[Info-vax] Home-grown application process dumps
Stephen Hoffman
seaohveh at hoffmanlabs.invalid
Mon Jan 5 12:40:52 EST 2015
On 2015-01-05 17:08:19 +0000, RGB said:
> What I find interesting about the above is that these "bugs" can NOT be
> reproduced in our test/development/QA environments.
I've seen more than a few of these cases. This is the
software-troubleshooting version of proving a negative, after all...
> Said environments run on exactly the same hardware and config i.e.,
> rx2800 i2 with 32GB RAM and VMS v8.4.
I well recall one of these cases that involved heap corruption that
turned out to be secondary to the length of the DECnet host name. If
the length of the DECnet host name caused a heap allocation to be right
on a quadword boundary, the bug would be exposed and the code would
crash. Any other length would cause the heap to pad to the allocation
to the next quadword, and the bug was masked. It was also common for
the overrun to hit something that didn't trigger a crash. Worse, there
was only one host around with that particular DECnet host name length
value, too. Cue much head scratching.
So... Having the same hardware doesn't really matter to various bugs,
and as you've already conclusively proven here.
A while ago, I traced one case that was secondary to a
previously-unknown hardware difference. The servers involved were from
an order of identically-configured systems, and even had consecutive
serial numbers from manufacturing. It turned out that the boxes
crossed over two batches of I/O controllers when they were originally
manufactured. The controller in one bunch all worked file, and the
controllers in the other — again, original, and which had never been
changed — were from a different batch, and didn't work quite the same.
When debugging these cases, looking at what's alike doesn't tell you
nearly as much as looking for what's different.
> These processes dump ONLY in production but, then again, the modules
> are more heavily utilized in production than in the aforementioned
> test/dev environments.
Application and and system load are always good triggers for snits from
wobbly code, and these can be latent bugs and which can point to buggy
synchronization. Failure to specify IOSBs and failures to check both
return status values and IOSBs are a common source of wobbly code.
Pretty much any code that can or does use event flags can be suspect.
If I were working in your environment, I'd audit the IOSB and error
handling throughout the code, would identify and audit any asynchronous
code, would add signal handlers and traceback support and integrated
debugging, and would integrate tracing and logging support into the
code. The lightest of the tracing code should be on continuously, and
additional tracing code should be capable of being enabled even in
production.
--
Pure Personal Opinion | HoffmanLabs LLC
More information about the Info-vax
mailing list