[Info-vax] Home-grown application process dumps

Mon Jan 5 12:40:52 EST 2015

On 2015-01-05 17:08:19 +0000, RGB said:

> What I find interesting about the above is that these "bugs" can NOT be 
> reproduced in our test/development/QA environments.

I've seen more than a few of these cases.   This is the 
software-troubleshooting version of proving a negative, after all...

> Said environments run on exactly the same hardware and config i.e., 
> rx2800 i2 with 32GB RAM and VMS v8.4.

I well recall one of these cases that involved heap corruption that 
turned out to be secondary to the length of the DECnet host name.  If 
the length of the DECnet host name caused a heap allocation to be right 
on a quadword boundary, the bug would be exposed and the code would 
crash.   Any other length would cause the heap to pad to the allocation 
to the next quadword, and the bug was masked.   It was also common for 
the overrun to hit something that didn't trigger a crash.  Worse, there 
was only one host around with that particular DECnet host name length 
value, too.  Cue much head scratching.

So...  Having the same hardware doesn't really matter to various bugs, 
and as you've already conclusively proven here.

A while ago, I traced one case that was secondary to a 
previously-unknown hardware difference.  The servers involved were from 
an order of identically-configured systems, and even had consecutive 
serial numbers from manufacturing.    It turned out that the boxes 
crossed over two batches of I/O controllers when they were originally 
manufactured.   The controller in one bunch all worked file, and the 
controllers in the other — again, original, and which had never been 
changed — were from a different batch, and didn't work quite the same.

When debugging these cases, looking at what's alike doesn't tell you 
nearly as much as looking for what's different.

> These processes dump ONLY in production but, then again, the modules 
> are more heavily utilized in production than in the aforementioned 
> test/dev environments.

Application and and system load are always good triggers for snits from 
wobbly code, and these can be latent bugs and which can point to buggy 
synchronization.  Failure to specify IOSBs and failures to check both 
return status values and IOSBs are a common source of wobbly code.  
Pretty much any code that can or does use event flags can be suspect.

If I were working in your environment, I'd audit the IOSB and error 
handling throughout the code, would identify and audit any asynchronous 
code, would add signal handlers and traceback support and integrated 
debugging, and would integrate tracing and logging support into the 
code.  The lightest of the tracing code should be on continuously, and 
additional tracing code should be capable of being enabled even in 
production.

-- 
Pure Personal Opinion | HoffmanLabs LLC