[Info-vax] hung program location
Stephen Hoffman
seaohveh at hoffmanlabs.invalid
Tue Feb 19 10:29:20 EST 2013
On 2013-02-19 14:48:44 +0000, Tom Adams said:
> There are no direct $hiber or $wait calls, but I use lib$wait to cause
> brief pauses.
I've found that programs with asynchronous logic that also include
brief pauses can be excellent indicators of latent race conditions; of
latent bugs that some previous programmer hadn't directly resolved.
> Can't think of where other hidden $hiber's could be, unless they happen
> in QIO calls.
That's probably not the best approach when working with sys$hiber and
sys$wake <http://labs.hoffmanlabs.com/node/829>, and irrespective of
whether the code you're working on includes asynchronous logic. Given
the complexity of a typical application and the possibility that some
other programmer somewhere might decide to add sts$hiber or sys$wake or
sys$schdwk calls (to some of your application code, to some library
you're calling, or some system or compiler or application library
you're using — the hibernation scheduling state is process-wide, after
all), it's usually best to always plan for the arrival of spurious
$wake calls.
> The process does QIO calls to establish network and/or serial links.
There are two flavors; sys$qio, and sys$qiow. The former is a rich
source of asynchronous activity and quite ripe for introducing latent
programming bugs. Failure to specify IOSBs, failure to correctly
specify IOSBs that are and will remain valid over the lifetime of the
asynchronous calls, failure to properly manage all memory and all
variables that are shared between AST
<http://labs.hoffmanlabs.com/node/617> and non-AST routines, event flag
<http://labs.hoffmanlabs.com/node/613> collisions, etc.
Any number of ways to go off the rails here, too.
Caveat: simply waiting in the context of an AST routine is also
something best avoided.
> It most likely hung under conditions where it was suppose to be
> retrying to establish a link for weeks on end, because we only hook up
> the device it's trying to link to about once a month.
Or maybe a garden-variety bug. But this "most likely hung" is a
theory, one worth verification, but far from a certainty. Add
application-level debugging, as a starting point.
The ways of asynchronous programming on OpenVMS wizardry can be quite
subtle, and sometimes quick to anger.
If there is asynchronous code here (eg: sys$qio calls or other non-W
calls, or asynch or synch calls with AST completion routines specified,
and not necessarily with sys$qiow or other synchronous calls), then
you're in the deep end of the pool here, too. Familiarity with what's
documented in the OpenVMS Programming Concepts is likely necessary
here, and you may need to become familiar with memory synchronization
<http://labs.hoffmanlabs.com/node/407>, and with the synchronization
chapters in the Programming Concepts manual.
--
Pure Personal Opinion | HoffmanLabs LLC
More information about the Info-vax
mailing list