[Info-vax] hung program location

Stephen Hoffman seaohveh at hoffmanlabs.invalid
Tue Feb 19 10:29:20 EST 2013


On 2013-02-19 14:48:44 +0000, Tom Adams said:

> There are no direct $hiber or $wait calls, but I use lib$wait to cause
> brief pauses.

I've found that programs with asynchronous logic that also include 
brief pauses can be excellent indicators of latent race conditions; of 
latent bugs that some previous programmer hadn't directly resolved.

> Can't think of where other hidden $hiber's could be, unless they happen 
> in QIO calls.

That's probably not the best approach when working with sys$hiber and 
sys$wake <http://labs.hoffmanlabs.com/node/829>, and irrespective of 
whether the code you're working on includes asynchronous logic.  Given 
the complexity of a typical application and the possibility that some 
other programmer somewhere might decide to add sts$hiber or sys$wake or 
sys$schdwk calls (to some of your application code, to some library 
you're calling, or some system or compiler or application library 
you're using — the hibernation scheduling state is process-wide, after 
all), it's usually best to always plan for the arrival of spurious 
$wake calls.

> The process does QIO calls to establish network and/or serial links.

There are two flavors; sys$qio, and sys$qiow.  The former is a rich 
source of asynchronous activity and quite ripe for introducing latent 
programming bugs.  Failure to specify IOSBs, failure to correctly 
specify IOSBs that are and will remain valid over the lifetime of the 
asynchronous calls, failure to properly manage all memory and all 
variables that are shared between AST 
<http://labs.hoffmanlabs.com/node/617> and non-AST routines, event flag 
<http://labs.hoffmanlabs.com/node/613> collisions, etc.

Any number of ways to go off the rails here, too.

Caveat: simply waiting in the context of an AST routine is also 
something best avoided.

> It most likely hung under conditions where it was suppose to be 
> retrying to establish a link for weeks on end, because we only hook up 
> the device it's trying to link to about once a month.

Or maybe a garden-variety bug.  But this "most likely hung" is a 
theory, one worth verification, but far from a certainty.  Add 
application-level debugging, as a starting point.

The ways of asynchronous programming on OpenVMS wizardry can be quite 
subtle, and sometimes quick to anger.

If there is asynchronous code here (eg: sys$qio calls or other non-W 
calls, or asynch or synch calls with AST completion routines specified, 
and not necessarily with sys$qiow or other synchronous calls), then 
you're in the deep end of the pool here, too.  Familiarity with what's 
documented in the OpenVMS Programming Concepts is likely necessary 
here, and you may need to become familiar with memory synchronization 
<http://labs.hoffmanlabs.com/node/407>, and with the synchronization 
chapters in the Programming Concepts manual.


-- 
Pure Personal Opinion | HoffmanLabs LLC




More information about the Info-vax mailing list