[Info-vax] hung program location
Tom Adams
w.tom.adams at gmail.com
Wed Feb 20 10:45:13 EST 2013
On Feb 19, 2:27 pm, VAXman- @SendSpamHere.ORG wrote:
> In article <780e61f7-c9b5-4a64-a5bb-2b4f9ef0f... at w14g2000vba.googlegroups.com>, Tom Adams <w.tom.ad... at gmail.com> writes:
>
>
>
>
>
>
>
>
>
> >On Feb 19, 10:29=A0am, Stephen Hoffman <seaoh... at hoffmanlabs.invalid>
> >wrote:
> >> On 2013-02-19 14:48:44 +0000, Tom Adams said:
>
> >> > There are no direct $hiber or $wait calls, but I use lib$wait to cause
> >> > brief pauses.
>
> >> I've found that programs with asynchronous logic that also include
> >> brief pauses can be excellent indicators of latent race conditions; of
> >> latent bugs that some previous programmer hadn't directly resolved.
>
> >> > Can't think of where other hidden $hiber's could be, unless they happen
> >> > in QIO calls.
>
> >> That's probably not the best approach when working with sys$hiber and
> >> sys$wake <http://labs.hoffmanlabs.com/node/829>, and irrespective of
> >> whether the code you're working on includes asynchronous logic. =A0Given
> >> the complexity of a typical application and the possibility that some
> >> other programmer somewhere might decide to add sts$hiber or sys$wake or
> >> sys$schdwk calls (to some of your application code, to some library
> >> you're calling, or some system or compiler or application library
> >> you're using =97 the hibernation scheduling state is process-wide, after
> >> all), it's usually best to always plan for the arrival of spurious
> >> $wake calls.
>
> >> > The process does QIO calls to establish network and/or serial links.
>
> >> There are two flavors; sys$qio, and sys$qiow. =A0The former is a rich
> >> source of asynchronous activity and quite ripe for introducing latent
> >> programming bugs. =A0Failure to specify IOSBs, failure to correctly
> >> specify IOSBs that are and will remain valid over the lifetime of the
> >> asynchronous calls, failure to properly manage all memory and all
> >> variables that are shared between AST
> >> <http://labs.hoffmanlabs.com/node/617> and non-AST routines, event flag
> >> <http://labs.hoffmanlabs.com/node/613> collisions, etc.
>
> >> Any number of ways to go off the rails here, too.
>
> >> Caveat: simply waiting in the context of an AST routine is also
> >> something best avoided.
>
> >> > It most likely hung under conditions where it was suppose to be
> >> > retrying to establish a link for weeks on end, because we only hook up
> >> > the device it's trying to link to about once a month.
>
> >> Or maybe a garden-variety bug. =A0But this "most likely hung" is a
> >> theory, one worth verification, but far from a certainty. =A0Add
> >> application-level debugging, as a starting point.
>
> >> The ways of asynchronous programming on OpenVMS wizardry can be quite
> >> subtle, and sometimes quick to anger.
>
> >> If there is asynchronous code here (eg: sys$qio calls or other non-W
> >> calls, or asynch or synch calls with AST completion routines specified,
> >> and not necessarily with sys$qiow or other synchronous calls), then
> >> you're in the deep end of the pool here, too. =A0Familiarity with what's
> >> documented in the OpenVMS Programming Concepts is likely necessary
> >> here, and you may need to become familiar with memory synchronization
> >> <http://labs.hoffmanlabs.com/node/407>, and with the synchronization
> >> chapters in the Programming Concepts manual.
>
> >> --
> >> Pure Personal Opinion | HoffmanLabs LLC
>
> >There is only one AST programmed in. It's a resource wait AST that
> >only fires when the system is telling the process to shutdown, so it
> >did not cause the problem. None of the QIOs or QIOWs use ASTs.
>
> >My theory is that the process got stuck in a LIB$WAIT call, but I
> >don't know why that would happen. But it could be that the program
> >gets into HIB somewhere else in the code processing, and there is the
> >possibility that I am overlooking some bug that would screw up a LIB
> >$WAIT.
>
> I doubt that it's the fault of LIB$WAIT though.
>
> >One odd thing is that the same process hung on three different
> >Alphas. But I don't know if they all got hung at the same time. The
> >three process would all be trying to get (or had or lost) a network
> >connection to the same IP address of the same analyzer. The analyzer
> >is turned off and on and moved around to different physical connection
> >points. The code has been stable for a long time, but the practice
> >of moving devices around like this is kind of a new practice.
>
> How long is the wait that you are specifying with LIB$WAIT???
>
> --
> VAXman- A Bored Certified VMS Kernel Mode Hacker VAXman(at)TMESIS(dot)ORG
>
> Well I speak to machines with the voice of humanity.
There are a number of LIB$WAITs in the code that range from 0.5 to
10.0 seconds.
More information about the Info-vax
mailing list