[Info-vax] hung program location
VAXman- at SendSpamHere.ORG
VAXman- at SendSpamHere.ORG
Tue Feb 19 14:27:49 EST 2013
In article <780e61f7-c9b5-4a64-a5bb-2b4f9ef0fda0 at w14g2000vba.googlegroups.com>, Tom Adams <w.tom.adams at gmail.com> writes:
>On Feb 19, 10:29=A0am, Stephen Hoffman <seaoh... at hoffmanlabs.invalid>
>wrote:
>> On 2013-02-19 14:48:44 +0000, Tom Adams said:
>>
>> > There are no direct $hiber or $wait calls, but I use lib$wait to cause
>> > brief pauses.
>>
>> I've found that programs with asynchronous logic that also include
>> brief pauses can be excellent indicators of latent race conditions; of
>> latent bugs that some previous programmer hadn't directly resolved.
>>
>> > Can't think of where other hidden $hiber's could be, unless they happen
>> > in QIO calls.
>>
>> That's probably not the best approach when working with sys$hiber and
>> sys$wake <http://labs.hoffmanlabs.com/node/829>, and irrespective of
>> whether the code you're working on includes asynchronous logic. =A0Given
>> the complexity of a typical application and the possibility that some
>> other programmer somewhere might decide to add sts$hiber or sys$wake or
>> sys$schdwk calls (to some of your application code, to some library
>> you're calling, or some system or compiler or application library
>> you're using =97 the hibernation scheduling state is process-wide, after
>> all), it's usually best to always plan for the arrival of spurious
>> $wake calls.
>>
>> > The process does QIO calls to establish network and/or serial links.
>>
>> There are two flavors; sys$qio, and sys$qiow. =A0The former is a rich
>> source of asynchronous activity and quite ripe for introducing latent
>> programming bugs. =A0Failure to specify IOSBs, failure to correctly
>> specify IOSBs that are and will remain valid over the lifetime of the
>> asynchronous calls, failure to properly manage all memory and all
>> variables that are shared between AST
>> <http://labs.hoffmanlabs.com/node/617> and non-AST routines, event flag
>> <http://labs.hoffmanlabs.com/node/613> collisions, etc.
>>
>> Any number of ways to go off the rails here, too.
>>
>> Caveat: simply waiting in the context of an AST routine is also
>> something best avoided.
>>
>> > It most likely hung under conditions where it was suppose to be
>> > retrying to establish a link for weeks on end, because we only hook up
>> > the device it's trying to link to about once a month.
>>
>> Or maybe a garden-variety bug. =A0But this "most likely hung" is a
>> theory, one worth verification, but far from a certainty. =A0Add
>> application-level debugging, as a starting point.
>>
>> The ways of asynchronous programming on OpenVMS wizardry can be quite
>> subtle, and sometimes quick to anger.
>>
>> If there is asynchronous code here (eg: sys$qio calls or other non-W
>> calls, or asynch or synch calls with AST completion routines specified,
>> and not necessarily with sys$qiow or other synchronous calls), then
>> you're in the deep end of the pool here, too. =A0Familiarity with what's
>> documented in the OpenVMS Programming Concepts is likely necessary
>> here, and you may need to become familiar with memory synchronization
>> <http://labs.hoffmanlabs.com/node/407>, and with the synchronization
>> chapters in the Programming Concepts manual.
>>
>> --
>> Pure Personal Opinion | HoffmanLabs LLC
>
>There is only one AST programmed in. It's a resource wait AST that
>only fires when the system is telling the process to shutdown, so it
>did not cause the problem. None of the QIOs or QIOWs use ASTs.
>
>My theory is that the process got stuck in a LIB$WAIT call, but I
>don't know why that would happen. But it could be that the program
>gets into HIB somewhere else in the code processing, and there is the
>possibility that I am overlooking some bug that would screw up a LIB
>$WAIT.
I doubt that it's the fault of LIB$WAIT though.
>One odd thing is that the same process hung on three different
>Alphas. But I don't know if they all got hung at the same time. The
>three process would all be trying to get (or had or lost) a network
>connection to the same IP address of the same analyzer. The analyzer
>is turned off and on and moved around to different physical connection
>points. The code has been stable for a long time, but the practice
>of moving devices around like this is kind of a new practice.
How long is the wait that you are specifying with LIB$WAIT???
--
VAXman- A Bored Certified VMS Kernel Mode Hacker VAXman(at)TMESIS(dot)ORG
Well I speak to machines with the voice of humanity.
More information about the Info-vax
mailing list