[Info-vax] A 5 minutes hang during early stage of a shutdown.

Stephen Hoffman seaohveh at hoffmanlabs.invalid
Tue Jan 22 09:31:34 EST 2013


On 2013-01-22 13:50:28 +0000, Jan-Erik Soderholm said:

> OK, I guess that is simply defining SHUTDOWN$VERIFY, right?

Edit the file(s) involved and stick the proverbial SET VERIFY at the 
top, if you have to.  Revert when done.  This is VMS and not rocket 
science, after all.

>> Sort out the particular trigger for the mutex, as those can cause various
>> secondary problems.  <http://labs.hoffmanlabs.com/node/231>
> 
> This has been "up" in at least one thread before. I never menaged
> to find any information. I'm today confident that this happens
> if one try to run "telnet delete" or "telnet create" against
> a TNA device that is already in "waiting for delete".
> 
> After a "telnet delete", it takes 10-15 seconds before the TNA
> device is removed.

Confidence is not equivalent to a reproducer, and a reproducer will 
usually gain the attention of support folks.  This given it is much 
harder to deny there's an issue when you are handed a reproducer, and 
it's much easier to test a fix with a reproducer.  Whether you get 
anywhere with that reproducer is another matter and another discussion 
— the (mis)behavior demonstrated by the reproducer could still be 
declared a "feature" and not a "bug", for instance.

If you're doing a whole lot of that and OpenVMS and TCP/IP aren't 
reacting quickly enough or aren't reacting appropriately enough — and 
HP support isn't providing you with a solution sufficient for your 
needs — then write some tools to manage and allocate your application 
TN device access yourself.


>> As a complete WAG, I've seen vaguely similar SHUTDOWN-time hangs with bogus
>> hosts listed in the queue database, and cases where the old host name is
>> latent in the queue database after a host has been renamed.
> 
> Yes, that does correspond with the fact that the shutdown process
> is in QUEMAN.EXE.

Yes, you've mentioned this several times.

> We should probably either clean up the DECnet database or
> not start DECnet at all...

The distributed queue manager uses SCS-level communications, and not DECnet.

Specifically, the queue manager uses the SCA$TRANSPORT SCS SYSAP, and 
not DECnet, and not IP.

This is the ugly, gnarly sausage factory of VMS.  Do some local DCL 
debugging.  Figure out where things went sideways.

————

ps: Standard caveat whenever mixing Internet discussions of clustering 
and DECnet in the same posting or same IRC chat: DECnet is not the 
underpinnings of clustering. DECnet is not related to clustering.  
Clustering does not use DECnet.  Yes, there are clustering-related 
services that can use DECnet, such as the use of MOP within the 
satellite boot process, and the MONITOR VPM server.  But there are 
alternatives for these and other uses.    But the transport underneath 
clustering is not DECnet.  You can cluster without DECnet installed.  
This might not be your intent and you may well know the disparate 
nature of DECnet and clustering, but I'm decidedly cautious when I 
encounter discussions of tweaking DECnet when clustering is centrally 
involved.


-- 
Pure Personal Opinion | HoffmanLabs LLC




More information about the Info-vax mailing list