[Info-vax] Production VMS cluster hanging with lots of LEFO

Fri Mar 13 18:20:37 EDT 2009

In article <e61cc3be-c732-4ed0-bff2-866c86e22314
@e18g2000yqo.googlegroups.com>, heinvandenheuvel at gmail.com says...> 
> On Mar 13, 5:32 am, filip.debl... at proximus.net wrote:
> > Greetings.
> >
> > Yesterday we had a massive incident on our most important VMS
> > machines.
> 
> > OS is VMS 8.3, we run DECNET over IP.
> 
> > Fortunately I always keep some sessions open on by station (not part
> > of the cluster), which were still working. The system was NOT down.
> 
> > When looking at the first system, I immediately remarked a significant
> > number of LEFO processes, most of them related to individual (DCL) users,

[snip]

> > - (not 100% sure of this) most of the LEFO processes where attempts 
to
> > login, trying to run LOGINOUT.EXE (just another image ...)
> 
> Handwaving here... wild speculation... when a process does not exist
> yet and the first image still has to run but can not get to the disk
> (that serialization lock) would it be Outswapped waiting for the disk,
> considering it is not in yet?
> I should crack open the internal book and remind myself of the states
> in the life of a prcocess.  "in the  beginning...". But I don't have
> time for that now.
> 

To elaborate a little on this, since no one else has mentioned this
explicitly...

(Warning: this is all based on vague and ancient memory, and could
be decades obsolete!)

I think early one when the job controller (?) creates a new interactive
process, it sets up a process header with appropriate values in the
saved registers and it's memory mapped to a swapped-out copy of
LOGINOUT.EXE (i.e. pointing to LOGINOUT's shared sections.)  Then
when the swapper decides to swap it in, it loads the new processes'
read-write data from LOGINOUT's writeable, copy-on-reference pages
and the code & read-only data from LOGINOUT's shareable pages and
off it goes.  During the time between it getting created and when
the swapper swaps it in, it's in LEFO state.

So zillions of these processes could be due to something attempting
a zillion interactive process creations (telnet storm?) combined
with some kind of swapper deadlock or bottleneck.  Did the
"network load test" do anything like this?  Could the quorum node
have been a bottleneck?  Maybe trying to log all the process creations
to its private OPERATOR.LOG or a non-shared accounting or security
log?  Maybe its system disk got full?  (Did the boot procedure on the
quorum node purge anything, thus curing the problem?)  Maybe there
was a "China Syndrom"[1] bottleneck, sending a zillion OPCOM "telnet
session created" messages to a slow serial port?  I'm assuming
here that the quorum node is a much smaller, slower system, maybe
with a small local system disk.

[snip]

> 
> Cheers,
> Hein.

[1] If anyone remembers the movie, when things started going
wrong, thousands of messages started printing out on a terminal
that looked suspiciously like an LA36, and so the messages
were way behind the actual events.  I understand this was
based on real behaviour of at least some nuclear power plants:
during routine operations, so many messages would print on
the console printers that they would get way behind, and the
operators would do "something" (I suspect typing Ctrl/O) to
skip ahead so the console printer could catch up, but there
was a real danger they would miss important, non-standard
messages when they did this.

-- 
John Santos
Evans Griffiths & Hart, Inc.