[Info-vax] Production VMS cluster hanging with lots of LEFO

Richard B. Gilbert rgilbert88 at comcast.net
Fri Mar 13 18:41:14 EDT 2009


John Santos wrote:
> In article <e61cc3be-c732-4ed0-bff2-866c86e22314
> @e18g2000yqo.googlegroups.com>, heinvandenheuvel at gmail.com says...> 
>> On Mar 13, 5:32 am, filip.debl... at proximus.net wrote:
>>> Greetings.
>>>
>>> Yesterday we had a massive incident on our most important VMS
>>> machines.
>>> OS is VMS 8.3, we run DECNET over IP.
>>> Fortunately I always keep some sessions open on by station (not part
>>> of the cluster), which were still working. The system was NOT down.
>>> When looking at the first system, I immediately remarked a significant
>>> number of LEFO processes, most of them related to individual (DCL) users,
> 
> [snip]
> 
>>> - (not 100% sure of this) most of the LEFO processes where attempts 
> to
>>> login, trying to run LOGINOUT.EXE (just another image ...)
>> Handwaving here... wild speculation... when a process does not exist
>> yet and the first image still has to run but can not get to the disk
>> (that serialization lock) would it be Outswapped waiting for the disk,
>> considering it is not in yet?
>> I should crack open the internal book and remind myself of the states
>> in the life of a prcocess.  "in the  beginning...". But I don't have
>> time for that now.
>>
> 
> To elaborate a little on this, since no one else has mentioned this
> explicitly...
> 
> (Warning: this is all based on vague and ancient memory, and could
> be decades obsolete!)
> 
> I think early one when the job controller (?) creates a new interactive
> process, it sets up a process header with appropriate values in the
> saved registers and it's memory mapped to a swapped-out copy of
> LOGINOUT.EXE (i.e. pointing to LOGINOUT's shared sections.)  Then
> when the swapper decides to swap it in, it loads the new processes'
> read-write data from LOGINOUT's writeable, copy-on-reference pages
> and the code & read-only data from LOGINOUT's shareable pages and
> off it goes.  During the time between it getting created and when
> the swapper swaps it in, it's in LEFO state.
> 
> So zillions of these processes could be due to something attempting
> a zillion interactive process creations (telnet storm?) combined
> with some kind of swapper deadlock or bottleneck.  Did the
> "network load test" do anything like this?  Could the quorum node
> have been a bottleneck?  Maybe trying to log all the process creations
> to its private OPERATOR.LOG or a non-shared accounting or security
> log?  Maybe its system disk got full?  (Did the boot procedure on the
> quorum node purge anything, thus curing the problem?)  Maybe there
> was a "China Syndrom"[1] bottleneck, sending a zillion OPCOM "telnet
> session created" messages to a slow serial port?  I'm assuming
> here that the quorum node is a much smaller, slower system, maybe
> with a small local system disk.
> 
> [snip]
> 
>> Cheers,
>> Hein.
> 
> [1] If anyone remembers the movie, when things started going
> wrong, thousands of messages started printing out on a terminal
> that looked suspiciously like an LA36, and so the messages
> were way behind the actual events.  I understand this was
> based on real behaviour of at least some nuclear power plants:
> during routine operations, so many messages would print on
> the console printers that they would get way behind, and the
> operators would do "something" (I suspect typing Ctrl/O) to
> skip ahead so the console printer could catch up, but there
> was a real danger they would miss important, non-standard
> messages when they did this.
> 

At least VMS will say something like "Above message repeated 50 times" 
instead of trying to print all 50 copies.

There is much to be said for prioritizing messages.  If you bombard the 
operator with drivel, he will not pay as much attention as he should.

I recall, while working as a contractor at Mobil, hearing a discussion 
of what to tell the operators when the refinery was in danger of blowing 
up.  I asked "Why involve the operator at all?"  If there is danger of 
an explosion just shut down to a safe condition!  The odds of a human 
doing the right thing in an emergency, especially if the emergency is 
his fault, are not good!



More information about the Info-vax mailing list