[Info-vax] Production VMS cluster hanging with lots of LEFO

Tue Mar 17 20:00:11 EDT 2009

In article <00A8879A.EF14B146 at SendSpamHere.ORG>, VAXman-  
@SendSpamHere.ORG says...> 
> In article <MPG.2424a6445c5a927e989687 at news.verizon.net>, John Santos <john at egh.com> writes:
> >In article <e61cc3be-c732-4ed0-bff2-866c86e22314
> >@e18g2000yqo.googlegroups.com>, heinvandenheuvel at gmail.com says...>=20
> >> On Mar 13, 5:32=A0am, filip.debl... at proximus.net wrote:
> >> > Greetings.
> >> >
> >> > Yesterday we had a massive incident on our most important VMS
> >> > machines.
> >>=20
> >> > OS is VMS 8.3, we run DECNET over IP.
> >>=20
> >> > Fortunately I always keep some sessions open on by station (not part
> >> > of the cluster), which were still working. The system was NOT down.
> >>=20
> >> > When looking at the first system, I immediately remarked a significant
> >> > number of LEFO processes, most of them related to individual (DCL) user=
> >s,
> >
> >[snip]
> >
> >> > - (not 100% sure of this) most of the LEFO processes where attempts=20
> >to
> >> > login, trying to run LOGINOUT.EXE (just another image ...)
> >>=20
> >> Handwaving here... wild speculation... when a process does not exist
> >> yet and the first image still has to run but can not get to the disk
> >> (that serialization lock) would it be Outswapped waiting for the disk,
> >> considering it is not in yet?
> >> I should crack open the internal book and remind myself of the states
> >> in the life of a prcocess.  "in the  beginning...". But I don't have
> >> time for that now.
> >>=20
> >
> >To elaborate a little on this, since no one else has mentioned this
> >explicitly...
> >
> >(Warning: this is all based on vague and ancient memory, and could
> >be decades obsolete!)
> >
> >I think early one when the job controller (?) creates a new interactive
> >process, it sets up a process header with appropriate values in the
> >saved registers and it's memory mapped to a swapped-out copy of
> >LOGINOUT.EXE (i.e. pointing to LOGINOUT's shared sections.)  Then
> >when the swapper decides to swap it in, it loads the new processes'
> >read-write data from LOGINOUT's writeable, copy-on-reference pages
> >and the code & read-only data from LOGINOUT's shareable pages and
> >off it goes.  During the time between it getting created and when
> >the swapper swaps it in, it's in LEFO state.
> 
> COMO -> COM

That makes sense...  But (idle speculation), I wonder if it could
go from COMO to LEFO if the scheduler decided to page it in and
got stuck while waiting from the page fault read I/O to page it in?
Or would that be some sort of page fault MWAIT state?

Or maybe there's something locked (file, bucket, record) in SYSUAF.DAT
or RIGHTSLIST.DAT and the loginout processes got swapped out again
while waiting...

(Maybe I should read ahead before posting, in case someone else
already figured it out :-)

-- 
John Santos
Evans Griffiths & Hart, Inc.