[Info-vax] Production VMS cluster hanging with lots of LEFO

Tue Mar 17 22:32:37 EDT 2009

John Santos wrote:
> 
> In article <00A8879A.EF14B146 at SendSpamHere.ORG>, VAXman-
> @SendSpamHere.ORG says...>
> > In article <MPG.2424a6445c5a927e989687 at news.verizon.net>, John Santos <john at egh.com> writes:
> > >In article <e61cc3be-c732-4ed0-bff2-866c86e22314
> > >@e18g2000yqo.googlegroups.com>, heinvandenheuvel at gmail.com says...>=20
> > >> On Mar 13, 5:32=A0am, filip.debl... at proximus.net wrote:
> > >> > Greetings.
> > >> >
> > >> > Yesterday we had a massive incident on our most important VMS
> > >> > machines.
> > >>=20
> > >> > OS is VMS 8.3, we run DECNET over IP.
> > >>=20
> > >> > Fortunately I always keep some sessions open on by station (not part
> > >> > of the cluster), which were still working. The system was NOT down.
> > >>=20
> > >> > When looking at the first system, I immediately remarked a significant
> > >> > number of LEFO processes, most of them related to individual (DCL) user=
> > >s,
> > >
> > >[snip]
> > >
> > >> > - (not 100% sure of this) most of the LEFO processes where attempts=20
> > >to
> > >> > login, trying to run LOGINOUT.EXE (just another image ...)
> > >>=20
> > >> Handwaving here... wild speculation... when a process does not exist
> > >> yet and the first image still has to run but can not get to the disk
> > >> (that serialization lock) would it be Outswapped waiting for the disk,
> > >> considering it is not in yet?
> > >> I should crack open the internal book and remind myself of the states
> > >> in the life of a prcocess.  "in the  beginning...". But I don't have
> > >> time for that now.
> > >>=20
> > >
> > >To elaborate a little on this, since no one else has mentioned this
> > >explicitly...
> > >
> > >(Warning: this is all based on vague and ancient memory, and could
> > >be decades obsolete!)
> > >
> > >I think early one when the job controller (?) creates a new interactive
> > >process, it sets up a process header with appropriate values in the
> > >saved registers and it's memory mapped to a swapped-out copy of
> > >LOGINOUT.EXE (i.e. pointing to LOGINOUT's shared sections.)  Then
> > >when the swapper decides to swap it in, it loads the new processes'
> > >read-write data from LOGINOUT's writeable, copy-on-reference pages
> > >and the code & read-only data from LOGINOUT's shareable pages and
> > >off it goes.  During the time between it getting created and when
> > >the swapper swaps it in, it's in LEFO state.
> >
> > COMO -> COM
> 
> That makes sense...  But (idle speculation), I wonder if it could
> go from COMO to LEFO if the scheduler decided to page it in and
> got stuck while waiting from the page fault read I/O to page it in?
> Or would that be some sort of page fault MWAIT state?

I'm thinking it would be some kind of page wait state.

D.J.D.