[Info-vax] Production VMS cluster hanging with lots of LEFO

Fri Mar 13 09:28:32 EDT 2009

On Mar 13, 5:32 am, filip.debl... at proximus.net wrote:
> Greetings.
>
> Yesterday we had a massive incident on our most important VMS
> machines.

> OS is VMS 8.3, we run DECNET over IP.

> Fortunately I always keep some sessions open on by station (not part
> of the cluster), which were still working. The system was NOT down.

What I am missing in the story so far is a mention of DecAMDS  ?
Availability manager.
This interacts with the systems at driver level and is typically still
available when normal processes are in trouble.
Get that installed rightaway will you!

> When looking at the first system, I immediately remarked a significant
> number of LEFO processes, most of them related to individual (DCL) users,

So what was the memory pressure/availabilty?
An other reason for LEFO processes USED to be running out of
BALSETCNT.
It may still explain something. Check out MCR SYSGEN HELP SYS BALSET

" BALSETCNT is no longer a strict setting of the number of processes
that might be resident in memory. The swapper tries to reduce the
number
of resident processes down to BALSETCNT.
However, if the total number of active processes and processes that
have
disabled swapping exceeds BALSETCNT, the swapper does not force
processes out of memory just to meet the BALSETCNT setting."

> In a second session (on an other machine of the cluster), session got
> iced as well during a DCL command.

The 'easiest' way for processes to lock up big time is a disk
serialization lock.
Running processes continue to read and write data to with open files,
but touch the file system with a DIR, and you loose an other process.

That and loosing quorum.
Did you loose quorum due to loosing network connectivity?

> - (not 100% sure of this) most of the LEFO processes where attempts to
> login, trying to run LOGINOUT.EXE (just another image ...)

Handwaving here... wild speculation... when a process does not exist
yet and the first image still has to run but can not get to the disk
(that serialization lock) would it be Outswapped waiting for the disk,
considering it is not in yet?
I should crack open the internal book and remind myself of the states
in the life of a prcocess.  "in the  beginning...". But I don't have
time for that now.

> A desparate try to login to console (console monitoring is running an
> separate node) yielded no success. It appeared that all machines
> (including quorum node) were inaccessible (but not dead !)

Shoudda had the Availability Manager all set up, ready to go!
Just a little extra driver on the VMS systems and you can play and
monitor from Windoze all you like after that!

Cheers,
Hein.