[Info-vax] Production VMS cluster hanging with lots of LEFO

Fri Mar 13 05:32:41 EDT 2009

Greetings.

Yesterday we had a massive incident on our most important VMS
machines.

Production is configured as a disaster tolerant cluster containing
four
identical midsize alphas. These are grouped two-by-two into two
computerrooms, separated by more than 25 km. Connections between them
is
done by a four-fold extreme high capacity network, which is also
shared by a
massive army of UN*X boxes.

A fifth quorum node (small thing, only has to be present) sits in a
third
room.

The application that is running on the cluster is ACMS driven and is
quite
stable : everything is installed in memmory, takes up on avarage max
10-15%
cpu, and has memory to burn, so outswapped processes are extermely
rare.
This application accesses a monster SYBASE database, which is running
on a
UN*X box (did I mention the things was disaster tolerant ? :-(

OS is VMS 8.3, we run DECNET over IP.

Previous night, some "load test" was done on the network. Not a lot is
known
about that, but it is believed it included the links between the two
sites.
I was not aware of this thing being done, and it would probably have
been
none of my concern.

Very soon alarms started to come in stating users could not login
anymore,
neither over the dedicated TCP/IP interfaces (using some
application-to-application mechanism), neither via whatever SET HOST,
TELNET, etc.

Fortunately I always keep some sessions open on by station (not part
of the
cluster), which were still working. The system was NOT down.

When looking at the first system, I immediately remarked a significant
number
of LEFO processes, most of them related to individual (DCL) users,
having
close to 0 CPU time and IO. I also spotted one HIBO (REMACP !).
I was able to STOP/ID all the LEFOs (did not touch REMACP), to no
avail.
When trying to find the real identity of a user (MC AUTHORIZE), my
session
froze.

In a second session (on an other machine of the cluster), session got
iced
as well during a DCL command.

I got worried.

It seemed that it was not possible to run an image anymore. (a lot of
DCL
command do startup an image) Very soon I lost control from _all_
sessions,
but before that I was able to notice :
- the cluster was fine (all 5 machines up, all participating with 1
vote)
- there was at least one looping process (happens all the time, we
simply
kill them)
- (not 100% sure of this) most of the LEFO processes where attempts to
login, trying to run LOGINOUT.EXE (just another image ...)

So SNAFU

It was found out later, by some (external) database monitoring, that
at
least one of the looping processes (image was already running by the
time
the problems started) did do some DB activity, so the VMS process was
not
aware of any problems and happyly kept looping.

A desparate try to login to console (console monitoring is running an
separate node) yielded no success. It appeared that all machines
(including
quorum node) were inaccessible (but not dead !)

Somewhat later it was claimed that the network modifications (?) were
rolled
back. VMS cluster did not recover by itself.

Finally (we need zillions of authorizations for everything) the quorum
node
was crashed.
And I was happyly looking at >>>

First boot failed due to bizar (and unrelated problems), but booting
as MIN
did work. I was able to login into the quorun node (via console of
course)

Miracle happened. All LEFO disappeared and the beast went back to
business.
Most processes simply continued from the point where they were
blocked, no
damage (except for part of the application which had timed out, a
simple
restart solved this).

Unfortunately I did not check if the situation was normalised because
of the
crash of the quorum node. I only observed 'back to business' after the
minimal reboot.

Now, 24 hours later, things are as normal as allways.

A lot of unknowns are still left.

Q : what caused the image activator to go into LEFO (actual to remain
in
LEFO). At some point during image activation (last phase ?) it starts
waiting for an eventflag. What could be setting that event flag ? I am
suspecting it never came ...

Q: crashing (and rebooting) the quorum node solved things immediately.
Could
this be caused by a lock held by the quorum node ? if so, is this a
lock that is
related to cluster transitions ?

Q : would we have had the same effect by crashing/rebooting anyone of
the
other nodes ?

And finally :

Can some form of (minor ?) network outage trigger events like this?

Any takers ?

advTHANKSance