[Info-vax] Production VMS cluster hanging with lots of LEFO

Fri Mar 13 07:46:07 EDT 2009

On Mar 13, 5:32 am, filip.debl... at proximus.net wrote:
> Greetings.
>
> Yesterday we had a massive incident on our most important VMS
> machines.
>
> Production is configured as a disaster tolerant cluster containing
> four
> identical midsize alphas. These are grouped two-by-two into two
> computerrooms, separated by more than 25 km. Connections between them
> is
> done by a four-fold extreme high capacity network, which is also
> shared by a
> massive army of UN*X boxes.
>
> A fifth quorum node (small thing, only has to be present) sits in a
> third
> room.
>
> The application that is running on the cluster is ACMS driven and is
> quite
> stable : everything is installed in memmory, takes up on avarage max
> 10-15%
> cpu, and has memory to burn, so outswapped processes are extermely
> rare.
> This application accesses a monster SYBASE database, which is running
> on a
> UN*X box (did I mention the things was disaster tolerant ? :-(
>
> OS is VMS 8.3, we run DECNET over IP.
>
> Previous night, some "load test" was done on the network. Not a lot is
> known
> about that, but it is believed it included the links between the two
> sites.
> I was not aware of this thing being done, and it would probably have
> been
> none of my concern.
>
> Very soon alarms started to come in stating users could not login
> anymore,
> neither over the dedicated TCP/IP interfaces (using some
> application-to-application mechanism), neither via whatever SET HOST,
> TELNET, etc.
>
> Fortunately I always keep some sessions open on by station (not part
> of the
> cluster), which were still working. The system was NOT down.
>
> When looking at the first system, I immediately remarked a significant
> number
> of LEFO processes, most of them related to individual (DCL) users,
> having
> close to 0 CPU time and IO. I also spotted one HIBO (REMACP !).
> I was able to STOP/ID all the LEFOs (did not touch REMACP), to no
> avail.
> When trying to find the real identity of a user (MC AUTHORIZE), my
> session
> froze.
>
> In a second session (on an other machine of the cluster), session got
> iced
> as well during a DCL command.
>
> I got worried.
>
> It seemed that it was not possible to run an image anymore. (a lot of
> DCL
> command do startup an image) Very soon I lost control from _all_
> sessions,
> but before that I was able to notice :
> - the cluster was fine (all 5 machines up, all participating with 1
> vote)
> - there was at least one looping process (happens all the time, we
> simply
> kill them)
> - (not 100% sure of this) most of the LEFO processes where attempts to
> login, trying to run LOGINOUT.EXE (just another image ...)
>
> So SNAFU
>
> It was found out later, by some (external) database monitoring, that
> at
> least one of the looping processes (image was already running by the
> time
> the problems started) did do some DB activity, so the VMS process was
> not
> aware of any problems and happyly kept looping.
>
> A desparate try to login to console (console monitoring is running an
> separate node) yielded no success. It appeared that all machines
> (including
> quorum node) were inaccessible (but not dead !)
>
> Somewhat later it was claimed that the network modifications (?) were
> rolled
> back. VMS cluster did not recover by itself.
>
> Finally (we need zillions of authorizations for everything) the quorum
> node
> was crashed.
> And I was happyly looking at >>>
>
> First boot failed due to bizar (and unrelated problems), but booting
> as MIN
> did work. I was able to login into the quorun node (via console of
> course)
>
> Miracle happened. All LEFO disappeared and the beast went back to
> business.
> Most processes simply continued from the point where they were
> blocked, no
> damage (except for part of the application which had timed out, a
> simple
> restart solved this).
>
> Unfortunately I did not check if the situation was normalised because
> of the
> crash of the quorum node. I only observed 'back to business' after the
> minimal reboot.
>
> Now, 24 hours later, things are as normal as allways.
>
> A lot of unknowns are still left.
>
> Q : what caused the image activator to go into LEFO (actual to remain
> in
> LEFO). At some point during image activation (last phase ?) it starts
> waiting for an eventflag. What could be setting that event flag ? I am
> suspecting it never came ...
>
> Q: crashing (and rebooting) the quorum node solved things immediately.
> Could
> this be caused by a lock held by the quorum node ? if so, is this a
> lock that is
> related to cluster transitions ?
>
> Q : would we have had the same effect by crashing/rebooting anyone of
> the
> other nodes ?
>
> And finally :
>
> Can some form of (minor ?) network outage trigger events like this?
>
> Any takers ?
>
> advTHANKSance

Filip,

In essence, I concur with Vaxman. There are few ways to diagnose this
problem without a system dump (or live access during such an event to
SDA, the utility invoked by the ANALYZE/SYSTEM command).

LEF (and LEFO) are completely normal wait states. A SHOW SYSTEM on a
normal day will show many processes in this state. The "O" means that
the process is outswapped. Due to today's large memories, outswapped
is less often seen, but it is also fairly normal. One will also see
normally see many processes in the HIB state, these have executed the
$HIBERNATE system service and are waiting for some event to awake.

That much is completely normal. The quorum machine crashing could (or
could not) be part of the answer, as could the cluster continuing on
its way. If the cluster quorum machine left a dump when it crashed,
analyzing that dump might indicate why processes were freezing.

If the cluster communications channel (e.g., the lan between the
machines) were being disrupted in some way, strange things can happen.
In essence, the connection is presumed to be a simple IEEE 802.3
Ethernet. If someone were to put routing or traffic shaping devices
into that path, it could produce interesting results. This is
particularly true if some security device is removing packets from the
stream for some reason.

The crash dumps are vital. A knowledge of precisely what was happening
on the network would be useful, but it would have to be a complete
list, without any "incidental" changes omitted.

- Bob Gezelter, http://www.rlgsc.com