[Info-vax] Production VMS cluster hanging with lots of LEFO

Fri Mar 13 09:33:09 EDT 2009

On 13 mrt, 10:32, filip.debl... at proximus.net wrote:
> Greetings.
>
> Yesterday we had a massive incident on our most important VMS
> machines.
>
> Production is configured as a disaster tolerant cluster containing
> four
> identical midsize alphas. These are grouped two-by-two into two
> computerrooms, separated by more than 25 km. Connections between them
> is
> done by a four-fold extreme high capacity network, which is also
> shared by a
> massive army of UN*X boxes.
>
> A fifth quorum node (small thing, only has to be present) sits in a
> third
> room.
>
> The application that is running on the cluster is ACMS driven and is
> quite
> stable : everything is installed in memmory, takes up on avarage max
> 10-15%
> cpu, and has memory to burn, so outswapped processes are extermely
> rare.
> This application accesses a monster SYBASE database, which is running
> on a
> UN*X box (did I mention the things was disaster tolerant ? :-(
>
> OS is VMS 8.3, we run DECNET over IP.
>
> Previous night, some "load test" was done on the network. Not a lot is
> known
> about that, but it is believed it included the links between the two
> sites.
> I was not aware of this thing being done, and it would probably have
> been
> none of my concern.
>
> Very soon alarms started to come in stating users could not login
> anymore,
> neither over the dedicated TCP/IP interfaces (using some
> application-to-application mechanism), neither via whatever SET HOST,
> TELNET, etc.
>
> Fortunately I always keep some sessions open on by station (not part
> of the
> cluster), which were still working. The system was NOT down.
>
> When looking at the first system, I immediately remarked a significant
> number
> of LEFO processes, most of them related to individual (DCL) users,
> having
> close to 0 CPU time and IO. I also spotted one HIBO (REMACP !).
> I was able to STOP/ID all the LEFOs (did not touch REMACP), to no
> avail.
> When trying to find the real identity of a user (MC AUTHORIZE), my
> session
> froze.
>
> In a second session (on an other machine of the cluster), session got
> iced
> as well during a DCL command.
>
> I got worried.
>
> It seemed that it was not possible to run an image anymore. (a lot of
> DCL
> command do startup an image) Very soon I lost control from _all_
> sessions,
> but before that I was able to notice :
> - the cluster was fine (all 5 machines up, all participating with 1
> vote)
> - there was at least one looping process (happens all the time, we
> simply
> kill them)
> - (not 100% sure of this) most of the LEFO processes where attempts to
> login, trying to run LOGINOUT.EXE (just another image ...)
>
> So SNAFU
>
> It was found out later, by some (external) database monitoring, that
> at
> least one of the looping processes (image was already running by the
> time
> the problems started) did do some DB activity, so the VMS process was
> not
> aware of any problems and happyly kept looping.
>
> A desparate try to login to console (console monitoring is running an
> separate node) yielded no success. It appeared that all machines
> (including
> quorum node) were inaccessible (but not dead !)
>
> Somewhat later it was claimed that the network modifications (?) were
> rolled
> back. VMS cluster did not recover by itself.
>
> Finally (we need zillions of authorizations for everything) the quorum
> node
> was crashed.
> And I was happyly looking at >>>
>
> First boot failed due to bizar (and unrelated problems), but booting
> as MIN
> did work. I was able to login into the quorun node (via console of
> course)
>
> Miracle happened. All LEFO disappeared and the beast went back to
> business.
> Most processes simply continued from the point where they were
> blocked, no
> damage (except for part of the application which had timed out, a
> simple
> restart solved this).
>
> Unfortunately I did not check if the situation was normalised because
> of the
> crash of the quorum node. I only observed 'back to business' after the
> minimal reboot.
>
> Now, 24 hours later, things are as normal as allways.
>
> A lot of unknowns are still left.
>
> Q : what caused the image activator to go into LEFO (actual to remain
> in
> LEFO). At some point during image activation (last phase ?) it starts
> waiting for an eventflag. What could be setting that event flag ? I am
> suspecting it never came ...
>
> Q: crashing (and rebooting) the quorum node solved things immediately.
> Could
> this be caused by a lock held by the quorum node ? if so, is this a
> lock that is
> related to cluster transitions ?
>
> Q : would we have had the same effect by crashing/rebooting anyone of
> the
> other nodes ?
>
> And finally :
>
> Can some form of (minor ?) network outage trigger events like this?
>
> Any takers ?
>
> advTHANKSance

Filip.
one way to get into a situation like this is when (a) you have a
quorum node in a cluster and (b) it is connected to a low speed link.
Low speed, compared to the other nodes in the cluster.
Is there a quorumdisk defined on one or more nodes in the cluster?
Hans