[Info-vax] Production VMS cluster hanging with lots of LEFO

Fri Mar 13 17:41:59 EDT 2009

VAXman- @SendSpamHere.ORG wrote:
> In article <0003343f$0$2186$c3e8da3 at news.astraweb.com>, JF Mezei <jfmezei.spamnot at vaxination.ca> writes:
>> VAXman- @SendSpamHere.ORG wrote:
>>
>>> I thought he said he DID crash them.
>>>
>> seems like he crashed only the quorum node.
>>
>> And *apparently* it is when he rebooted the quorum node that all
>> magically came back to normal on the production node.
> 
> OK.  I read through it quickly this morning and it was formatted rather
> ugly in my TEXT reader for pre-3rd pot of coffee reading.
> 

Here it is, all nicely formatted, spelling errors corrected, etc.

filip.deblock at proximus.net wrote:
 > Greetings.
 >
 > Yesterday we had a massive incident on our most important VMS
 > machines.
 >
 > Production is configured as a disaster tolerant cluster containing
 > four identical midsize alphas. These are grouped two-by-two into two
 > computer rooms, separated by more than 25 km. Connections between them
 > is done by a four-fold extreme high capacity network, which is also
 > shared by a massive army of UN*X boxes.
 >
 > A fifth quorum node (small thing, only has to be present) sits in a
 > third room.
 >
 > The application that is running on the cluster is ACMS driven and is
 > quite stable : everything is installed in memmory, takes up on
 > avarage max 10-15% cpu, and has memory to burn, so outswapped
 > processes are extremely rare. This application accesses a monster
 > SYBASE database, which is running on a UN*X box (did I mention the
 > things was disaster tolerant ? :-(
 >
 > OS is VMS 8.3, we run DECNET over IP.
 >
 > Previous night, some "load test" was done on the network. Not a lot
 > is known about that, but it is believed it included the links between
 > the two sites. I was not aware of this thing being done, and it would
 > probably have been none of my concern.
 >
 > Very soon alarms started to come in stating users could not login
 > anymore, neither over the dedicated TCP/IP interfaces (using some
 > application-to-application mechanism), neither via whatever SET HOST,
 >  TELNET, etc.
 >
 > Fortunately I always keep some sessions open on by station (not part
 > of the cluster), which were still working. The system was NOT down.
 >
 > When looking at the first system, I immediately remarked a
 > significant number of LEFO processes, most of them related to
 > individual (DCL) users, having close to 0 CPU time and IO. I also
 > spotted one HIBO (REMACP !). I was able to STOP/ID all the LEFOs (did
 > not touch REMACP), to no avail. When trying to find the real identity
 > of a user (MC AUTHORIZE), my session froze.
 >
 > In a second session (on an other machine of the cluster), session got
 >  iced as well during a DCL command.
 >
 > I got worried.
 >
 > It seemed that it was not possible to run an image anymore. (a lot of
 >  DCL command do startup an image) Very soon I lost control from _all_
 >  sessions, but before that I was able to notice : - the cluster was
 > fine (all 5 machines up, all participating with 1 vote) - there was
 > at least one looping process (happens all the time, we simply kill
 > them) - (not 100% sure of this) most of the LEFO processes where
 > attempts to login, trying to run LOGINOUT.EXE (just another image
 > ...)
 >
 > So SNAFU
 >
 > It was found out later, by some (external) database monitoring, that
 > at least one of the looping processes (image was already running by
 > the time the problems started) did do some DB activity, so the VMS
 > process was not aware of any problems and happyly kept looping.
 >
 > A desperate try to login to console (console monitoring is running an
 >  separate node) yielded no success. It appeared that all machines
 > (including quorum node) were inaccessible (but not dead !)
 >
 > Somewhat later it was claimed that the network modifications (?) were
 >  rolled back. VMS cluster did not recover by itself.
 >
 > Finally (we need zillions of authorizations for everything) the
 > quorum node was crashed. And I was happyly looking at >>>
 >
 > First boot failed due to bizar (and unrelated problems), but booting
 > as MIN did work. I was able to login into the quorun node (via
 > console of course)
 >
 > Miracle happened. All LEFO disappeared and the beast went back to
 > business. Most processes simply continued from the point where they
 > were blocked, no damage (except for part of the application which had
 > timed out, a simple restart solved this).
 >
 > Unfortunately I did not check if the situation was normalised because
 >  of the crash of the quorum node. I only observed 'back to business'
 > after the minimal reboot.
 >
 > Now, 24 hours later, things are as normal as allways.
 >
 > A lot of unknowns are still left.
 >
 > Q : what caused the image activator to go into LEFO (actual to remain
 >  in LEFO). At some point during image activation (last phase ?) it
 > starts waiting for an event flag. What could be setting that event
 > flag ? I am suspecting it never came ...
 >
 > Q: crashing (and rebooting) the quorum node solved things
 > immediately. Could this be caused by a lock held by the quorum node ?
 > if so, is this a lock that is related to cluster transitions ?
 >
 > Q : would we have had the same effect by crashing/rebooting anyone of
 >  the other nodes ?
 >
 > And finally :
 >
 > Can some form of (minor ?) network outage trigger events like this?
 >
 > Any takers ?
 >
 > advTHANKSance