[Info-vax] Production VMS cluster hanging with lots of LEFO
Richard B. Gilbert
rgilbert88 at comcast.net
Fri Mar 13 17:41:59 EDT 2009
VAXman- @SendSpamHere.ORG wrote:
> In article <0003343f$0$2186$c3e8da3 at news.astraweb.com>, JF Mezei <jfmezei.spamnot at vaxination.ca> writes:
>> VAXman- @SendSpamHere.ORG wrote:
>>
>>> I thought he said he DID crash them.
>>>
>> seems like he crashed only the quorum node.
>>
>> And *apparently* it is when he rebooted the quorum node that all
>> magically came back to normal on the production node.
>
> OK. I read through it quickly this morning and it was formatted rather
> ugly in my TEXT reader for pre-3rd pot of coffee reading.
>
Here it is, all nicely formatted, spelling errors corrected, etc.
filip.deblock at proximus.net wrote:
> Greetings.
>
> Yesterday we had a massive incident on our most important VMS
> machines.
>
> Production is configured as a disaster tolerant cluster containing
> four identical midsize alphas. These are grouped two-by-two into two
> computer rooms, separated by more than 25 km. Connections between them
> is done by a four-fold extreme high capacity network, which is also
> shared by a massive army of UN*X boxes.
>
> A fifth quorum node (small thing, only has to be present) sits in a
> third room.
>
> The application that is running on the cluster is ACMS driven and is
> quite stable : everything is installed in memmory, takes up on
> avarage max 10-15% cpu, and has memory to burn, so outswapped
> processes are extremely rare. This application accesses a monster
> SYBASE database, which is running on a UN*X box (did I mention the
> things was disaster tolerant ? :-(
>
> OS is VMS 8.3, we run DECNET over IP.
>
> Previous night, some "load test" was done on the network. Not a lot
> is known about that, but it is believed it included the links between
> the two sites. I was not aware of this thing being done, and it would
> probably have been none of my concern.
>
> Very soon alarms started to come in stating users could not login
> anymore, neither over the dedicated TCP/IP interfaces (using some
> application-to-application mechanism), neither via whatever SET HOST,
> TELNET, etc.
>
> Fortunately I always keep some sessions open on by station (not part
> of the cluster), which were still working. The system was NOT down.
>
> When looking at the first system, I immediately remarked a
> significant number of LEFO processes, most of them related to
> individual (DCL) users, having close to 0 CPU time and IO. I also
> spotted one HIBO (REMACP !). I was able to STOP/ID all the LEFOs (did
> not touch REMACP), to no avail. When trying to find the real identity
> of a user (MC AUTHORIZE), my session froze.
>
> In a second session (on an other machine of the cluster), session got
> iced as well during a DCL command.
>
> I got worried.
>
> It seemed that it was not possible to run an image anymore. (a lot of
> DCL command do startup an image) Very soon I lost control from _all_
> sessions, but before that I was able to notice : - the cluster was
> fine (all 5 machines up, all participating with 1 vote) - there was
> at least one looping process (happens all the time, we simply kill
> them) - (not 100% sure of this) most of the LEFO processes where
> attempts to login, trying to run LOGINOUT.EXE (just another image
> ...)
>
> So SNAFU
>
> It was found out later, by some (external) database monitoring, that
> at least one of the looping processes (image was already running by
> the time the problems started) did do some DB activity, so the VMS
> process was not aware of any problems and happyly kept looping.
>
> A desperate try to login to console (console monitoring is running an
> separate node) yielded no success. It appeared that all machines
> (including quorum node) were inaccessible (but not dead !)
>
> Somewhat later it was claimed that the network modifications (?) were
> rolled back. VMS cluster did not recover by itself.
>
> Finally (we need zillions of authorizations for everything) the
> quorum node was crashed. And I was happyly looking at >>>
>
> First boot failed due to bizar (and unrelated problems), but booting
> as MIN did work. I was able to login into the quorun node (via
> console of course)
>
> Miracle happened. All LEFO disappeared and the beast went back to
> business. Most processes simply continued from the point where they
> were blocked, no damage (except for part of the application which had
> timed out, a simple restart solved this).
>
> Unfortunately I did not check if the situation was normalised because
> of the crash of the quorum node. I only observed 'back to business'
> after the minimal reboot.
>
> Now, 24 hours later, things are as normal as allways.
>
> A lot of unknowns are still left.
>
> Q : what caused the image activator to go into LEFO (actual to remain
> in LEFO). At some point during image activation (last phase ?) it
> starts waiting for an event flag. What could be setting that event
> flag ? I am suspecting it never came ...
>
> Q: crashing (and rebooting) the quorum node solved things
> immediately. Could this be caused by a lock held by the quorum node ?
> if so, is this a lock that is related to cluster transitions ?
>
> Q : would we have had the same effect by crashing/rebooting anyone of
> the other nodes ?
>
> And finally :
>
> Can some form of (minor ?) network outage trigger events like this?
>
> Any takers ?
>
> advTHANKSance
More information about the Info-vax
mailing list