[Info-vax] Production VMS cluster hanging with lots of LEFO
Main, Kerry
Kerry.Main at hp.com
Fri Mar 13 10:25:28 EDT 2009
> -----Original Message-----
> From: info-vax-bounces at rbnsn.com [mailto:info-vax-bounces at rbnsn.com] On
> Behalf Of Bob Gezelter
> Sent: March 13, 2009 7:46 AM
> To: info-vax at rbnsn.com
> Subject: Re: [Info-vax] Production VMS cluster hanging with lots of
> LEFO
>
> On Mar 13, 5:32 am, filip.debl... at proximus.net wrote:
> > Greetings.
> >
> > Yesterday we had a massive incident on our most important VMS
> > machines.
> >
> > Production is configured as a disaster tolerant cluster containing
> > four
> > identical midsize alphas. These are grouped two-by-two into two
> > computerrooms, separated by more than 25 km. Connections between them
> > is
> > done by a four-fold extreme high capacity network, which is also
> > shared by a
> > massive army of UN*X boxes.
> >
> > A fifth quorum node (small thing, only has to be present) sits in a
> > third
> > room.
> >
> > The application that is running on the cluster is ACMS driven and is
> > quite
> > stable : everything is installed in memmory, takes up on avarage max
> > 10-15%
> > cpu, and has memory to burn, so outswapped processes are extermely
> > rare.
> > This application accesses a monster SYBASE database, which is running
> > on a
> > UN*X box (did I mention the things was disaster tolerant ? :-(
> >
> > OS is VMS 8.3, we run DECNET over IP.
> >
> > Previous night, some "load test" was done on the network. Not a lot
> is
> > known
> > about that, but it is believed it included the links between the two
> > sites.
> > I was not aware of this thing being done, and it would probably have
> > been
> > none of my concern.
> >
> > Very soon alarms started to come in stating users could not login
> > anymore,
> > neither over the dedicated TCP/IP interfaces (using some
> > application-to-application mechanism), neither via whatever SET HOST,
> > TELNET, etc.
> >
[snip...]
> _______________________________________________
An absolute must for situations like this is to have the OpenVMS Availability
Manager running. This is the exact scenario it is designed to troubleshoot.
Because it's driver runs at a high IPL, it can still access the servers
when everything else (including the console) is hung.
It also gives a very good view of active locking activity and can not only
monitor and display active process quotas, but also dynamically adjust these
quotas as well.
Reference:
http://h71000.www7.hp.com/openvms/products/availman/index.html
Regards
Kerry Main
Senior Consultant
HP Services Canada
Voice: 613-254-8911
Fax: 613-591-4477
kerryDOTmainAThpDOTcom
(remove the DOT's and AT)
OpenVMS - the secure, multi-site OS that just works.
More information about the Info-vax
mailing list