[Info-vax] Production VMS cluster hanging with lots of LEFO

JF Mezei jfmezei.spamnot at vaxination.ca
Fri Mar 13 11:52:23 EDT 2009


If you don't normally have processes that are outswapped, then seeing
many LEFOs out there is an indication of some process(es) consuming
exhorbitant amount of memory.

SHOW SYSTEM is a command which is at low level and can usually still
run, and you can SHOW SYSTEM/NODE across a cluster even when you can't
login to the problem node anymore.

In such events, looking at the working set size of the processes can
provide insight on he culprit.

SHOW MEMORY also gives you an insight on how much page file space is
remaining. When free page file space gets near zero, you get the
situation you described (again, usually because of abonormal memory
consumption by one or more processes)


Note that SCS (clustering protocol) is very time sensitive and if your
opcom didn't show loss of connection to other nodes, the "network tests"
 wouldn't have affected that ethernet traffic.

A similar event I encountered was the IMAP server. It hung for some
reason. Then, the client tried to connect again, and a new imap server
was created. It consumed all its allowed memory and hung. (repeat
process until all memory and page file space was used).

Perhaps your database application couldn't communicate with the remote
server and queued all transactions in memory, perhaps it declared itself
dead and a new instance was created, but the old one remained around.
Loop this and you would get your situation.

Are you absolutely sure that rebooting the quorum node is what fixed the
problem ?

If communications with the remot unix server were re-established,
perhaps all outstanding IOs were able to complete and proper process
rundown happened and memory was released which would bring your system
back to normal.



More information about the Info-vax mailing list