[Info-vax] Current VMS engineering quality, was: Re: What's VMS up to these

Tue Mar 13 19:02:05 EDT 2012

Michael Kraemer wrote:

> Crashing an entire workstation cluster due to some network problem
> can hardly be called "the right thing".

Actually, it unfortunatly is.  If a network problem results in possible
cluster partitioning, or nodes that got locked due to loss of quorum but
whose view of the cluster became stale ( no update to locks, logical
names and all thsoe shared structures), then the best way is to simply
force that node to reboot to ensure its stale data is not used.

Say you had application X running on the local node. It had a lock on a
remote node as well as on the local node.

When the link is broken, node X freezes due to loss of quorum. Rest of
cluster will eventually kick that node out after the recnxinterval
timeout. When this happens, the surviving cluster will zap all locks
that X had.

Now, node X still thinks that it has a lock on both the local and remote
files. The rest of the cluster sees X has having no locks.

If you allowed X to rejoin the cluster, you can't merge the 2 lock
tables because meanwhile, application on node Y might have taken a lock
and started to use that remote file (since that file became lockable
once node X was declared lost).

This is where the real VMS engineering did shine. They made unpopular
decisions (such as forcing a crash, or the much hated RWAST and RWMBX
states) because they did spend the time to think about all the
implications and saw possibilities where there would be corruption and
made damned sure that it wouldn't happen.