[Info-vax] Current VMS engineering quality, was: Re: What's VMS up to these

Tue Mar 13 11:09:59 EDT 2012

On 3/12/2012 10:30 AM, Michael Kraemer wrote:
> In article<fgi139-ui41.ln1 at news.sture.ch>, Paul Sture<paul at sture.ch>  writes:
>> On Mon, 12 Mar 2012 14:53:05 +0000, VAXman- wrote:
>>
>> Well, er, back when everyone in my office had both a Alpha PWS and a PC
>> on their desk, we also suffered a lot of network problems.  All those
>> Alphas were booted as cluster satellites, network problems would bring
>> them back to the boot screen, and that screen colour was blue.
>
> Welcome to the club, I know this all too well.
> And yet people still talk about a "rock solid" OS ...
>
>> Our solitary Unix expert used to goad us with cries of "Blue Screen of
>> Death" when that happened.
>
> It's called ... wait ... "Affinity", ISTR,
> Weendoze boxes and Alphas both crashing the same way.

It is normal and appropriate to see a crash of a VMS box under these 
circumstances (too-long temporary loss of network connectivity resulting 
in a CLUEXIT bugcheck). When a node is a member of a VMS cluster and it 
loses connectivity with the rest of the cluster for more than 
RECNXINTERVAL seconds, and the rest of the cluster retains quorum, the 
rest of the cluster continues on after a cluster state transition, 
during which they discard any locks the unreachable node may have held. 
When the node has network connectivity restored, it discovers that it 
has been removed from the cluster and any locks it holds are no longer 
valid, so it cannot continue from where it left off, and it must reboot 
to rejoin the cluster again.

This does not represent a bug or problem in the OS; it is the 
appropriate reaction to a lengthy problem in the network. VMS is doing 
exactly the right thing under the circumstances to protect the data.

Another important distinction is that a VMS cluster node is smart enough 
that when it loses connectivity with the rest of the cluster, it 
voluntarily keeps its mitts off the shared resources like disks, to 
avoid corruption due to uncoordinated access. In a Linux cluster, nodes 
aren't that smart and the rest of the cluster has to try to forcibly 
"fence" the node off from the shared resources by powering it off, 
disabling its SAN and/or network ports, etc. If the fencing operation 
fails, shared resources could be corrupted. If two nodes (or two subsets 
of the cluster) try to fence each other off, both might go down at once.

Another advantage of the VMS approach is that if a node or a subset of 
nodes loses quorum, while they keep their hands off the shared resources 
they also retain all the context they had, so if it turns out they are 
the only surviving subset of the cluster, manual intervention in the 
form of a few commands at the console or a few clicks under the 
Availability Manager GUI and they can continue on from right where they 
left off. In a Linux (or Serviceguard) cluster once a node loses quorum 
it must always reboot to rejoin the cluster.