[Info-vax] Current VMS engineering quality, was: Re: What's VMS up to these
Keith Parris
keithparris_deletethis at yahoo.com
Tue Mar 13 11:09:59 EDT 2012
On 3/12/2012 10:30 AM, Michael Kraemer wrote:
> In article<fgi139-ui41.ln1 at news.sture.ch>, Paul Sture<paul at sture.ch> writes:
>> On Mon, 12 Mar 2012 14:53:05 +0000, VAXman- wrote:
>>
>> Well, er, back when everyone in my office had both a Alpha PWS and a PC
>> on their desk, we also suffered a lot of network problems. All those
>> Alphas were booted as cluster satellites, network problems would bring
>> them back to the boot screen, and that screen colour was blue.
>
> Welcome to the club, I know this all too well.
> And yet people still talk about a "rock solid" OS ...
>
>> Our solitary Unix expert used to goad us with cries of "Blue Screen of
>> Death" when that happened.
>
> It's called ... wait ... "Affinity", ISTR,
> Weendoze boxes and Alphas both crashing the same way.
It is normal and appropriate to see a crash of a VMS box under these
circumstances (too-long temporary loss of network connectivity resulting
in a CLUEXIT bugcheck). When a node is a member of a VMS cluster and it
loses connectivity with the rest of the cluster for more than
RECNXINTERVAL seconds, and the rest of the cluster retains quorum, the
rest of the cluster continues on after a cluster state transition,
during which they discard any locks the unreachable node may have held.
When the node has network connectivity restored, it discovers that it
has been removed from the cluster and any locks it holds are no longer
valid, so it cannot continue from where it left off, and it must reboot
to rejoin the cluster again.
This does not represent a bug or problem in the OS; it is the
appropriate reaction to a lengthy problem in the network. VMS is doing
exactly the right thing under the circumstances to protect the data.
Another important distinction is that a VMS cluster node is smart enough
that when it loses connectivity with the rest of the cluster, it
voluntarily keeps its mitts off the shared resources like disks, to
avoid corruption due to uncoordinated access. In a Linux cluster, nodes
aren't that smart and the rest of the cluster has to try to forcibly
"fence" the node off from the shared resources by powering it off,
disabling its SAN and/or network ports, etc. If the fencing operation
fails, shared resources could be corrupted. If two nodes (or two subsets
of the cluster) try to fence each other off, both might go down at once.
Another advantage of the VMS approach is that if a node or a subset of
nodes loses quorum, while they keep their hands off the shared resources
they also retain all the context they had, so if it turns out they are
the only surviving subset of the cluster, manual intervention in the
form of a few commands at the console or a few clicks under the
Availability Manager GUI and they can continue on from right where they
left off. In a Linux (or Serviceguard) cluster once a node loses quorum
it must always reboot to rejoin the cluster.
More information about the Info-vax
mailing list