[Info-vax] Wide area cluster, metro area network, seeking info

Fri Jun 11 21:10:09 EDT 2021

On 6/8/21 4:28 PM, Rich Jordan wrote:
> We are looking at the possibility of putting VMS boxes in two locations, with Integrity boxes running VSI VMS.  This is the very beginning of the research on the possibility of clustering those two servers instead of just having them networked.  Probably have to be master/slave since only two nodes and no shared storage.
> 
> After reviewing the various cluster docs, they seem to be focused on older technologies like SoNET and DS3 using FDDI bridges (which would allow shared storage).  The prospect has a metropolitan area network but I do not have any specs on that as yet.
> 
> Are there available docs relevant to running a distributed VMS cluster over a metro area network or fast/big enough VPN tunnel?  Or is that just the straight cluster over IP configuration in the docs (which we've never used) that we need to concentrate on?
> 
> Thanks
> 

First, I recommend you ignore the suggestions to add a 3rd node to your 
cluster.  In your situation, it is not really a viable answer.

There are configurations that will allow a member of a 2-node cluster to 
automatically continue in the event that the other node fails.  However, 
if you lose the communication channel but both nodes stay up, the 
cluster will partition and then you have to be really careful about how 
you reform the cluster.  Because of this, I tend not to recommend this 
particular solution except in very specific circumstances. 
(Circumstances where you can guarantee the correct node becomes the 
shadow master when the cluster reforms and you haven't been writing 
different data to each node).

As far as I can tell from your description, the only way clustering 
would be a viable answer for you would be if you also did HBVS.  In that 
case, simply build a 2-node cluster with enough identical disks such 
that all of the data you want present at the backup site can be placed 
on a host-based volume set.  HBVS will then keep the data at both sites 
in sync.

Failure modes in this scenario.

1. Loss of the communication channel.  In this case, both nodes will 
hang (for the duration of the cluster timeout parameters).  More 
specifically, each will freeze any process that attempts a write to 
disk.  As long as the communication channel comes back up before the 
cluster times out, everything will resume automatically.  If it doesn't, 
both nodes should take a CLUEXIT bugcheck.  Once the communication 
channel is back up, you then bring each node back up as appropriate.

2. Loss of one node.  In this case the other node will hang.  Manual 
intervention is required to get it going again (specifically, a couple 
of commands at the console to reset quorum).  At that point, everything 
simply resumes on that node.

The main reason for doing it this way is that it becomes a human 
decision to decide what to do in the event of any failure.  In the event 
of any node or communication failure, any surviving cluster members will 
simply stop until you tell them what to do.  The main intent here is to 
simply prevent the wrong node from becoming the shadow master when (or 
if) the cluster is reformed.

Since you are in contact with VSI, I have no doubt they will cover this 
type of scenario with you.  This is presented merely as an idea to 
generate questions as part of your discussions.

Mark Berryman