[Info-vax] Quorum disk woes

Sun Jun 2 11:49:31 EDT 2019

On 2019-06-02 09:55:59 +0000, Marc Van Dyck said:

> Unless I completely misunderstood, expected_votes is the number of 
> votes observed when the whole cluster is up.

I'd swap the word "observed" for either "intended" or "expected" in that.

The number of votes observed is determined dynamically based on the 
number of votes present.

EXPECTED_VOTES is a mechanism that provides a lower bound and an 
initial default for the total number of votes, prior to the connection 
manager and its connections being established.

Once the cluster connections are established, then the number of votes 
present and the calculated value for quorum will be floated to the 
actual value of votes present in the configuration.

There are folks that try to use EXPECTED_VOTES to fake the cluster 
connection manager and the voting, and that can end badly.  
Particularly for cases where connections cannot be established.  It'll 
look like it works. Until it doesn't. And it'll end with corruptions.

> If each member as one vote, and the quorum disk has one vote too, then 
> expected_votes should be 3.

This particular configuration is constrained by the requirements.  This 
configuration is not going to be a robust configuration. This 
configuration is going to be less than what it could be.  This is all 
dictated from those stated app requirements found in that other thread 
of yours that's directly related to this topic:
https://groups.google.com/d/msg/comp.os.vms/tc0v7lnTLSo/GDvyor2SAQAJ

The requirements for a single and consistent primary host means a 
different voting pattern, and means the quorum disk is unnecessary, and 
it means somebody will get to learn the IPLC handler as a way to free 
up a stuck secondary.  With a primary-secondary configuration, there's 
no need for a quorum disk, and no need for votes on the secondary.

> Then for such a configuration the calculated quorum is 2. When booting 
> one member and with the quorum disk present, we have 2 votes so the 
> quorum is gained and the cluster can be alive.

Or one vote and one expected vote, and the secondary can join and can 
depart the "cluster" at any time.  It can share resources.  But it 
can't boot and operate without the primary present—which is one of the 
stated requirements—without manual intervention.  That manual 
intervention will switch host names and switch votes and related 
baggage, which will allow the secondary to impersonate the usual 
primary, and will allow the software with the hard-coded host names and 
the rest to operate appropriately in the failover configuration.  And 
with the IPLC handler the secondary can clear the quorum hang of the 
"departed" primary for a look around, or the secondary can be rebooted 
with a different persona, can can poke around prior to the manual 
failover.

> Regarding the visibility of the quorum disk, I think it's OK but just 
> to be sure, tomorrow I will extract a console log and post it here.

The quorum disk is unnecessary here.  If it's present, it needs to be 
present and configured on all hosts with direct access, which would be 
all hosts connected to the FC SAN if that's where the quorum disk is 
located.

Your app software is incompatible with the app failover capabilities of 
a cluster.  Clustering does not provide a mechanism to fix this.  If 
you want what you want, you cannot have the cluster configuration 
you're trying for here.

The OpenVMS app development doc is simply awful at describing how to 
code the apps for what you want, too.  And how to not code the apps.  
The doc does describe all the pieces, but not how it all fits together. 
 And the frameworks and tools are just weak, such as those around the 
DLM and common tasks such as distributed job scheduling, and app and 
system fail-overs.  These are not easy tasks to properly code, too.

-- 
Pure Personal Opinion | HoffmanLabs LLC