[Info-vax] Distant Cluster?
Stephen Hoffman
seaohveh at hoffmanlabs.invalid
Sat Oct 6 12:09:03 EDT 2012
On 2012-10-06 15:09:20 +0000, Christoph Gartmann said:
You mention "distant" in the title. How distant? The
officially-supported cluster distance is 500 miles / 800 kilometers
with the HP default cluster configuration support. Longer spans can
and do work, though you'll (officially) want to work with HP if you
need formal support.
> I have a few Alphas, all booting from a common system disk, running OpenVMS
> 7.3-2 and connected via LAN.
I could ask for justification for the fossil version, but I really
don't care what fig leaf somebody will propose.
> Next, there are two Itanium boxes in a different building, each having its own
> system disk. These are members of the cluster as well.
I'd probably have one disk for the two boxes, but that's another discussion.
> There is a shadowed disk consisting of a physical disk connected to one of the
> Itaniums and a physical disk connected to one of the Alphas. The common cluster
> files reside on this disk.
That's a fairly typical configuration.
> Now it is possible to shutdown all the Alphas, the Itanium boxes continue to
> run and vice versa. But if the network connection between the two sites drops,
> the Itaniums crash. Why?
Because you haven't yet configured a stable network, or haven't added a
parallel/redundant connection? Or because your network is getting
overloaded, and the load of (for instance) shadowing is saturating the
wire? Or wasn't that the (intended) question?
As for the question you've asked, this discussion of quorum and
partitioning and CLUEXITs and flaky networks is very far from a new
discussion, of course. Search the archives for details.
FWIW, the hosts usually crash when the network connection resumes,
because they're determined to the "outliers" in the cluster and
CLUEXIT; you ended up with a partition, and the Alpha boxes "won" the
decision during the reconnection process.
Why does this happen? Partitioning. Or more specifically, avoiding
partitioning.
This can be because the hosts in the Alpha lobe have more votes, and
don't stop their processing. Alternatively, if there are equal votes
in the cluster lobes (and the network connections are gone too long),
then there's no particular (documented, determinate) way to predict
which lobe will get the go-ahead and which will (eventually) CLUEXIT
when there are equal-vote lobes when the broken network connection
temporarily unbreaks itself.
Which means the usual approach is to adjust the votes in the lobes to
keep whichever section you want as the portion that will continue.
The ugly approach is to increase the cluster timeout past your
networking group's typical network outage interval; see RECNXINTERVAL
here
<http://h71000.www7.hp.com/doc/84final/4477/4477pro_026.html#clus_sysgen>
--
Pure Personal Opinion | HoffmanLabs LLC
More information about the Info-vax
mailing list