[Info-vax] Distant Cluster?

Sat Oct 6 16:10:48 EDT 2012

On 2012-10-06 19:25:18 +0000, Phillip Helbig---undress to reply said:

> In article <k4pl2v$kks$1 at dont-email.me>, Stephen Hoffman
> <seaohveh at hoffmanlabs.invalid> writes:
> 
>> You mention "distant" in the title.  How distant?  The
>> officially-supported cluster distance is 500 miles / 800 kilometers
>> with the HP default cluster configuration support.  Longer spans can
>> and do work, though you'll (officially) want to work with HP if you
>> need formal support.
> 
> He did say "in another building".  If it were in another country, I
> think he would have mentioned that.  :-)

Phillip, you've been known to leave details out.  You don't always post 
the commands that were used, the versions or related details, in your 
own questions.

And if it's two buildings at roughly 500 meters apart but with a 
network distance of 800 kilometers (network cabling doesn't always go 
where you expect), or if you have a too-small network pipe, as 
shadowing can push big volumes of data and saturate links - two 
adjacent buildings can still get in trouble with the network connection.

Put another way, this is 
comp.os.we.dont.always.get.the.full.story.with.the.first.post.vms after 
all.

> 
>> I could ask for justification for the fossil version, but I really
>> don't care what fig leaf somebody will propose.
> 
> The old stuff does the job and is paid off while new kit would be less
> reliable and would cost extra money?  Difficulty of getting a quote,
> much less a sensible price for an academic institution, from HP?  Lack
> of public commitment to VMS on the part of HP makes it difficult to make
> a business case for further investment involving new hardware?

If I don't care about the OP's justification for down-revision 
software, what might imply I care about your justification?

> 
>>> Now it is possible to shutdown all the Alphas, the Itanium boxes continue to
>>> run and vice versa. But if the network connection between the two sites drops,
>>> the Itaniums crash. Why?
>> 
>> Because you haven't yet configured a stable network, or haven't added a
>> parallel/redundant connection?  Or because your network is getting
>> overloaded, and  the load of (for instance) shadowing is saturating the
>> wire?  Or wasn't that the (intended) question?
> 
> I don't think so.  Normally, one shouldn't be able to switch off part of
> a cluster and, at another time, what is left---partitioned cluster etc.

An unstable network makes for an unstable cluster.  Full stop.

> 
>> FWIW, the hosts usually crash when the network connection resumes,
>> because they're determined to the "outliers" in the cluster and
>> CLUEXIT; you ended up with a partition, and the Alpha boxes "won" the
>> decision during the reconnection process.
>> 
>> Why does this happen?  Partitioning.  Or more specifically, avoiding
>> partitioning.
> 
> Right.  I was puzzled by the fact that the boxes "kept running".  Maybe
> the fans were running, but the OS noticed the partition.

Or the transient outage was just a little longer than the reconnection 
interval.

What might be happening with the OP's configuration isn't yet clear.

-- 
Pure Personal Opinion | HoffmanLabs LLC