[Info-vax] Distant Cluster?

Wed Oct 10 14:37:54 EDT 2012

David Froble <davef at tsoft-inc.com> writes:

>Michael Moroney wrote:
>> gartmann at nonsense.immunbio.mpg.de (Christoph Gartmann) writes:
>> 
>>> In article <b9dcb6a6-8b38-4729-9b1e-abac00f3bba1 at googlegroups.com>, Ken Fairfield <ken.fairfield at gmail.com> writes:
>> 
>>>> There are no inherent problems with a long RECNXINTERVAL other
>>>> than (some vague memories I have of) lengthened cluster transistion
>>>> times.
>> 
>>> Good to know.
>> 
>> The question you have to ask yourself is whether you or your users can
>> tolerate random "hangs" by the entire cluster for up to RECNXINTERVAL
>> seconds, pretty much any time there is a network glitch such as rebooting
>> a switch.  Because that is what wil happen until things resolve themselves
>> or some node(s) get kicked out of the cluster.
>> 
>> Default RECNXINTERVAL is 20 seconds.

>That is a timeout value, and only comes into play when the link is down. 

RECNXINTERVAL is the time for when a node/nodes in a cluster detect that
other nodes are unreachable, and try to continue without them.  Also 
meaning that when the link is restored, some portion of the cluster will
reboot.

>  If the cluster is broken, aren't the users hosed anyway?

RECNXINTERVAL is part of the definition of when things are "broken" rather
than a temporary glitch.  Each site has a different definition.  The base
note has network routers with a 2 minute reboot time, but their reboots
are really a network glitch, not actual brokenness.  Others have networks
where outages more than a few seconds happen only when something is
really broken.

Second, VMS Clusters can survive and be useful even if a network link
breaks and a portion of the cluster is unreachable and unusable.
This is part of the whole "disaster tolerant" idea.

Third, some clusters are in situations where they absolutely have to
respond within a certain time period or [bad things happen].  These
need to have RECNXINTERVAL set to shorter than this time so a portion
of the cluster can respond in time.

>  I'd rather 
>they take a short break and come back to where they left off.  If it 
>happens often, then perhaps the core problem should be addressed.

As I stated, each site's definition of a "short break" is different. Too
long a break causes bad things to happen for some, and it is better for
part of the cluster to try to continue on without the other part.

>There is "doable" and then there is "prudent".  I'd think a private 
>direct link would be prudent.

Another possibility is a second parallel non-private link which doesn't
depend on any common hardware as the first, with clustering using both.