[Info-vax] Distant Cluster?
Michael Moroney
moroney at world.std.spaamtrap.com
Wed Oct 10 14:37:54 EDT 2012
David Froble <davef at tsoft-inc.com> writes:
>Michael Moroney wrote:
>> gartmann at nonsense.immunbio.mpg.de (Christoph Gartmann) writes:
>>
>>> In article <b9dcb6a6-8b38-4729-9b1e-abac00f3bba1 at googlegroups.com>, Ken Fairfield <ken.fairfield at gmail.com> writes:
>>
>>>> There are no inherent problems with a long RECNXINTERVAL other
>>>> than (some vague memories I have of) lengthened cluster transistion
>>>> times.
>>
>>> Good to know.
>>
>> The question you have to ask yourself is whether you or your users can
>> tolerate random "hangs" by the entire cluster for up to RECNXINTERVAL
>> seconds, pretty much any time there is a network glitch such as rebooting
>> a switch. Because that is what wil happen until things resolve themselves
>> or some node(s) get kicked out of the cluster.
>>
>> Default RECNXINTERVAL is 20 seconds.
>That is a timeout value, and only comes into play when the link is down.
RECNXINTERVAL is the time for when a node/nodes in a cluster detect that
other nodes are unreachable, and try to continue without them. Also
meaning that when the link is restored, some portion of the cluster will
reboot.
> If the cluster is broken, aren't the users hosed anyway?
RECNXINTERVAL is part of the definition of when things are "broken" rather
than a temporary glitch. Each site has a different definition. The base
note has network routers with a 2 minute reboot time, but their reboots
are really a network glitch, not actual brokenness. Others have networks
where outages more than a few seconds happen only when something is
really broken.
Second, VMS Clusters can survive and be useful even if a network link
breaks and a portion of the cluster is unreachable and unusable.
This is part of the whole "disaster tolerant" idea.
Third, some clusters are in situations where they absolutely have to
respond within a certain time period or [bad things happen]. These
need to have RECNXINTERVAL set to shorter than this time so a portion
of the cluster can respond in time.
> I'd rather
>they take a short break and come back to where they left off. If it
>happens often, then perhaps the core problem should be addressed.
As I stated, each site's definition of a "short break" is different. Too
long a break causes bad things to happen for some, and it is better for
part of the cluster to try to continue on without the other part.
>There is "doable" and then there is "prudent". I'd think a private
>direct link would be prudent.
Another possibility is a second parallel non-private link which doesn't
depend on any common hardware as the first, with clustering using both.
More information about the Info-vax
mailing list