[Info-vax] reboot due to network failure

Michael Moroney moroney at world.std.spaamtrap.com
Tue Dec 13 10:46:55 EST 2011


helbig at astro.multiCLOTHESvax.de (Phillip Helbig---undress to reply) writes:

>In article <4ee74c90$0$2818$c3e8da3$fdf4f6af at news.astraweb.com>, JF
>Mezei <jfmezei.spamnot at vaxination.ca> writes: 

>> Phillip Helbig---undress to reply wrote:
>> > Recently, my switch died, which of course froze my LAN cluster.  When I 
>> > replaced it with another one (actually a hub; I plan to buy a switch 
>> > this week), I saw that 2 of the 3 nodes in the cluster rebooted.  What 
>> > determines whether any nodes reboot and if so which ones?
>> 
>> When a node reconnects with a cluster after the others kicked it out, it
>> will perform hara kiri because it realises that its lock database and
>> all other cluster structures are woefully out fo date.  Rebooting is the
>> simpler way to get the node to rejoin cleanly.
>> 
>> I believe it goes by votes. If one node has quorum (or if a few nodes
>> managed to stay connected during outage) then they will survive, and
>> reconncting nodes will reboot.

>3 nodes.  Two rebooted, one didn't.  One node can't have quorum.  Maybe 
>one rebooted and came back before the other one did, so quorum was never 
>lost.  Boottimes are almost 3 minutes apart.

When parts of a cluster lose connection for more than RECNXINTERVAL, each
of the sub-clusters will kick out all the nodes in the cluster with which
contact was lost.  (in your case, each sub-cluster was a single node).  
This results in three completely incompatible views of the cluster.  This
is not a problem since, due to the quorum mechanism, a maximum of one
sub-cluster can continue on as "the" cluster.  (in your case, none did).
The other nodes are "frozen".

When connection is restored, these sub-clusters see each other and resolve
their differences via hara-kiri.  The rules are as follows: The subcluster
with the most votes "wins". Members of other subclusters reboot.  If a
tie: (3 way tie in your case), the subcluster with the most nodes win (two
servers with 1 vote each beats a 2 vote server).  In your case, again a 3
way tie.  At this point, I was told by a cluster engineer that it was
essentially random who won.  (over the years I've also heard it was the
subcluster with the lowest SCSNODE in it, and the subcluster with the
highest SCSNODE in it, obviously at least one of these must be wrong)



More information about the Info-vax mailing list