[Info-vax] Wide area cluster, metro area network, seeking info

Wed Jun 16 00:10:53 EDT 2021

On 6/15/2021 3:54 PM, dthi...--- via Info-vax wrote:
>
>>> VSI _is_ involved and we are working with them on this possibility. At this point I think the additional license subscription costs are going to kill the HBVS/cluster option, especially if a third node was needed (and a third location and connection, and set of licenses). That means going with the one-day latency backup option and generating and testing the procedures for failing back to the main system when it is available again.
>
> This is just a thinking-outside-the-box thought to bounce off of VSI in the design discussions.  Since OpenVMS doesn't really have a good solution for stabilizing a wide area 2-node cluster without extensive hardware/software investment, you might ask them if they would authorize you to set up a FreeAXP OpenVMS VM at a third site using the Alpha Community License solely to provide a tie-breaking vote for your production two-node cluster, and to do no other work. It never hurts to ask if there's an acceptable lower cost alternative to an expensive situation. This tie-breaker VM could even be hosted at an offsite VM provider like AWS so that your company doesn't have to invest in a physical 3rd site and network presence.
> _______________________________________________
> Info-vax mailing list
> Info-vax at rbnsn.com
> http://rbnsn.com/mailman/listinfo/info-vax_rbnsn.com
>

Hmm.  Let's call the tie-breaking site, T.  Site A is the site they want
to consider the main site and site B is the backup site.
Suppose a fire (or other spreading damage) occurs.  Fiber is destroyed
such that site A no longer has connectivity.  An hour or two passes,
where site B has continued to communicate with site T.  Then the fire
(or whatever) terminates the connection.  Now priorities as they are,
the company servicing the fiber, goes out and repairs/replaces the fiber
so that site A resumes communication with site T.  It's going to be a
couple more days before they repair the connection so that site B can
communicate with site T.
The question:  Is there enough information at site T for site A to
really resume operation, or did site B make sufficient changes that site
A cannot operate until site B is back in communication?

I think this solution is safer if you can be sure you have completely
redundant fiber links between site A and site B, where the fiber paths
enter and exit each building by two different paths and that they never
cross along the way.  This reduces the chance of a communication issue
causing problems with the cluster.