[Info-vax] Uptime for OpenVMS

Sat May 21 11:09:50 EDT 2011

On May 14, 3:32 am, Johnny Billquist <b... at softjar.se> wrote:
> On 2011-05-13 21.24, JF Mezei wrote:
>
> > Michael Moroney wrote:
>
> >> Mount Verify gets invoked if a drive goes offline when trying to do I/O
> >> to it.
>
> > OK, when you reboot after a crash, and the MOUNT command takes an
> > eternity to complete because it does some "cleanup" of the disk, what is
> > that process called ?
>
> Is that on VMS? I don't think I've seen a mount take an eternity to
> complete, except for the time when it waits for a disk to physically
> spin up. But then again, I use VMS seldom enough that I might just have
> forgotten about it...
>
>         Johnny

If you have RMS Journalled files or databases which were participating
in ACMS transactions or an RDB database split across multiple
spindles, it can "take an eternity" during initial boot because the
database, Journalling, and DECdtm need to verify all of their states.
I haven't seen this much in a POST-RA disk world, but, I also no
longer have clients without enough UPS to keep the data center running
for at least 10 minutes while the diesel generators start.

Remember that "up-time" question which started this thread and some
learned scholar claimed it accomplished nothing?  Well, they were
wrong, but, they tend to be wrong 99.999% of the time.  Extreme up-
times mean you don't see the cost of high availability.

Let us wander back to the tales of 911 when those trading companies
with DISTRIBUTED OpenVMS systems lost one or more locations with the
Twin Towers, yet continued to trade until the end of the business day
with less than a 15 minute "hesitation" to trade processing.  They
were able to do this because OpenVMS was designed to have extreme up-
times combined with high availability.  A combined target which is
physically impossible with any flavor of Linux/Unix or Windows.

The trading sites had RMS Journalling participating with a relational
database and something akin to MQ Series + ACMS all participating
under the umbrella of unified DECdtm transactions.  Lots of
information is kept in lots of places about the state of each sub-
transaction participating in the unified parent transaction.  When a
location is lost while the cluster is up, the cluster has to wait for
the timeout to kick the other node(s) out, then, DECdtm has to get its
transactions back to consistent states before processing continues.

If you are at a site which fully uses the power of the cluster and
DECdtm, especially if there are some "shadow volumes" at each site,
you will see really long mounts happen in a post catastrophic power
outage case.  If your power crisis management is configured in such a
way that local nodes begin an orderly shutdown when the UPS indicates
it has less than N minutes of power, you won't see this because an
orderly exit and shutdown will have occurred.