[Info-vax] Unpleasant Disk Shadowing Surprise

Wed Oct 12 14:44:06 EDT 2011

In article <4e95c170$0$28524$c3e8da3$9b4ff22a at news.astraweb.com>, JF
Mezei <jfmezei.spamnot at vaxination.ca> writes: 

> Bob Koehler wrote:
> 
> >    If your definition of real-time can't handle 3 minutes of
> >    interruption, then you probably need to engineer a different solution
> >    than the kind of shadowing approach you're using now.
> 
> I seem to recall being told that VMS would seamlessly continue to run
> after the loss of a disk.

For some definitions of "seamlessly".  I once had a system disk crash,
back when I didn't have it shadowed (how I could sleep at night then I
don't know).  VMS continued fine---no problems until it had to access
the disk.  Last night I had some sort of crash (haven't analysed it yet;
everything is back now).  One machine (from three---all with one
vote---in the cluster) was accessible; SHOW SYSTEM/CLUSTER showed the
usual processes etc.  However, the system disk (shadow set) of this node
was in mount verification.  I couldn't mount it, because in order to do
so it would have to access the system disk.  I couldn't find any way to
reboot it other than powering it down.  There is one bizarre side
effect: 

SYSMAN> do write sys$output f$getsyi("boottime")
%SYSMAN-I-OUTPUT, command execution on node MINNIM
11-OCT-2011 22:19:51.00
%SYSMAN-I-OUTPUT, command execution on node JANDER
11-OCT-2011 22:25:38.00
%SYSMAN-I-OUTPUT, command execution on node LEEBIG
 1-JAN-2015 00:06:43.00
SYSMAN> conf sh time
System time on node MINNIM: 12-OCT-2011 20:39:17.32
System time on node JANDER: 12-OCT-2011 20:39:17.55
System time on node LEEBIG: 12-OCT-2011 20:39:17.74
SYSMAN> conf set time
SYSMAN> conf sh time
System time on node MINNIM: 12-OCT-2011 20:39:25.88
System time on node JANDER: 12-OCT-2011 20:39:25.90
System time on node LEEBIG: 12-OCT-2011 20:39:25.91
SYSMAN> do write sys$output f$getsyi("boottime")
%SYSMAN-I-OUTPUT, command execution on node MINNIM
11-OCT-2011 22:19:51.00
%SYSMAN-I-OUTPUT, command execution on node JANDER
11-OCT-2011 22:25:38.00
%SYSMAN-I-OUTPUT, command execution on node LEEBIG
 1-JAN-2015 00:06:43.00

Both the date and the time are way off for LEEBIG.  How could this 
possibly happen?

> For proper fault tolerance, you would want to have 2 SCSI controlere.

And have the members of the shadow set connected to different nodes in 
the cluster.