[Info-vax] Unpleasant Disk Shadowing Surprise

tadamsmar tadamsmar at yahoo.com
Tue Oct 11 14:53:12 EDT 2011


On Oct 11, 1:25 pm, "Robert A. Brooks" <r... at aitchpee.com> wrote:
> On 10/11/2011 10:53 AM, tadamsmar wrote:
>
> > After the incident I found here was a disk error on one member of the
> > shadow set.
> > According to the console log, more than 30 seconds after the watchdog
> > sounded, the shadow set changed state, the offending disk went
> > offline, a mount verification started and completed.
>
> > Immediately after the mount verification completed, VMS started
> > working again.
>
> > Looks like the disk system was inaccessible for about 3 minutes and
> > any process that tried to use it got halted somehow.
>
> > Is this to be expected?
>
> Yes this is expected.  Once a shadowset goes into mount verification,
> all I/O is queued up; all members are treated as suspect until proven good.
>
> Shadowing and mount verification are more complicated than when
> a "scalar" device goes into mount verification; see SYSGEN params
> SHADOW_MBR_TMO and MVTIMEOUT for details.
>
> Note that a partially-failing disk can be a big problem, in that the
> device will whipsaw in and out of mount verification.  In a more bizarre
> case, we've seen cases where I/O reads work, but writes fail.
> This is a big problem, because mount verification only reads the disk,
> so the device will quickly exit mount verification, only to reenter upon
> the retry of the failing read.
>
> Most I/O errors will trigger mount verification; a few, such as
> SS$_DRVERR will not trigger verification, and will be immediately
> returned to the caller.  This is not happening in your case, however,
> and is relatively rare.
>
>                                 -- Rob

Here is what I find in the error log:

I had two errors on the pka0: (adaptec aic-7899) at times 7:58:14 and
7:58:21

The disk error is logged at 7:58:22

But the watchdog started sounding at least 30 seconds before any
errors were
logged.  The system engineers were issuing command to shutdown and
reboot the
application at least 15 seconds earlier than the first logged error.
I don't get an exact timestamp on the external watchdog but it must
have happened a while before anyone issued a application shutdown
command.  I heard the watchdog sound from my office, but an engineer
shut the audible off so quickly that I assumed it was just momentary
slowness of the system, not a ~3 minute event.



More information about the Info-vax mailing list