[Info-vax] Boot drive died on a shadowed system disk
abrsvc
dansabrservices at yahoo.com
Fri May 18 06:41:53 EDT 2012
On Thursday, May 17, 2012 4:20:04 PM UTC-4, Rich Jordan wrote:
> On May 17, 11:36 am, Rich Jordan <jor... at ccs4vms.com> wrote:
> > Based on the docs I think we're ok but its the first time this has
> > happened so if anyone knows for certain please feel free to comment.
> >
> > DS10. OpenVMS V8.2, ECOs current. Two channel KZPEA SCSI controller,
> > two drives on each channel; drive DKA0 is the console selected boot
> > disk. System shadow disk DSA0 contains DKA0 and DKB0, the data unit
> > DSA1 has drives DKA100 and DKB100. Console AUTO_ACTION is RESTART.
> >
> > DKA0 failed out of the shadowset with hard errors (DKB0 failed out of
> > DSA1 shortly thereafter, but was able to rejoin manually). DKA0 will
> > not remount into DSA0 (got the following error):
> >
> > $ MOUNT/SYSTEM DSA0 /SHADOW=$1$DKA0: ALPHASYS /CONFIRM
> > %MOUNT-I-MOUNTED, ALPHASYS mounted on _DSA0:
> > %MOUNT-I-SHDWMEMFAIL, _$!$DKA0: (NODE) failed as a member of the
> > shadow set.
> > -SYSTEM-F-ABORT, abort
> > %MOUNT-I-ISAMBR, _$1$DKB0: (NODE) is a member of the shadow set
> >
> > No errors were logged against the DKA0 device from this mount attempt
> > but one bus error on PKA0 was. We're not certain yet which component
> > or components are at fault (a support call is being placed).
> >
> > I can mount DKA0 locally/writelocked and have run an analyze/disk on
> > it (with some cleanup indicated as needed).
> >
> > I suppose I could mount it /override=shadow then dismount and try to
> > have it rejoin the set but I don't think its trustworthy so not going
> > to try.
> >
> > My question is this. In the event of a reboot before service can be
> > performed, what will happen? My expectation based on the shadow docs
> > is one of two, either of which are survivable.
> >
> > DKA0 is nonbootable: the system just fails at console level, and can
> > either have its console boot device changed to DKB0 or just manually
> > booted from DKB0.
> >
> > DKA0 is at least nominally bootable: the system starts to boot, sees
> > the shadow info (so long as I don't mount it /OVERRIDE=SHADOW!), looks
> > to DKB0 and sees the severe mismatch and that DKA0 was not a valid
> > member if the set. It then fails the boot with a SHADBOOTFAIL
> > bugcheck and someone onsite still has to manually boot from DKB0.
> >
> > I don't see a way for the system to actually come up on the outdated
> > DKA0: disk. Just bootfailures if it goes down. Is this correct?
>
> Finally got the log to a WSEA equipped box. Perhaps the KZPEA has
> failed since the log seems to call that out. I've not seen it before;
> could this still be the result of a failing disk or perhaps an
> overheated disk (if the cage fan has failed)?
>
> If more of the log output is needed I'll be happy to post it; this was
> just a snapshot showing the failure callout.
>
> Thanks for any insights.
>
> ====================
>
> emb_Device_Number 0
> emb_func 0
> emb_name_len 10
> emb_name FPO001$PKA
> emb_dtname_len 0
> emb_dtname
>
> KZPEA_2
> KZPEA_LW_CNT 90
> pka_erl_b_rev x0032 packet revision 2
> pka_sub_packet_class x1389 PCI-SCSI SubPacket
> pka_sub_packet_type x0002 OVMS SubPacket
> KZPEA_ErrCode x0402 Adapter Hardware
> Failure
> SubType[7:0] x2 Runtime Error
> Type[15:8] x4
> pka_pci_bus 0
> pka_pci_slot 15
> pka_vendor_id x9005
> pka_device_id x00C0 KZPEA Ultra 3 Dual Port
> pka_subsystem_vendor_idx9005
> pka_subsystem_id xF620
I would concentrate on the disk drive itself rather than the controller. I had a case where all indications were that a tape drive was hanging the controller. It turns out that the problem was a defective disk cage slot. Eliminating the use of that slot "resolved" the hanging problem. Please note that all drives were in use without problems for days before the hang occurred. In this case, I would remove the drive and replace it as the first "repair" attempt.
Dan
More information about the Info-vax
mailing list