[Info-vax] Eisner? Down? (10 days later)
Main, Kerry
Kerry.Main at hp.com
Mon Jan 5 09:20:12 EST 2009
> -----Original Message-----
> From: info-vax-bounces at rbnsn.com [mailto:info-vax-bounces at rbnsn.com] On
> Behalf Of Richard B. Gilbert
> Sent: January 4, 2009 11:39 PM
> To: info-vax at rbnsn.com
> Subject: Re: [Info-vax] Eisner? Down? (10 days later)
>
> G Cornelius wrote:
> > DeCoy wrote:
> >> Thanks, George. The problem appears to be storage-related, perhaps
> with the
> >> RAID array on the Mylex controller, and perhaps with either
> controller
> >> hardware or controller configuration.
> >>
> >> Expertise in diagnosing (and perhaps fixing) Mylex controller
> symptoms would
> >> be initially useful.
> >
> > I won't be of much help - my experience is with the HSJ/HSZ/HSG
> controller
> > series.
> >
> > I did leave voice mail for Steve offering my services, but you folks
> > will probably do better diagnosing it remotely than me trying to get
> > involved. Let me know, though, if I can do something, even if it's
> > just getting him some spare parts.
> >
> > Coincidentally, the reason I am not using the DS20 that's in my
> garage
> > is that the Mylex (KZPBC?) controller failed when I was trying to
> configure
> > it and I have not yet sprung for a replacement or stuffed in a non-
> raid
> > SCSI card.
> >
> > I know of others around here who have used the Mylex controller and
> > have encountered some of its quirks. I seem to remember helping
> someone
> > on the research side of things restore a backup of what was at the
> time
> > a large (30GB) raid volume that was lost due to Mylex controller
> issues,
> > or perhaps due to not noticing that a raid disk had failed until a
> > second failure made recovery impossible.
> >
>
> It seems to me that it's a SYS$MANGLER's JOB to notice things like
> failing disks. I had a batch job called "MORNING_CHECK" that ran every
> day at 07:30. It compared the output of "SHOW ERROR" with the output
> from yesterday. It checked log files for errors ("-E-" and -F-"), etc,
> etc. If it found something that looked like a problem I was notified
> by
> a text message to my pager. This gave me time to work on the problem
> before it turned into a crisis!
>
> A failed disk was not allowed to become a problem! I would swap it out
> with a spare and call DEC/Compaq/HP to pick up the dear departed and
> bring me a replacement drive.
>
> In fact, thanks to MORNING_CHECK, I usually found disks that were
> developing problems before the problems developed fully. One error was
> allowed but when a disk started logging multiple errors, I swapped it
> out with a spare and called for a replacement. The same guy who
> fetched
> replacements for field service would fetch me a new one and I gave him
> the dear departed!
> _______________________________________________
It does not seem like this was the issue here, but just an observation-
As Richard outlined, one of the big issues with RAID (HW or SW) is that
they work to well and staff can become to dependent on them. Hence, a
LUN drive may fail, error gets logged, but if missed and/or not followed
up on, then the chances are that a second failure on another drive a few
weeks/months later will take out the entire LUN.
Some RAID arrays have beeps, flashing lights to remind people of errors,
but in lights out environments, I have seen even these get missed.
Regards
Kerry Main
Senior Consultant
HP Services Canada
Voice: 613-254-8911
Fax: 613-591-4477
kerryDOTmainAThpDOTcom
(remove the DOT's and AT)
OpenVMS - the secure, multi-site OS that just works.
More information about the Info-vax
mailing list