[Info-vax] Eisner? Down? (10 days later)

Mon Jan 5 09:20:12 EST 2009

> -----Original Message-----
> From: info-vax-bounces at rbnsn.com [mailto:info-vax-bounces at rbnsn.com] On
> Behalf Of Richard B. Gilbert
> Sent: January 4, 2009 11:39 PM
> To: info-vax at rbnsn.com
> Subject: Re: [Info-vax] Eisner? Down? (10 days later)
> 
> G Cornelius wrote:
> > DeCoy wrote:
> >> Thanks, George.  The problem appears to be storage-related, perhaps
> with the
> >> RAID array on the Mylex controller, and perhaps with either
> controller
> >> hardware or controller configuration.
> >>
> >> Expertise in diagnosing (and perhaps fixing) Mylex controller
> symptoms would
> >> be initially useful.
> >
> > I won't be of much help - my experience is with the HSJ/HSZ/HSG
> controller
> > series.
> >
> > I did leave voice mail for Steve offering my services, but you folks
> > will probably do better diagnosing it remotely than me trying to get
> > involved.  Let me know, though, if I can do something, even if it's
> > just getting him some spare parts.
> >
> > Coincidentally, the reason I am not using the DS20 that's in my
> garage
> > is that the Mylex (KZPBC?) controller failed when I was trying to
> configure
> > it and I have not yet sprung for a replacement or stuffed in a non-
> raid
> > SCSI card.
> >
> > I know of others around here who have used the Mylex controller and
> > have encountered some of its quirks.  I seem to remember helping
> someone
> > on the research side of things restore a backup of what was at the
> time
> > a large (30GB) raid volume that was lost due to Mylex controller
> issues,
> > or perhaps due to not noticing that a raid disk had failed until a
> > second failure made recovery impossible.
> >
> 
> It seems to me that it's a SYS$MANGLER's JOB to notice things like
> failing disks.  I had a batch job called "MORNING_CHECK" that ran every
> day at  07:30.  It compared the output of "SHOW ERROR" with the output
> from yesterday.  It checked log files for errors ("-E-" and -F-"), etc,
> etc.  If it found something that looked like a problem I was notified
> by
> a text message to my pager.  This gave me time to work on the problem
> before it turned into a crisis!
> 
> A failed disk was not allowed to become a problem!  I would swap it out
> with a spare and call DEC/Compaq/HP to pick up the dear departed and
> bring me a replacement drive.
> 
> In fact, thanks to MORNING_CHECK, I usually found disks that were
> developing problems before the problems developed fully.  One error was
> allowed but when a disk started logging multiple errors, I swapped it
> out with a spare and called for a replacement.  The same guy who
> fetched
> replacements for field service would fetch me a new one and I gave him
> the dear departed!
> _______________________________________________

It does not seem like this was the issue here, but just an observation-

As Richard outlined, one of the big issues with RAID (HW or SW) is that
they work to well and staff can become to dependent on them. Hence, a 
LUN drive may fail, error gets logged, but if missed and/or not followed 
up on, then the chances are that a second failure on another drive a few 
weeks/months later will take out the entire LUN.

Some RAID arrays have beeps, flashing lights to remind people of errors,
but in lights out environments, I have seen even these get missed.

Regards

Kerry Main
Senior Consultant
HP Services Canada
Voice: 613-254-8911
Fax: 613-591-4477
kerryDOTmainAThpDOTcom
(remove the DOT's and AT)

OpenVMS - the secure, multi-site OS that just works.