[Info-vax] Transient anal/disk errors

Thu Dec 12 10:51:16 EST 2013

On Thursday, December 12, 2013 10:07:28 AM UTC-5, Stephen Hoffman wrote:
> On 2013-12-12 14:39:04 +0000, tadamsmar said:
> 
> 
> 
> > I started running ANAL/DISK a lot more lately on our systems:
> 
> > 
> 
> > One of them gives transient warnings like this pretty often:
> 
> 
> 
> You're looking at a live disk, with active changes, and not a static 
> 
> environment.  That'll inherently generate diagnostics on an ODS-2 or 
> 
> ODS-5 file system.
> 
> 
> 
> If you really want to pursue this, write some tools to scan for severe 
> 
> and fatal errors, and mask the expected errors.
> 
> 
> 
> > If I run it again, the messages go away or change.  This is a shadowed 
> 
> > disk that is not logging errors.
> 
> 
> 
> Those are the typical sorts of chatter that arise with an active disk.
> 
> 
> 
> So have you finished working on your backup strategy, and have you 
> 
> recently tested a recovery-restart from that?    (This is vastly more 
> 
> important than analyzing your disks, as ANALYZE /DISK is reactive and 
> 
> as it doesn't spot impending failures, RAID doesn't protect against 
> 
> various common errors including volume corruptions larger than what it 
> 
> can handle, file deletions or any sorts of intentional theft or 
> 
> corruption or damage that might occur.  The BACKUPs allow for recovery.
> 
> 
> 
> There's also the infamous BACKUP /IGNORE=INTERLOCK command, which some 
> 
> folks think is an online BACKUP.  It's not.  Worse, it allows silent 
> 
> data corruptions in the output savesets.  If you have control over the 
> 
> applications involved, that's where the BACKUP support needs to reside, 
> 
> particularly if your applications are writing clumps of updates to 
> 
> disk.  Various relational databases on VMS include application-internal 
> 
> backup tools, and always use those in preference to using the OpenVMS 
> 
> BACKUP command.  Alternatively, quiesce the applications or the disks 
> 
> or the systems, and then use the standard BACKUP tools. Or quiesce the 
> 
> environment and yank a disk from the shadowset, and backup that.

You think I was recently working on my backup strategy?  I was just working
on those persistent ANAL/DISK problems.

But I probably do need to work on my backup strategy.  I have been yanking
out a disk without quiescing and backing up the yanked disk, and I have not done any deliberate recovery testing, just defacto when I had to recover a file or compress a disk.  Just yanking a disk is easy, I just have to run command procedures, but as you point out, it might not have optimal reliability.

What's the easiest way to quiesce and yank? The only way I am sure of is to shutdown, boot with a CD, yank, then reboot normally.  I am not sure that
there is a console command that will yank a disk from a shadowset, but I
seem to recall one that will disable shadowing.

I have noticed that sometimes a yanked disk will not run ANAL/DISK clean. This also seems to be transient.

> 
> That's a much smaller window of downtime.   Test the recovery process 
> 
> periodically.
> 
> 
> 
> > The other 4 systems run ANAL/DISK clean, if they have transient 
> 
> > warnings at all then it must be at a much lower rate.
> 
> 
> 
> I'd look to replace all of the disks in this configuration, just 
> 
> because most of them are probably as old as those boxes.  As good as 
> 
> the old DEC SCSI disks were, statistically, they're failure fodder 
> 
> given their likely relative ages.
> 
> 
> 
> > All 5 systems are V7.3-2.
> 
> 
> 
> Ancient.

Me have no wampum for support for many moons, paleface.

> 
> 
> 
> > The one with the transients is a DS10  466mhz.
> 
> 
> 
> Shut it down, boot from CD or a backup disk, and try again.  Quiesce 
> 
> the environment, in other words.
> 
> 
> 
> > The others are DS10s 466 or 600 and one is an AS800.
> 
> 
> 
> The arsenal of the ancient, eh?  One rx2660 would likely easily replace 
> 
> this whole configuration.

Heck, one DS10 600 could probably replace the whole thing.

There was this idea that running on 4 systems made a total failure less likely, so we spread it out over 4 systems plus a development system that could act also as a quickly configurable spare.  But this was kind of pointless.  Someone once brought almost all of it down by yanking on one thin wire which was THE thinwire. Now we have thickwire with one switch that revolutionized bringing down the system - it can be done remotely without yanking a cable.  Or by one dead UPS battery or by unplugging the switch.  All or most of these have happened.

> Less power, less space, more capacity.  
> 
> Maybe two with a low-end FC SAN or shared SCSI connection for the boot 
> 
> and quorum disk, if you're in an uptime-critical environment.
> 
> 
> 
> > They all are running in essentially the same operating system configuration.
> 
> 
> 
> "Essentially" is a particularly loaded word when debugging stuff.  It's 
> 
> those "essential" differences that often play into differences in how 
> 
> bugs manifest themselves.
> 
> 
> 
> > Maybe there is more application level activity on the one with the 
> 
> > transients, not sure.
> 
> 
> 
> Re-read the above listing of transients.  There's your evidence.

Yes, that one system is arguably more active today when I was doing my testing.

> 
> 
> 
> > PS: They all give this informational, and always have:
> 
> > %ANALDISK-I-OPENQUOTA, error opening QUOTA.SYS
> 
> > -SYSTEM-W-NOSUCHFILE, no such file
> 
> > 
> 
> > How does one get rid of the OPENQUOTA statement?
> 
> 
> 
> Activate the disk quotas on the disk, rebuild, and set the limits past 
> 
> the capacity of the disk, and take a slight performance hit tracking 
> 
> the quotas.  Or do what everybody else does here, and ignore it.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> -- 
> 
> Pure Personal Opinion | HoffmanLabs LLC