[Info-vax] Transient anal/disk errors

Thu Dec 12 10:07:28 EST 2013

On 2013-12-12 14:39:04 +0000, tadamsmar said:

> I started running ANAL/DISK a lot more lately on our systems:
> 
> One of them gives transient warnings like this pretty often:

You're looking at a live disk, with active changes, and not a static 
environment.  That'll inherently generate diagnostics on an ODS-2 or 
ODS-5 file system.

If you really want to pursue this, write some tools to scan for severe 
and fatal errors, and mask the expected errors.

> If I run it again, the messages go away or change.  This is a shadowed 
> disk that is not logging errors.

Those are the typical sorts of chatter that arise with an active disk.

So have you finished working on your backup strategy, and have you 
recently tested a recovery-restart from that?    (This is vastly more 
important than analyzing your disks, as ANALYZE /DISK is reactive and 
as it doesn't spot impending failures, RAID doesn't protect against 
various common errors including volume corruptions larger than what it 
can handle, file deletions or any sorts of intentional theft or 
corruption or damage that might occur.  The BACKUPs allow for recovery.

There's also the infamous BACKUP /IGNORE=INTERLOCK command, which some 
folks think is an online BACKUP.  It's not.  Worse, it allows silent 
data corruptions in the output savesets.  If you have control over the 
applications involved, that's where the BACKUP support needs to reside, 
particularly if your applications are writing clumps of updates to 
disk.  Various relational databases on VMS include application-internal 
backup tools, and always use those in preference to using the OpenVMS 
BACKUP command.  Alternatively, quiesce the applications or the disks 
or the systems, and then use the standard BACKUP tools. Or quiesce the 
environment and yank a disk from the shadowset, and backup that.  
That's a much smaller window of downtime.   Test the recovery process 
periodically.

> The other 4 systems run ANAL/DISK clean, if they have transient 
> warnings at all then it must be at a much lower rate.

I'd look to replace all of the disks in this configuration, just 
because most of them are probably as old as those boxes.  As good as 
the old DEC SCSI disks were, statistically, they're failure fodder 
given their likely relative ages.

> All 5 systems are V7.3-2.

Ancient.

> The one with the transients is a DS10  466mhz.

Shut it down, boot from CD or a backup disk, and try again.  Quiesce 
the environment, in other words.

> The others are DS10s 466 or 600 and one is an AS800.

The arsenal of the ancient, eh?  One rx2660 would likely easily replace 
this whole configuration.  Less power, less space, more capacity.  
Maybe two with a low-end FC SAN or shared SCSI connection for the boot 
and quorum disk, if you're in an uptime-critical environment.

> They all are running in essentially the same operating system configuration.

"Essentially" is a particularly loaded word when debugging stuff.  It's 
those "essential" differences that often play into differences in how 
bugs manifest themselves.

> Maybe there is more application level activity on the one with the 
> transients, not sure.

Re-read the above listing of transients.  There's your evidence.

> PS: They all give this informational, and always have:
> %ANALDISK-I-OPENQUOTA, error opening QUOTA.SYS
> -SYSTEM-W-NOSUCHFILE, no such file
> 
> How does one get rid of the OPENQUOTA statement?

Activate the disk quotas on the disk, rebuild, and set the limits past 
the capacity of the disk, and take a slight performance hit tracking 
the quotas.  Or do what everybody else does here, and ignore it.

-- 
Pure Personal Opinion | HoffmanLabs LLC