[Info-vax] Transient anal/disk errors
Stephen Hoffman
seaohveh at hoffmanlabs.invalid
Thu Dec 12 10:07:28 EST 2013
On 2013-12-12 14:39:04 +0000, tadamsmar said:
> I started running ANAL/DISK a lot more lately on our systems:
>
> One of them gives transient warnings like this pretty often:
You're looking at a live disk, with active changes, and not a static
environment. That'll inherently generate diagnostics on an ODS-2 or
ODS-5 file system.
If you really want to pursue this, write some tools to scan for severe
and fatal errors, and mask the expected errors.
> If I run it again, the messages go away or change. This is a shadowed
> disk that is not logging errors.
Those are the typical sorts of chatter that arise with an active disk.
So have you finished working on your backup strategy, and have you
recently tested a recovery-restart from that? (This is vastly more
important than analyzing your disks, as ANALYZE /DISK is reactive and
as it doesn't spot impending failures, RAID doesn't protect against
various common errors including volume corruptions larger than what it
can handle, file deletions or any sorts of intentional theft or
corruption or damage that might occur. The BACKUPs allow for recovery.
There's also the infamous BACKUP /IGNORE=INTERLOCK command, which some
folks think is an online BACKUP. It's not. Worse, it allows silent
data corruptions in the output savesets. If you have control over the
applications involved, that's where the BACKUP support needs to reside,
particularly if your applications are writing clumps of updates to
disk. Various relational databases on VMS include application-internal
backup tools, and always use those in preference to using the OpenVMS
BACKUP command. Alternatively, quiesce the applications or the disks
or the systems, and then use the standard BACKUP tools. Or quiesce the
environment and yank a disk from the shadowset, and backup that.
That's a much smaller window of downtime. Test the recovery process
periodically.
> The other 4 systems run ANAL/DISK clean, if they have transient
> warnings at all then it must be at a much lower rate.
I'd look to replace all of the disks in this configuration, just
because most of them are probably as old as those boxes. As good as
the old DEC SCSI disks were, statistically, they're failure fodder
given their likely relative ages.
> All 5 systems are V7.3-2.
Ancient.
> The one with the transients is a DS10 466mhz.
Shut it down, boot from CD or a backup disk, and try again. Quiesce
the environment, in other words.
> The others are DS10s 466 or 600 and one is an AS800.
The arsenal of the ancient, eh? One rx2660 would likely easily replace
this whole configuration. Less power, less space, more capacity.
Maybe two with a low-end FC SAN or shared SCSI connection for the boot
and quorum disk, if you're in an uptime-critical environment.
> They all are running in essentially the same operating system configuration.
"Essentially" is a particularly loaded word when debugging stuff. It's
those "essential" differences that often play into differences in how
bugs manifest themselves.
> Maybe there is more application level activity on the one with the
> transients, not sure.
Re-read the above listing of transients. There's your evidence.
> PS: They all give this informational, and always have:
> %ANALDISK-I-OPENQUOTA, error opening QUOTA.SYS
> -SYSTEM-W-NOSUCHFILE, no such file
>
> How does one get rid of the OPENQUOTA statement?
Activate the disk quotas on the disk, rebuild, and set the limits past
the capacity of the disk, and take a slight performance hit tracking
the quotas. Or do what everybody else does here, and ignore it.
--
Pure Personal Opinion | HoffmanLabs LLC
More information about the Info-vax
mailing list