[Info-vax] RX2800 sporadic disk I/O slowdowns

Fri Oct 18 14:26:53 EDT 2024

RX2800 i4 server, 64GB RAM, 4 processors, P410i controller with 10 each 
2TB disks in RAID 6, broken down into volumes.

We periodically (sometimes steady once a week, but sometimes more 
frequent) one overnight batch job take much longer than normal to run. 
Normal runtime about 30-35 minutes will take 4.5 - 6.5 hours.  Several 
images called by that job all run much slower than normal.  At the end 
the overall CPU and I/O counts are very close between a normal and a 
long job.

The data files are very large indexed files.  Records are read and 
updated but not added in this job; output is just tabulated reports.

We've run monitors for all and disk and also built polling snapshot jobs 
that check for locked/busy files, other active batch jobs, auto-checked 
through system analyzer looking for any other processes accessing the 
busy files at the same time as the problem batch (two data files show 
long busy periods but we do not show any other process with channels to 
that file at the same time except for backup, see next).

The backups start at the same time, but do not get to the data disks 
until well after the problem job normally completes; that does cause 
concurrent access to the problem files but it occurs only when the job 
has already run long. so it is not the cause  Overall backup time is 
about the same regardless of how long the problem batch takes.

Monitor during a long run shows average and peak I/O rates to the disks 
with busy files at about 1/2 of what they do for normal runs.  We can 
see that in the process snapshots too; the direct i/o count on a slow 
run increases much more slowly than on a normal run but both normal and 
long runs end up with close to the same CPU time and total I/Os.

Other jobs in monitor are somewhat slowed down but nowhere near as much 
(and they do much less access).

Before anyone asks, the indexed files could probably use  a 
cleanup/rebuild, but if thats the cause would we see periodic 
performance issues?  I would expect them to be constant.

There is a backup server available, so I'm going to restore backups of 
the two problem files to it and do rebuilds to see how long it takes; 
that will determine how/when we can do it on the production server.

So something is apparently causing it to be I/O constrained but so far 
we can't find it.  Same concurrent processes, other jobs don't appear to 
be slowed down much (but  may be much less i/o sensitive or using data 
on other disks, I threw that question to the devs).

Is there anything in the background below VMS that could cause this? 
The controller doing drive checks or other maintenance activities?

Thanks for any ideas.