[Info-vax] What choices are available to the OpenVMS Process Scheduler once all the CPUs are in use?

Sun Feb 28 23:28:02 EST 2016

sean at obanion.us wrote:
> Sorry I haven't been back, we've been busy as youll see, and I'm not allowed to post to here from work.
> 
> On Monday Feb 15, since it was a holiday, I went in to run some more PAWZ reports on this issue, and to fix a failed backup where we break a member out of the system disk shadow set and back it up to tape, and to a saveset on disk where that output disk had run out of space (I hadn't purged it for a while...). When it had completed, I went to mount the system member ($1$DGA2501:) back into the existing shadow set for a full copy with the other member (so from one to two members), and an error told me that a volume of that label already was mounted.
> 
> Looking at the console I saw that a few minutes earlier:
> 
> 
> SHADOW-I-VOLPROC, $1$DGA1501:  is an incorrect shadow set member volume.
> SHADOW-I-VOLPROC, $1$DGA1501:  has been removed from shadow set.
> SHADOW-I-VOLPROC, DSA51: has aborted volume processing.
> SHADOW-I-VOLPROC, DSA51: contains zero working members.
> 
> ERRFMT BUGCHECK - DELETING ERRFMT PROCESS
> WRITING PROCESS DUMP TO SYS$ERRORLOG:ERRFMT.DMP
> TO RESTART ERRFMT PROCESS, USE "@SYS$SYSTEM:STARTUP ERRFMT"
>  
> 
> At this point it was fairly certain that a CTRL-P crash was in the near future.
> I tried to put $1$DGA1501:  back in but got the same error about a volume with same label already being mounted.
> 
> We had seen some bad fragmentation along with corruption reported by Analyze/Disk on other similarly configured systems, and the fix we had developed there was applied here:
> We crashed it (found that the DOSD paths where old and the dump failed), and booted off the original disk, $1$DGA1501:, into the expected shadow set, DSA51, adding two more members and waited for it to complete shadow copying.
> Then we broke out the two added members, mounted one private, /over=(id,shadow), and ran Analyze/Disk/Repair, which among other thing put about 16K lost files in the form of ".;" spooled device files, into [SYSLOST] where we deleted them. 
> We then did a Dir/Size/Grand for all .EXE files, as one prior attempt on a different system had ended up with missing executables (SETHOW.EXE), but got the same value as the boot disk. 
> The other former shadow member, which we had kept incase the first Repair needed a source to copy missing files from was needed, was then used as the target of an image backup restore to give us a repaired and defragmented disk to boot from. 
> We then rebooted from this third disk, later adding the other two back in to the shadow.
> 
> Mission Critical did their usual outstanding support, walking us through all of that and some obscure UEFI boot path errors, that would have been avoided if we had remembered to run BOOT_OPTIONS after the last SAN & systems changes, which would had also gotten us a dump.
> 
> We where down for about 7 hours, longest in more than 20 years here, but the Cache databases were undamaged.
> Some out of order processing went on through the next day.
> 
> Mission Critical found a study that Rob Eulenstein had done awhile ago, that reproduced this sequence of the shadow server reducing the membership to zero where a second stand alone system mounted the member of a single member set. We had storage support check our P9500s and SAN logs, finding nothing, along with finding no other system or storage problems in our environment that day.
> 
> 
> At this point, we are guessing that the fragmentation and corruption was so bad that the shadow server removed the member. And guess that even if there had be more than one member at that moment, it still would have happened.
> 
> There are a variety of proposals being floated, including moving the Cache TEMP and Write Image Journal (WIJ) files off the system disk onto their own unique shadows, since throughput has improved since this cleanup.
> 
> Relatedly, the PAWZ programer has found a bug that incorrectly reported the CPU CUR queue, so a fix for that is coming. 
> We expect to have some appropriate T4 data collection in place for the coming Tuesday, our next month end.
> 
> 
> Sean

DEC/Compaq/HP does have a defragmentation tool ....

Not saying others will agree, but what I do is when a backup is successfully 
completed, the command file then does a purge of old save sets, I keep 3, others 
might use a different number.

I consider it important to know the latest backup completed successfully before 
I'll delete any old versions.

Keeping a system disk for only VMS, and possibly some static rarely used data, 
is a good idea.