[Info-vax] Are queue manager updates written to disk immediately ?

Fri Apr 12 10:03:58 EDT 2013

On 2013-04-12, Stephen Hoffman <seaohveh at hoffmanlabs.invalid> wrote:
> On 2013-04-11 15:27:06 +0000, Simon Clubley said:
>
> If this queue manager misbehavior is a sufficient issue for you, 
> consider getting yourself a Less-Interruptible Power Supply (LIPS, as 
> I've never met a truly uninterruptible power supply) for the system.  

Thanks for the feedback, Hoff.

The problem with that is that it feels like a hardware workaround for a
software bug.

> And as others have mentioned, add some checks against a job that really 
> can't run twice.

The problem with ad-hoc checks is just as you mention, that it _is_ ad-hoc.

The normal application level production jobs (this was not one of them)
are part of a site specific scheduler which means that when they run is
under that scheduler's control (job specific .com files are created and
submitted by the scheduler as required).

This design also means there are no holding jobs waiting to be released
manually by mistake when they should not be; the scheduler in use was
designed that way on purpose to stop just this problem of the job been
run when it should not be. What it will not currently protect against
however is VMS itself running the same submitted job twice.

In case it's not obvious by now :-), I tend to be rather paranoid when
it comes to data integrity and security and even I did not think about
the possibility of VMS itself doing something like this (if indeed that
turns out to be the case).

>  (I've seen a few of these cases in clusters, when the 
> cluster time was skewed among hosts.  Your "tomorrow+08:20" should have 
> avoided problems from the usual minor skews, unless the time in the 
> cluster ? on the host that was running the queue manager, which is not 
> necessarily the host that was running the batch job ? was very skewed.) 
>  I've ended up with a batch scheduler for these and related tasks.
>

It's a standalone system; no cluster involved.

All hardware is official HP supported hardware; no unsupported third
party equipment for either the controller or disks.

Everything is configured as write-through; no deferred writes involved.

BTW, it also occurred to me after my last batch of responses that if
a window exists during job rundown when the queue manager thinks the
job is still active even though the log file is complete, then the job
should have been marked with the system failed during execution status
you would normally get in that situation upon system restart.

That makes me think nothing about the job actually starting was written
to the queue manager database on disk even though a full logfile was
written to those same disks. (The logfile was on a different disk, but
that disk was attached to the same controller.)

I've now logged the issue with HP and they are currently looking at it.

Thanks everyone,

Simon.

-- 
Simon Clubley, clubley at remove_me.eisner.decus.org-Earth.UFP
Microsoft: Bringing you 1980s technology to a 21st century world