[Info-vax] OpenVMS servers and clusters as a cloud service

Tue Jan 9 11:28:12 EST 2018

On 2018-01-09 04:13:58 +0000, Grant Taylor said:

> On 01/08/2018 03:40 AM, Jan-Erik Soderholm wrote:
>> Than you have an very badly desiged batch job. Either split it up into 
>> multiple jobs or make sure in some other way that the job is 
>> restartable. Such a job should of course be able to pick up where it 
>> crashed.
> 
> I think you hit the nail on the head.
> 
> I can't tell you how many different things I've seen that are /just/ 
> good enough that they work (most of the time) when everything they 
> depend on is working properly.  If a gnat farts in the room, things 
> fall over.  Then someone has to pick the pieces up.
> 
> I really wish that this was not the case.  But this is what I've come 
> to expect.  That way I'm pleasantly surprised when something is as you 
> are describing it should be.

I'd like to see the ability to restart and recover far more widely 
used.  But what's out there now in OpenVMS?  That's clearly not getting 
used widely, not viewed as useful, not capable, or whatever.  Which 
means either developers and project managers have to change their whole 
approaches here — and sure, like that's going to happen — or the 
provided APIs and approaches have to be made easier and more 
transparent and clearer to the developers.  And made easier and more 
visible and valuable to end-users, for that matter.

Yeah, even OpenVMS development itself got hung up on "uptime", and 
(neglected? forgot? ignored?) what's involved with that, and when the 
uptime counter inevitably resets.  About cluster rolling reboots and 
the need to apply patches, and about app failures and recovery, and 
about app patches and testing and recovery, and all of which also ties 
into crash logging, telemetry, patch deployments, online backups, and 
into integration with IP networking for failover and load-balancing, 
and other related topics.   At making these goals easier for developers 
to adopt and to achieve, and easier to keep OpenVMS and third-party 
apps current, and available.

Not everybody has the time or the skills or the management support to 
slog through the associated effort involved with making a process 
capable of supporting checkpoint-restart or other sorts of recovery, 
unfortunately.   The APIs have to be made easier and more transparent 
and more capable, and the documentation far better.   Because what 
OpenVMS has right now clearly isn't getting used — or it's not 
widely-known or not easily-useful or not expeditious or not valuable 
enough or whatever other excuse folks will use here — to most 
developers and designer.   Even the folks that do known and do use and 
do develop for OpenVMS and have implemented restart and recovery don't 
ubiquitously implement that recovery and restart logic for all of the 
long-running processes.  Ponder why that is.

Now do I think the related restart and recovery work is anywhere near 
the top of VSI's priority list?  No.  The x86-64 port and a pile of 
other code is more important.    But distributed job and process 
control — which includes batch job restart and recovery — is an area 
that OpenVMS has not seen particular enhancements in ~thirty years, 
where it's often not viewed as reasonable to implement for whatever 
reason, and what's out there is far too dependent on hand-rolled code 
and batch and server queues, and on DQS where that's been acquired, and 
on piles and piles of arcane glue code.  It's gotta be easier and 
simpler to adopt restart and recovery capabilities, if it's going to be 
of interest to new sites, and even to existing sites that might want to 
upgrade their own long-running processing.  And it's all also tied in 
with rolling reboots, online backups, app and system crash monitoring, 
and patch management.  As well as with testing and rollbacks.

-- 
Pure Personal Opinion | HoffmanLabs LLC