[Info-vax] OpenVMS servers and clusters as a cloud service
Stephen Hoffman
seaohveh at hoffmanlabs.invalid
Tue Jan 9 11:28:12 EST 2018
On 2018-01-09 04:13:58 +0000, Grant Taylor said:
> On 01/08/2018 03:40 AM, Jan-Erik Soderholm wrote:
>> Than you have an very badly desiged batch job. Either split it up into
>> multiple jobs or make sure in some other way that the job is
>> restartable. Such a job should of course be able to pick up where it
>> crashed.
>
> I think you hit the nail on the head.
>
> I can't tell you how many different things I've seen that are /just/
> good enough that they work (most of the time) when everything they
> depend on is working properly. If a gnat farts in the room, things
> fall over. Then someone has to pick the pieces up.
>
> I really wish that this was not the case. But this is what I've come
> to expect. That way I'm pleasantly surprised when something is as you
> are describing it should be.
I'd like to see the ability to restart and recover far more widely
used. But what's out there now in OpenVMS? That's clearly not getting
used widely, not viewed as useful, not capable, or whatever. Which
means either developers and project managers have to change their whole
approaches here — and sure, like that's going to happen — or the
provided APIs and approaches have to be made easier and more
transparent and clearer to the developers. And made easier and more
visible and valuable to end-users, for that matter.
Yeah, even OpenVMS development itself got hung up on "uptime", and
(neglected? forgot? ignored?) what's involved with that, and when the
uptime counter inevitably resets. About cluster rolling reboots and
the need to apply patches, and about app failures and recovery, and
about app patches and testing and recovery, and all of which also ties
into crash logging, telemetry, patch deployments, online backups, and
into integration with IP networking for failover and load-balancing,
and other related topics. At making these goals easier for developers
to adopt and to achieve, and easier to keep OpenVMS and third-party
apps current, and available.
Not everybody has the time or the skills or the management support to
slog through the associated effort involved with making a process
capable of supporting checkpoint-restart or other sorts of recovery,
unfortunately. The APIs have to be made easier and more transparent
and more capable, and the documentation far better. Because what
OpenVMS has right now clearly isn't getting used — or it's not
widely-known or not easily-useful or not expeditious or not valuable
enough or whatever other excuse folks will use here — to most
developers and designer. Even the folks that do known and do use and
do develop for OpenVMS and have implemented restart and recovery don't
ubiquitously implement that recovery and restart logic for all of the
long-running processes. Ponder why that is.
Now do I think the related restart and recovery work is anywhere near
the top of VSI's priority list? No. The x86-64 port and a pile of
other code is more important. But distributed job and process
control — which includes batch job restart and recovery — is an area
that OpenVMS has not seen particular enhancements in ~thirty years,
where it's often not viewed as reasonable to implement for whatever
reason, and what's out there is far too dependent on hand-rolled code
and batch and server queues, and on DQS where that's been acquired, and
on piles and piles of arcane glue code. It's gotta be easier and
simpler to adopt restart and recovery capabilities, if it's going to be
of interest to new sites, and even to existing sites that might want to
upgrade their own long-running processing. And it's all also tied in
with rolling reboots, online backups, app and system crash monitoring,
and patch management. As well as with testing and rollbacks.
--
Pure Personal Opinion | HoffmanLabs LLC
More information about the Info-vax
mailing list