[Info-vax] OpenVMS servers and clusters as a cloud service

Tue Jan 9 12:13:48 EST 2018

Stephen Hoffman wrote:
> On 2018-01-09 04:13:58 +0000, Grant Taylor said:
> 
>> On 01/08/2018 03:40 AM, Jan-Erik Soderholm wrote:
>>> Than you have an very badly desiged batch job. Either split it up 
>>> into multiple jobs or make sure in some other way that the job is 
>>> restartable. Such a job should of course be able to pick up where it 
>>> crashed.
>>
>> I think you hit the nail on the head.
>>
>> I can't tell you how many different things I've seen that are /just/ 
>> good enough that they work (most of the time) when everything they 
>> depend on is working properly.  If a gnat farts in the room, things 
>> fall over.  Then someone has to pick the pieces up.
>>
>> I really wish that this was not the case.  But this is what I've come 
>> to expect.  That way I'm pleasantly surprised when something is as you 
>> are describing it should be.
> 
> I'd like to see the ability to restart and recover far more widely 
> used.  But what's out there now in OpenVMS?  That's clearly not getting 
> used widely, not viewed as useful, not capable, or whatever.  Which 
> means either developers and project managers have to change their whole 
> approaches here — and sure, like that's going to happen — or the 
> provided APIs and approaches have to be made easier and more transparent 
> and clearer to the developers.  And made easier and more visible and 
> valuable to end-users, for that matter.
> 
> Yeah, even OpenVMS development itself got hung up on "uptime", and 
> (neglected? forgot? ignored?) what's involved with that, and when the 
> uptime counter inevitably resets.  About cluster rolling reboots and the 
> need to apply patches, and about app failures and recovery, and about 
> app patches and testing and recovery, and all of which also ties into 
> crash logging, telemetry, patch deployments, online backups, and into 
> integration with IP networking for failover and load-balancing, and 
> other related topics.   At making these goals easier for developers to 
> adopt and to achieve, and easier to keep OpenVMS and third-party apps 
> current, and available.
> 
> Not everybody has the time or the skills or the management support to 
> slog through the associated effort involved with making a process 
> capable of supporting checkpoint-restart or other sorts of recovery, 
> unfortunately.   The APIs have to be made easier and more transparent 
> and more capable, and the documentation far better.   Because what 
> OpenVMS has right now clearly isn't getting used — or it's not 
> widely-known or not easily-useful or not expeditious or not valuable 
> enough or whatever other excuse folks will use here — to most developers 
> and designer.   Even the folks that do known and do use and do develop 
> for OpenVMS and have implemented restart and recovery don't ubiquitously 
> implement that recovery and restart logic for all of the long-running 
> processes.  Ponder why that is.
> 
> Now do I think the related restart and recovery work is anywhere near 
> the top of VSI's priority list?  No.  The x86-64 port and a pile of 
> other code is more important.    But distributed job and process control 
> — which includes batch job restart and recovery — is an area that 
> OpenVMS has not seen particular enhancements in ~thirty years, where 
> it's often not viewed as reasonable to implement for whatever reason, 
> and what's out there is far too dependent on hand-rolled code and batch 
> and server queues, and on DQS where that's been acquired, and on piles 
> and piles of arcane glue code.  It's gotta be easier and simpler to 
> adopt restart and recovery capabilities, if it's going to be of interest 
> to new sites, and even to existing sites that might want to upgrade 
> their own long-running processing.  And it's all also tied in with 
> rolling reboots, online backups, app and system crash monitoring, and 
> patch management.  As well as with testing and rollbacks.

I like it when people think "outside the box".  That's how we get new things. 
So don't think this is criticism, think of it as perspective.

My software has been doing checkpoints and restarts since mid 1980s.  Back then 
it was more important.  Not so much today with backup power and rather capable 
systems.  Still, it gives me some perspective on restarts.

Not always, but sometimes a restart is very application specific.  For such, I 
don't see what OS capabilities could do to help.  For transactions there is 
commits of all or nothing in some database products, which wasn't available to 
me back then, but, there will always be a finite limit to just how large a 
transaction could be to allow such.  Beyond that things get rather application 
specific.

My biggest concern back then was insuring that the progress flags were retained 
in some failure.  Nothing is perfect, had to take what I could get.  Still, I 
don't remember one failure of a restart.  Usually the application realized it 
was in the middle of a transaction and restarted automatically.  A few times 
some intervention was required.  Don't need fingers of both hands to count them.

So, I'm still having some problem in seeing how any OS capabilities could have 
helped.  Of course, all in the context of what my apps were doing.  Perhaps 
other apps could benefit.

-- 
David Froble                       Tel: 724-529-0450
Dave Froble Enterprises, Inc.      E-Mail: davef at tsoft-inc.com
DFE Ultralights, Inc.
170 Grimplin Road
Vanderbilt, PA  15486