[Info-vax] OpenVMS servers and clusters as a cloud service

Tue Jan 9 16:39:27 EST 2018

Den 2018-01-09 kl. 18:13, skrev DaveFroble:
> Stephen Hoffman wrote:
>> On 2018-01-09 04:13:58 +0000, Grant Taylor said:
>>
>>> On 01/08/2018 03:40 AM, Jan-Erik Soderholm wrote:
>>>> Than you have an very badly desiged batch job. Either split it up into 
>>>> multiple jobs or make sure in some other way that the job is 
>>>> restartable. Such a job should of course be able to pick up where it 
>>>> crashed.
>>>
>>> I think you hit the nail on the head.
>>>
>>> I can't tell you how many different things I've seen that are /just/ 
>>> good enough that they work (most of the time) when everything they 
>>> depend on is working properly.  If a gnat farts in the room, things fall 
>>> over.  Then someone has to pick the pieces up.
>>>
>>> I really wish that this was not the case.  But this is what I've come to 
>>> expect.  That way I'm pleasantly surprised when something is as you are 
>>> describing it should be.
>>
>> I'd like to see the ability to restart and recover far more widely used.  
>> But what's out there now in OpenVMS?  That's clearly not getting used 
>> widely, not viewed as useful, not capable, or whatever.  Which means 
>> either developers and project managers have to change their whole 
>> approaches here — and sure, like that's going to happen — or the provided 
>> APIs and approaches have to be made easier and more transparent and 
>> clearer to the developers.  And made easier and more visible and valuable 
>> to end-users, for that matter.
>>
>> Yeah, even OpenVMS development itself got hung up on "uptime", and 
>> (neglected? forgot? ignored?) what's involved with that, and when the 
>> uptime counter inevitably resets.  About cluster rolling reboots and the 
>> need to apply patches, and about app failures and recovery, and about app 
>> patches and testing and recovery, and all of which also ties into crash 
>> logging, telemetry, patch deployments, online backups, and into 
>> integration with IP networking for failover and load-balancing, and other 
>> related topics.   At making these goals easier for developers to adopt 
>> and to achieve, and easier to keep OpenVMS and third-party apps current, 
>> and available.
>>
>> Not everybody has the time or the skills or the management support to 
>> slog through the associated effort involved with making a process capable 
>> of supporting checkpoint-restart or other sorts of recovery, 
>> unfortunately.   The APIs have to be made easier and more transparent and 
>> more capable, and the documentation far better.   Because what OpenVMS 
>> has right now clearly isn't getting used — or it's not widely-known or 
>> not easily-useful or not expeditious or not valuable enough or whatever 
>> other excuse folks will use here — to most developers and designer.   
>> Even the folks that do known and do use and do develop for OpenVMS and 
>> have implemented restart and recovery don't ubiquitously implement that 
>> recovery and restart logic for all of the long-running processes.  Ponder 
>> why that is.
>>
>> Now do I think the related restart and recovery work is anywhere near the 
>> top of VSI's priority list?  No.  The x86-64 port and a pile of other 
>> code is more important.    But distributed job and process control — 
>> which includes batch job restart and recovery — is an area that OpenVMS 
>> has not seen particular enhancements in ~thirty years, where it's often 
>> not viewed as reasonable to implement for whatever reason, and what's out 
>> there is far too dependent on hand-rolled code and batch and server 
>> queues, and on DQS where that's been acquired, and on piles and piles of 
>> arcane glue code.  It's gotta be easier and simpler to adopt restart and 
>> recovery capabilities, if it's going to be of interest to new sites, and 
>> even to existing sites that might want to upgrade their own long-running 
>> processing.  And it's all also tied in with rolling reboots, online 
>> backups, app and system crash monitoring, and patch management.  As well 
>> as with testing and rollbacks.
> 
> I like it when people think "outside the box".  That's how we get new 
> things. So don't think this is criticism, think of it as perspective.
> 
> My software has been doing checkpoints and restarts since mid 1980s.  Back 
> then it was more important.  Not so much today with backup power and rather 
> capable systems.  Still, it gives me some perspective on restarts.
> 
> Not always, but sometimes a restart is very application specific.  For 
> such, I don't see what OS capabilities could do to help.  For transactions 
> there is commits of all or nothing in some database products, which wasn't 
> available to me back then, but, there will always be a finite limit to just 
> how large a transaction could be to allow such.

The issue is more often how small a transaction can be allowed to be.

When we has "a lot" of records to process in some way, we make sure
that each record is processed, some status or similar in the record
is updated and then that transaction is committed. And then start a
new transaction to process the next record.

If the process crashes (or the whole system), it will just pick up
with the next not processed (not committed) record. Any half processed
record in the case of a process (or system) crash are automatically
restored by the database before any user processes are allowed access.

So to the user processes the crash is more or less transparent/invisible.

   Beyond that things get
> rather application specific.
> 
> My biggest concern back then was insuring that the progress flags were 
> retained in some failure.  Nothing is perfect, had to take what I could 
> get.  Still, I don't remember one failure of a restart.  Usually the 
> application realized it was in the middle of a transaction and restarted 
> automatically.  A few times some intervention was required.  Don't need 
> fingers of both hands to count them.
> 
> So, I'm still having some problem in seeing how any OS capabilities could 
> have helped.  Of course, all in the context of what my apps were doing.  
> Perhaps other apps could benefit.
>