[Info-vax] HP Integrity rx2800 i4 (2.53GHz/32.0MB) :: PAKs won't load

Mon Feb 29 13:27:23 EST 2016

> -----Original Message-----
> From: Info-vax [mailto:info-vax-bounces at info-vax.com] On Behalf Of
> lists--- via Info-vax
> Sent: 29-Feb-16 11:34 AM
> To: info-vax at info-vax.com
> Cc: lists at openmailbox.org
> Subject: Re: [New Info-vax] HP Integrity rx2800 i4 (2.53GHz/32.0MB) ::
> PAKs won't load
> 
> On Mon, 29 Feb 2016 14:28:42 +0000
> Kerry Main via Info-vax <info-vax at rbnsn.com> wrote:
> 
> > > -----Original Message-----
> > > From: Info-vax [mailto:info-vax-bounces at info-vax.com] On Behalf
> Of
> > > lists--- via Info-vax
> > > Sent: 29-Feb-16 1:24 AM
> > > To: info-vax at info-vax.com
> > > Cc: lists at openmailbox.org
> > > Subject: Re: [New Info-vax] HP Integrity rx2800 i4 (2.53GHz/32.0MB)
> ::
> > > PAKs won't load
> > >
> > > On Sun, 28 Feb 2016 14:20:06 +0000
> > > Kerry Main via Info-vax <info-vax at rbnsn.com> wrote:
> > >
> > > > Then, add in the "what is we lose a site and we cannot lose any
> data"
> > > > scenario to the cluster and the complexity goes through the roof -
> > > > "split brain" and many of the issues which ALL OS platforms need to
> > > > address.
> > >
> > > Does that mean "can't lose access to any data" or "can't lose data?"
> > >
> >
> > Can't lose any data - or as is often referred to as RPO=0 (recovery point
> > objective).
> 
> Ok. I think there is still a timing dependency in there regardless. You
> have to get the data off the host and that takes time. If you throw
> transactional overhead in there it takes even longer. As far as I know this
> has been available on IBM for a fairly long time but with the proviso that
> not all types of facilities and software can participate in a transaction.
> So anything that requires logical unit of work integrity has to be designed
> with that in mind.
> 

If a transaction is not completed after being committed in an active-active
Cluster scenario OR in a 2PC app environment, the data is not lost, but
rather an error occurs on the App screen and the App logic decides what 
to do with the error. In this scenario, the data did not complete at all
sites, so the transaction or IO is rolled back from all sites. The update is
not considered complete.

The difference is that 2PC is a fault tolerant Application level design,
while an OpenVMS multi-site cluster is an OS and above design which 
may have many critical applications, support utilities and ISV application
running that were not designed with 2PC in mind.

> The way it often works in banking that use IBM teller terminals is store
> and forward. If a plane hits the data center the transaction will
> eventually get committed to the DR host.

Teller transactions are typically considered minor (less than a thousand or
two), so store-forward may be ok for that level. The bank will often decide
to eat the loss should this type of event happen.

Same cannot be said for the bigger server-server transactions.

>  If a plane hits the branch before
> the transaction was sent there will be a problem. I don't know any way
> around this. I mean, there is always the single point for failure for a
> short time. It should not affect a lot of work but it can affect in-flight
> transactions and that is dealt with by the back-end.
> 

OpenVMS clusters are designed for server-server and site to site DR/DT
features - not client or branch DT/DR.

Store and forward might be ok for remote teller / ATM devices, but they 
would not protect server to server transactions at the back end.

[snip..]

> 
> I don't understand how there could be no data loss if the data source
> was struck. I understand the data loss could be mitigated even greatly, but to
> say no data loss at all seems physically impossible, having nothing to do
> with software/hardware technology but just as a consequence of
> transmission
> time, etc. To say the database (whatever that means) is always in a
> consistent state, sure, that has to happen and has for many decades.
> Again,
> probably not to this day on many platforms but at least yours and mine,
> probably implemented in very different ways.

See above note - when plane hits DC1, data is not lost if the App errors 
out and the App takes whatever action is deemed appropriate. If local
branch is using store and forward, then when the cluster state transition 
completes (usually less than a couple of minutes), then the App can retry
sending the update or ask the teller to re-enter or ??? and the data will
be applied to the DC2 site.

A key concept is that OpenVMS multi-site clusters ENABLE relatively
seamless transitions, but applications also need to be cluster aware and 
customized to handle the various what-if scenarios.

> 
> > > > Having stated this, to the point raised in the last reply, one needs
> > > > to understand that multi-site clustering with no data loss in a DR
> > > > scenario is a really tough nut to crack - on ANY OS platform.
> 
> I think is has become less so in the IBM world over the past decade as
> they
> have added more facilities to the transactional recovery bucket. For
> instance VSAM was often the outsider unless you were updating it
> through
> one of the transaction processing facilities like CICS or IMS. If you had a
> batch job there was no transactional support for VSAM although the
> same
> batch job could create LUWs for DB2, IMS/DB, etc. VSAM now has
> support to
> be part of the picture. It's been a while since I was involved in this so I
> may have the exact details off but in general I think this is correct.
> 
> > > But if it's only a matter of temporary unavailability, or short term
> > > delay then I know another OS that can provide that through third
> party
> > > tools and I
> > > guess others could also.
> > >
> >
> > All OS's can provide active-passive type DR solutions (even Windows
> > albeit coyote ugly).   The issue with active-passive solutions is that one
> > has to assume that SOME data WILL be lost i.e. RPO not = to 0. This is
> > because active-passive solutions use some form of SW or HW based
> > replication which is only sync'd every X or XX minutes.
> 
> Yes, this is clearly NFG for transaction processing. If anybody is doing
> that I don't really know why unless it's cheaper to throw away the
> transactions and pay out refunds than it is to develop a real solution.
> Oh wait a minute, for certain businesses and OS it is.
> 
> > If a significant event happens, the App thinks a write is complete, but
> > it has not yet been committed at the remote site. When the App is
> > restarted at the remote site, those transactions that were caught in
> that
> > replication buffer window are lost. [Remember that only writes are
> > propagated in these replication buffers]
> 
> I can't imagine we're even talking about this today in 2016. It's so
> obvious that when you're dealing with money you have to have two-
> phase
> commit and preserve LUW and the atomicity of your transactions and all
> the
> underlying pieces.

That is a critical application view, but feeder / ISV apps may not have 2PC 
as part of their design.

> 
> > If transaction updates are not critical, then it's not a big deal. If the
> > transactions are measured in large $'s (banks measure some in
> > millions of $'s), then it really is a big deal.
> 
> Right, and the facilities to do that have been available for decades on
> IBM, even without anything resembling clustering.
> 

Old saying used to be that the 2 best OS clustering technologies were 
from  OS's that had the same 3 letters - VMS/MVS (aka OpenVMS/z/OS)

Similarities are very interesting - both are active-active, shared everything
models.

> > As Keith Parris emphasizes in his DR/DT presentations, there is a huge
> > difference between a disaster recovery(DR) solution and a disaster
> > tolerant (DT) solution. With DR, the business is down for some period
> > of time and steps are taken to get the business back on line. With DT,
> > the business can continue to run with no data loss.
> 
> That's a distinction I haven't really heard discussed much. I think it is
> another important talking point. I know some businesses that lose
> serious
> money for every second they're offline. That was in the olden days. Now
> with so much e-commerce, how much more so.
> 

Yep, especially with global time zones etc, downtime for things like
offline backups, file maint etc are rapidly becoming a thing of the past.

> > I like to compare DR vs DT to insurance - the more you need, the
> > more it costs.  If you never use it then it is a big expense. If it does
> > save the company major $'s and/or loss of business / public
> > credibility, then the additional insurance was a very cheap expense.
> 
> Very good points, all.
> 

Btw, another analogy is the spare tire scenario. Everyone would likely
agree that carrying a spare tire in their vehicle is a good expense to pay,
just in case there is a flat. If you never have a flat, you paid for that extra 
tire for nothing. For environments like rough bush roads, maybe 1 spare 
tire is not enough? Perhaps throwing an extra spare in the back might be 
an acceptable risk mitigation - especially if you know there is no cell phone
coverage where you are going?

Risk vs. cost vs. benefits - it all depends on what the impact might be if
that significant event does occur.

:-)

Regards,

Kerry Main
Kerry dot main at starkgaming dot com