[Info-vax] HP Integrity rx2800 i4 (2.53GHz/32.0MB) :: PAKs won't load

Mon Feb 29 11:34:15 EST 2016

On Mon, 29 Feb 2016 14:28:42 +0000
Kerry Main via Info-vax <info-vax at rbnsn.com> wrote:

> > -----Original Message-----
> > From: Info-vax [mailto:info-vax-bounces at info-vax.com] On Behalf Of
> > lists--- via Info-vax
> > Sent: 29-Feb-16 1:24 AM
> > To: info-vax at info-vax.com
> > Cc: lists at openmailbox.org
> > Subject: Re: [New Info-vax] HP Integrity rx2800 i4 (2.53GHz/32.0MB) ::
> > PAKs won't load
> > 
> > On Sun, 28 Feb 2016 14:20:06 +0000
> > Kerry Main via Info-vax <info-vax at rbnsn.com> wrote:
> > 
> > > Then, add in the "what is we lose a site and we cannot lose any data"
> > > scenario to the cluster and the complexity goes through the roof -
> > > "split brain" and many of the issues which ALL OS platforms need to
> > > address.
> > 
> > Does that mean "can't lose access to any data" or "can't lose data?"
> > 
> 
> Can't lose any data - or as is often referred to as RPO=0 (recovery point
> objective).

Ok. I think there is still a timing dependency in there regardless. You
have to get the data off the host and that takes time. If you throw
transactional overhead in there it takes even longer. As far as I know this
has been available on IBM for a fairly long time but with the proviso that
not all types of facilities and software can participate in a transaction.
So anything that requires logical unit of work integrity has to be designed
with that in mind.

The way it often works in banking that use IBM teller terminals is store
and forward. If a plane hits the data center the transaction will
eventually get committed to the DR host. If a plane hits the branch before
the transaction was sent there will be a problem. I don't know any way
around this. I mean, there is always the single point for failure for a
short time. It should not affect a lot of work but it can affect in-flight
transactions and that is dealt with by the back-end.

> Apologies for not making this clearer, but I was referring to a
> multi-site OpenVMS cluster that uses host based shadowing (sync writes)
> to ensure that data is consistent across both sites. If a write happens
> at one site, then it either completes at both sites or neither site. If
> there is an abort, then the App must take whatever is the correct course
> of action.

Right, this is pretty standard stuff in my world. I really in the rest of
the world (excluding VMS) it is not.

> Very bad scenario in any multi-site cluster is a split cluster where a
> write completes at one site, but not the other, yet the App thinks both
> sites were updated. 

Absolutely. But this is very old stuff and is a known quantity. The issue
comes up with teller transactions and ATM and POS and card processing all
the time, and has since the beginning of those types of transactions. Same
thing for airline reservations (we talked about that here earlier).

> > If the former, how many setups can really provide that in real time? I
> > know
> > a system that can do it in near-real time and it will sync up but if you
> > lose a system your data on that system is going on be inaccessible. As
> > far as "can't lose any data" goes, that sounds like a feature of any OS
> > that claims to be enterprise ready. Until an airplane hits the data
> > center.
> > 
> 
> Again in a multi-site cluster, the airplane hit would be an impact, but 
> assuming a properly designed App environment, after a short cluster 
> transition, the app would continue (with no data lost) at the other site.

I don't understand how there could be no data loss if the data source was
struck. I understand the data loss could be mitigated even greatly, but to
say no data loss at all seems physically impossible, having nothing to do
with software/hardware technology but just as a consequence of transmission
time, etc. To say the database (whatever that means) is always in a
consistent state, sure, that has to happen and has for many decades. Again,
probably not to this day on many platforms but at least yours and mine,
probably implemented in very different ways.

> > > Having stated this, to the point raised in the last reply, one needs
> > > to understand that multi-site clustering with no data loss in a DR
> > > scenario is a really tough nut to crack - on ANY OS platform.

I think is has become less so in the IBM world over the past decade as they
have added more facilities to the transactional recovery bucket. For
instance VSAM was often the outsider unless you were updating it through
one of the transaction processing facilities like CICS or IMS. If you had a
batch job there was no transactional support for VSAM although the same
batch job could create LUWs for DB2, IMS/DB, etc. VSAM now has support to
be part of the picture. It's been a while since I was involved in this so I
may have the exact details off but in general I think this is correct.

> > But if it's only a matter of temporary unavailability, or short term
> > delay then I know another OS that can provide that through third party
> > tools and I
> > guess others could also.
> > 
> 
> All OS's can provide active-passive type DR solutions (even Windows
> albeit coyote ugly).   The issue with active-passive solutions is that one
> has to assume that SOME data WILL be lost i.e. RPO not = to 0. This is
> because active-passive solutions use some form of SW or HW based
> replication which is only sync'd every X or XX minutes.

Yes, this is clearly NFG for transaction processing. If anybody is doing
that I don't really know why unless it's cheaper to throw away the
transactions and pay out refunds than it is to develop a real solution.
Oh wait a minute, for certain businesses and OS it is.

> If a significant event happens, the App thinks a write is complete, but
> it has not yet been committed at the remote site. When the App is
> restarted at the remote site, those transactions that were caught in that
> replication buffer window are lost. [Remember that only writes are
> propagated in these replication buffers]

I can't imagine we're even talking about this today in 2016. It's so
obvious that when you're dealing with money you have to have two-phase
commit and preserve LUW and the atomicity of your transactions and all the
underlying pieces.

> If transaction updates are not critical, then it's not a big deal. If the 
> transactions are measured in large $'s (banks measure some in 
> millions of $'s), then it really is a big deal.

Right, and the facilities to do that have been available for decades on
IBM, even without anything resembling clustering.

> As Keith Parris emphasizes in his DR/DT presentations, there is a huge
> difference between a disaster recovery(DR) solution and a disaster
> tolerant (DT) solution. With DR, the business is down for some period 
> of time and steps are taken to get the business back on line. With DT,
> the business can continue to run with no data loss.

That's a distinction I haven't really heard discussed much. I think it is
another important talking point. I know some businesses that lose serious
money for every second they're offline. That was in the olden days. Now
with so much e-commerce, how much more so.

> I like to compare DR vs DT to insurance - the more you need, the
> more it costs.  If you never use it then it is a big expense. If it does
> save the company major $'s and/or loss of business / public
> credibility, then the additional insurance was a very cheap expense.

Very good points, all.