[Info-vax] VMS port to x86

Sat Mar 24 09:38:17 EDT 2012

On Mar 24, 12:45 pm, "John Reagan" <johnrrea... at earthlink.net> wrote:
> "JF Mezei"  wrote in message
>
> news:4f6d5a90$0$2201$c3e8da3$460562f1 at news.astraweb.com...
>
> >I was just thinking.
> >If HP were to produce fault tolerant 8086s for NSK, those boxes could
>
> Any more fault tolerant than the Itaniums used by NSK today?

Exactly.

Tandem hasn't needed CPU lockstep since they dropped their own
proprietary processor architectures, and that's a LONG time ago in
processor history. This isn't a secret, but it may as well be as far
as the trade media and others are concerned.

In modern Tandem boxes, the synchronisation is not at instruction
level or even main-memory access level but conceptually more like "IO
access level" (or maybe process context switch level). The
synchronisation is not on the CPU chip, not even particularly close to
it, but is managed by a piece of complex external logic called the
Logical Synchronization Unit, which doesn't care about instruction-
level lockstep but does care that each processor's operations result
in the same IO with the outside world (and the same context if a
processor swap has to occur). The LSU also does a lot more than that,
which I won't go into here.

The 2006 Oztug presentation at [1] was given by one of Tandem's senior
architects, Hal Massey, and was an excellent intro to the internals of
Tandem boxes over the years. Sadly, it's fallen off the internet and I
haven't kept a copy and nor have I yet been able to find a suitable
replacement. Suggestions welcome for a definitive replacement - but
I'm
not holding my breath.

Don't take my word for it, work it out from first principles. Modern
chips have lots of RAS/availability features which result in many soft
errors being correctable, but errors may have an impact on timing -
sometimes software support is needed at the time or later, or the
error results in visibly different processor behaviour. For example, a
cache error which resulted in a reference to main memory in a chip
which can continue with other processing while it waits for the main
memory reference to arrive. If two chips didn't both have the same
correctable soft error at the same time (what are the odds of that),
they'd not both be executing the same instructions at the same time,
so they'd be out of sync. (Gross oversimplification alert, but
hopefully you get the drift).

hth
john

[1] www.oztug.org/events/2006/AdvancedArchitecture_Massey.pdf - now
vanished, sadly.

[yes this text may seem familiar]