[Info-vax] RealWorldTech on Poulson

Sun Jul 3 19:52:36 EDT 2011

On 2011-07-04 01.11, John Wallace wrote:
> On Jul 3, 5:34 pm, Johnny Billquist<b... at softjar.se>  wrote:
>> Let's make one thing clear here. The cache coherency strategy design is
>> not inherent to the architecture. x86 can use a directory based cache
>> coherency just as good as an Itanium. There is no actual performance
>> advantage in Itanium itself in this.
>> It's just a question of current implementation.
>>
>> Yes, compiler advances dreamt of by the Itanium team never materialized,
>> and honestly it was a desperate dream to start with. Alpha made the
>> right choice, but was killed for political reasons.
>> At this point, the x86 is a better platform to develop than Itanium,
>> which made several bad design choices. It's like the Sparc, who decided
>> to have a branch delay slot defined by the architecture. That has hurt
>> them ever since.
>> Bad processor designs are very hard to recover from. And Itanium seem to
>> have collected a lot of those. No clever cache snooping hardware is
>> going to change that.
>>
>>          Johnny
>>
>> --
>> Johnny Billquist                  || "I'm on a bus
>>                                     ||  on a psychedelic trip
>> email: b... at softjar.se             ||  Reading murder books
>> pdp is alive!                     ||  tryin' to stay hip" - B. Idol
>
> "The cache coherency strategy design is not inherent to the
> architecture."
>
> Q1) Are you sure about that?
> Q2) Does it matter?
>
> A1: I agree that at first glance the cache coherency is not inherent
> to the architecture. Then again, why does Alpha have "memory barrier"
> instructions rather than a chip<->memory system design which enforces
> consistency 100% of the time? And it's not just Alpha. I think the
> theory was that it helps design faster systems.

Yes. The cache coherency protocol itself is totally invisible to the 
architecture. Memory barriers are a related, but different thing. Memory 
barriers are used to guarantee memory commits in the face of 
out-of-order execution, which otherwise can cause writes to memory to 
happen in a different order than you think they do with regards to your 
program, since the hardware can reorder instructions which it seems are 
not related to each other (such as two stores to different memory 
addresses).

This definitely helps to speed up systems. In a multiprocessor system, 
such optimizations might not always be good, which is why memory 
barriers exist, that guarantee that at that point, all the writes in 
your program have actually been committed to memory before the next 
instruction occurs. So, no reordering is allowed across memory barriers.

Like I said, you cannot even tell what kind of cache coherency protocol 
a machine implements, as it is totally invisible to code executing on 
the machine. So, you are always free to implement whatever algorithm you 
want. The only constraint is that all CPUs on your machine must use the 
same protocol, or else you'll not have cache coherency (there are 
machines made that didn't have any cache coherency, but at that point, 
it becomes the responsibility of the OS to maintain cache coherency 
instead, and then you need to be aware.)

> A2: Maybe this "cache efficiency" thing matters, but even the RWT
> article admits that in the one to four socket market, it doesn't
> matter enough to make AMD64 a bad choice. Given that x86 clock rates
> have hit their limit for now (forever?), the market is forced to
> accept multicore (or what us oldtimers used to call SMP) as the way
> forward. Four sockets gets you 32 Xeon cores (not quite as many
> Opterons yet). What kind of workload really needs 32 cores? More
> importantly, what kind of operating environment/OS can usefully use 32
> cores? Well I guess server consolidation (which seems to be called
> virtualisation these days) might, but other than that?

To be honest, cache snooping algorithms are faster than directory based. 
It's just that they don't scale to many CPUs, while directory based ones 
are slower, but do scale to many CPUs.

And well, yes, it do matter, if you want to have very many CPUs in your 
machine.

But I agree, most software cannot make use of massive parallel machines 
anyway, so machines with very many CPUs are for most purposes not really 
a good solution. Virtual machines being the one exception actually. But 
that is getting pretty popular.

> "Bad processor designs are very hard to recover from. And Itanium seem
> to
> have collected a lot of those. No clever cache snooping hardware is
> going to change that."
>
> I don't know if IA64 is bad. It's just irrelevant. It's VMS that makes
> boxes with IA64 interesting. Or NSK. Or maybe even HP-UX.  But VLIW
> itself is neither here nor there. Nobody cares about VLIW.

Nobody cares about VLIM in itself, for sure. But everyone cares about 
the machines being fast. And VLIW is actually sitting in the way of 
achieving that. So indirectly everyone cares.
And the same is true for some other parts of Itanium as well. And the 
other aspect is that it takes a lot of gates to implement everything an 
Itanium is supposed to do, which also leads to large chips, high costs, 
complex and long development cycles, and difficulties at achieving high 
speeds also from that point of view.

	Johnny