[Info-vax] RealWorldTech on Poulson
Johnny Billquist
bqt at softjar.se
Sun Jul 3 19:52:36 EDT 2011
On 2011-07-04 01.11, John Wallace wrote:
> On Jul 3, 5:34 pm, Johnny Billquist<b... at softjar.se> wrote:
>> Let's make one thing clear here. The cache coherency strategy design is
>> not inherent to the architecture. x86 can use a directory based cache
>> coherency just as good as an Itanium. There is no actual performance
>> advantage in Itanium itself in this.
>> It's just a question of current implementation.
>>
>> Yes, compiler advances dreamt of by the Itanium team never materialized,
>> and honestly it was a desperate dream to start with. Alpha made the
>> right choice, but was killed for political reasons.
>> At this point, the x86 is a better platform to develop than Itanium,
>> which made several bad design choices. It's like the Sparc, who decided
>> to have a branch delay slot defined by the architecture. That has hurt
>> them ever since.
>> Bad processor designs are very hard to recover from. And Itanium seem to
>> have collected a lot of those. No clever cache snooping hardware is
>> going to change that.
>>
>> Johnny
>>
>> --
>> Johnny Billquist || "I'm on a bus
>> || on a psychedelic trip
>> email: b... at softjar.se || Reading murder books
>> pdp is alive! || tryin' to stay hip" - B. Idol
>
> "The cache coherency strategy design is not inherent to the
> architecture."
>
> Q1) Are you sure about that?
> Q2) Does it matter?
>
> A1: I agree that at first glance the cache coherency is not inherent
> to the architecture. Then again, why does Alpha have "memory barrier"
> instructions rather than a chip<->memory system design which enforces
> consistency 100% of the time? And it's not just Alpha. I think the
> theory was that it helps design faster systems.
Yes. The cache coherency protocol itself is totally invisible to the
architecture. Memory barriers are a related, but different thing. Memory
barriers are used to guarantee memory commits in the face of
out-of-order execution, which otherwise can cause writes to memory to
happen in a different order than you think they do with regards to your
program, since the hardware can reorder instructions which it seems are
not related to each other (such as two stores to different memory
addresses).
This definitely helps to speed up systems. In a multiprocessor system,
such optimizations might not always be good, which is why memory
barriers exist, that guarantee that at that point, all the writes in
your program have actually been committed to memory before the next
instruction occurs. So, no reordering is allowed across memory barriers.
Like I said, you cannot even tell what kind of cache coherency protocol
a machine implements, as it is totally invisible to code executing on
the machine. So, you are always free to implement whatever algorithm you
want. The only constraint is that all CPUs on your machine must use the
same protocol, or else you'll not have cache coherency (there are
machines made that didn't have any cache coherency, but at that point,
it becomes the responsibility of the OS to maintain cache coherency
instead, and then you need to be aware.)
> A2: Maybe this "cache efficiency" thing matters, but even the RWT
> article admits that in the one to four socket market, it doesn't
> matter enough to make AMD64 a bad choice. Given that x86 clock rates
> have hit their limit for now (forever?), the market is forced to
> accept multicore (or what us oldtimers used to call SMP) as the way
> forward. Four sockets gets you 32 Xeon cores (not quite as many
> Opterons yet). What kind of workload really needs 32 cores? More
> importantly, what kind of operating environment/OS can usefully use 32
> cores? Well I guess server consolidation (which seems to be called
> virtualisation these days) might, but other than that?
To be honest, cache snooping algorithms are faster than directory based.
It's just that they don't scale to many CPUs, while directory based ones
are slower, but do scale to many CPUs.
And well, yes, it do matter, if you want to have very many CPUs in your
machine.
But I agree, most software cannot make use of massive parallel machines
anyway, so machines with very many CPUs are for most purposes not really
a good solution. Virtual machines being the one exception actually. But
that is getting pretty popular.
> "Bad processor designs are very hard to recover from. And Itanium seem
> to
> have collected a lot of those. No clever cache snooping hardware is
> going to change that."
>
> I don't know if IA64 is bad. It's just irrelevant. It's VMS that makes
> boxes with IA64 interesting. Or NSK. Or maybe even HP-UX. But VLIW
> itself is neither here nor there. Nobody cares about VLIW.
Nobody cares about VLIM in itself, for sure. But everyone cares about
the machines being fast. And VLIW is actually sitting in the way of
achieving that. So indirectly everyone cares.
And the same is true for some other parts of Itanium as well. And the
other aspect is that it takes a lot of gates to implement everything an
Itanium is supposed to do, which also leads to large chips, high costs,
complex and long development cycles, and difficulties at achieving high
speeds also from that point of view.
Johnny
More information about the Info-vax
mailing list