[Info-vax] Intel previews new Itanium "Poulson" processor

Fri Feb 25 16:07:36 EST 2011

"JF Mezei" <jfmezei.spamnot at vaxination.ca> wrote in message 
news:4d6813ef$1$2796$c3e8da3$38634283 at news.astraweb.com...
> John Reagan wrote:
>
>> As for number of cycles to execute once the chip starts chewing...  No a
>> register to register move or shift will be faster than some fixed 
>> multiply
>> or such.
>
>
> OK, so say I have a set of 25 instructions that are declared as being
> able to run in parralel running on Tukwila.
>
> The first 6 go in right away.  Does this mean that if slot #4 finished
> executing first, then the 7th instruction will go to slot #4 ? before
> the rest of the original 6 has finished executing ?
>
> Or does it wait for all of the original 6 to finish before loading the
> next 6 in ?

Slots are just how the 3 instructions are encoded in a bundle.  Once loaded 
and picked apart, just thing of them as instructions.

No, it doesn't wait to finish all 6 before going on.  It might vary based on 
what KIND of instruction.  For example, there are only so many memory ports 
so if there are outstanding memory reads, any more might have to wait 
eventhough the compiler said there are no register dependencies.  Same for 
other on-chip resources.    What I believe Poulson does is have more on-chip 
resources (adders, shifters, memory ports?, etc.) so it can grab 12 
instructions and see how many it can get running at once.

Not much different than Alpha (except Alpha's out of order model let the 
chip skip ahead, my relocation of Itanium is that it is an in-order machine, 
it wants to do the very next instruction and waits until the chip resources 
are available).  That is all part of the compiler's instruction scheduling 
algorithm (which is different than looking for register dependencies to 
place the stop bits).  Again, not much different than Alpha.  Prior to EV6's 
out-of-order model, the compilers were very careful to sequence the 
instructions to spread the work across the on-chip resources.   With EV6, 
the chip did that for us.

For instance, on EV5's & EV56's one of the "bad things" to do was do a 
memory fetch on a location that was just fetched.  If the first fetch had a 
cache miss, the chip asked the memory subsystem to fetch memory, fill the 
cache line, etc. and load the data.  If you then tried to load the same 
address (actually the same cache line) again, EV56 would cancel the first 
memory load and start all over again eventhough if it had just waited one 
more cycle, the data would have been present.  So GEM tried to space memory 
loads a certain distance apart if we think we there might be a clash. 
Happened more than you think... consider a loop unrolling where we wanted to 
fetch A(I) and A(I+1) with adjacent instructions.  You'd hit this problem. 
So it would often be faster to move other instructions between the two 
fetches or even put NOPs in there.  Again, EV6 was much smarter.

John

John