[Info-vax] Intel previews new Itanium "Poulson" processor
John Reagan
johnrreagan at earthlink.net
Fri Feb 25 16:07:36 EST 2011
"JF Mezei" <jfmezei.spamnot at vaxination.ca> wrote in message
news:4d6813ef$1$2796$c3e8da3$38634283 at news.astraweb.com...
> John Reagan wrote:
>
>> As for number of cycles to execute once the chip starts chewing... No a
>> register to register move or shift will be faster than some fixed
>> multiply
>> or such.
>
>
> OK, so say I have a set of 25 instructions that are declared as being
> able to run in parralel running on Tukwila.
>
> The first 6 go in right away. Does this mean that if slot #4 finished
> executing first, then the 7th instruction will go to slot #4 ? before
> the rest of the original 6 has finished executing ?
>
> Or does it wait for all of the original 6 to finish before loading the
> next 6 in ?
Slots are just how the 3 instructions are encoded in a bundle. Once loaded
and picked apart, just thing of them as instructions.
No, it doesn't wait to finish all 6 before going on. It might vary based on
what KIND of instruction. For example, there are only so many memory ports
so if there are outstanding memory reads, any more might have to wait
eventhough the compiler said there are no register dependencies. Same for
other on-chip resources. What I believe Poulson does is have more on-chip
resources (adders, shifters, memory ports?, etc.) so it can grab 12
instructions and see how many it can get running at once.
Not much different than Alpha (except Alpha's out of order model let the
chip skip ahead, my relocation of Itanium is that it is an in-order machine,
it wants to do the very next instruction and waits until the chip resources
are available). That is all part of the compiler's instruction scheduling
algorithm (which is different than looking for register dependencies to
place the stop bits). Again, not much different than Alpha. Prior to EV6's
out-of-order model, the compilers were very careful to sequence the
instructions to spread the work across the on-chip resources. With EV6,
the chip did that for us.
For instance, on EV5's & EV56's one of the "bad things" to do was do a
memory fetch on a location that was just fetched. If the first fetch had a
cache miss, the chip asked the memory subsystem to fetch memory, fill the
cache line, etc. and load the data. If you then tried to load the same
address (actually the same cache line) again, EV56 would cancel the first
memory load and start all over again eventhough if it had just waited one
more cycle, the data would have been present. So GEM tried to space memory
loads a certain distance apart if we think we there might be a clash.
Happened more than you think... consider a loop unrolling where we wanted to
fetch A(I) and A(I+1) with adjacent instructions. You'd hit this problem.
So it would often be faster to move other instructions between the two
fetches or even put NOPs in there. Again, EV6 was much smarter.
John
John
More information about the Info-vax
mailing list