Paper

Adopting an Advanced EPIC Architecture
Intel Itanium processor 9500 series represents a near clean-
sheet redesign of the Intel Itanium cores to support an un-
precedented amount of instruction-level parallelism in its main
execution pipeline. It can execute up to 12 instructions each
cycle in 4 instruction bundles. It has 2 memory execution units, 2
general purpose integer units, 2 ALU units, 2 oating point units,
3 branch units and 1 NOP unit. The Intel Itanium bundle template
determines which units are candidates for executing each in-
struction. The hardware algorithm used to disperse the incoming
instructions into each of the 12 execution unit pipelines is simple,
deterministic and efcient – allowing compilers to exactly control
execution resources. To support 12-wide issue, the register les
have 12 read and 12 write ports.
Intel Itanium processor New-Instrutions Architectural
Extensions
Intel Itanium processor 9500 series adds a set of new instruc-
tions that extends the Itanium architecture. It adds integer
multiply instructions and a count-leading-zero instruction. It
adds an instruction to provide better OS control of thread
behavior. It adds and extends instructions that provide more
detailed data access hints as well as new user-controlled regis-
ter le to control those hints. This allows compilers much ner
grained control of data cache and TLB policies. It also adds an
instruction for multi-line software prefetches. All of these new
instructions are motivated by the desire to increase perfor-
mance, both single-thread and multi-thread.
Memory Parallelism
Intel Itanium processor 9500 series also focuses on increasing
memory parallelism by addressing throughput and queuing in
the memory subsystem. The core has additional queuing for
pending memory operations tweaked for throughput. Queue
sizes were increased and the scheduler was changed to focus
on performance and power.
Another key improvement on memory parallelism is the ability
to avoid pipeline hazards by executing data prefetch opera-
tions to move data in advance of use between the various
levels of caches. By providing extra hooks to the compilers to
control caching policies in addition to the software and hard-
ware prefetchers in the memory pipeline, Intel Itanium proces-
sor 9500 series can control explicit data and control specula-
tion mechanisms, and enable its prefetchers to use an adaptive
algorithms to conserve bandwidth as much as possible to help
relieve potential pipeline bottlenecks.
Core Parallelism
Probably the most obvious form of parallelism Intel Itanium
processor 9500 series supports is core-level parallelism. The
processor has eight cores per socket connected to eight 4MB
last level cache modules via a ring interconnect. The ring
interconnect is capable of 700 GB/s of aggregate bandwidth.
The ring caches are connected using QPI protocols to the two
on-die memory controllers and a ten port router. The router
The refreshed microarchitecture also allowed a focus on power
efciency. The power aware design of Intel Itanium processor
9500 series was essential to being able to double core count and
operating frequency while simultaneously reducing maximum
package power to achieve a factor of three power efciency
advantage over the previous Itanium processor design.
Figure 1 Intel Itanium processor 9500 series core floorplan.
New microarchitecture features an 11-stage pipeline and
architectural extensions.
Buffers
Floating Point
Execution
Integer
Execution
1
st
level
Cache
1
st
level
Cache
Branch
Predict
Interface
Logic
1
st
level
Cache
Mid-
Level
Inst.
Cache
Pipe Line
Control
Mid-
Level
Data
Cache
Instruction
Queues
Buffers
Integer
Register
Float.
Pt RF
BR
CTL
2