User Guide

128-Bit Media and Scientific Programming 191
24592—Rev. 3.15—November 2009 AMD64 Technology
cache line must be used. For further details, see the Optimization Guide for AMD Athlon™ 64 and
AMD Opteron™ Processors, order# 25112.
4.12.8 Use 128-Bit Media Code for Moving Data
Movements of data between memory, GPR, XMM, and MMX registers can take advantage of the
parallel vector operations supported by the 128-bit media MOVx instructions. Figure 4-6 on page 111
illustrates the range of move operations available.
4.12.9 Retain Intermediate Results in XMM Registers
Keep intermediate results in the XMM registers as much as possible, especially if the intermediate
results are used shortly after they have b een produced. Avoid spilling intermediate results to memory
and reusing them shortly thereafter. In 64-bit mode, the architecture’s 16 XMM registers offer twice
the number of legacy XMM registers.
4.12.10 Replace GPR Code with 128-Bit Media Code.
In 64-bit mode, the AMD64 architecture provides twice the number of general-purpose registers
(GPRs) as the legacy x86 architecture, thereby reducing potential pressure on GPRs. Nevertheless,
general-purpose instructions do not operate in parallel on vectors of elements, as do 128-bit media
instructions. Thus, 128-bit media code supports parallel operations and can perform better with
algorithms and data that are organized for parallel operations.
4.12.11 Replace x87 Code with 128-Bit Media Code
One of the most useful advantages of 128-bit media instructions is the ability to intermix integer and
floating-point instructions in the same procedure, using a register set that is separate from the GPR,
MMX, and x87 register sets. Code written with 128-bit media floating-point instructions can operate
in parallel on four times as many single-precision floating-point operands as can x87 floating-point
code. This achieves potentially four times the computational work of x87 instructions that take single-
precision operands. Also, the higher density of 128-bit media floating-point operands may make it
possible to remove local temporary variables that would otherwise be needed in x87 floating-point
code. 128-bit media code is also easier to write than x87 floating-point code, because the XMM
register file is flat, rather than stack-oriented, and in 64-bit mode there are twice the number of XMM
registers as x87 registers. Moreover, when integer and floating-point instructions must be used
together, 128-bit media floating-point instructions avoid the potential need to save and restore state
between integer operations and floating-point procedures.