User Guide

128-Bit Media and Scientific Programming 189
24592—Rev. 3.15—November 2009 AMD64 Technology
128-bit media procedure accesses an MMX register by means of a data-transfer or data-conversion
instruction.
In such cases, software should separate such procedures or dynamic link libraries (DLLs) from x87
floating-point procedures or DLLs by clearing the MMX state with the EMMS instruction, as
described in “Exit Media State” on page 209. For further details, see “Mixing Media Code with x87
Code” on page 233.
4.12 Performance Considerations
In addition to typical code optimization techniques, such as those affecting loops and the inlining of
function calls, the following considerations may help improve the performance of application
programs written with 128-bit media instructions.
These are implementation-independent performance considerations. Other considerations depend on
the hardware implementation. For information about such implementation-dependent considerations
and for more information about application performance in general, see the data sheets and the
software-optimization guides relating to particular hardware implementations.
4.12.1 Use Small Operand Sizes
The performance advantages available with 128-bit media operations is to some extent a function of
the data sizes operated upon. The smaller the data size, the more data elements that can be packed into
single 128-bit vectors. The parallelism of computation increases as the number of elements per vector
increases.
4.12.2 Reorganize Data for Parallel Operations
Much of the performance benefit from the 128-bit media instructions comes from the parallelism
inherent in vector operations. It can be advantageous to reorganize data before performing arithmetic
operations so that its layout after reorganization maximizes the parallelism of the arithmetic
operations.
The speed of memory access is particularly important for certain types of computation, such as
graphics rendering, that depend on the regularity and locality of data-memory accesses. For example,
in matrix operations, performance is high when operating on the rows of the matrix, because row bytes
are contiguous in memory, but lower when operating on the columns of the matrix, because column
bytes are not contiguous in memory and accessing them can result in cache misses. To improve
performance for operations on such columns, the matrix should first be transposed. Such
transpositions can, for example, be done using a sequence of unpacking or shuffle instructions.
4.12.3 Remove Branches
Branch can be replaced with 128-bit media instructions that simulate predicated execution or
conditional moves, as described in “Branch Removal” on page 114. The branch can be replaced with