User Guide

128-Bit Media and Scientific Programming 189

24592—Rev. 3.15—November 2009 AMD64 Technology

128-bit media procedure accesses an MMX register by means of a data-transfer or data-conversion

instruction.

In such cases, software should separate such procedures or dynamic link libraries (DLLs) from x87

floating-point procedures or DLLs by clearing the MMX state with the EMMS instruction, as

described in “Exit Media State” on page 209. For further details, see “Mixing Media Code with x87

Code” on page 233.

4.12 Performance Considerations

In addition to typical code optimization techniques, such as those affecting loops and the inlining of

function calls, the following considerations may help improve the performance of application

programs written with 128-bit media instructions.

These are implementation-independent performance considerations. Other considerations depend on

the hardware implementation. For information about such implementation-dependent considerations

and for more information about application performance in general, see the data sheets and the

software-optimization guides relating to particular hardware implementations.

4.12.1 Use Small Operand Sizes

The performance advantages available with 128-bit media operations is to some extent a function of

the data sizes operated upon. The smaller the data size, the more data elements that can be packed into

single 128-bit vectors. The parallelism of computation increases as the number of elements per vector

increases.

4.12.2 Reorganize Data for Parallel Operations

Much of the performance benefit from the 128-bit media instructions comes from the parallelism

inherent in vector operations. It can be advantageous to reorganize data before performing arithmetic

operations so that its layout after reorganization maximizes the parallelism of the arithmetic

operations.

The speed of memory access is particularly important for certain types of computation, such as

graphics rendering, that depend on the regularity and locality of data-memory accesses. For example,

in matrix operations, performance is high when operating on the rows of the matrix, because row bytes

are contiguous in memory, but lower when operating on the columns of the matrix, because column

bytes are not contiguous in memory and accessing them can result in cache misses. To improve

performance for operations on such columns, the matrix should first be transposed. Such

transpositions can, for example, be done using a sequence of unpacking or shuffle instructions.

4.12.3 Remove Branches

Branch can be replaced with 128-bit media instructions that simulate predicated execution or

conditional moves, as described in “Branch Removal” on page 114. The branch can be replaced with