User Guide

64-Bit Media Programming 235

24592—Rev. 3.15—November 2009 AMD64 Technology

Unlike FSAVE and FNSAVE, however, FXSAVE does not alter the tag bits (thus, it does not perform

the state-clearing function of EMMS or FEMMS). The state of the saved MMX and x87 registers is

retained, thus indicating that the registers may still be valid (or whatever other value the tag bits

indicated prior to the save). To invalidate the contents of the MMX and x87 registers after FXSAVE,

software must explicitly execute an FINIT instruction. Also, FXSAVE (like FNSAVE) and FXRSTOR

do not check for pending unmasked x87 floating-point exceptions. An FWAIT instruction can be used

for this purpose.

For details about the FXSAVE and FXRSTOR memory formats, see “Media and x87 Processor State”

in Volume 2.

5.15 Performance Considerations

In addition to typical code optimization techniques, such as those affecting loops and the inlining of

function calls, the following considerations may help improve the performance of application

programs written with 64-bit media instructions.

These are implementation-independent performance considerations. Other considerations depend on

the hardware implementation. For information about such implementation-dependent considerations

and for more information about application performance in general, see the data sheets and the

software-optimization guides relating to particular hardware implementations.

5.15.1 Use Small Operand Sizes

The performance advantages available with 64-bit media operations is to some extent a function of the

data sizes operated upon. The smaller the data size, the more data elements that can be packed into

single 64-bit vectors. The parallelism of computation increases as the number of elements per vector

increases.

5.15.2 Reorganize Data for Parallel Operations

Much of the performance benefit from the 64-bit media instructions comes from the parallelism

inherent in vector operations. It can be advantageous to reorganize data before performing arithmetic

operations so that its layout after reorganization maximizes the parallelism of the arithmetic

operations.

The speed of memory access is particularly important for certain types of computation, such as

graphics rendering, that depend on the regularity and locality of data-memory accesses. For example,

in matrix operations, performance is high when operating on the rows of the matrix, because row bytes

are contiguous in memory, but lower when operating on the columns of the matrix, because column

bytes are not contiguous in memory and accessing them can result in cache misses. To improve

performance for operations on such columns, the matrix should first be transposed. Such

transpositions can, for example, be done using a sequence of unpacking or shuffle instructions.