User Guide

General-Purpose Programming 101

24592—Rev. 3.15—November 2009 AMD64 Technology

then invalidates the line in the cache and in all other caches in the cache hierarchy that contain the line.

Once invalidated, the line is available for use by the processor and can be filled with other data.

3.10 Performance Considerations

In addition to typical code optimization techniques, such as those affecting loops and the inlining of

function calls, the following considerations may help improve the performance of application

programs written with general-purpose instructions.

These are implementation-independent performance considerations. Other considerations depend on

the hardware implementation. For information about such implementation-dependent considerations

and for more information about application performance in general, see the data sheets and the

software-optimization guides relating to particular hardware implementations.

3.10.1 Use Large Operand Sizes

Loading, storing, and moving data with the largest relevant operand size maximizes the memory

bandwidth of these instructions.

3.10.2 Use Short Instructions

Use the shortest possible form of an instruction (the form with fewest opcode bytes). This increases

the number of instructions that can be decoded at any one time, and it reduces overall code size.

3.10.3 Align Data

Data alignment directly affects memory-access performance. Data alignment is particularly important

when accessing streaming (also called non-temporal) data—data that will not be reused and therefore

should not be cached. Data alignment is also important in cases where data that is written by one

instruction is subsequently read by a subsequent instruction soon after the write.

3.10.4 Avoid Branches

Branching can be very time-consuming. If the body of a branch is small, the branch may be

replaceable with conditional move (CMOVcc) instructions, or with 128-bit or 64-bit media

instructions that simulate predicated parallel execution or parallel conditional moves.

3.10.5 Prefetch Data

Memory latency can be substantially reduced—especially for data that will be used multiple times—

by prefetching such data into various levels of the cache hierarchy. Software can use the PREFETCHx

instructions very effectively in such cases. One PREFETCHx per cache line should be used.

Some of the best places to use prefetch instructions are inside loops that process large amounts of data.

If the loop goes through less than one cache line of data per iteration, partially unroll the loop. Try to

use virtually all of the prefetched data. This usually requires unit-stride memory accesses—those in

which all accesses are to contiguous memory locations.