User Guide

General-Purpose Programming 101
24592—Rev. 3.15—November 2009 AMD64 Technology
then invalidates the line in the cache and in all other caches in the cache hierarchy that contain the line.
Once invalidated, the line is available for use by the processor and can be filled with other data.
3.10 Performance Considerations
In addition to typical code optimization techniques, such as those affecting loops and the inlining of
function calls, the following considerations may help improve the performance of application
programs written with general-purpose instructions.
These are implementation-independent performance considerations. Other considerations depend on
the hardware implementation. For information about such implementation-dependent considerations
and for more information about application performance in general, see the data sheets and the
software-optimization guides relating to particular hardware implementations.
3.10.1 Use Large Operand Sizes
Loading, storing, and moving data with the largest relevant operand size maximizes the memory
bandwidth of these instructions.
3.10.2 Use Short Instructions
Use the shortest possible form of an instruction (the form with fewest opcode bytes). This increases
the number of instructions that can be decoded at any one time, and it reduces overall code size.
3.10.3 Align Data
Data alignment directly affects memory-access performance. Data alignment is particularly important
when accessing streaming (also called non-temporal) data—data that will not be reused and therefore
should not be cached. Data alignment is also important in cases where data that is written by one
instruction is subsequently read by a subsequent instruction soon after the write.
3.10.4 Avoid Branches
Branching can be very time-consuming. If the body of a branch is small, the branch may be
replaceable with conditional move (CMOVcc) instructions, or with 128-bit or 64-bit media
instructions that simulate predicated parallel execution or parallel conditional moves.
3.10.5 Prefetch Data
Memory latency can be substantially reduced—especially for data that will be used multiple times—
by prefetching such data into various levels of the cache hierarchy. Software can use the PREFETCHx
instructions very effectively in such cases. One PREFETCHx per cache line should be used.
Some of the best places to use prefetch instructions are inside loops that process large amounts of data.
If the loop goes through less than one cache line of data per iteration, partially unroll the loop. Try to
use virtually all of the prefetched data. This usually requires unit-stride memory accesses—those in
which all accesses are to contiguous memory locations.