User Guide

236 64-Bit Media Programming
AMD64 Technology 24592—Rev. 3.15—November 2009
5.15.3 Remove Branches
Branch can be replaced with 64-bit media instructions that simulate predicated execution or
conditional moves, as described in “Branch Removal” on page 198. Where possible, break long
dependency chains into several shorter dependency chains which can be executed in parallel. This is
especially important for floating-point instructions because of their longer latencies.
5.15.4 Align Data
Data alignment is particularly important for performance when data written by one instruction is read
by a subsequent instruction soon after the write, or when accessing streaming (non-temporal) data—
data that will not be reused and t herefore should not be cached. These cases may occur frequently in
64-bit media procedures.
Accesses to data stored at unaligned locations may benefit from on-the-fly software alignment or from
repetition of data at different alignment boundaries, as required by different loops that process the data.
5.15.5 Organize Data for Cacheability
Pack small data structures into cache-line-size blocks. Organize frequently accessed constants and
coefficients into cache-line-size blocks and prefetch them. Procedures that access data arranged in
memory-bus-sized blocks, or memory-burst-sized blocks, can make optimum use of the available
memory bandwidth.
For data that will be used only once in a procedure, consider using non-cacheable memory. Accesses to
such memory are not burdened by the overhead of cache protocols.
5.15.6 Prefetch Data
Media applications typically operate on large data sets. Because of this, they make intensive use of the
memory bus. Memory latency can be substantially reduced—especially for data that will be used only
once—by prefetching such data into various levels of the cache hierarchy. Software can use the
PREFETCHx instructions very effectively in such cases, as described in “Cache and Memory
Management” on page 66.
Some of the best places to use prefetch instructions are inside loops that process large amounts of data.
If the loop goes through less than one cache line of data per iteration, partially unroll the loop to obtain
multiple iterations of the loop within a cache line. Try to use virtually all of the prefetched data. This
usually requires unit-stride memory accesses—those in which all accesses are to contiguous memory
locations.
5.15.7 Retain Intermediate Results in MMX™ Registers
Keep intermediate results in the MMX registers as much as possible, especially if the intermediate
results are used shortly after they have b een produced. Avoid spilling intermediate results to memory
and reusing them shortly thereafter.