User Guide

236 64-Bit Media Programming

AMD64 Technology 24592—Rev. 3.15—November 2009

5.15.3 Remove Branches

Branch can be replaced with 64-bit media instructions that simulate predicated execution or

conditional moves, as described in “Branch Removal” on page 198. Where possible, break long

dependency chains into several shorter dependency chains which can be executed in parallel. This is

especially important for floating-point instructions because of their longer latencies.

5.15.4 Align Data

Data alignment is particularly important for performance when data written by one instruction is read

by a subsequent instruction soon after the write, or when accessing streaming (non-temporal) data—

data that will not be reused and t herefore should not be cached. These cases may occur frequently in

64-bit media procedures.

Accesses to data stored at unaligned locations may benefit from on-the-fly software alignment or from

repetition of data at different alignment boundaries, as required by different loops that process the data.

5.15.5 Organize Data for Cacheability

Pack small data structures into cache-line-size blocks. Organize frequently accessed constants and

coefficients into cache-line-size blocks and prefetch them. Procedures that access data arranged in

memory-bus-sized blocks, or memory-burst-sized blocks, can make optimum use of the available

memory bandwidth.

For data that will be used only once in a procedure, consider using non-cacheable memory. Accesses to

such memory are not burdened by the overhead of cache protocols.

5.15.6 Prefetch Data

Media applications typically operate on large data sets. Because of this, they make intensive use of the

memory bus. Memory latency can be substantially reduced—especially for data that will be used only

once—by prefetching such data into various levels of the cache hierarchy. Software can use the

PREFETCHx instructions very effectively in such cases, as described in “Cache and Memory

Management” on page 66.

Some of the best places to use prefetch instructions are inside loops that process large amounts of data.

If the loop goes through less than one cache line of data per iteration, partially unroll the loop to obtain

multiple iterations of the loop within a cache line. Try to use virtually all of the prefetched data. This

usually requires unit-stride memory accesses—those in which all accesses are to contiguous memory

locations.

5.15.7 Retain Intermediate Results in MMX™ Registers

Keep intermediate results in the MMX registers as much as possible, especially if the intermediate

results are used shortly after they have b een produced. Avoid spilling intermediate results to memory

and reusing them shortly thereafter.