User Guide
190 128-Bit Media and Scientific Programming
AMD64 Technology 24592—Rev. 3.15—November 2009
128-bit media instructions that simulate predicated execution or conditional moves. Figure 4-10 on
page 115 shows an example of a non-branching sequence that implements a two-way multiplexer.
Where possible, break long dependency chains into several shorter dependency chains that can be
executed in parallel. This is especially important for floating-point instructions because of their longer
latencies.
4.12.4 Use Streaming Stores
The MOVNTDQ, MOVNTDQA and MASKMOVDQU instructions store streaming (non-temporal)
data to memory. These instructions indicate to the processor that the data they reference will be used
only once and is therefore not subject to cache-related o verhead (such as write-allocation). A typical
case benefitting from streaming stores occurs when data written by the processor is never read by the
processor, such as data written to a graphics frame buffer.
4.12.5 Align Data
Data alignment is particularly important for performance when data written by one instruction is read
by a subsequent instruction soon after the write, or when accessing streaming (non-temporal) data.
These cases may occur frequently in 128-bit media procedures.
Accesses to data stored at unaligned locations may benefit from on-the-fly software alignment or from
repetition of data at different alignment boundaries, as required by different loops that process the data.
4.12.6 Organize Data for Cacheability
Pack small data structures into cache-line-size blocks. Organize frequently accessed constants and
coefficients into cache-line-size blocks and prefetch them. Procedures that access data arranged in
memory-bus-sized blocks, or memory-burst-sized blocks, can make optimum use of the available
memory bandwidth.
For data that will be used only once in a procedure, consider using non-cacheable memory. Accesses to
such memory are not burdened by the overhead of cache protocols.
4.12.7 Prefetch Data
Media applications typically operate on large data sets. Because of this, they make intensive use of the
memory bus. Memory latency can be substantially reduced—especially for data that will be used only
once—by p refetching such data into various levels of the cache hierarchy. Software can use the
PREFETCHx instructions very effectively in such cases, as described in “Cache and Memory
Management” on page 66.
Some of the best places to use prefetch instructions are inside loops that process large amounts of data.
If the loop goes through less than one cache line of data per iteration, partially unroll the loop. Try to
use virtually all of the prefetched data. This usually requires unit-stride memory accesses—those in
which all accesses are to contiguous memory locations. Exactly one PREFETCHx instruction per