User Guide

190 128-Bit Media and Scientific Programming

AMD64 Technology 24592—Rev. 3.15—November 2009

128-bit media instructions that simulate predicated execution or conditional moves. Figure 4-10 on

page 115 shows an example of a non-branching sequence that implements a two-way multiplexer.

Where possible, break long dependency chains into several shorter dependency chains that can be

executed in parallel. This is especially important for floating-point instructions because of their longer

latencies.

4.12.4 Use Streaming Stores

The MOVNTDQ, MOVNTDQA and MASKMOVDQU instructions store streaming (non-temporal)

data to memory. These instructions indicate to the processor that the data they reference will be used

only once and is therefore not subject to cache-related o verhead (such as write-allocation). A typical

case benefitting from streaming stores occurs when data written by the processor is never read by the

processor, such as data written to a graphics frame buffer.

4.12.5 Align Data

Data alignment is particularly important for performance when data written by one instruction is read

by a subsequent instruction soon after the write, or when accessing streaming (non-temporal) data.

These cases may occur frequently in 128-bit media procedures.

Accesses to data stored at unaligned locations may benefit from on-the-fly software alignment or from

repetition of data at different alignment boundaries, as required by different loops that process the data.

4.12.6 Organize Data for Cacheability

Pack small data structures into cache-line-size blocks. Organize frequently accessed constants and

coefficients into cache-line-size blocks and prefetch them. Procedures that access data arranged in

memory-bus-sized blocks, or memory-burst-sized blocks, can make optimum use of the available

memory bandwidth.

For data that will be used only once in a procedure, consider using non-cacheable memory. Accesses to

such memory are not burdened by the overhead of cache protocols.

4.12.7 Prefetch Data

Media applications typically operate on large data sets. Because of this, they make intensive use of the

memory bus. Memory latency can be substantially reduced—especially for data that will be used only

once—by p refetching such data into various levels of the cache hierarchy. Software can use the

PREFETCHx instructions very effectively in such cases, as described in “Cache and Memory

Management” on page 66.

Some of the best places to use prefetch instructions are inside loops that process large amounts of data.

If the loop goes through less than one cache line of data per iteration, partially unroll the loop. Try to

use virtually all of the prefetched data. This usually requires unit-stride memory accesses—those in

which all accesses are to contiguous memory locations. Exactly one PREFETCHx instruction per