User Guide

ManualsBrandsAMD ManualsOtherAMD64 ARCHITECTURE

131

132

133

134

135

136

137

138

139

140

100 General-Purpose Programming

AMD64 Technology 24592—Rev. 3.15—November 2009

to a PREFETCH. Refer to the Optimization Guide for AMD Athlon™ 64 and AMD Opteron™

Processors, order# 25112, for details relating to a particular processor family, brand or model.

- PREFETCHT0—Prefetches temporal data into the entire cache hierarchy.

- PREFETCHT1—Prefetches temporal data into the second-level (L2) and higher-level caches,

but not into the L1 cache.

- PREFETCHT2—Prefetches temporal data into the third-level (L3) and higher-level caches,

but not into the L1 or L2 cache.

- PREFETCHNTA—Prefetches non-temporal data into the processor, minimizing cache

pollution. The specific technique for minimizing cache pollution is implementation-dependent

and can include such techniques as allocating space in a software-invisible buffer, allocating a

cache line in a single cache or a specific way of a cache, etc.

• PREFETCH—(a 3DNow! instruction) Prefetches read data into the L1 data cache. Data can be

written to such a cache line, but doing so can result in additional delay because the processor must

signal externally to negotiate the right to change the cache line’s cache-coherency state for the

purpose of writing to it.

• PREFETCHW—(a 3DNow! instruction) Prefetches write data into the L1 data cache. Data can be

written to the cache line without additional delay, because the data is already prefetched in the

modified cache-coherency state. Data can also be read from the cache line without additional delay.

However, prefetching write data takes longer than prefetching read data if the processor must wait

for another caching master to first write-back its modified copy of the requested data to memory

before the prefetch request is satisfied.

The PREFETCHW instruction provides a hint to the processor that the cache line is to be modified,

and is intended for use when the cache line will be written to shortly after the prefetch is performed.

The processor can place the cache line in the modified state when it is prefetched, but before it is

actually written. Doing so can save time compared to a PREFETCH instruction, followed by a

subsequent cache-state change due to a write.

To prevent a false-store dependency from stalling a prefetch instruction, prefetched data should be

located at least one cache-line away from the address of any surrounding data write. For example, if

the cache-line size is 32 bytes, avoid prefetching from data addresses within 32 bytes of the data

address in a preceding write instruction.

Non-Temporal Stores. Non-temporal store instructions are provided to prevent memory writes from

being stored in the cache, thereby reducing cache pollution. These non-temporal store instructions are

specific to the type of register they write:

• GPR Non-Temporal Stores—MOVNTI.

• XMM Non-Temporal Stores—MASKMOVDQU, MOVNTDQ, MOVNTPD, and MOVNTPS.

• MMX Non-Temporal Stores—MASKMOVQ and MOVNTQ.

Removing Stale Cache Lines. When cache data becomes stale, it occupies space in the cache that

could be used to store frequently-accessed data. Applications can use the CLFLUSH instruction to free

a stale cache-line for use by other data. CLFLUSH writes the contents of a cache line to memory and