User Guide
100 General-Purpose Programming
AMD64 Technology 24592—Rev. 3.15—November 2009
to a PREFETCH. Refer to the Optimization Guide for AMD Athlon™ 64 and AMD Opteron™
Processors, order# 25112, for details relating to a particular processor family, brand or model.
- PREFETCHT0—Prefetches temporal data into the entire cache hierarchy.
- PREFETCHT1—Prefetches temporal data into the second-level (L2) and higher-level caches,
but not into the L1 cache.
- PREFETCHT2—Prefetches temporal data into the third-level (L3) and higher-level caches,
but not into the L1 or L2 cache.
- PREFETCHNTA—Prefetches non-temporal data into the processor, minimizing cache
pollution. The specific technique for minimizing cache pollution is implementation-dependent
and can include such techniques as allocating space in a software-invisible buffer, allocating a
cache line in a single cache or a specific way of a cache, etc.
• PREFETCH—(a 3DNow! instruction) Prefetches read data into the L1 data cache. Data can be
written to such a cache line, but doing so can result in additional delay because the processor must
signal externally to negotiate the right to change the cache line’s cache-coherency state for the
purpose of writing to it.
• PREFETCHW—(a 3DNow! instruction) Prefetches write data into the L1 data cache. Data can be
written to the cache line without additional delay, because the data is already prefetched in the
modified cache-coherency state. Data can also be read from the cache line without additional delay.
However, prefetching write data takes longer than prefetching read data if the processor must wait
for another caching master to first write-back its modified copy of the requested data to memory
before the prefetch request is satisfied.
The PREFETCHW instruction provides a hint to the processor that the cache line is to be modified,
and is intended for use when the cache line will be written to shortly after the prefetch is performed.
The processor can place the cache line in the modified state when it is prefetched, but before it is
actually written. Doing so can save time compared to a PREFETCH instruction, followed by a
subsequent cache-state change due to a write.
To prevent a false-store dependency from stalling a prefetch instruction, prefetched data should be
located at least one cache-line away from the address of any surrounding data write. For example, if
the cache-line size is 32 bytes, avoid prefetching from data addresses within 32 bytes of the data
address in a preceding write instruction.
Non-Temporal Stores. Non-temporal store instructions are provided to prevent memory writes from
being stored in the cache, thereby reducing cache pollution. These non-temporal store instructions are
specific to the type of register they write:
• GPR Non-Temporal Stores—MOVNTI.
• XMM Non-Temporal Stores—MASKMOVDQU, MOVNTDQ, MOVNTPD, and MOVNTPS.
• MMX Non-Temporal Stores—MASKMOVQ and MOVNTQ.
Removing Stale Cache Lines. When cache data becomes stale, it occupies space in the cache that
could be used to store frequently-accessed data. Applications can use the CLFLUSH instruction to free
a stale cache-line for use by other data. CLFLUSH writes the contents of a cache line to memory and