User Guide

136 128-Bit Media and Scientific Programming
AMD64 Technology 24592—Rev. 3.15—November 2009
location, the memory address must be aligned. The MOVDQU instruction does the same, except for
unaligned operands. The LDDQU instruction is virtually identical in operation to the MOVDQU
instruction. The LDDQU instruction moves a double quadword of data from a 128-bit memory
operand into a destination XMM register.
The MOVDQ2Q instruction copies the low-order 64-bit value in an XMM register to an MMX
register. The MOVQ2DQ instruction copies a 64-bit value from an MMX register to the low-order 64
bits of an XMM register, with zero-extension to 128 bits.
Figure 4-17 on page 137 shows the capabilities of the various integer move instructions. These
instructions move large amounts of data. When copying between XMM registers, or between an XMM
register and memory, a move instruction can copy up to 16 bytes of data. When copying between an
XMM register and an MMX or GPR register, a move instruction can copy up to 8 bytes of data. The
MOVx instructions—along with the PUNPCKx instructions—are often among the most frequently
used instructions in 128-bit media integer and floating-point procedures.
The move instructions are in many respects similar to the assignment operator in high-level languages.
The simplest example of their use is for initializing variables. To initialize a register to 0, however,
rather than using a MOVx instruction it may be more efficient to use the PXOR instruction with
identical destination and source operands.
Move Non-Temporal. The move non-temporal instructions are streaming-store instructions. They
minimize pollution of the cache.
MOVNTDQ—Move Non-Temporal Double Quadword
MASKMOVDQU—Masked Move Double Quadword Unaligned
The MOVNTDQ instruction stores its second operand (a 128-bit XMM register value) into its first
operand (a 128-bit memory location). MOVNTDQ indicates to the processor that its data is non-
temporal, which assumes that the referenced data will be used only once and is therefore not subject to
cache-related overhead (as opposed to temporal data, which assumes that the data will be accessed
again soon and should be cached). The non-temporal instructions use weakly-ordered, write-
combining buffering of write data, and they minimize cache pollution. The exact method by which
cache pollution is minimized depends on the hardware implementation of the instruction. For further
information, see “Memory Optimization” on page 92.