User Guide

136 128-Bit Media and Scientific Programming

AMD64 Technology 24592—Rev. 3.15—November 2009

location, the memory address must be aligned. The MOVDQU instruction does the same, except for

unaligned operands. The LDDQU instruction is virtually identical in operation to the MOVDQU

instruction. The LDDQU instruction moves a double quadword of data from a 128-bit memory

operand into a destination XMM register.

The MOVDQ2Q instruction copies the low-order 64-bit value in an XMM register to an MMX

bits of an XMM register, with zero-extension to 128 bits.

Figure 4-17 on page 137 shows the capabilities of the various integer move instructions. These

instructions move large amounts of data. When copying between XMM registers, or between an XMM

XMM register and an MMX or GPR register, a move instruction can copy up to 8 bytes of data. The

MOVx instructions—along with the PUNPCKx instructions—are often among the most frequently

used instructions in 128-bit media integer and floating-point procedures.

The move instructions are in many respects similar to the assignment operator in high-level languages.

The simplest example of their use is for initializing variables. To initialize a register to 0, however,

rather than using a MOVx instruction it may be more efficient to use the PXOR instruction with

identical destination and source operands.

Move Non-Temporal. The move non-temporal instructions are streaming-store instructions. They

minimize pollution of the cache.

• MOVNTDQ—Move Non-Temporal Double Quadword

• MASKMOVDQU—Masked Move Double Quadword Unaligned

The MOVNTDQ instruction stores its second operand (a 128-bit XMM register value) into its first

operand (a 128-bit memory location). MOVNTDQ indicates to the processor that its data is non-

temporal, which assumes that the referenced data will be used only once and is therefore not subject to

cache-related overhead (as opposed to temporal data, which assumes that the data will be accessed

again soon and should be cached). The non-temporal instructions use weakly-ordered, write-

combining buffering of write data, and they minimize cache pollution. The exact method by which

cache pollution is minimized depends on the hardware implementation of the instruction. For further

information, see “Memory Optimization” on page 92.