User Guide

ManualsBrandsAMD ManualsOtherAMD64 ARCHITECTURE

AMD64 Technology

AMD64 Architecture

Programmer’s Manual

Volume 1:

Application Programming

Publication No. Revision Date

24592 3.15 November 2009

Summary of content (336 pages)

PAGE 1
AMD64 Technology AMD64 Architecture Programmer’s Manual Volume 1: Application Programming Publication No. Revision Date 24592 3.
PAGE 2
AMD64 Technology 24592—Rev. 3.15—November 2009 © 2002 – 2009 Advanced Micro Devices, Inc. All rights reserved. The contents of this document are provided in connection with Advanced Micro Devices, Inc. (“AMD”) products. AMD makes no representations or warranties with respect to the accuracy or completeness of the contents of this publication and reserves the right to make changes to specifications and product descriptions at any time without notice.
PAGE 3
24592—Rev. 3.15—November 2009 AMD64 Technology Contents Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Tables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
PAGE 4
AMD64 Technology 3 General-Purpose Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23 3.1 3.2 3.3 3.4 3.5 3.6 3.7 ii 24592—Rev. 3.15—November 2009 Registers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Legacy Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
PAGE 5
24592—Rev. 3.15—November 2009 3.8 3.9 3.10 3.11 4 AMD64 Technology Procedure Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Jumps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Procedure Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
PAGE 6
AMD64 Technology 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 iv 24592—Rev. 3.15—November 2009 MXCSR Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Other Data Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 rFLAGS Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
PAGE 7
24592—Rev. 3.15—November 2009 AMD64 Technology Use Small Operand Sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 Reorganize Data for Parallel Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 Remove Branches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 Use Streaming Stores. . . . . . . . . . . . . . . . . . . . . . . . . . .
PAGE 8
AMD64 Technology 5.8 5.9 5.10 5.11 5.12 5.13 5.14 5.15 6 Instruction Effects on Flags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 Instruction Prefixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 Supported Prefixes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 Special-Use and Reserved Prefixes . . . .
PAGE 9
24592—Rev. 3.15—November 2009 6.5 6.6 6.7 6.8 6.9 6.10 AMD64 Technology Load Constants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 Transcendental Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 Compare and Test . . . . .
PAGE 10
AMD64 Technology viii 24592—Rev. 3.
PAGE 11
24592—Rev. 3.15—November 2009 AMD64 Technology Figures Figure 1-1. Application-Programming Register Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Figure 2-1. Virtual-Memory Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Figure 2-2. Segment Registers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Figure 2-3.
PAGE 12
AMD64 Technology 24592—Rev. 3.15—November 2009 Figure 4-2. Parallel Operations on Vectors of Floating-Point Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Figure 4-3. Unpack and Interleave Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Figure 4-4. Pack Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Figure 4-5. Shuffle Operation . .
PAGE 13
24592—Rev. 3.15—November 2009 AMD64 Technology Figure 4-35. ADDPS Arithmetic Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 Figure 4-36. CMPPD Compare Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 Figure 4-37. COMISD Compare Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 Figure 4-38.
PAGE 14
AMD64 Technology xii 24592—Rev. 3.
PAGE 15
24592—Rev. 3.15—November 2009 AMD64 Technology Tables Table 1-1. Operating Modes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Table 1-2. Application Registers and Stack, by Operating Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Table 2-1. Address-Size Prefixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Table 3-1.
PAGE 16
AMD64 Technology 24592—Rev. 3.15—November 2009 Table 6-2. Types of Rounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 Table 6-3. Mapping Between Internal and Software-Visible Tag Bits . . . . . . . . . . . . . . . . . . . . . . . . . . . 246 Table 6-4. Instructions that Access the x87 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 Table 6-5.
PAGE 17
24592—Rev. 3.15—November 2009 AMD64 Technology Revision History Date Revision Description November 2009 3.15 Modified description of the Auxiliary Carry Flag on page 35. Clarified section 3.3.4, “Load Segment Registers” on page 49. Added “Atomicity of accesses.” on page 94. Revised section 3.11, “Cross-Modifying Code” on page 103. September 2007 3.14 Incorporated minor clarifications and formatting changes. July 2007 3.13 Revised rFLAGS register table 3-5 on page 34.
PAGE 18
AMD64 Technology xvi 24592—Rev. 3.
PAGE 19
24592—Rev. 3.15—November 2009 AMD64 Technology Preface About This Book This book is part of a multivolume work entitled the AMD64 Architecture Programmer’s Manual. This table lists each volume and its order number. Title Order No.
PAGE 20
AMD64 Technology • • • 24592—Rev. 3.15—November 2009 128-bit Media Programming—This model uses the 128-bit XMM registers and supports integer and floating-point operations on vector (packed) and scalar data types. 64-bit Media Programming—This model uses the 64-bit MMX™ registers and supports integer and floating-point operations on vector (packed) and scalar data types. x87 Floating-Point Programming—This model uses the 80-bit x87 registers and supports floatingpoint operations on scalar data types.
PAGE 21
24592—Rev. 3.15—November 2009 AMD64 Technology 32-bit mode Legacy mode or compatibility mode in which a 32-bit address size is active. See legacy mode and compatibility mode. 64-bit mode A submode of long mode. In 64-bit mode, the default address size is 64 bits and new features, such as register extensions, are supported for system and application software. #GP(0) Notation indicating a general-protection exception (#GP) with error code of 0.
PAGE 22
AMD64 Technology 24592—Rev. 3.15—November 2009 direct Referencing a memory location whose address is included in the instruction’s syntax as an immediate operand. The address may be an absolute or relative address. Compare indirect. dirty data Data held in the processor’s caches or internal buffers that is more recent than the copy held in main memory. displacement A signed value that is added to the base of a segment (absolute addressing) or an instruction pointer (relative addressing). Same as offset.
PAGE 23
24592—Rev. 3.15—November 2009 AMD64 Technology FF /0 Notation indicating that FF is the first byte of an opcode, and a subopcode in the ModR/M byte has a value of 0. flush An often ambiguous term meaning (1) writeback, if modified, and invalidate, as in “flush the cache line,” or (2) invalidate, as in “flush the pipeline,” or (3) change a value, as in “flush to zero.” GDT Global descriptor table. GIF Global interrupt flag. IDT Interrupt descriptor table. IGN Ignore. Field is ignored.
PAGE 24
AMD64 Technology 24592—Rev. 3.15—November 2009 long mode An operating mode unique to the AMD64 architecture. A processor implementation of the AMD64 architecture can run in either long mode or legacy mode. Long mode has two submodes, 64-bit mode and compatibility mode. lsb Least-significant bit. LSB Least-significant byte. main memory Physical memory, such as RAM and ROM (but not cache memory) that is installed in a particular computer system.
PAGE 25
24592—Rev. 3.15—November 2009 AMD64 Technology offset Same as displacement. overflow The condition in which a floating-point number is larger in magnitude than the largest, finite, positive or negative number that can be represented in the data-type format being used. packed See vector. PAE Physical-address extensions. physical memory Actual memory, consisting of main memory and cache. probe A check for an address in a processor’s caches or internal buffers.
PAGE 26
AMD64 Technology 24592—Rev. 3.15—November 2009 Software must not depend on the state of a reserved field, nor upon the ability of such fields to return to a previously written state. If a reserved field is not marked with one of the above qualifiers, software must not change the state of that field; it must reload that field with the same values returned from a prior read. REX An instruction prefix that specifies a 64-bit operand size and provides access to additional registers.
PAGE 27
24592—Rev. 3.15—November 2009 AMD64 Technology TSS Task-state segment. underflow The condition in which a floating-point number is smaller in magnitude than the smallest nonzero, positive or negative number that can be represented in the data-type format being used. vector (1) A set of integer or floating-point values, called elements, that are packed into a single operand. Most of the 128-bit and 64-bit media instructions use vectors as operands.
PAGE 28
AMD64 Technology 24592—Rev. 3.15—November 2009 CRn Control register number n. CS Code segment register. eAX–eSP The 16-bit AX, BX, CX, DX, DI, SI, BP, and SP registers or the 32-bit EAX, EBX, ECX, EDX, EDI, ESI, EBP, and ESP registers. Compare rAX–rSP. EFER Extended features enable register. eFLAGS 16-bit or 32-bit flags register. Compare rFLAGS. EFLAGS 32-bit (extended) flags register. eIP 16-bit or 32-bit instruction-pointer register. Compare rIP. EIP 32-bit (extended) instruction-pointer register.
PAGE 29
24592—Rev. 3.15—November 2009 AMD64 Technology MSR Model-specific register. r8–r15 The 8-bit R8B–R15B registers, or the 16-bit R8W–R15W registers, or the 32-bit R8D–R15D registers, or the 64-bit R8–R15 registers. rAX–rSP The 16-bit AX, BX, CX, DX, DI, SI, BP, and SP registers, or the 32-bit EAX, EBX, ECX, EDX, EDI, ESI, EBP, and ESP registers, or the 64-bit RAX, RBX, RCX, RDX, RDI, RSI, RBP, and RSP registers.
PAGE 30
AMD64 Technology 24592—Rev. 3.15—November 2009 RSP 64-bit version of the ESP register. SP Stack pointer register. SS Stack segment register. TPR Task priority register (CR8), a new register introduced in the AMD64 architecture to speed interrupt management. TR Task register. Endian Order The x86 and AMD64 architectures address memory using little-endian byte-ordering.
PAGE 31
24592—Rev. 3.15—November 2009 • • • • • • • • • • • • • • • • • • • • • • AMD64 Technology Ralf Brown and Jim Kyle, PC Interrupts, Addison-Wesley, New York, 1994. Penn Brumm and Don Brumm, 80386/80486 Assembly Language Programming, Windcrest McGraw-Hill, 1993. Geoff Chappell, DOS Internals, Addison-Wesley, New York, 1994. Chips and Technologies, Inc. Super386 DX Programmer’s Reference Manual, Chips and Technologies, Inc., San Jose, 1992.
PAGE 32
AMD64 Technology • • • • • • • • • • • • • • • • • 24592—Rev. 3.15—November 2009 Muhammad Ali Mazidi and Janice Gillispie Mazidi, 80X86 IBM PC and Compatible Computers, Prentice-Hall, Englewood Cliffs, NJ, 1997. Hans-Peter Messmer, The Indispensable Pentium Book, Addison-Wesley, New York, 1995. Karen Miller, An Assembly Language Introduction to Computer Architecture: Using the Intel Pentium®, Oxford University Press, New York, 1999.
PAGE 33
24592—Rev. 3.15—November 2009 AMD64 Technology 1 Overview of the AMD64 Architecture 1.1 Introduction The AMD64 architecture is a simple yet powerful 64-bit, backward-compatible extension of the industry-standard (legacy) x86 architecture. It adds 64-bit addressing and expands register resources to support higher performance for recompiled 64-bit programs, while supporting legacy 16-bit and 32-bit applications and operating systems without modification or recompilation.
PAGE 34
AMD64 Technology 24592—Rev. 3.15—November 2009 General-Purpose Registers (GPRs) 64-Bit Media and Floating-Point Registers RAX RBX RCX RDX RBP RSI RDI RSP R8 R9 R10 R11 R12 R13 R14 R15 63 0 MMX0/FPR0 MMX1/FPR1 MMX2/FPR2 MMX3/FPR3 MMX4/FPR4 MMX5/FPR5 MMX6/FPR6 MMX7/FPR7 63 Flags Register 0 EFLAGS 63 Instruction Pointer EIP 63 Figure 1-1.
PAGE 35
24592—Rev. 3.15—November 2009 AMD64 Technology 1.1.2 Registers Table 1-2 compares the register and stack resources available to application software, by operating mode. The left set of columns shows the legacy x86 resources, which are available in the AMD64 architecture’s legacy and compatibility modes. The right set of columns shows the comparable resources in 64-bit mode. Gray shading indicates differences between the modes.
PAGE 36
AMD64 Technology 24592—Rev. 3.15—November 2009 1.1.3 Instruction Set The AMD64 architecture supports the full legacy x86 instruction set, with additional instructions to support long mode (see Table 1-1 on page 2 for a summary of operating modes). The applicationprogramming instructions are organized into three subsets, as follows: • • • • General-Purpose Instructions—These are the basic x86 integer instructions used in virtually all programs.
PAGE 37
24592—Rev. 3.15—November 2009 AMD64 Technology The 128-bit and 64-bit media instructions are designed to accelerate these applications. The instructions use a form of vector (or packed) parallel processing known as single-instruction, multiple data (SIMD) processing. This vector technology has the following characteristics: • • A single register can hold multiple independent pieces of data.
PAGE 38
AMD64 Technology 24592—Rev. 3.15—November 2009 media instructions. This provides application programs with three distinct sets of floating-point registers. In addition, certain high-end implementations of the AMD64 architecture may support 128bit media, 64-bit media, and x87 instructions with separate execution units. 1.2 Modes of Operation Table 1-1 on page 2 summarizes the modes of operation supported by the AMD64 architecture.
PAGE 39
24592—Rev. 3.15—November 2009 AMD64 Technology instructions, these defaults can be overridden on an instruction-by-instruction basis using instruction prefixes. REX prefixes specify the 64-bit operand size and register extensions. RIP-Relative Data Addressing. 64-bit mode supports data addressing relative to the 64-bit instruction pointer (RIP). The legacy x86 architecture supports IP-relative addressing only in controltransfer instructions.
PAGE 40
AMD64 Technology 24592—Rev. 3.15—November 2009 Legacy mode is compatible with existing 32-bit processor implementations of the x86 architecture. Processors that implement the AMD64 architecture boot in legacy real mode, just like processors that implement the legacy x86 architecture. Throughout this document, references to legacy mode refer to all three submodes—protected mode, virtual-8086 mode, and real mode.
PAGE 41
24592—Rev. 3.15—November 2009 2 AMD64 Technology Memory Model This chapter describes the memory characteristics that apply to application software in the various operating modes of the AMD64 architecture. These characteristics apply to all instructions in the architecture. Several additional system-level details about memory and cache management are described in Volume 2. 2.1 Memory Organization 2.1.1 Virtual Memory Virtual memory consists of the entire address space available to programs.
PAGE 42
AMD64 Technology 24592—Rev. 3.15—November 2009 64-Bit Mode (Flat Segmentation Model) 264 - 1 Legacy and Compatibility Mode (Multi-Segment Model) 232 - 1 Code Segment (CS) Base Stack Segment (SS) Base Base Address for All Segments 0 Data Segment (DS) Base code stack data 0 513-107.eps Figure 2-1.
PAGE 43
24592—Rev. 3.15—November 2009 AMD64 Technology Legacy Mode and Compatibility Mode CS CS 15 64-Bit Mode (Attributes only) DS ignored ES ignored FS (Base only) GS (Base only) SS ignored FS GS 0 15 0 513-312.eps Figure 2-2. Segment Registers For details on segmentation and the segment registers, see “Segmented Virtual Memory” in Volume 2. 2.1.
PAGE 44
AMD64 Technology 24592—Rev. 3.15—November 2009 Long-Mode Memory Management. Figure 2-3 shows the flow, from top to bottom, of memory management functions performed in the two submodes of long mode. 64-Bit Mode Compatibility Mode 63 0 15 Virtual (Linear) Address 0 Selector 31 0 Effective Address Segmentation 63 32 31 Virtual Address 0 Paging Paging 51 0 Physical Address 0 51 0 Physical Address 513-184.eps Figure 2-3.
PAGE 45
24592—Rev. 3.15—November 2009 AMD64 Technology Protected Mode 15 0 Selector Virtual-8086 Mode 31 0 Effective Address (EA) 0 15 15 31 0 EA Selector Segmentation 0 15 15 19 Linear Address Paging Paging 0 Physical Address (PA) Segmentation 0 Linear Address 31 19 0 Linear Address 0 Physical Address (PA) 0 EA Selector Segmentation 0 31 Real Mode 31 19 0 0 PA 513-185.eps Figure 2-4.
PAGE 46
AMD64 Technology 2.2 24592—Rev. 3.15—November 2009 Memory Addressing 2.2.1 Byte Ordering Instructions and data are stored in memory in little-endian byte order. Little-endian ordering places the least-significant byte of the instruction or data item at the lowest memory address and the mostsignificant byte at the highest memory address. Figure 2-5 shows a generalization of little-endian memory and register images of a quadword data type.
PAGE 47
24592—Rev. 3.15—November 2009 AMD64 Technology the first (most-significant) byte. In memory, the REX prefix byte (48) would be stored at the lowest address, and the first immediate byte (11) would be stored at the highest instruction address. 11 09h 22 08h 33 07h 44 06h 55 05h 66 04h 77 03h 88 02h B8 01h 48 00h High (most-significant) Low (least-significant) 513-186.eps Figure 2-6. Example of 10-Byte Instruction in Memory 2.2.
PAGE 48
AMD64 Technology • • 24592—Rev. 3.15—November 2009 Instruction-Relative Addresses—These addresses are given as displacements (or offsets) from the current instruction pointer (IP), also called the program counter (PC). They are generated by control-transfer instructions. A displacement in the instruction encoding, or one read from memory, serves as an offset from the address that follows the transfer. See “RIP-Relative Addressing” on page 18 for details about RIP-relative addressing in 64-bit mode.
PAGE 49
24592—Rev. 3.15—November 2009 AMD64 Technology truncated to the effective-address size of the current mode (64-bit mode or compatibility mode), as overridden by any address-size prefix. The result is then zero-extended to the full 64-bit address width. Because of this, 16-bit and 32-bit applications running in compatibility mode can access only the low 4GB of the long-mode virtual-address space.
PAGE 50
AMD64 Technology 24592—Rev. 3.15—November 2009 Table 2-1. Address-Size Prefixes Operating Mode 64-Bit Mode Long Mode Default Address Size (Bits) AddressEffective Size Prefix Address Size (67h)1 (Bits) Required? 64 32 Compatibility Mode 16 Legacy Mode (Protected, Virtual-8086, or Real Mode) 32 16 64 no 32 yes 32 no 16 yes 32 yes 16 no 32 no 16 yes 32 yes 16 no Note: 1. “No” indicates that the default address size is used. 2.2.
PAGE 51
24592—Rev. 3.15—November 2009 AMD64 Technology RIP-relative addressing. The effect of the address-size prefix is to truncate and zero-extend the computed effective address to 32 bits, like any other addressing mode. Encoding. For details on instruction encoding of RIP-relative addressing, see in “RIP-Relative Addressing” in Volume 3. 2.3 Pointers Pointers are variables that contain addresses rather than data. They are used by instructions to reference memory.
PAGE 52
AMD64 Technology 24592—Rev. 3.15—November 2009 Stack Frame Before Procedure Call Stack-Frame Base Pointer (rBP) and Stack Pointer (rSP) Stack-Segment (SS) Base Address Stack Frame After Procedure Call Stack-Frame Base Pointer (rBP) Stack Pointer (rSP) passed data Stack-Segment (SS) Base Address 513-110.eps Figure 2-9.
PAGE 53
24592—Rev. 3.15—November 2009 AMD64 Technology IP EIP rIP RIP 63 32 31 0 513-140.eps Figure 2-10. Instruction Pointer (rIP) Register The contents of the rIP are not directly readable by software. However, the rIP is pushed onto the stack by a call instruction. The memory model described in this chapter is used by all of the programming environments that make up the AMD64 architecture.
PAGE 54
AMD64 Technology 22 24592—Rev. 3.
PAGE 55
24592—Rev. 3.15—November 2009 3 AMD64 Technology General-Purpose Programming The general-purpose programming model includes the general-purpose registers (GPRs), integer instructions and operands that use the GPRs, program-flow control methods, memory optimization methods, and I/O. This programming model includes the original x86 integer-programming architecture, plus 64-bit extensions and a few additional instructions.
PAGE 56
AMD64 Technology 24592—Rev. 3.15—November 2009 General-Purpose Registers (GPRs) rAX rBX rCX rDX rBP rSI rDI rSP R8 R9 R10 R11 R12 Segment Registers R13 CS R14 DS R15 63 ES 0 Flags and Instruction Pointer Registers FS 15 32 31 GS rFLAGS SS rIP 0 63 32 31 0 Available to sofware in all modes Available to sofware only in 64-bit mode Ignored by hardware in 64-bit mode Figure 3-1. 513-131.eps General-Purpose Programming Registers 3.1.
PAGE 57
24592—Rev. 3.15—November 2009 AMD64 Technology register encoding high 8-bit low 8-bit 16-bit 32-bit 0 AH (4) AL AX EAX 3 BH (7) BL BX EBX 1 CH (5) CL CX ECX 2 DH (6) DL DX EDX 6 SI SI ESI 7 DI DI EDI 5 BP BP EBP 4 SP SP ESP 31 16 15 0 FLAGS FLAGS EFLAGS IP 31 IP EIP 0 513-311.eps Figure 3-2. General Registers in Legacy and Compatibility Modes The legacy GPRs include: • • • Eight 8-bit registers (AH, AL, BH, BL, CH, CL, DH, DL).
PAGE 58
AMD64 Technology 24592—Rev. 3.15—November 2009 3.1.2 64-Bit-Mode Registers In 64-bit mode, eight new GPRs are added to the eight legacy GPRs, all 16 GPRs are 64 bits wide, and the low bytes of all registers are accessible. Figure 3-3 on page 27 shows the GPRs, flags register, and instruction-pointer register available in 64-bit mode. The GPRs include: • • • • • Sixteen 8-bit low-byte registers (AL, BL, CL, DL, SIL, DIL, BPL, SPL, R8B, R9B, R10B, R11B, R12B, R13B, R14B, R15B).
PAGE 59
24592—Rev. 3.
PAGE 60
AMD64 Technology 24592—Rev. 3.15—November 2009 Figure 3-4.
PAGE 61
24592—Rev. 3.15—November 2009 AMD64 Technology Default Operand Size. For most instructions, the default operand size in 64-bit mode is 32 bits. To access 16-bit operand sizes, an instruction must contain an operand-size prefix (66h), as described in Section 3.2.2, “Operand Sizes and Overrides,” on page 39. To access the full 64-bit operand size, most instructions must contain a REX prefix. For details on operand size, see Section 3.2.2, “Operand Sizes and Overrides,” on page 39. Byte Registers.
PAGE 62
AMD64 Technology 24592—Rev. 3.15—November 2009 66 01C3 ADD BX,AX ;66 is 16-bit size override Result:RBX = 0002_0002_0123_5502 (bits 63:16 are preserved) Example 4: 8-bit Add: Before:RAX = 0002_0001_8000_2201 RBX = 0002_0002_0123_3301 00C3 ADD BL,AL ;8-bit add Result:RBX = 0002_0002_0123_3302 (bits 63:08 are preserved) GPR High 32 Bits Across Mode Switches. The processor does not preserve the upper 32 bits of the 64-bit GPRs across switches from 64-bit mode to compatibility or legacy modes.
PAGE 63
24592—Rev. 3.15—November 2009 Table 3-1. AMD64 Technology Implicit Uses of GPRs Registers1 Low 8-Bit AL BL CL DL SIL2 16-Bit AX BX CX DX Name 32-Bit EAX EBX ECX EDX SI ESI Implicit Uses 64-Bit RAX2 RBX2 RCX2 • Operand for decimal arithmetic, multiply, divide, string, compare-andexchange, table-translation, and I/O instructions. • Special accumulator encoding Accumulator for ADD, XOR, and MOV instructions. • Used with EDX to hold doubleprecision operands.
PAGE 64
AMD64 Technology Table 3-1. 24592—Rev. 3.15—November 2009 Implicit Uses of GPRs (continued) Registers1 Low 8-Bit 16-Bit 32-Bit Name Implicit Uses • Memory address of destination operand for string instructions. • Memory index for 16-bit addresses. 64-Bit DIL2 DI EDI RDI2 Destination Index BPL2 BP EBP RBP2 Base Pointer • Memory address of stackframe base pointer. SPL2 SP ESP RSP2 Stack Pointer • Memory address of last stack entry (top of stack).
PAGE 65
24592—Rev. 3.15—November 2009 AMD64 Technology Decimal Arithmetic. The decimal arithmetic instructions (AAA, AAD, AAM, AAS, DAA, DAS) that adjust binary-coded decimal (BCD) operands implicitly use the AL and AH register for their operations. Shifts and Rotates. Shift and rotate instructions can use the CL register to specify the number of bits an operand is to be shifted or rotated. Conditional Jumps. Special conditional-jump instructions use the rCX register instead of flags.
PAGE 66
AMD64 Technology 24592—Rev. 3.15—November 2009 63 32 Reserved, RAZ 31 12 11 10 9 O D F F See Volume 2 for System Flags Bits 11 10 7 6 4 2 0 Mnemonic OF DF SF ZF AF PF CF Description Overflow Flag Direction Flag Sign Flag Zero Flag Auxiliary Carry Flag Parity Flag Carry Flag 8 7 6 S F Z F 5 4 3 A F 2 1 P F 0 C F R/W R/W R/W R/W R/W R/W R/W R/W Figure 3-5.
PAGE 67
24592—Rev. 3.15—November 2009 AMD64 Technology The sections below describe each application-visible flag. All of these flags are readable and writable. For example, the POPF, POPFD, POPFQ, IRET, IRETD, and IRETQ instructions write all flags. The carry and direction flags are writable by dedicated application instructions. Other application-visible flags are written indirectly by specific instructions.
PAGE 68
AMD64 Technology 24592—Rev. 3.15—November 2009 to 0 specifies incrementing the data pointer. The pointers are stored in the rSI or rDI register. Software can set or clear the flag with the STD and CLD instructions, respectively. Overflow Flag (OF). Bit 11. Hardware sets the overflow flag to 1 to indicate that the most-significant (sign) bit of the result of the last signed integer operation differed from the signs of both source operands. Otherwise, hardware clears the flag to 0.
PAGE 69
24592—Rev. 3.15—November 2009 AMD64 Technology Signed Integer 127 0 Double Quadword 16 bytes (64-bit mode only) s s 8 bytes (64-bit mode only) 63 s 4 bytes 31 s 2 bytes 15 s Quadword Doubleword Word Byte 7 0 Unsigned Integer 127 0 Double Quadword 16 bytes (64-bit mode only) 8 bytes (64-bit mode only) 63 Quadword 4 bytes 31 Doubleword 2 bytes Word 15 Byte Packed BCD BCD Digit 7 3 Bit 513-326.eps 0 Figure 3-6. General-Purpose Data Types Signed and Unsigned Integers.
PAGE 70
AMD64 Technology Table 3-2. 24592—Rev. 3.15—November 2009 Representable Values of General-Purpose Data Types (continued) Data Type Byte Word Unsigned Integers 0 to +28-1 (0 to 255) 0 to +216-1 (0 to 65,535) Packed BCD Digits BCD Digit Doubleword Quadword Double Quadword2 0 to +232-1 0 to +264-1 0 to +2128-1 9 19 (0 to 4.29 x 10 ) (0 to 1.84 x 10 ) (0 to 3.40 x 1038) 00 to 99 multiple packed BCD-digit bytes 0 to 9 multiple BCD-digit bytes Note: 1.
PAGE 71
24592—Rev. 3.15—November 2009 AMD64 Technology 3.2.2 Operand Sizes and Overrides Default Operand Size. In legacy and compatibility modes, the default operand size is either 16 bits or 32 bits, as determined by the default-size (D) bit in the current code-segment descriptor (for details, see “Segmented Virtual Memory” in Volume 2). In 64-bit mode, the default operand size for most instructions is 32 bits.
PAGE 72
AMD64 Technology 24592—Rev. 3.15—November 2009 Immediate Operand Size. In legacy mode and compatibility modes, the size of immediate operands can be 8, 16, or 32 bits, depending on the instruction. In 64-bit mode, the maximum size of an immediate operand is also 32 bits, except that 64-bit immediates can be copied into a 64-bit GPR using the MOV instruction. When the operand size of a MOV instruction is 64 bits, the processor sign-extends immediates to 64 bits before using them.
PAGE 73
24592—Rev. 3.15—November 2009 AMD64 Technology The AMD64 architecture does not impose data-alignment requirements for accessing data in memory. However, depending on the location of the misaligned operand with respect to the width of the data bus and other aspects of the hardware implementation (such as store-to-load forwarding mechanisms), a misaligned memory access can require more bus cycles than an aligned access. For maximum performance, avoid misaligned memory accesses.
PAGE 74
AMD64 Technology 24592—Rev. 3.15—November 2009 In most instructions that take two operands, the first (left-most) operand is both a source operand and the destination operand. The second (right-most) operand serves only as a source. Instructions can have one or more prefixes that modify default instruction functions or operand properties. These prefixes are summarized in Section 3.5, “Instruction Prefixes,” on page 71.
PAGE 75
24592—Rev. 3.15—November 2009 AMD64 Technology The CMOVcc instructions perform the same task as MOV but work conditionally, depending on the state of status flags in the RFLAGS register. If the condition is not satisfied, the instruction has no effect and control is passed to the next instruction. The mnemonics of CMOVcc instructions indicate the condition that must be satisfied. Several mnemonics are often used for one opcode to make the mnemonics easier to remember.
PAGE 76
AMD64 Technology Table 3-4. Mnemonic 24592—Rev. 3.
PAGE 77
24592—Rev. 3.15—November 2009 AMD64 Technology PUSHA or PUSHAD stores eight word-sized or doubleword-sized registers onto the stack: eAX, eCX, eDX, eBX, eSP, eBP, eSI and eDI, in that order. The stored value of eSP is sampled at the moment when the PUSHA instruction started. The resulting stack-pointer value is decremented by 16 or 32.
PAGE 78
AMD64 Technology 24592—Rev. 3.15—November 2009 rBP register from the calling procedure. If the depth operand is greater than zero, the saved frame pointer of the current procedure is pushed onto the stack (forming an array of depth frame pointers). Finally, the saved value of the frame pointer is copied to the rBP register, and the rSP register is decremented by the value of the first operand, allocating space for local variables used in the procedure.
PAGE 79
24592—Rev. 3.15—November 2009 AMD64 Technology Flags are not affected by these instructions. The instructions can be used to prepare an operand for signed division (performed by the IDIV instruction) by doubling its storage size.
PAGE 80
AMD64 Technology 24592—Rev. 3.15—November 2009 Although the base of the numeration for ASCII-adjust instructions is assumed to be 10, the AAM and AAD instructions can be used to correct multiplication and division with other bases. BCD Adjust • DAA—Decimal Adjust after Addition • DAS—Decimal Adjust after Subtraction The DAA and DAS instructions perform corrections of addition and subtraction operations on packed BCD values.
PAGE 81
24592—Rev. 3.15—November 2009 AMD64 Technology 3.3.4 Load Segment Registers These instructions load segment registers. • • • LDS, LES, LFS, LGS, LSS—Load Far Pointer MOV segReg—Move Segment Register POP segReg—Pop Stack Into Segment Register The LDS, LES, LFD, LGS, and LSS instructions atomically (with respect to interrupts only, not contending memory accesses) load the two parts of a far pointer into a segment register and a generalpurpose register.
PAGE 82
AMD64 Technology 24592—Rev. 3.15—November 2009 LEA has a limited capability to perform multiplication of operands in general-purpose registers using scaled-index addressing. For example: lea eax, [ebx+ebx*8] loads the value of the EBX register, multiplied by 9, into the EAX register. 3.3.6 Arithmetic The arithmetic instructions perform basic arithmetic operations, such as addition, subtraction, multiplication, and division on integer operands.
PAGE 83
24592—Rev. 3.15—November 2009 AMD64 Technology The MUL instruction performs multiplication of unsigned integer operands. The size of operands can be byte, word, doubleword, or quadword. The product is stored in a destination which is double the size of the source operands (multiplicand and factor). The MUL instruction's mnemonic has only one operand, which is a factor. The multiplicand operand is always assumed to be an accumulator register.
PAGE 84
AMD64 Technology 24592—Rev. 3.15—November 2009 3.3.7 Rotate and Shift The rotate and shift instructions perform cyclic rotation or non-cyclic shift, by a given number of bits (called the count), in a given byte-sized, word-sized, doubleword-sized or quadword-sized operand. When the count is greater than 1, the result of the rotate and shift instructions can be considered as an iteration of the same 1-bit operation by count number of times.
PAGE 85
24592—Rev. 3.15—November 2009 AMD64 Technology The SHx instructions (including SHxD) perform shift operations on unsigned operands. The SAx instructions operate with signed operands. SHL and SAL instructions effectively perform multiplication of an operand by a power of 2, in which case they work as more-efficient alternatives to the MUL instruction. Similarly, SHR and SAR instructions can be used to divide an operand (signed or unsigned, depending on the instruction used) by a power of 2.
PAGE 86
AMD64 Technology 24592—Rev. 3.15—November 2009 The CMP instruction is often used together with the conditional jump instructions (Jcc), conditional SET instructions (SETcc) and other instructions such as conditional loops (LOOPcc) whose behavior depends on flag state. Test • TEST—Test Bits The TEST instruction is in many ways similar to the AND instruction: it performs logical conjunction of the corresponding bits of both operands, but unlike the AND instruction it leaves the operands unchanged.
PAGE 87
24592—Rev. 3.15—November 2009 AMD64 Technology because these instructions operate directly on bits rather than larger data types, the semaphore arrays can be smaller than is possible when using XCHG. In such semaphore applications, bit-test instructions should be preceded by the LOCK prefix.
PAGE 88
AMD64 Technology 24592—Rev. 3.15—November 2009 SETcc instructions are often used to set logical indicators. Like CMOVcc instructions (page 42), SETcc instructions can replace two instructions—a conditional jump and a move. Replacing conditional jumps with conditional sets can help avoid branch-prediction penalties that may be caused by conditional jumps.
PAGE 89
24592—Rev. 3.15—November 2009 AMD64 Technology Compare Strings • CMPS—Compare Strings • CMPSB—Compare Strings by Byte • CMPSW—Compare Strings by Word • CMPSD—Compare Strings by Doubleword • CMPSQ—Compare Strings by Quadword The CMPSx instructions compare the values of two implicit operands of the same size located at seg:[rSI] and ES:[rDI]. After the copy, both the rSI and rDI registers are auto-incremented (if the DF flag is 0) or auto-decremented (if the DF flag is 1).
PAGE 90
AMD64 Technology 24592—Rev. 3.15—November 2009 The LODSx instructions load a value from the memory location seg:[rSI] to the accumulator register (AL or rAX). After the load, the rSI register is auto-incremented (if the DF flag is 0) or autodecremented (if the DF flag is 1).
PAGE 91
24592—Rev. 3.15—November 2009 AMD64 Technology target offset of the JMP instruction is ignored, and the new values loaded into CS and rIP are taken from the call gate or from the TSS. Conditional Jump • Jcc—Jump if condition Conditional jump instructions jump to an instruction specified by the operand, depending on the state of flags in the rFLAGS register. The operands specifies a signed relative offset from the current contents of the rIP.
PAGE 92
AMD64 Technology Table 3-6. 24592—Rev. 3.
PAGE 93
24592—Rev. 3.15—November 2009 AMD64 Technology the CALL. When the called procedure finishes execution and is exited using a return instruction, control is transferred to the return address saved on the stack. The CALL instruction has the same forms as the JMP instruction, except that CALL lacks the shortrelative (1-byte offset) form. • • • • Relative Near Call—These specify an offset relative to the instruction following the CALL instruction.
PAGE 94
AMD64 Technology • • 24592—Rev. 3.15—November 2009 IRETD—Interrupt Return Doubleword IRETQ—Interrupt Return Quadword The INT instruction implements a software interrupt by calling an interrupt handler. The operand of the INT instruction is an immediate byte value specifying an index in the interrupt descriptor table (IDT), which contains addresses of interrupt handlers (see Section 3.7.10, “Interrupts and Exceptions,” on page 86 for further information on the IDT).
PAGE 95
24592—Rev. 3.15—November 2009 AMD64 Technology For details on stack operations, see “Control Transfers” on page 76. Set and Clear Flags • CLC—Clear Carry Flag • CMC—Complement Carry Flag • STC—Set Carry Flag • CLD—Clear Direction Flag • STD—Set Direction Flag • CLI—Clear Interrupt Flag • STI—Set Interrupt Flag These instructions change the value of a flag in the rFLAGS register that is visible to application software. Each instruction affects only one specific flag.
PAGE 96
AMD64 Technology 24592—Rev. 3.15—November 2009 When operating in legacy protected mode or in long mode, the RFLAGS register’s I/O privilege level (IOPL) field and the I/O-permission bitmap in the current task-state segment (TSS) are used to control access to the I/O addresses (called I/O ports). See “Input/Output” on page 90 for further information.
PAGE 97
24592—Rev. 3.15—November 2009 • • • • • AMD64 Technology CMPXCHG—Compare and Exchange CMPXCHG8B—Compare and Exchange Eight Bytes CMPXCHG16B—Compare and Exchange Sixteen Bytes XADD—Exchange and Add XCHG—Exchange The CMPXCHG instruction compares a value in the AL or rAX register with the first (destination) operand, and sets the arithmetic flags (ZF, OF, SF, AF, CF, PF) according to the result. If the compared values are equal, the source operand is loaded into the destination operand.
PAGE 98
AMD64 Technology 24592—Rev. 3.15—November 2009 See “Feature Detection” on page 74 for details about using the CPUID instruction. For a full description of the CPUID instruction and its function codes, see “CPUID” in Volume 3 and the CPUID Specification, order# 25481. 3.3.16 Cache and Memory Management Applications can use the cache and memory-management instructions to control memory reads and writes to influence the caching of read/write data.
PAGE 99
24592—Rev. 3.15—November 2009 AMD64 Technology The NOP instructions performs no operation (except incrementing the instruction pointer rIP by one). It is an alternative mnemonic for the XCHG rAX, rAX instruction. Depending on the hardware implementation, the NOP instruction may use one or more cycles of processor time. 3.3.
PAGE 100
AMD64 Technology 24592—Rev. 3.15—November 2009 3.4.2 Canonical Address Format Bits 63 through the most-significant implemented virtual-address bit must be all zeros or all ones in any memory reference. See “64-Bit Canonical Addresses” on page 15 for details. (This rule applies to long mode, which includes both 64-bit mode and compatibility mode.) 3.4.
PAGE 101
24592—Rev. 3.15—November 2009 • • AMD64 Technology No Extension of 8-Bit and 16-Bit Results: 8-bit and 16-bit results leave the high 56 or 48 bits, respectively, of 64-bit GPR destination registers unchanged. Undefined High 32 Bits After Mode Change: The processor does not preserve the upper 32 bits of the 64-bit GPRs across changes from 64-bit mode to compatibility or legacy modes. In compatibility and legacy mode, the upper 32 bits of the GPRs are undefined and not accessible to software. 3.4.
PAGE 102
AMD64 Technology • • • 24592—Rev. 3.15—November 2009 ARPL—Adjust Requestor Privilege Level. Opcode becomes the MOVSXD instruction. DEC (one-byte opcode only)—Decrement by 1. Opcode becomes a REX prefix. Use the two-byte DEC opcode instead. INC (one-byte opcode only)—Increment by 1. Opcode becomes a REX prefix. Use the two-byte INC opcode instead. 3.4.7 Instructions with 64-Bit Default Operand Size Most instructions default to 32-bit operand size in 64-bit mode.
PAGE 103
24592—Rev. 3.15—November 2009 AMD64 Technology size override prefix for 64-bit mode. For details on the operand-size prefix, see “Instruction Prefixes” in Volume 3. For details on near branches, see “Near Branches in 64-Bit Mode” on page 85. For details on instructions that implicitly reference RSP, see “Stack Operand-Size in 64-Bit Mode” on page 77. For details on opcodes and operand-size overrides, see “General-Purpose Instructions in 64-Bit Mode” in Volume 3. 3.
PAGE 104
AMD64 Technology Table 3-7. 24592—Rev. 3.15—November 2009 Legacy Instruction Prefixes Mnemonic Prefix Code (Hex) Description Operand-Size Override none 661 Changes the default operand size of a memory or register operand, as shown in Table 3-3 on page 39. Address-Size Override none 67 Changes the default address size of a memory operand, as shown in Table 2-1 on page 18. CS 2E Forces use of the CS segment for memory operands. DS 3E Forces use of the DS segment for memory operands.
PAGE 105
24592—Rev. 3.15—November 2009 AMD64 Technology Segment Override Prefix. The DS segment is the default segment for most memory operands. Many instructions allow this default data segment to be overridden using one of the six segment-override prefixes shown in Table 3-7 on page 72. Data-segment overrides will be ignored when accessing data in the following cases: • • When a stack reference is made that pushes data onto or pops data off of the stack. In those cases, the SS segment is always used.
PAGE 106
AMD64 Technology 24592—Rev. 3.15—November 2009 3.5.2 REX Prefixes REX prefixes are a new group of instruction-prefix bytes that can be used only in 64-bit mode. They enable the 64-bit register extensions. REX prefixes specify the following features: • • • • Use of an extended GPR register, shown in Figure 3-3 on page 27. Use of an extended XMM register, shown in Figure 4-12 on page 117. Use of a 64-bit (quadword) operand size, as described in “Operands” on page 36.
PAGE 107
24592—Rev. 3.15—November 2009 AMD64 Technology After software has determined that the processor implementation supports the CPUID instruction, software can test for support of specific features by loading a function code (value) into the EAX register and executing the CPUID instruction. Processor feature information is returned in the EAX, EBX, ECX, and EDX registers, as described fully in “CPUID” in Volume 3. The architecture supports CPUID information about standard functions and extended functions.
PAGE 108
AMD64 Technology 24592—Rev. 3.15—November 2009 3.6.1 Feature Detection in a Virtualized Environment Software writers must assume that their software may be executed as a guest in a virtualized environment. A virtualized guest may be migrated between processors of differing capabilities, so the CPUID indication of a feature's presence must be respected. Operating systems, user programs and libraries must all ensure that the CPUID instruction indicates a feature is present before using that feature.
PAGE 109
24592—Rev. 3.15—November 2009 AMD64 Technology Figure 3-9 shows the relationship of the four privilege-levels to each other. The protection scheme is implemented using the segmented memory-management mechanism described in “Segmented Virtual Memory” in Volume 2. Memory Management File Allocation Interrupt Handling Privilege 0 Device-Drivers Library Routines Privilege 1 Privilege 2 513-236.eps Privilege 3 Figure 3-9. Application Programs Privilege-Level Relationships 3.7.
PAGE 110
AMD64 Technology 24592—Rev. 3.15—November 2009 Except for far branches, all instructions that implicitly reference the stack pointer default to 64-bit operand size in 64-bit mode. Table 3-8 on page 79 lists these instructions. The default 64-bit operand size eliminates the need for a REX prefix with these instructions. However, a REX prefix is still required if R8–R15 (the extended set of eight GPRs) are used as operands, because the prefix is required to address the extended registers.
PAGE 111
24592—Rev. 3.15—November 2009 Table 3-8.
PAGE 112
AMD64 Technology 24592—Rev. 3.15—November 2009 the CALL instruction. Parameters can be pushed onto the stack by the calling procedure prior to executing the CALL instruction. Figure 3-10 shows the stack pointer before (old rSP value) and after (new rSP value) the CALL. The stack segment (SS) is not changed. Procedure Stack Parameters ... Return rIP Old rSP New rSP 513-175.eps Figure 3-10. Procedure Stack, Near Call Far Call, Same Privilege.
PAGE 113
24592—Rev. 3.15—November 2009 AMD64 Technology supported. Absolute far calls (those that reference the base of the code segment) are not supported in 64-bit mode. When a call to a more-privileged procedure occurs, the processor locates the new procedure’s stack pointer from its task-state segment (TSS).
PAGE 114
AMD64 Technology 24592—Rev. 3.15—November 2009 The three types of RET are: • • • Near Return—Transfers control back to the calling procedure within the current code segment. Far Return—Transfers control back to the calling procedure outside the current code segment. Interprivilege-Level Far Return—A far return that changes privilege levels. All of the RET instruction types can be used with an immediate operand indicating the number of parameter bytes present on the stack.
PAGE 115
24592—Rev. 3.15—November 2009 AMD64 Technology Procedure Stack New rSP Parameters ... Return CS Return rIP Old rSP 513-179.eps Figure 3-14. Procedure Stack, Far Return from Same Privilege Far Return, Less Privilege. Privilege-changing far RETs can only return to less-privileged code segments, otherwise a general-protection exception occurs. The full return pointer is popped off the stack and into the CS and rIP registers, and execution begins from the newly-loaded segment and offset.
PAGE 116
AMD64 Technology 24592—Rev. 3.15—November 2009 3.7.7 System Calls A disadvantage of far CALLs and far RETs is that they use segment-based protection and privilegechecking. This involves significant overhead associated with loading new segment selectors and their corresponding descriptors into the segment registers. The overhead includes not only the time required to load the descriptors from memory but also the time required to perform the privilege, type, and limit checks.
PAGE 117
24592—Rev. 3.15—November 2009 AMD64 Technology 3.7.9 Branching in 64-Bit Mode Near Branches in 64-Bit Mode. The long-mode architecture expands the near-branch mechanisms to accommodate branches in the full 64-bit virtual-address space. In 64-bit mode, the operand size for all near branches defaults to 64 bits, so these instructions update the full 64-bit RIP. Table 3-9 lists the near-branch instructions. Table 3-9.
PAGE 118
AMD64 Technology 24592—Rev. 3.15—November 2009 Branches to 64-Bit Offsets. Because immediates are generally limited to 32 bits, the only way a full 64-bit absolute RIP can be specified in 64-bit mode is with an indirect branch. For this reason, direct forms of far branches are invalid in 64-bit mode. 3.7.10 Interrupts and Exceptions Interrupts and exceptions are a form of control transfer operation.
PAGE 119
24592—Rev. 3.15—November 2009 • • • AMD64 Technology Faults—A fault is a precise exception that is reported on the boundary before the interrupted instruction. Generally, faults are caused by an undesirable error condition involving the interrupted instruction, although some faults (such as page faults) are common and normal occurrences. After the service routine completes, the machine state prior to the faulting instruction is restored, and the instruction is retried.
PAGE 120
AMD64 Technology Table 3-10. Vector 24592—Rev. 3.
PAGE 121
24592—Rev. 3.15—November 2009 AMD64 Technology Interrupt Handler Stack Old rSP rFLAGS Return CS Return rIP Error Code New rSP 513-182.eps Figure 3-16. Procedure Stack, Interrupt to Same Privilege Interrupt to More Privilege or in Long Mode. When an interrupt to a more-privileged handler occurs or the processor is operating in long mode the processor locates the handler’s stack pointer from the TSS. The old stack pointer (SS:rSP) is pushed onto the new stack, along with a copy of the rFLAGS register.
PAGE 122
AMD64 Technology 3.8 24592—Rev. 3.15—November 2009 Input/Output I/O devices allow the processor to communicate with the outside world, usually to a human or to another system. In fact, a system without I/O has little utility. Typical I/O devices include a keyboard, mouse, LAN connection, printer, storage devices, and monitor. The speeds these devices must operate at vary greatly, and usually depend on whether the communication is to a human (slow) or to another machine (fast). There are exceptions.
PAGE 123
24592—Rev. 3.15—November 2009 AMD64 Technology FFFF 216 - 1 0000 0 513-187.eps Figure 3-18. I/O Address Space Memory-Mapped I/O. Memory-mapped I/O devices are attached to the system memory bus and respond to memory transactions as if they were memory devices, such as DRAM. Access to memorymapped I/O devices can be performed using any instruction that accesses memory, but typically MOV instructions are used to transfer data between the processor and the device.
PAGE 124
AMD64 Technology 24592—Rev. 3.15—November 2009 result (speculation), and it can reorder reads ahead of writes. In the case of writes, multiple writes to memory locations in close proximity to each other can be combined into a single write or a burst of multiple writes. Writes can also be delayed, or buffered, by the processor. Application software that needs to force memory ordering to memory-mapped I/O devices can do so using the read/write barrier instructions: LFENCE, SFENCE, and MFENCE.
PAGE 125
24592—Rev. 3.15—November 2009 AMD64 Technology techniques that can be implemented within a system design, and how applications can optimize their use. 3.9.1 Accessing Memory Implementations of the AMD64 architecture commit the results of each instruction—i.e.
PAGE 126
AMD64 Technology 24592—Rev. 3.15—November 2009 Some system devices might be sensitive to reads. Normally, applications do not have direct access to system devices, but instead call an operating-system service routine to perform the access on the application’s behalf. In this case, it is system software’s responsibility to enforce strong read-ordering. Write Ordering. Writes affect program order because they affect the state of software-visible resources.
PAGE 127
24592—Rev. 3.15—November 2009 AMD64 Technology around an MFENCE instruction, but other non-serializing instructions that do not access memory can be reordered around the MFENCE. Although they serve different purposes, other instructions can be used as read/write barriers when the order of memory accesses must be strictly enforced. These read/write barrier instructions force all prior reads and writes to complete before subsequent reads or writes are executed.
PAGE 128
AMD64 Technology 24592—Rev. 3.15—November 2009 3.9.3 Caches Depending on the instruction, operands can be encoded in the instruction opcode or located in registers, I/O ports, or memory locations. An operand that is located in memory can actually be physically present in one or more locations within a system’s memory hierarchy. Memory Hierarchy.
PAGE 129
24592—Rev. 3.15—November 2009 AMD64 Technology Larger Size Main Memory L3 Cache System L2 Cache Faster Access L1 Instruction Cache L1 Data Cache Processor 513-137.eps Figure 3-19. Memory Hierarchy Example Write Buffering. Processor implementations can contain write-buffers attached to the internal caches. Write buffers can also be present on the interface used to communicate with the external portions of the memory hierarchy.
PAGE 130
AMD64 Technology 24592—Rev. 3.15—November 2009 access a memory address that is cached, the processor maintains coherency by providing the correct data back to the device and main memory. When a memory-read occurs as a result of an instruction fetch or operand access, the processor first checks the cache to see if the requested information is available. A read hit occurs if the information is available in the cache, and a read miss occurs if the information is not available.
PAGE 131
24592—Rev. 3.15—November 2009 • AMD64 Technology Cache-control instructions (“Cache-Control Instructions” on page 99) are available to applications to minimize cache pollution caused by non-temporal data. Spatial locality refers to data that resides at addresses adjacent to or very close to the data being referenced. Typically, when data is accessed, it is likely the data at nearby addresses will be accessed in a short period of time.
PAGE 132
AMD64 Technology • • 24592—Rev. 3.15—November 2009 to a PREFETCH. Refer to the Optimization Guide for AMD Athlon™ 64 and AMD Opteron™ Processors, order# 25112, for details relating to a particular processor family, brand or model. - PREFETCHT0—Prefetches temporal data into the entire cache hierarchy. - PREFETCHT1—Prefetches temporal data into the second-level (L2) and higher-level caches, but not into the L1 cache.
PAGE 133
24592—Rev. 3.15—November 2009 AMD64 Technology then invalidates the line in the cache and in all other caches in the cache hierarchy that contain the line. Once invalidated, the line is available for use by the processor and can be filled with other data. 3.
PAGE 134
AMD64 Technology 24592—Rev. 3.15—November 2009 For data that will be used only once in a procedure, consider using non-temporal accesses. Such accesses are not burdened by the overhead of cache protocols. 3.10.6 Keep Common Operands in Registers Keep frequently used values in registers rather than in memory. This avoids the comparatively long latencies for accessing memory. 3.10.
PAGE 135
24592—Rev. 3.15—November 2009 AMD64 Technology 3.10.12 Organize Data in Memory Blocks Organize frequently accessed constants and coefficients into cache-line-size blocks and prefetch them. Procedures that access data arranged in memory-bus-sized blocks, or memory-burst-sized blocks, can make optimum use of the available memory bandwidth. 3.
PAGE 136
AMD64 Technology 104 24592—Rev. 3.
PAGE 137
24592—Rev. 3.15—November 2009 4 AMD64 Technology 128-Bit Media and Scientific Programming This chapter describes the 128-bit media and scientific programming model. This model includes all instructions that access the 128-bit XMM registers—called the 128-bit media instructions. These instructions perform integer and floating-point operations primarily on vector operands (a few of the instructions take scalar operands).
PAGE 138
AMD64 Technology 4.2 24592—Rev. 3.15—November 2009 Capabilities The 128-bit media instructions are designed to support media and scientific applications. The vector operands used by these instructions allow applications to operate in parallel on multiple elements of vectors. The elements can be integers (from bytes to quadwords) or floating-point (either singleprecision or double-precision). Arithmetic operations produce signed, unsigned, and/or saturating results.
PAGE 139
24592—Rev. 3.15—November 2009 AMD64 Technology the source operands. The result of the operation replaces the first source operand. There are also instructions that operate on vectors of words, doublewords, or quadwords. operand 1 operand 2 127 0 127 . . . . . . . . . . . . . . 0 . . . . . . . . . . . . . . operation operation . . . . . . . . . . . . . . 127 0 result 513-163.eps Figure 4-1. Parallel Operations on Vectors of Integer Elements 4.2.
PAGE 140
AMD64 Technology 24592—Rev. 3.15—November 2009 instructions are often required to operate completely on the data. For example, software can change the viewing perspective of a 3D scene through transformation matrices by using floating-point instructions in the same procedure that contains integer operations on other aspects of the graphics data. It is typically much easier to write 128-bit media programs using floating-point instructions.
PAGE 141
24592—Rev. 3.15—November 2009 AMD64 Technology operand 1 127 operand 2 0 127 127 0 0 result 513-150.eps Figure 4-4. Pack Operation Figure 4-5 shows one of many types of shuffle operation (PSHUFD). Here, the second operand is a vector containing doubleword elements, and an immediate byte provides shuffle control for up to 256 permutations of the elements. Shuffles are useful, for example, in color imaging when computing alpha saturation of RGB values.
PAGE 142
AMD64 Technology 24592—Rev. 3.15—November 2009 4.2.5 Block Operations Move instructions—along with unpack instructions—are among the most frequently used instructions in 128-bit media procedures. Figure 4-6 on page 111 shows the combined set of move operations supported by the integer and floating-point move instructions. These instructions provide a fast way to copy large amounts of data between registers or between registers and memory.
PAGE 143
24592—Rev. 3.15—November 2009 AMD64 Technology XMM 0 127 XMM or Memory 0 127 XMM or Memory 0 127 XMM 0 127 XMM 0 127 XMM 0 0 127 memory memory 127 GPR or Memory XMM 0 memory 63 XMM 0 63 GPR or Memory 0 memory 127 63MMX 127 XMM TM Register 0 127 0 XMM 63 0 MMX Register 0 513-171.eps Figure 4-6.
PAGE 144
AMD64 Technology 24592—Rev. 3.15—November 2009 operand 1 operand 2 127 0 127 0 . . . . . . . . . . . . . . select . . . . . . . . . . . . . . select store address memory rDI 513-148.eps Figure 4-7. Move Mask Operation 4.2.6 Matrix and Special Arithmetic Operations The instruction set provides a broad assortment of vector add, subtract, multiply, divide, and squareroot operations for use on matrices and other data structures common to media and scientific applications.
PAGE 145
24592—Rev. 3.15—November 2009 AMD64 Technology in many media algorithms such as those required for finite impulse response (FIR) filters, one of the commonly used DSP algorithms. operand 1 operand 2 127 0 * 127 0 * * . 255 intermediate result . . + + 127 . 0 + result * + 0 513-154.eps Figure 4-8. Multiply-Add Operation There is also a sum-of-absolute-differences instruction (PSADBW), shown in Figure 4-9 on page 114.
PAGE 146
AMD64 Technology 24592—Rev. 3.15—November 2009 operand 1 operand 2 127 0 . . . . . . ABS Δ 127 . . . . . . high-order intermediate result . . . . . . 0 . . . . . . ABS Δ ABS Δ Σ 0 127 . . . . . . low-order intermediate result . . . . . . ABS Δ Σ 0 result 0 513-155.eps Figure 4-9. Sum-of-Absolute-Differences Operation There is an instruction for computing the average of unsigned bytes or words.
PAGE 147
24592—Rev. 3.15—November 2009 AMD64 Technology The sequence in Figure 4-10 begins with a vector compare instruction that compares the elements of two source operands in parallel and produces a mask vector containing elements of all 1s or 0s. This mask vector is ANDed with one source operand and ANDed-Not with the other source operand to isolate the desired elements of both operands. These results are then ORed to select the relevant elements from each operand.
PAGE 148
AMD64 Technology 24592—Rev. 3.15—November 2009 GPR 127 XMM 0 0 concatenate 16 most-significant bits 513-157..eps Figure 4-11. 4.3 Move Mask Operation Registers Operands for most 128-bit media instructions are located in XMM registers or memory. Operation of the 128-bit media instructions is supported by the MXCSR control and status register.
PAGE 149
24592—Rev. 3.15—November 2009 AMD64 Technology XMM Data Registers 127 0 xmm0 xmm1 xmm2 xmm3 xmm4 xmm5 xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15 Available in all modes Available only in 64-bit mode 128-Bit Media Control and Status Register MXCSR 31 0 513-314.eps Figure 4-12. 128-Bit Media Registers Upon power-on reset, all 16 XMM registers are cleared to +0.0. However, initialization by means of the #INIT external input signal does not change the state of the XMM registers. 4.3.
PAGE 150
AMD64 Technology 24592—Rev. 3.15—November 2009 using the FXRSTOR or LDMXCSR instructions, and it can store the register to memory using the FXSAVE or STMXCSR instructions.
PAGE 151
24592—Rev. 3.15—November 2009 AMD64 Technology denormals are zeros (DAZ) bit, the processor does not set the DE bit. (See “Denormalized (Tiny) Numbers” on page 128.) Zero-Divide Exception (ZE). Bit 2. The processor sets this bit to 1 when a non-zero number is divided by zero. Overflow Exception (OE). Bit 3. The processor sets this bit to 1 when the absolute value of a rounded result is larger than the largest representable normalized floating-point number for the destination format.
PAGE 152
AMD64 Technology • • 24592—Rev. 3.15—November 2009 10 = round up 11 = round toward zero For details, see “Floating-Point Rounding” on page 132. Flush-to-Zero (FZ). Bit 15. If the rounded result is tiny and the underflow mask is set, the FTZ bit causes the result to be flushed to zero. This naturally causes the result to be inexact, which causes both PE and UE to be set. The sign returned with the zero is the sign of the true result. The FTZ bit does not have any effect if the underflow mask is 0.
PAGE 153
24592—Rev. 3.15—November 2009 4.4 AMD64 Technology Operands Operands for a 128-bit media instruction are either referenced by the instruction's opcode or included as an immediate value in the instruction encoding. Depending on the instruction, referenced operands can be located in registers or memory. The data types of these operands include vector and scalar floating-point, and vector and scalar integer. 4.4.1 Data Types Figure 4-14 on page 122 shows the register images of the 128-bit media data types.
PAGE 154
AMD64 Technology 24592—Rev. 3.
PAGE 155
24592—Rev. 3.15—November 2009 AMD64 Technology Software can interpret the data types in ways other than those shown in Figure 4-14 on page 122— such as bit fields or fractional numbers—but the 128-bit media instructions do not directly support such interpretations and software must handle them entirely on its own. 4.4.2 Operand Sizes and Overrides Operand sizes for 128-bit media instructions are determined by instruction opcodes.
PAGE 156
AMD64 Technology • • 24592—Rev. 3.15—November 2009 MOVUPS—Move Unaligned Packed Single-Precision Floating-Point. LDDQU—Load Unaligned Double Quadword When alignment checking is enabled (CR0.AM = 1 and rFLAGS.AC = 1) and the MXCSR misaligned exception mask (MM) bit is set to 1, a 16-byte misaligned memory access on most packed SSE instructions will not cause a #GP exception, but a #AC exception is generated instead.
PAGE 157
24592—Rev. 3.15—November 2009 Table 4-1. Range of Values in 128-Bit Media Integer Data Types Data-Type Interpretation Base-2 Unsigned (exact) integers Base-10 (approx.) Signed integers1 AMD64 Technology Base-2 (exact) Base-10 (approx.) Byte Word Doubleword Quadword Double Quadword 0 to +28–1 0 to +216–1 0 to +232–1 0 to +264–1 0 to +2128–1 0 to 255 0 to 65,535 0 to 4.29 * 109 0 to 1.84 * 1019 0 to 3.
PAGE 158
AMD64 Technology 24592—Rev. 3.15—November 2009 software may use fixed-point operands in which the implied binary point is located in any position. In such cases, software is responsible for managing the interpretation of such implied binary points, as well as any redundant sign bits that may occur during multiplication. 4.4.6 Floating-Point Data Types The 128-bit media floating-point instructions take vector or scalar operands, depending on the instruction.
PAGE 159
24592—Rev. 3.15—November 2009 • • AMD64 Technology Single-Precision Format—This format includes a 1-bit sign, an 8-bit biased exponent whose value is 127, and a 23-bit significand. The integer bit is implied, making a total of 24 bits in the significand. Double-Precision Format—This format includes a 1-bit sign, an 11-bit biased exponent whose value is 1023, and a 52-bit significand. The integer bit is implied, making a total of 53 bits in the significand.
PAGE 160
AMD64 Technology • • 24592—Rev. 3.15—November 2009 Infinity Not a Number (NaN) In common engineering and scientific usage, floating-point numbers—also called real numbers—are represented in base (radix) 10. A non-zero number consists of a sign, a normalized significand, and a signed exponent, as in: +2.71828 e0 Both large and small numbers are representable in this notation, subject to the limits of data-type precision. For example, a million in base-10 notation appears as +1.00000 e6 and -0.
PAGE 161
24592—Rev. 3.15—November 2009 AMD64 Technology Denormalization may correct the exponent by placing leading zeros in the significand. This may cause a loss of precision, because the number of significant bits in the fraction is reduced by the leading zeros. In the single-precision floating-point format, for example, normalized numbers have biased exponents ranging from 1 to 254 (the unbiased exponent range is from –126 to +127).
PAGE 162
AMD64 Technology 24592—Rev. 3.15—November 2009 bit determines the processor’s response, as described in “SIMD Floating-Point Exception Masking” on page 184. When a floating-point operation or exception produces a QNaN result, its value is determined by the rules in Table 4-5. Table 4-5.
PAGE 163
24592—Rev. 3.15—November 2009 Table 4-6. AMD64 Technology Supported Floating-Point Encodings Classification Sign Negative Floating-Point Numbers Significand2 SNaN 0 1.011 ... 111 111 ... 111 to 1.000 ... 001 QNaN 0 1.111 ... 111 111 ... 111 to 1.100 ... 000 Positive Infinity (+∞) 0 111 ... 111 1.000 ... 000 Positive Normal 0 111 ... 110 1.111 ... 111 to to 000 ... 001 1.000 ... 000 Positive Denormal 0 0.111 ... 111 000 ... 000 to 0.000 ... 001 Positive Zero 0 000 ... 000 0.000 ...
PAGE 164
AMD64 Technology 24592—Rev. 3.15—November 2009 floating-point-to-integer data conversion overflows its destination integer data type, and IE exceptions are masked, the integer indefinite value is returned as the result. Table 4-7 shows the encodings of the indefinite values for each data type. For floating-point numbers, the indefinite value is a special form of QNaN. For integers, the indefinite value is the largest representable negative twos-complement number, 80...00h.
PAGE 165
24592—Rev. 3.15—November 2009 AMD64 Technology in interval arithmetic, in which upper and lower bounds bracket the true result of a computation. Round toward zero takes the smaller in magnitude, that is, always truncates. The processor produces a floating-point result defined by the IEEE standard to be infinitely precise.
PAGE 166
AMD64 Technology 24592—Rev. 3.15—November 2009 PADDB xmm1, xmm2/mem128 Mnemonic First Source Operand and Destination Operand Second Source Operand Figure 4-16. 513-147.eps Mnemonic Syntax for Typical Instruction This example shows the PADDB mnemonic followed by two operands, a 128-bit XMM register operand and another 128-bit XMM register or 128-bit memory operand. In most instructions that take two operands, the first (left-most) operand is both a source operand and the destination operand.
PAGE 167
24592—Rev. 3.15—November 2009 • • • • • • • • AMD64 Technology S—Signed, or Saturation, or Shift SD—Scalar double-precision floating-point SI—Signed integer SS—Scalar single-precision floating-point, or Signed saturation U—Unsigned, or Unordered, or Unaligned US—Unsigned saturation W—Word x—One or more variable characters in the mnemonic For example, the mnemonic for the instruction that packs four words into eight unsigned bytes is PACKUSWB.
PAGE 168
AMD64 Technology 24592—Rev. 3.15—November 2009 location, the memory address must be aligned. The MOVDQU instruction does the same, except for unaligned operands. The LDDQU instruction is virtually identical in operation to the MOVDQU instruction. The LDDQU instruction moves a double quadword of data from a 128-bit memory operand into a destination XMM register. The MOVDQ2Q instruction copies the low-order 64-bit value in an XMM register to an MMX register.
PAGE 169
24592—Rev. 3.
PAGE 170
AMD64 Technology 24592—Rev. 3.15—November 2009 be unaligned. Figure 4-18 shows the MASKMOVDQU operation. It is useful for the handling of end cases in block copies and block fills based on streaming stores. operand 1 operand 2 127 0 127 0 . . . . . . . . . . . . . . select . . . . . . . . . . . . . . select store address memory rDI 513-148.eps Figure 4-18. MASKMOVDQU Move Mask Operation Move Mask.
PAGE 171
24592—Rev. 3.15—November 2009 AMD64 Technology GPR 127 XMM 0 0 concatenate 16 most-significant bits 513-157..eps Figure 4-19. PMOVMSKB Move Mask Operation 4.5.3 Data Conversion The integer data-conversion instructions convert integer operands to floating-point operands. These instructions take 128-bit integer source operands. For data-conversion instructions that take 128-bit floating-point source operands, see “Data Conversion” on page 162.
PAGE 172
AMD64 Technology 24592—Rev. 3.15—November 2009 Before executing a CVTPI2x instruction, software should ensure that the MMX registers are properly initialized so as to prevent conflict with their aliased use by x87 floating-point instructions. This may require clearing the MMX state, as described in “Accessing Operands in MMX™ Registers” on page 188.
PAGE 173
24592—Rev. 3.15—November 2009 AMD64 Technology Figure 4-20 shows an example of a PACKSSDW instruction. The operation merges vector elements of 2x size into vector elements of 1x size, thus reducing the precision of the vector-element data types. Any results that would otherwise overflow or underflow are saturated (clamped) at the maximum or minimum representable value, respectively, as described in “Saturation” on page 125. operand 1 operand 2 127 0 127 127 result 0 0 513-150.eps Figure 4-20.
PAGE 174
AMD64 Technology 24592—Rev. 3.15—November 2009 into interleaved words, interleaved doublewords, and interleaved quadwords in the destination operand. The PUNPCKLBW, PUNPCKLWD, PUNPCKLDQ, and PUNPCKLQDQ instructions are analogous to their high-element counterparts except that they take elements from the low quadword of each source vector and ignore elements in the high quadword.
PAGE 175
24592—Rev. 3.15—November 2009 AMD64 Technology PUNPCKLQDQ may be 64 bits, but the width of the memory access of the memory-operand forms of PUNPCKHBW, PUNPCKHWD, PUNPCKHDQ, and PUNPCKHQDQ may be 128 bits. Thus, the alignment constraints for PUNPCKLx instructions may be less restrictive than the alignment constraints for PUNPCKHx instructions. For details, see the documentation for particular hardware implementations of the architecture.
PAGE 176
AMD64 Technology 24592—Rev. 3.15—November 2009 xmm reg32/64/mem16 127 0 15 0 imm8 select word position for insert 127 0 result 513-166.eps Figure 4-22. PINSRW Operation Shuffle. These instructions reorder the elements of a vector.
PAGE 177
24592—Rev. 3.15—November 2009 AMD64 Technology operand 1 operand 2 127 0 127 127 0 0 result 513-151.eps Figure 4-23. PSHUFD Shuffle Operation The PSHUFHW and PSHUFLW instructions are analogous to PSHUFD, except that they fill each word of the high or low quadword, respectively, of the first operand by copying any one of the four words in the high or low quadword of the second operand. Figure 4-24 shows the PSHUFHW operation.
PAGE 178
AMD64 Technology 24592—Rev. 3.15—November 2009 operand 1 operand 2 127 0 127 . . . . . . . . . . . . . . 0 . . . . . . . . . . . . . . operation operation . . . . . . . . . . . . . . 127 Figure 4-25. 0 result 513-163.eps Arithmetic Operation on Vectors of Bytes Addition.
PAGE 179
24592—Rev. 3.15—November 2009 AMD64 Technology The PADDUSB and PADDUSW instructions perform saturating-add operations analogous to the PADDSB and PADDSW instructions, except on unsigned integer elements. Subtraction.
PAGE 180
AMD64 Technology 24592—Rev. 3.15—November 2009 The PMULHW instruction multiplies each 16-bit signed integer value in the first operand by the corresponding 16-bit integer in the second operand, producing a 32-bit intermediate result. The instruction then writes the high-order 16 bits of the 32-bit intermediate result of each multiplication to the corresponding word of the destination.
PAGE 181
24592—Rev. 3.15—November 2009 AMD64 Technology operand 1 operand 2 127 127 0 * 127 0 * result 0 513-153.eps Figure 4-27. PMULUDQ Multiply Operation See “Shift” on page 152 for shift instructions that can be used to perform multiplication and division by powers of 2. Multiply-Add. This instruction multiplies the elements of two source vectors and add their intermediate results in a single operation.
PAGE 182
AMD64 Technology 24592—Rev. 3.15—November 2009 operand 1 operand 2 127 0 * 127 0 * * . 255 + intermediate result . . + 127 . 0 + result * + 0 513-154.eps Figure 4-28. PMADDWD Multiply-Add Operation PMADDWD can be used with one source operand (for example, a coefficient) taken from memory and the other source operand (for example, the data to be multiplied by that coefficient) taken from an XMM register.
PAGE 183
24592—Rev. 3.15—November 2009 AMD64 Technology operand1[i] = ((operand1[i] + operand2[i]) + 1) ÷ 2 where: i = 0 to n – 1 The PAVGB instruction is useful for MPEG decoding, in which motion compensation performs many byte-averaging operations between and within macroblocks. In addition to speeding up these operations, PAVGB can free up registers and make it possible to unroll the averaging loops. Sum of Absolute Differences.
PAGE 184
AMD64 Technology 24592—Rev. 3.15—November 2009 4.5.6 Shift The vector-shift instructions are useful for scaling vector elements to higher or lower precision, packing and unpacking vector elements, and multiplying and dividing vector elements by powers of 2. Left Logical Shift.
PAGE 185
24592—Rev. 3.15—November 2009 operand1[i] = operand1[i] AMD64 Technology ÷ 2operand2 where: i = 0 to n – 1 The PSRLDQ instruction differs from the other three right-shift instructions because it operates on bytes rather than bits. It right-shifts the 128-bit (double quadword) value in an XMM register by the number of bytes specified in an immediate byte value. PSRLDQ can be used, for example, to move the high 8 bytes of an XMM register to the low 8 bytes of the register.
PAGE 186
AMD64 Technology 24592—Rev. 3.15—November 2009 operand 1 operand 2 127 0 . . . . . . . . . . . . . . 127 imm8 0 . . . . . . . . . . . . . . compare compare all 1s or 0s all 1s or 0s . . . . . . . . . . . . . . 127 Figure 4-30. result 0 513-168.eps PCMPEQB Compare Operation For the PCMPEQx instructions, if the compared values are equal, the result mask is all 1s. If the values are not equal, the result mask is all 0s.
PAGE 187
24592—Rev. 3.15—November 2009 AMD64 Technology In the above sequence, PCMPGTW, PAND, PANDN, and POR operate, in parallel, on all four elements of the vectors. Compare and Write Minimum or Maximum.
PAGE 188
AMD64 Technology 24592—Rev. 3.15—November 2009 Or • POR—Packed Logical Bitwise OR The POR instruction performs a logical bitwise OR of the values in the first and second operands and writes the result to the destination. Exclusive Or • PXOR—Packed Logical Bitwise Exclusive OR The PXOR instruction performs a logical bitwise exclusive OR of the values in the first and second operands and writes the result to the destination.
PAGE 189
24592—Rev. 3.15—November 2009 AMD64 Technology For a summary of the 64-bit media floating-point instructions, see “Instruction Summary—FloatingPoint Instructions” on page 223. For a summary of the x87 floating-point instructions, see “Instruction Summary” on page 261. The instructions are organized here by functional group—such as data-transfer, vector arithmetic, and so on.
PAGE 190
AMD64 Technology 24592—Rev. 3.15—November 2009 The MOVAPx instructions copy a vector of four single-precision floating-point values (MOVAPS) or a vector of two double-precision floating-point values (MOVAPD) from the second operand to the first operand—i.e., from an XMM register or 128-bit memory location or to another XMM register, or vice versa. A general-protection exception occurs if a memory operand is not aligned on a 16-byte boundary.
PAGE 191
24592—Rev. 3.
PAGE 192
AMD64 Technology 24592—Rev. 3.15—November 2009 64-bit memory location. In the memory-to-register case, the low-order 64 bits of the destination XMM register are not modified. The MOVLPS and MOVLPD instructions copy a vector of two single-precision floating-point values (MOVLPS) or one double-precision floating-point value (MOVLPD) from a 64-bit memory location to the low-order 64 bits of an XMM register, or from the low-order 64 bits of an XMM register to a 64bit memory location.
PAGE 193
24592—Rev. 3.15—November 2009 AMD64 Technology Move Non-Temporal. The move non-temporal instructions are streaming-store instructions. They minimize pollution of the cache.
PAGE 194
AMD64 Technology 24592—Rev. 3.15—November 2009 GPR 127 XMM 0 0 concatenate 4 sign bits 513-158.eps Figure 4-32. MOVMSKPS Move Mask Operation 4.6.3 Data Conversion The floating-point data-conversion instructions convert floating-point operands to integer operands. These data-conversion instructions take 128-bit floating-point source operands. For data-conversion instructions that take 128-bit integer source operands, see “Data Conversion” on page 139.
PAGE 195
24592—Rev. 3.15—November 2009 AMD64 Technology The CVTSS2SD instruction converts a single-precision floating-point value in the low-order 32 bits of the second operand to a double-precision floating-point value in the low-order 64 bits of the destination. The high-order 64 bits in the destination XMM register are not modified.
PAGE 196
AMD64 Technology 24592—Rev. 3.15—November 2009 the value is rounded, but for the CVTTPS2PI instruction such a result is truncated (rounded toward zero). The CVTPD2PI and CVTTPD2PI instructions convert two double-precision floating-point values in an XMM register or a 128-bit memory location to two 32-bit signed integer values in an MMX register.
PAGE 197
24592—Rev. 3.15—November 2009 AMD64 Technology 4.6.4 Data Reordering The floating-point data-reordering instructions unpack and interleave, or shuffle the elements of vector operands. Unpack and Interleave. These instructions interleave vector elements from the high or low halves of two floating-point source operands.
PAGE 198
AMD64 Technology • 24592—Rev. 3.15—November 2009 SHUFPD—Shuffle Packed Double-Precision Floating-Point The SHUFPS instruction moves any two of the four single-precision floating-point values in the first operand to the low-order quadword of the destination and moves any two of the four single-precision floating-point values in the second operand to the high-order quadword of the destination. In each case, the value of the destination is determined by a field in the immediate-byte operand.
PAGE 199
24592—Rev. 3.15—November 2009 AMD64 Technology The ADDPS instruction adds each of four single-precision floating-point values in the first operand to the corresponding single-precision floating-point values in the second operand and writes the result in the corresponding quadword of the destination. The ADDPD instruction performs an analogous operation for two double-precision floating-point values.
PAGE 200
AMD64 Technology 24592—Rev. 3.15—November 2009 point values in the third and fourth doublewords of the source operand and stores the sum in the fourth doubleword of the destination operand. The HADDPD instruction adds the two double-precision floating point values in the quadword halves of the destination operand and stores the sum in the first quadword of the destination.
PAGE 201
24592—Rev. 3.15—November 2009 AMD64 Technology source operand from the third doubleword of the source operand and stores the result in the fourth doubleword of the destination. The HSUBPD instruction subtracts the second quadword of the destination register from the first quadword of the destination operand and stores the difference in the first quadword of the destination register.
PAGE 202
AMD64 Technology 24592—Rev. 3.
PAGE 203
24592—Rev. 3.15—November 2009 AMD64 Technology low-order doubleword of the destination. The three high-order doublewords of the destination XMM register are not modified. The SQRTSD instruction computes the square root of the low-order double-precision floating-point value in the second operand (an XMM register or 64-bit memory location) and writes the result in the low-order quadword of the destination. The high-order quadword of the destination XMM register is not modified.
PAGE 204
AMD64 Technology • • • 24592—Rev. 3.15—November 2009 CMPPD—Compare Packed Double-Precision Floating-Point CMPSS—Compare Scalar Single-Precision Floating-Point CMPSD—Compare Scalar Double-Precision Floating-Point The CMPPS instruction compares each of four single-precision floating-point values in the first operand with the corresponding single-precision floating-point value in the second operand and writes the result in the corresponding 32 bits of the destination.
PAGE 205
24592—Rev. 3.
PAGE 206
AMD64 Technology 24592—Rev. 3.15—November 2009 operand 1 operand 2 127 0 127 0 compare 0 63 Figure 4-37. rFLAGS 31 0 513-161.eps COMISD Compare Operation The difference between an ordered and unordered comparison has to do with the conditions under which a floating-point invalid-operation exception (IE) occurs. In an ordered comparison (COMISS or COMISD), an IE exception occurs if either of the source operands is either type of NaN (QNaN or SNaN).
PAGE 207
24592—Rev. 3.15—November 2009 • AMD64 Technology ORPD—Logical Bitwise OR Packed Double-Precision Floating-Point The ORPS instruction performs a logical bitwise OR of four single-precision floating-point values in the first operand and the corresponding four single-precision floating-point values in the second operand and writes the result in the destination. The ORPD instruction performs an analogous operation on pairs of two double-precision floating-point values.
PAGE 208
AMD64 Technology • • 24592—Rev. 3.15—November 2009 REP—The F2 and F3h prefixes do not function as repeat prefixes for 128-bit media instructions. Instead, they are used to form the opcodes of certain 128-bit media instructions. The prefixes are ignored by all other 128-bit media instructions. REX—The REX prefixes affect operands that reference a GPR or XMM register when running in 64-bit mode.
PAGE 209
24592—Rev. 3.15—November 2009 AMD64 Technology See “Processor Feature Identification” in Volume 2 for a full description of the CPUID instruction and its function codes. In addition, the operating system must support the FXSAVE and FXRSTOR instructions (by having set CR4.OSFXSR = 1), and it may wish to support SIMD floating-point exceptions (by having set CR4.OSXMMEXCPT = 1). For details, see “System-Control Registers” in Volume 2. 4.10 Exceptions Types of Exceptions.
PAGE 210
AMD64 Technology 24592—Rev. 3.15—November 2009 A device-not-available exception (#NM) can occur if an attempt is made to execute a 128-bit media instruction when the task switch bit (TS) of the control register (CR0) is set to 1 (CR0.TS = 1).
PAGE 211
24592—Rev. 3.15—November 2009 AMD64 Technology Exception Vectors. The SIMD floating-point exception is listed above as #XF (Vector 19) but it actually causes either an #XF exception or a #UD (Vector 6) exception, if an unmasked IE, DE, ZE, OE, UE, or PE exception is reported. The choice of exception vector is determined by the operatingsystem XMM exception support bit (OSXMMEXCPT) in control register 4 (CR4): • • When CR4.OSXMMEXCPT = 1, a #XF exception occurs. When CR4.
PAGE 212
AMD64 Technology 24592—Rev. 3.15—November 2009 Resultinfinite A result of infinite precision, which is representable when the width of the exponent and the width of the significand are both infinite. Resultround A result, after rounding, whose unbiased exponent is infinitely wide and whose significand is the width specified for the destination format. (Rounding is described in “Floating-Point Rounding” on page 132.) Resultround, denormal A result, after rounding and denormalization.
PAGE 213
24592—Rev. 3.15—November 2009 AMD64 Technology Denormalized-Operand Exception (DE). The DE exception occurs when one of the source operands of an instruction is in denormalized form, as described in “Denormalized (Tiny) Numbers” on page 128. Zero-Divide Exception (ZE). The ZE exception occurs when and instruction attempts to divide zero into a non-zero finite dividend. Overflow Exception (OE).
PAGE 214
AMD64 Technology 24592—Rev. 3.15—November 2009 4.10.3 SIMD Floating-Point Exception Priority Figure 4-12 on page 182 shows the priority with which the processor recognizes multiple, simultaneous SIMD floating-point exceptions and operations involving QNaN operands. Each exception type is characterized by its timing, as follows: • • Pre-Computation—an exception that is recognized before an instruction begins its operation.
PAGE 215
24592—Rev. 3.15—November 2009 For Each Exception Type For Each Vector Element AMD64 Technology Test For Pre-Computation Exceptions Set MXCSR Exception Flags Any Unmasked Exceptions ? Yes No For Each Exception Type For Each Vector Element Test For Pre-Computation Exceptions Set MXCSR Exception Flags Yes Any Unmasked Exceptions ? No Invoke Exception Service Routine Any Masked Exceptions ? Yes Default Response No Continue Execution Figure 4-38. 513-188.
PAGE 216
AMD64 Technology 24592—Rev. 3.15—November 2009 4.10.4 SIMD Floating-Point Exception Masking The six floating-point exception flags have corresponding exception-flag masks in the MXCSR register, as shown in Table 4-13. Table 4-13.
PAGE 217
24592—Rev. 3.15—November 2009 Table 4-14.
PAGE 218
AMD64 Technology Table 4-14. Exception Invalidoperation exception (IE) 24592—Rev. 3.15—November 2009 Masked Responses to SIMD Floating-Point Exceptions (continued) Operation1 Processor Response2 Sets the result in rFLAGS to Ordered or unordered scalar compare, in which one or both “unordered.” operands is a NaN (COMISS, COMISD, UCOMISS, Clear the overflow (OF), sign UCOMISD). (SF), and auxiliary carry (AF) flags in rFLAGS.
PAGE 219
24592—Rev. 3.15—November 2009 Table 4-14. Masked Responses to SIMD Floating-Point Exceptions (continued) Operation1 Exception Precision exception (PE) AMD64 Technology Inexact normalized or denormalized result Processor Response2 Without OE or UE exception Return rounded result. With masked OE or UE exception Respond as for OE or UE exception. With unmasked OE or UE exception Respond as for OE or UE exception, and invoke SIMD exception handler. Note: 1.
PAGE 220
AMD64 Technology 4.11 24592—Rev. 3.15—November 2009 Saving, Clearing, and Passing State 4.11.1 Saving and Restoring State In general, system software should save and restore 128-bit media state between task switches or other interventions in the execution of 128-bit media procedures. Virtually all modern operating systems running on x86 processors implement preemptive multitasking that handle saving and restoring of state across task switches, independent of hardware task-switch support.
PAGE 221
24592—Rev. 3.15—November 2009 AMD64 Technology 128-bit media procedure accesses an MMX register by means of a data-transfer or data-conversion instruction. In such cases, software should separate such procedures or dynamic link libraries (DLLs) from x87 floating-point procedures or DLLs by clearing the MMX state with the EMMS instruction, as described in “Exit Media State” on page 209. For further details, see “Mixing Media Code with x87 Code” on page 233. 4.
PAGE 222
AMD64 Technology 24592—Rev. 3.15—November 2009 128-bit media instructions that simulate predicated execution or conditional moves. Figure 4-10 on page 115 shows an example of a non-branching sequence that implements a two-way multiplexer. Where possible, break long dependency chains into several shorter dependency chains that can be executed in parallel. This is especially important for floating-point instructions because of their longer latencies. 4.12.
PAGE 223
24592—Rev. 3.15—November 2009 AMD64 Technology cache line must be used. For further details, see the Optimization Guide for AMD Athlon™ 64 and AMD Opteron™ Processors, order# 25112. 4.12.8 Use 128-Bit Media Code for Moving Data Movements of data between memory, GPR, XMM, and MMX registers can take advantage of the parallel vector operations supported by the 128-bit media MOVx instructions. Figure 4-6 on page 111 illustrates the range of move operations available. 4.12.
PAGE 224
AMD64 Technology 192 24592—Rev. 3.
PAGE 225
24592—Rev. 3.15—November 2009 5 AMD64 Technology 64-Bit Media Programming This chapter describes the 64-bit media programming model. This model includes all instructions that access the MMX™ registers, including the MMX and 3DNow!™ instructions, as well as some SSE and SSE2 instructions. The 64-bit media instructions perform integer and floating-point operations primarily on vector operands (a few of the instructions take scalar operands).
PAGE 226
AMD64 Technology 24592—Rev. 3.15—November 2009 The MMX and 3DNow! instructions introduce no additional registers, status bits, or other processor state to the legacy x86 architecture. Instead, they use the x87 floating-point registers that have long been a part of most x86 architectures. Because of this, 64-bit media procedures require no special operating-system support or exception handlers.
PAGE 227
24592—Rev. 3.15—November 2009 AMD64 Technology operand 1 operand 2 63 0 op 63 63 op op result 0 op 0 513-121.eps Figure 5-1. Parallel Integer Operations on Elements of Vectors 5.3.2 Data Conversion and Reordering The 64-bit media instructions support conversions of various integer data types to floating-point data types, and vice versa. There are also instructions that reorder vector-element ordering or the bit-width of vector elements.
PAGE 228
AMD64 Technology 24592—Rev. 3.15—November 2009 operand 1 63 operand 2 0 63 63 0 0 result 513-144.eps Figure 5-2. Unpack and Interleave Operation Figure 5-3 shows a shuffle operation (PSHUFW), in which one of the operands provides vector data, and an immediate byte provides shuffle control for up to 256 permutations of the data. 63 operand 1 63 0 63 result operand 2 0 0 513-126.eps Figure 5-3. Shuffle Operation (1 of 256) 5.3.
PAGE 229
24592—Rev. 3.15—November 2009 AMD64 Technology instructions can also perform multiply-accumulate operations. Efficient matrix multiplication is further supported with instructions that can first transpose the elements of matrix rows and columns. These transpositions can make subsequent accesses to memory or cache more efficient when performing arithmetic matrix operations.
PAGE 230
AMD64 Technology 24592—Rev. 3.15—November 2009 5.3.5 Branch Removal Branching is a time-consuming operation that, unlike most 64-bit media vector operations, does not exhibit parallel behavior (there is only one branch target, not multiple targets, per branch instruction). In many media applications, a branch involves selecting between only a few (often only two) cases.
PAGE 231
24592—Rev. 3.15—November 2009 AMD64 Technology 5.3.6 Floating-Point (3DNow!™) Vector Operations Floating-point vector instructions using the MMX registers were introduced by AMD with the 3DNow! technology. These instructions take 64-bit vector operands consisting of two 32-bit singleprecision floating-point numbers, shown as FP single in Figure 5-6. 63 0 32 31 63 FP single FP single 0 32 31 FP single FP single op op FP single FP single 63 32 31 0 513-124.eps Figure 5-6.
PAGE 232
AMD64 Technology 5.4 24592—Rev. 3.15—November 2009 Registers 5.4.1 MMX™ Registers Eight 64-bit MMX registers, mmx0–mmx7, support the 64-bit media instructions. Figure 5-7 shows these registers. They can hold operands for both vector and scalar operations on integer (MMX) and floating-point (3DNow!) data types. MMXTM Registers 63 0 mmx0 mmx1 mmx2 mmx3 mmx4 mmx5 mmx6 mmx7 513-145.eps Figure 5-7.
PAGE 233
24592—Rev. 3.15—November 2009 5.5 AMD64 Technology Operands Operands for a 64-bit media instruction are either referenced by the instruction's opcode or included as an immediate value in the instruction encoding. Depending on the instruction, referenced operands can be located in registers or memory. The data types of these operands include vector and scalar integer, and vector floating-point. 5.5.1 Data Types Figure 5-8 on page 202 shows the register images of the 64-bit media data types.
PAGE 234
AMD64 Technology 24592—Rev. 3.
PAGE 235
24592—Rev. 3.15—November 2009 AMD64 Technology 5.5.2 Operand Sizes and Overrides Operand sizes for 64-bit media instructions are determined by instruction opcodes. Some of these opcodes include an operand-size override prefix, but this prefix acts in a special way to modify the opcode and is considered an integral part of the opcode. The general use of the 66h operand-size override prefix described in “Instruction Prefixes” on page 71 does not apply to 64-bit media instructions.
PAGE 236
AMD64 Technology 24592—Rev. 3.15—November 2009 For other 64-bit media instructions, the architecture does not impose data-alignment requirements for accessing 64-bit media data in memory. Specifically, operands in physical memory do not need to be stored at addresses that are even multiples of the operand size in bytes. However, the consequence of storing operands at unaligned locations is that accesses to those operands may require more processor and bus cycles than for aligned accesses.
PAGE 237
24592—Rev. 3.15—November 2009 Table 5-2.
PAGE 238
AMD64 Technology 24592—Rev. 3.15—November 2009 Single-Precision Format. The single-precision floating-point format supported by 64-bit media instructions is the same format as the normalized IEEE 754 single-precision format. This format includes a sign bit, an 8-bit biased exponent, and a 23-bit significand with one hidden integer bit for a total of 24 bits in the significand. The hidden integer bit is assumed to have a value of 1, and the significand field is also the fraction.
PAGE 239
24592—Rev. 3.15—November 2009 AMD64 Technology generated. If all source operands are normalized numbers, these instructions never produce infinities, NaNs, or denormalized numbers as results. This aspect of 64-bit media floating-point operations does not comply with the IEEE 754 standard. Software must use only normalized operands and ensure that computations remain within valid normalized-number ranges. No Support for Floating-Point Exceptions.
PAGE 240
AMD64 Technology 24592—Rev. 3.15—November 2009 PADDB mmx1, mmx2/mem64 Mnemonic First Source Operand and Destination Operand Second Source Operand Figure 5-10. 513-142.eps Mnemonic Syntax for Typical Instruction This example shows the PADDB mnemonic followed by two operands, a 64-bit MMX register operand and another 64-bit MMX register or 64-bit memory operand. In most instructions that take two operands, the first (left-most) operand is both a source operand and the destination operand.
PAGE 241
24592—Rev. 3.15—November 2009 • • • • AMD64 Technology U—Unsigned US—Unsigned saturation W—Word x—One or more variable characters in the mnemonic For example, the mnemonic for the instruction that packs four words into eight unsigned bytes is PACKUSWB. In this mnemonic, the PACK designates 2x-to-1x conversion of vector elements, the US designates unsigned results with saturation, and the WB designates vector elements of the source as words and those of the result as bytes. 5.6.
PAGE 242
AMD64 Technology 24592—Rev. 3.15—November 2009 The MOVQ instruction copies a 64-bit value from an MMX register or 64-bit memory location to another MMX register, or from an MMX register to another MMX register or 64-bit memory location. The MOVDQ2Q instruction copies the low-order 64-bit value in an XMM register to an MMX register. The MOVQ2DQ instruction copies a 64-bit value from an MMX register to the low-order 64 bits of an XMM register, with zero-extension to 128 bits.
PAGE 243
24592—Rev. 3.15—November 2009 AMD64 Technology operand 1 operand 2 63 0 63 0 . . . . . . select . . . . . . select store address memory rDI 513-133.eps Figure 5-11. MASKMOVQ Move Mask Operation The MOVNTQ and MASKMOVQ instructions use weakly-ordered, write-combining buffering of write data and they minimize cache pollution. The exact method by which cache pollution is minimized depends on the hardware implementation of the instruction.
PAGE 244
AMD64 Technology 24592—Rev. 3.15—November 2009 instructions that take 128-bit source operands, see “Data Conversion” on page 139 and “Data Conversion” on page 162. Convert Integer to Floating-Point. These instructions convert integer data types into floating-point data types.
PAGE 245
92—Rev. 3.15—November 2009 AMD64 Technology operand 1 63 operand 2 0 63 63 result 0 0 513-143.eps Figure 5-12. PACKSSDW Pack Operation Conversion from higher-to-lower precision may be needed, for example, after an arithmetic operation which requires the higher-precision format to prevent possible overflow, but which requires the lowerprecision format for a subsequent operation. Unpack and Interleave.
PAGE 246
AMD64 Technology 24592—Rev. 3.15—November 2009 operand 1 63 operand 2 0 63 63 result 0 0 513-144.eps Figure 5-13. PUNPCKLWD Unpack and Interleave Operation If one of the two source operands is a vector consisting of all zero-valued elements, the unpack instructions perform the function of expanding vector elements of 1x size into vector elements of 2x size (for example, word-size to doubleword-size).
PAGE 247
24592—Rev. 3.15—November 2009 AMD64 Technology The PSHUFW instruction moves any one of the four words in its second operand (an MMX register or 64-bit memory location) to specified word locations in its first operand (another MMX register). The ordering of the shuffle can occur in any of 256 possible ways, as specified by the immediate-byte operand. Figure 5-14 shows one of the 256 possible shuffle operations. PSHUFW is useful, for example, in color imaging when computing alpha saturation of RGB values.
PAGE 248
AMD64 Technology 24592—Rev. 3.15—November 2009 5.6.6 Arithmetic The integer vector-arithmetic instructions perform an arithmetic operation on the elements of two source vectors. Arithmetic instructions that are not specifically named as unsigned perform signed two’s-complement arithmetic.
PAGE 249
24592—Rev. 3.15—November 2009 AMD64 Technology The subtraction instructions perform operations analogous to the addition instructions. The PSUBB, PSUBW, PSUBD, and PSUBQ instructions subtract each 8-bit (PSUBB), 16-bit (PSUBW), 32-bit (PSUBD), or 64-bit (PSUBQ) integer element in the second operand from the corresponding, same-sized integer element in the first operand. The instructions then write the integer result of each subtraction to the corresponding, same-sized element of the destination.
PAGE 250
AMD64 Technology 24592—Rev. 3.15—November 2009 Multiply-Add • PMADDWD—Packed Multiply Words and Add Doublewords The PMADDWD instruction multiplies each 16-bit signed value in the first operand by the corresponding 16-bit signed value in the second operand. The instruction then adds the adjacent 32-bit intermediate results of each multiplication, and writes the 32-bit result of each addition into the corresponding doubleword of the destination.
PAGE 251
24592—Rev. 3.15—November 2009 AMD64 Technology For floating-point multiplication operations, see the PFMUL instruction on page 225. For floatingpoint accumulation operations, see the PFACC, PFNACC, and PFPNACC instructions on page 226.
PAGE 252
AMD64 Technology 24592—Rev. 3.15—November 2009 second operands are either an MMX register and another MMX register or 64-bit memory location, or an MMX register and an immediate-byte value. The low-order bits that are emptied by the shift operation are cleared to 0. In integer arithmetic, left logical shifts effectively multiply unsigned operands by positive powers of 2.
PAGE 253
24592—Rev. 3.15—November 2009 AMD64 Technology The PCMPEQx and PCMPGTx instructions compare corresponding bytes, words, or doubleword in the first and second operands. The instructions then write a mask of all 1s or 0s for each compare into the corresponding, same-sized element of the destination. For the PCMPEQx instructions, if the compared values are equal, the result mask is all 1s. If the values are not equal, the result mask is all 0s.
PAGE 254
AMD64 Technology 24592—Rev. 3.15—November 2009 5.6.9 Logical The vector-logic instructions perform Boolean logic operations, including AND, OR, and exclusive OR. And • PAND—Packed Logical Bitwise AND • PANDN—Packed Logical Bitwise AND NOT The PAND instruction performs a bitwise logical AND of the values in the first and second operands and writes the result to the destination.
PAGE 255
24592—Rev. 3.15—November 2009 AMD64 Technology 5.6.10 Save and Restore State These instructions save and restore the processor state for 64-bit media instructions. Save and Restore 64-Bit Media and x87 State • FSAVE—Save x87 and MMX State • FNSAVE—Save No-Wait x87 and MMX State • FRSTOR—Restore x87 and MMX State These instructions save and restore the entire processor state for x87 floating-point instructions and 64-bit media instructions.
PAGE 256
AMD64 Technology 24592—Rev. 3.15—November 2009 For a summary of the 128-bit media floating-point instructions, see “Instruction Summary—FloatingPoint Instructions” on page 156. For a summary of the x87 floating-point instructions, see “Instruction Summary” on page 261. The instructions are organized here by functional group—such as data-transfer, vector arithmetic, and so on.
PAGE 257
24592—Rev. 3.15—November 2009 AMD64 Technology The 3DNow! PF2IW instruction converts two single-precision floating-point values in the second operand (an MMX register or a 64-bit memory location) to two 16-bit signed integer values, signextended to 32-bits, and writes the converted values into the first operand (an MMX register).
PAGE 258
AMD64 Technology 24592—Rev. 3.15—November 2009 Division For a description of floating-point division techniques, see “Reciprocal Estimation” on page 227. Division is equivalent to multiplication of the dividend by the reciprocal of the divisor.
PAGE 259
24592—Rev. 3.15—November 2009 AMD64 Technology vectors (one element is the real part, the other element is the imaginary part), there is a need to swap the elements of one source operand to perform the multiplication, and there is a need for mixed positive-negative accumulation to complete the parallel computation of real and imaginary results.
PAGE 260
AMD64 Technology 24592—Rev. 3.15—November 2009 The PFRSQRT instruction can be used together with the PFRSQIT1 instruction and the PFRCPIT2 instruction (described in “Reciprocal Estimation” on page 227) to increase the accuracy of a singleprecision significand. 5.7.4 Compare The floating-point vector-compare instructions compare two operands, and they either write a mask or they write the maximum or minimum value.
PAGE 261
24592—Rev. 3.15—November 2009 AMD64 Technology 5.9.1 Supported Prefixes The following prefixes can be used with 64-bit media instructions: • • • • • Address-Size Override—The 67h prefix affects only operands in memory. The prefix is ignored by all other 64-bit media instructions. Operand-Size Override—The 66h prefix is used to form the opcodes of certain 64-bit media instructions. The prefix is ignored by all other 64-bit media instructions.
PAGE 262
AMD64 Technology • • • • • • 24592—Rev. 3.15—November 2009 MMX extensions, indicated by bit 22 of CPUID function 8000_0001h. 3DNow! extensions, indicated by bit 30 of CPUID function 8000_0001h. SSE instructions, indicated by bit 25 of CPUID function 8000_0001h. SSE2 instruction extensions, indicated by bit 26 of CPUID function 8000_0001h. SSE3 instruction extensions, indicated by bit 0 of CPUID function 0000_0001h. SSE4A instruction extensions, indicated by bit 6 of CPUID function 8000_0001h.
PAGE 263
24592—Rev. 3.15—November 2009 • • • • • • • • • AMD64 Technology #UD—Invalid-Opcode Exception (Vector 6) #DF—Double-Fault Exception (Vector 8) #SS—Stack Exception (Vector 12) #GP—General-Protection Exception (Vector 13) #PF—Page-Fault Exception (Vector 14) #MF—x87 Floating-Point Exception-Pending (Vector 16) #AC—Alignment-Check Exception (Vector 17) #MC—Machine-Check Exception (Vector 18) #XF—SIMD Floating-Point Exception (Vector 19)—Only by the CVTPS2PI, CVTTPS2PI, CVTPD2PI, and CVTTPD2PI instructions.
PAGE 264
AMD64 Technology 24592—Rev. 3.15—November 2009 processor asserts the FERR# output signal. For details about the x87 floating-point exceptions and the FERR# output signal, see “x87 Floating-Point Exception Causes” on page 279. 5.12 Actions Taken on Executing 64-Bit Media Instructions The MMX registers are mapped onto the low 64 bits of the 80-bit x87 floating-point physical registers, FPR0–FPR7, described in “Registers” on page 238. The MMX instructions do not use the x87 stackaddressing mechanism.
PAGE 265
24592—Rev. 3.15—November 2009 Table 5-6. AMD64 Technology Mapping Between Internal and Software-Visible Tag Bits Architectural State State Binary Value Valid 00 Zero 01 Special (NaN, infinity, denormal)2 10 Empty 11 Internal State1 Full (0) Empty (1) Note: 1. For a more detailed description of this mapping, see “Deriving FSAVE Tag Field from FXSAVE Tag Field” in Volume 2. 2. The 64-bit media floating point (3DNow!™) instructions do not support NaNs, infinities, and denormals.
PAGE 266
AMD64 Technology 24592—Rev. 3.15—November 2009 The 64-bit media instructions and x87 floating-point instructions interpret the contents of their aliased MMX and x87 registers differently. Because of this, software should not exchange register data between 64-bit media and x87 floating-point procedures, or use conditional branches at the end of loops that might jump to code of the other type. Software must not rely on the contents of the aliased MMX and x87 registers across such code-type transitions.
PAGE 267
24592—Rev. 3.15—November 2009 AMD64 Technology Unlike FSAVE and FNSAVE, however, FXSAVE does not alter the tag bits (thus, it does not perform the state-clearing function of EMMS or FEMMS). The state of the saved MMX and x87 registers is retained, thus indicating that the registers may still be valid (or whatever other value the tag bits indicated prior to the save). To invalidate the contents of the MMX and x87 registers after FXSAVE, software must explicitly execute an FINIT instruction.
PAGE 268
AMD64 Technology 24592—Rev. 3.15—November 2009 5.15.3 Remove Branches Branch can be replaced with 64-bit media instructions that simulate predicated execution or conditional moves, as described in “Branch Removal” on page 198. Where possible, break long dependency chains into several shorter dependency chains which can be executed in parallel. This is especially important for floating-point instructions because of their longer latencies. 5.15.
PAGE 269
24592—Rev. 3.15—November 2009 6 AMD64 Technology x87 Floating-Point Programming This chapter describes the x87 floating-point programming model. This model supports all aspects of the legacy x87 floating-point model and complies with the IEEE 754 and 854 standards for binary floating-point arithmetic. In hardware implementations of the AMD64 architecture, support for specific features of the x87 programming model are indicated by the CPUID feature bits, as described in “Feature Detection” on page 278.
PAGE 270
AMD64 Technology 24592—Rev. 3.15—November 2009 number into double-extended-precision format. The processor can convert numbers back to specific formats, or leave them in double-extended-precision format when writing them to memory. Most x87 operations for addition, subtraction, multiplication, and division specify two source operands, the first of which is replaced by the result. Instructions for subtraction and division have reverse forms which swap the ordering of operands. 6.1.
PAGE 271
24592—Rev. 3.15—November 2009 AMD64 Technology x87 Data Registers 79 0 fpr0 fpr1 fpr2 fpr3 fpr4 fpr5 fpr6 fpr7 Instruction Pointer (rIP) Control ControlWord Word Data Pointer (rDP) Status StatusWord Word 63 Opcode 10 Tag TagWord Word 0 15 0 513-321.eps Figure 6-1.
PAGE 272
AMD64 Technology 24592—Rev. 3.15—November 2009 x87 Status Word ST(6) fpr0 ST(7) fpr1 TOP ST(0) fpr2 ST(1) fpr3 ST(2) fpr4 ST(3) fpr5 ST(4) fpr6 ST(5) fpr7 13 11 79 0 513-134.eps Figure 6-2. x87 Physical and Stack Registers Stack Organization. The bank of eight physical data registers, FPR0–FPR7, are organized internally as a stack, ST(0)–ST(7). The stack functions like a circular modulo-8 buffer. The stack top can be set by software to start at any register position in the bank.
PAGE 273
24592—Rev. 3.15—November 2009 AMD64 Technology not empty (as indicated by the register’s tag bits). To prevent overflow, the FXCH (floating-point exchange) instruction can be used to access stack registers, giving the appearance of a flat register file, but all x87 programs must be aware of the register file’s stack organization.
PAGE 274
AMD64 Technology 24592—Rev. 3.15—November 2009 15 14 13 12 11 10 9 C 3 B TOP 7 6 5 C C C E 2 1 0 S S F P U O Z D I E E E E E E Bits Mnemonic 15 14 B C3 TOP 13–11 10 9 8 7 6 C2 C1 C0 ES SF 1 PE UE OE ZE DE 0 IE 5 4 3 2 Figure 6-3.
PAGE 275
24592—Rev. 3.15—November 2009 AMD64 Technology Overflow Exception (OE). Bit 3. The processor sets this bit to 1 when the absolute value of a rounded result is larger than the largest representable normalized floating-point number for the destination format. (See “Normalized Numbers” on page 254.) Underflow Exception (UE). Bit 4.
PAGE 276
AMD64 Technology 24592—Rev. 3.15—November 2009 operand. For details on how each instruction sets the condition codes, see “x87 Floating-Point Instruction Reference” in Volume 5. x87 Floating-Point Unit Busy (B). Bit 15. The processor sets the value of this bit equal to the calculated value of the ES bit, bit 7. This bit can be written, but the written value is ignored. The bit is included only for backward-compatibility with the 8087 coprocessor, in which it indicates that the coprocessor is busy.
PAGE 277
24592—Rev. 3.15—November 2009 AMD64 Technology ZE, DE, IE), which are reported in the x87 status word as described in “x87 Status Word Register (FSW)” on page 241. A bit masks its exception type when set to 1, and unmasks it when cleared to 0. Masking a type of exception causes the processor to handle all subsequent instances of the exception type in a default way. Unmasking the exception type causes the processor to branch to the #MF exception service routine when an exception occurs.
PAGE 278
AMD64 Technology 24592—Rev. 3.15—November 2009 Infinity Bit (Y). Bit 12. This bit is obsolete. It can be read and written, but the value has no meaning. On pre-386 processor implementations, the bit specified the affine (Y = 1) or projective (Y = 0) infinity. The AMD64 architecture uses only the affine infinity, which specifies distinct positive and negative infinity values. 6.2.4 x87 Tag Word Register (FTW) The x87 tag word register contains a 2-bit tag field for each x87 physical data register.
PAGE 279
24592—Rev. 3.15—November 2009 AMD64 Technology setting all the registers to full, and thus they may affect execution of subsequent x87 floating-point instructions. For details, see “Mixing Media Code with x87 Code” on page 233. 6.2.5 Pointers and Opcode State The x87 instruction pointer, instruction opcode, and data pointer are part of the x87 environment (nondata processor state) that is loaded and stored by the instructions described in “x87 Environment” on page 248.
PAGE 280
AMD64 Technology • 24592—Rev. 3.15—November 2009 Opcode Field[7:0] = Second x87-opcode byte[7:0]. For example, the x87 opcode D9 F8 (floating-point partial remainder) is stored as 001_1111_1000b. The low-order three bits of the first opcode byte, D9 (1101_1001b), are stored in bits 10–8. The second opcode byte, F8 (1111_1000b), is stored in bits 7–0. The high-order five bits of the first opcode byte (1101_1b) are not needed because they are identical for all x87 instructions. Last x87 Data Pointer.
PAGE 281
24592—Rev. 3.15—November 2009 Table 6-4.
PAGE 282
AMD64 Technology • 24592—Rev. 3.15—November 2009 “Instruction Prefixes” on page 277 describes the use of address-size instruction overrides by 64-bit media instructions. Register Operands. Most x87 floating-point instructions can read source operands from and write results to x87 registers. Most instructions access the ST(0)–ST(7) register stack.
PAGE 283
24592—Rev. 3.15—November 2009 AMD64 Technology Floating-Point Data Types. The floating-point data types, shown in Figure 6-8 on page 251, include 32-bit single precision, 64-bit double precision, and 80-bit double-extended precision. The default precision is double-extended precision, and all operands loaded into registers are converted into double-extended precision format.
PAGE 284
AMD64 Technology • • 24592—Rev. 3.15—November 2009 Double-Precision Format—This format includes a 1-bit sign, an 11-bit biased exponent whose value is 1023, and a 52-bit significand. The integer bit is implied, making a total of 53 bits in the significand. Double-Extended-Precision Format—This format includes a 1-bit sign, a 15-bit biased exponent whose value is 16,383, and a 64-bit significand, which includes one explicit integer bit.
PAGE 285
24592—Rev. 3.15—November 2009 AMD64 Technology bits in the 80-bit format are reserved (ignored on loads, zeros on stores). The high bit (bit 79) is a sign bit. 79 78 72 71 S 0 Ignore or Zero Precision — 18 Digits, 72 Bits Used, 4-Bits/Digit Description Ignored on Load, Zeros on Store Sign Bit Bits 78-72 79 Figure 6-9. x87 Packed Decimal Data Type Two x87 instructions operate on the packed-decimal data type.
PAGE 286
AMD64 Technology 24592—Rev. 3.15—November 2009 In common engineering and scientific usage, floating-point numbers—also called real numbers—are represented in base (radix) 10. A non-zero number consists of a sign, a normalized significand, and a signed exponent, as in: +2.71828 e0 Both large and small numbers are representable in this notation, subject to the limits of data-type precision. For example, a million in base-10 notation appears as +1.00000 e6 and -0.0000383 is represented as -3.83000 e-5.
PAGE 287
24592—Rev. 3.15—November 2009 AMD64 Technology Denormalization may correct the exponent by placing leading zeros in the significand. This may cause a loss of precision, because the number of significant bits in the fraction is reduced by the leading zeros. In the single-precision floating-point format, for example, normalized numbers have biased exponents ranging from 1 to 254 (the unbiased exponent range is from –126 to +127).
PAGE 288
AMD64 Technology 24592—Rev. 3.15—November 2009 exception is masked. In general, when the processor encounters a QNaN as a source operand for an instruction—in an instruction other than FxCOMx, FISTx, or FSTx—the processor does not generate an exception but generates a QNaN as the result. The processor never generates an SNaN as a result of a floating-point operation.
PAGE 289
24592—Rev. 3.15—November 2009 AMD64 Technology The single-precision and double-precision formats do not include the integer bit in the significand (the value of the integer bit can be inferred from number encodings). The double-extended-precision format explicitly includes the integer in bit 63 and places the most-significant fraction bit in bit 62. Exponents of all three types are encoded in biased format, with respective biasing constants of 127, 1023, and 16,383. Table 6-8.
PAGE 290
AMD64 Technology 24592—Rev. 3.15—November 2009 Table 6-8. Supported Floating-Point Encodings (continued) Classification Sign Biased Exponent1 Significand2 Negative Zero 1 000 ... 000 0.000 ... 000 Negative Denormal 1 0.000 ... 001 000 ... 000 to 0.111 ... 111 1 1.000 ... 001 000 ... 000 to 1.111 ... 111 1 000 ... 001 1.000 ... 000 to to 111 ... 110 1.111 ... 111 Negative Infinity (-∞) 1 111 ... 111 1.000 ... 000 SNaN 1 1.000 ... 001 111 ... 111 to 1.011 ... 111 QNaN4 1 1.100 ...
PAGE 291
24592—Rev. 3.15—November 2009 Table 6-9. AMD64 Technology Unsupported Floating-Point Encodings Classification Sign Biased Exponent1 Significand2 Positive Pseudo-NaN 0 111 ... 111 0.111 ... 111 to 0.000 ... 001 Positive Pseudo-Infinity 0 111 ... 111 0.000 ... 000 Positive Unnormal 0 111 ... 110 to 000 ... 001 0.111 ... 111 to 0.000 ... 000 Negative Unnormal 1 000 ... 001 to 111 ... 110 0.000 ... 000 to 0.111 ... 111 Negative Pseudo-Infinity 1 111 ... 111 0.000 ...
PAGE 292
AMD64 Technology 24592—Rev. 3.15—November 2009 6.3.5 Precision The Precision control (PC) field comprises bits 9–8 of the x87 control word (“x87 Control Word Register (FCW)” on page 244). This field specifies the precision of floating-point calculations for the FADDx, FSUBx, FMULx, FDIVx, and FSQRT instructions, as shown in Table 6-11. Table 6-11.
PAGE 293
24592—Rev. 3.15—November 2009 AMD64 Technology round up (toward +∞), round down (toward –∞), and round toward zero. Round up and round down are used in interval arithmetic, in which upper and lower bounds bracket the true result of a computation. Round toward zero takes the smaller in magnitude, that is, always truncates. The processor produces a floating-point result defined by the IEEE standard to be infinitely precise.
PAGE 294
AMD64 Technology 24592—Rev. 3.15—November 2009 FADD st(0), st(i) Mnemonic First Source Operand and Destination Operand Second Source Operand Figure 6-10. 513-146.eps Mnemonic Syntax for Typical Instruction This example shows the FADD mnemonic followed by two operands, both of which are 80-bit stackregister operands. Most instructions take source operands from an x87 stack register and/or memory and write their results to a stack register or memory.
PAGE 295
24592—Rev. 3.15—November 2009 • • • • • • AMD64 Technology P—Pop PP—Pop Twice R—Reverse ST—Store U—Unordered x—One or more variable characters in the mnemonic For example, the mnemonic for the store instruction that stores the top-of-stack and pops the stack is FSTP. In this mnemonic, the F means a floating-point instruction, the ST means a store, and the P means pop the stack. 6.4.
PAGE 296
AMD64 Technology 24592—Rev. 3.15—November 2009 The FILD instruction converts the 16-bit, 32-bit, or 64-bit source signed integer in memory into a double-extended-precision floating-point value and pushes the result onto the top-of-stack, ST(0). The FIST instruction converts and rounds the source value in the top-of-stack, ST(0), to a signed integer and copies it to the specified 16-bit or 32-bit memory location. The type of rounding is determined by the rounding control (RC) field of the x87 control word.
PAGE 297
24592—Rev. 3.15—November 2009 Table 6-13.
PAGE 298
AMD64 Technology • • 24592—Rev. 3.15—November 2009 FLDLG2—Floating-Point Load Log10 2 FLDLN2—Floating-Point Load Ln 2 The FLDL2E, FLDL2T, FLDLG2, and FLDLN2 instructions, respectively, push the floating-point constant value, log2e, log210, log102, and loge2, onto the top-of-stack, ST(0). 6.4.4 Arithmetic The arithmetic instructions support addition, subtraction, multiplication, division, change-sign, round, round to integer, partial remainder, and square root.
PAGE 299
24592—Rev. 3.15—November 2009 • AMD64 Technology FISUBR—Floating-Point Integer Subtract Reverse The FSUB instruction syntax has forms that include one or two explicit source operands. In the oneoperand form, the instruction reads a 32-bit or 64-bit floating-point value from memory, converts it to the double-extended-precision format, subtracts it from ST(0), and writes the result to ST(0). In the two-operand form, both source operands are located in stack registers.
PAGE 300
AMD64 Technology • • • 24592—Rev. 3.15—November 2009 FDIVR—Floating-Point Divide Reverse FDIVRP—Floating-Point Divide Reverse and Pop FIDIVR—Floating-Point Integer Divide Reverse The FDIV instruction syntax has forms that include one or two source explicit operands that may be single-precision or double-precision floating-point values or 16-bit or 32-bit integer values. In the oneoperand form, the instruction reads a value from memory, divides ST(0) by the memory operand, and writes the result to ST(0).
PAGE 301
24592—Rev. 3.15—November 2009 AMD64 Technology quotient are calculated, guaranteeing that the remainder returned is less in magnitude than the divisor in ST(1). If the exponent difference is equal to or greater than 64, only a subset of the integer quotient bits, numbering between 32 and 63, are calculated and a partial remainder is returned. FPREM can be repeated on a partial remainder until reduction is complete. It can be used to bring the operands of transcendental functions into their proper range.
PAGE 302
AMD64 Technology 24592—Rev. 3.15—November 2009 condition-code bit in the x87 status word is set to 1, and the argument is returned as the result. If software detects an out-of-range argument, the FPREM or FPREM1 instruction can be used to reduce the magnitude of the argument before using the FSIN, FCOS, FSINCOS, or FPTAN instruction again.
PAGE 303
24592—Rev. 3.15—November 2009 • • • • AMD64 Technology FCOMP—Floating-Point Compare and Pop FCOMPP—Floating-Point Compare and Pop Twice FCOMI—Floating-Point Compare and Set Flags FCOMIP—Floating-Point Compare and Set Flags and Pop The FCOM instruction syntax has forms that include zero or one explicit source operands. In the zerooperand form, the instruction compares ST(1) with ST(0) and writes the x87 status-word condition codes accordingly.
PAGE 304
AMD64 Technology • 24592—Rev. 3.15—November 2009 FUCOMIP—Floating-Point Unordered Compare and Set Flags and Pop The FUCOMx instructions perform the same operations as the FCOMx instructions, except that the FUCOMx instructions generate an invalid-operation exception (IE) only if any operand is an unsupported data type or a signaling NaN (SNaN), whereas the ordered-compare FCOMx instructions generate an invalid-operation exception if any operand is an unsupported data type or any type of NaN.
PAGE 305
24592—Rev. 3.15—November 2009 Table 6-15. AMD64 Technology Condition-Code Settings for FXAM (continued) C3 C2 C0 C11 Meaning 1 0 1 0 +empty 1 0 1 1 -empty 1 1 0 0 +denormal 1 1 0 1 -denormal Note: 1. C1 is the sign of ST(0). 6.4.7 Stack Management The stack management instructions move the x87 top-of-stack pointer (TOP) and clear the contents of stack registers.
PAGE 306
AMD64 Technology 24592—Rev. 3.15—November 2009 The FINIT and FNINIT instructions set all bits in the x87 control-word, status-word, and tag word registers to their default values. Assemblers issue FINIT as an FWAIT instruction followed by an FNINIT instruction. Thus, FINIT (but not FNINIT) reports pending unmasked x87 floating-point exceptions before performing the initialization.
PAGE 307
24592—Rev. 3.15—November 2009 AMD64 Technology Assemblers issue FSTCW as an FWAIT instruction followed by an FNSTCW instruction. Thus, FSTCW (but not FNSTCW) reports pending unmasked x87 floating-point exceptions before storing the control word. The FSTCW instruction should be used when pending x87 floating-point exceptions are being reported (unmasked). The no-wait instruction, FNSTCW, should be used when pending x87 floatingpoint exceptions are not being reported (masked).
PAGE 308
AMD64 Technology 24592—Rev. 3.15—November 2009 Save and Restore x87 and 64-Bit Media State • FSAVE—Save x87 and MMX State. • FNSAVE—Save No-Wait x87 and MMX State. • FRSTOR—Restore x87 and MMX State. These instructions save and restore the entire processor state for x87 floating-point instructions and 64-bit media instructions. The instructions save and restore either 94 or 108 bytes of data, depending on the effective operand size.
PAGE 309
24592—Rev. 3.15—November 2009 Table 6-16. Instruction Mnemonic 6.6 AMD64 Technology Instruction Effects on rFLAGS rFLAGS Mnemonic and Bit Number OF 11 SF 7 ZF 6 AF 4 PF 2 CF 0 FCMOVcc Tst Tst Tst FCOMI FCOMIP FUCOMI FUCOMIP Mod Mod Mod Instruction Prefixes Instruction prefixes, in general, are described in “Instruction Prefixes” on page 71. The following restrictions apply to the use of instruction prefixes with x87 instructions. Supported Prefixes.
PAGE 310
AMD64 Technology 6.7 24592—Rev. 3.15—November 2009 Feature Detection Before executing x87 floating-point instructions, software should determine if the processor supports the technology by executing the CPUID instruction. “Feature Detection” on page 74 describes how software uses the CPUID instruction to detect feature support.
PAGE 311
24592—Rev. 3.
PAGE 312
AMD64 Technology 24592—Rev. 3.15—November 2009 determines that an unmasked exception is pending—by checking the exception status (ES) flag in the x87 status word—and invokes the #MF exception service routine. #MF Exception Types and Flags. The #MF exceptions are of six types, five of which are mandated by the IEEE 754 standard. These six types and their bit-flags in the x87 status word are shown in Table 6-17. A stack fault (SF) exception is always accompanied by an invalid-operation exception (IE).
PAGE 313
24592—Rev. 3.15—November 2009 Table 6-18. AMD64 Technology Invalid-Operation Exception (IE) Causes Operation Condition • A source operand is an SNaN, or Any Arithmetic Operation • A source operand is an unsupported data type (pseudoNaN, pseudo-infinity, or unnormal). Arithmetic (IE exception) FADD, FADDP Source operands are infinities with opposite signs. FSUB, FSUBP, FSUBR, FSUBRP Source operands are infinities with same sign. FMUL, FMULP Source operands are zero and infinity.
PAGE 314
AMD64 Technology 24592—Rev. 3.15—November 2009 Overflow Exception (OE). The OE exception occurs when the value of a rounded floating-point result is larger than the largest representable normalized positive or negative floating-point number in the destination format, as shown in Table 6-5 on page 252. An overflow can occur through computation or through conversion of higher-precision numbers to lower-precision numbers. See “Precision” on page 260.
PAGE 315
24592—Rev. 3.15—November 2009 Table 6-19.
PAGE 316
AMD64 Technology Table 6-20. 24592—Rev. 3.15—November 2009 x87 Floating-Point (#MF) Exception Masks Exception Mask and Mnemonic x87 Control-Word Bit1 Invalid-operation exception mask (IM) 0 Denormalized-operand exception mask (DM) 1 Zero-divide exception mask (ZM) 2 Overflow exception mask (OM) 3 Underflow exception mask (UM) 4 Precision exception mask (PM) 5 Note: 1. See “x87 Status Word Register (FSW)” on page 241 for a summary of each exception.
PAGE 317
24592—Rev. 3.15—November 2009 Table 6-21. AMD64 Technology Masked Responses to x87 Floating-Point Exceptions Exception and Mnemonic Type of Operation1 Any Arithmetic Operation: Source operand is an SNaN. Invalid-operation exception (IE)2 Processor Response Set IE flag, and return a QNaN value.
PAGE 318
AMD64 Technology Table 6-21. 24592—Rev. 3.15—November 2009 Masked Responses to x87 Floating-Point Exceptions (continued) Type of Operation1 Exception and Mnemonic Processor Response FCOS, FPTAN, FSIN, FSINCOS: Source operand is ∞ or FPREM, FPREM1: Dividend is infinity or divisor is 0. Set IE flag, return the floating-point indefinite value3, and clear condition code C2 to 0. FCOM, FCOMP, or FCOMPP: One or both operands is a NaN or Set IE flag, and set C3–C0 condition codes to reflect the result.
PAGE 319
24592—Rev. 3.15—November 2009 Table 6-21. AMD64 Technology Masked Responses to x87 Floating-Point Exceptions (continued) Type of Operation1 Exception and Mnemonic Processor Response Round to nearest. • If sign of result is positive, set OE flag, and return +∞. • If sign of result is negative, set OE flag, and return -∞. Round toward +∞. • If sign of result is positive, set OE flag, and return +∞. • If sign of result is negative, set OE flag, and return finite negative number with largest magnitude.
PAGE 320
AMD64 Technology 24592—Rev. 3.15—November 2009 Unmasked Responses. The processor handles unmasked exceptions as shown in Table 6-22 on page 288. Table 6-22. Unmasked Responses to x87 Floating-Point Exceptions Exception and Mnemonic Type of Operation Invalid-operation exception (IE) Invalid-operation exception (IE) with stack fault (SF) Processor Response1 Set IE and ES flags, and call the #MF service routine2. The destination and the TOP are not changed.
PAGE 321
24592—Rev. 3.15—November 2009 Table 6-22. AMD64 Technology Unmasked Responses to x87 Floating-Point Exceptions (continued) Exception and Mnemonic Type of Operation Processor Response1 • If the destination is memory, set UE and ES flags, and call the #MF service routine2. The destination and the TOP are not changed.
PAGE 322
AMD64 Technology • 24592—Rev. 3.15—November 2009 used externally. It is recommended that system software set NE to 1. This enables optimal performance in handling x87 floating-point exceptions. If CR0.
PAGE 323
24592—Rev. 3.15—November 2009 AMD64 Technology FXSAVE and FXRSTOR Instructions. Application software can save and restore the 128-bit media state, 64-bit media state, and x87 floating-point state by executing the FXSAVE and FXRSTOR instructions.
PAGE 324
AMD64 Technology 24592—Rev. 3.15—November 2009 based branches that depend on the condition codes for branch direction, because FNSTSW AX is often a serializing instruction. 6.10.3 Use FSINCOS Instead of FSIN and FCOS Frequently, a piece of code that needs to compute the sine of an argument also needs to compute the cosine of that same argument.
PAGE 325
24592—Rev. 3.15—November 2009 AMD64 Technology Index Symbols #AC exception ......................................................... 88 #BP exception ......................................................... 87 #BR exception ......................................................... 87 #DB exception ......................................................... 87 #DE exception ......................................................... 87 #DF exception .........................................................
PAGE 326
AMD64 Technology branches ............................................. 76, 84, 101, 189 BSF instruction ........................................................ 54 BSR instruction ....................................................... 54 BSWAP instruction.................................................. 48 BT instruction ......................................................... 54 BTC instruction ....................................................... 54 BTR instruction .................................
PAGE 327
24592—Rev. 3.15—November 2009 64-bit media ....................................................... 201 general-purpose .................................................... 36 mismatched ........................................................ 188 x87 ............................................................. 250, 256 DAZ bit ................................................................ 119 DE bit ..................................... 118, 119, 181, 242, 281 DEC instruction ...........................
PAGE 328
AMD64 Technology FBSTP instruction ................................................. 264 FCMOVcc instructions........................................... 264 FCOM instruction .................................................. 271 FCOMI instruction ................................................. 271 FCOMIP instruction............................................... 271 FCOMP instruction ................................................ 271 FCOMPP instruction ..............................................
PAGE 329
24592—Rev. 3.15—November 2009 memory-mapped .................................................. 91 ports ............................................... 64, 90, 123, 203 privilege level ...................................................... 92 IDIV instruction ...................................................... 50 IE bit .............................................. 118, 180, 242, 280 IEEE 754 Standard........................... 119, 127, 238, 251 IEEE-754 standard ......................................
PAGE 330
AMD64 Technology logarithms ............................................................. 265 logical instructions ............................. 56, 155, 174, 222 logical shift ............................................................. 53 long mode ......................................................... xxii, 6 LOOPcc instructions ................................................ 60 LSB ...................................................................... xxii lsb .....................................
PAGE 331
24592—Rev. 3.15—November 2009 AMD64 Technology multiplication .......................................................... 50 multiply-add ................................................... 112, 197 MXCSR register .................................................... 117 overflow .............................................................. xxiii overflow exception (OE)................................. 181, 282 overflow flag ........................................................... 36 N P NaN .......
PAGE 332
AMD64 Technology PFMAX instruction................................................ 228 PFMIN instruction ................................................. 228 PFMUL instruction ................................................ 225 PFNACC instruction .............................................. 226 PFPNACC instruction ............................................ 226 PFRCP instruction ................................................. 227 PFRCPIT1 instruction ............................................
PAGE 333
24592—Rev. 3.15—November 2009 PUNPCKLDQ instruction ............................... 141, 213 PUNPCKLQDQ instruction .................................... 141 PUNPCKLWD instruction............................... 141, 213 PUSH instruction ............................................... 44, 69 PUSHA instruction ............................................ 44, 69 PUSHAD instruction.......................................... 44, 69 PUSHF instruction ...................................................
PAGE 334
AMD64 Technology S SAHF instruction ..................................................... 63 SAL instruction ....................................................... 52 SAR instruction ....................................................... 52 saturation 128-bit media ..................................................... 125 64-bit media ....................................................... 204 saving state.............................. 156, 188, 223, 234, 290 SBB instruction .............................
PAGE 335
24592—Rev. 3.15—November 2009 AMD64 Technology T tag bits ........................................................... 232, 246 tag word ................................................................ 246 task switch .............................................................. 81 task-state segment (TSS) .......................................... 81 temporal locality ...................................................... 98 TEST instruction .....................................................
PAGE 336
AMD64 Technology 304 24592—Rev. 3.