Alpha Architecture Handbook Order Number: EC–QD2KC–TE Revision/Update Information: Compaq Computer Corporation This is Version 4 of the Alpha Architecture Handbook.
October 1998 The information in this publication is subject to change without notice. COMPAQ COMPUTER CORPORATION SHALL NOT BE LIABLE FOR TECHNICAL OR EDITORIAL ERRORS OR OMISSIONS CONTAINED HEREIN, NOR FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES RESULTING FROM THE FURNISHING, PERFORMANCE, OR USE OF THIS MATERIAL.
Table of Contents 1 Introduction 1.1 1.2 1.3 1.4 1.5 1.6 1.6.1 1.6.2 1.6.3 1.6.4 1.6.5 1.6.6 1.6.7 1.6.8 1.6.9 1.6.10 1.6.11 1.6.12 2 The Alpha Approach to RISC Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data Format Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Instruction Format Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 3 Big-Endian Addressing Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Instruction Formats 3.1 Alpha Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Program Counter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Integer Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.5 Integer Signed Compare . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.6 Integer Unsigned Compare . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.7 Count Leading Zero . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.8 Count Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.
4.7.10.4 4.8 4.8.1 4.8.2 4.8.3 4.8.4 4.8.5 4.8.6 4.8.7 4.8.8 4.9 4.9.1 4.10 4.10.1 4.10.2 4.10.3 4.10.4 4.10.5 4.10.6 4.10.7 4.10.8 4.10.9 4.10.10 4.10.11 4.10.12 4.10.13 4.10.14 4.10.15 4.10.16 4.10.17 4.10.18 4.10.19 4.10.20 4.10.21 4.10.22 4.10.23 4.10.24 4.10.25 4.11 4.11.1 4.11.2 4.11.3 4.11.4 4.11.5 4.11.6 4.11.7 4.11.8 4.11.9 4.11.10 4.11.11 4.12 4.12.1 4.13 4.13.1 4.13.2 4.13.3 4.13.4 vi Propagating NaN Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5 System Architecture and Programming Implications 5.1 5.2 5.2.1 5.2.2 5.2.3 5.2.4 5.3 5.4 5.5 5.5.1 5.5.2 5.5.3 5.5.4 5.6 5.6.1 5.6.1.1 5.6.1.2 5.6.1.3 5.6.1.4 5.6.1.5 5.6.1.6 5.6.1.7 5.6.1.8 5.6.1.9 5.6.2 5.6.2.1 5.6.2.2 5.6.2.3 5.6.2.4 5.6.2.5 5.6.2.6 5.6.2.7 5.6.2.8 5.6.2.9 5.6.2.10 5.6.2.11 5.6.3 5.6.4 5.6.4.1 5.6.4.2 5.6.4.3 5.6.4.4 5.6.4.5 5.6.4.6 5.6.4.7 5.6.4.8 5.6.5 5.7 6 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.5 6.6 6.7 6.7.1 6.7.2 6.7.3 PALcode Effects on System Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . PALcode Replacement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Required PALcode Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Drain Aborts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.4.4.6 NOT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.4.4.7 Booleans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.4.5 Exceptions and Trap Barriers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.4.6 Pseudo-Operations (Stylized Code Forms) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.
E.2.2.2 E.2.2.3 E.2.3 E.2.3.1 E.2.3.2 E.2.3.3 Index x Windows NT Alpha Functions and Argument . . . . . . . . . . . . . . . . . . . . . . . . . . . . . OpenVMS Alpha and DIGITAL UNIX Functions and Arguments . . . . . . . . . . . . . . 21264 Performance Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Performance Monitor Interrupt Mechanism. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Windows NT Alpha Functions and Argument . . . . . . . . .
Figures 1–1 2–1 2–2 2–3 2–4 2–5 2–6 2–7 2–8 2–9 2–10 2–11 2–12 2–13 2–14 2–15 2–16 2–17 2–18 2–19 2–20 2–21 2–22 2–23 2–24 3–1 3–2 3–3 3–4 3–5 3–6 4–1 4–2 8–1 A–1 A–2 A–3 A–4 A–5 B–1 B–2 Instruction Format Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Byte Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Word Format . . . . . . . . . . . . . . . . . . . . . . . . .
Tables 2–1 2–2 3–1 3–2 3–3 3–4 3–5 3–6 3–7 4–1 4–2 4–3 4–4 4–5 4–6 4–7 4–8 4–9 4–10 4–11 4–12 4–13 4–14 4–15 4–16 4–17 4–18 5–1 6–1 6–2 9–1 9–2 10–1 10–2 11–1 11–2 A–1 A–2 B–1 B–2 B–3 C–1 C–2 C–3 C–4 C–5 C–6 C–7 C–8 C–9 C–10 C–11 C–12 C–13 C–14 xii F_floating Load Exponent Mapping (MAP_F) ................................................................ 2–4 S_floating Load Exponent Mapping (MAP_S) ................................................................ 2–7 Operand Notation ........................
C–15 C–16 C–17 C–18 C–19 D–1 D–2 D–3 D–4 E–1 E–2 E–3 E–4 E–5 E–6 E–7 E–8 E–9 E–10 E–11 E–12 E–13 E–14 E–15 E–16 E–17 E–18 E–19 E–20 E–21 E–22 E–23 E–24 E–25 E–26 PALcode Opcodes in Numerical Order ....................................................................... C–18 Required PALcode Opcodes........................................................................................ C–20 Opcodes Reserved for PALcode ..................................................................................
xiv
Preface Chapters 1 through 8 and appendixes A through E of this book are directly derived from the Alpha System Reference Manual, Version 7 and passed engineering change orders (ECOs) that have been applied. It is an accurate representation of the described parts of the Alpha architecture. References in this handbook to the Alpha Architecture Reference Manual are to the Third Edition of that manual, EY-W938E-DP.
Chapter 1 Introduction Alpha is a 64-bit load/store RISC architecture that is designed with particular emphasis on the three elements that most affect performance: clock speed, multiple instruction issue, and multiple processors. The Alpha architects examined and analyzed current and theoretical RISC architecture design elements and developed high-performance alternatives for the Alpha architecture.
Alpha makes it easy to maintain binary compatibility across multiple implementations and easy to maintain full speed on multiple-issue implementations. For example, there are no implementation-specific pipeline timing hazards, no load-delay slots, and no branch-delay slots. The Alpha Approach to Byte Manipulation The Alpha architecture reads and writes bytes between registers and memory with the LDBU and STB instructions. (Alpha also supports word read/writes with the LDWU and STW instructions.
PALcode is written in standard machine code with some implementation-specific extensions to provide access to low-level hardware. PALcode lets Alpha implementations run the full OpenVMS Alpha, DIGITAL UNIX, and Windows NT Alpha operating systems. PALcode can provide this functionality with little overhead.
1.3 Instruction Format Overview As shown in Figure 1–1, Alpha instructions are all 32 bits in length. There are four major instruction format classes that contain 0, 1, 2, or 3 register fields. All formats have a 6-bit opcode.
Branch Instructions Conditional branch instructions can test a register for positive/negative or for zero/nonzero, and they can test integer registers for even/odd. Unconditional branch instructions can write a return address into a register. There is also a calculated jump instruction that branches to an arbitrary 64-bit address in a register. Load/Store Instructions Load and store instructions move 8-bit, 16-bit, 32-bit, or 64-bit aligned quantities from and to memory.
Floating-Point Operate Instructions The floating-point operate instructions include four complete sets of VAX and IEEE arithmetic instructions, plus instructions for performing conversions between floating-point and integer quantities. In addition to the operations found in conventional RISC architectures, Alpha includes conditional move instructions for avoiding branches and merge sign/exponent instructions for simple field manipulation.
1.6.1 Numbering All numbers are decimal unless otherwise indicated. Where there is ambiguity, numbers other than decimal are indicated with the name of the base in subscript form, for example, 1016. 1.6.2 Security Holes A security hole is an error of commission, omission, or oversight in a system that allows protection mechanisms to be bypassed.
Operations that produce UNPREDICTABLE results may also produce exceptions. • An occurrence specified as UNPREDICTABLE may happen or not based on an arbitrary choice function. The choice function is subject to the same constraints as are UNPREDICTABLE results and, in particular, must not constitute a security hole.
1.6.6 Must Be Zero (MBZ) Fields specified as Must be Zero (MBZ) must never be filled by software with a non-zero value. These fields may be used at some future time. If the processor encounters a non-zero value in a field specified as MBZ, an Illegal Operand exception occurs. 1.6.7 Read As Zero (RAZ) Fields specified as Read as Zero (RAZ) return a zero when read. 1.6.8 Should Be Zero (SBZ) Fields specified as Should be Zero (SBZ) should be filled by software with a zero value.
Chapter 2 Basic Architecture 2.1 Addressing The basic addressable unit in the Alpha architecture is the 8-bit byte. Virtual addresses are 64 bits long. An implementation may support a smaller virtual address space. The minimum virtual address size is 43 bits. Virtual addresses as seen by the program are translated into physical memory addresses by the memory management mechanism. Although the data types in Section 2.
Figure 2–2: Word Format 15 0 :A A word is specified by its address, the address of the byte containing bit 0. A word is a 16-bit value. The word is only supported in Alpha by the load, store, sign-extend, extract, mask, and insert instructions. 2.2.3 Longword A longword is 4 contiguous bytes starting on an arbitrary byte boundary. The bits are numbered from right to left, 0 through 31, as shown in Figure 2–3.
A quadword is specified by its address A, the address of the byte containing bit 0. A quadword is a 64-bit value. When interpreted arithmetically, a quadword is either a two’s-complement integer with bits of increasing significance from 0 through 62 and bit 63 as the sign bit, or an unsigned integer with bits of increasing significance from 0 through 63. Note: Alpha implementations will impose a significant performance penalty when accessing quadword operands that are not naturally aligned.
Table 2–1: F_floating Load Exponent Mapping (MAP_F) Memory <14:7> Register <62:52> 1 1111111 1 000 1111111 1 xxxxxxx 1 000 xxxxxxx (xxxxxxx not all 1’s) 0 xxxxxxx 0 111 xxxxxxx (xxxxxxx not all 0’s) 0 0000000 0 000 0000000 The F_floating store instruction reorders register bits on the way to memory and does no checking of the low-order fraction bits. Register bits <61:59> and <28:0> are ignored by the store instruction.
A G_floating operand occupies 64 bits in a floating register, arranged as shown in Figure 2–8. Figure 2–8: G_floating Register Format 63 62 52 51 Exp. S 0 32 31 Fraction Hi Fraction Lo :Fx A G_floating datum is specified by its address A, the address of the byte containing bit 0.
The reordering of bits required for a D_floating load or store is identical to that required for a G_floating load or store. The G_floating load and store instructions are therefore used for loading or storing D_floating data. A D_floating datum is specified by its address A, the address of the byte containing bit 0. The memory form of a D_floating datum is identical to an F_floating datum except for 32 additional low significance fraction bits.
NaNs. Signaling NaNs are used to provide values for uninitialized variables and for arithmetic enhancements. Quiet NaNs provide retrospective diagnostic information regarding previous invalid or unavailable data and results. Signaling NaNs signal an invalid operation when they are an operand to an arithmetic instruction, and may generate an arithmetic exception. Quiet NaNs propagate through almost every operation without generating an arithmetic exception.
This mapping preserves both normal values and exceptional values. Note that the mapping for all 1’s differs from that of F_floating load, since for S_floating all 1’s is an exceptional value and for F_floating all 1’s is a normal value. The S_floating store instruction reorders register bits on the way to memory and does no checking of the low-order fraction bits. Register bits <61:59> and <28:0> are ignored by the store instruction. The S_floating load instruction does no checking of the input.
A T_floating operand occupies 64 bits in a floating register, arranged as shown in Figure 2–14. Figure 2–14: T_floating Register Format 63 62 S 52 51 Exp. 0 32 31 Fraction Hi Fraction Lo :Fx The T_floating load instruction performs no bit reordering on input, nor does it perform checking of the input data. The T_floating store instruction performs no bit reordering on output. This instruction does no checking of the data; the preceding operation should have specified a T_floating result.
Figure 2–15: X_floating Datum 63 62 48 47 0 :A Fraction_low S Exponent :A+8 Fraction_high An X_floating datum occupies two consecutive even/odd floating-point registers (such as F4/F5), as shown in Figure 2–16. Figure 2–16: X_floating Register Format 127 126 S 112 111 Exponent 64 63 Fraction_high 0 Fraction_low Fn OR 1 Fn An X_floating datum is specified by its address A, the address of the byte containing bit 0.
Figure 2–17: X_floating Big-Endian Datum Byte 0 A: S Exponent Fraction_high Byte 15 A+8: Fraction_low Figure 2–18: X_floating Big-Endian Register Format Byte Byte 0 15 S Exponent Fraction_high Fraction_low Fn OR 1 Fn 2.2.7 Longword Integer Format in Floating-Point Unit A longword integer operand occupies 32 bits in memory, arranged as shown in Figure 2–19.
Note: Alpha implementations will impose a significant performance penalty when accessing longwords that are not naturally aligned. (A naturally aligned longword datum has zero as the low-order two bits of its address.) 2.2.8 Quadword Integer Format in Floating-Point Unit A quadword integer operand occupies 64 bits in memory, arranged as shown in Figure 2–21.
• Trailing Numeric String • Leading Separate Numeric String • Packed Decimal String 2.3 Big-Endian Addressing Support Alpha implementations may include optional big-endian addressing support.
used unchanged for both conventions. Big-endian character strings have their most significant character on the left, while little-endian strings have their most significant character on the right. • The compare byte (CMPBGE) instruction is neutral about direction, doing eight byte compares in parallel. However, following the CMPBGE instruction, the code is different that examines the byte mask to determine which string is larger, depending on whether the rightmost or leftmost unequal byte is used.
Chapter 3 Instruction Formats 3.1 Alpha Registers Each Alpha processor has a set of registers that hold the current processor state. If an Alpha system contains multiple Alpha processors, there are multiple per-processor sets of these registers. 3.1.1 Program Counter The Program Counter (PC) is a special register that addresses the instruction stream. As each instruction is decoded, the PC is advanced to the next sequential instruction. This is referred to as the updated PC.
There are some interesting cases involving R31 as a destination: • STx_C R31,disp(Rb) Although this might seem like a good way to zero out a shared location and reset the lock_flag, this instruction causes the lock_flag and virtual location {Rbv + SEXT(disp)} to become UNPREDICTABLE. • LDx_L R31,disp(Rb) This instruction produces no useful result since it causes both lock_flag and locked_physical_address to become UNPREDICTABLE.
3.1.5 Processor Cycle Counter (PCC) Register The PCC register consists of two 32-bit fields. The low-order 32 bits (PCC<31:0>) are an unsigned wrapping counter, PCC_CNT. The high-order 32 bits (PCC<63:32>), PCC_OFF, are operating system dependent in their implementation. PCC_CNT is the base clock register for measuring time intervals and is suitable for timing intervals on the order of nanoseconds. PCC_CNT increments once per N CPU cycles, where N is an implementation-specific integer in the range 1..16.
3.2.1 Operand Notation Tables 3–1, 3–2, and 3–3 list the notation for the operands, the operand values, and the other expression operands.
3.2.2 Instruction Operand Notation The notation used to describe instruction operands follows from the operand specifier notation used in the VAX Architecture Standard. Instruction operands are described as follows: . 3.2.2.1 Operand Name Notation Specifies the instruction field (Ra, Rb, Rc, or disp) and register type of the operand (integer or floating).
Table 3–5: Operand Access Type Notation (Continued) Access Type Meaning r The operand is read only. m The operand is both read and written. w The operand is write only. 3.2.2.
Table 3–7: Operators (Continued) Operator Meaning || Bit concatenation {} Indicates explicit operator precedence (x) Contents of memory location whose address is x x Contents of bit field of x defined by bits n through m x M’th bit of x ACCESS(x,y) Accessibility of the location whose address is x using the access mode y. Returns a Boolean value TRUE if the address is accessible, else FALSE.
Table 3–7: Operators (Continued) Operator Meaning CASE The CASE construct selects one of several actions based on the value of its argument. The form of a case is: CASE argument OF argvalue1: action_1 argvalue2: action_2 ... argvaluen:action_n [otherwise: default_action] ENDCASE If the value of argument is argvalue1 then action_1 is executed; if argument = argvalue2, then action_2 is executed, and so forth.
Table 3–7: Operators (Continued) Operator Meaning NOT Logical (ones) complement OR Logical sum PHYSICAL_ADDRESS Translation of a virtual address PRIORITY_ENCODE Returns the bit position of most significant set bit, interpreting its argument as a positive integer (=int(lg(x))).
Table 3–7: Operators (Continued) Operator Meaning TEST(x,cond) The contents of register x are tested for branch condition (cond) true. TEST returns a Boolean value TRUE if x bears the specified relation to 0, else FALSE is returned. Integer and floating test conditions are drawn from the preceding list of relational operators. XOR Logical difference ZEXT(x) X is zero-extended to the required size. 3.2.
3.3.1 Memory Instruction Format The Memory format is used to transfer data between registers and memory, to load an effective address, and for subroutine jumps. It has the format shown in Figure 3–1. Figure 3–1: Memory Instruction Format 31 26 25 Opcode 21 20 Ra 16 15 Rb 0 Memory_disp A Memory format instruction contains a 6-bit opcode field, two 5-bit register address fields, Ra and Rb, and a 16-bit signed displacement field. The displacement field is a byte offset.
3.3.1.2 Memory Format Jump Instructions For computed branch instructions (CALL, RET, JMP, JSR_COROUTINE) the displacement field is used to provide branch-prediction hints as described in Section 4.3. 3.3.2 Branch Instruction Format The Branch format is used for conditional branch instructions and for PC-relative subroutine jumps. It has the format shown in Figure 3–3.
An Operate format instruction contains a 6-bit opcode field and a 7-bit function code field. Unused function codes for opcodes defined as reserved in the Version 5 Alpha architecture specification (May 1992) produce an illegal instruction trap. Those opcodes are 01, 02, 03, 04, 05, 06, 07, 0A, 0C, 0D, 0E, 14, 19, 1B, 1D, 1E, and 1F. For other opcodes, unused function codes produce UNPREDICTABLE but not UNDEFINED results; they are not security holes. There are three operand fields, Ra, Rb, and Rc.
A Floating-point Operate format instruction contains a 6-bit opcode field and an 11-bit function field. Unused function codes for those opcodes defined as reserved in the Version 5 Alpha architecture specification (May 1992) produce an illegal instruction trap. Those opcodes are 01, 02, 03, 04, 05, 06, 07, 14, 19, 1B, 1D, 1E, and 1F. For other opcodes, unused function codes produce UNPREDICTABLE but not UNDEFINED results; they are not security holes. There are three operand fields, Fa, Fb, and Fc.
Figure 3–6: PALcode Instruction Format 31 26 25 Opcode 0 PALcode Function The 26-bit PALcode function field specifies the operation. The source and destination operands for PALcode instructions are supplied in fixed registers that are specified in the individual instruction descriptions. An opcode of zero and a PALcode function of zero specify the HALT instruction.
Chapter 4 Instruction Descriptions 4.1 Instruction Set Overview This chapter describes the instructions implemented by the Alpha architecture. The instruction set is divided into the following sections: Instruction Type Section Integer load and store 4.2 Integer control 4.3 Integer arithmetic 4.4 Logical and shift 4.5 Byte manipulation 4.6 Floating-point load and store 4.7 Floating-point control 4.8 Floating-point branch 4.9 Floating-point operate 4.10 Miscellaneous 4.
• Qualifiers specific to the instructions in the group • A description of the instruction operation • Optional programming examples and optional notes on the instruction 4.1.1 Subsetting Rules An instruction that is omitted in a subset implementation of the Alpha architecture is not performed in either hardware or PALcode. System software may provide emulation routines for subsetted instructions. 4.1.2 Floating-Point Subsets Floating-point support is optional on an Alpha processor.
4.1.3 Software Emulation Rules General-purpose layered and application software that executes in User mode may assume that certain loads (LDL, LDQ, LDF, LDG, LDS, and LDT) and certain stores (STL, STQ, STF, STG, STL, and STT) of unaligned data are emulated by system software. General-purpose layered and application software that executes in User mode may assume that subsetted instructions are emulated by system software.
4.2 Memory Integer Load/Store Instructions The instructions in this section move data between the integer registers and memory. They use the Memory instruction format. The instructions are summarized in Table 4–2.
4.2.1 Load Address Format: LDAx !Memory format Ra.wq,disp.ab(Rb.ab) Operation: Ra ← Rbv + SEXT(disp) Ra ← Rbv + SEXT(disp*65536) !LDA !LDAH Exceptions: None Instruction mnemonics: LDA Load Address LDAH Load Address High Qualifiers: None Description: The virtual address is computed by adding register Rb to the sign-extended 16-bit displacement for LDA, and 65536 times the sign-extended 16-bit displacement for LDAH. The 64-bit result is written to register Ra.
4.2.2 Load Memory Data into Integer Register Format: LDx !Memory format Ra.wq,disp.ab(Rb.
In the case of LDQ and LDL, the source operand is fetched from memory, sign-extended, and written to register Ra. In the case of LDWU and LDBU, the source operand is fetched from memory, zero-extended, and written to register Ra. In all cases, if the data is not naturally aligned, an alignment exception is generated. Notes: • The word or byte that the LDWU or LDBU instruction fetches from memory is placed in the low (rightmost) word or byte of Ra, with the remaining 6 or 7 bytes set to zero.
4.2.3 Load Unaligned Memory Data into Integer Register Format: LDQ_U Ra.wq,disp.ab(Rb.ab) !Memory format Operation: va ← {{Rbv + SEXT(disp)} AND NOT 7} Ra ← (va)<63:0> Exceptions: Access Violation Fault on Read Translation Not Valid Instruction mnemonics: LDQ_U Load Unaligned Quadword from Memory to Register Qualifiers: None Description: The virtual address is computed by adding register Rb to the sign-extended 16-bit displacement, then the low-order three bits are cleared.
4.2.4 Load Memory Data into Integer Register Locked Format: LDx_L !Memory format Ra.wq,disp.ab(Rb.
When a LDx_L instruction is executed without faulting, the processor records the target physical address in a per-processor locked_physical_address register and sets the per-processor lock_flag. If the per-processor lock_flag is (still) set when a STx_C instruction is executed (accessing within the same 16-byte naturally aligned block as the LDx_L), the store occurs; otherwise, it does not occur, as described for the STx_C instructions.
If two LDx_L instructions execute with no intervening STx_C, the second one overwrites the state of the first one. If two STx_C instructions execute with no intervening LDx_L, the second one always fails because the first clears lock_flag. • Software will not emulate unaligned LDx_L instructions.
4.2.5 Store Integer Register Data into Memory Conditional Format: STx_C !Memory format Ra.mx,disp.ab(Rb.
• The computed virtual address must specify a location within the naturally aligned 16-byte block in virtual memory accessed by the preceding LDx_L instruction. • The resultant physical address must specify a location within the naturally aligned 16-byte block in physical memory accessed by the preceding LDx_L instruction.
Software Note: If the address specified by a STx_C instruction does not match the one given in the preceding LDx_L instruction, an MB is required to guarantee ordering between the two instructions. Hardware/Software Implementation Note: STQ_C is used in the first Alpha implementations to access the MailBox Pointer Register (MBPR). In this special case, the effect of the STQ_C is well defined (that is, not UNPREDICTABLE) even though the preceding LDx_L did not specify the address of the MBPR.
4.2.6 Store Integer Register Data into Memory Format: STx !Memory format Ra.rx,disp.ab(Rb.
The Ra operand is written to memory at this address. If the data is not naturally aligned, an alignment exception is generated. Notes: • The word or byte that the STB or STW instruction stores to memory comes from the low (rightmost) byte or word of Ra. • Accesses have byte granularity. • For big-endian access with STB or STW, the byte/word remains in the rightmost part of Ra, but the va sent to memory has the indicated bits inverted. See Operation section, above.
4.2.7 Store Unaligned Integer Register Data into Memory Format: STQ_U Ra.rq,disp.ab(Rb.ab) !Memory format Operation: va ← {{Rbv + SEXT(disp)} AND NOT 7} (va)<63:0> ← Rav<63:0> Exceptions: Access Violation Fault on Write Translation Not Valid Instruction mnemonics: STQ_U Store Unaligned Quadword from Register to Memory Qualifiers: None Description: The virtual address is computed by adding register Rb to the sign-extended 16-bit displacement, then clearing the low order three bits.
4.3 Control Instructions Alpha provides integer conditional branch, unconditional branch, branch to subroutine, and jump instructions. The PC used in these instructions is the updated PC, as described in Section 3.1.1.
Table 4–3: Control Instructions Summary (Continued) Mnemonic Operation BNE Branch if Register Not Equal to Zero BR Unconditional Branch BSR Branch to Subroutine JMP Jump JSR Jump to Subroutine RET Return from Subroutine JSR_COROUTINE Jump to Subroutine Return Instruction Descriptions 4–19
4.3.1 Conditional Branch Format: Bxx Ra.rq,disp.
4.3.2 Unconditional Branch Format: BxR Ra.wq,disp.al !Branch format Operation: {update PC} Ra ← PC PC ← PC + {4*SEXT(disp)} Exceptions: None Instruction mnemonics: BR Unconditional Branch BSR Branch to Subroutine Qualifiers: None Description: The PC of the following instruction (the updated PC) is written to register Ra and then the PC is loaded with the target address. The displacement is treated as a signed longword offset.
4.3.3 Jumps Format: mnemonic Ra.wq,(Rb.ab),hint !Memory format Operation: {update PC} va ← Rbv AND {NOT 3} Ra ← PC PC ← va Exceptions: None Instruction mnemonics: JMP Jump JSR Jump to Subroutine RET Return from Subroutine JSR_COROUTINE Jump to Subroutine Return Qualifiers: None Description: The PC of the instruction following the Jump instruction (the updated PC) is written to register Ra and then the PC is loaded with the target virtual address. The new PC is supplied from register Rb.
Table 4–4: Jump Instructions Branch Prediction disp<15:14> Meaning Predicted Target<15:0> Prediction Stack Action 00 JMP PC + {4*disp<13:0>} – 01 JSR PC + {4*disp<13:0>} Push PC 10 RET Prediction stack Pop 11 JSR_COROUTINE Prediction stack Pop, push PC The design in Table 4–4 allows specification of the low 16 bits of a likely longword target address (enough bits to start a useful I-cache access early), and also allows distinguishing call from return (and from the other two less frequent
4.4 Integer Arithmetic Instructions The integer arithmetic instructions perform add, subtract, multiply, signed and unsigned compare, and bit count operations. Count instruction (CIX) extension implementation note: The CIX extension to the architecture provides the CTLZ, CTPOP, and CTTZ instructions. Alpha processors for which the AMASK instruction returns bit 2 set implement these instructions.
4.4.1 Longword Add Format: ADDL Ra.rl,Rb.rl,Rc.wq !Operate format ADDL Ra.rl,#b.ib,Rc.wq !Operate format Operation: Rc ← SEXT( (Rav + Rbv)<31:0>) Exceptions: Integer Overflow Instruction mnemonics: ADDL Add Longword Qualifiers: Integer Overflow Enable (/V) Description: Register Ra is added to register Rb or a literal and the sign-extended 32-bit sum is written to Rc. The high order 32 bits of Ra and Rb are ignored. Rc is a proper sign extension of the truncated 32-bit sum.
4.4.2 Scaled Longword Add Format: SxADDL Ra.rl,Rb.rq,Rc.wq !Operate format SxADDL Ra.rl,#b.ib,Rc.
4.4.3 Quadword Add Format: ADDQ Ra.rq,Rb.rq,Rc.wq !Operate format ADDQ Ra.rq,#b.ib,Rc.wq !Operate format Operation: Rc ← Rav + Rbv Exceptions: Integer Overflow Instruction mnemonics: ADDQ Add Quadword Qualifiers: Integer Overflow Enable (/V) Description: Register Ra is added to register Rb or a literal and the 64-bit sum is written to Rc. On overflow, the least significant 64 bits of the true result are written to the destination register.
4.4.4 Scaled Quadword Add Format: SxADDQ Ra.rq,Rb.rq,Rc.wq !Operate format SxADDQ Ra.rq,#b.ib,Rc.wq !Operate format Operation: CASE S4ADDQ: Rc ← LEFT_SHIFT(Rav,2) + Rbv S8ADDQ: Rc ← LEFT_SHIFT(Rav,3) + Rbv ENDCASE Exceptions: None Instruction mnemonics: S4ADDQ S8ADDQ Scaled Add Quadword by 4 Scaled Add Quadword by 8 Qualifiers: None Description: Register Ra is scaled by 4 (for S4ADDQ) or 8 (for S8ADDQ) and is added to register Rb or a literal, and the 64-bit sum is written to Rc.
4.4.5 Integer Signed Compare Format: CMPxx Ra.rq,Rb.rq,Rc.wq !Operate format CMPxx Ra.rq,#b.ib,Rc.wq !Operate format Operation: IF Rav SIGNED_RELATION Rbv THEN Rc ← 1 ELSE Rc ← 0 Exceptions: None Instruction mnemonics: CMPEQ CMPLE CMPLT Compare Signed Quadword Equal Compare Signed Quadword Less Than or Equal Compare Signed Quadword Less Than Qualifiers: None Description: Register Ra is compared to Register Rb or a literal.
4.4.6 Integer Unsigned Compare Format: CMPUxx Ra.rq,Rb.rq,Rc.wq !Operate format CMPUxx Ra.rq,#b.ib,Rc.wq !Operate format Operation: IF Rav UNSIGNED_RELATION Rbv THEN Rc ← 1 ELSE Rc ← 0 Exceptions: None Instruction mnemonics: CMPULE CMPULT Compare Unsigned Quadword Less Than or Equal Compare Unsigned Quadword Less Than Qualifiers: None Description: Register Ra is compared to Register Rb or a literal.
4.4.7 Count Leading Zero Format: CTLZ Rb.rq,Rc.wq ! Operate format Operation: temp = 0 FOR i FROM 63 DOWN TO 0 IF { Rbv EQ 1 } THEN BREAK temp = temp + 1 END Rc<6:0> ← temp<6:0> Rc<63:7> ← 0 Exceptions: None Instruction mnemonics: CTLZ Count Leading Zero Qualifiers: None Description: The number of leading zeros in Rb, starting at the most significant bit position, is written to Rc. Ra must be R31.
4.4.8 Count Population Format: CTPOP Rb.rq,Rc.wq Operation: temp = 0 FOR i FROM 0 TO 63 IF { Rbv EQ 1 } THEN temp = temp + 1 END Rc<6:0> ← temp<6:0> Rc<63:7> ← 0 Exceptions: None Instruction mnemonics: CTPOP Count Population Qualifiers: None Description: The number of ones in Rb is written to Rc. Ra must be R31.
4.4.9 Count Trailing Zero Format: CTTZ Rb.rq,Rc.wq ! Operate format Operation: temp = 0 FOR i FROM 0 TO 63 IF { Rbv EQ 1 } THEN BREAK temp = temp + 1 END Rc<6:0> ← temp<6:0> Rc<63:7> ← 0 Exceptions: None Instruction mnemonics: CTTZ Count Trailing Zero Qualifiers: None Description: The number of trailing zeros in Rb, starting at the least significant bit position, is written to Rc. Ra must be R31.
4.4.10 Longword Multiply Format: MULL Ra.rl,Rb.rl,Rc.wq !Operate format MULL Ra.rl,#b.ib,Rc.wq !Operate format Operation: Rc ← SEXT ((Rav * Rbv)<31:0>) Exceptions: Integer Overflow Instruction mnemonics: MULL Multiply Longword Qualifiers: Integer Overflow Enable (/V) Description: Register Ra is multiplied by register Rb or a literal and the sign-extended 32-bit product is written to Rc. The high 32 bits of Ra and Rb are ignored. Rc is a proper sign extension of the truncated 32-bit product.
4.4.11 Quadword Multiply Format: MULQ Ra.rq,Rb.rq,Rc.wq !Operate format MULQ Ra.Rq,#b.ib,Rc.wq !Operate format Operation: Rc ← Rav * Rbv Exceptions: Integer Overflow Instruction mnemonics: MULQ Multiply Quadword Qualifiers: Integer Overflow Enable (/V) Description: Register Ra is multiplied by register Rb or a literal and the 64-bit product is written to register Rc. Overflow detection is based on considering the operands and the result as signed quantities.
4.4.12 Unsigned Quadword Multiply High Format: UMULH Ra.rq,Rb.rq,Rc.wq !Operate format UMULH Ra.rq,#b.ib,Rc.wq !Operate format Operation: Rc ← {Rav * U Rbv}<127:64> Exceptions: None Instruction mnemonics: UMULH Unsigned Multiply Quadword High Qualifiers: None Description: Register Ra and Rb or a literal are multiplied as unsigned numbers to produce a 128-bit result. The high-order 64-bits are written to register Rc.
4.4.13 Longword Subtract Format: SUBL Ra.rl,Rb.rl,Rc.wq !Operate format SUBL Ra.rl,#b.ib,Rc.wq !Operate format Operation: Rc ← SEXT ((Rav - Rbv)<31:0>) Exceptions: Integer Overflow Instruction mnemonics: SUBL Subtract Longword Qualifiers: Integer Overflow Enable (/V) Description: Register Rb or a literal is subtracted from register Ra and the sign-extended 32-bit difference is written to Rc. The high 32 bits of Ra and Rb are ignored.
4.4.14 Scaled Longword Subtract Format: SxSUBL Ra.rl,Rb.rl,Rc.wq !Operate format SxSUBL Ra.rl,#b.ib,Rc.
4.4.15 Quadword Subtract Format: SUBQ Ra.rq,Rb.rq,Rc.wq !Operate format SUBQ Ra.rq,#b.ib,Rc.wq !Operate format Operation: Rc ← Rav - Rbv Exceptions: Integer Overflow Instruction mnemonics: SUBQ Subtract Quadword Qualifiers: Integer Overflow Enable (/V) Description: Register Rb or a literal is subtracted from register Ra and the 64-bit difference is written to register Rc. On overflow, the least significant 64 bits of the true result are written to the destination register.
4.4.16 Scaled Quadword Subtract Format: SxSUBQ Ra.rq,Rb.rq,Rc.wq !Operate format SxSUBQ Ra.rq,#b.ib,Rc.
4.5 Logical and Shift Instructions The logical instructions perform quadword Boolean operations. The conditional move integer instructions perform conditionals without a branch. The shift instructions perform left and right logical shift and right arithmetic shift. These are summarized in Table 4–6.
4.5.1 Logical Functions Format: mnemonic Ra.rq,Rb.rq,Rc.wq !Operate format mnemonic Ra.rq,#b.ib,Rc.
4.5.2 Conditional Move Integer Format: CMOVxx Ra.rq,Rb.rq,Rc.wq !Operate format CMOVxx Ra.rq,#b.ib,Rc.
Notes: Except that it is likely in many implementations to be substantially faster, the instruction: CMOVEQ Ra,Rb,Rc is exactly equivalent to: BNE Ra,label OR Rb,Rb,Rc label: ...
4.5.3 Shift Logical Format: SxL Ra.rq,Rb.rq,Rc.wq !Operate format SxL Ra.rq,#b.ib,Rc.wq !Operate format Operation: Rc ← Rc ← LEFT_SHIFT(Rav, Rbv<5:0>) RIGHT_SHIFT(Rav, Rbv<5:0>) !SLL !SRL Exceptions: None Instruction mnemonics: SLL SRL Shift Left Logical Shift Right Logical Qualifiers: None Description: Register Ra is shifted logically left or right 0 to 63 bits by the count in register Rb or a literal. The result is written to register Rc.
4.5.4 Shift Arithmetic Format: SRA Ra.rq,Rb.rq,Rc.wq !Operate format SRA Ra.rq,#b.ib,Rc.wq !Operate format Operation: Rc ← ARITH_RIGHT_SHIFT(Rav, Rbv<5:0>) Exceptions: None Instruction mnemonics: SRA Shift Right Arithmetic Qualifiers: None Description: Register Ra is right shifted arithmetically 0 to 63 bits by the count in register Rb or a literal. The result is written to register Rc. The sign bit (Rav<63>) is propagated into the vacated bit positions.
4.6 Byte Manipulation Instructions Alpha implementations that support the BWX extension provide the following instructions for loading, sign-extending, and storing bytes and words between a register and memory: Instruction Meaning Described in Section LDBU/LDWU Load byte/word unaligned 4.2.2 SEXTB/SEXTW Sign-extend byte/word 4.6.5 STB/STW Store byte/word 4.2.6 The AMASK instruction reports whether a particular Alpha implementation supports the BWX extension. AMASK is described in Sections 4.11.
Table 4–7: Byte-Within-Register Manipulation Instructions Summary (Continued) Mnemonic Operation INSWH Insert Word High INSLH Insert Longword High INSQH Insert Quadword High MSKBL Mask Byte Low MSKWL Mask Word Low MSKLL Mask Longword Low MSKQL Mask Quadword Low MSKWH Mask Word High MSKLH Mask Longword High MSKQH Mask Quadword High SEXTB Sign extend byte SEXTW Sign extend word ZAP Zero Bytes ZAPNOT Zero Bytes Not 4–48 Alpha Architecture Handbook
4.6.1 Compare Byte Format: CMPBGE Ra.rq,Rb.rq,Rc.wq !Operate format CMPBGE Ra.rq,#b.ib,Rc.wq !Operate format Operation: FOR i FROM 0 TO 7 temp<8:0> ← 0 || Rav} + {0 || NOT Rbv} + 1 Rc ← temp<8> END Rc<63:8> ← 0 Exceptions: None Instruction mnemonics: CMPBGE Compare Byte Qualifiers: None Description: CMPBGE does eight parallel unsigned byte comparisons between corresponding bytes of Rav and Rbv, storing the eight results in the low eight bits of Rc.
To compare two character strings for greater/equal/less: LOOP: LDQ LDA LDQ LDA CMPBGE XOR BNE BEQ R3, R1, R4, R2, R31, R3, R6, R5, 0(R1) 8(R1) 0(R2) 8(R2) R3, R6 R4, R5 DONE LOOP DONE: CMPBGE R31, R5, R5 ; ; ; ; ; ; ; ; Pick up 8 bytes of string1 Increment string1 pointer Pick up 8 bytes of string2 Increment string2 pointer Test for zeros in string1 Test for all equal bytes Exit if a zero found Loop if al
4.6.2 Extract Byte Format: EXTxx Ra.rq,Rb.rq,Rc.wq !Operate format EXTxx Ra.rq,#b.ib,Rc.
Description: EXTxL shifts register Ra right by 0 to 7 bytes, inserts zeros into vacated bit positions, and then extracts 1, 2, 4, or 8 bytes into register Rc. EXTxH shifts register Ra left by 0 to 7 bytes, inserts zeros into vacated bit positions, and then extracts 2, 4, or 8 bytes into register Rc. The number of bytes to shift is specified by Rbv’<2:0>. The number of bytes to extract is specified in the function code. Remaining bytes are filled with zeros.
For software that is not designed to use the BWX extension, the intended sequence for loading and zero-extending a word from unaligned address X is: LDQ_U LDQ_U LDA EXTWL EXTWH OR R1, R2, R3, R1, R2, R2, X(R11) X+1(R11) X(R11) R3, R1 R3, R2 R1, R1 ; ; ; ; ; ; Ignores va<2:0>, R1 = yBAx xxxx Ignores va<2:0>, R2 = yBAx xxxx R3<2:0> = (X mod 8) = 5 R1 = 0000 00BA R2 = 0000 0000 R1 = 0000 00BA For software that is not designed to use the BWX extension, the intended sequence for loading and sign-extending a
For software that is not designed to use the BWX extension, the intended sequence for loading and zero-extending an aligned word from 10(R3) is: LDL R1, 8(R3) EXTWL R1, #2, R1 ; R1 = ssss BAxx ; Faults if R3 is not longword aligned ; R1 = 0000 00BA For software that is not designed to use the BWX extension, the intended sequence for loading and sign-extending an aligned word from 10(R3) is: LDL R1, 8(R3) SRA R1, #16, R1 ; R1 = ssss BAxx ; Faults if R3 is not longword aligned ; R1 = ssss ssBA Big-e
4.6.3 Byte Insert Format: INSxx Ra.rq,Rb.rq,Rc.wq !Operate format INSxx Ra.rq,#b.ib,Rc.
Qualifiers: None Description: INSxL and INSxH shift bytes from register Ra and insert them into a field of zeros, storing the result in register Rc. Register Rbv’<2:0> selects the shift amount, and the function code selects the maximum field width: 1, 2, 4, or 8 bytes. The instructions can generate a byte, word, longword, or quadword datum that is spread across two registers at an arbitrary byte alignment.
4.6.4 Byte Mask Format: MSKxx Ra.rq,Rb.rq,Rc.wq !Operate format MSKxx Ra.rq,#b.ib,Rc.
Description: MSKxL and MSKxH set selected bytes of register Ra to zero, storing the result in register Rc. Register Rbv’<2:0> selects the starting position of the field of zero bytes, and the function code selects the maximum width: 1, 2, 4, or 8 bytes. The instructions generate a byte, word, longword, or quadword field of zeros that can spread across two registers at an arbitrary byte alignment.
For software that is not designed to use the BWX extension, the intended sequence for storing an unaligned word R5 at X is: LDA LDQ_U LDQ_U INSWH INSWL MSKWH MSKWL OR OR STQ_U STQ_U R6, R2, R1, R5, R5, R2, R1, R2, R1, R2, R1, X(R11) X+1(R11) X(R11) R6, R4 R6, R3 R6, R2 R6, R1 R4, R2 R3, R1 X+1(R11) X(R11) ; ; ; ; ; ; ; ; ; ; ; R6<2:0> = (X mod 8) = 5 Ignores va<2:0>, R2 = yBAx xxxx Ignores va<2:0>, R1 = yBAx xxxx R4 = 0000 0000 R3 = 0BA0 0000 R2 = yBAx xxxx R1 = y00x xxxx R2 = yBAx xxxx R1 = yBAx xxxx M
4.6.5 Sign Extend Format: SEXTx Rb.rq,Rc.wq !Operate format SEXTx #b.ib,Rc.wq !Operate format Operation: CASE SEXTB: SEXTW: ENDCASE Rc ← SEXT(Rbv<07:0>) Rc ← SEXT(Rbv<15:0>) Exceptions: None Instruction mnemonics: SEXTB SEXTW Sign Extend Byte Sign Extend Word Qualifiers: None Description: The byte or word in register Rb is sign-extended to 64 bits and written to register Rc. Ra must be R31.
4.6.6 Zero Bytes Format: ZAPx Ra.rq,Rb.rq,Rc.wq !Operate format ZAPx Ra.rq,#b.ib,Rc.wq !Operate format Operation: CASE ZAP: Rc ← BYTE_ZAP(Rav, Rbv<7:0>) ZAPNOT: Rc ← BYTE_ZAP(Rav, NOT Rbv<7:0>) ENDCASE Exceptions: None Instruction mnemonics: ZAP ZAPNOT Zero Bytes Zero Bytes Not Qualifiers: None Description: ZAP and ZAPNOT set selected bytes of register Ra to zero and store the result in register Rc. Register Rb<7:0> selects the bytes to be zeroed.
4.7 Floating-Point Instructions Alpha provides instructions for operating on floating-point operands in each of four data formats: • F_floating (VAX single) • G_floating (VAX double, 11-bit exponent) • S_floating (IEEE single) • T_floating (IEEE double, 11-bit exponent) Data conversion instructions are also provided to convert operands between floating-point and quadword integer formats, between double and single floating, and between quadword and longword integers.
All floating-point loads and stores may take memory management faults (access control violation, translation not valid, fault on read/write, data alignment). The floating-point enable (FEN) internal processor register (IPR) allows system software to restrict access to the floating-point registers. If a floating-point instruction is implemented and FEN = 0, attempts to execute the instruction cause a floating disabled fault.
For VAX floating-point, finites do not include reserved operands or dirty zeros (this differs from the usual VAX interpretation of dirty zeros as finite). For IEEE floating-point, finites do not include infinites, NaNs, or denormals, but do include minus zero. denormal An IEEE floating-point bit pattern that represents a number whose magnitude lies between zero and the smallest finite number. dirty zero A VAX floating-point bit pattern that represents a zero value, but not in true-zero form.
true zero The value +0, represented as exactly 64 zeros in a floating-point register. 4.7.4 Encodings Floating-point numbers are represented with three fields: sign, exponent, and fraction. The sign is 1 bit; the exponent is 8, 11, or 15 bits; and the fraction is 23, 52, 55, or 112 bits.
4.7.5 Rounding Modes All rounding modes map a true result that is exactly representable to that representable value. VAX Rounding Modes For VAX floating-point operations, two rounding modes are provided and are specified in each instruction: normal (biased) rounding and chopped rounding.
The following tables summarize the floating-point rounding modes: VAX Rounding Mode Instruction Notation Normal rounding (No qualifier) Chopped /C IEEE Rounding Mode Instruction Notation Normal rounding (No qualifier) Dynamic rounding /D Plus infinity /D and ensure that FPCR = ‘11’ Minus infinity /M Chopped /C 4.7.6 Computational Models The Alpha architecture provides a choice of floating-point computational models.
4.7.6.2 High-Performance VAX-Format Arithmetic This model provides arithmetic operations on VAX finite numbers. An imprecise arithmetic trap is generated by any operation that involves non-finite numbers, floating overflow, and divide-by-zero exceptions. This model is implemented by using VAX floating-point instructions with a trap qualifier other than /S, /SU, or /SV. Each instruction can determine whether it also traps on underflow or integer overflow.
4.7.6.5 High-Performance IEEE-Format Arithmetic This model provides arithmetic operations on IEEE finite numbers and notifies applications of all exceptional floating-point operations. An imprecise arithmetic trap is generated by any operation that involves non-finite numbers, floating overflow, divide-by-zero, and invalid operations. Underflow results are set to zero. Conversion to integer results that overflow are set to the low-order bits of the integer value.
When /U or /V mode is specified: • • • • • • Arithmetic is performed on VAX finite numbers. Operations give imprecise traps whenever the following occur: – an operand is a non-finite number – an underflow – an integer overflow – a floating overflow – a divide-by-zero Traps are imprecise and it is not always possible to determine which instruction triggered a trap or the operands of that instruction. An underflow trap produces a zero result.
A summary of the VAX trapping modes, instruction notation, and their meaning follows in Table 4–8: Table 4–8: VAX Trapping Modes Summary Trap Mode Notation Meaning Underflow disabled No qualifier /S Imprecise Precise exception completion Underflow enabled /U /SU Imprecise Precise exception completion Integer overflow disabled No qualifier /S Imprecise Precise exception completion Integer overflow enabled /V /SV Imprecise Precise exception completion 4.7.7.
• • • • Traps are imprecise, and it is not always possible to determine which instruction triggered a trap or the operands of that instruction. An underflow trap produces a zero. A conversion to integer trap with an integer overflow produces the low-order bits of the integer. The result of any other operation that traps is UNPREDICTABLE. When /SU or /SV mode is specified: • • • Arithmetic is performed on all IEEE values, both finite and non-finite.
Table 4–9: Summary of IEEE Trapping Modes (Continued) Trap Mode Notation Meaning Integer overflow enabled and inexact disabled /V /SV Imprecise Precise exception completion Integer overflow enabled and inexact enabled /SVI Precise exception completion 4.7.7.3 Arithmetic Trap Completion Because floating-point instructions may be pipelined, the trap PC can be an arbitrary number of instructions past the one triggering the trap.
Condition 3 allows an OS completion handler to emulate the trigger instruction with its original input operand values. Condition 4 allows the handler to re-execute instructions in the trap shadow with their original operand values. Condition 5 prevents any unusual side effects that would cause problems on repeated execution of the instructions in the trap shadow. Conditions: 1.
Table 4–10: Trap Shadow Length Rules Floating-Point Instruction Group Trap Shadow Extends Until Any of the Following Occurs: Floating-point operate (except DIVx and SQRTx) • Encountering a CALL_PAL, EXCB, or TRAPB instruction. • The result is consumed by any instruction except floating-point STx. • The fourth instruction† after the result is consumed by a floating-point STx instruction.
Table 4–10: Trap Shadow Length Rules (Continued) Floating-Point Instruction Group Trap Shadow Extends Until Any of the Following Occurs: Floating-point SQRTx † • Encountering a CALL_PAL, EXCB, or TRAPB instruction. • The result is consumed by any instruction. • The result of a subsequent SQRTx instruction is consumed by any instruction. The length of four instructions is a conservative estimate of how far the trap shadow may extend past a consuming floating-point STx instruction.
An implementation may choose not to take an INV trap for a valid IEEE operation that involves denormal operands if: • The instruction is modified by any valid qualifier combination that includes the /S (exception completion) qualifier. • The implementation supports the DNZ (denormal operands to zero) bit and DNZ is set. • The instruction produces the result and exceptions required by Section 4.7.10, as modified by the DNZ bit described in Section 4.7.7.11.
4.7.7.7 Underflow (UNF) Arithmetic Trap An underflow occurs if the rounded result is smaller in magnitude than the smallest finite number of the destination format. If an underflow occurs, a true zero (64 bits of zero) is always stored in the result register. In the case of an IEEE operation that takes an underflow arithmetic trap, a true zero is stored even if the result after rounding would have been –0 (underflow below the negative denormal range).
4.7.7.11 IEEE Denormal Control Bits In the case of IEEE exception completion modes, the handling of denormal operands and results is controlled by the DNZ and UNDZ bits in the FPCR. These denormal control bits only affect denormal handling by IEEE instructions that are modified by any valid qualifier combination that includes the /S (exception completion) qualifier.
VAX and IEEE subsets, appropriately set the FPCR exception bits. It is UNPREDICTABLE whether floating-point operates that belong only to the VAX floating-point subset set the FPCR exception bits. Alpha floating-point hardware only transitions these exception bits from zero to one. Once set to one, these exception bits are only cleared when software writes zero into these bits by writing a new value into the FPCR. Section 4.7.2 allows certain of the FPCR bits to be subsetted.
Table 4–11: Floating-Point Control Register (FPCR) Bit Descriptions (Continued) Bit Description (Meaning When Set) 57 Integer Overflow (IOV). An integer arithmetic operation or a conversion from floating to integer overflowed the destination precision. 56 Inexact Result (INE). A floating arithmetic or conversion operation gave a result that differed from the mathematically exact result. 55 Underflow (UNF). A floating arithmetic or conversion operation underflowed the destination exponent.
FPCR and the instructions to access it are required for an implementation that supports floating-point (see Section 4.7.8). On implementations that do not support floating-point, the instructions that access FPCR (MF_FPCR and MT_FPCR) take an Illegal Instruction Trap. Software Note: Support for FPCR is required on a system that supports the OpenVMS Alpha operating system even if that system does not support floating-point. 4.7.8.
4.7.8.2 Default Values of the FPCR Processor initialization leaves the value of FPCR UNPREDICTABLE. Software Note: Compaq software should initialize FPCR = 10 during program activation. Using this default, a program can be coded to use only dynamic rounding without the need to explicitly set the rounding mode to normal rounding in its start-up code. Program activation normally clears all other fields in the FPCR. However, this behavior may depend on the operating system. 4.7.8.
Compaq software may choose to initialize the software status bits and the trap disable bits to all 1’s to avoid any initial trapping when an exception condition first occurs. Or, software may choose to initialize those bits to all 0’s in order to provide a summary of the exception behavior when the program terminates. In any event, the exception bits in the FPCR are still useful to programs.
Table 4–12: IEEE Floating-Point Function Field Bit Summary Bits Field Meaning† 15–13 TRP Trapping modes: Contents Meaning for Opcodes 1416 and 1616 000 001 Imprecise (default) Underflow enable (/U) — floating-point output Integer overflow enable (/V) — integer output UNPREDICTABLE for opcode 1616 instructions Reserved for opcode 1416 instructions 010 011 UNPREDICTABLE for opcode 1616 instructions Reserved for opcode 1416 instructions 100 UNPREDICTABLE for opcode 1616 instructions Reserved for op
Table 4–12: IEEE Floating-Point Function Field Bit Summary (Continued) Bits Field Meaning† 8–5 FNC Instruction class: † Contents Meaning for Opcode 1616 Meaning for Opcode 1416 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 ADDx SUBx MULx DIVx CMPxUN CMPxEQ CMPxLT CMPxLE Reserved Reserved Reserved Reserved CVTxS Reserved CVTxT Reserved Reserved Reserved Reserved ITOFS/ITOFT Reserved Reserved Reserved Reserved Reserved Reserved SQRTS/SQRTT Reserved Reserved Reserved 1
Table 4–13: VAX Floating-Point Function Field Bit Summary Bits Field Meaning 15–13 TRP Trapping modes: Contents Meaning for Opcodes 1416 and 1516 000 001 Imprecise (default) Underflow enable (/U) – floating-point output Integer overflow enable (/V) – integer output UNPREDICTABLE for opcode 1516 instructions Reserved for opcode 1416 instructions 010 011 UNPREDICTABLE for opcode 1516 instructions Reserved for opcode 1416 instructions 100 101 /S – Exception completion enable /SU – floating-point ou
Table 4–13: VAX Floating-Point Function Field Bit Summary (Continued) Bits Field Meaning 8–5 FNC Instruction class: † Contents Meaning for Opcode 1516 Meaning for Opcode 1416 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 ADDx SUBx MULx DIVx CMPxUN CMPxEQ CMPxLT CMPxLE Reserved Reserved Reserved Reserved CVTxF CVTxD CVTxG CVTxQ Reserved Reserved Reserved Reserved ITOFF Reserved Reserved Reserved Reserved Reserved SQRTF/SQRTG Reserved Reserved Reserved Reserved Re
4.7.10.2 Copying NaN Values Copying a NaN value without changing its precision does not cause an invalid operation exception. 4.7.10.3 Generating NaN Values When an operation is required to produce a NaN and none of its inputs are NaN values, the result of the operation is the quiet NaN value that has the sign bit set to one, all exponent bits set to one (to indicate a NaN), the most significant fraction bit set to one (to indicate that the NaN is quiet), and all other fraction bits cleared to zero.
4.8 Memory Format Floating-Point Instructions The instructions in this section move data between the floating-point registers and memory. They use the Memory instruction format. They do not interpret the bits moved in any way; specifically, they do not trap on non-finite values. The instructions are summarized in Table 4–14.
4.8.1 Load F_floating Format: LDF !Memory format Fa.wf,disp.ab(Rb.ab) Operation: va ← {Rbv + SEXT(disp)} CASE big_endian_data: va’ ← va XOR 1002 little_endian_data: va’ ← va ENDCASE Fa ← (va’)<15> || MAP_F((va’)<14:7>) || (va’)<6:0> || (va’)<31:16> || 0<28:0> Exceptions: Access Violation Fault on Read Alignment Translation Not Valid Instruction mnemonics: LDF Load F_floating Qualifiers: None Description: LDF fetches an F_floating datum from memory and writes it to register Fa.
4.8.2 Load G_floating Format: LDG Fa.wg,disp.ab(Rb.ab) !Memory format Operation: va ← {Rbv + SEXT(disp)} Fa ← (va)<15:0> || (va)<31:16> || (va)<47:32> || (va)<63:48> Exceptions: Access Violation Fault on Read Alignment Translation Not Valid Instruction mnemonics: LDG Load G_floating (Load D_floating) Qualifiers: None Description: LDG fetches a G_floating (or D_floating) datum from memory and writes it to register Fa. If the data is not naturally aligned, an alignment exception is generated.
4.8.3 Load S_floating Format: LDS !Memory format Fa.ws,disp.ab(Rb.
4.8.4 Load T_floating Format: LDT Fa.wt,disp.ab(Rb.ab) !Memory format Operation: va ← {Rbv + SEXT(disp)} Fa ← (va)<63:0> Exceptions: Access Violation Fault on Read Alignment Translation Not Valid Instruction mnemonics: LDT Load T_floating (Load Quadword Integer) Qualifiers: None Description: LDT fetches a quadword (integer or T_floating) from memory and writes it to register Fa. If the data is not naturally aligned, an alignment exception is generated.
4.8.5 Store F_floating Format: STF !Memory format Fa.rf,disp.ab(Rb.ab) Operation: va ← {Rbv + SEXT(disp)} CASE big_endian_data: va’ ← va XOR 1002 little_endian_data: va’ ← va ENDCASE (va’)<31:0> ← Fav<44:29> || Fav<63:62> || Fav<58:45> Exceptions: Access Violation Fault on Write Alignment Translation Not Valid Instruction mnemonics: STF Store F_floating Qualifiers: None Description: STF stores an F_floating datum from Fa to memory.
4.8.6 Store G_floating Format: STG Fa.rg,disp.ab(Rb.ab) !Memory format Operation: va ← {Rbv + SEXT(disp)} (va)<63:0> ← Fav<15:0> || Fav<31:16> || Fav<47:32> || Fav<63:48> Exceptions: Access Violation Fault on Write Alignment Translation Not Valid Instruction mnemonics: STG Store G_floating (Store D_floating) Qualifiers: None Description: STG stores a G_floating (or D_floating) datum from Fa to memory. If the data is not naturally aligned, an alignment exception is generated.
4.8.7 Store S_floating Format: STS !Memory format Fa.rs,disp.ab(Rb.ab) Operation: va ← {Rbv + SEXT(disp)} CASE big_endian_data: va’ ← va XOR 1002 little_endian_data: va’ ← va ENDCASE (va’)<31:0> ← Fav<63:62> || Fav<58:29> Exceptions: Access Violation Fault on Write Alignment Translation Not Valid Instruction mnemonics: STS Store S_floating (Store Longword Integer) Qualifiers: None Description: STS stores a longword (integer or S_floating) datum from Fa to memory.
4.8.8 Store T_floating Format: STT Fa.rt,disp.ab(Rb.ab) !Memory format Operation: va ← {Rbv + SEXT(disp)} (va)<63:0> ← Fav<63:0> Exceptions: Access Violation Fault on Write Alignment Translation Not Valid Instruction mnemonics: STT Store T_floating (Store Quadword Integer) Qualifiers: None Description: STT stores a quadword (integer or T_floating) datum from Fa to memory. If the data is not naturally aligned, an alignment exception is generated.
4.9 Branch Format Floating-Point Instructions Alpha provides six floating conditional branch instructions. These branch-format instructions test the value of a floating-point register and conditionally change the PC. They do not interpret the bits tested in any way; specifically, they do not trap on non-finite values. The test is based on the sign bit and whether the rest of the register is all zero bits. All 64 bits of the register are tested.
4.9.1 Conditional Branch Format: FBxx Fa.rq,disp.al !Branch format Operation: {update PC} va ← PC + {4*SEXT(disp)} IF TEST(Fav, Condition_based_on_Opcode) THEN PC ← va Exceptions: None Instruction mnemonics: FBEQ FBGE FBGT FBLE FBLT FBNE Floating Branch Equal Floating Branch Greater Than or Equal Floating Branch Greater Than Floating Branch Less Than or Equal Floating Branch Less Than Floating Branch Not Equal Qualifiers: None Description: Register Fa is tested.
Notes: • To branch properly on non-finite operands, compare to F31, then branch on the result of the compare. • The largest negative integer (8000 0000 0000 000016) is the same bit pattern as floating minus zero, so it is treated as equal to zero by the branch instructions. To branch properly on the largest negative integer, convert it to floating or move it to an integer register and do an integer branch.
4.10 Floating-Point Operate Format Instructions The floating-point bit-operate instructions perform copy and integer convert operations on 64-bit register values. The bit-operate instructions do not interpret the bits moved in any way; specifically, they do not trap on non-finite values.
Table 4–16: Floating-Point Operate Instructions Summary (Continued) Mnemonic Operation Subset Arithmetic Operations ADDF Add F_floating VAX ADDG Add G_floating VAX ADDS Add S_floating IEEE ADDT Add T_floating IEEE CMPGxx Compare G_floating VAX CMPTxx Compare T_floating IEEE CVTDG Convert D_floating to G_floating VAX CVTGD Convert G_floating to D_floating VAX CVTGF Convert G_floating to F_floating VAX CVTGQ Convert G_floating to Quadword VAX CVTQF Convert Quadword to F_float
Table 4–16: Floating-Point Operate Instructions Summary (Continued) Mnemonic Operation Subset Arithmetic Operations MULF Multiply F_floating VAX MULG Multiply G_floating VAX MULS Multiply S_floating IEEE MULT Multiply T_floating IEEE SQRTF Square root F_floating VAX SQRTG Square root G_floating VAX SQRTS Square root S_floating IEEE SQRTT Square root T_floating IEEE SUBF Subtract F_floating VAX SUBG Subtract G_floating VAX SUBS Subtract S_floating IEEE SUBT Subtract T_flo
4.10.1 Copy Sign Format: CPYSy Fa.rq,Fb.rq,Fc.
4.10.2 Convert Integer to Integer Format: CVTxy Fb.rq,Fc.
4.10.3 Floating-Point Conditional Move Format: FCMOVxx Fa.rq,Fb.rq,Fc.
Notes: Except that it is likely in many implementations to be substantially faster, the instruction: FCMOVxx Fa,Fb,Fc is exactly equivalent to: FByy Fa,label CPYS Fb,Fb,Fc label: ...
4.10.4 Move from/to Floating-Point Control Register Format: Mx_FPCR Fa.rq,Fa.rq,Fa.wq !Floating-point Operate format Operation: CASE MF_FPCR: Fa ← FPCR MT_FPCR: FPCR ← Fav ENDCASE Exceptions: None Instruction mnemonics: MF_FPCR MT_FPCR Move from Floating-point Control Register Move to Floating-point Control Register Qualifiers: None Description: The Floating-point Control Register (FPCR) is read from (MF_FPCR) or written to (MT_FPCR), a floating-point register.
4.10.5 VAX Floating Add Format: ADDx Fa.rx,Fb.rx,Fc.wx !Floating-point Operate format Operation: Fc ← Fav + Fbv Exceptions: Invalid Operation Overflow Underflow Instruction mnemonics: ADDF ADDG Add F_floating Add G_floating Qualifiers: Rounding: Trapping: Chopped (/C) Exception Completion (/S) Underflow Enable (/U) Description: Register Fa is added to register Fb, and the sum is written to register Fc.
4.10.6 IEEE Floating Add Format: ADDx Fa.rx,Fb.rx,Fc.wx !Floating-point Operate format Operation: Fc ← Fav + Fbv Exceptions: Invalid Operation Overflow Underflow Inexact Result Instruction mnemonics: ADDS ADDT Add S_floating Add T_floating Qualifiers: Rounding: Trapping: Dynamic (/D) Minus infinity (/M) Chopped (/C) Exception Completion (/S) Underflow Enable (/U) Inexact Enable (/I) Description: Register Fa is added to register Fb, and the sum is written to register Fc.
4.10.7 VAX Floating Compare Format: CMPGyy Fa.rg,Fb.rg,Fc.wq !Floating-point Operate format Operation: IF Fav SIGNED_RELATION Fbv THEN Fc ← 4000 0000 0000 000016 ELSE Fc ← 0000 0000 0000 000016 Exceptions: Invalid Operation Instruction mnemonics: CMPGEQ CMPGLE CMPGLT Compare G_floating Equal Compare G_floating Less Than or Equal Compare G_floating Less Than Qualifiers: Trapping: Exception Completion (/S) Description: The two operands in Fa and Fb are compared.
4.10.8 IEEE Floating Compare Format: CMPTyy Fa.rx,Fb.rx,Fc.wq !Floating-point Operate format Operation: IF Fav SIGNED_RELATION Fbv THEN Fc ← 4000 0000 0000 000016 ELSE Fc ← 0000 0000 0000 000016 Exceptions: Invalid Operation Instruction mnemonics: CMPTEQ CMPTLE CMPTLT CMPTUN Compare T_floating Equal Compare T_floating Less Than or Equal Compare T_floating Less Than Compare T_floating Unordered Qualifiers: Trapping: Exception Completion (/SU) Description: The two operands in Fa and Fb are compared.
4.10.9 Convert VAX Floating to Integer Format: CVTGQ Fb.rx,Fc.wq !Floating-point Operate format Operation: Fc ← {conversion of Fbv} Exceptions: Invalid Operation Integer Overflow Instruction mnemonics: CVTGQ Convert G_floating to Quadword Qualifiers: Rounding: Trapping: Chopped (/C) Exception Completion (/S) Integer Overflow Enable (/V) Description: The floating operand in register Fb is converted to a two’s-complement quadword number and written to register Fc.
4.10.10 Convert Integer to VAX Floating Format: CVTQy Fb.rq,Fc.wx !Floating-point Operate format Operation: Fc ← {conversion of Fbv<63:0>} Exceptions: None Instruction mnemonics: CVTQF CVTQG Convert Quadword to F_floating Convert Quadword to G_floating Qualifiers: Rounding: Chopped (/C) Description: The two’s-complement quadword operand in register Fb is converted to a single- or double-precision floating result and written to register Fc.
4.10.11 Convert VAX Floating to VAX Floating Format: CVTxy Fb.rx,Fc.
4.10.12 Convert IEEE Floating to Integer Format: CVTTQ Fb.rx,Fc.
4.10.13 Convert Integer to IEEE Floating Format: CVTQy Fb.rq,Fc.
4.10.14 Convert IEEE S_Floating to IEEE T_Floating Format: CVTST Fb.rx,Fc.wx ! Floating-point Operate format Operation: Fc ← {conversion of Fbv} Exceptions: Invalid Operation Instruction mnemonics: CVTST Convert S_floating to T_floating Qualifiers: Trapping: Exception Completion (/S) Description: The S_floating operand in register Fb is converted to T_floating format and written to register Fc. Register Fa must be F31. Notes: • The conversion from S_floating to T_floating is exact.
4.10.15 Convert IEEE T_Floating to IEEE S_Floating Format: CVTTS Fb.rx,Fc.
4.10.16 VAX Floating Divide Format: DIVx Fa.rx,Fb.rx,Fc.wx !Floating-point Operate format Operation: Fc ← Fav / Fbv Exceptions: Invalid Operation Division by Zero Overflow Underflow Instruction mnemonics: DIVF DIVG Divide F_floating Divide G_floating Qualifiers: Rounding: Trapping: Chopped (/C) Exception Completion (/S) Underflow Enable (/U) Description: The dividend operand in register Fa is divided by the divisor operand in register Fb and the quotient is written to register Fc.
4.10.17 IEEE Floating Divide Format: DIVx Fa.rx,Fb.rx,Fc.
4.10.18 Floating-Point Register to Integer Register Move Format: FTOIx Fa.rq,Rc.wq !Floating-point Operate format Operation: CASE: FTOIS: Rc<63:32> ← SEXT(Fav<63>) Rc<31:0> ← Fav<63:62> || Fav <58:29> FTOIT: Rc <- Fav ENDCASE Exceptions: None Instruction mnemonics: FTOIS FTOIT Floating-point to Integer Register Move, S_floating Floating-point to Integer Register Move, T_floating Qualifiers: None Description: Data in a floating-point register file is moved to an integer register file.
4.10.19 Integer Register to Floating-Point Register Move Format: ITOFx Ra.rq,Fc.
ITOFS is exactly equivalent to the sequence: STL LDS ITOFT is exactly equivalent to the sequence: STQ LDT Software Note: ITOFF, ITOFS, and ITOFT are no slower than the corresponding store/load sequence and can be significantly faster.
4.10.20 VAX Floating Multiply Format: MULx Fa.rx,Fb.rx,Fc.wx !Floating-point Operate format Operation: Fc ← Fav * Fbv Exceptions: Invalid Operation Overflow Underflow Instruction mnemonics: MULF MULG Multiply F_floating Multiply G_floating Qualifiers: Rounding: Trapping: Chopped (/C) Exception Completion (/S) Underflow Enable (/U) Description: The multiplicand operand in register Fb is multiplied by the multiplier operand in register Fa and the product is written to register Fc.
4.10.21 IEEE Floating Multiply Format: MULx Fa.rx,Fb.rx,Fc.
4.10.22 VAX Floating Square Root Format: SQRTx Fb.rx,Fc.wx !Floating-point Operate format Operation: Fc ← Fb ** (1/2) Exceptions: Invalid operation Instruction mnemonics: SQRTF SQRTG Square root F_floating Square root G_floating Qualifiers: Rounding: Trapping: Chopped (/C) Exception Completion (/S) Underflow Enable (/U) — See Notes below Description: The square root of the floating-point operand in register Fb is written to register Fc.
4.10.23 IEEE Floating Square Root Format: SQRTx Fb.rx,Fc.
4.10.24 VAX Floating Subtract Format: SUBx Fa.rx,Fb.rx,Fc.wx !Floating-point Operate format Operation: Fc ← Fav - Fbv Exceptions: Invalid Operation Overflow Underflow Instruction mnemonics: SUBF SUBG Subtract F_floating Subtract G_floating Qualifiers: Rounding: Trapping: Chopped (/C) Exception Completion (/S) Underflow Enable (/U) Description: The subtrahend operand in register Fb is subtracted from the minuend operand in register Fa and the difference is written to register Fc.
4.10.25 IEEE Floating Subtract Format: SUBx Fa.rx,Fb.rx,Fc.
4.11 Miscellaneous Instructions Alpha provides the miscellaneous instructions shown in Table 4–17.
4.11.1 Architecture Mask Format: AMASK Rb.rq,Rc.wq !Operate format AMASK #b.ib,Rc.wq !Operate format Operation: Rc ← Rbv AND {NOT CPU_feature_mask} Exceptions: None Instruction mnemonics: AMASK Architecture Mask Qualifiers: None Description: Rbv represents a mask of the requested architectural extensions. Bits are cleared that correspond to architectural extensions that are present. Reserved bits and bits that correspond to absent extensions are copied unchanged.
• On 21164A (EV56), 21164PC (PCA56), and 21264 (EV6), AMASK correctly indicates support for architecture extensions by copying Rbv to Rc and clearing appropriate bits. Bits are assigned and placed in Appendix D for architecture extensions as ECOs for those extensions are passed. The low 8 bits are reserved for standard architecture extensions so they can be tested with a literal; application-specific extensions are assigned from bit 8 upward.
4.11.2 Call Privileged Architecture Library Format: CALL_PAL !PAL format fnc.ir Operation: {Stall instruction issuing until all prior instructions are guaranteed to complete without incurring exceptions.} {Trap to PALcode.} Exceptions: None Instruction mnemonics: CALL_PAL Call Privileged Architecture Library Qualifiers: None Description: The CALL_PAL instruction is not issued until all previous instructions are guaranteed to complete without exceptions.
4.11.3 Evict Data Cache Block Format: ECB (Rb.ab) ! Memory format Operation: va ← Rbv IF { va maps to memory space } THEN Prepare to reuse cache resources that are occupied by the the addressed byte. END Exceptions: None Instruction mnemonics: ECB Evict Cache Block Qualifiers: None Description: The ECB instruction provides a hint that the addressed location will not be referenced again in the near future, so any cache space it occupies should be made available to cache other memory locations.
• ECB is not intended for flushing caches prior to power failure or low power operation — CFLUSH is intended for that purpose. Implementation Note: Implementations with set-associative caches are encouraged to update their allocation pointer so that the next D-stream reference that misses the cache and maps to this line is allocated into the vacated set.
4.11.4 Exception Barrier Format: ! Memory format EXCB Operation: {EXCB does not appear to issue until completion of all exceptions and dependencies on the Floating-point Control Register (FPCR) from prior instructions.
4.11.5 Prefetch Data Format: FETCHx 0(Rb.ab) !Memory format Operation: va ← {Rbv} {Optionally prefetch aligned 512-byte block surrounding va.} Exceptions: None Instruction mnemonics: FETCH FETCH_M Prefetch Data Prefetch Data, Modify Intent Qualifiers: None Description: The virtual address is given by Rbv. This address is used to designate an aligned 512-byte block of data.
No exceptions are generated by FETCHx. If a Load (or Store in the case of FETCH_M) that uses the same address would fault, the prefetch request is ignored. It is UNPREDICTABLE whether a TB-miss fault is ever taken by FETCHx. Implementation Note: Implementations are encouraged to take the TB-miss fault, then continue the prefetch.
4.11.6 Implementation Version Format: IMPLVER Rc !Operate format Operation: Rc ← value, which is defined in Appendix D Exceptions: None Instruction mnemonics: IMPLVER Implementation Version Description: A small integer is placed in Rc that specifies the major implementation version of the processor on which it is executed.
4.11.7 Memory Barrier Format: !Memory format MB Operation: {Guarantee that all subsequent loads or stores will not access memory until after all previous loads and stores have accessed memory, as observed by other processors.} Exceptions: None Instruction mnemonics: MB Memory Barrier Qualifiers: None Description: The use of the Memory Barrier (MB) instruction is required only in multiprocessor systems.
4.11.8 Read Processor Cycle Counter Format: RPCC !Memory format Ra.wq Operation: Ra ← {cycle counter} Exceptions: None Instruction mnemonics: RPCC Read Processor Cycle Counter Qualifiers: None Description: Register Ra is written with the processor cycle counter (PCC). The PCC register consists of two 32-bit fields. The low-order 32 bits (PCC<31:0>) are an unsigned, wrapping counter, PCC_CNT. The high-order 32 bits (PCC<63:32>), PCC_OFF, are operating-system dependent in their implementation.
4.11.9 Trap Barrier Format: !Memory format TRAPB Operation: {TRAPB does not appear to issue until all prior instructions are guaranteed to complete without causing any arithmetic traps}.
4.11.10 Write Hint Format: WH64 ! Memory format (Rb.ab) Operation: va ← Rbv IF { va maps to memory space } THEN Write UNPREDICTABLE data to the aligned 64-byte region containing the addressed byte. END Exceptions: None Instruction mnemonics: WH64 Write Hint - 64 Bytes Qualifiers: None Description: The WH64 instruction provides a hint that the current contents of the aligned 64-byte block containing the addressed byte will never be read again but will be overwritten in the near future.
Implementation Note: If the 64-byte region containing the addressed byte is not in the data cache, implementations are encouraged to allocate the region in the data cache without first reading it from memory. However, if any of the addressed bytes exist in the caches of other processors, they must be kept coherent with respect to those processors. Processors with cache blocks smaller than 64 bytes are encouraged to implement WH64 as defined.
4.11.11 Write Memory Barrier Format: !Memory format WMB Operation: { Guarantee that { All preceding { regions are { that access { All preceding { regions are { that access stores that access memory-like ordered before any subsequent stores memory-like regions and stores that access non-memory-like ordered before any subsequent stores non-memory-like regions.
The WMB instruction is the preferred method for providing high-bandwidth write streams where order must be preserved between writes in that stream. Notes: WMB is useful for ordering streams of writes to a non-memory-like region, such as to memory-mapped control registers or to a graphics frame buffer. While both MB and WMB can ensure that writes to a non-memory-like region occur in order, without being aggregated or reordered, the WMB is usually faster and is never slower than MB.
4.12 VAX Compatibility Instructions Alpha provides the instructions shown in Table 4–18 for use in translated VAX code. These instructions are not a permanent part of the architecture and will not be available in some future implementations. They are intended to preserve customer assumptions about VAX instruction atomicity in porting code from VAX to Alpha. These instructions should be generated only by the VAX-to-Alpha software translator; they should never be used in native Alpha code.
4.12.1 VAX Compatibility Instructions Format: Rx !Memory format Ra.wq Operation: Ra ← intr_flag intr_flag ← 0 intr_flag ← 1 !RC !RS Exceptions: None Instruction mnemonics: RC RS Read and Clear Read and Set Qualifiers: None Description: The intr_flag is returned in Ra and then cleared to zero (RC) or set to one (RS).
4.
4.13.1 Byte and Word Minimum and Maximum Format: MINxxx Ra.rq,Rb.rq,Rc.wq Ra.rq,#b.ib,Rc.wq ! Operate Format MAXxxx Ra.rq,Rb.rq,Rc.wq Ra.rq,#b.ib,Rc.
Instruction mnemonics: MINUB8 MINSB8 MINUW4 MINSW4 MAXUB8 MAXSB8 MAXUW4 MAXSW4 Vector Unsigned Byte Minimum Vector Signed Byte Minimum Vector Unsigned Word Minimum Vector Signed Word Minimum Vector Unsigned Byte Maximum Vector Signed Byte Maximum Vector Unsigned Word Maximum Vector Signed Word Maximum Qualifiers: None Description: For MINxB8, each byte of Rc is written with the smaller of the corresponding bytes of Ra or Rb. The bytes may be interpreted as signed or unsigned values.
4.13.2 Pixel Error Format: PERR Ra.rq,Rb.rq,Rc.wq ! Operate Format Operation: temp = 0 FOR i FROM 0 TO 7 IF { Rav GEU Rbv} THEN temp ← temp + (Rav - Rbv) ELSE temp ← temp + (Rbv - Rav) END Rc ← temp Exceptions: None Instruction mnemonics: PERR Pixel Error Qualifiers: None Description: The absolute value of the difference between each of the bytes in Ra and Rb is calculated. The sum of the resulting bytes is written to Rc.
4.13.3 Pack Bytes Format: PKxB Rb.rq,Rc.
4.13.4 Unpack Bytes Format: UNPKBx Rb.rq,Rc.
Chapter 5 System Architecture and Programming Implications 5.1 Introduction Portions of the Alpha architecture have implications for programming, and the system structure, of both uniprocessor and multiprocessor implementations.
Memory coherency may be provided in different ways for each of the four physical address regions. Possible per-region policies include, but are not restricted to: • No caching No copies are kept of data in a region; all reads and writes access the actual data location (memory or I/O register), but a processor may elide multiple accesses to the same data (see Section 5.2.3).
For a byte access region, accesses to physical memory must be implemented such that independent accesses to adjacent bytes or adjacent aligned words produce the same results, regardless of the order of execution. Further, an access to a byte, an aligned word, an aligned longword, or an aligned quadword must be done in a single atomic operation.
• Address ranges may overlap, such that a write to one location changes the bits read from a different location. • Reads may have side effects, although this is strongly discouraged. • Longword granularity need not be supported and, even if the byte/word extension is implemented, byte access granularity need not be implemented. • Instruction-fetch need not be supported. • Load-locked and store-conditional need not be supported.
The following requirements must be met by all cache/write-buffer implementations. All processors must provide a coherent view of memory. • Write buffers may be used to delay and aggregate writes. From the viewpoint of another processor, buffered writes appear not to have happened yet. (Write buffers must not delay writes indefinitely. See Section 5.6.1.9.) • Write-back caches must be able to detect a later write from another processor and invalidate or update the cache contents.
5.5 Data Sharing In a multiprocessor environment, writes to shared data must be synchronized by the programmer. 5.5.1 Atomic Change of a Single Datum The ordinary STL and STQ instructions can be used to perform an atomic change of a shared aligned longword or quadword. ("Change" means that the new value is not a function of the old value.) In particular, an ordinary STL or STQ instruction can be used to change a variable that could be simultaneously accessed via an LDx_L/STx_C sequence. 5.5.
This load-locked/store-conditional paradigm may be used whenever an atomic update of a shared aligned quadword is desired, including getting the effect of atomic byte writes. 5.5.3 Atomic Update of Data Structures Before accessing shared writable data structures (those that are not a single aligned longword or quadword), the programmer can acquire control of the data structure by using an atomic update to set a software lock variable. Such a software lock can be cleared with an ordinary store instruction.
• Both conditional branches are forward branches, so they are properly predicted not to be taken (to match the common case of no contention for the lock). • The OR writes its result to a second register; this allows the OR and the BLBS to be interchanged if that would give a faster instruction schedule. • Other operate instructions (from the critical section) may be scheduled into the LDQ_L..
5.5.4 Ordering Considerations for Shared Data Structures A critical section sequence, such as shown in Section 5.5.3, is conceptually only three steps: 1. Acquire software lock 2. Critical section — read/write shared data 3. Clear software lock In the absence of explicit instructions to the contrary, the Alpha architecture allows reads and writes to be reordered. While this may allow more implementation speed and overlap, it can also create undesired side effects on shared data structures.
an ADDL2 to update a variable that is shared between a "MAIN" routine and an AST routine, if running on a single processor. In the Alpha architecture, a programmer must deal with AST shared data by using multiprocessor shared data sequences. 5.6 Read/Write Ordering This section applies to programs that run on multiple processors or on one or more processors that are interacting with DMA I/O devices.
In most systems, DMA I/O devices or other agents can read or write shared memory locations. The order of accesses by those agents is not completely specified in this document. It is possible in some systems for read accesses by I/O devices or other agents to give results indicating some reordering of accesses. However, there are guarantees that apply in all systems. See Section 5.6.4.7. A shared memory is the primary storage place for one or more locations.
there is at least one byte that is accessed by both, that is, if max(x,y) < min(x+m,y+n). 5.6.1.1 Architectural Definition of Processor Issue Sequence The issue sequence for a processor is architecturally defined with respect to a hypothetical simple implementation that contains one processor and a single shared memory, with no caches or buffers. This is the instruction execution model: 1. I-fetch: An Alpha instruction is fetched from memory. 2.
Table 5–1: Processor Issue Constraints 1st↓ 2nd → Pi:I(y,b) Pi:I(x,a) ⇐ if overlap Pi:R(y,b) Pi:W(y,b) Pi:MB Pi:IMB ⇐ if overlap ⇐ ⇐ ⇐ if overlap ⇐ ⇐ ⇐ if overlap ⇐ ⇐ ⇐ ⇐ ⇐ ⇐ ⇐ ⇐ ⇐ ⇐ ⇐ if overlap Pi:R(x,a) Pi:W(x,a) Pi:MB Pi:IMB ⇐ Where "overlap" denotes the condition max(x,y) < min(x+m,y+n). For two accesses u and v issued by processor Pi, if u precedes v by processor issue constraint, then u precedes v in BEFORE order.
5.6.1.4 Definition of Location Access Constraints Location access constraints are imposed on overlapping read/write accesses. If u and v are overlapping read/write accesses, at least one of which is a write, then u and v must be comparable in the BEFORE (⇐ ) ordering, that is, either u ⇐ v or v ⇐ u. There is no direct requirement that nonoverlapping accesses be comparable in the BEFORE (⇐ ) ordering.
and accesses byte z; then the value of byte z read by v is exactly the value written by u. In this situation, u is a source of v. The only way to communicate information between different processors is for one to write a shared location and the other to read the shared location and receive the newly written value. (In this context, the sending of an interrupt from processor Pi to Pj is modeled as Pi writing to a location INTij, and Pj reading from INTij.) 5.6.1.
Representing those code sequences in the style of the litmus tests in Section 5.6.2, it is impossible for the following sequence to result: Pi Pj [U1] Pi:R<8>(x,0) [V1] Pj:R<8>(y,0) [U2] Pi:W<8>(y,0) [V2] Pj:W<8>(x,0) Analysis: <1> By the definitions of storage and visibility, U2 is the source of V1, and V2 is the source of U1. <2> By the definition of DP and examination of the code, U1 DP U2, and V1 DP V2. <3> Thus, U1 DP U2, U2 is the source of V1, V1 DP V2, and V2 is the source of U1.
5.6.1.9 Timeliness Even in the absence of a barrier after the write, no write by a processor may be delayed indefinitely in the BEFORE ordering. 5.6.2 Litmus Tests Many issues about writing and reading shared data can be cast into questions about whether a write is before or after a read. These questions can be answered by rigorously checking whether any ordering satisfies the rules in Sections 5.6.1.3 through 5.6.1.8. In litmus tests 1–9 below, all initial quadword memory locations contain 1.
5.6.2.2 Litmus Test 2 (Impossible Sequence) Initially, location x contains 1: Pi Pj [U1]Pi:W<8>(x,2) [V1]Pj:W<8>(x,3) [V2]Pj:R<8>(x,2) [V3]Pj:R<8>(x,3) Analysis: <1> Since V1 precedes V2 in processor issue sequence, V1 is visible to V2. <2> V2 reading 2 implies U1 is the latest (in ⇐ order) write to x visible to V2. <3> From <1> and <2>, V1 ⇐ U1. <4> Since U1 is visible to V2, and they are issued by different processors, U1 ⇐ V2. <5> By the processor issue constraints, V2 ⇐ V3.
5.6.2.4 Litmus Test 4 (Sequence Okay) Initially, locations x and y contain 1: Pi Pj [U1]Pi:W<8>(x,2) [V1]Pj:R<8>(y,2) [U2]Pi:W<8>(y,2) [V2]Pj:R<8>(x,1) Analysis: <1> V1 reading 2 implies U2 ⇐ V1, by storage and visibility. <2> Since V2 does not read 2, there cannot be U1 ⇐ V2. <3> By the access order constraints, it follows from <2> that V2 ⇐ U1. There are no conflicts in the sequence. There are no violations of the definition of BEFORE. 5.6.2.
There is V2 ⇐ U1 ⇐ U2 ⇐ U3 ⇐ V1. There are no conflicts in this sequence. There are no violations of the definition of BEFORE. In litmus tests 4, 5, and 6, writes to two different locations x and y are observed (by another processor) to occur in the opposite order than that in which they were performed. An update to y propagates quickly to Pj, but the update to x is delayed, and Pi and Pj do not both have MBs. 5.6.2.
5.6.2.9 Litmus Test 9 (Impossible Sequence) Initially, location x contains 1: Pi Pj [U1]Pi:W<8>(x,2) [V1]Pj:W<8>(x,3) [U2]Pi:R<8>(x,2) [V2]Pj:R<8>(x,3) [U3]Pi:R<8>(x,3) [V3]Pj:R<8>(x,2) Analysis: <1> V3 reading 2 implies U1 is the latest write to x visible to V3, therefore V1 ⇐ U1. <2> U3 reading 3 implies V1 is the latest write to x visible to U3, therefore U1 ⇐ V1. Both <1> and <2> cannot be true. Time cannot go backwards. If V3 reads 2, then U3 must read 2.
5.6.3 Implied Barriers There are no implied barriers in Alpha. If an implied barrier is needed for functionally correct access to shared data, it must be written as an explicit instruction. (Software must explicitly include any needed MB, WMB, or CALL_PAL IMB instructions.
is communicated through just one location in memory, memory barriers are not necessary. Software Note: Note that this section does not describe how to reliably communicate data from a processor to a DMA device. See Section 5.6.4.7. Leaving out the first MB removes the assurance that the shared data is written before the flag is written.
This implies that after a DMA I/O device has written some I-stream to memory (such as paging in a page from disk), the DMA device must logically execute an MB 1 before posting a completion interrupt, and the interrupt handler software must execute a CALL_PAL IMB before the I-stream is guaranteed to be visible to the interrupted processor. Other processors must also execute CALL_PAL IMB instructions before they are guaranteed to see the new I-stream.
MB [1] ensures that the writes done to save the state of the current process happen before the ownership is passed. MB [2] ensures that the reads done to load the state of the new process happen after the ownership is picked up and hence are reliably the values written by the processor saving the old state. Leaving this MB out makes the code fail if an old value of the context remains in the second processor’s cache and invalidates from the writes done on the first processor are not delivered soon enough.
First Processor Second Processor : Pick up ownership of process context data structure memory. MB Assign new ASN or invalidate TBs. Save state of current process. Restore state of new process. MB Pass ownership of process context data structure memory. : ⇒ : : Pickup ownership of new process context data structure memory. MB Assign new ASN or invalidate TBs. Save state of current process. Restore state of new process. MB Pass ownership of old process context data structure memory.
First Processor Second Processor : Write data MB Send interrupt ⇒ Receive interrupt MB Access data : Leaving out the MB at the beginning of the interrupt-receipt routine causes the code to fail if an old value of the context remains in the second processor’s cache, and invalidates from the writes done on the first processor are not delivered soon enough. 5.6.4.7 Implications for Memory Mapped I/O Sections 5.6.4.3 and 5.6.4.
will detect the writes of the shared data before detecting the flag write, interrupt, or device register write. This implies that after a processor has prepared a data buffer to be read from memory by a DMA I/O device (such as writing a buffer to disk), the processor must execute an MB before starting the I/O. The I/O device, after receiving the start signal, must logically execute an MB before reading the data buffer, and the buffer must be located in a memory-like physical memory region.
The MB on the first processor guarantees that the write to CSR_A precedes the write to flag in memory, as perceived on other processors. (The MB does not guarantee that the write to CSR_A has completed. See Section 5.6.4.7 for a discussion of how a processor can guarantee that a write to an I/O device has completed at that device.) The MB on the second processor guarantees that the write to CSR_B will reach the I/O device after the write to CSR_A. 5.6.
5.7 Arithmetic Traps Alpha implementations are allowed to execute multiple instructions concurrently and to forward results from one instruction to another. Thus, when an arithmetic trap is detected, the PC may have advanced an arbitrarily large number of instructions past the instruction T (calculating result R) whose execution triggered the trap. When the trap is detected, any or all of these subsequent instructions may run to completion before the trap is actually taken.
Chapter 6 Common PALcode Architecture 6.1 PALcode In a family of machines, both users and operating system developers require functions to be implemented consistently. When functions conform to a common interface, the code that uses those functions can be used on several different implementations without modification. These functions range from the binary encoding of the instruction and data to the exception mechanisms and synchronization primitives.
The Alpha architecture lets these functions be implemented in standard machine code that is resident in main memory. PALcode is written in standard machine code with some implementation-specific extensions to provide access to low-level hardware. This lets an Alpha implementation make various design trade-offs based on the hardware technology being used to implement the machine. The PALcode can abstract these differences and make them invisible to system software.
• PALcode needs a hardware mechanism to transition the machine from the PALcode environment to the non-PALcode environment. This mechanism loads the PC, enables interrupts, enables mapping, and disables PALcode privileges. An Alpha implementation may also choose to provide additional functions to simplify or improve performance of some PALcode functions.
PALcode should be written modularly to facilitate the easy replacement or conditional building of each component. Such a practice simplifies the integration of CPU hardware, system platform hardware, console firmware, operating system software, and compilers.
The PALcode instructions listed in Table 6–2 and described in the following sections must be supported by all Alpha implementations: Table 6–2: Required PALcode Instructions Mnemonic Type Operation DRAINA Privileged Drain aborts HALT Privileged Halt processor IMB Unprivileged I-stream memory barrier Common PALcode Architecture 6–5
6.7.1 Drain Aborts Format: CALL_PAL !PALcode format DRAINA Operation: IF PS(<)CM> NE 0 THEN {privileged instruction exception} {Stall instruction issuing until all prior instructions are guaranteed to complete without incurring aborts.
6.7.
6.7.3 Instruction Memory Barrier Format: CALL_PAL !PALcode format IMB Operation: {Make instruction stream coherent with data stream} Exceptions: None Instruction mnemonics: CALL_PAL IMB I-stream Memory Barrier Description: An IMB instruction must be executed after software or I/O devices write into the instruction stream or modify the instruction stream virtual address mapping, and before the new value is fetched as an instruction.
Chapter 7 Console Subsystem Overview On an Alpha system, underlying control of the system platform hardware is provided by a console subsystem. The console subsystem: • Initializes, tests, and prepares the system platform hardware for Alpha system software. • Bootstraps (loads into memory and starts the execution of) system software. • Controls and monitors the state and state transitions of each processor in a multiprocessor system.
Chapter 8 Input/Output Overview Conceptually, Alpha systems can consist of processors, memory, a processor-memory interconnect (PMI), I/O buses, bridges, and I/O devices. Figure 8–1 shows the Alpha system overview. Figure 8–1: Alpha System Overview Processor-Memory Interconnect I/O Device Processor Memory Bridge I/O Bus I/O Device I/O Device As shown in Figure 8–1, processors, memory, and possibly I/O devices, are connected by a PMI.
Chapter 9 OpenVMS Alpha The following sections specify the Privileged Architecture Library (PALcode) instructions, that are required to support an OpenVMS Alpha system. 9.1 Unprivileged OpenVMS Alpha PALcode The unprivileged PALcode instructions provide support for system operations to all modes of operation (kernel, executive, supervisor, and user). Table 9–1 describes the unprivileged OpenVMS Alpha PALcode instructions.
Table 9–1 : Unprivileged OpenVMS Alpha PALcode Instruction Summary (Continued) Mnemonic Operation and Description CHME Change mode to executive The CHME instruction allows a process to change its mode in a controlled manner. A change in mode also results in a change of stack pointers: the old pointer is saved, the new pointer is loaded. Registers R2..R7, PS, and PC are pushed onto the selected stack. The saved PC addresses the instruction following the CHME instruction.
Table 9–1 : Unprivileged OpenVMS Alpha PALcode Instruction Summary (Continued) Mnemonic Operation and Description IMB I-Stream memory barrier IMB ensures that the contents of an instruction cache are coherent after the instruction stream has been modified by software or I/O devices.If the instruction stream is modified and an IMB is not executed before fetching an instruction from the modified location, it is UNPREDICTABLE whether the old or new value is fetched.
Table 9–1 : Unprivileged OpenVMS Alpha PALcode Instruction Summary (Continued) Mnemonic Operation and Description INSQTILR Insert into longword queue at tail, interlocked resident The entry specified in R17 is inserted into the self-relative queue preceding the header specified in R16. The insertion is a noninterruptible operation.
Table 9–1 : Unprivileged OpenVMS Alpha PALcode Instruction Summary (Continued) Mnemonic Operation and Description RD_PS Read processor status RD_PS writes the Processor Status (PS) to register R0. READ_UNQ Read unique context READ_UNQ reads the hardware process (thread) unique context value, if previously written by WRITE_UNQ, and places that value in R0. REI Return from exception or interrupt The PS, PC, and saved R2..R7 are popped from the current stack and held in temporary registers.
Table 9–1 : Unprivileged OpenVMS Alpha PALcode Instruction Summary (Continued) Mnemonic Operation and Description REMQHIQR Remove from quadword queue at header, interlocked resident The queue entry following the header, pointed to by R16, is removed from the self-relative queue and the address of the removed entry is returned in R1. The removal is interlocked to prevent concurrent interlocked insertions or removals at the head or tail of the same queue by another process, in a multiprocessor environment.
Table 9–1 : Unprivileged OpenVMS Alpha PALcode Instruction Summary (Continued) Mnemonic Operation and Description REMQUEL Remove from longword queue The queue entry addressed by R16 for REMQUEL or the entry addressed by the longword addressed by R16 for REMQUEL/D is removed from the longword absolute queue, and the address of the removed entry is returned in R1. The removal is a noninterruptible operation.
9.2 Privileged OpenVMS Alpha Palcode The privileged PALcode instructions can be called in kernel mode only. Table 9–2 describes the privileged OpenVMS Alpha PALcode instructions. Table 9–2 : Privileged OpenVMS Alpha PALcode Instructions Summary Mnemonic Operation and Description CFLUSH Cache flush At least the entire physical page specified by a page frame number in R16 is flushed from any data caches associated with the current processor.
Table 9–2 : Privileged OpenVMS Alpha PALcode Instructions Summary (Continued) Mnemonic Operation and Description STQP Store quadword physical The quadword contents of R17 are written to the memory location whose physical address is in R16. If the operand address in R16 is not quadword-aligned, the result is UNPREDICTABLE.
Chapter 10 Digital UNIX The following sections specifiy the Privileged Architecture Library (PALcode) instructions that are required to support a Digital UNIX system. 10.1 Unprivileged Digital UNIX PALcode Table 10–1 describes the unprivileged Digital UNIX PALcode instructions.
Table 10–1 : Unprivileged Digital UNIX PALcode Instruction Summary (Continued) Mnemonic Operation and Description rdunique Read unique The rdunique instruction returns the process unique value. urti Return from user mode trap The urti instruction pops from the user stack the registers a0 through a2, the global pointer, the new user assembler temporary register, the stack pointer, the program counter, and the processor status register.
Table 10–2 : Privileged Digital UNIX PALcode Instruction Summary (Continued) Mnemonic Operation and Description retsys Return from system call The retsys instruction pops the return address, the user stack pointer, and the user global pointer from the kernel stack. It then saves the kernel stack pointer, sets mode to user, enables interrupts, and jumps to the address popped off the stack. rti Return from trap, fault or interrupt The rti instruction pops certain registers from the kernel stack.
Table 10–2 : Privileged Digital UNIX PALcode Instruction Summary (Continued) Mnemonic Operation and Description wrval Write system value The wrval instruction writes a 64-bit per-processor value. wrvptptr Write virtual page table pointer The wrvptptr instruction writes a pointer to the virtual page table pointer (vptptr).
Chapter 11 Windows NT Alpha The following sections specify the Privileged Architecture Library (PALcode) instructions that are required to support a Windows NT Alpha system. 11.1 Unprivileged Windows NT Alpha PALcode The unprivileged PALcode instuctions provide support for system operations and may be called from both kernel and user modes.
Table 11–1 : Unprivileged Windows NT Alpha PALcode Instruction Summary Mnemonic Operation and description imb Instruction memory barrier The imb instruction guarantees that all subsequent instruction stream fetches are coherent with respect to main memory. Imb must be issued before executing code in memory that has been modified (either by stores from the processor or DMA from an I/O processor).
Table 11–2 : Privileged Windows NT Alpha PALcode Instruction Summary (Continued) Mnemonic Operation and description draina Drain all aborts including machine checks The draina instruction drains all aborts, including machine checks, from the current processor. Draina guarantees that no abort is signaled for an instruction issued before the draina while any instruction issued subsequent to the draina is executing.
Table 11–2 : Privileged Windows NT Alpha PALcode Instruction Summary (Continued) Mnemonic Operation and description rdirql Read the current IRQL from the PSR The rdirql instruction returns the contents of the interrupt request level (IRQL) field of the PSR internal processor register. rdksp Read initial kernel stack pointer for the current thread The rdksp instruction returns the contents of the IKSP (initial kernel stack pointer) internal processor register for the currently executing thread.
Table 11–2 : Privileged Windows NT Alpha PALcode Instruction Summary (Continued) Mnemonic Operation and description retsys Return from system service call exception The retsys instruction returns from a system service call exception by unwinding the trap frame and returning to the code stream that was executing when the original exception was initiated. In addition, retsys accepts a parameter to set software interrupt requests that became pending while the exception was handled.
Table 11–2 : Privileged Windows NT Alpha PALcode Instruction Summary (Continued) Mnemonic Operation and description tbia Translation buffer invalidate all The tbia instruction invalidates all translations and virtual cache blocks within the processor. tbim Translation buffer invalidate multiple The tbim instruction invalidates multiple virtual translations for the current ASN. The translation for the virtual address must be invalidated in all processor translation buffers and virtual caches.
Appendix A Software Considerations A.1 Hardware-Software Compact The Alpha architecture, like all RISC architectures, depends on careful attention to data alignment and instruction scheduling to achieve high performance. Since there will be various implementations of the Alpha architecture, it is not obvious how compilers can generate high-performance code for all implementations.
In some cases, there are performance advantages to aligning instructions or data to cache-block boundaries, or putting data whose use is correlated into the same cache block, or trying to avoid cache conflicts by not having data whose use is correlated placed at addresses that are equal modulo the cache size. Since the Alpha architecture will have many implementations, an exact cache design cannot be outlined here.
branch-takens. If the infrequent case is rare (5%), put it far enough away that it never comes into the I-cache. If the infrequent case is extremely rare (error message code), put it on a page of rarely executed code and expect that page never to be paged in. 4. There are two functionally identical branch-format opcodes, BSR and BR, as shown in Figure A–1.
quickly as possible, second priority to predicting conditional branches based on the sign of the displacement field (backward taken, forward not-taken), and third priority to predicting subroutine return addresses by running a small prediction stack. (VAX traces show a stack of two to four entries correctly predicts most branches.) A.2.3 Improving I-Stream Density — Factor of 3 Compilers should try to use profiles to make sure almost 100% of the bytes brought into an I-cache are actually executed.
aligned octaword boundaries whenever language rules allow. In some implementations, a series of writes that completely fill a cache block may be a factor of 10 faster than a series of writes that partially fill a cache block, when that cache block would give a read miss. This is true of write-back caches that read a partially filled cache block from memory, but optimize away the read for completely filled blocks.
data in the same cache block as the lock. For the high-sharing case, compilers should assume that almost all accesses to shared data result in cache misses all the way back to main memory, for each distinct cache block used. Such accesses will likely be a factor of 30 slower than cache hits. It is helpful to pack correlated shared data into a small number of cache blocks. It is helpful also to segregate blocks written by one processor from blocks read by others.
In a frequently executed loop, compilers could allocate the data items accessed from memory so that, on each loop iteration, all of the memory addresses accessed are either in exactly the same aligned 64-byte block or differ in bits VA<10:6>. For loops that go through arrays in a common direction with a common stride, this requires allocating the arrays, checking that the first-iteration addresses differ, and if not, inserting up to 64 bytes of padding between the arrays.
tion addresses differ, and if they do not, inserting up to 8K bytes of padding between the arrays. This rule will avoid thrashing in direct-mapped TBs and in some large direct-mapped data caches with total sizes of 32 pages (256 KB) or more. Usually, this padding will mean zero extra bytes in the executable image, just a skip in virtual address space to the next-higher page boundary. For large caches, the rule above should be applied to the I-stream, in addition to all the D-stream references.
Table A–1: Cache Block Prefetching Type Instructions Operation Prefetch with Modify Intent LDS F31, xxx (Rn) If the load operation hits a dirty, modified, Dcache block, the instruction is dismissed. Otherwise, the addressed cache block is allocated into the Dcache for write access — its dirty and modified bits are set. Prefetch, Next LDQ R31, xxx (Rn) Prefetch a cache block and mark that block in an associated cache to be evicted on the next cache fill to an associated address.
Note: The shifts often can be combined with shifts that might surround subsequent arithmetic operations (for example, to produce word overflow from the high end of a register). In the common case, the intended sequence for loading and zero-extending a byte is: LDL EXTBL R1,D.lw(Rx) R1,#D.mod,R1 ! ! In the common case, the intended sequence for loading and sign-extending a byte is: LDL SLL SRA R1,D.lw(Rx) R1,#56-8*D.
16-bit quotient digit plus a 48-bit new partial dividend. Three more such steps can generate the full quotient. Having prior knowledge of the possible sizes of the divisor and dividend, normalizing away leading bytes of zeros, and performing an early-out test can reduce the average number of multiplies to about five (compared to a best case of one and a worst case of nine). A.4.
The standard NOP forms are: NOP FNOP == == BIS CPYS R31,R31,R31 F31,F31,F31 These generate no exceptions. In most implementations, they should encounter no operand issue delays and no destination issue delay. Implementations are free to optimize these into no action and zero execution cycles. A.4.4.2 Clear a Register The standard clear register forms are: CLR FCLR == == BIS CPYS R31,R31,Rx F31,F31,Fx These generate no exceptions.
The general sequence is: LDA Rdst, low(R31) LDAH Rdst, extra(Rdst) LDAH Rdst, high(Rdst) ! Omit if extra=0 ! Omit if high=0 A.4.4.4 Register-to-Register Move The standard register move forms are: MOV RX,RY FMOV FX,FY == == BIS CPYS RX,RX,RY FX,FX,FY These move forms generate no exceptions. In most implementations, these should encounter no functional unit issue delay. A.4.4.
A.4.5 Exceptions and Trap Barriers The EXCB instruction allows software to guarantee that in a pipelined implementation, all previous instructions have completed any behavior that is related to exceptions or rounding modes before any instructions after the EXCB are issued. In particular, all changes to the floating-point control register (FPCR) are guaranteed to have been made, whether or not there is an associated exception.
Table A–2: Decodable Pseudo-Operations (Stylized Code Forms) (Continued) Pseudo-Operation in Listing FNEG Fx, Fy FNOP Meaning Actual Instruction Encoding No-exception generic floating negation CPYSN Fx, Fx, Fy Floating-point no-op CPYS F31, F31, F31 MOV Lit, Rx Move 16-bit sign-extended literal to Rx LDA Rx,lit(R31) MOV {Rx/Lit8}, Ry Move Rx/8-bit zero-extended literal to Ry BIS R31,{Rx/Lit8},Ry MF_FPCR Fx Move from FPCR MF_FPCR Fx, Fx, Fx MT_FPCR Fx Move to FPCR MT_FPCR Fx, Fx
Table A–2: Decodable Pseudo-Operations (Stylized Code Forms) (Continued) Pseudo-Operation in Listing SEXTL {Rx/Lit8}, Ry UNOP Meaning Actual Instruction Encoding Longword sign-extension of Rx storing results in Ry ADDL R31, {Rx/Lit}, Ry Universal NOP for both integer and floating-point code LDQ_U R31,0(Rx) A.
Appendix B IEEE Floating-Point Conformance A subset of IEEE Standard for Binary Floating-Point Arithmetic (ANSI/IEEE Standard 754-1985) is provided in the Alpha floating-point instructions. This appendix describes how to construct a complete IEEE implementation. The order of presentation parallels the order of the IEEE specification. B.1 Alpha Choices for IEEE Options Alpha supports IEEE single, double, and optionally (in software) extended double formats.
Overflow and underflow, NaNs, and infinities encountered during software binary to decimal conversion return strings that specify the conditions. Alpha hardware supports comparisons of same-format numbers. Software supports comparisons of different-format numbers. In the Alpha architecture, results are true-false in response to a predicate. Alpha hardware supports the required six predicates and the optional unordered predicate.
In the Alpha architecture, user signal handlers are supported by compilers and an OS completion handler (interposed between the hardware and the IEEE user), as described in the next section. B.2 Alpha Support for OS Completion Handlers Alpha floating-point trap behavior is statically controlled by the /S, /U, and /I mode qualifiers on floating-point instructions. Changing these options usually requires recompiling.
these bits may choose to complete computations involving non-finite values without the assistance of software completion. Operating systems use these FPCR bits to enable hardware completion of instructions with any valid qualifier combination that includes /S in those cases where the operating system does not require a trap to do exception signaling.
• Integer overflow (IOV) exceptions are controlled by the INVE enable mask bit (FP_C<1>), as allowed by the IEEE standard. Implementation software is responsible for setting the INVS status bit (FP_C<17>) when a CVTTQ or CVTQL instruction traps into the software completion mechanism for integer overflow . • At process creation, all trap enable flags in the FP_C are clear.
Table B–1: Floating-Point Control (FP_C) Quadword Bit Summary (Continued) Bit Description 5 Inexact result enable (INEE) Initiate an INE exception if the result of a floating arithmetic or conversion operation differs from the mathematically exact result. 4 Underflow enable (UNFE) Initiate a UNF exception if a floating arithmetic or conversion operation underflows the destination exponent.
Figure B–2: IEEE Trap Handling Behavior Hardware Traps to PALcode PALcode Traps to Operating System Operating System Traps to User IEEE Trap Handler (IEEE Standard) User Signal Handler The IEEE-specified trap behavior occurs only with respect to the user signal handler (the last layer in Figure B–2); any trap-and-fixup behavior in the first three layers is outside the scope of the IEEE standard. The IEEE number system is divided into finite and non-finite numbers: The finites are normal numbers: • –MAX..
Table B–2: IEEE Floating-Point Trap Handling PALCode Alpha Instructions Hardware1 FBEQ FBNE FBLT FBLE FBGT FBGE LDS LDT STS STT CPYS CPYSN FCMOVx Bits Only – No Exceptions OS Completion Handler User Signal Handler Bits Only—No Exceptions Bits Only—No Exceptions Bits Only—No Exceptions Bits Only—No Exceptions ADDx SUBx INPUT Exceptions: Denormal operand Trap Trap Supply sum +/-Inf operand QNaN operand SNaN operand +Inf + –Inf Trap Trap Trap Trap Trap Trap Trap Trap Supply sum Supply QNaN Supp
Table B–2: IEEE Floating-Point Trap Handling (Continued) Alpha Instructions Hardware 1 PALCode OS Completion Handler User Signal Handler [Overflow3] Scale by bias adjust – MULx OUTPUT Exceptions: Exponent overflow Trap Trap Supply +/–Inf +/–MAX Exponent underflow and disabled Exponent underflow and enabled Supply +0 Supply +0 and Trap – Trap Inexact and disabled Inexact and enabled – Supply prod. and trap – Trap – Supply +/–MIN denorm +/–0 – – Denormal operand Trap Trap Supply quot.
Table B–2: IEEE Floating-Point Trap Handling (Continued) Alpha Instructions Hardware 1 PALCode OS Completion Handler User Signal Handler [Denormal Op2] [Invalid Op] [Invalid Op] CMPTLT CMPTLE INPUT Exceptions: Denormal operand Trap Trap Supply ≤ or < QNaN operand SNaN operand Trap Trap Trap Trap Supply False Supply False Denormal operand Trap Trap Supply Cvt +/-Inf operand QNaN operand SNaN operand Trap Trap Trap Trap Trap Trap Supply 0 Supply 0 Supply 0 [Denormal Op2] [Invalid Op] –
Table B–2: IEEE Floating-Point Trap Handling (Continued) Alpha Instructions Hardware 1 PALCode OS Completion Handler User Signal Handler [Invalid Op] – SQRTx INPUT Exceptions Negative nonzero operand +/–0 + Denormal operand Trap Supply +/–0 Trap Trap – Trap Supply QNan – Supply SQRT – Denormal operand Trap Trap Supply QNaN + Infinity operand – Infinity operand QNaN operand SNaN operand Trap Trap Trap Trap Trap Trap Trap Trap Supply +Inf Supply QNaN Supply QNaN Supply QNaN [Denormal Op2] [
Table B–3 shows the IEEE standard charts. In the charts, the second column is the result when the user signal handler is disabled; the third column is the result when that handler is enabled. The OS completion handler supplies the IEEE default that is specified in the second column. The contents of the Alpha registers contain sufficient information for an enabled user handler to compute the value in the third column.
Appendix C Instruction Summary This appendix summarizes all instructions and opcodes in the Alpha architecture. All values are in hexadecimal radix. C.1 Common Architecture Instruction Summary This section summarizes all common Alpha instructions. Table C–1 describes the contents of the Format and Opcode columns in Table C–2. Table C–1: Instruction Format and Opcode Notation Instruction Format Format Symbol Opcode Notation Branch Floating- point Bra F-P oo oo.
Table C–2: Common Architecture Instructions Mnemonic Format Opcode Description ADDF ADDG ADDL ADDL/V ADDQ ADDQ/V ADDS ADDT AMASK AND BEQ BGE BGT BIC BIS BLBC BLBS BLE BLT BNE BR BSR CALL_PAL CMOVEQ CMOVGE CMOVGT CMOVLBC CMOVLBS CMOVLE CMOVLT CMOVNE CMPBGE CMPEQ CMPGEQ CMPGLE CMPGLT CMPLE CMPLT CMPTEQ CMPTLE CMPTLT CMPTUN CMPULE CMPULT CPYS CPYSE CPYSN CTLZ CTPOP CTTZ CVTDG CVTGD CVTGF F-P F-P Opr 15.080 15.0A0 10.00 10.40 10.20 10.60 16.080 16.0A0 11.61 11.00 39 3E 3F 11.08 11.
Table C–2: Common Architecture Instructions (Continued) Mnemonic Format Opcode Description CVTGQ CVTLQ CVTQF CVTQG CVTQL CVTQS CVTQT CVTST CVTTQ CVTTS DIVF DIVG DIVS DIVT ECB EQV EXCB EXTBL EXTLH EXTLL EXTQH EXTQL EXTWH EXTWL FBEQ FBGE FBGT FBLE FBLT FBNE FCMOVEQ FCMOVGE FCMOVGT FCMOVLE FCMOVLT FCMOVNE FETCH FETCH_M FTOIS FTOIT IMPLVER INSBL INSLH INSLL INSQH INSQL INSWH INSWL ITOFF ITOFS ITOFT JMP JSR JSR_COROUTINE F-P F-P F-P F-P F-P F-P F-P F-P F-P F-P F-P F-P F-P F-P Mfc Opr Mfc Opr Opr Opr Opr Opr
Table C–2: Common Architecture Instructions (Continued) Mnemonic Format Opcode Description LDA LDAH LDBU LDWU LDF LDG LDL LDL_L LDQ LDQ_L LDQ_U LDS LDT MAXSB8 MAXSW4 MAXUB8 MAXUW4 MB MF_FPCR MINSB8 MINSW4 MINUB8 MINUW4 MSKBL MSKLH MSKLL MSKQH MSKQL MSKWH MSKWL MT_FPCR MULF MULG MULL MULL/V MULQ MULQ/V MULS MULT ORNOT PERR PKLB PKWB RC RET RPCC RS S4ADDL S4ADDQ S4SUBL S4SUBQ S8ADDL S8ADDQ S8SUBL Mem Mem Mem Mem Mem Mem Mem Mem Mem Mem Mem Mem Mem Opr Opr Opr Opr Mfc F-P Opr Opr Opr Opr Opr Opr Opr Opr O
Table C–2: Common Architecture Instructions (Continued) Mnemonic Format Opcode Description S8SUBQ SEXTB SEXTW SLL SQRTF SQRTG SQRTS SQRTT SRA SRL STB STF STG STS STL STL_C STQ STQ_C STQ_U STT STW SUBF SUBG SUBL SUBL/V SUBQ SUBQ/V SUBS SUBT TRAPB UMULH UNPKBL UNPKBW WH64 WMB XOR ZAP ZAPNOT Opr Opr Opr Opr F-P F-P F-P F-P Opr Opr Mem Mem Mem Mem Mem Mem Mem Mem Mem Mem Mem F-P F-P Opr 10.3B 1C.00 1C.01 12.39 14.08A 14.0AA 14.08B 14.0AB 12.3C 12.34 0E 24 25 26 2C 2E 2D 2F 0F 27 0D 15.081 15.0A1 10.09 10.
C.2 IEEE Floating-Point Instructions Table C–3 lists the hexadecimal value of the 11-bit function code field for the IEEE floating-point instructions, with and without qualifiers. The opcode for the following instructions is 1616, except for SQRTS and SQRTT, which are opcode 1416.
Table C–3: IEEE Floating-Point Instruction Function Codes (Continued) CVTTQ CVTTQ None /C /V /VC /SV /SVC /SVI /SVIC 0AF 02F 1AF 12F 5AF 52F 7AF 72F /D /VD /SVD /SVID /M /VM /SVM /SVIM 0EF 1EF 5EF 7EF 06F 16F 56F 76F Programming Note: To use CMPTxx with software completion trap handling, specify the /SU IEEE trap mode, even though an underflow trap is not possible.
C.4 Independent Floating-Point Instructions Table C–5 lists the hexadecimal value of the 11-bit function code field for the floating-point instructions that are not directly tied to IEEE or VAX floating point. The opcode for the following instructions is 1716. Table C–5: Independent Floating-Point Instruction Function Codes None CPYS CPYSE CPYSN CVTLQ CVTQL FCMOVEQ FCMOVGE FCMOVGT FCMOVLE FCMOVLT MF_FPCR MT_FPCR 020 022 021 010 030 02A 02D 02F 02E 02C 025 024 /V /SV 130 530 C.
The instruction format is listed under the instruction symbol. The symbols in Table C–6 are explained in Table C–7.
C.6 Common Architecture Opcodes in Numerical Order Table C–8: Common Architecture Opcodes in Numerical Order Opcode 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F 10.00 10.02 10.09 10.0B 10.0F 10.12 10.1B 10.1D 10.20 10.22 10.29 10.2B 10.2D 10.32 10.3B 10.3D 10.40 10.49 10.4D 10.60 10.69 10.6D 11.00 11.08 11.14 11.16 11.20 11.
Table C–8: Common Architecture Opcodes in Numerical Order (Continued) Opcode 14.78B 14.7AB 14.7CB 14.7EB 15.000 15.001 15.002 15.003 15.01E 15.020 15.021 15.022 15.023 15.02C 15.02D 15.02F 15.03C 15.03E 15.080 15.081 15.082 15.083 15.09E 15.0A0 15.0A1 15.0A2 15.0A3 15.0A5 15.0A6 15.0A7 15.0AC 15.0AD 15.0AF 15.0BC 15.0BE 15.100 15.101 15.102 15.103 15.11E 15.120 15.121 15.122 15.123 15.12C 15.
Table C–8: Common Architecture Opcodes in Numerical Order (Continued) Opcode 16.0A0 16.0A1 16.0A2 16.0A3 16.0A4 16.0A5 16.0A6 16.0A7 16.0AC 16.0AF 16.0BC 16.0BE 16.0C0 16.0C1 16.0C2 16.0C3 16.0E0 16.0E1 16.0E2 16.0E3 16.0EC 16.0EF 16.0FC 16.0FE 16.100 16.101 16.102 16.103 16.120 16.121 16.122 16.123 16.12C 16.12F 16.140 16.141 16.142 16.143 16.160 16.161 16.162 16.163 16.16C 16.16F 16.180 16.
Table C–8: Common Architecture Opcodes in Numerical Order (Continued) Opcode 16.7A0 16.7A1 16.7A2 16.7A3 16.7AC 16.7AF 16.7BC 16.7BE 16.7C0 16.7C1 16.7C2 16.7C3 16.7E0 16.7E1 16.7E2 16.7E3 16.7EC 16.7EF 16.7FC 16.7FE 17.010 17.020 17.021 17.022 17.024 17.025 17.02A 17.02B 17.02C 17.02D 17.02E 17.02F 17.030 17.130 17.530 18.0000 18.
C.7 OpenVMS Alpha PALcode Instruction Summary Table C–9: OpenVMS Alpha Unprivileged PALcode Instructions Mnemonic Opcode Description AMOVRM AMOVRR BPT BUGCHK CHMK CHME CHMS CHMU CLRFEN GENTRAP IMB INSQHIL INSQHILR INSQHIQ INSQHIQR INSQTIL INSQTILR INSQTIQ INSQTIQR INSQUEL INSQUEL/D INSQUEQ INSQUEQ/D PROBER PROBEW RD_PS READ_UNQ REI REMQHIL REMQHILR REMQHIQ REMQHIQR REMQTIL REMQTILR REMQTIQ REMQTIQR REMQUEL REMQUEL/D REMQUEQ REMQUEQ/D RSCC SWASTEN WRITE_UNQ WR_PS_SW 00.00A1 00.00A0 00.0080 00.0081 00.
Table C–10: OpenVMS Alpha Privileged PALcode Instructions Mnemonic Opcode Description CFLUSH CSERVE DRAINA HALT LDQP MFPR_ASN MFPR_ESP MFPR_FEN MFPR_IPL MFPR_MCES MFPR_PCBB MFPR_PRBR MFPR_PTBR MFPR_SCBB MFPR_SISR MFPR_SSP MFPR_TBCHK MFPR_USP MFPR_VPTB MFPR_WHAMI MTPR_ASTEN MTPR_ASTSR MTPR_DATFX MTPR_ESP MTPR_FEN MTPR_IPIR MTPR_IPL MTPR_MCES MTPR_PERFMON MTPR_PRBR MTPR_SCBB MTPR_SIRR MTPR_SSP MTPR_TBIA MTPR_TBIAP MTPR_TBIS MTPR_TBISD MTPR_TBISI MTPR_USP MTPR_VPTB STQP SWPCTX SWPPAL WTINT 00.0001 00.
C.8 DIGITAL UNIX PALcode Instruction Summary Table C–11: DIGITAL UNIX Unprivileged PALcode Instructions Mnemonic Opcode Description bpt bugchk callsys clrfen gentrap imb rdunique urti wrunique 00.0080 00.0081 00.0083 00.00AE 00.00AA 00.0086 00.009E 00.0092 00.
C.9 Windows NT Alpha Instruction Summary Table C–13: Windows NT Alpha Unprivileged PALcode Instructions Mnemonic Opcode Description bpt callkd callsys gentrap imb kbpt rdteb 00.0080 00.00AD 00.0083 00.00AA 00.0086 00.00AC 00.
C.10 PALcode Opcodes in Numerical Order Opcodes 00.003816 through 00.003F16 are reserved for processor implementation-specific PALcode instructions. All other opcodes are reserved for use by Compaq. Table C–15: PALcode Opcodes in Numerical Order Opcode16 Opcode10 OpenVMS Alpha DIGITAL UNIX Windows NT Alpha 00.0000 00.0001 00.0002 00.0003 00.0004 00.0005 00.0006 00.0007 00.0008 00.0009 00.000A 00.000B 00.000C 00.000D 00.000E 00.000F 00.0010 00.0011 00.0012 00.0013 00.0014 00.0015 00.0016 00.0017 00.
Table C–15: PALcode Opcodes in Numerical Order (Continued) Opcode16 Opcode10 OpenVMS Alpha 00.0032 00.0033 00.0034 00.0035 00.0036 00.0037 00.0038 00.0039 00.003A 00.003C 00.003D 00.003E 00.003F 00.0080 00.0081 00.0082 00.0083 00.0084 00.0085 00.0086 00.0087 00.0088 00.0089 00.008A 00.008B 00.008C 00.008D 00.008E 00.008F 00.0090 00.0091 00.0092 00.0093 00.0094 00.0095 00.0096 00.0097 00.0098 00.0099 00.009A 00.009B 00.009C 00.009D 00.009E 00.009F 00.00A0 00.00A1 00.00A2 00.00A3 00.00A4 00.00A5 00.
Table C–15: PALcode Opcodes in Numerical Order (Continued) Opcode16 Opcode10 OpenVMS Alpha DIGITAL UNIX Windows NT Alpha 00.00A8 00.00A9 00.00AA 00.00AB 00.00AC 00.00AD 00.00AE 00.0168 00.0169 00.0170 00.0171 00.0172 00.0173 00.0174 REMQHIQR REMQTIQR GENTRAP — — — CLRFEN — — gentrap — — — clrfen — — gentrap rdteb kbpt callkd C.11 Required PALcode Opcodes The opcodes listed in Table C–16 are required for all Alpha implementations. The notation used is oo.
C.13 Opcodes Reserved to Compaq The opcodes listed in Table C–18 are reserved to Compaq. Table C–18: Opcodes Reserved for Compaq Mnemonic OPC01 OPC04 OPC07 Mnemonic 01 04 07 OPC02 OPC05 Mnemonic 02 05 OPC03 OPC06 03 06 Programming Note: The code points 18.4800 and 18.4C00 are reserved for adding weaker memory barrier instructions. Those code points must operate as a Memory Barrier instruction (MB 18.4000) for implementations that precede their definition as weaker memory barrier instructions.
C.15 ASCII Character Set Table C–19 shows the 7-bit ASCII character set and the corresponding hexadecimal value for each character. Table C–19: ASCII Character Set Char Hex Code Char Hex Code Char Hex Code Char Hex Code NUL SQH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US 0 1 2 3 4 5 6 7 8 9 A B C D E F 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F SP ! " # $ % & ' ( ) * + , .
Appendix D Registered System and Processor Identifiers This appendix contains a table of the processor type assignments, PALcode implementation information, and the architecture mask (AMASK) and implementation value (IMPLVER) assignments. D.1 Processor Type Assignments The following processor types are defined.
Table D–1: Processor Type Assignments (Continued) Major Type 5= 6= 7= 8= 9= Minor Type EV5 (21164) EV45 (21064A) EV56 (21164A) EV6 (21264) PCA56 (21164PC) 0= Reserved (Pass 1) 1= Pass 2, 2.2 (rev BA, CA) 2= Pass 2.3 (rev DA, EA) 3= Pass 3 4= Pass 3.2 5= Pass 4 0= Reserved 1= Pass 1 2= Pass 1.1 3= Pass 2 0= Reserved 1= Pass 1 2= Pass 2 0= Reserved 1= Pass 1 2= Pass 2, 2.1 3= Pass 2.2 4= Pass 2.
Table D–2: PALcode Variation Assignments Token PALcode Type Summary Table 2 DIGITAL UNIX Console Interface (III), Chapter 3 in the Alpha Architecture Reference Manual 3–127 Reserved to Compaq 128–255 Reserved to non-Compaq D.3 Architecture Mask and Implementation Values The following bits are defined for the AMASK instruction.
Appendix E Waivers and Implementation-Dependent Functionality This appendix describes waivers to the Alpha architecture and functionality that is specific to particular hardware implementations. E.1 Waivers The following waivers have been passed for the Alpha architecture. E.1.
The DECchip 21064, DECchip 21066, and DECchip 21068 implementations differ from the above specification in handling the Inexact condition for the IEEE DIVS and DIVT instructions in two ways: 1. The DIVS and DIVT instructions with the /Inexact modifier trap unconditionally and report the INE exception in the EXC_SUM register (except for NaN, infinity, and denormal inputs that result in INVs). This allows for a software calculation to determine the correct INE status. 2.
The DECchip 21264 varies from that description, with regard to the WH64 instruction, as follows: If any other memory access (ECB, LDx, LDQ_U, STx, STQ_U) is executed on the given processor between the LDx_L and the STx_C, the sequence above may always fail on some implementations; hence, no useful program should do this.
The performance monitor functions, described in Section E.2.1.2, can provide the following, depending on implementation: • Enable the performance counters to interrupt and trap into the performance monitoring vector in the operating system. • Disable the performance counter from interrupting. This does not necessarily mean that the counters will stop counting. • Select which events will be monitored and set the width of the two counters.
E.2.1.2 Functions and Arguments for the DECchip 21064/21066/21068 The functions execute on a single (the current running) processor only and are described in Table E–1. • • • The OpenVMS Alpha MTPR_PERFMON instruction is called with a function code in R16, a function-specific argument in R17, and status is returned in R0. The DIGITAL UNIX wrperfmon instruction is called with a function code in a0, a function specific argument in a1, and status is returned in v0.
Table E–1: DECchip 21064/21066/21068 Performance Monitoring Functions (Continued) Function Register Usage Windows NT Alpha Input: a0 = 0 a0 = 1 a1 = 0 Comments Select counter 0 Select counter 1 Disable selected counter Select desired events (mux_ctl) DIGITAL UNIX Input: Output: a0 = 2 a1 = mux_ctl v0 = 1 v0 = 0 OpenVMS Alpha Input: R16 = 2 R17 = mux_ctl Output: R0 = 1 R0 = 0 Windows NT Alpha Input: a2 = PCMUX0 a2 = PCMUX1 a3 = PC0 a3 = PC1 Function code mux_ctl is the exact contents of those fiel
Table E–1: DECchip 21064/21066/21068 Performance Monitoring Functions (Continued) Function Register Usage Comments OpenVMS Alpha Input: R16 = 3 R17 = opt Output: Function code Function argument opt is: <0> = log all processes if set <1> = log only selected if set Success Failure (not generated) R0 = 1 R0 = 0 Table E–2: DECchip 21064/21066/21068 MUX Control Fields in ICCSR Register Bits Option Description 34:32 PCMUX1 Event selection, counter 1: Value Description 0 1 2 3 4 5 6 7 Total D-cach
Table E–2: DECchip 21064/21066/21068 MUX Control Fields in ICCSR Register (Continued) Bits Option Description 11:8 PCMUX0 Event selection, counter 0: 3 0 PC0 PC1 Value Description 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Total issues divided by 2 Unused Nothing issued, no valid I-stream data Unused All load instructions Unused Nothing issued, resource conflict Unused All branches (conditional, unconditional, JSR, HW_REI) Unused Total cycles Cycles while in PALcode environment Total nonissues divid
E.2.2 DECchip 21164/21164PC Performance Monitoring Unless otherwise stated, the term "21164" in this section means implementations of the 21164 at all frequencies. PALcode instructions control the DECchip 21164/21164PC on-chip performance counters. For OpenVMS Alpha, the instruction is MTPR_PERFMON; for DIGITAL UNIX and Windows NT Alpha, the instruction is wrperfmon. The instruction arguments and results are described in the following sections. The scratch register usage is operating system specific.
For the Windows NT Alpha Operating System When a counter overflows and interrupt enabling conditions are correct, the counter causes an interrupt to PALcode. The PALcode builds a frame on the kernel stack and dispatches to the kernel at the interrupt entry point. E.2.2.2 Windows NT Alpha Functions and Argument The functions for Windows NT Alpha execute on only a single (the current running) processor.
Table E–3: Bit Summary of PMCTR Register for Windows NT Alpha Bits Name Meaning 63–48 CTR0 Counter 0 value 47–32 CTR1 Counter 1 value 31 PCSEL0 Counter 0 selection: 30 Value Meaning 0 1 Cycles Issues Must be set to one1 29–16 CTR2 Counter 2 value 15–14 CTL0 Counter 0 control: 13–12 11–10 CTL1 CTL2 Value Meaning 0 1 2 3 Counter disable, interrupt disable Counter enable, interrupt disable Counter enable, interrupt at count 65536 Counter enable, interrupt at count 256 Counter 1
Table E–3: Bit Summary of PMCTR Register for Windows NT Alpha (Continued) Bits Name Meaning 9–8 MODE_SELECT 1 Select modes in which to count: Value Meaning 0 1 2 3 Count all modes Count PALmode only Count all modes except PALmode Count only user mode 7–4 PCSEL1 Counter 1 selection. See Table E–13 3–0 PCSEL2 Counter 2 selection.
Table E–4: OpenVMS Alpha and DIGITAL UNIX Performance Monitoring Functions (Continued) Function Register Usage Comments Enable performance monitoring; start the counters from zero DIGITAL UNIX Input: Output: OpenVMS Alpha Input: Output: a0 = 7 a1 = arg v0 = 1 v0 = 0 Function code value Argument from Table E–5 Success Failure (not generated) R16 = 7 R17 = arg R0 = 1 R0 = 0 Function code value Argument from Table E–5 Success Failure (not generated) Disable performance monitoring; do not reset counters
Table E–4: OpenVMS Alpha and DIGITAL UNIX Performance Monitoring Functions (Continued) Function Register Usage Comments Select Processor Mode options DIGITAL UNIX Input: Output: OpenVMS Alpha Input: Output: a0 = 3 a1 = arg v0 = 1 v0 = 0 Function code value Argument from Table E–9 Success Failure (not generated) R16 = 3 R17 = arg R0 = 1 R0 = 0 Function code value Argument from Table E–9 Success Failure (not generated) Select interrupt frequencies DIGITAL UNIX Input: Output: OpenVMS Alpha Input: Outpu
Table E–4: OpenVMS Alpha and DIGITAL UNIX Performance Monitoring Functions (Continued) Function Register Usage Comments Write the counters DIGITAL UNIX Input: Output: OpenVMS Alpha Input: Output: a0 = 6 a1 = arg v0 = 1 v0 = 0 Function code value Argument from Table E–12 Success Failure (not generated) R16 = 6 R17 = arg R0 = 1 R0 = 0 Function code value Argument from Table E–12 Success Failure (not generated) Table E–5: 21164/21164PC Enable Counters for OpenVMS Alpha and DIGITAL UNIX Bits Meaning Wh
Table E–7: 21164 Select Desired Events for OpenVMS Alpha and DIGITAL UNIX Bits Name 63:32 31 Meaning MBZ PCSEL0 30:25 Counter 0 selection: Value Meaning 0 1 Cycles Issues MBZ 24:22 CBOX2 CBOX2 event selection (only has meaning when event selection field PCSEL2 is value <15>; otherwise MBZ). CBOX2 described in Table E– 16. 21:19 CBOX1 CBOX1 event selection (only has meaning when event selection field PCSEL1 is value <15>; otherwise MBZ). CBOX1 described in Table E– 15.
Table E–8: 21164PC Select Desired Events for OpenVMS Alpha and DIGITAL UNIX (Continued) Bits Name Meaning 7:4 PCSEL1 Counter 1 event selection. PCSEL1 described in Table E–13. 3:0 PCSEL2 Counter 2 event selection. PCSEL2 described in Table E–14.
Table E–10: 21164/21164PC Select Desired Frequencies for OpenVMS Alpha and DIGITAL UNIX Table E–10 contains the selection definitions for each of the three counters.
Table E–11: 21164/21164PC Read Counters for OpenVMS Alpha and DIGITAL UNIX Bits Meaning When Returned 63:48 Counter 0 returned value 47:32 Counter 1 returned value 31:30 MBZ 29:16 Counter 2 returned value 15:1 MBZ 0 Set means success; clear means failure Table E–12: 21164/21164PC Write Counters for OpenVMS Alpha and DIGITAL UNIX Bits Meaning 63:48 Counter 0 written value 47:32 Counter 1 written value 31:30 MBZ 29:16 Counter 2 written value 15:0 MBZ Table E–13: 21164/21164PC Counter
Table E–13: 21164/21164PC Counter 1 (PCSEL1) Event Selection (Continued) The following values choose the counter 1 (PCSEL1) event selection: Value Meaning 9 Integer operate instructions 10 Floating point operate instructions 11 Load instructions 12 Store instructions 13 Instruction cache access 14 Data cache access 15 For the 21164, use CBOX1 event selection in Table E–15. For the 21164PC, use PM0_MUX event selection in Table E–17.
Table E–15: 21164 CBOX1 Event Selection The following values choose the CBOX1 event selection. Value Meaning 0 S-cache access 1 S-cache read 2 S-cache write 3 S-cache victim 4 Unused value 5 B-cache hit 6 B-cache victim 7 System request Table E–16: 21164 CBOX2 Event Selection The following values choose the CBOX2 event selection.
Table E–17: 21164PC PM0_MUX Event Selection The following values choose the PM0_MUX event selection and perform the chosen operation in Counter 0. Value Meaning 0 B-cache read operations 1 B-cache D read hits 2 B-cache D read fills 3 B-cache write operations 4 Undefined 5 B-cache clean write hits 6 B-cache victims 7 Read miss 2 launched Table E–18: 21164PC PM1_MUX Event Selection The following values choose the PM1_MUX event selection and perform the chosen operation in Counter 1.
E.2.3 21264 Performance Monitoring PALcode instructions control the 21264 on-chip performance counters. For OpenVMS Alpha, the instruction is MTPR_PERFMON; for DIGITAL UNIX and Windows NT Alpha, the instruction is wrperfmon. The instruction arguments and results are described in the following sections. The scratch register usage is operating system specific. Two 20-bit on chip counters count events. Counters can be individually programmed, read, and written.
For the Windows NT Alpha Operating System When a counter overflows and interrupt enabling conditions are correct, the counter causes an interrupt to PALcode. The PALcode builds a frame on the kernel stack and dispatches to the kernel at the interrupt entry point. E.2.3.2 Windows NT Alpha Functions and Argument The functions for Windows NT Alpha execute on only a single (the current running) processor.
Table E–19: Bit Summary of PCTR_CTL Register for Windows NT Alpha Bits Name Meaning 3–2 SL1 PCTR1 input selector. If SL0 value is 0: Bit value Meaning 0000 0001 Counter 1 counts cycles. Counter 1 counts retired conditional branches. Counter 1 counts retired branch mispredicts. Counter 1 counts retired DTB single misses * 2. Counter 1 counts retired DTB double double misses. Counter 1 counts retired ITB misses. Counter 1 counts retired unaligned traps. Counter 1 counts replay traps.
Table E–20: OpenVMS Alpha and DIGITAL UNIX Performance Monitoring Functions Function Register Usage Comments Disable performance monitoring DIGITAL UNIX Input: OpenVMS Alpha Input: a0 = 0 a1 = arg Function code value Argument from Table E–22 R16 = 0 R17 = arg Function code value Argument from Table E–22 Select desired events (MUX_SELECT) DIGITAL UNIX Input: OpenVMS Alpha Input: a0 = 2 a1 = arg Function code value Argument from Table E–23 R16 = 2 R17 = arg Function code value Argument from Table
Table E–20: OpenVMS Alpha and DIGITAL UNIX Performance Monitoring Functions Function Register Usage Comments Write the counters DIGITAL UNIX Input: OpenVMS Alpha Input: a0 = 6 a1 = arg Function code value Argument from Table E–25 R16 = 6 R17 = arg Function code value Argument from Table E–25 Enable and write selected counters DIGITAL UNIX Input: OpenVMS Alpha Input: a0 = 7 a1 = arg Function code value Argument from Table E–26 R16 = 7 R17 = arg Function code value Argument from Table E–26 Table
Table E–23: 21264 Select Desired Events for OpenVMS Alpha and DIGITAL UNIX R17/a1 Bits Meaning 4 Bit value Meaning 1 0 3–2 Counter 0 counts retired instructions. Counter 0 counts cycles. Bit value Meaning 0000 0001 0010 0011 0100 0101 0110 0111 Counter 1 counts cycles. Counter 1 counts retired conditional branches. Counter 1 counts retired branch mispredicts. Counter 1 counts retired DTB single misses * 2. Counter 1 counts retired DTB double double misses. Counter 1 counts retired ITB misses.
Table E–25: 21264 Write Counters for OpenVMS Alpha and DIGITAL UNIX R17/a1 Bits Meaning 5–2 Reserved 1 When set, write to Counter 1 0 When set, write to Counter 0 Table E–26: 21264 Enable and Write Counters for OpenVMS Alpha and DIGITAL UNIX R17/a1 Bits Meaning 63–48 Reserved 47–28 Counter 0 value to write; writing zeroes clears the counter 27–26 Reserved 25–6 Counter 1 value to write; writing zeroes clears the counter 5–2 Reserved 1 When set, enable and write to Counter 1 0 When set,
Index A Aborts, forcing, 6–6 ACCESS(x,y) operator, 3–7 Add instructions add longword, 4–25 add quadword, 4–27 add scaled longword, 4–26 add scaled quadword, 4–28 See also Floating-point operate ADDF instruction, 4–110 Alpha architecture addressing, 2–1 overview, 1–1 porting operating systems to, 1–1 programming implications, 5–1 registers, 3–1 security, 1–7 See also Conventions Alpha privileged architecture library.
programming implications for, 5–30 TRAPB instruction with, 4–144 underflow, 4–78, 4–81 underflow to zero, disabling, 4–80 underflow, disabling, 4–80 underflow, enabling, B–6 underflow, status of, B–5 ASCII character set, C–22 Atomic access, 5–3 Atomic operations accessing longword datum, 5–2 accessing quadword datum , 5–2 updating shared data structures, 5–7 using load locked and store conditional, 5–7 Atomic sequences, A–16 B BEFORE, defined for memory access, 5–12 BEQ instruction, 4–20 BGE instruction, 4
Changed datum, 5–6 CMOVLBC instruction, 4–43 notation, 3–10 numbering, 1–7 ranges, 1–8 Count instructions Count leading zero, 4–31 Count population, 4–32 Count trailing zero, 4–33 CPYS instruction, 4–105 Clear a register, A–12 CMOVEQ instruction, 4–43 CMOVGE instruction, 4–43 CMOVGT instruction, 4–43 CMOVLE instruction, 4–43 CPYSE instruction, 4–105 CMOVLT instruction, 4–43 CPYSN instruction, 4–105 CMOVNE instruction, 4–43 CSERVE (PALcode) instruction required recognition of, 6–4 cserve (PALcode) in
Data stream considerations, A–4 Data structures, shared, 5–6 Data types byte, 2–1 IEEE floating-point, 2–6 longword, 2–2 longword integer, 2–11 quadword , 2–2 quadword integer, 2–12 unsupported in hardware, 2–12 VAX floating-point, 2–3 word, 2–1 Denormal, 4–64 Denormal operand exception disable, 4–81 Denormal operand exception enable (DNOE) FP_C quadword bit, B–5 Denormal operand status (DNOS) FP_C quadword bit , B–5 Denormal operands to zero, 4–81 DZED bit.
at processor initialization, 4–83 bit descriptions, 4–80 instructions to read/write, 4–109 operate instructions that use, 4–102 saving and restoring, 4–83 trap disable bits in , 4–78 Floating-point convert instructions, 3–14 Fa field requirements, 3–14 Floating-point division, performance impact of, A–10 Floating-point format, number representation (encodings), 4–65 Floating-point instructions branch, 4–99 faults, 4–62 function field format, 4–84 introduced, 4–62 memory format, 4–90 opcodes and format summa
I I/O devices, DMA MB and WMB with, 5–22 reliably communicating with processor, 5–27 shared memory locations with, 5–11 I/O interface overview, 8–1 IEEE floating-point exception handlers, B–3 floating-point control (FP_C) quadword, B–4 format, 2–6 FPCR (floating-point control register), 4–79 function field format, 4–85 hardware support, B–2 NaN , 2–6 options, B–1 S_floating , 2–7 standard charts, B–12 standard, mapping to, B–6 T_floating, 2–8 trap handling, B–6 X_floating, 2–9 See also Floating-point instru
Instruction stream. See I-stream Instructions, overview, 1–4 INSWH instruction, 4–55 INSWL instruction, 4–55 Integer division, A–10 Integer registers defined, 3–1 R31 restrictions, 3–1 INV bit See also Arithmetic traps, invalid operation Invalid operation enable (INVE) FP_C quadword bit , B–6 Invalid operation status (INVS) FP_C quadword bit , B–5 INVD bit.
4–64 M Memory-like behavior, 5–3 MF_FPCR instruction, 4–109 MIN, defined for floating-point, 4–65 /M opcode qualifier, IEEE floating-point, 4–67 MINS(x,y) operator, 3–8 MAP_F function, 2–4 MINSB8 instruction, 4–152 MAP_S function, 2–7 MINSW4 instruction, 4–152 MAP_x operator, 3–8 MINU(x,y) operator, 3–8 Mask byte instructions, 4–57 MINUB8 instruction, 4–152 MAX, defined for floating-point, 4–65 MINUW4 instruction, 4–152 MAXS(x,y) operator, 3–8 Miscellaneous instructions, 4–132 MAXSB8 instru
N NaN (Not-a-Number) conversion to integer, 4–88 copying, generating, propograting, 4–89 defined, 2–6 quiet, 4–64 signaling, 4–64 NATURALLY ALIGNED data objects, 1–8 Negate stylized code form, A–13 Non-finite number, 4–64 Nonmemory-like behavior, 5–3 NOP, universal (UNOP), A–11 NOT instruction, ORNOT with zero, 4–42 NOT operator, 3–9 NOT stylized code form, A–13 O Opcode qualifiers default values, 4–3 notation, 4–3 See also specific qualifiers Opcodes common architecture, C–1 DIGITAL UNIX PALcode, C–16 in
with, 4–138 Register-to-register move, A–13 Pixel error instruction, 4–154 Relational Operators, 3–9 PKLB (Pack longwords to bytes) instruction, 4–155 Representative result, 4–64 PKWB (Pack words to bytes) instruction, 4–155 Reserved instructions, opcodes for, C–21 Prefetch data (FETCH instruction), 4–139 Result latency, A–4 PRIORITY_ENCODE operator, 3–9 RET instruction, 4–22 Privileged Architecture Library.
Shift arithmetic instructions, 4–46 STT instruction, 4–98 Sign extend instructions, 4–60 STW instruction, 4–15 Single-precision floating-point, 4–62 SUBF instruction, 4–130 SLL instruction, 4–45 SUBG instruction, 4–130 Software considerations, A–1 See also Performance optimizations SQRTF instruction, 4–128 SUBL instruction, 4–37 SQRTG instruction, 4–128 SQRTS instruction, 4–129 SQRTT instruction, 4–129 Square root instructions IEEE, 4–129 VAX, 4–128 SRA instruction, 4–46 SRL instruction, 4–45 STB
TRAPB (trap barrier) instruction described, 4–144 with FPCR, 4–84 True result, 4–64 True zero, 4–65 U UMULH instruction, 4–36 with MULQ, 4–35 UNALIGNED data objects, 1–8 Unconditional long jump, 4–23 UNDEFINED operations, 1–7 Underflow enable (UNFE) FP_C quadword bit, B–6 Underflow status (UNFS) FP_C quadword bit, B–5 UNDZ bit. See Trap disable bits, underflow to zero UNF bit See also Arithmetic traps, underflow UNFD bit.
XOR instruction, 4–42 XOR operator, 3–10 Y YUV coordinates, interleaved, 4–151 Z ZAP instruction, 4–61 ZAPNOT instruction, 4–61 Zero byte instructions, 4–61 ZEXT(x)operator, 3–10 Index–13