AlphaServer GS80/160/320 Service Manual Order Number: EK–GS320–SV. D01 This manual is for service providers who maintain Compaq AlphaServer GS80/160/320 systems.
Revised February, 2001 © 2001 Compaq Computer Corporation. Compaq, the Compaq logo, and AlphaServer registered in U.S. Patent and Trademark Office. OpenVMS and Tru64 are trademarks of Compaq Information Technologies Group, L.P. in the United States and other countries. Portions of the software are © copyright Cimetrics Technology. Linux is a registered trademark of Linus Torvalds in several countries. UNIX is a trademark of The Open Group in the United States and other countries.
Japanese Notice Canadian Notice This Class A digital apparatus meets all requirements of the Canadian Interference-Causing Equipment Regulations. Avis Canadien Cet appareil numérique de la classe A respecte toutes les exigences du Règlement sur le matériel brouilleur du Canada. European Union Notice Products with the CE Marking comply with both the EMC Directive (89/336/EEC) and the Low Voltage Directive (73/23/EEC) issued by the Commission of the European Community.
Contents Preface ......................................................................................................................xix Chapter 1 System Overview 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.7.1 1.7.2 1.7.3 1.7.4 1.8 1.8.1 1.8.2 1.8.3 1.8.4 1.8.5 1.8.6 1.9 1.9.1 1.9.2 1.9.3 1.9.4 1.10 1.10.1 1.11 1.12 1.13 GS160/320 System Cabinets ................................................................. 1-2 GS160/320 System Building Block........................................................
1.13.1 1.13.2 1.13.3 1.13.4 1.14 1.14.1 1.14.2 1.15 1.15.1 1.15.2 1.15.3 1.15.4 1.16 1.17 1.18 1.19 1.20 1.21 1.21.1 1.21.2 1.21.3 1.21.4 1.21.5 1.21.6 1.21.7 1.22 1.22.1 1.22.2 1.22.3 1.22.4 1.23 1.23.1 1.23.2 1.23.3 1.24 vi H-switch Clock Module ................................................................. 1-50 Dual-Output Clock Module........................................................... 1-52 Clock Splitter Module ...................................................................
Chapter 2 Power-Up 2.1 2.1.1 2.1.2 2.2 2.3 2.4 2.5 2.6 2.7 Operator Control Panel......................................................................... 2-2 Control Panel Assembly.................................................................. 2-2 Fluorescent Display Messages........................................................ 2-4 Power-Up Test Flow – Init. and Phase 0 .............................................. 2-6 Power-Up Test Flow – Phase 1 ...................................................
3.12.1 3.12.2 3.12.3 Compaq Analyze Using a Web Browser........................................ 3-88 Problem Found Report.................................................................. 3-90 Description of the Error (660)....................................................... 3-94 Chapter 4 GS160/320 Component Removal and Replacement 4.1 4.2 4.3 4.3.1 4.3.2 4.3.3 4.3.4 4.4 4.4.1 4.4.2 4.5 4.5.1 4.5.2 4.5.3 4.5.4 4.5.5 4.5.6 4.5.7 4.5.8 4.6 4.6.1 4.6.2 4.6.3 4.7 4.7.1 4.7.2 4.7.3 4.7.4 4.7.5 4.
Chapter 5 Power Cabinet Component Removal and Replacement 5.1 5.1.1 5.1.2 5.1.3 5.1.4 5.1.5 5.1.6 5.1.7 5.1.8 5.1.9 5.1.10 5.1.11 5.2 5.3 5.4 5.5 5.6 PCI Modules.......................................................................................... 5-2 Standard I/O Module Removal and Replacement........................... 5-4 Console Serial Bus Node ID Module Removal and Replacement ... 5-6 Remote I/O Riser Removal and Replacement .................................
6.16 AC Input Box Removal and Replacement........................................... 6-36 Appendix A Power Distribution Rules A.1 A.2 A.3 GS160/320 Power Cabinet Configuration and Cabling.........................A-2 Expander Cabinet Configuration and Cabling .....................................A-8 GS80 Power Cabling ...........................................................................A-16 Appendix B Cache Coherency B.1 B.2 B.3 B.3.1 B.3.2 B.4 B.5 B.6 B.7 B.7.1 B.7.2 B.7.3 B.8 B.9 B.9.1 B.9.2 B.9.
Appendix C Power-Up Diagnostic Error Table Appendix D Firmware Updates D.1 D.2 D.2.1 D.2.2 D.2.3 D.2.4 D.3 D.3.1 D.3.2 D.4 System Firmware That May Require Updates .................................... D-2 Preparations for Firmware Updates .................................................... D-4 Partitions ....................................................................................... D-4 Hardware Connections ..................................................................
Examples 1–1 Shutting Down a Partition...................................................................... 1-40 2–1 System Control Manager Power-Up Display .......................................... 2-18 2–2 SCM Power-Up Display (OCP On).......................................................... 2-22 2–3 Examples of the SCM Error Display....................................................... 2-34 3–1 Console Power-Up Error Messages...........................................................
3–37 3–38 3–39 3–40 3–41 3–42 3–43 3–44 3–45 3–46 3–47 3–48 3–49 3–50 3–51 3–52 3–53 3–54 4–1 D–1 D–2 D–3 D–4 D–5 D–6 D–7 D–8 D–9 Fptest .................................................................................................... 3-59 Fakedisk................................................................................................ 3-60 Nettest................................................................................................... 3-61 Booting the Firmware CD-ROM ..................
1–11 1–12 1–13 1–14 1–15 1–16 1–17 1–18 1–19 1–20 1–21 1–22 1–23 1–24 1–25 1–26 1–27 1–28 1–29 1–30 1–31 1–32 1–33 1–34 1–35 1–36 1–37 1–38 1–39 1–40 1–41 1–42 1–43 1–44 1–45 1–46 1–47 1–48 1–49 1–50 1–51 1–52 xiv SMC Connections.................................................................................. 1-24 System Control Manager Block Diagram ............................................. 1-26 Power System Manager Software Block Diagram ................................
2–1 Control Panel ............................................................................................ 2-2 2–2 Power-Up Flowchart – Init. and Phase 0.................................................. 2-6 2–3 Power-Up Flowchart - Phase 1 ............................................................... 2-12 2–4 Power-Up Flowchart - Phase 2, 3, and 4................................................. 2-16 3–1 System LEDs......................................................................................
5–15 Terminal Server Removal ..................................................................... 5-28 5–16 Power Supply Removal ......................................................................... 5-30 5–17 Power Subrack Removal ....................................................................... 5-32 5–18 AC Input Box Removal.......................................................................... 5-34 6–1 Drawer Modules Location ..................................................................
B–12 Inval-to-Dirty, Full Block Write Coherency Store Flows .....................B-36 D–1 Connecting a Laptop to the Local Terminal Port.................................... D-6 Tables 1 Compaq AlphaServer GS80/160/320 Documentation ................................ xx 1–1 Address Ranges Seen at the CPU........................................................... 1-20 1–2 Address Ranges Seen in the System.......................................................
4–3 FRU Cables ............................................................................................... 4-7 4–4 Power States ........................................................................................... 4-10 4–5 FRU Power Swap States ........................................................................ 4-11 4–6 Power Color Coding................................................................................. 4-19 4–7 Module Color Codes ...............................................
Preface Intended Audience This manual is for service providers who maintain Compaq AlphaServer GS80/160/320 systems. Document Structure This manual uses a structured documentation design. Topics are organized into small sections, usually consisting of two facing pages. Most topics begin with an abstract that provides an overview of the section, followed by an illustration or example. The facing page contains descriptions, procedures, and definitions.
• Appendix C, Power-Up Diagnostic Error Table, lists test numbers, errors, and the likely FRU if an SROM or XSROM diagnostic fails. • Appendix D, Firmware Updates, describes methods for updating firmware and unjamming the communications link.
Information on the Internet Visit the following Web sites for service tools and more information about the AlphaServer GS80/160/320 systems: AlphaServer site www.compaq.com/alphaserver/site_index.html General Support http://www.compaq.com/services Console Firmware http://ftp.digital.com/pub/Digital/Alpha/firmware/readme.html Supported Options List http://www.compaq.com/alphaserver/products/options.html Operating System Patches http://www.support.compaq.com/patches/index.html WEBES/Compaq Analyze http://www.
Chapter 1 System Overview The AlphaServer GS80/160/320 systems have two different design centers: one with a small footprint and up to 8 CPUs, the other with a larger footprint that expands to 32 CPUs. This chapter describes both types of systems, their components, and their system enclosures. Most of the components between the two are interchangeable.
1.1 GS160/320 System Cabinets Two cabinets are required for a GS160 system; three are required for a GS320 system.
Figure 1–1 shows the front view of the GS320 system cabinets. Systems from 1 to 32 CPUs, from 4 Gbytes to 256 Gbytes of memory, and from 13 to 27 PCI slots for I/O options can be built in these cabinets. Expander cabinets containing additional storage and/or PCI I/O capacity are optional and can bring the total number of PCI slots to 224. For storage configuration rules, see Appendix A and the AlphaServer GS80/160/320 User’s Guide.
1.2 GS160/320 System Building Block The basic system building block for these systems is the quad building block or QBB. A QBB consists of a backplane, up to four CPUs, up to four memory modules, a directory module, up to two I/O riser modules, a global port, two power modules, a power system manager module, and a clock splitter module. The maximum number of QBBs in a GS160 is four and in the GS320 is eight. Each QBB has an ID number from 0 to 7.
Figure 1–2 shows a QBB backplane and its connectors. Module locations are identified by callouts. ¡ ¢ 48V/Vaux power connector (present only on even numbered QBBs in the rear of the system) Global port module connector (note there are two backplanes: the one for the front QBB has the global port connector on its back facing the rear of the cabinet and the one for the back QBB has the global port connector on its front also facing the rear of the cabinet.
1.3 GS160/320 System Box A system box contains two QBBs back to back.
Figure 1–3 shows the rear view of a system box. Each CPU and each memory module is assigned a physical ID associated with the slot in the QBB backplane in which it resides. A GS320 system can have up to four system boxes each with two QBBs. When all QBBs contain the maximum of four CPUs, a system containing 32 processors is created. Global ports must be physically close to each other and to the hierarchical switch and therefore are connected either to the front of a QBB backplane or to the back.
1.4 GS80 Rack Cabinet The AlphaServer GS80 system is in a single rack cabinet.
Figure 1–4 shows the front view of the GS80 rack system. A GS80 can have from one to eight CPUs, from 1 Gbyte to 64 Gbytes of memory, and from 13 to 41 PCI slots. Configurations depend upon options chosen for given base systems.
1.5 GS80 System Drawer The system drawer QBB is the building block for the smaller system. The drawer contains a backplane, CPU(s), memory(s), I/O riser(s), power modules, a power system manager, a clock splitter, and a directory, if there are two system drawers in the cabinet.
Figure 1–5 shows a system drawer backplane and the location of module and cable connectors.
1.6 Operator Control Panel The control panel is located in the front door of the power cabinet. It has a three position On/Off switch, three pushbuttons, three status LEDs, and an ASCII/graphical vacuum fluorescent display. Figure 1– 6 Control Panel Assembly 1 2 3 8 4 5 9 10 6 7 PK0621 Users control the basic state of the system by use of pushbuttons and a keyswitch on the operator control panel (OCP). LEDs and a fluorescent display provide visual evidence of the system state.
The callouts in Figure 1–6 point to these components on the control panel. ¡ ¢ Secure LED – When lit, indicates that the keyswitch is in the Secure position and system is powered on. All pushbuttons and SCM functions are disabled, including remote access to the system. Power OK LED – When lit, indicates that the system is powered on and remote console operations are enabled. (Keyswitch in On position.
1.7 Hierarchical System Architecture AlphaServer GS80/160/320 systems are distributed shared-memory multiprocessor systems with up to eight 4-processor QBBs interconnected by an 8x8 hierarchical switch (H-switch). The system provides a single address space shared by all processors, though memory is physically distributed over all nodes (QBBs) in the system. 1.7.
Figure 1–7 shows a single QBB. CPUs access memory and I/O through the local 11-port switch. In a four-processor (4-P) system, no communication off the QBB backplane, other than I/O and system management, is necessary. Therefore, neither the global port nor the directory modules are needed. Not shown in the diagram is the console serial bus used for system management. See Section 1.8.1 for information on the console serial bus.
1.7.2 The Secondary Switch The global port performs second-level switching and along with the directory module tracks the state of memory in other QBBs.
Figure 1–8 shows an 8-processor, two QBB system. Such a system can be built using a rack cabinet and two drawers (a GS80 system) or a system and power cabinet and a loaded system box (an 8-P GS160 system). This configuration is the maximum for the rack GS80 system. The directory contains state information on each 64-byte (cache-block-size) chunk of main memory in the system.
1.7.3 The Hierarchical Switch The hierarchical switch is an 8-port switch that connects up to 8 QBBs.
Figure 1–9 is a block diagram of a 16-processor GS160 system. It consists of two full system boxes with two QBBs in each. In this case, only four of the eight ports in the hierarchical switch (H-switch) are used to pass command/address and data between nodes. Since these systems use distributed memory, the hierarchical switch is required to help maintain systemwide coherency. First, it supports multicasting.
1.7.4 Addressing The CPU chip and the rest of the system have slightly different formats. Table 1– 1 Address Ranges Seen at the CPU Home QBB Memory Space Address I/O Space Address 0 000.0000.0000 - 00f.ffff.ffff ff0.0000.0000 - fff.ffff.ffff 1 010.0000.0000 - 01f.ffff.ffff fe0.0000.0000 - fef.ffff.ffff 2 020.0000.0000 - 02f.ffff.ffff fd0.0000.0000 - fdf.ffff.ffff 3 030.0000.0000 - 03f.ffff.ffff fc0.0000.0000 - fcf.ffff.ffff 4 040.0000.0000 - 04f.ffff.ffff fb0.0000.0000 - fbf.ffff.
The memory system functions as a single, distributed, tightly-coupled shared memory. The system’s memory address space and I/O address space are distributed in segments across a system’s QBBs. Each memory address maps to one and only one memory module, on one and only one QBB. Each I/O address maps to one and only one I/O device, on one and only one QBB. The QBB onto which a memory or I/O address maps is referred to as that address’ “ Home” QBB.
1.8 System Management Architecture AlphaServer GS80/160/320 systems use an independent multi-drop serial bus, powered by auxiliary voltage (Vaux), to configure, monitor, and control the system and its partitions either locally or remotely. 1.8.1 Console Serial Bus The console serial bus (CSB) is controlled by the system control manager microprocessor (SCM) on the standard I/O module in the required master PCI box.
The system management console (SMC) is a front end PC that serves as the local console for the system. See Section 1.8.2. A modem for remote control connects directly to the SMC PC through the modem port. Another modem connected to the standard I/O is used for system-initiated service calls. The CSB uses a polled master/slave protocol where a single master controls the network. The master, in this case the SCM, sends commands to slaves to which they respond.
1.8.2 System Management Console The system management console (SMC) is a front end PC running terminal emulator(s) and connected either directly to the master SCM or to a terminal server connected to all SCMs including the master.
Figure 1–11 shows the connections made to the system management console and connections made from it to the terminal server located above the AC input boxes in the power cabinet of a GS160/320 system. By running as many terminal emulation sessions as SCM/SRM consoles, the SMC has control of each SCM/SRM/partition in the system. (Each partition hard or soft requires a console. See Section 1.9 for information on partitions.
1.8.3 System Control Manager (SCM) The system control manager (SCM) is primarily responsible for two tasks: remote management and master of the console serial bus (CSB). The SCM is on the standard I/O module in a master PCI box.
The system control manager (SCM) uses the console serial bus to: • Control system power-up • Configure the system • Monitor the system • Update firmware • Power on and power off the system, locally or remotely • Halt and reset the system Through a microprocessor and its associated resources, the SCM receives and initiates secure remote connections. It is powered by the Vaux output of the PCI power supply that is on whenever AC is applied to the PCI box.
1.8.4 Power System Manager (PSM) In addition to configuring and monitoring the QBB, the power system manager (PSM) performs tasks at the request of the system control manager. The PSM is a microprocessor controlled subsystem responsible for power management, environmental monitoring, system reset, and initialization. The PSM is a required module in all QBBs.
Figure 1–13 shows the software block diagram for the power system manager. The PSM is responsible for power on/off, environmental system management, system initialization, system reset, and system communication in a QBB. Three 2 2 I C bus controllers control three I C buses that route throughout the QBB. The PSM also controls the serial lines to the CPUs used to communicate with SROM/XSROM code during power-up. Like all other nodes on the CSB it is powered by Vaux.
1.8.5 PCI Backplane Manager (PBM) In addition to configuring and monitoring the PCI I/O subsystem, the PCI backplane manager (PBM) performs tasks at the request of the system control manager (SCM). The PBM is a microprocessor controlled subsystem responsible for PCI environmental monitoring, notifying the system of unsafe conditions, PCI test, reset, and initialization. The PBM is on the PCI backplane.
Figure 1–14 is a block diagram of the PCI backplane manager (PBM). It is primarily responsible for monitoring environmental sensors on the backplane and reporting unsafe conditions. The shaded part of the block diagram is powered by Vaux and is available for use whenever AC is applied. The PBM microprocessor controls the x86 bus upon which are various control and status registers, an interface to the PCA ASIC (Section 1.21.5), and an 2 interface to the I C bus on the PCI backplane.
1.8.6 Hierarchical Switch Power Manager (HPM) The hierarchical switch power manager (HPM) is a microprocessor controlled subsystem responsible for power management, 2 environmental monitoring, asynchronous reset & initialize, I C bus management, and console serial bus communication for the H-switch.
Figure 1–15 is a block diagram of the hierarchical switch power manager (HPM) module. The HPM is responsible for monitoring environmental sensors on the H-switch and reporting unsafe conditions. The HPM is powered by Vaux which is converted to +5V and +3.3V on the module. The HPM monitors two clock signals and several power supply signals that must be good and remain good during power-on and normal operation. If any fail, the H-switch is turned off.
1.9 System Partitioning Partitions allow large systems to appear as several smaller ones either from a hardware and/or software point of view. NOTE: When considering partitions, it is helpful to separate two functions resident on the standard I/O module: the SCM function and the SRM function. Although all STD I/Os contain SCM code, only one is master of the CSB and only one other may be eligible to become master.
Applying the rules in Table 1–4 to a GS320 with eight QBBs, a customer might set up a system as shown in Table 1–5. Such a system has three hard partitions each with the required resources to run an operating system. The configuration shows that hard partitions are confined to QBB boundaries and that no resources are shared across partitions.
1.9.2 Soft Partitions Soft partitions make use of the OpenVMS Galaxy firmware functions embedded in the SRM console and PALcode firmware. Soft partitions can exist inside hard partitions. Table 1– 6 Rules Affecting Soft Partitions Rule Rule Description 1 A soft partition requires one or more CPU(s), memory, and an attached I/O subsystem with a standard I/O module. Soft partitions may be set up in a hard partition. 2 Soft partitions are not restricted by QBB boundaries.
For a full discussion of soft partitions and how to create them, see the AlphaServer GS80/160/320 Getting Started with Partitions. Applying the rules in Table 1–6 to a GS160 with four QBBs, a customer might set up a system as shown in Table 1–7. This system has three soft partitions each with the required resources to run an instance of an operating system. Currently, only OpenVMS supports soft partitions.
1.9.3 Mixture of Hard and Soft Partitions Hard and soft partitions can exist in a single system.
For a full discussion of both hard and soft partitions and how to create them, see AlphaServer GS80/160/320 Getting Started with Partitions. Table 1–8 describes a fully loaded 32-P system with the maximum number of CPUs (32), maximum number of PCIs (16), and the maximum number of standard I/O modules (8).
1.9.4 Servicing Partitions Partitions allow parts of a system to remain up and running while other parts of the system are powered off for service. Example 1– 1 Shutting Down a Partition …[..Shut down the operating system in a given partition ..
Service can be performed on one or more partitions while other partitions remain operational. Once the operating system running in a given partition is shut down, power can be removed from the partition without removing power from other partitions. And once a QBB is in an Off state, any module in that QBB can be replaced or added without further disruption of the system. Rules for Servicing Partitions 1. Only hard partitions can be powered off. 2.
1.10 CPU Module The CPU module uses the Alpha processor chip.
A single Alpha processor chip is on the CPU module.
1.10.1 CPU Processor The Alpha processor used in these systems is the third generation of the chip. It contains 15.2 million transistors.
Figure 1–17 is a block diagram of the 21264 Alpha processor chip.
1.11 Memory Module The memory module uses SDRAM storage elements and CMOS application specific integrated circuits (ASICs) for interface and control logic. Each memory module holds two four DIMM memory arrays. Figure 1– 18 Memory Module DIMMs MPD MPA MPD PLLs PK0603 Memory module features include: • Two memory arrays consisting of four DIMMs each. • Read error detection of single-bit errors and the most prevalent 2-bit, 3-bit, and 4-bit errors in SDRAM arrays.
• Memory interleaving is used to improve memory bandwidth by “ staggering” transactions on the memory arbitration bus. See memory interleaving guidelines in the AlphaServer GS80/160/320 User’s Guide. • Read data wrapping is used to reduce apparent memory latency by allowing quadword (8 bytes) access in a prescribed order. • A microcontroller initiates memory built-in self-test and communicates DIMM EEPROM data to the PSM. • Short-circuit protection. Table 1– 10 Memory Configurations DIMM Min.
1.12 Power System Manager Module The PSM is the microprocessor controller for the power subsystem.
Figure 1–19 is a block diagram of the power system manager module. For a functional description of the power system manager, see Section 1.8.4. Figure 1– 20 PSM Module LEDs and Jumpers Figure 1–20 shows the service switch, the PSM module LEDs, and jumpers. Service Switch When in the Normal position, the switch allows the PSM normal operational control of the QBB. When set to the Service position, 48V does not get converted to logic voltages but may still be present on the backplane.
1.13 Clock Generation Modules AlphaServer GS80/160/320 systems use synchronous data transfers at high speed. There are three clock domains: the system clock domain, the I/O clock domain, and the PCI clock domain. There are two clock generation modules for the system clock: one for systems with one or two QBBs and one for systems with more than two QBBs. The I/O clock domain reference is generated on the clock splitter module, and the PCI clock domain reference is generated on the PCI backplane. 1.13.
The H-switch clock module is mounted above the hierarchical switch and generates a global reference clock signal from which all other system clock signals are derived. Coax cables carry the clock signal to the clock splitters in each QBB and to the built-in clock splitter on the H-switch. The clock splitter produces 48 copies of the signal that are sent to master phase lock loop devices (MPLL) associated with each ASIC (or CPU) on modules, on the QBB backplane, and on the H-switch.
1.13.2 Dual-Output Clock Module The dual-output clock module is used in a GS80 or a GS160 with a single system box.
The dual-output clock module is used in 4 or 8P systems only. The module is mounted on the rear left side wall of the top drawer in GS80 systems and in the distribution board housing in GS160/320 systems. It generates a global reference clock signal from which all other system clock signals are derived. Equal length coax cables carry the clock signal to the clock splitters in each QBB.
1.13.3 Clock Splitter Module The clock splitter module converts the global reference sine wave from either clock module to 48 identical copies of a positive ECL (PECL) signal that is distributed to master phase lock loops (MPLL) associated with ASICs on the system backplane and on modules in the QBB. It also generates independent clock signals for the I/O domain.
Figure 1–23 shows a functional block diagram of the clock splitter. A clock splitter module is required in each QBB. The clock splitter receives the clock sine wave from either the H-switch clock module or the dual-output clock module and converts it into 48 copies of a positive ECL clock signal. This PECL clock signal is transmitted to master phase lock loop (MPLL) modules associated with each ASIC and CPU in the system clock domain.
1.13.4 Master Phase Lock Loop The master phase lock loop daughtercard aligns ASIC clocks to the global clock reference provided to it by the clock splitter module.
Figure 1–24 shows a functional block diagram of the master phase lock loop daughtercard (MPLL). Each ASIC in the system has an associated MPLL. To keep tight clock tolerances, the MPLLs are all deskewed so that all have the same performance. To synchronize all ASICs in the system, the global reference clock is supplied to each MPLL in the system; the MPLLs supply the clock to the ASIC that it is associated with, tests the ASIC’s delay, and then aligns the internal ASIC clock to the global reference clock.
1.14 Local I/O Riser Modules There are two local I/O riser modules: one for the GS160/320 QBBs and another for the GS80 QBBs. 1.14.1 System Box Local I/O Riser Module The system box local I/O riser module provides two I/O port interfaces to the QBB and two connections for I/O cables connected to the PCI I/O subsystem through remote I/O riser modules in PCI boxes. The module may be removed while other parts of the system remain operational. There may be up to two optional local I/O risers in each QBB.
The system box local I/O riser module provides two I/O port interfaces to the QBB I/O port (IOP). Since there is a similar I/O riser in the PCI box, it is helpful to name the riser connected to the QBB the local riser and the one connected to the PCI the remote riser. Figure 1–25 shows a block diagram of the system box local riser. There are two minilink application-specific integrated circuits (MLK ASIC) on the module, one for each port.
1.14.2 System Drawer Local I/O Riser Module The GS80 drawer local I/O riser module provides two I/O port interfaces to the QBB and two connections for I/O cables connected to the PCI I/O subsystem through remote I/O riser modules. The module cannot be removed while other parts of the system remain operational.
Figure 1–26 is a block diagram of the drawer local I/O riser modules. Together the two modules, the B4172-Ax and the B4173-Ax, are functionally identical to the system box local I/O riser module; only the mechanics of the module are different. In the GS80, the I/O cables must be brought out at a right angle to the I/O riser. The two-module design makes this possible.
1.15 Power Modules Each QBB has two power modules; the H-switch has one power module. 1.15.1 Main Power Module The main power module converts 48 VDC power supplied by the power supplies to DC voltages required by the clocks and devices on the QBB.
The power module converts 48 VDC to the following required outputs. • - 1.7VP at 24 amps • +1.7V at 45 amps • +3.3VP at 45 amps • +3.3V at 90 amps Separate converters on the module put out each voltage. The VP voltages are routed to the phase lock loop clocks thoughout the QBB. Other voltages power the rest of the QBB including the modules that plug into it. Figure 1–27 shows the block diagram of the main power module and its daughtercard that contains the control logic for the converters.
1.15.2 Auxiliary Power Module Like the main power module, the auxiliary power module converts 48 VDC supplied by the power supplies to DC voltages necessary for devices other than clocks on the QBB.
The power module converts 48 VDC to the following required outputs: • +3.3V at 135 amps • +2.5V at 45 amps The 2.5V is used by SRAMs that make up CPU backup cache and will be used when such DIMMs are placed in the directory and memory modules. The 3.3V is current-shared with the 3.3V output from the main power module. Separate converters, some in parallel, on the module put out each voltage.
1.15.3 Hierarchical Switch Power Module The hierarchical switch power module converts 48 VDC power to the voltages necessary for the H-switch.
Figure 1–29 shows the block diagram of the hierarchical switch power supply. The module converts 48 VDC to the following required outputs: • -1.7VP at 2.5 amps • +3.3V at 5.5 amps • +1.7V at 14 amps • +5.0V at 0.15 amps • +2.5V at 7.5 amps • +15V at 0.5 amps • +3.3VP at 5 amps Only one power module is required for the H-switch; the second is redundant and either module is hot swappable.
1.15.4 Short-Circuit Protection Module The short-circuit protection module is a small daughtercard that protects against short circuits on modules and backplanes throughout the system. In some cases, the protection circuit has been designed into the module so the daughtercard is not used. There are two parts: one for remote I/O risers and one for other modules and backplanes.
Figure 1–30 shows both a block diagram of the short-circuit protection module and its interconnect in the QBB and drawer. Similar interconnects exist in the PCI where the SCP is installed on the PCI backplane, the two remote I/O risers, and the standard I/O module. At present (August 2000), the functions performed by the SCP are designed into the CPU module, the H-switch, the global port, and the GS80 backplane.
1.16 Directory Module The directory module uses DIMMs populated with SDRAMs to track ownership and state of memory addresses local to a QBB. A directory module is necessary in each QBB in systems with more than one.
The directory module is associated with the local memory contained within a QBB. In systems with more than one QBB, a directory module is required in each. The directory functions as the focal point for memory coherency. It is used to store the processor ID of the current owner and node masks or presence bits of the nodes that have acquired shared copies of a cache block belonging to memory in the local QBB.
1.17 Global Port Module There are two global port modules, one for QBBs in the front of the system box and one for QBBs in the rear of the system box.
Figure 1–32 shows the two global port modules used in the system box. One, the B4180-Bx, is used for the QBB in the front of the system box. The other, the B4181-Bx, is for the QBB in the back of the system box. There is no global port module for drawer systems; since the function of the global port is built into the drawer backplane. For a functional description of the global port, see Section 1.7.2. The rear global port plugs into the front of the rear QBB’s backplane.
1.18 Global Port Distribution Board There are two QBB distribution boards, one for GS160/320 systems and one for GS80 systems. The distribution module connects the transmit signals from each of two global ports to the receivers on the other.
Figure 1–33 shows a simple block diagram of the B4186-Ax distribution board and the layout for both distribution boards. These boards are used in systems that have two QBBs. In systems with more than two QBBs, the hierarchical switch actively performs this switching function. The B4186-Ax module, used with the system box, is an active module because of the CSB connection. The B4185-Ax, used in GS80 systems, does not have a CSB connection and is completely passive.
1.19 Hierarchical Switch The hierarchical switch allows up to eight QBBs to communicate with each other simultaneously.
Figure 1–34 shows the hierarchical switch. For a functional description of the H-switch, see Section 1.7.3. The hierarchical switch has eight ports. Each port consists of two unidirectional buses, one in and one out, each with a 2-Gbyte/second raw bandwidth. The functions of the hierarchical switch are implemented in six ASICs, two HSAs for addresses and four HSDs for data.
1.20 H-switch Power Manager Module The h-switch power module (HPM) is a microprocessor controlled subsystem responsible for H-switch power management, environmental 2 monitoring, asynchronous reset and initialize, I C bus management, and CSB communication.
Figure 1–35 is a drawing of the location of the H-switch power manager (HPM). For a functional description of the HPM, see Section 1.8.6. In GS80 and GS160 systems with only one system box the HPM is not present. In all other system configurations that require the hierarchical switch an HPM is present. The module jumpers and LEDs are as follows. Jumpers The module has three two-position jumpers, none of which are normally installed.
1.21 PCI Subsystem The I/O subsystem consists of the local and remote I/O risers and the PCI box to which they are attached. There are two types of PCI boxes: a master PCI box and an expansion PCI box. The master PCI box has the devices necessary to test and boot the system; the expansion PCI box does not. 1.21.1 PCI I/O Subsystem Interconnect The PCI I/O subsystem is connected to QBBs through local and remote I/O riser modules and their cables.
Figure 1–36 shows all the major I/O subsystem components. The QBB backplane contains the IOP ASICs and two local I/O riser slots. The local I/O riser provides two I/O ports that are cabled to the remote I/O riser modules connected to the PCI backplane. The PCI backplane contains 14 PCI slots spread over four logical 64-bit PCI buses. Two of those buses contain four PCI slots and two contain three PCI slots.
1.21.2 PCI Backplane The PCI backplane contains the connectors for the remote I/O risers, PCI devices, and the standard I/O module. Much of the logic on the PCI backplane is dedicated to communicating with devices on the PCI buses and controlling the interface with the PCA ASICs on the remote I/O risers. The PCI backplane also contains the PBM microprocessor connected to the CSB.
Figure 1–37 shows the layout of the PCI backplane. All the PCI slots and riser slots are labeled. Note that the slot at the far right of the drawing is labeled 0-0/1. In a master PCI box the slot is occupied by the standard I/O module and PCI slot 0-0 is not available. In an expansion PCI box, which does not have a standard I/O module, PCI slot 0-1 is available. The PCI backplane manager (PBM), resides on the PCI backplane and is powered by Vaux. See Section 1.8.
1.21.3 PCI Box Configuration Each QBB can have two I/O risers supporting up to two PCI boxes. A cable connects a local I/O riser (in the QBB) to a remote I/O riser (in the PCI box). Each PCI box can have up to two remote I/O risers creating two three-slot and two four-slot 64-bit PCI buses. Cable connectors for the two remote I/O risers are shown as Riser 0 and Riser 1 in Figure 1– 38. PCI slots and logical hoses are listed in Table 1– 12.
CAUTION: Installing a full-length module next to the standard I/O module requires extra care due to cabling on the standard I/O module. Logical Hoses You can have a maximum of four logical hoses per PCI box. Logical hose numbers are assigned by the firmware. Logical hoses are numbered from 0 to 63 and are assigned in blocks of eight to each QBB. QBB0 is assigned hoses 0 – 7, QBB1 hoses 8 – 15, … QBB7 hoses 56 – 63.
1.21.4 Standard I/O Module The standard I/O module is central to the system management architecture. It provides basic I/O devices necessary for testing and configuring the system and is the location of the system control manager (SCM) and the system console (SRM). It is located in the master PCI box in the power cabinet. At least one is required though there may be up to eight to support partitions. When the SCM is in pass through mode, it becomes the local console.
The standard I/O module provides basic I/O device interfaces to allow the system to be tested, configured, and booted.
1.21.5 PCI Remote I/O Riser Module The PCI remote I/O riser module resides in the PCI box and provides an interface between a single I/O cable and two PCI buses. The PCA ASIC is the PCI bus controller.
The remote I/O riser module provides an interface between a single I/O cable and two PCI buses. Figure 1–40 is a block diagram of the remote I/O riser module installed in the PCI I/O subsystem. The data path passes through the MLK ASIC to the PCA ASIC that controls the two PCI buses. Both the MLK ASIC and the PCA ASIC are synchronized to the PCI using a phase locked loop device that receives its reference clock from the PCI.
1.21.6 Console Serial Bus Node ID Module The console serial bus node ID module is mounted at the rear of the PCI box and contains the bulkhead connector for the CSB cable, the PCI box status LEDs, and the PCI box CSB node ID switch. This module is present in all PCI boxes in the system and is cabled to the PCI backplane and to the CSB. The CSB node ID number must be unique for each PCI box.
The PCI box console serial bus node ID module is a small board mounted at the right rear of the PCI box that provides the bulkhead interface for the CSB cable, PCI box CSB node ID switch, and PCI box status LEDs. Figure 1–41 is a drawing of the module. PCI Box ID A small switch on the CSB ID module is used to set the PCI box node ID. The ID becomes part of the CSB node address for the box.
1.21.7 Standard I/O Cable Interface The standard I/O cable interface module is located in the front top right corner of master PCI boxes.
Figure 1–42 shows the standard I/O cable interface module. It is located in the front top right corner of the PCI box. A cable inside the master PCI box connects it to the standard I/O module.
1.22 GS160/320 System Power Several components make up the power system for the AlphaServer GS160 and GS320 systems: the AC input box, the system 48V power supplies, the power subrack, the cabinet bulkhead, the power modules in the system box, the power supplies in PCI boxes, the power supplies in the H-switch, and the power managers in the system box, H-switch, and on the PCI.
Major power components are described in the following sections except for the power modules in the QBB and the power managers. For the main power module, see Section 1.15.1, for the auxiliary power module, see Section 1.15.2, for the H-switch, see Section 1.15.3, for the PSM, see Section 1.8.4 and Section 1.12, for the HPM, see Section 1.8.6 and Section 1.12, and for the PBM, see Section 1.8.5. Figure 1–43 is a block diagram of power distribution in a GS320 system.
1.22.1 AC Input Box (Three Phase) There are two variants of AC input boxes for system box based systems.
Figure 1–44 shows both the front and rear of the AC input box used in GS160/320 systems. Three phase AC input power is used in these systems. There are two variants: • The 30-48848-01 used in North America/Japan provides 3 phase 30 amp. 120/208V power. • The 30-48848-02 used in Europe provides 3 phase 30 amp. 380-415V power. There is no visual difference between the two variants except the power cord plug.
1.22.2 48V Power Supply The 1600 watt power supply converts AC to 48 VDC and to Vaux (8.75 VDC) from a single phase provided by the three-phase AC input box.
Figure 1–45 shows the 1600 watt 48V power supply. The power supply plugs into the power subrack. To differentiate it and the 1000 watt 48V power supply used in drawer systems, note the back plug receptacle shapes are different ( in Figure 1–45 and in Figure 1–50). Features of the power supply are: • 48 VDC and Vaux outputs. Vaux is always on when AC power is applied. 48 VDC output is controlled by an enable signal provided by the PSMs in the system.
1.22.3 Power Subrack The power subrack holds three power supplies that power a system box containing two QBBs. Since more than one may be necessary to power either a GS160 or GS320, they are color coded to match the colors assigned to system boxes.
Figure 1–46 shows the H7505 power supply subrack. It is placed in the power cabinet of GS160/320 systems. Power supplies slide into the rack from the front of the system. The H7505 uses 1600W power supplies. Loads must be properly distributed across the three phases to avoid nuisance tripping of circuit breakers. Therefore, placement of the third, redundant power supply is important. Figure 1–46 provides a chart showing the recommended placement.
1.22.4 Power Distribution Panel and Power Cabinet Bulkhead The power distribution panel is located on the rear of the power subrack. The power cabinet bulkhead is located between the power cabinet and system cabinet 1. There are two power cabinet bulkheads, one for cables running from the subracks to system cabinet 1 and another for cables running from the subracks to system boxes in system cabinet 2.
Figure 1–47 shows the location and use of the power distribution panel which is part of the power subrack. The power distribution panel performs the “ oring functions” described in Section 1.22. Figure 1–47 also shows power cabling in a power cabinet. Marks the connections made by the cable connecting the AC input box and the power distribution panel on the subrack. Shows the cable connections made by power and signal cables from the power distribution panel and the power cabinet bulkhead.
1.23 GS80 System Power Six major components make up the power system for the AlphaServer GS80 systems: the AC input box, the 48V power supplies, the power modules in the drawer, the power managers, the PCI power supplies, and the storage power supplies.
Major power components are described in the following sections except for the power modules in the drawer and the power managers. For the main power module see Section 1.15.1, for the auxiliary power module see Section 1.15.2, for the PSM see Section 1.8.4 and Section 1.12, and for the PBM see Section 1.8.5. Figure 1–48 is a block diagram of power distribution in a GS80 system. It shows most of the major components that make up the power distribution system.
1.23.1 AC Input Box (Single Phase) There are three variants of AC input boxes for GS80 systems. Only one AC input box is required when the inlet voltage is high (200+ V) and two are required when the voltage is low (120 V).
Figure 1–49 shows both the front and rear of the AC input box used for GS80 systems. Single-phase AC input power is used in these systems. There are three variants: • The 30-48847-01 used in North America accommodates single phase 30 amp. 120V power • The 30-48205-04 used in Europe phase 30 amp. 220-240V power • The 30-48205-03 used in North America and Japan accommodates single phase 30 amp 200-240V power There is no visual difference between the variants except the power cord plugs.
1.23.2 48V Power Supplies The 1000 watt power supply converts AC from the drawer based system AC input box to 48 VDC and to Vaux (8.75 V DC).
Figure 1–50 shows the 1000 watt 48V power supply. The power supply plugs into the power subrack. To differentiate it and the 1600 watt power supply used in system box based systems, note the back plug receptacle shapes are different ( in Figure 1–45 and in Figure 1–50). Features of the power supply are: • 48 VDC and Vaux outputs. Vaux is always on when AC power is applied. 48 VDC output is controlled by an enable signal provided by the PSM in the system.
1.23.3 GS80 Power Subrack The power subrack holds three power supplies that power a drawer. Two subracks are needed for a two-drawer system.
Figure 1–51 shows the H7504 power supply subrack. It is placed between the drawers and the AC input boxes. Power supplies slide into the rack from the front of the system. The H7504 power subrack accepts 1000 watt power supplies. Two power supplies are needed to power one drawer, the OCP, and clock. The third power supply is redundant. When there are three power supplies in a subrack, one may be hot swapped. There is an electrical difference between the GS80 power subrack and the GS160/320 subrack.
1.24 PCI Power Supply The PCI power supply converts AC input to DC voltages required by the PCI I/O subsystem. One power supply is required; the second is redundant.
Figure 1–52 shows the PCI power supply, which receives AC power from the AC input box. The supply plugs into the front of the PCI box. One power supply is required for a PCI box; the second is redundant. The power module converts single phase AC input to the following required outputs: • +3.3V at 57A • -12V at 1.5A • +5.0V at 50A • • +12V at 7A Vaux (+5.
Chapter 2 Power-Up This chapter describes power-up testing and explains the power-up displays. The following topics are covered: • Operator Control Panel • Power-Up Test Flow – Init.
2.1 Operator Control Panel The control panel is located at the top of the power cabinet. 2.1.1 Control Panel Assembly The control panel assembly has a three position On/Off switch, three pushbuttons, three status LEDs, and an ASCII/graphical vacuum fluorescent display. Figure 2–1 Control Panel 1 2 3 8 4 5 9 10 6 7 PK0621 Users control the basic state of the system by use of pushbuttons and a keyswitch on the operator control panel (OCP).
The callouts in Figure 2–1 point to these components of the control panel. ¡ ¢ Secure LED – When lit, indicates that the keyswitch is in the Secure position and system is powered on. All OCP pushbutton and SCM functions are disabled including remote access to the system. Power LED – When lit, indicates that at least one QBB is powered on and that remote console operations are enabled.
2.1.2 Fluorescent Display Messages The vacuum fluorescent display is used to communicate the state and or condition of the machine. Four 20-character lines are available. Table 2– 1 Display Messages Message Description AlphaServer GS-xxx Identifies the AlphaServer as a GS-80, GS-160, or a GS-320. If the OCP_TEXT environment variable is empty, this line appears; otherwise the value of the OCP_TEXT environment variable is displayed.
Message Description ALERT: NO Valid MEM NO Valid CPU NO CPI & MEM NO Stdio Alerts provide information about system configurations that cause the system or a partition not to operate. Those conditions are no valid CPUs or memory, or standard I/O module. CLI HALT IN/OUT An SCM halt in or halt out command was issued. HALT Asserted/Deasserted The Halt button is in and the halt signal is asserted. The Halt button is out and the halt signal is not asserted.
2.2 Power-Up Test Flow – Init. and Phase 0 After the initial setup, phase 0 tests the “local” QBBs. Figure 2– 2 Power-Up Flowchart – Init.
Power-up consists of an initialization phase followed by five test phases. The system control manager (SCM) firmware, run by the microprocessor on the standard I/O module, controls power-up. The SCM, master of the console serial bus (CSB), sends power-up control test packets over the CSB to the PSMs in each QBB. The PSMs in turn pass power-up control test packets to the CPUs in the QBBs over the PSM to CPU serial lines.
Figure 2– 2 Power-Up Flowchart – Init.
Table 2–2 lists the SROM tests run in phase 0. Table 2– 2 SROM Tests Test # Hex.
Table 2–3 lists the XSROM tests run in phase 0.
Table 2– 3 XSROM Tests Run in Phase 0 (Continued) Test # Hex Test Name Phase 0 Step 3 tests continued 24 Local IOP error test 25 Local MEM0 scratch/BIST/error tests 26 Local MEM1 scratch/BIST/error tests 27 Local MEM2 scratch/BIST/error tests 28 Local MEM3 scratch/BIST/error tests 29 Local DTag scratch and BIST check test 2a Local directory scratch and BIST check test Phase 0 Step 4 tests 2b Local IOP BIST check test 2c Local QSA error line test 2d Local hose error line test 2e Local
2.3 Power-Up Test Flow – Phase 1 Remote testing of QBBs is done in phase 1.
During phase 1, “ remote” testing of each QBB in the system is conducted if there is more than one QBB in the system. Remote means testing of hardware by a system primary CPU, selected by the SCM from data collected in phase 0, across secondary (global ports) and the hierarchical switch if present. Initial soft QBB IDs are assigned in this phase. (Soft QBB IDs may change during power-up if something fails.) Soft QBB IDs are necessary to make sure that good memory exists at address 000.0000.
Table 2–4 lists the XSROM tests executed during phase 1.
Table 2– 4 XSROM Tests Run in Phase 1 (Continued) Test # Hex Test Name Phase 1 Step 9 tests continued 42 Remote MEM0 scratch/BIST/error line testing 43 Remote MEM1 scratch/BIST/error line testing 44 Remote MEM2 scratch/BIST/error line testing 45 Remote MEM3 scratch/BIST/error line testing 46 Remote DTag BIST check 47 Remote DIR BIST check 48 Remote IOP BIST check 49 Remote QSA error line test 4a Remote Hose error line test 4b Remote GP error line test 4c Placeholder 4d Remote direct
2.4 Power-Up Test Flow – Phases 2, 3, and 4 During the final three phases, XSROM code assures cache coherency, assures that all CPUs can access all memory, and leaves all CPUs running the SRM console.
The phase 2 test “ victimizes” all cache blocks of all secondary CPUs. (The SP CPU “ victimized” all its cache blocks at the end of phase 1.) A victimized cache block is one that the CPU has modified and wishes to write back to memory. Writing data back to memory assures that the contents of B-cache, DTags, and memory are coherent. The phase 3 tests assure that each CPU interacts correctly with its own B-cache and the QBB’s DTag and can access each memory array in the entire system.
2.5 Power Applied – Vaux Present When power is applied, the microprocessors on the CSB execute their built-in self-test (BIST) and the system control manager takes control of the system. Micros on the CSB are SCMs, PSMs, HPM, and PBMs. Example 2– 1 System Control Manager Power-Up Display Master SCM | Testing SCM EEPROM – Passed | Initializing Evs | SCM Selftest Passed | Polling CSB............................ OCP will be inactive for first 12 seconds after micro reset SCM_E0> Querying the modem port.
Example 2–1 shows the SCM monitor display for a four QBB system with eight CPUs. Auxiliary power is applied to the system when the AC circuit breakers are put in the On position. Refer to Example 2–1. When power is applied, an eligible SCM with the lowest csb node ID number connected to the OCP and running from its application image becomes master of the CSB. It checks its EEPROM and self-test, restores environment variables (EVs), and sets up data structures and CSB communication channels.
Example 2– 1 System Control Manager Power-Up Display (Continued) SCM_E0> QBB0 Directory Module Added | Power Supply-1 present in Subrack-1 | Power Supply-3 present in Subrack-1 | QBB0 3.3V Main Power Converter present | QBB0 3.3V AUX Converter present | QBB0 GP added | MEM0 added to QBB0 | MEM3 added to QBB0 | IOR01 added in QBB0 | CPU0 added to QBB0 | CPU2 added to QBB0 | SCM_E0> QBB1 Directory Module Added | Power Supply-1 present in Subrack-1 | Power Supply-3 present in Subrack-1 | QBB1 3.
The continuation of Example 2–1 shows the description of each QBB. In this particular system only QBBs 0 and 1 have local I/O risers. QBB1 has two CPUs while the rest have four. Other similarities and differences between the QBBs can be seen by further examination. ¡ ¢ QBB0 is described. QBB1 is described. QBB2 is described. 11 QBB3 is described. 12 The SCM can now begin to monitor the state of the OCP switch. Once the SCM has configured the CSB, it begins to monitor the state of the OCP switch.
2.6 System Turned On Once the OCP switch is attended by the SCM firmware, the system can be turned on by the switch.
Example 2–2 shows a continuation of the SCM console display after the OCP switch has been put in the On position. Refer to Example 2–2. PCIs and QBBs are powered on. The QBBs are powered on and the Init. Phase is started. The SROM code (step 0) is run on each CPU in each QBB. The master SCM is SCM_E0 and, in this case, the slave SCM is SCMe1. The message displayed here indicates that while the slave SCM is testing its shared RAM, the master SCM recognizes that fact.
Example 2– 2 SCM Power-Up Display (OCP On) (Continuation 1) Phase 0 ~I~ Enable HS Links: 0f | | ~I~ QbbConf(gp/io/c/m)=0000bbff Assign=0f SQbb0=00 PQbb=00 SoftQbbId=0000ba98 | ~I~ SysConfig: 00 00 00 00 00 00 00 00 07 1f 07 9f 37 3f 37 95 | SCM_E0> ~I~ HSW4/HPM40 SCM_E0> ~I~ HSW4/HPM40 SCM_E0> ~I~ HSW4/HPM40 SCM_E0> ~I~ HSW4/HPM40 SysEvent: LINK0_ON Reg0:000F Reg1:D581 SysEvent: LINK1_ON Reg0:010F Reg1:D581 SysEvent: LINK2_ON Reg0:030F Reg1:D581 SysEvent: LINK3_ON Reg0:070F Reg1:D581 SCM
QBB0 QBB0 QBB0 QBB0 QBB0 QBB0 QBB0 QBB0 QBB0 QBB0 QBB0 now now now now now now now now now now now Testing Testing Testing Testing Testing Testing Testing Testing Testing Testing Testing Step-9. Step-A. Step-7. Step-8. Step-9.. Step-A. Step-7. Step-8.. Step-9. Step-A. Step-B. | | | | | | | | | | | Refer to continuation 1 of Example 2–2. Phase 0, local QBB testing begins. See Section 2.2 for information on testing done in phase 0.
Example 2– 2 SCM Power-Up Display (OCP On) (Continuation 2) Phase 2 QBB0 IO_MAP0: QBB1 IO_MAP1: QBB2 IO_MAP2: QBB3 IO_MAP3: 0000A0C001333333 0000A1C101333333 0000000000000003 0000000000000003 | | | | | ~I~ QbbConf(gp/io/c/m)=0000bbff Assign=0f SQbb0=00 PQbb=00 SoftQbbId=0000ba98 ~I~ SysConfig: 00 00 00 00 00 00 00 00 07 1f 07 9f 37 3f 37 95 | SCM_E0> QBB1 now QBB2 now QBB3 now QBB0 now Phase 3 Testing Testing Testing Testing Step-C Step-C Step-C. Step-C.
Refer to continuation 3 of Example 2–2. ¡ Phase 2 begins. The pass/fail results of phase 1 are passed back to the SCM indicated by the ~I~ line. An I/O map built by the PSM, now the result of remote testing, is passed to the SCM monitor. Phase 2 testing is done on each QBB. Phase 2 consists of a single test. The caches of each secondary CPU are victimized – that is, written back into memory with the result that memory and caches are now coherent.
Example 2– 2 SCM Power-Up Display (OCP On) (Continuation 3) System Primary QBB0 : 0 System Primary CPU : 0 on QBB0 . Par hrd/csb CPU Mem IOR3 IOR2 IOR1 IOR0 QBB# 3210 3210 (pci_box.rio) (-) (-) (-) (-) 0/30 1/31 2/32 3/33 HSwitch HPM40 -P-P PPPP PPPP PPPP Type 4-port PCI Rise1-1 Cab 7 6 5 4 10 11 - - L - - - - P--P --PP P--P ---P --.--.--.--.- --.--.--.--.- P0.1 P1.1 --.--.- GP QBB Mod BP P0.0 P1.0 --.--.
QBB 1 memory, 3 GB QBB 2 memory, 3 GB QBB 3 memory, 1 GB total memory, 11 GB copying PALcode to 10bffe8000 copying PALcode to 20bffe8000 copying PALcode to 303ffe8000 | | | | | | | Refer to continuation 3 of Example 2–2. An expanded system map is displayed. PALcode is loaded and started. The system configuration is displayed from the SRM console point of view. The location of the standard I/O module with both the SCM monitor code and the SRM console code is determined.
Example 2– 2 SCM Power-Up Display (OCP On) (Continuation 4) probe I/O subsystem probing hose 0, PCI probing PCI-to-ISA bridge, bus 1 bus 0, slot 1 -- pka -- QLogic ISP10x0 bus 0, slot 3 -- ewa -- DE500-BA Network Controller bus 0, slot 15 -- dqa -- Acer Labs M1543C IDE bus 0, slot 15 -- dqb -- Acer Labs M1543C IDE probing hose 1, PCI probing hose 2, PCI probing hose 3, PCI bus 0, slot 5 -- pkb -- QLogic ISP10x0 probing hose 8, PCI probing PCI-to-ISA bridge, bus 1 bus 0, slot 1 -- pkc -- QLogic ISP10x0 bus 0
Refer to continuation 4 of Example 2–2. The I/O subsystem is mapped. Each CPU in the system is identified, started, and initialized.
Example 2– 2 SCM Power-Up Display (OCP On) (Continuation 5) starting console on CPU 7 initialized idle PCB initializing idle process PID lowering IPL CPU 7 speed is 731 MHz create powerup | | | | | | starting console on CPU 8 initialized idle PCB initializing idle process PID lowering IPL CPU 8 speed is 731 MHz create powerup | | | | | | Repeated for each CPU in the system. . . .
Refer to continuation 5 of Example 2–2. Each secondary CPU starts the console, is initialized and ready to join the multiprocessor environment. The I/O subsystem is initialized. GCT/FRU is the system configuration tree/FRU table and its location in memory is 1f4000. The configuration tree/FRU table is the data structure containing information about hard and soft partitions. Note that the location in memory of the configuration tree is a fixed address in these systems.
2.7 SROM/XSROM Error Reports SROM and XSROM errors are reported to the PSM, which passes the error information on to the SCM at the end of each phase. The SCM formats the information and displays it to the console. For a full description of running SROM/XSROM tests, see Section 3.5. Example 2– 3 Examples of the SCM Error Display Example 1 SCM_E0> test &pc0 13 Testing. Please wait... *** Error Format: 1 Severity: Hard Type: XSROM selftest Test: 13h Rvsn: V4.0-0 FRU1: QBB0 QSD3 FRU2: QBB0.
The SCM formats and prints SROM and XSROM errors found during power-up or when executing the diagnostics in user mode. Example 2–3 shows examples of two formats for SROM/XSROM failures.
Example 2– 3 Examples of the SCM Error Display (Continued) Example 3 SCM_E0> test &pc0 52 Testing. Please wait... *** Error Format: 2 Severity: Hard QBB/CPU: 00/00 Type: XSROM selftest Test: 52h Error: 0108 Rvsn: V4.1-0 FRU1: QBB0.MEM1 ARR0, CFG FRU2: FRU3: FRU4: P1: 0000000000000108 P2: 0000000000000000 P3: 0000000000000000 P4: 0000000000000000 SCM_E0> Example 3 shows a memory/directory configuration test failure.
Chapter 3 Troubleshooting This chapter describes various troubleshooting techniques including power-up testing and explains the power-up displays.
3.1 Troubleshooting During Power-Up Power problems may occur when powering up the system.
Table 3– 1 Power Problems (Continued) Symptom Possible Cause Indicators System does not power up/main blowers off CSB bus problems - Cable broken/disconnected along path from STD I/O CSB ID module CSB module ¬ ¬ See above - Vaux problem System does not power up/main blowers off System or part of system does not power up and/or main blowers off Message on console regarding lost connections OCP problems - OCP present signal not seen by SCM (signal cable path from STD I/O STD I/O cable interface OCP
Table 3– 1 Power Problems (Continued) Symptom Possible Cause System or part of system does not power up Logic voltage problems - PSM faiure 3-4 Indicators PSM/main/aux LEDs not normal - Main power module failure PSM qbb_dc_good LED off/ main power LEDs not normal - Auxiliary power module failure PSM qbb_dc_good LED off/ aux power LEDs not normal AlphaServer GS80/160/320 Service Manual
3.2 System Management Console Problems If the system management console connected to the local terminal port on a standard I/O module does not operate, the problem could range from broken hardware to unequal baud rates on the serial line.
3.3 Power-Up Display and Troubleshooting During power-up, any number of messages can appear either giving warnings or specifying errors. Example 3– 1 Console Power-Up Error Messages . . . QBB1 now Testing Step-3 QBB2 now Testing Step-3 QBB3 now Testing Step-3... QBB1 now Testing Step-4. QBB1 now Testing Step-5 ~E~ QBB1 Error: ~E~ PUP MEM1 NO GOOD ARRAY *** Error Format: 2 Severity: Hard QBB/CPU: 01/00 Type: XSROM selftest Test: 26h Error: 100F Rvsn: V5.4-0 FRU1: QBB1.
Example 3–1 shows a memory error report in the middle of power-up. In this case memory failed self-test and the report shows that the most likely FRU is memory 1 in QBB1. The SROM/XSROM diagnostic reports errors using error and warning formats: *** Designates a diagnostic error format. Depending on the type of error and the configuration, the machine will most likely power up. At a minimum the resource is dropped from the system. ### Designates a diagnostic warning format.
Table 3– 3 Fluorescent Display Messages Display Message Description AlphaServer GS-xxx Identifies the AlphaServer as a GS-80, GS-160, or a GS-320. If the OCP_TEXT environment variable is empty, this line appears, otherwise the value of the OCP_TEXT environment variable is displayed. Cpu- Mem- Pci- Indicates the number of good CPUs, memory arrays, and PCI buses attached to the system. KeyswitchON/OFF/“ON/SECURE” Indicates the state of the keyswitch.
Display Message Description ALERT: NO Valid MEM NO Valid CPU NO CPI & MEM NO Stdio Alerts provide information about system configurations that cause the system or a partition not to operate. Those conditions are no valid CPUs or memory, or standard I/O module. CLI HALT IN/OUT An SCM halt in or halt out command was issued. HALT Asserted/Deasserted The halt button is in and the halt signal is asserted. The halt button is out and the halt signal is not asserted.
3.4 Using the SCM Monitor There are several SCM commands that set the system environment, display configuration information, and help diagnose the system. 3.4.1 SCM Commands The system control manager sets and controls the system environment through a set of commands. Table 3– 4 SCM Commands Command Description build Build FRU data (pn= part number in 2-52.4 format, sn=serial number in xxyzzabcde format, mod= module, and ali=alias).
Table 3– 4 SCM Commands (Continued) Command Description help or ? Display the list of SCM commands init modem Initialize the modem (See Table 3–6) master Slave SCM command to master SCM – allows a slave SCM to pass an SCM command to the master for the master to issue power {off, on} [-all, -partition ] On/off power a partition or the entire system to system/QBB/H-switch quit Switch from SCM-CLI mode COM1 port reset [-all, -partition ] Reset the system or a particular partition
3.4.2 Controlling Power The SCM commands, power on and power off behave differently depending upon whether the system is partitioned or not. Table 3– 5 Power On/Off Command Non Partitioned System Partitioned System Power on Powers on the entire system. If QBB to I/O hose mapping already exists as indicated by the show system command, the partition owning the particular PCI box from which the command is issued will power-up. Other partitions will not.
Table 3–5 describes the behavior of the power on/power off commands as they relate to whether the system is partitioned or not. 1 If the system is not partitioned, the hp_count environment variable is zero and either power on or power off commands will power the entire system on or off including the I/O. If the system is partitioned, however, the behavior of the power commands vary as described in the table.
3.4.3 Displaying and Setting Up the System Environment Two SCM commands are used to display and set system environment variables stored in EEPROM on the standard I/O module.
Example 3– 3 Set Environment Variables SCM_E0> SCM_E0> SCM_E0> SCM_E0> SCM_E0> set set set set hp_count 3 hp_qbb_mask0 03 hp_qbb_mask1 04 hp_qbb_mask2 08 !setting up 3 partitions !partition 0 two QBBs 0 and 1 !partition 1 with one QBB, 2 !partition 2 with one QBB, 3 Any of these environment variables can be set using the SCM set command. In Example 3–3 the set command is used to define 3 partitions.
3.4.4 Displaying Configuration Information Several show commands provide system information.
Example 3– 5 Show system SCM_E0> show system Par hrd/csb CPU QBB# 3210 (-) 0/30 PPPP (-) 1/31 PPPP (-) 2/32 PPPP (-) 3/33 PPPP Mem IOR3 IOR2 IOR1 IOR0 GP QBB Dir PS Temp 3210 (pci_box.rio) Mod BP Mod 321 (ºC) PPPP Px.x P2.0 Pf.1 Pf.0 P P P PPP 33.5 PPPP P4.0 Px.x Px.x P5.0 P P P PPP 31.0 PPPP Px.x P0.0 Px.x P3.0 P P P PP- 29.0 PPPP --.- --.- --.- --.P P P PP- 33.5 HSwitch Type Cables 7 6 5 4 3 2 1 0 Temp(ºC) HPM40 4-port - - - - P P P P 28.
Example 3– 6 Show status SCM_E0> show status AlphaServer GS320 RMC escape sequence : Local Baud/flow control : COM1 Baud/flow control : Modem Baud/flow control : COM1 mode : OCP power switch : OCP halt : OCP secure : Remote access : Remote user : Alerts : Modem password : Modem init string : Modem dial string : Modem alert string : Alert pending : Most recent alert : [esc][esc]scm or ^[^[scm 57600/soft 57600/hard 57600/hard Pass-through On Deasserted Non-Secure Disabled Not Connected Disabled atz&c1s0=1 a
Example 3–6 shows the results of the show status command, and Table 3–7 defines the entries. Show status reads the EEPROM on the standard I/O module and the state of the OCP (buttons and switch). The variables are set using various SCM commands controlling remote access to the system. See the AlphaServer GS80/160/320 User’s Guide for more information. Table 3– 7 Show Status Entries Name Description RMC escape sequence Current escape sequence to access the SCM.
Example 3– 7 Show fru SCM_E0> show fru FRUname E PBP0 00 PBP0.SIO 00 PBP0.RIO0 00 PBP0.RIO1 00 PBP1 00 PBP1.SIO 00 PBP1.RIO0 00 PBP1.RIO1 00 QBB0 00 QBB0.PSM 00 QBB0.PWR 00 QBB0.AUX 00 QBB0.CPU0 00 QBB0.CPU1 00 QBB0.CPU2 00 QBB0.CPU3 00 QBB0.MEM0 00 QBB0.MEM0.DIM0 00 QBB0.MEM0.DIM1 00 QBB0.MEM0.DIM2 00 QBB0.MEM0.DIM3 00 QBB0.MEM3 00 QBB0.MEM3.DIM0 00 QBB0.MEM3.DIM1 00 QBB0.MEM3.DIM2 00 QBB0.MEM3.DIM3 00 QBB0.DIR 00 QBB0.DIR0.DIM1 00 QBB0.DIR0.DIM2 00 QBB0.DIR0.DIM4 00 QBB0.DIR0.DIM5 00 QBB0.IO01 00 QBB0.
QBB1.GP HSW8 HSW8.HPM0 HSW8.PWR2 00 00 02 00 -B4181-BA.A02 B4187-AA.B01 54-25115-01.B04 54-30194-01.D01 NI93470534 WF08LTA111 NI93870439 NI92660628 NI94271542 .......... WFFW_LAB_PSM_DEV ................ Table 3– 8 Show fru Command Field Descriptions Field Description FRU The field-replaceable unit name and location in the hierarchy of the system.
3.4.5 Dealing With EEPROMs EEPROMs throughout the system record FRU identification and error information and store system environment and firmware information. Example 3– 8 Clear error SCM_E0> show fru . . . QBB1.DIR0.DIM7 00 ...S...S...S@..... QBB1.GP 00 B4181-BA. B01 QBB2 00 54-25043-01.D03 QBB2.PSM 40 54-25074-01.J01 QBB2.PWR 00 54-25017-01.F01 QBB2.AUX 00 54-25123-01.E01 . . . scm-E0> clear error qbb2.psm scm_E0> show fru . . . QBB1.DIR0.DIM7 00 QBB1.GP 00 QBB2 00 QBB2.PSM 00 QBB2.PWR 00 QBB2.
Example 3– 9 Build fru SCM_E0> build qbb2.psm 54-25074-01.J01 NI94570274 WF96LTA113 WF_FIRMWARE_LAB Example 3–9 is an example of the build command. The command places manufacturing information (part number and serial number) and optional information (module name and an alias) in the designated FRU’s EEPROM. The command also clears any errors logged against the module. If a field is left blank in the command, it is left blank in the EEPROM as well.
3.5 Running Diagnostics Using the SCM Monitor If the SRM console gets loaded, the SROM/XSROM diagnostic tests cannot be run under the console. To run SROM/XSROM diagnostics in SCM user mode they must be loaded and remain in the CPU’s I-cache or B-cache. And once XSROM tests are loaded into a CPU’s B-cache, SROM tests may not be run on it unless the SROM is once again loaded. SCM examines and deposits require SROM or XSROM code running.
Example 3–12 shows a diagnostic session. Test masks stored in the NVRAM on the standard I/O module control power-up diagnostic testing. The SCM reads the test masks and executes the tests designated by the mask. The first quadword masks the SROM tests and XSROM tests 10 through 3a, and the second quadword masks the remainder of the XSROM tests. By default all tests are run during power-up including XSROM test 58 which loads the console into memory and all CPUs jump to that code and start.
Example 3– 12 Setting Up the Diagnostic Environment (Continued) SCM_E0> reset Powering ON on QBB-0 SCM_E0> QBB0 now Testing Step-0 PSM30 SysEvent: QBB_PULSE_RESET SysEvent Reg0: 468f SysEvent Reg1: 33ff Phase 0............................
As Example 3–12 continues, the callouts explain the progress of the diagnostic session. The reset command is executed. Even though no XSROM tests are executed, the SCM runs through its normal power-up routine. (xsrom_mask = none) The test &pc0 5 command now passes. SROM test 5, the B-cache march test, is run on CPU0. The test &pc1 7 command passes. SROM test 7, the EVx D-cache and Bcache error tests, run on CPU1. The test &pc0 10 command passes.
Example 3– 13 Various Test Commands SCM_E0> test &pc1 5 Testing. Please wait…Test(s) passed. SCM_E0> test &pc1 80 10 0 Testing. Please wait…Test(s) passed. SCM_E0> test &pc1 80 ff 0 Testing. Please wait…Test(s) passed. SCM_E0> test &pc1 80 ffe 0 20 SCM_E0> Test(s) passed. SCM_E0> test &pc0 80 ff0000 0 Testing. Please wait…Test(s) passed.
Test &pc1 5 is an example of using a test number to run a single test. The SCM command-line interface does not accept a list of tests using this format; however, it does accept masks and continuous or multiple passes. test &pc1 80 10 0 is an example of the format used to pass a mask. “ 80” indicates to the SCM that a diagnostic mask will follow. In this case, 10 is th the first quadword of the mask and 0 is the second. The 5 bit is set in the first quadword, so test 5 is executed on CPU1.
3.6 Using the SRM Console Several SRM commands can be used to set the system environment, power CPUs on and off, display configuration information, display error information, and test and exercise the system. 3.6.1 Displaying and Setting Up the System Environment Several SRM console commands are used to display and set system environment variables stored in EEPROM on the standard I/O module.
3.6.2 Controlling Power with the SRM Through firmware callbacks from the SRM console to the SCM monitor, the SRM can power off partitions and parts of the system.
Although you can power off a partition using the SCM power off – par x command, use the SRM power off command because it stops all CPU activity and leaves QBBs in a quiet, clean state. Example 3– 18 Power off cpu Command P00>>> power off cpu 8 powering off CPU 8 (CPU 0 in QBB 2) P00>>> QBB2 Powering off CPU0 P00>>> scm show csb . . c7 CPU3/SROM T4.2-7 c4 IOR0 32 PSM T04.7 ( 11.23/01:03) 32 XSROM T04.7 (11.23/01:55) c8 CPU0/SROM T4.2-7 c9 CPU1/SROM T4.2-7 . . T4.2 (09.
3.6.3 Displaying Configuration Information (SRM) Several show commands provide system information. Example 3– 20 Show configuration P00>>> show config Compaq Computer Corporation Compaq AlphaServer GS320 6/731 SRM Console V5.8-1, built on May 26 2000 at 12:15:01 PALcode OpenVMS PALcode V1.81-1, Tru64 UNIX PALcode V1.75-1 Micro Firmware V5.
PCI Bus 0 PCI Bus 1 Global Port Hose 14 Hose 15 64 Bit, 33 MHz 64 Bit, 33 MHz PCI rev 2.1 compliant PCI rev 2.
0 1 4 Total Available 2 GB 4 GB 00000000000 8-Way Interleave Board Set Array 0 0 0 0 1 4 Total Available Size 512 MB 512 MB 1 GB Address QBB 1 Memory 01000000000 01000000000 8-Way Interleave Board Set Array 0 0 0 0 1 4 Total Available Size 512 MB 1 GB 1.
3 14 15 16 17 20 21 0 1 1 2 2 2 2 1 3 3 0 0 2 2 1 0 1 0 1 0 1 4-7 1-3 4-7 1-3 4-7 1-3 4-7 Slot 3 7 15 Option DE500-BA Network Con Acer Labs M1543C Acer Labs M1543C IDE 19 Acer Labs M1543C USB Slot 4 5 6 Option DEC KZPSA DECchip 21154-AA QLogic ISP10x0 Hose 0, Bus 0, PCI ewa0.0.0.3.0 dqa.0.0.15.0 dqa0.0.0.15.0 Hose 1, Bus 0, PCI pka0.7.0.4.1 00-00-F8-1B-1C-0B Bridge to Bus 1, ISA TOSHIBA CD-ROM XM-6302B KGPSA-C pkb0.7.0.6.1 dkb100.1.0.6.1 pga0.0.0.7.
19 Slot 4 6 dqd0.0.0.15.20 TOSHIBA CD-ROM XM-6302B Hose 21, Bus 0, PCI pkh0.7.0.4.21 dkh0.0.0.4.21 dkh100.1.0.4.21 dkh200.2.0.4.21 dkh300.3.0.4.21 ewc0.0.0.6.21 SCSI Bus ID 7 COMPAQ BB00911CA0 COMPAQ BB00911CA0 COMPAQ BB00911CA0 RZ1CB-CA 08-00-2B-C3-C1-C7 Acer Labs M1543C USB Option QLogic ISP10x0 DE500-BA Network Con Example 3–20 shows output from the show config command for a partition made up of four QBBs.
QBB0. QBB0.PSM QBB0.PWR QBB0.AUX QBB0.CLKSPLT QBB0.CPU0 QBB0.CPU1 QBB0.CPU2 QBB0.CPU3 PBP0.RIO0 PBP0.PCI1 PBP0.PCI3 PBP0.PCI7 PBP0.RIO1 PBP0.PCI5 QBB0.MEM0 QBB0.MEM0.DIM0 QBB0.MEM0.DIM1 QBB0.MEM0.DIM2 QBB0.MEM0.DIM3 QBB0.MEM0.DIM4 QBB0.MEM0.DIM5 QBB0.MEM0.DIM6 QBB0.MEM0.DIM7 QBB0.MEM3 QBB0.MEM3.DIM0 QBB0.MEM3.DIM1 QBB0.MEM3.DIM2 QBB0.MEM3.DIM3 QBB0.MEM3.DIM4 QBB0.MEM3.DIM5 QBB0.MEM3.DIM6 QBB0.MEM3.DIM7 QBB0.DIR QBB0.DIR0.DIM1 QBB0.DIR0.DIM2 QBB0.DIR0.DIM4 QBB0.DIR0.DIM5 QBB0.GP QBB1. QBB1.PSM QBB1.PWR QBB1.
QBB1.MEM1.DIM1 QBB1.MEM1.DIM2 QBB1.MEM1.DIM3 QBB1.MEM1.DIM4 QBB1.MEM1.DIM5 QBB1.MEM1.DIM6 QBB1.MEM1.DIM7 QBB1.DIR QBB1.DIR0.DIM1 QBB1.DIR0.DIM2 QBB1.DIR0.DIM3 QBB1.DIR0.DIM4 QBB1.DIR0.DIM5 QBB1.DIR0.DIM6 QBB1.DIR0.DIM7 QBB1.GP QBB2. QBB2.PSM QBB2.PWR QBB2.AUX QBB2.CLKSPLT QBB2.CPU0 QBB2.CPU1 QBB2.CPU2 QBB2.CPU3 QBB2.MEM0 QBB2.MEM0.DIM0 QBB2.MEM0.DIM1 QBB2.MEM0.DIM2 QBB2.MEM0.DIM3 QBB2.MEM0.DIM4 QBB2.MEM0.DIM5 QBB2.MEM0.DIM6 QBB2.MEM0.DIM7 QBB2.MEM3 QBB2.MEM3.DIM0 QBB2.MEM3.DIM1 QBB2.MEM3.DIM2 QBB2.MEM3.
QBB3.MEM0.DIM3 QBB3.MEM0.DIM4 QBB3.MEM0.DIM5 QBB3.MEM0.DIM6 QBB3.MEM0.DIM7 QBB3.DIR QBB3.DIR0.DIM0 QBB3.DIR0.DIM1 QBB3.DIR0.DIM2 QBB3.DIR0.DIM3 QBB3.DIR0.DIM4 QBB3.DIR0.DIM5 QBB3.DIR0.DIM6 QBB3.GP HSW4 HSW4.HPM0 HSW4.PWR2 HSW4.CLCK HSW4.MOD.SPLT CAB2.SYS PBP0.PCI0 QBB0.IOR01 PBP0.SYSFAN2 PBP0.SYSFAN1 PBP0.PS2 PBP0.PS1 PBP1.PCI0 QBB1.IOR01 PBP1.SYSFAN2 PBP1.SYSFAN1 PBP1.PS2 PBP1.PS1 P00>>> 3-40 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 54-24941-EA.
The SRM show fru command identifies a few more FRUs than the SCM show fru command. See Table 3–8 for an explanation of the fields displayed by both commands, and Table 3–9 for additional units identified by the SRM. Table 3– 9 Additional SRM Show FRU Codes Field Description FRU The field-replaceable unit name and location in the hierarchy of the system.
Example 3– 23 Csr P00>>> csr qbb0.*err_sum CSR Name -----------------------------------QBB0.QSD.QSD_ERR_SUM QBB0.QSA.QSA_CPU_ERR_SUM QBB0.QSA.QSA_MISC_ERR_SUM QBB0.QSA.QSA_TMO_ERR_SUM QBB0.QSA.QSA_ILL_CMD_ERR_SUM QBB0.DTag0.DTAG_ERR_SUM QBB0.DTag1.DTAG_ERR_SUM QBB0.DTag2.DTAG_ERR_SUM QBB0.DTag3.DTAG_ERR_SUM QBB0.Dir.DIR_ERR_SUM QBB0.Mem0.MEM_ERR_SUM QBB0.Mem1.MEM_ERR_SUM QBB0.Mem2.MEM_ERR_SUM QBB0.Mem3.MEM_ERR_SUM QBB0.IOP.IOP_QBB_ERR_SUM QBB0.IOP.IOA_ERR_SUM QBB0.IOP.IOD_ERR_SUM QBB0.
Example 3– 24 Csr P00>>> csr QBB0.QSD.CPU0_SCRATCH CSR Name -----------------------------------QBB0.QSD.CPU0_SCRATCH CSR Address ----------fffff940800 CSR Data ---------------0000000000000000 P00>>> csr QBB0.QSD.CPU0_SCRATCH 99 CSR Name -----------------------------------QBB0.QSD.CPU0_SCRATCH CSR Address ----------fffff940800 CSR Data ---------------0000000000000099 P00>>> csr QBB0.QSD.CPU0_SCRATCH CSR Name -----------------------------------QBB0.QSD.
Example 3– 25 Wf show cpu P00>>> wf show cpu CPU 0 partition 0 CPU 1 partition 0 CPU 2 partition 0 CPU 3 partition 0 CPU 4 partition 0 CPU 5 partition 0 CPU 6 partition 0 CPU 7 partition 0 CPU 8 partition 0 CPU 9 partition 0 CPU 10 partition 0 CPU 11 partition 0 CPU 12 partition 0 CPU 13 partition 0 CPU 14 partition 0 CPU 15 partition 0 P00>>> Type Type Type Type Type Type Type Type Type Type Type Type Type Type Type Type 000000090000000b 000000090000000b 000000090000000b 000000090000000b 000000090000000b
START_PFN: 00800000 PFN_COUNT: 001fffe0 PFN_TESTED: 001fffe0 BITMAP_VA: 0000000000000000 BITMAP_PA: 00000013fffc0000 2097120 good pages from 0000001000000000 to 00000013fffbffff Cluster: 4, Usage: Console START_PFN: 009fffe0 PFN_COUNT: 00000020 PFN_TESTED: 00000000 32 pages from 00000013fffc0000 to 00000013ffffffff Cluster: 5, Usage: System START_PFN: 01000000 PFN_COUNT: 001fffe0 PFN_TESTED: 001fffe0 BITMAP_VA: 0000000000000000 BITMAP_PA: 00000023fffc0000 2097120 good pages from 0000002000000000 to 00000023
3.7 Running Diagnostics from the SRM Console The test command exercises major system components sequentially. 3.7.1 Setting Up the Test Environment Prior to running SRM console controlled tests, you can create a test environment to control how tests behave.
Table 3– 10 Diagnostic Environment Variables (Continued) Environment Variable Description d_passes Specifies the number of passes to run a diagnostic. Default is 1. 0 indicates to run the diagnostic indefinitely. d_quick Specifies whether an abbreviated mode of tests should be run. Default is Off (no abbreviation). d_report Specifies the level of information provided by diagnostic error reports. The default is Summary; other values are Full or Off.
3.7.2 Background Testing and Display Several tests can be run in the background freeing the console for other operations. Displaying background test status is possible using the show_status command and the ps command. Stopping background tests is done using the kill_diags command. Example 3– 28 Running sys_exer in the Background P00>>> sys_exer Default zone extended at the expense of memzone.
Example 3– 29 Show_status P00>>> show_status ID Program -------- -----------00000001 idle 0000081c memtest 00000822 memtest 00000828 memtest 00000887 memtest 000008a0 memtest 000008b9 memtest 000008d2 memtest 000008eb memtest 00000905 memtest 0000091e memtest 00000937 memtest 00000940 memtest 00000975 exer_kid 00000976 exer_kid 00000977 exer_kid 00000978 exer_kid 0000097c exer_kid 00000983 exer_kid 000009bf nettest 000009eb nettest 00000a1a nettest P00>>> Device Pass Hard/Soft Bytes Written Bytes Read ----
Example 3– 31 Kill P00>>> memexer memtest -bs 1000000 -rb -p 0 & memtest -sa 80000000 -ea FFFDE000 -z -p 0 & memtest -sa 1000000000 -ea 103FFE4000 -z -p 0 & memtest -sa 2000000000 -ea 205FFE2000 -z -p 0 & memtest -sa 3000000000 -ea 30FFFD8000 -z -p 0 & memtest -sa 4000000000 -ea 407FFE0000 -z -p 0 & memtest -sa 5000000000 -ea 507FFE0000 -z -p 0 & memtest -sa 6000000000 -ea 607FFE0000 -z -p 0 & memtest -sa 7000000000 -ea 707FFE0000 -z -p 0 & P00>>> P00>>> show_status ID Program Device Pass Hard/
Example 3– 32 Ps P00>>> ps ID PCB -------- -------00001bca 00360320 00001bc9 00332400 00001b9a 00358f80 00001b93 003a0d80 00001b91 0039fb60 00001b7a 003995a0 00001b78 00398380 00001b61 00394460 00001b5f 00391600 00001b48 0038d500 00001b46 00389a80 00001b2f 00384940 00001b2c 00380980 00001b15 0037b780 00001b13 0037a560 Pri CPU Time Affinity CPU Program State --- -------- -------- --- ---------- ---------------------3 0 00000001 0 ps running 5 0 00000001 0 pkg0_poll waiting on tqe f0ab0 2 6691 ffffffff 8 mem
3.7.3 Testing and Exercising the System The SRM test command tests the hardware in the system or hard partition. If a system or hard partition is soft partitioned, the test command ignores the soft partition environment variables and tests the hardware in the hard partition.
Example 3– 34 Sys_exer P00>>> sys_exer Default zone extended at the expense of memzone.
Example 3–29 for the display of the show_status command related to sys_exer. Example 3– 35 Exer P00>>> P00>>> P00>>> P00>>> exer exer exer exer dk*.* -p 0 –secs 36000 –l 2 dkb0 –sb 1 –eb 3 –bc 4 –a ‘w’ –d1 ‘0x5a’ dka100 –a ‘?r-w-Rc’ dka400 Read all SCSI type disks for the entire length of each disk. Repeat this until 36000 seconds (10 hours) have elapsed. All disks will be read concurrently. Each block read will occur at a random block number on each disk.
• A read operation reads from a specified device into a buffer. • A write operation writes from a buffer to a specified device. • A compare operation compares the contents of the two buffers. • The exer command uses two buffers, buffer 1 and buffer 2, to carry out the operations. A read or write operation can be performed using either buffer. A compare operation uses both buffers.
Table 3– 11 Exer Options -sb start_block Specifies the starting block number (hex). The default is 0. -eb end_block Specifies the ending block number (hex). The default is 0. -p pass_count Specifies the number of passes. If 0, then run forever or until Ctrl/C. The default is 1. -l blocks Specifies the number of blocks (hex) to exercise. The option l has precedence over eb. If only reading, then not using either – l or – eb defaults to read until end-of-file.
Table 3– 11 Exer Options (Continued) - Seek to file offset prior to last read or write ? Seek to a random block offset within the specified range of blocks. s Sleep for a number of milliseconds specified by the delay qualifier. If no delay qualifier is used, sleep for 1 millisecond. Note: Times reported in verbose mode may not be accurate when this character is used.
Example 3– 36 Memexer P00>>> memexer 3 memtest -bs 1000000 -rb -p 0 & memtest -bs 1000000 -rb -p 0 & memtest -bs 1000000 -rb -p 0 & P00>>> show_status ID Program Device Pass Hard/Soft Bytes Written Bytes Read -------- ------------ ------------ ------ --------- ------------- ----------00000001 idle system 0 0 0 0 0 0000011d memtest memory 2 0 0 520093696 20093696 00000123 memtest memory 2 0 0 520093696 520093696 00000162 memtest memory 2 0 0 520093696 520093696 P00>>> kill_diags P00>>> show_status ID Program
Example 3– 37 Fptest P00>>> fptest &p4 P00>>> ps ID PCB -------- -------00000395 002be420 00000394 002ccf40 00000393 002c8580 0000029e 002c5360 0000003a 002b5660 00000036 002aab20 00000014 0027b320 00000012 001d3150 & Pri CPU Time Affinity CPU Program State --- -------- -------- --- ---------- ---------------------3 1 00000001 0 ps running 1 949 00000010 4 fptest running 2 1 00000001 0 sh_bg waiting on 002CCF40 3 7705 00000001 0 shell ready 6 0 ffffffff 0 rx_ewa0 waiting on rx_isr_ewa0 3 6812 00000001 0 sh
Example 3– 38 Fakedisk P00>>> P00>>> P00>>> P00>>> fakedisk a 10 fakedisk * 15 exer –a ‘?r-w-Rc’ –sec 15 dfa rm dfa This command creates a fake disk in memory called dfa the size of ten 512 blocks. This command creates a fake disk in memory for each disk controller on each PCI in the system. The size of each RAM disk is fifteen 512 blocks. This command causes the following to run for 15 seconds on fake disk dfa: Set the current block address to the beginning of a random block.
Example 3– 39 Nettest P00>>> nettest ei* P00>>> nettest –mode in ew* P00>>> nettest –mode ex –w 10 e* Internal loopback test on all ei type network devices Internal loopback test on all ew type network devices External loopback test on all network devices on the system; wait 10 seconds between tests Nettest is the generic network device exerciser. It can test network devices in internal, external, or live network loopback mode. The test works with ports supporting MOP protocol.
3.7.4 Running SRM Loadable Diagnostics Each system comes with an Alpha Systems Firmware CD-ROM. On this CD-ROM are update firmware files, LFU, an expanded SRM console, and diagnostics. Some of these diagnostics are run at power-up. Example 3– 40 Booting the Firmware CD-ROM Place the firmware CD-ROM in the master PCI box CD-ROM device. P00>>> boot dqb0 (boot dqb0.0.0.15.12) block 0 of dqb0.0.0.15.12 is a valid boot block reading 1082 blocks from dqb0.0.0.15.
Not all diagnostics are run at power-up. In order to run loadable console controlled diagnostics, an expanded SRM console must be loaded. The loadable diagnostics test devices on the standard I/O module and a Memory Channel should one be on the system. Example 3–40 shows an example of booting the expanded SRM console. Insert the firmware CD-ROM into the CD-ROM device in the master PCI box. Boot the CD-ROM.
Example 3– 41 Acer_bridge_diag P01>>> P01>>> P01>>> P01>>> set d_trace on set d_group mfg set d_harderr continue acer_bridge_diag -h 12 acer_bridge_ acer_bridge_ acer_bridge_ acer_bridge_ acer_bridge_ 00000076 00000076 00000076 00000076 00000076 | | | Std-I/O Std-I/O Std-I/O Std-I/O Std-I/O H12 H12 H12 H12 H12 1 1 1 1 1 1 2 3 7 8 0 0 0 0 0 0 0 0 0 0 *** Hard Error - Error #1 - Acer IDE Config Compare Error Diag Name ID Device Pass Test Hard/Soft 1-JAN acer_bridge_ 00000076 Std-I/
Example 3– 42 Acer_io_diag P01>>> set d_trace on | P01>>> set d_group mfg | P01>>> set d_harderr continue | P01>>> acer_io_diag -h 12 acer_io_diag 00000075 Std-I/O H12 1 1 0 acer_io_diag 00000075 Std-I/O H12 1 3 0 acer_io_diag 00000075 Std-I/O H12 1 4 0 acer_io_diag 00000075 Std-I/O H12 1 5 0 acer_io_diag 00000075 Std-I/O H12 1 7 0 acer_io_diag 00000075 Std-I/O H12 1 8 0 acer_io_diag 00000075 Std-I/O H12 1 9 0 Cannot run this test on the Console Standard I/O acer_io_diag 00000075 Std-I/O H12 1 10 0 Cannot r
Example 3– 43 Acer_8042_diag P01>>> set d_trace on | P01>>> set d_group mfg | P01>>> set d_harderr continue | P01>>> acer_8042_diag -h 46 acer_8042_di 00000064 Std-I/O H12 acer_8042_di 00000064 Std-I/O H12 acer_8042_di 00000064 Std-I/O H12 1 1 1 1 3 4 0 0 0 0 0 0 12:00:01 12:00:01 12:00:01 *** Hard Error - Error #4 - KeyBoard BIST Failed Diag Name ID Device Pass Test Hard/Soft 1-JAN acer_8042_di 00000064 Std-I/O H12 1 4 1 0 12:00:01 *** End of Error *** acer_8042_di acer_8042_di acer_804
Example 3– 44 Isp1020_diag P01>>> set d_trace on | P01>>> set d_group mfg | P01>>> set d_harderr continue | P01>>> isp1020_diag pka isp1020_diag 00000081 pka 1 isp1020_diag 00000081 pka 1 isp1020_diag 00000081 pka 1 1 2 3 0 0 0 0 0 0 12:00:01 12:00:01 12:00:01 The set d_trace command causes the diagnostic output to display on the console. The set d_group mfg command permits the Acerlab test to be run.
Example 3– 45 Bq4285_diag P01>>> set d_trace on P01>>> set d_group mfg P01>>> set d_harderr continue P01>>> bq4285_diag -h 12 bq4285_diag 0000007e bq4285 bq4285_diag 0000007e bq4285 bq4285_diag 0000007e bq4285 bq4285_diag 0000007e bq4285 3-68 | | | H12 H12 H12 H12 1 1 1 1 1 3 4 5 0 0 0 0 0 0 0 0 12:00:01 12:00:01 12:00:01 12:00:01 The set d_trace command causes the diagnostic output to display on the console. The set d_group mfg command permits the test to be run.
Example 3– 46 Isa_misc_diag P01>>> set d_trace on | P01>>> set d_group mfg | P01>>> set d_harderr continue | P01>>> isa_misc_diag -h 12 isa_misc_dia 00000083 Std-I/O H12 isa_misc_dia 00000083 Std-I/O H12 isa_misc_dia 00000083 Std-I/O H12 1 1 1 1 2 3 0 0 0 0 0 0 12:00:01 12:00:01 12:00:01 The set d_trace command causes the diagnostic output to display on the console. The set d_group mfg command permits the test to be run.
3.7.5 Crashing the System Use the crash command to obtain a crash dump of the system. Example 3– 47 Crash P00>>> crash CPU 0 restarting DUMP: 1983738 blocks available for dumping DUMP: 118178 wanted for a partial compressed dump. DUMP: Allowing 2060017 of the 2064113 available on 0x800001 device string for dump = SCSI 1 1 0 0 0 0 0. DUMP.prom: dev SCSI 1 1 0 0 0 0 0, block 2178787 DUMP: Header to 0x800001 at 2064113 (0x1f7ef1) device string for dump = SCSI 1 1 0 0 0 0 0. DUMP.
The crash command causes an operating system to halt and write the contents of memory to a file that can later be analyzed. Crash dumps can be helpful in determining why a system has malfunctioned. If the environment variable auto_boot is on, the system will reboot; otherwise, it will remain at the SRM prompt. The syntax for this command is: crash [device] The device option specifies the name of the device to which the crash dump is written.
3.8 Troubleshooting with LEDs Diagnostic LEDs are visible only when cabinet doors are open and faceplates are removed. In some instances LEDs may be the only way to identify a power problem.
Figure 3–1 shows the location and condition (on or off) of LEDs on the OCP, AC input box, 48V power supply, PCI power supply, local I/O riser, PSM, and the main and auxiliary power modules when the system is running. The only LEDs visible when the cabinet doors are closed are those on the OCP. When the Halt LED on the OCP is lit, AC is applied to the system, Vaux is on, and the system cannot be powered on remotely. When the Power LED on the OCP is lit, the system is running.
Figure 3– 1 System LEDs (Continued) Figure 3–1 continued shows the location and normal condition (on or off) of LEDs on the CPU, HPM, H-switch power supplies, master clock, and CSB node ID module when the system is running. The following comments assume the system is powered on, all cabinet doors are open, and faceplates removed so all LEDs are visible. Note that LEDs could be off because the system/QBB was powered off remotely or that there is some other power problem. See Section 3.1.
CPU LED – If the Run LED is off when it should be on, the CPU could be broken. H-switch power manager LEDs – If the DC OK LED is off, the onboard +5V and/or +3V regulator is broken. If the Reset/Initialize LED is on, the module is in the reset state and may not have passed self-test. H-switch power supply LEDs – If the Vaux LED is off, check that Vaux is OK at other system locations. If it is, Vaux is not present at the power supply for some reason or the power supply is broken.
3.9 Dealing with a Hung System Troubleshooting a hung system depends upon what was running at the time of the hang. In general, these systems are designed not to hang. If a transaction times out or forward progress is not made for some reason, such events are considered faults and a running system should crash. Table 3– 12 Hung System Suggestions 1. Try logging in remotely and investigate what the system is doing. 2. Check LEDs in QBBs and power supplies and if there is something abnormal fix it. 3.
Troubleshooting a hang is difficult. The suggestions in Table 3–12 are intended to give you a start. There are some causes you can eliminate. Theoretically, at the hardware level, the system should not hang. Transactions are tracked such that if one is not making forward progress, a timeout is triggered, a machine check is generated, and the system crashes.
3.9.1 Troubleshooting a Diagnostic Hang SROM and XSROM diagnostics report to the SCM monitor that they are hung. Example 3– 48 Diagnostic Hang SCM_EF> . . . QBB0 now Testing Step-1 QBB1 now Testing Step-1................. . . . QBB1 now Testing Step-2...............
Refer to Example 3–48. SROM/XSROM diagnostics are expected to complete in a certain amount of time. If that time is exceeded, a timeout occurs and is reported to the SCM. CPU1 in QBB0 hangs running test 1b subtest 19. Test 1b is the local IOP configuration test ID test and is run on a local primary. When the test hangs on CPU1, a new local primary is selected, CPU3 on QBB0. It too hangs. In QBB1 the same thing is happening. CPU1 hangs running the same test/subtest.
Example 3– 48 Diagnostic Hang (Continued) SCM_EF> QBB0 now Testing Step-6 . . QBB0 Step-b Tested IO_MAP0 from QBB0: 0000c00002322233 IO_MAP1 from QBB1: 00cf000004f444f3 No connection from RIO1 in PCI Drawer f Phase 2 ***SCM: CONFIG ERROR. SOFT ID NOT DETERMINED QbbConf: 000000dd PQbb : ff SQbb0 : ff QbbCnt : 02 QbbConf(GP_IOR_CPU_MEM) SCM_EF> QBB1 now Testing Step-c QBB0 Step-c Tested Phase 3 ***SCM: CONFIG ERROR.
Despite the hang in the two QBBs, power-up continues on the CSB. The PCI box with the standard I/O module is not connected through the QBB backplane IOP – local I/O riser – near end mini link – hose – far end mini link – PCA remote I/O riser – PCI backplane to the standard I/O module. The same condition is seen during the remaining phases of power-up. The power on summary shows a failure on each CPU in the system.
3.9.2 Troubleshooting a Diagnostic Fault SROM and XSROM diagnostics report faults to the SCM monitor. Example 3– 49 A Sample Diagnostic Fault . . QBB0 now Testing Step-6 | QBB1 Step-6 Tested | . | QBB1 Step-a Tested | QBB2 Step-6 Tested | . | QBB2 Step-a Tested | QBB3 Step-6 Tested | . | QBB3 Step-a Tested | SCM_E0> ..............
Now waiting 10 seconds after HPM reset ******* Waiting 5 seconds before sending restart to PSMs SCM_E0> ... ****QBB3-Cpu0TestHang Test:53 Subtest:1 | ****QBB3-Cpu1TestHang Test:53 Subtest:1 | ****QBB3-Cpu2TestHang Test:53 Subtest:1 | ****QBB3-Cpu3TestHang Test:53 Subtest:1 | Example 3–49 shows reports sent to the SCM over the console serial bus when an unexpected fault occurs during SROM/XSROM testing. The system is a four QBB system. The system faults and is detected by the PSM.
3.10 Dealing with Corrupt Firmware Each microprocessor on the CSB runs firmware located in flash ROMs on the module or backplane close to the microprocessor. If this firmware is corrupt, a new image can be loaded into the flash ROMs by having the microprocessor running a fail-safe loader image. Only the SCM update command can be used to load the new firmware.
Initially on power-up or reset each microprocessor on the CSB runs a fail-safe loader image and the microprocessor is said to be in fail-safe loader mode (FSL mode). This FSL image resides in flash ROM in a different location than the normal firmware image run in the microprocessor. The FSL image has two functions: • it runs a checksum test on the primary firmware run by the microprocessor.
3.11 Error Detection Error detection is distributed throughout the system. Figure 3– 2 Core System Error Detectors Figure 3–2 is a block diagram showing the data error detectors in the system. There are three types of errors: • Correctable errors are detected either by the system or by a CPU.
and PALcode builds a 660 system uncorrectable error frame that is deposited in the error log. If the CPU detects the error, an error interrupt is generated for that CPU, the system crashes, and PALcode builds a 670 processor uncorrectable error frame that is deposited in the error log. • Faults are errors that compromise the coherence of the system.
3.12 Compaq Analyze Compaq Analyze is the error analysis tool used to analyze errors. The tool runs automatically in the background monitoring the active error log and processing events as they occur. For information on installing, running, and learning about Compaq Analyze, refer to the WEBES V3.0 GS80/160/320 CD-ROM. Compaq Analyze can be run manually using a Web browser or using a command-line interface. 3.12.
Figure 3–3 is an example of what you might see when running Compaq Analyze manually using a Web browser. There are two methods available for users to run Compaq Analyze. The method shown here is through a Web browser interface. Either Netscape version 3.x or higher or Internet Explorer version 4.0 or later is required. The second method is through the use of a commandline interface. Both methods are described in the Compaq Analyze User Guide on the WEBES V3.0 GS80/160/320 CD-ROM.
3.12.2 Problem Found Report Compaq Analyze runs in the background and continually analyzes binary entries in the error log. If an error entry meets error criteria, a problem found report is delivered through Compaq Analyze. The problem found report states the problem and identifies the most likely faulty FRU. One can retrieve the problem found report by selecting the Problem Found icon and opening the file.
available to help identify the source of the error. This error bit is implemented as a copy of the Valid bit in the MEM_RD_UCE_TRAP register. NOTE: To determine the array in error a valid configuration tree is required. If this is not available, the entire memory module, including its DIMMs, will be called out. For an uncorrectable memory error a single DIMM in error can not be determined. As a minimum a group of 4 DIMMs will be called out (ie a Memory Array\). No memory write errors have been identified.
Example 3– 50 Problem Found (Continued) FRU List: Warning Probability Fru Manufacturer Fru Model Fru PartNumber Fru SerialNumber Fru FirmwareRev Fru SiteLocation Fru CabinetID Fru Position Fru Chasis Fru Assembly Fru SubAssembly Fru Slot : : : : : : : : : : : : : : FRU Configuration Data Not Available {-} High | | Memory DIMM 0 | - Probability Fru Manufacturer Fru Model Fru PartNumber Fru SerialNumber Fru FirmwareRev Fru SiteLocation Fru CabinetID Fru Position Fru Chasis Fru Assembly Fru SubAssembly Fru
Fru Fru Fru Fru Chasis Assembly SubAssembly Slot Probability Fru Manufacturer Fru Model Fru PartNumber Fru SerialNumber Fru FirmwareRev Fru SiteLocation Fru CabinetID Fru Position Fru Chasis Fru Assembly Fru SubAssembly Fru Slot : : : : - : : : : : : : : : : : : : Medium Compaq Memory Module 1 - | | | Evidence: Time of Event Errorlog Entry Id WF660 Rule Revision EEprom SDD written : : : : Sat, 8 Jan 2000 11:30:38 4-346 X1.
3.12.3 Description of the Error (660) If you want to view the error log entry, select the appropriate event. Example 3– 51 shows the Compaq Analyze error report associated with the problem found in Example 3– 50. Example 3– 51 Compaq Analyze Error Report $0..0/ &7&/5 )&"%&3 $&) 7 04 5ZQF 0QFO7.
FYD BEES Y'''' '''' &YDFQUJPO "EESFTT 3FHJTUFS QD< > Y'''' '''' & $ &YDFQUJPO "EESFTT JFS DN Y & '''& *OUFSSVQU &OBCMF $VSSFOU 1SPDFTTPS .
Example 3– 51 Compaq Analyze Error Report (Continued) START OF SUBPACKETS IN THIS &7&/5 4VCQLU@$ @5 @7 .FNPSZ &SSPS 'SBNF 4VCQBDLFU 7FSTJPO Y'''' '''' ''% FOUJUZ< > CBTF@QIZTJDBM@BEESFTT &OUJUZ .FNPSZ .PEVMF .&. RCC JE< > Y ' .FNPSZ &SSPS 4VNNBSZ 3FH #BTF QIZTJDBM BEEFTT 2##JE 2## Y 3E@FSS@QUS< > .&. &33 46.
.&.@3%@$&@53"1@ Y .FNPSZ $PSSFDUBCMF 3FBE &SSPS 5SBQ BSC CVT< > Y XSBQ< > Y %BUB 8SBQQJOH 0SEFS BEES< > Y %BUB #MPDL "EESFTT USBOT< > Y 5SBOTBDUJPO 8SJUF UP .
Example 3– 51 Compaq Analyze Error Report (Continued) Y $ # .FNPSZ 6ODPSSFDUBCMF 3FBE &SSPS 5SBQ BSC@CVT< > Y "SC#VT 4OBQTIPU XSBQ< > .&. 3% 6$& 53"1 Y %BUB 8SBQQJOH 0SEFS BEES< > Y % %BUB #MPDL "EESFTT Y USBOT< > DJE< > 5SBOTBDUJPO 8SJUF UP .
USBOT< > DJE< > Y 5SBOTBDUJPO 8SJUF UP .FNPSZ Y $PNNBEFS *% $16 RX@MPH< > Y 2VBEXPSE JO &SSPS RX@FSS< > Y 6ODPSSFDUBCMF &SSPS RX TZOE< > Y $ &SSPS 4ZOESPNF CML@DPSS@NJT< > Y /VNCFS PG .JTTFE $PSSFDUBCMF &SSPST CML VODPSS NJT< > Y /VNCFS PG .
Example 3– 52 Problem Found (680) Problem Found: There is a Vital Power Failure in the Firebox at Mon Feb 14 14:39:24 EST 2000 Managed Entity: System Name System Type System Serial OS Type :wfsi21 :Compaq AlphaServer GS320 6/731 :PROTO-WF21 :Digital UNIX T4.0G-6 (Rev. 1474) Brief Description: There is a Vital Power Failure in the Firebox There is not enough power for the Firebox.
Firmware Rev Site Location Cabinet Id Position Chassis Assembly Subassmbly Slot : : : : : : : : Firebox Power Cabinet Front, Second from Top Power Subrack Ps1 Ps2 Ps3 Evidence: Time Event was Logged Time Event Occurred Unique ID Count Unique ID Prefix Rule Revision : Thu, 27 Jan 2000 06:59:34 -0500 : 27 Jan 2000 11:55:19 : 0 : 27392 : x1.0 The brief description summarizes the problem. In this case the problem is that there is not enough power to keep the two QBBs in the system box running.
COMMON EVENT HEADER (CEH) V2.
PSM System Event Frame Subpacket - Version 1 PSM_Elapsed_Time_Since_Srm_ 1,036 Seconds Since Last Console Boot Boot PSM_Info_Block x0032 00FF 7C84 0001 PSM System Event Information Not Enough 48V Regulator Power PSM_System_Event_Code[7:0] x01 Available PSM_Supplementary_Code[15:8] x00 Ps1_Vaux_Ok[16] x0 Power Supply 1 Vaux NOT Ok Ps1_48v_Ok[17] x0 Power Supply 1 48 Volts NOT Ok Ps2_48v_Ok[19] x0 Power Supply 2 48 Volts NOT Ok Ps3_Vaux_Ok[20] x0 Power Supply 3 Vaux NOT Ok Ps3_48v_Ok[21] x0 Power Supply 3 48
PSM System Event Frame Subpacket - Version 1 PSM_Elaps_Time_Since_Srm_Boot PSM_Info_Block PSM_Sys_Event_Code [7:0] PSM_Supplement_Code [15:8] Ps1_Vaux_Ok[16] Ps1_48v_Ok[17] Ps2_48v_Ok[19] Ps3_Vaux_Ok[20] Ps3_48v_Ok[21] Ps1_Temp_Ok[25] Cpu0_Dcok[40] Cpu1_Dcok[41] Cpu2_Dcok[42] Cpu3_Dcok[43] Ior2_Dcok[46] Ior3_Dcok[47] CSB_Address[55:48] 1,036 Seconds Since Last Console Boot x0033 30FF 7C84 0001 PSM System Event Information Not Enough 48V Regulator Power x01 Available x00 x0 x0 x0 x0 x0 x0 x0 x0 x0 x0 x0 x0
Example 3– 53 620 Error Report COMMON EVENT HEADER (CEH) V2.
mm_stat opcode[9:4] x0000 0000 0000 0000 x00 Memory Management Status Register Opcode of the Instruction that Caused the Error cpu_ce_err_summ QBB0[0] x0000 0000 0000 0001 x1 System Correctable Error Summary Register QBB0 Correctable Errors Reported QBB0_csrs_to_be_logged mem0[20] x0000 0000 0010 0000 x1 Registers logged for QBB0: Memory Module 0 START OF SUBPACKETS IN THIS EVENT System Error Frame Header Subpacket - V1.
valid[63] x0 Error information is NOT valid MEM_RD_CE_TRAP_2 arb_bus[37:0] wrap[1:0] addr[31:2] trans[34:32] cid[37:35] qw_log[45:43] qw_err[47:46] qw_synd[55:48] blk_corr_mis[58:56] valid[63] x0000 0000 0000 0000 x00 0000 0000 x0 x0000 0000 x0 x0 x0 x0 x00 x0 x0 Memory Correctable Read Error Trap 2 ArbBus Snapshot Data Wrapping Order Data Block Address <35:6> Transaction = Write to Memory Commander ID = CPU0 Quadword in Error = 0 No Error Error Syndrome Number of Missed Correctable Errors Error inform
arb_bus[37:0] wrap[1:0] addr[31:2] trans[34:32] cid[37:35] qw_log[45:43] qw_err[47:46] qw_synd[55:48] blk_corr_mis[58:56] valid[63] x00 0000 0000 x0 x0000 0000 x0 x0 x0 x0 x00 x0 x0 ArbBus Snapshot Data Wrapping Order Data Block Address <35:6> Transaction = Write to Memory Commander ID = CPU0 Quadword in Error = 0 No Error Error Syndrome Number of Missed Correctable Errors Error information is NOT valid MEM_WT_CE_TRAP_2 arb_bus[37:0] wrap[1:0] addr[31:2] trans[34:32] cid[37:35] qw_log[45:43] qw_err[47:46
Example 3– 54 630 Error Report COMMON EVENT HEADER (CEH) V2.
Chapter 4 GS160/320 Component Removal and Replacement This chapter describes the removal and replacement procedures for components in system cabinets.
4.1 System Safety These systems use a great deal of power. precautions when working on them.
Table 4–1 lists the various power and mechanical hazards in the system. Use caution when servicing these systems. WARNING: When the system is off and plugged into an AC outlet, Vaux is still supplied to the system. To remove all power, unplug the AC input box(s) or trip the main circuit breaker on the AC input box(s).
4.
Table 4– 2 Field-Replaceable Unit Part Numbers (Continued) Console Serial Bus Modules 54-25125-01 CSB node ID module (PCI) 54-25355-01 H-switch CSB interface 54-25371-01 CSB interface in both distribution board housing and GS80 drawer 12-45925-01 Connector, adapter, 2RJ45 (power cab frame) 12-45926-01 Connector, terminator, molded, 8 pos, PCI Modules (excluding power) B4190-xx Standard I/O module 54-25127-01 Standard I/O cable interface module B4171-xx Remote I/O riser (a.k.a.
Table 4– 2 Field-Replaceable Unit Part Numbers (Continued) Drawer Modules B4172-xx Drawer riser B4173-xx Drawer riser interface BA185-xx Drawer distribution panel (a.k.a.
Table 4– 2 Field-Replaceable Unit Part Numbers (Continued) Fans 12-23609-26 PCI fan 12-45727-01 Fan on drawer 12-47545-01 Blower (used in system cabinets 1 and 2) Table 4– 3 FRU Cables Cable Description From To 17-00083-03 Power cord (GS80 in North America) AC input box PCI or storage device 17-03212-04, 05 Signal cable Terminal server Adapter on local port of STD I/O 17-04308-05 Signal cable SMC PC Terminal server 17-00442-18 Power cord (all GS160/320 + GS80 in Europe/Japan) AC inpu
Table 4– 3 FRU Cables (Continued) Cable Description From To 17-04713-01 50 pin sig cable Power cab bulkhead Front QBBs in sys cab 1 17-04713-02 50 pin sig cable Power cab bulkhead Front QBBs in sys cab 2 17-04714-01 Power cable Pwr subrack Pwr cab bulkhead 17-04715-01 Power cable Pwr cab bulkhead Blower 17-04715-02 Power cable Pwr cab bulkhead Blower 17-04716-01 17-04716-02 Power (48V/Vaux) –01 long –02 short QBB H-switch 17-04722-01 Power (48V/Vaux) QBB (8-P sys only) Dual-ou
Table 4– 3 FRU Cables (Continued) Cables Description From To 17-04845-01 Power cable Power subrck (dr) Drawer bulkhead 17-04846-01 Power harness Power subrck (dr) Drawer bulkhead 17-04847-01 Ribbon cable Drawer bckplane CSB module 17-04847-02 Ribbon cable Drawer bckplane CSB module 17-04847-03 Ribbon cable Distrib. board CSB module 17-04847-04 Ribbon cable H-switch H-switch CSB mod.
4.3 FRU Power States Defined With operating system support, these systems can operate in power states that allow FRUs to be removed and replaced or added while other parts of the system remain running.
CPU Memory Directory Global port Local I/O riser Clock splitter Main power mod Auxiliary power mod Power system mod System box Distribution board HS power supply (sb) (sb) (sb) (sb) (dr) (dr) (dr) (dr) HS master clock mod HS power manager CSB module Cabinet blower 48V power supply PCI power supply AC Off OS & SCM commands OS & SCM commands OS & SCM commands OCP Comments OS commands (dr) (dr) (dr) Clock module HS backplane (sb) (sb) (sb) Cold FRU Warm FRU Po
4.3.1 Hot-Swapping a FRU The hardware supports three FRUs that can be removed while power is applied to the rest of the system: three different power supplies, CPUs, and local I/O risers. CPU 1. 2. Put the CPU in the hot-swap state by: • For OpenVMS • For Tru64 UNIX Planned feature - refer to the Tru64 UNIX documentation Planned feature - refer to the OpenVMS documentation The yellow Swap OK LED lights on the target CPU.
4.3.2 Warm-Swapping a FRU Only FRUs in partitioned GS160/320 systems can be placed in a warmswap state. In partitioned systems a QBB can be isolated and powered off, thus putting it (and its FRUs) in a warm-swap state. To put the FRU in the warm-swap state, shut down the operating system running in the partition containing the target QBB and power off the partition. Example 4– 1 Warm-Swap State (assumes the system is partitioned) 1.
4.3.3 Cold-Swapping a FRU FRUs that require a cold-swap state are all modules in GS80 systems except CPUs and I/O risers which may be hot swapped, clock modules, and cabinet blowers. GS80 Modules Except the CPUs and Local I/O Risers 1. Shut down the operating system running in the affected drawer. 2. If the system is partitioned, use the SRM power off command from the console connected to the partition containing the FRU. 3.
4.3.4 Getting a FRU into the AC Off State FRUs that require AC Off are system backplanes, clock modules, the hierarchical switch backplane and power manager, and a power subrack. The PCI is a special case where the system can be running, but all power is removed from the PCI. Dual-Output Clock, H-switch Clock, Distribution Board, H-switch, CSB Module, OCP, and HPM 1. Shut down the operating system(s). 2. Trip the master circuit breaker on all AC input boxes. AC is now removed. System Box 1.
4.4 System Box Module Location and Identification QBB and slot identify module locations. Since global ports must be physically close to each other, backplanes are rotated and flipped such that slot locations shift relative to the cabinet.
Figure 4– 2 System Box Module Location (front) 4 PSM Main power Aux power Signal connect Main power Aux power Signal connect CPU0 MEM0 I/O risers PSM MEM0 I/O risers CPU0 MEM1 CPU2 MEM2 CPU2 MEM1 MEM2 MEM3 CPU3 MEM3 16-P System CPU1 DIR CPU3 Clock splitter DIR Clock splitter DIR DIR Clock splitter CPU3 QBB5 (orange) 32-P System Clock splitter CPU2 MEM3 MEM2 CPU2 MEM1 MEM3 MEM1 CPU1 CPU0 MEM0 CPU3 CPU0 MEM0 PSM MEM2 PSM Main power I/O risers CPU1 Aux power Mai
4.4.1 Power Color Codes Each system box, power subrack, and AC input box outlets and circuit breakers are color coded to organize cabling and parts placement.
Figure 4–3 shows the front of a GS320 system showing the power system color codes assigned to each system box, power subrack, and AC outlets. The color codes also appear above the circuit breakers of AC input boxes. Cabling and power system parts placement follow this color code scheme.
4.4.2 Module Color Codes All modules that plug into a QBB are color-coded.
Each module placed in a QBB is color-coded to correspond to both system box and drawer color-coded slots. Table 4–7 lists the modules and their associated color codes. Figure 4–4 shows slot location color codes for both system boxes and drawers. For system box systems, QBBs located at the front of system cabinets have global port slots located on the rear of the backplane. Note that the system box orientation depends upon where the box is relative to the hierarchical switch.
4.5 System Box Module Access All doors on GS160 and GS320 systems have locks, and access to almost all modules requires the removal of a cover plate.
Access to system box modules both front and back requires opening the system cabinet doors and removing the system box faceplate. The orientation of the system box and its faceplate depends upon the target QBB. Figure 4–5 shows QBB1 in the front of system cabinet 1. Removal of the faceplate is similar for all QBBs. To remove the faceplate, use a Phillips head screwdriver to loosen, by a ½ turn, the three slide fasteners that hold the faceplate in place.
4.5.1 Memory, Directory, Main Power, or Auxiliary Power Module Removal and Replacement Each of these modules is a warm-swap module in GS160/320 systems.
Module Removal 1. If the system is not partitioned, shut down the operating system and issue the SRM power off command. Put the OCP switch in the Off position. Skip step 2. 2. If the system is partitioned, from the console connected to the partition with the target FRU, shut down the operating system and power it off using the SRM power off command. Put the PSM(s)’ switch in the Service position. Note, only hard partitions can be powered off.
4.5.2 CPU Removal and Replacement The CPU is a hot-swap module.
Module Removal 1. If the operating system supports hot-swap CPU, enter the appropriate OS command to put the target CPU in the hot-swap state. See Section 4.3.1. Skip steps 2 and 3. 2. If the system is not partitioned, shut down the operating system and issue the SRM power off command. Put the OCP switch in the Off position. Skip step 3. 3.
4.5.3 Power System Manager Removal and Replacement Since the firmware on a spare PSM could be out or date, replacement of the PSM may require a firmware update.
Module Removal 1. If the system is not partitioned, shut down the operating system and put the OCP switch in the Off position. Skip step 2. 2. If the system is partitioned, from the console connected to the partition with the target FRU, shut down the operating system and power it off using the SRM power off command. Note, only hard partitions can be powered off. If soft partitions are used, they must be shut down and the hard partition must be powered off from the console controlling the partition. 3.
4.5.4 Clock Splitter Module Removal and Replacement Located next to the global port, the clock splitter provides identical copies of the clock to synchronize transactions.
Module Removal 1. If the system is not partitioned, shut down the operating system and issue the SRM power off command. Put the OCP switch in the Off position. Skip step 2. 2. If the system is partitioned, from the console connected to the partition with the target FRU, shut down the operating system and power it off using the SRM power off command. Put the PSM(s)’ switch in the Service position. Note, only hard partitions can be powered off.
4.5.5 Local I/O Riser Removal and Replacement The local I/O riser modules can be removed without removing the QBB faceplate and opening up the system box.
Module Removal 1. If the operating system supports hot-swap I/O, enter the appropriate OS command to put the target local I/O riser in the hot-swap state. See Section 4.3.1. Skip steps 2 and 3. 2. If the system is not partitioned, shut down the operating system and issue the SRM power off command. Put the OCP switch in the Off position. Skip step 3. 3.
4.5.6 Global Port Module Removal and Replacement The global port is the module closest to the distribution board or to the H-switch in systems with more than one system box. The pins on the cables are very fragile. Figure 4– 11 Global Port Removal 1 2 PK2242 Module Removal 1. If the system is not partitioned, shut down the operating system and issue the SRM power off command. Put the OCP switch in the Off position. Skip step 2. 2.
4. Remove the EMI cover off either the H-switch or distribution board housing that is adjacent to the QBB you are working on. 5. Using both hands, one on each module lever, place your index finger on the catch and thumb on the edge of the lever just below/above the arrow. First squeeze to release the lever and then pull both levers away from the module to release it from the QBB backplane. See Figure 4–11. 6.
4.5.7 Memory or Directory DIMM Removal and Replacement DIMMs for memory and for the directory are different but the procedure for removing and replacing them is the same. Be sure you are replacing the broken DIMM with the same DIMM variant.
Removal 1. Remove the target directory or memory module. Follow the procedure described in Section 4.5.1. 2. Place the module on an anti-static mat on a flat surface with the DIMMs facing up. 3. Identify the DIMM to replace. Figure 4–12 shows the physical layout of both the memory module and the directory module. 4. There are locking levers on the end of each DIMM connector levers and gently pull the DIMM from the connector. . Open the Replacement Reverse the steps outlined in the Removal procedure.
4.5.8 System Box Removal and Replacement If a QBB backplane requires replacement, the system box is replaced. This procedure requires two people.
Removal (Requires two people) 1. If the system is not partitioned, shut down the operating system and remove AC by powering off the system and tripping the main circuit breakers on the AC input boxes. Skip step 2. 2. If the system is partitioned, shut down instances of the operating system in the target QBBs in the system box. Remove all AC power by tripping the circuit breakers on the AC input box that controls the system box to be removed. See Section 4.4.1 for color-code information. 3.
4.6 GS160 Distribution Board Assembly Modules Modules in the distribution board assembly are the distribution board, the console serial bus module, and the dual-output clock module. 4.6.
Removal 1. Shut down the operating system(s), put the OCP switch in the Off position, and trip the main circuit breaker on the AC input boxes. 2. Open the rear door. 3. Remove the faceplate on QBB0. (See Section 4.5.) 4. Unscrew the two captive screws that hold the EMI cover in place over the distribution board assembly and remove the cover. The distribution board is now exposed. 5. For each blue cable from top to bottom, release the cable from the plastic cable-dressing clip. 6.
4.6.2 Dual-Output Clock Module Removal and Replacement The dual-output clock module provides the clock signal to both QBBs.
Removal 1. Shut down the operating system(s), put the OCP switch in the Off position, and trip the main circuit breaker on the AC input box. 2. Open the rear door. 3. Remove the faceplate on QBB0. (See Section 4.5.) 4. Unscrew the two captive screws that hold the EMI covers in place over the distribution board assembly and remove the cover. The dual-output clock module is now exposed. 5. Unplug the two coax clock cables. 6. Unplug the power cable. 7.
4.6.3 Console Serial Bus Module Removal and Replacement All AC power must be off when removing this module. In systems with an H-switch, a similar module is located in the H-switch enclosure.
Removal 1. Shut down the operating system(s), put the OCP switch in the Off position, and trip the main circuit breaker on all AC input boxes. 2. Open the rear door. 3. Remove the lower QBB faceplate. (See Section 4.5.) 4. Unscrew the two captive screws that hold the EMI cover in place over the distribution board assembly and remove the cover. The console serial bus module is now exposed . 5. Unplug the three cables from the module: two internal cables, one external to the distribution board housing .
4.7 Hierarchical Switch Assembly Modules Several parts in the hierarchical switch assembly can be replaced. 4.7.1 H-switch Removal and Replacement The H-switch FRU is the entire assembly and cables for QBBs in a GS160 system. Replacing the H-switch requires two people.
Removal (This procedure requires two people.) The hierarchical switch FRU comes cabled for a GS160 and contains clock and power cables for a GS320. This procedure describes replacing a FRU in a GS320. If your system is a GS160, omit step 6. Ground cables for the H-switch stay in the system box. 1. Shut down the operating system(s). 2. Remove all AC power by tripping the main circuit breakers on all AC input boxes. 3. Open the system cabinet front and rear door(s). 4.
4.7.2 H-switch Power Supply Removal and Replacement You can hot swap a redundant power supply in the H-switch assembly.
Removal 1. Open the system cabinet rear door(s). 2. If you are removing PS1, in Figure 4–18, remove the faceplate from QBB0 and then remove the lower left EMI cover from the H-switch housing by unscrewing the two captive screws that hold it to the housing. Check that the Swap OK LED is lit. See Section 1.15.3. 3.
4.7.3 H-switch Clock Module Removal and Replacement All AC power must be off when removing this module. It is located just above the H-switch. Access is gained from the upper left side of the Hswitch housing.
Removal 1. Shut down the operating system(s), put the OCP switch in the Off position, and trip the main circuit breakers on both AC input boxes. 2. Open the rear door(s). 3. Remove the upper H-switch power supply. See Section 4.7.2. The clock module is now exposed. 4. Unplug all coax cables connected to the module making sure that the QBB ID labels are secure. (You may want to use needle-nosed pliers for this.) 5. Unplug the ribbon cable that goes to the H-switch module. 6.
4.7.4 H-switch Power Manager Removal and Replacement All AC power must be off when removing this module. It is located in the lower right side of the H-switch housing.
Removal 1. Shut down the operating system(s), put the OCP switch in the Off position, and trip the main circuit breaker on all AC input boxes. 2. Open the rear door(s). 3. In GS320 systems, skip to step 9. 4. In GS160 systems, remove both the upper and lower QBB faceplates. (See Section 4.5.) 5. Remove the upper and lower H-switch EMI covers. Unscrew the two captive screws that hold them in place and remove them. 6.
4.7.5 Console Serial Bus Module Removal and Replacement All AC power must be off when removing this module. It is located in the lower right side of the H-switch housing.
Removal 1. Shut down the operating system(s), put the OCP switch in the Off position, and trip the main circuit breaker on all AC input boxes. 2. Open the rear door(s). 3. In GS320 systems, skip to step 9. 4. In GS160 systems, remove both the upper and lower QBB faceplates. (See Section 4.5.) 5. Unscrew the four captive screws that hold the upper and lower H-switch EMI covers in place and remove both covers . 6.
4.8 System Cabinet Blower Removal and Replacement The QBBs in the cabinet from which the blower will be removed must be off.
Removal 1. If the system is partitioned and is a GS320, operating systems running in the cabinet that does not contain the failing blower can continue to run while the repair is done on the blower in the other cabinet. 2. Open the rear door of the power cabinet and trip the circuit breakers of the AC input box powering the subracks that power the system cabinet containing the failing blower. (Do not trip the main circuit breaker since peripherals may be powered by this particular AC input box.) 3.
Chapter 5 Power Cabinet Component Removal and Replacement This chapter describes the removal and replacement procedures of components and options in the GS160/320 power cabinet.
5.1 PCI Modules The PCI boxes are mounted in power or expander cabinets. Except for the power supply, service the PCI box from the rear of the cabinet.
PCI Box Access 1. Remove the I/O resources from the operating system by whatever means necessary. You may have to shut down the system or a partition or use some other means available through the particular operating system or SRM. See individual FRU removal and replacement procedures. 2. Open the front door of the cabinet and unplug the PCI power supplies. 3. Open the rear door of the cabinet. 4.
5.1.1 Standard I/O Module Removal and Replacement The standard I/O module is located at the far right of the PCI card cage. AC must be removed from the PCI box when this module is replaced.
Removal 1. If the operating system in control of the PCI containing the target FRU supports hotswap I/O and taking the I/O resources away from it will allow it to continue to operate, follow operating system procedures to put the local I/O riser into the hotswap state. See Section 4.3.1 and skip to step 4. 2. If the system is not partitioned, shut down the operating system and issue the SRM power off command. Put the OCP switch in the Off position. Skip to step 4. 3.
5.1.2 Console Serial Bus Node ID Module Removal and Replacement The CSB node ID module is located in the right rear corner of the PCI box and is attached to the box from the outside. AC must be removed from the PCI box when this module is replaced. Figure 5– 3 CSB Node ID Module Removal CSB Node ID 3 2 1 2 PK1545 Removal 1.
2. If the system is not partitioned, shut down the operating system and issue the SRM power off command. Put the OCP switch in the Off position. Skip to step 4. 3. If the system is partitioned, from the console connected to the partition to which the PCI box containing the target FRU is attached, shut down the operating system and power it off using the SRM power off command. Put the OCP switch in the Secure position. Pull the plugs on the target PCI power supplies. 4.
5.1.3 Remote I/O Riser Removal and Replacement Two remote I/O riser modules are located in slots marked R0 and R1 in the PCI box.
Removal 1. If the operating system in control of the PCI containing the target FRU supports hot-swap I/O and taking the I/O resources away from it will allow it to continue to operate, follow operating system procedures to put the local I/O riser into the hot-swap state. See Section 4.3.1 and skip to step 4. 2. If the system is not partitioned, shut down the operating system and issue the SCM power off command. Put the OCP switch in the Off position. Skip to step 4. 3.
5.1.4 PCI Option Removal and Replacement The PCI option to be removed may be in any of the 14 PCI slots.
Removal 1. If the operating system in control of the PCI containing the target FRU supports hot-swap I/O and taking the I/O resources away from it will allow it to continue to operate, follow operating system procedures to put the local I/O riser into the hot-swap state. See Section 4.3.1 and skip to step 4. 2. If the system is not partitioned, shut down the operating system and issue the SCM power off command. Put the OCP switch in the Off position. Skip to step 4. 3.
5.1.5 PCI Backplane Removal and Replacement To remove a PCI backplane, the entire PCI box must be removed from the system. Figure 5– 6 PCI Backplane Removal 1 3 2 PK1548 Removal 1. If the operating system in control of the PCI containing the target FRU supports hotswap I/O and taking the I/O resources away from it will allow it to continue to operate, follow operating system procedures to put the local I/O riser into the hotswap state. See Section 4.3.1 and skip to step 4.
2. If the system is not partitioned, shut down the operating system and issue the SRM power off command. Put the OCP switch in the Off position. Skip to step 4. 3. If the system is partitioned, from the console connected to the partition to which the PCI box containing the target FRU is attached, shut down the operating system and power it off using the SRM power off command. Put the OCP switch in the Secure position. Pull the plugs on the target PCI power supplies. 4. Remove the PCI power supplies.
5.1.6 PCI Fan Removal and Replacement The PCI fans are located in the power section of the PCI box. Figure 5– 7 PCI Fan Removal 1 Fan 2 Fan 1 PK1549 Removal 1. If the operating system in control of the PCI containing the target FRU supports hot-swap I/O and taking the I/O resources away from it will allow it to continue to operate, follow operating system procedures to put the local I/O riser into the hot-swap state. See Section 4.3.1 and skip to step 4. 2.
3. If the system is partitioned, from the console connected to the partition to which the PCI box containing the target FRU is attached, shut down the operating system and power it off using the SRM power off command. Put the OCP switch in the Secure position. 4. Unplug the power supplies in the target PCI box. 5. Access to a particular fan depends upon whether the PCI box is at the top of a cabinet or underneath another PCI box.
5.1.7 DVD/CD-ROM Player Removal and Replacement The DVD/CD-ROM player is located in the front of any master PCI box. It is attached to a bracket that is removed from the PCI box when DVD/CD-ROM is replaced. Figure 5– 8 DVD/CD-ROM Removal 2 CD-ROM/DVD CD-ROM/DVD 2 1 PK1550 Removal 1.
it to continue to operate, follow operating system procedures to put the local I/O riser into the hot-swap state. See Section 4.3.1 and skip to step 4. 2. If the system is not partitioned, shut down the operating system and issue the SRM power off command. Put the OCP switch in the Off position. Skip to step 4. 3.
5.1.8 SCSI (FIS) Disk Removal and Replacement The SCSI disk is located above the standard I/O interface module in a master PCI box.
Removal 1. If the operating system in control of the PCI containing the target FRU supports hot-swap I/O and taking the I/O resources away from it will allow it to continue to operate, follow operating system procedures to put the local I/O riser into the hot-swap state. See Section 4.3.1 and skip to step 4. 2. If the system is not partitioned, shut down the operating system and issue the SRM power off command. Put the OCP switch in the Off position. Skip to step 4. 3.
5.1.9 Standard I/O Cable Interface Removal and Replacement The standard I/O cable interface module is located under the SCSI disk in the top right front corner of a master PCI box.
Removal 1. If the operating system in control of the PCI containing the target FRU supports hot-swap I/O and taking the I/O resources away from it will allow it to continue to operate, follow operating system procedures to put the local I/O riser into the hot-swap state. See Section 4.3.1 and skip to step 4. 2. If the system is not partitioned, shut down the operating system and issue the SRM power off command. Put the OCP switch in the Off position. Skip to step 4. 3.
5.1.10 PCI Power Supply Removal and Replacement The PCI power supply is located in the front of PCI boxes in either the power cabinet or expander cabinets.
Removal 1. Open the front door of the power cabinet or expander cabinet depending upon where the target power supply is located. 2. Identify the broken power supply by noticing which of the two has its Power OK LED off. 3. Unplug the power supply. 4. Wait for the Vaux OK LED to go off and the Swap OK LED to come on. 5. Loosen the four captive fasteners holding the faceplate of the power supply to the box. 6. Grasp the power supply handle and firmly pull it from the box. Replacement 1.
5.1.11 Standard I/O Battery Removal and Replacement The time of year clock battery has a theoretical life of 10 years.
WARNING: Danger of explosion if battery is installed incorrectly. Replace only with the same or equivalent type recommended by the manufacturer. Dispose of used batteries according to the manufacturer’s instructions. Removal 1. Remove the standard I/O module. See Section 5.1.1. 2. Slip the battery from its holder. Notice the battery’s polarity. Replacement When you replace the battery, be sure to put it back with the correct polarity. Reverse the steps outlined in the removal for the standard I/O module.
5.2 Operator Control Panel Removal and Replacement The OCP is contained in a plastic shroud at the top of the front door. There are two designs: one that attaches to the door using Tinnerman nuts, the other using screws. The AC must be off during the removal and replacement procedure.
Removal 1. Shut down the operating system(s) 2. Open the rear door. 3. Trip the main circuit breaker on the AC input box(s). 4. Open the front door. 5. Working at the back of the open door, disconnect the power cable to the back of the OCP. 6. Unplug the signal cable(s) at the back of the OCP. 7. If the OCP assembly is connected to the door using screws, go to step 11. 8.
5.3 Terminal Server Removal and Replacement The terminal server is located just above the AC input boxes in the power cabinet and is connected to the SMC and the local terminal port on each standard I/O module in the system.
Removal Conceivably the system could be running and doing useful work while the terminal server needs replacing. Essentially what is lost is console control of the system. Assuming this is the case, there is no need to shut down operating systems. 1. Open the front door of the power cabinet. 2. Unplug the power cord connected to the back of the terminal server. 3. Unplug the signal cable connecting the terminal server to the SMC PC. 4. Open the rear door of the power cabinet. 5.
5.4 48V Power Supply Removal and Replacement Under certain conditions 48V power supplies may be hot swapped.
Removal 1. Open the front door of the power cabinet. 2. Locate the power supply that needs to be replaced: 1. Use the color codes to associate a power subrack with the QBB with the power problem. At least one of the power supplies on this subrack should be replaced. 2. If the system box has redundant power, the associated subrack will have three power supplies. The power supply with its 48V LED off is the one to replace. (The failed supply may have both the 48V LED and the Vaux LED off.) 3.
5.5 Power Subrack Removal and Replacement AC must be removed from the power subrack for it to be removed.
Removal 1. Remove the QBBs in the affected system box from use, by shutting down the instance of the operating system and using the SRM power off command. 2. Open the front and rear doors of the power cabinet. 3. At the back: if the entire system had to be brought down, trip the main circuit breaker on the AC input box powering the subrack; otherwise, trip the three circuit breakers controlling the lines to the subrack.
5.6 AC Input Box Removal and Replacement The AC input box must be unplugged in order for it to be removed.
If an AC input box failed, QBBs in one of the system cabinets are not operating. If your system is a partitioned GS320, some of the system may remain running during this repair. Removal 1. If the system is partitioned such that you can continue to run partitions in the system cabinet not affected by the target AC input box, continue to let them run. Otherwise, shut down the operating system and turn off the machine. 2. Open the front and rear doors of the power cabinet. 3.
Chapter 6 GS80 Component Removal and Replacement This chapter describes the removal and replacement procedures for components in the GS80 rack cabinet except for PCI box and storage components. See Chapter 5 for PCI box components.
6.1 Drawer Modules The GS80 system uses the same modules as the GS160/320 systems with the exception of the global port module and the I/O riser. The functions of the global port are built into the backplane on the GS80. And the design of the I/O riser is modified so that it fits in the GS80 drawer.
Figure 6–1 shows the location and color codes of modules that plug into the GS80 backplane. There is no global port module, since the functions are designed into the backplane. The I/O riser consists of a module that plugs into the backplane and a transition card that plugs into the riser. The transition card is used to bring the I/O signals to the drawer’s bulkhead where the I/O hoses are attached. Table 6–1 lists modules in the drawer and their associated color codes.
6.1.1 Accessing a Single or Top Drawer in a GS80 System To access a single or the top drawer in a two-drawer GS80 configuration is relatively simple, and under most circumstances will require that the system or drawer have its 48V power removed.
Under most circumstances, when replacing FRUs in any drawer, the drawer must have its 48V power off. Only two FRUs in the GS80 drawer can be hot swapped: the CPU and a local I/O riser if the operating system supports hot-swap and they are in the top or single drawer. The remaining FRUs are cold-swap or removed when AC is not present. (Cold-swap is defined as a state where Vaux and AC are present but 48V and logic voltages are not. See Section 4.3.
6.1.2 Accessing a Bottom Drawer in a GS80 System The drawers must be separated to access the bottom drawer in a twodrawer GS80 configuration. When separated, the top drawer and the distribution board channel are pushed back into the cabinet while the bottom drawer remains extended out of the front of the cabinet.
Before you open the drawers to remove FRUs in a bottom drawer, follow instructions regarding the operating and power state associated with the removal and replacement of the particular FRU. It is possible, in a nonstandard configuration, that two drawers are configured as two totally independent systems with no distribution board. If this is the case, treat the two drawers as single drawers. 1. Open the front and rear doors. 2.
6.2 Memory, Directory, Main Power, or Auxiliary Power Module Removal and Replacement Each of these modules is a cold-swap module in GS80 systems. DC power must be removed from the drawer.
Module Removal 1. If the system is not partitioned, shut down the operating system and issue the SRM power off command. Put the OCP switch in the Off position. Skip step 2. 2. If the system is partitioned, from the console connected to the partition with the target FRU, shut down the operating system and power it off using the SRM power off command. Note, only hard partitions can be powered off. If soft partitions are used, both partitions must be shut down and the hard partition powered off. 3.
6.3 CPU Removal and Replacement Only CPUs in a single or top drawer can be hot-swapped.
Module Removal 1. If the operating system supports hot-swap CPU and the CPU in question is in the top drawer, enter the appropriate OS command to put the target CPU in the hot-swap state. See Section 4.3.1. Skip steps 2 and 3. 2. If the system is not partitioned, shut down the operating system and issue the SRM power off command. Put the OCP switch in the Off position. Skip step 3. 3.
6.4 Power System Manager Removal and Replacement The PSM is a special removal and replacement case because its firmware may need to be updated.
Module Removal 1. If the system is not partitioned, shut down the operating system and put the OCP switch in the Off position. Skip step 2. 2. If the system is partitioned, from the console connected to the partition with the target FRU, shut down the operating system and power it off using the SRM power off command. Note, only hard partitions can be powered off. If soft partitions are used, both partitions must be shut down and the hard partition powered off. 3.
6.5 Clock Splitter Module Removal and Replacement Located next to the local I/O riser module, the clock splitter provides identical copies of the clock to synchronize transactions.
Module Removal 1. If the system is not partitioned, shut down the operating system and issue the SRM power off command. Put the OCP switch in the Off position. Skip step 2. 2. If the system is partitioned, from the console connected to the partition with the target FRU, shut down the operating system and power it off using the SRM power off command. Note, only hard partitions can be powered off. If soft partitions are used, both partitions must be shut down and the hard partition powered off. 3.
6.6 I/O Riser Removal and Replacement Only the local I/O riser in a single or top drawer can be hot-swapped and then only when the operating system supports hot-swap I/O.
Module Removal 1. If the operating system supports hot-swap I/O, enter the OS command that puts the target local I/O module in the hot-swap state. See Section 4.3.1. Skip to step 4. 2. If the system is not partitioned, shut down the operating system and issue the SRM power off command. Put the OCP switch in the Off position. Skip step 3. 3.
6.7 I/O Transition Module Removal and Replacement The local I/O riser transition module in a drawer can be hot-swapped only if the local I/O riser can be hot-swapped.
Removal 1. If the operating system supports hot-swap I/O, enter the OS command that puts the target local I/O module in the hot-swap state. See Section 4.3.1. Skip to step 4. 2. If the system is not partitioned, shut down the operating system and issue the SRM power off command. Put the OCP switch in the Off position. Skip step to step 4. 3.
6.8 Drawer Backplane Removal and Replacement The AC must be removed from the drawer when the system backplane is swapped. Figure 6– 10 Drawer Backplane Removal 2 4 7 1 6 5 3 PK1248 Removal 1. If the system is not partitioned, shut down the operating system and issue the SRM power off command. Put the OCP switch in the Off position. 2. If the system is partitioned, from the console connected to the partition with the target FRU, shut down the operating system and power it off using SRM power off.
only hard partitions can be powered off. Pull the 48V power supplies from the subrack powering the drawer with the FRU. (This removes both 48V and Vaux from the drawer.) Skip to step 4. 3. Trip the circuit breaker on the AC input box powering the drawer(s). 4. Access the drawer with the faulty backplane. See Section 6.1.1 or 6.1.2. 5. Remove all modules from the drawer. See Sections 6.2 through 6.6. 6. Disconnect the CSB ribbon cable. 7.
6.9 Dual-Output Clock Removal and Replacement The system is off when a dual-output clock module is replaced.
Removal 1. Shut down the operating system(s). 2. Put the OCP switch in the Off position. 3. Follow the procedure in Section 6.1.1 to access the top drawer. 4. Remove CPU3 and memory 2 so that you have room to access the clock. If you need more room, remove more modules. See Sections 6.2 and 6.3. 5. Remove the clock module cover plate in the rear left corner of the drawer compartment by removing the two Phillips head screws holding it in place and lifting it out of the drawer. 6.
6.10 Distribution Board Removal and Replacement The distribution board is located in the bottom of the distribution board channel. Figure 6– 12 Distribution Board Removal FOLD 3 4 1 2 PK1291 When replacing the distribution board, it is not necessary to open the QBB drawer compartment of either the top or bottom drawer.
Removal 1. If the system is partitioned, it is not necessary to power down anything. (In a GS80 system so partitioned, there is no traffic across the distribution board.) Skip to step 3. Note, only hard partitions can be powered off. If soft partitions are used, they must be shut down and the hard partition be powered off. 2.
6.11 Console Serial Bus Removal and Replacement The AC power to the drawer with the target CSB module must be off when replacing the console serial bus module.
Removal 1. If the system is not partitioned, shut down the operating system, issue the SRM power off command, put the OCP switch in the Off position, and trip the main circuit breaker on the AC input box(s) at the rear of the system. Skip to step 4. 2. If the system is partitioned, from the console connected to the partition with the target FRU, shut down the operating system and power it off using SRM power off. 3. Pull the 48V power supplies from the subrack powering the drawer with the FRU.
6.12 Drawer Blower Removal and Replacement The drawer must be powered off to replace the blower.
Removal 1. If the system is not partitioned, shut down the operating system and put the OCP switch in the Off position. Skip step 2. 2. If the system is partitioned and the blower in one of the drawers needs replacing, neither an operating system nor the SRM should be running. Power off the partition off using the SCM power off -par x command. Put the OCP switch in the Secure position. Open the back door of the cabinet. Note, only hard partitions can be powered off.
6.13 Operator Control Panel Removal and Replacement The OCP is contained in a plastic shroud at the top of the front door. There are two designs: one that attaches to the door using Tinnerman nuts, the other using screws. The AC must be off during the removal and replacement procedure.
Removal 1. Shut down the operating system(s). 2. Open the rear door. 3. Trip the main circuit breaker on the AC input box(s). 4. Open the front door. 5. Working at the back of the open door, disconnect the power cable to the back of the OCP. 6. Unplug the signal cable(s) at the back of the OCP. 7. If the OCP assembly is connected to the door using screws, go to step 11. 8.
6.14 Power Supply Removal and Replacement Under certain conditions 48V power supplies may be hot swapped.
Removal 1. Open the front door. 2. Locate the power supply that needs to be replaced: 1. Associate a power subrack with the drawer with the power problem; the upper drawer is powered by the upper subrack, the lower drawer by the lower subrack. At least one of the power supplies on the identified subrack should be replaced. 2. If the drawer has redundant power, the associated subrack will have three power supplies. The power supply with its 48V LED off is the one to replace.
6.15 Power Subrack Removal and Replacement Each power subrack powers a single drawer. AC must be removed from the subrack for it to be removed.
If a power subrack needs replacing, it is unlikely that an operating system is running in the drawer powered by it. In some cases it is possible to keep part of the system running, but we recommend that the entire system be brought down. Removal 1. Shut down the operating system and put the OCP switch in the Off position. 2. Open the front and rear doors of the cabinet. 3. At the back, trip the main circuit breaker on the AC input box(es). 4.
6.16 AC Input Box Removal and Replacement There are three variants of AC input boxes. Only one AC input box is required when the inlet voltage is high (200 – 240 V) and two are required when the voltage is low (120 V).
Removal 1. If the operating system is still running, shut it down. 2. Put the OCP switch into the Off position. 3. Open the front and rear doors of the cabinet. 4. Trip the main circuit breaker on the target AC input box. 5. Unplug the main power cord from the utility power. 6. Unplug the power cords leading to power subracks, PCIs, and storage devices. Note the location of all power cords.
Appendix A Power Distribution Rules This appendix shows power distribution and cabling for the GS160/320 power cabinet and expander cabinet.
A.1 GS160/320 Power Cabinet Configuration and Cabling Cabling the GS160/320 is complex due to the large variety of options and the need to phase balance the AC input boxes to avoid nuisance circuit breaker tripping.
Figure A–1 shows the options available for use by each base configuration. Space at the top of the power cabinet is available for two options. An optional PCI box and either an optional BA356 storage device may fill that space; or two optional storage devices may fill the space. Each base configuration requires two AC input boxes and a master PCI box. The remaining space is used for the power subracks.
Figure A– 2 GS160/320 Power Cabinet Components A-4 AlphaServer GS80/160/320 Service Manual
Figure A–2 shows the GS160/320 components that make up the power system. For each component, the figure shows the “ J” name for a cable connector. Use Figure A–2 and Figure A–3 to determine where any given cable is connected.
Figure A– 3 GS160/320 Power Cabinet Cabling A-6 AlphaServer GS80/160/320 Service Manual
Figure A–3 shows the required and optional cables in the power cabinet. Redundant cables are marked with an asterisk (*). The AC input for the GS160/320 is three phases. To avoid nuisance tripping of circuit breakers, follow the cabling diagram in Figure A–3. The physical connector locations are identified in Figure A–2.
A.2 Expander Cabinet Configuration and Cabling The power cabling in expander cabinets is described in this section.
Figure A–4 shows possible BA356 storage configurations in expander cabinets available with GS80/160/320 systems.
Figure A– 5 Expander Cabinet Cable Connector Locations A-10 AlphaServer GS80/160/320 Service Manual
Figure A–5 shows a diagram of the PCI box and AC input boxes used in expander cabinets. For each, the figure shows the “ J” names for a cable connector. Use Figure A–6 and Figure A–7 to determine where any given cable in an expander cabinet is connected.
Figure A– 6 Expander Cabinet H9A20-AA Variant Cabling A-12 AlphaServer GS80/160/320 Service Manual
Figure A–6 shows the power cord connections for 120V NEMA cords used in North America. Note that this power cord is also used in the GS80. Use Figure A–5 and Figure A–6 to determine where any given cable in an expander cabinet is connected.
Figure A– 7 Expander Cabinet H9A20-AB, -AC Variants Cabling A-14 AlphaServer GS80/160/320 Service Manual
Figure A–7 shows the power cable connections for expander cabinets used in North America, Japan, and Europe. Use Figure A–7 and Figure A–5 to determine where any given power cable in such an expander cabinet is connected.
A.3 GS80 Power Cabling Cabling the GS80 can be confusing due to the sheer number of cords. Figure A– 8 GS80 Power Cabling -CA Cabinet Terminator 12-45926-01 (At End of CSB Buss) 2nd Starlight 17-04736-01 (Optional) BA54A PCI - No. 2 or BA356 - No. 1 or Starlight - No. 1 (Optional) BA54A PCI No. 1 OCP Assembly J-2 J-1 BA52A (4P) No. 1 BA52A (4P) No.
Figure A–8 shows the power cable connections for the GS80 –CA cabinet used in North America. Use Figure A–8 and Figure A–5 to determine where any given power cable in such a cabinet is connected.
Figure A– 9 GS80 Power Cabling -CB, -CC Cabinet Terminator 12-45926-01 (At End of CSB Buss) 17-04736-01 2nd Starlight (Optional) BA54A PCI - No. 2 or BA356 - No. 1 or Starlight - No. 1 (Optional) BA54A PCI No. 1 OCP Assembly J-2 J-1 BA52A (4P) No. 1 BA52A (4P) No.
Figure A–9 shows the power cable connections for the GS80 –CB, –CC cabinet used in Japan, and Europe. Use Figure A–9 and Figure A–5 to determine where any given power cable in such a cabinet is connected.
Appendix B Cache Coherency Maintaining the coherency of the CPU caches, the memory space, and the I/O space is important in complex, hierarchical systems like the AlphaServer GS80/160/320 systems. This section describes how cache coherency is maintained.
B.1 Terminology Table B–1 shows the definitions of terms related to cache coherency.
B.2 Cache States The Alpha CPU chip supports five cache states and two sets of commands that affect them. The AlphaServer GS series uses both command sets and four of the five cache states. It is the AlphaServer GS series cache coherency scheme that is described in this section. The AlphaServer GS series cache states are described in Table B–2. Table B– 2 AlphaServer GS Series Cache States Cache State Clean Description The cache location holds a copy of a memory block.
B.3 Cache Commands • Two sets of commands are used to modify cache state: • Memory space commands • System probe commands B.3.1 Memory Space Commands Table B–3 shows the CPU commands that change the CPU’s cache state when issued to the system. The commands are assigned a “ class” which has a common effect on the cache state. It is the command class name that is associated with each cache state change represented by the arrows in Figure B– 1.
Table B– 3 Memory Space Commands Command Class Description RdBlk Rd Read a block of memory data into cache. RdBlkMod RdM Read a block of memory data into cache for the purpose of modification (writing). Fetch - Read a block of memory data - do not cache it. RdBlkVic Rd Read a block of memory data into cache that will replace a valid (clean or dirty) block of data. RdBlkModVic RdM Read a block of memory data into cache for the purpose of modification (writing).
B.3.2 System Probe Commands The second set of commands that affect cache coherency are the “ system probe” commands. These are commands that are issued from the system to the CPU requesting data and/or Tag status updates. Probe commands are the result of a CPU command affecting the cache of another CPU.
B.4 Cache State Transition Diagram Figure B–1 shows how both memory space commands and system probe commands cause cache block state to change. Circles in the diagram represent the state of a given cache block. Cache blocks change state as a result of a particular command affecting the block represented by the arrows. Figure B– 1 Cache State Transition Diagram Evict Inval Rd Clean Evict FRdM or Inval Invalid RdM or ItoD Dirty CtoD Evict FRdM or Inval FRd StoD Clean/ Shared B.
Table B– 5 Memory Command and Cache State Interaction CPU Memory Cmnd Cache Block State RdBlk Invalid Clean Dirty Dirty-Shared Invalid Clean Dirty Dirty-Shared Invalid Clean Dirty Dirty-Shared Invalid Clean Dirty Dirty-Shared Invalid Clean Dirty Dirty-Shared Invalid Clean Dirty Dirty-Shared Any State Any State Invalid Clean Dirty Dirty-Shared Invalid Clean Dirty Dirty-Shared Invalid Clean Dirty Dirty-Shared Invalid Clean Dirty Dirty-Shared Invalid Clean Dirty Dirty-Shared RdBlkMod Fetch RdBlkVic RdBl
B.6 Virtual Channels When mapping processor request activity onto a switch-based distributed shared memory system, it is necessary to create switch packets to support processor commands, command responses, and probes. The GS80/160/320 distributed shared memory systems operate by passing message packets between QBBs. A variety of message types are used to support the wide variety of system operations.
B.7 Virtual Channels and Coherency Flow The virtual channels are useful in explaining how transactions flow through the system while maintaining cache coherency. B.7.
3. Upon reaching its home QBB, a memory space transaction arbitrates for access to a home directory bank and a home memory bank. When the transaction is granted access to the directory and memory, it accesses both the cache state and the data stored in the block’s memory location. When the cache state is accessed, it is combined with the transaction’s command type to: • determine, and atomically update, the next coherency state. • generate a response packet to the requesting processor.
B.7.
Figure B–3 shows the progress of an I/O space read or Programmed IO Read (PIO Rd) transaction through a system. The steps a transaction may take are as follows: 1. All PIO Rd transactions are issued by a source (or “ requesting” ) processor. The source processor in this case must be a CPU, not an IOP. 2. All PIO Rd transactions are sent to the home QBB of the IOP to which they are targeted in the QIO virtual channel.
B.7.3 I/O Space Writes Figure B– 4 I/O Space Write Transaction Flow Diagram Q1 WrIO Comsig IO Device Source Processor Issues a reference to IO address X IO Processor Arb QIO PIOWr Source of PIORd Data or Destination of PIOWr Data QIO PIOWr Figure B–4 shows the progress of an I/O space write or programmed IO write (PIO Wr) transaction through a system. The steps that a transaction may take in this progression are outlined below. 1.
B.8 Virtual Channel Ordering Rules To support cache coherency, virtual channels obey a number of ordering rules. These rules are enforced: • To support “ Sparse Vector” directories (i.e. 1 directory bit/QBB vs. 1 directory/processor) • To enable system support of Memory Barriers. • To minimize permutations of in flight transactions. Q1 Full Ordering At each QBB, the main arbiter in the QSA, the QS Arb, orders all Q0 transactions to the QBB’s home memory space.
5. When ordered lists of Q1 packets from multiple HS input ports target multiple common HS output ports, the Q1 packets must appear at the output ports in a manner consistent with a single, common ordering of all incoming Q1 packets. Each output port may transmit some or all of the packets in the common ordered list. Q0 Read and Victim Ordering The system enforces ordering restrictions on reads and victims from the same processor to the same memory block.
Victim and Q1 Ordering To properly implement memory barriers, the cache coherency protocol requires that victim packets “ push” Q1 packets from the H-switch arbitration point to the output of the victim’s home QS Arb. 8. The H-switch orders all incoming Q1 packets and victim packets for each of its output ports.
B.9 Coherency Data Storage Cache coherency information is stored in the following locations in the system: • The CPU’s primary tag storage (PTag) • The IOP tag storage • The duplicate tag storage on the QBB (DTag) • The transaction tracking table in the global ports (TTT) B.9.1 CPU Primary Tag Store (PTag) There is one PTag store in each CPU processor in the system. Each PTag store has one entry per EV6 cache location.
B.9.2 IOP Tag Store Each IOP in the system implements two fully associative data caches; one “ write” cache and one “ read” cache. As such, each IOP also implements a fully associative tag store. Table B–7 describes information stored in the IOP tag store. Table B– 7 IOP Tag Cache Coherency Storage IOP Tag Field Field Description Valid When set, indicates that a memory block is cached at the associated cache location. Dirty Indicates that the cached block is writeable.
B.9.3 QBB Duplicate Tag Store (DTag) There is one DTag store in each QBB in the system. Each DTag store has one entry for each potential CPU cache location in a QBB. In other words, the DTag has enough storage to map four CPU module caches. Table B– 8 DTag Cache Coherency Storage DTag Field State Description of Field Status Invalid no valid data is cached at the associated cache location Clean unmodified data is cached at the associated cache location.
B.9.4 QBB Directory There is one directory store in each QBB in the system. Each directory holds one entry for each main memory block in its QBB. For example, a 32-Gbyte memory system consisting of 64-byte blocks would require 512-Mbyte directory entries. Table B– 9 Directory Cache Coherency Storage Directory Field Field Description Owner-ID This is a 6 bit encoded field. It identifies which of 32 processors, 8 IOPs or single memory bank holds the most up to date copy of a memory block.
B.9.6 Access to Coherency State Figure B–5 shows how the various coherency stores are connected in a QBB. The PTag is omitted from the diagram because it is connected to, and used exclusively by, the CPU processor. As can be seen in Figure B–5, the QSA implements two interfaces to the cache coherency stores. The information in Table B–10 describes the two interfaces.
Table B– 10 QSA Interface to the Cache Coherency Storage Interface Description GPLink - The global port link (GPLink) is the primary, clock-forwarded link between the quad switch address ASIC (QSA) and the global port address ASIC (GPA). This path is used to transmit Q0, Q0Vic, QIO, Q1 and Q2 packets, bound for remote QBBs, to the local GPA. The GPLink can transfer one address packet every two clock cycles.
B.10 Coherency Storage and Coherency Flow The following sections describe how various transactions use the system coherency storage elements as they progress through the system. B.10.1 Local Read Transactions Figure B– 6 Local Read Coherency Store Flow Source/Home QBB Q0: Visit Directrory Dtag, TTT and IOP Tag store via ArbBus. Q1 Q1: VisitTTT via ArbBus Q2: Fill steered directly to requesting processor.
Local Read, ReadVic, and Fetch commands use the system coherency storage elements as illustrated in Figure B–6 and described by the following sequence of events. 1. Each Read-type command is first issued to the ArbBus of the home QBB by means of the QS Arb. It visits: • The DTag to determine if the addressed block is dirty in the home QBB. • The directory to determine if the addressed block is dirty in another QBB. • The IOP tag store to determine if the addressed block is dirty in the home IOP.
B.10.2 Local Read Modify Transactions Figure B– 7 Local Read/Modify Transaction Source/Home QBB Q0: Visit Directrory Dtag, TTT and IOP Tag store via ArbBus. Q1 Q2: Visit TTT via ArbBus Q2: Fill steered directly to requesting processor. Q1 Q2 Response Packet Dirty Data QBB of Dirty Processor Q1: Directory steers Fwd Rd directly to Dirty Processor. Visit TTT, Dtag and IOP Tag Store via ArbBus B-26 QBB of Shared Processor Q1: Visit Dtag, TTT and IOP Tag store via ArbBus.
Local Read Mod and Read Vic Mod commands use the system coherency storage elements as shown in Figure B–7 and described by the following events: 1. Each Read Mod-type command is first issued to the ArbBus of the home QBB by means of the quad switch arbiter. It visits: • The Dtag to determine both if the addressed block is dirty in the home QBB and if any of the CPUs in the home QBB have copies of the block.
B.10.3 Local Change-to-Dirty, Inval-to-Dirty and Full Block Transactions Figure B– 8 Local Change-to-Change Coherency Store Flow Source/Home QBB Q0: Visit Directrory Dtag, TTT and IOP Tag store via ArbBus. Q1 Q1: Visit TTT via ArbBus Q1 Response Packet QBB of Shared Processor Q1: Visit Dtag, TTT and IOP Tag store via ArbBus.
Local Change-to-Dirty, Shared-to-Dirty, STCChange-to-Dirty, Inval-to-Dirty and Full Block Write commands use the system coherency storage elements as shown in Figure B–8 and described by the following sequence of events. 1. Each Change-to-Dirty-type command is first issued to the ArbBus of the home QBB by means of the QS Arb. It visits: • The DTag to determine both if the Change-to-Dirty will succeed or fail, and if any of the CPUs in the home QBB have copies of the addressed block.
B.10.4 Global (Remote) Read Transactions Figure B– 9 Global (Remote) Read Transaction Storage Flow Source Q BB Q 0: V isit TT T v ia G PLin k Q0 Q 0V ic Q 1: V isit TTT, and D tag or IO P Tag store, via A rbB us. Q 2: V isit D tag, and TTT o r IO P Tag store v ia A rbB us. Q1 Q2 Response Packet Dirty Data Ho m e Q B B QB B of D irty Processor Q 0: V isit D irectrory D tag, TTT and IO P Tag store v ia A rbB us. Q 1: D irectory steers Fw d R d directly to D irty Processor.
Remote Read, ReadVic and Fetch commands use the system coherency storage elements as illustrated in Figure B–9 and described by the following sequence of events. 1. Each Read-type command first visits the TTT via the GPLink, for the purpose of creating a TTT MAF entry. 2. Each Read-type command is then issued to the ArbBus of the home QBB by means of the QS Arb. It visits: • The DTag to determine if the addressed block is dirty in the home QBB.
B.10.5 Global (Remote) Read Modify Transactions Figure B– 10 Read Mod Coherency Store Flow Source QBB Q0: Visit TTT via GPLink Q0 Q0Vic Q1: Visit TTT, Dtag and IOP Tag store, via ArbBus. Q2: Visit Dtag, TTT and IOP Tag store via ArbBus. Q1 Q2 Response Packet Dirty Data QBB of Dirty Processor Home QBB Q1: Directory steers Fwd Rd directly to Dirty Processor. Visit TTT, Dtag and IOP Tag Store via ArbBus Q0: Visit Directrory Dtag, TTT and IOP Tag store via ArbBus.
3. The Fwd Rd Mod probe packets resulting from each Read-type command are then issued to the ArbBus of the QBB of the dirty processor. The Fwd Rd Mod is sent directly to the dirty processor and visits: • The DTag to determine if any of the CPUs in the dirty processor’s QBB have copies of the addressed block. • The IOP tag store to determine if the IOP in the dirty processor’s QBB has a copy (clean or dirty) of the addressed block.
B.10.6 Global (Remote) Change-to-Dirty Transactions Figure B– 11 Change-to-Dirty Coherency Store Flow S ou r ce Q B B Q 0: V is it D tag v ia A rB u s V is it T T T v ia G P L in k. Q0 Q 0V ic Q 1 : V isit T T T , D tag an d IO P T ag s tore, v ia A rb B u s. Q1 R e sp on se P ack et Hom e Q BB Q B B of S h ared P roces s or Q 0: V is it D irectro ry D tag , T T T an d IO P T ag sto re v ia A rbB us . Q 1: V is it D tag , T T T an d IO P T ag sto re v ia A rbB us .
3. Each Change-to-Dirty-type command is then issued to the ArbBus of the home QBB by means of the QS Arb. It visits: • The DTag to determine if any of the CPUs in the home QBB have copies of the addressed block. • The directory to determine if the Change-to-Dirty will succeed or fail, and to determine if any other QBBs have copies of the addressed block. • The IOP tag store to determine if the home IOP has a copy (clean or dirty) of the addressed block.
B.10.7 Global (Remote) Inval-to-Dirty and Full Block Write Transactions Figure B– 12 Inval-to-Dirty, Full Block Write Coherency Store Flows S ou rce Q B B Q 0: V isit TT T v ia G P Lin k . Q0 Q 0 V ic Q 1 : V isit TT T , D ta g an d IO P T ag sto re , v ia A rb B u s. Q1 R esp onse P acket H om e Q BB Q B B o f Sha re d P ro cesso r Q 0: V isit Dire ctrory D ta g, T T T an d IO P T ag sto re v ia A rb B u s. Q 1 : V isit D tag, T T T an d IO P T ag sto re v ia A rbB u s.
2. Each Inval-to-Dirty and Full Block Write command is then issued to the ArbBus of the home QBB by means of the QS Arb. It visits: • The DTag to determine if any of the CPUs in the home QBB have copies of the addressed block. • The directory to determine if any other QBBs have copies of the addressed block. • The IOP tag store to determine if the home IOP has a copy (clean or dirty) of the addressed block. • The TTT to determine if the addressed block is in a transient state. 3.
Appendix C Power-Up Diagnostic Error Table This appendix contains a table that lists SROM and XSROM tests and all possible errors and associated number designations. For each test error, possible FRUs are identified and a brief description of the failure is given. The FRU(s) identified represent a best guess at what is broken and may not actually be the failing piece of hardware.
Error Number Table Description Error # column: Contains the error number that is printed as part of the failure report of the srom/xsrom test. “Error: xxxx” FRU column: Lists the possible FRU(s). Component(s) column: Lists the possible failing components associated with the FRU(s) called out in the FRU column. For example, if a callout were FRU1: QBBx.
Failure Description column: Any detail which will be useful to the user to decode what the Error # means and why the test failed.
Table C–1 Test Number / Error Number Test # hex ERROR # <15:0> 1 Alpha CPU chip BIST check test 0001 FRU(s) QBBx.CPUy Component(s) EV Failure Description BIST failed for I-Cache and/or D-Cache Parameters error 0001 P1: P2: P3: P4: 2 exp: P2 xor EXP_DATA [where EXP_DATA = I_CTL w/ bit 23 CLEAR] rcvd: EV6 I_CTL Read data addr: IPR Number of I_CTL 0 Alpha CPU chip D-cache test 0001 QBBx.CPUy EV Test Setup 0002 QBBx.
Test # hex ERROR # <15:0> 4 B-cache data line test 0001 FRU(s) QBBx.CPUy Component(s) Failure Description EV, Bcache Unexpected error write data pttrn to Bcache. 0002 Unexpected error verifying data pttrn written 0003 B-Cache data RAM failure B-Cache data RAM failure in a check bit 0004 Check bit n failure 0<=n<=^xf CB0n Parameters error numbers 1,2,3,4 Test runs to completion unless an unexpected error occurs. P1 mask has the following information: P1 = aabbccdd.
Test # hex ERROR # <15:0> FRU(s) 5 B-cache march test Component(s) Failure Description 0001 QBBx.CPUy EV, Bcache Test setup 0002 QBBx.
Test # hex ERROR # <15:0> 6 B-cache address test 0001 FRU(s) QBBx.
Test # hex ERROR # <15:0> 8 B-cache ECC data line test 0001 FRU(s) QBBx.CPUy Component(s) Failure Description EV, Bcache DC_STAT error bits not clear before starting test. Bcache ECC data problems 0002 DC_STAT error bits set after reading back ECC patterns. Bcache ECC data problems. Parameters all error numbers: Note: For Error numbers 2, all the following parameters apply. For Error number 1, only P1<31:0>, P3<31:0> and P4 are valid. P1: DC_STAT read data in <31:0>.
Test # hex ERROR # <15:0> A B-cache data line and C-box read chain verify test 0001 FRU(s) QBBx.CPUy Component(s) EV, Bcache Failure Description Latched Tag address did not match test address Note: This test is not testing TAG RAMs on the CPU module since a certain EV6 hook is not available. TAG DATA RAMs will be fully tested in XSROM test 54. This test will simply test basic Bcache functionality.
Test # hex ERROR # <15:0> 10 Local QSD_WHAMI (QSD Who Am I register) 0001 FRU(s) Component(s) QBBx.
Test # hex ERROR # <15:0> 12 Local QSA_SCRATCH (QSA scratch register) test 0001 FRU(s) QBBx.
Test # hex ERROR # <15:0> FRU(s) Component(s) 14 Local non-device interrupt test x = 1 for CE interrupt testing, 001x QBBx.
Test # hex ERROR # <15:0> 15 Local I/O device interrupt test 0011 FRU(s) QBBx.CPUy QBBx Component(s) Failure Description NO IRQ interrupt posted to CPU QSDz (z=0 - 3) 0021 Incorrect IRQ int posted to CPU 0101 Invalid CPU ID in error summary CSR 0201 Incorrect IO_DEV_INT_NUM in CPUx_DEV_INT 0111 No IRQ int to CPU. Invalid CPU ID in error CSR 0211 No IRQ int to CPU. Incorrect IO_DEV_INT 0121 Incorrect IRQ to CPU. Invalid CPU ID in error CSR 0221 Incorrect IRQ to CPU.
Test # hex ERROR # <15:0> 17 Local interprocessor interrupt test 0021 FRU(s) QBBx.CPUy QBBx Component(s) Failure Description Incorrect IRQ int posted to the CPU QSDz (z=0 - 3) 0101 Invalid CPU ID in CPUx_IP_INT 0201 CPUx_IP_INT not SET 1001 QSD error summary CSR is correct.
Test # hex ERROR # <15:0> 1a Local IOP data path (IOD_SCRATCH) test 0001 FRU(s) QBBx Component(s) Failure Description IOD0,1 QSD0.
Test # hex ERROR # <15:0> 1b Local Hose 0 Config (1b), Local Hose 1 Config (1c), Local Hose 2 Config (1d), Local Hose 3 Config (1e) 1c 1d 1e (7 subtests) FRU(s) Component(s) Failure Description 0001 QBBx IOA Data pattern read/write error 0002 QBBx IODz (z=0,1) Data pattern read/write error 0003 QBB#.IORy. CBLz PBPx.RIOy QBB#.IORy. CBLz 0004 QBB#.IORy. (y=0,1) IOP_HOSE x is present but not initialized MLK X=0..f; y=0,1; z=0.
Test# hex ERROR # <15:0> FRU(s) Component Failure Description Parameters P1: Expected pattern (written to XXX_SCRATCH CSR or ignore this field when checking IO*_ERR_SUM and IOP CSR) P2: Received pattern (from XXX_SCRATCH CSR or received data from IO*_ERR_SUM and IOP CSR) P3: Failing Address (of XXX_SCRATCH CSR or IO*_ERR_SUM or other IOP CSR) P4: IO Map information passed for SCM needs 1f Placeholder Power-Up Diagnostic Error Table C-17
Test# hex ERROR # <15:0> 20 Local GPA scratch register test 0001 FRU(s) Component(s) System box: QBBx,GP QBBX QBBx.CPUy Failure Description Read/Write AA pattern failure to gpa_scratch GPA,GPD0 QSA,QSD0 Drawer: QBBx QBBx.
Test # hex ERROR # <15:0> 21 Local GPD scratch register test 0001 FRU(s) Component(s) Failure Description System box: QBBx,GP QBBX QBBx.CPUy Read/Write AA pattern failure to gpd_scratch GPA,GPDz (z=0 - 3) Drawer: QSA,QSDx (x=0 - 3) QBBx QBBx.
Test # hex ERROR # <15:0> 22 Local Gp-link > HS-link loopback test (QBBxis the local QBB in these callouts) 0011 FRU(s) System box: QBBx,GP QBBX.GP.CBL QBBx.CPUy HSW0 Drawer: QBBx QBBx.CBLE SCBL QBBx.
Test # hex ERROR # <15:0> FRU(s) Component(s) Failure Description 22 contin ued FF00 System box: HSW0 QBBx.GP.CBL QBBX is LOCAL QBB Scratch testing passed by parity errors detected on GP and HS CSRs QBBx is LOCAL QBB X=1..6 Scratch Test failed. Parity errors detected on GP CSRs. QBBx is LOCAL QBB X=1..6 Scratch Test failed. Parity errors detected on HS CSRs QBBx is LOCAL QBB X=1..6 Scratch Test failed. Parity errors detected on HS and GP CSRs. GPA, GPD0 X=1..6 Scratch Test failed.
Test# hex ERROR # <15:0> FRU(s) Component 22 contin ued Parameters for error numbers xx1x and xx2x Failure Description P1: Expected Data (wrt GPA/GPD_SCRATCH) P2: Received Data (rd GPA/GPD_SCRATCH) P3: Failing Address (of GPA/GPD_SCRATCH CSR) P4: Source Soft QBB ID in bits <2:0> Parameters for error numbers Fxxx P1: GPA_HSL_ERR_SUM CSR read data P2: GPD_HSL_ERR_SUM CSR read data P3: HS_CSR0 CSR read data P4: HS_CSR1 CSR read data Parameters for error number 0099 P1: QSA_QBB_POP_1 CSR read data P2:
Test # hex ERROR # <15:0> 23 Local GP performance monitor test 0001 FRU(s) QBBx.GP QBBx Component(s) PERFMON PERFMON Failure Description Default/Reset value is incorrect in REG0>REG10 0002 Default/Reset value is incorrect in Page 0..15 Counter 0..7 and Page 0..
Test# hex ERROR # <15:0> FRU(s) 24 Local IOP error testing Component(s) Failure Description m = Interrupt type. (m =1 for CE; m =2 for UCE; m =3 for SE) 0001 QBBq.
Test# hex ERROR# <15:0> FRU(s) Component(s) Failure Description 25, 26, 27, 28 Local MEM0 scratch/BIST/error testing Local MEM1 scratch/BIST/error testing Local MEM2 scratch/BIST/error testing Local MEM3 scratch/BIST/error testing (3 subtests in each test) Note: Only ONE test per MEMx (25,26,27,28 = MEM0,MEM1,MEM2,MEM3) SUBTEST 1 (MEM_SCRATCH CSR Pattern testing) 0001 QBBx.MEMx (x=0,1,2,3 based on which test is running) QBBx.
Test# hex ERROR # <15:0> FRU(s) Component Failure Description SUBTEST 3 (MEM Error Line test) 25, 26, 27, 28 (3 subtests in each test) contin ued QBBx QBBx.CPUy QSDy QBBx IOAy, QSDy, path QBBx.
Test# hex ERROR # <15:0> 25, 26, 27, 28 SUBTEST 1: Parameters for error numbers 1,2,3,4,5,6 (3 subtests in each test) contin ued FRU(s) Component Failure Description P1: Exp: Data written to MEM_SCRATCH (MEM_SCRATCH) P2: Rcvd: Data read back from MEM_SCRATCH after write bits <1:0> P3: Addr: Failing Address P4: CPU# (running this test) in SUBTEST 2: Parameters for error numbers xxxF P1: Mask of failing DIMMx (0..7) on MEMx (x=0,1,2,3) under test <07:00> = MEM0 DIMM0..
Test# hex ERROR# <15:0> 29 Local DTAG scratch and BIST check test (2 subtests) Subtest 1: DTAGx Scratch CSR (DTAG_ERR_ADDR_0) testing F001 FRU(s) QBBx Component(s) DTGx (x=0 - 3) for non-MCM backplane Failure Description Write/Read AA’s to DTAG_ERR_ADDR_0 failed DTG0-3 or DTG4-7 for MCM backplane F002 Write/Read 55s to DTAG_ERR_ADDR_0 fail F003 Write/Read FFs to DTAG_ERR_ADDR_0 fail F004 Write/Read 00s to DTAG_ERR_ADDR_0 fail F005 Float 1s through DTAG_ERR_ADDR_0 failed F006 Float 0s th
Test# hex ERROR# <15:0> 2a Local directory scratch and BIST check test (2 subtests) Subtest 1: DIR Scratch CSR (DIR_EDC_SUB_ADDR_B) testing F001 FRU(s) QBBx.
Test# hex ERROR# <15:0> 2b Local IOP BIST check test 0001 FRU(s) Component(s) Failure Description QBBx IOD0 Write Cache BIST failure (slice 0) QBBx IOD1 0002 Write Cache BIST failure (slice 1) 0003 Read Cache BIST failure (slice 0) 0004 Read Cache BIST failure (slice 1) Parameters for all error numbers P1: IOD_ERR_SUM CSR read results P3: not used 2c P2: IOD_ERR_SUM CSR address P4: not used Local QSA error line test Error# = LMNX X = 1 CE testing X = 2 UCE testing N = 1 No IRQ<0>(CE) or
Test# hex ERROR# <15:0> FRU(s) 2d Local hose error testing Component Failure Description Error #<3:0> = Error Type Error #<7:4> = MiniLink ID =m ( m =2 =Near End MLK; =1 =Far End MLK; =0 =PCA MLK) Error #<11:8> = Hose ID = h (where: 0<= h <=3 ) Error #<15:12> = QBB ID =q (where: 0<= q <=7 ) QBBq.IORx MLK qhm1 *_ERR_SUM bit y is NOT set when writing or PBP.RIO MLK *_DIAG_FORCE_ERR bit y (* = NE/FE/PCA) or PBP.
Test# hex ERROR # <15:0> FRU(s) Component Failure Description 2d contin ued Parameters for error # =qhm1 (** = NE or FE or PCA) P1: Expected Data (written to **_DIAG_FORCE_ERR_SUM CSR) P2: Received Data (read from **_ERR_SUM CSR) P3: Failing Address (of **_ERR_SUM CSR) th rd P4<63:48> may contains 4 error #, P4<47:32> may contains 3 error #, nd st P4<31:16> may contains 2 error #, P4<15:0> contains 1 error # Parameters for error # = qhmn; (1< n<^xC) P1: Sender info: The data written into IOP_ERR_I
Test# hex ERROR # <15:0> FRU(s) 2e Local GP error line test Component Failure Description Error# = LMNX X = 1 CE testing X = 2 UCE testing M=0 M=1 M=2 M=4 M=8 N = 1 No IRQ<0>(CE) or IRQ<4> (UCE) interrupts posted to the CPU N = 2 Incorrect IRQ interrupts posted to the CPU QSD Error Summary CSR (CPUx_CE/UCE_SUM) was correct Invalid CPU ID in CPUx_CE_SUM or CPUx_UCE_SUM No QBB_NUM bits set in CPUx_CE_SUM or CPUx_UCE_SUM Incorrect QBB_NUM set in CPUx_CE_SUM or CPUx_UCE_SUM More than one QBB_NUM bit se
Test # hex ERROR # <15:0> FRU(s) 30 Local directory error line test Component Failure Description Error# = LMNX X = 1 CE testing X = 2 UCE testing N = 1 No IRQ<0>(CE) or IRQ<4> (UCE) interrupts posted to the CPU N = 2 Incorrect IRQ interrupts posted to the CPU M=0 M=1 M=2 M=4 M=8 QSD Error Summary CSR (CPUx_CE/UCE_SUM) was correct Invalid CPU ID in CPUx_CE_SUM or CPUx_UCE_SUM No QBB_NUM bits set in CPUx_CE_SUM or CPUx_UCE_SUM Incorrect QBB_NUM set in CPUx_CE_SUM or CPUx_UCE_SUM More than one QBB_NUM
Test# hex ERROR # <15:0> FRU(s) Component 31 Local QSD error line (FAULT) test Failure Description Error# = LMNX X<0> = 1 always to signify SE testing (no CE or UCE testing here) X = 3 IOP_QBB_ERR_SUM=0 meaning a FAULT was not reported X = 5 IOP_QBB_ERR_SUM is incorrect (not as expected) X = 9 IOP_QBB_ERR_SUM bit NOT set! N = 2 Incorrect IRQ interrupts posted to the CPU M=0 M=1 M=2 M=4 M=8 QSD Error Summary CSR (CPUx_SE_SUM) was correct Inval
Test# hex ERROR # <15:0> FRU(s) Component 32 Local DTAG error line (FAULT) test Failure Description Error# = LMNX X<0> = 1 always to signify SE testing (no CE or UCE testing here) X = 3 IOP_QBB_ERR_SUM=0 meaning a FAULT was not reported X = 5 IOP_QBB_ERR_SUM is incorrect (not as expected) X = 9 IOP_QBB_ERR_SUM bit NOT set! N = 2 Incorrect IRQ interrupts posted to the CPU M=0 M=1 M=2 M=4 M=8 QSD Error Summary CSR (CPUx_SE_SUM) was correct Inva
Test# hex ERROR # <15:0> FRU(s) Component 34 Local QBB soft QBB ID configuration test (12 subtests) 0001 QBBx Failure Description Write/Read of QSA_QBB_ID failed (step 2 of Local Soft QBB ID config process) 0002 Write/Read of IOD_CONFIG failed (step 3 of Local Soft QBB ID config process) 0003 Invalid Sub-test number received from PSM. PSM->XSROM interaction problem (bad PSM packet). 0099 QSA_PORT_MAP0..
Test # hex ERROR # <15:0> 35 Remote GPA scratch register test (QBBx is target QBB) 0001 FRU(s) System box: QBBx,GP QBBx.GP.CBL QBBX Component Read/Write AA pattern failure to gpa_scratch GPA,GPD0 QSA, QSDx HSW0 / QBBx.GP.CBL QBBx.GP QBBx GPDy HSW0 QSA,QSDx Drawer: QBBx QBBx.CBL GP,QSA,QSDx SCBL / QBBx.
Test# hex ERROR # <15:0> 35 contin ued FF00 FRU(s) System box: HSW0 QBBx.GP.CBL Component Failure Description QBBx is REMOTE QBB Scratch testing passed by parity errors detected on GP and HS CSRs QBBx is REMOTE QBB X=1..6 Scratch Test failed. Parity errors detected on GP CSRs. QBBx is REMOTE QBB X=1..6 Scratch Test failed. Parity errors detected on HS CSRs QBBx is REMOTE QBB X=1..6 Scratch Test failed. Parity errors detected on HS and GP CSRs. Drawer: SCBL QBBx.CBL F3#x System box: QBBx.GP.
Test# hex ERROR # <15:0> 35 contin ued #0#x FRU(s) Component X=1..6 Scratch Test failed. No parity errors detected. System box: QBBx,GP QBBx.GP.CBL QBBX HSW0 Failure Description GPA,GPD0 QSA, QSDx Or QBBx.GP.CBL QBBx.GP QBBx, HSW0 GPDy QSA,QSDx Drawer: QBBx QBBx.CBL SCBL GP,QSA,QSDx Or QBBx.
Test# hex ERROR # <15:0> 36 Remote GPD scratch register test 0001 FRU(s) Component(s) System box: QBBx,GP QBBX QBBx.CPUy GPA,GPDz QSA,QSDx Drawer: QBBx QBBx.
Test# hex ERROR # <15:0> FRU(s) Component Failure Description 36 contin ued F3#x System box: QBBx.GP.CBL HSW0 QBBx.GP QBBx is REMOTE QBB X=1..6 Scratch Test failed. Parity errors detected on GP CSRs. QBBx is REMOTE QBB X=1..6 Scratch Test failed. Parity errors detected on HS CSRs QBBx is REMOTE QBB X=1..6 Scratch Test failed. Parity errors detected on HS and GP CSRs. Drawer: QBBx.CBL SCBL, QBBx FC#x System box: HSW0 QBBx.GP.CBL Drawer: SCBL QBBx.CBL FF#x System box: HSW0 QBBx.GP.
Test # hex ERROR # <15:0> FRU(s) Component 37 Remote QBB soft QBB ID configuration test (13 subtests) 0001 QBBx (tested) Failure Description Write/Read of QSA_QBB_ID failed (step 2 of config process) 0002 Write/Read of QSA_PORT_MAP failed (step 5 of remote config process) 0003 Write/Read of IOD_CONFIG failed (step 7 of remote config process) 0004 0005 7777 Invalid Sub-test number received from PSM HSW0/SCBL QSA_QBB_POP_1 bit is NOT set FRU not determined GPA CSR (8-bits) Read h
Test# hex ERROR # <15:0> 38 See Test 23 (Local GP PerfMon CSR Access). Same error numbers reported for remote. 39 See Test 19 (Local IOA Scratch Access). Same error numbers reported for remote version. 3a See Test 1a (Local IOD Scratch access). Same error numbers reported for remote version. 3b(7 subtests) See Test 1b (Local IO Hose 0 Configuration and Path Verification test). Same error numbers reported for this remote version.
Test# hex ERROR # <15:0> FRU(s) Component Failure Description 3f contin ued F3#x System box: QBBx is REMOTE QBB X=1..6 Scratch Test failed. Parity errors detected on GP CSRs. QBBx is REMOTE QBB X=1..6 Scratch Test failed. Parity errors detected on HS CSRs QBBx.GP.CBL HSW0 QBBx.GP Drawer: QBBx.CBL SCBL QBBx FC#x System box: HSW0 QBBx.GP.CBL Drawer: SCBL QBBx.CBL FF#x System box: HSW0 QBBx.GP.CBL QBBx is REMOTE QBB X=1..6 Scratch Test failed. Parity errors detected on HS and GP CSRs.
Test# hex ERROR # <15:0> 40 See Test 13 (Local QSD Scratch access). Same error numbers reported for remote version. F300 FRU(s) System box: QBBx.GP.CBL HSW0 QBBx.GP Component Failure Description QBBx is REMOTE QBB Scratch testing passed but parity errors detected on GP CSRs QBBx is REMOTE QBB Scratch testing passed but parity errors detected on HS CSRs QBBx is REMOTE QBB Scratch testing passed by parity errors detected on GP and HS CSRs QBBx is REMOTE QBB X=1..6 Scratch Test failed.
Test# hex ERROR # <15:0> FRU(s) Component Failure Description 41 See Test 24 (Local IOP Error). Same error numbers reported for this remote version. 42 See Test 25 (Local MEM0 Scratch/BIST/Error test). Same error numbers reported for this remote version. 43 See Test 26 (Local MEM1 Scratch/BIST/Error). Same error numbers reported for this remote version. 44 See Test 27 (Local MEM2 Scratch/BIST/Error). Same error numbers reported for this remote version.
Test# hex ERROR # <15:0> FRU(s) Component Failure Description 42 – 45 contin ued FC#x System box: HSW0 QBBx.GP.CBL QBBx is REMOTE QBB X=1..6 Scratch Test failed. Parity errors detected on HS CSRs Drawer: SCBL QBBx.CBL FF#x System box: HSW0 QBBx.GP.CBL QBBx is REMOTE QBB X=1..6 Scratch Test failed. Parity errors detected on HS and GP CSRs. Drawer: SCBL QBBx.CBL 46 See Test 29 (Local DTAG Scratch/BIST test). Same error numbers reported for remote. F300 System box: QBBx.GP.CBL HSW0 QBBx.
Test# hex ERROR # <15:0> FRU(s) Component Failure Description 46 contin ued FC#x System box: HSW0 QBBx.GP.CBL QBBx is REMOTE QBB X=1..6 Scratch Test failed. Parity errors detected on HS CSRs Drawer: SCBL QBBx.CBL FF#x System box: HSW0 QBBx.GP.CBL QBBx is REMOTE QBB X=1..6 Scratch Test failed. Parity errors detected on HS and GP CSRs. Drawer: SCBL QBBx.CBL 47 See Test 2a (Local DIR Scratch/BIST test). Same error numbers reported for this remote. F300 System box: QBBx.GP.CBL HSW0 QBBx.
Test# hex ERROR # <15:0> FRU(s) Component Failure Description 47 contin ued FC#x System box: HSW0 QBBx.GP.CBL QBBx is REMOTE QBB X=1..6 Scratch Test failed. Parity errors detected on HS CSRs Drawer: SCBL QBBx.CBL FF#x System box: HSW0 QBBx.GP.CBL QBBx is REMOTE QBB X=1..6 Scratch Test failed. Parity errors detected on HS and GP CSRs. Drawer: SCBL QBBx.CBL 48 See Test 2b (Local IOP BIST check test). Same error numbers reported for this remote. 49 See Test 2c (Local QSA Error line).
Test# hex ERROR # <15:0> FRU(s) 4f See Test 32 (Local DTAG Error line Fault). Same error numbers reported for this Remote version. 50 Placeholder 51 Placeholder 52 Memory and directory configuration test x = Hard QBB_ID y = Physical MEM Port ID z = DIR/MEM DIMM ID n = MEM Array ID Component where where where where Failure Description 0<=x<=7 0<=y<=3 0<=z<=7 0<= n<=1 NOTE: ‘CFG’ means MIS-Configuration problem. CFG is NOT a component. x001 QBBx.
Test# hex 52 contin ued ERROR # <15:0> XynE FRU(s) Component QBBx.MEMy ARRn XynF QBBx.MEMy ARRn 7777 FRU not determined Failure Description QBB x MEM y Array n No rank is enabled CFG QBB x MEM y Array N size is reduced to its corresponding DIR DIMM size GPA CSR (8-bits) Read had bits other than <7:0> set! Possibly bad GP cable. Parameters P1: P2: P3: P4: st 1 Error # - see Table above to decode. (The FRU callout is associated to FRU1.) nd 2 Error # - see Table above to decode.
Test# hex ERROR # <15:0> FRU(s) Component 56 Low memory mailbox access test Failure Description NO Error/FRU callouts for this CPU/MEM test. (supported in Powerup mode ONLY) 57 Memory thrashing test NO Error/FRU callouts for this CPU/MEM test. (supported in Powerup mode ONLY) 58 Console flash ROM checksum and unload test 0001 PBP.PCIx FLSH0 No STDIO was found in the earlier Local IO Config Test, so no valid STDIO info was passed down from the PSM.
Test# hex ERROR # <15:0> FRU(s) Component Failure Description 5a-5c are placeholders for future tests if necessary 5d CPU Hot-swap cache victimization/jump-to-console test CPU Hot-swap support Test. Secondary Cache victim/jump to console. No errors reported. 5e->5F are placeholders for future tests if necessary.
Appendix D Firmware Updates This appendix covers the following topics: • System Firmware That May Require Updates • Preparations for Firmware Updates • Firmware Updates • Dealing with a COM1 Port Jam Firmware Updates D-1
D.1 System Firmware That May Require Updates The following firmware will likely need updates: code for each microprocessor on the console serial bus and XSROM code on PSMs. Table D–1 Firmware Update Files and What They Update File Name What is updated SCMROM.HEX The SCM firmware located on the standard I/O module. PSMROM.HEX The PSM firmware on PSM modules. HPMROM.HEX The HPM firmware on the HPM module. PBMROM.HEX The PBM firmware on PCI backplanes. WF_XSROM.
Table D–1 lists AlphaServer GS80/160/320 update firmware files. When running an update, each file is written into a flash ROM on the piece of hardware targeted by the update command. Most recent files are found on the latest AlphaServer firmware CD or can be obtained from the following Web site: http://ftp.digital.com/pub/digital/Alpha/firmware/ For a full description of LFU, see Appendix B of the Compaq AlphaServer GS80/160/320 Firmware Reference Manual.
D.2 Preparations for Firmware Updates On any given system some preparations may be needed to perform system firmware updates. D.2.1 Partitions LFU cannot update a partitioned system. Operating systems must be shut down and if the system is hardware partitioned, an SCM command must be issued to remove them. LFU must be run from the master SCM. Example D– 1 Removing Partitions SCM_E0> show nvram . . .
LFU must be run in a nonpartitioned environment for the following reasons: 1. LFU cannot communicate directly over the CSB and relies on the SCM to transfer files to the microprocessors on the CSB. 2. LFU transfers files to SCMs, both master and slaves, through PCI space to shared RAM. If the system remains partitioned, knowledge of PCI space is known only on a per partition basis. If a system is partitioned, it must be reconfigured to run LFU. Example D–1 shows the necessary preparations.
D.2.2 Hardware Connections Use of the SCM update command requires a physical connection to the master SCM. If the system management console is used, you need not connect a laptop but can execute update procedures from there.
Figure D–1 shows the connection made between a laptop and the local terminal port on the standard I/O module in the master PCI box. On this module resides the CSB master SCM. Use two nine pin to MMJ connectors (H8571-J), one for the COM1 port and the other for the laptop, and connect the two using a DEC connect office cable. NOTE: If you are using the system management console (SMC) to make firmware updates, you need not connect your laptop since the SMC is already connected to the master PCI box.
D.2.3 Laptop Operating System Preparation When the SCM update command is used, firmware update files are downloaded from some source into the master SCM module. Certain COM1 port settings are required. Example D– 2 COM1 Port Settings for Windows NT 4.0 1. From Start go to Settings and select Control Panel. 2. From Control Panel select Ports. 3. From Ports select COM1 and Settings. 4.
Example D– 3 COM1 Port Settings for Windows 2000 1. From Start go to Settings and select Control Panel. 2. From Control Panel select System. 3. From System select the Hardware tab. 4. From the Hardware tab select Device Manager. 5. Expand Ports and select Communications Port (COM1). 6. At the Communications Port (COM1) Properties, set: Bits per second: 57600 Data Bits: 8 Parity: None Stop Bits: 1 Flow Control: Xon/Xoff then select Advanced. 7. At the Advanced Settings for COM1, deselect Use FIFO buffers. 8.
D.2.4 Terminal Emulator Settings When the SCM update command is used, firmware update files are downloaded from a host PC COM1 port to the master SCM local port on the standard I/O module. Certain terminal emulator settings are required. Example D– 5 KEAterm V5.1 Session for PC or Laptop COM1 Port 1. From Start go to Programs and select KEA!VT and then KEA! 2. At the Session Template select Serial – click Next>. 3. At Connection Type select Serial – click Next>. 4.
Example D– 6 PowerTerm 525 Settings 1. From Start go to Programs and select PowerTerm. 2. At Connect set: Session type Terminal type Baud Rate Stop Bits Port Number Parity Flow Control – click Connect COM VT420-8 9600 1 1 8/none Xon/Xoff 3.
D.3 Firmware Updates Two firmware update modes are available on GS80/160/320 systems: one using LFU and the other using the SCM update command. Note that if a microprocessor’s firmware is corrupt and it is in fail-safe loader mode, the SCM update command must be used to load healthy firmware. D.3.1 Using LFU LFU is a standard, fairly automatic, method used to update firmware. Currently (August, 2000), LFU must be run from the master SCM in systems that are not hardware partitioned.
***** Loadable Firmware Update Utility ***** -------------------------------------------------------------------Function Description -------------------------------------------------------------------Display Exit List Displays the system’s configuration table. Done exit LFU (reset). Lists the device, revision, firmware name, and update revision. Readme Lists important release information. Update Replaces current firmware with loadable data image. Verify Compares loadable and hardware images.
Example D– 7 Running LFU (Continued) UPD> list Device Current Revision SRM V5.7-3525 micro V5.2(03.06/01:09) UPD> update micro Filename Update Revision srm_fw micro_fw V5.7-3533 V5.4(03.24/01:21) Confirm update on: micro [Y/(N)]y WARNING: updates may take several minutes to complete for each device. DO NOT ABORT! micro Updating to V5.4(03.24/01:21) Updating SCM nodes E0,E1 Update Cmd processed Transferring hex file...........Flash ON........Flash ON........Flash ON........ Flash ON....
The LFU list command shows the current revision and update revision of the SCM firmware. Note that LFU does not list each firmware file associated with each micro that is listed in Table D–1. Instead, it lumps them into one file, micro_fw. When using LFU, all microprocessor firmware is updated each time LFU is run. The LFU update command is issued. Confirmation of the update is required. SCM update and micro update begin. SCMs are found at nodes E0 and E1.
Example D– 7 Running LFU (Continued) Updating XSROM node 30,31,32,33 Update Cmd processed Transferring hex file.................. ~I~ Flashing node 30 (please wait) ~I~ Flashing node 31 (please wait) ~I~ Flashing node 32 (please wait) ~I~ Flashing node 33 (please wait) CSB download of .Hex file complete Updating HPM node 40 Update Cmd procesed Transferring hex file.................. CSB download of .Hex file complete ¡ ¢ Updating PSM node 30,31,32,33 Update Cmd processed Transferring hex file............
¡ ¢ 11 12 13 Once the SCM completes the map of the CSB, LFU provides the updated XSROM code to the SCM through shared RAM. The SCM sends the code to the PSM in each QBB. After the PSM receives the code, it then blasts it into the XSROM flash ROM. LFU provides code for the HPM to the SCM. The SCM downloads this code over the CSB to the HPM in the H-switch. LFU then provides code for the PSMs to the SCM. The SCM downloads this code over the CSB to RAM space in each PSM.
D.3.2 Using the SCM Update Command The SCM update command may be used from the master SCM to update specific firmware in the system. The firmware file must be downloaded to the master SCM local terminal port. If a microprocessor is in FSL mode because its firmware is corrupt, this command must be used. Example D– 8 Using the Update Command SCM_E0> power off –par 0 SCM_E0> sho csb CSB Type Firmware Revision FSL Revision Power State 10 PBM T04.6 (11.03/01:09) T4.2 (09.08) ON 11 PBM T04.6 (11.
Conditions of Note When Using This Update Method A master and slave SCM may be updated using this method but a master SCM cannot update a slave. To update either a master or slave the device downloading the SCMROM.HEX file must be connected physically to the target standard I/O local port. When updating the XSROM code, the entire system may be up and running operating systems.
Example D– 8 Using the Update Command (Continued) SCM_E0> update –csb 30 Initiate HEX file transfer from host (press ESC-ESC to abort): Initiate transfer of the file PSMROM.HEX to the COM1 port For KEAterm From the Tools menu goto File Transfer and select Send to Host Change the Files of type: to All Files (*.*) Browse for and select the file PSMROM.HEX. Click on OK. For PowerTerm 525 From the Communication menu select Send File… At Send File select the Ascii tab Browse for and select the file PSMROM.
Example D–8 shows a sample master SCM update of a PSM module in QBB0. The SCM update command is issued. Note that it is possible to update several PSMs at a time with the command: update –csb 30,31,32… Be sure that the terminal emulator is configured properly for the file transfer. See Section D.2.4. The PSMROM.HEX file is transferred to the COM1 port. The flash update completes.
D.4 Dealing with a COM1 Port Jam Occasionally, when the COM1 is under software flow control, as it is when you use a terminal emulator to communicate through it, COM1 can jam. If this occurs, your emulator will have no control of the system or partition to which it is attached. You can clear the jam by clearing the communications link in the emulator and using the SCM clear port command to un-jam COM1.
In the event that your emulator appears hung, it is possible that the COM1 port is jammed. The procedure presented in Example D–9 will clear the jam. Of course, communications could have failed for some other reason that you will have to investigate if this procedure does not work. From the emulator’s perspective, it has received an XOFF. Setting the CLEAR COMM sets XON and the emulator will again transmit the characters you type. The escape sequence gets you to the SCM.
Glossary AC off state One of the system power states in which all power is removed from the system. See also Hot-swap, Cold-swap, and Warm-swap states. Clock splitter module Module that provides the system with multiple copies of the system and I/O reference clocks. Cold-swap state One of the system power states in which AC power and Vaux are present in the system, but power is removed from the area being serviced. See also AC off, Hot-swap, and Warm-swap states. Console serial bus See CSB.
Hard partition A partition consisting of one or more QBBs and sharing no resources with any other parition. Hard partitions are defined by using the SCM command language. See also Partition. Hierarchical switch See H-switch. Hose A logical PCI bus; or the cable between a QBB and a PCI box. Hot-swap state A state of the system that allows swapping of certain components while power is present in the system. See also Cold-swap, Warm-swap, and AC off states.
Memory directory module See Directory module. OCP Operator control panel; used by the operator to control the system. It has a keyswitch, display screen, indicators, and buttons. The keyswitch is used to power the system up or down or to secure it from remote access. The screen displays messages during booting. Indicators show system state. The buttons reset or halt the system.
Power system manager See PSM. PSM Power system manager; a module in each QBB that monitors CPUs, voltages, temperatures, and fan speeds in the QBB and then reports this information to the system control manager (SCM). The SCM can make requests of the power system manager. QBB Quad building block; the basic building block of the system.
SMC System management console; a PC, software, and terminal server used to manage the system. Soft partition A collection of resources within a hard partition. Resources can be allocated among soft partitions. In contrast to hard partitions, a QBB can provide resources to more than one soft partition. Soft partitions are defined by using the SRM console. Also referred to as logical partitions.
Warm-swap state Glossary-6 One of the power states of the system in which power is removed from a specified QBB for service while other segments of the system remain fully powered. See also Hot-swap, Cold-swap, and AC off states.
Index A C AC input box GS160/320 (three phase), 1-97 GS80 (single phase), 1-107 removal and replacement (system box), 5-35 AC-off state defined, 4-10 getting into, 4-15 Addressing, 1-20 Auxiliary power module, 1-64 removal and replacement (GS160/320), 4-25 removal and replacement (GS80), 6-9 Cabling expander cabinet, A-8–A-15 GS80 cabinet, A-16–A-19 power cabinet, A-2–A-7 Cache coherency data storage, B-18–B-23 storage element use and flow, B-37 terminology, B-2 Cache state command interaction, B-7 comm
COM1 port unjam, D-23 COM2, 1-87 Compaq Analyze, 3-88–3-109 Console serial bus function, 1-22 module removal and replacement (distribution box housing), 4-45 module removal and replacement (GS80), 6-27 module removal and replacement (Hswitch housing), 4-55 Console serial bus node ID module, 1-90 removal and replacement, 5-7 Control panel, 1-12, 2-2 CPU chip, 1-45 CPU module, 1-43 removal and replacement (GS160/320), 4-27 removal and replacement (GS80), 611 CSB.
module locations, 4-17 power subsystem, 1-94–1-103 system box, 1-7 GS80 backplane, 1-11 cabinet cabling, A-16–A-19 description, 1-9 module locations, 6-3 power subsystem, 1-104–1-111 H Halt LED, 1-13, 2-3 Halt pushbutton, 1-13, 2-3 Hierarchical switch function, 1-18 module, 1-76 removal and replacement, 4-47 Hot-swap state defined, 4-10 getting into, 4-12 HPM. See H-switch power manager module H-switch.
memory description, 1-46 PSM description, 1-49 short-circuit protection description, 169 standard I/O cable interface description, 1-92 standard I/O module description, 1-87 Local switch, 1-15 M Main power module, 1-62 removal and replacement (GS160/320), 4-25 removal and replacement (GS80), 6-9 Master clock module, 1-51 removal and replacement, 4-51 Master phase lock loop, 1-57 Memexer command (SRM), 3-58 Memory module, 1-46 removal and replacement (GS160/320), 4-25 removal and replacement (GS80), 6-9 Mi
removal and replacement, 5-23 PCI slots, 1-84 Power color codes, 4-19 troubleshooting, 3-2 Power cabinet cabling, A-2–A-7 Power distribution, 1-103 Power LED, 1-13, 2-3 Power modules auxiliary, 1-64 H-switch, 1-66 main, 1-63 Power off command (SRM), 3-31 Power subrack GS160/320, 1-101 GS80, 1-111 removal and replacement (GS160/320), 5-33 removal and replacement (GS80), 635 Power supplies GS160/320 48 VDC described, 1-99 GS80 48 VDC described, 1-109 PCI supply described, 1-113 Power supply removal and replac
SCP.
a hung system, 3-76–3-83 an operating system hang, 3-77 console, 3-5 CSB bus, 3-3 logic voltages, 3-4 OCP, 3-3 power, 3-2 using LEDs, 3-72–3-75 using the SRM console, 3-30–3-45 Vaux, 3-2 U Update COM1 settings for Windows 2000, D9 COM1 settings for Windows 95, D-9 COM1 settings for Windows NT, D-8 connecting a laptop to the local terminal port, D-7 hardware and software preparations, D-4–D-11 KEAterm settings, D-10 partitions, D-5 PowerTerm settings, D-11 Update command (SCM), D-18–D-21 Update files, D-3