AlphaServer ES40 Service Guide Order Number: EK–ES240–SV. A01 This guide is intended for service providers and selfmaintenance customers responsible for Compaq AlphaServer ES40 systems.
First Printing, July 1999 The information in this publication is subject to change without notice. COMPAQ COMPUTER CORPORATION SHALL NOT BE LIABLE FOR TECHNICAL OR EDITORIAL ERRORS OR OMISSIONS CONTAINED HEREIN, NOR FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES RESULTING FROM THE FURNISHING, PERFORMANCE, OR USE OF THIS MATERIAL.
Attention! Ceci est un produit de Classe A. Dans un environnement domestique, ce produit risque de créer des interférences radioélectriques, il appartiendra alors à l'utilisateur de prendre les mesures spécifiques appropriées. FCC Notice: This equipment generates, uses, and may emit radio frequency energy.
Contents Preface Chapter 1 1.1 1.2 1.3 1.4 1.5 1.5 1.6 1.7 1.8 1.9 1.10 1.10.1 1.10.2 1.11 1.12 1.13 1.14 1.15 1.16 System Overview System Architecture.............................................................................. 1-2 System Enclosures ................................................................................ 1-4 System Chassis—Front View/Top View................................................ 1-6 System Chassis—Rear View ............................................................
2.3.4 2.3.5 2.3.6 2.3.7 2.3.8 2.3.9 2.4 2.4.1 2.4.2 2.4.3 2.4.4 2.4.5 2.4.6 2.4.7 AlphaBIOS Menus ........................................................................ 2-10 Remote Management Console (RMC) ........................................... 2-10 Operating System Exercisers (DEC VET) .................................... 2-11 Crash Dumps ................................................................................ 2-11 Revision and Configuration Management Tool (RCM).................
4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16 4.17 4.18 4.19 4.20 4.21 4.22 exer...................................................................................................... 4-16 floppy_write......................................................................................... 4-21 grep ..................................................................................................... 4-22 hd ....................................................................................................
6.4.1 6.4.2 6.4.3 6.5 6.5.1 6.5.2 6.6 6.7 6.7.1 6.7.2 6.7.3 6.7.4 6.8 6.9 6.10 6.10.1 6.10.2 6.10.3 6.10.4 6.11 6.11.1 6.11.2 Setting the Date and Time............................................................ 6-21 Setting Up the Hard Disk............................................................. 6-22 Setting the Level of Memory Testing............................................ 6-23 Setting Automatic Booting..................................................................
Chapter 8 8.1 8.1.1 8.1.2 8.1.3 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 8.10 8.11 8.12 8.13 8.14 8.15 8.16 8.17 FRU Removal and Replacement FRUs ..................................................................................................... 8-2 Power Cords .................................................................................... 8-5 FRU Locations ................................................................................ 8-6 Important Information Before Replacing FRUs .......................
D.4 D.5 D.6 D.7 D.8 D.9 D.10 D.11 D.12 D.13 D.14 D.15 D.16 D.17 D.18 D.19 D.20 D.21 D.22 Cbox Read Register .............................................................................. D-8 Exception Address Register (EXC_ADDR) ........................................ D-10 Interrupt Enable and Current Processor Mode Register (IER_CM).. D-12 Interrupt Summary Register (ISUM) ................................................ D-14 PAL Base Register (PAL_BASE) ..................................................
4–3 4–4 4–5 4–6 4–7 4–8 4–9 4–10 4–11 4–12 4–13 4–14 4–15 4–16 4–17 4–18 4–19 4–20 4–21 4–22 4–23 4–24 5–1 6–1 6–2 6–3 6–4 6–5 7–1 7–2 7–3 7–4 7–5 7–6 7–7 7–8 7–9 7–10 clear_error ........................................................................................... 4-10 deposit and examine ........................................................................... 4-12 exer...................................................................................................... 4-16 floppy_write.........
Figures 1–1 1–2 1–3 1–4 1–5 1–6 1–7 1–8 1–9 1–10 1–11 1–12 1–13 1–14 1–15 1–16 1–17 3–1 3–2 5–1 5–2 5–3 5–4 5–5 5–6 5–7 5–8 5–9 5–10 5–11 5–12 5–13 5–14 6–1 6–2 6–3 6–4 6–5 6–6 xii System Block Diagram.......................................................................... 1-2 Compaq AlphaServer ES40 Systems .................................................... 1-4 Components Top/Front View (Pedestal/Rackmount Orientation) ........ 1-6 Rear Components (Pedestal/Rackmount Orientation).....................
6–7 6–8 6–9 6–10 6–11 6–12 6–13 6–14 6–15 6–16 7–1 7–2 7–3 7–4 8–1 8–2 8–3 8–4 8–5 8–6 8–7 8–8 8–9 8–10 8–11 8–12 8–13 8–14 8–15 8–16 8–17 8–18 8–19 8–20 8–21 8–22 8–23 B–1 B–2 B–3 B–4 AlphaBIOS Utilities Menu.................................................................. 6-29 Run Maintenance Program Dialog Box .............................................. 6-30 CPU Slot Locations (Pedestal/Rack) ................................................... 6-40 CPU Slot Locations (Tower).........................
Tables 1 1–1 2–1 2–2 2–3 2–4 2–5 3–1 3–2 3–3 3–4 4–1 4–2 4–3 5–1 5–2 5–3 6–1 6–2 7–1 7–2 7–3 8–1 8–2 A–1 B–1 B–2 B–3 B–4 C–1 D–1 D–2 D–3 D–4 D–5 D–6 D–7 D–8 D–9 D–10 xiv Compaq AlphaServer ES40 Documentation ......................................... xix Fan Descriptions ................................................................................. 1-27 Power Problems..................................................................................... 2-4 Problems Getting to Console Mode ...............
D–11 D–12 D–13 D–14 D–15 D–16 D–17 D–18 D–19 D–20 D–21 E–1 E–2 E–3 E–4 E–5 21272-CA Device Interrupt Request Register Fields......................... D-30 21272-CA Pchip Error Register Fields............................................... D-33 21272-CA Array Address Register (AAR) ............................................ D-35 DPR Locations A0:A9........................................................................ D-37 Nine Bytes Read from Power Supply ................................................
Preface Intended Audience This manual is for service providers and self-maintenance customers who are responsible for servicing Compaq AlphaServer ES40 systems. WARNING: To prevent injury, access is limited to persons who have appropriate technical training and experience. Such persons are expected to understand the hazards of working within this equipment and take measures to minimize danger to themselves or others. These measures include: 1. Remove any jewelry that may conduct electricity. 2.
This manual has eight chapters and five appendixes. • Chapter 1, System Overview, gives an overview of the system and describes the components. • Chapter 2, Troubleshooting, describes the troubleshooting strategy, lists service tools, utilities, and information services, and gives diagnostic tables for problem categories. • Chapter 3, Power-Up Diagnostics and Display, explains the power-up process and RMC, SROM, and SRM power-up diagnostics.
Documentation Titles 1 Compaq AlphaServer ES40 Documentation Title Order Number User Documentation Kit Owner’s Guide User Interface Guide Basic Installation Release Notes Documentation CD (6 languages) QA-6E88A-G8 EK-ES240-UG EK-ES240-UI EK-ES240-PD EK-ES240-RN AG-RF9HA-BE Maintenance Kit Service Guide Service Guide HTML Help Illustrated Parts Breakdown QZ-01BAB-GZ EK-ES240-SV AK-RFXDA-CA EK-ES240-IP Loose Piece Items Rackmount Installation Guide Rackmount Installation Template Model 1 to Model 2 Up
Chapter 1 System Overview This chapter provides an overview of the system in these sections: • System Architecture • System Enclosures • System Chassis—Front View/Top View • System Chassis—Rear View • I/O Ports and Slots • Control Panel • System Motherboard • CPU Card • Memory Architecture and Options • PCI Backplane • Remote System Management Logic • Power Supplies • Fans • Removable Media Storage • Hard Disk Drive Storage • System Access • Console Terminal System Overvie
1.1 System Architecture The system uses a switch-based interconnect system that maintains constant performance even as the number of transactions multiplies.
This system is designed to fully exploit the potential of the Alpha 21264 chip by using a switch-based (or point-to-point) interconnect system. With a traditional bus design, the processors, memory, and I/O modules share the bus. As the number of bus users increases, the transactions interfere with one another, increasing latency and decreasing aggregate bandwidth.
1.2 System Enclosures The Compaq AlphaServer ES40 family consists of a standalone tower, a pedestal with expanded storage capacity, and a cabinet.
Model Variants AlphaServer ES40 systems are offered in two models. The entry-level model provides connectors for four DIMMs on each of the memory motherboards (MMBs) and connectors for six PCI options on the PCI backplane. To upgrade from Model 1 to Model 2, you replace the PCI backplane and the four memory motherboards.
1.
1.
1.
Rear Panel Connections ➊ Modem port—Dedicated 9-pin port for connection by modem to remote management console. ➋ ➌ ➍ ➎ COM2 serial port—Extra port to modem or any serial device. ➏ ➐ ➑ ➒ USB ports. ➓ PCI slot for VGA controller, if installed. Keyboard port—To PS/2-compatible keyboard. Mouse port—To PS/2-compatible mouse. COM1 MMJ-type serial port/terminal port —For connecting a console terminal. Parallel port—To parallel device such as a printer. SCSI breakouts.
1.5 Control Panel The control panel provides system controls and status indicators. The controls are the Power, Halt, and Reset buttons. A 16-character back-lit alphanumeric display indicates system state. The panel has two LEDs: a green Power OK indicator and an amber Halt indicator. Figure 1–6 Control Panel 1 2 3 4 5 6 PK0204 ➊ Control panel display. A one-line, 16-character alphanumeric display that indicates system status during power-up and testing.
➌ Power LED (green). Lights when the power button is depressed and system power passes initial checks. ➍ Reset button. A momentary contact switch that restarts the system and reinitializes the console firmware. Power-up messages are displayed, and then the console prompt is displayed or the operating system boot messages are displayed, depending on how the startup sequence has been defined. ➎ ➏ Halt LED (amber). Lights when you press the Halt button. Halt button. Halts the system.
1.6 System Motherboard The system motherboard is located on the floor of the system card cage. It has slots for the CPUs and memory motherboards (MMBs) and has the PCI backplane interconnect.
The system motherboard has the majority of the logic for the system, including the CPU, MMB connectors, the PCI connector to I/O, the D-chips and P-chips, the logic for the remote management console (RMC), and the jumpers for the fail-safe loader (FSL). Figure 1–7 shows the location of components and connectors on the system motherboard.
1.7 CPU Card An AlphaServer ES40 can have up to four CPU cards. In addition to the Alpha 21264 chip, the CPU card has a 4-Mbyte second-level cache and a 2.2V DC-to-DC converter with heatsink that provides the required voltage to the Alpha chip. Power-up diagnostics are stored in a flash SROM on the card.
The 21264 microprocessor is a superscalar CPU with out-of-order execution and speculative execution to maximize speed and performance. It contains four integer execution units and dedicated execution units for floating-point add, multiply, and divide. It has an instruction cache and a data cache on the chip. Each cache is a 64 KB, two-way, set associative, virtually addressed cache that has 64-byte blocks. The data cache is a physically tagged, write-back cache.
1.8 Memory Architecture and Options The system has two 256-bit wide memory data buses, which can move large amounts of data simultaneously.
Memory Architecture Memory throughput in this system is maximized by the following features: • Two independent, wide memory data buses • Very low memory latency (120 ns) and high bandwidth with 12 ns clock • ECC memory Each data bus is 256 bits wide (32 bytes). The memory bus speed is 83 MHz. This yields 2.6 GB/sec bandwidth per bus (32 x 83 MHz = 2.6 GB/sec). The maximum bandwidth is 5.2 GB/sec. The switch interconnect design takes full advantage of the capabilities of the two wide data buses.
1.9 PCI Backplane The PCI backplane has two independent 64-bit, 33 MHz PCI buses that support 64-bit PCI slots. The 64-bit PCI slots are split across the two buses. The PCI buses support 3.3 V and 5 V options.
PCI Bus Implementation • Is fully compliant with the PCI Version 2.
1.10 Remote System Management Logic The remote system management logic consists of two major elements: the System Power Controller (SPC), used to monitor and control system power supplies, regulators, and cooling apparatus; and the Remote Management Console (RMC), which facilitates remote interrogation and control of the system. The components used within the remote system management logic are powered by the AUX_5V supply, which is always present whenever AC input power is available to the system.
Dual-Port RAM (DPR) The ES40 system features a dual-port RAM—RAM that is shared between the RMC and the system motherboard logic—to ease communication between the system and the RMC. This book refers to the dual-port RAM as the DPR. The RMC reads 256 bytes of data from each FRU EEPROM at power-up and stores it in the DPR. This data contains configuration and possibly error log information. The data is accessible via the TIG chip to the firmware for configuration information during start-up.
1.10.1 System Power Controller (SPC) The System Power Controller (SPC) is responsible for sequencing the turn-on/turn-off of all power supplies and regulators, monitoring all system power supplies and regulators, generating hardware resets to all logic elements, and generating power system status signals for use by other functional units within the system. Additionally, it is responsible for emergency shutdown if the internal system temperature exceeds permissible limits.
1.10.2 Remote Management Console (RMC) The remote management console (RMC) provides a mechanism for remotely monitoring a system and manipulating it on a very low level. It also provides access to the repository for all error information in the system. This provides the operator, either remotely or locally, with the ability to monitor the system (voltages, temperature, fans, error status) and manipulate it (reset, power on/off, halt) without any interaction on the part of the operating system.
1.11 Power Supplies The power supplies provide power to components in the system box. The number of power supplies required depends on the system configuration.
One to three power supplies provide power to components in the system box. The system supports redundant power configurations to ensure continued system operation if a power supply fails. See Chapter 6 for power supply configurations. When more than one power supply is installed, the supplies share the load. The power supplies select line voltage automatically (120V or 240V and 50 Hz or 60 Hz). Power Supply LEDs Each power supply has two green LEDs that indicate the state of power to the system.
1.12 Fans The system has six hot-plug fans that provide front-to-back airflow.
The system fans are shown in Figure 1–13 and described in Table 1–1. Table 1–1 Fan Descriptions Fan Number Area Cooled Fan Failure Scenario ➊, ➋ PCI card cage Removable media Right drive cage Both fans are powered at all times. If one fan fails, all other system fans speed up to provide adequate cooling. You can replace either fan while the system is running. Power supplies Left drive cage Both fans are powered at all times.
1.13 Removable Media Storage The system box houses a CD-ROM drive ➊ and a high-density 3.5-inch floppy diskette drive ➋ and supports two additional 5.25-inch halfheight drives or one additional full-height drive. The 5.25-inch half height area has a divider that can be removed to mount one full-height 5.25-inch device.
1.14 Hard Disk Drive Storage The system chassis can have either one or two storage disk cages. You can install four 1.6-inch hard drives in each storage disk cage. See Chapter 8 for information on replacing hard disk drives.
1.15 System Access At the time of delivery, the system keys are taped inside the small front door that provides access to the operator control panel and removable media devices.
Both the tower and pedestal systems have a small front door through which the control panel and removable media devices are accessible. At the time of delivery, the system keys are taped inside this door. The tower front door has a lock that lets you secure access to the disk drives and to the rest of the system. The pedestal has two front doors, both of which can be locked. The upper door secures the disk drives and access to the rest of the system, and the lower door secures the expanded storage.
1.16 Console Terminal The console terminal can be a serial (character cell) terminal connected to the COM1 or COM2 port or a VGA monitor connected to a VGA adapter on PCI 0. A VGA monitor requires a keyboard and mouse.
Chapter 2 Troubleshooting This chapter describes the starting points for diagnosing problems on Compaq AlphaServer ES40 systems. The chapter also provides information resources.
2.1 Questions to Consider Before troubleshooting any system problem, first check the site maintenance log for the system’s service history. Be sure to ask the system manager the following questions: • Has the system been used and did it work correctly? • Have changes to hardware or updates to firmware or software been made to the system recently? If so, are the revision numbers compatible for the system? (Refer to the hardware and operating system release notes.
2.2 Diagnostic Tables System problems can be classified into the following five categories. Using these categories, you can quickly determine a starting point for diagnosis and eliminate the unlikely sources of the problem. 1. Power problems—Table 2–1 2. No access to console mode—Table 2–2 3. Console-reported failures—Table 2–3 4. Boot problems—Table 2–4 5.
Table 2–1 Power Problems Symptom Action System does not power on. • Check error messages on the OCP. • Check that AC power is plugged in. • Check that the ambient room temperature is within environmental specifications (10–40° C, 50–104° F). • Check the Power setting on the control panel. Toggle the Power button to off, then back on to clear a remote power disable. • Check that internal power supply cables are plugged in at the system motherboard.
Table 2–2 Problems Getting to Console Mode Symptom Action Reference Power-up screen is not displayed at system console. Note any error beep codes and observe the OCP display for a failure detected during self-tests. Chapter 3 Check keyboard and monitor connections. Chapter 1 Press the Return key. If the system enters console mode, check that the console environment variable is set correctly. If the console terminal is a VGA monitor, the console variable should be set to graphics.
Table 2–3 Problems Reported by the Console Symptom Action Reference No SRM messages are displayed after the “jump to console” message. Console firmware is corrupted. Load new firmware with fail-safe loader. Chapter 3 The system attempts to boot from the floppy drive after a checksum error is reported. The system automatically reverts to the fail-safe loader to load new SRM and AlphaBIOS firmware. If the fail-safe load does not work, replace the system motherboard.
Table 2–4 Boot Problems Symptom Action Reference System cannot find boot device. Check the system configuration for the correct device parameters (node ID, device name, and so on). Chapter 6 • For UNIX and OpenVMS, use the show config and show device commands. • For Windows NT, use the AlphaBIOS Display System Configuration menu and the CMOS Setup menus. Check the system configuration for the correct environment variable settings. Device does not boot.
Table 2–5 Errors Reported by the Operating System Symptom Action Reference System is hung, but SRM console is operating Press the Halt button and enter the crash command to provide a crash dump file for analysis (OpenVMS and UNIX only). Chapter 4 Refer to OpenVMS Alpha System Dump Analyzer Utility Manual for information on how to interpret OpenVMS crash dump files. Refer to the Guide to Kernel Debugging for information on using the UNIX Krash Utility.
2.3 Service Tools and Utilities This section lists some of the tools and utilities available for acceptance testing and diagnosis and gives recommendations for their use. 2.3.1 Error Handling/Logging Tools (Compaq Analyze) The Tru64 UNIX, OpenVMS, and Microsoft Windows NT operating systems provide fault management error detection, handling, notification, and logging.
2.3.4 AlphaBIOS Menus The AlphaBIOS Standard CMOS Setup menu and the Advanced CMOS Setup menu are used to configure Windows NT systems.
2.3.6 Operating System Exercisers (DEC VET) The Verifier and Exerciser Tool (DEC VET) is supported by the Tru64 UNIX, OpenVMS, and Windows NT operating systems. DEC VET is an on-line diagnostic tool used to ensure the proper installation and operation of hardware and base operating system software. Use DEC VET as part of acceptance testing to ensure that the CPU, memory, disk, tape, file system, and network are interacting properly. 2.3.
2.3.9 StorageWorks Command Console (SWCC) The StorageWorks Command Console (SWCC) is a storage management software tool that allows you to configure and monitor storage graphically from a single management console. It also has distributed capabilities that let you view multiple servers at the same time in a Microsoft Explorer-like navigation pane. The StorageWorks Command Console’s client is a graphical user interface (GUI) that can configure and monitor StorageWorks RAID Array solutions.
2.4 Information Resources Many information resources are available, including tools that can be downloaded from the Internet, firmware updates, a supported options list, and more. 2.4.1 Compaq Service Tools CD The Compaq Service Tools CD-ROM enables field engineers to upgrade customer systems with the latest version of software when the customer does not have access to Compaq Web pages. The CD-ROM Web site is: http://caspian1.zko.dec.com/service_tools/ 2.4.
• If you do not have a Web browser, you can download the files using anonymous ftp: ftp.digital.com/pub/Digital/Alpha/firmware • Individual Alpha system firmware releases that occur between releases of the firmware CD are located in the interim directory: ftp.digital.com/pub/Digital/Alpha/firmware/interim AlphaBIOS Firmware The AlphaBIOS firmware is included in the Alpha Systems Firmware Update Kit CD-ROM. 2.4.
2.4.6 Late-Breaking Technical Information You can download up-to-date files and late-breaking technical information from the Internet. The information includes firmware updates, the latest configuration utilities, software patches, lists of supported options, and more. http://www.digital.com/alphaserver/es40/es40.html 2.4.7 Supported Options A list of options supported on the system is available on the Internet: http://www.digital.com/alphaserver/es40/es40_sol.
Chapter 3 Power-Up Diagnostics and Display This chapter describes the power-up process and RMC, SROM, and SRM powerup diagnostics.
3.1 Overview of Power-Up Diagnostics The power-up process begins with the power-on of the power supplies. After the AC and DC power-up sequences are completed, the remote management console (RMC) reads EEROM information and deposits it into the DPR. The SROM minimally tests the CPUs, initializes and tests backup cache, and minimally tests memory. Finally, the SROM loads the SRM console program into memory and jumps to the first instruction in the console program.
3.2 System Power-Up Sequence The power-up sequence is Figure 3–1. described below and illustrated in 1. When the power cord is plugged into the wall outlet, 5V auxiliary AC voltage is enabled. The 5 V AUX LEDs on the power supplies are lit, and the system power controller and RMC are initialized. 2. Pressing the Power button on the control panel or subsequently issuing the power-on command from the RMC turns on power to the power supplies, CPU converters, and VTERM regulators.
Figure 3–1 Power-Up Sequence Apply AC power 5 V AUX LEDs on PS are lit OCP Power button = IN Turn on power supplies Turn on CPU converters Turn on VTERM regulators Set all CPU_DCOK = True Set SYS_DC_OK = True Set SYS_RESET = False Set CPU(n)_RESET = False Set CPU(n)_RESET = False No CPU = "Alive"? Disable CPU All CPUs reload initial Y divisor Yes Continue SROM power-up PK0943 3-4 Compaq AlphaServer ES40 Service Guide
Figure 3–1 Power-Up Sequence (Continued) SROM Power-Up Init EV6 Test PCI Determine Config Bad Good Reload Using Flash SROM Init EV6 Test PCI Release CPUs B-Cache Tests Memory Config and Tests Load SRM PK0964 Power-Up Diagnostics and Display 3-5
3.3 Power-Up Displays Power-up information is displayed on the operator control panel and on the console terminal startup screen. Messages sent from the RMC and SROM programs are displayed first, followed by messages from the SRM console. NOTE: The power-up text that is displayed on the screen depends on what kind of terminal is connected as the console terminal: VT or VGA.
• Section 3.3.1 describes the SROM power-up sequence and shows the SROM power-up messages and corresponding OCP messages. • Section 3.3.2 shows the messages that are displayed once the SROM has transferred control to the SRM console.
3.3.1 SROM Power-Up Display Example 3–1 Sample SROM Power-Up Display SROM Power-Up Display SROM V1.00 CPU #00 @ 0500 SROM program starting Reloading SROM OCP Message MHz SROM T1.
SROM Power-Up Sequence ➊ When the system powers up, the SROM code is loaded into the I-cache (instruction cache) on the first available CPU, which becomes the primary CPU. The order of precedence is CPU0, CPU1, and so on. The primary CPU attempts to access the PCI bus. If it cannot, either a hang or a failure occurs, and this is the only message displayed. ➋ The primary CPU interrogates the I C EEROM as stored in the DPR. The primary CPU determines the optimum CPU and system configuration to jump to.
3.3.2 SRM Console Power-Up Display When SROM power-up is complete, the primary CPU transfers control to the SRM console program. The console program continues the system initialization. Failures are reported to the console terminal through the power-up screen and a console event log. Example 3–2 SRM Power-Up Display OpenVMS PALcode V1.50-0, Tru64 UNIX PALcode V1.
SRM Power-Up Sequence ➊ The primary CPU prints a message indicating that it is running the console. Starting with this message, the power-up display is sent to any console terminal, regardless of the state of the console environment variable. If console is set to graphics, the display from this point on is saved in a memory buffer and displayed on the VGA monitor after the PCI buses are sized and the VGA device is initialized. ➋ The memory size is determined and memory is tested.
Example 3–2 SRM Power-Up Display (Continued) entering idle loop initializing keyboard starting console on CPU 1 initialized idle PCB initializing idle process PID lowering IPL CPU 1 speed is 2.00 ns (500MHz) create powerup entering idle loop starting console on CPU 2 initialized idle PCB initializing idle process PID lowering IPL CPU 2 speed is 2.00 ns (500MHz) create powerup starting console on CPU 3 initialized idle PCB initializing idle process PID lowering IPL CPU 3 speed is 2.
SRM Power-Up Sequence (Continued) ➎ The console is started on the secondary CPUs. The example shows a fourprocessor system. ➏ Various diagnostics are performed. ➐ Systems running UNIX or OpenVMS display the SRM console banner and the prompt, Pnn>>>. The number n indicates the primary processor. In a multiprocessor system, the prompt could be P00>>>, P01>>>, P02>>>, or P03>>>. From the SRM prompt, you can boot the UNIX or OpenVMS operating system.
3.3.3 Resizing SRM Console Heap The SRM console allocates enough memory for most configurations. If options were installed that require more memory than the SRM console has allocated, the console dynamically resizes itself to provide additional memory to support the configuration. The following crash/reboot cycle can occur several times until the console has allocated enough memory. An abbreviated example of the output to a serial console screen is shown in Example 3–3. 1. The console powers up. 2.
Example 3–3 Memory Resize Crash/Reboot Cycle initialized idle PCB initializing semaphores initializing heap initial heap 200c0 memory low limit = 15e000 heap = 200c0, 17fc0 initializing driver structures initializing idle process PID initializing file system initializing hardware initializing timer data structures lowering IPL CPU 0 speed is 500 MHz create dead_eater create poll create timer create powerup access NVRAM Memory size 2048 MB testing memory ......
2048 MB of System Memory Testing the System CPU0: insufficient dynamic memory for a request of 4592 bytes Console heap space will be automatically increased in size by 64KB PID bytes name -------- ---------- ---00000000 27360 ???? 00000001 23424 idle 00000002 800 dead_eater 00000003 800 poll 00000004 800 timer 00000005 499584 powerup 00000031 129536 pwrup_diag 00000013 896 ???? 00000016 1056 ???? 00000026 128 ???? 00000017 512 ???? 00000006 2880 tt_control 00000007 800 mscp_poll 00000008 800 dup_poll 000000
SYSFAULT CPU0 - pc = 0014faac exception context saved starting at 001FD7B0 GPRs: 0: 00000000 00048FF8 16: 00000000 0000001E 1: 00000000 00150C80 17: 00000000 EFEFEFC8 2: 00000000 001202D0 18: 00000000 001FD2F8 3: 00000000 000011F0 19: 00000000 00000025 4: 00000000 0010C7B8 20: 00000801 FC000000 5: 00000000 00000020 21: 00000000 0008A8B0 6: 00000000 00000000 22: 00000000 0010ACB8 7: 00000000 00038340 23: 00000000 00000001 8: 00000000 00000000 24: 00000000 00000000 9: 00000000 00000000 25: 00000000 00000001 1
. . . bus 0, slot 15 -- dqb—Acer Labs M1543C IDE starting drivers entering idle loop initializing keyboard starting console on CPU 1 initialized idle PCB initializing idle process PID lowering IPL CPU 1 speed is 500 MHz create powerup . . .
3.3.4 SRM Console Event Log The SRM console event log helps you troubleshoot problems that do not prevent the system from coming up to the SRM console. The console event log consists of status messages received during power-up selftests.
3.3.5 AlphaBIOS Startup Screens If the system is running the Windows NT operating system, the SRM console loads and starts the AlphaBIOS console. An initialization screen similar to Example 3–5 is displayed on the VGA monitor. The initialization includes a memory test that is displayed to the screen. Once AlphaBIOS initialization is complete, an AlphaBIOS boot screen similar to Example 3–6 is displayed. Example 3–5 AlphaBIOS Initialization Screen AlphaBIOS 5.
Example 3–6 AlphaBIOS Boot Screen AlphaBIOS 5.68 Please select the operating system to start: Windows NT Server 4.00 Use and to move the highlight to your choice. Press Enter to choose.
3.4 Power-Up Error Messages Error messages at power-up may be displayed by the RMC, SROM, and SRM. A few SROM messages are announced by beep codes. 3.4.1 SROM Messages with Beep Codes Table 3–1 Error Beep Codes Beep Code Associated Messages 1 Jump to Console 1-3 Meaning SROM code has completed execution. System jumps to SRM console. SRM messages should start to be displayed. If no SRM messages are displayed, it may indicate corrupted firmware. See Section 3.4.2. VGA monitor not plugged in.
A few SROM error messages that appear on the operator control panel are announced by audible error beep codes, an indicated in Table 3–1. For example, a 1-1-4 beep code consists of one beep, a pause (indicated by the hyphen), one beep, a pause, and a burst of four beeps. This beep code is accompanied by the message “ROM err.” Related messages are also displayed on the console terminal if the console device is connected to the serial line and the SRM console environment variable is set to serial.
3.4.2 Checksum Error If Jump to Console is the last message displayed on the OCP, the console firmware may have become corrupted. When the system detects the error, it attempts to load the fail-safe loader (FSL) program so that you can load new console firmware images. Example 3–7 Checksum Error and Fail-Safe Load Loading console Console ROM checksum error Expect: 00000000.000000FE Actual: 00000000.000000FF XORval: 00000000.
➏ ***** Loadable Firmware Update Utility ***** ------------------------------------------------------------Function Description -----------------------------------------------------------Display Displays the system’s configuration table. Exit Done exit LFU (reset). List Lists the device, revision, firmware name, and update revision. Readme Lists important release information. Update Replaces current firmware with loadable data image. Verify Compares loadable and hardware images.
3.4.3 No MEM Error If the SROM code cannot find any usable memory, a 1-3-3 beep code is issued (one beep, a pause, a burst of three beeps, a pause, and another burst of three beeps), and the message “No MEM” is displayed on the OCP. The system does not come up to the console program. This error indicates missing or bad DIMMs.
➊ Indicates failed DIMMs. M identifies the MMB; D identifies the DIMM. In this line, DIMM 2 on MMB1 failed. ➋ Indicates that some DIMMs in this array are mismatched. All DIMMs in the affected array are marked as incompatible (incmpat). ➌ Indicates that a DIMM in this array is missing. All missing DIMMs in the affected array are marked as missing. ➍ Indicates that the DIMM data for this array is unreadable. All unreadable DIMMs in the affected array are marked as illegal.
3.4.4 RMC Error Messages Table 3–2 lists the fatal error messages that could potentially be displayed on the OCP by the remote management console during power-up. Most fatal error messages prevent the system from completing power-up. The warning messages listed in Table 3–3 require prompt attention but might not prevent the system from completing power-up or booting the operating system. Table 3–2 RMC Fatal Error Messages Message Meaning AC loss No AC power to the system. CPUn failed CPU failed.
Table 3–3 RMC Warning Messages Message Meaning PSn failed Power supply failed. “n” is 0, 1, or 2. OverTemp Warning System temperature is near the high threshold. Fann failed Fan failed. “n” is 0 through 6. PCI door opened Cover to PCI card cage is off. Reinstall cover. Fan door opened Cover to main fan area (fans 5 and 6) is off. Reinstall cover. 3.3V bulk warn Power supply voltage over or under threshold. 5V bulk warn Power supply voltage over or under threshold.
3.4.5 SROM Error Messages The SROM power-up identifies errors that may or may not prevent the system from coming up to the console. It is possible that these errors may prevent the system from successfully booting the operating system. Errors encountered during SROM power-up are displayed on the OCP. Some errors are also displayed on the console terminal screen if the console output is set to serial. Table 3–4 lists the SROM error messages.
Table 3–4 SROM Error Messages (Continued) Code SROM Message OCP Message 7E 7D 7C 7B 7A 79 78 77 76 75 74 73 Configuration error on CPU #2 Configuration error on CPU #1 Configuration error on CPU #0 Bcache failed on CPU #3 error Bcache failed on CPU #2 error Bcache failed on CPU #1 error Bcache failed on CPU #0 error Memory thrash error on CPU #3 Memory thrash error on CPU #2 Memory thrash error on CPU #1 Memory thrash error on CPU #0 Starting secondary on CPU #3 error CfgERR 2 CfgERR 1 CfgERR 0 BC Bad
3.5 Forcing a Fail-Safe Floppy Load Under some circumstances, you may need to force the activation of the FSL. For example, if you install a system motherboard that has an older version of the firmware than your system requires, you may not be able to bring up the SRM console. In that case you need to force a floppy load so that you can update the SRM firmware.
1. Turn off the system. Unplug the power cord from each power supply and wait for the 5V AUX indicators to extinguish. 2. Remove enclosure covers (tower and pedestal) or the front bezel (rackmount) to access the system chassis. See Chapter 8 for illustrations. 3. Remove the fan cover and the system card cage cover to gain access to the system motherboard. See Chapter 8 for illustrations. 4. Remove MMB 1 (closest to the PCI backplane) so that you can access the function jumpers. 5.
3.6 Updating the RMC Under certain circumstances, the RMC will not function. If the problem is caused by corrupted RMC flash ROM, you need to update RMC firmware. The RMC will not function if: • No AC power is provided to any of the power supplies. • DPR does not pass its self-test (DPR is corrupted). • RMC flash ROM is corrupted.
You can update the remote management console firmware from flash ROM using the LFU. 1. Load the update medium. 2. At the UPD> prompt, exit from the update utility, and answer y to the manual update prompt. Enter update RMC to update the firmware.
Chapter 4 SRM Console Diagnostics This chapter describes troubleshooting with the SRM console. The SRM console firmware contains ROM-based diagnostics that allow you to run system-specific or device-specific exercisers. The exercisers run concurrently to provide maximum bus interaction between the console drivers and the target devices. Run the diagnostics by using commands from the SRM console. To run the diagnostics in the background, use the background operator “&” at the end of the command.
4.1 Diagnostic Command Summary Diagnostic commands are used to test the system and help diagnose failures. Table 4–1 gives a summary of the SRM diagnostic commands and related commands. See Chapter 6 for a list of SRM environment variables, and see Appendix A for a list of SRM commands most commonly used for the ES40 system. Table 4–1 Summary of Diagnostic and Related Commands Command Function buildfru Initializes I2Cbus EEPROM data structures for the named FRU.
Table 4–1 Summary of Diagnostic and Related Commands (Continued) Command Function kill Terminates a specified process. kill_diags Terminates all executing diagnostics. more el Same as cat el, but displays the console event log one screen at a time. memexer Runs a requested number of memory tests in the background. memtest Tests a specified section of memory. net -ic Initializes the MOP counters for the specified Ethernet port. net -s Displays the MOP counters for the specified Ethernet port.
4.2 buildfru 2 The buildfru command initializes I C bus EEPROM descriptive data structures for the named FRU and initializes its SDD and TDD error logs. This command uses data supplied on the command line to build the FRU descriptor. Buildfru is used by Manufacturing, FRU repair operations, or Field Service. Example 4–1 buildfru P00>>> P00>>> P00>>> P00>>> buildfru buildfru buildfru buildfru smb0.mmb0.dim1 54-24941-EA NI90200100 ➊ smb0.cpu0 30-30158-05.AX05 NI94060554 Compaq➋ -s smb0.mmb0.
The information supplied on the buildfru command line includes the console name for the FRU, part number, serial number, model number, and optional information. The buildfru command facilitates writing the FRU information to the EEPROM on the device. Use the show fru command to display the FRU table created with buildfru. Use the show error command to display FRUs that have errors logged to them.
The ES40 FRU assembly hierarchy has three levels.
Arguments Console name for this FRU. This name reflects the position of the FRU in the assembly hierarchy. The FRU’s 2-5-2.4 part number. This ASCII string should be 16 characters (extra characters are truncated). This field should not contain any embedded spaces. If a space must be inserted, enclose the entire argument string in double quotes. This field contains the FRU revision, and in some cases an embedded space is allowed between the part number and the revision.
4.3 cat el and more el The cat el and more el commands display the contents of the console event log. In Example 4–2, the console reports that CPU 1 did not power up and fans 1 and 2 failed.
➊ ➋ CPU 1 failed. Fan 1 and Fan 2 failed. Status and error messages are logged to the console event log at power-up, during normal system operation, and while running system tests. Standard error messages are indicated by asterisks (***). When cat el is used, the contents of the console event log scroll by. Use the Ctrl/S key combination to stop the screen from scrolling, and use Ctrl/Q to resume scrolling. The more el command allows you to view the console event log one screen at a time.
4.4 clear_error The clear_error command clear errors logged in the FRU EEPROMs as reported by the show error command. Example 4–3 clear_error P00>>> clear_error smb0 P00>>> ➊ P00>>> clear_error all P00>>> ➋ ➊ Clears all errors logged in the FRU EEPROM on the system motherboard (SMB0). ➋ Clears all errors logged to all FRU EEPROMs in the system The clear_error command clears TDD, SDD, and checksum errors. Hardware failures and unreadable EEPROM errors are not cleared. See Table 4–2.
4.5 crash The SRM crash command forces a crash dump to the selected device for UNIX and OpenVMS systems. P00>>> crash CPU 0 restarting DUMP: 19837638 blocks available for dumping. DUMP: 118178 wanted for a partial compressed dump. DUMP: Allowing 2060017 of the 2064113 available on 0x800001 device string for dump = SCSI 1 1 0 0 0 0 0. DUMP.prom: dev SCSI 1 1 0 0 0 0 0, block 2178787 DUMP: Header to 0x800001 at 2064113 (0x1f7ef1) device string for dump = SCSI 1 1 0 0 0 0 0. DUMP.
4.6 deposit and examine The deposit command writes data to the specified address of a memory location, register, or device. The examine command displays the contents of a memory location, register, or a device.
Deposit The deposit command stores data in the location specified. If no options are given, the system uses the options from the preceding deposit command. If the specified value is too large to fit in the data size listed, the console ignores the command and issues an error. If the data is smaller than the data size, the higher order bits are filled with zeros. In Example 4–4: ➊ ➋ ➌ ➍ ➎ ➏ ➐ Clear first 512 bytes of physical memory Deposit 5 into four longwords starting at virtual memory address 1234.
-b Defines data size as byte. -w Defines data size as word. -l (default) Defines data size as longword. -q Defines data size as quadword. -o Defines data size as octaword. -h Defines data size as hexword. -d Instruction decode (examine command only) -n value The number of consecutive locations to modify. -s value The address increment size. The default is the data size. dev_name Device name (address space) of the device to access. Device names are: dpr Dual-port RAM.
Symbolic forms can be used for the address. They are: pc The program counter. The address space is set to GPR. + The location immediately following the last location referenced in a deposit or examine command. For physical and virtual memory, the referenced location is the last location plus the size of the reference (1 for byte, 2 for word, 4 for longword). For other address spaces, the address is the last referenced address plus 1.
4.7 exer The exer command exercises one or more devices by performing specified read, write, and compare operations. Typically exer is run from the built-in console script. Advanced users may want to use the specific options described here. Note that running exer on disks can be destructive. Optionally, exer reports performance statistics: • A read operation reads from a device that you specify into a buffer. • A write operation writes from a buffer to a device that you specify.
P00>>> ls -l dk*.* r--dk 0/0 0 P00>>> exer dk*.* -bc 10 -sec 20 -m -a ’r’ dka0.0.0.0.0 exer completed packet IOs 8192 3325 27238400 0 166 dka0.0.0.0.0 1360288 elapsed idle 20 19 P00>>> exer -eb 64 -bc 4 -a ’?w-Rc’ dka0 A destructive write test over block numbers 0 through 100 on disk dka0. The packet size is 2048 bytes. The action string specifies the following sequence of operations: 1. Set the current block address to a random block number on the disk between 0 and 97.
7. Compare buffer1 with buffer2 and report any discrepancies. 8. Repeat the above steps until each block on the disk has been written once and read twice.
-bs Specifies the block size (hex) in bytes. The default is 200 (hex). -bc Specifies the number of blocks (hex) per I/O. On devices without length (tape), use the specified packet size or default to 2048. The maximum block size allowed with variable length block reads is 2048 bytes. The default is 1. -d1 String argument for eval to generate buffer1 data pattern from. Buffer1 is initialized only once before any I/O occurs. Default = all bytes set to hex 5A’s.
-a (continued) • ? Seek to a random block offset within the specified range of blocks. exer calls the program, random, to “deal” each of a set of numbers once. exer chooses a set that is a power of two and is greater than or equal to the block range. Each call to random results in a number that is then mapped to the set of numbers that are in the block range and exer seeks to that location in the filestream.
4.8 floppy_write The floppy_write script runs a write test on the floppy drive to determine whether or not you can write on the diskette. Use this script if a customer is unable to write data to the floppy. This is a destructive test, so use a blank floppy. Example 4–6 floppy_write P00>>> floppy_write Destructive Test of the Floppy started P00>>> show_status ID Program Device Pass -------- ------------ ------------ -----00000001 idle system 0 00000c37 exer_kid dva0.0.0.
4.9 grep The grep command is very similar to the UNIX grep command. It allows you to search for “regular expressions”—specific strings of characters—and prints any lines containing occurrences of the strings. Using grep is similar to using wildcards. Example 4–7 grep P00>>> show fru SMB0.CPB0.PCI1 SMB0.CPB0.PCI4 SMB0.CPB0.
Syntax grep ( [-{c|i|n|v}] [-f ] [] [...] ) Arguments Specifies the target regular expression. If any regular expression metacharacters are present, the expression should be enclosed with quotes to avoid interpretation by the shell. ... Specifies the files to be searched. If none are present, then standard input is searched. Options -c Print only the number of lines matched. -i Ignore case. By default grep is case sensitive.
4.10 hd The hd command dumps the contents of a file (byte stream) in hexadecimal and ASCII.
➊ Example 4–8 shows a hex dump to DPR location 2b00, ending at block 0. Syntax hd [-{byte|word|long|quad}] [-{sb|eb} ] [:]. Arguments [:] Specifies the file (byte stream) to be displayed.
4.11 info The info command displays registers and data structures. You can enter the command by itself or followed by a number (0, 1, 2, 3, or 4). If you do not specify a number, a list of selections is displayed and you are prompted to enter a selection.
info 0 Displays the SRM memory descriptors as described in the Alpha System Reference Manual. info 1 Displays the page table entries (PTE) used by the console and operating system to map virtual to physical memory. Valid data is displayed only after a boot operation. info 2 Dumps the Galaxy Configuration Tree (GCT) FRU table. Galaxy is a software architecture that allows multiple instances of OpenVMS to execute cooperatively on a single computer.
Example 4–11 shows an abbreviated info 2 display.
Example 4–12 shows an abbreviated info 3 display.
Example 4–13 shows an abbreviated info 4 display.
4.12 kill and kill_diags The kill and kill_diags commands terminate diagnostics that are currently executing.
4.13 memexer The memexer command runs a specified number of memory exercisers in the background. Nothing is displayed unless an error occurs. Each exerciser tests all available memory in twice the backup cache size blocks for each pass. The following example shows no errors.
If the memory configuration is very large, the console might not test all of the memory. The upper limit is 1 GB. Use the show_status command to display the progress of the tests. Use the kill or kill_diags command to terminate the test. Syntax memexer [number] Arguments [number] Number of memory exercisers to start. The default is 1. The number of exercisers, as well as the length of time for testing, depends on the context of the testing.
4.14 memtest The memtest command exercises a specified section of memory. Typically memtest is run from the built-in console script. Advanced users may want to use the specific options described here.
➊ Use the show memory command or an info 0 command to see where memory is located. ➋ Starting address ➌ Length of the section to test in bytes ➍ Passcount. In this example, the test will run for 10 passes. ➎ The test detected a failure on DIMM 3, which is located on MMB 2. Use the show_status command to display the progress of the test. Use the kill or kill_diags command to terminate the test. Memtest provides a graycode memory test.
NOTE: If memtest is used to test large sections of memory, testing may take a while to complete. If you issue a Ctrl/C or kill PID in the middle of testing, memtest may not abort right away. For speed reasons, a check for a Ctrl/C or kill is done outside of any test loops. If this is not satisfactory, you can run concurrent memtest processes in the background with shorter lengths within the target range.
Syntax memtest ( [-sa ] [-ea ] [-l ] [-bs ] [-i ] [-p ] [-d ] [-rs ] [-ba ] [-t ] [-se ] [-g ] [-rb] [-f] [-m] [-z] [-h] [-mb] ) Options -sa Start address. Default is first free space in memzone. -ea End address. Default is start address plus length size.
Options -m Timer. Prints out the run time of the pass. Default = off . -z Tests the specified memory address without allocation. Bypasses all checking but allows testing in addresses outside of the main memory heap. Also allows unaligned input. CAUTION: This flag can overwrite the console. If the system hangs, press the Reset button. -d Used only for march test (2). Uses this pattern as test pattern. Default = 5’s -h Allocates test memory from the firmware heap. -rs Used only for random test (3).
4.15 net The net command performs maintenance operations on a specified Ethernet port. Net -ic initializes the MOP counters for the specified Ethernet port, and net -s displays the current status of the port, including the contents of the MOP counters.
Syntax net [-ic] net [-s] Arguments 4-40 Specifies the Ethernet port on which to operate, either ei*0 or ew*0.
4.16 nettest The nettest command tests the network ports using MOP loopback. Typically nettest is run from the built-in console script. Advanced users may want to use the specific options and environment variables described here.
Nettest performs a network test. It can test the ei* or ew* ports in internal loopback, external loopback, or live network loopback mode. Nettest contains the basic options to run MOP loopback tests. Many environment variables can be set from the console to customize nettest before nettest is started. The environment variables, a brief description, and their default values are listed in the syntax table in this section. Each variable name is preceded by e*a0_ or e*b0_ to specify the desired port.
Syntax nettest ( [-f ] [-mode ] [-p ] [-sv ] [-to ] [-w ] [] ) Arguments Specifies the Ethernet port on which to run the test. Options -f -mode Specifies the file containing the list of network station addresses to loop messages to. The default file name is lp_nodes_e*a0 for port e*a0. The default file name is lp_nodes_e*b0 for port e*b0. The files by default have their own station address.
-sv Specifies which MOP version protocol to use. If 3, then MOP V3 (DECNET Phase IV) packet format is used. If 4, then MOP V4 (DECNET Phase V IEEE 802.3) format is used. -to Specifies the time in seconds allowed for the loop messages to be returned. The default is 2 seconds. -w Specifies the time in seconds to wait between passes of the test. The default is 0 (no delay). The network device can be very CPU intensive. This option will allow other processes to run.
4.17 set sys_serial_num The set sys_serial_num command sets the system serial number. This command is used by Manufacturing for establishing the system serial number, which is then propagated to all FRU devices that have EEPROMs. The sys_serial_num environment variable can be read by the operating system. Example 4–19 set sys_serial_num P00>>> set sys_serial_num NI900100022 When the system motherboard (SMB) is replaced, you must use the set sys_serial_num command to restore the master setting.
4.18 show error The show error command reports errors logged to the FRU EEPROMs.
The output of the show error command is based on information logged to the serial control bus EEPROMs on the system FRUs. Both the operating system and the ROM-based diagnostics log errors to the EEPROMs. This functionality allows you to generate an error log from the console environment. No errors are displayed for fans or the OCP because these components do not have an EEPROM. Syntax show error All FRUs with errors are displayed.
Table 4–2 Show Error Message Translation Bit Mask (E Field) Text Message Meaning and Action 01 Hardware Failure Module failure. FRUs that are known to be connected but are unreadable are considered hardware failures. An example is power supplies. 02 TDD - Type:0 Test: 0 SubTest: Error: 0 Serious error. Run the Compaq Analyze GUI, if necessary, to determine what action to take. If you cannot run Compaq Analyze, replace the module.
4.19 show fru The show fru command displays the physical configuration of FRUs. Use show fru -e to display FRUs with errors. Example 4–21 show fru P00>>> build smb0 54-25385-01.a01 ay94412345 P00>>> show fru ➊ FRUname SMB0 SMB0.CPU0 SMB0.CPU1 SMB0.CPU2 SMB0.CPU3 SMB0.MMB0 SMB0.MMB0.DIM1 SMB0.MMB0.DIM2 SMB0.MMB0.DIM3 SMB0.MMB0.DIM4 SMB0.MMB0.DIM5 SMB0.MMB0.DIM6 SMB0.MMB1 SMB0.MMB1.DIM1 SMB0.MMB1.DIM2 SMB0.MMB1.DIM3 SMB0.MMB1.DIM4 SMB0.MMB1.DIM5 SMB0.MMB1.DIM6 SMB0.MMB2 SMB0.MMB2.DIM1 SMB0.MMB2.DIM2 SMB0.
PWR1 FAN1 FAN2 FAN3 FAN4 FAN5 FAN6 SMB0.CPB0.SBM0 ➊ 00 00 00 00 00 00 00 06 FRUname 30-49448-01. C02 70-40073-01 70-40073-01 70-40072-01 70-40071-01 70-40073-02 70-40074-01 54-12345-01 2P91600530 AY80151237 API-7650 Fan Fan Fan Fan Fan Fan The FRU name recognized by the SRM console. The name also indicates the location of that FRU in the physical hierarchy.
Table 4–3 lists bit assignments for failures that could potentially be listed in the E (error) field of the show fru command. Because the E field is only two characters wide, bits are “or’ed” together if the device has multiple errors.
4.20 show_status The show_status command displays the progress of diagnostics. The command reports one line of information per executing diagnostic. Many of the diagnostics run in the background and provide information only if an error occurs.
➊ ➋ ➌ ➍ ➎ Process ID ➏ ➐ Bytes successfully written by the diagnostic. The SRM diagnostic for the particular device The ID of the device under test Number of diagnostic passes that have been completed Error count (hard and soft). Soft errors are not usually fatal; hard errors halt the system or prevent completion of the diagnostics. Bytes successfully read by the diagnostic.
4.21 sys_exer The sys_exer command exercises the devices displayed with the show config command. Tests are run concurrently and in the background. Nothing is displayed after the initial test startup messages unless an error occurs. Example 4–23 sys_exer P00>>> sys_exer Default zone extended at the expense of memzone.
Use the show_status command to display the progress of diagnostic tests. The diagnostics started by the sys_exer command automatically reallocate memory resources, because these tests require additional resources. Use the init command to reconfigure memory before booting an operating system. Because the sys_exer tests are run concurrently and indefinitely (until you stop them with the init command), they are useful in flushing out intermittent hardware problems.
4.22 test The test command verifies all the devices in the system. This command can be used on all supported operating systems: Tru64 UNIX, OpenVMS, and Windows NT.
• A trial diskette with files installed • A trial CD-ROM with files installed The test script tests devices in the following order: 1. Memory tests (one pass) 2. Read-only tests: DK* disks, DR* disks, DQ* disks, MK* tapes, DV* floppy. NOTE: You must install media to test disks, tapes, and the floppy drive. Since no write tests are performed, it is safe to test disks and tapes that contain data. 3. Console loopback tests if -lb argument is specified: COM2 serial port and parallel port. 4.
Chapter 5 Error Logs This chapter tells how to interpret error logs reported by the operating system.
5.1 Error Log Analysis with Compaq Analyze Compaq Analyze (CA) is a fault management diagnostic tool that is used to determine the cause of hardware failures. Compaq Analyze performs system diagnostic processing of both single and multiple error/fault events. Compaq Analyze may or may not be installed on the customer’s system with the operating system, depending on the release cycle. If CA is installed, the Compaq Analyze Director starts automatically as part of the system start-up.
5.1.1 WEB Enterprise Service (WEBES) Director Compaq Analyze uses the functionality contained in the WEBES Director, a process that executes continuously on the machine. The Director manages the processing of system error events and provides analysis message routing for the system. Compaq Analyze provides the functionality for system event analysis and translation. NOTE: WEBES was formerly known as DESTA. The initial release of Compaq Analyze, V1.0, included the common WEBES code.
5.1.2 Invoking the GUI When you invoke the Compaq Analyze GUI, the node “localhost” opens by default for all operating systems. The “localhost” is the system on which CA is running. If an event has occurred, it is listed under “localhost” Events. See Figure 5–1.
Figure 5–2 shows an example of an event screen for an ES40 system. When an error is detected, it is reported to the console with a series of problem found statements. In this case, “Correctable System Detected Error” was logged in the event log with the date and time the event occurred. To display an event or report, click on it to select it, then click on “Display Information.” The item selected opens up in the data display window. See Figure 5–3.
5.1.3 Problem Found Report After you select the Problem Found report and click on Display Information, a full description of the error is displayed and probable FRUs and their location are called out. Figure 5–3 shows the beginning of a Compaq Analyze problem found report.
identification for this event type. The Event_ID_Count indicates the number this event is of this event type. Brief Description The Brief Description designator indicates whether the error event is related to the CPU, system (PCI, storage, and so on), or environmental subsystem. Callout ID The last 12 characters of the Callout ID designator can be used to determine the revision level of the analysis rule-set that is being used. Severity The Severity designator indicates the severity of the problem.
Figure 5–4 FRU List Designator 5-8 Compaq AlphaServer ES40 Service Guide
FRU List The FRU List designator lists the most probable defective FRUs. This list indicates that service needs to be administered to one or more of these FRUs. The information typically include the FRU probability, manufacturer, system device type, system physical location, part number, serial number, and firmware revision level (if applicable). In Figure 5–4 the most probable failing FRU is DIMM 3 on MMB1. The next less probable is the system motherboard, and the least probable is MMB1.
Figure 5–5 Evidence Designator 5-10 Compaq AlphaServer ES40 Service Guide
Evidence The Evidence designator provides information that leads Compaq Analyze to identify the failing FRU and its location. A portion of the Evidence designator is shown in Figure 5–5. The evidence provided depends on the type of error that is detected.
5.2 Fault Detection and Reporting Table 5–1 provides a summary of the fault detection and correction components of Compaq AlphaServer ES40 systems. Generally, PALcode handles exceptions/interrupts as follows: 1. The PALcode determines the cause of the exception/interrupt. 2. If possible, it corrects the problem and passes control to the operating system for error notification, reporting, and logging before returning the system to normal operation. If PALcode is unable to correct the problem, it 3.
Table 5–1 Compaq AlphaServer ES40 Fault Detection and Correction Component Fault Detection/Correction Capability Alpha 21264 (EV6) microprocessor Contains error checking and correction (ECC) logic for data cycles. Check bits are associated with all data entering and exiting the microprocessor. A single-bit error on any of the four longwords being read can be corrected (per cycle). A double-bit error on any of the four longwords being read can be detected (per cycle).
5.3 Machine Checks/Interrupts The exceptions that result from hardware system errors are called machine checks/interrupts. They occur when a system error is detected during the processing of a data request. During the error-handling process, errors are first handled by the appropriate PALcode error routine and then by the associated operating system error handler. PALcode transfers control to the operating system through the PAL handler.
Table 5–2 Machine Checks/Interrupts (Continued) Error Type Error Descriptions System Correctable Error (620) System detected ECC single-bit error ES40-specific correctable errors. System Uncorrectable Error (660) A system-detected machine check that occurred as a result of an “off-chip” request to the system. System Environmental Error (680) System-detected machine check caused by an overtemperature condition, fan failure, or power supply failure.
5.3.1 Error Logging and Event Log Entry Format The operating system error handlers generate several entry types. Entries can be of variable length based on the number of registers within the entry. Each entry consists of an operating system header, several device frames, and an end frame. Most entries have a PAL-generated logout frame, and may contain frames for CPU, memory, and I/O. Table 5–3 shows an event structure map for a Windows NT system uncorrectable PCI target abort error.
Table 5–3 Sample Error Log Event Structure Map (ES40 with 10 PCI Slots) OFFSET(hex) 63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0 nh0000 STANDARD MICROSOFT NT OS HEADER nh+nnnn ech0000 NEW COMMON OS HEADER ech+nnnn lfh0000 lfh+nnnn lfev60000 lfev6+nnnn lfctt_A0[u] lfctt_A8[u] lfctt_B0[u] lfctt_B8[u] lfctt_C0[u] STANDARD LOGOUT FRAME HEADER COMMON PAL EV6 SECTION (first 8 QWs Zeroed) SESF<63:32> = <39:32>= SESF<31:16> = Reserved(MBZ) (MBZ) Reserved(MBZ) SESF<15:0>= 0002(hex) Cchip CPUx D
5.4 Environmental Errors Captured by SRM If an environmental error occurs while the SRM console is running, a logout frame similar to Example 5–1 is sent to the console output device. The logout frame is preceded by the message “***unexpected system event through vector 680 on CPU n.” (usually CPU 0.) For register definitions see Appendix D.
P00>>> *** unexpected system event through vector 680 on CPU 0 os_flags 0000000000000000 cchip_dirx 0004000000000000 tig_smir 0000000000000008 tig_cpuir 000000000000000f tig_psir 0000000000000003 lm78_isr 0000000000000000 door_open 0000000000000040 ➊ temp_warning 0000000000000000 fan_ctrl_fault 0000000000000000 power_down_code 0000000000000000 reserved_1 0000000000000000 ➊ This example shows a fan door closing event.
5.5 Windows NT Error Logs The Display Error Frames selection of the AlphaBIOS Utilities menu allows you to view hardware error reports for systems running Windows NT. A report is generated if a fatal error or double error halt occurs. If the System Error Logging Software for Alpha kit is installed, you will be able to see the report in the system event log after the system has booted. Figure 5–6 New Error Frame Was Detected Window AlphaBIOS 5.
Fatal Error Halts Fatal error halts are single errors that occur when the operating system is running. Only one operating system fatal (OS fatal) error at a time can exist in flash ROM. When a new OS fatal error occurs, it replaces the old error in the flash. Double Error Halts Double error halts are conditions in which the processing of a fatal error triggers a second error.
Figure 5–7 Display Error Frames Screen 5-22 Compaq AlphaServer ES40 Service Guide
Displaying an Error Frame 1. To display the error frame, enter AlphaBIOS Setup and select the Utilities menu. 2. From the Utilities menu, select Display Error Frames…. If there is no error frame in the flash ROM, a screen with the message “No Error Frame in the flash ROM” is displayed. If there is an error frame, a screen similar to Figure 5–7 is displayed.
5.5.1 Viewing a Formatted Text-Style Error Frame Press the Enter key to view a formatted text-style error frame. The error source is also displayed. For example, the Fatal Error Frame in Figure 5–8 reports a “D-Stream Error, Uncorrectable ECC.
You can browse the entire contents of an error log by using the scroll bar, as shown in Figure 5–9.
5.5.2 Viewing a Binary Dump of the Error Frame Press the F6 key to get a binary dump of the entire error frame.
5.5.3 Saving the Error Frame to the Floppy Press F10 to save the error frame to the floppy. For the formatted text style, an ASCII (text) file is generated. For the binary dump, a raw file is generated. If the same file name already exists on the floppy, a warning message is displayed. Press Enter to continue the save.
The OS fatal and double error halt files are named as follows. The is two digits. Type of Error Frame File Name Fatal error frame (Binary) FATALERR.BIN Fatal error frame (ASCII) FATALERR.TXT Double error frame (Binary) DBLERR.BIN Double error frame (ASCII) DBLERR.TXT Figure 5–12 shows an example of a formatted text file. Figure 5–12 Formatted Text File Error Frame Type: Fatal Error Frame. Date: 12/04/1998, Time: 03:15:46 D-Stream Error, Uncorrectable ECC.
Number of TLVs in header Wall-Clock Time (Tag) Wall-Clock Time (Length) Wall-Clock Time (String) DSR (Tag) DSR (Length) DSR (String) OS Version (Tag) OS Version (Length) OS Version (String) OS Build Number (Tag) OS Build Number (Length) OS Build Number (String) System Serial Num.(Tag) System Serial Num.(Length) System Serial Num.
5.5.4 Deleting an Error Frame Use the DEL key to delete the error frame from the flash ROM. If you delete a new error frame, a warning message is displayed, as shown in Figure 5–13. If you delete an old error frame, a message similar to that in Figure 5–14 is displayed. Press F10 to continue a deletion. When the deletion is complete, a “Delete Complete” message is displayed.
Figure 5–14 Deleting an Old Error Frame Error Logs 5-31
Chapter 6 System Configuration and Setup This chapter describes how to configure and set up Compaq AlphaServer ES40 systems.
6.1 System Consoles System console programs are located in a flash ROM on the system motherboard. From the console interface, you can set up and boot the operating system, display the system configuration, and run diagnostics. For complete information on the SRM and AlphaBIOS consoles, see the Compaq AlphaServer ES40 User Interface Guide. Figure 6–1 AlphaBIOS Setup Screen AlphaBIOS Setup Display System Configuration... AlphaBIOS Upgrade... Hard Disk Setup... CMOS Setup... Network Setup...
SRM Console Systems running the Tru64 UNIX or OpenVMS operating systems are configured from the SRM console, a command-line interface (CLI). From the CLI you can enter commands to configure the system, view the system configuration, boot the system, and run ROM-based diagnostics. AlphaBIOS Console Systems running the Windows NT operating system are configured from the AlphaBIOS console, a menu interface.
6.1.1 Switching Between Consoles Under some circumstances, you may need to switch between the system consoles. For example, error frames for Windows NT systems are viewed from the AlphaBIOS console.
6.1.2 Selecting the Console and Display Device The SRM os_type environment variable determines which user interface (SRM or AlphaBIOS) is the final console loaded on a power-up or reset. The SRM console environment variable determines to which display device (VT-type terminal or VGA monitor) the console display is sent. Selecting the Console The os_type variable selects the console. Os_type is factory configured as follows: • For Windows NT, os_type is set to nt.
You can verify the display device with the SRM show console command and change the display device with the SRM set console command. If you change the display device setting, you must reset the system (with the Reset button or the init command) to put the new setting into effect. In the following example, the user displays the current console device (a graphics device) and then resets it to a serial device. After the system initializes, output will be displayed on the serial terminal.
6.1.3 Setting the Control Panel Message If you are running Tru64 UNIX or OpenVMS, you can create a customized message to be displayed on the operator control panel after startup self-tests and diagnostics have been completed. When the operating system is running, the control panel displays the console revision. It is useful to create a customized message if you have a number of systems and you want to identify each system by a node name.
6.2 Displaying the Hardware Configuration View the system hardware configuration for UNIX and OpenVMS systems from the SRM console. View a Windows NT hardware configuration from the AlphaBIOS console. It is useful to view the hardware configuration to ensure that the system recognizes all devices, memory configuration, and network connections. Displaying a Tru64 UNIX or OpenVMS Configuration Use the following SRM console commands to view the system configuration for UNIX or OpenVMS systems.
Displaying a Windows NT Hardware Configuration View a Windows NT configuration as follows: 1. From the AlphaBIOS Setup screen, select Display System Configuration and press Enter. 2. In the Display System Configuration screen, use the arrow keys to select the configuration category you want to see.
6.3 Setting Environment Variables for Tru64 UNIX or OpenVMS Environment variables pass configuration information between the console and the operating system. Their settings determine how the system powers up, boots the operating system, and operates. • To check the setting for a specific environment variable, enter the show envar command, where the name of the environment variable is substituted for envar.
set envar The set command sets or modifies the value of an environment variable. It can also be used to create a new environment variable if the name used is unique. Environment variables pass configuration information between the console and the operating system. Their settings determine how the system powers up, boots the operating system, and operates. The syntax is: set envar value envar The name of the environment variable to be modified. value The new value of the environment variable.
Table 6–1 SRM Environment Variables Used on ES40 Systems Variable Attributes 1 Description auto_action NV,W Action the console should take following an error halt or power failure. Defined values are: boot—Attempt bootstrap. halt—Halt, enter console I/O mode. restart—Attempt restart. If restart fails, try boot. bootdef_dev NV,W Device or device list from which booting is to be attempted when no path is specified. Set at factory to disk with factory-installed software; otherwise NULL.
Table 6–1 SRM Environment Variables Used on ES40 Systems (Continued) Variable Attributes Description boot_osflags (continued) NV,W boot_flags: The hexadecimal value of the bit number or numbers to set. To specify multiple boot flags, add the flag values (logical OR). 1—Bootstrap conversationally (enables you to modify SYSGEN parameters in SYSBOOT). 2—Map XDELTA to running system. 4—Stop at initial system breakpoint. 8—Perform a diagnostic bootstrap. 10—Stop at the bootstrap breakpoints.
Table 6–1 SRM Environment Variables Used on ES40 Systems (Continued) Variable Attributes Description D—Full dump; implies s as well. By default, if Tru64 UNIX crashes, it completes a partial memory dump. Specifying D forces a full dump at system crash. boot_osflags (continued) Common settings are a, autoboot, and Da, autoboot and create full dumps if the system crashes. com1_baud NV,W Sets the baud rate of the COM1 (MMJ) port. The default baud rate is 9600.
Table 6–1 SRM Environment Variables Used on ES40 Systems (Continued) Variable Attributes Description com1_modem com2_modem NV,W Used to tell the operating system whether a modem is present on the COM1 or COM2 ports, respectively On—Modem is present. Off—Modem is not present (default value). console NV Sets the device on which power-up output is displayed. Graphics—Sets the power-up output to be displayed at a VGA monitor or device connected to the VGA module.
Table 6–1 SRM Environment Variables Used on ES40 Systems (Continued) Variable Attributes ei*0_mode or ew*0_mode (continued) ei*0_protocols or ew*0_protocols Description twisted-pair— Sets the default device to 10BaseT (twisted-pair). NV Determines which network protocols are enabled for booting and other functions. mop—Sets the network protocol to MOP for systems using the OpenVMS operating system. bootp—Sets the network protocol to bootp for systems using the Tru64 UNIX operating system.
Table 6–1 SRM Environment Variables Used on ES40 Systems (Continued) Variable Attributes Description language NV Specifies the console keyboard layout. The default is English (American). memory_test NV Specifies the extent to which memory will be tested on Tru64 UNIX. The options are: Full—Full memory test will be run. Required for OpenVMS. Partial—First 256 MB of memory will be tested. None—Only first 32 MB will be tested.
Table 6–1 SRM Environment Variables Used on ES40 Systems (Continued) Variable Attributes Description pk*0_fast NV Enables fast SCSI devices on a SCSI controller to perform in standard or fast mode. 0—Sets the default speed for devices on the controller to standard SCSI. If a controller is set to standard SCSI mode, both standard and fast SCSI devices will perform in standard mode. 1—Sets the default speed for devices on the controller to fast SCSI mode.
Table 6–1 SRM Environment Variables Used on ES40 Systems (Continued) Variable Attribute Description sys_serial_num NV Sets the system serial number, which is then propagated to all FRUs that have EEPROMs. The serial number can be read by the operating system. tt_allow_login NV Enables or disables login to the SRM console firmware on alternative console ports. 0—Disables login on alternative console ports. 1—Enables login on alternative console ports (default setting).
6.4 Setting Up a System for Windows NT Before you install and boot Windows NT for the first time, set the system date and time and set up the hard disks. Optionally, you can set the level of memory testing and set system password protection. If you are installing Windows NT from CD-ROM, use the AlphaBIOS CMOS Setup screen and the Hard Disk Setup screen to set up your system. Use the Advanced CMOS Setup screen to set the level of memory testing and to set password protection, if desired.
6.4.1 Setting the Date and Time Set the date and time from the CMOS Setup screen. Figure 6–4 CMOS Setup Screen CMOS Setup Date: Time: Friday, 13:22:27 May 10 F1=Help 1999 Floppy Drive A: 3.5" 1.44 MB Floppy Drive B: None Keyboard: U.S. 101-key keyboard Auto Start: Enabled Auto Start Count: 30 Seconds Press or to modify date fields. take effect immediately. F3=Color F6=Advanced F7=Defaults Date modifications will ESC=Discard Changes F10=Save Changes PK0901 1. Start AlphaBIOS. 2.
6.4.2 Setting Up the Hard Disk Set up the hard disk from the Hard Disk Setup screen. Figure 6–5 Hard Disk Setup Screen Hard Disk Setup Disk Disk Disk 0 1 2 NCRC8xx #0, SCSI ID 0 Partition 1 7 Partition 2 4091 MB 4085 MB 5 6 MB FAT FAT NCRC8XX #0, SCSI ID 1 Partition 1 4091 MB 4091 MB NTFS NCRC8XX #0, SCSI ID 2 Partition 1 4091 MB 4091 MB NTFS INSERT =New DEL=Delete F6 =Format F7 =Express ESC=Exit PK0940a Set the date and time as described in Section 6.4.
6.4.3 Setting the Level of Memory Testing Set the level of memory testing that occurs when the system is power cycled from the advanced CMOS Setup screen. Figure 6–6 Advanced CMOS Setup Screen Advanced CMOS Setup F1=Help PCI Parity Checking: Power-up Memory Test: AlphaBIOS Password Option: SCSI BIOS Emulation: Disabled Partial Disabled Enabled For All Console Selection: Windows NT Console (AlphaBIOS) Press or to enable or disable power-up memory testing.
6.5 Setting Automatic Booting Windows NT systems are factory set to auto start; UNIX and OpenVMS systems are factory set to halt in the SRM console. You can change these defaults, if desired.
6.5.1 Windows NT and Auto Start On Windows NT systems the Auto Start option is enabled by default, which causes the primary operating system to start automatically whenever the machine is power cycled or reset. If more than one version of Windows NT is installed (for example, Version 4.0 and Version 5.0), the version selected as the primary operating system starts automatically if Auto Start is enabled.
6.5.2 Setting Tru64 UNIX or OpenVMS Systems to Auto Start The SRM auto_action environment variable determines the default action the system takes when the system is power cycled, reset, or experiences a failure. On systems that are factory configured for UNIX or OpenVMS, the factory setting for auto_action is halt. The halt setting causes the system to stop in the SRM console. You must then boot the operating system manually. For maximum system availability, auto_action can be set to boot or restart.
6.6 Changing the Default Boot Device It is not necessary to modify the boot file setting for Windows NT. You can change the default boot device for UNIX or OpenVMS with the set bootdef_dev command. Windows NT AlphaBIOS boots Windows NT from the operating system loader program, OSLOADER.EXE. A boot file setting is created along with the operating system selection during Windows NT setup, and this setting is usually not modified by the user. You can, however, modify this setting, if necessary.
6.7 Running AlphaBIOS-Based Utilities Depending upon the type of hardware you have, you may have to run hardware configuration utilities. Hardware configuration diskettes are shipped with your system or with options that you order.
6.7.1 Running Utilities from a VGA Monitor If you are running Windows NT, no terminal setup is required for running utilities. Figure 6–7 AlphaBIOS Utilities Menu AlphaBIOS Setup Display System Configuration... Upgrade AlphaBIOS Hard Disk Setup... CMOS Setup... Install Windows NT Utilities About AlphaBIOS... F1=Help Display Error Frames... OS Selection Setup... Run Maintenance Program... ESC=Exit PK0954a Running a Utility from a VGA Monitor 1. Start the AlphaBIOS console. 2.
4. In the Run Maintenance Program dialog box, type the name of the program to be run in the Program Name field. Then Tab to the Location list box, and select the hard disk partition, floppy disk, or CD-ROM drive from which to run the program. 5. Press Enter to execute the program. Figure 6–8 Run Maintenance Program Dialog Box AlphaBIOS Setup Display System Configuration... Upgrade AlphaBIOS Hard Disk Setup... CMOS S Run Maintenance Program Networ Instal Utilit 1 Program Name: arccf.
6.7.2 Setting Up Serial Mode Serial mode requires a VT320 or higher (or equivalent) terminal. To run AlphaBIOS and maintenance programs in serial mode, set the console environment variable to serial and enter the init command to reset the system. Set up the serial terminal as follows: 1. From the General menu, set the terminal mode to VTxxx mode, 8-bit controls. 2. From the Comm menu, set the character format to 8 bit, no parity, and set receive XOFF to 128 or greater.
6.7.3 Running Utilities from a Serial Terminal Utilities are run from a serial terminal the same way as from a VGA monitor. The menus are the same, but some key mappings are different.
1. Issue the alphabios command at the P00>>> prompt to start the AlphaBIOS console. 2. From the AlphaBIOS Boot screen, press F2. 3. From AlphaBIOS Setup, select Utilities, and select Run Maintenance Program from the sub-menu that is displayed. Press Enter. 4. In the Run Maintenance Program dialog box, type the name of the program to be run in the Program Name field. Then tab to the Location list box, and select the hard disk partition, floppy disk, or CD-ROM drive from which to run the program. 5.
6.7.4 Running the RAID Standalone Configuration Utility The RAID Standalone Configuration Utility is used to set up RAID disk drives and logical units. The Standalone Utility is run from the AlphaBIOS Utilities menu. The system supports KZPAC-xx Ultra SCSI RAID controllers. The KZPAC-xx kit includes the controller, RAID Array 230/Plus Subsystem software, and documentation. 1. Start AlphaBIOS Setup. If the system is in the SRM console, issue the alphabios command.
6.8 Setting SRM Security The set password and set secure commands set SRM security. The login command turns off security for the current session. The clear password command returns the system to user mode. The SRM console has two modes, user mode and secure mode. • User mode allows you to use all SRM console commands. User mode is the default mode. • Secure mode allows you to use only the boot and continue commands.
➊ Setting a password. If a password has not been set and the set password command is issued, the console prompts for a password and verification. The password and verification are not echoed. ➋ Changing a password. If a password has been set and the set password command is issued, the console prompts for the new password and verification, then prompts for the old password. The password is not changed if the validation password entered does not match the existing password stored in NVRAM.
If You Forget the Password If you forget the current password, use the login command in conjunction with the control panel Halt button to clear the password, as follows: 1. Enter the login command: P00>>> login 2. When prompted for the password, press the Halt button to the latched position and then press the Return (or Enter) key. 3. Press the Halt button to release the halt. The password is now cleared and the console cannot be put into secure mode unless you set a new password.
6.9 Setting Windows NT Security Password protection provides two levels of security for a Windows NT system: setup protection and startup protection. When system setup protection is enabled, a password is required to start AlphaBIOS Setup. When startup password protection is enabled, a password is required before the system initializes.
Startup password protection provides more comprehensive protection than setup password protection because with startup protection the system cannot be used at all until the correct password is entered. To enable password protection: 1. Start AlphaBIOS Setup, select CMOS Setup, and press Enter. 2. In the CMOS Setup screen, press F6 to enter Advanced CMOS Setup. 3.
6.10 Configuring Devices Become familiar with the configuration requirements for CPUs and memory before removing or replacing those components. See Chapter 8 for removal and replacement procedures. 6.10.
Figure 6–10 CPU Slot Locations (Tower) CPU 3 CPU 2 CPU 1 CPU 0 PK0229 CPU Configuration Rules 6. A CPU must be installed in slot 0. The system will not power up without a CPU in slot 0. 7. CPU cards must be installed in numerical order, starting at CPU slot 0. The slots are populated from left to right on a pedestal or rackmount system and from bottom to top on a tower. See Figure 6–9 and Figure 6–10. 8. CPUs must be identical in speed and cache size.
6.10.2 Memory Configuration Become familiar with the rules for memory configuration before adding DIMMs to the system. For the Model 2 system, do not mix stacked and unstacked DIMMs within an array. Refer to Figure 6–12 or Figure 6–13 and observe the following rules for installing DIMMs. • You can install up to 16 DIMMs or up to 32 DIMMs, depending on the system model. • A set consists of 4 DIMMs. You must install all 4 DIMMs. • Fill sets in numerical order.
DIMM Information for Model 2 Systems DIMMs are manufactured with two types of SRAMs, stacked and unstacked (see Figure 6–11). Stacked DIMMs provide twice the capacity of unstacked DIMMs, and, at the time of shipment, are the highest capacity DIMMs offered by Compaq. The system may have either stacked or unstacked DIMMs. You can mix stacked and unstacked DIMMs within the system, but not within an array.
Figure 6–12 Memory Configuration (Pedestal/Rack) Sets 7 7 5 5 3 3 1 1 Sets 6 6 4 4 2 2 0 0 MMB 2 Sets 7 7 5 5 3 3 1 1 MMB 0 Array 1 Sets 1 & 5 Array 3 Sets 3 & 7 Array 0 Sets 0 & 4 MMB 3 Sets 6 6 4 4 2 2 0 0 Array 2 Sets 2 & 6 MMB 1 PK0202 6-44 Compaq AlphaServer ES40 Service Guide
Figure 6–13 Memory Configuration (Tower) Sets 6 6 4 4 2 2 0 0 MMB 1 7 Sets 3 3 5 5 7 1 1 MMB 3 6 0 Sets 2 2 4 4 6 0 MMB 0 7 Sets 3 3 5 5 7 MMB 2 1 1 Array 1 Sets 1 & 5 Array 0 Sets 0 & 4 Array 3 Sets 3 & 7 Array 2 Sets 2 & 6 PK0203 System Configuration and Setup 6-45
6.10.
Figure 6–15 PCI Slot Locations (Tower) 10-Slot System 1 2 3 4 5 6 7 8 9 10 6-Slot System 1 2 3 8 9 10 PK0227 The PCI slots are split across two independent 64-bit, 33 MHz PCI buses: PCI0 and PCI1. These buses correspond to Hose 0 and Hose 1 in the system logical configuration. The slots on each bus are listed below. System Variant Slots on PCI 0 Slots on PCI 1 Six-slot system 1–3 8–10 Ten-slot system 1–4 5–10 Some PCI options require drivers to be installed and configured.
6.10.
The system can have the following power configurations: Single Power Supply. A single power supply is provided with entry-level systems, such as a system configured with: • One or two CPUs • One storage cage Two Power Supplies. Two power supplies are required if the system has more than two CPUs or if the system has a second storage cage. Redundant Power Supply. If one power supply fails, the redundant supply provides power and the system continues to operate normally.
6.11 Switching Between Operating Systems The system supports three operating systems. You UNIX, OpenVMS, or Windows NT. You can also operating system to another by removing the disk system that is currently installed and installing operating system you want to run. can install Tru64 switch from one for the operating the disk for the CAUTION: The file structures of the three operating systems are incompatible.
1. Shut down the operating system and power off the system. Unplug the power cord from each power supply. 2. Remove the enclosure panels and system covers as described in Chapter 8. 3. Remove any options that are not supported on Windows NT and replace them with supported options. 4. Remove the UNIX or OpenVMS operating system disk and insert the Windows NT system disk. 5. Plug in the power supplies and power up the system. 6.
6.11.2 Switching from Windows NT to UNIX or OpenVMS Follow this procedure if you have already installed Windows NT and want to switch to UNIX or OpenVMS. CAUTION: Before switching operating systems, make a note of the boot path and location of the system disk (controller, SCSI ID number, and so on) of the operating system you are removing so that you can restore that operating system at a later date. 1. Shut down the operating system and power off the system. Unplug the power cord from each power supply.
Chapter 7 Using the Remote Management Console You can manage the system through the remote management console (RMC). The RMC is implemented through an independent microprocessor that resides on the system motherboard. The RMC also provides access to the repository for all error information in the system. This chapter explains the operation and use of the RMC.
7.1 RMC Overview The remote management console provides a mechanism for monitoring the system (voltages, temperatures, and fans) and manipulating it on a low level (reset, power on/off, halt). It also provides functionality to read and write configuration and error log information to FRU error log devices. The RMC performs monitoring and control functions to ensure the successful operation of the system.
The RMC logic is implemented using an 8-bit microprocessor, PIC17C44, as the primary control device. The firmware code is resident within the microprocessor and in flash memory. If the RMC firmware should ever become corrupted or obsolete, you can update it manually using the Loadable Firmware Update Utility. See Chapter 3 for details. The microprocessor can also communicate with the system power control logic to turn on or turn off power to the rest of the system.
7.2 Operating Modes The RMC can be configured to manage different data flow paths defined by the com1_mode environment variable. In Through mode (the default), all data and control signals flow from the system COM1 port through the RMC to the active external port. You can also set bypass modes so that the signals partially or completely bypass the RMC. The com1_mode environment variable can be set from either SRM or the RMC. See Section 7.6.1.
Through Mode Through mode is the default operating mode. The RMC routes every character of data between the internal system COM1 port and the active external port, either the local COM1 serial port (MMJ) or the 9-pin modem port. If a modem is connected, the data goes to the modem. The RMC filters the data for a specific escape sequence. If it detects the escape sequence, it connects to the RMC CLI. Figure 7–1 illustrates the data flow in Through mode.
7.2.1 Bypass Modes For modem connection, you can set the operating mode so that data and control signals partially or completely bypass the RMC. The bypass modes are Snoop, Soft Bypass, and Firm Bypass.
Figure 7–2 shows the data flow in the bypass modes. Note that the internal system COM1 port is connected directly to the modem port. NOTE: You can connect a serial terminal to the modem port in any of the bypass modes. The local terminal is still connected to the RMC and can still connect to the RMC CLI to switch the COM1 mode if necessary. Snoop Mode In Snoop mode data partially bypasses the RMC.
After downloading binary files, you can set the com1_mode environment variable from the SRM console to switch back to Snoop mode or other modes for accessing the RMC, or you can hang up the current modem session and reconnect it. Firm Bypass Mode In Firm Bypass mode all data and control signals are routed directly between the system COM1 port and the external modem port. The RMC does not configure or monitor the modem.
7.3 Terminal Setup You can use the RMC from a modem hookup or the serial terminal connected to the system. As shown in Figure 7–3, a modem is connected to the dedicated 9-pin modem port ➊ and a terminal is connected to the COM1 serial port/terminal port (MMJ) ➋.
7.4 Connecting to the RMC CLI You type an escape sequence to connect to the RMC CLI. You can connect to the CLI from any of the following: a modem, the local serial console terminal, the local VGA monitor, or the system. The “system” includes the operating system, SRM, AlphaBIOS, or an application. • You can connect to the RMC CLI from the local terminal regardless of the current operating mode. • You can connect to the RMC CLI from the modem if the RMC is in Through mode, Snoop mode, or Local mode.
Connecting from the Local VGA Monitor To connect to the RMC CLI from the local VGA monitor, the console environment variable must be set to graphics and the SRM console must be running. Invoke the SRM console and enter the rmc command. P00>>> rmc You are about to connect to the Remote Management Console. Use the RMC reset command or press the front panel reset button to disconnect and to reload the SRM console.
7.5 SRM Environment Variables for COM1 Several SRM environment variables allow you to set up the COM1 serial port (MMJ) for use with the RMC. You may need to set the following environment variables from the SRM console, depending on how you decide to set up the RMC. com1_baud Sets the baud rate of the COM1 serial port and the modem port. The default is 9600. com1_flow Specifies the flow control on the serial port. The default is software.
7.6 RMC Command-Line Interface The remote management console supports setup commands and commands for managing the system. The RMC commands are listed below. clear {alert, port} dep disable {alert, remote} dump enable {alert, remote} env halt {in, out} hangup help or ? power {on, off} quit reset send alert set {alert, com1_mode, dial, escape, init, logout, password, user} status The commands for setting up and using the RMC are described in the following sections. The dep command is reserved.
Command Conventions Observe the following conventions for entering RMC commands: • Enter enough characters to distinguish the command. NOTE: The reset and quit commands are exceptions. You must enter the entire string for these commands to work. • For commands consisting of two words, enter the entire first word and at least one letter of the second word. For example, you can enter disable a for disable alert. • For commands that have parameters, you are prompted for the parameter.
7.6.1 Defining the COM1 Data Flow Use the set com1_mode command from SRM or RMC to define the COM1 data flow paths. You can set com1_mode to one of the following values: through All data passes through RMC and is filtered for the escape sequence. This is the default. snoop Data partially bypasses RMC, but RMC taps into the data lines and listens passively for the escape sequence. soft_bypass Data bypasses RMC, but RMC switches automatically into Snoop mode if loss of carrier occurs.
7.6.2 Displaying the System Status The RMC status command displays the current RMC settings. Table 7–1 explains the status fields. Example 7–2 status RMC> status PLATFORM STATUS On-Chip Firmware Revision: V1.0 Flash Firmware Revision: V1.
Table 7–1 Status Command Fields Field Meaning On-Chip Firmware Revision: Revision of RMC firmware on the microcontroller. Flash Firmware Revision: Revision of RMC firmware in flash ROM. Server Power: ON = System is on. OFF = System is off. System Halt: Asserted = System has been halted. Deasserted = Halt has been released. RMC Power Control: ON= System has powered on from RMC. OFF = System has powered off from RMC. Escape Sequence: Current escape sequence for access to RMC console.
7.6.3 Displaying the System Environment The RMC env environment. command provides a snapshot of the system Example 7–3 env RMC> env System Hardware Monitor Temperature (warnings at 45.0°C, power-off at 50.0°C) CPU0: 26.0°C Zone0: 29.0°C Fan RPM Fan1: 2295 Fan4: 2235 CPU1: 26.0°C Zone1: 30.0°C Fan2: 2295 Fan5: OFF CPU2: 27.0°C CPU3: 26.0°C Zone2: 31.
➊ ➋ CPU temperature. In this example four CPUs are present. ➌ Fan RPM. With the exception of Fan 5, all fans are powered as long as the system is powered on. Fan 5 is OFF unless Fan 6 fails. ➍ The normal power supply status is either OK (system is powered on) or OFF (system is powered off or the power supply cord is not plugged in). FAIL indicates a problem with a supply. ➎ CPU CORE voltage and CPU I/O voltage.
7.6.4 Dumping DPR Data The dump command dumps unformatted data from DPR locations 0–3FFF hex. The information might be useful for system troubleshooting. Use the DPR address table in Appendix C to analyze the data.
➊ ➋ DPR address ➌ Bytes 10:15 are the time stamp. See Appendix C for the meaning of other locations. Number of bytes dumped (in hex). In the example the dump command dumps EF bytes from address 10. The dump command allows you to dump data from the DPR. You can use this command locally or remotely if you are not able to access the SRM console because of a system crash.
7.6.5 Power On and Off, Reset, and Halt The RMC power {on, off}, halt {in, out}, and reset commands perform the same functions as the buttons on the operator control panel. Power On and Power Off The RMC power on command powers the system on, and the power off command powers the system off. The Power button on the OCP, however, has precedence. • If the system has been powered off with the Power button, the RMC cannot power the system on.
Halt In and Halt Out The halt in command halts the system. The halt out command releases the halt. When you issue either the halt in or halt out command, the terminal exits RMC and reconnects to the server’s COM1 port. Example 7–6 halt in/out RMC> halt Returning RMC> halt Returning in to COM port out to COM port The halt out command cannot release the halt if the Halt button is latched in.
7.6.6 Configuring Remote Dial-In Before you can dial in through the RMC modem port or enable the system to call out in response to system alerts, you must configure RMC for remote dial-in. Connect your modem to the 9-pin modem port and turn it on. Connect to the RMC CLI from either the local serial terminal or the local VGA monitor to set up the parameters.
➊ Sets the password that is prompted for at the beginning of a modem session. The string cannot exceed 14 characters and is not case sensitive. For security, the password is not echoed on the screen. When prompted for verification, type the password again. ➋ Sets the initialization string. The string is limited to 31 characters and can be modified depending on the type of modem used.
7.6.7 Configuring Dial-Out Alert When you are not monitoring the system from a modem connection, you can use the RMC dial-out alert feature to remain informed of system status. If dial-out alert is enabled, and the RMC detects alarm conditions within the managed system, it can call a preset pager number. You must configure remote dial-in for the dial-out feature to be enabled. See Section 7.6.6.
The elements of the dial string and alert string are shown in Table 7–2. Paging services vary, so you need to become familiar with the options provided by the paging service you will be using. The RMC supports only numeric messages. ➊ Sets the string to be used by the RMC to dial out when an alert condition occurs. The dial string must include the appropriate modem commands to dial the number. ➋ Sets the alert string, typically the phone number of the modem connected to the remote system.
Table 7–2 Elements of Dial String and Alert String Dial String The dial string is case sensitive. The RMC automatically converts all alphabetic characters to uppercase. ATXDT AT = Attention. X = Forces the modem to dial “blindly” (not seek the dial tone). Enter this character if the dial-out line modifies its dial tone when used for services such as voice mail. D = Dial T = Tone (for touch-tone) 9, The number for an outside line (in this example, 9).
7.6.8 Resetting the Escape Sequence The RMC set escape command sets a new escape sequence. The new escape sequence can be any character string, not to exceed 14 characters. A typical sequence consists of two or more control characters. It is recommended that control characters be used in preference to ASCII characters. Use the status command to verify the new escape sequence before exiting the RMC. The following example consists of two instances of the Esc key and the letters “FUN.
7.7 Resetting the RMC to Factory Defaults If the non-default RMC escape sequence has been lost or forgotten, RMC must be reset to factory settings to restore the default escape sequence. Figure 7–4 RMC Jumpers (Default Positions) 1 2 3 J24 J25 J26 J31 1 2 J3 J2 J1 PK0211 NOTE: J1, J2, and J3 are reserved.
The following procedure restores the default settings: 1. Shut down the operating system and press the Power button on the operator control panel to the OFF position. 2. Unplug the power cord from each power supply. Wait until the +5V Aux LEDs on the power supplies go off before proceeding. 3. Remove enclosure panels as described in Chapter 8. 4. Remove the system card cage cover and fan cover from the system chassis, as described in Chapter 8. 5. Remove CPU 1 as described in Chapter 8. 6.
7.8 Troubleshooting Tips Table 7–3 lists possible causes and suggested solutions for symptoms you might see. Table 7–3 RMC Troubleshooting Symptom Possible Cause Suggested Solution You cannot connect to the RMC CLI from the modem. The RMC may be in Soft Bypass or Firm Bypass mode. Issue the show com1_mode command from SRM and change the setting if necessary. If in Soft Bypass mode, you can disconnect the modem session and reconnect it. The terminal cannot communicate with the RMC correctly.
Table 7–3 RMC Troubleshooting (Continued) Symptom Possible Cause Suggested Solution RMC will not answer when modem is called. (continued from previous page) On AC power-up, RMC defers initializing the modem for 30 seconds to allow the modem to complete its internal diagnostics and initializations. Wait 30 seconds after powering up the system and RMC before attempting to dial in. After the system is powered up, the COM1 port seems to hang or you seem to be unable to execute RMC commands.
Chapter 8 FRU Removal and Replacement This chapter describes the procedures for removing and replacing FRUs on Compaq AlphaServer ES40 systems. Unless otherwise specified, install a FRU by reversing the steps shown in the removal procedures. NOTE: If you are installing or replacing CPU cards, memory DIMMs, or PCI cards, become familiar with the location of the card slots and configuration rules. See Chapter 6. CAUTION: Static electricity can damage integrated circuits.
8.1 FRUs Table 8–1 lists the FRUs by part number and description. Figure 8–1 shows the location of FRUs in the pedestal/rack systems, and Figure 8–2 shows the location of FRUs in the tower system.
Table 8–1 FRU List (Continued) Part # Description CPU Modules 54-30158-03 500 MHz EV6 4 MB cached CPU 54-30158-05 Acceptable substitute for 54-24801-03 54-30158-06 500 MHz EV6 4 MB cached CPU (EV6 V2.4) 54-30158-07 500 MHz EV6 4 MB cached CPU (EV6 V2.
Table 8–1 FRU List (Continued) Part # Description 30-49448-01 Power supply, 720 Watts SN-LKQ46-Ax Keyboard, OpenVMS SN-LKQ47-Ax Keyboard, Tru64 UNIX SN-LKQ97-Ax Keyboard, Windows NT SN-PBQWS-WA Mouse, 3-button 12-37977-02 Key for doors 3X-RRD32-AC 3R-A0284-AA CD-ROM drive, half-height RX23L-AC Floppy drive 8-4 Compaq AlphaServer ES40 Service Guide
8.1.1 Power Cords Tower enclosures ordered in North America include a 120 V power cord. Non-North American orders require one country-specific power cord. Pedestal systems ordered in North American include two 120 V power cords. Non-North American orders require two country-specific power cords. Table 8–2 lists the country-specific power cords for tower and pedestal systems. Table 8–2 Country-Specific Power Cords Power Cord Country Length BN26J-1K North American 120 V 75 in. 3X-BN46F-02 Japan 2.
8.1.2 FRU Locations Figure 8–1 and Figure 8–2 show the location of FRUs in the pedestal and rackmount configurations.
Figure 8–2 FRUs — Rear (Pedestal/Rack View) I/O Connector Module (Junk I/O) Speaker Power Harness Access Cover Power Supplies System Motherboard PK0286 FRU Removal and Replacement 8-7
8.1.3 Important Information Before Replacing FRUs The system must be shut down before you replace most FRUs. The exceptions are power supplies, individual fans, and hard drives. After replacing FRUs you must clear the system error information repository with the SRM clear_error all command. Tools You need the following tools to remove or replace FRUs.
Before Replacing Non Hot-Plug FRUs Follow the procedure below before replacing any non hot-plug FRU. 1. Shut down the operating system. 2. Shut down power to external options, where appropriate. 3. Turn off power to the system. 4. Unplug the power cord from each power supply. WARNING: To prevent injury, unplug the power cord from each power supply before installing components.
8.2 Removing Enclosure Panels on a Tower or Pedestal Open and remove the front door. Loosen the captive screws that allow you to remove the top and side panels.
To Remove Enclosure Panels from a Tower The enclosure panels are secured by captive screws. 1. Remove the front door. 2. To remove the top panel, loosen the top left and top right captive screws ➊. Slide the top panel back and lift it off the system. 3. To remove the left panel, loosen the captive screw ➋ at the top and the captive screw ➌ at the bottom. Slide the panel back and then tip it outward. Lift it off the system.
Figure 8–4 Enclosure Panel Removal (Pedestal) 1 2 PK0234 8-12 Compaq AlphaServer ES40 Service Guide
To Remove Enclosure Panels from a Pedestal The enclosure panels are secured by captive screws. 1. Open and remove the front doors. 2. To remove the top enclosure panel, loosen top left and top right captive screws ➊. Slide the top panel back and lift it off the system. 3. To remove the right enclosure panel, loosen the captive screw shown in ➋. Slide the panel back and then tip it outward. Lift the panel from the three tabs.
8.3 Accessing the System Chassis in a Cabinet In a rackmount system, the system chassis is mounted to slides. WARNING: Pull out the stabilizer bar and extend the leveler foot to the floor before you pull out the system. This precaution prevents the cabinet from tipping over.
To Gain Access to the System Chassis 1. Open the front door of the cabinet. 2. Pull out the stabilizer bar ➊ at the bottom of the cabinet until it stops. 3. Extend the leveler foot at the end of the stabilizer bar to the floor. 4. Snap out the front bezel ➋. 5. Remove and set aside the two screws ➌ (one per side), if present, that secure the system to the cabinet. 6. Pull the system out until it locks.
8.4 Removing Covers from the System Chassis The system chassis has three covers: the fan cover, the system card cage cover, and the PCI card cage cover. Remove a cover by loosening the quarter-turn captive screw, pulling up on the ring, and sliding the cover from the system chassis. V @ >240VA WARNING: High current area. Currents exceeding 240 VA can cause burns or eye injury. Avoid contact with parts or remove power prior to access. WARNING: Contact with moving fan can cause severe injury to fingers.
Figure 8–7 and Figure 8–8 show the location and removal of covers on the tower and pedestal/rackmount systems, respectively. The numbers in the illustrations correspond to the following: ➊ 3mm Allen captive quarter-turn screw that secures each cover. ➋ ➌ Spring-loaded ring that releases cover. Each cover has a ring. ➍ System card cage cover. This area contains CPUs, memory DIMMs, MMBs, and system motherboard. To remove the system card cage cover, you must first remove the fan area cover ➌.
Figure 8–7 Covers on the System Chassis (Tower) 5 2 1 2 3 1 4 2 PK0216 8-18 Compaq AlphaServer ES40 Service Guide
Figure 8–8 Covers on the System Chassis (Pedestal/Rack) 4 1 2 3 1 2 5 PK0215 FRU Removal and Replacement 8-19
8.
WARNING: Hazardous voltages are contained within the power supply. Do not attempt to service. Return to factory for service. The power supply is a hot-plug component. As long as the system has a redundant supply, you can replace a supply while the system is running. Removing a Power Supply 1. Unplug the AC power cord. 2. Loosen the three Phillips screws ➊ that secure the power supply bracket. (Do not remove the screws.) Remove the bracket ➋. 3.
8.
The fans are hot-plug components. You can replace individual fans while the system is running. WARNING: Contact with moving fan can cause severe injury to fingers. Avoid contact or remove power prior to access. Replacing Fans 1. Remove the cover from the fan area (fans ➎ and ➏) or the PCI card cage (fans ➊,➋,➌, and ➍). 2. Pull the pop-up latch to unlock it, and lift the fan out of the system. Fan ➌ has no pop-up latch. It is held in place by fan ➍. 3.
8.
Hard drives are hot-plug components. CAUTION: Before replacing a hard disk drive, ensure that the SCSI controller and/or the operating system support hot-swapping of drives. Otherwise, shut down the operating system and return to the SRM console level before starting the replacement procedure. Removing a Hard Disk Drive 1. Access the storage drive area. 2. Push the button ➊ to release the plastic handle ➋ on the front of the drive carrier. Pull out the plastic handle toward you and slide the drive out.
8.8 CPUs You must shut the system down before adding or replacing a CPU. Figure 8–12 Removing CPU Cards PK0240a WARNING: CPU cards have parts that operate at high temperatures. Wait 2 minutes after power is removed before touching any module. V @ >240VA 8-26 WARNING: High current area. Currents exceeding 240 VA can cause burns or eye injury. Avoid contact with parts or remove power prior to access.
Replacing a CPU Card 1. Remove the covers from the fan area and the system card cage. 2. Pull up on the clips at each end of the card and remove the card. 3. Install the new CPU card in the connector and push down firmly on both clips simultaneously. NOTE: When installing an additional CPU, remove the blank CPU air deflector from the next available slot. Verification — SRM Console 1. Turn on power to the system. 2. During power-up, observe the screen display.
8.
WARNING: Memory DIMMs have parts that operate at high temperatures. Wait 2 minutes after power is removed before touching any module. V @ >240VA WARNING: High current area. Currents exceeding 240 VA can cause burns or eye injury. Avoid contact with parts or remove power prior to access. CAUTION: DIMMs come in two types, stacked or unstacked. See Chapter 6 before replacing DIMMs. Replacing DIMMs You must shut the system down before adding or replacing DIMMs. 1.
Figure 8–14 Aligning DIMM in MMB PK0953a 8-30 Compaq AlphaServer ES40 Service Guide
4. Install the new DIMM. Align the notches on the gold fingers with the connector keys (Figure 8–14) and secure the DIMM with the clips on the MMB slot. 5. Reinstall the MMB and secure it to the system backplane with the clips. Verification — SRM Console 1. Turn on power to the system. 2. During power-up, observe the screen display for memory. 3. Issue the show memory command to display the total amount of memory in the system. Verification — AlphaBIOS Console 1.
8.10 PCI Cards Figure 8–15 Installing or Replacing a PCI Card 3 1 2 PK0245 WARNING: To prevent fire, use only modules with current limited outputs. See National Electrical Code NFPA 70 or Safety of Information Technology Equipment, Including Electrical Business Equipment EN 60 950. V @ >240VA 8-32 WARNING: High current area. Currents exceeding 240 VA can cause burns or eye injury. Avoid contact with parts or remove power prior to access.
Installing or Replacing a PCI Card You must shut the system down before adding or replacing a PCI card. 1. Remove the cover to the PCI card cage. 2. If installing a new card, remove and discard the bulkhead filler plate ➊ from the PCI slot. 3. If replacing a card, disconnect and remove the failed card. 4. Insert the new PCI card ➋ into the connector. NOTE: Some full-length PCI cards may have extender brackets for installing into ISA/EISA-style card cages.
8.
Removing the OCP Assembly You must shut the system down before removing the OCP assembly. 1. Press the two tabs ➊ on the top of the OCP assembly to release it. 2. Rotate the assembly toward you and lift it out of the two bottom tabs. 3. Disconnect the control panel cable ➋.
8.12 Removable Media Figure 8–17 Removing a 5.
Removing a 5.25-Inch Removable Media Device You must shut the system down before adding or replacing a removable media device. 4. Remove the cover to the PCI card cage. 5. Remove and set aside the four screws ➊ that secure the removable media cage. 6. Unplug the signal cable ➋ and power cable ➌ from all devices except the floppy. 7. Remove the cage. 8. Unplug the signal cable and power cable from the floppy. 9. Remove the four screws ➍ that secure the device and set aside the screws.
8.
Removing the Floppy Drive You must shut the system down before removing the floppy drive. 1. Remove the cover to the PCI card cage. 2. Remove and set aside the four screws ➊ that secure the removable media cage. 3. Unplug the signal cable ➋ and power cable ➌ from all devices except the floppy. 4. Remove the cage. 5. Unplug the signal cable and power cable from the floppy. 6. Remove the four screws ➍ that secure the floppy drive, and slide the drive out. 7.
8.
Removing the I/O Connector Assembly You must shut the system down before removing the I/O connector assembly. 1. Unplug all I/O connectors from the rear of the unit. 2. Remove the cover from the PCI card cage. 3. Unplug the 68-pin signal cable ➊. 4. Remove the two screws ➋ that secure the assembly to the back of the unit. 5. Pull the assembly out through the PCI area.
8.15 PCI Backplane Figure 8–20 Cables Connected to PCI Backplane 1 2 3 4 5 6 7 8 PK0279 ➊ ➋ ➌ ➍ ➎ ➏ ➐ ➑ Connecting Cable 17-04785-01 17-03970-04 17-04786-01 70-31349-01 17-04678-02 17-03971-07 17-04914-01 (if present) 17-04400-06 V @ >240VA 8-42 Connects To: Fans Floppy Cover sensors Speaker CD-ROM OCP Storage disk cage I/O controller module WARNING: High current area. Currents exceeding 240 VA can cause burns or eye injury. Avoid contact with parts or remove power prior to access.
Disconnecting the Cables You must shut the system down before accessing the PCI area. 1. Remove the cover to the PCI card cage. 2. Record the location of installed PCI cards. 3. Remove all external cables from the PCI bulkheads in the rear of the unit. Remove internal cables from PCI cards. 4. Unlatch and remove the cards from the card cage. 5. Disconnect cables connected to the PCI backplane. See Figure 8–20. 6. Remove the top fan (pedestal/rack orientation) or left fan (tower orientation).
Figure 8–21 Removing the PCI Backplane 3 1 2 2 4 1 PK0280 8-44 Compaq AlphaServer ES40 Service Guide
Removing the PCI Backplane CAUTION: When removing the PCI backplane, be careful not to flex the board. Flexing the board may damage the BGA component connections. 1. Remove the 12 screws ➊ that secure the PCI backplane to the chassis. CAUTION: Do not remove the four additional nonwashered screws ➋. Removing them inactivates the built-in mechanism for extracting the PCI backplane from the system. 2.
8.
! WARNING: CPUs and memory DIMMs have parts that operate at high temperatures. Wait 2 minutes after power is removed before touching any module. CAUTION: When removing the system motherboard, be careful not to flex the board. Flexing the board may damage the BGA component connections. NOTE: Removing the system motherboard requires the removal of other FRUs. Review the removal procedures for the fans, MMBs, CPUs, and drive cage before beginning the system motherboard removal procedure. 1.
9. Unplug the five connectors ➏ on the bottom of the system motherboard. 10. Remove the three Phillips screws ➐ that secure the system motherboard. 11. A white plastic flange ➑ and two holes in the sheet metal under the flange are used to help disengage the system motherboard from the PCI backplane. Insert a screwdriver through the hole in the flange into the closest hole and pry the system motherboard away from the PCI backplane.
After installing a new motherboard: 1. Power up to the P00>>> prompt. 2. Enter the clear_error all command. 3. Enter the set sys_serial_num command to set the system serial number. For example: P00>>> set sys_serial_num NI900100022 The serial number will be propagated to all FRU devices that have EEPROMs.
8.
NOTE: Removing the power harness requires the removal of other system FRUs. Review the removal procedures for the power supplies, fans, and drive cage before beginning the harness removal procedure. 1. Remove the power supplies and any blank power supply panels. 2. Remove the cover to the PCI card cage. 3. Remove fans 4 and 3 (the inner fans). 4. Unplug the connectors to each removable media device (except the floppy). 5. Remove the four screws that secure the removable media cage.
Appendix A SRM Console Commands This appendix lists the SRM console commands that are most frequently used with the Compaq AlphaServer ES40 family of systems. Table A–1 SRM Commands Used on ES40 Systems Command Function alphabios Loads and starts the AlphaBIOS console. boot Loads and starts the operating system. buildfru Initializes I Cbus EEPROM data structures for the named FRU. cat el Displays the console event log. Same as more el, but scrolls rapidly.
Table A–1 SRM Commands Used on ES40 Systems (Continued) Command Function exer Exercises one or more devices by performing specified read, write, and compare operations. floppy_write Runs a write test on the floppy drive to determine whether you can write on the diskette. grep Searches for “regular expressions”—specific strings of characters—and prints any lines containing occurrences of the strings. hd Dumps the contents of a file (byte stream) in hexadecimal and ASCII.
Table A–1 SRM Commands Used on ES40 Systems (Continued) Command Function rmc Invokes the remote management console from the local VGA monitor. set envar Sets or modifies the value of an environment variable. show envar Displays the state of the specified environment variable. show config Displays the logical configuration at the last system initialization. show device Displays a list of controllers and bootable devices in the system. show error Reports errors logged in the FRU EEPROMs .
Appendix B Jumpers and Switches This chapter lists and describes the configuration jumpers and switches on the system motherboard and PCI board.
B.1 RMC and SPC Jumpers on System Motherboard The RMC jumpers can be used to override the RMC defaults. For example, if a high-speed modem is connected to COM1, you can disable J31 to prevent RMC from receiving characters that might cause interference. The SPC jumpers are reserved.
Table B–1 RMC/SPC Jumper Settings Jumper Description J24 1–2: Disables RMC flash update 2–3: Enables RMC flash update (default) Disabling RMC flash update prevents other operators from erasing or updating the RMC. J25 1–2: Sets RMC back to defaults 2–3: Normal RMC operating mode (default) If the RMC escape sequence is set to something other than the default, and you have forgotten the sequence, RMC must be reset to factory settings to restore the default escape sequence.
B.2 TIG/SROM Jumpers on System Motherboard TIG/SROM jumpers allow you to load the TIG if flash RAM is corrupted or load the fail-safe loader (FSL) if SRM firmware is corrupted. Figure B–2 TIG/SROM Jumpers J21 J20 J22 J23 1 2 3 1 2 3 1 2 3 1 2 3 E296 1 2 3 4 5 6 7 8 9 10 ON OFF SC0033 NOTE: See Chapter 3 for instructions on activating the FSL.
Table B–2 TIG/SROM Jumper Descriptions Jumper Description J21 1–2: Load TIG from flash RAM (default) 2–3: Load TIG from serial ROM. This setting allows you to load the TIG if the flash RAM is corrupted. J20 Must be in default positions over pins 1 and 2 to enable FSL. FIR_FUNC2 (bit 2) 1–2 = 0, 2–3 = 1 J22 Jumper for enabling fail-safe loader (FSL) FIR_FUNC1 (bit 1) 1–2= 0, 2–3= 1 J23 Must be in default positions over pins 1 and 2 to enable FSL.
B.3 Clock Generator Switch Settings Switchpack E16 on the system motherboard sets the frequency of the main clock on the system motherboard. The settings should not be changed.
Table B–3 Clock Generator Settings SW1 M0 (on) SW2 M1 (on) SW3 M2 (on) SW4 M3 (off) SW5 M4 (on) SW6 M5 (off) SW7 M6 (on) SW8 N0 (off) SW9 N1 (on) SW10 XTAL_SEL (OFF) Jumpers and Switches B-7
B.4 Jumpers on PCI Board You can set J31 on the PCI board to force DTR so that a modem will not be disconnected if the system is power cycled. Check J13 if the system is losing time or the operating system comes up with a very inaccurate time.
Table B–4 PCI Board Jumper Descriptions Jumper Description ➊ J31 1–2: Do not force COM1 DTR 2–3: Force COM1 DTR (default) This jumper allows you to force DTR. The default position prevents disconnection of the modem on a power cycle. ➋ J20 1–2: Enable PCI 0 power management events (PME). 2–3: Disable PCI 0 PME (default) This jumper is reserved. ➌ J21 1–2: Enable PCI 1 PME 2–3: Disable PCI 1 PME (default) This jumper is reserved.
B.5 Setting Jumpers Review the material in the previous sections of this chapter before setting any system jumpers. Before setting jumpers, shut down the system and remove the power cord from each power supply. CAUTION: Static electricity can damage integrated circuits. Always use a grounded wrist strap (29-26246) and grounded work surface when working with internal parts of a computer system. Remove jewelry before working on internal parts of the system. Setting Jumpers 1.
Appendix C DPR Address Layout This appendix shows the address layout of the dual-port RAM (DPR). Use the SRM examine dpr:address command (where address is the offset from the base of the DPR) or use the RMC dump command to view locations in the DPR. See Appendix D for definitions of locations written when environmental error events occur.
C.
Table C–1 DPR Address Layout (Continued) Location Logical Written (Hex) Indicator By 16 17:1D 1E 1F 20:3F 40:5F 60:7F 80 SROM SROM SROM 20 20 20 80 SROM Used For SROM Power On Error Indication for CPU is “alive.
Table C–1 DPR Address Layout (Continued) Location Logical Written (Hex) Indicator By 81 81 SROM 82 83 84 85 86 87 88:8B 82 83 84 85 86 87 SROM SROM SROM SROM SROM SROM SROM 8C:8F 8C-8F SROM 90 91 92 90 91 92 RMC RMC RMC C-4 Used For Array 0 (AAR 0)Size (x64 Mbytes) 0 = no good memory 1 = 64 Mbyte 2 = 128 Mbyte 4 = 256 Mbyte 8 = 512 Mbyte 10 = 1 Gbyte 20 = 2 Gbyte 40 = 4 Gbyte 80 = 8 Gbyte Array 1 (AAR 1) Configuration Array 1 (AAR 1) Size (x64 Mbytes) Array 2 (AAR 2) Configuration Array 2 (AA
Table C–1 DPR Address Layout (Continued) Location Logical Written (Hex) Indicator By 93:96 97:99 9A:9F A0:A9 93 97 9A A0 RMC RMC RMC RMC AA RMC AB RMC AC AD AE AF RMC RMC RMC RMC B0 RMC B1 RMC Used For Temperature from CPU(x) in BCD Temperature Zone(x) from 3 PCI temp sensors Fan Status; Raw Fan speed value Failure registers used as part of the 680 machine check logout frame. See Appendix D.
Table C–1 DPR Address Layout (Continued) Location Logical Written (Hex) Indicator By B2 RMC B3:B9 Unused BA BB RMC RMC BC BD BE RMC RMC RMC BF C0:D8 D9 DA RMC DB:E3 E4:EC ED:F5 F6:F8 F9 FA:FB RMC RMC RMC Unused Firmware Firmware C-6 RMC TIG FA Used For Status of RMC to read SCSI backplane Definition: Bit 0 — SCSI backplane 0 Bit 1 — SCSI backplane 1 Bit 4 — Power supply 0 Bit 5 — Power supply 1 Bit 6 — Power supply 2 Unused 2 I C done, BA = finished RMC Power on Error indicates error durin
Table C–1 Location (Hex) DPR Address Layout (Continued) Logical Written Indicator By FC FC RMC FD FD RMC FE FE Firmware FF FF Firmware 100:1FF 100 RMC 200:2FF 300:3FF 400:4FF 500:5FF 600:7FF 700:7FF 800:8FF 900:9FF A00:AFF B00:BFF C00:CFF D00:DFF E00:EFF F00:FFF 200 300 400 500 600 700 800 900 A00 B00 C00 D00 E00 F00 RMC RMC RMC RMC RMC RMC RMC RMC RMC RMC RMC RMC RMC RMC Used For Command status associated with the RMC response to a request from the firmware 0 = successful completion 80
Table C–1 DPR Address Layout (Continued) Location (Hex) Logical Written Indicator By 1000:10FF 1100:11FF 1200:12FF 1300:13FF 1400:14FF 1500:15FF 1600:16FF 1700:17FF 1800:18FF 1900:19FF 1A00:1AFF 1B00:1BFF 1C00:1CFF 1D00:1DFF 1E00:1EFF 1F00:1FFF 2000:20FF 2100:21FF 2200:22FF 2300:23FF 2400:24FF 2500:25FF 2600:26FF 2700:27FF 2800:28FF 2900:29FF 2A00:2AFF 2B00:2BFF 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 1A00 1B00 1C00 1D00 1E00 1F00 2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 2A00 2B00 C
Table C–1 DPR Address Layout (Continued) Location (Hex) Logical Written Indicator By 2C00:2CFF 2C00 RMC 2D00:2DFF 2D00 RMC 2E00:2FFF 2E00 RMC 3000:3008 3009:300B SROM RMC 300C:300E RMC 300F:3010 3011:30FF 3100:31FF 3200:32FF 3300:33FF 3400 3401 300F RMC Unused RMC RMC RMC SROM SROM 3402 3403:340F SROM SROM/SRM 3410:3417 SROM/SRM Used For Last Redundant Failure—ASCII character string that indicates redundant failure occurred, type, FRU, and so on.
Table C–1 Location (Hex) DPR Address Layout (Continued) Logical Written Indicator By 3418 3419 SROM/SRM SROM 341A:341E SROM 341F SROM/SRM 3420:342F 3430:343F 3440:344F 3450:349F SROM/SRM SROM/SRM SROM/SRM SROM/ RMC 34A0:34A7 SROM 34A8:34AF SROM 34B0:34B7 SROM 34B8:34CF SROM 34C0:34FF C-10 34C0 SROM Used For Waiting to jump to flag for CPU0 Shadow of value written to EV6 DC_CTL register. Shadow of most recent writes to EV6 CBOX “Write-many” chain.
Table C–1 Location (Hex) DPR Address Layout (Continued) Logical Written Indicator By 3500:35FF 3600:36FF 3700:37FF 3800:3AFF 3B00:3BFF 3C00:3CFF 3D00:3DFF 3E00:3EFF 3F00:3FFF Firmware 3600 SRM SRM RMC RMC RMC RMC RMC RMC Used For Used as the dedicated buffer in which SRM writes OCP or FRU EEROM data. Firmware will write this data, RMC will only read this data.
Appendix D Registers This appendix describes 21264 (EV6) internal processor registers; 21272 (Tsunami/Typhoon) system support chipset registers; and dual-port RAM (DPR) registers that are related to general logout frame errors. It also provides CPU and system uncorrectable and correctable machine logout frames and error state bit definitions of all the platform logout frame registers.
D.1 Ibox Status Register (I_STAT) The Ibox Status Register (I_STAT) is read only by PAL code and is an element in the CPU or system uncorrectable and correctable machine check error logout frame. 63 32 31 30 29 28 0 DPE TPE D-2 FM-05854.
Table D–1 Ibox Status Register Fields Name Bits Type Description Reserved <63:31> RO DPE <30> W1C Reserved for Compaq. I-cache data parity error When set, indicates that the I-cache encountered a data parity error on instruction fetch. TPE <29> W1C I-cache tag parity error When set, indicates that the I-cache encountered a tag parity error on instruction fetch. Reserved <28:0> RO Reserved for Compaq.
D.2 Memory Management Status Register (MM_STAT) The Memory Management Status Register (MM_STAT) is read only by PAL code and is an element in the CPU or system uncorrectable and correctable machine check error logout frame. 63 31 32 11 10 9 4 3 2 1 0 DC_TAG_PERR OPCODE[5:0] FOW FOR ACV WR FM-05862.
Table D–2 Memory Management Status Register Fields Name Bits Reserved <63:11> Type Description Reserved for Compaq. DC_TAG_ <10> PERR RO This bit is set when a D-cache tag parity error occurs during the initial tag probe of a load or store instruction. The error created a synchronous fault to the D_FAULT PALcode entry point and is correctable. The virtual address associated with the error is available in the VA register. OPCODE <9:4> RO Opcode of the instruction that caused the error.
D.3 Dcache Status Register (DC_STAT) The Dcache Status Register (DC_STAT) is read only by PAL code and is an element in the CPU or system uncorrectable and correctable machine check error logout frame. 63 31 32 5 4 3 2 1 0 SEO ECC_ERR_LD ECC_ERR_ST TPERR_P1 TPERR_P0 FM-05865.
Table D–3 Dcache Status Register Fields Name Bits Type Description Reserved <63:5> SEO <4> W1C Second error occurred. When set, indicates that a second D-cache store ECC error occurred within 6 cycles of the previous D-cache store ECC error. ECC_ERR_LD <3> W1C ECC error on load. When set, indicates that a single-bit ECC error occurred while processing a load from the D-cache or any fill. ECC_ERR_ST <2> W1C ECC error on store.
D.4 Cbox Read Register The Cbox Read Register is read only by PAL code and is an element in the CPU or system uncorrectable and correctable machine check error logout frame. Table D–4 Cbox Read Register Fields Name Description C_SYNDROME_1<7:0> Syndrome for the upper QW in the OW of victim that was scrubbed. See Appendix E. C_SYNDROME_0<7:0> Syndrome for the lower QW in the OW of victim that was scrubbed. See Appendix E.
Table D–4 Cbox Read Register Fields (Continued) Name Description C_STAT<4:0> (continued) Bits Error Status 01100 ISTREAM_BC_ERR 01101 Reserved 0111X Reserved 10011 DSTREAM_MEM_DBL 10100 DSTREAM_BC_DBL 11011 ISTREAM_MEM_DBL 11100 ISTREAM_BC_DBL C_STS<3:0> If C_STAT equals xxx_MEM_ERR or xxx_BC_ERR, then C_STAT contains the status of the block as follows; otherwise, the value of C_STAT is X.
D.5 Exception Address Register (EXC_ADDR) The exception address register (EXC_ADDR) is a read-only register that is updated by hardware when it encounters an exception or interrupt. 63 32 PC[63:32] 31 2 1 0 PC[31:2] PAL FM-06384.
EXC_ADDR[0] is set if the associated exception occurred in PAL mode. The exception actions are: • If the exception was a fault or a synchronous trap, EXC_ADDR contains the PC of the instruction that triggered the fault or trap. • If the exception was an interrupt, EXC_ADDR contains the PC of the next instruction that would have executed if the interrupt had not occurred.
D.6 Interrupt Enable and Current Processor Mode Register (IER_CM) The interrupt enable and current processor mode register (IER_CM) contains the interrupt enable and current processor mode bit fields. 63 39 38 33 32 EIEN[5:0] SLEN 31 30 29 28 14 13 12 5 4 3 2 0 CREN PCEN[1:0] SIEN[15:1] ASTEN CM[1:0] FM-05846.
Table D–5 IER_CM Register Fields Name Extent Type Description Reserved [63:39] EIEN[5:0] [38:33] RW External Interrupt Enable SLEN [32] RW Serial Line Interrupt Enable CREN [31] RW Corrected Read Error Interrupt Enable PCEN[1:0] [30:29] RW Performance Counter Interrupt Enables SIEN[15:1] [28:14] RW Software Interrupt Enables ASTEN [13] RW AST Interrupt Enable When set, enables those AST interrupt requests that are also enabled by the value in ASTER.
D.7 Interrupt Summary Register (ISUM) The interrupt summary register (ISUM) is a read-only register that records all pending hardware, software, and AST interrupt requests that have their corresponding enable bit set. If a new interrupt (hardware, serial line, crd, or performance counters) occurs simultaneously with an ISUM read, the ISUM read returns zeros. That condition is normally assumed to be a passive release condition. The interrupt is signaled again when the PALcode returns to native mode.
Table D–6 ISUM Register Fields Name Extent Type Description Reserved [63:39] EI[5:0] [38:33] RO External Interrupts SL [32] RO Serial Line Interrupt CR [31] RO Corrected Read Error Interrupts PC[1:0] [30:29] RO Performance Counter Interrupts PC0 when PC[0] is set. PC1 when PC[1] is set. SI[15:1] [28:14] Reserved [13:11] ASTU, ASTS [10],[9] RO Software Interrupts RO AST Interrupts For each processor mode, the bit is set if an associated AST interrupt is pending.
D.8 PAL Base Register (PAL_BASE) The PAL base register (PAL_BASE) is a read-write register that contains the base physical address for PALcode. Its contents are cleared by chip reset but are not cleared after waking up from sleep mode or from fault reset. 63 44 43 32 PAL_BASE[43:32] 31 15 14 0 PAL_BASE[31:15] FM-05852.
Table D–7 PAL_BASE Register Fields Name Extent Type Description Reserved [63:44] RO, 0 Reserved for COMPAQ. PAL_BASE[43:15] [43:15] RW Base physical address for PALcode. Reserved [14:0] RO, 0 Reserved for COMPAQ.
D.9 Ibox Control Register (I_CTL) The Ibox control register (I_CTL) is a read-write register that controls various Ibox functions. Its contents are cleared by chip reset. 63 48 47 32 SEXT(VPTB[47]) VPTB[47:32] 31 30 29 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 3 2 1 0 VPTB[31:30] CHIP_ID[5:0] BIST_FAIL TB_MB_EN MCHK_EN CALL_PAL_R23 PCT1_EN PCT0_EN SINGLE_ISSUE_H VA_FORM_32 VA_48 SL_RCV SL_XMIT HWE BP_MODE[1:0] SBE[1:0] SDE[1:0] SPE[2:0] IC_EN[1:0] SPCE[0] FM-05853.
Table D–8 I_CTL Register Fields Name Extent Type Description SEXT(VPTB[47]) [63:48] RW,0 Sign extended VPTB[47]. VPTB[47:30] [47:30] RW,0 Virtual Page Table Base. CHIP_ID[5:0] [29:24] RO This is a read-only field that supplies the revision ID number for the 21264 part. 21264 pass 1 ID is 0000002. 21264 pass 2 ID is 0000012 21264 pass 2.2 ID is 0000102. 21264 pass 2.3 ID is 0000112 21264 pass 2.4 ID is 000101. BIST_FAIL [23] RO,0 Indicates the status of BIST (set = pass, clear = fail).
Table D–8 I_CTL Register Fields (Continued) Name Extent Type Description PCT0_EN [18] RW,0 Enable performance counter #0. If this bit is one, the performance counter will count if EITHER the system (SPCE) or process (PPCE) performance counter enable is set. SINGLE_ISSUE_H [17] RW,0 When set, this bit forces instructions to issue only from the bottom-most entries of the IQ and FQ. VA_FORM_32 [16] RW,0 This bit controls address formatting on a read of the IVA_FORM register.
Table D–8 I_CTL Register Fields (Continued) Name Extent Type Description SL_XMIT [13] WO When set, drives a value on SromClk_H. HWE [12] RW,0 If set, allow PALRES intructions to be executed in kernel mode. Note that modification of the ITB while in kernel mode/native mode may cause UNPREDICTABLE behavior. BP_MODE[1:0] [11:10] RW,0 Branch Prediction Mode Selection. BP_MODE[1], if set, forces all branches to be predicted to fall through. If clear, the dynamic branch predictor is chosen.
Table D–8 I_CTL Register Fields (Continued) Name Extent Type Description SPE[2:0] [5:3] RW,0 Super Page Mode Enable. Identical to the SPE bits in the Mbox M_CTL SPE[2:0]. IC_EN[1:0] [2:1] RW,3 Icache Set Enable. At least one set must be enabled. The entire cache may be enabled by setting both bits. Zero, one, or two Icache sets can be enabled. This bit does not clear the Icache, but only disables fills to the affected set. SPCE [0] RW,0 System Performance Counting Enable.
D.10 Process Context Register (PCTX) The process context register (PCTX) contains information associated with the context of a process. 63 39 38 47 46 32 ASN[7:0] 31 13 12 9 8 5 4 3 2 1 0 ASTRR[3:0] ASTER[3:0] FPE PPCE FM-05855.
The following table lists the correspondence between IPR index bits and register fields. IPR Index Bit Register Field 0 ASN 1 ASTER 2 ASTRR 3 PPCE 4 FPE Table D–9 lists the PXTX register fields.
Table D–9 PCTX Register Fields Name Extent Type Description Reserved ASN[7:0] Reserved ASTRR[3:0] [63:47] [46:39] [38:13] [12:9] RW Address space number. RW AST request register—used to request AST interrupts in each of the four processor modes. To generate a particular AST interrupt, its corresponding bits in ASTRR and ASTER must be set, along with the ASTE bit in IER.
D.11 21272-CA Cchip Miscellaneous Register (MISC) This register is designed so that only writes of 1 affect it. When a 1 is written to any bit in the register, the programmer does not need to be concerned with read-modify-write or the status of any other bits in the register. Once NXM is set, the NXS field is locked. It is unlocked when software clears the NXM field. The ABW (arbitration won) field is locked if either ABW bit is set, so the first CPU to write it locks out the other CPU.
Table D–10 21272-CA Cchip Miscellaneous Register Fields Name Bits Type Initial State Description RES <63:44> MBZ, RAZ 0 DEVSUP <43:40> WO 0 REV <39:32> RO 1 Latest revision of the Cchip: 1 = Tsunami 8=Typhoon NXS <31:29> RO 0 NXM source—Device that caused the NXM. Unpredictable if NXM not set. 0 = CPU0 1 = CPU1 2 = CPU2 3 = CPU3 4 = P-chip 0 5 = P-chip 1 NXM <28> R, W1C 0 Nonexistent memory address detected. Sets DRIR<63> and locks the NXS field until it is cleared.
Table D–10 21272-CA Cchip Miscellaneous Register Fields (Continued) Name Bits Type Initial State Description IPINTR <11:8> R, W1C 0 Interprocessor interrupt pending—one bit per CPU. Pin irq<3> is asserted to the CPU corresponding to a 1 in this field. ITINTR <7:4> R, W1C 0 Interval timer interrupt pending—one bit per CPU. Pin irq<2> is asserted to the CPU corresponding to a 1 in this field. RES <3:2> MBZ, RAZ 0 Reserved. CPUID <1:0> RO - ID of the CPU performing the read.
D.12 21272-CA Cchip CPU Device Interrupt Request Register (DIRn, n=0,1,2,3) These registers indicate which interrupts are pending to the CPUs and indicate the presence of an I/O error condition.
Table D–11 21272-CA Device Interrupt Request Register Fields Name Bits Type Initial State Description ERR <63:58> RO 0 RES NXS <57:56> <55:0> RO RO 0 0 D-30 IRQ0 error interrupts <63> Cchip detected MISC <62> Recommended hookup to Pchip0 error <61> Recommended hookup to Pchip1 error Reserved IRQ1 PCI interrupts pending to the CPU Compaq AlphaServer ES40 Service Guide
D.13 21272-CA Pchip Error Register (PERROR) If any bits <11:0> are set, this register is frozen. Only bit <0> can be set thereafter. All other values are held until all bits <11:0> are clear. When an error occurs and one of the <11:0> bits is set, the associated information is captured in bit <63:16>. After the information is captured, the INV bit is cleared, but the information is not valid and should not be used if INV is set.
Access 63 RW 56 55 52 51 50 44 43 32 40 39 ADDR INV CMD SYN 31 16 15 12 11 10 9 8 7 6 5 4 3 2 1 0 ADDR RES CRE UECC RES NDS RDPE TA APE SGE DCRTO PERR SERR LOST PK1419-99 D-32 Compaq AlphaServer ES40 Service Guide
Table D–12 21272-CA Pchip Error Register Fields Name Bits Type Initial State Description SYN <63:56> RO 0 ECC syndrome of error if CRE or UECC. CMD <55:52> RO 0 PCI command of transaction when error detected if not CRE and not UECC. If CRE or UECC, then: INV <51> RO Rev1 RAZ Rev0 0 Value Command 0000 0001 0011 Others DMA read DMA read-modify-write SGTE read Reserved Info Not Valid—only meaningful when one of bits <11:0> is set. Indicates the validity of , , and fields.
Table D–12 21272-CA Pchip Error Register Fields (Continued) Bits RES <15:12> MBZ, RAZ 0 Reserved CRE <11> R, WIC 0 Correctable ECC error. UECC <10> R, WIC 0 Uncorrectable ECC error. RES <9> MBZ, RAZ 0 Reserved. NDS <8> R, WIC 0 No b_devsel_l as PCI master. RDPE <7> R,W1C 0 PCI read data parity error as PCI master. TA <6> R, W1C 0 Target abort as PCI master. APE <5> R, W1C 0 Address parity error detected as potential PCI target.
D.14 21272-CA Array Address Registers (AAR0–AAR3) The Array Address Registers define the base address and size for each memory array. Table D–13 21272-CA Array Address Register (AAR) Field Bits Type Init RES ADDR <63:35> <34:24> MBZ,RAZ 0 RW 0 RES DBG <23:17> 16 MBZ,RAZ 0 RW 0 ASIZ <15:12> RW RES TSA SA <11:10> <9> <8> MBZ,RAZ 0 RW 0 RW 0 0 Description Reserved. Base address – Bits <34:24> of the physical byte address of the first byte in the array.
Table D–13 21272-CA Array Address Register (AAR) (Continued) Field Bits Type RES ROWS <7:4> <3:2> MBZ,RAZ 0 RW 0 BNKS <1:0> RW D-36 Init 0 Description Reserved. Number of row bits in the SDRAMs.
D.15 DPR Registers for 680 Correctable Machine Check Logout Frames DPR Locations A0:A9 represent the information that the console will read when a 680 machine check logout frame is loaded. They provide the interrupt information obtained by the RMC through the LM78 sensors. When an error occurs, the RMC writes the bits and delivers an IRQ to the SRM console. The SRM reads the bits and clears them. On the next 680 error, the RMC writes the error into the A0:A9 locations.
Table D–14 DPR Location DPR Locations A0:A9 (Continued) Description A2 If bit is set the associated fault is active. Bit 0 CPU0_VCORE out of tolerance 1 CPU0_VIO out of tolerance 2 CPU1_VCORE out of tolerance 3 CPU1_VIO out of tolerance 4 PCI backplane LM78 1 is over temp 5 Not Used 6 Fan 4 fault 7 Fan 5 fault A3 Reserved If bit is set the associated fault is active.
Table D–14 DPR Location A6 A7 DPR Locations A0:A9 (Continued) Description These bits indicate a door has been opened. Bit 0 unused 1 CPU door is open 2 Fan door is open 3 PCI door is open 5 System CPU door is open 6 System fan door is open 7 System PCI door is open Temperature Warning Mask Bit 0 1 2 3 4 5 6 A8 Fan Controller Fault. This indicates a fan is not responding to a different RPM range as set by the RMC. (It is used to indicate that the fan failed to reach its maximum RPM at power-up).
D.16 DPR Power Supply Status Registers The RMC reads nine bytes of information from each of the three power supplies. The first byte is read from an I/O expander port, the second four bytes and the last four bytes are read from the A–D converter. Table D–15 Nine Bytes Read from Power Supply DPR Location Definition DB/E4/ED Reads I/O expander on Power Supply 0, 1, 2 Bit 0 1 2 3 4:7 PS_ID0_L PS_ID1_L Reserved (Pulled up so bit is always enabled) Thermal_Shutdown_H Tied to High within PS DC/E5/EE 3.
D.17 DPR 680 Fatal Registers The RMC is powered by an auxiliary 5V supply that is independent from the system power subsystem. When any catastrophic failures (such as overtemperature failure) occur, this error state is captured as shown in Table D–16. The information is used to populate the console data log uncorrectable error frame in Environ_QW_8.
D.18 CPU and System Uncorrectable Machine Check Logout Frame The SRM console or the Windows NT HAL builds the uncorrectable machine check logout frames and passes them to the OS error handlers. The OS error handlers further process and subsequently log the formatted error event into the system binary error log.
D.19 Console Data Log Event Environmental Error Logout Frame (680 Uncorrectable) Compaq Analyze uses the logout frame in Table D–18 for its decomposition of all 680 system environmental uncorrectable error frames.
D.20 CPU and System Correctable Machine Check Logout Frame The SRM console or the Windows NT HAL builds the correctable machine check logout frames and passes them to the OS error handlers. The OS error handlers further process and subsequently log the formatted error event into the system binary error log. The operating systems contain built-in throttling mechanisms to handle high-volume bursting of these correctable error conditions.
D.21 Environmental Error Logout Frame (680 Correctable) Table D–20 shows Environ_QW_1:7 and Environ_QW_8 error state capture information from DPR locations A0:A9 and BD:BF, respectively.
D.22 Platform Logout Frame Register Translation Compaq Analyze uses information from all logout frames for its decomposition of all error events. The error state bit definitions of all platform logout frame registers is shown in Table D–21.
Table D–21 Bit Definition of Logout Frame Registers Register Identification Bit Field Text Translation Description C_SYNDROME_0 <7:0> Syndrome for lower quadword in octaword of victim that was scrubbed as follows : <7:0>(Hex) CE CB D3 D5 D6 D9 DA DC 23 25 26 29 2A 2C 31 34 0E 0B 13 15 16 19 1A 1C E3 E5 E6 E9 EA EC Data Bit 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 <7:0>(Hex) 4F 4A 52 54 57 58 5B 5D A2 A4 A7 A8 AB AD B0 B5 8F 8A 92 94 97 98 9B 9D 62 64 6
Table D–21 Bit Definition of Logout Frame Registers (Continued) Register Identification Bit Field C_SYNDROME_0 (continued) 1 C_SYNDROME_1 <7:0> C_STAT <4:0> C_STS <7:4> <3:0> C_ADDR <42:6> Text Translation Description Data Bit <7:0>(Hex) Data Bit <7:0>(Hex) F1 30 70 62 F4 31 75 63 01 CB0 10 CB4 02 CB1 20 CB5 04 CB2 40 CB6 08 CB3 80 CB7 Syndrome for upper quadword in octaword of victim that was scrubbed (same as specified above) 1 Detected Error <4:0>(Hex) 00 No Error unless DC_STAT<3> = 1 indica
Table D–21 Bit Definition of Logout Frame Registers (Continued) Register Identification I_STAT Bit Field Text Translation Description <63:41> <40> <39> <38> <37:34> <33> <32:30> <29> Reserved ProfileMe Mispredict Trap ProfileMe Trap ProfileMe Load-Store Order Trap ProfileMe Trap Types ProfileMe Icache Miss ProfileMe Counter 0 Overcount Set = icache encountered a parity error on instruction fetch and a reply trap is performed which generates a correctable read interrupt.
Table D–21 Bit Definition of Logout Frame Registers (Continued) Register Identification Bit Field Text Translation Description EXC_ADDR <0> <63:2> IER_CM <4:3> I_SUM <13> <28:14> <30:29> <31> <32> <38:33> <4:3> Set = exception or interrupt occurred in PAL mode Contains the PC address of the instruction that would have executed if the error interrupt did not occur.
Table D–21 Bit Definition of Logout Frame Registers (Continued) Register Identification I_CTL Bit Field Text Translation Description <2:1> <7:6> 01(Bin) and 10(Bin) for Icache set 1 or 2 enabled, respectively 01(Bin) and 10(Bin) for R8-R11 & R24-R27 and R4-R7 & R20R23 are used for PAL shadow registers, respectively Set = forces bad Icache tag parity Set = forces bad Icache data parity Clear and set for 43 bit or 48 bit virtual address format, respectively Clear or set for R23 or R27 used as CALL_PAL lin
Table D–21 Bit Definition of Logout Frame Registers (Continued) ID Bit Field Text Translation Description MISC <43:40> Suppress IRQ1 interrupts to 1(Hex) for CPU0, 2(Hex) for CPU1, 4(Hex) for CPU2, and 8(Hex) for CPU3 Cchip Cchip Revision Level : 00-07(Hex) for C2, 08-0F(Hex) for C4 0(Hex) for CPU0, 1(Hex) for CPU1, 2(Hex) for CPU2, 3(Hex) for CPU3, 4(Hex) for Pchip0, 5(Hex) for Pchip1, as device (source) which caused the NXM Set = NXM address detected, <31:29> are locked, DRIR <63> is set Write 1 = Ar
Table D–21 Bit Definition of Logout Frame Registers (Continued) ID Bit Field Text Translation Description DIRx <63> <62> <61> <60> <59> <58> <57:56> <55> <54> <53> <52> <51> <50> <49> <48> <47:44> <43:40> <39:36> <35:32> <31:28> <27:24> <23:20> <19:16> <15:12> <11:8> Internal Cchip asynchronous error [i.e.
Table D–21 Bit Definition of Logout Frame Registers (Continued) Register Identification P0 & 1_ERROR Bit Field Text Translation Description <63:56> <55:52> ECC Syndrome of CRE or UECC error - Same as EV6. When CRE or UECC failing transaction: 0000(Bin) = DMA Read; 0001(Bin) = DMA RMW; 0011(Bin) = S/G Read.
Table D–21 Bit Definition of Logout Frame Registers (Continued) Register Identification SMIR (Environ_QW_1) CPUIR (Environ_QW_2) PSIR (Environ_QW_3) Bit Field Text Translation Description <7> <6> <5> <4> <3> <2> <1> <0> <7> <6> <5> <4> <3> <2> <1> <0> <7> <6> <5> <4> <3> <2> <1> <0> Inverted Sys_Rst = System is being reset Inverted PCI_Rst1 = PCI Bus #1 is in reset Inverted PCI_Rst0 = PCI Bus #0 is in reset Set = System temperature over 50 degrees C failure unused Set = Sys_DC_Notok failure detected I
Table D–21 Bit Definition of Logout Frame Registers (Continued) Register Identification System_PS/Temp/ Fan_Fault_ LM78_ISR (Environ_QW_4) Bit Field Text Translation Description <0> <1> <2> <3> <4> Set = PS +3.
Table D–21 Bit Definition of Logout Frame Registers (Continued) Register Identification System_Doors (Environ_QW_5) System_Temperature_Warning (Environ_QW_6) Bit Field Text Translation Description <0> <1> <2> <3> <4> <5> <6> <7> <63:8> <0> <1> <2> <3> <4> Unused Set = System CPU door is open Set = System Fan door is open Set = System PCI door is open Unused Set = System CPU door is closed Set = System Fan door is closed Set = System PCI door is closed Unused Set = CPU0 temperature warning fault has occ
Table D–21 Bit Definition of Logout Frame Registers (Continued) Register Identification Bit Field Text Translation Description Fatal_Power_Down_Codes (Environ_QW_8) <0> <1> <2> <3:7> <8> <9> <10> <11> <12> <13> <14> <15> <16> <17> <18> Set = Power Supply 0 AC input fail Set = Power Supply 1 AC input fail Set = Power Supply 2 AC input fail Unused Set = Power Supply 0 DC fail Set = Power Supply 1 DC fail Set = Power Supply 2 DC fail Set = Vterm fail Set = CPU0 Regulator fail Set = CPU1 Regulator fail Set
Appendix E Isolating Failing DIMMs This appendix explains how to manually isolate a failing DIMM from the failing address and failing data bits. It also covers how to isolate single-bit errors.
E.1 Information for Isolating Failures Table E–1 lists the information needed to isolate the failure. See Appendix D for the register table for the Array Address Registers (AARs). The failing address and failing data can come from a variety of different locations such as the SROM serial line, SRM screen displays, the SRM event log, and errors detected by the 21264 (EV6) chip.
E.2 DIMM Isolation Procedure Use the procedure in this section to isolate the failing DIMM. 1. Find the failing array by using the failing address and the Array Address Registers (AARs—see Appendix D). Use the AAR base address and size to create an Address range for comparing the failing address. For example if AAR1 base address was 40000000 (1 GB) and its size was 10000000 (256 MB), the address range would be 40000000–4FFFFFFF (4–4.25 GB). This range would be used to compare against the failing address.
3. After finding the real array, determine whether it is the lower array set or the upper array set. Use DPR locations 80, 82, 84, and 86 listed in Table E–1. Table E–3 shows the description of these locations.
4. Use the following table to determine the proper set. Bits<27,28,29,30,31,32> are from the failing address.
Table E–4 Failing DIMM Lookup Table Data Bits Array 1 Upper Lower Set Set Array 2 Upper Lower Set Set Array 3 Upper Lower Set Set Array 4 Upper Lower Set Set 0 M:1 D:1 M:1 D:5 M:3 D:1 M:3 D:5 M:1 D:3 M:1 D:7 M:3 D:3 M:3 D:7 1 M:1 D:1 M:1 D:5 M:3 D:1 M:3 D:5 M:1 D:3 M:1 D:7 M:3 D:3 M:3 D:7 2 M:1 D:1 M:1 D:5 M:3 D:1 M:3 D:5 M:1 D:3 M:1 D:7 M:3 D:3 M:3 D:7 3 M:1 D:1 M:1 D:5 M:3 D:1 M:3 D:5 M:1 D:3 M:1 D:7 M:3 D:3 M:3 D:7 4 M:1 D:1 M:1 D:5 M:3 D:1 M:3 D:5 M:1 D:3 M
Table E–4 Failing DIMM Lookup Table (Continued) Data Bits Array 1 Upper Lower Set Set Array 2 Upper Lower Set Set Array 3 Upper Lower Set Set Array 4 Upper Lower Set Set 32 M:0 D:1 M:0 D:5 M:2 D:1 M:2 D:5 M:0 D:3 M:0 D:7 M:2 D:3 M:2 D:7 33 M:0 D:1 M:0 D:5 M:2 D:1 M:2 D:5 M:0 D:3 M:0 D:7 M:2 D:3 M:2 D:7 34 M:0 D:1 M:0 D:5 M:2 D:1 M:2 D:5 M:0 D:3 M:0 D:7 M:2 D:3 M:2 D:7 35 M:0 D:1 M:0 D:5 M:2 D:1 M:2 D:5 M:0 D:3 M:0 D:7 M:2 D:3 M:2 D:7 36 M:0 D:1 M:0 D:5 M:2 D:1 M:
Table E–4 Failing DIMM Lookup Table (Continued) Data Bits Array 1 Upper Lower Set Set Array 2 Upper Lower Set Set Array 3 Upper Lower Set Set Array 4 Upper Lower Set Set 63 M:1 D:1 M:1 D:5 M:3 D:1 M:3 D:5 M:1 D:3 M:1 D:7 M:3 D:3 M:3 D:7 64 M:1 D:1 M:1 D:5 M:3 D:1 M:3 D:5 M:1 D:3 M:1 D:7 M:3 D:3 M:3 D:7 65 M:1 D:1 M:1 D:5 M:3 D:1 M:3 D:5 M:1 D:3 M:1 D:7 M:3 D:3 M:3 D:7 66 M:1 D:1 M:1 D:5 M:3 D:1 M:3 D:5 M:1 D:3 M:1 D:7 M:3 D:3 M:3 D:7 67 M:1 D:1 M:1 D:5 M:3 D:1 M:
Table E–4 Failing DIMM Lookup Table (Continued) Data Bits Array 1 Upper Lower Set Set Array 2 Upper Lower Set Set Array 3 Upper Lower Set Set Array 4 Upper Lower Set Set 94 M:0 D:1 M:0 D:5 M:2 D:1 M:2 D:5 M:0 D:3 M:0 D:7 M:2 D:3 M:2 D:7 95 M:0 D:1 M:0 D:5 M:2 D:1 M:2 D:5 M:0 D:3 M:0 D:7 M:2 D:3 M:2 D:7 96 M:0 D:1 M:0 D:5 M:2 D:1 M:2 D:5 M:0 D:3 M:0 D:7 M:2 D:3 M:2 D:7 97 M:0 D:1 M:0 D:5 M:2 D:1 M:2 D:5 M:0 D:3 M:0 D:7 M:2 D:3 M:2 D:7 98 M:0 D:1 M:0 D:5 M:2 D:1 M:
Table E–4 Failing DIMM Lookup Table (Continued) Data Bits Array 1 Upper Lower Set Set Array 2 Upper Lower Set Set Array 3 Upper Lower Set Set Array 4 Upper Lower Set Set 123 M:1 D:1 M:1 D:5 M:3 D:1 M:3 D:5 M:1 D:3 M:1 D:7 M:3 D:3 M:3 D:7 124 M:1 D:1 M:1 D:5 M:3 D:1 M:3 D:5 M:1 D:3 M:1 D:7 M:3 D:3 M:3 D:7 125 M:1 D:1 M:1 D:5 M:3 D:1 M:3 D:5 M:1 D:3 M:1 D:7 M:3 D:3 M:3 D:7 126 M:1 D:1 M:1 D:5 M:3 D:1 M:3 D:5 M:1 D:3 M:1 D:7 M:3 D:3 M:3 D:7 127 M:1 D:1 M:1 D:5 M:3 D:
Table E–4 Failing DIMM Lookup Table (Continued) Data Bits Array 1 Upper Lower Set Set Array 2 Upper Lower Set Set Array 3 Upper Lower Set Set Array 4 Upper Lower Set Set 152 M:0 D:2 M:0 D:6 M:2 D:2 M:2 D:6 M:0 D:4 M:0 D:8 M:2 D:4 M:2 D:8 153 M:0 D:2 M:0 D:6 M:2 D:2 M:2 D:6 M:0 D:4 M:0 D:8 M:2 D:4 M:2 D:8 154 M:0 D:2 M:0 D:6 M:2 D:2 M:2 D:6 M:0 D:4 M:0 D:8 M:2 D:4 M:2 D:8 155 M:0 D:2 M:0 D:6 M:2 D:2 M:2 D:6 M:0 D:4 M:0 D:8 M:2 D:4 M:2 D:8 156 M:0 D:2 M:0 D:6 M:2 D:
Table E–4 Failing DIMM Lookup Table (Continued) Data Bits Array 1 Upper Lower Set Set Array 2 Upper Lower Set Set Array 3 Upper Lower Set Set Array 4 Upper Lower Set Set 182 M:0 D:2 M:0 D:6 M:2 D:2 M:2 D:6 M:0 D:4 M:0 D:8 M:2 D:4 M:2 D:8 183 M:0 D:2 M:0 D:6 M:2 D:2 M:2 D:6 M:0 D:4 M:0 D:8 M:2 D:4 M:2 D:8 184 M:1 D:2 M:1 D:6 M:3 D:2 M:3 D:6 M:1 D:4 M:1 D:8 M:3 D:4 M:3 D:8 185 M:1 D:2 M:1 D:6 M:3 D:2 M:3 D:6 M:1 D:4 M:1 D:8 M:3 D:4 M:3 D:8 186 M:1 D:2 M:1 D:6 M:3 D:
Table E–4 Failing DIMM Lookup Table (Continued) Data Bits Array 1 Upper Lower Set Set Array 2 Upper Lower Set Set Array 3 Upper Lower Set Set Array 4 Upper Lower Set Set 211 M:1 D:2 M:1 D:6 M:3 D:2 M:3 D:6 M:1 D:4 M:1 D:8 M:3 D:4 M:3 D:8 212 M:1 D:2 M:1 D:6 M:3 D:2 M:3 D:6 M:1 D:4 M:1 D:8 M:3 D:4 M:3 D:8 213 M:1 D:2 M:1 D:6 M:3 D:2 M:3 D:6 M:1 D:4 M:1 D:8 M:3 D:4 M:3 D:8 214 M:1 D:2 M:1 D:6 M:3 D:2 M:3 D:6 M:1 D:4 M:1 D:8 M:3 D:4 M:3 D:8 215 M:1 D:2 M:1 D:6 M:3 D:
Table E–4 Failing DIMM Lookup Table (Continued) Data Bits Array 1 Upper Lower Set Set Array 2 Upper Lower Set Set Array 3 Upper Lower Set Set Array 4 Upper Lower Set Set 242 M:0 D:2 M:0 D:6 M:2 D:2 M:2 D:6 M:0 D:4 M:0 D:8 M:2 D:4 M:2 D:8 243 M:0 D:2 M:0 D:6 M:2 D:2 M:2 D:6 M:0 D:4 M:0 D:8 M:2 D:4 M:2 D:8 244 M:0 D:2 M:0 D:6 M:2 D:2 M:2 D:6 M:0 D:4 M:0 D:8 M:2 D:4 M:2 D:8 245 M:0 D:2 M:0 D:6 M:2 D:2 M:2 D:6 M:0 D:4 M:0 D:8 M:2 D:4 M:2 D:8 246 M:0 D:2 M:0 D:6 M:2 D:
Table E–4 Failing DIMM Lookup Table (Continued) Check Bits Array 1 Upper Lower Set Set Array 2 Upper Lower Set Set Array 3 Upper Lower Set Set Array 4 Upper Lower Set Set 0 M:1 D:1 M:1 D:5 M:3 D:1 M:3 D:5 M:1 D:3 M:1 D:7 M:3 D:3 M:3 D:7 1 M:0 D:1 M:0 D:5 M:2 D:1 M:2 D:5 M:0 D:3 M:0 D:7 M:2 D:3 M:2 D:7 2 M:1 D:1 M:1 D:5 M:3 D:1 M:3 D:5 M:1 D:3 M:1 D:7 M:3 D:3 M:3 D:7 3 M:0 D:1 M:0 D:5 M:2 D:1 M:2 D:5 M:0 D:3 M:0 D:7 M:2 D:3 M:2 D:7 4 M:0 D:1 M:0 D:5 M:2 D:1 M:2 D:
E.3 EV6 Single-Bit Errors The procedure for detection down to the set of DIMMs for a single-bit error is very similar to the procedure described in the previous sections. However, you cannot isolate down to a specific data or check bit. The 21264 (EV6) chip detects and reports a C_ADDR<42:6> failing address that is accurate to the cache block (64 bytes).
Table E–5 Syndrome to Data Check Bits Table (Continued) Syndrome C_Syndrome 0 C_Syndrome 1 31 34 0E 0B 13 15 16 19 1A 1C E3 E5 E6 E9 EA EC F1 F4 4F 4A 52 54 57 58 5B 5D A2 A4 A7 A8 AB AD Data Bit 14 or 142 Data Bit 15 or 143 Data Bit 16 or 144 Data Bit 17 or 145 Data Bit 18 or 146 Data Bit 19 or 147 Data Bit 20 or 148 Data Bit 21 or 149 Data Bit 22 or 150 Data Bit 23 or 151 Data Bit 24 or 152 Data Bit 25 or 153 Data Bit 26 or 154 Data Bit 27 or 155 Data Bit 28 or 156 Data Bit 29 or 157 Data Bit 30 or 15
Table E–5 Syndrome to Data Check Bits Table (Continued) Syndrome C_Syndrome 0 C_Syndrome 1 B0 B5 8F 8A 92 94 97 98 9B 9D 62 64 67 68 6B 6D 70 75 01 02 04 08 10 20 40 80 Data Bit 46 or 174 Data Bit 47 or 175 Data Bit 48 or 176 Data Bit 49 or 177 Data Bit 50 or 178 Data Bit 51 or 179 Data Bit 52 or 180 Data Bit 53 or 181 Data Bit 54 or 182 Data Bit 55 or 183 Data Bit 56 or 184 Data Bit 57 or 185 Data Bit 58 or 186 Data Bit 59 or 187 Data Bit 60 or 188 Data Bit 61 or 189 Data Bit 62 or 190 Data Bit 63 or 1
Index A AAR memory addresses, E-2 Acceptance testing, 2-11 Advanced CMOS Setup screen, 6-23 Alpha System Reference Manual, 4-26 alphabios command, 6-4 AlphaBIOS console Auto Start option, 6-25 boot screen, 3-21, 6-3 hard disk setup, 6-22 initialization screen, 3-20 memory test, 6-23 running in serial mode, 6-32 setting date and time, 6-21 setup screen, 6-2 startup screens, 3-20 AlphaBIOS error frames, 5-20 AlphaBIOS firmware obtaining, 2-14 AlphaBIOS menus, 2-10 AlphaBIOS utilities, 6-28 Architecture, 1-2
com1_baud environment variable, 6-14 com1_flow environment variable, 6-14 com1_mode environment variable, 6-14, 7-4 COM2 and parallel port loopback tests, 4-54 COM2 port, 1-9 com2_baud environment variable, 6-14 com2_flow environment variable, 6-14 com2_modem environment variable, 6-15 Command conventions, RMC, 7-14 Compaq Analyze, 2-9 and SDD errors, 4-48 and TDD errors, 4-48 documentation, 5-3 event screen, 5-5 evidence designator, 5-10 FRU list designator, 5-8 invoking GUI, 5-4 overview, 5-2 problem foun
grep, 4-22 hd, 4-24 info, 4-26 kill, 4-31 kill_diags, 4-31 memexer, 4-32 memtest, 4-34 more el, 4-8 net, 4-39 net -ic, 4-39 net -s, 4-39 nettest, 4-41 set sys_serial_num, 4-45 show error, 4-46 show fru, 4-49 show_status, 4-52 sys_exer, 4-54 test, 4-56 test -lb, 4-56 Diagnostic commands list, 4-2 Diagnostics power-up, 3-1 running in background, 4-1 showing status of, 4-52 SRM console, 4-1 Dial-in configuration, 7-24 Dial-out alert, 7-26 DIMM arrays, 6-42 DIMM isolation failing DIMM lookup table, E-6 informat
Error logs, 5-1 browsing in AlphaBIOS, 5-25 Windows NT, 5-20 Error messages power-up, 3-22 RMC, 3-28 SROM, 3-30, 3-31 Error repository, clearing, 8-1, 8-9 Escape sequence (RMC), 7-10 Ethernet external loopback, 4-54 EV6 (21264) microprocessor, 1-15 EV6 registers, D-1 Event log, 3-19 Event structure map, 5-16 ew*0_inet_init environment variable, 6-15 ew*0_mode environment variable, 6-15 ew*0_protocols environment variable, 6-16 exer command, 4-16 Exercising devices, 4-16 Exercising memory, 4-32, 4-34 F Fail
Internal processor registers (21264), D-1 Interrupts, 5-14 Invoking SRM from AlphaBIOS, 6-4 J Jumpers PCI, B-8 RMC and SPC, B-2 setting, B-10 TIG/SROM, B-4 Jumpers and switches, B-1 Junk I/O.
Operating systems errors reported by, 2-8 switching between, 6-50 switching to UNIX or OpenVMS, 6-52 switching to Windows NT, 6-50 Operator control panel.
Registers (21272) DIRn, D-29 Registers (EV6) Cbox Read, D-8 DC_STAT, D-6 EXC_ADDR, D-10 I_CTL, D-18 I_STAT, D-2 IER_CM, D-12 ISUM, D-14 MM_STAT, D-4 PAL_BASE, D-16 PCTX, D-23 Registers, displaying, 4-26 Remote management console. See RMC Remote power-on/off, 7-22 Remote system management logic, 1-20 Removable media, 1-28 removing 5.
show boot* command, 6-8 show config command, 6-8 show console command, 6-6 show device command, 6-8 show envar command, 6-11 show error command, 4-46 message translation, 4-48 show fru command, 4-49, 6-8 show fru E field, 4-51 show memory command, 6-8 show power command, 6-49 show_status command, 4-52 Single-bit errors (EV6), detecting, E-16 Slot locations, PCI, 6-46 Slot numbers CPUs, 6-40 PCI, 6-47 Snoop mode, 7-7 Soft Bypass mode, 7-7 Software patches, 2-14 SPC logic, 1-22 Speaker, testing, 4-56 SRM cons
U UART ports, 7-5 Updating RMC, 3-34 USB ports, 1-9 User interfaces, 6-2 Utilities AlphaBIOS, 6-28 running from serial terminal, 6-32 running from VGA, 6-29 Utilities menu, 6-29 VGA console tests, 4-57 VGA controller, slot for, 6-47 VGA monitor, 1-32, 6-5 VT terminal, 6-5 W Warning messages, RMC, 3-29 WEBES Director, 5-3 Windows NT Crash Dump Collector, 2-11 Windows NT, testing, 4-57 Write test, on floppy, 4-21 V Verifying devices, 4-56 Index-9