AlphaServer 4000/4100 Service Manual Order Number: EK–4100A–SV. B01 This manual is for anyone who services an AlphaServer 4000/4100 pedestal or cabinet system. It includes troubleshooting information, configuration rules, and instructions for removal and replacement of field-replaceable units (FRUs).
First Printing, June 1997 Digital Equipment Corporation makes no representations that the use of its products in the manner described in this publication will not infringe on existing or future patent rights, nor do the descriptions contained in this publication imply the granting of licenses to make, use, or sell equipment or software in accordance with the description.
Contents Preface ............................................................................................................... xi Chapter 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.10 1.11 1.12 1.13 1.14 1.15 1.16 1.17 1.18 AlphaServer 4100 System Drawer (BA30A).............................................1-2 AlphaServer 4000 System Drawer (BA30C).............................................1-4 AlphaServer 4100 System Drawer (BA30B).............................................1-6 Cabinet System ...........
2.8 2.9 2.10 Console Device Determination ............................................................... 2-18 Console Power-Up Display..................................................................... 2-20 Fail-Safe Loader.....................................................................................2-24 Chapter 3 3.1 3.1.1 3.2 3.2.1 3.3 3.4 3.5 3.5.1 3.5.2 Troubleshooting with LEDs......................................................................3-2 Cabinet Power and Fan LEDs..............
.4.1 5.4.2 5.4.3 5.4.4 5.4.5 5.4.6 5.4.7 5.4.8 5.4.9 5.4.10 5.4.11 5.5 5.5.1 5.5.2 5.5.3 System Bus ECC Error .................................................................... 5-39 System Bus Nonexistent Address Error............................................ 5-40 System Bus Address Parity Error..................................................... 5-41 PIO Buffer Overflow Error (PIO_OVFL)......................................... 5-42 Page Table Entry Invalid Error ....................................
7.16 7.17 7.18 7.19 7.20 7.21 7.22 7.23 7.24 7.25 7.26 7.27 7.28 7.29 7.30 7.31 7.32 7.33 PCI Motherboard (B3051) Removal and Replacement............................ 7-36 Server Control Module Removal and Replacement................................. 7-38 PCI/EISA Option Removal and Replacement ......................................... 7-40 Power Supply Removal and Replacement............................................... 7-42 Power Harness (4100 & early 4000) Removal and Replacement .............
Appendix C C.1 C.1.1 C.1.2 C.1.3 C.1.4 C.1.5 C.1.6 C.1.7 Operating the System Remotely RCM Console Overview.......................................................................... C-1 Modem Usage .................................................................................. C-2 Entering and Leaving Command Mode ............................................ C-5 RCM Commands.............................................................................. C-6 Dial-Out Alerts................................
Figures 1-1 1-2 1-3 1-4 1-5 1-6 1-7 1-8 1-9 1-10 1-11 1-12 1-13 1-14 1-15 1-16 1-17 1-18 1-19 1-20 1-21 1-22 2-1 2-2 2-3 2-4 2-5 2-6 2-7 3-1 3-2 3-4 3-5 4-1 4-2 4-3 4-4 4-5 4-6 4-7 4-8 4-7 viii Components of the BA30A System Drawer..............................................1-2 Cover Interlock Circuit (BA30A) .............................................................1-3 Components of the BA30C System Drawer ..............................................1-4 Cover Interlock Circuit (BA30C)..............
4-8 5-1 7-1 7-2 7-3 7-4 7-5 7-6 7-7 7-8 7-9 7-10 7-11 7-12 7-13 7-14 7-15 7-16 7-17 7-18 7-19 7-20 7-21 7-22 7-23 7-24 7-25 7-26 7-27 7-28 7-29 7-30 7-31 7-32 7-33 A-1 A-2 A-3 C-1 Pedestal Power Distribution (Europe and AP)......................................... 4-15 Error Detector Placement .........................................................................5-2 System Drawer FRU Locations.................................................................7-2 Location of 4100 Power System FRUs.........
2-2 2-3 2-4 2-5 2-6 3-1 5-1 5-2 5-3 5-4 5-5 5-6 5-7 5-8 5-9 5-10 6-1 6-2 6-3 6-4 6-5 6-6 7-1 A-1 A-2 A-3 B-1 B-2 B-3 C-1 C-2 C-3 C-4 x SROM Tests...........................................................................................2-10 XSROM Tests........................................................................................2-13 Memory Tests ........................................................................................2-14 IOD Tests..................................................
Preface Intended Audience This manual is written for the customer service engineer. Document Structure This manual uses a structured documentation design. Topics are organized into small sections for efficient online and printed reference. Each topic begins with an abastract, followed by an illustration or example, and ends with descriptive text.
• Appendix B, SRM Console Commands and Environment Variables, summarizes the commands used to examine and alter the system configuration. • Appendix C, Operating the System Remotely, describes how to use the remote console monitor (RCM) to monitor and control the system remotely. Documentation Titles Table 1 lists titles related to AlphaServer 4000/4100 systems.
Information on the Internet Using a Web browser you can access the AlphaServer InfoCenter at: http://www.digital.com/info/alphaserver/products.html Access the latest system firmware either with a Web browser or via FTP as follows: ftp://ftp.digital.com/pub/Digital/Alpha/firmware/ Interim firmware released since the last firmware CD is located at: ftp://ftp.digital.
Chapter 1 System Overview This chapter introduces the DIGITAL AlphaServer 4000 and the DIGITAL AlphaServer 4100 systems. These systems are available in cabinets or pedestals. There are three system drawers; two, the BA30B and the BA30C, are used in the AlphaServer 4000, and the third, the BA30A, is used in the AlphaServer 4100. The pedestal system has one system drawer and up to three StorageWorks shelves.
1.1 AlphaServer 4100 System Drawer (BA30A) Components in the BA30A system drawer are located in the system bus card cage, the PCI card cage, the control panel assembly, and the power and cooling section. The drawer measures 30 cm x 45 cm (11.8 in. x 17.7 in.) and fully configured weighs approximately 45.5 kg (~100 lbs).
➊ System card cage, which holds the system motherboard and the CPU, memory, ➋ ➌ ➍ ➎ bridge, and power control modules. (The difference between the BA30A and the BA30C is the system motherboard.) PCI/EISA card cage, which holds the PCI motherboard, option cards, and server control module. Server control module, which holds the I/O connectors and remote console monitor. Control panel assembly, which includes the control panel, a floppy drive, and a CD-ROM drive.
1.2 AlphaServer 4000 System Drawer (BA30C) Components in the BA30C system drawer are located in the system bus card cage, PCI card cage, control panel assembly, and power and cooling section. The drawer measures 30 cm x 45 cm (11.8 in. x 17.7 in.) and fully configured weighs approximately 45.5 kg (~100 lbs). Figure 1-3 Components of the BA30C System Drawer 1 5 2 3 4 PK-0702-96 When the system drawer is in a pedestal, the control panel assembly is mounted in a tray at the top of the drawer.
➊ System card cage, which holds the system motherboard and the CPU, memory, ➋ ➌ ➍ ➎ bridge, and power control modules. (The difference between the BA30A and the BA30C is the system motherboard.) PCI/EISA card cage, which holds the PCI motherboard, option cards, and server control module. Server control module, which holds the I/O connectors and remote console monitor. Control panel assembly, which includes the control panel, a floppy drive, and a CD-ROM drive.
1.3 AlphaServer 4000 System Drawer (BA30B) Components in the BA30B system drawer are located in the system bus card cage, two PCI card cages, the control panel assembly, and the power and cooling section. The drawer measures 30 cm x 45 cm (11.8 in. x 17.7 in.) and fully configured weighs approximately 45.5 kg (~100 lbs).
➊ System card cage holds the system motherboard, the CPU, memory, bridge, and power control modules. ➋ PCI/EISA card cage holds the PCI/EISA motherboard for PCI/EISA 0 and PCI 1, option cards, and server control module. ➌ Server control module holds the I/O connectors and remote console. ➍ Control panel assembly holds the control panel, a floppy, and a CD-ROM. ➎ Power and cooling section contains one to three power supplies and three fans. ➏ PCI card cage holds the PCI motherboard for PCI 2 and PCI 3.
1.4 Cabinet System The AlphaServer 4000/4100 cabinet system can accommodate multiple systems in a single cabinet. There are four cabinet variations that can hold different system configurations. Diferences are in power distribution and drawer mounting; from the outside the cabinets look almost identical.
Cabinet Differences Cabinet H9A10-EB Power AC input box power strips Mounting C channel (max drawers: 4) Destination North America Asia Pacific H9A10-EC AC input box power strips C channel (max drawers: 4) Europe H9A10-EL Two 120 volt H7600-AA power controllers Pull-out tray (max drawers: 3) North America Asia Pacific H9A10-EM Two 240 volt H7600-DB power controllers Pull-out tray (max drawers: 3) Europe Cabinet System Fan Tray At the top of cabinet systems is a fan tray containing three exha
1.5 Pedestal System The pedestal system contains one system drawer with a control panel, a CDROM drive, and a floppy drive. In the pedestal control panel area there is space for an optional tape or disk drive. Three StorageWorks shelves provide up to 90 Gbytes of in-cabinet storage. Figure 1-9 Pedestal System Front PK-0301-96 In the pedestal system, the control panel is located at the top left in a tray. See Figure 1-11. There is space for an optional device beside it.
Figure 1-10 Pedestal System Rear PK-0307a-96 System Overview 1-11
1.6 Control Panel and Drives The control panel includes the On/Off, Halt, and Reset buttons and a display. In a pedestal system the control panel is located in a tray at the top of the system drawer. In a cabinet system it is at the bottom of the system drawer with the CD-ROM drive and the floppy drive. Figure 1-11 Control Panel Assembly 1 2 3 4 Cabinet Pedestal Control Panel CD-ROM Drive CD-ROM Drive ➊ 1-12 Floppy Drive PK-0751-96 On/Off button. Powers the system drawer on or off.
NOTE: The LEDs on some modules are on when the line cord is plugged in, regardless of the position of the On/Off button. ➋ Halt button. Pressing this button in (so the LED at the top of the button is on) does the following: If DIGITAL UNIX or OpenVMS is running, halts the operating system and returns to the SRM console. The Halt button has no effect on Windows NT. If the Halt button is in when the system is reset or powered up, the system halts in the SRM console, regardless of the operating system.
1.7 System Consoles There are two console programs: the SRM console and the AlphaBIOS console. SRM Console Prompt On systems running the DIGITAL UNIX or OpenVMS operating system, the following console prompt is displayed after system startup messages are displayed, or whenever the SRM console is invoked: P00>>> NOTE: The console prompt displays only after the entire power-up sequence is complete. This can take up to several minutes if the memory is very large.
SRM Console The SRM console is a command-line interface that is used to boot the DIGITAL UNIX and OpenVMS operating systems. It also provides support for examining and modifying the system state and configuring and testing the system. The SRM console can be run from a serial terminal or a graphics monitor. AlphaBIOS Console The AlphaBIOS console is a menu-based interface that supports the Microsoft Windows NT operating system.
1.8 System Architecture Alpha microprocessor chips are used in these systems. The CPU, memory, and the I/O bridge module(s) are connected to the system bus motherboard.
AlphaServer 4000/4100 systems use the Alpha chip for the CPU. The CPU, memory, and I/O bridge modules, one to PCI/EISA I/O buses and another (4000 only) to another pair of PCIs, are connected to the system bus motherboard. A fourth type of module, the power control module, also plugs into the system motherboard. A fully configured 4100 system drawer can have up to four CPUs, four memory pairs, and a total of eight I/O options.
1.9 System Motherboard The system motherboard is on the floor of the system card cage. It has slots for the CPU, memory, power control, and bridge modules.
The system motherboard has the logic for the system bus. It is the backplane that holds the CPU, memory, bridge, and power control modules. Figure 1-13 shows diagrams of the three motherboards used in AlphaServer 4000/4100 systems. The module locations are designated by the callouts.
1.10 CPU Types AlphaServer 4000 and 4100 systems can be configured with one of several CPU variants. Variants are differentiated by CPU speeds and the presence or absence of a backup data cache external to the Alpha microprocessor chip. Figure 1-14 CPU Module Layout Typical Uncached CPU Typical Cached CPU PKW0422C-96 Alpha Chip Composition The Alpha chip is made using state-of-the-art chip technology, has a transistor count of 9.
Chip Description Unit Description Instruction Execution 8-byte cache, 4-way issue 4-way execution; 2 integer units, 1 floating-point adder, 1 floating-point multiplier Merge logic, 8-Kbyte write-through first-level data cache, 96-Kbyte write-back second-level data cache, bus interface unit Memory CPU Variants Module Variant Clock Frequency Onboard Cache B3001-CA B3002-AB B3004-BA B3004-AA B3004-DA 300 MHz 300 MHz 300 MHz 400 MHz 466 MHz None 2 Mbytes 2 Mbytes 4 Mbytes 4 Mbytes CPU Configuration R
1.11 Memory Modules Memory modules are used only in pairs — two modules of the same size and type. Each module provides either the low half or the high half of the memory space. The 4100 system drawer can hold up to four memory module pairs. The 4000 system drawer can hold up to two memory module pairs.
Memory Variants Each memory option consists of two identical modules. Each 4100 drawer supports up to four memory options, for a total of 4 Gbytes of memory: 4000 drawers support half that. Memory modules are used only in pairs and are available in 128 Mbyte, 512 Mbyte, and 1 Gbyte sizes. The 128-Mbyte option is synchronous memory, while the larger sizes are asynchronous memory (EDO).
1.12 Memory Addressing Alpha system memory addressing is unusual because memory address space is determined not by the amount of physical memory but is calculated by a multiple of the size of the memory pair in slot MEM0x.
The rules for addressing memory are as follows: 1. Address space is determined by the memory pair in slot MEM0. 2. Memory pairs need not be the same size. 3. The memory pair in slot MEM0 must be the largest of all memory pairs. Other memory pairs may be as large but none may be larger. 4. The starting address of each memory pair is N times the size of the memory pair in slot MEM0. N=0,1,2,3. 5. Memory addresses are contiguous within each module pair. 6.
1.13 System Bus The system bus consists of a 40-bit command/address bus, a 128-bit plus ECC data bus, and several control signals and clocks.
The system bus motherboard consists of a 40-bit command/address bus, a 128-bit plus ECC data bus, and several control signals, clocks, and a bus arbiter. The bus requires that all CPUs have the same high-speed oscillator providing the clock to the Alpha chip. The AlphaServer 4100 system bus connects up to four CPUs, four pairs of memory modules, and a single I/O bus bridge module. Note that the I/O bus bridges may be desinated as IODn where n is the number of the PCI bus.
1.14 System Bus to PCI Bus Bridge Module The bridge module is the physical interconnect between the system motherboard and any PCI motherboard in the system.
The system bus to PCI bus bridge module converts system bus commands and data addressed to I/O space to PCI commands and data; and converts PCI bus commands and data addressed to system memory or CPUs to system bus commands and data. An AlphaServer 4100 system has one bridge module; an AlphaServer 4000 system can have a second bridge module.
1.15 PCI I/O Subsystem The I/O subsystem is PCI. Both the 4100 and the 4000 have two four-slot PCI buses that hold up to eight I/O options. One of these buses can be both PCI and EISA but can hold not more than four options three of which may be EISA. The 4000 can have an additional two four-slot PCI buses allowing a total of sixteen I/O options. Figure 1-19 PCI Block Diagram 3&, %XV Serial Interrupt Logic PCI-3 4 64-bit slots 4000 only 3.
Table 1–1 PCI Motherboard Slot Numbering Slot PCI0 PCI1 PCI2 (4000 only) PCI3 (4000 only) 0 Reserved Reserved Reserved Reserved 1 PCI to EISA bridge Internal CD-ROM controller Reserved Reserved 2 PCI or EISA slot PCI slot PCI slot PCI slot 3 PCI or EISA slot PCI slot PCI slot PCI slot 4 PCI or EISA slot PCI slot PCI slot PCI slot 5 PCI slot PCI slot PCI slot PCI slot The logic for two PCI buses is on each PCI motherboard.
1.16 Server Control Module The server control module enables remote console connections to the system drawer. The module passes signals to COM ports 1 and 2, the keyboard, and the mouse to the standard I/O connectors.
The server control module has two sections: the remote console monitor (RCM) and the standard I/O. See Appendix C for information on controlling the system remotely. The remote console monitor connects to a modem through the modem port on the bulkhead. The RCM requires a 12V power connection. The standard I/O ports (keyboard, mouse, COM1 and COM2 serial, and parallel ports) are on the same bulkhead.
1.17 Power Control Module The power control module controls power sequencing and monitors power supply voltage, temperature, and fans.
The power control module performs these functions: • Controls power sequencing. • Monitors the combined output of power supplies and shuts down power if it is not in range. • Monitors system temperature and shuts off power if it is out of range. • Monitors the fans in the system drawer and on the CPU modules and shuts down power if a fan fails. • Provides visual indication of faults through LEDs.
1.18 Power Supply The system drawer power supplies provide power only to components in the drawer. One or two power supplies are required, depending on the number of CPU modules and PCI card cages; a second or third can be added for redundancy. The power system is described in detail in Chapter 4.
Description One to three power supplies provide power to components in the system drawer. (They supply power only for the drawer in which they are located.) Three power supplies provide redundant power in fully loaded AlphaServer 4000/4100 systems. These power supplies share the load, and redundant configurations are supported. They autoselect line voltage (120V to 240V). Each has 450 W output and supplies up to 75A of 3.43V, 50A of 5.
Chapter 2 Power-Up This chapter describes system power-up testing and explains the power-up displays.
2.1 Control Panel The control panel display indicates the likely device when testing fails. Figure 2-1 Control Panel and LCD Display Potentiometer Access Hole On/Off Halt Reset P0 TEST 11 CPU00 PK-0706G-96 • When the On/Off button LED is on, power is applied and the system is running. When it is off, the system is not running, but power may or may not be present. If power is present, the PCM or the power LED on the system bus to PCI bus bridge module should be flashing.
Table 2–1 Control Panel Display Field Content Display Meaning ➊ ➋ CPU number P0–P3 CPU reporting status Status TEST Tests are executing FAIL Failure has been detected MCHK Machine check has occurred INTR Error interrupt has occurred CPU0–3 CPU module number MEM0–3 and L, H, or * Memory pair number and low 2 module, high module, or either IOD0 Bridge to PCI bus 0 3 IOD1 Bridge to PCI bus 1 3 IOD2 Bridge to PCI bus 2 4 IOD3 Bridge to PCI bus 3 4 FROM0 Flash ROM COMBO COM co
2.2 Power-Up Sequence Console and most power-up tests reside on the I/O subsystem, not on the CPU nor on any other module on the system bus. Figure 2-2 Power-Up Flow Power-Up/Reset XSROM tests execute SROM code loaded into each CPU’s I-cache SRM console loaded into memory SROM tests execute SRM console tests execute XSROM loaded into each CPU’s S-cache SRM console either remains in the system or loads AlphaBIOS console PKW0432B-96 Definitions SROM. The SROM is a 128-Kbit ROM on each CPU module.
the XBUS. Sector 2 of FEPROM 0 contains a duplicate copy of the code and is used if sector 0 is bad. FEPROM. Two 1-Mbyte programmable ROMs are on the XBUS on PCI0. FEPROM 0 contains two copies of the XSROM, the OpenVMS and DIGITAL UNIX PALcode, and the SRM console and decompression code. FEPROM 1 contains the AlphaBIOS and NT HALcode. See Figure 2-3. These two FEPROMs can be flash updated. Refer to Appendix A.
For the console to run, the path from the CPU to the XSROM must be functional. The XSROM resides in FEPROM0 on the XBUS, off the EISA bus, off PCI 0, off IOD 0. See Figure 2-4. This path is minimally tested by SROM.
The SROM contents are loaded into each CPU’s I-cache and executed on powerup/reset. After testing the caches on each processor chip, it tests the path to the XSROM. Once this path is tested and deemed reliable, layers of the XSROM are loaded sequentially into the processor chip on each CPU. None of the SROM or XSROM power-up tests are run from memory —all run from the caches in the CPU chip, thus providing excellent diagnostic isolation.
2.3 SROM Power-Up Test Flow The SROM tests the CPU chip and the path to the XSROM.
The Alpha chip built-in self-test tests the I-cache at power-up and upon reset. Each CPU chip loads its SROM code into its I-cache and starts executing it. If the chip is partially functional, the SROM code continues to execute. However, if the chip cannot perform most of its functions, that CPU hangs and that CPU pass/fail LED remains off. If the system has more than one CPU and at least one passes both the SROM and XSROM power-up tests, the system will bring up the console.
Table 2-2 lists the tests performed by the SROM.
2.4 SROM Errors Reported The SROM reports machine checks, pending interrupt/exception errors, and errors related to corruption of FEPROM 0. If SROM errors are fatal, the particular CPU will hang and only the CPU self-test pass LEDs and/or the LEDs on the system bus to PCI bus bridge module will indicate the failure.
2.5 XSROM Power-Up Test Flow Once the SROM has completed its tests and verified the path to the FEPROM containing the XSROM code, it loads the first 8 Kbytes of XSROM into the primary CPU’s S-cache and jumps to it. Figure 2-6 XSROM Power-Up Flowchart XSROM banner to OCP/console device Clear SC_FHIT (force hit) Enable all 3 S-cache banks Run memory texts. Print trace to OCP/console dev. Print errors to OCP/console dev. Done message to console dev. Run B-cache tests Print errors to OCP/console dev.
After jumping to the primary CPU’s S-cache, the code then intentionally I-caches itself and is completely register based (no D-stream for stack or data storage is used). The only D-stream accesses are writes/reads during testing. Each FEPROM has sixteen 64-Kbyte sectors. The first sector contains B-cache tests, memory tests, and a fail-safe loader. The second sector contains PALcode. The third sector contains a copy of the first sector.
Table 2-4 Memory Tests Test Test Name Logic Tested Description 20 Memory Data test Data path to and from memory Data path on memory and RAMs 01 – FF Errors are reported as an 8-bit binary field. A set bit indicates a module failure. Bit <0> indicates pass/fail of MEM0_L; <1> indicates pass/fail of MEM0_H; <2> indicates pass/fail of MEM1_L; <7> indicates pass/fail of MEM3_H. 21 Memory Address test Address path to and from memory Address path on memory and RAMs Same as test 20.
2.6 XSROM Errors Reported The XSROM reports B-cache test errors and memory test errors. reports a warning if memory is illegally configured. It also Example 2-2 XSROM Errors Reported at Power-Up B-cache Error (CPU Error) TEST ERR on cpu0 FRU cpu0 err# 2 tst# 11 exp: 5555555555555555 rcv: aaaaaaaaaaaaaaaa adr: ffff8 #CPU running the test #Expected data #Received data #B-cache location error #occurred Memory Error (Memory Module Indicated) 20..21..
2.7 Console Power-Up Tests Once the SRM console is loaded, it does further testing of each IOD. Table 2-5 describes the IOD power-up tests, and Table 2-6 describes the PCI motherboard power-up tests. Table 2-5 IOD Tests Test Number Test Name Description 1 IOD CSR Access test Read and write all CSRs in each IOD. 2 Loopback test Dense space writes to the IOD’s PCI dense space to check the integrity of ECC lines on the IODs.
Table 2-6 PCI Motherboard Tests (B3050 only) Test Number Test Name Diagnostic Name Description 1 PCEB pceb_diag Tests the PCI to EISA bridge chip 2 ESC esc_diag Tests the EISA system controller 3 8K NVRAM nvram_diag Tests the NVRAM 4 Real-Time Clock ds1287_diag Tests the real-time clock chip 5 Keyboard and Mouse i8242_diag Tests the keyboard/mouse chip 6 Flash ROM flash_diag Dumps contents of flash ROM 7 Serial and Parallel Ports and Floppy combo_diag Tests COM ports 1 and 2,
2.8 Console Device Determination After the SROM and XSROM have completed their tasks, the SRM console program, as it starts, determines where to send its power-up messages. Figure 2-7 Console Device Determination Flowchart Power-Up/Reset or P00>>> Init Console Envar = serial Yes Enable COM port 1 and send messages as system is powering up No Console Envar = graphics Yes VGA adapter on PCI0 Yes VGA becomes the console device. No Enable COM port 1 and send messages as system is powering up.
Console Device Options The console device can be either a serial terminal or a graphics monitor. Specifically: • A serial terminal connected to COM1 off the server control module. The terminal connected to COM1 must be set to 9600 baud. This baud rate cannot be changed. • A graphics monitor off an adapter on PCI0. Systems running Windows NT must have a graphics monitor as the console device and run AlphaBIOS as the console program.
2.9 Console Power-Up Display The entire power-up display prints to a serial terminal (if the console environment variable is set to serial), and parts of it print to the control panel display. The last several lines print to either a serial terminal or a graphics monitor. Example 2-3 Power-Up Display SROM V1.0 on cpu0 ➊ SROM V1.0 on cpu1 SROM V1.0 on cpu2 SROM V1.0 on cpu3 ➋ XSROM V1.0 on cpu2 XSROM V1.0 on cpu1 XSROM V1.0 on cpu3 XSROM V1.
➊ At power-up or reset, the SROM code on each CPU module is loaded into that module’s I-cache and tests the module. If all tests pass, the processor’s LED lights. If any test fails, the LED remains off and power-up testing terminates on that CPU. The first determination of the primary processor is made, and the primary processor executes a loopback test to each PCI bridge. If this test passes, the bridge LED lights. If it fails, the LED remains off and power-up continues.
Example 2-3 Power-Up Display (Continued) starting console on CPU 0 sizing memory 0 128 MB SYNC 1 128 MB SYNC starting console on CPU 1 starting console on CPU 2 starting console on CPU 3 probing IOD1 hose 1 bus 0 slot 1 - NCR 53C810 bus 0 slot 2 - DECchip 21041-AA bus 0 slot 3 - NCR 53C810 bus 0 slot 4 - DECchip 21040-AA probing IOD0 hose 0 bus 0 slot 1 - PCEB Configuring I/O adapters... AlphaServer 4100 Console V1.
➏ The final primary CPU determination is made. The primary CPU unloads PALcode and decompression code from the FEPROM on the PCI 0 to its Bcache. The primary CPU then jumps to the PALcode to start the SRM console. The primary CPU prints a message indicating that it is running the console. Starting with this message, the power-up display is printed to the default console terminal, regardless of the state of the console environment variable.
2.10 Fail-Safe Loader The fail-safe loader is a software routine that loads the SRM console image from floppy. Once the console is running you will want to run LFU to update FEPROM 0 with a new image. NOTE: FEPROM 0 contains images of the SROM, XSROM, PAL, decompression, and SRM console code. If the fail-safe loader loads, the following conditions exist on the machine: • The SROM has passed its tests and successfully unloaded the XSROM.
Chapter 3 Troubleshooting This chapter describes troubleshooting during power-up and booting, as well as diagnostics for AlphaServer 4000/4100 systems.
3.1 Troubleshooting with LEDs During power-up, reset, initialization, or testing, diagnostics are run on CPUs, memories, bridge modules, PCI motherboards, and sometimes options. The following sections describe possible problems that can be identified by checking LEDs.
CPU LEDs • If the CPU STP LED on any CPU module is lit, that CPU chip is functioning properly. If the operating system is NT and the CPU STP LED is off, that CPU may or may not be functioning. You can use the Halt button on the OCP to prevent the AlphaBIOS console (which turns off the CPU STP LED) from booting, thus assuring the validity of the CPU STP LED. If the LED is off, replace the CPU. If the LED is lit, you can use the SRM console command alphabios to load and run the AlphaBIOS console.
3.1.
A cabinet system has three exhaust fans at the top of the cabinet. They are powered from a small power supply in the fan tray. This power supply also powers the server control module at the bottom of the PCI card cage to allow remote access to the system. A failure of the power supply is indicated only by the LEDs. No messages are displayed. There are two LEDs on the top panel: a fan LED and a power LED. • When the fan LED (amber) is flashing, a cabinet fan needs replacing.
3.2 Troubleshooting Power Problems Power problems can occur before the system is up or while the system is running. If a system stops running, make a habit of checking the PCM. Power Problem List The system will halt for the following: 1. 2. 3. 4. 5. 6. 7. 8. 9.
If Power Problem Occurs at Power-Up If the system has a power problem on a cold start, the PCM LEDs are not valid until after DCOK_SENSE has been asserted. The cause is one of the following: • Broken system fan • Broken CPU fan • Power supplied to the system is out of tolerance (a power supply could be broken and the system could still power up) • PCM failure • Interlock failure • Wire problems • Temperature problem (unlikely) Recommended Order for Troubleshooting Failure at Power-Up 1.
3.2.1 Power Control Module LEDs The PCM has 11 LEDs visible through the system card cage. The LED display shows the relative placement of the LEDs.
Table 3-1 Power Control Module LED States LED State Description DCOK_SENSE On Both +5.0V and +3.43V are present and within limits. PS0_OK On Power supply 0 is present and has asserted POK_H. PS1_0K On Off Power supply 1 is present and has asserted POK_H. Power supply 1 not present. PS2_OK On Off Power supply 2 is present and has asserted POK_H. Power supply 2 not present. TEMP_OK On The system temperature is below 55° C. CPUFAN_OK On Off All CPU fans are OK. A CPU fan has failed.
2 3.3 Maintenance Bus (I C Bus) 2 The I C bus (referred to as the “I squared C bus”) is a small internal maintenance bus used to monitor system conditions scanned by the power control module, write the fault display, store error state, and track configuration information in the system. Although all system modules (not I/O 2 modules) sit on the maintenance bus, only the I C controller accesses it. 2 Everything written or read on the I C bus is done by the controller.
Monitor 2 The I C bus monitors the state of system conditions scanned by the PCM. There are two registers on the PCM: • One records the state of the fans and power supplies and is latched when there is a fault. • The other causes an interrupt on the I C bus when a CPU or system fan fails, an overtemperature condition exists, or power supplied to the system is out of tolerance. 2 2 The interrupt received by the I C bus controller on PCI 0 alerts the system of imminent power shutdown.
3.4 Running Diagnostics — Test Command The test command runs diagnostics on the entire system, CPU devices, memory devices, and the PCI I/O subsystem. The test command runs only from the SRM console. Ctrl/C stops the test. Example 3-1 Test Command Syntax P00>>> help test FUNCTION SYNOPSIS test ([-q] [-t
3.5 Testing an Entire System A test command with no modifiers runs all exercisers for subsystems and devices on the system. I/O devices tested are supported boot devices. The test runs for 10 minutes. Example 3-2 Sample Test Command P00>>> test Console is in diagnostic mode System test, runtime 600 seconds Type ^C to stop testing Configuring system.. polling ncr0 (NCR 53C810) slot 1, bus 0 PCI, hose 1 dka500.5.0.1.
ID Program Device Pass Hard/Soft Bytes Written Bytes Read -------- ------------ ------------ ------ --------- ------------- -----------00003047 memtest memory 1 0 0 134217728 134217728 00003050 memtest memory 205 0 0 213883392 213883392 00003059 memtest memory 192 0 0 200253568 200253568 00003062 memtest memory 192 0 0 200253568 200253568 00003084 memtest memory 80 0 0 82827392 82827392 000030d8 exer_kid dkb200.2.0.
3.5.1 Testing Memory The test mem command tests individual memory devices or all memory. The test shown in Example 3-3 runs for 2 minutes. Example 3-3 Sample Test Memory Command P00>>> test memory Console is in diagnostic mode System test, runtime 120 seconds Type ^C to stop testing Starting background memory test, affinity to all CPUs.. Starting memory thrasher on each CPU.. Starting memory thrasher on each CPU.. Starting memory thrasher on each CPU.. Starting memory thrasher on each CPU..
ID Program Device Pass Hard/Soft Bytes Written Bytes Read -------- ------------ ------------ ------ --------- ------------- -----------000046d7 memtest memory 1 0 0 583008256 583008256 000046e0 memtest memory 1456 0 0 1525491840 1525491840 000046e9 memtest memory 1446 0 0 1515007360 1515007360 000046f2 memtest memory 1444 0 0 1512910464 1512910464 000046fb memtest memory 550 0 0 575597952 ID Program Device Pass Hard/Soft Bytes Written 575597952 Bytes Read ------
3.5.2 Testing PCI The test pci command tests PCI buses and devices. The test runs for 2 minutes. Example 3-4 Sample Test Command for PCI P00>>> test pci* Console is in diagnostic mode System test, runtime 120 seconds Type ^C to stop testing Configuring all PCI buses.. polling ncr0 (NCR 53C810) slot 1, bus 0 PCI, hose 1 dka500.5.0.1.1 DKa500 RRD45 SCSI Bus ID 7 1645 polling ncr1 (NCR 53C810) slot 3, bus 0 PCI, hose 1 SCSI Bus ID 7 dkb200.2.0.3.1 DKb200 RZ29B 0007 dkb400.4.0.3.
ID Program Device Pass Hard/Soft Bytes Written Bytes Read -------- ------------ ------------ ------ --------- ------------- -----------00002c29 exer_kid dkb200.2.0.3 92 0 0 0 48689152 00002c2a exer_kid dkb400.4.0.3 92 0 0 0 48689152 00002c5e exer_kid dva0.0.0.100 0 0 0 0 286720 Testing aborted. Shutting down tests. Please wait..
Chapter 4 Power System This chapter describes the AlphaServer 4000/4100 power system: • Power Supply • Power Control Module Features • Power Circuit and Cover Interlocks • Power-Up/Down Sequencing • Cabinet Power Configuration Rules • Pedestal Power Configuration Rules (North America and Japan) • Pedestal Power Configuration Rules (Europe and Asia Pacific) Power System 4-1
4.1 Power Supply Power supply ouputs are shown in Figure 4-1. Figure 4-1 Power Supply Outputs Misc. Signal Current share +5V/Return +3.4V/Return +3.
Power Supply Features • 90–264 Vrms input • 450 watts output. Output voltages are as follows: • Output Voltage Min. Voltage Max. Voltage Max. Current +5.0 4.85 5.25 50 +3.43 3.400 3.465 75 +12 11.5 12.6 11 –12 –10.9 –13.2 0.2 –5.0 –4.6 –5.5 0.2 Vaux 8.5 9.5 0.05 Remote sense on +5.0V and +3.43V +5.0V is sensed on all CPUs in the system, the system bus motherboard, and the PCI bus motherboard(s). +3.43V is sensed on all CPUs in the system and the system bus motherboard.
4.2 Power Control Module Features The power control module (54-24117-01) is located behind the B3040-AA module, the system bus to PCI bus bridge module.
The power control module performs the following functions: • Controls the power-up/down sequencing. • Monitors the combined output of power supplies VDD (3.43V) and VCC (5.0V) and asserts DCOK_SENSE if these voltages are within range and asserts POWER_FAULT_L causing an immediate power shutdown if either is not. • Monitors system temperature and asserts TEMP_FAIL, if temperature exceeds 55° C.
4.
Figure 4-3 shows the distribution of power thoughout the system drawer. Opens in the circuit or the PCM signal POWER_FAULT_L or the SCM signal RSM_DC_EN_L interrupt DC power applied to the system. The opens can be caused by the On/Off button or the cover interlocks. The POWER_FAULT_L signal is asserted by the PCM module if it detects a fault and the RSM_DC_EN_L is controlled remotely. A failure anywhere in the circuit will result in the removal of DC power.
4.4 Power-Up/Down Sequence The On/Off button can be controlled manually or remotely. The button is on the OCP. Remote power control is provided though the remote I/O port connected to the PCI. The power-up/down sequence flow is shown below.
When AC is applied to the system, Vaux (auxiliary voltage) is asserted and is sensed by the PCM. The PCM asserts DC_ENABLE_L starting the power supplies. If there is a hard fault on power-up, the power supplies shut down immediately; otherwise, the power system powers up and remains up until the system is shut off or the PCM senses a fault. If a power fault is sensed, the power system attempts to restore power and will do so if the fault is not sensed a second time.
4.5 Cabinet Power Configuration Rules There are four cabinets with different power delivery systems. See page 1-9 for a description of differences. A barcode label designating the cabinet variation is located inside the back door in the upper left corner of the bezel holding the door. The four variations are: H9A10 -EB, -EC, -EL, -EM. Figure 4-5 Simple -EB & -EC Cabinet Power Configuration StorageWorks StorageWorks Power Strips 0.38 Arms 0.38 Arms 0.38 Arms 0.38 Arms 0.38 Arms 0.
Figure 4-6 Worst-Case -EB & -EC Cabinet Power Configuration StorageWorks Power Strips StorageWorks System Drawer System Drawer 0.38 Arms 0.38 Arms 1.83 Arms 1.83 Arms 1.83 Arms 1.83 Arms 0.38 Arms 0.38 Arms 1.83 Arms 1.83 Arms 1.83 Arms 1.83 Arms 10A System Drawer Fan Tray 10A 1.83 Arms 0.5 Arms 1.83 Arms 1.83 Arms 1.83 Arms System Drawer 10A AC Distribution Box 7.8 Arms 8.1 Arms 8.1 Arms 10A 200 - 240 Vrms 24.
Figure 4-7 -EL & -EM Single Drawer Cabinet Power Configuration (Single drawer -EM shown with H7600-DB controller) 2 Power Controllers StorageWorks 0.38 0.38 A 0.38 Ams 0.38 Ams 0.38 Ams 0.38 Ams 0.38 Ams 0.38 Ams 1.83 Ams 1.83 Ams 0.5 Am StorageWorks StorageWorks StorageWorks System Drawer 1.83 Ams 0.38 0.38 Ams 0.38 Ams 0.38 Ams 0.38 Ams 0.38 Ams 0.38 Ams 0.38 Ams StorageWorks StorageWorks StorageWorks StorageWorks 240 V, 16 AMP Controller with 12 IEC C13 outlets (Europe & A.P.
Figure 4-8 -EL Three Drawer Cabinet Power Configuration (Three drawer -EL shown with H7600-AA controller) 2 Power Controllers System Drawer System Drawer 3.67 Ams 3.67 Ams 3.67 Ams 3.67 Ams 3.67 Ams StorageWorks System Drawer 0.75 Ams 0.75 Ams 1.0 Ams 3.67 Ams 3.67 Ams 3.67 Ams StorageWorks 3.67 Ams 0.75 Ams 0.75 Ams 120 V, 14 AMP Controller with 10 NEMA 5-15 outlets (N.A. & A.P.
4.6 Pedestal Power Configuration Rules (North America and Japan) Figure 4-9 Pedestal Power Distribution (N.A. and Japan) StorageWorks StorageWorks Power Strips 0.75 0.75 0.75 0.75 Arms Arms Arms Arms System Drawer 3.67 Arms 3.67 Arms 3.67 Arms 15A 100 - 120 Vrms 11.0 Arms 100 - 120 Vrms 3.0 Arms Total Power Available (Assuming a 15 A branch) Single Drawer Single StorageWorks Shelf Outlets Power Strip 4-14 PKW0406B-95 N.
4.7 Pedestal Power Configuration Rules (Europe and Asia Pacific) Figure 4-10 Pedestal Power Distribution (Europe and AP) Power Strips 0.34 0.34 0.34 0.34 StorageWorks Arms Arms Arms Arms StorageWorks System Drawer 10A 1.67 Arms 1.67 Arms 1.67 Arms 10A 200 - 240 Vrms 5.0 Arms 200 - 240 Vrms 3.0 Arms Total Power Available Single Drawer Single StorageWorks Shelf Outlets Power Strip PKW0406C-95 2200 VA per power strip 1100 VA 150 VA 10 IEC 320 receptacles max.
Chapter 5 Error Logs This chapter provides information on troubleshooting with error logs. The following topics are covered: • Using Error Logs • Using DECevent • Error Log Examples and Analysis • Troubleshooting IOD-Detected Errors • Double Error Halts and Machine Checks While in PAL Mode Error registers are described in Chapter 6.
5.1 Using Error Logs Error detection is performed by CPUs, the IOD, and the EISA to PCI bus bridge. (The IOD is the acronym used by software to refer to the system bus to PCI bus bridge.
Lines Protected Device ECC Protected System bus data lines IOD on every transaction, CPU when using the bus B-cache IOD on every transaction, CPU when using the bus Parity Protected System bus command/address lines IOD on every transaction, CPU when using the bus Duplicate tag store IOD on every transaction, CPU when using the bus B-cache index lines CPU PCI bus IOD EISA bus EISA bridge As shown in Figure 5-1 and the accompanying table, the CPU chip is isolated by transceivers (XVER) from th
5.1.1 Hard Errors There are two categories of hard errors: • System-independent errors detected by the CPU. These errors are processor machine checks handled as MCHK 670 interrupts and are: Internal EV5 or EV56 cache errors CPU B-cache module errors • System-dependent errors detected by both the CPU and IOD.
5.1.3 Error Log Events Several different events are logged by OpenVMS and DIGITAL UNIX. Windows NT does not log errors in this fashion. Table 5-1 Types of Error Log Events Error Log Event Description MCHK 670 Processor machine checks.These are synchronous errors that inform precisely what happened at the time the error occurred. They are detected inside the CPU chip and are fatal errors . MCHK 660 System machine checks. These are asynchronous errors that are recorded after the error has occurred.
5.2 Using DECevent DECevent produces bit-to-text ASCII reports derived from system event entries or user-supplied event logs. The format of the reports is determined by commands, qualifiers, parameters, and keywords appended to the comand. The maximum command line length is 255 characters.
5.2.1 Translating Event Files To produce a translated event report using the default event log file, SYS$ERRORLOG:ERRLOG.SYS, enter the following command: OpenVMS $ DIAGNOSE DIGITAL UNIX > dia -a The DIAGNOSE command allows DECevent to use built-in defaults. This command produces a full report, directed to the terminal screen, from the input event file, SYS$ERRORLOG:ERRLOG.SYS. The /TRANSLATE qualifier is understood on the command line. To select an alternate input file OpenVMS $ DIAGNOSE ERRORLOG.
To reverse the order of the input events OpenVMS $ DIAGNOSE/TRANSLATE/REVERSE DIGITAL UNIX > dia -R These commands reverse the order in which events are displayed. The default order is forward chronologically. 5.2.2 Filtering Events /INCLUDE and /EXCLUDE qualifiers allow you to filter input event log files. The /INCLUDE qualifier is used to create output for devices named in the command.
Use the /BEFORE and /SINCE qualifiers to select events before or after a certain date and time. OpenVMS $ DIAGNOSE/TRANSLATE/BEFORE=15-JAN-1996:10:30:00 or $ DIAGNOSE/TRANSLATE/SINCE=15-JAN-1996:10:30:00 DIGITAL UNIX > dia -t s:15-jan-1996 e:20-jan-1996 If no time is specified, the default time is 00:00:00, and all events for that day are selected. The /BEFORE and /SINCE qualifiers can be combined to select a certain period of time.
5.2.3 Selecting Alternative Reports Table 5-2 describes the DECevent report formats. Report formats are mutually exclusive. No combinations are allowed. The default format is /Full.
5.3 Error Log Examples and Analysis The following sections provide examples and analysis of error logs. 5.3.1 MCHK 670 CPU-Detected Failure The error log in Example 5-1 shows the following: ➊ CPU1 logged the error in a system with two CPUs. ➋ During a D-ref fill, the External Interface Status Register logged an uncorrectable EEC error. (When a CPU chip does not find data it needs to perform a task in any of its caches, it requests data from off the chip to fill its D-caches. It performs a “D-ref fill.
Example 5-1 MCHK 670 Logging OS System Architecture Event sequence number Timestamp of occurrence Host name System type register 2. DIGITAL UNIX 2. Alpha 4.
Icache Par Err Stat Reg Dcache Par Err Stat Reg Virtual Address Reg Memory Mgmt Flt Sts Reg Scache Address Reg Scache Status Reg Bcache Tag Address Reg TEST_STATUS_H Pin Asserted x00000000 x00000000 xFFFFFFFE8F63BD38 x000000000166D1 Ref which caused err was a write Ref resulted in DTB miss RA Field x0000000000001B Opcode Field x0000000000002C xFFFFFF00000254BF x00000000 xFFFFFF80E98F7FFF External cache hit Parity for ds and v bits Cache block dirty Cache block valid Ext cache tag addr parity bit Tag addre
MC_Command x00000008 Device Id x0000003A CAP Error Register PCI Bus Trans Error Adr MDPA Status Register MDPA Error Syndrome Reg x00000000 x00000000 x00000000 x00000000 MDPB Status Register MDPB Error Syndrome Reg x00000000 x00000000 ** IOD SUBPACKET -> ** WHOAMI x000000BB CAP Error Register PCI Bus Trans Error Adr MDPA Status Register MDPA Error Syndrome Reg x00000000 xC0018B48 x00000000 x00000000 MDPB Status Register MDPB Error Syndrome Reg x00000000 x00000000 (no error seen) ➌ MDPA Chip Revis
Cycle 2 ECC Syndrome x00000000 Cycle 3 ECC Syndrome x00000000 PALcode Revision Palcode Rev: 1.
5.3.2 MCHK 670 CPU and IOD-Detected Failure The error log in Example 5-2 shows the following: ➊ CPU3 logged the error in a system with four CPUs. ➋ The External Interface Status Register logged an uncorrectable ECC error during a D-ref fill. (When a CPU chip does not find data it needs to perform a task in any of its caches, it requests data from off the chip to fill its D-cache. It performs a “D-ref fill.”) Bit <30> is set, indicating that the source of the error is memory or the system.
Example 5-2 MCHK 670 CPU and IOD-Detected Failure Logging OS System Architecture Event sequence number Timestamp of occurrence Host name System type register 2. DIGITAL UNIX 2. Alpha 6.
Dcache Par Err Stat Reg Virtual Address Reg Memory Mgmt Flt Sts Reg Scache Address Reg Scache Status Reg Bcache Tag Address Reg x00000000 x00000001407D6000 x00000000011A10 Ref resulted in DTB miss RA Field x0000000008 Opcode Field x00000000000023 xFFFFFF00000254BF x00000000 xFFFFFF80286F7FFF External cache hit Parity for ds and v bits Cache block dirty Cache block valid Ext cache tag addr parity bit Tag address<38:20> is x00000000000286 Ext Interface Address Reg xFFFFFF0028681A8F Fill Syndrome Reg x00000
MC Error Info Register 0 MC Error Info Register 1 x28681A80 x800FD800 ➏ MC bus trans addr <31:4> x028681A8 MC bus trans addr <39:32> x00000000 ➍ ➎ MC_Command x00000018 Device Id x0000003F MC error info valid CAP Error Register xC0000000 PCI Bus Trans Error Adr MDPA Status Register MDPA Error Syndrome Reg x000003FD x00000000 x00000000 MDPB Status Register x80000000 MDPB Error Syndrome Reg x0000000000004B x0000004B Uncorrectable ECC err det by MDPB ➌ MC error info latched MDPA Chip Revision x
MC_Command x00000018 CAP Error Register xC0000000 Device Id x0000003F MC error info valid Uncorrectable ECC err det by MDPB MC error info latched PCI Bus Trans Error Adr MDPA Status Register MDPA Error Syndrome Reg x00000000 x00000000 x00000000 MDPB Status Register x80000000 MDPB Error Syndrome Reg x0000000000004B x0000004B ➍ ➎ ➌ MDPA Chip Revision x00000000 Cycle 0 ECC Syndrome x00000000 Cycle 1 ECC Syndrome x00000000 Cycle 2 ECC Syndrome x00000000 Cycle 3 ECC Syndrome x00000000 MDPB Chip Revisi
5.3.3 MCHK 670 Read Dirty CPU-Detected Failure The error log in Example 5-3 shows the following: ➊ CPU0 logged the error in a system with two CPUs. ➋ The External Interface Status Register records an uncorrectable ECC error from the system (bit <30> set). ➌ Both IOD CAP Error Registers logged an error. ➍ The MC Error Info Registers 0 and 1 have captured the error information.
Example 5-3 MCHK 670 Read Dirty Failure Logging OS System Architecture Event sequence number Timestamp of occurrence Host name System type register Number of CPUs (mpnum) 2. DIGITAL UNIX 2. Alpha 4.
Icache Par Err Stat Reg Dcache Par Err Stat Reg Virtual Address Reg Memory Mgmt Flt Sts Reg Floating Point Instructions will cause FEN Exceptions. PAL Shadow Registers Enabled. Correctable Error Interrupts Enabled. ICACHE BIST (Self Test) Was Successful.
Bridge to PCI Transactions: Enabled Bridge REQUESTS 64 Bit Data Transactions Bridge ACCEPTS 64 Bit Data Transactions PCI Address Parity Check: Enabled MC Bus CMD/Addr Parity Check: Enabled MC Bus NXM Check: Enabled Check ALL Transactions for Errors Use MC_BMSK for 16 Byte Align Blk Mem Wrt Wrt PEND_NUM Threshold: 8.
MC-PCI Command Register x06480FF1 PCI Class Code x00000600 Module SelfTest Passed LED on Delayed PCI Bus Reads Protocol: Enabled Bridge to PCI Transactions: Enabled Bridge REQUESTS 64 Bit Data Transactions Bridge ACCEPTS 64 Bit Data Transactions PCI Address Parity Check: Enabled MC Bus CMD/Addr Parity Check: Enabled MC Bus NXM Check: Enabled Check ALL Transactions for Errors Use MC_BMSK for 16 Byte Align Blk Mem Wrt Mem Host Address Ext Reg IO Host Adr Ext Register Interrupt Ctrl Register Struct:Enabled
5.3.4 MCHK 660 IOD-Detected Failure (System Bus Error) The error log in Example 5-4 shows the following: ➊ CPU0 logged the error in a system with two CPUs. ➋ The External Interface Status Register does not record an error. ➌ Both IOD CAP Error Registers logged an error. ➍ The MC Error Info Registers 0 and 1 captured the error information. ➎ The commander at the time of the error was CPU3 (known from MC_ERR1). ➏ The command on the bus at the time was a write-back memory command.
Example 5-4 MCHK 660 IOD-Detected Failure (System Bus Error) Logging OS System Architecture Event sequence number Timestamp of occurrence Host name System type register Number of CPUs (mpnum) 2. DIGITAL UNIX 2. Alpha 6. 04-APR-1996 17:20:04 whip16 x00000016 x00000002 AlphaStation 4x00 CPU logging event (mperr) x00000000 Event validity 1. O/S claims event is valid Event severity 1. Severe Priority Entry type 100.
RA Field Scache Address Reg Scache Status Reg Bcache Tag Address Reg x0000000006 Opcode Field x00000000000029 xFFFFFF0000024EAF x00000000 xFFFFFF80FFED6FFF Parity for ds and v bits Cache block dirty Cache block valid Tag address<38:20> is x00000000000FFE Ext Interface Address Reg xFFFFFF00FC00000F Fill Syndrome Reg x0000000000C5D2 Ext Interface Status Reg xFFFFFFF004FFFFFF LD LOCK Error occurred during D-ref fill xFFFFFF000020065F ** IOD SUBPACKET -> ** WHOAMI x000000BA ➋ IOD 0 Register Subpacket De
PCI Bus Trans Error Adr MDPA Status Register x00000000 x80000000 MDPA Error Syndrome Reg x0000000000001E x1E00001E MDPA Chip Revision x00000000 MDPA Error Syndrome of uncorrectable read error Cycle 0 ECC Syndrome Cycle 1 ECC Syndrome x00000000 Cycle 2 ECC Syndrome x00000000 Cycle 3 ECC Syndrome x0000000000001E MDPB Status Register MDPB Error Syndrome Reg x00000000 x00000000 ** IOD SUBPACKET -> ** WHOAMI x000000BA MDPB Chip Revision x00000000 Cycle 0 ECC Syndrome x00000000 Cycle 1 ECC Syndrome x0000
MDPA Error Syndrome Reg x1E00001E MDPB Status Register MDPB Error Syndrome Reg x00000000 x00000000 PALcode Revision 5-30 MDPA Error Syndrome of uncorrectable read error Cycle 0 ECC Syndrome x00000000 Cycle 1 ECC Syndrome x00000000 Cycle 2 ECC Syndrome x00000000 Cycle 3 ECC Syndrome x00000000 MDPB Chip Revision x00000000 Cycle 0 ECC Syndrome x00000000 Cycle 1 ECC Syndrome x00000000 Cycle 2 ECC Syndrome x00000000 Cycle 3 ECC Syndrome x00000000 Palcode Rev: 1.
5.3.5 MCHK 660 IOD-Detected Failure (PCI Error) The error log in Example 5-5 shows the following: ➊ CPU 0 logged the error in a system with three CPUs. ➋ The External Interface Status register records that the error occurred during a D-ref Fill but does not indicate what the error is. ➌ The CAP Error register for IOD0 did not see an error. ➍ The CAP Error register for IOD1, however, records a serious error. ➎ The MC Error Info registers 0 and 1 captured the error information.
Esample 5-5 MCHK 660 IOD-Detected Failure (PCI Error) Logging OS System Architecture Event sequence number Timestamp of occurrence Host name System type register Number of CPUs (mpnum) 2. DIGITAL UNIX 2. Alpha 2. 27-AUG-1996 08:15:41 mason3 x00000016 x00000003 CPU logging event (mperr) x00000000 Event validity Event severity Entry type 2.
PALTEMP21 PALTEMP22 PALTEMP23 Exception Address Reg Exception Summary Reg Exception Mask Reg PAL Base Address Reg x0000000000000008 Interrupt Summary Reg x0000000000000000 IBOX Ctrl and Status Reg Icache Par Err Stat Reg Dcache Par Err Stat Reg Virtual Address Reg Memory Mgmt Flt Sts Reg xFFFFFC000043CD40 xFFFFFC000058D540 x000000007FC67A58 xFFFFFC0000433E0C Native-mode Instruction Exception PC x3FFFFF000010CF83 x0000000000000000 x0000000000000000 x0000000000020000 Base Addr for PALcode: x0000000000200000
MC-PCI Command Register Mem Host Address Ext Reg IO Host Adr Ext Register Interrupt Ctrl Register Struct:Enabled Interrupt Request Interrupt Mask0 Register PCI Class Code x00000600 x06470FB1 Module SelfTest Passed LED on Delayed PCI Bus Reads Protocol: Enabled Bridge to PCI Transactions: Enabled Bridge WILL NOT REQUEST 64 Bit Data Trans Bridge ACCEPTS 64 Bit Data Transactions PCI Address Parity Check: Enabled MC Bus CMD/Addr Parity Check: Enabled MC Bus NXM Check: Enabled Check ALL Transactions for Errors
Mem Host Address Ext Reg IO Host Adr Ext Register Interrupt Ctrl Register Struct:Enabled Interrupt Request RD_TYPE Memory Prefetch Algorithm: Short RL_TYPE Mem Rd Line Prefetch Type: Medium RM_TYPE Mem Rd Multiple Cmd Type: Long ARB_MODE Arbitration: MC-PCI Priority Mode x00000000 HAE Sparse Mem Adr<31:27> x00000000 x00000000 PCI Upper Adr Bits<31:25> x00000000 x00000003 Write Device Interrupt Info x00800000 Interrupt Mask0 Register Interrupt Mask1 Register x00C51111 x00000000 MC Error Info Register 0
Base Address Register 6 x00000000 Expansion Rom Base Addres x00000000 Interrupt P1 x04 Interrupt P2 x01 Min Gnt x00 Max Lat x00 CONFIG Address Device and Vendor ID x000000FBC0001000 x00011069 Mylex DAC960 KZPSC RAID Controller Vendor ID: x1069 (Mylex) Device ID: x00000001 Command Register x0147 I/O Space Accesses Response: Enabled Memory Space Accesses Response: Enabled PCI Bus Master Capability: Enabled Monitor for Special Cycle Ops: DISABLED Generate Mem Wrt/Invalidate Cmds: DISABLED Parity Error Detect
Fast Back-to-Back to Different Targets, Is Not Supported in Target Device. Device Select Timing: Medium. Revision ID x02 Device Class Code x010000 Mass Storage: SCSI Bus Controller Cache Line S x00 Latency T.
Command Register Status Register x0147 I/O Space Accesses Response: Enabled Memory Space Accesses Response: Enabled PCI Bus Master Capability: Enabled Monitor for Special Cycle Ops: DISABLED Generate Mem Wrt/Invalidate Cmds: DISABLED Parity Error Detection Response: Normal Wait Cycle Address/Data Stepping: DISABLED SERR# Sys Err Driver Capability: Enabled Fast Back-to-Back to Many Target: DISABLED xE2C0 Device is 33 Mhz Capable. Device Supports User Defineable Features.
5.3.6 MCHK 630 Correctable CPU Error The error log in Example 5-6 shows the following: ➊ CPU0 logged the error in a system with two CPUs. ➋ During a D-ref fill, the External Interface Status Register shows no error but states that the “data source is b-cache. ” (When a CPU chip does not find data it needs to perform a task in any of its caches, it requests data from off the chip to fill its D-cache. It performs a D-ref fill.) ➌ Both IOD CAP Error Registers logged no error.
Example 5-6 MCHK 630 Correctable CPU Error Logging OS System Architecture Event sequence number Timestamp of occurrence Host name System type register Number of CPUs (mpnum) 2. DIGITAL UNIX 2. Alpha 415. 09-MAY-1996 14:56:30 whip16 x00000016 x00000002 AlphaStation 4x00 ➊ CPU logging event (mperr) x00000000 Event validity Event severity Entry type 1. O/S claims event is valid 3. High Priority 100. CPU Machine Check Errors CPU Minor class 3.
Error Logs 5-41
5.3.7 MCHK 620 Correctable Error The MCHK 620 error is a correctable error detected by the IOD. The error log in Example 5-7 shows the following: ➊ CPU0 logged the error in a system with two CPUs. ➋ The External Interface Status Register is not valid. ➌ The MC Error Info Registers 0 and 1 captured the error information. ➍ The commander at the time of the error was CPU0. ➎ The command at the time of the error was a write-back memory command. The IOD detected a recoverable error on the system bus.
System Revision x00000000 Machine Check Reason Ext Interface Status Reg x0204 IOD Detected Soft Error x0000000000000000 ➋ Not Valid for 620 System Correctable Errors Ext Interface Address Reg x0000000000000000 Not Valid for 620 System Correctable Errors Fill Syndrome Reg x0000000000000000 Not Valid for 620 System Correctable Errors Interrupt Summary Reg x0000000000000000 Not Valid for 620 System Correctable Errors WHOAMI x00000000 Module Revision 0. MID 0. GID 0.
5.4 Troubleshooting IOD-Detected Errors Step 1 Read the CAP Error Registers on both PCI bridges (F9E0000880 and FBE0000880). If one or both of these registers shows an error, match the register contents with the data pattern and perform the action indicated.
5.4.1 System Bus ECC Error Step 2 Read the MC_ERR1 register and match the contents with the data pattern. Perform the action indicated.
5.4.2 System Bus Nonexistent Address Error Step 3 Determine which node (if any) should have responded to the command/address identified in MC_ERR1. Perform the action indicated.
5.4.3 System Bus Address Parity Error Step 4 Determine which node put the bad command/adress on the system bus identified in MC_ERR1. Perform the action indicated.
5.4.4 PIO Buffer Overflow Error (PIO_OVFL) Step 5 Enter the value of the CAP_CTRL register bits<19:16> (Actual_PEND_NUM) in the following formula. Compare the results as indicated in Table 5-7 to determine the most likely cause of the error. When an IOD is implicated in the analysis of the error, replace the one that capturered the error in its CAP Error Register.
5.4.5 Page Table Entry Invalid Error Step 6 This error is almost always a software problem. However, if the software is known to be good and the hardware is suspected, swap the IOD. 5.4.6 PCI Master Abort Step 7 Master aborts normally occur when the operating system is sizing the PCI bus. However, if the master abort occurs after the system is booted, read PCI_ERR1 and determine which PCI device should have responded to this PCI address. Replace this device. 5.4.
5.4.9 Broken Memory Step 10 Refer to the following sections. For a Read Data Substitute Error (uncorrectable ECC error) When a read data substitute (RDS) error occurs, determine which memory module pair caused the error as follows: 1. Run the memory diagnostic to see if it catches the bad memory. If so, replace the memory module that it reports as bad. 2. At the SRM console prompt, enter the show mem command.
3. When you have isolated the failing memory pair, determine which of the two modules is bad. (You cannot do this if the operating system is Windows NT.) Read the CPU FIL SYNDROME Register. If this register is non-zero, use the ECC syndrome bits in Table 5-8 to determine which module had the single-bit error.
5.4.10 Command Codes Table 5-9 shows the codes for transactions on the system bus and how they are affected by the commander in charge of the bus during the transaction. The command is a six-bit field in the command address (bits<5:0>). Bit-to-text translations give six-bit data (although the top two bits may or may not be relevant). Note that address bit<39> defines the command as being either a system space or an I/O command.
Table 5-9 Decoding Commands (continued) 54 MC_C MD 3210 CMD in Hex MC_ ADR <39> Description No BCache CPU BCache CPU IOD Y Y Y xx 1010 XA 0 Read Mod0 Mem xx 1010 XA 1 Read Peer0 - I/O xx 1011 XB 0 Read Mod1 Mem xx 1011 XB 1 Read Peer1 - I/O Y 10 1100 2C 1 FILL0 (due to Read0/Peer0) Y 10 1101 2D 1 FILL1 (due to Read1/Peer1) Y xx 1110 XE 0 Read0 - Mem Y Y xx 1111 XF 0 Read1 - Mem Y Y Y Y Y Y 5.4.
5.5 Double Error Halts and Machine Checks While in PAL Mode Two error cases require special attention. Neither double error halts or machine checks while the machine is in PAL mode result in error log entries. Nevertheless, information is available that can help determine what error occurred. 5.5.1 PALcode Overview PALcode, privileged architecture library code, is used to implement a number of functions at the machine level without the use of microcode.
5.5.2 Double Error Halt A double error halt occurs under the following conditions: • A machine check occurs. • PAL completes its tasks and returns control of the system to the operating system. • A second machine check occurs before the operating system completes its tasks. The machine returns to the console and displays the following message: halt code = 6 double error halt PC = 20000004 Your system has halted due to an irrecoverable error.
cpu00 per_cpu impure area cns$flag cns$flag+4 cns$hlt cns$hlt+4 cns$mchkflag cns$mchkflag+4 cns$exc_addr cns$exc_addr+4 cns$pal_base cns$pal_base+4 cns$mm_stat cns$mm_stat+4 cns$va cns$va+4 cns$icsr cns$icsr+4 cns$ipl cns$ipl+4 cns$ps cns$ps+4 cns$itb_asn cns$itb_asn+4 cns$aster cns$aster+4 cns$astrr cns$astrr+4 cns$isr cns$isr+4 cns$ivptbr cns$ivptbr+4 cns$mcsr cns$mcsr+4 cns$dc_mode cns$dc_mode+4 cns$maf_mode cns$maf_mode+4 cns$sirr cns$sirr+4 cns$fpcsr cns$fpcsr+4 cns$icperr_stat cns$icperr_stat+4 cns$pm
cns$fill_syn cns$fill_syn+4 cns$ld_lock cns$ld_lock+4 000000a7 00000000 0004eaef ffffff00 : : : : 0410 0414 0418 041c Error Logs 5-57
Example 5-9 INFO 5 Command P00>>> info 5 cpu00 per_cpu logout area 00004838 mchk$crd_flag 00000320 : 0000 mchk$crd_flag+4 00000000 : 0004 mchk$crd_offsets 00000118 : 0008 mchk$crd_offsets+4 00001328 : 000c mchk$crd_mchk_code 00980000 : 0010 mchk$crd_mchk_code+4 00000000 : 0014 mchk$crd_ei_stat eba00003 : 0018 mchk$crd_ei_stat+4 4143040a : 001c mchk$crd_ei_addr d1200067 : 0020 mchk$crd_ei_addr+4 47f90416 : 0024 mchk$crd_fill_syn eba00003 : 0028 mchk$crd_fill_syn+4 d1200068 : 002c mc
mchk$fill_syn+4 00000000 : 018c mchk$ei_stat 04ffffff : 0190 mchk$ei_stat+4 fffffff0 : 0194 mchk$ld_lock 00005b6f : 0198 mchk$ld_lock+4 ffffff00 : 019c IOD: 0 base address: f9e0000000 WHOAMI: 0000003a PCI_REV: 06008221 CAP_CTL: 02490fb1 HAE_MEM: 00000000 HAE_IO: 00000000 INT_CTL: 00000003 INT_REQ: 00800000 INT_MASK0: 00010000 INT_MASK1: 00000000 MC_ERR0: e0000000 MC_ERR1: 800e88fd CAP_ERR: 84000000 PCI_ERR: 00000000 MDPA_STAT: 00000000 MDPA_SYN: 00000000 MDPB_STAT: 0
Example 5-10 INFO 8 Command P00>>> info 8 IOD 0 WHOAMI: 0000003a PCI_REV: 06008221 CAP_CTL: 02490fb1 HAE_MEM: 00000000 HAE_IO: 00000000 INT_CTL: 00000003 INT_REQ: 00000000 INT_MASK0: 00210000 INT_MASK1: 00000000 MC_ERR0: e0000000 MC_ERR1: 000e88fd CAP_ERR: 00000000 PCI_ERR: 00000000 MDPA_STAT: 00000000 MDPA_SYN: 00000000 MDPB_STAT: 00000000 MDPB_SYN: 00000000 INT_TARG: 0000003a INT_ADR: 00006000 INT_ADR_EXT 00000000 PERF_MON: 00406ebf PERF_CONT: 00000000 CAP_DIAG:
Chapter 6 Error Registers This chapter describes the registers used to hold error information.
6.1 External Interface Status Register - EL_STAT The EI_STAT register is a read-only register that is unlocked and cleared by any PALcode read. A read of this register also unlocks the EI_ADDR, BC_TAG_ADDR, and FILL_SYN registers subject to some restrictions. The EI_STAT register is not unlocked or cleared by reset.
Fill data from B-cache or main memory could have correctable or uncorrectable errors in ECC mode. In parity mode, fill data parity errors are treated as uncorrectable hard errors. System address/command parity errors are always treated as uncorrectable hard errors, irrespective of the mode. The sequence for reading, unlocking, and clearing EI_STAT, EI_ADDR, BC_TAG_ADDR, and FILL_SYN is as follows: 1. Read the EI_ADDR, BC_TAG_ADDR, and FIL_SYN registers in any order. Does not unlock or clear any register. 2.
Table 6-1 External Interface Status Register Name Bits Type Description COR_ECC_ERR <31> R Correctable ECC Error. Indicates that fill data received from outside the CPU contained a correctable ECC error. EI_ES R External Interface Error Source. When set, indicates that the error source is fill data from main memory or a system address/command parity error. When clear, the error source is fill data from the B-cache.
Table 6-1 External Interface Status Register (continued) Name Bits Type Description <63:36> All ones. SEO_HRD_ERR <35> R Second External Interface Hard Error. Indicates that a fill from B-cache or main memory, or a system address/command received by the CPU has a hard error while one of the hard error bits in the EI_STST register is already set. FIL_IRD R Fill I-Ref D-Ref. When set, indicates that the error occurred during an I-ref fill.
6.1.1 External Interface Address Register - EI_ADDR The EI_ADDR register contains the physical address associated with errors reported by the EI_STAT register. It is unlocked by a read of the EI_STAT Register. This register is meaningful only when one of the error bits is set.
Table 6-2 Loading and Locking Rules for External Interface Registers Correct -able Error Uncorrectable Error Second Hard Error 0 0 1 Load Register Lock Register Action When EI_STAT Is Read Not possible No No Clears and unlocks all registers 0 Not possible Yes No Clears and unlocks all registers 0 1 0 Yes Yes Clears and unlocks all registers 11 1 0 Yes Yes Clear bit (c) does not unlock. Transition to “0,1,0” state.
6.1.2 MC Error Information Register 0 (MC_ERR0 - Offset = 800) The low-order MC bus (system bus) address bits are latched into this register when the system bus to PCI bus bridge detects an error event. If the event is a hard error, the register bits are locked. A write to clear symptom bits in the CAP Error Register unlocks this register. When the valid bit (MC_ERR_VALID) in the CAP Error Register is clear, the contents are undefined.
6.1.3 MC Error Information Register 1 (MC_ERR1 - Offset = 840) The high-order MC bus (system bus) address bits and error symptoms are latched into this register when the system bus to PCI bus bridge detects an error. If the event is a hard error, the register bits are locked. A write to clear symptom bits in the CAP Error Register unlocks this register. When the valid bit (MC_ERR_VALID) in the CAP Error Register is clear, the contents are undefined.
Table 6-4 MC Error Information Register 1 Name Bits Type Initial State VALID <31> RO 0 Reserved <30:21> RO 0 Dirty <20> RO 0 Reserved <19:17> DEVICE_ID <16:14> RO 0 Slot number of bus master at the time of the error. MC_CMD<5:0> <13:8> RO 0 Active command at the time the error was detected. ADDR<39:32> <7:0> RO 0 Address bits <39:32> of the transaction on the system bus when an error is detected. 6-10 Description Logical OR of bits <30:23> in the CAP_ERR Register.
6.1.4 CAP Error Register (CAP_ERR - Offset = 880) CAP_ERR is used to log information pertaining to an error detected by the CAP or MDP ASIC. If the error is a hard error, the register is locked. All bits, except the LOST_MC_ERR bit, are locked on hard errors. CAP_ERR remains locked until the CAP error is written to clear each individual error bit.
Table 6-5 CAP Error Register Name Bits Type Initial State MC_ERR VALID <31> RO 0 Logical OR of bits <30:23> in this register. When set MC_ERR0 and MC_ERR1 are latched. RDSB <30> RW1C 0 Uncorrectable ECC error detected by MDPB. Clear state in MDPB before clearing this bit. RDSA <29> RW1C 0 Uncorrectable ECC error detected by MDPA. Clear state in MDPA before clearing this bit. CRDB <28> RW1C 0 Correctable ECC error detected by MDPB. Clear state in MDPB_STAT before clearing this bit.
Table 6-5 CAP Error Register (continued) Name Bits Type Initial State LOST_MC_ERR <24> RW1C 0 Set when an error is detected but not logged because the associated symptom fields and registers are locked with the state of an earlier error. PIO_OVFL <23> RW1C 0 Set when a transaction that targets this system bus to PCI bus bridge is not serviced because the buffers are full. This is a symptom of setting the PEND_NUM field in CAP_CNTL to an incorrect value.
6.1.5 PCI Error Status Register 1 (PCI_ERR1 - Offset = 1040) PCI_ERR1 is used by the system bus to PCI bus bridge to log bus address <31:0> pertaining to an error condition logged in CAP_ERR. This register always captures PCI address <31:0>, even for a PCI DAC cycle. When the PCI_ERR_VALID bit in CAP_ERR is clear, the contents are undefined.
Chapter 7 Removal and Replacement This chapter describes removal and replacement procedures for field-replaceable units (FRUs). 7.1 System Safety Observe the safety guidelines in this section to prevent personal injury. CAUTION: Wear an antistatic wrist strap whenever you work on a system. The AlphaServer cabinet system has a wrist strap connected to the frame at the front and rear. The pedestal system does not have an attached strap, so you will have to take one to the site.
7.2 FRU List Figure 7-1 shows the locations of FRUs in the system drawer, and Table 7-1 lists the part numbers of all field-replaceable units.
Table 7-1 Field-Replaceable Unit Part Numbers CPU Modules B3001-CA 300 MHz CPU, uncached B3002-AB 300 MHz CPU, 2 Mbyte cache B3004-BA 300 MHz CPU, 2 Mbyte cache B3004-AA 400 MHz, 4 Mbyte cache B3004-DA 466 MHz, 4 Mbyte cache Memory Modules B3020-CA 64 Mbyte synch B3030-EA 256 Mbyte asynch (EDO) B3030-FA 512 Mbyte asynch (EDO) B3030-GA 2 Gbyte asynch (EDO) Required System Drawer Modules and Display 54-23803-01 System motherboard (4100) 54-23803-02 54-23805-01 System motherboard (early 40
Table 7-1 Field-Replaceable Unit Part Numbers (continued) Fans 12-23609-21 4.5-inch fan 12-24701-34 CPU fan Power System Components 30-44712-01 Power supply (H7291-AA) 30-45353-01 Techniq AC Box (NA/Japan, H9A10-EB cabinet) 30-45353-02 Techniq AC Box (Europe/AP, H9A10-EC cabinet) 30-46788-01 Internal power source 40W/12V fan tray power (cabinet) H7600-AA Power controller (NA/Japan, H9A10-EL cabinet) H7600-DB Power controller (Europe/AP, H9A10-EM cabinet) 12-23501-01 NEMA power strip (N.A.
Table 7-1 Field-Replaceable Unit Part Numbers (continued) Server Control Module Power (Pedestal Only) 30-46485-01 110V North America 30-46485-02 220V Europe 30-46485-03 Australia/N.Z. 30-46485-04 220V U.K.
Table 7-1 Field-Replaceable Unit Part Numbers (continued) System Drawer Cables and Jumpers From To 17-04358-01 Power harness (later 4000 only) Power supply(s) 3 conns.
Table 7-1 Field-Replaceable Unit Part Numbers (continued) Pedestal Cables From To 17-04293-01 Elec harness power cable+5/+12 Power harness (17-04217-01) Ped tray bulkhead (system side) 17-04302-01 OCP signal cable OCP sig conn on PCI mbrd OCP sig conn on ped tray bulkhead (system side) 17-04305-01 Harness power cable +5/+12 Power conn on ped tray bulkhd (tray side) Both OCP DC enable pwr conn and pwr conn on optional SCSI drive 17-04306-01 SCSI signal cable (narrow) SCSI sig conn on ped tra
7.3 4100 Power System FRUs Figure 7-2 Location of 4100 Power System FRUs 4 3 2 P PS0 o w e PS1 r S PS2 t r Fan i p 14 Tray To Pedestal Power Source 1 5 Fan 0 Fan 1 Fan 2 6 54-23803-01 (Motherboard) P C M B3040 7 B3050 Cabinet 2a AC Input Box 12 FLPY To Cabinet Power Source 9 11 OCP 6&0 Pedestal Tray OCP CD 10 Tray SCSI Interlock 8 13 Notes: Only power cables are shown. Systems have only one OCP located in either the cabinet tray or the pedestal tray.
Part Number Description 2a 17-04285-01 Power cord from AC input box to power strip. .5 meter, IEC320 to IEC320 connector used in cabinet systems only. In pedestal systems, cords match country-specific wall outlets. 1, 2, 2a H7600-AA Power controller used in place of 30-45353-01, 12-45334-02, and 17-04285-02 in the H9A10-EL cabinet in N.
7.4 4000 Power System FRUs Figure 7-3 Location of 4000 Power System FRUs 4 3 P o w e r S t r i p 14 2 To Pedestal Power Source B3051 PS0 Fan 0 PS1 Fan 1 PS2 Fan 2 Fan Tray 6 7 B3040 54-23805-01 P (4000 Motherb.) C M B3040 B3050 Cabinet 2a 1 5 AC Input Box 12 FLPY To Cabinet Power Source 6&0 11 OCP Pedestal Tray OCP CD 10 Tray SCSI Interlock 8 13 Note: Only power cables are shown. Systems have only one OCP located in either the cabinet tray or the pedestal tray.
Part Number Description 2a 17-04285-01 Power cord from AC input box to power strip. .5 meter, IEC320 to IEC320 connector used in cabinet systems only. In pedestal systems, cords match country-specific wall outlets. 1, 2, 2a H7600-AA Power controller used in place of 30-45353-01, 12-45334-02, and 17-04285-02 in the H9A10-EL cabinet in N.
7.5 System Drawer Exposure (Cabinet) There are two cabinet types for these systems: the H9A10-EB -EC cabinet and the H9A10-EL -EM cabinet. System drawer exposure differs depending upon the cabinet. 7.5.1 Cabinet Drawer Exposure (H9A10-EB & -EC) Open both doors, disconnect cables that obstruct movement of the drawer, remove the shipping brackets, and slide the drawer out from the cabinet.
Exposing the System Bus or PCI Bus Card Cages 1. Open the front and rear doors of the cabinet. 2. At the front of the cabinet, unplug the drawer’s power supplies. 3. At the rear, remove the two Phillips screws holding the shipping bracket on the right rail so that the drawer can be pulled out. 4. Using a flathead screwdriver, disengage the lock mechanism at the lower left hand corner of the drawer. 5. Pull the drawer out part way and release the lock mechanism by removing the screwdriver.
7.5.2 Cabinet Drawer Exposure (H9A10-EL & EM) In the H9A10-EL and -EM Cabinet, the system drawer sits on a tray that slides out of the front of the cabinet. A stabilizer bar must be pulled out from the bottom to pevent the cabinet from tipping over.
CAUTION: The cabinet could tip over if a system drawer is pulled out and the stablizing bar is not fully extended and its leveler foot on the floor. Exposing any section of the system drawer in an H9A10-EL or -EM Cabinet. 1. Open the front door of the cabinet. 2. Pull the stabilizer bar at the bottom of the cabinet out until it stops. 3. Extend the leveler foot at the enc of the stabilizer bar to the floor. 4. Unplug the drawer’s power supplies. 5.
7.6 System Drawer Exposure (Pedestal) Figure 7-5 Exposing System Drawer (Pedestal) Pedestal Tray Cover System Bus Cover Pedestal Tray and Power Section Cover PCI Bus Cover 3.
Exposing the System Drawer 1. Open the front door and remove it by lifting and pulling it away from the system. 2. Remove the top cover. Unscrew the two Phillips head screws midway up on each side of the pedestal, tilt the cover up, and lift it away from the frame. 3.
7.7 CPU Removal and Replacement CAUTION: Several different CPU modules work in these systems. Unless you are upgrading, be sure you are replacing the broken module with the same variant. B3001 and B3002 can only be used in AlphaServer 4100 systems. Figure 7-6 Removing CPU Module CPU Module System Bus Card Cage PKW0411-96 WARNING: CPU modules and memory modules have parts that operate at high temperatures. Wait 2 minutes after power is removed before touching any module.
Removal 1. Shut down the operating system and power down the system. 2. Expose the system drawer. 3. Expose the system bus card cage. Remove the two Phillips head screws holding the cover in place and slide it off the drawer. 4. Identify and remove faulty CPU. A label to the left of the system bus card cage identifies which slot contains CPU0, CPU1, CPU2, or CPU3. The CPU is held in place with levers at both ends; simultaneously raise the levers and lift the CPU from the cage.
7.
Removal 1. Follow the CPU Removal and Replacement procedure. 2. Unplug the fan from the module. 3. Remove the four Phillips head screws holding the fan to the Alpha chip’s heatsink. Replacement Reverse the above procedure. Verification If the system powers up, the CPU fan is working.
7.9 Memory Removal and Replacement CAUTION: Several different memory modules work in these systems. Be sure you are replacing the broken module with the same variant. Figure 7-8 Removing Memory Module Memory Module System Bus Card Cage PKW0408-96 WARNING: CPU modules and memory modules have parts that operate at high temperatures. Wait 2 minutes after power is removed before touching any module.
Removal 1. Shut down the operating system and power down the system. 2. Expose the system drawer. 3. Expose the system bus card cage. Remove the two Phillips head screws holding the cover in place and slide it off the drawer. 4. Identify and remove the faulty module. A label to the left of the system card cage identifies which slot contains the high or low halves of memory banks. The memory module is held in place by a flathead captive screw attached to the top brace of the module.
7.
Removal 1. Shut down the operating system and power down the system. 2. Expose the system drawer. 3. Expose the system bus card cage. Remove the two Phillips head screws holding the cover in place and slide it off the drawer. 4. Remove the faulty PCM. The PCM is located in the back left corner of the system bus card cage. A captive flathead screw and the rear card guide hold the PCM in place. Unscrew the screw and lift the module from the cage. Replacement Reverse the steps in the Removal procedure.
7.
Removal 1. Shut down the operating system and power down the system. 2. Expose the system drawer. 3. Expose the system bus card cage. Remove the two Phillips head screws holding the cover in place and slide it off the drawer. 4. Expose the PCI bus card cage. Remove three Phillips head screws holding the cover in place and slide it off the drawer. 5. Remove all the PCI/EISA options. 6. Remove the server control module. 7. Remove the PCI motherboard. 8.
7.
Removal 1. Shut down the operating system and power down the system. 2. Expose the system drawer. 3. Expose the system bus card cage. Remove the two Phillips head screws holding the cover in place and slide it off the drawer. 4. Expose the PCI bus card cage on the right side of the drawer. Remove three Phillips head screws holding the cover in place and slide it off the drawer. 5. Remove all the PCI options. 6. Remove the PCI motherboard. 7.
7.13 System Motherboard (4100 &early 4000) Removal and Replacement The system motherboard contains an NVRAM that holds the system serial number. Be sure to record this number before replacing the module. The serial number is on a barcode on the side of the system drawer or on the system bus card cage. The part number for the 4100 is 54-23803-01 and for the early 4000 is 54-23803-02. Figure 7-12 Removing System Motherboard PKW0414-96 Removal 1. Shut down the operating system and power down the system.
5. Expose the PCI bus card cage. Remove three Phillips head screws holding the cover in place and slide it off the drawer. 6. Remove all the PCI/EISA options. 7. Remove the server control module. 8. Remove the PCI motherboard. 9. Remove system bus to PCI bus module from the system motherboard. 10. Remove the bracket holding the power cables in place as they pass from the system bus section to the power section of the drawer. 11.
7.14 System Motherboard (4000) Removal and Replacement The system motherboard contains an NVRAM that holds the system serial number. Be sure to record this number before replacing the module. The serial number is on a barcode on the side of the system drawer or on the system bus card cage. The part number for the later 4000 is 54-23805-01. Figure 7-13 Removing System Motherboard PKW0414A-96 Removal 1. Shut down the operating system and power down the system. 2. Expose the system drawer. 3.
6. Remove all the PCI/EISA options. 7. Remove the server control module. 8. Remove the PCI motherboards. 9. Remove both bridge modules from the system motherboard. 10. Remove the bracket holding the power cables in place as they pass from the system bus section to the power section of the drawer. 11. Disconnect all cables to the system motherboard and lay them back over the power supply section of the system drawer.
7.15 PCI/EISA Motherboard (B3050) Removal and Replacement Figure 7-14 Replacing PCI/EISA Motherboard Connection to Bridge Module PCI Motherboard PKW0409-96 Removal The PCI motherboard contains an NVRAM with ECU data and customized console environment variables. Therefore, if the console runs, execute a show * command at the console prompt and, if you have not done so earlier, record the settings for the sys_model_number and sys_type environment variables.
motherboard, these environment variables are lost and must be restored after the module swap. 1. Shut down the operating system and power down the system. 2. Expose the system drawer. 3. Expose the PCI bus card cage. Remove three Phillips head screws holding the cover in place and slide it off the drawer. 4. Remove all PCI and EISA options. 5. Disconnect all cables connected to the PCI motherboard. 6. Remove the server control module. 7.
7.
Removal 1. Shut down the operating system and power down the system. 2. Expose the system drawer. 3. Expose the PCI bus card cage on the right when viewing the drawer from the rear. Remove three Phillips head screws holding the cover in place, and slide it off the drawer. 4. Remove all PCI options. 5. Disconnect all cables connected to the PCI motherboard. 6. Unscrew the two screws holding the system bus to PCI bus bridge module in the system bus card cage to the PCI motherboard. 7.
7.
Removal 1. Shut down the operating system and power down the system. 2. Expose the system drawer. 3. Expose the PCI bus card cage. Remove three Phillips head screws holding the cover in place and slide it off the drawer. 4. Disconnect the cables connected at the bulkhead to the server control module. 5. If necessary, remove several PCI and EISA options from the bottom of the PCI card cage up until you can access the server control module. 6.
7.18 PCI/EISA Option Removal and Replacement Figure 7-17 Removing PCI/EISA Option PKW0418-96 WARNING: To prevent fire, use only modules with current limited outputs. See National Electrical Code NFPA 70 or Safety of Information Technology Equipment, Including Electrical Business Equipment EN 60 950.
Removal 1. Shut down the operating system and power down the system. 2. Expose the system drawer. 3. Expose the PCI bus card cage. Remove three Phillips head screws holding the cover in place and slide it off the drawer. 4. Remove the faulty option. Disconnect cables connected to the option. Unscrew the small Phillips head screw securing the option to the card cage. Slide the option from the card cage. Replacement Reverse the steps in the Removal procedure.
7.
Removal 1. Shut down the operating system and power down the system. 2. Expose the system drawer. 3. Remove the cover to the power section of the drawer. Remove the two Phillips head screws holding the cover in place and slide it off the drawer. 4. Release the power supply tray by removing the two Phillips head screws on the side of the drawer. See ➍. 5. Lift the power supply tray to release it from the sheet metal and slide it out from the drawer until it locks (about 4 inches). 6.
7.
Removal 1. Shut down the operating system and power down the system. 2. Expose the system drawer. 3. Expose the power, system card cage, and PCI/EISA sections of the drawer by removing all covers. Unscrew the Phillips head screws holding each cover in place and slide the covers off the drawer. 4. Release the power supply tray by removing the two Phillips head screws on the side of the drawer. 5.
7.
Removal 1. Shut down the operating system and power down the system. 2. Expose the system drawer. 3. Expose the power and system card cage sections of the drawer by removing the two covers. Unscrew the two Phillips head screws holding each cover in place and slide the covers off the drawer. 4. If you want more space to work on the fans, do this step and the next; otherwise skip to step 7. Release the power supply tray by removing the two Phillips head screws on the side of the drawer. 5.
7.22 System Drawer Fan Removal and Replacement Figure 7-21 Removing System Drawer Fan PKW0416-96 Removal 1. Shut down the operating system and power down the system. 2. Expose the system drawer. 3. Expose the power system, the system card cage, and the PCI card cage sections of the drawer by removing all three covers. Unscrew the two Phillips head screws holding each cover on top of the drawer in place and slide them off the drawer.
4. Release the power supply tray by removing the two Phillips head screws on the side of the drawer. 5. Lift the power supply tray to release it from the sheet metal and slide it out from the drawer. 6. Tilt the tray to allow easier access to the fans. 7. Remove the bracket holding the power harness as it passes from the power section to the system card cage section of the drawer. Remove the three Phillips head screws holding the bracket in place. 8.
7.
Removal 1. Shut down the operating system and power down the system. 2. Expose the system drawer. 3. Remove all three section covers to expose the interlock switch assembly. 4. Remove the two screws holding the interlock in place. 5. Push the interlock toward the opposite side of the system drawer (be sure not to twist it) and tilt it so that the switches affected by the power and system card cage covers clear the openings in the side of the drawer.
7.
Removal 1. Shut down the operating system and power down the system. 2. Expose the system drawer. 3. Remove all three section covers to expose the interlock switch assemblies. 4. Remove the two screws holding the interlocks in place. 5. Push the interlock toward the opposite side of the system drawer (be sure not to twist it) and tilt it so that the switches affected by the power and system card cage covers clear the openings in the side of the drawer.
7.
Removal 1. Shut down the operating system and power down the system. 2. Expose the system drawer. 3. While you need not remove the tray containing the OCP, you do need to slide it forward to access the OCP retaining screws under the tray. The tray is attached to the power system section cover. To slide the tray forward: a. Remove the tray cover by loosening the retaining screws at the back of the tray and sliding it toward the back of the system. b.
7.26 Operator Control Panel Removal and Replacement (Pedestal) Figure 7-25 Removing OCP (Pedestal) 3.
Removal 1. Shut down the operating system and power down the system. 2. Expose the system drawer. 3. Remove the four Phillips head screws holding the OCP tray to the system drawer. 4. Slide the tray out of the system drawer far enough to disconnect cables attached to the OCP, the floppy, and the CD-ROM drive. 5. Remove the tray from the system. 6. Move the tray to some handy work surface.
7.
Removal 1. Shut down the operating system and power down the system. 2. Expose the system drawer. 3. Remove the four Phillips head screws holding the OCP tray to the system drawer. 4. Slide the tray out of the system drawer and disconnect cables attached to the OCP (unnecessary on a pedestal system), the floppy, and the CD-ROM drive. (In the pedestal system the OCP is in the tray above the power supplies.) 5. Move the tray to some handy work surface.
7.
Removal 1. Shut down the operating system and power down the system. 2. Expose the system drawer. 3. Remove the four Phillips head screws holding the OCP tray to the system drawer. 4. Slide the tray out of the system drawer and disconnect cables attached to the OCP (unnecessary on a pedestal system), the floppy, and the CD-ROM drive. (In the pedestal system the OCP is in the pedestal tray above the power supplies.) 5. Move the tray to some handy work surface.
7.
Removal 1. Shut down the operating system and power down the system. Unplug the AC power cable from the cabinet tray power supply. 2. If present, unplug any power cables going to the server control modules at the back of system drawers. 3. Unscrew the four Phillips head screws securing the fan tray to the top of the cabinet. 4. Loosen the four hexnuts that hold the tray to the top of the cabinet. 5.
7.
Removal 1. Remove the cabinet fan tray. 2. Disconnect the power harness from the fan fail detect module and each fan. 3. Remove the power supply cover. It is held in place by two screws that go through the AC bulkhead spot welded to the tray weldment. 4. Remove the power harness from the tray by disconnecting it from the power supply. 5. Disconnect the neutral and load leads from the power supply. 6. Remove the four screws holding the power supply to the tray.
7.
Removal 1. Remove the cabinet fan tray. 2. Disconnect the power harness from the fan you wish to replace. 3. Remove the fan finger guard. 4. Remove the two remaining screws holding the fan to the tray and remove the fan. 5. If the new fan does not have clip nuts, remove them from the fan. Replacement 1. Reverse the Removal procedure, taking care to orient the fan so that the connection to the power harness is dressed nicely. 2. Place the fan tray back in the cabinet.
7.
Removal 1. Remove the cabinet fan tray. 2. Disconnect the power harness from the fan fail detect module. 3. Remove the fan fail detect module. In early systems, the module is held in place by three screws that go through the weldment, through three standoffs, through the module to nuts. In later systems, the module snaps in place. Replacement 1. Reverse the steps in the Removal procedure. 2. Place the fan tray back in the cabinet. Verification Power up the system.
7.
Removal 1. Shut down the operating system and power down the system. 2. Remove the power cord and signal cord(s) from the StorageWorks shelf. 3. Remove the two retaining brackets holding the shelf in the mounting rail by removing the Phillips head screws holding the brackets in place. 4. Slide the shelf out of the system. Replacement Reverse the steps in the Removal procedure. Verification Power up the system.
Appendix A Running Utilities This appendix provides a brief overview of how to load and run utilities.
A.1 Running Utilities from a Graphics Monitor Start AlphaBIOS and select Utilities from the menu. The next selection depends on the utility to be run. For example, to run ECU, select Run ECU from floppy. To run RCU, select Run Maintenance Program. Figure A-1 Running a Utility from a Graphics Monitor AlphaBIOS Setup Display System Configuration... Upgrade AlphaBIOS Hard Disk Setup... CMOS Setup... Install Windows NT Utilities About AlphaBIOS... F1=Help Run ECU from floppy... OS Selection Setup...
A.2 Running Utilities from a Serial Terminal Utilities are run from a serial terminal in the same way as from a graphics monitor. The menus are the same, but some keys are different.
A.3 Running ECU The EISA Configuration Utility (ECU) is used to configure EISA options on AlphaServer systems. The ECU can be run either from a graphics monitor or a serial terminal. 1. Start AlphaBIOS Setup. If the system is in the SRM console, issue the command alphabios. (If the system has a graphics monitor, you can set the SRM console environment variable to graphics.) 2. From AlphaBIOS Setup, select Utilities, then select Run ECU from floppy… from the submenu that displays, and press Enter.
A.4 Running RAID Standalone Configuration Utility The RAID Standalone Configuration Utility is used to set up RAID disk drives and logical units. The Standalone Utility is run from the AlphaBIOS Utility menu. The AlphaServer 4100 system supports the KZPSC- xx PCI RAID controller (SWXCR). The KZPSC-xx kit includes the controller, RAID Array 230 Subsystems software, and documentation. 1. Start AlphaBIOS Setup. If the system is in the SRM console, issue the command alphabios.
A.5 Updating Firmware with LFU Start the Loadable Firmware Update (LFU) utility by issuing the lfu command at the SRM console prompt or by selecting Update AlphaBIOS in the AlphaBIOS Setup screen. LFU is part of the SRM console. Example A-1 Starting LFU from the SRM Console P00>>> lfu ***** Loadable Firmware Update Utility ***** Select firmware load device (cda0, dva0, ewa0), or Press to bypass loading and proceed to LFU: cda0 . . .
Use the Loadable Firmware Update (LFU) utility to update system firmware. You can start LFU from either the SRM console or the AlphaBIOS console. • From the SRM console, start LFU by issuing the lfu command. • From the AlphaBIOS console, select Upgrade AlphaBIOS from the AlphaBIOS Setup screen (see Figure A-2). A typical update procedure is: 1. Start LFU. 2. Use the LFU list command to show the revisions of modules that LFU can update and the revisions of update firmware. 3.
A.5.1 Updating Firmware from the Internal CD-ROM Insert the update CD-ROM, start LFU, and select cda0 as the load device. Example A-2 Updating Firmware from the Internal CD-ROM ***** Loadable Firmware Update Utility ***** Select firmware load device (cda0, dva0, ewa0), or Press to bypass loading and proceed to LFU: cda0 ➊ Please enter the name of the options firmware files list, or Press to use the default filename [AS4X00FW]: AS4X00CP ➋ Copying AS4X00CP from DKA500.5.0.1.1 .
➊ Select the device from which firmware will be loaded. The choices are the internal CD-ROM, the internal floppy disk, or a network device. In this example, the internal CD-ROM is selected. ➋ Select the file that has the firmware update, or press Enter to select the default file.
Example A-2 Updating Firmware from the Internal CD-ROM (Continued) UPD> update * ➎ WARNING: updates may take several minutes to complete for each device. Confirm update on: AlphaBIOS AlphaBIOS [Y/(N)] y DO NOT ABORT! Updating to V2.0-3... Verifying V2.0-3... PASSED. UPD> exit A-10 ➏ DO NOT ABORT! Updating to V6.40-1... Verifying V6.40-1... PASSED.
➎ The update command updates the device specified or all devices. In this example, the wildcard indicates that all devices supported by the selected update file will be updated. ➏ For each device, you are asked to confirm that you want to update the firmware. The default is no. Once the update begins, do not abort the operation. Doing so will corrupt the firmware on the module. ➐ The exit command returns you to the console from which you entered LFU (either SRM or AlphaBIOS).
A.5.2 Updating Firmware from the Internal Floppy Disk — Creating the Diskettes Create the update diskettes before starting LFU. See Section A.5.3 for an example of the update procedure. Table A-2 File Locations for Creating Update Diskettes on a PC Console Update Diskette I/O Update Diskette AS4X00FW.TXT AS4X00IO.TXT AS4X00CP.TXT RHREADME.SYS RHREADME.SYS CIPCA214.SYS RHSRMROM.SYS DFPAA246.SYS RHARCROM.SYS KZPAAA10.
Example A-3 Creating Update Diskettes on an OpenVMS System Console Update Diskette $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ inquire ignore "Insert blank HD floppy in DVA0, then continue" set verify set proc/priv=all init /density=hd/index=begin dva0: rhods2cp mount dva0: rhods2cp create /directory dva0:[as4x00] copy as4x00fw.sys dva0:[as4x00]as4x00fw.sys copy as4x00cp.sys dva0:[as4x00]as4x00cp.sys copy rhreadme.sys dva0:[as4x00]rhreadme.sys copy as4x00fw.txt dva0:[as4x00]as4x00fw.txt copy as4x00cp.
A.5.3 Updating Firmware from the Internal Floppy Disk — Performing the Update Insert an update diskette (see Section A.5.2) into the internal floppy drive. Start LFU and select dva0 as the load device.
➊ Select the device from which firmware will be loaded. The choices are the internal CD-ROM, the internal floppy disk, or a network device. In this example, the internal floppy disk is selected. ➋ Select the file that has the firmware update, or press Enter to select the default file.
Example A-4 Updating Firmware from the Internal Floppy Disk(Continued) UPD> update pfi0 ➍ WARNING: updates may take several minutes to complete for each device. Confirm update on: pfi0 [Y/(N)] y ➎ DO NOT ABORT! Updating to 2.52... Verifying to 2.52... PASSED.
➍ ➎ The update command updates the device specified or all devices. ➏ The lfu command restarts the utility so that console firmware can be updated. (Another method is shown in Example A-5, where the user specifies the file AS4X00FW and is prompted to insert the second diskette.) ➐ The default update file, AS4X00CP, is selected. The console firmware can now be updated, using the same procedure as for the I/O firmware.
A.5.4 Updating Firmware from a Network Device Copy files to the local MOP server’s MOP load area, start LFU, and select ewa0 as the load device.
Before starting LFU, download the update files from the Internet (see Preface). You will need the files with the extension .SYS. Copy these files to your local MOP server’s MOP load area. ➊ Select the device from which firmware will be loaded. The choices are the internal CD-ROM, the internal floppy disk, or a network device. In this example, a network device is selected. ➋ Select the file that has the firmware update, or press Enter to select the default file.
Example A-6 Updating Firmware from a Network Device (Continued) UPD> update * -all ➍ WARNING: updates may take several minutes to complete for each device. AlphaBIOS DO NOT ABORT! Updating to V6.40-1... Verifying V6.40-1... PASSED. kzpsa0 DO NOT ABORT! Updating to A11 ... Verifying A11... PASSED. kzpsa1 DO NOT ABORT! Updating to A11 ... Verifying A11... PASSED. srmflash DO NOT ABORT! Updating to V2.0-3... Verifying V2.0-3... PASSED.
➍ The update command updates the device specified or all devices. In this example, the wildcard indicates that all devices supported by the selected update file will be updated. Typically, LFU requests confirmation before updating each console’s or device’s firmware. The -all option removes the update confirmation requests. ➎ The exit command returns you to the console from which you entered LFU (either SRM or AlphaBIOS).
A.5.5 LFU Commands The commands summarized in Table A-3 are used to update system firmware. Table A-3 LFU Command Summary Command Function display Shows the system physical configuration. exit Terminates the LFU program. help Displays the LFU command list. lfu Restarts the LFU program. list Displays the inventory of update firmware on the selected device. readme Lists release notes for the LFU program. update Writes new firmware to the module.
display The display command shows the system physical configuration. Display is equivalent to issuing the SRM console command show configuration. Because it shows the slot for each module, display can help you identify the location of a device. exit The exit command terminates the LFU program, causes system initialization and testing, and returns the system to the console from which LFU was called. help The help (or ?) command displays the LFU command list, shown below.
list The list command displays the inventory of update firmware on the CD-ROM, network, or floppy. Only the devices listed at your terminal are supported for firmware updates. The list command shows three pieces of information for each device: • Current Revision — The revision of the device’s current firmware • Filename — The name of the file used to update that firmware • Update revision — The revision of the firmware update image readme The readme command lists release notes for the LFU program.
A.6 Updating Firmware from AlphaBIOS Insert the CD-ROM or diskette with the updated firmware and select Upgrade AlphaBIOS from the main AlphaBIOS Setup screen. Use the Loadable Firmware Update (LFU) utility to perform the update. The LFU exit command causes a system reset. Figure A-3 AlphaBIOS Setup Screen AlphaBIOS Setup Display System Configuration... Upgrade AlphaBIOS Hard Disk Setup CMOS Setup... Install Windows NT Utilities About AlphaBIOS...
Upgrading AlphaBIOS As new versions of Windows NT are released, it might be necessary to upgrade AlphaBIOS to the latest version. Additionally, as improvements are made to AlphaBIOS, it might be desirable to upgrade to take advantage of new AlphaBIOS features. Use this procedure to upgrade from an earlier version of AlphaBIOS: 1. Insert the diskette or CD-ROM containing the AlphaBIOS upgrade. 2.
Appendix B SRM Console Commands and Environment Variables This appendix provides a summary of the SRM console commands and environment variables. The test command is described in Chapter 3 of this document. For complete reference information on the other SRM commands and environment variables, see the AlphaServer 4000/4100 System Drawer User’s Guide.
B.1 Summary of SRM Console Commands The SRM console commands are used to examine or modify the system state. Table B-1 Summary of SRM Console Commands Command Function alphabios Loads and starts the AlphaBIOS console. boot Loads and starts the operating system. clear envar Resets an environment variable to its default value. continue Resumes program execution. crash Forces a crash dump at the operating system level. deposit Writes data to the specified address.
Table B-1 Summary of SRM Console Commands (Continued) Command Function man Displays information about the specified console command. more Displays a file one screen at a time. prcache Initializes and displays status of the PCI NVRAM. set envar Sets or modifies the value of an environment variable. set host Connects to an MSCP DUP server on a DSSI device. set rcm_dialout Sets a modem dialout string. show envar Displays the state of the specified environment variable.
B.2 Summary of SRM Environment Variables Environment variables pass configuration information between the console and the operating system. Their settings determine how the system powers up, boots the operating system, and operates. Environment variables are set or changed with the set envar command and returned to their default values with the clear envar command. Their values are viewed with the show envar command. The SRM environment variables are specific to the SRM console.
Table B-2 Environment Variable Summary (Continued) Environment Variable Function memory_test Specifies the extent to which memory will be tested. For DIGITAL UNIX systems only. ocp_text Overrides the default OCP display text with specified text. os_type Specifies the operating system and sets the appropriate console interface. pci_parity Disables or enables parity checking on the PCI bus. pk*0_fast Enables fast SCSI mode.
B.3 Recording Environment Variables You can make copies of the table below to record environment variable settings for specific systems. Write the system name in the column provided. Enter the show* command to list the system settings.
Table B-3 Environment Variables Worksheet (Continued) Environment Variable System Name System Name System Name pk*0_soft_term sys_model_num sys_serial_num sys_type tga_sync_green tt_allow_login SRM Console Commands and Environment Variables B-7
Appendix C Operating the System Remotely This appendix describes how to use the remote console monitor (RCM) to monitor and control the system remotely . C.1 RCM Console Overview The remote console monitor (RCM) is used to monitor and control the system remotely. The RCM resides on the server control module and allows the system administrator to connect remotely to a managed system through a modem, using a serial terminal or terminal emulator.
C.1.1 Modem Usage To use the RCM to monitor a system remotely, first make the connections to the server control module, as shown below. Then configure the modem port for dial-in.
Modem Selection The RCM requires a Hayes-compatible modem. The controls that the RCM sends to the modem have been selected to be acceptable to a wide selection of modems. The modems that have been tested and qualified include: Motorola LifeStyle Series 28.8 AT&T DATAPORT 14.4/FAX Zoom Model 360 The U.S. Robotics Sportster DATA/FAX MODEM is also supported, but requires some modification of the modem initialization and answer strings. See Section C.1.7. Modem Configuration Procedure 1.
Dialing In to the RCM Modem Port 1. Dial the modem connected to the server control module. The RCM answers the call and after a few seconds prompts for a password with a “#” character. 2. Enter the password that was loaded using the setpass command. The user has three tries to correctly enter the password. On the third unsuccessful attempt, the connection is terminated, and as a security precaution, the modem is not answered again for 5 minutes.
C.1.2 Entering and Leaving Command Mode Use the default escape sequence to enter RCM command mode for the first time. You can enter RCM command mode from the SRM console level, the operating system level, or an application. The RCM quit command reconnects the terminal to the system console port. Example C-2 Entering and Leaving RCM Command Mode ^]^]rcm RCM> ➊ ➋ RCM> quit Focus returned to COM port Entering the RCM Firmware Console To enter the RCM firmware console, enter the RCM escape sequence.
C.1.3 RCM Commands The RCM commands summarized below are used to control and monitor a system remotely.
Command Conventions • • • The commands are not case sensitive. A command must be entered in full. If a command is entered that is not valid, the command fails with the message: *** ERROR - unknown command *** Enter a valid command. The RCM commands are described on the following pages.
C.1.3.1 alert_clr The alert_clr command clears an alert condition within the RCM. The alert enable condition remains active, and the RCM will again enter the alert condition when it detects a system power failure. RCM>alert_clr C.1.3.2 alert_dis The alert_dis command disables RCM dial-out capability. It also clears any outstanding alerts. The alert disable state is nonvolatile. Dial-out capability remains disabled until the alert_enable command is issued. RCM>alert_dis C.1.3.
C.1.3.4 disable The disable command disables remote access to the RCM modem port. RCM>disable The module’s remote access default state is DISABLED. The modem enable state is nonvolatile. When the modem is disabled, it remains disabled until the enable command is issued. If a modem connection is in progress, entering the disable command terminates it. C.1.3.5 enable The enable command enables remote access to the RCM modem port. It can take up to 10 seconds for the enable command to be executed.
C.1.3.7 halt The halt command attempts to halt the managed system. It is functionally equivalent to pressing the Halt button on the system operator control panel to the “in” position and then releasing it to the “out” position. The RCM console firmware exits command mode and reconnects the user’s terminal to the server’s COM1 serial port. RCM>halt Focus returned to COM port NOTE: Pressing the Halt button has no effect on systems running Windows NT. C.1.3.
C.1.3.10 poweron The poweron command requests the RCM module to power on the system. For the system power to come on, the following conditions must be met: • AC power must be present at the power supply inputs. • The DC On/Off button must be in the “on” position. • All system interlocks must be set correctly. The RCM firmware console exits command mode and reconnects the user’s terminal to the system console port.
C.1.3.12 reset The reset command requests the RCM module to perform a hardware reset. It is functionally equivalent to pressing the Reset button on the system operator control panel. RCM>reset Focus returned to COM port The following events occur when the reset command is executed: • The system restarts and the system console firmware reinitializes. • The console exits RCM command mode and reconnects the user’s terminal to the server’s COM1 serial port.
If the escape sequence entered exceeds 15 characters, the command fails with the message: *** ERROR *** When changing the default escape sequence, avoid using special characters that are used by the system’s terminal emulator or applications. Control characters are not echoed when entering the escape sequence. To verify the complete escape sequence, use the status command. C.1.3.
C.1.3.15 status The status command displays the current state of the server’s sensors, as well as the current escape sequence and alarm information. RCM>status Firmware Rev: V1.0 Escape Sequence: ^]^]RCM Remote Access: ENABLE/DISABLE Alerts: ENABLE/DISABLE Alert Pending: YES/NO (C) Temp (C): 26.0 RCM Power Control: ON/OFF External Power: ON Server Power: OFF RCM> The status fields are explained in Table C-2. Table C-2 RCM Status Command Fields Item Description Firmware Rev: Revision of RCM firmware.
C.1.4 Dial-Out Alerts The RCM can be configured to automatically dial out through the modem (usually to a paging service) when it detects a power failure within the system. When a dial-out alert is triggered, the RCM initializes the modem for dial-out, sends the dial-out string, hangs up the modem, and reconfigures the modem for dial-in. The RCM and modem must continue to be powered, and the phone line must remain active, for the dial-out alert function to operate.
Enabling the Dial-Out Alert Function: 1. Enter the set rcm_dialout command, followed by a dial-out alert string, from the SRM console (see ➊ in Example C-3). The string is a modem dial-out character string, not to exceed 47 characters, that is used by the RCM when dialing out through the modem. See the next topic for details on composing the modem dial-out string. 2. 3. Enter the RCM firmware console and enter the enable command to enable remote access dial-in.
Composing a Modem Dial-Out String The modem dial-out string emulates a user dialing an automatic paging service. Typically, the user dials the pager phone number, waits for a tone, and then enters a series of numbers. The RCM dial-out string (Example C-4) has the following requirements: • The entire string following the set rcm_dialout command must be enclosed by quotation marks. • The characters ATDT must be entered after the opening quotation marks. Do not mix case.
C.1.5 Resetting the RCM to Factory Defaults If the escape sequence has been forgotten, you can reset the controller to factory settings. Reset Procedure 1. Power down the AlphaServer system and access the server control module, as follows: Expose the PCI bus card cage. Remove three Phillips head screws holding the cover in place and slide it off the drawer.
C.1.6 Troubleshooting Guide Table C-3 lists a number of possible causes and suggested solutions for symptoms you might see. Table C-3 RCM Troubleshooting Symptom Possible Cause Suggested Solution The local terminal will not communicate with the system or the RCM. System and terminal baud rate set incorrectly. Set the system and terminal baud rates to 9600 baud. Cables not correctly installed. Review external cable installation. Modem cables may be incorrectly installed.
Table C-3 RCM Troubleshooting (Continued) Symptom Possible Cause Suggested Solution After the system and RCM are powered up, the COM port seems to hang and then starts working after a few seconds. This delay is normal behavior. Wait a few seconds for the COM port to start working. RCM installation is complete, but system will not power up. RCM Power Control: is set to DISABLE. Enter RCM console and issue the poweron command.
Table C-3 RCM Troubleshooting (Continued) Symptom Possible Cause Suggested Solution Cannot enable modem or modem will not answer. The modem is not configured correctly to work with the RCM. Modify the modem initialization and/or answer string.
C.1.7 Modem Dialog Details This section provides further details on the dialog between the RCM and the modem and is intended to help you reprogram your modem if necessary. Phases of Modem Operation The RCM is programmed to expect specific responses from the modem during four phases of operation: • Initialization • Ring detection • Answer • Hang-up The initialization and answer command strings are stored in the RCM NVRAM.
This default initialization string works on a wide variety of modems. If your modem does not configure itself to these parameters, the initialization string will need to be modified. See the topic in this section entitled Modifying Initialization and Answer Strings. Ring Detection The RCM expects to be informed of an in-bound call by the modem signaling the RCM with the string, “2” (RING). Answer When the RCM receives the ring message from the modem, it responds with the answer string.
RCM/Modem Interchange Overview Table C-4 summarizes the actions between the RCM and the modem from initialization to hangup.
To display all the RCM user settable strings: P00>>> show rcm* rcm_answer ATXA rcm_dialout rcm_init AT&F0EVS0=0S12=50 P00>>> Initialization and Answer String Substitutions The RCM default initialization and answer strings are as follows: Initialization String: “AT&F0EVS0=0S12=50” Answer String: “ATXA” The following modem requires a modified answer string.
Index ? ? command, RCM, C-10 4 4000 system drawer, 1-4, 1-6 4100 system drawer, 1-2 A alert_clr command, RCM, C-8 alert_dis command, RCM, C-8 alert_ena command, RCM, C-8 Alpha 21164 microprocessor, 1-16 Alpha chip composition, 1-20 AlphaBIOS console, 1-15 loading, 2-7 upgrading, A-28 Architecture, system, 1-16 auto_action environment variable, SRM, 2-23 Auxiliary voltage (vaux), 4-9 B B3002-AA CPU module, 1-21 B3002-AB CPU module, 1-21 B3002-BA CPU module, 1-21 B3004-AA CPU module, 1-21 B3004-DA CPU modu
CD-ROM removal and replacement, 7-60 COM1 port, 2-19 Command codes, 5-54 Command summary (SRM), B-2 Components housed in system drawer, 1-2, 14, 1-6 Console SRM, 2-23 Console device determination, 2-18 Console device options, 2-19 Console device, changing, 2-19 console environment variable, SRM, 2-21, 2-23 Console power-up tests, 2-16 Control panel, 1-12, 2-2 display, 2-21 Halt button, 1-13 LCD potentiometer, 2-2 messages in display, 2-3 Controls Halt button, 1-13 Cover interlocks, 1-3, 1-5, 1-7, 4-7 overri
updating from network device, A20 updating, AlphaBIOS selection, A-6 updating, SRM command, A-6 Floppy removal and replacement, 7-58 FRU list, 7-2 4000 power system, 7-10 4100 power system, 7-8 FRU part numbers, 7-3 G Graphics monitor, VGA, 2-19 H H7600-AA power controller, 1-9 H7600-DB power controller, 1-9 halt command, RCM, C-10 Halts caused by power problem, 3-6 hangup command, RCM, C-9 Hard errors, categories of, 5-4 help command (LFU), A-24, A-25 help command, RCM, C-10 I I squared C bus, 3-10 INFO
MCHK 620 correctable error, 5-44 MCHK 630 correctable CPU error, 541 MCHK 660 IOD detected failure, 527, 5-32 MCHK 670 CPU and IOD detected failure, 5-16 MCHK 670 CPU-detected failure, 511 MCHK 670 read dirty failure, 5-21 Memory addressing, 1-24 rules, 1-25 Memory errors corrected read data error, 5-52 read data substitute error, 5-52 Memory modules, 1-17, 1-22, 7-3 removal and replacement, 7-22 variants, 1-23 Memory operation, 1-23 Memory option configuration rules, 1-23 Memory pairs, 1-23 Memory tests, 2
voltages, 4-3 Power system components, 7-4 poweroff command, RCM, C-10 poweron command, RCM, C-11 Power-up SROM and XSROM messages during, 2-19 Power-up display, 2-20 Power-up sequence, 2-4 Power-up/down sequence, 4-8 Processor determining primary, 2-21 Processor correctable error, 5-5 Processor machine checks, 5-5 Q quit command, RCM, C-11 R RAID Standalone Configuration Utility, running, A-5 RCM, C-1 command summary, C-6 dial-out alerts, C-15 entering and leaving command mode, C-5 modem usage, C-2 reset
System bus ECC error, 5-47 System bus nonexistent address error, 5-48 System bus to PCI bus bridge module, 1-17, 1-28 System bus to PCI/EISA bus bridge module, 1-17 System consoles, 1-14 System correctable errors, 5-5 System drawer components of, 1-2, 1-4, 1-6 FRU locations, 7-2 fully configured, 1-17 remote operation, C-1 System drawer exposure original cabinet, 7-12 pedestal, 7-16 System drawer modules, 7-3 System machine checks, 5-5 System model number, displaying, 734 System motherboard, 1-18 System mot