EK-PM32E-PS-001 DECstation 5000 Model 100 Series Pocket Service Guide digital equipment corporation maynard, massachusetts
August 1991 The information in this document is subject to change without notice and should not be construed as a commitment by Digital Equipment Corporation. Digital Equipment Corporation assumes no responsibility for any errors that may appear in this document. The software described in this document is furnished under a license and may be used or copied only in accordance with the terms of such license.
Contents Using This Guide . . . . . . . . . . . xiii Chapters xv .................. Conventions .............. xvii 1 TROUBLESHOOTING INFORMATION Error Messages . . . . . . . . . . Test failure messages . . . Console exception messages . . . . . . . . . . . . . Memory test error messages . . . . . . . . . . . . . Addresses . . . . . . . . . . . . . . . Slot numbers . . . . . . . . . . Memory addresses . . . . . . Hardware physical addresses . . . . . . . . . . . . . ULTRIX Error Logs . . . . . .
Memory parity error log fields . . . . . . . . . . . . . . . . . CPU write timeout . . . . . Bus timeout . . . . . . . . . . . Diagnostic LEDs . . . . . . . . . Registers . . . . . . . . . . . . . . . Cause register . . . . . . . . . System registers . . . . . . . 1–13 1–14 1–14 1–15 1–16 1–16 1–18 2 TROUBLESHOOTING TOOLS Self-tests . . . . . . . . . . . . . . . 2–1 Console Mode Tests . . . . . . . 2–2 Console commands . . . . . 2–2 t command . . . . . . . . . . . . 2–4 SCSI controller (cntl) test . . . .
Figures 2-1 2-2 2-3 2-4 2-5 2-6 2-7 2-8 2-9 2-10 2-11 Troubleshooting procedure, 1 of 2 . . . . . . . . . . . . . . . 2–18 Troubleshooting procedure, 2 of 2 . . . . . . . . . . . . . . . 2–19 When the LED display is 1111 1111, 0011 1111, 0011 1110, or 0011 1101, 1 of 2 . . . 2–20 When the LED display is 1111 1111, 0011 1111, 0011 1110, or 0011 1101, 2 of 2 . . . 2–21 When the LED display is 0011 0110 . . . . . . . . . . . 2–22 When the LED display is 0010 0011, 0001 0011, 0000 0011, or 0000 0000 . . .
2-12 2-13 2-14 2-15 2-16 2-17 2-18 2-19 2-20 2-21 2-22 2-23 2-24 2-25 vi When hardware does not appear in the cnfg display, 2 of 3 .................. 2–29 When hardware does not appear in the cnfg display, 3 of 3 .................. 2–30 Troubleshooting memory modules . . . . . . . . . . . . 2–31 Troubleshooting SCSI controllers and devices, 1 of 2 .................. 2–32 Troubleshooting SCSI controllers and devices, 2 of 2 .................. 2–33 Troubleshooting an Ethernet controller, 1 of 2 . . . . . .
2-26 2-27 3-1 When ULTRIX is running but the monitor has no display, 2 of 3 .................. 2–43 When ULTRIX is running but the monitor has no display, 3 of 3 .................. 2–44 DECstation 5000 Model 100 Series Major FRUs . . . 3–2 Tables 1-1 1-2 1-3 1-4 1-5 1-6 1-7 1-8 1-9 1-10 2-1 2-2 2-3 Base system test error messages . . . . . . . . . . . 1–4 Slot numbers in commands and messages . . . . . . . . . . . 1–8 Memory module slot address ranges . . . . . . . . . . . . . .
2-4 2-5 2-6 2-7 2-8 3-1 3-2 3-3 3-4 3-5 3-6 3-7 3-8 viii SCSI send diagnostics error codes and descriptions . . . . . . . . . 2–9 External loopback test codes and descriptions . . . . . . 2–11 SCC transmit and receive test codes and descriptions . . . . . . . . . 2–13 Pin pairs tested by loopback connectors . . . . . . . . . . . 2–15 SCC pins test codes and descriptions . . . . . . . . . 2–16 Part numbers: Basic system components . . . . . . . . . 3–3 Part numbers: Internal drives . . . . . . . . . . . .
Using This Guide This guide contains the information that you need for field maintenance of the DECstation 5000 Model 100 Series RISC workstation. Field maintenance consists of identifying and replacing failed field replaceable units (FRUs).
Chapters This guide contains the following chapters: Chapter 1 Troubleshooting Information Chapter 2 Troubleshooting Tools Chapter 3 Part Numbers Chapter 1, Troubleshooting Information, describes the types of information that help you identify failed FRUs.
Chapters Some of the troubleshooting information is automatically displayed, by the system, such as exception messages and diagnostic LEDs. Other information must be specifically generated or accessed by the engineer, such as test error messages, ULTRIX error logs, and registers. Chapter 2, Troubleshooting Tools, describes the tools that you use to test the system and its components.
Conventions This guide uses the following conventions: Monospace type Anything that appears on your monitor screen is set in monospace type, like this. Boldface type Anything you are asked to type is set in boldface type, like this. Italic type Any part of a command that you replace with an actual value is set in italic type, like this.
1 TROUBLESHOOTING INFORMATION TROUBLESHOOTING INFORMATION 1–1
Error Messages An error message can be either an exception message that is automatically displayed when something goes wrong during normal system operation or a test failure message that is displayed when an automatic or user-initiated test fails.
Test failure messages The test failure message format is: ?TFL slot_number/test_name (n:description)[module] Identifies a test error message ?TFL slot_number Identifies the module that reported the error test_name n The test that failed Indicates which part of the test failed description Describes the failure module The module identification number Table 1-1 lists the test values that can appear in the test failure message when some component part of the base system (slot number 3) fails.
Table 1-1 Base system test error messages Component Tested Corrective Action cache/data cache/fill cache/isol cache/reload cache/seg fpu CPU module Replace the CPU module. If the problem persists, replace the system module. mem mem/float10 Memory modules Troubleshoot according to Figure 2-14. mem/select Memory and system module Replace the memory module failed. If the problem persists, replace the system module. misc/halt System module Replace the system module.
Table 1-1 (Cont.) Base system test error messages Component Tested Corrective Action ni/cllsn ni/common ni/crc ni/cntrs ni/dma1 ni/dma2 line>ni/esar ni/ext-lb ni/int ni/int-lb ni/m-cst ni/promisc ni/regs ni/setup Base system Ethernet controller Troubleshoot according to Figure 2-17. rtc/nvr rtc/period rtc/regs rtc/time System module Replace the system module.
Console exception messages This is a typical console exception message: ? PC: ? CR: ? SR: 0x451 0x810 0x30030000 ? VA: 0x451 ? ER: 0x100003f0 ? MER: 0x2000 PC CR The address of the exception instruction The contents of the cause register. The last term is the exception type. The exception types are as follows: MOD, TLBL, or TLBS: An invalid address was probably used in a console command.
Memory test error messages This is a typical memory test error message: ?TFL:3/mem(PER,cause=0000001C, DBE=0040000c, Bank 2, D16-31,d23-d16) Bank The slot number of the problem memory module D16-31 The module farthest from the power supply failed. D0-15 The module nearest the power supply failed.
Addresses Slot numbers Table 1-2 Slot numbers in commands and messages Slot No.
Memory addresses These addresses appear in memory error printouts. Table 1-3 Memory module slot address ranges Slot No.
Hardware physical addresses These addresses appear in error printouts.
Table 1-4 (Cont.
ULTRIX Error Logs To examine the ULTRIX error logs from the ULTRIX prompt, type /etc/uerf -R | more Table 1-5 Error log event types Code Event Type 100 Machine check 101 Memory error 102 Disk error 103 Tape error 104 Device controller error 105 Adapter error 106 Bus error 107 Stray interrupt 108 Asynchronous write error 109 Exception or fault 113 CPU error and status information 130 Error and status registers 200 Panic (bug check) 250 Informational ASCII message 251 Operation
Memory parity error log fields The following memory error log fields are pertinent when a memory parity error occurs: The ERROR SYNDROME field identifies the memory parity error. The MEM REG fields give the following memory failure information: HARD CNT shows how many errors recurred on both read and write operations. SOFT CNT shows how many errors recurred on read but cleared on write. TRAN CNT shows how many errors did not recur on read.
CPU write timeout The following error and status register error log fields are pertinent when a CPU write timeout occurs: OS EVENT TYPE refers to the error and status registers for a CPU write timeout. PANIC MESSAGE indicates a CPU write timeout. The CAUSE register gives no information for a CPU write timeout. The BAD VIRT ADR register identifies the address of the timeout. The SIR register shows the write timeout error.
Diagnostic LEDs Table 1-6 LED error codes LED Error Code (1=On) Troubleshooting Procedure 1111 1111 0011 1111 0011 1110 0011 1101 Troubleshoot according to Figure 2-3. 0011 0111 Replace the CPU module. If the LEDs display 0011 0111 when the power-up self-test stops, replace the system module. 0011 0110 Troubleshoot according to Figure 2-5. 0010 0001 0000 0000 0011 0011 0011 0000 Troubleshoot according to Figure 2-6. 0011 1011 0010 1011 0001 1011 0000 1011 Troubleshoot according to Figure 2-7.
Registers There are two types of registers: CPU registers and system registers. CPU register information is automatically displayed on the screen when an exception occurs. To access system registers from the console prompt (>>), enter the e command. Cause register The cause register is a CPU register and is displayed in exception error messages only. You cannot access the cause register independently.
Table 1-7 Cause register exception codes Number Mnemonic Description 0 Int Interrupt 1 Mod TLB modification exception 2 TLBL TLB miss exception (load or instruction fetch) 3 TLBS TLB miss exception (store) 4 AdEL Address error exception (load or instruction fetch) 5 AdES Address error exception (store) 6 IBE Bus error exception (instruction fetch) 7 DBE Bus error exception (data reference: load or store) 8 Sys Syscall exception 9 Bp Breakpoint exception 10 RI Reserved inst
System registers To examine a system register from the console prompt (>>), enter the e command: e [options] [console_address] Table 1-8 System registers Register Console Address Description SSR 0xBC040100 System support register MER 0xAC400000 Memory error register SIR 0xBC040110 System interrupt register Mask 0xBC040120 System interrupt mask register MSR 0xAC800000 Memory size register EAR 0xAE000004 Error address register 1–18 TROUBLESHOOTING INFORMATION
Table 1-9 Memory Error Register (MER) 0x0C400000 Bits Access 31:17 Description Reserved 16 R/W Page boundary error 15 R/W Transfer length error 14 R/W PARDIS memory error disable 13:12 11:8 7:0 Reserved R/W Byte(s) with parity error Reserved TROUBLESHOOTING INFORMATION 1–19
Table 1-10 System Interrupt Register (SIR) 0x1C040110 Bits Access Description 31 R/W0C Comm port 1 transmit page end interrupt 30 R/W0C Comm port 1 transmit DMA memory read error 29 R/W0C Comm port 1 receive half page interrupt 28 R/W0C Comm port 1 receive DMA page overrun 27 R/W0C Comm port 2 transmit page end interrupt 26 R/W0C Comm port 2 transmit DMA memory read error 25 R/W0C Comm port 2 receive half page interrupt 24 R/W0C Comm port 2 receive DMA overrun 23 R/W0C Reserved
Table 1-10 (Cont.) System Interrupt Register (SIR) 0x1C040110 Bits Access Description 11 R Reserved 10 R NRMOD manufacturing mode jumper 9 R SCSI interrupt from 53C94 SCSI controller 8 R Ethernet interrupt 7 R SCC(1) serial interrupt (comm port 2 and keyboard) 6 R SCC(0) serial interrupt (comm port 1 and mouse) 5 R TOY interrupt 4 R PSWARN power supply warning indicator 3 R Reserved 2 R SCSI data ready 1 R PBNC 0 R PBNO Note Comm port 1 is the same as serial line 2.
2 TROUBLESHOOTING TOOLS Self-tests The system automatically runs a power-up test sequence when you turn the power on. The system runs a quick test or thorough test sequence according to the value of the testaction environtmental variable (q for quick, t for thorough). Quick is for normal startup; thorough for troubleshooting. You can run a self-test sequence from the console prompt without cycling system power.
Console Mode Tests From the console prompt (>>), enter the t command to run an individual test or the sh command to run a test script. Console commands From the console prompt, enter ? to see a list of available console commands and their formats. Table 2-1 Console command functions Command Function ?[cmd] Displays one or more console commands and formats boot [-zseconds] [-n][bootpath] [-a][args...
Table 2-1 (Cont.) Console command functions Command Function ls [slot_number] Displays the scripts and other files in a module passwd [-c] [-s] Sets and clears the console password printenv [variable] Prints environment variables restart Attempts to restart the operating system software that is specified in the restart block script name Creates a temporary script of console commands setenv variable value Sets an environment variable sh [-b] [-e] [-l] [-v] [-S] [slot_number/script] [arg...
t command To run a single test from the console prompt type t [-l] slot_number/test_name [arg1] [...] [argn] t is the test command. -l The test repeats until you press Ctrl-c or reset the system with the init command or by cycling power. slot_number Replace with the slot number of the module to be tested. test_name Replace with the name of the test to be run. arg1...argn Specify individual test conditions. Table 2-2 lists the tests for the base system modules.
Table 2-2 Base system module tests and utilities Test or Utility Command System module tests: Halt button t 3/misc/halt [number] Nonvolatile RAM (NVR) t 3/rtc/nvr [pattern] Overheat detect t 3/misc/pstemp Real-time clock period t 3/rtc/period Real-time clock register t 3/rtc/regs Real-time t 3/rtc/time Serial communication chip (SCC) access t 3/scc/access Serial communication chip (SCC) DMA t 3/scc/dma [line] [loopback] [baud] SCC interrupts t 3/scc/int [line] SCC I/O t 3/scc/io [line]
Table 2-2 (Cont.
Table 2-2 (Cont.
SCSI controller (cntl) test To test the operation of a SCSI controller from the console prompt, enter t slot_number/scsi /cntl Table 2-3 SCSI controller error codes (code: description) Meaning (1: rd cnfg) Values written to and read from configuration register did not match. (2: fifo flg) First in, first out (FIFO) load and FIFO flags did not match. (3: cnt xfr) Write and read operation on TCL register reported a mismatch. (4: illg cmd) Command was illegal and did not generate an interrupt.
SCSI send diagnostics (sdiag) test To run the self-test for an individual SCSI device from the console prompt, enter t slot_number/scsi /sdiagscsi_id [d] [u] [s] Table 2-4 SCSI send diagnostics error codes and descriptions (code: description) Meaning (1: dev ol) Test could not bring the unit on line. (2: dev ol) Test could not bring the unit on line. (3: sdiag) Device failed the send diagnostics test.
External loopback test To check an Ethernet controlleer and its connections from the console prompt, install a ThickWire loopback connector and enter the following command: t slot_number /ni /ext-lb 2–10 TROUBLESHOOTING TOOLS
Table 2-5 External loopback test codes and descriptions (code: description) Meaning (1: (LANCE-init [xxxxxxxx])) LANCE initialization failed. xxxxxxxx is a LANCE failure code. (3: (xmit [xxxxxxxx, yyyyyyyy] zzzzz)) LANCE initialization failed. xxxxxxxx,yyyyyyyy is a LANCE failure code. zzzzz describes the likely cause of the failure. (4: rcv [xxxxxxxx,yyyyyyyy]) System did not receive packet. xxxxxxxx, yyyyyyyy describes the receive failure. (6: pkt-data !=) Transmitted packet was not received.
Transmit and receive test To test the transmit and receive function of a serial port from the console prompt (>>), install a communications adapter with an MMJ loopback connector and enter the following command: t 3/scc/tx-rx [line] line loopback [baud] [parity] [bits] line Specify line 0, 1, 2, or 3. loopback Specify intl for internal or extl for external. baud Specify 300, 1200, 2400, 3600, 4800, 9600, 19200, or 38400. parity Specify none, odd, or even. bits Specify 8, 7, or 6 bits per character.
Table 2-6 SCC transmit and receive test codes and descriptions (code: description) Meaning 1: LnN tx bfr not empty. status=xx System could not write a single character because the transmit buffer was not empty. The error occurred on line N. xx is the contents of SCC read register 0. 2: LnN char not rcvd. status=xx Expected CHAR AVAIL signal not received. The error occurred on line N. xx is the contents of SCC read register 0.
SCC pins test To test the pins on a communications connector from the console prompt, install a modem loopback connector on the communications connector and enter the following command: t 3/scc/pins line attachment line Specify line 2 (right connector) or 3 (left).
Table 2-7 Pin pairs tested by loopback connectors Loopback Connector Pin Pairs Tested 29-24795 4-5 23-6-8 RTS to CTS SS to DSR and CD 6-23 failure implies 6 broken. 8-23 failure implies 8 broken. 6-23 8-23 failure implies 23 broken. H3200 4-5 6-20 12-23 RTS to CTS DSR to DTR SI to SS H8571-A 4-5 20-6-8 RTS to CTS DTR to DSR and CD 6-20 failure implies 6 broken. 8-20 failure implies 8 broken. 6-20 8-20 failure implies 20 broken.
Table 2-8 SCC pins test codes and descriptions (code: description) Meaning 1:LnN Invld param [xx] The loopback specifier was invalid. The error occurred on serial line N. xx is the first two characters of the invalid value. 2:LnN Strtup R-xx xptd=yy actl=zz | pins | Test failed to generate the expected SCC status bits. The error occurred on serial line N. xx is the number of the SCC register that contains the status bits. yy is the expected status bits. zz is the actual status bits.
Test scripts To run a test script from the console prompt (>>), type sh [options] slot_number/test_name sh The shell command options: -b Executes script directly, not through a subshell -e Script halts on error. -l Test loops until Ctrl-c or system reset. -v Echos script to console -S Suppresses script-not-found error messages slot_number Replace with the slot number of the module to be tested. test_name Replace with the name of the script to be run.
Flow Charts Start See Figure 2-21, Does the powerNo supply"Troubleshooting the power LED glow green? supply." Yes Do the diagnostic NoLEDs flicker but not count See Table 1-6, "LED error down to 0000 0000? codes." Yes Does a display No See Figure 2-9, "When the appear on the monitor?monitor has no display." Yes Yes Does the monitor display an error message? No Type test ; press Return. Does the monitor Yes now display an error See the "Error Messages" section, page 1-1.
2–19 -12.
110, odule.
2–21 110,
LEDs = 0011 0110 Insert two good memory modules into slot 0 and at least 8 Mbytes of memory total in the system. Does the power-up self-test still stop and the LEDs display 0011 0110? Yes Replace the system module. No Replace the remaining memory modules one pair at a time. After each pair, type 3/mem and the slot number. Press Return. End Replace any modules that report an error after the memory test. See "Memory Test Error Messages".
Start Do the left LEDs display 0010 or 0001? Replace the option module. No Yes Find the option module in the slot number displayed on the left LEDs. No Is it a 2D graphics accelerator module? End Do the left LEDs display 0000? Yes No Yes Do the right LEDs display 0000? Yes No Is the option module Yes in slot 0 a 2D graphics accelerator module? End Replace the 2D module VSIMM. If the problem persists, replace the 2D module. No Replace the option module in slot 0.
LEDs = 0011 1011, 0010 1011, 0001 1011, or 0000 1011 Is a TURBOchannel graphics module installed? No A Yes Is a VT320 terminal available? No Replace the graphics module. Yes Remove all graphics modules. Turn on the system unit power. Connect the VT320 to the system unit. Turn on the VT320 and the system unit. On the VT320 keyboard, type setenv console s. Troubleshoot according to the error messages that appear on the monitor. Turn off the system unit power. Reinstall the graphics module(s).
A Turn off the system unit power. Disconnect the VT320 terminal from the system unit. Turn on the system unit power. Check the diagnostic LEDs. Does the power-up self-test complete successfully? Yes The terminal or cable is faulty. Isolate and replace the bad part. No Is the right LED display 1011? No Is the right LED display 0000? Yes Yes Replace the system module. End No Troubleshoot according to the LED codes.
P010 ower
2–27 3P011
Start Type cnfg, press Return. Type cnfg 3, press Return. Is a memory module missing from the cnfg or cnfg 3 display? Yes Look for missing memory module. Reseat any memory module(s) that do not appear in the display. No C (Continued) Is a SCSI controller or device missing from the cnfg or cnfg 3 display? No Yes Yes Check that the first device in the bus is properly connected.
(Continued) (Continued) A B Replace the cable from the SCSI controller to the first device. No Is the SCSI controller or device still missing from the display? Yes Replace the SCSI controller. Is a SCSI device missing from the cnfg or cnfg 3 display? Yes No End Make sure the power is on and all cables are connected properly. Change the SCSI ID for the missing device to an unused ID between 0 and 6. Type init slot number. Press Return.
Start Interpret any error messages to determine which memory modules reported an error. No Do all memory modules report an error? Replace those memory modules that report an error. Yes Remove all the memory modules. Insert two good memory modules in slot 0 and at least 8 Mbytes of contiguous memory total. Repeat the memory test. Does the memory test still report an error? Yes Replace the system module. No Install any additional pairs of memory modules one pair at a time.
Start Type cnfg slot number. Press Return. Does the drive appear in the cnfg display? No Check that all cables are connected to the drive and that there is a terminator on the last external drive in the bus. Yes Make sure each drive in the SCSI bus has a unique ID from 0 to 6. (continued) Type init slot number. Press Return. Yes End Does the drive appear in the cnfg display now? No Replace the drive.
(continued) Does the cntl test report an error? Yes Replace the SCSI controller that has the slot number that appears in the error message. No Does the sdiag test report Yes an error? No Troubleshoot according to the sdiag test error messages. See the SCSI drive service guide. End Replace the SCSI controller Yes in the slot number listed in the error message.
e es
2–35 3P021 e error he s as
Start Does an error message list 3/scc as the test that failed or does the customer complain about a specific serial line device? Yes No End Make sure the hardware and software for the serial line device are set up properly. Run the internal loopback serial line test script. Type sh 3/test-scc-t and press Return.
2–37 g to
WS33P004
Turn off the system unit power. Remove the system unit cover. Turn on the system unit power. Do any fans rotate? No If the fan assembly power cord is connected correctly, replace the power supply. Yes No End Do all three fans rotate? Yes Does the power supply still overheat? Yes Replace the power supply. No Replace the power supply fan assembly.
ard P020
2–41 ard P022
3 he
2–43 3 he t WS33P012
A (Continued) Connect the alternate terminal to the system unit. Type setenv console s. Press Return. B Troubleshoot according to the error messages that appear on the terminal. Replace any FRU that reports an error. Reconnect any SCSI devices that you disconnected. Reset the environment variables. End Replace the TURBOchannel module in the slot number that appears on the left LEDs.
3 PART NUMBERS PART NUMBERS 3–1
System unit cover Bezel insert TURBOchannel option module connector (one of three) Power supply and fan assembly Removable media drive panel System unit chassis System module CPU module Memory module Locations for internal hard disk drives WS33P028 Figure 3-1 DECstation 5000 Model 100 Series Major FRUs 3–2 PART NUMBERS
Table 3-1 Part numbers: Basic system components Item Part No. Customer Order No.
Table 3-2 Part numbers: Internal drives Item Part No. Customer Order No.
Table 3-3 Part numbers: TURBOchannel option modules Item Part No. Customer Order No.
Table 3-4 Part numbers: Monitors Part No. Customer Order No.
Table 3-4 (Cont.) Part numbers: Monitors Part No. Customer Order No.
Table 3-5 Part numbers: Input devices Item Part No. Customer Order No.
Table 3-6 Part numbers: Loopback connectors, plugs, test media, and small hardware Item Part No. Customer Order No.
Table 3-7 Part numbers: Cords, cables, and connectors Item Part No. Customer Order No. Monitor-to-systemunit power cord (U.S.
Table 3-7 (Cont.) Part numbers: Cords, cables, and connectors Item Part No. Customer Order No.
Table 3-8 Part numbers: Hardware documentation Part No. Customer Order No.
Table 3-8 (Cont.) Part numbers: Hardware documentation Item Part No. Customer Order No.