HP XC System Software Administration Guide Version 3.0
17. Troubleshooting
This chapter provides information to help you troubleshoot problems with your HP XC system. It addresses
the following topics:
• System Interconnect Troubleshooting (page 159)
• SLURM Troubleshooting (page 163)
• LSF-HPC Troubleshooting (page 165)
See also Chapter 16.: Using Diagnostic Tools (page 149) for information on available diagnostic tools,
which can be also used to locate the source of the failure.
System Interconnect Troubleshooting
This section describes the troubleshooting steps for the following supported system interconnects:
• Myrinet System Interconnect Troubleshooting (page 159)
• Quadrics System Interconnect Troubleshooting (page 160)
• InfiniBand System Interconnect Troubleshooting (page 161)
Myrinet System Interconnect Troubleshooting
The following troubleshooting information applies to the Myrinet system interconnect. Perform these steps on
any node on which you suspect a problem to determine if your HP XC system is configured properly. If these
tests pass but you are still experiencing difficulty, see Chapter 16.: Using Diagnostic Tools (page 149).
1. Run the gm_board_info test:
# /opt/gm/bin/gm_board_info
This command should complete without errors and display all the nodes in the HP XC system.
2. Make sure that you are running an HP XC kernel. The HP XC kernels are identified by the presence of
XC in the kernel name:
# uname -a
Linux n16 2.4.21-15.7hp.XCsmp #1 SMP date ... GNU/Linux
3. Make sure that your system has Myrinet boards installed:
# lspci -v | grep Myrinet
05:0d.0 Network controller: MYRICOM Inc. Myrinet 2000 . . .
Subsystem: MYRICOM Inc. Myrinet 2000 Scalable Cluster Interconnect
4. Run the gm_debug test:
# /opt/gm/bin/gm_debug
This command should complete without errors; there should be no nonzero counters containing the
string bad.
5. Make sure all the Myrinet RPMs are installed:
# rpm -q -a
.
.
.
gm-2.1.7_Linux-2.1hptc
m3-dist-1.0.14-1
mute-1.9.6-1
.
.
.
The version numbers for your HP XC system may differ from these.
System Interconnect Troubleshooting 159