HP XC System Software Administration Guide Version 2.1

16
Troubleshooting
This chapter provides information that helps you troubleshoot problems with your HP XC
system:
A discussion on troubleshooting t he sys
tem Interconnect (Section 16.1)
A list of issues regarding SLURM (Section 16.2)
A discussion on LSF-HPC issues (Section 16.3)
See also Chapter 15 for informatio n on
available diagnostic tools, which can be also used
to locate the source of the failure.
16.1 System Interconnect Troublesh
ooting
This section describes the troubleshooting st eps for these supported system interconnects:
Myrinet (Section 16.1.1)
Quadrics (Section 16.1.2)
InfiniBand (Section 16.1.3)
16.1.1 Myrinet System Interconn
ect Troubleshooting
The following troublesho oti ng info rmation appli es to the Myrinet system interconnect; perform
these steps on any node which you suspect a problem to determine if y our HP XC sy stem is
configured properly. If theses tests pass but you are still ex periencin g difficulty, see Chapter 15.
1. Run the gm_board_info test:
# /opt/gm/bin/gm_board_info
This comman d should com plete w
ithout error s and display all the nodes in the HP XC
system.
2. Make sure that you are running an H P XC kernel. The HP XC kernels are identified
bythepresenceofXC in the kernel name.
# uname -a
Linux n16 2.4.21-15.7hp.XCsmp #1 SMP date ... GNU/Linux
3. Make sure that your system has
Myrinet boards installed.
# lspci -v | grep Myrinet
05:0d.0 Network controller: MYRICOM Inc. Myrinet 2000...
Subsystem: MYRICOM Inc. Myrinet 2000 Scalable Cluster Interconnect
4. Run the gm_debug test.
# /opt/gm/bin/gm_debug
This co mm and should complete without errors; there should be no nonzero counters
containing the string bad.
Troubleshooting 16-1