Administrator's Guide

Troubleshooting Cluster Test
Table 4 Cluster Test Troubleshooting Guide
Possible solutionHow to diagnoseSymptom
Check the message on the output window or
terminal:
A test terminates right
away.
The Platform MPI license has expired. Get
new license and copy it to /opt/hpmpi/
licenses
Cannot check out license
The date and time on the head node is not
set correctly. This often happens in
ssh: connect to host 192.168.1.X port 22:
No route to host.
fresh-from-the-factory machines. Set the date
and time with the date command. See
date(1) for more information.
License failures can also occur because the
dates on the compute nodes are not
consistent with the date on the head node.
To fix this, select ToolsSync Node Times.
Admin network connection to node
192.168.1.X can’t be established. Check
Ethernet cable. Restart network daemon on
that node.
Interconnect between nodes can't be
established:
Check the message on the output window or
terminal:
CrissCross test fails to
complete.
You might have a bad cable or bad
Interconnect PCI card (InfiniBand, or driver
not loaded).
Mpirun: one or more remote shell
commands exited with non-zero status
which may indicate a remote access
problem.
Restart the network daemon or openibd
on the node having the problem.
Use the checkic command to find out
which nodes have a broken interconnect.
CrissCross test: a node
responds with less
Replace the interconnect cable, the
interconnect PIC card, or both.
Check the interconnect cable and the link
LED on the PCI card.
optimal bandwidth
compared to others.
Check firmware of the Interconnect PCI
card.
Update card firmware.
Reseat the line cards on the interconnect
switch.
Use diagnostics software that comes with
the interconnect switch to diagnose the
switch.
Update switch firmware.
Test4 fails to complete
Follow the hints above for troubleshooting
the CrissCross test if CrissCross did not
complete successfully.
Did the CrissCross test complete
successfully?
Does any node shut itself down during the
Test4 test? Heat related problem – check to see if all
fans on the shut down node are running at
Observe the Performance Monitor to see
if any node drops off or has no activity on
expected speeds. If not, replace fans on
that node.
the interconnect. See “The performance
monitor” (page 33). You might need to replace bad nodes.
Set the system date to current date.Check the system date on that node. If the
date is far off the current date, Linpack can’t
start because the hpmpi license might expire.
Linpack can’t start on a
node
Heat relatedA node shuts down
itself during Linpack test
Check fans on that node.
Replace the node.
Troubleshooting Cluster Test 45