Product specifications

Table Of Contents
D–Troubleshooting
System Administration Troubleshooting
IB6054601-00 H D-9
A
System Administration Troubleshooting
The following sections provide details on locating problems related to system
administration.
Broken Intermediate Link
Sometimes message traffic passes through the fabric while other traffic appears
to be blocked. In this case, MPI jobs fail to run.
In large cluster configurations, switches may be attached to other switches to
supply the necessary inter-node connectivity. Problems with these inter-switch (or
intermediate) links are sometimes more difficult to diagnose than failure of the
final link between a switch and a node. The failure of an intermediate link may
allow some traffic to pass through the fabric while other traffic is blocked or
degraded.
If you notice this behavior in a multi-layer fabric, check that all switch cable
connections are correct. Statistics for managed switches are available on a
per-port basis, and may help with debugging. See your switch vendor for more
information.
Two diagnostic tools, ibhosts and ibtracert, may also be helpful. The tool
ibhosts lists all the InfiniBand nodes that the subnet manager recognizes. To
check the InfiniBand path between two nodes, use the ibtracert command.
Performance Issues
The following sections discuss known performance issues.
Unexpected Low Bandwidth or Poor Latency
If MTRR mapping is used for write combining (instead of the PAT mechanism), the
BIOS must be set to Discrete if there is 4GB or more memory in the system; it
affects where the PCI, PCIe, and HyperTransport I/O Base Address Registers
(BARs) are mapped. If there is 4GB or more memory in the system, and the
MTRR mapping is not set to Discrete, the bandwidth will be very low (under
250 MBps) on anything that normally runs near full bandwidth over the QHT7140
and QLE7140 adapters.
Since QLE7240 and QLE7280 adapters use SendDMA rather than PIO for larger
messages, peak message bandwidth is no longer a symptom of this problem. In
this case, it appears as poor latency with small (less than 8K) messages.