Troubleshooting guide

Appendix B: Isolating the Cause of a Hardware Problem
The following diagnostic procedures will need to be used if you are unable to install the
Fabric Management System (FMS).
Two of the most commonly reported hardware failures are damaged cables and damaged
port connectors.
As previously mentioned in Appendix A, a high badcrc count (reported in the host or
switch hardware counters) or a serdesFaultTrap for a connected port (reported in the
switch hardware counters) is a strong indication of hardware damage/failure. Our
guarantee is an *average* of less than 1 packet-data error (badcrc) per hour on a link
operating at full data rate. If you suspect a Myrinet hardware failure, this failure could be
in a Myrinet NIC, a cable, a port on a Myrinet switch, or within a Myrinet switch.
If the failure is in a combination of (NIC, cable from the NIC to the switch, or port on a
switch) it is possible to diagnose this situation quite easily using the mx_pingpong
"loopback test" or the gm_allsize "loopback test" as described below. However, if the
failure lies within a switch, or on a cable connecting two switches, the following
procedure will not detect this kind of failure. The diagnostic tool FMS is needed to detect
this type of switch-to-switch failure.
Note: If you are using a mixture of Myrinet-2000 and Myrinet-1280 hardware, badcrcs
will be generated if the switch line card or the NICs are set to different speeds. Some
products have a mechanical switch on the circuit board to allow the default data rate to be
switched between SAN-2000 (2.0+2.0 Gb/s) and SAN-1280 (1.28+1.28 Gb/s). Please
refer to the Myrinet FAQ entry “I have Myrinet-2000 NICs and Myrinet-1280 switches
and my NICs and switches aren’t able to talk to each other. What do I do?” for more
details on checking and setting the speed.
The mx_pingpong "loopback test" or gm_allsize "loopback test" limits all
communication to a specific Myrinet NIC, cable, and port on a Myrinet switch.
If you are using MX, the mx_pingpong "loopback test" is performed as follows:
1. Reset the host counters
cd <install_path>/bin/
su root
./mx_counters –c
2. On each node, run:
mx_counters | grep Bad
su root
© 2007 Myricom, Inc. DRAFT
37