Troubleshooting guide

If the badcrc_cnt (reported in gm_counters) increased significantly after the test on any
of the hosts, then you have identified a possible hardware trouble spot in your cluster and
you must now isolate if the badcrc_cnt is coming from the Myrinet NIC, the cable, or
the port on the Myrinet switch.
B.1. How do I determine if a cable has failed?
In most cases, the Bad CRC8 or badcrc__invalid (or badcrc_cnt) is caused by a
damaged cable. As a first step, if you have some extra cables, we suggest that you first try
replacing the suspect cable, and then rerunning the above mx_pingpong
"loopback_test" or gm_allsize "loopback test" to see if the value of Bad CRC8 or
badcrc__invalid (or badcrc_cnt) continues to increase. If this does not eliminate the
badcrcs then the cable is not the cause of the hardware failure, and you must now
determine if the failure is due to the Myrinet NIC or the port on the Myrinet switch to
which it is connected.
If the Bad CRC8 or badcrc__invalid (or badcrc_cnt) does not increase after replacing
the cable, then you have isolated the damaged hardware component.
Contact help@myri.com to return the cable for repair/replacement, and you will be
assigned a "Return Material Authorization" (RMA) number. The information required
for an RMA is outlined in the Myrinet FAQ (http://www.myri.com/scs/FAQ/).
B.2. How do I determine if a port on a switch line card has failed?
To determine if a port on a Myrinet switch has failed, do the following:
With a known good cable, try connecting the NIC port to a different port on the switch
line card, and rerun the mx_pingpong "loopback test" or gm_allsize "loopback test".
If the badcrc count no longer increases, then the old switch port is the cause of the
hardware failure. Please note that if a cable is moved from one switch port to another
switch port (or from one NIC to another NIC), the topology of the network has changed.
Each MX/GM process has a relative address to each other process (something like “go to
the first switch, jump 3 ports, go to the next switch, jump -2 ports”), and if the cabling of
the network has changed, then the mapper must be re-run so that these relative addresses
can be updated.
If you’re using MX or GM-2, this change in topology will be automatically detected by
the MX/GM-2 mapper. However, if you’re using GM-1, the GM-1 mapper must be re-
run before any communication over the Myrinet network can occur.
If the port on a switch line card is identified as the point of failure, contact
help@myri.com to return this switch line card for repair/replacement. You will be
assigned a "Return Material Authorization" (RMA) number. The information required for
an RMA is outlined in the Myrinet FAQ (http://www.myri.com/scs/FAQ/).
© 2007 Myricom, Inc. DRAFT
39