Troubleshooting guide

Appendix A: Determining if a Problem is Hardware or Software
Related
Diagnosing a problem as hardware- or software-related can be difficult. The first goal is
to isolate where the problem resides:
Host computer hardware (e.g., a bad PCI slot, defective or inadequate riser card,
buggy BIOS, etc)
Host computer software (e.g., OS not configured properly)
Myrinet hardware (NIC, switch, or cable)
Myrinet software (GM driver, GM mapper, MPICH-GM, etc)
Some of the key questions in isolating the cause of the problem are:
Did the procedures outlined in Section VIII Testing/Validation (page 27) yield
any errors?
If you installed FMS, did you see any alerts listed in the output of fm_status and
fm_show_alerts?
If you are unable to install FMS, do you see a high number of bad crcs (packet-
data errors) reported in the host or switch counters? If you suspect a Myrinet
hardware problem, you need to examine these hardware counters. Of all of the
host counters, only bad crcs can indicate a potential hardware failure. A small
number of badcrcs is harmless. As the number of badcrcs increases, they can
lead to performance degradation, a loss of connectivity to a specific host, and
interference with the mapper's ability to map the network.
Do you see a high number of Bad CRC8 in the output of mx_counters or a
high number of badcrc_cnt in the output of gm_counters on any of the
nodes?
cd <install_path>/bin/
./mx_counters | grep "Bad CRC8"
cd <install_path>/bin/
./gm_counters | grep badcrc__invalid
If the value of badcrc__invalid is non-zero, it should be very small compared
to the value of netrecv_cnt (the total number of packets received).
For further details, refer to "How do I isolate the cause of a high Bad CRC8
count in mx_counters? (
http://www.myri.com/cgi-bin/fom?file=423) and
"How do I isolate the cause of a high badcrc_cnt count in gm_counters?"
(http://www.myri.com/cgi-bin/fom?file=58).
© 2007 Myricom, Inc. DRAFT
33