Troubleshooting guide
Appendix A: Determining if a Problem is Hardware or Software
Related
Diagnosing a problem as hardware- or software-related can be difficult. The first goal is
to isolate where the problem resides:
• Host computer hardware (e.g., a bad PCI slot, defective or inadequate riser card,
buggy BIOS, etc)
• Host computer software (e.g., OS not configured properly)
• Myrinet hardware (NIC, switch, or cable)
• Myrinet software (GM driver, GM mapper, MPICH-GM, etc)
Some of the key questions in isolating the cause of the problem are:
• Did the procedures outlined in Section VIII Testing/Validation (page 27) yield
any errors?
• If you installed FMS, did you see any alerts listed in the output of fm_status and
fm_show_alerts?
• If you are unable to install FMS, do you see a high number of bad crcs (packet-
data errors) reported in the host or switch counters? If you suspect a Myrinet
hardware problem, you need to examine these hardware counters. Of all of the
host counters, only bad crcs can indicate a potential hardware failure. A small
number of badcrcs is harmless. As the number of badcrcs increases, they can
lead to performance degradation, a loss of connectivity to a specific host, and
interference with the mapper's ability to map the network.
• Do you see a high number of Bad CRC8 in the output of mx_counters or a
high number of badcrc_cnt in the output of gm_counters on any of the
nodes?
cd <install_path>/bin/
./mx_counters | grep "Bad CRC8"
cd <install_path>/bin/
./gm_counters | grep badcrc__invalid
If the value of badcrc__invalid is non-zero, it should be very small compared
to the value of netrecv_cnt (the total number of packets received).
For further details, refer to "How do I isolate the cause of a high Bad CRC8
count in mx_counters? (
http://www.myri.com/cgi-bin/fom?file=423) and
"How do I isolate the cause of a high badcrc_cnt count in gm_counters?"
(http://www.myri.com/cgi-bin/fom?file=58).
© 2007 Myricom, Inc. DRAFT
33