Troubleshooting guide

Did the firmware (MX or GM) load properly on all nodes in the cluster? Were
there any error messages in the system log (dmesg or /var/log/messages) output
on any of the nodes when you loaded the firmware? Sections V, VI, and VII
address software installation and troubleshooting issues. Run-time diagnostic
error messages are also explained in the Myrinet FAQ
(http://www.myri.com/scs/FAQ/).
Were there any error messages in the system log (dmesg or /var/log/messages)
output on any of the nodes after loading the firmware?
Were there software run-time error messages while running the application? A
number of these run-time messages are explained in the Myrinet FAQ
(http://www.myri.com/scs/FAQ/).
Further Details
If there are host computer hardware or software problems, these problems will most
likely be encountered as a failure during the Myrinet hardware or software installation
phase (Section III and Section VIII Testing/Validation). Or, these types of problems
may also be exhibited/revealed as an unexplained performance degradation or
performance inconsistency on the nodes. Refer to the subsection entitled “3. Run
mx_dmabench or gm_debug to test the PCI bandwidth” (page 30) in Section VIII
Testing/Validation for further details.
If there are any faulty Myrinet hardware components, these components are most easily
isolated with the Fabric Management System (FMS) as described in Section VIII
Testing/Validation. If you are unable to install FMS, you can use the troubleshooting
procedures outlined in Appendix A and Appendix B.
There are two sources of hardware counters available for Myrinet:
host counters, reported by the MX test program mx_counters or the GM test
program gm_counters; and
switch counters and traps, reported by the web interface to the Myrinet switch(es).
These hardware counters reveal important information about the health of the Myrinet
hardware and the interactions of the hardware and the software. A detailed explanation
of each of these hardware counters can be found in the Myrinet FAQ
(http://www.myri.com/scs/FAQ/), and in the M3-CLOS-ENCL/M3-SPINE-ENCL switch
tutorial (http://www.myri.com/scs/14U_switches/). If you are using the M3-CLOS-
ENCL/M3-SPINE-ENCL switches, you can use the Log feature of the web interface
(http://www.myri.com/scs/14U_switches/index-overview-web.html#log) to monitor
switch traps in real-time. If you are using the M3-E* switches, Mute
(
http://www.myri.com/scs/mute/) can be used to monitor the switch traps in real time.
Note that Mute has been replaced by the Fabric Management System (FMS).
© 2007 Myricom, Inc. DRAFT
35