Troubleshooting guide
If you are using M3-E* switches, two other useful hardware counters for diagnosing
hardware failures are the switch counters called serdesFaultTrap and missedBeatTrap.
It is important to note that these two traps can be harmless and merely signal a port on a
switch line card that is unconnected. However, if the port generating these traps is
connected by a cable, then these traps indicate a port failure and the symptoms would be
a loss of connectivity to a specific host, usually accompanied by the lack of illumination
of the green LED associated with that port.
Have you run mpi_stress and/or gm_stress on the cluster?
The recommended Myrinet-2000 Diagnostic Tool is the Fabric Management System
(FMS) (http://www.myri.com/scs/fms/). FMS will work with either GM or MX on Myrinet-
2000 M3-E* or M3-CLOS-ENCL/M3-SPINE-ENCL switches.
If you are not able to install FMS on your cluster, then you need to follow the diagnostic
procedures described in Appendix B to isolate the malfunctioning hardware component.
If you suspect a Myrinet software problem, please check the Myrinet Software and
Customer Support webpage (http://www.myri.com/scs/) to see if there is a newer release, or
check the Myrinet FAQ (http://www.myri.com/scs/FAQ/) for any reports of known problems.
© 2007 Myricom, Inc. DRAFT
36