Troubleshooting guide
mpicc to compile mx/unit_test/src/mpi/mpi_stress.c. The executable mpi_stress can
then be run like any other MPI program using mpirun.ch_mx or mpirun.ch_gm.
If the GM firmware is installed on the cluster, the GM-specific stress program,
gm_stress.c, can also be used to stress the network. Full details of how to run gm_stress
can be found on the FAQ entry (http://www.myri.com/cgi-bin/fom?file=53).
8. Run fm_show_alerts for diagnostic information on any damaged/failing hardware
component.
Are there any “un-ACKed alerts” listed in the output of fm_status?
If yes, run fm_show_alerts to print a list of all active alerts, signaling possible hardware
error conditions.
Alerts are created when certain exceptional events occur and are reported to the fms.
Alerts persist within the fms until they are cleared. Clearing usually requires the alert to
be acknowledged (ACKed) and for the condition which caused the alert to have cleared.
Once the alert has been acknowledged, it is marked as "ACKed". Once the condition that
caused the alert has cleared, we mark it as a "relic". Most alerts are deleted only after they
have been both relic-ed and ACKed.
By default, fm_show_alerts prints only alerts which have not been ACKed and are not
relics. Each alert has a unique index which can be passed to fm_ack_alert to
acknowledge the alert.
Refer to http://www.myri.com/scs/fms/#alerts as well as the file libfma/alert.def in the
FMS distribution, for a detailed listing of all possible alerts.
Example output of fm_show_alerts can also be found on the FMS webpage,
http://www.myri.com/scs/fms/#examples.
© 2007 Myricom, Inc. DRAFT
32