HP XC System Software Administration Guide Version 3.0
The max parameter is the maximum number of processors available to you in the lsf partition.
No warning messages appear when all the specified nodes are performing at their peak efficiency.
Using the System Interconnect Diagnostic Tools
Various tools enable you to diagnose the system interconnect. Some tools are provided by the system
interconnect manufacturer and are discussed in the
Installation and Operation Guide
(the hardware
documentation) for your system. Be sure to consult the appropriate Web page for these system interconnect
tools:
Myrinet http://www.myrinet.com
Quadrics http://www.quadrics.com
InfiniBand http://www.voltaire.com
Other tools have been written specifically for use with the HP XC system.
To use the diagnostic tools, you must ensure that the system interconnect is properly configured. The IP
addresses must be configured and the /etc/hosts file must be updated with the switch names, for example
MR0N00 for Myrinet system interconnect and QR0N00 for Quadrics system interconnect. These topics are
discussed in the
HP XC System Software Installation Guide
.
Note
Link errors are common when a node boots or reboots. During boot, the system interconnect driver is initiated,
putting the system interconnect into a full reset. This puts the link into reset and always causes an error on
the switch connected to the system interconnect.
This section describes the following diagnostic tools:
• HP XC Diagnostic Tools for the Myrinet System Interconnect (page 152)
• Using Diagnostic Tools for the Quadrics System Interconnect (page 153)
• Using Diagnostic Tools for the Gigabit Ethernet System Interconnect (page 157)
HP XC Diagnostic Tools for the Myrinet System Interconnect
This section describes tools that were developed specifically for diagnosing the Myrinet system interconnect
(from Myricom, Inc.) on the HP XC system. See your system's hardware installation and operation guide for
information about standard diagnostic tools.
The gm_prodmode_mon Diagnostic Tool
This program monitors the GM2.1 switch, reads current environment parameters, and generates alerts if the
values of the following parameters are outside the operating ranges recommended by the manufacturer:
bad Crcs The value should be zero(0).
Temperature The temperature should be less than 104°F (40°C).
Voltage The voltage should be within +/- 10 percent of nominal voltage.
Fan speed The fan speed should be above the minimum.
The gm_prodmode_mon diagnostic tool searches /etc/hosts for entries whose name matches the regular
expression “MR0[NT][0–9][0–9]”.
This command uses the links -dump command to obtain the current values and parses the output. The
gm_prodmode_mon diagnostic tool generates an alert if any errors are found. All alerts are logged in the
/var/log/messages file.
The format of this command is:
gm_prodmode_mon-[-help]-[-verbose]-[-d directory-name]
The output from the gm_prodmode_mon is logged to
/var/log/diag/myrinet/gm_prodmode_mon/links.log by default, but you can specify another
directory with the -d option. Output is displayed to the stdout to show the progress of the diagnostic test.
This command is configured to run once each hour by a crontab file in the /etc/cron.hourly directory.
152 Using Diagnostic Tools