HP XC System Software Administration Guide Version 4.0

Table Of Contents
Service: Environment
Status Information: Node sensor status
A warning or critical message indicates that one or more monitored sensors reported that a threshold
has been exceeded.
Correct the condition.
Service: Load Average
Status Information: Node Load Ave: x/y/z QueLen: n
A warning or critical message indicates that load average thresholds for the specific node have been
exceeded.
Thresholds can be set on a per-node, per-class, or per-system basis in the nagios_vars.ini file.
These values are specific to the site and depend on site load.
If the load average thresholds are reasonable, monitor for excessive activity on the node.
Service: Nagios Monitor
Status Information: Nagios status information
Typically, the status of Nagios, the number of Nagios services located, and the last time the Nagios
status log was updated are reported.
A warning or critical message indicates that one or more of the Nagios monitor processes either failed
or reported error conditions that could degrade monitoring.
Ensure that the node can communicate with the head node.
Service: Nodeinfo
Status Information: Node process status total/user/zombie , uptime
This entry displays the total number of processes, the number of user processes, and the number of
Zombie processes as well as the uptime for the Nagios host.
A warning or critical message indicates that thresholds for the specific node were exceeded.
Thresholds can be set on a per-node, per-class or per-system basis in the nagios_vars.ini file.
These values are site-specific to the site and depend on site load.
If thresholds are reasonable, monitor for excessive activity on the node.
Service: Ping Interconnect
Status Information: Node interconnect status
Nagios performs a ping command on the interconnect at regular intervals. Typically, this entry
provides the status information output from that command and the Interconnect's IP address.
A warning or critical message indicates that the specified node or system interconnect failed to respond
to the ping command in the allotted time.
Determine if the node is powered on, enabled, and responsive.
Determine if interconnect is functional by running the ping command with the Nagios host name
for the interconnect; for example, if the Nagios host name is necs1-1, enter the following comand:
# ping necs1-1
If interconnect responds but the time it takes to respond to a ping command is excessive, the problem
may be related to the system load. If there is no response to the ping command, determine that the
interconnect is configured improperly or if it is failing by running the corresponding system
interconnect diagnostic tools; see “Using the System Interconnect Diagnostic Tools” (page 240) for
more information.
Service: Resource Monitor
Status Information: Resource monitor activity status
Typically this entry reports the output of the SLURM squeue command.
A warning or critical message indicates that the SLURM squeue command reported errors.
See the output of the squeue command for more details. SLURM on HP XC systems is described in
Chapter 15 (page 169)
252 Troubleshooting