HP XC System Software Administration Guide Version 3.1

Service: Configuration Monitor
Status Information: Node information
This message reports the total number of nodes, the number of nodes enabled, the number of nodes disabled, and
the number of nodes imaged.
No action is required.
Service: Environment
Status Information: Node sensor status
A warning or critical message indicates that one or more monitored sensors reported that a threshold has been
exceeded.
Correct the condition.
Service: Load Average
Status Information: Node Load Ave: x/y/z QueLen: n
A warning or critical message indicates that load average thresholds for the specific node have been exceeded.
Thresholds can be set on a per-node, per-class, or per-system basis in the nagios_vars.ini file. These values
are specific to the site and depend on site load.
If the load average thresholds are reasonable, monitor for excessive activity on the node.
Service: Nagios Monitor
Status Information: Nagios status information
Typically, the status of Nagios, the number of Nagios services located, and the last time the Nagios status log was
updated are reported.
A warning or critical message indicates that one or more of the Nagios monitor processes either failed or reported
error conditions that could degrade monitoring.
Ensure that the node can communicate with the head node.
Service: Nodeinfo
Status Information: Node process status total/user/zombie , uptime
This entry displays the total number of processes, the number of user processes, and the number of Zombie
processes as well as the uptime for the Nagios host.
A warning or critical message indicates that thresholds for the specific node were exceeded.
Thresholds can be set on a per-node, per-class or per-system basis in the nagios_vars.ini file. These values
are site-specific to the site and depend on site load.
If thresholds are reasonable, monitor for excessive activity on the node.
Service: Ping Interconnect
Status Information: Node interconnect status
Nagios performs a ping command on the interconnect at regular intervals. Typically, this entry provides the
status information output from that command and the Interconnect's IP address.
A warning or critical message indicates that the specified node or system interconnect failed to respond to the
ping command in the allotted time.
Determine if the node is powered on, enabled, and responsive.
Determine if interconnect is functional by running the ping command with the Nagios host name for the
interconnect; for example, if the Nagios host name is necs1-1, enter the following comand:
# ping necs1-1
If interconnect responds but the time it takes to respond to a ping command is excessive, the problem may be
related to the system load. If there is no response to the ping command, determine that the interconnect is
configured improperly or if it is failing by running the corresponding system interconnect diagnostic tools; see
“Using the System Interconnect Diagnostic Tools” (page 222) for more information.
Service: Resource Monitor
Status Information: Resource monitor activity status
Typically this entry reports the output of the SLURM squeue command.
A warning or critical message indicates that the SLURM squeue command reported errors.
See the output of the squeue command for more details. SLURM on HP XC systems is described in Chapter 14
(page 157)
20.3 Messages Reported by Nagios 233