HP XC System Software Administration Guide Version 3.1

ManualsBrandsHP ManualsSoftwareHP XC System 3.x Software

231

232

233

234

235

236

237

238

239

240

Service: Configuration Monitor

Status Information: Node information

This message reports the total number of nodes, the number of nodes enabled, the number of nodes disabled, and

the number of nodes imaged.

No action is required.

Service: Environment

Status Information: Node sensor status

A warning or critical message indicates that one or more monitored sensors reported that a threshold has been

exceeded.

Correct the condition.

Service: Load Average

Status Information: Node Load Ave: x/y/z QueLen: n

A warning or critical message indicates that load average thresholds for the specific node have been exceeded.

Thresholds can be set on a per-node, per-class, or per-system basis in the nagios_vars.ini file. These values

are specific to the site and depend on site load.

If the load average thresholds are reasonable, monitor for excessive activity on the node.

Service: Nagios Monitor

Status Information: Nagios status information

Typically, the status of Nagios, the number of Nagios services located, and the last time the Nagios status log was

updated are reported.

A warning or critical message indicates that one or more of the Nagios monitor processes either failed or reported

error conditions that could degrade monitoring.

Ensure that the node can communicate with the head node.

Service: Nodeinfo

Status Information: Node process status total/user/zombie , uptime

This entry displays the total number of processes, the number of user processes, and the number of Zombie

processes as well as the uptime for the Nagios host.

A warning or critical message indicates that thresholds for the specific node were exceeded.

Thresholds can be set on a per-node, per-class or per-system basis in the nagios_vars.ini file. These values

are site-specific to the site and depend on site load.

If thresholds are reasonable, monitor for excessive activity on the node.

Service: Ping Interconnect

Status Information: Node interconnect status

Nagios performs a ping command on the interconnect at regular intervals. Typically, this entry provides the

status information output from that command and the Interconnect's IP address.

A warning or critical message indicates that the specified node or system interconnect failed to respond to the

ping command in the allotted time.

Determine if the node is powered on, enabled, and responsive.

Determine if interconnect is functional by running the ping command with the Nagios host name for the

interconnect; for example, if the Nagios host name is necs1-1, enter the following comand:

# ping necs1-1

If interconnect responds but the time it takes to respond to a ping command is excessive, the problem may be

related to the system load. If there is no response to the ping command, determine that the interconnect is

configured improperly or if it is failing by running the corresponding system interconnect diagnostic tools; see

“Using the System Interconnect Diagnostic Tools” (page 222) for more information.

Service: Resource Monitor

Status Information: Resource monitor activity status

Typically this entry reports the output of the SLURM squeue command.

A warning or critical message indicates that the SLURM squeue command reported errors.

See the output of the squeue command for more details. SLURM on HP XC systems is described in Chapter 14

(page 157)

20.3 Messages Reported by Nagios 233