HP XC System Software Administration Guide Version 3.1

ManualsBrandsHP ManualsSoftwareHP XC System 3.x Software

231

232

233

234

235

236

237

238

239

240

Service: Root key synchronization

Status Information: Root SSH key synchronization status

This entry provides the status of the root key synchronization.

A warning or critical message indicates that the root ssh keys for one or more hosts are out of synchronization

with the head node. The ssh and pdsh commands may not work for these nodes.

Verify that the imaging is correct on the affected nodes. The most common cause of this problem is caused by a

node that failed to reimage and booted a kernel with an older set of ssh keys (/root/.ssh/*).

If all the nodes are not synchronized, determine if the head node changed its root ssh keys.

See “Mismatched Secure Shell Keys” (page 229) for more information.

Service: Supermon Metrics Monitor

Status Information: Supermon node metrics retrieval status

This entry reports the status of the Supermon service and the number of nodes from which it collected metrics

data.

A warning or critical message indicates that one or more hosts was not accessible during metrics collection or

there was a Nagios service_check_timeout interval timed out.

These messages can occur if metrics collection cannot be completed in a reasonable time; examine the

/opt/hptc/nagios/etc/nagios.cfg file for the value of the service_check_timeout parameter.

The default should be adequate for HP XC systems with fewer than 256 nodes.

Increasing the value for the service_check_timeout parameter may solve the problem for systems with more

nodes.

Also, verify that the supermond service is running by invoking the following command on the head node:

# service supermond status

Loss or time-outs of this service can cause per-node warnings for nodeinfo, load average and system free

space.

A non-timeout warning or critical message simply indicates a number of monitored nodes are not responding;

this is normal if the nodes are down or otherwise disabled.

Service: Syslog Alert Monitor

Status Information: Status of consolidated.log syslog monitoring

Typically, this entry reports the number of new records processed in the

/hptc_cluster/adm/logs/consolidated.log file.

A warning or critical message occurs when there is insufficient time to process a huge volume of messages before

the Nagios service_check_timeout period expires.

Nagios examines the recent incoming consolidated log messages and issues a warning or critical message if the

incoming rate since last interval exceeds a configured number of records. The default values are 2 for warnings

and 20 for critical. See /opt/hptc/nagios/libexec/check_syslogalerts for details.

No specific action is required unless the service times out. In that case, an excessive number of syslog messages

is collected across the system; this is more than the plug-in can process in the service_check_timeout period.

See the /opt/hptc/nagios/etc/nagios.cfg file for the value of the service_check_timeout parameter.

Running the following command on the node reporting error solves the problem:

# /opt/hptc/nagios/libexec/check_syslogalerts –domain node:nagios_monitor –nsca

Otherwise, wait for the nightly log to roll over.

Service: Syslog Alerts

Status Information: Node Syslog alerts information

Typically, this entry reports the number of alerts in a specified period of time and allows you to access the most

recent log.

A warning or critical message indicates that one or more rules defined in the

/opt/hptc/nagios/etc/syslogAlertRules file matches the specified node's consolidated log file.

Take the appropriate action based on the message.

Service: System Event Log

Status Information: Node Syslog alerts information

A warning or critical message indicates that one or more rules defined in the /opt/hptc/nagios/etc/selRules

file matches the specified node's firmware System Event Log.

Take the appropriate action based on the System Event Log message.

234 Troubleshooting