HP XC System Software Administration Guide Version 3.1

Service: Root key synchronization
Status Information: Root SSH key synchronization status
This entry provides the status of the root key synchronization.
A warning or critical message indicates that the root ssh keys for one or more hosts are out of synchronization
with the head node. The ssh and pdsh commands may not work for these nodes.
Verify that the imaging is correct on the affected nodes. The most common cause of this problem is caused by a
node that failed to reimage and booted a kernel with an older set of ssh keys (/root/.ssh/*).
If all the nodes are not synchronized, determine if the head node changed its root ssh keys.
See “Mismatched Secure Shell Keys” (page 229) for more information.
Service: Supermon Metrics Monitor
Status Information: Supermon node metrics retrieval status
This entry reports the status of the Supermon service and the number of nodes from which it collected metrics
data.
A warning or critical message indicates that one or more hosts was not accessible during metrics collection or
there was a Nagios service_check_timeout interval timed out.
These messages can occur if metrics collection cannot be completed in a reasonable time; examine the
/opt/hptc/nagios/etc/nagios.cfg file for the value of the service_check_timeout parameter.
The default should be adequate for HP XC systems with fewer than 256 nodes.
Increasing the value for the service_check_timeout parameter may solve the problem for systems with more
nodes.
Also, verify that the supermond service is running by invoking the following command on the head node:
# service supermond status
Loss or time-outs of this service can cause per-node warnings for nodeinfo, load average and system free
space.
A non-timeout warning or critical message simply indicates a number of monitored nodes are not responding;
this is normal if the nodes are down or otherwise disabled.
Service: Syslog Alert Monitor
Status Information: Status of consolidated.log syslog monitoring
Typically, this entry reports the number of new records processed in the
/hptc_cluster/adm/logs/consolidated.log file.
A warning or critical message occurs when there is insufficient time to process a huge volume of messages before
the Nagios service_check_timeout period expires.
Nagios examines the recent incoming consolidated log messages and issues a warning or critical message if the
incoming rate since last interval exceeds a configured number of records. The default values are 2 for warnings
and 20 for critical. See /opt/hptc/nagios/libexec/check_syslogalerts for details.
No specific action is required unless the service times out. In that case, an excessive number of syslog messages
is collected across the system; this is more than the plug-in can process in the service_check_timeout period.
See the /opt/hptc/nagios/etc/nagios.cfg file for the value of the service_check_timeout parameter.
Running the following command on the node reporting error solves the problem:
# /opt/hptc/nagios/libexec/check_syslogalerts domain node:nagios_monitor nsca
Otherwise, wait for the nightly log to roll over.
Service: Syslog Alerts
Status Information: Node Syslog alerts information
Typically, this entry reports the number of alerts in a specified period of time and allows you to access the most
recent log.
A warning or critical message indicates that one or more rules defined in the
/opt/hptc/nagios/etc/syslogAlertRules file matches the specified node's consolidated log file.
Take the appropriate action based on the message.
Service: System Event Log
Status Information: Node Syslog alerts information
A warning or critical message indicates that one or more rules defined in the /opt/hptc/nagios/etc/selRules
file matches the specified node's firmware System Event Log.
Take the appropriate action based on the System Event Log message.
234 Troubleshooting