Extending ICE-Linux Monitoring

Figure 1. Typical Nagios Display

The specific services from the above screen capture are defined in this table:

Servi

ce Description

Environment

Per-node temperature, fan, power supply, and other sensor information supplied

by the platform.

Load Average

Per-node the load averages. Load averages are displayed for the 1, 5, and 15

minute intervals. Values are compared to per-node thresholds defined in the

/opt/hptc/nagios/etc/nagios_vars.ini file.

Nodeinfo

Per-node process and user information. Total processes, user (non-system)

processes, and zombies, as well as user counts are compared against thresholds

defined in the nagios_vars.ini file.

Syslog Alerts

Per-node alerts that match patterns found in the

/opt/hptc/nagios/etc/syslogAlertRules file. Clicking a hyper-link in

the status text displays previously found patterns.

System Event

log

Per-node alerts that match patterns found in the

/opt/hptc/nagios/etc/selRules file. The System Event Log is collected

through the management processor, either an iLO or an IPMI BMC. System Events

are hardware-related alerts such as memory errors, power supply faults, and so

on.

System Free

Space

Per-node free space status. This is compared to thresholds defined in the

nagios_vars.ini file.

Configuration

Per-node configuration information and quick reference for the ICE-Linux assigned

simple node name.

ICE-Linux gathers data using the platform specific management processor, iLO or IPMI, as well as OS

provided metrics such as load average, nodeinfo, system free space and so on. Other information

such as syslog alerts is obtained from the consolidated log file,

/hptc_cluster/adm/logs/consolidated.log . The shaded rows in the table above indicate

services that use mond to collect their data. This information is stored in a Round Robin Database

(RRD) for future reference and to enable graphical displays. All non-shaded information is collected

without agents and is therefore available for any host being monitored by ICE-Linux. In addition to the

metrics displayed and tracked by Nagios, additional metrics are also captured and stored in RRDs.

You can find (and customize) the list of captured metrics in the nagios_vars.ini file.