Fault Monitoring on Windows Integrity Servers
Output
These SNMP traps are logged to the IPMI SEL; consequently, they are also logged to the
Windows System Log.
Predictive Failure Monitor (PFM)
Predictive Failure Monitor is an exclusive service of the HP Integrity server that monitors
correctable errors in the system, anticipates more serious failures (predictable failures), and
generates special, customer-visible, predictive failure events and SNMP alarms when a
component reaches a designated error threshold.
PFM is unique on HP Integrity servers. The threshold values are consistent for the operating
systems supported on these servers (Windows Server 2003 and HP_UX).
Type of Errors
Currently PFM monitors:
• Corrected memory errors, including double-chip sparing on sx2000 systems, double-byte
errors on zx2 systems, and single-byte errors on zx1 systems
• Corrected internal processor cache errors
• Corrected external cache errors for mx2 processors
• Corrected fabric errors for cellular systems
• Thermal trips for Intel
®
Dual-Core Itanium
®
processors (Montecito)
• Front Side Bus (FSB) errors
• Intel Cache Safe Technology performance errors
Methodology
PFM generates special pre-failure events in the IPMI system event log for certain system
devices. It does this by monitoring log records from firmware through OS channels for
specific events, counts them, and compares the number of events with pre-configured
thresholds.
Windows OS polls the architected SAL_GET_STATE_INFO SFW routine periodically. The
polling interval on HP Integrity servers is 10 minutes, programmed via a registry key.
When a correctable error occurs, Windows OS receives Corrected Platform Event (CPE) and
Corrected Machine Check (CMC) records from system firmware through the
SAL_GET_STATE_INFO call. The PFM service registers with WMI service to get copies of
these records.
When PFM receives a record, the service determines which type of error occurred and
whether to take action. Actions are defined in a rules text file (hpfmrules.cfg). If the type of
error matches an event rule, PFM service increments the threshold count for that error.
When a threshold is crossed, the PFM service logs an IPMI SEL event that triggers the Event
Subsystem to forward notifications through its normal channels: Windows System Log,
SNMP trap, email notification, and SMH SEL viewer.
Output
Besides logging a threshold event to IPMI SEL, PFM also creates log files that contain more
information about the occurrence of the error. These log files are in directory:
Page 4