MCA Error Recovery: HP-UX Feature for Recovering from Machine Check Aborts
Introduction
HP Integrity servers provide superior reliability and availability. Nevertheless, even the best of
computers can occasionally experience hardware problems that lead to unplanned downtime. Some
of these problems are caused by transient events such as an alpha particle strike on memory, cache,
or a processor data structure. Intel® Itanium®-based servers support an advanced architecture that
allows the system to contain, correct, and signal machine check errors. Many of these errors are
corrected by the platform without operating system intervention. When the platform cannot correct an
error, it will be handed off to the operating system.
To further enhance the superior reliability of HP Integrity servers, the HP-UX MCA Error Recovery
feature adds the ability to recover from some of these machine check aborts (MCAs). This helps to
prevent unplanned downtime, allowing the system to continue execution. This feature is also known as
Automated Processor Recovery (APR).
Availability
The MCA Error Recovery feature is delivered in the following set of HP-UX 11i v3 patches and is
supported on servers using Dual-Core Intel® Itanium® Processor 9100 Series and later.
• PHKL_36387
• PHKL_36388
• PHKL_36389
• PHKL_36390
• PHKL_36470
• PHKL_36534
• PHKL_36010
These patches are included in the HWEnable11i bundle on the HP-UX 11i v3 Operating Environment
(OE) media for September 2007. The patches and the September 2007 HWEnable11i bundle are
also available at the HP IT Resource Center site:
http://itrc.hp.com
Configuration
The MCA Error Recovery feature is enabled by default on all systems supporting it. During boot, if a
system does not support the feature, a message will be printed and placed in the system log
(/var/adm/syslog/syslog.log) indicating the feature has been disabled.
A new kernel tunable, mca_recovery_on(5), is also provided in case a system administrator wants
to manually disable the feature. Changes to the tunable do not require a reboot of the system. On
systems that do not support the MCA Error Recovery feature, the tunable will be set to disabled and
changes will not be allowed.