MCA Error Recovery: HP-UX Feature for Recovering from Machine Check Aborts

Recovery Process
In the rare occurrence when the operating system handles a machine check abort, it must ensure the
error can be fully contained and corrected before resuming execution of the system. In some cases
this entails termination of a user process executing at the time of the machine check abort. If the error
is determined not to be transient or not fully correctable, the system will be brought down.
If a user process is terminated as part of MCA error recovery handling, the core file created by the
process will indicate MCA Error Recovery as the reason for termination. One of two signals will be
used:
signal SIGILL with an si_code of ILL_MCARECOV
signal SIGBUS with an si_code of BUS_MCARECOV
When a machine check abort occurs, the platform will gather information and create an error log.
During MCA Error Recovery this log will be sent to the Diagnostics subsystem and archived in the
/var/tombstones directory.
After the successful recovery of an error, a message containing details about the recovery will be
printed and placed in the system log (/var/adm/syslog/syslog.log). If a user process was
terminated as part of the error recovery, information about the process will be included in the
message.
Dynamic Processor Resilience
After the successful recovery of an MCA, the Diagnostics subsystem will inform System Fault
Management (SFM) of the event. SFM monitors the number of recovered MCAs on each processor
and if two MCAs happen on the same processor within a pre-determined time span (for example, two
months), the processor will be disabled (Dynamic Processor Resilience). SFM will also raise an
Automated Processor Recovery (APR) event 100661, which generates a message in the event log
(/var/opt/resmon/log/event.log).
Restarting a Terminated Process
Termination of a user process can happen for various reasons, including MCA error recovery. If
needed, the inittab(4) file can be used to monitor important processes and restart them if they are
terminated.
Users of ServiceGuard can group together application services (individual HP-UX processes) as
"packages." Users can then configure ServiceGuard to monitor those processes, specifying how many
times they should be restarted locally on the same cluster node before they are failed-over to an
adoptive node.
For more detailed ServiceGuard information, see the latest edition of Managing ServiceGuard at
http://www.docs.hp.com/en/oshpux11iv3.html#Serviceguard