MCA Error Recovery: HP-UX Feature for Recovering from Machine Check Aborts

Recovery Process

In the rare occurrence when the operating system handles a machine check abort, it must ensure the

error can be fully contained and corrected before resuming execution of the system. In some cases

this entails termination of a user process executing at the time of the machine check abort. If the error

is determined not to be transient or not fully correctable, the system will be brought down.

If a user process is terminated as part of MCA error recovery handling, the core file created by the

process will indicate MCA Error Recovery as the reason for termination. One of two signals will be

used:

• signal SIGILL with an si_code of ILL_MCARECOV

• signal SIGBUS with an si_code of BUS_MCARECOV

When a machine check abort occurs, the platform will gather information and create an error log.

During MCA Error Recovery this log will be sent to the Diagnostics subsystem and archived in the

/var/tombstones directory.

After the successful recovery of an error, a message containing details about the recovery will be

printed and placed in the system log (/var/adm/syslog/syslog.log). If a user process was

terminated as part of the error recovery, information about the process will be included in the

message.

Dynamic Processor Resilience

After the successful recovery of an MCA, the Diagnostics subsystem will inform System Fault

Management (SFM) of the event. SFM monitors the number of recovered MCAs on each processor

and if two MCAs happen on the same processor within a pre-determined time span (for example, two

months), the processor will be disabled (Dynamic Processor Resilience). SFM will also raise an

Automated Processor Recovery (APR) event 100661, which generates a message in the event log

(/var/opt/resmon/log/event.log).

Restarting a Terminated Process

Termination of a user process can happen for various reasons, including MCA error recovery. If

needed, the inittab(4) file can be used to monitor important processes and restart them if they are

terminated.

Users of ServiceGuard can group together application services (individual HP-UX processes) as

"packages." Users can then configure ServiceGuard to monitor those processes, specifying how many

times they should be restarted locally on the same cluster node before they are failed-over to an

adoptive node.

For more detailed ServiceGuard information, see the latest edition of Managing ServiceGuard at

http://www.docs.hp.com/en/oshpux11iv3.html#Serviceguard