Specifications
Server Failure Recovery Scenario
The following steps describe the recovery scenario, illustrated above, if LifeKeeper marks all
communications connections to a server DEAD.
1. LCM activates eventslcm. When LifeKeeper marks all communications paths dead, the LCM
initiates the eventslcm process.
Only one activity stops the eventslcm process:
l Communication path alive. If one of the communications paths begins sending the heartbeat
signal again, the LCM stops the eventslcm process.
It is critical that you configure two or more physically independent, redundant communication
paths between each pair of servers to prevent failovers and possible system panics due to
communication failures.
2. Message to sendevent. eventslcm sends the system failure alarm by calling sendeventwith
the event type machfail.
3. sendevent initiates failover recovery. The sendevent program determines that LifeKeeper can
handle the system failure event and executes the LifeKeeper failover recovery process
lcdmachfail.
4. lcdmachfail checks. The lcdmachfail process first checks to ensure that the non-responding
server was not shut down. Failovers are inhibited if the other system was shut down gracefully
before system failure. Then lcdmachfail determines all resources that have a shared
equivalency with the failed system. This is the commit point for the recovery.
5. lcdmachfail restores resources. lcdmachfail determines all resources on the backup server
that have shared equivalencies with the failed primary server. It also determines whether the
backup server is the highest priority alive server for which a given resource is configured. All
backup servers perform this check, so that only one server will attempt to recover a given
hierarchy. For each equivalent resource that passes this check, lcdmachfail invokes the
associated restore program. Then, lcdmachfail also restores each resource dependent on a
restored resource, until it brings the entire hierarchy into service on the backup server.
60SteelEye LifeKeeper for Linux