Specifications
Resource Error Recovery Scenario
1. lkcheck runs. By default, the lkcheck process runs once every two minutes. When lkcheck
runs, it invokes the appropriate quickCheck script for each in-service resource on the system.
2. quickCheck script checks resource. The nature of the checks performed by the quickCheck
script is unique to each resource type. Typically, the script simply verifies that the resource is
available to perform its intended task by imitating a client of the resource and verifying that it
receives the expected response.
3. quickCheck script invokes sendevent. If the quickCheck script determines that the resource
is in a failed state, it initiates an event of the appropriate class and type by calling sendevent.
4. Recovery instruction search. The system event notification mechanism, sendevent, first
attempts to determine if the LCD has a resource and/or recovery for the event type or
component. To make this determination, the is_recoverable process scans the resource
hierarchy in LCD for a resource instance that corresponds to the event (in this example, the
filesys name).
The action in the next step depends upon whether the scan finds resource-level recovery
instructions:
l Not found. If resource recovery instructions are not found, is_recoverable returns to
sendevent and sendevent continues with basic event notification.
l Found. If the scan finds the resource, is_recoverable forks the recover process into the
background. The is_recoverable process returns and sendevent continues with basic event
notification, passing an advisory flag "-A" to the basic alarming event response scripts,
indicating that LifeKeeper is performing recovery.
5. Recover process initiated. Assuming that recovery continues, is_recoverable initiates the
recover process which first attempts local recovery.
6. Local recovery attempt. If the instance was found, the recover process attempts local
recovery by accessing the resource hierarchy in LCD to search the hierarchy tree for a
resource that knows how to respond to the event. For each resource type, it looks for a
recovery subdirectory containing a subdirectory named for the event class, which in turn
contains a recovery script for the event type.
The recover process runs the recovery script associated with the resource that is farthest
above the failing resource in the resource hierarchy. If the recovery script succeeds, recovery
halts. If the script fails, recover runs the script associated with the next resource, continuing
until a recovery script succeeds or until recover attempts the recovery script associated with
the failed instance.
If local recovery succeeds, the recovery process halts.
7. Inter-server recovery begins. If local recovery fails, the event then escalates to inter-server
recovery.
8. Recovery continues. Since local recovery fails, the recover process marks the failed instance
to the Out-of-Service-FAILED (OSF) state and marks all resources that depend upon the failed
resource to the Out-of-Service-UNIMPAIRED (OSU) state. The recover process then
58SteelEye LifeKeeper for Linux