Specifications
Taking the System to init state S WARNING
(RS-232 TTY) console can experience severe problems with LifeKeeper service. During operation,
LifeKeeper generates console messages. If your configuration has a serial console (instead of the
standard VGA console), the entire data path from LifeKeeper to the end-user terminal must be
operational in order to ensure the delivery of these console messages.
If there is any break in the data path—such as terminal powered off, modem disconnected, or cable
loose—the Linux STREAMS facility queues the console message. If the STREAMS queue becomes
full, the Unix kernel suspends LifeKeeper until the STREAMS buffer queue again has room for more
messages. This scenario could cause LifeKeeper to HANG.
Note: The use of serial consoles in a LifeKeeper environment is strongly discouraged and the use of
the VGA console is recommended. If you must use a serial console, be sure that your serial console
is turned on, the cables and optional modems are connected properly, and that messages are being
displayed.
Taking the System to init state S WARNING
When LifeKeeper is operational, the system must not be taken directly to init state S. Due to the
operation of the Linux init system, such a transition causes all the LifeKeeper processes to be killed
immediately and may precipitate a fastfail. Instead, you should either stop LifeKeeper manually
(using /etc/init.d/lifekeeper stop-nofailover) or take the system first to init state 1
followed by init state S.
Thread is Hung Messages on Shared Storage
In situations where the device checking threads are not completing fast enough, this can cause
messages to be placed in the LifeKeeper log stating that a thread is hung. This can cause resources
to be moved from one server to another and in worse case, cause a server to be killed.
Explanation
The FAILFASTTIMER (in /etc/default/LifeKeeper) defines the number of seconds that each
device is checked to assure that it is functioning properly, and that all resources that are owned by a
particular system are still accessible by that system and owned by it. The FAILFASTTIMER needs
to be as small as possible to guarantee this ownership and to provide the highest data reliability.
However if a device is busy, it may not be able to respond at peak loads in the specified time. When
a device takes longer than the FAILFASTTIMER then LifeKeeper considers that device as possibly
hung. If a device has not responded after 3 loops of the FAILFASTTIMER time period then
LifeKeeper attempts to perform recovery as if the device has failed. The recovery process is defined
by the tunable SCSIERROR. Depending on the setting of SCSIERROR the action can be a
sendevent to perform local recovery and then a switchover if that fails or it can cause the system to
halt.
Suggested Action:
In cases where a device infrequently has a hung message printed to the error log followed by a
message that it is no longer hung and the number in parenthesis is always 1, there should be no
SteelEye Protection Suite for Linux253