Specifications
Storage and Adapter Configuration
Item Description
Heavy I/O in Multipath Configurations
In multipath configurations, performing heavy
I/O while paths are being manipulated can cause
a system to temporarily appear to be
unresponsive. When the multipath software
moves the access of a LUN from one path to
another, it must also move any outstanding I/Os
to the new path. The rerouting of the I/Os can
cause a delay in the response times for these
I/Os. If additional I/Os continue to be issued
during this time, they will be queued in the
system and can cause a system to run out of
memory available to any process. Under very
heavy I/O loads, these delays and low memory
conditions can cause the system to be
unresponsive such that LifeKeeper may detect a
server as down and initiate a failover.
There are many factors that will affect the
frequency at which this issue may be seen.
l The speed of the processor will affect
how fast I/Os can be queued. A faster
processor may cause the failure to be
seen more frequently.
l The amount of system memory will affect
how many I/Os can be queued before the
system becomes unresponsive. A
system with more memory may cause
the failure to be seen less frequently.
l The number of LUNs in use will affect the
amount of I/O that can be queued.
l Characteristics of the I/O activity will
affect the volume of I/O queued. In test
cases where the problem has been seen,
the test was writing an unlimited amount
of data to the disk. Most applications will
both read and write data. As the reads
are blocked waiting on the failover, writes
will also be throttled, decreasing the I/O
rate such that the failure may be seen
less frequently.
For example, during testing of the IBM DS4000
multipath configuration with RDAC, when the
I/O throughput to the DS4000 was greater than
190 MB per second and path failures were
simulated, LifeKeeper would (falsely) detect a
failed server approximately one time out of
twelve. The servers used in this test were IBM
x345 servers with dual Xeon 2.8GHz processors
and 2 GB of memory connected to a DS4400
with 8 volumes (LUNs) per server in use. To
avoid the failovers, the LifeKeeper parameter
LCMNUMHBEATS
(in/etc/default/LifeKeeper) was
increased to 16. The change to this parameter
results in LifeKeeper waiting approximately 80
seconds before determining that an
unresponsive system is dead, rather than the
default wait time of approximately 15 seconds.
SteelEye Protection Suite for Linux81