Specifications

ManualsBrandsMicroNet ManualsMusical equipmentFibre to SAS/SATA II RAID Subsystem

Storage and Adapter Configuration

Item Description

Heavy I/O in Multipath Configurations

In multipath configurations, performing heavy

I/O while paths are being manipulated can cause

a system to temporarily appear to be

unresponsive. When the multipath software

moves the access of a LUN from one path to

another, it must also move any outstanding I/Os

to the new path. The rerouting of the I/Os can

cause a delay in the response times for these

I/Os. If additional I/Os continue to be issued

during this time, they will be queued in the

system and can cause a system to run out of

memory available to any process. Under very

heavy I/O loads, these delays and low memory

conditions can cause the system to be

unresponsive such that LifeKeeper may detect a

server as down and initiate a failover.

There are many factors that will affect the

frequency at which this issue may be seen.

l The speed of the processor will affect

how fast I/Os can be queued. A faster

processor may cause the failure to be

seen more frequently.

l The amount of system memory will affect

how many I/Os can be queued before the

system becomes unresponsive. A

system with more memory may cause

the failure to be seen less frequently.

l The number of LUNs in use will affect the

amount of I/O that can be queued.

l Characteristics of the I/O activity will

affect the volume of I/O queued. In test

cases where the problem has been seen,

the test was writing an unlimited amount

of data to the disk. Most applications will

both read and write data. As the reads

are blocked waiting on the failover, writes

will also be throttled, decreasing the I/O

rate such that the failure may be seen

less frequently.

For example, during testing of the IBM DS4000

multipath configuration with RDAC, when the

I/O throughput to the DS4000 was greater than

190 MB per second and path failures were

simulated, LifeKeeper would (falsely) detect a

failed server approximately one time out of

twelve. The servers used in this test were IBM

x345 servers with dual Xeon 2.8GHz processors

and 2 GB of memory connected to a DS4400

with 8 volumes (LUNs) per server in use. To

avoid the failovers, the LifeKeeper parameter

LCMNUMHBEATS

(in/etc/default/LifeKeeper) was

increased to 16. The change to this parameter

results in LifeKeeper waiting approximately 80

seconds before determining that an

unresponsive system is dead, rather than the

default wait time of approximately 15 seconds.

SteelEye Protection Suite for Linux81