VERITAS Volume Manager 3.5 Troubleshooting Guide (September 2004)
Recovery from Hardware Failure
Failures on RAID-5 Volumes
Chapter 1
17
Failures on RAID-5 Volumes
Failures are seen in two varieties: system failures and disk failures. A system failure
means that the system has abruptly ceased to operate due to an operating system panic
or power failure. Disk failures imply that the data on some number of disks has become
unavailable due to a system failure (such as a head crash, electronics failure on disk, or
disk controller failure).
System Failures
RAID-5 volumes are designed to remain available with a minimum of disk space
overhead, if there are disk failures. However, many forms of RAID-5 can have data loss
after a system failure. Data loss occurs because a system failure causes the data and
parity in the RAID-5 volume to become unsynchronized. Loss of synchronization occurs
because the status of writes that were outstanding at the time of the failure cannot be
determined.
If a loss of sync occurs while a RAID-5 volume is being accessed, the volume is described
as having stale parity. The parity must then be reconstructed by reading all the
non-parity columns within each stripe, recalculating the parity, and writing out the
parity stripe unit in the stripe. This must be done for every stripe in the volume, so it can
take a long time to complete.
CAUTION While the resynchronization of a RAID-5 volume without log plexes is being performed,
any failure of a disk within the volume causes its data to be lost.
Besides the vulnerability to failure, the resynchronization process can tax the system
resources and slow down system operation.
RAID-5 logs reduce the damage that can be caused by system failures, because they
maintain a copy of the data being written at the time of the failure. The process of
resynchronization consists of reading that data and parity from the logs and writing it to
the appropriate areas of the RAID-5 volume. This greatly reduces the amount of time
needed for a resynchronization of data and parity. It also means that the volume never
becomes truly stale. The data and parity for all stripes in the volume are known at all
times, so the failure of a single disk cannot result in the loss of the data within the
volume.
Disk Failures
Disk failures can cause the data on a disk to become unavailable. In terms of a RAID-5
volume, this means that a subdisk becomes unavailable.
This can occur due to an uncorrectable I/O error during a write to the disk. The I/O error
can cause the subdisk to be detached from the array or a disk being unavailable when
the system is booted (for example, from a cabling problem or by having a drive powered
down).
When this occurs, the subdisk cannot be used to hold data and is considered stale and
detached. If the underlying disk becomes available or is replaced, the subdisk is still
considered stale and is not used.