VERITAS Volume Manager 3.5 Troubleshooting Guide (August 2002)

Failures on RAID-5 Volumes

6 VERITAS Volume Manager Troubleshooting Guide

Failures on RAID-5 Volumes

Failures are seen in two varieties: system failures and disk failures. A system failure means

that the system has abruptly ceased to operate due to an operatingsystem panic orpower

failure.Disk failuresimply that the dataon somenumber of disks hasbecome unavailable

due to a system failure (such as a head crash, electronics failure on disk, or disk controller

failure).

System Failures

RAID-5 volumes are designed to remain available with a minimum of disk space

overhead, if there are disk failures. However, many forms of RAID-5 can have data loss

after a system failure.Data loss occursbecause a systemfailure causesthe data andparity

in the RAID-5volume to becomeunsynchronized.Loss of synchronizationoccurs because

the status of writes that were outstanding at the time of the failure cannot be determined.

If a loss of sync occurs while a RAID-5 volume is being accessed, the volume is described

as having stale parity. The parity must then be reconstructed by reading all the non-parity

columns within each stripe, recalculating the parity, and writing out the parity stripe unit

in thestripe. Thismust be done for every stripe in thevolume, soit can take a long time to

complete.

Caution While the resynchronization of a RAID-5 volume without log plexes is being

performed, any failure of a disk within the volume causes its data to be lost.

Besides the vulnerability to failure, the resynchronization process can tax the system

resources and slow down system operation.

RAID-5 logs reduce the damage that can be caused by system failures, because they

maintain a copy of the data being written at the time of the failure. The process of

resynchronization consists of reading that data and parity from the logs and writing it to

the appropriate areas of the RAID-5 volume. This greatly reduces the amount of time

needed for a resynchronization of data and parity. It also means that the volume never

becomes truly stale. The data and parity for all stripes in the volume are known at all

times, so the failure of a single disk cannot result in the loss of the data within the volume.