User guide

Recovering from Drive Failure
Replacement drives are not considered to be “online” until Automatic Data
Recovery is completed, at which time the online LED stops blinking and is on
“solid.” Any drives that are not yet “online” are treated as if they are “failed”
when trying to determine whether fault tolerance will be compromised. For
example, in a RAID 5 logical drive with no spare and one drive rebuilding,
another drive failure at this time would result in a “failure” condition for the
entire logical drive.
In general, the time required for a rebuild is approximately 15 minutes per GB.
The actual rebuild time, however, is dependent upon the Rebuild Priority set,
the amount of I/O activity occurring during the rebuild operation, the number
of drives in the array (RAID 5) and the disk drive speed.
During Automatic Data Recovery, if the online LED of the replacement drive
stops blinking and all other drives in the array are still online, the Automatic
Data Recovery process may have been abnormally terminated due to an non-
correctable read error from another physical drive during the recovery process.
The background Auto-Reliability Monitoring process is meant to help prevent
this problem, but it cannot do anything about certain issues, such as SCSI bus
signal integrity problems. Reboot the system and a POST message should
confirm the diagnosis. Retrying Automatic Data Recovery may possibly help.
If not, backups of all data on the system, surface analysis (using User
Diagnostics), and restore is the recommended course of action in this
unfortunate situation.
During Automatic Data Recovery, if the online LED of the replacement drive
stops blinking and the replacement drive is failed (amber failure LED
illuminated or other LEDs go out), the replacement drive is producing
unrecoverable disk errors. In this case, the replacement drive should be
removed and replaced with another replacement drive.
If fault tolerance is ever compromised due to failure of multiple drives, the
condition of the logical drive will be “failed” and “unrecoverable” errors will
be returned to the host. Data loss is probable. Insertion of replacement drives
at this time will not improve the condition of the logical drive. If this occurs,
first try turning the entire system off and on. In some cases an intermittent
drive will appear to work again (perhaps long enough to make copies of
important files) after cycling power. If a 1779 POST message displays, press
F2 to re-enable the logical drive(s). Remember that data loss has likely
occurred and any data on the logical drive is suspect.