System information
Appendix
B-5
B.8 Proactive Data Protection
The most fundamental requirement for a storage system is to protect the data from all kinds of failures. The RAID controller
firmware supports versatile RAID configurations for different levels of reliability requirement, including RAID 6 to tolerate
double-drive failure, and Triple Parity for extreme data availability. It provides online utilities for proactive data protection to
monitor disk health, minimize the risk of data loss, and avoid RAID degradation. RAID configurations can be recovered and
imported even the RAID is corrupted.
• Online disk scrubbing
Bad sectors of a hard disk can be detected only when they are accessed, so bad sectors may stay a long time undetected if
disk access pattern is unevenly distributed and the sectors reside on seldom-accessed areas. In disk rebuilding, all data on
the surviving hard disks is needed to regenerate the data of the failed disk, and if there are bad sectors on the surviving disks,
the data cannot be regenerated and gone forever. As the number of sectors per disk increases, this will be a very common
issue to any disk-based storage systems. The firmware provides online disk scrubbing utility to test the entire disk surface by
a background task and recover any bad sectors detected.
• Online parity consistency check and recovery
The ability to protect data in parity-based RAID relies on the correctness of parity information. There are certain conditions
that the parity consistency might be corrupted, such as internal errors of hard drives or abnormal power-off of system while
the cache of hard drives is enabled. To ensure higher data reliability, the administrator can instruct the controller to conduct
parity check and recovery during disk scrubbing.
• S.M.A.R.T. drive health monitoring and self-test
S.M.A.R.T. stands for Self-Monitoring Analysis Reporting Technology, by which a hard disk can continuously self-monitor its
key components and collect statistics as indicators of its health conditions. The hard disks are periodically polled, and the
controller will alert the administrator and start disk cloning when the disks report warnings. The firmware can also instruct the
disk drives to execute device self-test routines embedded in the disk drives; this effectively helps the users to identify
defective disk drives.
• Online bad sector reallocation and recovery with over-threshold alert
Hard disks are likely to have more and more bad sectors after they are in service. When host computers access bad sectors,
the controller rebuilds data and responds to host. In addition to leveraging on-disk reserved space for bad block reallocation,
the controller uses the reserved space on hard disks for reallocating data of bad sectors. If the number of bad sectors
increases over the threshold specified by the administrator, alerts will be sent to the administrator, and disk cloning will be
started automatically.
• Online SMART disk cloning
When a hard disk fails in a disk group, RAID enters the degradation state, which means lower performance, higher risk of
data loss or RAID corruption. When a hard disk is likely to become faulty or unhealthy, such as bad sectors of a physical disk
increases over a threshold, or a disk reports SMART warning, the controller will online copy all data of the disk to a spare
disk. Moreover, should the source disk fails during the cloning, controller will start rebuilding on the cloning disk, and the
rebuilding will skip the sectors where the cloning has been done. The disk cloning has been approved as the most effective
solutions to prevent RAID degradation.
• Transaction log and auto parity recovery
The capability to rebuild data of parity-based data protection relies on the consistency of parity and data. However, the
consistency might not be retained because of improper system shutdown when there are uncompleted write commands. To
maintain the consistency, the controller keeps logs of write commands in the NVRAM, and when the controller is restarted,
the parity affected by the uncompleted writes will be automatically recovered.
• Battery backup protection
The controller delays the writes to disk drives and caches the data in the memory for performance optimization, but this also
causes risk because the data in the cache will be gone forever if the system is not properly powered off. The battery backup
module retains the data in the cache memory during abnormal power loss, and when the system is restarted, the data in the
cache memory will be flushed to the disk drives. As the size of cache memory installed grows increasingly, the data loss could
lead to unrecoverable disasters for applications.