System information
Appendix
B-6
B.9 Fortified Reliability and Robustness
The mission of a RAID controller is not only to protect user data from disk drive failure but also any hazards that might cause
data loss or system downtime. Both hardware and firmware of RAID controller has incorporated advanced mechanisms to
fortify the data reliability and to ensure the system robustness. These designs are derived from our field experiences of more
than one decade in all kinds of real-world environments dealing with host computers, disk drives, and hardware components.
One of the best parts in the design is that the administrator can use the online utilities provided by the firmware to solve his
problems without calling the services from the vendors.
• Seasoned redundancy design
The storage system availability is achieved by the redundancy design to eliminate single point of failure. The controller is
equipped with redundant flash chips with advanced algorithms for error checking and bad block reallocation in the firmware to
protect the controller from defect flash blocks and ensure longer life time of the controller. The firmware stores two copies of
RAID meta data as well as bad block reallocation map on disk drives to avoid any data or RAID loss resulted from bad
sectors.
• Support multi-path
Supporting multi-path solutions at host side, such as Microsoft® MPIO, system continuity can be achieved because the
storage system can tolerate failures on the IO path, such as host bus adapters, switches, or cables, by distributing IO over
multiple IO paths. This also improves performance by the dynamic load balancing as well as simplifies the storage
presentation process.
• Support active-active redundant controller
The controller supports dual active-active configuration to tolerate controller failure. The host IO access and background
tasks of a failed controller can be online taken over by the survival controller. And when the failed controller is replaced by a
new controller, the system will return to optimal operation by redistributing the host IO access and background tasks back to
the original controller.
• Support UPS monitoring
The firmware can monitor the attached UPS by the SMART UPS protocol through the RS232 ports. When the AC power is
gone, the firmware will conduct the graceful shutdown to avoid unwanted data loss. The administrator can also configure the
UPS to determine the shutdown and restart policies.
• Online array roaming
When a storage system cannot be recovered in a short time, the best choice to put the data on disk drives back online is to
conduct the array roaming, by which the disk drives can be installed in another storage system, and the RAID configurations
are recovered instantly. Besides, the background tasks previously running on the disk drives are also resumed. With the
online array roaming, the administrator can online install the disk drives one by one to the system, and import the disk groups
later. This avoids disrupting the running storage system, and simplifies the roaming process.
• Online array recovery
There are chances of RAID crash resulted from the transient failure of multiple disk drives, and the disk drives can still be
working after being re-powered. The drives might stall when its firmware is locked or be unstable as they are getting old. It
could also be because of the abnormal environmental conditions, like bad air conditioning or vibrations, or because of failures
of hardware components, like connectors or cables. When any of these happens, the data and RAID configurations are gone
forever for most storage systems. With the online array recovery, the firmware can online recognize and recover the RAID
configurations stored on disk drives and get the data back as long as the disk drives can be running again.
B.10 Vigilant System Monitoring
After a storage system is installed and starts serving the applications, one of the most important jobs for the administrators is
to monitor the system status. The hardware components in a storage system, like disk drives, fans, or power supply units,
might become unhealthy or even dead, and the environment might also be out of control. The firmware vigilantly watches
these hardware components and environment, and alerts the administrators timely. It may also intelligently conduct
necessary countermeasures to recover from the degradation or mitigate the risks.
• Remote monitoring by Web GUI
The web GUI displays the picture of the hardware components of the storage system, and shows their corresponding status.
The administrator can quickly get the overview of the system status and easily understand what components need to be
serviced. Because the GUI can be remotely accessed by web browsers, the monitoring can be done virtually anywhere in the
world.
• Non-volatile event logging
To help the administrators to track the history of all state changes, the firmware records the log of events on the NVRAM of
the controller. Because the logs are recorded on the controller, there is no need of extra software to keep the records. The
logs can also be downloaded to the administrator’s desktop for further analysis or long-term database, and it can be saved as
a human-readable text file or CSV file for spreadsheet applications.