hot plug RAID memory technology for fault tolerance and scalability

figure 1: server outages during a one-year period due to memory failures

100

1000

10000

100000

1000000

64 MB 1 GB 16 GB

Memory Capacity

Cumulative failures per 10,000 systems

Parity

ECC

(logarithmic scale)

ECC for large memory

systems is only about as

good as parity checking

is for smaller capacities

Nearly 50%

system failures

per year

120%

75%

4.6%

48%

.3%

hot plug RAID

memory

To help meet the availability and scalability demands of today’s eBusiness world, HP

developed a solution that allows customers to take advantage of industry-standard

memory technology, increase server fault-tolerance, increase memory capacity, and

increase server availability. Hot Plug RAID Memory provides a level of protection far

greater than standard ECC-based solutions and allows the detection of otherwise

undetectable errors (table 1).

table 1: comparison of protection provided by parity checking, ECC, and Hot Plug RAID Memory

Error Condition Parity Standard ECC RAID Memory

Single-bit Detect Correct Correct

Double-bit Unreliable Detect Correct

4-bit DRAM Unreliable Detect Correct

8-bit DRAM Unreliable Unreliable Correct

Greater than DRAM Unreliable Unreliable Detect

For years, the computer industry has used redundant array of independent disk (RAID)

technology to provide fault tolerance and high availability for disk drive subsystems in

servers. The technology used in Hot Plug RAID Memory is conceptually similar to RAID

storage technology. However, in the context of the memory solution, RAID stands for

redundant array of industry-standard DIMMs.

Source: Timothy J. Dell, “A White Paper on the Benefits of Chipkill-Correct ECC for PC Server Main Memory,” IBM Microelectronics

Division – Rev. 11/19/97