Specifications

CPU cooling
Heat is the big enemy of electronic components. But the rp8400’s two-level cooling scheme offers
outstanding cooling capacity at a nominal cost. The server’s turbo-cooler fans draw air directly
into the heat sinks of the CPU and cell VLSI. At the extreme operating ranges of the rp8400, the
turbo-cooler fans keep temperatures well below the maximum values allowed. Even though the
turbo-coolers may not be required under normal operating conditions, running them ensures the
silicon chips operate at the lowest temperature, ensuring maximum lifetime.
To further improve reliability of the rp8400, manageability software monitors the speeds of all
fans, including turbo-cooler fans. The rp8400 Smartfan controller can detect the first hint of
slowdown associated with bearing wear, ensuring you get plenty of warning before a fan fails.
iCOD
Instant capacity on demand (iCOD, also referred to as “pay per use”) is a means of adding and
removing CPUs in a partition. With iCOD, you don’t need to worry about:
interleaved memory
application-locked memory
server “switchovers due to false failures
physically handling CPU or memory boards
rebooting
iCOD is the most reliable means of reducing planned downtime for hardware upgrades.
redundant CPU power
In the rp8400, CPU power is protected through redundancy of the local dc-dc power conversion
for the CPUs.
memory protection
Main memory failures are the single largest cause of customer downtime. The rp8400 has several
features designed to reduce or eliminate failures of memory:
“chip kill” tolerance
dynamic memory resiliency (DMR)
automatic deconfigure on reboot
hardware memory scrubbing
chip kill tolerance
“Chip kill” tolerance is the ability of the system to continue to run in the face of any single- or multi-
bit chip error on a DRAM. The DRAMs in the rp8400 are basically N+1 per set of 128 DRAMs
per memory word. This functionality is essential in the design of reliable memory systems, and
systems without this feature are doomed to fail at an alarming rate compared to the rp8400.
(This has been demonstrated at customer sites that use both chip kill tolerance and less reliable
architectures.)
There are many ways that DRAMs can fail, especially when a system has hundreds of them! It is
hopeless to try to design around (or explain away) this simple fact. With N+1 DRAMs, the
rp8400 memory is extremely reliable.
dynamic memory resiliency
Dynamic memory resiliency is the system’s ability to de-allocate failed memory pages online. This
feature is similar to dynamic processor resiliency; if a location in memory proves to be
questionable (that is, exhibits persistent errors), the memory is de-allocated online with no
customer-visible impact. Assuming the rp8400 is equipped with adequate memory to begin with,
it is likely that the failed memory will never have to be replaced over the life of the product,
resulting in a significant reduction in both planned and unplanned downtime.
protection for I/O
I/O errors are another significant cause of hardware errors and downtime because:
the number of I/O cards in a typical system is significant
the I/O cards themselves are a part of the system most exposed to frequent human interaction in
the data center
28