Specifications

Page 121 /148
components each of which runs in its own protected environment and interacts
with the other components over clean, well-defined interfaces. The second is to
provide enough computing power to each component such that it rarely, if ever,
runs under stress.
The routing system is built on top of a version of Unix that has been custom
modified for robust operation under loaded conditions. In addition to providing the
stability that comes with over 15 years of accumulated industry experience, the
Unix operating system provides protected environments (separate address
spaces) for the routing protocols, network management, and user interface to run.
This removes most opportunities for runaway applications to corrupt each other
and/or the kernel. The routing system is powered by a state-of-the art Intel
processor that provides sufficient computing cycles to keep the processor from
being heavily loaded.
The embedded system itself is broken up into two independent pieces, one of
which runs on a processor in the SCB or SSB and the other on a processor on
the individual FPCs. This structure makes it difficult for errors in one of these
components to corrupt the other, or to corrupt the routing system. Additionally, as
is the case for the routing system, the SCB / SSB and the FPC processors
provide more than enough computing power for the task so that failures due to
loaded conditions should be extremely rare. Neither the SCB / SSB nor the FPC
processor handles the data traffic to be switched. This means that the operating
conditions seen by the software span a much smaller dynamic range than a
system in which the CPU’s are doing the switching, making the software much
easier to test and get right.
4.1.6 Hardware Errors
The packet-forwarding engine is built using state-of-the-art hardware that uses
conservative design rules to achieve high reliability. Perhaps the single most
important contributor to the reliability of the PFE is the fact that it is implemented
using a small number of extremely highly integrated CMOS circuits. Almost all the
improvements in the reliability of digital electronic systems over the last 30 years
can be attributed to the increased use of monolithic integrated circuits, and the
Mxxx exploit this fact to the maximum extent allowed by today’s technology. A
small handful of custom ASICs, high volume SRAMs, DRAMs, and
microprocessors implement over 95% of the system’s functionality. The
approach results in a superlative MTBF for the Mxxx.
Most performance parameters of the PFE are deliberately over-engineered to
make it extremely unlikely for any kind of traffic to overwhelm the system. Shared
memory capacity is many times the strict minimum necessary and is pooled into a
single common resource to make it effectively even larger. Input and output
packet engines are sized for a minimum of twice the line rate to avoid any
problems with runs of short packets. The route lookup engine is also centralized
and is sized to be roughly four times faster than is called for by average packet
size.
All signals that cross chip boundaries are either parity or CRC checked for
corruption, and all data stored in external memory is either ECC or parity
protected. There is extensive internal consistency checking and logging built in to
the ASICs. The system is designed for testability and provides full support of
JTAG for boundary as well as full-scan.
The core PFE system is fully synchronous, and uses time tested digital design
practices for timing, clocking, and signal integrity. All timing and voltage
margining was done for the worst case process, supply, and temperature corners
to ensure that the system will function reliably under the most marginal of
environmental conditions.
The PFE features redundant fans and power supplies to ensure that the most
commonly occurring hardware failures are removed from the system. Either of the
dual fan trays is capable of cooling the system indefinitely, while the dual power
supplies are load sharing and the system can operate on either one of them.
The system architecture deliberately avoids the use of switching cards to reduce
the number of backplane connections for the sake of improved reliability. Since
connectors are amongst the most frequent causes of failure, halving the number
of connections makes a significant dent in the computed failure rate. In fact, the
failure rate for the machine improves by 400 FIT simply as a result of this
packaging choice. Furthermore, the M20 also avoids the use of extensive in-
system redundancy because this would increase complexity and potentially make
the machine less reliable.