Specifications

Page 119 /148
4. AVAILABILITY
4.1 Redundancy Concerns
Juniper Networks’ understands the critical importance of product reliability given
the point at which the Mxxx will typically be used within a network. Juniper
Networks’ approach to reliability (as related to the Mxxx) is based on the
fundamentals of reliable distributed systems and the practical knowledge of the
underlying causes of failures in modern electronic systems; especially when
these systems have a significant software component (as is the case with an
Internet Backbone Router).
While the M160 has been designed as a fully redundant “carrier class” system,
the essence of the Mxxx approach is to make individual systems as simple, fast,
and highly integrated as possible to ensure element-level reliability and to count
on network-level replication of routers to achieve network-level reliability. In
particular, intra-M20 redundancy is used only for components known to fail often
such as fans and pow er supplies and avoided where the improvement in
reliability is marginal or negative because of the increase in system complexity.
From a practical standpoint, this is the most effective way to provide a reliable
network for network providers. Furthermore, the approach is consistent with how
many customers intend to build their next generation backbones by using loosely
coupled pairs of primary and secondary routers.
Since this approach is different from what is traditionally done in a wide-area
circuit switched network where in-box redundancy is emphasized, it is necessary
to explain why Juniper’s approach is superior for building a reliable routed
network. As a prelude, it is useful to cover two background topics: the first is an
explanation of what causes failures in modern electronic equipment; and the
second is the fundamental premise used in building a reliable system using
unreliable components.
4.1.1 Causes of Failure
The subject of what causes failure in modern electronic systems, especially when
these systems contain complex software is widely misunderstood. Experience
has shown that the primary cause of failure in such systems can be broken down
into three broad categories: operator error, software failure, and hardware failure.
Operator error is by far the most common cause, accounting for well over 50% of
the failures in systems as diverse as the switched public telephone network, and
on-line transaction processing computer systems. The next largest cause of
system failure is software failures. These occur typically under heavily loaded
conditions because this is where most situations unanticipated by the underlying
design usually lie. Finally, the fewest system failures are caused by the failure of
hardware components within the system. Of these failures, fans, power supplies,
and connectors are the leading culprits. Electronic circuits, particularly monolithic
integrated circuits, are simply not a significant factor when they are used properly
and operated under manufacturer’s guidelines for stress factors such as voltage
and temperature. The one exception to this is soft errors in DRAM, but judicious
use of error correcting codes quickly reduces the frequency of these errors to
insignificant levels.
Good engineering practice dictates that effort should be directed to these
categories roughly in proportion to the relative frequency of failures within the
categories. Therefore, great care should be taken in building the human
interface to the system so it is unlikely for simple operational mistakes to result
in network-wide failures. Next, the software should be built in a modular fashion
with clean well-understood interfaces between modules; if possible, the system
should be over-engineered so overload conditions occur rarely, thereby
decreasing the occurrence of the failures that are the hardest to track and fix.
Finally, the most commonly failing hardware components should be made
redundant to boost their reliability.
In contrast to this practice, the engineering of a reliable system is often reduced
to providing redundant copies of electronic subsystems enhanced with clever
schemes that work some of the time and are usually difficult or impossible to test
fully. This state of affairs exists partly because of historical precedent (electronic
components used to be notoriously unreliable), and partly because hardware
redundancy is easier to provide and exhibit as a showcase of the system’s
“reliability”.