Specifications
    Page 119 /148 
4.  AVAILABILITY 
4.1  Redundancy Concerns 
Juniper Networks’ understands the critical importance of product reliability given 
the point at which the Mxxx will typically be used within a network. Juniper 
Networks’ approach to reliability (as related to the Mxxx) is based on the 
fundamentals of reliable distributed systems and the practical knowledge of the 
underlying causes of failures in modern electronic systems; especially when 
these systems  have a significant software component (as is the case with an 
Internet Backbone Router). 
While the  M160 has been designed as a fully redundant “carrier class” system, 
the essence of the Mxxx approach is to make individual systems as simple, fast, 
and highly integrated as possible to ensure element-level reliability and to count 
on network-level replication of routers to achieve network-level reliability. In 
particular, intra-M20 redundancy is used only for components known to fail often 
such as fans and pow er supplies and avoided where the improvement in 
reliability is marginal or negative because of the increase in system complexity. 
From a practical standpoint, this is the most effective way to provide a reliable 
network for network providers. Furthermore, the approach is consistent with how 
many customers intend to build their next generation backbones by using loosely 
coupled pairs of primary and secondary routers. 
Since this approach is different from what is traditionally done in a wide-area 
circuit switched network where in-box redundancy is emphasized, it is necessary 
to explain why Juniper’s approach is superior for building a reliable routed 
network. As a prelude, it is useful to cover two background topics: the first is an 
explanation of what causes failures in modern electronic equipment; and the 
second is the fundamental premise used in building a reliable system using 
unreliable components. 
4.1.1  Causes of Failure 
The subject of what causes failure in modern electronic systems, especially when 
these systems contain complex software is widely misunderstood. Experience 
has shown that the primary cause of failure in such systems can be broken down 
into three broad categories:  operator error,  software failure, and hardware failure. 
Operator error is by far the most common cause, accounting for well over 50% of 
the failures in systems as diverse as the switched public telephone network, and 
on-line transaction processing computer systems. The next largest cause of 
system failure is software failures. These occur typically under heavily loaded 
conditions because this is where most situations unanticipated by the underlying 
design usually lie. Finally, the fewest system failures are caused by the failure of 
hardware components within the system. Of these failures, fans, power supplies, 
and connectors are the leading culprits. Electronic circuits, particularly monolithic 
integrated circuits, are simply not a significant factor when they are used properly 
and operated under manufacturer’s guidelines for stress factors such as voltage 
and temperature. The one exception to this is soft errors in DRAM, but judicious 
use of error correcting codes quickly reduces the frequency of these errors to 
insignificant levels. 
Good engineering practice dictates that effort should be directed to these 
categories roughly in proportion to the relative frequency of failures within the 
categories. Therefore, great care should be taken in building the  human 
interface  to the system so it is unlikely for simple operational mistakes to result 
in network-wide failures. Next, the  software should be built in a modular fashion 
with clean well-understood interfaces between modules; if possible, the system 
should be over-engineered so overload conditions occur rarely, thereby 
decreasing the occurrence of the failures that are the hardest to track and fix. 
Finally, the most commonly failing  hardware components should be made 
redundant to boost their reliability. 
In contrast to this practice, the engineering of a reliable system is often reduced 
to providing redundant copies of electronic subsystems enhanced with clever 
schemes that work some of the time and are usually difficult or impossible to test 
fully. This state of affairs exists partly because of historical precedent (electronic 
components used to be notoriously unreliable), and partly because hardware 
redundancy is easier to provide and exhibit as a showcase of the system’s 
“reliability”. 










