Specifications

ManualsBrandsSentry ManualsMusical equipmentSPD 5.5.5

111

112

113

114

115

116

117

118

119

120

Page 119 /148

4. AVAILABILITY

4.1 Redundancy Concerns

Juniper Networks’ understands the critical importance of product reliability given

the point at which the Mxxx will typically be used within a network. Juniper

Networks’ approach to reliability (as related to the Mxxx) is based on the

fundamentals of reliable distributed systems and the practical knowledge of the

underlying causes of failures in modern electronic systems; especially when

these systems have a significant software component (as is the case with an

Internet Backbone Router).

While the M160 has been designed as a fully redundant “carrier class” system,

the essence of the Mxxx approach is to make individual systems as simple, fast,

and highly integrated as possible to ensure element-level reliability and to count

on network-level replication of routers to achieve network-level reliability. In

particular, intra-M20 redundancy is used only for components known to fail often

such as fans and pow er supplies and avoided where the improvement in

reliability is marginal or negative because of the increase in system complexity.

From a practical standpoint, this is the most effective way to provide a reliable

network for network providers. Furthermore, the approach is consistent with how

many customers intend to build their next generation backbones by using loosely

coupled pairs of primary and secondary routers.

Since this approach is different from what is traditionally done in a wide-area

circuit switched network where in-box redundancy is emphasized, it is necessary

to explain why Juniper’s approach is superior for building a reliable routed

network. As a prelude, it is useful to cover two background topics: the first is an

explanation of what causes failures in modern electronic equipment; and the

second is the fundamental premise used in building a reliable system using

unreliable components.

4.1.1 Causes of Failure

The subject of what causes failure in modern electronic systems, especially when

these systems contain complex software is widely misunderstood. Experience

has shown that the primary cause of failure in such systems can be broken down

into three broad categories: operator error, software failure, and hardware failure.

Operator error is by far the most common cause, accounting for well over 50% of

the failures in systems as diverse as the switched public telephone network, and

on-line transaction processing computer systems. The next largest cause of

system failure is software failures. These occur typically under heavily loaded

conditions because this is where most situations unanticipated by the underlying

design usually lie. Finally, the fewest system failures are caused by the failure of

hardware components within the system. Of these failures, fans, power supplies,

and connectors are the leading culprits. Electronic circuits, particularly monolithic

integrated circuits, are simply not a significant factor when they are used properly

and operated under manufacturer’s guidelines for stress factors such as voltage

and temperature. The one exception to this is soft errors in DRAM, but judicious

use of error correcting codes quickly reduces the frequency of these errors to

insignificant levels.

Good engineering practice dictates that effort should be directed to these

categories roughly in proportion to the relative frequency of failures within the

categories. Therefore, great care should be taken in building the human

interface to the system so it is unlikely for simple operational mistakes to result

in network-wide failures. Next, the software should be built in a modular fashion

with clean well-understood interfaces between modules; if possible, the system

should be over-engineered so overload conditions occur rarely, thereby

decreasing the occurrence of the failures that are the hardest to track and fix.

Finally, the most commonly failing hardware components should be made

redundant to boost their reliability.

In contrast to this practice, the engineering of a reliable system is often reduced

to providing redundant copies of electronic subsystems enhanced with clever

schemes that work some of the time and are usually difficult or impossible to test

fully. This state of affairs exists partly because of historical precedent (electronic

components used to be notoriously unreliable), and partly because hardware

redundancy is easier to provide and exhibit as a showcase of the system’s

“reliability”.