Specifications

ManualsBrandsSentry ManualsMusical equipmentSPD 5.5.5

111

112

113

114

115

116

117

118

119

120

Page 120 /148

4.1.2 The Fundamental Premise

It is useful to recall the fundamental premise that is always made when adding

redundancy to a system to make it more reliable. This premise consists of two

parts: The first is that when redundant copies of a component are added, there

are no significant common-mode failures that affect the redundant copies. The

second is that the complexity of the control mechanism needed to resolve the

operation of the redundant copies is small enough that it does not have a material

negative impact on system reliability.

The first part of the premise has important implications for hardware and

software. For hardware, the primary implication is that physical separation and

loose coupling of redundant components generally results in a more reliable

system because there are fewer common-mode faults. For software, the primary

implication is that identical components exposed to the same inputs will crash

identically and therefore have no value in improving a system’s reliability. The

only time redundant software components will help is either if the components are

implemented differently, or if they are exposed to independent inputs making it

unlikely they will crash at the same time.

The second part of the premise implies that complex control schemes for

coordinating redundancy are not worthwhile. In fact, unless the state space of the

control mechanism can be fully characterized and exhaustively tested, it is likely

that the net effect of the redundancy will be to make the system less reliable.

4.1.3 The Juniper Approach

The Mxxx were architected, designed, and implemented with a single overriding

goal in mind: to build no-compromise routers to run the Internet backbone. From

choice of technology, hardware components, architectural tradeoffs, technology

partners, operating system, algorithms, management infrastructure and user

interface, all were made with the goal of building the best possible machine given

the state of the art.

Simplicity, speed, high integration, and modular design form the basis for the

reliability of a single M20 or M40 within the network. Replication of M20s and

M40s such that primary and secondary routers do not see the same traffic is the

basis for network-level reliability.

4.1.4 Operator Errors

The structure and user interface of the management software aids significantly in

the reliable operation of the Juniper Networks routers. The system has specific

features to minimize disruptions due to operator errors that in the past have been

known to cause failures, and provides assistance in recovering from failures due

to unpredictable errors.

For example, configuration changes are made using an interactive editor that

allows the state transition due to each change to be deferred until all changes

have been entered. The system then checks the set of changes for correct

semantics and either performs the changes or notifies the operator, as

appropriate. In any event, the set of changes is performed in an all-or-nothing

manner such that the system is never left in an inconsistent state. Operators may

also play non-destructive "what-if" games with some of the more complex

portions of system configuration. For example, a new routing policy can be tried

out to determine what the operational effect will be before actually activating the

policy.

Finally, the system provides mechanisms to authenticate and manage change

control and to help in problem diagnosis and recovery when things go wrong.

Each operator may be assigned a different set of privileges that give permission

to perform some classes of operations but not others. For example, an operator

tasked with interface installation may be prohibited from modifying routing

configuration. There is a sophisticated revision control mechanisms to enable the

operation staff to revert as well as audit problematic configuration changes.

Operational staff can determine exactly who made a particular change, what the

change was, and when it was activated, thereby allowing preventive measures to

be taken to avoid recurrence.

4.1.5 Software Errors

Two strategies are used to avoid software errors and limit their damage when

they do occur. The first is to partition the system into a number of modular