Specifications
Page 120 /148
4.1.2 The Fundamental Premise
It is useful to recall the fundamental premise that is always made when adding
redundancy to a system to make it more reliable. This premise consists of two
parts: The first is that when redundant copies of a component are added, there
are no significant common-mode failures that affect the redundant copies. The
second is that the complexity of the control mechanism needed to resolve the
operation of the redundant copies is small enough that it does not have a material
negative impact on system reliability.
The first part of the premise has important implications for hardware and
software. For hardware, the primary implication is that physical separation and
loose coupling of redundant components generally results in a more reliable
system because there are fewer common-mode faults. For software, the primary
implication is that identical components exposed to the same inputs will crash
identically and therefore have no value in improving a system’s reliability. The
only time redundant software components will help is either if the components are
implemented differently, or if they are exposed to independent inputs making it
unlikely they will crash at the same time.
The second part of the premise implies that complex control schemes for
coordinating redundancy are not worthwhile. In fact, unless the state space of the
control mechanism can be fully characterized and exhaustively tested, it is likely
that the net effect of the redundancy will be to make the system less reliable.
4.1.3 The Juniper Approach
The Mxxx were architected, designed, and implemented with a single overriding
goal in mind: to build no-compromise routers to run the Internet backbone. From
choice of technology, hardware components, architectural tradeoffs, technology
partners, operating system, algorithms, management infrastructure and user
interface, all were made with the goal of building the best possible machine given
the state of the art.
Simplicity, speed, high integration, and modular design form the basis for the
reliability of a single M20 or M40 within the network. Replication of M20s and
M40s such that primary and secondary routers do not see the same traffic is the
basis for network-level reliability.
4.1.4 Operator Errors
The structure and user interface of the management software aids significantly in
the reliable operation of the Juniper Networks routers. The system has specific
features to minimize disruptions due to operator errors that in the past have been
known to cause failures, and provides assistance in recovering from failures due
to unpredictable errors.
For example, configuration changes are made using an interactive editor that
allows the state transition due to each change to be deferred until all changes
have been entered. The system then checks the set of changes for correct
semantics and either performs the changes or notifies the operator, as
appropriate. In any event, the set of changes is performed in an all-or-nothing
manner such that the system is never left in an inconsistent state. Operators may
also play non-destructive "what-if" games with some of the more complex
portions of system configuration. For example, a new routing policy can be tried
out to determine what the operational effect will be before actually activating the
policy.
Finally, the system provides mechanisms to authenticate and manage change
control and to help in problem diagnosis and recovery when things go wrong.
Each operator may be assigned a different set of privileges that give permission
to perform some classes of operations but not others. For example, an operator
tasked with interface installation may be prohibited from modifying routing
configuration. There is a sophisticated revision control mechanisms to enable the
operation staff to revert as well as audit problematic configuration changes.
Operational staff can determine exactly who made a particular change, what the
change was, and when it was activated, thereby allowing preventive measures to
be taken to avoid recurrence.
4.1.5 Software Errors
Two strategies are used to avoid software errors and limit their damage when
they do occur. The first is to partition the system into a number of modular