Providing Open Architecture High Availability Solutions

Data integrity can be verified using many methods, most of which depend on either redundancy or

summary information included within the data. Some of the methods may use sufficient

redundancy to not only detect an error, but also to correct it. However, most methods contain only

enough additional information to detect that the data is not valid. Examples of typical methods

include parity, checksums, and Cyclic Redundancy Checks (CRCs).

Comparison Testing

When redundant systems are employed it is possible to have two systems make calculations in

parallel. The results are then compared, and a fault is detected if the results do not match. This

concept is also called voting, and is discussed in Section 3.5. Comparisons can be made at any

level of the system, from cycle-by-cycle comparisons on a memory bus to final output being sent

over the network.

Time Testing

Time tests can be the simplest form of error detection. If an event is expected within a certain time

frame and the event does not occur, a fault is detected. This concept can be applied in hardware,

using watchdog timers, and in software, using either hardware timers or software processes.

One specific method of time testing is commonly referred to as heartbeating. This technique,

which can be implemented in both hardware and software, uses some type of message handshaking

that is performed at a predefined periodic frequency. This technique is used to verify that the

appropriate components or subsystems still maintain some level of functionality.

When redundant systems are used, time checks can verify that the systems are operating at the

same rate, which would indicate that no faults are present. If redundant systems are not used,

expected times can be set based on design parameters.

While time testing is useful for fault detection, it does not always give a good indication of where a

fault occurred. For example, if a routine does not return in the expected amount of time, it is not

possible to know if the routine failed or if the processor had unexpected interrupts.

Time testing works well in deterministic systems, but can be problematic in systems where the

times of events may not be completely deterministic. In this case, the non-deterministic

performance of a system may cause time testing to sense faults that don’t really exist. On the other

hand, time testing can catch non-determinism in a system that was assumed to be deterministic.

User and Other Observable Detection

There are some cases where the end user of the system will detect a problem that has not been

detected by the systems management functions. It must be possible for diagnostics and other tests

to be started by external command in order to resolve these issues. Ideally, as experience with

faults detected in this manner grows, the automatic fault detection of the system can be expanded to

find these problems.

Fail with Notification

This is also known as ‘don’t fail silently’. It is critical in HA systems that any fault, no matter how

small or how infrequent, is logged or recorded in some way. This allows prediction of future

failures and assures that if the system model indicates that all is well, all is indeed well.