Providing Open Architecture High Availability Solutions
Providing Open Architecture High Availability Solutions
54
Data integrity can be verified using many methods, most of which depend on either redundancy or
summary information included within the data. Some of the methods may use sufficient
redundancy to not only detect an error, but also to correct it. However, most methods contain only
enough additional information to detect that the data is not valid. Examples of typical methods
include parity, checksums, and Cyclic Redundancy Checks (CRCs).
Comparison Testing
When redundant systems are employed it is possible to have two systems make calculations in
parallel. The results are then compared, and a fault is detected if the results do not match. This
concept is also called voting, and is discussed in Section 3.5. Comparisons can be made at any
level of the system, from cycle-by-cycle comparisons on a memory bus to final output being sent
over the network.
Time Testing
Time tests can be the simplest form of error detection. If an event is expected within a certain time
frame and the event does not occur, a fault is detected. This concept can be applied in hardware,
using watchdog timers, and in software, using either hardware timers or software processes.
One specific method of time testing is commonly referred to as heartbeating. This technique,
which can be implemented in both hardware and software, uses some type of message handshaking
that is performed at a predefined periodic frequency. This technique is used to verify that the
appropriate components or subsystems still maintain some level of functionality.
When redundant systems are used, time checks can verify that the systems are operating at the
same rate, which would indicate that no faults are present. If redundant systems are not used,
expected times can be set based on design parameters.
While time testing is useful for fault detection, it does not always give a good indication of where a
fault occurred. For example, if a routine does not return in the expected amount of time, it is not
possible to know if the routine failed or if the processor had unexpected interrupts.
Time testing works well in deterministic systems, but can be problematic in systems where the
times of events may not be completely deterministic. In this case, the non-deterministic
performance of a system may cause time testing to sense faults that don’t really exist. On the other
hand, time testing can catch non-determinism in a system that was assumed to be deterministic.
User and Other Observable Detection
There are some cases where the end user of the system will detect a problem that has not been
detected by the systems management functions. It must be possible for diagnostics and other tests
to be started by external command in order to resolve these issues. Ideally, as experience with
faults detected in this manner grows, the automatic fault detection of the system can be expanded to
find these problems.
Fail with Notification
This is also known as ‘don’t fail silently’. It is critical in HA systems that any fault, no matter how
small or how infrequent, is logged or recorded in some way. This allows prediction of future
failures and assures that if the system model indicates that all is well, all is indeed well.