Providing Open Architecture High Availability Solutions

6.7.4 Approach

Fault prediction uses periodic or historic information gathered about a system and its components

in an attempt to determine when and where a fault is most likely to occur. The data accumulated

about the specified components or subsystems might entail previous failure information, device

monitoring data, MTBF statistics, and applicable data gathered from associated components. Using

the data collected from the sources at hand, a statistical probability of failure can be derived.

Some examples of data that could be collected are the packet and error rates of network

connections, the temperature or speed of components, or watermarks for OS parameters. Using

network rates, one can derive if a net segment is overloaded or beginning to fail. Temperature

and/or speed of hardware components can indicate either immediate failure, as when CPUs and

other ICs overheat, or the need for preventative maintenance, as when fans slow down or disk

drives either slow down or heat up. Watermarks on memory, stack, or process use in the OS can

indicate that a process is stuck or out of control. All of these items can give indication of a pending

failure, with significant time to avoid the failure and maintain service availability.

6.7.5 Techniques

Trends

Prediction can be done based on trends of a single variable, but frequently requires looking at two

or more variables. For example, CPU temperature trends would need to be viewed in light of

airflow and inlet air temperature to be useful for trending.

Multivariate Correlations

As noted above, most trends for failure must be viewed as composites of several variable. For

example, network re-try rate is meaningless without a network load factor.

Expert Systems

After a number of faults have occurred in a system it is possible to analyze the faults for trends and

use expert systems to watch for these trends to occur again.

6.7.6 Dependencies

Prediction is dependent on the system model collecting the required information and on the system

designer setting up sufficient detectors.