Providing Open Architecture High Availability Solutions

Providing Open Architecture High Availability Solutions
68
6.7.4 Approach
Fault prediction uses periodic or historic information gathered about a system and its components
in an attempt to determine when and where a fault is most likely to occur. The data accumulated
about the specified components or subsystems might entail previous failure information, device
monitoring data, MTBF statistics, and applicable data gathered from associated components. Using
the data collected from the sources at hand, a statistical probability of failure can be derived.
Some examples of data that could be collected are the packet and error rates of network
connections, the temperature or speed of components, or watermarks for OS parameters. Using
network rates, one can derive if a net segment is overloaded or beginning to fail. Temperature
and/or speed of hardware components can indicate either immediate failure, as when CPUs and
other ICs overheat, or the need for preventative maintenance, as when fans slow down or disk
drives either slow down or heat up. Watermarks on memory, stack, or process use in the OS can
indicate that a process is stuck or out of control. All of these items can give indication of a pending
failure, with significant time to avoid the failure and maintain service availability.
6.7.5 Techniques
Trends
Prediction can be done based on trends of a single variable, but frequently requires looking at two
or more variables. For example, CPU temperature trends would need to be viewed in light of
airflow and inlet air temperature to be useful for trending.
Multivariate Correlations
As noted above, most trends for failure must be viewed as composites of several variable. For
example, network re-try rate is meaningless without a network load factor.
Expert Systems
After a number of faults have occurred in a system it is possible to analyze the faults for trends and
use expert systems to watch for these trends to occur again.
6.7.6 Dependencies
Prediction is dependent on the system model collecting the required information and on the system
designer setting up sufficient detectors.