Providing Open Architecture High Availability Solutions
Providing Open Architecture High Availability Solutions
105
Fault – A problem in a component where the response was either not correct or not timely.
Fault detector – A hardware or software component that checks for faults.
Fault domain – A group of components that is replaced when a fault is detected in any of the
components.
Fault management – The process of Detection, Diagnosis, Isolation Recovery and Repair of a
faulted component. Fault management works in conjunction with configuration management to
change redundant components.
Fault prediction – Using information gathered by a management system to predict when a fault
may occur. If possible, preventative maintenance can then be performed to keep the fault from
occurring. Fault prediction relies on parameters such as time in use, temperature and error rate.
Fault prevention – Using best known methods during the design phase of a system to prevent
faults from occurring when it is deployed.
Fault removal – Removing faults during the design phase of a system by validation/verification,
diagnosis and correction.
Fault tolerance – The system attribute of being able to operate correctly while faults are occurring.
Fencing – Removing device entries in an I/O subsystem so that the component no longer receives
inputs and can no longer change outputs.
Field replaceable unit (FRU) – A hardware component that can be replaced in a repair process.
FRUs are typically boards or modules that can be easily swapped out in the field.
FM – see Fault Management.
FRU – see Field Replaceable Unit.
HA Forum – An industry group with the goal of promoting open standards for high availability
systems. The HA Forum generated this document.
Hardened driver – A software driver that has been written so that it will not lock-up or return
faulty data to the OS, no matter what its associated hardware does. Hardened drivers are a key part
of an HA system, and must be written to make use of HA features within the hardware and the OS.
Hardware platform – The hardware and firmware upon which an OS, middleware and
applications are run. The hardware platform typically includes BIOS and diagnostic software.
Heartbeating – Sending a periodic signal from one component to another to show that the sending
unit is still functioning correctly.
Hot restart – Restarting a component while the system is still operational or partially operational,
typically with the new component picking up state information from memory rather than doing a
complete initialization.
Hot-Swap – Changing a board or other hardware component in a system without shutting the
system down.
In-band communication or message– Transferring information over the primary communications
channel or bus. The primary channel is the communications framework, protocol(s) and hardware
used for the majority of inter-process communications.