Providing Open Architecture High Availability Solutions
Providing Open Architecture High Availability Solutions
87
Often, the above capabilities include a higher degree of stabilizing the code and ensuring that the
software conforms to appropriate and established software practices, such as code verification,
code coverage analysis, elimination of dead code and consistent error code generation.
Other examples of enhanced OS capabilities that improve the reliability are discussed in the next
several sections.
9.1.1 Error Code Checking
The commonly accepted software practice of always checking the returned function code for error
indications applies to OS design as well as to application design. In a robust OS, all function calls
should parse the return code for potential fault information. If detected, the calls should log the
exception and take appropriate autonomous error handling (fault management), or if warranted,
halt operation until the fault can be cleared by an external directed operation. This fault
management mechanism within the OS layer is fundamental to improved reliability.
9.1.2 Hardened I/O
Hardened drivers are enhanced device drivers that do not assume that the devices they control are
always reliable or available. I/O interfaces should be designed to expect hardware failure, and
attempts to ensure that system functionality will not be significantly degraded or locked-up by the
failure of an external device. All I/O processing should scrub latent faults with some form of limit
checking for reasonable and expected values. Further, no I/O operation should wait indefinitely on
a faulted component. The access operation should have a reasonable timeout, exit and recovery
process on any pending request. The purpose of hardened drivers is to detect and isolate faults as
much as possible. Driver exceptions and faults are typically logged, as well as reported to the
middleware layer.
9.1.3 Argument Checking
Every data structure and function argument that can be rapidly checked for correctness should be
checked. In the case of data structure corruption, regenerate the data structure, if possible. In the
case of wild arguments, the OS should return a reasonable error status, and the error should be
logged.
9.1.4 Consistent Programmatic Response
An OS with consistent and deterministic behavior can aid in the rapid detection of time-interval
related faults, as the time relationships are known with greater certainty. This certainty allows fault
detection and recovery mechanisms such as hardware and software failsafes (e.g., watchdog
timers) to operate with a finer time resolution.
9.1.5 Avoidance of Arbitrary Limits
Enforcing unnecessary arbitrary limits on the length or number of any data structure, by allocating
all data structures dynamically, can directly affect reliability over time. A HA system, designed for
minimal downtime and in-place upgrades, may have peak resource usage that was never foreseen
during design or test. The ability to dynamically extend OS structures eliminate this point of
failure. However, if the dynamic allocation is not allowed by system design or fails due to resource
limitations, the failure can be logged, the appropriate in-band error code communicated to the
application and exception notification information can be used by the middleware or application to
recover from this exception condition.