Providing Open Architecture High Availability Solutions

Often, the above capabilities include a higher degree of stabilizing the code and ensuring that the

software conforms to appropriate and established software practices, such as code verification,

code coverage analysis, elimination of dead code and consistent error code generation.

Other examples of enhanced OS capabilities that improve the reliability are discussed in the next

several sections.

9.1.1 Error Code Checking

The commonly accepted software practice of always checking the returned function code for error

indications applies to OS design as well as to application design. In a robust OS, all function calls

should parse the return code for potential fault information. If detected, the calls should log the

exception and take appropriate autonomous error handling (fault management), or if warranted,

halt operation until the fault can be cleared by an external directed operation. This fault

management mechanism within the OS layer is fundamental to improved reliability.

9.1.2 Hardened I/O

Hardened drivers are enhanced device drivers that do not assume that the devices they control are

always reliable or available. I/O interfaces should be designed to expect hardware failure, and

attempts to ensure that system functionality will not be significantly degraded or locked-up by the

failure of an external device. All I/O processing should scrub latent faults with some form of limit

checking for reasonable and expected values. Further, no I/O operation should wait indefinitely on

a faulted component. The access operation should have a reasonable timeout, exit and recovery

process on any pending request. The purpose of hardened drivers is to detect and isolate faults as

much as possible. Driver exceptions and faults are typically logged, as well as reported to the

middleware layer.

9.1.3 Argument Checking

Every data structure and function argument that can be rapidly checked for correctness should be

checked. In the case of data structure corruption, regenerate the data structure, if possible. In the

case of wild arguments, the OS should return a reasonable error status, and the error should be

logged.

9.1.4 Consistent Programmatic Response

An OS with consistent and deterministic behavior can aid in the rapid detection of time-interval

related faults, as the time relationships are known with greater certainty. This certainty allows fault

detection and recovery mechanisms such as hardware and software failsafes (e.g., watchdog

timers) to operate with a finer time resolution.

9.1.5 Avoidance of Arbitrary Limits

Enforcing unnecessary arbitrary limits on the length or number of any data structure, by allocating

all data structures dynamically, can directly affect reliability over time. A HA system, designed for

minimal downtime and in-place upgrades, may have peak resource usage that was never foreseen

during design or test. The ability to dynamically extend OS structures eliminate this point of

failure. However, if the dynamic allocation is not allowed by system design or fails due to resource

limitations, the failure can be logged, the appropriate in-band error code communicated to the

application and exception notification information can be used by the middleware or application to

recover from this exception condition.