Providing Open Architecture High Availability Solutions

Choosing smaller failure group sizes (to confine the failures to parts of the total system) can

reduce total system downtime from individual failures. (This assumes the application can be

partitioned, and total downtime can be pro-rated over the whole system.)

4.4.2 Repair and Testing

The speed and accuracy of repairs can have a significant impact on availability. Ideally when a

hardware component fails, the systems can accurately identify the field replaceable unit (FRU) the

first time. Then, once a replacement unit has been installed, it can be fully tested before being put

into service for use by the system.

Typical requirements are:

• A failed FRU must be identified correctly by the system so that it can be replaced correctly on

the first attempt 95% of the time.

• Failed FRU identification and replacement must not affect the service availability.

• A replacement FRU cannot be put in service until fully tested within the system.

• Testing a replacement FRU must not affect service availability.

The frequency of failures and the urgency of repair also affect the cost of operating a system. The

more often service personnel have to visit a system to make repairs, the higher the operating cost.

FRU reliability can be increased by choice of parts, as well as by providing redundancy built into

the FRU.

4.4.3 Upgrades and Changes

All systems inevitably need changes to add or remove hardware components, install operating

system changes, and install new application versions. Having the service down during such

upgrades will usually violate the availability requirements, so some sort of rolling upgrade strategy

is needed.

The rolling upgrade often requires that parts of the system are shutdown and upgraded while the

redundant components keep the service available. Once the first parts of the system are upgraded,

they are put into service, and the remaining parts are in turn shut down, upgraded, and returned to

service.

The implications to the Open HA Framework and application are that some level of compatibility

between versions of software is needed. These may include:

• Message formats, files, and data, common between different versions of software is

compatible, or is tagged with version numbers for specific handling.

• Behavior of the different versions of software is compatible.

• Testing of the new software on the split system is possible before bringing it into service.

• A roll-back plan is in place in case the new software does not work.

When hardware changes are made, the operating system and management middleware must be able

to assimilate the new configuration into the operating environment. New redundancy rules may be

needed when components are introduced or removed from a system, and testing of new hardware is

necessary before putting it into service.