Providing Open Architecture High Availability Solutions
Providing Open Architecture High Availability Solutions
37
Choosing smaller failure group sizes (to confine the failures to parts of the total system) can
reduce total system downtime from individual failures. (This assumes the application can be
partitioned, and total downtime can be pro-rated over the whole system.)
4.4.2 Repair and Testing
The speed and accuracy of repairs can have a significant impact on availability. Ideally when a
hardware component fails, the systems can accurately identify the field replaceable unit (FRU) the
first time. Then, once a replacement unit has been installed, it can be fully tested before being put
into service for use by the system.
Typical requirements are:
• A failed FRU must be identified correctly by the system so that it can be replaced correctly on
the first attempt 95% of the time.
• Failed FRU identification and replacement must not affect the service availability.
• A replacement FRU cannot be put in service until fully tested within the system.
• Testing a replacement FRU must not affect service availability.
The frequency of failures and the urgency of repair also affect the cost of operating a system. The
more often service personnel have to visit a system to make repairs, the higher the operating cost.
FRU reliability can be increased by choice of parts, as well as by providing redundancy built into
the FRU.
4.4.3 Upgrades and Changes
All systems inevitably need changes to add or remove hardware components, install operating
system changes, and install new application versions. Having the service down during such
upgrades will usually violate the availability requirements, so some sort of rolling upgrade strategy
is needed.
The rolling upgrade often requires that parts of the system are shutdown and upgraded while the
redundant components keep the service available. Once the first parts of the system are upgraded,
they are put into service, and the remaining parts are in turn shut down, upgraded, and returned to
service.
The implications to the Open HA Framework and application are that some level of compatibility
between versions of software is needed. These may include:
• Message formats, files, and data, common between different versions of software is
compatible, or is tagged with version numbers for specific handling.
• Behavior of the different versions of software is compatible.
• Testing of the new software on the split system is possible before bringing it into service.
• A roll-back plan is in place in case the new software does not work.
When hardware changes are made, the operating system and management middleware must be able
to assimilate the new configuration into the operating environment. New redundancy rules may be
needed when components are introduced or removed from a system, and testing of new hardware is
necessary before putting it into service.