Providing Open Architecture High Availability Solutions

ManualsBrandsHP ManualsSoftwareHP Serviceguard Software

Providing Open Architecture

High Availability Solutions

Revision 1.0

Published by the HA Forum

February, 2001

Summary of content (112 pages)

PAGE 1
Providing Open Architecture High Availability Solutions Revision 1.
PAGE 2
Contributing Authors Tim Anderson, High Availability Software MontaVista Software, Inc.
PAGE 3
Providing Open Architecture High Availability Solutions Contents 1.0 Executive Summary ......................................................................................................... 9 2.0 Introduction ....................................................................................................................11 2.1 2.2 2.3 3.0 High Availability Concepts and Principles ..................................................................13 3.1 3.2 3.3 3.4 3.5 3.6 4.
PAGE 4
Providing Open Architecture High Availability Solutions 5.4 5.5 5.6 6.0 System Capabilities — Fault Management .................................................................. 50 6.1 6.2 6.3 6.4 4 5.3.1 Introduction ............................................................................................ 40 5.3.2 Concepts ................................................................................................ 40 5.3.3 Approach ..........................................................
PAGE 5
Providing Open Architecture High Availability Solutions 6.5 6.6 6.7 7.0 Open-Architecture Systems ..........................................................................................69 7.1 7.2 8.0 Open Architecture and High Availability .............................................................. 69 Open-Architecture Building Blocks for High Availability Systems ....................... 70 Layer-Specific Capabilities – Hardware .................................................................
PAGE 6
Providing Open Architecture High Availability Solutions 9.2 9.3 9.4 9.5 10.0 Layer-Specific Capabilities – Management Middleware ............................................. 93 10.1 10.2 10.3 10.4 10.5 10.6 10.7 10.8 10.9 11.0 9.1.4 Consistent Programmatic Response...................................................... 87 9.1.5 Avoidance of Arbitrary Limits.................................................................. 87 9.1.6 Appropriate panic() Behavior ..........................................
PAGE 7
Providing Open Architecture High Availability Solutions Figures 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Failure Classes.................................................................................................... 16 Fault Classes....................................................................................................... 17 System Decomposition........................................................................................ 20 Tree View of System Decomposition ...................
PAGE 8
PAGE 9
Providing Open Architecture High Availability Solutions 1.0 Executive Summary This document describes the best known methods and capabilities needed for systems that are required to have almost no down-time, frequently referred to as high availability (HA) systems. It is intended as a guide to a common vocabulary and possible HA functions. The system designer must select the appropriate functions for each system based on HA requirements, design complexity and cost.
PAGE 10
Providing Open Architecture High Availability Solutions 4. Recovery – The system is adjusted or re-started so it functions properly 5. Repair - A faulty system component is replaced Capabilities of Major Building Blocks (or Layers) In order to create open architecture systems, interoperable building blocks must be available. These blocks can then be combined as needed to create a system. Section 7.0 provides an overview of how a system is divided into building blocks.
PAGE 11
Providing Open Architecture High Availability Solutions 2.0 Introduction High Availability, or HA, is the term associated with computer systems which exhibit almost no downtime. This document has been generated by the High Availability Forum (HA Forum) to make it easier to create open architecture high availability systems using Intel Architecture or other processors. This section discusses the HA Forum, and defines the scope of this document. 2.
PAGE 12
Providing Open Architecture High Availability Solutions The scope of this document includes only functions and capabilities specifically related to the key parts of an HA system. These parts are: 1. Redundant hardware and software components 2. Methods for storing information about the components and their relationships 3. Methods for managing faults, from detection through recovery and repair 4. Methods for replacing and upgrading components and updating the stored information 5.
PAGE 13
Providing Open Architecture High Availability Solutions 3.0 High Availability Concepts and Principles The demand for increasingly capable hardware and software systems has grown dramatically over the past two decades. Advanced, complex, hardware and software systems have a significant presence in our everyday lives. We often take for granted the tasks and services performed and delivered by our automobiles, telephones, banking institutions, computers, and the Internet.
PAGE 14
Providing Open Architecture High Availability Solutions Table 2.
PAGE 15
Providing Open Architecture High Availability Solutions System designers often build reliability into their platforms by building in correction mechanisms for latent faults that concern them. These faults, when correctable, do not produce errors or a failures since they are part of the design margins built into the system.
PAGE 16
Providing Open Architecture High Availability Solutions precaution is taken to build reliable systems because they cannot tolerate the repair intervals of failures. For most other applications, however, the ability to provide near continuous service by repairing faults and preventing their propagation is more economical and still acceptable by its users.
PAGE 17
Providing Open Architecture High Availability Solutions Figure 2.
PAGE 18
Providing Open Architecture High Availability Solutions To help combat the significant influence of software failures in large systems, software reuse can be applied. The continual improvement in reliability when reusing software has demonstrated benefits. This concept is known as reliability growth. Object-oriented software engineering practices further encourage software reuse and have also demonstrated significant improvements and reliability growth when maintained and reused.
PAGE 19
Providing Open Architecture High Availability Solutions Fault and Failure Forecasting The ability to manage reliability futures is an instrumental part of the complete lifecycle of system availability. Understanding the operational environment, gathering field failure data, the use of reliability models and the analysis and interpretation of these results are all significant and important to successfully manage availability.
PAGE 20
Providing Open Architecture High Availability Solutions Unlike hardware faults that are mostly physical faults, software faults are design faults, which are harder to visualize, classify, detect, and correct. As a result, software reliability is more difficult to realize and analyze than hardware reliability. Usually, hardware reliability theory relies on the analysis of stationary processes, because only physical faults are considered.
PAGE 21
Providing Open Architecture High Availability Solutions Figure 4. Tree View of System Decomposition System Subsystem 1 C1 C2 C3 Subsystem 2 C4 C5 C6 C7 This view implies a hierarchy that describes the composed-of relationship between a system, its components, and their components. More levels in the hierarchy provide more detail, and hence, smaller components are present in the system model.
PAGE 22
Providing Open Architecture High Availability Solutions interpreters, where a given interpreter provides a collection of abstract objects to an interpreter above it. For example, a collection of hardware components that comprise a computer motherboard interprets the application that is hosted on it. The interprets view is depicted in Figure 6. Figure 6.
PAGE 23
Providing Open Architecture High Availability Solutions Equation 2. C λ = ∑ πiλi i=1 Where C is the number of components in the system, and π is the average proportion of time component i is under execution. It is the sum of each of the components failure rates, with a compensator for its actual sojourn time. Equation 2 may be used for both hardware and software system failure rate analysis.
PAGE 24
Providing Open Architecture High Availability Solutions 3.4.3 System Models with Service Restoration Section 3.4.2 addressed systems in the absence of service restoration capabilities. The behavior of a system when failures are removed and service is restored was not taken into account. The understanding of this principle is important to better understand how to model the availability of a system. In this section, the influence service restoration has on the behavior of a system will be addressed.
PAGE 25
Providing Open Architecture High Availability Solutions The trend of the ‘a’ line depicts the results of a hyper-exponential model [lapr92] that is also backed by typical field data. It shows that systems typically peak in their unavailability shortly after deployment, then, through reliability growth (defect removal), eventually stabilize near their expected reliability level (c). The ‘b’ line represents the pessimistic reliability prior to defect removal and reliability growth.
PAGE 26
Providing Open Architecture High Availability Solutions One or more primary components, together with their redundant counterparts, act together to provide a reliable service. A group of such components is defined as a service group, an example of which might be the power supplies in a chassis, assuming that they were configured such that at least one was redundant. Spatial redundancy can be applied in a number of different ways as described in the next four sections. 3.5.
PAGE 27
Providing Open Architecture High Availability Solutions Clusters can be homogeneous, when they are composed of identical nodes, or heterogeneous, when the nodes can vary widely in make-up or even architecture. Nodes are managed on a black-box basis – either the node is fully functional, or the entire node is taken out of service with no attempt to diagnose or rectify failures within the node. Failures therefore remove more of the system (i.e., the whole node) instead of the single failed component.
PAGE 28
Providing Open Architecture High Availability Solutions 3.6 Making it All Work — Open vs. Proprietary The overarching goal of this high availability framework is to provide an open system environment in which multiple vendors can participate in providing system services that help achieve the expected level of availability in the systems we design. Today, systems are far too large and complex to expect that a single vendor can provide all of the functionality required of a system.
PAGE 29
Providing Open Architecture High Availability Solutions 4.0 Customer Requirements for Open HA Systems For many years computers have been controlling systems that provide critical services where continuous availability and data integrity are essential. The scale, the availability requirements, and the level of data integrity vary widely across the spectrum of such services. Consequently, the techniques and implementations that provide the high availability (HA) are also equally varied.
PAGE 30
Providing Open Architecture High Availability Solutions 4.1 Application Areas While the primary application area for an open HA framework is broadly aimed at the systems that provide telecommunications and Internet infrastructure, many other application areas share similar types of architectures and requirements, and will also benefit from an open HA framework.
PAGE 31
Providing Open Architecture High Availability Solutions Figure 8. Open HA Framework Individual System Model Application(s) HA Management Middleware Other Middleware Operating System Platform Hardware The framework should also provide some design guidelines for good HA implementations for both hardware and software within an open HA framework environment. 4.2.
PAGE 32
Providing Open Architecture High Availability Solutions 4.2.2 Compatibility and Interoperability The definitions of an HA Framework must be clear and unambiguous. They must define the component interfaces well enough to ensure that different implementations of the same components will interoperate. The definitions may leave room for future expandability and growth, but must require standardscompliant implementations to be compatible.
PAGE 33
Providing Open Architecture High Availability Solutions 4.3.1 System Components An HA cluster is composed of several systems, and each system has many individual components. Each of these components has a specific role in the application – and each may also have different roles, effects, and participation in the overall system availability, and in an open HA framework. Figure 9.
PAGE 34
Providing Open Architecture High Availability Solutions Passive Shared Bus Active/Standby Cross Connection I/O or Peripheral Processing Board I/O or Peripheral Processing Board I/O or Peripheral Processing Board I/O or Peripheral Processing Board System Board System Board I/O or Peripheral Processing Board I/O or Peripheral Processing Board I/O or Peripheral Processing Board I/O or Peripheral Processing Board Figure 10.
PAGE 35
Providing Open Architecture High Availability Solutions 4.3.2 Application Environment The HA framework defines an operating environment for the application. Different applications will have different needs and expectations of the operating environment, which include: • Non-HA aware (active/spare) — these applications are written without any special coding to take advantage of the HA framework features. These applications must be cold restarted after a failure has occurred.
PAGE 36
Providing Open Architecture High Availability Solutions The HA framework should provide a set of APIs for the HA-aware applications to interact with the HA middleware. These APIs include: • • • • 4.
PAGE 37
Providing Open Architecture High Availability Solutions Choosing smaller failure group sizes (to confine the failures to parts of the total system) can reduce total system downtime from individual failures. (This assumes the application can be partitioned, and total downtime can be pro-rated over the whole system.) 4.4.2 Repair and Testing The speed and accuracy of repairs can have a significant impact on availability.
PAGE 38
Providing Open Architecture High Availability Solutions 4.5 HA Configuration and Cluster Management Configuration and cluster management middleware is the key controlling entity in an HA configuration and is implemented by HA middleware distributed amongst the HA cluster systems. It maintains a system model of the components that comprise the cluster, defines how faults are detected in the cluster and what action should be taken as a result.
PAGE 39
Providing Open Architecture High Availability Solutions 5.0 System Capabilities – Configuration Management 5.1 Introduction Configuration management involves knowing what types of hardware, firmware, and software components are actually in a system. It also tracks the intended configuration of the system (the system model), which may or may not match the actual system configuration.
PAGE 40
Providing Open Architecture High Availability Solutions 5.3 Dynamic System Model 5.3.1 Introduction The system model provides the basis for both configuration and fault management and is a critical component in meeting availability targets within an HA system. The model is typically implemented in an in-memory database within the management middleware. This complete system model may use information from models of other components, such as a model of the hardware.
PAGE 41
Providing Open Architecture High Availability Solutions It also allows for the determination of how modifying (disabling, changing, adding, etc.) a component affects the other components within the system. • Logical Dependencies – As discussed in Section 4.0, system components may also have dependencies beyond the traditional physical dependencies. For example, an application on a system CPU card may depend on a database located on a separate CPU card.
PAGE 42
Providing Open Architecture High Availability Solutions • Verifying Required System Component Population (System Model). This technique determines what components should be present in the system to provide a particular service. It works from a defined system model, detecting which components in the system model are present, and if those components are functional. Interdependencies among the components are also tracked and analyzed. • Obtaining Detailed Information about System Components (FRU).
PAGE 43
Providing Open Architecture High Availability Solutions • • • • View or modify configuration Monitor applications Enable a system administrator to remotely access and control the system and its components Enable a network management system to interface to the system and its components The key interface areas include the platform interface and the application interface. The platform interface should provide other parts of the system (i.e.
PAGE 44
Providing Open Architecture High Availability Solutions Diagnostics. Diagnostics involves testing of the system components. This may be done while the system component is on-line or off-line. Diagnostic testing may also be done destructively or nondestructively. Testing that interrupts the normal functioning of the system must be coordinated so it is either done during off-peak hours or another redundant component can handle the normal system traffic during the testing period. Autonomous.
PAGE 45
Providing Open Architecture High Availability Solutions Management (CM) services. The operator uses the services of the CM to view the health of each component, determine which component has failed, and identify the component (field replaceable unit, FRU) so that it can be replaced. Once the component has been replaced, the CM service is involved in re-establishing the system model (redundancies, etc.).
PAGE 46
Providing Open Architecture High Availability Solutions The Intelligent Platform Management Interface is an industry standard method that provides a means by which hardware-based system components may report health status information. This allows a system to be monitored even if it is not fully functional or operational.
PAGE 47
Providing Open Architecture High Availability Solutions 5.5.2 Alarms Alarms are intended to convey critical system exception information in an appropriate, effective and timely method. Definition. Alarms are typically autonomously generated messages that are triggered by a specific causative stimuli.
PAGE 48
Providing Open Architecture High Availability Solutions Web-Based Enterprise Management (WBEM). WBEM is a standard being driven by the DMTF (Distributed Management Task Force) for managing groups of computers connected in a network. There are several standards feeding into WBEM, including Common Information Model (CIM) and Intelligent Platform Management Interface (IPMI). 5.5.4 User Interface A user can see what is going on within a system from three positions: At the system.
PAGE 49
Providing Open Architecture High Availability Solutions 5.6.4 Approach Systems are comprised of a broad range of components that need to be upgraded, including hardware, operating systems, applications and peripherals. In systems designed for service availability, many of these components are redundant. This enables the system to transfer operation away from the component that needs to be upgraded without any outage to the service. 5.6.
PAGE 50
Providing Open Architecture High Availability Solutions 6.0 System Capabilities — Fault Management Managing faults in a system is typically a five-stage process. 1. Detection – The fault is found 2. Diagnosis – The cause of the fault is determined 3. Isolation – The rest of the system is protected from the fault 4. Recovery – The system is adjusted or re-started so it functions properly 5. Repair - A faulty system component is replaced Notification of the fault occurs at many points in this process.
PAGE 51
Providing Open Architecture High Availability Solutions Figure 12. Fault Management Flow Chart Detection Prediction (On-Line) Diagnosis Isolation Notification Recovery (Off-Line) Diagnosis Repair A fault occurs when a system component is not performing as expected. The severity of the fault can be evaluated by its effect on the service availability level of the system as a whole.
PAGE 52
Providing Open Architecture High Availability Solutions 6.1.2 Objective The objective of fault detection is to detect when a fault occurs, and pass information on the fault to the components responsible for diagnosis, isolation and recovery. This information would include the location and type of fault, time of occurrence, and perhaps the most likely next affected component. For example, if a fault occurs in a multiplication subroutine, it is useful to also know which routine is expecting the result. 6.
PAGE 53
Providing Open Architecture High Availability Solutions Detection of faults can occur through various avenues within a system. A fault may be detected at the source of the fault itself. There are a variety of components, which are designed so that the component can trap or report error or out-of-tolerance conditions. These types of detected faults can range anywhere from slight threshold incursions to complete component or resource failures.
PAGE 54
Providing Open Architecture High Availability Solutions Data integrity can be verified using many methods, most of which depend on either redundancy or summary information included within the data. Some of the methods may use sufficient redundancy to not only detect an error, but also to correct it. However, most methods contain only enough additional information to detect that the data is not valid. Examples of typical methods include parity, checksums, and Cyclic Redundancy Checks (CRCs).
PAGE 55
Providing Open Architecture High Availability Solutions 6.1.6 Dependencies Fault detection is heavily dependent on facilities designed into the system infrastructure. If a system is not designed to provide additional information or redundancy for detection of faults, many faults may go undetected. 6.2 Diagnosis There are two sets of operations in HA systems that use the term diagnosis. The first are the immediate acts taken after a fault is found to isolate the fault and recover from it.
PAGE 56
Providing Open Architecture High Availability Solutions Granularity of Diagnosis. The objectives for diagnostics in a particular system determine the granularity. From the perspective of service availability in a system with redundant components, the diagnosis at a minimum must be able to identify which system component failed. More granularity may be required to support the recovery and notification actions.
PAGE 57
Providing Open Architecture High Availability Solutions On-Line Diagnosis On-line diagnosis is done while the system is running its normal tasks. This implies that the fault which created the need for diagnosis was not fatal to the system, nor did it require that a redundant component take over for the faulted component. Once the component suspected of having a fault is removed from normal system operation, further diagnosis is considered off-line.
PAGE 58
Providing Open Architecture High Availability Solutions As noted in Section 6.0, there is a fine line between isolation and recovery. For this section, fault isolation includes actions that prevent a fault from propagating, but do NOT make the system function correctly. Actions that change a system from either an inoperative or degraded state to full operation are considered fault recovery and are covered in Section 6.4. The recovery and fault isolation steps may be combined inseparably.
PAGE 59
Providing Open Architecture High Availability Solutions maximum capacity. Removing the first power module that indicated it was not able to keep up will then cause the remaining power modules to be even more overloaded, resulting in the reverse reaction where power modules would shutdown causing the entire system to fail.
PAGE 60
Providing Open Architecture High Availability Solutions 6.3.6 Dependencies Fault isolation is dependent on the results of the diagnosis as well as the definition of the system dependency tree. The dependencies of software modules or hardware components are the active results of the mapping of the system (system model defined in Section 5.3) and the results of reliability modeling (system modeling for reliability in Section 3.
PAGE 61
Providing Open Architecture High Availability Solutions Common techniques for recovery start with the ability to have some level of redundancy. Typically redundancy for a recovery action is either in time or in space. A redundant component is one that can be connected to the same inputs and can provide the same outputs as another component.
PAGE 62
Providing Open Architecture High Availability Solutions 6.5 Repair 6.5.1 Introduction Repair, in a live system, requires some form of hot replacement. Again, a system must be designed to support this activity. To repair a failed component, a replacement is hot inserted, powered on, connected to the bus, validated through off-line diagnostics, and configured. 6.5.2 Objective The objective of this process is to return the system to its original capabilities including levels of redundancy. 6.5.
PAGE 63
Providing Open Architecture High Availability Solutions components in the system (dependencies) and the conversions of any data storage that may need to be updated as well. Also, a software upgrade should include a rollback feature that allows the system to be returned to the original operation prior to the upgrade. Diagnostics are tools for verification. The final step in the repair action is to be sure that the new component is working properly. 6.5.5 Techniques Component Replacement.
PAGE 64
Providing Open Architecture High Availability Solutions 6.6.2 Objective Notification is a key capability of the fault management process. The objective of notification is to enable management middleware and other system components to access fault reporting, state change performance and status information that could proactively predict faults. 6.6.3 Concepts Notification may include information context and content on: Autonomous Notification.
PAGE 65
Providing Open Architecture High Availability Solutions System Log. Event information, exception conditions, state changes and context information should be reported to and recorded in a structured event log, such as a system log. 6.6.4 Approach State changes (whether generated by faults or not) of hardware and software resources within the defined system model may signify increased or diminished capabilities, and should generate immediate autonomous notification messages.
PAGE 66
Providing Open Architecture High Availability Solutions Information Context / Content The content and context of the notification should be appropriate to the management interface. A non-recoverable media fault reported from the I/O driver to the calling thread would typically be limited to return code error information (lightweight).
PAGE 67
Providing Open Architecture High Availability Solutions error and recovery might be communicated and captured as warning information. Based upon a frequency or rate of change threshold, this type of warning might become a stronger alert and then an alarm notification to the layers and management interfaces above it. If the disk read condition is passed up to the I/O driver in the OS layer, the driver might attempt its own form of error recovery, perhaps resetting the controller and trying again.
PAGE 68
Providing Open Architecture High Availability Solutions 6.7.4 Approach Fault prediction uses periodic or historic information gathered about a system and its components in an attempt to determine when and where a fault is most likely to occur. The data accumulated about the specified components or subsystems might entail previous failure information, device monitoring data, MTBF statistics, and applicable data gathered from associated components.
PAGE 69
Providing Open Architecture High Availability Solutions 7.0 Open-Architecture Systems The preceding sections have described the requirements of and an architectural approach to building high availability systems. However, the goal of the HA Forum is not simply to describe how to build high availability systems, but to describe how to build open-architecture high availability systems.
PAGE 70
Providing Open Architecture High Availability Solutions 7.2 Open-Architecture Building Blocks for High Availability Systems The first key step in creating the ability to build high availability systems from open-architecture building blocks is to identify what those building blocks are. Each building block will provide one part of the overall technology stack in a high availability computer system.
PAGE 71
Providing Open Architecture High Availability Solutions Figure 13. Open Architecture Building Block in an HA System Application Software Operating System Management Middleware “Other” Middleware (e.g., DBMS, Protocol Stack) Hardware Platform The hardware platform consists of the entire set of hardware, firmware, etc., normally provided by a hardware system vendor, ready to support an operating system like Windows or Linux.
PAGE 72
Providing Open Architecture High Availability Solutions between the operating system and the application software and between the operating system and the other middleware are shown as narrow arrows (indicating that it is not an interface that has significance to the high availability capabilities in the system). However, even with this restrictive view, the operating system will require certain capabilities to operate in a high availability system.
PAGE 73
Providing Open Architecture High Availability Solutions 8.0 Layer-Specific Capabilities – Hardware High availability hardware system architectures are created by combining fault domains into service groups in such a way that the system can continue to operate even when any particular fault domain is out of service. A wide variety of fault domain configurations are possible.
PAGE 74
Providing Open Architecture High Availability Solutions • • • • 8.2 Mass storage subsystems Peripheral devices Power supplies Cooling modules Communication The fault domains within a high availability system interact with each other to create a complete system. This interaction occurs through various communication mechanisms. For the purpose of a fault domain analysis, the communication mechanisms of significance are the ones between fault domains.
PAGE 75
Providing Open Architecture High Availability Solutions Figure 15. Redundant Switched Interconnects I/O Card I/O Card I/O Card I/O Card I/O Card I/O Card Host CPU Switch Each box and each of the switched networks are fault domains in this system. Switch Host CPU I/O Card I/O Card I/O Card I/O Card I/O Card I/O Card For fault tolerant systems, communications among fault domains will vary depending on the specific characteristics of each fault domain.
PAGE 76
Providing Open Architecture High Availability Solutions Because some communication links are difficult to terminate at multiple points in a system, and because redundancy in the external communication paths is desirable for its own sake, high availability systems are often designed with redundant external communication links, each of which is logically part of the fault domain that includes its termination point in the system. 8.
PAGE 77
Providing Open Architecture High Availability Solutions When a system contains fault domains that are effectively in a standby mode, there is a need for detection of latent faults in these domains. That is, if the primary failure detection mechanism is observation of normal operating behavior, the hardware may need to provide a separate mechanism for detection of faults in fault domains which are not normally operating. 8.3.
PAGE 78
Providing Open Architecture High Availability Solutions ordered to execute a system reset operation. If this still does not clear the problem, it may be ordered to power itself off. Similarly, if a particular I/O controller has failed in a system, it may be ordered to isolate itself from the I/O bus. If this does not work, a second level of isolation may be to isolate a slot on the backplane, or even an entire I/O bus segment (at the point of a PCI to PCI bridge, for example). 8.3.
PAGE 79
Providing Open Architecture High Availability Solutions 8.3.5 Fault Domain Repair One of the most complex features of high availability systems is the need to repair failed fault domains while the system continues to operate. To support this, the specific capabilities required in the hardware are dependent on the design of the fault domains.
PAGE 80
Providing Open Architecture High Availability Solutions Since the physical repair of a hardware domain will involve direct hands-on interaction between a system technician and the actual system hardware, having visual guidance for the repair action directly on the hardware itself is highly desirable. This often takes the form of LEDs and/or other small display devices that can be controlled through the platform management system.
PAGE 81
Providing Open Architecture High Availability Solutions Typically, these will involve monitoring analog values that can reflect on the health of the hardware even when a fault has not occurred. For example, a fan may be slowing down, but still functioning within specifications. This may be indicating a bearing wearing out, and with this warning, the fan can be replaced before a fault occurs.
PAGE 82
Providing Open Architecture High Availability Solutions • • • • Out-of-band communications Inventory management through FRU information On-line firmware/software upgrades Notification services for service personnel, including remote access to annunciation devices such as LEDs • Asynchronous notification of platform events • Logging of platform events for fault diagnosis 8.5.
PAGE 83
Providing Open Architecture High Availability Solutions The Telcordia and ITU standards for alarming go well beyond defining hardware capabilities, describing a complete approach to fault management. A complete treatment of this topic is beyond the scope of this paper, but it is a resource that any organization working on standards for fault management should consider.
PAGE 84
Providing Open Architecture High Availability Solutions are used to route Ethernet packets between nodes. Ethernet borrows its electrical signaling, just like InfiniBand, from Fiber Channel. Gigabit Ethernet uses a differential pair for full duplex transmission. This differential pair is quad bundled into a set of four full duplex differential pairs (8 wires in total) with each set operating at a quarter of the base (1 GHz) frequency. The total bandwidth is 1 Gbit/s.
PAGE 85
Providing Open Architecture High Availability Solutions Rapid I/O Rapid I/O (RIO) is a new interconnect technology that is being developed as a PCI replacement for board and backplane interconnect. The RIO Trade Association controls and develops the specifications. Thus, RIO is a open, publicly available standard to all trade association members. Like InfiniBand, RIO is fabric based. Thus, switches are used to route data between end nodes.
PAGE 86
Providing Open Architecture High Availability Solutions 9.0 Layer-Specific Capabilities – Operating System The operating system hosts applications and provides process scheduling and resource control for applications, middleware and device drivers. An HA-aware OS provides typical OS services as well as services that are specifically designed to provide fault-management capabilities either directly, or by escalating information to other layers for fault management and resolution.
PAGE 87
Providing Open Architecture High Availability Solutions Often, the above capabilities include a higher degree of stabilizing the code and ensuring that the software conforms to appropriate and established software practices, such as code verification, code coverage analysis, elimination of dead code and consistent error code generation. Other examples of enhanced OS capabilities that improve the reliability are discussed in the next several sections. 9.1.
PAGE 88
Providing Open Architecture High Availability Solutions 9.1.6 Appropriate panic() Behavior A catastrophic system failure, or panic() routine is used when a failure occurs which cannot easily be recovered from. Frequently these failures are system data structure corruptions. The result of the panic() routine is to crash the system, with the obvious impact on availability.
PAGE 89
Providing Open Architecture High Availability Solutions 9.3.1 Memory Protection Modern software divides computer memory up into regions, principally for program code and several types of data. Without a hardware Memory Management Unit (MMU), these divisions are soft, enforced only by the way that development tools lay out memory and by how programmers follow the layout discipline.
PAGE 90
Providing Open Architecture High Availability Solutions Processes can also be forced to exit asynchronously by sending a process kill signal, and can aid the fault diagnosis process. The ability to recover resources from a faulted, or shed, process aids in fault recovery. The open-standard Process ID (PID) structures allow the middleware to easily determine what applications are currently loaded and running.
PAGE 91
Providing Open Architecture High Availability Solutions OSs may also support a structured and polled system MIB. This MIB typically structures the kernel information according to a published structure. This information can be directly incorporated into the element management mechanism, or parsed, ad hoc by the middleware, to garner required system status and state information.
PAGE 92
Providing Open Architecture High Availability Solutions or suspended and redirected upon removal. Explicit control of device drivers and their association with hardware is required so systems integrators can create and enforce policies governing the working of the system. In a traditional system the hardware, control software (typically a device driver) is very tightly associated with the hardware it controls.
PAGE 93
Providing Open Architecture High Availability Solutions 10.0 Layer-Specific Capabilities – Management Middleware In availability management, hardware or software faults are not avoided, but are expected to occur and the system is designed to anticipate and work-around faults before they become system failures.
PAGE 94
Providing Open Architecture High Availability Solutions 10.1 Collect System Data in Real Time All critical system components must be continuously monitored and managed in a unified solution. This includes hardware, software, operating system and applications. Availability management software, therefore, must include an interface to these components. A flexible approach to use for this purpose is an object-oriented framework using managed objects.
PAGE 95
Providing Open Architecture High Availability Solutions 10.2 Configuring and Maintaining State-Aware Model of the Total System Availability management requires a system-wide model that can represent all managed components in the system, changing information and the intricacies of each component’s dependencies and interdependencies. The management software needs this information to make quick and appropriate reconfigurations when necessary.
PAGE 96
Providing Open Architecture High Availability Solutions Figure 17. Example Directed Graph Describing a Managed System Directed Graph Service1 Group Membership Dependency Apps Comms App1 App2 App3 App4 App5 App6 O/S1 O/S2 O/S3 O/S4 O/S5 O/S6 Host H/W1 Host H/W2 Line H/W3 Line H/W4 Line H/W5 Line H/W6 Power PS1 PS2 Fans PS3 Fans1 Fans2 Fans3 A8680-01 In addition, service group dependencies also must be modeled, monitored and managed.
PAGE 97
Providing Open Architecture High Availability Solutions To provide service availability, the current state of transactions often must be maintained in a hot standby redundant component. This means that ongoing transaction data and application state data must be continuously delivered (checkpointed) to a hot standby location.
PAGE 98
Providing Open Architecture High Availability Solutions 10.5 Performing Rapid Recovery In general, completion of the fault management cycle includes recovery from the fault, as well as reporting to administration and repair and reintegration as needed. In complex systems with both parent/child dependencies and multi-layered service dependencies, the management of fault recovery actions requires multiple factoring of hierarchies of cluster-wide dependency and availability issues.
PAGE 99
Providing Open Architecture High Availability Solutions The management middleware also controls service availability groups (the collections of managed components in redundancy relationships such as 2N or N+1) and makes role assignments (active, standby, spare) within them. 10.7 Providing Administrative Access and Control The availability functions of the management middleware described so far comprise an automatic, self-managing system.
PAGE 100
Providing Open Architecture High Availability Solutions 11.0 Layer-Specific Capabilities – Applications When reviewing capabilities for high availability systems from an application perspective it is important to understand the application objectives. Many different applications are required to make a system. There are many ways to classify applications and the capabilities of applications to participate, control, or operate within a highly available system.
PAGE 101
Providing Open Architecture High Availability Solutions 11.3 State Preservation A common method for state preservation should be provided for an application. This preservation should allow for an application to restart at a known state. This preservation may require some level of replication of the data. In the type where an application is to restart on the current processor, the preservation could be in volatile, non-volatile memories or even on a storage media like disk or tape.
PAGE 102
Providing Open Architecture High Availability Solutions It is inevitable that a situation will occur which requires a rejuvenation of the system. This reboot operation is needed from an application trigger to be used as a last resort in part of a recovery operation. 11.5 Resilience Resilience is the property of a component that allows it to continue full or partial function after some or multiple faults occur.
PAGE 103
Providing Open Architecture High Availability Solutions 12.0 Glossary 5-nines – Maintaining availability 99.999% of the time. 2-N – A method of redundancy where there is one component in standby for every component in operation. Active – Component currently in use providing a service. Active Fault – A fault that is currently causing an error. Active faults are not necessarily detected, although they should be in a well-designed HA system.
PAGE 104
Providing Open Architecture High Availability Solutions Corrective maintenance – Maintenance for the purpose of fixing a known or expected to occur error in the system. Curative maintenance – Maintenance for the purpose of fixing a known error in the system. Data Isolation – Using memory management to keep data from one program, task, or thread from interfering with data from other program areas.
PAGE 105
Providing Open Architecture High Availability Solutions Fault – A problem in a component where the response was either not correct or not timely. Fault detector – A hardware or software component that checks for faults. Fault domain – A group of components that is replaced when a fault is detected in any of the components. Fault management – The process of Detection, Diagnosis, Isolation Recovery and Repair of a faulted component.
PAGE 106
Providing Open Architecture High Availability Solutions Indirect Detection – Detection of a fault by a method other than directly measuring or comparing the value which is faulted. Indirect detection is used for time-based errors and where direct measurement is difficult. For example, chassis temperature can be used to indirectly detect fan speed or CPU temperature problems.
PAGE 107
Providing Open Architecture High Availability Solutions Management middleware – The software within a system responsible for managing that system. This software is typically provided as a separate package, although some operating systems may also provide these functions. Mean time to failure (MTTF) – The average time between one failure and the next one as measured (or projected) over a large number of failures.
PAGE 108
Providing Open Architecture High Availability Solutions Physical Isolation – Isolating a component from the rest of the system by electrical disconnection, either using switches or by removing a board. Platform management – Managing a hardware platform using control features of that platform. IPMI is frequently used to provide a platform management function. Preventative maintenance – Maintenance performed on a system to prevent it from failing while it is needed in operation.
PAGE 109
Providing Open Architecture High Availability Solutions Robustness – The property of a software component, particularly an OS, that incorporates tests for many error conditions and has been designed in a way which protects it from errant behavior. Role – The function a component plays in a redundant system. Typical roles are active, standby, and unassigned or spare.
PAGE 110
Providing Open Architecture High Availability Solutions System model – A computer-usable representation of the capabilities, characteristics and dependencies of all of the components that could be included in a system. Temporal redundancy – Redundancy provided by re-performing operations. Most network protocols provide this form of redundancy, as raw network traffic is inherently subject to errors.
PAGE 111
Providing Open Architecture High Availability Solutions 13.0 Bibliography Section 3: [Ande81] Anderson, T. and Lee, P.A., Fault Tolerance – Principles and Practice, Prentice-Hall, 1981. [DHB3’00] D. H. Brown Associates, Inc. (DHBA), Competitive Analysis of UNIX Cluster Functionality – Part One of Two Part HA Study, March 2000. [Gray92] Gray, Jim and Reuter, Andreas – Transaction Processing: Concepts and Techniques, Morgan Kaufmann Publishing, San Mateo, CA, 1992. [Lapr92] Laprie, J.C.
PAGE 112
Providing Open Architecture High Availability Solutions IPMI – Intelligent Platform Management Interface — http://developer.intel.com/design/servers/ipmi/ PICMG – PCI Industrial Computer Manufacturers Group — http://www.picmg.org/ RMON – Remote Monitoring — http://www2.ietf.org/rfc/rfc2819.txt?number=2819 SNMP – Simple Network Management protocol — http://www.ietf.org/ids.by.wg/snmpv3.html TMN – Telecommunications Management Network — http://www.itu.int/TMN/ X.731 – ITU recommendation X.