Providing Open Architecture High Availability Solutions
Providing Open Architecture High Availability Solutions
80
Since the physical repair of a hardware domain will involve direct hands-on interaction between a
system technician and the actual system hardware, having visual guidance for the repair action
directly on the hardware itself is highly desirable. This often takes the form of LEDs and/or other
small display devices that can be controlled through the platform management system.
On-Line Firmware Upgradability
Increasingly, hardware components contain field-programmable devices. This affords an
opportunity to perform repairs and upgrades of the hardware without having to physically remove
and replace components from the system. This can result in significant reductions in MTTRs, as
well as eliminating opportunities for errors being made during a physical hardware swap. Thus,
when programmable devices are used in a system design, having the capability to upgrade the
firmware is highly desirable.
Maintenance of a System Inventory
In any system where hardware components and configuration can change over time, it is critical to
be able to answer the question, “What is currently installed?” This includes a discovery capability
that can detect all installed hardware. Individually replaceable units should be able to report, at a
minimum, what they are (including revision level), what options they contain (if applicable), and
unique tracking numbers. Typically, this information should be made available to the platform
management system.
8.3.6 Event Notification
Throughout all the capabilities of platform management, a common requirement is to provide
notification of significant events to other parts of the system, most notably the management
middleware component of a high availability system. As significant events occur (fault detections,
reconfigurations, etc.), the platform management system should provide information as to what has
occurred. Desirable capabilities of the platform management event notification function include:
• A common format of event messages for all events
• An asynchronous generation of event messages rather than requiring software to poll devices
to learn of events
• A publish/subscribe style interface for communicating events from platform management
system to interested software
• Automatic storage of events in non-volatile memory so they are not lost if there is no current
event listener
• System-wide synchronized time-stamp on events
• Common classification system for events to identify severity, urgency, etc., of event
• Inclusion of enough information with an event to permit immediate fault management action
as well as later analysis for the purpose of root cause analysis on system faults
8.3.7 Additional Useful Hardware Capabilities
The above platform management capabilities have been aimed at the minimum required to support
fault detection, diagnosis, isolation, recovery, and repair of fault domains. Beyond this minimum,
additional platform management capabilities of the hardware in high availability systems can be
provided in order to predict faults and prevent them from occurring in the first place.