VCEM Profile Failover and Profile Moves White Paper
15
For example, a local array controller fails but the server is SAN-boot configured and the
workload does not access local drives, so the array controller is not a critical component.
When components are configured redundantly, a partner component can assume a failed
partner’s load.
Reporting critical events from a server
When a non-redundant component, such as a single CPU, fails instantly, there is no
opportunity for the HP Systems Insight Manager agents on that server to report the failure.
There are many examples of component failures that are not usually reported by HP Systems
Insight Manager agents.
Reporting component status also requires that the server’s operating system and network
communications are working between the system and HP Systems Insight Manager.
A server can fail to deliver services because its hardware or its operating system (hang or
crash) failed critically. But if the root cause of the failure is unknown, you cannot be certain
that replacing the server will remedy the problem. For example, if the boot image has
become unusable, replacing the server will not remedy the problem. In other cases,
replacement may not be necessary at all, for example, where a reboot would restore
functionality. In any case, a system experiencing this type of issue is not able to send an
event to HP Systems Insight Manager.
The HP Systems Insight Manager “System Unreachable” event
System Unreachable is an optional but commonly used system status change event. When
configured in HP Systems Insight Manager, it causes the “System Unreachable” event
whenever a system does not respond to a ping issued by the HP Systems Insight Manager
Hardware Status Polling task.
The System Unreachable event can be configured to cause the system to perform failover.
The drawback is that System Unreachable is not a “root cause” event. It occurs whether the
root cause is a critical hardware failure, workload software failure, or failure of the
intervening communications network. (In the case of a network failure many servers could be
triggered to perform failover.)
Failover events and service level ratings
The ideal goal for failover events is 100% accuracy, so that failover is triggered for:
• All situations in which a critical server hardware failure occurs.
• Zero conditions in which the server remains capable of meeting its service level objectives.
If event detection overlooks a critical server hardware failure, then services will not be
restored until the failure is discovered by other means such as a call to the help desk.
If the need to failover is falsely detected, services are interrupted during the failover and
system restart, adversely impacting the service level and perhaps making the service
unavailable during a critical period.
Before relying on event initiated failover, ensure that the accuracy of the selected failure
detection events will result in acceptable service level ratings, given your local installation,
configurations, workloads, service level objectives, and operations policies.