VCEM Profile Failover and Profile Moves White Paper

For example, a local array controller fails but the server is SAN-boot configured and the

workload does not access local drives, so the array controller is not a critical component.

When components are configured redundantly, a partner component can assume a failed

partner’s load.

Reporting critical events from a server

When a non-redundant component, such as a single CPU, fails instantly, there is no

opportunity for the HP Systems Insight Manager agents on that server to report the failure.

There are many examples of component failures that are not usually reported by HP Systems

Insight Manager agents.

Reporting component status also requires that the server’s operating system and network

communications are working between the system and HP Systems Insight Manager.

A server can fail to deliver services because its hardware or its operating system (hang or

crash) failed critically. But if the root cause of the failure is unknown, you cannot be certain

that replacing the server will remedy the problem. For example, if the boot image has

become unusable, replacing the server will not remedy the problem. In other cases,

replacement may not be necessary at all, for example, where a reboot would restore

functionality. In any case, a system experiencing this type of issue is not able to send an

event to HP Systems Insight Manager.

The HP Systems Insight Manager “System Unreachable” event

System Unreachable is an optional but commonly used system status change event. When

configured in HP Systems Insight Manager, it causes the “System Unreachable” event

whenever a system does not respond to a ping issued by the HP Systems Insight Manager

Hardware Status Polling task.

The System Unreachable event can be configured to cause the system to perform failover.

The drawback is that System Unreachable is not a “root cause” event. It occurs whether the

root cause is a critical hardware failure, workload software failure, or failure of the

intervening communications network. (In the case of a network failure many servers could be

triggered to perform failover.)

Failover events and service level ratings

The ideal goal for failover events is 100% accuracy, so that failover is triggered for:

• All situations in which a critical server hardware failure occurs.

• Zero conditions in which the server remains capable of meeting its service level objectives.

If event detection overlooks a critical server hardware failure, then services will not be

restored until the failure is discovered by other means such as a call to the help desk.

If the need to failover is falsely detected, services are interrupted during the failover and

system restart, adversely impacting the service level and perhaps making the service

unavailable during a critical period.

Before relying on event initiated failover, ensure that the accuracy of the selected failure

detection events will result in acceptable service level ratings, given your local installation,

configurations, workloads, service level objectives, and operations policies.