VCEM Profile Failover and Profile Moves

When future failure of a component that is operating in a degraded state threatens physical

damage to the server or threatens the integrity of its retained data, there is cause to failover

the server. Examples are thermal abnormalities and certain CPU and memory error

conditions. HP SIM rates conditions that indicate impending failures as “major” events;

however for failover, these events also may be considered critical.

Also, the server configuration and workload may further qualify what is a critical component

for any individual server. For example, the local array controller fails. If the server is

configured as SAN-boot and the workload does not access local drives, then the array

controller is not a critical component, since after its failure, the server would continue to

operate without impact to the workload. As mention above, a component that is configured

redundantly, where a redundant partner can assume a failed partner’s load, also is not

critical.

Reporting critical events from a server

When a non-redundant component, such as a single CPU, fails instantly, there is no

opportunity for the HP SIM agents on that server to report the failure. There are many

examples of component failures that in practice are usually not reported by HP SIM agents.

To report component status also requires that the server’s operating system and network

communications between the system and HP SIM be working.

A server can fail to deliver services because its hardware or its operating system (hang or

crash) failed critically. But if the root cause of the failure is unknown, it cannot be certain that

replacing the server will remedy the problem, for example, if the boot image has become

unusable. Or that replacement is necessary at all, for example, when a reboot would

restore functionality. In any case, such a system is not able to send an event to HP SIM.

The HP SIM “System Unreachable” event

System unreachable is an optional but commonly used system status change event. When

configured in HP SIM, it causes the “system unreachable” event whenever a system does not

respond to a ping issued by the HP SIM Hardware Status Polling task.

The system unreachable event can be configured to cause the system to failover. The

drawback is that system unreachable is not a “root cause” event. It occurs whether the root

cause is a critical hardware failure, workload software failure or failure of the intervening

communications network. (In the case of a network failure many servers could be triggered

to failover.)

Failover events and service level ratings

The ideal for failover events is 100% accuracy, so that failover is triggered for

• all conditions where a critical server hardware failure occurs, and

• no condition where the server remains capable of meeting its service level objectives.

If event detection overlooks a critical server hardware failure, then services won’t be restored

until discovered by other means, for example, a call to the help desk from a user.

If the need to failover is falsely sensed, then services are interrupted during the failover and

system restart, adversely impacting the service level rating and perhaps making the service

unavailable during a critical period of use.