VCEM Profile Failover and Profile Moves

When future failure of a component that is operating in a degraded state threatens physical
damage to the server or threatens the integrity of its retained data, there is cause to failover
the server. Examples are thermal abnormalities and certain CPU and memory error
conditions. HP SIM rates conditions that indicate impending failures as “major” events;
however for failover, these events also may be considered critical.
Also, the server configuration and workload may further qualify what is a critical component
for any individual server. For example, the local array controller fails. If the server is
configured as SAN-boot and the workload does not access local drives, then the array
controller is not a critical component, since after its failure, the server would continue to
operate without impact to the workload. As mention above, a component that is configured
redundantly, where a redundant partner can assume a failed partner’s load, also is not
critical.
Reporting critical events from a server
When a non-redundant component, such as a single CPU, fails instantly, there is no
opportunity for the HP SIM agents on that server to report the failure. There are many
examples of component failures that in practice are usually not reported by HP SIM agents.
To report component status also requires that the server’s operating system and network
communications between the system and HP SIM be working.
A server can fail to deliver services because its hardware or its operating system (hang or
crash) failed critically. But if the root cause of the failure is unknown, it cannot be certain that
replacing the server will remedy the problem, for example, if the boot image has become
unusable. Or that replacement is necessary at all, for example, when a reboot would
restore functionality. In any case, such a system is not able to send an event to HP SIM.
The HP SIM “System Unreachable” event
System unreachable is an optional but commonly used system status change event. When
configured in HP SIM, it causes the “system unreachable” event whenever a system does not
respond to a ping issued by the HP SIM Hardware Status Polling task.
The system unreachable event can be configured to cause the system to failover. The
drawback is that system unreachable is not a “root cause” event. It occurs whether the root
cause is a critical hardware failure, workload software failure or failure of the intervening
communications network. (In the case of a network failure many servers could be triggered
to failover.)
Failover events and service level ratings
The ideal for failover events is 100% accuracy, so that failover is triggered for
all conditions where a critical server hardware failure occurs, and
no condition where the server remains capable of meeting its service level objectives.
If event detection overlooks a critical server hardware failure, then services won’t be restored
until discovered by other means, for example, a call to the help desk from a user.
If the need to failover is falsely sensed, then services are interrupted during the failover and
system restart, adversely impacting the service level rating and perhaps making the service
unavailable during a critical period of use.