Consolidating HP Serviceguard for Linux and Oracle RAC 10g Clusters, June 2005

heartbeat mechanisms. Heartbeats are network messages that are sent between nodes in a
cluster letting each node know that the others are “alive”.
Both Serviceguard for Linux and Oracle RAC recognize node failures by the loss of
heartbeats and act to resolve it by rebooting the affected node. Most failures where there is
a loss of heartbeat are due to a node failure. This could be due to a hardware problem or
an OS crash. In these instances, both clusters readily recognize (and implicitly agree on)
which node has failed, and they will adjust their membership accordingly.
There is one failure type that, without special considerations, might cause the entire cluster to
fail when two clusters are running on the same nodes. If a Serviceguard for Linux and RAC
two-node cluster were configured with a single, shared heartbeat network, then the failure of
that network would result in isolation of the two nodes. Both clusters have quorum
mechanisms that define how the cluster determines which node should keep running and
which should be reset. However, since each cluster software has different and independent
algorithms, they may choose different nodes. For example, HP Serviceguard for Linux may
choose to retain node A and reboot the other node (node B); while RAC may choose to
retain node B and reboot node A. Serviceguard for Linux will reboot node B and nearly
simultaneously RAC will reboot node A, resulting in both nodes (and both clusters) becoming
unavailable.
This possibility can be eliminated for all but the most extreme failure scenarios by having
redundant and/or multiple heartbeat networks. In this case, ALL of the networks carrying
heartbeats would have to fail to have the condition where the clusters could attempt to
reboot different nodes causing both clusters to fail – a multiple failure within the clusters.
The recommended configuration (below) has 4 network paths capable of carrying
heartbeats. If a user is concerned about all of these failing simultaneously, then the
Serviceguard Quorum Service can be used. The Serviceguard Quorum Service runs on a
system outside of the cluster such as another server or a PC. Experiments have shown that,
when using the Quorum Service, Serviceguard takes down a node significantly sooner than
Oracle would.
1
The Oracle membership software recognizes this and does not take down
the remaining node. It should be pointed out that the Quorum Service is connected to the
cluster via a network connection. If the failure case of all network paths between nodes
failing also causes all paths to the Quorum Service to fail, then there is again the possibility
of the entire cluster going down. It is impossible to protect against all multiple failures.
Recommended Configuration
The general recommended network configuration for Serviceguard calls for a dedicated
heartbeat LAN and bonded pair of NICs (using Linux Channel Bonding) for both application
traffic and Serviceguard heartbeats. Serviceguard and Oracle can share the dedicated
heartbeat LANs since Serviceguard’s heartbeats are once per second and will have no
measurable effect on RAC.
1
When using the default timeout and heartbeat intervals, Oracle RAC and Serviceguard will both reset a node
after about 60 seconds.
4