Consolidating HP Serviceguard for Linux and Oracle RAC 10g Clusters, June 2005

heartbeat mechanisms. Heartbeats are network messages that are sent between nodes in a

cluster letting each node know that the others are “alive”.

Both Serviceguard for Linux and Oracle RAC recognize node failures by the loss of

heartbeats and act to resolve it by rebooting the affected node. Most failures where there is

a loss of heartbeat are due to a node failure. This could be due to a hardware problem or

an OS crash. In these instances, both clusters readily recognize (and implicitly agree on)

which node has failed, and they will adjust their membership accordingly.

There is one failure type that, without special considerations, might cause the entire cluster to

fail when two clusters are running on the same nodes. If a Serviceguard for Linux and RAC

two-node cluster were configured with a single, shared heartbeat network, then the failure of

that network would result in isolation of the two nodes. Both clusters have quorum

mechanisms that define how the cluster determines which node should keep running and

which should be reset. However, since each cluster software has different and independent

algorithms, they may choose different nodes. For example, HP Serviceguard for Linux may

choose to retain node A and reboot the other node (node B); while RAC may choose to

retain node B and reboot node A. Serviceguard for Linux will reboot node B and nearly

simultaneously RAC will reboot node A, resulting in both nodes (and both clusters) becoming

unavailable.

This possibility can be eliminated for all but the most extreme failure scenarios by having

redundant and/or multiple heartbeat networks. In this case, ALL of the networks carrying

heartbeats would have to fail to have the condition where the clusters could attempt to

reboot different nodes causing both clusters to fail – a multiple failure within the clusters.

The recommended configuration (below) has 4 network paths capable of carrying

heartbeats. If a user is concerned about all of these failing simultaneously, then the

Serviceguard Quorum Service can be used. The Serviceguard Quorum Service runs on a

system outside of the cluster such as another server or a PC. Experiments have shown that,

when using the Quorum Service, Serviceguard takes down a node significantly sooner than

Oracle would.

The Oracle membership software recognizes this and does not take down

the remaining node. It should be pointed out that the Quorum Service is connected to the

cluster via a network connection. If the failure case of all network paths between nodes

failing also causes all paths to the Quorum Service to fail, then there is again the possibility

of the entire cluster going down. It is impossible to protect against all multiple failures.

Recommended Configuration

The general recommended network configuration for Serviceguard calls for a dedicated

heartbeat LAN and bonded pair of NICs (using Linux Channel Bonding) for both application

traffic and Serviceguard heartbeats. Serviceguard and Oracle can share the dedicated

heartbeat LANs since Serviceguard’s heartbeats are once per second and will have no

measurable effect on RAC.

When using the default timeout and heartbeat intervals, Oracle RAC and Serviceguard will both reset a node

after about 60 seconds.