Install guide

members. T his ensures the cluster is viable and that Oracle can continue to operate on the primary

node. Some failure cases can cause the heartbeat to become erratic or unreliable, so modern clustering

products provide a second check-in mechanism, which insures that quorum is maintained. Quorum

voting causes each cluster member to identify itself by voting, in the form of a simple write to a shared

vote, or quorum disk. The combination of heartbeat and quorum disk minimizes the risk of split-brain

cluster states. Split-brain clusters occur when the two nodes think they are both in the correct cluster

state, so both access the shared storage. Split-brain states create the highest risk of database

corruption, so this is the functional core of the cluster.

The two most common examples of Oracle Cold Failover are HP ServiceGuard and Veritas Cluster

Server (VCS). Red Hat Cluster Suite’s implementation for Oracle closely models these products, so

customers familiar with them will be immediately familiar with Red Hat Cluster Suite Oracle HA. Of course,

the devil is most definitely in the details.

5.1. Red Hat Cluster Suite HA

Red Hat Cluster Suite contains all the requisite components to implement Oracle HA: heartbeat, quorum

disk voting, fencing and a resource harness to relocate the Oracle instance, when necessary. The major

differences between how Red Hat Cluster Suite is set up for RAC and how it is set up for single instance

involves appropriate timeout settings and the configuration of the resource harness, aptly named

rgmanager.

5.2. Red Hat Cluster Suite Timers

When Oracle RAC is installed, Red Hat Cluster Suite must interact with Oracle Clusterware, but also is in

control of the timeouts and eventually the fencing. When Oracle HA is configured, Red Hat Cluster Suite

is also in charge, so the timeouts are very similar.

Tip

It is critical that the Red Hat Cluster Suite heartbeat service operates over the private, bonded

network, not the public network. If the private network fails for a node, then this node must be

removed from the cluster.

All installations will have subtly different timeout requirements, but start with these recommended

settings:

<fence_daemon clean_start="1" post_fail_delay="0"post_join_delay="3" />

<quorumd device="/dev/mapper/qdisk" interval="2" log_level="5" tko="8"

votes="1" />

In this example, the quorum disk is the level fencing mechanism with a timeout of 16 seconds; that is, two

intervals of 8 seconds. T he tko parameter stands for T echnical Knock Out — a boxing metaphor. The

CMAN heartbeat timeouts must be more than two time the tko timeouts; we choose 33 seconds (value

in ms). This delay gives the quorum daemon adequate time to establish which node is the master during

a failure, or if there is a load spike that might delay voting. T he expected_votes parameter is set to

the number of nodes + 1.

Chapter 5. Cold Failover Cluster Configuration