Install guide

Chapter 5. Cold Failover Cluster Configuration

This chapter provides information on configuration a cold failover HA cluster. For information on

configuring a RAC/GFS cluster, see Chapter 4, RAC/GFS Cluster Configuration.

Long before RAC (and its progenitor, OPS) was suitable for high availability, customers still needed

Oracle databases to be more reliable. The best way to do this was with a (relatively) simple two-node

cluster that provided a second server node to take over in the event the primary node crashed. T hese

early clusters still required many of the shared attributes that OPS/RAC databases required, but

mandated that only one Oracle instance could be running at once; the storage was shared, but Oracle

access was not. The use of this “failover” configuration remains in wide use today.

Note

An Oracle instance is the combination of OS resources (processes and shared memory) that

must be initiated on a server. T he instance provides coherent and persistent database access,

for the connecting users or clients. Oracle workloads are extremely resource intensive, so

typically there is only one instance/server. Oracle RAC consists of multiple instances (usually on

physically distinct servers), all connecting to the same set of database files. Server virtualization

now makes it possible to have more than one instance/server. However, this is not RAC unless

these instances all connect to the same set of database files. T he voraciousness of most Oracle

workloads makes multiple instance/server configurations difficult to configure and optimize.

The OS clustering layer must insure that Oracle is never running on both nodes at the same time. If this

occurs, the database will be corrupted. T he two nodes must be in constant contact, either through a

voting disk, or a heartbeat network, or both. If something goes wrong with the primary node (the node

currently running Oracle), then the secondary node must be able to terminate that server, take over the

storage, and restart the Oracle database. Termination is also called fencing, and is most frequently

accomplished by the secondary node turning off the power to the primary node; this is called power-

managed fencing. There are a variety of fencing methods, but power-managed fencing is recommended.

Note

The Oracle database is a fully journaled file system, and is capable of recovering all relevant

transactions. Oracle calls the journal logs redo logs. When Oracle or the server fails

unexpectedly, the database has aborted and requires crash recovery. In the failover case, this

recovery usually occurs on the secondary node, but this does affect Oracle recovery. Whatever

node starts up Oracle after it has aborted must do recovery. Oracle HA recovery is still just single

instance recovery. In RAC, there are multiple instances, each with it’s own set of redo logs. When

a RAC node fails, some other RAC node must recover the failed node’s redo logs, while

continuing to provide access to the database.

The Oracle database must be installed on a shared storage array and this file system (or these file

systems) can only be mounted on the active node. The clustering layer also has agents, or scripts that

must be customized to the specific installation of Oracle. Once configured, this software can

automatically start the Oracle database and any other relevant services (like Oracle network listeners).

The job of any cluster product is to ensure that Oracle is only ever running on one node.

Clusters are designed specifically to handle bizarre, end-case operating conditions, but are at the mercy

of the OS components that might fail too. The heartbeat network operates over standard TCP/IP

networks, and is the primary mechanism by which the cluster nodes identify themselves to other

Red Hat Enterprise Linux 5 Configuration Example - Oracle HA on Cluster Suite