Install guide
Chapter 5. Cold Failover Cluster Configuration
This chapter provides information on configuration a cold failover HA cluster. For information on
configuring a RAC/GFS cluster, see Chapter 4, RAC/GFS Cluster Configuration.
Long before RAC (and its progenitor, OPS) was suitable for high availability, customers still needed
Oracle databases to be more reliable. The best way to do this was with a (relatively) simple two-node
cluster that provided a second server node to take over in the event the primary node crashed. T hese
early clusters still required many of the shared attributes that OPS/RAC databases required, but
mandated that only one Oracle instance could be running at once; the storage was shared, but Oracle
access was not. The use of this “failover” configuration remains in wide use today.
Note
An Oracle instance is the combination of OS resources (processes and shared memory) that
must be initiated on a server. T he instance provides coherent and persistent database access,
for the connecting users or clients. Oracle workloads are extremely resource intensive, so
typically there is only one instance/server. Oracle RAC consists of multiple instances (usually on
physically distinct servers), all connecting to the same set of database files. Server virtualization
now makes it possible to have more than one instance/server. However, this is not RAC unless
these instances all connect to the same set of database files. T he voraciousness of most Oracle
workloads makes multiple instance/server configurations difficult to configure and optimize.
The OS clustering layer must insure that Oracle is never running on both nodes at the same time. If this
occurs, the database will be corrupted. T he two nodes must be in constant contact, either through a
voting disk, or a heartbeat network, or both. If something goes wrong with the primary node (the node
currently running Oracle), then the secondary node must be able to terminate that server, take over the
storage, and restart the Oracle database. Termination is also called fencing, and is most frequently
accomplished by the secondary node turning off the power to the primary node; this is called power-
managed fencing. There are a variety of fencing methods, but power-managed fencing is recommended.
Note
The Oracle database is a fully journaled file system, and is capable of recovering all relevant
transactions. Oracle calls the journal logs redo logs. When Oracle or the server fails
unexpectedly, the database has aborted and requires crash recovery. In the failover case, this
recovery usually occurs on the secondary node, but this does affect Oracle recovery. Whatever
node starts up Oracle after it has aborted must do recovery. Oracle HA recovery is still just single
instance recovery. In RAC, there are multiple instances, each with it’s own set of redo logs. When
a RAC node fails, some other RAC node must recover the failed node’s redo logs, while
continuing to provide access to the database.
The Oracle database must be installed on a shared storage array and this file system (or these file
systems) can only be mounted on the active node. The clustering layer also has agents, or scripts that
must be customized to the specific installation of Oracle. Once configured, this software can
automatically start the Oracle database and any other relevant services (like Oracle network listeners).
The job of any cluster product is to ensure that Oracle is only ever running on one node.
Clusters are designed specifically to handle bizarre, end-case operating conditions, but are at the mercy
of the OS components that might fail too. The heartbeat network operates over standard TCP/IP
networks, and is the primary mechanism by which the cluster nodes identify themselves to other
Red Hat Enterprise Linux 5 Configuration Example - Oracle HA on Cluster Suite
36