Installation guide

Section 2.1:Choosing a Hardware Configuration 17

2.1.3 Choosing the Type of Power Controller

The Red Hat Cluster Manager implementation consists of a generic power management layer and

a set of device specific modules which accommodate a range of power management types. When se-

lecting the appropriate type of power controller to deploy in the cluster, it is important to recognize the

implications of specific device types. The following describes the types of supported power switches

followed by a summary table. For a more detailed description of the role a power switch plays to

ensure data integrity, refer to Section 2.4.2, Configuring Power Switches.

Serial- and Network-attached power switches are separate devices which enable one cluster member

to power cycle another member. They resemble a power plug strip on which individual outlets can be

turned on and off under software control through either a serial or network cable.

Watchdog timers provide a means for failed systems to remove themselves from the cluster prior to an-

other system taking over its services, rather than allowing one cluster member to power cycle another.

The normal operational mode for watchdog timers is that the cluster software must periodically reset

a timer prior to its expiration. If the cluster software fails to reset the timer, the watchdog will trigger

under the assumption that the system may have hung or otherwise failed. The healthy cluster member

allows a window of time to pass prior to concluding that another cluster member has failed (by default,

this window is 12 seconds). The watchdog timer interval must be less than the duration of time for

one cluster member to conclude that another has failed. In this manner, a healthy system can assume

that prior to taking over services for a failed cluster member, that it has safely removed itself from the

cluster (by rebooting) and therefore is no risk to data integrity. The underlying watchdog support is

included in the core Linux kernel. Red Hat Cluster Manager utilizes these watchdog features via

its standard APIs and configuration mechanism.

There are two types of watchdog timers: Hardware-based and software-based. Hardware-based watch-

dog timers typically consist of system board components such as the Intel

i810 TCO chipset. This

circuitry has a high degree of independence from the main system CPU. This independence is benefi-

cial in failure scenarios of a true system hang, as in this case it will pull down the system’s reset lead

resulting in a system reboot. There are some PCI expansion cards that provide watchdog features.

The second type of watchdog timer is software-based. This category of watchdog does not have any

dedicated hardware. The implementation is a kernel thread which is periodically run and if the timer

duration has expired will initiate a system reboot. The vulnerability of the software watchdog timer

is that under certain failure scenarios such as system hangs while interrupts are blocked, the kernel

thread will not be called. As a result, in such conditions it can not be definitively depended on for

data integrity. This can cause the healthy cluster member to take over services for a hung node which

could cause data corruption under certain scenarios.

Finally, administrators can choose not to employ a power controller at all. If choosing the "None" type,

note that there are no provisions for a cluster member to power cycle a failed member. Similarly, the

failed member can not be guaranteed to reboot itself under all failure conditions. Deploying clusters