Installation guide

Section 2.1:Choosing a Hardware Configuration 17
2.1.3 Choosing the Type of Power Controller
The Red Hat Cluster Manager implementation consists of a generic power management layer and
a set of device specific modules which accommodate a range of power management types. When se-
lecting the appropriate type of power controller to deploy in the cluster, it is important to recognize the
implications of specific device types. The following describes the types of supported power switches
followed by a summary table. For a more detailed description of the role a power switch plays to
ensure data integrity, refer to Section 2.4.2, Configuring Power Switches.
Serial- and Network-attached power switches are separate devices which enable one cluster member
to power cycle another member. They resemble a power plug strip on which individual outlets can be
turned on and off under software control through either a serial or network cable.
Watchdog timers provide a means for failed systems to remove themselves from the cluster prior to an-
other system taking over its services, rather than allowing one cluster member to power cycle another.
The normal operational mode for watchdog timers is that the cluster software must periodically reset
a timer prior to its expiration. If the cluster software fails to reset the timer, the watchdog will trigger
under the assumption that the system may have hung or otherwise failed. The healthy cluster member
allows a window of time to pass prior to concluding that another cluster member has failed (by default,
this window is 12 seconds). The watchdog timer interval must be less than the duration of time for
one cluster member to conclude that another has failed. In this manner, a healthy system can assume
that prior to taking over services for a failed cluster member, that it has safely removed itself from the
cluster (by rebooting) and therefore is no risk to data integrity. The underlying watchdog support is
included in the core Linux kernel. Red Hat Cluster Manager utilizes these watchdog features via
its standard APIs and configuration mechanism.
There are two types of watchdog timers: Hardware-based and software-based. Hardware-based watch-
dog timers typically consist of system board components such as the Intel
®
i810 TCO chipset. This
circuitry has a high degree of independence from the main system CPU. This independence is benefi-
cial in failure scenarios of a true system hang, as in this case it will pull down the system’s reset lead
resulting in a system reboot. There are some PCI expansion cards that provide watchdog features.
The second type of watchdog timer is software-based. This category of watchdog does not have any
dedicated hardware. The implementation is a kernel thread which is periodically run and if the timer
duration has expired will initiate a system reboot. The vulnerability of the software watchdog timer
is that under certain failure scenarios such as system hangs while interrupts are blocked, the kernel
thread will not be called. As a result, in such conditions it can not be definitively depended on for
data integrity. This can cause the healthy cluster member to take over services for a hung node which
could cause data corruption under certain scenarios.
Finally, administrators can choose not to employ a power controller at all. If choosing the "None" type,
note that there are no provisions for a cluster member to power cycle a failed member. Similarly, the
failed member can not be guaranteed to reboot itself under all failure conditions. Deploying clusters