Installation guide

156 Appendix A:Supplementary Hardware Information
Configuring the Software Watchdog Timer
Any cluster system can utilize the software watchdog timer as a data integrity provision, as no dedi-
cated hardware components are required. If you have specified a power switch type of SW_WATCHDOG
while using the cluconfig utility, the cluster software will automatically load the corresponding load-
able kernel module called softdog.
If the cluster is configured to utilize the software watchdog timer, the cluster quorum daemon
(cluquorumd) will periodically reset the timer interval. Should cluqourumd fail to reset the
timer, the failed cluster member will reboot itself.
When using the software watchdog timer, there is a small risk that the system will hang in such a way
that the software watchdog thread will not be executed. In this unlikely scenario, the other cluster
member may takeover services of the apparently hung cluster member. Generally, this is a safe op-
eration; but in the unlikely event that the hung cluster member resumes, data corruption could occur.
To further lessen the chance of this vulnerability occurring when using the software watchdog timer,
administrators should also configure the NMI watchdog timer.
Enabling the NMI Watchdog Timer
If you are using the software watchdog timer as a data integrity provision, it is also recommended to
enable the Non-Maskable Interrupt (NMI) watchdog timer to enhance the data integrity guarantees.
The NMI watchdog timer is a different mechanism for causing the system to reboot in the event of a
hang scenario where interrupts are blocked. This NMI watchdog can be used in conjunction with the
software watchdog timer.
Unlike the software watchdog timer which is reset by the cluster quorum daemon (cluquorumd),
the NMI watchdog timer counts system interrupts. Normally, a healthy system will receive hundreds
of device and timer interrupts per second. If there are no interrupts in a 5 second interval, a system
hang has occurred and the NMI watchdog timer will expire, initiating a system reboot.
A robust data integrity solution can be implemented by combining the health monitoring of the the clus-
ter quorum daemon with the software watchdog timer along with the low-level system status checks
of the NMI watchdog.
Correct operation of the NMI watchdog timer mechanism requires that the cluster members contain
an APIC chip on the main system board. The majority of contemporary systems do include the APIC
component. Generally, Intel-based SMP systems and Intel-based uniprocessor systems with SMP
system boards (2+ cpu slots/sockets, but only one CPU) are known the support the NMI watchdog.