HP XC System Software Administration Guide Version 4.0

Table Of Contents
16.14.1 Overview of LSF with SLURM Monitoring and Failover Support
LSF with SLURM failover is disabled by default. You can enable or disable LSF with SLURM
failover at any time with the controllsf command. For more information, see controllsf(8).
Note:
At least two nodes must have the resource management roles to enable LSF with SLURM failover.
One is selected as the master (primary LSF execution host), and the others are considered backup
nodes. At any time,LSF with SLURM daemons start and run only on the master node.
The Nagios LSF failover module monitors the virtual IP associated with the primary LSF execution
host. When LSF with SLURM failover is enabled on the HP XC system, and if the primary LSF
execution host fails, the Nagios LSF failover module detects that the node is unresponsive and
initiates failover:
The Nagios module attempts to contact the node hosting the IP to ensure that LSF with
SLURM is shut down and that virtual IP hosting is disabled.
A new primary LSF execution host from the backup nodes is selected. The LSF daemons
start on the backup node.
The Nagios module tries to re-establish the virtual IP on the new node.
LSF with SLURM is restarted on that host.
LSF with SLURM monitoring and failover are implemented on the HP XC system as tools that
prepare the environment for the LSF execution host daemons on a given node, start the daemons,
then watch the node to ensure that it remains active.
After a standard installation, the HP XC system is initially configured so that:
LSF with SLURM is started on the head node.
LSF with SLURM failover is disabled.
The Nagios application reports whether LSF with SLURM is up, down, or "currently shut
down," but takes no action in any case.
The only direct interaction between LSF with SLURM and the LSF monitoring and failover tools
occurs at LSF with SLURM startup, when the daemons are started in the virtual environment,
and at failover, when the existing daemons are shut down cleanly before the virtual environment
is moved to a new host.
16.14.2 Interplay of LSF with SLURM
The LSF with SLURM product and SLURM are managed independently; one is not critically
affected if the other goes down.
SLURM has no dependency on LSF with SLURM.
The LSF with SLURM product needs SLURM to schedule jobs. If SLURM becomes unresponsive,
LSF with SLURM drops its processor count to 1 and closes the HP XC virtual host. When SLURM
is available again, LSF with SLURM adjusts its processor count accordingly and reopens the host.
16.14.3 Assigning the Resource Management Nodes
You assign nodes to both SLURM and the LSF with SLURM product by assigning them the
resource management role; this role includes both the lsf and slurm_controller services.
By default, the HP XC resource management system attempts to place the SLURM controller
and the LSF execution host on the same node to constrain the use of system resources. If only
one node has the resource management role, the LSF with SLURM execution daemons and the
SLURM control daemon both run on that node.
204 Managing LSF