HP XC System Software Administration Guide Version 3.1
LSF-HPC with SLURM monitoring and failover are implemented on the HP XC system as tools that prepare
the environment for the LSF execution host daemons on a given node, start the daemons, then watch the
node to ensure that it remains active.
After a standard installation, the HP XC system is initially configured so that:
• LSF-HPC with SLURM is started on the head node.
• LSF-HPC with SLURM failover is disabled.
• The Nagios application reports whether LSF-HPC with SLURM is up, down, or "currently shut down,"
but takes no action in any case.
The only direct interaction between LSF-HPC with SLURM and the LSF monitoring and failover tools
occurs at LSF-HPC with SLURM startup, when the daemons are started in the virtual environment, and
at failover, when the existing daemons are shut down cleanly before the virtual environment is moved to
a new host.
15.13.2 Interplay of LSF-HPC with SLURM
The LSF-HPC with SLURM product and SLURM are managed independently; one is not critically affected
if the other goes down.
SLURM has no dependency on LSF-HPC with SLURM.
The LSF-HPC with SLURM product needs SLURM to schedule jobs. If SLURM becomes unresponsive,
LSF-HPC with SLURM drops its processor count to 1 and closes the HP XC virtual host. When SLURM is
available again, LSF-HPC with SLURM adjusts its processor count accordingly and reopens the host.
15.13.3 Assigning the Resource Management Nodes
You assign nodes to both SLURM and the LSF-HPC with SLURM product by assigning them the resource
management role; this role includes both the lsf and slurm_controller services.
By default, the HP XC resource management system attempts to place the SLURM controller and the LSF
execution host on the same node to constrain the use of system resources. If only one node has the resource
management role, the LSF-HPC with SLURM execution daemons and the SLURM control daemon both
run on that node.
If two nodes are assigned the resource management role, by default, the first node becomes the primary
resource management node, and the second node is the backup resource management node.
If more than two nodes are assigned the resource management role, the first becomes the primary resource
management host and the second becomes the backup SLURM host and the first LSF-HPC with SLURM
failover candidate. Additional nodes with the resource management role can serve as LSF-HPC with
SLURM failover nodes if the either or both of the first two nodes are down.
Resource management candidate nodes are ordered in ASCII sort order by node name, after the head
node, which is taken first.
Example
In this example, nodes n3, n4, n15, and n16 are resource management nodes, and n16 is the head node.
The selection list is ordered as shown, and the nodes have the corresponding assignments:
1. Node n16 hosts the primary LSF-HPC with SLURM and SLURM control daemons.
(The head node is taken first.)
2. Node n15 hosts the backup SLURM control daemon and serves as the first LSF-HPC with SLURM
failover candidate.
(The remaining nodes are ASCII-sorted.)
3. n3 becomes the second choice for LSF-HPC with SLURM failover.
4. n4 becomes the third choice for LSF-HPC with SLURM failover.
190 Managing LSF