HP XC System Software Administration Guide Version 4.0

Table Of Contents
If two nodes are assigned the resource management role, by default, the first node becomes the
primary resource management node, and the second node is the backup resource management
node.
If more than two nodes are assigned the resource management role, the first becomes the primary
resource management host and the second becomes the backup SLURM host and the first LSF
with SLURM failover candidate. Additional nodes with the resource management role can serve
as LSF with SLURM failover nodes if the either or both of the first two nodes are down.
Resource management candidate nodes are ordered in ASCII sort order by node name, after the
head node, which is taken first.
Example
In this example, nodes n3, n4, n15, and n16 are resource management nodes, and n16 is the
head node. The selection list is ordered as shown, and the nodes have the corresponding
assignments:
1. Node n16 hosts the primary LSF with SLURM and SLURM control daemons.
(The head node is taken first.)
2. Node n15 hosts the backup SLURM control daemon and serves as the first LSF with SLURM
failover candidate.
(The remaining nodes are ASCII-sorted.)
3. n3 becomes the second choice for LSF with SLURM failover.
4. n4 becomes the third choice for LSF with SLURM failover.
You can use the controllsf command to change these assignments.
controllsf disable headnode preferred
Specifies that the head node should be ordered at the end of the list, rather than at the head.
controllsf disable slurm affinity
Specifies that HP XC should attempt to place the SLURM and LSF with SLURM daemons
on separate nodes.
controllsf set primary nodename
Specifies that LSF with SLURM start on some node other than the head node by default.
You can also change the selection of the primary and backup nodes for the SLURM control
daemon by editing the SLURM configuration file, /hptc_cluster/slurm/etc/slurm.conf.
16.14.4 LSF with SLURM Failover and Running Jobs
In the event of an LSF with SLURM failover, any interactive LSF with SLURM jobs are ended
because their I/O operates through the LSF daemons. These jobs finish with an exit code of 122.
However, LSF with SLURM batch jobs run undisturbed as long as their nodes remain up.
LSF with SLURM cannot monitor the running jobs to determine if the job is running appropriately
or if it is hung indefinitely when the HP XC LSF execution host fails and the LSF with SLURM
daemons are restarted on another node.
Ensure that each LSF with SLURM queue configured in the lsb.queues file includes 122 as a
requeue exit value so that these jobs will be queued again and rerun. At a minimum, the entry for
each queue resembles the following:
REQUEUE_EXIT_VALUES=122
16.14 LSF with SLURM Failover 205