HP XC System Software Administration Guide Version 3.2

If more than two nodes are assigned the resource management role, the first becomes the primary
resource management host and the second becomes the backup SLURM host and the first LSF-HPC
with SLURM failover candidate. Additional nodes with the resource management role can serve
as LSF-HPC with SLURM failover nodes if the either or both of the first two nodes are down.
Resource management candidate nodes are ordered in ASCII sort order by node name, after the
head node, which is taken first.
Example
In this example, nodes n3, n4, n15, and n16 are resource management nodes, and n16 is the
head node. The selection list is ordered as shown, and the nodes have the corresponding
assignments:
1. Node n16 hosts the primary LSF-HPC with SLURM and SLURM control daemons.
(The head node is taken first.)
2. Node n15 hosts the backup SLURM control daemon and serves as the first LSF-HPC with
SLURM failover candidate.
(The remaining nodes are ASCII-sorted.)
3. n3 becomes the second choice for LSF-HPC with SLURM failover.
4. n4 becomes the third choice for LSF-HPC with SLURM failover.
You can use the controllsf command to change these assignments.
controllsf disable headnode preferred
Specifies that the head node should be ordered at the end of the list, rather than at the head.
controllsf disable slurm affinity
Specifies that HP XC should attempt to place the SLURM and LSF-HPC with SLURM
daemons on separate nodes.
controllsf set primary nodename
Specifies that LSF-HPC with SLURM start on some node other than the head node by default.
You can also change the selection of the primary and backup nodes for the SLURM control
daemon by editing the SLURM configuration file, /hptc_cluster/slurm/etc/slurm.conf.
16.14.4 LSF-HPC with SLURM Failover and Running Jobs
In the event of an LSF-HPC with SLURM failover, any interactive LSF-HPC with SLURM jobs
are ended because their I/O operates through the LSF daemons. These jobs finish with an exit
code of 122.
However, LSF-HPC with SLURM batch jobs run undisturbed as long as their nodes remain up.
LSF-HPC with SLURM cannot monitor the running jobs to determine if the job is running
appropriately or if it is hung indefinitely when the HP XC LSF execution host fails and the
LSF-HPC with SLURM daemons are restarted on another node.
Ensure that each LSF-HPC with SLURM queue configured in the lsb.queues file includes 122
as a requeue exit value so that these jobs will be queued again and rerun. At a minimum, the entry
for each queue resembles the following:
REQUEUE_EXIT_VALUES=122
16.14.5 Manual LSF-HPC with SLURM Failover
Use the following procedure if you need to initiate a manual LSF-HPC with SLURM, that is,
move LSF-HPC with SLURM from one node to another. You need to perform this operation to
perform maintenance on the LSF execution host, for example.
1. Log in as the superuser (root).
16.14 LSF-HPC with SLURM Failover 205