HP XC System Software Administration Guide Version 2.1
By default, the HP XC resource management system
attempts to place the SLURM controller
and th e LSF execution host o n the same node to con
strain the use of system resources. If
only one n ode has the resource management role
, the LSF-H PC execution daem ons and the
SLURM control daemon both run on that node.
If two nodes are assigned the resource manag emen t role, by default, the first node becomes the
primary resource m anag emen t node, and the second nod e is the backup resource managem ent
node.
If more than two nodes are assigned the resource managem ent role, the first becomes the
primary resource management host, the second becomes the b ackup SLURM host and the first
LSF-HPC failover candidate. Addition a l nodes with the resource management role can serve as
LSF-HPC failover nodes if the either or b oth of the first two nodes are down.
Resource management cand idate nodes are ordered in ASCII sort order by node name , except
that the head node is taken first.
Example
Suppose nodes n3, n4, n15,andn16 are resource management nodes, and n16 is the head
node, th e selection list is ordered as shown and the nodes have the co rresp ond ing assignm ents:
1. n16 hosts the primary LSF-HPC and SLURM c o ntrol daemons.
(the head node is taken first)
2. n15 hosts the backup SLURM co ntrol daem on and s erves as the first LSF-H PC failover
candidate.
(the remaining nodes are A SCI I- sorted)
3. n3 becomes the second choice for LSF -HPC failover.
4. n4 beco mes the third choice for L SF-HPC failover.
You can use the controllsf command to change these assignments.
controllsf disable headnode preferred
Specifies that the head node should be ordered at t he end of the list, rather than at the head.
controllsf disable slurm affinity
Specifies that HP XC should attempt to place the SLURM and LSF-HPC daemons on
separate nodes.
controllsf set primary nodename
Specifies that LSF-HPC s hou ld start on some node other than the head node by default.
You can also change the selection o f the primary and backup nodes for the
SLURM control daemon by editing the SLURM configuration file, /hptc_clus-
ter/slurm/etc/slurm.conf.
12.9.4 LSF-HPC Failover and Running Jobs
In the event of an LSF-H PC failover, LSF-HPC k ills each job that was previously running.
These jobs finish w ith an exit code of 1 22.
LSF-HPC canno t monitor the running jobs to determine if the job is running appropriately or i f
it is hung indefinitely when the H P XC LSF execution host fails a nd the LSF-HPC daemons are
restarted on another node.
Ensure that each LSF-HPC queue configured in the lsb.queues file includes 122 as a
requeue exit value so that these jobs will be requeued and rerun. At a minimum, the entry for
each queue should resemble the following:
LSF-HPC Administration 12-11