HP XC System Software Administration Guide Version 3.2

ManualsBrandsHP ManualsSoftwareHP XC System 3.x Software

201

202

203

204

205

206

207

208

209

210

If more than two nodes are assigned the resource management role, the first becomes the primary

resource management host and the second becomes the backup SLURM host and the first LSF-HPC

with SLURM failover candidate. Additional nodes with the resource management role can serve

as LSF-HPC with SLURM failover nodes if the either or both of the first two nodes are down.

Resource management candidate nodes are ordered in ASCII sort order by node name, after the

head node, which is taken first.

Example

In this example, nodes n3, n4, n15, and n16 are resource management nodes, and n16 is the

head node. The selection list is ordered as shown, and the nodes have the corresponding

assignments:

1. Node n16 hosts the primary LSF-HPC with SLURM and SLURM control daemons.

(The head node is taken first.)

2. Node n15 hosts the backup SLURM control daemon and serves as the first LSF-HPC with

SLURM failover candidate.

(The remaining nodes are ASCII-sorted.)

3. n3 becomes the second choice for LSF-HPC with SLURM failover.

4. n4 becomes the third choice for LSF-HPC with SLURM failover.

You can use the controllsf command to change these assignments.

• controllsf disable headnode preferred

Specifies that the head node should be ordered at the end of the list, rather than at the head.

• controllsf disable slurm affinity

Specifies that HP XC should attempt to place the SLURM and LSF-HPC with SLURM

daemons on separate nodes.

• controllsf set primary nodename

Specifies that LSF-HPC with SLURM start on some node other than the head node by default.

You can also change the selection of the primary and backup nodes for the SLURM control

daemon by editing the SLURM configuration file, /hptc_cluster/slurm/etc/slurm.conf.

16.14.4 LSF-HPC with SLURM Failover and Running Jobs

In the event of an LSF-HPC with SLURM failover, any interactive LSF-HPC with SLURM jobs

are ended because their I/O operates through the LSF daemons. These jobs finish with an exit

code of 122.

However, LSF-HPC with SLURM batch jobs run undisturbed as long as their nodes remain up.

LSF-HPC with SLURM cannot monitor the running jobs to determine if the job is running

appropriately or if it is hung indefinitely when the HP XC LSF execution host fails and the

LSF-HPC with SLURM daemons are restarted on another node.

Ensure that each LSF-HPC with SLURM queue configured in the lsb.queues file includes 122

as a requeue exit value so that these jobs will be queued again and rerun. At a minimum, the entry

for each queue resembles the following:

REQUEUE_EXIT_VALUES=122

16.14.5 Manual LSF-HPC with SLURM Failover

Use the following procedure if you need to initiate a manual LSF-HPC with SLURM, that is,

move LSF-HPC with SLURM from one node to another. You need to perform this operation to

perform maintenance on the LSF execution host, for example.

1. Log in as the superuser (root).

16.14 LSF-HPC with SLURM Failover 205