HP XC System Software Release Notes for Version 3.1
15.3.4 Moving SLURM and LSF to Their Backup Nodes
This procedure is not documented in the HP XC System Software Administration Guide but it will
be included in a future version.
To move the SLURM and LSF daemons from their primary node to their backup node (perhaps
due to a maintenance need on the primary node), follow this procedure:
1. Log into the backup node as root.
2. Shut down the backup slurmctld daemon:
# pkill slurmctld
3. Use the text editor of your choice to edit the /hptc_cluster/slurm/etc/slurm.conf
file. Change the value of the ControlMachine attribute to the backup node, and comment
out or change the value of the BackupController.
4. Save your changes and exit the text editor.
5. Shut down LSF on the primary node (you do not have to be logged into the primary node
to do this).
Shutting down LSF on the primary node will not impact batch jobs, but it will terminate
interactive LSF jobs (jobs submitted with the bsub -I option). Therefore, take the appropriate
precautions before running this command (either warn users, or close the LSF queues and
wait for all jobs to finish).
# controllsf stop
6. Log in to the primary SLURM node and shut down the primary slurmctld daemon:
# pkill slurmctld
7. On the backup SLURM node, start the primary slurmctld controller:
# slurmctld
8. Start LSF locally on the backup node:
# controllsf start here
9. Enter the following command if you want to make the backup node the primary node for
LSF:
# controllsf set primary backup_nodename
If you set another node to be the BackupController for SLURM, you can log into that node
and run the slurmctld command. This new backup node requires the resource_management
role to be assigned to it for this configuration to persist after future runs of the cluster_config
command .
To move LSF and SLURM back to the original primary node, follow the same procedure with
the assumption that the original primary node is now the backup node, and the original backup
node is now the primary node.
15.4 Manpages
discover.8
The following description of the --nowarn option is missing from the manpage:
This option is used in conjunction with the --replacenode option to instruct
the discover command not to issue the warning to have the node powered
down and set to network boot.
15.4 Manpages 81