HP XC System Software Administration Guide Version 3.0
controllsf set primary nodename
Specifies that LSF-HPC should start on some node
other than the head node by default.
You can also change the selection of the primary and backup nodes for the SLURM control daemon by
editing the SLURM configuration file, /hptc_cluster/slurm/etc/slurm.conf.
LSF-HPC Failover and Running Jobs
In the event of an LSF-HPC failover, LSF-HPC terminates each job that was previously running. These jobs
finish with an exit code of 122.
LSF-HPC cannot monitor the running jobs to determine if the job is running appropriately or if it is hung
indefinitely when the HP XC LSF execution host fails and the LSF-HPC daemons are restarted on another
node.
Ensure that each LSF-HPC queue configured in the lsb.queues file includes 122 as a
requeue exit value
so that these jobs will be requeued and rerun. At a minimum, the entry for each queue should resemble the
following:
REQUEUE_EXIT_VALUES=122
LSF-HPC Monitoring
LSF-HPC is monitored and controlled by Nagios using the check_lsf plug-in.
When LSF-HPC is down, the response of the check_lsf plug-in depends on whether LSF-HPC failover is
enabled or disabled.
When LSF-HPC failover is
disabled
The check_lsf plug-in returns an immediate failure notification to Nagios.
When LSF-HPC failover is
enabled
The check_lsf plug-in decides if LSF-HPC is supposed to be running. If
so, it acquires a list of resource management nodes and tries to restart
LSF-HPC on each of those nodes, in turn, until one succeeds, or until the list
is exhausted.
If successful, the check_lsf plug-in returns an LSF OK - restarted
message.
If the restart procedure fails, the check_lsf plug-in returns a failure
notification.
LSF Execution Host Failure
Should the node hosting LSF-HPC becomes unresponsive, the Nagios check_lsf plug-in takes action.
Table 13-2. lists the Nagios messages for LSF failover monitor status:
Table 13-2. Nagios messages for LSF Failover Monitor Status Nagios messages for LSF Failover Monitor
Status
MeaningMessage
The LSF-HPC environment appears to be up and operational on
the HP XC system
LSF OK - up
The LSF-HPC environment has not been started on the HP XC systemLSF OK - currently shut down
LSF-HPC is not running, and LSF-HPC failover is disabledLSF CRITICAL - down
The LSF-HPC environment was not running, and should have been;
it is being restarted. The message should change to LSF OK -
up the next time Nagios is updated.
LSF warning - restarted
An abnormal problem occurred. The {message} text provides
useful diagnostic information.
LSF CRITICAL - {message}
Enhancing LSF-HPC
You can set environment variables to influence the operation of LSF-HPC in the HP XC system. These
environment variables affect the operation directly and set thresholds for LSF-HPC and SLURM interplay.
Example 129