HP XC System Software Administration Guide Version 4.0

Table Of Contents
In these examples, 22 processors on this HP XC system are available for use by LSF with SLURM.
You can verify this information, which is obtained by LSF with SLURM, with the SLURM sinfo
command:
date and time
$ sinfo --Node --long
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT FEATURES REASON
n[1-10,16] 11 lsf idle 2 2:1:1 2048 1 1 (null) none
The output of the sinfo command shows that 11 nodes are available, and that each node has 2
processors.
The LSF lshosts command and the SLURM sinfo command both report the memory for each
node as 2,048 MB. This memory value is configured for each node in /hptc_cluster/slurm/
etc/slurm.conf; it is not obtained directly from the nodes. See the SLURM documentation
for more information on configuring the slurm.conf file.
16.13 LSF with SLURM Monitoring
LSF with SLURM is monitored and controlled by Nagios using the check_lsf plug-in.
When LSF with SLURM is down, the response of the check_lsf plug-in depends on whether
LSF with SLURM failover is enabled or disabled:
When LSF with SLURM failover is disabled
The check_lsf plug-in returns an immediate failure notification to Nagios.
When LSF with SLURM failover is enabled
The check_lsf plug-in decides if LSF with SLURM is supposed to be running. If so, it
acquires a list of resource management nodes and tries to restart LSF with SLURM on each
of those nodes, in turn, until one succeeds, or until the list is exhausted.
If successful, the check_lsf plug-in returns an LSF OK - restarted message.
If the restart procedure fails, the check_lsf plug-in returns a failure notification.
LSF Execution Host Failure
If the node hosting LSF with SLURM becomes unresponsive, the Nagios check_lsf plug-in
takes action.
Table 16-2 lists the Nagios messages for LSF failover monitor status:
Table 16-2 Nagios Messages for LSF with SLURM Failover Monitor Status
MeaningMessage
The LSF with SLURM environment appears to be up and
operational on the HP XC system
LSF OK - up
The LSF with SLURM environment has not been started
on the HP XC system
LSF OK - currently shut down
LSF with SLURM is not running, and LSF with SLURM
failover is disabled
LSF CRITICAL - down
The LSF with SLURM environment was not running, and
should have been; it is being restarted. The message
changes to LSF OK - up the next time Nagios is
updated.
LSF warning - restarted
An abnormal problem occurred. The {message} text
provides useful diagnostic information.
LSF CRITICAL - {message}
16.14 LSF with SLURM Failover
This section discusses aspects of the LSF with SLURM failover mechanism.
16.13 LSF with SLURM Monitoring 203