HP XC System Software Administration Guide Version 4.0

Table Of Contents
Table 16-1 LSF with SLURM Interpretation of SLURM Node States (continued)
DescriptionNode
A node in any of the following states:In Use
The node is allocated to a job.
ALLOCATED
The node is allocated to a job that is in
the process of completing. The node
state is removed when all the job
processes have ended and the SLURM
epilog program (if any) has ended.
COMPLETING
The node is currently running a job but
will not be allocated to additional jobs.
The node state changes to state
DRAINED when the last job on it
completes.
DRAINING
A node that is not available for use; its status is one of the following:Unavailable
The node is not available for use.
DOWNED
The node is not available for use per
system administrator request.
DRAINED
The SLURM controller has just started
and the node state is not yet
determined.
UNKNOWN
16.2.1.4 LSF with SLURM Failover
The failover of the LSF component of the integrated LSF with SLURM product is of critical
concern because only one node in the HP XC system runs the LSF with SLURM daemons. During
installation, you select the primary LSF execution host from the nodes on the HP XC system that
have the resource management role; although that node could also be a compute node, it is not
recommended. Other nodes that also have the resource management role are designated as
potential LSF execution host backups.
To address this concern, LSF with SLURM is configured on HP XC with a virtual host name
(vhost) and a virtual IP (vIP). The virtual IP and host name are used because they can be moved
from one node to another, and maintain a consistent LSF interface. By default, the virtual IP is
an internal IP on the HP XC administration network, and the virtual host name is
lsfhost.localdomain. The LSF execution host is configured to host the vIP, then the LSF
with SLURM daemons are started on that node.
The Nagios infrastructure contains a module that monitors the LSF with SLURM virtual IP. If it
detects a problem with the virtual IP (for example, the inability to ping it), the monitoring code
assumes the node is down and chooses a new LSF execution host from the backup candidate
nodes on which to set up the virtual IP and restart LSF with SLURM.
See “LSF with SLURM Failover” (page 203) for more information.
16.3 Switching the Type of LSF Installed
The HP XC system installation process offers a choice of two different types of LSF. The default
choice is LSF with SLURM. This choice requires that SLURM is installed and configured when
you run the cluster_config utility. Standard LSF is the second type of LSF that is available
to install, and it does not interact with SLURM.
If you made the wrong LSF selection while running the cluster_config utility, perform the
following procedure to remove the current type of LSF installed and install the other type of
LSF:
1. Log in as superuser (root) on the head node.
194 Managing LSF