HP XC System Software Administration Guide Version 3.1
15.2.1.2 SLURM External Scheduler
The integration of LSF-HPC with SLURM includes the addition of a SLURM-based external scheduler.
Users can submit SLURM parameters in the context of their jobs. This enables users to make specific
topology-based allocation requests. See the HP XC System Software User's Guide for more information.
15.2.1.3 SLURM lsf Partition
An lsf partition is created in SLURM; this partition contains all the nodes that LSF-HPC with SLURM
manages. This partition must be configured such that only the superuser can make allocation requests
(RootOnly=YES). This configuration prevents other users from directly accessing the resources that are
being managed by LSF-HPC with SLURM. The LSF-HPC with SLURM daemons, running as the superuser,
make allocation requests on behalf of the owner of the job to be dispatched. This is how LSF-HPC with
SLURM creates SLURM allocations for users' jobs to be run.
The lsf partition must be configured so that the nodes can be shared by default (Shared=FORCE). Thus,
LSF-HPC with SLURM can allocate serial jobs by different users on a per-processor basis (rather than on
a per-node basis) by default, which makes the best use of the resources. This setting also enables LSF-HPC
with SLURM to support preemption by allowing a new job to run while an existing job is suspended on
the same resource.
SLURM nodes can be in various states. Table 15-1 describes how LSF-HPC with SLURM interprets each
node state.
Table 15-1 LSF-HPC with SLURM Interpretation of SLURM Node States
DescriptionNode
A node that is configured in the LSF partition and is not allocated to any job. The node
is in the following state:
Free
The node is not allocated to any job and is
available for use.
IDLE
A node in any of the following states:In Use
The node is allocated to a job.
ALLOCATED
The node is allocated to a job that is in the
process of completing. The node state is
removed when all the job processes have
ended and the SLURM epilog program
(if any) has ended.
COMPLETING
The node is currently running a job but will
not be allocated to additional jobs. The node
state changes to state DRAINED when the
last job on it completes.
DRAINING
A node that is not available for use; its status is one of the following:Unavailable
The node is not available for use.
DOWNED
The node is not available for use per system
administrator request.
DRAINED
The SLURM controller has just started and
the node state is not yet determined.
UNKNOWN
15.2.1.4 LSF-HPC with SLURM Failover
The failover of the LSF component of the integrated LSF-HPC with SLURM product is of critical concern
because only one node in the HP XC system runs the LSF-HPC with SLURM daemons. During installation,
you select the primary LSF execution host from the nodes on the HP XC system that have the resource
management role; although that node could also be a compute node, it is not recommended. Other nodes
that also have the resource management role are designated as potential LSF execution host backups.
To address this concern, LSF-HPC with SLURM is configured on HP XC with a virtual host name (vhost)
and a virtual IP (vIP). The virtual IP and host name are used because they can be moved from one node
to another, and maintain a consistent LSF interface. By default, the virtual IP is an internal IP on the HP
XC administration network, and the virtual host name is lsfhost.localdomain. The LSF execution
host is configured to host the vIP, then the LSF-HPC with SLURM daemons are started on that node.
15.2 LSF-HPC with SLURM 181