HP XC System Software Administration Guide Version 2.1
to make specific top olo gy-based allocation req
uests. See the HP XC System So ftware User’s
Guide for more information.
12.1.3 SLURM lsf Partition
An lsf’ partition is created in SLURM; this partition contains all the no des that LSF-HPC
manages. This partition m ust be co nfigured such that only the superuser can make allocation
requests (RootOnly=YES). This configuration prevents other users from directly accessing
the resources that are being managed by LSF-HPC. Th e LSF-HPC daemons, running as the
superuser, make allocation r equests on behalf of the owner of the job to be dispatched. This is
how LSF-HPC creates SLURM allocat ions fo r u sers’ jobs to be run.
The lsf partition must b e configured such that the nodes can be shared by default
(Shared=FORCE). Thus, LSF-H PC for SLURM can allocate serial jobs by different users on a
per-processor basis (rather than on a per-node basis) by default, which makes the best use of the
resources. This setting also allows LSF-HPC to support preemp tio n by allowing a new job to
run while an existing job is suspended on the same resource.
SLURM nodes can be in various states. The following table describes how LSF-HPC interprets
each node state:
Table 12-1: LSF-HPC Interpretation of SLURM Node States
Node Description
A node that is configur
ed in the LSF-HPC partition and is not allocated
to any job. The node is in the following state:
Free
IDLE
The node in not allocated to any job and
is available for use.
A node in any of the following states:
ALLOCATED
The node is allocated to a job.
COMPLETING
The node is allocated to a job that is in the
process of completing. The node state is
removed when all the j ob processes have
ended and the SLURM epilog program
(if any) has ended.
In Use
DRAINING
The node is currently running a job but
will not be allocated to additional jobs.
The node state changes to state DRAINED
when the last job on it completes.
A node that is not available for use; its status is one of the following:
DOWNED
The node is not available for use.
DRAINED
The node is not available for use per
system administrator request.
Unavailable
UNKNOWN
The SLURM controller has just started and
the node state is not yet determined.
12.1.4 LSF-HPC Failover
LSF-HPC for SLURM failover is of critical concern because only one node in the HP XC system
runs the LSF- HPC daemons. D uring installation, the primary LSF-H PC Executi on Host is
selected from the nodes on t heHP XC system that have the resource management role; although
that node could also be a compute node, it is no t recomme nded . Other nod es that also have the
resource management role are designated as potential LSF-HPC Execution Host backups.
To address this concern, LSF-HPC for SLURM is config ured on HP XC with a virtual hostname
(vhost) and a virtual IP (vIP). The virtual IP and h ostname are used because they can be
moved from one node to another, and maintain a consistent LSF interface. B y default, the
LSF-HPC Administration 12-3