HP XC System Software Administration Guide Version 2.1

ManualsBrandsHP ManualsSoftwareHP XC 1 Processor LTU

100

to make specific top olo gy-based allocation req

uests. See the HP XC System So ftware User’s

Guide for more information.

12.1.3 SLURM lsf Partition

An lsf’ partition is created in SLURM; this partition contains all the no des that LSF-HPC

manages. This partition m ust be co nfigured such that only the superuser can make allocation

requests (RootOnly=YES). This configuration prevents other users from directly accessing

the resources that are being managed by LSF-HPC. Th e LSF-HPC daemons, running as the

superuser, make allocation r equests on behalf of the owner of the job to be dispatched. This is

how LSF-HPC creates SLURM allocat ions fo r u sers’ jobs to be run.

The lsf partition must b e configured such that the nodes can be shared by default

(Shared=FORCE). Thus, LSF-H PC for SLURM can allocate serial jobs by different users on a

per-processor basis (rather than on a per-node basis) by default, which makes the best use of the

resources. This setting also allows LSF-HPC to support preemp tio n by allowing a new job to

run while an existing job is suspended on the same resource.

SLURM nodes can be in various states. The following table describes how LSF-HPC interprets

each node state:

Table 12-1: LSF-HPC Interpretation of SLURM Node States

Node Description

A node that is configur

ed in the LSF-HPC partition and is not allocated

to any job. The node is in the following state:

Free

IDLE

The node in not allocated to any job and

is available for use.

A node in any of the following states:

ALLOCATED

The node is allocated to a job.

COMPLETING

The node is allocated to a job that is in the

process of completing. The node state is

removed when all the j ob processes have

ended and the SLURM epilog program

(if any) has ended.

In Use

DRAINING

The node is currently running a job but

will not be allocated to additional jobs.

The node state changes to state DRAINED

when the last job on it completes.

A node that is not available for use; its status is one of the following:

DOWNED

The node is not available for use.

DRAINED

The node is not available for use per

system administrator request.

Unavailable

UNKNOWN

The SLURM controller has just started and

the node state is not yet determined.

12.1.4 LSF-HPC Failover

LSF-HPC for SLURM failover is of critical concern because only one node in the HP XC system

runs the LSF- HPC daemons. D uring installation, the primary LSF-H PC Executi on Host is

selected from the nodes on t heHP XC system that have the resource management role; although

that node could also be a compute node, it is no t recomme nded . Other nod es that also have the

resource management role are designated as potential LSF-HPC Execution Host backups.

To address this concern, LSF-HPC for SLURM is config ured on HP XC with a virtual hostname

(vhost) and a virtual IP (vIP). The virtual IP and h ostname are used because they can be

moved from one node to another, and maintain a consistent LSF interface. B y default, the

LSF-HPC Administration 12-3