HP XC System Software Administration Guide Version 3.0

ManualsBrandsHP ManualsSoftwareHP XC System 3.x Software

121

122

123

124

125

126

127

128

129

130

SLURM lsf Partition

An lsf partition is created in SLURM; this partition contains all the nodes that LSF-HPC manages. This

partition must be configured such that only the superuser can make allocation requests (RootOnly=YES).

This configuration prevents other users from directly accessing the resources that are being managed by

LSF-HPC. The LSF-HPC daemons, running as the superuser, make allocation requests on behalf of the owner

of the job to be dispatched. This is how LSF-HPC creates SLURM allocations for users' jobs to be run.

The lsf partition must be configured so that the nodes can be shared by default (Shared=FORCE). Thus,

LSF-HPC can allocate serial jobs by different users on a per-processor basis (rather than on a per-node basis)

by default, which makes the best use of the resources. This setting also enables LSF-HPC to support preemption

by allowing a new job to run while an existing job is suspended on the same resource.

SLURM nodes can be in various states. Table 13-1. describes how LSF-HPC interprets each node state.

Table 13-1. LSF-HPC Interpretation of SLURM Node States LSF-HPC Interpretation of SLURM Node States

DescriptionNode

A node that is configured in the LSF-HPC partition and is not allocated to any job. The

node is in the following state:

Free

The node is not allocated to any job and is

available for use.

IDLE

A node in any of the following states:In Use

The node is allocated to a job.ALLOCATED

The node is allocated to a job that is in the

process of completing. The node state is

removed when all the job processes have

ended and the SLURM epilog program (if

any) has ended.

COMPLETING

The node is currently running a job but will

not be allocated to additional jobs. The node

state changes to state DRAINED when the

last job on it completes.

DRAINING

A node that is not available for use; its status is one of the following:Unavailable

The node is not available for use.DOWNED

The node is not available for use per system

administrator request.

DRAINED

The SLURM controller has just started and

the node state is not yet determined.

UNKNOWN

LSF-HPC Failover

LSF-HPC failover is of critical concern because only one node in the HP XC system runs the LSF-HPC daemons.

During installation, you select the primary LSF execution host from the nodes on the HP XC system that have

the resource management role; although that node could also be a compute node, it is not recommended.

Other nodes that also have the resource management role are designated as potential LSF execution host

backups.

To address this concern, LSF-HPC is configured on HP XC with a virtual host name (vhost) and a virtual IP

(vIP). The virtual IP and host name are used because they can be moved from one node to another, and

maintain a consistent LSF interface. By default, the virtual IP is an internal IP on the HP XC administration

network, and the virtual host name is lsfhost.localdomain. The LSF execution host is configured to

host the vIP, then the LSF-HPC daemons are started on that node.

The Nagios infrastructure contains a module that monitors the LSF-HPC virtual IP. If it detects a problem with

the virtual IP (for example, the inability to ping it), the monitoring code assumes the node is down and

chooses a new LSF execution host from the backup candidate nodes on which to set up the virtual IP and

restart LSF-HPC.

Administering LSF-HPC 121