HP XC System Software Administration Guide Version 3.1

ManualsBrandsHP ManualsSoftwareHP XC System 3.x Software

151

152

153

154

155

156

157

158

159

160

14 Managing SLURM

The HP XC system uses the Simple Linux Utility for Resource Management (SLURM). This chapter

addresses the following topics:

• “Overview of SLURM” (page 157)

• “Configuring SLURM” (page 158)

• “Restricting User Access to Nodes” (page 165)

• “Job Accounting” (page 165)

• “Monitoring SLURM” (page 169)

• “Draining Nodes” (page 170)

• “Configuring the SLURM Epilog Script” (page 171)

• “Maintaining the SLURM Daemon Log” (page 172)

• “Enabling SLURM to Recognize a New Node” (page 173)

• “Removing SLURM” (page 174)

•

For your convenience, the HP XC Documentation CD contains the SLURM Reference Manual, which is also

available from the following Web site:

IMPORTANT: If SLURM was not configured during the installation of the HP XC System Software and

you want to configure it now, you must rerun the cluster_config utility. For more information, see

the HP XC System Software Installation Guide.

http://www.llnl.gov/LCdocs/slurm/

14.1 Overview of SLURM

SLURM provides a simple, lightweight, scalable infrastructure for managing the computing resources of

the HP XC system. SLURM contains a job launcher, srun, that offers much flexibility in requesting resources

and dispatching serial or parallel applications. SLURM also features a Pluggable Authentication Module

that, when enabled, can provide more control over access to the computing resources.

SLURM uses two daemons on the HP XC system:

slurmd

This daemon runs on each compute node in the HP XC system and is responsible for the

following:

• Starting each job on its node

• Monitoring the job's resource use

• Enforcing limits (for example, memory size)

• Freeing up resources when the job completes

The slurmd daemon runs as root to control starting and managing user jobs.

slurmctld

This SLURM controller daemon runs on the node with the resource manager role as a

central controller daemon. It is responsible for the following:

• Monitoring the availability of the compute nodes

• Managing node characteristics and node partitions

• Managing jobs, that is, the queuing, scheduling, and maintaining the state of jobs

Primary and backup slurmctld daemons run on separate resource manager nodes.

SLURM also enables you to configure a backup slurmctld daemon. If present, this backup daemon

monitors the state of the primary slurmctld daemon. If the backup daemon detects that the slurmctld

daemon failed, the backup daemon assumes the responsibilities of the primary slurmctld daemon. On

returning to service, the primary slurmctld daemon regains control of the SLURM subsystem from the

backup slurmctld daemon.

14.1 Overview of SLURM 157