HP XC System Software Administration Guide Version 3.0
12. Managing SLURM
The HP XC system uses the Simple Linux Utility for Resource Management (SLURM). This chapter addresses
the following topics:
• Overview of SLURM (page 101)
• Configuring SLURM (page 102)
• Restricting User Access to Nodes (page 109)
• Job Accounting (page 109)
• Monitoring SLURM (page 113)
• Draining Nodes (page 113)
• Configuring the SLURM Epilog Script (page 115)
• SLURM Daemon Log Maintentance (page 116)
For your convenience, the HP XC Documentation CD contains the
SLURM Reference Manual
, which is also
available from the following Web site:
http://www.llnl.gov/LCdocs/slurm/
Overview of SLURM
SLURM provides a simple, lightweight, scalable infrastructure for managing the computing resources of the
HP XC system. SLURM contains a job launcher, srun, that offers much flexibility in requesting resources
and dispatching serial or parallel applications. SLURM also features a Pluggable Authentication Module
that, when enabled, can provide more control over access to the computing resources.
SLURM uses two daemons on the HP XC system:
slurmd This daemon runs on each compute node in the HP XC system and is responsible for the
following:
• Starting each job on its node
• Monitoring the job's resource use
• Enforcing limits (for example, memory size)
• Freeing up resources when the job completes
The slurmd daemon runs as root to control starting and managing user jobs.
slurmctld This SLURM controller daemon runs on the node with the resource manager role as a central
controller daemon. It is responsible for the following:
• Monitoring the availability of the compute nodes
• Managing node characteristics and node partitions
• Managing jobs, that is, the queuing, scheduling, and maintaining the state of jobs
Primary and backup slurmctld daemons run on separate resource manager nodes.
SLURM also enables you to configure a backup slurmctld daemon. If present, this backup daemon
monitors the state of the primary slurmctld daemon. If the backup daemon detects that the slurmctld
daemon failed, the backup daemon assumes the responsibilities of the primary slurmctld daemon. On
returning to service, the primary slurmctld daemon regains control of the SLURM subsystem from the
backup slurmctld daemon.
SLURM offers a set of utilities that provide information about SLURM configuration, state, and jobs, most
notably scontrol, squeue, and sinfo. See scontrol(1), squeue(1), and sinfo(1) for more
information about these utilities.
SLURM enables you to collect and analyze job accounting information. “Configuring Job Accounting”
(page 111) describes how to configure job accounting information on the HP XC system.
“SLURM Troubleshooting” (page 163) provides SLURM troubleshooting information.
Overview of SLURM 101