HP XC System Software Administration Guide Version 3.1
14 Managing SLURM
The HP XC system uses the Simple Linux Utility for Resource Management (SLURM). This chapter
addresses the following topics:
• “Overview of SLURM” (page 157)
• “Configuring SLURM” (page 158)
• “Restricting User Access to Nodes” (page 165)
• “Job Accounting” (page 165)
• “Monitoring SLURM” (page 169)
• “Draining Nodes” (page 170)
• “Configuring the SLURM Epilog Script” (page 171)
• “Maintaining the SLURM Daemon Log” (page 172)
• “Enabling SLURM to Recognize a New Node” (page 173)
• “Removing SLURM” (page 174)
•
For your convenience, the HP XC Documentation CD contains the SLURM Reference Manual, which is also
available from the following Web site:
IMPORTANT: If SLURM was not configured during the installation of the HP XC System Software and
you want to configure it now, you must rerun the cluster_config utility. For more information, see
the HP XC System Software Installation Guide.
http://www.llnl.gov/LCdocs/slurm/
14.1 Overview of SLURM
SLURM provides a simple, lightweight, scalable infrastructure for managing the computing resources of
the HP XC system. SLURM contains a job launcher, srun, that offers much flexibility in requesting resources
and dispatching serial or parallel applications. SLURM also features a Pluggable Authentication Module
that, when enabled, can provide more control over access to the computing resources.
SLURM uses two daemons on the HP XC system:
slurmd
This daemon runs on each compute node in the HP XC system and is responsible for the
following:
• Starting each job on its node
• Monitoring the job's resource use
• Enforcing limits (for example, memory size)
• Freeing up resources when the job completes
The slurmd daemon runs as root to control starting and managing user jobs.
slurmctld
This SLURM controller daemon runs on the node with the resource manager role as a
central controller daemon. It is responsible for the following:
• Monitoring the availability of the compute nodes
• Managing node characteristics and node partitions
• Managing jobs, that is, the queuing, scheduling, and maintaining the state of jobs
Primary and backup slurmctld daemons run on separate resource manager nodes.
SLURM also enables you to configure a backup slurmctld daemon. If present, this backup daemon
monitors the state of the primary slurmctld daemon. If the backup daemon detects that the slurmctld
daemon failed, the backup daemon assumes the responsibilities of the primary slurmctld daemon. On
returning to service, the primary slurmctld daemon regains control of the SLURM subsystem from the
backup slurmctld daemon.
14.1 Overview of SLURM 157