HP XC System Software Administration Guide Version 4.0

Table Of Contents
21.6 SLURM Troubleshooting
The following section discusses SLURM troubleshooting in terms of configuration issues and
run-time troubleshooting.
21.6.1 SLURM Configuration Issues
SLURM consists of the following primary components:
slurmctld
a master/backup daemon.
slurmd
a slave daemon.
Command binaries
The sinfo, srun, scancel, squeue, and scontrol commands.
slurm.conf The SLURM configuration file, /hptc_cluster/slurm/etc/
slurm.conf. This file contains all the information necessary to
understand how SLURM is configured on HP XC systems, including
the following:
Logging (syslog is the default logging mechanism)
Debug level (the debug levels range from 1 to 7; the default debug
level is 3)
Nodes (all nodes are listed by default)
Node partitions (only one by default)
Authentication (MUNGE is used by default )
SLURM uses the MUNGE package to authenticate users between nodes in the system. Both
MUNGE and SLURM require files that contain encrypted keys. The names of the SLURM files
are configured in the /hptc_cluster/slurm/etc/slurm.conf file. The MUNGE key
file is /opt/hptc/munge/etc/keys/.munge_key. These files must be replicated on every
node in the HP XC system, which occurs by default through SystemImager; see Chapter 11:
Distributing Software Throughout the System (page 141) for more information on software
distribution.
SLURM and MUNGE expect the following characteristics of the system configuration. Errors
can result unless all these conditions are true:
Each node must be synchronized to the correct time. Communication errors occur if the
node clocks differ.
User authentication must be available on every node. If not, non-root users will be unable
to run jobs.
The /hptc_cluster directory must be properly shared. It is exported from the head node
and mounted on all the other nodes. If this directory is not properly shared, the slurm.conf
configuration file will not be found and errors will result.
On systems using Quadrics system interconnects, the /opt/hptc/libelanhosts/etc/
elanhosts file must be properly configured with the spconfig command, as described
in the HP XC System Software Installation Guide. Otherwise, system interconnect errors will
occur, and you must restart SLURM. See “Configuring SLURM System Interconnect Support”
(page 172) for more information.
On systems using Quadrics system interconnects, the spconfig command might report
that a node has less memory than expected.
Verify the node's memory size by running the following command and scroll through the
output. Compare the node's memory with nodes of the same type.
# shownode config | les
Perform the following steps:
1. Log in to the node in question as superuser (root).
2. Run the following command:
262 Troubleshooting