HP XC System Software Administration Guide Version 3.0
slurm.conf The SLURM configuration file, /hptc_cluster/slurm/etc/slurm.conf.
This file contains all the information necessary to understand how SLURM is
configured on HP XC systems, including the following:
• Logging (syslog is the default logging mechanism)
• Debug level (the debug levels range from 1 to 7; the default debug level is
3)
• Nodes (all nodes are listed by default)
• Node partitions (only one by default)
• Authentication (MUNGE is used by default )
SLURM uses the MUNGE package to authenticate users between nodes in the system. Both MUNGE and
SLURM require files that contain encrypted keys. The names of the SLURM files are configured in the
/hptc_cluster/slurm/etc/slurm.conf file. The MUNGE key file is
/opt/hptc/munge/etc/keys/.munge_key. These files must be replicated on every node in the HP XC
system, which should occur by default through SystemImager; see Chapter 8.: Distributing Software
Throughout the System (page 79) for more information on software distribution.
SLURM and MUNGE expect the following characteristics of the system configuration. Errors can result unless
all these conditions are true:
• Each node must be synchronized to the correct time. Communication errors occur if the node clocks
differ.
• User authentication must be available on every node. If not, non-root users will be unable to run jobs.
• The /hptc_cluster directory must be properly shared. It is exported from the head node and mounted
on all the other nodes. If this directory is not properly shared, the slurm.conf configuration file will
not be found and errors will result.
• On systems using Quadrics system interconnects, the /opt/hptc/libelanhosts/etc/elanhosts
file must be properly configured with the spconfig command, as described in the
HP XC System
Software Installation Guide
. Otherwise, system interconnect errors will occur, and you must restart
SLURM. See “Configuring SLURM System Interconnect Support” (page 103) for more information.
SLURM Run-Time Troubleshooting
The following describes how to overcome problems reported by SLURM while the HP XC system is running:
Healthy node is down The most common reason for SLURM to list an apparently
healthy node down is that a specified resource has dropped
below the level defined for the node in the
/hptc_cluster/slurm/etc/slurm.conf file.
For example, if the temporary disk space specification is
TmpDisk=4096, but the available temporary disk space
falls below 4 GB on the system, SLURM marks it as down.
SLURM refuses to operate on some nodes If SLURM refuses to operate on some or all nodes, and the
log files in /var/slurm/log report problems with
credentials, execute the following command to confirm that
all nodes display the same time:
# cexec -a date
A matter of a few seconds is inconsequential, but SLURM is
unable to recognize the credentials of nodes within the HP
XC system that are more than 5 minutes out of
synchronization.
Checking SLURM daemons Use the following command to confirm that your control
daemons are up and running:
# scontrol ping
Checking node and partition status Use the following command to examine the status of your
nodes and partitions:
164 Troubleshooting