HP XC System Software Administration Guide Version 2.1
• Node partitions (only one by default)
• Authentication (MUNGE is u sed by default )
SLURM uses the MUNGE package to authenticate users between nodes in the system. Both
MUNGE and SLURM require files that contain encrypted keys. The names of the SLURM files
are configured in the /hptc_cluster/slurm/etc/slurm.conf file; the M UNG E
key file is /opt/hptc/munge/etc/keys/.munge_key. These files must be replicated
on every node in the HP XC system, which should occur by default through SystemImager;
see Chapter 7 for more information on software distribution.
SLURM and MUNGE expect the following characteristics of the system configuration . Errors
may result unless all these conditions are true:
• Each no de must be synchronized to the c
orrect tim e. Comm unicationerrorsoccurifthe
node clocks differ.
• User auth enticatio n m ust be available on every node. If not, non-roo t users will be unable
to run jobs.
•The/hptc_cluster directory must be properly shared. It is exported from the head
node and m ou nted on all the other nodes. If this directory i s not properly shared, the
slurm.conf co nfigu ration file will not be fo und an d erro rs will result.
• On systems using Quad rics system
interconnects, the /opt/hptc/libelan-
hosts/etc/elanhosts file must
be properly configu red with the spconfig
command, as described in the HP X
C System Software In stallation Guide.Otherwise,
system interconnect errors wi
ll occur, and SLURM must be restarted. See Section 11.6 for
more information.
16.2.2 SLURM Run-Time Troubleshooting
The following describes how
to overcome problems reported by SLURM while the H P XC
system is running.
Healthy no de is down
The most common reason for SLURM to list an apparently healthy node down is
that a specified resource has dropped below the level defined for the node in the
/hptc_cluster/slurm/etc/slurm.conf file.
Forexample,ifthetemp
orary disk space specification is TmpDisk=4096,butthe
available temporary di
sk space falls below 4 GB on the system , SLURM marks it
as down.
SLURM r efus es to operate on some nodes
If SLURM refuses to o
perate on some or all nodes, and the log files in
/var/slurm/log co
mplain of problems with credentials, execute the f ollowing
command to confirm
that all nodes display the same time:
# cexec -a date
A matter of a few seconds is inconsequential, but SLURM is unable to recognize the
credentials of nodes within the HP XC system that are more than five minutes out
of synchronization.
Checking SLURM daemons
Use the followi
ng com m and to confirm that your control daemons are up and running:
# scontrol ping
16-6 Troubleshooting