HP XC System Software Administration Guide Version 3.1
RX packets:7 errors:0 dropped:0 overruns:0 frame:0
TX packets:4 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:420 (420.0 b) TX bytes:240 (240.0 b)
You can try to ping other nodes that are connected to the network.
8. You can find additional information about InfiniBand in the /proc/voltaire directory. Use the
find command to display it:
# find /proc/voltaire -type f -print -exec cat {} \;
20.5 SLURM Troubleshooting
The following section discusses SLURM troubleshooting in terms of configuration issues and run-time
troubleshooting.
20.5.1 SLURM Configuration Issues
SLURM consists of the following primary components:
slurmctld
a master/backup daemon.
slurmd
a slave daemon.
Command binaries
The sinfo, srun, scancel, squeue, and scontrol commands.
slurm.conf The SLURM configuration file, /hptc_cluster/slurm/etc/slurm.conf.
This file contains all the information necessary to understand how SLURM is
configured on HP XC systems, including the following:
• Logging (syslog is the default logging mechanism)
• Debug level (the debug levels range from 1 to 7; the default debug level is
3)
• Nodes (all nodes are listed by default)
• Node partitions (only one by default)
• Authentication (MUNGE is used by default )
SLURM uses the MUNGE package to authenticate users between nodes in the system. Both MUNGE and
SLURM require files that contain encrypted keys. The names of the SLURM files are configured in the
/hptc_cluster/slurm/etc/slurm.conf file. The MUNGE key file is
/opt/hptc/munge/etc/keys/.munge_key. These files must be replicated on every node in the HP
XC system, which occurs by default through SystemImager; see Chapter 10: Distributing Software
Throughout the System (page 129) for more information on software distribution.
SLURM and MUNGE expect the following characteristics of the system configuration. Errors can result
unless all these conditions are true:
• Each node must be synchronized to the correct time. Communication errors occur if the node clocks
differ.
• User authentication must be available on every node. If not, non-root users will be unable to run jobs.
• The /hptc_cluster directory must be properly shared. It is exported from the head node and
mounted on all the other nodes. If this directory is not properly shared, the slurm.conf configuration
file will not be found and errors will result.
• On systems using Quadrics system interconnects, the /opt/hptc/libelanhosts/etc/elanhosts
file must be properly configured with the spconfig command, as described in the HP XC System
Software Installation Guide. Otherwise, system interconnect errors will occur, and you must restart
SLURM. See “Configuring SLURM System Interconnect Support” (page 159) for more information.
20.5.2 SLURM Run-Time Troubleshooting
The following describes how to overcome problems reported by SLURM while the HP XC system is
running:
240 Troubleshooting