HP XC System Software Administration Guide Version 3.1

Healthy node is down The most common reason for SLURM to list an apparently
healthy node down is that a specified resource has dropped
below the level defined for the node in the
/hptc_cluster/slurm/etc/slurm.conf file.
For example, if the temporary disk space specification is
TmpDisk=4096, but the available temporary disk space
falls below 4 GB on the system, SLURM marks it as down.
SLURM refuses to operate on some nodes If SLURM refuses to operate on some or all nodes, and the
log files in /var/slurm/log report problems with
credentials, execute the following command to confirm that
all nodes display the same time:
# cexec -a date
A matter of a few seconds is inconsequential, but SLURM
is unable to recognize the credentials of nodes within the
HP XC system that are more than 5 minutes out of
synchronization.
Checking SLURM daemons Use the following command to confirm that your control
daemons are up and running:
# scontrol ping
Checking node and partition status Use the following command to examine the status of your
nodes and partitions:
# sinfo --all
20.6 LSF-HPC Troubleshooting
Take the following steps if you are have trouble submitting jobs or controlling LSF-HPC with SLURM:
Ensure that the number of nodes in the lsf partition is less than or equal to the number of nodes
reported in the XC.lic file. Sample entries follow:
INCREMENT XC-CPUS Compaq auth.number exp. date nodes ...
INCREMENT XC-PROCESSORS Compaq auth.number exp. date nodes ...
The value for nodes in the XC-CPUS or XC-PROCESSORS entry specifies the number of licensed
nodes for this system. If this value does not match the number of actual nodes, the LSF service may
fail to start LSF.
Use the lshosts command to determine the number of processors reported by LSF.
Ensure that the date is synchronized throughout the HP XC system.
Verify that the /hptc_cluster directory (file system) is properly mounted on all nodes. SLURM
relies on this file system.
Ensure that SLURM is configured, up, and running properly.
Examine the SLURM log files in /var/slurm/log/ directory on the SLURM master node for any
problems.
If the sinfo command reports that the node is down and daemons are running, examine the
available processors vs. Procs setting in the slurm.conf file.
Ensure that the lsf partition is configured correctly.
Verify that the system licensing is operational. Use the lmstat -a command.
Ensure that munge is running on all compute nodes.
If you are experiencing LSF communication problems, examine for potential firewall issues.
20.6 LSF-HPC Troubleshooting 241