HP XC System Software Administration Guide Version 4.0

Table Of Contents
Verify that the /hptc_cluster directory (file system) is properly mounted on all nodes.
SLURM relies on this file system.
Ensure that SLURM is configured, up, and running properly.
Examine the SLURM log files in /var/slurm/log/ directory on the SLURM master node
for any problems.
If the sinfo command reports that the node is down and daemons are running, examine
the available processors vs. Procs setting in the slurm.conf file.
Ensure that the lsf partition is configured correctly.
Verify that the system licensing is operational. Use the lmstat -a command.
Ensure that munge is running on all compute nodes.
If you are experiencing LSF communication problems, examine for potential firewall issues.
When LSF with SLURM failover is disabled and the LSF execution host (which is not the
head node) goes down, issue the controllsf command to restart LSF with SLURM on the
HP XC system:
# controllsf start
When failover is enabled, you need to intervene only when the primary LSF execution host
is not started on HP XC system startup (when the startsys command is run). Use the
controllsf command to restart LSF with SLURM.
# controllsf start
When starting LSF with SLURM after a partial system shutdown, LSF is started on the head
node if:
LSF failover is enabled.
The head node has the resource management role and no other resource management
node is up.
The head node has the resource management role and the enable headnode
preferred subcommand is set.
LSF with SLURM was not shut down cleanly, perhaps as a result of running startsys
without running service lsf stop or controllsf stop on the head node.
LSF with SLURM starts on the head node if the other resource management nodes are
unavailable
LSF with SLURM failover may select the node that it just released.
LSF with SLURM failover attempts to ensure that a different node is used after it removes
control from the present node. However, if all other options are exhausted, LSF with SLURM
failover tries the current node again before giving up.
If you are trying to perform load balancing, log in to the primary LSF execution host node
and execute the controllsf start here command from that node.
Rebooting a node might result in inconclusive job termination.
If a node that is running a job under LSF with SLURM is rebooted (with the reboot
command), SLURM might recognize the node as unresponsive and attempt to end the job.
However, some remnants of the job could remain, which causes LSF to report the job as still
running. This issue has occurred with large jobs using in excess of 100 nodes.
If you turn off power to the node power instead of rebooting it, however, LSF with SLURM
reports the status as EXIT, and the node is released back to the pool of idle nodes.
An LSF Queue RUN_WINDOW that is too short can suspend other jobs.
A job that does not complete within the RUN_WINDOW of its queue is suspended and that
might prevent other jobs on other queues from running, even if those other jobs were
submitted to a higher priority queue.
264 Troubleshooting