HP XC System Software Administration Guide Version 3.0
# sinfo --all
LSF-HPC Troubleshooting
Take the following steps if you are have trouble submitting jobs or controlling LSF-HPC:
• Ensure that the number of nodes in the lsf partition is less than or equal to the number of nodes
reported in the XC.lic file. Sample entries follow:
INCREMENT XC-CPUS Compaq auth.number exp. date nodes ...
INCREMENT XC-PROCESSORS Compaq auth.number exp. date nodes ...
The value for nodes in the XC-CPUS or XC-PROCESSORS entry specifies the number of licensed nodes
for this system. If this value does not match the number of actual nodes, the LSF service may fail to start
LSF.
Use the lshosts command to determine the number of processors reported by LSF.
• Ensure that the date is synchronized throughout the HP XC system.
• Verify that the /hptc_cluster directory (file system) is properly mounted on all nodes. SLURM relies
on this file system.
• Ensure that SLURM is configured, up, and running properly.
• Examine the SLURM log files in /var/slurm/log/ directory on the SLURM master node for any
problems.
• If the sinfo command reports that the node is down and daemons are running, examine the available
processors vs. Procs setting in the slurm.conf file.
• Ensure that the lsf partition is configured correctly.
• Verify that the system licensing is operational. Use the lmstat -a command.
• Ensure that munge is running on all compute nodes.
• If you are experiencing LSF communication problems, examine for potential firewall issues.
• When LSF-HPC failover is disabled and the LSF execution host (which is not the head node) goes down,
issue the controllsf command to restart LSF-HPC on the HP XC system:
# controllsf start
• When failover is enabled, you need to intervene only when the primary LSF execution host is not started
on HP XC system startup (when the startsys command is run). Use the controllsf command to
restart LSF-HPC.
# controllsf start
• When starting LSF-HPC after a partial system shutdown, LSF is started on the head node if:
LSF failover is enabled.•
• The head node has the "resource management" role and no other resource management node is
up.
• The head node has the “resource management” role and the enable headnode preferred
subcommand is set.
• LSF-HPC was not shut down cleanly, perhaps as a result of running startsys without running
service lsf stop or controllsf stop on the head node.
• LSF-HPC starts on the head node if the other resource management nodes are unavailable
• LSF-HPC failover may select the node that it just released.
LSF-HPC failover attempts to ensure that a different node is used after it removes control from the present
node. However, if all other options are exhausted, LSF-HPC failover tries the current node again before
giving up.
If you are trying to perform load balancing, log in to the primary LSF execution host node and execute
the controllsf start here command from that node.
LSF-HPC Troubleshooting 165