HP XC System Software Administration Guide Version 2.1
Checking node and partition status
Use t he follo wing comm and to check the status of your nodes and partitions:
# sinfo --all
16.3 LSF Troubleshooting
Take the following steps if you are have trouble submitting jobs or contro lling LSF-HPC:
• Ensure that the date is synchron ized through
out the HP XC system.
• Check that the /hptc_cluster directory (file system) is properly mounted on all nodes.
SLURMreliesonthisfilesystem.
• Ensure that SLURM is configured, up, and runn ing properly.
• Check the SLURM log fil es in /var/slurm/lo
g/ directory on the SLURM master
node for any pro blems.
•Ifthesinfo command reports that the node is down and daemons are running, check
available processors vs. Procs settingintheslurm.conf file.
• Ensure that the lsf partition is configured correctly.
• Check that the system licensing is oper
ational. Use the lmstat -a command.
• Ensure that munge is running on all com p ute nodes.
• If you are experiencing LSF communication problems, check for potential firewall issues.
• When LSF-H PC failover is disabled and th
e LSF execution host (which is not the head node)
goes down , issue t he controllsf command
to restart LSF-HPC o n the HP XC system :
# controllsf start
• When failover is enabled, y ou need to intervene only when the primary LSF execution host
is not started on HP XC system startup (when the startsys command is run). Use the
controllsf command to restart LSF-HPC.
# controllsf start
• When the stopsys command is executed to shut down the HP XC system , LSF-HPC is
restarted on the head node if t he head no de has the lsf service and if failover is e n abled .
This allows jobs to be queued.
To ensure that LSF-HPC is restarted properly when the system is started again perform
any of the following:
- Execute the controllsf command
.
# controllsf stop
- Execute the service command on the head node.
# service lsf stop
- Reboot the head node.
• When starting LSF-HPC after a partial system shutdown, LSF is started on the head node if:
- LSF failover is enabled.
- The head node has the "resource managemen t" role.
- LSF-HPC was not shut down cleanly, perhaps as a result of running startsys
without run nin g service lsf stop or controllsf stop on the head n ode.
- LSF-HPC starts on the head node
if the other resource management nodes are
unavailable
• LSF-HPC failover may select the node that it just released.
Troubleshooting 16-7