HP XC System Software Release Notes for Version 3.1
To verify the change, submit an interactive job similar to the following:
[lsfadmin@n16 ~]$hostname
n16
[lsfadmin@n16 ~]$ bsub -Is -n8 /bin/bash -i
Job <261> is submitted to the default queue <interactive>.
<<Waiting for dispatch ...>>
<<Starting on lsf.localdomain>>
[lsfadmin@n4 ~]$ hostname
n4
[lsfadmin@n4 ~]$ srun hostname
n4
n4
n4
n4
n5
n5
n5
n5
[lsfadmin@n4 ~]$ exit
exit
[lsfadmin@n16 ~]$ hostname
n16
[lsfadmin@n16 ~]$
8.1.2 Node Reboot Might Result in Inconclusive Job Termination
If a node that is running a job under LSF-HPC with SLURM is rebooted (with the reboot
command), SLURM may recognize the node as unresponsive and attempt to terminate the job.
However, some remnants of the job may remain, which will cause LSF to report the job as still
running.
This issue has been seen with large jobs using in excess of 100 nodes.
If the node power is turned off instead of rebooted, however, LSF-HPC with SLURM reports the
status as EXIT, and the node is released back to the pool of idle nodes.
8.1.3 Short LSF Queue RUN_WINDOW Can Suspend Other Jobs
A job that does not complete within the RUN_WINDOW of its queue is suspended and may prevent
other jobs on other queues from running, even if those other jobs were submitted to a higher
priority queue.
At the next instance of the queue's RUN_WINDOW, the job resumes execution and the other jobs
can be scheduled.
Consider this example:
1. Job #75 is scheduled on a queue named night.
2. The RUN_WINDOW opens for the night queue.
3. Job #75 runs on the night queue.
4. The RUN_WINDOW for the night queue ends but Job #75 did not complete. Job #75 is
suspended.
5. Job #76 is scheduled on a higher priority queue named main but is suspended.
6. The RUN_WINDOW for queue night opens again according to the queue definition.
7. Job #75 resumes on the night queue.
8. Job #76 run on the main queue.
A work around is to ensure that jobs end when the RUN_WINDOW for the queue ends. Use the
LSF RUNLIMIT or TERMINATE_WHEN setting in the lsb.queues file to do so. For more
information, see the standard LSF documentation from Platform Computing.
62 Load Sharing Facility and Job Management