HP XC System Software Release Notes for Version 3.0

9.1.4 Short LSF Queue RUN_WINDOW Can Suspend Other Jobs
A job that does not complete within the RUN_WINDOW of its queue is suspended and may prevent
other jobs on other queues from running, even if those other jobs were submitted to a higher
priority queue.
At the next instance of the queue's RUN_WINDOW, the job resumes execution and the other jobs
can be scheduled.
Consider this example:
1. Job #75 is scheduled on a queue named night.
2. The RUN_WINDOW opens for the night queue.
3. Job #75 runs on the night queue.
4. The RUN_WINDOW for the night queue ends but Job #75 did not complete. Job #75 is
suspended.
5. Job #76 is scheduled on a higher priority queue named main but is suspended.
6. The RUN_WINDOW for queue night opens again according to the queue definition.
7. Job #75 resumes on the night queue.
8. Job #76 run on the main queue.
A work around is to ensure that jobs end when the RUN_WINDOW for the queue ends. Use the
LSF RUNLIMIT or TERMINATE_WHEN setting in the lsb.queues file to do so. For more
information, see the standard LSF documentation from Platform Computing.
9.2 SLURM and Job Management
The notes in this section apply to the Simple Linux Utility for Resource Management (SLURM).
SLURM provides commands for launching, monitoring, and controlling jobs.
Refer to the HP XC System Software User's Guide for more information about using SLURM.
9.2.1 Error in slurm.epilog.clean Script
SLURM provides a slurm.epilog.clean script in the /opt/hptc/slurm/etc/ directory.
This script is not used in normal operation by default. However, it is provided if you want to
configure SLURM on your XC system to ensure that all processes relating to a user's job on the
compute nodes are terminated after the job has completed.
An error has been discovered in this script where the SLURM_BIN variable is not set. If this script
is enabled, this error causes the script to terminate all user processes on the node, even if the
user has a separate job running on the same node.
Follow this procedure to correct the problem:
1. On the head node, use the text editor of your choice to edit the following file:
/opt/hptc/slurm/etc/slurm.epilog.clean
2. Add the following line to the file:
SLURM_BIN=/opt/hptc/slurm/bin
3. Save your changes and exit the file.
4. Use procedures in the HP XC System Software Administration Guide to update the golden
image and propagate the new image to all nodes.
9.2.2 How to Remove SLURM
The HP XC system installation process offers a choice of two different types of LSF. The default
choice, LSF-HPC with SLURM, requires that SLURM is also installed and configured. The other
choice is standard LSF, which does not require nor interact with SLURM. If standard LSF is
selected, SLURM should not be configured.
9.2 SLURM and Job Management 53