LSF Version 7.3 - Using Platform LSF HPC

Enforcing Resource Usage Limits for Parallel Tasks
A typical Platform LSF parallel job launches its tasks across multiple hosts. By default
you can enforce limits on the total resources used by all the tasks in the job. Because
PAM only reports the sum of parallel task resource usage, LSF does not enforce
resource usage limits on individual tasks in a parallel job.
For example, resource usage limits cannot control allocated memory of a single task of
a parallel job to prevent it from allocating memory and bringing down the entire system.
For some jobs, the total resource usage may be exceed a configured resource usage limit
even if no single task does, and the job is terminated when it does not need to be.
Attempting to limit individual tasks by setting a system-level swap hard limit
(RLIMIT_AS) in the system limit configuration file
(
/etc/security/limits.conf) is not satisfactory, because it only prevents tasks
running on that host from allocating more memory than they should; other tasks in the
job can continue to run, with unpredictable results.
By default, custom job controls (JOB_CONTROL in
lsb.queues) apply only to the
entire job, not individual parallel tasks.
Enabling resource usage limit enforcement for parallel tasks
Use the LSF_HPC_EXTENSIONS options TASK_SWAPLIMIT and
TASK_MEMLIMIT in
lsf.conf to enable resource usage limit enforcement and job
control for parallel tasks. When TASK_SWAPLIMIT or TASK_MEMLIMIT is set in
LSF_HPC_EXTENSIONS, LSF terminates the entire parallel job if any single task
exceeds the limit setting for memory and swap limits.
Other resource usage limits (CPU limit, process limit, run limit, and so on) continue to
be enforced for the entire job, not for individual tasks.
For detailed information about resource usage limits in LSF, see the “Runtime Resource
Usage Limits” chapter in Administering Platform LSF.
Assumptions and behavior
To enforce resource usage limits by parallel task, you must use the LSF generic PJL
framework (PAM/TS) to launch your parallel jobs.
This feature only affects parallel jobs monitored by PAM. It has no effect on other
LSF jobs.
LSF_HPC_EXTENSIONS=TASK_SWAPLIMIT overrides the default behavior
of swap limits (
bsub -v, bmod -v, or SWAPLIMIT in lsb.queues).
LSF_HPC_EXTENSIONS=TASK_MEMLIMIT overrides the default behavior
of memory limits (
bsub -M, bmod -M, or MEMLIMIT in lsb.queues).
LSF_HPC_EXTENSIONS=TASK_MEMLIMIT overrides
LSB_MEMLIMIT_ENFORCE=Y or LSB_JOB_MEMLIMIT=Y in
lsf.conf
When a parallel job is terminated because of task limit enforcement, LSF sets a value
in the LSB_JOBEXIT_INFO environment variable for any post-execution
programs:
LSB_JOBEXIT_INFO=SIGNAL -29 SIG_TERM_SWAPLIMIT
LSB_JOBEXIT_INFO=SIGNAL -25 SIG_TERM_MEMLIMIT