LSF Version 7.3 - Administering Platform LSF

Handling Job Exceptions in Queues
110 Administering Platform LSF
Handling Job Exceptions in Queues
You can configure queues so that LSF detects exceptional conditions while jobs are
running, and take appropriate action automatically. You can customize what
exceptions are detected, and the corresponding actions. By default, LSF does not
detect any exceptions.
Job exceptions LSF can detect
If you configure job exception handling in your queues, LSF detects the following
job exceptions:
Job underrunjobs end too soon (run time is less than expected). Underrun
jobs are detected when a job exits abnormally
Job overrun job runs too long (run time is longer than expected). By default,
LSF checks for overrun jobs every 1 minute. Use
EADMIN_TRIGGER_DURATION in
lsb.params to change how frequently
LSF checks for job overrun.
Idle jobrunning job consumes less CPU time than expected (in terms of
CPU time/runtime). By default, LSF checks for idle jobs every 1 minute. Use
EADMIN_TRIGGER_DURATION in
lsb.params to change how frequently
LSF checks for idle jobs.
Configuring job exception handling (lsb.queues)
You can configure your queues to detect job exceptions. Use the following
parameters:
JOB_IDLE Specify a threshold for idle jobs. The value should be a number between 0.0 and 1.0
representing CPU time/runtime. If the job idle factor is less than the specified
threshold, LSF invokes
eadmin to trigger the action for a job idle exception.
JOB_OVERRUN Specify a threshold for job overrun. If a job runs longer than the specified run time,
LSF invokes
eadmin to trigger the action for a job overrun exception.
JOB_UNDERRUN Specify a threshold for job underrun. If a job exits before the specified number of
minutes, LSF invokes
eadmin to trigger the action for a job underrun exception.
Example The following queue defines thresholds for all types job exceptions:
Begin Queue
...
JOB_UNDERRUN = 2
JOB_OVERRUN = 5
JOB_IDLE = 0.10
...
End Queue
For this queue:
A job underrun exception is triggered for jobs running less than 2 minutes
A job overrun exception is triggered for jobs running longer than 5 minutes
A job idle exception is triggered for jobs with an idle factor
(CPU time/runtime) less than 0.10