Platform LSF Administration Guide Version 6.2

Chapter 5
Working with Queues
Administering Platform LSF
137
Handling Job Exceptions
You can configure queues so that LSF detects exceptional conditions while jobs are
running, and take appropriate action automatically. You can customize what exceptions
are detected, and the corresponding actions. By default, LSF does not detect any
exceptions.
eadmin script
When an exception is detected, LSF takes appropriate action by running the script
LSF_SERVERDIR/eadmin on the master host. You can customize eadmin to suit the
requirements of your site. For example, in some environments, a job running 1 hour
would be an overrun job, while this may be a normal job in other environments. If your
configuration considers jobs running longer than 1 hour to be overrun jobs, you may
want to close the queue when LSF detects a job that has run longer than 1 hour and
invokes
eadmin. Alternatively, eadmin could find out the owner of the problem jobs
and use
bstop -u to stop all jobs that belong to the user.
Job exceptions LSF can detect
If you configure exception handling, LSF detects the following job exceptions:
Job underrunjobs end too soon (run time is less than expected). Underrun jobs
are detected when a job exits abnormally
Job overrunjob runs too long (run time is longer than expected)
By default, LSF checks for overrun jobs every 5 minutes. Use
EADMIN_TRIGGER_DURATION in
lsb.params to change how frequently
LSF checks for job overrun.
Idle jobrunning job consumes less CPU time than expected (in terms of CPU
time/runtime)
By default, LSF checks for idle jobs every 5 minutes. Use
EADMIN_TRIGGER_DURATION in
lsb.params to change how frequently
LSF checks for idle jobs.
Default eadmin actions
LSF sends email to the LSF administrator. The email contains the job ID, exception type
(overrrun, underrun, idle job), and other job information.
An email is sent for all detected job exceptions according to the frequency configured
by EADMIN_TRIGGER_DURATION in
lsb.params. For example, if
EADMIN_TRIGGER_DURATION is set to 10 minutes, and 1 overrun job and 2 idle
jobs are detected, after 10 minutes,
eadmin is invoked and only one email is sent. If
another overrun job is detected in the next 10 minutes, another email is sent.
Configuring job exception handling (lsb.queues)
You can configure your queues to detect job exceptions. Use the following parameters:
JOB_IDLE
Specifies a threshold for idle jobs. The value should be a number between 0.0 and 1.0
representing CPU time/runtime. If the job idle factor is less than the specified
threshold, LSF invokes
eadmin to trigger the action for a job idle exception.