LSF Version 7.3 - Administering Platform LSF

Handling Job Exceptions
138 Administering Platform LSF
Handling Job Exceptions
You can configure hosts and queues so that LSF detects exceptional conditions
while jobs are running, and take appropriate action automatically. You can
customize what exceptions are detected, and the corresponding actions. By default,
LSF does not detect any exceptions.
Run
bjobs -d -m host_name to see exited jobs for a particular host.
Job exceptions LSF can detect
If you configure job exception handling in your queues, LSF detects the following
job exceptions:
Job underrunjobs end too soon (run time is less than expected). Underrun
jobs are detected when a job exits abnormally
Job overrun job runs too long (run time is longer than expected). By default,
LSF checks for overrun jobs every 1 minute. Use
EADMIN_TRIGGER_DURATION in
lsb.params to change how frequently
LSF checks for job overrun.
Idle jobrunning job consumes less CPU time than expected (in terms of
CPU time/runtime). By default, LSF checks for idle jobs every 1 minute. Use
EADMIN_TRIGGER_DURATION in
lsb.params to change how frequently
LSF checks for idle jobs.
Host exceptions LSF can detect
If you configure host exception handling, LSF can detect jobs that exit repeatedly
on a host. The host can still be available to accept jobs, but some other problem
prevents the jobs from running. Typically jobs dispatched to such “black hole, or
“job-eating” hosts exit abnormally. By default, LSF monitors the job exit rate for
hosts, and closes the host if the rate exceeds a threshold you configure (EXIT_RATE
in
lsb.hosts).
If EXIT_RATE is not specified for the host, LSF invokes
eadmin if the job exit rate
for a host remains above the configured threshold for longer than 5 minutes. Use
JOB_EXIT_RATE_DURATION in
lsb.params to change how frequently LSF
checks the job exit rate.
Use GLOBAL_EXIT_RATE in
lsb.params to set a cluster-wide threshold in
minutes for exited jobs. If EXIT_RATE is not specified for the host in
lsb.hosts,
GLOBAL_EXIT_RATE defines a default exit rate for all hosts in the cluster.
Host-level EXIT_RATE overrides the GLOBAL_EXIT_RATE value.
Jobs killed with
bkill are counted in the job exit rate.
Customize job exception actions with the eadmin script
When an exception is detected, LSF takes appropriate action by running the script
LSF_SERVERDIR/eadmin on the master host.
You can customize
eadmin to suit the requirements of your site. For example,
eadmin could find out the owner of the problem jobs and use bstop -u to stop all
jobs that belong to the user.