Platform LSF Administration Guide Version 6.2

Handling Host-level Job Exceptions
Administering Platform LSF
124
Handling Host-level Job Exceptions
You can configure hosts so that LSF detects exceptional conditions while jobs are
running, and take appropriate action automatically. You can customize what exceptions
are detected, and the corresponding actions. By default, LSF does not detect any
exceptions.
eadmin script
When an exception is detected, LSF takes appropriate action by running the script
LSF_SERVERDIR/eadmin on the master host. You can customize eadmin to suit the
requirements of your site. For example,
eadmin could find out the owner of the
problem jobs and use
bstop -u to stop all jobs that belong to the user.
Host exceptions LSF can detect
If you configure exception handling, LSF can detect jobs that exit repeatedly on a host.
The host can still be available to accept jobs, but some other problem prevents the jobs
from running. Typically jobs dispatched to such “black hole”, or “job-eating” hosts exit
abnormally. LSF monitors the job exit rate for hosts, and closes the host if the rate
exceeds a threshold you configure (EXIT_RATE in
lsb.hosts).
By default, LSF invokes
eadmin if the job exit rate for a host remains above the
configured threshold for longer than 10 minutes. Use
JOB_EXIT_RATE_DURATION in
lsb.params to change how frequently LSF
checks the job exit rate.
Default eadmin actions
LSF closes the host and sends email to the LSF administrator. The email contains the
host name, job exit rate for the host, and other host information. The message
eadmin: JOB EXIT THRESHOLD EXCEEDED is attached to the closed host event in
lsb.events, and displayed by badmin hist and badmin hhist. Only one email
is sent for host exceptions.
Configuring host exception handling (lsb.hosts)
EXIT_RATE
Specifies a threshold for exited jobs. If the job exit rate is exceeded for 10 minutes or the
period specified by JOB_EXIT_RATE_DURATION, LSF invokes
eadmin to trigger
a host exception.
Example
The following Host section defines a job exit rate of 20 jobs per minute for all hosts:
Begin Host
HOST_NAME MXJ EXIT_RATE # Keywords
Default ! 20
End Host
Configuring thresholds for exception handling
JOB_EXIT_RATE_DURATION (lsb.params)
By default, LSF checks the number of exited jobs every 10 minutes. Use
JOB_EXIT_RATE_DURATION in
lsb.params to change this default.