LSF Version 7.3 - Administering Platform LSF

Handling Host-level Job Exceptions
98 Administering Platform LSF
Handling Host-level Job Exceptions
You can configure hosts so that LSF detects exceptional conditions while jobs are
running, and take appropriate action automatically. You can customize what
exceptions are detected, and the corresponding actions. By default, LSF does not
detect any exceptions.
Host exceptions LSF can detect
If you configure host exception handling, LSF can detect jobs that exit repeatedly
on a host. The host can still be available to accept jobs, but some other problem
prevents the jobs from running. Typically jobs dispatched to such “black hole, or
“job-eating” hosts exit abnormally. LSF monitors the job exit rate for hosts, and
closes the host if the rate exceeds a threshold you configure (EXIT_RATE in
lsb.hosts).
If EXIT_RATE is not specified for the host, LSF invokes
eadmin if the job exit rate
for a host remains above the configured threshold for longer than 5 minutes. Use
JOB_EXIT_RATE_DURATION in
lsb.params to change how frequently LSF
checks the job exit rate.
Use GLOBAL_EXIT_RATE in
lsb.params to set a cluster-wide threshold in
minutes for exited jobs. If EXIT_RATE is not specified for the host in
lsb.hosts,
GLOBAL_EXIT_RATE defines a default exit rate for all hosts in the cluster.
Host-level EXIT_RATE overrides the GLOBAL_EXIT_RATE value.
Configuring host exception handling (lsb.hosts)
EXIT_RATE Specify a threshold for exited jobs. If the job exit rate is exceeded for 5 minutes or
the period specified by JOB_EXIT_RATE_DURATION in
lsb.params, LSF
invokes
eadmin to trigger a host exception.
Example The following Host section defines a job exit rate of 20 jobs for all hosts, and an exit
rate of 10 jobs on
hostA.
Begin Host
HOST_NAME MXJ EXIT_RATE # Keywords
Default ! 20
hostA ! 10
End Host
Configuring thresholds for host exception handling
By default, LSF checks the number of exited jobs every 5 minutes. Use
JOB_EXIT_RATE_DURATION in
lsb.params to change this default.
Tuning TIP: Tune JOB_EXIT_RATE_DURATION carefully. Shorter values may raise false alarms, longer
values may not trigger exceptions frequently enough.