LSF Version 7.3 - Administering Platform LSF

Handling Host-level Job Exceptions

98 Administering Platform LSF

Handling Host-level Job Exceptions

You can configure hosts so that LSF detects exceptional conditions while jobs are

running, and take appropriate action automatically. You can customize what

exceptions are detected, and the corresponding actions. By default, LSF does not

detect any exceptions.

Host exceptions LSF can detect

If you configure host exception handling, LSF can detect jobs that exit repeatedly

on a host. The host can still be available to accept jobs, but some other problem

prevents the jobs from running. Typically jobs dispatched to such “black hole”, or

“job-eating” hosts exit abnormally. LSF monitors the job exit rate for hosts, and

closes the host if the rate exceeds a threshold you configure (EXIT_RATE in

lsb.hosts).

If EXIT_RATE is not specified for the host, LSF invokes

eadmin if the job exit rate

for a host remains above the configured threshold for longer than 5 minutes. Use

JOB_EXIT_RATE_DURATION in

lsb.params to change how frequently LSF

checks the job exit rate.

Use GLOBAL_EXIT_RATE in

lsb.params to set a cluster-wide threshold in

minutes for exited jobs. If EXIT_RATE is not specified for the host in

lsb.hosts,

GLOBAL_EXIT_RATE defines a default exit rate for all hosts in the cluster.

Host-level EXIT_RATE overrides the GLOBAL_EXIT_RATE value.

Configuring host exception handling (lsb.hosts)

EXIT_RATE Specify a threshold for exited jobs. If the job exit rate is exceeded for 5 minutes or

the period specified by JOB_EXIT_RATE_DURATION in

lsb.params, LSF

invokes

eadmin to trigger a host exception.

Example The following Host section defines a job exit rate of 20 jobs for all hosts, and an exit

rate of 10 jobs on

hostA.

Begin Host

HOST_NAME MXJ EXIT_RATE # Keywords

Default ! 20

hostA ! 10

End Host

Configuring thresholds for host exception handling

By default, LSF checks the number of exited jobs every 5 minutes. Use

JOB_EXIT_RATE_DURATION in

lsb.params to change this default.

Tuning TIP: Tune JOB_EXIT_RATE_DURATION carefully. Shorter values may raise false alarms, longer

values may not trigger exceptions frequently enough.