LSF Version 7.3 - Administering Platform LSF

Administering Platform LSF 139
Managing Jobs
In some environments, a job running 1 hour would be an overrun job, while this
may be a normal job in other environments. If your configuration considers jobs
running longer than 1 hour to be overrun jobs, you may want to close the queue
when LSF detects a job that has run longer than 1 hour and invokes
eadmin.
Default eadmin
actions
For host-level exceptions, LSF closes the host and sends email to the LSF
administrator. The email contains the host name, job exit rate for the host, and
other host information. The message
eadmin: JOB EXIT THRESHOLD EXCEEDED is
attached to the closed host event in
lsb.events, and displayed by badmin hist
and
badmin hhist. Only one email is sent for host exceptions.
For job exceptions. LSF sends email to the LSF administrator. The email contains
the job ID, exception type (overrun, underrun, idle job), and other job information.
An email is sent for all detected job exceptions according to the frequency
configured by EADMIN_TRIGGER_DURATION in
lsb.params. For example, if
EADMIN_TRIGGER_DURATION is set to 5 minutes, and 1 overrun job and 2 idle
jobs are detected, after 5 minutes,
eadmin is invoked and only one email is sent. If
another overrun job is detected in the next 5 minutes, another email is sent.
Handling job initialization failures
By default, LSF handles job exceptions for jobs that exit after they have started
running. You can also configure LSF to handle jobs that exit during initialization
because of an execution environment problem, or because of a user action or LSF
policy.
LSF detects that the jobs are exiting before they actually start running, and takes
appropriate action when the job exit rate exceeds the threshold for specific hosts
(EXIT_RATE in
lsb.hosts) or for all hosts (GLOBAL_EXIT_RATE in
lsb.params).
Use EXIT_RATE_TYPE in
lsb.params to include job initialization failures in the
exit rate calculation. The following table summarizes the exit rate types you can
configure:
Exit rate type ... Includes ...
JOBEXIT Local exited jobs
Remote job initialization failures
Parallel job initialization failures on hosts other
than the first execution host
Jobs exited by user action (e.g., bkill, bstop,
etc.) or LSF policy (e.g., load threshold
exceeded, job control action, advance
reservation expired, etc.)
JOBEXIT_NONLSF
This is the default when
EXIT_RATE_TYPE is not set
Local exited jobs
Remote job initialization failures
Parallel job initialization failures on hosts other
than the first execution host