LSF Version 7.3 - Administering Platform LSF
Handling Job Exceptions
140 Administering Platform LSF
Job exits excluded
from exit rate
calculation
By default, jobs that are exited for non-host related reasons (user actions and LSF
policies) are not counted in the exit rate calculation. Only jobs that are exited for
what LSF considers host-related problems and are used to calculate a host exit rate.
The following cases are not included in the exit rate calculations:
◆ bkill, bkill -r
◆ brequeue
◆ RERUNNABLE jobs killed when a host is unavailable
◆ Resource usage limit exceeded (for example, PROCESSLIMIT, CPULIMIT,
etc.)
◆ Queue-level job control action TERMINATE and TERMINATE_WHEN
◆ Checkpointing a job with the kill option (bchkpnt -k)
◆ Rerunnable job migration
◆ Job killed when an advance reservation has expired
◆ Remote lease job start fails
◆ Any jobs with an exit code found in SUCCESS_EXIT_VALUES, where a
particular exit value is deemed as successful.
Excluding LSF and
user-related job
exits
To explicitly exclude jobs exited because of user actions or LSF-related policies from
the job exit calculation, set EXIT_RATE_TYPE = JOBEXIT_NONLSF in
lsb.params. JOBEXIT_NONLSF tells LSF to include all job exits except those that
are related to user action or LSF policy. This is the default value for
EXIT_RATE_TYPE .
To include all job exit cases in the exit rate count, you must set EXIT_RATE_TYPE
= JOBEXIT in
lsb.params. JOBEXIT considers all job exits.
Jobs killed by signal external to LSF will still be counted towards exit rate
Jobs killed because of job control SUSPEND action and RESUME action are still
counted towards the exit rate. This because LSF cannot distinguish between jobs
killed from SUSPEND action and jobs killed by external signals.
If both JOBEXIT and JOBEXIT_NONLSF are defined, JOBEXIT_NONLSF is used.
Local jobs When EXIT_RATE_TYPE=JOBINIT, various job initialization failures are
included in the exit rate calculation, including:
◆ Host-related failures; for example, incorrect user account, user permissions,
incorrect directories for checkpointable jobs, host name resolution failed, or
other execution environment problems
◆ Job-related failures; for example, pre-execution or setup problem, job file not
created, etc.
JOBINIT Local job initialization failures
Parallel job initialization failures on the first
execution host
HPCINIT Job initialization failures for Platform LSF HPC
jobs
Exit rate type ... Includes ...