LSF Version 7.3 - Administering Platform LSF
Administering Platform LSF 141
Managing Jobs
Parallel jobs By default, or when EXIT_RATE_TYPE=JOBEXIT_NONLSF, job initialization
failure on the first execution host does not count in the job exit rate calculation. Job
initialization failure for hosts other than the first execution host are counted in the
exit rate calculation.
When EXIT_RATE_TYPE=JOBINIT, job initialization failure happens on the first
execution host are counted in the job exit rate calculation. Job initialization failures
for hosts other than the first execution host are not counted in the exit rate
calculation.
TIP: For parallel job exit exceptions to be counted for all hosts, specify EXIT_RATE_TYPE=HPCINIT
or EXIT_RATE_TYPE=JOBEXIT_NONLSF JOBINIT.
Remote jobs By default, or when EXIT_RATE_TYPE=JOBEXIT_NONLSF, job initialization
failures are counted as exited jobs on the remote execution host and are included in
the exit rate calculation for that host. To include only local job initialization failures
on the execution cluster from the exit rate calculation, set EXIT_RATE_TYPE to
include only JOBINIT or HPCINIT.
Scaling and tuning job exit rate by number of slots
On large, multiprocessor hosts, use to ENABLE_EXIT_RATE_PER_SLOT=Y in
lsb.params to scale the job exit rate so that the host is only closed when the job exit
rate is high enough in proportion to the number of processors on the host. This
avoids having a relatively low exit rate close a host inappropriately.
Use a float value for GLOBAL_EXIT_RATE in
lsb.params to tune the exit rate on
multislot hosts. The actual calculated exit rate value is never less than 1.
Example: exit rate of 5 on single processor and multiprocessor hosts
On a single-processor host, a job exit rate of 5 is much more severe than on a
20-processor host. If a stream of jobs to a single-processor host is consistently
failing, it is reasonable to close the host or take some other action after 5 failures.
On the other hand, for the same stream of jobs on a 20-processor host, it is possible
that 19 of the processors are busy doing other work that is running fine. To close
this host after only 5 failures would be wrong because effectively less than 5% of the
jobs on that host are actually failing.
Example: float value for GLOBAL_EXIT_RATE on multislot hosts
Using a float value for GLOBAL_EXIT_RATE allows the exit rate to be less than the
number of slots on the host. For example, on a host with 4 slots,
GLOBAL_EXIT_RATE=0.25 gives an exit rate of 1. The same value on an 8 slot
machine would be 2 and so on. On a single-slot host, the value is never less than 1.
For more information
◆ See Handling Host-level Job Exceptions on page 98 for information about
configuring host-level job exceptions.
◆ See Handling Job Exceptions in Queues on page 110 for information about
configuring job exceptions. in queues