LSF Version 7.3 - Administering Platform LSF
Administering Platform LSF 473
Job Requeue and Job Rerun
Exclusive Job Requeue
You can configure automatic job requeue so that a failed job is not rerun on the
same host.
Limitations
◆ If mbatchd is restarted, this feature might not work properly, since LSF forgets
which hosts have been excluded. If a job ran on a host and exited with an
exclusive exit code before
mbatchd was restarted, the job could be dispatched
to the same host again after
mbatchd is restarted.
◆ Exclusive job requeue does not work for MultiCluster jobs or parallel jobs
◆ A job terminated by a signal is not requeued
Configure exclusive job requeue
1 Set REQUEUE_EXIT_VALUES in the queue definition (lsb.queues) and
define the exit code using parentheses and the keyword
EXCLUDE:
EXCLUDE(exit_code...)
exit_code has the following form:
"[all] [~number ...] | [number ...]"
The reserved keyword all specifies all exit codes. Exit codes are typically
between 0 and 255. Use a tilde (
~) to exclude specified exit codes from the list.
Jobs are requeued to the head of the queue. The output from the failed run is
not saved, and the user is not notified by LSF.
When a job exits with any of the specified exit codes, it is requeued, but it is not
dispatched to the same host again.
Begin Queue
...
REQUEUE_EXIT_VALUES=30 EXCLUDE(20)
HOSTS=hostA hostB hostC
...
End Queue
A job in this queue can be dispatched to hostA, hostB or hostC.
If a job running on
hostA exits with value 30 and is requeued, it can be dispatched
to
hostA, hostB, or hostC. However, if a job running on hostA exits with value 20
and is requeued, it can only be dispatched to
hostB or hostC.
If the job runs on
hostB and exits with a value of 20 again, it can only be dispatched
on
hostC. Finally, if the job runs on hostC and exits with a value of 20, it cannot be
dispatched to any of the hosts, so it is pending forever.