Platform LSF Administration Guide Version 6.2
Chapter 24
Job Requeue and Job Rerun
Administering Platform LSF
389
Exclusive Job Requeue
About exclusive job requeue
You can configure automatic job requeue so that a failed job is not rerun on the same
host.
Limitations
◆
If mbatchd is restarted, this feature might not work properly, since LSF forgets
which hosts have been excluded. If a job ran on a host and exited with an exclusive
exit code before
mbatchd was restarted, the job could be dispatched to the same
host again after
mbatchd is restarted.
◆
Exclusive job requeue does not work for MultiCluster jobs or parallel jobs
◆
A job terminated by a signal is not requeued
Configuring exclusive job requeue
Set REQUEUE_EXIT_VALUES in the queue definition (lsb.queues) and define
the exit code using parentheses and the keyword
EXCLUDE, as shown:
EXCLUDE(
exit_code...
)
When a job exits with any of the specified exit codes, it will be requeued, but it will not
be dispatched to the same host again.
Example
Begin Queue
...
REQUEUE_EXIT_VALUES=30 EXCLUDE(20)
HOSTS=hostA hostB hostC
...
End Queue
A job in this queue can be dispatched to hostA, hostB or hostC.
If a job running on
hostA exits with value 30 and is requeued, it can be dispatched to
hostA, hostB, or hostC. However, if a job running on hostA exits with value 20 and
is requeued, it can only be dispatched to
hostB or hostC.
If the job runs on
hostB and exits with a value of 20 again, it can only be dispatched
on
hostC. Finally, if the job runs on hostC and exits with a value of 20, it cannot be
dispatched to any of the hosts, so it will be pending forever.