LSF Version 7.3 - Administering Platform LSF

About Job Requeue
468 Administering Platform LSF
About Job Requeue
A networked computing environment is vulnerable to any failure or temporary
conditions in network services or processor resources. For example, you might get
NFS stale handle errors, disk full errors, process table full errors, or network
connectivity problems. Your application can also be subject to external conditions
such as a software license problems, or an occasional failure due to a bug in your
application.
Such errors are temporary and probably happen at one time but not another, or on
one host but not another. You might be upset to learn all your jobs exited due to
temporary errors and you did not know about it until 12 hours later.
LSF provides a way to automatically recover from temporary errors. You can
configure certain exit values such that in case a job exits with one of the values, the
job is automatically requeued as if it had not yet been dispatched. This job is then
be retried later. It is also possible for you to configure your queue such that a
requeued job is not scheduled to hosts on which the job had previously failed to
run.