LSF Version 7.3 - Administering Platform LSF

About Job Requeue

468 Administering Platform LSF

About Job Requeue

A networked computing environment is vulnerable to any failure or temporary

conditions in network services or processor resources. For example, you might get

NFS stale handle errors, disk full errors, process table full errors, or network

connectivity problems. Your application can also be subject to external conditions

such as a software license problems, or an occasional failure due to a bug in your

application.

Such errors are temporary and probably happen at one time but not another, or on

one host but not another. You might be upset to learn all your jobs exited due to

temporary errors and you did not know about it until 12 hours later.

LSF provides a way to automatically recover from temporary errors. You can

configure certain exit values such that in case a job exits with one of the values, the

job is automatically requeued as if it had not yet been dispatched. This job is then

be retried later. It is also possible for you to configure your queue such that a

requeued job is not scheduled to hosts on which the job had previously failed to

run.