Platform LSF Administration Guide Version 6.2

About Job Requeue

Administering Platform LSF

386

About Job Requeue

A networked computing environment is vulnerable to any failure or temporary

conditions in network services or processor resources. For example, you might get NFS

stale handle errors, disk full errors, process table full errors, or network connectivity

problems. Your application can also be subject to external conditions such as a software

license problems, or an occasional failure due to a bug in your application.

Such errors are temporary and probably will happen at one time but not another, or on

one host but not another. You might be upset to learn all your jobs exited due to

temporary errors and you did not know about it until 12 hours later.

LSF provides a way to automatically recover from temporary errors. You can configure

certain exit values such that in case a job exits with one of the values, the job will be

automatically requeued as if it had not yet been dispatched. This job will then be retried

later. It is also possible for you to configure your queue such that a requeued job will not

be scheduled to hosts on which the job had previously failed to run.