Platform LSF Administration Guide Version 6.2

Chapter 24
Job Requeue and Job Rerun
Administering Platform LSF
391
Automatic Job Rerun
Job requeue vs. job rerun
Automatic job requeue occurs when a job finishes and has a specified exit code (usually
indicating some type of failure).
Automatic job rerun occurs when the execution host becomes unavailable while a job is
running. It does not occur if the job itself fails.
About job rerun
When a job is rerun or restarted, it is first returned to the queue from which it was
dispatched with the same options as the original job. The priority of the job is set
sufficiently high to ensure the job gets dispatched before other jobs in the queue. The
job uses the same job ID number. It is executed when a suitable host is available, and an
email message is sent to the job owner informing the user of the restart.
Automatic job rerun can be enabled at the job level, by the user, or at the queue level,
by the LSF administrator. If automatic job rerun is enabled, the following conditions
cause LSF to rerun the job:
The execution host becomes unavailable while a job is running
The system fails while a job is running
When LSF reruns a job, it returns the job to the submission queue, with the same job
ID. LSF dispatches the job as if it was a new submission, even if the job has been
checkpointed.
Execution host
fails
If the execution host fails, LSF dispatches the job to another host. You receive a mail
message informing you of the host failure and the requeuing of the job.
LSF system fails
If the LSF system fails, LSF requeues the job when the system restarts.
Configuring queue-level job rerun
To enable automatic job rerun at the queue level, set RERUNNABLE in lsb.queues
to
yes.
Submitting a rerunnable job
To enable automatic job rerun at the job level, use bsub -r.
Interactive batch jobs (
bsub -I) cannot be rerunnable.
Disabling post-execution for rerunnable jobs
Running of post-execution commands upon restart of a rerunnanble job may not always
be desirable; for example, if the post-exec removes certain files, or does other cleanup
that should only happen if the job finishes successfully. Use
LSB_DISABLE_RERUN_POST_EXEC=Y in
lsf.conf to prevent the post-exec
from running when a job is rerun.