LSF Version 7.3 - Administering Platform LSF
Administering Platform LSF 475
Job Requeue and Job Rerun
Automatic Job Rerun
Job requeue vs. job rerun
Automatic job requeue occurs when a job finishes and has a specified exit code
(usually indicating some type of failure).
Automatic job rerun occurs when the execution host becomes unavailable while a
job is running. It does not occur if the job itself fails.
About job rerun
When a job is rerun or restarted, it is first returned to the queue from which it was
dispatched with the same options as the original job. The priority of the job is set
sufficiently high to ensure the job gets dispatched before other jobs in the queue.
The job uses the same job ID number. It is executed when a suitable host is available,
and an email message is sent to the job owner informing the user of the restart.
Automatic job rerun can be enabled at the job level, by the user, or at the queue level,
by the LSF administrator. If automatic job rerun is enabled, the following
conditions cause LSF to rerun the job:
◆ The execution host becomes unavailable while a job is running
◆ The system fails while a job is running
When LSF reruns a job, it returns the job to the submission queue, with the same
job ID. LSF dispatches the job as if it was a new submission, even if the job has been
checkpointed.
Execution host fails
If the execution host fails, LSF dispatches the job to another host. You receive a mail
message informing you of the host failure and the requeuing of the job.
LSF system fails
If the LSF system fails, LSF requeues the job when the system restarts.
Configure queue-level job rerun
1 To enable automatic job rerun at the queue level, set RERUNNABLE in
lsb.queues to yes.
Submit a rerunnable job
1 To enable automatic job rerun at the job level, use bsub -r.
Interactive batch jobs (bsub -I) cannot be rerunnable.