LSF Version 7.3 - Platform LSF Configuration Reference

Feature: Job checkpoint and restart
The job checkpoint and restart feature enables you to stop jobs and then restart them from
the point at which they stopped, which optimizes resource usage. LSF can periodically capture
the state of a running job and the data required to restart it. This feature provides fault tolerance
and allows LSF administrators and users to migrate jobs from one host to another to achieve
load balancing.
Contents
About job checkpoint and restart
Scope
Configuration to enable job checkpoint and restart
Job checkpoint and restart behavior
Configuration to modify job checkpoint and restart
Job checkpoint and restart commands
About job checkpoint and restart
Checkpointing enables LSF users to restart a job on the same execution host or to migrate a job to a different execution
host. LSF controls checkpointing and restart by means of interfaces named echkpnt and erestart. By default, when a
user specifies a checkpoint directory using bsub -k or bmod -k or submits a job to a queue that has a checkpoint
directory specified, echkpnt sends checkpoint instructions to an executable named echkpnt.default.
When LSF checkpoints a job, the echkpnt interface creates a checkpoint file in the directory checkpoint_dir/
job_ID, and then checkpoints and resumes the job. The job continues to run, even if checkpointing fails.
When LSF restarts a stopped job, the erestart interface recovers job state information from the checkpoint file,
including information about the execution environment, and restarts the job from the point at which the job stopped.
At job restart, LSF
1.
Resubmits the job to its original queue and assigns a new job ID
2.
Dispatches the job when a suitable host becomes available (not necessarily the original execution host)
3.
Re-creates the execution environment based on information from the checkpoint file
4.
Restarts the job from its most recent checkpoint
Feature: Job checkpoint and restart
90 Platform LSF Configuration Reference