LSF Version 7.3 - Platform LSF Configuration Reference

Feature: Job checkpoint and restart

The job checkpoint and restart feature enables you to stop jobs and then restart them from

the point at which they stopped, which optimizes resource usage. LSF can periodically capture

the state of a running job and the data required to restart it. This feature provides fault tolerance

and allows LSF administrators and users to migrate jobs from one host to another to achieve

load balancing.

Contents

•

About job checkpoint and restart

•

Scope

•

Configuration to enable job checkpoint and restart

•

Job checkpoint and restart behavior

•

Configuration to modify job checkpoint and restart

•

Job checkpoint and restart commands

About job checkpoint and restart

Checkpointing enables LSF users to restart a job on the same execution host or to migrate a job to a different execution

host. LSF controls checkpointing and restart by means of interfaces named echkpnt and erestart. By default, when a

user specifies a checkpoint directory using bsub -k or bmod -k or submits a job to a queue that has a checkpoint

directory specified, echkpnt sends checkpoint instructions to an executable named echkpnt.default.

When LSF checkpoints a job, the echkpnt interface creates a checkpoint file in the directory checkpoint_dir/

job_ID, and then checkpoints and resumes the job. The job continues to run, even if checkpointing fails.

When LSF restarts a stopped job, the erestart interface recovers job state information from the checkpoint file,

including information about the execution environment, and restarts the job from the point at which the job stopped.

At job restart, LSF

Resubmits the job to its original queue and assigns a new job ID

Dispatches the job when a suitable host becomes available (not necessarily the original execution host)

Re-creates the execution environment based on information from the checkpoint file

Restarts the job from its most recent checkpoint

Feature: Job checkpoint and restart

90 Platform LSF Configuration Reference