Platform LSF Administration Guide Version 6.2

Checkpointing Jobs
Administering Platform LSF
394
Checkpointing Jobs
Checkpointing a job involves capturing the state of an executing job, the data necessary
to restart the job, and not wasting the work done to get to the current stage. The job
state information is saved in a checkpoint file. There are many reasons why you would
want to checkpoint a job.
Fault tolerance
To provide job fault tolerance, checkpoints are taken at regular intervals (periodically)
during the job’s execution. If the job is killed or migrated, or if the job fails for a reason
other than host failure, the job can be restarted from its last checkpoint and not waste
the efforts to get it to its current stage.
Migration
Checkpointing enables a migrating job to make progress rather than restarting the job
from the beginning. Jobs can be migrated when a host fails or when a host becomes
unavailable due to load.
Load balancing
Checkpointing a job and restarting it (migrating) on another host provides load
balancing by moving load (jobs) from a heavily loaded host to a lightly loaded host.
In this section
Approaches to Checkpointing” on page 395
Checkpointing a Job” on page 399