Platform LSF Administration Guide Version 6.2

Checkpointing Jobs

Administering Platform LSF

394

Checkpointing Jobs

Checkpointing a job involves capturing the state of an executing job, the data necessary

to restart the job, and not wasting the work done to get to the current stage. The job

state information is saved in a checkpoint file. There are many reasons why you would

want to checkpoint a job.

Fault tolerance

To provide job fault tolerance, checkpoints are taken at regular intervals (periodically)

during the job’s execution. If the job is killed or migrated, or if the job fails for a reason

other than host failure, the job can be restarted from its last checkpoint and not waste

the efforts to get it to its current stage.

Migration

Checkpointing enables a migrating job to make progress rather than restarting the job

from the beginning. Jobs can be migrated when a host fails or when a host becomes

unavailable due to load.

Load balancing

Checkpointing a job and restarting it (migrating) on another host provides load

balancing by moving load (jobs) from a heavily loaded host to a lightly loaded host.

In this section

◆

“Approaches to Checkpointing” on page 395

◆

“Checkpointing a Job” on page 399