LSF Version 7.3 - Administering Platform LSF
Checkpoint and restart options
478 Administering Platform LSF
Checkpoint and restart options
You can implement job checkpoint and restart at one of the following levels.
◆ Kernel level—provided by your operating system, enabled by default
◆ User level—provided by special LSF libraries that you link to your application
object files
◆ Application level—provided by your site-specific applications and supported
by LSF through the use of application-specific
echkpnt and erestart
executables
NOTE: For a detailed description of the job checkpoint and restart feature and how to configure
it, see the Platform LSF Configuration Reference.
Checkpoint directory and files
The job checkpoint and restart feature requires that a job be made checkpointable
at the job, application profile, or queue level. LSF users can make a job
checkpointable by submitting the job using
bsub -k and specifying a checkpoint
directory, and optional checkpoint period, initial checkpoint period, and
checkpoint method. Administrators can make all jobs in a queue or an application
profile checkpointable by specifying a checkpoint directory for the queue or
application.
Requirements
The following requirements apply to a checkpoint directory specified at the queue
or application profile level:
◆ The specified checkpoint directory must already exist. LSF does not create the
checkpoint directory.
◆ The user account that submits the job must have read and write permissions for
the checkpoint directory.
◆ For the job to restart on another execution host, both the original and new
hosts must have network connectivity to the checkpoint directory.
Behavior
Specifying a checkpoint directory at the queue level or in an application profile
enables checkpointing.
◆ All jobs submitted to the queue or application profile are checkpointable. LSF
writes the checkpoint files, which contain job state information, to the
checkpoint directory. The checkpoint directory can contain checkpoint files for
multiple jobs.
NOTE: LSF does not delete the checkpoint files; you must perform file maintenance manually.
◆ If the administrator specifies a checkpoint period, in minutes, LSF creates a
checkpoint file every chkpnt_period during job execution.