LSF Version 7.3 - Administering Platform LSF

Checkpoint and restart executables
480 Administering Platform LSF
brestart -q priority mydir 123
Job <456> is submitted to queue <priority>
LSF assigns a new job ID of 456, submits the job to the queue named "priority," and
restarts the job.
Once job 456 is running, you can change the checkpoint period using the
bchkpnt
command:
bchkpnt -p 360 456
Job <456> is being checkpointed
NOTE: For a detailed description of the commands used with the job checkpoint and restart
feature, see the Platform LSF Configuration Reference.
Checkpoint and restart executables
LSF controls checkpointing and restart by means of interfaces named echkpnt and
erestart. By default, when a user specifies a checkpoint directory using bsub -k
or
bmod -k or submits a job to a queue that has a checkpoint directory specified,
echkpnt sends checkpoint instructions to an executable named echkpnt.default.
For application-level job checkpoint and restart, you can specify customized
checkpoint and restart executables for each application that you use. The optional
parameter
LSB_ECHKPNT_METHOD specifies a checkpoint executable used for all jobs
in the cluster. An LSF user can override this value when submitting a job.
NOTE: For a detailed description of how to write and configure application-level checkpoint and
restart executables, see the Platform LSF Configuration Reference.
Job restart
LSF can restart a checkpointed job on a host other than the original execution host
using the information saved in the checkpoint file to recreate the execution
environment. Only jobs that have been checkpointed successfully can be restarted
from a checkpoint file. When a job restarts, LSF performs the following actions:
1 LSF resubmits the job to its original queue as a new job and assigns a new
job ID.
2 When a suitable host becomes available, LSF dispatches the job.
3 LSF recreates the execution environment from the checkpoint file.
4 LSF restarts the job from its last checkpoint. You can restart a job manually
from the command line using
brestart, automatically through configuration,
or by migrating the job to a different host using
bmig.