LSF Version 7.3 - Administering Platform LSF
Administering Platform LSF 479
Job Checkpoint, Restart, and Migration
◆ If the administrator specifies an initial checkpoint period in an application
profile, in minutes, the first checkpoint does not happen until the initial period
has elapsed. LSF then creates a checkpoint file every chkpnt_period after the
initial checkpoint period, during job execution.
◆ If a user specifies a checkpoint directory, initial checkpoint period, checkpoint
method or checkpoint period at the job level with
bsub -k, or modifies the job
with
bmod, the job-level values override the queue-level and applcation profile
values.
The
brestart command restarts checkpointed jobs that have stopped running.
Precendence of checkpointing options
If checkpoint-related configuration is specified in both the queue and an
application profile, the application profile setting overrides queue level
configuration.
If checkpoint-related configuration is specified in the queue, application profile,
and at job level:
◆ Application-level and job-level parameters are merged. If the same parameter
is defined at both job-level and in the application profile, the job-level value
overrides the application profile value.
◆ The merged result of job-level and application profile settings override
queue-level configuration.
Checkpointing MultiCluster jobs
To enable checkpointing of MultiCluster jobs, define a checkpoint directory in both
the send-jobs and receive-jobs queues (CHKPNT in
lsb.queues), or in an
application profile (CHKPNT_DIR, CHKPNT_PERIOD,
CHKPNT_INITPERIOD, CHKPNT_METHOD in
lsb.applications) of both
submission cluster and execution cluster. LSF uses the directory specified in the
execution cluster.
Checkpointing is not supported if a job runs on a leased host.
Example
The following example shows a queue configured for periodic checkpointing in
lsb.queues:
Begin Queue
...
QUEUE_NAME=checkpoint
CHKPNT=mydir 240
DESCRIPTION=Automatically checkpoints jobs every 4 hours to mydir
...
End Queue
NOTE: The bqueues command displays the checkpoint period in seconds; the lsb.queues
CHKPNT parameter defines the checkpoint period in minutes.
If the command bchkpnt -k 123 is used to checkpoint and kill job 123, you can
restart the job using the
brestart command as shown in the following example: