LSF Version 7.3 - Administering Platform LSF

Administering Platform LSF 479
Job Checkpoint, Restart, and Migration
If the administrator specifies an initial checkpoint period in an application
profile, in minutes, the first checkpoint does not happen until the initial period
has elapsed. LSF then creates a checkpoint file every chkpnt_period after the
initial checkpoint period, during job execution.
If a user specifies a checkpoint directory, initial checkpoint period, checkpoint
method or checkpoint period at the job level with
bsub -k, or modifies the job
with
bmod, the job-level values override the queue-level and applcation profile
values.
The
brestart command restarts checkpointed jobs that have stopped running.
Precendence of checkpointing options
If checkpoint-related configuration is specified in both the queue and an
application profile, the application profile setting overrides queue level
configuration.
If checkpoint-related configuration is specified in the queue, application profile,
and at job level:
Application-level and job-level parameters are merged. If the same parameter
is defined at both job-level and in the application profile, the job-level value
overrides the application profile value.
The merged result of job-level and application profile settings override
queue-level configuration.
Checkpointing MultiCluster jobs
To enable checkpointing of MultiCluster jobs, define a checkpoint directory in both
the send-jobs and receive-jobs queues (CHKPNT in
lsb.queues), or in an
application profile (CHKPNT_DIR, CHKPNT_PERIOD,
CHKPNT_INITPERIOD, CHKPNT_METHOD in
lsb.applications) of both
submission cluster and execution cluster. LSF uses the directory specified in the
execution cluster.
Checkpointing is not supported if a job runs on a leased host.
Example
The following example shows a queue configured for periodic checkpointing in
lsb.queues:
Begin Queue
...
QUEUE_NAME=checkpoint
CHKPNT=mydir 240
DESCRIPTION=Automatically checkpoints jobs every 4 hours to mydir
...
End Queue
NOTE: The bqueues command displays the checkpoint period in seconds; the lsb.queues
CHKPNT parameter defines the checkpoint period in minutes.
If the command bchkpnt -k 123 is used to checkpoint and kill job 123, you can
restart the job using the
brestart command as shown in the following example: