Platform LSF Administration Guide Version 6.2

Chapter 25
Job Checkpoint, Restart, and Migration
Administering Platform LSF
405
Restarting Checkpointed Jobs
LSF can restart a checkpointed job on a host other than the original execution host using
the information saved in the checkpoint file to recreate the execution environment. Only
jobs that have been checkpointed successfully can be restarted from a checkpoint file.
When a job is restarted, LSF performs the following actions:
1
LSF re-submits the job to its original queue as a new job and a new job ID is
assigned
2
When a suitable host is available, the job is dispatched
3
The execution environment is recreated from the checkpoint file
4
The job is restarted from its last checkpoint.
This can be done manually from the command line, automatically through
configuration, and when a job is migrated.
Requirements
LSF can restart a job from its last checkpoint on the execution host, or on another host
if the job is migrated. To restart a job on another host, both hosts must:
Be binary compatible
Run the same dot version of the operating system. Unpredictable results may occur
if both hosts are not running the exact same OS version.
Have access to the executable
Have access to all open files (LSF must locate them with an absolute path name)
Have access to the checkpoint file
Manually restarting jobs
Use the brestart command to manually restart a checkpointed job. To restart a job
from its last checkpoint, specify the checkpoint directory and the job ID of the
checkpointed job. For example, to restart a checkpointed job with job ID 123 from
checkpoint directory
my_dir:
%
brestart my_dir 123
Job <456> is submitted to default queue <default>
The brestart command allows you to change many of the original submission
options. For example, to restart a checkpointed job with job ID 123 from checkpoint
directory
my_dir and have it start from a queue named priority:
%
brestart -q priority my_dir 123
Job <456> is submitted to queue <priority>