LSF Version 7.3 - Platform LSF Configuration Reference
•
Migrates the job to a new host using bmig
All checkpoint and restart executables run under the user account of the user who submits the
job.
Note:
By default, LSF redirects standard error and standard output to /
dev/null and discards the data.
Checkpoint directory and files
LSF identifies checkpoint files by the checkpoint directory and job ID. For example:
bsub -k my_dir
Job <123> is submitted to default queue <default>
LSF writes the checkpoint file to my_dir/123.
LSF maintains all of the checkpoint files for a single job in one location. When a job restarts,
LSF creates both a new subdirectory based on the new job ID and a symbolic link from the
old to the new directory. For example, when job 123 restarts on a new host as job 456, LSF
creates my_dir/456 and a symbolic link from my_dir/123 to my_dir/456.
Precedence of job, queue, application, and cluster-level checkpoint
values
LSF handles checkpoint and restart values as follows:
1.
Checkpoint directory and checkpoint period—values specified at the job level override
values for the queue. Values specified in an application profile setting overrides queue level
configuration.
If checkpoint-related configuration is specified in the queue, application profile, and at job
level:
•
Application-level and job-level parameters are merged. If the same parameter is defined
at both job-level and in the application profile, the job-level value overrides the
application profile value.
•
The merged result of job-level and application profile settings override queue-level
configuration.
2.
Checkpoint and restart executables—the value for checkpoint_method specified at the job
level overrides the application-level CHKPNT_METHOD, and the cluster-level value for
LSB_ECHKPNT_METHOD specified in lsf.conf or as an environment variable.
3.
Configuration parameters and environment variables—values specified as environment
variables override the values specified in lsf.conf
If the command line is…
And… Then…
bsub -k "my_dir 240"
In lsb.queues,
CHKPNT=other_dir 360
•
LSF saves the checkpoint file to my_dir/
job_ID every 240 minutes
bsub -k "my_dir fluent"
In lsf.conf,
LSB_ECHKPNT_METHOD=myapp
•
LSF invokes echkpnt.fluent at job
checkpoint and erestart.fluent at job
restart
Feature: Job checkpoint and restart
Platform LSF Configuration Reference 97