Platform LSF Administration Guide Version 6.2
Chapter 25
Job Checkpoint, Restart, and Migration
Administering Platform LSF
397
stderr and stdout are ignored by LSF. You can save these to a file by setting
LSB_ECHKPNT_KEEP_OUTPUT=y in
lsf.conf or as an environment variable.
Return values for erestart.method_name
erestart.method_name
creates the file
checkpoint_dir/$LSB_JOBID/.restart_cmd and writes in this file the
command to restart the job or process group in the form:
LSB_RESTART_CMD=
restart_command
For example, if the command to restart a job is my_restart my_job, the
erestart.method_name writes to the .restart_cmd file:
LSB_RESTART_CMD=my_restart my_job
erestart then reads the .restart_cmd file and uses the command specified with
LSB_RESTART_CMD as the command to restart the job.
You have the choice of writing to the file or not. Return a 0 if
erestart.method_name succeeds in writing the job restart command to the file
checkpoint_dir/$LSB_JOBID/.restart_cmd, or if it purposefully writes
nothing to the file. Non-zero values indicate that
erestart.method_name was not
able to restart the job.
For user-level checkpointing,
erestart.method_name must collect the exit code
from the job. Then,
erestart.method_name must exit with the same exit code
as the job. Otherwise, the job’s exit status is not reported correctly to LSF. Kernel-level
checkpointing works differently and does not need this information from
erestart.method_name to restart the job.
erestart.method_name
◆
Must have access to the original command line. It is important the
erestart.method_name have access to the original command line used to
start the job.
◆
erestart.method_name must return, it should not run the application to
restart the job.
Note
Any information echkpnt writes to stderr is considered by sbatchd as an echkpnt
failure. However, not all errors are fatal. If the
chkpnt explicitly writes to stdout or
stderr "Checkpoint done", sbatchd assumes echkpnt has succeeded.
Configuring LSF to recognize the custom echkpnt and erestart
You can set the following parameters in lsf.conf or as environment variables. If set
in
lsf.conf, these parameters apply globally to the cluster and will be the default
values. Parameters specified as environment variables override the parameters specified
in
lsf.conf.
If you set parameters in
lsf.conf, reconfigure your cluster with lsadmin reconfig
and
badmin mbdrestart so that changes take effect.
1
Set LSB_ECHKPNT_METHOD=method_name in lsf.conf or as an
environment variable
OR
When you submit the job, specify the checkpoint and restart method. For example: