LSF Version 7.3 - Platform LSF Configuration Reference
•
A non-zero value indicates that erestart.application failed to write to
the .restart_cmd file.
•
A return value of 0 indicates that erestart.application successfully wrote to
the .restart_cmd file, or that the executable intentionally did not write to the
file.
•
Your executables must recognize the syntax used by the echkpnt and erestart
interfaces, which communicate with your executables by means of a common syntax.
•
echkpnt.application syntax:
echkpnt [-c] [-f] [-k | -s] [-d checkpoint_dir] [-x] process_group_ID
Restriction:
The -k and -s options are mutually exclusive.
•
erestart.application syntax:
erestart [-c] [-f] checkpoint_dir
Option or variable
Description Operating systems
-c Copies all files in use by the checkpointed process to the
checkpoint directory.
Some, such as SGI
systems running IRIX and
Altix
-f Forces a job to be checkpointed even under non-
checkpointable conditions, which are specific to the
checkpoint implementation used. This option could create
checkpoint files that do not provide for successful restart.
Some, such as SGI
systems running IRIX and
Altix
-k Kills a job after successful checkpointing. If checkpoint fails,
the job continues to run.
All operating systems that
LSF supports
-s Stops a job after successful checkpointing. If checkpoint fails,
the job continues to run.
Some, such as SGI
systems running IRIX and
Altix
-d checkpoint_dir
Specifies the checkpoint directory as a relative or absolute
path.
All operating systems that
LSF supports
-x Identifies the cpr (checkpoint and restart) process as type
HID. This identifies the set of processes to checkpoint as a
process hierarchy (tree) rooted at the current PID.
Some, such as SGI
systems running IRIX and
Altix
process_group_ID
ID of the process or process group to checkpoint. All operating systems that
LSF supports
Job checkpoint and restart behavior
LSF invokes the echkpnt interface when a job is
•
Automatically checkpointed based on a configured checkpoint period
•
Manually checkpointed with bchkpnt
•
Migrated to a new host with bmig
After checkpointing, LSF invokes the erestart interface to restart the job. LSF also invokes
the erestart interface when a user
•
Manually restarts a job using brestart
Feature: Job checkpoint and restart
96 Platform LSF Configuration Reference