LSF Version 7.3 - Platform LSF Configuration Reference

Configuration
file
Parameter and syntax Behavior
lsb.queues
JOB_CONTROLS=SUSPEND CHKPNT
TERMINATE
LSF checkpoints jobs before suspending
or terminating them
When suspending a job, LSF checkpoints
the job and then stops it by sending the
SIGSTOP signal
When terminating a job, LSF checkpoints
the job and then kills it
Configuration to copy open job files to the checkpoint directory
For hosts that use the Cray operating system, LSF administrators can configure LSF at the host
level to copy all open job files to the checkpoint directory every time the job is checkpointed.
Configuration file
Parameter and syntax Behavior
lsb.hosts
HOST_NAME CHKPNT
host_name C
LSF copies all open job files to the checkpoint
directory when a job is checkpointed
Job checkpoint and restart commands
Commands for submission
Command
Description
bsub -k "checkpoint_dir
[checkpoint_period]
[method=echkpnt_application]"
Specifies a relative or absolute path for the checkpoint directory and
makes the job checkpointable.
If the specified checkpoint directory does not already exist, LSF
creates the checkpoint directory.
If a user specifies a checkpoint period (in minutes), LSF creates a
checkpoint file every chkpnt_period during job execution.
The command-line values for the checkpoint directory and checkpoint
period override the values specified for the queue.
If a user specifies an echkpnt_application, LSF runs the
corresponding restart executable when the job restarts. For example,
for bsub -k "my_dir method=fluent" LSF runs echkpnt.fluent
at job checkpoint and erestart.fluent at job restart.
The command-line value for echkpnt_application overrides the value
specified by LSB_ECHKPNT_METHOD in lsf.conf or as an
environment variable. Users can override
LSB_ECHKPNT_METHOD and use the default checkpoint and
restart executables by defining method=default.
Feature: Job checkpoint and restart
100 Platform LSF Configuration Reference