LSF Version 7.3 - Using Platform LSF HPC
LSF installs echkpnt.fluent and erestart.fluent, which are special versions
of
echkpnt and erestart to allow checkpointing with FLUENT. Use bsub -a
fluent
to make sure your job uses these files.
Checkpoint directories
When you submit a checkpointing job, you specify a checkpoint directory.
Before the job starts running, LSF sets the environment variable LSB_CHKPNT_DIR.
The value of LSB_CHKPNT_DIR is a subdirectory of the checkpoint directory
specified in the command line. This subdirectory is identified by the job ID and only
contains files related to the submitted job.
Checkpoint trigger files
When you checkpoint a FLUENT job, LSF creates a checkpoint trigger file (check) in
the job subdirectory, which causes FLUENT to checkpoint and continue running. A
special option is used to create a different trigger file (
exit) to cause FLUENT to
checkpoint and exit the job.
FLUENT uses the LSB_CHKPNT_DIR environment variable to determine the
location of checkpoint trigger files. It checks the job subdirectory periodically while
running the job. FLUENT does not perform any checkpointing unless it finds the LSF
trigger file in the job subdirectory. FLUENT removes the trigger file after checkpointing
the job.
Restarting jobs
If a job is restarted, LSF attempts to restart the job with the -restart option
appended to the original FLUENT command. FLUENT uses the checkpointed data
and case files to restart the process from that checkpoint, rather than repeating the entire
process.
Each time a job is restarted, it is assigned a new job ID, and a new job subdirectory is
created in the checkpoint directory. Files in the checkpoint directory are never deleted
by LSF, but you may choose to remove old files once the FLUENT job is finished and
the job history is no longer required.
Submitting FLUENT jobs
Use bsub to submit the job, including parameters required for checkpointing.
The syntax for the bsub command to submit a FLUENT job is:
bsub
[
-R fluent
]
-a fluent
[-k checkpoint_dir | -k "checkpoint_dir
[checkpoint_period]
" [bsub options] FLUENT command [FLUENT options] -lsf
-R fluent
Optional. Specify the fluent shared resource if the FLUENT application is only
installed on certain hosts in the cluster
-a fluent
Use the esub for FLUENT jobs, which automatically sets the checkpoint method to
fluent to use the checkpoint and restart programs for FLUENT jobs,
echkpnt.fluent and erestart.fluent.
The checkpointing feature for FLUENT jobs requires all of the following parameters: