LSF Version 7.3 - Administering Platform LSF

Administering Platform LSF 481
Job Checkpoint, Restart, and Migration
Requirements
To allow restart of a checkpointed job on a different host than the host on which the
job originally ran, both the original and the new hosts must:
Be binary compatible
Run the same dot version of the operating system for predictable results
Have network connectivity and read/execute permissions to the checkpoint
and restart executables (in
LSF_SERVERDIR by default)
Have network connectivity and read/write permissions to the checkpoint
directory and the checkpoint file
Have access to all files open during job execution so that LSF can locate them
using an absolute path name
Job migration
Job migration is the process of moving a checkpointable or rerunnable job from one
host to another. This facilitates load balancing by moving jobs from a
heavily-loaded host to a lightly-loaded host.
You can initiate job migration manually on demand (
bmig) or automatically. To
initiate job migration automatically, you can configure a migration threshold at job
submission, or at the host, queue, or in an application profile.
NOTE: For a detailed description of the job migration feature and how to configure it, see the
Platform LSF Configuration Reference.
Manual job migration
The bmig command migrates checkpointable or rerunnable jobs on demand. Jobs
can be manually migrated by the job owner, queue administrator, and LSF
administrator.
For example, to migrate a job with job ID 123 to the first available host:
bmig 123
Job <123> is being migrated
Automatic job migration
Automatic job migration assumes that if a job is system-suspended (SSUSP) for an
extended period of time, the execution host is probably heavily loaded. Specifying
a migration threshold at job submission (
bsub -mig) or configuring an application
profile-level, queue-level or host-level migration threshold allows the job to
progress and reduces the load on the host. You can use
bmig at any time to override
a configured migration threshold, or
bmod -mig to change a job-level migration
threshold.
For example, at the queue level, in
lsb.queues:
Begin Queue
...
MIG=30 # Migration threshold set to 30 mins
DESCRIPTION=Migrate suspended jobs after 30 mins
...
End Queue