Platform LSF Administration Guide Version 6.2

Migrating Jobs
Administering Platform LSF
406
Migrating Jobs
Migration is the process of moving a checkpointable or rerunnable job from one host
to another host.
Checkpointing enables a migrating job to make progress by restarting it from its last
checkpoint. Rerunnable non-checkpointable jobs are restarted from the beginning. LSF
provides the ability to manually migrate jobs from the command line and automatically
through configuration. When a job is migrated, LSF performs the following actions:
1
Stops the job if it is running
2
Checkpoints the job if it is checkpointable
3
Kills the job on the current host
4
Restarts or reruns the job on the next available host, bypassing all pending jobs
Requirements
To migrate a checkpointable job to another host, both hosts must:
Be binary compatible
Run the same dot version of the operating system. Unpredictable results may occur
if both hosts are not running the exact same OS version.
Have access to the executable
Have access to all open files (LSF must locate them with an absolute path name)
Have access to the checkpoint file
Manually migrating jobs
Use the bmig command to manually migrate jobs. Any checkpointable or rerunnable
job can be migrated. Jobs can be manually migrated by the job owner, queue
administrator, and LSF administrator. For example, to migrate a job with job ID 123:
%
bmig 123
Job <123> is being migrated
%
bhist -l 123
Job Id <123>, User <user1>, Command <my_job>
Tue Feb 29 16:50:27: Submitted from host <hostA> to Queue <default>, C
WD <$HOME/tmp>, Checkpoint directory <chkpnt_dir/123>;
Tue Feb 29 16:50:28: Started on <hostB>, Pid <4705>;
Tue Feb 29 16:53:42: Migration requested;
Tue Feb 29 16:54:03: Migration checkpoint initiated (actpid 4746);
Tue Feb 29 16:54:15: Migration checkpoint succeeded (actpid 4746);
Tue Feb 29 16:54:15: Pending: Migrating job is waiting for reschedule;
Tue Feb 29 16:55:16: Started on <hostC>, Pid <10354>.
Summary of time in seconds spent in various states by Tue Feb 29 16:57:26
PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL
62 0 357 0 0 0 419