Platform LSF Administration Guide Version 6.2

Chapter 25
Job Checkpoint, Restart, and Migration
Administering Platform LSF
395
Approaches to Checkpointing
LSF provides support for most checkpoint and restart implementations through
uniform interfaces,
echkpnt and erestart. All interaction between LSF and the
checkpoint implementations are handled by these commands. See the
echkpnt(8) and
erestart(8) man pages for more information.
Checkpoint and restart implementations are categorized based on the facility that
performs the checkpoint and the amount of knowledge an executable has of the
checkpoint. Commonly, checkpoint and restart implementations are grouped as kernel-
level, user-level, and application-level.
Kernel-level checkpointing
Kernel-level checkpointing is provided by the operating system and can be applied to
arbitrary jobs running on the system. This approach is transparent to the application,
there are no source code changes and no need to re-link your application with
checkpoint libraries.
To support kernel-level checkpoint and restart, LSF provides an
echkpnt and
erestart executable that invokes OS specific system calls.
Kernel-level checkpointing is currently supported on:
Cray UNICOS
IRIX 6.4 and later
NEC SX-4 and SX-5
See the
chkpnt(1) man page on Cray systems and the cpr(1) man page on IRIX
systems for the limitations of their checkpoint implementations.
User-level checkpointing
LSF provides a method to checkpoint jobs on systems that do not support kernel-level
checkpointing called user-level checkpointing. To implement user-level checkpointing,
you must have access to your applications object files (.o files), and they must be re-
linked with a set of libraries provided by LSF in LSF_LIBDIR. This approach is
transparent to your application, its code does not have to be changed and the application
does not know that a checkpoint and restart has occurred.
Application-level checkpointing
The application-level approach applies to those applications which are specially written
to accommodate the checkpoint and restart. The application writer must also provide an
echkpnt and erestart to interface with LSF. For more details see the echkpnt(8)
and
erestart(8) man pages. The application checkpoints itself either periodically or
in response to signals sent by other processes. When restarted, the application itself must
look for the checkpoint files and restore its state.