User`s guide

D2.1.4 IST-033576
B Berkeley Lab Checkpoint/Restart (BLCR) User’s
Guide
B.1 About Berkeley Lab Checkpoint/Restart
Checkpoint/Restart allows you to save one or more processes to a file and later
restart them from that file. There are three main uses for this:
Scheduling: Checkpointing a program allows a program to be safely stopped
at any point in its execution, so that some other program can run in its place.
The original program can then be run again later.
Process Migration: If a compute node appears to be likely to crash, or
there is some other reason for shutting it down (routine maintenance, hard-
ware upgrade, etc.), checkpoint/restart allows any processes running on it
to be moved to a different node (or saved until the original node is available
again).
Failure recovery: A long running program can be checkpointed periodically,
so that if it crashes due to hardware, system software, or some other non-
deterministic cause, it can be restarted from a point in its execution more
recent that starting from the beginning.
Berkeley Lab Checkpoint/Restart (BLCR) provides checkpoint/restart on Linux
systems. BLCR can be used either with a processes on a single computer, or on
parallel jobs (such as MPI applications) which may be running across multiple
machines on a cluster of Linux nodes.
Note: Checkpointing parallel jobs requires a library which has integrated
BLCR support. At the present time, the only MPI implementations which support
checkpoint/restart with BLCR are LAM/MPI (version 7.x), MVAPICH2 (version
0.9.8 or newer), and MPICH-V (version 1.0.0 or newer). The development branch
of OpenMPI also includes support, intended for inclusion their 1.3 release. How-
ever, work is underway to add support to other MPI implementations, so consult
your MPI’s documentation for the latest information.
B.2 Checkpoint/restarting within a BLCR-aware batch con-
trol system
One way to use BLCR is with a batch scheduler system (a.k.a. "job controller",
"queue manager", etc.) that knows how to use the BLCR tools to checkpoint and
restart the jobs under its control. You can simply tell such a system to "suspend"
or "checkpoint" a job, and then later to "resume" or "restart" it.
39/49 XtreemOS–Integrated Project