User`s guide

D2.1.4 IST-033576

B Berkeley Lab Checkpoint/Restart (BLCR) User’s

Guide

B.1 About Berkeley Lab Checkpoint/Restart

Checkpoint/Restart allows you to save one or more processes to a ﬁle and later

restart them from that ﬁle. There are three main uses for this:

• Scheduling: Checkpointing a program allows a program to be safely stopped

at any point in its execution, so that some other program can run in its place.

The original program can then be run again later.

• Process Migration: If a compute node appears to be likely to crash, or

there is some other reason for shutting it down (routine maintenance, hard-

ware upgrade, etc.), checkpoint/restart allows any processes running on it

to be moved to a different node (or saved until the original node is available

again).

• Failure recovery: A long running program can be checkpointed periodically,

so that if it crashes due to hardware, system software, or some other non-

deterministic cause, it can be restarted from a point in its execution more

recent that starting from the beginning.

Berkeley Lab Checkpoint/Restart (BLCR) provides checkpoint/restart on Linux

systems. BLCR can be used either with a processes on a single computer, or on

parallel jobs (such as MPI applications) which may be running across multiple

machines on a cluster of Linux nodes.

Note: Checkpointing parallel jobs requires a library which has integrated

BLCR support. At the present time, the only MPI implementations which support

checkpoint/restart with BLCR are LAM/MPI (version 7.x), MVAPICH2 (version

0.9.8 or newer), and MPICH-V (version 1.0.0 or newer). The development branch

of OpenMPI also includes support, intended for inclusion their 1.3 release. How-

ever, work is underway to add support to other MPI implementations, so consult

your MPI’s documentation for the latest information.

B.2 Checkpoint/restarting within a BLCR-aware batch con-

trol system

One way to use BLCR is with a batch scheduler system (a.k.a. "job controller",

"queue manager", etc.) that knows how to use the BLCR tools to checkpoint and

restart the jobs under its control. You can simply tell such a system to "suspend"

or "checkpoint" a job, and then later to "resume" or "restart" it.

39/49 XtreemOS–Integrated Project