Platform LSF Administration Guide Version 6.2
Creating Custom echkpnt and erestart for Application-level Checkpointing
Administering Platform LSF
396
Creating Custom echkpnt and erestart for
Application-level Checkpointing
Different applications may have different checkpointing implementations and custom
echkpnt and erestart programs.
You can write your own
echkpnt and erestart programs to checkpoint your specific
applications and tell LSF which program to use for which application.
◆
“Writing custom echkpnt and erestart programs” on page 396
◆
“Configuring LSF to recognize the custom echkpnt and erestart” on page 397
Writing custom echkpnt and erestart programs
Programming
language
You can write your own echkpnt and erestart interfaces in C or Fortran.
Name
Assign the name echkpnt.method_name and erestart.method_name,
where
method_name is the name that identifies this is the program for a specific
application.
For example, if your custom
echkpnt is for my_app, you would have:
echkpnt.my_app, erestart.my_app.
Location
Place echkpnt.method_name and erestart.method_name in
LSF_SERVERDIR. You can specify a different directory with
LSB_ECHKPNT_METHOD_DIR as an environment variable or in
lsf.conf.
The method name (LSB_ECHKPNT_METHOD in
lsf.conf or as an environment
variable) and location (LSB_ECHKPNT_METHOD_DIR) combination must be
unique in the cluster. For example, you may have two
echkpnt applications with the
same name such as
echkpnt.mymethod but what differentiates them is the different
directories defined with LSB_ECHKPNT_METHOD_DIR.
The checkpoint method directory should be accessible by all users who need to run the
custom
echkpnt and erestart programs.
Supported syntax
for echkpnt
Your echkpnt.method_name must recognize commands in the following syntax
as these are the options used by
echkpnt to communicate with your
echkpnt.method_name:
echkpnt
[
-c
] [
-f
] [
-k | -s
] [
-d checkpoint_dir
] [
-x
]
process_group_ID
For more details on echkpnt syntax, see the echkpnt(8) man page.
Supported syntax
for erestart
Your erestart.method_name must recognize commands in the following syntax
as these are the options used by
erestart to communicate with your
erestart.method_name .
erestart
[
-c
] [
-f
]
checkpoint_dir
For more details, see the erestart(8) man page.
Return values for echkpnt.method_name
If echkpnt.method_name is able to successfully checkpoint the job, it exits with
a 0. Non-zero values indicate job checkpoint failed.