LSF Version 7.3 - Using Platform LSF HPC
Configuring application profiles for the blaunch framework
You can configure an application profile in lsb.applications to control the
behavior of a parallel or distributed application when a remote task exits. Specify a value
for RTASK_GONE_ACTION in the application profile to define what the LSF does
when a remote task exits.
The default behaviour is:
RTASK_GONE_ACTION has the following syntax:
RTASK_GONE_ACTION="[KILLJOB_TASKDONE | KILLJOB_TASKEXIT]
[IGNORE_TASKCRASH]"
Where:
◆
IGNORE_TASKCRASH
A remote task crashes. LSF does nothing. The job continues to launch the next task.
◆
KILLJOB_TASKDONE
A remote task exits with zero value. LSF terminates all tasks in the job.
◆
KILLJOB_TASKEXIT
A remote task exits with non-zero value. LSF terminates all tasks in the job.
For example:
RTASK_GONE_ACTION="IGNORE_TASKCRASH KILLJOB_TASKEXIT"
RTASK_GONE_ACTION only applies to the blaunch distributed application
framework.
When defined in an application profile, the LSB_DJOB_RTASK_GONE_ACTION
variable is set when running
bsub -app for the specified application.
You can also use the environment variable LSB_DJOB_RTASK_GONE_ACTION to
override the value set in the application profile.
By default, LSF shuts down the entire job if connection is lost with the task RES,
validation timeout, or heartbeat timeout. You can configure an application profile in
lsb.applications so only the current tasks are shut down, not the entire job.
Use DJOB_COMMFAIL_ACTION="KILL_TASKS" to define the behavior of LSF
when it detects a communication failure between itself and one or more tasks. If not
defined, LSF terminates all tasks, and shuts down the job. If set to KILL_TASKS, LSF
tries to kill all the current tasks of a parallel or distributed job associated with the
communication failure.
DJOB_COMMFAIL_ACTION only applies to the
blaunch distributed application
framework.
When defined in an application profile, the LSB_DJOB_COMMFAIL_ACTION
environment variable is set when running
bsub -app for the specified application.
Task exits with zero value Does nothing
Task exits with non-zero value Does nothing
Task crashes Shuts down the entire job