LSF Version 7.3 - Administering Platform LSF
Configuring Pre- and Post-Execution Commands
566 Administering Platform LSF
Rerunnable jobs may rerun after they have actually finished because the host
became unavailable before post-execution processing finished, but the mbatchd
considers the job still in RUN state.
Job preemption is delayed until post-execution processing is finished.
Post-execution on
SGI cpusets
Post-execution processing on SGI cpusets behave differently from previous
releases. If JOB_INCLUDE_POSTPROC=Y is specified in
lsb.applications or
cluster wide in
lsb.params, post-execution processing is not attached to the job
cpuset, and Platform LSF does not release the cpuset until post-execution
processing has finished.
Preventing job
overlap on hosts
You can use JOB_INCLUDE_POSTPROC to ensure that there is no execution
overlap among running jobs. For example, you may have pre-execution processing
to create a user execution environment at the desktop (mount a disc for the user,
create rlogin permissions, etc.) Then you configure post-execution processing to
clean up the user execution environment set by the pre-exec.
If the post-execution for one job is still running when a second job is dispatched,
pre-execution processing that sets up the user environment for the next job may not
be able to run correctly because the previous job’s environment has not yet been
cleaned up by its post-exec.
You should configure jobs to run exclusively to prevent the actual jobs from not
overlapping, but in this case, you also need to configure post-execution to be
included in job finish status reporting.
Setting a post-execution timeout
Configure JOB_POSTPROC_TIMEOUT in an application profile in
lsb.applications or cluster wide in lsb.params to control how long
post-execution processing is allowed to run.
JOB_POSTPROC_TIMEOUT specifies a timeout in minutes for job
post-execution processing. If post-execution processing takes longer than the
timeout,
sbatchd reports the post-execution has failed (POST_ERR status), and
kills the process group of the job’s post-execution processes.
The specified timeout must be greater than zero.
If JOB_INCLUDE_POSTPROC is enabled in the application profile or cluster wide
in
lsb.params, and sbatchd kills the post-execution processes because the timeout
has been reached, the CPU time of the post-execution processing is set to 0, and the
job CPU time will not include the CPU time of the post-execution processing.
Controlling how many times pre-execution commands are retried
By default, if job pre-execution fails, LSF retries the job automatically.
Configure MAX_PREEXEC_RETRY to limit the number of times LSF retries job
pre-execution. Pre-execution retry is configured cluster-wide (
lsb.params), at the
queue level (
lsb.queues), and at the application level (lsb.applications).
MAX_PREEXEC_RETRY in
lsb.applications overrides lsb.queues, and
lsb.queues overrides lsb.params configuration.