LSF Version 7.3 - Platform LSF Configuration Reference
Example termination
cause
Termination reason in bacct –l Example bhist output
bchkpnt -k On the first run:
Completed <exit>;
TERM_CHKPNT
Wed Apr 16 16:00:48: Checkpoint
succeeded (actpid 931249);
Wed Apr 16 16:01:03: Exited with exit
code 137. The CPU time used is 0.0
seconds;
Kill –9 <RES> and job Completed <exit>;
TERM_EXTERNAL_SIGNAL
Thu Mar 13 17:30:43: Exited by signal
15. The CPU time used is 0.1 seconds;
Others Completed <exit>; Thu Mar 13 17:30:43: Exited with 3; The
CPU time used is 0.1 seconds;
Job termination by LSF exit information
LSF also provides additional information in the POST_EXEC of the job. Use this information
to detect conditions where LSF has terminated the job and take the appropriate action.
The job exit information in the POST_EXEC is defined in 2 parts:
•
LSB_JOBEXIT_STAT—the raw wait3() output (converted using the wait macros /usr/
include/sys/wait.h)
•
LSB_JOBEXIT_INFO—defined only if the job exit was due to a defined LSF reason.
Queue-level POST_EXEC commands should be written by the cluster administrator to
perform whatever task is necessary for specific exit situations.
Note:
System level enforced limits like CPU and Memory (listed above),
cannot be shown in the LSB_JOBEXIT_INFO since it is the
operating system performing the action and not LSF. Set
appropriate parameters in the queue or at job submission to allow
LSF to enforce the limits, which makes this information available
to LSF.
Common LSB_JOBEXIT_STAT and LSB_JOBEXIT_INFO values
The following is a table of common scenarios covered and not covered by the
LSB_JOBEXIT_INFO
Example termination
cause
LSB_JOBEXIT_
STAT
LSB_JOBEXIT_INFO Example bhist output
Job killed with the
SIGINT bkill -s INT 520
33280 SIGNAL 2 INT Fri Feb 14 16:48:00: Exited with
exit code 130. The CPU time
used is 0.2 seconds;
Job killed with SIGTERM
bkill -s TERM 521
36608 SIGNAL 15 TERM Fri Feb 14 16:49:50: Exited with
exit code 143. The CPU time
used is 0.2 seconds;
Job killed with SIGKILL
bkill -s KILL 522
33280 SIGNAL -14 SIG_TERM_USER Fri Feb 14 16:51:03: Exited with
exit code 130. The CPU time
used is 0.2 seconds;
Understanding Platform LSF job exit information
Platform LSF Configuration Reference 615