LSF Version 7.3 - Administering Platform LSF
Job termination by LSF exit information
732 Administering Platform LSF
The job exit information in the POST_EXEC is defined in 2 parts:
◆ LSB_JOBEXIT_STAT—the raw wait3() output (converted using the wait
macros
/usr/include/sys/wait.h)
◆ LSB_JOBEXIT_INFO—defined only if the job exit was due to a defined LSF
reason.
Queue-level POST_EXEC commands should be written by the cluster
administrator to perform whatever task is necessary for specific exit situations.
TIP: System level enforced limits like CPU and Memory (listed above), cannot be shown in the
LSB_JOBEXIT_INFO since it is the operating system performing the action and not LSF. Set
appropriate parameters in the queue or at job submission to allow LSF to enforce the limits, which
makes this information available to LSF.
Common LSB_JOBEXIT_STAT and LSB_JOBEXIT_INFO values
The following is a table of common scenarios covered and not covered by the
LSB_JOBEXIT_INFO
Example termination cause LSB_JOBEXIT_STAT LSB_JOBEXIT_INFO Example bhist output
Job killed with the
SIGINT
bkill -s INT 520
33280 SIGNAL 2 INT Fri Feb 14 16:48:00: Exited with
exit code 130. The CPU time
used is 0.2 seconds;
Job killed with SIGTERM
bkill -s TERM 521
36608 SIGNAL 15 TERM Fri Feb 14 16:49:50: Exited with
exit code 143. The CPU time
used is 0.2 seconds;
Job killed with SIGKILL
bkill -s KILL 522
33280 SIGNAL -14 SIG_TERM_USER Fri Feb 14 16:51:03: Exited with
exit code 130. The CPU time
used is 0.2 seconds;
Automatic migration
when MIG is defined at
queue level
33280 SIGNAL -1 SIG_CHKPNT Fri Feb 14 17:32:17: Job has
been requeued;
Fri Feb 14 17:32:17: Pending:
Migrating job is waiting for
rescheduling;
bsub –I "hostname;exit
130"
33280 Undefined Fri Feb 14 14:41:51: Exited with
exit code 130. The CPU time
used is 0.2 seconds;
Killing the job with bkill
command
bkill 210
33280 SIGNAL -14 SIG_TERM_USER Fri Feb 14 14:45:51: Exited with
exit code 130. The CPU time
used is 0.2 seconds;
Job being brequeued.
brequeue -r
Job <211> is being
requeued
33280 SIGNAL -23 SIG_KILL_REQUEUE Fri Feb 14 14:48:15: Signal
<REQUEUE_PEND> requested
by user or administrator
<iayaz>;
Fri Feb 14 14:48:18: Exited with
exit code 130. The CPU time
used is 0.2 second