LSF Version 7.3 - Administering Platform LSF

Administering Platform LSF 731
Understanding Platform LSF Job Exit Information
Example output of bacct and bhist
Job termination by LSF exit information
LSF also provides additional information in the POST_EXEC of the job. Use this
information to detect conditions where LSF has terminated the job and take the
appropriate action.
Example termination cause Termination reason in bacct –l Example bhist output
bkill -s KILL
bkill job_ID
Completed <exit>; TERM_OWNER or
TERM_ADMIN
Thu Mar 13 17:32:05: Signal <KILL>
requested by user or administrator
<user2>;
Thu Mar 13 17:32:06: Exited by signal
2. The CPU time used is 0.1 seconds;
bkill –r Completed <exit>; TERM_FORCE_ADMIN or
TERM_FORCE_OWNER when sbatchd is not
reachable.
Otherwise, TERM_USER or
TERM_ADMIN
Thu Mar 13 17:32:05: Signal <KILL>
requested by user or administrator
<user2>;
Thu Mar 13 17:32:06: Exited by signal
2. The CPU time used is 0.1 seconds;
TERMINATE_WHEN Completed <exit>; TERM_LOAD/
TERM_WINDOWS/
TERM_PREEMPT
Thu Mar 13 17:33:16: Signal <KILL>
requested by user or administrator
<user2>;
Thu Mar 13 17:33:18: Exited by signal
2. The CPU time used is 0.1 seconds;
Memory limit reached Completed <exit>; TERM_MEMLIMIT Thu Mar 13 19:31:13: Exited by signal
2. The CPU time used is 0.1 seconds;
Run limit reached Completed <exit>; TERM_RUNLIMIT Thu Mar 13 20:18:32: Exited by signal
2. The CPU time used is 0.1 seconds.
CPU limit Completed <exit>; TERM_CPULIMIT Thu Mar 13 18:47:13: Exited by signal
24. The CPU time used is 62.0 seconds;
Swap limit Completed <exit>; TERM_SWAPLIMIT Thu Mar 13 18:47:13: Exited by signal
24. The CPU time used is 62.0 seconds;
Regular job exits when
host crashes
Rusage 0,
Completed <exit>;
TERM_ZOMBIE
Thu Jun 12 15:49:02: Unknown;
unable to reach the execution host;
Thu Jun 12 16:10:32: Running;
Thu Jun 12 16:10:38: Exited with exit
code 143. The CPU time used is 0.0
seconds;
brequeue –r For each requeue,
Completed <exit>;
TERM_REQUEUE_ADMIN or
TERM_REQUEUE_OWNER
Thu Mar 13 17:46:39: Signal
<REQUEUE_PEND> requested by user
or administrator <user2>;
Thu Mar 13 17:46:56: Exited by signal
2. The CPU time used is 0.1 seconds;
bchkpnt -k On the first run:
Completed <exit>;
TERM_CHKPNT
Wed Apr 16 16:00:48: Checkpoint
succeeded (actpid 931249);
Wed Apr 16 16:01:03: Exited with exit
code 137. The CPU time used is 0.0
seconds;
Kill –9 <RES> and job Completed <exit>;
TERM_EXTERNAL_SIGNAL
Thu Mar 13 17:30:43: Exited by signal
15. The CPU time used is 0.1 seconds;
Others Completed <exit>; Thu Mar 13 17:30:43: Exited with 3;
The CPU time used is 0.1 seconds;