LSF Version 7.3 - Administering Platform LSF

LSF Job Termination Reason Logging
700 Administering Platform LSF
Example output of bacct and bhist
Understanding LSF job exit codes
LSF monitors a job while running and returns the exit code returned from the job
itself. LSF collects this exit code via
wait3() system call on UNIX platforms. The
exit code is a result of the system exit values. Use
bhist to see the exit code for your
job.
Example termination cause Termination reason in bacct –l Example bhist output
bkill -s KILL
bkill job_ID
Completed <exit>; TERM_OWNER or
TERM_ADMIN
Thu Mar 13 17:32:05: Signal <KILL>
requested by user or administrator
<user2>;
Thu Mar 13 17:32:06: Exited by signal
2. The CPU time used is 0.1 seconds;
bkill –r Completed <exit>; TERM_FORCE_ADMIN or
TERM_FORCE_OWNER when sbatchd is not
reachable.
Otherwise, TERM_USER or
TERM_ADMIN
Thu Mar 13 17:32:05: Signal <KILL>
requested by user or administrator
<user2>;
Thu Mar 13 17:32:06: Exited by signal
2. The CPU time used is 0.1 seconds;
TERMINATE_WHEN Completed <exit>; TERM_LOAD/
TERM_WINDOWS/
TERM_PREEMPT
Thu Mar 13 17:33:16: Signal <KILL>
requested by user or administrator
<user2>;
Thu Mar 13 17:33:18: Exited by signal
2. The CPU time used is 0.1 seconds;
Memory limit reached Completed <exit>; TERM_MEMLIMIT Thu Mar 13 19:31:13: Exited by signal
2. The CPU time used is 0.1 seconds;
Run limit reached Completed <exit>; TERM_RUNLIMIT Thu Mar 13 20:18:32: Exited by signal
2. The CPU time used is 0.1 seconds.
CPU limit Completed <exit>; TERM_CPULIMIT Thu Mar 13 18:47:13: Exited by signal
24. The CPU time used is 62.0 seconds;
Swap limit Completed <exit>; TERM_SWAPLIMIT Thu Mar 13 18:47:13: Exited by signal
24. The CPU time used is 62.0 seconds;
Regular job exits when
host crashes
Rusage 0,
Completed <exit>;
TERM_ZOMBIE
Thu Jun 12 15:49:02: Unknown;
unable to reach the execution host;
Thu Jun 12 16:10:32: Running;
Thu Jun 12 16:10:38: Exited with exit
code 143. The CPU time used is 0.0
seconds;
brequeue –r For each requeue,
Completed <exit>;
TERM_REQUEUE_ADMIN or
TERM_REQUEUE_OWNER
Thu Mar 13 17:46:39: Signal
<REQUEUE_PEND> requested by user
or administrator <user2>;
Thu Mar 13 17:46:56: Exited by signal
2. The CPU time used is 0.1 seconds;
bchkpnt -k On the first run:
Completed <exit>;
TERM_CHKPNT
Wed Apr 16 16:00:48: Checkpoint
succeeded (actpid 931249);
Wed Apr 16 16:01:03: Exited with exit
code 137. The CPU time used is 0.0
seconds;
Kill –9 <RES> and job Completed <exit>;
TERM_EXTERNAL_SIGNAL
Thu Mar 13 17:30:43: Exited by signal
15. The CPU time used is 0.1 seconds;
Others Completed <exit>; Thu Mar 13 17:30:43: Exited with 3;
The CPU time used is 0.1 seconds;