Platform LSF Administration Guide Version 6.2

Chapter 43
Error and Event Logging
Administering Platform LSF
599
Notes
If a queue-level JOB_CONTROL is configured, LSF cannot determine the result
of the action. The termination reason only reflects what the termination reason
could be in LSF.
LSF cannot be guaranteed to catch any external signals sent directly to the job.
In MultiCluster, a brequeue request sent from the submission cluster is translated
to TERM_OWNER or TERM_ADMIN in the remote execution cluster. The
termination reason in the email notification sent from the execution cluster as well
as that in the
lsb.acct is set to TERM_OWNER or TERM_ADMIN.
Understanding LSF job exit codes
LSF monitors a job while running and returns the exit code returned from the job itself.
LSF collects this exit code via
wait3() system call on UNIX platforms. The exit code
is a result of the system exit values. Use
bhist to see the exit code for your job.
TERMINATE_WHEN Completed <exit>; TERM_LOAD/
TERM_WINDOWS/
TERM_PREEMPT
Thu Mar 13 17:33:16: Signal <KILL>
requested by user or administrator
<user2>;
Thu Mar 13 17:33:18: Exited by signal 2.
The CPU time used is 0.1 seconds;
Memory limit reached Completed <exit>; TERM_MEMLIMIT Thu Mar 13 19:31:13: Exited by signal 2.
The CPU time used is 0.1 seconds;
Run limit reached Completed <exit>; TERM_RUNLIMIT Thu Mar 13 20:18:32: Exited by signal 2.
The CPU time used is 0.1 seconds.
CPU limit Completed <exit>; TERM_CPULIMIT Thu Mar 13 18:47:13: Exited by signal
24. The CPU time used is 62.0 seconds;
Swap limit Completed <exit>; TERM_SWAPLIMIT Thu Mar 13 18:47:13: Exited by signal
24. The CPU time used is 62.0 seconds;
Regular job exits when
host crashes
Rusage 0,
Completed <exit>;
TERM_ZOMBIE
Thu Jun 12 15:49:02: Unknown; unable
to reach the execution host;
Thu Jun 12 16:10:32: Running;
Thu Jun 12 16:10:38: Exited with exit
code 143. The CPU time used is 0.0
seconds;
brequeue –r For each requeue,
Completed <exit>;
TERM_REQUEUE_ADMIN or
TERM_REQUEUE_OWNER
Thu Mar 13 17:46:39: Signal
<REQUEUE_PEND> requested by
user or administrator <user2>;
Thu Mar 13 17:46:56: Exited by signal 2.
The CPU time used is 0.1 seconds;
bchkpnt -k On the first run:
Completed <exit>;
TERM_CHKPNT
Wed Apr 16 16:00:48: Checkpoint
succeeded (actpid 931249);
Wed Apr 16 16:01:03: Exited with exit
code 137. The CPU time used is 0.0
seconds;
Kill –9 <RES> and job Completed <exit>;
TERM_EXTERNAL_SIGNAL
Thu Mar 13 17:30:43: Exited by signal
15. The CPU time used is 0.1 seconds;
Job terminated
abnormally in SLURM
Completed <exit>;
TERM_SLURM
Thu Mar 13 17:30:43: Exited with 123;
Others Completed <exit>; Thu Mar 13 17:30:43: Exited with 3; The
CPU time used is 0.1 seconds;
Termination cause Termination reason in bacct –l Example bhist output