Platform LSF Administration Guide Version 6.2
Chapter 43
Error and Event Logging
Administering Platform LSF
599
Notes
◆
If a queue-level JOB_CONTROL is configured, LSF cannot determine the result
of the action. The termination reason only reflects what the termination reason
could be in LSF.
◆
LSF cannot be guaranteed to catch any external signals sent directly to the job.
◆
In MultiCluster, a brequeue request sent from the submission cluster is translated
to TERM_OWNER or TERM_ADMIN in the remote execution cluster. The
termination reason in the email notification sent from the execution cluster as well
as that in the
lsb.acct is set to TERM_OWNER or TERM_ADMIN.
Understanding LSF job exit codes
LSF monitors a job while running and returns the exit code returned from the job itself.
LSF collects this exit code via
wait3() system call on UNIX platforms. The exit code
is a result of the system exit values. Use
bhist to see the exit code for your job.
TERMINATE_WHEN Completed <exit>; TERM_LOAD/
TERM_WINDOWS/
TERM_PREEMPT
Thu Mar 13 17:33:16: Signal <KILL>
requested by user or administrator
<user2>;
Thu Mar 13 17:33:18: Exited by signal 2.
The CPU time used is 0.1 seconds;
Memory limit reached Completed <exit>; TERM_MEMLIMIT Thu Mar 13 19:31:13: Exited by signal 2.
The CPU time used is 0.1 seconds;
Run limit reached Completed <exit>; TERM_RUNLIMIT Thu Mar 13 20:18:32: Exited by signal 2.
The CPU time used is 0.1 seconds.
CPU limit Completed <exit>; TERM_CPULIMIT Thu Mar 13 18:47:13: Exited by signal
24. The CPU time used is 62.0 seconds;
Swap limit Completed <exit>; TERM_SWAPLIMIT Thu Mar 13 18:47:13: Exited by signal
24. The CPU time used is 62.0 seconds;
Regular job exits when
host crashes
Rusage 0,
Completed <exit>;
TERM_ZOMBIE
Thu Jun 12 15:49:02: Unknown; unable
to reach the execution host;
Thu Jun 12 16:10:32: Running;
Thu Jun 12 16:10:38: Exited with exit
code 143. The CPU time used is 0.0
seconds;
brequeue –r For each requeue,
Completed <exit>;
TERM_REQUEUE_ADMIN or
TERM_REQUEUE_OWNER
Thu Mar 13 17:46:39: Signal
<REQUEUE_PEND> requested by
user or administrator <user2>;
Thu Mar 13 17:46:56: Exited by signal 2.
The CPU time used is 0.1 seconds;
bchkpnt -k On the first run:
Completed <exit>;
TERM_CHKPNT
Wed Apr 16 16:00:48: Checkpoint
succeeded (actpid 931249);
Wed Apr 16 16:01:03: Exited with exit
code 137. The CPU time used is 0.0
seconds;
Kill –9 <RES> and job Completed <exit>;
TERM_EXTERNAL_SIGNAL
Thu Mar 13 17:30:43: Exited by signal
15. The CPU time used is 0.1 seconds;
Job terminated
abnormally in SLURM
Completed <exit>;
TERM_SLURM
Thu Mar 13 17:30:43: Exited with 123;
Others Completed <exit>; Thu Mar 13 17:30:43: Exited with 3; The
CPU time used is 0.1 seconds;
Termination cause Termination reason in bacct –l Example bhist output