LSF Version 7.3 - Administering Platform LSF

Administering Platform LSF 727
Understanding Platform LSF Job Exit Information
It is possible for a job to explicitly exit with an exit code greater than 128, which can
be confused with the corresponding UNIX signal. Make sure that applications you
write do not use exit codes greater than128.
System signal exit values
When you send a signal that terminates the job, LSF reports either the signal or the
signal_value+128. If the return status is greater than 128, and the job was
terminated with a signal, then return_status-128=signal. For example, return status
133 means that the job was terminated with signal 5 (SIGTRAP on most systems,
133-128=5). A job with exit status 130 was terminated with signal 2 (SIGINT on
most systems, 130-128 = 2).
Some operating systems define exit codes as 0-255. As a result, negative exit values
or values > 255 may have a wrap-around effect on that range. The most common
example of this is a program that exits -1 will be seen with "exit code 255" in LSF.
How or why the job may have been signaled, or exited with a certain exit code, can
be application and/or system specific. The application or system logs might be able
to give a better description of the problem.
TIP: Termination signals are operating system dependent, so signal 5 may not be SIGTRAP and 11
may not be SIGSEGV on all UNIX and Linux systems. You need to pay attention to the execution
host type in order to correct translate the exit value if the job has been signaled.
bhist and bjobs output
In most cases, bjobs and bhist show the application exit value (128 + signal). In
some cases,
bjobs and bhist show the actual signal value.
If LSF sends catchable signals to the job, it displays the exit value. For example, if
you run
bkill jobID to kill the job, LSF passes SIGINT, which causes the job to exit
with exit code 130 (SIGINT is 2 on most systems, 128+2 = 130).
If LSF sends uncatchable signals to the job, then the entire process group for the job
exits with the corresponding signal. For example, if you run
bkill -s SEGV jobID
to kill the job,
bjobs and bhist show
Exited by signal 7
Example
The following example shows a job that exited with exit code 139, which means that
the job was terminated with signal 11 (SIGSEGV on most systems, 139-128=11).
This means that the application had a core dump.
bjobs -l 2012
Job <2012>, User , Project , Status , Queue , Command
Fri Dec 27 22:47:28: Submitted from host , CWD <$HOME>;
Fri Dec 27 22:47:37: Started on , Execution Home , Execution CWD ;
Fri Dec 27 22:48:02: Exited with exit code 139. The CPU time used is 0.2 seconds.
SCHEDULING PARAMETERS:
r15s r1m r15m ut pg io ls it tmp swp mem
loadSched - - - - - - - - - - -
loadStop - - - - - - - - - - -
cpuspeed bandwidth