LSF Version 7.3 - Platform LSF Configuration Reference
that were running when the systems went down are assumed to have exited, and email is sent
to the submitting user. Pending jobs remain in their queues, and are scheduled as hosts become
available.
Exited jobs
A job might terminate abnormally for various reasons. Job termination can happen from any
state. An abnormally terminated job goes into EXIT state. The situations where a job
terminates abnormally include:
•
The job is cancelled by its owner or the LSF administrator while pending, or after being
dispatched to a host.
•
The job is not able to be dispatched before it reaches its termination deadline, and thus is
aborted by LSF.
•
The job fails to start successfully. For example, the wrong executable is specified by the
user when the job is submitted.
The job exits with a non-zero exit status.
You can configure hosts so that LSF detects an abnormally high rate of job exit from a host.
See Administering Platform LSF for more information.
Application and system exit values
LSF monitors a job while running and returns the exit code returned from the job itself. LSF
collects this exit code via wait3() system call on UNIX platforms. The exit code is a result of
the system exit values. Use bhist or bjobs to see the exit code for your job.
Application exit values
The most common cause of abnormal LSF job termination is due to application system exit
values. If your application had an explicit exit value less than 128, bjobs and bhist display
the actual exit code of the application; for example,
Exited with exit code 3
. You would have to refer to the application code for the meaning of exit code 3.
It is possible for a job to explicitly exit with an exit code greater than 128, which can be confused
with the corresponding UNIX signal. Make sure that applications you write do not use exit
codes greater than128.
System signal exit values
When you send a signal that terminates the job, LSF reports either the signal or the signal_value
+128. If the return status is greater than 128, and the job was terminated with a signal, then
return_status-128=signal. For example, return status 133 means that the job was terminated
with signal 5 (SIGTRAP on most systems, 133-128=5). A job with exit status 130 was
terminated with signal 2 (SIGINT on most systems, 130-128 = 2).
Some operating systems define exit codes as 0-255. As a result, negative exit values or values
> 255 may have a wrap-around effect on that range. The most common example of this is a
program that exits -1 will be seen with "exit code 255" in LSF.
How or why the job may have been signaled, or exited with a certain exit code, can be
application and/or system specific. The application or system logs might be able to give a better
description of the problem.
Note:
Understanding Platform LSF job exit information
610 Platform LSF Configuration Reference