LSF Version 7.3 - Administering Platform LSF

Application and system exit values
726 Administering Platform LSF
Host failure
If an LSF server host fails, jobs running on that host are lost. No other jobs are
affected. Jobs can be submitted so that they are automatically rerun from the
beginning or restarted from a checkpoint on another host if they are lost because of
a host failure.
If all of the hosts in a cluster go down, all running jobs are lost. When a host comes
back up and takes over as master, it reads the
lsb.events file to get the state of all
batch jobs. Jobs that were running when the systems went down are assumed to
have exited, and email is sent to the submitting user. Pending jobs remain in their
queues, and are scheduled as hosts become available.
Exited jobs
A job might terminate abnormally for various reasons. Job termination can happen
from any state. An abnormally terminated job goes into EXIT state. The situations
where a job terminates abnormally include:
The job is cancelled by its owner or the LSF administrator while pending, or
after being dispatched to a host.
The job is not able to be dispatched before it reaches its termination deadline,
and thus is aborted by LSF.
The job fails to start successfully. For example, the wrong executable is specified
by the user when the job is submitted.
The job exits with a non-zero exit status.
You can configure hosts so that LSF detects an abnormally high rate of job exit from
a host. See Administering Platform LSF for more information.
Application and system exit values
LSF monitors a job while running and returns the exit code returned from the job
itself. LSF collects this exit code via
wait3() system call on UNIX platforms. The
exit code is a result of the system exit values. Use
bhist or bjobs to see the exit code
for your job.
Application exit values
The most common cause of abnormal LSF job termination is due to application
system exit values. If your application had an explicit exit value less than 128,
bjobs
and
bhist display the actual exit code of the application; for example, Exited with
exit code 3
. You would have to refer to the application code for the meaning of
exit code 3.
LSF internal error -127, 127 all N/A RES returns -127 or 127 for all internal
problems.
Out of memory N/A all N/A Exit code depends on the error
handling of the application itself.
LSF job states 0 all N/A Exit code 0 is returned for all job states
Error codition LSF exit code Operating system System exit code
equivalent
Meaning