LSF Version 7.3 - Using Platform LSF HPC
The PJL wrapper starts the PJL (for example, mpirun).
Instead of starting tasks directly, PJL starts TS on each host selected to run the
parallel job.
TS starts the task.
Each TS reports its task PID and host name back to PAM. Now PAM can perform job
control and resource usage collection through RES.
TaskStarter also collects the exit status of the task and reports it to PAM. When PJL
exits, PAM exits with the same termination status as the PJL.
Integration methods
There are 2 ways to integrate the PJL.
In this method, PAM rewrites the PJL command line to insert TS in the correct position,
and set callback information for TS to communicate with PAM.
Use this method when:
◆
You always use the same number of PJL arguments
◆
The job in the PJL command line is the executable application that starts the parallel
tasks
For details, see “Integration Method 1” on page 52
In this method, you rewrite or wrap the PJL to include TS and callback information for
TS to communicate with PAM. This method of integration is the most flexible, but may
be more difficult to implement.
Use this method when:
◆
The number of PJL arguments is uncertain
◆
Parallel tasks have a complex startup sequence
◆
The job in the PJL command line could be a script instead of the executable
application that starts the parallel tasks
For details, see “Integration Method 2” on page 54.
Error handling
If PAM cannot start PJL, no tasks are started and PAM exits.
If PAM does not receive all the TS registration messages (host name and PID)
within a the timeout specified by LSF_HPC_PJL_LOADENV_TIMEOUT in
lsf.conf, it assumes that the job can not be executed. It kills the PJL, kills all the
tasks that have been successfully started (if any), and exits. The default for
LSF_HPC_PJL_LOADENV_TIMEOUT is 300 seconds.
If TS cannot start the task, it reports this to PAM and exit. If all tasks report, PAM
checks to make sure all tasks have started. If any task does not start, PAM kills the
PJL, sends a message to kill all the remote tasks that have been successfully started,
and exit.
If TS terminates before it can report the exit status of the task to PAM, PAM never
succeeds in receiving all the exit status. It then exits when the PJL exits.
If the PJL exits before all TS have registered the exit status of the tasks, then PAM
assumes the parallel job is completed, and communicates with RES, which signals
the tasks.