LSF Version 7.3 - Using Platform LSF HPC

The PJL wrapper starts the PJL (for example, mpirun).

Instead of starting tasks directly, PJL starts TS on each host selected to run the

parallel job.

TS starts the task.

Each TS reports its task PID and host name back to PAM. Now PAM can perform job

control and resource usage collection through RES.

TaskStarter also collects the exit status of the task and reports it to PAM. When PJL

exits, PAM exits with the same termination status as the PJL.

Integration methods

There are 2 ways to integrate the PJL.

In this method, PAM rewrites the PJL command line to insert TS in the correct position,

and set callback information for TS to communicate with PAM.

Use this method when:

◆

You always use the same number of PJL arguments

◆

The job in the PJL command line is the executable application that starts the parallel

tasks

For details, see “Integration Method 1” on page 52

In this method, you rewrite or wrap the PJL to include TS and callback information for

TS to communicate with PAM. This method of integration is the most flexible, but may

be more difficult to implement.

Use this method when:

◆

The number of PJL arguments is uncertain

◆

Parallel tasks have a complex startup sequence

◆

The job in the PJL command line could be a script instead of the executable

application that starts the parallel tasks

For details, see “Integration Method 2” on page 54.

Error handling

If PAM cannot start PJL, no tasks are started and PAM exits.

If PAM does not receive all the TS registration messages (host name and PID)

within a the timeout specified by LSF_HPC_PJL_LOADENV_TIMEOUT in

lsf.conf, it assumes that the job can not be executed. It kills the PJL, kills all the

tasks that have been successfully started (if any), and exits. The default for

LSF_HPC_PJL_LOADENV_TIMEOUT is 300 seconds.

If TS cannot start the task, it reports this to PAM and exit. If all tasks report, PAM

checks to make sure all tasks have started. If any task does not start, PAM kills the

PJL, sends a message to kill all the remote tasks that have been successfully started,

and exit.

If TS terminates before it can report the exit status of the task to PAM, PAM never

succeeds in receiving all the exit status. It then exits when the PJL exits.

If the PJL exits before all TS have registered the exit status of the tasks, then PAM

assumes the parallel job is completed, and communicates with RES, which signals

the tasks.