LSF Version 7.3 - Using Platform LSF HPC
Tuning PAM Scalability and Fault Tolerance
To improve performance and scalability for large parallel jobs, tune the following
parameters.
Parameters for PAM (lsf.conf)
For better performance, you can adjust the following parameters in lsf.conf. The
user's environment can override these.
Timeout value in seconds for PJL to load or unload the environment. For example, the
time needed for IBM POE to load or unload adapter windows.
At job startup, the PJL times out if the first task fails to register within the specified
timeout value. At job shutdown, the PJL times out if it fails to exit after the last
Taskstarter termination report within the specified timeout value.
Default: LSF_HPC_PJL_LOADENV_TIMEOUT=300
This factor adjusts the update interval according to the following calculation:
RUSAGE_UPDATE_INTERVAL + num_tasks *1*LSF_PAM_RUSAGE_UPD_F
ACTOR.
PAM updates resource usage for each task for every
SBD_SLEEP_TIME + num_tasks * 1 seconds (by default, SBD_SLEEP_TIME=15).
For large parallel jobs, this interval is too long. As the number of parallel tasks increases,
LSF_PAM_RUSAGE_UPD_FACTOR causes more frequent updates.
Default: LSF_PAM_RUSAGE_UPD_FACTOR=0.01