Platform LSF Administration Guide Version 6.2

Performance Tuning for Interactive Batch Jobs
Administering Platform LSF
520
The paging rate load index can be used as a threshold to either stop sending more jobs
to the host, or to suspend an already running batch job to give priority to interactive
users.
This parameter can be used in different configuration files to achieve different purposes.
By defining paging rate threshold in
lsf.cluster.cluster_name, the host will
become busy from LIM’s point of view; therefore, no more jobs will be advised by LIM
to run on this host.
By including paging rate in queue or host scheduling conditions, jobs can be prevented
from starting on machines with a heavy paging rate, or can be suspended or even killed
if they are interfering with the interactive user on the console.
A job suspended due to
pg threshold will not be resumed even if the resume conditions
are met unless the machine is interactively idle for more than PG_SUSP_IT seconds.
Interactive idle
time (it)
Strict control can be achieved using the idle time (it) index. This index measures the
number of minutes since any interactive terminal activity. Interactive terminals include
hard wired ttys,
rlogin and lslogin sessions, and X shell windows such as xterm.
On some hosts, LIM also detects mouse and keyboard activity.
This index is typically used to prevent batch jobs from interfering with interactive
activities. By defining the suspending condition in the queue as
it<1 && pg>50, a job
from this queue will be suspended if the machine is not interactively idle and the paging
rate is higher than 50 pages per second. Furthermore, by defining the resuming
condition as
it>5 && pg<10 in the queue, a suspended job from the queue will not
resume unless it has been idle for at least five minutes and the paging rate is less than ten
pages per second.
The
it index is only non-zero if no interactive users are active. Setting the it threshold
to five minutes allows a reasonable amount of think time for interactive users, while
making the machine available for load sharing, if the users are logged in but absent.
For lower priority batch queues, it is appropriate to set an
it suspending threshold of
two minutes and scheduling threshold of ten minutes in the
lsb.queues file. Jobs in
these queues are suspended while the execution host is in use, and resume after the host
has been idle for a longer period. For hosts where all batch jobs, no matter how
important, should be suspended, set a per-host suspending threshold in the
lsb.hosts
file.
CPU run queue
length (r15s, r1m,
r15m)
Running more than one CPU-bound process on a machine (or more than one process
per CPU for multiprocessors) can reduce the total throughput because of operating
system overhead, as well as interfering with interactive users. Some tasks such as
compiling can create more than one CPU-intensive task.
Queues should normally set CPU run queue scheduling thresholds below 1.0, so that
hosts already running compute-bound jobs are left alone. LSF scales the run queue
thresholds for multiprocessor hosts by using the effective run queue lengths, so
multiprocessors automatically run one job per processor in this case.
For concept of effective run queue lengths, see
lsfintro(1).