LSF Version 7.3 - Administering Platform LSF

Performance Tuning for Interactive Batch Jobs
596 Administering Platform LSF
The it index is only non-zero if no interactive users are active. Setting the it
threshold to five minutes allows a reasonable amount of think time for interactive
users, while making the machine available for load sharing, if the users are logged
in but absent.
For lower priority batch queues, it is appropriate to set an
it suspending threshold
of two minutes and scheduling threshold of ten minutes in the
lsb.queues file.
Jobs in these queues are suspended while the execution host is in use, and resume
after the host has been idle for a longer period. For hosts where all batch jobs, no
matter how important, should be suspended, set a per-host suspending threshold
in the
lsb.hosts file.
CPU run queue
length (r15s, r1m,
r15m)
Running more than one CPU-bound process on a machine (or more than one
process per CPU for multiprocessors) can reduce the total throughput because of
operating system overhead, as well as interfering with interactive users. Some tasks
such as compiling can create more than one CPU-intensive task.
Queues should normally set CPU run queue scheduling thresholds below 1.0, so
that hosts already running compute-bound jobs are left alone. LSF scales the run
queue thresholds for multiprocessor hosts by using the effective run queue lengths,
so multiprocessors automatically run one job per processor in this case.
For short to medium-length jobs, the
r1m index should be used. For longer jobs, you
might want to add an
r15m threshold. An exception to this are high priority queues,
where turnaround time is more important than total throughput. For high priority
queues, an
r1m scheduling threshold of 2.0 is appropriate.
See Load Indices on page 239 for the concept of effective run queue length.
CPU utilization (ut) The ut parameter measures the amount of CPU time being used. When all the CPU
time on a host is in use, there is little to gain from sending another job to that host
unless the host is much more powerful than others on the network. A
ut threshold
of 90% prevents jobs from going to a host where the CPU does not have spare
processing cycles.
If a host has very high
pg but low ut, then it may be desirable to suspend some jobs
to reduce the contention.
Some commands report
ut percentage as a number from 0-100, some report it as a
decimal number between 0-1. The configuration parameter in the
lsf.cluster.cluster_name file and the configuration files take a fraction in the
range from 0 to 1, while the
bsub -R resource requirement string takes an integer
from 1-100.
The command
bhist shows the execution history of batch jobs, including the time
spent waiting in queues or suspended because of system load.
The command
bjobs -p shows why a job is pending.
Scheduling conditions and resource thresholds
Three parameters, RES_REQ, STOP_COND and RESUME_COND, can be
specified in the definition of a queue. Scheduling conditions are a more general way
for specifying job dispatching conditions at the queue level. These parameters take
resource requirement strings as values which allows you to specify conditions in a
more flexible manner than using the
loadSched or loadStop thresholds.