Platform LSF Administration Guide Version 6.2

Chapter 39
Tuning the Cluster
Administering Platform LSF
557
%
lshosts -l
HOST_NAME: hostD
...
LOAD_THRESHOLDS:
r15s r1m r15m ut pg io ls it tmp swp mem
- 3.5- -15---- 2M1M
HOST_NAME: hostA
...
LOAD_THRESHOLDS:
r15s r1m r15m ut pg io ls it tmp swp mem
- 3.5- -15---- 2M1M
%
lsload
HOST_NAME status r15s r1m r15m ut pg ls it tmp swp mem
hostD ok 0.0 0.0 0.0 0% 0.0 6 0 30M 32M 10M
hostA busy 1.9 2.1 1.9 47% *69.6 21 0 38M 96M 60M
In this example, the hosts have the following characteristics:
hostD is ok.
hostA is busy—The pg (paging rate) index is 69.6, above the threshold of 15.
If LIM often reports a host as busy
If LIM often reports a host as busy when the CPU utilization and run queue lengths
are relatively low and the system is responding quickly, the most likely cause is the paging
rate threshold. Try raising the
pg threshold.
Different operating systems assign subtly different meanings to the paging rate statistic,
so the threshold needs to be set at different levels for different host types. In particular,
HP-UX systems need to be configured with significantly higher
pg values; try starting
at a value of 50.
There is a point of diminishing returns. As the paging rate rises, eventually the system
spends too much time waiting for pages and the CPU utilization decreases. Paging rate
is the factor that most directly affects perceived interactive response. If a system is
paging heavily, it feels very slow.
If interactive jobs slow down response
If you find that interactive jobs slow down system response too much while LIM still
reports your host as
ok, reduce the CPU run queue lengths (r15s, r1m, r15m).
Likewise, increase CPU run queue lengths if hosts become busy at low loads.
Multiprocessor systems
On multiprocessor systems, CPU run queue lengths (r15s, r1m, r15m) are compared
to the effective run queue lengths as displayed by the
lsload -E command.
CPU run queue lengths should be configured as the load limit for a single processor.
Sites with a variety of uniprocessor and multiprocessor machines can use a standard
value for
r15s, r1m and r15m in the configuration files, and the multiprocessor
machines will automatically run more jobs.