Platform LSF Administration Guide Version 6.2
Chapter 4
Working with Hosts
Administering Platform LSF
125
Tuning
Tune JOB_EXIT_RATE_DURATION carefully. Shorter values may raise false alarms,
longer values may not trigger exceptions frequently enough.
Example
In the diagram, the job exit rate of hostA exceeds the configured threshold. LSF
monitors
hostA from time t1 to time t2 (t2=t1 + JOB_EXIT_RATE_DURATION in
lsb.params). At t2, the exit rate is still high, and a host exception is detected. At t3
(EADMIN_TRIGGER_DURATION in
lsb.params), LSF invokes eadmin and the
host exception is handled. By default, LSF closes
hostA and sends email to the LSF
administrator. Since
hostA is closed and cannot accept any new jobs, the exit rate drops
quickly.
t0 t1 t2
t3
Time
Exit rate
hostA exit rate
Threshold