LSF Version 7.3 - Administering Platform LSF

Administering Platform LSF 619
Achieving Performance and Scalability
This instructs mbatchd to check if the events file has logged 1000 batch job
completions every two hours. The two parameters can control the frequency of the
events file switching as follows:
After two hours, mbatchd checks the number of completed batch jobs. If 1000
completed jobs have been logged, it switches the events file
If 1000 jobs complete after five minutes, mbatchd does not switch the events file
until till the end of the two-hour period
TIP: For large clusters, set the MIN_SWITCH_PERIOD to a value equal to or greater than 600. This
causes mbatchd to fork a child process that handles event switching, thereby reducing the load
on mbatchd. mbatchd terminates the child process and appends delta events to new events
after the MIN_SWITCH_PERIOD has elapsed. If you define a value less than 600 seconds, mbatchd
will not fork a child process for event switching.
Automatic load updating
Periodically, the LIM daemons exchange load information. In large clusters, let LSF
automatically load the information by dynamically adjusting the period based on
the load.
IMPORTANT: For automatic tuning of the loading interval, make sure the parameter
EXINTERVAL in lsf.cluster.cluster_name file is not defined. Do not configure your cluster
to load the information at specific intervals.
Managing the I/O performance of the info directory
In large clusters, there are large numbers of jobs submitted by its users. Since each
job generally has a job file, this results in a large number of job files stored in the
LSF_SHAREDIR/cluster_name/logdir/info directory at any time. When the total
size of the job files reaches a certain point, you will notice a significant delay when
performing I/O operations in the
info directory.
This delay is caused by a limit in the total size of files that can reside in a file server
directory. This limit is dependent on the file system implementation. A high load
on the file server delays the master batch daemon operations, and therefore slows
down the overall cluster throughput.
You can prevent this delay by creating and using subdirectories under the parent
directory. Each new subdirectory is subject to the file size limit, but the parent
directory is not subject to the total file size of its subdirectories. Since the total file
size of the
info directory is divided among its subdirectories, your cluster can
process more job operations before reaching the total size limit of the job files.
If your cluster has a lot of jobs resulting in a large
info directory, you can tune your
cluster by enabling LSF to create subdirectories in the
info directory. Use
MAX_INFO_DIRS in lsb.params to create the subdirectories and enable mbatchd to
distribute the job files evenly throughout the subdirectories.
Syntax MAX_INFO_DIRS=num_subdirs
Where num_subdirs specifies the number of subdirectories that you want to create
under the
LSF_SHAREDIR/cluster_name/logdir/info directory. Valid values are
positive integers between
1 and 1024. By default, MAX_INFO_DIRS is not defined.