LSF Version 7.3 - Administering Platform LSF

ManualsBrandsHP ManualsSoftwareHP XC System 4.x Software

611

612

613

614

615

616

617

618

619

620

Administering Platform LSF 619

Achieving Performance and Scalability

This instructs mbatchd to check if the events file has logged 1000 batch job

completions every two hours. The two parameters can control the frequency of the

events file switching as follows:

◆ After two hours, mbatchd checks the number of completed batch jobs. If 1000

completed jobs have been logged, it switches the events file

◆ If 1000 jobs complete after five minutes, mbatchd does not switch the events file

until till the end of the two-hour period

TIP: For large clusters, set the MIN_SWITCH_PERIOD to a value equal to or greater than 600. This

causes mbatchd to fork a child process that handles event switching, thereby reducing the load

on mbatchd. mbatchd terminates the child process and appends delta events to new events

after the MIN_SWITCH_PERIOD has elapsed. If you define a value less than 600 seconds, mbatchd

will not fork a child process for event switching.

Automatic load updating

Periodically, the LIM daemons exchange load information. In large clusters, let LSF

automatically load the information by dynamically adjusting the period based on

the load.

IMPORTANT: For automatic tuning of the loading interval, make sure the parameter

EXINTERVAL in lsf.cluster.cluster_name file is not defined. Do not configure your cluster

to load the information at specific intervals.

Managing the I/O performance of the info directory

In large clusters, there are large numbers of jobs submitted by its users. Since each

job generally has a job file, this results in a large number of job files stored in the

LSF_SHAREDIR/cluster_name/logdir/info directory at any time. When the total

size of the job files reaches a certain point, you will notice a significant delay when

performing I/O operations in the

info directory.

This delay is caused by a limit in the total size of files that can reside in a file server

directory. This limit is dependent on the file system implementation. A high load

on the file server delays the master batch daemon operations, and therefore slows

down the overall cluster throughput.

You can prevent this delay by creating and using subdirectories under the parent

directory. Each new subdirectory is subject to the file size limit, but the parent

directory is not subject to the total file size of its subdirectories. Since the total file

size of the

info directory is divided among its subdirectories, your cluster can

process more job operations before reaching the total size limit of the job files.

If your cluster has a lot of jobs resulting in a large

info directory, you can tune your

cluster by enabling LSF to create subdirectories in the

info directory. Use

MAX_INFO_DIRS in lsb.params to create the subdirectories and enable mbatchd to

distribute the job files evenly throughout the subdirectories.

Syntax MAX_INFO_DIRS=num_subdirs

Where num_subdirs specifies the number of subdirectories that you want to create

under the

LSF_SHAREDIR/cluster_name/logdir/info directory. Valid values are

positive integers between

1 and 1024. By default, MAX_INFO_DIRS is not defined.