Platform LSF Administration Guide Version 6.2

ManualsBrandsHP ManualsSoftwareHP XC System 3.x Software

Welcome

Administering Platform LSF

Platform LSF HPC

Interruptible

backfill

Designed to improve cluster utilization, the new interruptible backfill scheduling policy

allows reserved job slots to be used by low priority small jobs that will be terminated

when the higher priority large jobs are about to start.

An interruptible backfill job:

◆

Starts as a regular job and is killed when it exceeds the queue runtime limit

◆

Is started for backfill whenever there is a backfill time slice longer than the specified

minimal time, and killed before the slot-reservation job is about to start

Applies to compute-intensive serial or single-node parallel jobs that can run a long time,

yet be able to checkpoint or resume from an arbitrary computation point.

Resource

granularity

Allows for greater flexibility on how numeric resources are reserved by jobs. Resources

may be reserved on a slot, job, or host basis.

The cluster-wide RESOURCE_RESERVE_PER_SLOT parameter in

lsb.params is

obsolete. Resource reservation is now configured on a per-resource basis as PER_JOB,

PER_SLOT, PER_HOST in the ReservationUsage section in

lsb.resources. This

configuration overrides RESOURCE_RESERVE_PER_SLOT if it also exists.

Time-based slot reservation and greedy slot reservation

Existing LSF slot reservation works in simple environments, where host-based MXJ

limits are the only constraint to job slot requests. In complex environments, where more

than one constraint exists:

◆

Estimated job start time becomes inaccurate

◆

The scheduler makes a reservation decision that can postpone estimated job start

time or decrease cluster utilization.

Current slot reservation (RESERVE_BY_STARTTIME) resolves several reservation

issues in multiple candidate host groups, but it cannot help on other cases:

◆

Special topology requests, like span[ptile=n]

◆

Only calculates and displays reservation if host has free slots. Reservations may

change or disappear if there are no free CPUs; for example, if a backfill job takes all

reserved CPUs.

◆

For HPC machines containing many internal nodes, host-level number of reserved

slots is not enough for administrator and end user to tell which CPUs the job is

reserving and waiting for.

With time-based reservation, a set of pending jobs will get future allocation and

estimated start time. The size of job set is determined by cluster capacity and

LSB_TIME_RESERVE_NUMJOBS in

lsf.conf.

If a job cannot be placed in a future allocation, scheduler uses greedy slot reservation to

reserve slots. Existing LSF slot reservation is a simple greedy algorithm:

◆

Only considers current available resource and minimal number of requested job

slots to reserve as many slots as it is allowed

◆

For multiple exclusive candidate host groups, scheduler goes through those groups

and makes reservation on the group that has the largest available slots