Open Source Object Storage for Unstructured Data: Ceph on HP ProLiant SL4540 Gen8 Servers

ManualsBrandsHP ManualsServerHP ProLiant SL4540 Gen8 Server

Table Of Contents

Reference Architecture| Ceph on HP ProLiant SL4540 Gen8 Servers

• When balancing PG usage for all pools, the proportion of PGs allocated should be based on which pool contains the most

objects. So the data pool for the object gateway would typically get the lion’s share of the placement groups. If there are

multiple pools with high numbers of objects (ex: a few RBD pools), tuning PG count becomes more complicated.

• Right now pg_num and pgp_num must be the same. Remember to set both values when pools need tuning.

• The *100 ratio can actually vary between about 50-100, where lower counts may help with lower powered systems. For

the HP ProLiant SL4540 Gen 8 Server under test, plenty of compute resources are available so a higher number works.

• Powers of two are documented as slightly more performant. It is not practical to jump heavily utilized pools a full power

of two every time OSDs are added, but keep this type of growth in mind for planning.

• PG allocations must keep a minimum PG count per OSD for the cluster. Running ‘ceph –s’ will warn if under threshold.

• PG count in a pool can’t be lowered; pools must be deleted and remade to lower PG count (if data isn’t important) or copy

pool contents to another through RADOS before deletion. So increasing placement groups isn’t directly reversible.

• Recent Inktank installation documentation has recommended even higher counts of 100-200 PGs per OSD; for large

enterprise clusters with the CPU and high OSD counts these higher PG counts may show benefits. HP has not tested with

this tuning in mind.

• Higher PG counts take more CPU and rebalance time in exchange for better cluster distribution of objects. Changing PG

count also incurs a rebalance.

Adding extra PGs for future expansion of OSDs on a critical pool can make sense, or PGs can be left available for RBD

pool(s). Best practice depends on current and planned cluster use.

SSD journal usage

If data requires significant PUT performance, consider SSDs for data journaling.

Advantages

• Separation of the highly sequential journal data from object data—which is distributed across the data partition as

RADOS objects land in their placement groups—means significantly less seeking to the front of the drive for a journal

commit and then seeking elsewhere to write data. It also means that all bandwidth on the spinning media is going to data

IO, approximately doubling bandwidth of PUTs/writes.

• Using an SSD device for the journal keeps storage relatively dense because multiple journals can go to the same higher

bandwidth device while not incurring rotating media seek penalties.

Disadvantages

• Each SSD in this configuration is more expensive than a drive that could be put in the slot. Journal SSDs reduce the

maximum amount of object storage on the node.

• Tying a separate device to multiple OSDs as a journal and using xfs—the default file system with ceph-deploy—means

that loss of the journal device is a loss of all dependent OSDs. With a high enough replica and OSD count this isn’t a

significant additional risk to data durability, but it does mean architecting with that expectation in mind. The btrfs file

system avoids this limitation, but it is not mature enough for some enterprises.

• OSDs can’t be hot swapped with separate data and journal devices.

Configuration recommendations

• For bandwidth, four spinning disks to one SSD is a recommended performance ratio. It’s possible to go with a higher ratio

of spinning to solid state, but that increases the number of OSDs affected by an SSD failure. Also, the SSD could become a

bottleneck; larger ratios of disks to SSD journal should be balanced versus peak spinning media performance.

• Journals don’t require a lot of capacity but larger SSDs do provide extra wear leveling. Journaling space reserved by the

Ceph should be 10-20 seconds of writes. If each spinning disk peaks at ~150MB/sec, then 4GB of capacity in a given

journal partition is more than a spinning disk will need to meet those buffer requirements.

• A RAID 1 of SSDs is not recommended. Wear leveling makes it likely SSDs will be upgraded at similar times. The doubling

of SSDs per node also reduces storage density and increases price per gig. With massive storage scale, it’s better to

expect drive failure and plan so failure is easily recoverable and tolerable.

• Choose SSDs that match data usage. Consider the number of times the entire device will be written per day versus the

capabilities of the device. If write bandwidth required is in occasional bursts, SLC flash doesn’t make sense.

Logical drive configuration

For a 1x60, significant CPU cycles must be reserved for 60 OSDs on a single compute node. A 1x60 HP ProLiant SL4540

Gen8 Server fully-loaded could reduce CPU usage by configuring RAID 0 volumes across two drives at a time—resulting in

30 OSDs. Configuring multiple drives in a RAID array can reduce CPU cost for colder storage, in exchange for reduced

storage efficiency to provide reliability. It can also provide more CPU headroom for error handling, or additional resources if

cluster design dictates CPU resource usage outside of cluster specific tasks.