Open Source Object Storage for Unstructured Data: Ceph on HP ProLiant SL4540 Gen8 Servers

Table Of Contents
Reference Architecture| Ceph on HP ProLiant SL4540 Gen8 Servers
When balancing PG usage for all pools, the proportion of PGs allocated should be based on which pool contains the most
objects. So the data pool for the object gateway would typically get the lion’s share of the placement groups. If there are
multiple pools with high numbers of objects (ex: a few RBD pools), tuning PG count becomes more complicated.
Right now pg_num and pgp_num must be the same. Remember to set both values when pools need tuning.
The *100 ratio can actually vary between about 50-100, where lower counts may help with lower powered systems. For
the HP ProLiant SL4540 Gen 8 Server under test, plenty of compute resources are available so a higher number works.
Powers of two are documented as slightly more performant. It is not practical to jump heavily utilized pools a full power
of two every time OSDs are added, but keep this type of growth in mind for planning.
PG allocations must keep a minimum PG count per OSD for the cluster. Running ‘ceph s’ will warn if under threshold.
PG count in a pool can’t be lowered; pools must be deleted and remade to lower PG count (if data isn’t important) or copy
pool contents to another through RADOS before deletion. So increasing placement groups isn’t directly reversible.
Recent Inktank installation documentation has recommended even higher counts of 100-200 PGs per OSD; for large
enterprise clusters with the CPU and high OSD counts these higher PG counts may show benefits. HP has not tested with
this tuning in mind.
Higher PG counts take more CPU and rebalance time in exchange for better cluster distribution of objects. Changing PG
count also incurs a rebalance.
Adding extra PGs for future expansion of OSDs on a critical pool can make sense, or PGs can be left available for RBD
pool(s). Best practice depends on current and planned cluster use.
SSD journal usage
If data requires significant PUT performance, consider SSDs for data journaling.
Advantages
Separation of the highly sequential journal data from object datawhich is distributed across the data partition as
RADOS objects land in their placement groupsmeans significantly less seeking to the front of the drive for a journal
commit and then seeking elsewhere to write data. It also means that all bandwidth on the spinning media is going to data
IO, approximately doubling bandwidth of PUTs/writes.
Using an SSD device for the journal keeps storage relatively dense because multiple journals can go to the same higher
bandwidth device while not incurring rotating media seek penalties.
Disadvantages
Each SSD in this configuration is more expensive than a drive that could be put in the slot. Journal SSDs reduce the
maximum amount of object storage on the node.
Tying a separate device to multiple OSDs as a journal and using xfsthe default file system with ceph-deploymeans
that loss of the journal device is a loss of all dependent OSDs. With a high enough replica and OSD count this isn’t a
significant additional risk to data durability, but it does mean architecting with that expectation in mind. The btrfs file
system avoids this limitation, but it is not mature enough for some enterprises.
OSDs can’t be hot swapped with separate data and journal devices.
Configuration recommendations
For bandwidth, four spinning disks to one SSD is a recommended performance ratio. It’s possible to go with a higher ratio
of spinning to solid state, but that increases the number of OSDs affected by an SSD failure. Also, the SSD could become a
bottleneck; larger ratios of disks to SSD journal should be balanced versus peak spinning media performance.
Journals don’t require a lot of capacity but larger SSDs do provide extra wear leveling. Journaling space reserved by the
Ceph should be 10-20 seconds of writes. If each spinning disk peaks at ~150MB/sec, then 4GB of capacity in a given
journal partition is more than a spinning disk will need to meet those buffer requirements.
A RAID 1 of SSDs is not recommended. Wear leveling makes it likely SSDs will be upgraded at similar times. The doubling
of SSDs per node also reduces storage density and increases price per gig. With massive storage scale, it’s better to
expect drive failure and plan so failure is easily recoverable and tolerable.
Choose SSDs that match data usage. Consider the number of times the entire device will be written per day versus the
capabilities of the device. If write bandwidth required is in occasional bursts, SLC flash doesn’t make sense.
Logical drive configuration
For a 1x60, significant CPU cycles must be reserved for 60 OSDs on a single compute node. A 1x60 HP ProLiant SL4540
Gen8 Server fully-loaded could reduce CPU usage by configuring RAID 0 volumes across two drives at a timeresulting in
30 OSDs. Configuring multiple drives in a RAID array can reduce CPU cost for colder storage, in exchange for reduced
storage efficiency to provide reliability. It can also provide more CPU headroom for error handling, or additional resources if
cluster design dictates CPU resource usage outside of cluster specific tasks.
28