Open Source Object Storage for Unstructured Data: Ceph on HP ProLiant SL4540 Gen8 Servers

Table Of Contents
Reference Architecture| Ceph on HP ProLiant SL4540 Gen8 Servers
It’s a good idea to buffer some performance in estimates. Complex application loads are not as easy to gauge as a simple
canned test load, and production systems shouldn’t run near the edge so they can better cope with failures and unexpected
load spikes.
Some other things to remember around disk performance:
Replica count means multiple media writes for each object PUT
Peak write performance of spinning media without separate journals is around half due to writes to journal and data
partitions going to the same device
With a single 10GbE port, the bandwidth bottleneck is at the port rather than controller/drive on any fully disk -populated
HP ProLiant SL4540 Gen8 Server node; the controller is optimally capable of about 3GB/sec, while the effective peak
node bandwidth on a 10GbE link looks to be in the 900 MB-1GB/sec range out of theoretical 1.25GB max
At smaller object sizes, the bottleneck tends to be on the object gateway’s ops/sec capabilities before network or disk; in
some cases, the bottleneck can be the client’s ability to execute object operations
Given the fairly randomly distributed IO load for object data best case average performance from spinning media is about
90-100 MB/sec. Real world object gateway performance is more in the 60-70 MB/sec average range per disk; this is also
impacted by object gateways not providing a particularly deep IO queue in observed tests; Peak disk performance can be
higher, which is why a 4:1 SSD journal ratio is recommended
Capacity versus object count
If the use case focuses on many small objects, it may be necessary to get involved in the details of the file systems mounted
on each OSD. Because RADOS objects are represented as files, they require an inode to be allocated. Depending on the file
system used and the average object size it may be necessary to change formatting options to maximize disk usage.
As an example, we’ll refer to limits for the sample reference configuration. The Ceph-deploy program sets up xfs file
systems with 5% of capacity as maximum usable for inodes (xfs dynamically allocates inodes as needed). As an example,
using 2KB xfs inode size on 3TB drives configured as RAID 0 results in about 73.2 million inodes available per drive. Clearly
these settings would max out inode usage with 1k objects well before the drive was full of object data.
If inode limitations are a concern, plan file system format parameters before installing Ceph on the cluster. Installation of
the OSDs is more involved with custom file system settings; reference the Ceph official documentation for details.
Allocating disks to OSD hosts
Choose the server that fits use case needs; for the OSD hosts we’ll cover choices using the HP ProLiant SL4540 Gen8 Server.
3x15 units maximize per-node disk utilization on smaller network pipes or offer a greater network bandwidth to disk ratio.
Using 3x15 HP ProLiant SL4540 Gen8 Servers increases compute density in the rack, but would be the least dense choice
for storage. The 2x25 and 1x60 configurations increasingly improve storage density in the rack at the expense of compute
density, and are therefore good choices for progressively ‘colder’ storage.
Take the drive pool from the first step and divide it into the desired HP ProLiant SL4540 Gen8 Server node configuration. If
SSD journals have been chosen, they’ll reduce capacity per node accordingly. As an example, and HP ProLiant SL4540 Gen8
Server with a 4:1 ratio of spinning to SSD would have 12 spinning disks per node on a 3x15, 20 spinning disks per node on a
2x25. SSD journals are not recommended on a 1x60 density optimized configuration. Replacing a spinning media slot with
SSDs is counter to the focus on density, and the attempt to increase drive write performance runs into server architectural
limitationsfor example, ratio of disk to network bandwidth.
As part of designing towards homogeneity, adjust drive counts to divide storage into compute evenly where possible. Once
the number of disks is chosen, decide how storage will be configured in logical volumessee Logical Drive Configuration
under Cluster Tuningand select system CPU and memory to match the number of OSDs.
Choosing a network infrastructure
Consider desired bandwidth of storage calculated above, the overhead of replication traffic, and the network configuration
of the object gateway’s data network (number of ports/total bandwidth). Details of traffic segmentation, load balancer
configuration, VLAN setup, or other networking configuration/best practice are very use-case specific and outside the scope
of this document.
Typical choices of configuration for data traffic will be 1-2 1GbE or 10GbE networks. Cold object storage use cases may
be satisfied with data access over lower bandwidth, but consider that10GbE is also useful for faster rebuild and recovery
between OSDs. Replicating 1TB of data across a 1GbE network takes three hours, with 10GbE it would be 20 minutes. If
more network ports are needed, an additional NIC card can be placed in the server’s PCIe slot.
Network redundancy (active/passive configurations, redundant switching) is not recommended, as scale-out
configurations gain significant reliability from compute and disk node redundancy and proper failure domain
configuration. Consider the network configuration (where the switches and rack interconnects are) in the CRUSH map to
define how replicas are distributed.
26