Open Source Object Storage for Unstructured Data: Ceph on HP ProLiant SL4540 Gen8 Servers

ManualsBrandsHP ManualsServerHP ProLiant SL4540 Gen8 Server

Table Of Contents

Reference Architecture| Ceph on HP ProLiant SL4540 Gen8 Servers

It’s a good idea to buffer some performance in estimates. Complex application loads are not as easy to gauge as a simple

canned test load, and production systems shouldn’t run near the edge so they can better cope with failures and unexpected

load spikes.

Some other things to remember around disk performance:

• Replica count means multiple media writes for each object PUT

• Peak write performance of spinning media without separate journals is around half due to writes to journal and data

partitions going to the same device

• With a single 10GbE port, the bandwidth bottleneck is at the port rather than controller/drive on any fully disk -populated

HP ProLiant SL4540 Gen8 Server node; the controller is optimally capable of about 3GB/sec, while the effective peak

node bandwidth on a 10GbE link looks to be in the 900 MB-1GB/sec range out of theoretical 1.25GB max

• At smaller object sizes, the bottleneck tends to be on the object gateway’s ops/sec capabilities before network or disk; in

some cases, the bottleneck can be the client’s ability to execute object operations

• Given the fairly randomly distributed IO load for object data best case average performance from spinning media is about

90-100 MB/sec. Real world object gateway performance is more in the 60-70 MB/sec average range per disk; this is also

impacted by object gateways not providing a particularly deep IO queue in observed tests; Peak disk performance can be

higher, which is why a 4:1 SSD journal ratio is recommended

Capacity versus object count

If the use case focuses on many small objects, it may be necessary to get involved in the details of the file systems mounted

on each OSD. Because RADOS objects are represented as files, they require an inode to be allocated. Depending on the file

system used and the average object size it may be necessary to change formatting options to maximize disk usage.

As an example, we’ll refer to limits for the sample reference configuration. The Ceph-deploy program sets up xfs file

systems with 5% of capacity as maximum usable for inodes (xfs dynamically allocates inodes as needed). As an example,

using 2KB xfs inode size on 3TB drives configured as RAID 0 results in about 73.2 million inodes available per drive. Clearly

these settings would max out inode usage with 1k objects well before the drive was full of object data.

If inode limitations are a concern, plan file system format parameters before installing Ceph on the cluster. Installation of

the OSDs is more involved with custom file system settings; reference the Ceph official documentation for details.

Allocating disks to OSD hosts

Choose the server that fits use case needs; for the OSD hosts we’ll cover choices using the HP ProLiant SL4540 Gen8 Server.

3x15 units maximize per-node disk utilization on smaller network pipes or offer a greater network bandwidth to disk ratio.

Using 3x15 HP ProLiant SL4540 Gen8 Servers increases compute density in the rack, but would be the least dense choice

for storage. The 2x25 and 1x60 configurations increasingly improve storage density in the rack at the expense of compute

density, and are therefore good choices for progressively ‘colder’ storage.

Take the drive pool from the first step and divide it into the desired HP ProLiant SL4540 Gen8 Server node configuration. If

SSD journals have been chosen, they’ll reduce capacity per node accordingly. As an example, and HP ProLiant SL4540 Gen8

Server with a 4:1 ratio of spinning to SSD would have 12 spinning disks per node on a 3x15, 20 spinning disks per node on a

2x25. SSD journals are not recommended on a 1x60 density optimized configuration. Replacing a spinning media slot with

SSDs is counter to the focus on density, and the attempt to increase drive write performance runs into server architectural

limitations—for example, ratio of disk to network bandwidth.

As part of designing towards homogeneity, adjust drive counts to divide storage into compute evenly where possible. Once

the number of disks is chosen, decide how storage will be configured in logical volumes—see Logical Drive Configuration

under Cluster Tuning— and select system CPU and memory to match the number of OSDs.

Choosing a network infrastructure

Consider desired bandwidth of storage calculated above, the overhead of replication traffic, and the network configuration

of the object gateway’s data network (number of ports/total bandwidth). Details of traffic segmentation, load balancer

configuration, VLAN setup, or other networking configuration/best practice are very use-case specific and outside the scope

of this document.

• Typical choices of configuration for data traffic will be 1-2 1GbE or 10GbE networks. Cold object storage use cases may

be satisfied with data access over lower bandwidth, but consider that10GbE is also useful for faster rebuild and recovery

between OSDs. Replicating 1TB of data across a 1GbE network takes three hours, with 10GbE it would be 20 minutes. If

more network ports are needed, an additional NIC card can be placed in the server’s PCIe slot.

• Network redundancy (active/passive configurations, redundant switching) is not recommended, as scale-out

configurations gain significant reliability from compute and disk node redundancy and proper failure domain

configuration. Consider the network configuration (where the switches and rack interconnects are) in the CRUSH map to

define how replicas are distributed.