Open Source Object Storage for Unstructured Data: Ceph on HP ProLiant SL4540 Gen8 Servers

ManualsBrandsHP ManualsServerHP ProLiant SL4540 Gen8 Server

Table Of Contents

Reference Architecture | Product, solution, or service

Configuration guidance

This section covers how to create a Ceph cluster to fit your business needs. The basic strategy of building a cluster is this:

with a desired capacity and workload in mind, understand where performance bottlenecks are for the use case, and what

failure domains the cluster configuration introduces.

Building your own cluster

General configuration recommendations

• The slowest performer is the weakest link for performance in a pool. Typically, OSD hosts should be configured with the

same quantity, type, and configuration of storage. There are reasons to violate this guidance (pools limited to specific

drives/hosts, federation being more important than performance), but it’s a good design principle.

• A minimum size cluster has at least three compute nodes hosting OSDs to distribute the three replicas. A minimum

recommended size cluster would have at least six compute nodes. The additional nodes provide more space for

unstructured scale, help distribute load per node for operations and make each component less of a bottleneck.

• If the minimum recommended cluster size sounds large, consider whether Ceph is the right solution. Smaller amounts of

storage that don’t grow at unstructured data scales could stay on traditional block and file, or leverage an object interface

on a file-focused storage target. Smaller Ceph clusters do make sense if the use case requires features of Swift/S3

RESTful interfaces. If the planned solution starts small but scales quickly past the minimum cluster size, then it will

benefit from the features of Ceph on HP hardware.

• Ceph clusters can scale to exabyte levels, and you can easily add storage as needed. But failure domain impacts must be

considered as hardware is added. Even three-way replication may reach an unacceptable data durability level with

enough OSDs. Also, what may have been a sufficient failure domain in the initial CRUSH map may not be a good

representation as network and power elements are added. Design assuming elements will fail at scale.

Cluster sizing

Compute and memory

For the OSD hosts, the recommendation is reserving 1GHz from a core of Intel Xeon processing per OSD daemon. If other

tasks run on these cluster nodes, consider the sample data in the CPU results chart under the canned tests as a fairly

optimal baseline, and select CPUs resources accordingly. Balance the power of the CPU selected for hardware versus failure

domain considerations for losing the processing power. Even if there are enough free CPU cycles to run VMs or other Linux

services on cluster components, more functionality will be lost if a box running multiple services goes down.

From the official Ceph recommendations, monitors should reserve about 1GB of RAM per daemon instance. The object

gateway does not require much buffer for object size load either; in total the sample reference configuration only needed to

reserve a few GB on top of other OS and application requirements.

The general memory recommendation is about 2GB of memory per OSD. Normal IO usage is rated about 500MB of RAM per

OSD daemon instance; observations haven’t shown much of a memory load during normal operation. During recovery

however, OSDs may use significantly more memory. The canned tests show extra RAM will have a noticeable positive impact

with file system cache on smaller object IOs, so additional memory can benefit performance too.

Choosing disks

Choose how many drives are needed to meet performance SLAs. That may be the number of drives to meet capacity

requirements, but may require more spindles for performance or cluster homogeneity reasons.

Object storage requirements tend to be primarily driven by capacity, so consider required capacity first. Replica count is the

biggest impact between raw and real capacities. There will be additional configuration loss factor for things like journal

capacity, file system format, and logical volume reserved sectors that will factor into storage efficiency—but these are

significantly less impact than replication. A good estimate ratio to use with the sample reference configuration is 1:3.2 for

usable to raw storage.

Three-way or greater replica count allows for more distribution of object copies to service reads, but also provides for a

quorum on object coherency. Importantly, two disks failing can’t cause data loss at these replica levels.

Choose the types of drives to meet requirements—balanced based on price and performance sensitivity—and if SSDs will

be used for journal data. Extrapolate from performance results versus the business use case to help make this selection. HP

drive qualification helps maintain homogeneity here, as drives of the same class and capacity are tuned to have similar

performance characteristics regardless of vendor. Unstructured data may not require the performance and 24x7 nature of

enterprise -class drives. If this is true for the use case, choose drives that trade performance and availability for cost/GB. As

an example, HP midline drives are capable of about 550 TB/year of workload and have both SAS and SATA interfaces.