Open Source Object Storage for Unstructured Data: Ceph on HP ProLiant SL4540 Gen8 Servers

Table Of Contents
Reference Architecture | Product, solution, or service
Configuration guidance
This section covers how to create a Ceph cluster to fit your business needs. The basic strategy of building a cluster is this:
with a desired capacity and workload in mind, understand where performance bottlenecks are for the use case, and what
failure domains the cluster configuration introduces.
Building your own cluster
General configuration recommendations
The slowest performer is the weakest link for performance in a pool. Typically, OSD hosts should be configured with the
same quantity, type, and configuration of storage. There are reasons to violate this guidance (pools limited to specific
drives/hosts, federation being more important than performance), but it’s a good design principle.
A minimum size cluster has at least three compute nodes hosting OSDs to distribute the three replicas. A minimum
recommended size cluster would have at least six compute nodes. The additional nodes provide more space for
unstructured scale, help distribute load per node for operations and make each component less of a bottleneck.
If the minimum recommended cluster size sounds large, consider whether Ceph is the right solution. Smaller amounts of
storage that don’t grow at unstructured data scales could stay on traditional block and file, or leverage an object interface
on a file-focused storage target. Smaller Ceph clusters do make sense if the use case requires features of Swift/S3
RESTful interfaces. If the planned solution starts small but scales quickly past the minimum cluster size, then it will
benefit from the features of Ceph on HP hardware.
Ceph clusters can scale to exabyte levels, and you can easily add storage as needed. But failure domain impacts must be
considered as hardware is added. Even three-way replication may reach an unacceptable data durability level with
enough OSDs. Also, what may have been a sufficient failure domain in the initial CRUSH map may not be a good
representation as network and power elements are added. Design assuming elements will fail at scale.
Cluster sizing
Compute and memory
For the OSD hosts, the recommendation is reserving 1GHz from a core of Intel Xeon processing per OSD daemon. If other
tasks run on these cluster nodes, consider the sample data in the CPU results chart under the canned tests as a fairly
optimal baseline, and select CPUs resources accordingly. Balance the power of the CPU selected for hardware versus failure
domain considerations for losing the processing power. Even if there are enough free CPU cycles to run VMs or other Linux
services on cluster components, more functionality will be lost if a box running multiple services goes down.
From the official Ceph recommendations, monitors should reserve about 1GB of RAM per daemon instance. The object
gateway does not require much buffer for object size load either; in total the sample reference configuration only needed to
reserve a few GB on top of other OS and application requirements.
The general memory recommendation is about 2GB of memory per OSD. Normal IO usage is rated about 500MB of RAM per
OSD daemon instance; observations haven’t shown much of a memory load during normal operation. During recovery
however, OSDs may use significantly more memory. The canned tests show extra RAM will have a noticeable positive impact
with file system cache on smaller object IOs, so additional memory can benefit performance too.
Choosing disks
Choose how many drives are needed to meet performance SLAs. That may be the number of drives to meet capacity
requirements, but may require more spindles for performance or cluster homogeneity reasons.
Object storage requirements tend to be primarily driven by capacity, so consider required capacity first. Replica count is the
biggest impact between raw and real capacities. There will be additional configuration loss factor for things like journal
capacity, file system format, and logical volume reserved sectors that will factor into storage efficiencybut these are
significantly less impact than replication. A good estimate ratio to use with the sample reference configuration is 1:3.2 for
usable to raw storage.
Three-way or greater replica count allows for more distribution of object copies to service reads, but also provides for a
quorum on object coherency. Importantly, two disks failing can’t cause data loss at these replica levels.
Choose the types of drives to meet requirementsbalanced based on price and performance sensitivityand if SSDs will
be used for journal data. Extrapolate from performance results versus the business use case to help make this selection. HP
drive qualification helps maintain homogeneity here, as drives of the same class and capacity are tuned to have similar
performance characteristics regardless of vendor. Unstructured data may not require the performance and 24x7 nature of
enterprise -class drives. If this is true for the use case, choose drives that trade performance and availability for cost/GB. As
an example, HP midline drives are capable of about 550 TB/year of workload and have both SAS and SATA interfaces.
25