Open Source Object Storage for Unstructured Data: Ceph on HP ProLiant SL4540 Gen8 Servers

Table Of Contents

Reference Architecture| Ceph on HP ProLiant SL4540 Gen8 Servers

Figure 4: Replication

This model offloads replication to the OSD hosts; the client only has to drive data for the primary write. The picture above is

for the three copies of an object in the sample reference configuration; the Primary OSD will drive as many replicas as are

defined by the target pool.

Peering and Sets:

The Ceph Storage Cluster was designed to store at least two copies of an object, which is the minimum requirement for data

safety. For high availability, a Ceph Storage Cluster should store more than two copies of an object

(e.g., size = 3 and min size = 2) so that it can continue to run in a degraded state while maintaining data safety.

Referring back to the replication diagram, the Ceph OSD Daemons are not specifically named (e.g., osd.0, osd.1, etc.), but

rather referred to as Primary, Secondary, and so forth. By convention, the Primary is the first OSD in the Acting Set, and is

responsible for coordinating the peering process for each placement group where it acts as the Primary, and is

the

ONLY

OSD that that will accept client-initiated writes to objects for a given placement group where it acts as the Primary.

When a series of OSDs are responsible for a placement group, they’re referred to as an Acting Set. An Acting Set may be Ceph

OSD Daemons that are currently responsible for the placement group, or the Ceph OSD Daemons that were responsible for

a particular placement group as of some epoch.

This behavior also applies to removal/failure of an OSD; object copies get remapped from other elements of the Acting Sets

onto free cluster storage.

Rebalancing:

When a Ceph OSD Daemon is added to a Ceph Storage Cluster, the cluster map gets updated with the new

OSD. Consequently, it also changes object placement, because it changes an input for CRUSH map calculations. The process

of redistributing objects across the Ceph cluster is termed rebalancing.

Data Consistency:

Ceph OSDs can compare object metadata in one placement group with its replicas in placement groups

stored on other OSDs. Scrubbing (usually performed daily) catches OSD bugs or file system errors. OSDs can also perform

deeper scrubbing by comparing data in objects bit-for-bit. Deep scrubbing (usually performed weekly) finds bad sectors on a

disk that weren’t apparent in a lighter scrub.

Failure Domain, CRUSH map

: An important element of maintaining fault tolerance is determining where data should be

placed for optimal resiliency. Rather than depend on expensive hardware redundancy for all interconnects, the scale-out

design principle is to assume some components will fail and make the data available by properly partitioning failure

domains.

For a simpler configuration, it’s ok to assume that distributing copies across servers will provide reliability. As clusters scale

out, it will be important to separate replicas across racks, power sources, network switches, or even data centers to reduce

the likelihood a failure event will have untenable impacts. The default CRUSH map generated by ceph-deploy cluster install