Open Source Object Storage for Unstructured Data: Ceph on HP ProLiant SL4540 Gen8 Servers

Table Of Contents

Reference Architecture | Product, solution, or service

“knew” which Ceph OSD Daemon had which object, it would create a tight coupling between the Ceph Client and the Ceph

OSD Daemon. Instead, the CRUSH algorithm maps each object to a placement group and then maps each placement group

to one or more Ceph OSD Daemons. This layer of indirection allows Ceph to rebalance dynamically when new Ceph OSD

Daemons and the underlying OSD devices come online. The following diagram depicts how CRUSH maps objects to

placement groups, and placement groups to OSDs.

Figure 3: Mapping Objects to OSDs

The leverage of existing storage technology takes place under the OSD Daemon. When the RADOS object data is written, it’s

currently written as a file within a directory on the OSD. There’s more to it than that—the metadata must also be

committed separately, and Ceph reserves some storage for journaling—but the distribution across the file system is

essentially how object data and placement groups are implemented.

Scaling/Consistency/Failure Handling

With an understanding of the roles of the cluster and how data is stored, it’s also important to understand how the integrity

of data is protected and maintained.

Replication

– in addition to the benefit of data locality, replication provides the failure tolerance required by large scale. Like

Ceph Clients, Ceph OSD Daemons use the CRUSH algorithm but the Ceph OSD Daemon uses it to compute where replicas of

objects should be stored (and for rebalancing). For the recommended configuration, there are 3 copies of any object data

written—one on the Primary OSD for the placement group and two replicas. This replication level is user configurable; the

default without modifying ceph.conf is 2 replicas.

In a typical write scenario, a client uses the CRUSH algorithm to compute where to store an object, maps the object to a pool

and placement group, then looks at the CRUSH map to identify the primary OSD for the placement group.

The client writes the object to the identified placement group in the primary OSD. Then, the primary OSD with its own copy

of the CRUSH map identifies the secondary and tertiary OSDs for replication purposes, replicates the object to the

appropriate placement groups in the secondary and tertiary OSDs, and responds to the client once it has confirmed the

object and its replicas were stored successfully.