Open Source Object Storage for Unstructured Data: Ceph on HP ProLiant SL4540 Gen8 Servers

Table Of Contents
Reference Architecture | Product, solution, or service
A cluster network offloads replication traffic from the data network, and provides an isolated failure domain. With tested
replication settings, there are two writes for replication on the cluster network for every actual IO. That’s a significant
amount of traffic to isolate from the data network.
It is recommended to reserve a separate 1GbE network for management as it supports a different class and purpose of
traffic than cluster IO.
Matching object gateways to traffic
Start by selecting the typical object size and IO pattern then compare to the sample reference configuration results. The
object gateway limits depend on the object traffic, so accurate scaling requires testing and characterization with load
representative of the use case. Here are some considerations when determining how many object gateways to select for
the cluster:
Object gateway operation processing tends to limit small object transfer. File system caching for GETs tends to have the
biggest performance impact at these small sizes.
For larger object and cluster sizes, gateway network bandwidth is the typical limiting factor for performance.
HP has observed around a peak of 3000-5000 ops/sec per object gateway testing across object sizes; that range was
seen at small object sizes. Maximum practical bandwidth limits seen were in the 900MB-1GB/sec range on a 10GbE link.
Load balancing does make sense at scale to improve latency, IOPS, and bandwidth. Consider at least three object
gateways behind a load balancer architecture.
Very cold storage or environments with limited clients may only ever need a single gateway.
With the monitor process having relatively lightweight resource requirements, the monitor can run on the same hardware
used for an object gateway. Performance and failure domain requirements dictate that not every monitor host is an object
gateway, and vice-versa. To maximize client traffic per object gateway or meet strictest failure domain requirements, it is
recommended the two roles be hosted on separate hardware.
Planning monitor count
Use a minimum of three monitors for a production setup. While it is possible to run with just one monitor, it’s not
recommended for an enterprise deployment, as larger counts are important for quorum and redundancy. With multiple
sites it makes sense to extend the monitor count higher to maintain a quorum with a site down.
Use physical boxes rather than VMs, to have separate hardware for failure cases. Do not run a monitor on the same box as
OSDs; Ceph documentation recommends avoiding that due to the monitor’s usage of fsync() impacting OSD performance.
The sample reference configuration is not stressing DL360p monitor resources in a 200 OSD cluster. Therefore, there are
no scaling recommendations for monitors based on cluster size.
Cluster Installation
Hardware platform and OS preparation details are contained in Appendix D: Server Preparation. Most installation details are
broken out in Appendix E: Cluster Installation.
Even for more complicated clusters, the quick Ceph deployment flow using ceph-deploy is a good starting point for cluster
installation. There is community work with more advanced configuration management tools to further automate cluster
install (ex: Chef, Juju, Puppet), but those details are outside of this document’s scope.
Whether Ceph’s quick start instructions are used or not, it’s recommended to use ceph-deploy where possible over manual
configuration as steps tend to be simpler to execute and maintain. Do expect to make manual configuration changes to
ceph.conf regardless. Object gateway configuration is currently not supported under ceph-deploy; cluster use and
configuration may dictate non-default parameters be added
Cluster tuning
This section contains tuning guidance, which HP considered important to general system configuration. This section is not
an ‘optimal performance’ guide; there are a lot of settings to modify operating behavior and the goal was easy to reproduce
performance with a good baseline configuration.
Placement groups
The tested ratio (for the sum of
all
pools) recommended by online documentation is
<total_placement_group_count>= ((# OSD * 100) / replica count)
Some tuning heuristics:
27