Solution Reference Architecture Open source object storage for unstructured data Ceph on HP ProLiant SL4540 Gen8 Servers Table of contents Executive summary ...................................................................................................................................................................... 3 Introduction ...........................................................................................................................................................................
Reference Architecture| Ceph on HP ProLiant SL4540 Gen8 Servers BIOS Configuration Settings ................................................................................................................................................. 36 Configuring a Mirrored Boot Device .................................................................................................................................... 37 Upgrading Ubuntu...................................................................................
Executive summary Explosive data growth, expansion of Big Data and unstructured data and the pervasiveness of mobile devices continually pressure traditional file and block storage architectures. Businesses are exploring emerging storage architectures like object storage to help deal with these trends to provide cost-effective storage solutions that keep up with capacity growth while providing the service level agreements to meet business and customer requirements.
Reference Architecture| Ceph on HP ProLiant SL4540 Gen8 Servers Introduction This reference architecture describes a Ceph cluster deployed on HP hardware. It details why and how to build a Ceph cluster with HP hardware to solve unstructured, cloud and backup/archival storage problems.
Reference Architecture | Product, solution, or service • 1GbE Networking running on an HP 2920 switch, carrying HP Integrated Light-Out (iLO) and corporate management traffic • Rack and power components In this configuration the HP ProLiant SL4540 Gen8 Servers are ‘object storage nodes’; these are servers where scale-out storage hard drives reside. The HP ProLiant DL360p Gen8 Servers are ‘management nodes’ for the cluster.
Reference Architecture| Ceph on HP ProLiant SL4540 Gen8 Servers Overview Business problem Businesses are looking for better and more cost-effective ways to manage their exploding data storage requirements. In recent years, the amount of storage required for businesses has increased dramatically. Exploration data from oil and gas, patient medical records, user and machine generated content, and many other data types generate massive amounts of data per day.
Reference Architecture | Product, solution, or service To access the storage, a RESTful interface is used to provide better client independence and remove state tracking load on the server. HTTP is typically used as the transport mechanism to connect applications to the data, so it’s very easy to connect any device over the network to the object store. The IO interface is designed for static data. There are no file handles, concerns for locking, or reservations on objects.
Reference Architecture| Ceph on HP ProLiant SL4540 Gen8 Servers Figure 1: Cluster Access Methods The core of mapping a HTTP GET/PUT or block read/write to Ceph objects from any of the access methods is CRUSH (Controlled Replication Under Scalable Hashing). It is the algorithm Ceph uses to compute object storage locations. Per the picture above, all access methods are converted into some number of Ceph native objects on the back end.
Reference Architecture | Product, solution, or service “knew” which Ceph OSD Daemon had which object, it would create a tight coupling between the Ceph Client and the Ceph OSD Daemon. Instead, the CRUSH algorithm maps each object to a placement group and then maps each placement group to one or more Ceph OSD Daemons. This layer of indirection allows Ceph to rebalance dynamically when new Ceph OSD Daemons and the underlying OSD devices come online.
Reference Architecture| Ceph on HP ProLiant SL4540 Gen8 Servers Figure 4: Replication This model offloads replication to the OSD hosts; the client only has to drive data for the primary write. The picture above is for the three copies of an object in the sample reference configuration; the Primary OSD will drive as many replicas as are defined by the target pool.
Reference Architecture | Product, solution, or service will be functional, but for more complicated configurations customizing the CRUSH map is required. Tuning the map for failure domains of the cluster helps optimize performance, improves reliability and availability, and aids manageability. Value of a purpose-built enterprise hardware platform An important part of planning Ceph cluster architecture is determining what kind of hardware it runs on.
Reference Architecture| Ceph on HP ProLiant SL4540 Gen8 Servers idea exchange. Inktank is the company delivering Ceph, and they have a goal to drive the widespread adoption of SDS with Ceph and help customers scale storage to the Exabyte level and beyond in a cost-effective way. Enterprise solutions and support While Ceph is in use for a variety of business cases, there’s ongoing work to support the needs of enterprise deployments beyond just hardening work.
Reference Architecture | Product, solution, or service Releases that meet Risk and Feature requirements As involvement in Ceph has increased, it’s kept a rapid pace both with new feature introduction and bug fix/stability work. Multi-Site federation had its first real complete release in Emperor. There are a lot of compelling features coming down the road with Ceph such as Erasure Coding, Cache Tiering, and improving CephFS.
Reference Architecture| Ceph on HP ProLiant SL4540 Gen8 Servers Figure 6: Sample Reference Configuration Block Diagram Figure 7: Sample Reference Configuration Rack Diagram 14
Reference Architecture | Product, solution, or service Solution components Component choices This section describes more detailed reasoning behind some of the hardware and software components chosen for the sample reference configuration. Decisions made for component sizing choices in the cluster (compute, memory, storage, networking topology) are described under Configuration guidance. Operating system Ubuntu is the Linux OS distribution that has been tested the most with Ceph, and Ubuntu 12.
Reference Architecture| Ceph on HP ProLiant SL4540 Gen8 Servers Management nodes The 1U HP ProLiant DL360p Gen8 Server is a dual socket server, with a choice of Intel® Xeon® E5-2600 v2 and Intel® Xeon® E5-2600 processors, up to 768GB of memory, and two expansion slots. Network connectivity can be provided through FlexibleLOM in a 4x1GbE NIC configuration or a 2x10GbE configuration. For storage, various configurations are available with LFF or SFF drives with an HP Smart Array P420i controller.
Reference Architecture | Product, solution, or service The sample reference configuration fits in a single rack, but is scalable in some important ways. The rack reserves space to configure for further HP ProLiant SL4540 Gen8 Server scaling or other datacenter equipment. It’s relatively simple to source this configuration to multiple racks by replicating elements of the BOM and distributing monitors/object gateways across the racks.
Reference Architecture| Ceph on HP ProLiant SL4540 Gen8 Servers Block testing • Test phases for random IO are 8k read, write, and 70% read/30% write mix. Test phases for sequential IO are 256k read and write. All block IO is submitted to the same 4TB RADOS block device mapped to all three traffic generators.
Reference Architecture | Product, solution, or service General points The analysis details will help make cluster planning decisions versus the target workload/use case, but a few general points that can be derived from the data are: • Reads are significantly more performant than writes at the same size • Writes mixed with reads have a noticeable impact on read performance • Object IO maximum latency can be significant, although max latency cases are atypical Object testing There are two IO sizes of note
Reference Architecture| Ceph on HP ProLiant SL4540 Gen8 Servers Bandwidth & IOPS Bandwidth 1000 881 MB/Sec 800 600 400 200 0 525 426 169 11 189 41 10 11 1 1k 293 16k 64k 63 893 100% PUTs 100% GETs 173 105 128k 512k 909 890 866 881 859 694 586 493 157 916 934 785 1m 90% GET/10% PUT 4m 16m 128m Object Size Operations Per Second Ops/Sec 15000 10000 100% PUTs 5000 100% GETs 90% GET/10% PUT 0 1k 16k 64k 128k 512k 1m 4m 16m 128m Object Size • GET Ops/sec on the sample refe
Reference Architecture | Product, solution, or service Latency Object storage latencies are higher than typical SAN storage. Some of that is expected with the architecture (HTTP server, networking), but those factors don’t cover all performance impacts. Minimum latency data for object IO is less interesting— it’s still relatively long compared to block—so those graphs are not presented.
Reference Architecture| Ceph on HP ProLiant SL4540 Gen8 Servers CPU% Object GW, Avg CPU Object GW, Peak CPU PUT 0 GET PUT 0 GET MIX Object Size Object Size OSD Host, Avg CPU OSD Host, Peak CPU MIX 60 0 GET 1K 16K 64K 128K 512k 1M 4M 16M 128M 10 PUT 40 20 PUT 0 GET MIX Object Size 1K 16K 64K 128K 512k 1M 4M 16M 128M 20 % CPU 30 % CPU 40 20 1K 16K 64K 128K 512k 1M 4M 16M 128M 40 20 % CPU 60 1K 16K 64K 128K 512k 1M 4M 16M 128M % CPU 60 MIX Object Size The results show th
Reference Architecture | Product, solution, or service Block testing HP presents less data around RBD traffic than object IO partly because there’s more public content around tuning and performance for RBD. One reason for that is because RBD testing is easier to set up. No object gateway or object storage access code is required and block storage benchmarking tools are easy to get and well understood.
Reference Architecture| Ceph on HP ProLiant SL4540 Gen8 Servers Latency Minimum Latency 4.00 3.67 256K sequential writes ms 3.00 256K sequential reads 2.00 1.00 0.35 1.00 0.30 0.35 1.00 8K random reads 8K random writes 0.00 8K mix test reads Test Type Maximum Latency 4000.0 ms 2000.0 1000.0 256K sequential writes 2928.7 3000.0 1131.3 1683.0 1647.3 1460.2 1399.3 256K sequential reads 8K random reads 8K random writes 0.0 8K mix test reads Test Type Average Latency 20.00 ms 15.
Reference Architecture | Product, solution, or service Configuration guidance This section covers how to create a Ceph cluster to fit your business needs. The basic strategy of building a cluster is this: with a desired capacity and workload in mind, understand where performance bottlenecks are for the use case, and what failure domains the cluster configuration introduces.
Reference Architecture| Ceph on HP ProLiant SL4540 Gen8 Servers It’s a good idea to buffer some performance in estimates. Complex application loads are not as easy to gauge as a simple canned test load, and production systems shouldn’t run near the edge so they can better cope with failures and unexpected load spikes.
Reference Architecture | Product, solution, or service • A cluster network offloads replication traffic from the data network, and provides an isolated failure domain. With tested replication settings, there are two writes for replication on the cluster network for every actual IO. That’s a significant amount of traffic to isolate from the data network. • It is recommended to reserve a separate 1GbE network for management as it supports a different class and purpose of traffic than cluster IO.
Reference Architecture| Ceph on HP ProLiant SL4540 Gen8 Servers • When balancing PG usage for all pools, the proportion of PGs allocated should be based on which pool contains the most objects. So the data pool for the object gateway would typically get the lion’s share of the placement groups. If there are multiple pools with high numbers of objects (ex: a few RBD pools), tuning PG count becomes more complicated. • Right now pg_num and pgp_num must be the same.
Reference Architecture | Product, solution, or service In production configuring a logical drive is generally straightforward: if it’s in the node it’s in use. If it’s desirable to use a subset of the drives present in the system, the recommendation is configuring logical volumes only for drives to be used. Array accelerator cache is divided between configured logical drives, so unused logical volumes will take up caching resources.
Reference Architecture| Ceph on HP ProLiant SL4540 Gen8 Servers Bill of materials This BOM reproduces the sample reference configuration. Note: HP ProLiant servers ship with an IEC-IEC power cord for rack mounting. HP ProLiant SL4540 Gen8 Server Qty Part Number Description 5 663600-B22 HP ProLiant SL454x 2x Node Chassis 10 664644-B22 HP 2xSL4540 Gen8 Tray Node Svr 10 684373-L21 HP SL4540 Gen8 Intel® Xeon® E5-2470 (2.
Reference Architecture | Product, solution, or service HP Networking Cables Qty Part Number Description 20 263474-B23 HP IP CAT5 Qty-8 12ft/3.7m Cable 6 263474-B22 HP IP CAT5 Qty-8 6ft/2m Cable 3 JD096C HP X240 10G SFP+ to SFP+ 1.
Reference Architecture| Ceph on HP ProLiant SL4540 Gen8 Servers Summary With rapid growth of unstructured data and backup/archival storage, traditional storage solutions are lacking in their ability to scale or efficiently serve this data. The cost per gigabyte for SAN and NAS at scale is undesirable and the solutions provide performance features data doesn’t really require. Tape has better cost at scale, but doesn’t always meet latency requirements for data access.
Reference Architecture | Product, solution, or service Appendix A: Sample Reference Ceph Configuration File The ceph.conf file used for the sample reference configuration. mon_initial_members = hp-cephmon01, hp-cephmon02, hp-cephmon03 mon_host = 10.9.25.17,10.9.25.18,10.9.25.19 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx filestore_xattr_use_omap = true osd_pool_default_size = 3 osd_pool_default_min_size = 2 public_network = 10.9.25.0/24 cluster_network = 10.9.
Reference Architecture| Ceph on HP ProLiant SL4540 Gen8 Servers Appendix B: Sample Reference Pool Configuration Pool dump for sample reference configuration.
Reference Architecture | Product, solution, or service Appendix C: Syntactical Conventions for command samples Angle bracketed text indicates a substitution for a literal value. Example: ssh < host name > would indicate to substitute the host name of the target Ceph node when executing ssh commands. The use of single quotes with an OS command indicates shorthand for a command. Example: ‘ceph –s’.
Reference Architecture| Ceph on HP ProLiant SL4540 Gen8 Servers Appendix D: Server Preparation This section describes a few steps that need to be performed prior to OS installation as well as steps required to customize the OS after installation. Install HP Support Pack for ProLiant HP Service Pack for ProLiant (SPP) is a comprehensive systems software and firmware update solution, which is delivered as a single ISO image.
Reference Architecture | Product, solution, or service You should observe the dark blue box change to indicate “Enabled”. Escape twice and then press ‘F10’ to save your changes and exit the utility. Configuring a Mirrored Boot Device While not required, mirroring your boot device is a good practice. For this reference architecture we created two partitions on each drive, one for the root file system and one for swap. We then mirrored each pair of partitions.
Reference Architecture| Ceph on HP ProLiant SL4540 Gen8 Servers Upgrading Ubuntu Update kernel and packages to latest; as of this writing Ubuntu12.04.3 is running with 3.8.0-37 on the sample reference configuration. If you need to upgrade the kernel before installing Ceph, use ‘apt-get dist-upgrade’ to do intelligent package upgrade. Despite the name, the ‘apt-get dist-upgrade’ allows upgrade of the kernel packages, not the distribution.
Reference Architecture | Product, solution, or service sudo reboot Use “modinfo mlx4_en” after rebooting and verify that the server is using the v2.1.xx driver. Remember that if the kernel is upgraded, you must also recompile Mellanox drivers against the new kernel and rebuild the initramfs. Don’t forget to re-run the patch script to regenerate the config.mk file before building.
Reference Architecture| Ceph on HP ProLiant SL4540 Gen8 Servers Appendix E: Cluster Installation Because the Ceph documentation website can change over time, the installation flow has been sourced from Ceph documentation used when configuring the sample reference configuration. The sourced instructions have been modified to include customizations made, and fixed on the choices of Ubuntu and Ceph distribution.
Reference Architecture | Product, solution, or service Your public key has been saved in /ceph-client/.ssh/id_rsa.pub. Copy the key to each Ceph Node ssh-copy-id ceph@ … ssh-copy-id ceph@ Modify the ~/.ssh/config file of the ceph-deploy admin node so that it logs in to Ceph Nodes as the user created (e.g., ceph).
Reference Architecture| Ceph on HP ProLiant SL4540 Gen8 Servers Install Ceph Software This step pulls down the Ceph distribution packages and installs onto all cluster role servers. If using ceph-deploy to install Ceph packages and using a proxy server to get to the internet, edit wgetrc’s proxy configuration under all Ubuntu nodes. Otherwise, ‘ceph-deploy install’ will get stuck trying to get the release key with wget. Aptitude should be configured with proper proxy settings during OS installation.
Reference Architecture | Product, solution, or service ssh ${tgtsys} sudo parted -s ${tgtdrv} mklabel gpt p_layout=( 0G 4G 8G 12G 16G ) start_idx=0 end_idx=1 while [ ${end_idx} -lt ${#p_layout[@]} ]; do ssh ${tgtsys} sudo parted ${tgtdrv} -s mkpart cephjournal${end_idx} ${p_layout[${start_idx}]} ${p_layout[${end_idx}]} (( start_idx=end_idx )) (( end_idx++ )) done Sample script for adding OSDs to the cluster. #!/bin/bash destbox=${1} if [ -z "${destbox}" ]; then echo "No target system.
Reference Architecture| Ceph on HP ProLiant SL4540 Gen8 Servers create the default pool here so object gateway install doesn’t create one sub-optimal default placement group settings. Remember to balance object gateway usage with amount of rbd storage required. sudo ceph osd pool create .rgw.buckets Add Object Gateways The ceph-deploy package does not support object gateways, but changes to the configuration are driven from the staging directory created in the above cluster installation step.
Reference Architecture | Product, solution, or service Restart Apache sudo service apache2 restart Install Ceph Object Gateway The Ceph packages don’t pull down object gateway software by default, so add that now. sudo apt-get install radosgw Add gateway configuration to Ceph HP recommends this step and the configuration step be executed from the deployment directory used for cluster creation. For each object gateway host, there’s a separate section for their definition.
Reference Architecture| Ceph on HP ProLiant SL4540 Gen8 Servers Allow from all AuthBasicAuthoritative Off AllowEncodedSlashes On ErrorLog /var/log/apache2/error.log CustomLog /var/log/apache2/access.log combined ServerSignature Off ServerName ServerAlias *.ldev.net ServerAdmin cloudplay@hp-cephmon01.ldev.net DocumentRoot /var/www RewriteEngine On RewriteRule ^/([a-zA-Z0-9-_.]*)([/]?.*) /s3gw.
Reference Architecture | Product, solution, or service Generate Keyring and Key for the Gateway Here a keyring is created on the object gateway install system. These steps also set up read access for administrative ease of use, and attach the gateway user to the cluster and keyring file. For simplicity, this config doesn’t bother merging gateway keyring files across object gateways. • sudo ceph-authtool --create-keyring /etc/ceph/keyring.radosgw.gateway • sudo chmod +r /etc/ceph/keyring.radosgw.
Reference Architecture| Ceph on HP ProLiant SL4540 Gen8 Servers Appendix F: Newer Ceph Features While the sample reference configuration here used the Dumpling release, Ceph is continuing to make feature releases of significant technologies. This section lists features already released in stable code bases, or coming soon. There are many features on the Inktank roadmap; the selected are from Emperor and Firefly releases. Multi-Site Ceph Emperor Release has fully functional support for multi-site clusters.
Reference Architecture | Product, solution, or service Appendix G: Helpful Commands These are commands for administering the Ceph cluster that were useful during testing. Removing configured objects For a POC/test configuration, there may be reasons to tear down the cluster to recreate, change OSD configuration, etc. An example is swapping out spinning media for SSDs.
Reference Architecture| Ceph on HP ProLiant SL4540 Gen8 Servers -1 -2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 546 root default 54.6 host hp-osdhost01 2.73 osd.0 up 1 2.73 osd.1 up 1 2.73 osd.2 up 1 2.73 osd.3 up 1 2.73 osd.4 up 1 2.73 osd.5 up 1 2.73 osd.6 up 1 2.73 osd.7 up 1 2.73 osd.8 up 1 2.73 osd.9 up 1 2.73 osd.10 up 1 2.73 osd.11 up 1 2.73 osd.12 up 1 2.73 osd.13 up 1 2.73 osd.14 up 1 2.73 osd.15 up 1 2.73 osd.16 up 1 2.73 osd.17 up 1 2.73 osd.18 up 1 2.73 osd.
Reference Architecture | Product, solution, or service Appendix H: Workload Tool Detail Getput Installing Getput is written in python and is available as source from Getput on github. Getput v0.0.7, which is the version available on github at the time of this writing, requires python-swiftclient v1.6.0 or later. On Ubuntu 12.04, you should be able to use these instructions to install this package.
Reference Architecture| Ceph on HP ProLiant SL4540 Gen8 Servers • --iodepth=: Number of IO units to keep in flight against the file. A depth of 8 was used for testing. • --name : In this context, used /dev/rbd1 to specify job name and the device file being targeted. • --rwmixwrite=<% mix writes>: Percentage of mixed workload to make writes. The mix test used 30. • --rwmixread=<% mix reads>: Percentage of mixed workload to make reads. The mix test used 70.
Reference Architecture | Product, solution, or service listen https_proxy :443 option tcplog mode tcp option ssl-hello-chk balance source server hp-cephmon01 192.168.0.10 check server hp-cephmon02 192.168.0.11 check server hp-cephmon03 192.168.0.
Glossary • Cold, warm and hot storage—Temperature in data management refers to frequency and performance of data access in storage. Cold storage is rarely accessed and can be stored on the slowest tier of storage. As the storage ‘heat’ increases, the bandwidth over time as well as instantaneous (latency, IOPS) performance requirements increase. • CRUSH— Controlled Replication Under Scalable Hashing. The algorithm Ceph uses to compute object storage locations.
Reference Architecture | Product, solution, or service For more information With increased density, efficiency, serviceability, and flexibility, the HP ProLiant SL4540 Gen8 Server is the perfect solution for scale-out storage needs. To learn more, visit hp.com/servers/sl4540. To support the management and access features of object storage, and seamlessly operate as part of HP Converged Infrastructure, the HP ProLiant DL360p Gen8 series brings the power, density and performance required.