504 OSD Ceph cluster on converged microserver ethernet drives

sage

When Ceph was originally designed a decade ago, the concept was that “intelligent” disk drives with some modest processing capability could store objects instead of blocks and take an active role in replicating, migrating, or repairing data within the system.  In contrast to conventional disk drives, a smart object-based drive could coordinate with other drives in the system in a peer-to-peer fashion to build a more scalable storage system.

Today an ethernet-attached hard disk drive from WDLabs is making this architecture a reality. WDLabs has assembled over 500 drives from the early production line and assembled them into a 4 PB (3.6 PiB) Ceph cluster running Jewel and the prototype BlueStore storage backend.  WDLabs has been working on validating the need to apply an open source compute environment within the storage device and is now beginning to understand the use cases as thought leaders such as Red Hat work with the early units.  This test seeks to demonstrate that the second generation converged microserver has become a viable solution for distributed storage use cases like Ceph. Building an open platform that can run open source software is a key underpinning of the concept.

The collaboration between WDLabs, Red Hat, and SuperMicro on the large scale 4 PB cluster will help drive further learning and improvements to this new and potentially disruptive product.  By allowing storage services to run directly on the drive, an entire tier of conventional servers can be eliminated, simplifying the overall stack up through the top-of-rack switch, and paying dividends through space efficiency gains and component and power cost reductions.

The ARM community has contributed quite a bit of code to Ceph over the past two years to make it run well on the previous-generation 32-bit and new 64-bit architectures.  Our work is by no means complete, but we think the results are quite encouraging!

The WDLabs Converged Microserver He8

The Converged Microserver He8 is a microcomputer built on the existing production Ultrastar® He8 platform. The host used in the Ceph cluster is a Dual-Core Cortex-A9 ARM Processor running at 1.3 GHz with 1 GB of Memory, soldered directly onto the drive's PCB (pictured). Options include 2 GB of memory and ECC protection. It contains the ARM NEON coprocessor to help with erasure code computations and XOR and crypto engines.

[caption id="attachment_7711" align="aligncenter" width="293"]WaspV3_PCBA-1The drive PCB includes the standard disk controller hardware, as well as an additional ARM SoC running Debian Jessie (and Ceph). The connector passes ethernet instead of SATA.[/caption]

The interface is dual 1 GbE SGMII ports with the ability to reach 2.5 GbE in a compatible chassis.  The physical connector is identical to existing SAS/SATA devices, but with a new pinout that is being standardized and adopted by other drive manufacturers. Disk shelves are already available from multiple chassis vendors.

[caption id="attachment_7757" align="aligncenter" width="520"]Wasp-8TB-Top-BotThe drive has a standard 3.5" form factor.[/caption]

The default operating system is Debian Linux 8.0 (Jessie) and PXE boot is supported.

504 OSD test cluster and test setup

The 4 PB 504 node Converged Microserver He8 cluster is anchored by 42 SuperMicro 1048-RT chassis that feed 10 GbE each to the top-of-rack public network through 3 SuperMicro SSE-X3348T switches. Another identical switch interconnects the private back end network for internal cluster related tasks.  Each drive has two 1Gbps interfaces, one for the public network, one for the cluster (replication) network.  The monitor (just one in this test) is running on a conventional server.

[caption id="attachment_7713" align="aligncenter" width="225"]Front view - 25 new enclosures-smallCluster from front: 25 1u SuperMicro enclosures[/caption]

Clients have been installed to apply workloads to the system but so far have not been able to fully flood the cluster with traffic.  There are 18 x86 machines, each with 10 Gbps interfaces.  The workload is generated by 'rados bench,' with the default write size (4 MB) and 128-196 threads per client, running for 5 minutes for each data point.

The cluster is running an early build of Ceph Jewel (v10.1.0), one ceph-osd per drive, using the new experimental BlueStore backend to more efficiently utilize the raw disk (/dev/sda4). The configuration is relatively straightforward, although we did some minor tuning to reduce the memory consumption on the devices:

osd map cache size = 40 osd map max advance = 32 osd map message max = 32

The ceph status output looks like this:

cluster 4f095735-f4b2-44c7-b318-566fc6f1d47c
 health HEALTH\_OK
 monmap e1: 1 mons at {mon0=192.168.102.249:6789/0}
        election epoch 3, quorum 0 mon0
 osdmap e276: 504 osds: 504 up, 504 in
        flags sortbitwise
  pgmap v3799: 114752 pgs, 4 pools, 2677 GB data, 669 kobjects
        150 TB used, 3463 TB / 3621 TB avail
          114752 active+clean

[caption id="attachment_7712" align="aligncenter" width="225"]The whole cluster-smallAssembling and testing the cluster[/caption]

Performance

The first set of tests show the total read and write bandwidth of the cluster, as seen by 'rados bench'.  Currently only 180 Gbps of clients are connecting, which is why throughput scaling starts to level off around 180 nodes.

write scaling

If we look at the normalized write bandwidth per OSD, we are getting decent (although not breathtaking) throughput on each drive.  Read performance is very similar.

write bw per drive

The read performance is very similar:

read bw per drive

Scaling up the client workload but keeping a small set of 20 OSDs in the cluster shows us that each drive caps out around 90 MB/sec:

write saturate

We also did a few experiments with using erasure coding instead of replication.  We expect that capacity- and density-optimized clusters will be a compelling use case for these drives.  The following plot shows the total write bandwidth over the cluster with various erasure codes.  The throughput is lower than for pure replication, but already sufficient for many use cases.  There is still a lot of idle CPU time on the drives under load, so until we improve the network and client capabilities we won't know where the bottleneck in the system is.

ec writes

Replicated writes are a bit faster:

rep writes

What next

The Converged Microserver He8 and compatible chassis are now available in limited volumes through the WDLabs store by contacting a WDLabs representative. They are now looking at the next generation solution based on early customer adoption and feedback, and are partnering with key suppliers and customers that can help evolve the product.

The Ceph development community is excited to see Ceph running on different hardware platforms and architectures, and we enjoyed working with the WDLabs team to demonstrate the viability of this deployment model. There are many opportunities going forward to optimize the performance and behavior of Ceph on low-power, converged platforms, and we look forward to further collaboration with the hardware community to prove out next generation scale-out storage architectures.