Ceph blog stories provide high-level spotlights on our customers all over the world
Recap: In Blog Episode-3 We have covered RHCS cluster scale-out performance and have observed that, upon adding 60% of additional hardware resources we can get 95% higher IOPS, this demonstrates the scale-out nature of Red Hat Ceph Storage Cluster.
This is the fourth episode of the performance blog series on RHCS 3.2 BlueStore running on the all-flash cluster. In this episode, we are going to review the performance results of various RHCS configurations that will help us take rational decisions to build a Red Hat Ceph Storage configuration that will give us the best performance possible with the current hardware and software available in the lab. For your convenience this blog post has been divided into multiple short benchmarking sections, Enjoy Reading !!
Historically for Ceph Filestore OSD backend, the recommendation for OSD count to NVMe device was generally prescribed to be 4 OSDs/NVMe. With recent improvements in Ceph OSD subsystem and more specifically with the General Availability of Bluestore OSD backend, it’s possible to achieve much higher performance per OSD, as such deploying more than 2 OSD per NVMe device provides diminishing returns. In this benchmark, we wanted to back our hypothesis that 2 OSDs / NVMe device is the best configuration.
Graph-1 demonstrates that there are no change in aggregated IOPS across 2 OSDs/NVMe and 4 OSDs/NVMe configurations. Interestingly the CPU utilization with 2 OSDs/NVMe configuration found to be better (lower). In addition to this 2 OSDs/NVMe resulted in lower average latency compared to 4 OSDs/NVMe configuration as shown in Graph-2. As such this validates our hypothesis that RHCS 3.2 BlueStore backend requires a lower number of OSDs per NVMe device (2 OSDs/NVMe) compared to FileStore backend that needed 4 OSDs/NVMe.
RHCS 3.2 Bluestore with 2 OSDs/NVMe configuration delivers the following advantages compared to 4 OSDs/NVMe
This test aims to find optimal CPU physical core to NVMe ratio geared towards performance. From the results in graph 3 we can see that for random write and mixed workloads(70% read, 30% write) we get higher IOPS count as we increase the CPUs. This is in line with what we have seen in the results of our previous tests with small blocks where the limiting factor has been the CPU. For Random Reads there is almost no gain when moving from 6 to 8 cpus, this also makes sense, if you recall the graph where we showed the media utilization for Random Read 4Kb workloads, the media was pretty busy around 88% utilized.
So from the results, we can infer that, with RHCS 3.2 BlueStore, 6 CPU cores / NVMe is the optimal ratio for a variety of workload patterns. Increasing the CPU beyond 6 provides diminishing returns and might not justify the cost involved. It’s also worth noting that with the last generation of Ceph OSD backend i.e. FileStore, this ratio used to be 10 CPU cores/NVMe. Hence RHCS 3.2 with BlueStore not only delivers improved performance but also requires lower system resources.
|RHCS 3.2. Physical Cores per NVMe IOPS increase 4Kb|
|IOPS Increase from 4 to 6 Cores||IOPS Increase from 6 to 8 Cores|
|Randread||▲ 31.96%||▲ 0.51%|
|Randrw||▲ 48.71%||▲ 17.15%|
|Randwrite||▲ 51.66%||▲ 19.60%|
Table 1. IOPS Percentage increase from 4 to 6 to 8 physical Cores. Block Size 4Kb
Based on multiple testing done by Red Hat Storage performance teams as well as Ceph community, it is evident that 2x replicated pool perform better than 3x replicated pools as with 2x Ceph OSDs have to write less data hence delivers higher performance. However, this question becomes relevant when we need to know how IOPS, average and tail latency compares between 2x vs. 3x replicated pool on an all-flash Ceph cluster. This test captures all of the above.
In Graph 4 we can observe that there is a performance tax to pay when using 3 replicas. As such 2x replica pool delivers approx 30% higher IOPS and approx 50% reduced average and tail latency compared to 3x replicated pool. However, selecting a replica size is not as simple as it sounds. The pool replica size must be chosen based on the underlying media. The MTBF and MTTR for flash media drastically lower compared to HDD media, It is generally considered safe if you choose a 2x replicated pool for flash media and 3x replicated pools for HDD media, but your milage will vary based on the use-case you are designing your storage system for.
|RHCS 3.2. Replica 3 vs. Replica 2 . Block Size 4Kb|
|Workload||IOPS Lat||Avg Lat||P95% Lat||P99% Lat|
|Rand Read||▼ -2.25%||▲ 2.09%||▲ 1.68%||▲ 3.10%|
|Rand RW(70R/30W)||▼ -27.43%||▲ 60.56%||▲ 53.42%||▲ 51.39%|
|Random Write||▼ -29.08%||▲ 51.45%||▲ 52.49%||▲ 50.96%|
Table 2 . Replica 3 vs Replica 2 . Block Size 4Kb
For RBD workloads on Ceph BlueStore, the size of the bluestore cache can have a material impact on performance. Onode caching in bluestore is hierarchical. If an Onode is not cached, it will be read from the DB disk, populated into the KV cache, and finally populated into the Bluestore Onode cache. As you can imagine having a direct hit in the Onode cache is much faster than reading from disk or from the KV cache.
When all Onodes in a data set fits into BlueStore’s block cache, the Onodes are never read from disk and thus Onodes never have to be populated into the KV cache at all. This is the optimal scenario for RBD. On the other hand, a worst case scenario is where you end up reading Onodes from disk which populates both the rocksdb KV cache and the bluestore onode cache with fresh data and forces out older onodes, which may be read back in again from disk later.
As such, we found that by increasing BlueStore cache size to 8G, random read-write (70/30) workload performance could be increased up to 30% higher IOPS and 50% reduced tail latency.
|RHCS 3.2. Bluestore 8Gb Cache vs. 4Gb Cache . Block Size 4Kb|
|Workload||IOPS||Avg Lat||P95% Lat||P99% Lat|
|Rand Read||▲ 14.43%||▼ -24.57%||▼ -25.43%||▼ -61.76%|
|Rand RW(70R/30W)||▲ 30.52%||▼ -32.62%||▼ -52.12%||▼ -11.60%|
|Random Write||▲ 15.40%||▼ -19.10%||▼ -24.31%||▼ -28.68%|
Table 3 . Bluestore 8Gb Cache vs. 4Gb Cache . Block Size 4Kb
With this test, we wanted to answer a simple question “Does Intel Optane P4800x, when used as BlueStore metadata device, provides value for money ?”
To answer this, a similar test was executed twice. In round-1 BlueStore metadata (rocksdb and WAL) partitions were co-located with the BlueStore data partition on the same device (p4500). And round-2 included BlueStore metadata (rocksdb and WAL) were configured on Intel Optane P4800x device with BlueStore data partition on P4500
Graph-6 shows the IOPS, average latency and P99% for 8KB block tests across random read, write and read-write pattern. As highlighted in the graph, using Intel Optane P4800x helps in improved IOPS and significantly better P99 tail latency especially in random write workloads. As such, using Intel Optane P4800x as BlueStore metadata, we have seen a 50% increase in IOPS and 50% lower P99 tail latency over NVMe based co-located BlueStore metadata partitions.
Achieving predictable performance is critical for production database and as a fact database workloads are super sensitive to tail latencies. Using the P4800x Intel Optane drive during the tests we have observed minimal variations in the 99th percentile latency it’s not only that we have observed lower P99% latency but we have achieved predictable and consistent tail latency results, which is of utmost importance to database workloads.
|RHCS 3.2. Optane vs. No Optane . Block Size 8Kb|
|Workload||IOPS||Avg Lat||P95% Lat||P99% Lat|
|Rand Read||▲ 29.79%||▼ -17.54%||▼ -60.26%||▼ -75.57%|
|Rand RW(70R/30W)||▲ 52.12%||▼ -43.16%||▼ -48.96%||▼ -51.36%|
|Random Write||▲ 37.92%||▼ -27.40%||▼ -49.35%||▼ -70.14%|
Table 4 . Optane vs. No Optane . Block Size 8Kb