The Ceph Blog

INTRODUCTION

One of the strangest things about the holidays is how productive I am. Maybe it’s the fact that Minnesota is bitterly cold this time of year, and the only entertaining things to do outside often involve subzero winds rushing at your face. Or perhaps it’s the desire to escape from all other forms of life after 4-5 consecutive holiday celebrations. In any event, this is the time of year where (assuming I’m not shoveling 3 feet of snow) I tend to get a bit more done. Luckily for you that means we’ve got a couple of new articles already in the works.

One of the things that people have periodically asked me is what IO scheduler they should be using to get maximum performance with Ceph. In case you don’t know, an IO scheduler is typically an algorithm that employs a queue, or set of queues, where block IO requests go and are operated on before being sent to the underlying storage device. In this article we’ll take a look at how Ceph performs with some of the common Linux IO schedulers in a couple of different scenarios. Without further ado, lets get to work:

 

On many Linux systems there are typically three IO schedulers to choose from. Here’s a very brief (and somewhat incomplete) explanation of each:

  • CFQ: Puts IO requests into per-process queues and allocates time slices for each queue.
  • Deadline: Assigns deadlines to IO requests and puts them into queues that are sorted by their deadlines.
  • NOOP: Puts IO requests into a simple FIFO queue. Any scheduling is performed at another layer.

People often recommend Deadline or NOOP when using SSDs or a controller with write-back cache. CFQ tends to excel with spinning disks on desktops or workstations where there is a user directly interacting with the system. It’s not clear that CFQ would help something like Ceph though. Rather than just making recommendations up based on this second hand knowledge, I figured it was time to go looking for some answers and see how these schedulers really perform.

 

SYSTEM SETUP

We are going to use the SAS2208 controller for these tests. It supports JBOD, multiple single-drive RAID0, and single-OSD RAID0 configurations. Unfortunately different controllers will have different IO reordering capabilities so these results may not be representative for other controllers. Hopefully they will at least provide an initial starting point and perhaps a guess as to how similar configurations may perform.

Hardware being used in this setup includes:

  • Chassis: Supermicro 4U 36-drive SC847A
  • Motherboard: Supermicro X9DRH-7F
  • Disk Controller: On-board SAS2208
  • CPUS: 2 X Intel XEON E5-2630L (2.0GHz, 6-core)
  • RAM: 8 X 4GB Supermicro ECC Registered DDR1333 (32GB total)
  • Disks: 8 X 7200RPM Seagate Constellation ES 1TB Enterprise SATA
  • NIC: Intel X520-DA2 10GBE

As far as software goes, these tests will use:

  • OS: Ubuntu 12.04
  • Kernel: 3.6.3 from Ceph’s GitBuilder archive
  • Tools: blktrace, collectl, perf
  • Ceph: Ceph “next” branch from just before the 0.56 bobtail release.

TEST SETUP

Similarly to what we’ve done in previous articles, we are running tests directly on the SC847a using localhost TCP socket connections. We are performing both read and write tests. A 10G journal partition was setup at the beginning of each device. The following controller modes were tested:

  • JBOD Mode (Acts like a standard SAS controller.  Does not use on-board cache.)
  • 8xRAID0 mode (A single drive RAID0 group for each OSD. Uses on-board write-back cache.)
  • RAID0 Mode (A single OSD on a multi-disk RAID0 group.  Uses on-board write-back cache.)

To generate results, we are using Ceph’s trusty built-in benchmarking command: “RADOS bench” which writes new objects for every chunk of data that is to be written out (Some day I’ll get to the promised smalliobench article!). RADOS bench has certain benefits and drawbacks. On one hand it gives you a very clear picture of how fast OSDs can write out and read objects at various sizes. What it does not test is how quickly small IO to large objects are performed. For that reason and others, these results are not necessarily reflective of how RBD will ultimately perform.

Like in our previous articles, we are running 8 concurrent instances of RADOS bench and aggregating the results to ensure that it is not a bottleneck. We are instructing each instance of RADOS bench to write to its own pool with 2048 PGs each. This is done to ensure that later on during read tests each instance of RADOS bench reads unique objects that were not previously read into page cache by one of the other instances. You may also notice that we are using a power-of-2 number of PGs per pool. Due to the way that Ceph implements PG splitting behavior, having a power-of-2 number of PGs (especially at low PG counts!) may improve how evenly data is distributed across OSDs. At larger PG counts this may not be as important.

RADOS bench gives you some flexibility regarding how big objects should be, how many to concurrently keep in flight, and how long tests should be run for. We’ve settled on 5 minute tests using the following permutations:

  • 4KB Objects, 16 Concurrent Operations (2 per rados bench instance)
  • 4KB Objects, 256 Concurrent Operations (32 per rados bench instance)
  • 128KB Objects, 16 Concurrent Operations (2 per rados bench instance)
  • 128KB Objects, 256 Concurrent Operations (32 per rados bench instance)
  • 4M Objects, 16 Concurrent Operations (2 per rados bench instance)
  • 4M Objects, 256 Concurrent Operations (32 per rados bench instance)

For each permutation, we run the same test using either BTRFS, XFS, or EXT4 for the underlying OSD file system and CFQ, Deadline, or NOOP for the IO scheduler. File systems are reformatted and mkcephfs is re-run between every test to ensure that fragmentation from previous tests does not affect the outcome. Keep in mind that this may be misleading if trying to use these results to determine how a production cluster would perform. Each file system appears to age differently and may perform quite differently over time.  Despite this, reformatting between each test is necessary to ensure that the comparisons are fair.

We left most Ceph tunables in their default state for these tests except for two. “filestore xattr use omap” was enabled to ensure that EXT4 worked properly. CephX authentication was also disabled as it was not necessary for these tests.

We did pass certain mkfs and mount options to the underlying file systems where it made sense. In response to the Bobtail performance preview, Christoph Hellwig pointed out that Ceph would likely benefit from using the inode64 mount option with XFS, and mentioned a couple of other tunable options that might be worth trying. We didn’t have time to explore all of them, but did enable inode64 for these tests.

  • mkfs.btfs options: -l 16k -n 16k
  • btrfs mount options: -o noatime
  • mkfs.xfs options: -f -i size=2048 (-d su-64k, sw=8 for RAID0 tests)
  • xfs mount options: -o inode64,noatime
  • mkfs.ext4 options: (-b 4096 -E stride=16,stripe-width=128 for RAID0 tests)
  • ext4 mount options: -o noatime,user_xattr

During the tests, collectl was used to record various system performance statistics.

 

4KB RADOS BENCH WRITE RESULTS


 

Well, so far it looks like some of our suspicions are holding true. In modes where WB cache is being used, Deadline and NOOP seem have some advantages over CFQ. That probably makes sense considering that the controller can do it’s own reordering. In JBOD mode though it looks like the situation is reversed with CFQ tending to perform better.

With more concurrent operations, CFQ continues to do well in JBOD mode. In the two RAID modes, the race has tightened, but CFQ continues to do poorly with EXT4 in the single-OSD RAID0 configuration.

 

4KB RADOS BENCH READ RESULTS

With few concurrent 4K reads, it looks like NOOP and Deadline tend to be a better choice for BTRFS, but EXT4 and XFS results are a little more muddled. In the 8xRAID0 mode it actually likes like CFQ may be pulling ahead, but more samples should probably be taken to make sure.

Wow! There are some crazy trends here. BTRFS performance seems to be pretty consistent across IO schedulers now, but XFS and EXT4 are showing some pretty big differences. EXT4 in 1 disk per OSD setups seems to do far better with CFQ than either Deadline or NOOP. With XFS on the other hand, CFQ seems to be doing far worse than Deadline or NOOP in the JBOD configuration.

 

128KB RADOS BENCH WRITE RESULTS

With few concurrent 128k writes, it looks like BTRFS is tending to favor Deadline and NOOP. CFQ seems to pull out a solitary win with EXT4. In the single-osd RAID0 configuration though, XFS and EXT4 do significantly better with Deadline and NOOP.

Adding more concurrent 128k writes seems to push CFQ back into the lead in the JBOD tests. Deadline and NOOP seem to generally be better in the modes with WB cache, especially with EXT4.

 

128KB RADOS BENCH READ RESULTS

The results here at kind of muddled. I think the only clear thing is that EXT4 really seems to favor CFQ in the multi-OSD modes, while BTRFS seems to favor Deadline and NOOP in the big RAID0 configuration.

With more concurrent reads, it’s still tough to make any significant conclusions other than to say that EXT4 read performance continues to be highest in multi-OSD configurations with CFQ.

 

4MB RADOS BENCH WRITE RESULTS

It’s tough to make any strong conclusions here except that EXT4 seems to do better with Deadline and NOOP in the big RAID0 configuration.

Again, tough to make any real strong conclusions here, other than maybe that EXT4 is doing better with CFQ in JBOD mode, and does worst with Deadline in the 8-OSD, single disk, RAID0 mode.

 

4MB RADOS BENCH READ RESULTS

Looks like XFS and BTRFS are slightly favoring Deadline and NOOP across the board, while EXT4 favors CFQ in the multi-OSD configurations, but favors Deadline and NOOP in the 1-OSD RAID0 configuration.

The big obvious thing here is the dramatic performance drop with Deadline and NOOP when using EXT4 with the 8-OSD RAID0 configuration. Otherwise, Deadline and NOOP seem to maybe do slightly better than CFQ.

 

RESULTS SUMMARY

Alright, that wasn’t nearly as intense as the last article. A proverbial walk in the park. Still, it’s not exactly easy to draw meaningful trends from the scatter plots shown above. Lets take a look at the averages for each IO scheduler and examine how they compare at different IO sizes. I’m color-coding the results based on how the mean and standard deviation ranges for the results compare. Specifically, I take the scheduler with the highest mean throughput and compare its 1-standard deviation range to that of the lowest performing scheduler. If the ranges overlap, no color coding is done. If the ranges are distinct, the highest performing scheduler mean is colored green and the lowest colored red. The middle scheduler’s mean is color coded based on how it compares to the other two. This probably isn’t precise enough for a scientific journal, but I figure for a blog post it isn’t entirely unreasonable.

I think so far our hypothesis that DEADLINE and NOOP are better for modes with WB cache is roughly correct. Interestingly CFQ does pull off a number of wins in the JBOD mode. There’s some mixed results in the 8xRAID0 mode, but it looks like the only really major CFQ win there is for EXT4 reads. In RAID0 mode CFQ doesn’t show any advantages at all as far as I can tell.

The results look pretty consistent with the 4K results. CFQ tends to do pretty well in JBOD mode, but otherwise is often beat out by Deadline and NOOP in the other modes, except for 8xRAID0 EXT4 reads.

With large IOs, the only times were CFQ seems to be a consistent win is with EXT4 in JBOD and 8xRAID0 modes. Otherwise, it looks like if you want to optimize for large IOs you are best off with Deadline and in some cases NOOP.

CONCLUSION

Well there you have it. If you’ve ever wondered what IO scheduler to use with Ceph, this will hopefully provide some insight. Namely, that if you use EXT4 or have a JBOD configuration, CFQ might be worth looking at. In many of the other configurations, you may want to choose Deadline (or sometimes NOOP). In some cases, you may have to sacrifice read or write performance to improve the other. Given that every controller is different these results may not be universal, but it looks like these results at least fall roughly in line with the conventional wisdom regarding these schedulers. This concludes our article, but keep your eyes peeled for the next one where we will examine how different Ceph tunables affect performance. Until next time Ceph enthusiasts!

Comments: Ceph Bobtail Performance – IO Scheduler Comparison

  1. Hi,

    thanks for the benchmarks, very interesting. Where you at the default “nr_requests” (/sys/block/sda/queue/nr_requests) ?

    Because the default is 128 and its quite low. It would be interesting to test with values like 4096, 8192… 64k (I usually get the most out of my drives around 8192).

    Posted by nwrk
    January 23, 2013 at 12:09 am
    • Hi nwrk,

      Yep, default nr_requests. That’s a good idea, I’ll see if I can get some tests going with higher values. Thanks for the suggestion!

      Mark

      Posted by MarkNelson
      January 23, 2013 at 1:46 am
  2. Hi,

    Some points, you should be seeing better performance out of your 8xRAID0 case.

    Your RAID0 with 8 physical disks (assuming each is directly connected to HBA) doesn’t account for the actual bandwidth at it’s disposal. Each physical disk normally has a scsi queue depth of 32 *each*, now you’ve aggregated that but kept the default scsi queue depth of 32, it should really be somewhere around 32 * 8.

    You’ve likely traded throughput for response time. The benchmark would be more informative if you detailed both throughput and response time, it puts the tradeoffs into better perspective as not everyone is focused on throughput like transaction focused applications with hard deadlines.

    Additionally the block queue is of great importance and with it’s depth governed by nr_requests. That’s where the scheduler has the best chance to reorder IOs for better performance, the larger the queue, the higher the impact of the scheduler algorithm. As with any queue, this also is a tradeoff between response time and throughput.

    Food for thought. Good work.

    Posted by Peter Petrakis
    February 5, 2013 at 8:59 pm
    • Hi Peter,

      Thanks for the feedback!

      For the 8xRAID0 configuration, which tests do you think should perform better? IE for 256 concurrent 4MB objects, we are getting somewhere around 520MB/s = ~1040MB/s with journal writes / 8 = ~130MB/s to the drive. That’s pretty close to the throughput these 7200rpm SATA disk can do (a little over 140MB/s).

      Regarding queue depth: Doesn’t megaraid_sas throttle the queue depth at the controller level anyway? I don’t think it’s going to let you have a per LUN queue depth of 256. Just a quick look through the code seems to indicate possibly a per LUN limit of 64, but I don’t trust that I’m right about that.

      Definitely tweaking nr_requests could be interesting. Too many things to test!

      Mark

      Posted by MarkNelson
      February 6, 2013 at 4:41 pm
  3. You should be seeing better read performance in compared to JBOD, it’s like the HBA cache is behaving as write through. You have a 1G HBA cache and a combined individual disk cache of 512MB, there’s plenty of room in there for multiple blocks to hang out before being evicted (written back) on conflict, which really confuses me in the read cases, especially in the 256 way 4M read case.

    Concerning queue depth, megaraid has a modest feedback loop to throttle on timeouts, Timeouts aren’t cheap either, they average 4-6ms turnaround time to simply acknowledge them.

    concerning dynamic queue depth… that doesn’t as sexy as it sounds.

    megasas_check_and_restore_queue_depth(struct megasas_instance *instance)
    {
    unsigned long flags;
    if (instance->flag & MEGASAS_FW_BUSY
    && time_after(jiffies, instance->last_time + 5 * HZ)
    && atomic_read(&instance->fw_outstanding) throttlequeuedepth + 1) {


    instance->host->can_queue =
    instance->max_fw_cmds – MEGASAS_INT_CMDS;
    }

    where max_fw_cmds is:
    instance->max_fw_cmds = instance->instancet->read_fw_status_reg(reg_set) & 0x00FFF

    … so who knows :)

    throttlequeuedepth defaults to 16, and becomes the current queue depth after an error condition occurred. So this is simply putting the HBA back into a sane state after error handling, it’s fail safe, not a performance tweak.

    The spec for these disks is 150MB/s, getting 130 leaves 16% left to gain, that’s significant. Also, throughput isn’t the whole story, just looking at iostat can show response times. If I have a JBOD and 8xRAID0 pushing the same throughput and make a purchasing decision to get the JBOD based just on that metric I can get burned should my application require fast turn around times.

    iotime4M[JBOD(one disk)] = 9ms + 4.16ms + 4MB/150MB/s = 39.9ms
    iotime4M[8xRAID0] = 9ms + 4.16ms + (4MB/8)/150MB/s = 16.5ms

    If you have a transaction budget of a 100ms, which one would you rather have?

    It’s a RAID first, it has no idea ceph is it’s only customer. That means all best practices of deploying RAID applies, especially partition alignment.

    “Published test results indicate a performance penalty of about 5-30% for improper alignment”
    http://www.ibm.com/developerworks/linux/library/l-4kb-sector-disks/index.html?ca=dgr-lnxw074KB-Disksdth-LX

    There are lots of variables, start somewhere, write it all down, and assumptions can become lost opportunities. Keep it up :)

    Posted by Peter Petrakis
    February 7, 2013 at 3:04 am

Add Comment

© 2013, Inktank Storage, Inc.. All rights reserved.