Planet Ceph

Aggregated news from external sources

  • November 27, 2013
    Quick update Ceph: from Argonaut to Cuttlefish

    Memory leaks disappeared and CPU load dramatically reduced. Yay!

    The upgrade started during the week 39.

    The first graph shows the amount of RAM used before and after the Ceph upgrade.
    As you might know, they were numerous memory leaks…

  • November 26, 2013

    Thanks to the hard work of the puppet-openstack
    community, Puppet was the preferred method of deployment for Openstack
    in the latest Openstack User Survey.

    If you’d like to join in on the fun and contribute, read on !
    First things first, a bit of context:

    • Openstack is a modular cloud orchestration platform,
      self-described as “Open source software for building private and
      public clouds”.
    • puppet-openstack is a Stackforge project that centralizes the
      development of puppet modules related to Openstack. puppet-openstack
      is also an actual module allowing the installation and
      configuration of core Openstack services.
    • Stackforge is used to host Openstack-related projects so that they
      can benefit from the same continuous integration infrastructure and
      review system that the main Openstack projects use such as Nova.

    Now that we have the basics out of the way, if you’d like to contribute
    to Openstack in general, it’s not mandatory to have any programming or
    networking knowledge. There’s always things like documentation and
    translation that need manpower.

    For contributing to puppet-openstack in particular, however, it is
    required to be (or become!) familiar with ruby, puppet,
    puppet-rspec and of course, Openstack..

    The contribution process for puppet-openstack is slightly different than
    committing code to primary Openstack projects (such as Nova) and I won’t
    be highlighting them here for the sake of simplicity – this is a topic
    for another blog post !

    I recently started contributing as part of the
    new puppet-ceph initiative so this blog post more or less describes
    what I had to go through to get my first contribution in.

    Okay, sign me up.

    If you want to join in on the fun, the basic instructions for signing up
    are pretty well documented on the Openstack

    In a nutshell:

    Getting started

    Let’s say I want to develop for puppet-ceph (!), I’ll keep these
    resources handy:

    • The Launchpad project for bugs/issues/fixes/feature/backlog
      documentation and
      discussion: (each project
      has it’s own launchpad project)
    • The developer documentation will prove useful to prepare your
      development environment and beyond. For puppet modules,
      documentation is provided both on the Openstack
      and directly in the README files.

    Clone the project

    You’re going to need the puppet module source to work on it, you can
    either clone it from Github:

    git clone

    or from Gerrit:

    git clone

    Make sure you have ruby, rubygems and bundle installed

    First of all, you’ll need ruby and bundle to manage ruby
    packages (gems).
    These will be required, especially when the time will come to do
    spec/integration/lint tests.

    If you already have them you can skip this part !

    On Ubuntu:

    apt-get install ruby rubygems ruby-bundler

    On Debian:

    apt-get install ruby rubygems bundler

    Install development dependencies

    With the help of bundle, fetch and install the gem dependencies
    documented in the Gemfile located at the root of the repository.

    bundle install

    Create your branch and do your stuff

    Create a branch with a name relevant to what you’re doing

    git checkout -b feature/my_feature

    Now you can do your modifications.
    Don’t forget to add new spec tests or modify existing ones to match the
    modifications you made to the module.

    Test your stuff

    You’ve added or modified some code, now you want to test it:

    Test for puppet syntax (puppet-lint):

    bundle exec rake lint

    Run spec tests (puppet-rspec)

    bundle exec rake spec

    If you try to push code that doesn’t pass the tests, jenkins will not
    let you through – better make sure everything is okay before sending
    something for review!

    Tests are successful ? Add and commit your stuff

    git add [file] git commit

    Make sure your commit message follows the right format !

    Send your stuff for review

    git review

    That’s it ! Your code was sent to gerrit for review by the community
    and the core reviewers !

    Jenkins or someone -1’d my code. Help !

    Maybe you did a typo or something far worse you’d like to fix – this is
    done by submitting another patch set.

    Do the changes you want to do, add the files again but instead of using
    git commit‘, use ‘git commit —amend‘.
    This will essentially modify the initial commit.

    After amending your commit, send the code back for a new review with
    git review‘ once more.

  • November 26, 2013
    Back from the summit: Ceph/OpenStack integration

    The summit was exciting and full of good things and announcements. We had great Cinder sessions and an amazing Ceph/OpenStack integration session. I’ve led the Ceph/OpenStack integration session with Josh Durgin (Inktank). We had a good participation from the audience. I would like to specially thank Sage Weil, Haomai Wang, Edward Hope-Morley for their good inputs. The main purpose of… Read more →

  • November 25, 2013
    Ceph: find who’s mapping a RBD device

    Curious? Wanna know who has a RBD device mapped?

    W Important note: this method only works with the Emperor version of Ceph and above.

    Grab the image information:

    $ rbd info boot
    rbd image 'boot':
    size 10240 MB in 2560 ob…

  • November 22, 2013
    Map/unmap RBD device on boot/shutdown

    Quick how-to on mapping/unmapping a RBD device during startup and shutdown.

    We are going to use an init script provided by the ceph package.
    During the boot sequence, the init script first looks at /etc/rbdmap and will map devices accordingly.

  • November 21, 2013
    Manage a multi-datacenter crush map with the command line

    A new datacenter is added to the crush map of a Ceph cluster: # ceph osd crush add-bucket fsf datacenter added bucket fsf type datacenter to crush map # ceph osd crush move fsf root=default moved item id -13 name … Continue reading

  • November 21, 2013
    Measure Ceph RBD performance in a quantitative way (part II)

    This is the 2nd post about Ceph RBD performance. In part1, we go talk about random IO perforamnce on Ceph. This time we share the sequential read/write testing data. In case you forget our hardware configurations, we use 40x 1TB SATA disks for data disk plus 12 SSD as journal. And 4x 10Gb links are used to connect the storage clusters with clients together, which provides enough network bandwidth. Below figures show the SR and SW performance with QD=64 and CAP=60MB/s per VM. With the number of Volume/VM increases, the per-VM throughput drops gradually. The SR max total throughput 2759MB/sec happens at VM=80 and SW peak total throughput 1487MB/sec happens at VM=50. However consider our pre-defined QoS requirement (Per-VM throughput is larger than 90% of the pre-defined target), we pick up VM=40 for SR and VM=30 for SW, which results in reported metrics as 2263MB/sec (SR) and 1197MB/sec (SW).

    Is the result good enough? With the similar approach as we do for random IO, we measure the native disk performance as the reference. We observe ~160MB/sec sequential bandwidth per disk for both read and write. And we measure ~900MB/sec for single 10Gb NIC. In theory, the 40 SATA disks are expected to deliver 6400MB/sec for read and 3200MB/sec for write (replica=2). And 4x 10Gb can deliver ~3600MB/sec bandwidth. Thus the final Ceph efficiency is 57% for SR and 37% for SW as below table. comparing to random IO testing result, this is not a perfect result.

    Let’s take a look at Ceph architecture to understand the data better. Below figure illustrates a conceptual Ceph cluster, which has M disks and each disk is mapped with N-1 objects. The size of volume disk is marked as Volume_Size. Assuming the object size is Size_O, each virtual disk volume is composed of Volume_Size/Size_O objects. To simplify the problem, some components (e.g. PG and replica impact) are ignored on purpose. The IO requests from virtual disk volumes are distributed to different objects based on CRUSH algorithm and become the real read/write hit on disks. Due to several objects map to the same physical disks, the original logical sequential IO streams mix together (green, orange, blue and read blocks). And the real IO pattern on each physical disk becomes random with disk seeking happen. As the result, latency becomes much longer and total throughput drops a lot.

    Blktrace result proves our assumptions. We collect ~37K IO traces in two experiments. In the left figure, we run 40 VM. All of them generate full sequential read. In the right figure, we run 20 sequential IO VM and 20 random IO VM at the same time. Even on the all sequential IO case, there is 26% IO non-adjacent – which means seeking happen. When there is half random IO load, the need-seeking IO ratio increases to 59%. In a real product environment, we believe the random IO steam ratio should be higher, which expects to make more impact to sequential IO steam performance.

    Below figure shows the per-VM BW and latency analysis for sequential read/write pattern under different FIO queue size (QD) and volume/VM number. There are several findings:

    • Read performance is better than write – because the write generates twice physical IO comparing to read.
    • SSD journal doesn’t bring the same benefit as we observe in random IO tests. We believe this is due to sequential IO has a much higher bandwidth throughput, which utilizes the cache space very quickly.
    • For QD=8 cases, the read latency starts from 4ms and ends with 15ms. The write latency starts from 9ms and ends with 28ms. Further study shows the physical IO pattern becomes more random w/ a higher VM number, which is reasonable.
    • For QD=64 cases, the read latency starts from 17ms and ends with 116ms. The write latency starts from 36ms and ends with 247ms. The larger starting latency is abnormal because at this point the storage cluster is far from full of load. Latency breakdown tests tell us most of the latency comes from the client side, which shows potential optimization opportunity.

    We believe the low sequential IO performance issue is not only a challenge for Ceph, but for all other distributed storage system with the similar design. Per our understanding, there are potentially two general ways to improve the sequential IO performance: to make random IO run faster or to optimize the IO pattern to increase sequential IO percent.

    • Firstly let’s look at the possibility to reduce the random IO latency. One way is to optimize the existing software code. Remember we observe 36ms latency for SW with only one VM QD=64? The latency is extreme high consider the storage cluster is far from full of load at this moment. We believe the extra latency comes from software issues (locking, single thread queue etc.) By removing those software bottlenecks, we should be able to shorten the IO latency and improve the total throughput. The other way is to speed up the file store part, either by adding more memory or taking SSD as a write back/through cache. The 2nd one seems more interesting because the cost advantage comparing to DRAM and usually the journal may not use all the SSD space. However this need more experiments to understand the bandwidth requirement if we use the same SSD for journal and cache. There are some related BP talking about this – like the Cache Tier (although it use SSD cache at the different level per my understanding) and new NAND interface support etc.
    • The 2nd way is to change the IO pattern thus there is a higher adjacent IO percent with less seeking happens. There are two cases where the logical sequential IO stream is interrupted: The 1st one is the when the logical address of the same volume increases, the mapping object moves from one object to the other object (either on the same physical storage node or a different one). The 2nd case is that it is interrupted by other IO read/write to the same hard disk but from a different volume. Accordingly, we can also try two tuning methods. The 1st is to use bigger object size – thus the possibility of jumping to a different objects/nodes becomes less. And the 2nd one is to use better mapping rules. By default, we put all the disks into the same pool. Thus the virtual address space from one virtual volume is also distributed to all the physical disks. In this case, there are many IO streams from different volumes hit the same hard disk, resulting in a higher fragmented access pattern. The better way is to create multiple pools. Each pool has limited disks and serves less volumes. Thus there are fewer IO streams share the same hard disks, resulting in a better sequential pattern. Below figure demonstrates the two configurations. The idea is from the best paper “Copysets: Reducing the Frequency of Data Loss in Cloud Storage” in ATC 2013. The author’s original goal is to reduce the disk lost impact, which is valid for Ceph case. We suggest we may consider to keep this as “best practice” and add some function into Ceph to ease administrator’s work.

    We did some tests to verify the two tuning options as below figure. The left bar (4MB object) is the default configuration with 4MB object size and one pool for all disks. The 2nd bar (32MB object) is the 32MB object size with the one pool for all disks. And the rightest bar (32MB object + new mapping rule) is the 32MB object size and 10pool with 4 disks each. With 40VM/volume and same pressure load, the average per-VM bandwidth is increased from 43MB/sec to 57MB/sec (33% gain) and 64MB/sec (13% gain). This seems to be a pretty good start. For next step, we will continue to try different tuning parameters to understand the tradeoff and identify the optimization opportunity to achieve a high sequential throughput. For example, Sage Weil from Inktank suggest we should turn on the RBD client cache, which is expected to increase the read/write package size, thus reduce the latency of each IO.

    As the summary, the default sequential IO performance of Ceph is not promising enough. Although by applying some tuning BKM the performance becomes better, further study and optimization is still required. If you have any suggestions or comments on this topic, please mail me ( to let us know. Again thanks for teams’ work to provide the data and help review. On next part, I hope we can share more things about how to use SSD for Ceph.

    BTW, I delivered the session “Is Open Source Good Enough? A Deep Study of Swift and Ceph Performance” on this month HongKong openstack conference (link). Thanks for all the guys come to my talk especially consider it is the last session on the last day. 🙂

  • openstack ceph
  • Icon Image: 

  • Cloud Computing
  • Cloud Services
  • Server
  • November 19, 2013
    Ceph RBD objects placement

    Quick script to evaluate the placement of the objects contained in a RBD image.


    # USAGE
    # ./rbd-loc <pool> <image>

    if [ -z ${1} ] || [ -z ${2} ];
    echo "USAGE: ./rbd-loc &lt…

  • November 19, 2013
    Mixing Ceph and LVM volumes in OpenStack

    Ceph pools are defined to collocate volumes and instances in OpenStack Havana. For volumes that do not need the resilience provided by Ceph, a LVM cinder backend is defined in /etc/cinder/cinder.conf: [lvm] volume_group=cinder-volumes volume_driver=cinder.volume.drivers.lvm.LVMISCSIDriver volume_backend_name=LVM and appended to the list … Continue reading

  • November 18, 2013
    Creating a Ceph OSD from a designated disk partition

    When a new Ceph OSD is setup with ceph-disk on a designated disk partition ( say /dev/sdc3 ), it will not be prepared and the sgdisk command must be run manually: # osd_uuid=$(uuidgen) # partition_number=3 # ptype_tobe=89c57f98-2fe5-4dc0-89c1-f3ad0ceff2be # sgdisk –change-name=”${partition_number}:ceph … Continue reading

  • November 16, 2013
    Display the default Ceph configuration

    The ceph-conf command line queries the /etc/ceph/ceph.conf file. # ceph-conf –lookup fsid 571bb920-6d85-44d7-9eca-1bc114d1cd75 The –show-config option can be used to display the config of a running daemon: ceph -n osd.123 –show-config When no name is specified, it will show the … Continue reading

  • November 13, 2013
    Migrating from ganeti to OpenStack via Ceph

    On ganeti, shutdown the instance and activate its disks: z2-8:~# gnt-instance shutdown nerrant Waiting for job 1089813 for nerrant… z2-8:~# gnt-instance activate-disks nerrant On an OpenStack Havana installation using a Ceph cinder backend, create a volume with the same … Continue reading

  • Careers