Planet Ceph

Aggregated news from external sources

  • December 5, 2013
    Ceph Installation :: Part-2

    CEPH Storage ClusterInstalling Ceph Deploy ( ceph-mon1 )Update your repository and install ceph-deploy on ceph-mon1 node[ceph@ceph-mon1 ~]$ sudo yum update && sudo yum install ceph-deployLoaded plugins: downloadonly, fastestmirror, securityLoad…

  • December 5, 2013
    Ceph Installation :: Part-1
    Ceph Installation Step by Step

    This quick start setup helps to deploy ceph with 3 Monitors and 2 OSD nodes with 4 OSD each node. In this we are using commodity hardware running CentOS 6.4

    Ceph-mon1 : First Monitor + Ceph-deploy machine (will be used to deploy ceph to other nodes )

    Ceph-mon2 : Second Monitor ( for monitor quorum )

    Ceph-mon3 : Third Monitor ( for monitor quorum )

    Ceph-node1 : OSD node 1 with 10G X 1 for OS , 440G X 4 for 4 OSD

    Ceph-node2 : OSD node 2 with 10G X 1 for OS , 440G X 4 for 4 OSD

    Ceph-Deploy Version is 1.3.2 , Ceph Version 0.67.4 ( Dumpling )

    Preflight Checklist 

    All the Ceph Nodes may require some basic configuration work prior to deploying a Ceph Storage Cluster.

    CEPH node setup

    • Create a user on each Ceph Node.
    sudo useradd -d /home/ceph -m ceph
    sudo passwd ceph
    • Add root privileges for the user on each Ceph Node.
    echo "ceph ALL = (root) NOPASSWD:ALL" | sudo tee /etc/sudoers
    sudo chmod 0440 /etc/sudoers
    • Configure your ceph-deploy node ( ceph-mon1) with password-less SSH access to each Ceph Node. Leave the passphrase empty , repeat this step for CEPH and ROOT users.
    ceph@ceph-admin:~ [ceph@ceph-admin ~]$ ssh-keygen
    Generating public/private rsa key pair.
    Enter file in which to save the key (/home/ceph/.ssh/id_rsa): yes
    Created directory '/home/ceph/.ssh'.
    Enter passphrase (empty for no passphrase):
    Enter same passphrase again:
    Your identification has been saved in /home/ceph/.ssh/id_rsa.
    Your public key has been saved in /home/ceph/.ssh/
    The key fingerprint is:
    The key's randomart image is:
    +--[ RSA 2048]----+
    | |
    | E. o |
    | .. oo . |
    | . .+..o |
    | . .o.S. |
    | . + |
    | . o. o |
    | ++ .. . |
    | ..+*+++ |

    • Copy the key to each Ceph Node. ( Repeat this step for ceph and root users )
    [ceph@ceph-mon1 ~]$ ssh-copy-id ceph@ceph-node2
    The authenticity of host 'ceph-node2 (' can't be established.
    RSA key fingerprint is ac:31:6f:e7:bb:ed:f1:18:9e:6e:42:cc:48:74:8e:7b.
    Are you sure you want to continue connecting (yes/no)? y
    Please type 'yes' or 'no': yes
    Warning: Permanently added 'ceph-node2,' (RSA) to the list of known hosts.
    ceph@ceph-node2's password:
    Now try logging into the machine, with "ssh 'ceph@ceph-node2'", and check in: .ssh/authorized_keys
    to make sure we haven't added extra keys that you weren't expecting.
    [ceph@ceph-mon1 ~]$
    • Ensure connectivity using ping with hostnames , for convenience we have used local host file , update host file of every node with details of other nodes. PS : Use of DNS is recommended
    • Packages are cryptographically signed with the release.asc key. Add release key to your system’s list of trusted keys to avoid a security warning:
    sudo rpm --import ';a=blob_plain;f=keys/release.asc'
    • Ceph may require additional additional third party libraries. To add the EPEL repository, execute the following:
    su -c 'rpm -Uvh'
    sudo yum install snappy leveldb gdisk python-argparse gperftools-libs
    • Installing Release packages , Dumpling is the most recent stable release of Ceph. ( by the time i am creating this wiki )
    su -c 'rpm -Uvh'
    • Adding ceph to YUM , create repository file for ceph /etc/yum.repos.d/ceph.repo
    name=Ceph packages for $basearch

    name=Ceph noarch packages

    name=Ceph source packages
    • For best results, create directories on your nodes for maintaining the configuration generated by ceph . These should get auto created by ceph however in may case it gave me problems. So creating manually.
    mkdir -p /etc/ceph /var/lib/ceph/{tmp,mon,mds,bootstrap-osd} /var/log/ceph
    • By default, daemons bind to ports within the 6800:7100 range. You may configure this range at your discretion. Before configuring your IP tables, check the default iptables configuration. ::ports within the 6800:7100 range. You may configure this range at your discretion. Since we are performing test deployment we can disable iptables on ceph nodes . For moving to production this need to be attended.

    Please Follow Ceph Installation :: Part-2 for next step in installation

  • December 5, 2013
    Ceph Storage :: Introduction

    What is CEPH

    Ceph is an open-source, massively scalable, software-defined storage system which provides object, block and file system storage from a single clustered platform. Ceph’s main goals is to be completely distributed without a single point of failure, scalable to the exabyte level, and freely-available. The data is replicated, making it fault tolerant. Ceph software runs on commodity hardware. The system is designed to be both self-healing and self-managing and self awesome 🙂

    CEPH Internals

    • OSD: A Object Storage Daemon (OSD) stores data, handles data replication, recovery, backfilling, rebalancing, and provides some monitoring information to Ceph Monitors by checking other Ceph OSD Daemons for a heartbeat. A Ceph Storage Cluster requires at least two Ceph OSD Daemons to achieve an active + clean state when the cluster makes two copies of your data . 
    • Monitor: A Ceph Monitor maintains maps of the cluster state, including the monitor map, the OSD map, the Placement Group (PG) map, and the CRUSH map. Ceph maintains a history (called an “epoch”) of each state change in the Monitors, Ceph OSD Daemons, and PGs.
    • MDS: A Ceph Metadata Server (MDS) stores metadata on behalf of the Ceph Filesystem . Ceph Metadata Servers make it feasible for POSIX file system users to execute basic commands like ls, find, etc. without placing an enormous burden on the Ceph Storage Cluster.
    Note :: Please use and other official InkTank and ceph community resources as a primary source of information on ceph . This entire blog is an attempt help beginners in setting up ceph cluster and sharing my troubleshooting with you.

  • December 2, 2013
    Ceph and Swift: Why we are not fighting.

    I have just come back from the OpenStack summit in Hong Kong. As always it was a blast talking to lot of people and listening to presentations or designing the future of the software we all love. While chatting with different people there was a recurrent question coming up to me: people wanted to know whether “Ceph is better than… Read more →

  • December 1, 2013
    Ceph performance: interesting things going on

    The Ceph developer summit is already behind us and wow! so many good things are around the corner!
    During this online event, we discussed the future of the Firefly release (planned for February 2014).
    During the last OpenStack summit in Hong Kong, I …

  • November 27, 2013
    Quick update Ceph: from Argonaut to Cuttlefish

    Memory leaks disappeared and CPU load dramatically reduced. Yay!

    The upgrade started during the week 39.

    The first graph shows the amount of RAM used before and after the Ceph upgrade.
    As you might know, they were numerous memory leaks…

  • November 26, 2013

    Thanks to the hard work of the puppet-openstack
    community, Puppet was the preferred method of deployment for Openstack
    in the latest Openstack User Survey.

    If you’d like to join in on the fun and contribute, read on !
    First things first, a bit of context:

    • Openstack is a modular cloud orchestration platform,
      self-described as “Open source software for building private and
      public clouds”.
    • puppet-openstack is a Stackforge project that centralizes the
      development of puppet modules related to Openstack. puppet-openstack
      is also an actual module allowing the installation and
      configuration of core Openstack services.
    • Stackforge is used to host Openstack-related projects so that they
      can benefit from the same continuous integration infrastructure and
      review system that the main Openstack projects use such as Nova.

    Now that we have the basics out of the way, if you’d like to contribute
    to Openstack in general, it’s not mandatory to have any programming or
    networking knowledge. There’s always things like documentation and
    translation that need manpower.

    For contributing to puppet-openstack in particular, however, it is
    required to be (or become!) familiar with ruby, puppet,
    puppet-rspec and of course, Openstack..

    The contribution process for puppet-openstack is slightly different than
    committing code to primary Openstack projects (such as Nova) and I won’t
    be highlighting them here for the sake of simplicity – this is a topic
    for another blog post !

    I recently started contributing as part of the
    new puppet-ceph initiative so this blog post more or less describes
    what I had to go through to get my first contribution in.

    Okay, sign me up.

    If you want to join in on the fun, the basic instructions for signing up
    are pretty well documented on the Openstack

    In a nutshell:

    Getting started

    Let’s say I want to develop for puppet-ceph (!), I’ll keep these
    resources handy:

    • The Launchpad project for bugs/issues/fixes/feature/backlog
      documentation and
      discussion: (each project
      has it’s own launchpad project)
    • The developer documentation will prove useful to prepare your
      development environment and beyond. For puppet modules,
      documentation is provided both on the Openstack
      and directly in the README files.

    Clone the project

    You’re going to need the puppet module source to work on it, you can
    either clone it from Github:

    git clone

    or from Gerrit:

    git clone

    Make sure you have ruby, rubygems and bundle installed

    First of all, you’ll need ruby and bundle to manage ruby
    packages (gems).
    These will be required, especially when the time will come to do
    spec/integration/lint tests.

    If you already have them you can skip this part !

    On Ubuntu:

    apt-get install ruby rubygems ruby-bundler

    On Debian:

    apt-get install ruby rubygems bundler

    Install development dependencies

    With the help of bundle, fetch and install the gem dependencies
    documented in the Gemfile located at the root of the repository.

    bundle install

    Create your branch and do your stuff

    Create a branch with a name relevant to what you’re doing

    git checkout -b feature/my_feature

    Now you can do your modifications.
    Don’t forget to add new spec tests or modify existing ones to match the
    modifications you made to the module.

    Test your stuff

    You’ve added or modified some code, now you want to test it:

    Test for puppet syntax (puppet-lint):

    bundle exec rake lint

    Run spec tests (puppet-rspec)

    bundle exec rake spec

    If you try to push code that doesn’t pass the tests, jenkins will not
    let you through – better make sure everything is okay before sending
    something for review!

    Tests are successful ? Add and commit your stuff

    git add [file] git commit

    Make sure your commit message follows the right format !

    Send your stuff for review

    git review

    That’s it ! Your code was sent to gerrit for review by the community
    and the core reviewers !

    Jenkins or someone -1’d my code. Help !

    Maybe you did a typo or something far worse you’d like to fix – this is
    done by submitting another patch set.

    Do the changes you want to do, add the files again but instead of using
    git commit‘, use ‘git commit —amend‘.
    This will essentially modify the initial commit.

    After amending your commit, send the code back for a new review with
    git review‘ once more.

  • November 26, 2013
    Back from the summit: Ceph/OpenStack integration

    The summit was exciting and full of good things and announcements. We had great Cinder sessions and an amazing Ceph/OpenStack integration session. I’ve led the Ceph/OpenStack integration session with Josh Durgin (Inktank). We had a good participation from the audience. I would like to specially thank Sage Weil, Haomai Wang, Edward Hope-Morley for their good inputs. The main purpose of… Read more →

  • November 25, 2013
    Ceph: find who’s mapping a RBD device

    Curious? Wanna know who has a RBD device mapped?

    W Important note: this method only works with the Emperor version of Ceph and above.

    Grab the image information:

    $ rbd info boot
    rbd image 'boot':
    size 10240 MB in 2560 ob…

  • November 22, 2013
    Map/unmap RBD device on boot/shutdown

    Quick how-to on mapping/unmapping a RBD device during startup and shutdown.

    We are going to use an init script provided by the ceph package.
    During the boot sequence, the init script first looks at /etc/rbdmap and will map devices accordingly.

  • November 21, 2013
    Manage a multi-datacenter crush map with the command line

    A new datacenter is added to the crush map of a Ceph cluster: # ceph osd crush add-bucket fsf datacenter added bucket fsf type datacenter to crush map # ceph osd crush move fsf root=default moved item id -13 name … Continue reading

  • November 21, 2013
    Measure Ceph RBD performance in a quantitative way (part II)

    This is the 2nd post about Ceph RBD performance. In part1, we go talk about random IO perforamnce on Ceph. This time we share the sequential read/write testing data. In case you forget our hardware configurations, we use 40x 1TB SATA disks for data disk plus 12 SSD as journal. And 4x 10Gb links are used to connect the storage clusters with clients together, which provides enough network bandwidth. Below figures show the SR and SW performance with QD=64 and CAP=60MB/s per VM. With the number of Volume/VM increases, the per-VM throughput drops gradually. The SR max total throughput 2759MB/sec happens at VM=80 and SW peak total throughput 1487MB/sec happens at VM=50. However consider our pre-defined QoS requirement (Per-VM throughput is larger than 90% of the pre-defined target), we pick up VM=40 for SR and VM=30 for SW, which results in reported metrics as 2263MB/sec (SR) and 1197MB/sec (SW).

    Is the result good enough? With the similar approach as we do for random IO, we measure the native disk performance as the reference. We observe ~160MB/sec sequential bandwidth per disk for both read and write. And we measure ~900MB/sec for single 10Gb NIC. In theory, the 40 SATA disks are expected to deliver 6400MB/sec for read and 3200MB/sec for write (replica=2). And 4x 10Gb can deliver ~3600MB/sec bandwidth. Thus the final Ceph efficiency is 57% for SR and 37% for SW as below table. comparing to random IO testing result, this is not a perfect result.

    Let’s take a look at Ceph architecture to understand the data better. Below figure illustrates a conceptual Ceph cluster, which has M disks and each disk is mapped with N-1 objects. The size of volume disk is marked as Volume_Size. Assuming the object size is Size_O, each virtual disk volume is composed of Volume_Size/Size_O objects. To simplify the problem, some components (e.g. PG and replica impact) are ignored on purpose. The IO requests from virtual disk volumes are distributed to different objects based on CRUSH algorithm and become the real read/write hit on disks. Due to several objects map to the same physical disks, the original logical sequential IO streams mix together (green, orange, blue and read blocks). And the real IO pattern on each physical disk becomes random with disk seeking happen. As the result, latency becomes much longer and total throughput drops a lot.

    Blktrace result proves our assumptions. We collect ~37K IO traces in two experiments. In the left figure, we run 40 VM. All of them generate full sequential read. In the right figure, we run 20 sequential IO VM and 20 random IO VM at the same time. Even on the all sequential IO case, there is 26% IO non-adjacent – which means seeking happen. When there is half random IO load, the need-seeking IO ratio increases to 59%. In a real product environment, we believe the random IO steam ratio should be higher, which expects to make more impact to sequential IO steam performance.

    Below figure shows the per-VM BW and latency analysis for sequential read/write pattern under different FIO queue size (QD) and volume/VM number. There are several findings:

    • Read performance is better than write – because the write generates twice physical IO comparing to read.
    • SSD journal doesn’t bring the same benefit as we observe in random IO tests. We believe this is due to sequential IO has a much higher bandwidth throughput, which utilizes the cache space very quickly.
    • For QD=8 cases, the read latency starts from 4ms and ends with 15ms. The write latency starts from 9ms and ends with 28ms. Further study shows the physical IO pattern becomes more random w/ a higher VM number, which is reasonable.
    • For QD=64 cases, the read latency starts from 17ms and ends with 116ms. The write latency starts from 36ms and ends with 247ms. The larger starting latency is abnormal because at this point the storage cluster is far from full of load. Latency breakdown tests tell us most of the latency comes from the client side, which shows potential optimization opportunity.

    We believe the low sequential IO performance issue is not only a challenge for Ceph, but for all other distributed storage system with the similar design. Per our understanding, there are potentially two general ways to improve the sequential IO performance: to make random IO run faster or to optimize the IO pattern to increase sequential IO percent.

    • Firstly let’s look at the possibility to reduce the random IO latency. One way is to optimize the existing software code. Remember we observe 36ms latency for SW with only one VM QD=64? The latency is extreme high consider the storage cluster is far from full of load at this moment. We believe the extra latency comes from software issues (locking, single thread queue etc.) By removing those software bottlenecks, we should be able to shorten the IO latency and improve the total throughput. The other way is to speed up the file store part, either by adding more memory or taking SSD as a write back/through cache. The 2nd one seems more interesting because the cost advantage comparing to DRAM and usually the journal may not use all the SSD space. However this need more experiments to understand the bandwidth requirement if we use the same SSD for journal and cache. There are some related BP talking about this – like the Cache Tier (although it use SSD cache at the different level per my understanding) and new NAND interface support etc.
    • The 2nd way is to change the IO pattern thus there is a higher adjacent IO percent with less seeking happens. There are two cases where the logical sequential IO stream is interrupted: The 1st one is the when the logical address of the same volume increases, the mapping object moves from one object to the other object (either on the same physical storage node or a different one). The 2nd case is that it is interrupted by other IO read/write to the same hard disk but from a different volume. Accordingly, we can also try two tuning methods. The 1st is to use bigger object size – thus the possibility of jumping to a different objects/nodes becomes less. And the 2nd one is to use better mapping rules. By default, we put all the disks into the same pool. Thus the virtual address space from one virtual volume is also distributed to all the physical disks. In this case, there are many IO streams from different volumes hit the same hard disk, resulting in a higher fragmented access pattern. The better way is to create multiple pools. Each pool has limited disks and serves less volumes. Thus there are fewer IO streams share the same hard disks, resulting in a better sequential pattern. Below figure demonstrates the two configurations. The idea is from the best paper “Copysets: Reducing the Frequency of Data Loss in Cloud Storage” in ATC 2013. The author’s original goal is to reduce the disk lost impact, which is valid for Ceph case. We suggest we may consider to keep this as “best practice” and add some function into Ceph to ease administrator’s work.

    We did some tests to verify the two tuning options as below figure. The left bar (4MB object) is the default configuration with 4MB object size and one pool for all disks. The 2nd bar (32MB object) is the 32MB object size with the one pool for all disks. And the rightest bar (32MB object + new mapping rule) is the 32MB object size and 10pool with 4 disks each. With 40VM/volume and same pressure load, the average per-VM bandwidth is increased from 43MB/sec to 57MB/sec (33% gain) and 64MB/sec (13% gain). This seems to be a pretty good start. For next step, we will continue to try different tuning parameters to understand the tradeoff and identify the optimization opportunity to achieve a high sequential throughput. For example, Sage Weil from Inktank suggest we should turn on the RBD client cache, which is expected to increase the read/write package size, thus reduce the latency of each IO.

    As the summary, the default sequential IO performance of Ceph is not promising enough. Although by applying some tuning BKM the performance becomes better, further study and optimization is still required. If you have any suggestions or comments on this topic, please mail me ( to let us know. Again thanks for teams’ work to provide the data and help review. On next part, I hope we can share more things about how to use SSD for Ceph.

    BTW, I delivered the session “Is Open Source Good Enough? A Deep Study of Swift and Ceph Performance” on this month HongKong openstack conference (link). Thanks for all the guys come to my talk especially consider it is the last session on the last day. 🙂

  • openstack ceph
  • Icon Image: 

  • Cloud Computing
  • Cloud Services
  • Server