Planet Ceph

Aggregated news from external sources

  • May 8, 2015
    Ceph : Reduce OSD Scrub Priority

    Ceph : Reduce OSD Scrub Priority
    Let’s assume ! on a nice sunny day , you receives complaints that your Ceph storage cluster is not performing as it was performing yesterday. After checking cluster status you found that placement groups scrubbing is going on and depending on your scenario , you would like to decrease its priority. Here is how you can do it.

    Note : OSD disk thread I/O priority can only be changed if the disk scheduler is cfq.

    • Check disk scheduler, if its not cfq you can change it to cfq dynamically.
    $ sudo cat /sys/block/sda/queue/scheduler
    noop [deadline] cfq
    $ sudo echo cfq > /sys/block/sda/queue/scheduler
    • Next check for the current values of OSD disk thread io priority , the default values should be as shown below.
    $ sudo ceph daemon osd.0 config get osd_disk_thread_ioprio_class
    { "osd_disk_thread_ioprio_class": ""}
    $ sudo ceph daemon osd.0 config get osd_disk_thread_ioprio_priority
    { "osd_disk_thread_ioprio_priority": "-1"}
    • Reduce the osd_disk_thread_ioprio by executing
    $ sudo ceph tell osd.* injectargs '--osd_disk_thread_ioprio_priority 7'
    $ sudo ceph tell osd.* injectargs '--osd_disk_thread_ioprio_class idle'
    • Finally recheck osd_disk_thread_ioprio
    $ sudo ceph daemon osd.0 config get osd_disk_thread_ioprio_class
    { "osd_disk_thread_ioprio_class": "idle"}
    $ sudo ceph daemon osd.0 config get osd_disk_thread_ioprio_priority
    { "osd_disk_thread_ioprio_priority": "7"}

    This should reduce OSD scrubbing priority and is useful to slow down scrubbing on an OSD that is busy handling client operations. Once the coast is clear , its a good idea to revert back the changes.

  • May 7, 2015
    Improving Ceph python scripts tests

    The Ceph command line and ceph-disk helper are python scripts for which there are integration tests ( and It would be useful to add unit tests and pep8 checks. It can be done by creating a python module instead … Continue reading

  • May 5, 2015
    v9.0.0 released

    This is the first development release for the Infernalis cycle, and the first Ceph release to sport a version number from the new numbering scheme. The “9” indicates this is the 9th release cycle–I (for Infernalis) is the 9th letter. The first “0” indicates this is a development release (“1” will mean release candidate and …Read more

  • May 4, 2015
    OpenVZ: Kernel 3.10 With Rbd Module

    3.X Kernel for OpenVZ is out and it is compiled with rbd module:

    root@debian:~# uname -a
    Linux debian 3.10.0-3-pve #1 SMP Thu Jun 12 13:50:49 CEST 2014 x86_64 GNU/Linux

    root@debian:~# modinfo rbd
    filename: /lib/modules/3.10.0-3-pve/kernel/drive…

  • May 4, 2015
    Ceph using Monitor key/value store

    Ceph monitors make use of leveldb to store cluster maps, users and keys.
    Since the store is present, Ceph developers thought about exposing this through the monitors interface.
    So monitors have a built-in capability that allows you to store blobs of …

  • April 27, 2015
    v0.87.2 Giant released

    This is the second (and possibly final) point release for Giant. We recommend all v0.87.x Giant users upgrade to this release. NOTABLE CHANGES ceph-objectstore-tool: only output unsupported features when incompatible (#11176 David Zafman) common: do not implicitly unlock rwlock on destruction (Federico Simoncelli) common: make wait timeout on empty queue configurable (#10818 Samuel Just) crush: …Read more

  • April 27, 2015
    Ceph: manually repair object

    {% img center Ceph: manually repair object %}

    Debugging scrubbing errors can be tricky and you don’t necessary know how to proceed.

    Assuming you have a cluster state similar to this o…

  • April 25, 2015
    Ceph Loves Jumbo Frames

    Ceph Loves Jumbo Frames
    Who doesn’t loves a high performing Ceph storage cluster. To get this you need to tame it , i mean not only Ceph tuning but also Network needs to be tuned. The quickest way to tune your network is to enable Jumbo Frames.

    What are they :

    • They are ethernet frames with payload more than 1500 MTU
    • Can significantly improve network performance by making data transmission efficient.
    • Requires Gigabit ethernet
    • Most of the enterprise network device supports Jumbo Frames
    • Some people also call them ‘Giants’

    Enabling Jumbo Frames

    • Make sure your switch port is configured to accept Jumbo frames
    • On server side , set your network interface MTU to 9000
    # ifconfig eth0 mtu 9000
    • Make changes permanent by updating network interface file and restart network services
    # echo "MTU 9000" >> /etc/sysconfig/network-script/ifcfg-eth0
    • Confirm if MTU is used between two specific devices
    # ip route get {IP-address}

    In my production Ceph cluster, i have seen improvements after enabling Jumbo Frames both on Ceph as well as on OpenStack nodes.

  • April 17, 2015
    Stretching Ceph networks

    This is a quick note about Ceph networks, so do not expect anything lengthy here :).

    Usually Ceph networks are presented as cluster public and cluster private.
    However it is never mentioned that you can use a separate network for the monitors.
    This mi…

  • April 15, 2015
    Ceph Pool Migration

    You have probably already be faced to migrate all objects from a pool to another, especially to change parameters that can not be modified on pool. For example, to migrate from a replicated pool to an EC pool, change EC profile, or to reduce the number of PGs…
    There are different methods, depending on the contents of the pool (RBD, objects), size…

    The simple way

    The simplest and safest method to copy all objects with the “rados cppool” command.
    However, it need to have read only access to the pool during the copy.

    For example for migrating to an EC pool :

    ceph osd pool create $ 4096 4096 erasure default
    rados cppool $pool $
    ceph osd pool rename $pool $pool.old
    ceph osd pool rename $ $pool

    But it does not work in all cases. For example with EC pools : “error copying pool testpool => newpool: (95) Operation not supported”.

    Using Cache Tier

    This must to be used with caution, make tests before using it on a cluster in production. It worked for my needs, but I can not say that it works in all cases.

    I find this method interesting method, because it allows transparent operation, reduce downtime and avoid to duplicate all data. The principle is simple: use the cache tier, but in reverse order.

    At the begning, we have 2 pools : the current “testpool”, and the new one “newpool”

    Setup cache tier

    Configure the existing pool as cache pool :

    ceph osd tier add newpool testpool --force-nonempty
    ceph osd tier cache-mode testpool forward

    In ceph osd dump you should see something like that :

    --> pool 58 'testpool' replicated size 3 .... tier_of 80 

    Now, all new objects will be create on new pool :

    Now we can force to move all objects to new pool :

    rados -p testpool cache-flush-evict-all

    Switch all clients to the new pool

    (You can also do this step earlier. For example, just after the cache pool creation.)
    Until all the data has not been flushed to the new pool you need to specify an overlay to search objects on old pool :

    ceph osd tier set-overlay newpool testpool

    In ceph osd dump you should see something like that :

    --> pool 80 'newpool' replicated size 3 .... tiers 58 read_tier 58 write_tier 58

    With overlay, all operation will be forwarded to the old testpool :

    Now you can switch all the clients to access objects on the new pool.


    When all data is migrate, you can remove overlay and old “cache” pool :

    ceph osd tier remove-overlay newpool
    ceph osd tier remove newpool testpool

    In-use object

    During eviction you can find some error :

    failed to evict testrbd.rbd: (16) Device or resource busy

    List watcher on object can help :

    rados -p testpool listwatchers testrbd.rbd
    watcher= client.5520194 cookie=1

    Using Rados Export/Import

    For this, you need to use a temporary local directory.

    rados export --create testpool tmp_dir
    [exported]     rb.0.4975.2ae8944a.000000002391
    [exported]     rb.0.4975.2ae8944a.000000004abc
    [exported]     rb.0.4975.2ae8944a.0000000018ce
    rados import tmp_dir newpool
    # Stop All IO
    # And redo a sync of modified objects
    rados export --workers 5 testpool tmp_dir
    rados import --workers 5 tmp_dir newpool
  • April 13, 2015
    v0.94.1 Hammer released

    This bug fix release fixes a few critical issues with CRUSH. The most important addresses a bug in feature bit enforcement that may prevent pre-hammer clients from communicating with the cluster during an upgrade. This only manifests in some cases (for example, when the ‘rack’ type is in use in the CRUSH map, and possibly …Read more

  • April 13, 2015
    Ceph: analyse journal write pattern

    Simple trick to analyse the write patterns applied to your Ceph journal.

    Assuming your journal device is /dev/sdb1, checking for 10 seconds:

    $ iostat -dmx /dev/sbd1 10 | awk ‘/[0-9]/ {print $8}’

    Now converting sectors to KiB.