Planet Ceph

Aggregated news from external sources

  • February 2, 2015
    OpenStack and Ceph: RBD discard

    {% img center http://sebastien-han.fr/images/openstack-ceph-discard.jpg OpenStack and Ceph: RBD space reclamation %}

    Only Magic Card player might recognize that post picture 🙂 (if you’re interested)

    I have been waiting for this for quite a while …

  • January 29, 2015
    Ceph recover a RBD image from a dead cluster

    Many years ago I came across a script made by Shawn Moore and Rodney Rymer from Catawba university.
    The purpose of this tool is to reconstruct a RBD image.
    Imagine your cluster dead, all the monitors got wiped off and you don’t have backup (I know what can possibly happen?).
    However all your objects remain intact.

    I’ve always wanted to blog about this tool, simply to advocate it and make sure that people can use it.
    Hopefully it will be a good publicity for this tool :-).

    Backuping RBD images

    Before we dive into the recovery process.
    I’d like to take a few lines to describe what is important to backup and how to backup it.

    • Keep track of all the images across all the pools
    • Store their properties (shown by rbd info <pool>/<image>)
    • Store the RBD headers

    Recover

    In the context of this exercise I will simply:

    • Create a RBD image
    • Map it on a machine
    • Put a XFS filesystem on top of it
    • Touch a simple file

    “`bash
    $ rbd create -s 10240 leseb
    $ rbd info leseb
    rbd image ‘leseb’:
    size 10240 MB in 2560 objects
    order 22 (4096 kB objects)
    block_name_prefix: rb.0.1066.74b0dc51
    format: 1

    $ sudo rbd -p rbd map leseb
    /dev/rbd0

    $ sudo rbd showmapped
    id pool image snap device
    0 rbd leseb – /dev/rbd0

    $ sudo mkfs.xfs /dev/rbd0
    log stripe unit (4194304 bytes) is too large (maximum is 256KiB)
    log stripe unit adjusted to 32KiB
    meta-data=/dev/rbd0 isize=256 agcount=17, agsize=162816 blks

         =                       sectsz=512   attr=2, projid32bit=0
    

    data = bsize=4096 blocks=2621440, imaxpct=25

         =                       sunit=1024   swidth=1024 blks
    

    naming =version 2 bsize=4096 ascii-ci=0
    log =internal log bsize=4096 blocks=2560, version=2

         =                       sectsz=512   sunit=8 blks, lazy-count=1
    

    realtime =none extsz=4096 blocks=0, rtextents=0

    $ sudo mount /dev/rbd0 /mnt
    $ echo “foo” > /mnt/bar
    $ sudo umount /mnt
    $ sudo rbd unmap /dev/rbd0
    “`

    Prepare a directory on a server to restore your image:

    bash
    $ mkdir recover_leseb
    $ wget -O rbd_restore https://raw.githubusercontent.com/smmoore/ceph/master/rbd_restore.sh
    $ chmox +x rbd_restore

    Then I need to collect all the RBD object files, on my setup I only had one OSD server, which made the gathering operation easier:

    bash
    $ cd recover_leseb
    ~/recover_leseb$ for block in $(find /var/lib/ceph/osd/ -type f -name rb.0.1066.74b0dc51.*); do cp $block . ; done
    ~/recover_leseb$ bash recover.sh leseb rb.0.1066.74b0dc51 10737418240
    ~/recover_leseb$ file leseb
    leseb: SGI XFS filesystem data (blksz 4096, inosz 256, v2 dirs)
    ~/recover_leseb$ du -h leseb
    11M leseb

    Hum looks like we have something interesting here 🙂
    Let’s see if it really worked:

    “`bash
    ~/recover_leseb$ losetup -f
    /dev/loop0

    ~/recover_leseb$ losetup /dev/loop0 leseb
    ~/recover_leseb$ mount /dev/loop0 /mnt/
    ~/recover_leseb$ df -h /mnt
    Filesystem Size Used Avail Use% Mounted on
    /dev/loop0 10G 33M 10G 1% /mnt

    ~/recover_leseb$ ls /mnt/
    bar
    ~/recover_leseb$ cat /mnt/bar
    foo
    “`

    HELL YEAH!

  • January 29, 2015
    Multiple Clusters on the Same Hardware: OSD Isolation With LXC

    Ceph makes it easy to create multiple cluster on the same hardware with the naming of clusters. If you want a better insolation you can use LXC, for example to allow a different version of Ceph between your clusters.

    For this you will need access to…

  • January 27, 2015
    Replace Apache by Civetweb on the RadosGW

    Since Firefly you can test the use of the lightweight web client Civetweb instead of Apache.
    To activate it, it’s very simple, there’s nothing to install again, simply add this line to your ceph.conf:

    [client.radosgw.gateway]
    rgw frontends…

  • January 25, 2015
    Ceph and KRBD discard

    {% img center http://sebastien-han.fr/images/ceph-krbd-discard-support.jpg Ceph and KRBD discard %}

    Space reclamation mechanism for the Kernel RBD module.
    Having this kind of support is really crucial for operators and ease your capacity planing.
    RBD images are sparse, thus size after creation is equal to 0 MB.
    The main issue with sparse images is that images grow to eventually reach their entire size.
    The thing is Ceph doesn’t know anything that this happening on top of that block especially if you have a filesystem.
    You can easily write the entire filesystem and then delete everything, Ceph will still believe that the block is fully used and will keep that metric.
    However thanks to the discard support on the block device, the filesystem can send discard flush commands to the block.
    In the end, the storage will free up blocks.

    This feature was added into the Kernel 3.18.

    Let’s create a RBD image

    “`bash
    $ rbd create -s 10240 leseb
    $ rbd info leseb
    rbd image ‘leseb’:
    size 10240 MB in 2560 objects
    order 22 (4096 kB objects)
    block_name_prefix: rb.0.1066.74b0dc51
    format: 1

    $ rbd diff rbd/leseb | awk ‘{ SUM += $2 } END { print SUM/1024/1024 ” MB” }’
    0 MB
    “`

    Map it to a host and put a filesystem on top of it:

    “`bash
    $ sudo rbd -p rbd map leseb
    /dev/rbd0

    $ sudo rbd showmapped
    id pool image snap device
    0 rbd leseb – /dev/rbd0

    $ sudo mkfs.xfs /dev/rbd0
    log stripe unit (4194304 bytes) is too large (maximum is 256KiB)
    log stripe unit adjusted to 32KiB
    meta-data=/dev/rbd0 isize=256 agcount=17, agsize=162816 blks

         =                       sectsz=512   attr=2, projid32bit=0
    

    data = bsize=4096 blocks=2621440, imaxpct=25

         =                       sunit=1024   swidth=1024 blks
    

    naming =version 2 bsize=4096 ascii-ci=0
    log =internal log bsize=4096 blocks=2560, version=2

         =                       sectsz=512   sunit=8 blks, lazy-count=1
    

    realtime =none extsz=4096 blocks=0, rtextents=0

    $ sudo mount /dev/rbd0 /mnt
    “`

    Ok we are all set now, so let’s write some data:

    “`bash
    $ dd if=/dev/zero of=/mnt/leseb bs=1M count=128
    128+0 records in
    128+0 records out
    134217728 bytes (134 MB) copied, 2.88215 s, 46.6 MB/s

    $ df -h /mnt/
    Filesystem Size Used Avail Use% Mounted on
    /dev/rbd0 10G 161M 9.9G 2% /mnt
    “`

    Then we check the size of the image again:

    bash
    $ rbd diff rbd/leseb | awk '{ SUM += $2 } END { print SUM/1024/1024 " MB" }'
    142.406 MB

    We know have 128MB of data and ~14,406MB of filesystem data/metadata.
    Check that discard is properly enabled on the device:

    bash
    root@ceph-mon0:~# cat /sys/block/rbd0/queue/discard_*
    4194304
    4194304
    1

    Now let’s check the default behavior, when discard is not supported, we delete our 128 MB file so we free up some space on the filesystem.
    Unfortunately Ceph didn’t notice anything and still believes that this 128 MB of data are still there.

    bash
    $ rm /mnt/leseb
    $ rbd diff rbd/leseb | awk '{ SUM += $2 } END { print SUM/1024/1024 " MB" }'
    142.406 MB

    Now let’s Run the fstrim command on the mounted filesystem to instruct the block to free up unused space:

    bash
    $ fstrim /mnt/
    $ rbd diff rbd/leseb | awk '{ SUM += $2 } END { print SUM/1024/1024 " MB" }'
    10.6406 MB

    Et voilà ! Ceph freed up our 128 MB.

    If you want to run discard on the fly and let the filesystem check for discard all the time you can mount the filesystem with the discard option:

    “`bash
    $ mount -o discard /dev/rbd0 /mnt/

    $ mount | grep rbd
    /dev/rbd0 on /mnt type xfs (rw,discard)
    “`

    Note that using the discard mount option can be a real performance killer.
    So generally you want to trigger the fstrim command through a daily cron job.

  • January 24, 2015
    Ceph Meetup Helsinki : 22 Jan 2015

    Ceph Meetup Helsinki , Finland 22nd Jan 2015It has been a good start to 2015. We the geeks of “Helsinki Metropolitan Area” express our sincere thanks to Red Hat Folks for arranging an unofficial “Ceph Day” s…

  • January 23, 2015
    OpenStack Cinder with Ceph under the hood

    {% img center http://sebastien-han.fr/images/cinder-ceph-under-the-hood.jpg OpenStack Cinder with Ceph under the hood %}

    What’s happening under the hood while playing with Cinder and Ceph?
    Answer table :-).

    ACTION …

  • January 16, 2015
    Ceph reset perf counter

    {% img center http://sebastien-han.fr/images/ceph-reset-perf-counter.jpg Ceph reset perf counter %}

    OSD performance counters tend to stack up and sometimes the value shown is not really representative of the current environment.
    Thus it is quite usefu…

  • January 14, 2015
    v0.91 released

    We are quickly approaching the Hammer feature freeze but have a few more dev releases to go before we get there. The headline items are subtree-based quota support in CephFS (ceph-fuse/libcephfs client support only for now), a rewrite of the watch/notify librados API used by RBD and RGW, OSDMap checksums to ensure that maps are …Read more

  • January 14, 2015
    v0.80.8 Firefly released

    This is a long-awaited bugfix release for firefly. It several imporant (but relatively rare) OSD peering fixes, performance issues when snapshots are trimmed, several RGW fixes, a paxos corner case fix, and some packaging updates. We recommend that all users for v0.80.x firefly upgrade when it is convenient to do so. NOTABLE CHANGES

  • January 14, 2015
    Fix nova-scheduler issue with RBD and UUID not found

    While playing with Ceph on DevStack I noticed that after several rebuild I ended up with the following error from nova-scheduler:

    Secret not found: rbd no secret matches uuid ‘3092b632-4e9f-40ca-9430-bbf60cefae36’

    Actually this error is reported b…

  • January 2, 2015
    Teuthology docker targets hack (4/5)

    The teuthology container hack is improved by adding a flag to retrieve packages from a user specified repository instead of gitbuilder.ceph.com. The user can build packages from sources and run a job, which will implicitly save a docker image with … Continue reading

Careers