For the past few months I have been working towards a way to use Ceph for virtual machine images in Apache CloudStack. This integration is important to end users because it allows them to use Ceph’s distributed block device (RBD) to speed up provisioning of virtual machines.
We (my company) have been long-time contributors to Ceph (since version 0.17!), and will be using it in our own cloud product. Support for Ceph didn’t exist in CloudStack… So we built it!
I’m co-owner of a Dutch webhosting company called PCextreme B.V. My role as CTO is to do our Research & Development and that enables me to play with Ceph (a lot).
Quite some time ago we were convinced we wanted to use Ceph with RBD in our VPS product, but we weren’t sure how. Were we going to write our own cloud management software? OpenStack seemed like a good choice since it already had RBD integration, but while looking at OpenStack we came across CloudStack. I’m not going to do the OpenStack vs CloudStack discussion, but we decided CloudStack suited us better. It however lacked RBD support!
To make this integration work, a few things needed to be done:
This work has been completed and merged, and will all be part of the new CloudStack 4.0 release, which is slated for the end of October. Between now and then, we’d like people to try it!
To get started, take a look at the related documentation. If you encounter any problems, feel free to ask for help on the Ceph or CloudStack mailing lists. Or join the #ceph (OFTC) or #cloudstack (Freenode) IRC channels, I’m idling there for most of the time.
In my (rather brief) time digging in to Ceph and working with the community, most discussions generally boil down to two questions: “How does Ceph work?” and “What can I do with Ceph?” The first question has garnered a fair amount of attention in our outreach efforts. Ross Turk’s post “More Than an Object Store” does a fantastic job summarizing Ceph’s magic. The second question is what I will address below.
So what can you do with Ceph? For those who like to read the ending first, the answer turns out to be “a blindingly awesome ton.” Thankfully that doesn’t spoil it for the rest of us, because it’s the details that make it fun. In an email discussion of these details, it was Inktank’s chief suit, Bryan Bogensberger, who managed to succinctly summarize many of the available options while still citing examples and supporting data. (How do you like that, a business guy who has a solid handle on the tech. How lucky are we!?) Without immediately overwhelming you with all the supporting details, his list was as follows:
- Posted by sage
- October 16th, 2012
Another development release of Ceph is ready, v0.53. We are getting pretty close to what will be frozen for the next stable release (bobtail), so if you would like a preview, give this one a go. Notable changes include:
- librbd: image locking
- rbd: fix list command when more than 1024 (format 2) images
- osd: backfill reservation framework (to avoid flooding new osds with backfill data)
- osd, mon: honor new ‘nobackfill’ and ‘norecover’ osdmap flags
- osd: new ‘deep scrub’ will compare object content across replicas (once per week by default)
- osd: crush performance improvements
- osd: some performance improvements related to request queuing
- osd: capability syntax improvements, bug fixes
- osd: misc recovery fixes
- osd: fix memory leak on certain error paths
- osd: default journal size to 1 GB
- crush: default root of tree type is now ‘root’ instead of ‘pool’ (to avoid confusiong wrt rados pools)
- ceph-fuse: fix handling for .. in root directory
- librados: some locking fixes
- mon: some election bug fixes
- mon: some additional on-disk metadata to facilitate future mon changes (post-bobtail)
- mon: throttle osd flapping based on osd history (limits osdmap “thrashing” on overloaded or unhappy clusters)
- mon: new ‘osd crush create-or-move …’ command
- radosgw: fix copy-object vs attributes
- radosgw: fix bug in bucket stat updates
- mds: fix ino release on abort session close, relative getattr path, mds shutdown, other misc items
- upstart: stop jobs on shutdown
- common: thread pool sizes can now be adjusted at runtime
- build fixes for Fedora 18, CentOS/RHEL 6
The latest version of OpenStack, Folsom, was recently released. This release makes block devices in general, and Ceph block devices (RBD) in particular, much easier to use. If you’re not familiar with OpenStack terminology, there are a few things you should know before proceeding:
- instance – a virtual machine
- image – a template for a virtual machine
- volume – a block device
- Cinder – OpenStack service for managing block devices (replaces nova-volumes from previous versions)
- Glance – OpenStack service for storing images and metadata about them (image type, size, owner, etc.)
In previous releases, you could create volumes and attach them to virtual machines, and you could even boot from them, but there was no way to put data on them without going and doing it manually yourself. To boot from a volume, you’d have to:
Here at Inktank our developers have been toiling away at their desks, profiling and optimizing Ceph to make it one of the fastest distributed storage solutions on the planet. One question we often get asked is how to build an optimally performing Ceph cluster. This isn’t always an easy question to answer because it depends on many factors including funding, capacity requirements, density requirements, and existing infrastructure. There are however some basic investigations that can be done to start getting an idea of what components in a Ceph storage node matter.
The wise and benevolent management at Inktank (Hi Guys!) agreed to allow me to go on a shopping spree with the corporate credit card to answer these questions. Without further encouragement, I went to one of our hardware vendors and immediately put in an order for a 36-drive Supermicro SC847A chassis along with 36 SATA drives, 9 Intel 520 SSDs, a variety of controllers, and all of the other random bits needed to actually make this thing work.
As Ceph development continues to move forward at an astonishing rate we’re working hard to share both our passion for what’s here and our vision of things to come via as many conduits as we can manage. If you are interested in hearing about the latest Ceph dev work, asking questions of some of the folks behind it, or just want to tell us the awesome things you are building with Ceph, keep an eye on our marathon event schedule and stop on by.
In the immediate future you can find us both at the Open World Forum coming up this week in Paris and next week at the newly streamlined OpenStack Summit in San Diego.
Open World Forum
In Paris Ross Turk will be speaking on several panels. For those of you who don’t know Ross, any presentation from this seasoned Open Source veteran is well worth the time away from precious bits and/or internet cat pictures, so make sure you catch all of his appearances! Ross will be delivering both of the following talks:
- Posted by rturk
- October 2nd, 2012
Today, Dmitry Ukov wrote a great post on the Mirantis Blog entitled Object Storage approaches for OpenStack Cloud: Understanding Swift and Ceph. Dmitry’s overview on Ceph was a solid introduction to those needing an object store for their OpenStack deployment, and it was an interesting read. Thanks, Dmitry!
Naturally, since I spend most of my days thinking about Ceph, I couldn’t resist going a bit deeper with a few of Dmitry’s ideas. Here we go:
Ceph is More Than Just An Object Store
Ceph is a great object store. If you strip it down to its bare minimum, that’s what it is. Comparing the entire Ceph platform with Swift is apples and oranges, though, since Ceph can be much more than just an object store. Bonus points to the first person who writes a jingle that best accompanies that last part there.
The Ceph Object Store (also called RADOS), is a collection of two kinds of services: object storage daemons (ceph-osd) and monitors (ceph-mon). The monitors’ primary function is to keep track of which nodes are operational at any given time, while the OSDs perform actual data storage/retrieval operations. A cluster can have between a handful and thousands of OSDs, but only a small number of monitors – usually 3, 5, or 7 – are enough for most clusters. There’s a client library, librados, that allows apps to store and retrieve objects.
- Posted by sage
- October 18th, 2011
v0.37 is ready. Notable changes this time around:
- radosgw: backend on-disk format changes
- radosgw: improved logging
- radosgw: package improvements (init script, fixed deps)
- osd: bug fixes!
- teuthology: btrfs testing
If you are currently storing data with radosgw, you will need to export and reimport your data as the backend storage strategy has changed to improve scaling.
Other work not directly in the release includes work with the Chef cookbooks (will hit ceph-cookbooks.git soon), an RBD backend for Glance (OpenStack), and ongoing work improving the libvirt support for qemu/KVM + RBD. We’ve also been fighting with the ceph.spec file to get something that will build on all of Fedora, RHEL/CentOS, openSUSE, and SLES (with mixed success).
You can get v0.37 from:
- Posted by yehuda
- October 10th, 2011
Just a quick update on the current status of RBD.
The main recent development is that librbd (the userspace library) can ack writes immediately (instead of waiting for them to actually commit), to better mimic the behavior of a normal disk.
Why do this? A long long time ago, when you issued a write to a disk, it would ACK the write when the data was written. No more. Now, the ACK means the data is either the drive’s cache or on disk. You don’t know data is safe/durable until you issue a separate flush command. Now RBD behaves similarly: writes are acked immediately (up to some number of bytes, at least), and a flush will wait for all previous writes to commit. The only real difference between this and a real drive cache is that a real drive will try to coalesce small writes into a single operation, while RBD sends them all straight through to the backend cluster.
To make this work with qemu/KVM you need:
This is not yet implemented in the kernel RBD driver. As a result, effective performance using that device is still relatively poor. We hope to have similar behavior ready when the v3.2 merge window opens.
- Posted by sage
- October 30th, 2010
v0.22.2 is out with a few minor bug fixes:
- cfuse: fix truncation issue
- osd: fix decoding of legacy (0.21 and earlier) coll_t (which caused problems for people upgrading)
- osd: handle missing objects on snap reads
- filestore: escape xattr chunk names
Not too much here, but the decoding error would bite anyone upgrading from v0.21.