It’s been several weeks, but v0.40 is ready. This has mostly been a stabilization release, so there isn’t too much new here. Some notable additions include:
There’s a much longer list of bugs fixed, but I’m not sure it’s worth listing here. Lots of stuff in the OSD, for the most part. Notably, there is only one high priority OSD bug in the tracker right now, and it is just awaiting confirmation from the nightly QA run that the fix is correct.
The main thing that didn’t make the cut is the “backfill” work, that is about to be merged into master for v0.41. This revamps the way OSDs handle recovery when the entire PG has to be replicated to a new location, and significantly reduces memory requirements and improves the recovery speed. For v0.41, we’re also working on mechanisms to improve visibility into the health of the cluster and addressing some performance issues.
To download v0.40:
v0.39 has been tagged and uploaded. There was a lot of bug fixing going on that isn’t terribly exciting. That aside, the highlights include:
The monitor and network config changes are worth mentioning. We simplified monitor bootstrapping to make it easier to use tools like Chef or Juju to bring up a fresh cluster. At the same time we made monitor cluster expansion almost trivial, and fixed an important performance problem when a monitor was down for a long time and then came back up.
Specifying the network config for daemons is also simple now that you can constrain the choice to a specific subnet. That means that when you have a whole cluster with a public and private network for, say, the OSDs, you can force ceph-osd to choose an ip for each interface from the appropriate subnet without explicitly setting the IP in the ceph.conf for each daemon.
Ceph now builds on FreeBSD, thanks to some porting work by Stanislav Sedov.
There were a lot of small fixes to the OSD. A few bugs remain, however, in strange recovery corner cases. Some of the core recovery code is being rewritten for v0.40 that will vastly simplify things and make the system more performant and less of a memory hog during recovery (see the wip-backfill branch in ceph.git).
For v0.40 we are also working on the RBD image cloning (“layering”), and it’s going to be pretty slick. And the vastly improved ceph.spec file is almost ready and should land in v0.40 as well.
To download v0.39:
It”s a week delayed, but v0.38 is ready. The highlights:
The big upcoming items for v0.39 are RBD layering (image cloning), online casino gambling further improvements to radosgw”s Swift support, and some monitor failure recovery and bootstrapping improvements. We”re also continuing work on the automation bits that the Chef cookbooks and Juju charms will use, and a Crowbar barclamp was also just posted on github. Several patches are still working their way into libvirt and qemu to improve support for RBD authentication.
You can get v0.38 from the usual places:
A while back we worked on radosgw doing atomic reads and writes.
The first issue was making sure that two or more concurrent writers that write to the same object don’t end up with an inconsistent object. That is the “atomic PUT” issue.
We also wanted to be able to make sure that when one client reads an object via radosgw while another client writes to the same object, the result is consistent. That is, when reading an object a client should get either the old or the new version of the object, and never a mix of the two. That is the “atomic GET” issue.
Radosgw is built directly on top of RADOS and is a prime example of a librados user. The basic issue is that radosgw streams the objects from or to the RADOS objects with a series of relatively small reads or writes. For the atomic PUT and atomic GET we didn’t want to introduce locking. Locking would solve the issue, but implementing it on top of RADOS would not have been trivial, and would have affected scalability and the relative simplicity of the gateway. The Ceph distributed file system implements locking in the metadata server (as part of its POSIX file locking support), and introducing that in the gateway would require holding state on each object and synchronizing it between the different gateway instances. We didn’t want to reimplement the MDS again.
When radosgw reads or writes an object it can issue multiple read or write librados requests to the RADOS backend. One RADOS feature is that each single operation is atomic. The problem is that for sufficiently large object (which are not too large in any case) we issue multiple write operations, and could end up with an interleaved object.
The solution for the atomic PUT is to write the object into a temporary object. Once the temp object is completely written, we issue a single librados clone-range operation that atomically clones the entire temp object to the destination. Once the data is there we remove the temp object. This is equivalent to write to a temporary file and renaming it over the target when we finish.
Since the RADOS backend is distributed, we need to make sure that both the temp object and the target object will be located in the same placement group (and on the same OSD). Usually the object location is determined by the object name, but for this purpose we used the “object locator” feature, which allows us to provide alternative string that is fed into the hash function. In this case we use the target object name as the object locator for the temporary object, ensuring that both objects end up on the same placement group on the same node so that the clone operation can work.
With atomic PUT we know that the objects are consistent. However, this doesn’t help with clients reading when an object is being written. Since there can be multiple librados read operations for a single GET, some of the reads may happen before the object is replaced and some may happen after that, leading to an inconsistent “torn” result.
In addition to atomic operations, RADOS has a nice feature called compound operations which allow you to send a few operations that are bundled together and applied atomically. If one of the operations fail, nothing is applied. We use this for atomic PUT in order to set both data and metadata on the target object in a single atomic operation.
For the atomic GET we introduce an object “tag,” which is a random value that we generate for each PUT and store as an object attribute (xattr). When radosgw writes to an object it first checks for an existing object and fetches its tag (which it can do atomically). If the object exists it clones it to a new object with the tag as a suffix (taking necessary steps to avoid name collisions) and the original object name as the locator. The compound clone operation looks like:
The first operation is a guard to make sure that the object hasn’t been rewritten since we first read it. (Had it been rewritten, we need to restart the whole operation and reread the tag.) We put the same guard when we write the new object instance, to make sure that there was no racing operation.
A client that reads the object also starts by reading the tag, and putting the same guard before each subsequent read operation. If the guard fails, the client knows that the object has been rewritten. However, it also knows that since it has been rewritten, the object that it started reading can now be found at <name>_<tag>. So, reading of an object named foo looks like this:
The final component is an intent log. Since we end up creating multiple instances of the same object under different names, we need to make sure that these object are cleaned up after some reasonable amount of time. We added a log object which we record each such object that needs to be removed. After a sufficient amount of time (however long we expect very slow GETs to still succeed), a process iterates over the log and removes old objects.
The name Ceph comes from cephalopod, a class of mollusks that includes the octopus and squid. The suggestion came from Carlos Maltzahn, a professor in our research group at UCSC, sometime in 2006. My memory is a bit hazy, but if memory serves the reasoning had something to do with their high level of intelligence and many-tentacled–ahem, “distributed”–physiology [insert hand waving here].
Here are some fun facts (and links).
Amusingly, I was searching for a story I heard a while back about an octopus at an aquarium that would sneak out of its tank to steal fish/food from a nearby tank and then return home, but ended up on this page on snopes about the prevalence of the story.
v0.37 is ready. Notable changes this time around:
If you are currently storing data with radosgw, you will need to export and reimport your data as the backend storage strategy has changed to improve scaling.
Other work not directly in the release includes work with the Chef cookbooks (will hit ceph-cookbooks.git soon), an RBD backend for Glance (OpenStack), and ongoing work improving the libvirt support for qemu/KVM + RBD. We’ve also been fighting with the ceph.spec file to get something that will build on all of Fedora, RHEL/CentOS, openSUSE, and SLES (with mixed success).
You can get v0.37 from:
Just a quick update on the current status of RBD.
The main recent development is that librbd (the userspace library) can ack writes immediately (instead of waiting for them to actually commit), to better mimic the behavior of a normal disk.
Why do this? A long long time ago, when you issued a write to a disk, it would ACK the write when the data was written. No more. Now, the ACK means the data is either the drive’s cache or on disk. You don’t know data is safe/durable until you issue a separate flush command. Now RBD behaves similarly: writes are acked immediately (up to some number of bytes, at least), and a flush will wait for all previous writes to commit. The only real difference between this and a real drive cache is that a real drive will try to coalesce small writes into a single operation, while RBD sends them all straight through to the backend cluster.
To make this work with qemu/KVM you need:
This is not yet implemented in the kernel RBD driver. As a result, effective performance using that device is still relatively poor. We hope to have similar behavior ready when the v3.2 merge window opens.
It’s been three weeks and v0.36 is ready. The most visible change this time around is that the daemons and tools have been renamed. Everything that used to start with ‘c’ now starts with ‘ceph-’, and libceph is now libcephfs. Nothing earth shattering, but we’re trying to clean these things up where we can and deal with the pain sooner rather than later. (If you have any naming or tool usage pet peeves, let us know.)
Notable changes since v0.35:
The biggest item here is probably the librbd async write change, which affects qemu/KVM virtual machines using the RBD virtual disks. Typical physical disk have a write cache and don’t actually ensure your data is physically written to the platter until you issue a flush command (which modern file systems are now careful to do at critical points). In contrast, RBD wouldn’t acknowledge a write until it was written to the backend storage (all N replicas), which meant high latency writes and seemingly poor performance (even though throughput was theoretically very good). librbd now buffers writes so that it behaves more like a disk, resulting in vastly improved performance for most typical workloads (like dd). You still need to use the latest upstream qemu version to ensure that flush commands are properly handled, so this is still off by default; see this post for more information. We haven’t made the same change to the Linux kernel RBD driver, but it’s coming soon.
We took an extra week this cycle due to a few trips (Yehuda, Bryan, and I were in Israel for a few days, and then I was at SDC last week), and may do that again this sprint. Tommi and Bryan will be at the Openstack conference and design summit (don’t miss Tommi’s talk on RBD on Friday!) next week, and you’ll see us in the Dell booth provisioning a Ceph cluster with Chef and Dell’s crowbar.
For v0.37, the focus is on Chef cookbooks, Openstack integration, radosgw scalability improvements, and libvirt integration.
You can get v0.36 from:
WARNING: There is a disk format change in this release that requires a bit of extra care to upgrade safely. Please see below.
Notable changes since v0.34 include:
The big change this time around is the way the OSD is storing objects on the local machine. When directories of objects get large, they are “pre-hashed” into subdirectories. This is necessary groundwork that will facilitate splitting and merging of PGs when pools grow or shrink dramatically in size. There is an “on disk” format change to do this, so a bit of care is needed to upgrade.
Please follow this basic procedure:
We’ve done a lot of testing to make sure this works properly, but there are some awkward changes that make it difficult to test every scenario. If you have important data in your cluster, make a backup before upgrading.
Where to get v0.35:
We spent some time this week working on our technical roadmap for the next few months. It’s all been mostly translated into issues and priorities in the tracker (here’s a sorted priority list), but from that level of gory detail it’s hard to see the forest for the trees.
At a high level, the basic priorities are:
A bit more specifically, the priority list goes something like:
There are several mostly parallel goals we are pursuing here, so in the end figuring out which pieces to work on at any point of time is always a bit of a challenge (especially when it comes to features vs bugs vs qa). There will inevitably be some shift in what ends up in each release (every 2-3 weeks).
In case it isn’t clear from the above, improving stability and testing coverage continues to be a key goal. We tend to focus on bugs we currently see (or users are seeing), and expand testing on the core systems first. Although there isn’t much about the file system on this list, cfuse and kclient testing are already fairly well covered in our test suite, and that continues to expand. It just isn’t our primary focus at this point.
Questions/comments welcome! If any of these areas interests you in particular (from a technical, business, or potential employment perspective), we would of course love to hear from you.