v0.41 is ready! There are a few key things in this release:
For v0.42 we’re working on improved journal performance for the OSD, better encoding for data structures (to ease upgrades and downgrades), rgw performance improvements, and an efficient key/value object interface.
You can download v0.41 from the usual places:
Ceph had its arms in a little bit of everything at SCALE 10x last weekend.
Between attending sessions, mingling at the OpenStack party, hanging out at the Ceph booth, hanging out at the OpenStack booth, and Sage’s presentation, we were busy!
One of our engineers, nicknamed TV, enjoyed the btrfs talk and made a note that the next version of Oracle’s Unbreakable Linux, which is coming out mid-February, will use btrfs by default. He also said the Canonical JuJu charm school talk was *packed* and people were spilling out the doors, it was so full. Good chance Juju could be become “a thing”.
And the best for last, Sage’s session on Sunday afternoon was full and finished with several questions from the crowd. In case you missed it, you can download the pdf here. An audio/visual recording of the SCALE 10x presentations will be posted soon.
Coming up next :
2/2/12 – The upcoming OpenStack Meetup in San Francisco will be hosted by DreamHost and facilitated by Piston Cloud. The theme will be the Ceph storage project and how it’s currently being implemented in production. http://ceph.newdream.net/openstack
Ceph will be at the SCALE conference this Fri, Sat and Sun to promote the Ceph project, our involvement with OpenStack and to recruit the great talent that we know will be swarming to this conference. We have several activities planned for the SCALE conference so we hope to see you there!
The Southern California Linux Expo (SCALE) is an annual Linux, Open-Source, and Free Software conference held in Los Angeles. Now celebrating its tenth year, this community organized event will be held January 20-22, 2012 at the Hilton Los Angeles Airport hotel. SCALE offers over 100 seminars and presentation sessions, and an exhibit hall where non-profit and commercial organizations will demonstrate the latest developments in the open-source and free software realm.
The Ceph team is looking forward to meeting you at the SCALE conference in Los Angeles January 20-22, 2012.
Use the discount code “CEPH” to get 40% off registration. http://www.socallinuxexpo.org/scale10x
If you’ve ever been to a developers summit, technology user group meeting, or an online technology forum, you may have noticed that there’s a scarcity of women. We’ve noticed that too so we’re doing our part to support women in technology and to encourage them to get involved in the open source revolution. The Ada Initiative is a non-profit organization dedicated to increasing participation of women in open technology and culture. This week we’re matching individual donations to Ada so please donate today. For more information, you can read our interview with Valerie Aurora, founder of the Ada Initative: https://adainitiative.org/2012/01/interview-with-ceph-first-ada-initiative-bronze-sponsor
It’s been several weeks, but v0.40 is ready. This has mostly been a stabilization release, so there isn’t too much new here. Some notable additions include:
There’s a much longer list of bugs fixed, but I’m not sure it’s worth listing here. Lots of stuff in the OSD, for the most part. Notably, there is only one high priority OSD bug in the tracker right now, and it is just awaiting confirmation from the nightly QA run that the fix is correct.
The main thing that didn’t make the cut is the “backfill” work, that is about to be merged into master for v0.41. This revamps the way OSDs handle recovery when the entire PG has to be replicated to a new location, and significantly reduces memory requirements and improves the recovery speed. For v0.41, we’re also working on mechanisms to improve visibility into the health of the cluster and addressing some performance issues.
To download v0.40:
v0.39 has been tagged and uploaded. There was a lot of bug fixing going on that isn’t terribly exciting. That aside, the highlights include:
The monitor and network config changes are worth mentioning. We simplified monitor bootstrapping to make it easier to use tools like Chef or Juju to bring up a fresh cluster. At the same time we made monitor cluster expansion almost trivial, and fixed an important performance problem when a monitor was down for a long time and then came back up.
Specifying the network config for daemons is also simple now that you can constrain the choice to a specific subnet. That means that when you have a whole cluster with a public and private network for, say, the OSDs, you can force ceph-osd to choose an ip for each interface from the appropriate subnet without explicitly setting the IP in the ceph.conf for each daemon.
Ceph now builds on FreeBSD, thanks to some porting work by Stanislav Sedov.
There were a lot of small fixes to the OSD. A few bugs remain, however, in strange recovery corner cases. Some of the core recovery code is being rewritten for v0.40 that will vastly simplify things and make the system more performant and less of a memory hog during recovery (see the wip-backfill branch in ceph.git).
For v0.40 we are also working on the RBD image cloning (“layering”), and it’s going to be pretty slick. And the vastly improved ceph.spec file is almost ready and should land in v0.40 as well.
To download v0.39:
It’s a week delayed, but v0.38 is ready. The highlights:
The big upcoming items for v0.39 are RBD layering (image cloning), further improvements to radosgw’s Swift support, and some monitor failure recovery and bootstrapping improvements. We’re also continuing work on the automation bits that the Chef cookbooks and Juju charms will use, and a Crowbar barclamp was also just posted on github. Several patches are still working their way into libvirt and qemu to improve support for RBD authentication.
You can get v0.38 from the usual places:
A while back we worked on radosgw doing atomic reads and writes.
The first issue was making sure that two or more concurrent writers that write to the same object don’t end up with an inconsistent object. That is the “atomic PUT” issue.
We also wanted to be able to make sure that when one client reads an object via radosgw while another client writes to the same object, the result is consistent. That is, when reading an object a client should get either the old or the new version of the object, and never a mix of the two. That is the “atomic GET” issue.
Radosgw is built directly on top of RADOS and is a prime example of a librados user. The basic issue is that radosgw streams the objects from or to the RADOS objects with a series of relatively small reads or writes. For the atomic PUT and atomic GET we didn’t want to introduce locking. Locking would solve the issue, but implementing it on top of RADOS would not have been trivial, and would have affected scalability and the relative simplicity of the gateway. The Ceph distributed file system implements locking in the metadata server (as part of its POSIX file locking support), and introducing that in the gateway would require holding state on each object and synchronizing it between the different gateway instances. We didn’t want to reimplement the MDS again.
When radosgw reads or writes an object it can issue multiple read or write librados requests to the RADOS backend. One RADOS feature is that each single operation is atomic. The problem is that for sufficiently large object (which are not too large in any case) we issue multiple write operations, and could end up with an interleaved object.
The solution for the atomic PUT is to write the object into a temporary object. Once the temp object is completely written, we issue a single librados clone-range operation that atomically clones the entire temp object to the destination. Once the data is there we remove the temp object. This is equivalent to write to a temporary file and renaming it over the target when we finish.
Since the RADOS backend is distributed, we need to make sure that both the temp object and the target object will be located in the same placement group (and on the same OSD). Usually the object location is determined by the object name, but for this purpose we used the “object locator” feature, which allows us to provide alternative string that is fed into the hash function. In this case we use the target object name as the object locator for the temporary object, ensuring that both objects end up on the same placement group on the same node so that the clone operation can work.
With atomic PUT we know that the objects are consistent. However, this doesn’t help with clients reading when an object is being written. Since there can be multiple librados read operations for a single GET, some of the reads may happen before the object is replaced and some may happen after that, leading to an inconsistent “torn” result.
In addition to atomic operations, RADOS has a nice feature called compound operations which allow you to send a few operations that are bundled together and applied atomically. If one of the operations fail, nothing is applied. We use this for atomic PUT in order to set both data and metadata on the target object in a single atomic operation.
For the atomic GET we introduce an object “tag,” which is a random value that we generate for each PUT and store as an object attribute (xattr). When radosgw writes to an object it first checks for an existing object and fetches its tag (which it can do atomically). If the object exists it clones it to a new object with the tag as a suffix (taking necessary steps to avoid name collisions) and the original object name as the locator. The compound clone operation looks like:
The first operation is a guard to make sure that the object hasn’t been rewritten since we first read it. (Had it been rewritten, we need to restart the whole operation and reread the tag.) We put the same guard when we write the new object instance, to make sure that there was no racing operation.
A client that reads the object also starts by reading the tag, and putting the same guard before each subsequent read operation. If the guard fails, the client knows that the object has been rewritten. However, it also knows that since it has been rewritten, the object that it started reading can now be found at <name>_<tag>. So, reading of an object named foo looks like this:
The final component is an intent log. Since we end up creating multiple instances of the same object under different names, we need to make sure that these object are cleaned up after some reasonable amount of time. We added a log object which we record each such object that needs to be removed. After a sufficient amount of time (however long we expect very slow GETs to still succeed), a process iterates over the log and removes old objects.
The name Ceph comes from cephalopod, a class of mollusks that includes the octopus and squid. The suggestion came from Carlos Maltzahn, a professor in our research group at UCSC, sometime in 2006. My memory is a bit hazy, but if memory serves the reasoning had something to do with their high level of intelligence and many-tentacled–ahem, “distributed”–physiology [insert hand waving here].
Here are some fun facts (and links).
Amusingly, I was searching for a story I heard a while back about an octopus at an aquarium that would sneak out of its tank to steal fish/food from a nearby tank and then return home, but ended up on this page on snopes about the prevalence of the story.