Archives:

Getting Involved with Ceph

The Ceph community is made up of many individuals with a wide variety of backgrounds, from FOSS hacker to corporate architect. We feel very fortunate to have such a great, and active, community. Even more so lately, as we have been fielding a number of questions on how best to become a more active participant in the Ceph community. With that in mind we decided it was time to sketch out a brief menu of different engagement opportunities to make it easy for anyone (not just developers) to take part in our digital revolution.

read more…

v0.54 released

The v0.54 development release is ready!  This will be the last development release before v0.55 “bobtail,” our next long-term stable release, is ready.  Notable changes this time around include:

  • osd: use entire device if journal is a block device
  • osd: new caps structure (see below)
  • osd: backfill target reservations (improve performance during recovery)
  • ceph-fuse: many fixes (including memory leaks, hangs)
  • radosgw: REST API for managing usage stats
  • radosgw: many small fixes, cleanups (coverity)
  • mds: misc fixes for multi-mds clusters
  • rbd: ls -l
  • ceph-disk-prepare: support for external journals, default mount/mkfs options, etc.
  • ceph-debugpack: misc improvements

There isn’t anything especially exciting here; most of the big stuff is landing in v0.55, which will become bobtail. Most of our effort over the next few weeks will be on make sure that v0.55 bobtail is rock solid and performs well.
read more…

INTRODUCTION

Hello again!

If you are new around these parts you may want to start out by reading the first article in this series available here.

For the rest of you, I am sure you are no doubt aware by now of the epic battle that Mark Shuttleworth and I are waging over who can generate more page hits on the ceph.com website.  I’ve made a totally original and in no way inaccurate illustration to document the saga for future generations:

Shuttleworth is going down!
read more…

Our Very First Ceph Day

Last Friday we had our very first day-long workshop dedicated to Ceph…in beautiful Amsterdam! The Ceph project has had a nice, long string of “firsts” lately and it was exciting to witness this one in person.

The event was organized by Inktank and 42on, a new Ceph company and this month’s Featured Contributor! The team at 42on did an amazing job organizing the venue, managing registration, and making sure that everybody had food, drinks, desks, and power. It simply wouldn’t have happened without their hard work and dedication to the community.

IMG 0117
read more…

v0.38 released

It”s a week delayed, but v0.38 is ready.  The highlights:

  • osd: some peering refactoring
  • osd: “replay” period is per-pool (now only affects fs data pool)
  • osd: clean up old osdmaps
  • osd: allow admin to revert lost objects to prior versions (or delete)
  • mkcephfs: generate reasonable crush map based on “host” and “rack” fields in [osd.NN] sections of ceph.conf
  • radosgw: bucket index improvements
  • radosgw: improved swift support
  • rbd: misc command line tool fixes
  • debian: misc packaging fixes (including dependency breakage on upgrades)
  • ceph: query daemon perfcounters via command line tool

The big upcoming items for v0.39 are RBD layering (image cloning), online casino gambling further improvements to radosgw”s Swift support, and some monitor failure recovery and bootstrapping improvements.  We”re also continuing work on the automation bits that the Chef cookbooks and Juju charms will use, and a Crowbar barclamp was also just posted on github.  Several patches are still working their way into libvirt and qemu to improve support for RBD authentication.

You can get v0.38 from the usual places:

Atomicity of RESTful radosgw operations

A while back we worked on radosgw doing atomic reads and writes.

The first issue was making sure that two or more concurrent writers that write to the same object don’t end up with an inconsistent object. That is the “atomic PUT” issue.

We also wanted to be able to make sure that when one client reads an object via radosgw while another client writes to the same object, the result is consistent. That is, when reading an object a client should get either the old or the new version of the object, and never a mix of the two. That is the “atomic GET” issue.

Radosgw is built directly on top of RADOS and is a prime example of a librados user. The basic issue is that radosgw streams the objects from or to the RADOS objects with a series of relatively small reads or writes. For the atomic PUT and atomic GET we didn’t want to introduce locking. Locking would solve the issue, but implementing it on top of RADOS would not have been trivial, and would have affected scalability and the relative simplicity of the gateway. The Ceph distributed file system implements locking in the metadata server (as part of its POSIX file locking support), and introducing that in the gateway would require holding state on each object and synchronizing it between the different gateway instances. We didn’t want to reimplement the MDS again.

Atomic PUT

When radosgw reads or writes an object it can issue multiple read or write librados requests to the RADOS backend. One RADOS feature is that each single operation is atomic. The problem is that for sufficiently large object (which are not too large in any case) we issue multiple write operations, and could end up with an interleaved object.

The solution for the atomic PUT is to write the object into a temporary object. Once the temp object is completely written, we issue a single librados clone-range operation that atomically clones the entire temp object to the destination. Once the data is there we remove the temp object. This is equivalent to write to a temporary file and renaming it over the target when we finish.

Since the RADOS backend is distributed, we need to make sure that both the temp object and the target object will be located in the same placement group (and on the same OSD). Usually the object location is determined by the object name, but for this purpose we used the “object locator” feature, which allows us to provide alternative string that is fed into the hash function. In this case we use the target object name as the object locator for the temporary object, ensuring that both objects end up on the same placement group on the same node so that the clone operation can work.

Atomic GET

With atomic PUT we know that the objects are consistent. However, this doesn’t help with clients reading when an object is being written. Since there can be multiple librados read operations for a single GET, some of the reads may happen before the object is replaced and some may happen after that, leading to an inconsistent “torn” result.

In addition to atomic operations, RADOS has a nice feature called compound operations which allow you to send a few operations that are bundled together and applied atomically. If one of the operations fail, nothing is applied. We use this for atomic PUT in order to set both data and metadata on the target object in a single atomic operation.

For the atomic GET we introduce an object “tag,” which is a random value that we generate for each PUT and store as an object attribute (xattr). When radosgw writes to an object it first checks for an existing object and fetches its tag (which it can do atomically). If the object exists it clones it to a new object with the tag as a suffix (taking necessary steps to avoid name collisions) and the original object name as the locator. The compound clone operation looks like:

  1. check to see if object <name> tag attribute is <tag>
  2. clone to <name>_<tag>

The first operation is a guard to make sure that the object hasn’t been rewritten since we first read it. (Had it been rewritten, we need to restart the whole operation and reread the tag.) We put the same guard when we write the new object instance, to make sure that there was no racing operation.

A client that reads the object also starts by reading the tag, and putting the same guard before each subsequent read operation. If the guard fails, the client knows that the object has been rewritten. However, it also knows that since it has been rewritten, the object that it started reading can now be found at <name>_<tag>. So, reading of an object named foo looks like this:

  • read object foo tag -> 123
  • verify object foo tag is “123″; read object foo (offset = 0, size = 512K) -> ok, read 512K
  • check object foo tag is “123″; read object foo (offset = 512K, size = 512K) -> not ok, object was replaced
  • read object foo_123 (offset = 512K, size = 512K) -> ok, read 512K

The final component is an intent log. Since we end up creating multiple instances of the same object under different names, we need to make sure that these object are cleaned up after some reasonable amount of time. We added a log object which we record each such object that needs to be removed. After a sufficient amount of time (however long we expect very slow GETs to still succeed), a process iterates over the log and removes old objects.

Cephalopods

The name Ceph comes from cephalopod, a class of mollusks that includes the octopus and squid.  The suggestion came from Carlos Maltzahn, a professor in our research group at UCSC, sometime in 2006.  My memory is a bit hazy, but if memory serves the reasoning had something to do with their high level of intelligence and many-tentacled–ahem, “distributed”–physiology [insert hand waving here].

Here are some fun facts (and links).

  • Cephalopods have the most complex nervous system of all the invertebrates.
  • Some can fly up to 50m through the air, squirting water to help propel themselves.
  • Most have a chromatophores, colored pigments on their skin that are used for camouflage.  Check out this incredible video clip from Science Friday.
  • Cephalopods have advanced vision, but most are color blind.
  • They can detect gravity with statocysts.
  • They have an ink sac that they squirt into the water to confuse predators.
  • Most have no bones and can squeeze themselves through extremely small holes (search youtube for ‘octopus escape’ for some crazy videos).

Amusingly, I was searching for a story I heard a while back about an octopus at an aquarium that would sneak out of its tank to steal fish/food from a nearby tank and then return home, but ended up on this page on snopes about the prevalence of the story.

 

v0.23.1 released

This release includes some bug fixes for v0.23, although there’s nothing here that too many people have been hitting, fortunately.

  • cfuse/libceph: fix crash with clustered mds restart
  • cfuse/libceph: fix hard link caching
  • cfuse/libceph: fix lssnap
  • msgr: fix various races
  • msgr: fix IPv6 address parsing buffer overflow

v0.24 is still a few weeks away, and will include OSD recovery improvements, background scrubbing, and MDS clustering and performance improvements, among other things.

Relevant URLs:

S3-compatible object storage with radosgw

The radosgw has been around for a while, but it hasn’t been well publicized or documented, so I thought I’d mention it here.  The idea is this:

  • Ceph’s architecture is based on a robust, scalable distributed object store called RADOS.
  • Amazon’s S3 has shown that a simple object-based storage interface is a convenient way to write applications, even when that interface is very restrictive.
  • Providing access to Ceph’s object store via an S3-compatible interface is easy with librados.

The result is radosgw, a FastCGI-based proxy that exposes Ceph’s object store via a REST (HTTP-based) interface.  Radosgw implements a subset of Amazon’s API (some Amazon-specific features of ACLs and object versioning aren’t supported), but the subset it does implement aims to be fully compatible.  That means that most existing apps that are designed for S3 can be seamlessly migrated to a Ceph-based object store, provided they allow the hostname to be configured (many hard-code s3.amazonaws.com).

It should be noted that this approach has some fundamental limitations:

  • librados provides direct parallel access to storage nodes; radosgw is a single endpoint proxy that sits in front of your storage cluster.  That may actually be a good thing, depending on your security model.
  • The REST-based storage interface is much more restricted than that provided by librados.  librados allows partial object updates, has no object size limits, supports extensible object classes, fine-grained snapshots, and more.
  • The radosgw security model emulates S3′s, and is implemented as a layer on top of librados.  Accessing the same objects via the native librados library will not reflect S3-style ACLs created via radosgw.

Check it out!

v0.23 released

Another month, and v0.23 is out.  The main milestone here is that clustered MDS is pretty stable.  Stable enough that, if you’re interested and willing, we’d like you to try it and let us know what problems you have.  Notably, clustered recovery is not yet well tested (that’s v0.24), so don’t do this unless you’re feeling adventurous.  Directory fragmentation (splitting and merging) is also working, although still off by default.  If you’d like to try that too, add ‘mds bal frag = true’ to your [mds] section.

Other notable changes this time around:

  • osd: use new btrfs snapshot ioctls (2.6.37), parallel journaling
  • mds: clustering, replay fixes
  • mon: better commit batches, lower latency updates
  • objecter: bug fixes
  • osd: spread data across multiple xattrs; assert on io/enospc errors
  • osd: start up despite corrupt pg logs
  • ceph: new gui (ceph -g)

The general focus for v0.24 will be continuing OSD stability and clustered MDS recovery.

Relevant URLs:

© 2013, Inktank Storage, Inc.. All rights reserved.