Archives: July 2011

v0.49 released

This release is a bit less exciting than most because it is the first development release since argonaut, and much of our time has been spent working on stability.  Most of those fixes have been backported and slated for the next argonaut point release (v0.48.1).   I’ll include both below; see the 0.48.1 release notes (when it’s available later this week) to see what changes with argonaut.

  • mon: ‘ceph osd crush move’ command lets you rearrange your CRUSH hierarchy
  • osd: scrub  efficiency improvement
  • osd: capability grammar improvements
  • osd: many bug fixes
  • msgr: various messenger bug fixes
  • librados: several bug fixes (rare races, locking errors)
  • mon: several bug fixes (rare races causing crashes)
  • log: fix in-memory buffering behavior (to only write log messages on crash)
  • ceph-disk-prepare: creates and labels GPT partitions
  • rados: ability to copy, rename pools

There is also lots of work going on with RBD to get the layering working.  This didn’t quite make the 0.50 cutoff, but will be testable in the 0.51 release (or sooner, for those interested in testing the release candidate).  The devops deployment work with Chef and upstart is also progressing nicely, although it is still not quite ready for wide use.  We’ve also been working on some OSD threading and peering improvements that will appear in v0.50.

For those of you using our Debian/Ubuntu packages, please note that the URL is now slightly different for the development release.   The stable (e.g., argonaut) release will remain at the old URL (http://ceph.com/debian) while the development releases will live at http://ceph.com/testing.

You can get this latest development release at:

 

v0.48 “argonaut” released

We’re pleased to annouce the release of Ceph v0.48, code-named “argonaut.”  This release will be the basis of our first long-term stable branch.  Although we will continue to make releases every 3-4 weeks, this stable release will be maintained with bug fixes and select non-destabilizing feature additions for much longer than that.  Argonaut is recommended for production users of rados and librados, rbd and librbd, and radosgw.

The upgrade to v0.48 argonaut from previous versions includes a disk-format upgrade.  Please note:

  • You will not be able to downgrade from v0.48 to a previous version.
  • Each ceph-osd will need some time to convert its local data before rejoining the cluster.  If you need to maintain availability, you will need to do a “rolling upgrade” by restarting daemons on each host or rack in sequence and allowing the cluster to recover before moving on to the next one.  Note that for non-btrfs file systems especially this can be slow (many hours); plan accordingly.
  • The ceph tool’s -s and -w commands from previous version are incompatible with this version.  Upgrade your client tools with your monitor if you rely on those commands.

The highlights for this release include:

read more…

v0.31 released

We’ve released v0.31. Notable changes include:

  • librados, libceph: can now access multiple clusters in same process
  • osd: snapshot rollback fixes
  • osd: scrub race
  • mds: fixed lock starvation issue
  • client: cache ref counting fixes
  • client: snap writeback, umount hang, cache pressure, other fixes
  • radosgw: atomic PUT

There is also the usual mix of bug fixes and code cleanup all over the tree. Much of the work has also been focused on teuthology, our Ceph test/QA framework.

For the current sprint (v0.32) we are working on:

  • osd: prehashing of PG objects (to facilitate efficient PG splits/merges)
  • instrumentation (simpler query-based interface, collectd plugins)
  • Chef scripts
  • mds: clustering fixes
  • uclient: chasing down a few hard to hit bugs
  • radosgw: coherent caching of bucket acls
  • more qa (teuthology, tests)

v0.31 can be found in the usual places:

Linus vs FUSE

I can’t decide whether Linus is amused or annoyed by the extent to which people hang on his every word, or go nuts over his random rants about this or that. People still talk about his pronouncement about O_DIRECT and tripping monkeys (which has now found a home on the open(2) man page). The latest hullabaloo is about his decree that all FUSE-based file systems are toys.

Clearly, as many have pointed out, calling all such systems “toys” isn’t completely fair. But then it wouldn’t be fun to say it if it were strictly true. There are real systems (big and fast) built on FUSE, just as there are such systems built with Java, Visual BASIC, Cobol, and every other platform/technology we love to mock.

I haven’t seen PLFS come up yet in the discussion, but I think it’s worth mentioning just because it is such a good example of optimizing for the cases that actually matter for your workload. For those not familiar, PLFS (parallel log-structured file system) is a FUSE-based file system built at LANL for their huge many-thousand node clusters that turns all random IO sequential by building a mess of intermediate indices. It sounds like it would be a disaster, but in practice it speeds up their workloads by several orders of magnitude, simply because the underlying parallel file systems on which it is stacked are so bad at those workloads.

Anyway, there are just a few points I wanted to make about the kernel vs userspace file systems, having implemented the Ceph client using both. At the risk of stating the obvious:

  • There is nothing you can do in userspace that you can’t also do in the kernel. Sure, development can be harder in the kernel, but you have unparalleled access to the system. The only significant technical disadvantage of a kernel implementation is fault isolation: a buggy FUSE-based file system won’t take down the system with it.
  • Implementation is easier with FUSE. At least for something basic. There are some key problems that are harder to solve because of limitations in the interface.
  • Memory management is easier in the kernel. AB is right when he says that the memory management and file system need to work together. The problem is that it is difficult to push memory management into userspace when you are not the only tenant on the machine. (I suspect that in most of the big production environments where userspace file systems are used, the fs either is the sole tenant or is given some fixed amount of RAM to work with.) The kernel VM, on the other hand, will apply cache pressure dynamically based on the demands of all users of the system. Trying to do that in userspace is extremely awkward at best.
  • Managing cache coherency is easier in the kernel. Some people don’t care about this (e.g., see NFS, or any of the “toys” Linus was referring to), but we do. This is mainly a result of the limited FUSE interface. You can probably avoid the issue by simply not using the kernel dentry and page caches and reimplementing it all in userspace. That’s a simple enough approach, but is slow, and fails to leverage years of work invested in the core Linux VFS code.
  • FUSE may be partly to blame. Jeff Darcy has made the point that many of the FUSE shortcomings aren’t inherent to userspace storage, but artifacts of the current interface and kernel politics. Maybe that’s the case, but that is the world we live in. No file system that doesn’t work on Linux (or maybe *BSD) is relevant. And for what it’s worth, most of the people I see complaining about kernel community intransigence haven’t even tried to work upstream; it’s easier than you think, as long as the code you’re pushing isn’t crap.

Which is better for any given project in the end is probably more of a business decision: technical investment, performance, time to market, ease of deployment. If you’re talking purely about the technical limitations of the environment, however, it’s hard to beat the kernel.

Or, if you can, implement both. It makes these sorts of debates that much more fun.

v0.21 released

It’s been a while, but v0.21 is ready.  Most of the work this time around has been on stability. There is one key new feature, however: RBD, the rados block device, which let you create a virtual disk backed by objects stored in the Ceph cluster.  The images can be mapped natively by the ceph kernel module or via a driver in qemu/KVM.  Although neither of those drivers is upstream yet, the server side functionality and admin tools are in place.

Changes since v0.20 include:

  • improved logging infrastructure
  • log rotate
  • mkfs improvements
  • rbd tool, and rados class
  • mds: return ENOTEMPTY when removing directory with snapshots
  • mds: lazy io support (experimental)
  • msgr: send messages directory to connection handles (more efficient)
  • faster atomic_t via libatomic-ops
  • mon: recovery improvements, fixes (e.g. when one mon is down for a long time)
  • mon: warn on monitor clock drift
  • osd: large object support
  • osd: heartbeat improvements, fixes
  • osd: journaling fixes, improvements (bugs, better use of direct io)
  • osd: snapshot rollback op (for rbd)
  • radosgw fixes, improvements
  • many memory leaks and other bugs fixed

The project roadmap has been updated and is available via the issue tracker.

Relevant URLs:

v0.10 released

We’ve released v0.10.  The big items this time around:

  • kernel client: some cleanup, unaligned memory access fixes
  • much debugging of MDS recovery: kernel client will now correctly untar, compile kernel with MDS server running in a 60 second restart loop.
  • a few misc mds fixes
  • osd recovery fixes
  • userspace client: many bug fixes, now quite stable
  • librados improvements

Also,

  • libceph: a thin wrapper around the POSIXy ceph interface

which is being used to write a file system ‘Broker’ for the Hypertable distributed database project.  We’re also planning on (finally) getting the Hadoop ceph client in working order.

We’re also continuing to work on the librados object storage layer, including a standalone fastcgi-based gateway exposing an S3-compatible restful interface, the goal being a drop-in replacement for apps using S3. (It won’t let you use the rados snapshots or object classes, though, and won’t scale as efficiently.)

As far as testing goes, we’re filling up a 100TB cluser locally and will start failure testing on that shortly.  And this past week we’ve been thorougly testing single-node) MDS recovery.  Next up is looping OSD restarts and power cycling.

Major todo items coming up next:

  • client authentication
  • additional metadata to facilitate catastrophic rebuild of fs hierarchy
  • stabilize clustered mds

We’ve also sent the Linux kernel client code off to LKML and -fsdevel again, and are continuing to work toward a merge into the mainline kernel.

UPDATE: Here are the relevant URLs:

Snapshot progress

If things seem a bit slow lately, it’s because I’ve been primarily working
on implementing the snapshot mechanism for the last few weeks.  This is
coming along pretty well: I can take snapshots and access snapshotted
content.  The interaction with recursive accounting has been tricky
because delayed propagation means changes may propagate into recent
snapshot as changes work their way up the hierarchy, but I think I have
that one nailed.

Here’s how it works:

$ tar jxf ~/src/linux-2.6.24.tar.bz2 &
[1] 18715
$ mkdir linux-2.6.24/.snap/1   # create a few snapshots
$ mkdir linux-2.6.24/.snap/2
$ mkdir linux-2.6.24/.snap/3
$ kill %1
$ ls -al linux-2.6.24/.snap    # see that dir sizes increased over time
total 3
drwxr-xr-x 1 sage sage 1205808 Jul 24 10:23 ./
drwxr-xr-x 1 sage sage 1205808 Jul 24 10:23 ../   # live copy
drwxr-xr-x 1 sage sage 1028511 Jul 24 10:23 1/
drwxr-xr-x 1 sage sage 1144455 Jul 24 10:23 2/
drwxr-xr-x 1 sage sage 1177913 Jul 24 10:23 3/
[1]+  Terminated              tar jxf ~/src/linux-2.6.24.tar.bz2
$ ls linux-2.6.24/.snap/1/Documentation/ | wc
23      24     472
$ ls linux-2.6.24/.snap/3/Documentation/ | wc
32      33     680

Etc.  The ‘.snap’ hidden dir is accessible from anywhere (like .snapshot
on a Netapp).  Snapshots can be created for any directory at any time,
however, and recursively apply to all nested content.

Still left to do:

  • properly handle directory renames (which interact in interesting ways with the snapshot realm tree).
  • snapshot deletion
  • garbage collection (metadata and data)
  • update kernel client (I’m currently working just with the fuse clientfor faster prototyping)

Next up: snapshots!

One of the last intrusive additions I have planned is a flexible snapshot mechanism.  I haven’t been able to figure out how to map writeable snapshots onto the current object and metadata storage model, unfortunately, so it’ll be read-only snapshots for now.  Ceph snapshots will be significantly more flexible than what you find with WAFL or ZFS, though.  The goal is to get behavior like:

$ cd any/random/directory
$ ls .snapshot
$ mkdir .snapshot/foo      # create a snapshot
$ ls .snapshot
foo
$ cd a/deeper/dir
$ ls .snapshot
foo
$ mkdir .snapshot/bar      # create another one
$ ls .snapshot
foo    bar
$

That is, users can create snapshots, from a standard shell, for any subtree of the directory hierarchy.  (In contrast, most proprietary vendors’ snapshots are for entire volumes only, while ZFS can only snapshot predefined subvolumes.)  And snapshots will be visible via a hidden .snapshot (or similar) directory from any directory.  Something similarly convenient (rmdir?) will be used to delete snapshots from the command line. The naming will be a bit more complicated than in the above example to avoid name collisions, but that is the basic idea.

© 2013, Inktank Storage, Inc.. All rights reserved.