Archives:

v0.64 released

A new development release of Ceph is out. Notable changes include:

  • osd: monitor both front and back interfaces
  • osd: verify both front and back network are working before rejoining cluster
  • osd: fix memory/network inefficiency during deep scrub
  • osd: fix incorrect mark-down of osds
  • mon: fix start fork behavior
  • mon: fix election timeout
  • mon: better trim/compaction behavior
  • mon: fix units in ‘ceph df’ output
  • mon, osd: misc memory leaks
  • librbd: make default options/features for newly created images (e.g., via qemu-img) configurable
  • mds: many fixes for mds clustering
  • mds: fix rare hang after client restart
  • ceph-fuse: add ioctl support
  • ceph-fuse/libcephfs: fix for cap release/hang
  • rgw: handle deep uri resources
  • rgw: fix CORS bugs
  • ceph-disk: add ‘[un]suppress-active DEV’ command
  • debian: rgw: stop daemon on uninstall
  • debian: fix upstart behavior with upgrades

You can get v0.64 from the usual locations:

v0.61.3 released

This is a much-anticipated point release for the v0.61 Cuttlefish stable series.  It resolves a number of issues, primarily with monitor stability and leveldb trimming.  All v0.61.x uses are encouraged to upgrade.

Upgrading from bobtail:

  • There is one known problem with mon upgrades from bobtail.  If the ceph-mon conversion on startup is aborted or fails for some reason, we do not correctly error out, but instead continue with (in certain cases) odd results.  Please be careful if you have to restart the mons during the upgrade.  A 0.61.4 release with a fix will be out shortly.
  • In the meantime, for current cuttlefish users, 0.61.3 is safe to use.

Notable changes since v0.61.2:

  • mon: paxos state trimming fix (resolves runaway disk usage)
  • mon: finer-grained compaction on trim
  • mon: discard messages from disconnected clients (lowers load)
  • mon: leveldb compaction and other stats available via admin socket
  • mon: async compaction (lower overhead)
  • mon: fix bug incorrectly marking osds down with insufficient failure reports
  • osd: fixed small bug in pg request map
  • osd: avoid rewriting pg info on every osdmap
  • osd: avoid internal heartbeta timeouts when scrubbing very large objects
  • osd: fix narrow race with journal replay
  • mon: fixed narrow pg split race
  • rgw: fix leaked space when copying object
  • rgw: fix iteration over large/untrimmed usage logs
  • rgw: fix locking issue with ops log socket
  • rgw: require matching version of librados
  • librbd: make image creation defaults configurable (e.g., create format 2 images via qemu-img)
  • fix units in ‘ceph df’ output
  • debian: fix prerm/postinst hooks to start/stop daemons appropriately
  • upstart: allow uppercase daemons names (and thus hostnames)
  • sysvinit: fix enumeration of local daemons by type
  • sysvinit: fix osd weight calcuation when using -a
  • fix build on unsigned char platforms (e.g., arm)

See the full release notes for more details.

You can get v0.61.3 from the usual places:

Hi! My name is Eleanor, and I’m working on Ceph as an intern for Inktank this summer. My task for the summer is to use the Ceph API to create a lock-free, distributed key-value store suitable for storing large sets of small key-value pairs. I’ve just finished my first year at Pomona College, where I’m majoring in Computer Science. I had previously explored concurrency with a Computer Science professor at the University of Utah, but this is my first experience with file systems, my first experience with a startup, and my first experience working on an open source software project. At the beginning of the summer, I was somewhat terrified. On my first day, as Sam walked me through how to use Github, I worried that I was in over my head. There were so many skills and so much vocabulary that came naturally to everyone around me but with which I had little to no familiarity. But the Ceph team proved to be extraordinarily welcoming and supportive as I got up to speed.

As a warm-up exercise to gain familiarity with the API, I began the summer by creating an object map benchmarking tool. Librados objects are the basic unit of storage in Ceph. Objects have a number of properties:

read more…

v0.47.3 released

This is a bugfix release with one major fix and a few small ones:

  • osd: disable use of the FIEMAP ioctl by default as its use was leading to corruption for RBD users
  • a few minor compile/build/specfile fixes

I was going to wait for v0.48, but that is still several days away.  If you are using RBD in production, you should either add ‘filestore fiemap = false’ to your ceph.conf file or upgrade.

You can get this release from the usual places, with the exception of Debian sid and wheezy packages; the upstream repos were sufficiently broken to make pbuilder cranky so I left them out.

v0.30 released

We’re pushing out v0.30. Highlights include:

  • librbd: Fixed race/crash
  • mds: misc clustered mds fixes
  • mds: misc rename journaling/replay fixes
  • mds: fixed flock deadlock when processes die during lock wait
  • osd: snaptrimmer fixes, misc races, recovery bugs
  • auth: fixed cephx race/crash
  • librados: rados bench fix
  • librados: flush
  • radosgw: multipart uploads
  • debian: gceph moved to separate package
  • lots of g_conf refactoring, removing of globals, and related cleanup
  • qa: lots

The focus this time around continues to be with QA, bug fixes, and cleanup.

Relevant URLs:

v0.29.1 released

We’ve released 0.29.1 with a few fixes. The main thing is a fix for a
race condition in librbd that was biting people using rbd with qemu/kvm.

  • librbd: fix for race/crash
  • osd: fix memory leak
  • osd: fix clone size accounting
  • mkcephfs: fix ceph.conf reference

Relevant URLs:

  • Direct download at: http://ceph.newdream.net/downloads/ceph-0.29.1.tar.gz
  • Debian/Ubuntu packages: see http://ceph.newdream.net/wiki/Debian

v0.29 released

Ceph v0.29 is ready.  Notable changes since v0.28.2 include

  • mds: some fixes for multiple clients accessing the same directory
  • obsync: supports rados/rgw backend
  • osd: fix bug causing recovering objects to be excluded from object listing
  • rados: import/export support for xattrs, incremental updates
  • radosgw: misc fixes
  • libceph: readdir bug fixes
  • osd: fix for various heartbeat failures

Mainly we saw continued stabilization of the OSD peering code, which is now working quite well for us.  For v0.30 we’re continuing to clean up a few OSD corner cases and working on clustered MDS problems.

Relevant URLs:

RADOS snapshots

Some interesting issues came up when we started considering how to expose the RADOS snapshot functionality to librados users.  The object store exposes a pretty low-level interface to control when objects are cloned (i.e. when an object snapshot is taken via the btrfs copy-on-write ioctls).  The basic design in Ceph is that the client provides a “SnapContext” with each write operation that indicates which snapshots logically exist for the given object; if the version already stored by the OSD is older than the newest snapshot in the SnapContext, a clone is created before the write is applied.  It is the Ceph MDS’s responsibility to keep track of which snapshots apply to which objects (remember, Ceph lets you snapshot any subdirectory) and to do all the synchronization that ensures mounted clients have up to date SnapContexts.

In creating a raw object storage interface, how is that underlying functionality best exposed?  One option is to expose some functions that allow users to create, manipulate, and possibly store SnapContexts, and manually specify a context for each write (or a snapshot id to read).  This exposes the same functionality Ceph makes use of, but essentially drops all of the issues with synchronization and storage in librados user’s lap.  How should one go about keeping multiple processes accessing the RADOS store in sync (i.e. agreeing on which snapshots exist) to get the semantics people want?

Our solution is to introduce some basic snapshot accounting to RADOS.  We allow per-pool snapshots to be created via RADOS itself, and include that snap information in the OSDMap (the global data structure used to synchronize the activities of OSDs and clients).  If a client performs a write and does not manually specify a SnapContext (as Ceph does), an appropriate context will be generated from the pool snapshot information in the OSDMap.

Snapshot creation is done via the monitor, either via a librados API call or an administrator command like ‘ceph osd pool mksnap poolname snapname’.  This updates the OSDMap to include the new snap for that pool, and that map propagates across the cluster.

int rados_snap_create(rados_pool_t pool, const char *snapname);
int rados_snap_remove(rados_pool_t pool, const char *snapname);
int rados_snap_list(rados_pool_t pool, rados_snapid_t *snaps, int maxlen);
int rados_snap_get_name(raods_pool_t pool, rados_snapid_t id, char *name, int maxlen);

To read an existing snapshot, a new RADOS pool context is opened and a specific snapshot id is selected (the id can be obtained via rados_snap_list above):

rados_pool_t snapped_pool;
rados_open_pool(“data”, &snapped_pool);
rados_set_snap(snapped_pool, 2);

Subsequent reads via the snapped_pool handle will return data from snapid 2, and any attempts to write will return -EROFS (Read-only file system).  Reading and writing via other rados_pool_t handles will be unaffected.  By default any newly opened pool handle will be “positioned” at the “head”–the current, writeable version of the object pool.

Map propagation is fast, but not synchronous: it is possible for one client to create a snapshot and for another client to then perform a write that does not preserve some data in the new snap.  So we do not completely solve the synchronization problem for you to create a global, ‘instantaneous’ point-in-time snapshot.  Doing so in a large distributed environment with many clients and many servers, operating in parallel, is a challenge in any system.

From the perspective of the client creating the snapshot, however, the snapshot is ordered with respect to IO performed before and after rados_snap_create.   RADOS already does some synchronization with respect to OSDMap updates to ensure that readers, writers and OSDs all agree on the current state of a placement group when performing IO.  Any IO initiated after the snapshot is created will be tagged with the new OSDMap version, and any OSD will make sure it has either the same or a newer version of the map before performing that IO.  Other clients will not see a clear ordering unless the librados user takes steps to coordinate clients such that they all obtain the updated OSDMap (describing the new snapshot) before performing new IO.

If there is demand, we may still expose an API to manipulate raw SnapContexts for advanced users wanting different snapshot schedules for different objects.  It will be their responsibility to manage all client synchronization in that case, as that snapshot information won’t be propagated via the OSDMap.

For anybody wanting perfect cluster-wide point-in-time snapshots without any client coordination… well, sorry.  Experience with file system snapshots has shown that proper synchronization is never something that the storage system alone can get right due to caching at all layers of the system.  NFS client write-back caches make server-based snapshots (e.g., NetApp filers) imperfect.  Snapshots in local file systems utilize some kernel machinery to momentarily quiesce all IO while the snapshot is created, but even applications may not have the on-disk files (as seen by the OS) in a consistent state.  Coordination with applications is always necessary for any fully ‘correct’ solution, so we won’t try to solve the whole problem based on some false sense of what ‘correct’ is.

Recursive accounting

This is somewhat old news, but the recursive accounting changes have been merged into both the ‘unstable’ and ‘master’ branches, and the feature is documented in the wiki.

I’m extremely curious what people think of this feature (useful? confusing?).  It takes liberties with two common behaviors of directories: first, with the “rbytes” mount option, the directory size is suddenly related to the directory’s recursive contents, and may appear very large.  Second, doing “cat dir” will dump the directory’s full stats instead of returning -EISDIR (Is a directory).  I’m hoping the latter behavior change is harmless, given that until relatively recently reading a directory dumped the encoded directory contents to your terminal…

© 2013, Inktank Storage, Inc.. All rights reserved.