The Ceph Blog

Earlier Posts

v0.25 released

We’ve just tagged the v0.25 release.  Most of the work here is in the OSD
cluster, a new librbd library (refactoring existing RBD infrastructure),
and a librados API refresh.
* osd: fix map churn while peering
* osd: “watch/notify” framework for RBD synchronization
* osd: ability to read from closest replica
* osd: many bug fixes
* osd: improved recovery behavior (tolerate missing objects)
* mds: misc clustering fixes
* mds: fix respawn
* mds: misc bug fixes
* mds: “hot standby” behavior
* /etc/ceph/keyring instead of keyring.bin
* ability to log to syslog
The focus for v0.26 will remain on stability, primary with the OSD
cluster, RBD, and radosgw.  Internally, we’re focusing on building out our
QA and performance testing infrastructure.
Relevant URLs:
* Direct download at: http://ceph.newdream.net/download/ceph-0.25.tar.gz
* For Debian and Ubuntu packages, see http://ceph.newdream.net/wiki/Debian

We’ve just tagged the v0.25 release.  Most of the work here is in the OSD cluster, a new librbd library (refactoring existing RBD infrastructure),  and a librados API refresh.

The librados changes are an attempt to clean up the API warts sooner rather than later.  If there are any issues with the new interface, we’d like to hear about them!

The new librbd library sits on top of librados and captures the RBD striping, snapshotting, and other functionality, presenting a simple block device-like interface.  The qemu/KVM driver is being rewritten in terms of librbd, which will vastly simplify the upstream qemu code and allow us to fix bugs and add functionality without being tied to a specific version of qemu/KVM.

Other changes since v0.24 include:
  • osd: fix map churn while peering
  • osd: “watch/notify” framework for RBD synchronization
  • osd: ability to read from closest replica
  • osd: many bug fixes
  • osd: improved recovery behavior (tolerate missing objects)
  • mds: misc clustering fixes
  • mds: fix respawn
  • mds: misc bug fixes
  • mds: “hot standby” behavior
  • /etc/ceph/keyring instead of keyring.bin
  • ability to log to syslog

The focus for v0.26 will remain on stability, primary with the OSD cluster, RBD, and radosgw.  Internally, we’re focusing on building out our QA and performance testing infrastructure.

Relevant URLs:

Earlier Posts

v0.24.3 released

We’ve released v0.24.3 with more bug fixes, including one that loses data in certain cases when OSDs restart during recovery. It’s pretty much all OSD stuff, which is where we’re focusing our testing efforts currently.

  • osd: misc crashes, slowness
  • osd: fix bug that loses backlog (and potentially data)
  • osd: scrub fixes
  • osd: snap_trimmer fixes
  • mds: fix bug with multi-client interaction/slowness

Relevant URLs:

Earlier Posts

SCALE 9x

I’ll be giving a talk at SCALE 9x targeted toward system administrators and users.  It’ll be Sunday, February 27th at 4:30pm in the Century AB room.  Hope to see you there!

UPDATE: Here are the slides, as ODF and PDF.

Earlier Posts

v0.24.2 released

This is a bugfix release.  Changes since v0.24.1 include:

  • osd: fix journal ordering bug (crash)
  • osd: fix long sync delay
  • osd: don’t overflow journal size
  • osd: snapshot trimming bugs
  • osd: fix msgr connection issues after osd restart
  • osd: don’t crash on no-journal case
  • mds: fix double-pinning of stray inodes (crash)
  • mds: don’t block signals after restart
  • mds: fix journaling of root inode layout policy
  • mds: fix journal dump
  • mds: C_Gather locking fix
  • msgr: fix connection cleanup on non-daemons
  • monclinet: fix locking

Relevant URLs:

Earlier Posts

v0.24.1 released

v0.24.1 has been released, with a number of bug fixes from v0.24.  These include:

  • msgr: fix races during connection teardown
  • mds: fix bug during directory removal
  • mds: fix replay issue when mds restarts immediately after mkfs
  • filestore: fix journal ordering problem (triggered under load)
  • osd: fix recovery issue
  • osd: several scrub bug fixes

This is also the first time I’ve built Ubuntu packages (for lucid and maverick), as the libcrypto++ dependency resolves to a different library version on Ubuntu and Debian sid.  If anyone has any problems there, please let us know.  libcrypto++ is unfortunately also a hassle under Redhat, as it is not included in RHEL and was only recently added to Fedora.  We plan to start building RHEL/CentOS and Fedora packages soon, and will be updating the wiki with information on gathering all the dependencies to build from source shortly.

Relevant URLs:

Earlier Posts

RBD upstream updates

QEMU-RBD

The QEMU-RBD block device has been merged upstream into the QEMU project. QEMU-RBD was created originally by Christian Brunner, and is binary compatible with the linux native RBD driver. It allows the creation of QEMU block devices that are striped over objects in RADOS — the Ceph distributed object store. As with the corresponding Linux device driver, the QEMU driver gets all the RBD goodies: thin provisioning, reliability, scalability, and snapshots!

libvirt

libvirt is a virtualization library that allows controlling virtual machines (such as QEMU based VMs, but also others) using a single API. There are many tools already built around it (e.g., virsh, virt-manager, etc.), and adding the ability to configure RBD devices via the library makes RBD work in the existing tools. With the help of the Sheepdog project (whom also merged their QEMU block device upstream into QEMU recently), we were able to get RBD (and Sheepdog, and also nbd) support upstream into libvirt. Basically a new “network” disk type was added, and there are currently 3 possible types for such a disk: nbd, sheepdog, or rbd. For each you can specify a host name. E.g., for rbd the host name(s) would hold the ip address and tcp port for the ceph cluster monitor(s).

libvirt support for the Linux native kernel rbd driver is also in the works, which will allow rbd to be used with non-qemu VMs supported by libvirt (e.g., Xen, VirtualBox, VMware, etc.)

Linux Kernel

As we posted before, the RBD native linux device was merged into the upcoming Linux kernel version (2.6.37) which will be out in a few weeks. Since the original merge we’ve modified the RBD sysfs interface so that it’d conform better with the sysfs requirements: originally, the RBD driver was based on another linux block device called osdblk and it inherited its sysfs interface, which was monolithic and kept a single sysfs entry per config option for all the devices. This was both wrong and cumbersome, as we needed to specify the device id for each operation. The new interface moves the sysfs rbd subdir to a better location (/sys/bus/rbd) and creates a subdir per device, so that all operations for a single device are grouped together, and there’s no need to specify the device name. We also create a subdir per snapshot under the device that holds all its information, and we dropped the one-big-list-for-all entry.

All in all, it was a relatively big change to introduce well into the release cycle, but we believe it was worth it.

Earlier Posts

v0.24 released

We’ve released v0.24, just in time for the holidays!  Big changes this time around include:

  • mds: many fixes with clustered failure recovery
  • mds: bloom filter to reduce directory reads
  • mds: configurable directory hash functions (for fragmentation)
  • rbd: import/export tools are smart about holes (i.e., use FIEMAP)
  • osd: many recovery improvements, mostly making data available more quickly
  • osd: automatic background scrubbing when load is low
  • osd: fixes with dedicated backend replication network
  • osd: use new (2.6.37) btrfs ioctls for async snapshot creation
  • replaced openssl dependency with libcrypto++ (licensing issue)
  • librados: “zero-copy” reads
  • misc bug fixes, man pages, and code cleanup

The focus for the next release (v0.25) is on OSD and MDS stability, directory fragmentation recovery, and fsck preliminaries; see the roadmap for more details.

Relevant URLs:

Earlier Posts

v0.23.1 released

This release includes some bug fixes for v0.23, although there’s nothing here that too many people have been hitting, fortunately.

  • cfuse/libceph: fix crash with clustered mds restart
  • cfuse/libceph: fix hard link caching
  • cfuse/libceph: fix lssnap
  • msgr: fix various races
  • msgr: fix IPv6 address parsing buffer overflow

v0.24 is still a few weeks away, and will include OSD recovery improvements, background scrubbing, and MDS clustering and performance improvements, among other things.

Relevant URLs:

Earlier Posts

S3-compatible object storage with radosgw

The radosgw has been around for a while, but it hasn’t been well publicized or documented, so I thought I’d mention it here.  The idea is this:

  • Ceph’s architecture is based on a robust, scalable distributed object store called RADOS.
  • Amazon’s S3 has shown that a simple object-based storage interface is a convenient way to write applications, even when that interface is very restrictive.
  • Providing access to Ceph’s object store via an S3-compatible interface is easy with librados.

The result is radosgw, a FastCGI-based proxy that exposes Ceph’s object store via a REST (HTTP-based) interface.  Radosgw implements a subset of Amazon’s API (some Amazon-specific features of ACLs and object versioning aren’t supported), but the subset it does implement aims to be fully compatible.  That means that most existing apps that are designed for S3 can be seamlessly migrated to a Ceph-based object store, provided they allow the hostname to be configured (many hard-code s3.amazonaws.com).

It should be noted that this approach has some fundamental limitations:

  • librados provides direct parallel access to storage nodes; radosgw is a single endpoint proxy that sits in front of your storage cluster.  That may actually be a good thing, depending on your security model.
  • The REST-based storage interface is much more restricted than that provided by librados.  librados allows partial object updates, has no object size limits, supports extensible object classes, fine-grained snapshots, and more.
  • The radosgw security model emulates S3′s, and is implemented as a layer on top of librados.  Accessing the same objects via the native librados library will not reflect S3-style ACLs created via radosgw.

Check it out!

Earlier Posts

v0.23 released

Another month, and v0.23 is out.  The main milestone here is that clustered MDS is pretty stable.  Stable enough that, if you’re interested and willing, we’d like you to try it and let us know what problems you have.  Notably, clustered recovery is not yet well tested (that’s v0.24), so don’t do this unless you’re feeling adventurous.  Directory fragmentation (splitting and merging) is also working, although still off by default.  If you’d like to try that too, add ‘mds bal frag = true’ to your [mds] section.

Other notable changes this time around:

  • osd: use new btrfs snapshot ioctls (2.6.37), parallel journaling
  • mds: clustering, replay fixes
  • mon: better commit batches, lower latency updates
  • objecter: bug fixes
  • osd: spread data across multiple xattrs; assert on io/enospc errors
  • osd: start up despite corrupt pg logs
  • ceph: new gui (ceph -g)

The general focus for v0.24 will be continuing OSD stability and clustered MDS recovery.

Relevant URLs:

Page 12 of 18« First...1011121314...Last »
© 2013, Inktank Storage, Inc.. All rights reserved.