Archives: December 2009

Ceph Over Fibre for VMWare

We always love it when Ceph users choose to share what they have been doing with the community. Recently, a couple of regulars to the #ceph IRC channel were good enough to give us a very detailed look at how they were using Ceph to power their VMWare infrastructure. So, without further ado, read on for a great visual representation and quick summary of Chris Holcombe and Robert Blair’s pet Ceph project!

read more…

INTRODUCTION

Hello again!

War. War never changes. Some of you may have been following my bitter rivalry with Mark Shuttleworth.  Now, I am perfectly aware that I share nearly as much blame as he does for this entire debacle. We’ve both done things that can’t be undone and we’re just going to have get past it. (Come on Slashdot? flamebait? You really need an incredibly obvious click-baiting descriptor.) Anyway, I think it’s time to finally bury the hatchet. Let bygones be bygones and all that? I say we all sit down, calmly work through our differences, and find peace in mutual… Oh who am I kidding. The only way to resolve this is through a fight to the death!

Oh no, not me and Shuttleworth. I wouldn’t stand a chance. I’ve heard rumors that Unity can now plant subliminal messages in your dreams. How am I supposed to fight when I can’t even sleep? No this must be resolved through aquatic lifeform combat. Can the champion Argonaut defend his title from the likes of the upstart challenger Bobtail? Will competitive fighting arcade games from the early 90s make a come back? Will Protendo and Kobatashi ever be able to reclaim their lost honor? Let the battle commence!

 
Round 1: Fight!

read more…

What’s New in the Land of OSD?

It’s been a few months since the last named release, Argonaut, and we’ve been busy! Well, in retrospect, most of the time was spent on finding a cephalopod name that starts with “b”, but once we got that done, we still had a few weeks left to devote to technical improvements. In particular, the OSD has seen some new and interesting developments.

OSD Internals Overview

Let’s start with some background for those not familiar with ceph internals. Objects in a Ceph Object Store are placed into pools, each of which is comprised of some number of placement groups (PGs). An object “foo” in pool “bar” would be mapped onto a set of osds as follows:

read more…

v0.55.1 released

There were some packaging and init script issues with v0.55, so a small point release is out. It fixes a few odds and ends:

  • init-ceph: typo (new ‘fs type’ stuff was broken)
  • debian: fixed conflicting upstart and sysvinit scripts
  • auth: fixed default auth settings
  • osd: dropped some broken asserts
  • librbd: fix locking bug, race with ‘flatten’

You can get this release from the usual locations:

Deploying Ceph with a Crowbar

We have seen users deploying Ceph in a number of different ways, which is just plain awesome! I have spoken with people deploying with makecephfs, ceph-deploy, Juju, Chef, and even the beginnings of some Puppet work. However, thanks to collaboration between Inktank and Dell there is a really solid deployment pathway using Dell’s Crowbar tool and a Ceph “barclamp.”

For those not familiar with Dell Crowbar, it is an Open Source cloud deployment framework that originated as a way for Dell to support their OpenStack and Hadoop powered solutions. Since its inception, and eventual open source-ing at OSCON 2011, it has come a long way, growing into the full-featured solution that we see today. Crowbar uses packages called “barclamps” that allow individuals to create ready-made ways to deploy the tools they want (like Chef’s “recipes” or Juju’s “charms”). These barclamps include custom UI for config, dependency graphs, and even localization support. Using it as one of the powerful devops vehicles to deploy Ceph seemed like the next logical step.

read more…

Monitoring a Ceph Cluster

credit: xkcd.org

Ok, so you have gone through the five minute quickstart guide, learned a bit about Ceph, and stood a pre-production server up to test real data and operations…now what? Over the past couple of weeks we have gotten quite a few questions about monitoring and troubleshooting a Ceph cluster once you have one. Thankfully, our doc has been getting a ton of polish. However, we figured a quick rundown of some of the more frequently-useful troubleshooting tools might be helpful.

The first step to fixing a problem is understanding that you actually _have_ a problem in the first place. To that end there are a number of health and monitoring tools available to keep a hairy eyeball on Ceph. These tools can be run in interactive mode (just typing ‘ceph’ from the command line) or by a series of status queries and watch commands. To run the ceph tool in interactive mode, type ceph at the command line with no arguments. For example:

  • ceph
  • ceph> health
  • ceph> status
  • ceph> quorum_status
  • ceph> mon_status

read more…

v0.55 released

We had originally planned to make v0.55 a long-term stable release, but a lot of last-minute changes and fixes went into this cycle, so we are going to wait another cycle and make v0.56 bobtail.   A lot of work went into v0.55, however.  If you aren’t running argonaut (v0.48.*), please give v0.55 a try and help us make sure it is rock solid!

WARNING: The default authentication behavior changed.  Please read below before upgrading or your cluster may not start.

Notable changes since v0.54 include:
read more…

v0.39 released

v0.39 has been tagged and uploaded.  There was a lot of bug fixing going on that isn’t terribly exciting.  That aside, the highlights include:

  • mon: rearchitected bootstrap (mkfs)
  • mon: vastly simplified mon cluster expansion
  • config: choose daemon ip based on subnet instead of explicitly
  • hadoop: misc hadoop client fixes
  • osd: many bugs fixed
  • make: pretty V=0 mode
  • radosgw: swift support improvements
  • radosgw, objecter: perfcounter instrumentation
  • we now build on FreeBSD
  • debian: packaging cleanup

The monitor and network config changes are worth mentioning.  We simplified monitor bootstrapping to make it easier to use tools like Chef or Juju to bring up a fresh cluster.  At the same time we made monitor cluster expansion almost trivial, and fixed an important performance problem when a monitor was down for a long time and then came back up.

Specifying the network config for daemons is also simple now that you can constrain the choice to a specific subnet.  That means that when you have a whole cluster with a public and private network for, say, the OSDs, you can force ceph-osd to choose an ip for each interface from the appropriate subnet without explicitly setting the IP in the ceph.conf for each daemon.

Ceph now builds on FreeBSD, thanks to some porting work by Stanislav Sedov.

There were a lot of small fixes to the OSD.  A few bugs remain, however, in strange recovery corner cases.  Some of the core recovery code is being rewritten for v0.40 that will vastly simplify things and make the system more performant and less of a memory hog during recovery (see the wip-backfill branch in ceph.git).

For v0.40 we are also working on the RBD image cloning (“layering”), and it’s going to be pretty slick.  And the vastly improved ceph.spec file is almost ready and should land in v0.40 as well.

To download v0.39:

 

RBD upstream updates

QEMU-RBD

The QEMU-RBD block device has been merged upstream into the QEMU project. QEMU-RBD was created originally by Christian Brunner, and is binary compatible with the linux native RBD driver. It allows the creation of QEMU block devices that are striped over objects in RADOS — the Ceph distributed object store. As with the corresponding Linux device driver, the QEMU driver gets all the RBD goodies: thin provisioning, reliability, scalability, and snapshots!

libvirt

libvirt is a virtualization library that allows controlling virtual machines (such as QEMU based VMs, but also others) using a single API. There are many tools already built around it (e.g., virsh, virt-manager, etc.), and adding the ability to configure RBD devices via the library makes RBD work in the existing tools. With the help of the Sheepdog project (whom also merged their QEMU block device upstream into QEMU recently), we were able to get RBD (and Sheepdog, and also nbd) support upstream into libvirt. Basically a new “network” disk type was added, and there are currently 3 possible types for such a disk: nbd, sheepdog, or rbd. For each you can specify a host name. E.g., for rbd the host name(s) would hold the ip address and tcp port for the ceph cluster monitor(s).

libvirt support for the Linux native kernel rbd driver is also in the works, which will allow rbd to be used with non-qemu VMs supported by libvirt (e.g., Xen, VirtualBox, VMware, etc.)

Linux Kernel

As we posted before, the RBD native linux device was merged into the upcoming Linux kernel version (2.6.37) which will be out in a few weeks. Since the original merge we’ve modified the RBD sysfs interface so that it’d conform better with the sysfs requirements: originally, the RBD driver was based on another linux block device called osdblk and it inherited its sysfs interface, which was monolithic and kept a single sysfs entry per config option for all the devices. This was both wrong and cumbersome, as we needed to specify the device id for each operation. The new interface moves the sysfs rbd subdir to a better location (/sys/bus/rbd) and creates a subdir per device, so that all operations for a single device are grouped together, and there’s no need to specify the device name. We also create a subdir per snapshot under the device that holds all its information, and we dropped the one-big-list-for-all entry.

All in all, it was a relatively big change to introduce well into the release cycle, but we believe it was worth it.

v0.24 released

We’ve released v0.24, just in time for the holidays!  Big changes this time around include:

  • mds: many fixes with clustered failure recovery
  • mds: bloom filter to reduce directory reads
  • mds: configurable directory hash functions (for fragmentation)
  • rbd: import/export tools are smart about holes (i.e., use FIEMAP)
  • osd: many recovery improvements, mostly making data available more quickly
  • osd: automatic background scrubbing when load is low
  • osd: fixes with dedicated backend replication network
  • osd: use new (2.6.37) btrfs ioctls for async snapshot creation
  • replaced openssl dependency with libcrypto++ (licensing issue)
  • librados: “zero-copy” reads
  • misc bug fixes, man pages, and code cleanup

The focus for the next release (v0.25) is on OSD and MDS stability, directory fragmentation recovery, and fsck preliminaries; see the roadmap for more details.

Relevant URLs:

© 2013, Inktank Storage, Inc.. All rights reserved.