Aggregated news from external sources
The Ceph architecture can be pretty neatly broken into two key layers. The first is RADOS, a reliable autonomic distributed object store, which provides an extremely scalable storage service for variably sized objects. The Ceph file system is built on top of that underlying abstraction: file data is striped over objects, and the MDS (metadata server) cluster provides distributed access to a POSIX file system namespace (directory hierarchy) that’s ultimately backed by more objects.
Until now, RADOS’ only user has been Ceph. But if the success of Amazon’s S3 (simple storage service) has shown nothing else, it’s that there is ample use (and demand) for a reliable and scalable object-based storage interface.
The underlying storage abstraction provided by RADOS is relatively simple:
A key design feature of RADOS is that the OSDs are able to operate with a relative autonomy when it comes to recovering from failures or migrating data in response to cluster expansion. By minimizing the role of the central cluster coordinator (actually a small Paxos cluster managing key cluster state), the overall system is extremely scalable. A small system of a few nodes can seamlessly grow to hundreds or thousands of nodes (or contract again) as needed.
The API provided by librados will be quite simple. Something along the lines of:
/* initialization */
int rados_initialize(int argc, const char **argv);
int rados_open_pool(const char *name, rados_pool_t *pool);
void rados_close_pool(rados_pool_t pool);
int rados_write(rados_pool_t pool, struct ceph_object *oid, const char *buf, off_t off, size_t len);
int rados_read(rados_pool_t pool, struct ceph_object *oid, char *buf, off_t off, size_t len);
An asynchronous I/O interface will also be exposed, as well as a buffering/caching facility (currently in use by the Ceph fuse client) with the ability to selectively flush/invalidate sets of objects (e.g., the set of objects a file is striped over).
What are the benefits of using this sort of interface? Clearly, anything you can do with objects you can also do with files in a distributed fs (like Ceph): just create a file at /foo/$poolname/$objectname.
One goal is to make applications that currently use the S3 client library trivially portable to librados, allowing users to maintain control of the full storage stack.