<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Ceph &#187; Dev notes</title>
	<atom:link href="http://ceph.com/category/dev-notes/feed/" rel="self" type="application/rss+xml" />
	<link>http://ceph.com</link>
	<description></description>
	<lastBuildDate>Mon, 17 Jun 2013 23:09:19 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>New Ceph Backend to Lower Disk Requirements</title>
		<link>http://ceph.com/dev-notes/new-ceph-backend-to-lower-disk-requirements/</link>
		<comments>http://ceph.com/dev-notes/new-ceph-backend-to-lower-disk-requirements/#comments</comments>
		<pubDate>Tue, 11 Jun 2013 16:20:11 +0000</pubDate>
		<dc:creator>scuttlemonkey</dc:creator>
				<category><![CDATA[Dev notes]]></category>
		<category><![CDATA[ceph]]></category>
		<category><![CDATA[erasure encoding]]></category>
		<category><![CDATA[scality]]></category>

		<guid isPermaLink="false">http://ceph.com/?p=3500</guid>
		<description><![CDATA[I get a fair number of questions on the current Ceph blueprints, especially those coming from the community. Loic Dachary, one of the owners of the Erasure Encoding blueprint, has done a great job taking a look at some of issues at hand. When evaluating Ceph to run a new storage service, the replication factor [...]<img src="http://track.hubspot.com/__ptq.gif?a=268973&k=14&bu=http%3A%2F%2Fceph.com&r=http%3A%2F%2Fceph.com%2Fdev-notes%2Fnew-ceph-backend-to-lower-disk-requirements%2F&bvt=rss&p=wordpress" style="float:left;" xml:base="http://ceph.com/feed/" width="1" height="1" border="0" align="right"/>]]></description>
			<content:encoded><![CDATA[<p>I get a fair number of questions on the current <a href="http://wiki.ceph.com/01Planning/02Blueprints">Ceph blueprints</a>, especially those coming from the community.  Loic Dachary, one of the owners of the Erasure Encoding blueprint, has done a great job taking a look at some of issues at hand.</p>
<p><span id="more-3500"></span></p>
<blockquote><p>
When evaluating Ceph to run a new storage service, the replication factor only matters after the hardware provisioned from the start is almost full. It may happen months after the first user starts to store data. In the meantime a new storage backend ( erasure encoded ) reducing up to 50% of the hardware requirements <a href="http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/Erasure_encoding_as_a_storage_backend">is being developped</a> in Ceph.</p>
<p>It does not matters to save disk from the beginning : it is not used anyway. The question is to figure out when the erasure encoded will be ready to double the usage value of the storage already in place.</p>
<p>When looking for a new storage solution the hardware requirements are an important factor. If Ceph is configured with three replicates, 1PB of usable storage requires 3PB of actual storage. The users are expected to occupy an increasing amount of disk space over time:</p>
<pre class="code">
            ^
       10PB |
            |
            |
        6PB |
            |                                          /--
            |                                     /----
        4PB |                                /----
            |                           /----   usage
            |                      /----
        2PB |                 /----
            |             /---
            |        /----
            |   /----
            +----------------+----------------+------------>
                          A months          B months
</pre>
<p>Hardware provisioning is expected to follow the usage curve. In the following, 4PB are provisionned initialy, an additional 2PB after A months of operation etc.</p>
<pre class="code">
            ^
       10PB |                                 +-----------
            |                                 |
            |                                 |
        6PB |                +----------------+
            |                |    provisioning         /---
            |                |                    /----
        4PB +----------------+               /----
            |                           /----  usage
            |                      /----
        2PB |                 /----
            |             /---
            |        /----
            |   /----
            +----------------+----------------+------------>
                          A months          B months
</pre>
<p>An erasure encoded Ceph backend could reduce the requirements for raw storage : 1PB of usable storage fits in 1.5PB of raw storage. If it was available the curve would not grow as fast and the need for provisioning more hardware would happen at a later time.</p>
<pre class="code">
            ^
       10PB |
            |
            |
        6PB |                                    +---------
            |                                    |
            |                                    |
        4PB +------------------------------------+
            |                  provisioning
            |                                    /---------
        2PB |                          /---------  usage
            |                /---------
            |        /-------
            |   /----
            +----------------+----------------+------------>
                          A months          B months
</pre>
<p>The implementation of an erasure encoded backend for Ceph started in may 2012 and when it is released, it will progressively lower the disk space requirements. In the example above it will save money if it happens before A months. However, even if it happens later, it will still save money by reducing the storage footprint and make better use of the existing hardware.</p>
<pre class="code">
            ^
       10PB |
            |
            |
        6PB |
            |
            |
        4PB +-----------------
            |
            |
        2PB |
            |
            |
            |
            +----------------+
                          A months
</pre>
<p>In any case, it does not save any money to have erasure encoding from the start because the provisionned hardware is completely empty. Up to A months, the investment to provision 4PB was done anyway.</p>
<p>Originally posted by <a href="http://dachary.org/?p=2048">Loic Dachary</a>.<br />
</blockquote</p>
<pre class="outline">scuttlemonkey out</pre>
<img src="http://track.hubspot.com/__ptq.gif?a=268973&k=14&bu=http%3A%2F%2Fceph.com&r=http%3A%2F%2Fceph.com%2Fdev-notes%2Fnew-ceph-backend-to-lower-disk-requirements%2F&bvt=rss&p=wordpress" style="float:left;" xml:base="http://ceph.com/feed/" width="1" height="1" border="0" align="right"/>]]></content:encoded>
			<wfw:commentRss>http://ceph.com/dev-notes/new-ceph-backend-to-lower-disk-requirements/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Incremental Snapshots with RBD</title>
		<link>http://ceph.com/dev-notes/incremental-snapshots-with-rbd/</link>
		<comments>http://ceph.com/dev-notes/incremental-snapshots-with-rbd/#comments</comments>
		<pubDate>Tue, 14 May 2013 14:02:34 +0000</pubDate>
		<dc:creator>scuttlemonkey</dc:creator>
				<category><![CDATA[Dev notes]]></category>
		<category><![CDATA[backup]]></category>
		<category><![CDATA[cloudstack]]></category>
		<category><![CDATA[eucalyptus]]></category>
		<category><![CDATA[opennebula]]></category>
		<category><![CDATA[openstack]]></category>
		<category><![CDATA[rbd]]></category>
		<category><![CDATA[snapshots]]></category>

		<guid isPermaLink="false">http://ceph.com/?p=3258</guid>
		<description><![CDATA[While Ceph has a wide range of use cases, the most frequent application that we are seeing is that of block devices as data store for public and private clouds managed by OpenStack, CloudStack, Eucalyptus, and OpenNebula. This means that we frequently get questions about things like geographic replication, backup, and disaster recovery (or some [...]<img src="http://track.hubspot.com/__ptq.gif?a=268973&k=14&bu=http%3A%2F%2Fceph.com&r=http%3A%2F%2Fceph.com%2Fdev-notes%2Fincremental-snapshots-with-rbd%2F&bvt=rss&p=wordpress" style="float:left;" xml:base="http://ceph.com/feed/" width="1" height="1" border="0" align="right"/>]]></description>
			<content:encoded><![CDATA[<p>While Ceph has a wide range of use cases, the most frequent application that we are seeing is that of block devices as data store for public and private clouds managed by OpenStack, CloudStack, Eucalyptus, and OpenNebula.  This means that we frequently get questions about things like geographic replication, backup, and disaster recovery (or some combination therein, given the amount of overlap on these topics).  While a full-featured, robust solution to geo-replication is currently being hammered out there are a number of different approaches already being tinkered with (like Sebastien Han&#8217;s <a href="http://www.sebastien-han.fr/blog/2013/01/28/ceph-geo-replication-sort-of/">setup with DRBD</a> or the upcoming work <a href="http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/12238">using RGW</a>).  </p>
<p>However, since one of the primary focuses in managing a cloud is the manipulation of images, the solution to disaster recovery and general backup can often be quite simplistic.  Incremental snapshots can fill this, and several other, roles quite well.  To that end I wanted to share a few thoughts from RBD developer Josh Durgin for those of you who may have missed his <a href="http://www.openstack.org/summit/portland-2013/session-videos/presentation/new-features-for-ceph-with-cinder-and-beyond">great talk</a> at the OpenStack Developer Summit a few weeks ago.</p>
<p><span id="more-3258"></span></p>
<p>For the purposes of disaster recovery, the idea is that you could run two simultaneous Ceph clusters in different geographic locations and instead of copying a new snapshot each time, you could simply generate and transfer a delta.  The incantation would look something like this:</p>
<pre class="code">
rbd export-diff --from-snap snap1 pool/image@snap2 pool_image_snap1_to_snap2.diff
</pre>
<p>This creates a simple binary file that stores the following information:</p>
<ul>
<li>original snapshot name (if applicable)</li>
<li>end snapshot name</li>
<li>size of the image at ending snapshot</li>
<li>the diff between snapshots</li>
</ul>
<p>The format of this file can be seen in the <a href="http://ceph.com/docs/master/dev/rbd-diff/">RBD doc</a>.</p>
<p>After exporting a diff you could either simply back up the file somewhere offsite or import the diff on top of the existing image on a remote Ceph cluster.</p>
<pre class="code">
rbd import-diff /path/to/diff backup_image
</pre>
<p>This will write the contents of the differential to the backup image and create a snapshot with the same name as the original ending snapshot. It will fail and do nothing if a snapshot with this name already exists. Since overwriting the same data is idempotent, it&#8217;s safe to have an import-diff interrupted in the middle.</p>
<p>These commands can work with stdin and stdout as well, so you could do something like:</p>
<pre class="code">
rbd export-diff --from-snap snap1 pool/image@snap2 - | ssh user@second_cluster rbd import-diff - pool2/image
</pre>
<p>You can see which extents changed (in plain text, json, or xml) via:</p>
<pre class="code">
rbd diff --from-snap snap1 pool/image@snap2 --format plain
</pre>
<p>There are a couple of limitations in the current implementation, however.</p>
<ol>
<li>There&#8217;s no guarantee you&#8217;re importing a diff onto an image in the right state (i.e. the same image at the same snapshot as the diff was exported from).</li>
<li>There&#8217;s no way to inspect the diff files to see what snapshots they refer to, so you&#8217;d have to depend on the filename containing that information.</li>
</ol>
<p>While the implementation is still relatively simple, you can see how this could be quite useful in managing not only cloud images, but any of your Ceph block devices.  This functionality hit the streets with the recent &#8216;<a href="http://ceph.com/releases/v0-61-cuttlefish-released/">cuttlefish</a>&#8216; stable release, but if you have questions or enhancement requests please let us know.</p>
<p>To learn more about some of the new things coming in future versions of Ceph you can check out the current <a href="http://www.inktank.com/about-inktank/roadmap/">published roadmap</a> of work Inktank is planning on contributing.  Also if you missed the  virtual <a href="http://ceph.com/events/ceph-developer-summit/">Ceph Developer Summit</a>, the videos have been posted for review.  In the meantime, if you have questions, comments, or anything for the good of the cause feel free to stop by our <a href="irc://irc.oftc.net/ceph">irc channel</a> or drop a note to one of the <a href="http://ceph.com/resources/mailing-list-irc/">mailing lists</a>.  </p>
<pre class="outline">scuttlemonkey out</pre>
<img src="http://track.hubspot.com/__ptq.gif?a=268973&k=14&bu=http%3A%2F%2Fceph.com&r=http%3A%2F%2Fceph.com%2Fdev-notes%2Fincremental-snapshots-with-rbd%2F&bvt=rss&p=wordpress" style="float:left;" xml:base="http://ceph.com/feed/" width="1" height="1" border="0" align="right"/>]]></content:encoded>
			<wfw:commentRss>http://ceph.com/dev-notes/incremental-snapshots-with-rbd/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Adding Support for RBD to stgt</title>
		<link>http://ceph.com/dev-notes/adding-support-for-rbd-to-stgt/</link>
		<comments>http://ceph.com/dev-notes/adding-support-for-rbd-to-stgt/#comments</comments>
		<pubDate>Thu, 21 Mar 2013 10:51:26 +0000</pubDate>
		<dc:creator>dmick</dc:creator>
				<category><![CDATA[Dev notes]]></category>

		<guid isPermaLink="false">http://ceph.com/?p=3177</guid>
		<description><![CDATA[tgt, the Linux SCSI target framework (well, one of them) is an iSCSI target implementation whose goals include implementing a large portion of the SCSI emulation code in userland. tgt can provide iSCSI over Ethernet or iSER (iSCSI extensions for RDMA) over Infiniband. It can emulate various SCSI target types (really &#8220;command sets&#8221;): SBC (normal [...]<img src="http://track.hubspot.com/__ptq.gif?a=268973&k=14&bu=http%3A%2F%2Fceph.com&r=http%3A%2F%2Fceph.com%2Fdev-notes%2Fadding-support-for-rbd-to-stgt%2F&bvt=rss&p=wordpress" style="float:left;" xml:base="http://ceph.com/feed/" width="1" height="1" border="0" align="right"/>]]></description>
			<content:encoded><![CDATA[<p><a href="http://stgt.sourceforge.net">tgt</a>, the Linux SCSI target framework (well, one of them) is an iSCSI target implementation whose goals include implementing a large portion of the SCSI emulation code in userland. tgt can provide iSCSI over Ethernet or iSER (iSCSI extensions for RDMA) over Infiniband.  It can emulate various SCSI target types (really<br />
&#8220;command sets&#8221;):</p>
<ul>
<li>SBC (normal &#8220;disk&#8221; type devices)</li>
<li>SMC (&#8220;jukebox&#8221; media changer)</li>
<li>MMC (CD/DVD drive)</li>
<li>SSC (tape device)</li>
<li>OSD (the &#8216;object storage device&#8217;)</li>
</ul>
<p>It can use either a raw block device or a file as backing storage for any of these device types.</p>
<p><span id="more-3177"></span></p>
<p>Well, since Ceph provides a distributed reliable storage pool on the network, having a way to access that storage as an iSCSI device seems natural; this way clients that speak iSCSI don&#8217;t even need to be aware that their storage is on the Ceph cluster (except to know that it&#8217;s highly available and safe).  Virtual machine providers and cloud software of many types can speak iSCSI, and if Ceph could export storage as an iSCSI device, it would be easy to glue all those providers to a Ceph cluster.</p>
<p>To that end, I&#8217;ve written a backend for tgt that can use a RADOS block device (<a href="http://ceph.com/docs/master/rbd/rbd/">rbd</a>) image as the storage for the iSCSI target device.  Now, you may be saying, &#8220;but wait, I can already create an RBD image on the Ceph cluster, map it in my kernel as a block device, and use tgt or LIO or other iSCSI tools to export it as an iSCSI target&#8221;, and that&#8217;s correct; people do this with success today.  However, the completely-userland approach has several benefits:</p>
<ul>
<li>the userland code for rbd typically leads the kernel implementation with respect to new features.  At the time of this writing, userland rbd can use copy-on-write cloning and &#8216;fancy striping&#8217;, which are still being implemented in the kernel</li>
<li>userland code can be compiled and installed on older kernels that may not have the kernel rbd module available at all, or may have an older, less-stable version</li>
<li>avoiding the kernel can be useful for throttling memory/bandwidth, management in general, delegating access, security, avoiding kernel crashes, etc.</li>
<li>Without risk of memory deadlock, we can perform much better caching in the userland librbd</li>
</ul>
<p>Of course these advantages don&#8217;t come for free; there can also be a cost in performance.  The tgt project has taken some care to try to mitigate performance effects, but your mileage may vary.  However, the ease of the port, and the ease of modifying it for new features makes this a worthwhile effort even in the face of possible performance hits.</p>
<h3>The bs_rbd backing-store driver</h3>
<p>Adding rbd support to tgt was fairly simple due to its modular design and simple backing-store drivers.  Starting from the bs_rdwr backing-store driver, which backs the daemon-provided instances with either a file in a filesystem or a block device, using normal open/close /read/write functions, I added the initialization to open a connection to the RADOS cluster using librados, and a small function to parse the rbd pool, imagename, and snapshot name out of the arguments (see below). Then the POSIX calls were translated into librbd calls for rbd_open, rbd_close, etc.  librbd is very similar to POSIX file operations in both its synchronous and asynchronous forms, so the translation was obvious and easy.</p>
<p>The patch to add bs_rbd has been accepted into the <a href="http://github.com/fujita/tgt">mainline repository</a> as of mid-February 2013, along with some very brief README information on how to use it.  Here&#8217;s a little expansion on that brief usage:</p>
<p>tgtd is configured by the tgtadm command; to select an RBD image as the backend storage for a tgtd instance, you use the &#8211;bstype rbd option to tell tgtd that it should access the storage using bs_rbd. Also, use the &#8211;backing-store option to select the (already-existing) rbd image in the usual Ceph syntax: &#8211;backing-store [pool/]image[@snap] to select an rbd image named &#8216;image&#8217;, optionally in pool &#8216;pool&#8217;, and optionally a readonly snapshot of that image @snap.  You can create the image in the usual way, using the rbd command-line tool.</p>
<p>You must give the device you&#8217;re creating a name; a typical name form would be an &#8216;IQN&#8217; (iSCSI qualified name), of the form: iqn.&#60;year&#62;-&#60;month&#62;.&#60;domain&#62;:&#60;domain-specified-string&#62; but no particular form seems to be required, so &#8216;testrbd&#8217; works just as well.  In my testing I created a target named simply &#8220;rbd&#8221;.</p>
<h3>Using bs_rbd with tgtd</h3>
<p>So a typical setup using manual commands might go like this:  First, create an image on your running Ceph cluster:</p>
<pre class="code">rbd create iscsi-image --size 500       # a 500 MB image named iscsi-image</pre>
<p>tgtadm/tgtd will access the cluster using the configuration supplied via the default Ceph configuration files (by default, /etc/ceph/$cluster.conf, ~/.ceph/$cluster.conf, and ./$cluster.conf, where $cluster is, by default, &#8216;ceph&#8217;), or by the CEPH_CONF environment variable; make sure your configuration is accessible through one of those settings.</p>
<p>Next, create a new target for the tgtd daemon to emulate:</p>
<pre class="code">tgtadm --lld iscsi --mode target --op new --tid 1 --targetname rbd</pre>
<p>Create a LUN on this target bound to an rbd image:</p>
<pre class="code">tgtadm --lld iscsi --mode logicalunit --op new --tid 1 --lun 0 --backing-store iscsi-image --bstype rbd</pre>
<p>Allow access to that lun:</p>
<pre class="code">tgtadm --lld iscsi --op bind --mode target --tid 1 -I ALL</pre>
<p>Verify that the image can be seen by a local iscsi initiator.  Here I&#8217;m using iscsiadm, part of the open-iscsi package:</p>
<pre class="code">iscsiadm -m discovery -t st -p localhost</pre>
<p>Log into node, which will create a /dev/sdX block device:</p>
<pre class="code">iscsiadm -m node --login</pre>
<p>Now you can access the device locally as /dev/sdX using iSCSI.  You can also perform the last two steps from a different network host, specifying -p &#60;tgtd-hostname&#62;, of course.</p>
<p>When you&#8217;re done, you can terminate the session and remove the device:</p>
<pre class="code">iscsiadm -m node --logout</pre>
<h3>Details of the bs_rbd backing-store driver, possible future work</h3>
<p>As a first implementation, I wrote the bs_rbd driver to handle up to 20 simultaneous rbd images (just an arbitrary fixed-size-array limit), and used the bs_rdwr module as a starting point, so that I/O is synchronous to the RADOS cluster.  However, tgtd itself maintains a thread worker pool of, by default, 16 threads, so while I/O blocks in the RADOS cluster, the daemon itself maintains multiple outstanding requests.  The thread count can be adjusted with -t or &#8211;nr-iothreads when creating the LUN.</p>
<p>It&#8217;s possible that using librbd&#8217;s asynchronous-I/O support would improve performance or CPU utilization; this is something that could be the basis of experiments.  I chose the simpler implementation as a working proof-of-concept; performance studies and experiments would be welcomed.</p>
<p>The driver links against librbd and librados, so those must be installed on your machine to build, and you must select the configuration option CEPH_RBD by, for example, &#8220;make CEPH_RBD=1&#8243;.</p>
<h3>Try it!</h3>
<p>So that&#8217;s the story of rbd support in tgt!  Please try it out and let us know what you think; report any bugs to the stgt mailing list and the ceph development list:  stgt@vger.kernel.org and ceph-devel@vger.kernel.org.</p>
<img src="http://track.hubspot.com/__ptq.gif?a=268973&k=14&bu=http%3A%2F%2Fceph.com&r=http%3A%2F%2Fceph.com%2Fdev-notes%2Fadding-support-for-rbd-to-stgt%2F&bvt=rss&p=wordpress" style="float:left;" xml:base="http://ceph.com/feed/" width="1" height="1" border="0" align="right"/>]]></content:encoded>
			<wfw:commentRss>http://ceph.com/dev-notes/adding-support-for-rbd-to-stgt/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Ceph&#8217;s New Monitor Changes</title>
		<link>http://ceph.com/dev-notes/cephs-new-monitor-changes/</link>
		<comments>http://ceph.com/dev-notes/cephs-new-monitor-changes/#comments</comments>
		<pubDate>Thu, 07 Mar 2013 23:11:16 +0000</pubDate>
		<dc:creator>joao</dc:creator>
				<category><![CDATA[Dev notes]]></category>

		<guid isPermaLink="false">http://ceph.com/?p=3066</guid>
		<description><![CDATA[Back in May 2012, after numerous hours confined to a couple of planes since departing Lisbon, I arrived at Los Angeles to meet most of the folks from Inktank. During my stay I had the chance to meet everybody on the team, attend the company&#8217;s launch party and start a major and well deserved rework [...]<img src="http://track.hubspot.com/__ptq.gif?a=268973&k=14&bu=http%3A%2F%2Fceph.com&r=http%3A%2F%2Fceph.com%2Fdev-notes%2Fcephs-new-monitor-changes%2F&bvt=rss&p=wordpress" style="float:left;" xml:base="http://ceph.com/feed/" width="1" height="1" border="0" align="right"/>]]></description>
			<content:encoded><![CDATA[<p>Back in May 2012, after numerous hours confined to a couple of planes since departing Lisbon, I arrived at Los Angeles to meet most of the folks from Inktank.  During my stay I had the chance to meet everybody on the team, attend the company&#8217;s launch party and start a major and well deserved rework of some key aspects of the Ceph Monitor.  These changes were merged into Ceph for v0.58.</p>
<p>Before getting into details on the changes, let me give some background on how the Monitor works.</p>
<p><span id="more-3066"></span></p>
<h3>Monitor Architecture</h3>
<p>As you may already know, the Monitor is a critical piece in any Ceph cluster: without at least one monitor, the cluster just won&#8217;t do anything useful.  And by that I mean nothing will happen. Ever.</p>
<p>Think of the monitors as that central piece of the cluster that keeps track of who and where the other pieces of the cluster are and what is happening with them. Through a single monitor, a Client is able to obtain the location of the remaining monitors, where the object storage daemons (OSDs) or the metadata servers (MDS) can be found, or figure out where the data lies; and it is to the monitors that OSDs and MDS&#8217; will report.</p>
<p>The monitor tracks a lot of information essential to the cluster&#8217;s operation, much of which is at some point provided by the other components in the system.  Some of this information is kept in the form of maps &#8212; OSDMap and PGMap, to name a couple &#8211;, and each map may have multiple versions.  For instance, the OSDMap contains the location of the OSDs, the CRUSH map, and numerous statistics; the PGMap keeps track of PGs and where they are located at any given moment, with different versions providing different insights on the cluster history.  So one might want to consider having multiple monitors in the same cluster, not only to guarantee redundancy of this information in case the monitor&#8217;s data store suffers a terrible death, but also to guarantee availability if something should happen to the monitor (power or network failure on the monitor&#8217;s server or rack, for instance).</p>
<p>However, keeping multiple monitors means that the information must be equally shared by them all.  Any potential inconsistencies, may they be lost or corrupted versions, could lead to incorrect cluster behavior or even data-loss.  In order to enforce the consistency requirements throughout the monitor cluster, Ceph resorts to Paxos (<a href="http://en.wikipedia.org/wiki/Paxos_(computer_science)">http://en.wikipedia.org/wiki/Paxos_(computer_science)</a>), a distributed consensus algorithm.  Each time a map is modified, a new version is created and run through a quorum of monitors.  When a majority acknowledges the change, and only then, the new version will be considered committed.  Throughout the documentation and the mailing list archives one can find numerous reasons to maintain more than one monitor (and an odd numbers at that), but I believe that Mike Lowe described it the best in an email to the list </p>
<p>(<a href="http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-February/000224.html">http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-February/000224.html</a>):</p>
<p><em>&#8220;Think of yourself as a mob boss and the mon is your mob accountant.  While you may have all of the account numbers where you have stashed your ill gotten gains only your accountant knows which bank those account numbers belong to.  If you or somebody else whacks your sole accountant then your money is gone.  Oh, and your accountants may lie to you so best to have an odd number and let the majority rule.&#8221;</em></p>
<h3>ARCHITECTURAL REWORK</h3>
<p>There were three major architectural reworks on the monitor, which I will explain in further detail in the next sections:</p>
<ul>
<li>Shifting from the legacy file-based data store onto a Key/Value store;</li>
<li>Introducing a single Paxos instance instead of an instance per monitor service; and,</li>
<li>Performing a store-wide sync to catch up with cluster state</li>
</ul>
<h3>K/V STORE INSTEAD OF THE LEGACY FILE-BASED STORE</h3>
<p>Up until v0.58, the monitor&#8217;s data store was comprised of a set of files and directories. This approach benefited from the simplicity of inspecting the data store with tools like &#8216;ls&#8217;, &#8216;cat&#8217;, and the likes.  However, this simplicity would also enable some creative problem-solving approaches, and every now and then we would get ahold of someone whose monitors would crash and burn because there had been some tampering with monmap versions, for instance.  Let me make this clear:  black boxing the monitor is not a workaround to avoid this kind of approach &#8212; once the users understand that they should not do it, they won&#8217;t ever repeat the feat &#8211;, but black boxing does provide other benefits that are far more important.</p>
<p>The file-based monitor store has some other drawbacks that cannot be avoided by simply informing the user.  For instance, there is no way to atomically change a set of files.  This issue is not uncommon in file systems, and several techniques have been developed to work around it, but they are annoying and there are several circumstances when they don&#8217;t really work out.  Take how the monitor applies a version on its data store: reads V, the latest committed version in the store; creates a new file under foo/V+1 with the new version&#8217;s contents; writes V+1 to the latest committed version file.  Now say that the store runs out of space when writing the new version&#8217;s contents to disk.  There is a chance that only part of the contents ever reached disk, and we might end up with a corrupted version.  And you say, &#8220;but we didn&#8217;t mark that version as being the last committed version, so there&#8217;s no problem, right?!?&#8221;.  Well, that is certainly true, to some extent of true.  The real story is that, during recovery, the monitor <b>might</b> check if there is an uncommitted version in the store, and if so try to run it through Paxos, and in this case the version might be corrupted.  So one would say that this is a bug, the store&#8217;s fault, and one would certainly be correct: it could be avoided by stashing the version&#8217;s crc before we went out to write it out, and we could check if the crc matched the read version before we did anything with it.</p>
<p>Sure, we could have kept on working on the file-based store, adding features as we deemed necessary, and that&#8217;s what we probably would have done if we weren&#8217;t about to perform a major rework on the monitor.  Therefore, instead of focusing on extending the existing file-based store, we decided it was time to move on to a key/value store with all the properties we were looking for, and given that we have already been using one such store in Ceph, we just went ahead and used leveldb (<a href="http://code.google.com/p/leveldb/">http://code.google.com/p/leveldb/</a>).</p>
<p>In fact, the legacy file-based store acted much like a key/value store when it came to data placement.  It was mainly comprised of files holding data, the filename acting as the key, the data acting as the value.  Thus, moving to a key/value store didn&#8217;t pose much of an ordeal, and it gave us something we were really looking forward to use on the remaining architectural rework of the monitor: transactions, being able to perform multiple modification operations in one single atomic batch.</p>
<h3>PAXOS &#038; MONITOR SERVICES</h3>
<div id="attachment_3068" class="wp-caption alignright" style="width: 251px"><a href="http://ceph.com/wp-content/uploads/2013/03/2013-03-07_17-57-12.png"><img src="http://ceph.com/wp-content/uploads/2013/03/2013-03-07_17-57-12-241x220.png" alt="Figure 1" title="mon_fig1" width="241" height="220" class="size-medium wp-image-3068" /></a><p class="wp-caption-text">Figure 1</p></div>
<p>We have already discussed how the monitor maintains map consistency throughout the cluster by resorting to Paxos, but we didn&#8217;t give much detail on it.  Without getting into excruciating detail, in fact the monitor can be seen as being divided in 6 services, each responsible to handle one kind of information: authentication, logging, MDS, Monitor, PG and OSD maps.  Each of these services are what we call ‘Paxos Services&#8217;, given that they pretty much behave as paxos machines, each one maintaining a Paxos instance (see Figure 1).  This means that at any given moment, it would be theoretically possible to have 6 parallel modifications going on, granted each one would be of a different type.  In reality they are not really parallel, as the monitor only handles one message at a time, but it is possible to keep multiple concurrent Paxos proposals.</p>
<p>Having a Paxos instance per service guarantees that each service will keep track of its own versions, and will be responsible for their maintenance that may differ from service to service, depending on different requirements and criteria. Basically, this approach confers a great deal of autonomy to each service, at the expense of some redundancy by having multiple Paxos instances when just one would be enough.  In Figure 1 we roughly depict how each service used perform their read and write operations on the monitor data store.  In a nutshell, most modifications would be made through their Paxos instance, while reads would be directly performed by the service.  We say most modifications because we would only resort to Paxos when dealing with a new version on the cluster.  There were several other modifications that would be done directly on the store, as long as they were considered as not affecting the global Paxos state, such as version trimming (i.e., getting rid of old, unnecessary versions).</p>
<p>Disregarding the conceptual architecture and diving for a moment into the implementation point-of-view, each service also involved quite a bit of effort when accessing its own data, as they were required to some extent to explicitly use the file-based monitor store interface to access their allocated namespace within the file system.  Little to no abstraction was provided.</p>
<p>This brings us to our ultimate goal with the architectural rework: use one single Paxos instance across all services, while keeping their autonomy and sandboxing their store accesses to their own namespace using a clean and simple-to-use interface.</p>
<h3>ONE PAXOS TO RULE THEM ALL</h3>
<div id="attachment_3070" class="wp-caption alignright" style="width: 244px"><a href="http://ceph.com/wp-content/uploads/2013/03/2013-03-07_17-58-59.png"><img src="http://ceph.com/wp-content/uploads/2013/03/2013-03-07_17-58-59-234x220.png" alt="Figure 2" title="mon_fig2" width="234" height="220" class="size-medium wp-image-3070" /></a><p class="wp-caption-text">Figure 2</p></div>
<p>Although using a single Paxos instance makes sense, it involved some serious reworking on how the services perceive their world, as well as how Paxos is used within the monitor.</p>
<p>Instead of keeping up with the previous approach of using Paxos solely to run new versions of a specific service through the other monitors in the cluster, we now use it to perform any change whatsoever across the cluster, thus guaranteeing that all the monitors are constantly in sync &#8212; and this means trimming too, which is now enforced to happen at the same time across the cluster.  Therefore, with the Single Paxos approach we make sure that every write is run through Paxos prior to be applied onto the store, although services can read the store directly (see Figure 2).</p>
<p>This approach posed one major challenge: given that service versions (and by that we mean, for instance, map epochs) were directly associated with the Paxos version, ranging from [1,n] in incremental fashion, how would we now deal with this given that we have only a single Paxos instance?  Would we end up with gaps in map epochs? Were this a headline, I could easily refer to Betteridge&#8217;s law of headlines (<a href="http://en.wikipedia.org/wiki/Betteridge's_law_of_headlines">http://en.wikipedia.org/wiki/Betteridge&#8217;s_law_of_headlines</a>); given it&#8217;s not, I will just have to answer No! and explain why.</p>
<p>In dissociating Paxos from the services, the Paxos&#8217; version became analogous to a global version, representing a given proposal&#8217;s version instead of a map epoch.  The services kept their responsibility of managing their own versions, and are absolutely oblivious to the fact that there is only one single Paxos instance &#8212; they really don&#8217;t care, they just push their changes up the chain, and propose them to the cluster.  The same goes to Paxos. By leveraging the new key/value store&#8217;s capability to perform transactions, not only are we able to abstract the services from however Paxos deals with versions, but we are able to abstract Paxos from whatever the services propose, which didn&#8217;t happen before &#8212; the Paxos/service relation was so tight that a Paxos proposal took the form of set version ‘foo&#8217; with contents ‘bar&#8217; for service ‘baz&#8217;.</p>
<div id="attachment_3071" class="wp-caption alignright" style="width: 180px"><a href="http://ceph.com/wp-content/uploads/2013/03/2013-03-07_17-59-49.png"><img src="http://ceph.com/wp-content/uploads/2013/03/2013-03-07_17-59-49-170x220.png" alt="Figure 3" title="mon_fig3" width="170" height="220" class="size-medium wp-image-3071" /></a><p class="wp-caption-text">Figure 3</p></div>
<p>With the support of transactions however, we can make a service generate a transaction containing the operations it wants to perform on its namespace &#8212; which will be properly adjusted to reflect the service&#8217;s namespace without the service being aware.  The transaction will then be encoded into a byte array (Ceph has all the data structures allowing this to happen effortlessly), and submitted to Paxos.  Take Figure 3, where we depict this process.  Once the service&#8217;s transaction reaches Paxos, a new transaction will be created, reflecting the new Paxos version.  In Figure 3 we can see that Paxos will create a new version 42 with the contents of the service&#8217;s encoded transaction &#8212; Paxos won&#8217;t care what the contents really are though; they are meaningless from its point-of-view.  Once the proposal is acknowledge by a majority of monitors, each monitor will perform one single transaction comprised of the Paxos transaction&#8217;s operations and the service&#8217;s transaction&#8217;s operations &#8212; all of them applied in one single atomic batch.</p>
<p>This approach is also used for pretty much any operation requiring to be applied throughout the cluster in a consistent manner.  For instance, while we used to let each service, on each monitor, decide when to trim their versions, we now delegate that decision only to the Leader on the monitor quorum.  Periodically, the Leader will assess which versions, either Paxos or service-specific, need to be trimmed, generating a transaction comprised of erase() operations over Paxos versions, alongside with service-specific versions (if any).  Similarly to what happens with other modifications, this transaction is proposed through Paxos, which will create a new version containing the encoded proposed transaction, finally applying it throughout the cluster.</p>
<p>One might have noticed that we just stated that trimming versions is also a Paxos proposal that will lead to a new Paxos version.  Well, that is by design, and comes as wonderfully useful when recovering drifted monitors.</p>
<p>A monitor is considered as having drifted if it is behind a given number of Paxos versions.  If this number is small enough such that its last committed version is within the interval of available versions on the remaining cluster monitors, then the monitor is able to recover without much effort, simply by obtaining the missing Paxos versions and re-applying them on the store &#8212; some of these versions can simply add new information to the store, or erase old versions; regardless, the monitor will obtain a consistent state with the remaining cluster.</p>
<p>However, at times there is a chance that the monitor drifted so much that no longer shares any Paxos version with the remaining cluster.  At this point, the monitor must perform a store-wide synchronization.</p>
<h3>STORE-WIDE SYNCHRONIZATION</h3>
<p>Prior to v0.58, when a Paxos service drifted beyond a given number of versions, a mechanism called slurp would be triggered.  In a nutshell, this mechanism consisted of establishing a connection with the quorum Leader and obtain every single version the Leader had, for every service that had drifted.  Such approach was adequate to the one-Paxos-per-service architecture, but wouldn&#8217;t fare so well on a single Paxos architecture.  The reason is simple and follows the behavior of Paxos as it was described in the previous section: Paxos versions no longer represent service versions, and only synchronizing them would certainly lead to a corrupted state, with lots and lots of information missing.</p>
<p>So we got rid of slurp.  Instead, we leveraged leveldb&#8217;s snapshots and iterators, and we now perform a store-wide sync.  This means that once a monitor (hereafter known as Requester) finds out it has drifted beyond salvaging, it will request some other monitor (hereafter known as Provider) to perform a sync.  The Provider will then take a snapshot of its store and iterate over it, bundling all the key/values it can find into transactions and sending them to the Requester.  The Requester will apply each received transaction and once it receives the last chunk it will be ready to join the cluster.</p>
<p>The great thing about this new mechanism, is that unlike the slurp, the Requester doesn&#8217;t really need to synchronize directly from the quorum&#8217;s Leader.  Instead, it may synchronize from any given monitor in the quorum, and there may be any given number of syncs being performed simultaneously, without overloading the Leader.</p>
<h3>BUT, BUT&#8230; IS UPGRADING COMPLEX? IS IT POSSIBLE TO REVERT?</h3>
<p>Well&#8230; no and kind of.</p>
<p>Sometime around Bobtail, the monitor started recording a Global Version for each version a service proposed through their Paxos instance.  After some time running, the monitor be holding a mapping from service-specific Paxos versions to a Global Version, and would then set a flag on its store stating just that: we are now able to map any Paxos version to a global id.</p>
<p>This was slipped into the monitor in order to allow us to upgrade a monitor from the one-Paxos-per-service to the single Paxos architecture.  So, basically, as long as one has been running a Bobtail monitor for some time, upgrading to the new monitor should be as simple as restarting it and waiting for the store conversion to finish.  This conversion will be triggered automatically, and may take some time if the store is big enough.  So, no, upgrading is not complex, granted you are coming from Bobtail; otherwise, you will have to upgrade to Bobtail and take it from there.</p>
<p>If upgrading fails for some reason, we would really appreciate if you&#8217;d let us know on the <a href="http://ceph.com/resources/mailing-list-irc/">mailing-list and/or on IRC</a>.  In any case, there is no need for despair.  Given that during conversion we only perform read operations on the legacy file-based store, and we convert everything into a leveldb sub-directory on the monitor&#8217;s data directory, you can easily revert to your original data store simply by running your old monitor.  However, if your monitor did not fail, if you successfully upgrade all your monitors and they form a quorum, from that point onward there is no going back (unless you are okay with just reverting to an older state).  Furthermore, you should be aware that this upgrade does not allow for mixed monitor clusters, so there is no point in trying to upgrade just part of your monitor cluster: it won&#8217;t work as the post-rework code is unable to understand pre-rework way of doing business.</p>
<h3>SUMMARY</h3>
<p>Over the past ten months the Ceph Monitor undertook a major rework, from its backend data store moving from a file-based format to a key/value store supporting atomic transactions, to the way versions are created, unifying all services under a single Paxos instance and sandboxing their access to the data store.  Such rework allowed us to suppress some limitations of the previous architecture, and to create an architecture that by dissociating Paxos from the monitor services it will allow us to disseminate information throughout the cluster in a seamless way, allowing to simplify how new capabilities can be built around and within the monitor.  Future versions of the monitor may for instance include a generic key/value store, such that a user could stash and retrieve information deemed necessary, while benefitting from the distributed and high-availability nature of the monitor.  There&#8217;s work being developed towards such an implementation, taking advantage of the mechanisms now in place, leveraging Paxos as a conduit of modifications throughout the cluster.</p>
<p>If you would like to know more, feel free to take a look at the commit messages of the patches introducing the whole architectural rework (<a href="https://github.com/ceph/ceph/commit/a5e2dcb33d915dca26558909647e2e56ed1c23f4">single Paxos</a>, <a href="https://github.com/ceph/ceph/commit/86f6a342715e50cbd304e73d38af74ccfcfffbc4">trimming through Paxos</a>, and <a href="https://github.com/ceph/ceph/commit/cab3411b4a06a8cd9bac3feac49dc423981cc808">store sync</a>), dive into the <a href="https://github.com/ceph/ceph">source code</a>, or chat us up on <a href="http://ceph.com/resources/mailing-list-irc/">the mailing list or IRC</a>!</p>
<img src="http://track.hubspot.com/__ptq.gif?a=268973&k=14&bu=http%3A%2F%2Fceph.com&r=http%3A%2F%2Fceph.com%2Fdev-notes%2Fcephs-new-monitor-changes%2F&bvt=rss&p=wordpress" style="float:left;" xml:base="http://ceph.com/feed/" width="1" height="1" border="0" align="right"/>]]></content:encoded>
			<wfw:commentRss>http://ceph.com/dev-notes/cephs-new-monitor-changes/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>CephFS MDS Status Discussion</title>
		<link>http://ceph.com/dev-notes/cephfs-mds-status-discussion/</link>
		<comments>http://ceph.com/dev-notes/cephfs-mds-status-discussion/#comments</comments>
		<pubDate>Tue, 05 Mar 2013 16:45:48 +0000</pubDate>
		<dc:creator>gfarnum</dc:creator>
				<category><![CDATA[Dev notes]]></category>

		<guid isPermaLink="false">http://ceph.com/?p=3042</guid>
		<description><![CDATA[There have been a lot of questions lately about the current status of the Ceph MDS and when to expect a stable release. Inktank has been having some internal discussions around CephFS release development, and I&#8217;d like to share them with you and ask for feedback! A couple quick notes: first, this blog post is [...]<img src="http://track.hubspot.com/__ptq.gif?a=268973&k=14&bu=http%3A%2F%2Fceph.com&r=http%3A%2F%2Fceph.com%2Fdev-notes%2Fcephfs-mds-status-discussion%2F&bvt=rss&p=wordpress" style="float:left;" xml:base="http://ceph.com/feed/" width="1" height="1" border="0" align="right"/>]]></description>
			<content:encoded><![CDATA[<p>There have been a lot of questions lately about the current status of the Ceph MDS and when to expect a stable release. Inktank has been having some internal discussions around CephFS release development, and I&#8217;d like to share them with you and ask for feedback!</p>
<p>A couple quick notes: first, this blog post is from the perspective of Inktank&#8217;s development. We aren&#8217;t the <a href="https://github.com/ceph/ceph/commit/b7e698a52bf7838f8e37842074c510a6561f165b">only</a> ones generating metadata server (MDS) patches, and other parties might make contributions with different priorities! Second, this is a discussion about MDS development — look for a blog about what the MDS does and how it works coming soon!</p>
<p><span id="more-3042"></span></p>
<p><strong>Current Status</strong></p>
<p>Over the past year, we at Inktank have regretfully stepped back from the filesystem — we still believe its feature set and capabilities will revolutionize storage, but we realized it required a lot more work to become a stable product than RBD and RGW, so we focused our efforts on the software we could give to customers. That is still Inktank&#8217;s organizational focus, but at the turn of the year something wonderful (for me personally) happened! We created an internal CephFS team and I and Sam Lang have been devoting an increasing amount of our time to work on the MDS and filesystem development. This renewed focus has emphasized what kinds of issues remain. There are a few brave organizations using CephFS in testing or production capacities, but the more important its use is to them the less functionality they rely on. For community members my recommendation has been to test CephFS under your workload for two weeks, inject some failures (node restarts, etc), and if it works through that then it should continue working — some people have run systems for months without issues, but others run into trouble on their second or third command. Basically, if your workload happens to look like one of the test suites we regularly run it should be good — but if it deviates even a little there are hidden traps lying in wait from bugs that we haven&#8217;t yet discovered.</p>
<p>Initially our goal was to stabilize the features the filesystem already has and develop fsck, but through recent discussions we realized we hadn&#8217;t sat down and figured out what our users and customers actually needed CephFS to do in order to put it into production deployments. More than that, while we&#8217;ve viewed CephFS for years as this big ball of awesomeness with features like snapshotting and multiple active servers and unlimited directory sizes, we don&#8217;t know which of those features are actually necessary for a first release — and the bugs we&#8217;ve been working on have reminded us that some of them require a lot more stability work than others.</p>
<p>Keeping that thought in mind, we&#8217;re now starting a discussion with customers, users, and the community at large to discuss what a &#8220;<a href="http://en.wikipedia.org/wiki/Minimum_viable_product">minimum viable product</a>&#8221; for CephFS would look like. Our starting point is just what&#8217;s easy from a development perspective, and we would welcome feedback from users on if this works for them, or how it would need to change before they could deploy it.</p>
<p><strong>Minimum Viable Product Proposal</strong></p>
<p>As we put more engineering resources back into CephFS, we are looking at what we would consider as the minimal useful feature set in order to get CephFS into the hands of production users as soon as possible. We are currently considering it to be a single active MDS, with a maximum number of entries in a single directory, no fsck, and no snapshots. This delivers a POSIX-compliant filesystem that can be mounted on thousands of clients, scales to arbitrarily large data throughputs, allows an unlimited number of files in the hierarchy, is location-aware, and can be used through a number of interfaces (Ganesha NFS, Hadoop, the in-kernel and FUSE-based clients, Samba, etc).</p>
<p>Let me break down what each of those assertions means in more detail.</p>
<p><em>Single active MDS</em></p>
<p>One of CephFS&#8217; flagship features is its horizontal scalability across very large numbers of metadata server daemons. This will continue to be a flagship feature in the future, but right now it introduces significant system instability so it will not be a part of our initial supported release.</p>
<p><em>However</em>, the standby and active standby features are very stable and will be part of the first release. This means that the fast failover features (30 seconds or much less, depending on user settings, hardware, and tolerance for unnecessary failovers) will function, allowing users to provision as many servers as they wish in case of a hardware failure, or to take over during maintenance.</p>
<p>The primary limits implied by a single-MDS configuration are the number of metadata operations/second the system can handle, the number of simultaneous client connections it can handle, and the amount of metadata the MDS can store in memory. This last provides a limit on how many files can be in use simultaneously with good performance, but <em>not</em> on total number of files in the system (which remains effectively unlimited). As always, the amount of RAM and CPU available to the MDS node will have a dramatic impact on where precisely these limits fall.</p>
<p>Aside: We currently default to 100,000 inodes in the cache, but that is extremely conservative and fits inside the low hundreds of MB of RAM. We don&#8217;t yet have recent specific values on memory consumption per inode.</p>
<p><em>Maximum number of entries in a single directory</em></p>
<p>CephFS includes preliminary support for directory &#8220;fragmenting&#8221; (or sharding), which allows us to both split up a single directory on-disk and to split it up between multiple MDS servers. Again though, while the code exists it requires a significant amount of validation and debugging, so our first release will not provide support and we will need to limit the total number of entries allowed within a single directory. This is a soft limit open to negotiation — the MDS needs to be able to hold the whole directory in-memory whenever it is read off disk, and if the directory holds more entries than the MDS cache can hold the cache efficiency and overall performance will naturally degrade (and, if more than one directory is in use they will feed back on each other).</p>
<p>However, &#8220;manually&#8221; sharding directories by splitting them up according to any given heuristic (which splits them finely enough) works just fine, and this limit does not imply a limit on the total number of files in the system. (As long as one considers the maximum amount of active metadata discussed above.)</p>
<p><em>No fsck</em></p>
<p>As with many distributed systems, CephFS does not currently provide an fsck. Initial design work has been done but not yet implemented. CephFS does of course inherit RADOS&#8217; underlying reliability methods, which include a periodic scrub of the data for consistency between replicas, checksum-based validity checks (upcoming in the Cuttlefish release), and replication of data and recovery when it degrades. Unfortunately this does not completely insure CephFS — objects which are completely lost will translate into file holes, and will not necessarily trigger alarms. Any lost directories in the filesystem hierarchy (which will be detected) must be repaired manually.</p>
<p>On the positive side, due to Ceph&#8217;s design, any portions of the hierarchy which have not been damaged will continue to function even if data has been lost.</p>
<p><em>No snapshots</em></p>
<p>While CephFS has preliminary support for snapshots of directory hierarchies, it too requires significant hardening and debugging. We will not support them in our initial release. When they are released, it will be a pioneering feature among distributed filesystems.</p>
<p><em>POSIX-compliant</em></p>
<p>CephFS is POSIX-compliant, always has been, and always will be. It provides proper consistency (rather than open-to close as NFS does), and supports even less commonly-used features such as file locking.</p>
<p>Hard links are also supported, although in their current implementation each link requires a small bit of MDS memory and so there is an implied limit based on your available memory. We have designed but not implemented a new solution to avoid this problem.</p>
<p><em>Can be used by thousands of clients simultaneously</em></p>
<p>In past testing (during Sage Weil&#8217;s PhD thesis work), MDS servers have had no trouble handling 1000 clients each, and while we haven&#8217;t tested recently we expect that number to have improved rather than degraded.</p>
<p><em>Scales to arbitrary data throughputs</em></p>
<p>In CephFS, once the client has opened a file, the MDS does not play a further role in the data path. That means that if your clients can send the data, and your OSDs can write the data, the single-MDS limit will not directly impact the aggregate bandwidth available. The implied limits are those based on how much each client can send out and the total number of active files the MDS can handle.</p>
<p><em>Allows an unlimited number of files in the hierarchy</em></p>
<p>Although there are limits to the size of a single directory as discussed above, Ceph does not require that every file in the system take up MDS memory at all times. This means that unlike many other systems, it does not and never will have a hard limit on the total number of files available.</p>
<p><em>Is location aware</em></p>
<p>CephFS is built on RADOS, which has a failure domain-based layout engine (which by default naturally maps onto the physical host, rack, row, room layouts of the data center). CephFS allows clients to query this layout data for files and optionally to read from local replicas. Systems which are interested in location awareness will also appreciate the ability to set custom layouts on every file, specifying the underlying object size and pool (which further dictates the the striping strategy in use).</p>
<p><em>Many interfaces</em></p>
<p>Native ceph clients are available in the upstream Linux kernel and in userspace as both a library and a FUSE module. In addition to the regular interface options available through those standard mechanisms, the library has been integrated into the Ganesha NFS server and Samba; fully integrates with Hadoop; and can be integrated into any custom application.</p>
<p>&nbsp;</p>
<p><strong>Feedback</strong></p>
<p>As I said, we would love to get your feedback on these ideas. I&#8217;m starting a discussion thread on ceph-users as this blog goes up; you can comment here; or you can drop by irc and ping any of us. We&#8217;d appreciate any information you can provide!</p>
<p>-Greg out</p>
<img src="http://track.hubspot.com/__ptq.gif?a=268973&k=14&bu=http%3A%2F%2Fceph.com&r=http%3A%2F%2Fceph.com%2Fdev-notes%2Fcephfs-mds-status-discussion%2F&bvt=rss&p=wordpress" style="float:left;" xml:base="http://ceph.com/feed/" width="1" height="1" border="0" align="right"/>]]></content:encoded>
			<wfw:commentRss>http://ceph.com/dev-notes/cephfs-mds-status-discussion/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Deploying Ceph with Juju</title>
		<link>http://ceph.com/dev-notes/deploying-ceph-with-juju/</link>
		<comments>http://ceph.com/dev-notes/deploying-ceph-with-juju/#comments</comments>
		<pubDate>Thu, 21 Feb 2013 17:55:03 +0000</pubDate>
		<dc:creator>scuttlemonkey</dc:creator>
				<category><![CDATA[Dev notes]]></category>
		<category><![CDATA[howto]]></category>

		<guid isPermaLink="false">http://ceph.com/?p=1483</guid>
		<description><![CDATA[The last few weeks have been very exciting for Inktank and Ceph. There have been a number of community examples of how people are deploying or using Ceph in the wild. From the ComodIT orchestration example, to the unique approach of Synnefo delivering unified storage with Ceph and many others that haven&#8217;t made it to [...]<img src="http://track.hubspot.com/__ptq.gif?a=268973&k=14&bu=http%3A%2F%2Fceph.com&r=http%3A%2F%2Fceph.com%2Fdev-notes%2Fdeploying-ceph-with-juju%2F&bvt=rss&p=wordpress" style="float:left;" xml:base="http://ceph.com/feed/" width="1" height="1" border="0" align="right"/>]]></description>
			<content:encoded><![CDATA[<p>The last few weeks have been very exciting for Inktank and Ceph.   There have been a number of community examples of how people are deploying or using Ceph in the wild.  From the <a href="http://ceph.com/community/deploying-ceph-with-comodit/">ComodIT orchestration example</a>, to the unique approach of <a href="http://ceph.com/community/ceph-comes-to-synnefo-and-ganeti/">Synnefo delivering unified storage</a> with Ceph and many others that haven&#8217;t made it to the blog yet.  It is a great time to be doing things with Ceph!</p>
<p>We at Inktank have been just as excited as anyone in the community and have been playing with a number of deployment and orchestration tools.  Today I wanted to share an experiment of my own for the general consumption of the community, deploying Ceph with Canonical&#8217;s relatively new deployment tool, &#8216;<a href="https://juju.ubuntu.com/">Juju</a>,&#8217; that is taking cloud deployments by storm.  If you follow this guide to the end you should end up with something that looks like this:</p>
<p><a href="http://ceph.com/wp-content/uploads/2012/11/2013-02-19_15-50-45.png"><img src="http://ceph.com/wp-content/uploads/2012/11/2013-02-19_15-50-45-261x220.png" alt="" title="2013-02-19_15-50-45" width="261" height="220" class="alignnone size-medium wp-image-2992" /></a><br />
<span id="more-1483"></span></p>
<p>Juju is a &#8220;<a href="https://juju.ubuntu.com/docs/faq.html">next generation service deployment and orchestration framework</a>&#8220;. The cool part about Juju is you can use just about anything to build your Juju “charms” (recipes) from bash and your favorite scripting language, all the way up to Chef and Puppet. A good portion of the knowledge for the Ceph charms developed by Clint Byrum and James Page actually came from both the Chef cookbooks and the work on ceph-deploy, which we’ll cover in later installments.</p>
<p>For the purposes of this experiment I decided to build the environment using Amazon’s EC2 but you can also use an OpenStack deployment or on your own bare metal in conjunction with Canonical&#8217;s <a href="https://maas.ubuntu.com">MAAS</a> product. The client machine used to spin up the bootstrap environment and then later spin up all the other servers will be an Ubuntu Quantal (12.10) LTS image, but could be any Ubuntu box, including your laptop. The rest of the working machines will be spun up using Quantal as well.</p>
<p>Juju is very generous about spinning up new boxes (typically one per service) so I chose to make all of my boxes spin up using the &#8216;t1.micro&#8217; machine size so anyone playing with this guide wouldn’t incur massive EC2 charges. Now, on to the meat!</p>
<h3>Getting Started</h3>
<p>As I said, start the process by spinning up an Ubuntu 12.10 LTS image as your client, this way you don’t have to dump a bunch of software/config on your local machine. This will be the client you use to spin everything else up. Once you have your base Ubuntu install lets add the PPA and install Juju.</p>
<pre class="code">&gt; sudo apt-add-repository ppa:juju/pkgs
&gt; sudo apt-get update &amp;&amp; sudo apt-get install juju</pre>
<p>Now that we have Juju installed we need to tell it to generate a config file.</p>
<pre class="code">&gt; juju bootstrap</pre>
<p>This will throw an error, but creates ~/.juju/environments.yaml for you to edit. Since we’re using EC2 we need to tell Juju about our credentials so it can spin up new machines and deploy new services.  You&#8217;ll notice that I&#8217;m using the default-series of &#8216;quantal&#8217; for all of my node machines.  This is important since this tells juju where and how to grab the important bits of each charm.</p>
<pre class="code">&gt; vi ~/.juju/environments.yaml

default: cephtest
environments:
  cephtest:
    type: ec2
    access-key: YOUR-ACCESS-KEY-GOES-HERE
    secret-key: YOUR-SECRET-KEY-GOES-HERE
    control-bucket: (generated by juju)
    admin-secret: (generated by juju)
    default-series: quantal
    juju-origin: ppa
    ssl-hostname-verification: true</pre>
<h3>Setting up the Bootstrap Environment</h3>
<p>Now that Juju can interact with EC2 directly we need to get a bootstrap environment set up that will hold our configs and deploy our services. Since I can’t set the global configs yet, I need to tell it manually that this box needs to be a &#8216;t1.micro&#8217; instance.</p>
<pre class="code">&gt; juju bootstrap --constraints “instance-type=t1.micro”</pre>
<p>This will take a few minutes to spin up the machine and get the environment set up.  Once this is completed you should be able to see the machine via the &#8216;juju status&#8217; command.</p>
<pre class="code">> juju status

2012-11-07 13:06:30,645 INFO Connecting to environment...
2012-11-07 13:06:42,313 INFO Connected to environment.
machines:
  0:
    agent-state: running
    dns-name: ec2-23-20-70-201.compute-1.amazonaws.com
    instance-id: i-d79492ab
    instance-state: running
services: {}
2012-11-07 13:06:42,408 INFO 'status' command finished successfully
</pre>
<p>Now we have a bootstrap environment and we can tell it that all boxes should default to &#8216;t1.micro&#8217; unless otherwise specified. There are a number of settings that you can monkey with, take a look at the <a href="”https://juju.ubuntu.com/docs/constraints.html”">constraints doc</a> for more details.</p>
<pre class="code">&gt; juju set-constraints instance-type=t1.micro</pre>
<h3>Make it Pretty!</h3>
<p>For those who like to see a visual representation of what&#8217;s happening, or just feel like letting someone else watch what&#8217;s going on, Juju now has a GUI that you can use.  While I wouldn&#8217;t recommend using the GUI as a replacement for the command line to deploy the charms below, you can certainly use it to watch what&#8217;s happening.  For more mature charms (and in the future) this GUI should be more than capable of managing your resources.  In any case, it&#8217;s neat to have pretty pictures as you tapdance on the CLI.</p>
<p>If you would like to install the GUI feel free to grab my version of the &#8216;juju-gui&#8217; charm (at the time of this article the main charm wasn&#8217;t on quantal yet):</p>
<pre class="code">&gt; juju deploy cs:~pmcgarry/quantal/juju-gui</pre>
<p>Once that completes (and it could take a while for everything to download and install) you&#8217;ll need to &#8216;expose&#8217; it so you can get to it:</p>
<pre class="code">&gt; juju expose juju-gui</pre>
<p>This will give you the ability to access the box publicly via a web browser at the ec2 address shown in &#8216;juju status&#8217;.  The detault user name and password are &#8216;admin&#8217; and the &#8216;admin-secret&#8217; value from your ~/.juju/environments.yaml file.  Feel free to leave that up while you do the rest of this work to watch the magic happen.</p>
<h3>Prep for Ceph Deployment</h3>
<p>Our Juju environment is now ready to start spinning up our Ceph cluster, we just need to do a little leg work so Juju has all the important details up-front. First we need to grab a few Ceph tools:</p>
<pre class="code">&gt; sudo apt-get install ceph-common &#038;&#038; sudo apt-get install uuid</pre>
<p>We need to generate a uuid and auth key for Ceph to use.</p>
<pre class="code">&gt; uuid</pre>
<p>insert this as the $fsid below</p>
<pre class="code">&gt; ceph-authtool /dev/stdout --name=$NAME --gen-key</pre>
<p>insert this as the $monitor-secret below.</p>
<p>Now we need to drop these (and a few other) values into our yaml file:</p>
<pre class="code">&gt; vi ceph.yaml

ceph:
    source: http://ceph.com/debian-bobtail/ quantal main
    fsid: d78ae656-7476-11e2-a532-1231390a9d4b
    monitor-secret: AQDcNRlR6MMZNRAAWw3iAobsJ1MLoFBLJYo4yg==

ceph-osd:
    source: http://ceph.com/debian-bobtail/ quantal main
    osd-devices: /dev/xvdf

ceph-radosgw:
    source: http://ceph.com/debian-bobtail/ quantal main</pre>
<p>You&#8217;ll notice we&#8217;re also passing a &#8216;source&#8217; item to Juju, this tells the charm where to grab the appropriate code for Ceph, in this case the <a href="http://ceph.com/resources/downloads/">latest release</a> (Bobtail 0.56.3 when this was written) from Ceph.com.</p>
<h3>Tail Those Logs!</h3>
<p>Since a good portion of this setup is experimental it&#8217;s a good idea to tail the logs.  Thankfully, Juju makes this extremely easy for you to do.  Simply open a second term window, ssh to your client machine, and type:</p>
<pre class="code">>juju debug-log</pre>
<p>This will aggregate all of the logs from your cluster into a single output for easy browsing in case something goes wrong.</p>
<h3>Deploying Ceph Monitors</h3>
<p>Time to start deploying our Ceph cluster! In this case we’re going to deploy the first three machines with ceph-mon (Ceph monitors) since we typically recommend at least three in order to reach a quorum. You&#8217;ll want to wait until all three machines are up before moving on.</p>
<pre class="code">&gt; juju deploy -n 3 --config ceph.yaml cs:~pmcgarry/quantal/ceph</pre>
<p>You’ll notice that while these charms are in the charm store (cs:) they are off on my own user space. This is because I had to make a few tweaky changes for these charms to deploy happily on ec2 and use bobtail and quantal.  These charms are still a bit new so if you have tweaks or changes feel free to give me a shout, or play with the main Ceph charms on <a href="http://jujucharms.com">jujucharms.com</a>.  In the future you’ll be able to deploy using just ‘ceph’ instead of anyone&#8217;s user space.</p>
<pre class="code">EXAMPLE: &gt; juju deploy -n 3 --config ceph.yaml ceph</pre>
<p>This could take a while, so just keep checking &#8216;juju status&#8217; until you have the machines running AND the agents set to &#8216;started.&#8217;  You should also see the debug-log go through a flurry of activity when it starts getting close to the end.</p>
<p>Once we have the monitors up and running you can take a look at what your deployment looks like. If you want to you can even ssh in to one of the machines using Juju’s built-in ssh tool.</p>
<pre class="code">&gt; juju status

machines:
  0:
    agent-state: running
    dns-name: ec2-50-16-15-64.compute-1.amazonaws.com
    instance-id: i-2b45f657
    instance-state: running
  1:
    agent-state: running
    dns-name: ec2-50-19-23-167.compute-1.amazonaws.com
    instance-id: i-3b368547
    instance-state: running
  2:
    agent-state: running
    dns-name: ec2-107-22-128-107.compute-1.amazonaws.com
    instance-id: i-1f368563
    instance-state: running
  3:
    agent-state: running
    dns-name: ec2-174-129-51-96.compute-1.amazonaws.com
    instance-id: i-15368569
    instance-state: running
services:
  ceph:
    charm: cs:~pmcgarry/quantal/ceph-0
    relations:
      mon:
      - ceph
    units:
      ceph/0:
        agent-state: started
        machine: 1
        public-address: ec2-50-19-23-167.compute-1.amazonaws.com
      ceph/1:
        agent-state: started
        machine: 2
        public-address: ec2-107-22-128-107.compute-1.amazonaws.com
      ceph/2:
        agent-state: started
        machine: 3
        public-address: ec2-174-129-51-96.compute-1.amazonaws.com</pre>
<pre class="code">&gt; juju ssh ceph/0 sudo ceph -s

   health HEALTH_ERR 192 pgs stuck inactive; 192 pgs stuck unclean; no osds
   monmap e2: 3 mons at {ceph232118103=10.243.121.227:6789/0,ceph501969115=10.245.210.114:6789/0,ceph5423414494=10.245.89.32:6789/0}, election epoch 6, quorum 0,1,2 ceph232118103,ceph501969115,ceph5423414494
   osdmap e1: 0 osds: 0 up, 0 in
    pgmap v2: 192 pgs: 192 creating; 0 bytes data, 0 KB used, 0 KB / 0 KB avail
   mdsmap e1: 0/0/1 up</pre>
<p>From this status we can see that there are three monitors up (&#8220;monmap e2: 3 mons at {&#8230;}&#8221;) and no OSDs (&#8220;osdmap e1: 0 osds: 0 up, 0 in&#8221;). Time to spin up some homes for those bits!</p>
<h3>Deploying OSDs</h3>
<p>Once our monitors look healthy it’s time to spin up some OSDs. Feel free to drop as many in as you please, for the purposes of this experiment I chose to spin up three.</p>
<pre class="code">&gt; juju deploy -n 3 --config ceph.yaml cs:~pmcgarry/quantal/ceph-osd</pre>
<p>That will take a little bit to complete so you may want to go grab an infusion of caffeine at this point. One thing to keep in mind is that earlier in our ceph.yaml we defined the physical devices for our OSDs as /dev/xvdf.  If you are familiar with EC2 you will know that that device doesn&#8217;t exist yet, so our OSD deploy command will spin up and configure boxes, but we&#8217;re not quite there yet.</p>
<p>When you get back, if you take a look with juju status you should now see a bunch of new machines and a new section called ceph-osd:</p>
<pre class="code">&gt; juju status

…
  ceph-osd:
    charm: cs:~pmcgarry/quantal/ceph-osd-0
    relations: {}
    units:
      ceph-osd/0:
        agent-state: started
        machine: 4
        public-address: ec2-174-129-82-169.compute-1.amazonaws.com
      ceph-osd/1:
        agent-state: started
        machine: 5
        public-address: ec2-50-16-0-95.compute-1.amazonaws.com
      ceph-osd/2:
        agent-state: started
        machine: 6
        public-address: ec2-75-101-175-213.compute-1.amazonaws.com
    </pre>
<p>Now we need to actually give it the disks it needs.  Via your EC2 console (or using ec2 command line tools) you need to spin up 3 EBS volumes and attach one to each of your OSD machines. If you need help there is a pretty decent, concise walkthrough at:</p>
<p><a href="http://www.webmastersessions.com/how-to-attach-ebs-volume-to-amazon-ec2-instance">http://www.webmastersessions.com/how-to-attach-ebs-volume-to-amazon-ec2-instance</a></p>
<p>Once you have the volumes attached we need to tell Juju to go back and use them:</p>
<pre class="code">&gt; juju set ceph-osd "osd-devices=/dev/xvdf"</pre>
<p>This will trigger a rescan and get your OSDs functioning.  All that’s left now is to connect our monitor cluster with the new pool of OSDs.</p>
<pre class="code">&gt; juju add-relation ceph-osd ceph</pre>
<p>We can ssh into one of the Ceph boxes and take a look at our cluster now:</p>
<pre class="code">>juju ssh ceph/0

> sudo ceph -s

   health HEALTH_OK
   monmap e2: 3 mons at {ceph232118103=10.243.121.227:6789/0,ceph501969115=10.245.210.114:6789/0,ceph5423414494=10.245.89.32:6789/0}, election epoch 6, quorum 0,1,2 ceph232118103,ceph501969115,ceph5423414494
   osdmap e10: 3 osds: 3 up, 3 in
    pgmap v115: 208 pgs: 208 active+clean; 0 bytes data, 3102 MB used, 27584 MB / 30686 MB avail
   mdsmap e1: 0/0/1 up
</pre>
<p>Congratulations, you now have a Ceph cluster! Feel free to write a few apps against it, show it to all of your friends, or just nuke it and start refining your chops for a production deployment.</p>
<h3>Extra Credit</h3>
<p>Since that Juju GUI screen looked so empty I decided I wanted to play a bit more with the tools at my disposal.  If you would like to take this exercise a bit further we can also add a few RADOS Gateway machines and load-balance them behind an haproxy machine.  To do this is only a few more commands with Juju:</p>
<pre class="code">&gt; juju deploy -n 3 --config ceph.yaml cs:~pmcgarry/quantal/ceph-radosgw
&gt; juju expose ceph-radosgw

&gt; juju deploy cs:~pmcgarry/quantal/haproxy
&gt; juju expose haproxy

&gt; juju add-relation ceph-radosgw haproxy</pre>
<p>That should be it!  You&#8217;ll notice that I have my own copy of the haproxy, this is simply because it isn&#8217;t technically released for quantal yet, but my (unmodified) version seems to run just fine.</p>
<h3>Troubleshooting</h3>
<p>Juju actually makes troubleshooting and iterative development VERY easy (one of my favorite things about it).  If you would like to delve deeper into playing with Juju I highly recommend reading their docs, which are quite good.  However, one of the most useful tools (beyond the debug-log I mentioned earlier) is the ability to step through the hooks as juju tries to run them.  For example, lets say we tried to deploy Ceph and &#8216;juju status&#8217; was telling us there was an &#8216;install-error.&#8217; We could use our second term window to execute the following:</p>
<pre class="code">&gt; juju debug-hooks ceph/0</pre>
<p>This allows us to debug the execution of the hooks on a specific machine (in this case ceph/0).  Now in our main window we can type:</p>
<pre class="code">&gt; juju resolved --retry ceph/0</pre>
<p>We get a preformatted setup in our &#8216;debug-hooks&#8217; window with an indication at the bottom that we&#8217;re on the &#8220;install&#8221; hook.  From here we can change to the hooks directory and rerun the install hook:</p>
<pre class="code">&gt; cd hooks
&gt; ./install</pre>
<p>From here we can troubleshoot errors on this box before going back and pushing a patch to Launchpad.net. I wont try to recreate the expansive documentation on the jujucharms site, but fiddling with Juju has been far less frustrating that some other orchestration frameworks I have poked at recently.  Good luck, and happy charming!</p>
<h3>Cleaning Up</h3>
<p>If you would like to close up shop you can either destroy just the services (if you want to keep the machines running for deploying other Juju tests):</p>
<pre class="code">&gt; juju destroy-service ceph
&gt; juju destroy-service ceph-osd
&gt; juju destroy-service ceph-radosgw
&gt; juju destroy-service haproxy</pre>
<p>&#8230;or just drop some dynamite on the whole thing (this will kill everything but your client machine, including your bootstrap environment):</p>
<pre class="code">&gt; juju destroy-environment</pre>
<h3>Wrap Up</h3>
<p>You are now a seasoned veteran of Ceph deployment, what more could you want? If you do have questions, comments, or anything for the good of the cause we would love to hear about it. Currently the best way to get help or give feedback is in our <a href="irc://irc.oftc.net:6667/ceph">#Ceph irc channel</a> but our <a href="http://ceph.com/resources/mailing-list-irc/">mailing lists</a> are also pretty active. For Juju-specific feedback you can also hit up the <a href="irc://irc.freenode.net:6667/juju">#Juju irc channel</a>. If you see any egregious errors on this writeup or would like to know more about Ceph community plans feel free to send email to patrick at inktank dot com.
<p>
<pre class="outline">scuttlemonkey out</pre></p>
<img src="http://track.hubspot.com/__ptq.gif?a=268973&k=14&bu=http%3A%2F%2Fceph.com&r=http%3A%2F%2Fceph.com%2Fdev-notes%2Fdeploying-ceph-with-juju%2F&bvt=rss&p=wordpress" style="float:left;" xml:base="http://ceph.com/feed/" width="1" height="1" border="0" align="right"/>]]></content:encoded>
			<wfw:commentRss>http://ceph.com/dev-notes/deploying-ceph-with-juju/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>What&#8217;s New in the Land of OSD?</title>
		<link>http://ceph.com/dev-notes/whats-new-in-the-land-of-osd/</link>
		<comments>http://ceph.com/dev-notes/whats-new-in-the-land-of-osd/#comments</comments>
		<pubDate>Tue, 18 Dec 2012 13:54:21 +0000</pubDate>
		<dc:creator>sajust</dc:creator>
				<category><![CDATA[Dev notes]]></category>

		<guid isPermaLink="false">http://ceph.com/?p=2083</guid>
		<description><![CDATA[It’s been a few months since the last named release, Argonaut, and we’ve been busy! Well, in retrospect, most of the time was spent on finding a cephalopod name that starts with “b”, but once we got that done, we still had a few weeks left to devote to technical improvements. In particular, the OSD [...]<img src="http://track.hubspot.com/__ptq.gif?a=268973&k=14&bu=http%3A%2F%2Fceph.com&r=http%3A%2F%2Fceph.com%2Fdev-notes%2Fwhats-new-in-the-land-of-osd%2F&bvt=rss&p=wordpress" style="float:left;" xml:base="http://ceph.com/feed/" width="1" height="1" border="0" align="right"/>]]></description>
			<content:encoded><![CDATA[<p>It’s been a few months since the last named release, Argonaut, and we’ve been busy!  Well, in retrospect, most of the time was spent on finding a cephalopod name that starts with “b”, but once we got that done, we still had a few weeks left to devote to technical improvements.  In particular, the OSD has seen some new and interesting developments.</p>
<h3>OSD Internals Overview</h3>
<p>Let’s start with some background for those not familiar with ceph internals.  Objects in a Ceph Object Store are placed into pools, each of which is comprised of some number of placement groups (PGs).  An object “foo” in pool “bar” would be mapped onto a set of osds as follows:</p>
<p><img src="http://ceph.com/wp-content/uploads/2012/12/pg-placement1.png" alt="" title="pg-placement1" width="568" height="286" class="aligncenter size-full wp-image-2194" /><span id="more-2083"></span></p>
<p>The first mapping hashes foo to 0x3F4AE323 and maps “bar” to its pool id: 3.  The next mapping maps this to PG 3.23 (pg 23 in pool 3) by taking 0x3F4AE323 mod 256 (the number of PGs in pool “bar”).  This pgid is then mapped onto the osds [24, 3, 12] via CRUSH.  osd 24 is the primary; 3 and 12 are the replicas.  PGs serve several critical roles in the ceph-osd design.  First, they are the unit of placement.  If we calculated placement directly on a per-object basis, changes in the cluster might require us to recalculate the location of each and every object!  This way, we only need to re-run CRUSH on a per-PG basis when the cluster changes.  Second, writes on objects are sequenced on a per-PG basis.  Each PG contains an ordered log of all operations on objects in that PG.  Finally, recovery is done on a per-PG basis.  By comparing their PG logs, two osds can agree on which objects need to be recovered to which OSD.</p>
<h3>Scrub</h3>
<p><a href="http://ceph.com/wp-content/uploads/2013/12/scrubbing-bubbles-image.jpg"><img src="http://ceph.com/wp-content/uploads/2013/12/scrubbing-bubbles-image.jpg" alt="" title="scrubbing-bubbles-image" width="125" height="100" class="alignright size-full wp-image-2096" /></a>With that out of the way, let’s move to some work on keeping your cluster’s data honest.  It turns out that data redundancy isn’t particularly useful if you fail to notice a corrupted object until you finally go to read it, possibly months after the last copy has finally become unreadable.  To deal with this, ceph has long included a “scrub” feature which, during periods of low IO, chooses PGs in sequence and compares their contents across replicas.  Alas, our implementation suffered from two shortcomings.  The first is that we compared the set of objects contained in each PG across replicas as well as object metadata, but not the object contents.  In the upcoming Bobtail release, we hash the object contents as we scan and compare the hashes from across replicas to detect corrupt copies.</p>
<p>The second shortcoming is that, in the name of simplicity, we essentially scrubbed an entire PG at once.  The tricky part of efficiently performing a scrub is that comparing the contents of the primary and the replica is only useful if the scans are performed at the same version!  Scrubbing while writes are in flight might result in a scrub scanning the replica at version 200 and scanning the primary at 197 because the replica happens to be a bit ahead of the primary.  A simple way to ensure that the versions match is simply to stop writes on the entire PG and wait for them to flush before scanning the primary and replica stores.  In fact, the Argonaut approach is a bit more sophisticated &#8212; we scan the primary and replica collections without stopping writes, and then stop writes to rescan any objects which changed in the meantime.  However, for a large PG, that last step could take a long time, so a better approach is needed.  Enter ChunkyScrub!  In Bobtail, we scrub a PG in chunks, only pausing writes on the set of objects we are currently scrubbing.  This way, no object has writes blocked for long.</p>
<h3>OSD Internals Refactor</h3>
<p>The ceph-osd daemon internals received a bit of a rework as well.  As mentioned above, PGs act as the unit of sequencing for object operations. This is reflected in the code: each PG the OSD is responsible for maps onto a PG object.  The OSD object’s primary responsibility is to shuffle messages from clients and other OSDs over to the appropriate PG object.  A happy consequence of this is that operations on different objects on the same OSD can be processed independently (and in parallel!) as long as the objects are in different PGs.  There is, however, one annoying detail which tends to prevent us from fully exploiting this opportunity for parallelism: that pesky OSDMap.</p>
<p>The OSDMap is required for the “CRUSH MAGIC” arrow in the above diagram to work.  CRUSH really takes two inputs: a pgid and a description of the cluster.  These together determine the OSDs on which the PG will reside.  This description is encoded in the OSDMap.  Changes to the cluster, such as the death of of an osd, are encoded into a new OSDMap by the ceph-mon cluster and sent out to the OSDs.  The maps are given sequential epoch numbers.  Essentially every decision within an OSD depends on the contents of the OSDMap.  Complicating the situation even further is that OSDMap updates don’t reach all OSDs at the same time.  The ceph-mon cluster sends out maps as they are created to a few OSDs, and then the OSDs gossip the new map around to other OSDs as they discover OSDs with old maps.  Every OSD-OSD (including regular heartbeats) or OSD-client message includes the sender’s current OSDMap epoch, allowing the receiver to respond with whatever maps the sender is missing.  So, how do we handle an OSDMap update arriving while other threads are busy with client requests for various PGs?</p>
<p>Originally, the OSD halted the threads responsible for handling PG requests (including client IO) while updating the global map.  This was a useful simplification since each PG might need to update local state due to the map change, and it would be complicated to coordinate that update with in-progress operations.  However, it was also a somewhat expensive simplification since halting all IO during the map switchover tends to be costly.  Bobtail includes a rework of how the OSD processes PG messages.  First, the PG internal code has been reworked to rely as little as possible on global OSD state.  Second, each PG has its own notion of the “current” OSDMap epoch distinct from that of other PGs and from the OSD as a whole.  Each PG’s internal map state is updated to the current OSDMap epoch before processing a message.  The OSD can therefore update its OSDMap related state without bothering the PG threads and then publish the new map epoch atomically for PG thread consumption once it’s ready.  The end result of all of this is that the OSD should handle map changes much more efficiently.  This might not seem like much, but map changes tend to happen quickly when the cluster is experiencing heavy load due to OSD failures &#8212; exactly when you don’t want extraneous overhead!</p>
<h3>Filestore Performance</h3>
<p><a href="http://ceph.com/wp-content/uploads/2013/12/bar_chart.jpg"><img src="http://ceph.com/wp-content/uploads/2013/12/bar_chart-220x140.jpg" alt="" title="bar_chart" width="220" height="140" class="alignleft size-thumbnail wp-image-2099" /></a>Another area that got a fresh coat of paint is the backend io system synchronization design.  The ceph-osd daemon uses standard file systems such as xfs or btrfs as its backing store.  However, as you might imagine, it’s much simpler to work in terms of transactions on an abstract data store than to work directly on top of a file system (particularly considering the differences between xfs, btrfs, and ext4).  Thus, the ceph-osd daemon talks to the file system via the FIleStore, which presents a uniform transactional interface in terms of objects and flat collections on top of the user’s underlying filesystem.<br />
The journal is crucial to providing these transactional guarantees.  In xfs (btrfs is somewhat different), the FileStore writes out each transaction to the journal prior to applying it to the file system.  Each write must pass through:</p>
<ul>
<li>OSD op thread (responsible for handling client requests)</li>
<li>FileStore journal thread (responsible for appending writes to the journal)</li>
<li>FileStore work queue (responsible for applying writes to the backing file sytem)</li>
<li>Messenger (responsible for managing inter-node communication) for the client reply.</li>
</ul>
<p>It’s crucial to maximize throughput and minimize latency in this pipeline if we want to avoid torpedoing performance.  To approach this problem, we took a shiny new server with 192GB of memory and started running benchmarks against the FileStore module in isolation mounted on a ramdisk.  We were able to shove small writes through at a rate of around 6k iops.  This is pretty good if we plan on running on a ~150iop spinning disk.  It is considerably less good if we plan on running on 20k+ iop ssds.  So, we went to work.  Instrumenting our Mutex object to add up time spent waiting on each lock yielded several promising “problem locks”, each of which, for reasons of simplicity, protected several unrelated structures.  Restructuring the code for finer synchronization around these structures, along with disabling in-memory logging, bumped us up to around 22k iops.  For the next release, we’ll continue on to attacking latency and throughput bottlenecks in the upper layers of the OSD daemon.</p>
<h3>Recovery QOS</h3>
<p>One of Ceph’s nicer properties is self-healing.  The death of OSD 10 eventually triggers a new OSDMap to be generated with OSD 10 marked down and out, which in turn triggers any PGs which had lived on OSD 10 to rebalance to a new set of OSDs.  Of course, there is no escaping the fact that recovering OSD 10’s PGs to new OSDs must involve copying the objects from OSD 10’s PG’s surviving replicas to new OSDs.  With Argonaut’s default settings, this looks like a long series of 1MB transfers from surviving replicas to new replicas.  So, how might these large transfers interact with, say, the flurry of latency sensitive 4k writes generated by the VMs running on RBD on your cluster?</p>
<p>Argonaut already has some facilities you may be familiar with for limiting the impact of recovery on client workloads.  Most prominently, “osd recovery max active” limits the number of concurrent recovery operations any single OSD will start.  Regrettably, this only limits the number started at any single OSD.  It does not, for example, prevent 20 OSDs from simultaneously pushing “osd recovery max active” objects each to a single OSD you have just added to your cluster!  That’s where Bobtail’s new “osd max backfills” configurable comes in.  “osd max backfills” defines a limit on how many PGs are allowed to recover to or from a single OSD at any one time.</p>
<p>Argonaut also includes a simple mechanism for prioritizing Messages for processing at the OSD.  Each message is tagged with a numerical priority.  Messages are processed in order first by priority, and then by time of arrival.  So all messages of priority 128 will be processed before any of priority 63.  This is useful for some pieces of the OSD.  For example, replies from replicas to the primary indicating that the replica has persisted a client op are given a high priority to reduce client op latency since they are quick to process.  However, if we give client messages a higher priority than recovery messages using this mechanism, any significant amount of client io will tend to starve recovery.  You really don’t want that since the longer a PG goes without re-replicating it’s data, the more likely a second or third OSD death takes it out completely!</p>
<p>Bobtail introduces a more flexible Message prioritization scheme.  Messages can be sent such that a message with priority 40 will be allowed through at twice the rate of messages with priority 20, but won’t starve them.  This has been leveraged to allow recovery messages to be sent with a lower priority than client io messages without causing starvation.  As a nice bonus, consider trying to read an object which has not yet been recovered to the primary.  The primary must complete the recovery operation on that object before serving the read.  Now, we can give that recovery operation the priority of client io to allow it to bypass any lower priority recovery operations queued at other osds!  Future work in this area will focus on reducing the burden of coordinating a PG’s recovery operations on the PG’s primary OSD.  This burden still has an impact on client io coming in to that OSD.</p>
<p>So, those are some of the new developments in the OSD.  And that’s just the OSD!  RBD is getting layering and write-back cache!  CephFS is getting substantial stability and performance enhancements!  And if we stay on track, our next release will have even more exclamation points!  That is, it will if we have time left after we come up with a cephalopod name that starts with “c”&#8230;</p>
<img src="http://track.hubspot.com/__ptq.gif?a=268973&k=14&bu=http%3A%2F%2Fceph.com&r=http%3A%2F%2Fceph.com%2Fdev-notes%2Fwhats-new-in-the-land-of-osd%2F&bvt=rss&p=wordpress" style="float:left;" xml:base="http://ceph.com/feed/" width="1" height="1" border="0" align="right"/>]]></content:encoded>
			<wfw:commentRss>http://ceph.com/dev-notes/whats-new-in-the-land-of-osd/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Atomicity of RESTful radosgw operations</title>
		<link>http://ceph.com/dev-notes/atomicity-of-restful-radosgw-operations/</link>
		<comments>http://ceph.com/dev-notes/atomicity-of-restful-radosgw-operations/#comments</comments>
		<pubDate>Mon, 07 Nov 2011 21:42:26 +0000</pubDate>
		<dc:creator>yehuda</dc:creator>
				<category><![CDATA[Dev notes]]></category>
		<category><![CDATA[RADOS]]></category>
		<category><![CDATA[RGW]]></category>

		<guid isPermaLink="false">http://ceph.newdream.net/?p=343</guid>
		<description><![CDATA[A while back we worked on radosgw doing atomic reads and writes. The first issue was making sure that two or more concurrent writers that write to the same object don’t end up with an inconsistent object. That is the &#8220;atomic PUT&#8221; issue. We also wanted to be able to make sure that when one [...]<img src="http://track.hubspot.com/__ptq.gif?a=268973&k=14&bu=http%3A%2F%2Fceph.com&r=http%3A%2F%2Fceph.com%2Fdev-notes%2Fatomicity-of-restful-radosgw-operations%2F&bvt=rss&p=wordpress" style="float:left;" xml:base="http://ceph.com/feed/" width="1" height="1" border="0" align="right"/>]]></description>
			<content:encoded><![CDATA[<p>A while back we worked on radosgw doing atomic reads and writes.</p>
<p>The first issue was making sure that two or more concurrent writers that write to the same object don’t end up with an inconsistent object. That is the &#8220;atomic PUT&#8221; issue.</p>
<p>We also wanted to be able to make sure that when one client reads an object via radosgw while another client writes to the same object, the result is consistent. That is, when reading an object a client should get either the old or the new version of the object, and never a mix of the two. That is the &#8220;atomic GET&#8221; issue.</p>
<p>Radosgw is built directly on top of RADOS and is a prime example of a librados user. The basic issue is that radosgw streams the objects from or to the RADOS objects with a series of relatively small reads or writes. For the atomic PUT and atomic GET we didn&#8217;t want to introduce locking. Locking would solve the issue, but implementing it on top of RADOS would not have been trivial, and would have affected scalability and the relative simplicity of the gateway. The Ceph distributed file system implements locking in the metadata server (as part of its POSIX file locking support), and introducing that in the gateway would require holding state on each object and synchronizing it between the different gateway instances. We didn’t want to reimplement the MDS again.</p>
<p><strong>Atomic PUT</strong></p>
<p>When radosgw reads or writes an object it can issue multiple read or write librados requests to the RADOS backend. One RADOS feature is that each single operation is atomic. The problem is that for sufficiently large object (which are not too large in any case) we issue multiple write operations, and could end up with an interleaved object.</p>
<p>The solution for the atomic PUT is to write the object into a temporary object. Once the temp object is completely written, we issue a single librados clone-range operation that atomically clones the entire temp object to the destination. Once the data is there we remove the temp object. This is equivalent to write to a temporary file and renaming it over the target when we finish.</p>
<p>Since the RADOS backend is distributed, we need to make sure that both the temp object and the target object will be located in the same placement group (and on the same OSD). Usually the object location is determined by the object name, but for this purpose we used the &#8220;object locator&#8221; feature, which allows us to provide alternative string that is fed into the hash function. In this case we use the target object name as the object locator for the temporary object, ensuring that both objects end up on the same placement group on the same node so that the clone operation can work.</p>
<p><strong>Atomic GET</strong></p>
<p>With atomic PUT we know that the objects are consistent. However, this doesn’t help with clients reading when an object is being written. Since there can be multiple librados read operations for a single GET, some of the reads may happen before the object is replaced and some may happen after that, leading to an inconsistent &#8220;torn&#8221; result.</p>
<p>In addition to atomic operations, RADOS has a nice feature called compound operations which allow you to send a few operations that are bundled together and applied atomically. If one of the operations fail, nothing is applied. We use this for atomic PUT in order to set both data and metadata on the target object in a single atomic operation.</p>
<p>For the atomic GET we introduce an object &#8220;tag,&#8221; which is a random value that we generate for each PUT and store as an object attribute (xattr). When radosgw writes to an object it first checks for an existing object and fetches its tag (which it can do atomically). If the object exists it clones it to a new object with the tag as a suffix (taking necessary steps to avoid name collisions) and the original object name as the locator. The compound clone operation looks like:</p>
<ol>
<li>check to see if object &lt;name&gt; tag attribute is &lt;tag&gt;</li>
<li>clone to &lt;name&gt;_&lt;tag&gt;</li>
</ol>
<p>The first operation is a guard to make sure that the object hasn&#8217;t been rewritten since we first read it. (Had it been rewritten, we need to restart the whole operation and reread the tag.) We put the same guard when we write the new object instance, to make sure that there was no racing operation.</p>
<p>A client that reads the object also starts by reading the tag, and putting the same guard before each subsequent read operation. If the guard fails, the client knows that the object has been rewritten. However, it also knows that since it has been rewritten, the object that it started reading can now be found at &lt;name&gt;_&lt;tag&gt;. So, reading of an object named foo looks like this:</p>
<ul>
<li>read object foo tag -&gt; 123</li>
<li>verify object foo tag is &#8220;123&#8243;; read object foo (offset = 0, size = 512K) -&gt; ok, read 512K</li>
<li>check object foo tag is &#8220;123&#8243;; read object foo (offset = 512K, size = 512K) -&gt; not ok, object was replaced</li>
<li>read object foo_123 (offset = 512K, size = 512K) -&gt; ok, read 512K</li>
</ul>
<p>The final component is an intent log. Since we end up creating multiple instances of the same object under different names, we need to make sure that these object are cleaned up after some reasonable amount of time. We added a log object which we record each such object that needs to be removed. After a sufficient amount of time (however long we expect very slow GETs to still succeed), a process iterates over the log and removes old objects.</p>
<img src="http://track.hubspot.com/__ptq.gif?a=268973&k=14&bu=http%3A%2F%2Fceph.com&r=http%3A%2F%2Fceph.com%2Fdev-notes%2Fatomicity-of-restful-radosgw-operations%2F&bvt=rss&p=wordpress" style="float:left;" xml:base="http://ceph.com/feed/" width="1" height="1" border="0" align="right"/>]]></content:encoded>
			<wfw:commentRss>http://ceph.com/dev-notes/atomicity-of-restful-radosgw-operations/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>RBD Status Update</title>
		<link>http://ceph.com/dev-notes/rbd-status-update/</link>
		<comments>http://ceph.com/dev-notes/rbd-status-update/#comments</comments>
		<pubDate>Tue, 11 Oct 2011 02:34:24 +0000</pubDate>
		<dc:creator>yehuda</dc:creator>
				<category><![CDATA[Dev notes]]></category>
		<category><![CDATA[RBD]]></category>
		<category><![CDATA[Updates]]></category>

		<guid isPermaLink="false">http://ceph.newdream.net/?p=324</guid>
		<description><![CDATA[Just a quick update on the current status of RBD. The main recent development is that librbd (the userspace library) can ack writes immediately (instead of waiting for them to actually commit), to better mimic the behavior of a normal disk. Why do this? A long long time ago, when you issued a write to [...]<img src="http://track.hubspot.com/__ptq.gif?a=268973&k=14&bu=http%3A%2F%2Fceph.com&r=http%3A%2F%2Fceph.com%2Fdev-notes%2Frbd-status-update%2F&bvt=rss&p=wordpress" style="float:left;" xml:base="http://ceph.com/feed/" width="1" height="1" border="0" align="right"/>]]></description>
			<content:encoded><![CDATA[<p><em>Just a quick update on the current status of RBD.<br />
</em><br />
The main recent development is that librbd (the userspace library) can ack writes immediately (instead of waiting for them to actually commit), to better mimic the behavior of a normal disk.</p>
<p>Why do this? A long long time ago, when you issued a write to a disk, it would ACK the write when the data was written. No more. Now, the ACK means the data is either the drive&#8217;s cache or on disk. You don&#8217;t know data is safe/durable until you issue a separate flush command. Now RBD behaves similarly: writes are acked immediately (up to some number of bytes, at least), and a flush will wait for all previous writes to commit. The only real difference between this and a real drive cache is that a real drive will try to coalesce small writes into a single operation, while RBD sends them all straight through to the backend cluster.</p>
<p>To make this work with qemu/KVM you need:</p>
<ul>
<li>Ceph v0.35 or later.</li>
<li>Set the rbd_writeback_window to the number of bytes (something on the order of what you&#8217;d expect a physical disk cache to be.. say, 8 MB). This means using a qemu drive string like
<pre>rbd:rbd/myimage:rbd_writeback_window=8000000</pre>
</li>
<li>You need qemu with commit 7a3f5fe, which wires up the qemu flush function properly.  It is not included in v0.15, but should be in the next release.</li>
</ul>
<p>This is not yet implemented in the kernel RBD driver. As a result, effective performance using that device is still relatively poor. We hope to have similar behavior ready when the v3.2 merge window opens.</p>
<img src="http://track.hubspot.com/__ptq.gif?a=268973&k=14&bu=http%3A%2F%2Fceph.com&r=http%3A%2F%2Fceph.com%2Fdev-notes%2Frbd-status-update%2F&bvt=rss&p=wordpress" style="float:left;" xml:base="http://ceph.com/feed/" width="1" height="1" border="0" align="right"/>]]></content:encoded>
			<wfw:commentRss>http://ceph.com/dev-notes/rbd-status-update/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Linus vs FUSE</title>
		<link>http://ceph.com/dev-notes/linus-vs-fuse/</link>
		<comments>http://ceph.com/dev-notes/linus-vs-fuse/#comments</comments>
		<pubDate>Fri, 08 Jul 2011 21:47:46 +0000</pubDate>
		<dc:creator>sage</dc:creator>
				<category><![CDATA[Dev notes]]></category>

		<guid isPermaLink="false">http://ceph.newdream.net/?p=279</guid>
		<description><![CDATA[I can&#8217;t decide whether Linus is amused or annoyed by the extent to which people hang on his every word, or go nuts over his random rants about this or that. People still talk about his pronouncement about O_DIRECT and tripping monkeys (which has now found a home on the open(2) man page). The latest [...]<img src="http://track.hubspot.com/__ptq.gif?a=268973&k=14&bu=http%3A%2F%2Fceph.com&r=http%3A%2F%2Fceph.com%2Fdev-notes%2Flinus-vs-fuse%2F&bvt=rss&p=wordpress" style="float:left;" xml:base="http://ceph.com/feed/" width="1" height="1" border="0" align="right"/>]]></description>
			<content:encoded><![CDATA[<p>I can&#8217;t decide whether Linus is amused or annoyed by the extent to which people hang on his every word, or go nuts over his random rants about this or that.  People still talk about his pronouncement about <a href="http://lkml.org/lkml/2002/5/11/58">O_DIRECT and tripping monkeys</a> (which has now found a home on <a href="http://linux.die.net/man/2/open">the open(2) man page</a>).  The latest hullabaloo is about his decree that all <a href="https://lkml.org/lkml/2011/6/9/462">FUSE-based file systems are toys</a>.</p>
<p>Clearly, as <a href="http://www.gluster.com/2011/06/28/linus-torvalds-doesnt-understand-user-space-storage/">many</a> <a href="http://cloudfs.org/2011/06/user-space-filesystems/">have</a> <a href="http://zaitcev.livejournal.com/210078.html">pointed</a> <a href="http://cloudfs.org/2011/06/user-space-file-systems-again/">out</a>, calling all such systems &#8220;toys&#8221; isn&#8217;t completely fair.  But then it wouldn&#8217;t be fun to say it if it were strictly true.  There are real systems (big and fast) built on FUSE, just as there are such systems built with Java, Visual BASIC, Cobol, and every other platform/technology we love to mock.</p>
<p>I haven&#8217;t seen <a href="http://institutes.lanl.gov/plfs/">PLFS</a> come up yet in the discussion, but I think it&#8217;s worth mentioning just because it is such a good example of optimizing for the cases that actually matter for your workload.  For those not familiar, PLFS (parallel log-structured file system) is a FUSE-based file system built at LANL for their huge many-thousand node clusters that turns all random IO sequential by building a mess of intermediate indices.  It sounds like it would be a disaster, but in practice it speeds up their workloads by <em>several</em> orders of magnitude, simply because the underlying parallel file systems on which it is stacked are so bad at those workloads.</p>
<p>Anyway, there are just a few points I wanted to make about the kernel vs userspace file systems, having implemented the Ceph client using both.  At the risk of stating the obvious:</p>
<ul>
<li><strong>There is nothing you can do in userspace that you can&#8217;t also do in the kernel</strong>.  Sure, development can be harder in the kernel, but you have unparalleled access to the system.  The only significant technical disadvantage of a kernel implementation is fault isolation: a buggy FUSE-based file system won&#8217;t take down the system with it.</li>
<li><strong>Implementation is easier with FUSE</strong>.  At least for something basic.  There are some key problems that are harder to solve because of limitations in the interface.</li>
<li><strong>Memory management is easier in the kernel</strong>.  AB is right when <a href="http://www.gluster.com/2011/06/28/linus-torvalds-doesnt-understand-user-space-storage/">he says</a> that the memory management and file system need to work together.  The problem is that it is difficult to push memory management into userspace when you are not the only tenant on the machine.  (I suspect that in most of the big production environments where userspace file systems are used, the fs either is the sole tenant or is given some fixed amount of RAM to work with.)  The kernel VM, on the other hand, will apply cache pressure dynamically based on the demands of all users of the system.  Trying to do that in userspace is extremely awkward at best.</li>
<li><strong>Managing cache coherency is easier in the kernel</strong>.  Some people don&#8217;t care about this (e.g., see NFS, or any of the &#8220;toys&#8221; Linus was referring to), but we do.  This is mainly a result of the limited FUSE interface.  You can probably avoid the issue by simply not using the kernel dentry and page caches and reimplementing it all in userspace.  That&#8217;s a simple enough approach, but is slow, and fails to leverage years of work invested in the core Linux VFS code.</li>
<li><strong>FUSE may be partly to blame</strong>.  Jeff Darcy has made the point that many of the FUSE shortcomings aren&#8217;t inherent to userspace storage, but artifacts of the current interface and kernel politics.  Maybe that&#8217;s the case, but that is the world we live in.  No file system that doesn&#8217;t work on Linux (or maybe *BSD) is relevant.  And for what it&#8217;s worth, most of the people I see complaining about kernel community intransigence haven&#8217;t even tried to work upstream; it&#8217;s easier than you think, as long as the code you&#8217;re pushing isn&#8217;t crap.</li>
</ul>
<p>Which is better for any given project in the end is probably more of a business decision: technical investment, performance, time to market, ease of deployment.  If you&#8217;re talking purely about the technical limitations of the environment, however, it&#8217;s hard to beat the kernel.</p>
<p>Or, if you can, implement both.  It makes these sorts of debates that much more fun.</p>
<img src="http://track.hubspot.com/__ptq.gif?a=268973&k=14&bu=http%3A%2F%2Fceph.com&r=http%3A%2F%2Fceph.com%2Fdev-notes%2Flinus-vs-fuse%2F&bvt=rss&p=wordpress" style="float:left;" xml:base="http://ceph.com/feed/" width="1" height="1" border="0" align="right"/>]]></content:encoded>
			<wfw:commentRss>http://ceph.com/dev-notes/linus-vs-fuse/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
	</channel>
</rss>
