Community Update: Welcome to 2016!

It has been quite a while since a coordinated Ceph update has made it to the Ceph blog, so I figured it was time to gather all of the various threads and make sure they were in a single place for consumption.

Quite a lot is happening in the Ceph world and, depending on what part of the project you are involved with, there is more than likely to be a place for you to deepen your engagement with the community. So, let’s do the highlight reel:

read more…

v10.0.2 released

This development release includes a raft of changes and improvements for Jewel. Key additions include CephFS scrub/repair improvements, an AIX and Solaris port of librados, many librbd journaling additions and fixes, extended per-pool options, and NBD driver for RBD (rbd-nbd) that allows librbd to present a kernel-level block device on Linux, multitenancy support for RGW, RGW bucket lifecycle support, RGW support for Swift static large objects (SLO), and RGW support for Swift bulk delete.

There are also lots of smaller optimizations and performance fixes going in all over the tree, particular in the OSD and common code.


  • auth: fail if rotating key is missing (do not spam log) (pr#6473, Qiankun Zheng)
  • auth: fix crash when bad keyring is passed (pr#6698, Dunrong Huang)
  • auth: make keyring without mon entity type return -EACCES (pr#5734, Xiaowei Chen)
  • buffer: make usable outside of ceph source again (pr#6863, Josh Durgin)
  • build: cmake check fixes (pr#6787, Orit Wasserman)
  • build: fix bz2-dev dependency (pr#6948, Samuel Just)
  • build: Gentoo: _FORTIFY_SOURCE fix. (issue#13920, pr#6739, Robin H. Johnson)
  • build/ops: systemd ceph-disk unit must not assume /bin/flock (issue#13975, pr#6803, Loic Dachary)
  • ceph-detect-init: Ubuntu >= 15.04 uses systemd (pr#6873, James Page)
  • cephfs-data-scan: scan_frags (pr#5941, John Spray)
  • cephfs-data-scan: scrub tag filtering (#12133 and #12145) (issue#12133, issue#12145, pr#5685, John Spray)
  • ceph-fuse: add process to ceph-fuse –help (pr#6821, Wei Feng)
  • ceph-kvstore-tool: handle bad out file on command line (pr#6093, Kefu Chai)
  • ceph-mds:add –help/-h (pr#6850, Cilang Zhao)
  • ceph_objectstore_bench: fix race condition, bugs (issue#13516, pr#6681, Igor Fedotov)
  • add BuildRequires: systemd (issue#13860, pr#6692, Nathan Cutler)
  • client: a better check for MDS availability (pr#6253, John Spray)
  • client: close mds sessions in shutdown() (pr#6269, John Spray)
  • client: don’t invalidate page cache when inode is no longer used (pr#6380, Yan, Zheng)
  • client: modify a word in log (pr#6906, YongQiang He)
  • cls/ fix misused metadata_name_from_key (issue#13922, pr#6661, Xiaoxi Chen)
  • cmake: Add common/ to CMakeLists.txt (pr#6805, Pete Zaitcev)
  • cmake: add to librgw.a (pr#6786, Orit Wasserman)
  • cmake: add to libcommon (pr#6823, Orit Wasserman)
  • cmake: define STRERROR_R_CHAR_P for GNU-specific strerror_r (pr#6751, Ilya Dryomov)
  • cmake: update for recent librbd changes (pr#6715, John Spray)
  • cmake: update for recent rbd changes (pr#6818, Mykola Golub)
  • common: add generic plugin infrastructure (pr#6696, Sage Weil)
  • common: add latency perf counter for finisher (pr#6175, Xinze Chi)
  • common: buffer: add cached_crc and cached_crc_adjust counts to perf dump (pr#6535, Ning Yao)
  • common: buffer: remove unneeded list destructor (pr#6456, Michal Jarzabek)
  • common/ order of initialisers (pr#6838, Michal Jarzabek)
  • common: don’t reverse hobject_t hash bits when zero (pr#6653, Piotr Dałek)
  • common: log: Assign LOG_DEBUG priority to syslog calls (issue#13993, pr#6815, Brad Hubbard)
  • common: log: predict log message buffer allocation size (pr#6641, Adam Kupczyk)
  • common: optimize debug logging code (pr#6441, Adam Kupczyk)
  • common: perf counter for bufferlist history total alloc (pr#6198, Xinze Chi)
  • common: reduce CPU usage by making stringstream in stringify function thread local (pr#6543, Evgeniy Firsov)
  • common: re-enable backtrace support (pr#6771, Jason Dillaman)
  • common: SubProcess: fix multiple definition bug (pr#6790, Yunchuan Wen)
  • common: use namespace instead of subclasses for buffer (pr#6686, Michal Jarzabek)
  • macro fix (pr#6769, Igor Podoski)
  • doc: admin/build-doc: add lxml dependencies on debian (pr#6610, Ken Dreyer)
  • doc/cephfs/posix: update (pr#6922, Sage Weil)
  • doc: CodingStyle: fix broken URLs (pr#6733, Kefu Chai)
  • doc: correct typo ‘restared’ to ‘restarted’ (pr#6734, Yilong Zhao)
  • doc/dev/index: refactor/reorg (pr#6792, Nathan Cutler)
  • doc/dev/index.rst: begin writing Contributing to Ceph (pr#6727, Nathan Cutler)
  • doc/dev/index.rst: fix headings (pr#6780, Nathan Cutler)
  • doc: dev: introduction to tests (pr#6910, Loic Dachary)
  • doc: file must be empty when writing layout fields of file use “setfattr” (pr#6848, Cilang Zhao)
  • doc: Fixed incorrect name of a “List Multipart Upload Parts” Response Entity (issue#14003, pr#6829, Lenz Grimmer)
  • doc: Fixes a spelling error (pr#6705, Jeremy Qian)
  • doc: fix typo in cephfs/quota (pr#6745, Drunkard Zhang)
  • doc: fix typo in developer guide (pr#6943, Nathan Cutler)
  • doc: INSTALL redirect to online documentation (pr#6749, Loic Dachary)
  • doc: little improvements for troubleshooting scrub issues (pr#6827, Mykola Golub)
  • doc: Modified a note section in rbd-snapshot doc. (pr#6908, Nilamdyuti Goswami)
  • doc: note that cephfs auth stuff is new in jewel (pr#6858, John Spray)
  • doc: osd: s/schedued/scheduled/ (pr#6872, Loic Dachary)
  • doc: remove unnecessary period in headline (pr#6775, Marc Koderer)
  • doc: rst style fix for pools document (pr#6816, Drunkard Zhang)
  • doc: Update list of admin/build-doc dependencies (issue#14070, pr#6934, Nathan Cutler)
  • init-ceph: do umount when the path exists. (pr#6866, Xiaoxi Chen)
  • journal: disconnect watch after watch error (issue#14168, pr#7113, Jason Dillaman)
  • journal: fire replay complete event after reading last object (issue#13924, pr#6762, Jason Dillaman)
  • journal: support replaying beyond skipped splay objects (pr#6687, Jason Dillaman)
  • librados: aix gcc librados port (pr#6675, Rohan Mars)
  • librados: avoid malloc(0) (which can return NULL on some platforms) (issue#13944, pr#6779, Dan Mick)
  • librados: clean up Objecter.h (pr#6731, Jie Wang)
  • librados: include/rados/librados.h: fix typo (pr#6741, Nathan Cutler)
  • librbd: automatically flush IO after blocking write operations (issue#13913, pr#6742, Jason Dillaman)
  • librbd: better handling of exclusive lock transition period (pr#7204, Jason Dillaman)
  • librbd: check for presence of journal before attempting to remove (issue#13912, pr#6737, Jason Dillaman)
  • librbd: clear error when older OSD doesn’t support image flags (issue#14122, pr#7035, Jason Dillaman)
  • librbd: correct include guard in RenameRequest.h (pr#7143, Jason Dillaman)
  • librbd: correct issues discovered during teuthology testing (issue#14108, issue#14107, pr#6974, Jason Dillaman)
  • librbd: correct issues discovered when cache is disabled (issue#14123, pr#6979, Jason Dillaman)
  • librbd: correct race conditions discovered during unit testing (issue#14060, pr#6923, Jason Dillaman)
  • librbd: disable copy-on-read when not exclusive lock owner (issue#14167, pr#7129, Jason Dillaman)
  • librbd: do not ignore self-managed snapshot release result (issue#14170, pr#7043, Jason Dillaman)
  • librbd: ensure copy-on-read requests are complete prior to closing parent image (pr#6740, Jason Dillaman)
  • librbd: ensure librados callbacks are flushed prior to destroying (issue#14092, pr#7040, Jason Dillaman)
  • librbd: fix journal iohint (pr#6917, Jianpeng Ma)
  • librbd: fix known test case race condition failures (issue#13969, pr#6800, Jason Dillaman)
  • librbd: fix merge-diff for >2GB diff-files (issue#14030, pr#6889, Yunchuan Wen)
  • librbd: fix test case race condition for journaling ops (pr#6877, Jason Dillaman)
  • librbd: fix tracepoint parameter in diff_iterate (pr#6892, Yunchuan Wen)
  • librbd: image refresh code paths converted to async state machines (pr#6859, Jason Dillaman)
  • librbd: include missing header for bool type (pr#6798, Mykola Golub)
  • librbd: initial collection of state machine unit tests (pr#6703, Jason Dillaman)
  • librbd: integrate journaling for maintenance operations (pr#6625, Jason Dillaman)
  • librbd: journaling-related lock dependency cleanup (pr#6777, Jason Dillaman)
  • librbd: not necessary to hold owner_lock while releasing snap id (issue#13914, pr#6736, Jason Dillaman)
  • librbd: only send signal when AIO completions queue empty (pr#6729, Jianpeng Ma)
  • librbd: optionally validate new RBD pools for snapshot support (issue#13633, pr#6925, Jason Dillaman)
  • librbd: partial revert of commit 9b0e359 (issue#13969, pr#6789, Jason Dillaman)
  • librbd: properly handle replay of snap remove RPC message (issue#14164, pr#7042, Jason Dillaman)
  • librbd: reduce verbosity of common error condition logging (issue#14234, pr#7114, Jason Dillaman)
  • librbd: simplify IO method signatures for 32bit environments (pr#6700, Jason Dillaman)
  • librbd: support eventfd for AIO completion notifications (pr#5465, Haomai Wang)
  • mailmap: add UMCloud affiliation (pr#6820, Jiaying Ren)
  • mailmap: Jewel updates (pr#6750, Abhishek Lekshmanan)
  • makefiles: remove bz2-dev from dependencies (issue#13981, pr#6939, Piotr Dałek)
  • mds: add ‘p’ flag in auth caps to control setting pool in layout (pr#6567, John Spray)
  • mds: fix client capabilities during reconnect (client.XXXX isn’t responding to mclientcaps(revoke)) (issue#11482, pr#6432, Yan, Zheng)
  • mds: fix setvxattr (broken in a536d114) (issue#14029, pr#6941, John Spray)
  • mds: repair the command option “–hot-standby” (pr#6454, Wei Feng)
  • mds: tear down connections from tell commands (issue#14048, pr#6933, John Spray)
  • mon: fix ceph df pool available calculation for 0-weighted OSDs (pr#6660, Chengyuan Li)
  • mon: fix routed_request_tids leak (pr#6102, Ning Yao)
  • mon: support min_down_reporter by subtree level (default by host) (pr#6709, Xiaoxi Chen)
  • mount.ceph: memory leaks (pr#6905, Qiankun Zheng)
  • osd: add osd op queue latency perfcounter (pr#5793, Haomai Wang)
  • osd: Allow repair of history.last_epoch_started using config (pr#6793, David Zafman)
  • osd: avoid duplicate op->mark_started in ReplicatedBackend (pr#6689, Jacek J. Łakis)
  • osd: cancel failure reports if we fail to rebind network (pr#6278, Xinze Chi)
  • osd: correctly handle small osd_scrub_interval_randomize_ratio (pr#7147, Samuel Just)
  • osd: defer decoding of MOSDRepOp/MOSDRepOpReply (pr#6503, Xinze Chi)
  • osd: don’t update epoch and rollback_info objects attrs if there is no need (pr#6555, Ning Yao)
  • osd: dump number of missing objects for each peer with pg query (pr#6058, Guang Yang)
  • osd: enable perfcounters on sharded work queue mutexes (pr#6455, Jacek J. Łakis)
  • osd: FileJournal: reduce locking scope in write_aio_bl (issue#12789, pr#5670, Zhi Zhang)
  • osd: FileStore: remove __SWORD_TYPE dependency (pr#6263, John Coyle)
  • osd: fix FileStore::_destroy_collection error return code (pr#6612, Ruifeng Yang)
  • osd: fix incorrect throttle in WBThrottle (pr#6713, Zhang Huan)
  • osd: fix MOSDRepScrub reference counter in replica_scrub (pr#6730, Jie Wang)
  • osd: fix rollback_info_trimmed_to before index() (issue#13965, pr#6801, Samuel Just)
  • osd: fix trivial scrub bug (pr#6533, Li Wang)
  • osd: KeyValueStore: don’t queue NULL context (pr#6783, Haomai Wang)
  • osd: make backend and block device code a bit more generic (pr#6759, Sage Weil)
  • osd: move newest decode version of MOSDOp and MOSDOpReply to the front (pr#6642, Jacek J. Łakis)
  • osd: pg_pool_t: add dictionary for pool options (issue#13077, pr#6081, Mykola Golub)
  • osd: reduce memory consumption of some structs (pr#6475, Piotr Dałek)
  • osd: release the message throttle when OpRequest unregistered (issue#14248, pr#7148, Samuel Just)
  • osd: remove __SWORD_TYPE dependency (pr#6262, John Coyle)
  • osd: slightly reduce actual size of pg_log_entry_t (pr#6690, Piotr Dałek)
  • osd: support pool level recovery_priority and recovery_op_priority (pr#5953, Guang Yang)
  • osd: use pg id (without shard) when referring the PG (pr#6236, Guang Yang)
  • packaging: add build dependency on python devel package (pr#7205, Josh Durgin)
  • pybind/cephfs: add symlink and its unit test (pr#6323, Shang Ding)
  • pybind: decode empty string in conf_parse_argv() correctly (pr#6711, Josh Durgin)
  • pybind: Implementation of rados_ioctx_snapshot_rollback (pr#6878, Florent Manens)
  • pybind: port the rbd bindings to Cython (issue#13115, pr#6768, Hector Martin)
  • pybind: support ioctx:exec (pr#6795, Noah Watkins)
  • qa: erasure-code benchmark plugin selection (pr#6685, Loic Dachary)
  • qa/krbd: Expunge generic/247 (pr#6831, Douglas Fuller)
  • qa/workunits/cephtool/ false positive fail on /tmp/obj1. (pr#6837, Robin H. Johnson)
  • qa/workunits/cephtool/ no ./ (pr#6748, Sage Weil)
  • qa/workunits/rbd: rbd-nbd test should use sudo for map/unmap ops (issue#14221, pr#7101, Jason Dillaman)
  • rados: bench: fix off-by-one to avoid writing past object_size (pr#6677, Tao Chang)
  • rbd: add –object-size option, deprecate –order (issue#12112, pr#6830, Vikhyat Umrao)
  • rbd: add RBD pool mirroring configuration API + CLI (pr#6129, Jason Dillaman)
  • rbd: fix build with “–without-rbd” (issue#14058, pr#6899, Piotr Dałek)
  • rbd: journal: configuration via conf, cli, api and some fixes (pr#6665, Mykola Golub)
  • rbd: merge_diff test should use new –object-size parameter instead of –order (issue#14106, pr#6972, Na Xie, Jason Dillaman)
  • rbd-nbd: network block device (NBD) support for RBD (pr#6657, Yunchuan Wen, Li Wang)
  • rbd: output formatter may not be closed upon error (issue#13711, pr#6706, xie xingguo)
  • rgw: add a missing cap type (pr#6774, Yehuda Sadeh)
  • rgw: add an inspection to the field of type when assigning user caps (pr#6051, Kongming Wu)
  • rgw: add LifeCycle feature (pr#6331, Ji Chen)
  • rgw: add support for Static Large Objects of Swift API (issue#12886, issue#13452, pr#6643, Yehuda Sadeh, Radoslaw Zarzynski)
  • rgw: fix a glaring syntax error (pr#6888, Pavan Rallabhandi)
  • rgw: fix the build failure (pr#6927, Kefu Chai)
  • rgw: multitenancy support (pr#6784, Yehuda Sadeh, Pete Zaitcev)
  • rgw: Remove unused code in PutMetadataAccount:execute (pr#6668, Pete Zaitcev)
  • rgw: remove unused variable in RGWPutMetadataBucket::execute (pr#6735, Radoslaw Zarzynski)
  • rgw/rgw_resolve: fallback to res_query when res_nquery not implemented (pr#6292, John Coyle)
  • rgw: static large objects (Radoslaw Zarzynski, Yehuda Sadeh)
  • rgw: swift bulk delete (Radoslaw Zarzynski)
  • systemd: start/stop/restart ceph services by daemon type (issue#13497, pr#6276, Zhi Zhang)
  • sysvinit: allow custom cluster names (pr#6732, Richard Chan)
  • test/encoding/ fix (pr#6714, Igor Podoski)
  • test: fix (pr#6697, Xinze Chi)
  • test/librados/ clean up EC pools’ crush rules too (issue#13878, pr#6788, Loic Dachary, Dan Mick)
  • tests: allow object corpus readable test to skip specific incompat instances (pr#6932, Igor Podoski)
  • tests: ceph-helpers assert success getting backfills (pr#6699, Loic Dachary)
  • tests: ceph_test_keyvaluedb_iterators: fix broken test (pr#6597, Haomai Wang)
  • tests: fix failure for (issue#13986, pr#6890, Loic Dachary, Ning Yao)
  • tests: fix race condition testing auto scrub (issue#13592, pr#6724, Xinze Chi, Loic Dachary)
  • tests: flush op work queue prior to destroying MockImageCtx (issue#14092, pr#7002, Jason Dillaman)
  • tests: –osd-scrub-load-threshold=2000 for more consistency (issue#14027, pr#6871, Loic Dachary)
  • tests: to display full osd logs on error (issue#13986, pr#6857, Loic Dachary)
  • test: use sequential journal_tid for object cacher test (issue#13877, pr#6710, Josh Durgin)
  • tools: add cephfs-table-tool ‘take_inos’ (pr#6655, John Spray)
  • tools: Fix layout handing in cephfs-data-scan (#13898) (pr#6719, John Spray)
  • tools: support printing part cluster map in readable fashion (issue#13079, pr#5921, Bo Cai)
  • add mstart, mstop, mrun wrappers for running multiple vstart-style test clusters out of src tree (pr#6901, Yehuda Sadeh)

Many years ago I came across a script made by Shawn Moore and Rodney Rymer from Catawba university.
The purpose of this tool is to reconstruct a RBD image.
Imagine your cluster dead, all the monitors got wiped off and you don’t have backup (I know what can possibly happen?).
However all your objects remain intact.

I’ve always wanted to blog about this tool, simply to advocate it and make sure that people can use it.
Hopefully it will be a good publicity for this tool :-).

read more…

Ceph makes it easy to create multiple cluster on the same hardware with the naming of clusters. If you want a better insolation you can use LXC, for example to allow a different version of Ceph between your clusters.

For this you will need access to the physical disks from the container. You just allow access to the device with cgroup and create the device with mknod :

# Retrieve the major and minor number for a device :
$ ls -l /dev/sda5
brw-rw---T 1 root disk 8, 5 janv. 26 18:47 /dev/sda5

$ mknod /var/lib/lxc/container-cluster1/rootfs/dev/sda5 b 8 5
$ echo "lxc.cgroup.devices.allow = b 8:7 rwm" >> /var/lib/lxc/container-cluster1/config

read more…

Since Firefly you can test the use of the lightweight web client Civetweb instead of Apache.
To activate it, it’s very simple, there’s nothing to install again, simply add this line to your ceph.conf:

rgw frontends = "civetweb port=80"

If you have already installed apache, remember to stop it before activating civetweb, or it must not listen on the same port.

Then :

/etc/init.d/radosgw restart

Ceph and KRBD discard

{% img center Ceph and KRBD discard %}

Space reclamation mechanism for the Kernel RBD module.
Having this kind of support is really crucial for operators and ease your capacity planing.
RBD images are sparse, thus size after creation is equal to 0 MB.
The main issue with sparse images is that images grow to eventually reach their entire size.
The thing is Ceph doesn’t know anything that this happening on top of that block especially if you have a filesystem.
You can easily write the entire filesystem and then delete everything, Ceph will still believe that the block is fully used and will keep that metric.
However thanks to the discard support on the block device, the filesystem can send discard flush commands to the block.
In the end, the storage will free up blocks.

Ceph Meetup Helsinki : 22 Jan 2015

Ceph Meetup Helsinki , Finland 22nd Jan 2015

It has been a good start to 2015. We the geeks of “Helsinki Metropolitan Area” express our sincere thanks to Red Hat Folks for arranging an unofficial “Ceph Day” sort of Ceph Meetup.

From my point of view expecting any event on CEPH in Finland , is like day dreaming. But hopefully not from now. 

Here is my presentation during the meetup. Hope you enjoy and learn something new out of it. Read it carefully “There is something new for YOU”


Introducing Try-Ceph

Since this was the first presentation on Ceph , in-order to keep the audience awake and into the gravity of Ceph. I have presented a LIVE DEMONSTRATION  of Ceph which the audience really enjoyed.

So What is Try-Ceph

Its a shortest and the quickest way to get your TEST Ceph cluster UP and Running in Just 10 Minutes –yes-i-am-really-really-serious

Its a TWO Step Process 

Step – 1 : # git clone
Step – 2 :  # vagrant up

Checkout the documentation

Checkout the recorded session

 Some Pictures from the Meetup

karan singh

karan singh

karan singh

karan singh

Ceph reset perf counter

{% img center Ceph reset perf counter %}

OSD performance counters tend to stack up and sometimes the value shown is not really representative of the current environment.
Thus it is quite useful to reset the counters to get the last values.
This feature was added in the Ceph 0.90, so you must wait for the Hammer release.

This action can be triggered via the admin socket:

$ sudo ceph daemon osd.0 perf reset

v0.91 released

We are quickly approaching the Hammer feature freeze but have a few more dev releases to go before we get there. The headline items are subtree-based quota support in CephFS (ceph-fuse/libcephfs client support only for now), a rewrite of the watch/notify librados API used by RBD and RGW, OSDMap checksums to ensure that maps are always consistent inside the cluster, new API calls in librados and librbd for IO hinting modeled after posix_fadvise, and improved storage of per-PG state.

We expect two more releases before the first Hammer release candidate (v0.93).


  • The ‘category’ field for objects has been removed. This was originally added to track PG stat summations over different categories of objects for use by radosgw. It is no longer has any known users and is prone to abuse because it can lead to a pg_stat_t structure that is unbounded. The librados API calls that accept this field now ignore it, and the OSD no longers tracks the per-category summations.
  • The output for ‘rados df’ has changed. The ‘category’ level has been eliminated, so there is now a single stat object per pool. The structure of the JSON output is different, and the plaintext output has one less column.
  • The ‘rados create <objectname> [category]’ optional category argument is no longer supported or recognized.
  •’s Rados class no longer has a __del__ method; it was causing problems on interpreter shutdown and use of threads. If your code has Rados objects with limited lifetimes and you’re concerned about locked resources, call Rados.shutdown() explicitly.
  • There is a new version of the librados watch/notify API with vastly improved semantics. Any applications using this interface are encouraged to migrate to the new API. The old API calls are marked as deprecated and will eventually be removed.
  • The librados rados_unwatch() call used to be safe to call on an invalid handle. The new version has undefined behavior when passed a bogus value (for example, when rados_watch() returns an error and handle is not defined).
  • The structure of the formatted ‘pg stat’ command is changed for the portion that counts states by name to avoid using the ‘+’ character (which appears in state names) as part of the XML token (it is not legal).


read more…

© 2016, Red Hat, Inc. All rights reserved.