The Ceph Blog

Featured Post

v0.80.10 Firefly released

This is a bugfix release for Firefly.

We recommend that all Firefly users upgrade.

For more detailed information, see the complete changelog.

NOTABLE CHANGES

  • build/ops: ceph.spec.in: package mkcephfs on EL6 (issue#11955pr#4924, Ken Dreyer)
  • build/ops: debian: ceph-test and rest-bench debug packages should require their respective binary packages (issue#11673pr#4766, Ken Dreyer)
  • build/ops: run RGW as root (issue#11453pr#4638, Ken Dreyer)
  • common: messages/MWatchNotify: include an error code in the message (issue#9193pr#3944, Sage Weil)
  • common: Rados.shutdown() dies with Illegal instruction (core dumped) (issue#10153pr#3963, Federico Simoncelli)
  • common: SimpleMessenger: allow RESETSESSION whenever we forget an endpoint (issue#10080pr#3915, Greg Farnum)
  • common: WorkQueue: make wait timeout on empty queue configurable (issue#10817pr#3941, Samuel Just)
  • crush: set_choose_tries = 100 for erasure code rulesets (issue#10353pr#3824, Loic Dachary)
  • doc: backport ceph-disk man page to Firefly (issue#10724pr#3936, Nilamdyuti Goswami)
  • doc: Fix ceph command manpage to match ceph -h (issue#10676pr#3996, David Zafman)
  • fs: mount.ceph: avoid spurious error message (issue#10351pr#3927, Yan, Zheng)
  • librados: Fix memory leak in python rados bindings (issue#10723pr#3935, Josh Durgin)
  • librados: fix resources leakage in RadosClient::connect() (issue#10425pr#3828, Radoslaw Zarzynski)
  • librados: Translate operation flags from C APIs (issue#10497pr#3930, Matt Richards)
  • librbd: acquire cache_lock before refreshing parent (issue#5488pr#4206, Jason Dillaman)
  • librbd: snap_remove should ignore -ENOENT errors (issue#11113pr#4245, Jason Dillaman)
  • mds: fix assertion caused by system clock backwards (issue#11053pr#3970, Yan, Zheng)
  • mon: ignore osd failures from before up_from (issue#10762pr#3937, Sage Weil)
  • mon: MonCap: take EntityName instead when expanding profiles (issue#10844pr#3942, Joao Eduardo Luis)
  • mon: Monitor: fix timecheck rounds period (issue#10546pr#3932, Joao Eduardo Luis)
  • mon: OSDMonitor: do not trust small values in osd epoch cache (issue#10787pr#3823, Sage Weil)
  • mon: OSDMonitor: fallback to json-pretty in case of invalid formatter (issue#9538pr#4475, Loic Dachary)
  • mon: PGMonitor: several stats output error fixes (issue#10257pr#3826, Joao Eduardo Luis)
  • objecter: fix map skipping (issue#9986pr#3952, Ding Dinghua)
  • osd: cache tiering: fix the atime logic of the eviction (issue#9915pr#3949, Zhiqiang Wang)
  • osd: cancel_pull: requeue waiters (issue#11244pr#4415, Samuel Just)
  • osd: check that source OSD is valid for MOSDRepScrub (issue#9555pr#3947, Sage Weil)
  • osd: DBObjectMap: lock header_lock on sync() (issue#9891pr#3948, Samuel Just)
  • osd: do not ignore deleted pgs on startup (issue#10617pr#3933, Sage Weil)
  • osd: ENOENT on clone (issue#11199pr#4385, Samuel Just)
  • osd: erasure-code-profile set races with erasure-code-profile rm (issue#11144pr#4383, Loic Dachary)
  • osd: FAILED assert(soid < scrubber.start || soid >= scrubber.end) (issue#11156pr#4185, Samuel Just)
  • osd: FileJournal: fix journalq population in do_read_entry() (issue#6003pr#3960, Samuel Just)
  • osd: fix negative degraded objects during backfilling (issue#7737pr#4021, Guang Yang)
  • osd: get the currently atime of the object in cache pool for eviction (issue#9985pr#3950, Sage Weil)
  • osd: load_pgs: we need to handle the case where an upgrade from earlier versions which ignored non-existent pgs resurrects a pg with a prehistoric osdmap (issue#11429pr#4556, Samuel Just)
  • osd: ObjectStore: Don’t use largest_data_off to calc data_align. (issue#10014pr#3954, Jianpeng Ma)
  • osd: osd_types: op_queue_age_hist and fs_perf_stat should be in osd_stat_t::o… (issue#10259pr#3827, Samuel Just)
  • osd: PG::actingset should be used when checking the number of acting OSDs for… (issue#11454pr#4453, Guang Yang)
  • osd: PG::all_unfound_are_queried_or_lost for non-existent osds (issue#10976pr#4416, Mykola Golub)
  • osd: PG: always clear_primary_state (issue#10059pr#3955, Samuel Just)
  • osd: PGLog.h: 279: FAILED assert(log.log.size() == log_keys_debug.size()) (issue#10718pr#4382, Samuel Just)
  • osd: PGLog: include rollback_info_trimmed_to in (read|write)_log (issue#10157pr#3964, Samuel Just)
  • osd: pg stuck stale after create with activation delay (issue#11197pr#4384, Samuel Just)
  • osd: ReplicatedPG: fail a non-blocking flush if the object is being scrubbed (issue#8011pr#3943, Samuel Just)
  • osd: ReplicatedPG::on_change: clean up callbacks_for_degraded_object (issue#8753pr#3940, Samuel Just)
  • osd: ReplicatedPG::scan_range: an object can disappear between the list and t… (issue#10150pr#3962, Samuel Just)
  • osd: requeue blocked op before flush it was blocked on (issue#10512pr#3931, Sage Weil)
  • rgw: check for timestamp for s3 keystone auth (issue#10062pr#3958, Abhishek Lekshmanan)
  • rgw: civetweb should use unique request id (issue#11720pr#4780, Orit Wasserman)
  • rgw: don’t allow negative / invalid content length (issue#11890pr#4829, Yehuda Sadeh)
  • rgw: fail s3 POST auth if keystone not configured (issue#10698pr#3966, Yehuda Sadeh)
  • rgw: flush xml header on get acl request (issue#10106pr#3961, Yehuda Sadeh)
  • rgw: generate new tag for object when setting object attrs (issue#11256pr#4571, Yehuda Sadeh)
  • rgw: generate the “Date” HTTP header for civetweb. (issue#11871,11891pr#4851, Radoslaw Zarzynski)
  • rgw: keystone token cache does not work correctly (issue#11125pr#4414, Yehuda Sadeh)
  • rgw: merge manifests correctly when there’s prefix override (issue#11622pr#4697, Yehuda Sadeh)
  • rgw: send appropriate op to cancel bucket index pending operation (issue#10770pr#3938, Yehuda Sadeh)
  • rgw: shouldn’t need to disable rgw_socket_path if frontend is configured (issue#11160pr#4275, Yehuda Sadeh)
  • rgw: Swift API. Dump container’s custom metadata. (issue#10665pr#3934, Dmytro Iurchenko)
  • rgw: Swift API. Support for X-Remove-Container-Meta-{key} header. (issue#10475pr#3929, Dmytro Iurchenko)
  • rgw: use correct objv_tracker for bucket instance (issue#11416pr#4379, Yehuda Sadeh)
  • tests: force checkout of submodules (issue#11157pr#4079, Loic Dachary)
  • tools: Backport ceph-objectstore-tool changes to firefly (issue#12327pr#3866, David Zafman)
  • tools: ceph-objectstore-tool: Output only unsupported features when incomatible (issue#11176pr#4126, David Zafman)
  • tools: ceph-objectstore-tool: Use exit status 11 for incompatible import attemp… (issue#11139pr#4129, David Zafman)
  • tools: Fix do_autogen.sh so that -L is allowed (issue#11303pr#4247, Alfredo Deza)

GETTING CEPH

Earlier Posts

v9.0.2 released

This development release features more of the OSD work queue unification, randomized osd scrub times, a huge pile of librbd fixes, more MDS repair and snapshot fixes, and a significant amount of work on the tests and build infrastructure.

NOTABLE CHANGES

  • buffer: some cleanup (Michal Jarzabek)
  • build: cmake: fix nss linking (Danny Al-Gaaf)
  • build: cmake: misc fixes (Orit Wasserman, Casey Bodley)
  • build: install-deps: misc fixes (Loic Dachary)
  • build: make_dist_tarball.sh (Sage Weil)
  • ceph-detect-init: added Linux Mint (Michal Jarzabek)
  • ceph-detect-init: robust init system detection (Owen Synge)
  • ceph-disk: ensure ‘zap’ only operates on a full disk (#11272 Loic Dachary)
  • ceph-disk: misc fixes to respect init system (Loic Dachary, Owen Synge)
  • ceph-disk: support NVMe device partitions (#11612 Ilja Slepnev)
  • ceph: fix ‘df’ units (Zhe Zhang)
  • ceph: fix parsing in interactive cli mode (#11279 Kefu Chai)
  • ceph-objectstore-tool: many many changes (David Zafman)
  • ceph-post-file: misc fixes (Joey McDonald, Sage Weil)
  • client: avoid sending unnecessary FLUSHSNAP messages (Yan, Zheng)
  • client: exclude setfilelock when calculating oldest tid (Yan, Zheng)
  • client: fix error handling in check_pool_perm (John Spray)
  • client: fsync waits only for inode’s caps to flush (Yan, Zheng)
  • client: invalidate kernel dcache when cache size exceeds limits (Yan, Zheng)
  • client: make fsync wait for unsafe dir operations (Yan, Zheng)
  • client: pin lookup dentry to avoid inode being freed (Yan, Zheng)
  • common: detect overflow of int config values (#11484 Kefu Chai)
  • common: fix json parsing of utf8 (#7387 Tim Serong)
  • common: fix leak of pthread_mutexattr (#11762 Ketor Meng)
  • crush: respect default replicated ruleset config on map creation (Ilya Dryomov)
  • deb, rpm: move ceph-objectstore-tool to ceph (Ken Dreyer)
  • doc: man page updates (Kefu Chai)
  • doc: misc updates (#11396 Nilamdyuti, Fracois Lafont, Ken Dreyer, Kefu Chai)
  • init-radosgw: merge with sysv version; fix enumeration (Sage Weil)
  • librados: add config observer (Alistair Strachan)
  • librbd: add const for single-client-only features (Josh Durgin)
  • librbd: add deep-flatten operation (Jason Dillaman)
  • librbd: avoid blocking aio API methods (#11056 Jason Dillaman)
  • librbd: fix fast diff bugs (#11553 Jason Dillaman)
  • librbd: fix image format detection (Zhiqiang Wang)
  • librbd: fix lock ordering issue (#11577 Jason Dillaman)
  • librbd: flatten/copyup fixes (Jason Dillaman)
  • librbd: lockdep, helgrind validation (Jason Dillaman, Josh Durgin)
  • librbd: only update image flags while hold exclusive lock (#11791 Jason Dillaman)
  • librbd: return result code from close (#12069 Jason Dillaman)
  • librbd: tolerate old osds when getting image metadata (#11549 Jason Dillaman)
  • mds: do not add snapped items to bloom filter (Yan, Zheng)
  • mds: fix handling for missing mydir dirfrag (#11641 John Spray)
  • mds: fix rejoin (Yan, Zheng)
  • mds: fix stra reintegration (Yan, Zheng)
  • mds: fix suicide beason (John Spray)
  • mds: misc repair improvements (John Spray)
  • mds: misc snapshot fixes (Yan, Zheng)
  • mds: respawn instead of suicide on blacklist (John Spray)
  • misc coverity fixes (Danny Al-Gaaf)
  • mon: add ‘mon_metadata <id>’ command (Kefu Chai)
  • mon: add ‘node ls …’ command (Kefu Chai)
  • mon: disallow ec pools as tiers (#11650 Samuel Just)
  • mon: fix mds beacon replies (#11590 Kefu Chai)
  • mon: fix ‘pg ls’ sort order, state names (#11569 Kefu Chai)
  • mon: normalize erasure-code profile for storage and comparison (Loic Dachary)
  • mon: optionally specify osd id on ‘osd create’ (Mykola Golub)
  • mon: ‘osd tree’ fixes (Kefu Chai)
  • mon: prevent pool with snapshot state from being used as a tier (#11493 Sage Weil)
  • mon: refine check_remove_tier checks (#11504 John Spray)
  • mon: remove spurious who arg from ‘mds rm …’ (John Spray)
  • msgr: async: misc fixes (Haomai Wang)
  • msgr: xio: fix ip and nonce (Raju Kurunkad)
  • msgr: xio: improve lane assignment (Vu Pham)
  • msgr: xio: misc fixes (Vu Pham, Cosey Bodley)
  • osd: avoid transaction append in some cases (Sage Weil)
  • osdc/Objecter: allow per-pool calls to op_cancel_writes (John Spray)
  • osd: elminiate txn apend, ECSubWrite copy (Samuel Just)
  • osd: filejournal: cleanup (David Zafman)
  • osd: fix check_for_full (Henry Chang)
  • osd: fix dirty accounting in make_writeable (Zhiqiang Wang)
  • osd: fix osdmap dump of blacklist items (John Spray)
  • osd: fix snap flushing from cache tier (again) (#11787 Samuel Just)
  • osd: fix snap handling on promotion (#11296 Sam Just)
  • osd: handle log split with overlapping entries (#11358 Samuel Just)
  • osd: keyvaluestore: misc fixes (Varada Kari)
  • osd: make suicide timeouts individually configurable (Samuel Just)
  • osd: move scrub in OpWQ (Samuel Just)
  • osd: pool size change triggers new interval (#11771 Samuel Just)
  • osd: randomize scrub times (#10973 Kefu Chai)
  • osd: refactor scrub and digest recording (Sage Weil)
  • osd: refuse first write to EC object at non-zero offset (Jianpeng Ma)
  • osd: stripe over small xattrs to fit in XFS’s 255 byte inline limit (Sage Weil, Ning Yao)
  • osd: sync object_map on syncfs (Samuel Just)
  • osd: take excl lock of op is rw (Samuel Just)
  • osd: WBThrottle cleanups (Jianpeng Ma)
  • pycephfs: many fixes for bindings (Haomai Wang)
  • rados: bench: add –no-verify option to improve performance (Piotr Dalek)
  • rados: misc bench fixes (Dmitry Yatsushkevich)
  • rbd: add disk usage tool (#7746 Jason Dillaman)
  • rgw: alwasy check if token is expired (#11367 Anton Aksola, Riku Lehto)
  • rgw: conversion tool to repair broken multipart objects (#12079 Yehuda Sadeh)
  • rgw: do not enclose bucket header in quotes (#11860 Wido den Hollander)
  • rgw: error out if frontend did not send all data (#11851 Yehuda Sadeh)
  • rgw: fix assignment of copy obj attributes (#11563 Yehuda Sadeh)
  • rgw: fix reset_loc (#11974 Yehuda Sadeh)
  • rgw: improve content-length env var handling (#11419 Robin H. Johnson)
  • rgw: only scan for objects not in a namespace (#11984 Yehuda Sadeh)
  • rgw: remove trailing :port from HTTP_HOST header (Sage Weil)
  • rgw: shard work over multiple librados instances (Pavan Rallabhandi)
  • rgw: swift: enforce Content-Type in response (#12157 Radoslaw Zarzynski)
  • rgw: use attrs from source bucket on copy (#11639 Javier M. Mellid)
  • rocksdb: pass options as single string (Xiaoxi Chen)
  • rpm: many spec file fixes (Owen Synge, Ken Dreyer)
  • tests: fixes for rbd xstests (Douglas Fuller)
  • tests: fix tiering health checks (Loic Dachary)
  • tests for low-level performance (Haomai Wang)
  • tests: many ec non-regression improvements (Loic Dachary)
  • tests: many many ec test improvements (Loic Dachary)
  • upstart: throttle restarts (#11798 Sage Weil, Greg Farnum)

GETTING CEPH

Ceph is becoming more and more popular in China. Intel and Redhat jointly held the Beijing Ceph Day in Intel RYC office on June 6th, 2015. It attracted ~200 developers, end users from 120+ companies. Ten technical sessions were delivered to share Ceph’s transformative power during the event, it also focused on current problems of Ceph and how can we grow the Ceph ecosystem in China.

Keynote Speech

Ziya Ma, General Manager of Intel Bigdata technology team (BDT) introduced Intel’s investments on Ceph. She started from the data bigbang to point out that the data needs are growing at a rate unsustainable with today’s infrastructure and labor costs, and thus we need a fundamental transformation in storage infrastructure to resolve the new challenges. As the most popular Openstack block backend, Ceph attracted more and more interests – e.g., Fujitsu delivered Ceph based storage products CD10K. Intel BDT’s investments on Ceph includes: Ceph performance analysis and tuning on different platforms, key features like Cache tiering, Erasure coding and Newstore development and optimization, toolkit development – COSBench, VSM and CeTune, and promoting Ceph based scale out storage solutions with China local customers. She announced the found of China Ceph user group, Chinese maillist, and the next Ceph Day to be held in Shanghai in October.

Ceph Community director Patrick McGarry from Redhat introduced the Ceph community updates and recent development status. He emphasized that Ceph community’s focus hasn’t change after Redhat’s acquisition of Inktank, and Ceph will provide better support for RHEL/Fedora/Centos. He encouraged developers to attend the first Ceph hackathon to be held in Hillsboro in August, which will focus on performance, RBD and RGW. In the development part, he introduced the CephFS improvements in Hammer release – 366 commits to MDS module, 20K lines of code changes, and we can expect that CephFS would be production ready next release.

Ceph Development

NewStore: Xiaoxi Chen from Intel introduced the design and implementation of NewStore, which is a new storage backend for Ceph target at the next Release. By de-couple the mapping from object name to actual storage path, NewStore is able to manage the data flexibly. Compared to FileStore, NewStore could saving the journal write for create, append and overwrite operations without losing the atomicity and consistency. This feature is not only helping improve performance but also cutting down the TCO for customer.  The initial performance data shared in the talk looks quite promising. Attendees are very interested with Newstore and looking forward trying it when it is ready.

Cache Tiering Optimization: Community active code contributor, Doctor Li Wang from Ubuntukylin introduced their Ceph optimization work on Tianhe-2 supercomputer platform, including CephFS inline data, RDB image offline recovery and Cache tiering optimization. Cache tiering is an important feature since Emperor, it is designed to improve the Ceph cluster’s performance by leveraging a small set of fast devices as cache. However, current evict algorithm is based on the latest access time, which is not very efficient in some scenario. Doctor Wang proposed a temperature based cache management algorithm that evicts objects based on its access time and frequency.  The user survey of Beijing Ceph Day showed Cache tiering was one of the two most interested and would like to try features (the other is Erasure coding), and it still needs more optimization for cache tiering to be production ready.

Ceph-dokan Windows client: Currently Ceph doesn’t has drivers that can be directly used for windows. Zhisheng Meng from Ucloud introduced Ceph-Dokan, which implements a Win32 FS API compatible windows client with the help of Cygwin and MinGw. The next step work is to support CephX, provide librados and librbd dll and get it merged to Ceph master.

Ecosystem

Ceph and Containers: Container technology is widely adopted in cloud computing environments. Independent opensource contributor Haomai Wang introduced Ceph and container integration work. He compared the pros and cons of VM+RBD and Container+RBD usage model. The latter mode has better performance in general, but needs more improvement on security. In kubernetes, different containers composed a POD and leverage file as storage, so it looks it is more suitable to use filesystem instead of RBD as containers backend. He also introduced CephFS latest improvements, CephFS deployment and development progress with Nova, Kubernetes integration.

Ceph toolkit: As the only female speaker, Chendi Xue from Intel presented a new ceph profiling and tuning tool called CeTune. It is designed to help system engineers to deploy and benchmark the Ceph cluster in a fast and easy way. CeTune is designed to benchmark Ceph RBD, Object and CephFS interface with fio and Cosbench and other pluggable workloads. It not only monitors system metrics like CPU utilization, memory usage, I/O statistics but also Ceph performance metrics like Ceph perf counter and LTTNG trace data. CeTune analyzes these data offline to reveal system and software stack bottlenecks. It also provides web based visualization of all the processed data to make analysis and tuning more easily.

Ceph and Bigdata: As the rising of IAAS, cloud storage is becoming more and more popular. However this introduced a new problem for big data analytics framework, like Map Reduce, which usually stores the data in specific distribute file system. This would require lots of data movement from IAAS storage to HDFS. Yuan Zhou from Intel introduced how to run Hadoop over Ceph RGW. He introduced the detail design of Hadoop over Ceph Object Storage, following the way of OpenStack Sahara doing on Swift with a new RGWFS driver and RGW proxy component. Some early benchmarking data with various solutions and different deployments were shared, including VM vs. container vs. bare-metal, HDFS vs. Swift.

User Experience Sharing

Ceph and Openstack integration experience sharing: Dexin Wu and Yuting Wu from awcloud shared their experiences on Ceph and Openstack integration. One key takeaway is although Hammer release brought significant performance improvement, it is still not able to fully utilize the capability of SSD devices. Besides, we still need more features like cluster level QoS, multi-geo disaster recovery. They shared one performance tuning example on how to improve the throughput of a 100 OSDs cluster from 2000 to 9000 IOPS through tuning Ceph parameters and redeployment.

One Ceph, two ways of thinking: Xiaoyi Zhang from Perfect world (Top internet gaming vendor in China) shared their feedbacks on Ceph as an end user and provided some optimization proposals. From perfect world’s point of view, Ceph has many advantage features like highly available, high reliability and highly durable, almost unlimited expend on capacity. He shared how they solved several problems to improve the recovery performance with tuning read_ahead_kb on the hard driver, how to reconfigure the ceph.conf and leverage B-cache to improve Ceph cluster stability and performance; and how to deploy multiple directory on a single PCI-E SSD as dedicated OSD storage spaces to improve Ceph all SSD performance.

Ceph based products

Hao Zhou from Sandisk introduced Ceph based all flash production – InfinishFlash and related optimization. InfiniFlash provided up to 512TB space in 3U chassis, with up to 780K IOPS and 7GB/s bandwidth. He introduced optimization efforts like the thread pool sharding, lock sequence and granularity optimization.

Panel Discussion

As the last session of Beijing Ceph Day, the panel discussion covered two topics: What do you think is current problem of Ceph and how can we accelerate the development of Ceph in China. Most concerns are about performance, management, documentation and localization. People provided many suggestions on how to grow the Ceph ecosystem in China, e.g., that the community need more contributions and sharing from users, developers, and partners. Developers can benefit from the real usage scenario or issues met from end users to make Ceph more stable and mature, while end user can become more familiar with Ceph through the engagement.

Technical Slides

All the slides can be downloaded from http://www.slideshare.net/inktank_ceph .

Onsite pictures

Registration

Beijing Ceph Day Registeration

Agenda

Beijing Ceph Day Agenda

Keynote Speech

Keynotes

Audience

Audience

 

Media Coverage

The Beijing Ceph Day was a great success, here are some media coverage reports:

http://www.csdn.net/article/2015-06-08/2824891

http://code.csdn.net/news/2825020

http://www.csdn.net/article/2015-07-03/2825121

Beijing Ceph Day User Survey Results

We run a Ceph survey during Beijing Ceph day. Our initial purpose is to get a general understanding of the Ceph deployment status in China and collect feedbacks & suggestions for our next step development and optimization work. We designed a 16 question questionnaire, including three open questions. We received 110 valid respondents during the event.  We would like to share with you the survey results.

Summary:

  1. Attendee Role: Most customer are private cloud providers, followed by public cloud service providers.
  2. Cloud OS: Openstack is still the dominate Cloud OS (59%).
  3. Other storage deployed: 26% used commercial storage, HDFS is also very popular.
  4. Ceph deployment phase: Most deployment phase is still very early, 46% of the Ceph deployment is still under QA and testing, while 30% already in production.
  5. Ceph cluster scale: Most of the cluster scale is 10-50 nodes.
  6. Ceph interface used: RBD is mostly used (50%), followed by object storage (23%), CephFS (16%), 6% used Native rados API.
  7. Ceph version: The most popular Ceph version is Hammer (31%).
  8. Replication model: 3x replica is most commonly used (49%).
  9. Next Features interested or would like to try: Cache tiering (26%) and erasure coding (19%) is very attractive to customers. Followed by Full SSD optimization.
  10. Performance metrics most cared: Stability is still No.1 concern (30%).
  11. Deployment tools: Most people use Ceph-deploy (50%).
  12. Monitoring and Management: 35% using calamari for monitoring and management while 33% used nothing.
  13. The Top three issues for Ceph: (1) Performance, (2) Complexity, and (3) too many immature features.
  14. Suggestion to Ceph’s development and optimization Open question: (1) Documentation (2) Stability
  15. Major reason to choose Ceph: (1) Unified Storage, (2) Acceptable Performance, (3) active community
  16. QoS requirement: Diverse requirements.

image001

image002

image003

image004

image006

image005

image007

image008

image009

image010

image011

image012

image013

Q14

Q15

Q16: What’s your QoS requirement in your environment?

Q16

 

v0.94.2 Hammer released

This Hammer point release fixes a few critical bugs in RGW that can prevent objects starting with underscore from behaving properly and that prevent garbage collection of deleted objects when using the Civetweb standalone mode.

All v0.94.x Hammer users are strongly encouraged to upgrade, and to make note of the repair procedure below if RGW is in use.

UPGRADING FROM PREVIOUS HAMMER RELEASE

Bug #11442 introduced a change that made rgw objects that start with underscore incompatible with previous versions. The fix to that bug reverts to the previous behavior. In order to be able to access objects that start with an underscore and were created in prior Hammer releases, following the upgrade it is required to run (for each affected bucket):

$ radosgw-admin bucket check --check-head-obj-locator \
                             --bucket=<bucket> [--fix]

NOTABLE CHANGES

  • build: compilation error: No high-precision counter available (armhf, powerpc..) (#11432, James Page)
  • ceph-dencoder links to libtcmalloc, and shouldn’t (#10691, Boris Ranto)
  • ceph-disk: disk zap sgdisk invocation (#11143, Owen Synge)
  • ceph-disk: use a new disk as journal disk,ceph-disk prepare fail (#10983, Loic Dachary)
  • ceph-objectstore-tool should be in the ceph server package (#11376, Ken Dreyer)
  • librados: can get stuck in redirect loop if osdmap epoch == last_force_op_resend (#11026, Jianpeng Ma)
  • librbd: A retransmit of proxied flatten request can result in -EINVAL (Jason Dillaman)
  • librbd: ImageWatcher should cancel in-flight ops on watch error (#11363, Jason Dillaman)
  • librbd: Objectcacher setting max object counts too low (#7385, Jason Dillaman)
  • librbd: Periodic failure of TestLibRBD.DiffIterateStress (#11369, Jason Dillaman)
  • librbd: Queued AIO reference counters not properly updated (#11478, Jason Dillaman)
  • librbd: deadlock in image refresh (#5488, Jason Dillaman)
  • librbd: notification race condition on snap_create (#11342, Jason Dillaman)
  • mds: Hammer uclient checking (#11510, John Spray)
  • mds: remove caps from revoking list when caps are voluntarily released (#11482, Yan, Zheng)
  • messenger: double clear of pipe in reaper (#11381, Haomai Wang)
  • mon: Total size of OSDs is a maginitude less than it is supposed to be. (#11534, Zhe Zhang)
  • osd: don’t check order in finish_proxy_read (#11211, Zhiqiang Wang)
  • osd: handle old semi-deleted pgs after upgrade (#11429, Samuel Just)
  • osd: object creation by write cannot use an offset on an erasure coded pool (#11507, Jianpeng Ma)
  • rgw: Improve rgw HEAD request by avoiding read the body of the first chunk (#11001, Guang Yang)
  • rgw: civetweb is hitting a limit (number of threads 1024) (#10243, Yehuda Sadeh)
  • rgw: civetweb should use unique request id (#10295, Orit Wasserman)
  • rgw: critical fixes for hammer (#11447, #11442, Yehuda Sadeh)
  • rgw: fix swift COPY headers (#10662, #10663, #11087, #10645, Radoslaw Zarzynski)
  • rgw: improve performance for large object (multiple chunks) GET (#11322, Guang Yang)
  • rgw: init-radosgw: run RGW as root (#11453, Ken Dreyer)
  • rgw: keystone token cache does not work correctly (#11125, Yehuda Sadeh)
  • rgw: make quota/gc thread configurable for starting (#11047, Guang Yang)
  • rgw: make swift responses of RGW return last-modified, content-length, x-trans-id headers.(#10650, Radoslaw Zarzynski)
  • rgw: merge manifests correctly when there’s prefix override (#11622, Yehuda Sadeh)
  • rgw: quota not respected in POST object (#11323, Sergey Arkhipov)
  • rgw: restore buffer of multipart upload after EEXIST (#11604, Yehuda Sadeh)
  • rgw: shouldn’t need to disable rgw_socket_path if frontend is configured (#11160, Yehuda Sadeh)
  • rgw: swift: Response header of GET request for container does not contain X-Container-Object-Count, X-Container-Bytes-Used and x-trans-id headers (#10666, Dmytro Iurchenko)
  • rgw: swift: Response header of POST request for object does not contain content-length and x-trans-id headers (#10661, Radoslaw Zarzynski)
  • rgw: swift: response for GET/HEAD on container does not contain the X-Timestamp header (#10938, Radoslaw Zarzynski)
  • rgw: swift: response for PUT on /container does not contain the mandatory Content-Length header when FCGI is used (#11036, #10971, Radoslaw Zarzynski)
  • rgw: swift: wrong handling of empty metadata on Swift container (#11088, Radoslaw Zarzynski)
  • tests: TestFlatIndex.cc races with TestLFNIndex.cc (#11217, Xinze Chi)
  • tests: ceph-helpers kill_daemons fails when kill fails (#11398, Loic Dachary)

For more detailed information, see the complete changelog.

GETTING CEPH

v9.0.1 released

This development release is delayed a bit due to tooling changes in the build environment. As a result the next one (v9.0.2) will have a bit more work than is usual.

Highlights here include lots of RGW Swift fixes, RBD feature work surrounding the new object map feature, more CephFS snapshot fixes, and a few important CRUSH fixes.

NOTABLE CHANGES

  • auth: cache/reuse crypto lib key objects, optimize msg signature check (Sage Weil)
  • build: allow tcmalloc-minimal (Thorsten Behrens)
  • build: do not build ceph-dencoder with tcmalloc (#10691 Boris Ranto)
  • build: fix pg ref disabling (William A. Kennington III)
  • build: install-deps.sh improvements (Loic Dachary)
  • build: misc fixes (Boris Ranto, Ken Dreyer, Owen Synge)
  • ceph-authtool: fix return code on error (Gerhard Muntingh)
  • ceph-disk: fix zap sgdisk invocation (Owen Synge, Thorsten Behrens)
  • ceph-disk: pass –cluster arg on prepare subcommand (Kefu Chai)
  • ceph-fuse, libcephfs: drop inode when rmdir finishes (#11339 Yan, Zheng)
  • ceph-fuse,libcephfs: fix uninline (#11356 Yan, Zheng)
  • ceph-monstore-tool: fix store-copy (Huangjun)
  • common: add perf counter descriptions (Alyona Kiseleva)
  • common: fix throttle max change (Henry Chang)
  • crush: fix crash from invalid ‘take’ argument (#11602 Shiva Rkreddy, Sage Weil)
  • crush: fix divide-by-2 in straw2 (#11357 Yann Dupont, Sage Weil)
  • deb: fix rest-bench-dbg and ceph-test-dbg dependendies (Ken Dreyer)
  • doc: document region hostnames (Robin H. Johnson)
  • doc: update release schedule docs (Loic Dachary)
  • init-radosgw: run radosgw as root (#11453 Ken Dreyer)
  • librados: fadvise flags per op (Jianpeng Ma)
  • librbd: allow additional metadata to be stored with the image (Haomai Wang)
  • librbd: better handling for dup flatten requests (#11370 Jason Dillaman)
  • librbd: cancel in-flight ops on watch error (#11363 Jason Dillaman)
  • librbd: default new images to format 2 (#11348 Jason Dillaman)
  • librbd: fast diff implementation that leverages object map (Jason Dillaman)
  • librbd: fix snapshot creation when other snap is active (#11475 Jason Dillaman)
  • librbd: new diff_iterate2 API (Jason Dillaman)
  • librbd: object map rebuild support (Jason Dillaman)
  • logrotate.d: prefer service over invoke-rc.d (#11330 Win Hierman, Sage Weil)
  • mds: avoid getting stuck in XLOCKDONE (#11254 Yan, Zheng)
  • mds: fix integer truncateion on large client ids (Henry Chang)
  • mds: many snapshot and stray fixes (Yan, Zheng)
  • mds: persist completed_requests reliably (#11048 John Spray)
  • mds: separate safe_pos in Journaler (#10368 John Spray)
  • mds: snapshot rename support (#3645 Yan, Zheng)
  • mds: warn when clients fail to advance oldest_client_tid (#10657 Yan, Zheng)
  • misc cleanups and fixes (Danny Al-Gaaf)
  • mon: fix average utilization calc for ‘osd df’ (Mykola Golub)
  • mon: fix variance calc in ‘osd df’ (Sage Weil)
  • mon: improve callout to crushtool (Mykola Golub)
  • mon: prevent bucket deletion when referenced by a crush rule (#11602 Sage Weil)
  • mon: prime pg_temp when CRUSH map changes (Sage Weil)
  • monclient: flush_log (John Spray)
  • msgr: async: many many fixes (Haomai Wang)
  • msgr: simple: fix clear_pipe (#11381 Haomai Wang)
  • osd: add latency perf counters for tier operations (Xinze Chi)
  • osd: avoid multiple hit set insertions (Zhiqiang Wang)
  • osd: break PG removal into multiple iterations (#10198 Guang Yang)
  • osd: check scrub state when handling map (Jianpeng Ma)
  • osd: fix endless repair when object is unrecoverable (Jianpeng Ma, Kefu Chai)
  • osd: fix pg resurrection (#11429 Samuel Just)
  • osd: ignore non-existent osds in unfound calc (#10976 Mykola Golub)
  • osd: increase default max open files (Owen Synge)
  • osd: prepopulate needs_recovery_map when only one peer has missing (#9558 Guang Yang)
  • osd: relax reply order on proxy read (#11211 Zhiqiang Wang)
  • osd: skip promotion for flush/evict op (Zhiqiang Wang)
  • osd: write journal header on clean shutdown (Xinze Chi)
  • qa: run-make-check.sh script (Loic Dachary)
  • rados bench: misc fixes (Dmitry Yatsushkevich)
  • rados: fix error message on failed pool removal (Wido den Hollander)
  • radosgw-admin: add ‘bucket check’ function to repair bucket index (Yehuda Sadeh)
  • rbd: allow unmapping by spec (Ilya Dryomov)
  • rbd: deprecate –new-format option (Jason Dillman)
  • rgw: do not set content-type if length is 0 (#11091 Orit Wasserman)
  • rgw: don’t use end_marker for namespaced object listing (#11437 Yehuda Sadeh)
  • rgw: fail if parts not specified on multipart upload (#11435 Yehuda Sadeh)
  • rgw: fix GET on swift account when limit == 0 (#10683 Radoslaw Zarzynski)
  • rgw: fix broken stats in container listing (#11285 Radoslaw Zarzynski)
  • rgw: fix bug in domain/subdomain splitting (Robin H. Johnson)
  • rgw: fix civetweb max threads (#10243 Yehuda Sadeh)
  • rgw: fix copy metadata, support X-Copied-From for swift (#10663 Radoslaw Zarzynski)
  • rgw: fix locator for objects starting with _ (#11442 Yehuda Sadeh)
  • rgw: fix mulitipart upload in retry path (#11604 Yehuda Sadeh)
  • rgw: fix quota enforcement on POST (#11323 Sergey Arkhipov)
  • rgw: fix return code on missing upload (#11436 Yehuda Sadeh)
  • rgw: force content type header on responses with no body (#11438 Orit Wasserman)
  • rgw: generate new object tag when setting attrs (#11256 Yehuda Sadeh)
  • rgw: issue aio for first chunk before flush cached data (#11322 Guang Yang)
  • rgw: make read user buckets backward compat (#10683 Radoslaw Zarzynski)
  • rgw: merge manifests properly with prefix override (#11622 Yehuda Sadeh)
  • rgw: return 412 on bad limit when listing buckets (#11613 Yehuda Sadeh)
  • rgw: send ETag, Last-Modified for swift (#11087 Radoslaw Zarzynski)
  • rgw: set content length on container GET, PUT, DELETE, HEAD (#10971, #11036 Radoslaw Zarzynski)
  • rgw: support end marker on swift container GET (#10682 Radoslaw Zarzynski)
  • rgw: swift: fix account listing (#11501 Radoslaw Zarzynski)
  • rgw: swift: set content-length on keystone tokens (#11473 Herv Rousseau)
  • rgw: use correct oid for gc chains (#11447 Yehuda Sadeh)
  • rgw: use unique request id for civetweb (#10295 Orit Wasserman)
  • rocksdb, leveldb: fix compact_on_mount (Xiaoxi Chen)
  • rocksdb: add perf counters for get/put latency (Xinxin Shu)
  • rpm: add suse firewall files (Tim Serong)
  • rpm: misc systemd and suse fixes (Owen Synge, Nathan Cutler)

GETTING CEPH

Ceph Developer Summit: Jewel

Hey Cephers, welcome to another Ceph Developer Summit cycle! As Infernalis filters down through the fancy new testing hardware and QA processes it’s time to start thinking about what ‘Jewel’ will hold in store for us (beyond Sage’s hope for a robust and ready CephFS!!!).

Blueprint submissions are now open for any and all work that that you would like to contribute or request of community developers. Please submit as soon as possible to ensure that it gets a CDS slot. We know this is still a little early, but the community has asked for a bit more lead time from finished schedule to actual event, so we’re trying to push the submissions cycle forward a bit.

This cycle we are in the middle of our wiki transition, so we will have a bit of a different process which I ask you to be patient with us on. This cycle will be the first to utilize the Redmine wiki (on tracker.ceph.com), but migration is ongoing so it will be a little rough.

The link below will take you to the edit page for the Jewel blueprints. From that page you just need to add in your title in the format of [[My Awesome Blueprint]] and save the page. You can then just click that link and enter your information. There is a sample blueprint page there to get you started, but please don’t hesitate to ask ‘scuttlemonkey’ on IRC or ‘pmcgarry at redhat dot com’ via email if you have any issues. We really appreciate your patience on this.

The rough schedule (updated) of CDS and Jewel in general should look something like this:

Date Milestone
26 MAY Blueprint submissions begin
12 JUN Blueprint submissions end
17 JUN Summit agenda announced
01 JUL Ceph Developer Summit: Day 1
02 JUL Ceph Developer Summit: Day 2 (if needed)
NOV 2015 Jewel Released

As always, this event will be an online event (utilizing the BlueJeans system) so that everyone can attend from their own timezone. If you are interested in submitting a blueprint or collaborating on an existing blueprint, please click the big red button below!

 

Submit Blueprint

scuttlemonkey out

v9.0.0 released

This is the first development release for the Infernalis cycle, and the first Ceph release to sport a version number from the new numbering scheme. The “9” indicates this is the 9th release cycle–I (for Infernalis) is the 9th letter. The first “0” indicates this is a development release (“1” will mean release candidate and “2” will mean stable release), and the final “0” indicates this is the first such development release.

A few highlights include:

  • a new ‘ceph daemonperf’ command to watch perfcounter stats in realtime
  • reduced MDS memory usage
  • many MDS snapshot fixes
  • librbd can now store options in the image itself
  • many fixes for RGW Swift API support
  • OSD performance improvements
  • many doc updates and misc bug fixes

NOTABLE CHANGES

read more…

v0.87.2 Giant released

This is the second (and possibly final) point release for Giant.

We recommend all v0.87.x Giant users upgrade to this release.

NOTABLE CHANGES

  • ceph-objectstore-tool: only output unsupported features when incompatible (#11176 David Zafman)
  • common: do not implicitly unlock rwlock on destruction (Federico Simoncelli)
  • common: make wait timeout on empty queue configurable (#10818 Samuel Just)
  • crush: pick ruleset id that matches and rule id (Xiaoxi Chen)
  • crush: set_choose_tries = 100 for new erasure code rulesets (#10353 Loic Dachary)
  • librados: check initialized atomic safely (#9617 Josh Durgin)
  • librados: fix failed tick_event assert (#11183 Zhiqiang Wang)
  • librados: fix looping on skipped maps (#9986 Ding Dinghua)
  • librados: fix op submit with timeout (#10340 Samuel Just)
  • librados: pybind: fix memory leak (#10723 Billy Olsen)
  • librados: pybind: keep reference to callbacks (#10775 Josh Durgin)
  • librados: translate operation flags from C APIs (Matthew Richards)
  • libradosstriper: fix write_full on ENOENT (#10758 Sebastien Ponce)
  • libradosstriper: use strtoll instead of strtol (Dongmao Zhang)
  • mds: fix assertion caused by system time moving backwards (#11053 Yan, Zheng)
  • mon: allow injection of random delays on writes (Joao Eduardo Luis)
  • mon: do not trust small osd epoch cache values (#10787 Sage Weil)
  • mon: fail non-blocking flush if object is being scrubbed (#8011 Samuel Just)
  • mon: fix division by zero in stats dump (Joao Eduardo Luis)
  • mon: fix get_rule_avail when no osds (#10257 Joao Eduardo Luis)
  • mon: fix timeout rounds period (#10546 Joao Eduardo Luis)
  • mon: ignore osd failures before up_from (#10762 Dan van der Ster, Sage Weil)
  • mon: paxos: reset accept timeout before writing to store (#10220 Joao Eduardo Luis)
  • mon: return if fs exists on ‘fs new’ (Joao Eduardo Luis)
  • mon: use EntityName when expanding profiles (#10844 Joao Eduardo Luis)
  • mon: verify cross-service proposal preconditions (#10643 Joao Eduardo Luis)
  • mon: wait for osdmon to be writeable when requesting proposal (#9794 Joao Eduardo Luis)
  • mount.ceph: avoid spurious error message about /etc/mtab (#10351 Yan, Zheng)
  • msg/simple: allow RESETSESSION when we forget an endpoint (#10080 Greg Farnum)
  • msg/simple: discard delay queue before incoming queue (#9910 Sage Weil)
  • osd: clear_primary_state when leaving Primary (#10059 Samuel Just)
  • osd: do not ignore deleted pgs on startup (#10617 Sage Weil)
  • osd: fix FileJournal wrap to get header out first (#10883 David Zafman)
  • osd: fix PG leak in SnapTrimWQ (#10421 Kefu Chai)
  • osd: fix journalq population in do_read_entry (#6003 Samuel Just)
  • osd: fix operator== for op_queue_age_hit and fs_perf_stat (#10259 Samuel Just)
  • osd: fix rare assert after split (#10430 David Zafman)
  • osd: get pgid ancestor from last_map when building past intervals (#10430 David Zafman)
  • osd: include rollback_info_trimmed_to in {read,write}_log (#10157 Samuel Just)
  • osd: lock header_lock in DBObjectMap::sync (#9891 Samuel Just)
  • osd: requeue blocked op before flush it was blocked on (#10512 Sage Weil)
  • osd: tolerate missing object between list and attr get on backfill (#10150 Samuel Just)
  • osd: use correct atime for eviction decision (Xinze Chi)
  • rgw: flush XML header on get ACL request (#10106 Yehuda Sadeh)
  • rgw: index swift keys appropriately (#10471 Hemant Bruman, Yehuda Sadeh)
  • rgw: send cancel for bucket index pending ops (#10770 Baijiaruo, Yehuda Sadeh)
  • rgw: swift: support X_Remove_Container-Meta-{key} (#01475 Dmytro Iurchenko)

For more detailed information, see the complete changelog.

GETTING CEPH

v0.94.1 Hammer released

This bug fix release fixes a few critical issues with CRUSH. The most important addresses a bug in feature bit enforcement that may prevent pre-hammer clients from communicating with the cluster during an upgrade. This only manifests in some cases (for example, when the ‘rack’ type is in use in the CRUSH map, and possibly other cases), but for safety we strongly recommend that all users use 0.94.1 instead of 0.94 when upgrading.

There is also a fix in the new straw2 buckets when OSD weights are 0.

We recommend that all v0.94 users upgrade.

NOTABLE CHANGES

  • crush: fix divide-by-0 in straw2 (#11357 Sage Weil)
  • crush: fix has_v4_buckets (#11364 Sage Weil)
  • osd: fix negative degraded objects during backfilling (#7737 Guang Yang)

For more detailed information, see the complete changelog.

 

GETTING CEPH

v0.94 Hammer released

This major release is expected to form the basis of the next long-term stable series. It is intended to supersede v0.80.x Firefly.

Highlights since Giant include:

  • RADOS Performance: a range of improvements have been made in the OSD and client-side librados code that improve the throughput on flash backends and improve parallelism and scaling on fast machines.
  • Simplified RGW deployment: the ceph-deploy tool now has a new ‘ceph-deploy rgw create HOST’ command that quickly deploys a instance of the S3/Swift gateway using the embedded Civetweb server. This is vastly simpler than the previous Apache-based deployment. There are a few rough edges (e.g., around SSL support) but we encourage users to try the new method.
  • RGW object versioning: RGW now supports the S3 object versioning API, which preserves old version of objects instead of overwriting them.
  • RGW bucket sharding: RGW can now shard the bucket index for large buckets across, improving performance for very large buckets.
  • RBD object maps: RBD now has an object map function that tracks which parts of the image are allocating, improving performance for clones and for commands like export and delete.
  • RBD mandatory locking: RBD has a new mandatory locking framework (still disabled by default) that adds additional safeguards to prevent multiple clients from using the same image at the same time.
  • RBD copy-on-read: RBD now supports copy-on-read for image clones, improving performance for some workloads.
  • CephFS snapshot improvements: Many many bugs have been fixed with CephFS snapshots. Although they are still disabled by default, stability has improved significantly.
  • CephFS Recovery tools: We have built some journal recovery and diagnostic tools. Stability and performance of single-MDS systems is vastly improved in Giant, and more improvements have been made now in Hammer. Although we still recommend caution when storing important data in CephFS, we do encourage testing for non-critical workloads so that we can better guage the feature, usability, performance, and stability gaps.
  • CRUSH improvements: We have added a new straw2 bucket algorithm that reduces the amount of data migration required when changes are made to the cluster.
  • Shingled erasure codes (SHEC): The OSDs now have experimental support for shingled erasure codes, which allow a small amount of additional storage to be traded for improved recovery performance.
  • RADOS cache tiering: A series of changes have been made in the cache tiering code that improve performance and reduce latency.
  • Experimental RDMA support: There is now experimental support for RDMA via the Accelio (libxio) library.
  • New administrator commands: The ‘ceph osd df’ command shows pertinent details on OSD disk utilizations. The ‘ceph pg ls …’ command makes it much simpler to query PG states while diagnosing cluster issues.

Other highlights since Firefly include:
read more…

Page 1 of 1612345...10...Last »
© 2015, Inktank Storage, Inc.. All rights reserved.