Archives:

When Ceph was originally designed a decade ago, the concept was that “intelligent” disk drives with some modest processing capability could store objects instead of blocks and take an active role in replicating, migrating, or repairing data within the system.  In contrast to conventional disk drives, a smart object-based drive could coordinate with other drives in the system in a peer-to-peer fashion to build a more scalable storage system.

Today an ethernet-attached hard disk drive from WDLabs is making this architecture a reality. WDLabs has assembled over 500 drives from the early production line and assembled them into a 4 PB (3.6 PiB) Ceph cluster running Jewel and the prototype BlueStore storage backend.  WDLabs has been working on validating the need to apply an open source compute environment within the storage device and is now beginning to understand the use cases as thought leaders such as Red Hat work with the early units.  This test seeks to demonstrate that the second generation converged microserver has become a viable solution for distributed storage use cases like Ceph. Building an open platform that can run open source software is a key underpinning of the concept.
read more…

The Ceph project would like to congratulate the following students on their acceptance to the 2016 Google Summer of Code program, and the Ceph project:

Student Project
Shehbaz Jaffer BlueStore
Victor Araujo End-to-end Performance Visualization
Aburudha Bose Improve Overall Python Infrastructure
Zhao Junwang Over-the-wire Encryption Support
Oleh Prypin Python 3 Support for Ceph

These five students represent the best of the almost 70 project submissions that we fielded from students around the world. For those not familiar with the Google Summer of Code program, this means that Google will generously fund these students during their summer work.

Thanks to everyone who applied this year, the selection process was made very challenging by the number of highly qualified applicants. We look forward to mentoring students to a successful summer of coding and Open Source, both this year and in the years to come.

v10.2.0 Jewel released

This major release of Ceph will be the foundation for the next long-term stable release. There have been many major changes since the Infernalis (9.2.x) and Hammer (0.94.x) releases, and the upgrade process is non-trivial. Please read these release notes carefully.

MAJOR CHANGES FROM INFERNALIS

read more…

v0.87.2 Giant released

This is the second (and possibly final) point release for Giant.

We recommend all v0.87.x Giant users upgrade to this release.

NOTABLE CHANGES

  • ceph-objectstore-tool: only output unsupported features when incompatible (#11176 David Zafman)
  • common: do not implicitly unlock rwlock on destruction (Federico Simoncelli)
  • common: make wait timeout on empty queue configurable (#10818 Samuel Just)
  • crush: pick ruleset id that matches and rule id (Xiaoxi Chen)
  • crush: set_choose_tries = 100 for new erasure code rulesets (#10353 Loic Dachary)
  • librados: check initialized atomic safely (#9617 Josh Durgin)
  • librados: fix failed tick_event assert (#11183 Zhiqiang Wang)
  • librados: fix looping on skipped maps (#9986 Ding Dinghua)
  • librados: fix op submit with timeout (#10340 Samuel Just)
  • librados: pybind: fix memory leak (#10723 Billy Olsen)
  • librados: pybind: keep reference to callbacks (#10775 Josh Durgin)
  • librados: translate operation flags from C APIs (Matthew Richards)
  • libradosstriper: fix write_full on ENOENT (#10758 Sebastien Ponce)
  • libradosstriper: use strtoll instead of strtol (Dongmao Zhang)
  • mds: fix assertion caused by system time moving backwards (#11053 Yan, Zheng)
  • mon: allow injection of random delays on writes (Joao Eduardo Luis)
  • mon: do not trust small osd epoch cache values (#10787 Sage Weil)
  • mon: fail non-blocking flush if object is being scrubbed (#8011 Samuel Just)
  • mon: fix division by zero in stats dump (Joao Eduardo Luis)
  • mon: fix get_rule_avail when no osds (#10257 Joao Eduardo Luis)
  • mon: fix timeout rounds period (#10546 Joao Eduardo Luis)
  • mon: ignore osd failures before up_from (#10762 Dan van der Ster, Sage Weil)
  • mon: paxos: reset accept timeout before writing to store (#10220 Joao Eduardo Luis)
  • mon: return if fs exists on ‘fs new’ (Joao Eduardo Luis)
  • mon: use EntityName when expanding profiles (#10844 Joao Eduardo Luis)
  • mon: verify cross-service proposal preconditions (#10643 Joao Eduardo Luis)
  • mon: wait for osdmon to be writeable when requesting proposal (#9794 Joao Eduardo Luis)
  • mount.ceph: avoid spurious error message about /etc/mtab (#10351 Yan, Zheng)
  • msg/simple: allow RESETSESSION when we forget an endpoint (#10080 Greg Farnum)
  • msg/simple: discard delay queue before incoming queue (#9910 Sage Weil)
  • osd: clear_primary_state when leaving Primary (#10059 Samuel Just)
  • osd: do not ignore deleted pgs on startup (#10617 Sage Weil)
  • osd: fix FileJournal wrap to get header out first (#10883 David Zafman)
  • osd: fix PG leak in SnapTrimWQ (#10421 Kefu Chai)
  • osd: fix journalq population in do_read_entry (#6003 Samuel Just)
  • osd: fix operator== for op_queue_age_hit and fs_perf_stat (#10259 Samuel Just)
  • osd: fix rare assert after split (#10430 David Zafman)
  • osd: get pgid ancestor from last_map when building past intervals (#10430 David Zafman)
  • osd: include rollback_info_trimmed_to in {read,write}_log (#10157 Samuel Just)
  • osd: lock header_lock in DBObjectMap::sync (#9891 Samuel Just)
  • osd: requeue blocked op before flush it was blocked on (#10512 Sage Weil)
  • osd: tolerate missing object between list and attr get on backfill (#10150 Samuel Just)
  • osd: use correct atime for eviction decision (Xinze Chi)
  • rgw: flush XML header on get ACL request (#10106 Yehuda Sadeh)
  • rgw: index swift keys appropriately (#10471 Hemant Bruman, Yehuda Sadeh)
  • rgw: send cancel for bucket index pending ops (#10770 Baijiaruo, Yehuda Sadeh)
  • rgw: swift: support X_Remove_Container-Meta-{key} (#01475 Dmytro Iurchenko)

For more detailed information, see the complete changelog.

GETTING CEPH

Ceph Loves Jumbo Frames

Ceph Loves Jumbo Frames
Who doesn’t loves a high performing Ceph storage cluster. To get this you need to tame it , i mean not only Ceph tuning but also Network needs to be tuned. The quickest way to tune your network is to enable Jumbo Frames.

What are they :

  • They are ethernet frames with payload more than 1500 MTU
  • Can significantly improve network performance by making data transmission efficient.
  • Requires Gigabit ethernet
  • Most of the enterprise network device supports Jumbo Frames
  • Some people also call them ‘Giants’

Enabling Jumbo Frames

  • Make sure your switch port is configured to accept Jumbo frames
  • On server side , set your network interface MTU to 9000
1
# ifconfig eth0 mtu 9000
  • Make changes permanent by updating network interface file and restart network services
1
# echo "MTU 9000" >> /etc/sysconfig/network-script/ifcfg-eth0
  • Confirm if MTU is used between two specific devices
1
# ip route get {IP-address}

In my production Ceph cluster, i have seen improvements after enabling Jumbo Frames both on Ceph as well as on OpenStack nodes.

Stretching Ceph networks

This is a quick note about Ceph networks, so do not expect anything lengthy here :).

Usually Ceph networks are presented as cluster public and cluster private.
However it is never mentioned that you can use a separate network for the monitors.
This might sound obvious for some people but it is completely possible.
The only requirement of course is to have this monitor network accessible from all the Ceph nodes.

We can then easily imagine 4 VLANs:

  • Ceph monitor
  • Ceph public
  • Ceph cluster
  • Ceph heartbeat

I know this does not sound much, but I’ve been hearing this question so many times :).

Ceph Pool Migration

You have probably already be faced to migrate all objects from a pool to another, especially to change parameters that can not be modified on pool. For example, to migrate from a replicated pool to an EC pool, change EC profile, or to reduce the number of PGs…
There are different methods, depending on the contents of the pool (RBD, objects), size…

The simple way

The simplest and safest method to copy all objects with the “rados cppool” command.
However, it need to have read only access to the pool during the copy.

For example for migrating to an EC pool :

1
2
3
4
5
pool=testpool
ceph osd pool create $pool.new 4096 4096 erasure default
rados cppool $pool $pool.new
ceph osd pool rename $pool $pool.old
ceph osd pool rename $pool.new $pool

But it does not work in all cases. For example with EC pools : “error copying pool testpool => newpool: (95) Operation not supported”.

Using Cache Tier

This must to be used with caution, make tests before using it on a cluster in production. It worked for my needs, but I can not say that it works in all cases.

I find this method interesting method, because it allows transparent operation, reduce downtime and avoid to duplicate all data. The principle is simple: use the cache tier, but in reverse order.

At the begning, we have 2 pools : the current “testpool”, and the new one “newpool”

Setup cache tier

Configure the existing pool as cache pool :

1
2
ceph osd tier add newpool testpool --force-nonempty
ceph osd tier cache-mode testpool forward

In ceph osd dump you should see something like that :

--> pool 58 'testpool' replicated size 3 .... tier_of 80 

Now, all new objects will be create on new pool :

Now we can force to move all objects to new pool :

1
rados -p testpool cache-flush-evict-all

Switch all clients to the new pool

(You can also do this step earlier. For example, just after the cache pool creation.)
Until all the data has not been flushed to the new pool you need to specify an overlay to search objects on old pool :

1
ceph osd tier set-overlay newpool testpool

In ceph osd dump you should see something like that :

--> pool 80 'newpool' replicated size 3 .... tiers 58 read_tier 58 write_tier 58

With overlay, all operation will be forwarded to the old testpool :

Now you can switch all the clients to access objects on the new pool.

Finish

When all data is migrate, you can remove overlay and old “cache” pool :

1
2
ceph osd tier remove-overlay newpool
ceph osd tier remove newpool testpool

In-use object

During eviction you can find some error :

....
rb.0.59189e.2ae8944a.000000000001   
rb.0.59189e.2ae8944a.000000000023   
rb.0.59189e.2ae8944a.000000000006   
testrbd.rbd 
failed to evict testrbd.rbd: (16) Device or resource busy
rb.0.59189e.2ae8944a.000000000000   
rb.0.59189e.2ae8944a.000000000026   
...

List watcher on object can help :

1
2
rados -p testpool listwatchers testrbd.rbd
watcher=10.20.6.39:0/3318181122 client.5520194 cookie=1

Using Rados Export/Import

For this, you need to use a temporary local directory.

1
2
3
4
5
6
7
8
9
10
11
12
13
rados export --create testpool tmp_dir
[exported]     rb.0.4975.2ae8944a.000000002391
[exported]     rb.0.4975.2ae8944a.000000004abc
[exported]     rb.0.4975.2ae8944a.0000000018ce
...

rados import tmp_dir newpool

# Stop All IO
# And redo a sync of modified objects

rados export --workers 5 testpool tmp_dir
rados import --workers 5 tmp_dir newpool

v0.94.1 Hammer released

This bug fix release fixes a few critical issues with CRUSH. The most important addresses a bug in feature bit enforcement that may prevent pre-hammer clients from communicating with the cluster during an upgrade. This only manifests in some cases (for example, when the ‘rack’ type is in use in the CRUSH map, and possibly other cases), but for safety we strongly recommend that all users use 0.94.1 instead of 0.94 when upgrading.

There is also a fix in the new straw2 buckets when OSD weights are 0.

We recommend that all v0.94 users upgrade.

NOTABLE CHANGES

  • crush: fix divide-by-0 in straw2 (#11357 Sage Weil)
  • crush: fix has_v4_buckets (#11364 Sage Weil)
  • osd: fix negative degraded objects during backfilling (#7737 Guang Yang)

For more detailed information, see the complete changelog.

 

GETTING CEPH

© 2016, Red Hat, Inc. All rights reserved.