Archives:

When Ceph was originally designed a decade ago, the concept was that “intelligent” disk drives with some modest processing capability could store objects instead of blocks and take an active role in replicating, migrating, or repairing data within the system.  In contrast to conventional disk drives, a smart object-based drive could coordinate with other drives in the system in a peer-to-peer fashion to build a more scalable storage system.

Today an ethernet-attached hard disk drive from WDLabs is making this architecture a reality. WDLabs has assembled over 500 drives from the early production line and assembled them into a 4 PB (3.6 PiB) Ceph cluster running Jewel and the prototype BlueStore storage backend.  WDLabs has been working on validating the need to apply an open source compute environment within the storage device and is now beginning to understand the use cases as thought leaders such as Red Hat work with the early units.  This test seeks to demonstrate that the second generation converged microserver has become a viable solution for distributed storage use cases like Ceph. Building an open platform that can run open source software is a key underpinning of the concept.

The collaboration between WDLabs, Red Hat, and SuperMicro on the large scale 4 PB cluster will help drive further learning and improvements to this new and potentially disruptive product.  By allowing storage services to run directly on the drive, an entire tier of conventional servers can be eliminated, simplifying the overall stack up through the top-of-rack switch, and paying dividends through space efficiency gains and component and power cost reductions.

The ARM community has contributed quite a bit of code to Ceph over the past two years to make it run well on the previous-generation 32-bit and new 64-bit architectures.  Our work is by no means complete, but we think the results are quite encouraging!

The WDLabs Converged Microserver He8

The Converged Microserver He8 is a microcomputer built on the existing production Ultrastar® He8 platform. The host used in the Ceph cluster is a Dual-Core Cortex-A9 ARM Processor running at 1.3 GHz with 1 GB of Memory, soldered directly onto the drive’s PCB (pictured). Options include 2 GB of memory and ECC protection. It contains the ARM NEON coprocessor to help with erasure code computations and XOR and crypto engines.

WaspV3_PCBA-1

The drive PCB includes the standard disk controller hardware, as well as an additional ARM SoC running Debian Jessie (and Ceph). The connector passes ethernet instead of SATA.

The interface is dual 1 GbE SGMII ports with the ability to reach 2.5 GbE in a compatible chassis.  The physical connector is identical to existing SAS/SATA devices, but with a new pinout that is being standardized and adopted by other drive manufacturers. Disk shelves are already available from multiple chassis vendors.

Wasp-8TB-Top-Bot

The drive has a standard 3.5″ form factor.

The default operating system is Debian Linux 8.0 (Jessie) and PXE boot is supported.

504 OSD test cluster and test setup

The 4 PB 504 node Converged Microserver He8 cluster is anchored by 42 SuperMicro 1048-RT chassis that feed 10 GbE each to the top-of-rack public network through 3 SuperMicro SSE-X3348T switches. Another identical switch interconnects the private back end network for internal cluster related tasks.  Each drive has two 1Gbps interfaces, one for the public network, one for the cluster (replication) network.  The monitor (just one in this test) is running on a conventional server.

Front view - 25 new enclosures-small

Cluster from front: 25 1u SuperMicro enclosures

Clients have been installed to apply workloads to the system but so far have not been able to fully flood the cluster with traffic.  There are 18 x86 machines, each with 10 Gbps interfaces.  The workload is generated by ‘rados bench,’ with the default write size (4 MB) and 128-196 threads per client, running for 5 minutes for each data point.

The cluster is running an early build of Ceph Jewel (v10.1.0), one ceph-osd per drive, using the new experimental BlueStore backend to more efficiently utilize the raw disk (/dev/sda4). The configuration is relatively straightforward, although we did some minor tuning to reduce the memory consumption on the devices:

osd map cache size = 40
osd map max advance = 32
osd map message max = 32

The ceph status output looks like this:

    cluster 4f095735-f4b2-44c7-b318-566fc6f1d47c
     health HEALTH_OK
     monmap e1: 1 mons at {mon0=192.168.102.249:6789/0}
            election epoch 3, quorum 0 mon0
     osdmap e276: 504 osds: 504 up, 504 in
            flags sortbitwise
      pgmap v3799: 114752 pgs, 4 pools, 2677 GB data, 669 kobjects
            150 TB used, 3463 TB / 3621 TB avail
              114752 active+clean
The whole cluster-small

Assembling and testing the cluster

 

Performance

The first set of tests show the total read and write bandwidth of the cluster, as seen by ‘rados bench’.  Currently only 180 Gbps of clients are connecting, which is why throughput scaling starts to level off around 180 nodes.

write scaling

If we look at the normalized write bandwidth per OSD, we are getting decent (although not breathtaking) throughput on each drive.  Read performance is very similar.

write bw per drive

The read performance is very similar:

read bw per drive

Scaling up the client workload but keeping a small set of 20 OSDs in the cluster shows us that each drive caps out around 90 MB/sec:

write saturate

 

We also did a few experiments with using erasure coding instead of replication.  We expect that capacity- and density-optimized clusters will be a compelling use case for these drives.  The following plot shows the total write bandwidth over the cluster with various erasure codes.  The throughput is lower than for pure replication, but already sufficient for many use cases.  There is still a lot of idle CPU time on the drives under load, so until we improve the network and client capabilities we won’t know where the bottleneck in the system is.

ec writes

Replicated writes are a bit faster:

rep writes

What next

The Converged Microserver He8 and compatible chassis are now available in limited volumes through the WDLabs store by contacting a WDLabs representative. They are now looking at the next generation solution based on early customer adoption and feedback, and are partnering with key suppliers and customers that can help evolve the product.

The Ceph development community is excited to see Ceph running on different hardware platforms and architectures, and we enjoyed working with the WDLabs team to demonstrate the viability of this deployment model. There are many opportunities going forward to optimize the performance and behavior of Ceph on low-power, converged platforms, and we look forward to further collaboration with the hardware community to prove out next generation scale-out storage architectures.

The Ceph project would like to congratulate the following students on their acceptance to the 2016 Google Summer of Code program, and the Ceph project:

Student Project
Shehbaz Jaffer BlueStore
Victor Araujo End-to-end Performance Visualization
Aburudha Bose Improve Overall Python Infrastructure
Zhao Junwang Over-the-wire Encryption Support
Oleh Prypin Python 3 Support for Ceph

These five students represent the best of the almost 70 project submissions that we fielded from students around the world. For those not familiar with the Google Summer of Code program, this means that Google will generously fund these students during their summer work.

Thanks to everyone who applied this year, the selection process was made very challenging by the number of highly qualified applicants. We look forward to mentoring students to a successful summer of coding and Open Source, both this year and in the years to come.

v10.2.0 Jewel released

This major release of Ceph will be the foundation for the next long-term stable release. There have been many major changes since the Infernalis (9.2.x) and Hammer (0.94.x) releases, and the upgrade process is non-trivial. Please read these release notes carefully.

MAJOR CHANGES FROM INFERNALIS

read more…

v0.87.2 Giant released

This is the second (and possibly final) point release for Giant.

We recommend all v0.87.x Giant users upgrade to this release.

NOTABLE CHANGES

  • ceph-objectstore-tool: only output unsupported features when incompatible (#11176 David Zafman)
  • common: do not implicitly unlock rwlock on destruction (Federico Simoncelli)
  • common: make wait timeout on empty queue configurable (#10818 Samuel Just)
  • crush: pick ruleset id that matches and rule id (Xiaoxi Chen)
  • crush: set_choose_tries = 100 for new erasure code rulesets (#10353 Loic Dachary)
  • librados: check initialized atomic safely (#9617 Josh Durgin)
  • librados: fix failed tick_event assert (#11183 Zhiqiang Wang)
  • librados: fix looping on skipped maps (#9986 Ding Dinghua)
  • librados: fix op submit with timeout (#10340 Samuel Just)
  • librados: pybind: fix memory leak (#10723 Billy Olsen)
  • librados: pybind: keep reference to callbacks (#10775 Josh Durgin)
  • librados: translate operation flags from C APIs (Matthew Richards)
  • libradosstriper: fix write_full on ENOENT (#10758 Sebastien Ponce)
  • libradosstriper: use strtoll instead of strtol (Dongmao Zhang)
  • mds: fix assertion caused by system time moving backwards (#11053 Yan, Zheng)
  • mon: allow injection of random delays on writes (Joao Eduardo Luis)
  • mon: do not trust small osd epoch cache values (#10787 Sage Weil)
  • mon: fail non-blocking flush if object is being scrubbed (#8011 Samuel Just)
  • mon: fix division by zero in stats dump (Joao Eduardo Luis)
  • mon: fix get_rule_avail when no osds (#10257 Joao Eduardo Luis)
  • mon: fix timeout rounds period (#10546 Joao Eduardo Luis)
  • mon: ignore osd failures before up_from (#10762 Dan van der Ster, Sage Weil)
  • mon: paxos: reset accept timeout before writing to store (#10220 Joao Eduardo Luis)
  • mon: return if fs exists on ‘fs new’ (Joao Eduardo Luis)
  • mon: use EntityName when expanding profiles (#10844 Joao Eduardo Luis)
  • mon: verify cross-service proposal preconditions (#10643 Joao Eduardo Luis)
  • mon: wait for osdmon to be writeable when requesting proposal (#9794 Joao Eduardo Luis)
  • mount.ceph: avoid spurious error message about /etc/mtab (#10351 Yan, Zheng)
  • msg/simple: allow RESETSESSION when we forget an endpoint (#10080 Greg Farnum)
  • msg/simple: discard delay queue before incoming queue (#9910 Sage Weil)
  • osd: clear_primary_state when leaving Primary (#10059 Samuel Just)
  • osd: do not ignore deleted pgs on startup (#10617 Sage Weil)
  • osd: fix FileJournal wrap to get header out first (#10883 David Zafman)
  • osd: fix PG leak in SnapTrimWQ (#10421 Kefu Chai)
  • osd: fix journalq population in do_read_entry (#6003 Samuel Just)
  • osd: fix operator== for op_queue_age_hit and fs_perf_stat (#10259 Samuel Just)
  • osd: fix rare assert after split (#10430 David Zafman)
  • osd: get pgid ancestor from last_map when building past intervals (#10430 David Zafman)
  • osd: include rollback_info_trimmed_to in {read,write}_log (#10157 Samuel Just)
  • osd: lock header_lock in DBObjectMap::sync (#9891 Samuel Just)
  • osd: requeue blocked op before flush it was blocked on (#10512 Sage Weil)
  • osd: tolerate missing object between list and attr get on backfill (#10150 Samuel Just)
  • osd: use correct atime for eviction decision (Xinze Chi)
  • rgw: flush XML header on get ACL request (#10106 Yehuda Sadeh)
  • rgw: index swift keys appropriately (#10471 Hemant Bruman, Yehuda Sadeh)
  • rgw: send cancel for bucket index pending ops (#10770 Baijiaruo, Yehuda Sadeh)
  • rgw: swift: support X_Remove_Container-Meta-{key} (#01475 Dmytro Iurchenko)

For more detailed information, see the complete changelog.

GETTING CEPH

Ceph Loves Jumbo Frames

Ceph Loves Jumbo Frames
Who doesn’t loves a high performing Ceph storage cluster. To get this you need to tame it , i mean not only Ceph tuning but also Network needs to be tuned. The quickest way to tune your network is to enable Jumbo Frames.

What are they :

  • They are ethernet frames with payload more than 1500 MTU
  • Can significantly improve network performance by making data transmission efficient.
  • Requires Gigabit ethernet
  • Most of the enterprise network device supports Jumbo Frames
  • Some people also call them ‘Giants’

Enabling Jumbo Frames

  • Make sure your switch port is configured to accept Jumbo frames
  • On server side , set your network interface MTU to 9000
1
# ifconfig eth0 mtu 9000
  • Make changes permanent by updating network interface file and restart network services
1
# echo "MTU 9000" >> /etc/sysconfig/network-script/ifcfg-eth0
  • Confirm if MTU is used between two specific devices
1
# ip route get {IP-address}

In my production Ceph cluster, i have seen improvements after enabling Jumbo Frames both on Ceph as well as on OpenStack nodes.

Stretching Ceph networks

This is a quick note about Ceph networks, so do not expect anything lengthy here :).

Usually Ceph networks are presented as cluster public and cluster private.
However it is never mentioned that you can use a separate network for the monitors.
This might sound obvious for some people but it is completely possible.
The only requirement of course is to have this monitor network accessible from all the Ceph nodes.

We can then easily imagine 4 VLANs:

  • Ceph monitor
  • Ceph public
  • Ceph cluster
  • Ceph heartbeat

I know this does not sound much, but I’ve been hearing this question so many times :).

Ceph Pool Migration

You have probably already be faced to migrate all objects from a pool to another, especially to change parameters that can not be modified on pool. For example, to migrate from a replicated pool to an EC pool, change EC profile, or to reduce the number of PGs…
There are different methods, depending on the contents of the pool (RBD, objects), size…

The simple way

The simplest and safest method to copy all objects with the “rados cppool” command.
However, it need to have read only access to the pool during the copy.

For example for migrating to an EC pool :

1
2
3
4
5
pool=testpool
ceph osd pool create $pool.new 4096 4096 erasure default
rados cppool $pool $pool.new
ceph osd pool rename $pool $pool.old
ceph osd pool rename $pool.new $pool

But it does not work in all cases. For example with EC pools : “error copying pool testpool => newpool: (95) Operation not supported”.

Using Cache Tier

This must to be used with caution, make tests before using it on a cluster in production. It worked for my needs, but I can not say that it works in all cases.

I find this method interesting method, because it allows transparent operation, reduce downtime and avoid to duplicate all data. The principle is simple: use the cache tier, but in reverse order.

At the begning, we have 2 pools : the current “testpool”, and the new one “newpool”

Setup cache tier

Configure the existing pool as cache pool :

1
2
ceph osd tier add newpool testpool --force-nonempty
ceph osd tier cache-mode testpool forward

In ceph osd dump you should see something like that :

--> pool 58 'testpool' replicated size 3 .... tier_of 80 

Now, all new objects will be create on new pool :

Now we can force to move all objects to new pool :

1
rados -p testpool cache-flush-evict-all

Switch all clients to the new pool

(You can also do this step earlier. For example, just after the cache pool creation.)
Until all the data has not been flushed to the new pool you need to specify an overlay to search objects on old pool :

1
ceph osd tier set-overlay newpool testpool

In ceph osd dump you should see something like that :

--> pool 80 'newpool' replicated size 3 .... tiers 58 read_tier 58 write_tier 58

With overlay, all operation will be forwarded to the old testpool :

Now you can switch all the clients to access objects on the new pool.

Finish

When all data is migrate, you can remove overlay and old “cache” pool :

1
2
ceph osd tier remove-overlay newpool
ceph osd tier remove newpool testpool

In-use object

During eviction you can find some error :

....
rb.0.59189e.2ae8944a.000000000001   
rb.0.59189e.2ae8944a.000000000023   
rb.0.59189e.2ae8944a.000000000006   
testrbd.rbd 
failed to evict testrbd.rbd: (16) Device or resource busy
rb.0.59189e.2ae8944a.000000000000   
rb.0.59189e.2ae8944a.000000000026   
...

List watcher on object can help :

1
2
rados -p testpool listwatchers testrbd.rbd
watcher=10.20.6.39:0/3318181122 client.5520194 cookie=1

Using Rados Export/Import

For this, you need to use a temporary local directory.

1
2
3
4
5
6
7
8
9
10
11
12
13
rados export --create testpool tmp_dir
[exported]     rb.0.4975.2ae8944a.000000002391
[exported]     rb.0.4975.2ae8944a.000000004abc
[exported]     rb.0.4975.2ae8944a.0000000018ce
...

rados import tmp_dir newpool

# Stop All IO
# And redo a sync of modified objects

rados export --workers 5 testpool tmp_dir
rados import --workers 5 tmp_dir newpool

v0.94.1 Hammer released

This bug fix release fixes a few critical issues with CRUSH. The most important addresses a bug in feature bit enforcement that may prevent pre-hammer clients from communicating with the cluster during an upgrade. This only manifests in some cases (for example, when the ‘rack’ type is in use in the CRUSH map, and possibly other cases), but for safety we strongly recommend that all users use 0.94.1 instead of 0.94 when upgrading.

There is also a fix in the new straw2 buckets when OSD weights are 0.

We recommend that all v0.94 users upgrade.

NOTABLE CHANGES

  • crush: fix divide-by-0 in straw2 (#11357 Sage Weil)
  • crush: fix has_v4_buckets (#11364 Sage Weil)
  • osd: fix negative degraded objects during backfilling (#7737 Guang Yang)

For more detailed information, see the complete changelog.

 

GETTING CEPH

© 2016, Red Hat, Inc. All rights reserved.