Archives:

{% img center http://sebastien-han.fr/images/openstack-summit-vancouver.jpg OpenStack Summit Vancouver: thanks for your votes %}

Bonjour, bonjour !
Quick post to let you know that my talk submission has been accepted, so I’d like to thank you all for voting.
As a reminder, our talk (Josh Durgin and I) is scheduled Tuesday, May 19 between 11:15am – 11:55am.

Also note that the summit has other Ceph talks!

See you in Vancouver!

The schedule for the upcoming OpenStack Summit 2015 in Vancouver is finally available. Sage and I submitted a presentation about “Storage security in a critical enterprise OpenStack environment“. The submission was accepted and the talk is scheduled for Monday, May 18th, 15:40 – 16:20. 

There are also some other talks related to Ceph available:
Checkout the links or the schedule for dates and times of the talks. 

See you in Vancouver!

Calculating the storage overhead of a replicated pool in Ceph is easy.
You divide the amount of space you have by the “size” (amount of replicas) parameter of your storage pool.

Let’s work with some rough numbers: 64 OSDs of 4TB each.

Raw size: 64 * 4  = 256TB
Size 2  : 128 / 2 = 128TB
Size 3  : 128 / 3 = 85.33TB

Replicated pools are expensive in terms of overhead: Size 2 provides the same resilience and overhead as RAID-1.
Size 3 provides more resilience than RAID-1 but at the tradeoff of even more overhead.

Explaining what Erasure coding is about gets complicated quickly.

I like to compare replicated pools to RAID-1 and Erasure coded pools to RAID-5 (or RAID-6) in the sense that there are data chunks and recovery/parity/coding chunks.

What’s appealing with erasure coding is that it can provide the same (or better) resiliency than replicated pools but with less storage overhead – at the cost of the computing it requires.

Ceph has had erasure coding support for a good while already and interesting documentation is available:

The thing with erasure coded pools, though, is that you’ll need a cache tier in front of them to be able to use them in most cases.

This makes for a perfect synergy of slower/larger/less expensive drives for your erasure coded pool and faster, more expensive drives in front as your cache tier.

To calculate the overhead of a erasure coded pool, you need to know your ‘k’ and ‘m’ values of your erasure code profile.

chunk

  When the encoding function is called, it returns chunks of the same size. Data chunks which can be concatenated to reconstruct the original object and coding chunks which can be used to rebuild a lost chunk.

K

  The number of data chunks, i.e. the number of chunks in which the original object is divided. For instance if K = 2 a 10KB object will be divided into K objects of 5KB each.

M

  The number of coding chunks, i.e. the number of additional chunks computed by the encoding functions. If there are 2 coding chunks, it means 2 OSDs can be out without losing data.

The formula to calculate the overhead is:

nOSD * k / (k+m) * OSD Size

Finally, let’s look at a couple different erasure coding profile configurations based on 64 OSDs of 4 TB ranging from m=1 to m=4 and k=1 to k=10:

|     | 1      | 2      | 3      | 4      |
|-----|--------|--------|--------|--------|
| 1   | 128.00 | 85.33  | 64.00  | 51.20  |
| 2   | 170.67 | 128.00 | 102.40 | 85.33  |
| 3   | 192.00 | 153.60 | 128.00 | 109.71 |
| 4   | 204.80 | 170.67 | 146.29 | 128.00 |
| 5   | 213.33 | 182.86 | 160.00 | 142.22 |
| 6   | 219.43 | 192.00 | 170.67 | 153.60 |
| 7   | 224.00 | 199.11 | 179.20 | 162.91 |
| 8   | 227.56 | 204.80 | 186.18 | 170.67 |
| 9   | 230.40 | 209.45 | 192.00 | 177.23 |
| 10  | 232.73 | 213.33 | 196.92 | 182.86 |
| Raw | 256    | 256    | 256    | 256    |

RadosGW: Simple Replication Example

This is a simple example of federated gateways config to make an asynchonous replication between two Ceph clusters.

( This configuration is based on Ceph documentation :
http://ceph.com/docs/master/radosgw/federated-config/ )

Here I use only one region (“default”) and two zones (“main” and “fallback”), one for each cluster.

Note that in this example, I use 3 placement targets (default, hot, cold) that correspond respectively on pool .main.rgw.buckets, .main.rgw.hot.buckets, .main.rgw.cold.buckets.
Be carefull to replace the tags {MAIN_USER_ACCESS}, {MAIN_USER_SECRET}, {FALLBACK_USER_ACESS}, {FALLBACK_USER_SECRET} by corresponding values.

First I created region and zones files, that will be require on the 2 clusters :

The region file “region.conf.json” :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
{ "name": "default",
  "api_name": "default",
  "is_master": "true",
  "endpoints": [
        "http:\/\/s3.mydomain.com:80\/"],
  "master_zone": "main",
  "zones": [
        { "name": "main",
          "endpoints": [
                "http:\/\/s3.mydomain.com:80\/"],
          "log_meta": "true",
          "log_data": "true"},
        { "name": "fallback",
          "endpoints": [
                "http:\/\/s3-fallback.mydomain.com:80\/"],
          "log_meta": "true",
          "log_data": "true"}],
  "placement_targets": [
        { "name": "default-placement",
          "tags": []},
        { "name": "cold-placement",
          "tags": []},
        { "name": "hot-placement",
          "tags": []}],
  "default_placement": "default-placement"}

a zone file “zone-main.conf.json” :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
{ "domain_root": ".main.domain.rgw",
  "control_pool": ".main.rgw.control",
  "gc_pool": ".main.rgw.gc",
  "log_pool": ".main.log",
  "intent_log_pool": ".main.intent-log",
  "usage_log_pool": ".main.usage",
  "user_keys_pool": ".main.users",
  "user_email_pool": ".main.users.email",
  "user_swift_pool": ".main.users.swift",
  "user_uid_pool": ".main.users.uid",
  "system_key": {
      "access_key": "{MAIN_USER_ACCESS}",
      "secret_key": "{MAIN_USER_SECRET}"},
  "placement_pools": [
        { "key": "default-placement",
          "val": { "index_pool": ".main.rgw.buckets.index",
              "data_pool": ".main.rgw.buckets",
              "data_extra_pool": ".main.rgw.buckets.extra"}},
        { "key": "cold-placement",
          "val": { "index_pool": ".main.rgw.buckets.index",
              "data_pool": ".main.rgw.cold.buckets",
              "data_extra_pool": ".main.rgw.buckets.extra"}},
        { "key": "hot-placement",
          "val": { "index_pool": ".main.rgw.buckets.index",
              "data_pool": ".main.rgw.hot.buckets",
              "data_extra_pool": ".main.rgw.buckets.extra"}}]}

And a zone file “zone-fallback.conf.json” :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
{ "domain_root": ".fallback.domain.rgw",
  "control_pool": ".fallback.rgw.control",
  "gc_pool": ".fallback.rgw.gc",
  "log_pool": ".fallback.log",
  "intent_log_pool": ".fallback.intent-log",
  "usage_log_pool": ".fallback.usage",
  "user_keys_pool": ".fallback.users",
  "user_email_pool": ".fallback.users.email",
  "user_swift_pool": ".fallback.users.swift",
  "user_uid_pool": ".fallback.users.uid",
  "system_key": {
    "access_key": "{FALLBACK_USER_ACESS}",
    "secret_key": "{FALLBACK_USER_SECRET}"
         },
  "placement_pools": [
        { "key": "default-placement",
          "val": { "index_pool": ".fallback.rgw.buckets.index",
              "data_pool": ".fallback.rgw.buckets",
              "data_extra_pool": ".fallback.rgw.buckets.extra"}},
        { "key": "cold-placement",
          "val": { "index_pool": ".fallback.rgw.buckets.index",
              "data_pool": ".fallback.rgw.cold.buckets",
              "data_extra_pool": ".fallback.rgw.buckets.extra"}},
        { "key": "hot-placement",
          "val": { "index_pool": ".fallback.rgw.buckets.index",
              "data_pool": ".fallback.rgw.hot.buckets",
              "data_extra_pool": ".fallback.rgw.buckets.extra"}}]}

On first cluster (MAIN)

I created the pools :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
ceph osd pool create .rgw.root 16 16
ceph osd pool create .main.rgw.root 16 16
ceph osd pool create .main.domain.rgw 16 16
ceph osd pool create .main.rgw.control 16 16
ceph osd pool create .main.rgw.gc 16 16
ceph osd pool create .main.rgw.buckets 512 512
ceph osd pool create .main.rgw.hot.buckets 512 512
ceph osd pool create .main.rgw.cold.buckets 512 512
ceph osd pool create .main.rgw.buckets.index 32 32
ceph osd pool create .main.rgw.buckets.extra 16 16
ceph osd pool create .main.log 16 16
ceph osd pool create .main.intent-log 16 16
ceph osd pool create .main.usage 16 16
ceph osd pool create .main.users 16 16
ceph osd pool create .main.users.email 16 16
ceph osd pool create .main.users.swift 16 16
ceph osd pool create .main.users.uid 16 16

I configured region, zone, and add system users :

1
2
3
4
5
6
7
  radosgw-admin region set --name client.radosgw.main < region.conf.json
  radosgw-admin zone set --rgw-zone=main --name client.radosgw.main < zone-main.conf.json
  radosgw-admin zone set --rgw-zone=fallback --name client.radosgw.main < zone-fallback.conf.json
  radosgw-admin regionmap update --name client.radosgw.main

  radosgw-admin user create --uid="main" --display-name="Zone main" --name client.radosgw.main --system --access-key={MAIN_USER_ACCESS} --secret={MAIN_USER_SECRET}
  radosgw-admin user create --uid="fallback" --display-name="Zone fallback" --name client.radosgw.main --system --access-key={FALLBACK_USER_ACESS} --secret={FALLBACK_USER_SECRET}

Setup RadosGW Config in ceph.conf on cluster MAIN :

1
2
3
4
5
6
7
8
9
10
  [client.radosgw.main]
  host = ceph-main-radosgw-01
  rgw region = default
  rgw region root pool = .rgw.root
  rgw zone = main
  rgw zone root pool = .main.rgw.root
  rgw frontends = "civetweb port=80"
  rgw dns name = s3.mydomain.com
  keyring = /etc/ceph/ceph.client.radosgw.keyring
  rgw_socket_path = /var/run/ceph/radosgw.sock

I needed to create keyring for [client.radosgw.main] in /etc/ceph/ceph.client.radosgw.keyring, see documentation.

Then, start/restart radosgw for cluster MAIN.

On the other Ceph cluster (FALLBACK)

I created the pools :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
ceph osd pool create .rgw.root 16 16
ceph osd pool create .fallback.rgw.root 16 16
ceph osd pool create .fallback.domain.rgw 16 16
ceph osd pool create .fallback.rgw.control 16 16
ceph osd pool create .fallback.rgw.gc 16 16
ceph osd pool create .fallback.rgw.buckets 512 512
ceph osd pool create .fallback.rgw.hot.buckets 512 512
ceph osd pool create .fallback.rgw.cold.buckets 512 512
ceph osd pool create .fallback.rgw.buckets.index 32 32
ceph osd pool create .fallback.rgw.buckets.extra 16 16
ceph osd pool create .fallback.log 16 16
ceph osd pool create .fallback.intent-log 16 16
ceph osd pool create .fallback.usage 16 16
ceph osd pool create .fallback.users 16 16
ceph osd pool create .fallback.users.email 16 16
ceph osd pool create .fallback.users.swift 16 16
ceph osd pool create .fallback.users.uid 16 16

I configured region, zone, and add system users :

radosgw-admin region set --name client.radosgw.fallback < region.conf.json
radosgw-admin zone set --rgw-zone=fallback --name client.radosgw.fallback < zone-fallback.conf.json
radosgw-admin zone set --rgw-zone=main --name client.radosgw.fallback < zone-main.conf.json
radosgw-admin regionmap update --name client.radosgw.fallback

radosgw-admin user create --uid="fallback" --display-name="Zone fallback" --name client.radosgw.fallback --system --access-key={FALLBACK_USER_ACESS} --secret={FALLBACK_USER_SECRET}
radosgw-admin user create --uid="main" --display-name="Zone main" --name client.radosgw.fallback --system --access-key={MAIN_USER_ACCESS} --secret={MAIN_USER_SECRET}

Setup RadosGW Config in ceph.conf on cluster FALLBACK :

1
2
3
4
5
6
7
8
9
10
[client.radosgw.fallback]
host = ceph-fallback-radosgw-01
rgw region = default
rgw region root pool = .rgw.root
rgw zone = fallback
rgw zone root pool = .fallback.rgw.root
rgw frontends = "civetweb port=80"
rgw dns name = s3-fallback.mydomain.com
keyring = /etc/ceph/ceph.client.radosgw.keyring
rgw_socket_path = /var/run/ceph/radosgw.sock

Also, I needed to create keyring for [client.radosgw.fallback] in /etc/ceph/ceph.client.radosgw.keyring and start radosgw for cluster FALLBACK.

Finally setup the RadosGW Agent

/etc/ceph/radosgw-agent/default.conf :

1
2
3
4
5
6
7
8
9
src_zone: main
source: http://s3.mydomain.com:80
src_access_key: {MAIN_USER_ACCESS}
src_secret_key: {MAIN_USER_SECRET}
dest_zone: fallback
destination: http://s3-fallback.mydomain.com:80
dest_access_key: {FALLBACK_USER_ACESS}
dest_secret_key: {FALLBACK_USER_SECRET}
log_file: /var/log/radosgw/radosgw-sync.log
1
/etc/init.d/radosgw-agent start

After that, he still has a little suspense …
Then I try to create a bucket with data on s3.mydomain.com and verify that, it’s well synchronized.

for debug, you can enable logs on the RadosGW on each side, and start radosgw-agent with radosgw-agent -v -c /etc/ceph/radosgw-agent/default.conf

These steps work for me. The establishment is sometimes not obvious. Whenever I setup a sync it rarely works the first time, but it always ends up running.

I’ve just drafted a new release of python-cephclient
on PyPi: v0.1.0.5.

After learning about the ceph-rest-api I just had
to do something fun with it.

In fact, it’s going to become very handy for me as I might start to develop
with it for things like nagios monitoring scripts.

The changelog:

dmsimard:

  • Add missing dependency on the requests library
  • Some PEP8 and code standardization cleanup
  • Add root “PUT” methods
  • Add mon “PUT” methods
  • Add mds “PUT” methods
  • Add auth “PUT” methods

Donald Talton:

  • Add osd “PUT” methods

Please try it out and let me know if you have any feedback !

Pull requests are welcome :)

v0.80.9 Firefly released

This is a bugfix release for firefly. It fixes a performance regression in librbd, an important CRUSH misbehavior (see below), and several RGW bugs. We have also backported support for flock/fcntl locks to ceph-fuse and libcephfs.

We recommend that all Firefly users upgrade.

For more detailed information, see the complete changelog.

ADJUSTING CRUSH MAPS

  • This point release fixes several issues with CRUSH that trigger excessive data migration when adjusting OSD weights. These are most obvious when a very small weight change (e.g., a change from 0 to .01) triggers a large amount of movement, but the same set of bugs can also lead to excessive (though less noticeable) movement in other cases.

    However, because the bug may already have affected your cluster, fixing it may trigger movement back to the more correct location. For this reason, you must manually opt-in to the fixed behavior.

    In order to set the new tunable to correct the behavior:

    ceph osd crush set-tunable straw_calc_version 1

    Note that this change will have no immediate effect. However, from this point forward, any ‘straw’ bucket in your CRUSH map that is adjusted will get non-buggy internal weights, and that transition may trigger some rebalancing.

    read more…

When a teuthology target (i.e. machine) is provisioned with teuthology-lock for the purpose of testing Ceph, there is no way to choose the kernel. But it can be installed afterwards using the following:

cat > kernel.yaml <<EOF
interactive-on-error: true
roles:
- - mon.a
  - client.0
kernel:
   branch: testing
tasks:
- interactive:
EOF

Assuming the target on which the new kernel is to be installed is vpm083, running

$ teuthology  --owner loic@dachary.org \
  kernel.yaml <(teuthology-lock --list-targets vpm083)
...
2015-03-09 17:47 INFO:teuthology.task.internal:Starting timer...
2015-03-09 17:47 INFO:teuthology.run_tasks:Running task interactive...
Ceph test interactive mode, use ctx to interact with the cluster
>>>

will install an alternate kernel and reboot the machine:

[ubuntu@vpm083 ~]$ uname -a
Linux vpm083 3.19.0-ceph-00029-gaf5b96e #1 SMP Thu Mar 5 01:04:25 GNU/Linux
[ubuntu@vpm083 ~]$ lsb_release -a
LSB Version:	:base-4.0-amd64:base-4.0-noarch:
Distributor ID:	RedHatEnterpriseServer
Description:  release 6.5 (Santiago)
Release:	6.5
Codename:	Santiago

Command line arguments to the kernel may be added to /boot/grub/grub.conf. For instance loop.max_part=16 to allow partition creation on /dev/loop devices:

default=0
timeout=5
splashimage=(hd0,0)/boot/grub/splash.xpm.gz
hiddenmenu
title rhel-6.5-cloudinit (3.19.0-ceph-00029-gaf5b96e)
        root (hd0,0)
        kernel /boot/vmlinuz-3.19.0 ro root=LABEL=79d3d2d4  loop.max_part=16
        initrd /boot/initramfs-3.19.0.img

When handling a Ceph OSD, it is convenient to assign it a symbolic name that can be chosen even before it is created. That’s what the uuid argument for ceph osd create is for. Without a uuid argument, a random uuid will be assigned to the OSD and can be used later. Since the ceph osd create uuid is idempotent, it can also be used to lookup the id of a given OSD.

$ osd_uuid=b2e780fc-ec82-4a91-a29d-20cd9159e5f6
# convert the OSD uuid into an OSD id
$ ceph osd create $osd_uuid
0
# convert the OSD id into an OSD uuid
$ ./ceph --format json osd dump | jq '.osds[] | select(.osd==0) | .uuid'
"b2e780fc-ec82-4a91-a29d-20cd9159e5f6"

I recently had the opportunity to work on a Firefly cluster (0.80.8) in which power outages caused a failure of two OSDs. As with lots of things in technology, that’s not the whole story. The manner in which the power outages and OSD failures occurred put the cluster into a state with 5 placement groups (PGs) into an incomplete state. Before I got involved, the failed OSDs had been ejected from the cluster and new OSDs re-deployed in their place.

The good news is that one of the ‘failed’ OSDs was still readable for the most part and this allowed us to use a new tool to recover the PG contents.

WARNING: THIS IS A RISKY PROCESS! Do not attempt this on a production cluster without engaging Red Hat Ceph support. You could cause irreversible data loss in your cluster.
read more…

© 2016, Red Hat, Inc. All rights reserved.