Archives:

v10.2.1 Jewel released

This is the first bugfix release for Jewel. It contains several annoying packaging and init system fixes and a range of important bugfixes across RBD, RGW, and CephFS.

We recommend that all v10.2.x users upgrade.

For more detailed information, see the complete changelog.

NOTABLE CHANGES

read more…

v0.94.7 Hammer released

This Hammer point release fixes several minor bugs. It also includes a backport of an improved ‘ceph osd reweight-by-utilization’ command for handling OSDs with higher-than-average utilizations.

We recommend that all hammer v0.94.x users upgrade.

For more detailed information, see the complete changelog.

NOTABLE CHANGES

read more…

Docker container in 10 commands

If you are on this page, then you definately know what is Docker , i will not take your time with the introduction part.

Lets do Docker !!!

  • Install docker packages on your Linux host , in my case its CentOS.
1
# yum install -y docker-io

  • Start Docker service and enable it as a startup process.
1
# service docker start ; chkconfig docker on
  • Docker pull CentOS image
1
# docker pull centos:latest
  • Check docker images
1
2
3
4
[root@karan-ws ~]# docker images
REPOSITORY          TAG                 IMAGE ID            CREATED             VIRTUAL SIZE
docker.io/centos    latest              fd44297e2ddb        5 weeks ago         215.7 MB
[root@karan-ws ~]#
  • Create Docker container
1
2
3
[root@karan-ws ~]# docker create -ti --name="mona" centos bash
c7f9eb6b32eba38242b9d9ced309314f8eee720dbf29c656885aa0cbfff15aa6
[root@karan-ws ~]#
  • Start your docker container
1
# docker start mona
  • Get IP address of your newly created docker container
1
2
3
[root@karan-ws ~]# docker inspect mona | grep -i ipaddress
        "IPAddress": "172.17.0.1",
[root@karan-ws ~]#
  • Attach (login) to your docker container
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
[root@karan-ws ~]# docker attach mona

[root@c7f9eb6b32eb /]#
[root@c7f9eb6b32eb /]# cat /etc/redhat-release
CentOS Linux release 7.1.1503 (Core)
[root@c7f9eb6b32eb /]# df -h
Filesystem                                                                                          Size  Used Avail Use% Mounted on
/dev/mapper/docker-253:1-16852579-c7f9eb6b32eba38242b9d9ced309314f8eee720dbf29c656885aa0cbfff15aa6  9.8G  268M  9.0G   3% /
tmpfs                                                                                               1.6G     0  1.6G   0% /dev
shm                                                                                                  64M     0   64M   0% /dev/shm
tmpfs                                                                                               1.6G     0  1.6G   0% /run
tmpfs                                                                                               1.6G     0  1.6G   0% /tmp
/dev/vda1                                                                                            10G  1.6G  8.5G  16% /etc/hosts
tmpfs                                                                                               1.6G     0  1.6G   0% /run/secrets
tmpfs                                                                                               1.6G     0  1.6G   0% /proc/kcore
[root@c7f9eb6b32eb /]#

To detach from docker container use ctrl+p+q , avoid using exit command as it will stop container and exit.

  • List container
1
2
3
4
[root@karan-ws ~]# docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
c7f9eb6b32eb        centos:latest       "bash"              9 minutes ago       Up 28 seconds                           mona
[root@karan-ws ~]#
  • Stop and destroy container
1
2
3
4
5
6
7
[root@karan-ws ~]# docker stop mona ; docker kill mona
mona
mona
[root@karan-ws ~]#
[root@karan-ws ~]# docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
[root@karan-ws ~]#

These are elementary basic docker operations that you can perform to take a feel of Docker Container technology. In future posts i will cover more advance docker topics. Stay Tuned !!!

Ceph: Monitor Troubleshooting

Ceph monitor ip change

While playing with your Ceph cluster , you might have seen HEALTH_WARN cluster status.

Cluster warning can occur due to several reasons of component malfunctioning such as MON,OSD,PG & MDS.

In my case i saw warning due to Ceph monitors, which was like :

1
health HEALTH_WARN 1 mons down, quorum 0,1 ceph-mon1,ceph-mon2

At first i tried restarting MON service , but no luck.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
[root@ceph-mon3 ~]# service ceph status mon
=== mon.ceph-mon3 ===
mon.ceph-mon3: not running.
[root@ceph-mon3 ~]# service ceph start mon
=== mon.ceph-mon3 ===
Starting Ceph mon.ceph-mon3 on ceph-mon3...
Invalid argument: /var/lib/ceph/mon/ceph-ceph-mon3/store.db: does not exist (create_if_missing is false)
IO error: /var/lib/ceph/mon/ceph-ceph-mon3/store.db/000001.dbtmp: Input/output error
2015-05-22 11:44:38.065906 7fad6c6967a0 -1 failed to create new leveldb store
failed: 'ulimit -n 131072;  /usr/bin/ceph-mon -i ceph-mon3 --pid-file /var/run/ceph/mon.ceph-mon3.pid -c /etc/ceph/ceph.conf --cluster ceph '
Starting ceph-create-keys on ceph-mon3...
[root@ceph-mon3 ~]#
[root@ceph-mon3 ~]# service ceph status mon
=== mon.ceph-mon3 ===
mon.ceph-mon3: not running.
[root@ceph-mon3 ~]#

The error message that i received was not something normal, so i started playing rough with my cluster , by moving monitor store.db files. !!! Be Cautious

1
mv /var/lib/ceph/mon/ceph-ceph-mon3/store.db /var/lib/ceph/mon/ceph-ceph-mon3/store.db.orig

And this broke MON really badly, so now i know another way that causes a new error YAY :-)

1
2
3
4
5
6
7
[root@ceph-mon3 ceph-ceph-mon3]# service ceph start mon
=== mon.ceph-mon3 ===
Starting Ceph mon.ceph-mon3 on ceph-mon3...
2015-05-22 11:59:45.385826 7faa43dfb7a0 -1 unable to read magic from mon data.. did you run mkcephfs?
failed: 'ulimit -n 131072;  /usr/bin/ceph-mon -i ceph-mon3 --pid-file /var/run/ceph/mon.ceph-mon3.pid -c /etc/ceph/ceph.conf --cluster ceph '
Starting ceph-create-keys on ceph-mon3...
[root@ceph-mon3 ceph-ceph-mon3]#

Show Time begins :-)

Then i started doing real work by reading monitor logs and what i found was monitor IP address were incorrect , they need to have a different address range.

To fix this first we need to change monitor IP address to the correct rage.

Changing Ceph Monitor IP Address

  • Get monitor maps , you could see the current IP range is 80.50.X.X , we need to change this to the correct range.
1
2
3
4
5
6
7
8
9
10
11
12
13
[root@ceph-mon1 ~]# ceph mon getmap -o /tmp/monmap
got monmap epoch 3
[root@ceph-mon1 ~]#
[root@ceph-mon1 ~]# monmaptool --print /tmp/monmap
monmaptool: monmap file /tmp/monmap
epoch 3
fsid 98d89661-f616-49eb-9ccf-84d720e179c0
last_changed 2015-05-18 14:42:01.287460
created 2015-05-18 14:41:00.514879
0: 80.50.50.35:6789/0 mon.ceph-mon1
1: 80.50.50.37:6789/0 mon.ceph-mon2
2: 80.50.50.39:6789/0 mon.ceph-mon3
[root@ceph-mon1 ~]#
  • Remove monitor nodes from monitor map
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
[root@ceph-mon1 ~]# monmaptool --rm ceph-mon1 /tmp/monmap
monmaptool: monmap file /tmp/monmap
monmaptool: removing ceph-mon1
monmaptool: writing epoch 3 to /tmp/monmap (2 monitors)
[root@ceph-mon1 ~]# monmaptool --rm ceph-mon2 /tmp/monmap
monmaptool: monmap file /tmp/monmap
monmaptool: removing ceph-mon2
monmaptool: writing epoch 3 to /tmp/monmap (1 monitors)
[root@ceph-mon1 ~]# monmaptool --rm ceph-mon3 /tmp/monmap
monmaptool: monmap file /tmp/monmap
monmaptool: removing ceph-mon3
monmaptool: writing epoch 3 to /tmp/monmap (0 monitors)
[root@ceph-mon1 ~]#
[root@ceph-mon1 ~]#
[root@ceph-mon1 ~]# monmaptool --print /tmp/monmap
monmaptool: monmap file /tmp/monmap
epoch 3
fsid 98d89661-f616-49eb-9ccf-84d720e179c0
last_changed 2015-05-18 14:42:01.287460
created 2015-05-18 14:41:00.514879
[root@ceph-mon1 ~]#
  • Add the correct hostname and IP address for monitor nodes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
[root@ceph-mon1 ~]# monmaptool --add ceph-mon1-ib 10.100.1.101:6789 /tmp/monmap
monmaptool: monmap file /tmp/monmap
monmaptool: writing epoch 3 to /tmp/monmap (1 monitors)
[root@ceph-mon1 ~]#
[root@ceph-mon1 ~]#
[root@ceph-mon1 ~]# monmaptool --add ceph-mon2-ib 10.100.1.102:6789 /tmp/monmap
monmaptool: monmap file /tmp/monmap
monmaptool: writing epoch 3 to /tmp/monmap (2 monitors)
[root@ceph-mon1 ~]#
[root@ceph-mon1 ~]# monmaptool --add ceph-mon3-ib 10.100.1.103:6789 /tmp/monmap
monmaptool: monmap file /tmp/monmap
monmaptool: writing epoch 3 to /tmp/monmap (3 monitors)
[root@ceph-mon1 ~]#
[root@ceph-mon1 ~]#
[root@ceph-mon1 ~]# monmaptool --print /tmp/monmap monmaptool: monmap file /tmp/monmap
epoch 3
fsid 98d89661-f616-49eb-9ccf-84d720e179c0
last_changed 2015-05-18 14:42:01.287460
created 2015-05-18 14:41:00.514879
0: 10.100.1.101:6789/0 mon.ceph-mon1-ib
1: 10.100.1.102:6789/0 mon.ceph-mon2-ib
2: 10.100.1.103:6789/0 mon.ceph-mon3-ib
[root@ceph-mon1 ~]#
  • Before injecting the new monmap , bring down monitor services and then inject the monmap.
1
2
3
4
5
6
[root@ceph-mon1 ~]# service ceph stop mon
=== mon.ceph-mon1 ===
Stopping Ceph mon.ceph-mon1 on ceph-mon1...kill 441540...done
[root@ceph-mon1 ~]#
[root@ceph-mon1 ceph]# ceph-mon -i ceph-mon1 --inject-monmap /tmp/monmap
[root@ceph-mon1 ceph]#
  • Repeat these steps for the other monitors in your cluster , to save some time you can copy the new monmap file from first monitor node (ceph-mon1) to other monitor nodes and simply inject this new monmap into their ceph monitor instance.
  • Finally bring up the monitor services on all the monitor nodes.
1
2
3
4
5
6
7
8
9
[root@ceph-mon1 ceph]# service ceph start mon
=== mon.ceph-mon1 ===
Starting Ceph mon.ceph-mon1 on ceph-mon1...
Starting ceph-create-keys on ceph-mon1...
[root@ceph-mon1 ceph]#
[root@ceph-mon1 ceph]# service ceph status mon
=== mon.ceph-mon1 ===
mon.ceph-mon1: running {"version":"0.80.9"}
[root@ceph-mon1 ceph]#

If you still see monitor problems, you can redeploy the monitor node

1
[root@ceph-mon1 ceph]# ceph-deploy --overwrite-conf  mon create ceph-mon3
  • Finally your cluster should attain Health_OK status
1
2
3
4
5
6
7
8
9
[root@ceph-mon3 ceph]# ceph -s
    cluster 98d89661-f616-49eb-9ccf-84d720e179c0
     health HEALTH_OK
     monmap e4: 3 mons at {ceph-mon1=10.100.1.101:6789/0,ceph-mon2=10.100.1.102:6789/0,ceph-mon3=10.100.1.103:6789/0}, election epoch 18, quorum 0,1,2 ceph-mon1,ceph-mon2,ceph-mon3
     osdmap e244: 55 osds: 54 up, 54 in
      pgmap v693: 192 pgs, 3 pools, 0 bytes data, 0 objects
            5327 MB used, 146 TB / 146 TB avail
                 192 active+clean
[root@ceph-mon3 ceph]#

This might give you some idea of Ceph monitor troubleshooting. You could also follow more detailed steps mentioned in Ceph documentation.

OpenStack Summit Vancouver

Finally back from Vancouver, back from an interesting week at the OpenStack Summit including a quite packed schedule with many interesting presentations, work sessions and meetings. 
I presented together with Sage Weil about “Storage security in a critical enterprise OpenStack environment” with focus on Ceph. You can find the slides at slideshare. There is also a video available at the OpenStack YouTube channel.

Ceph Developer Summit: Jewel

Hey Cephers, welcome to another Ceph Developer Summit cycle! As Infernalis filters down through the fancy new testing hardware and QA processes it’s time to start thinking about what ‘Jewel’ will hold in store for us (beyond Sage’s hope for a robust and ready CephFS!!!).

Blueprint submissions are now open for any and all work that that you would like to contribute or request of community developers. Please submit as soon as possible to ensure that it gets a CDS slot. We know this is still a little early, but the community has asked for a bit more lead time from finished schedule to actual event, so we’re trying to push the submissions cycle forward a bit.

This cycle we are in the middle of our wiki transition, so we will have a bit of a different process which I ask you to be patient with us on. This cycle will be the first to utilize the Redmine wiki (on tracker.ceph.com), but migration is ongoing so it will be a little rough.

The link below will take you to the edit page for the Jewel blueprints. From that page you just need to add in your title in the format of [[My Awesome Blueprint]] and save the page. You can then just click that link and enter your information. There is a sample blueprint page there to get you started, but please don’t hesitate to ask ‘scuttlemonkey’ on IRC or ‘pmcgarry at redhat dot com’ via email if you have any issues. We really appreciate your patience on this.

The rough schedule (updated) of CDS and Jewel in general should look something like this:

Date Milestone
26 MAY Blueprint submissions begin
12 JUN Blueprint submissions end
17 JUN Summit agenda announced
01 JUL Ceph Developer Summit: Day 1
02 JUL Ceph Developer Summit: Day 2 (if needed)
NOV 2015 Jewel Released

As always, this event will be an online event (utilizing the BlueJeans system) so that everyone can attend from their own timezone. If you are interested in submitting a blueprint or collaborating on an existing blueprint, please click the big red button below!

 

Submit Blueprint

scuttlemonkey out

Intel 520 SSD Journal

A quick check of my Intel 520 SSD that running since 2 years on a small cluster.

smartctl -a /dev/sda
=== START OF INFORMATION SECTION ===
Model Family:     Intel 520 Series SSDs
Device Model:     INTEL SSDSC2CW060A3
Serial Number:    CVCV305200NB060AGN
LU WWN Device Id: 5 001517 8f36af9db
Firmware Version: 400i
User Capacity:    60 022 480 896 bytes [60,0 GB]
Sector Size:      512 bytes logical/physical

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0
  9 Power_On_Hours_and_Msec 0x0032   000   000   000    Old_age   Always       -       910315h+05m+29.420s
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       13
170 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       13
184 End-to-End_Error        0x0033   100   100   090    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x000f   117   117   050    Pre-fail  Always       -       153797776
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       13
225 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       1367528
226 Workld_Media_Wear_Indic 0x0032   100   100   000    Old_age   Always       -       65535
227 Workld_Host_Reads_Perc  0x0032   100   100   000    Old_age   Always       -       3
228 Workload_Minutes        0x0032   100   100   000    Old_age   Always       -       65535
232 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
233 Media_Wearout_Indicator 0x0032   093   093   000    Old_age   Always       -       0
241 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       1367528
242 Host_Reads_32MiB        0x0032   100   100   000    Old_age   Always       -       56808
249 NAND_Writes_1GiB        0x0013   100   100   000    Pre-fail  Always       -       33624

9 – Power on hours count

Cluster started since 2 years.

170 Available_Reservd_Space

100%

174 – Unexpected power loss

13 => Due to power loss on cluster. Everything has always well restarted. :)

187 – Uncorrectable error count

? Limit Ok

233 Media Wearout Indicator

093 => progressively decrease, I do not know if it’s completely reliable, but it is usually a good indicator.

241 – Host Writes 32MiB

1367528 => 42 Tb written by host
This correspond to 60 GB per days for 3 osd. This seems normal.

249 – NAND Writes 1GiB

33624 => 33 Tb written on Nand
write amplification = 0.79 That is pretty good.

The drive is a 60.0 GB. This make each LBA written about 560 times.

For clusters with a little more load, Intel DC S3700 models remains my favorite, but in my case the Intel 520 do very well their job.

Ceph Jerasure and ISA plugins benchmarks

In Ceph, a pool can be configured to use erasure coding instead of replication to save space. When used with Intel processors, the default Jerasure plugin that computes erasure code can be replaced by the ISA plugin for better write performances. Here is how they compare on a Intel(R) Xeon(R) CPU E3-1245 V2 @ 3.40GHz.

Encoding and decoding all used 4KB objects which is the default stripe width. Two variants of the jerasure plugins were used: Generic (jerasure_generic) and SIMD (erasure_sse4) which is used when running on an Intel processor with SIMD instructions.
This benchmark was run after compiling from sources using

$ ( cd src ; make ceph_erasure_code_benchmark )
$ TOTAL_SIZE=$((4 * 1024 * 1024 * 1024)) \
CEPH_ERASURE_CODE_BENCHMARK=src/ceph_erasure_code_benchmark \
PLUGIN_DIRECTORY=src/.libs \
  qa/workunits/erasure-code/bench.sh fplot | \
  tee qa/workunits/erasure-code/bench.js

and displayed with

firefox qa/workunits/erasure-code/bench.html

RadosGW Big Index

$ rados -p .default.rgw.buckets.index listomapkeys .dir.default.1970130.1 | wc -l
166768275

With each key containing between 100 and 250 bytes, this make a very big object for rados (several GB)… Especially when migrating it from an OSD to another (this will lock all writes), moreover, the OSD containing this object will use a lot of memory …

Since the hammer release it is possible to shard the bucket index. However, you can not shard an existing one but you can setup it for new buckets.
This is a very good thing for the scalability.

Setting up index max shards

You can specify the default number of shards for new buckets :

  • Per zone, in regionmap :
1
2
3
4
5
6
7
8
9
10
11
12
13
$ radosgw-admin region get
...
"zones": [
    {
        "name": "default",
        "endpoints": [
            "http:\/\/storage.example.com:80\/"
        ],
        "log_meta": "true",
        "log_data": "true",
        "bucket_index_max_shards": 8             <===
    },
...
  • In in radosgw section in ceph.conf (this override the per zone value)
1
2
3
4
...
[client.radosgw.gateway]
rgw bucket index max shards = 8
....

Verification :

$ radosgw-admin metadata get bucket:mybucket | grep bucket_id
            "bucket_id": "default.1970130.1"

$ radosgw-admin metadata get bucket.instance:mybucket:default.1970130.1 | grep num_shards
            "num_shards": 8,

$ rados -p .rgw.buckets.index ls | grep default.1970130.1
.dir.default.1970130.1.0
.dir.default.1970130.1.1
.dir.default.1970130.1.2
.dir.default.1970130.1.3
.dir.default.1970130.1.4
.dir.default.1970130.1.5
.dir.default.1970130.1.6
.dir.default.1970130.1.7

Bucket listing impact :

A simple test with ~200k objects in a bucket :

num_shard time (s)
0 25
8 36
128 109

So, do not use buckets with thousands of shards if you do not need it, because the bucket listing will become very slow…

Link to the blueprint :

https://wiki.ceph.com/Planning/Blueprints/Hammer/rgw%3A_bucket_index_scalability

© 2016, Red Hat, Inc. All rights reserved.