The Ceph Blog

Ceph blog stories provide high-level spotlights on our customers all over the world

  • June 11, 2015
    v9.0.1 released

    This development release is delayed a bit due to tooling changes in the build environment. As a result the next one (v9.0.2) will have a bit more work than is usual. Highlights here include lots of RGW Swift fixes, RBD feature work surrounding the new object map feature, more CephFS snapshot fixes, and a few …Read more

  • June 6, 2015
    Teuthology docker targets hack (5/5)

    The teuthology container hack is improved to run teuthology-suite. For instance: ./virtualenv/bin/teuthology-suite \ –distro ubuntu \ –suite-dir $HOME/software/ceph/ceph-qa-suite \ –config-file docker-integration/teuthology.yaml \ –machine-type container \ –owner loic@dachary.org \ –filter ‘rados:basic/{clusters/fixed-2.yaml fs/btrfs.yaml \ msgr-failures/few.yaml tasks/rados_cls_all.yaml}’ \ –suite rados/basic –ceph ANY \ … Continue reading

  • June 3, 2015
    Ceph: activate RBD readahead

    {% img center http://sebastien-han.fr/images/ceph-rbd-readhead.jpg Ceph: activate RBD readhead %}

    RBD readahead was introduced with Giant.

    During the boot sequence of a virtual machine if the librbd detects contiguous reads, it will attempt to readahead on the OSDs and fill up the RBD cache with the content.
    When the OS reads it will get those reads from the librbd cache.
    Parameters that control the readahead:

    rbd readahead trigger requests = 10 # number of sequential requests necessary to trigger readahead.
    rbd readahead max bytes = 524288 # maximum size of a readahead request, in bytes.
    rbd readahead disable after bytes = 52428800
    

    Testing procedure

    The way I tested this is rather simple, I simply calculated the time it took to SSH into the virtual machine.
    I ran this test 10 times with and without the readahead in order to get an average value

    Execution script:

    bash
    for i in $(seq 1 10)
    do
    nova delete leseb > /dev/null 2>&1
    sleep 5
    nova boot --flavor m1.small --image 19dab28e-5d13-4d13-9fd4-dbc597fdccb7 leseb > /dev/null 2>&1
    time ./checkup.sh 10.0.0.2 22
    done

    Checkup script:

    “`bash

    !/bin/bash

    host=$1
    port=$2
    max=1000000
    counter=1

    while true
    do
    python -c “import socket;s = socket.socket(socket.AF_INET, socket.SOCK_STREAM);s.connect((‘$host’, $port))” > /dev/null 2>&1 && break || \
    echo -n “.”

    if [[ ${counter} == ${max} ]];then

      echo "Could not connect"
      exit 1
    

    fi
    (( counter++ ))
    done
    “`

    Boot time comparison

    At some point, I tried to look at the virtual machine logs and analysed the block size.
    I was hoping that using a more accurate value for rbd_readahead_max_bytes would bring me some befenit.
    So I queried the admin socket to hopefully get something useful about the reads that happen during the boot sequence:

    “`bash
    $ sudo ceph –admin-daemon /var/run/ceph/guests/ceph-client.cinder.463407.139639582721120.asok perf dump

        "flush": 0,
        "aio_rd": 5477,
        "aio_rd_bytes": 117972992,
        "aio_rd_latency": {
            "avgcount": 5477,
            "sum": 16.090880101
    


    “`

    Unfortunately I don’t see to get anything interesting, ideally I’d have gotten average reads.
    My last resort is to log every single read entries of the librbd.
    I used one of my previous article as a reference.
    Over 9903 reads during the boot sequence, it resulted the average read block size was 98304.
    I eventually decided to give it a try.

    Here are the results:

    {% img center http://sebastien-han.fr/images/ceph-rbd-readahead-boot-time-comp.jpg Ceph RBD readahead boot time comparison %}

    My second optimisation attempt was clearly the most successful since we are almost below 23 seconds to boot a virtual machine.
    In the meantime the default values are not that bad and sound pretty reasonnable.
    Thus sticking with the default should not be an issue.

  • June 1, 2015
    Ceph OSD daemon config diff

    {% img center http://sebastien-han.fr/images/ceph-osd-config-diff.jpg Ceph OSD daemon config diff %}

    Quick tip.
    Simply check the diff between the applied configuration in your ceph.conf and the default values on an OSD.

    “`bash
    $ sudo ceph daemon …

  • May 31, 2015
    Docker Containers in Just 10 Commands

    Docker container in 10 commands

    If you are on this page, then you definately know what is Docker , i will not take your time with the introduction part.

    Lets do Docker !!!

    • Install docker packages on your Linux host , in my case its CentOS.
    1
    
    # yum install -y docker-io
    

    • Start Docker service and enable it as a startup process.
    1
    
    # service docker start ; chkconfig docker on
    
    • Docker pull CentOS image
    1
    
    # docker pull centos:latest
    
    • Check docker images
    1
    2
    3
    4
    
    [root@karan-ws ~]# docker images
    REPOSITORY          TAG                 IMAGE ID            CREATED             VIRTUAL SIZE
    docker.io/centos    latest              fd44297e2ddb        5 weeks ago         215.7 MB
    [root@karan-ws ~]#
    
    • Create Docker container
    1
    2
    3
    
    [root@karan-ws ~]# docker create -ti --name="mona" centos bash
    c7f9eb6b32eba38242b9d9ced309314f8eee720dbf29c656885aa0cbfff15aa6
    [root@karan-ws ~]#
    
    • Start your docker container
    1
    
    # docker start mona
    
    • Get IP address of your newly created docker container
    1
    2
    3
    
    [root@karan-ws ~]# docker inspect mona | grep -i ipaddress
            "IPAddress": "172.17.0.1",
    [root@karan-ws ~]#
    
    • Attach (login) to your docker container
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    
    [root@karan-ws ~]# docker attach mona
    
    [root@c7f9eb6b32eb /]#
    [root@c7f9eb6b32eb /]# cat /etc/redhat-release
    CentOS Linux release 7.1.1503 (Core)
    [root@c7f9eb6b32eb /]# df -h
    Filesystem                                                                                          Size  Used Avail Use% Mounted on
    /dev/mapper/docker-253:1-16852579-c7f9eb6b32eba38242b9d9ced309314f8eee720dbf29c656885aa0cbfff15aa6  9.8G  268M  9.0G   3% /
    tmpfs                                                                                               1.6G     0  1.6G   0% /dev
    shm                                                                                                  64M     0   64M   0% /dev/shm
    tmpfs                                                                                               1.6G     0  1.6G   0% /run
    tmpfs                                                                                               1.6G     0  1.6G   0% /tmp
    /dev/vda1                                                                                            10G  1.6G  8.5G  16% /etc/hosts
    tmpfs                                                                                               1.6G     0  1.6G   0% /run/secrets
    tmpfs                                                                                               1.6G     0  1.6G   0% /proc/kcore
    [root@c7f9eb6b32eb /]#
    

    To detach from docker container use ctrl+p+q , avoid using exit command as it will stop container and exit.

    • List container
    1
    2
    3
    4
    
    [root@karan-ws ~]# docker ps
    CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
    c7f9eb6b32eb        centos:latest       "bash"              9 minutes ago       Up 28 seconds                           mona
    [root@karan-ws ~]#
    
    • Stop and destroy container
    1
    2
    3
    4
    5
    6
    7
    
    [root@karan-ws ~]# docker stop mona ; docker kill mona
    mona
    mona
    [root@karan-ws ~]#
    [root@karan-ws ~]# docker ps
    CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
    [root@karan-ws ~]#
    

    These are elementary basic docker operations that you can perform to take a feel of Docker Container technology. In future posts i will cover more advance docker topics. Stay Tuned !!!

  • May 30, 2015
    Ceph: Monitor Troubleshooting

    Ceph monitor ip change

    While playing with your Ceph cluster , you might have seen HEALTH_WARN cluster status.

    Cluster warning can occur due to several reasons of component malfunctioning such as MON,OSD,PG & MDS.

    In my case i saw warning due to Ceph monitors, which was like :

    1
    
    health HEALTH_WARN 1 mons down, quorum 0,1 ceph-mon1,ceph-mon2
    

    At first i tried restarting MON service , but no luck.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    
    [root@ceph-mon3 ~]# service ceph status mon
    === mon.ceph-mon3 ===
    mon.ceph-mon3: not running.
    [root@ceph-mon3 ~]# service ceph start mon
    === mon.ceph-mon3 ===
    Starting Ceph mon.ceph-mon3 on ceph-mon3...
    Invalid argument: /var/lib/ceph/mon/ceph-ceph-mon3/store.db: does not exist (create_if_missing is false)
    IO error: /var/lib/ceph/mon/ceph-ceph-mon3/store.db/000001.dbtmp: Input/output error
    2015-05-22 11:44:38.065906 7fad6c6967a0 -1 failed to create new leveldb store
    failed: 'ulimit -n 131072;  /usr/bin/ceph-mon -i ceph-mon3 --pid-file /var/run/ceph/mon.ceph-mon3.pid -c /etc/ceph/ceph.conf --cluster ceph '
    Starting ceph-create-keys on ceph-mon3...
    [root@ceph-mon3 ~]#
    [root@ceph-mon3 ~]# service ceph status mon
    === mon.ceph-mon3 ===
    mon.ceph-mon3: not running.
    [root@ceph-mon3 ~]#
    

    The error message that i received was not something normal, so i started playing rough with my cluster , by moving monitor store.db files. !!! Be Cautious

    1
    
    mv /var/lib/ceph/mon/ceph-ceph-mon3/store.db /var/lib/ceph/mon/ceph-ceph-mon3/store.db.orig
    

    And this broke MON really badly, so now i know another way that causes a new error YAY 🙂

    1
    2
    3
    4
    5
    6
    7
    
    [root@ceph-mon3 ceph-ceph-mon3]# service ceph start mon
    === mon.ceph-mon3 ===
    Starting Ceph mon.ceph-mon3 on ceph-mon3...
    2015-05-22 11:59:45.385826 7faa43dfb7a0 -1 unable to read magic from mon data.. did you run mkcephfs?
    failed: 'ulimit -n 131072;  /usr/bin/ceph-mon -i ceph-mon3 --pid-file /var/run/ceph/mon.ceph-mon3.pid -c /etc/ceph/ceph.conf --cluster ceph '
    Starting ceph-create-keys on ceph-mon3...
    [root@ceph-mon3 ceph-ceph-mon3]#
    

    Show Time begins 🙂

    Then i started doing real work by reading monitor logs and what i found was monitor IP address were incorrect , they need to have a different address range.

    To fix this first we need to change monitor IP address to the correct rage.

    Changing Ceph Monitor IP Address

    • Get monitor maps , you could see the current IP range is 80.50.X.X , we need to change this to the correct range.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    
    [root@ceph-mon1 ~]# ceph mon getmap -o /tmp/monmap
    got monmap epoch 3
    [root@ceph-mon1 ~]#
    [root@ceph-mon1 ~]# monmaptool --print /tmp/monmap
    monmaptool: monmap file /tmp/monmap
    epoch 3
    fsid 98d89661-f616-49eb-9ccf-84d720e179c0
    last_changed 2015-05-18 14:42:01.287460
    created 2015-05-18 14:41:00.514879
    0: 80.50.50.35:6789/0 mon.ceph-mon1
    1: 80.50.50.37:6789/0 mon.ceph-mon2
    2: 80.50.50.39:6789/0 mon.ceph-mon3
    [root@ceph-mon1 ~]#
    
    • Remove monitor nodes from monitor map
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    
    [root@ceph-mon1 ~]# monmaptool --rm ceph-mon1 /tmp/monmap
    monmaptool: monmap file /tmp/monmap
    monmaptool: removing ceph-mon1
    monmaptool: writing epoch 3 to /tmp/monmap (2 monitors)
    [root@ceph-mon1 ~]# monmaptool --rm ceph-mon2 /tmp/monmap
    monmaptool: monmap file /tmp/monmap
    monmaptool: removing ceph-mon2
    monmaptool: writing epoch 3 to /tmp/monmap (1 monitors)
    [root@ceph-mon1 ~]# monmaptool --rm ceph-mon3 /tmp/monmap
    monmaptool: monmap file /tmp/monmap
    monmaptool: removing ceph-mon3
    monmaptool: writing epoch 3 to /tmp/monmap (0 monitors)
    [root@ceph-mon1 ~]#
    [root@ceph-mon1 ~]#
    [root@ceph-mon1 ~]# monmaptool --print /tmp/monmap
    monmaptool: monmap file /tmp/monmap
    epoch 3
    fsid 98d89661-f616-49eb-9ccf-84d720e179c0
    last_changed 2015-05-18 14:42:01.287460
    created 2015-05-18 14:41:00.514879
    [root@ceph-mon1 ~]#
    
    • Add the correct hostname and IP address for monitor nodes
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    
    [root@ceph-mon1 ~]# monmaptool --add ceph-mon1-ib 10.100.1.101:6789 /tmp/monmap
    monmaptool: monmap file /tmp/monmap
    monmaptool: writing epoch 3 to /tmp/monmap (1 monitors)
    [root@ceph-mon1 ~]#
    [root@ceph-mon1 ~]#
    [root@ceph-mon1 ~]# monmaptool --add ceph-mon2-ib 10.100.1.102:6789 /tmp/monmap
    monmaptool: monmap file /tmp/monmap
    monmaptool: writing epoch 3 to /tmp/monmap (2 monitors)
    [root@ceph-mon1 ~]#
    [root@ceph-mon1 ~]# monmaptool --add ceph-mon3-ib 10.100.1.103:6789 /tmp/monmap
    monmaptool: monmap file /tmp/monmap
    monmaptool: writing epoch 3 to /tmp/monmap (3 monitors)
    [root@ceph-mon1 ~]#
    [root@ceph-mon1 ~]#
    [root@ceph-mon1 ~]# monmaptool --print /tmp/monmap monmaptool: monmap file /tmp/monmap
    epoch 3
    fsid 98d89661-f616-49eb-9ccf-84d720e179c0
    last_changed 2015-05-18 14:42:01.287460
    created 2015-05-18 14:41:00.514879
    0: 10.100.1.101:6789/0 mon.ceph-mon1-ib
    1: 10.100.1.102:6789/0 mon.ceph-mon2-ib
    2: 10.100.1.103:6789/0 mon.ceph-mon3-ib
    [root@ceph-mon1 ~]#
    
    • Before injecting the new monmap , bring down monitor services and then inject the monmap.
    1
    2
    3
    4
    5
    6
    
    [root@ceph-mon1 ~]# service ceph stop mon
    === mon.ceph-mon1 ===
    Stopping Ceph mon.ceph-mon1 on ceph-mon1...kill 441540...done
    [root@ceph-mon1 ~]#
    [root@ceph-mon1 ceph]# ceph-mon -i ceph-mon1 --inject-monmap /tmp/monmap
    [root@ceph-mon1 ceph]#
    
    • Repeat these steps for the other monitors in your cluster , to save some time you can copy the new monmap file from first monitor node (ceph-mon1) to other monitor nodes and simply inject this new monmap into their ceph monitor instance.
    • Finally bring up the monitor services on all the monitor nodes.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    
    [root@ceph-mon1 ceph]# service ceph start mon
    === mon.ceph-mon1 ===
    Starting Ceph mon.ceph-mon1 on ceph-mon1...
    Starting ceph-create-keys on ceph-mon1...
    [root@ceph-mon1 ceph]#
    [root@ceph-mon1 ceph]# service ceph status mon
    === mon.ceph-mon1 ===
    mon.ceph-mon1: running {"version":"0.80.9"}
    [root@ceph-mon1 ceph]#
    

    If you still see monitor problems, you can redeploy the monitor node

    1
    
    [root@ceph-mon1 ceph]# ceph-deploy --overwrite-conf  mon create ceph-mon3
    
    • Finally your cluster should attain Health_OK status
    1
    2
    3
    4
    5
    6
    7
    8
    9
    
    [root@ceph-mon3 ceph]# ceph -s
        cluster 98d89661-f616-49eb-9ccf-84d720e179c0
         health HEALTH_OK
         monmap e4: 3 mons at {ceph-mon1=10.100.1.101:6789/0,ceph-mon2=10.100.1.102:6789/0,ceph-mon3=10.100.1.103:6789/0}, election epoch 18, quorum 0,1,2 ceph-mon1,ceph-mon2,ceph-mon3
         osdmap e244: 55 osds: 54 up, 54 in
          pgmap v693: 192 pgs, 3 pools, 0 bytes data, 0 objects
                5327 MB used, 146 TB / 146 TB avail
                     192 active+clean
    [root@ceph-mon3 ceph]#
    

    This might give you some idea of Ceph monitor troubleshooting. You could also follow more detailed steps mentioned in Ceph documentation.

  • May 29, 2015
    OpenStack Summit Vancouver

    Finally back from Vancouver, back from an interesting week at the OpenStack Summit including a quite packed schedule with many interesting presentations, work sessions and meetings. I presented together with Sage Weil about “Storage security in a …

  • May 26, 2015
    Ceph Developer Summit: Jewel

    Hey Cephers, welcome to another Ceph Developer Summit cycle! As Infernalis filters down through the fancy new testing hardware and QA processes it’s time to start thinking about what ‘Jewel’ will hold in store for us (beyond Sage’s hope for a robust and ready CephFS!!!). Blueprint submissions are now open for any and all work …Read more

  • May 19, 2015
    OpenStack Summit Vancouver: Ceph and OpenStack current integration and roadmap

    Date: 19/05/2015

    Video:

    {% youtube PhxVPEZeHp4 %}

    Slides:

    Download the slides here.

  • May 19, 2015
    Intel 520 SSD Journal

    A quick check of my Intel 520 SSD that running since 2 years on a small cluster.

    smartctl -a /dev/sda
    === START OF INFORMATION SECTION ===
    Model Family:     Intel 520 Series SSDs
    Device Model:     INTEL SSDSC2CW060A3
    Serial Number:    CVCV305200NB060AGN
    LU WWN Device Id: 5 001517 8f36af9db
    Firmware Version: 400i
    User Capacity:    60 022 480 896 bytes [60,0 GB]
    Sector Size:      512 bytes logical/physical
    
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
      5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0
      9 Power_On_Hours_and_Msec 0x0032   000   000   000    Old_age   Always       -       910315h+05m+29.420s
     12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       13
    170 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
    171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
    172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
    174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       13
    184 End-to-End_Error        0x0033   100   100   090    Pre-fail  Always       -       0
    187 Uncorrectable_Error_Cnt 0x000f   117   117   050    Pre-fail  Always       -       153797776
    192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       13
    225 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       1367528
    226 Workld_Media_Wear_Indic 0x0032   100   100   000    Old_age   Always       -       65535
    227 Workld_Host_Reads_Perc  0x0032   100   100   000    Old_age   Always       -       3
    228 Workload_Minutes        0x0032   100   100   000    Old_age   Always       -       65535
    232 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
    233 Media_Wearout_Indicator 0x0032   093   093   000    Old_age   Always       -       0
    241 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       1367528
    242 Host_Reads_32MiB        0x0032   100   100   000    Old_age   Always       -       56808
    249 NAND_Writes_1GiB        0x0013   100   100   000    Pre-fail  Always       -       33624
    

    9 – Power on hours count

    Cluster started since 2 years.

    170 Available_Reservd_Space

    100%

    174 – Unexpected power loss

    13 => Due to power loss on cluster. Everything has always well restarted. 🙂

    187 – Uncorrectable error count

    ? Limit Ok

    233 Media Wearout Indicator

    093 => progressively decrease, I do not know if it’s completely reliable, but it is usually a good indicator.

    241 – Host Writes 32MiB

    1367528 => 42 Tb written by host
    This correspond to 60 GB per days for 3 osd. This seems normal.

    249 – NAND Writes 1GiB

    33624 => 33 Tb written on Nand
    write amplification = 0.79 That is pretty good.

    The drive is a 60.0 GB. This make each LBA written about 560 times.

    For clusters with a little more load, Intel DC S3700 models remains my favorite, but in my case the Intel 520 do very well their job.

  • May 12, 2015
    Ceph Jerasure and ISA plugins benchmarks

    In Ceph, a pool can be configured to use erasure coding instead of replication to save space. When used with Intel processors, the default Jerasure plugin that computes erasure code can be replaced by the ISA plugin for better write … Continue reading

  • May 12, 2015
    RadosGW Big Index
    $ rados -p .default.rgw.buckets.index listomapkeys .dir.default.1970130.1 | wc -l
    166768275
    

    With each key containing between 100 and 250 bytes, this make a very big object for rados (several GB)… Especially when migrating it from an OSD to another (this will lock all writes), moreover, the OSD containing this object will use a lot of memory …

    Since the hammer release it is possible to shard the bucket index. However, you can not shard an existing one but you can setup it for new buckets.
    This is a very good thing for the scalability.

    Setting up index max shards

    You can specify the default number of shards for new buckets :

    • Per zone, in regionmap :
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    
    $ radosgw-admin region get
    ...
    "zones": [
        {
            "name": "default",
            "endpoints": [
                "http:\/\/storage.example.com:80\/"
            ],
            "log_meta": "true",
            "log_data": "true",
            "bucket_index_max_shards": 8             <===
        },
    ...
    
    • In in radosgw section in ceph.conf (this override the per zone value)
    1
    2
    3
    4
    
    ...
    [client.radosgw.gateway]
    rgw bucket index max shards = 8
    ....
    

    Verification :

    $ radosgw-admin metadata get bucket:mybucket | grep bucket_id
                "bucket_id": "default.1970130.1"
    
    $ radosgw-admin metadata get bucket.instance:mybucket:default.1970130.1 | grep num_shards
                "num_shards": 8,
    
    $ rados -p .rgw.buckets.index ls | grep default.1970130.1
    .dir.default.1970130.1.0
    .dir.default.1970130.1.1
    .dir.default.1970130.1.2
    .dir.default.1970130.1.3
    .dir.default.1970130.1.4
    .dir.default.1970130.1.5
    .dir.default.1970130.1.6
    .dir.default.1970130.1.7
    

    Bucket listing impact :

    A simple test with ~200k objects in a bucket :

    num_shard time (s)
    0 25
    8 36
    128 109

    So, do not use buckets with thousands of shards if you do not need it, because the bucket listing will become very slow…

    Link to the blueprint :

    https://wiki.ceph.com/Planning/Blueprints/Hammer/rgw%3A_bucket_index_scalability

Careers