The Ceph Blog

Ceph blog stories provide high-level spotlights on our customers all over the world

  • May 26, 2015
    Ceph Developer Summit: Jewel

    Hey Cephers, welcome to another Ceph Developer Summit cycle! As Infernalis filters down through the fancy new testing hardware and QA processes it’s time to start thinking about what ‘Jewel’ will hold in store for us (beyond Sage’s hope for a robust and ready CephFS!!!). Blueprint submissions are now open for any and all work …Read more

  • May 19, 2015
    OpenStack Summit Vancouver: Ceph and OpenStack current integration and roadmap

    Date: 19/05/2015

    Video:

    {% youtube PhxVPEZeHp4 %}

    Slides:

    Download the slides here.

  • May 19, 2015
    Intel 520 SSD Journal

    A quick check of my Intel 520 SSD that running since 2 years on a small cluster.

    smartctl -a /dev/sda
    === START OF INFORMATION SECTION ===
    Model Family:     Intel 520 Series SSDs
    Device Model:     INTEL SSDSC2CW060A3
    Serial Number:    CVCV305200NB060AGN
    LU WWN Device Id: 5 001517 8f36af9db
    Firmware Version: 400i
    User Capacity:    60 022 480 896 bytes [60,0 GB]
    Sector Size:      512 bytes logical/physical
    
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
      5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       0
      9 Power_On_Hours_and_Msec 0x0032   000   000   000    Old_age   Always       -       910315h+05m+29.420s
     12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       13
    170 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
    171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
    172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
    174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       13
    184 End-to-End_Error        0x0033   100   100   090    Pre-fail  Always       -       0
    187 Uncorrectable_Error_Cnt 0x000f   117   117   050    Pre-fail  Always       -       153797776
    192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       13
    225 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       1367528
    226 Workld_Media_Wear_Indic 0x0032   100   100   000    Old_age   Always       -       65535
    227 Workld_Host_Reads_Perc  0x0032   100   100   000    Old_age   Always       -       3
    228 Workload_Minutes        0x0032   100   100   000    Old_age   Always       -       65535
    232 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
    233 Media_Wearout_Indicator 0x0032   093   093   000    Old_age   Always       -       0
    241 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       1367528
    242 Host_Reads_32MiB        0x0032   100   100   000    Old_age   Always       -       56808
    249 NAND_Writes_1GiB        0x0013   100   100   000    Pre-fail  Always       -       33624
    

    9 – Power on hours count

    Cluster started since 2 years.

    170 Available_Reservd_Space

    100%

    174 – Unexpected power loss

    13 => Due to power loss on cluster. Everything has always well restarted. πŸ™‚

    187 – Uncorrectable error count

    ? Limit Ok

    233 Media Wearout Indicator

    093 => progressively decrease, I do not know if it’s completely reliable, but it is usually a good indicator.

    241 – Host Writes 32MiB

    1367528 => 42 Tb written by host
    This correspond to 60 GB per days for 3 osd. This seems normal.

    249 – NAND Writes 1GiB

    33624 => 33 Tb written on Nand
    write amplification = 0.79 That is pretty good.

    The drive is a 60.0 GB. This make each LBA written about 560 times.

    For clusters with a little more load, Intel DC S3700 models remains my favorite, but in my case the Intel 520 do very well their job.

  • May 12, 2015
    Ceph Jerasure and ISA plugins benchmarks

    In Ceph, a pool can be configured to use erasure coding instead of replication to save space. When used with Intel processors, the default Jerasure plugin that computes erasure code can be replaced by the ISA plugin for better write … Continue reading

  • May 12, 2015
    RadosGW Big Index
    $ rados -p .default.rgw.buckets.index listomapkeys .dir.default.1970130.1 | wc -l
    166768275
    

    With each key containing between 100 and 250 bytes, this make a very big object for rados (several GB)… Especially when migrating it from an OSD to another (this will lock all writes), moreover, the OSD containing this object will use a lot of memory …

    Since the hammer release it is possible to shard the bucket index. However, you can not shard an existing one but you can setup it for new buckets.
    This is a very good thing for the scalability.

    Setting up index max shards

    You can specify the default number of shards for new buckets :

    • Per zone, in regionmap :
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    
    $ radosgw-admin region get
    ...
    "zones": [
        {
            "name": "default",
            "endpoints": [
                "http:\/\/storage.example.com:80\/"
            ],
            "log_meta": "true",
            "log_data": "true",
            "bucket_index_max_shards": 8             <===
        },
    ...
    
    • In in radosgw section in ceph.conf (this override the per zone value)
    1
    2
    3
    4
    
    ...
    [client.radosgw.gateway]
    rgw bucket index max shards = 8
    ....
    

    Verification :

    $ radosgw-admin metadata get bucket:mybucket | grep bucket_id
                "bucket_id": "default.1970130.1"
    
    $ radosgw-admin metadata get bucket.instance:mybucket:default.1970130.1 | grep num_shards
                "num_shards": 8,
    
    $ rados -p .rgw.buckets.index ls | grep default.1970130.1
    .dir.default.1970130.1.0
    .dir.default.1970130.1.1
    .dir.default.1970130.1.2
    .dir.default.1970130.1.3
    .dir.default.1970130.1.4
    .dir.default.1970130.1.5
    .dir.default.1970130.1.6
    .dir.default.1970130.1.7
    

    Bucket listing impact :

    A simple test with ~200k objects in a bucket :

    num_shard time (s)
    0 25
    8 36
    128 109

    So, do not use buckets with thousands of shards if you do not need it, because the bucket listing will become very slow…

    Link to the blueprint :

    https://wiki.ceph.com/Planning/Blueprints/Hammer/rgw%3A_bucket_index_scalability

  • May 8, 2015
    Ceph : Reduce OSD Scrub Priority

    Ceph : Reduce OSD Scrub Priority
    Let’s assume ! on a nice sunny day , you receives complaints that your Ceph storage cluster is not performing as it was performing yesterday. After checking cluster status you found that placement groups scrubbing is going on and depending on your scenario , you would like to decrease its priority. Here is how you can do it.

    Note : OSD disk thread I/O priority can only be changed if the disk scheduler is cfq.

    • Check disk scheduler, if its not cfq you can change it to cfq dynamically.
    1
    2
    3
    
    $ sudo cat /sys/block/sda/queue/scheduler
    noop [deadline] cfq
    $ sudo echo cfq > /sys/block/sda/queue/scheduler
    
    • Next check for the current values of OSD disk thread io priority , the default values should be as shown below.
    1
    2
    
    $ sudo ceph daemon osd.0 config get osd_disk_thread_ioprio_class
    { "osd_disk_thread_ioprio_class": ""}
    
    1
    2
    
    $ sudo ceph daemon osd.0 config get osd_disk_thread_ioprio_priority
    { "osd_disk_thread_ioprio_priority": "-1"}
    
    • Reduce the osd_disk_thread_ioprio by executing
    1
    2
    
    $ sudo ceph tell osd.* injectargs '--osd_disk_thread_ioprio_priority 7'
    $ sudo ceph tell osd.* injectargs '--osd_disk_thread_ioprio_class idle'
    
    • Finally recheck osd_disk_thread_ioprio
    1
    2
    
    $ sudo ceph daemon osd.0 config get osd_disk_thread_ioprio_class
    { "osd_disk_thread_ioprio_class": "idle"}
    
    1
    2
    
    $ sudo ceph daemon osd.0 config get osd_disk_thread_ioprio_priority
    { "osd_disk_thread_ioprio_priority": "7"}
    

    This should reduce OSD scrubbing priority and is useful to slow down scrubbing on an OSD that is busy handling client operations. Once the coast is clear , its a good idea to revert back the changes.

  • May 7, 2015
    Improving Ceph python scripts tests

    The Ceph command line and ceph-disk helper are python scripts for which there are integration tests (ceph-disk.sh and test.sh). It would be useful to add unit tests and pep8 checks. It can be done by creating a python module instead … Continue reading

  • May 5, 2015
    v9.0.0 released

    This is the first development release for the Infernalis cycle, and the first Ceph release to sport a version number from the new numbering scheme. The β€œ9” indicates this is the 9th release cycle–I (for Infernalis) is the 9th letter. The first β€œ0” indicates this is a development release (β€œ1” will mean release candidate and …Read more

  • May 4, 2015
    OpenVZ: Kernel 3.10 With Rbd Module

    3.X Kernel for OpenVZ is out and it is compiled with rbd module:

    root@debian:~# uname -a
    Linux debian 3.10.0-3-pve #1 SMP Thu Jun 12 13:50:49 CEST 2014 x86_64 GNU/Linux

    root@debian:~# modinfo rbd
    filename: /lib/modules/3.10.0-3-pve/kernel/drive…

  • May 4, 2015
    Ceph using Monitor key/value store

    Ceph monitors make use of leveldb to store cluster maps, users and keys.
    Since the store is present, Ceph developers thought about exposing this through the monitors interface.
    So monitors have a built-in capability that allows you to store blobs of …

  • April 27, 2015
    v0.87.2 Giant released

    This is the second (and possibly final) point release for Giant. We recommend all v0.87.x Giant users upgrade to this release. NOTABLE CHANGES ceph-objectstore-tool: only output unsupported features when incompatible (#11176 David Zafman) common: do not implicitly unlock rwlock on destruction (Federico Simoncelli) common: make wait timeout on empty queue configurable (#10818 Samuel Just) crush: …Read more

  • April 27, 2015
    Ceph: manually repair object

    {% img center http://sebastien-han.fr/images/ceph-manually-repair-objects.jpg Ceph: manually repair object %}

    Debugging scrubbing errors can be tricky and you don’t necessary know how to proceed.

    Assuming you have a cluster state similar to this o…

Careers