Planet Ceph

Aggregated news from external sources

  • January 1, 2015
    Building Ceph Debian GNU/Linux packages

    The following script explains how to create Debian GNU/Linux packages for Ceph from a clone of the sources. releasedir=/tmp/release rm -fr releasedir mkdir -p $releasedir # # remove all files not under git so they are not # included in … Continue reading

  • December 23, 2014
    Difference Between ‘Ceph Osd Reweight’ and ‘Ceph Osd Crush Reweight’

    From Gregory and Craig in mailing list…

    “ceph osd crush reweight” sets the CRUSH weight of the OSD. This
    weight is an arbitrary value (generally the size of the disk in TB or
    something) and controls how much data the system tries to allocate to
    the OSD.

    “ceph osd reweight” sets an override weight on the OSD. This value is
    in the range 0 to 1, and forces CRUSH to re-place (1-weight) of the
    data that would otherwise live on this drive. It does *not* change the
    weights assigned to the buckets above the OSD, and is a corrective
    measure in case the normal CRUSH distribution isn’t working out quite
    right. (For instance, if one of your OSDs is at 90% and the others are
    at 50%, you could reduce this weight to try and compensate for it.)

    Note that ‘ceph osd reweight’ is not a persistent setting. When an OSD
    gets marked out, the osd weight will be set to 0. When it gets marked in
    again, the weight will be changed to 1.

    Because of this ‘ceph osd reweight’ is a temporary solution. You should
    only use it to keep your cluster running while you’re ordering more
    hardware.

    http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-June/040961.html

    I asked myself when one of my osd was marked down (on my old cluster in Cuttlefish…) and I noticed that only the drive of the local machine seemed to fill. Something that seems normal since the weight of the host had not changed in crushmap.

    Testing

    Testing on simple cluster (Giant), with this crushmap :

    1
    2
    3
    4
    5
    6
    7
    
    ruleset 0
    type replicated
    min_size 1
    max_size 10
    step take default
    step chooseleaf firstn 0 type host
      step emit

    Take the example of the 8 pgs on pool 3 :

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    
    $ ceph pg dump | grep '^3.' | awk '{print $1,$15;}'
    dumped all in format plain
    3.4 [0,2]
    3.5 [4,1]
    3.6 [2,0]
    3.7 [2,1]
    3.0 [2,1]
    3.1 [0,2]
    3.2 [2,1]
    3.3 [2,4]
    

    Now I try ceph osd out :

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    
    $ ceph osd out 0    # This is equivalent to "ceph osd reweight 0 0"
    marked out osd.0.
    
    $ ceph osd tree
    # id  weight  type name   up/down reweight
    -1    0.2 root default
    -2    0.09998     host ceph-01
    0 0.04999         osd.0   up  0       # <-- reweight has set to "0"
    4 0.04999         osd.4   up  1   
    -3    0.04999     host ceph-02
    1 0.04999         osd.1   up  1   
    -4    0.04999     host ceph-03
    2 0.04999         osd.2   up  1   
    
    $ ceph pg dump | grep '^3.' | awk '{print $1,$15;}'
    dumped all in format plain
    3.4 [2,4]  # <--  [0,2] (move pg on osd.4)
    3.5 [4,1]
    3.6 [2,1]  # <--  [2,0] (move pg on osd.1)
    3.7 [2,1]
    3.0 [2,1]
    3.1 [2,1]  # <--  [0,2] (move pg on osd.1)
    3.2 [2,1]
    3.3 [2,4]
    

    Now I try ceph osd CRUSH out :

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    
    $ ceph osd crush reweight osd.0 0
    reweighted item id 0 name 'osd.0' to 0 in crush map
    
    $ ceph osd tree
    # id  weight  type name   up/down reweight
    -1    0.15    root default
    -2    0.04999     host ceph-01            # <-- the weight of the host changed
    0 0               osd.0   up  1       # <-- crush weight is set to "0"
    4 0.04999         osd.4   up  1   
    -3    0.04999     host ceph-02
    1 0.04999         osd.1   up  1   
    -4    0.04999     host ceph-03
    2 0.04999         osd.2   up  1   
    
    $ ceph pg dump | grep '^3.' | awk '{print $1,$15;}'
    dumped all in format plain
    3.4 [4,2]  # <--  [0,2] (move pg on osd.4)
    3.5 [4,1]
    3.6 [2,4]  # <--  [2,0] (move pg on osd.4)
    3.7 [2,1]
    3.0 [2,1]
    3.1 [4,2]  # <--  [0,2] (move pg on osd.4)
    3.2 [2,1]
    3.3 [2,1]
    

    This does not seem very logical because the weight assigned to the bucket “host ceph-01” is still higher than the others. This would probably be different with more PG…

    Trying with more pgs

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    
    # Add more pg on my testpool
    $ ceph osd pool set testpool pg_num 128
    set pool 3 pg_num to 128
    
    # Check repartition
    $ for i in 0 1 2 4; do echo  "osd.$i=$(ceph pg dump 2>/dev/null | grep '^3.' | awk '{print $15;}' | grep $i | wc -l) pgs"; done
    osd.0=48 pgs
    osd.1=78 pgs
    osd.2=77 pgs
    osd.4=53 pgs
    
    $ ceph osd reweight 0 0
    reweighted osd.0 to 0 (802)
    $ for i in 0 1 2 4; do echo  "osd.$i=$(ceph pg dump 2>/dev/null | grep '^3.' | awk '{print $15;}' | grep $i | wc -l) pgs"; done
    osd.0=0 pgs
    osd.1=96 pgs
    osd.2=97 pgs
    osd.4=63 pgs
    

    The distribution seems fair. Why in the same case, with Cuttlefish, distribution is not the same ?

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    
    $ ceph osd reweight 0 1
    reweighted osd.0 to 0 (802)
    $ for i in 0 1 2 4; do echo  "osd.$i=$(ceph pg dump 2>/dev/null | grep '^3.' | awk '{print $15;}' | grep $i | wc -l) pgs"; done
    osd.0=0 pgs
    osd.1=96 pgs
    osd.2=97 pgs
    osd.4=63 pgs
    
    $ ceph osd crush reweight osd.0 0
    reweighted osd.0 to 0 (802)
    
    $ for i in 0 1 2 4; do echo  "osd.$i=$(ceph pg dump 2>/dev/null | grep '^3.' | awk '{print $15;}' | grep $i | wc -l) pgs"; done
    osd.0=0 pgs
    osd.1=87 pgs
    osd.2=88 pgs
    osd.4=81 pgs
    

    With crush reweight, everything is normal.

    Trying with crush legacy

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    
    $ ceph osd crush tunables legacy
    adjusted tunables profile to legacy
    root@ceph-01:~/ceph-deploy# for i in 0 1 2 4; do echo  "osd.$i=$(ceph pg dump 2>/dev/null | grep '^3.' | awk '{print $15;}' | grep $i | wc -l) pgs"; done
    osd.0=0 pgs
    osd.1=87 pgs
    osd.2=88 pgs
    osd.4=81 pgs
    
    $ ceph osd crush reweight osd.0 0.04999
    reweighted item id 0 name 'osd.0' to 0.04999 in crush map
    
    $ ceph osd tree
    # id  weight  type name   up/down reweight
    -1    0.2 root default
    -2    0.09998     host ceph-01
    0 0.04999         osd.0   up  0   
    4 0.04999         osd.4   up  1   
    -3    0.04999     host ceph-02
    1 0.04999         osd.1   up  1   
    -4    0.04999     host ceph-03
    2 0.04999         osd.2   up  1   
    
    $ for i in 0 1 2 4; do echo  "osd.$i=$(ceph pg dump 2>/dev/null | grep '^3.' | awk '{print $15;}' | grep $i | wc -l) pgs"; done
    osd.0=0 pgs
    osd.1=78 pgs
    osd.2=77 pgs
    osd.4=101 pgs   # <--- All pg from osd.0 and osd.4 is here when using legacy value (on host ceph-01)
    

    So, it is an evolution of the distribution algorithm to prefer a more global distribution when OSD is marked down (instead of distributing preferably by proximity). Indeed the old distribution can cause problems when there is not a lot of OSD by host, and that they are nearly full.

    When some OSDs are marked out, the data tends to get redistributed to nearby OSDs instead of across the entire hierarchy.

    To view the number of pg per osd :

    http://cephnotes.ksperis.com/blog/2015/02/23/get-the-number-of-placement-groups-per-osd/

  • December 21, 2014
    A make check bot for Ceph contributors

    The automated make check for Ceph bot runs on Ceph pull requests. It is still experimental and will not be triggered by all pull requests yet. It does the following: Create a docker container (using ceph-test-helper.sh) Checkout the merge of … Continue reading

  • December 19, 2014
    v0.90 released

    This is the last development release before Christmas. There are some API cleanups for librados and librbd, and lots of bug fixes across the board for the OSD, MDS, RGW, and CRUSH. The OSD also gets support for discard (potentially helpful on SSDs, although it is off by default), and there are several improvements to …Read more

  • December 18, 2014
    Use Trim/discard With Rbd Kernel Client (Since Kernel 3.18)

    Realtime :

    1
    mount -o discard /dev/rbd0 /mnt/myrbd

    Using batch :

    1
    fstrim /mnt/myrbd

    Test

    The empty FS :

    $ rbd create rbd/myrbd –size=20480
    $ mkfs.xfs /dev/rbd0
    $ rbd diff rbd/myrbd | awk ‘{ SUM += $2 } END { print SUM/1024/1024 ” MB” }…

  • December 17, 2014
    Ceph: collect Kernel RBD logs

    {% img center http://sebastien-han.fr/images/ceph-collect-krbd-logs.jpg Ceph: collect Kernel RBD logs %}

    Quick tip to collect Kernel RBD logs.

    Make sure your kernel is compiled with CONFIG_DYNAMIC_DEBUG (and CONFIG_DEBUG_FS) enabled:

    bash
    $ sudo cat /boot/config-`uname -r` | grep DYNAMIC_DEBUG
    CONFIG_DYNAMIC_DEBUG=y

    Then mount debugfs:

    bash
    $ sudo mount -t debugfs none /sys/kernel/debug

    Set the console log level to 9:

    bash
    $ sudo echo 9 > /proc/sysrq-trigger

    Then chose the module that you want to log:

    bash
    $ sudo echo 'module rbd +p' | sudo tee -a /sys/kernel/debug/dynamic_debug/control

    Looking at dmesg will show the corresponding logs.
    You can use this script from the Ceph repo as well to enable all of them:

    “`bash

    !/bin/sh -x

    p() {
    echo “$*” > /sys/kernel/debug/dynamic_debug/control
    }

    echo 9 > /proc/sysrq-trigger
    p ‘module ceph +p’
    p ‘module libceph +p’
    p ‘module rbd +p’
    “`

  • December 15, 2014
    Ceph: Cache Tier Statistics

    {% img center http://sebastien-han.fr/images/ceph-cache-statistics.jpg Ceph: Cache Tier Statistics %}

    Quick tip on how to retrieve cache statistics from the a cache pool.

    Simply use the admin socket:

    “`bash
    $ sudo ceph daemon osd.{id} perf …

  • December 15, 2014
    Teuthology docker targets hack (3/4)

    The teuthology container hack is improved so each Ceph command is run via docker exec -i which can read from stdin as of docker 1.4 released in December 2014. It can run the following job machine_type: container os_type: ubuntu os_version: … Continue reading

  • December 13, 2014
    Why are by-partuuid symlinks missing or outdated ?

    The ceph-disk script manages Ceph devices and rely on the content of the /dev/disk/by-partuuid directory which is updated by udev rules. For instance: a new partition is created with /sbin/sgdisk –largest-new=1 –change-name=1:ceph data –partition-guid=1:83c14a9b-0493-4ccf-83ff-e3e07adae202 –typecode=1:89c57f98-2fe5-4dc0-89c1-f3ad0ceff2be — /dev/loop4 the kernel is … Continue reading

  • December 11, 2014
    DevStack and remote Ceph cluster

    {% img center http://sebastien-han.fr/images/devstack-ceph-remote-cluster.jpg DevStack and remote Ceph cluster %}

    Introducing the ability to connect DevStack to a remote Ceph cluster.
    So DevStack won’t bootstrap any Ceph cluster, it will simply connec…

  • December 9, 2014
    How many PGs in each OSD of a Ceph cluster ?

    To display how many PGs in each OSD of a Ceph cluster: $ ceph –format xml pg dump | \ xmlstarlet sel -t -m “//pg_stats/pg_stat/acting” -v osd -n | \ sort -n | uniq -c 332 0 312 1 299 … Continue reading

  • December 9, 2014
    OpenStack: import existing Ceph volumes in Cinder

    {% img center http://sebastien-han.fr/images/openstack-import-existing-vol-ceph-cinder.jpg OpenStack: import existing Ceph volumes in Cinder %}

    This method can be useful while migrating from one OpenStack to another.

    Imagine you have operatin…

Careers