Archives: December 2010

From Gregory and Craig in mailing list…

“ceph osd crush reweight” sets the CRUSH weight of the OSD. This
weight is an arbitrary value (generally the size of the disk in TB or
something) and controls how much data the system tries to allocate to
the OSD.

“ceph osd reweight” sets an override weight on the OSD. This value is
in the range 0 to 1, and forces CRUSH to re-place (1-weight) of the
data that would otherwise live on this drive. It does *not* change the
weights assigned to the buckets above the OSD, and is a corrective
measure in case the normal CRUSH distribution isn’t working out quite
right. (For instance, if one of your OSDs is at 90% and the others are
at 50%, you could reduce this weight to try and compensate for it.)

Note that ‘ceph osd reweight’ is not a persistent setting. When an OSD
gets marked out, the osd weight will be set to 0. When it gets marked in
again, the weight will be changed to 1.

Because of this ‘ceph osd reweight’ is a temporary solution. You should
only use it to keep your cluster running while you’re ordering more
hardware.

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-June/040961.html

I asked myself when one of my osd was marked down (on my old cluster in Cuttlefish…) and I noticed that only the drive of the local machine seemed to fill. Something that seems normal since the weight of the host had not changed in crushmap.

Testing

Testing on simple cluster (Giant), with this crushmap :

1
2
3
4
5
6
7
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
  step emit

Take the example of the 8 pgs on pool 3 :

1
2
3
4
5
6
7
8
9
10
$ ceph pg dump | grep '^3.' | awk '{print $1,$15;}'
dumped all in format plain
3.4 [0,2]
3.5 [4,1]
3.6 [2,0]
3.7 [2,1]
3.0 [2,1]
3.1 [0,2]
3.2 [2,1]
3.3 [2,4]

Now I try ceph osd out :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
$ ceph osd out 0    # This is equivalent to "ceph osd reweight 0 0"
marked out osd.0.

$ ceph osd tree
# id  weight  type name   up/down reweight
-1    0.2 root default
-2    0.09998     host ceph-01
0 0.04999         osd.0   up  0       # <-- reweight has set to "0"
4 0.04999         osd.4   up  1   
-3    0.04999     host ceph-02
1 0.04999         osd.1   up  1   
-4    0.04999     host ceph-03
2 0.04999         osd.2   up  1   

$ ceph pg dump | grep '^3.' | awk '{print $1,$15;}'
dumped all in format plain
3.4 [2,4]  # <--  [0,2] (move pg on osd.4)
3.5 [4,1]
3.6 [2,1]  # <--  [2,0] (move pg on osd.1)
3.7 [2,1]
3.0 [2,1]
3.1 [2,1]  # <--  [0,2] (move pg on osd.1)
3.2 [2,1]
3.3 [2,4]

Now I try ceph osd CRUSH out :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
$ ceph osd crush reweight osd.0 0
reweighted item id 0 name 'osd.0' to 0 in crush map

$ ceph osd tree
# id  weight  type name   up/down reweight
-1    0.15    root default
-2    0.04999     host ceph-01            # <-- the weight of the host changed
0 0               osd.0   up  1       # <-- crush weight is set to "0"
4 0.04999         osd.4   up  1   
-3    0.04999     host ceph-02
1 0.04999         osd.1   up  1   
-4    0.04999     host ceph-03
2 0.04999         osd.2   up  1   

$ ceph pg dump | grep '^3.' | awk '{print $1,$15;}'
dumped all in format plain
3.4 [4,2]  # <--  [0,2] (move pg on osd.4)
3.5 [4,1]
3.6 [2,4]  # <--  [2,0] (move pg on osd.4)
3.7 [2,1]
3.0 [2,1]
3.1 [4,2]  # <--  [0,2] (move pg on osd.4)
3.2 [2,1]
3.3 [2,1]

This does not seem very logical because the weight assigned to the bucket “host ceph-01” is still higher than the others. This would probably be different with more PG…

Trying with more pgs

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Add more pg on my testpool
$ ceph osd pool set testpool pg_num 128
set pool 3 pg_num to 128

# Check repartition
$ for i in 0 1 2 4; do echo  "osd.$i=$(ceph pg dump 2>/dev/null | grep '^3.' | awk '{print $15;}' | grep $i | wc -l) pgs"; done
osd.0=48 pgs
osd.1=78 pgs
osd.2=77 pgs
osd.4=53 pgs

$ ceph osd reweight 0 0
reweighted osd.0 to 0 (802)
$ for i in 0 1 2 4; do echo  "osd.$i=$(ceph pg dump 2>/dev/null | grep '^3.' | awk '{print $15;}' | grep $i | wc -l) pgs"; done
osd.0=0 pgs
osd.1=96 pgs
osd.2=97 pgs
osd.4=63 pgs

The distribution seems fair. Why in the same case, with Cuttlefish, distribution is not the same ?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
$ ceph osd reweight 0 1
reweighted osd.0 to 0 (802)
$ for i in 0 1 2 4; do echo  "osd.$i=$(ceph pg dump 2>/dev/null | grep '^3.' | awk '{print $15;}' | grep $i | wc -l) pgs"; done
osd.0=0 pgs
osd.1=96 pgs
osd.2=97 pgs
osd.4=63 pgs

$ ceph osd crush reweight osd.0 0
reweighted osd.0 to 0 (802)

$ for i in 0 1 2 4; do echo  "osd.$i=$(ceph pg dump 2>/dev/null | grep '^3.' | awk '{print $15;}' | grep $i | wc -l) pgs"; done
osd.0=0 pgs
osd.1=87 pgs
osd.2=88 pgs
osd.4=81 pgs

With crush reweight, everything is normal.

Trying with crush legacy

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
$ ceph osd crush tunables legacy
adjusted tunables profile to legacy
root@ceph-01:~/ceph-deploy# for i in 0 1 2 4; do echo  "osd.$i=$(ceph pg dump 2>/dev/null | grep '^3.' | awk '{print $15;}' | grep $i | wc -l) pgs"; done
osd.0=0 pgs
osd.1=87 pgs
osd.2=88 pgs
osd.4=81 pgs

$ ceph osd crush reweight osd.0 0.04999
reweighted item id 0 name 'osd.0' to 0.04999 in crush map

$ ceph osd tree
# id  weight  type name   up/down reweight
-1    0.2 root default
-2    0.09998     host ceph-01
0 0.04999         osd.0   up  0   
4 0.04999         osd.4   up  1   
-3    0.04999     host ceph-02
1 0.04999         osd.1   up  1   
-4    0.04999     host ceph-03
2 0.04999         osd.2   up  1   

$ for i in 0 1 2 4; do echo  "osd.$i=$(ceph pg dump 2>/dev/null | grep '^3.' | awk '{print $15;}' | grep $i | wc -l) pgs"; done
osd.0=0 pgs
osd.1=78 pgs
osd.2=77 pgs
osd.4=101 pgs   # <--- All pg from osd.0 and osd.4 is here when using legacy value (on host ceph-01)

So, it is an evolution of the distribution algorithm to prefer a more global distribution when OSD is marked down (instead of distributing preferably by proximity). Indeed the old distribution can cause problems when there is not a lot of OSD by host, and that they are nearly full.

When some OSDs are marked out, the data tends to get redistributed to nearby OSDs instead of across the entire hierarchy.

To view the number of pg per osd :

http://cephnotes.ksperis.com/blog/2015/02/23/get-the-number-of-placement-groups-per-osd/

A make check bot for Ceph contributors

The automated make check for Ceph bot runs on Ceph pull requests. It is still experimental and will not be triggered by all pull requests yet.

It does the following:

A use case for developers is:

  • write a patch and send a pull request
  • switch to another branch and work on another patch while the bot is running
  • if the bot reports failure, switch back to the original branch and repush a fix: the bot will notice the repush and run again

It also helps reviewers who can wait until the bot succeeds before looking at the patch closely.

To debug an error, run-make-check.sh can be executed locally on the branch of the pull request, after merging or rebasing against the destination branch.

It can also be run in a container for CentOS 7 or Ubuntu 14.04. Each container needs about 10GB of disk. They are run using a dedicated Ceph clone to not be disturbed while development continues.

The preparation of the container uses install-deps.sh and takes a long time (from five to thirty minutes or more depending on the bandwidth). It is however done once and reused as long as its depedencies (ceph.spec.in, debian/control, etc.) are not modified. The second step, including make -jX check, takes six minutes on a 64GB RAM, 250GB SSD, 24 core server and fifteen minutes on a 16GB RAM, 250GB spinner, 4 core laptop. The -jX is set to half of the number of processors reported by /proc/cpuinfo (i.e. make -j4 if there are 8 processors and make -j12 if there are 24 processors).

The bot runs in a container so that cleaning up a failed test or aborting if it takes too long (30 minutes) can be done by removing the container (for instance with docker stop ceph-ubuntu-14.04).

The bot is triggered from a Gitlab CI based on a mirror of the git repository. They both need to be polished.

v0.90 released

This is the last development release before Christmas. There are some API cleanups for librados and librbd, and lots of bug fixes across the board for the OSD, MDS, RGW, and CRUSH. The OSD also gets support for discard (potentially helpful on SSDs, although it is off by default), and there are several improvements to ceph-disk.

The next two development releases will be getting a slew of new functionality for hammer. Stay tuned!

UPGRADING

  • Previously, the formatted output of ‘ceph pg stat -f …’ was a full pg dump that included all metadata about all PGs in the system. It is now a concise summary of high-level PG stats, just like the unformatted ‘ceph pg stat’ command.
  • All JSON dumps of floating point values were incorrecting surrounding the value with quotes. These quotes have been removed. Any consumer of structured JSON output that was consuming the floating point values was previously having to interpret the quoted string and will most likely need to be fixed to take the unquoted number.

NOTABLE CHANGES

read more…

Realtime :

1
mount -o discard /dev/rbd0 /mnt/myrbd

Using batch :

1
fstrim /mnt/myrbd

Test

The empty FS :

$ rbd create rbd/myrbd --size=20480
$ mkfs.xfs /dev/rbd0
$ rbd diff rbd/myrbd | awk '{ SUM += $2 } END { print SUM/1024/1024 " MB" }'
14.4062 MB

With a big file… :

$ mount /dev/rbd0 /mnt/myrbd
$ dd if=/dev/zero of=/mnt/myrbd/testfile bs=1M count=1024
$ rbd diff rbd/myrbd | awk '{ SUM += $2 } END { print SUM/1024/1024 " MB" }'
1038.41 MB

When the big file has been removed (with file system not mount with “discard”) :

$ rm /mnt/myrbd/testfile
$ rbd diff rbd/myrbd | awk '{ SUM += $2 } END { print SUM/1024/1024 " MB" }'
1038.41 MB

Launch FS trim :

$ fstrim /mnt/myrbd
$ rbd diff rbd/myrbd | awk '{ SUM += $2 } END { print SUM/1024/1024 " MB" }'
10.6406 MB

Benchmark discard option

Without “discard” :

$ mount /dev/rbd0 /mnt/rbd0

$ mkdir testdir; cd testdir
$ dd if=/dev/zero of=mainfile bs=1M count=200
$ split -b 4048 -a 7 mainfile; sync               # 4k file / ~51k files
$ cd ..
$ time rm -rf testdir; time sync
real    0m2.780s
user    0m0.096s
sys     0m2.632s

real    0m0.130s
user    0m0.004s
sys     0m0.016s

# total: < 3s

With “discard” :

$ mount -o discard /dev/rbd1 /mnt/rbd1

$ mkdir testdir; cd testdir
$ dd if=/dev/zero of=mainfile bs=1M count=200
$ split -b 4048 -a 7 mainfile; sync               # 4k file / ~51k files
$ cd ..
$ time rm -rf testdir; time sync
real    1m51.471s
user    0m0.104s
sys     0m2.084s

real    0m47.262s
user    0m0.000s
sys     0m0.008s

# total: ~1m56

This is on 2 osd without ssd journal…

In the case of intensive use of the file system, with many small file, it may be more advantageous to use fstrim, for example once a day.

(Ceph osd support trim for block device since 0.46)

Teuthology docker targets hack (3/4)

The teuthology container hack is improved so each Ceph command is run via docker exec -i which can read from stdin as of docker 1.4 released in December 2014.
It can run the following job

machine_type: container
os_type: ubuntu
os_version: "14.04"
suite_path: /home/loic/software/ceph/ceph-qa-suite
roles:
- - mon.a
  - osd.0
  - osd.1
  - client.0
overrides:
  install:
    ceph:
      branch: master
  ceph:
    wait-for-scrub: false
tasks:
- install:
- ceph:

under one minute, when repeated a second time and the bulk of the installation can be reused.

{duration: 50.01510691642761, flavor: basic,
  owner: loic@dachary.org, success: true}


The docker exec -i commands a run with

        self.p = subprocess.Popen(self.args,
                                  stdin=self.stdin_r,
                                  stdout=stdout, stderr=stderr,
                                  close_fds=True,)

The stdin is set when the command is created, as an os.pipe, so that it can be written to immediately, even before the command is actually run (which may happen at a later time if the thread is already busy finished a previous command). The stdout and stderr are consumed immediately after the command is run and copied over to the arguments provided by the caller:

        while ( self.file_copy(self.p.stdout, self.stdout) or
                self.file_copy(self.p.stderr, self.stderr) ):

All other file descriptors are closed (with close_fds=True), otherwise the child process will hang until they are all closed.

The ceph-disk script manages Ceph devices and rely on the content of the /dev/disk/by-partuuid directory which is updated by udev rules. For instance:

  • a new partition is created with /sbin/sgdisk --largest-new=1 --change-name=1:ceph data --partition-guid=1:83c14a9b-0493-4ccf-83ff-e3e07adae202 --typecode=1:89c57f98-2fe5-4dc0-89c1-f3ad0ceff2be -- /dev/loop4
  • the kernel is notified of the change with partprobe or partx and fires a udev event
  • the udev daemon receives UDEV [249708.246769] add /devices/virtual/block/loop4/loop4p1 (block) and the /lib/udev/rules.d/60-persistent-storage.rules script creates the corresponding symlink.

Let say the partition table is removed later (with sudo sgdisk --zap-all --clear --mbrtogpt -- /dev/loop4 for instance) and the kernel is not notified with partprobe or partx. If the first partition is created again and the kernel is notified as above, it will fail to notice any difference and will not send a udev event. As a result /dev/disk/by-partuuid will contain a symlink that is outdated.
The problem can be fixed by manually removing the stale symlink from /dev/disk/by-partuuid, clearing the partition table and notifying the kernel again. The events sent to udev can be displayed with:

# udevadm monitor
...
KERNEL[250902.072077] change   /devices/virtual/block/loop4 (block)
UDEV  [250902.100779] change   /devices/virtual/block/loop4 (block)
KERNEL[250902.101235] remove   /devices/virtual/block/loop4/loop4p1 (block)
UDEV  [250902.101421] remove   /devices/virtual/block/loop4/loop4p1 (block)
...

The environment and scripts used for a block device can be displayed with

# udevadm test /block/sdb/sdb1
...
udev_rules_apply_to_event: IMPORT '/sbin/blkid -o udev -p /dev/sdb/sdb1' /lib/udev/rules.d/60-ceph-partuuid-workaround.rules:28
udev_event_spawn: starting '/sbin/blkid -o udev -p /dev/sdb1'
...

DevStack and remote Ceph cluster

{% img center http://sebastien-han.fr/images/devstack-ceph-remote-cluster.jpg DevStack and remote Ceph cluster %}

Introducing the ability to connect DevStack to a remote Ceph cluster. So DevStack won't bootstrap any Ceph cluster, it will simply connect to a remote one.

read more...

How many PGs in each OSD of a Ceph cluster ?

To display how many PGs in each OSD of a Ceph cluster:

$ ceph --format xml pg dump | \
   xmlstarlet sel -t -m "//pg_stats/pg_stat/acting" -v osd -n | \
   sort -n | uniq -c
    332 0
    312 1
    299 2
    326 3
    291 4
    295 5
    316 6
    311 7
    301 8
    313 9

Where xmlstarlet loops over each PG acting set ( -m “//pg_stats/pg_stat/acting” ) and displays the OSDs it contains (-v osd), one by line (-n). The first column is the number of PGs in which the OSD in the second column shows.
To restrict the display to the PGs belonging to a given pool:

ceph --format xml pg dump |  \
  xmlstarlet sel -t -m "//pg_stats/pg_stat[starts-with(pgid,'0.')]/acting" -v osd -n | \
  sort -n | uniq -c

Where 0. is the prefix of each PG that belongs to pool 0.

© 2015, Red Hat, Inc. All rights reserved.