The contents of this wiki are no longer actively maintained. The most current documentation is available at http://ceph.com/docs.

Benchmark

From Ceph wiki

(Difference between revisions)
Jump to: navigation, search
(Benchmark procedure)
(RADOS benchmark)
(7 intermediate revisions not shown)
Line 63: Line 63:
/dev/sda:
/dev/sda:
  Timing buffered disk reads: 296 MB in  3.01 seconds =  98.50 MB/sec
  Timing buffered disk reads: 296 MB in  3.01 seconds =  98.50 MB/sec
-
 
/dev/sdb:
/dev/sdb:
  Timing buffered disk reads: 360 MB in  3.01 seconds = 119.52 MB/sec
  Timing buffered disk reads: 360 MB in  3.01 seconds = 119.52 MB/sec
...
...
-
 
</pre>
</pre>
Line 74: Line 72:
A similar test can be done using the OSD device with the following command:
A similar test can be done using the OSD device with the following command:
<pre>
<pre>
-
ceph osd tell $id bench
+
# ceph osd tell $id bench
</pre>
</pre>
The following command is run on one of the node of the ceph cluster:
The following command is run on one of the node of the ceph cluster:
<pre>
<pre>
-
for id in 11 12 21 22 41 42 51 52 61 62; do ceph osd tell $id bench ; done
+
# for id in 11 12 21 22 41 42 51 52 61 62; do ceph osd tell $id bench ; done
</pre>
</pre>
Each '''id''' uniquely identifies an OSD.
Each '''id''' uniquely identifies an OSD.
Line 113: Line 111:
</pre>
</pre>
-
The write speed of the OSD are 35MB/s
+
The write speed of the OSD are 35MB/s except two OSD with 60MB/s
-
Then start another benchmark tools:
+
'''To sum up, we have: '''
 +
 
 +
{| class="wikitable"
 +
!test name||bandwidth read||bandwidth write (StdDev)||latency read||latency write (StdDev)
 +
|-
 +
|OSD || 120MB/s ||  35MB/s  || ||
 +
|}
 +
 
 +
Then start another benchmark tools to confirm the results:
<pre>
<pre>
-
bonnie++ -q -u root -d /srv/ceph/osd<nodeid><diskid> -m ceph<nodeid>-osd<nodeid><diskid>
+
# bonnie++ -q -u root -d /srv/ceph/osd<nodeid><diskid> -m ceph<nodeid>-osd<nodeid><diskid>
</pre>
</pre>
Line 124: Line 130:
<pre>
<pre>
-
bon_csv2txt < bench_result.csv > bench_result.txt
+
# bon_csv2txt < bench_result.csv > bench_result.txt
-
bon_csv2html < bench_result.csv > bench_result.html
+
# bon_csv2html < bench_result.csv > bench_result.html
</pre>
</pre>
Line 159: Line 165:
<pre>
<pre>
-
# fio --size=16384m --directory=/srv/ceph/osd<nodeid><diskid> --direct=1 --name=rw_latency --rw=rw --blocksize=4m
+
# fio --size=32160m --directory=/srv/ceph/osd<nodeid><diskid> --direct=1 --name=rw_latency --rw=rw --blocksize=4m
-
# fio --size=16384m --directory=/srv/ceph/osd<nodeid><diskid> --direct=1 --name=rw_throutput --rw=rw --blocksize=1024m  
+
# fio --size=32160m --directory=/srv/ceph/osd<nodeid><diskid> --direct=1 --name=rw_throutput --rw=rw --blocksize=1024m  
-
# fio --size=16384m --directory=/srv/ceph/osd<nodeid><diskid> --direct=1 --name=random_rw --rw=randrw
+
# fio --size=32160m --directory=/srv/ceph/osd<nodeid><diskid> --direct=1 --name=random_rw --rw=randrw
</pre>
</pre>
 +
 +
''Note the 32160m, a good value is at least the ramsize*2.''
=== RADOS benchmark ===
=== RADOS benchmark ===
Line 170: Line 178:
Prepare a pool for the benchmark:
Prepare a pool for the benchmark:
<pre>
<pre>
-
# rados mkpool pbench
+
# ceph osd pool create pbench 768
</pre>
</pre>
-
The volume rbd is created with default value, (ie: strip size is 4M, latency test use this value)
+
 
 +
''The number of pg must be set between 50 and 100 to multiple the number OSD '''
 +
 
 +
Clean the disk cache on all Ceph nodes with:
 +
 
 +
<pre>
 +
# echo 3 > /proc/sys/vm/drop_caches
 +
</pre>
 +
 
 +
On the client node start the bench with:
<pre>
<pre>
Line 194: Line 211:
Min latency:            0.112051
Min latency:            0.112051
</pre>
</pre>
 +
 +
Then re-clean the disk cache on all cephs nodes with:
 +
 +
<pre>
 +
# echo 3 > /proc/sys/vm/drop_caches
 +
</pre>
 +
 +
And start the sequential read bench on the client:
<pre>
<pre>
-
echo 3 > /proc/sys/vm/drop_caches
 
# rados bench -p pbench 900 seq
# rados bench -p pbench 900 seq
...
...
Line 211: Line 235:
Min latency:          0.07418
Min latency:          0.07418
</pre>
</pre>
 +
 +
To sum up, here are the corresponding results:
 +
 +
{| class="wikitable"
 +
!test name||bandwidth read||bandwidth write (StdDev)||latency read||latency write (StdDev)
 +
|-
 +
|RADOS || 111.234MB/s || 75.673MB/s (13.5071) || 0.575252s || 0.845532s (0.573691)
 +
|}
=== RBD block device benchmark ===
=== RBD block device benchmark ===
The goal of this benchmark is to figure out the read / write throughput and latency of RBD. Reading and writing to RBD is expected to be marginaly slower than the same operation on a rados object. The difference accounts for the overhead of the block device abstraction layer.  
The goal of this benchmark is to figure out the read / write throughput and latency of RBD. Reading and writing to RBD is expected to be marginaly slower than the same operation on a rados object. The difference accounts for the overhead of the block device abstraction layer.  
-
Prepare a volume '''rbd''':
+
Prepare a volume '''rbd''' on the client:
<pre>
<pre>
# rbd create --pool pbench --size 100000 rbd --secret client.admin
# rbd create --pool pbench --size 100000 rbd --secret client.admin
Line 224: Line 256:
</pre>
</pre>
-
Then begin benchmark, on '''/dev/rbd0''' for me    '''* Warning this destroy all your data on /dev/rbd0 *'''
+
The volume rbd is created with default value, (ie: strip size is 4M, latency test use this value)
 +
Then begin the benchmark, on the rbd volume, '''/dev/rbd0''' here
 +
'''* Warning this destroy all your data on /dev/rbd0 *'''
 +
 
 +
Clean the disk cache on all Ceph nodes and the rbd client with:
 +
<pre>
 +
# echo 3 > /proc/sys/vm/drop_caches
 +
</pre>
-
latency write:
+
Start the latency write bench on the rbd client:
<pre>
<pre>
-
echo 3 > /proc/sys/vm/drop_caches
 
# seekwatcher -t rbd-latency-write.trace -o rbd-latency-write.png -p 'dd if=/dev/zero of=/dev/rbd0 bs=4M count=1000 oflag=direct' -d /dev/rbd0
# seekwatcher -t rbd-latency-write.trace -o rbd-latency-write.png -p 'dd if=/dev/zero of=/dev/rbd0 bs=4M count=1000 oflag=direct' -d /dev/rbd0
blktrace -d /dev/rbd0 -o rbd-latency-write.trace -b 2048 -a complete
blktrace -d /dev/rbd0 -o rbd-latency-write.trace -b 2048 -a complete
Line 240: Line 278:
So, dd run for 299.768s with 1000 blocks, then the latency is 300ms.
So, dd run for 299.768s with 1000 blocks, then the latency is 300ms.
-
latency read:
+
 
 +
Clean the disk cache on all Ceph nodes and the rbd client with:
 +
<pre>
 +
# echo 3 > /proc/sys/vm/drop_caches
 +
</pre>
 +
 
 +
Start the latency read bench on the rbd client:
<pre>
<pre>
-
echo 3 > /proc/sys/vm/drop_caches
 
# seekwatcher -t rbd-latency-read.trace -o rbd-latency-read.png -p 'dd if=/dev/rbd0 of=/dev/null bs=4M count=1000 iflag=direct' -d /dev/rbd0
# seekwatcher -t rbd-latency-read.trace -o rbd-latency-read.png -p 'dd if=/dev/rbd0 of=/dev/null bs=4M count=1000 iflag=direct' -d /dev/rbd0
blktrace -d /dev/rbd0 -o rbd-latency-read.trace -b 2048 -a complete
blktrace -d /dev/rbd0 -o rbd-latency-read.trace -b 2048 -a complete
Line 251: Line 294:
So, dd run for 39.0792 s with 1000 blocks then the latency is 39ms.
So, dd run for 39.0792 s with 1000 blocks then the latency is 39ms.
-
throughput write:
+
 
 +
Clean the disk cache on all Ceph nodes and the rbd client with:
 +
<pre>
 +
# echo 3 > /proc/sys/vm/drop_caches
 +
</pre>
 +
 
 +
Start the throughput write bench on the rbd client:
<pre>
<pre>
-
echo 3 > /proc/sys/vm/drop_caches
 
# seekwatcher -t rbd-throughput-write.trace -o rbd-throughput-write.png -p 'dd if=/dev/zero of=/dev/rbd0 bs=1G count=1 oflag=direct' -d /dev/rbd0
# seekwatcher -t rbd-throughput-write.trace -o rbd-throughput-write.png -p 'dd if=/dev/zero of=/dev/rbd0 bs=1G count=1 oflag=direct' -d /dev/rbd0
blktrace -d /dev/rbd0 -o rbd-throughput-write.trace -b 2048 -a complete
blktrace -d /dev/rbd0 -o rbd-throughput-write.trace -b 2048 -a complete
Line 264: Line 312:
</pre>
</pre>
 +
The measured throughput is then 23.7 MB/s
-
throughput read:
+
 
 +
Clean the disk cache on all Ceph nodes and the rbd client with:
 +
<pre>
 +
# echo 3 > /proc/sys/vm/drop_caches
 +
</pre>
 +
 
 +
Start the throughput read bench on the rbd client:
<pre>
<pre>
-
echo 3 > /proc/sys/vm/drop_caches
 
# seekwatcher -t rbd-throughput-read.trace -o rbd-throughput-read.png -p 'dd if=/dev/rbd0 of=/dev/null bs=1G count=1 iflag=direct' -d /dev/rbd0
# seekwatcher -t rbd-throughput-read.trace -o rbd-throughput-read.png -p 'dd if=/dev/rbd0 of=/dev/null bs=1G count=1 iflag=direct' -d /dev/rbd0
blktrace -d /dev/rbd0 -o rbd-throughput-read.trace -b 2048 -a complete
blktrace -d /dev/rbd0 -o rbd-throughput-read.trace -b 2048 -a complete
Line 277: Line 331:
....
....
</pre>
</pre>
 +
 +
The measured throughput is then 117 MB/s
-
All graphs can be merged with:
+
All graphs can be merged into a single one with:
<pre>
<pre>
-
seekwatcher -t rbd-latency-read.trace -t rbd-latency-write.trace -t rbd-throughput-read.trace -t rbd-throughput-write.trace -l rbd-latency-read -l rbd-latency-write -l rbd-throughput-read -l rbd-throughput-write -o all.png
+
# seekwatcher -t rbd-latency-read.trace -t rbd-latency-write.trace -t rbd-throughput-read.trace -t rbd-throughput-write.trace -l rbd-latency-read -l rbd-latency-write -l rbd-throughput-read -l rbd-throughput-write -o all.png
</pre>
</pre>
 +
 +
To sum up, here are the corresponding results:
 +
 +
{| class="wikitable"
 +
!test name||bandwidth read||bandwidth write (StdDev)||latency read||latency write (StdDev)
 +
|-
 +
|RBD || 117MB/s || 23.7 MB/s || 39ms || 300ms
 +
|}
=== REST API benchmark ===
=== REST API benchmark ===
 +
The goal of this benchmark is to figure out the read / write throughput and latency of the REST-API (ie: amazon/swift frontend).
 +
 +
Clean the disk cache on all Ceph nodes with:
 +
<pre>
 +
# echo 3 > /proc/sys/vm/drop_caches
 +
</pre>
 +
 +
And start the rados write bench on the client to the radosgw server:
<pre>
<pre>
Line 307: Line 379:
</pre>
</pre>
 +
 +
Clean the disk cache on all Ceph nodes with:
 +
<pre>
 +
# echo 3 > /proc/sys/vm/drop_caches
 +
</pre>
 +
 +
And start the rados read bench on the client to the radosgw server:
<pre>
<pre>
# rest-bench --api-host=ceph1.fqdn --bucket=bench --access-key=<KEY> --secret=<SECRET> --seconds 900 seq
# rest-bench --api-host=ceph1.fqdn --bucket=bench --access-key=<KEY> --secret=<SECRET> --seconds 900 seq
Line 321: Line 400:
Min latency:          0.039847
Min latency:          0.039847
</pre>
</pre>
 +
 +
 +
To sum up, here are the corresponding results:
 +
 +
{| class="wikitable"
 +
!test name||bandwidth read||bandwidth write (StdDev)||latency read||latency write (StdDev)
 +
|-
 +
|RADOSGW || 40.553MB/s || 17.598MB/s (15.9672) || 1.57773s || 3.63591s (2.57325)
 +
|}
=== On multiple clients ===
=== On multiple clients ===
Line 370: Line 458:
{| class="wikitable"
{| class="wikitable"
-
!test name (nb clients)||bandwidth read||bandwidth write (StdDev)||latency read||latency write (StdDev)
+
!test name (nb clients) || bandwidth read||bandwidth write (StdDev)||latency read||latency write (StdDev)
|-
|-
|OSD || 130MB/s || 85MB/s  || NA || NA
|OSD || 130MB/s || 85MB/s  || NA || NA
Line 378: Line 466:
|RADOS (6)|| 20MB/s || 17MB/s (10MB/s)  || ... || 3.38062s(2.39645)
|RADOS (6)|| 20MB/s || 17MB/s (10MB/s)  || ... || 3.38062s(2.39645)
|-
|-
-
|RBD|| 117MB/s || 72MB/s  || 40.5899ms || 275.557ms
+
|RBD || 117MB/s || 72MB/s  || 40.5899ms || 275.557ms
|-
|-
|RBD (6)|| 74.8MB/s || 10.3MB/s || ...  ||  ...
|RBD (6)|| 74.8MB/s || 10.3MB/s || ...  ||  ...
|-
|-
-
|RADOSGW|| 79.206MB/s ||  16.860MB/s (14.6564) || 0.805336s || 3.78915s (2.74434)
+
|RADOSGW || 79.206MB/s ||  16.860MB/s (14.6564) || 0.805336s || 3.78915s (2.74434)
|}
|}
Line 388: Line 476:
{| class="wikitable"
{| class="wikitable"
-
!Test name (nb clients)||Bandwidth read||Bandwidth write (StdDev)||Latency read||Latency write (StdDev)
+
!Test name (nb clients) || Bandwidth read||Bandwidth write (StdDev)||Latency read||Latency write (StdDev)
|-
|-
|OSD || 130MB/s|| 90MB/s || NA || NA  
|OSD || 130MB/s|| 90MB/s || NA || NA  
Line 394: Line 482:
|RADOS || 111.854MB/s || 109.139MB/s (6.16)|| 0.572126s ||0.586355s (0.457155)
|RADOS || 111.854MB/s || 109.139MB/s (6.16)|| 0.572126s ||0.586355s (0.457155)
|-
|-
-
|RADOS (6)|| ~60MB/s || 27MB/s (13MB/s)  || ~1s || 2.32s (1.8)
+
|RADOS (6) || ~60MB/s || 27MB/s (13MB/s)  || ~1s || 2.32s (1.8)
|-
|-
-
|RBD|| 117MB/s || 107MB/s || 40ms || 94ms
+
|RBD || 117MB/s || 107MB/s || 40ms || 94ms
|-
|-
-
|RBD (6)|| 67.5MB/s || 16.6MB/s || 185ms ||  300ms
+
|RBD (6) || 67.5MB/s || 16.6MB/s || 185ms ||  300ms
|-
|-
-
|RADOSGW|| 70.301MB/s ||  30.4MB/s (28.5897) || 0.91s  || 2.10s (2.8)
+
|RADOSGW || 70.301MB/s ||  30.4MB/s (28.5897) || 0.91s  || 2.10s (2.8)
|-
|-
|RADOSGW (with apache and fastcgi with ceph patch)|| 78.6MB/s || 32.639(29.7339) || 0.81s || 1.66s (1.72)
|RADOSGW (with apache and fastcgi with ceph patch)|| 78.6MB/s || 32.639(29.7339) || 0.81s || 1.66s (1.72)
Line 406: Line 494:
The limits of ~110MB/s seems to be the network speed of the client.
The limits of ~110MB/s seems to be the network speed of the client.
 +
 +
===== Some results according to the number of PG =====
 +
 +
Some measurements of the percentage of cpu and the memory utilization of each OSD during a rados benchmarks for 90 seconds, with different value of the number of PG for the bench pool:
 +
 +
[[media:Result-avg-cpu-64-write.png|%CPU utilization]],
 +
[[media:Result-avg-res-64-write.png|Residual memory utilization]],
 +
[[media:Result-avg-virt-64-write.png|Virtual memory utilization]]
 +
 +
Other graphs can be found [http://dl.sileht.net/public/enovance/ceph-pgnum/ here]
 +
 +
[http://dl.sileht.net/public/enovance/ceph-pgnum/script/ Helper Scripts] to make the monitoring, the bench, and to generate the graphics.
 +
(warning: it need to be edited to change parameters, ie: nodes list,...)
==== Other example ====
==== Other example ====
Line 415: Line 516:
</pre>
</pre>
-
My setup current is ....
+
The software configuration is ....
{| class="wikitable"
{| class="wikitable"

Revision as of 15:30, 2 August 2012

This HOWTO is about benchmarking ceph.

** THIS PAGE IS NOT FINISH YET **

Contents

Benchmark tools

- ceph tools
- rest-bench
- dd
- pssh or pdsh
- fio
- seekwatcher
- bonnie++

Preparation

This page was written on a ceph installation based on the Ceph and OpenStack Debian GNU/Linux howto. The OpenStack integration can be ignored.

# apt-get install blktrace python-matplotlib fio rest-bench collectl perftest linux-tools-3.2 
# wget https://oss.oracle.com/~mason/seekwatcher/seekwatcher-0.12.tar.bz2
# tar -xjf seekwatcher-0.12.tar.bz2
# sed -i '755d' seekwatcher-0.12/seekwatcher  # small to work this debian version of python-matplotlib
# parallel-scp -H "ceph1 ceph2 ceph3 ceph4 ceph5 ceph6" seekwatcher-0.12/seekwatcher /usr/local/bin/seekwatcher

Benchmark procedure

Some informations about the ceph setup:

Node names are: ceph1 ceph2 ceph3 ceph4 ceph5 ceph6
OSD id convention is <nodeid><diskid> (ie: second disk on third node is 32)
OSD disk are mounted on /srv/ceph/osd<nodeid><diskid> (ie: second disk on third node is nmounted at /srv/ceph/osd32)
OSD are on disk sda2 and sdb2 on each servers
All my OSD id are, 11, 12, 31, 32, 41, 42, 51, 52, 61 and 62 My mon nodes are ceph2 ceph4 and ceph5 RADOSGW run on ceph1 Ceph3 act as a client for benchmark that used the network.

Note: No informations about the crush map, the pgnum, the pool size and the hardware are mentionned.
The results shown as examples in the following instructions are not meant to be interpreted. Benchmark results must be associated to a hardware configuration to be interpreted, as shown in #Benchmark Results

OSD benchmark

The goal of this benchmark is to assert the read and write speed of each OSD. They are compared with the read and write speed of the underlying device. The expected result is that it will not show a significant difference.

Start by running this command on a ceph node to get information about the disk read speed:

# parallel-ssh -t0 -iP -H "ceph1 ceph2 ceph4 ceph5 ceph6" "hdparm -t /dev/sda ; hdparm -t /dev/sdb" 
[1] 15:25:55 [SUCCESS] ceph1
/dev/sda:
 Timing buffered disk reads: 476 MB in  3.01 seconds = 158.14 MB/sec
/dev/sdb:
 Timing buffered disk reads: 494 MB in  3.01 seconds = 164.30 MB/sec
[2] 15:25:56 [SUCCESS] ceph2
/dev/sda:
 Timing buffered disk reads: 368 MB in  3.00 seconds = 122.54 MB/sec
/dev/sdb:
 Timing buffered disk reads: 362 MB in  3.08 seconds = 117.54 MB/sec
[2] 15:25:56 [SUCCESS] ceph4
/dev/sda:
 Timing buffered disk reads: 296 MB in  3.01 seconds =  98.50 MB/sec
/dev/sdb:
 Timing buffered disk reads: 360 MB in  3.01 seconds = 119.52 MB/sec
...

So the disks read speed are ~ 120MB/sec.

A similar test can be done using the OSD device with the following command:

# ceph osd tell $id bench

The following command is run on one of the node of the ceph cluster:

# for id in 11 12 21 22 41 42 51 52 61 62; do ceph osd tell $id bench ; done

Each id uniquely identifies an OSD.

The bench results can be read from the logs:

# grep bench /var/log/ceph/ceph.log
2012-07-25 13:09:51.183160 osd.11 169.254.6.21:6800/16801 1185 : [INF] bench: wrote 1024 MB in blocks of 4096 KB in 18.328763 sec at 57209 KB/sec
2012-07-25 13:09:48.771528 osd.12 169.254.6.21:6803/16909 1258 : [INF] bench: wrote 1024 MB in blocks of 4096 KB in 15.902182 sec at 65939 KB/sec
2012-07-25 13:10:04.727132 osd.21 169.254.6.22:6800/4199 695 : [INF] bench: wrote 1024 MB in blocks of 4096 KB in 31.841067 sec at 32931 KB/sec
2012-07-25 13:10:05.624053 osd.41 169.254.6.24:6800/4481 464 : [INF] bench: wrote 1024 MB in blocks of 4096 KB in 32.707697 sec at 32058 KB/sec
2012-07-25 13:10:03.294986 osd.61 169.254.6.26:6800/3339 1216 : [INF] bench: wrote 1024 MB in blocks of 4096 KB in 30.320444 sec at 34583 KB/sec
2012-07-25 13:10:01.767423 osd.22 169.254.6.22:6803/4278 667 : [INF] bench: wrote 1024 MB in blocks of 4096 KB in 28.865160 sec at 36326 KB/sec
2012-07-25 13:10:02.157689 osd.62 169.254.6.26:6803/3422 1196 : [INF] bench: wrote 1024 MB in blocks of 4096 KB in 29.168095 sec at 35949 KB/sec
2012-07-25 13:10:02.397218 osd.42 169.254.6.24:6803/4552 464 : [INF] bench: wrote 1024 MB in blocks of 4096 KB in 29.466679 sec at 35585 KB/sec
2012-07-25 13:10:03.850164 osd.52 169.254.6.25:6803/4005 408 : [INF] bench: wrote 1024 MB in blocks of 4096 KB in 30.889020 sec at 33946 KB/sec
2012-07-25 13:10:04.240440 osd.51 169.254.6.25:6800/3936 380 : [INF] bench: wrote 1024 MB in blocks of 4096 KB in 31.295116 sec at 33506 KB/sec

Or wait the result with:

# ceph -w 
2012-07-25 13:09:51.183160 osd.11 169.254.6.21:6800/16801 1185 : [INF] bench: wrote 1024 MB in blocks of 4096 KB in 18.328763 sec at 57209 KB/sec
2012-07-25 13:09:48.771528 osd.12 169.254.6.21:6803/16909 1258 : [INF] bench: wrote 1024 MB in blocks of 4096 KB in 15.902182 sec at 65939 KB/sec
2012-07-25 13:10:04.727132 osd.21 169.254.6.22:6800/4199 695 : [INF] bench: wrote 1024 MB in blocks of 4096 KB in 31.841067 sec at 32931 KB/sec
2012-07-25 13:10:05.624053 osd.41 169.254.6.24:6800/4481 464 : [INF] bench: wrote 1024 MB in blocks of 4096 KB in 32.707697 sec at 32058 KB/sec
2012-07-25 13:10:03.294986 osd.61 169.254.6.26:6800/3339 1216 : [INF] bench: wrote 1024 MB in blocks of 4096 KB in 30.320444 sec at 34583 KB/sec
2012-07-25 13:10:01.767423 osd.22 169.254.6.22:6803/4278 667 : [INF] bench: wrote 1024 MB in blocks of 4096 KB in 28.865160 sec at 36326 KB/sec
2012-07-25 13:10:02.157689 osd.62 169.254.6.26:6803/3422 1196 : [INF] bench: wrote 1024 MB in blocks of 4096 KB in 29.168095 sec at 35949 KB/sec
2012-07-25 13:10:02.397218 osd.42 169.254.6.24:6803/4552 464 : [INF] bench: wrote 1024 MB in blocks of 4096 KB in 29.466679 sec at 35585 KB/sec
2012-07-25 13:10:03.850164 osd.52 169.254.6.25:6803/4005 408 : [INF] bench: wrote 1024 MB in blocks of 4096 KB in 30.889020 sec at 33946 KB/sec
2012-07-25 13:10:04.240440 osd.51 169.254.6.25:6800/3936 380 : [INF] bench: wrote 1024 MB in blocks of 4096 KB in 31.295116 sec at 33506 KB/sec

The write speed of the OSD are 35MB/s except two OSD with 60MB/s

To sum up, we have:

test namebandwidth readbandwidth write (StdDev)latency readlatency write (StdDev)
OSD 120MB/s 35MB/s

Then start another benchmark tools to confirm the results:

# bonnie++ -q -u root -d /srv/ceph/osd<nodeid><diskid> -m ceph<nodeid>-osd<nodeid><diskid>

Copy/Paste all the ouput from this command onto a single file (ie: bench_result.csv)

# bon_csv2txt < bench_result.csv > bench_result.txt
# bon_csv2html < bench_result.csv > bench_result.html

Bonnie++ results:

# cat bench_result.txt
Version      1.96   ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
ceph1-osd11  32160M  1154  99 113556  11 50219   6  1263  46 136658   9 526.3  26
Latency              7025us   74123us     764ms    1304ms   84768us   98644us
ceph1-osd12  32160M  1179  99 150674  14 65643   8  1348  49 172851  11 547.8  25
Latency              6899us   95868us    1179ms    1152ms   12513us   93919us
ceph3-osd31  32160M  1113  99 138012  13 59340   8  2659  98 164279  11 520.5  26
Latency              7723us     175ms    1162ms    9901us    1190ms   91338us
ceph3-osd32  32160M  1167  99 141259  13 61504   8  1291  47 164216  11 623.4  12
Latency              6995us   93199us     949ms     832ms   82967us   60616us
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files:max:min        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
ceph1-osd11      16 27412  95 +++++ +++ 29144  95 27344  95 +++++ +++ 28128  94
Latency               424us     115us     182us     280us      17us     129us
ceph1-osd12      16 27130  94 +++++ +++ 29535  96 27534  96 +++++ +++ 28379  96
Latency               448us     114us     146us     272us      18us     135us
ceph3-osd31      16 27091  95 +++++ +++ 29012  95 27166  96 +++++ +++ 28130  96
Latency               324us     115us     146us     412us      26us     141us
ceph3-osd32      16 27235  95 +++++ +++ 29312  96 27270  95 +++++ +++ 28303  95
Latency               456us     118us     386us     234us      18us     453us

fio can be used too, for this kind of test:

# fio --size=32160m --directory=/srv/ceph/osd<nodeid><diskid> --direct=1 --name=rw_latency --rw=rw --blocksize=4m
# fio --size=32160m --directory=/srv/ceph/osd<nodeid><diskid> --direct=1 --name=rw_throutput --rw=rw --blocksize=1024m 
# fio --size=32160m --directory=/srv/ceph/osd<nodeid><diskid> --direct=1 --name=random_rw --rw=randrw

Note the 32160m, a good value is at least the ramsize*2.

RADOS benchmark

The goal of this benchmark is to figure out the read / write throughput and latency of rados objects. Writing to a rados object will be slower than writing to an OSD, because it is replicated N times. Reading a rados object is expected to be faster that reading from and OSD because reads use all replicas.

Prepare a pool for the benchmark:

# ceph osd pool create pbench 768

The number of pg must be set between 50 and 100 to multiple the number OSD '

Clean the disk cache on all Ceph nodes with:

# echo 3 > /proc/sys/vm/drop_caches

On the client node start the bench with:

# rados bench -p pbench 900 write
   899      16     17042     17026   75.7464        60  0.441918  0.844101
2012-07-25 18:31:51.752487min lat: 0.112051 max lat: 2.77934 avg lat: 0.844125
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
   900      16     17056     17040   75.7245        56  0.197815  0.844125
   901      14     17056     17042   75.6493         8  0.599423  0.844166
 Total time run:         901.565911
Total writes made:      17056
Write size:             4194304
Bandwidth (MB/sec):     75.673

Stddev Bandwidth:       13.5071
Max bandwidth (MB/sec): 116
Min bandwidth (MB/sec): 8
Average Latency:        0.845532
Stddev Latency:         0.573691
Max latency:            3.38332
Min latency:            0.112051

Then re-clean the disk cache on all cephs nodes with:

# echo 3 > /proc/sys/vm/drop_caches

And start the sequential read bench on the client:

# rados bench -p pbench 900 seq
...
   611      16     17010     16994   111.241       104   1.05852  0.574897
   612      16     17037     17021   111.236       108   1.17321  0.574932
   613      16     17056     17040   111.178        76   1.01611  0.574903
 Total time run:        613.339616
Total reads made:     17056
Read size:            4194304
Bandwidth (MB/sec):    111.234

Average Latency:       0.575252
Max latency:           1.65182
Min latency:           0.07418

To sum up, here are the corresponding results:

test namebandwidth readbandwidth write (StdDev)latency readlatency write (StdDev)
RADOS 111.234MB/s 75.673MB/s (13.5071) 0.575252s 0.845532s (0.573691)

RBD block device benchmark

The goal of this benchmark is to figure out the read / write throughput and latency of RBD. Reading and writing to RBD is expected to be marginaly slower than the same operation on a rados object. The difference accounts for the overhead of the block device abstraction layer.

Prepare a volume rbd on the client:

# rbd create --pool pbench --size 100000 rbd --secret client.admin
# rbd map --pool pbench rbd --secret client.admin
# rbd showmapped
id      pool    image   snap    device
0       pbench  rbd     -       /dev/rbd0

The volume rbd is created with default value, (ie: strip size is 4M, latency test use this value) Then begin the benchmark, on the rbd volume, /dev/rbd0 here * Warning this destroy all your data on /dev/rbd0 *

Clean the disk cache on all Ceph nodes and the rbd client with:

# echo 3 > /proc/sys/vm/drop_caches

Start the latency write bench on the rbd client:

# seekwatcher -t rbd-latency-write.trace -o rbd-latency-write.png -p 'dd if=/dev/zero of=/dev/rbd0 bs=4M count=1000 oflag=direct' -d /dev/rbd0
blktrace -d /dev/rbd0 -o rbd-latency-write.trace -b 2048 -a complete
running :dd if=/dev/zero of=/dev/rbd0 bs=4M count=1000 oflag=direct:
1000+0 records in
1000+0 records out
4194304000 bytes (4.2 GB) copied, 299.768 s, 14.0 MB/s
done running dd if=/dev/zero of=/dev/rbd0 bs=4M count=1000 oflag=direct
....

So, dd run for 299.768s with 1000 blocks, then the latency is 300ms.


Clean the disk cache on all Ceph nodes and the rbd client with:

# echo 3 > /proc/sys/vm/drop_caches

Start the latency read bench on the rbd client:

# seekwatcher -t rbd-latency-read.trace -o rbd-latency-read.png -p 'dd if=/dev/rbd0 of=/dev/null bs=4M count=1000 iflag=direct' -d /dev/rbd0
blktrace -d /dev/rbd0 -o rbd-latency-read.trace -b 2048 -a complete
running :dd if=/dev/rbd0 of=/dev/null bs=4M count=1000 iflag=direct:
...

So, dd run for 39.0792 s with 1000 blocks then the latency is 39ms.


Clean the disk cache on all Ceph nodes and the rbd client with:

# echo 3 > /proc/sys/vm/drop_caches

Start the throughput write bench on the rbd client:

# seekwatcher -t rbd-throughput-write.trace -o rbd-throughput-write.png -p 'dd if=/dev/zero of=/dev/rbd0 bs=1G count=1 oflag=direct' -d /dev/rbd0
blktrace -d /dev/rbd0 -o rbd-throughput-write.trace -b 2048 -a complete
running :dd if=/dev/zero of=/dev/rbd0 bs=1G count=1 oflag=direct:
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 45.2403 s, 23.7 MB/s
done running dd if=/dev/zero of=/dev/rbd0 bs=1G count=1 oflag=direct
...

The measured throughput is then 23.7 MB/s


Clean the disk cache on all Ceph nodes and the rbd client with:

# echo 3 > /proc/sys/vm/drop_caches

Start the throughput read bench on the rbd client:

# seekwatcher -t rbd-throughput-read.trace -o rbd-throughput-read.png -p 'dd if=/dev/rbd0 of=/dev/null bs=1G count=1 iflag=direct' -d /dev/rbd0
blktrace -d /dev/rbd0 -o rbd-throughput-read.trace -b 2048 -a complete
running :dd if=/dev/rbd0 of=/dev/null bs=1G count=1 iflag=direct:
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 9.17204 s, 117 MB/s
done running dd if=/dev/rbd0 of=/dev/null bs=1G count=1 iflag=direct
....

The measured throughput is then 117 MB/s


All graphs can be merged into a single one with:

# seekwatcher -t rbd-latency-read.trace -t rbd-latency-write.trace -t rbd-throughput-read.trace -t rbd-throughput-write.trace -l rbd-latency-read -l rbd-latency-write -l rbd-throughput-read -l rbd-throughput-write -o all.png

To sum up, here are the corresponding results:

test namebandwidth readbandwidth write (StdDev)latency readlatency write (StdDev)
RBD 117MB/s 23.7 MB/s 39ms 300ms

REST API benchmark

The goal of this benchmark is to figure out the read / write throughput and latency of the REST-API (ie: amazon/swift frontend).

Clean the disk cache on all Ceph nodes with:

# echo 3 > /proc/sys/vm/drop_caches

And start the rados write bench on the client to the radosgw server:

# rest-bench --api-host=ceph1.fqdn --bucket=bench --access-key=<KEY> --secret=<SECRET> --show-time --seconds 900 write
...
2012-07-25 17:44:21.220392   901      16      3978      3962   17.5878         0         -   3.62207
2012-07-25 17:44:22.220455   902      16      3978      3962   17.5683         0         -   3.62207
2012-07-25 17:44:23.220515   903      16      3978      3962   17.5488         0         -   3.62207
2012-07-25 17:44:24.220571   904      16      3979      3963   17.5338       0.8   7.22069   3.62298
2012-07-25 17:44:25.220752 Total time run:         904.396359
Total writes made:      3979
Write size:             4194304
Bandwidth (MB/sec):     17.598 

Stddev Bandwidth:       15.9672
Max bandwidth (MB/sec): 64
Min bandwidth (MB/sec): 0
Average Latency:        3.63591
Stddev Latency:         2.57325
Max latency:            13.9555
Min latency:            0.512203


Clean the disk cache on all Ceph nodes with:

# echo 3 > /proc/sys/vm/drop_caches

And start the rados read bench on the client to the radosgw server:

# rest-bench --api-host=ceph1.fqdn --bucket=bench --access-key=<KEY> --secret=<SECRET> --seconds 900 seq
2012-07-25 17:51:27.486578   390      16      3974      3958     40.59        32   3.77318   1.56981
2012-07-25 17:51:28.486690   391      16      3979      3963   40.5373        20   4.31407   1.57008
2012-07-25 17:51:29.486806   392      16      3979      3963   40.4339         0         -   1.57008
2012-07-25 17:51:30.487002 Total time run:        392.475304
Total reads made:     3979
Read size:            4194304
Bandwidth (MB/sec):    40.553

Average Latency:       1.57773
Max latency:           5.80392
Min latency:           0.039847


To sum up, here are the corresponding results:

test namebandwidth readbandwidth write (StdDev)latency readlatency write (StdDev)
RADOSGW 40.553MB/s 17.598MB/s (15.9672) 1.57773s 3.63591s (2.57325)

On multiple clients

All the previous tests started on a client (ie: rados bench, rbd and rest-api) can be launched on multiple clients in parallels with one the following commands:

parallel-ssh -i -t0 -H "ceph-client1 ceph-client2 ceph-client3" '<bench command>'

or

pdsh -u root -Rssh -w "ceph-client1,ceph-client2,ceph-client3" '<bench command>'

For the rest benchmark we need to change the bucket name for each client, so you can use:

parallel-ssh -i -t0 -H "ceph-client1 ceph-client2 ceph-client3" 'p=$(hostname -s) ; rest-bench --api-host=radosgw --bucket=$p --access-key=BENCH --secret=BENCH --seconds 30 write'

Benchmark Results

First Example

The benchmark example were run on the following hardware / software combination:

2 Dell PowerEdge 1950 with each
 - 2×Intel(R) Xeon(R) CPU L5410 @ 1.33GHz
 - 2 OSD disks of 250Go
 - 16G RAM
 - a 1Gbit Link between server
 - Debian wheezy
 - ceph 0.48

4 DCS5125 with each
 - 2×AMD Athlon(tm) II X2 260u  @ 1.8GHz
 - 2 OSD disks ~ 1TGo
 - 8G RAM
 - a 1Gbit Link between server
 - Debian wheezy
 - ceph 0.48


Note: in the following tables, the columns values are per client. For instance, 20MB/s in the bandwidth read column for RADOS (6) means that each of the 6 clients have a 20MB/s read throughput. The total is therefore 6 x 20 MB/s = 120MB/s which is consistent with the bandwidth read for a single client ( 111.764MB/s ).

With 2 OSD daemons per servers, replication x2, default crush map and default pgnum, journals and datas of OSD are on the same disks:

test name (nb clients) bandwidth readbandwidth write (StdDev)latency readlatency write (StdDev)
OSD 130MB/s 85MB/s NA NA
RADOS 111.764MB/s 91.480MB/s (18.7785) 0.572556 0.699563s (0.543912)
RADOS (6) 20MB/s 17MB/s (10MB/s) ... 3.38062s(2.39645)
RBD 117MB/s 72MB/s 40.5899ms 275.557ms
RBD (6) 74.8MB/s 10.3MB/s ... ...
RADOSGW 79.206MB/s 16.860MB/s (14.6564) 0.805336s 3.78915s (2.74434)

With 2 OSD daemons per servers, replication x2, default crush map and default pgnum, OSD journals are on a tmpfs:

Test name (nb clients) Bandwidth readBandwidth write (StdDev)Latency readLatency write (StdDev)
OSD 130MB/s 90MB/s NA NA
RADOS 111.854MB/s 109.139MB/s (6.16) 0.572126s 0.586355s (0.457155)
RADOS (6) ~60MB/s 27MB/s (13MB/s) ~1s 2.32s (1.8)
RBD 117MB/s 107MB/s 40ms 94ms
RBD (6) 67.5MB/s 16.6MB/s 185ms 300ms
RADOSGW 70.301MB/s 30.4MB/s (28.5897) 0.91s 2.10s (2.8)
RADOSGW (with apache and fastcgi with ceph patch) 78.6MB/s 32.639(29.7339) 0.81s 1.66s (1.72)

The limits of ~110MB/s seems to be the network speed of the client.

Some results according to the number of PG

Some measurements of the percentage of cpu and the memory utilization of each OSD during a rados benchmarks for 90 seconds, with different value of the number of PG for the bench pool:

%CPU utilization, Residual memory utilization, Virtual memory utilization

Other graphs can be found here

Helper Scripts to make the monitoring, the bench, and to generate the graphics. (warning: it need to be edited to change parameters, ie: nodes list,...)

Other example

The benchmark example were run on the following hardware:

- ...
- ...

The software configuration is ....

test name (nb clients)bandwidth readbandwidth write (StdDev)latency readlatency write (StdDev)
OSD
RADOS
RADOS (n)
RBD
RBD (n)
RADOSGW

Note:that all of this results are issued from sequential operation.

Personal tools