Benchmark
From Ceph wiki
(→Benchmark procedure) |
(→RADOS benchmark) |
||
| (7 intermediate revisions not shown) | |||
| Line 63: | Line 63: | ||
/dev/sda: | /dev/sda: | ||
Timing buffered disk reads: 296 MB in 3.01 seconds = 98.50 MB/sec | Timing buffered disk reads: 296 MB in 3.01 seconds = 98.50 MB/sec | ||
| - | |||
/dev/sdb: | /dev/sdb: | ||
Timing buffered disk reads: 360 MB in 3.01 seconds = 119.52 MB/sec | Timing buffered disk reads: 360 MB in 3.01 seconds = 119.52 MB/sec | ||
... | ... | ||
| - | |||
</pre> | </pre> | ||
| Line 74: | Line 72: | ||
A similar test can be done using the OSD device with the following command: | A similar test can be done using the OSD device with the following command: | ||
<pre> | <pre> | ||
| - | ceph osd tell $id bench | + | # ceph osd tell $id bench |
</pre> | </pre> | ||
The following command is run on one of the node of the ceph cluster: | The following command is run on one of the node of the ceph cluster: | ||
<pre> | <pre> | ||
| - | for id in 11 12 21 22 41 42 51 52 61 62; do ceph osd tell $id bench ; done | + | # for id in 11 12 21 22 41 42 51 52 61 62; do ceph osd tell $id bench ; done |
</pre> | </pre> | ||
Each '''id''' uniquely identifies an OSD. | Each '''id''' uniquely identifies an OSD. | ||
| Line 113: | Line 111: | ||
</pre> | </pre> | ||
| - | The write speed of the OSD are 35MB/s | + | The write speed of the OSD are 35MB/s except two OSD with 60MB/s |
| - | Then start another benchmark tools: | + | '''To sum up, we have: ''' |
| + | |||
| + | {| class="wikitable" | ||
| + | !test name||bandwidth read||bandwidth write (StdDev)||latency read||latency write (StdDev) | ||
| + | |- | ||
| + | |OSD || 120MB/s || 35MB/s || || | ||
| + | |} | ||
| + | |||
| + | Then start another benchmark tools to confirm the results: | ||
<pre> | <pre> | ||
| - | bonnie++ -q -u root -d /srv/ceph/osd<nodeid><diskid> -m ceph<nodeid>-osd<nodeid><diskid> | + | # bonnie++ -q -u root -d /srv/ceph/osd<nodeid><diskid> -m ceph<nodeid>-osd<nodeid><diskid> |
</pre> | </pre> | ||
| Line 124: | Line 130: | ||
<pre> | <pre> | ||
| - | bon_csv2txt < bench_result.csv > bench_result.txt | + | # bon_csv2txt < bench_result.csv > bench_result.txt |
| - | bon_csv2html < bench_result.csv > bench_result.html | + | # bon_csv2html < bench_result.csv > bench_result.html |
</pre> | </pre> | ||
| Line 159: | Line 165: | ||
<pre> | <pre> | ||
| - | # fio --size= | + | # fio --size=32160m --directory=/srv/ceph/osd<nodeid><diskid> --direct=1 --name=rw_latency --rw=rw --blocksize=4m |
| - | # fio --size= | + | # fio --size=32160m --directory=/srv/ceph/osd<nodeid><diskid> --direct=1 --name=rw_throutput --rw=rw --blocksize=1024m |
| - | # fio --size= | + | # fio --size=32160m --directory=/srv/ceph/osd<nodeid><diskid> --direct=1 --name=random_rw --rw=randrw |
</pre> | </pre> | ||
| + | |||
| + | ''Note the 32160m, a good value is at least the ramsize*2.'' | ||
=== RADOS benchmark === | === RADOS benchmark === | ||
| Line 170: | Line 178: | ||
Prepare a pool for the benchmark: | Prepare a pool for the benchmark: | ||
<pre> | <pre> | ||
| - | # | + | # ceph osd pool create pbench 768 |
</pre> | </pre> | ||
| - | The | + | |
| + | ''The number of pg must be set between 50 and 100 to multiple the number OSD ''' | ||
| + | |||
| + | Clean the disk cache on all Ceph nodes with: | ||
| + | |||
| + | <pre> | ||
| + | # echo 3 > /proc/sys/vm/drop_caches | ||
| + | </pre> | ||
| + | |||
| + | On the client node start the bench with: | ||
<pre> | <pre> | ||
| Line 194: | Line 211: | ||
Min latency: 0.112051 | Min latency: 0.112051 | ||
</pre> | </pre> | ||
| + | |||
| + | Then re-clean the disk cache on all cephs nodes with: | ||
| + | |||
| + | <pre> | ||
| + | # echo 3 > /proc/sys/vm/drop_caches | ||
| + | </pre> | ||
| + | |||
| + | And start the sequential read bench on the client: | ||
<pre> | <pre> | ||
| - | |||
# rados bench -p pbench 900 seq | # rados bench -p pbench 900 seq | ||
... | ... | ||
| Line 211: | Line 235: | ||
Min latency: 0.07418 | Min latency: 0.07418 | ||
</pre> | </pre> | ||
| + | |||
| + | To sum up, here are the corresponding results: | ||
| + | |||
| + | {| class="wikitable" | ||
| + | !test name||bandwidth read||bandwidth write (StdDev)||latency read||latency write (StdDev) | ||
| + | |- | ||
| + | |RADOS || 111.234MB/s || 75.673MB/s (13.5071) || 0.575252s || 0.845532s (0.573691) | ||
| + | |} | ||
=== RBD block device benchmark === | === RBD block device benchmark === | ||
The goal of this benchmark is to figure out the read / write throughput and latency of RBD. Reading and writing to RBD is expected to be marginaly slower than the same operation on a rados object. The difference accounts for the overhead of the block device abstraction layer. | The goal of this benchmark is to figure out the read / write throughput and latency of RBD. Reading and writing to RBD is expected to be marginaly slower than the same operation on a rados object. The difference accounts for the overhead of the block device abstraction layer. | ||
| - | Prepare a volume '''rbd''': | + | Prepare a volume '''rbd''' on the client: |
<pre> | <pre> | ||
# rbd create --pool pbench --size 100000 rbd --secret client.admin | # rbd create --pool pbench --size 100000 rbd --secret client.admin | ||
| Line 224: | Line 256: | ||
</pre> | </pre> | ||
| - | Then begin benchmark, on '''/dev/rbd0''' | + | The volume rbd is created with default value, (ie: strip size is 4M, latency test use this value) |
| + | Then begin the benchmark, on the rbd volume, '''/dev/rbd0''' here | ||
| + | '''* Warning this destroy all your data on /dev/rbd0 *''' | ||
| + | |||
| + | Clean the disk cache on all Ceph nodes and the rbd client with: | ||
| + | <pre> | ||
| + | # echo 3 > /proc/sys/vm/drop_caches | ||
| + | </pre> | ||
| - | latency write: | + | Start the latency write bench on the rbd client: |
<pre> | <pre> | ||
| - | |||
# seekwatcher -t rbd-latency-write.trace -o rbd-latency-write.png -p 'dd if=/dev/zero of=/dev/rbd0 bs=4M count=1000 oflag=direct' -d /dev/rbd0 | # seekwatcher -t rbd-latency-write.trace -o rbd-latency-write.png -p 'dd if=/dev/zero of=/dev/rbd0 bs=4M count=1000 oflag=direct' -d /dev/rbd0 | ||
blktrace -d /dev/rbd0 -o rbd-latency-write.trace -b 2048 -a complete | blktrace -d /dev/rbd0 -o rbd-latency-write.trace -b 2048 -a complete | ||
| Line 240: | Line 278: | ||
So, dd run for 299.768s with 1000 blocks, then the latency is 300ms. | So, dd run for 299.768s with 1000 blocks, then the latency is 300ms. | ||
| - | latency read: | + | |
| + | Clean the disk cache on all Ceph nodes and the rbd client with: | ||
| + | <pre> | ||
| + | # echo 3 > /proc/sys/vm/drop_caches | ||
| + | </pre> | ||
| + | |||
| + | Start the latency read bench on the rbd client: | ||
<pre> | <pre> | ||
| - | |||
# seekwatcher -t rbd-latency-read.trace -o rbd-latency-read.png -p 'dd if=/dev/rbd0 of=/dev/null bs=4M count=1000 iflag=direct' -d /dev/rbd0 | # seekwatcher -t rbd-latency-read.trace -o rbd-latency-read.png -p 'dd if=/dev/rbd0 of=/dev/null bs=4M count=1000 iflag=direct' -d /dev/rbd0 | ||
blktrace -d /dev/rbd0 -o rbd-latency-read.trace -b 2048 -a complete | blktrace -d /dev/rbd0 -o rbd-latency-read.trace -b 2048 -a complete | ||
| Line 251: | Line 294: | ||
So, dd run for 39.0792 s with 1000 blocks then the latency is 39ms. | So, dd run for 39.0792 s with 1000 blocks then the latency is 39ms. | ||
| - | throughput write: | + | |
| + | Clean the disk cache on all Ceph nodes and the rbd client with: | ||
| + | <pre> | ||
| + | # echo 3 > /proc/sys/vm/drop_caches | ||
| + | </pre> | ||
| + | |||
| + | Start the throughput write bench on the rbd client: | ||
<pre> | <pre> | ||
| - | |||
# seekwatcher -t rbd-throughput-write.trace -o rbd-throughput-write.png -p 'dd if=/dev/zero of=/dev/rbd0 bs=1G count=1 oflag=direct' -d /dev/rbd0 | # seekwatcher -t rbd-throughput-write.trace -o rbd-throughput-write.png -p 'dd if=/dev/zero of=/dev/rbd0 bs=1G count=1 oflag=direct' -d /dev/rbd0 | ||
blktrace -d /dev/rbd0 -o rbd-throughput-write.trace -b 2048 -a complete | blktrace -d /dev/rbd0 -o rbd-throughput-write.trace -b 2048 -a complete | ||
| Line 264: | Line 312: | ||
</pre> | </pre> | ||
| + | The measured throughput is then 23.7 MB/s | ||
| - | throughput read: | + | |
| + | Clean the disk cache on all Ceph nodes and the rbd client with: | ||
| + | <pre> | ||
| + | # echo 3 > /proc/sys/vm/drop_caches | ||
| + | </pre> | ||
| + | |||
| + | Start the throughput read bench on the rbd client: | ||
<pre> | <pre> | ||
| - | |||
# seekwatcher -t rbd-throughput-read.trace -o rbd-throughput-read.png -p 'dd if=/dev/rbd0 of=/dev/null bs=1G count=1 iflag=direct' -d /dev/rbd0 | # seekwatcher -t rbd-throughput-read.trace -o rbd-throughput-read.png -p 'dd if=/dev/rbd0 of=/dev/null bs=1G count=1 iflag=direct' -d /dev/rbd0 | ||
blktrace -d /dev/rbd0 -o rbd-throughput-read.trace -b 2048 -a complete | blktrace -d /dev/rbd0 -o rbd-throughput-read.trace -b 2048 -a complete | ||
| Line 277: | Line 331: | ||
.... | .... | ||
</pre> | </pre> | ||
| + | |||
| + | The measured throughput is then 117 MB/s | ||
| - | All graphs can be merged with: | + | All graphs can be merged into a single one with: |
<pre> | <pre> | ||
| - | seekwatcher -t rbd-latency-read.trace -t rbd-latency-write.trace -t rbd-throughput-read.trace -t rbd-throughput-write.trace -l rbd-latency-read -l rbd-latency-write -l rbd-throughput-read -l rbd-throughput-write -o all.png | + | # seekwatcher -t rbd-latency-read.trace -t rbd-latency-write.trace -t rbd-throughput-read.trace -t rbd-throughput-write.trace -l rbd-latency-read -l rbd-latency-write -l rbd-throughput-read -l rbd-throughput-write -o all.png |
</pre> | </pre> | ||
| + | |||
| + | To sum up, here are the corresponding results: | ||
| + | |||
| + | {| class="wikitable" | ||
| + | !test name||bandwidth read||bandwidth write (StdDev)||latency read||latency write (StdDev) | ||
| + | |- | ||
| + | |RBD || 117MB/s || 23.7 MB/s || 39ms || 300ms | ||
| + | |} | ||
=== REST API benchmark === | === REST API benchmark === | ||
| + | The goal of this benchmark is to figure out the read / write throughput and latency of the REST-API (ie: amazon/swift frontend). | ||
| + | |||
| + | Clean the disk cache on all Ceph nodes with: | ||
| + | <pre> | ||
| + | # echo 3 > /proc/sys/vm/drop_caches | ||
| + | </pre> | ||
| + | |||
| + | And start the rados write bench on the client to the radosgw server: | ||
<pre> | <pre> | ||
| Line 307: | Line 379: | ||
</pre> | </pre> | ||
| + | |||
| + | Clean the disk cache on all Ceph nodes with: | ||
| + | <pre> | ||
| + | # echo 3 > /proc/sys/vm/drop_caches | ||
| + | </pre> | ||
| + | |||
| + | And start the rados read bench on the client to the radosgw server: | ||
<pre> | <pre> | ||
# rest-bench --api-host=ceph1.fqdn --bucket=bench --access-key=<KEY> --secret=<SECRET> --seconds 900 seq | # rest-bench --api-host=ceph1.fqdn --bucket=bench --access-key=<KEY> --secret=<SECRET> --seconds 900 seq | ||
| Line 321: | Line 400: | ||
Min latency: 0.039847 | Min latency: 0.039847 | ||
</pre> | </pre> | ||
| + | |||
| + | |||
| + | To sum up, here are the corresponding results: | ||
| + | |||
| + | {| class="wikitable" | ||
| + | !test name||bandwidth read||bandwidth write (StdDev)||latency read||latency write (StdDev) | ||
| + | |- | ||
| + | |RADOSGW || 40.553MB/s || 17.598MB/s (15.9672) || 1.57773s || 3.63591s (2.57325) | ||
| + | |} | ||
=== On multiple clients === | === On multiple clients === | ||
| Line 370: | Line 458: | ||
{| class="wikitable" | {| class="wikitable" | ||
| - | !test name (nb clients)||bandwidth read||bandwidth write (StdDev)||latency read||latency write (StdDev) | + | !test name (nb clients) || bandwidth read||bandwidth write (StdDev)||latency read||latency write (StdDev) |
|- | |- | ||
|OSD || 130MB/s || 85MB/s || NA || NA | |OSD || 130MB/s || 85MB/s || NA || NA | ||
| Line 378: | Line 466: | ||
|RADOS (6)|| 20MB/s || 17MB/s (10MB/s) || ... || 3.38062s(2.39645) | |RADOS (6)|| 20MB/s || 17MB/s (10MB/s) || ... || 3.38062s(2.39645) | ||
|- | |- | ||
| - | |RBD|| 117MB/s || 72MB/s || 40.5899ms || 275.557ms | + | |RBD || 117MB/s || 72MB/s || 40.5899ms || 275.557ms |
|- | |- | ||
|RBD (6)|| 74.8MB/s || 10.3MB/s || ... || ... | |RBD (6)|| 74.8MB/s || 10.3MB/s || ... || ... | ||
|- | |- | ||
| - | |RADOSGW|| 79.206MB/s || 16.860MB/s (14.6564) || 0.805336s || 3.78915s (2.74434) | + | |RADOSGW || 79.206MB/s || 16.860MB/s (14.6564) || 0.805336s || 3.78915s (2.74434) |
|} | |} | ||
| Line 388: | Line 476: | ||
{| class="wikitable" | {| class="wikitable" | ||
| - | !Test name (nb clients)||Bandwidth read||Bandwidth write (StdDev)||Latency read||Latency write (StdDev) | + | !Test name (nb clients) || Bandwidth read||Bandwidth write (StdDev)||Latency read||Latency write (StdDev) |
|- | |- | ||
|OSD || 130MB/s|| 90MB/s || NA || NA | |OSD || 130MB/s|| 90MB/s || NA || NA | ||
| Line 394: | Line 482: | ||
|RADOS || 111.854MB/s || 109.139MB/s (6.16)|| 0.572126s ||0.586355s (0.457155) | |RADOS || 111.854MB/s || 109.139MB/s (6.16)|| 0.572126s ||0.586355s (0.457155) | ||
|- | |- | ||
| - | |RADOS (6)|| ~60MB/s || 27MB/s (13MB/s) || ~1s || 2.32s (1.8) | + | |RADOS (6) || ~60MB/s || 27MB/s (13MB/s) || ~1s || 2.32s (1.8) |
|- | |- | ||
| - | |RBD|| 117MB/s || 107MB/s || 40ms || 94ms | + | |RBD || 117MB/s || 107MB/s || 40ms || 94ms |
|- | |- | ||
| - | |RBD (6)|| 67.5MB/s || 16.6MB/s || 185ms || 300ms | + | |RBD (6) || 67.5MB/s || 16.6MB/s || 185ms || 300ms |
|- | |- | ||
| - | |RADOSGW|| 70.301MB/s || 30.4MB/s (28.5897) || 0.91s || 2.10s (2.8) | + | |RADOSGW || 70.301MB/s || 30.4MB/s (28.5897) || 0.91s || 2.10s (2.8) |
|- | |- | ||
|RADOSGW (with apache and fastcgi with ceph patch)|| 78.6MB/s || 32.639(29.7339) || 0.81s || 1.66s (1.72) | |RADOSGW (with apache and fastcgi with ceph patch)|| 78.6MB/s || 32.639(29.7339) || 0.81s || 1.66s (1.72) | ||
| Line 406: | Line 494: | ||
The limits of ~110MB/s seems to be the network speed of the client. | The limits of ~110MB/s seems to be the network speed of the client. | ||
| + | |||
| + | ===== Some results according to the number of PG ===== | ||
| + | |||
| + | Some measurements of the percentage of cpu and the memory utilization of each OSD during a rados benchmarks for 90 seconds, with different value of the number of PG for the bench pool: | ||
| + | |||
| + | [[media:Result-avg-cpu-64-write.png|%CPU utilization]], | ||
| + | [[media:Result-avg-res-64-write.png|Residual memory utilization]], | ||
| + | [[media:Result-avg-virt-64-write.png|Virtual memory utilization]] | ||
| + | |||
| + | Other graphs can be found [http://dl.sileht.net/public/enovance/ceph-pgnum/ here] | ||
| + | |||
| + | [http://dl.sileht.net/public/enovance/ceph-pgnum/script/ Helper Scripts] to make the monitoring, the bench, and to generate the graphics. | ||
| + | (warning: it need to be edited to change parameters, ie: nodes list,...) | ||
==== Other example ==== | ==== Other example ==== | ||
| Line 415: | Line 516: | ||
</pre> | </pre> | ||
| - | + | The software configuration is .... | |
{| class="wikitable" | {| class="wikitable" | ||
Revision as of 15:30, 2 August 2012
This HOWTO is about benchmarking ceph.
** THIS PAGE IS NOT FINISH YET **
Contents |
Benchmark tools
- ceph tools - rest-bench - dd - pssh or pdsh - fio - seekwatcher - bonnie++
Preparation
This page was written on a ceph installation based on the Ceph and OpenStack Debian GNU/Linux howto. The OpenStack integration can be ignored.
# apt-get install blktrace python-matplotlib fio rest-bench collectl perftest linux-tools-3.2 # wget https://oss.oracle.com/~mason/seekwatcher/seekwatcher-0.12.tar.bz2 # tar -xjf seekwatcher-0.12.tar.bz2 # sed -i '755d' seekwatcher-0.12/seekwatcher # small to work this debian version of python-matplotlib # parallel-scp -H "ceph1 ceph2 ceph3 ceph4 ceph5 ceph6" seekwatcher-0.12/seekwatcher /usr/local/bin/seekwatcher
Benchmark procedure
Some informations about the ceph setup:
Node names are: ceph1 ceph2 ceph3 ceph4 ceph5 ceph6
OSD id convention is <nodeid><diskid> (ie: second disk on third node is 32)
OSD disk are mounted on /srv/ceph/osd<nodeid><diskid> (ie: second disk on third node is nmounted at /srv/ceph/osd32)
OSD are on disk sda2 and sdb2 on each servers
All my OSD id are, 11, 12, 31, 32, 41, 42, 51, 52, 61 and 62
My mon nodes are ceph2 ceph4 and ceph5
RADOSGW run on ceph1
Ceph3 act as a client for benchmark that used the network.
Note:
No informations about the crush map, the pgnum, the pool size and the hardware are mentionned.
The results shown as examples in the following instructions are not meant to be interpreted.
Benchmark results must be associated to a hardware configuration to be interpreted, as shown in #Benchmark Results
OSD benchmark
The goal of this benchmark is to assert the read and write speed of each OSD. They are compared with the read and write speed of the underlying device. The expected result is that it will not show a significant difference.
Start by running this command on a ceph node to get information about the disk read speed:
# parallel-ssh -t0 -iP -H "ceph1 ceph2 ceph4 ceph5 ceph6" "hdparm -t /dev/sda ; hdparm -t /dev/sdb" [1] 15:25:55 [SUCCESS] ceph1 /dev/sda: Timing buffered disk reads: 476 MB in 3.01 seconds = 158.14 MB/sec /dev/sdb: Timing buffered disk reads: 494 MB in 3.01 seconds = 164.30 MB/sec [2] 15:25:56 [SUCCESS] ceph2 /dev/sda: Timing buffered disk reads: 368 MB in 3.00 seconds = 122.54 MB/sec /dev/sdb: Timing buffered disk reads: 362 MB in 3.08 seconds = 117.54 MB/sec [2] 15:25:56 [SUCCESS] ceph4 /dev/sda: Timing buffered disk reads: 296 MB in 3.01 seconds = 98.50 MB/sec /dev/sdb: Timing buffered disk reads: 360 MB in 3.01 seconds = 119.52 MB/sec ...
So the disks read speed are ~ 120MB/sec.
A similar test can be done using the OSD device with the following command:
# ceph osd tell $id bench
The following command is run on one of the node of the ceph cluster:
# for id in 11 12 21 22 41 42 51 52 61 62; do ceph osd tell $id bench ; done
Each id uniquely identifies an OSD.
The bench results can be read from the logs:
# grep bench /var/log/ceph/ceph.log 2012-07-25 13:09:51.183160 osd.11 169.254.6.21:6800/16801 1185 : [INF] bench: wrote 1024 MB in blocks of 4096 KB in 18.328763 sec at 57209 KB/sec 2012-07-25 13:09:48.771528 osd.12 169.254.6.21:6803/16909 1258 : [INF] bench: wrote 1024 MB in blocks of 4096 KB in 15.902182 sec at 65939 KB/sec 2012-07-25 13:10:04.727132 osd.21 169.254.6.22:6800/4199 695 : [INF] bench: wrote 1024 MB in blocks of 4096 KB in 31.841067 sec at 32931 KB/sec 2012-07-25 13:10:05.624053 osd.41 169.254.6.24:6800/4481 464 : [INF] bench: wrote 1024 MB in blocks of 4096 KB in 32.707697 sec at 32058 KB/sec 2012-07-25 13:10:03.294986 osd.61 169.254.6.26:6800/3339 1216 : [INF] bench: wrote 1024 MB in blocks of 4096 KB in 30.320444 sec at 34583 KB/sec 2012-07-25 13:10:01.767423 osd.22 169.254.6.22:6803/4278 667 : [INF] bench: wrote 1024 MB in blocks of 4096 KB in 28.865160 sec at 36326 KB/sec 2012-07-25 13:10:02.157689 osd.62 169.254.6.26:6803/3422 1196 : [INF] bench: wrote 1024 MB in blocks of 4096 KB in 29.168095 sec at 35949 KB/sec 2012-07-25 13:10:02.397218 osd.42 169.254.6.24:6803/4552 464 : [INF] bench: wrote 1024 MB in blocks of 4096 KB in 29.466679 sec at 35585 KB/sec 2012-07-25 13:10:03.850164 osd.52 169.254.6.25:6803/4005 408 : [INF] bench: wrote 1024 MB in blocks of 4096 KB in 30.889020 sec at 33946 KB/sec 2012-07-25 13:10:04.240440 osd.51 169.254.6.25:6800/3936 380 : [INF] bench: wrote 1024 MB in blocks of 4096 KB in 31.295116 sec at 33506 KB/sec
Or wait the result with:
# ceph -w 2012-07-25 13:09:51.183160 osd.11 169.254.6.21:6800/16801 1185 : [INF] bench: wrote 1024 MB in blocks of 4096 KB in 18.328763 sec at 57209 KB/sec 2012-07-25 13:09:48.771528 osd.12 169.254.6.21:6803/16909 1258 : [INF] bench: wrote 1024 MB in blocks of 4096 KB in 15.902182 sec at 65939 KB/sec 2012-07-25 13:10:04.727132 osd.21 169.254.6.22:6800/4199 695 : [INF] bench: wrote 1024 MB in blocks of 4096 KB in 31.841067 sec at 32931 KB/sec 2012-07-25 13:10:05.624053 osd.41 169.254.6.24:6800/4481 464 : [INF] bench: wrote 1024 MB in blocks of 4096 KB in 32.707697 sec at 32058 KB/sec 2012-07-25 13:10:03.294986 osd.61 169.254.6.26:6800/3339 1216 : [INF] bench: wrote 1024 MB in blocks of 4096 KB in 30.320444 sec at 34583 KB/sec 2012-07-25 13:10:01.767423 osd.22 169.254.6.22:6803/4278 667 : [INF] bench: wrote 1024 MB in blocks of 4096 KB in 28.865160 sec at 36326 KB/sec 2012-07-25 13:10:02.157689 osd.62 169.254.6.26:6803/3422 1196 : [INF] bench: wrote 1024 MB in blocks of 4096 KB in 29.168095 sec at 35949 KB/sec 2012-07-25 13:10:02.397218 osd.42 169.254.6.24:6803/4552 464 : [INF] bench: wrote 1024 MB in blocks of 4096 KB in 29.466679 sec at 35585 KB/sec 2012-07-25 13:10:03.850164 osd.52 169.254.6.25:6803/4005 408 : [INF] bench: wrote 1024 MB in blocks of 4096 KB in 30.889020 sec at 33946 KB/sec 2012-07-25 13:10:04.240440 osd.51 169.254.6.25:6800/3936 380 : [INF] bench: wrote 1024 MB in blocks of 4096 KB in 31.295116 sec at 33506 KB/sec
The write speed of the OSD are 35MB/s except two OSD with 60MB/s
To sum up, we have:
| test name | bandwidth read | bandwidth write (StdDev) | latency read | latency write (StdDev) |
|---|---|---|---|---|
| OSD | 120MB/s | 35MB/s |
Then start another benchmark tools to confirm the results:
# bonnie++ -q -u root -d /srv/ceph/osd<nodeid><diskid> -m ceph<nodeid>-osd<nodeid><diskid>
Copy/Paste all the ouput from this command onto a single file (ie: bench_result.csv)
# bon_csv2txt < bench_result.csv > bench_result.txt # bon_csv2html < bench_result.csv > bench_result.html
Bonnie++ results:
# cat bench_result.txt
Version 1.96 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
ceph1-osd11 32160M 1154 99 113556 11 50219 6 1263 46 136658 9 526.3 26
Latency 7025us 74123us 764ms 1304ms 84768us 98644us
ceph1-osd12 32160M 1179 99 150674 14 65643 8 1348 49 172851 11 547.8 25
Latency 6899us 95868us 1179ms 1152ms 12513us 93919us
ceph3-osd31 32160M 1113 99 138012 13 59340 8 2659 98 164279 11 520.5 26
Latency 7723us 175ms 1162ms 9901us 1190ms 91338us
ceph3-osd32 32160M 1167 99 141259 13 61504 8 1291 47 164216 11 623.4 12
Latency 6995us 93199us 949ms 832ms 82967us 60616us
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
ceph1-osd11 16 27412 95 +++++ +++ 29144 95 27344 95 +++++ +++ 28128 94
Latency 424us 115us 182us 280us 17us 129us
ceph1-osd12 16 27130 94 +++++ +++ 29535 96 27534 96 +++++ +++ 28379 96
Latency 448us 114us 146us 272us 18us 135us
ceph3-osd31 16 27091 95 +++++ +++ 29012 95 27166 96 +++++ +++ 28130 96
Latency 324us 115us 146us 412us 26us 141us
ceph3-osd32 16 27235 95 +++++ +++ 29312 96 27270 95 +++++ +++ 28303 95
Latency 456us 118us 386us 234us 18us 453us
fio can be used too, for this kind of test:
# fio --size=32160m --directory=/srv/ceph/osd<nodeid><diskid> --direct=1 --name=rw_latency --rw=rw --blocksize=4m # fio --size=32160m --directory=/srv/ceph/osd<nodeid><diskid> --direct=1 --name=rw_throutput --rw=rw --blocksize=1024m # fio --size=32160m --directory=/srv/ceph/osd<nodeid><diskid> --direct=1 --name=random_rw --rw=randrw
Note the 32160m, a good value is at least the ramsize*2.
RADOS benchmark
The goal of this benchmark is to figure out the read / write throughput and latency of rados objects. Writing to a rados object will be slower than writing to an OSD, because it is replicated N times. Reading a rados object is expected to be faster that reading from and OSD because reads use all replicas.
Prepare a pool for the benchmark:
# ceph osd pool create pbench 768
The number of pg must be set between 50 and 100 to multiple the number OSD '
Clean the disk cache on all Ceph nodes with:
# echo 3 > /proc/sys/vm/drop_caches
On the client node start the bench with:
# rados bench -p pbench 900 write 899 16 17042 17026 75.7464 60 0.441918 0.844101 2012-07-25 18:31:51.752487min lat: 0.112051 max lat: 2.77934 avg lat: 0.844125 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 900 16 17056 17040 75.7245 56 0.197815 0.844125 901 14 17056 17042 75.6493 8 0.599423 0.844166 Total time run: 901.565911 Total writes made: 17056 Write size: 4194304 Bandwidth (MB/sec): 75.673 Stddev Bandwidth: 13.5071 Max bandwidth (MB/sec): 116 Min bandwidth (MB/sec): 8 Average Latency: 0.845532 Stddev Latency: 0.573691 Max latency: 3.38332 Min latency: 0.112051
Then re-clean the disk cache on all cephs nodes with:
# echo 3 > /proc/sys/vm/drop_caches
And start the sequential read bench on the client:
# rados bench -p pbench 900 seq ... 611 16 17010 16994 111.241 104 1.05852 0.574897 612 16 17037 17021 111.236 108 1.17321 0.574932 613 16 17056 17040 111.178 76 1.01611 0.574903 Total time run: 613.339616 Total reads made: 17056 Read size: 4194304 Bandwidth (MB/sec): 111.234 Average Latency: 0.575252 Max latency: 1.65182 Min latency: 0.07418
To sum up, here are the corresponding results:
| test name | bandwidth read | bandwidth write (StdDev) | latency read | latency write (StdDev) |
|---|---|---|---|---|
| RADOS | 111.234MB/s | 75.673MB/s (13.5071) | 0.575252s | 0.845532s (0.573691) |
RBD block device benchmark
The goal of this benchmark is to figure out the read / write throughput and latency of RBD. Reading and writing to RBD is expected to be marginaly slower than the same operation on a rados object. The difference accounts for the overhead of the block device abstraction layer.
Prepare a volume rbd on the client:
# rbd create --pool pbench --size 100000 rbd --secret client.admin # rbd map --pool pbench rbd --secret client.admin # rbd showmapped id pool image snap device 0 pbench rbd - /dev/rbd0
The volume rbd is created with default value, (ie: strip size is 4M, latency test use this value) Then begin the benchmark, on the rbd volume, /dev/rbd0 here * Warning this destroy all your data on /dev/rbd0 *
Clean the disk cache on all Ceph nodes and the rbd client with:
# echo 3 > /proc/sys/vm/drop_caches
Start the latency write bench on the rbd client:
# seekwatcher -t rbd-latency-write.trace -o rbd-latency-write.png -p 'dd if=/dev/zero of=/dev/rbd0 bs=4M count=1000 oflag=direct' -d /dev/rbd0 blktrace -d /dev/rbd0 -o rbd-latency-write.trace -b 2048 -a complete running :dd if=/dev/zero of=/dev/rbd0 bs=4M count=1000 oflag=direct: 1000+0 records in 1000+0 records out 4194304000 bytes (4.2 GB) copied, 299.768 s, 14.0 MB/s done running dd if=/dev/zero of=/dev/rbd0 bs=4M count=1000 oflag=direct ....
So, dd run for 299.768s with 1000 blocks, then the latency is 300ms.
Clean the disk cache on all Ceph nodes and the rbd client with:
# echo 3 > /proc/sys/vm/drop_caches
Start the latency read bench on the rbd client:
# seekwatcher -t rbd-latency-read.trace -o rbd-latency-read.png -p 'dd if=/dev/rbd0 of=/dev/null bs=4M count=1000 iflag=direct' -d /dev/rbd0 blktrace -d /dev/rbd0 -o rbd-latency-read.trace -b 2048 -a complete running :dd if=/dev/rbd0 of=/dev/null bs=4M count=1000 iflag=direct: ...
So, dd run for 39.0792 s with 1000 blocks then the latency is 39ms.
Clean the disk cache on all Ceph nodes and the rbd client with:
# echo 3 > /proc/sys/vm/drop_caches
Start the throughput write bench on the rbd client:
# seekwatcher -t rbd-throughput-write.trace -o rbd-throughput-write.png -p 'dd if=/dev/zero of=/dev/rbd0 bs=1G count=1 oflag=direct' -d /dev/rbd0 blktrace -d /dev/rbd0 -o rbd-throughput-write.trace -b 2048 -a complete running :dd if=/dev/zero of=/dev/rbd0 bs=1G count=1 oflag=direct: 1+0 records in 1+0 records out 1073741824 bytes (1.1 GB) copied, 45.2403 s, 23.7 MB/s done running dd if=/dev/zero of=/dev/rbd0 bs=1G count=1 oflag=direct ...
The measured throughput is then 23.7 MB/s
Clean the disk cache on all Ceph nodes and the rbd client with:
# echo 3 > /proc/sys/vm/drop_caches
Start the throughput read bench on the rbd client:
# seekwatcher -t rbd-throughput-read.trace -o rbd-throughput-read.png -p 'dd if=/dev/rbd0 of=/dev/null bs=1G count=1 iflag=direct' -d /dev/rbd0 blktrace -d /dev/rbd0 -o rbd-throughput-read.trace -b 2048 -a complete running :dd if=/dev/rbd0 of=/dev/null bs=1G count=1 iflag=direct: 1+0 records in 1+0 records out 1073741824 bytes (1.1 GB) copied, 9.17204 s, 117 MB/s done running dd if=/dev/rbd0 of=/dev/null bs=1G count=1 iflag=direct ....
The measured throughput is then 117 MB/s
All graphs can be merged into a single one with:
# seekwatcher -t rbd-latency-read.trace -t rbd-latency-write.trace -t rbd-throughput-read.trace -t rbd-throughput-write.trace -l rbd-latency-read -l rbd-latency-write -l rbd-throughput-read -l rbd-throughput-write -o all.png
To sum up, here are the corresponding results:
| test name | bandwidth read | bandwidth write (StdDev) | latency read | latency write (StdDev) |
|---|---|---|---|---|
| RBD | 117MB/s | 23.7 MB/s | 39ms | 300ms |
REST API benchmark
The goal of this benchmark is to figure out the read / write throughput and latency of the REST-API (ie: amazon/swift frontend).
Clean the disk cache on all Ceph nodes with:
# echo 3 > /proc/sys/vm/drop_caches
And start the rados write bench on the client to the radosgw server:
# rest-bench --api-host=ceph1.fqdn --bucket=bench --access-key=<KEY> --secret=<SECRET> --show-time --seconds 900 write ... 2012-07-25 17:44:21.220392 901 16 3978 3962 17.5878 0 - 3.62207 2012-07-25 17:44:22.220455 902 16 3978 3962 17.5683 0 - 3.62207 2012-07-25 17:44:23.220515 903 16 3978 3962 17.5488 0 - 3.62207 2012-07-25 17:44:24.220571 904 16 3979 3963 17.5338 0.8 7.22069 3.62298 2012-07-25 17:44:25.220752 Total time run: 904.396359 Total writes made: 3979 Write size: 4194304 Bandwidth (MB/sec): 17.598 Stddev Bandwidth: 15.9672 Max bandwidth (MB/sec): 64 Min bandwidth (MB/sec): 0 Average Latency: 3.63591 Stddev Latency: 2.57325 Max latency: 13.9555 Min latency: 0.512203
Clean the disk cache on all Ceph nodes with:
# echo 3 > /proc/sys/vm/drop_caches
And start the rados read bench on the client to the radosgw server:
# rest-bench --api-host=ceph1.fqdn --bucket=bench --access-key=<KEY> --secret=<SECRET> --seconds 900 seq 2012-07-25 17:51:27.486578 390 16 3974 3958 40.59 32 3.77318 1.56981 2012-07-25 17:51:28.486690 391 16 3979 3963 40.5373 20 4.31407 1.57008 2012-07-25 17:51:29.486806 392 16 3979 3963 40.4339 0 - 1.57008 2012-07-25 17:51:30.487002 Total time run: 392.475304 Total reads made: 3979 Read size: 4194304 Bandwidth (MB/sec): 40.553 Average Latency: 1.57773 Max latency: 5.80392 Min latency: 0.039847
To sum up, here are the corresponding results:
| test name | bandwidth read | bandwidth write (StdDev) | latency read | latency write (StdDev) |
|---|---|---|---|---|
| RADOSGW | 40.553MB/s | 17.598MB/s (15.9672) | 1.57773s | 3.63591s (2.57325) |
On multiple clients
All the previous tests started on a client (ie: rados bench, rbd and rest-api) can be launched on multiple clients in parallels with one the following commands:
parallel-ssh -i -t0 -H "ceph-client1 ceph-client2 ceph-client3" '<bench command>'
or
pdsh -u root -Rssh -w "ceph-client1,ceph-client2,ceph-client3" '<bench command>'
For the rest benchmark we need to change the bucket name for each client, so you can use:
parallel-ssh -i -t0 -H "ceph-client1 ceph-client2 ceph-client3" 'p=$(hostname -s) ; rest-bench --api-host=radosgw --bucket=$p --access-key=BENCH --secret=BENCH --seconds 30 write'
Benchmark Results
First Example
The benchmark example were run on the following hardware / software combination:
2 Dell PowerEdge 1950 with each - 2×Intel(R) Xeon(R) CPU L5410 @ 1.33GHz - 2 OSD disks of 250Go - 16G RAM - a 1Gbit Link between server - Debian wheezy - ceph 0.48 4 DCS5125 with each - 2×AMD Athlon(tm) II X2 260u @ 1.8GHz - 2 OSD disks ~ 1TGo - 8G RAM - a 1Gbit Link between server - Debian wheezy - ceph 0.48
Note: in the following tables, the columns values are per client. For instance, 20MB/s in the bandwidth read column for RADOS (6) means that each of the 6 clients have a 20MB/s read throughput. The total is therefore 6 x 20 MB/s = 120MB/s which is consistent with the bandwidth read for a single client ( 111.764MB/s ).
With 2 OSD daemons per servers, replication x2, default crush map and default pgnum, journals and datas of OSD are on the same disks:
| test name (nb clients) | bandwidth read | bandwidth write (StdDev) | latency read | latency write (StdDev) |
|---|---|---|---|---|
| OSD | 130MB/s | 85MB/s | NA | NA |
| RADOS | 111.764MB/s | 91.480MB/s (18.7785) | 0.572556 | 0.699563s (0.543912) |
| RADOS (6) | 20MB/s | 17MB/s (10MB/s) | ... | 3.38062s(2.39645) |
| RBD | 117MB/s | 72MB/s | 40.5899ms | 275.557ms |
| RBD (6) | 74.8MB/s | 10.3MB/s | ... | ... |
| RADOSGW | 79.206MB/s | 16.860MB/s (14.6564) | 0.805336s | 3.78915s (2.74434) |
With 2 OSD daemons per servers, replication x2, default crush map and default pgnum, OSD journals are on a tmpfs:
| Test name (nb clients) | Bandwidth read | Bandwidth write (StdDev) | Latency read | Latency write (StdDev) |
|---|---|---|---|---|
| OSD | 130MB/s | 90MB/s | NA | NA |
| RADOS | 111.854MB/s | 109.139MB/s (6.16) | 0.572126s | 0.586355s (0.457155) |
| RADOS (6) | ~60MB/s | 27MB/s (13MB/s) | ~1s | 2.32s (1.8) |
| RBD | 117MB/s | 107MB/s | 40ms | 94ms |
| RBD (6) | 67.5MB/s | 16.6MB/s | 185ms | 300ms |
| RADOSGW | 70.301MB/s | 30.4MB/s (28.5897) | 0.91s | 2.10s (2.8) |
| RADOSGW (with apache and fastcgi with ceph patch) | 78.6MB/s | 32.639(29.7339) | 0.81s | 1.66s (1.72) |
The limits of ~110MB/s seems to be the network speed of the client.
Some results according to the number of PG
Some measurements of the percentage of cpu and the memory utilization of each OSD during a rados benchmarks for 90 seconds, with different value of the number of PG for the bench pool:
%CPU utilization, Residual memory utilization, Virtual memory utilization
Other graphs can be found here
Helper Scripts to make the monitoring, the bench, and to generate the graphics. (warning: it need to be edited to change parameters, ie: nodes list,...)
Other example
The benchmark example were run on the following hardware:
- ... - ...
The software configuration is ....
| test name (nb clients) | bandwidth read | bandwidth write (StdDev) | latency read | latency write (StdDev) |
|---|---|---|---|---|
| OSD | ||||
| RADOS | ||||
| RADOS (n) | ||||
| RBD | ||||
| RBD (n) | ||||
| RADOSGW |
Note:that all of this results are issued from sequential operation.