Benchmarking Ceph erasure code plugins

loic

The erasure code implementation in Ceph relies on the jerasure library. It is packaged into a plugin that is dynamically loaded by erasure coded pools.
The ceph_erasure_code_benchmark is implemented to help benchmark the competing erasure code plugins implementations and to find the best parameters for a given plugin. It shows the jerasure technique cauchy_good with a packet size of 3072 to be the most efficient on a Intel(R) Xeon(R) CPU E3-1245 V2 @ 3.40GHz when compiled with gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5). The test was done assuming each object is spread over six OSDs and two extra OSDs are used for parity ( K=6 and M=2 ).

  • Encoding: 4.2GB/s
  • Decoding: no processing necessary (because the code is systematic)
  • Recovering the loss of one OSD: 10GB/s
  • Recovering the loss of two OSD: 3.2GB/s

The processing is done on the primary OSDs and therefore distributed on the Ceph cluster. Encoding and decoding is an order of magnitude faster than the typical storage hardware throughput.

Ceph is compiled from sources with:

./autogen.sh ; ./configure ; make

which compiles the ceph_erasure_code_benchmark benchmark tool.
The results of the erasure code bench script ( which relies on ceph_erasure_code_benchmark ) were produced on a Intel(R) Xeon(R) CPU E3-1245 V2 @ 3.40GHz and compiled with gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5).

CEPH_ERASURE_CODE_BENCHMARK=src/ceph_erasure_code_benchmark \ PLUGIN_DIRECTORY=src/.libs \ qa/workunits/erasure-code/bench.sh

They can be interpreted as follows:

seconds KB plugin k m work. iter. size eras. 0.612510 1048576 example 2 1 encode 1024 1048576 0 0.317254 1048576 example 2 1 decode 1024 1048576 1

The first line used the example plugin to encode 1048576KB (1GB) in 0.612510 seconds which is ~1.7GB/s. The measure was done by iterating 1024 times to encode a 1048576 (1MB) bytes buffer. The second line used the example plugin to decode 1048576KB (1GB) when 1 chunk has been erased (last column) in 0.317254 seconds which is ~3.1GB/s. The measure was done by iterating 1024 times to decode a 1048576 (1MB) bytes buffer that was encoded once.
When using the Jerasure Ceph plugin and the Reed Solomon technique to sustain the loss of two OSDs (i.e. K=6 and M=2 ) the results are:

seconds KB plugin k m work. iter. size eras. 0.103921 1048576 jerasure 6 2 decode 1024 1048576 1 0.277644 1048576 jerasure 6 2 decode 1024 1048576 2 0.238322 1048576 jerasure 6 2 encode 1024 1048576 0

The first line shows that if 1 OSD is lost ( erased ), it can be recovered at a rate of 10GB/s ( 1/0.103921 ). If 2 OSDs are lost, recovering both of them can be done at a rate of 3.6GB/s ( 1/0.277644 ). Encoding can be done at a rate of 4.2GB/s ( 1/0.238322 ).
The corresponding jerasure technique is cauchy_good with a packet size of 3072:

--parameter erasure-code-packetsize=3072 --parameter erasure-code-technique=cauchy_good

After profiling a single call and reducing the number of iterations from 1024 to 10 because valgrind makes the run significantly slower:

valgrind --tool=callgrind src/ceph_erasure_code_benchmark --plugin jerasure --workload encode --iterations 10 --size 1048576 --parameter erasure-code-k=6 --parameter erasure-code-m=2 --parameter erasure-code-directory=.libs --parameter erasure-code-technique=cauchy_good --parameter erasure-code-packetsize=3072

It shows that 97% of the time is spent in table lookups.