Custom data placement with CRUSH
From Ceph wiki
m (grammar) |
|||
| Line 7: | Line 7: | ||
== CRUSH hierarchy == | == CRUSH hierarchy == | ||
| - | CRUSH works by describing the storage cluster in a hierarchy that reflects | + | CRUSH works by describing the storage cluster in a hierarchy that reflects its physical organization. For example, let's say each host has three disks, each rack has 30 hosts, and we have some number of racks. The result is a hierarchy of racks, hosts, and devices. |
Generally speaking, it is more important that this hierarchy reflect the underlying hardware infrastructure than physical location (although location may be important too). That is, if each rack has its own power circuits and network gear, then the hierarchy should be by rack. | Generally speaking, it is more important that this hierarchy reflect the underlying hardware infrastructure than physical location (although location may be important too). That is, if each rack has its own power circuits and network gear, then the hierarchy should be by rack. | ||
Latest revision as of 16:06, 14 November 2010
Data replicas are placed on OSDs via a CRUSH placement rule. By default, all OSDs are placed in a single pool, and replicas are placed on N (2, by default) pseudorandom nodes. This is fine for small, simple clusters.
For other situations, it is problematic. For example, if each host has two disks, we may want to run two independent cosd daemon for each disk. That way, a single disk failure only takes out a single disk's worth of data.
We could use btrfs to combine the disks into a single volume, but it will be less reliable overall as btrfs normally only replicates metadata. Or if we have 3 or more disks we could use MD, DM, or some other hardware or software RAID. But let's assume for now we don't want to use RAID.
Contents |
CRUSH hierarchy
CRUSH works by describing the storage cluster in a hierarchy that reflects its physical organization. For example, let's say each host has three disks, each rack has 30 hosts, and we have some number of racks. The result is a hierarchy of racks, hosts, and devices.
Generally speaking, it is more important that this hierarchy reflect the underlying hardware infrastructure than physical location (although location may be important too). That is, if each rack has its own power circuits and network gear, then the hierarchy should be by rack.
CRUSH rules
We can then write a rule that describes how replicas are placed in that hierarchy. In the above example, we definitely want to separate replicas across hosts: we don't end up with both replicas of data object X on the two disks in the same machine, because if that machine crashes we lose access to both replicas. We may also want to separate replicas across racks, so that a single power circuit failure doesn't take out both replicas. What policy you settle on depends on the size and structure of your hierarchy.
Manipulating a CRUSH map
The CRUSH map (the hierarchy and rules) can be viewed and manipulated with crushtool. To decode a map into plaintext (for editing, etc.):
$ crushtool -d file -o file.txt
To recompile a plaintext map:
$ crushtool -c file.txt -o file
Building an initial map
You can also construct an initial map to get started with crushtool. In the above example, we have 3 disks per host, and 30 hosts per rack. Say we have 120 total disks/OSDs.
$ crushtool --num_osds 120 -o file --build host straw 2 rack straw 30 root straw 0
The '--build' is followed by some number of triples, each consisting of name algorithm size. The name should be some useful, descriptive name for that type of item (in this case, 'rack' and 'host'). The algorithm is one of 'straw', 'uniform', 'tree', or 'list'. Straw is a good all around default, as it has ideal rebalancing characteristics. It can be a bit slow for large buckets, however, so maybe use 'tree' if size is large. Finally, size is the number of child items in that bucket type. For example, each host has two disks, each rack has 30 nodes, root has however many racks we end up with, and we'll use the straw algorithm for everything.
You can view or modify the resulting map with
$ crushtool -d file -o file.txt $ vi file.txt # or whatever
Writing rules
By default, the generated map only has a single rule called 'data'. However, by default, the initial OSDMap contains 4 pools: data, metadata, casdata, and rbd, and each is set to use CRUSH rules 0..3, respectively. The default data rule can simply be duplicated, with the name and ruleset adjusted accordingly.
Rules normally follow the pattern
rule data {
ruleset 0
type replicated
min_size 1
max_size 10
step take root
step choose firstn 0 type device
step emit
}
That is, when placing object replicas, we start at the root hierarchy, and choose N items of type 'device'. ('0' means to grab however many replicas. The rules are written to be general for some range of N, 1-10 in this case.)
In our original example, if we want to separate replicas across hosts, the steps would be
step take root
step chooseleaf firstn 0 type host
step emit
which chooses leaf items under N hosts. Alternatively, if we want replicas separated across racks,
step take root
step chooseleaf firstn 0 type rack
step emit
Using a custom CRUSH map during mkfs
You can specify a CRUSH map to use during mkfs with either
$ mkcephfs --crushmap file ...
or
$ mkcephfs --crushmapsrc file.txt ...
Changing the CRUSH map for an online system
You can get the currently active CRUSH map with
$ ceph osd getcrushmap -o file
You can either use that as a basis for editing, or maybe construct a fresh map with crushtool.
When you're done, you can import a new crush map into the system with
$ ceph osd setcrushmap -i file
Example crush map
A CRUSH map with 12 hosts, 4 hosts per rack and 3 racks.
# Generated with: crushtool --num_osds 12 -o file --build host straw 1 rack straw 4 root straw 0
# Decoded with: crushtool -d file -o file.txt
# device0 matches osd0, device1 matches osd1, etc, etc
# begin crush map
# devices
device 0 device0
device 1 device1
device 2 device2
device 3 device3
device 4 device4
device 5 device5
device 6 device6
device 7 device7
device 8 device8
device 9 device9
device 10 device10
device 11 device11
# types
type 0 device
type 1 host
type 2 rack
type 3 root
# buckets
host host0 {
id -1
alg straw
hash 0
item device0 weight 1.000
}
host host1 {
id -2
alg straw
hash 0
item device1 weight 1.000
}
host host2 {
id -3
alg straw
hash 0
item device2 weight 1.000
}
host host3 {
id -4
alg straw
hash 0
item device3 weight 1.000
}
host host4 {
id -5
alg straw
hash 0
item device4 weight 1.000
}
host host5 {
id -6
alg straw
hash 0
item device5 weight 1.000
}
host host6 {
id -7
alg straw
hash 0
item device6 weight 1.000
}
host host7 {
id -8
alg straw
hash 0
item device7 weight 1.000
}
host host8 {
id -9
alg straw
hash 0
item device8 weight 1.000
}
host host9 {
id -10
alg straw
hash 0
item device9 weight 1.000
}
host host10 {
id -11
alg straw
hash 0
item device10 weight 1.000
}
host host11 {
id -12
alg straw
hash 0
item device11 weight 1.000
}
rack rack0 {
id -13
alg straw
hash 0
item host0 weight 1.000
item host1 weight 1.000
item host2 weight 1.000
item host3 weight 1.000
}
rack rack1 {
id -14
alg straw
hash 0
item host4 weight 1.000
item host5 weight 1.000
item host6 weight 1.000
item host7 weight 1.000
}
rack rack2 {
id -15
alg straw
hash 0
item host8 weight 1.000
item host9 weight 1.000
item host10 weight 1.000
item host11 weight 1.000
}
root root {
id -16
alg straw
hash 0
item rack0 weight 4.000
item rack1 weight 4.000
item rack2 weight 4.000
}
# rules
rule data {
ruleset 1
type replicated
min_size 2
max_size 2
step take root
step chooseleaf firstn 0 type rack
step emit
}
# end crush map