The contents of this wiki are no longer actively maintained. The most current documentation is available at http://ceph.com/docs.

Custom data placement with CRUSH

From Ceph wiki

Revision as of 16:06, 14 November 2010 by Smurf (Talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Data replicas are placed on OSDs via a CRUSH placement rule. By default, all OSDs are placed in a single pool, and replicas are placed on N (2, by default) pseudorandom nodes. This is fine for small, simple clusters.

For other situations, it is problematic. For example, if each host has two disks, we may want to run two independent cosd daemon for each disk. That way, a single disk failure only takes out a single disk's worth of data.

We could use btrfs to combine the disks into a single volume, but it will be less reliable overall as btrfs normally only replicates metadata. Or if we have 3 or more disks we could use MD, DM, or some other hardware or software RAID. But let's assume for now we don't want to use RAID.

Contents

CRUSH hierarchy

CRUSH works by describing the storage cluster in a hierarchy that reflects its physical organization. For example, let's say each host has three disks, each rack has 30 hosts, and we have some number of racks. The result is a hierarchy of racks, hosts, and devices.

Generally speaking, it is more important that this hierarchy reflect the underlying hardware infrastructure than physical location (although location may be important too). That is, if each rack has its own power circuits and network gear, then the hierarchy should be by rack.

CRUSH rules

We can then write a rule that describes how replicas are placed in that hierarchy. In the above example, we definitely want to separate replicas across hosts: we don't end up with both replicas of data object X on the two disks in the same machine, because if that machine crashes we lose access to both replicas. We may also want to separate replicas across racks, so that a single power circuit failure doesn't take out both replicas. What policy you settle on depends on the size and structure of your hierarchy.

Manipulating a CRUSH map

The CRUSH map (the hierarchy and rules) can be viewed and manipulated with crushtool. To decode a map into plaintext (for editing, etc.):

$ crushtool -d file -o file.txt

To recompile a plaintext map:

$ crushtool -c file.txt -o file

Building an initial map

You can also construct an initial map to get started with crushtool. In the above example, we have 3 disks per host, and 30 hosts per rack. Say we have 120 total disks/OSDs.

$ crushtool --num_osds 120 -o file --build host straw 2 rack straw 30 root straw 0

The '--build' is followed by some number of triples, each consisting of name algorithm size. The name should be some useful, descriptive name for that type of item (in this case, 'rack' and 'host'). The algorithm is one of 'straw', 'uniform', 'tree', or 'list'. Straw is a good all around default, as it has ideal rebalancing characteristics. It can be a bit slow for large buckets, however, so maybe use 'tree' if size is large. Finally, size is the number of child items in that bucket type. For example, each host has two disks, each rack has 30 nodes, root has however many racks we end up with, and we'll use the straw algorithm for everything.

You can view or modify the resulting map with

$ crushtool -d file -o file.txt
$ vi file.txt                   # or whatever

Writing rules

By default, the generated map only has a single rule called 'data'. However, by default, the initial OSDMap contains 4 pools: data, metadata, casdata, and rbd, and each is set to use CRUSH rules 0..3, respectively. The default data rule can simply be duplicated, with the name and ruleset adjusted accordingly.

Rules normally follow the pattern

rule data {
        ruleset 0
        type replicated
        min_size 1
        max_size 10
        step take root
        step choose firstn 0 type device
        step emit
}

That is, when placing object replicas, we start at the root hierarchy, and choose N items of type 'device'. ('0' means to grab however many replicas. The rules are written to be general for some range of N, 1-10 in this case.)

In our original example, if we want to separate replicas across hosts, the steps would be

        step take root
        step chooseleaf firstn 0 type host
        step emit

which chooses leaf items under N hosts. Alternatively, if we want replicas separated across racks,

        step take root
        step chooseleaf firstn 0 type rack
        step emit

Using a custom CRUSH map during mkfs

You can specify a CRUSH map to use during mkfs with either

$ mkcephfs --crushmap file ...

or

$ mkcephfs --crushmapsrc file.txt ...

Changing the CRUSH map for an online system

You can get the currently active CRUSH map with

$ ceph osd getcrushmap -o file

You can either use that as a basis for editing, or maybe construct a fresh map with crushtool.

When you're done, you can import a new crush map into the system with

$ ceph osd setcrushmap -i file

Example crush map

A CRUSH map with 12 hosts, 4 hosts per rack and 3 racks.

# Generated with: crushtool --num_osds 12 -o file --build host straw 1 rack straw 4 root straw 0
# Decoded with: crushtool -d file -o file.txt

# device0 matches osd0, device1 matches osd1, etc, etc

# begin crush map

# devices
device 0 device0
device 1 device1
device 2 device2
device 3 device3
device 4 device4
device 5 device5
device 6 device6
device 7 device7
device 8 device8
device 9 device9
device 10 device10
device 11 device11

# types
type 0 device
type 1 host
type 2 rack
type 3 root

# buckets
host host0 {
	id -1
	alg straw
	hash 0
	item device0 weight 1.000
}
host host1 {
	id -2
	alg straw
	hash 0
	item device1 weight 1.000
}
host host2 {
	id -3
	alg straw
	hash 0
	item device2 weight 1.000
}
host host3 {
	id -4
	alg straw
	hash 0
	item device3 weight 1.000
}
host host4 {
	id -5
	alg straw
	hash 0
	item device4 weight 1.000
}
host host5 {
	id -6
	alg straw
	hash 0
	item device5 weight 1.000
}
host host6 {
	id -7
	alg straw
	hash 0
	item device6 weight 1.000
}
host host7 {
	id -8
	alg straw
	hash 0
	item device7 weight 1.000
}
host host8 {
	id -9
	alg straw
	hash 0
	item device8 weight 1.000
}
host host9 {
	id -10
	alg straw
	hash 0
	item device9 weight 1.000
}
host host10 {
	id -11
	alg straw
	hash 0
	item device10 weight 1.000
}
host host11 {
	id -12
	alg straw
	hash 0
	item device11 weight 1.000
}

rack rack0 {
	id -13
	alg straw
	hash 0
	item host0 weight 1.000
	item host1 weight 1.000
	item host2 weight 1.000
	item host3 weight 1.000
}
rack rack1 {
	id -14
	alg straw
	hash 0
	item host4 weight 1.000
	item host5 weight 1.000
	item host6 weight 1.000
	item host7 weight 1.000
}
rack rack2 {
	id -15
	alg straw
	hash 0
	item host8 weight 1.000
	item host9 weight 1.000
	item host10 weight 1.000
	item host11 weight 1.000
}
root root {
	id -16
	alg straw
	hash 0
	item rack0 weight 4.000
	item rack1 weight 4.000
	item rack2 weight 4.000
}

# rules
rule data {
	ruleset 1
	type replicated
	min_size 2
	max_size 2
	step take root
	step chooseleaf firstn 0 type rack
	step emit
}

# end crush map
Personal tools