Admin Guide :: Replacing a Failed Disk in a Ceph Cluster
¶
Replacing a Failed Disk from Ceph a Cluster ¶
Do you have a ceph cluster , great , you are awesome ; so very soon you would face this .
- Check your cluster health
# ceph status
cluster c452b7df-0c0b-4005-8feb-fc3bb92407f5
health HEALTH_WARN 6 pgs peering; 6 pgs stale; 6 pgs stuck inactive; 6 pgs stuck stale; 6 pgs stuck unclean; 2 requests are blocked > 32 sec
monmap e6: 3 mons at {node01-ib=10.168.100.101:6789/0,node06-ib=10.168.100.106:6789/0,node11-ib=10.168.100.111:6789/0}, election epoch 2
830, quorum 0,1,2 node01-ib,node06-ib,node11-ib
mdsmap e58: 1/1/1 up {0=node01-ib=up:active}
osdmap e8871: 153 osds: 152 up, 152 in
pgmap v1409465: 66256 pgs, 30 pools, 201 TB data, 51906 kobjects
90439 GB used, 316 TB / 413 TB avail
66250 active+clean
6 stale+peering
- Login to any ceph node and search for failed disk
#ceph osd tree | grep -i down
9 2.63 osd.9 up 1
17 2.73 osd.17 up 1
30 2.73 osd.30 up 1
53 2.73 osd.53 up 1
65 2.73 osd.65 up 1
78 2.73 osd.78 up 1
89 2.73 osd.89 up 1
99 2.73 osd.99 down 0
113 2.73 osd.113 up 1
128 2.73 osd.128 up 1
141 2.73 osd.141 up 1
- Now you have identified which node's OSD is failed and what's the OSD number. Login to that node and check if that OSD is mounted , IT would be NOT ( as its failed )
# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 47G 6.0G 38G 14% /
tmpfs 12G 0 12G 0% /dev/shm
/dev/sdd1 2.8T 197G 2.5T 8% /var/lib/ceph/osd/ceph-30
/dev/sde1 2.8T 172G 2.6T 7% /var/lib/ceph/osd/ceph-53
/dev/sdc1 2.8T 264G 2.5T 10% /var/lib/ceph/osd/ceph-17
/dev/sdh1 2.8T 227G 2.5T 9% /var/lib/ceph/osd/ceph-89
/dev/sdf1 2.8T 169G 2.6T 7% /var/lib/ceph/osd/ceph-65
/dev/sdi1 2.8T 150G 2.6T 6% /var/lib/ceph/osd/ceph-113
/dev/sdb1 2.7T 1.3T 1.3T 51% /var/lib/ceph/osd/ceph-9
/dev/sdj1 2.8T 1.6T 1.2T 58% /var/lib/ceph/osd/ceph-128
/dev/sdg1 2.8T 237G 2.5T 9% /var/lib/ceph/osd/ceph-78
/dev/sdk1 2.8T 1.5T 1.3T 53% /var/lib/ceph/osd/ceph-141
- Go to your datacenter with a new physical drive and replace the drive physically , i assume depending on enterprise server that you are using it should be hot swappable , These days almost all servers support hot swapping of disks , but still you should check for your server model . In this example i was using HPDL380 server.
- Once you have replaced the drive physically , wait for some time so that new drive gets stable.
- Login to your node and make OSD out from cluster . Please remember , the OSD was already DOWN and OUT as soon as disk was failed . Ceph takes care of OSD and if its not available it marks it down and moves it out of cluster.
# ceph osd out osd.99
osd.99 is already out.
# service ceph stop osd.99
/etc/init.d/ceph: osd.99 not found (/etc/ceph/ceph.conf defines osd.9 osd.30 osd.17 osd.128 osd.65 osd.141 osd.89 osd.53 osd.113 osd.78 , /var/lib/ceph defines osd.9 osd.30 osd.17 osd.128 osd.65 osd.141 osd.89 osd.53 osd.113 osd.78)
# service ceph status osd.99
/etc/init.d/ceph: osd.99 not found (/etc/ceph/ceph.conf defines osd.9 osd.30 osd.17 osd.128 osd.65 osd.141 osd.89 osd.53 osd.113 osd.78 , /var/lib/ceph defines osd.9 osd.30 osd.17 osd.128 osd.65 osd.141 osd.89 osd.53 osd.113 osd.78)
# service ceph status
=== osd.9 ===
osd.9: running {"version":"0.72.1"}
=== osd.30 ===
osd.30: running {"version":"0.72.1"}
=== osd.17 ===
osd.17: running {"version":"0.72.1"}
=== osd.128 ===
osd.128: running {"version":"0.72.1"}
=== osd.65 ===
osd.65: running {"version":"0.72.1"}
=== osd.141 ===
osd.141: running {"version":"0.72.1"}
=== osd.89 ===
osd.89: running {"version":"0.72.1"}
=== osd.53 ===
osd.53: running {"version":"0.72.1"}
=== osd.113 ===
osd.113: running {"version":"0.72.1"}
=== osd.78 ===
osd.78: running {"version":"0.72.1"}
- Now remove this failed OSD from Crush Map , as soon as its removed from crush map , ceph starts making PG copies that were located on this failed disk and it places these PG on other disks. So a recovery process will start.
# ceph osd crush remove osd.99
removed item id 99 name 'osd.99' from crush map
# ceph status
cluster c452b7df-0c0b-4005-8feb-fc3bb92407f5
health HEALTH_WARN 43 pgs backfill; 56 pgs backfilling; 9 pgs peering; 82 pgs recovering; 6 pgs stale; 6 pgs stuck inactive; 6 pgs stuck stale; 192 pgs st
uck unclean; 4 requests are blocked > 32 sec; recovery 373488/106903578 objects degraded (0.349%)
monmap e6: 3 mons at {node01-ib=10.168.100.101:6789/0,node06-ib=10.168.100.106:6789/0,node11-ib=10.168.100.111:6789/0}, election epoch 2
836, quorum 0,1,2 node01-ib,node06-ib,node11-ib
mdsmap e58: 1/1/1 up {0=node01-ib=up:active}
osdmap e8946: 153 osds: 152 up, 152 in
pgmap v1409604: 66256 pgs, 30 pools, 201 TB data, 51916 kobjects
1
1 GB used, 316 TB / 413 TB avail
1
1 /106903578 objects degraded (0.349%)
1 active
66060 active+clean
1
1
1 active+remapped+wait_backfill
3 peering
1 active+remapped
1
1
1 active+remapped+backfilling
6 stale+peering
4 active+clean+scrubbing+deep
1
1
1 active+recovering
recovery io 159 MB/s, 39 objects/s
- (optional) Check disk statistics , it looks nice and after some times ( depending on data that your FAILED disk holds ) it completes
# dstat 10
----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read writ| recv send| in out | int csw
2 3 95 1 0 0|2223k 5938k| 0 0 |1090B 2425B|5853 11k
14 58 1 25 0 2| 130M 627M| 219M 57M|6554B 0 | 28k 111k
14 57 1 26 0 2| 106M 743M| 345M 32M| 0 4096B| 35k 73k
13 61 1 23 0 2| 138M 680M| 266M 67M| 83k 0 | 31k 82k
14 52 1 31 0 2| 99M 574M| 230M 32M| 48k 6963B| 27k 78k
14 51 2 31 0 2| 99M 609M| 291M 31M| 0 0 | 29k 83k
11 57 1 28 0 2| 118M 636M| 214M 57M|9830B 0 | 26k 92k
12 49 4 34 0 1| 97M 432M| 166M 48M| 35k 0 | 22k 100k
13 44 3 38 0 1| 95M 422M| 183M 46M| 0 0 | 22k 88k
13 52 3 30 0 2| 96M 510M| 207M 44M| 0 0 | 25k 109k
14 49 3 32 0 2| 96M 568M| 276M 37M| 16k 0 | 27k 72k
9 54 4 31 0 2| 109M 520M| 136M 45M| 0 0 | 20k 89k
14 44 5 35 0 1| 76M 444M| 192M 13M| 0 0 | 22k 54k
15 47 3 34 0 2| 101M 452M| 141M 20M|3277B 13k| 21k 79k
17 48 3 31 0 1| 108M 445M| 181M 16M| 0 200k| 23k 69k
17 48 3 30 0 1| 154M 406M| 138M 23M| 0 0 | 21k 75k
17 53 3 27 0 1| 169M 399M| 115M 23M| 0 396k| 21k 81k
13 45 4 36 0 1| 161M 330M| 131M 20M| 0 397k| 20k 90k
11 51 5 33 0 1| 116M 416M| 145M 1177k| 0 184k| 20k 69k
14 50 4 31 0 1| 144M 376M| 124M 8752k| 0 0 | 20k 72k
14 42 6 37 0 1| 142M 340M| 138M 19M| 0 0 | 19k 79k
15 47 6 32 0 1| 111M 427M| 129M 11M| 0 819B| 19k 66k
15 50 5 29 0 1| 163M 413M| 139M 5709k| 58k 0 | 20k 90k
14 49 4 32 0 1| 155M 395M| 91M 12M| 0 0 | 18k 93k
18 43 7 31 0 1| 166M 338M| 84M 6493k| 0 0 | 17k 81k
14 49 5 32 0 1| 179M 335M| 98M 3824k| 0 0 | 18k 91k
13 46 9 31 0 1| 157M 299M| 72M 14M| 0 0 | 17k 125k
17 42 9 30 0 1| 188M 269M| 82M 11M| 16k 0 | 16k 102k
22 35 15 27 0 1| 158M 167M|8932k 287k| 0 0 | 13k 88k
7 20 46 26 0 0| 118M 12M| 250k 392k| 0 82k|9333 61k
7 17 60 16 0 0| 124M 1638B| 236k 225k| 0 0 |7512 64k
7 16 63 14 0 0| 117M 1005k| 247k 238k| 0 0 |7429 60k
3 9 82 5 0 0| 41M 17M| 225k 225k| 0 0 |6049 27k
4 8 81 7 0 0| 56M 7782B| 227k 225k| 0 6144B|5933 33k
4 9 79 7 0 0| 60M 9011B| 248k 245k| 0 9011B|6457 36k
4 9 79 7 0 0| 58M 236k| 231k 230k| 0 14k|6210 35k
# ceph status
cluster c452b7df-0c0b-4005-8feb-fc3bb92407f5
health HEALTH_WARN 6 pgs peering; 6 pgs stale; 6 pgs stuck inactive; 6 pgs stuck stale; 6 pgs stuck unclean; 2 requests are blocked > 32 sec
monmap e6: 3 mons at {node01-ib=10.168.100.101:6789/0,node06-ib=10.168.100.106:6789/0,node11-ib=10.168.100.111:6789/0}, election epoch 2
836, quorum 0,1,2 node01-ib,node06-ib,node11-ib
mdsmap e58: 1/1/1 up {0=node01-ib=up:active}
osdmap e9045: 153 osds: 152 up, 152 in
pgmap v1409957: 66256 pgs, 30 pools, 200 TB data, 51705 kobjects
90448 GB used, 316 TB / 413 TB avail
66250 active+clean
6 stale+peering
- Once data recovery is done , go ahead delete keyrings for that OSD and finally remove OSD
# ceph auth del osd.99
updated
# ceph osd rm osd.99
removed osd.99
#ceph status
cluster c452b7df-0c0b-4005-8feb-fc3bb92407f5
health HEALTH_WARN 6 pgs peering; 6 pgs stale; 6 pgs stuck inactive; 6 pgs stuck stale; 6 pgs stuck unclean; 2 requests are blocked > 32 sec
monmap e6: 3 mons at {node01-ib=10.168.100.101:6789/0,node06-ib=10.168.100.106:6789/0,node11-ib=10.168.100.111:6789/0}, election epoch 2
836, quorum 0,1,2 node01-ib,node06-ib,node11-ib
mdsmap e58: 1/1/1 up {0=node01-ib=up:active}
osdmap e9046: 152 osds: 152 up, 152 in
pgmap v1409971: 66256 pgs, 30 pools, 200 TB data, 51705 kobjects
90445 GB used, 316 TB / 413 TB avail
66250 active+clean
6 stale+peering
- Remove entry of this OSD from ceph.conf ( if its present ) , make sure you keep all the nodes ceph.conf file updated . You can push the new configuration file to entire cluster using # ceph admin command.
- Time to create new OSD for the physical disk that we have inserted , You would see, ceph will create new OSD with the same OSD number , that was failed , as we have removed failed OSD cleanly , if you see a different OSD number , it means that you have not cleanly removed failed OSD.
# ceph osd create
99
# ceph status
cluster c452b7df-0c0b-4005-8feb-fc3bb92407f5
health HEALTH_WARN 6 pgs peering; 6 pgs stale; 6 pgs stuck inactive; 6 pgs stuck stale; 6 pgs stuck unclean; 2 requests are blocked > 32 sec
monmap e6: 3 mons at {node01-ib=10.168.100.101:6789/0,node06-ib=10.168.100.106:6789/0,node11-ib=10.168.100.111:6789/0}, election epoch 2
836, quorum 0,1,2 node01-ib,node06-ib,node11-ib
mdsmap e58: 1/1/1 up {0=node01-ib=up:active}
osdmap e9047: 153 osds: 152 up, 152 in
pgmap v1409988: 66256 pgs, 30 pools, 200 TB data, 51705 kobjects
90442 GB used, 316 TB / 413 TB avail
66250 active+clean
6 stale+peering
- List the disk , zap it and deploy it again
# ceph-deploy disk list node14
[ceph_deploy.cli][INFO ] Invoked (1.3.2): /usr/bin/ceph-deploy disk list node14
[node14][DEBUG ] connected to host: node14
[node14][DEBUG ] detect platform information from remote host
[node14][DEBUG ] detect machine type
[ceph_deploy.osd][INFO ] Distro info: CentOS 6.4 Final
[ceph_deploy.osd][DEBUG ] Listing disks on node14...
[node14][INFO ] Running command: ceph-disk list
[node14][DEBUG ] /dev/sda :
[node14][DEBUG ] /dev/sda1 other, ext4, mounted on /
[node14][DEBUG ] /dev/sda2 swap, swap
[node14][DEBUG ] /dev/sdb :
[node14][DEBUG ] /dev/sdb1 ceph data, active, cluster ceph, osd.9, journal /dev/sdb2
[node14][DEBUG ] /dev/sdb2 ceph journal, for /dev/sdb1
[node14][DEBUG ] /dev/sdc :
[node14][DEBUG ] /dev/sdc1 ceph data, active, cluster ceph, osd.17, journal /dev/sdc2
[node14][DEBUG ] /dev/sdc2 ceph journal, for /dev/sdc1
[node14][DEBUG ] /dev/sdd :
[node14][DEBUG ] /dev/sdd1 ceph data, active, cluster ceph, osd.30, journal /dev/sdd2
[node14][DEBUG ] /dev/sdd2 ceph journal, for /dev/sdd1
[node14][DEBUG ] /dev/sde :
[node14][DEBUG ] /dev/sde1 ceph data, active, cluster ceph, osd.53, journal /dev/sde2
[node14][DEBUG ] /dev/sde2 ceph journal, for /dev/sde1
[node14][DEBUG ] /dev/sdf :
[node14][DEBUG ] /dev/sdf1 ceph data, active, cluster ceph, osd.65, journal /dev/sdf2
[node14][DEBUG ] /dev/sdf2 ceph journal, for /dev/sdf1
[node14][DEBUG ] /dev/sdg :
[node14][DEBUG ] /dev/sdg1 ceph data, active, cluster ceph, osd.78, journal /dev/sdg2
[node14][DEBUG ] /dev/sdg2 ceph journal, for /dev/sdg1
[node14][DEBUG ] /dev/sdh :
[node14][DEBUG ] /dev/sdh1 ceph data, active, cluster ceph, osd.89, journal /dev/sdh2
[node14][DEBUG ] /dev/sdh2 ceph journal, for /dev/sdh1
[node14][DEBUG ] /dev/sdi other, btrfs
[node14][DEBUG ] /dev/sdj :
[node14][DEBUG ] /dev/sdj1 ceph data, active, cluster ceph, osd.113, journal /dev/sdj2
[node14][DEBUG ] /dev/sdj2 ceph journal, for /dev/sdj1
[node14][DEBUG ] /dev/sdk :
[node14][DEBUG ] /dev/sdk1 ceph data, active, cluster ceph, osd.128, journal /dev/sdk2
[node14][DEBUG ] /dev/sdk2 ceph journal, for /dev/sdk1
[node14][DEBUG ] /dev/sdl :
[node14][DEBUG ] /dev/sdl1 ceph data, active, cluster ceph, osd.141, journal /dev/sdl2
[node14][DEBUG ] /dev/sdl2 ceph journal, for /dev/sdl1
# ceph-deploy disk zap node14:sdi
[ceph_deploy.cli][INFO ] Invoked (1.3.2): /usr/bin/ceph-deploy disk zap node14:sdi
[ceph_deploy.osd][DEBUG ] zapping /dev/sdi on node14
[node14][DEBUG ] connected to host: node14
[node14][DEBUG ] detect platform information from remote host
[node14][DEBUG ] detect machine type
[ceph_deploy.osd][INFO ] Distro info: CentOS 6.4 Final
[node14][DEBUG ] zeroing last few blocks of device
[node14][INFO ] Running command: sgdisk --zap-all --clear --mbrtogpt -- /dev/sdi
[node14][DEBUG ] Creating new GPT entries.
[node14][DEBUG ] GPT data structures destroyed! You may now partition the disk using fdisk or
[node14][DEBUG ] other utilities.
[node14][DEBUG ] The operation has completed successfully.
# ceph-deploy --overwrite-conf osd prepare node14:sdi
[ceph_deploy.cli][INFO ] Invoked (1.3.2): /usr/bin/ceph-deploy --overwrite-conf osd prepare node14:sdi
[ceph_deploy.osd][DEBUG ] Preparing cluster ceph disks node14:/dev/sdi:
[node14][DEBUG ] connected to host: node14
[node14][DEBUG ] detect platform information from remote host
[node14][DEBUG ] detect machine type
[ceph_deploy.osd][INFO ] Distro info: CentOS 6.4 Final
[ceph_deploy.osd][DEBUG ] Deploying osd to node14
[node14][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf
[node14][INFO ] Running command: udevadm trigger --subsystem-match=block --action=add
[ceph_deploy.osd][DEBUG ] Preparing host node14 disk /dev/sdi journal None activate False
[node14][INFO ] Running command: ceph-disk-prepare --fs-type xfs --cluster ceph -- /dev/sdi
[node14][ERROR ] INFO:ceph-disk:Will colocate journal with data on /dev/sdi
[node14][DEBUG ] The operation has completed successfully.
[node14][DEBUG ] The operation has completed successfully.
[node14][DEBUG ] meta-data=/dev/sdi1 isize=2048 agcount=32, agsize=22884224 blks
[node14][DEBUG ] = sectsz=512 attr=2, projid32bit=0
[node14][DEBUG ] data = bsize=4096 blocks=732295168, imaxpct=5
[node14][DEBUG ] = sunit=64 swidth=64 blks
[node14][DEBUG ] naming =version 2 bsize=4096 ascii-ci=0
[node14][DEBUG ] log =internal log bsize=4096 blocks=357568, version=2
[node14][DEBUG ] = sectsz=512 sunit=64 blks, lazy-count=1
[node14][DEBUG ] realtime =none extsz=4096 blocks=0, rtextents=0
[node14][DEBUG ] The operation has completed successfully.
[ceph_deploy.osd][DEBUG ] Host node14 is now ready for osd use.
- Check the new OSD
# ceph osd tree
9 2.63 osd.9 up 1
17 2.73 osd.17 up 1
30 2.73 osd.30 up 1
53 2.73 osd.53 up 1
65 2.73 osd.65 up 1
78 2.73 osd.78 up 1
89 2.73 osd.89 up 1
113 2.73 osd.113 up 1
128 2.73 osd.128 up 1
141 2.73 osd.141 up 1
99 2.73 osd.99 up 1
# ceph status
cluster c452b7df-0c0b-4005-8feb-fc3bb92407f5
health HEALTH_WARN 186 pgs backfill; 12 pgs backfilling; 6 pgs peering; 57 pgs recovering; 887 pgs recovery_wait; 6 pgs stale; 6 pgs stuck inactive; 6 pgs
stuck stale; 283 pgs stuck unclean; 2 requests are blocked > 32 sec; recovery 784023/106982434 objects degraded (0.733%)
monmap e6: 3 mons at {node01-ib=10.168.100.101:6789/0,node06-ib=10.168.100.106:6789/0,node11-ib=10.168.100.111:6789/0}, election epoch 2
836, quorum 0,1,2 node01-ib,node06-ib,node11-ib
mdsmap e58: 1/1/1 up {0=node01-ib=up:active}
osdmap e9190: 153 osds: 153 up, 153 in
pgmap v1413041: 66256 pgs, 30 pools, 200 TB data, 51840 kobjects
90504 GB used, 319 TB / 416 TB avail
784023/106982434 objects degraded (0.733%)
65108 active+clean
186 active+remapped+wait\_backfill
887 active+recovery\_wait
12 active+remapped+backfilling
6 stale+peering
57 active+recovering
recovery io 383 MB/s, 95 objects/s
- You would notice ceph will start putting PG ( data ) on this new OSD , so as to rebalance data and to make this new osd to participate cluster.
Yes at this stage you are done with replacement.