使用Bcache加速Ceph OSD性能

概述

在Ceph的环境中,我们通常会使用SSD来作为OSD的Journal,而OSD的数据盘是普通的SATA盘,在实践中,经常会发现SATA盘的性能瓶颈影响了OSD的性能,那能不能继续压榨SSD的性能来提升OSD的性能呢?

答案是肯定的,可以使用SSD加速SATA盘的策略来加速作为OSD数据盘的SATA盘,通常的策略有:

  • flashcache
  • bcache

有文章对比测试过这两种cache策略的性能,bcache的性能会好很多,这里介绍如何使用bcache来给OSD加速。

测试环境

Ceph版本:Jewel 10.2.9

作为Ceph机器的物理机的磁盘配置如下:

  1. SSD - 745 G,三块
  2. SATA - 3.7 T,九块

对磁盘规划如下:

  1. 每个SSD分出3个10G分区,作为三个SATA盘OSD的journal
  2. 每个SSD剩余空间分为1个分区,使用bcache来加速三个SATA盘

磁盘分区前如下:

1
2
3
4
5
6
root@ceph0:~/yangguanjun# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
...
sdb 8:16 0 745.2G 0 disk
sdc 8:32 0 745.2G 0 disk
sdd 8:48 0 745.2G 0 disk

磁盘分区后为:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
root@ceph0:~# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
...
sdb 8:16 0 745.2G 0 disk
├─sdb1 8:17 0 10G 0 part
├─sdb2 8:18 0 10G 0 part
├─sdb3 8:19 0 10G 0 part
└─sdb4 8:20 0 715.2G 0 part
sdc 8:32 0 745.2G 0 disk
├─sdc1 8:33 0 10G 0 part
├─sdc2 8:34 0 10G 0 part
├─sdc3 8:35 0 10G 0 part
└─sdc4 8:36 0 715.2G 0 part
sdd 8:48 0 745.2G 0 disk
├─sdd1 8:49 0 10G 0 part
├─sdd2 8:50 0 10G 0 part
├─sdd1 8:51 0 10G 0 part
└─sdd4 8:52 0 715.2G 0 part

设备性能

针对当前环境,先要了解下各个硬件的性能,通过fio测试结果如下:

磁盘类型 read write randread randwrite
SATA 155MB/s 158MB/s 126 219
SSD 508MB/s 426MB/s 69.2k 45.5k

参考文章配置SATA盘与SSD盘的bcache策略:http://www.yangguanjun.com/2018/03/26/lvm-sata-ssd-bcache/

配置bcache命令如下:

1
# make-bcache -B /dev/sde -C /dev/sdc4

配置后的块设备信息如下:

1
2
3
4
5
6
7
8
9
10
11
# lsblk
...
sdc 8:32 0 745.2G 0 disk
├─sdc1 8:33 0 10G 0 part
├─sdc2 8:34 0 10G 0 part
├─sdc3 8:35 0 10G 0 part
└─sdc4 8:36 0 715.2G 0 part
├─bcache0 253:0 0 3.7T 0 disk
...
sde 8:64 0 3.7T 0 disk
└─bcache0 253:0 0 3.7T 0 disk

之后测试bcache加速设备性能:

bcache的不同缓存策略 randwrite
bcache [writethrough] 218
bcache [writeback] 38.8k

上面结果看出 bcache 配置为writeback模式后,加速设备性能很高,会比SSD盘的性能略差些。但之后数据会回刷到SATA上,通过iostat可以看到SATA盘会繁忙好一阵子。

部署OSD

默认直接使用ceph-deploy部署bcache设备时会报错,如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
# ceph-deploy osd prepare ceph0:/dev/bcache0:/dev/sdd1
[ceph_deploy.conf][DEBUG ] found configuration file at: /root/.cephdeploy.conf
[ceph_deploy.cli][INFO ] Invoked (1.5.37): /usr/bin/ceph-deploy osd prepare ceph0:/dev/bcache0:/dev/sdd1
[ceph_deploy.cli][INFO ] ceph-deploy options:
[ceph_deploy.cli][INFO ] username : None
[ceph_deploy.cli][INFO ] disk : [('ceph0', '/dev/bcache0', '/dev/sdd1')]
[ceph_deploy.cli][INFO ] dmcrypt : False
[ceph_deploy.cli][INFO ] verbose : False
[ceph_deploy.cli][INFO ] bluestore : None
[ceph_deploy.cli][INFO ] overwrite_conf : False
[ceph_deploy.cli][INFO ] subcommand : prepare
[ceph_deploy.cli][INFO ] dmcrypt_key_dir : /etc/ceph/dmcrypt-keys
[ceph_deploy.cli][INFO ] quiet : False
[ceph_deploy.cli][INFO ] cd_conf : <ceph_deploy.conf.cephdeploy.Conf instance at 0x7f81bab7b200>
[ceph_deploy.cli][INFO ] cluster : ceph
[ceph_deploy.cli][INFO ] fs_type : xfs
[ceph_deploy.cli][INFO ] func : <function osd at 0x7f81bab6a938>
[ceph_deploy.cli][INFO ] ceph_conf : None
[ceph_deploy.cli][INFO ] default_release : False
[ceph_deploy.cli][INFO ] zap_disk : False
[ceph_deploy.osd][DEBUG ] Preparing cluster ceph disks ceph0:/dev/bcache0:/dev/sdd1
[ceph0][DEBUG ] connected to host: ceph0
[ceph0][DEBUG ] detect platform information from remote host
[ceph0][DEBUG ] detect machine type
[ceph0][DEBUG ] find the location of an executable
[ceph_deploy.osd][INFO ] Distro info: CentOS Linux 7.4.1708 Core
[ceph_deploy.osd][DEBUG ] Deploying osd to ceph0
[ceph0][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf
[ceph_deploy.osd][DEBUG ] Preparing host ceph0 disk /dev/bcache0 journal /dev/sdd1 activate False
[ceph0][DEBUG ] find the location of an executable
[ceph0][INFO ] Running command: /usr/sbin/ceph-disk -v prepare --cluster ceph --fs-type xfs -- /dev/bcache0 /dev/sdd1
[ceph0][WARNIN] command: Running command: /usr/bin/ceph-osd --cluster=ceph --show-config-value=fsid
[ceph0][WARNIN] command: Running command: /usr/bin/ceph-osd --check-allows-journal -i 0 --log-file $run_dir/$cluster-osd-check.log --cluster ceph --setuser ceph --setgroup ceph
[ceph0][WARNIN] command: Running command: /usr/bin/ceph-osd --check-wants-journal -i 0 --log-file $run_dir/$cluster-osd-check.log --cluster ceph --setuser ceph --setgroup ceph
[ceph0][WARNIN] command: Running command: /usr/bin/ceph-osd --check-needs-journal -i 0 --log-file $run_dir/$cluster-osd-check.log --cluster ceph --setuser ceph --setgroup ceph
[ceph0][WARNIN] get_dm_uuid: get_dm_uuid /dev/bcache0 uuid path is /sys/dev/block/252:8/dm/uuid
[ceph0][WARNIN] command: Running command: /usr/bin/ceph-osd --cluster=ceph --show-config-value=osd_journal_size
[ceph0][WARNIN] get_dm_uuid: get_dm_uuid /dev/bcache0 uuid path is /sys/dev/block/252:8/dm/uuid
[ceph0][WARNIN] get_dm_uuid: get_dm_uuid /dev/bcache0 uuid path is /sys/dev/block/252:8/dm/uuid
[ceph0][WARNIN] get_dm_uuid: get_dm_uuid /dev/bcache0 uuid path is /sys/dev/block/252:8/dm/uuid
[ceph0][WARNIN] command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_mkfs_options_xfs
[ceph0][WARNIN] command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_fs_mkfs_options_xfs
[ceph0][WARNIN] command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_mount_options_xfs
[ceph0][WARNIN] command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_fs_mount_options_xfs
[ceph0][WARNIN] get_dm_uuid: get_dm_uuid /dev/sdd1 uuid path is /sys/dev/block/8:51/dm/uuid
[ceph0][WARNIN] prepare_device: Journal /dev/sdd1 is a partition
[ceph0][WARNIN] get_dm_uuid: get_dm_uuid /dev/sdd1 uuid path is /sys/dev/block/8:51/dm/uuid
[ceph0][WARNIN] prepare_device: OSD will not be hot-swappable if journal is not the same device as the osd data
[ceph0][WARNIN] command: Running command: /usr/sbin/blkid -o udev -p /dev/sdd1
[ceph0][WARNIN] prepare_device: Journal /dev/sdd1 was not prepared with ceph-disk. Symlinking directly.
[ceph0][WARNIN] get_dm_uuid: get_dm_uuid /dev/bcache0 uuid path is /sys/dev/block/252:8/dm/uuid
[ceph0][WARNIN] set_data_partition: Creating osd partition on /dev/bcache0
[ceph0][WARNIN] get_dm_uuid: get_dm_uuid /dev/bcache0 uuid path is /sys/dev/block/252:8/dm/uuid
[ceph0][WARNIN] ptype_tobe_for_name: name = data
[ceph0][WARNIN] get_dm_uuid: get_dm_uuid /dev/bcache0 uuid path is /sys/dev/block/252:8/dm/uuid
[ceph0][WARNIN] create_partition: Creating data partition num 1 size 0 on /dev/bcache0
[ceph0][WARNIN] command_check_call: Running command: /usr/sbin/sgdisk --largest-new=1 --change-name=1:ceph data --partition-guid=1:cf599e47-c299-46c1-a385-aff4b0d25f1f --typecode=1:89c57f98-2fe5-4dc0-89c1-f3ad0ceff2be --mbrtogpt -- /dev/bcache0
[ceph0][WARNIN] Caution: invalid main GPT header, but valid backup; regenerating main header
[ceph0][WARNIN] from backup!
[ceph0][WARNIN]
[ceph0][WARNIN] Invalid partition data!
[ceph0][WARNIN] '/usr/sbin/sgdisk --largest-new=1 --change-name=1:ceph data --partition-guid=1:cf599e47-c299-46c1-a385-aff4b0d25f1f --typecode=1:89c57f98-2fe5-4dc0-89c1-f3ad0ceff2be --mbrtogpt -- /dev/bcache0' failed with status code 2
[ceph0][ERROR ] RuntimeError: command returned non-zero exit status: 1
[ceph_deploy.osd][ERROR ] Failed to execute command: /usr/sbin/ceph-disk -v prepare --cluster ceph --fs-type xfs -- /dev/bcache0 /dev/sdd1
[ceph_deploy][ERROR ] GenericError: Failed to create 1 OSDs

搜索有如下参考,ceph-deploy还不支持bcache设备,默认ceph-deploy会尝试对bcache设备进行分区,而bcache设备是不支持分区后挂载的,所以会导致命令失败。

http://tracker.ceph.com/issues/13278

https://github.com/ceph/ceph/pull/16327

解决办法一

修改ceph-disk的代码,把上述链接中的patch加上,可惜没搞成功,很奇怪修改后的代码貌似没跑到。。。因时间紧急就没再研究,但这个办法肯定是可行的!

解决办法二

手动格式化bcache设备,mount上后,通过ceph-deploy指定目录来部署了,步骤如下:

1
2
3
# mkdir /var/lib/ceph/osd/ceph-0
# mkfs.xfs /dev/bcache0
# mount /dev/bcache0 /var/lib/ceph/osd/ceph-0

然后再尝试部署,报错如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# ceph-deploy --overwrite-conf osd prepare ceph0:/var/lib/ceph/osd/ceph-0:/dev/sdd1
...
[ceph0][WARNIN] command: Running command: /usr/bin/timeout 300 ceph-osd --cluster ceph --mkfs --mkkey -i 195 --monmap /var/lib/ceph/osd/ceph-0/activate.monmap --osd-data /var/lib/ceph/osd/ceph-0 --osd-journal /var/lib/ceph/osd/ceph-0/journal --osd-uuid 9ff0983b-74e2-4fc1-8ba8-cfb688024284 --keyring /var/lib/ceph/osd/ceph-0/keyring --setuser ceph --setgroup ceph
[ceph0][WARNIN] Traceback (most recent call last):
[ceph0][WARNIN] File "/usr/sbin/ceph-disk", line 9, in <module>
[ceph0][WARNIN] load_entry_point('ceph-disk==1.0.0', 'console_scripts', 'ceph-disk')()
[ceph0][WARNIN] File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 5553, in run
[ceph0][WARNIN] main(sys.argv[1:])
[ceph0][WARNIN] File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 5504, in main
[ceph0][WARNIN] args.func(args)
[ceph0][WARNIN] File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 3631, in main_activate
[ceph0][WARNIN] init=args.mark_init,
[ceph0][WARNIN] File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 3451, in activate_dir
[ceph0][WARNIN] (osd_id, cluster) = activate(path, activate_key_template, init)
[ceph0][WARNIN] File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 3556, in activate
[ceph0][WARNIN] keyring=keyring,
[ceph0][WARNIN] File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 3010, in mkfs
[ceph0][WARNIN] '--setgroup', get_ceph_group(),
[ceph0][WARNIN] File "/usr/lib/python2.7/site-packages/ceph_disk/main.py", line 2957, in ceph_osd_mkfs
[ceph0][WARNIN] raise Error('%s failed : %s' % (str(arguments), error))
[ceph0][WARNIN] ceph_disk.main.Error: Error: ['ceph-osd', '--cluster', 'ceph', '--mkfs', '--mkkey', '-i', u'195', '--monmap', '/var/lib/ceph/osd/ceph-0/activate.monmap', '--osd-data', '/var/lib/ceph/osd/ceph-0', '--osd-journal', '/var/lib/ceph/osd/ceph-0/journal', '--osd-uuid', u'9ff0983b-74e2-4fc1-8ba8-cfb688024284', '--keyring', '/var/lib/ceph/osd/ceph-0/keyring', '--setuser', 'ceph', '--setgroup', 'ceph'] failed : parse error setting 'osd_deep_scrub_interval' to '2592000 // every mouth'
[ceph0][WARNIN] 2018-04-26 16:02:27.180328 7f3085e23940 -1 filestore(/var/lib/ceph/osd/ceph-0) mkfs: write_version_stamp() failed: (13) Permission denied
[ceph0][WARNIN] 2018-04-26 16:02:27.180346 7f3085e23940 -1 OSD::mkfs: ObjectStore::mkfs failed with error -13
[ceph0][WARNIN] 2018-04-26 16:02:27.180441 7f3085e23940 -1 ** ERROR: error creating empty object store in /var/lib/ceph/osd/ceph-0: (13) Permission denied
[ceph0][WARNIN]
[ceph0][ERROR ] RuntimeError: command returned non-zero exit status: 1
[ceph_deploy][ERROR ] RuntimeError: Failed to execute command: /usr/sbin/ceph-disk -v activate --mark-init systemd --mount /var/lib/ceph/osd/ceph-0

从输出里看是权限的问题,执行如下两条命令后,部署成功:

1
2
root@ceph0:/var/lib/ceph/osd# chown -R ceph:ceph /dev/sdd1
root@ceph0:/var/lib/ceph/osd# chown -R ceph:ceph ceph-0

Ceph OSD开机自启动

bcache开机启动

添加bcache开机启动

1
2
3
4
5
6
7
8
9
# cat /etc/sysconfig/modules/bcache.modules
#! /bin/sh

/sbin/modinfo -F filename bcache > /dev/null 2>&1
if [ $? -eq 0 ]; then
/sbin/modprobe -f bcache
fi

# chmod 755 /etc/sysconfig/modules/bcache.modules

自动挂载OSD目录

添加磁盘自动挂载,保证重启后Ceph OSD能自动运行

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
root@ceph0:~# blkid
/dev/sdd1: PARTUUID="81732128-2073-4d93-8582-377f4f9a701f"
...
/dev/bcache0: UUID="88ef6ba6-fd13-478c-8b56-82951183f1d3" TYPE="xfs"

root@ceph0:~# cat /etc/fstab
#
# /etc/fstab
# Created by anaconda on Wed May 24 15:20:03 2017
#
# Accessible filesystems, by reference, are maintained under '/dev/disk'
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
LABEL=/ / ext4 defaults 1 1
UUID=88ef6ba6-fd13-478c-8b56-82951183f1d3 /var/lib/ceph/osd/ceph-0 xfs defaults 0 2
支持原创