cephfs与rbd的direct IO测试分析

概述

在测试cephfs的性能中,发现其direct IO性能较差,而rbd的direct IO性能就较好,很奇怪为什么,这里做测试对比和分析

性能测试

rbd设备

1
2
3
4
5
6
7
8
# rbd create test -p cephfs_data3 --size=1024
# rbd feature disable cephfs_data3/test exclusive-lock, object-map, fast-diff, deep-flatten
# rbd map cephfs_data3/test

# dd if=/dev/zero of=/dev/rbd0 bs=1024M count=1 oflag=direct
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 1.72135 s, 624 MB/s

cephfs

1
2
3
4
5
6
# mount -t ceph 10.10.2.1:6789:/foo mike -o name=foo,secret=AQCddEtZC8n5KRAAPw5qd3BWLzlgqiEuRR5AYg==
# cd mike/
# dd if=/dev/zero of=file bs=1024M count=1 oflag=direct
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 7.31779 s, 147 MB/s

结论:为什么对于direct IO,rbd与cephfs的差别这么大?

分析

cephfs

代码分析

cephfs client端写操作接口:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
const struct file_operations ceph_file_fops = {    // cephfs中文件的file操作集
.open = ceph_open,
.release = ceph_release,
.llseek = ceph_llseek,
.read_iter = ceph_read_iter,
.write_iter = ceph_write_iter,
.mmap = ceph_mmap,
.fsync = ceph_fsync,
.lock = ceph_lock,
.flock = ceph_flock,
.splice_read = generic_file_splice_read,
.splice_write = iter_file_splice_write,
.unlocked_ioctl = ceph_ioctl,
.compat_ioctl = ceph_ioctl,
.fallocate = ceph_fallocate,
};

针对direct IO的处理过程如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
static ssize_t ceph_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
...
if (iocb->ki_flags & IOCB_DIRECT)
written = ceph_direct_read_write(iocb, &data, snapc, // direct IO处理函数
&prealloc_cf);
else
written = ceph_sync_write(iocb, &data, pos, snapc);
...
}

static ssize_t
ceph_direct_read_write(struct kiocb *iocb, struct iov_iter *iter,
struct ceph_snap_context *snapc,
struct ceph_cap_flush **pcf)
{
...
while (iov_iter_count(iter) > 0) { // 循环发送数据
...
req = ceph_osdc_new_request(&fsc->client->osdc, &ci->i_layout, // 初始化osd request
vino, pos, &size, 0,
/*include a 'startsync' command*/
write ? 2 : 1,
write ? CEPH_OSD_OP_WRITE :
CEPH_OSD_OP_READ,
flags, snapc,
ci->i_truncate_seq,
ci->i_truncate_size,
false);
...
ret = ceph_osdc_start_request(req->r_osdc, req, false); // 发送osd request写数据
if (!ret)
ret = ceph_osdc_wait_request(&fsc->client->osdc, req); // 等待osd request写返回
...
}
...
}

所以从上面的逻辑看:cephfs写direct IO是会被切分为一个个写osd的request,顺序的每个osd request都会等待写成功返回。这样就无法发挥出分布式集群的整体性能,所以cephfs的direct IO写性能较低。

测试验证

按照 cephfs kernel client debug 方法 打开cephfs client端的log,测试一个12M的direct IO;

命令:dd if=/dev/zero of=file bs=12M count=1 oflag=direct

log信息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
[260262.367080] aio_write ffff8801f403b418 10000002393.fffffffffffffffe 0~12582912 getting caps. i_size 0
[260262.367085] aio_write ffff8801f403b418 10000002393.fffffffffffffffe 0~12582912 got cap refs on Fwb // get caps花了5us
[260262.367086] sync_direct_read_write (write) on file ffff88142f83ba00 0~12582912

// 第一个osd写4M
[260262.367092] ceph_osdc_alloc_request req ffff881257eb8000
[260262.367173] ----- ffff88142f83a200 to osd23 42=osd_op len 201+0+4194304 ——
[260262.367177] wait_request_timeout req ffff881257eb8000 tid 4
[260262.394196] ===== ffff88142f83b400 4 from osd23 43=osd_opreply len 182+0 (1765759831 0 0) =====

// 第二个osd写4M
[260262.394264] ceph_osdc_alloc_request req ffff881257eb8000
[260262.394379] ----- ffff88142f83b400 to osd3 42=osd_op len 201+0+4194304 ——
[260262.394380] wait_request_timeout req ffff881257eb8000 tid 5
[260262.433310] ===== ffff88142f83be00 1 from osd3 43=osd_opreply len 182+0 (93139192 0 0) =====

// 第三个osd写4M
[260262.433426] ceph_osdc_alloc_request req ffff881257eb8000
[260262.433534] ----- ffff88142f83a200 to osd26 42=osd_op len 201+0+4194304 ——
[260262.433535] wait_request_timeout req ffff881257eb8000 tid 6
[260262.464063] ===== ffff88142f83b700 1 from osd26 43=osd_opreply len 182+0 (3847263551 0 0) =====

[260262.464151] aio_write ffff8801f403b418 10000002393.fffffffffffffffe 0~12582912 dropping cap refs on Fwb

测试增大object size

增大cephfs中file对应的object size和stripe unit都为64M,然后测试性能

1
2
3
4
5
6
7
8
9
10
11
# touch file
# setfattr -n ceph.file.layout.object_size -v 67108864 file
# setfattr -n ceph.file.layout.stripe_unit -v 67108864 file
# getfattr -n ceph.file.layout file
# file: file
ceph.file.layout="stripe_unit=67108864 stripe_count=1 object_size=67108864 pool=cephfs_data"

# dd if=/dev/zero of=file bs=1024M count=1 oflag=direct
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB) copied, 6.77491 s, 158 MB/s

测试性能发现,对比4M的object size和stripe unit,性能并没有提升。

分析osd端的op时间开销

获取cephfs文件的数据的location信息:

1
2
3
4
5
6
7
8
9
10
# cephfs file show_location
WARNING: This tool is deprecated. Use the layout.* xattrs to query and modify layouts.
location.file_offset: 0
location.object_offset:0
location.object_no: 0
location.object_size: 67108864
location.object_name: 100000720de.00000000
location.block_offset: 0
location.block_size: 67108864
location.osd: 25 // 数据存储在osd 25上

在osd 25上打开optracker:

1
2
3
4
5
6
修改/etc/ceph/ceph.conf,在osd域添加下面两行:
osd_enable_op_tracker = true
osd_op_history_size = 100

然后重启osd服务
# systemctl restart ceph-osd@25.service

重新跑dd命令后,在osd端收集ops信息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
# ceph daemon osd.25 dump_historic_ops
{
"num to keep": 20,
"duration to keep": 600,
"Ops": [
{
"description": "osd_op(mds.0.75:987357 13.844f3494 200.00000000 [] snapc 0=[] ondisk+write+known_if_redirected+full_force e452)", // mds操作
"initiated_at": "2017-06-27 09:55:39.196572",
"age": 8.391089,
"duration": 0.001696, // 花费1ms
"type_data": [
"commit sent; apply or cleanup",
[
{
"time": "2017-06-27 09:55:39.196572",
"event": "initiated"
},
{
"time": "2017-06-27 09:55:39.196629",
"event": "queued_for_pg"
},
{
"time": "2017-06-27 09:55:39.196735",
"event": "reached_pg"
},
{
"time": "2017-06-27 09:55:39.196799",
"event": "started"
},
{
"time": "2017-06-27 09:55:39.196861",
"event": "waiting for subops from 1,18"
},
{
"time": "2017-06-27 09:55:39.196962",
"event": "commit_queued_for_journal_write"
},
{
"time": "2017-06-27 09:55:39.197011",
"event": "write_thread_in_journal_buffer"
},
{
"time": "2017-06-27 09:55:39.197206",
"event": "journaled_completion_queued"
},
{
"time": "2017-06-27 09:55:39.197280",
"event": "op_commit"
},
{
"time": "2017-06-27 09:55:39.197738",
"event": "op_applied"
},
{
"time": "2017-06-27 09:55:39.198125",
"event": "sub_op_commit_rec from 1"
},
{
"time": "2017-06-27 09:55:39.198225",
"event": "sub_op_commit_rec from 18"
},
{
"time": "2017-06-27 09:55:39.198236",
"event": "commit_sent"
},
{
"time": "2017-06-27 09:55:39.198268",
"event": "done"
}
]
]
},
{
"description": "osd_op(mds.0.75:987359 9.a30e34e0 100000720de.00000000 [] snapc 1=[] ondisk+write+known_if_redirected+full_force e452)", // mds操作
"initiated_at": "2017-06-27 09:55:39.200560",
"age": 8.387101,
"duration": 0.016664, // 花费16ms
"type_data": [
"commit sent; apply or cleanup",
[
{
"time": "2017-06-27 09:55:39.200560",
"event": "initiated"
},
{
"time": "2017-06-27 09:55:39.200617",
"event": "queued_for_pg"
},
{
"time": "2017-06-27 09:55:39.200683",
"event": "reached_pg"
},
{
"time": "2017-06-27 09:55:39.200721",
"event": "started"
},
{
"time": "2017-06-27 09:55:39.200813",
"event": "waiting for subops from 7,15"
},
{
"time": "2017-06-27 09:55:39.200921",
"event": "commit_queued_for_journal_write"
},
{
"time": "2017-06-27 09:55:39.200955",
"event": "write_thread_in_journal_buffer"
},
{
"time": "2017-06-27 09:55:39.201047",
"event": "journaled_completion_queued"
},
{
"time": "2017-06-27 09:55:39.201104",
"event": "op_commit"
},
{
"time": "2017-06-27 09:55:39.201955",
"event": "sub_op_commit_rec from 15"
},
{
"time": "2017-06-27 09:55:39.202224",
"event": "sub_op_commit_rec from 7"
},
{
"time": "2017-06-27 09:55:39.202231",
"event": "commit_sent"
},
{
"time": "2017-06-27 09:55:39.217210",
"event": "op_applied"
},
{
"time": "2017-06-27 09:55:39.217224",
"event": "done"
}
]
]
},
{
"description": "osd_op(client.372486.1:1048 9.a30e34e0 100000720de.00000000 [] snapc 1=[] ondisk+write+ordersnap e452)",
"initiated_at": "2017-06-27 09:55:39.228517",
"age": 8.359144,
"duration": 0.450798, // 花费450ms
"type_data": [
"commit sent; apply or cleanup",
{
"client": "client.372486",
"tid": 1048
},
[
{
"time": "2017-06-27 09:55:39.228517”, // 收到请求,创建OpRequest
"event": "initiated"
},
{
"time": "2017-06-27 09:55:39.345624”, // 花费117ms,simple messager,接收64MB数据的时延
"event": "queued_for_pg"
},
{
"time": "2017-06-27 09:55:39.345691",
"event": "reached_pg"
},
{
"time": "2017-06-27 09:55:39.345738",
"event": "started"
},
{
"time": "2017-06-27 09:55:39.348961",
"event": "waiting for subops from 7,15"
},
{
"time": "2017-06-27 09:55:39.381379”, // 花费33ms,- FileJournal::submit_entry 这部分的延时应该是reserve filestore throttle等待的时间
"event": "commit_queued_for_journal_write"
},
{
"time": "2017-06-27 09:55:39.381441",
"event": "write_thread_in_journal_buffer"
},
{
"time": "2017-06-27 09:55:39.524101”, // 花费43ms,这部分延时是写journal的花费时间
"event": "journaled_completion_queued"
},
{
"time": "2017-06-27 09:55:39.524162”, // commit到journal成功
"event": "op_commit"
},
{
"time": "2017-06-27 09:55:39.570412”, // 花费46ms,apply到filestore成功
"event": "op_applied"
},
{
"time": "2017-06-27 09:55:39.658194",
"event": "sub_op_commit_rec from 7"
},
{
"time": "2017-06-27 09:55:39.679285”, // sub从发送到收到reply花费约330ms,subop commit返回
"event": "sub_op_commit_rec from 15"
},
{
"time": "2017-06-27 09:55:39.679298”, // 返回给客户端
"event": "commit_sent"
},
{
"time": "2017-06-27 09:55:39.679315",
"event": "done"
}
]
]
}
]
}

结论:
大的object size写,在osd这段处理时间很长,从 simple messager -> pg -> journal -> filestore 每一步都是串行的,单simple messager接受64M的数据就花费了117ms(10G网络),所以性能比较低,很难提升;

rbd设备

rbd到object的映射

创建一个object size为32M的rbd设备

1
2
3
4
# rbd create foxtst -p cephfs_data3 --size=1024 --object-size 32M
# rbd feature disable cephfs_data3/foxtst exclusive-lock, object-map, fast-diff, deep-flatten
# rbd map cephfs_data3/foxtst
/dev/rbd0

先写64M数据到该rbd设备

1
# dd if=/dev/zero of=/dev/rbd0 bs=64M count=1 oflag=direct

获取rbd设备的id

1
2
3
# rados -p cephfs_data3 get rbd_id.foxtst myfile
# cat myfile
56092238e1f29

找到rbd设备对应的rados object,这里我们就写了64M数据,所以就能找到两个32M的objects

1
2
3
4
# rados ls -p cephfs_data3 | grep 56092238e1f29
rbd_data.56092238e1f29.0000000000000001
rbd_header.56092238e1f29
rbd_data.56092238e1f29.0000000000000000

找到上述两个object的location

1
2
3
4
# ceph osd map cephfs_data3 rbd_data.56092238e1f29.0000000000000000
osdmap e467 pool 'cephfs_data3' (15) object 'rbd_data.56092238e1f29.0000000000000000' -> pg 15.47ba6daa (15.1aa) -> up ([21,14,2], p21) acting ([21,14,2], p21)
# ceph osd map cephfs_data3 rbd_data.56092238e1f29.0000000000000001
osdmap e467 pool 'cephfs_data3' (15) object 'rbd_data.56092238e1f29.0000000000000001' -> pg 15.46295241 (15.41) -> up ([4,23,13], p4) acting ([4,23,13], p4)

性能测试

在对应osd节点上打开op tracker后,开始测试

1
# dd if=/dev/zero of=/dev/rbd0 bs=64M count=1 oflag=direct

在对应的osd节点上收集ops信息:

1
2
# ceph daemon osd.21 dump_historic_ops > osd-21-dump_historic_ops
# ceph daemon osd.4 dump_historic_ops > osd-4-dump_historic_ops

分析这些ops信息可以得到以下结果:

  1. 写单个32M object会拆分很多个ops,拆分规则与device的max_sectors_kb配置有关系,默认这个值为:512
  2. 写单个32M object的ops会顺序发给osd,但不是等待一个op返回后再发下一个
  3. 写单个32M object的ops是流水线的处理模式;simple messager -> pg -> journal -> filestore,所以这样处理多个小请求的效率更高,不会有cephfs中大object写会在simple messager里等待数据都全部接受后再走下一步的问题
  4. 写两个32M object的ops是并行的,因为这里指定的bs=64M,对应到两个不同的osd上,不同osd上的ops没先后关系

如下图所示:
rbd-dd-io-flow

所以rbd设备的direct IO写的性能随着bs的增大是逐渐提高的。

支持原创