Cephfs测试中IO hang问题分析

问题

为了测试cephfs的可用性,我们对cephfs进行了几组stripe的测试,在跑自动化测试时,发现fio总是会跑一部分后hang住,而此时ceph -s输出显示read值太大,但实际环境中查看并没这么大的流量。

ceph输出

1
2
3
4
5
6
7
8
9
10
11
12
# ceph -w
cluster 16f59233-16da-4864-8c5a-1fec71d119ad
health HEALTH_OK
monmap e1: 3 mons at {cephcluster-server1=10.10.1.6:6789/0,cephcluster-server2=10.10.1.7:6789/0,cephcluster-server3=10.10.1.8:6789/0}
election epoch 12, quorum 0,1,2 cephcluster-server1,cephcluster-server2,cephcluster-server3
fsmap e17: 1/1/1 up {0=mds-daemon-38=up:active}, 1 up:standby
osdmap e180: 30 osds: 30 up, 30 in
flags sortbitwise,require_jewel_osds
pgmap v55504: 1088 pgs, 3 pools, 81924 MB data, 10910 objects
241 GB used, 108 TB / 109 TB avail
1088 active+clean
client io 68228 MB/s rd, 1066 op/s rd, 0 op/s wr

分析:cephfs client io显示的信息是正确的。从后面的分析得知cephfs client一直loop着尝试read object,单次read就有64MB,object数据缓存在osd的内存里,所以ceph统计的read速率很大。

查看网络

用iftop看网络的流量只有10MB/s

1
2
3
4
5
# iftop -B -i eth4
25.5MB 50.9MB 76.4MB 102MB 127MB
└──────────────────────────┴─────────────────────────┴─────────────────────────┴─────────────────────────┴──────────────────────────
cephcluster-server2.****.com => 10.10.1.18 4.46MB 4.51MB 4.52MB
<= 220KB 221KB 222KB

对比几个stripe

发现对64MB的无条带化的object读时,会触发这个bug:

1
2
3
# getfattr -n ceph.file.layout dir4/tstfile
# file: dir4/tstfile
ceph.file.layout="stripe_unit=67108864 stripe_count=1 object_size=67108864 pool=cephfs_data”

bs=64M的fio hang住

1
2
3
4
5
6
7
# fio -filename=/home/yangguanjun/cephfs/dir4/tstfile -size=20G -thread -group_reporting -direct=1 -ioengine=libaio -bs=64M -rw=read -iodepth=64 -iodepth_batch_submit=8 -iodepth_batch_complete=8 -name=write_64k_64q
write_64k_64q: (g=0): rw=read, bs=64M-64M/64M-64M/64M-64M, ioengine=libaio, iodepth=64
fio-2.2.8
Starting 1 thread
^Cbs: 1 (f=1): [R(1)] [inf% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158050441d:06h:59m:55s]
fio: terminating on signal 2
...

bs=32M的fio hang住

1
2
3
4
5
# fio -filename=/home/yangguanjun/cephfs/dir4/tstfile -size=20G -thread -group_reporting -direct=1 -ioengine=libaio -bs=32M -rw=read -iodepth=64 -iodepth_batch_submit=8 -iodepth_batch_complete=8 -name=write_64k_64q
write_64k_64q: (g=0): rw=read, bs=32M-32M/32M-32M/32M-32M, ioengine=libaio, iodepth=64
fio-2.2.8
Starting 1 thread
^Cbs: 1 (f=1): [R(1)] [inf% done] [0KB/0KB/0KB /s] [0/0/0 iops] [eta 1158050441d:07h:00m:09s]

bs=16M的dd hang住

1
2
3
# dd if=tstfile of=/dev/null bs=16M count=200

root 12763 0.0 0.0 124344 664 pts/0 D+ 14:29 0:00 dd if=tstfile of=/dev/null bs=16M count=200
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# ceph -w
cluster 16f59233-16da-4864-8c5a-1fec71d119ad
health HEALTH_OK
monmap e1: 3 mons at {cephcluster-server1=10.10.1.6:6789/0,cephcluster-server2=10.10.1.7:6789/0,cephcluster-server3=10.10.1.8:6789/0}
election epoch 12, quorum 0,1,2 cephcluster-server1,cephcluster-server2,cephcluster-server3
fsmap e23: 1/1/1 up {0=mds-daemon-37=up:active}
osdmap e183: 30 osds: 30 up, 30 in
flags sortbitwise,require_jewel_osds
pgmap v63806: 1088 pgs, 3 pools, 93200 MB data, 13729 objects
274 GB used, 108 TB / 109 TB avail
1088 active+clean
client io 1330 MB/s rd, 41 op/s rd, 0 op/s wr

2017-07-05 14:30:11.483974 mon.0 [INF] pgmap v63806: 1088 pgs: 1088 active+clean; 93200 MB data, 274 GB used, 108 TB / 109 TB avail; 1330 MB/s rd, 41 op/s
2017-07-05 14:30:16.484856 mon.0 [INF] pgmap v63807: 1088 pgs: 1088 active+clean; 93200 MB data, 274 GB used, 108 TB / 109 TB avail; 1212 MB/s rd, 37 op/s
2017-07-05 14:30:21.485295 mon.0 [INF] pgmap v63808: 1088 pgs: 1088 active+clean; 93200 MB data, 274 GB used, 108 TB / 109 TB avail; 1206 MB/s rd, 37 op/s
2017-07-05 14:30:26.485953 mon.0 [INF] pgmap v63809: 1088 pgs: 1088 active+clean; 93200 MB data, 274 GB used, 108 TB / 109 TB avail; 1212 MB/s rd, 37 op/s

ceph cluster节点信息

在ceph cluster的一个节点上看到测试节点有读数据请求

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# iftop -n -i eth4
10.10.1.7 => 10.10.1.18 12.3Mb 11.6Mb 11.6Mb
<= 285Kb 284Kb 284Kb

打开source port后
10.10.1.7:6810 => 10.10.1.18 12.5Mb 11.5Mb 11.5Mb
<= 278Kb 255Kb 255Kb

# netstat -nap | grep 6810
tcp 0 0 10.10.1.7:6810 0.0.0.0:* LISTEN 79360/ceph-osd
tcp 0 0 10.10.1.7:6810 0.0.0.0:* LISTEN 79360/ceph-osd
tcp 0 0 10.10.1.7:6810 10.10.1.7:57266 ESTABLISHED 79360/ceph-osd

# ps aux | grep -w 79360
ceph 79360 85.9 0.5 4338588 709312 ? Ssl Jul04 1556:54 /usr/bin/ceph-osd -f --cluster ceph --id 15 --setuser ceph --setgroup ceph

# df
/dev/sdg1 3905070088 9079428 3895990660 1% /var/lib/ceph/osd/ceph-15

查看sdg没有io

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# iostat -kx 2 sdg
Linux 3.10.0-514.10.2.el7.x86_64 (cephcluster-server2.****.com) 07/05/2017 _x86_64_ (32 CPU)

avg-cpu: %user %nice %system %iowait %steal %idle
12.47 0.00 1.64 0.03 0.00 85.86

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdg 0.01 0.07 2.48 2.01 95.78 140.98 105.44 0.08 17.79 9.69 27.79 3.44 1.54

avg-cpu: %user %nice %system %iowait %steal %idle
0.80 0.00 2.96 0.02 0.00 96.22

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

avg-cpu: %user %nice %system %iowait %steal %idle
0.86 0.00 2.99 0.00 0.00 96.15

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

查看文件location

测试文件的location信息,正好第一块数据就在osd 15上

1
2
3
4
5
6
7
8
9
10
11
12
13
# cephfs tstfile show_location
WARNING: This tool is deprecated. Use the layout.* xattrs to query and modify layouts.
location.file_offset: 0
location.object_offset:0
location.object_no: 0
location.object_size: 67108864
location.object_name: 10000000805.00000000
location.block_offset: 0
location.block_size: 67108864
location.osd: 15

# ceph osd map cephfs_data 10000000805.00000000
osdmap e188 pool 'cephfs_data' (1) object '10000000805.00000000' -> pg 1.12d00fd5 (1.1d5) -> up ([15,0,26], p15) acting ([15,0,26], p15)

停止ceph osd 15

1
# systemctl stop ceph-osd@15.service

过一会查看osd 15所在节点没有与测试节点的数据流量了
这时在ceph osd 0节点,出现了与测试节点的数据流量,查看发现正好是osd 0与测试节点的数据通信
查看此时测试文件第一个object的map信息如下:

1
2
# ceph osd map cephfs_data 10000000805.00000000
osdmap e185 pool 'cephfs_data' (1) object '10000000805.00000000' -> pg 1.12d00fd5 (1.1d5) -> up ([0,26], p0) acting ([0,26], p0)

问题明确

现在这个问题很好重现了

  • 创建一个 stripe_unit=67108864 stripe_count=1 object_size=67108864 的文件
  • dd if=/dev/zero of=foxfile bs=64M count=1 写成功
  • dd if=foxfile of=/dev/null bs=64M count=1 读就会hang住
  • 查看测试节点与对应的object所在的osd有数据流量

通过strace命令查看dd读数据时确实hang在read函数:

1
2
3
4
5
6
7
8
9
10
11
# strace dd if=foxfile of=/dev/null bs=64M count=1
...
open("foxfile", O_RDONLY) = 3
dup2(3, 0) = 0
close(3) = 0
lseek(0, 0, SEEK_CUR) = 0
open("/dev/null", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
dup2(3, 1) = 1
close(3) = 0
mmap(NULL, 67121152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f01712d8000
read(0,

收集client log

参考文章 cephfs kernel client debug

收集osd log

1
# ceph daemon /var/run/ceph/ceph-osd.<osdid>.asok config set debug_osd "20/20"

分析log

在cephfs client的log中能循环发现如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Jul  6 11:30:29 cephfs-client kernel: [ 2010.231008] ceph_sock_data_ready on ffff88202c987030 state = 5, queueing work
Jul 6 11:30:29 cephfs-client kernel: [ 2010.231012] get_osd ffff88202c987000 3 -> 4
Jul 6 11:30:29 cephfs-client kernel: [ 2010.231015] queue_con_delay ffff88202c987030 0
Jul 6 11:30:29 cephfs-client kernel: [ 2010.231034] try_read start on ffff88202c987030 state 5
Jul 6 11:30:29 cephfs-client kernel: [ 2010.231035] try_read tag 1 in_base_pos 0
Jul 6 11:30:29 cephfs-client kernel: [ 2010.231036] try_read got tag 8
Jul 6 11:30:29 cephfs-client kernel: [ 2010.231036] prepare_read_ack ffff88202c987030
Jul 6 11:30:29 cephfs-client kernel: [ 2010.231041] ceph_sock_data_ready on ffff88202c987030 state = 5, queueing work
Jul 6 11:30:29 cephfs-client kernel: [ 2010.231041] get_osd ffff88202c987000 4 -> 5
Jul 6 11:30:29 cephfs-client kernel: [ 2010.231053] queue_con_delay ffff88202c987030 0
Jul 6 11:30:29 cephfs-client kernel: [ 2010.231062] got ack for seq 1 type 42 at ffff881008668100
Jul 6 11:30:29 cephfs-client kernel: [ 2010.231066] ceph_msg_put ffff881008668100 (was 2)
Jul 6 11:30:29 cephfs-client kernel: [ 2010.231081] prepare_read_tag ffff88202c987030
Jul 6 11:30:29 cephfs-client kernel: [ 2010.231083] try_read start on ffff88202c987030 state 5
Jul 6 11:30:29 cephfs-client kernel: [ 2010.231086] try_read tag 1 in_base_pos 0
Jul 6 11:30:29 cephfs-client kernel: [ 2010.231088] try_read got tag 7
Jul 6 11:30:29 cephfs-client kernel: [ 2010.231088] prepare_read_message ffff88202c987030
Jul 6 11:30:29 cephfs-client kernel: [ 2010.231089] read_partial_message con ffff88202c987030 msg (null)
Jul 6 11:30:29 cephfs-client kernel: [ 2010.231090] try_read done on ffff88202c987030 ret -5

证明cephfs client一直在尝试读取osd数据,但收到数据后却认为msg为null

在ceph osd的log中能循环发现如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
2017-07-06 11:30:29.185056 7f1788eb2700  1 osd.20 188 ms_handle_reset con 0x7f1748c39d80 session 0x7f170f4f31c0
2017-07-06 11:30:29.185683 7f17719fc700 10 osd.20 188 new session 0x7f174d435380 con=0x7f1748c39f00 addr=10.10.1.18:0/144324133
2017-07-06 11:30:29.185839 7f17719fc700 20 osd.20 188 should_share_map client.24357 10.10.1.18:0/144324133 188
2017-07-06 11:30:29.185854 7f17719fc700 15 osd.20 188 enqueue_op 0x7f174d42f100 prio 127 cost 0 latency 0.000047 osd_op(client.24357.1:3 1.1bdf9243 1000000095b.00000000 [read 0~67108864 [1@-1]] snapc 0=[] RETRY=6 ondisk+retry+read+known_if_redirected e188) v4
2017-07-06 11:30:29.185891 7f1777bff700 10 osd.20 188 dequeue_op 0x7f174d42f100 prio 127 cost 0 latency 0.000085 osd_op(client.24357.1:3 1.1bdf9243 1000000095b.00000000 [read 0~67108864 [1@-1]] snapc 0=[] RETRY=6 ondisk+retry+read+known_if_redirected e188) v4 pg pg[1.43( v 188'6597 (180'3500,188'6597] local-les=157 n=10 ec=136 les/c/f 157/157/0 151/156/136) [20,10,7] r=0 lpr=156 crt=188'6595 lcod 188'6596 mlcod 188'6596 active+clean]
2017-07-06 11:30:29.185937 7f1777bff700 10 osd.20 pg_epoch: 188 pg[1.43( v 188'6597 (180'3500,188'6597] local-les=157 n=10 ec=136 les/c/f 157/157/0 151/156/136) [20,10,7] r=0 lpr=156 crt=188'6595 lcod 188'6596 mlcod 188'6596 active+clean] handle_message: 0x7f174d42f100
2017-07-06 11:30:29.185951 7f1777bff700 20 osd.20 pg_epoch: 188 pg[1.43( v 188'6597 (180'3500,188'6597] local-les=157 n=10 ec=136 les/c/f 157/157/0 151/156/136) [20,10,7] r=0 lpr=156 crt=188'6595 lcod 188'6596 mlcod 188'6596 active+clean] op_has_sufficient_caps pool=1 (cephfs_data ) owner=0 need_read_cap=1 need_write_cap=0 need_class_read_cap=0 need_class_write_cap=0 -> yes
2017-07-06 11:30:29.185970 7f1777bff700 10 osd.20 pg_epoch: 188 pg[1.43( v 188'6597 (180'3500,188'6597] local-les=157 n=10 ec=136 les/c/f 157/157/0 151/156/136) [20,10,7] r=0 lpr=156 crt=188'6595 lcod 188'6596 mlcod 188'6596 active+clean] do_op osd_op(client.24357.1:3 1.1bdf9243 1000000095b.00000000 [read 0~67108864 [1@-1]] snapc 0=[] RETRY=6 ondisk+retry+read+known_if_redirected e188) v4 may_read -> read-ordered flags ondisk+retry+read+known_if_redirected
2017-07-06 11:30:29.185990 7f1777bff700 10 osd.20 pg_epoch: 188 pg[1.43( v 188'6597 (180'3500,188'6597] local-les=157 n=10 ec=136 les/c/f 157/157/0 151/156/136) [20,10,7] r=0 lpr=156 crt=188'6595 lcod 188'6596 mlcod 188'6596 active+clean] get_object_context: found obc in cache: 0x7f171e423180
2017-07-06 11:30:29.185997 7f1777bff700 10 osd.20 pg_epoch: 188 pg[1.43( v 188'6597 (180'3500,188'6597] local-les=157 n=10 ec=136 les/c/f 157/157/0 151/156/136) [20,10,7] r=0 lpr=156 crt=188'6595 lcod 188'6596 mlcod 188'6596 active+clean] get_object_context: 0x7f171e423180 1:c249fbd8:::1000000095b.00000000:head rwstate(none n=0 w=0) oi: 1:c249fbd8:::1000000095b.00000000:head(188'4499 mds.0.20:66259 dirty|data_digest|omap_digest s 67108864 uv 4499 dd cdba94a2 od ffffffff) ssc: 0x7f172dcfce40 snapset: 1=[]:[]+head
2017-07-06 11:30:29.186010 7f1777bff700 10 osd.20 pg_epoch: 188 pg[1.43( v 188'6597 (180'3500,188'6597] local-les=157 n=10 ec=136 les/c/f 157/157/0 151/156/136) [20,10,7] r=0 lpr=156 crt=188'6595 lcod 188'6596 mlcod 188'6596 active+clean] find_object_context 1:c249fbd8:::1000000095b.00000000:head @head oi=1:c249fbd8:::1000000095b.00000000:head(188'4499 mds.0.20:66259 dirty|data_digest|omap_digest s 67108864 uv 4499 dd cdba94a2 od ffffffff)
2017-07-06 11:30:29.186026 7f1777bff700 10 osd.20 pg_epoch: 188 pg[1.43( v 188'6597 (180'3500,188'6597] local-les=157 n=10 ec=136 les/c/f 157/157/0 151/156/136) [20,10,7] r=0 lpr=156 crt=188'6595 lcod 188'6596 mlcod 188'6596 active+clean] execute_ctx 0x7f1776beed00
2017-07-06 11:30:29.186035 7f1777bff700 10 osd.20 pg_epoch: 188 pg[1.43( v 188'6597 (180'3500,188'6597] local-les=157 n=10 ec=136 les/c/f 157/157/0 151/156/136) [20,10,7] r=0 lpr=156 crt=188'6595 lcod 188'6596 mlcod 188'6596 active+clean] do_op 1:c249fbd8:::1000000095b.00000000:head [read 0~67108864 [1@-1]] ov 188'4499
2017-07-06 11:30:29.186041 7f1777bff700 10 osd.20 pg_epoch: 188 pg[1.43( v 188'6597 (180'3500,188'6597] local-les=157 n=10 ec=136 les/c/f 157/157/0 151/156/136) [20,10,7] r=0 lpr=156 crt=188'6595 lcod 188'6596 mlcod 188'6596 active+clean] taking ondisk_read_lock
2017-07-06 11:30:29.186047 7f1777bff700 10 osd.20 pg_epoch: 188 pg[1.43( v 188'6597 (180'3500,188'6597] local-les=157 n=10 ec=136 les/c/f 157/157/0 151/156/136) [20,10,7] r=0 lpr=156 crt=188'6595 lcod 188'6596 mlcod 188'6596 active+clean] do_osd_op 1:c249fbd8:::1000000095b.00000000:head [read 0~67108864 [1@-1]]
2017-07-06 11:30:29.186056 7f1777bff700 10 osd.20 pg_epoch: 188 pg[1.43( v 188'6597 (180'3500,188'6597] local-les=157 n=10 ec=136 les/c/f 157/157/0 151/156/136) [20,10,7] r=0 lpr=156 crt=188'6595 lcod 188'6596 mlcod 188'6596 active+clean] do_osd_op read 0~67108864 [1@-1]

分析看出cephfs client发送read 64MB的request,ceph osd读64M数据return,但是cephfs client认为msg为null,然后重试
一直重复上面的请求,所以cephfs client的IO就hang住了

提交bug

提交该bug到ceph社区:
http://tracker.ceph.com/issues/20528

cephfs client端的配置限制了read message的最大size为16M。

in net/ceph/messenger.c

1
2
3
4
5
6
7
8
9
10
11
12
static int read_partial_message(struct ceph_connection *con)
{
...
front_len = le32_to_cpu(con->in_hdr.front_len);
if (front_len > CEPH_MSG_MAX_FRONT_LEN)
return -EIO;
middle_len = le32_to_cpu(con->in_hdr.middle_len);
if (middle_len > CEPH_MSG_MAX_MIDDLE_LEN)
return -EIO;
data_len = le32_to_cpu(con->in_hdr.data_len);
if (data_len > CEPH_MSG_MAX_DATA_LEN)
return -EIO;

所以我们使用cephfs中,不能配置文件的stripe_unit大于16M。

支持原创