Rook部署的Ceph系统中CephFS的一个小问题

问题描述

在使用Rook部署的Ceph集群里,配置好CephFS后,在比较长的目录下,读写文件失败。

现象如下:

使用ceph-fuse客户端时:

1
2
root@ceph3:/mnt/test/volumes/kubernetes/kubernetes/kubernetes-dynamic-pvc-9adbb10c-f86a-11e8-96ff-9247c38478e0# cat fox 
cat: fox: Input/output error

使用kernel client时:

1
2
root@ceph3:/mnt/test/volumes/kubernetes/kubernetes/kubernetes-dynamic-pvc-9adbb10c-f86a-11e8-96ff-9247c38478e0# cat fox 
cat: fox: File name too long

问题分析

打开ceph-fuse client的log:

1
# ceph-fuse -m 10.10.15.89:6790,10.10.15.198:6790 /mnt/test/ -n client.admin --keyring=./yangguanjun/keyring --debug-client=20

然后在 /var/log/ceph/里查看log,有err:-36

1
2
3
4
2018-12-05 18:42:04.908612 7fd5955bc700  3 client.5213 ll_read 0x565512bf8760 0x10000000007  0~4096 
2018-12-05 18:42:04.910114 7fd599dc5700 10 client.5213 ms_handle_connect on 172.16.1.54:6800/33886
2018-12-05 18:42:04.910887 7fd5955bc700 10 client.5213 check_pool_perm on pool 2 ns fsvolumens_kubernetes-dynamic-pvc-9adbb10c-f86a-11e8-96ff-9247c38478e0 rd_err = -36 wr_err = -36
2018-12-05 18:42:04.910907 7fd5955bc700 10 client.5213 check_pool_perm on pool 2 ns fsvolumens_kubernetes-dynamic-pvc-9adbb10c-f86a-11e8-96ff-9247c38478e0 rd_err = -36 wr_err = -36

查看linux里的 include/uapi/asm-generic/errno.h,有如下定义:

1
#define ENAMETOOLONG    36  /* File name too long */

因为这个错误跟ceph-fuse和kernel client无关,所以猜测是osd相关地方的问题;

然后搜索ceph代码,osd相关的地方有如下几处检查:

文件:osd/PrimaryLogPG.cc

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
/** do_op - do an op
* pg lock will be held (if multithreaded)
* osd_lock NOT held.
*/
void PrimaryLogPG::do_op(OpRequestRef& op)
{
...
// object name too long?
if (m->get_oid().name.size() > cct->_conf->osd_max_object_name_len) {
dout(4) << "do_op name is longer than "
<< cct->_conf->osd_max_object_name_len
<< " bytes" << dendl;
osd->reply_op_error(op, -ENAMETOOLONG);
return;
}
if (m->get_hobj().get_key().size() > cct->_conf->osd_max_object_name_len) {
dout(4) << "do_op locator is longer than "
<< cct->_conf->osd_max_object_name_len
<< " bytes" << dendl;
osd->reply_op_error(op, -ENAMETOOLONG);
return;
}
if (m->get_hobj().nspace.size() > cct->_conf->osd_max_object_namespace_len) {
dout(4) << "do_op namespace is longer than "
<< cct->_conf->osd_max_object_namespace_len
<< " bytes" << dendl;
osd->reply_op_error(op, -ENAMETOOLONG);
return;
}
...
}

于是打开OSD的log:

1
2
3
4
[root@rook-ceph-tools /]# ceph tell osd.0 config set debug_osd 5
Set debug_osd to 5/5
[root@rook-ceph-tools /]# ceph tell osd.1 config set debug_osd 5
Set debug_osd to 5/5

然后继续测试重新问题,抓取osd的log:

1
2
3
# grep "longer than" *
rook-ceph-osd-0-85f5bf454f-64w7d-ceph3.log:2018-12-05 11:14:38.864707 7fbe34194700 4 osd.0 pg_epoch: 24 pg[2.15( empty local-lis/les=21/22 n=0 ec=20/20 lis/c 21/21 les/c/f 22/22/0 21/21/20) [0,1] r=0 lpr=21 crt=0'0 mlcod 0'0 active+clean] do_op namespace is longer than 64 bytes
rook-ceph-osd-0-85f5bf454f-64w7d-ceph3.log:2018-12-05 11:14:38.864853 7fbe3819c700 4 osd.0 pg_epoch: 24 pg[2.15( empty local-lis/les=21/22 n=0 ec=20/20 lis/c 21/21 les/c/f 22/22/0 21/21/20) [0,1] r=0 lpr=21 crt=0'0 mlcod 0'0 active+clean] do_op namespace is longer than 64 bytes

而对应代码处的检查为:

1
if (m->get_hobj().nspace.size() > cct->_conf->osd_max_object_namespace_len)

查看osd相关的配置:

1
2
3
4
5
[root@rook-ceph-osd-0-85f5bf454f-64w7d-ceph3 ceph]# cat ceph.conf
...
osd max object name len = 256
osd max object namespace len = 64
...

(o゜▽゜)o☆[BINGO!],找到原因了,哪为什么会有这个配置呢??

在Rook的代码里有下面代码,文件:pkg/daemon/ceph/osd/device.go

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
func writeConfigFile(cfg *osdConfig, context *clusterd.Context, cluster *cephconfig.ClusterInfo, location string) error {
cephConfig := cephconfig.CreateDefaultCephConfig(context, cluster, cfg.rootPath)
if isBluestore(cfg) {
cephConfig.GlobalConfig.OsdObjectStore = config.Bluestore
} else {
cephConfig.GlobalConfig.OsdObjectStore = config.Filestore
}
cephConfig.CrushLocation = location

if cfg.dir || isFilestoreDevice(cfg) {
// using the local file system requires some config overrides
// http://docs.ceph.com/docs/jewel/rados/configuration/filesystem-recommendations/#not-recommended
cephConfig.GlobalConfig.OsdMaxObjectNameLen = 256
cephConfig.GlobalConfig.OsdMaxObjectNamespaceLen = 64
}
...
}

可以看出这个配置项是在配置OSD使用目录 或 使用FileStore时添加的。

查看Ceph关于Filestore里的说明,指出在ext4的文件系统里,因为xattrs长度的限制,启动Filestore会被限制,用户可以在确定object name比较短的应用场景里,配置下面的两个参数来使用ext4的Filestore。

1
2
osd max object name len = 256
osd max object namespace len = 64

参考:http://docs.ceph.com/docs/jewel/rados/configuration/filesystem-recommendations/#not-recommended

查看我们的配置,cluster.yaml里对OSD的配置如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
...
storage:
useAllNodes: false
useAllDevices: false
deviceFilter:
config:
storeType: bluestore
nodes:
- name: ceph2
devices:
- name: sdb1
- name: ceph3
devices:
- name: sdb1

而Rook在配置OSD时候,是不支持配置分区的。若配置为分区时,实际上Rook代码检查端会跳过所有有分区的盘,然后默认使用/var/lib/rook/osd<id>/这个目录来创建OSD,如下:

1
2
3
4
5
6
7
8
9
10
11
# ll  /var/lib/rook/osd0/
total 3344
drwxr--r-- 3 root root 4096 Dec 5 16:44 ./
drwxr-xr-x 5 root root 4096 Dec 5 16:44 ../
lrwxrwxrwx 1 root root 34 Dec 5 16:44 block -> /var/lib/rook/osd0/bluestore-block
lrwxrwxrwx 1 root root 31 Dec 5 16:44 block.db -> /var/lib/rook/osd0/bluestore-db
lrwxrwxrwx 1 root root 32 Dec 5 16:44 block.wal -> /var/lib/rook/osd0/bluestore-wal
-rw-r--r-- 1 root root 2 Dec 5 16:44 bluefs
-rw-r--r-- 1 root root 472432779264 Dec 5 19:22 bluestore-block
-rw-r--r-- 1 root root 1073741824 Dec 5 16:44 bluestore-db
-rw-r--r-- 1 root root 603979776 Dec 5 19:27 bluestore-wal

解决办法

临时方法

修改osd_max_object_namespace_len为更长的值即可。

1
2
3
4
[root@rook-ceph-tools /]# ceph tell osd.0 config set osd_max_object_namespace_len 256
Set osd_max_object_namespace_len to 256
[root@rook-ceph-tools /]# ceph tell osd.1 config set osd_max_object_namespace_len 256
Set osd_max_object_namespace_len to 256

推荐方法

Ceph OSD使用BlueStore,在Rook对应Ceph集群的clustre.yaml文件里,指定OSD使用整块磁盘。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
...
storage:
useAllNodes: false
useAllDevices: false
deviceFilter:
config:
storeType: bluestore
nodes:
- name: ceph2
devices:
- name: sdb
- name: ceph3
devices:
- name: sdb

注:指定的磁盘要创建GPT Header,并且删除所有分区

支持原创