Ceph osd_enable_op_tracker配置分析

概述

本文分析Ceph OpTracker相关实现,提供osd_enable_op_tracker的配置分析,并做相关测试对比性能;

代码分析

op_tracker定义

osd_enable_op_tracker对应的配置是osd中的op_tracker,每个osd会初始化一个op_tracker,代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
## op_tracker的定义
class OSD : public Dispatcher, public md_config_obs_t {
...
// -- op tracking --
OpTracker op_tracker;
...
};

## op_tracker的初始化
OSD::OSD(...) :
...
op_tracker(cct, cct->_conf->osd_enable_op_tracker, // 是否enable op tracker
cct->_conf->osd_num_op_tracker_shard), // op tracker shard数,每个shard一个mutex锁,多个shard可以避免op更新的性能瓶颈问题
...
{
monc->set_messenger(client_messenger);
op_tracker.set_complaint_and_threshold(cct->_conf->osd_op_complaint_time, // op complaint time配置,超出后有warning log
cct->_conf->osd_op_log_threshold); // op log的阈值
op_tracker.set_history_size_and_duration(cct->_conf->osd_op_history_size, // op history的最大size
cct->_conf->osd_op_history_duration); // op history的最长保留时间
}

OpTracker的定义如下,里面维护了一个sharded list,会把op映射到对应的list上,每个shard有一个mutex锁,保证这里的track操作不会成为性能瓶颈;

每个list的单位是TrackedOp

1
2
3
4
5
6
7
8
9
10
11
12
13
class OpTracker {
...
struct ShardedTrackingData {
Mutex ops_in_flight_lock_sharded;
xlist<TrackedOp *> ops_in_flight_sharded;
explicit ShardedTrackingData(string lock_name):
ops_in_flight_lock_sharded(lock_name.c_str()) {}
};
vector<ShardedTrackingData*> sharded_in_flight_list;
uint32_t num_optracker_shards;
OpHistory history;
...
}

TrackedOp定义

TrackedOp是一个tracked的operation,它会指向所属的OpTracker,定义如下:

1
2
3
4
5
6
7
8
9
class TrackedOp {
private:
friend class OpHistory;
friend class OpTracker;
xlist<TrackedOp*>::item xitem;
protected:
OpTracker *tracker; /// the tracker we are associated with
...
}

ceph osd中的每个op都是TrackedOp的子类;

1
2
3
4
struct OpRequest : public TrackedOp {
friend class OpTracker;
...
}

OpRequest会在osd dispatch的时候初始化,会根据是否enable op tracker初始化对应值:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
osd/OSD.cc <<ms_fast_dispatch>>
OpRequestRef op = op_tracker.create_request<OpRequest, Message*>(m);
osd/OSD.cc <<_dispatch>>
OpRequestRef op = op_tracker.create_request<OpRequest, Message*>(m);

class OpTracker {
...
template <typename T, typename U>
typename T::Ref create_request(U params)
{
typename T::Ref retval(new T(params, this), // 初始化OpRequest类
RemoveOnDelete(this));
retval->tracking_start(); // OpRequest放到OpTracker的shard list里,设置 is_tracked = true;后于OpReuest类的初始化
return retval;
}
};

class TrackedOp {
...
void tracking_start() {
if (tracker->register_inflight_op(&xitem)) {
events.push_back(make_pair(initiated_at, "initiated"));
is_tracked = true;
}
}
}

OpRequest::OpRequest(Message *req, OpTracker *tracker) :
TrackedOp(tracker, req->get_recv_stamp()), // TrackedOp的is_tracked默认为false
rmw_flags(0), request(req),
hit_flag_points(0), latest_flag_point(0),
send_map_update(false), sent_epoch(0),
hitset_inserted(false) {
if (req->get_priority() < tracker->cct->_conf->osd_client_op_priority) {
// don't warn as quickly for low priority ops
warn_interval_multiplier = tracker->cct->_conf->osd_recovery_op_warn_multiple;
}
if (req->get_type() == CEPH_MSG_OSD_OP) {
reqid = static_cast<MOSDOp*>(req)->get_reqid();
} else if (req->get_type() == MSG_OSD_SUBOP) {
reqid = static_cast<MOSDSubOp*>(req)->reqid;
} else if (req->get_type() == MSG_OSD_REPOP) {
reqid = static_cast<MOSDRepOp*>(req)->reqid;
}
tracker->mark_event(this, "header_read", request->get_recv_stamp()); //??? 因为此时is_tracked为false,所以这些event log不会打印出来
tracker->mark_event(this, "throttled", request->get_throttle_stamp());
tracker->mark_event(this, "all_read", request->get_recv_complete_stamp());
tracker->mark_event(this, "dispatched", request->get_dispatch_stamp());
}

OpTracker记录op

在创建一个op request的时候,就会把op分到OpTracker的不同shard list上

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
bool OpTracker::register_inflight_op(xlist<TrackedOp*>::item *i)
{
RWLock::RLocker l(lock);
if (!tracking_enabled)
return false;

uint64_t current_seq = seq.inc();
uint32_t shard_index = current_seq % num_optracker_shards;
ShardedTrackingData* sdata = sharded_in_flight_list[shard_index];
assert(NULL != sdata);
{
Mutex::Locker locker(sdata->ops_in_flight_lock_sharded);
sdata->ops_in_flight_sharded.push_back(i);
sdata->ops_in_flight_sharded.back()->seq = current_seq;
}
return true;
}

之后可以通过ceph daemon <osd.id> dump_ops_in_flight/dump_historic_ops查看该osd上的op情况;

TrackedOp mark event

当开启了op tracker,TrackedOp就会根据配置的debug等级把一些event打印到osd的log文件里,这个操作会影响系统性能;但也可以帮我们分析出一个op哪个阶段比较费时,找出系统瓶颈;

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
void TrackedOp::mark_event(const string &event)
{
if (!is_tracked)
return;
utime_t now = ceph_clock_now(g_ceph_context);
{
Mutex::Locker l(lock);
events.push_back(make_pair(now, event)); // events记录一个op所有的event信息
}
tracker->mark_event(this, event); // mark event,打印log
_event_marked(); // 可以自己实现event mark后的操作,现在为空
}

void OpTracker::mark_event(TrackedOp *op, const string &dest, utime_t time)
{
if (!op->is_tracked)
return;
return _mark_event(op, dest, time);
}
void OpTracker::_mark_event(TrackedOp *op, const string &evt,
utime_t time)
{
dout(5);
*_dout << "seq: " << op->seq
<< ", time: " << time << ", event: " << evt
<< ", op: ";
op->_dump_op_descriptor_unlocked(*_dout);
*_dout << dendl;
}

性能测试

基于之前搭建的RGW cosbench,对比测试了osd_enable_op_tracker为false/true的情况下的性能,结果如下:

测试参数:两个cosbench driver,写一个bucket 5,000,000个objects,workers=100,runtime=2000;

debug_optracker = 0/0 & osd_enable_op_tracker = false

先写两个bucket的结果如下:

Op-Type OOp-Count OByte-Count OAvg-ResTime OAvg-ProcTime OThroughput OBandwidth OSucc-Ratio
1 write 2.57 mops 82.15 GB 77.79 ms 77.64 ms 1284.26 op/s 41.1 MB/S 100%
2 write 1.32 mops 42.28 GB 151.05 ms 150.8 ms 661.69 op/s 21.17 MB/S 100%

【第二次写入的性能较差,感觉应该是测试误差导致的,或者是当时有别的io影响,没再重复测试】

删除之前写的bucket的结果如下:

Op-Type OOp-Count OByte-Count OAvg-ResTime OAvg-ProcTime OThroughput OBandwidth OSucc-Ratio
1 cleanup-delete 5 mops 0 B 8.03 ms 8.03 ms 12468.13 op/s 0 B/S 100%
2 cleanup-delete 5 mops 0 B 4.79 ms 4.79 ms 20898.72 op/s 0 B/S 100%

debug_optracker = 0/0 & osd_enable_op_tracker = true

先写两个bucket的结果如下:

Op-Type OOp-Count OByte-Count OAvg-ResTime OAvg-ProcTime OThroughput OBandwidth OSucc-Ratio
1 write 2.67 mops 85.47 GB 74.79 ms 74.59 ms 1335.65 op/s 42.74 MB/S 100%
2 write 2.47 mops 79.02 GB 80.91 ms 80.76 ms 1234.71 op/s 39.51 MB/S 100%

删除之前写的bucket的结果如下:

Op-Type OOp-Count OByte-Count OAvg-ResTime OAvg-ProcTime OThroughput OBandwidth OSucc-Ratio
1 cleanup-delete 5 mops 0 B 10.3 ms 10.3 ms 9709.31 op/s 0 B/S 100%
2 cleanup-delete 5 mops 0 B 8.63 ms 8.63 ms 11581.92 op/s 0 B/S 100%

【删除的相应时间比之前还是慢的比较多,不清晰是测试误差,还是别的原因】

结论

从上述两项的对比可以看出,false/true的测试结果在误差范围内,印证了osd_enable_op_tracker设置为true对Ceph系统性能没什么影响;

打开debug_optracker

为了追踪op在osd中各个阶段的时间开销,我们可以打开debug_optracker,然后在osd的log中查到对应的event log;

打开关闭debug_optracker的方法:

1
2
ceph daemon <osd-id> config set debug_optracker 5\/5  // 打开debug log
ceph daemon <osd-id> config set debug_optracker 0\/0 // 关闭debug log

之后可以在对应osd的log文件中看到如下log,以其中一个write op为例:

1
2
3
4
5
6
7
8
9
10
11
12
13
2017-05-09 11:15:58.925362 7f556af0f700  5 -- op tracker -- seq: 682036, time: 2017-05-09 11:15:58.925362, event: queued_for_pg, op: osd_op(client.1027809.0:366525835 31.36c4e4e0 (undecoded) ondisk+write+known_if_redirected e2611)
2017-05-09 11:15:58.925388 7f55b27ff700 5 -- op tracker -- seq: 682036, time: 2017-05-09 11:15:58.925388, event: reached_pg, op: osd_op(client.1027809.0:366525835 31.36c4e4e0 (undecoded) ondisk+write+known_if_redirected e2611)
2017-05-09 11:15:58.925509 7f55b27ff700 5 -- op tracker -- seq: 682036, time: 2017-05-09 11:15:58.925509, event: started, op: osd_op(client.1027809.0:366525835 31.36c4e4e0 f31f020f-aaa2-4570-b6f7-19258064e5b5.1027809.13_myobjects449924 [create 0~0 [excl],setxattr user.rgw.idtag (55),writefull 0~32000,setxattr user.rgw.manifest (600),setxattr user.rgw.acl (145),setxattr user.rgw.content_type (25),setxattr user.rgw.etag (33),call rgw.obj_store_pg_ver,setxattr user.rgw.source_zone (4)] snapc 0=[] ondisk+write+known_if_redirected e2611)
2017-05-09 11:15:58.925595 7f55b27ff700 5 -- op tracker -- seq: 682036, time: 2017-05-09 11:15:58.925595, event: waiting for subops from 29,76, op: osd_op(client.1027809.0:366525835 31.36c4e4e0 f31f020f-aaa2-4570-b6f7-19258064e5b5.1027809.13_myobjects449924 [create 0~0 [excl],setxattr user.rgw.idtag (55),writefull 0~32000,setxattr user.rgw.manifest (600),setxattr user.rgw.acl (145),setxattr user.rgw.content_type (25),setxattr user.rgw.etag (33),call rgw.obj_store_pg_ver,setxattr user.rgw.source_zone (4)] snapc 0=[] ondisk+write+known_if_redirected e2611)
2017-05-09 11:15:58.925735 7f55b27ff700 5 -- op tracker -- seq: 682036, time: 2017-05-09 11:15:58.925735, event: commit_queued_for_journal_write, op: osd_op(client.1027809.0:366525835 31.36c4e4e0 f31f020f-aaa2-4570-b6f7-19258064e5b5.1027809.13_myobjects449924 [create 0~0 [excl],setxattr user.rgw.idtag (55),writefull 0~32000,setxattr user.rgw.manifest (600),setxattr user.rgw.acl (145),setxattr user.rgw.content_type (25),setxattr user.rgw.etag (33),call rgw.obj_store_pg_ver,setxattr user.rgw.source_zone (4)] snapc 0=[] ondisk+write+known_if_redirected e2611)
2017-05-09 11:15:58.925773 7f561cfec700 5 -- op tracker -- seq: 682036, time: 2017-05-09 11:15:58.925772, event: write_thread_in_journal_buffer, op: osd_op(client.1027809.0:366525835 31.36c4e4e0 f31f020f-aaa2-4570-b6f7-19258064e5b5.1027809.13_myobjects449924 [create 0~0 [excl],setxattr user.rgw.idtag (55),writefull 0~32000,setxattr user.rgw.manifest (600),setxattr user.rgw.acl (145),setxattr user.rgw.content_type (25),setxattr user.rgw.etag (33),call rgw.obj_store_pg_ver,setxattr user.rgw.source_zone (4)] snapc 0=[] ondisk+write+known_if_redirected e2611)
2017-05-09 11:15:58.925947 7f561c7eb700 5 -- op tracker -- seq: 682036, time: 2017-05-09 11:15:58.925946, event: journaled_completion_queued, op: osd_op(client.1027809.0:366525835 31.36c4e4e0 f31f020f-aaa2-4570-b6f7-19258064e5b5.1027809.13_myobjects449924 [create 0~0 [excl],setxattr user.rgw.idtag (55),writefull 0~32000,setxattr user.rgw.manifest (600),setxattr user.rgw.acl (145),setxattr user.rgw.content_type (25),setxattr user.rgw.etag (33),call rgw.obj_store_pg_ver,setxattr user.rgw.source_zone (4)] snapc 0=[] ondisk+write+known_if_redirected e2611)
2017-05-09 11:15:58.925962 7f5614fff700 5 -- op tracker -- seq: 682036, time: 2017-05-09 11:15:58.925961, event: op_commit, op: osd_op(client.1027809.0:366525835 31.36c4e4e0 f31f020f-aaa2-4570-b6f7-19258064e5b5.1027809.13_myobjects449924 [create 0~0 [excl],setxattr user.rgw.idtag (55),writefull 0~32000,setxattr user.rgw.manifest (600),setxattr user.rgw.acl (145),setxattr user.rgw.content_type (25),setxattr user.rgw.etag (33),call rgw.obj_store_pg_ver,setxattr user.rgw.source_zone (4)] snapc 0=[] ondisk+write+known_if_redirected e2611)
2017-05-09 11:15:58.926400 7f55d3be3700 5 -- op tracker -- seq: 682036, time: 2017-05-09 11:15:58.926400, event: sub_op_commit_rec from 76, op: osd_op(client.1027809.0:366525835 31.36c4e4e0 f31f020f-aaa2-4570-b6f7-19258064e5b5.1027809.13_myobjects449924 [create 0~0 [excl],setxattr user.rgw.idtag (55),writefull 0~32000,setxattr user.rgw.manifest (600),setxattr user.rgw.acl (145),setxattr user.rgw.content_type (25),setxattr user.rgw.etag (33),call rgw.obj_store_pg_ver,setxattr user.rgw.source_zone (4)] snapc 0=[] ondisk+write+known_if_redirected e2611)
2017-05-09 11:15:58.926440 7f55b27ff700 5 -- op tracker -- seq: 682036, time: 2017-05-09 11:15:58.926439, event: sub_op_commit_rec from 29, op: osd_op(client.1027809.0:366525835 31.36c4e4e0 f31f020f-aaa2-4570-b6f7-19258064e5b5.1027809.13_myobjects449924 [create 0~0 [excl],setxattr user.rgw.idtag (55),writefull 0~32000,setxattr user.rgw.manifest (600),setxattr user.rgw.acl (145),setxattr user.rgw.content_type (25),setxattr user.rgw.etag (33),call rgw.obj_store_pg_ver,setxattr user.rgw.source_zone (4)] snapc 0=[] ondisk+write+known_if_redirected e2611)
2017-05-09 11:15:58.926456 7f55b27ff700 5 -- op tracker -- seq: 682036, time: 2017-05-09 11:15:58.926456, event: commit_sent, op: osd_op(client.1027809.0:366525835 31.36c4e4e0 f31f020f-aaa2-4570-b6f7-19258064e5b5.1027809.13_myobjects449924 [create 0~0 [excl],setxattr user.rgw.idtag (55),writefull 0~32000,setxattr user.rgw.manifest (600),setxattr user.rgw.acl (145),setxattr user.rgw.content_type (25),setxattr user.rgw.etag (33),call rgw.obj_store_pg_ver,setxattr user.rgw.source_zone (4)] snapc 0=[] ondisk+write+known_if_redirected e2611)
2017-05-09 11:15:58.926793 7f5613ffd700 5 -- op tracker -- seq: 682036, time: 2017-05-09 11:15:58.926792, event: op_applied, op: osd_op(client.1027809.0:366525835 31.36c4e4e0 f31f020f-aaa2-4570-b6f7-19258064e5b5.1027809.13_myobjects449924 [create 0~0 [excl],setxattr user.rgw.idtag (55),writefull 0~32000,setxattr user.rgw.manifest (600),setxattr user.rgw.acl (145),setxattr user.rgw.content_type (25),setxattr user.rgw.etag (33),call rgw.obj_store_pg_ver,setxattr user.rgw.source_zone (4)] snapc 0=[] ondisk+write+known_if_redirected e2611)
2017-05-09 11:15:58.926825 7f5613ffd700 5 -- op tracker -- seq: 682036, time: 2017-05-09 11:15:58.926824, event: done, op: osd_op(client.1027809.0:366525835 31.36c4e4e0 f31f020f-aaa2-4570-b6f7-19258064e5b5.1027809.13_myobjects449924 [create 0~0 [excl],setxattr user.rgw.idtag (55),writefull 0~32000,setxattr user.rgw.manifest (600),setxattr user.rgw.acl (145),setxattr user.rgw.content_type (25),setxattr user.rgw.etag (33),call rgw.obj_store_pg_ver,setxattr user.rgw.source_zone (4)] snapc 0=[] ondisk+write+known_if_redirected e2611)

打开debug_optracker会有大量的osd log写入,严重影响osd的性能,所以这里只是建议在调试osd性能的时候偶尔打开收集下信息,然后分析osd的性能瓶颈;

dump_historic_ops

在不打开debug_optracker的情况下,也可以获取部分OpRequest的event信息,这就是通过dump_historic_ops获取;然后就可以分析op的各个时间段开销了;

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
# ceph daemon osd.11 dump_historic_ops
{
"num to keep": 20,
"duration to keep": 600,
"Ops": [
{
"description": "osd_op(client.1027809.0:458241652 17.e6e072e1 .dir.f31f020f-aaa2-4570-b6f7-19258064e5b5.859142.3.77 [] snapc 0=[] ack+ondisk+write+known_if_redirected e2850)",
"initiated_at": "2017-05-09 18:54:50.987435",
"age": 412.664363,
"duration": 0.338724,
"type_data": [
"commit sent; apply or cleanup",
{
"client": "client.1027809",
"tid": 458241652
},
[
{
"time": "2017-05-09 18:54:50.987435",
"event": "initiated"
},
{
"time": "2017-05-09 18:54:50.987453",
"event": "queued_for_pg"
},
{
"time": "2017-05-09 18:54:50.987483",
"event": "reached_pg"
},
{
"time": "2017-05-09 18:54:50.987520",
"event": "waiting for rw locks"
},
{
"time": "2017-05-09 18:54:51.052956",
"event": "reached_pg"
},
{
"time": "2017-05-09 18:54:51.052986",
"event": "waiting for rw locks"
},
{
"time": "2017-05-09 18:54:51.197523",
"event": "reached_pg"
},
{
"time": "2017-05-09 18:54:51.197562",
"event": "started"
},
{
"time": "2017-05-09 18:54:51.197812",
"event": "waiting for subops from 33,43"
},
{
"time": "2017-05-09 18:54:51.198068",
"event": "commit_queued_for_journal_write"
},
{
"time": "2017-05-09 18:54:51.198094",
"event": "write_thread_in_journal_buffer"
},
{
"time": "2017-05-09 18:54:51.198195",
"event": "journaled_completion_queued"
},
{
"time": "2017-05-09 18:54:51.198330",
"event": "op_commit"
},
{
"time": "2017-05-09 18:54:51.198454",
"event": "sub_op_commit_rec from 43"
},
{
"time": "2017-05-09 18:54:51.198851",
"event": "sub_op_commit_rec from 33"
},
{
"time": "2017-05-09 18:54:51.198865",
"event": "commit_sent"
},
{
"time": "2017-05-09 18:54:51.326034",
"event": "op_applied"
},
{
"time": "2017-05-09 18:54:51.326158",
"event": "done"
}
]
]
},
...
支持原创