0%

NFS-Ganesha对接CephFS时iops很低

NFS-Ganesha对接CephFS时iops很低,记录排查过程。Ceph版本14.2.11;NFS-Ganesha版本2.8.4。

现象 (1)

Nfs-ganesha对接cephfs。使用两种挂载方式(fuse和nfsv4),分别压测:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
[global]
ioengine=libaio
buffered=0
direct=1
iodepth=128
rw=randwrite
bs=4k
numjobs=1
thread
size=300G
runtime=120

[file1]
filename=/mnt/test/testfile.empty102
  • ceph-fuse: iops=1500
  • nfsv4: iops=955

排查 (2)

pool的状态(ceph osd pool stats)

- fuse压测时,cephfs data pool的写带宽6MiB/s;metadata pool几乎没有写(1.6K,或者nothing is going on);
- nfsv4压测时,cephfs data pool的写带宽3.8MiB/s;metadata pool的写带宽1.9MiB/s;

注意到data pool写带宽(3.8)是metadata pool写带宽(1.9)的2倍;设置bs=512k,重新压测nfsv4发现data pool的写带宽是metadata pool的256倍;设置bs=1024k,data pool的写带宽是metadata pool的512倍,以此类推。

OSD日志

鉴于上面data pool和metadata pool写带宽的比例关系,猜测:nfsv4压测时,假如data pool和metadata pool的iops是1:1的话,那么metadata pool的io-size应该是2K左右。
所以,统计所有OSD的日志,grep包含' lat '的行(级别为15),然后按pool分类,再相加,发现总数基本一样,相差不超过1%。也就是说:data pool和metadata pool的iops确实是1:1。并且metadata pool的io-size是1930字节左右,接近2K。

MDS perf统计

nfs压测前:

1
ceph daemon  /var/run/ceph/mds.{name}.asok perf reset all

nfs压测后:

1
ceph daemon  /var/run/ceph/mds.{name}.asok perf dump

发现MDS基本没有什么写操作,可以排除attr修改的可能

MDS日志

1
5 mds.0.log _submit_thread 154413621245~1910 : EUpdate cap update [metablob 0x1, 2 dirs]
  • fuse压测:_submit_thread日志只有38行。
  • nfsv4压测:_submit_thread日志有114243行,除以压测时间120秒,得每秒952行,和iops基本一致。再次证明:data pool和metadata pool的iops是1:1,每进行一次数据写,MDS都会写一次journal,大小1910字节,接近2K
1
2
3
4
mds.0.locker handle_client_caps  on 0x1000000002e tid 1097259 follows 1 op update flags 0x1
...
mds.0.locker _do_cap_update dirty Fw issued pAsxLsXsxFsxcrwb wanted pAsxXsxFsxcrwb on [inode 0x1000000002e [2,head] /perf-test-13/testfile.empty102
...

这个日志出现的次数和_submit_thread基本一样。

设置mds_log_pause

可以通过mds_log_pause命令MDS是否写journal:

1
ceph daemon /var/run/ceph/mds.{name}.asok config set mds_log_pause {true or false}
  • nfsv4压测:设置true,fio的iops立即跌到0,设置false,iops立即增长,可反复验证。
  • fuse压测:设置为true,fio压测可以继续。

再次说明:nfsv4压测时,每进行一次数据写,MDS都会写一次journal,不让写journal就不能正常写文件。data pool和metadata pool的iops是1:1

dump journal events

1
2
3
4
5
6
7
8
9
10
11
12
13
14
cephfs-journal-tool --rank=<fs>:<rank> event get list > journal.events

tail journal.events

2024-07-16 11:49:02.641177 0x24357f3b04 UPDATE: (cap update)
testfile.empty102
2024-07-16 11:49:02.643862 0x24357f4170 UPDATE: (cap update)
testfile.empty102
2024-07-16 11:49:20.385734 0x24357f47dc UPDATE: (cap update)
testfile.empty102
2024-07-16 11:49:35.607546 0x24357f4e22 SESSION: ()
2024-07-16 11:49:35.618058 0x24357f4fe8 SESSION: ()
2024-07-16 11:50:02.584154 0x24357f5067 SESSION: ()
2024-07-16 11:50:02.595054 0x24357f522d SESSION: ()
  • nfsv4压测:压测时间段内,有非常多的UPDATE: (cap update),因为list截断了统计不完全,也有92640之多;总共应该也是114243。
  • fuse压测:只有几十行UPDATE: (cap update)

所以至此可以断定:nfsv4 iops低就是频繁flush journal导致的;而journal event就是cap update

cap update处理流程

Cephfs的客户端(nfs-ganesha)发来capability update (CEPH_CAP_OP_UPDATE)请求。

  • 生成EUpdate并放入MDLog::pending_events (代码摘自ceph-14.2.11 mds)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
void Locker::dispatch(const Message::const_ref &m) -->
void Locker::handle_client_caps(const MClientCaps::const_ref &m) -->
bool Locker::_do_cap_update(CInode *in, Capability *cap,
int dirty, snapid_t follows,
const MClientCaps::const_ref &m, const MClientCaps::ref &ack,
bool *need_flush) -->
{
// ......

// do the update.
EUpdate *le = new EUpdate(mds->mdlog, "cap update");
mds->mdlog->start_entry(le);

// ......

mds->mdlog->submit_entry(le, new C_Locker_FileUpdate_finish(this, in, mut, update_flags,
ack, client));

}
  • 处理EUpdate:生成journal entry,并flush到journal,journal是存在metadata pool上的 (代码摘自ceph-14.2.11 mds)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
void MDLog::_submit_thread()
{
while (!mds->is_daemon_stopping()) {
// ......

map<uint64_t,list<PendingEvent> >::iterator it = pending_events.begin();

// ......

PendingEvent data = it->second.front();

// ......

if (data.le) {
LogEvent *le = data.le;
LogSegment *ls = le->_segment;
// encode it, with event type
bufferlist bl;
le->encode_with_header(bl, features);

uint64_t write_pos = journaler->get_write_pos();

le->set_start_off(write_pos);
if (le->get_type() == EVENT_SUBTREEMAP)
ls->offset = write_pos;

dout(5) << "_submit_thread " << write_pos << "~" << bl.length()
<< " : " << *le << dendl;

// journal it.
const uint64_t new_write_pos = journaler->append_entry(bl); // bl is destroyed.
ls->end = new_write_pos;

MDSLogContextBase *fin;
if (data.fin) {
fin = dynamic_cast<MDSLogContextBase*>(data.fin);
ceph_assert(fin);
fin->set_write_pos(new_write_pos);
} else {
fin = new C_MDL_Flushed(this, new_write_pos);
}

journaler->wait_for_flush(fin);

if (data.flush)
journaler->flush();

if (logger)
logger->set(l_mdl_wrpos, ls->end);

delete le;
} else {

// ......

}

// ......
}
}

谁发的capability update (CEPH_CAP_OP_UPDATE)请求

打开nfs-ganesha端的libcephfs的日志(ceph.conf里设置debug_client=20),可以看到:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
2024-07-16 19:22:07.468 7f0b003e8700 10 client.230209841 get_caps 0x10000000031.head(...) have pAsxLsXsxFsxcrwb need AsFw want Fb revoking -
2024-07-16 19:22:07.468 7f0b003e8700 10 client.230209841 snaprealm snaprealm(0x1 nref=3 c=0 seq=1 parent=0x0 my_snaps=[] cached_snapc=1=[])
2024-07-16 19:22:07.468 7f0b003e8700 7 client.230209841 wrote to 4674625536, leaving file size at 23498956800
2024-07-16 19:22:07.468 7f0b003e8700 10 mark_caps_dirty 0x10000000031.head(...) - -> Fw
2024-07-16 19:22:07.468 7f0b003e8700 3 client.230209841 ll_write 0x7f0b0ce62cc0 4674621440~4096 = 4096
2024-07-16 19:22:07.468 7f0b003e8700 3 client.230209841 ll_fsync 0x7f0b0ce62cc0 0x10000000031
2024-07-16 19:22:07.468 7f0b003e8700 8 client.230209841 _fsync(0x7f0b0ce62cc0, data+metadata)
2024-07-16 19:22:07.468 7f0b003e8700 8 client.230209841 _fsync on 0x10000000031.head(...) (data+metadata)
2024-07-16 19:22:07.468 7f0b003e8700 15 inode.get on 0x7f0b14457f00 0x10000000031.head now 11
2024-07-16 19:22:07.468 7f0b003e8700 10 client.230209841 _flush 0x10000000031.head(...)
2024-07-16 19:22:07.468 7f0b003e8700 15 client.230209841 using return-valued form of _fsync
2024-07-16 19:22:07.468 7f0b003e8700 10 client.230209841 check_caps on 0x10000000031.head(...) wanted pAsxXsxFsxcrwb used Fcb issued pAsxLsXsxFsxcrwb revoking - flags=3
2024-07-16 19:22:07.468 7f0b003e8700 10 client.230209841 cap mds.0 issued pAsxLsXsxFsxcrwb implemented pAsxLsXsxFsxcrwb revoking -
2024-07-16 19:22:07.468 7f0b003e8700 10 client.230209841 mark_caps_flushing (more) Fw 0x10000000031.head(...)
2024-07-16 19:22:07.468 7f0b003e8700 10 mark_caps_clean 0x10000000031.head(...)
2024-07-16 19:22:07.468 7f0b003e8700 10 client.230209841 send_cap 0x10000000031.head(...) mds.0 seq 2 used Fcb want pAsxXsxFsxcrwb flush Fw retain pAsxLsxXsxFsxcrwbl held pAsxLsXsxFsxcrwb revoking - dropping -
2024-07-16 19:22:07.468 7f0b003e8700 15 client.230209841 auth cap, requesting max_size 320360443904
2024-07-16 19:22:07.468 7f0b003e8700 15 client.230209841 waiting on data to flush

从log可知,ceph_ll_fsync()的参数syncdataonly=false,因为要sync data+metadata

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

int Client::_fsync(Inode *in, bool syncdataonly)
{
// ......

ldout(cct, 8) << "_fsync on " << *in << " " << (syncdataonly ? "(dataonly)":"(data+metadata)") << dendl;


if (!syncdataonly && in->dirty_caps) {
check_caps(in, CHECK_CAPS_NODELAY|CHECK_CAPS_SYNCHRONOUS);
if (in->flushing_caps)
flush_tid = last_flush_tid;
} else ldout(cct, 10) << "no metadata needs to commit" << dendl;

// ......

}

可见syncdataonly=false,会调用check_caps:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
void Client::check_caps(Inode *in, unsigned flags) -->
void Client::send_cap(Inode *in, MetaSession *session, Cap *cap,
int flags, int used, int want, int retain,
int flush, ceph_tid_t flush_tid)
{
int op = CEPH_CAP_OP_UPDATE;

// ......

auto m = MClientCaps::create(op,
in->ino,
0,
cap->cap_id, cap->seq,
cap->implemented,
want,
flush,
cap->mseq,
cap_epoch_barrier);
// ......

session->con->send_message2(std::move(m));
}

根因 (3)

所以,根因是ganesha调用libcephfs的ceph_ll_fsync时传入syncdataonly=false。尝试通过fio的fsyncfdatasync选项来控制这个参数,也不起作用。而ceph-fuse是可以通过这两个选项控制的。具体对比:

fio选项 nfs-ganesha ceph-fuse
–direct=1 sync data+metadata no sync
–direct=1 –fsync=1 sync data+metadata sync data+metadata
–direct=1 –fdatasync=1 sync data+metadata sync data-only

并且,ceph-fuse在--direct=1 --fsync=1时(也要sync data+metadata),表现的和nfsv4一模一样:metadata pool上有很大的带宽,iops很低…

看ganesha的代码发现,所有对ceph_ll_fsync的调用参数syncdataonly都为false。暴力修改src/FSAL/FSAL_CEPH/handle.c中ceph_fsal_write2()函数中的调用参数为true,iops就会达到ceph-fuse的水平,并且metadata pool上没有带宽。问题是这么修改安全吗?

下一步 (4)

  • 评估上述修改是否安全;
  • 看ceph-fuse是如何支持fdatasync的;
  • libcephfs给mds发capability update请求,mds的perf统计中为什么体现不出来?
  • capability update为什么有1900多字节之多?
写的不错,有赏!