nvme-rdma: Unexpected IB/RoCE End to End Flow Control credit behaviour

Samuel Jones sjones at kalrayinc.com
Tue Nov 8 13:21:28 PST 2022


Hi all,=20

I'm=20doing some performance analysis on an NVMeoF setup using the kernel n=
vme-rdma driver and a Mellanox board. I'm seeing some unexpected behaviour =
with respect to the End to End flow control credits in the IB/RoCE standard=
, and I am reaching out in case anyone has an explanation for this. I'm not=
 sure it's a real problem, but I can't seem to make it make sense.=20

My=20issue can be reproduced using two x86 servers with Mellanox NiCs and s=
tock Linux nvme-rdma stack. I'm using 5.4 and 5.14, but I have seen it with=
 other versions too. My understanding is that the IB credits reflect the nu=
mber of available buffers in the receiver's receive queue. nvme-rdma fills =
this queue up to the NVMe queue depth at connection establishment and each =
time it receives a completion entry for a recv queue entry it immediately p=
osts another recv queue entry.=20

As=20such, when I observe the credit values between both partners, I would =
expect the credits to remain at a "high" value, relatively close to the NVM=
e queue depth. Obviously a little variance is to be expected as the respond=
er may not handle completion queue entries as fast as they arrive. However,=
 what I observe is rather different: the initial credit value is high (clos=
e to queue depth): it then decreases as the sender sends more datagrams unt=
il it is close to zero and then reaches zero. Once it reaches zero, the sen=
der sends another datagram, and somehow, the next ACK sent by the receiver =
contains a credit value that is once again high (close to queue depth). Thi=
s saw-tooth credit pattern repeats continuously over the life of the connec=
tion.=20

I=20can't make sense of this: given that the system is limited by the NVMe =
queue depth, I can't see how it is possible for a sender to saturate the re=
ceiver's recv queue, even if the receiver goes AWOL for a *long* time. This=
 credit behaviour could imply that the receiver is not actually handling th=
e completion queue entries promptly, but I find that very difficult to beli=
eve. I don't understand the intricacies of how the ib-core handles completi=
on queue entries, but I think the performance impact would be catastrophic =
if this were the case. It could also imply that new recv buffers are not be=
ing committed to the card promptly, but as far as I can tell ib_post_recv i=
n the mlx5 driver is pretty straightforward. I even started wondering if th=
is could be a Mellanox driver or firmware issue, but I ran an SPDK-based be=
nch on the exact same setup and the credit behaviour was exactly what I wou=
ld expect: the credits stay high throughout the duration of the bench. The =
SPDK code follows the same approach as the kernel driver in terms of recv q=
ueue usage, except of course it goes through libibverbs instead of ib-core.=
=20

Anyone=20have any insight to share on this stuff?=20
Hopefully=20yours=20

Samuel=20Jones=20
Datacenter=20SW Development Manager =E2=80=A2 Kalray=20
Phone:=20
sjones at kalrayinc.com =E2=80=A2 [ https://www.kalrayinc.com/ | www.kalrayinc=
.com ]=20









More information about the Linux-nvme mailing list