nvmeof Issues with Zen 3/Ryzen 5000 Initiator

Wed May 26 13:47:05 PDT 2021

I've been testing NVMe over Fabrics for the past few weeks and the 
performance has been nothing short of incredible, though I'm running 
into some major issues that seems to be specifically related to AMD Zen 
3 Ryzen chips (in my case I'm testing with 5900x).

Target:
Supermicro X10 board
Xeon E5-2620v4
Intel E810 NIC

Problematic Client/initiator:
ASRock X570 board
Ryzen 9 5900x
Intel E810 NIC

Stable Client/initiator:
Supermicro X10 board
Xeon E5-2620v4
Intel E810 NIC

I'm using the same 2 E810 NICs and pair of 25G DACs in both cases.  The 
NICs are directly connected with the DACs and there is no switch in the 
equation.  To trigger the issue I'm simply using FIO similar to this:

fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 
--name=test --filename=/dev/nvme0n1 --bs=4k --iodepth=64 --size=10G 
--readwrite=randread --time_based --runtime=1200

I'm primarily using RDMA/iWARP right now but I've also tested RoCE2 
which presents the same issues/symptoms.  Primary testing has been done 
with Ubuntu 20.04.2 with CentOS 8 in the mix as well just to try and 
rule out a weird distro-specific issue.  All tests used the latest 
ice/irdma drivers from Intel (1.5.8 and 1.5.2 respectively)

I've not yet tested a Ryzen 5900x target with an Intel initiator but i 
plan to to see if it exhibits the same instability.

The issue presents itself as a connectivity loss between the two hosts - 
but there is no connectivity issue.  The issue is also somewhat 
inconsistent.  Sometimes it will show up after 1-2 minutes of testing, 
sometimes instantly, and sometimes close to 10 minutes in.

Target dmesg sample:
[ 3867.598007] nvmet: ctrl 1 keep-alive timer (15 seconds) expired!
[ 3867.598384] nvmet: ctrl 1 fatal error occurred!

Initiator dmesg sample:
<snip>
[  348.122160] nvme nvme4: I/O 86 QID 17 timeout
[  348.122224] nvme nvme4: I/O 87 QID 17 timeout
[  348.122290] nvme nvme4: I/O 88 QID 17 timeout
[  348.122354] nvme nvme4: I/O 89 QID 17 timeout
[  348.122417] nvme nvme4: I/O 90 QID 17 timeout
[  348.122480] nvme nvme4: I/O 91 QID 17 timeout
[  348.122544] nvme nvme4: I/O 92 QID 17 timeout
[  348.122607] nvme nvme4: I/O 93 QID 17 timeout
[  348.122670] nvme nvme4: I/O 94 QID 17 timeout
[  348.122733] nvme nvme4: I/O 95 QID 17 timeout
[  348.122796] nvme nvme4: I/O 96 QID 17 timeout
<snip>
[  380.387212] nvme nvme4: creating 24 I/O queues.
[  380.573925] nvme nvme4: Successfully reconnected (1 attempts)

All the while the underlying connectivity is working just fine. There's 
a long delay between the timeout and the successful reconnect.  I 
haven't timed it but it seems like about 5 minutes. This has luckily 
given me plenty of time to test connectivity which has consistently been 
just fine on all fronts.

I'm testing with a single Micron 9300 Pro 7.68TB right now which can 
push about 850k read IOPs.  On the Intel target/initiator combo I can 
run it "balls to the walls" for hours on end with 0 issues.  On the AMD 
initiator I can trigger the disconnect/drop generally within 5 minutes.  
Here's where things get weird - if I limit the test to 200K IOPs or less 
then it's relatively stable on the AMD and I've not seen any drops when 
this limitation is in place.

Here are some things I've tried which make no difference (or make things 
worse):

Ubuntu 20.04.2 kernel 5.4.
Ubuntu 20.04.2 kernel 5.8
Ubuntu 20.04.2 kernel 5.10
CentOS 8 kernel 4.18
CentOS 8 kernel 5.10 (from elrepo)
CentOS 8 kernel 5.12 (from elrepo) - whole system actually freezes upon 
"nvme connect" command on this one
With and without multipath (native)
With and without round-robin on multipath (native)
Different NVMe drive models
With and without PFC
10G DAC
25G DAC
25G DAC negotiated at 10G
With and without a switch
iWARP and RoCE2

I did do some testing with TCP/IP but cannot reach the >200k IOPS 
threshold with it which seems to be important for triggering the issue.  
I did not experience the drops with TCP/IP.

I can't seem to draw any conclusion other than this being something 
specific to Zen 3, but I'm not sure why.  Is there somewhere I should be 
looking aside from "dmesg" to get some useful debug info?  According to 
the irdma driver there are no rdma packets getting 
lost/dropped/erroring, etc.  Common things like rping and 
ib_read_bw/ib_write_bw tests all run indefinitely without error.

I would appreciate any help or advice with this or how I can help 
confirm if this is indeed specific to Zen 3.

-- 
Jonathan Wright
KnownHost, LLC
https://www.knownhost.com