nvmeof Issues with Zen 3/Ryzen 5000 Initiator

Thu May 27 14:36:33 PDT 2021

> I've been testing NVMe over Fabrics for the past few weeks and the 
> performance has been nothing short of incredible, though I'm running 
> into some major issues that seems to be specifically related to AMD Zen 
> 3 Ryzen chips (in my case I'm testing with 5900x).
> 
> Target:
> Supermicro X10 board
> Xeon E5-2620v4
> Intel E810 NIC
> 
> Problematic Client/initiator:
> ASRock X570 board
> Ryzen 9 5900x
> Intel E810 NIC
> 
> Stable Client/initiator:
> Supermicro X10 board
> Xeon E5-2620v4
> Intel E810 NIC
> 
> I'm using the same 2 E810 NICs and pair of 25G DACs in both cases.  The 
> NICs are directly connected with the DACs and there is no switch in the 
> equation.  To trigger the issue I'm simply using FIO similar to this:
> 
> fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 
> --name=test --filename=/dev/nvme0n1 --bs=4k --iodepth=64 --size=10G 
> --readwrite=randread --time_based --runtime=1200
> 
> I'm primarily using RDMA/iWARP right now but I've also tested RoCE2 
> which presents the same issues/symptoms.  Primary testing has been done 
> with Ubuntu 20.04.2 with CentOS 8 in the mix as well just to try and 
> rule out a weird distro-specific issue.  All tests used the latest 
> ice/irdma drivers from Intel (1.5.8 and 1.5.2 respectively)

CCing Shiraz Saleem who maintains irdma.

> 
> I've not yet tested a Ryzen 5900x target with an Intel initiator but i 
> plan to to see if it exhibits the same instability.
> 
> The issue presents itself as a connectivity loss between the two hosts - 
> but there is no connectivity issue.  The issue is also somewhat 
> inconsistent.  Sometimes it will show up after 1-2 minutes of testing, 
> sometimes instantly, and sometimes close to 10 minutes in.
> 
> Target dmesg sample:
> [ 3867.598007] nvmet: ctrl 1 keep-alive timer (15 seconds) expired!
> [ 3867.598384] nvmet: ctrl 1 fatal error occurred!
> 
> Initiator dmesg sample:
> <snip>
> [  348.122160] nvme nvme4: I/O 86 QID 17 timeout
> [  348.122224] nvme nvme4: I/O 87 QID 17 timeout
> [  348.122290] nvme nvme4: I/O 88 QID 17 timeout
> [  348.122354] nvme nvme4: I/O 89 QID 17 timeout
> [  348.122417] nvme nvme4: I/O 90 QID 17 timeout
> [  348.122480] nvme nvme4: I/O 91 QID 17 timeout
> [  348.122544] nvme nvme4: I/O 92 QID 17 timeout
> [  348.122607] nvme nvme4: I/O 93 QID 17 timeout
> [  348.122670] nvme nvme4: I/O 94 QID 17 timeout
> [  348.122733] nvme nvme4: I/O 95 QID 17 timeout
> [  348.122796] nvme nvme4: I/O 96 QID 17 timeout
> <snip>
> [  380.387212] nvme nvme4: creating 24 I/O queues.
> [  380.573925] nvme nvme4: Successfully reconnected (1 attempts)
> 
> All the while the underlying connectivity is working just fine. There's 
> a long delay between the timeout and the successful reconnect.  I 
> haven't timed it but it seems like about 5 minutes. This has luckily 
> given me plenty of time to test connectivity which has consistently been 
> just fine on all fronts.

Seems like loss of connectivity from the driver perspective.
While this is happening, can you try an rdma application like
ib_send_bwib_send_lat or something?

I'd also suggest to run both workloads concurrently and see if they
both suffer from a connectivity issue, this will help rule out
if this is something specific to the nvme-rdma driver.

> 
> I'm testing with a single Micron 9300 Pro 7.68TB right now which can 
> push about 850k read IOPs.  On the Intel target/initiator combo I can 
> run it "balls to the walls" for hours on end with 0 issues.  On the AMD 
> initiator I can trigger the disconnect/drop generally within 5 minutes. 
> Here's where things get weird - if I limit the test to 200K IOPs or less 
> then it's relatively stable on the AMD and I've not seen any drops when 
> this limitation is in place.
> 
> Here are some things I've tried which make no difference (or make things 
> worse):
> 
> Ubuntu 20.04.2 kernel 5.4.
> Ubuntu 20.04.2 kernel 5.8
> Ubuntu 20.04.2 kernel 5.10
> CentOS 8 kernel 4.18
> CentOS 8 kernel 5.10 (from elrepo)
> CentOS 8 kernel 5.12 (from elrepo) - whole system actually freezes upon 
> "nvme connect" command on this one
> With and without multipath (native)
> With and without round-robin on multipath (native)
> Different NVMe drive models
> With and without PFC
> 10G DAC
> 25G DAC
> 25G DAC negotiated at 10G
> With and without a switch
> iWARP and RoCE2

Looks like this probably always existed...

> 
> I did do some testing with TCP/IP but cannot reach the >200k IOPS 
> threshold with it which seems to be important for triggering the issue. 
> I did not experience the drops with TCP/IP.
> 
> I can't seem to draw any conclusion other than this being something 
> specific to Zen 3, but I'm not sure why.  Is there somewhere I should be 
> looking aside from "dmesg" to get some useful debug info?  According to 
> the irdma driver there are no rdma packets getting 
> lost/dropped/erroring, etc.  Common things like rping and 
> ib_read_bw/ib_write_bw tests all run indefinitely without error.

Ah, that is an important detail.

I think that packet sniffer can help here if this is the case, IIRC
there should be way to sniff rdma traffic using tcpdump but I don't
remember the details. Perhaps Intel folks can help you there...