nvme-tcp bricks my computer
Sagi Grimberg
sagi at grimberg.me
Wed Feb 3 17:36:53 EST 2021
>> I'm running "nvme discover" over a TCP connection. The nvme-tcp module freezes completely and bricks my computer.
>>
>> Steps:
>> $ sudo modprobe nvme-tcp
>> $ sudo nvme discover -t tcp -a [IP address] -s 8009
>> <System Bricked!>
>> Only a reboot (Alt-SysRq-B) can recover the system.
>
> Do you have a stack trace to share?
> *<MB> No. I'm not able to collect data since my computer freezes
> completely. I had a window opened that was showing the syslog at the
> time the freeze occurred. I took a couple of pictures (attached). *
OK, Are you able to scroll up a bit to see the RIP line in the
stack trace? and if this is a NULL dereference or something else?
>> Conditions to reproduce the problem:
>> The Discovery Controller must support sending Discovery Log Change Notifications. That is, bit 31 of the Identity's OAES field returned by the discovery controller must be set to 1. If OAES[31]=0, then everything is OK.
>
> What is the discovery log page returned by the nvme discovery
> controller? Does it include referrals? There was an issue fixed
> in nvme-cli with respect to referrals (although nothing that is
> related to any oaes changes).
> *<MB> The driver never makes it to asking for the discovery log page. It
> freezes as soon as it receives the "Identity" message. *
You mean identify controller? That is strange because not sure it should
be any difference..
>> Systems tested:
>> 1) Ubuntu 20.04, Linux 5.8, nvme 1.13.21
>> 2) Fedora 33, Linux 5.10, nvme 1.11.1
>
> Are these the default kernels that come with the distribution?
> *<MB> Yes.**
> *
> *On Ubuntu 20.04: *
> *$ uname -rsvpi*
> *Linux 5.8.0-41-generic #46~20.04.1-Ubuntu SMP Mon Jan 18 17:52:23 UTC
> 2021 x86_64 x86_64
> *
> *
> *
> *On Fedora 33:*
> *$ uname -rsvpi*
> *Linux 5.10.11-200.fc33.x86_64 #1 SMP Wed Jan 27 20:21:22 UTC 2021
> x86_64 x86_64
> *
>
> Does this happen with the latest upstream?
> *<MB> I did compile the latest upstream kernel modules, but
> Ubuntu/Fedora won't let me modprobe them. I'm not a kernel expert. There
> seems to be some security in place that prevents one from loading a
> kernel module that did not come with the official release. I tried
> several things suggested on Google to work around this but could never
> get the latest kernel modules loaded. By the way, Fedora 33 is pretty
> close to the latest upstream (i.e. Linux 5.10) and I see the same issue
> there.*
We definitely didn't get such a bug report on this kernel.
Does this happen if you directly connect to a normal nvme controller?
> *<MB> As I said earlier, everything works fine until I change the
> Discovery Controller to return **OAES[31]=1 in the Identity message.
> From what I see in the nvme-tcp code, this tells the driver to enable
> AER/AEN. I think that's where the issue is. Since I'm not a kernel
> expert, I cannot diagnose the problem further than that.*
It appears that without discovery log change events we never submit
async event, so it must be something there.
Can you share your kernel config file?
> I'm assuming this is not Linux nvmet target correct?
> *<MB> I don't know what that means: nvmet? *
What is your target implementation? Is this the nvme target that
is built into Linux?
More information about the Linux-nvme
mailing list