nvme-tcp: kernel NULL pointer dereference, address: 0000000000000034

Thu Mar 16 01:57:15 PDT 2023

>>>> I'm running tests where I connect/disconnect to/from a few I/O controllers
>>> using the nvme_tcp driver. I use nvmet_tcp with a null_blk device to simulate the
>>> target. The kernel module crashes (trace below) while trying to connect over
>>> TCP. This happens on Fedora 37 and Ubuntu 22.04. I also recompiled the kernel
>>> using the latest nvme-6.4 branch and I'm still seeing the crash.
>>>>
>>>> I'm not sure how to debug this further. Any suggestions?
>>>
>>> Never seen anyone try to use poll queues with nvme tcp before. It doesn't look
>>> like that would work for a connect command since there's no bdev at this point,
>>> and polling needs a bdev.
>>
>> Thanks for pointing me in the right direction.
>> I wrote a test program that exercises all the different options available.
>> The crash went away once I removed "nr-poll-queues=4".
>> But this begs the question: should a user-space program be given the ability
>> to crash the kernel by simply providing the wrong (or weird) arguments?
> 
> Right, we certainly don't want to let an easy kernel crash like this exist now
> that we know it's there. I'm just consdering a couple different ways to fix it.
> We could just reject user polling options for nvme fabrics, or we could make
> polling work with just a request_queue instead of needing a bdev.

Polling used to not need a bdev, but the introduction of bio_poll made
it required.

We can't just reject user polling for fabrics, it exists for RDMA as
well as TCP.