kernel NULL pointer on nvmet with stress rescan_controller/reset_controller test during I/O

Tue Mar 7 22:41:23 PST 2017

On 03/06/2017 07:52 PM, Sagi Grimberg wrote:
>> Hi
>
> Hi Yi,
>
>> I always can reproduce this issue during stress test on 
>> rescan_controller/reset_controller, could you help check it, thanks.
>>
>> Reproduce steps on Initiator side:
>> #fio -filename=/dev/nvme0n1 -iodepth=1 -thread -rw=randwrite 
>> -ioengine=psync 
>> -bssplit=5k/10:9k/10:13k/10:17k/10:21k/10:25k/10:29k/10:33k/10:37k/10:41k/10 
>> -bs_unaligned -runtime=1200 -size=-group_reporting -name=mytest 
>> -numjobs=60 &
>> #num=0
>> while [ $num -lt 200 ]
>> do
>>         echo "-------------------------------$num"
>>         echo 1 >/sys/block/nvme0n1/device/rescan_controller || exit 1
>>         echo 1 >/sys/block/nvme0n1/device/reset_controller || exit 1
>>         ((num++))
>> done
>
> nvmet-rdma makes sure that no inflight IO nor completions are pending
> when destroying the queue. The below looks like we got a recv completion
> event after we freed all the tasks for the queue (which happens after
> ib_drain_qp and rdma_destroy_qp).
>
> So this is definitely weird. Which device are you using? I ran
> the exact scenario on my VM and didn't see any NULL deref...
Here is the device I used:
07:00.0 Network controller: Mellanox Technologies MT27500 Family 
[ConnectX-3]

Could your run this test with more cycles, I always can reproduce this 
issue with less than 200 times.

Thanks
Yi