[bug report] NVMe/IB: reset_controller need more than 1min

Mon Dec 13 01:04:59 PST 2021

>>>>>> Hello
>>>>>>
>>>>>> Gentle ping here, this issue still exists on latest 5.13-rc7
>>>>>>
>>>>>> # time nvme reset /dev/nvme0
>>>>>>
>>>>>> real 0m12.636s
>>>>>> user 0m0.002s
>>>>>> sys 0m0.005s
>>>>>> # time nvme reset /dev/nvme0
>>>>>>
>>>>>> real 0m12.641s
>>>>>> user 0m0.000s
>>>>>> sys 0m0.007s
>>>>>
>>>>> Strange that even normal resets take so long...
>>>>> What device are you using?
>>>>
>>>> Hi Sagi
>>>>
>>>> Here is the device info:
>>>> Mellanox Technologies MT27700 Family [ConnectX-4]
>>>>
>>>>>
>>>>>> # time nvme reset /dev/nvme0
>>>>>>
>>>>>> real 1m16.133s
>>>>>> user 0m0.000s
>>>>>> sys 0m0.007s
>>>>>
>>>>> There seems to be a spurious command timeout here, but maybe this
>>>>> is due to the fact that the queues take so long to connect and
>>>>> the target expires the keep-alive timer.
>>>>>
>>>>> Does this patch help?
>>>>
>>>> The issue still exists, let me know if you need more testing for it. :)
>>>
>>> Hi Sagi
>>> ping, this issue still can be reproduced on the latest
>>> linux-block/for-next, do you have a chance to recheck it, thanks.
>>
>> Can you check if it happens with the below patch:
> 
> Hi Sagi
> It is still reproducible with the change, here is the log:
> 
> # time nvme reset /dev/nvme0
> 
> real    0m12.973s
> user    0m0.000s
> sys     0m0.006s
> # time nvme reset /dev/nvme0
> 
> real    1m15.606s
> user    0m0.000s
> sys     0m0.007s

Does it speed up if you use less queues? (i.e. connect with -i 4) ?

> 
> # dmesg | grep nvme
> [  900.634877] nvme nvme0: resetting controller
> [  909.026958] nvme nvme0: creating 40 I/O queues.
> [  913.604297] nvme nvme0: mapped 40/0/0 default/read/poll queues.
> [  917.600993] nvme nvme0: resetting controller
> [  988.562230] nvme nvme0: I/O 2 QID 0 timeout
> [  988.567607] nvme nvme0: Property Set error: 881, offset 0x14
> [  988.608181] nvme nvme0: creating 40 I/O queues.
> [  993.203495] nvme nvme0: mapped 40/0/0 default/read/poll queues.
> 
> BTW, this issue cannot be reproduced on my NVME/ROCE environment.

Then I think that we need the rdma folks to help here...