[bug report] reset_controller operation on NVMe/IB need more than 10s

Yi Zhang yi.zhang at redhat.com
Mon Aug 31 01:25:18 EDT 2020


Change the Subject...

Finally I found the OOM was not triggered by reset_controller, but due 
to my 500 times' reset_controller not finished in 1 hour, and triggered 
test harness watchdog and lead OOM.

So the issue here is the reset_controller operation need more than 10s 
now[1], by bisecting I found it was introduced from commit[2], and it 
will just need 2 seconds if w/o this patch.

[1]
# time echo 1 >/sys/block/nvme0n1/device/nvme0/reset_controller

real    0m10.392s
user    0m0.000s
sys    0m0.000s

[2]
commit 5ec5d3bddc6b912b7de9e3eb6c1f2397faeca2bc
Author: Max Gurtovoy <maxg at mellanox.com>
Date:   Tue May 19 17:05:56 2020 +0300

     nvme-rdma: add metadata/T10-PI support

     For capable HCAs (e.g. ConnectX-5/ConnectX-6) this will allow 
end-to-end
     protection information passthrough and validation for NVMe over RDMA
     transport. Metadata offload support was implemented over the new RDMA
     signature verbs API and it is enabled for capable controllers.

     Signed-off-by: Max Gurtovoy <maxg at mellanox.com>
     Signed-off-by: Israel Rukshin <israelr at mellanox.com>
     Signed-off-by: Christoph Hellwig <hch at lst.de>

[3] w/o above patch
# time echo 1 >/sys/block/nvme0n1/device/nvme0/reset_controller

real    0m2.132s
user    0m0.000s
sys    0m0.000s



On 8/28/20 11:57 PM, Sagi Grimberg wrote:
>
>> Hello
>>
>>  From 5.8-rc1 stress reset_controller operation will lead system OOM 
>> on both target/host side.
>> This issue cannot be reproduced on v5.7 and still can be reproduced 
>> on v5.9-rc1, let me know if you need more info, thanks.
>
> Does kmemleak complain during the test?
>




More information about the Linux-nvme mailing list