[PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs

Tue Mar 16 06:18:29 GMT 2021


On 2021/3/16 13:04, Sagi Grimberg wrote:
> 
>> Does the problem exist on the latest version?
> 
> This was seen on 5.4 stable, not upstream but nothing prevents
> this from happening in upstream code.
> 
>>
>> We also found Similar deadlocks in the older version.
>> However, with the latest code, it do not block grabbing the nshead srcu
>> when ctrl is freezed.
>> related patches:
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/block/blk-core.c?id=fe2008640ae36e3920cf41507a84fb5d3227435a
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5a6c35f9af416114588298aa7a90b15bbed15a41
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/block/blk-core.c?id=ed00aabd5eb9fb44d6aff1173234a2e911b9fead
>> I am not sure they are the same problem.
> 
> Its not the same problem.
> 
> When we teardown the io queues, we freeze the namespaces request queues.
> This means that concurrent mpath submit_bio calls can now block with
> the srcu lock taken.What is the call trace of ->submit_bio()?
The requeue work or normal submit bio?
> 
> When another path calls nvme_mpath_set_live, it needs to wait for
> the srcu to sync before kicking the requeue work (to make sure
> the updated current_path is visible).
> 
> And this is where the hang is, the only thing that will free it
> is if the offending controller reconnects (and unfreeze the queue)
> or it will disconnect (automatically or manually). Both can take
> a very long time or even forever in some cases.
> .