[PATCH 0/3 rfc] Fix nvme-tcp and nvme-rdma controller reset hangs

Tue Mar 16 05:04:45 GMT 2021

> Does the problem exist on the latest version?

This was seen on 5.4 stable, not upstream but nothing prevents
this from happening in upstream code.

> 
> We also found Similar deadlocks in the older version.
> However, with the latest code, it do not block grabbing the nshead srcu
> when ctrl is freezed.
> related patches:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/block/blk-core.c?id=fe2008640ae36e3920cf41507a84fb5d3227435a 
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5a6c35f9af416114588298aa7a90b15bbed15a41 
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/block/blk-core.c?id=ed00aabd5eb9fb44d6aff1173234a2e911b9fead 
> 
> I am not sure they are the same problem.

Its not the same problem.

When we teardown the io queues, we freeze the namespaces request queues.
This means that concurrent mpath submit_bio calls can now block with
the srcu lock taken.

When another path calls nvme_mpath_set_live, it needs to wait for
the srcu to sync before kicking the requeue work (to make sure
the updated current_path is visible).

And this is where the hang is, the only thing that will free it
is if the offending controller reconnects (and unfreeze the queue)
or it will disconnect (automatically or manually). Both can take
a very long time or even forever in some cases.