[PATCH v2 RFC 6/6] nvme-core: fix deadlock in disconnect during scan_work and/or ana_work
Christoph Hellwig
hch at lst.de
Wed Jun 24 02:43:09 EDT 2020
On Tue, Jun 23, 2020 at 05:18:53PM -0700, Sagi Grimberg wrote:
> From: Anton Eidelman <anton at lightbitslabs.com>
>
> A deadlock happens in the following scenario with multipath:
> 1) scan_work(nvme0) detects a new nsid while nvme0
> is an optimized path to it, path nvme1 happens to be
> inaccessible.
>
> 2) Before scan_work is complete nvme0 disconnect is initiated
> nvme_delete_ctrl_sync() sets nvme0 state to NVME_CTRL_DELETING
>
> 3) scan_work(1) attempts to submit IO,
> but nvme_path_is_optimized() observes nvme0 is not LIVE.
> Since nvme1 is a possible path IO is requeued and scan_work hangs.
I'm really worried about another flag outside the state machine. If
we really need a multi-step deletion we should have
NVME_CTRL_DELETE_START, NVME_CTRL_DELETE_CONT or so states and run
this via the state machine.
More information about the Linux-nvme
mailing list