[PATCH v2 RFC 6/6] nvme-core: fix deadlock in disconnect during scan_work and/or ana_work
Sagi Grimberg
sagi at grimberg.me
Wed Jun 24 03:13:12 EDT 2020
>> From: Anton Eidelman <anton at lightbitslabs.com>
>>
>> A deadlock happens in the following scenario with multipath:
>> 1) scan_work(nvme0) detects a new nsid while nvme0
>> is an optimized path to it, path nvme1 happens to be
>> inaccessible.
>>
>> 2) Before scan_work is complete nvme0 disconnect is initiated
>> nvme_delete_ctrl_sync() sets nvme0 state to NVME_CTRL_DELETING
>>
>> 3) scan_work(1) attempts to submit IO,
>> but nvme_path_is_optimized() observes nvme0 is not LIVE.
>> Since nvme1 is a possible path IO is requeued and scan_work hangs.
>
> I'm really worried about another flag outside the state machine. If
> we really need a multi-step deletion we should have
> NVME_CTRL_DELETE_START, NVME_CTRL_DELETE_CONT or so states and run
> this via the state machine.
Let me look into this, I'll send out v3 of the other 5 patches, and
will follow up with this as a follow on series. Going via the state
machine is slightly more delicate, but I agree that its a better
approach.
More information about the Linux-nvme
mailing list