[PATCHv2] nvme-pci: fix stuck reset on concurrent DPC and HP
Nilay Shroff
nilay at linux.ibm.com
Fri Mar 7 23:51:19 PST 2025
On 3/8/25 4:56 AM, Keith Busch wrote:
> From: Keith Busch <kbusch at kernel.org>
>
> The PCIe DPC handling has the nvme driver quiesce the device, attempt to
> restart it, then wait for that restart to complete.
>
> The DPC event also toggles the PCIe link. If the slot doesn't have
> out-of-band presence detection, this will trigger a pciehp
> re-enumeration.
>
> The DPC's error handling that calls nvme_error_resume is holding the
> device lock while this happens. This lock prevents pciehp's request to
> disconnect the driver from proceeding.
>
> Meanwhile the nvme's reset_work can't make forward progress because its
> device isn't there anymore with admin IO, and the timeout handler won't
> do anything to fix it because the device is undergoing error handling.
>
> End result: deadlocked.
>
> Fix this by having the timeout handler disable the nvme queueus for a
> disconnected PCIe device. We're relying on an IO timeout to unblock
> this, which is a minute by default.
>
> Signed-off-by: Keith Busch <kbusch at kernel.org>
Loos good to me:
Tested-by: Nilay Shroff <nilay at linux.ibm.com>
Reviewed-by: Nilay Shroff <nilay at linux.ibm.com>
More information about the Linux-nvme
mailing list