[PATCHv2] nvme-pci: fix stuck reset on concurrent DPC and HP

Nilay Shroff nilay at linux.ibm.com
Fri Mar 7 23:51:19 PST 2025



On 3/8/25 4:56 AM, Keith Busch wrote:
> From: Keith Busch <kbusch at kernel.org>
> 
> The PCIe DPC handling has the nvme driver quiesce the device, attempt to
> restart it, then wait for that restart to complete.
> 
> The DPC event also toggles the PCIe link. If the slot doesn't have
> out-of-band presence detection, this will trigger a pciehp
> re-enumeration.
> 
> The DPC's error handling that calls nvme_error_resume is holding the
> device lock while this happens. This lock prevents pciehp's request to
> disconnect the driver from proceeding.
> 
> Meanwhile the nvme's reset_work can't make forward progress because its
> device isn't there anymore with admin IO, and the timeout handler won't
> do anything to fix it because the device is undergoing error handling.
> 
> End result: deadlocked.
> 
> Fix this by having the timeout handler disable the nvme queueus for a
> disconnected PCIe device. We're relying on an IO timeout to unblock
> this, which is a minute by default.
> 
> Signed-off-by: Keith Busch <kbusch at kernel.org>

Loos good to me:

Tested-by: Nilay Shroff <nilay at linux.ibm.com>
Reviewed-by: Nilay Shroff <nilay at linux.ibm.com>






More information about the Linux-nvme mailing list