[Bug Report] PCIe errinject and hot-unplug causes nvme driver hang

Keith Busch kbusch at kernel.org
Mon Apr 22 06:52:25 PDT 2024


On Mon, Apr 22, 2024 at 04:00:54PM +0300, Sagi Grimberg wrote:
> > pci_rescan_remove_lock then it shall be able to recover the pci error and hence
> > pending IOs could be finished. Later when hot-unplug task starts, it could
> > forward progress and cleanup all resources used by the nvme disk.
> > 
> > So does it make sense if we unconditionally cancel the pending IOs from
> > nvme_remove() before it forward progress to remove namespaces?
> 
> The driver attempts to allow inflights I/O to complete successfully, if the
> device
> is still present in the remove stage. I am not sure we want to
> unconditionally fail these
> I/Os.    Keith?

We have a timeout handler to clean this up, but I think it was another
PPC specific patch that has the timeout handler do nothing if pcie error
recovery is in progress. Which seems questionable, we should be able to
concurrently run error handling and timeouts, but I think the error
handling just needs to syncronize the request_queue's in the
"error_detected" path.



More information about the Linux-nvme mailing list