[Bug Report] PCIe errinject and hot-unplug causes nvme driver hang
Keith Busch
kbusch at kernel.org
Mon Apr 22 07:35:04 PDT 2024
On Mon, Apr 22, 2024 at 07:52:25AM -0600, Keith Busch wrote:
> On Mon, Apr 22, 2024 at 04:00:54PM +0300, Sagi Grimberg wrote:
> > > pci_rescan_remove_lock then it shall be able to recover the pci error and hence
> > > pending IOs could be finished. Later when hot-unplug task starts, it could
> > > forward progress and cleanup all resources used by the nvme disk.
> > >
> > > So does it make sense if we unconditionally cancel the pending IOs from
> > > nvme_remove() before it forward progress to remove namespaces?
> >
> > The driver attempts to allow inflights I/O to complete successfully, if the
> > device
> > is still present in the remove stage. I am not sure we want to
> > unconditionally fail these
> > I/Os. Keith?
>
> We have a timeout handler to clean this up, but I think it was another
> PPC specific patch that has the timeout handler do nothing if pcie error
> recovery is in progress. Which seems questionable, we should be able to
> concurrently run error handling and timeouts, but I think the error
> handling just needs to syncronize the request_queue's in the
> "error_detected" path.
This:
---
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 8e0bb9692685d..38d0215fe53fc 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1286,13 +1286,6 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req)
u32 csts = readl(dev->bar + NVME_REG_CSTS);
u8 opcode;
- /* If PCI error recovery process is happening, we cannot reset or
- * the recovery mechanism will surely fail.
- */
- mb();
- if (pci_channel_offline(to_pci_dev(dev->dev)))
- return BLK_EH_RESET_TIMER;
-
/*
* Reset immediately if the controller is failed
*/
@@ -3300,6 +3293,7 @@ static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev,
return PCI_ERS_RESULT_DISCONNECT;
}
nvme_dev_disable(dev, false);
+ nvme_sync_queues(&dev->ctrl);
return PCI_ERS_RESULT_NEED_RESET;
case pci_channel_io_perm_failure:
dev_warn(dev->ctrl.device,
--
More information about the Linux-nvme
mailing list