[PATCH 1/1] nvme-pci: fix hang during error recovery when the PCI device is isolated

Niklas Schnelle schnelle at linux.ibm.com
Tue Jul 12 07:36:27 PDT 2022


On Tue, 2022-07-12 at 08:17 -0600, Keith Busch wrote:
> On Tue, Jul 12, 2022 at 02:44:53PM +0200, Niklas Schnelle wrote:
> > On s390 and powerpc PCI devices are isolated when an error is detected
> > and driver->err_handler->error_detected is called with an inaccessible
> > PCI device and PCI channel state set to pci_channel_io_frozen
> > (see Step 1 in Documentation/PCI/pci-error-recovery.rst).
> > 
> > In the case of NVMe devices nvme_error_detected() then calls
> > nvme_dev_disable(dev, false) and requests a reset. After a successful
> > reset the device is accessible again and nvme_slot_reset() resets the
> > controller and queues nvme_reset_work() which then recovers the
> > controller.
> > 
> > Since commit b98235d3a471 ("nvme-pci: harden drive presence detect in
> > nvme_dev_disable()") however nvme_dev_disable() no longer freezes the
> > queues if pci_device_is_present() returns false. This is the case for an
> > isolated PCI device. In principle this makes sense as there are no
> > accessible hardware queues to run. The problem though is that for
> > a previously live reset controller with online queues nvme_reset_work()
> > calls nvme_wait_freeze() which, without the freeze having been
> > initiated, then hangs forever. Fix this by starting the freeze in
> > nvme_slot_reset() which is the earliest point where we know the device
> > should be accessible again.
> > 
> > Fixes: b98235d3a471 ("nvme-pci: harden drive presence detect in nvme_dev_disable()")
> > Signed-off-by: Niklas Schnelle <schnelle at linux.ibm.com>
> 
> Oh, we've messed up the expected sequence. The mistaken assumption is a device
> not present means we're about to unbind from it, but it could appear that way
> just for normal error handling and reset, so we need to preserve the previous
> handling.
> 
> The offending commit really just wants to avoid the register access (which we
> shouldn't have to do, but hey, broken hardware...). So let's keep the sequence
> the same as before and just skip the register read. Does this work for you?

Ah thanks for the explanation! I had actually tested a similar patch
but wasn't sure if nvme_start_freeze() also does register access for
starting the HW queues and if it makes sense on a dead/isolated device
at all. On the other hand this code very explicitly handles dead
devices so I guess this was kept in mind.

So yes the below patch works for me.

> 
> ---
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index fdfee3e590db..c40e82cee735 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
>  static void nvme_dev_remove_admin(struct nvme_dev *dev)
> @@ -2690,9 +2772,11 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>  	struct pci_dev *pdev = to_pci_dev(dev->dev);
>  
>  	mutex_lock(&dev->shutdown_lock);
> -	if (pci_device_is_present(pdev) && pci_is_enabled(pdev)) {
> -		u32 csts = readl(dev->bar + NVME_REG_CSTS);
> +	if (pci_is_enabled(pdev)) {
> +		u32 csts = ~0;
>  
> +		if (pci_device_is_present(pdev))
> +			csts = readl(dev->bar + NVME_REG_CSTS);
>  		if (dev->ctrl.state == NVME_CTRL_LIVE ||
>  		    dev->ctrl.state == NVME_CTRL_RESETTING) {
>  			freeze = true;
> --





More information about the Linux-nvme mailing list