[PATCH 1/1] nvme-pci: fix hang during error recovery when the PCI device is isolated

Hannes Reinecke hare at suse.de
Tue Jul 12 06:49:29 PDT 2022


On 7/12/22 14:44, Niklas Schnelle wrote:
> On s390 and powerpc PCI devices are isolated when an error is detected
> and driver->err_handler->error_detected is called with an inaccessible
> PCI device and PCI channel state set to pci_channel_io_frozen
> (see Step 1 in Documentation/PCI/pci-error-recovery.rst).
> 
> In the case of NVMe devices nvme_error_detected() then calls
> nvme_dev_disable(dev, false) and requests a reset. After a successful
> reset the device is accessible again and nvme_slot_reset() resets the
> controller and queues nvme_reset_work() which then recovers the
> controller.
> 
> Since commit b98235d3a471 ("nvme-pci: harden drive presence detect in
> nvme_dev_disable()") however nvme_dev_disable() no longer freezes the
> queues if pci_device_is_present() returns false. This is the case for an
> isolated PCI device. In principle this makes sense as there are no
> accessible hardware queues to run. The problem though is that for
> a previously live reset controller with online queues nvme_reset_work()
> calls nvme_wait_freeze() which, without the freeze having been
> initiated, then hangs forever. Fix this by starting the freeze in
> nvme_slot_reset() which is the earliest point where we know the device
> should be accessible again.
> 
> Fixes: b98235d3a471 ("nvme-pci: harden drive presence detect in nvme_dev_disable()")
> Signed-off-by: Niklas Schnelle <schnelle at linux.ibm.com>
> ---
>   drivers/nvme/host/pci.c | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 193b44755662..7c0c61b74c30 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -3399,6 +3399,7 @@ static pci_ers_result_t nvme_slot_reset(struct pci_dev *pdev)
>   	dev_info(dev->ctrl.device, "restart after slot reset\n");
>   	pci_restore_state(pdev);
>   	nvme_reset_ctrl(&dev->ctrl);
> +	nvme_start_freeze(&dev->ctrl);
>   	return PCI_ERS_RESULT_RECOVERED;
>   }
>   
I am not sure if that's the right fix.
 From your description the hang occurs as nvme_reset_ctrl() is calling 
nvme_wait_freeze() without an corresponding nvme_start_freeze().
So why are you calling it _after_ the call to nvme_reset_ctrl()?

Cheers,

Hannes



More information about the Linux-nvme mailing list