[Bug Report] PCIe errinject and hot-unplug causes nvme driver hang

Tue Apr 23 02:52:46 PDT 2024

On 4/22/24 20:05, Keith Busch wrote:
> On Mon, Apr 22, 2024 at 07:52:25AM -0600, Keith Busch wrote:
>> On Mon, Apr 22, 2024 at 04:00:54PM +0300, Sagi Grimberg wrote:
>>>> pci_rescan_remove_lock then it shall be able to recover the pci error and hence
>>>> pending IOs could be finished. Later when hot-unplug task starts, it could
>>>> forward progress and cleanup all resources used by the nvme disk.
>>>>
>>>> So does it make sense if we unconditionally cancel the pending IOs from
>>>> nvme_remove() before it forward progress to remove namespaces?
>>>
>>> The driver attempts to allow inflights I/O to complete successfully, if the
>>> device
>>> is still present in the remove stage. I am not sure we want to
>>> unconditionally fail these
>>> I/Os.    Keith?
>>
>> We have a timeout handler to clean this up, but I think it was another
>> PPC specific patch that has the timeout handler do nothing if pcie error
>> recovery is in progress. Which seems questionable, we should be able to
>> concurrently run error handling and timeouts, but I think the error
>> handling just needs to syncronize the request_queue's in the
>> "error_detected" path.
> 
> This:
> 
> ---
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 8e0bb9692685d..38d0215fe53fc 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1286,13 +1286,6 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req)
>  	u32 csts = readl(dev->bar + NVME_REG_CSTS);
>  	u8 opcode;
>  
> -	/* If PCI error recovery process is happening, we cannot reset or
> -	 * the recovery mechanism will surely fail.
> -	 */
> -	mb();
> -	if (pci_channel_offline(to_pci_dev(dev->dev)))
> -		return BLK_EH_RESET_TIMER;
> -
>  	/*
>  	 * Reset immediately if the controller is failed
>  	 */
> @@ -3300,6 +3293,7 @@ static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev,
>  			return PCI_ERS_RESULT_DISCONNECT;
>  		}
>  		nvme_dev_disable(dev, false);
> +		nvme_sync_queues(&dev->ctrl);
>  		return PCI_ERS_RESULT_NEED_RESET;
>  	case pci_channel_io_perm_failure:
>  		dev_warn(dev->ctrl.device,
> --
> 
I tested the above patch, however, it doesn't help to solve the issue.
I tested it for two cases listed below:

1. Platform which doesn't support pci-error-recovery:
-----------------------------------------------------
On this platform when nvme_timeout() is invoked, it falls through 
nvme_shoud_reset()
  -> nvme_warn_reset() 
    -> goto disable

When nvme_timeout() jumps to the label disable, it tries setting the
controller state to RESETTING but that couldn't succeed because the 
(logical) hot-unplug/nvme_remove() of the disk is started on another 
thread and hence controller state has already changed to 
DELETING/DELETING_NOIO. As nvme_timeout() couldn't set the controller 
state to RESETTING, nvme_timeout() returns BLK_EH_DONE. In summary, 
as nvme_timeout() couldn't cancel pending IO, the hot-unplug/nvme_remove() 
couldn't forward progress and it keeps waiting for request queue to be freezed. 

2. Platform supporting pci-error-recovery:
------------------------------------------
Similarly, on this platform as explained for the above case, when 
nvme_timeout() is invoked, it falls through nvme_shoud_reset()
-> nvme_warn_reset() -> goto disable. In this case as well, 
nvme_timeout() returns BLK_EH_DONE. Please note that though this 
platform supports pci-error-recovery, we couldn't get through 
nvme_error_detected() because the pci-error-recovery thread is pending 
on acquiring mutex "pci_lock_rescan_remove". This mutex is acquired by 
hot-unplug thread before it invokes nvme_remove() and nvme_remove() 
is currently waiting for request queue to be freezed. For reference,
I have already captured the task hang traces in previous email of this 
thread where we could observe these hangs (for both pci-error-recovery
thread as well as hot-unplig/nvme_remove()).

I understand that we don't want to cancel pending IO from the nvme_remove()
unconditionally as if the disk is not physically hot-unplug then we still 
want to  wait for the in-flight IO to be finished. Also looking through 
the above cases, I think that the nvme_timeout() might be the code path 
from where we want to cancel in-flight/pending IO if controller is 
in the terminal state (i.e. DELETING or DELETING_NOIO). Keeping this idea in
mind, I have worked out the below patch:

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 8e0bb9692685..e45a54d84649 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1286,6 +1286,9 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req)
        u32 csts = readl(dev->bar + NVME_REG_CSTS);
        u8 opcode;
 
+       if (nvme_state_terminal(&dev->ctrl))
+               goto disable;
+
        /* If PCI error recovery process is happening, we cannot reset or
         * the recovery mechanism will surely fail.
         */
@@ -1390,8 +1393,13 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req)
        return BLK_EH_RESET_TIMER;
 
 disable:
-       if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING))
+       if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
+               if (nvme_state_terminal(&dev->ctrl)) {
+                       nvme_dev_disable(dev, false);
+                       nvme_sync_queues(&dev->ctrl);
+               }
                return BLK_EH_DONE;
+       }
 
        nvme_dev_disable(dev, false);
        if (nvme_try_sched_reset(&dev->ctrl))

I have tested the above patch against all possible cases. Please let me know
if this looks good or if there are any further comments.

Thanks,
--Nilay