[Bug Report] PCIe errinject and hot-unplug causes nvme driver hang
Nilay Shroff
nilay at linux.ibm.com
Tue Apr 23 02:52:46 PDT 2024
On 4/22/24 20:05, Keith Busch wrote:
> On Mon, Apr 22, 2024 at 07:52:25AM -0600, Keith Busch wrote:
>> On Mon, Apr 22, 2024 at 04:00:54PM +0300, Sagi Grimberg wrote:
>>>> pci_rescan_remove_lock then it shall be able to recover the pci error and hence
>>>> pending IOs could be finished. Later when hot-unplug task starts, it could
>>>> forward progress and cleanup all resources used by the nvme disk.
>>>>
>>>> So does it make sense if we unconditionally cancel the pending IOs from
>>>> nvme_remove() before it forward progress to remove namespaces?
>>>
>>> The driver attempts to allow inflights I/O to complete successfully, if the
>>> device
>>> is still present in the remove stage. I am not sure we want to
>>> unconditionally fail these
>>> I/Os. Keith?
>>
>> We have a timeout handler to clean this up, but I think it was another
>> PPC specific patch that has the timeout handler do nothing if pcie error
>> recovery is in progress. Which seems questionable, we should be able to
>> concurrently run error handling and timeouts, but I think the error
>> handling just needs to syncronize the request_queue's in the
>> "error_detected" path.
>
> This:
>
> ---
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 8e0bb9692685d..38d0215fe53fc 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1286,13 +1286,6 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req)
> u32 csts = readl(dev->bar + NVME_REG_CSTS);
> u8 opcode;
>
> - /* If PCI error recovery process is happening, we cannot reset or
> - * the recovery mechanism will surely fail.
> - */
> - mb();
> - if (pci_channel_offline(to_pci_dev(dev->dev)))
> - return BLK_EH_RESET_TIMER;
> -
> /*
> * Reset immediately if the controller is failed
> */
> @@ -3300,6 +3293,7 @@ static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev,
> return PCI_ERS_RESULT_DISCONNECT;
> }
> nvme_dev_disable(dev, false);
> + nvme_sync_queues(&dev->ctrl);
> return PCI_ERS_RESULT_NEED_RESET;
> case pci_channel_io_perm_failure:
> dev_warn(dev->ctrl.device,
> --
>
I tested the above patch, however, it doesn't help to solve the issue.
I tested it for two cases listed below:
1. Platform which doesn't support pci-error-recovery:
-----------------------------------------------------
On this platform when nvme_timeout() is invoked, it falls through
nvme_shoud_reset()
-> nvme_warn_reset()
-> goto disable
When nvme_timeout() jumps to the label disable, it tries setting the
controller state to RESETTING but that couldn't succeed because the
(logical) hot-unplug/nvme_remove() of the disk is started on another
thread and hence controller state has already changed to
DELETING/DELETING_NOIO. As nvme_timeout() couldn't set the controller
state to RESETTING, nvme_timeout() returns BLK_EH_DONE. In summary,
as nvme_timeout() couldn't cancel pending IO, the hot-unplug/nvme_remove()
couldn't forward progress and it keeps waiting for request queue to be freezed.
2. Platform supporting pci-error-recovery:
------------------------------------------
Similarly, on this platform as explained for the above case, when
nvme_timeout() is invoked, it falls through nvme_shoud_reset()
-> nvme_warn_reset() -> goto disable. In this case as well,
nvme_timeout() returns BLK_EH_DONE. Please note that though this
platform supports pci-error-recovery, we couldn't get through
nvme_error_detected() because the pci-error-recovery thread is pending
on acquiring mutex "pci_lock_rescan_remove". This mutex is acquired by
hot-unplug thread before it invokes nvme_remove() and nvme_remove()
is currently waiting for request queue to be freezed. For reference,
I have already captured the task hang traces in previous email of this
thread where we could observe these hangs (for both pci-error-recovery
thread as well as hot-unplig/nvme_remove()).
I understand that we don't want to cancel pending IO from the nvme_remove()
unconditionally as if the disk is not physically hot-unplug then we still
want to wait for the in-flight IO to be finished. Also looking through
the above cases, I think that the nvme_timeout() might be the code path
from where we want to cancel in-flight/pending IO if controller is
in the terminal state (i.e. DELETING or DELETING_NOIO). Keeping this idea in
mind, I have worked out the below patch:
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 8e0bb9692685..e45a54d84649 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1286,6 +1286,9 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req)
u32 csts = readl(dev->bar + NVME_REG_CSTS);
u8 opcode;
+ if (nvme_state_terminal(&dev->ctrl))
+ goto disable;
+
/* If PCI error recovery process is happening, we cannot reset or
* the recovery mechanism will surely fail.
*/
@@ -1390,8 +1393,13 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req)
return BLK_EH_RESET_TIMER;
disable:
- if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING))
+ if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
+ if (nvme_state_terminal(&dev->ctrl)) {
+ nvme_dev_disable(dev, false);
+ nvme_sync_queues(&dev->ctrl);
+ }
return BLK_EH_DONE;
+ }
nvme_dev_disable(dev, false);
if (nvme_try_sched_reset(&dev->ctrl))
I have tested the above patch against all possible cases. Please let me know
if this looks good or if there are any further comments.
Thanks,
--Nilay
More information about the Linux-nvme
mailing list