[PATCH RESEND] nvme-pci: Fix EEH failure on ppc after subsystem reset

Mon Mar 11 05:58:21 PDT 2024


On 3/11/24 10:11, Keith Busch wrote:
> On Sun, Mar 10, 2024 at 12:35:06AM +0530, Nilay Shroff wrote:
>> On 3/9/24 21:14, Keith Busch wrote:
>>> Your patch may observe a ctrl in "RESETTING" state from
>>> error_detected(), then disable the controller, which quiesces the admin
>>> queue. Meanwhile, reset_work may proceed to CONNECTING state and try
>>> nvme_submit_sync_cmd(), which blocks forever because no one is going to
>>> unquiesce that admin queue.
>>>
>> OK I think I got your point. However, it seems that even without my patch
>> the above mentioned deadlock could still be possible. 
> 
> I sure hope not. The current design should guarnatee forward progress on
> initialization failed devices.
> 
>> Without my patch, if error_detcted() observe a ctrl in "RESETTING" state then 
>> it still invokes nvme_dev_disable(). The only difference with my patch is that 
>> error_detected() returns the PCI_ERS_RESULT_NEED_RESET instead of PCI_ERS_RESULT_DISCONNECT.
> 
> There's one more subtle difference: that condition disables with the
> 'shutdown' parameter set to 'true' which accomplishes a couple things:
> all entered requests are flushed to their demise via the final
> unquiesce, and all request_queue's are killed which forces error returns
> for all new request allocations. No thread will be left waiting for
> something that won't happen.
> 
Aaargh, I didn't notice that subtle difference. I got your all points..
After thinking for a while (as you suggested) it seems that we potentially
require to contain the race between reset_work and error_detected code path
as both could run in parallel on different cpu when pci recovery initiates. 
So I'm thinking that if we can hold on the error_detected() code path to proceed
until reset_work is finished ? Particularly, in error_detcted() function if 
we fall through the pci_channel_io_frozen case and the ctrl state is already
RESETTING (so that means that reset_work shall be running) then hold on 
invoking nvme_dev_diable(dev, false) until reset_work is finished. The changes
should be something as below:

@@ -3295,10 +3304,13 @@ static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev,
        case pci_channel_io_frozen:
                dev_warn(dev->ctrl.device,
                        "frozen state error detected, reset controller\n");
-               if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
-                       nvme_dev_disable(dev, true);
-                       return PCI_ERS_RESULT_DISCONNECT;
+               if (nvme_ctrl_state(&dev->ctrl) != NVME_CTRL_RESETTING) {
+                       if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
+                               nvme_dev_disable(dev, true);
+                               return PCI_ERS_RESULT_DISCONNECT;
+                       }
                }
+               flush_work(&dev->ctrl.reset_work);
                nvme_dev_disable(dev, false);
                return PCI_ERS_RESULT_NEED_RESET;
        case pci_channel_io_perm_failure:

The flush_work() would ensure that we don't disable the ctrl if reset_work 
is running. If the rest_work is *not* running currently then flush_work() should
return immediately. Moreover, if reset_work is scheduled or start running after
flush_work() returns then reset_work should not be able to get upto the CONNECTING
state because pci recovery is in progress and so it should fail early.

On the reset_work side other than detecting pci error recovery, I think we also 
need one another change where in case the ctrl state is set to CONNECTING and we 
detect the pci error recovery in progress then before returning from the reset_work
we set the ctrl state to RESETTING so that error_detected() could forward progress.
The changes should be something as below:

@@ -2776,6 +2776,16 @@ static void nvme_reset_work(struct work_struct *work)
  out_unlock:
        mutex_unlock(&dev->shutdown_lock);
  out:
+       /*
+        * If PCI recovery is ongoing then let it finish first
+        */
+       if (pci_channel_offline(to_pci_dev(dev->dev))) {
+               dev_warn(dev->ctrl.device, "PCI recovery is ongoing so let it finish\n");
+               if (nvme_ctrl_state(&dev->ctrl) != NVME_CTRL_RESETTING)
+                       WRITE_ONCE(dev->ctrl.state, NVME_CTRL_RESETTING);
+               return;
+       }
+
        /*
         * Set state to deleting now to avoid blocking nvme_wait_reset(), which
         * may be holding this pci_dev's device lock.

Please let me know if you find anything not good with the above changes.

Thanks,
--Nilay