[PATCH RESEND] nvme-pci: Fix EEH failure on ppc after subsystem reset

Wed Mar 13 04:59:35 PDT 2024


On 3/12/24 20:00, Keith Busch wrote:
> On Mon, Mar 11, 2024 at 06:28:21PM +0530, Nilay Shroff wrote:
>> @@ -3295,10 +3304,13 @@ static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev,
>>         case pci_channel_io_frozen:
>>                 dev_warn(dev->ctrl.device,
>>                         "frozen state error detected, reset controller\n");
>> -               if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
>> -                       nvme_dev_disable(dev, true);
>> -                       return PCI_ERS_RESULT_DISCONNECT;
>> +               if (nvme_ctrl_state(&dev->ctrl) != NVME_CTRL_RESETTING) {
>> +                       if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
>> +                               nvme_dev_disable(dev, true);
>> +                               return PCI_ERS_RESULT_DISCONNECT;
>> +                       }
>>                 }
>> +               flush_work(&dev->ctrl.reset_work);
> 
> I was messing with a similar idea. I wasn't sure if EEH calls the error
> handler inline with the error, in which case this would try to flush the
> work within the same work, which obviously doesn't work. As long as its
> called from a different thread, then this should be fine.
The EEH recovery happens from different thread and so flush work should 
work here as expected.

>> @@ -2776,6 +2776,16 @@ static void nvme_reset_work(struct work_struct *work)
>>   out_unlock:
>>         mutex_unlock(&dev->shutdown_lock);
>>   out:
>> +       /*
>> +        * If PCI recovery is ongoing then let it finish first
>> +        */
>> +       if (pci_channel_offline(to_pci_dev(dev->dev))) {
>> +               dev_warn(dev->ctrl.device, "PCI recovery is ongoing so let it finish\n");
>> +               if (nvme_ctrl_state(&dev->ctrl) != NVME_CTRL_RESETTING)
>> +                       WRITE_ONCE(dev->ctrl.state, NVME_CTRL_RESETTING);
> 
> This may break the state machine, like if the device was hot removed
> during all this error handling. This will force the state back to
> RESETTING when it should be DEAD.
Agreed, we shouldn't force reset state to RESETTING here.
> 
> I think what you need is just allow a controller to reset from a
> connecting state. Have to be careful that wouldn't break any other
> expectations, though.
Yeah ok got your point, so I have reworked the ctrl state machine as you 
suggested and tested the changes. The updated state machine code is shown
below:

@@ -546,10 +546,11 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
                break;
        case NVME_CTRL_RESETTING:
                switch (old_state) {
                case NVME_CTRL_NEW:
                case NVME_CTRL_LIVE:
+               case NVME_CTRL_CONNECTING:
                        changed = true;
                        fallthrough;
                default:
                        break;
                }

And accordingly updated reset_work function is show below. Here we ensure that
even though the pci error recovery is in progress, if we couldn't move ctrl state
to RESETTING then we would let rest_work forward progress and set the ctrl state
to DEAD.

@@ -2774,10 +2774,19 @@ static void nvme_reset_work(struct work_struct *work)
        return;
 
  out_unlock:
        mutex_unlock(&dev->shutdown_lock);
  out:
+       /*
+        * If PCI recovery is ongoing then let it finish first
+        */
+       if (pci_channel_offline(to_pci_dev(dev->dev))) {
+               if (nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
+                       dev_warn(dev->ctrl.device, "PCI recovery is ongoing, let it finish\n");
+                       return;
+               }
+       }
        /*
         * Set state to deleting now to avoid blocking nvme_wait_reset(), which
         * may be holding this pci_dev's device lock.
         */
        dev_warn(dev->ctrl.device, "Disabling device after reset failure: %d\n",


Now I have also included in my test the hot removal of NVMe adapter while EEH recovery
is in progress. And the  EEH recovery code handles this case well : When EEH recovery 
is in progress and if we hot removes the adapter (which is being recovered) then EEH
handler would stop the recovery, set the PCI channel state to "pci_channel_io_perm_failure".


Collected the logs of this case, (shown below):
-----------------------------------------------
# nvme list-subsys 
nvme-subsys0 - NQN=nqn.1994-11.com.samsung:nvme:PM1735:2.5-inch:S6EUNA0R500358
               hostnqn=nqn.2014-08.org.nvmexpress:uuid:12b49f6e-0276-4746-b10c-56815b7e6dc2
               iopolicy=numa

# lspci 
0018:01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173X

# dmesg
[  561.639102] EEH: Recovering PHB#18-PE#10000
[  561.639120] EEH: PE location: N/A, PHB location: N/A
[  561.639128] EEH: Frozen PHB#18-PE#10000 detected
<snip>
<snip>
[  561.639428] EEH: This PCI device has failed 2 times in the last hour and will be permanently disabled after 5 failures.
[  561.639441] EEH: Notify device drivers to shutdown
[  561.639450] EEH: Beginning: 'error_detected(IO frozen)'
[  561.639458] PCI 0018:01:00.0#10000: EEH: Invoking nvme->error_detected(IO frozen)
[  561.639462] nvme nvme0: frozen state error detected, reset controller
[  561.719078] nvme 0018:01:00.0: enabling device (0000 -> 0002)
[  561.719318] nvme nvme0: PCI recovery is ongoing so let it finish

<!!!! WHILE EEH RECOVERY IS IN PROGRESS WE HOT REMOVE THE NVMe ADAPTER !!!!>

[  563.850328] rpaphp: RPA HOT Plug PCI Controller Driver version: 0.1
<snip>
[  571.879092] PCI 0018:01:00.0#10000: EEH: nvme driver reports: 'need reset'
[  571.879097] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'need reset'
<snip>
<snip>
[  571.881761] EEH: Reset without hotplug activity
[  574.039807] EEH: PHB#18-PE#10000: Slot inactive after reset: 0x2 (attempt 1)
[  574.309091] EEH: Failure 2 resetting PHB#18-PE#10000 (attempt 2)
[  574.309091] 
[  574.579094] EEH: Failure 2 resetting PHB#18-PE#10000 (attempt 3)
[  574.579094] 
[  574.579101] eeh_handle_normal_event: Cannot reset, err=-5
[  574.579104] EEH: Unable to recover from failure from PHB#18-PE#10000.
[  574.579104] Please try reseating or replacing it
<snip>
<snip>
[  574.581314] EEH: Beginning: 'error_detected(permanent failure)'
[  574.581320] PCI 0018:01:00.0#10000: EEH: no device

# lspci
<empty>

# nvme list-subsys
<empty>


Another case tested, when the reset_work is ongoing post subsystem-reset command
and if user immediately hot removes the NVMe adapter then EEH recovery is *not* 
triggered and ctrl forward progress to the "DEAD" state.

Collected the logs of this case, (shown below):
-----------------------------------------------
# nvme list-subsys 
nvme-subsys0 - NQN=nqn.1994-11.com.samsung:nvme:PM1735:2.5-inch:S6EUNA0R500358
               hostnqn=nqn.2014-08.org.nvmexpress:uuid:12b49f6e-0276-4746-b10c-56815b7e6dc2
               iopolicy=numa

# lspci 
0018:01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173X

# nvme subsystem-reset /dev/nvme0

<!!!! HOT REMOVE THE NVMe ADAPTER !!!!>

# dmesg
[ 9967.143886] eeh_handle_normal_event: Cannot find PCI bus for PHB#18-PE#10000
[ 9967.224078] eeh_handle_normal_event: Cannot find PCI bus for PHB#18-PE#10000
<snip>
[ 9967.223858] nvme 0018:01:00.0: enabling device (0000 -> 0002)
[ 9967.224140] nvme nvme0: Disabling device after reset failure: -19

# lspci
<empty>

# nvme list-subsys
<empty>


Please let me know if the above changes look good to you. If it looks good then 
I would spin a new version of the patch and send for a review.

Thanks,
--Nilay