[PATCH RESEND] nvme-pci: Fix EEH failure on ppc after subsystem reset

Thu Mar 21 22:02:01 PDT 2024

Hi Keith,

A gentle ping. I don't know whether you got a chance to review the last email on this subject.
Please let me know your feedback/thoughts.

Thanks,
--Nilay

On 3/13/24 17:29, Nilay Shroff wrote:
> 
> 
> On 3/12/24 20:00, Keith Busch wrote:
>> On Mon, Mar 11, 2024 at 06:28:21PM +0530, Nilay Shroff wrote:
>>> @@ -3295,10 +3304,13 @@ static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev,
>>>         case pci_channel_io_frozen:
>>>                 dev_warn(dev->ctrl.device,
>>>                         "frozen state error detected, reset controller\n");
>>> -               if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
>>> -                       nvme_dev_disable(dev, true);
>>> -                       return PCI_ERS_RESULT_DISCONNECT;
>>> +               if (nvme_ctrl_state(&dev->ctrl) != NVME_CTRL_RESETTING) {
>>> +                       if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
>>> +                               nvme_dev_disable(dev, true);
>>> +                               return PCI_ERS_RESULT_DISCONNECT;
>>> +                       }
>>>                 }
>>> +               flush_work(&dev->ctrl.reset_work);
>>
>> I was messing with a similar idea. I wasn't sure if EEH calls the error
>> handler inline with the error, in which case this would try to flush the
>> work within the same work, which obviously doesn't work. As long as its
>> called from a different thread, then this should be fine.
> The EEH recovery happens from different thread and so flush work should 
> work here as expected.
> 
>>> @@ -2776,6 +2776,16 @@ static void nvme_reset_work(struct work_struct *work)
>>>   out_unlock:
>>>         mutex_unlock(&dev->shutdown_lock);
>>>   out:
>>> +       /*
>>> +        * If PCI recovery is ongoing then let it finish first
>>> +        */
>>> +       if (pci_channel_offline(to_pci_dev(dev->dev))) {
>>> +               dev_warn(dev->ctrl.device, "PCI recovery is ongoing so let it finish\n");
>>> +               if (nvme_ctrl_state(&dev->ctrl) != NVME_CTRL_RESETTING)
>>> +                       WRITE_ONCE(dev->ctrl.state, NVME_CTRL_RESETTING);
>>
>> This may break the state machine, like if the device was hot removed
>> during all this error handling. This will force the state back to
>> RESETTING when it should be DEAD.
> Agreed, we shouldn't force reset state to RESETTING here.
>>
>> I think what you need is just allow a controller to reset from a
>> connecting state. Have to be careful that wouldn't break any other
>> expectations, though.
> Yeah ok got your point, so I have reworked the ctrl state machine as you 
> suggested and tested the changes. The updated state machine code is shown
> below:
> 
> @@ -546,10 +546,11 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
>                 break;
>         case NVME_CTRL_RESETTING:
>                 switch (old_state) {
>                 case NVME_CTRL_NEW:
>                 case NVME_CTRL_LIVE:
> +               case NVME_CTRL_CONNECTING:
>                         changed = true;
>                         fallthrough;
>                 default:
>                         break;
>                 }
> 
> And accordingly updated reset_work function is show below. Here we ensure that
> even though the pci error recovery is in progress, if we couldn't move ctrl state
> to RESETTING then we would let rest_work forward progress and set the ctrl state
> to DEAD.
> 
> @@ -2774,10 +2774,19 @@ static void nvme_reset_work(struct work_struct *work)
>         return;
>  
>   out_unlock:
>         mutex_unlock(&dev->shutdown_lock);
>   out:
> +       /*
> +        * If PCI recovery is ongoing then let it finish first
> +        */
> +       if (pci_channel_offline(to_pci_dev(dev->dev))) {
> +               if (nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
> +                       dev_warn(dev->ctrl.device, "PCI recovery is ongoing, let it finish\n");
> +                       return;
> +               }
> +       }
>         /*
>          * Set state to deleting now to avoid blocking nvme_wait_reset(), which
>          * may be holding this pci_dev's device lock.
>          */
>         dev_warn(dev->ctrl.device, "Disabling device after reset failure: %d\n",
> 
> 
> Now I have also included in my test the hot removal of NVMe adapter while EEH recovery
> is in progress. And the  EEH recovery code handles this case well : When EEH recovery 
> is in progress and if we hot removes the adapter (which is being recovered) then EEH
> handler would stop the recovery, set the PCI channel state to "pci_channel_io_perm_failure".
> 
> 
> Collected the logs of this case, (shown below):
> -----------------------------------------------
> # nvme list-subsys 
> nvme-subsys0 - NQN=nqn.1994-11.com.samsung:nvme:PM1735:2.5-inch:S6EUNA0R500358
>                hostnqn=nqn.2014-08.org.nvmexpress:uuid:12b49f6e-0276-4746-b10c-56815b7e6dc2
>                iopolicy=numa
> 
> # lspci 
> 0018:01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173X
> 
> # dmesg
> [  561.639102] EEH: Recovering PHB#18-PE#10000
> [  561.639120] EEH: PE location: N/A, PHB location: N/A
> [  561.639128] EEH: Frozen PHB#18-PE#10000 detected
> <snip>
> <snip>
> [  561.639428] EEH: This PCI device has failed 2 times in the last hour and will be permanently disabled after 5 failures.
> [  561.639441] EEH: Notify device drivers to shutdown
> [  561.639450] EEH: Beginning: 'error_detected(IO frozen)'
> [  561.639458] PCI 0018:01:00.0#10000: EEH: Invoking nvme->error_detected(IO frozen)
> [  561.639462] nvme nvme0: frozen state error detected, reset controller
> [  561.719078] nvme 0018:01:00.0: enabling device (0000 -> 0002)
> [  561.719318] nvme nvme0: PCI recovery is ongoing so let it finish
> 
> <!!!! WHILE EEH RECOVERY IS IN PROGRESS WE HOT REMOVE THE NVMe ADAPTER !!!!>
> 
> [  563.850328] rpaphp: RPA HOT Plug PCI Controller Driver version: 0.1
> <snip>
> [  571.879092] PCI 0018:01:00.0#10000: EEH: nvme driver reports: 'need reset'
> [  571.879097] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'need reset'
> <snip>
> <snip>
> [  571.881761] EEH: Reset without hotplug activity
> [  574.039807] EEH: PHB#18-PE#10000: Slot inactive after reset: 0x2 (attempt 1)
> [  574.309091] EEH: Failure 2 resetting PHB#18-PE#10000 (attempt 2)
> [  574.309091] 
> [  574.579094] EEH: Failure 2 resetting PHB#18-PE#10000 (attempt 3)
> [  574.579094] 
> [  574.579101] eeh_handle_normal_event: Cannot reset, err=-5
> [  574.579104] EEH: Unable to recover from failure from PHB#18-PE#10000.
> [  574.579104] Please try reseating or replacing it
> <snip>
> <snip>
> [  574.581314] EEH: Beginning: 'error_detected(permanent failure)'
> [  574.581320] PCI 0018:01:00.0#10000: EEH: no device
> 
> # lspci
> <empty>
> 
> # nvme list-subsys
> <empty>
> 
> 
> Another case tested, when the reset_work is ongoing post subsystem-reset command
> and if user immediately hot removes the NVMe adapter then EEH recovery is *not* 
> triggered and ctrl forward progress to the "DEAD" state.
> 
> Collected the logs of this case, (shown below):
> -----------------------------------------------
> # nvme list-subsys 
> nvme-subsys0 - NQN=nqn.1994-11.com.samsung:nvme:PM1735:2.5-inch:S6EUNA0R500358
>                hostnqn=nqn.2014-08.org.nvmexpress:uuid:12b49f6e-0276-4746-b10c-56815b7e6dc2
>                iopolicy=numa
> 
> # lspci 
> 0018:01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173X
> 
> # nvme subsystem-reset /dev/nvme0
> 
> <!!!! HOT REMOVE THE NVMe ADAPTER !!!!>
> 
> # dmesg
> [ 9967.143886] eeh_handle_normal_event: Cannot find PCI bus for PHB#18-PE#10000
> [ 9967.224078] eeh_handle_normal_event: Cannot find PCI bus for PHB#18-PE#10000
> <snip>
> [ 9967.223858] nvme 0018:01:00.0: enabling device (0000 -> 0002)
> [ 9967.224140] nvme nvme0: Disabling device after reset failure: -19
> 
> # lspci
> <empty>
> 
> # nvme list-subsys
> <empty>
> 
> 
> Please let me know if the above changes look good to you. If it looks good then 
> I would spin a new version of the patch and send for a review.
> 
> Thanks,
> --Nilay
> 
>