[PATCH RESEND] nvme-pci: Fix EEH failure on ppc after subsystem reset
Nilay Shroff
nilay at linux.ibm.com
Thu Mar 21 22:02:01 PDT 2024
Hi Keith,
A gentle ping. I don't know whether you got a chance to review the last email on this subject.
Please let me know your feedback/thoughts.
Thanks,
--Nilay
On 3/13/24 17:29, Nilay Shroff wrote:
>
>
> On 3/12/24 20:00, Keith Busch wrote:
>> On Mon, Mar 11, 2024 at 06:28:21PM +0530, Nilay Shroff wrote:
>>> @@ -3295,10 +3304,13 @@ static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev,
>>> case pci_channel_io_frozen:
>>> dev_warn(dev->ctrl.device,
>>> "frozen state error detected, reset controller\n");
>>> - if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
>>> - nvme_dev_disable(dev, true);
>>> - return PCI_ERS_RESULT_DISCONNECT;
>>> + if (nvme_ctrl_state(&dev->ctrl) != NVME_CTRL_RESETTING) {
>>> + if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
>>> + nvme_dev_disable(dev, true);
>>> + return PCI_ERS_RESULT_DISCONNECT;
>>> + }
>>> }
>>> + flush_work(&dev->ctrl.reset_work);
>>
>> I was messing with a similar idea. I wasn't sure if EEH calls the error
>> handler inline with the error, in which case this would try to flush the
>> work within the same work, which obviously doesn't work. As long as its
>> called from a different thread, then this should be fine.
> The EEH recovery happens from different thread and so flush work should
> work here as expected.
>
>>> @@ -2776,6 +2776,16 @@ static void nvme_reset_work(struct work_struct *work)
>>> out_unlock:
>>> mutex_unlock(&dev->shutdown_lock);
>>> out:
>>> + /*
>>> + * If PCI recovery is ongoing then let it finish first
>>> + */
>>> + if (pci_channel_offline(to_pci_dev(dev->dev))) {
>>> + dev_warn(dev->ctrl.device, "PCI recovery is ongoing so let it finish\n");
>>> + if (nvme_ctrl_state(&dev->ctrl) != NVME_CTRL_RESETTING)
>>> + WRITE_ONCE(dev->ctrl.state, NVME_CTRL_RESETTING);
>>
>> This may break the state machine, like if the device was hot removed
>> during all this error handling. This will force the state back to
>> RESETTING when it should be DEAD.
> Agreed, we shouldn't force reset state to RESETTING here.
>>
>> I think what you need is just allow a controller to reset from a
>> connecting state. Have to be careful that wouldn't break any other
>> expectations, though.
> Yeah ok got your point, so I have reworked the ctrl state machine as you
> suggested and tested the changes. The updated state machine code is shown
> below:
>
> @@ -546,10 +546,11 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
> break;
> case NVME_CTRL_RESETTING:
> switch (old_state) {
> case NVME_CTRL_NEW:
> case NVME_CTRL_LIVE:
> + case NVME_CTRL_CONNECTING:
> changed = true;
> fallthrough;
> default:
> break;
> }
>
> And accordingly updated reset_work function is show below. Here we ensure that
> even though the pci error recovery is in progress, if we couldn't move ctrl state
> to RESETTING then we would let rest_work forward progress and set the ctrl state
> to DEAD.
>
> @@ -2774,10 +2774,19 @@ static void nvme_reset_work(struct work_struct *work)
> return;
>
> out_unlock:
> mutex_unlock(&dev->shutdown_lock);
> out:
> + /*
> + * If PCI recovery is ongoing then let it finish first
> + */
> + if (pci_channel_offline(to_pci_dev(dev->dev))) {
> + if (nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
> + dev_warn(dev->ctrl.device, "PCI recovery is ongoing, let it finish\n");
> + return;
> + }
> + }
> /*
> * Set state to deleting now to avoid blocking nvme_wait_reset(), which
> * may be holding this pci_dev's device lock.
> */
> dev_warn(dev->ctrl.device, "Disabling device after reset failure: %d\n",
>
>
> Now I have also included in my test the hot removal of NVMe adapter while EEH recovery
> is in progress. And the EEH recovery code handles this case well : When EEH recovery
> is in progress and if we hot removes the adapter (which is being recovered) then EEH
> handler would stop the recovery, set the PCI channel state to "pci_channel_io_perm_failure".
>
>
> Collected the logs of this case, (shown below):
> -----------------------------------------------
> # nvme list-subsys
> nvme-subsys0 - NQN=nqn.1994-11.com.samsung:nvme:PM1735:2.5-inch:S6EUNA0R500358
> hostnqn=nqn.2014-08.org.nvmexpress:uuid:12b49f6e-0276-4746-b10c-56815b7e6dc2
> iopolicy=numa
>
> # lspci
> 0018:01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173X
>
> # dmesg
> [ 561.639102] EEH: Recovering PHB#18-PE#10000
> [ 561.639120] EEH: PE location: N/A, PHB location: N/A
> [ 561.639128] EEH: Frozen PHB#18-PE#10000 detected
> <snip>
> <snip>
> [ 561.639428] EEH: This PCI device has failed 2 times in the last hour and will be permanently disabled after 5 failures.
> [ 561.639441] EEH: Notify device drivers to shutdown
> [ 561.639450] EEH: Beginning: 'error_detected(IO frozen)'
> [ 561.639458] PCI 0018:01:00.0#10000: EEH: Invoking nvme->error_detected(IO frozen)
> [ 561.639462] nvme nvme0: frozen state error detected, reset controller
> [ 561.719078] nvme 0018:01:00.0: enabling device (0000 -> 0002)
> [ 561.719318] nvme nvme0: PCI recovery is ongoing so let it finish
>
> <!!!! WHILE EEH RECOVERY IS IN PROGRESS WE HOT REMOVE THE NVMe ADAPTER !!!!>
>
> [ 563.850328] rpaphp: RPA HOT Plug PCI Controller Driver version: 0.1
> <snip>
> [ 571.879092] PCI 0018:01:00.0#10000: EEH: nvme driver reports: 'need reset'
> [ 571.879097] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'need reset'
> <snip>
> <snip>
> [ 571.881761] EEH: Reset without hotplug activity
> [ 574.039807] EEH: PHB#18-PE#10000: Slot inactive after reset: 0x2 (attempt 1)
> [ 574.309091] EEH: Failure 2 resetting PHB#18-PE#10000 (attempt 2)
> [ 574.309091]
> [ 574.579094] EEH: Failure 2 resetting PHB#18-PE#10000 (attempt 3)
> [ 574.579094]
> [ 574.579101] eeh_handle_normal_event: Cannot reset, err=-5
> [ 574.579104] EEH: Unable to recover from failure from PHB#18-PE#10000.
> [ 574.579104] Please try reseating or replacing it
> <snip>
> <snip>
> [ 574.581314] EEH: Beginning: 'error_detected(permanent failure)'
> [ 574.581320] PCI 0018:01:00.0#10000: EEH: no device
>
> # lspci
> <empty>
>
> # nvme list-subsys
> <empty>
>
>
> Another case tested, when the reset_work is ongoing post subsystem-reset command
> and if user immediately hot removes the NVMe adapter then EEH recovery is *not*
> triggered and ctrl forward progress to the "DEAD" state.
>
> Collected the logs of this case, (shown below):
> -----------------------------------------------
> # nvme list-subsys
> nvme-subsys0 - NQN=nqn.1994-11.com.samsung:nvme:PM1735:2.5-inch:S6EUNA0R500358
> hostnqn=nqn.2014-08.org.nvmexpress:uuid:12b49f6e-0276-4746-b10c-56815b7e6dc2
> iopolicy=numa
>
> # lspci
> 0018:01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173X
>
> # nvme subsystem-reset /dev/nvme0
>
> <!!!! HOT REMOVE THE NVMe ADAPTER !!!!>
>
> # dmesg
> [ 9967.143886] eeh_handle_normal_event: Cannot find PCI bus for PHB#18-PE#10000
> [ 9967.224078] eeh_handle_normal_event: Cannot find PCI bus for PHB#18-PE#10000
> <snip>
> [ 9967.223858] nvme 0018:01:00.0: enabling device (0000 -> 0002)
> [ 9967.224140] nvme nvme0: Disabling device after reset failure: -19
>
> # lspci
> <empty>
>
> # nvme list-subsys
> <empty>
>
>
> Please let me know if the above changes look good to you. If it looks good then
> I would spin a new version of the patch and send for a review.
>
> Thanks,
> --Nilay
>
>
More information about the Linux-nvme
mailing list