[PATCH RESEND] nvme-pci: Fix EEH failure on ppc after subsystem reset

Thu Feb 29 04:27:56 PST 2024

Hi Keith,

On 2/28/24 16:49, Nilay Shroff wrote:
> 
> 
> On 2/27/24 23:59, Keith Busch wrote:
>> On Fri, Feb 09, 2024 at 10:32:16AM +0530, Nilay Shroff wrote:
>>> If the nvme subsyetm reset causes the loss of communication to the nvme
>>> adapter then EEH could potnetially recover the adapter. The detection of
>>> comminication loss to the adapter only happens when the nvme driver
>>> attempts to read an MMIO register.
>>>
>>> The nvme subsystem reset command writes 0x4E564D65 to NSSR register and
>>> schedule adapter reset.In the case nvme subsystem reset caused the loss
>>> of communication to the nvme adapter then either IO timeout event or
>>> adapter reset handler could detect it. If IO timeout even could detect
>>> loss of communication then EEH handler is able to recover the
>>> communication to the adapter. This change was implemented in 651438bb0af5
>>> (nvme-pci: Fix EEH failure on ppc). However if the adapter communication
>>> loss is detected in nvme reset work handler then EEH is unable to
>>> successfully finish the adapter recovery.
>>>
>>> This patch ensures that,
>>> - nvme driver reset handler would observer pci channel was offline after
>>>   a failed MMIO read and avoids marking the controller state to DEAD and
>>>   thus gives a fair chance to EEH handler to recover the nvme adapter.
>>>
>>> - if nvme controller is already in RESETTNG state and pci channel frozen
>>>   error is detected then  nvme driver pci-error-handler code sends the
>>>   correct error code (PCI_ERS_RESULT_NEED_RESET) back to the EEH handler
>>>   so that EEH handler could proceed with the pci slot reset.
>>
>> A subsystem reset takes the link down. I'm pretty sure the proper way to
>> recover from it requires pcie hotplug support.
>>
> Yes you're correct. We require pcie hotplugging to recover. However powerpc EEH 
> handler could able to recover the pcie adapter without physically removing and 
> re-inserting the adapter or in another words, it could reset adapter without 
> hotplug activity. In fact, powerpc EEH could isolate pcie slot and resets it 
> (i.e. resetting the PCI device holding the PCI #RST line high for two seconds), 
> followed by setting up the device config space (the base address registers 
> (BAR's), latency timer, cache line size, interrupt line, and so on). 
> 
> You may find more information about EEH recovery here: 
> https://www.kernel.org/doc/Documentation/powerpc/eeh-pci-error-recovery.txt
> 
> Typically when pcie error is detected and the EEH is able to recover the device, 
> the EEH handler code goes through below sequence (assuming driver is EEH aware):
> 
> eeh_handle_normal_event()
>   eeh_set_channel_state()-> set state to pci_channel_io_frozen 
>      eeh_report_error() 
>        nvme_error_detected() -> channel state "pci_channel_io_frozen"; returns PCI_ERS_RESULT_NEED_RESET
>          eeh_slot_reset() -> recovery successful 
>            nvme_slot_reset() -> returns PCI_ERS_RESULT_RECOVERED
>              eeh_set_channel_state()-> set state to pci_channel_io_normal
>                nvme_error_resume()
> 
> In case pcie erorr is detected and the EEH is unable to recover the device,
> the EEH handler code goes through the below sequence:
> 
> eeh_handle_normal_event()
>   eeh_set_channel_state()-> set state to pci_channel_io_frozen
>     eeh_report_error()
>       nvme_error_detected() -> channel state pci_channel_io_frozen; returns PCI_ERS_RESULT_NEED_RESET
>         eeh_slot_reset() -> recovery failed
>           eeh_report_failure()
>             nvme_error_detected()-> channel state pci_channel_io_perm_failure; returns PCI_ERS_RESULT_DISCONNECT
>               eeh_set_channel_state()-> set state to pci_channel_io_perm_failure
>                 nvme_remove()
> 
>                                            
> If we execute the command "nvme subsystem-reset ..." and adapter communication is
> lost then in the current code (under nvme_reset_work()) we simply disable the device
>  and mark the controller DEAD. However we may have a chance to recover the controller 
> if driver is EEH aware and EEH recovery is underway. We already handle one such case 
> in nvme_timeout(). So this patch ensures that if we fall through nvme_reset_work() 
> post subsystem-reset and the EEH recovery is in progress then we give a chance to the
> EEH mechanism to recover the adapter. If in case the EEH recovery is unsuccessful then
> we'd anyway fall through code path I mentioned above where we invoke nvme_remove() at 
> the end and delete the erring controller.
> 

BTW, the similar issue was earlier fixed in 651438bb0af5(nvme-pci: Fix EEH failure on ppc).
And that fix was needed due to the controller health check polling was removed in 
b2a0eb1a0ac72869(nvme-pci: Remove watchdog timer). In fact, today we may be able to recover
the NVMe adapter if subsystem-reset or any other PCI error occurs and at the same time some 
I/O request is in flight. The recovery is possible due to the in flight I/O request would 
eventually timeout and the nvme_timeout() has special code (added in 651438bb0af5)  that
gives EEH a chance to recover the adapter. 

However later, in 1e866afd4bcdd(nvme: ensure subsystem reset is single threaded), the 
nvme subsystem reset code was reworked so now when user executes command subsystem-reset, 
kernel first writes 0x4E564D65 to nvme NSSR register and then schedules the adapter reset.  
It's quite possible that when subsytem-reset is executed there were no I/O in flight and 
hence we may never hit the nvme_timeout(). Later when the adapter reset code (under 
nvme_reset_work()) start execution, it accesses MMIO registers. Hence, IMO, potentially 
nvme_reset_work() would also need similar changes as implemented under nvme_timeout() so
that EEH recovery could be possible.

> With the proposed patch, we find that EEH recovery is successful post subsystem-reset. 
> Please find below the relevant output:
> # lspci 
> 0524:28:00.0 Non-Volatile memory controller: KIOXIA Corporation NVMe SSD Controller CM7 2.5" (rev 01)
> 
> # nvme list-subsys
> nvme-subsys0 - NQN=nqn.2019-10.com.kioxia:KCM7DRUG1T92:7DQ0A01206N3
>                hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
>                iopolicy=numa
> \
>  +- nvme0 pcie 0524:28:00.0 live
> 
> # nvme subsystem-reset /dev/nvme0
> 
> # nvme list-subsys
> nvme-subsys0 - NQN=nqn.2019-10.com.kioxia:KCM7DRUG1T92:7DQ0A01206N3
>                hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
>                iopolicy=numa
> \
>  +- nvme0 pcie 0524:28:00.0 resetting
> 
> [10556.034082] EEH: Recovering PHB#524-PE#280000
> [10556.034108] EEH: PE location: N/A, PHB location: N/A
> [10556.034112] EEH: Frozen PHB#524-PE#280000 detected
> [10556.034115] EEH: Call Trace:
> [10556.034117] EEH: [c000000000051068] __eeh_send_failure_event+0x7c/0x15c
> [10556.034304] EEH: [c000000000049bcc] eeh_dev_check_failure.part.0+0x27c/0x6b0
> [10556.034310] EEH: [c008000004753d3c] nvme_pci_reg_read32+0x80/0xac [nvme]
> [10556.034319] EEH: [c0080000045f365c] nvme_wait_ready+0xa4/0x18c [nvme_core]
> [10556.034333] EEH: [c008000004754750] nvme_dev_disable+0x370/0x41c [nvme]
> [10556.034338] EEH: [c008000004757184] nvme_reset_work+0x1f4/0x3cc [nvme]
> [10556.034344] EEH: [c00000000017bb8c] process_one_work+0x1f0/0x4f4
> [10556.034350] EEH: [c00000000017c24c] worker_thread+0x3bc/0x590
> [10556.034355] EEH: [c00000000018a87c] kthread+0x138/0x140
> [10556.034358] EEH: [c00000000000dd58] start_kernel_thread+0x14/0x18
> [10556.034363] EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5 failures.
> [10556.034368] EEH: Notify device drivers to shutdown
> [10556.034371] EEH: Beginning: 'error_detected(IO frozen)'
> [10556.034376] PCI 0524:28:00.0#280000: EEH: Invoking nvme->error_detected(IO frozen)
> [10556.034379] nvme nvme0: frozen state error detected, reset controller
> [10556.102654] nvme 0524:28:00.0: enabling device (0000 -> 0002)
> [10556.103171] nvme nvme0: PCI recovery is ongoing so let it finish
> [10556.142532] PCI 0524:28:00.0#280000: EEH: nvme driver reports: 'need reset'
> [10556.142535] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'need reset'
> [...]
> [...]
> [10556.148172] EEH: Reset without hotplug activity
> [10558.298672] EEH: Beginning: 'slot_reset'
> [10558.298692] PCI 0524:28:00.0#280000: EEH: Invoking nvme->slot_reset()
> [10558.298696] nvme nvme0: restart after slot reset
> [10558.301925] PCI 0524:28:00.0#280000: EEH: nvme driver reports: 'recovered'
> [10558.301928] EEH: Finished:'slot_reset' with aggregate recovery state:'recovered'
> [10558.301939] EEH: Notify device driver to resume
> [10558.301944] EEH: Beginning: 'resume'
> [10558.301947] PCI 0524:28:00.0#280000: EEH: Invoking nvme->resume()
> [10558.331051] nvme nvme0: Shutdown timeout set to 10 seconds
> [10558.356679] nvme nvme0: 16/0/0 default/read/poll queues
> [10558.357026] PCI 0524:28:00.0#280000: EEH: nvme driver reports: 'none'
> [10558.357028] EEH: Finished:'resume'
> [10558.357035] EEH: Recovery successful.
> 
> # nvme list-subsys
> nvme-subsys0 - NQN=nqn.2019-10.com.kioxia:KCM7DRUG1T92:7DQ0A01206N3
>                hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
>                iopolicy=numa
> \
>  +- nvme0 pcie 0524:28:00.0 live
> 
> 
Thanks,
--Nilay