[PATCH RESEND] nvme-pci: Fix EEH failure on ppc after subsystem reset

Wed Feb 28 03:19:50 PST 2024

On 2/27/24 23:59, Keith Busch wrote:
> On Fri, Feb 09, 2024 at 10:32:16AM +0530, Nilay Shroff wrote:
>> If the nvme subsyetm reset causes the loss of communication to the nvme
>> adapter then EEH could potnetially recover the adapter. The detection of
>> comminication loss to the adapter only happens when the nvme driver
>> attempts to read an MMIO register.
>>
>> The nvme subsystem reset command writes 0x4E564D65 to NSSR register and
>> schedule adapter reset.In the case nvme subsystem reset caused the loss
>> of communication to the nvme adapter then either IO timeout event or
>> adapter reset handler could detect it. If IO timeout even could detect
>> loss of communication then EEH handler is able to recover the
>> communication to the adapter. This change was implemented in 651438bb0af5
>> (nvme-pci: Fix EEH failure on ppc). However if the adapter communication
>> loss is detected in nvme reset work handler then EEH is unable to
>> successfully finish the adapter recovery.
>>
>> This patch ensures that,
>> - nvme driver reset handler would observer pci channel was offline after
>>   a failed MMIO read and avoids marking the controller state to DEAD and
>>   thus gives a fair chance to EEH handler to recover the nvme adapter.
>>
>> - if nvme controller is already in RESETTNG state and pci channel frozen
>>   error is detected then  nvme driver pci-error-handler code sends the
>>   correct error code (PCI_ERS_RESULT_NEED_RESET) back to the EEH handler
>>   so that EEH handler could proceed with the pci slot reset.
> 
> A subsystem reset takes the link down. I'm pretty sure the proper way to
> recover from it requires pcie hotplug support.
> 
Yes you're correct. We require pcie hotplugging to recover. However powerpc EEH 
handler could able to recover the pcie adapter without physically removing and 
re-inserting the adapter or in another words, it could reset adapter without 
hotplug activity. In fact, powerpc EEH could isolate pcie slot and resets it 
(i.e. resetting the PCI device holding the PCI #RST line high for two seconds), 
followed by setting up the device config space (the base address registers 
(BAR's), latency timer, cache line size, interrupt line, and so on). 

You may find more information about EEH recovery here: 
https://www.kernel.org/doc/Documentation/powerpc/eeh-pci-error-recovery.txt

Typically when pcie error is detected and the EEH is able to recover the device, 
the EEH handler code goes through below sequence (assuming driver is EEH aware):

eeh_handle_normal_event()
  eeh_set_channel_state()-> set state to pci_channel_io_frozen 
     eeh_report_error() 
       nvme_error_detected() -> channel state "pci_channel_io_frozen"; returns PCI_ERS_RESULT_NEED_RESET
         eeh_slot_reset() -> recovery successful 
           nvme_slot_reset() -> returns PCI_ERS_RESULT_RECOVERED
             eeh_set_channel_state()-> set state to pci_channel_io_normal
               nvme_error_resume()

In case pcie erorr is detected and the EEH is unable to recover the device,
the EEH handler code goes through the below sequence:

eeh_handle_normal_event()
  eeh_set_channel_state()-> set state to pci_channel_io_frozen
    eeh_report_error()
      nvme_error_detected() -> channel state pci_channel_io_frozen; returns PCI_ERS_RESULT_NEED_RESET
        eeh_slot_reset() -> recovery failed
          eeh_report_failure()
            nvme_error_detected()-> channel state pci_channel_io_perm_failure; returns PCI_ERS_RESULT_DISCONNECT
              eeh_set_channel_state()-> set state to pci_channel_io_perm_failure
                nvme_remove()

If we execute the command "nvme subsystem-reset ..." and adapter communication is
lost then in the current code (under nvme_reset_work()) we simply disable the device
 and mark the controller DEAD. However we may have a chance to recover the controller 
if driver is EEH aware and EEH recovery is underway. We already handle one such case 
in nvme_timeout(). So this patch ensures that if we fall through nvme_reset_work() 
post subsystem-reset and the EEH recovery is in progress then we give a chance to the
EEH mechanism to recover the adapter. If in case the EEH recovery is unsuccessful then
we'd anyway fall through code path I mentioned above where we invoke nvme_remove() at 
the end and delete the erring controller.

With the proposed patch, we find that EEH recovery is successful post subsystem-reset. 
Please find below the relevant output:
# lspci 
0524:28:00.0 Non-Volatile memory controller: KIOXIA Corporation NVMe SSD Controller CM7 2.5" (rev 01)

# nvme list-subsys
nvme-subsys0 - NQN=nqn.2019-10.com.kioxia:KCM7DRUG1T92:7DQ0A01206N3
               hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
               iopolicy=numa
\
 +- nvme0 pcie 0524:28:00.0 live

# nvme subsystem-reset /dev/nvme0

# nvme list-subsys
nvme-subsys0 - NQN=nqn.2019-10.com.kioxia:KCM7DRUG1T92:7DQ0A01206N3
               hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
               iopolicy=numa
\
 +- nvme0 pcie 0524:28:00.0 resetting

[10556.034082] EEH: Recovering PHB#524-PE#280000
[10556.034108] EEH: PE location: N/A, PHB location: N/A
[10556.034112] EEH: Frozen PHB#524-PE#280000 detected
[10556.034115] EEH: Call Trace:
[10556.034117] EEH: [c000000000051068] __eeh_send_failure_event+0x7c/0x15c
[10556.034304] EEH: [c000000000049bcc] eeh_dev_check_failure.part.0+0x27c/0x6b0
[10556.034310] EEH: [c008000004753d3c] nvme_pci_reg_read32+0x80/0xac [nvme]
[10556.034319] EEH: [c0080000045f365c] nvme_wait_ready+0xa4/0x18c [nvme_core]
[10556.034333] EEH: [c008000004754750] nvme_dev_disable+0x370/0x41c [nvme]
[10556.034338] EEH: [c008000004757184] nvme_reset_work+0x1f4/0x3cc [nvme]
[10556.034344] EEH: [c00000000017bb8c] process_one_work+0x1f0/0x4f4
[10556.034350] EEH: [c00000000017c24c] worker_thread+0x3bc/0x590
[10556.034355] EEH: [c00000000018a87c] kthread+0x138/0x140
[10556.034358] EEH: [c00000000000dd58] start_kernel_thread+0x14/0x18
[10556.034363] EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5 failures.
[10556.034368] EEH: Notify device drivers to shutdown
[10556.034371] EEH: Beginning: 'error_detected(IO frozen)'
[10556.034376] PCI 0524:28:00.0#280000: EEH: Invoking nvme->error_detected(IO frozen)
[10556.034379] nvme nvme0: frozen state error detected, reset controller
[10556.102654] nvme 0524:28:00.0: enabling device (0000 -> 0002)
[10556.103171] nvme nvme0: PCI recovery is ongoing so let it finish
[10556.142532] PCI 0524:28:00.0#280000: EEH: nvme driver reports: 'need reset'
[10556.142535] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'need reset'
[...]
[...]
[10556.148172] EEH: Reset without hotplug activity
[10558.298672] EEH: Beginning: 'slot_reset'
[10558.298692] PCI 0524:28:00.0#280000: EEH: Invoking nvme->slot_reset()
[10558.298696] nvme nvme0: restart after slot reset
[10558.301925] PCI 0524:28:00.0#280000: EEH: nvme driver reports: 'recovered'
[10558.301928] EEH: Finished:'slot_reset' with aggregate recovery state:'recovered'
[10558.301939] EEH: Notify device driver to resume
[10558.301944] EEH: Beginning: 'resume'
[10558.301947] PCI 0524:28:00.0#280000: EEH: Invoking nvme->resume()
[10558.331051] nvme nvme0: Shutdown timeout set to 10 seconds
[10558.356679] nvme nvme0: 16/0/0 default/read/poll queues
[10558.357026] PCI 0524:28:00.0#280000: EEH: nvme driver reports: 'none'
[10558.357028] EEH: Finished:'resume'
[10558.357035] EEH: Recovery successful.

# nvme list-subsys
nvme-subsys0 - NQN=nqn.2019-10.com.kioxia:KCM7DRUG1T92:7DQ0A01206N3
               hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
               iopolicy=numa
\
 +- nvme0 pcie 0524:28:00.0 live

Thanks,
--Nilay