[PATCH RESEND] nvme-pci: Fix EEH failure on ppc after subsystem reset

Sat Mar 9 06:29:11 PST 2024

On 3/8/24 21:11, Keith Busch wrote:
> On Fri, Feb 09, 2024 at 10:32:16AM +0530, Nilay Shroff wrote:
>> @@ -2776,6 +2776,14 @@ static void nvme_reset_work(struct work_struct *work)
>>   out_unlock:
>>  	mutex_unlock(&dev->shutdown_lock);
>>   out:
>> +	/*
>> +	 * If PCI recovery is ongoing then let it finish first
>> +	 */
>> +	if (pci_channel_offline(to_pci_dev(dev->dev))) {
>> +		dev_warn(dev->ctrl.device, "PCI recovery is ongoing so let it finish\n");
>> +		return;
>> +	}
>> +
>>  	/*
>>  	 * Set state to deleting now to avoid blocking nvme_wait_reset(), which
>>  	 * may be holding this pci_dev's device lock.
>> @@ -3295,9 +3303,11 @@ static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev,
>>  	case pci_channel_io_frozen:
>>  		dev_warn(dev->ctrl.device,
>>  			"frozen state error detected, reset controller\n");
>> -		if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
>> -			nvme_dev_disable(dev, true);
>> -			return PCI_ERS_RESULT_DISCONNECT;
>> +		if (nvme_ctrl_state(&dev->ctrl) != NVME_CTRL_RESETTING) {
>> +			if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
>> +				nvme_dev_disable(dev, true);
>> +				return PCI_ERS_RESULT_DISCONNECT;
>> +			}
>>  		}
>>  		nvme_dev_disable(dev, false);
>>  		return PCI_ERS_RESULT_NEED_RESET;
> 
> I get what you're trying to do, but it looks racy. The reset_work may
> finish before pci sets channel offline, or the error handling work
> happens to see RESETTING state, but then transitions to CONNECTING state
> after and deadlocks on the '.resume()' side. You are counting on a very
> specific sequence tied to the PCIe error handling module, and maybe you
> are able to count on that sequence for your platform in this unique
> scenario, but these link errors could happen anytime.
> 
I am not sure about the deadlock in '.resume()' side you mentioned above.
Did you mean that deadlock occur due to someone holding this pci_dev's device lock?
Or deadlock occur due to the flush_work() from nvme_error_resume() would never 
return?

If the reset_work finishes before pci sets channel offline and nvme_timeout() detects
the pci channel offline then we would fall through the below sequence of events:
eeh_event_hadnler()
  ->nvme_error_detected() => set ctrl state to RESETTING
    ->nvme_slot_reset() => it schedules reset_work
      ->nvme_error_resume() => it waits until reset_work finishes
      "Device recovers"

If error handling work happens to see RESETTING state (it means that reset_work
has been scheduled or might have started running) then the controller state can't 
transition to CONNECTING state. In this case, the reset_work would not finish as the 
reset_work would try accessing the pci iomem and that would fail before setting
the controller state to CONNECTING.The reset_work invokes nvme_dev_disable() and 
nvme_pci_enable() (before setting controller state to CONNECTING) and both these
functions access pci iomem space.

> And nvme subsystem reset is just odd, it's not clear how it was intended
> to be handled. It takes the links down so seems like it requires
> re-enumeration from a pcie hotplug driver, and that's kind of how it was
> expected to work here, but your platform has a special way to contain
> the link event and bring things back up the way they were before. And
> the fact you *require* IO to be in flight just so the timeout handler
> can dispatch a non-posted transaction 30 seconds later to trigger EEH is
> also odd. Why can't EEH just detect the link down event directly?
I think EEH could detect the link down event however the eeh recovery 
mechanism piggy-backs on the PCI hotplug infrastructure so that device drivers 
do not need to be specifically modified to support EEH recovery on ppc 
architecture. Hence the eeh recovery is triggered when driver tries to 
access pci IO memory space and that starts the kernel generic pci error 
recovery.

> 
> This driver unfortunately doesn't handle errors during a reset well.
> Trying to handle that has been problematic, so the driver just bails if
> anything goes wrong at this critical initialization point. Maybe we need
> to address the reset/initialization failure handling more generically
> and delegate the teardown or retry decision to something else. Messing
> with that is pretty fragile right now, though.
> 
> Or you could just re-enumerate the slot.
Just another thought, how about serializing reset_work and error handling work?
> 
> I don't know, sorry my message is not really helping much to get this
> fixed.
Yeah, I know this is fragile right now and so it would require 
detailed review and discussion.

Thanks,
--Nilay