[PATCH RESEND] nvme-pci: Fix EEH failure on ppc after subsystem reset

Sat Mar 9 11:05:06 PST 2024

On 3/9/24 21:14, Keith Busch wrote:
> On Sat, Mar 09, 2024 at 07:59:11PM +0530, Nilay Shroff wrote:
>> On 3/8/24 21:11, Keith Busch wrote:
>>> On Fri, Feb 09, 2024 at 10:32:16AM +0530, Nilay Shroff wrote:
>>>> @@ -2776,6 +2776,14 @@ static void nvme_reset_work(struct work_struct *work)
>>>>   out_unlock:
>>>>  	mutex_unlock(&dev->shutdown_lock);
>>>>   out:
>>>> +	/*
>>>> +	 * If PCI recovery is ongoing then let it finish first
>>>> +	 */
>>>> +	if (pci_channel_offline(to_pci_dev(dev->dev))) {
>>>> +		dev_warn(dev->ctrl.device, "PCI recovery is ongoing so let it finish\n");
>>>> +		return;
>>>> +	}
>>>> +
>>>>  	/*
>>>>  	 * Set state to deleting now to avoid blocking nvme_wait_reset(), which
>>>>  	 * may be holding this pci_dev's device lock.
>>>> @@ -3295,9 +3303,11 @@ static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev,
>>>>  	case pci_channel_io_frozen:
>>>>  		dev_warn(dev->ctrl.device,
>>>>  			"frozen state error detected, reset controller\n");
>>>> -		if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
>>>> -			nvme_dev_disable(dev, true);
>>>> -			return PCI_ERS_RESULT_DISCONNECT;
>>>> +		if (nvme_ctrl_state(&dev->ctrl) != NVME_CTRL_RESETTING) {
>>>> +			if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
>>>> +				nvme_dev_disable(dev, true);
>>>> +				return PCI_ERS_RESULT_DISCONNECT;
>>>> +			}
>>>>  		}
>>>>  		nvme_dev_disable(dev, false);
>>>>  		return PCI_ERS_RESULT_NEED_RESET;
>>>
>>> I get what you're trying to do, but it looks racy. The reset_work may
>>> finish before pci sets channel offline, or the error handling work
>>> happens to see RESETTING state, but then transitions to CONNECTING state
>>> after and deadlocks on the '.resume()' side. You are counting on a very
>>> specific sequence tied to the PCIe error handling module, and maybe you
>>> are able to count on that sequence for your platform in this unique
>>> scenario, but these link errors could happen anytime.
>>>
>> I am not sure about the deadlock in '.resume()' side you mentioned above.
>> Did you mean that deadlock occur due to someone holding this pci_dev's device lock?
>> Or deadlock occur due to the flush_work() from nvme_error_resume() would never 
>> return?
> 
> Your patch may observe a ctrl in "RESETTING" state from
> error_detected(), then disable the controller, which quiesces the admin
> queue. Meanwhile, reset_work may proceed to CONNECTING state and try
> nvme_submit_sync_cmd(), which blocks forever because no one is going to
> unquiesce that admin queue.
> 
OK I think I got your point. However, it seems that even without my patch
the above mentioned deadlock could still be possible. 
Without my patch, if error_detcted() observe a ctrl in "RESETTING" state then 
it still invokes nvme_dev_disable(). The only difference with my patch is that 
error_detected() returns the PCI_ERS_RESULT_NEED_RESET instead of PCI_ERS_RESULT_DISCONNECT.

Regarding the deadlock, it appears to me that reset_work races with nvme_dev_disable()
and we may want to extend the shutdown_lock in reset_work so that nvme_dev_disable() 
can't interfere with admin queue while reset_work accesses the admin queue. I think
we can fix this case. I would send PATCH v2 with this fix for review, however, please let 
me know if you have any other concern before I spin a new patch.

Thanks,
--Nilay