[PATCH] nvme-pci: fix stuck reset on concurrent DPC and HP
Nilay Shroff
nilay at linux.ibm.com
Fri Mar 7 23:27:50 PST 2025
On 3/7/25 8:54 PM, Keith Busch wrote:
> On Fri, Mar 07, 2025 at 06:28:28PM +0530, Nilay Shroff wrote:
>> Though one question: IMO, the DPC error handler shall invoke nvme_error_detected() prior
>> to nvme_error_resume(). And we already disable the device (and cancel in-flight IO) in
>> nvme_error_detected() and so wouldn't that help?
>
> The sequence is error_detected, slot_reset, error_resume.
>
> The slot_reset schedules the nvme controller reset. That work sends
> amdin IO, like identify controller.
>
> If the pciehp removal starts after reset work's controller
> initialization, then nothing stops the work from sending new admin
> commands, and nothing will complete them. This causes the error_resume
> to wait for something that will never happen.
Ok makes sense, this appears to be a tight race condition and may not be
limited to one platform. This should be possible even on PPC.
Thanks,
--Nilay
More information about the Linux-nvme
mailing list