[PATCH] nvme-pci: fix stuck reset on concurrent DPC and HP
Keith Busch
kbusch at kernel.org
Fri Mar 7 07:24:48 PST 2025
On Fri, Mar 07, 2025 at 06:28:28PM +0530, Nilay Shroff wrote:
> Though one question: IMO, the DPC error handler shall invoke nvme_error_detected() prior
> to nvme_error_resume(). And we already disable the device (and cancel in-flight IO) in
> nvme_error_detected() and so wouldn't that help?
The sequence is error_detected, slot_reset, error_resume.
The slot_reset schedules the nvme controller reset. That work sends
amdin IO, like identify controller.
If the pciehp removal starts after reset work's controller
initialization, then nothing stops the work from sending new admin
commands, and nothing will complete them. This causes the error_resume
to wait for something that will never happen.
More information about the Linux-nvme
mailing list