[PATCH] nvme-pci: fix resume after AER recovery

Keith Busch kbusch at kernel.org
Mon Jan 30 10:43:28 PST 2023


On Mon, Jan 30, 2023 at 11:35:36AM -0700, Keith Busch wrote:
> On Mon, Jan 30, 2023 at 11:14:49AM +0100, Christoph Hellwig wrote:
> > All I/O on a nvme controllers hangs after injecting a malformed TLP error
> > using aer-inject with an error file like:
> > 
> > --- snip ---
> > AER
> > PCI_ID WWWW:XX.YY.Z
> > UNCOR_STATUS COMP_TIME
> > HEADER_LOG 0 1 2 3
> > --- snip ---
> > 
> > This is because in this case the ->resume method will be called after
> > ->error_injected and not ->slot_reset, leaving the controller in disabled
> > state and the queue frozen.  Fix this by doing a controller reset to
> > resume as well.
> 
> Why isn't slot_reset being called after error_detected? Driver should be
> returning "RESULT_NEEDS_RESET", which should have the pcie error handling
> always invoke the slot_reset callback.

Are you using an older kernel that doesn't have

  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=387c72cdd7fb6bef650fb078d0f6ae9682abf631

?



More information about the Linux-nvme mailing list