[Regression] Bug 216400 - Firmware activation starting AEN processing prevents further AER commands sent to the NVMe controller.

Keith Busch kbusch at kernel.org
Mon Aug 29 09:29:26 PDT 2022


On Mon, Aug 29, 2022 at 12:14:21PM +0300, Sagi Grimberg wrote:
> 
> 
> On 8/26/22 15:19, Thorsten Leemhuis wrote:
> > Hi, this is your Linux kernel regression tracker.
> > 
> > I noticed a regression report in bugzilla.kernel.org that afaics nobody
> > acted upon since it was reported. That's why I decided to forward it by
> > mail to those that afaics should handle this.
> > 
> > To quote from https://bugzilla.kernel.org/show_bug.cgi?id=216400 :
> > 
> > >   lixingyuan 2022-08-23 01:14:50 UTC
> > > 
> > > This bug is related to these two commits:
> > > 
> > > 1. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v6.0-rc2&id=4c75f877853cfa81b12374a07208e07b077f39b8
> > > 
> > > These codes will set the controller state to NVME_CTRL_RESETTING while handling the firmware activation staring AEN
> > > 
> > > 2. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v6.0-rc2&id=0fa0f99fc84e41057cbdd2efbfe91c6b2f47dd9d
> > > 
> > > When submitting a new AER command to the controller, this code checks if the controller state is NVME_CTRL_LIVE. This caused the problem. When the firmware activation staring AEN was processed before, the controller state was already set to NVME_CTRL_RESETTING, which resulted in no new AER commands being sent to the controller.
> 
> I see.
> 
> I can modify this code to check in the drivers instead of the core.
> 
> Keith, pci does not risk submitting an async event on a freed admin
> queue? if not, I can add a proper check there as well...

I don't think we'd attempt to issue an admin command while the queue is down,
at least not in pci driver.

I think it should be sufficient to requeue the ctrl->async_event_work after the
activation work complete, no?



More information about the Linux-nvme mailing list