[Regression] Bug 216400 - Firmware activation starting AEN processing prevents further AER commands sent to the NVMe controller.
Keith Busch
kbusch at kernel.org
Mon Aug 29 09:29:26 PDT 2022
On Mon, Aug 29, 2022 at 12:14:21PM +0300, Sagi Grimberg wrote:
>
>
> On 8/26/22 15:19, Thorsten Leemhuis wrote:
> > Hi, this is your Linux kernel regression tracker.
> >
> > I noticed a regression report in bugzilla.kernel.org that afaics nobody
> > acted upon since it was reported. That's why I decided to forward it by
> > mail to those that afaics should handle this.
> >
> > To quote from https://bugzilla.kernel.org/show_bug.cgi?id=216400 :
> >
> > > lixingyuan 2022-08-23 01:14:50 UTC
> > >
> > > This bug is related to these two commits:
> > >
> > > 1. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v6.0-rc2&id=4c75f877853cfa81b12374a07208e07b077f39b8
> > >
> > > These codes will set the controller state to NVME_CTRL_RESETTING while handling the firmware activation staring AEN
> > >
> > > 2. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v6.0-rc2&id=0fa0f99fc84e41057cbdd2efbfe91c6b2f47dd9d
> > >
> > > When submitting a new AER command to the controller, this code checks if the controller state is NVME_CTRL_LIVE. This caused the problem. When the firmware activation staring AEN was processed before, the controller state was already set to NVME_CTRL_RESETTING, which resulted in no new AER commands being sent to the controller.
>
> I see.
>
> I can modify this code to check in the drivers instead of the core.
>
> Keith, pci does not risk submitting an async event on a freed admin
> queue? if not, I can add a proper check there as well...
I don't think we'd attempt to issue an admin command while the queue is down,
at least not in pci driver.
I think it should be sufficient to requeue the ctrl->async_event_work after the
activation work complete, no?
More information about the Linux-nvme
mailing list