[Regression] Bug 216400 - Firmware activation starting AEN processing prevents further AER commands sent to the NVMe controller.

Sagi Grimberg sagi at grimberg.me
Tue Aug 30 00:03:14 PDT 2022


>> On 8/26/22 15:19, Thorsten Leemhuis wrote:
>>> Hi, this is your Linux kernel regression tracker.
>>>
>>> I noticed a regression report in bugzilla.kernel.org that afaics nobody
>>> acted upon since it was reported. That's why I decided to forward it by
>>> mail to those that afaics should handle this.
>>>
>>> To quote from https://bugzilla.kernel.org/show_bug.cgi?id=216400 :
>>>
>>>>    lixingyuan 2022-08-23 01:14:50 UTC
>>>>
>>>> This bug is related to these two commits:
>>>>
>>>> 1. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v6.0-rc2&id=4c75f877853cfa81b12374a07208e07b077f39b8
>>>>
>>>> These codes will set the controller state to NVME_CTRL_RESETTING while handling the firmware activation staring AEN
>>>>
>>>> 2. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v6.0-rc2&id=0fa0f99fc84e41057cbdd2efbfe91c6b2f47dd9d
>>>>
>>>> When submitting a new AER command to the controller, this code checks if the controller state is NVME_CTRL_LIVE. This caused the problem. When the firmware activation staring AEN was processed before, the controller state was already set to NVME_CTRL_RESETTING, which resulted in no new AER commands being sent to the controller.
>>
>> I see.
>>
>> I can modify this code to check in the drivers instead of the core.
>>
>> Keith, pci does not risk submitting an async event on a freed admin
>> queue? if not, I can add a proper check there as well...
> 
> I don't think we'd attempt to issue an admin command while the queue is down,
> at least not in pci driver.
> 
> I think it should be sufficient to requeue the ctrl->async_event_work after the
> activation work complete, no?

I don't know, don't have access to hw to test this at the moment...
Care to send a patch?



More information about the Linux-nvme mailing list