[PATCH] nvme-pci: fix sleeping function called from interrupt context

Sagi Grimberg sagi at grimberg.me
Tue Dec 19 05:53:58 PST 2023



On 12/18/23 19:57, Keith Busch wrote:
> On Mon, Dec 18, 2023 at 10:43:45AM -0700, Keith Busch wrote:
>> On Mon, Dec 18, 2023 at 04:11:38PM +0100, Maurizio Lombardi wrote:
>>> p'a 15. 12. 2023 v 18:25 odes'ilatel Keith Busch <kbusch at kernel.org> napsal:
>>>> What async event occured to cause this? It looks like the only AEN
>>>> handling that cancels anything is the FW activation one, but that only
>>>> applies to fabrics, not pci. Still, that AEN hanlder sets the state to
>>>> "RESETTING", but we only cancel work when transitioning to "LIVE" from a
>>>> CONNECTING state.
>>>
>>> It looks like the stack trace is a bit imprecise.
>>> It's not the call to nvme_change_ctrl_state() that triggers the error
>>> because, as
>>> you correctly noticed, moving the controller to the resetting state
>>> doesn't block.
>>>
>>> The problem is the call to nvme_auth_stop().
>>>
>>>          case NVME_AER_NOTICE_FW_ACT_STARTING:
>>>                  /*
>>>                   * We are (ab)using the RESETTING state to prevent subsequent
>>>                   * recovery actions from interfering with the controller's
>>>                   * firmware activation.
>>>                   */
>>>                  if (nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING)) {
>>>                          nvme_auth_stop(ctrl);     <----------------
>>>                          requeue = false;
>>>                          queue_work(nvme_wq, &ctrl->fw_act_work);
>>>                  }
>>>
>>> void nvme_auth_stop(struct nvme_ctrl *ctrl)
>>> {
>>>          cancel_work_sync(&ctrl->dhchap_auth_work);
>>> }
>>
>> Oh, got it; thank you for clarifying.
>>
>> In that case, let's move the nvme_auth_stop() call to fw_act_work. The
>> AER handler doesn't depend on auth_work being sync'ed.
> 
> I also want to mention nvme_auth_stop() ought to be a no-op for PCI,
> right? The nvme_ctrl struct seems a bit overburdened at this point, I'll
> take a shot at splitting optional transport specifics out of it.

Exactly what I was thinking. However we should make
nvme_handle_aen_notice() allowed to be called from an interrupt
context even for fabrics, so we should move it out of there anyways.



More information about the Linux-nvme mailing list