[PATCH] nvme-pci: check for valid request when polling for completions

Daniel Wagner dwagner at suse.de
Mon Apr 28 06:38:16 PDT 2025


Hi Keith,

On Tue, Sep 03, 2024 at 02:07:29PM -0600, Keith Busch wrote:
> On Tue, Sep 03, 2024 at 09:14:27AM -0600, Keith Busch wrote:
> > On Tue, Sep 03, 2024 at 08:25:08AM +0200, Hannes Reinecke wrote:
> > > On 9/2/24 19:04, Sagi Grimberg wrote:
> > > > On 02/09/2024 16:07, Hannes Reinecke wrote:
> > > > > When polling for completions from the timeout handler we traverse
> > > > > over _all_ cqes, and the fetching the request via blk_mq_tag_to_rq().
> > > > > Unfortunately that function will always return a request, even if
> > > > > that request is already completed.
> > > > > So we need to check if the command is still in flight before
> > > > > attempting to complete it.
> > 
> > So the very same command was completed in some other context? We've
> > disabled the queue's interrupt here, there should be no other context
> > that can concurrently complete it. The timeout poll check is supposed to
> > check only unseen cqes, not "all" of them. Is disable_irq() not a
> > sufficient barrier for accessing the cq head or something?
> 
> Ooo, I think I see a problem. Does your device have more than one
> namespace? I think we need to lock this queue for that condition because
> the timeout work executes per-namespace, and we're not locking that
> today. If you do have a muliti-namespace controller, does the below fix
> your observation? 
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 6cd9395ba9ec3..2c73ccd21afe3 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1109,9 +1109,11 @@ static void nvme_poll_irqdisable(struct nvme_queue *nvmeq)
>  
>  	WARN_ON_ONCE(test_bit(NVMEQ_POLLED, &nvmeq->flags));
>  
> +	spin_lock(&nvmeq->cq_poll_lock);
>  	disable_irq(pci_irq_vector(pdev, nvmeq->cq_vector));
>  	nvme_poll_cq(nvmeq, NULL);
>  	enable_irq(pci_irq_vector(pdev, nvmeq->cq_vector));
> +	spin_unlock(&nvmeq->cq_poll_lock);
>  }

Ah sorry, this one got lost in my todo list.

We got positive feedback from our customer. This indeed fixes the
problem. Are you going to send out a proper patch, or do you want me to
do it?

Thanks,
Daniel



More information about the Linux-nvme mailing list