6.13/regression/bisected - new nvme timeout errors

Sun Mar 23 02:49:40 PDT 2025

On Thu, Mar 6, 2025 at 8:19 PM Keith Busch <kbusch at kernel.org> wrote:
>
> On Wed, Jan 15, 2025 at 02:58:04AM +0500, Mikhail Gavrilov wrote:
> > Hi,
> > During 6.13 development cycle I spotted strange new nvme errors in the
> > log which I never seen before.
> >
> > [87774.010474] nvme nvme1: I/O tag 0 (3000) opcode 0x1 (I/O Cmd) QID 1 timeout, aborting req_op:WRITE(1) size:131072
>
> ...
>
> > I still haven't found a stable way to reproduce this.
> > But I'm pretty sure that if this error don't appearing within two
> > days, then we can assume that the kernel isn't affected by the
> > problem.
> > So I made bisection with above assumption and found this commit:
> >
> > beadf0088501d9dcf2454b05d90d5d31ea3ba55f is the first bad commit
> > commit beadf0088501d9dcf2454b05d90d5d31ea3ba55f
> > Author: Christoph Hellwig <hch at lst.de>
> > Date:   Wed Nov 13 16:20:41 2024 +0100
> >
> >     nvme-pci: reverse request order in nvme_queue_rqs
>
> The patch here uses the order recieved to dispatch commands in
> consequetive submission queue entries, which is supposed to be the
> desired behavior for any device. I did some testing on mailine, and it
> sure looks like the order the driver does this is optimal, so I'm not
> sure what's going on with your observation.
>
> Do you have a scheduler enabled on your device?
>
> How are you generating IO? Is it a pattern I should be able to replicate
> with 'fio'?

Sorry for not answering for a long time, but all because I was testing
the off-list patch that was sent to me by Maurizio Lombardi.

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 950289405ef2..990d9ef1eef9 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -989,6 +989,9 @@ static void nvme_submit_cmds(struct nvme_queue
*nvmeq, struct rq_list *rqlist)
 {
        struct request *req;

+       if (rq_list_empty(rqlist))
+               return;
+
        spin_lock(&nvmeq->sq_lock);
        while ((req = rq_list_pop(rqlist))) {
                struct nvme_iod *iod = blk_mq_rq_to_pdu(req);

And I want to say right away that this patch solved my problem. I
double checked, and both times the system worked for 4 days without
errors:

nvme nvme0: I/O tag 960 (33c0) opcode 0x2 (I/O Cmd) QID 7 timeout,
aborting req_op:READ(0) size:16384
nvme nvme0: I/O tag 961 (33c1) opcode 0x2 (I/O Cmd) QID 7 timeout,
aborting req_op:READ(0) size:8192
nvme nvme0: Abort status: 0x0
* * *
nvme nvme0: I/O tag 960 (33c0) opcode 0x2 (I/O Cmd) QID 7 timeout,
reset controller

and subsequent freezing of some applications.

> Do you have a scheduler enabled on your device?
No. nvme0 is server-tire 30TB SSD.

> head /sys/block/*/queue/scheduler
==> /sys/block/nvme0n1/queue/scheduler <==
[none] mq-deadline kyber bfq

==> /sys/block/nvme1n1/queue/scheduler <==
[none] mq-deadline kyber bfq

==> /sys/block/sda/queue/scheduler <==
none mq-deadline kyber [bfq]

==> /sys/block/sdb/queue/scheduler <==
none mq-deadline kyber [bfq]

> How are you generating IO? Is it a pattern I should be able to replicate with 'fio'?

It's very complicated. Because it happens only when I update (podman
pull) a large (16 GB) Docker container.

$ podman images
REPOSITORY                                                   TAG
      IMAGE ID      CREATED       SIZE
dev-image-store.my-company.com/local-stand/service  25.2100-latest
513139713b48  19 hours ago  16.5 GB
dev-image-store.my-company.com/local-stand/service  25.1220-latest
920708d4b3dc  9 days ago    16.8 GB
dev-image-store.my-company.com/local-stand/service  25.1220
67d2af91f323  10 days ago   16.8 GB
dev-image-store.my-company.com/local-stand/service  25.2100
24e6817bff52  3 weeks ago   16.4 GB

Podman stores its images on the 30TB BTRFS partition located in /home.

-- 
Best Regards,
Mike Gavrilov.