[bug report] block/005 hangs with NVMe device and linux-block/for-next

Ming Lei ming.lei at redhat.com
Tue Nov 2 03:48:30 PDT 2021


On Tue, Nov 02, 2021 at 09:02:47AM +0000, Shinichiro Kawasaki wrote:
> Let me add linux-nvme, Keith and Christoph to the CC list.
> 
> -- 
> Best Regards,
> Shin'ichiro Kawasaki
> 
> 
> On Nov 02, 2021 / 17:28, Shin'ichiro Kawasaki wrote:
> > On Nov 02, 2021 / 11:44, Ming Lei wrote:
> > > On Tue, Nov 02, 2021 at 02:22:15AM +0000, Shinichiro Kawasaki wrote:
> > > > On Nov 01, 2021 / 17:01, Jens Axboe wrote:
> > > > > On 11/1/21 6:41 AM, Jens Axboe wrote:
> > > > > > On 11/1/21 2:34 AM, Shinichiro Kawasaki wrote:
> > > > > >> I tried the latest linux-block/for-next branch tip (git hash b43fadb6631f and
> > > > > >> observed a process hang during blktests block/005 run on a NVMe device.
> > > > > >> Kernel message reported "INFO: task check:1224 blocked for more than 122
> > > > > >> seconds." with call trace [1]. So far, the hang is 100% reproducible with my
> > > > > >> system. This hang is not observed with HDDs or null_blk devices.
> > > > > >>
> > > > > >> I bisected and found the commit 4f5022453acd ("nvme: wire up completion batching
> > > > > >> for the IRQ path") triggers the hang. When I revert this commit from the
> > > > > >> for-next branch tip, the hang disappears. The block/005 test case does IO
> > > > > >> scheduler switch during IO, and the completion path change by the commit looks
> > > > > >> affecting the scheduler switch. Comments for solution will be appreciated.
> > > > > > 
> > > > > > I'll take a look at this.
> > > > > 
> > > > > I've tried running various things most of the day, and I cannot
> > > > > reproduce this issue nor do I see what it could be. Even if requests are
> > > > > split between batched completion and one-by-one completion, it works
> > > > > just fine for me. No special care needs to be taken for put_many() on
> > > > > the queue reference, as the wake_up() happens for the ref going to zero.
> > > > > 
> > > > > Tell me more about your setup. What does the runtimes of the test look
> > > > > like? Do you have all schedulers enabled? What kind of NVMe device is
> > > > > this?
> > > > 
> > > > Thank you for spending your precious time. With the kernel without the hang,
> > > > the test case completes around 20 seconds. When the hang happens, the check
> > > > script process stops at blk_mq_freeze_queue_wait() at scheduler change, and fio
> > > > workload processes stop at __blkdev_direct_IO_simple(). The test case does not
> > > > end, so I need to reboot the system for the next trial. While waiting the test
> > > > case completion, the kernel repeats the same INFO message every 2 minutes.
> > > > 
> > > > Regarding the scheduler, I compiled the kernel with mq-deadline and kyber.
> > > > 
> > > > The NVMe device I use is a U.2 NVMe ZNS SSD. It has a zoned name space and
> > > > a regular name space, and the hang is observed with both name spaces. I have
> > > > not yet tried other NVME devices, so I will try them.
> > > > 
> > > > > 
> > > > > FWIW, this is upstream now, so testing with Linus -git would be
> > > > > preferable.
> > > > 
> > > > I see. I have switched from linux-block for-next branch to the upstream branch
> > > > of Linus. At git hash 879dbe9ffebc, and still the hang is observed.
> > > 
> > > Can you post the blk-mq debugfs log after the hang is triggered?
> > > 
> > > (cd /sys/kernel/debug/block/nvme0n1 && find . -type f -exec grep -aH . {} \;)
> > 
> > Thanks Ming. When I ran the command above, the grep command stopped when it
> > opened tag related files in the debugfs tree. That grep command looked hanking
> > also. So I used the find command below instead to exclude the tag related files.
> > 
> > # find . -type f -not -name *tag* -exec grep -aH . {} \;
> > 
> > Here I share the captured log.
> > 

It is a bit odd since batching completion shouldn't be triggered in case
of io sched, but blk_mq_end_request_batch() does not restart hctx, so
maybe you can try the following patch:


diff --git a/block/blk-mq.c b/block/blk-mq.c
index 07eb1412760b..4c0c9af9235e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -846,16 +846,20 @@ void blk_mq_end_request_batch(struct io_comp_batch *iob)
 		rq_qos_done(rq->q, rq);
 
 		if (nr_tags == TAG_COMP_BATCH || cur_hctx != rq->mq_hctx) {
-			if (cur_hctx)
+			if (cur_hctx) {
 				blk_mq_flush_tag_batch(cur_hctx, tags, nr_tags);
+				blk_mq_sched_restart(cur_hctx);
+			}
 			nr_tags = 0;
 			cur_hctx = rq->mq_hctx;
 		}
 		tags[nr_tags++] = rq->tag;
 	}
 
-	if (nr_tags)
+	if (nr_tags) {
 		blk_mq_flush_tag_batch(cur_hctx, tags, nr_tags);
+		blk_mq_sched_restart(cur_hctx);
+	}
 }
 EXPORT_SYMBOL_GPL(blk_mq_end_request_batch);
 


-- 
Ming




More information about the Linux-nvme mailing list