Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5

Wed Nov 1 04:23:05 PDT 2023

On Wed, Nov 01, 2023 at 11:15:02AM +0100, Hannes Reinecke wrote:
> On 11/1/23 04:24, Ming Lei wrote:
> > On Wed, Nov 01, 2023 at 03:14:22AM +0100, Marek Marczykowski-Górecki wrote:
> > > On Wed, Nov 01, 2023 at 09:27:24AM +0800, Ming Lei wrote:
> > > > On Tue, Oct 31, 2023 at 11:42 PM Marek Marczykowski-Górecki
> > > > <marmarek at invisiblethingslab.com> wrote:
> > > > > 
> > > > > On Tue, Oct 31, 2023 at 03:01:36PM +0100, Jan Kara wrote:
> > > > > > On Tue 31-10-23 04:48:44, Marek Marczykowski-Górecki wrote:
> > > > > > > Then tried:
> > > > > > >   - PAGE_ALLOC_COSTLY_ORDER=4, order=4 - cannot reproduce,
> > > > > > >   - PAGE_ALLOC_COSTLY_ORDER=4, order=5 - cannot reproduce,
> > > > > > >   - PAGE_ALLOC_COSTLY_ORDER=4, order=6 - freeze rather quickly
> > > > > > > 
> > > > > > > I've retried the PAGE_ALLOC_COSTLY_ORDER=4,order=5 case several times
> > > > > > > and I can't reproduce the issue there. I'm confused...
> > > > > > 
> > > > > > And this kind of confirms that allocations > PAGE_ALLOC_COSTLY_ORDER
> > > > > > causing hangs is most likely just a coincidence. Rather something either in
> > > > > > the block layer or in the storage driver has problems with handling bios
> > > > > > with sufficiently high order pages attached. This is going to be a bit
> > > > > > painful to debug I'm afraid. How long does it take for you trigger the
> > > > > > hang? I'm asking to get rough estimate how heavy tracing we can afford so
> > > > > > that we don't overwhelm the system...
> > > > > 
> > > > > Sometimes it freezes just after logging in, but in worst case it takes
> > > > > me about 10min of more or less `tar xz` + `dd`.
> > > > 
> > > > blk-mq debugfs is usually helpful for hang issue in block layer or
> > > > underlying drivers:
> > > > 
> > > > (cd /sys/kernel/debug/block && find . -type f -exec grep -aH . {} \;)
> > > > 
> > > > BTW,  you can just collect logs of the exact disks if you know what
> > > > are behind dm-crypt,
> > > > which can be figured out by `lsblk`, and it has to be collected after
> > > > the hang is triggered.
> > > 
> > > dm-crypt lives on the nvme disk, this is what I collected when it
> > > hanged:
> > > 
> > ...
> > > nvme0n1/hctx4/cpu4/default_rq_list:000000000d41998f {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=65, .internal_tag=-1}
> > > nvme0n1/hctx4/cpu4/default_rq_list:00000000d0d04ed2 {.op=READ, .cmd_flags=, .rq_flags=IO_STAT, .state=idle, .tag=70, .internal_tag=-1}
> > 
> > Two requests stays in sw queue, but not related with this issue.
> > 
> > > nvme0n1/hctx4/type:default
> > > nvme0n1/hctx4/dispatch_busy:9
> > 
> > non-zero dispatch_busy means BLK_STS_RESOURCE is returned from
> > nvme_queue_rq() recently and mostly.
> > 
> > > nvme0n1/hctx4/active:0
> > > nvme0n1/hctx4/run:20290468
> > 
> > ...
> > 
> > > nvme0n1/hctx4/tags:nr_tags=1023
> > > nvme0n1/hctx4/tags:nr_reserved_tags=0
> > > nvme0n1/hctx4/tags:active_queues=0
> > > nvme0n1/hctx4/tags:bitmap_tags:
> > > nvme0n1/hctx4/tags:depth=1023
> > > nvme0n1/hctx4/tags:busy=3
> > 
> > Just three requests in-flight, two are in sw queue, another is in hctx->dispatch.
> > 
> > ...
> > 
> > > nvme0n1/hctx4/dispatch:00000000b335fa89 {.op=WRITE, .cmd_flags=NOMERGE, .rq_flags=DONTPREP|IO_STAT, .state=idle, .tag=78, .internal_tag=-1}
> > > nvme0n1/hctx4/flags:alloc_policy=FIFO SHOULD_MERGE
> > > nvme0n1/hctx4/state:SCHED_RESTART
> > 
> > The request staying in hctx->dispatch can't move on, and nvme_queue_rq()
> > returns -BLK_STS_RESOURCE constantly, and you can verify with
> > the following bpftrace when the hang is triggered:
> > 
> > 	bpftrace -e 'kretfunc:nvme_queue_rq  { @[retval, kstack]=count() }'
> > 
> > It is very likely that memory allocation inside nvme_queue_rq()
> > can't be done successfully, then blk-mq just have to retry by calling
> > nvme_queue_rq() on the above request.
> > 
> And that is something I've been wondering (for quite some time now):
> What _is_ the appropriate error handling for -ENOMEM?

It is just my guess.

Actually it shouldn't fail since the sgl allocation is backed with
memory pool, but there is also dma pool allocation and dma mapping.

> At this time, we assume it to be a retryable error and re-run the queue
> in the hope that things will sort itself out.

It should not be hard to figure out why nvme_queue_rq() can't move on.

> But if they don't we're stuck.
> Can we somehow figure out if we make progress during submission, and (at
> least) issue a warning once we detect a stall?

It needs counting on request retry, and people often hate to add something
to request or bio in fast path. Also this kind of issue is easy to show
in blk-mq debugfs or bpftrace.

Thanks, 
Ming