[PATCH v3 15/15] blk-mq: use hk cpus only when isolcpus=io_queue is enabled

Tue Aug 13 05:17:59 PDT 2024

On Fri, Aug 09, 2024 at 10:53:16PM GMT, Ming Lei wrote:
> On Fri, Aug 09, 2024 at 09:22:11AM +0200, Daniel Wagner wrote:
> > On Thu, Aug 08, 2024 at 01:26:41PM GMT, Ming Lei wrote:
> > > Isolated CPUs are removed from queue mapping in this patchset, when someone
> > > submit IOs from the isolated CPU, what is the correct hctx used for handling
> > > these IOs?
> > 
> > No, every possible CPU gets a mapping. What this patch series does, is
> > to limit/aligns the number of hardware context to the number of
> > housekeeping CPUs. There is still a complete ctx-hctc mapping. So
> 
> OK, then I guess patch 1~7 aren't supposed to belong to this series,
> cause you just want to reduce nr_hw_queues, meantime spread
> house-keeping CPUs first for avoiding queues with all isolated cpu
> mask.

I tried to explain the reason for these patches in the cover letter. The
idea here is that it makes the later changes simpler, because we only
have to touch one place. Furthermore, the caller just needs to provide
an affinity mask the rest of the code then is generic. This allows to
replace the open coded mapping code in hisi for example. Overall I think
the resulting code is nicer and cleaner.

> OK, Looks I missed the point in patch 15 in which you added isolated cpu
> into mapping manually, just wondering why not take the current two-stage
> policy to cover both house-keeping and isolated CPUs in
> group_cpus_evenly()?

Patch #15 explains why this approach didn't work in the current form.
blk_mq_map_queues will map all isolated CPUs to the first hctx.

> Such as spread house-keeping CPUs first, then isolated CPUs, just like
> what we did for present & non-present cpus.

I've experimented with this approach and it didn't work (see above).

> When blk_mq_hctx_notify_offline() is running, the current CPU isn't
> offline yet, and the hctx is active, same with the managed irq, so it is fine
> to wait until all in-flight IOs originated from this hctx completed
> there.

But if the if for some reason these never complete (as in my case),
this blocks forever. Wouldn't it make sense to abort the wait after a
while?

> The reason is why these requests can't be completed? And the forward
> progress is provided by blk-mq. And these requests are very likely
> allocated & submitted from CPU6.

Yes, I can confirm that the in flight request have been allocated and
submitted by the CPU which is offlined.

Here a log snipped from a different debug session. CPU 1 and 2 are
already offline, CPU 3 is offlined. The CPU mapping for hctx1 is

        hctx1: default 1 3

I've added a printk to my hack timeout handler:

 blk_mq_hctx_notify_offline:3600 hctx 1 force timeout request
 blk_mq_hctx_timeout_rq:3556 state 1 rq cpu 3
 blk_mq_hctx_timeout_rq:3556 state 1 rq cpu 3
 blk_mq_hctx_timeout_rq:3556 state 1 rq cpu 3
 blk_mq_hctx_timeout_rq:3556 state 1 rq cpu 3
 blk_mq_hctx_timeout_rq:3556 state 1 rq cpu 3
 blk_mq_hctx_timeout_rq:3556 state 1 rq cpu 3
 blk_mq_hctx_timeout_rq:3556 state 1 rq cpu 3
 blk_mq_hctx_timeout_rq:3556 state 1 rq cpu 3

that means these request have been allocated on CPU 3 and are still
marked as in flight. I am trying to figure out why they are not
completed as next step.

> Can you figure out what is effective mask for irq of hctx2?  It is
> supposed to be cpu6. And block debugfs for vda should provide helpful
> hint.

The effective mask for the above debug output is

queue mapping for /dev/vda
        hctx0: default 0 2
        hctx1: default 1 3
        hctx2: default 4 6
        hctx3: default 5 7

PCI name is 00:02.0: vda
        irq 27 affinity 0-1 effective 0  virtio0-config
        irq 28 affinity 0 effective 0  virtio0-req.0
        irq 29 affinity 1 effective 1  virtio0-req.1
        irq 30 affinity 4 effective 4  virtio0-req.2
        irq 31 affinity 5 effective 5  virtio0-req.3

Maybe there is still something off with qemu and the IRQ routing and the
interrupts have been delivered to the wrong CPU.

> > going offline have already been shutdown, thus no progress?) and
> > blk_mq_hctx_notifiy_offline isn't doing anything in this scenario.
> 
> RH has internal cpu hotplug stress test, but not see such report so
> far.

Is this stress test running on real hardware? If so, it adds to my
theory that the interrupt might be lost in certain situation when
running qemu.

 > Couldn't we do something like:
> 
> I usually won't thinking about any solution until root-cause is figured
> out, :-)

I agree, though sometimes is also is okay to have some defensive
programming in place, such an upper limit when until giving up the wait.

But yeah, let's focus figuring out what's wrong.