[PATCH v3 15/15] blk-mq: use hk cpus only when isolcpus=io_queue is enabled

Fri Aug 9 00:22:11 PDT 2024

On Thu, Aug 08, 2024 at 01:26:41PM GMT, Ming Lei wrote:
> Isolated CPUs are removed from queue mapping in this patchset, when someone
> submit IOs from the isolated CPU, what is the correct hctx used for handling
> these IOs?

No, every possible CPU gets a mapping. What this patch series does, is
to limit/aligns the number of hardware context to the number of
housekeeping CPUs. There is still a complete ctx-hctc mapping. So
whenever an user thread on an isolated CPU is issuing an IO a
housekeeping CPU will also be involved (with the additional overhead,
which seems to be okay for these users).

Without hardware queue on the isolated CPUs ensures we really never get
any unexpected IO on those CPUs unless userspace does it own its own.
It's a safety net.

Just to illustrate it, the non isolcpus configuration (default) map
for an 8 CPU setup:

queue mapping for /dev/vda
        hctx0: default 0
        hctx1: default 1
        hctx2: default 2
        hctx3: default 3
        hctx4: default 4
        hctx5: default 5
        hctx6: default 6
        hctx7: default 7

and with isolcpus=io_queue,2-3,6-7

queue mapping for /dev/vda
        hctx0: default 0 2
        hctx1: default 1 3
        hctx2: default 4 6
        hctx3: default 5 7

> From current implementation, it depends on implied zero filled
> tag_set->map[type].mq_map[isolated_cpu], so hctx 0 is used.
> 
> During CPU offline, in blk_mq_hctx_notify_offline(),
> blk_mq_hctx_has_online_cpu() returns true even though the last cpu in
> hctx 0 is offline because isolated cpus join hctx 0 unexpectedly, so IOs in
> hctx 0 won't be drained.
> 
> However managed irq core code still shutdowns the hw queue's irq because all
> CPUs in this hctx are offline now. Then IO hang is triggered, isn't
> it?

Thanks for the explanation. I was able to reproduce this scenario, that
is a hardware context with two CPUs which go offline. Initially, I used
fio for creating the workload but this never hit the hanger. Instead
some background workload from systemd-journald is pretty reliable to
trigger the hanger you describe.

Example:

  hctx2: default 4 6

CPU 0 stays online, CPU 1-5 are offline. CPU 6 is offlined:

  smpboot: CPU 5 is now offline
  blk_mq_hctx_has_online_cpu:3537 hctx3 offline
  blk_mq_hctx_has_online_cpu:3537 hctx2 offline

and there is no forward progress anymore, the cpuhotplug state machine
is blocked and an IO is hanging:

  # grep busy /sys/kernel/debug/block/*/hctx*/tags | grep -v busy=0
  /sys/kernel/debug/block/vda/hctx2/tags:busy=61

and blk_mq_hctx_notify_offline busy loops forever:

   task:cpuhp/6         state:D stack:0     pid:439   tgid:439   ppid:2      flags:0x00004000
   Call Trace:
    <TASK>
    __schedule+0x79d/0x15c0
    ? lockdep_hardirqs_on_prepare+0x152/0x210
    ? kvm_sched_clock_read+0xd/0x20
    ? local_clock_noinstr+0x28/0xb0
    ? local_clock+0x11/0x30
    ? lock_release+0x122/0x4a0
    schedule+0x3d/0xb0
    schedule_timeout+0x88/0xf0
    ? __pfx_process_timeout+0x10/0x10d
    msleep+0x28/0x40
    blk_mq_hctx_notify_offline+0x1b5/0x200
    ? cpuhp_thread_fun+0x41/0x1f0
    cpuhp_invoke_callback+0x27e/0x780
    ? __pfx_blk_mq_hctx_notify_offline+0x10/0x10
    ? cpuhp_thread_fun+0x42/0x1f0
    cpuhp_thread_fun+0x178/0x1f0
    smpboot_thread_fn+0x12e/0x1c0
    ? __pfx_smpboot_thread_fn+0x10/0x10
    kthread+0xe8/0x110
    ? __pfx_kthread+0x10/0x10
    ret_from_fork+0x33/0x40
    ? __pfx_kthread+0x10/0x10
    ret_from_fork_asm+0x1a/0x30
    </TASK>

I don't think this is a new problem this code introduces. This problem
exists for any hardware context which has more than one CPU. As far I
understand it, the problem is that there is no forward progress possible
for the IO itself (I assume the corresponding resources for the CPU
going offline have already been shutdown, thus no progress?) and
blk_mq_hctx_notifiy_offline isn't doing anything in this scenario.

Couldn't we do something like:

+static bool blk_mq_hctx_timeout_rq(struct request *rq, void *data)
+{
+       blk_mq_rq_timed_out(rq);
+       return true;
+}
+
+static void blk_mq_hctx_timeout_rqs(struct blk_mq_hw_ctx *hctx)
+{
+       struct blk_mq_tags *tags = hctx->sched_tags ?
+                       hctx->sched_tags : hctx->tags;
+       blk_mq_all_tag_iter(tags, blk_mq_hctx_timeout_rq, NULL);
+}
+
+
 static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
 {
        struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
                        struct blk_mq_hw_ctx, cpuhp_online);
+       int i;

        if (blk_mq_hctx_has_online_cpu(hctx, cpu))
                return 0;
@@ -3551,9 +3589,16 @@ static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
         * requests.  If we could not grab a reference the queue has been
         * frozen and there are no requests.
         */
+       i = 0;
        if (percpu_ref_tryget(&hctx->queue->q_usage_counter)) {
-               while (blk_mq_hctx_has_requests(hctx))
+               while (blk_mq_hctx_has_requests(hctx) && i++ < 10)
                        msleep(5);
+               if (blk_mq_hctx_has_requests(hctx)) {
+                       pr_info("%s:%d hctx %d force timeout request\n",
+                               __func__, __LINE__, hctx->queue_num);
+                       blk_mq_hctx_timeout_rqs(hctx);
+               }
+

This guarantees forward progress and it worked in my test scenario, got
the corresponding log entries

  blk_mq_hctx_notify_offline:3598 hctx 2 force timeout request

and the hotplug state machine continued. Didn't see an IO error either,
but I haven't looked closely, this is just a POC.

BTW, when looking at the tag allocator, I didn't see any hctx state
checks for the batched alloction path. Don't we need to check if the
corresponding hardware context is active there too?

@ -486,6 +487,15 @@ static struct request *__blk_mq_alloc_requests(struct blk_mq_alloc_data *data)
        if (data->nr_tags > 1) {
                rq = __blk_mq_alloc_requests_batch(data);
                if (rq) {
+                       if (unlikely(test_bit(BLK_MQ_S_INACTIVE,
+                                             &data->hctx->state))) {
+                               blk_mq_put_tag(blk_mq_tags_from_data(data),
+                                              rq->mq_ctx, rq->tag);
+                               msleep(3);
+                               goto retry;
+                       }
                        blk_mq_rq_time_init(rq, alloc_time_ns);
                        return rq;
                }

But given this is the hotpath and the hotplug path is very unlikely to
be used at all, at least for the majority of users, I would suggest to
try to get blk_mq_hctx_notify_offline to guarantee forward progress?.
This would make the hotpath an 'if' less.