[PATCH v10 13/13] docs: add io_queue flag to isolcpus

Sun Apr 12 15:50:33 PDT 2026

On Sat, Apr 11, 2026 at 08:52:00PM +0800, Ming Lei wrote:
> > The critical issue lies at the invocation of group_cpus_evenly(). Without
> > this patchset, the core logic lacks the necessary constraints to respect
> > CPU isolation. It is entirely possible, and indeed happens in practice, for
> > an isolated CPU to be assigned to a CPU mask group.
> 
> It is one bug report? No, because it doesn't show any trouble from user
> viewpoint.

Hi Ming,

The lack of a formal bug report does not negate the fact that the current
behaviour silently breaks the fundamental contract of CPU isolation from
the administrator's perspective.

To illustrate the user-visible impact, the following demonstrates the
difference between relying on isolcpus=managed_irq and isolcpus=io_queue
under 7.0.0-rc3-00065-gd80965e205a5, which includes this series.

The Broadcom MPI3 Storage Controller driver allocates a full complement of
48 operational queue pairs. Consequently, a number of MSI-X vectors are
generated and mapped directly onto the isolated cores thereby breaching
isolation.

    # uname -r
    7.0.0-rc3-00065-gd80965e205a5

    # tr ' ' '\n' < /proc/cmdline | grep isolcpus=
    isolcpus=managed_irq,domain,2-47

    # cat /sys/devices/system/cpu/isolated
    2-47

    # dmesg | grep -A 6 'MSI-X vectors supported:'
    [   2.981705] mpi3mr0: MSI-X vectors supported: 128, no of cores: 48,
    [   2.981705] mpi3mr0: MSI-X vectors requested: 49 poll_queues 0
    [   3.001915] mpi3mr0: trying to create 48 operational queue pairs
    [   3.011214] mpi3mr0: allocating operational queues through segmented queues 
    [   3.101903] mpi3mr0: successfully created 48 operational queue pairs(default/polled) queue = (2/0)
    [   3.111468] mpi3mr0: controller initialization completed successfully

    # awk '/mpi3mr0/ { print $1" "$NF }' /proc/interrupts
    78: mpi3mr0-msix0
    79: mpi3mr0-msix1
    80: mpi3mr0-msix2
    81: mpi3mr0-msix3
    82: mpi3mr0-msix4
    83: mpi3mr0-msix5
    84: mpi3mr0-msix6
    85: mpi3mr0-msix7
    86: mpi3mr0-msix8
    87: mpi3mr0-msix9
    88: mpi3mr0-msix10
    89: mpi3mr0-msix11
    90: mpi3mr0-msix12
    ...
    122: mpi3mr0-msix44
    123: mpi3mr0-msix45
    124: mpi3mr0-msix46
    125: mpi3mr0-msix47
    126: mpi3mr0-msix48

    # grep -H '' /proc/irq/{119,120,121,122}/{effective,smp}_affinity_list
    /proc/irq/119/effective_affinity_list:42
    /proc/irq/119/smp_affinity_list:42
    /proc/irq/120/effective_affinity_list:43
    /proc/irq/120/smp_affinity_list:43
    /proc/irq/121/effective_affinity_list:44
    /proc/irq/121/smp_affinity_list:44
    /proc/irq/122/effective_affinity_list:45
    /proc/irq/122/smp_affinity_list:45

Now with isolcpus=io_queue,2-47 the allocation is structurally restricted
at the source. The driver creates only two operational queues, confining
all resulting interrupts exclusively to housekeeping CPUs (0 and 1):

    # uname -r
    7.0.0-rc3-00065-gd80965e205a5

    # tr ' ' '\n' < /proc/cmdline | grep isolcpus=
    isolcpus=io_queue,domain,2-47

    # cat /sys/devices/system/cpu/isolated
    2-47

    # dmesg | grep -A 6 'MSI-X vectors supported:'
    [   3.284850] mpi3mr0: MSI-X vectors supported: 128, no of cores: 48,
    [   3.284851] mpi3mr0: MSI-X vectors requested: 49 poll_queues 0
    [   3.305492] mpi3mr0: allocated vectors (3) are less than configured (49)
    [   3.316528] mpi3mr0: trying to create 2 operational queue pairs
    [   3.328013] mpi3mr0: allocating operational queues through segmented queues
    [   3.340697] mpi3mr0: successfully created 2 operational queue pairs(default/polled) queue = (2/0)
    [   3.350664] mpi3mr0: controller initialization completed successfully

    # awk '/mpi3mr0/ { print $1" "$NF }' /proc/interrupts
    79: mpi3mr0-msix0
    80: mpi3mr0-msix1
    81: mpi3mr0-msix2

    # grep -H '' /proc/irq/{79,80,81}/{effective,smp}_affinity_list
    /proc/irq/79/effective_affinity_list:1
    /proc/irq/79/smp_affinity_list:1
    /proc/irq/80/effective_affinity_list:1
    /proc/irq/80/smp_affinity_list:1
    /proc/irq/81/effective_affinity_list:0
    /proc/irq/81/smp_affinity_list:0

> Sebastian explains/shows how "isolcpus=managed_irq" works perfectly in the
> following link:
> 
> https://lore.kernel.org/all/20260401110232.ET5RxZfl@linutronix.de/
> 
> You have reviewed it...
> 
> What matters is that IO won't interrupt isolated CPU.

The isolcpus=managed_irq acts as a "best effort" avoidance algorithm rather
than a strict, unbreakable constraint. This is indicated in the proposed
changes to Documentation/core-api/irq/managed_irq.rst [1].

[1]: https://lore.kernel.org/all/20260401110232.ET5RxZfl@linutronix.de/

The following is an excerpt of irq_do_set_affinity().

 - File: kernel/irq/manage.c

 232 int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask, bool force)
 233 {
 234         struct cpumask *tmp_mask = this_cpu_ptr(&__tmp_mask);
  :
 262         if (irqd_affinity_is_managed(data) &&
 263             housekeeping_enabled(HK_TYPE_MANAGED_IRQ)) {
 264                 const struct cpumask *hk_mask;
 265 
 266                 hk_mask = housekeeping_cpumask(HK_TYPE_MANAGED_IRQ);
 267 
 268                 cpumask_and(tmp_mask, mask, hk_mask);
 269                 if (!cpumask_intersects(tmp_mask, cpu_online_mask))
 270                         prog_mask = mask;
 271                 else
 272                         prog_mask = tmp_mask;
 273         } else {
 274                 prog_mask = mask;
 275         }

    1.  If the requested mask consists only of isolated CPUs (e.g., 2-47),
        it will have zero intersection with the hk_mask (which contains
        only the housekeeping CPUs). Consequently, the resulting tmp_mask
        becomes completely empty.

    2.  Because the tmp_mask is empty, it cannot intersect with the
        cpu_online_mask.

    3.  The kernel triggers this fallback path. It abandons the empty,
        filtered tmp_mask and reverts back to the originally requested
        mask, which only contains isolated CPUs. Consequently, the
        interrupt is routed directly to an isolated CPU, proving that
        managed_irq cannot guarantee isolation. 

> > The newer implementation of irq_create_affinity_masks() introduced by this
> > series resolves this. It considers the new CPU mask added to the IRQ
> > affinity descriptor. When group_mask_cpus_evenly() is called, this mask is
> > evaluated [1], guaranteeing that isolated CPUs are entirely excluded from
> > the mask groups.
> > 
> > [1]: https://lore.kernel.org/lkml/20260401222312.772334-8-atomlin@atomlin.com/
> 
> Not at all.
> 
> isolated CPU is still included in each group's cpu mask, please see patch
> 9:

You are entirely correct. The actual structural exclusion preventing the
interrupts from landing on those cores occurs subsequently via
irq_spread_hk_filter() in irq_create_affinity_masks() as per patch 12 [2].

[2]: https://lore.kernel.org/lkml/20260401222312.772334-13-atomlin@atomlin.com/

Kind regards,
-- 
Aaron Tomlin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20260412/863e605c/attachment-0001.sig>