IRQ thread timeouts and affinity

Fri Oct 10 08:03:01 PDT 2025

On Fri, Oct 10, 2025 at 03:18:13PM +0100, Marc Zyngier wrote:
> On Fri, 10 Oct 2025 14:50:57 +0100,
> Thierry Reding <thierry.reding at gmail.com> wrote:
> > 
> > On Thu, Oct 09, 2025 at 07:11:20PM +0100, Marc Zyngier wrote:
> > > On Thu, 09 Oct 2025 18:04:58 +0100,
> > > Marc Zyngier <maz at kernel.org> wrote:
> > > > 
> > > > On Thu, 09 Oct 2025 17:05:15 +0100,
> > > > Thierry Reding <thierry.reding at gmail.com> wrote:
> > > > > 
> > > > > [1  <text/plain; us-ascii (quoted-printable)>]
> > > > > On Thu, Oct 09, 2025 at 03:30:56PM +0100, Marc Zyngier wrote:
> > > > > > Hi Thierry,
> > > > > > 
> > > > > > On Thu, 09 Oct 2025 12:38:55 +0100,
> > > > > > Thierry Reding <thierry.reding at gmail.com> wrote:
> > > > > > > 
> > > > > > > Which brings me to the actual question: what is the right way to solve
> > > > > > > this? I had, maybe naively, assumed that the default CPU affinity, which
> > > > > > > includes all available CPUs, would be sufficient to have interrupts
> > > > > > > balanced across all of those CPUs, but that doesn't appear to be the
> > > > > > > case. At least not with the GIC (v3) driver which selects one CPU (CPU 0
> > > > > > > in this particular case) from the affinity mask to set the "effective
> > > > > > > affinity", which then dictates where IRQs are handled and where the
> > > > > > > corresponding IRQ thread function is run.
> > > > > > 
> > > > > > There's a (GIC-specific) answer to that, and that's the "1 of N"
> > > > > > distribution model. The problem is that it is a massive headache (it
> > > > > > completely breaks with per-CPU context).
> > > > > 
> > > > > Heh, that started out as a very promising first paragraph but turned
> > > > > ugly very quickly... =)
> > > > > 
> > > > > > We could try and hack this in somehow, but defining a reasonable API
> > > > > > is complicated. The set of CPUs receiving 1:N interrupts is a *global*
> > > > > > set, which means you cannot have one interrupt targeting CPUs 0-1, and
> > > > > > another targeting CPUs 2-3. You can only have a single set for all 1:N
> > > > > > interrupts. How would you define such a set in a platform agnostic
> > > > > > manner so that a random driver could use this? I definitely don't want
> > > > > > to have a GIC-specific API.
> > > > > 
> > > > > I see. I've been thinking that maybe the only way to solve this is using
> > > > > some sort of policy. A very simple policy might be: use CPU 0 as the
> > > > > "default" interrupt (much like it is now) because like you said there
> > > > > might be assumptions built-in that break when the interrupt is scheduled
> > > > > elsewhere. But then let individual drivers opt into the 1:N set, which
> > > > > would perhaps span all available CPUs but the first one. From an API PoV
> > > > > this would just be a flag that's passed to request_irq() (or one of its
> > > > > derivatives).
> > > > 
> > > > The $10k question is how do you pick the victim CPUs? I can't see how
> > > > to do it in a reasonable way unless we decide that interrupts that
> > > > have an affinity matching cpu_possible_mask are 1:N. And then we're
> > > > left with wondering what to do about CPU hotplug.
> > > 
> > > For fun and giggles, here's the result of a 5 minute hack. It enables
> > > 1:N distribution on SPIs that have an "all cpus" affinity. It works on
> > > one machine, doesn't on another -- no idea why yet. YMMV.
> > > 
> > > This is of course conditioned on your favourite HW supporting the 1:N
> > > feature, and it is likely that things will catch fire quickly. It will
> > > probably make your overall interrupt latency *worse*, but maybe less
> > > variable. Let me know.
> > 
> > You might be onto something here. Mind you, I've only done very limited
> > testing, but the system does boot and the QSPI related timeouts are gone
> > completely.
> 
> Hey, progress.
> 
> > Here's some snippets from the boot log that might be interesting:
> > 
> > [    0.000000] GICv3: GIC: Using split EOI/Deactivate mode
> > [    0.000000] GIC: enabling workaround for GICv3: NVIDIA erratum T241-FABRIC-4
> > [    0.000000] GIC: enabling workaround for GICv3: ARM64 erratum 2941627
> > [    0.000000] GICv3: 960 SPIs implemented
> > [    0.000000] GICv3: 320 Extended SPIs implemented
> > [    0.000000] Root IRQ handler: gic_handle_irq
> > [    0.000000] GICv3: GICv3 features: 16 PPIs, 1:N
> > [    0.000000] GICv3: CPU0: found redistributor 20000 region 0:0x0000000022100000
> > [...]
> > [    0.000000] GICv3: using LPI property table @0x0000000101500000
> > [    0.000000] GICv3: CPU0: using allocated LPI pending table @0x0000000101540000
> > [...]
> > 
> > There's a bunch of ITS info that I dropped, as well as the same
> > redistributor and LPI property table block for each of the 288 CPUs.
> > 
> > /proc/interrupts is much too big to paste here, but it looks like the
> > QSPI interrupts now end up evenly distributed across the first 72 CPUs
> > in this system. Not sure why 72, but possibly because this is a 4 NUMA
> > node system with 72 CPUs each, so the CPU mask might've been restricted
> > to just the first node.
> 
> It could well be that your firmware sets GICR_CTLR.DPG1NS on the 3
> other nodes, and the patch I gave you doesn't try to change that.
> Check with [1], which does the right thing on that front (it fixed a
> similar problem on my slightly more modest 12 CPU machine).
> 
> > On the face of it this looks quite promising. Where do we go from here?
> 
> For a start, you really should consider sending me one of these
> machines. I have plans for it ;-)

I'm quite happy with someone else hosting this device, I don't think the
electrical installation at home could handle it.

It has proven to be quite well suited for kernel builds...

> > Any areas that we need to test more exhaustively to see if this breaks?
> 
> CPU hotplug is the main area of concern, and I'm pretty sure it breaks
> this distribution mechanism (or the other way around). Another thing
> is that if firmware isn't aware that 1:N interrupts can (or should)
> wake-up a CPU from sleep, bad things will happen. Given that nobody
> uses 1:N, you can bet that any bit of privileged SW (TF-A,
> hypervisors) is likely to be buggy (I've already spotted bugs in KVM
> around this).

Okay, I can find out if CPU hotplug is a common use-case on these
devices, or if we can run some tests with that.

> The other concern is the shape of the API we would expose to drivers,
> because I'm not sure we want this sort of "scatter-gun" approach for
> all SPIs, and I don't know how that translates to other architectures.
> 
> Thomas should probably weight in here.

Yes, it would be interesting to understand how we can make use of this
in a more generic way.

Thierry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20251010/ce713567/attachment.sig>