[PATCH] irqchip/gic-v3-its: Don't acquire rt_spin_lock in allocate_vpe_l1_table()

Thu Jan 8 00:26:13 PST 2026

On Wed, 07 Jan 2026 21:53:53 +0000,
Waiman Long <longman at redhat.com> wrote:
> 
> When running a PREEMPT_RT debug kernel on a 2-socket Grace arm64 system,
> the following bug report was produced at bootup time.
> 
>   BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
>   in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/72
>   preempt_count: 1, expected: 0
>   RCU nest depth: 1, expected: 1
>    :
>   CPU: 72 UID: 0 PID: 0 Comm: swapper/72 Tainted: G        W           6.19.0-rc4-test+ #4 PREEMPT_{RT,(full)}
>   Tainted: [W]=WARN
>   Call trace:
>     :
>    rt_spin_lock+0xe4/0x408
>    rmqueue_bulk+0x48/0x1de8
>    __rmqueue_pcplist+0x410/0x650
>    rmqueue.constprop.0+0x6a8/0x2b50
>    get_page_from_freelist+0x3c0/0xe68
>    __alloc_frozen_pages_noprof+0x1dc/0x348
>    alloc_pages_mpol+0xe4/0x2f8
>    alloc_frozen_pages_noprof+0x124/0x190
>    allocate_slab+0x2f0/0x438
>    new_slab+0x4c/0x80
>    ___slab_alloc+0x410/0x798
>    __slab_alloc.constprop.0+0x88/0x1e0
>    __kmalloc_cache_noprof+0x2dc/0x4b0
>    allocate_vpe_l1_table+0x114/0x788
>    its_cpu_init_lpis+0x344/0x790
>    its_cpu_init+0x60/0x220
>    gic_starting_cpu+0x64/0xe8
>    cpuhp_invoke_callback+0x438/0x6d8
>    __cpuhp_invoke_callback_range+0xd8/0x1f8
>    notify_cpu_starting+0x11c/0x178
>    secondary_start_kernel+0xc8/0x188
>    __secondary_switched+0xc0/0xc8
> 
> This is due to the fact that allocate_vpe_l1_table() will call
> kzalloc() to allocate a cpumask_t when the first CPU of the
> second node of the 72-cpu Grace system is being called from the
> CPUHP_AP_MIPS_GIC_TIMER_STARTING state inside the starting section of

Surely *not* that particular state.

> the CPU hotplug bringup pipeline where interrupt is disabled. This is an
> atomic context where sleeping is not allowed and acquiring a sleeping
> rt_spin_lock within kzalloc() may lead to system hang in case there is
> a lock contention.
> 
> To work around this issue, a static buffer is used for cpumask
> allocation when running a PREEMPT_RT kernel via the newly introduced
> vpe_alloc_cpumask() helper. The static buffer is currently set to be
> 4 kbytes in size. As only one cpumask is needed per node, the current
> size should be big enough as long as (cpumask_size() * nr_node_ids)
> is not bigger than 4k.

What role does the node play here? The GIC topology has nothing to do
with NUMA. It may be true on your particular toy, but that's
definitely not true architecturally. You could, at worse, end-up with
one such cpumask per *CPU*. That'd be a braindead system, but this
code is written to support the architecture, not any particular
implementation.

> 
> Signed-off-by: Waiman Long <longman at redhat.com>
> ---
>  drivers/irqchip/irq-gic-v3-its.c | 26 +++++++++++++++++++++++++-
>  1 file changed, 25 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
> index ada585bfa451..9185785524dc 100644
> --- a/drivers/irqchip/irq-gic-v3-its.c
> +++ b/drivers/irqchip/irq-gic-v3-its.c
> @@ -2896,6 +2896,30 @@ static bool allocate_vpe_l2_table(int cpu, u32 id)
>  	return true;
>  }
>  
> +static void *vpe_alloc_cpumask(void)
> +{
> +	/*
> +	 * With PREEMPT_RT kernel, we can't call any k*alloc() APIs as they
> +	 * may acquire a sleeping rt_spin_lock in an atomic context. So use
> +	 * a pre-allocated buffer instead.
> +	 */
> +	if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
> +		static unsigned long mask_buf[512];
> +		static atomic_t	alloc_idx;
> +		int idx, mask_size = cpumask_size();
> +		int nr_cpumasks = sizeof(mask_buf)/mask_size;
> +
> +		/*
> +		 * Fetch an allocation index and if it points to a buffer within
> +		 * mask_buf[], return that. Fall back to kzalloc() otherwise.
> +		 */
> +		idx = atomic_fetch_inc(&alloc_idx);
> +		if (idx < nr_cpumasks)
> +			return &mask_buf[idx * mask_size/sizeof(long)];
> +	}

Err, no. That's horrible. I can see three ways to address this in a
more appealing way:

- you give RT a generic allocator that works for (small) atomic
  allocations. I appreciate that's not easy, and even probably
  contrary to the RT goals. But I'm also pretty sure that the GIC code
  is not the only pile of crap being caught doing that.

- you pre-compute upfront how many cpumasks you are going to require,
  based on the actual GIC topology. You do that on CPU0, outside of
  the hotplug constraints, and allocate what you need. This is
  difficult as you need to ensure the RD<->CPU matching without the
  CPUs having booted, which means wading through the DT/ACPI gunk to
  try and guess what you have.

- you delay the allocation of L1 tables to a context where you can
  perform allocations, and before we have a chance of running a guest
  on this CPU. That's probably the simplest option (though dealing
  with late onlining while guests are already running could be
  interesting...).

But I'm always going to say no to something that is a poor hack and
ultimately falling back to the same broken behaviour.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.