[PATCH] irqchip/gic-v4.1: Fix GICv4.1 doorbell affinity

Fri Jan 26 02:19:30 PST 2024

On Fri, 26 Jan 2024 09:17:52 +0000,
Kunkun Jiang <jiangkunkun at huawei.com> wrote:
> 
> commit dd3f050a216e ("irqchip/gic-v4.1: Implement the v4.1 flavour
> of VMOVP") make an optimization, VMOVP can be skipped if moving
> VPE to a cpu whose RD is sharing its VPE table with the current one.
> But when skipping VMOVP, the affinity recorded in irq_data is still
> updated. This causes the doorbell affinity recorfed in the irq_data
> to be inconsistent with the actual.
> 
> In corner case, this may result in lost interrupts:
> 0. Each cpu die shares a VPE table and contains 32 CPUs
>    die0(CPU0-31) die1(CPU32-63)...
> 1. VPE resides on CPU32, doorbell affinity to CPU32.
> 2. Move VPE to CPU33, skip VMOVP, doorbell still affinity to CPU32.
>    The affinity recorded in irq_data is CPU33.
> 3. Manually offline CPU32 on the host side:
>    'echo 0 > /sys/devices/system/cpu/cpu32/online'
> 4. Core code cannot move the doorbell affinity to CPU32, since the
>    record in irq_data is CPU33.
> 5. Subsequent doorbell interrupts will be lost.
> 
> So affinity recoreded in irq_data should not be updated when skipping
> VMOVP.
> 
> Fixes: dd3f050a216e ("irqchip/gic-v4.1: Implement the v4.1 flavour of VMOVP")
> Signed-off-by: Kunkun Jiang <jiangkunkun at huawei.com>
> ---
>  drivers/irqchip/irq-gic-v3-its.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
> index d097001c1e3e..4b1dbb697959 100644
> --- a/drivers/irqchip/irq-gic-v3-its.c
> +++ b/drivers/irqchip/irq-gic-v3-its.c
> @@ -3850,8 +3850,9 @@ static int its_vpe_set_affinity(struct irq_data *d,
>  	its_send_vmovp(vpe);
>  	its_vpe_db_proxy_move(vpe, from, cpu);
>  
> -out:
>  	irq_data_update_effective_affinity(d, cpumask_of(cpu));
> +
> +out:

That looks wrong. You are lying to the core code by saying that it's
all OK, and yet haven't done *anything*. This stuff is obviously
buggy, but I don't think this is right.

In your example, you don't even solve the problem: if CPUs 32 and 33
are part of the same ITS affinity group, you won't issue a VMOVP
either, so this doesn't fix anything.

At this stage, I think the VMOVP optimisation is wrong and that we
should drop it.

	M.

-- 
Without deviation from the norm, progress is not possible.