[PATCH] arm64/entry: Mask DAIF in cpu_switch_to(), call_on_irq_stack()

Ada Couprie Diaz ada.coupriediaz at arm.com
Fri Jul 18 07:28:14 PDT 2025


`cpu_switch_to()` and `call_on_irq_stack()` manipulate SP to change
to different stacks along with the Shadow Call Stack if it is enabled.
Those two stack changes cannot be done atomically and both functions
can be interrupted by SErrors or Debug Exceptions which, though unlikely,
is very much broken : if interrupted, we can end up with mismatched stacks
and Shadow Call Stack leading to clobbered stacks.

In `cpu_switch_to()`, it can happen when SP_EL0 points to the new task,
but x18 stills points to the old task's SCS. When the interrupt handler
tries to save the task's SCS pointer, it will save the old task
SCS pointer (x18) into the new task struct (pointed to by SP_EL0),
clobbering it.

In `call_on_irq_stack()`, it can happen when switching from the task stack
to the IRQ stack and when switching back. In both cases, we can be
interrupted when the SCS pointer points to the IRQ SCS, but SP points to
the task stack. The nested interrupt handler pushes its return addresses
on the IRQ SCS. It then detects that SP points to the task stack,
calls `call_on_irq_stack()` and clobbers the task SCS pointer with
the IRQ SCS pointer, which it will also use !

This leads to tasks returning to addresses on the wrong SCS,
or even on the IRQ SCS, triggering kernel panics via CONFIG_VMAP_STACK
or FPAC if enabled.

This is possible on a default config, but unlikely.
However, when enabling CONFIG_ARM64_PSEUDO_NMI, DAIF is unmasked and
instead the GIC is responsible for filtering what interrupts the CPU
should receive based on priority.
Given the goal of emulating NMIs, pseudo-NMIs can be received by the CPU
even in `cpu_switch_to()` and `call_on_irq_stack()`, possibly *very*
frequently depending on the system configuration and workload, leading
to unpredictable kernel panics.

Completely mask DAIF in `cpu_switch_to()` and restore it when returning.
Do the same in `call_on_irq_stack()`, but restore and mask around
the branch.
Mask DAIF even if CONFIG_SHADOW_CALL_STACK is not enabled for consistency
of behaviour between all configurations.

Introduce and use an assembly macro for saving and masking DAIF,
as the existing one saves but only masks IF.

Signed-off-by: Ada Couprie Diaz <ada.coupriediaz at arm.com>
Reported-by: Cristian Prundeanu <cpru at amazon.com>
Fixes: 59b37fe52f49955791a460752c37145f1afdcad1 ("arm64: Stash shadow stack pointer in the task struct on interrupt")
---
Hi,
I spent some time evaluating the performance impact of this change
to make sure that it would be OK to mask in those functions.
They have very few instructions so have few chances to be interrupted
to begin with so the impact should be minimal.

Disclaimer : I am no benchmarking or performance analysis expert.
I'm happy to take additional inputs/validation of the findings below !

I ran the following benchmarks on 4 core VMs running on recent
commercial hardware, trying to maximize task switches :
 - `stress-ng --switch` with 100 processes [0],
 - `hackbench -T` and `hackbench -P`, both with 400 tasks [1].

Comparing the effect on base defconfig :
 - `stress-ng` is nearly identical, the median switch time with the fix
   was reduced by 0.1%, average raised by 0.04%.
 - `hackbench` results are slightly different, medians were reduced by
   0.3-0.4%, with some high task time outliers raising the average by
   1.7-1.8%.
 - Both benchmarks have almost identical distribution.
 - The effects seem mostly minimal, possibly in the noise.

Comparing the effects with pNMI+SCS, pNMI enabled :
 - `stress-ng` is slightly slower : median is +1.9%, average +1.4%.
 - `hackbench` is similar : median is +0.8-0.9%, average +0.3-+0.6%.
 - Times distribution is larger in both cases.
 - There seems to be a small performance impact, however without the fix
   there is a high likelihood of triggering the race condition and panic
   at some point.

I also tried to benchmark the performance impact on `memcached`
as a workload reported to crash with pNMI enabled and SCS.
I used `mc-crusher`[2] as recommended per the `memcached` documentation,
specifically the `binconf`, `slab_rebal_torture` and `slab_rebal_torture5`
configurations, measuring the average amount of get/set operations
per second each minute.
Those were also run on a 4 core VM, but running on an older machine.

Comparing the effects on base defconfig :
 - `binconf` is slightly worse, -0.8% ops/s on average
 - `slab_rebal_torture` is slightly better, +0.7% op/s on average
 - `slab_rebal_torture5` is almost identical, +0.1% op/s on average
 - There is much less variation in operations/s.

Comparing the effects with pNMI+SCS, pNMI enabled :
 - `binconf` is slightly better, +0.5% op/s on average
 - `slab_rebal_torture` as well, +0.5% op/s on average
 - `slab_rebal_torture5` is slightly worse, -0.6% op/s on average
 - The spread of values is similar

The `mc-crusher` performance results seem to validate that the change
has little impact in practice and might very much be lost in the noise.

Given those results I feel it is OK to mask DAIF in `call_on_irq_stack()`
and `cpu_switch_to()` in all configurations for consistency of behaviour,
as well as not being notably detrimental in cases where it does fix the
race condition.

My apologies if this is all *very long*. I felt it was important to
explain the mechanisms triggering the issues, as well as justifying
the performance impact given the functions affected (though this has
less place in the commit message itself.)
I might be wrong on both counts, so I'm happy to trim the commit
message if needed or be corrected on the performance impact !

Thanks,
Ada

[0]: https://github.com/ColinIanKing/stress-ng
[1]: https://man.archlinux.org/man/hackbench.8
[2]: https://github.com/memcached/mc-crusher
---
 arch/arm64/include/asm/assembler.h | 5 +++++
 arch/arm64/kernel/entry.S          | 6 ++++++
 2 files changed, 11 insertions(+)

diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
index ad63457a05c5..c56c21bb1eec 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -41,6 +41,11 @@
 /*
  * Save/restore interrupts.
  */
+	.macro save_and_disable_daif, flags
+	mrs	\flags, daif
+	msr	daifset, #0xf
+	.endm
+
 	.macro	save_and_disable_irq, flags
 	mrs	\flags, daif
 	msr	daifset, #3
diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 5ae2a34b50bd..30dcb719685b 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -825,6 +825,7 @@ SYM_CODE_END(__bp_harden_el1_vectors)
  *
  */
 SYM_FUNC_START(cpu_switch_to)
+	save_and_disable_daif x11
 	mov	x10, #THREAD_CPU_CONTEXT
 	add	x8, x0, x10
 	mov	x9, sp
@@ -848,6 +849,7 @@ SYM_FUNC_START(cpu_switch_to)
 	ptrauth_keys_install_kernel x1, x8, x9, x10
 	scs_save x0
 	scs_load_current
+	restore_irq x11
 	ret
 SYM_FUNC_END(cpu_switch_to)
 NOKPROBE(cpu_switch_to)
@@ -874,6 +876,7 @@ NOKPROBE(ret_from_fork)
  * Calls func(regs) using this CPU's irq stack and shadow irq stack.
  */
 SYM_FUNC_START(call_on_irq_stack)
+	save_and_disable_daif x9
 #ifdef CONFIG_SHADOW_CALL_STACK
 	get_current_task x16
 	scs_save x16
@@ -888,8 +891,10 @@ SYM_FUNC_START(call_on_irq_stack)
 
 	/* Move to the new stack and call the function there */
 	add	sp, x16, #IRQ_STACK_SIZE
+	restore_irq x9
 	blr	x1
 
+	save_and_disable_daif x9
 	/*
 	 * Restore the SP from the FP, and restore the FP and LR from the frame
 	 * record.
@@ -897,6 +902,7 @@ SYM_FUNC_START(call_on_irq_stack)
 	mov	sp, x29
 	ldp	x29, x30, [sp], #16
 	scs_load_current
+	restore_irq x9
 	ret
 SYM_FUNC_END(call_on_irq_stack)
 NOKPROBE(call_on_irq_stack)

base-commit: 347e9f5043c89695b01e66b3ed111755afcf1911
-- 
2.43.0




More information about the linux-arm-kernel mailing list