[PATCH v2] tracing: Fix inconsistency in irq tracking on NMIs

Peter Zijlstra peterz at infradead.org
Tue Jul 1 05:54:59 PDT 2025


On Wed, Jun 25, 2025 at 02:08:22PM +0200, Gabriele Monaco wrote:
> The irq_enable/irq_disable tracepoints fire only when there's an actual
> transition (enabled->disabled and vice versa), this needs special care
> in NMIs, as they can potentially start with interrupts already disabled.
> The current implementation takes care of this by tracking the lockdep
> state on nmi_entry as well as using the variable tracing_irq_cpu to
> synchronise with other calls (e.g. local_irq_disable/enable).
> 
> This can be racy in case of NMIs when lockdep is enabled, and can lead
> to missing events when lockdep is disabled.
> 
> Remove dependency on the lockdep status in the NMI common entry/exit
> code and adapt the tracing code to make sure that:
> 
> - The first call disabling interrupts fires the tracepoint
> - The first non-NMI call enabling interrupts fires the tracepoint
> - The last NMI call enabling interrupts fires the tracepoint unless
>   interrupts were disabled before the NMI
> - All other calls don't fire

I'm not at all convinced this is correct. Nor can I understand anything
much about what you wrote above.


>  arch/arm64/kernel/entry-common.c |  5 ++---
>  kernel/entry/common.c            |  5 ++---
>  kernel/trace/trace_preemptirq.c  | 12 +++++++-----
>  3 files changed, 11 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/arm64/kernel/entry-common.c b/arch/arm64/kernel/entry-common.c
> index 7c1970b341b8c..7f1844123642e 100644
> --- a/arch/arm64/kernel/entry-common.c
> +++ b/arch/arm64/kernel/entry-common.c
> @@ -213,10 +213,9 @@ static void noinstr arm64_exit_nmi(struct pt_regs *regs)
>  	bool restore = regs->lockdep_hardirqs;
>  
>  	ftrace_nmi_exit();
> -	if (restore) {
> -		trace_hardirqs_on_prepare();
> +	trace_hardirqs_on_prepare();
> +	if (restore)
>  		lockdep_hardirqs_on_prepare();
> -	}
>  
>  	ct_nmi_exit();
>  	lockdep_hardirq_exit();
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index a8dd1f27417cf..e234f264fb495 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -343,10 +343,9 @@ void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state)
>  {
>  	instrumentation_begin();
>  	ftrace_nmi_exit();
> -	if (irq_state.lockdep) {
> -		trace_hardirqs_on_prepare();
> +	trace_hardirqs_on_prepare();
> +	if (irq_state.lockdep)
>  		lockdep_hardirqs_on_prepare();
> -	}
>  	instrumentation_end();
>  
>  	ct_nmi_exit();
> diff --git a/kernel/trace/trace_preemptirq.c b/kernel/trace/trace_preemptirq.c
> index 0c42b15c38004..fa45474fc54f1 100644
> --- a/kernel/trace/trace_preemptirq.c
> +++ b/kernel/trace/trace_preemptirq.c
> @@ -58,7 +58,11 @@ static DEFINE_PER_CPU(int, tracing_irq_cpu);
>   */
>  void trace_hardirqs_on_prepare(void)
>  {
> -	if (this_cpu_read(tracing_irq_cpu)) {
> +	int tracing_count = this_cpu_read(tracing_irq_cpu);
> +
> +	if (in_nmi() && tracing_count > 1)
> +		this_cpu_dec(tracing_irq_cpu);
> +	else if (tracing_count) {
>  		trace(irq_enable, TP_ARGS(CALLER_ADDR0, CALLER_ADDR1));
>  		tracer_hardirqs_on(CALLER_ADDR0, CALLER_ADDR1);
>  		this_cpu_write(tracing_irq_cpu, 0);
> @@ -89,8 +93,7 @@ NOKPROBE_SYMBOL(trace_hardirqs_on);
>   */
>  void trace_hardirqs_off_finish(void)
>  {
> -	if (!this_cpu_read(tracing_irq_cpu)) {
> -		this_cpu_write(tracing_irq_cpu, 1);
> +	if (this_cpu_inc_return(tracing_irq_cpu) == 1) {
>  		tracer_hardirqs_off(CALLER_ADDR0, CALLER_ADDR1);
>  		trace(irq_disable, TP_ARGS(CALLER_ADDR0, CALLER_ADDR1));
>  	}
> @@ -103,8 +106,7 @@ void trace_hardirqs_off(void)
>  {
>  	lockdep_hardirqs_off(CALLER_ADDR0);
>  
> -	if (!this_cpu_read(tracing_irq_cpu)) {
> -		this_cpu_write(tracing_irq_cpu, 1);
> +	if (this_cpu_inc_return(tracing_irq_cpu) == 1) {
>  		tracer_hardirqs_off(CALLER_ADDR0, CALLER_ADDR1);
>  		trace(irq_disable, TP_ARGS(CALLER_ADDR0, CALLER_ADDR1));
>  	}

So what about lovely things like:

trace_hardirqs_on_prepare()
  if (tracing_irq_cpu) {
    tracer_hardirqs_on();
    <NMI>
      trace_hardirqs_off_finish();
        if (this_cpu_inc_return() == 1) // will be > 1


So now we've traced IRQs are on, start an NMI, and loose the IRQs off
event. Well done.


This was all safe in that it would occasionally emit a duplicate state,
but no state was wrong. Both your attempts have broken things.

How about you fix you tool to accept duplicate state in the face of NMI
instead?



More information about the linux-arm-kernel mailing list