NOHZ tick-stop error with ath10k SDIO
Fabio Estevam
festevam at gmail.com
Sat Sep 4 14:10:45 PDT 2021
Hi Thomas,
Thanks for your response.
On Fri, Sep 3, 2021 at 5:07 AM Thomas Gleixner <tglx at linutronix.de> wrote:
> Looked once more at the trace output. It seems to be incomplete. The
> last recording of softirq raise was at 379568us ~= 0.38s post boot, but
> the splat comes about 20 seconds post boot. Did your kernel trigger a
> WARN_ON before that splat? If so, that might have disabled tracing.
You are right. The WARN_ON only happens after hostapd runs, which is at a
much later stage.
> As you are triggering this manually by invoking hostapd and the machine
> should be still functional afterwards, can you please replace Paul's
> debug patch with the one below? Please remove the command line option
> and do the following:
>
> # echo 1 >/sys/kernel/debug/tracing/events/irq/softirq_raise/enable
> # echo 1 >/sys/kernel/debug/tracing/events/irq/softirq_entry/enable
> # echo 1 > /proc/sys/kernel/stack_tracer_enabled
> # hostapd ...
>
> Once the warning triggered do:
>
> # cat /sys/kernel/debug/tracing/trace >trace.txt
>
> That should give us the full trace data and hopefully a better
> understanding of the problem.
I did as suggested and here is trace.txt:
https://pastebin.com/VUfLRJ8a
Also, while investigating this problem I saw a commit that fixed a
similar issue:
e63052a5dd3c ("mlx5e: add add missing BH locking around napi_schdule()").
I then tried the same approach on the ath10k sdio driver:
diff --git a/drivers/net/wireless/ath/ath10k/sdio.c
b/drivers/net/wireless/ath/ath10k/sdio.c
index b746052737e0..eb705214f3f0 100644
--- a/drivers/net/wireless/ath/ath10k/sdio.c
+++ b/drivers/net/wireless/ath/ath10k/sdio.c
@@ -1363,8 +1363,11 @@ static void
ath10k_rx_indication_async_work(struct work_struct *work)
ep->ep_ops.ep_rx_complete(ar, skb);
}
- if (test_bit(ATH10K_FLAG_CORE_REGISTERED, &ar->dev_flags))
+ if (test_bit(ATH10K_FLAG_CORE_REGISTERED, &ar->dev_flags)) {
+ local_bh_disable();
napi_schedule(&ar->napi);
+ local_bh_enable();
+ }
}
and no longer get the "NOHZ tick-stop error: Non-RCU local softirq work is
pending, handler #08!!!" error messages after launching hostapd.
Is this a proper fix?
Thanks,
Fabio Estevam
More information about the ath10k
mailing list