NOHZ tick-stop error with ath10k SDIO

Fabio Estevam festevam at gmail.com
Sat Sep 4 14:10:45 PDT 2021


Hi Thomas,

Thanks for your response.

On Fri, Sep 3, 2021 at 5:07 AM Thomas Gleixner <tglx at linutronix.de> wrote:

> Looked once more at the trace output. It seems to be incomplete. The
> last recording of softirq raise was at 379568us ~= 0.38s post boot, but
> the splat comes about 20 seconds post boot. Did your kernel trigger a
> WARN_ON before that splat? If so, that might have disabled tracing.

You are right. The WARN_ON only happens after hostapd runs, which is at a
much later stage.

> As you are triggering this manually by invoking hostapd and the machine
> should be still functional afterwards, can you please replace Paul's
> debug patch with the one below? Please remove the command line option
> and do the following:
>
> # echo 1 >/sys/kernel/debug/tracing/events/irq/softirq_raise/enable
> # echo 1 >/sys/kernel/debug/tracing/events/irq/softirq_entry/enable
> # echo 1 > /proc/sys/kernel/stack_tracer_enabled
> # hostapd ...
>
> Once the warning triggered do:
>
> # cat /sys/kernel/debug/tracing/trace >trace.txt
>
> That should give us the full trace data and hopefully a better
> understanding of the problem.

I did as suggested and here is trace.txt:
https://pastebin.com/VUfLRJ8a

Also, while investigating this problem I saw a commit that fixed a
similar issue:
e63052a5dd3c ("mlx5e: add add missing BH locking around napi_schdule()").

I then tried the same approach on the ath10k sdio driver:

diff --git a/drivers/net/wireless/ath/ath10k/sdio.c
b/drivers/net/wireless/ath/ath10k/sdio.c
index b746052737e0..eb705214f3f0 100644
--- a/drivers/net/wireless/ath/ath10k/sdio.c
+++ b/drivers/net/wireless/ath/ath10k/sdio.c
@@ -1363,8 +1363,11 @@ static void
ath10k_rx_indication_async_work(struct work_struct *work)
         ep->ep_ops.ep_rx_complete(ar, skb);
     }

-    if (test_bit(ATH10K_FLAG_CORE_REGISTERED, &ar->dev_flags))
+    if (test_bit(ATH10K_FLAG_CORE_REGISTERED, &ar->dev_flags)) {
+        local_bh_disable();
         napi_schedule(&ar->napi);
+        local_bh_enable();
+    }
 }

and no longer get the "NOHZ tick-stop error: Non-RCU local softirq work is
pending, handler #08!!!" error messages after launching hostapd.

Is this a proper fix?

Thanks,

Fabio Estevam



More information about the ath10k mailing list