KASAN on arm causes recursive fault?

Thu Jul 11 03:18:58 PDT 2024

Hi arm-linux developers,
I have a device with a i.MX 6 DualLite CPU (dual core Cortex-A9,
32bit). For a kernel issue investigation KASAN was enabled (generic,
out-of-line) and it provides legitimate reports but also produces
strange recursive faults.

The error starts like this:
[  160.506495] Insufficient stack space to handle exception!
[  160.506526] Task stack:     [0xa1000000..0xa1004000]
[  160.516925] IRQ stack:      [0xa0808000..0xa080c000]
[  160.521910] Overflow stack: [0x818f1000..0x818f2000]
[  160.526896] Internal error: kernel stack overflow: 0 [#1] SMP ARM
[  160.533015] Modules linked in: brcmfmac(O) brcmutil(O) bluetooth(O)
cfg80211(O) compat(O) ad5592r ad5592r_base veml6040 ci_hdrc_imx
ci_hdrc usbmisc_imx ehci_hcd imx_sdma virt_dma uio_pdrv_genirq uio
[  160.550862] CPU: 1 PID: 1009 Comm: hciconfig Tainted: G           O
      6.1.83 #1
[  160.560899] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
[  160.567445] PC is at __asan_load4+0x30/0x88
[  160.571677] LR is at do_translation_fault+0x2c/0x114

Then these frames are repeated 20+ times:
[  169.067504]  __dabt_svc from __asan_load4+0x30/0x88
[  169.072428]  __asan_load4 from do_translation_fault+0x2c/0x114
[  169.078312]  do_translation_fault from do_DataAbort+0x4c/0xec
[  169.084109]  do_DataAbort from __dabt_svc+0x4c/0x80

Eventually ending like this:
[  169.217995]  __dabt_svc from __schedule+0x484/0xb7c
[  169.222920]  __schedule from __cond_resched+0x48/0x64
[  169.228011]  __cond_resched from unmap_page_range+0x76c/0xc2c
[  169.233802]  unmap_page_range from unmap_vmas+0x15c/0x18c
[  169.239239]  unmap_vmas from exit_mmap+0x118/0x24c
[  169.244071]  exit_mmap from mmput+0x58/0x19c
[  169.248383]  mmput from do_exit+0x49c/0xea4
[  169.252611]  do_exit from do_group_exit+0x48/0xe4
[  169.257362]  do_group_exit from __wake_up_parent+0x0/0x2c
[  169.262819] Code: e3530002 9a000011 e1a011a1 e281145f (e1d110d0)
[  169.268934] ---[ end trace 0000000000000000 ]---
[  169.273568] note: hciconfig[1009] exited with irqs disabled
[  169.279714] Fixing recursive fault but reboot is needed!

There are a few variations for the end of the stack, but __schedule is
always the last frame before the abort. The fault sometimes happens
right after boot, sometimes later, but within 30 minutes the device is
usually done for.

It hinders the investigation I'm doing, so I would like to understand
why it's happening and get rid of it. Unfortunately I'm not a kernel
developer, so I need some assistance.

Things done so far:
I googled around for info and found 2 interesting emails. One
reporting the same issue (without any solution) and a mention of this
in the original patchset introducing KASAN for arm (related to LPAE
though, which I don't have enabled):
https://lore.kernel.org/all/660da7ca-4fd8-459a-8d9b-cace5d9e5ad3@app.fastmail.com/
https://lore.kernel.org/all/20171011082227.20546-1-liuwenliang@huawei.com/

Since do_translation_fault is KASAN instrumented, and it's
__asan_load4 that causes the next fault, I tried to break the loop by
disabling instrumentation on do_translation_fault by adding
__no_sanitize_address to the function definition.

This works and the fault completely goes away. The problem is that I
don't see the impact of my change... so a few questions:
- is there any better way to make the recursive fault go away?
- is disabling the instrumentation in do_translation_fault a viable
workaround? Is KASAN still functional or have I removed some essential
part of the mechanism?
- is landing in do_translation_fault from __schedule considered
normal? Or is it alerady an indication of an error I should pay more
attention to?

Thanks for your help in advance!

Regards,
   Gyorgy