cacheflush completely broken, suspecting PAN+LPAE

Michał Pecio michal.pecio at gmail.com
Tue Nov 12 01:32:29 PST 2024


Hi Linus,

On Tue, 12 Nov 2024 02:15:19 +0100, Linus Walleij wrote:
> We are trying to locate the issue, which I think is the same as this
> but not sure:
> https://bugzilla.kernel.org/show_bug.cgi?id=219247

You can verify by asking the reporter to run the crashing program under
strace. If SIGSEGV follows a failed cacheflush, it's my bug most likely.

A straightforward repro of this bug:
gdb
GUILE_JIT_THRESHOLD=0 gdb
GUILE_JIT_THRESHOLD=-1 gdb

Expected outcome: segfault, segfault, shows command prompt.

> I have been trying to replicate it on a Chromebook but didn't get so
> far yet because the installation is pretty idiomatic :/ also there is
> only appears in a single Qt program and not as predictable as here.

My bug also appears in a single program ;) This system works fine, but
any JIT is broken by this kind of bug. The failure may be random if the
caches resynchronize by a fluke, but with gdb it was every time so far.

> But. It appears the code is issuing cacheflush() which I guess ends
> up in arm_syscall() here:
> 
>         case NR(cacheflush):
>                 return do_cache_op(regs->ARM_r0, regs->ARM_r1, regs->ARM_r2);
> 
> To here:
> 
> static inline int
> do_cache_op(unsigned long start, unsigned long end, int flags)
> {
>         if (end < start || flags)
>                 return -EINVAL;
> 
>         if (!access_ok((void __user *)start, end - start))
>                 return -EFAULT;
> 
>         return __do_cache_op(start, end);
> }

Yep. I added printks here and it is particularly the call to
flush_icache_range() from __do_cache_op() which returns -EFAULT.

> Here userspace access should be fine because we have entered a
> syscall from userspace. I tried to emulate the situation with this
> program:
> 
> #include <stdlib.h>
> #include <stdio.h>
> #include <errno.h>
> #include <unistd.h>
> #include <fcntl.h>
> #include <sys/mman.h>
> 
> #define NR_cacheflush 0xf0002
> 
> /* libgcc */
> extern void __clear_cache(void *, void *);
> 
> int main (int argc, char **argv) {
>   void *addr;
>   int ret;
> 
>   printf("Test()\n");
>   addr = mmap(NULL, 0x1000, PROT_READ|PROT_WRITE|PROT_EXEC,
>               MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
>   if (addr == MAP_FAILED) {
>     printf("mmap() failed\n");
>     exit(1);
>   }

This seems incomplete, there is no __clear_cache(). But if you add it
at the end then yes, it should fail. Confirm it with strace.

> I added prints in the cacheflush trap:
> 
> diff --git a/arch/arm/kernel/traps.c b/arch/arm/kernel/traps.c
> index 480e307501bb..400650519bd1 100644
> --- a/arch/arm/kernel/traps.c
> +++ b/arch/arm/kernel/traps.c
> @@ -592,11 +592,14 @@ __do_cache_op(unsigned long start, unsigned
> long end) static inline int
>  do_cache_op(unsigned long start, unsigned long end, int flags)
>  {
> +       pr_info("%s(%08lx-%08lx)\n", __func__, start, end);
>         if (end < start || flags)
>                 return -EINVAL;
> 
> -       if (!access_ok((void __user *)start, end - start))
> +       if (!access_ok((void __user *)start, end - start)) {
> +               pr_err("ACCESS NOT OK\n");
>                 return -EFAULT;
> +       }
> 
>         return __do_cache_op(start, end);
>  }

You also need to check what __do_cache_op() returns.

Regards,
Michal



More information about the linux-arm-kernel mailing list