[QUESTION FOR ARM64 TLB] performance issue and implementation difference of TLB flush

Fri May 5 05:28:35 PDT 2023

Hi,

I found that in `ghes_unmap` protected by spinlock, arm64 and x86 have
different strategies for flushing tlb.

# arm64 call trace:
```
holding a spin lock
ghes_unmap
  clear_fixmap
   __set_fixmap
    flush_tlb_kernel_range
```

# x86 call trace:
```
holding a spin lock
ghes_unmap
  clear_fixmap
   __set_fixmap
    mmu.set_fixmap
     native_set_fixmap
      __native_set_fixmap
       set_pte_vaddr
        set_pte_vaddr_p4d
         __set_pte_vaddr
          flush_tlb_one_kernel
```

As we can see, ghes_unmap in arm64 eventually calls
flush_tlb_kernel_range to broadcast TLB invalidation. However, on
x86, ghes_unmap calls flush_tlb_one_kernel.

Why arm64 needs to broadcast TLB invalidation in ghes_unmap, while only
one CPU has accessed this memory area?

Mark Rutland said in 
https://lore.kernel.org/lkml/369d1be2-d418-1bfb-bfc2-b25e4e542d76@bytedance.com/

> The architecture (arm64) allows a CPU to allocate TLB entries at any time for any
> reason, for any valid translation table entries reachable from the 
> root in
> TTBR{0,1}_ELx. That can be due to speculation, prefetching, and/or other
> reasons.
>
> Due to that, it doesn't matter whether or not a CPU explicitly accesses a
> memory location -- TLB entries can be allocated regardless.
> Consequently, the
> spinlock doesn't make any difference.
>

arm64 broadcast TLB invalidation in ghes_unmap, because TLB entry can be
allocated regardless of whether the CPU explicitly accesses memory.

Why doesn't x86 broadcast TLB invalidation in ghes_unmap? Is there any
difference between x86 and arm64 in TLB allocation and invalidation 
strategy?

Thanks,
Gang Li