[PATCH V3] riscv: asid: Fixup stale TLB entry cause application crash

Thu Dec 8 19:15:19 PST 2022

On Fri, Dec 9, 2022 at 11:13 AM Guo Ren <guoren at kernel.org> wrote:
>
> On Fri, Dec 9, 2022 at 7:30 AM Palmer Dabbelt <palmer at rivosinc.com> wrote:
> >
> > On Fri, 18 Nov 2022 12:57:21 PST (-0800), geomatsi at gmail.com wrote:
> > > Hi Guo Ren,
> > >
> > >
> > >> After use_asid_allocator is enabled, the userspace application will
> > >> crash by stale TLB entries. Because only using cpumask_clear_cpu without
> > >> local_flush_tlb_all couldn't guarantee CPU's TLB entries were fresh.
> > >> Then set_mm_asid would cause the user space application to get a stale
> > >> value by stale TLB entry, but set_mm_noasid is okay.
> > >
> > > ... [snip]
> > >
> > >> +    /*
> > >> +     * The mm_cpumask indicates which harts' TLBs contain the virtual
> > >> +     * address mapping of the mm. Compared to noasid, using asid
> > >> +     * can't guarantee that stale TLB entries are invalidated because
> > >> +     * the asid mechanism wouldn't flush TLB for every switch_mm for
> > >> +     * performance. So when using asid, keep all CPUs footmarks in
> > >> +     * cpumask() until mm reset.
> > >> +     */
> > >> +    cpumask_set_cpu(cpu, mm_cpumask(next));
> > >> +    if (static_branch_unlikely(&use_asid_allocator)) {
> > >> +            set_mm_asid(next, cpu);
> > >> +    } else {
> > >> +            cpumask_clear_cpu(cpu, mm_cpumask(prev));
> > >> +            set_mm_noasid(next);
> > >> +    }
> > >>  }
> > >
> > > I observe similar user-space crashes on my SMP systems with enabled ASID.
> > > My attempt to fix the issue was a bit different, see the following patch:
> > >
> > > https://lore.kernel.org/linux-riscv/20220829205219.283543-1-geomatsi@gmail.com/
> > >
> > > In brief, the idea was borrowed from flush_icache_mm handling:
> > > - keep track of CPUs not running the task
> > > - perform per-ASID TLB flush on such CPUs only if the task is switched there
> >
> > That way looks better to me: leaking hartids in the ASID allocator might
> > make the crashes go away, but it's just going to end up trending towards
> > flushing everything and that doesn't seem like the right long-term
> > solution.
> The penalty in switch_mm is too heavy!!!
>  - If the system has multiple NUMA nodes, it will cause switch_mm_fast
> flush unnecessary harts.
>  - If flush_range is just 1 entry, it would case flush_tlb_all_asid.
>
> switch_mm_fast:
>         csr_write(CSR_SATP, virt_to_pfn(mm->pgd) |
>                   ((cntx & asid_mask) << SATP_ASID_SHIFT) |
>                   satp_mode);
>
>         if (need_flush_tlb)
>                 local_flush_tlb_all();
> +#ifdef CONFIG_SMP
> +        else {
> +                cpumask_t *mask = &mm->context.tlb_stale_mask;+
> +
> +               if (cpumask_test_cpu(cpu, mask)) {
> +                        cpumask_clear_cpu(cpu, mask);
> +                        local_flush_tlb_all_asid(cntx & asid_mask);
> // penalty in switch_mm fast path
> +                }
> +        }
> +#endif
>
> And See:
>  static void __sbi_tlb_flush_range(struct mm_struct *mm, unsigned long start,
>                                   unsigned long size, unsigned long stride)
>  {
> +       struct cpumask *pmask = &mm->context.tlb_stale_mask;
>         struct cpumask *cmask = mm_cpumask(mm);
>         unsigned int cpuid;
>         bool broadcast;
> @@ -44,6 +29,15 @@ static void __sbi_tlb_flush_range(struct mm_struct
> *mm, unsigned long start,
>         if (static_branch_unlikely(&use_asid_allocator)) {
>                 unsigned long asid = atomic_long_read(&mm->context.id);
>
> +               /*
> +                * TLB will be immediately flushed on harts concurrently
> +                * executing this MM context. TLB flush on other harts
> +                * is deferred until this MM context migrates there.
> +                */
> +               cpumask_setall(pmask);
>                  ^^^^^^^^^^^^^^^^^^^^^^^ It would flush all harts for
> all NUMA nodes!!! Most of them deferred to switch_mm_fast. The penalty
> contains unnecessary harts!
> +               cpumask_clear_cpu(cpuid, pmask);
> +               cpumask_andnot(pmask, pmask, cmask);
>
> Please reconsider a bit, and make a smart decision. Just penalty the
> harts who touched the mm, not all. And only flush the whole TLB when
                                                           ^^^^ ^^^^ Don't
> some entries are needed.
>
> The __sbi_tlb_flush_range is the slow path; keeping the fast path
> performance is worth more than improving a slow one.
>
> >
> > So I've got that one on for-next, sorry I missed it before.
> >
> > Thanks!
> >
> > >
> > > Your patch also works fine in my tests fixing those crashes. I have a
> > > question though, regarding removed cpumask_clear_cpu. How CPUs no more
> > > running the task are removed from its mm_cpumask ? If they are not
> > > removed, then flush_tlb_mm/flush_tlb_page will broadcast unnecessary
> > > TLB flushes to those CPUs when ASID is enabled.
> > >
> > > Regards,
> > > Sergey
>
>
>
> --
> Best Regards
>  Guo Ren


-- 
Best Regards
 Guo Ren