[RFC PATCH 0/3] um: clean up mm creation - another attempt

Wed Jan 17 11:54:35 PST 2024

On Wed, 2024-01-17 at 19:45 +0000, Anton Ivanov wrote:
> On 17/01/2024 17:17, Benjamin Berg wrote:
> > Hi,
> > 
> > On Wed, 2023-09-27 at 11:52 +0200, Benjamin Berg wrote:
> > > [SNIP]
> > > Once we are there, we can look for optimizations. The fundamental
> > > problem is that page faults (even minor ones) are extremely expensive
> > > for us.
> > > 
> > > Just throwing out ideas on what we could do:
> > >     1. SECCOMP as that reduces the amount of context switches.
> > >        (Yes, I know I should resubmit the patchset)
> > >     2. Maybe we can disable/cripple page access tracking? If we assume
> > >        initially mark all pages as accessed by userspace (i.e.
> > >        pte_mkyoung), then we avoid a minor page fault on first access.
> > >        Doing that will mess with page eviction though.
> > >     3. Do DAX (direct_access) for files. i.e. mmap files directly in the
> > >        host kernel rather than through UM.
> > >        With a hostfs like file system, one should be able to add an
> > >        intermediate block device that maps host files to physical pages,
> > >        then do DAX in the FS.
> > >        For disk images, the existing iomem infrastructure should be
> > >        usable, this should work with any DAX enabled filesystems (ext2,
> > >        ext4, xfs, virtiofs, erofs).
> > 
> > So, I experimented quite a bit over Christmas (including getting DAX to
> > work with virtiofs). At the end of all this my conclusion is that
> > insufficient page table synchronization is our main problem.
> > 
> > Basically, right now we rely on the flush_tlb_* functions from the
> > kernel, but these are only called when TLB entries are removed, *not*
> > when new PTEs are added (there is also update_mmu_cache, but it isn't
> > enough either). Effectively this means that new page table entries will
> > often only be synced because the userspace code runs into an
> > unnecessary segfright now we rely on the flush_tlb_* functions from the
> > kernel, but these are only called when TLB entries are removed, *not*
> > when new PTEs are added (there is also update_mmu_cache, but it isn't
> > enough either). Effectively this means that new page table entries will
> > often only be synced because the userspace code runs into an
> > unnecessary segfaultault.
> >   
> > Really, what we need is a set_pte_at() implementation that marks the
> > memory range for synchronization. Then we can make sure we sync it
> > before switching to the userspace process (the equivalent of running
> > flush_tlb_mm_range right now).
> > 
> > I think we should:
> >   * Rewrite the userspace syscall code
> >     - Support delaying the execution of syscalls
> >     - Only support mmap/munmap/mprotect and LDT
> >     - Do simple compression of consecutive syscalls here
> >     - Drop the hand-written assembler
> >   * Improve the tlb.c code
> >     - remove the HVC abstraction
> 
> Cool. That was not working particularly well. I tried to improve it a
> few times, but ripping it out and replacing it is probably a better idea.

Hm, now I realise that we still want mmap() syscall compression for the
kernel itself in tlb.c.

> >     - never force immediate syscall execution
> >   * Let set_pte_at() track which memory ranges that need syncing
> >   * At that point we should be able to:
> >     - drop copy_context_skas0
> >     - make flush_tlb_* no-ops
> >     - drop flush_tlb_page from handle_page_fault
> >     - move unmap() from flush_thread to init_new_context
> >       (or do it as part of start_userspace)
> > 
> > So, I did try this using nasty hacks and IIRC one of my runs was going
> > from 21s to 16s and another from 63s to 56s. Which seems like a nice
> > improvement.
> 
> Excellent. I assume you were using hostfs as usual, right? If so, the
> difference is likely to be even more noticeable on ubd.

Yes, I was mostly testing hostfs. Initially also virtiofs with DAX, but
I went back as that didn't result in a pagefault count improvement once
I made some other adjustments.

Benjamin

> 
> > 
> > Benjamin
> > 
> > 
> > PS: As for DAX, it doesn't really seem to help performance. It didn't
> > seem to lower the amount of page faults in UML. And, from my
> > perspective, it isn't really worth just for the memory sharing.
> > 
> > PPS: dirty/young tracking seemed to be only cause a small amount of
> > page faults in the grand scheme. So probably not something worth
> > following up on.
> > 
>