[RFC PATCH 0/3] um: clean up mm creation - another attempt
Benjamin Berg
benjamin at sipsolutions.net
Wed Jan 17 11:54:35 PST 2024
On Wed, 2024-01-17 at 19:45 +0000, Anton Ivanov wrote:
> On 17/01/2024 17:17, Benjamin Berg wrote:
> > Hi,
> >
> > On Wed, 2023-09-27 at 11:52 +0200, Benjamin Berg wrote:
> > > [SNIP]
> > > Once we are there, we can look for optimizations. The fundamental
> > > problem is that page faults (even minor ones) are extremely expensive
> > > for us.
> > >
> > > Just throwing out ideas on what we could do:
> > > 1. SECCOMP as that reduces the amount of context switches.
> > > (Yes, I know I should resubmit the patchset)
> > > 2. Maybe we can disable/cripple page access tracking? If we assume
> > > initially mark all pages as accessed by userspace (i.e.
> > > pte_mkyoung), then we avoid a minor page fault on first access.
> > > Doing that will mess with page eviction though.
> > > 3. Do DAX (direct_access) for files. i.e. mmap files directly in the
> > > host kernel rather than through UM.
> > > With a hostfs like file system, one should be able to add an
> > > intermediate block device that maps host files to physical pages,
> > > then do DAX in the FS.
> > > For disk images, the existing iomem infrastructure should be
> > > usable, this should work with any DAX enabled filesystems (ext2,
> > > ext4, xfs, virtiofs, erofs).
> >
> > So, I experimented quite a bit over Christmas (including getting DAX to
> > work with virtiofs). At the end of all this my conclusion is that
> > insufficient page table synchronization is our main problem.
> >
> > Basically, right now we rely on the flush_tlb_* functions from the
> > kernel, but these are only called when TLB entries are removed, *not*
> > when new PTEs are added (there is also update_mmu_cache, but it isn't
> > enough either). Effectively this means that new page table entries will
> > often only be synced because the userspace code runs into an
> > unnecessary segfright now we rely on the flush_tlb_* functions from the
> > kernel, but these are only called when TLB entries are removed, *not*
> > when new PTEs are added (there is also update_mmu_cache, but it isn't
> > enough either). Effectively this means that new page table entries will
> > often only be synced because the userspace code runs into an
> > unnecessary segfaultault.
> >
> > Really, what we need is a set_pte_at() implementation that marks the
> > memory range for synchronization. Then we can make sure we sync it
> > before switching to the userspace process (the equivalent of running
> > flush_tlb_mm_range right now).
> >
> > I think we should:
> > * Rewrite the userspace syscall code
> > - Support delaying the execution of syscalls
> > - Only support mmap/munmap/mprotect and LDT
> > - Do simple compression of consecutive syscalls here
> > - Drop the hand-written assembler
> > * Improve the tlb.c code
> > - remove the HVC abstraction
>
> Cool. That was not working particularly well. I tried to improve it a
> few times, but ripping it out and replacing it is probably a better idea.
Hm, now I realise that we still want mmap() syscall compression for the
kernel itself in tlb.c.
> > - never force immediate syscall execution
> > * Let set_pte_at() track which memory ranges that need syncing
> > * At that point we should be able to:
> > - drop copy_context_skas0
> > - make flush_tlb_* no-ops
> > - drop flush_tlb_page from handle_page_fault
> > - move unmap() from flush_thread to init_new_context
> > (or do it as part of start_userspace)
> >
> > So, I did try this using nasty hacks and IIRC one of my runs was going
> > from 21s to 16s and another from 63s to 56s. Which seems like a nice
> > improvement.
>
> Excellent. I assume you were using hostfs as usual, right? If so, the
> difference is likely to be even more noticeable on ubd.
Yes, I was mostly testing hostfs. Initially also virtiofs with DAX, but
I went back as that didn't result in a pagefault count improvement once
I made some other adjustments.
Benjamin
>
> >
> > Benjamin
> >
> >
> > PS: As for DAX, it doesn't really seem to help performance. It didn't
> > seem to lower the amount of page faults in UML. And, from my
> > perspective, it isn't really worth just for the memory sharing.
> >
> > PPS: dirty/young tracking seemed to be only cause a small amount of
> > page faults in the grand scheme. So probably not something worth
> > following up on.
> >
>
More information about the linux-um
mailing list