[RFC PATCH 0/3] um: clean up mm creation - another attempt

Wed Jan 17 09:17:49 PST 2024

Hi,

On Wed, 2023-09-27 at 11:52 +0200, Benjamin Berg wrote:
> [SNIP]
> Once we are there, we can look for optimizations. The fundamental
> problem is that page faults (even minor ones) are extremely expensive
> for us.
> 
> Just throwing out ideas on what we could do:
>    1. SECCOMP as that reduces the amount of context switches.
>       (Yes, I know I should resubmit the patchset)
>    2. Maybe we can disable/cripple page access tracking? If we assume
>       initially mark all pages as accessed by userspace (i.e.
>       pte_mkyoung), then we avoid a minor page fault on first access.
>       Doing that will mess with page eviction though.
>    3. Do DAX (direct_access) for files. i.e. mmap files directly in the
>       host kernel rather than through UM.
>       With a hostfs like file system, one should be able to add an
>       intermediate block device that maps host files to physical pages,
>       then do DAX in the FS.
>       For disk images, the existing iomem infrastructure should be
>       usable, this should work with any DAX enabled filesystems (ext2,
>       ext4, xfs, virtiofs, erofs).

So, I experimented quite a bit over Christmas (including getting DAX to
work with virtiofs). At the end of all this my conclusion is that
insufficient page table synchronization is our main problem.

Basically, right now we rely on the flush_tlb_* functions from the
kernel, but these are only called when TLB entries are removed, *not*
when new PTEs are added (there is also update_mmu_cache, but it isn't
enough either). Effectively this means that new page table entries will
often only be synced because the userspace code runs into an
unnecessary segfright now we rely on the flush_tlb_* functions from the
kernel, but these are only called when TLB entries are removed, *not*
when new PTEs are added (there is also update_mmu_cache, but it isn't
enough either). Effectively this means that new page table entries will
often only be synced because the userspace code runs into an
unnecessary segfaultault.

Really, what we need is a set_pte_at() implementation that marks the
memory range for synchronization. Then we can make sure we sync it
before switching to the userspace process (the equivalent of running
flush_tlb_mm_range right now).

I think we should:
 * Rewrite the userspace syscall code
   - Support delaying the execution of syscalls
   - Only support mmap/munmap/mprotect and LDT
   - Do simple compression of consecutive syscalls here
   - Drop the hand-written assembler
 * Improve the tlb.c code
   - remove the HVC abstraction
   - never force immediate syscall execution
 * Let set_pte_at() track which memory ranges that need syncing
 * At that point we should be able to:
   - drop copy_context_skas0
   - make flush_tlb_* no-ops
   - drop flush_tlb_page from handle_page_fault
   - move unmap() from flush_thread to init_new_context
     (or do it as part of start_userspace)

So, I did try this using nasty hacks and IIRC one of my runs was going
from 21s to 16s and another from 63s to 56s. Which seems like a nice
improvement.

Benjamin

PS: As for DAX, it doesn't really seem to help performance. It didn't
seem to lower the amount of page faults in UML. And, from my
perspective, it isn't really worth just for the memory sharing.

PPS: dirty/young tracking seemed to be only cause a small amount of
page faults in the grand scheme. So probably not something worth
following up on.