[RFC PATCH 0/3] um: clean up mm creation - another attempt
Anton Ivanov
anton.ivanov at cambridgegreys.com
Wed Jan 17 11:45:20 PST 2024
On 17/01/2024 17:17, Benjamin Berg wrote:
> Hi,
>
> On Wed, 2023-09-27 at 11:52 +0200, Benjamin Berg wrote:
>> [SNIP]
>> Once we are there, we can look for optimizations. The fundamental
>> problem is that page faults (even minor ones) are extremely expensive
>> for us.
>>
>> Just throwing out ideas on what we could do:
>> 1. SECCOMP as that reduces the amount of context switches.
>> (Yes, I know I should resubmit the patchset)
>> 2. Maybe we can disable/cripple page access tracking? If we assume
>> initially mark all pages as accessed by userspace (i.e.
>> pte_mkyoung), then we avoid a minor page fault on first access.
>> Doing that will mess with page eviction though.
>> 3. Do DAX (direct_access) for files. i.e. mmap files directly in the
>> host kernel rather than through UM.
>> With a hostfs like file system, one should be able to add an
>> intermediate block device that maps host files to physical pages,
>> then do DAX in the FS.
>> For disk images, the existing iomem infrastructure should be
>> usable, this should work with any DAX enabled filesystems (ext2,
>> ext4, xfs, virtiofs, erofs).
>
> So, I experimented quite a bit over Christmas (including getting DAX to
> work with virtiofs). At the end of all this my conclusion is that
> insufficient page table synchronization is our main problem.
>
> Basically, right now we rely on the flush_tlb_* functions from the
> kernel, but these are only called when TLB entries are removed, *not*
> when new PTEs are added (there is also update_mmu_cache, but it isn't
> enough either). Effectively this means that new page table entries will
> often only be synced because the userspace code runs into an
> unnecessary segfright now we rely on the flush_tlb_* functions from the
> kernel, but these are only called when TLB entries are removed, *not*
> when new PTEs are added (there is also update_mmu_cache, but it isn't
> enough either). Effectively this means that new page table entries will
> often only be synced because the userspace code runs into an
> unnecessary segfaultault.
>
> Really, what we need is a set_pte_at() implementation that marks the
> memory range for synchronization. Then we can make sure we sync it
> before switching to the userspace process (the equivalent of running
> flush_tlb_mm_range right now).
>
> I think we should:
> * Rewrite the userspace syscall code
> - Support delaying the execution of syscalls
> - Only support mmap/munmap/mprotect and LDT
> - Do simple compression of consecutive syscalls here
> - Drop the hand-written assembler
> * Improve the tlb.c code
> - remove the HVC abstraction
Cool. That was not working particularly well. I tried to improve it a
few times, but ripping it out and replacing it is probably a better idea.
> - never force immediate syscall execution
> * Let set_pte_at() track which memory ranges that need syncing
> * At that point we should be able to:
> - drop copy_context_skas0
> - make flush_tlb_* no-ops
> - drop flush_tlb_page from handle_page_fault
> - move unmap() from flush_thread to init_new_context
> (or do it as part of start_userspace)
>
> So, I did try this using nasty hacks and IIRC one of my runs was going
> from 21s to 16s and another from 63s to 56s. Which seems like a nice
> improvement.
Excellent. I assume you were using hostfs as usual, right? If so, the
difference is likely to be even more noticeable on ubd.
>
> Benjamin
>
>
> PS: As for DAX, it doesn't really seem to help performance. It didn't
> seem to lower the amount of page faults in UML. And, from my
> perspective, it isn't really worth just for the memory sharing.
>
> PPS: dirty/young tracking seemed to be only cause a small amount of
> page faults in the grand scheme. So probably not something worth
> following up on.
>
--
Anton R. Ivanov
Cambridgegreys Limited. Registered in England. Company Number 10273661
https://www.cambridgegreys.com/
More information about the linux-um
mailing list