[RFC PATCH 00/17] KVM: arm64: Parallelize stage 2 fault handling

Fri Apr 15 17:04:33 PDT 2022

On Fri, Apr 15, 2022 at 04:35:24PM -0700, David Matlack wrote:
> On Fri, Apr 15, 2022 at 2:59 PM Oliver Upton <oupton at google.com> wrote:
> >
> > Presently KVM only takes a read lock for stage 2 faults if it believes
> > the fault can be fixed by relaxing permissions on a PTE (write unprotect
> > for dirty logging). Otherwise, stage 2 faults grab the write lock, which
> > predictably can pile up all the vCPUs in a sufficiently large VM.
> >
> > The x86 port of KVM has what it calls the TDP MMU. Basically, it is an
> > MMU protected by the combination of a read-write lock and RCU, allowing
> > page walkers to traverse in parallel.
> >
> > This series is strongly inspired by the mechanics of the TDP MMU,
> > making use of RCU to protect parallel walks. Note that the TLB
> > invalidation mechanics are a bit different between x86 and ARM, so we
> > need to use the 'break-before-make' sequence to split/collapse a
> > block/table mapping, respectively.
> 
> An alternative (or perhaps "v2" [1]) is to make x86's TDP MMU
> arch-neutral and port it to support ARM's stage-2 MMU. This is based
> on a few observations:
> 
> - The problems that motivated the development of the TDP MMU are not
> x86-specific (e.g. parallelizing faults during the post-copy phase of
> Live Migration).
> - The synchronization in the TDP MMU (read/write lock, RCU for PT
> freeing, atomic compare-exchanges for modifying PTEs) is complex, but
> would be equivalent across architectures.
> - Eventually RISC-V is going to want similar performance (my
> understanding is RISC-V MMU is already a copy-paste of the ARM MMU),
> and it'd be a shame to re-implement TDP MMU synchronization a third
> time.
> - The TDP MMU includes support for various performance features that
> would benefit other architectures, such as eager page splitting,
> deferred zapping, lockless write-protection resolution, and (coming
> soon) in-place huge page promotion.
> - And then there's the obvious wins from less code duplication in KVM
> (e.g. get rid of the RISC-V MMU copy, increased code test coverage,
> ...).

I definitely agree with the observation -- we're all trying to solve the
same set of issues. And I completely agree that a good long term goal
would be to create some common parts for all architectures. Less work
for us ARM folks it would seem ;-)

What's top of mind is how we paper over the architectural differences
between all of the architectures, especially when we need to do entirely
different things because of the arch.

For example, I whine about break-before-make a lot throughout this
series which is somewhat unique to ARM. I don't think we can do eager
page splitting on the base architecture w/o doing the TLBI for every
block. Not only that, we can't do a direct valid->valid change without
first making an invalid PTE visible to hardware. Things get even more
exciting when hardware revisions relax break-before-make requirements.

There's also significant architectural differences between KVM on x86
and KVM for ARM. Our paging code runs both in the host kernel and the
hyp/lowvisor, and does:

 - VM two dimensional paging (stage 2 MMU)
 - Hyp's own MMU (stage 1 MMU)
 - Host kernel isolation (stage 2 MMU)

each with its own quirks. The 'not exactly in the kernel' part will make
instrumentation a bit of a hassle too.

None of this is meant to disagree with you in the slightest. I firmly
agree we need to share as many parts between the architectures as
possible. I'm just trying to call out a few of the things relating to
ARM that will make this annoying so that way whoever embarks on the
adventure will see it.

> The side of this I haven't really looked into yet is ARM's stage-2
> MMU, and how amenable it would be to being managed by the TDP MMU. But
> I assume it's a conventional page table structure mapping GPAs to
> HPAs, which is the most important overlap.
> 
> That all being said, an arch-neutral TDP MMU would be a larger, more
> complex code change than something like this series (hence my "v2"
> caveat above). But I wanted to get this idea out there since the
> rubber is starting to hit the road on improving ARM MMU scalability.

All for it. I cc'ed you on the series for this exact reason, I wanted to
grab your attention to spark the conversation :)

--
Thanks,
Oliver