KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory

Wed Aug 10 02:37:26 PDT 2022

Hi,

On Tue, Aug 09, 2022 at 01:43:32PM -0500, Oliver Upton wrote:
> Hi Alex,
> 
> On Tue, Aug 09, 2022 at 03:01:36PM +0100, Alexandru Elisei wrote:
> 
> [...]
> 
> > > > To summarize the approaches we've discussed so far:
> > > > 
> > > > 1. Pinning the entire guest memory
> > > > - Heavy handed and not ideal.
> > > > - Tried this approach in v5 of the SPE series [1], patches #2-#12.
> > > > 
> > > > 2. Mapping the guest SPE buffer on demand, page by page, as a result of stage 2
> > > > faults reported by SPE.
> > > > - Not feasible, because the entire contents of the buffer must be discarded is
> > > >   PMBSR_EL1.DL is set to 1 when taking the fault.
> > > > - Requires KVM to walk the guest's stage 1 tables, because SPE reports the VA,
> > > >   not the IPA.
> > > > 
> > > > 3. Pinning the guest SPE buffer when profiling becomes enabled*:
> > > > - There is the corner case described above, when profiling becomes enabled as a
> > > >   result of an ERET to EL0. This can happen when the buffer is enabled and
> > > >   PMSCR_EL1.{E0SPE,E1SPE} = {1,0};
> > > > - The previous buffer is unpinned when a new buffer is pinned, to avoid SPE
> > > >   stage 2 faults when draining the buffer, which is performed with profiling
> > > >   disabled.
> > > > - Also requires KVM to walk the guest's stage 1 tables.
> > > > 
> > > > 4. Pin the entire guest SPE buffer after the first stage 2 fault reported by
> > > > SPE.
> > > > - Gets rid of the corner case at 3.
> > > > - Same approach to buffer unpinning as 3.
> > > > - Introduces a blackout window before the first record is written.
> > > > - Also requires KVM to walk the guest's stage 1 tables.
> > > > 
> > > > As for the corner case at 3, I proposed either:
> > > > 
> > > > a) Mandate that guest operating systems must never modify the buffer
> > > > translation entries if the buffer is enabled and
> > > > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}.
> > > > 
> > > > b) Pin the entire buffer as a result of the first stage 2 fault reported by SPE,
> > > > but **only** for this corner case. For all other cases, the buffer is pinned
> > > > when profiling becomes enabled, to eliminate the blackout window. Guest
> > > > operating systems can be modified to not change the translation entries for the
> > > > buffer if this blackout window is not desirable.
> > > > 
> > > > Pinning as a result of the **first** stage 2 fault should work, because there
> > > > are no prior records that would have to be discarded if PMSBR_EL1.DL = 1.
> > > > 
> > > > I hope I haven't missed anything. Thoughts and suggestions more than welcome.
> > > 
> > > Thanks Alex for pulling together all of the context here.
> > > 
> > > Unless there's any other strong opinions on the topic, it seems to me
> > > that option #4 (pin on S2 fault) is probably the best approach for
> > > the initial implementation. No amount of tricks in KVM can work around
> > > the fact that SPE has some serious issues w.r.t. virtualization. With
> > > that, we should probably document the behavior of SPE as a known erratum
> > > of KVM.
> > > 
> > > If folks complain about EL1 profile blackout, eagerly pinning when
> > > profiling is enabled could layer on top quite easily by treating it as
> > > a synthetic S2 fault and triggering the implementation of #4. Having
> > 
> > I'm not sure I follow, I understand what you mean by "treating it as a
> > synthetic S2 fault", would you mind elaborating?
> 
> Assuming approach #4 is implemented, we will already have an SPE fault
> handler that walks stage-1 and pins the buffer. At that point,
> implementing approach #3 would be relatively easy. When EL1 sets
> PMSCR_EL1.E1SPE, call the SPE fault handler on the GVA of the buffer.

I see, that makes sense, thanks,

> 
> > > said that I don't believe it is a hard requirement for enabling some
> > > flavor of SPE for guests.
> > > 
> > > Walking guest S1 in KVM doesn't sound too exciting although it'll need to
> > > be done eventually.
> > > 
> > > Do you feel like this is an OK route forward, or have I missed
> > > something?
> > 
> > I've been giving this some thought, and I prefer approach #3 because with
> > #4, pinning the buffer as a result of a stage 2 fault reported by SPE, it
> > will be impossible to distinguish between a valid stage 2 fault (a fault
> > caused by the guest reprogramming the buffer and enabling profiling) and
> > KVM messing something up when pinning the buffer. I believe this to be
> > important, as experience has shown me that pinning the buffer at stage 2 is
> > not trivial and there isn't a mechanism today in Linux to do that
> > (explanation and examples here [1]).
> 
> How does eagerly pinning avoid stage-2 aborts, though? As you note in
> [1], page pinning does not avoid the possibility of the MMU notifiers
> being called on a given range. Want to make sure I'm following, what
> is your suggestion for approach #3 to handle the profile buffer when
> only enabled at EL0?
> 
> > With approach #4, it would be impossible to figure out if the results of a
> > profiling operations inside a guest are representative of the workload or
> > not, because those SPE stage 2 faults triggered by a bug in KVM can happen
> > multiple times per profiling session, introducing multiple blackout windows
> > that can skew the results.
> > 
> > If you're proposing that the blackout window when the first record is
> > written be documented as an erratum for KVM, then why no got a step further
> > and document as an erratum that changing the buffer translation tables
> > after the buffer has been enabled will lead to an SPE Serror? That will
> > allow us to always pin the buffer when profiling is enabled.
> 
> Ah, there are certainly more errata in virtualizing SPE beyond what I
> had said :) Preserving the stage-1 translations while profiling is
> active is a good recommendation, although I'm not sure that we've
> completely eliminated the risk of stage-2 faults. 
> 
> It seems impossible to blame the guest for all stage-2 faults that happen
> in the middle of a profiling session. In addition to host mm driven changes
> to stage-2, live migration is a busted as well. You'd need to build out
> stage-2 on the target before resuming the guest and guarantee that the
> appropriate pages have been demanded from the source (in case of post-copy).
> 
> So, are we going to inject an SError for stage-2 faults outside of guest
> control as well? An external abort reported as an SPE buffer management
> event seems to be gracefully handled by the Linux driver, but that behavior
> is disallowed by SPEv1p3.
> 
> To sum up the point I'm getting at: I agree that there are ways to
> reduce the risk of stage-2 faults in the middle of profiling, but I
> don't believe the current architecture allows KVM to virtualize the
> feature to the letter of the specification.

I believe there's some confusion here: emulating SPE **does not work** if
stage 2 faults are triggered in the middle of a profiling session. Being
able to have a memory range never unmapped from stage 2 is a
**prerequisite** and is **required** for SPE emulation, it's not a nice to
have.

A stage 2 fault before the first record is written is acceptable because
there are no other records already written which need to be thrown away.
Stage 2 faults after at least one record has been written are unacceptable
because it means that the contents of the buffer needs to thrown away.

Does that make sense to you?

I believe it is doable to have addresses always mapped at stage 2 with some
changes to KVM, but that's not what this thread is about. This thread is
about how and when to pin the buffer.

As long as we're all agreed that buffer memory needs "pinning" (as in the
IPA are never unmapped from stage 2 until KVM decides otherwise as part of
SPE emulation), I believe that live migration is tangential to figuring out
how and when the buffer should be "pinned". I'm more than happy to start a
separate thread about live migration after we figure out how we should go
about "pinning" the buffer, I think your insight would be most helpful :)

Thanks,
Alex

> 
> --
> Thanks,
> Oliver