KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory

Wed Jul 27 09:06:30 PDT 2022

On Wed, Jul 27, 2022 at 11:38:53AM +0100, Alexandru Elisei wrote:
> Hi Marc,
> 
> On Wed, Jul 27, 2022 at 10:52:34AM +0100, Marc Zyngier wrote:
> > On Wed, 27 Jul 2022 10:30:59 +0100,
> > Marc Zyngier <maz at kernel.org> wrote:
> > > 
> > > On Tue, 26 Jul 2022 18:51:21 +0100,
> > > Oliver Upton <oliver.upton at linux.dev> wrote:
> > > > 
> > > > Doesn't pinning the buffer also imply pinning the stage 1 tables
> > > > responsible for its translation as well? I agree that pinning the buffer
> > > > is likely the best way forward as pinning the whole of guest memory is
> > > > entirely impractical.
> > 
> > Huh, I just realised that you were talking about S1. I don't think we
> > need to do this. As long as the translation falls into a mapped
> > region (pinned or not), we don't need to worry.

Right, but my issue is what happens when a fragment of the S1 becomes
unmapped at S2. We were discussing the idea of faulting once on the
buffer at the beginning of profiling, seems to me that it could just as
easily happen at runtime and get tripped up by what Alex points out
below:

> PMBSR_EL1.DL might be set 1 as a result of stage 2 fault reported by SPE,
> which means the last record written is incomplete. Records have a variable
> size, so it's impossible for KVM to revert to the end of the last known
> good record without parsing the buffer (references here [1]). And even if
> KVM would know the size of a record, there's this bit in the Arm ARM which
> worries me (ARM DDI 0487H.a, page D10-5177):
> 
> "The architecture does not require that a sample record is written
> sequentially by the SPU, only that:
> [..]
> - On a Profiling Buffer management interrupt, PMBSR_EL1.DL indicates
>   whether PMBPTR_EL1 points to the first byte after the last complete
>   sample record."
> 
> So there might be gaps in the buffer, meaning that the entire buffer would
> have to be discarded if DL is set as a result of a stage 2 fault.

Attempting to avoid thrashing with more threads so I'm going to summon back
some context from your original reply, Marc:

> > > > Live migration also throws a wrench in this. IOW, there are still potential
> > > > sources of blackout unattributable to guest manipulation of the SPU.
> > >
> > > Can you chime some light on this? I appreciate that you can't play the
> > > R/O trick on the SPE buffer as it invalidates the above discussion,
> > > but it should be relatively easy to track these pages and never reset
> > > them as clean until the vcpu is stopped. Unless you foresee other
> > > issues?

Right, we can play tricks on pre-copy to avoid write protecting the SPE
buffer. My concern was more around post-copy, where userspace could've
decided to leave the buffer behind and demand it back on the resulting
S2 fault.

> > > To be clear, I don't worry too much about these blind windows. The
> > > architecture doesn't really give us the right tools to make it work
> > > reliably, making this a best effort only. Unless we pin the whole
> > > guest and forego migration and other fault-driven mechanisms.
> > >
> > > Maybe that is a choice we need to give to the user: cheap, fast,
> > > reliable. Pick two.

As long as we crisply document the errata in KVM's virtualized SPE (and
inform the guest), that sounds reasonable. I'm just uneasy about
proceeding with an implementation w/ so many gotchas unless all parties
involved are aware of the quirks.

--
Thanks,
Oliver