KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory

Wed Jul 27 03:38:53 PDT 2022

Hi Marc,

On Wed, Jul 27, 2022 at 10:52:34AM +0100, Marc Zyngier wrote:
> On Wed, 27 Jul 2022 10:30:59 +0100,
> Marc Zyngier <maz at kernel.org> wrote:
> > 
> > On Tue, 26 Jul 2022 18:51:21 +0100,
> > Oliver Upton <oliver.upton at linux.dev> wrote:
> > > 
> > > Doesn't pinning the buffer also imply pinning the stage 1 tables
> > > responsible for its translation as well? I agree that pinning the buffer
> > > is likely the best way forward as pinning the whole of guest memory is
> > > entirely impractical.
> 
> Huh, I just realised that you were talking about S1. I don't think we
> need to do this. As long as the translation falls into a mapped
> region (pinned or not), we don't need to worry.
> 
> If we get a S2 translation fault from SPE, we just go and map it. And
> TBH the pinning here is just a optimisation against things like swap,
> KSM and similar things. The only thing we need to make sure is that
> the fault is handled in the context of the vcpu that owns this SPU.
> 
> Alex, can you think of anything that would cause a problem (other than
> performance and possible blackout windows) if we didn't do any pinning
> at all and just handled the SPE interrupts as normal page faults?

PMBSR_EL1.DL might be set 1 as a result of stage 2 fault reported by SPE,
which means the last record written is incomplete. Records have a variable
size, so it's impossible for KVM to revert to the end of the last known
good record without parsing the buffer (references here [1]). And even if
KVM would know the size of a record, there's this bit in the Arm ARM which
worries me (ARM DDI 0487H.a, page D10-5177):

"The architecture does not require that a sample record is written
sequentially by the SPU, only that:
[..]
- On a Profiling Buffer management interrupt, PMBSR_EL1.DL indicates
  whether PMBPTR_EL1 points to the first byte after the last complete
  sample record."

So there might be gaps in the buffer, meaning that the entire buffer would
have to be discarded if DL is set as a result of a stage 2 fault.

Also, I'm not sure if you're aware of this, but SPE reports the guest VA in
PMBPTR_EL1 (not the IPA) on a fault, so KVM would have to walk the guest's
stage 1 tables to service the faults, which would add to the overhead of
servicing the fault. Don't know if that makes a difference, just thought I
should mention it as another peculiarity of SPE.

[1] https://lore.kernel.org/all/Yl7KewpTj+7NSonf@monolith.localdoman/

Thanks,
Alex