KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory

Tue Apr 19 06:51:05 PDT 2022

The approach I've taken so far in adding support for SPE in KVM [1] relies
on pinning the entire VM memory to avoid SPE triggering stage 2 faults
altogether. I've taken this approach because:

1. SPE reports the guest VA on an stage 2 fault, similar to stage 1 faults,
and at the moment KVM has no way to resolve the VA to IPA translation.  The
AT instruction is not useful here, because PAR_EL1 doesn't report the IPA
in the case of a stage 2 fault on a stage 1 translation table walk.

2. The stage 2 fault is reported asynchronously via an interrupt, which
means there will be a window where profiling is stopped from the moment SPE
triggers the fault and when the PE taks the interrupt. This blackout window
is obviously not present when running on bare metal, as there is no second
stage of address translation being performed.

I've been thinking about this approach and I was considering translating
the VA reported by SPE to the IPA instead, thus treating the SPE stage 2
data aborts more like regular (MMU) data aborts. As I see it, this approach
has several merits over memory pinning:

- The stage 1 translation table walker is also needed for nested
  virtualization, to emulate AT S1* instructions executed by the L1
  guest hypervisor.

- Walking the guest's translation tables is less of a departure from the
  way KVM manages physical memory for a virtual machine today.

I had a discussion with Mark offline about this approach and he expressed a
very sensible concern: when a guest is profiling, there is a blackout
window where profiling is stopped which doesn't happen on bare metal (point
2 above).

My questions are:

1. Is having this blackout window, regardless of its size, unnacceptable?
If it is, then I'll continue with the memory pinning approach.

2. If having a blackout window is acceptable, how large can this window be
before it becomes too much? I can try to take some performance measurements
to evaluate the blackout window when using a stage 1 walker in relation to
the buffer write speed on different hardware. I have access to an N1SDP
machine and an Ampere Altra for this.

[1] https://lore.kernel.org/all/20211117153842.302159-1-alexandru.elisei@arm.com/

Thanks,
Alex