[RFC PATCH v4 00/39] KVM: arm64: Add Statistical Profiling Extension (SPE) support
Suzuki K Poulose
suzuki.poulose at arm.com
Wed Sep 22 03:11:44 PDT 2021
On 25/08/2021 17:17, Alexandru Elisei wrote:
> This is v4 of the SPE series posted at [1]. v2 can be found at [2], and the
> original series at [3].
>
> Statistical Profiling Extension (SPE) is an optional feature added in
> ARMv8.2. It allows sampling at regular intervals of the operations executed
> by the PE and storing a record of each operation in a memory buffer. A high
> level overview of the extension is presented in an article on arm.com [4].
>
> This is another complete rewrite of the series, and nothing is set in
> stone. If you think of a better way to do things, please suggest it.
>
>
> Features added
> ==============
>
> The rewrite enabled me to add support for several features not
> present in the previous iteration:
>
> - Support for heterogeneous systems, where only some of the CPUs support SPE.
> This is accomplished via the KVM_ARM_VCPU_SUPPORTED_CPUS VCPU ioctl.
>
> - Support for VM migration with the KVM_ARM_VCPU_SPE_CTRL(KVM_ARM_VCPU_SPE_STOP)
> VCPU ioctl.
>
> - The requirement for userspace to mlock() the guest memory has been removed,
> and now userspace can make changes to memory contents after the memory is
> mapped at stage 2.
>
> - Better debugging of guest memory pinning by printing a warning when we
> get an unexpected read or write fault. This helped me catch several bugs
> during development, it has already proven very useful. Many thanks to
> James who suggested when reviewing v3.
>
>
> Missing features
> ================
>
> I've tried to keep the series as small as possible to make it easier to review,
> while implementing the core functionality needed for the SPE emulation. As such,
> I've chosen to not implement several features:
>
> - Host profiling a guest which has the SPE feature bit set (see open
> questions).
>
> - No errata workarounds have been implemented yet, and there are quite a few of
> them for Neoverse N1 and Neoverse V1.
>
> - Disabling CONFIG_NUMA_BALANCING is a hack to get KVM SPE to work and I am
> investigating other ways to get around automatic numa balancing, like
> requiring userspace to disable it via set_mempolicy(). I am also going to
> look at how VFIO gets around it. Suggestions welcome.
>
> - There's plenty of room for optimization. Off the top of my head, using
> block mappings at stage 2, batch pinning of pages (similar to what VFIO
> does), optimize the way KVM keeps track of pinned pages (using a linked
> list triples the memory usage), context-switch the SPE registers on
> vcpu_load/vcpu_put on VHE if the host is not profiling, locking
> optimizations, etc, etc.
>
> - ...and others. I'm sure I'm missing at least a few things which are
> important for someone.
>
>
> Known issues
> ============
>
> This is an RFC, so keep in mind that almost definitely there will be scary
> bugs. For example, below is a list of known issues which don't affect the
> correctness of the emulation, and which I'm planning to fix in a future
> iteration:
>
> - With CONFIG_PROVE_LOCKING=y, lockdep complains about lock contention when
> the VCPU executes the dcache clean pending ops.
>
> - With CONFIG_PROVE_LOCKING=y, KVM will hit a BUG at
> kvm_lock_all_vcpus()->mutex_trylock(&vcpu->mutex) with more than 48
> VCPUs.
>
> This BUG statement can also be triggered with mainline. To reproduce it,
> compile kvmtool from this branch [5] and follow the instruction in the
> kvmtool commit message.
>
> One workaround could be to stop trying to lock all VCPUs when locking a
> memslot and document the fact that it is required that no VCPUs are run
> before the ioctl completes, otherwise bad things might happen to the VM.
>
>
> Open questions
> ==============
>
> 1. Implementing support for host profiling a guest with the SPE feature
> means setting the profiling buffer owning regime to EL2. While that is in
> effect, PMBIDR_EL1.P will equal 1. This has two consequences: if the guest
> probes SPE during this time, the driver will fail; and the guest will be
> able to determine when it is profiled. I see two options here:
This doesn't mean the EL2 is owning the SPE. It only tells you that a
higher level EL is owning the SPE. It could as well be EL3. (e.g,
MDCR_EL3.NSPB == 0 or 1). So I think this is architecturally correct,
as long as we trap the guest access to other SPE registers and inject
and UNDEF.
Thanks
Suzuki
More information about the linux-arm-kernel
mailing list