[RFC PATCH v4 00/39] KVM: arm64: Add Statistical Profiling Extension (SPE) support

Suzuki K Poulose suzuki.poulose at arm.com
Wed Sep 22 03:11:44 PDT 2021


On 25/08/2021 17:17, Alexandru Elisei wrote:
> This is v4 of the SPE series posted at [1]. v2 can be found at [2], and the
> original series at [3].
> 
> Statistical Profiling Extension (SPE) is an optional feature added in
> ARMv8.2. It allows sampling at regular intervals of the operations executed
> by the PE and storing a record of each operation in a memory buffer. A high
> level overview of the extension is presented in an article on arm.com [4].
> 
> This is another complete rewrite of the series, and nothing is set in
> stone. If you think of a better way to do things, please suggest it.
> 
> 
> Features added
> ==============
> 
> The rewrite enabled me to add support for several features not
> present in the previous iteration:
> 
> - Support for heterogeneous systems, where only some of the CPUs support SPE.
>    This is accomplished via the KVM_ARM_VCPU_SUPPORTED_CPUS VCPU ioctl.
> 
> - Support for VM migration with the KVM_ARM_VCPU_SPE_CTRL(KVM_ARM_VCPU_SPE_STOP)
>    VCPU ioctl.
> 
> - The requirement for userspace to mlock() the guest memory has been removed,
>    and now userspace can make changes to memory contents after the memory is
>    mapped at stage 2.
> 
> - Better debugging of guest memory pinning by printing a warning when we
>    get an unexpected read or write fault. This helped me catch several bugs
>    during development, it has already proven very useful. Many thanks to
>    James who suggested when reviewing v3.
> 
> 
> Missing features
> ================
> 
> I've tried to keep the series as small as possible to make it easier to review,
> while implementing the core functionality needed for the SPE emulation. As such,
> I've chosen to not implement several features:
> 
> - Host profiling a guest which has the SPE feature bit set (see open
>    questions).
> 
> - No errata workarounds have been implemented yet, and there are quite a few of
>    them for Neoverse N1 and Neoverse V1.
> 
> - Disabling CONFIG_NUMA_BALANCING is a hack to get KVM SPE to work and I am
>    investigating other ways to get around automatic numa balancing, like
>    requiring userspace to disable it via set_mempolicy(). I am also going to
>    look at how VFIO gets around it. Suggestions welcome.
> 
> - There's plenty of room for optimization. Off the top of my head, using
>    block mappings at stage 2, batch pinning of pages (similar to what VFIO
>    does), optimize the way KVM keeps track of pinned pages (using a linked
>    list triples the memory usage), context-switch the SPE registers on
>    vcpu_load/vcpu_put on VHE if the host is not profiling, locking
>    optimizations, etc, etc.
> 
> - ...and others. I'm sure I'm missing at least a few things which are
>    important for someone.
> 
> 
> Known issues
> ============
> 
> This is an RFC, so keep in mind that almost definitely there will be scary
> bugs. For example, below is a list of known issues which don't affect the
> correctness of the emulation, and which I'm planning to fix in a future
> iteration:
> 
> - With CONFIG_PROVE_LOCKING=y, lockdep complains about lock contention when
>    the VCPU executes the dcache clean pending ops.
> 
> - With CONFIG_PROVE_LOCKING=y, KVM will hit a BUG at
>    kvm_lock_all_vcpus()->mutex_trylock(&vcpu->mutex) with more than 48
>    VCPUs.
> 
> This BUG statement can also be triggered with mainline. To reproduce it,
> compile kvmtool from this branch [5] and follow the instruction in the
> kvmtool commit message.
> 
> One workaround could be to stop trying to lock all VCPUs when locking a
> memslot and document the fact that it is required that no VCPUs are run
> before the ioctl completes, otherwise bad things might happen to the VM.
> 
> 
> Open questions
> ==============
> 
> 1. Implementing support for host profiling a guest with the SPE feature
> means setting the profiling buffer owning regime to EL2. While that is in
> effect,  PMBIDR_EL1.P will equal 1. This has two consequences: if the guest
> probes SPE during this time, the driver will fail; and the guest will be
> able to determine when it is profiled. I see two options here:

This doesn't mean the EL2 is owning the SPE. It only tells you that a
higher level EL is owning the SPE. It could as well be EL3. (e.g, 
MDCR_EL3.NSPB == 0 or 1). So I think this is architecturally correct,
as long as we trap the guest access to other SPE registers and inject
and UNDEF.


Thanks
Suzuki



More information about the linux-arm-kernel mailing list