[PATCH 00/14] KVM: arm64: Parallel stage-2 fault handling

Marc Zyngier maz at kernel.org
Tue Sep 6 03:00:09 PDT 2022


On Tue, 30 Aug 2022 20:41:18 +0100,
Oliver Upton <oliver.upton at linux.dev> wrote:
> 
> Presently KVM only takes a read lock for stage 2 faults if it believes
> the fault can be fixed by relaxing permissions on a PTE (write unprotect
> for dirty logging). Otherwise, stage 2 faults grab the write lock, which
> predictably can pile up all the vCPUs in a sufficiently large VM.
> 
> Like the TDP MMU for x86, this series loosens the locking around
> manipulations of the stage 2 page tables to allow parallel faults. RCU
> and atomics are exploited to safely build/destroy the stage 2 page
> tables in light of multiple software observers.
> 
> Patches 1-2 are a cleanup to the way we collapse page tables, with the
> added benefit of narrowing the window of time a range of memory is
> unmapped.
> 
> Patches 3-7 are minor cleanups and refactorings to the way KVM reads
> PTEs and traverses the stage 2 page tables to make it amenable to
> concurrent modification.
> 
> Patches 8-9 use RCU to punt page table cleanup out of the vCPU fault
> path, which should also improve fault latency a bit.
> 
> Patches 10-14 implement the meat of this series, extending the
> 'break-before-make' sequence with atomics to realize locking on PTEs.
> Effectively a cmpxchg() is used to 'break' a PTE, thereby serializing
> changes to a given PTE.
> 
> Finally, patch 15 flips the switch on all the new code and starts
> grabbing the read side of the MMU lock for stage 2 faults.
> 
> Applies to 6.0-rc3. Tested with KVM selftests and benchmarked with
> dirty_log_perf_test, scaling from 1 to 48 vCPUs with 4GB of memory per
> vCPU backed by THP.
> 
>   ./dirty_log_perf_test -s anonymous_thp -m 2 -b 4G -v ${NR_VCPUS}
> 
> Time to dirty memory:
> 
>         +-------+---------+------------------+
>         | vCPUs | 6.0-rc3 | 6.0-rc3 + series |
>         +-------+---------+------------------+
>         |     1 | 0.89s   | 0.92s            |
>         |     2 | 1.13s   | 1.18s            |
>         |     4 | 2.42s   | 1.25s            |
>         |     8 | 5.03s   | 1.36s            |
>         |    16 | 8.84s   | 2.09s            |
>         |    32 | 19.60s  | 4.47s            |
>         |    48 | 31.39s  | 6.22s            |
>         +-------+---------+------------------+
> 
> It is also worth mentioning that the time to populate memory has
> improved:
> 
>         +-------+---------+------------------+
>         | vCPUs | 6.0-rc3 | 6.0-rc3 + series |
>         +-------+---------+------------------+
>         |     1 | 0.19s   | 0.18s            |
>         |     2 | 0.25s   | 0.21s            |
>         |     4 | 0.38s   | 0.32s            |
>         |     8 | 0.64s   | 0.40s            |
>         |    16 | 1.22s   | 0.54s            |
>         |    32 | 2.50s   | 1.03s            |
>         |    48 | 3.88s   | 1.52s            |
>         +-------+---------+------------------+
> 
> RFC: https://lore.kernel.org/kvmarm/20220415215901.1737897-1-oupton@google.com/
> 
> RFC -> v1:
>  - Factored out page table teardown from kvm_pgtable_stage2_map()
>  - Use the RCU callback to tear down a subtree, instead of scheduling a
>    callback for every individual table page.
>  - Reorganized series to (hopefully) avoid intermediate breakage.
>  - Dropped the use of page headers, instead stuffing KVM metadata into
>    page::private directly
> 
> Oliver Upton (14):
>   KVM: arm64: Add a helper to tear down unlinked stage-2 subtrees
>   KVM: arm64: Tear down unlinked stage-2 subtree after break-before-make
>   KVM: arm64: Directly read owner id field in stage2_pte_is_counted()
>   KVM: arm64: Read the PTE once per visit
>   KVM: arm64: Split init and set for table PTE
>   KVM: arm64: Return next table from map callbacks
>   KVM: arm64: Document behavior of pgtable visitor callback
>   KVM: arm64: Protect page table traversal with RCU
>   KVM: arm64: Free removed stage-2 tables in RCU callback
>   KVM: arm64: Atomically update stage 2 leaf attributes in parallel
>     walks
>   KVM: arm64: Make changes block->table to leaf PTEs parallel-aware
>   KVM: arm64: Make leaf->leaf PTE changes parallel-aware
>   KVM: arm64: Make table->block changes parallel-aware
>   KVM: arm64: Handle stage-2 faults in parallel
> 
>  arch/arm64/include/asm/kvm_pgtable.h  |  59 ++++-
>  arch/arm64/kvm/hyp/nvhe/mem_protect.c |   7 +-
>  arch/arm64/kvm/hyp/nvhe/setup.c       |   4 +-
>  arch/arm64/kvm/hyp/pgtable.c          | 360 ++++++++++++++++----------
>  arch/arm64/kvm/mmu.c                  |  65 +++--
>  5 files changed, 325 insertions(+), 170 deletions(-)

This fails to build on -rc4:

  MODPOST vmlinux.symvers
  MODINFO modules.builtin.modinfo
  GEN     modules.builtin
  CC      .vmlinux.export.o
  LD      .tmp_vmlinux.kallsyms1
ld: Unexpected GOT/PLT entries detected!
ld: Unexpected run-time procedure linkages detected!
ld: ID map text too big or misaligned
ld: arch/arm64/kvm/hyp/nvhe/kvm_nvhe.o: in function `__kvm_nvhe_kvm_pgtable_walk':
(.hyp.text+0xdc0c): undefined reference to `__kvm_nvhe___rcu_read_lock'
ld: (.hyp.text+0xdc1c): undefined reference to `__kvm_nvhe___rcu_read_unlock'
ld: arch/arm64/kvm/hyp/nvhe/kvm_nvhe.o: in function `__kvm_nvhe_kvm_pgtable_get_leaf':
(.hyp.text+0xdc80): undefined reference to `__kvm_nvhe___rcu_read_lock'
ld: (.hyp.text+0xdc90): undefined reference to `__kvm_nvhe___rcu_read_unlock'
ld: arch/arm64/kvm/hyp/nvhe/kvm_nvhe.o: in function `__kvm_nvhe_kvm_pgtable_hyp_map':
(.hyp.text+0xddb0): undefined reference to `__kvm_nvhe___rcu_read_lock'
ld: (.hyp.text+0xddc0): undefined reference to `__kvm_nvhe___rcu_read_unlock'
ld: arch/arm64/kvm/hyp/nvhe/kvm_nvhe.o: in function `__kvm_nvhe_kvm_pgtable_hyp_unmap':
(.hyp.text+0xde44): undefined reference to `__kvm_nvhe___rcu_read_lock'
ld: (.hyp.text+0xde50): undefined reference to `__kvm_nvhe___rcu_read_unlock'
ld: arch/arm64/kvm/hyp/nvhe/kvm_nvhe.o: in function `__kvm_nvhe_kvm_pgtable_hyp_destroy':
(.hyp.text+0xdf40): undefined reference to `__kvm_nvhe___rcu_read_lock'
ld: (.hyp.text+0xdf50): undefined reference to `__kvm_nvhe___rcu_read_unlock'
ld: arch/arm64/kvm/hyp/nvhe/kvm_nvhe.o: in function `__kvm_nvhe_kvm_pgtable_stage2_map':
(.hyp.text+0xe16c): undefined reference to `__kvm_nvhe___rcu_read_lock'
ld: (.hyp.text+0xe17c): undefined reference to `__kvm_nvhe___rcu_read_unlock'
ld: arch/arm64/kvm/hyp/nvhe/kvm_nvhe.o: in function `__kvm_nvhe_kvm_pgtable_stage2_set_owner':
(.hyp.text+0xe264): undefined reference to `__kvm_nvhe___rcu_read_lock'
ld: (.hyp.text+0xe274): undefined reference to `__kvm_nvhe___rcu_read_unlock'
ld: arch/arm64/kvm/hyp/nvhe/kvm_nvhe.o: in function `__kvm_nvhe_kvm_pgtable_stage2_unmap':
(.hyp.text+0xe2d4): undefined reference to `__kvm_nvhe___rcu_read_lock'
ld: (.hyp.text+0xe2e4): undefined reference to `__kvm_nvhe___rcu_read_unlock'
ld: arch/arm64/kvm/hyp/nvhe/kvm_nvhe.o: in function `__kvm_nvhe_kvm_pgtable_stage2_flush':
(.hyp.text+0xe5b4): undefined reference to `__kvm_nvhe___rcu_read_lock'
ld: (.hyp.text+0xe5c4): undefined reference to `__kvm_nvhe___rcu_read_unlock'
ld: arch/arm64/kvm/hyp/nvhe/kvm_nvhe.o: in function `__kvm_nvhe_kvm_pgtable_stage2_destroy':
(.hyp.text+0xe6f0): undefined reference to `__kvm_nvhe___rcu_read_lock'
ld: (.hyp.text+0xe700): undefined reference to `__kvm_nvhe___rcu_read_unlock'
make[3]: *** [Makefile:1169: vmlinux] Error 1
make[2]: *** [debian/rules:7: build-arch] Error 2

as this drags the RCU read-lock into EL2, and that's not going to
work... The following fixes it, but I wonder how you tested it.

Thanks,

	M.

diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
index dc839db86a1a..adf170122daf 100644
--- a/arch/arm64/include/asm/kvm_pgtable.h
+++ b/arch/arm64/include/asm/kvm_pgtable.h
@@ -580,7 +580,7 @@ enum kvm_pgtable_prot kvm_pgtable_stage2_pte_prot(kvm_pte_t pte);
  */
 enum kvm_pgtable_prot kvm_pgtable_hyp_pte_prot(kvm_pte_t pte);
 
-#if defined(__KVM_NVHE_HYPERVISOR___)
+#if defined(__KVM_NVHE_HYPERVISOR__)
 
 static inline void kvm_pgtable_walk_begin(void) {}
 static inline void kvm_pgtable_walk_end(void) {}

-- 
Without deviation from the norm, progress is not possible.



More information about the linux-arm-kernel mailing list