[RFC PATCH v2 00/58] KVM: Arm SMMUv3 driver for pKVM
Mostafa Saleh
smostafa at google.com
Thu Dec 12 10:03:24 PST 2024
This is v2 of the series sent last year:
https://lore.kernel.org/kvmarm/20230201125328.2186498-1-jean-philippe@linaro.org/
pKVM overview:
=============
The pKVM hypervisor, recently introduced on arm64, provides a separation
of privileges between the host and hypervisor parts of KVM, where the
hypervisor is trusted by guests but the host is not [1][2]. The host is
initially trusted during boot, but its privileges are reduced after KVM
is initialized so that, if an adversary later gains access to the large
attack surface of the host, it cannot access guest data.
Currently with pKVM, the host can still instruct DMA-capable devices
like the GPU to access guest and hypervisor memory, which undermines
this isolation. Preventing DMA attacks requires an IOMMU, owned by the
hypervisor.
This series adds a hypervisor driver for the Arm SMMUv3 IOMMU. Since the
hypervisor part of pKVM (called nVHE here) is minimal, moving the whole
host SMMU driver into nVHE isn't really an option. It is too large and
complex and requires infrastructure from all over the kernel. We add a
reduced nVHE driver that deals with populating the SMMU tables and the
command queue, and the host driver still deals with probing and some
initialization.
Some of the pKVM infrastructure is not upstream yet, which are dependencies
for this series, so it should be considered a forward looking RFC for
what we think about how DMA isolation can be supported in pKVM or in
other similar confidential computing solutions and not a ready to merge
solution.
This is discussed further in the dependencies section below.
Patches overview
================
The patches are split as follows:
Patches 1-10: Mostly about splitting the current SMMUv3 driver and
io-pgtable-arm library, so the code can be re-used in the KVM driver
either inside the kernel or the hypervisor.
Most of these patches are best reviewed with git's --color-moved.
Patches 11-24: Introduce the hypervisor core code for IOMMUs which is
not specific to SMMUv3, these are the hypercall handlers and common
logic in the hypervisor.
It also introduces the key functions __pkvm_host_{un}use_dma_page which
are used to track DMA mapped pages, more on this in the design section.
Patches 25-41: Add the hypervisor part of the KVM SMMUv3 driver which
is called by hypervisor core IOMMU code, these are para-virtualized
operations such as attach/detach, map/unmap...
Patches 42-54: Add the kernel part of the KVM SMMUv3 driver, this
probes the IOMMUs and initialises them and populates the list of SMMUs
to the hypervisor, it also implements the kernel iommu_ops and registers
the IOMMUs with the kernel.
Patches 55-58: Two extra optimizations introduced at the end to avoid
complicating the start of the series, one to optimise iommu_map_sg and
the other is to batch TLB invalidation which I noticed to be a problem
while testing as my HW doesn’t support range invalidation.
A development branch is available at:
https://android-kvm.googlesource.com/linux/+log/refs/heads/for-upstream/pkvm-smmu
Design
======
We've explored 4 solutions so far, we only mention two of them here
which I believe are the most promising as they offer private IO spaces,
while the others were discussed in the v1 of the series cover letter.
1. Paravirtual I/O page tables
This is the solution implemented in this series. The host creates
IOVA->HPA mappings with two hypercalls map_pages() and unmap_pages(), and
the hypervisor populates the page tables. Page tables are abstracted into
IOMMU domains, which allow multiple devices to share the same address
space. Another four hypercalls, alloc_domain(), attach_dev(), detach_dev()
and free_domain(), manage the domains, the semantics of those hypercalls
are almost identical to the IOMMU ops which make the kernel driver part
simpler.
Some key points in the hypervisor design:
a- Tracking mapped pages: the hypervisor must prevent pages mapped in the
IOMMU to be donated to a protected guest or the hypervisor, or allow
a protected guest/hypervisor page be mapped in an IOMMU domain.
For that we rely on the vmemmap refcount, where each time a page is
mapped it’s refcount is incremented and ownership is checked, and
each time it's successfully unmapped it’s decremented. And any memory
donation would be denied for refcounted pages.
b- Locking: The io-pgtable-arm is lockless under some guarantees of how
the IOMMU code behaves. However with pKVM, the kernel is not trusted
and a malicious kernel can issue concurrent requests causing memory
corruption or UAF, so that it has to be locked in the hypervisor.
c- Memory management: The hypervisor needs a way to allocate pages for
the pv page tables, for that an IOMMU pool is created which can be
topped up from a hypercall, and the IOMMU hypercalls returns encoded
memory requests which can be fulfilled by the kernel driver.
2. Nested SMMUv3 translation (with emulation)
Another approach is to rely on nested translation support which is
optional in SMMUv3, that requires an architecturally accurate emulation
of SMMUv3 which can be complicated including cmdq emulation.
With this approach, we can use the same page tables as the CPU stage-2,
which adds more constraints on HW (SMMUv3 features must match the CPU)
and the ability of the devices to handle faults as the CPU part relies
on lazy mapping and has no guarantees about pages being mapped.
Or we can use a shadow IOMMU page table instead.
I have a prototype that is not ready yet to be posted for nested:
https://android-kvm.googlesource.com/linux/+log/refs/heads/smostafa/android15-6.6-smmu-nesting-wip
The trade off between the 2 approaches can be roughly summarised as:
Paravirtualization:
- Compatible with more HW (and IOMMUs).
- Better DMA performance due to shorter table walks/less TLB pressure
- Needs extra complexity to squeeze the last bit of optimization (around
unmap, and map_sg).
Nested Emulation
- Faster map_pages (not sure about unmap because it requires cmdq
emulation for TLB invalidation if DVM not used).
- Needs extra complexity for architecturally emulating SMMUv3.
I believe that the first approach looks more promising with this trade
off. However, I plan to complete the nested emulation and post it with
a comparison with this approach in terms of performance, and maybe this
topic can be discussed in an upcoming conference.
Dependencies
============
This series depends on some parts of pKVM that are not upstreamed yet,
some of them are currently posted[3][4]. However, not to spam the list
with many of these changes which are not relevant to IOMMU/SMMUv3 the
patches are developed on top of them.
This series also depends on another series reworking the io-pgtable walker[5]
Performance
===========
With CONFIG_DMA_MAP_BENCHMARK on a 4-core Morello board.
Numbers represent the average time needed for one dma_map/dma_unmap call
in μs, lower is better.
It is compared with the kernel driver, which is not quite a fair comparison
as it doesn't fulfil pKVM DMA isolation requirements. However, these
numbers are provided just to give a rough idea about how the overhead
looks like.
Kernel driver pKVM driver
4K - 1 thread 0.1/0.7 0.3/1.3
4K - 4 threads 0.1/1.1 0.5/3.3
1M - 1 thread 0.8/21.5 2.6/27.3
1M - 4 threads 1.1/45.7 3.6/46.2
And tested as follows:
echo dma_map_benchmark > /sys/bus/pci/devices/0000\:06\:00.0/driver_override
echo 0000:06:00.0 > /sys/bus/pci/devices/0000\:06\:00.0/driver/unbind
echo 0000:06:00.0 > /sys/bus/pci/drivers/dma_map_benchmark/bind
./dma_map_bechmark -t $threads -g $nr_pages
Future work
==========
- Add IDENTITY_DOMAIN support, I already have some patches for that, but
didn’t want to complicate this series, I can send them separately.
- Complete the comparison with the nesting support and find the most
suitable solution for upstream.
Main changes since v1
=====================
- Patches are reordered to split the introduction of the KVM IOMMU
code and the SMMUv3 driver.
- KVM EL2 code is closer the EL1 where domains are decoupled from
IOMMUs.
- SMMUv3 new features (stage-1 support, IRQ and EVTQ in the kernel).
- Adaptions to the new SMMUv3 cleanups.
- Rework tracking of mapped pages to improve performance.
- Rework locking to improve performance.
- Rework unmap to improve performance.
- Adding iotlb_gather to optimize unmap.
- Add new operations to optimize map_sg operation.
- Registering driver is dynamically done instead of statically checked.
- Memory allocation for page table pages are changed to be separate
pool and HVCs instead of share mc that required atomic allocation.
- Support for higher order page allocation.
- Support for non-coherent SMMUs.
- Support for DABT and MMIO emulation.
[1] https://lore.kernel.org/kvmarm/20220519134204.5379-1-will@kernel.org/
[2] https://www.youtube.com/watch?v=9npebeVFbFw
[3] https://lore.kernel.org/kvmarm/20241203103735.2267589-1-qperret@google.com/
[4] https://lore.kernel.org/all/20241202154742.3611749-1-tabba@google.com/
[5] https://lore.kernel.org/linux-iommu/20241028213146.238941-1-robdclark@gmail.com/T/#t
Jean-Philippe Brucker (23):
iommu/io-pgtable-arm: Split the page table driver
iommu/io-pgtable-arm: Split initialization
iommu/io-pgtable: Add configure() operation
iommu/arm-smmu-v3: Move some definitions to arm64 include/
iommu/arm-smmu-v3: Extract driver-specific bits from probe function
iommu/arm-smmu-v3: Move some functions to arm-smmu-v3-common.c
iommu/arm-smmu-v3: Move queue and table allocation to
arm-smmu-v3-common.c
iommu/arm-smmu-v3: Move firmware probe to arm-smmu-v3-common
iommu/arm-smmu-v3: Move IOMMU registration to arm-smmu-v3-common.c
KVM: arm64: pkvm: Add pkvm_udelay()
KVM: arm64: pkvm: Add __pkvm_host_add_remove_page()
KVM: arm64: pkvm: Support SCMI power domain
KVM: arm64: iommu: Support power management
KVM: arm64: iommu: Add SMMUv3 driver
KVM: arm64: smmu-v3: Initialize registers
KVM: arm64: smmu-v3: Setup command queue
KVM: arm64: smmu-v3: Reset the device
KVM: arm64: smmu-v3: Support io-pgtable
iommu/arm-smmu-v3-kvm: Add host driver for pKVM
iommu/arm-smmu-v3-kvm: Pass a list of SMMU devices to the hypervisor
iommu/arm-smmu-v3-kvm: Validate device features
iommu/arm-smmu-v3-kvm: Allocate structures and reset device
iommu/arm-smmu-v3-kvm: Probe power domains
Mostafa Saleh (35):
iommu/arm-smmu-v3: Move common irq code to common file
KVM: arm64: Add __pkvm_{use, unuse}_dma()
KVM: arm64: Introduce IOMMU driver infrastructure
KVM: arm64: pkvm: Add IOMMU hypercalls
KVM: arm64: iommu: Add a memory pool for the IOMMU
KVM: arm64: iommu: Add domains
KVM: arm64: iommu: Add {attach, detach}_dev
KVM: arm64: iommu: Add map/unmap() operations
KVM: arm64: iommu: support iommu_iotlb_gather
KVM: arm64: Support power domains
KVM: arm64: iommu: Support DABT for IOMMU
KVM: arm64: smmu-v3: Setup stream table
KVM: arm64: smmu-v3: Setup event queue
KVM: arm64: smmu-v3: Add {alloc/free}_domain
KVM: arm64: smmu-v3: Add TLB ops
KVM: arm64: smmu-v3: Add context descriptor functions
KVM: arm64: smmu-v3: Add attach_dev
KVM: arm64: smmu-v3: Add detach_dev
iommu/io-pgtable: Generalize walker interface
iommu/io-pgtable-arm: Add post table walker callback
drivers/iommu: io-pgtable: Add IO_PGTABLE_QUIRK_UNMAP_INVAL
KVM: arm64: smmu-v3: Add map/unmap pages and iova_to_phys
KVM: arm64: smmu-v3: Add DABT handler
KVM: arm64: Add function to topup generic allocator
KVM: arm64: Add macro for SMCCC call with all returns
iommu/arm-smmu-v3-kvm: Add function to topup IOMMU allocator
iommu/arm-smmu-v3-kvm: Add IOMMU ops
iommu/arm-smmu-v3-kvm: Add map, unmap and iova_to_phys operations
iommu/arm-smmu-v3-kvm: Support PASID operations
iommu/arm-smmu-v3-kvm: Add IRQs for the driver
iommu/arm-smmu-v3-kvm: Enable runtime PM
drivers/iommu: Add deferred map_sg operations
KVM: arm64: iommu: Add hypercall for map_sg
iommu/arm-smmu-v3-kvm: Implement sg operations
iommu/arm-smmu-v3-kvm: Support command queue batching
arch/arm64/include/asm/arm-smmu-v3-common.h | 592 +++++++
arch/arm64/include/asm/kvm_asm.h | 9 +
arch/arm64/include/asm/kvm_host.h | 48 +-
arch/arm64/include/asm/kvm_hyp.h | 2 +
arch/arm64/kvm/Makefile | 2 +-
arch/arm64/kvm/arm.c | 8 +-
arch/arm64/kvm/hyp/hyp-constants.c | 1 +
arch/arm64/kvm/hyp/include/nvhe/iommu.h | 91 ++
arch/arm64/kvm/hyp/include/nvhe/mem_protect.h | 3 +
arch/arm64/kvm/hyp/include/nvhe/mm.h | 1 +
arch/arm64/kvm/hyp/include/nvhe/pkvm.h | 37 +
.../arm64/kvm/hyp/include/nvhe/trap_handler.h | 2 +
arch/arm64/kvm/hyp/nvhe/Makefile | 6 +-
arch/arm64/kvm/hyp/nvhe/alloc_mgt.c | 2 +
arch/arm64/kvm/hyp/nvhe/hyp-main.c | 114 ++
arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c | 1390 +++++++++++++++++
.../arm64/kvm/hyp/nvhe/iommu/io-pgtable-arm.c | 153 ++
arch/arm64/kvm/hyp/nvhe/iommu/iommu.c | 490 ++++++
arch/arm64/kvm/hyp/nvhe/mem_protect.c | 133 +-
arch/arm64/kvm/hyp/nvhe/mm.c | 17 +
arch/arm64/kvm/hyp/nvhe/power/hvc.c | 47 +
arch/arm64/kvm/hyp/nvhe/power/scmi.c | 231 +++
arch/arm64/kvm/hyp/nvhe/setup.c | 9 +
arch/arm64/kvm/hyp/nvhe/timer-sr.c | 42 +
arch/arm64/kvm/iommu.c | 89 ++
arch/arm64/kvm/mmu.c | 20 +
arch/arm64/kvm/pkvm.c | 20 +
drivers/gpu/drm/msm/msm_iommu.c | 5 +-
drivers/iommu/Kconfig | 9 +
drivers/iommu/Makefile | 2 +-
drivers/iommu/arm/arm-smmu-v3/Makefile | 7 +
.../arm/arm-smmu-v3/arm-smmu-v3-common.c | 824 ++++++++++
.../iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c | 1093 +++++++++++++
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 989 +-----------
drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 758 +++------
drivers/iommu/io-pgtable-arm-common.c | 929 +++++++++++
drivers/iommu/io-pgtable-arm.c | 1061 +------------
drivers/iommu/io-pgtable-arm.h | 30 -
drivers/iommu/io-pgtable.c | 15 +
drivers/iommu/iommu.c | 53 +-
include/kvm/arm_smmu_v3.h | 46 +
include/kvm/iommu.h | 59 +
include/kvm/power_domain.h | 24 +
include/linux/io-pgtable-arm.h | 233 +++
include/linux/io-pgtable.h | 38 +-
include/linux/iommu.h | 43 +-
46 files changed, 7169 insertions(+), 2608 deletions(-)
create mode 100644 arch/arm64/include/asm/arm-smmu-v3-common.h
create mode 100644 arch/arm64/kvm/hyp/include/nvhe/iommu.h
create mode 100644 arch/arm64/kvm/hyp/nvhe/iommu/arm-smmu-v3.c
create mode 100644 arch/arm64/kvm/hyp/nvhe/iommu/io-pgtable-arm.c
create mode 100644 arch/arm64/kvm/hyp/nvhe/iommu/iommu.c
create mode 100644 arch/arm64/kvm/hyp/nvhe/power/hvc.c
create mode 100644 arch/arm64/kvm/hyp/nvhe/power/scmi.c
create mode 100644 arch/arm64/kvm/iommu.c
create mode 100644 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-common.c
create mode 100644 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-kvm.c
create mode 100644 drivers/iommu/io-pgtable-arm-common.c
delete mode 100644 drivers/iommu/io-pgtable-arm.h
create mode 100644 include/kvm/arm_smmu_v3.h
create mode 100644 include/kvm/iommu.h
create mode 100644 include/kvm/power_domain.h
create mode 100644 include/linux/io-pgtable-arm.h
--
2.47.0.338.g60cca15819-goog
More information about the linux-arm-kernel
mailing list