[RFC PATCH v1] KVM: arm64: Introduce KVM_CAP_ARM_SIGBUS_ON_SEA
Marc Zyngier
maz at kernel.org
Fri Nov 1 02:03:16 PDT 2024
On Thu, 31 Oct 2024 21:21:04 +0000,
Jiaqi Yan <jiaqiyan at google.com> wrote:
>
> Currently KVM handles SEA in guest by injecting async SError into
> guest directly, bypassing VMM, usually results in guest kernel panic.
>
> One major situation of guest SEA is when vCPU consumes uncorrectable
> memory error on the physical memory. Although SError and guest kernel
> panic effectively stops the propagation of corrupted memory, it is not
> easy for VMM and guest to recover from memory error in a more graceful
> manner.
>
> Alternatively KVM can send a SIGBUS BUS_OBJERR to VMM/vCPU, just like
> how core kernel signals SIGBUS BUS_OBJERR to the poison consuming
> thread.
Can you elaborate on why the delivery of a signal is preferable to
simply exiting back to userspace with a description of the error?
Signals are usually not generated by KVM, and are a pretty twisted way
to generate an exit.
> In addition to the benifit that KVM's handling for SEA becomes aligned
> with core kernel behavior
> - The blast radius in VM can be limited to only the consuming thread
> in guest, instead of entire guest kernel, unless the consumption is
> from guest kernel.
> - VMM now has the chance to do its duties to stop the VM from repeatedly
> consuming corrupted data. For example, VMM can unmap the guest page
> from stage-2 table to intercept forseen memory poison consumption,
Not quite. The VMM doesn't manage stage-2. It can remove the page from
the VMA if it has it mapped, but that's it. The kernel deals with S2.
Which brings me to the next subject: when the kernel unmaps the page
at S2, it is likely to use CMOs. Can these CMOs create RAS error
themselves?
> and for every consumption injects SEA to EL1 with synthetic memory
> error CPER.
Why do we need to involve ACPI here? I would expect the production of
an architected error record instead. Or at least be given the option.
> Introduce a new KVM ARM capability KVM_CAP_ARM_SIGBUS_ON_SEA. VMM
> can opt in this new capability if it prefers SIGBUS than SError
> injection during VM init. Now SEA handling in KVM works as follows:
> 1. Delegate to APEI/GHES to see if SEA can be claimed by them.
> 2. If APEI failed to claim the SEA and KVM_CAP_ARM_SIGBUS_ON_SEA is
> enabled for the VM, and the SEA is NOT about translation table,
> send SIGBUS BUS_OBJERR signal with host virtual address.
And what if it is? S1 PTs are backed by userspace memory, like
anything else. I don't think we should have a different treatment of
those, because the HW wouldn't treat them differently either.
> 3. Otherwise directly inject async SError to guest.
>
> Tested on a machine running Siryn AmpereOne processor. A in-house VMM
> that opts in KVM_CAP_ARM_SIGBUS_ON_SEA started a VM. A dummy application
> in VM allocated some memory buffer. The test used EINJ to inject an
> uncorrectable recoverable memory error at a page in the allocated memory
> buffer. The dummy application then consumed the memory error. Some hack
> was done to make core kernel's memory_failure triggered by poison
> generation to fail, so KVM had to deal with the SEA guest abort due to
> poison consumption. vCPU thread in VMM received SIGBUS BUS_OBJERR with
> valid host virtual address of the poisoned page. VMM then injected a SEA
> into guest using KVM_SET_VCPU_EVENTS with ext_dabt_pending=1. At last
> the dummy application in guest was killed by SIGBUS BUS_OBJERR, while the
> guest survived and continued to run.
>
> Signed-off-by: Jiaqi Yan <jiaqiyan at google.com>
> ---
> arch/arm64/include/asm/kvm_host.h | 2 +
> arch/arm64/include/asm/kvm_ras.h | 20 ++++----
> arch/arm64/kvm/Makefile | 2 +-
> arch/arm64/kvm/arm.c | 5 ++
> arch/arm64/kvm/kvm_ras.c | 77 +++++++++++++++++++++++++++++++
> arch/arm64/kvm/mmu.c | 8 +---
> include/uapi/linux/kvm.h | 1 +
> 7 files changed, 98 insertions(+), 17 deletions(-)
> create mode 100644 arch/arm64/kvm/kvm_ras.c
>
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index bf64fed9820ea..eb37a2489411a 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -334,6 +334,8 @@ struct kvm_arch {
> /* Fine-Grained UNDEF initialised */
> #define KVM_ARCH_FLAG_FGU_INITIALIZED 8
> unsigned long flags;
> + /* Instead of injecting SError into guest, SIGBUS VMM */
> +#define KVM_ARCH_FLAG_SIGBUS_ON_SEA 9
nit: why do you put this definition out of sequence (below 'flags')?
>
> /* VM-wide vCPU feature set */
> DECLARE_BITMAP(vcpu_features, KVM_VCPU_MAX_FEATURES);
> diff --git a/arch/arm64/include/asm/kvm_ras.h b/arch/arm64/include/asm/kvm_ras.h
> index 87e10d9a635b5..4bb7a424e3f6c 100644
> --- a/arch/arm64/include/asm/kvm_ras.h
> +++ b/arch/arm64/include/asm/kvm_ras.h
> @@ -11,15 +11,17 @@
> #include <asm/acpi.h>
>
> /*
> - * Was this synchronous external abort a RAS notification?
> - * Returns '0' for errors handled by some RAS subsystem, or -ENOENT.
> + * Handle synchronous external abort (SEA) in the following order:
> + * 1. Delegate to APEI/GHES to see if SEA can be claimed by them. If so, we
> + * are all done.
> + * 2. If userspace opts in KVM_CAP_ARM_SIGBUS_ON_SEA, and if the SEA is NOT
> + * about translation table, send SIGBUS
> + * - si_code is BUS_OBJERR.
> + * - si_addr will be 0 when accurate HVA is unavailable.
> + * 3. Otherwise, directly inject an async SError to guest.
> + *
> + * Note this applies to both ESR_ELx_EC_IABT_* and ESR_ELx_EC_DABT_*.
> */
> -static inline int kvm_handle_guest_sea(phys_addr_t addr, u64 esr)
> -{
> - /* apei_claim_sea(NULL) expects to mask interrupts itself */
> - lockdep_assert_irqs_enabled();
> -
> - return apei_claim_sea(NULL);
> -}
> +void kvm_handle_guest_sea(struct kvm_vcpu *vcpu);
>
> #endif /* __ARM64_KVM_RAS_H__ */
> diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile
> index 3cf7adb2b5038..c4a3a6d4870e6 100644
> --- a/arch/arm64/kvm/Makefile
> +++ b/arch/arm64/kvm/Makefile
> @@ -23,7 +23,7 @@ kvm-y += arm.o mmu.o mmio.o psci.o hypercalls.o pvtime.o \
> vgic/vgic-v3.o vgic/vgic-v4.o \
> vgic/vgic-mmio.o vgic/vgic-mmio-v2.o \
> vgic/vgic-mmio-v3.o vgic/vgic-kvm-device.o \
> - vgic/vgic-its.o vgic/vgic-debug.o
> + vgic/vgic-its.o vgic/vgic-debug.o kvm_ras.o
>
> kvm-$(CONFIG_HW_PERF_EVENTS) += pmu-emul.o pmu.o
> kvm-$(CONFIG_ARM64_PTR_AUTH) += pauth.o
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 48cafb65d6acf..bb97ad678dbec 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -151,6 +151,10 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
> }
> mutex_unlock(&kvm->slots_lock);
> break;
> + case KVM_CAP_ARM_SIGBUS_ON_SEA:
> + r = 0;
> + set_bit(KVM_ARCH_FLAG_SIGBUS_ON_SEA, &kvm->arch.flags);
Shouldn't this be somehow gated on the VM being RAS aware?
> + break;
> default:
> break;
> }
> @@ -339,6 +343,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> case KVM_CAP_ARM_SYSTEM_SUSPEND:
> case KVM_CAP_IRQFD_RESAMPLE:
> case KVM_CAP_COUNTER_OFFSET:
> + case KVM_CAP_ARM_SIGBUS_ON_SEA:
> r = 1;
> break;
> case KVM_CAP_SET_GUEST_DEBUG2:
> diff --git a/arch/arm64/kvm/kvm_ras.c b/arch/arm64/kvm/kvm_ras.c
> new file mode 100644
> index 0000000000000..3225462bcbcda
> --- /dev/null
> +++ b/arch/arm64/kvm/kvm_ras.c
> @@ -0,0 +1,77 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +#include <linux/bitops.h>
> +#include <linux/kvm_host.h>
> +
> +#include <asm/kvm_emulate.h>
> +#include <asm/kvm_ras.h>
> +#include <asm/system_misc.h>
> +
> +/*
> + * For synchrnous external instruction or data abort, not on translation
> + * table walk or hardware update of translation table, is FAR_EL2 valid?
> + */
> +static inline bool kvm_vcpu_sea_far_valid(const struct kvm_vcpu *vcpu)
> +{
> + return !(vcpu->arch.fault.esr_el2 & ESR_ELx_FnV);
> +}
> +
> +/*
> + * Was this synchronous external abort a RAS notification?
> + * Returns '0' for errors handled by some RAS subsystem, or -ENOENT.
> + */
> +static int kvm_delegate_guest_sea(phys_addr_t addr, u64 esr)
> +{
> + /* apei_claim_sea(NULL) expects to mask interrupts itself */
> + lockdep_assert_irqs_enabled();
> + return apei_claim_sea(NULL);
> +}
> +
> +void kvm_handle_guest_sea(struct kvm_vcpu *vcpu)
> +{
> + bool sigbus_on_sea;
> + int idx;
> + u64 vcpu_esr = kvm_vcpu_get_esr(vcpu);
> + u8 fsc = kvm_vcpu_trap_get_fault(vcpu);
> + phys_addr_t fault_ipa = kvm_vcpu_get_fault_ipa(vcpu);
> + gfn_t gfn = fault_ipa >> PAGE_SHIFT;
> + /* When FnV is set, send 0 as si_addr like what do_sea() does. */
> + unsigned long hva = 0UL;
> +
> + /*
> + * For RAS the host kernel may handle this abort.
> + * There is no need to SIGBUS VMM, or pass the error into the guest.
> + */
> + if (kvm_delegate_guest_sea(fault_ipa, vcpu_esr) == 0)
> + return;
> +
> + sigbus_on_sea = test_bit(KVM_ARCH_FLAG_SIGBUS_ON_SEA,
> + &(vcpu->kvm->arch.flags));
> +
> + /*
> + * In addition to userspace opt-in, SIGBUS only makes sense if the
> + * abort is NOT about translation table walk and NOT about hardware
> + * update of translation table.
> + */
> + sigbus_on_sea &= (fsc == ESR_ELx_FSC_EXTABT || fsc == ESR_ELx_FSC_SECC);
> +
> + /* Pass the error directly into the guest. */
> + if (!sigbus_on_sea) {
> + kvm_inject_vabt(vcpu);
> + return;
> + }
> +
> + if (kvm_vcpu_sea_far_valid(vcpu)) {
> + idx = srcu_read_lock(&vcpu->kvm->srcu);
> + hva = gfn_to_hva(vcpu->kvm, gfn);
> + srcu_read_unlock(&vcpu->kvm->srcu, idx);
> + }
> +
> + /*
> + * Send a SIGBUS BUS_OBJERR to vCPU thread (the userspace thread that
> + * runs KVM_RUN) or VMM, which aligns with what host kernel do_sea()
> + * does if apei_claim_sea() fails.
> + */
> + arm64_notify_die("synchronous external abort",
> + current_pt_regs(), SIGBUS, BUS_OBJERR, hva, vcpu_esr);
This is the point where I really think we should simply trigger an
exit with all that syndrome information stashed in kvm_run, like any
other event requiring userspace help.
Also: where is the documentation?
Thanks,
M.
--
Without deviation from the norm, progress is not possible.
More information about the linux-arm-kernel
mailing list