[PATCH 22/30] KVM: arm64: Return -EFAULT from VCPU_RUN on access to a poisoned pte

Fri Jan 9 09:35:15 PST 2026

On Fri, Jan 09, 2026 at 03:29:38PM +0000, Quentin Perret wrote:
> On Friday 09 Jan 2026 at 14:57:10 (+0000), Will Deacon wrote:
> > On Tue, Jan 06, 2026 at 03:54:06PM +0000, Quentin Perret wrote:
> > > On Monday 05 Jan 2026 at 15:49:30 (+0000), Will Deacon wrote:
> > > > +int __pkvm_vcpu_in_poison_fault(struct pkvm_hyp_vcpu *hyp_vcpu)
> > > > +{
> > > > +	struct pkvm_hyp_vm *vm = pkvm_hyp_vcpu_to_hyp_vm(hyp_vcpu);
> > > > +	kvm_pte_t pte;
> > > > +	s8 level;
> > > > +	u64 ipa;
> > > > +	int ret;
> > > > +
> > > > +	switch (kvm_vcpu_trap_get_class(&hyp_vcpu->vcpu)) {
> > > > +	case ESR_ELx_EC_DABT_LOW:
> > > > +	case ESR_ELx_EC_IABT_LOW:
> > > > +		if (kvm_vcpu_trap_is_translation_fault(&hyp_vcpu->vcpu))
> > > > +			break;
> > > > +		fallthrough;
> > > > +	default:
> > > > +		return -EINVAL;
> > > > +	}
> > > > +
> > > > +	ipa = kvm_vcpu_get_fault_ipa(&hyp_vcpu->vcpu);
> > > > +	ipa |= kvm_vcpu_get_hfar(&hyp_vcpu->vcpu) & GENMASK(11, 0);
> > > 
> > > Why is all the above needed? Could we simplify by having the host pass
> > > the IPA to the hcall?
> > 
> > I was just a little nervous about exposing an oracle here if we take the
> > gfn as an argument as it would provide the host with a pretty easy
> > mechanism to monitor the page access pattern of a guest after the initial
> > donation had occurred.
> 
> Aha, I see what you mean. I guess if we scope that hcall to only
> discover if a gfn is poisoned we're not exposing too much, but
> contextualizing the call to the fault also sounds good to me. Perhaps a
> small comment would help?

Good idea, I'll add something to the hypercall.

> > > > diff --git a/arch/arm64/kvm/pkvm.c b/arch/arm64/kvm/pkvm.c
> > > > index d1926cb08c76..14865907610c 100644
> > > > --- a/arch/arm64/kvm/pkvm.c
> > > > +++ b/arch/arm64/kvm/pkvm.c
> > > > @@ -417,10 +417,13 @@ int pkvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
> > > >  			return -EINVAL;
> > > >  
> > > >  		/*
> > > > -		 * We raced with another vCPU.
> > > > +		 * We either raced with another vCPU or the guest PTE
> > > > +		 * has been poisoned by an erroneous host access.
> > > >  		 */
> > > > -		if (mapping)
> > > > -			return -EAGAIN;
> > > > +		if (mapping) {
> > > > +			ret = kvm_call_hyp_nvhe(__pkvm_vcpu_in_poison_fault);
> > > 
> > > It's not too bad, but it's a shame we now issue that every time we have
> > > such a race (which is frequent-ish). Could we perhaps only issue it if
> > > at least one page has been forcefully reclaimed since boot?
> > 
> > On the plus side, it avoids an unconditional walk from the fault path
> > at EL2 (which is what we have in Android!).
> > 
> > It's a bit fiddly to implement your idea in the host, since the forceful
> > reclaim happens in a really terrible context but I could track it at EL2
> > and make __pkvm_vcpu_in_poison_fault() return early instead?
> 
> I guess EL2 could easily publish something in the host kvm struct as
> well if we really wanted to, it's pinned as shared with EL2 and
> accessible from the hyp_vm, which we retrieve in the force reclaim path.

Ah yeah, that sounds do-able.

> > It's also
> > worth bearing in mind that we've already serialised the concurrent fault
> > and done a GUP by this point, so performance is somewhat of a lost
> > cause...
> 
> That is very true, so happy to keep all these micro-optimization for
> later.

Sounds good to me. We'll have a bunch of performance work once the
functionality is there, so I'll leave this part as-is for now.

Will