[PATCH 22/30] KVM: arm64: Return -EFAULT from VCPU_RUN on access to a poisoned pte

Fri Jan 9 07:29:38 PST 2026

On Friday 09 Jan 2026 at 14:57:10 (+0000), Will Deacon wrote:
> On Tue, Jan 06, 2026 at 03:54:06PM +0000, Quentin Perret wrote:
> > On Monday 05 Jan 2026 at 15:49:30 (+0000), Will Deacon wrote:
> > > +int __pkvm_vcpu_in_poison_fault(struct pkvm_hyp_vcpu *hyp_vcpu)
> > > +{
> > > +	struct pkvm_hyp_vm *vm = pkvm_hyp_vcpu_to_hyp_vm(hyp_vcpu);
> > > +	kvm_pte_t pte;
> > > +	s8 level;
> > > +	u64 ipa;
> > > +	int ret;
> > > +
> > > +	switch (kvm_vcpu_trap_get_class(&hyp_vcpu->vcpu)) {
> > > +	case ESR_ELx_EC_DABT_LOW:
> > > +	case ESR_ELx_EC_IABT_LOW:
> > > +		if (kvm_vcpu_trap_is_translation_fault(&hyp_vcpu->vcpu))
> > > +			break;
> > > +		fallthrough;
> > > +	default:
> > > +		return -EINVAL;
> > > +	}
> > > +
> > > +	ipa = kvm_vcpu_get_fault_ipa(&hyp_vcpu->vcpu);
> > > +	ipa |= kvm_vcpu_get_hfar(&hyp_vcpu->vcpu) & GENMASK(11, 0);
> > 
> > Why is all the above needed? Could we simplify by having the host pass
> > the IPA to the hcall?
> 
> I was just a little nervous about exposing an oracle here if we take the
> gfn as an argument as it would provide the host with a pretty easy
> mechanism to monitor the page access pattern of a guest after the initial
> donation had occurred.

Aha, I see what you mean. I guess if we scope that hcall to only
discover if a gfn is poisoned we're not exposing too much, but
contextualizing the call to the fault also sounds good to me. Perhaps a
small comment would help?

> > > +	guest_lock_component(vm);
> > > +	ret = kvm_pgtable_get_leaf(&vm->pgt, ipa, &pte, &level);
> > > +	if (ret)
> > > +		goto unlock;
> > > +
> > > +	if (level != KVM_PGTABLE_LAST_LEVEL) {
> > > +		ret = -EINVAL;
> > > +		goto unlock;
> > > +	}
> > > +
> > > +	ret = guest_pte_is_poisoned(pte);
> > > +unlock:
> > > +	guest_unlock_component(vm);
> > > +	return ret;
> > > +}
> > > +
> > >  int __pkvm_host_share_hyp(u64 pfn)
> > >  {
> > >  	u64 phys = hyp_pfn_to_phys(pfn);
> > > diff --git a/arch/arm64/kvm/pkvm.c b/arch/arm64/kvm/pkvm.c
> > > index d1926cb08c76..14865907610c 100644
> > > --- a/arch/arm64/kvm/pkvm.c
> > > +++ b/arch/arm64/kvm/pkvm.c
> > > @@ -417,10 +417,13 @@ int pkvm_pgtable_stage2_map(struct kvm_pgtable *pgt, u64 addr, u64 size,
> > >  			return -EINVAL;
> > >  
> > >  		/*
> > > -		 * We raced with another vCPU.
> > > +		 * We either raced with another vCPU or the guest PTE
> > > +		 * has been poisoned by an erroneous host access.
> > >  		 */
> > > -		if (mapping)
> > > -			return -EAGAIN;
> > > +		if (mapping) {
> > > +			ret = kvm_call_hyp_nvhe(__pkvm_vcpu_in_poison_fault);
> > 
> > It's not too bad, but it's a shame we now issue that every time we have
> > such a race (which is frequent-ish). Could we perhaps only issue it if
> > at least one page has been forcefully reclaimed since boot?
> 
> On the plus side, it avoids an unconditional walk from the fault path
> at EL2 (which is what we have in Android!).
> 
> It's a bit fiddly to implement your idea in the host, since the forceful
> reclaim happens in a really terrible context but I could track it at EL2
> and make __pkvm_vcpu_in_poison_fault() return early instead?

I guess EL2 could easily publish something in the host kvm struct as
well if we really wanted to, it's pinned as shared with EL2 and
accessible from the hyp_vm, which we retrieve in the force reclaim path.

> It's also
> worth bearing in mind that we've already serialised the concurrent fault
> and done a GUP by this point, so performance is somewhat of a lost
> cause...

That is very true, so happy to keep all these micro-optimization for
later.

Thanks,
Quentin