[PATCH 6/8] KVM: arm64: Propagate stage-2 map failure on host->guest donation
Fuad Tabba
tabba at google.com
Tue Apr 28 07:36:43 PDT 2026
Hi Will,
On Tue, 28 Apr 2026 at 14:45, Will Deacon <will at kernel.org> wrote:
>
> On Tue, Apr 28, 2026 at 11:30:06AM +0100, Fuad Tabba wrote:
> > __pkvm_host_donate_guest() flips the host stage-2 PTE for the donated
> > page to a non-valid annotation (KVM_HOST_INVALID_PTE_TYPE_DONATION,
> > owner = PKVM_ID_GUEST) via host_stage2_set_owner_metadata_locked()
> > and then calls kvm_pgtable_stage2_map() to install the matching guest
> > stage-2 mapping. The map's return value was wrapped in WARN_ON() and
> > otherwise discarded.
> >
> > At EL2 in nVHE/pKVM, WARN_ON() is not warn-and-continue: it expands
> > to a BRK that enters the invalid-host-el2 vector and branches to
> > hyp_panic(), declared __noreturn. WARN_ON of a reachable failure at
> > EL2 is a panic primitive, not a debug aid.
> >
> > kvm_pgtable_stage2_map() can fail in reachable ways even at PAGE_SIZE
> > granularity: __pkvm_host_donate_guest() verifies PKVM_NOPAGE for the
> > guest IPA before the map, meaning no valid stage-2 entry exists. The
> > walker must allocate new page-table pages from the vcpu memcache to
> > install the mapping, returning -ENOMEM if exhausted. The host
> > controls the vcpu memcache via the topup interface, so an
> > under-provisioned donation request converts a recoverable error into
> > a fatal hyp panic.
> >
> > Capture the stage-2 map return value and propagate it. The walker
> > may have installed partial leaf entries for the IPA before failing,
> > so unmap the range to clear them; otherwise the guest would retain
> > stage-2 access to a page the host is about to reclaim as
> > PKVM_PAGE_OWNED. Then roll back the host stage-2 mutation: the only
> > forward mutation is host_stage2_set_owner_metadata_locked() flipping
> > the host vmemmap from PKVM_PAGE_OWNED to PKVM_NOPAGE and the host
> > stage-2 PTE from idmap to invalid+annotation.
> > host_stage2_set_owner_locked(_, _, PKVM_ID_HOST) restores both.
> >
> > The rollback calls host_stage2_set_owner_locked() under WARN_ON.
> > This is the correct use: host_stage2_set_owner_metadata_locked()
> > just wrote the host leaf PTE as an invalid+annotation entry, so the
> > reverse idmap rewrite cannot require new page-table allocation — it
> > rewrites the leaf in-place. The WARN_ON asserts an impossible state
> > under correct EL2 execution, semantically distinct from the misuse
> > being fixed.
> >
> > Fixes: 1e579adca177 ("KVM: arm64: Introduce __pkvm_host_donate_guest()")
> > Signed-off-by: Fuad Tabba <tabba at google.com>
> > ---
> > arch/arm64/kvm/hyp/nvhe/mem_protect.c | 27 ++++++++++++++++++++++++---
> > 1 file changed, 24 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> > index 7044913a0758..b8c57a95e9bf 100644
> > --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> > +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> > @@ -1391,9 +1391,30 @@ int __pkvm_host_donate_guest(u64 pfn, u64 gfn, struct pkvm_hyp_vcpu *vcpu)
> > meta = host_stage2_encode_gfn_meta(vm, gfn);
> > WARN_ON(host_stage2_set_owner_metadata_locked(phys, PAGE_SIZE,
> > PKVM_ID_GUEST, meta));
> > - WARN_ON(kvm_pgtable_stage2_map(&vm->pgt, ipa, PAGE_SIZE, phys,
> > - pkvm_mkstate(KVM_PGTABLE_PROT_RWX, PKVM_PAGE_OWNED),
> > - &vcpu->vcpu.arch.pkvm_memcache, 0));
> > + ret = kvm_pgtable_stage2_map(&vm->pgt, ipa, PAGE_SIZE, phys,
> > + pkvm_mkstate(KVM_PGTABLE_PROT_RWX, PKVM_PAGE_OWNED),
> > + &vcpu->vcpu.arch.pkvm_memcache, 0);
> > + if (ret) {
> > + /*
> > + * Stage-2 map can fail mid-walk (e.g. -ENOMEM from the
> > + * memcache), leaving partial leaf entries installed in the
> > + * guest stage-2. Tear them down before rolling back the host
> > + * stage-2; otherwise the guest would retain access to a page
> > + * the host is about to reclaim as PKVM_PAGE_OWNED.
> > + */
> > + kvm_pgtable_stage2_unmap(&vm->pgt, ipa, PAGE_SIZE);
>
> Whoa, whoa, whoa.
>
> First of all, this is mapping a single page, so the comment talking about
> "leaf entries" (plural) is bogus. If an operation to map a single page
> fails, then it makes no sense to try unmapping the mapping which we
> failed to create. What do you expect it to do?
>
> On the other hand, if we extend this to handle ranges in future (which
> presumably we'll want as part of the THP support) then wouldn't this
> mean that a concurrent vCPU could have transiently written to the pages
> that _did_ get mapped, and now we're going to give those back to the
> host? That's really not ok! We're relying on these WARN_ON()s being
> fatal and they shouldn't fail because we perform all the permission
> checks first, in a separate pass.
>
> If you want to improve this, then I think the options are either:
>
> 1. Check that the the memcache is topped up first
> 2. Poison the page (similar to the forced-reclaim path)
> 3. Tell the host about the pages it's lost and maybe it can leak them
>
> (I vote for (1))
>
> But we absolutely cannot do the simple rollback for the ownership
> changes; that's why the code is written to do the checks up-front. Your
> other patches at the end of this series have different flavours of the
> same issue.
>
> If it's just about keeping the LLM happy, then either fix the LLM or
> make these BUG_ON() (in conjunction with (1) above).
You're right, and no, it's not just about keeping em happy :D
The "leaf entries" comment on the single-page donate path is wrong, and
more importantly the rollback model is unsound. The right fix is to tighten
the precondition, not to handle a failure that shouldn't be reachable.
V2 will drop two patches (in addition to the HCR_EL2 one), and will be
as follows:
1. host->guest share and host->guest donate (kept, rewritten): add a
memcache-sufficiency check during the existing pre-check pass
(option 1) and return -ENOMEM cleanly without touching any state.
Restore the WARN_ON() on the subsequent kvm_pgtable_stage2_map() —
with the topup precheck it asserts an established invariant rather
than ignoring a reachable error.
For the single-page donate, "topped up" is
KVM_PGTABLE_LAST_LEVEL - vm->pgt.start_level (mirroring host EL1's
kvm_mmu_cache_min_pages). For multi-page share I plan to use the
conservative nr_pages * (LAST_LEVEL - start_level) bound and flag it
as conservative in the commit message; happy to compute a tighter
alignment-aware bound if you'd prefer.
2. guest->host share and guest->host unshare (dropped): on reflection
these shouldn't have been in V1 at all. As I did note in the commit
message, failure path is not reachable. Again, too eager to make
use of the new prompts.
I'll repost this as V2 if you agree.
Cheers,
/fuad
>
> Will
More information about the linux-arm-kernel
mailing list