[PATCH v1 00/13] KVM: arm64: Refactor user_mem_abort() into a state-object model

Fuad Tabba tabba at google.com
Fri Mar 6 07:44:44 PST 2026


On Fri, 6 Mar 2026 at 15:34, Marc Zyngier <maz at kernel.org> wrote:
>
> Hi Fuad,
>
> On Fri, 06 Mar 2026 14:02:19 +0000,
> Fuad Tabba <tabba at google.com> wrote:
> >
> > As promised in my recent patch series fixing a couple of urgent bugs in
> > user_mem_abort() [1], here is the actual refactoring to finally clean up this
> > monolith.
> >
> > If you look through the Fixes: history of user_mem_abort(), you will start to
> > see a very clear pattern of whack-a-mole caused by the sheer size and
> > complexity of the function. For example:
> > - We keep leaking struct page references on early error returns because the
> >   cleanup logic is hard to track (e.g., 5f9466b50c1b and the atomic fault leak
> >   I just fixed in the previous series).
> > - We have had uninitialized memcache pointers (157dbc4a321f) because the
> >   initialization flow jumps around unpredictably.
> > - We have had subtle TOCTOU and locking boundary bugs (like 13ec9308a857 and
> >   f587661f21eb) because we drop the mmap_read_lock midway through the function
> >   but leave the vma pointer and mmu_seq floating around in the same lexical
> >   scope, tempting people to use them.
> >
> > The bulk of the work is in the first 6 patches, which perform a strict,
> > no-logic-change structural refactoring of user_mem_abort() into a clean,
> > sequential dispatcher.
> >
> > We introduce a state object, struct kvm_s2_fault, which encapsulates
> > both the input parameters and the intermediate state. Then,
> > user_mem_abort() is broken down into focused, standalone helpers:
> > - kvm_s2_resolve_vma_size(): Determines the VMA shift and page size.
> > - kvm_s2_fault_pin_pfn(): Handles faulting in the physical page.
> >   - kvm_s2_fault_get_vma_info(): A tightly-scoped sub-helper that isolates the
> >     mmap_read_lock, VMA lookup, and metadata snapshotting.
> > - kvm_s2_fault_compute_prot(): Computes stage-2 protections and evaluates
> >   permission/execution constraints.
> > - kvm_s2_fault_map(): Manages the KVM MMU lock, mmu_seq retry loops, MTE, and
> >   the final stage-2 mapping.
> >
> > This structural change makes the "danger zone" foolproof. By isolating
> > the mmap_read_lock region inside a tightly-scoped sub-helper
> > (kvm_s2_fault_get_vma_info), the vma pointer is confined. It snapshots
> > the required metadata into the kvm_s2_fault structure before dropping
> > the lock. Because the pointers scope ends when the sub-helper returns,
> > accessing a stale VMA in the mapping phase is not possible by design.
> >
> > The remaining patches in are localized cleanup patches. With the logic
> > finally extracted into digestible helpers, these patches take the
> > opportunity to streamline struct initialization, drop redundant struct
> > variables, simplify nested math, and hoist validation checks (like MTE)
> > out of the lock-heavy mapping phase.
> >
> > I think that there are still more opportunities to tidy things up some
> > more, but I'll stop here to see what you think.
>
> Thanks a lot for going through this, this looks like a very valuable
> starting point. From a high-level perspective, the end result is even
> more shocking:
>
> <quote>
> static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>                           struct kvm_s2_trans *nested,
>                           struct kvm_memory_slot *memslot, unsigned long hva,
>                           bool fault_is_perm)
> {
>         bool write_fault = kvm_is_write_fault(vcpu);
>         bool logging_active = memslot_is_logging(memslot);
>         struct kvm_s2_fault fault = {
>                 .vcpu = vcpu,
>                 .fault_ipa = fault_ipa,
>                 .nested = nested,
>                 .memslot = memslot,
>                 .hva = hva,
>                 .fault_is_perm = fault_is_perm,
>                 .ipa = fault_ipa,
>                 .logging_active = logging_active,
>                 .force_pte = logging_active,
>                 .prot = KVM_PGTABLE_PROT_R,
>                 .fault_granule = fault_is_perm ? kvm_vcpu_trap_get_perm_fault_granule(vcpu) : 0,
>                 .write_fault = write_fault,
>                 .exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu),
>                 .topup_memcache = !fault_is_perm || (logging_active && write_fault),
>         };
> </quote>
>
> This kvm_s2_fault structure is a mess of architectural state (the
> fault itself), hypervisor context (logging_active, force_pte...), and
> implementation details (topup_memcache...). Effectively, that's the
> container for *everything* that was a variable in u_m_a() is now part
> of the structure that is passed around.
>
> The task at hand is now to completely nuke this monster. The
> architectural fault information should be further split, and directly
> passed as a parameter to user_mem_abort(). We have a lot of redundant
> state that is computed, recomputed, derived, and that should probably
> moved into the exact place where it matters.

Thanks Marc! Like I said, I stopped after 13 patches to see what you
think, and if this is the right approach. But yes, more can be done
here.

>
> I'll have a play with it,

Sounds good, and if you'd like me to tackle any particular part of
this, let me know.

I have to admit, breaking this down into pieces and seeing how much
tidier and easier to understand it became felt oddly satisfying.

Cheers,
/fuad

> Thanks,
>
>         M.
>
> --
> Without deviation from the norm, progress is not possible.



More information about the linux-arm-kernel mailing list