[PATCH v1 00/13] KVM: arm64: Refactor user_mem_abort() into a state-object model
Fuad Tabba
tabba at google.com
Fri Mar 6 07:44:44 PST 2026
On Fri, 6 Mar 2026 at 15:34, Marc Zyngier <maz at kernel.org> wrote:
>
> Hi Fuad,
>
> On Fri, 06 Mar 2026 14:02:19 +0000,
> Fuad Tabba <tabba at google.com> wrote:
> >
> > As promised in my recent patch series fixing a couple of urgent bugs in
> > user_mem_abort() [1], here is the actual refactoring to finally clean up this
> > monolith.
> >
> > If you look through the Fixes: history of user_mem_abort(), you will start to
> > see a very clear pattern of whack-a-mole caused by the sheer size and
> > complexity of the function. For example:
> > - We keep leaking struct page references on early error returns because the
> > cleanup logic is hard to track (e.g., 5f9466b50c1b and the atomic fault leak
> > I just fixed in the previous series).
> > - We have had uninitialized memcache pointers (157dbc4a321f) because the
> > initialization flow jumps around unpredictably.
> > - We have had subtle TOCTOU and locking boundary bugs (like 13ec9308a857 and
> > f587661f21eb) because we drop the mmap_read_lock midway through the function
> > but leave the vma pointer and mmu_seq floating around in the same lexical
> > scope, tempting people to use them.
> >
> > The bulk of the work is in the first 6 patches, which perform a strict,
> > no-logic-change structural refactoring of user_mem_abort() into a clean,
> > sequential dispatcher.
> >
> > We introduce a state object, struct kvm_s2_fault, which encapsulates
> > both the input parameters and the intermediate state. Then,
> > user_mem_abort() is broken down into focused, standalone helpers:
> > - kvm_s2_resolve_vma_size(): Determines the VMA shift and page size.
> > - kvm_s2_fault_pin_pfn(): Handles faulting in the physical page.
> > - kvm_s2_fault_get_vma_info(): A tightly-scoped sub-helper that isolates the
> > mmap_read_lock, VMA lookup, and metadata snapshotting.
> > - kvm_s2_fault_compute_prot(): Computes stage-2 protections and evaluates
> > permission/execution constraints.
> > - kvm_s2_fault_map(): Manages the KVM MMU lock, mmu_seq retry loops, MTE, and
> > the final stage-2 mapping.
> >
> > This structural change makes the "danger zone" foolproof. By isolating
> > the mmap_read_lock region inside a tightly-scoped sub-helper
> > (kvm_s2_fault_get_vma_info), the vma pointer is confined. It snapshots
> > the required metadata into the kvm_s2_fault structure before dropping
> > the lock. Because the pointers scope ends when the sub-helper returns,
> > accessing a stale VMA in the mapping phase is not possible by design.
> >
> > The remaining patches in are localized cleanup patches. With the logic
> > finally extracted into digestible helpers, these patches take the
> > opportunity to streamline struct initialization, drop redundant struct
> > variables, simplify nested math, and hoist validation checks (like MTE)
> > out of the lock-heavy mapping phase.
> >
> > I think that there are still more opportunities to tidy things up some
> > more, but I'll stop here to see what you think.
>
> Thanks a lot for going through this, this looks like a very valuable
> starting point. From a high-level perspective, the end result is even
> more shocking:
>
> <quote>
> static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> struct kvm_s2_trans *nested,
> struct kvm_memory_slot *memslot, unsigned long hva,
> bool fault_is_perm)
> {
> bool write_fault = kvm_is_write_fault(vcpu);
> bool logging_active = memslot_is_logging(memslot);
> struct kvm_s2_fault fault = {
> .vcpu = vcpu,
> .fault_ipa = fault_ipa,
> .nested = nested,
> .memslot = memslot,
> .hva = hva,
> .fault_is_perm = fault_is_perm,
> .ipa = fault_ipa,
> .logging_active = logging_active,
> .force_pte = logging_active,
> .prot = KVM_PGTABLE_PROT_R,
> .fault_granule = fault_is_perm ? kvm_vcpu_trap_get_perm_fault_granule(vcpu) : 0,
> .write_fault = write_fault,
> .exec_fault = kvm_vcpu_trap_is_exec_fault(vcpu),
> .topup_memcache = !fault_is_perm || (logging_active && write_fault),
> };
> </quote>
>
> This kvm_s2_fault structure is a mess of architectural state (the
> fault itself), hypervisor context (logging_active, force_pte...), and
> implementation details (topup_memcache...). Effectively, that's the
> container for *everything* that was a variable in u_m_a() is now part
> of the structure that is passed around.
>
> The task at hand is now to completely nuke this monster. The
> architectural fault information should be further split, and directly
> passed as a parameter to user_mem_abort(). We have a lot of redundant
> state that is computed, recomputed, derived, and that should probably
> moved into the exact place where it matters.
Thanks Marc! Like I said, I stopped after 13 patches to see what you
think, and if this is the right approach. But yes, more can be done
here.
>
> I'll have a play with it,
Sounds good, and if you'd like me to tackle any particular part of
this, let me know.
I have to admit, breaking this down into pieces and seeing how much
tidier and easier to understand it became felt oddly satisfying.
Cheers,
/fuad
> Thanks,
>
> M.
>
> --
> Without deviation from the norm, progress is not possible.
More information about the linux-arm-kernel
mailing list