[PATCH] KVM: arm64: Drop mte_allowed check during memslot creation

Wed Feb 26 01:58:26 PST 2025

Marc Zyngier <maz at kernel.org> writes:

> On Mon, 24 Feb 2025 16:44:06 +0000,
> Aneesh Kumar K.V <aneesh.kumar at kernel.org> wrote:
>> 
>> Marc Zyngier <maz at kernel.org> writes:
>> 
>> > On Mon, 24 Feb 2025 14:39:16 +0000,
>> > Catalin Marinas <catalin.marinas at arm.com> wrote:
>> >>
>> >> On Mon, Feb 24, 2025 at 12:24:14PM +0000, Marc Zyngier wrote:
>> >> > On Mon, 24 Feb 2025 11:05:33 +0000,
>> >> > Catalin Marinas <catalin.marinas at arm.com> wrote:
>> >> > > On Mon, Feb 24, 2025 at 03:09:38PM +0530, Aneesh Kumar K.V (Arm) wrote:
>> >> > > > This change is needed because, without it, users are not able to use MTE
>> >> > > > with VFIO passthrough (currently the mapping is either Device or
>> >> > > > NonCacheable for which tag access check is not applied.), as shown
>> >> > > > below (kvmtool VMM).
>> >> > >
>> >> > > Another nit: "users are not able to user VFIO passthrough when MTE is
>> >> > > enabled". At a first read, the above sounded to me like one wants to
>> >> > > enable MTE for VFIO passthrough mappings.
>> >> >
>> >> > What the commit message doesn't spell out is how MTE and VFIO are
>> >> > interacting here. I also don't understand the reference to Device or
>> >> > NC memory here.
>> >>
>> >> I guess it's saying that the guest cannot turn MTE on (Normal Tagged)
>> >> for these ranges anyway since Stage 2 is Device or Normal NC. So we
>> >> don't break any use-case specific to VFIO.
>> >>
>> >> > Isn't the issue that DMA doesn't check/update tags, and therefore it
>> >> > makes little sense to prevent non-tagged memory being associated with
>> >> > a memslot?
>> >>
>> >> The issue is that some MMIO memory range that does not support MTE
>> >> (well, all MMIO) could be mapped by the guest as Normal Tagged and we
>> >> have no clue what the hardware does as tag accesses, hence we currently
>> >> prevent it altogether. It's not about DMA.
>> >>
>> >> This patch still prevents such MMIO+MTE mappings but moves the decision
>> >> to user_mem_abort() and it's slightly more relaxed - only rejecting it
>> >> if !VM_MTE_ALLOWED _and_ the Stage 2 is cacheable. The side-effect is
>> >> that it allows device assignment into the guest since Stage 2 is not
>> >> Normal Cacheable (at least for now, we have some patches Ankit but they
>> >> handle the MTE case).
>> >
>> > The other side effect is that it also allows non-tagged cacheable
>> > memory to be given to the MTE-enabled guest, and the guest has no way
>> > to distinguish between what is tagged and what's not.
>> >
>> >>
>> >> > My other concern is that this gives pretty poor consistency to the
>> >> > guest, which cannot know what can be tagged and what cannot, and
>> >> > breaks a guarantee that the guest should be able to rely on.
>> >>
>> >> The guest should not set Normal Tagged on anything other than what it
>> >> gets as standard RAM. We are not changing this here. KVM than needs to
>> >> prevent a broken/malicious guest from setting MTE on other (physical)
>> >> ranges that don't support MTE. Currently it can only do this by forcing
>> >> Device or Normal NC (or disable MTE altogether). Later we'll add
>> >> FEAT_MTE_PERM to permit Stage 2 Cacheable but trap on tag accesses.
>> >>
>> >> The ABI change is just for the VMM, the guest shouldn't be aware as
>> >> long as it sticks to the typical recommendations for MTE - only enable
>> >> on standard RAM.
>> >
>> > See above. You fall into the same trap with standard memory, since you
>> > now allow userspace to mix things at will, and only realise something
>> > has gone wrong on access (and -EFAULT is not very useful).
>> >
>> >>
>> >> Does any VMM rely on the memory slot being rejected on registration if
>> >> it does not support MTE? After this change, we'd get an exit to the VMM
>> >> on guest access with MTE turned on (even if it's not mapped as such at
>> >> Stage 1).
>> >
>> > I really don't know what userspace expects w.r.t. mixing tagged and
>> > non-tagged memory. But I don't expect anything good to come out of it,
>> > given that we provide zero information about the fault context.
>> >
>> > Honestly, if we are going to change this, then let's make sure we give
>> > enough information for userspace to go and fix the mess. Not just "it
>> > all went wrong".
>> >
>> 
>> What if we trigger a memory fault exit with the TAGACCESS flag, allowing
>> the VMM to use the GPA to retrieve additional details and print extra
>> information to aid in analysis? BTW, we will do this on the first fault
>> in cacheable, non-tagged memory even if there is no tagaccess in that
>> region. This can be further improved using the NoTagAccess series I
>> posted earlier, which ensures the memory fault exit occurs only on
>> actual tag access
>> 
>> Something like below?
>
> Something like that, only with:
>
> - a capability informing userspace of this behaviour
>
> - a per-VM (or per-VMA) flag as a buy-in for that behaviour
>

If we’re looking for a capability based control, could we tie that up to
FEAT_MTE_PERM? That’s what I did here:

https://lore.kernel.org/all/20250110110023.2963795-1-aneesh.kumar@kernel.org

That patch set also addresses the issue mentioned here. Let me know if
you think this is a better approach

> - the relaxation is made conditional on the memslot not being memory
> (i.e. really MMIO-only).
>
> and keep the current behaviour otherwise.
>
> Thanks,

-aneesh