[RFC PATCH 0/2] MTE support for KVM guest

Fri Jun 26 13:24:05 EDT 2020

Hi guys,

On 24/06/2020 17:24, Catalin Marinas wrote:
> On Wed, Jun 24, 2020 at 03:59:35PM +0100, Steven Price wrote:
>> On 24/06/2020 15:21, Catalin Marinas wrote:
>>> On Wed, Jun 24, 2020 at 12:16:28PM +0100, Steven Price wrote:
>>>> On 23/06/2020 18:48, Catalin Marinas wrote:
>>>>> This causes potential issues since we can't guarantee that all the
>>>>> Cacheable memory slots allocated by the VMM support MTE. If they do not,
>>>>> the arch behaviour is "unpredictable". We also can't trust the guest to
>>>>> not enable MTE on such Cacheable mappings.
>>>>
>>>> Architecturally it seems dodgy to export any address that isn't "normal
>>>> memory" (i.e. with tag storage) to the guest as Normal Cacheable. Although
>>>> I'm a bit worried this might cause a regression in some existing case.
>>>
>>> What I had in mind is some persistent memory that may be given to the
>>> guest for direct access. This is allowed to be cacheable (write-back)
>>> but may not have tag storage.
>>
>> At the moment we don't have a good idea what would happen if/when the guest
>> (or host) attempts to use that memory as tagged. If we have a relatively
>> safe hardware behaviour (e.g. the tags are silently dropped/read-as-zero)
>> then that's not a big issue. But if the accesses cause some form of abort
>> then we need to understand how that would be handled.
> 
> The architecture is not prescriptive here, the behaviour is
> "unpredictable". It could mean tags read-as-zero/write-ignored or an
> SError.

This surely is the same as treating a VFIO device as memory and performing some
unsupported operation on it.

I thought the DT 'which memory ranges' description for MTE was removed. Wouldn't the rules
for a guest be the same? If you enable MTE, everything described as memory must support
MTE. Something like persistent memory then can't be described as memory, ... we have the
same problem on the host.

>>>>> 1. As in your current patches, assume any Cacheable at Stage 2 can have
>>>>>      MTE enabled at Stage 1. In addition, we need to check whether the
>>>>>      physical memory supports MTE and it could be something simple like
>>>>>      pfn_valid(). Is there a way to reject a memory slot passed by the
>>>>>      VMM?
>>>>
>>>> Yes pfn_valid() should have been in there. At the moment pfn_to_page() is
>>>> called without any checks.
>>>>
>>>> The problem with attempting to reject a memory slot is that the memory
>>>> backing that slot can change. So checking at the time the slot is created
>>>> isn't enough (although it might be a useful error checking feature).
>>>
>>> But isn't the slot changed as a result of another VMM call? So we could
>>> always have such check in place.
>>
>> Once you have created a memslot the guest's view of memory follows the user
>> space's address space. This is the KVM_CAP_SYNC_MMU capability. So there's
>> nothing stopping a VMM adding a memslot backed with perfectly reasonable
>> memory then mmap()ing over the top of it some memory which isn't MTE
>> compatible. KVM gets told the memory is being removed (via mmu notifiers)
>> but I think it waits for the next fault before (re)creating the stage 2
>> entries.

(indeed, stage2 is pretty lazy)

> OK, so that's where we could kill the guest if the VMM doesn't play
> nicely. It means that we need the check when setting up the stage 2
> entry. I guess it's fine if we only have the check at that point and
> ignore it on KVM_SET_USER_MEMORY_REGION. It would be nice if we returned
> on error on slot setup but

> we may not know (yet) whether the VMM intends to enable MTE for the guest.

We don't. Memory slots take the VM-fd, whereas the easy-to-add feature bits are per-vcpu.
Packing features into the 'type' that create-vm takes is a problem once we run out,
although the existing user is the IPA space size, and MTE is a property of the memory system.

The meaning of the flag is then "I described this as memory, only let the guest access
memory through this range that is MTE capable". What do we do when that is violated? Tell
the VMM is the nicest, but its not something we ever expect to happen. I guess an abort is
what real hardware would do, (if firmware magically turned off MTE while it was in use).

This would need to be kvm's inject_abt64(), as otherwise the vcpu may take the stage2
fault again, forever. For kvm_set_spte_hva() we can't inject an abort (which vcpu?), so
not mapping the page and waiting for the guest to access it is the only option...

Thanks,

James