[PATCH v11 2/6] arm64: kvm: Introduce MTE VM feature

Mon May 10 11:35:07 PDT 2021

On Fri, May 07, 2021 at 07:25:39PM +0100, Catalin Marinas wrote:
> On Thu, May 06, 2021 at 05:15:25PM +0100, Steven Price wrote:
> > On 04/05/2021 18:40, Catalin Marinas wrote:
> > > On Thu, Apr 29, 2021 at 05:06:41PM +0100, Steven Price wrote:
> > > > Given the changes to set_pte_at() which means that tags are restored from
> > > > swap even if !PROT_MTE, the only race I can see remaining is the creation of
> > > > new PROT_MTE mappings. As you mention an attempt to change mappings in the
> > > > VMM memory space should involve a mmu notifier call which I think serialises
> > > > this. So the remaining issue is doing this in a separate address space.
> > > > 
> > > > So I guess the potential problem is:
> > > > 
> > > >   * allocate memory MAP_SHARED but !PROT_MTE
> > > >   * fork()
> > > >   * VM causes a fault in parent address space
> > > >   * child does a mprotect(PROT_MTE)
> > > > 
> > > > With the last two potentially racing. Sadly I can't see a good way of
> > > > handling that.
> > > 
> > > Ah, the mmap lock doesn't help as they are different processes
> > > (mprotect() acquires it as a writer).
> > > 
> > > I wonder whether this is racy even in the absence of KVM. If both parent
> > > and child do an mprotect(PROT_MTE), one of them may be reading stale
> > > tags for a brief period.
> > > 
> > > Maybe we should revisit whether shared MTE pages are of any use, though
> > > it's an ABI change (not bad if no-one is relying on this). However...
[...]
> > > Thinking about this, we have a similar problem with the PG_dcache_clean
> > > and two processes doing mprotect(PROT_EXEC). One of them could see the
> > > flag set and skip the I-cache maintenance while the other executes
> > > stale instructions. change_pte_range() could acquire the page lock if
> > > the page is VM_SHARED (my preferred core mm fix). It doesn't immediately
> > > solve the MTE/KVM case but we could at least take the page lock via
> > > user_mem_abort().
[...]
> > This is the real issue I see - the race in PROT_MTE case is either a data
> > leak (syncing after setting the bit) or data loss (syncing before setting
> > the bit).
[...]
> > But without serialising through a spinlock (in mte_sync_tags()) I haven't
> > been able to come up with any way of closing the race. But with the change
> > to set_pte_at() to call mte_sync_tags() even if the PTE isn't PROT_MTE that
> > is likely to seriously hurt performance.
> 
> Yeah. We could add another page flag as a lock though I think it should
> be the core code that prevents the race.
> 
> If we are to do it in the arch code, maybe easier with a custom
> ptep_modify_prot_start/end() where we check if it's VM_SHARED and
> VM_MTE, take a (big) lock.

I think in the general case we don't even need VM_SHARED. For example,
we have two processes mapping a file, read-only. An mprotect() call in
both processes will race on the page->flags via the corresponding
set_pte_at(). I think an mprotect() with a page fault in different
processes can also race.

The PROT_EXEC case can be easily fixed, as you said already. The
PROT_MTE with MAP_PRIVATE I think can be made safe by a similar
approach: test flag, clear tags, set flag. A subsequent write would
trigger a CoW, so different page anyway.

Anyway, I don't think ptep_modify_prot_start/end would buy us much, it
probably makes the code even harder to read.

> In the core code, something like below (well, a partial hack, not tested
> and it doesn't handle huge pages but just to give an idea):
> 
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 94188df1ee55..6ba96ff141a6 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -114,6 +113,10 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  			}
>  
>  			oldpte = ptep_modify_prot_start(vma, addr, pte);
> +			if (vma->vm_flags & VM_SHARED) {
> +				page = vm_normal_page(vma, addr, oldpte);
> +				lock_page(page);
> +			}
>  			ptent = pte_modify(oldpte, newprot);
>  			if (preserve_write)
>  				ptent = pte_mk_savedwrite(ptent);
> @@ -138,6 +141,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>  				ptent = pte_mkwrite(ptent);
>  			}
>  			ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
> +			if (page)
> +				unlock_page(page);
>  			pages++;
>  		} else if (is_swap_pte(oldpte)) {
>  			swp_entry_t entry = pte_to_swp_entry(oldpte);

That's bogus: lock_page() might sleep but this whole code sequence is
under the ptl spinlock. There are some lock_page_* variants but that
would involve either a busy loop on this path or some bailing out,
waiting for a release.

Options:

1. Change the mte_sync_tags() code path to set the flag after clearing
   and avoid reading stale tags. We document that mprotect() on
   MAP_SHARED may lead to tag loss. Maybe we can intercept this in the
   arch code and return an error.

2. Figure out some other locking in the core code. However, if
   mprotect() in one process can race with a handle_pte_fault() in
   another, on the same shared mapping, it's not trivial.
   filemap_map_pages() would take the page lock before calling
   do_set_pte(), so mprotect() would need the same page lock.

3. Use another PG_arch_3 bit as a lock to spin on in the arch code (i.e.
   set it around the other PG_arch_* bit setting).

I ran out of ideas.

-- 
Catalin