[PATCH 1/2] mm: Allow architectures to request 'old' entries when prefaulting

Mon Dec 28 23:35:06 EST 2020

Got it at last, sorry it's taken so long.

On Tue, 29 Dec 2020, Kirill A. Shutemov wrote:
> On Tue, Dec 29, 2020 at 01:05:48AM +0300, Kirill A. Shutemov wrote:
> > On Mon, Dec 28, 2020 at 10:47:36AM -0800, Linus Torvalds wrote:
> > > On Mon, Dec 28, 2020 at 4:53 AM Kirill A. Shutemov <kirill at shutemov.name> wrote:
> > > >
> > > > So far I only found one more pin leak and always-true check. I don't see
> > > > how can it lead to crash or corruption. Keep looking.

Those mods look good in themselves, but, as you expected,
made no difference to the corruption I was seeing.

> > > 
> > > Well, I noticed that the nommu.c version of filemap_map_pages() needs
> > > fixing, but that's obviously not the case Hugh sees.
> > > 
> > > No,m I think the problem is the
> > > 
> > >         pte_unmap_unlock(vmf->pte, vmf->ptl);
> > > 
> > > at the end of filemap_map_pages().
> > > 
> > > Why?
> > > 
> > > Because we've been updating vmf->pte as we go along:
> > > 
> > >                 vmf->pte += xas.xa_index - last_pgoff;
> > > 
> > > and I think that by the time we get to that "pte_unmap_unlock()",
> > > vmf->pte potentially points to past the edge of the page directory.
> > 
> > Well, if it's true we have bigger problem: we set up an pte entry without
> > relevant PTL.
> > 
> > But I *think* we should be fine here: do_fault_around() limits start_pgoff
> > and end_pgoff to stay within the page table.

Yes, Linus's patch had made no difference,
the map_pages loop is safe in that respect.

> > 
> > It made mw looking at the code around pte_unmap_unlock() and I think that
> > the bug is that we have to reset vmf->address and NULLify vmf->pte once we
> > are done with faultaround:
> > 
> > diff --git a/mm/memory.c b/mm/memory.c
> 
> Ugh.. Wrong place. Need to sleep.
> 
> I'll look into your idea tomorrow.
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 87671284de62..e4daab80ed81 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2987,6 +2987,8 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf, unsigned long address,
>  	} while ((head = next_map_page(vmf, &xas, end_pgoff)) != NULL);
>  	pte_unmap_unlock(vmf->pte, vmf->ptl);
>  	rcu_read_unlock();
> +	vmf->address = address;
> +	vmf->pte = NULL;
>  	WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss);
>  
>  	return ret;
> -- 

And that made no (noticeable) difference either.  But at last
I realized, it's absolutely on the right track, but missing the
couple of early returns at the head of filemap_map_pages(): add

--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3025,14 +3025,12 @@ vm_fault_t filemap_map_pages(struct vm_f
 
 	rcu_read_lock();
 	head = first_map_page(vmf, &xas, end_pgoff);
-	if (!head) {
-		rcu_read_unlock();
-		return 0;
-	}
+	if (!head)
+		goto out;
 
 	if (filemap_map_pmd(vmf, head)) {
-		rcu_read_unlock();
-		return VM_FAULT_NOPAGE;
+		ret = VM_FAULT_NOPAGE;
+		goto out;
 	}
 
 	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
@@ -3066,9 +3064,9 @@ unlock:
 		put_page(head);
 	} while ((head = next_map_page(vmf, &xas, end_pgoff)) != NULL);
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
+out:
 	rcu_read_unlock();
 	vmf->address = address;
-	vmf->pte = NULL;
 	WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss);
 
 	return ret;
--

and then the corruption is fixed.  It seems miraculous that the
machines even booted with that bad vmf->address going to __do_fault():
maybe that tells us what a good job map_pages does most of the time.

You'll see I've tried removing the "vmf->pte = NULL;" there. I did
criticize earlier that vmf->pte was being left set, but was either
thinking back to some earlier era of mm/memory.c, or else confusing
with vmf->prealloc_pte, which is NULLed when consumed: I could not
find anywhere in mm/memory.c which now needs vmf->pte to be cleared,
and I seem to run fine without it (even on i386 HIGHPTE).

So, the mystery is solved; but I don't think any of these patches
should be applied.  Without thinking through Linus's suggestions
re do_set_pte() in particular, I do think this map_pages interface
is too ugly, and given us lots of trouble: please take your time
to go over it all again, and come up with a cleaner patch.

I've grown rather jaded, and questioning the value of the rework:
I don't think I want to look at or test another for a week or so.

Hugh