[v5 PATCH] arm64: mm: force write fault for atomic RMW instructions

Tue Jul 2 06:50:35 PDT 2024

>>>>> Changing to this new behaviour would only be a partial solution for your use
>>>>> case, since you would still have 2 faults. But it would remove the cost of the
>>>>> shattering and ensure you have a THP immediately after the write fault. But I
>>>>> can't think of a reason why this wouldn't be a generally useful change
>>>>> regardless? Any thoughts?
>>>>
>>>> The "let's read before we write" as used by QEMU migration code is the desire
>>>> to not waste memory by populating the zeropages. Deferring consuming memory
>>>> until really required.
>>>>
>>>>       /*
>>>>        * We read one byte of each page; this will preallocate page tables if
>>>>        * required and populate the shared zeropage on MAP_PRIVATE anonymous
>>>> memory
>>>>        * where no page was populated yet. This might require adaption when
>>>>        * supporting other mappings, like shmem.
>>>>        */
>>>
>>> So QEMU is concerned with preallocatiing page tables? I would have thought you
>>> could make that a lot more efficient with an explicit MADV_POPULATE_PGTABLE
>>> call? (i.e. 1 kernel call vs 1 call per 2M, allocate all the pages in one trip
>>> through the allocator, fewer pud/pmd lock/unlocks, etc).
>>
>> I think we are only concerned about the "shared zeropage" part. Everything else
>> is just unnecessary detail that adds confusion here :) One requires the other.
> 

BTW, I was quoting the wrong QEMU code. The relevant QEMU commit that first added
that handling is:

commit 211ea74022f51164a7729030b28eec90b6c99a08
Author: Peter Lieven <pl at kamp.de>
Date:   Mon Jun 10 12:14:20 2013 +0200

     migration: do not overwrite zero pages

     on incoming migration do not memset pages to zero if they already read as zero.
     this will allocate a new zero page and consume memory unnecessarily. even
     if we madvise a MADV_DONTNEED later this will only deallocate the memory
     asynchronously.

(note that the MADV_DONTNEED handling in that form does not really apply
anymore AFAIK)

> Sorry I don't quite follow your comment. As I understand it, the zeropage
> concept is intended as a memory-saving mechanism for applications that read
> memory but never write it.

"never written" -- then we wouldn't support COW faults on it, right? :P

> I don't think that really applies in your migration
> case, because you are definitely going to write all the memory eventually, I
> think?

Especially with memory ballooning and things like free-page-reporting we might
be migrating a whole bunch of zero-memory and only have to make sure that the
destination is actually zero. We don't want to consume memory.

I recently fixed that handling for s390x where customers were running into
precisely that issue (and zeropages in s390x VMs were disabled for historical
reasons).

> So I guess you are not interested in the "memory-saving" property, but in
> the side-effect, which is the pre-allocation of pagetables? (if you just wanted
> the memory-saving property, why not just skip the "read one byte of each page"
> op? It's not important though, so let's not go down the rabbit hole.

There are cases where, during an incoming migration, some memory pages might
already have been populated and might not be zero. And IIRC there are some cases
(postcopy -> uffd) where I think it is important that we actually do have a page
in place. But I forgot some of the details around userfaultfd handling in QEMU.

> 
>>
>> Note that this is from migration code where we're supposed to write a single
>> page we received from the migration source right now (not more). And we want to
>> avoid allcoating memory if it can be avoided (usually for overcommit).
>>
>>
>>
>>>
>>> TBH I always assumed in the past the that huge zero page is only useful because
>>> its a placeholder for a real THP that would be populated on write. But that's
>>> obviously not the case at the moment. So other than a hack to preallocate the
>>> pgtables with only 1 fault per 2M, what other benefits does it have?
>>
>> I don't quite udnerstand that question. [2] has some details why the huge
>> zeropage was added -- because we would have never otherwise received huge
>> zeropages with THP enabled but always anon THP directly on read.
>>
>>>
>>>>
>>>>
>>>> Without THP this works as expected. With THP this currently also works as
>>>> expected, but of course with the price [1] of not getting anon THP
>>>> immediately, which usually we don't care about. As you note, khugepaged might
>>>> fix this up later.
>>>>
>>>> If we disable the huge zeropage, we would get anon THPs when reading instead of
>>>> small zeropages.
>>>
>>> I wasn't aware of that behaviour either. Although that sounds like another
>>> reason why allocating a THP over the huge zero page on write fault should be the
>>> "more consistent" behaviour.
>>
>> Reading [2] I think the huge zeropage was added to avoid the allocation of THP
>> on read. Maybe for really only large readable regions, not sure why exactly.
> 
> I might raise this on the THP call tomorrow, if Kyril joins and get his view.

Makes sense.

-- 
Cheers,

David / dhildenb