[v5 PATCH] arm64: mm: force write fault for atomic RMW instructions

Tue Jul 2 05:36:43 PDT 2024

On 02/07/2024 12:22, David Hildenbrand wrote:
> On 02.07.24 12:26, Ryan Roberts wrote:
>> On 01/07/2024 20:43, Catalin Marinas wrote:
>>> On Fri, Jun 28, 2024 at 11:20:43AM -0700, Yang Shi wrote:
>>>> On 6/28/24 10:24 AM, Catalin Marinas wrote:
>>>>> This patch does feel a bit like working around a non-optimal user choice
>>>>> in kernel space. Who knows, madvise() may even be quicker if you do a
>>>>> single call for a larger VA vs touching each page.
>>>>
>>>> IMHO, I don't think so. I viewed this patch to solve or workaround some ISA
>>>> inefficiency in kernel. Two faults are not necessary if we know we are
>>>> definitely going to write the memory very soon, right?
>>>
>>> I agree the Arm architecture behaviour is not ideal here and any
>>> timelines for fixing it in hardware, if they do happen, are far into the
>>> future. Purely from a kernel perspective, what I want though is make
>>> sure that longer term (a) we don't create additional maintenance burden
>>> and (b) we don't keep dead code around.
>>>
>>> Point (a) could be mitigated if the architecture is changed so that any
>>> new atomic instructions added to this range would also come with
>>> additional syndrome information so that we don't have to update the
>>> decoding patterns.
>>>
>>> Point (b), however, depends on the OpenJDK and the kernel versions in
>>> distros. Nick Gasson kindly provided some information on the OpenJDK
>>> changes. The atomic_add(0) change happened in early 2022, about 5-6
>>> months after MADV_POPULATE_WRITE support was added to the kernel. What's
>>> interesting is Ampere already contributed MADV_POPULATE_WRITE support to
>>> OpenJDK a few months ago:
>>>
>>> https://github.com/openjdk/jdk/commit/a65a89522d2f24b1767e1c74f6689a22ea32ca6a
>>>
>>> The OpenJDK commit lacks explanation but what I gathered from the diff
>>> is that this option is the preferred one in the presence of THP (which
>>> most/all distros enable by default). If we merge your proposed kernel
>>> patch, it will take time before it makes its way into distros. I'm
>>> hoping that by that time, distros would have picked a new OpenJDK
>>> version already that doesn't need the atomic_add(0) pattern. If that's
>>> the case, we end up with some dead code in the kernel that's almost
>>> never exercised.
>>>
>>> I don't follow OpenJDK development but I heard that updates are dragging
>>> quite a lot. I can't tell whether people have picked up the
>>> atomic_add(0) feature and whether, by the time a kernel patch would make
>>> it into distros, they'd also move to the MADV_POPULATE_WRITE pattern.
>>>
>>> There's a point (c) as well on the overhead of reading the faulting
>>> instruction. I hope that's negligible but I haven't measured it.
>>>
>>
>> Just to add to this, I note the existing kernel behaviour is that if a write
>> fault happens in a region that has a (RO) huge zero page mapped at PMD level,
>> then the PMD is shattered, the PTE of the fault address is populated with a
>> writable page and the remaining PTEs are populated with order-0 zero pages
>> (read-only).
> 
> That also recently popped up in [1]. CCing Jinjiang. Ever since I
> replied there, I also thought some more about that handling in regard to the
> huge zeropage.
> 
>>
>> This seems like odd behaviour to me. Surely it would be less effort and more
>> aligned with the app's expectations to notice the huge zero page in the PMD,
>> remove it, and install a THP, as would have been done if pmd_none() was true? I
>> don't think there is a memory bloat argument here because, IIUC, with the
>> current behaviour, khugepaged would eventually upgrade it to a THP anyway?
> 
> One detail: depending on the setting of khugepaged_max_ptes_none. zeropages
> are treated like pte_none. But in the common case, that setting is left alone.

Ahh, got it. So in the common case, khugepaged won't actually collapse
unless/until a bunch more write faults occur in the 2M region, and in that case
there is a risk that changing this behaviour could lead to a memory bloat
regression.

> 
>>
>> Changing to this new behaviour would only be a partial solution for your use
>> case, since you would still have 2 faults. But it would remove the cost of the
>> shattering and ensure you have a THP immediately after the write fault. But I
>> can't think of a reason why this wouldn't be a generally useful change
>> regardless? Any thoughts?
> 
> The "let's read before we write" as used by QEMU migration code is the desire
> to not waste memory by populating the zeropages. Deferring consuming memory
> until really required.
> 
>     /*
>      * We read one byte of each page; this will preallocate page tables if
>      * required and populate the shared zeropage on MAP_PRIVATE anonymous memory
>      * where no page was populated yet. This might require adaption when
>      * supporting other mappings, like shmem.
>      */

So QEMU is concerned with preallocatiing page tables? I would have thought you
could make that a lot more efficient with an explicit MADV_POPULATE_PGTABLE
call? (i.e. 1 kernel call vs 1 call per 2M, allocate all the pages in one trip
through the allocator, fewer pud/pmd lock/unlocks, etc).

TBH I always assumed in the past the that huge zero page is only useful because
its a placeholder for a real THP that would be populated on write. But that's
obviously not the case at the moment. So other than a hack to preallocate the
pgtables with only 1 fault per 2M, what other benefits does it have?

> 
> 
> Without THP this works as expected. With THP this currently also works as
> expected, but of course with the price [1] of not getting anon THP
> immediately, which usually we don't care about. As you note, khugepaged might
> fix this up later.
> 
> If we disable the huge zeropage, we would get anon THPs when reading instead of
> small zeropages.

I wasn't aware of that behaviour either. Although that sounds like another
reason why allocating a THP over the huge zero page on write fault should be the
"more consistent" behaviour.

> 
> As reply to [1], I suggested using preallcoation (using MADV_POPULATE_WRITE)
> when we really care about that performance difference, which would also
> avoid the huge zeropage completely, but it's also not quite optimal in some cases.

I could imagine some cases could benefit from a MADV_POPULATE_WRITE_ON_FAULT,
which would just mark the VMA so that any read fault is upgraded to write.

> 
> 
> I don't really know what to do here: changing the handling for the huge zeropage
> only unconditionally does not sound too wrong, but the change in behavior
> might (or might not) be desired for some use cases.
> 
> Reading from unpopulated memory can be a clear sign that really the shared zeropage
> is desired (as for QEMU), and concurrent memory preallcoation/population should
> ideally use MADV_POPULATE_WRITE. Maybe there are some details buried in [2]
> regarding
> the common use cases for the huge zeropage back than.

The current huge zero page behavior on write fault sounds wonky to me. But I
agree there are better and more complete solutions to the identified use cases.
So unless something pops up where the change is a clear benefit, I guess better
to be safe and leave as is.

> 
> [1] https://lkml.kernel.org/r/740d7379-3e3d-4c8c-4350-6c496969db1f@huawei.com
> [2] https://lwn.net/Articles/517465/
>