[v5 PATCH] arm64: mm: force write fault for atomic RMW instructions

Tue Jul 9 10:56:55 PDT 2024

On 7/4/24 3:03 AM, Catalin Marinas wrote:
> On Tue, Jul 02, 2024 at 03:21:41PM -0700, Yang Shi wrote:
>> On 7/1/24 12:43 PM, Catalin Marinas wrote:
>>> I don't follow OpenJDK development but I heard that updates are dragging
>>> quite a lot. I can't tell whether people have picked up the
>>> atomic_add(0) feature and whether, by the time a kernel patch would make
>>> it into distros, they'd also move to the MADV_POPULATE_WRITE pattern.
>> As Christopher said there may be similar use of atomic in other
>> applications, so I don't worry too much about dead code problem IMHO.
>> OpenJDK is just the usecase that we know. There may be unknown unknowns. And
>> the distros typically backport patches from mainline kernel to their kernel
>> so there should be combos like old kernel + backported patch + old OpenJDK.
> That's a somewhat valid argument I heard internally as well. People tend
> to change or patch kernel versions more often than OpenJDK versions
> because of the risk of breaking their Java stack. But, arguably, one can
> backport the madvise() OpenJDK change since it seems to have other
> benefits on x86 as well.
>
>> AFAICT, the users do expect similar behavior as x86 (one fault instead of
>> two faults). Actually we noticed this problem due to a customer report.
> It's not a correctness problem, only a performance one. Big part of that
> could be mitigated by some adjustment to how THP pages are allocated on
> a write fault (though we'd still have an unnecessary read fault and some
> TLBI). See Ryan's sub-thread.

Yes, the TLBI is still sub-optimal and it is quite noticeable.

>
>>> There's a point (c) as well on the overhead of reading the faulting
>>> instruction. I hope that's negligible but I haven't measured it.
>> I think I showed benchmark data requested by Anshuman in the earlier email
>> discussion.
> Do you mean this:
>
> https://lore.kernel.org/r/328c4c86-96c8-4896-8b6d-94f2facdac9a@os.amperecomputing.com
>
> I haven't figured out what the +24% case is in there, it seems pretty
> large.

I think I ran the test much more iterations and I didn't see such 
outlier anymore.

 >> To rule out the worst case, I also ran the test 100 iterations with 
160 threads then compared the worst case:
 >>
 >>      N           Min           Max        Median Avg Stddev
 >>   100         34770         84979         65536 63537.7 10358.873
 >>   100         38077         87652         65536 63119.02 8792.7399

>
> What you haven't benchmarked (I think) is the case where the instruction
> is in an exec-only mapping. The subsequent instruction read will fault
> and it adds to the overhead. Currently exec-only mappings are not
> widespread but I heard some people planning to move in this direction as
> a default build configuration.

I tested exec-only on QEMU tcg, but I don't have a hardware supported 
EPAN. I don't think performance benchmark on QEMU tcg makes sense since 
it is quite slow, such small overhead is unlikely measurable on it.

>
> It could be worked around with a new flavour of get_user() that uses the
> non-T LDR instruction and the user mapping is readable by the kernel
> (that's the case with EPAN, prior to PIE and I think we can change this
> for PIE configurations as well). But it adds to the complexity of this
> patch when the kernel already offers a MADV_POPULATE_WRITE solution.
>