[v5 PATCH] arm64: mm: force write fault for atomic RMW instructions

Thu Jul 4 03:03:49 PDT 2024

On Tue, Jul 02, 2024 at 03:21:41PM -0700, Yang Shi wrote:
> On 7/1/24 12:43 PM, Catalin Marinas wrote:
> > I don't follow OpenJDK development but I heard that updates are dragging
> > quite a lot. I can't tell whether people have picked up the
> > atomic_add(0) feature and whether, by the time a kernel patch would make
> > it into distros, they'd also move to the MADV_POPULATE_WRITE pattern.
> 
> As Christopher said there may be similar use of atomic in other
> applications, so I don't worry too much about dead code problem IMHO.
> OpenJDK is just the usecase that we know. There may be unknown unknowns. And
> the distros typically backport patches from mainline kernel to their kernel
> so there should be combos like old kernel + backported patch + old OpenJDK.

That's a somewhat valid argument I heard internally as well. People tend
to change or patch kernel versions more often than OpenJDK versions
because of the risk of breaking their Java stack. But, arguably, one can
backport the madvise() OpenJDK change since it seems to have other
benefits on x86 as well.

> AFAICT, the users do expect similar behavior as x86 (one fault instead of
> two faults). Actually we noticed this problem due to a customer report.

It's not a correctness problem, only a performance one. Big part of that
could be mitigated by some adjustment to how THP pages are allocated on
a write fault (though we'd still have an unnecessary read fault and some
TLBI). See Ryan's sub-thread.

> > There's a point (c) as well on the overhead of reading the faulting
> > instruction. I hope that's negligible but I haven't measured it.
> 
> I think I showed benchmark data requested by Anshuman in the earlier email
> discussion.

Do you mean this:

https://lore.kernel.org/r/328c4c86-96c8-4896-8b6d-94f2facdac9a@os.amperecomputing.com

I haven't figured out what the +24% case is in there, it seems pretty
large.

What you haven't benchmarked (I think) is the case where the instruction
is in an exec-only mapping. The subsequent instruction read will fault
and it adds to the overhead. Currently exec-only mappings are not
widespread but I heard some people planning to move in this direction as
a default build configuration.

It could be worked around with a new flavour of get_user() that uses the
non-T LDR instruction and the user mapping is readable by the kernel
(that's the case with EPAN, prior to PIE and I think we can change this
for PIE configurations as well). But it adds to the complexity of this
patch when the kernel already offers a MADV_POPULATE_WRITE solution.

-- 
Catalin