[v5 PATCH] arm64: mm: force write fault for atomic RMW instructions

Wed Aug 21 04:32:40 PDT 2024

On 8/21/24 15:48, Catalin Marinas wrote:
> On Thu, Jul 11, 2024 at 11:17:51AM -0700, Yang Shi wrote:
>> On 7/11/24 10:43 AM, Catalin Marinas wrote:
>>> On Wed, Jul 10, 2024 at 11:43:18AM -0700, Yang Shi wrote:
>>>> On 7/10/24 2:22 AM, Catalin Marinas wrote:
>>>>>> diff --git a/arch/arm64/mm/mmap.c b/arch/arm64/mm/mmap.c
>>>>>> index 642bdf908b22..d30265d424e4 100644
>>>>>> --- a/arch/arm64/mm/mmap.c
>>>>>> +++ b/arch/arm64/mm/mmap.c
>>>>>> @@ -19,7 +19,7 @@ static pgprot_t protection_map[16] __ro_after_init = {
>>>>>>    ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ [VM_WRITE]ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ = PAGE_READONLY,
>>>>>>    ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ [VM_WRITE | VM_READ]ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ = PAGE_READONLY,
>>>>>>    ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ /* PAGE_EXECONLY if Enhanced PAN */
>>>>>> - ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ [VM_EXEC]ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ = PAGE_READONLY_EXEC,
>>>>>> +ï¿½ ï¿½ï¿½ï¿½ï¿½ï¿½ [VM_EXEC]ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ = PAGE_EXECONLY,
>>>>>>    ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ [VM_EXEC | VM_READ]ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ = PAGE_READONLY_EXEC,
>>>>>>    ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ [VM_EXEC | VM_WRITE]ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ = PAGE_READONLY_EXEC,
>>>>>>    ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ [VM_EXEC | VM_WRITE | VM_READ]ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ = PAGE_READONLY_EXEC,
>>>>> In theory you'd need to change the VM_SHARED | VM_EXEC entry as well.
>>>>> Otherwise it looks fine.
>>>> Thanks. I just ran the same benchmark. Ran the modified page_fault1_thread
>>>> (trigger read fault) in 100 iterations with 160 threads on 160 cores. This
>>>> should be the worst contention case and collected the max data (worst
>>>> latency). It shows the patch may incur ~30% overhead for exec-only case. The
>>>> overhead should just come from the permission fault.
>>>>
>>>>   ï¿½ï¿½ï¿½ Nï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ Minï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ Maxï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ Medianï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ Avg Stddev
>>>> x 100ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ 163840ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ 219083ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ 184471ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ 183262 12593.229
>>>> + 100ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ 211198ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ 285947ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ï¿½ 233608ï¿½ï¿½ï¿½ï¿½ 238819.98 15253.967
>>>> Difference at 95.0% confidence
>>>>   ï¿½ï¿½ ï¿½55558 +/- 3877
>>>>   ï¿½ï¿½ ï¿½30.3161% +/- 2.11555%
>>>>
>>>> This is a very extreme benchmark, I don't think any real life workload will
>>>> spend that much time (sys vs user) in page fault, particularly read fault.
>>>>
>>>> With my atomic fault benchmark (populate 1G memory with atomic instruction
>>>> then manipulate the value stored in the memory in 100 iterations so the user
>>>> time is much longer than sys time), I saw around 13% overhead on sys time
>>>> due to the permission fault, but no noticeable change for user and real
>>>> time.
>>> Thanks for running these tests.
>>>
>>>> So the permission fault does incur noticeable overhead for read fault on
>>>> exec-only, but it may be not that bad for real life workloads.
>>> So you are saying 3 faults is not that bad for real life workloads but
>>> the 2 faults behaviour we have currently is problematic for OpenJDK. For
>>> the OpenJDK case, I don't think the faulting is the main issue affecting
>>> run-time performance but, over a longer run (or with more iterations in
>>> this benchmark after the faulting in), you'd notice the breaking up of
>>> the initial THP pages.
>> I meant the extra permission fault for exec-only should be ok since the
>> current implementation can't force write fault for exec-only anyway. It does
>> incur noticeable overhead for read fault, but I'm not aware of any real life
>> workloads are sensitive to read fault. The benchmark is for a very extreme
>> worst case.
> I agree microbenchmarks like this are not realistic but I wouldn't be
> surprised if we hear of some large file mmap() read or very sparse
> arrays read being affected.
>
> Sorry, I forgot the details from the previous discussions. Are there any
> benchmarks that show OpenJDK performing badly _and_ what the improvement
> is with this patch? We should do similar comparison with the THP kernel
> fix. If the THP fix gets us close to the same level of improvement, I
> don't think we should bother with this patch.

Just for the note, I am working on replacing the hugezeropage with a THP;
I will be posting that out (after some testing) soon.