[v5 PATCH] arm64: mm: force write fault for atomic RMW instructions

Dev Jain dev.jain at arm.com
Wed Aug 21 04:32:40 PDT 2024


On 8/21/24 15:48, Catalin Marinas wrote:
> On Thu, Jul 11, 2024 at 11:17:51AM -0700, Yang Shi wrote:
>> On 7/11/24 10:43 AM, Catalin Marinas wrote:
>>> On Wed, Jul 10, 2024 at 11:43:18AM -0700, Yang Shi wrote:
>>>> On 7/10/24 2:22 AM, Catalin Marinas wrote:
>>>>>> diff --git a/arch/arm64/mm/mmap.c b/arch/arm64/mm/mmap.c
>>>>>> index 642bdf908b22..d30265d424e4 100644
>>>>>> --- a/arch/arm64/mm/mmap.c
>>>>>> +++ b/arch/arm64/mm/mmap.c
>>>>>> @@ -19,7 +19,7 @@ static pgprot_t protection_map[16] __ro_after_init = {
>>>>>>    ������� [VM_WRITE]������������������������������������� = PAGE_READONLY,
>>>>>>    ������� [VM_WRITE | VM_READ]��������������������������� = PAGE_READONLY,
>>>>>>    ������� /* PAGE_EXECONLY if Enhanced PAN */
>>>>>> - ������ [VM_EXEC]�������������������������������������� = PAGE_READONLY_EXEC,
>>>>>> +� ����� [VM_EXEC]�������������������������������������� = PAGE_EXECONLY,
>>>>>>    ������� [VM_EXEC | VM_READ]���������������������������� = PAGE_READONLY_EXEC,
>>>>>>    ������� [VM_EXEC | VM_WRITE]��������������������������� = PAGE_READONLY_EXEC,
>>>>>>    ������� [VM_EXEC | VM_WRITE | VM_READ]����������������� = PAGE_READONLY_EXEC,
>>>>> In theory you'd need to change the VM_SHARED | VM_EXEC entry as well.
>>>>> Otherwise it looks fine.
>>>> Thanks. I just ran the same benchmark. Ran the modified page_fault1_thread
>>>> (trigger read fault) in 100 iterations with 160 threads on 160 cores. This
>>>> should be the worst contention case and collected the max data (worst
>>>> latency). It shows the patch may incur ~30% overhead for exec-only case. The
>>>> overhead should just come from the permission fault.
>>>>
>>>>   ��� N���������� Min���������� Max������� Median���������� Avg Stddev
>>>> x 100������� 163840������� 219083������� 184471������� 183262 12593.229
>>>> + 100������� 211198������� 285947������� 233608���� 238819.98 15253.967
>>>> Difference at 95.0% confidence
>>>>   �� �55558 +/- 3877
>>>>   �� �30.3161% +/- 2.11555%
>>>>
>>>> This is a very extreme benchmark, I don't think any real life workload will
>>>> spend that much time (sys vs user) in page fault, particularly read fault.
>>>>
>>>> With my atomic fault benchmark (populate 1G memory with atomic instruction
>>>> then manipulate the value stored in the memory in 100 iterations so the user
>>>> time is much longer than sys time), I saw around 13% overhead on sys time
>>>> due to the permission fault, but no noticeable change for user and real
>>>> time.
>>> Thanks for running these tests.
>>>
>>>> So the permission fault does incur noticeable overhead for read fault on
>>>> exec-only, but it may be not that bad for real life workloads.
>>> So you are saying 3 faults is not that bad for real life workloads but
>>> the 2 faults behaviour we have currently is problematic for OpenJDK. For
>>> the OpenJDK case, I don't think the faulting is the main issue affecting
>>> run-time performance but, over a longer run (or with more iterations in
>>> this benchmark after the faulting in), you'd notice the breaking up of
>>> the initial THP pages.
>> I meant the extra permission fault for exec-only should be ok since the
>> current implementation can't force write fault for exec-only anyway. It does
>> incur noticeable overhead for read fault, but I'm not aware of any real life
>> workloads are sensitive to read fault. The benchmark is for a very extreme
>> worst case.
> I agree microbenchmarks like this are not realistic but I wouldn't be
> surprised if we hear of some large file mmap() read or very sparse
> arrays read being affected.
>
> Sorry, I forgot the details from the previous discussions. Are there any
> benchmarks that show OpenJDK performing badly _and_ what the improvement
> is with this patch? We should do similar comparison with the THP kernel
> fix. If the THP fix gets us close to the same level of improvement, I
> don't think we should bother with this patch.

Just for the note, I am working on replacing the hugezeropage with a THP;
I will be posting that out (after some testing) soon.




More information about the linux-arm-kernel mailing list