[PATCH 1/2] mm: make faultaround produce old ptes

Wed Nov 29 03:24:05 PST 2017

On 11/29/2017 4:21 PM, Will Deacon wrote:
> On Wed, Nov 29, 2017 at 11:35:28AM +0530, Vinayak Menon wrote:
>> On 11/29/2017 1:15 AM, Linus Torvalds wrote:
>>> On Mon, Nov 27, 2017 at 9:07 PM, Vinayak Menon <vinmenon at codeaurora.org> wrote:
>>>> Making the faultaround ptes old results in a unixbench regression for some
>>>> architectures [3][4]. But on some architectures it is not found to cause
>>>> any regression. So by default produce young ptes and provide an option for
>>>> architectures to make the ptes old.
>>> Ugh. This hidden random behavior difference annoys me.
>>>
>>> It should also be better documented in the code if we end up doing it.
>> Okay.
>>> The reason x86 seems to prefer young pte's is simply that a TLB lookup
>>> of an old entry basically causes a micro-fault that then sets the
>>> accessed bit (using a locked cycle) and then a restart.
>>>
>>> Those microfaults are not visible to software, but they are pretty
>>> expensive in hardware, probably because they basically serialize
>>> execution as if a real page fault had happened.
>>>
>>> HOWEVER - and this is the part that annoys me most about the hidden
>>> behavior - I suspect it ends up being very dependent on
>>> microarchitectural details in addition to the actual load. So it might
>>> be more true on some cores than others, and it might be very
>>> load-dependent. So hiding it as some architectural helper function
>>> really feels wrong to me. It would likely be better off as a real
>>> flag, and then maybe we could make the default behavior be set by
>>> architecture (or even dynamically by the architecture bootup code if
>>> it turns out to be enough of an issue).
>>>
>>> And I'm actually somewhat suspicious of your claim that it's not
>>> noticeable on arm64. It's entirely possible that the serialization
>>> cost of the hardware access flag is much lower, but I thought that in
>>> virtualization you actually end up taking a SW fault, which in turn
>>> would be much more expensive. In fact, I don't even find that
>>> "Hardware Accessed" bit in my armv8 docs at all, so I'm guessing it's
>>> new to 8.1? So this is very much not about architectures at all, but
>>> about small details in microarchitectural behavior.
>> The experiments were done on v8.2 hardware with CONFIG_ARM64_HW_AFDBM enabled.
>> I have tried with CONFIG_ARM64_HW_AFDBM "disabled", and the unixbench score drops down,
>> probably due to the SW faults.
> Sure, but I think the point is that just because a CPU implements hardware
> access/dirty management (DBM -- added in 8.1), it doesn't mean it's going
> to be efficient on all implementations, and so having this keyed off the
> architecture isn't the right thing to do.
>
> If we had a flag, as suggested, then we could set that by default on CPUs
> that implement hardware DBM and clear it on a case-by-case basis if
> implementations pop up where it's a performance issue, although I think
> it's more likely that setting the dirty bit is the expensive one since
> it's not allowed to be performed speculatively.

Yes, I agree that a flag will be better. I will send a v2 with these changes.

Thanks,
Vinayak