Questions about TLB flushing and lru_gen_look_around

Yu Zhao yuzhao at google.com
Thu Sep 12 20:59:00 PDT 2024


Hi Phil,

On Thu, Sep 12, 2024 at 7:03 AM Phil Elwell <phil at raspberrypi.com> wrote:
>
> Hi,
>
> I've spent many hours recently trying to diagnose a problem that
> manifests as a CPU spin, under load and memory pressure, that can last
> for many seconds. The problem can be seen on our downstream kernels
> from 6.5 onwards, when built for ARCH=arm, running on a Pi 3B (BCM2837
> - quad A53). I've not tested a pure Linux 6.5, but this is not a bug
> report.
>
> Pi 3B has limited RAM (1GB), and it was discovered that restricting
> this further to 512MB made the spins more frequent, as did adding
> other processes. Running an ARM64 kernel in the same configuration
> leads to normal OOM behaviour.
>
> I traced the spin to a loop in __copy_to_user_memcpy where
> pin_page_for_write fails repeatedly, sometimes for hundreds of
> thousands of times. The pin is failing because the user page in
> question is marked as being old (L_PTE_YOUNG is unset). When this
> happens, the code tries to freshen the page using __put_user, but in
> this case it is not triggering the required page fault. Digging
> deeper, it can be seen that the PTE in the ARM's shadow hardware PTE
> is 0 as expected, but clearly the MMU is not seeing this otherwise it
> would be faulting; a TLB flush for that PTE fixes it.
>
> The TLB non-coherency for that PTE can be attributed to a call to
> ptep_test_and_clear_young from lru_gen_look_around, which clears the
> L_PTE_YOUNG bit in the Linux PTE

Yes, it does that.

> and zeroes the hardware PTE

I don't see how it can happen, or why it's needed. Could you explain?

> but doesn't call flush_tlb_cache.

Correct, and this is because that arch-specific API currently doesn't
require TLB flushes, from the MM's POV. None of the current callers
does, I doubt they were used on arm (32 bit) at all, except MGLRU.

> Two possible "fixes" are:
>
> a. Replace ptep_test_and_clear_young with ptep_clear_flush_young,
> which includes the TLB flush.
> b. After the loop over the page range from "start" to "end", include a
> call to flush_tlb_range from "start" to "end" if the "young" count is
> non-zero.
>
> My questions are:
>
> 1. Which bit of code is meant to take care of TLB coherency where
> lru_gen_look_around has made changes?

None, since the API doesn't explicitly require it (or at least the MM
assumes), as I mentioned above.

> 2. Between the two patches a) and b), which is preferable? b) would
> seem better if IPIs are needed to broadcast the TLB flushes, but it
> seems that BCM2837 has new enough CPU cores not to require such
> broadcasts.

Could this be fixed within arm? If not, we would have to update the
requirement of that arch-specific API. This would affect other archs
that don't require TLB flushes, assuming they exist. And we would need
to fix all callers of ptep_test_and_clear_young() in MM.

> 3. walk_pte_range has a similar loop, but it seems it doesn't need to
> be patched to fix my spin, possibly because it isn't called.

Correct.

> If a
> patch to lru_gen_look_around is needed, might one be needed here as
> well?

No, because that code is disabled, unless hardware can set A-bit,
e.g., arm64 v8.2.

Thanks.



More information about the linux-rpi-kernel mailing list