Questions about TLB flushing and lru_gen_look_around
Phil Elwell
phil at raspberrypi.com
Thu Sep 12 06:03:13 PDT 2024
Hi,
I've spent many hours recently trying to diagnose a problem that
manifests as a CPU spin, under load and memory pressure, that can last
for many seconds. The problem can be seen on our downstream kernels
from 6.5 onwards, when built for ARCH=arm, running on a Pi 3B (BCM2837
- quad A53). I've not tested a pure Linux 6.5, but this is not a bug
report.
Pi 3B has limited RAM (1GB), and it was discovered that restricting
this further to 512MB made the spins more frequent, as did adding
other processes. Running an ARM64 kernel in the same configuration
leads to normal OOM behaviour.
I traced the spin to a loop in __copy_to_user_memcpy where
pin_page_for_write fails repeatedly, sometimes for hundreds of
thousands of times. The pin is failing because the user page in
question is marked as being old (L_PTE_YOUNG is unset). When this
happens, the code tries to freshen the page using __put_user, but in
this case it is not triggering the required page fault. Digging
deeper, it can be seen that the PTE in the ARM's shadow hardware PTE
is 0 as expected, but clearly the MMU is not seeing this otherwise it
would be faulting; a TLB flush for that PTE fixes it.
The TLB non-coherency for that PTE can be attributed to a call to
ptep_test_and_clear_young from lru_gen_look_around, which clears the
L_PTE_YOUNG bit in the Linux PTE and zeroes the hardware PTE but
doesn't call flush_tlb_cache. Two possible "fixes" are:
a. Replace ptep_test_and_clear_young with ptep_clear_flush_young,
which includes the TLB flush.
b. After the loop over the page range from "start" to "end", include a
call to flush_tlb_range from "start" to "end" if the "young" count is
non-zero.
My questions are:
1. Which bit of code is meant to take care of TLB coherency where
lru_gen_look_around has made changes?
2. Between the two patches a) and b), which is preferable? b) would
seem better if IPIs are needed to broadcast the TLB flushes, but it
seems that BCM2837 has new enough CPU cores not to require such
broadcasts.
3. walk_pte_range has a similar loop, but it seems it doesn't need to
be patched to fix my spin, possibly because it isn't called. If a
patch to lru_gen_look_around is needed, might one be needed here as
well?
Thanks for your time,
Phil
More information about the linux-rpi-kernel
mailing list