[RFC PATCH 0/6] variable-order, large folios for anonymous memory
Ryan Roberts
ryan.roberts at arm.com
Wed Mar 22 05:03:31 PDT 2023
Hi Matthew,
On 17/03/2023 10:57, Ryan Roberts wrote:
> Hi All,
>
> [...]
>
> Bug(s)
> ======
>
> When I run this code without the last (workaround) patch, with DEBUG_VM et al,
> PROVE_LOCKING and KASAN enabled, I see occasional oopses. Mostly these are
> relating to invalid kernel addresses (which usually look like either NULL +
> small offset or mostly zeros with a few mid-order bits set + a small offset) or
> lockdep complaining about a bad unlock balance. Call stacks are often in
> madvise_free_pte_range(), but I've seen them in filesystem code too. (I can
> email example oopses out separately if anyone wants to review them). My hunch is
> that struct pages adjacent to the folio are being corrupted, but don't have hard
> evidence.
>
> When adding the workaround patch, which prevents madvise_free_pte_range() from
> attempting to split a large folio, I never see any issues. Although I'm not
> putting the system under memory pressure so guess I might see the same types of
> problem crop up under swap, etc.
>
> I've reviewed most of the code within split_folio() and can't find any smoking
> gun, but I wonder if there are implicit assumptions about the large folio being
> PMD sized that I'm obviously breaking now?
>
> The code in madvise_free_pte_range():
>
> if (folio_test_large(folio)) {
> if (folio_mapcount(folio) != 1)
> goto out;
> folio_get(folio);
> if (!folio_trylock(folio)) {
> folio_put(folio);
> goto out;
> }
> pte_unmap_unlock(orig_pte, ptl);
> if (split_folio(folio)) {
> folio_unlock(folio);
> folio_put(folio);
> orig_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> goto out;
> }
> ...
> }
I've noticed that its folio_split() with a folio order of 1 that causes my
problems. And I also see that the page cache code always explicitly never
allocates order-1 folios:
void page_cache_ra_order(struct readahead_control *ractl,
struct file_ra_state *ra, unsigned int new_order)
{
...
while (index <= limit) {
unsigned int order = new_order;
/* Align with smaller pages if needed */
if (index & ((1UL << order) - 1)) {
order = __ffs(index);
if (order == 1)
order = 0;
}
/* Don't allocate pages past EOF */
while (index + (1UL << order) - 1 > limit) {
if (--order == 1)
order = 0;
}
err = ra_alloc_folio(ractl, index, mark, order, gfp);
if (err)
break;
index += 1UL << order;
}
...
}
Matthew, what is the reason for this? I suspect its guarding against the same
problem I'm seeing.
If I explicitly prevent order-1 allocations for anon pages, I'm unable to cause
any oops/panic/etc. I'd just like to understand the root cause.
Thanks,
Ryan
>
> Will normally skip my large folios because they have a mapcount > 1, due to
> incrementing mapcount for each pte, unlike PMD mapped pages. But on occasion it
> will see a mapcount of 1 and proceed. So I guess this is racing against reclaim
> or CoW in this case?
>
> I also see its doing a dance to take the folio lock and drop the ptl. Perhaps my
> large anon folio is not using the folio lock in the same way as a THP would and
> we are therefore not getting the expected serialization?
>
> I'd really appreciate any suggestions for how to pregress here!
>
More information about the linux-arm-kernel
mailing list