[RFC PATCH 0/6] variable-order, large folios for anonymous memory
Ryan Roberts
ryan.roberts at arm.com
Wed Mar 22 07:25:27 PDT 2023
On 22/03/2023 13:36, Yin, Fengwei wrote:
>
>
> On 3/22/2023 8:03 PM, Ryan Roberts wrote:
>> Hi Matthew,
>>
>> On 17/03/2023 10:57, Ryan Roberts wrote:
>>> Hi All,
>>>
>>> [...]
>>>
>>> Bug(s)
>>> ======
>>>
>>> When I run this code without the last (workaround) patch, with DEBUG_VM et al,
>>> PROVE_LOCKING and KASAN enabled, I see occasional oopses. Mostly these are
>>> relating to invalid kernel addresses (which usually look like either NULL +
>>> small offset or mostly zeros with a few mid-order bits set + a small offset) or
>>> lockdep complaining about a bad unlock balance. Call stacks are often in
>>> madvise_free_pte_range(), but I've seen them in filesystem code too. (I can
>>> email example oopses out separately if anyone wants to review them). My hunch is
>>> that struct pages adjacent to the folio are being corrupted, but don't have hard
>>> evidence.
>>>
>>> When adding the workaround patch, which prevents madvise_free_pte_range() from
>>> attempting to split a large folio, I never see any issues. Although I'm not
>>> putting the system under memory pressure so guess I might see the same types of
>>> problem crop up under swap, etc.
>>>
>>> I've reviewed most of the code within split_folio() and can't find any smoking
>>> gun, but I wonder if there are implicit assumptions about the large folio being
>>> PMD sized that I'm obviously breaking now?
>>>
>>> The code in madvise_free_pte_range():
>>>
>>> if (folio_test_large(folio)) {
>>> if (folio_mapcount(folio) != 1)
>>> goto out;
>>> folio_get(folio);
>>> if (!folio_trylock(folio)) {
>>> folio_put(folio);
>>> goto out;
>>> }
>>> pte_unmap_unlock(orig_pte, ptl);
>>> if (split_folio(folio)) {
>>> folio_unlock(folio);
>>> folio_put(folio);
>>> orig_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
>>> goto out;
>>> }
>>> ...
>>> }
>>
>> I've noticed that its folio_split() with a folio order of 1 that causes my
>> problems. And I also see that the page cache code always explicitly never
>> allocates order-1 folios:
>>
>> void page_cache_ra_order(struct readahead_control *ractl,
>> struct file_ra_state *ra, unsigned int new_order)
>> {
>> ...
>>
>> while (index <= limit) {
>> unsigned int order = new_order;
>>
>> /* Align with smaller pages if needed */
>> if (index & ((1UL << order) - 1)) {
>> order = __ffs(index);
>> if (order == 1)
>> order = 0;
>> }
>> /* Don't allocate pages past EOF */
>> while (index + (1UL << order) - 1 > limit) {
>> if (--order == 1)
>> order = 0;
>> }
>> err = ra_alloc_folio(ractl, index, mark, order, gfp);
>> if (err)
>> break;
>> index += 1UL << order;
>> }
>>
>> ...
>> }
>>
>> Matthew, what is the reason for this? I suspect its guarding against the same
>> problem I'm seeing.
>>
>> If I explicitly prevent order-1 allocations for anon pages, I'm unable to cause
>> any oops/panic/etc. I'd just like to understand the root cause.
> Checked the struct folio definition. The _deferred_list is in third page struct.
> My understanding is to support folio split, the folio order must >= 2. Thanks.
Yep, looks like we have found the root cause - thanks for your help!
I've updated calc_anonymous_folio_order() to only use non-0 order if THP is
available and in that case, never allocate order-1. I think that both fixes the
problem and manages the dependency we have on THP:
static void calc_anonymous_folio_order(struct vm_fault *vmf,
int *order_out,
unsigned long *addr_out)
{
/*
* The aim here is to determine what size of folio we should allocate
* for this fault. Factors include:
* - Folio must be naturally aligned within VA space
* - Folio must not breach boundaries of vma
* - Folio must be fully contained inside one pmd entry
* - Folio must not overlap any non-none ptes
* - Order must not be higher than *order_out upon entry
*
* Additionally, we do not allow order-1 since this breaks assumptions
* elsewhere in the mm; THP pages must be at least order-2 (since they
* store state up to the 3rd struct page subpage), and these pages must
* be THP in order to correctly use pre-existing THP infrastructure such
* as folio_split().
*
* As a consequence of relying on the THP infrastructure, if the system
* does not support THP, we always fallback to order-0.
*
* Note that the caller may or may not choose to lock the pte. If
* unlocked, the calculation should be considered an estimate that will
* need to be validated under the lock.
*/
struct vm_area_struct *vma = vmf->vma;
int nr;
int order;
unsigned long addr;
pte_t *pte;
pte_t *first_set = NULL;
int ret;
if (has_transparent_hugepage()) {
order = min(*order_out, PMD_SHIFT - PAGE_SHIFT);
for (; order > 1; order--) {
nr = 1 << order;
addr = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT);
/* Check vma bounds. */
if (addr < vma->vm_start ||
addr + nr * PAGE_SIZE > vma->vm_end)
continue;
/* Ptes covered by order already known to be none. */
if (pte + nr <= first_set)
break;
/* Already found set pte in range covered by order. */
if (pte <= first_set)
continue;
/* Need to check if all the ptes are none. */
ret = check_all_ptes_none(pte, nr);
if (ret == nr)
break;
first_set = pte + ret;
}
if (order == 1)
order = 0;
} else {
order = 0;
}
*order_out = order;
*addr_out = order > 0 ? addr : vmf->address;
}
>
>
> Regards
> Yin, Fengwei
>
>>
>> Thanks,
>> Ryan
>>
>>
>>
>>>
>>> Will normally skip my large folios because they have a mapcount > 1, due to
>>> incrementing mapcount for each pte, unlike PMD mapped pages. But on occasion it
>>> will see a mapcount of 1 and proceed. So I guess this is racing against reclaim
>>> or CoW in this case?
>>>
>>> I also see its doing a dance to take the folio lock and drop the ptl. Perhaps my
>>> large anon folio is not using the folio lock in the same way as a THP would and
>>> we are therefore not getting the expected serialization?
>>>
>>> I'd really appreciate any suggestions for how to pregress here!
>>>
>>
More information about the linux-arm-kernel
mailing list