[RFC PATCH 0/6] variable-order, large folios for anonymous memory

Wed Mar 22 06:36:04 PDT 2023


On 3/22/2023 8:03 PM, Ryan Roberts wrote:
> Hi Matthew,
> 
> On 17/03/2023 10:57, Ryan Roberts wrote:
>> Hi All,
>>
>> [...]
>>
>> Bug(s)
>> ======
>>
>> When I run this code without the last (workaround) patch, with DEBUG_VM et al,
>> PROVE_LOCKING and KASAN enabled, I see occasional oopses. Mostly these are
>> relating to invalid kernel addresses (which usually look like either NULL +
>> small offset or mostly zeros with a few mid-order bits set + a small offset) or
>> lockdep complaining about a bad unlock balance. Call stacks are often in
>> madvise_free_pte_range(), but I've seen them in filesystem code too. (I can
>> email example oopses out separately if anyone wants to review them). My hunch is
>> that struct pages adjacent to the folio are being corrupted, but don't have hard
>> evidence.
>>
>> When adding the workaround patch, which prevents madvise_free_pte_range() from
>> attempting to split a large folio, I never see any issues. Although I'm not
>> putting the system under memory pressure so guess I might see the same types of
>> problem crop up under swap, etc.
>>
>> I've reviewed most of the code within split_folio() and can't find any smoking
>> gun, but I wonder if there are implicit assumptions about the large folio being
>> PMD sized that I'm obviously breaking now?
>>
>> The code in madvise_free_pte_range():
>>
>> 	if (folio_test_large(folio)) {
>> 		if (folio_mapcount(folio) != 1)
>> 			goto out;
>> 		folio_get(folio);
>> 		if (!folio_trylock(folio)) {
>> 			folio_put(folio);
>> 			goto out;
>> 		}
>> 		pte_unmap_unlock(orig_pte, ptl);
>> 		if (split_folio(folio)) {
>> 			folio_unlock(folio);
>> 			folio_put(folio);
>> 			orig_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
>> 			goto out;
>> 		}
>> 		...
>> 	}
> 
> I've noticed that its folio_split() with a folio order of 1 that causes my
> problems. And I also see that the page cache code always explicitly never
> allocates order-1 folios:
> 
> void page_cache_ra_order(struct readahead_control *ractl,
> 		struct file_ra_state *ra, unsigned int new_order)
> {
> 	...
> 
> 	while (index <= limit) {
> 		unsigned int order = new_order;
> 
> 		/* Align with smaller pages if needed */
> 		if (index & ((1UL << order) - 1)) {
> 			order = __ffs(index);
> 			if (order == 1)
> 				order = 0;
> 		}
> 		/* Don't allocate pages past EOF */
> 		while (index + (1UL << order) - 1 > limit) {
> 			if (--order == 1)
> 				order = 0;
> 		}
> 		err = ra_alloc_folio(ractl, index, mark, order, gfp);
> 		if (err)
> 			break;
> 		index += 1UL << order;
> 	}
> 
> 	...
> }
> 
> Matthew, what is the reason for this? I suspect its guarding against the same
> problem I'm seeing.
> 
> If I explicitly prevent order-1 allocations for anon pages, I'm unable to cause
> any oops/panic/etc. I'd just like to understand the root cause.
Checked the struct folio definition. The _deferred_list is in third page struct.
My understanding is to support folio split, the folio order must >= 2. Thanks.


Regards
Yin, Fengwei

> 
> Thanks,
> Ryan
> 
> 
> 
>>
>> Will normally skip my large folios because they have a mapcount > 1, due to
>> incrementing mapcount for each pte, unlike PMD mapped pages. But on occasion it
>> will see a mapcount of 1 and proceed. So I guess this is racing against reclaim
>> or CoW in this case?
>>
>> I also see its doing a dance to take the folio lock and drop the ptl. Perhaps my
>> large anon folio is not using the folio lock in the same way as a THP would and
>> we are therefore not getting the expected serialization?
>>
>> I'd really appreciate any suggestions for how to pregress here!
>>
>