[PATCH 10/12] fs/dax: Properly refcount fs dax pages
Dan Williams
dan.j.williams at intel.com
Thu Oct 24 16:52:50 PDT 2024
Alistair Popple wrote:
[..]
> >
> > Was there a discussion I missed about why the conversion to typical
> > folios allows the page->share accounting to be dropped.
>
> The problem with keeping it is we now treat DAX pages as "normal"
> pages according to vm_normal_page(). As such we use the normal paths
> for unmapping pages.
>
> Specifically page->share accounting relies on PAGE_MAPPING_DAX_SHARED
> aka PAGE_MAPPING_ANON which causes folio_test_anon(), PageAnon(),
> etc. to return true leading to all sorts of issues in at least the
> unmap paths.
Oh, I missed that PAGE_MAPPING_DAX_SHARED aliases with
PAGE_MAPPING_ANON.
> There hasn't been a previous discussion on this, but given this is
> only used to print warnings it seemed easier to get rid of it. I
> probably should have called that out more clearly in the commit
> message though.
>
> > I assume this is because the page->mapping validation was dropped, which
> > I think might be useful to keep at least for one development cycle to
> > make sure this conversion is not triggering any of the old warnings.
> >
> > Otherwise, the ->share field of 'struct page' can also be cleaned up.
>
> Yes, we should also clean up the ->share field, unless you have an
> alternate suggestion to solve the above issue.
kmalloc mininimum alignment is 8, so there is room to do this?
---
diff --git a/fs/dax.c b/fs/dax.c
index c62acd2812f8..a70f081c32cb 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -322,7 +322,7 @@ static unsigned long dax_end_pfn(void *entry)
static inline bool dax_page_is_shared(struct page *page)
{
- return page->mapping == PAGE_MAPPING_DAX_SHARED;
+ return folio_test_dax_shared(page_folio(page));
}
/*
@@ -331,14 +331,14 @@ static inline bool dax_page_is_shared(struct page *page)
*/
static inline void dax_page_share_get(struct page *page)
{
- if (page->mapping != PAGE_MAPPING_DAX_SHARED) {
+ if (!dax_page_is_shared(page)) {
/*
* Reset the index if the page was already mapped
* regularly before.
*/
if (page->mapping)
page->share = 1;
- page->mapping = PAGE_MAPPING_DAX_SHARED;
+ page->mapping = (void *)PAGE_MAPPING_DAX_SHARED;
}
page->share++;
}
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 1b3a76710487..21b355999ce0 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -666,13 +666,14 @@ PAGEFLAG_FALSE(VmemmapSelfHosted, vmemmap_self_hosted)
#define PAGE_MAPPING_ANON 0x1
#define PAGE_MAPPING_MOVABLE 0x2
#define PAGE_MAPPING_KSM (PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE)
-#define PAGE_MAPPING_FLAGS (PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE)
+/* to be removed once typical page refcounting for dax proves stable */
+#define PAGE_MAPPING_DAX_SHARED 0x4
+#define PAGE_MAPPING_FLAGS (PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE | PAGE_MAPPING_DAX_SHARED)
/*
* Different with flags above, this flag is used only for fsdax mode. It
* indicates that this page->mapping is now under reflink case.
*/
-#define PAGE_MAPPING_DAX_SHARED ((void *)0x1)
static __always_inline bool folio_mapping_flags(const struct folio *folio)
{
@@ -689,6 +690,11 @@ static __always_inline bool folio_test_anon(const struct folio *folio)
return ((unsigned long)folio->mapping & PAGE_MAPPING_ANON) != 0;
}
+static __always_inline bool folio_test_dax_shared(const struct folio *folio)
+{
+ return ((unsigned long)folio->mapping & PAGE_MAPPING_DAX_SHARED) != 0;
+}
+
static __always_inline bool PageAnon(const struct page *page)
{
return folio_test_anon(page_folio(page));
---
...and keep the validation around at least for one post conversion
development cycle?
> > It does have implications for the dax dma-idle tracking thought, see
> > below.
> >
> >> {
> >> - unsigned long pfn;
> >> + unsigned long order = dax_entry_order(entry);
> >> + struct folio *folio = dax_to_folio(entry);
> >>
> >> - if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
> >> + if (!dax_entry_size(entry))
> >> return;
> >>
> >> - for_each_mapped_pfn(entry, pfn) {
> >> - struct page *page = pfn_to_page(pfn);
> >> -
> >> - WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
> >> - if (dax_page_is_shared(page)) {
> >> - /* keep the shared flag if this page is still shared */
> >> - if (dax_page_share_put(page) > 0)
> >> - continue;
> >> - } else
> >> - WARN_ON_ONCE(page->mapping && page->mapping != mapping);
> >> - page->mapping = NULL;
> >> - page->index = 0;
> >> - }
> >> + /*
> >> + * We don't hold a reference for the DAX pagecache entry for the
> >> + * page. But we need to initialise the folio so we can hand it
> >> + * out. Nothing else should have a reference either.
> >> + */
> >> + WARN_ON_ONCE(folio_ref_count(folio));
> >
> > Per above I would feel more comfortable if we kept the paranoia around
> > to ensure that all the pages in this folio have dropped all references
> > and cleared ->mapping and ->index.
> >
> > That paranoia can be placed behind a CONFIG_DEBUB_VM check, and we can
> > delete in a follow-on development cycle, but in the meantime it helps to
> > prove the correctness of the conversion.
>
> I'm ok with paranoia, but as noted above the issue is that at a minimum
> page->mapping (and probably index) now needs to be valid for any code
> that might walk the page tables.
A quick look seems to say the confusion is limited to aliasing
PAGE_MAPPING_ANON.
> > [..]
> >> @@ -1189,11 +1165,14 @@ static vm_fault_t dax_load_hole(struct xa_state *xas, struct vm_fault *vmf,
> >> struct inode *inode = iter->inode;
> >> unsigned long vaddr = vmf->address;
> >> pfn_t pfn = pfn_to_pfn_t(my_zero_pfn(vaddr));
> >> + struct page *page = pfn_t_to_page(pfn);
> >> vm_fault_t ret;
> >>
> >> *entry = dax_insert_entry(xas, vmf, iter, *entry, pfn, DAX_ZERO_PAGE);
> >>
> >> - ret = vmf_insert_mixed(vmf->vma, vaddr, pfn);
> >> + page_ref_inc(page);
> >> + ret = dax_insert_pfn(vmf, pfn, false);
> >> + put_page(page);
> >
> > Per above I think it is problematic to have pages live in the system
> > without a refcount.
>
> I'm a bit confused by this - the pages have a reference taken on them
> when they are mapped. They only live in the system without a refcount
> when the mm considers them free (except for the bit between getting
> created in dax_associate_entry() and actually getting mapped but as
> noted I will fix that).
>
> > One scenario where this might be needed is invalidate_inode_pages() vs
> > DMA. The invaldation should pause and wait for DMA pins to be dropped
> > before the mapping xarray is cleaned up and the dax folio is marked
> > free.
>
> I'm not really following this scenario, or at least how it relates to
> the comment above. If the page is pinned for DMA it will have taken a
> refcount on it and so the page won't be considered free/idle per
> dax_wait_page_idle() or any of the other mm code.
[ tl;dr: I think we're ok, analysis below, but I did talk myself into
the proposed dax_busy_page() changes indeed being broken and needing to
remain checking for refcount > 1, not > 0 ]
It's not the mm code I am worried about. It's the filesystem block
allocator staying in-sync with the allocation state of the page.
fs/dax.c is charged with converting idle storage blocks to pfns to
mapped folios. Once they are mapped, DMA can pin the folio, but nothing
in fs/dax.c pins the mapping. In the pagecache case the page reference
is sufficient to keep the DMA-busy page from being reused. In the dax
case something needs to arrange for DMA to be idle before
dax_delete_mapping_entry().
However, looking at XFS it indeed makes that guarantee. First it does
xfs_break_dax_layouts() then it does truncate_inode_pages() =>
dax_delete_mapping_entry().
It follows that that the DMA-idle condition still needs to look for the
case where the refcount is > 1 rather than 0 since refcount == 1 is the
page-mapped-but-DMA-idle condition.
[..]
> >> @@ -1649,9 +1627,10 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
> >> loff_t pos = (loff_t)xas->xa_index << PAGE_SHIFT;
> >> bool write = iter->flags & IOMAP_WRITE;
> >> unsigned long entry_flags = pmd ? DAX_PMD : 0;
> >> - int err = 0;
> >> + int ret, err = 0;
> >> pfn_t pfn;
> >> void *kaddr;
> >> + struct page *page;
> >>
> >> if (!pmd && vmf->cow_page)
> >> return dax_fault_cow_page(vmf, iter);
> >> @@ -1684,14 +1663,21 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
> >> if (dax_fault_is_synchronous(iter, vmf->vma))
> >> return dax_fault_synchronous_pfnp(pfnp, pfn);
> >>
> >> - /* insert PMD pfn */
> >> + page = pfn_t_to_page(pfn);
> >
> > I think this is clearer if dax_insert_entry() returns folios with an
> > elevated refrence count that is dropped when the folio is invalidated
> > out of the mapping.
>
> I presume this comment is for the next line:
>
> + page_ref_inc(page);
>
> I can move that into dax_insert_entry(), but we would still need to
> drop it after calling vmf_insert_*() to ensure we get the 1 -> 0
> transition when the page is unmapped and therefore
> freed. Alternatively we can make it so vmf_insert_*() don't take
> references on the page, and instead ownership of the reference is
> transfered to the mapping. Personally I prefered having those
> functions take their own reference but let me know what you think.
Oh, the model I was thinking was that until vmf_insert_XXX() succeeds
then the page was never allocated because it was never mapped. What
happens with the code as proposed is that put_page() triggers page-free
semantics on vmf_insert_XXX() failures, right?
There is no need to invoke the page-free / final-put path on
vmf_insert_XXX() error because the storage-block / pfn never actually
transitioned into a page / folio.
> > [..]
> >> @@ -519,21 +529,3 @@ void zone_device_page_init(struct page *page)
> >> lock_page(page);
> >> }
> >> EXPORT_SYMBOL_GPL(zone_device_page_init);
> >> -
> >> -#ifdef CONFIG_FS_DAX
> >> -bool __put_devmap_managed_folio_refs(struct folio *folio, int refs)
> >> -{
> >> - if (folio->pgmap->type != MEMORY_DEVICE_FS_DAX)
> >> - return false;
> >> -
> >> - /*
> >> - * fsdax page refcounts are 1-based, rather than 0-based: if
> >> - * refcount is 1, then the page is free and the refcount is
> >> - * stable because nobody holds a reference on the page.
> >> - */
> >> - if (folio_ref_sub_return(folio, refs) == 1)
> >> - wake_up_var(&folio->_refcount);
> >> - return true;
> >
> > It follow from the refcount disvussion above that I think there is an
> > argument to still keep this wakeup based on the 2->1 transitition.
> > pagecache pages are refcount==1 when they are dma-idle but still
> > allocated. To keep the same semantics for dax a dax_folio would have an
> > elevated refcount whenever it is referenced by mapping entry.
>
> I'm not sold on keeping it as it doesn't seem to offer any benefit
> IMHO. I know both Jason and Christoph were keen to see it go so it be
> good to get their feedback too. Also one of the primary goals of this
> series was to refcount the page normally so we could remove the whole
> "page is free with a refcount of 1" semantics.
The page is still free at refcount 0, no argument there. But, by
introducing a new "page refcount is elevated while mapped" (as it
should), it follows that "page is DMA idle at refcount == 1", right?
Otherwise, the current assumption that fileystems can have
dax_layout_busy_page_range() poll on the state of the pfn in the mapping
is broken because page refcount == 0 also means no page to mapping
association.
More information about the linux-arm-kernel
mailing list