[PATCH 01/13] PCI/P2PDMA: Release the per-cpu ref of pgmap when vm_insert_page() fails

Alistair Popple apopple at nvidia.com
Sun Jan 11 15:21:22 PST 2026


On 2026-01-10 at 02:03 +1100, Bjorn Helgaas <helgaas at kernel.org> wrote...
> On Fri, Jan 09, 2026 at 11:41:51AM +1100, Alistair Popple wrote:
> > On 2026-01-09 at 02:55 +1100, Bjorn Helgaas <helgaas at kernel.org> wrote...
> > > On Thu, Jan 08, 2026 at 02:23:16PM +1100, Alistair Popple wrote:
> > > > On 2025-12-20 at 15:04 +1100, Hou Tao <houtao at huaweicloud.com> wrote...
> > > > > From: Hou Tao <houtao1 at huawei.com>
> > > > > 
> > > > > When vm_insert_page() fails in p2pmem_alloc_mmap(), p2pmem_alloc_mmap()
> > > > > doesn't invoke percpu_ref_put() to free the per-cpu ref of pgmap
> > > > > acquired after gen_pool_alloc_owner(), and memunmap_pages() will hang
> > > > > forever when trying to remove the PCIe device.
> > > > > 
> > > > > Fix it by adding the missed percpu_ref_put().
> > ...
> 
> > > Looking at this again, I'm confused about why in the normal, non-error
> > > case, we do the percpu_ref_tryget_live_rcu(ref), followed by another
> > > percpu_ref_get(ref) for each page, followed by just a single
> > > percpu_ref_put() at the exit.
> > > 
> > > So we do ref_get() "1 + number of pages" times but we only do a single
> > > ref_put().  Is there a loop of ref_put() for each page elsewhere?
> > 
> > Right, the per-page ref_put() happens when the page is freed (ie. the struct
> > page refcount drops to zero) - in this case free_zone_device_folio() will call
> > p2pdma_folio_free() which has the corresponding percpu_ref_put().
> 
> I don't see anything that looks like a loop to call ref_put() for each
> page in free_zone_device_folio() or in p2pdma_folio_free(), but this
> is all completely out of my range, so I'll take your word for it :)  

That's brave :-)

What happens is the core mm takes over managing the page life time once
vm_insert_page() has been (successfully) called to map the page:

	VM_WARN_ON_ONCE_PAGE(!page_ref_count(page), page);
	set_page_count(page, 1);
	ret = vm_insert_page(vma, vaddr, page);
	if (ret) {
		gen_pool_free(p2pdma->pool, (uintptr_t)kaddr, len);
		return ret;
	}
	percpu_ref_get(ref);
	put_page(page);

In the above sequence vm_insert_page() takes a page ref for each page it maps
into the user page tables with folio_get(). This reference is dropped when the
user page table entry is removed, typically by the loop in zap_pte_range().

Normally the user page table mapping is the only thing holding a reference so
it ends up calling folio_put()->free_zone_device_folio->...->ref_put() one page
at a time as the PTEs are removed from the page tables. At least that's what
happens conceptually - the TLB batching code makes it hard to actually see where
the folio_put() is called in this sequence.

Note the extra set_page_count(1) and put_page(page) in the above sequence is
just to make vm_insert_page() happy - it complains it you try and insert a page
with a zero page ref.

And looking at that sequence there is another minor bug - in the failure
path we are exiting the loop with the failed page ref count set to
1 from set_page_count(page, 1). That needs to be reset to zero with
set_page_count(page, 0) to avoid the VM_WARN_ON_ONCE_PAGE() if the page gets
reused. I will send a fix for that.

 - Alistair

> Bjorn



More information about the Linux-nvme mailing list