[PATCH v4 00/25] fs/dax: Fix ZONE_DEVICE page reference counts
Alistair Popple
apopple at nvidia.com
Mon Dec 16 21:12:43 PST 2024
Main updates since v3:
- Rebased onto next-20241216
- Fixed a bunch of build breakages reported by John Hubbard and the
kernel test robot due to various combinations of CONFIG options.
- Split the rmap changes into a separate patch as suggested by David H.
- Reworded the description for the P2PDMA change.
Main updates since v2:
- Rename the DAX specific dax_insert_XXX functions to vmf_insert_XXX
and have them pass the vmf struct.
- Seperate out the device DAX changes.
- Restore the page share mapping counting and associated warnings.
- Rework truncate to require file-systems to have previously called
dax_break_layout() to remove the address space mapping for a
page. This found several bugs which are fixed by the first half of
the series. The motivation for this was initially to allow the FS
DAX page-cache mappings to hold a reference on the page.
However that turned out to be a dead-end (see the comments on patch
21), but it found several bugs and I think overall it is an
improvement so I have left it here.
Device and FS DAX pages have always maintained their own page
reference counts without following the normal rules for page reference
counting. In particular pages are considered free when the refcount
hits one rather than zero and refcounts are not added when mapping the
page.
Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary
mechanism for allowing GUP to hold references on the page (see
get_dev_pagemap). However there doesn't seem to be any reason why FS
DAX pages need their own reference counting scheme.
By treating the refcounts on these pages the same way as normal pages
we can remove a lot of special checks. In particular pXd_trans_huge()
becomes the same as pXd_leaf(), although I haven't made that change
here. It also frees up a valuable SW define PTE bit on architectures
that have devmap PTE bits defined.
It also almost certainly allows further clean-up of the devmap managed
functions, but I have left that as a future improvment. It also
enables support for compound ZONE_DEVICE pages which is one of my
primary motivators for doing this work.
Signed-off-by: Alistair Popple <apopple at nvidia.com>
---
Cc: lina at asahilina.net
Cc: zhang.lyra at gmail.com
Cc: gerald.schaefer at linux.ibm.com
Cc: dan.j.williams at intel.com
Cc: vishal.l.verma at intel.com
Cc: dave.jiang at intel.com
Cc: logang at deltatee.com
Cc: bhelgaas at google.com
Cc: jack at suse.cz
Cc: jgg at ziepe.ca
Cc: catalin.marinas at arm.com
Cc: will at kernel.org
Cc: mpe at ellerman.id.au
Cc: npiggin at gmail.com
Cc: dave.hansen at linux.intel.com
Cc: ira.weiny at intel.com
Cc: willy at infradead.org
Cc: djwong at kernel.org
Cc: tytso at mit.edu
Cc: linmiaohe at huawei.com
Cc: david at redhat.com
Cc: peterx at redhat.com
Cc: linux-doc at vger.kernel.org
Cc: linux-kernel at vger.kernel.org
Cc: linux-arm-kernel at lists.infradead.org
Cc: linuxppc-dev at lists.ozlabs.org
Cc: nvdimm at lists.linux.dev
Cc: linux-cxl at vger.kernel.org
Cc: linux-fsdevel at vger.kernel.org
Cc: linux-mm at kvack.org
Cc: linux-ext4 at vger.kernel.org
Cc: linux-xfs at vger.kernel.org
Cc: jhubbard at nvidia.com
Cc: hch at lst.de
Cc: david at fromorbit.com
Alistair Popple (25):
fuse: Fix dax truncate/punch_hole fault path
fs/dax: Return unmapped busy pages from dax_layout_busy_page_range()
fs/dax: Don't skip locked entries when scanning entries
fs/dax: Refactor wait for dax idle page
fs/dax: Create a common implementation to break DAX layouts
fs/dax: Always remove DAX page-cache entries when breaking layouts
fs/dax: Ensure all pages are idle prior to filesystem unmount
fs/dax: Remove PAGE_MAPPING_DAX_SHARED mapping flag
mm/gup.c: Remove redundant check for PCI P2PDMA page
mm/mm_init: Move p2pdma page refcount initialisation to p2pdma
mm: Allow compound zone device pages
mm/memory: Enhance insert_page_into_pte_locked() to create writable mappings
mm/memory: Add vmf_insert_page_mkwrite()
rmap: Add support for PUD sized mappings to rmap
huge_memory: Add vmf_insert_folio_pud()
huge_memory: Add vmf_insert_folio_pmd()
memremap: Add is_device_dax_page() and is_fsdax_page() helpers
gup: Don't allow FOLL_LONGTERM pinning of FS DAX pages
proc/task_mmu: Ignore ZONE_DEVICE pages
mm/mlock: Skip ZONE_DEVICE PMDs during mlock
fs/dax: Properly refcount fs dax pages
device/dax: Properly refcount device dax pages when mapping
mm: Remove pXX_devmap callers
mm: Remove devmap related functions and page table bits
Revert "riscv: mm: Add support for ZONE_DEVICE"
Documentation/mm/arch_pgtable_helpers.rst | 6 +-
arch/arm64/Kconfig | 1 +-
arch/arm64/include/asm/pgtable-prot.h | 1 +-
arch/arm64/include/asm/pgtable.h | 24 +-
arch/powerpc/Kconfig | 1 +-
arch/powerpc/include/asm/book3s/64/hash-4k.h | 6 +-
arch/powerpc/include/asm/book3s/64/hash-64k.h | 7 +-
arch/powerpc/include/asm/book3s/64/pgtable.h | 52 +---
arch/powerpc/include/asm/book3s/64/radix.h | 14 +-
arch/powerpc/mm/book3s64/hash_pgtable.c | 3 +-
arch/powerpc/mm/book3s64/pgtable.c | 8 +-
arch/powerpc/mm/book3s64/radix_pgtable.c | 5 +-
arch/powerpc/mm/pgtable.c | 2 +-
arch/riscv/Kconfig | 1 +-
arch/riscv/include/asm/pgtable-64.h | 20 +-
arch/riscv/include/asm/pgtable-bits.h | 1 +-
arch/riscv/include/asm/pgtable.h | 17 +-
arch/x86/Kconfig | 1 +-
arch/x86/include/asm/pgtable.h | 51 +---
arch/x86/include/asm/pgtable_types.h | 5 +-
drivers/dax/device.c | 15 +-
drivers/gpu/drm/nouveau/nouveau_dmem.c | 3 +-
drivers/nvdimm/pmem.c | 4 +-
drivers/pci/p2pdma.c | 19 +-
fs/dax.c | 357 ++++++++++++++-----
fs/ext4/inode.c | 43 +--
fs/fuse/dax.c | 35 +--
fs/fuse/virtio_fs.c | 3 +-
fs/proc/task_mmu.c | 18 +-
fs/userfaultfd.c | 2 +-
fs/xfs/xfs_inode.c | 40 +-
fs/xfs/xfs_inode.h | 3 +-
fs/xfs/xfs_super.c | 18 +-
include/linux/dax.h | 37 ++-
include/linux/huge_mm.h | 22 +-
include/linux/memremap.h | 28 +-
include/linux/migrate.h | 4 +-
include/linux/mm.h | 40 +--
include/linux/mm_types.h | 14 +-
include/linux/mmzone.h | 12 +-
include/linux/page-flags.h | 6 +-
include/linux/pfn_t.h | 20 +-
include/linux/pgtable.h | 21 +-
include/linux/rmap.h | 15 +-
lib/test_hmm.c | 3 +-
mm/Kconfig | 4 +-
mm/debug_vm_pgtable.c | 59 +---
mm/gup.c | 176 +---------
mm/hmm.c | 12 +-
mm/huge_memory.c | 233 +++++++-----
mm/internal.h | 2 +-
mm/khugepaged.c | 2 +-
mm/madvise.c | 8 +-
mm/mapping_dirty_helpers.c | 4 +-
mm/memory-failure.c | 6 +-
mm/memory.c | 126 ++++---
mm/memremap.c | 59 +--
mm/migrate_device.c | 9 +-
mm/mlock.c | 2 +-
mm/mm_init.c | 23 +-
mm/mprotect.c | 2 +-
mm/mremap.c | 5 +-
mm/page_vma_mapped.c | 5 +-
mm/pagewalk.c | 14 +-
mm/pgtable-generic.c | 7 +-
mm/rmap.c | 56 +++-
mm/swap.c | 2 +-
mm/truncate.c | 16 +-
mm/userfaultfd.c | 5 +-
mm/vmscan.c | 5 +-
70 files changed, 922 insertions(+), 928 deletions(-)
base-commit: e25c8d66f6786300b680866c0e0139981273feba
--
git-series 0.9.1
More information about the linux-arm-kernel
mailing list