[PATCH v2 00/61] Introducing the Maple Tree

Liam Howlett liam.howlett at oracle.com
Tue Aug 17 08:47:03 PDT 2021


The maple tree is an RCU-safe range based B-tree designed to use modern
processor cache efficiently.  There are a number of places in the kernel
that a non-overlapping range-based tree would be beneficial, especially
one with a simple interface.  The first user that is covered in this
patch set is the vm_area_struct, where three data structures are
replaced by the maple tree: the augmented rbtree, the vma cache, and the
linked list of VMAs in the mm_struct.  The long term goal is to reduce
or remove the mmap_sem contention.

The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
nodes.  With the increased branching factor, it is significantly shorter than
the rbtree so it has fewer cache misses.  The removal of the linked list
between subsequent entries also reduces the cache misses and the need to pull
in the previous and next VMA during many tree alterations.

This patch series is based on next-20210811 with "remap_file_pages: Use
vma_lookup() instead of find_vma()"

Link:
https://github.com/oracle/linux-uek/releases/tag/howlett%2Fmaple%2F20210816

Performance on a 144 core x86:

While still using the mmap_sem, the performance seems fairly similar on
real-world workloads, while there are variations in micro-benchmarks.

Increase in performance in the following micro-benchmarks in Hmean:
- wis malloc1-threads: Increase of 13% to 840%
- wis page_fault1-threads: Increase of 1% to 14%
- wis brk1-threads: Disregard, this test is invalid.


Decrease in performance in the following micro-benchmarks in Hmean:
- wis brk1-processes: Decrease of 45% due to RCU required

Mixed:
- wis pthread_mutex1-threads: +11% to -3%
- wis signal1-threads: +6% to -12%
- wis malloc1-processes: +9% to -18% (-18 at 2 processes, increases after)
- wis page_fault3-threads: +8% to -22%

kernbench:
Amean     user-2        884.88 (   0.00%)      882.61 *   0.26%*
Amean     syst-2        157.38 (   0.00%)      161.23 *  -2.45%*
Amean     elsp-2        526.17 (   0.00%)      527.53 *  -0.26%* 
Amean     user-4        919.90 (   0.00%)      910.87 *   0.98%*
Amean     syst-4        166.21 (   0.00%)      170.06 *  -2.32%*
Amean     elsp-4        278.01 (   0.00%)      276.83 *   0.42%*
Amean     user-8        973.23 (   0.00%)      970.73 *   0.26%*
Amean     syst-8        176.70 (   0.00%)      181.00 *  -2.44%*
Amean     elsp-8        152.24 (   0.00%)      153.33 *  -0.72%*
Amean     user-16      1040.15 (   0.00%)     1045.90 *  -0.55%*
Amean     syst-16       185.13 (   0.00%)      191.08 *  -3.21%*
Amean     elsp-16        85.47 (   0.00%)       86.70 *  -1.44%*
Amean     user-32      1189.60 (   0.00%)     1187.91 *   0.14%*
Amean     syst-32       210.02 (   0.00%)      219.46 *  -4.49%*
Amean     elsp-32        53.86 (   0.00%)       53.91 *  -0.08%*
Amean     user-64      1222.05 (   0.00%)     1230.00 *  -0.65%*
Amean     syst-64       213.37 (   0.00%)      223.57 *  -4.78%*
Amean     elsp-64        32.87 (   0.00%)       33.42 *  -1.68%*
Amean     user-128     1618.73 (   0.00%)     1614.52 *   0.26%*
Amean     syst-128      259.72 (   0.00%)      272.95 *  -5.09%*
Amean     elsp-128       25.91 (   0.00%)       25.93 *  -0.08%*


gitcheckout:
Amean     User           0.00 (   0.00%)        0.00 *   0.00%*
Amean     System         8.09 (   0.00%)        7.90 *   2.25%*
Amean     Elapsed       22.89 (   0.00%)       22.50 *   1.74%*
Amean     CPU           93.53 (   0.00%)       93.67 *  -0.14%*


v2 changes:
- Split out unlock_range() into its own cleanup patch, already upstream
- Split off vma_lookup() into its own 22 patches, already upstream
- Fixed locking issue when brk does not change but succeeds. Thanks Suren
  Baghdasar
- Move locking in brk much earler to match mmap_sem
- Fixed RCU locking issue in mm/khugepaged.  Thanks Hillf Danton
- RCU fixes in userfaultfd_release, mlock, munmap, task_mmu, and nommu
- Removed mm_populate_vma() and related patches from this set
- Removed inline of remove_vma_list() from this set as the function is removed
- Fixed comments to all C-based comments as suggested by Peter Zijlstra
- Fixed comments to all C-based comments in test_maple_tree.c as well
- Changed #defines to hex as requested by Peter Zijlstra
- Fixed whitespace error in mas_set_range().  Thanks Peter Zijlstra
- Add Asserts to mas->depth and mte_pivot() range check.  Thanks Peter Zijlstra
- Updated comments for mas_alloc_req() and friends.  Thanks Peter Zijlstra
- Added back the parent pointer decoding support and added explanations on how
  the encoding/decoding works.  Thanks Peter Zijlstra
- Expanded maple tree height to 31 and added a BUG_ON when exceeding that
  value.  There should be no way to reach 31 high.  Thanks Peter Zijlstra
- Added comment on harmless race in mmget_not_zero() - Thanks Suren Baghdasaryan
- Removed debug statement left in during testing - Thanks Suren Baghdasaryan
- Fixed locking in dup_mmap() - Thanks Suren Baghdasaryan
- Changes in the RCU locking in areas that may sleep.
- Added rcu stress testing and fixed maple tree specific issues exposed
  - Thanks Paul McKenney for helping with this.
- Large Documentation update.


Patch organization:
Patches 1 to 4 are radix tree test suite additions for maple tree
support.

Patch 5 adds the maple tree.  Test code is 37000 lines.

Patches 6-11 are the removal of the rbtree from the mm_struct.

Patches 12-18 are the removal of the vmacache from the kernel.

Patches 19-60 are the removal of the vma linked list from the mm_struct.

Patch 61 is a small cleanup from the removal of the vma linked list.

Liam R. Howlett (61):
  radix tree test suite: Add pr_err define
  radix tree test suite: Add kmem_cache_set_non_kernel()
  radix tree test suite: Add allocation counts and size to kmem_cache
  radix tree test suite: Add support for slab bulk APIs
  Maple Tree: Add new data structure
  mm: Start tracking VMAs with maple tree
  mm/mmap: Use the maple tree in find_vma() instead of the rbtree.
  mm/mmap: Use the maple tree for find_vma_prev() instead of the rbtree
  mm/mmap: Use maple tree for unmapped_area{_topdown}
  kernel/fork: Use maple tree for dup_mmap() during forking
  mm: Remove rb tree.
  xen/privcmd: Optimized privcmd_ioctl_mmap() by using vma_lookup()
  mm: Optimize find_exact_vma() to use vma_lookup()
  mm/khugepaged: Optimize collapse_pte_mapped_thp() by using
    vma_lookup()
  mm/mmap: Change do_brk_flags() to expand existing VMA and add
    do_brk_munmap()
  mm: Use maple tree operations for find_vma_intersection() and
    find_vma()
  mm/mmap: Use advanced maple tree API for mmap_region()
  mm: Remove vmacache
  mm/mmap: Move mmap_region() below do_munmap()
  mm/mmap: Convert count_vma_pages_range() to use ma_state
  mm/mmap: Reorganize munmap to use maple states
  mm/mmap: Change do_brk_munmap() to use do_mas_align_munmap()
  mm: Introduce vma_next() and vma_prev()
  arch/arm64: Remove mmap linked list from vdso.
  arch/parisc: Remove mmap linked list from kernel/cache
  arch/powerpc: Remove mmap linked list from mm/book3s32/tlb
  arch/powerpc: Remove mmap linked list from mm/book3s64/subpage_prot
  arch/s390: Use maple tree iterators instead of linked list.
  arch/x86: Use maple tree iterators for vdso/vma
  arch/xtensa: Use maple tree iterators for unmapped area
  drivers/misc/cxl: Use maple tree iterators for cxl_prefault_vma()
  drivers/tee/optee: Use maple tree iterators for __check_mem_type()
  fs/binfmt_elf: Use maple tree iterators for fill_files_note()
  fs/coredump: Use maple tree iterators in place of linked list
  fs/exec: Use vma_next() instead of linked list
  fs/proc/base: Use maple tree iterators in place of linked list
  fs/proc/task_mmu: Stop using linked list and highest_vm_end
  fs/userfaultfd: Stop using vma linked list.
  ipc/shm: Stop using the vma linked list
  kernel/acct: Use maple tree iterators instead of linked list
  kernel/events/core: Use maple tree iterators instead of linked list
  kernel/events/uprobes: Use maple tree iterators instead of linked list
  kernel/sched/fair: Use maple tree iterators instead of linked list
  kernel/sys: Use maple tree iterators instead of linked list
  arch/um/kernel/tlb: Stop using linked list
  bpf: Remove VMA linked list
  mm/gup: Use maple tree navigation instead of linked list
  mm/khugepaged: Use maple tree iterators instead of vma linked list
  mm/ksm: Use maple tree iterators instead of vma linked list
  mm/madvise: Use vma_next instead of vma linked list
  mm/memcontrol: Stop using mm->highest_vm_end
  mm/mempolicy: Use maple tree iterators instead of vma linked list
  mm/mlock: Use maple tree iterators instead of vma linked list
  mm/mprotect: Use maple tree navigation instead of vma linked list
  mm/mremap: Use vma_next() instead of vma linked list
  mm/msync: Use vma_next() instead of vma linked list
  mm/oom_kill: Use maple tree iterators instead of vma linked list
  mm/pagewalk: Use vma_next() instead of vma linked list
  mm/swapfile: Use maple tree iterator instead of vma linked list
  mm: Remove the vma linked list
  mm/mmap: Drop range_has_overlap() function

 Documentation/core-api/index.rst              |     1 +
 Documentation/core-api/maple-tree.rst         |   508 +
 MAINTAINERS                                   |    12 +
 arch/arm64/kernel/vdso.c                      |     5 +-
 arch/parisc/kernel/cache.c                    |    15 +-
 arch/powerpc/mm/book3s32/tlb.c                |     5 +-
 arch/powerpc/mm/book3s64/subpage_prot.c       |    15 +-
 arch/s390/configs/debug_defconfig             |     1 -
 arch/s390/mm/gmap.c                           |     8 +-
 arch/um/kernel/tlb.c                          |    16 +-
 arch/x86/entry/vdso/vma.c                     |    12 +-
 arch/x86/kernel/tboot.c                       |     2 +-
 arch/xtensa/kernel/syscall.c                  |     4 +-
 drivers/firmware/efi/efi.c                    |     2 +-
 drivers/misc/cxl/fault.c                      |     6 +-
 drivers/tee/optee/call.c                      |    15 +-
 drivers/xen/privcmd.c                         |     2 +-
 fs/binfmt_elf.c                               |     5 +-
 fs/coredump.c                                 |    13 +-
 fs/exec.c                                     |     9 +-
 fs/proc/base.c                                |     7 +-
 fs/proc/task_mmu.c                            |    48 +-
 fs/proc/task_nommu.c                          |    55 +-
 fs/userfaultfd.c                              |    34 +-
 include/linux/maple_tree.h                    |   474 +
 include/linux/mm.h                            |    54 +-
 include/linux/mm_types.h                      |    31 +-
 include/linux/mm_types_task.h                 |     5 -
 include/linux/sched.h                         |     1 -
 include/linux/sched/mm.h                      |     9 +
 include/linux/vm_event_item.h                 |     4 -
 include/linux/vmacache.h                      |    28 -
 include/linux/vmstat.h                        |     6 -
 include/trace/events/maple_tree.h             |   227 +
 include/trace/events/mmap.h                   |    71 +
 init/main.c                                   |     2 +
 ipc/shm.c                                     |    13 +-
 kernel/acct.c                                 |     8 +-
 kernel/bpf/task_iter.c                        |     6 +-
 kernel/debug/debug_core.c                     |    12 -
 kernel/events/core.c                          |     7 +-
 kernel/events/uprobes.c                       |    25 +-
 kernel/fork.c                                 |    61 +-
 kernel/sched/fair.c                           |    14 +-
 kernel/sys.c                                  |     6 +-
 lib/Kconfig.debug                             |    15 +-
 lib/Makefile                                  |     3 +-
 lib/maple_tree.c                              |  6779 +++
 lib/test_maple_tree.c                         | 37000 ++++++++++++++++
 mm/Makefile                                   |     2 +-
 mm/debug.c                                    |    14 +-
 mm/gup.c                                      |     7 +-
 mm/huge_memory.c                              |     4 +-
 mm/init-mm.c                                  |     4 +-
 mm/internal.h                                 |    81 +-
 mm/khugepaged.c                               |    11 +-
 mm/ksm.c                                      |    26 +-
 mm/madvise.c                                  |     2 +-
 mm/memcontrol.c                               |     6 +-
 mm/memory.c                                   |    33 +-
 mm/mempolicy.c                                |    41 +-
 mm/mlock.c                                    |    21 +-
 mm/mmap.c                                     |  2129 +-
 mm/mprotect.c                                 |    13 +-
 mm/mremap.c                                   |    13 +-
 mm/msync.c                                    |     2 +-
 mm/nommu.c                                    |   120 +-
 mm/oom_kill.c                                 |     5 +-
 mm/pagewalk.c                                 |     2 +-
 mm/swapfile.c                                 |     9 +-
 mm/util.c                                     |    32 -
 mm/vmacache.c                                 |   117 -
 mm/vmstat.c                                   |     4 -
 tools/testing/radix-tree/.gitignore           |     2 +
 tools/testing/radix-tree/Makefile             |    13 +-
 tools/testing/radix-tree/generated/autoconf.h |     1 +
 tools/testing/radix-tree/linux.c              |   160 +-
 tools/testing/radix-tree/linux/kernel.h       |     1 +
 tools/testing/radix-tree/linux/maple_tree.h   |     7 +
 tools/testing/radix-tree/linux/slab.h         |     4 +
 tools/testing/radix-tree/maple.c              |    59 +
 .../radix-tree/trace/events/maple_tree.h      |     8 +
 82 files changed, 46975 insertions(+), 1639 deletions(-)
 create mode 100644 Documentation/core-api/maple-tree.rst
 create mode 100644 include/linux/maple_tree.h
 delete mode 100644 include/linux/vmacache.h
 create mode 100644 include/trace/events/maple_tree.h
 create mode 100644 lib/maple_tree.c
 create mode 100644 lib/test_maple_tree.c
 delete mode 100644 mm/vmacache.c
 create mode 100644 tools/testing/radix-tree/linux/maple_tree.h
 create mode 100644 tools/testing/radix-tree/maple.c
 create mode 100644 tools/testing/radix-tree/trace/events/maple_tree.h

-- 
2.30.2



More information about the maple-tree mailing list