From thehajime at gmail.com  Sun Nov  2 01:49:25 2025
From: thehajime at gmail.com (Hajime Tazaki)
Date: Sun,  2 Nov 2025 18:49:25 +0900
Subject: [PATCH v12 00/13] nommu UML
Message-ID: <cover.1762075876.git.thehajime@gmail.com>

This patchset is another spin of nommu mode addition to UML.  It would
be nice to hear about your opinions on that.

There are still several limitations/issues which we already found;
here is the list of those issues.

- memory mapped by loadable modules are not distinguished from
  userspace memory.
- CONFIG_SMP is disabled as host_fs handling doesn't work with thread
  local storage.

-- Hajime

v12:
- rebase with the latest uml/next branch
- disable SMP and tls as those doesn't work with host_fs handling ([11/13])

v11:
- clean up userspace return routine and integrate to userspace() ([04/13])
- fix direction flag issue on using nolibc memcpy ([04/13])
- fix a crash issue when using usermode helper ([06/13])
- test with out-of-tree kunit-uapi patches (which uses umh)
 - https://lore.kernel.org/all/20250626-kunit-kselftests-v4-0-48760534fef5 at linutronix.de/
 - https://lore.kernel.org/all/20250626195714.2123694-3-benjamin at sipsolutions.net/
- https://lore.kernel.org/all/cover.1758181109.git.thehajime at gmail.com/

v10:
- fix wrong comment on gs register handling ([09/13])
- remove unnecessary code of early syscall implementation ([04/13])
- https://lore.kernel.org/all/cover.1750594487.git.thehajime at gmail.com/

v9:
- rebase with the latest uml/next branch
- add performance numbers of new SECCOMP mode, and update results ([12/13])
- add a workaround for upstream change on MMU depedency to PCI drivers ([10/13])
- https://lore.kernel.org/all/cover.1750294482.git.thehajime at gmail.com/

v8:
- rebase with the latest uml/next branch
- clean up segv_handler to align with the latest uml ([9/12])
- https://lore.kernel.org/all/cover.1745980082.git.thehajime at gmail.com/

v7:
- properly handle FP register upon signal delivery [10/13]
- update benchmark result with new FP register handling [12/13]
- fix arch_has_single_step() for !MMU case [07/13]
- revert stack alignment as it is in uml/fixes tree [10/13]
- https://lore.kernel.org/all/cover.1737348399.git.thehajime at gmail.com/

v6:
- rebase to the latest uml/next tree
- more clean up on mmu/nommu for signal handling [10/13]
- rename functions of mcontext routines [06,10/13]
- added Acked-by tag for binfmt_elf_fdpic [02/13]
- https://lore.kernel.org/linux-um/cover.1736853925.git.thehajime at gmail.com/

v5:
- clean up stack manipulation code [05,06,07,10/13]
- https://lore.kernel.org/linux-um/cover.1733998168.git.thehajime at gmail.com/

v4:
- add arch/um/nommu, arch/x86/um/nommu to contain !MMU specific codes
- remove zpoline patch
- drop binfmt_elf_fdpic patch
- reduce ifndef CONFIG_MMU if possible
- split to elf header cleanup patch [01/13]
- fix kernel test robot warnings [06/13]
- fix coding styles [07/13]
- move task_top_of_stack definition [05/13]
- https://lore.kernel.org/linux-um/cover.1733652929.git.thehajime at gmail.com/

v3:
- https://lore.kernel.org/linux-um/cover.1733199769.git.thehajime at gmail.com/
- add seccomp-based syscall hook in addition to zpoline [06/13]
- remove RFC, add a line to MAINTAINERS file
- fix kernel test robot warnings [02/13,08/13,10/13]
- add base-commit tag to cover letter
- pull the latest uml/next
- clean up SIGSEGV handling [10/13]
- detect fsgsbase availability with elf aux vector [08/13]
- simplify vdso code with macros [09/13]

RFC v2:
- https://lore.kernel.org/linux-um/cover.1731290567.git.thehajime at gmail.com/
- base branch is now uml/linux.git instead of torvalds/linux.git.
- reorganize the patch series to clean up
- fixed various coding styles issues
- clean up exec code path [07/13]
- fixed the crash/SIGSEGV case on userspace programs [10/13]
- add seccomp filter to limit syscall caller address [06/13]
- detect fsgsbase availability with sigsetjmp/siglongjmp [08/13]
- removes unrelated changes
- removes unneeded ifndef CONFIG_MMU
- convert UML_CONFIG_MMU to CONFIG_MMU as using uml/linux.git
- proposed a patch of maple-tree issue (resolving a limitation in RFC v1)
  https://lore.kernel.org/linux-mm/20241108222834.3625217-1-thehajime at gmail.com/

RFC:
- https://lore.kernel.org/linux-um/cover.1729770373.git.thehajime at gmail.com/

Hajime Tazaki (13):
  x86/um: nommu: elf loader for fdpic
  um: decouple MMU specific code from the common part
  um: nommu: memory handling
  x86/um: nommu: syscall handling
  um: nommu: seccomp syscalls hook
  x86/um: nommu: process/thread handling
  um: nommu: configure fs register on host syscall invocation
  x86/um/vdso: nommu: vdso memory update
  x86/um: nommu: signal handling
  um: change machine name for uname output
  um: nommu: disable SMP on nommu UML
  um: nommu: add documentation of nommu UML
  um: nommu: plug nommu code into build system

 Documentation/virt/uml/nommu-uml.rst   | 180 ++++++++++++++++++++++
 MAINTAINERS                            |   1 +
 arch/um/Kconfig                        |  14 +-
 arch/um/Makefile                       |  10 ++
 arch/um/configs/x86_64_nommu_defconfig |  54 +++++++
 arch/um/include/asm/futex.h            |   4 +
 arch/um/include/asm/mmu.h              |   8 +
 arch/um/include/asm/mmu_context.h      |   2 +
 arch/um/include/asm/ptrace-generic.h   |   8 +-
 arch/um/include/asm/uaccess.h          |   7 +-
 arch/um/include/shared/kern_util.h     |   6 +
 arch/um/include/shared/os.h            |  16 ++
 arch/um/kernel/Makefile                |   5 +-
 arch/um/kernel/mem-pgtable.c           |  55 +++++++
 arch/um/kernel/mem.c                   |  38 +----
 arch/um/kernel/process.c               |  38 +++++
 arch/um/kernel/skas/process.c          |  37 -----
 arch/um/kernel/um_arch.c               |   3 +
 arch/um/nommu/Makefile                 |   3 +
 arch/um/nommu/os-Linux/Makefile        |   7 +
 arch/um/nommu/os-Linux/seccomp.c       |  87 +++++++++++
 arch/um/nommu/os-Linux/signal.c        |  24 +++
 arch/um/nommu/trap.c                   | 201 +++++++++++++++++++++++++
 arch/um/os-Linux/Makefile              |   3 +-
 arch/um/os-Linux/internal.h            |   8 +
 arch/um/os-Linux/mem.c                 |   4 +
 arch/um/os-Linux/process.c             | 139 ++++++++++++++++-
 arch/um/os-Linux/signal.c              |  11 +-
 arch/um/os-Linux/skas/process.c        | 127 ----------------
 arch/um/os-Linux/start_up.c            |  25 ++-
 arch/um/os-Linux/util.c                |   3 +-
 arch/x86/um/Kconfig                    |   2 +-
 arch/x86/um/Makefile                   |   7 +-
 arch/x86/um/asm/elf.h                  |   8 +-
 arch/x86/um/asm/syscall.h              |   6 +
 arch/x86/um/nommu/Makefile             |   8 +
 arch/x86/um/nommu/do_syscall_64.c      |  75 +++++++++
 arch/x86/um/nommu/entry_64.S           | 114 ++++++++++++++
 arch/x86/um/nommu/os-Linux/Makefile    |   6 +
 arch/x86/um/nommu/os-Linux/mcontext.c  |  26 ++++
 arch/x86/um/nommu/syscalls.h           |  18 +++
 arch/x86/um/nommu/syscalls_64.c        | 121 +++++++++++++++
 arch/x86/um/shared/sysdep/mcontext.h   |   5 +
 arch/x86/um/shared/sysdep/ptrace.h     |   2 +-
 arch/x86/um/vdso/vma.c                 |  17 ++-
 fs/Kconfig.binfmt                      |   2 +-
 46 files changed, 1322 insertions(+), 223 deletions(-)
 create mode 100644 Documentation/virt/uml/nommu-uml.rst
 create mode 100644 arch/um/configs/x86_64_nommu_defconfig
 create mode 100644 arch/um/kernel/mem-pgtable.c
 create mode 100644 arch/um/nommu/Makefile
 create mode 100644 arch/um/nommu/os-Linux/Makefile
 create mode 100644 arch/um/nommu/os-Linux/seccomp.c
 create mode 100644 arch/um/nommu/os-Linux/signal.c
 create mode 100644 arch/um/nommu/trap.c
 create mode 100644 arch/x86/um/nommu/Makefile
 create mode 100644 arch/x86/um/nommu/do_syscall_64.c
 create mode 100644 arch/x86/um/nommu/entry_64.S
 create mode 100644 arch/x86/um/nommu/os-Linux/Makefile
 create mode 100644 arch/x86/um/nommu/os-Linux/mcontext.c
 create mode 100644 arch/x86/um/nommu/syscalls.h
 create mode 100644 arch/x86/um/nommu/syscalls_64.c


base-commit: 8e03c195cc4d82100291500f772f85c686653748
-- 
2.43.0


From thehajime at gmail.com  Sun Nov  2 01:49:26 2025
From: thehajime at gmail.com (Hajime Tazaki)
Date: Sun,  2 Nov 2025 18:49:26 +0900
Subject: [PATCH v12 01/13] x86/um: nommu: elf loader for fdpic
In-Reply-To: <cover.1762075876.git.thehajime@gmail.com>
References: <cover.1762075876.git.thehajime@gmail.com>
Message-ID: <d0537389ac18acb03ae3bdaf25473d56f7e74f4f.1762075876.git.thehajime@gmail.com>

As UML supports CONFIG_MMU=n case, it has to use an alternate ELF
loader, FDPIC ELF loader.  In this commit, we added necessary
definitions in the arch, as UML has not been used so far.  It also
updates Kconfig file to use BINFMT_ELF_FDPIC under !MMU environment.

Cc: Eric Biederman <ebiederm at xmission.com>
Cc: Kees Cook <kees at kernel.org>
Cc: Alexander Viro <viro at zeniv.linux.org.uk>
Cc: Christian Brauner <brauner at kernel.org>
Cc: Jan Kara <jack at suse.cz>
Cc: linux-mm at kvack.org
Cc: linux-fsdevel at vger.kernel.org
Acked-by: Kees Cook <kees at kernel.org>
Signed-off-by: Hajime Tazaki <thehajime at gmail.com>
Signed-off-by: Ricardo Koller <ricarkol at google.com>
---
 arch/um/include/asm/mmu.h            | 5 +++++
 arch/um/include/asm/ptrace-generic.h | 6 ++++++
 arch/x86/um/asm/elf.h                | 8 ++++++--
 fs/Kconfig.binfmt                    | 2 +-
 4 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/arch/um/include/asm/mmu.h b/arch/um/include/asm/mmu.h
index 07d48738b402..82a919132aff 100644
--- a/arch/um/include/asm/mmu.h
+++ b/arch/um/include/asm/mmu.h
@@ -21,6 +21,11 @@ typedef struct mm_context {
 	spinlock_t sync_tlb_lock;
 	unsigned long sync_tlb_range_from;
 	unsigned long sync_tlb_range_to;
+
+#ifdef CONFIG_BINFMT_ELF_FDPIC
+	unsigned long   exec_fdpic_loadmap;
+	unsigned long   interp_fdpic_loadmap;
+#endif
 } mm_context_t;
 
 #define INIT_MM_CONTEXT(mm)						\
diff --git a/arch/um/include/asm/ptrace-generic.h b/arch/um/include/asm/ptrace-generic.h
index 86d74f9d33cf..62e9916078ec 100644
--- a/arch/um/include/asm/ptrace-generic.h
+++ b/arch/um/include/asm/ptrace-generic.h
@@ -29,6 +29,12 @@ struct pt_regs {
 
 #define PTRACE_OLDSETOPTIONS 21
 
+#ifdef CONFIG_BINFMT_ELF_FDPIC
+#define PTRACE_GETFDPIC		31
+#define PTRACE_GETFDPIC_EXEC	0
+#define PTRACE_GETFDPIC_INTERP	1
+#endif
+
 struct task_struct;
 
 extern long subarch_ptrace(struct task_struct *child, long request,
diff --git a/arch/x86/um/asm/elf.h b/arch/x86/um/asm/elf.h
index 62ed5d68a978..33f69f1eac10 100644
--- a/arch/x86/um/asm/elf.h
+++ b/arch/x86/um/asm/elf.h
@@ -9,6 +9,7 @@
 #include <skas.h>
 
 #define CORE_DUMP_USE_REGSET
+#define ELF_FDPIC_CORE_EFLAGS  0
 
 #ifdef CONFIG_X86_32
 
@@ -190,8 +191,11 @@ extern int arch_setup_additional_pages(struct linux_binprm *bprm,
 
 extern unsigned long um_vdso_addr;
 #define AT_SYSINFO_EHDR 33
-#define ARCH_DLINFO	NEW_AUX_ENT(AT_SYSINFO_EHDR, um_vdso_addr)
-
+#define ARCH_DLINFO						\
+do {								\
+	NEW_AUX_ENT(AT_SYSINFO_EHDR, um_vdso_addr);		\
+	NEW_AUX_ENT(AT_MINSIGSTKSZ, 0);			\
+} while (0)
 #endif
 
 typedef unsigned long elf_greg_t;
diff --git a/fs/Kconfig.binfmt b/fs/Kconfig.binfmt
index 1949e25c7741..0a92bebd5f75 100644
--- a/fs/Kconfig.binfmt
+++ b/fs/Kconfig.binfmt
@@ -58,7 +58,7 @@ config ARCH_USE_GNU_PROPERTY
 config BINFMT_ELF_FDPIC
 	bool "Kernel support for FDPIC ELF binaries"
 	default y if !BINFMT_ELF
-	depends on ARM || ((M68K || RISCV || SUPERH || XTENSA) && !MMU)
+	depends on ARM || ((M68K || RISCV || SUPERH || UML || XTENSA) && !MMU)
 	select ELFCORE
 	help
 	  ELF FDPIC binaries are based on ELF, but allow the individual load
-- 
2.43.0


From thehajime at gmail.com  Sun Nov  2 01:49:27 2025
From: thehajime at gmail.com (Hajime Tazaki)
Date: Sun,  2 Nov 2025 18:49:27 +0900
Subject: [PATCH v12 02/13] um: decouple MMU specific code from the common part
In-Reply-To: <cover.1762075876.git.thehajime@gmail.com>
References: <cover.1762075876.git.thehajime@gmail.com>
Message-ID: <08489faaad68a17037e1f24b2a39d8fc3b021c61.1762075876.git.thehajime@gmail.com>

This splits the memory, process related code with common and MMU
specific parts in order to avoid ifdefs in .c file and duplication
between MMU and !MMU.

Signed-off-by: Hajime Tazaki <thehajime at gmail.com>
---
 arch/um/kernel/Makefile         |   5 +-
 arch/um/kernel/mem-pgtable.c    |  55 ++++++++++++++
 arch/um/kernel/mem.c            |  35 ---------
 arch/um/kernel/process.c        |  38 ++++++++++
 arch/um/kernel/skas/process.c   |  37 ---------
 arch/um/os-Linux/Makefile       |   3 +-
 arch/um/os-Linux/process.c      | 129 ++++++++++++++++++++++++++++++++
 arch/um/os-Linux/skas/process.c | 127 -------------------------------
 8 files changed, 227 insertions(+), 202 deletions(-)
 create mode 100644 arch/um/kernel/mem-pgtable.c

diff --git a/arch/um/kernel/Makefile b/arch/um/kernel/Makefile
index be60bc451b3f..76d36751973e 100644
--- a/arch/um/kernel/Makefile
+++ b/arch/um/kernel/Makefile
@@ -16,9 +16,10 @@ always-$(KBUILD_BUILTIN) := vmlinux.lds
 
 obj-y = config.o exec.o exitcode.o irq.o ksyms.o mem.o \
 	physmem.o process.o ptrace.o reboot.o sigio.o \
-	signal.o sysrq.o time.o tlb.o trap.o \
-	um_arch.o umid.o kmsg_dump.o capflags.o skas/
+	signal.o sysrq.o time.o \
+	um_arch.o umid.o kmsg_dump.o capflags.o
 obj-y += load_file.o
+obj-$(CONFIG_MMU) += mem-pgtable.o tlb.o trap.o skas/
 
 obj-$(CONFIG_BLK_DEV_INITRD) += initrd.o
 obj-$(CONFIG_GPROF)	+= gprof_syms.o
diff --git a/arch/um/kernel/mem-pgtable.c b/arch/um/kernel/mem-pgtable.c
new file mode 100644
index 000000000000..549da1d3bff0
--- /dev/null
+++ b/arch/um/kernel/mem-pgtable.c
@@ -0,0 +1,55 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2000 - 2007 Jeff Dike (jdike@{addtoit,linux.intel}.com)
+ */
+
+#include <linux/stddef.h>
+#include <linux/module.h>
+#include <linux/memblock.h>
+#include <linux/swap.h>
+#include <linux/slab.h>
+#include <asm/page.h>
+#include <asm/pgalloc.h>
+#include <as-layout.h>
+#include <init.h>
+#include <kern.h>
+#include <kern_util.h>
+#include <mem_user.h>
+#include <os.h>
+#include <um_malloc.h>
+
+
+/* Allocate and free page tables. */
+
+pgd_t *pgd_alloc(struct mm_struct *mm)
+{
+	pgd_t *pgd = (pgd_t *)__get_free_page(GFP_KERNEL);
+
+	if (pgd) {
+		memset(pgd, 0, USER_PTRS_PER_PGD * sizeof(pgd_t));
+		memcpy(pgd + USER_PTRS_PER_PGD,
+		       swapper_pg_dir + USER_PTRS_PER_PGD,
+		       (PTRS_PER_PGD - USER_PTRS_PER_PGD) * sizeof(pgd_t));
+	}
+	return pgd;
+}
+
+static const pgprot_t protection_map[16] = {
+	[VM_NONE]					= PAGE_NONE,
+	[VM_READ]					= PAGE_READONLY,
+	[VM_WRITE]					= PAGE_COPY,
+	[VM_WRITE | VM_READ]				= PAGE_COPY,
+	[VM_EXEC]					= PAGE_READONLY,
+	[VM_EXEC | VM_READ]				= PAGE_READONLY,
+	[VM_EXEC | VM_WRITE]				= PAGE_COPY,
+	[VM_EXEC | VM_WRITE | VM_READ]			= PAGE_COPY,
+	[VM_SHARED]					= PAGE_NONE,
+	[VM_SHARED | VM_READ]				= PAGE_READONLY,
+	[VM_SHARED | VM_WRITE]				= PAGE_SHARED,
+	[VM_SHARED | VM_WRITE | VM_READ]		= PAGE_SHARED,
+	[VM_SHARED | VM_EXEC]				= PAGE_READONLY,
+	[VM_SHARED | VM_EXEC | VM_READ]			= PAGE_READONLY,
+	[VM_SHARED | VM_EXEC | VM_WRITE]		= PAGE_SHARED,
+	[VM_SHARED | VM_EXEC | VM_WRITE | VM_READ]	= PAGE_SHARED
+};
+DECLARE_VM_GET_PAGE_PROT
diff --git a/arch/um/kernel/mem.c b/arch/um/kernel/mem.c
index dc938715ec9d..52cd906e3896 100644
--- a/arch/um/kernel/mem.c
+++ b/arch/um/kernel/mem.c
@@ -6,7 +6,6 @@
 #include <linux/stddef.h>
 #include <linux/module.h>
 #include <linux/memblock.h>
-#include <linux/mm.h>
 #include <linux/swap.h>
 #include <linux/slab.h>
 #include <linux/init.h>
@@ -214,45 +213,11 @@ void free_initmem(void)
 {
 }
 
-/* Allocate and free page tables. */
-
-pgd_t *pgd_alloc(struct mm_struct *mm)
-{
-	pgd_t *pgd = __pgd_alloc(mm, 0);
-
-	if (pgd)
-		memcpy(pgd + USER_PTRS_PER_PGD,
-		       swapper_pg_dir + USER_PTRS_PER_PGD,
-		       (PTRS_PER_PGD - USER_PTRS_PER_PGD) * sizeof(pgd_t));
-
-	return pgd;
-}
-
 void *uml_kmalloc(int size, int flags)
 {
 	return kmalloc(size, flags);
 }
 
-static const pgprot_t protection_map[16] = {
-	[VM_NONE]					= PAGE_NONE,
-	[VM_READ]					= PAGE_READONLY,
-	[VM_WRITE]					= PAGE_COPY,
-	[VM_WRITE | VM_READ]				= PAGE_COPY,
-	[VM_EXEC]					= PAGE_READONLY,
-	[VM_EXEC | VM_READ]				= PAGE_READONLY,
-	[VM_EXEC | VM_WRITE]				= PAGE_COPY,
-	[VM_EXEC | VM_WRITE | VM_READ]			= PAGE_COPY,
-	[VM_SHARED]					= PAGE_NONE,
-	[VM_SHARED | VM_READ]				= PAGE_READONLY,
-	[VM_SHARED | VM_WRITE]				= PAGE_SHARED,
-	[VM_SHARED | VM_WRITE | VM_READ]		= PAGE_SHARED,
-	[VM_SHARED | VM_EXEC]				= PAGE_READONLY,
-	[VM_SHARED | VM_EXEC | VM_READ]			= PAGE_READONLY,
-	[VM_SHARED | VM_EXEC | VM_WRITE]		= PAGE_SHARED,
-	[VM_SHARED | VM_EXEC | VM_WRITE | VM_READ]	= PAGE_SHARED
-};
-DECLARE_VM_GET_PAGE_PROT
-
 void mark_rodata_ro(void)
 {
 	unsigned long rodata_start = PFN_ALIGN(__start_rodata);
diff --git a/arch/um/kernel/process.c b/arch/um/kernel/process.c
index 63b38a3f73f7..b07c1f120910 100644
--- a/arch/um/kernel/process.c
+++ b/arch/um/kernel/process.c
@@ -25,6 +25,7 @@
 #include <linux/tick.h>
 #include <linux/threads.h>
 #include <linux/resume_user_mode.h>
+#include <linux/start_kernel.h>
 #include <asm/current.h>
 #include <asm/mmu_context.h>
 #include <asm/switch_to.h>
@@ -307,3 +308,40 @@ unsigned long __get_wchan(struct task_struct *p)
 
 	return 0;
 }
+
+extern void start_kernel(void);
+
+static int __init start_kernel_proc(void *unused)
+{
+	block_signals_trace();
+
+	start_kernel();
+	return 0;
+}
+
+char cpu_irqstacks[NR_CPUS][THREAD_SIZE] __aligned(THREAD_SIZE);
+
+int __init start_uml(void)
+{
+	stack_protections((unsigned long) &cpu_irqstacks[0]);
+	set_sigstack(cpu_irqstacks[0], THREAD_SIZE);
+
+	init_new_thread_signals();
+
+	init_task.thread.request.thread.proc = start_kernel_proc;
+	init_task.thread.request.thread.arg = NULL;
+	return start_idle_thread(task_stack_page(&init_task),
+				 &init_task.thread.switch_buf);
+}
+
+static DEFINE_SPINLOCK(initial_jmpbuf_spinlock);
+
+void initial_jmpbuf_lock(void)
+{
+	spin_lock_irq(&initial_jmpbuf_spinlock);
+}
+
+void initial_jmpbuf_unlock(void)
+{
+	spin_unlock_irq(&initial_jmpbuf_spinlock);
+}
diff --git a/arch/um/kernel/skas/process.c b/arch/um/kernel/skas/process.c
index 4a7673b0261a..d643854942bc 100644
--- a/arch/um/kernel/skas/process.c
+++ b/arch/um/kernel/skas/process.c
@@ -17,31 +17,6 @@
 #include <skas.h>
 #include <kern_util.h>
 
-extern void start_kernel(void);
-
-static int __init start_kernel_proc(void *unused)
-{
-	block_signals_trace();
-
-	start_kernel();
-	return 0;
-}
-
-char cpu_irqstacks[NR_CPUS][THREAD_SIZE] __aligned(THREAD_SIZE);
-
-int __init start_uml(void)
-{
-	stack_protections((unsigned long) &cpu_irqstacks[0]);
-	set_sigstack(cpu_irqstacks[0], THREAD_SIZE);
-
-	init_new_thread_signals();
-
-	init_task.thread.request.thread.proc = start_kernel_proc;
-	init_task.thread.request.thread.arg = NULL;
-	return start_idle_thread(task_stack_page(&init_task),
-				 &init_task.thread.switch_buf);
-}
-
 unsigned long current_stub_stack(void)
 {
 	if (current->mm == NULL)
@@ -65,15 +40,3 @@ void current_mm_sync(void)
 
 	um_tlb_sync(current->mm);
 }
-
-static DEFINE_SPINLOCK(initial_jmpbuf_spinlock);
-
-void initial_jmpbuf_lock(void)
-{
-	spin_lock_irq(&initial_jmpbuf_spinlock);
-}
-
-void initial_jmpbuf_unlock(void)
-{
-	spin_unlock_irq(&initial_jmpbuf_spinlock);
-}
diff --git a/arch/um/os-Linux/Makefile b/arch/um/os-Linux/Makefile
index 70c73c22f715..051679d78aae 100644
--- a/arch/um/os-Linux/Makefile
+++ b/arch/um/os-Linux/Makefile
@@ -8,7 +8,8 @@ KCOV_INSTRUMENT                := n
 
 obj-y = execvp.o file.o helper.o irq.o main.o mem.o process.o \
 	registers.o sigio.o signal.o start_up.o time.o tty.o \
-	umid.o user_syms.o util.o skas/
+	umid.o user_syms.o util.o
+obj-$(CONFIG_MMU) += skas/
 
 CFLAGS_signal.o += -Wframe-larger-than=4096
 
diff --git a/arch/um/os-Linux/process.c b/arch/um/os-Linux/process.c
index 3a2a84ab9325..c50fa865d8c7 100644
--- a/arch/um/os-Linux/process.c
+++ b/arch/um/os-Linux/process.c
@@ -6,6 +6,7 @@
 
 #include <stdio.h>
 #include <stdlib.h>
+#include <stdbool.h>
 #include <unistd.h>
 #include <errno.h>
 #include <signal.h>
@@ -17,10 +18,16 @@
 #include <sys/prctl.h>
 #include <sys/wait.h>
 #include <asm/unistd.h>
+#include <linux/threads.h>
 #include <init.h>
 #include <longjmp.h>
 #include <os.h>
 #include <skas/skas.h>
+#include <as-layout.h>
+#include <kern_util.h>
+
+int using_seccomp;
+static int unscheduled_userspace_iterations;
 
 void os_alarm_process(int pid)
 {
@@ -209,3 +216,125 @@ int os_futex_wake(void *uaddr)
 				NULL, NULL, 0));
 	return r < 0 ? -errno : r;
 }
+
+int is_skas_winch(int pid, int fd, void *data)
+{
+	return pid == getpgrp();
+}
+
+void new_thread(void *stack, jmp_buf *buf, void (*handler)(void))
+{
+	(*buf)[0].JB_IP = (unsigned long) handler;
+	(*buf)[0].JB_SP = (unsigned long) stack + UM_THREAD_SIZE -
+		sizeof(void *);
+}
+
+#define INIT_JMP_NEW_THREAD 0
+#define INIT_JMP_CALLBACK 1
+#define INIT_JMP_HALT 2
+#define INIT_JMP_REBOOT 3
+
+void switch_threads(jmp_buf *me, jmp_buf *you)
+{
+	unscheduled_userspace_iterations = 0;
+
+	if (UML_SETJMP(me) == 0)
+		UML_LONGJMP(you, 1);
+}
+
+static jmp_buf initial_jmpbuf;
+
+static __thread void (*cb_proc)(void *arg);
+static __thread void *cb_arg;
+static __thread jmp_buf *cb_back;
+
+int start_idle_thread(void *stack, jmp_buf *switch_buf)
+{
+	int n;
+
+	set_handler(SIGWINCH);
+
+	/*
+	 * Can't use UML_SETJMP or UML_LONGJMP here because they save
+	 * and restore signals, with the possible side-effect of
+	 * trying to handle any signals which came when they were
+	 * blocked, which can't be done on this stack.
+	 * Signals must be blocked when jumping back here and restored
+	 * after returning to the jumper.
+	 */
+	n = setjmp(initial_jmpbuf);
+	switch (n) {
+	case INIT_JMP_NEW_THREAD:
+		(*switch_buf)[0].JB_IP = (unsigned long) uml_finishsetup;
+		(*switch_buf)[0].JB_SP = (unsigned long) stack +
+			UM_THREAD_SIZE - sizeof(void *);
+		break;
+	case INIT_JMP_CALLBACK:
+		(*cb_proc)(cb_arg);
+		longjmp(*cb_back, 1);
+		break;
+	case INIT_JMP_HALT:
+		kmalloc_ok = 0;
+		return 0;
+	case INIT_JMP_REBOOT:
+		kmalloc_ok = 0;
+		return 1;
+	default:
+		printk(UM_KERN_ERR "Bad sigsetjmp return in %s - %d\n",
+		       __func__, n);
+		fatal_sigsegv();
+	}
+	longjmp(*switch_buf, 1);
+
+	/* unreachable */
+	printk(UM_KERN_ERR "impossible long jump!");
+	fatal_sigsegv();
+	return 0;
+}
+
+void initial_thread_cb_skas(void (*proc)(void *), void *arg)
+{
+	jmp_buf here;
+
+	cb_proc = proc;
+	cb_arg = arg;
+	cb_back = &here;
+
+	initial_jmpbuf_lock();
+	if (UML_SETJMP(&here) == 0)
+		UML_LONGJMP(&initial_jmpbuf, INIT_JMP_CALLBACK);
+	initial_jmpbuf_unlock();
+
+	cb_proc = NULL;
+	cb_arg = NULL;
+	cb_back = NULL;
+}
+
+void halt_skas(void)
+{
+	initial_jmpbuf_lock();
+	UML_LONGJMP(&initial_jmpbuf, INIT_JMP_HALT);
+	/* unreachable */
+}
+
+static bool noreboot;
+
+static int __init noreboot_cmd_param(char *str, int *add)
+{
+	*add = 0;
+	noreboot = true;
+	return 0;
+}
+
+__uml_setup("noreboot", noreboot_cmd_param,
+"noreboot\n"
+"    Rather than rebooting, exit always, akin to QEMU's -no-reboot option.\n"
+"    This is useful if you're using CONFIG_PANIC_TIMEOUT in order to catch\n"
+"    crashes in CI\n\n");
+
+void reboot_skas(void)
+{
+	initial_jmpbuf_lock();
+	UML_LONGJMP(&initial_jmpbuf, noreboot ? INIT_JMP_HALT : INIT_JMP_REBOOT);
+	/* unreachable */
+}
diff --git a/arch/um/os-Linux/skas/process.c b/arch/um/os-Linux/skas/process.c
index d6c22f8aa06d..01814ad82f5d 100644
--- a/arch/um/os-Linux/skas/process.c
+++ b/arch/um/os-Linux/skas/process.c
@@ -18,7 +18,6 @@
 #include <sys/stat.h>
 #include <sys/socket.h>
 #include <asm/unistd.h>
-#include <as-layout.h>
 #include <init.h>
 #include <kern_util.h>
 #include <mem.h>
@@ -29,16 +28,10 @@
 #include <sysdep/stub.h>
 #include <sysdep/mcontext.h>
 #include <linux/futex.h>
-#include <linux/threads.h>
 #include <timetravel.h>
 #include <asm-generic/rwonce.h>
 #include "../internal.h"
 
-int is_skas_winch(int pid, int fd, void *data)
-{
-	return pid == getpgrp();
-}
-
 static const char *ptrace_reg_name(int idx)
 {
 #define R(n) case HOST_##n: return #n
@@ -426,8 +419,6 @@ static int __init init_stub_exe_fd(void)
 }
 __initcall(init_stub_exe_fd);
 
-int using_seccomp;
-
 /**
  * start_userspace() - prepare a new userspace process
  * @mm_id: The corresponding struct mm_id
@@ -540,7 +531,6 @@ int start_userspace(struct mm_id *mm_id)
 	return err;
 }
 
-static int unscheduled_userspace_iterations;
 extern unsigned long tt_extra_sched_jiffies;
 
 void userspace(struct uml_pt_regs *regs)
@@ -789,120 +779,3 @@ void userspace(struct uml_pt_regs *regs)
 		}
 	}
 }
-
-void new_thread(void *stack, jmp_buf *buf, void (*handler)(void))
-{
-	(*buf)[0].JB_IP = (unsigned long) handler;
-	(*buf)[0].JB_SP = (unsigned long) stack + UM_THREAD_SIZE -
-		sizeof(void *);
-}
-
-#define INIT_JMP_NEW_THREAD 0
-#define INIT_JMP_CALLBACK 1
-#define INIT_JMP_HALT 2
-#define INIT_JMP_REBOOT 3
-
-void switch_threads(jmp_buf *me, jmp_buf *you)
-{
-	unscheduled_userspace_iterations = 0;
-
-	if (UML_SETJMP(me) == 0)
-		UML_LONGJMP(you, 1);
-}
-
-static jmp_buf initial_jmpbuf;
-
-static __thread void (*cb_proc)(void *arg);
-static __thread void *cb_arg;
-static __thread jmp_buf *cb_back;
-
-int start_idle_thread(void *stack, jmp_buf *switch_buf)
-{
-	int n;
-
-	set_handler(SIGWINCH);
-
-	/*
-	 * Can't use UML_SETJMP or UML_LONGJMP here because they save
-	 * and restore signals, with the possible side-effect of
-	 * trying to handle any signals which came when they were
-	 * blocked, which can't be done on this stack.
-	 * Signals must be blocked when jumping back here and restored
-	 * after returning to the jumper.
-	 */
-	n = setjmp(initial_jmpbuf);
-	switch (n) {
-	case INIT_JMP_NEW_THREAD:
-		(*switch_buf)[0].JB_IP = (unsigned long) uml_finishsetup;
-		(*switch_buf)[0].JB_SP = (unsigned long) stack +
-			UM_THREAD_SIZE - sizeof(void *);
-		break;
-	case INIT_JMP_CALLBACK:
-		(*cb_proc)(cb_arg);
-		longjmp(*cb_back, 1);
-		break;
-	case INIT_JMP_HALT:
-		kmalloc_ok = 0;
-		return 0;
-	case INIT_JMP_REBOOT:
-		kmalloc_ok = 0;
-		return 1;
-	default:
-		printk(UM_KERN_ERR "Bad sigsetjmp return in %s - %d\n",
-		       __func__, n);
-		fatal_sigsegv();
-	}
-	longjmp(*switch_buf, 1);
-
-	/* unreachable */
-	printk(UM_KERN_ERR "impossible long jump!");
-	fatal_sigsegv();
-	return 0;
-}
-
-void initial_thread_cb_skas(void (*proc)(void *), void *arg)
-{
-	jmp_buf here;
-
-	cb_proc = proc;
-	cb_arg = arg;
-	cb_back = &here;
-
-	initial_jmpbuf_lock();
-	if (UML_SETJMP(&here) == 0)
-		UML_LONGJMP(&initial_jmpbuf, INIT_JMP_CALLBACK);
-	initial_jmpbuf_unlock();
-
-	cb_proc = NULL;
-	cb_arg = NULL;
-	cb_back = NULL;
-}
-
-void halt_skas(void)
-{
-	initial_jmpbuf_lock();
-	UML_LONGJMP(&initial_jmpbuf, INIT_JMP_HALT);
-	/* unreachable */
-}
-
-static bool noreboot;
-
-static int __init noreboot_cmd_param(char *str, int *add)
-{
-	*add = 0;
-	noreboot = true;
-	return 0;
-}
-
-__uml_setup("noreboot", noreboot_cmd_param,
-"noreboot\n"
-"    Rather than rebooting, exit always, akin to QEMU's -no-reboot option.\n"
-"    This is useful if you're using CONFIG_PANIC_TIMEOUT in order to catch\n"
-"    crashes in CI\n\n");
-
-void reboot_skas(void)
-{
-	initial_jmpbuf_lock();
-	UML_LONGJMP(&initial_jmpbuf, noreboot ? INIT_JMP_HALT : INIT_JMP_REBOOT);
-	/* unreachable */
-}
-- 
2.43.0


From thehajime at gmail.com  Sun Nov  2 01:49:29 2025
From: thehajime at gmail.com (Hajime Tazaki)
Date: Sun,  2 Nov 2025 18:49:29 +0900
Subject: [PATCH v12 04/13] x86/um: nommu: syscall handling
In-Reply-To: <cover.1762075876.git.thehajime@gmail.com>
References: <cover.1762075876.git.thehajime@gmail.com>
Message-ID: <c4609c653db31ec3a6595bb03f34d2331cf9d543.1762075876.git.thehajime@gmail.com>

This commit introduces an entry point of syscall interface for !MMU
mode. It uses an entry function, __kernel_vsyscall, a kernel-wide global
symbol accessible from any locations.

Although it isn't in the scope of this commit, it can be also exposed
via vdso image which is directly accessible from userspace. A standard
library (i.e., libc) can utilize this entry point to implement syscall
wrapper; we can also use this by hooking syscall for unmodified userspace
applications/libraries, which will be implemented in the subsequent
commit.

This only supports 64-bit mode of x86 architecture.

Signed-off-by: Hajime Tazaki <thehajime at gmail.com>
Signed-off-by: Ricardo Koller <ricarkol at google.com>
---
 arch/x86/um/Makefile              |   4 ++
 arch/x86/um/asm/syscall.h         |   6 ++
 arch/x86/um/nommu/Makefile        |   8 +++
 arch/x86/um/nommu/do_syscall_64.c |  32 +++++++++
 arch/x86/um/nommu/entry_64.S      | 112 ++++++++++++++++++++++++++++++
 arch/x86/um/nommu/syscalls.h      |  16 +++++
 6 files changed, 178 insertions(+)
 create mode 100644 arch/x86/um/nommu/Makefile
 create mode 100644 arch/x86/um/nommu/do_syscall_64.c
 create mode 100644 arch/x86/um/nommu/entry_64.S
 create mode 100644 arch/x86/um/nommu/syscalls.h

diff --git a/arch/x86/um/Makefile b/arch/x86/um/Makefile
index b42c31cd2390..227af2a987e2 100644
--- a/arch/x86/um/Makefile
+++ b/arch/x86/um/Makefile
@@ -32,6 +32,10 @@ obj-y += syscalls_64.o vdso/
 subarch-y = ../lib/csum-partial_64.o ../lib/memcpy_64.o \
 	../lib/memmove_64.o ../lib/memset_64.o
 
+ifneq ($(CONFIG_MMU),y)
+obj-y += nommu/
+endif
+
 endif
 
 subarch-$(CONFIG_MODULES) += ../kernel/module.o
diff --git a/arch/x86/um/asm/syscall.h b/arch/x86/um/asm/syscall.h
index d6208d0fad51..bb4f6f011667 100644
--- a/arch/x86/um/asm/syscall.h
+++ b/arch/x86/um/asm/syscall.h
@@ -20,4 +20,10 @@ static inline int syscall_get_arch(struct task_struct *task)
 #endif
 }
 
+#ifndef CONFIG_MMU
+extern void do_syscall_64(struct pt_regs *regs);
+extern long __kernel_vsyscall(int64_t a0, int64_t a1, int64_t a2, int64_t a3,
+			      int64_t a4, int64_t a5, int64_t a6);
+#endif
+
 #endif /* __UM_ASM_SYSCALL_H */
diff --git a/arch/x86/um/nommu/Makefile b/arch/x86/um/nommu/Makefile
new file mode 100644
index 000000000000..d72c63afffa5
--- /dev/null
+++ b/arch/x86/um/nommu/Makefile
@@ -0,0 +1,8 @@
+# SPDX-License-Identifier: GPL-2.0
+ifeq ($(CONFIG_X86_32),y)
+	BITS := 32
+else
+	BITS := 64
+endif
+
+obj-y = do_syscall_$(BITS).o entry_$(BITS).o
diff --git a/arch/x86/um/nommu/do_syscall_64.c b/arch/x86/um/nommu/do_syscall_64.c
new file mode 100644
index 000000000000..292d7c578622
--- /dev/null
+++ b/arch/x86/um/nommu/do_syscall_64.c
@@ -0,0 +1,32 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/kernel.h>
+#include <linux/ptrace.h>
+#include <kern_util.h>
+#include <asm/syscall.h>
+#include <os.h>
+
+__visible void do_syscall_64(struct pt_regs *regs)
+{
+	int syscall;
+
+	syscall = PT_SYSCALL_NR(regs->regs.gp);
+	UPT_SYSCALL_NR(&regs->regs) = syscall;
+
+	if (likely(syscall < NR_syscalls)) {
+		unsigned long ret;
+
+		ret = (*sys_call_table[syscall])(UPT_SYSCALL_ARG1(&regs->regs),
+						 UPT_SYSCALL_ARG2(&regs->regs),
+						 UPT_SYSCALL_ARG3(&regs->regs),
+						 UPT_SYSCALL_ARG4(&regs->regs),
+						 UPT_SYSCALL_ARG5(&regs->regs),
+						 UPT_SYSCALL_ARG6(&regs->regs));
+		PT_REGS_SET_SYSCALL_RETURN(regs, ret);
+	}
+
+	PT_REGS_SYSCALL_RET(regs) = regs->regs.gp[HOST_AX];
+
+	/* handle tasks and signals at the end */
+	interrupt_end();
+}
diff --git a/arch/x86/um/nommu/entry_64.S b/arch/x86/um/nommu/entry_64.S
new file mode 100644
index 000000000000..485c578aae64
--- /dev/null
+++ b/arch/x86/um/nommu/entry_64.S
@@ -0,0 +1,112 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <asm/errno.h>
+
+#include <linux/linkage.h>
+#include <asm/percpu.h>
+#include <asm/desc.h>
+
+#include "../entry/calling.h"
+
+#ifdef CONFIG_SMP
+#error need to stash these variables somewhere else
+#endif
+
+#define UM_GLOBAL_VAR(x) .data; .align 8; .globl x; x:; .long 0
+
+UM_GLOBAL_VAR(current_top_of_stack)
+UM_GLOBAL_VAR(current_ptregs)
+
+.code64
+.section .entry.text, "ax"
+
+.align 8
+#undef ENTRY
+#define ENTRY(x) .text; .globl x; .type x,%function; x:
+#undef END
+#define END(x)   .size x, . - x
+
+/*
+ * %rcx has the return address (we set it before entering __kernel_vsyscall).
+ *
+ * Registers on entry:
+ * rax  system call number
+ * rcx  return address
+ * rdi  arg0
+ * rsi  arg1
+ * rdx  arg2
+ * r10  arg3
+ * r8   arg4
+ * r9   arg5
+ *
+ * (note: we are allowed to mess with r11: r11 is callee-clobbered
+ * register in C ABI)
+ */
+ENTRY(__kernel_vsyscall)
+
+	movq	%rsp, %r11
+
+	/* Point rsp to the top of the ptregs array, so we can
+           just fill it with a bunch of push'es. */
+	movq	current_ptregs, %rsp
+
+	/* 8 bytes * 20 registers (plus 8 for the push) */
+	addq	$168, %rsp
+
+	/* Construct struct pt_regs on stack */
+	pushq	$0		/* pt_regs->ss (index 20) */
+	pushq   %r11		/* pt_regs->sp */
+	pushfq			/* pt_regs->flags */
+	pushq	$0		/* pt_regs->cs */
+	pushq	%rcx		/* pt_regs->ip */
+	pushq	%rax		/* pt_regs->orig_ax */
+
+	PUSH_AND_CLEAR_REGS rax=$-ENOSYS
+
+	mov %rsp, %rdi
+
+	/*
+	 * Switch to current top of stack, so "current->" points
+	 * to the right task.
+	 */
+	movq	current_top_of_stack, %rsp
+
+	call	do_syscall_64
+
+	jmp	userspace
+
+END(__kernel_vsyscall)
+
+/*
+ * common userspace returning routine
+ *
+ * all procedures like syscalls, signal handlers, umh processes, will gate
+ * this routine to properly configure registers/stacks.
+ *
+ * void userspace(struct uml_pt_regs *regs)
+ */
+ENTRY(userspace)
+
+	/* clear direction flag to meet ABI */
+	cld
+	/* align the stack for x86_64 ABI */
+	and     $-0x10, %rsp
+	/* Handle any immediate reschedules or signals */
+	call	interrupt_end
+
+	movq	current_ptregs, %rsp
+
+	POP_REGS
+
+	addq	$8, %rsp	/* skip orig_ax */
+	popq	%rcx		/* pt_regs->ip */
+	addq	$8, %rsp	/* skip cs */
+	addq	$8, %rsp	/* skip flags */
+	popq	%rsp
+
+	/*
+	* not return w/ ret but w/ jmp as the stack is already popped before
+	* entering __kernel_vsyscall
+	*/
+	jmp	*%rcx
+
+END(userspace)
diff --git a/arch/x86/um/nommu/syscalls.h b/arch/x86/um/nommu/syscalls.h
new file mode 100644
index 000000000000..a2433756b1fc
--- /dev/null
+++ b/arch/x86/um/nommu/syscalls.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __UM_NOMMU_SYSCALLS_H
+#define __UM_NOMMU_SYSCALLS_H
+
+
+#define task_top_of_stack(task) \
+({									\
+	unsigned long __ptr = (unsigned long)task->stack;	\
+	__ptr += THREAD_SIZE;			\
+	__ptr;					\
+})
+
+extern long current_top_of_stack;
+extern long current_ptregs;
+
+#endif
-- 
2.43.0


From thehajime at gmail.com  Sun Nov  2 01:49:28 2025
From: thehajime at gmail.com (Hajime Tazaki)
Date: Sun,  2 Nov 2025 18:49:28 +0900
Subject: [PATCH v12 03/13] um: nommu: memory handling
In-Reply-To: <cover.1762075876.git.thehajime@gmail.com>
References: <cover.1762075876.git.thehajime@gmail.com>
Message-ID: <ca690557be98a0b11744fc8aa664933552742266.1762075876.git.thehajime@gmail.com>

This commit adds memory operations on UML under !MMU environment.

Some part of the original UML code relying on CONFIG_MMU are excluded
from compilation when !CONFIG_MMU.  Additionally, generic functions such as
uaccess, futex, memcpy/strnlen/strncpy can be used as user- and
kernel-space share the address space in !CONFIG_MMU mode.

Signed-off-by: Hajime Tazaki <thehajime at gmail.com>
Signed-off-by: Ricardo Koller <ricarkol at google.com>
---
 arch/um/Makefile                  | 4 ++++
 arch/um/include/asm/futex.h       | 4 ++++
 arch/um/include/asm/mmu.h         | 3 +++
 arch/um/include/asm/mmu_context.h | 2 ++
 arch/um/include/asm/uaccess.h     | 7 ++++---
 arch/um/kernel/mem.c              | 3 ++-
 arch/um/os-Linux/mem.c            | 4 ++++
 arch/um/os-Linux/process.c        | 4 ++--
 8 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/arch/um/Makefile b/arch/um/Makefile
index 7be0143b5ba3..5371c9a1b11e 100644
--- a/arch/um/Makefile
+++ b/arch/um/Makefile
@@ -46,6 +46,10 @@ ARCH_INCLUDE	:= -I$(srctree)/$(SHARED_HEADERS)
 ARCH_INCLUDE	+= -I$(srctree)/$(HOST_DIR)/um/shared
 KBUILD_CPPFLAGS += -I$(srctree)/$(HOST_DIR)/um
 
+ifneq ($(CONFIG_MMU),y)
+core-y += $(ARCH_DIR)/nommu/
+endif
+
 # -Dvmap=kernel_vmap prevents anything from referencing the libpcap.o symbol so
 # named - it's a common symbol in libpcap, so we get a binary which crashes.
 #
diff --git a/arch/um/include/asm/futex.h b/arch/um/include/asm/futex.h
index 780aa6bfc050..785fd6649aa2 100644
--- a/arch/um/include/asm/futex.h
+++ b/arch/um/include/asm/futex.h
@@ -7,8 +7,12 @@
 #include <asm/errno.h>
 
 
+#ifdef CONFIG_MMU
 int arch_futex_atomic_op_inuser(int op, u32 oparg, int *oval, u32 __user *uaddr);
 int futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr,
 			      u32 oldval, u32 newval);
+#else
+#include <asm-generic/futex.h>
+#endif
 
 #endif
diff --git a/arch/um/include/asm/mmu.h b/arch/um/include/asm/mmu.h
index 82a919132aff..c0b9ce3215c4 100644
--- a/arch/um/include/asm/mmu.h
+++ b/arch/um/include/asm/mmu.h
@@ -22,10 +22,13 @@ typedef struct mm_context {
 	unsigned long sync_tlb_range_from;
 	unsigned long sync_tlb_range_to;
 
+#ifndef CONFIG_MMU
+	unsigned long   end_brk;
 #ifdef CONFIG_BINFMT_ELF_FDPIC
 	unsigned long   exec_fdpic_loadmap;
 	unsigned long   interp_fdpic_loadmap;
 #endif
+#endif /* !CONFIG_MMU */
 } mm_context_t;
 
 #define INIT_MM_CONTEXT(mm)						\
diff --git a/arch/um/include/asm/mmu_context.h b/arch/um/include/asm/mmu_context.h
index c727e56ba116..528b217da285 100644
--- a/arch/um/include/asm/mmu_context.h
+++ b/arch/um/include/asm/mmu_context.h
@@ -18,11 +18,13 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 {
 }
 
+#ifdef CONFIG_MMU
 #define init_new_context init_new_context
 extern int init_new_context(struct task_struct *task, struct mm_struct *mm);
 
 #define destroy_context destroy_context
 extern void destroy_context(struct mm_struct *mm);
+#endif
 
 #include <asm-generic/mmu_context.h>
 
diff --git a/arch/um/include/asm/uaccess.h b/arch/um/include/asm/uaccess.h
index 1c6e0ae41b0c..b9677758e759 100644
--- a/arch/um/include/asm/uaccess.h
+++ b/arch/um/include/asm/uaccess.h
@@ -23,6 +23,7 @@
 #define __addr_range_nowrap(addr, size) \
 	((unsigned long) (addr) <= ((unsigned long) (addr) + (size)))
 
+#ifdef CONFIG_MMU
 extern unsigned long raw_copy_from_user(void *to, const void __user *from, unsigned long n);
 extern unsigned long raw_copy_to_user(void __user *to, const void *from, unsigned long n);
 extern unsigned long __clear_user(void __user *mem, unsigned long len);
@@ -34,9 +35,6 @@ static inline int __access_ok(const void __user *ptr, unsigned long size);
 
 #define INLINE_COPY_FROM_USER
 #define INLINE_COPY_TO_USER
-
-#include <asm-generic/uaccess.h>
-
 static inline int __access_ok(const void __user *ptr, unsigned long size)
 {
 	unsigned long addr = (unsigned long)ptr;
@@ -70,5 +68,8 @@ do {									\
 	barrier();							\
 	current->thread.segv_continue = NULL;				\
 } while (0)
+#endif
+
+#include <asm-generic/uaccess.h>
 
 #endif
diff --git a/arch/um/kernel/mem.c b/arch/um/kernel/mem.c
index 52cd906e3896..1b9e7c62412d 100644
--- a/arch/um/kernel/mem.c
+++ b/arch/um/kernel/mem.c
@@ -71,7 +71,8 @@ void __init arch_mm_preinit(void)
 	 * to be turned on.
 	 */
 	brk_end = PAGE_ALIGN((unsigned long) sbrk(0));
-	map_memory(brk_end, __pa(brk_end), uml_reserved - brk_end, 1, 1, 0);
+	map_memory(brk_end, __pa(brk_end), uml_reserved - brk_end, 1, 1,
+		   !IS_ENABLED(CONFIG_MMU));
 	memblock_free((void *)brk_end, uml_reserved - brk_end);
 	uml_reserved = brk_end;
 	min_low_pfn = PFN_UP(__pa(uml_reserved));
diff --git a/arch/um/os-Linux/mem.c b/arch/um/os-Linux/mem.c
index 72f302f4d197..4f5d9a94f8e2 100644
--- a/arch/um/os-Linux/mem.c
+++ b/arch/um/os-Linux/mem.c
@@ -213,6 +213,10 @@ int __init create_mem_file(unsigned long long len)
 {
 	int err, fd;
 
+	/* NOMMU kernel uses -1 as a fd for further use (e.g., mmap) */
+	if (!IS_ENABLED(CONFIG_MMU))
+		return -1;
+
 	fd = create_tmp_file(len);
 
 	err = os_set_exec_close(fd);
diff --git a/arch/um/os-Linux/process.c b/arch/um/os-Linux/process.c
index c50fa865d8c7..ddb5258d7720 100644
--- a/arch/um/os-Linux/process.c
+++ b/arch/um/os-Linux/process.c
@@ -100,8 +100,8 @@ int os_map_memory(void *virt, int fd, unsigned long long off, unsigned long len,
 	prot = (r ? PROT_READ : 0) | (w ? PROT_WRITE : 0) |
 		(x ? PROT_EXEC : 0);
 
-	loc = mmap64((void *) virt, len, prot, MAP_SHARED | MAP_FIXED,
-		     fd, off);
+	loc = mmap64((void *) virt, len, prot, MAP_SHARED | MAP_FIXED |
+		     (!IS_ENABLED(CONFIG_MMU) ? MAP_ANONYMOUS : 0), fd, off);
 	if (loc == MAP_FAILED)
 		return -errno;
 	return 0;
-- 
2.43.0


From thehajime at gmail.com  Sun Nov  2 01:49:31 2025
From: thehajime at gmail.com (Hajime Tazaki)
Date: Sun,  2 Nov 2025 18:49:31 +0900
Subject: [PATCH v12 06/13] x86/um: nommu: process/thread handling
In-Reply-To: <cover.1762075876.git.thehajime@gmail.com>
References: <cover.1762075876.git.thehajime@gmail.com>
Message-ID: <94b1c9a65af9d22e3f21d28bc0fad2f94e1e86cb.1762075876.git.thehajime@gmail.com>

Since ptrace facility isn't used under !MMU of UML, there is different
code path to invoke processes/threads; there are no external process
used, and need to properly configure some of registers (fs segment
register for TLS, etc) on every context switch, etc.

Signals aren't delivered in non-ptrace syscall entry/leave so, we also
need to handle pending signal by ourselves.

ptrace related syscalls are not tested yet so, marked
arch_has_single_step() unsupported in !MMU environment.

Signed-off-by: Hajime Tazaki <thehajime at gmail.com>
Signed-off-by: Ricardo Koller <ricarkol at google.com>
---
 arch/um/include/asm/ptrace-generic.h |  2 +-
 arch/x86/um/Makefile                 |  3 +-
 arch/x86/um/nommu/Makefile           |  2 +-
 arch/x86/um/nommu/entry_64.S         |  2 ++
 arch/x86/um/nommu/syscalls.h         |  2 ++
 arch/x86/um/nommu/syscalls_64.c      | 50 ++++++++++++++++++++++++++++
 6 files changed, 58 insertions(+), 3 deletions(-)
 create mode 100644 arch/x86/um/nommu/syscalls_64.c

diff --git a/arch/um/include/asm/ptrace-generic.h b/arch/um/include/asm/ptrace-generic.h
index 62e9916078ec..5aa38fe6b2fb 100644
--- a/arch/um/include/asm/ptrace-generic.h
+++ b/arch/um/include/asm/ptrace-generic.h
@@ -14,7 +14,7 @@ struct pt_regs {
 	struct uml_pt_regs regs;
 };
 
-#define arch_has_single_step()	(1)
+#define arch_has_single_step()	(IS_ENABLED(CONFIG_MMU))
 
 #define EMPTY_REGS { .regs = EMPTY_UML_PT_REGS }
 
diff --git a/arch/x86/um/Makefile b/arch/x86/um/Makefile
index 227af2a987e2..53c9ebb3c41c 100644
--- a/arch/x86/um/Makefile
+++ b/arch/x86/um/Makefile
@@ -27,7 +27,8 @@ subarch-y += ../kernel/sys_ia32.o
 
 else
 
-obj-y += syscalls_64.o vdso/
+obj-y += vdso/
+obj-$(CONFIG_MMU) += syscalls_64.o
 
 subarch-y = ../lib/csum-partial_64.o ../lib/memcpy_64.o \
 	../lib/memmove_64.o ../lib/memset_64.o
diff --git a/arch/x86/um/nommu/Makefile b/arch/x86/um/nommu/Makefile
index ebe47d4836f4..4018d9e0aba0 100644
--- a/arch/x86/um/nommu/Makefile
+++ b/arch/x86/um/nommu/Makefile
@@ -5,4 +5,4 @@ else
 	BITS := 64
 endif
 
-obj-y = do_syscall_$(BITS).o entry_$(BITS).o os-Linux/
+obj-y = do_syscall_$(BITS).o entry_$(BITS).o syscalls_$(BITS).o os-Linux/
diff --git a/arch/x86/um/nommu/entry_64.S b/arch/x86/um/nommu/entry_64.S
index 485c578aae64..a58922fc81e5 100644
--- a/arch/x86/um/nommu/entry_64.S
+++ b/arch/x86/um/nommu/entry_64.S
@@ -86,6 +86,8 @@ END(__kernel_vsyscall)
  */
 ENTRY(userspace)
 
+	/* set stack and pt_regs to the current task */
+	call	arch_set_stack_to_current
 	/* clear direction flag to meet ABI */
 	cld
 	/* align the stack for x86_64 ABI */
diff --git a/arch/x86/um/nommu/syscalls.h b/arch/x86/um/nommu/syscalls.h
index a2433756b1fc..ce16bf8abd59 100644
--- a/arch/x86/um/nommu/syscalls.h
+++ b/arch/x86/um/nommu/syscalls.h
@@ -13,4 +13,6 @@
 extern long current_top_of_stack;
 extern long current_ptregs;
 
+void arch_set_stack_to_current(void);
+
 #endif
diff --git a/arch/x86/um/nommu/syscalls_64.c b/arch/x86/um/nommu/syscalls_64.c
new file mode 100644
index 000000000000..d56027ebc651
--- /dev/null
+++ b/arch/x86/um/nommu/syscalls_64.c
@@ -0,0 +1,50 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2003 - 2007 Jeff Dike (jdike@{addtoit,linux.intel}.com)
+ * Copyright 2003 PathScale, Inc.
+ *
+ * Licensed under the GPL
+ */
+
+#include <linux/sched.h>
+#include <linux/sched/mm.h>
+#include <linux/syscalls.h>
+#include <linux/uaccess.h>
+#include <asm/prctl.h> /* XXX This should get the constants from libc */
+#include <registers.h>
+#include <os.h>
+#include "syscalls.h"
+
+void arch_set_stack_to_current(void)
+{
+	current_top_of_stack = task_top_of_stack(current);
+	current_ptregs = (long)task_pt_regs(current);
+}
+
+void arch_switch_to(struct task_struct *to)
+{
+	/*
+	 * In !CONFIG_MMU, it doesn't ptrace thus,
+	 * The FS_BASE registers are saved here.
+	 */
+	current_top_of_stack = task_top_of_stack(to);
+	current_ptregs = (long)task_pt_regs(to);
+
+	if ((to->thread.regs.regs.gp[FS_BASE / sizeof(unsigned long)] == 0) ||
+	    (to->mm == NULL))
+		return;
+
+	/* this changes the FS on every context switch */
+	arch_prctl(to, ARCH_SET_FS,
+		   (void __user *) to->thread.regs.regs.gp[FS_BASE / sizeof(unsigned long)]);
+}
+
+SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len,
+		unsigned long, prot, unsigned long, flags,
+		unsigned long, fd, unsigned long, off)
+{
+	if (off & ~PAGE_MASK)
+		return -EINVAL;
+
+	return ksys_mmap_pgoff(addr, len, prot, flags, fd, off >> PAGE_SHIFT);
+}
-- 
2.43.0


From thehajime at gmail.com  Sun Nov  2 01:49:30 2025
From: thehajime at gmail.com (Hajime Tazaki)
Date: Sun,  2 Nov 2025 18:49:30 +0900
Subject: [PATCH v12 05/13] um: nommu: seccomp syscalls hook
In-Reply-To: <cover.1762075876.git.thehajime@gmail.com>
References: <cover.1762075876.git.thehajime@gmail.com>
Message-ID: <ad735edb645ba2b004fe68c4727e69f7beb8ea46.1762075876.git.thehajime@gmail.com>

This commit adds syscall hook with seccomp.

Using seccomp raises SIGSYS to UML process, which is captured in the
(UML) kernel, then jumps to the syscall entry point, __kernel_vsyscall,
to hook the original syscall instructions.

The SIGSYS signal is raised upon the execution from uml_reserved and
high_physmem, which locates userspace memory.

It also renames existing static function, sigsys_handler(), in
start_up.c to avoid name conflicts between them.

Signed-off-by: Hajime Tazaki <thehajime at gmail.com>
Signed-off-by: Kenichi Yasukata <kenichi.yasukata at gmail.com>
---
 arch/um/include/shared/kern_util.h    |  2 +
 arch/um/include/shared/os.h           | 10 +++
 arch/um/kernel/um_arch.c              |  3 +
 arch/um/nommu/Makefile                |  3 +
 arch/um/nommu/os-Linux/Makefile       |  7 +++
 arch/um/nommu/os-Linux/seccomp.c      | 87 +++++++++++++++++++++++++++
 arch/um/nommu/os-Linux/signal.c       | 16 +++++
 arch/um/os-Linux/signal.c             |  8 +++
 arch/um/os-Linux/start_up.c           |  4 +-
 arch/x86/um/nommu/Makefile            |  2 +-
 arch/x86/um/nommu/os-Linux/Makefile   |  6 ++
 arch/x86/um/nommu/os-Linux/mcontext.c | 15 +++++
 arch/x86/um/shared/sysdep/mcontext.h  |  4 ++
 13 files changed, 164 insertions(+), 3 deletions(-)
 create mode 100644 arch/um/nommu/Makefile
 create mode 100644 arch/um/nommu/os-Linux/Makefile
 create mode 100644 arch/um/nommu/os-Linux/seccomp.c
 create mode 100644 arch/um/nommu/os-Linux/signal.c
 create mode 100644 arch/x86/um/nommu/os-Linux/Makefile
 create mode 100644 arch/x86/um/nommu/os-Linux/mcontext.c

diff --git a/arch/um/include/shared/kern_util.h b/arch/um/include/shared/kern_util.h
index 38321188c04c..7798f16a4677 100644
--- a/arch/um/include/shared/kern_util.h
+++ b/arch/um/include/shared/kern_util.h
@@ -63,6 +63,8 @@ extern void segv_handler(int sig, struct siginfo *unused_si, struct uml_pt_regs
 extern void winch(int sig, struct siginfo *unused_si, struct uml_pt_regs *regs,
 		  void *mc);
 extern void fatal_sigsegv(void) __attribute__ ((noreturn));
+extern void sigsys_handler(int sig, struct siginfo *si, struct uml_pt_regs *regs,
+			   void *mc);
 
 void um_idle_sleep(void);
 
diff --git a/arch/um/include/shared/os.h b/arch/um/include/shared/os.h
index b26e94292fc1..5451f9b1f41e 100644
--- a/arch/um/include/shared/os.h
+++ b/arch/um/include/shared/os.h
@@ -356,4 +356,14 @@ static inline void os_local_ipi_enable(void) { }
 static inline void os_local_ipi_disable(void) { }
 #endif /* CONFIG_SMP */
 
+/* seccomp.c */
+#ifdef CONFIG_MMU
+static inline int os_setup_seccomp(void)
+{
+	return 0;
+}
+#else
+extern int os_setup_seccomp(void);
+#endif
+
 #endif
diff --git a/arch/um/kernel/um_arch.c b/arch/um/kernel/um_arch.c
index e2b24e1ecfa6..27c13423d9aa 100644
--- a/arch/um/kernel/um_arch.c
+++ b/arch/um/kernel/um_arch.c
@@ -423,6 +423,9 @@ void __init setup_arch(char **cmdline_p)
 		add_bootloader_randomness(rng_seed, sizeof(rng_seed));
 		memzero_explicit(rng_seed, sizeof(rng_seed));
 	}
+
+	/* install seccomp filter */
+	os_setup_seccomp();
 }
 
 void __init arch_cpu_finalize_init(void)
diff --git a/arch/um/nommu/Makefile b/arch/um/nommu/Makefile
new file mode 100644
index 000000000000..baab7c2f57c2
--- /dev/null
+++ b/arch/um/nommu/Makefile
@@ -0,0 +1,3 @@
+# SPDX-License-Identifier: GPL-2.0
+
+obj-y := os-Linux/
diff --git a/arch/um/nommu/os-Linux/Makefile b/arch/um/nommu/os-Linux/Makefile
new file mode 100644
index 000000000000..805e26ccf63b
--- /dev/null
+++ b/arch/um/nommu/os-Linux/Makefile
@@ -0,0 +1,7 @@
+# SPDX-License-Identifier: GPL-2.0
+
+obj-y := seccomp.o signal.o
+USER_OBJS := $(obj-y)
+
+include $(srctree)/arch/um/scripts/Makefile.rules
+USER_CFLAGS+=-I$(srctree)/arch/um/os-Linux
diff --git a/arch/um/nommu/os-Linux/seccomp.c b/arch/um/nommu/os-Linux/seccomp.c
new file mode 100644
index 000000000000..d1cfa6e3d632
--- /dev/null
+++ b/arch/um/nommu/os-Linux/seccomp.c
@@ -0,0 +1,87 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <errno.h>
+#include <signal.h>
+#include <sys/prctl.h>
+#include <sys/syscall.h>   /* For SYS_xxx definitions */
+#include <init.h>
+#include <as-layout.h>
+#include <os.h>
+#include <linux/filter.h>
+#include <linux/seccomp.h>
+
+int __init os_setup_seccomp(void)
+{
+	int err;
+	unsigned long __userspace_start = uml_reserved,
+		__userspace_end = high_physmem;
+
+	struct sock_filter filter[] = {
+		/* if (IP_high > __userspace_end) allow; */
+		BPF_STMT(BPF_LD + BPF_W + BPF_ABS,
+			 offsetof(struct seccomp_data, instruction_pointer) + 4),
+		BPF_JUMP(BPF_JMP + BPF_JGT + BPF_K, __userspace_end >> 32,
+			 /*true-skip=*/0, /*false-skip=*/1),
+		BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_ALLOW),
+
+		/* if (IP_high == __userspace_end && IP_low >= __userspace_end) allow; */
+		BPF_STMT(BPF_LD + BPF_W + BPF_ABS,
+			 offsetof(struct seccomp_data, instruction_pointer) + 4),
+		BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, __userspace_end >> 32,
+			 /*true-skip=*/0, /*false-skip=*/3),
+		BPF_STMT(BPF_LD + BPF_W + BPF_ABS,
+			 offsetof(struct seccomp_data, instruction_pointer)),
+		BPF_JUMP(BPF_JMP + BPF_JGE + BPF_K, __userspace_end,
+			 /*true-skip=*/0, /*false-skip=*/1),
+		BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_ALLOW),
+
+		/* if (IP_high < __userspace_start) allow; */
+		BPF_STMT(BPF_LD + BPF_W + BPF_ABS,
+			 offsetof(struct seccomp_data, instruction_pointer) + 4),
+		BPF_JUMP(BPF_JMP + BPF_JGE + BPF_K, __userspace_start >> 32,
+			 /*true-skip=*/1, /*false-skip=*/0),
+		BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_ALLOW),
+
+		/* if (IP_high == __userspace_start && IP_low < __userspace_start) allow; */
+		BPF_STMT(BPF_LD + BPF_W + BPF_ABS,
+			 offsetof(struct seccomp_data, instruction_pointer) + 4),
+		BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, __userspace_start >> 32,
+			 /*true-skip=*/0, /*false-skip=*/3),
+		BPF_STMT(BPF_LD + BPF_W + BPF_ABS,
+			 offsetof(struct seccomp_data, instruction_pointer)),
+		BPF_JUMP(BPF_JMP + BPF_JGE + BPF_K, __userspace_start,
+			 /*true-skip=*/1, /*false-skip=*/0),
+		BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_ALLOW),
+
+		/* other address; trap  */
+		BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_TRAP),
+	};
+	struct sock_fprog prog = {
+		.len = ARRAY_SIZE(filter),
+		.filter = filter,
+	};
+
+	err = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+	if (err)
+		os_warn("PR_SET_NO_NEW_PRIVS (err=%d, ernro=%d)\n",
+		       err, errno);
+
+	err = syscall(SYS_seccomp, SECCOMP_SET_MODE_FILTER,
+		      SECCOMP_FILTER_FLAG_TSYNC, &prog);
+	if (err) {
+		os_warn("SECCOMP_SET_MODE_FILTER (err=%d, ernro=%d)\n",
+		       err, errno);
+		exit(1);
+	}
+
+	set_handler(SIGSYS);
+
+	os_info("seccomp: setup filter syscalls in the range: 0x%lx-0x%lx\n",
+		__userspace_start, __userspace_end);
+
+	return 0;
+}
+
diff --git a/arch/um/nommu/os-Linux/signal.c b/arch/um/nommu/os-Linux/signal.c
new file mode 100644
index 000000000000..19043b9652e2
--- /dev/null
+++ b/arch/um/nommu/os-Linux/signal.c
@@ -0,0 +1,16 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <signal.h>
+#include <kern_util.h>
+#include <os.h>
+#include <sysdep/mcontext.h>
+#include <sys/ucontext.h>
+
+void sigsys_handler(int sig, struct siginfo *si,
+		    struct uml_pt_regs *regs, void *ptr)
+{
+	mcontext_t *mc = (mcontext_t *) ptr;
+
+	/* hook syscall via SIGSYS */
+	set_mc_sigsys_hook(mc);
+}
diff --git a/arch/um/os-Linux/signal.c b/arch/um/os-Linux/signal.c
index 327fb3c52fc7..2f6795cd884c 100644
--- a/arch/um/os-Linux/signal.c
+++ b/arch/um/os-Linux/signal.c
@@ -20,6 +20,7 @@
 #include <um_malloc.h>
 #include <sys/ucontext.h>
 #include <timetravel.h>
+#include <linux/compiler_attributes.h>
 #include "internal.h"
 
 void (*sig_info[NSIG])(int, struct siginfo *, struct uml_pt_regs *, void *mc) = {
@@ -31,6 +32,7 @@ void (*sig_info[NSIG])(int, struct siginfo *, struct uml_pt_regs *, void *mc) =
 	[SIGSEGV]	= segv_handler,
 	[SIGIO]		= sigio_handler,
 	[SIGCHLD]	= sigchld_handler,
+	[SIGSYS]	= sigsys_handler,
 };
 
 static void sig_handler_common(int sig, struct siginfo *si, mcontext_t *mc)
@@ -182,6 +184,11 @@ static void sigusr1_handler(int sig, struct siginfo *unused_si, mcontext_t *mc)
 	uml_pm_wake();
 }
 
+__weak void sigsys_handler(int sig, struct siginfo *unused_si,
+			   struct uml_pt_regs *regs, void *mc)
+{
+}
+
 void register_pm_wake_signal(void)
 {
 	set_handler(SIGUSR1);
@@ -193,6 +200,7 @@ static void (*handlers[_NSIG])(int sig, struct siginfo *si, mcontext_t *mc) = {
 	[SIGILL] = sig_handler,
 	[SIGFPE] = sig_handler,
 	[SIGTRAP] = sig_handler,
+	[SIGSYS] = sig_handler,
 
 	[SIGIO] = sig_handler,
 	[SIGWINCH] = sig_handler,
diff --git a/arch/um/os-Linux/start_up.c b/arch/um/os-Linux/start_up.c
index 054ac03bbf5e..33e039d2c1bf 100644
--- a/arch/um/os-Linux/start_up.c
+++ b/arch/um/os-Linux/start_up.c
@@ -239,7 +239,7 @@ extern unsigned long *exec_fp_regs;
 
 __initdata static struct stub_data *seccomp_test_stub_data;
 
-static void __init sigsys_handler(int sig, siginfo_t *info, void *p)
+static void __init _sigsys_handler(int sig, siginfo_t *info, void *p)
 {
 	ucontext_t *uc = p;
 
@@ -274,7 +274,7 @@ static int __init seccomp_helper(void *data)
 			sizeof(seccomp_test_stub_data->sigstack));
 
 	sa.sa_flags = SA_ONSTACK | SA_NODEFER | SA_SIGINFO;
-	sa.sa_sigaction = (void *) sigsys_handler;
+	sa.sa_sigaction = (void *) _sigsys_handler;
 	sa.sa_restorer = NULL;
 	if (sigaction(SIGSYS, &sa, NULL) < 0)
 		exit(2);
diff --git a/arch/x86/um/nommu/Makefile b/arch/x86/um/nommu/Makefile
index d72c63afffa5..ebe47d4836f4 100644
--- a/arch/x86/um/nommu/Makefile
+++ b/arch/x86/um/nommu/Makefile
@@ -5,4 +5,4 @@ else
 	BITS := 64
 endif
 
-obj-y = do_syscall_$(BITS).o entry_$(BITS).o
+obj-y = do_syscall_$(BITS).o entry_$(BITS).o os-Linux/
diff --git a/arch/x86/um/nommu/os-Linux/Makefile b/arch/x86/um/nommu/os-Linux/Makefile
new file mode 100644
index 000000000000..4571e403a6ff
--- /dev/null
+++ b/arch/x86/um/nommu/os-Linux/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0
+
+obj-y = mcontext.o
+USER_OBJS := mcontext.o
+
+include $(srctree)/arch/um/scripts/Makefile.rules
diff --git a/arch/x86/um/nommu/os-Linux/mcontext.c b/arch/x86/um/nommu/os-Linux/mcontext.c
new file mode 100644
index 000000000000..b62a6195096f
--- /dev/null
+++ b/arch/x86/um/nommu/os-Linux/mcontext.c
@@ -0,0 +1,15 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <sys/ucontext.h>
+#define __FRAME_OFFSETS
+#include <asm/ptrace.h>
+#include <sysdep/ptrace.h>
+#include <sysdep/mcontext.h>
+
+extern long __kernel_vsyscall(int64_t a0, int64_t a1, int64_t a2, int64_t a3,
+			      int64_t a4, int64_t a5, int64_t a6);
+
+void set_mc_sigsys_hook(mcontext_t *mc)
+{
+	mc->gregs[REG_RCX] = mc->gregs[REG_RIP];
+	mc->gregs[REG_RIP] = (unsigned long) __kernel_vsyscall;
+}
diff --git a/arch/x86/um/shared/sysdep/mcontext.h b/arch/x86/um/shared/sysdep/mcontext.h
index 6fe490cc5b98..9a0d6087f357 100644
--- a/arch/x86/um/shared/sysdep/mcontext.h
+++ b/arch/x86/um/shared/sysdep/mcontext.h
@@ -17,6 +17,10 @@ extern int get_stub_state(struct uml_pt_regs *regs, struct stub_data *data,
 extern int set_stub_state(struct uml_pt_regs *regs, struct stub_data *data,
 			  int single_stepping);
 
+#ifndef CONFIG_MMU
+extern void set_mc_sigsys_hook(mcontext_t *mc);
+#endif
+
 #ifdef __i386__
 
 #define GET_FAULTINFO_FROM_MC(fi, mc) \
-- 
2.43.0


From thehajime at gmail.com  Sun Nov  2 01:49:32 2025
From: thehajime at gmail.com (Hajime Tazaki)
Date: Sun,  2 Nov 2025 18:49:32 +0900
Subject: [PATCH v12 07/13] um: nommu: configure fs register on host syscall invocation
In-Reply-To: <cover.1762075876.git.thehajime@gmail.com>
References: <cover.1762075876.git.thehajime@gmail.com>
Message-ID: <86fc0b173ac530454a0f0e33f5100e0b60e37730.1762075876.git.thehajime@gmail.com>

As userspace on UML/!MMU also need to configure %fs register when it is
running to correctly access thread structure, host syscalls implemented
in os-Linux drivers may be puzzled when they are called.  Thus it has to
configure %fs register via arch_prctl(SET_FS) on every host syscalls.

Signed-off-by: Hajime Tazaki <thehajime at gmail.com>
Signed-off-by: Ricardo Koller <ricarkol at google.com>
---
 arch/um/include/shared/os.h       |  6 +++
 arch/um/os-Linux/process.c        |  6 +++
 arch/um/os-Linux/start_up.c       | 21 +++++++++
 arch/x86/um/nommu/do_syscall_64.c | 37 ++++++++++++++++
 arch/x86/um/nommu/syscalls_64.c   | 71 +++++++++++++++++++++++++++++++
 5 files changed, 141 insertions(+)

diff --git a/arch/um/include/shared/os.h b/arch/um/include/shared/os.h
index 5451f9b1f41e..0ac87507e05e 100644
--- a/arch/um/include/shared/os.h
+++ b/arch/um/include/shared/os.h
@@ -189,6 +189,7 @@ extern void check_host_supports_tls(int *supports_tls, int *tls_min);
 extern void get_host_cpu_features(
 	void (*flags_helper_func)(char *line),
 	void (*cache_helper_func)(char *line));
+extern int host_has_fsgsbase;
 
 /* mem.c */
 extern int create_mem_file(unsigned long long len);
@@ -213,6 +214,11 @@ extern int os_protect_memory(void *addr, unsigned long len,
 extern int os_unmap_memory(void *addr, int len);
 extern int os_drop_memory(void *addr, int length);
 extern int can_drop_memory(void);
+extern int os_arch_prctl(int pid, int option, unsigned long *arg);
+#ifndef CONFIG_MMU
+extern long long host_fs;
+#endif
+
 
 void os_set_pdeathsig(void);
 
diff --git a/arch/um/os-Linux/process.c b/arch/um/os-Linux/process.c
index ddb5258d7720..dacf63ac33c8 100644
--- a/arch/um/os-Linux/process.c
+++ b/arch/um/os-Linux/process.c
@@ -18,6 +18,7 @@
 #include <sys/prctl.h>
 #include <sys/wait.h>
 #include <asm/unistd.h>
+#include <sys/syscall.h>   /* For SYS_xxx definitions */
 #include <linux/threads.h>
 #include <init.h>
 #include <longjmp.h>
@@ -179,6 +180,11 @@ int __init can_drop_memory(void)
 	return ok;
 }
 
+int os_arch_prctl(int pid, int option, unsigned long *arg2)
+{
+	return syscall(SYS_arch_prctl, option, arg2);
+}
+
 void init_new_thread_signals(void)
 {
 	set_handler(SIGSEGV);
diff --git a/arch/um/os-Linux/start_up.c b/arch/um/os-Linux/start_up.c
index 33e039d2c1bf..c0afe5d8b559 100644
--- a/arch/um/os-Linux/start_up.c
+++ b/arch/um/os-Linux/start_up.c
@@ -20,6 +20,8 @@
 #include <sys/resource.h>
 #include <asm/ldt.h>
 #include <asm/unistd.h>
+#include <sys/auxv.h>
+#include <asm/hwcap2.h>
 #include <init.h>
 #include <os.h>
 #include <smp.h>
@@ -37,6 +39,8 @@
 #include <skas.h>
 #include "internal.h"
 
+int host_has_fsgsbase;
+
 static void ptrace_child(void)
 {
 	int ret;
@@ -460,6 +464,20 @@ __uml_setup("seccomp=", uml_seccomp_config,
 "    This is insecure and should only be used with a trusted userspace\n\n"
 );
 
+static void __init check_fsgsbase(void)
+{
+	unsigned long auxv = getauxval(AT_HWCAP2);
+
+	os_info("Checking FSGSBASE instructions...");
+	if (auxv & HWCAP2_FSGSBASE) {
+		host_has_fsgsbase = 1;
+		os_info("OK\n");
+	} else {
+		host_has_fsgsbase = 0;
+		os_info("disabled\n");
+	}
+}
+
 void __init os_early_checks(void)
 {
 	int pid;
@@ -488,6 +506,9 @@ void __init os_early_checks(void)
 	using_seccomp = 0;
 	check_ptrace();
 
+	/* probe fsgsbase instruction */
+	check_fsgsbase();
+
 	pid = start_ptraced_child();
 	if (init_pid_registers(pid))
 		fatal("Failed to initialize default registers");
diff --git a/arch/x86/um/nommu/do_syscall_64.c b/arch/x86/um/nommu/do_syscall_64.c
index 292d7c578622..9bc630995df9 100644
--- a/arch/x86/um/nommu/do_syscall_64.c
+++ b/arch/x86/um/nommu/do_syscall_64.c
@@ -2,10 +2,38 @@
 
 #include <linux/kernel.h>
 #include <linux/ptrace.h>
+#include <asm/fsgsbase.h>
+#include <asm/prctl.h>
 #include <kern_util.h>
 #include <asm/syscall.h>
 #include <os.h>
 
+static int os_x86_arch_prctl(int pid, int option, unsigned long *arg2)
+{
+	if (!host_has_fsgsbase)
+		return os_arch_prctl(pid, option, arg2);
+
+	switch (option) {
+	case ARCH_SET_FS:
+		wrfsbase(*arg2);
+		break;
+	case ARCH_SET_GS:
+		wrgsbase(*arg2);
+		break;
+	case ARCH_GET_FS:
+		*arg2 = rdfsbase();
+		break;
+	case ARCH_GET_GS:
+		*arg2 = rdgsbase();
+		break;
+	default:
+		pr_warn("%s: unsupported option: 0x%x", __func__, option);
+		break;
+	}
+
+	return 0;
+}
+
 __visible void do_syscall_64(struct pt_regs *regs)
 {
 	int syscall;
@@ -13,6 +41,9 @@ __visible void do_syscall_64(struct pt_regs *regs)
 	syscall = PT_SYSCALL_NR(regs->regs.gp);
 	UPT_SYSCALL_NR(&regs->regs) = syscall;
 
+	/* set fs register to the original host one */
+	os_x86_arch_prctl(0, ARCH_SET_FS, (void *)host_fs);
+
 	if (likely(syscall < NR_syscalls)) {
 		unsigned long ret;
 
@@ -29,4 +60,10 @@ __visible void do_syscall_64(struct pt_regs *regs)
 
 	/* handle tasks and signals at the end */
 	interrupt_end();
+
+	/* restore back fs register to userspace configured one */
+	os_x86_arch_prctl(0, ARCH_SET_FS,
+		      (void *)(current->thread.regs.regs.gp[FS_BASE
+						     / sizeof(unsigned long)]));
+
 }
diff --git a/arch/x86/um/nommu/syscalls_64.c b/arch/x86/um/nommu/syscalls_64.c
index d56027ebc651..19d23686fc5b 100644
--- a/arch/x86/um/nommu/syscalls_64.c
+++ b/arch/x86/um/nommu/syscalls_64.c
@@ -13,8 +13,70 @@
 #include <asm/prctl.h> /* XXX This should get the constants from libc */
 #include <registers.h>
 #include <os.h>
+#include <asm/thread_info.h>
+#include <asm/mman.h>
 #include "syscalls.h"
 
+/*
+ * The guest libc can change FS, which confuses the host libc.
+ * In fact, changing FS directly is not supported (check
+ * man arch_prctl). So, whenever we make a host syscall,
+ * we should be changing FS to the original FS (not the
+ * one set by the guest libc). This original FS is stored
+ * in host_fs.
+ */
+long long host_fs = -1;
+
+long arch_prctl(struct task_struct *task, int option,
+		unsigned long __user *arg2)
+{
+	long ret = -EINVAL;
+	unsigned long *ptr = arg2, tmp;
+
+	switch (option) {
+	case ARCH_SET_FS:
+		if (host_fs == -1)
+			os_arch_prctl(0, ARCH_GET_FS, (void *)&host_fs);
+		ret = 0;
+		break;
+	case ARCH_SET_GS:
+		ret = 0;
+		break;
+	case ARCH_GET_FS:
+	case ARCH_GET_GS:
+		ptr = &tmp;
+		break;
+	}
+
+	ret = os_arch_prctl(0, option, ptr);
+	if (ret)
+		return ret;
+
+	switch (option) {
+	case ARCH_SET_FS:
+		current->thread.regs.regs.gp[FS_BASE / sizeof(unsigned long)] =
+			(unsigned long) arg2;
+		break;
+	case ARCH_SET_GS:
+		current->thread.regs.regs.gp[GS_BASE / sizeof(unsigned long)] =
+			(unsigned long) arg2;
+		break;
+	case ARCH_GET_FS:
+		ret = put_user(current->thread.regs.regs.gp[FS_BASE / sizeof(unsigned long)], arg2);
+		break;
+	case ARCH_GET_GS:
+		ret = put_user(current->thread.regs.regs.gp[GS_BASE / sizeof(unsigned long)], arg2);
+		break;
+	}
+
+	return ret;
+}
+
+SYSCALL_DEFINE2(arch_prctl, int, option, unsigned long, arg2)
+{
+	return arch_prctl(current, option, (unsigned long __user *) arg2);
+}
+
 void arch_set_stack_to_current(void)
 {
 	current_top_of_stack = task_top_of_stack(current);
@@ -48,3 +110,12 @@ SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len,
 
 	return ksys_mmap_pgoff(addr, len, prot, flags, fd, off >> PAGE_SHIFT);
 }
+
+static int __init um_nommu_setup_hostfs(void)
+{
+	/* initialize the host_fs value at boottime */
+	os_arch_prctl(0, ARCH_GET_FS, (void *)&host_fs);
+
+	return 0;
+}
+arch_initcall(um_nommu_setup_hostfs);
-- 
2.43.0


From thehajime at gmail.com  Sun Nov  2 01:49:33 2025
From: thehajime at gmail.com (Hajime Tazaki)
Date: Sun,  2 Nov 2025 18:49:33 +0900
Subject: [PATCH v12 08/13] x86/um/vdso: nommu: vdso memory update
In-Reply-To: <cover.1762075876.git.thehajime@gmail.com>
References: <cover.1762075876.git.thehajime@gmail.com>
Message-ID: <8036933c8c46dbf1ec32b8b57ecebc94c2cdb2ca.1762075876.git.thehajime@gmail.com>

On !MMU mode, the address of vdso is accessible from userspace.  This
commit implements the entry point by pointing a block of page address.

This commit also add memory permission configuration of vdso page to be
executable.

Signed-off-by: Hajime Tazaki <thehajime at gmail.com>
Signed-off-by: Ricardo Koller <ricarkol at google.com>
---
 arch/x86/um/vdso/vma.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/x86/um/vdso/vma.c b/arch/x86/um/vdso/vma.c
index 51a2b9f2eca9..0799b3fe7521 100644
--- a/arch/x86/um/vdso/vma.c
+++ b/arch/x86/um/vdso/vma.c
@@ -9,6 +9,7 @@
 #include <asm/page.h>
 #include <asm/elf.h>
 #include <linux/init.h>
+#include <os.h>
 
 unsigned long um_vdso_addr;
 static struct page *um_vdso;
@@ -20,18 +21,29 @@ static int __init init_vdso(void)
 {
 	BUG_ON(vdso_end - vdso_start > PAGE_SIZE);
 
-	um_vdso_addr = task_size - PAGE_SIZE;
-
 	um_vdso = alloc_page(GFP_KERNEL);
 	if (!um_vdso)
 		panic("Cannot allocate vdso\n");
 
 	copy_page(page_address(um_vdso), vdso_start);
 
+#ifdef CONFIG_MMU
+	um_vdso_addr = task_size - PAGE_SIZE;
+#else
+	/* this is fine with NOMMU as everything is accessible */
+	um_vdso_addr = (unsigned long)page_address(um_vdso);
+	os_protect_memory((void *)um_vdso_addr, vdso_end - vdso_start, 1, 0, 1);
+#endif
+
+	pr_info("vdso_start=%lx um_vdso_addr=%lx pg_um_vdso=%lx",
+	       (unsigned long)vdso_start, um_vdso_addr,
+	       (unsigned long)page_address(um_vdso));
+
 	return 0;
 }
 subsys_initcall(init_vdso);
 
+#ifdef CONFIG_MMU
 int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 {
 	struct vm_area_struct *vma;
@@ -53,3 +65,4 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 
 	return IS_ERR(vma) ? PTR_ERR(vma) : 0;
 }
+#endif
-- 
2.43.0


From thehajime at gmail.com  Sun Nov  2 01:49:34 2025
From: thehajime at gmail.com (Hajime Tazaki)
Date: Sun,  2 Nov 2025 18:49:34 +0900
Subject: [PATCH v12 09/13] x86/um: nommu: signal handling
In-Reply-To: <cover.1762075876.git.thehajime@gmail.com>
References: <cover.1762075876.git.thehajime@gmail.com>
Message-ID: <32debc0728ce22cd4db50cdf1cd4e8db430ad402.1762075876.git.thehajime@gmail.com>

This commit updates the behavior of signal handling under !MMU
environment. It adds the alignment code for signal frame as the frame
is used in userspace as-is.

floating point register is carefully handling upon entry/leave of
syscall routine so that signal handlers can read/write the contents of
the register.

It also adds the follow up routine for SIGSEGV as a signal delivery runs
in the same stack frame while we have to avoid endless SIGSEGV.

Signed-off-by: Hajime Tazaki <thehajime at gmail.com>
---
 arch/um/include/shared/kern_util.h    |   4 +
 arch/um/nommu/Makefile                |   2 +-
 arch/um/nommu/os-Linux/signal.c       |   8 +
 arch/um/nommu/trap.c                  | 201 ++++++++++++++++++++++++++
 arch/um/os-Linux/signal.c             |   3 +-
 arch/x86/um/nommu/do_syscall_64.c     |   6 +
 arch/x86/um/nommu/os-Linux/mcontext.c |  11 ++
 arch/x86/um/shared/sysdep/mcontext.h  |   1 +
 arch/x86/um/shared/sysdep/ptrace.h    |   2 +-
 9 files changed, 235 insertions(+), 3 deletions(-)
 create mode 100644 arch/um/nommu/trap.c

diff --git a/arch/um/include/shared/kern_util.h b/arch/um/include/shared/kern_util.h
index 7798f16a4677..46c8d6336ca1 100644
--- a/arch/um/include/shared/kern_util.h
+++ b/arch/um/include/shared/kern_util.h
@@ -70,4 +70,8 @@ void um_idle_sleep(void);
 
 void kasan_map_memory(void *start, size_t len);
 
+#ifndef CONFIG_MMU
+extern void nommu_relay_signal(void *ptr);
+#endif
+
 #endif
diff --git a/arch/um/nommu/Makefile b/arch/um/nommu/Makefile
index baab7c2f57c2..096221590cfd 100644
--- a/arch/um/nommu/Makefile
+++ b/arch/um/nommu/Makefile
@@ -1,3 +1,3 @@
 # SPDX-License-Identifier: GPL-2.0
 
-obj-y := os-Linux/
+obj-y := trap.o os-Linux/
diff --git a/arch/um/nommu/os-Linux/signal.c b/arch/um/nommu/os-Linux/signal.c
index 19043b9652e2..6febb178dcda 100644
--- a/arch/um/nommu/os-Linux/signal.c
+++ b/arch/um/nommu/os-Linux/signal.c
@@ -5,6 +5,7 @@
 #include <os.h>
 #include <sysdep/mcontext.h>
 #include <sys/ucontext.h>
+#include <as-layout.h>
 
 void sigsys_handler(int sig, struct siginfo *si,
 		    struct uml_pt_regs *regs, void *ptr)
@@ -14,3 +15,10 @@ void sigsys_handler(int sig, struct siginfo *si,
 	/* hook syscall via SIGSYS */
 	set_mc_sigsys_hook(mc);
 }
+
+void nommu_relay_signal(void *ptr)
+{
+	mcontext_t *mc = (mcontext_t *) ptr;
+
+	set_mc_relay_signal(mc);
+}
diff --git a/arch/um/nommu/trap.c b/arch/um/nommu/trap.c
new file mode 100644
index 000000000000..430297517455
--- /dev/null
+++ b/arch/um/nommu/trap.c
@@ -0,0 +1,201 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/mm.h>
+#include <linux/sched/signal.h>
+#include <linux/hardirq.h>
+#include <linux/module.h>
+#include <linux/uaccess.h>
+#include <linux/sched/debug.h>
+#include <asm/current.h>
+#include <asm/tlbflush.h>
+#include <arch.h>
+#include <as-layout.h>
+#include <kern_util.h>
+#include <os.h>
+#include <skas.h>
+
+/*
+ * Note this is constrained to return 0, -EFAULT, -EACCES, -ENOMEM by
+ * segv().
+ */
+int handle_page_fault(unsigned long address, unsigned long ip,
+		      int is_write, int is_user, int *code_out)
+{
+	/* !MMU has no pagefault */
+	return -EFAULT;
+}
+
+static void show_segv_info(struct uml_pt_regs *regs)
+{
+	struct task_struct *tsk = current;
+	struct faultinfo *fi = UPT_FAULTINFO(regs);
+
+	if (!unhandled_signal(tsk, SIGSEGV))
+		return;
+
+	pr_warn_ratelimited("%s%s[%d]: segfault at %lx ip %p sp %p error %x",
+			    task_pid_nr(tsk) > 1 ? KERN_INFO : KERN_EMERG,
+			    tsk->comm, task_pid_nr(tsk), FAULT_ADDRESS(*fi),
+			    (void *)UPT_IP(regs), (void *)UPT_SP(regs),
+			    fi->error_code);
+}
+
+static void bad_segv(struct faultinfo fi, unsigned long ip)
+{
+	current->thread.arch.faultinfo = fi;
+	force_sig_fault(SIGSEGV, SEGV_ACCERR, (void __user *) FAULT_ADDRESS(fi));
+}
+
+void fatal_sigsegv(void)
+{
+	force_fatal_sig(SIGSEGV);
+	do_signal(&current->thread.regs);
+	/*
+	 * This is to tell gcc that we're not returning - do_signal
+	 * can, in general, return, but in this case, it's not, since
+	 * we just got a fatal SIGSEGV queued.
+	 */
+	os_dump_core();
+}
+
+/**
+ * segv_handler() - the SIGSEGV handler
+ * @sig:	the signal number
+ * @unused_si:	the signal info struct; unused in this handler
+ * @regs:	the ptrace register information
+ *
+ * The handler first extracts the faultinfo from the UML ptrace regs struct.
+ * If the userfault did not happen in an UML userspace process, bad_segv is called.
+ * Otherwise the signal did happen in a cloned userspace process, handle it.
+ */
+void segv_handler(int sig, struct siginfo *unused_si, struct uml_pt_regs *regs,
+		  void *mc)
+{
+	struct faultinfo *fi = UPT_FAULTINFO(regs);
+
+	/* !MMU specific part; detection of userspace */
+	/* mark is_user=1 when the IP is from userspace code. */
+	if (UPT_IP(regs) > uml_reserved && UPT_IP(regs) < high_physmem)
+		regs->is_user = 1;
+
+	if (UPT_IS_USER(regs) && !SEGV_IS_FIXABLE(fi)) {
+		show_segv_info(regs);
+		bad_segv(*fi, UPT_IP(regs));
+		return;
+	}
+	segv(*fi, UPT_IP(regs), UPT_IS_USER(regs), regs, mc);
+
+	/* !MMU specific part; detection of userspace */
+	relay_signal(sig, unused_si, regs, mc);
+}
+
+/*
+ * We give a *copy* of the faultinfo in the regs to segv.
+ * This must be done, since nesting SEGVs could overwrite
+ * the info in the regs. A pointer to the info then would
+ * give us bad data!
+ */
+unsigned long segv(struct faultinfo fi, unsigned long ip, int is_user,
+		   struct uml_pt_regs *regs, void *mc)
+{
+	int si_code;
+	int err;
+	int is_write = FAULT_WRITE(fi);
+	unsigned long address = FAULT_ADDRESS(fi);
+
+	if (!is_user && regs)
+		current->thread.segv_regs = container_of(regs, struct pt_regs, regs);
+
+	if (current->mm == NULL) {
+		show_regs(container_of(regs, struct pt_regs, regs));
+		panic("Segfault with no mm");
+	} else if (!is_user && address > PAGE_SIZE && address < TASK_SIZE) {
+		show_regs(container_of(regs, struct pt_regs, regs));
+		panic("Kernel tried to access user memory at addr 0x%lx, ip 0x%lx",
+		       address, ip);
+	}
+
+	if (SEGV_IS_FIXABLE(&fi))
+		err = handle_page_fault(address, ip, is_write, is_user,
+					&si_code);
+	else {
+		err = -EFAULT;
+		/*
+		 * A thread accessed NULL, we get a fault, but CR2 is invalid.
+		 * This code is used in __do_copy_from_user() of TT mode.
+		 * XXX tt mode is gone, so maybe this isn't needed any more
+		 */
+		address = 0;
+	}
+
+	if (!err)
+		goto out;
+	else if (!is_user && arch_fixup(ip, regs))
+		goto out;
+
+	if (!is_user) {
+		show_regs(container_of(regs, struct pt_regs, regs));
+		panic("Kernel mode fault at addr 0x%lx, ip 0x%lx",
+		      address, ip);
+	}
+
+	show_segv_info(regs);
+
+	if (err == -EACCES) {
+		current->thread.arch.faultinfo = fi;
+		force_sig_fault(SIGBUS, BUS_ADRERR, (void __user *)address);
+	} else {
+		WARN_ON_ONCE(err != -EFAULT);
+		current->thread.arch.faultinfo = fi;
+		force_sig_fault(SIGSEGV, si_code, (void __user *) address);
+	}
+
+out:
+	if (regs)
+		current->thread.segv_regs = NULL;
+
+	return 0;
+}
+
+void relay_signal(int sig, struct siginfo *si, struct uml_pt_regs *regs,
+		  void *mc)
+{
+	int code, err;
+
+	/* !MMU specific part; detection of userspace */
+	/* mark is_user=1 when the IP is from userspace code. */
+	if (UPT_IP(regs) > uml_reserved && UPT_IP(regs) < high_physmem)
+		regs->is_user = 1;
+
+	if (!UPT_IS_USER(regs)) {
+		if (sig == SIGBUS)
+			pr_err("Bus error - the host /dev/shm or /tmp mount likely just ran out of space\n");
+		panic("Kernel mode signal %d", sig);
+	}
+	/* if is_user==1, set return to userspace sig handler to relay signal */
+	nommu_relay_signal(mc);
+
+	arch_examine_signal(sig, regs);
+
+	/* Is the signal layout for the signal known?
+	 * Signal data must be scrubbed to prevent information leaks.
+	 */
+	code = si->si_code;
+	err = si->si_errno;
+	if ((err == 0) && (siginfo_layout(sig, code) == SIL_FAULT)) {
+		struct faultinfo *fi = UPT_FAULTINFO(regs);
+
+		current->thread.arch.faultinfo = *fi;
+		force_sig_fault(sig, code, (void __user *)FAULT_ADDRESS(*fi));
+	} else {
+		pr_err("Attempted to relay unknown signal %d (si_code = %d) with errno %d\n",
+		       sig, code, err);
+		force_sig(sig);
+	}
+}
+
+void winch(int sig, struct siginfo *unused_si, struct uml_pt_regs *regs,
+	   void *mc)
+{
+	do_IRQ(WINCH_IRQ, regs);
+}
diff --git a/arch/um/os-Linux/signal.c b/arch/um/os-Linux/signal.c
index 2f6795cd884c..28754f56c42b 100644
--- a/arch/um/os-Linux/signal.c
+++ b/arch/um/os-Linux/signal.c
@@ -41,9 +41,10 @@ static void sig_handler_common(int sig, struct siginfo *si, mcontext_t *mc)
 	int save_errno = errno;
 
 	r.is_user = 0;
+	if (mc)
+		get_regs_from_mc(&r, mc);
 	if (sig == SIGSEGV) {
 		/* For segfaults, we want the data from the sigcontext. */
-		get_regs_from_mc(&r, mc);
 		GET_FAULTINFO_FROM_MC(r.faultinfo, mc);
 	}
 
diff --git a/arch/x86/um/nommu/do_syscall_64.c b/arch/x86/um/nommu/do_syscall_64.c
index 9bc630995df9..cf5a347ee9b1 100644
--- a/arch/x86/um/nommu/do_syscall_64.c
+++ b/arch/x86/um/nommu/do_syscall_64.c
@@ -44,6 +44,9 @@ __visible void do_syscall_64(struct pt_regs *regs)
 	/* set fs register to the original host one */
 	os_x86_arch_prctl(0, ARCH_SET_FS, (void *)host_fs);
 
+	/* save fp registers */
+	asm volatile("fxsaveq %0" : "=m"(*(struct _xstate *)regs->regs.fp));
+
 	if (likely(syscall < NR_syscalls)) {
 		unsigned long ret;
 
@@ -61,6 +64,9 @@ __visible void do_syscall_64(struct pt_regs *regs)
 	/* handle tasks and signals at the end */
 	interrupt_end();
 
+	/* restore fp registers */
+	asm volatile("fxrstorq %0" : : "m"((current->thread.regs.regs.fp)));
+
 	/* restore back fs register to userspace configured one */
 	os_x86_arch_prctl(0, ARCH_SET_FS,
 		      (void *)(current->thread.regs.regs.gp[FS_BASE
diff --git a/arch/x86/um/nommu/os-Linux/mcontext.c b/arch/x86/um/nommu/os-Linux/mcontext.c
index b62a6195096f..afa20f1e235a 100644
--- a/arch/x86/um/nommu/os-Linux/mcontext.c
+++ b/arch/x86/um/nommu/os-Linux/mcontext.c
@@ -4,10 +4,21 @@
 #include <asm/ptrace.h>
 #include <sysdep/ptrace.h>
 #include <sysdep/mcontext.h>
+#include <os.h>
+#include "../syscalls.h"
 
 extern long __kernel_vsyscall(int64_t a0, int64_t a1, int64_t a2, int64_t a3,
 			      int64_t a4, int64_t a5, int64_t a6);
 
+void set_mc_relay_signal(mcontext_t *mc)
+{
+	/* configure stack and userspace returning routine as
+	 * instruction pointer
+	 */
+	mc->gregs[REG_RSP] = (unsigned long) current_top_of_stack;
+	mc->gregs[REG_RIP] = (unsigned long) userspace;
+}
+
 void set_mc_sigsys_hook(mcontext_t *mc)
 {
 	mc->gregs[REG_RCX] = mc->gregs[REG_RIP];
diff --git a/arch/x86/um/shared/sysdep/mcontext.h b/arch/x86/um/shared/sysdep/mcontext.h
index 9a0d6087f357..82a5f38b350f 100644
--- a/arch/x86/um/shared/sysdep/mcontext.h
+++ b/arch/x86/um/shared/sysdep/mcontext.h
@@ -19,6 +19,7 @@ extern int set_stub_state(struct uml_pt_regs *regs, struct stub_data *data,
 
 #ifndef CONFIG_MMU
 extern void set_mc_sigsys_hook(mcontext_t *mc);
+extern void set_mc_relay_signal(mcontext_t *mc);
 #endif
 
 #ifdef __i386__
diff --git a/arch/x86/um/shared/sysdep/ptrace.h b/arch/x86/um/shared/sysdep/ptrace.h
index 572ea2d79131..6ed6bb1ca50e 100644
--- a/arch/x86/um/shared/sysdep/ptrace.h
+++ b/arch/x86/um/shared/sysdep/ptrace.h
@@ -53,7 +53,7 @@ struct uml_pt_regs {
 	int is_user;
 
 	/* Dynamically sized FP registers (holds an XSTATE) */
-	unsigned long fp[];
+	unsigned long fp[] __attribute__((aligned(16)));
 };
 
 #define EMPTY_UML_PT_REGS { }
-- 
2.43.0


From thehajime at gmail.com  Sun Nov  2 01:49:35 2025
From: thehajime at gmail.com (Hajime Tazaki)
Date: Sun,  2 Nov 2025 18:49:35 +0900
Subject: [PATCH v12 10/13] um: change machine name for uname output
In-Reply-To: <cover.1762075876.git.thehajime@gmail.com>
References: <cover.1762075876.git.thehajime@gmail.com>
Message-ID: <e37819e93cf5a593fa1b16444b3c7aa6b7885301.1762075876.git.thehajime@gmail.com>

This commit tries to display MMU/!MMU mode from the output of uname(2)
so that users can distinguish which mode of UML is running right now.

Signed-off-by: Hajime Tazaki <thehajime at gmail.com>
---
 arch/um/Makefile        | 6 ++++++
 arch/um/os-Linux/util.c | 3 ++-
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/um/Makefile b/arch/um/Makefile
index 5371c9a1b11e..9bc8fc149514 100644
--- a/arch/um/Makefile
+++ b/arch/um/Makefile
@@ -153,6 +153,12 @@ export CFLAGS_vmlinux := $(LINK-y) $(LINK_WRAPS) $(LD_FLAGS_CMDLINE) $(CC_FLAGS_
 CLEAN_FILES += linux x.i gmon.out
 MRPROPER_FILES += $(HOST_DIR)/include/generated
 
+ifeq ($(CONFIG_MMU),y)
+UTS_MACHINE := "um"
+else
+UTS_MACHINE := "um\(nommu\)"
+endif
+
 archclean:
 	@find . \( -name '*.bb' -o -name '*.bbg' -o -name '*.da' \
 		-o -name '*.gcov' \) -type f -print | xargs rm -f
diff --git a/arch/um/os-Linux/util.c b/arch/um/os-Linux/util.c
index e3ad71a0d13c..5fb26f5dfcb6 100644
--- a/arch/um/os-Linux/util.c
+++ b/arch/um/os-Linux/util.c
@@ -64,7 +64,8 @@ void setup_machinename(char *machine_out)
 	}
 # endif
 #endif
-	strcpy(machine_out, host.machine);
+	strcat(machine_out, "/");
+	strcat(machine_out, host.machine);
 }
 
 void setup_hostinfo(char *buf, int len)
-- 
2.43.0


From thehajime at gmail.com  Sun Nov  2 01:49:36 2025
From: thehajime at gmail.com (Hajime Tazaki)
Date: Sun,  2 Nov 2025 18:49:36 +0900
Subject: [PATCH v12 11/13] um: nommu: disable SMP on nommu UML
In-Reply-To: <cover.1762075876.git.thehajime@gmail.com>
References: <cover.1762075876.git.thehajime@gmail.com>
Message-ID: <54839396f81bc2755728a53912bd8fcb19b889a1.1762075876.git.thehajime@gmail.com>

CONFIG_SMP doesn't work with nommu UML since fs register handling of
host does conflict with thread local storage (more specifically,
the variable signals_enabled).

Thus this commit disables the CONFIG option and the TLS variables.

Signed-off-by: Hajime Tazaki <thehajime at gmail.com>
---
 arch/um/os-Linux/internal.h | 8 ++++++++
 arch/x86/um/Kconfig         | 2 +-
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/um/os-Linux/internal.h b/arch/um/os-Linux/internal.h
index bac9fcc8c14c..25cb5cc931c1 100644
--- a/arch/um/os-Linux/internal.h
+++ b/arch/um/os-Linux/internal.h
@@ -6,6 +6,14 @@
 #include <stub-data.h>
 #include <signal.h>
 
+/* NOMMU doesn't work with thread-local storage used in CONFIG_SMP,
+ * due to the dependency on host_fs variable switch upon user/kernel
+ * context so, disable TLS until NOMMU supports SMP.
+ */
+#ifndef CONFIG_MMU
+#define __thread
+#endif
+
 /*
  * elf_aux.c
  */
diff --git a/arch/x86/um/Kconfig b/arch/x86/um/Kconfig
index c52fb5cb8d21..2bc18ecad783 100644
--- a/arch/x86/um/Kconfig
+++ b/arch/x86/um/Kconfig
@@ -13,7 +13,7 @@ config UML_X86
 	select ARCH_USE_QUEUED_SPINLOCKS
 	select DCACHE_WORD_ACCESS
 	select HAVE_EFFICIENT_UNALIGNED_ACCESS
-	select UML_SUBARCH_SUPPORTS_SMP if X86_CX8
+	select UML_SUBARCH_SUPPORTS_SMP if X86_CX8 && MMU
 
 config 64BIT
 	bool "64-bit kernel" if "$(SUBARCH)" = "x86"
-- 
2.43.0


From thehajime at gmail.com  Sun Nov  2 01:49:37 2025
From: thehajime at gmail.com (Hajime Tazaki)
Date: Sun,  2 Nov 2025 18:49:37 +0900
Subject: [PATCH v12 12/13] um: nommu: add documentation of nommu UML
In-Reply-To: <cover.1762075876.git.thehajime@gmail.com>
References: <cover.1762075876.git.thehajime@gmail.com>
Message-ID: <5a831d893431c15a1bc2833cedc5a45cdfa44cb9.1762075876.git.thehajime@gmail.com>

This commit adds an initial documentation for !MMU mode of UML.

Signed-off-by: Hajime Tazaki <thehajime at gmail.com>
---
 Documentation/virt/uml/nommu-uml.rst | 180 +++++++++++++++++++++++++++
 MAINTAINERS                          |   1 +
 2 files changed, 181 insertions(+)
 create mode 100644 Documentation/virt/uml/nommu-uml.rst

diff --git a/Documentation/virt/uml/nommu-uml.rst b/Documentation/virt/uml/nommu-uml.rst
new file mode 100644
index 000000000000..f049bbc697d1
--- /dev/null
+++ b/Documentation/virt/uml/nommu-uml.rst
@@ -0,0 +1,180 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+UML has been built with CONFIG_MMU since day 0.  The patchset
+introduces the nommu mode on UML in a different angle from what Linux
+Kernel Library tried.
+
+.. contents:: :local:
+
+What is it for ?
+================
+
+- Alleviate syscall hook overhead implemented with ptrace(2)
+- To exercises nommu code over UML (and over KUnit)
+- Less dependency to host facilities
+
+
+How it works ?
+==============
+
+To illustrate how this feature works, the below shows how syscalls are
+called under nommu/UML environment.
+
+- boot kernel, install seccomp filter if ``syscall`` instructions are
+  called from userspace memory based on the address of instruction
+  pointer
+- (userspace starts)
+- calls ``vfork``/``execve`` syscalls
+- ``SIGSYS`` signal raised, handler calls syscall entry point ``__kernel_vsyscall``
+- call handler function in ``sys_call_table[]`` and follow how UML syscall
+  works.
+- return to userspace
+
+
+What are the differences from MMU-full UML ?
+============================================
+
+The current nommu implementation adds 3 different functions which
+MMU-full UML doesn't have:
+
+- kernel address space can directly be accessible from userspace
+  - so, ``uaccess()`` always returns 1
+  - generic implementation of memcpy/strcpy/futex is also used
+- alternate syscall entrypoint without ptrace
+- alternate syscall hook
+  - hook syscall by seccomp filter
+
+With those modifications, it allows us to use unmodified userspace
+binaries with nommu UML.
+
+
+History
+=======
+
+This feature was originally introduced by Ricardo Koller at Open
+Source Summit NA 2020, then integrated with the syscall translation
+functionality with the clean up to the original code.
+
+Building and run
+================
+
+::
+
+   make ARCH=um x86_64_nommu_defconfig
+   make ARCH=um
+
+will build UML with ``CONFIG_MMU=n`` applied.
+
+Kunit tests can run with the following command::
+
+   ./tools/testing/kunit/kunit.py run --kconfig_add CONFIG_MMU=n
+
+To run a typical Linux distribution, we need nommu-aware userspace.
+We can use a stock version of Alpine Linux with nommu-built version of
+busybox and musl-libc.
+
+
+Preparing root filesystem
+=========================
+
+nommu UML requires to use a specific standard library which is aware
+of nommu kernel.  We have tested custom-build musl-libc and busybox,
+both of which have built-in support for nommu kernels.
+
+There are no available Linux distributions for nommu under x86_64
+architecture, so we need to prepare our own image for the root
+filesystem.  We use Alpine Linux as a base distribution and replace
+busybox and musl-libc on top of that.  The following are the step to
+prepare the filesystem for the quick start::
+
+     container_id=$(docker create ghcr.io/thehajime/alpine:3.20.3-um-nommu)
+     docker start $container_id
+     docker wait $container_id
+     docker export $container_id > alpine.tar
+     docker rm $container_id
+
+     mnt=$(mktemp -d)
+     dd if=/dev/zero of=alpine.ext4 bs=1 count=0 seek=1G
+     sudo chmod og+wr "alpine.ext4"
+     yes 2>/dev/null | mkfs.ext4 "alpine.ext4" || true
+     sudo mount "alpine.ext4" $mnt
+     sudo tar -xf alpine.tar -C $mnt
+     sudo umount $mnt
+
+This will create a file image, ``alpine.ext4``, which contains busybox
+and musl with nommu build on the Alpine Linux root filesystem.  The
+file can be specified to the argument ``ubd0=`` to the UML command line::
+
+  ./vmlinux ubd0=./alpine.ext4 rw mem=1024m loglevel=8 init=/sbin/init
+
+We plan to upstream apk packages for busybox and musl so that we can
+follow the proper procedure to set up the root filesystem.
+
+
+Quick start with docker
+=======================
+
+There is a docker image that you can quickly start with a simple step::
+
+  docker run -it -v /dev/shm:/dev/shm --rm ghcr.io/thehajime/alpine:3.20.3-um-nommu
+
+This will launch a UML instance with an pre-configured root filesystem.
+
+Benchmark
+=========
+
+The below shows an example of performance measurement conducted with
+lmbench and (self-crafted) getpid benchmark (with v6.17-rc5 uml/next
+tree).
+
+.. csv-table:: lmbench (usec)
+  :header: ,native,um,um-mmu(s),um-nommu(s)
+
+  select-10    ,0.5319,36.1214,24.2795,2.9174
+  select-100   ,1.6019,34.6049,28.8865,3.8080
+  select-1000  ,12.2588,43.6838,48.7438,12.7872
+  syscall      ,0.1644,35.0321,53.2119,2.5981
+  read         ,0.3055,31.5509,45.8538,2.7068
+  write        ,0.2512,31.3609,29.2636,2.6948
+  stat         ,1.8894,43.8477,49.6121,3.1908
+  open/close   ,3.2973,77.5123,68.9431,6.2575
+  fork+sh      ,1110.3000,7359.5000,4618.6667,439.4615
+  fork+execve  ,510.8182,2834.0000,2461.1667,139.7848
+
+.. csv-table:: do_getpid bench (nsec)
+  :header: ,native,um,um-mmu(s),um-nommu(s)
+
+  getpid , 161 , 34477 , 26242 , 2599
+
+(um-nommu(s) is with seccomp syscall hook, um-mmu(s) is SECCOMP mode,
+respectively)
+
+Limitations
+===========
+
+generic nommu limitations
+-------------------------
+Since this port is a kernel of nommu architecture so, the
+implementation inherits the characteristics of other nommu kernels
+(riscv, arm, etc), described below.
+
+- vfork(2) should be used instead of fork(2)
+- ELF loader only loads PIE (position independent executable) binaries
+- processes share the address space among others
+- mmap(2) offers a subset of functionalities (e.g., unsupported
+  MMAP_FIXED)
+
+Thus, we have limited options to userspace programs.  We have tested
+Alpine Linux with musl-libc, which has a support nommu kernel.
+
+supported architecture
+----------------------
+The current implementation of nommu UML only works on x86_64 SUBARCH.
+We have not tested with 32-bit environment.
+
+
+Further readings about NOMMU UML
+================================
+
+- NOMMU UML (original code by Ricardo Koller)
+ - https://static.sched.com/hosted_files/ossna2020/ec/kollerr_linux_um_nommu.pdf
diff --git a/MAINTAINERS b/MAINTAINERS
index 3da2c26a796b..2f227f56d04e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -26764,6 +26764,7 @@ USER-MODE LINUX (UML)
 M:	Richard Weinberger <richard at nod.at>
 M:	Anton Ivanov <anton.ivanov at cambridgegreys.com>
 M:	Johannes Berg <johannes at sipsolutions.net>
+M:	Hajime Tazaki <thehajime at gmail.com>
 L:	linux-um at lists.infradead.org
 S:	Maintained
 W:	http://user-mode-linux.sourceforge.net
-- 
2.43.0


From thehajime at gmail.com  Sun Nov  2 01:49:38 2025
From: thehajime at gmail.com (Hajime Tazaki)
Date: Sun,  2 Nov 2025 18:49:38 +0900
Subject: [PATCH v12 13/13] um: nommu: plug nommu code into build system
In-Reply-To: <cover.1762075876.git.thehajime@gmail.com>
References: <cover.1762075876.git.thehajime@gmail.com>
Message-ID: <efea26f968ad9049116cee46fb55f450df50033d.1762075876.git.thehajime@gmail.com>

Add nommu kernel for um build.  defconfig is also provided.

Signed-off-by: Hajime Tazaki <thehajime at gmail.com>
Signed-off-by: Ricardo Koller <ricarkol at google.com>
---
 arch/um/Kconfig                        | 14 ++++++-
 arch/um/configs/x86_64_nommu_defconfig | 54 ++++++++++++++++++++++++++
 2 files changed, 66 insertions(+), 2 deletions(-)
 create mode 100644 arch/um/configs/x86_64_nommu_defconfig

diff --git a/arch/um/Kconfig b/arch/um/Kconfig
index 097c6a6265ef..4907fd2db512 100644
--- a/arch/um/Kconfig
+++ b/arch/um/Kconfig
@@ -34,16 +34,19 @@ config UML
 	select ARCH_SUPPORTS_LTO_CLANG_THIN
 	select TRACE_IRQFLAGS_SUPPORT
 	select TTY # Needed for line.c
-	select HAVE_ARCH_VMAP_STACK
+	select HAVE_ARCH_VMAP_STACK if MMU
 	select HAVE_RUST
 	select ARCH_HAS_UBSAN
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_SYSCALL_TRACEPOINTS
 	select THREAD_INFO_IN_TASK
 	select SPARSE_IRQ
+	select UACCESS_MEMCPY if !MMU
+	select GENERIC_STRNLEN_USER if !MMU
+	select GENERIC_STRNCPY_FROM_USER if !MMU
 
 config MMU
-	bool
+	bool "MMU-based Paged Memory Management Support" if 64BIT
 	default y
 
 config UML_DMA_EMULATION
@@ -225,8 +228,15 @@ config MAGIC_SYSRQ
 	  The keys are documented in <file:Documentation/admin-guide/sysrq.rst>. Don't say Y
 	  unless you really know what this hack does.
 
+config ARCH_FORCE_MAX_ORDER
+	int "Order of maximal physically contiguous allocations" if EXPERT
+	default "10" if MMU
+	default "16" if !MMU
+
 config KERNEL_STACK_ORDER
 	int "Kernel stack size order"
+	default 3 if !MMU
+	range 3 10 if !MMU
 	default 2 if 64BIT
 	range 2 10 if 64BIT
 	default 1 if !64BIT
diff --git a/arch/um/configs/x86_64_nommu_defconfig b/arch/um/configs/x86_64_nommu_defconfig
new file mode 100644
index 000000000000..02cb87091c9f
--- /dev/null
+++ b/arch/um/configs/x86_64_nommu_defconfig
@@ -0,0 +1,54 @@
+CONFIG_SYSVIPC=y
+CONFIG_POSIX_MQUEUE=y
+CONFIG_NO_HZ=y
+CONFIG_HIGH_RES_TIMERS=y
+CONFIG_BSD_PROCESS_ACCT=y
+CONFIG_IKCONFIG=y
+CONFIG_IKCONFIG_PROC=y
+CONFIG_LOG_BUF_SHIFT=14
+CONFIG_CGROUPS=y
+CONFIG_BLK_CGROUP=y
+CONFIG_CGROUP_SCHED=y
+CONFIG_CGROUP_DEVICE=y
+CONFIG_CGROUP_CPUACCT=y
+# CONFIG_PID_NS is not set
+CONFIG_CC_OPTIMIZE_FOR_SIZE=y
+# CONFIG_MMU is not set
+CONFIG_HOSTFS=y
+CONFIG_MAGIC_SYSRQ=y
+CONFIG_SSL=y
+CONFIG_NULL_CHAN=y
+CONFIG_PORT_CHAN=y
+CONFIG_PTY_CHAN=y
+CONFIG_TTY_CHAN=y
+CONFIG_CON_CHAN="pts"
+CONFIG_SSL_CHAN="pts"
+CONFIG_MODULES=y
+CONFIG_MODULE_UNLOAD=y
+CONFIG_IOSCHED_BFQ=m
+CONFIG_BINFMT_MISC=m
+CONFIG_NET=y
+CONFIG_PACKET=y
+CONFIG_UNIX=y
+CONFIG_INET=y
+CONFIG_DEVTMPFS=y
+CONFIG_DEVTMPFS_MOUNT=y
+CONFIG_BLK_DEV_UBD=y
+CONFIG_BLK_DEV_LOOP=m
+CONFIG_BLK_DEV_NBD=m
+CONFIG_DUMMY=m
+CONFIG_TUN=m
+CONFIG_PPP=m
+CONFIG_SLIP=m
+CONFIG_LEGACY_PTY_COUNT=32
+CONFIG_UML_RANDOM=y
+CONFIG_EXT4_FS=y
+CONFIG_QUOTA=y
+CONFIG_AUTOFS_FS=m
+CONFIG_ISO9660_FS=m
+CONFIG_JOLIET=y
+CONFIG_NLS=y
+CONFIG_DEBUG_KERNEL=y
+CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y
+CONFIG_FRAME_WARN=1024
+CONFIG_IPV6=y
-- 
2.43.0


From jlayton at kernel.org  Wed Nov  5 12:24:50 2025
From: jlayton at kernel.org (Jeff Layton)
Date: Wed, 05 Nov 2025 15:24:50 -0500
Subject: [PATCH] vfs: remove the excl argument from the ->create()
 inode_operation
Message-ID: <20251105-create-excl-v1-1-a4cce035cc55@kernel.org>

Since ce8644fcadc5 ("lookup_open(): expand the call of vfs_create()"),
the "excl" argument to the ->create() inode_operation is always set to
true. Remove it, and fix up all of the create implementations.

Signed-off-by: Jeff Layton <jlayton at kernel.org>
---
The latest directory delegation patchset has a patch in it to clean up
the arguments for vfs_create() [1]. If that looks sane, then I think
this would be the next logical step.

Full disclosure: I did use claude code to generate the first
approximation, but I had to fix a number of things that it missed.  I
probably could have given it better prompts. In any case, I'm not sure
how to properly attribute this (or if I even need to).

[1]: https://lore.kernel.org/linux-nfs/20251105-dir-deleg-ro-v5-9-7ebc168a88ac at kernel.org/
---
 fs/9p/vfs_inode.c       | 2 +-
 fs/9p/vfs_inode_dotl.c  | 2 +-
 fs/affs/affs.h          | 2 +-
 fs/affs/namei.c         | 2 +-
 fs/afs/dir.c            | 4 ++--
 fs/bad_inode.c          | 2 +-
 fs/bfs/dir.c            | 2 +-
 fs/btrfs/inode.c        | 2 +-
 fs/ceph/dir.c           | 2 +-
 fs/coda/dir.c           | 2 +-
 fs/ecryptfs/inode.c     | 2 +-
 fs/efivarfs/inode.c     | 2 +-
 fs/exfat/namei.c        | 2 +-
 fs/ext2/namei.c         | 2 +-
 fs/ext4/namei.c         | 2 +-
 fs/f2fs/namei.c         | 2 +-
 fs/fat/namei_msdos.c    | 2 +-
 fs/fat/namei_vfat.c     | 2 +-
 fs/fuse/dir.c           | 2 +-
 fs/gfs2/inode.c         | 5 ++---
 fs/hfs/dir.c            | 2 +-
 fs/hfsplus/dir.c        | 2 +-
 fs/hostfs/hostfs_kern.c | 2 +-
 fs/hpfs/namei.c         | 2 +-
 fs/hugetlbfs/inode.c    | 2 +-
 fs/jffs2/dir.c          | 4 ++--
 fs/jfs/namei.c          | 2 +-
 fs/minix/namei.c        | 2 +-
 fs/namei.c              | 4 ++--
 fs/nfs/dir.c            | 4 ++--
 fs/nfs/internal.h       | 2 +-
 fs/nilfs2/namei.c       | 2 +-
 fs/ntfs3/namei.c        | 2 +-
 fs/ocfs2/dlmfs/dlmfs.c  | 3 +--
 fs/ocfs2/namei.c        | 3 +--
 fs/omfs/dir.c           | 2 +-
 fs/orangefs/namei.c     | 3 +--
 fs/overlayfs/dir.c      | 2 +-
 fs/ramfs/inode.c        | 2 +-
 fs/smb/client/cifsfs.h  | 2 +-
 fs/smb/client/dir.c     | 2 +-
 fs/ubifs/dir.c          | 2 +-
 fs/udf/namei.c          | 2 +-
 fs/ufs/namei.c          | 3 +--
 fs/vboxsf/dir.c         | 2 +-
 fs/xfs/xfs_iops.c       | 3 +--
 include/linux/fs.h      | 4 ++--
 ipc/mqueue.c            | 2 +-
 mm/shmem.c              | 2 +-
 49 files changed, 55 insertions(+), 61 deletions(-)

diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c
index 69f378a837753e934c20b599660f8a756127e40a..595244d57cba62869b9af8b909af67d3c61e7f6c 100644
--- a/fs/9p/vfs_inode.c
+++ b/fs/9p/vfs_inode.c
@@ -643,7 +643,7 @@ v9fs_create(struct v9fs_session_info *v9ses, struct inode *dir,
 
 static int
 v9fs_vfs_create(struct mnt_idmap *idmap, struct inode *dir,
-		struct dentry *dentry, umode_t mode, bool excl)
+		struct dentry *dentry, umode_t mode)
 {
 	struct v9fs_session_info *v9ses = v9fs_inode2v9ses(dir);
 	u32 perm = unixmode2p9mode(v9ses, mode);
diff --git a/fs/9p/vfs_inode_dotl.c b/fs/9p/vfs_inode_dotl.c
index 0b404e8484d22e2cbe60d846e0fa653001cdc4b1..de8fe9954d433c9b14ff5dd72ba13c3d5a67ebe7 100644
--- a/fs/9p/vfs_inode_dotl.c
+++ b/fs/9p/vfs_inode_dotl.c
@@ -218,7 +218,7 @@ int v9fs_open_to_dotl_flags(int flags)
  */
 static int
 v9fs_vfs_create_dotl(struct mnt_idmap *idmap, struct inode *dir,
-		     struct dentry *dentry, umode_t omode, bool excl)
+		     struct dentry *dentry, umode_t omode)
 {
 	return v9fs_vfs_mknod_dotl(idmap, dir, dentry, omode, 0);
 }
diff --git a/fs/affs/affs.h b/fs/affs/affs.h
index ac4e9a02910b72d63c8ec5291347b54518e67f4b..665be23c42cfa206dc0a2c9ffa119b7c3c747389 100644
--- a/fs/affs/affs.h
+++ b/fs/affs/affs.h
@@ -167,7 +167,7 @@ extern int	affs_hash_name(struct super_block *sb, const u8 *name, unsigned int l
 extern struct dentry *affs_lookup(struct inode *dir, struct dentry *dentry, unsigned int);
 extern int	affs_unlink(struct inode *dir, struct dentry *dentry);
 extern int	affs_create(struct mnt_idmap *idmap, struct inode *dir,
-			struct dentry *dentry, umode_t mode, bool);
+			struct dentry *dentry, umode_t mode);
 extern struct dentry *affs_mkdir(struct mnt_idmap *idmap, struct inode *dir,
 			struct dentry *dentry, umode_t mode);
 extern int	affs_rmdir(struct inode *dir, struct dentry *dentry);
diff --git a/fs/affs/namei.c b/fs/affs/namei.c
index f883be50db122d3b09f0ae4d24618bd49b55186b..5591e1b5a2f68fc7600115e241f01f81d3aac010 100644
--- a/fs/affs/namei.c
+++ b/fs/affs/namei.c
@@ -243,7 +243,7 @@ affs_unlink(struct inode *dir, struct dentry *dentry)
 
 int
 affs_create(struct mnt_idmap *idmap, struct inode *dir,
-	    struct dentry *dentry, umode_t mode, bool excl)
+	    struct dentry *dentry, umode_t mode)
 {
 	struct super_block *sb = dir->i_sb;
 	struct inode	*inode;
diff --git a/fs/afs/dir.c b/fs/afs/dir.c
index 89d36e3e5c7999c2e448b78e86896d8893a8a7a9..09224aca8cad37ad273fd0c1ac292f0c15e078b5 100644
--- a/fs/afs/dir.c
+++ b/fs/afs/dir.c
@@ -32,7 +32,7 @@ static bool afs_lookup_one_filldir(struct dir_context *ctx, const char *name, in
 static bool afs_lookup_filldir(struct dir_context *ctx, const char *name, int nlen,
 			      loff_t fpos, u64 ino, unsigned dtype);
 static int afs_create(struct mnt_idmap *idmap, struct inode *dir,
-		      struct dentry *dentry, umode_t mode, bool excl);
+		      struct dentry *dentry, umode_t mode);
 static struct dentry *afs_mkdir(struct mnt_idmap *idmap, struct inode *dir,
 				struct dentry *dentry, umode_t mode);
 static int afs_rmdir(struct inode *dir, struct dentry *dentry);
@@ -1637,7 +1637,7 @@ static const struct afs_operation_ops afs_create_operation = {
  * create a regular file on an AFS filesystem
  */
 static int afs_create(struct mnt_idmap *idmap, struct inode *dir,
-		      struct dentry *dentry, umode_t mode, bool excl)
+		      struct dentry *dentry, umode_t mode)
 {
 	struct afs_operation *op;
 	struct afs_vnode *dvnode = AFS_FS_I(dir);
diff --git a/fs/bad_inode.c b/fs/bad_inode.c
index 0ef9bcb744dd620bf47caa024d97a1316ff7bc89..5701361cf98155a61cb75a4ec602e8fc615eb3ae 100644
--- a/fs/bad_inode.c
+++ b/fs/bad_inode.c
@@ -29,7 +29,7 @@ static const struct file_operations bad_file_ops =
 
 static int bad_inode_create(struct mnt_idmap *idmap,
 			    struct inode *dir, struct dentry *dentry,
-			    umode_t mode, bool excl)
+			    umode_t mode)
 {
 	return -EIO;
 }
diff --git a/fs/bfs/dir.c b/fs/bfs/dir.c
index c375e22c4c0c15ba27307d266adfe3f093b90ab8..6beb8605c523cc2c7250d7b1a61508e103f0f3fd 100644
--- a/fs/bfs/dir.c
+++ b/fs/bfs/dir.c
@@ -76,7 +76,7 @@ const struct file_operations bfs_dir_operations = {
 };
 
 static int bfs_create(struct mnt_idmap *idmap, struct inode *dir,
-		      struct dentry *dentry, umode_t mode, bool excl)
+		      struct dentry *dentry, umode_t mode)
 {
 	int err;
 	struct inode *inode;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 3b1b3a0553eea06229255ad0284d76074bdb958a..8e06baeabae594850607366ea4f4f0fa41e3b464 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6816,7 +6816,7 @@ static int btrfs_mknod(struct mnt_idmap *idmap, struct inode *dir,
 }
 
 static int btrfs_create(struct mnt_idmap *idmap, struct inode *dir,
-			struct dentry *dentry, umode_t mode, bool excl)
+			struct dentry *dentry, umode_t mode)
 {
 	struct inode *inode;
 
diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
index d18c0eaef9b7e7be7eb517c701d6c4af08fd78ac..308903dc0780dbed2382228005d0221f185c61ee 100644
--- a/fs/ceph/dir.c
+++ b/fs/ceph/dir.c
@@ -976,7 +976,7 @@ static int ceph_mknod(struct mnt_idmap *idmap, struct inode *dir,
 }
 
 static int ceph_create(struct mnt_idmap *idmap, struct inode *dir,
-		       struct dentry *dentry, umode_t mode, bool excl)
+		       struct dentry *dentry, umode_t mode)
 {
 	return ceph_mknod(idmap, dir, dentry, mode, 0);
 }
diff --git a/fs/coda/dir.c b/fs/coda/dir.c
index ca99900172657d80a479b2eb27f50effdf834995..554e7fd44e5df1aae6da2c41a492a02ae9e0d616 100644
--- a/fs/coda/dir.c
+++ b/fs/coda/dir.c
@@ -134,7 +134,7 @@ static inline void coda_dir_drop_nlink(struct inode *dir)
 
 /* creation routines: create, mknod, mkdir, link, symlink */
 static int coda_create(struct mnt_idmap *idmap, struct inode *dir,
-		       struct dentry *de, umode_t mode, bool excl)
+		       struct dentry *de, umode_t mode)
 {
 	int error;
 	const char *name=de->d_name.name;
diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
index ba15e7359dfa6e150b577205991010873a633511..9a1ba68b16f3d6c4551e2d75e1e27309159c062e 100644
--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -262,7 +262,7 @@ int ecryptfs_initialize_file(struct dentry *ecryptfs_dentry,
 static int
 ecryptfs_create(struct mnt_idmap *idmap,
 		struct inode *directory_inode, struct dentry *ecryptfs_dentry,
-		umode_t mode, bool excl)
+		umode_t mode)
 {
 	struct inode *ecryptfs_inode;
 	int rc;
diff --git a/fs/efivarfs/inode.c b/fs/efivarfs/inode.c
index 2891614abf8d554f563319187b6d54c2bc006a91..043b3e3a4f0adefe27855f8156b946c1dc4bd184 100644
--- a/fs/efivarfs/inode.c
+++ b/fs/efivarfs/inode.c
@@ -75,7 +75,7 @@ static bool efivarfs_valid_name(const char *str, int len)
 }
 
 static int efivarfs_create(struct mnt_idmap *idmap, struct inode *dir,
-			   struct dentry *dentry, umode_t mode, bool excl)
+			   struct dentry *dentry, umode_t mode)
 {
 	struct inode *inode = NULL;
 	struct efivar_entry *var;
diff --git a/fs/exfat/namei.c b/fs/exfat/namei.c
index 7eb9c67fd35f4c54e18061a948806f20455675cf..c272a522c571044fd0cdc7630be30bdcec2ab8e5 100644
--- a/fs/exfat/namei.c
+++ b/fs/exfat/namei.c
@@ -543,7 +543,7 @@ static int exfat_add_entry(struct inode *inode, const char *path,
 }
 
 static int exfat_create(struct mnt_idmap *idmap, struct inode *dir,
-			struct dentry *dentry, umode_t mode, bool excl)
+			struct dentry *dentry, umode_t mode)
 {
 	struct super_block *sb = dir->i_sb;
 	struct inode *inode;
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index bde617a66cecd4a2bf12a713a2297bb4fee45916..edea7784ad39acd4afffc7f5ae6e50a20c04999d 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -101,7 +101,7 @@ struct dentry *ext2_get_parent(struct dentry *child)
  */
 static int ext2_create (struct mnt_idmap * idmap,
 			struct inode * dir, struct dentry * dentry,
-			umode_t mode, bool excl)
+			umode_t mode)
 {
 	struct inode *inode;
 	int err;
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 2cd36f59c9e363124ee949f742adccd88447295a..a1e77390a7ce300db02db9af90e45d69efabfea5 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2806,7 +2806,7 @@ static int ext4_add_nondir(handle_t *handle,
  * with d_instantiate().
  */
 static int ext4_create(struct mnt_idmap *idmap, struct inode *dir,
-		       struct dentry *dentry, umode_t mode, bool excl)
+		       struct dentry *dentry, umode_t mode)
 {
 	handle_t *handle;
 	struct inode *inode;
diff --git a/fs/f2fs/namei.c b/fs/f2fs/namei.c
index b882771e469971dcf4e7a42416f9fbb8a5d9bf39..9bcbb8b521501b22d0fe2238b7729c342e95baa4 100644
--- a/fs/f2fs/namei.c
+++ b/fs/f2fs/namei.c
@@ -351,7 +351,7 @@ static struct inode *f2fs_new_inode(struct mnt_idmap *idmap,
 }
 
 static int f2fs_create(struct mnt_idmap *idmap, struct inode *dir,
-		       struct dentry *dentry, umode_t mode, bool excl)
+		       struct dentry *dentry, umode_t mode)
 {
 	struct f2fs_sb_info *sbi = F2FS_I_SB(dir);
 	struct inode *inode;
diff --git a/fs/fat/namei_msdos.c b/fs/fat/namei_msdos.c
index 0b920ee40a7f9fe3c57af5d939d3efedf001a3d9..905ffa9e5b99f1507734d99b7c16dcad21d7b5b5 100644
--- a/fs/fat/namei_msdos.c
+++ b/fs/fat/namei_msdos.c
@@ -262,7 +262,7 @@ static int msdos_add_entry(struct inode *dir, const unsigned char *name,
 
 /***** Create a file */
 static int msdos_create(struct mnt_idmap *idmap, struct inode *dir,
-			struct dentry *dentry, umode_t mode, bool excl)
+			struct dentry *dentry, umode_t mode)
 {
 	struct super_block *sb = dir->i_sb;
 	struct inode *inode = NULL;
diff --git a/fs/fat/namei_vfat.c b/fs/fat/namei_vfat.c
index 5dbc4cbb8fce3d9b891cbc597f876c2c7b8d6aa0..8396b1ec4ec582fcdfadbcb12b04694ef0b8c5fc 100644
--- a/fs/fat/namei_vfat.c
+++ b/fs/fat/namei_vfat.c
@@ -754,7 +754,7 @@ static struct dentry *vfat_lookup(struct inode *dir, struct dentry *dentry,
 }
 
 static int vfat_create(struct mnt_idmap *idmap, struct inode *dir,
-		       struct dentry *dentry, umode_t mode, bool excl)
+		       struct dentry *dentry, umode_t mode)
 {
 	struct super_block *sb = dir->i_sb;
 	struct inode *inode;
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 667774cc72a1d49796f531fcb342d2e4878beb85..b7a2cee9b18313f88e745c5bb406bcc72866e390 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -889,7 +889,7 @@ static int fuse_mknod(struct mnt_idmap *idmap, struct inode *dir,
 }
 
 static int fuse_create(struct mnt_idmap *idmap, struct inode *dir,
-		       struct dentry *entry, umode_t mode, bool excl)
+		       struct dentry *entry, umode_t mode)
 {
 	return fuse_mknod(idmap, dir, entry, mode, 0);
 }
diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
index 8a7ed80d9f2d6e829b240629bdd18b5e0d30b5fc..b8e399dd1182b6ede0bcf1aa78bd7f9f2dca8b2b 100644
--- a/fs/gfs2/inode.c
+++ b/fs/gfs2/inode.c
@@ -942,15 +942,14 @@ static int gfs2_create_inode(struct inode *dir, struct dentry *dentry,
  * @dir: The directory in which to create the file
  * @dentry: The dentry of the new file
  * @mode: The mode of the new file
- * @excl: Force fail if inode exists
  *
  * Returns: errno
  */
 
 static int gfs2_create(struct mnt_idmap *idmap, struct inode *dir,
-		       struct dentry *dentry, umode_t mode, bool excl)
+		       struct dentry *dentry, umode_t mode)
 {
-	return gfs2_create_inode(dir, dentry, NULL, S_IFREG | mode, 0, NULL, 0, excl);
+	return gfs2_create_inode(dir, dentry, NULL, S_IFREG | mode, 0, NULL, 0, 1);
 }
 
 /**
diff --git a/fs/hfs/dir.c b/fs/hfs/dir.c
index 86a6b317b474a95f283f6a0908582efadde80892..c585942aa985686ca428d2d17f4401aa845a0eb8 100644
--- a/fs/hfs/dir.c
+++ b/fs/hfs/dir.c
@@ -190,7 +190,7 @@ static int hfs_dir_release(struct inode *inode, struct file *file)
  * the directory and the name (and its length) of the new file.
  */
 static int hfs_create(struct mnt_idmap *idmap, struct inode *dir,
-		      struct dentry *dentry, umode_t mode, bool excl)
+		      struct dentry *dentry, umode_t mode)
 {
 	struct inode *inode;
 	int res;
diff --git a/fs/hfsplus/dir.c b/fs/hfsplus/dir.c
index 1b3e27a0d5e038b559bd19b37d769078b2996d1b..c5ea04e078340a91b992095e189e978a3345f03c 100644
--- a/fs/hfsplus/dir.c
+++ b/fs/hfsplus/dir.c
@@ -518,7 +518,7 @@ static int hfsplus_mknod(struct mnt_idmap *idmap, struct inode *dir,
 }
 
 static int hfsplus_create(struct mnt_idmap *idmap, struct inode *dir,
-			  struct dentry *dentry, umode_t mode, bool excl)
+			  struct dentry *dentry, umode_t mode)
 {
 	return hfsplus_mknod(&nop_mnt_idmap, dir, dentry, mode, 0);
 }
diff --git a/fs/hostfs/hostfs_kern.c b/fs/hostfs/hostfs_kern.c
index 1e1acf5775ab5f6daf13bb917966d05f410d5ff5..18ca8cb9aa15e4015582ee5bd3db968c6b32de4b 100644
--- a/fs/hostfs/hostfs_kern.c
+++ b/fs/hostfs/hostfs_kern.c
@@ -593,7 +593,7 @@ static struct inode *hostfs_iget(struct super_block *sb, char *name)
 }
 
 static int hostfs_create(struct mnt_idmap *idmap, struct inode *dir,
-			 struct dentry *dentry, umode_t mode, bool excl)
+			 struct dentry *dentry, umode_t mode)
 {
 	struct inode *inode;
 	char *name;
diff --git a/fs/hpfs/namei.c b/fs/hpfs/namei.c
index 353e13a615f56664638f08a3408f90a727f5458b..809113d8248d50c0eaa57047b6c4bd87b9a5c6be 100644
--- a/fs/hpfs/namei.c
+++ b/fs/hpfs/namei.c
@@ -129,7 +129,7 @@ static struct dentry *hpfs_mkdir(struct mnt_idmap *idmap, struct inode *dir,
 }
 
 static int hpfs_create(struct mnt_idmap *idmap, struct inode *dir,
-		       struct dentry *dentry, umode_t mode, bool excl)
+		       struct dentry *dentry, umode_t mode)
 {
 	const unsigned char *name = dentry->d_name.name;
 	unsigned len = dentry->d_name.len;
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 9c94ed8c3ab0028772b7afb5d03a91d280c38106..0fd0d73e450bdedd92b953b9dd00f6babe1246e7 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -1001,7 +1001,7 @@ static struct dentry *hugetlbfs_mkdir(struct mnt_idmap *idmap, struct inode *dir
 
 static int hugetlbfs_create(struct mnt_idmap *idmap,
 			    struct inode *dir, struct dentry *dentry,
-			    umode_t mode, bool excl)
+			    umode_t mode)
 {
 	return hugetlbfs_mknod(idmap, dir, dentry, mode | S_IFREG, 0);
 }
diff --git a/fs/jffs2/dir.c b/fs/jffs2/dir.c
index dd91f725ded69ccb3a240aafd72a4b552f21bcd9..e77c84e43621a8c53e9852843f18cc3514315650 100644
--- a/fs/jffs2/dir.c
+++ b/fs/jffs2/dir.c
@@ -25,7 +25,7 @@
 static int jffs2_readdir (struct file *, struct dir_context *);
 
 static int jffs2_create (struct mnt_idmap *, struct inode *,
-		         struct dentry *, umode_t, bool);
+			 struct dentry *, umode_t);
 static struct dentry *jffs2_lookup (struct inode *,struct dentry *,
 				    unsigned int);
 static int jffs2_link (struct dentry *,struct inode *,struct dentry *);
@@ -161,7 +161,7 @@ static int jffs2_readdir(struct file *file, struct dir_context *ctx)
 
 
 static int jffs2_create(struct mnt_idmap *idmap, struct inode *dir_i,
-			struct dentry *dentry, umode_t mode, bool excl)
+			struct dentry *dentry, umode_t mode)
 {
 	struct jffs2_raw_inode *ri;
 	struct jffs2_inode_info *f, *dir_f;
diff --git a/fs/jfs/namei.c b/fs/jfs/namei.c
index 65a218eba8faf9508f5727515b812f6de2661618..48111f8d3efe40becadd857c56c84ed09de867ef 100644
--- a/fs/jfs/namei.c
+++ b/fs/jfs/namei.c
@@ -60,7 +60,7 @@ static inline void free_ea_wmap(struct inode *inode)
  *
  */
 static int jfs_create(struct mnt_idmap *idmap, struct inode *dip,
-		      struct dentry *dentry, umode_t mode, bool excl)
+		      struct dentry *dentry, umode_t mode)
 {
 	int rc = 0;
 	tid_t tid;		/* transaction id */
diff --git a/fs/minix/namei.c b/fs/minix/namei.c
index 8938536d8d3cf65c7e57f88f1819689365951fea..6540574f54781eab487074de7fe10ed38b1a8d1e 100644
--- a/fs/minix/namei.c
+++ b/fs/minix/namei.c
@@ -64,7 +64,7 @@ static int minix_tmpfile(struct mnt_idmap *idmap, struct inode *dir,
 }
 
 static int minix_create(struct mnt_idmap *idmap, struct inode *dir,
-			struct dentry *dentry, umode_t mode, bool excl)
+			struct dentry *dentry, umode_t mode)
 {
 	return minix_mknod(&nop_mnt_idmap, dir, dentry, mode, 0);
 }
diff --git a/fs/namei.c b/fs/namei.c
index d5ab28947b2b6c6e19c7bb4a9140ccec407dc07c..83da60fc298e523096e881b25c727d14f9553476 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3493,7 +3493,7 @@ int vfs_create(struct mnt_idmap *idmap, struct dentry *dentry, umode_t mode,
 	error = try_break_deleg(dir, di);
 	if (error)
 		return error;
-	error = dir->i_op->create(idmap, dir, dentry, mode, true);
+	error = dir->i_op->create(idmap, dir, dentry, mode);
 	if (!error)
 		fsnotify_create(dir, dentry);
 	return error;
@@ -3802,7 +3802,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
 		}
 
 		error = dir_inode->i_op->create(idmap, dir_inode, dentry,
-						mode, open_flag & O_EXCL);
+						mode);
 		if (error)
 			goto out_dput;
 	}
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 46d9c65d50f83fc1dc73f3d7f5868b84132bb0fd..7fe18efcd37b08030c7a4e17832801abfc19a3bd 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -2377,9 +2377,9 @@ static int nfs_do_create(struct inode *dir, struct dentry *dentry,
 }
 
 int nfs_create(struct mnt_idmap *idmap, struct inode *dir,
-	       struct dentry *dentry, umode_t mode, bool excl)
+	       struct dentry *dentry, umode_t mode)
 {
-	return nfs_do_create(dir, dentry, mode, excl ? O_EXCL : 0);
+	return nfs_do_create(dir, dentry, mode, O_EXCL);
 }
 EXPORT_SYMBOL_GPL(nfs_create);
 
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 2ecd38e1d17a8053a9134702588d57efc35f49e9..b122c4f34f7b53c5102a8b5138efe269af433c81 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -398,7 +398,7 @@ extern unsigned long nfs_access_cache_scan(struct shrinker *shrink,
 struct dentry *nfs_lookup(struct inode *, struct dentry *, unsigned int);
 void nfs_d_prune_case_insensitive_aliases(struct inode *inode);
 int nfs_create(struct mnt_idmap *, struct inode *, struct dentry *,
-	       umode_t, bool);
+	       umode_t);
 struct dentry *nfs_mkdir(struct mnt_idmap *, struct inode *, struct dentry *,
 			 umode_t);
 int nfs_rmdir(struct inode *, struct dentry *);
diff --git a/fs/nilfs2/namei.c b/fs/nilfs2/namei.c
index 40f4b1a28705b6e0eb8f0978cf3ac18b43aa1331..31d1d466c03048aaaab23f64c3f413c095939770 100644
--- a/fs/nilfs2/namei.c
+++ b/fs/nilfs2/namei.c
@@ -86,7 +86,7 @@ nilfs_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags)
  * with d_instantiate().
  */
 static int nilfs_create(struct mnt_idmap *idmap, struct inode *dir,
-			struct dentry *dentry, umode_t mode, bool excl)
+			struct dentry *dentry, umode_t mode)
 {
 	struct inode *inode;
 	struct nilfs_transaction_info ti;
diff --git a/fs/ntfs3/namei.c b/fs/ntfs3/namei.c
index 82c8ae56beee6d79046dd6c8f02ff0f35e9a1ad3..49fe635b550d3f51f81138649b47c9c831a73e3b 100644
--- a/fs/ntfs3/namei.c
+++ b/fs/ntfs3/namei.c
@@ -105,7 +105,7 @@ static struct dentry *ntfs_lookup(struct inode *dir, struct dentry *dentry,
  * ntfs_create - inode_operations::create
  */
 static int ntfs_create(struct mnt_idmap *idmap, struct inode *dir,
-		       struct dentry *dentry, umode_t mode, bool excl)
+		       struct dentry *dentry, umode_t mode)
 {
 	return ntfs_create_inode(idmap, dir, dentry, NULL, S_IFREG | mode, 0,
 				 NULL, 0, NULL);
diff --git a/fs/ocfs2/dlmfs/dlmfs.c b/fs/ocfs2/dlmfs/dlmfs.c
index cccaa1d6fbbac13ebcaf14a9183277890708e643..bd4b2269598b49c6f88dd8d201e246ee5ed855a6 100644
--- a/fs/ocfs2/dlmfs/dlmfs.c
+++ b/fs/ocfs2/dlmfs/dlmfs.c
@@ -454,8 +454,7 @@ static struct dentry *dlmfs_mkdir(struct mnt_idmap * idmap,
 static int dlmfs_create(struct mnt_idmap *idmap,
 			struct inode *dir,
 			struct dentry *dentry,
-			umode_t mode,
-			bool excl)
+			umode_t mode)
 {
 	int status = 0;
 	struct inode *inode;
diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
index c90b254da75eb5b90d2af5e37d41e781efe8b836..7443f468f45657cf68779a02e4edf4e38fb70f59 100644
--- a/fs/ocfs2/namei.c
+++ b/fs/ocfs2/namei.c
@@ -666,8 +666,7 @@ static struct dentry *ocfs2_mkdir(struct mnt_idmap *idmap,
 static int ocfs2_create(struct mnt_idmap *idmap,
 			struct inode *dir,
 			struct dentry *dentry,
-			umode_t mode,
-			bool excl)
+			umode_t mode)
 {
 	int ret;
 
diff --git a/fs/omfs/dir.c b/fs/omfs/dir.c
index 2ed541fccf331d796805dd1594fbf05c1f7f3b9a..a09a98f7e30bc66deca60725f9462d081b5e4784 100644
--- a/fs/omfs/dir.c
+++ b/fs/omfs/dir.c
@@ -286,7 +286,7 @@ static struct dentry *omfs_mkdir(struct mnt_idmap *idmap, struct inode *dir,
 }
 
 static int omfs_create(struct mnt_idmap *idmap, struct inode *dir,
-		       struct dentry *dentry, umode_t mode, bool excl)
+		       struct dentry *dentry, umode_t mode)
 {
 	return omfs_add_node(dir, dentry, mode | S_IFREG);
 }
diff --git a/fs/orangefs/namei.c b/fs/orangefs/namei.c
index bec5475de094dada6bb29eaf8520a875880f3bab..0ebaa7f000f26f1c1ecffd22cfe4272f20a783ed 100644
--- a/fs/orangefs/namei.c
+++ b/fs/orangefs/namei.c
@@ -18,8 +18,7 @@
 static int orangefs_create(struct mnt_idmap *idmap,
 			struct inode *dir,
 			struct dentry *dentry,
-			umode_t mode,
-			bool exclusive)
+			umode_t mode)
 {
 	struct orangefs_inode_s *parent = ORANGEFS_I(dir);
 	struct orangefs_kernel_op_s *new_op;
diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
index a5e9ddf3023b3942fafb9adb2770f26780a1b86b..0f70b3835f4a08c29d6bba8ae9143df55895e56b 100644
--- a/fs/overlayfs/dir.c
+++ b/fs/overlayfs/dir.c
@@ -704,7 +704,7 @@ static int ovl_create_object(struct dentry *dentry, int mode, dev_t rdev,
 }
 
 static int ovl_create(struct mnt_idmap *idmap, struct inode *dir,
-		      struct dentry *dentry, umode_t mode, bool excl)
+		      struct dentry *dentry, umode_t mode)
 {
 	return ovl_create_object(dentry, (mode & 07777) | S_IFREG, 0, NULL);
 }
diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
index 41f9995da7cab0d11395cb40a98fb4936d52597f..b6502aaa4fb44d27c939da9fae4449af7edd28d4 100644
--- a/fs/ramfs/inode.c
+++ b/fs/ramfs/inode.c
@@ -129,7 +129,7 @@ static struct dentry *ramfs_mkdir(struct mnt_idmap *idmap, struct inode *dir,
 }
 
 static int ramfs_create(struct mnt_idmap *idmap, struct inode *dir,
-			struct dentry *dentry, umode_t mode, bool excl)
+			struct dentry *dentry, umode_t mode)
 {
 	return ramfs_mknod(&nop_mnt_idmap, dir, dentry, mode | S_IFREG, 0);
 }
diff --git a/fs/smb/client/cifsfs.h b/fs/smb/client/cifsfs.h
index e9534258d1efd0bb34f36bf2c725c64d0a8ca8f4..294c66cea2eca3344e09cd77619761e9cb79a807 100644
--- a/fs/smb/client/cifsfs.h
+++ b/fs/smb/client/cifsfs.h
@@ -50,7 +50,7 @@ extern void cifs_sb_deactive(struct super_block *sb);
 extern const struct inode_operations cifs_dir_inode_ops;
 extern struct inode *cifs_root_iget(struct super_block *);
 extern int cifs_create(struct mnt_idmap *, struct inode *,
-		       struct dentry *, umode_t, bool excl);
+		       struct dentry *, umode_t);
 extern int cifs_atomic_open(struct inode *, struct dentry *,
 			    struct file *, unsigned, umode_t);
 extern struct dentry *cifs_lookup(struct inode *, struct dentry *,
diff --git a/fs/smb/client/dir.c b/fs/smb/client/dir.c
index da5597dbf5b9f140c6801158ac2357fa911c52ab..b00bc214db9f0e9533f481f41ac99ac8937610ac 100644
--- a/fs/smb/client/dir.c
+++ b/fs/smb/client/dir.c
@@ -566,7 +566,7 @@ cifs_atomic_open(struct inode *inode, struct dentry *direntry,
 }
 
 int cifs_create(struct mnt_idmap *idmap, struct inode *inode,
-		struct dentry *direntry, umode_t mode, bool excl)
+		struct dentry *direntry, umode_t mode)
 {
 	int rc;
 	unsigned int xid = get_xid();
diff --git a/fs/ubifs/dir.c b/fs/ubifs/dir.c
index 3c3d3ad4fa6cb719e9ec08fa2164c55371c017c1..4840a6f7974e254eba4ca249357e968764e326e0 100644
--- a/fs/ubifs/dir.c
+++ b/fs/ubifs/dir.c
@@ -303,7 +303,7 @@ static int ubifs_prepare_create(struct inode *dir, struct dentry *dentry,
 }
 
 static int ubifs_create(struct mnt_idmap *idmap, struct inode *dir,
-			struct dentry *dentry, umode_t mode, bool excl)
+			struct dentry *dentry, umode_t mode)
 {
 	struct inode *inode;
 	struct ubifs_info *c = dir->i_sb->s_fs_info;
diff --git a/fs/udf/namei.c b/fs/udf/namei.c
index 5f2e9a892bffa9579143cedf71d80efa7ad6e9fb..f83b5564cbc4c68c02c07bb3ab2109bfabdc799d 100644
--- a/fs/udf/namei.c
+++ b/fs/udf/namei.c
@@ -371,7 +371,7 @@ static int udf_add_nondir(struct dentry *dentry, struct inode *inode)
 }
 
 static int udf_create(struct mnt_idmap *idmap, struct inode *dir,
-		      struct dentry *dentry, umode_t mode, bool excl)
+		      struct dentry *dentry, umode_t mode)
 {
 	struct inode *inode = udf_new_inode(dir, mode);
 
diff --git a/fs/ufs/namei.c b/fs/ufs/namei.c
index 5b3c85c9324298f4ff6aa3d4feeb962ce5ede539..5012e056200aca671364d34a7faf647e6747e1d2 100644
--- a/fs/ufs/namei.c
+++ b/fs/ufs/namei.c
@@ -70,8 +70,7 @@ static struct dentry *ufs_lookup(struct inode * dir, struct dentry *dentry, unsi
  * with d_instantiate(). 
  */
 static int ufs_create (struct mnt_idmap * idmap,
-		struct inode * dir, struct dentry * dentry, umode_t mode,
-		bool excl)
+		struct inode * dir, struct dentry * dentry, umode_t mode)
 {
 	struct inode *inode;
 
diff --git a/fs/vboxsf/dir.c b/fs/vboxsf/dir.c
index 42bedc4ec7af7709c564a7174805d185ce86f854..9ce4310c891044db17b6af98c06e3130002a7dda 100644
--- a/fs/vboxsf/dir.c
+++ b/fs/vboxsf/dir.c
@@ -298,7 +298,7 @@ static int vboxsf_dir_create(struct inode *parent, struct dentry *dentry,
 
 static int vboxsf_dir_mkfile(struct mnt_idmap *idmap,
 			     struct inode *parent, struct dentry *dentry,
-			     umode_t mode, bool excl)
+			     umode_t mode)
 {
 	return vboxsf_dir_create(parent, dentry, mode, false, excl, NULL);
 }
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index caff0125faeac093c1c05a722d3588e3f2e99926..2bc7faac35678b5b78acd6a50695a0d7b1c9a263 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -293,8 +293,7 @@ xfs_vn_create(
 	struct mnt_idmap	*idmap,
 	struct inode		*dir,
 	struct dentry		*dentry,
-	umode_t			mode,
-	bool			flags)
+	umode_t			mode)
 {
 	return xfs_generic_create(idmap, dir, dentry, mode, 0, NULL);
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 64323e618724bc20dc101db13035b042f5f88e4d..b9a32e10078f5a1a0bbeb0d8913ac3e4b5b3a85d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2345,8 +2345,8 @@ struct inode_operations {
 
 	int (*readlink) (struct dentry *, char __user *,int);
 
-	int (*create) (struct mnt_idmap *, struct inode *,struct dentry *,
-		       umode_t, bool);
+	int (*create) (struct mnt_idmap *, struct inode *, struct dentry *,
+		       umode_t);
 	int (*link) (struct dentry *,struct inode *,struct dentry *);
 	int (*unlink) (struct inode *,struct dentry *);
 	int (*symlink) (struct mnt_idmap *, struct inode *,struct dentry *,
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 093551fe66a7eb884fc34ef853a0ca92b95770af..9ae28c79fe0578bf96b2d22daed45b48aba0b946 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -610,7 +610,7 @@ static int mqueue_create_attr(struct dentry *dentry, umode_t mode, void *arg)
 }
 
 static int mqueue_create(struct mnt_idmap *idmap, struct inode *dir,
-			 struct dentry *dentry, umode_t mode, bool excl)
+			 struct dentry *dentry, umode_t mode)
 {
 	return mqueue_create_attr(dentry, mode, NULL);
 }
diff --git a/mm/shmem.c b/mm/shmem.c
index b9081b817d28f3db1fbdd90ed3f04b6904d6ff18..8fdc9cbecb908e127f8173ca8888b5e038354fed 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -3912,7 +3912,7 @@ static struct dentry *shmem_mkdir(struct mnt_idmap *idmap, struct inode *dir,
 }
 
 static int shmem_create(struct mnt_idmap *idmap, struct inode *dir,
-			struct dentry *dentry, umode_t mode, bool excl)
+			struct dentry *dentry, umode_t mode)
 {
 	return shmem_mknod(idmap, dir, dentry, mode | S_IFREG, 0);
 }

---
base-commit: 76ddfe7d66d631e5e31ef4e5dd59797fa03acbf7
change-id: 20251105-create-excl-2b366d9bf3bb

Best regards,
-- 
Jeff Layton <jlayton at kernel.org>


From neilb at ownmail.net  Wed Nov  5 13:23:24 2025
From: neilb at ownmail.net (NeilBrown)
Date: Thu, 06 Nov 2025 08:23:24 +1100
Subject: [PATCH] vfs: remove the excl argument from the ->create() inode_operation
In-Reply-To: <20251105-create-excl-v1-1-a4cce035cc55@kernel.org>
References: <20251105-create-excl-v1-1-a4cce035cc55@kernel.org>
Message-ID: <176237780417.634289.15818324160940255011@noble.neil.brown.name>

On Thu, 06 Nov 2025, Jeff Layton wrote:
> Since ce8644fcadc5 ("lookup_open(): expand the call of vfs_create()"),
> the "excl" argument to the ->create() inode_operation is always set to
> true. Remove it, and fix up all of the create implementations.

nonono


> @@ -3802,7 +3802,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
>  		}
>  
>  		error = dir_inode->i_op->create(idmap, dir_inode, dentry,
> -						mode, open_flag & O_EXCL);
> +						mode);

"open_flag & O_EXCL" is not the same as "true".

It is true that "all calls to vfs_create() pass true for 'excl'"
The same is NOT true for inode_operations.create.

NeilBrown


From jlayton at kernel.org  Thu Nov  6 04:07:48 2025
From: jlayton at kernel.org (Jeff Layton)
Date: Thu, 06 Nov 2025 07:07:48 -0500
Subject: [PATCH] vfs: remove the excl argument from the ->create()
 inode_operation
In-Reply-To: <176237780417.634289.15818324160940255011@noble.neil.brown.name>
References: <20251105-create-excl-v1-1-a4cce035cc55@kernel.org>
	 <176237780417.634289.15818324160940255011@noble.neil.brown.name>
Message-ID: <6758176514cdd6e2ceacb3bd0e4d63fb8784b7c6.camel@kernel.org>

On Thu, 2025-11-06 at 08:23 +1100, NeilBrown wrote:
> On Thu, 06 Nov 2025, Jeff Layton wrote:
> > Since ce8644fcadc5 ("lookup_open(): expand the call of vfs_create()"),
> > the "excl" argument to the ->create() inode_operation is always set to
> > true. Remove it, and fix up all of the create implementations.
> 
> nonono
> 
> 
> > @@ -3802,7 +3802,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
> >  		}
> >  
> >  		error = dir_inode->i_op->create(idmap, dir_inode, dentry,
> > -						mode, open_flag & O_EXCL);
> > +						mode);
> 
> "open_flag & O_EXCL" is not the same as "true".
> 
> It is true that "all calls to vfs_create() pass true for 'excl'"
> The same is NOT true for inode_operations.create.
> 

I don't think this is a problem, actually:

Almost all of the existing ->create() operations ignore the "excl"
bool. There are only two that I found that do not: NFS and GFS2. Both
of those have an ->atomic_open() operation though, so lookup_open()
will never call ->create() for those filesystems. This means that -
>create() _is_ always called with excl == true.

-- 
Jeff Layton <jlayton at kernel.org>


From jlayton at kernel.org  Thu Nov  6 10:01:20 2025
From: jlayton at kernel.org (Jeff Layton)
Date: Thu, 06 Nov 2025 13:01:20 -0500
Subject: [PATCH] vfs: remove the excl argument from the ->create()
 inode_operation
In-Reply-To: <6758176514cdd6e2ceacb3bd0e4d63fb8784b7c6.camel@kernel.org>
References: <20251105-create-excl-v1-1-a4cce035cc55@kernel.org>
		 <176237780417.634289.15818324160940255011@noble.neil.brown.name>
	 <6758176514cdd6e2ceacb3bd0e4d63fb8784b7c6.camel@kernel.org>
Message-ID: <f5927a9bb985b9ad241bc5f9fc32acfd35340222.camel@kernel.org>

On Thu, 2025-11-06 at 07:07 -0500, Jeff Layton wrote:
> On Thu, 2025-11-06 at 08:23 +1100, NeilBrown wrote:
> > On Thu, 06 Nov 2025, Jeff Layton wrote:
> > > Since ce8644fcadc5 ("lookup_open(): expand the call of vfs_create()"),
> > > the "excl" argument to the ->create() inode_operation is always set to
> > > true. Remove it, and fix up all of the create implementations.
> > 
> > nonono
> > 
> > 
> > > @@ -3802,7 +3802,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
> > >  		}
> > >  
> > >  		error = dir_inode->i_op->create(idmap, dir_inode, dentry,
> > > -						mode, open_flag & O_EXCL);
> > > +						mode);
> > 
> > "open_flag & O_EXCL" is not the same as "true".
> > 
> > It is true that "all calls to vfs_create() pass true for 'excl'"
> > The same is NOT true for inode_operations.create.
> > 
> 
> I don't think this is a problem, actually:
> 
> Almost all of the existing ->create() operations ignore the "excl"
> bool. There are only two that I found that do not: NFS and GFS2. Both
> of those have an ->atomic_open() operation though, so lookup_open()
> will never call ->create() for those filesystems. This means that -
> > create() _is_ always called with excl == true.

How about this for a revised changelog, which makes the above clear:

    vfs: remove the excl argument from the ->create() inode_operation
    
    Since ce8644fcadc5 ("lookup_open(): expand the call of vfs_create()"),
    the "excl" argument to the ->create() inode_operation is always set to
    true in vfs_create().
    
    There is another call to ->create() in lookup_open() that can set it to
    either true or false. All of the ->create() operations in the kernel
    ignore the excl argument, except for NFS and GFS2. Both NFS and GFS2
    have an ->atomic_open() operation, however so lookup_open() will never
    call ->create() on those filesystems.
    
    Remove the "excl" argument from the ->create() operation, and fix up the
    filesystems accordingly.

Maybe we also need some comments or updates to Documentation/ to make
it clear that ->create() always implies O_EXCL semantics?
-- 
Jeff Layton <jlayton at kernel.org>


From neilb at ownmail.net  Thu Nov  6 16:00:34 2025
From: neilb at ownmail.net (NeilBrown)
Date: Fri, 07 Nov 2025 11:00:34 +1100
Subject: [PATCH] vfs: remove the excl argument from the ->create() inode_operation
In-Reply-To: <f5927a9bb985b9ad241bc5f9fc32acfd35340222.camel@kernel.org>
References: <20251105-create-excl-v1-1-a4cce035cc55@kernel.org>,
 <176237780417.634289.15818324160940255011@noble.neil.brown.name>,
 <6758176514cdd6e2ceacb3bd0e4d63fb8784b7c6.camel@kernel.org>,
 <f5927a9bb985b9ad241bc5f9fc32acfd35340222.camel@kernel.org>
Message-ID: <176247363419.634289.473957828516111884@noble.neil.brown.name>

On Fri, 07 Nov 2025, Jeff Layton wrote:
> On Thu, 2025-11-06 at 07:07 -0500, Jeff Layton wrote:
> > On Thu, 2025-11-06 at 08:23 +1100, NeilBrown wrote:
> > > On Thu, 06 Nov 2025, Jeff Layton wrote:
> > > > Since ce8644fcadc5 ("lookup_open(): expand the call of vfs_create()"),
> > > > the "excl" argument to the ->create() inode_operation is always set to
> > > > true. Remove it, and fix up all of the create implementations.
> > > 
> > > nonono
> > > 
> > > 
> > > > @@ -3802,7 +3802,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
> > > >  		}
> > > >  
> > > >  		error = dir_inode->i_op->create(idmap, dir_inode, dentry,
> > > > -						mode, open_flag & O_EXCL);
> > > > +						mode);
> > > 
> > > "open_flag & O_EXCL" is not the same as "true".
> > > 
> > > It is true that "all calls to vfs_create() pass true for 'excl'"
> > > The same is NOT true for inode_operations.create.
> > > 
> > 
> > I don't think this is a problem, actually:
> > 
> > Almost all of the existing ->create() operations ignore the "excl"
> > bool. There are only two that I found that do not: NFS and GFS2. Both
> > of those have an ->atomic_open() operation though, so lookup_open()
> > will never call ->create() for those filesystems. This means that -
> > > create() _is_ always called with excl == true.
> 
> How about this for a revised changelog, which makes the above clear:
> 
>     vfs: remove the excl argument from the ->create() inode_operation
>     
>     Since ce8644fcadc5 ("lookup_open(): expand the call of vfs_create()"),
>     the "excl" argument to the ->create() inode_operation is always set to
>     true in vfs_create().
>     
>     There is another call to ->create() in lookup_open() that can set it to
>     either true or false. All of the ->create() operations in the kernel
>     ignore the excl argument, except for NFS and GFS2. Both NFS and GFS2
>     have an ->atomic_open() operation, however so lookup_open() will never
>     call ->create() on those filesystems.
>     
>     Remove the "excl" argument from the ->create() operation, and fix up the
>     filesystems accordingly.

Thanks, that is a substantial improvement.  I see your point now and I
think this is a really nice cleanup to make - thanks.

I think the commit message could be improved further by leading with the
detail that is central - that most ->create function ignore 'excl'.

 With two exceptions, ->create() methods provided by filesystems ignore
 the "excl" flag.  Those exception are NFS and GFS2 which both also
 provide ->atomic_open.

 excl is always true when ->create is called from vfs_create() (since
 commit......) so the only time it can be false is when it is called by
 lookup_open() for filesystems that do not provide ->atomic_open.

 So the excl flag to ->create is either ignored or true.  So we can
 remove it and change NFS and GFS2 to acts as though it were true.

> 
> Maybe we also need some comments or updates to Documentation/ to make
> it clear that ->create() always implies O_EXCL semantics?

Definitely, something in porting.rst and something in vfs.rst.

I would be worth saying somewhere that if the fs needs to mediate
non-exclusive creation, it must provide atomic_open().

Thanks,
NeilBrown


> -- 
> Jeff Layton <jlayton at kernel.org>
> 


From jlayton at kernel.org  Fri Nov  7 07:05:03 2025
From: jlayton at kernel.org (Jeff Layton)
Date: Fri, 07 Nov 2025 10:05:03 -0500
Subject: [PATCH v2] vfs: remove the excl argument from the ->create()
 inode_operation
Message-ID: <20251107-create-excl-v2-1-f678165d7f3f@kernel.org>

With two exceptions, ->create() methods provided by filesystems ignore
the "excl" flag.  Those exception are NFS and GFS2 which both also
provide ->atomic_open.

Since ce8644fcadc5 ("lookup_open(): expand the call of vfs_create()"),
the "excl" argument to the ->create() inode_operation is always set to
true in vfs_create(). The ->create() call in lookup_open() sets it
according to the O_EXCL open flag, but is never called if the filesystem
provides ->atomic_open().

The excl flag is therefore always either ignored or true.  Remove it,
and change NFS and GFS2 to act as if it were always true.

Signed-off-by: Jeff Layton <jlayton at kernel.org>
---
Note that this is based on top of the dir delegation series [1]. LMK
if the Documentation/ updates are too wordy.

Full disclosure: I did use Claude code to generate the first
approximation of this patch, but I had to fix a number of things that it
missed.  I probably could have given it better prompts. In any case, I'm
not sure how to properly attribute this (or if I even need to).

[1]: https://lore.kernel.org/linux-nfs/20251105-dir-deleg-ro-v5-0-7ebc168a88ac at kernel.org/
---
Changes in v2:
- better describe why the argument isn't needed in the changelog
- updates do Documentation/
- Link to v1: https://lore.kernel.org/r/20251105-create-excl-v1-1-a4cce035cc55 at kernel.org
---
 Documentation/filesystems/porting.rst | 12 ++++++++++++
 Documentation/filesystems/vfs.rst     | 13 ++++++++++---
 fs/9p/vfs_inode.c                     |  2 +-
 fs/9p/vfs_inode_dotl.c                |  2 +-
 fs/affs/affs.h                        |  2 +-
 fs/affs/namei.c                       |  2 +-
 fs/afs/dir.c                          |  4 ++--
 fs/bad_inode.c                        |  2 +-
 fs/bfs/dir.c                          |  2 +-
 fs/btrfs/inode.c                      |  2 +-
 fs/ceph/dir.c                         |  2 +-
 fs/coda/dir.c                         |  2 +-
 fs/ecryptfs/inode.c                   |  2 +-
 fs/efivarfs/inode.c                   |  2 +-
 fs/exfat/namei.c                      |  2 +-
 fs/ext2/namei.c                       |  2 +-
 fs/ext4/namei.c                       |  2 +-
 fs/f2fs/namei.c                       |  2 +-
 fs/fat/namei_msdos.c                  |  2 +-
 fs/fat/namei_vfat.c                   |  2 +-
 fs/fuse/dir.c                         |  2 +-
 fs/gfs2/inode.c                       |  5 ++---
 fs/hfs/dir.c                          |  2 +-
 fs/hfsplus/dir.c                      |  2 +-
 fs/hostfs/hostfs_kern.c               |  2 +-
 fs/hpfs/namei.c                       |  2 +-
 fs/hugetlbfs/inode.c                  |  2 +-
 fs/jffs2/dir.c                        |  4 ++--
 fs/jfs/namei.c                        |  2 +-
 fs/minix/namei.c                      |  2 +-
 fs/namei.c                            |  4 ++--
 fs/nfs/dir.c                          |  4 ++--
 fs/nfs/internal.h                     |  2 +-
 fs/nilfs2/namei.c                     |  2 +-
 fs/ntfs3/namei.c                      |  2 +-
 fs/ocfs2/dlmfs/dlmfs.c                |  3 +--
 fs/ocfs2/namei.c                      |  3 +--
 fs/omfs/dir.c                         |  2 +-
 fs/orangefs/namei.c                   |  3 +--
 fs/overlayfs/dir.c                    |  2 +-
 fs/ramfs/inode.c                      |  2 +-
 fs/smb/client/cifsfs.h                |  2 +-
 fs/smb/client/dir.c                   |  2 +-
 fs/ubifs/dir.c                        |  2 +-
 fs/udf/namei.c                        |  2 +-
 fs/ufs/namei.c                        |  3 +--
 fs/vboxsf/dir.c                       |  2 +-
 fs/xfs/xfs_iops.c                     |  3 +--
 include/linux/fs.h                    |  4 ++--
 ipc/mqueue.c                          |  2 +-
 mm/shmem.c                            |  2 +-
 51 files changed, 77 insertions(+), 64 deletions(-)

diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst
index 7233b04668fcce75f1ed170329a2cd18110a7d89..d71a3f5c626e578f0370986975ca50292c8e15c3 100644
--- a/Documentation/filesystems/porting.rst
+++ b/Documentation/filesystems/porting.rst
@@ -1309,3 +1309,15 @@ a different length, use
 	vfs_parse_fs_qstr(fc, key, &QSTR_LEN(value, len))
 
 instead.
+
+---
+
+**mandatory**
+
+The ->create() operation has dropped the bool "excl" argument. This operation
+should now always provide O_EXCL semantics (i.e. fail with -EEXIST if the file
+exists). If the filesystem needs to handle the case where another entity could
+create the file on the backing store after a negative lookup or revalidate
+(e.g. it's a network filesystem and another client could create the file after
+a negative lookup), then it will require ->atomic_open() in addition to
+->create().
diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index 4f13b01e42eb5e2ad9d60cbbce7e47d09ad831e6..7a55e491e0c87a0d18909bd181754d6d68318059 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -467,7 +467,7 @@ As of kernel 2.6.22, the following members are defined:
 .. code-block:: c
 
 	struct inode_operations {
-		int (*create) (struct mnt_idmap *, struct inode *,struct dentry *, umode_t, bool);
+		int (*create) (struct mnt_idmap *, struct inode *,struct dentry *, umode_t);
 		struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int);
 		int (*link) (struct dentry *,struct inode *,struct dentry *);
 		int (*unlink) (struct inode *,struct dentry *);
@@ -505,7 +505,10 @@ otherwise noted.
 	if you want to support regular files.  The dentry you get should
 	not have an inode (i.e. it should be a negative dentry).  Here
 	you will probably call d_instantiate() with the dentry and the
-	newly created inode
+        newly created inode. This operation should always provide O_EXCL
+        semantics (i.e. it should fail with -EEXIST if the file exists).
+        If the filesystem needs to mediate non-exclusive creation,
+        then the filesystem must also provide an ->atomic_open() operation.
 
 ``lookup``
 	called when the VFS needs to look up an inode in a parent
@@ -654,7 +657,11 @@ otherwise noted.
 	handled by f_op->open().  If the file was created, FMODE_CREATED
 	flag should be set in file->f_mode.  In case of O_EXCL the
 	method must only succeed if the file didn't exist and hence
-	FMODE_CREATED shall always be set on success.
+        FMODE_CREATED shall always be set on success. This method is
+        usually needed on filesystems where the dentry to be created could
+        unexpectedly become positive after the kernel has looked it up or
+        revalidated it. (e.g. another host racing in and creating the file
+        on an NFS server).
 
 ``tmpfile``
 	called in the end of O_TMPFILE open().  Optional, equivalent to
diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c
index 69f378a837753e934c20b599660f8a756127e40a..595244d57cba62869b9af8b909af67d3c61e7f6c 100644
--- a/fs/9p/vfs_inode.c
+++ b/fs/9p/vfs_inode.c
@@ -643,7 +643,7 @@ v9fs_create(struct v9fs_session_info *v9ses, struct inode *dir,
 
 static int
 v9fs_vfs_create(struct mnt_idmap *idmap, struct inode *dir,
-		struct dentry *dentry, umode_t mode, bool excl)
+		struct dentry *dentry, umode_t mode)
 {
 	struct v9fs_session_info *v9ses = v9fs_inode2v9ses(dir);
 	u32 perm = unixmode2p9mode(v9ses, mode);
diff --git a/fs/9p/vfs_inode_dotl.c b/fs/9p/vfs_inode_dotl.c
index 0b404e8484d22e2cbe60d846e0fa653001cdc4b1..de8fe9954d433c9b14ff5dd72ba13c3d5a67ebe7 100644
--- a/fs/9p/vfs_inode_dotl.c
+++ b/fs/9p/vfs_inode_dotl.c
@@ -218,7 +218,7 @@ int v9fs_open_to_dotl_flags(int flags)
  */
 static int
 v9fs_vfs_create_dotl(struct mnt_idmap *idmap, struct inode *dir,
-		     struct dentry *dentry, umode_t omode, bool excl)
+		     struct dentry *dentry, umode_t omode)
 {
 	return v9fs_vfs_mknod_dotl(idmap, dir, dentry, omode, 0);
 }
diff --git a/fs/affs/affs.h b/fs/affs/affs.h
index ac4e9a02910b72d63c8ec5291347b54518e67f4b..665be23c42cfa206dc0a2c9ffa119b7c3c747389 100644
--- a/fs/affs/affs.h
+++ b/fs/affs/affs.h
@@ -167,7 +167,7 @@ extern int	affs_hash_name(struct super_block *sb, const u8 *name, unsigned int l
 extern struct dentry *affs_lookup(struct inode *dir, struct dentry *dentry, unsigned int);
 extern int	affs_unlink(struct inode *dir, struct dentry *dentry);
 extern int	affs_create(struct mnt_idmap *idmap, struct inode *dir,
-			struct dentry *dentry, umode_t mode, bool);
+			struct dentry *dentry, umode_t mode);
 extern struct dentry *affs_mkdir(struct mnt_idmap *idmap, struct inode *dir,
 			struct dentry *dentry, umode_t mode);
 extern int	affs_rmdir(struct inode *dir, struct dentry *dentry);
diff --git a/fs/affs/namei.c b/fs/affs/namei.c
index f883be50db122d3b09f0ae4d24618bd49b55186b..5591e1b5a2f68fc7600115e241f01f81d3aac010 100644
--- a/fs/affs/namei.c
+++ b/fs/affs/namei.c
@@ -243,7 +243,7 @@ affs_unlink(struct inode *dir, struct dentry *dentry)
 
 int
 affs_create(struct mnt_idmap *idmap, struct inode *dir,
-	    struct dentry *dentry, umode_t mode, bool excl)
+	    struct dentry *dentry, umode_t mode)
 {
 	struct super_block *sb = dir->i_sb;
 	struct inode	*inode;
diff --git a/fs/afs/dir.c b/fs/afs/dir.c
index 89d36e3e5c7999c2e448b78e86896d8893a8a7a9..09224aca8cad37ad273fd0c1ac292f0c15e078b5 100644
--- a/fs/afs/dir.c
+++ b/fs/afs/dir.c
@@ -32,7 +32,7 @@ static bool afs_lookup_one_filldir(struct dir_context *ctx, const char *name, in
 static bool afs_lookup_filldir(struct dir_context *ctx, const char *name, int nlen,
 			      loff_t fpos, u64 ino, unsigned dtype);
 static int afs_create(struct mnt_idmap *idmap, struct inode *dir,
-		      struct dentry *dentry, umode_t mode, bool excl);
+		      struct dentry *dentry, umode_t mode);
 static struct dentry *afs_mkdir(struct mnt_idmap *idmap, struct inode *dir,
 				struct dentry *dentry, umode_t mode);
 static int afs_rmdir(struct inode *dir, struct dentry *dentry);
@@ -1637,7 +1637,7 @@ static const struct afs_operation_ops afs_create_operation = {
  * create a regular file on an AFS filesystem
  */
 static int afs_create(struct mnt_idmap *idmap, struct inode *dir,
-		      struct dentry *dentry, umode_t mode, bool excl)
+		      struct dentry *dentry, umode_t mode)
 {
 	struct afs_operation *op;
 	struct afs_vnode *dvnode = AFS_FS_I(dir);
diff --git a/fs/bad_inode.c b/fs/bad_inode.c
index 0ef9bcb744dd620bf47caa024d97a1316ff7bc89..5701361cf98155a61cb75a4ec602e8fc615eb3ae 100644
--- a/fs/bad_inode.c
+++ b/fs/bad_inode.c
@@ -29,7 +29,7 @@ static const struct file_operations bad_file_ops =
 
 static int bad_inode_create(struct mnt_idmap *idmap,
 			    struct inode *dir, struct dentry *dentry,
-			    umode_t mode, bool excl)
+			    umode_t mode)
 {
 	return -EIO;
 }
diff --git a/fs/bfs/dir.c b/fs/bfs/dir.c
index c375e22c4c0c15ba27307d266adfe3f093b90ab8..6beb8605c523cc2c7250d7b1a61508e103f0f3fd 100644
--- a/fs/bfs/dir.c
+++ b/fs/bfs/dir.c
@@ -76,7 +76,7 @@ const struct file_operations bfs_dir_operations = {
 };
 
 static int bfs_create(struct mnt_idmap *idmap, struct inode *dir,
-		      struct dentry *dentry, umode_t mode, bool excl)
+		      struct dentry *dentry, umode_t mode)
 {
 	int err;
 	struct inode *inode;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 3b1b3a0553eea06229255ad0284d76074bdb958a..8e06baeabae594850607366ea4f4f0fa41e3b464 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6816,7 +6816,7 @@ static int btrfs_mknod(struct mnt_idmap *idmap, struct inode *dir,
 }
 
 static int btrfs_create(struct mnt_idmap *idmap, struct inode *dir,
-			struct dentry *dentry, umode_t mode, bool excl)
+			struct dentry *dentry, umode_t mode)
 {
 	struct inode *inode;
 
diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
index d18c0eaef9b7e7be7eb517c701d6c4af08fd78ac..308903dc0780dbed2382228005d0221f185c61ee 100644
--- a/fs/ceph/dir.c
+++ b/fs/ceph/dir.c
@@ -976,7 +976,7 @@ static int ceph_mknod(struct mnt_idmap *idmap, struct inode *dir,
 }
 
 static int ceph_create(struct mnt_idmap *idmap, struct inode *dir,
-		       struct dentry *dentry, umode_t mode, bool excl)
+		       struct dentry *dentry, umode_t mode)
 {
 	return ceph_mknod(idmap, dir, dentry, mode, 0);
 }
diff --git a/fs/coda/dir.c b/fs/coda/dir.c
index ca99900172657d80a479b2eb27f50effdf834995..554e7fd44e5df1aae6da2c41a492a02ae9e0d616 100644
--- a/fs/coda/dir.c
+++ b/fs/coda/dir.c
@@ -134,7 +134,7 @@ static inline void coda_dir_drop_nlink(struct inode *dir)
 
 /* creation routines: create, mknod, mkdir, link, symlink */
 static int coda_create(struct mnt_idmap *idmap, struct inode *dir,
-		       struct dentry *de, umode_t mode, bool excl)
+		       struct dentry *de, umode_t mode)
 {
 	int error;
 	const char *name=de->d_name.name;
diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
index ba15e7359dfa6e150b577205991010873a633511..9a1ba68b16f3d6c4551e2d75e1e27309159c062e 100644
--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -262,7 +262,7 @@ int ecryptfs_initialize_file(struct dentry *ecryptfs_dentry,
 static int
 ecryptfs_create(struct mnt_idmap *idmap,
 		struct inode *directory_inode, struct dentry *ecryptfs_dentry,
-		umode_t mode, bool excl)
+		umode_t mode)
 {
 	struct inode *ecryptfs_inode;
 	int rc;
diff --git a/fs/efivarfs/inode.c b/fs/efivarfs/inode.c
index 2891614abf8d554f563319187b6d54c2bc006a91..043b3e3a4f0adefe27855f8156b946c1dc4bd184 100644
--- a/fs/efivarfs/inode.c
+++ b/fs/efivarfs/inode.c
@@ -75,7 +75,7 @@ static bool efivarfs_valid_name(const char *str, int len)
 }
 
 static int efivarfs_create(struct mnt_idmap *idmap, struct inode *dir,
-			   struct dentry *dentry, umode_t mode, bool excl)
+			   struct dentry *dentry, umode_t mode)
 {
 	struct inode *inode = NULL;
 	struct efivar_entry *var;
diff --git a/fs/exfat/namei.c b/fs/exfat/namei.c
index 7eb9c67fd35f4c54e18061a948806f20455675cf..c272a522c571044fd0cdc7630be30bdcec2ab8e5 100644
--- a/fs/exfat/namei.c
+++ b/fs/exfat/namei.c
@@ -543,7 +543,7 @@ static int exfat_add_entry(struct inode *inode, const char *path,
 }
 
 static int exfat_create(struct mnt_idmap *idmap, struct inode *dir,
-			struct dentry *dentry, umode_t mode, bool excl)
+			struct dentry *dentry, umode_t mode)
 {
 	struct super_block *sb = dir->i_sb;
 	struct inode *inode;
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index bde617a66cecd4a2bf12a713a2297bb4fee45916..edea7784ad39acd4afffc7f5ae6e50a20c04999d 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -101,7 +101,7 @@ struct dentry *ext2_get_parent(struct dentry *child)
  */
 static int ext2_create (struct mnt_idmap * idmap,
 			struct inode * dir, struct dentry * dentry,
-			umode_t mode, bool excl)
+			umode_t mode)
 {
 	struct inode *inode;
 	int err;
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 2cd36f59c9e363124ee949f742adccd88447295a..a1e77390a7ce300db02db9af90e45d69efabfea5 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2806,7 +2806,7 @@ static int ext4_add_nondir(handle_t *handle,
  * with d_instantiate().
  */
 static int ext4_create(struct mnt_idmap *idmap, struct inode *dir,
-		       struct dentry *dentry, umode_t mode, bool excl)
+		       struct dentry *dentry, umode_t mode)
 {
 	handle_t *handle;
 	struct inode *inode;
diff --git a/fs/f2fs/namei.c b/fs/f2fs/namei.c
index b882771e469971dcf4e7a42416f9fbb8a5d9bf39..9bcbb8b521501b22d0fe2238b7729c342e95baa4 100644
--- a/fs/f2fs/namei.c
+++ b/fs/f2fs/namei.c
@@ -351,7 +351,7 @@ static struct inode *f2fs_new_inode(struct mnt_idmap *idmap,
 }
 
 static int f2fs_create(struct mnt_idmap *idmap, struct inode *dir,
-		       struct dentry *dentry, umode_t mode, bool excl)
+		       struct dentry *dentry, umode_t mode)
 {
 	struct f2fs_sb_info *sbi = F2FS_I_SB(dir);
 	struct inode *inode;
diff --git a/fs/fat/namei_msdos.c b/fs/fat/namei_msdos.c
index 0b920ee40a7f9fe3c57af5d939d3efedf001a3d9..905ffa9e5b99f1507734d99b7c16dcad21d7b5b5 100644
--- a/fs/fat/namei_msdos.c
+++ b/fs/fat/namei_msdos.c
@@ -262,7 +262,7 @@ static int msdos_add_entry(struct inode *dir, const unsigned char *name,
 
 /***** Create a file */
 static int msdos_create(struct mnt_idmap *idmap, struct inode *dir,
-			struct dentry *dentry, umode_t mode, bool excl)
+			struct dentry *dentry, umode_t mode)
 {
 	struct super_block *sb = dir->i_sb;
 	struct inode *inode = NULL;
diff --git a/fs/fat/namei_vfat.c b/fs/fat/namei_vfat.c
index 5dbc4cbb8fce3d9b891cbc597f876c2c7b8d6aa0..8396b1ec4ec582fcdfadbcb12b04694ef0b8c5fc 100644
--- a/fs/fat/namei_vfat.c
+++ b/fs/fat/namei_vfat.c
@@ -754,7 +754,7 @@ static struct dentry *vfat_lookup(struct inode *dir, struct dentry *dentry,
 }
 
 static int vfat_create(struct mnt_idmap *idmap, struct inode *dir,
-		       struct dentry *dentry, umode_t mode, bool excl)
+		       struct dentry *dentry, umode_t mode)
 {
 	struct super_block *sb = dir->i_sb;
 	struct inode *inode;
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 667774cc72a1d49796f531fcb342d2e4878beb85..b7a2cee9b18313f88e745c5bb406bcc72866e390 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -889,7 +889,7 @@ static int fuse_mknod(struct mnt_idmap *idmap, struct inode *dir,
 }
 
 static int fuse_create(struct mnt_idmap *idmap, struct inode *dir,
-		       struct dentry *entry, umode_t mode, bool excl)
+		       struct dentry *entry, umode_t mode)
 {
 	return fuse_mknod(idmap, dir, entry, mode, 0);
 }
diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
index 8a7ed80d9f2d6e829b240629bdd18b5e0d30b5fc..b8e399dd1182b6ede0bcf1aa78bd7f9f2dca8b2b 100644
--- a/fs/gfs2/inode.c
+++ b/fs/gfs2/inode.c
@@ -942,15 +942,14 @@ static int gfs2_create_inode(struct inode *dir, struct dentry *dentry,
  * @dir: The directory in which to create the file
  * @dentry: The dentry of the new file
  * @mode: The mode of the new file
- * @excl: Force fail if inode exists
  *
  * Returns: errno
  */
 
 static int gfs2_create(struct mnt_idmap *idmap, struct inode *dir,
-		       struct dentry *dentry, umode_t mode, bool excl)
+		       struct dentry *dentry, umode_t mode)
 {
-	return gfs2_create_inode(dir, dentry, NULL, S_IFREG | mode, 0, NULL, 0, excl);
+	return gfs2_create_inode(dir, dentry, NULL, S_IFREG | mode, 0, NULL, 0, 1);
 }
 
 /**
diff --git a/fs/hfs/dir.c b/fs/hfs/dir.c
index 86a6b317b474a95f283f6a0908582efadde80892..c585942aa985686ca428d2d17f4401aa845a0eb8 100644
--- a/fs/hfs/dir.c
+++ b/fs/hfs/dir.c
@@ -190,7 +190,7 @@ static int hfs_dir_release(struct inode *inode, struct file *file)
  * the directory and the name (and its length) of the new file.
  */
 static int hfs_create(struct mnt_idmap *idmap, struct inode *dir,
-		      struct dentry *dentry, umode_t mode, bool excl)
+		      struct dentry *dentry, umode_t mode)
 {
 	struct inode *inode;
 	int res;
diff --git a/fs/hfsplus/dir.c b/fs/hfsplus/dir.c
index 1b3e27a0d5e038b559bd19b37d769078b2996d1b..c5ea04e078340a91b992095e189e978a3345f03c 100644
--- a/fs/hfsplus/dir.c
+++ b/fs/hfsplus/dir.c
@@ -518,7 +518,7 @@ static int hfsplus_mknod(struct mnt_idmap *idmap, struct inode *dir,
 }
 
 static int hfsplus_create(struct mnt_idmap *idmap, struct inode *dir,
-			  struct dentry *dentry, umode_t mode, bool excl)
+			  struct dentry *dentry, umode_t mode)
 {
 	return hfsplus_mknod(&nop_mnt_idmap, dir, dentry, mode, 0);
 }
diff --git a/fs/hostfs/hostfs_kern.c b/fs/hostfs/hostfs_kern.c
index 1e1acf5775ab5f6daf13bb917966d05f410d5ff5..18ca8cb9aa15e4015582ee5bd3db968c6b32de4b 100644
--- a/fs/hostfs/hostfs_kern.c
+++ b/fs/hostfs/hostfs_kern.c
@@ -593,7 +593,7 @@ static struct inode *hostfs_iget(struct super_block *sb, char *name)
 }
 
 static int hostfs_create(struct mnt_idmap *idmap, struct inode *dir,
-			 struct dentry *dentry, umode_t mode, bool excl)
+			 struct dentry *dentry, umode_t mode)
 {
 	struct inode *inode;
 	char *name;
diff --git a/fs/hpfs/namei.c b/fs/hpfs/namei.c
index 353e13a615f56664638f08a3408f90a727f5458b..809113d8248d50c0eaa57047b6c4bd87b9a5c6be 100644
--- a/fs/hpfs/namei.c
+++ b/fs/hpfs/namei.c
@@ -129,7 +129,7 @@ static struct dentry *hpfs_mkdir(struct mnt_idmap *idmap, struct inode *dir,
 }
 
 static int hpfs_create(struct mnt_idmap *idmap, struct inode *dir,
-		       struct dentry *dentry, umode_t mode, bool excl)
+		       struct dentry *dentry, umode_t mode)
 {
 	const unsigned char *name = dentry->d_name.name;
 	unsigned len = dentry->d_name.len;
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 9c94ed8c3ab0028772b7afb5d03a91d280c38106..0fd0d73e450bdedd92b953b9dd00f6babe1246e7 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -1001,7 +1001,7 @@ static struct dentry *hugetlbfs_mkdir(struct mnt_idmap *idmap, struct inode *dir
 
 static int hugetlbfs_create(struct mnt_idmap *idmap,
 			    struct inode *dir, struct dentry *dentry,
-			    umode_t mode, bool excl)
+			    umode_t mode)
 {
 	return hugetlbfs_mknod(idmap, dir, dentry, mode | S_IFREG, 0);
 }
diff --git a/fs/jffs2/dir.c b/fs/jffs2/dir.c
index dd91f725ded69ccb3a240aafd72a4b552f21bcd9..e77c84e43621a8c53e9852843f18cc3514315650 100644
--- a/fs/jffs2/dir.c
+++ b/fs/jffs2/dir.c
@@ -25,7 +25,7 @@
 static int jffs2_readdir (struct file *, struct dir_context *);
 
 static int jffs2_create (struct mnt_idmap *, struct inode *,
-		         struct dentry *, umode_t, bool);
+			 struct dentry *, umode_t);
 static struct dentry *jffs2_lookup (struct inode *,struct dentry *,
 				    unsigned int);
 static int jffs2_link (struct dentry *,struct inode *,struct dentry *);
@@ -161,7 +161,7 @@ static int jffs2_readdir(struct file *file, struct dir_context *ctx)
 
 
 static int jffs2_create(struct mnt_idmap *idmap, struct inode *dir_i,
-			struct dentry *dentry, umode_t mode, bool excl)
+			struct dentry *dentry, umode_t mode)
 {
 	struct jffs2_raw_inode *ri;
 	struct jffs2_inode_info *f, *dir_f;
diff --git a/fs/jfs/namei.c b/fs/jfs/namei.c
index 65a218eba8faf9508f5727515b812f6de2661618..48111f8d3efe40becadd857c56c84ed09de867ef 100644
--- a/fs/jfs/namei.c
+++ b/fs/jfs/namei.c
@@ -60,7 +60,7 @@ static inline void free_ea_wmap(struct inode *inode)
  *
  */
 static int jfs_create(struct mnt_idmap *idmap, struct inode *dip,
-		      struct dentry *dentry, umode_t mode, bool excl)
+		      struct dentry *dentry, umode_t mode)
 {
 	int rc = 0;
 	tid_t tid;		/* transaction id */
diff --git a/fs/minix/namei.c b/fs/minix/namei.c
index 8938536d8d3cf65c7e57f88f1819689365951fea..6540574f54781eab487074de7fe10ed38b1a8d1e 100644
--- a/fs/minix/namei.c
+++ b/fs/minix/namei.c
@@ -64,7 +64,7 @@ static int minix_tmpfile(struct mnt_idmap *idmap, struct inode *dir,
 }
 
 static int minix_create(struct mnt_idmap *idmap, struct inode *dir,
-			struct dentry *dentry, umode_t mode, bool excl)
+			struct dentry *dentry, umode_t mode)
 {
 	return minix_mknod(&nop_mnt_idmap, dir, dentry, mode, 0);
 }
diff --git a/fs/namei.c b/fs/namei.c
index d5ab28947b2b6c6e19c7bb4a9140ccec407dc07c..83da60fc298e523096e881b25c727d14f9553476 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3493,7 +3493,7 @@ int vfs_create(struct mnt_idmap *idmap, struct dentry *dentry, umode_t mode,
 	error = try_break_deleg(dir, di);
 	if (error)
 		return error;
-	error = dir->i_op->create(idmap, dir, dentry, mode, true);
+	error = dir->i_op->create(idmap, dir, dentry, mode);
 	if (!error)
 		fsnotify_create(dir, dentry);
 	return error;
@@ -3802,7 +3802,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
 		}
 
 		error = dir_inode->i_op->create(idmap, dir_inode, dentry,
-						mode, open_flag & O_EXCL);
+						mode);
 		if (error)
 			goto out_dput;
 	}
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 46d9c65d50f83fc1dc73f3d7f5868b84132bb0fd..7fe18efcd37b08030c7a4e17832801abfc19a3bd 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -2377,9 +2377,9 @@ static int nfs_do_create(struct inode *dir, struct dentry *dentry,
 }
 
 int nfs_create(struct mnt_idmap *idmap, struct inode *dir,
-	       struct dentry *dentry, umode_t mode, bool excl)
+	       struct dentry *dentry, umode_t mode)
 {
-	return nfs_do_create(dir, dentry, mode, excl ? O_EXCL : 0);
+	return nfs_do_create(dir, dentry, mode, O_EXCL);
 }
 EXPORT_SYMBOL_GPL(nfs_create);
 
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 2ecd38e1d17a8053a9134702588d57efc35f49e9..b122c4f34f7b53c5102a8b5138efe269af433c81 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -398,7 +398,7 @@ extern unsigned long nfs_access_cache_scan(struct shrinker *shrink,
 struct dentry *nfs_lookup(struct inode *, struct dentry *, unsigned int);
 void nfs_d_prune_case_insensitive_aliases(struct inode *inode);
 int nfs_create(struct mnt_idmap *, struct inode *, struct dentry *,
-	       umode_t, bool);
+	       umode_t);
 struct dentry *nfs_mkdir(struct mnt_idmap *, struct inode *, struct dentry *,
 			 umode_t);
 int nfs_rmdir(struct inode *, struct dentry *);
diff --git a/fs/nilfs2/namei.c b/fs/nilfs2/namei.c
index 40f4b1a28705b6e0eb8f0978cf3ac18b43aa1331..31d1d466c03048aaaab23f64c3f413c095939770 100644
--- a/fs/nilfs2/namei.c
+++ b/fs/nilfs2/namei.c
@@ -86,7 +86,7 @@ nilfs_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags)
  * with d_instantiate().
  */
 static int nilfs_create(struct mnt_idmap *idmap, struct inode *dir,
-			struct dentry *dentry, umode_t mode, bool excl)
+			struct dentry *dentry, umode_t mode)
 {
 	struct inode *inode;
 	struct nilfs_transaction_info ti;
diff --git a/fs/ntfs3/namei.c b/fs/ntfs3/namei.c
index 82c8ae56beee6d79046dd6c8f02ff0f35e9a1ad3..49fe635b550d3f51f81138649b47c9c831a73e3b 100644
--- a/fs/ntfs3/namei.c
+++ b/fs/ntfs3/namei.c
@@ -105,7 +105,7 @@ static struct dentry *ntfs_lookup(struct inode *dir, struct dentry *dentry,
  * ntfs_create - inode_operations::create
  */
 static int ntfs_create(struct mnt_idmap *idmap, struct inode *dir,
-		       struct dentry *dentry, umode_t mode, bool excl)
+		       struct dentry *dentry, umode_t mode)
 {
 	return ntfs_create_inode(idmap, dir, dentry, NULL, S_IFREG | mode, 0,
 				 NULL, 0, NULL);
diff --git a/fs/ocfs2/dlmfs/dlmfs.c b/fs/ocfs2/dlmfs/dlmfs.c
index cccaa1d6fbbac13ebcaf14a9183277890708e643..bd4b2269598b49c6f88dd8d201e246ee5ed855a6 100644
--- a/fs/ocfs2/dlmfs/dlmfs.c
+++ b/fs/ocfs2/dlmfs/dlmfs.c
@@ -454,8 +454,7 @@ static struct dentry *dlmfs_mkdir(struct mnt_idmap * idmap,
 static int dlmfs_create(struct mnt_idmap *idmap,
 			struct inode *dir,
 			struct dentry *dentry,
-			umode_t mode,
-			bool excl)
+			umode_t mode)
 {
 	int status = 0;
 	struct inode *inode;
diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
index c90b254da75eb5b90d2af5e37d41e781efe8b836..7443f468f45657cf68779a02e4edf4e38fb70f59 100644
--- a/fs/ocfs2/namei.c
+++ b/fs/ocfs2/namei.c
@@ -666,8 +666,7 @@ static struct dentry *ocfs2_mkdir(struct mnt_idmap *idmap,
 static int ocfs2_create(struct mnt_idmap *idmap,
 			struct inode *dir,
 			struct dentry *dentry,
-			umode_t mode,
-			bool excl)
+			umode_t mode)
 {
 	int ret;
 
diff --git a/fs/omfs/dir.c b/fs/omfs/dir.c
index 2ed541fccf331d796805dd1594fbf05c1f7f3b9a..a09a98f7e30bc66deca60725f9462d081b5e4784 100644
--- a/fs/omfs/dir.c
+++ b/fs/omfs/dir.c
@@ -286,7 +286,7 @@ static struct dentry *omfs_mkdir(struct mnt_idmap *idmap, struct inode *dir,
 }
 
 static int omfs_create(struct mnt_idmap *idmap, struct inode *dir,
-		       struct dentry *dentry, umode_t mode, bool excl)
+		       struct dentry *dentry, umode_t mode)
 {
 	return omfs_add_node(dir, dentry, mode | S_IFREG);
 }
diff --git a/fs/orangefs/namei.c b/fs/orangefs/namei.c
index bec5475de094dada6bb29eaf8520a875880f3bab..0ebaa7f000f26f1c1ecffd22cfe4272f20a783ed 100644
--- a/fs/orangefs/namei.c
+++ b/fs/orangefs/namei.c
@@ -18,8 +18,7 @@
 static int orangefs_create(struct mnt_idmap *idmap,
 			struct inode *dir,
 			struct dentry *dentry,
-			umode_t mode,
-			bool exclusive)
+			umode_t mode)
 {
 	struct orangefs_inode_s *parent = ORANGEFS_I(dir);
 	struct orangefs_kernel_op_s *new_op;
diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
index a5e9ddf3023b3942fafb9adb2770f26780a1b86b..0f70b3835f4a08c29d6bba8ae9143df55895e56b 100644
--- a/fs/overlayfs/dir.c
+++ b/fs/overlayfs/dir.c
@@ -704,7 +704,7 @@ static int ovl_create_object(struct dentry *dentry, int mode, dev_t rdev,
 }
 
 static int ovl_create(struct mnt_idmap *idmap, struct inode *dir,
-		      struct dentry *dentry, umode_t mode, bool excl)
+		      struct dentry *dentry, umode_t mode)
 {
 	return ovl_create_object(dentry, (mode & 07777) | S_IFREG, 0, NULL);
 }
diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
index 41f9995da7cab0d11395cb40a98fb4936d52597f..b6502aaa4fb44d27c939da9fae4449af7edd28d4 100644
--- a/fs/ramfs/inode.c
+++ b/fs/ramfs/inode.c
@@ -129,7 +129,7 @@ static struct dentry *ramfs_mkdir(struct mnt_idmap *idmap, struct inode *dir,
 }
 
 static int ramfs_create(struct mnt_idmap *idmap, struct inode *dir,
-			struct dentry *dentry, umode_t mode, bool excl)
+			struct dentry *dentry, umode_t mode)
 {
 	return ramfs_mknod(&nop_mnt_idmap, dir, dentry, mode | S_IFREG, 0);
 }
diff --git a/fs/smb/client/cifsfs.h b/fs/smb/client/cifsfs.h
index e9534258d1efd0bb34f36bf2c725c64d0a8ca8f4..294c66cea2eca3344e09cd77619761e9cb79a807 100644
--- a/fs/smb/client/cifsfs.h
+++ b/fs/smb/client/cifsfs.h
@@ -50,7 +50,7 @@ extern void cifs_sb_deactive(struct super_block *sb);
 extern const struct inode_operations cifs_dir_inode_ops;
 extern struct inode *cifs_root_iget(struct super_block *);
 extern int cifs_create(struct mnt_idmap *, struct inode *,
-		       struct dentry *, umode_t, bool excl);
+		       struct dentry *, umode_t);
 extern int cifs_atomic_open(struct inode *, struct dentry *,
 			    struct file *, unsigned, umode_t);
 extern struct dentry *cifs_lookup(struct inode *, struct dentry *,
diff --git a/fs/smb/client/dir.c b/fs/smb/client/dir.c
index da5597dbf5b9f140c6801158ac2357fa911c52ab..b00bc214db9f0e9533f481f41ac99ac8937610ac 100644
--- a/fs/smb/client/dir.c
+++ b/fs/smb/client/dir.c
@@ -566,7 +566,7 @@ cifs_atomic_open(struct inode *inode, struct dentry *direntry,
 }
 
 int cifs_create(struct mnt_idmap *idmap, struct inode *inode,
-		struct dentry *direntry, umode_t mode, bool excl)
+		struct dentry *direntry, umode_t mode)
 {
 	int rc;
 	unsigned int xid = get_xid();
diff --git a/fs/ubifs/dir.c b/fs/ubifs/dir.c
index 3c3d3ad4fa6cb719e9ec08fa2164c55371c017c1..4840a6f7974e254eba4ca249357e968764e326e0 100644
--- a/fs/ubifs/dir.c
+++ b/fs/ubifs/dir.c
@@ -303,7 +303,7 @@ static int ubifs_prepare_create(struct inode *dir, struct dentry *dentry,
 }
 
 static int ubifs_create(struct mnt_idmap *idmap, struct inode *dir,
-			struct dentry *dentry, umode_t mode, bool excl)
+			struct dentry *dentry, umode_t mode)
 {
 	struct inode *inode;
 	struct ubifs_info *c = dir->i_sb->s_fs_info;
diff --git a/fs/udf/namei.c b/fs/udf/namei.c
index 5f2e9a892bffa9579143cedf71d80efa7ad6e9fb..f83b5564cbc4c68c02c07bb3ab2109bfabdc799d 100644
--- a/fs/udf/namei.c
+++ b/fs/udf/namei.c
@@ -371,7 +371,7 @@ static int udf_add_nondir(struct dentry *dentry, struct inode *inode)
 }
 
 static int udf_create(struct mnt_idmap *idmap, struct inode *dir,
-		      struct dentry *dentry, umode_t mode, bool excl)
+		      struct dentry *dentry, umode_t mode)
 {
 	struct inode *inode = udf_new_inode(dir, mode);
 
diff --git a/fs/ufs/namei.c b/fs/ufs/namei.c
index 5b3c85c9324298f4ff6aa3d4feeb962ce5ede539..5012e056200aca671364d34a7faf647e6747e1d2 100644
--- a/fs/ufs/namei.c
+++ b/fs/ufs/namei.c
@@ -70,8 +70,7 @@ static struct dentry *ufs_lookup(struct inode * dir, struct dentry *dentry, unsi
  * with d_instantiate(). 
  */
 static int ufs_create (struct mnt_idmap * idmap,
-		struct inode * dir, struct dentry * dentry, umode_t mode,
-		bool excl)
+		struct inode * dir, struct dentry * dentry, umode_t mode)
 {
 	struct inode *inode;
 
diff --git a/fs/vboxsf/dir.c b/fs/vboxsf/dir.c
index 42bedc4ec7af7709c564a7174805d185ce86f854..9ce4310c891044db17b6af98c06e3130002a7dda 100644
--- a/fs/vboxsf/dir.c
+++ b/fs/vboxsf/dir.c
@@ -298,7 +298,7 @@ static int vboxsf_dir_create(struct inode *parent, struct dentry *dentry,
 
 static int vboxsf_dir_mkfile(struct mnt_idmap *idmap,
 			     struct inode *parent, struct dentry *dentry,
-			     umode_t mode, bool excl)
+			     umode_t mode)
 {
 	return vboxsf_dir_create(parent, dentry, mode, false, excl, NULL);
 }
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index caff0125faeac093c1c05a722d3588e3f2e99926..2bc7faac35678b5b78acd6a50695a0d7b1c9a263 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -293,8 +293,7 @@ xfs_vn_create(
 	struct mnt_idmap	*idmap,
 	struct inode		*dir,
 	struct dentry		*dentry,
-	umode_t			mode,
-	bool			flags)
+	umode_t			mode)
 {
 	return xfs_generic_create(idmap, dir, dentry, mode, 0, NULL);
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 64323e618724bc20dc101db13035b042f5f88e4d..b9a32e10078f5a1a0bbeb0d8913ac3e4b5b3a85d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2345,8 +2345,8 @@ struct inode_operations {
 
 	int (*readlink) (struct dentry *, char __user *,int);
 
-	int (*create) (struct mnt_idmap *, struct inode *,struct dentry *,
-		       umode_t, bool);
+	int (*create) (struct mnt_idmap *, struct inode *, struct dentry *,
+		       umode_t);
 	int (*link) (struct dentry *,struct inode *,struct dentry *);
 	int (*unlink) (struct inode *,struct dentry *);
 	int (*symlink) (struct mnt_idmap *, struct inode *,struct dentry *,
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 093551fe66a7eb884fc34ef853a0ca92b95770af..9ae28c79fe0578bf96b2d22daed45b48aba0b946 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -610,7 +610,7 @@ static int mqueue_create_attr(struct dentry *dentry, umode_t mode, void *arg)
 }
 
 static int mqueue_create(struct mnt_idmap *idmap, struct inode *dir,
-			 struct dentry *dentry, umode_t mode, bool excl)
+			 struct dentry *dentry, umode_t mode)
 {
 	return mqueue_create_attr(dentry, mode, NULL);
 }
diff --git a/mm/shmem.c b/mm/shmem.c
index b9081b817d28f3db1fbdd90ed3f04b6904d6ff18..8fdc9cbecb908e127f8173ca8888b5e038354fed 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -3912,7 +3912,7 @@ static struct dentry *shmem_mkdir(struct mnt_idmap *idmap, struct inode *dir,
 }
 
 static int shmem_create(struct mnt_idmap *idmap, struct inode *dir,
-			struct dentry *dentry, umode_t mode, bool excl)
+			struct dentry *dentry, umode_t mode)
 {
 	return shmem_mknod(idmap, dir, dentry, mode | S_IFREG, 0);
 }

---
base-commit: 76ddfe7d66d631e5e31ef4e5dd59797fa03acbf7
change-id: 20251105-create-excl-2b366d9bf3bb

Best regards,
-- 
Jeff Layton <jlayton at kernel.org>


From neilb at ownmail.net  Fri Nov  7 14:29:43 2025
From: neilb at ownmail.net (NeilBrown)
Date: Sat, 08 Nov 2025 09:29:43 +1100
Subject: [PATCH v2] vfs: remove the excl argument from the ->create()
 inode_operation
In-Reply-To: <20251107-create-excl-v2-1-f678165d7f3f@kernel.org>
References: <20251107-create-excl-v2-1-f678165d7f3f@kernel.org>
Message-ID: <176255458305.634289.5577159882824096330@noble.neil.brown.name>

On Sat, 08 Nov 2025, Jeff Layton wrote:
> With two exceptions, ->create() methods provided by filesystems ignore
> the "excl" flag.  Those exception are NFS and GFS2 which both also
> provide ->atomic_open.
> 
> Since ce8644fcadc5 ("lookup_open(): expand the call of vfs_create()"),
> the "excl" argument to the ->create() inode_operation is always set to
> true in vfs_create(). The ->create() call in lookup_open() sets it
> according to the O_EXCL open flag, but is never called if the filesystem
> provides ->atomic_open().
> 
> The excl flag is therefore always either ignored or true.  Remove it,
> and change NFS and GFS2 to act as if it were always true.
> 
> Signed-off-by: Jeff Layton <jlayton at kernel.org>
> ---
> Note that this is based on top of the dir delegation series [1]. LMK
> if the Documentation/ updates are too wordy.

Patch is very nice.  I don't think the documentation is too wordy.
I think it is good that the two changes to the different files say
essentially the same thing but use different words.  That helps.

Reviewed-by: NeilBrown <neil at brown.name>

> 
> Full disclosure: I did use Claude code to generate the first
> approximation of this patch, but I had to fix a number of things that it
> missed.  I probably could have given it better prompts. In any case, I'm
> not sure how to properly attribute this (or if I even need to).

My understanding is that if you fully understand (and can defend) the
code change with all its motivations and implications as well as if you
had written it yourself, then you don't need to attribute whatever fancy
text editor or IDE (e.g.  Claude) that you used to help produce the
patch.

Thanks,
NeilBrown


From corbet at lwn.net  Fri Nov  7 14:35:17 2025
From: corbet at lwn.net (Jonathan Corbet)
Date: Fri, 07 Nov 2025 15:35:17 -0700
Subject: LLM disclosure (was: [PATCH v2] vfs: remove the excl argument from
 the ->create() inode_operation)
In-Reply-To: <176255458305.634289.5577159882824096330@noble.neil.brown.name>
References: <20251107-create-excl-v2-1-f678165d7f3f@kernel.org>
 <176255458305.634289.5577159882824096330@noble.neil.brown.name>
Message-ID: <87ikfl1nfe.fsf@trenco.lwn.net>

NeilBrown <neilb at ownmail.net> writes:

> On Sat, 08 Nov 2025, Jeff Layton wrote:

>> Full disclosure: I did use Claude code to generate the first
>> approximation of this patch, but I had to fix a number of things that it
>> missed.  I probably could have given it better prompts. In any case, I'm
>> not sure how to properly attribute this (or if I even need to).
>
> My understanding is that if you fully understand (and can defend) the
> code change with all its motivations and implications as well as if you
> had written it yourself, then you don't need to attribute whatever fancy
> text editor or IDE (e.g.  Claude) that you used to help produce the
> patch.

The proposed policy for such things is here, under review right now:

  https://lore.kernel.org/all/20251105231514.3167738-1-dave.hansen at linux.intel.com/

jon


From jlayton at kernel.org  Fri Nov  7 15:19:24 2025
From: jlayton at kernel.org (Jeff Layton)
Date: Fri, 07 Nov 2025 18:19:24 -0500
Subject: LLM disclosure (was: [PATCH v2] vfs: remove the excl argument
 from the ->create() inode_operation)
In-Reply-To: <87ikfl1nfe.fsf@trenco.lwn.net>
References: <20251107-create-excl-v2-1-f678165d7f3f@kernel.org>
	 <176255458305.634289.5577159882824096330@noble.neil.brown.name>
	 <87ikfl1nfe.fsf@trenco.lwn.net>
Message-ID: <f5a2c41e4f272fef9f1525e17b494dd4b4bcb529.camel@kernel.org>

On Fri, 2025-11-07 at 15:35 -0700, Jonathan Corbet wrote:
> NeilBrown <neilb at ownmail.net> writes:
> 
> > On Sat, 08 Nov 2025, Jeff Layton wrote:
> 
> > > Full disclosure: I did use Claude code to generate the first
> > > approximation of this patch, but I had to fix a number of things that it
> > > missed.  I probably could have given it better prompts. In any case, I'm
> > > not sure how to properly attribute this (or if I even need to).
> > 
> > My understanding is that if you fully understand (and can defend) the
> > code change with all its motivations and implications as well as if you
> > had written it yourself, then you don't need to attribute whatever fancy
> > text editor or IDE (e.g.  Claude) that you used to help produce the
> > patch.
> 
> The proposed policy for such things is here, under review right now:
> 
>   https://lore.kernel.org/all/20251105231514.3167738-1-dave.hansen at linux.intel.com/
> 
> jon

Thanks Jon.

I'm guessing that this would fall under the "menial task"
classification, and therefore doesn't need attribution. This seems
applicable:

+ - Purely mechanical transformations like variable renaming

This is a little different, but it's a similar rote task.
-- 
Jeff Layton <jlayton at kernel.org>


From neilb at ownmail.net  Fri Nov  7 15:37:30 2025
From: neilb at ownmail.net (NeilBrown)
Date: Sat, 08 Nov 2025 10:37:30 +1100
Subject: LLM disclosure (was: [PATCH v2] vfs: remove the excl argument
 from the ->create() inode_operation)
In-Reply-To: <f5a2c41e4f272fef9f1525e17b494dd4b4bcb529.camel@kernel.org>
References: <20251107-create-excl-v2-1-f678165d7f3f@kernel.org>,
 <176255458305.634289.5577159882824096330@noble.neil.brown.name>,
 <87ikfl1nfe.fsf@trenco.lwn.net>,
 <f5a2c41e4f272fef9f1525e17b494dd4b4bcb529.camel@kernel.org>
Message-ID: <176255865045.634289.1814933499430115577@noble.neil.brown.name>

On Sat, 08 Nov 2025, Jeff Layton wrote:
> On Fri, 2025-11-07 at 15:35 -0700, Jonathan Corbet wrote:
> > NeilBrown <neilb at ownmail.net> writes:
> > 
> > > On Sat, 08 Nov 2025, Jeff Layton wrote:
> > 
> > > > Full disclosure: I did use Claude code to generate the first
> > > > approximation of this patch, but I had to fix a number of things that it
> > > > missed.  I probably could have given it better prompts. In any case, I'm
> > > > not sure how to properly attribute this (or if I even need to).
> > > 
> > > My understanding is that if you fully understand (and can defend) the
> > > code change with all its motivations and implications as well as if you
> > > had written it yourself, then you don't need to attribute whatever fancy
> > > text editor or IDE (e.g.  Claude) that you used to help produce the
> > > patch.
> > 
> > The proposed policy for such things is here, under review right now:
> > 
> >   https://lore.kernel.org/all/20251105231514.3167738-1-dave.hansen at linux.intel.com/
> > 
> > jon
> 
> Thanks Jon.
> 
> I'm guessing that this would fall under the "menial task"
> classification, and therefore doesn't need attribution. This seems
> applicable:
> 
> + - Purely mechanical transformations like variable renaming
> 
> This is a little different, but it's a similar rote task.
> -- 
> Jeff Layton <jlayton at kernel.org>
> 

The bit I particularly liked was:

+
+Even if your tool use is out of scope you should still always consider
+if it would help reviewing your contribution if the reviewer knows
+about the tool that you used.
+

"would it help the reviewer"?  I agree that is a key question.  In your
case I cannot see how it would help.

Thanks,
NeilBrown


From asmadeus at codewreck.org  Fri Nov  7 22:12:10 2025
From: asmadeus at codewreck.org (Dominique Martinet)
Date: Sat, 8 Nov 2025 15:12:10 +0900
Subject: [PATCH v2] vfs: remove the excl argument from the ->create()
 inode_operation
In-Reply-To: <20251107-create-excl-v2-1-f678165d7f3f@kernel.org>
References: <20251107-create-excl-v2-1-f678165d7f3f@kernel.org>
Message-ID: <aQ7fOmknHIxcxuha@codewreck.org>

Jeff Layton wrote on Fri, Nov 07, 2025 at 10:05:03AM -0500:
> With two exceptions, ->create() methods provided by filesystems ignore
> the "excl" flag.  Those exception are NFS and GFS2 which both also
> provide ->atomic_open.
> 
> Since ce8644fcadc5 ("lookup_open(): expand the call of vfs_create()"),
> the "excl" argument to the ->create() inode_operation is always set to
> true in vfs_create(). The ->create() call in lookup_open() sets it
> according to the O_EXCL open flag, but is never called if the filesystem
> provides ->atomic_open().
> 
> The excl flag is therefore always either ignored or true.  Remove it,
> and change NFS and GFS2 to act as if it were always true.
> 
> Signed-off-by: Jeff Layton <jlayton at kernel.org>

Good cleanup, just one whitespace nitpick below but:
Reviewed-by: Dominique Martinet <asmadeus at codewreck.org>


> diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
> index 4f13b01e42eb5e2ad9d60cbbce7e47d09ad831e6..7a55e491e0c87a0d18909bd181754d6d68318059 100644
> --- a/Documentation/filesystems/vfs.rst
> +++ b/Documentation/filesystems/vfs.rst
> @@ -505,7 +505,10 @@ otherwise noted.
>  	if you want to support regular files.  The dentry you get should
>  	not have an inode (i.e. it should be a negative dentry).  Here
>  	you will probably call d_instantiate() with the dentry and the
> -	newly created inode
> +        newly created inode. This operation should always provide O_EXCL

This and the block below change halfway from tab (old text) to spaces
(your patch)

Looks like the file has a few space-indented sections though so it won't
be the first if that goes in as is, the html-rendering doesn't seem to
care :)

Cheers,
-- 
Dominique Martinet | Asmadeus


From thehajime at gmail.com  Sat Nov  8 00:05:35 2025
From: thehajime at gmail.com (Hajime Tazaki)
Date: Sat,  8 Nov 2025 17:05:35 +0900
Subject: [PATCH v13 00/13] nommu UML
Message-ID: <cover.1762588860.git.thehajime@gmail.com>

This patchset is another spin of nommu mode addition to UML.  It would
be nice to hear about your opinions on that.

There are still several limitations/issues which we already found;
here is the list of those issues.

- memory mapped by loadable modules are not distinguished from
  userspace memory.
- CONFIG_SMP is disabled as host_fs handling doesn't work with thread
  local storage.

-- Hajime

v13:
- rebase with the latest uml/next branch, fixing a conflict ([06/13])

v12:
- rebase with the latest uml/next branch
- disable SMP and tls as those doesn't work with host_fs handling ([11/13])
- https://lore.kernel.org/all/cover.1762075876.git.thehajime at gmail.com/

v11:
- clean up userspace return routine and integrate to userspace() ([04/13])
- fix direction flag issue on using nolibc memcpy ([04/13])
- fix a crash issue when using usermode helper ([06/13])
- test with out-of-tree kunit-uapi patches (which uses umh)
 - https://lore.kernel.org/all/20250626-kunit-kselftests-v4-0-48760534fef5 at linutronix.de/
 - https://lore.kernel.org/all/20250626195714.2123694-3-benjamin at sipsolutions.net/
- https://lore.kernel.org/all/cover.1758181109.git.thehajime at gmail.com/

v10:
- fix wrong comment on gs register handling ([09/13])
- remove unnecessary code of early syscall implementation ([04/13])
- https://lore.kernel.org/all/cover.1750594487.git.thehajime at gmail.com/

v9:
- rebase with the latest uml/next branch
- add performance numbers of new SECCOMP mode, and update results ([12/13])
- add a workaround for upstream change on MMU depedency to PCI drivers ([10/13])
- https://lore.kernel.org/all/cover.1750294482.git.thehajime at gmail.com/

v8:
- rebase with the latest uml/next branch
- clean up segv_handler to align with the latest uml ([9/12])
- https://lore.kernel.org/all/cover.1745980082.git.thehajime at gmail.com/

v7:
- properly handle FP register upon signal delivery [10/13]
- update benchmark result with new FP register handling [12/13]
- fix arch_has_single_step() for !MMU case [07/13]
- revert stack alignment as it is in uml/fixes tree [10/13]
- https://lore.kernel.org/all/cover.1737348399.git.thehajime at gmail.com/

v6:
- rebase to the latest uml/next tree
- more clean up on mmu/nommu for signal handling [10/13]
- rename functions of mcontext routines [06,10/13]
- added Acked-by tag for binfmt_elf_fdpic [02/13]
- https://lore.kernel.org/linux-um/cover.1736853925.git.thehajime at gmail.com/

v5:
- clean up stack manipulation code [05,06,07,10/13]
- https://lore.kernel.org/linux-um/cover.1733998168.git.thehajime at gmail.com/

v4:
- add arch/um/nommu, arch/x86/um/nommu to contain !MMU specific codes
- remove zpoline patch
- drop binfmt_elf_fdpic patch
- reduce ifndef CONFIG_MMU if possible
- split to elf header cleanup patch [01/13]
- fix kernel test robot warnings [06/13]
- fix coding styles [07/13]
- move task_top_of_stack definition [05/13]
- https://lore.kernel.org/linux-um/cover.1733652929.git.thehajime at gmail.com/

v3:
- https://lore.kernel.org/linux-um/cover.1733199769.git.thehajime at gmail.com/
- add seccomp-based syscall hook in addition to zpoline [06/13]
- remove RFC, add a line to MAINTAINERS file
- fix kernel test robot warnings [02/13,08/13,10/13]
- add base-commit tag to cover letter
- pull the latest uml/next
- clean up SIGSEGV handling [10/13]
- detect fsgsbase availability with elf aux vector [08/13]
- simplify vdso code with macros [09/13]

RFC v2:
- https://lore.kernel.org/linux-um/cover.1731290567.git.thehajime at gmail.com/
- base branch is now uml/linux.git instead of torvalds/linux.git.
- reorganize the patch series to clean up
- fixed various coding styles issues
- clean up exec code path [07/13]
- fixed the crash/SIGSEGV case on userspace programs [10/13]
- add seccomp filter to limit syscall caller address [06/13]
- detect fsgsbase availability with sigsetjmp/siglongjmp [08/13]
- removes unrelated changes
- removes unneeded ifndef CONFIG_MMU
- convert UML_CONFIG_MMU to CONFIG_MMU as using uml/linux.git
- proposed a patch of maple-tree issue (resolving a limitation in RFC v1)
  https://lore.kernel.org/linux-mm/20241108222834.3625217-1-thehajime at gmail.com/

RFC:
- https://lore.kernel.org/linux-um/cover.1729770373.git.thehajime at gmail.com/

Hajime Tazaki (13):
  x86/um: nommu: elf loader for fdpic
  um: decouple MMU specific code from the common part
  um: nommu: memory handling
  x86/um: nommu: syscall handling
  um: nommu: seccomp syscalls hook
  x86/um: nommu: process/thread handling
  um: nommu: configure fs register on host syscall invocation
  x86/um/vdso: nommu: vdso memory update
  x86/um: nommu: signal handling
  um: change machine name for uname output
  um: nommu: disable SMP on nommu UML
  um: nommu: add documentation of nommu UML
  um: nommu: plug nommu code into build system

 Documentation/virt/uml/nommu-uml.rst   | 180 ++++++++++++++++++++++
 MAINTAINERS                            |   1 +
 arch/um/Kconfig                        |  14 +-
 arch/um/Makefile                       |  10 ++
 arch/um/configs/x86_64_nommu_defconfig |  54 +++++++
 arch/um/include/asm/futex.h            |   4 +
 arch/um/include/asm/mmu.h              |   8 +
 arch/um/include/asm/mmu_context.h      |   2 +
 arch/um/include/asm/ptrace-generic.h   |   8 +-
 arch/um/include/asm/uaccess.h          |   7 +-
 arch/um/include/shared/kern_util.h     |   6 +
 arch/um/include/shared/os.h            |  16 ++
 arch/um/kernel/Makefile                |   5 +-
 arch/um/kernel/mem-pgtable.c           |  55 +++++++
 arch/um/kernel/mem.c                   |  38 +----
 arch/um/kernel/process.c               |  38 +++++
 arch/um/kernel/skas/process.c          |  37 -----
 arch/um/kernel/um_arch.c               |   3 +
 arch/um/nommu/Makefile                 |   3 +
 arch/um/nommu/os-Linux/Makefile        |   7 +
 arch/um/nommu/os-Linux/seccomp.c       |  87 +++++++++++
 arch/um/nommu/os-Linux/signal.c        |  24 +++
 arch/um/nommu/trap.c                   | 201 +++++++++++++++++++++++++
 arch/um/os-Linux/Makefile              |   3 +-
 arch/um/os-Linux/internal.h            |   8 +
 arch/um/os-Linux/mem.c                 |   4 +
 arch/um/os-Linux/process.c             | 139 ++++++++++++++++-
 arch/um/os-Linux/signal.c              |  11 +-
 arch/um/os-Linux/skas/process.c        | 127 ----------------
 arch/um/os-Linux/start_up.c            |  25 ++-
 arch/um/os-Linux/util.c                |   3 +-
 arch/x86/um/Kconfig                    |   2 +-
 arch/x86/um/Makefile                   |   7 +-
 arch/x86/um/asm/elf.h                  |   8 +-
 arch/x86/um/asm/syscall.h              |   6 +
 arch/x86/um/nommu/Makefile             |   8 +
 arch/x86/um/nommu/do_syscall_64.c      |  75 +++++++++
 arch/x86/um/nommu/entry_64.S           | 114 ++++++++++++++
 arch/x86/um/nommu/os-Linux/Makefile    |   6 +
 arch/x86/um/nommu/os-Linux/mcontext.c  |  26 ++++
 arch/x86/um/nommu/syscalls.h           |  18 +++
 arch/x86/um/nommu/syscalls_64.c        | 121 +++++++++++++++
 arch/x86/um/shared/sysdep/mcontext.h   |   5 +
 arch/x86/um/shared/sysdep/ptrace.h     |   2 +-
 arch/x86/um/vdso/vma.c                 |  17 ++-
 fs/Kconfig.binfmt                      |   2 +-
 46 files changed, 1322 insertions(+), 223 deletions(-)
 create mode 100644 Documentation/virt/uml/nommu-uml.rst
 create mode 100644 arch/um/configs/x86_64_nommu_defconfig
 create mode 100644 arch/um/kernel/mem-pgtable.c
 create mode 100644 arch/um/nommu/Makefile
 create mode 100644 arch/um/nommu/os-Linux/Makefile
 create mode 100644 arch/um/nommu/os-Linux/seccomp.c
 create mode 100644 arch/um/nommu/os-Linux/signal.c
 create mode 100644 arch/um/nommu/trap.c
 create mode 100644 arch/x86/um/nommu/Makefile
 create mode 100644 arch/x86/um/nommu/do_syscall_64.c
 create mode 100644 arch/x86/um/nommu/entry_64.S
 create mode 100644 arch/x86/um/nommu/os-Linux/Makefile
 create mode 100644 arch/x86/um/nommu/os-Linux/mcontext.c
 create mode 100644 arch/x86/um/nommu/syscalls.h
 create mode 100644 arch/x86/um/nommu/syscalls_64.c


base-commit: 293f71435d14f5b5c46fc3398695fa265c69363d
-- 
2.43.0


From thehajime at gmail.com  Sat Nov  8 00:05:36 2025
From: thehajime at gmail.com (Hajime Tazaki)
Date: Sat,  8 Nov 2025 17:05:36 +0900
Subject: [PATCH v13 01/13] x86/um: nommu: elf loader for fdpic
In-Reply-To: <cover.1762588860.git.thehajime@gmail.com>
References: <cover.1762588860.git.thehajime@gmail.com>
Message-ID: <59210140957e95ab0df73125bfdb035913a468b1.1762588860.git.thehajime@gmail.com>

As UML supports CONFIG_MMU=n case, it has to use an alternate ELF
loader, FDPIC ELF loader.  In this commit, we added necessary
definitions in the arch, as UML has not been used so far.  It also
updates Kconfig file to use BINFMT_ELF_FDPIC under !MMU environment.

Cc: Eric Biederman <ebiederm at xmission.com>
Cc: Kees Cook <kees at kernel.org>
Cc: Alexander Viro <viro at zeniv.linux.org.uk>
Cc: Christian Brauner <brauner at kernel.org>
Cc: Jan Kara <jack at suse.cz>
Cc: linux-mm at kvack.org
Cc: linux-fsdevel at vger.kernel.org
Acked-by: Kees Cook <kees at kernel.org>
Signed-off-by: Hajime Tazaki <thehajime at gmail.com>
Signed-off-by: Ricardo Koller <ricarkol at google.com>
---
 arch/um/include/asm/mmu.h            | 5 +++++
 arch/um/include/asm/ptrace-generic.h | 6 ++++++
 arch/x86/um/asm/elf.h                | 8 ++++++--
 fs/Kconfig.binfmt                    | 2 +-
 4 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/arch/um/include/asm/mmu.h b/arch/um/include/asm/mmu.h
index 07d48738b402..82a919132aff 100644
--- a/arch/um/include/asm/mmu.h
+++ b/arch/um/include/asm/mmu.h
@@ -21,6 +21,11 @@ typedef struct mm_context {
 	spinlock_t sync_tlb_lock;
 	unsigned long sync_tlb_range_from;
 	unsigned long sync_tlb_range_to;
+
+#ifdef CONFIG_BINFMT_ELF_FDPIC
+	unsigned long   exec_fdpic_loadmap;
+	unsigned long   interp_fdpic_loadmap;
+#endif
 } mm_context_t;
 
 #define INIT_MM_CONTEXT(mm)						\
diff --git a/arch/um/include/asm/ptrace-generic.h b/arch/um/include/asm/ptrace-generic.h
index 86d74f9d33cf..62e9916078ec 100644
--- a/arch/um/include/asm/ptrace-generic.h
+++ b/arch/um/include/asm/ptrace-generic.h
@@ -29,6 +29,12 @@ struct pt_regs {
 
 #define PTRACE_OLDSETOPTIONS 21
 
+#ifdef CONFIG_BINFMT_ELF_FDPIC
+#define PTRACE_GETFDPIC		31
+#define PTRACE_GETFDPIC_EXEC	0
+#define PTRACE_GETFDPIC_INTERP	1
+#endif
+
 struct task_struct;
 
 extern long subarch_ptrace(struct task_struct *child, long request,
diff --git a/arch/x86/um/asm/elf.h b/arch/x86/um/asm/elf.h
index 22d0111b543b..388fe669886c 100644
--- a/arch/x86/um/asm/elf.h
+++ b/arch/x86/um/asm/elf.h
@@ -9,6 +9,7 @@
 #include <skas.h>
 
 #define CORE_DUMP_USE_REGSET
+#define ELF_FDPIC_CORE_EFLAGS  0
 
 #ifdef CONFIG_X86_32
 
@@ -158,8 +159,11 @@ extern int arch_setup_additional_pages(struct linux_binprm *bprm,
 
 extern unsigned long um_vdso_addr;
 #define AT_SYSINFO_EHDR 33
-#define ARCH_DLINFO	NEW_AUX_ENT(AT_SYSINFO_EHDR, um_vdso_addr)
-
+#define ARCH_DLINFO						\
+do {								\
+	NEW_AUX_ENT(AT_SYSINFO_EHDR, um_vdso_addr);		\
+	NEW_AUX_ENT(AT_MINSIGSTKSZ, 0);			\
+} while (0)
 #endif
 
 typedef unsigned long elf_greg_t;
diff --git a/fs/Kconfig.binfmt b/fs/Kconfig.binfmt
index 1949e25c7741..0a92bebd5f75 100644
--- a/fs/Kconfig.binfmt
+++ b/fs/Kconfig.binfmt
@@ -58,7 +58,7 @@ config ARCH_USE_GNU_PROPERTY
 config BINFMT_ELF_FDPIC
 	bool "Kernel support for FDPIC ELF binaries"
 	default y if !BINFMT_ELF
-	depends on ARM || ((M68K || RISCV || SUPERH || XTENSA) && !MMU)
+	depends on ARM || ((M68K || RISCV || SUPERH || UML || XTENSA) && !MMU)
 	select ELFCORE
 	help
 	  ELF FDPIC binaries are based on ELF, but allow the individual load
-- 
2.43.0


From thehajime at gmail.com  Sat Nov  8 00:05:37 2025
From: thehajime at gmail.com (Hajime Tazaki)
Date: Sat,  8 Nov 2025 17:05:37 +0900
Subject: [PATCH v13 02/13] um: decouple MMU specific code from the common part
In-Reply-To: <cover.1762588860.git.thehajime@gmail.com>
References: <cover.1762588860.git.thehajime@gmail.com>
Message-ID: <df871d488f69a512ad762bf22d63e6615946566c.1762588860.git.thehajime@gmail.com>

This splits the memory, process related code with common and MMU
specific parts in order to avoid ifdefs in .c file and duplication
between MMU and !MMU.

Signed-off-by: Hajime Tazaki <thehajime at gmail.com>
---
 arch/um/kernel/Makefile         |   5 +-
 arch/um/kernel/mem-pgtable.c    |  55 ++++++++++++++
 arch/um/kernel/mem.c            |  35 ---------
 arch/um/kernel/process.c        |  38 ++++++++++
 arch/um/kernel/skas/process.c   |  37 ---------
 arch/um/os-Linux/Makefile       |   3 +-
 arch/um/os-Linux/process.c      | 129 ++++++++++++++++++++++++++++++++
 arch/um/os-Linux/skas/process.c | 127 -------------------------------
 8 files changed, 227 insertions(+), 202 deletions(-)
 create mode 100644 arch/um/kernel/mem-pgtable.c

diff --git a/arch/um/kernel/Makefile b/arch/um/kernel/Makefile
index be60bc451b3f..76d36751973e 100644
--- a/arch/um/kernel/Makefile
+++ b/arch/um/kernel/Makefile
@@ -16,9 +16,10 @@ always-$(KBUILD_BUILTIN) := vmlinux.lds
 
 obj-y = config.o exec.o exitcode.o irq.o ksyms.o mem.o \
 	physmem.o process.o ptrace.o reboot.o sigio.o \
-	signal.o sysrq.o time.o tlb.o trap.o \
-	um_arch.o umid.o kmsg_dump.o capflags.o skas/
+	signal.o sysrq.o time.o \
+	um_arch.o umid.o kmsg_dump.o capflags.o
 obj-y += load_file.o
+obj-$(CONFIG_MMU) += mem-pgtable.o tlb.o trap.o skas/
 
 obj-$(CONFIG_BLK_DEV_INITRD) += initrd.o
 obj-$(CONFIG_GPROF)	+= gprof_syms.o
diff --git a/arch/um/kernel/mem-pgtable.c b/arch/um/kernel/mem-pgtable.c
new file mode 100644
index 000000000000..549da1d3bff0
--- /dev/null
+++ b/arch/um/kernel/mem-pgtable.c
@@ -0,0 +1,55 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2000 - 2007 Jeff Dike (jdike@{addtoit,linux.intel}.com)
+ */
+
+#include <linux/stddef.h>
+#include <linux/module.h>
+#include <linux/memblock.h>
+#include <linux/swap.h>
+#include <linux/slab.h>
+#include <asm/page.h>
+#include <asm/pgalloc.h>
+#include <as-layout.h>
+#include <init.h>
+#include <kern.h>
+#include <kern_util.h>
+#include <mem_user.h>
+#include <os.h>
+#include <um_malloc.h>
+
+
+/* Allocate and free page tables. */
+
+pgd_t *pgd_alloc(struct mm_struct *mm)
+{
+	pgd_t *pgd = (pgd_t *)__get_free_page(GFP_KERNEL);
+
+	if (pgd) {
+		memset(pgd, 0, USER_PTRS_PER_PGD * sizeof(pgd_t));
+		memcpy(pgd + USER_PTRS_PER_PGD,
+		       swapper_pg_dir + USER_PTRS_PER_PGD,
+		       (PTRS_PER_PGD - USER_PTRS_PER_PGD) * sizeof(pgd_t));
+	}
+	return pgd;
+}
+
+static const pgprot_t protection_map[16] = {
+	[VM_NONE]					= PAGE_NONE,
+	[VM_READ]					= PAGE_READONLY,
+	[VM_WRITE]					= PAGE_COPY,
+	[VM_WRITE | VM_READ]				= PAGE_COPY,
+	[VM_EXEC]					= PAGE_READONLY,
+	[VM_EXEC | VM_READ]				= PAGE_READONLY,
+	[VM_EXEC | VM_WRITE]				= PAGE_COPY,
+	[VM_EXEC | VM_WRITE | VM_READ]			= PAGE_COPY,
+	[VM_SHARED]					= PAGE_NONE,
+	[VM_SHARED | VM_READ]				= PAGE_READONLY,
+	[VM_SHARED | VM_WRITE]				= PAGE_SHARED,
+	[VM_SHARED | VM_WRITE | VM_READ]		= PAGE_SHARED,
+	[VM_SHARED | VM_EXEC]				= PAGE_READONLY,
+	[VM_SHARED | VM_EXEC | VM_READ]			= PAGE_READONLY,
+	[VM_SHARED | VM_EXEC | VM_WRITE]		= PAGE_SHARED,
+	[VM_SHARED | VM_EXEC | VM_WRITE | VM_READ]	= PAGE_SHARED
+};
+DECLARE_VM_GET_PAGE_PROT
diff --git a/arch/um/kernel/mem.c b/arch/um/kernel/mem.c
index 39c4a7e21c6f..f3258680bfbe 100644
--- a/arch/um/kernel/mem.c
+++ b/arch/um/kernel/mem.c
@@ -6,7 +6,6 @@
 #include <linux/stddef.h>
 #include <linux/module.h>
 #include <linux/memblock.h>
-#include <linux/mm.h>
 #include <linux/swap.h>
 #include <linux/slab.h>
 #include <linux/init.h>
@@ -107,45 +106,11 @@ void free_initmem(void)
 {
 }
 
-/* Allocate and free page tables. */
-
-pgd_t *pgd_alloc(struct mm_struct *mm)
-{
-	pgd_t *pgd = __pgd_alloc(mm, 0);
-
-	if (pgd)
-		memcpy(pgd + USER_PTRS_PER_PGD,
-		       swapper_pg_dir + USER_PTRS_PER_PGD,
-		       (PTRS_PER_PGD - USER_PTRS_PER_PGD) * sizeof(pgd_t));
-
-	return pgd;
-}
-
 void *uml_kmalloc(int size, int flags)
 {
 	return kmalloc(size, flags);
 }
 
-static const pgprot_t protection_map[16] = {
-	[VM_NONE]					= PAGE_NONE,
-	[VM_READ]					= PAGE_READONLY,
-	[VM_WRITE]					= PAGE_COPY,
-	[VM_WRITE | VM_READ]				= PAGE_COPY,
-	[VM_EXEC]					= PAGE_READONLY,
-	[VM_EXEC | VM_READ]				= PAGE_READONLY,
-	[VM_EXEC | VM_WRITE]				= PAGE_COPY,
-	[VM_EXEC | VM_WRITE | VM_READ]			= PAGE_COPY,
-	[VM_SHARED]					= PAGE_NONE,
-	[VM_SHARED | VM_READ]				= PAGE_READONLY,
-	[VM_SHARED | VM_WRITE]				= PAGE_SHARED,
-	[VM_SHARED | VM_WRITE | VM_READ]		= PAGE_SHARED,
-	[VM_SHARED | VM_EXEC]				= PAGE_READONLY,
-	[VM_SHARED | VM_EXEC | VM_READ]			= PAGE_READONLY,
-	[VM_SHARED | VM_EXEC | VM_WRITE]		= PAGE_SHARED,
-	[VM_SHARED | VM_EXEC | VM_WRITE | VM_READ]	= PAGE_SHARED
-};
-DECLARE_VM_GET_PAGE_PROT
-
 void mark_rodata_ro(void)
 {
 	unsigned long rodata_start = PFN_ALIGN(__start_rodata);
diff --git a/arch/um/kernel/process.c b/arch/um/kernel/process.c
index 63b38a3f73f7..b07c1f120910 100644
--- a/arch/um/kernel/process.c
+++ b/arch/um/kernel/process.c
@@ -25,6 +25,7 @@
 #include <linux/tick.h>
 #include <linux/threads.h>
 #include <linux/resume_user_mode.h>
+#include <linux/start_kernel.h>
 #include <asm/current.h>
 #include <asm/mmu_context.h>
 #include <asm/switch_to.h>
@@ -307,3 +308,40 @@ unsigned long __get_wchan(struct task_struct *p)
 
 	return 0;
 }
+
+extern void start_kernel(void);
+
+static int __init start_kernel_proc(void *unused)
+{
+	block_signals_trace();
+
+	start_kernel();
+	return 0;
+}
+
+char cpu_irqstacks[NR_CPUS][THREAD_SIZE] __aligned(THREAD_SIZE);
+
+int __init start_uml(void)
+{
+	stack_protections((unsigned long) &cpu_irqstacks[0]);
+	set_sigstack(cpu_irqstacks[0], THREAD_SIZE);
+
+	init_new_thread_signals();
+
+	init_task.thread.request.thread.proc = start_kernel_proc;
+	init_task.thread.request.thread.arg = NULL;
+	return start_idle_thread(task_stack_page(&init_task),
+				 &init_task.thread.switch_buf);
+}
+
+static DEFINE_SPINLOCK(initial_jmpbuf_spinlock);
+
+void initial_jmpbuf_lock(void)
+{
+	spin_lock_irq(&initial_jmpbuf_spinlock);
+}
+
+void initial_jmpbuf_unlock(void)
+{
+	spin_unlock_irq(&initial_jmpbuf_spinlock);
+}
diff --git a/arch/um/kernel/skas/process.c b/arch/um/kernel/skas/process.c
index 4a7673b0261a..d643854942bc 100644
--- a/arch/um/kernel/skas/process.c
+++ b/arch/um/kernel/skas/process.c
@@ -17,31 +17,6 @@
 #include <skas.h>
 #include <kern_util.h>
 
-extern void start_kernel(void);
-
-static int __init start_kernel_proc(void *unused)
-{
-	block_signals_trace();
-
-	start_kernel();
-	return 0;
-}
-
-char cpu_irqstacks[NR_CPUS][THREAD_SIZE] __aligned(THREAD_SIZE);
-
-int __init start_uml(void)
-{
-	stack_protections((unsigned long) &cpu_irqstacks[0]);
-	set_sigstack(cpu_irqstacks[0], THREAD_SIZE);
-
-	init_new_thread_signals();
-
-	init_task.thread.request.thread.proc = start_kernel_proc;
-	init_task.thread.request.thread.arg = NULL;
-	return start_idle_thread(task_stack_page(&init_task),
-				 &init_task.thread.switch_buf);
-}
-
 unsigned long current_stub_stack(void)
 {
 	if (current->mm == NULL)
@@ -65,15 +40,3 @@ void current_mm_sync(void)
 
 	um_tlb_sync(current->mm);
 }
-
-static DEFINE_SPINLOCK(initial_jmpbuf_spinlock);
-
-void initial_jmpbuf_lock(void)
-{
-	spin_lock_irq(&initial_jmpbuf_spinlock);
-}
-
-void initial_jmpbuf_unlock(void)
-{
-	spin_unlock_irq(&initial_jmpbuf_spinlock);
-}
diff --git a/arch/um/os-Linux/Makefile b/arch/um/os-Linux/Makefile
index f8d672d570d9..40e3e0eab6a0 100644
--- a/arch/um/os-Linux/Makefile
+++ b/arch/um/os-Linux/Makefile
@@ -8,7 +8,8 @@ KCOV_INSTRUMENT                := n
 
 obj-y = elf_aux.o execvp.o file.o helper.o irq.o main.o mem.o process.o \
 	registers.o sigio.o signal.o start_up.o time.o tty.o \
-	umid.o user_syms.o util.o skas/
+	umid.o user_syms.o util.o
+obj-$(CONFIG_MMU) += skas/
 
 CFLAGS_signal.o += -Wframe-larger-than=4096
 
diff --git a/arch/um/os-Linux/process.c b/arch/um/os-Linux/process.c
index 3a2a84ab9325..c50fa865d8c7 100644
--- a/arch/um/os-Linux/process.c
+++ b/arch/um/os-Linux/process.c
@@ -6,6 +6,7 @@
 
 #include <stdio.h>
 #include <stdlib.h>
+#include <stdbool.h>
 #include <unistd.h>
 #include <errno.h>
 #include <signal.h>
@@ -17,10 +18,16 @@
 #include <sys/prctl.h>
 #include <sys/wait.h>
 #include <asm/unistd.h>
+#include <linux/threads.h>
 #include <init.h>
 #include <longjmp.h>
 #include <os.h>
 #include <skas/skas.h>
+#include <as-layout.h>
+#include <kern_util.h>
+
+int using_seccomp;
+static int unscheduled_userspace_iterations;
 
 void os_alarm_process(int pid)
 {
@@ -209,3 +216,125 @@ int os_futex_wake(void *uaddr)
 				NULL, NULL, 0));
 	return r < 0 ? -errno : r;
 }
+
+int is_skas_winch(int pid, int fd, void *data)
+{
+	return pid == getpgrp();
+}
+
+void new_thread(void *stack, jmp_buf *buf, void (*handler)(void))
+{
+	(*buf)[0].JB_IP = (unsigned long) handler;
+	(*buf)[0].JB_SP = (unsigned long) stack + UM_THREAD_SIZE -
+		sizeof(void *);
+}
+
+#define INIT_JMP_NEW_THREAD 0
+#define INIT_JMP_CALLBACK 1
+#define INIT_JMP_HALT 2
+#define INIT_JMP_REBOOT 3
+
+void switch_threads(jmp_buf *me, jmp_buf *you)
+{
+	unscheduled_userspace_iterations = 0;
+
+	if (UML_SETJMP(me) == 0)
+		UML_LONGJMP(you, 1);
+}
+
+static jmp_buf initial_jmpbuf;
+
+static __thread void (*cb_proc)(void *arg);
+static __thread void *cb_arg;
+static __thread jmp_buf *cb_back;
+
+int start_idle_thread(void *stack, jmp_buf *switch_buf)
+{
+	int n;
+
+	set_handler(SIGWINCH);
+
+	/*
+	 * Can't use UML_SETJMP or UML_LONGJMP here because they save
+	 * and restore signals, with the possible side-effect of
+	 * trying to handle any signals which came when they were
+	 * blocked, which can't be done on this stack.
+	 * Signals must be blocked when jumping back here and restored
+	 * after returning to the jumper.
+	 */
+	n = setjmp(initial_jmpbuf);
+	switch (n) {
+	case INIT_JMP_NEW_THREAD:
+		(*switch_buf)[0].JB_IP = (unsigned long) uml_finishsetup;
+		(*switch_buf)[0].JB_SP = (unsigned long) stack +
+			UM_THREAD_SIZE - sizeof(void *);
+		break;
+	case INIT_JMP_CALLBACK:
+		(*cb_proc)(cb_arg);
+		longjmp(*cb_back, 1);
+		break;
+	case INIT_JMP_HALT:
+		kmalloc_ok = 0;
+		return 0;
+	case INIT_JMP_REBOOT:
+		kmalloc_ok = 0;
+		return 1;
+	default:
+		printk(UM_KERN_ERR "Bad sigsetjmp return in %s - %d\n",
+		       __func__, n);
+		fatal_sigsegv();
+	}
+	longjmp(*switch_buf, 1);
+
+	/* unreachable */
+	printk(UM_KERN_ERR "impossible long jump!");
+	fatal_sigsegv();
+	return 0;
+}
+
+void initial_thread_cb_skas(void (*proc)(void *), void *arg)
+{
+	jmp_buf here;
+
+	cb_proc = proc;
+	cb_arg = arg;
+	cb_back = &here;
+
+	initial_jmpbuf_lock();
+	if (UML_SETJMP(&here) == 0)
+		UML_LONGJMP(&initial_jmpbuf, INIT_JMP_CALLBACK);
+	initial_jmpbuf_unlock();
+
+	cb_proc = NULL;
+	cb_arg = NULL;
+	cb_back = NULL;
+}
+
+void halt_skas(void)
+{
+	initial_jmpbuf_lock();
+	UML_LONGJMP(&initial_jmpbuf, INIT_JMP_HALT);
+	/* unreachable */
+}
+
+static bool noreboot;
+
+static int __init noreboot_cmd_param(char *str, int *add)
+{
+	*add = 0;
+	noreboot = true;
+	return 0;
+}
+
+__uml_setup("noreboot", noreboot_cmd_param,
+"noreboot\n"
+"    Rather than rebooting, exit always, akin to QEMU's -no-reboot option.\n"
+"    This is useful if you're using CONFIG_PANIC_TIMEOUT in order to catch\n"
+"    crashes in CI\n\n");
+
+void reboot_skas(void)
+{
+	initial_jmpbuf_lock();
+	UML_LONGJMP(&initial_jmpbuf, noreboot ? INIT_JMP_HALT : INIT_JMP_REBOOT);
+	/* unreachable */
+}
diff --git a/arch/um/os-Linux/skas/process.c b/arch/um/os-Linux/skas/process.c
index d6c22f8aa06d..01814ad82f5d 100644
--- a/arch/um/os-Linux/skas/process.c
+++ b/arch/um/os-Linux/skas/process.c
@@ -18,7 +18,6 @@
 #include <sys/stat.h>
 #include <sys/socket.h>
 #include <asm/unistd.h>
-#include <as-layout.h>
 #include <init.h>
 #include <kern_util.h>
 #include <mem.h>
@@ -29,16 +28,10 @@
 #include <sysdep/stub.h>
 #include <sysdep/mcontext.h>
 #include <linux/futex.h>
-#include <linux/threads.h>
 #include <timetravel.h>
 #include <asm-generic/rwonce.h>
 #include "../internal.h"
 
-int is_skas_winch(int pid, int fd, void *data)
-{
-	return pid == getpgrp();
-}
-
 static const char *ptrace_reg_name(int idx)
 {
 #define R(n) case HOST_##n: return #n
@@ -426,8 +419,6 @@ static int __init init_stub_exe_fd(void)
 }
 __initcall(init_stub_exe_fd);
 
-int using_seccomp;
-
 /**
  * start_userspace() - prepare a new userspace process
  * @mm_id: The corresponding struct mm_id
@@ -540,7 +531,6 @@ int start_userspace(struct mm_id *mm_id)
 	return err;
 }
 
-static int unscheduled_userspace_iterations;
 extern unsigned long tt_extra_sched_jiffies;
 
 void userspace(struct uml_pt_regs *regs)
@@ -789,120 +779,3 @@ void userspace(struct uml_pt_regs *regs)
 		}
 	}
 }
-
-void new_thread(void *stack, jmp_buf *buf, void (*handler)(void))
-{
-	(*buf)[0].JB_IP = (unsigned long) handler;
-	(*buf)[0].JB_SP = (unsigned long) stack + UM_THREAD_SIZE -
-		sizeof(void *);
-}
-
-#define INIT_JMP_NEW_THREAD 0
-#define INIT_JMP_CALLBACK 1
-#define INIT_JMP_HALT 2
-#define INIT_JMP_REBOOT 3
-
-void switch_threads(jmp_buf *me, jmp_buf *you)
-{
-	unscheduled_userspace_iterations = 0;
-
-	if (UML_SETJMP(me) == 0)
-		UML_LONGJMP(you, 1);
-}
-
-static jmp_buf initial_jmpbuf;
-
-static __thread void (*cb_proc)(void *arg);
-static __thread void *cb_arg;
-static __thread jmp_buf *cb_back;
-
-int start_idle_thread(void *stack, jmp_buf *switch_buf)
-{
-	int n;
-
-	set_handler(SIGWINCH);
-
-	/*
-	 * Can't use UML_SETJMP or UML_LONGJMP here because they save
-	 * and restore signals, with the possible side-effect of
-	 * trying to handle any signals which came when they were
-	 * blocked, which can't be done on this stack.
-	 * Signals must be blocked when jumping back here and restored
-	 * after returning to the jumper.
-	 */
-	n = setjmp(initial_jmpbuf);
-	switch (n) {
-	case INIT_JMP_NEW_THREAD:
-		(*switch_buf)[0].JB_IP = (unsigned long) uml_finishsetup;
-		(*switch_buf)[0].JB_SP = (unsigned long) stack +
-			UM_THREAD_SIZE - sizeof(void *);
-		break;
-	case INIT_JMP_CALLBACK:
-		(*cb_proc)(cb_arg);
-		longjmp(*cb_back, 1);
-		break;
-	case INIT_JMP_HALT:
-		kmalloc_ok = 0;
-		return 0;
-	case INIT_JMP_REBOOT:
-		kmalloc_ok = 0;
-		return 1;
-	default:
-		printk(UM_KERN_ERR "Bad sigsetjmp return in %s - %d\n",
-		       __func__, n);
-		fatal_sigsegv();
-	}
-	longjmp(*switch_buf, 1);
-
-	/* unreachable */
-	printk(UM_KERN_ERR "impossible long jump!");
-	fatal_sigsegv();
-	return 0;
-}
-
-void initial_thread_cb_skas(void (*proc)(void *), void *arg)
-{
-	jmp_buf here;
-
-	cb_proc = proc;
-	cb_arg = arg;
-	cb_back = &here;
-
-	initial_jmpbuf_lock();
-	if (UML_SETJMP(&here) == 0)
-		UML_LONGJMP(&initial_jmpbuf, INIT_JMP_CALLBACK);
-	initial_jmpbuf_unlock();
-
-	cb_proc = NULL;
-	cb_arg = NULL;
-	cb_back = NULL;
-}
-
-void halt_skas(void)
-{
-	initial_jmpbuf_lock();
-	UML_LONGJMP(&initial_jmpbuf, INIT_JMP_HALT);
-	/* unreachable */
-}
-
-static bool noreboot;
-
-static int __init noreboot_cmd_param(char *str, int *add)
-{
-	*add = 0;
-	noreboot = true;
-	return 0;
-}
-
-__uml_setup("noreboot", noreboot_cmd_param,
-"noreboot\n"
-"    Rather than rebooting, exit always, akin to QEMU's -no-reboot option.\n"
-"    This is useful if you're using CONFIG_PANIC_TIMEOUT in order to catch\n"
-"    crashes in CI\n\n");
-
-void reboot_skas(void)
-{
-	initial_jmpbuf_lock();
-	UML_LONGJMP(&initial_jmpbuf, noreboot ? INIT_JMP_HALT : INIT_JMP_REBOOT);
-	/* unreachable */
-}
-- 
2.43.0


From thehajime at gmail.com  Sat Nov  8 00:05:38 2025
From: thehajime at gmail.com (Hajime Tazaki)
Date: Sat,  8 Nov 2025 17:05:38 +0900
Subject: [PATCH v13 03/13] um: nommu: memory handling
In-Reply-To: <cover.1762588860.git.thehajime@gmail.com>
References: <cover.1762588860.git.thehajime@gmail.com>
Message-ID: <28512370a78b53783655667300bc4464fd338029.1762588860.git.thehajime@gmail.com>

This commit adds memory operations on UML under !MMU environment.

Some part of the original UML code relying on CONFIG_MMU are excluded
from compilation when !CONFIG_MMU.  Additionally, generic functions such as
uaccess, futex, memcpy/strnlen/strncpy can be used as user- and
kernel-space share the address space in !CONFIG_MMU mode.

Signed-off-by: Hajime Tazaki <thehajime at gmail.com>
Signed-off-by: Ricardo Koller <ricarkol at google.com>
---
 arch/um/Makefile                  | 4 ++++
 arch/um/include/asm/futex.h       | 4 ++++
 arch/um/include/asm/mmu.h         | 3 +++
 arch/um/include/asm/mmu_context.h | 2 ++
 arch/um/include/asm/uaccess.h     | 7 ++++---
 arch/um/kernel/mem.c              | 3 ++-
 arch/um/os-Linux/mem.c            | 4 ++++
 arch/um/os-Linux/process.c        | 4 ++--
 8 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/arch/um/Makefile b/arch/um/Makefile
index 7be0143b5ba3..5371c9a1b11e 100644
--- a/arch/um/Makefile
+++ b/arch/um/Makefile
@@ -46,6 +46,10 @@ ARCH_INCLUDE	:= -I$(srctree)/$(SHARED_HEADERS)
 ARCH_INCLUDE	+= -I$(srctree)/$(HOST_DIR)/um/shared
 KBUILD_CPPFLAGS += -I$(srctree)/$(HOST_DIR)/um
 
+ifneq ($(CONFIG_MMU),y)
+core-y += $(ARCH_DIR)/nommu/
+endif
+
 # -Dvmap=kernel_vmap prevents anything from referencing the libpcap.o symbol so
 # named - it's a common symbol in libpcap, so we get a binary which crashes.
 #
diff --git a/arch/um/include/asm/futex.h b/arch/um/include/asm/futex.h
index 780aa6bfc050..785fd6649aa2 100644
--- a/arch/um/include/asm/futex.h
+++ b/arch/um/include/asm/futex.h
@@ -7,8 +7,12 @@
 #include <asm/errno.h>
 
 
+#ifdef CONFIG_MMU
 int arch_futex_atomic_op_inuser(int op, u32 oparg, int *oval, u32 __user *uaddr);
 int futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr,
 			      u32 oldval, u32 newval);
+#else
+#include <asm-generic/futex.h>
+#endif
 
 #endif
diff --git a/arch/um/include/asm/mmu.h b/arch/um/include/asm/mmu.h
index 82a919132aff..c0b9ce3215c4 100644
--- a/arch/um/include/asm/mmu.h
+++ b/arch/um/include/asm/mmu.h
@@ -22,10 +22,13 @@ typedef struct mm_context {
 	unsigned long sync_tlb_range_from;
 	unsigned long sync_tlb_range_to;
 
+#ifndef CONFIG_MMU
+	unsigned long   end_brk;
 #ifdef CONFIG_BINFMT_ELF_FDPIC
 	unsigned long   exec_fdpic_loadmap;
 	unsigned long   interp_fdpic_loadmap;
 #endif
+#endif /* !CONFIG_MMU */
 } mm_context_t;
 
 #define INIT_MM_CONTEXT(mm)						\
diff --git a/arch/um/include/asm/mmu_context.h b/arch/um/include/asm/mmu_context.h
index c727e56ba116..528b217da285 100644
--- a/arch/um/include/asm/mmu_context.h
+++ b/arch/um/include/asm/mmu_context.h
@@ -18,11 +18,13 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
 {
 }
 
+#ifdef CONFIG_MMU
 #define init_new_context init_new_context
 extern int init_new_context(struct task_struct *task, struct mm_struct *mm);
 
 #define destroy_context destroy_context
 extern void destroy_context(struct mm_struct *mm);
+#endif
 
 #include <asm-generic/mmu_context.h>
 
diff --git a/arch/um/include/asm/uaccess.h b/arch/um/include/asm/uaccess.h
index 0df9ea4abda8..031b357800b7 100644
--- a/arch/um/include/asm/uaccess.h
+++ b/arch/um/include/asm/uaccess.h
@@ -18,6 +18,7 @@
 #define __addr_range_nowrap(addr, size) \
 	((unsigned long) (addr) <= ((unsigned long) (addr) + (size)))
 
+#ifdef CONFIG_MMU
 extern unsigned long raw_copy_from_user(void *to, const void __user *from, unsigned long n);
 extern unsigned long raw_copy_to_user(void __user *to, const void *from, unsigned long n);
 extern unsigned long __clear_user(void __user *mem, unsigned long len);
@@ -29,9 +30,6 @@ static inline int __access_ok(const void __user *ptr, unsigned long size);
 
 #define INLINE_COPY_FROM_USER
 #define INLINE_COPY_TO_USER
-
-#include <asm-generic/uaccess.h>
-
 static inline int __access_ok(const void __user *ptr, unsigned long size)
 {
 	unsigned long addr = (unsigned long)ptr;
@@ -63,5 +61,8 @@ do {									\
 	barrier();							\
 	current->thread.segv_continue = NULL;				\
 } while (0)
+#endif
+
+#include <asm-generic/uaccess.h>
 
 #endif
diff --git a/arch/um/kernel/mem.c b/arch/um/kernel/mem.c
index f3258680bfbe..e599b637c5fb 100644
--- a/arch/um/kernel/mem.c
+++ b/arch/um/kernel/mem.c
@@ -71,7 +71,8 @@ void __init arch_mm_preinit(void)
 	 * to be turned on.
 	 */
 	brk_end = PAGE_ALIGN((unsigned long) sbrk(0));
-	map_memory(brk_end, __pa(brk_end), uml_reserved - brk_end, 1, 1, 0);
+	map_memory(brk_end, __pa(brk_end), uml_reserved - brk_end, 1, 1,
+		   !IS_ENABLED(CONFIG_MMU));
 	memblock_free((void *)brk_end, uml_reserved - brk_end);
 	uml_reserved = brk_end;
 	min_low_pfn = PFN_UP(__pa(uml_reserved));
diff --git a/arch/um/os-Linux/mem.c b/arch/um/os-Linux/mem.c
index 72f302f4d197..4f5d9a94f8e2 100644
--- a/arch/um/os-Linux/mem.c
+++ b/arch/um/os-Linux/mem.c
@@ -213,6 +213,10 @@ int __init create_mem_file(unsigned long long len)
 {
 	int err, fd;
 
+	/* NOMMU kernel uses -1 as a fd for further use (e.g., mmap) */
+	if (!IS_ENABLED(CONFIG_MMU))
+		return -1;
+
 	fd = create_tmp_file(len);
 
 	err = os_set_exec_close(fd);
diff --git a/arch/um/os-Linux/process.c b/arch/um/os-Linux/process.c
index c50fa865d8c7..ddb5258d7720 100644
--- a/arch/um/os-Linux/process.c
+++ b/arch/um/os-Linux/process.c
@@ -100,8 +100,8 @@ int os_map_memory(void *virt, int fd, unsigned long long off, unsigned long len,
 	prot = (r ? PROT_READ : 0) | (w ? PROT_WRITE : 0) |
 		(x ? PROT_EXEC : 0);
 
-	loc = mmap64((void *) virt, len, prot, MAP_SHARED | MAP_FIXED,
-		     fd, off);
+	loc = mmap64((void *) virt, len, prot, MAP_SHARED | MAP_FIXED |
+		     (!IS_ENABLED(CONFIG_MMU) ? MAP_ANONYMOUS : 0), fd, off);
 	if (loc == MAP_FAILED)
 		return -errno;
 	return 0;
-- 
2.43.0


From thehajime at gmail.com  Sat Nov  8 00:05:39 2025
From: thehajime at gmail.com (Hajime Tazaki)
Date: Sat,  8 Nov 2025 17:05:39 +0900
Subject: [PATCH v13 04/13] x86/um: nommu: syscall handling
In-Reply-To: <cover.1762588860.git.thehajime@gmail.com>
References: <cover.1762588860.git.thehajime@gmail.com>
Message-ID: <b23f1464f32b0701b298b0f43bc0aa0e5de1e6f5.1762588860.git.thehajime@gmail.com>

This commit introduces an entry point of syscall interface for !MMU
mode. It uses an entry function, __kernel_vsyscall, a kernel-wide global
symbol accessible from any locations.

Although it isn't in the scope of this commit, it can be also exposed
via vdso image which is directly accessible from userspace. A standard
library (i.e., libc) can utilize this entry point to implement syscall
wrapper; we can also use this by hooking syscall for unmodified userspace
applications/libraries, which will be implemented in the subsequent
commit.

This only supports 64-bit mode of x86 architecture.

Signed-off-by: Hajime Tazaki <thehajime at gmail.com>
Signed-off-by: Ricardo Koller <ricarkol at google.com>
---
 arch/x86/um/Makefile              |   4 ++
 arch/x86/um/asm/syscall.h         |   6 ++
 arch/x86/um/nommu/Makefile        |   8 +++
 arch/x86/um/nommu/do_syscall_64.c |  32 +++++++++
 arch/x86/um/nommu/entry_64.S      | 112 ++++++++++++++++++++++++++++++
 arch/x86/um/nommu/syscalls.h      |  16 +++++
 6 files changed, 178 insertions(+)
 create mode 100644 arch/x86/um/nommu/Makefile
 create mode 100644 arch/x86/um/nommu/do_syscall_64.c
 create mode 100644 arch/x86/um/nommu/entry_64.S
 create mode 100644 arch/x86/um/nommu/syscalls.h

diff --git a/arch/x86/um/Makefile b/arch/x86/um/Makefile
index f9ea75bf43ac..39693807755a 100644
--- a/arch/x86/um/Makefile
+++ b/arch/x86/um/Makefile
@@ -31,6 +31,10 @@ obj-y += mem_64.o syscalls_64.o vdso/
 subarch-y = ../lib/csum-partial_64.o ../lib/memcpy_64.o \
 	../lib/memmove_64.o ../lib/memset_64.o
 
+ifneq ($(CONFIG_MMU),y)
+obj-y += nommu/
+endif
+
 endif
 
 subarch-$(CONFIG_MODULES) += ../kernel/module.o
diff --git a/arch/x86/um/asm/syscall.h b/arch/x86/um/asm/syscall.h
index d6208d0fad51..bb4f6f011667 100644
--- a/arch/x86/um/asm/syscall.h
+++ b/arch/x86/um/asm/syscall.h
@@ -20,4 +20,10 @@ static inline int syscall_get_arch(struct task_struct *task)
 #endif
 }
 
+#ifndef CONFIG_MMU
+extern void do_syscall_64(struct pt_regs *regs);
+extern long __kernel_vsyscall(int64_t a0, int64_t a1, int64_t a2, int64_t a3,
+			      int64_t a4, int64_t a5, int64_t a6);
+#endif
+
 #endif /* __UM_ASM_SYSCALL_H */
diff --git a/arch/x86/um/nommu/Makefile b/arch/x86/um/nommu/Makefile
new file mode 100644
index 000000000000..d72c63afffa5
--- /dev/null
+++ b/arch/x86/um/nommu/Makefile
@@ -0,0 +1,8 @@
+# SPDX-License-Identifier: GPL-2.0
+ifeq ($(CONFIG_X86_32),y)
+	BITS := 32
+else
+	BITS := 64
+endif
+
+obj-y = do_syscall_$(BITS).o entry_$(BITS).o
diff --git a/arch/x86/um/nommu/do_syscall_64.c b/arch/x86/um/nommu/do_syscall_64.c
new file mode 100644
index 000000000000..292d7c578622
--- /dev/null
+++ b/arch/x86/um/nommu/do_syscall_64.c
@@ -0,0 +1,32 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/kernel.h>
+#include <linux/ptrace.h>
+#include <kern_util.h>
+#include <asm/syscall.h>
+#include <os.h>
+
+__visible void do_syscall_64(struct pt_regs *regs)
+{
+	int syscall;
+
+	syscall = PT_SYSCALL_NR(regs->regs.gp);
+	UPT_SYSCALL_NR(&regs->regs) = syscall;
+
+	if (likely(syscall < NR_syscalls)) {
+		unsigned long ret;
+
+		ret = (*sys_call_table[syscall])(UPT_SYSCALL_ARG1(&regs->regs),
+						 UPT_SYSCALL_ARG2(&regs->regs),
+						 UPT_SYSCALL_ARG3(&regs->regs),
+						 UPT_SYSCALL_ARG4(&regs->regs),
+						 UPT_SYSCALL_ARG5(&regs->regs),
+						 UPT_SYSCALL_ARG6(&regs->regs));
+		PT_REGS_SET_SYSCALL_RETURN(regs, ret);
+	}
+
+	PT_REGS_SYSCALL_RET(regs) = regs->regs.gp[HOST_AX];
+
+	/* handle tasks and signals at the end */
+	interrupt_end();
+}
diff --git a/arch/x86/um/nommu/entry_64.S b/arch/x86/um/nommu/entry_64.S
new file mode 100644
index 000000000000..485c578aae64
--- /dev/null
+++ b/arch/x86/um/nommu/entry_64.S
@@ -0,0 +1,112 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <asm/errno.h>
+
+#include <linux/linkage.h>
+#include <asm/percpu.h>
+#include <asm/desc.h>
+
+#include "../entry/calling.h"
+
+#ifdef CONFIG_SMP
+#error need to stash these variables somewhere else
+#endif
+
+#define UM_GLOBAL_VAR(x) .data; .align 8; .globl x; x:; .long 0
+
+UM_GLOBAL_VAR(current_top_of_stack)
+UM_GLOBAL_VAR(current_ptregs)
+
+.code64
+.section .entry.text, "ax"
+
+.align 8
+#undef ENTRY
+#define ENTRY(x) .text; .globl x; .type x,%function; x:
+#undef END
+#define END(x)   .size x, . - x
+
+/*
+ * %rcx has the return address (we set it before entering __kernel_vsyscall).
+ *
+ * Registers on entry:
+ * rax  system call number
+ * rcx  return address
+ * rdi  arg0
+ * rsi  arg1
+ * rdx  arg2
+ * r10  arg3
+ * r8   arg4
+ * r9   arg5
+ *
+ * (note: we are allowed to mess with r11: r11 is callee-clobbered
+ * register in C ABI)
+ */
+ENTRY(__kernel_vsyscall)
+
+	movq	%rsp, %r11
+
+	/* Point rsp to the top of the ptregs array, so we can
+           just fill it with a bunch of push'es. */
+	movq	current_ptregs, %rsp
+
+	/* 8 bytes * 20 registers (plus 8 for the push) */
+	addq	$168, %rsp
+
+	/* Construct struct pt_regs on stack */
+	pushq	$0		/* pt_regs->ss (index 20) */
+	pushq   %r11		/* pt_regs->sp */
+	pushfq			/* pt_regs->flags */
+	pushq	$0		/* pt_regs->cs */
+	pushq	%rcx		/* pt_regs->ip */
+	pushq	%rax		/* pt_regs->orig_ax */
+
+	PUSH_AND_CLEAR_REGS rax=$-ENOSYS
+
+	mov %rsp, %rdi
+
+	/*
+	 * Switch to current top of stack, so "current->" points
+	 * to the right task.
+	 */
+	movq	current_top_of_stack, %rsp
+
+	call	do_syscall_64
+
+	jmp	userspace
+
+END(__kernel_vsyscall)
+
+/*
+ * common userspace returning routine
+ *
+ * all procedures like syscalls, signal handlers, umh processes, will gate
+ * this routine to properly configure registers/stacks.
+ *
+ * void userspace(struct uml_pt_regs *regs)
+ */
+ENTRY(userspace)
+
+	/* clear direction flag to meet ABI */
+	cld
+	/* align the stack for x86_64 ABI */
+	and     $-0x10, %rsp
+	/* Handle any immediate reschedules or signals */
+	call	interrupt_end
+
+	movq	current_ptregs, %rsp
+
+	POP_REGS
+
+	addq	$8, %rsp	/* skip orig_ax */
+	popq	%rcx		/* pt_regs->ip */
+	addq	$8, %rsp	/* skip cs */
+	addq	$8, %rsp	/* skip flags */
+	popq	%rsp
+
+	/*
+	* not return w/ ret but w/ jmp as the stack is already popped before
+	* entering __kernel_vsyscall
+	*/
+	jmp	*%rcx
+
+END(userspace)
diff --git a/arch/x86/um/nommu/syscalls.h b/arch/x86/um/nommu/syscalls.h
new file mode 100644
index 000000000000..a2433756b1fc
--- /dev/null
+++ b/arch/x86/um/nommu/syscalls.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __UM_NOMMU_SYSCALLS_H
+#define __UM_NOMMU_SYSCALLS_H
+
+
+#define task_top_of_stack(task) \
+({									\
+	unsigned long __ptr = (unsigned long)task->stack;	\
+	__ptr += THREAD_SIZE;			\
+	__ptr;					\
+})
+
+extern long current_top_of_stack;
+extern long current_ptregs;
+
+#endif
-- 
2.43.0


From thehajime at gmail.com  Sat Nov  8 00:05:40 2025
From: thehajime at gmail.com (Hajime Tazaki)
Date: Sat,  8 Nov 2025 17:05:40 +0900
Subject: [PATCH v13 05/13] um: nommu: seccomp syscalls hook
In-Reply-To: <cover.1762588860.git.thehajime@gmail.com>
References: <cover.1762588860.git.thehajime@gmail.com>
Message-ID: <9e3438cf6d6c26a708c428267375b102b434d5d6.1762588860.git.thehajime@gmail.com>

This commit adds syscall hook with seccomp.

Using seccomp raises SIGSYS to UML process, which is captured in the
(UML) kernel, then jumps to the syscall entry point, __kernel_vsyscall,
to hook the original syscall instructions.

The SIGSYS signal is raised upon the execution from uml_reserved and
high_physmem, which locates userspace memory.

It also renames existing static function, sigsys_handler(), in
start_up.c to avoid name conflicts between them.

Signed-off-by: Hajime Tazaki <thehajime at gmail.com>
Signed-off-by: Kenichi Yasukata <kenichi.yasukata at gmail.com>
---
 arch/um/include/shared/kern_util.h    |  2 +
 arch/um/include/shared/os.h           | 10 +++
 arch/um/kernel/um_arch.c              |  3 +
 arch/um/nommu/Makefile                |  3 +
 arch/um/nommu/os-Linux/Makefile       |  7 +++
 arch/um/nommu/os-Linux/seccomp.c      | 87 +++++++++++++++++++++++++++
 arch/um/nommu/os-Linux/signal.c       | 16 +++++
 arch/um/os-Linux/signal.c             |  8 +++
 arch/um/os-Linux/start_up.c           |  4 +-
 arch/x86/um/nommu/Makefile            |  2 +-
 arch/x86/um/nommu/os-Linux/Makefile   |  6 ++
 arch/x86/um/nommu/os-Linux/mcontext.c | 15 +++++
 arch/x86/um/shared/sysdep/mcontext.h  |  4 ++
 13 files changed, 164 insertions(+), 3 deletions(-)
 create mode 100644 arch/um/nommu/Makefile
 create mode 100644 arch/um/nommu/os-Linux/Makefile
 create mode 100644 arch/um/nommu/os-Linux/seccomp.c
 create mode 100644 arch/um/nommu/os-Linux/signal.c
 create mode 100644 arch/x86/um/nommu/os-Linux/Makefile
 create mode 100644 arch/x86/um/nommu/os-Linux/mcontext.c

diff --git a/arch/um/include/shared/kern_util.h b/arch/um/include/shared/kern_util.h
index 38321188c04c..7798f16a4677 100644
--- a/arch/um/include/shared/kern_util.h
+++ b/arch/um/include/shared/kern_util.h
@@ -63,6 +63,8 @@ extern void segv_handler(int sig, struct siginfo *unused_si, struct uml_pt_regs
 extern void winch(int sig, struct siginfo *unused_si, struct uml_pt_regs *regs,
 		  void *mc);
 extern void fatal_sigsegv(void) __attribute__ ((noreturn));
+extern void sigsys_handler(int sig, struct siginfo *si, struct uml_pt_regs *regs,
+			   void *mc);
 
 void um_idle_sleep(void);
 
diff --git a/arch/um/include/shared/os.h b/arch/um/include/shared/os.h
index b26e94292fc1..5451f9b1f41e 100644
--- a/arch/um/include/shared/os.h
+++ b/arch/um/include/shared/os.h
@@ -356,4 +356,14 @@ static inline void os_local_ipi_enable(void) { }
 static inline void os_local_ipi_disable(void) { }
 #endif /* CONFIG_SMP */
 
+/* seccomp.c */
+#ifdef CONFIG_MMU
+static inline int os_setup_seccomp(void)
+{
+	return 0;
+}
+#else
+extern int os_setup_seccomp(void);
+#endif
+
 #endif
diff --git a/arch/um/kernel/um_arch.c b/arch/um/kernel/um_arch.c
index e2b24e1ecfa6..27c13423d9aa 100644
--- a/arch/um/kernel/um_arch.c
+++ b/arch/um/kernel/um_arch.c
@@ -423,6 +423,9 @@ void __init setup_arch(char **cmdline_p)
 		add_bootloader_randomness(rng_seed, sizeof(rng_seed));
 		memzero_explicit(rng_seed, sizeof(rng_seed));
 	}
+
+	/* install seccomp filter */
+	os_setup_seccomp();
 }
 
 void __init arch_cpu_finalize_init(void)
diff --git a/arch/um/nommu/Makefile b/arch/um/nommu/Makefile
new file mode 100644
index 000000000000..baab7c2f57c2
--- /dev/null
+++ b/arch/um/nommu/Makefile
@@ -0,0 +1,3 @@
+# SPDX-License-Identifier: GPL-2.0
+
+obj-y := os-Linux/
diff --git a/arch/um/nommu/os-Linux/Makefile b/arch/um/nommu/os-Linux/Makefile
new file mode 100644
index 000000000000..805e26ccf63b
--- /dev/null
+++ b/arch/um/nommu/os-Linux/Makefile
@@ -0,0 +1,7 @@
+# SPDX-License-Identifier: GPL-2.0
+
+obj-y := seccomp.o signal.o
+USER_OBJS := $(obj-y)
+
+include $(srctree)/arch/um/scripts/Makefile.rules
+USER_CFLAGS+=-I$(srctree)/arch/um/os-Linux
diff --git a/arch/um/nommu/os-Linux/seccomp.c b/arch/um/nommu/os-Linux/seccomp.c
new file mode 100644
index 000000000000..d1cfa6e3d632
--- /dev/null
+++ b/arch/um/nommu/os-Linux/seccomp.c
@@ -0,0 +1,87 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <errno.h>
+#include <signal.h>
+#include <sys/prctl.h>
+#include <sys/syscall.h>   /* For SYS_xxx definitions */
+#include <init.h>
+#include <as-layout.h>
+#include <os.h>
+#include <linux/filter.h>
+#include <linux/seccomp.h>
+
+int __init os_setup_seccomp(void)
+{
+	int err;
+	unsigned long __userspace_start = uml_reserved,
+		__userspace_end = high_physmem;
+
+	struct sock_filter filter[] = {
+		/* if (IP_high > __userspace_end) allow; */
+		BPF_STMT(BPF_LD + BPF_W + BPF_ABS,
+			 offsetof(struct seccomp_data, instruction_pointer) + 4),
+		BPF_JUMP(BPF_JMP + BPF_JGT + BPF_K, __userspace_end >> 32,
+			 /*true-skip=*/0, /*false-skip=*/1),
+		BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_ALLOW),
+
+		/* if (IP_high == __userspace_end && IP_low >= __userspace_end) allow; */
+		BPF_STMT(BPF_LD + BPF_W + BPF_ABS,
+			 offsetof(struct seccomp_data, instruction_pointer) + 4),
+		BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, __userspace_end >> 32,
+			 /*true-skip=*/0, /*false-skip=*/3),
+		BPF_STMT(BPF_LD + BPF_W + BPF_ABS,
+			 offsetof(struct seccomp_data, instruction_pointer)),
+		BPF_JUMP(BPF_JMP + BPF_JGE + BPF_K, __userspace_end,
+			 /*true-skip=*/0, /*false-skip=*/1),
+		BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_ALLOW),
+
+		/* if (IP_high < __userspace_start) allow; */
+		BPF_STMT(BPF_LD + BPF_W + BPF_ABS,
+			 offsetof(struct seccomp_data, instruction_pointer) + 4),
+		BPF_JUMP(BPF_JMP + BPF_JGE + BPF_K, __userspace_start >> 32,
+			 /*true-skip=*/1, /*false-skip=*/0),
+		BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_ALLOW),
+
+		/* if (IP_high == __userspace_start && IP_low < __userspace_start) allow; */
+		BPF_STMT(BPF_LD + BPF_W + BPF_ABS,
+			 offsetof(struct seccomp_data, instruction_pointer) + 4),
+		BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, __userspace_start >> 32,
+			 /*true-skip=*/0, /*false-skip=*/3),
+		BPF_STMT(BPF_LD + BPF_W + BPF_ABS,
+			 offsetof(struct seccomp_data, instruction_pointer)),
+		BPF_JUMP(BPF_JMP + BPF_JGE + BPF_K, __userspace_start,
+			 /*true-skip=*/1, /*false-skip=*/0),
+		BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_ALLOW),
+
+		/* other address; trap  */
+		BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_TRAP),
+	};
+	struct sock_fprog prog = {
+		.len = ARRAY_SIZE(filter),
+		.filter = filter,
+	};
+
+	err = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+	if (err)
+		os_warn("PR_SET_NO_NEW_PRIVS (err=%d, ernro=%d)\n",
+		       err, errno);
+
+	err = syscall(SYS_seccomp, SECCOMP_SET_MODE_FILTER,
+		      SECCOMP_FILTER_FLAG_TSYNC, &prog);
+	if (err) {
+		os_warn("SECCOMP_SET_MODE_FILTER (err=%d, ernro=%d)\n",
+		       err, errno);
+		exit(1);
+	}
+
+	set_handler(SIGSYS);
+
+	os_info("seccomp: setup filter syscalls in the range: 0x%lx-0x%lx\n",
+		__userspace_start, __userspace_end);
+
+	return 0;
+}
+
diff --git a/arch/um/nommu/os-Linux/signal.c b/arch/um/nommu/os-Linux/signal.c
new file mode 100644
index 000000000000..19043b9652e2
--- /dev/null
+++ b/arch/um/nommu/os-Linux/signal.c
@@ -0,0 +1,16 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <signal.h>
+#include <kern_util.h>
+#include <os.h>
+#include <sysdep/mcontext.h>
+#include <sys/ucontext.h>
+
+void sigsys_handler(int sig, struct siginfo *si,
+		    struct uml_pt_regs *regs, void *ptr)
+{
+	mcontext_t *mc = (mcontext_t *) ptr;
+
+	/* hook syscall via SIGSYS */
+	set_mc_sigsys_hook(mc);
+}
diff --git a/arch/um/os-Linux/signal.c b/arch/um/os-Linux/signal.c
index 327fb3c52fc7..2f6795cd884c 100644
--- a/arch/um/os-Linux/signal.c
+++ b/arch/um/os-Linux/signal.c
@@ -20,6 +20,7 @@
 #include <um_malloc.h>
 #include <sys/ucontext.h>
 #include <timetravel.h>
+#include <linux/compiler_attributes.h>
 #include "internal.h"
 
 void (*sig_info[NSIG])(int, struct siginfo *, struct uml_pt_regs *, void *mc) = {
@@ -31,6 +32,7 @@ void (*sig_info[NSIG])(int, struct siginfo *, struct uml_pt_regs *, void *mc) =
 	[SIGSEGV]	= segv_handler,
 	[SIGIO]		= sigio_handler,
 	[SIGCHLD]	= sigchld_handler,
+	[SIGSYS]	= sigsys_handler,
 };
 
 static void sig_handler_common(int sig, struct siginfo *si, mcontext_t *mc)
@@ -182,6 +184,11 @@ static void sigusr1_handler(int sig, struct siginfo *unused_si, mcontext_t *mc)
 	uml_pm_wake();
 }
 
+__weak void sigsys_handler(int sig, struct siginfo *unused_si,
+			   struct uml_pt_regs *regs, void *mc)
+{
+}
+
 void register_pm_wake_signal(void)
 {
 	set_handler(SIGUSR1);
@@ -193,6 +200,7 @@ static void (*handlers[_NSIG])(int sig, struct siginfo *si, mcontext_t *mc) = {
 	[SIGILL] = sig_handler,
 	[SIGFPE] = sig_handler,
 	[SIGTRAP] = sig_handler,
+	[SIGSYS] = sig_handler,
 
 	[SIGIO] = sig_handler,
 	[SIGWINCH] = sig_handler,
diff --git a/arch/um/os-Linux/start_up.c b/arch/um/os-Linux/start_up.c
index 054ac03bbf5e..33e039d2c1bf 100644
--- a/arch/um/os-Linux/start_up.c
+++ b/arch/um/os-Linux/start_up.c
@@ -239,7 +239,7 @@ extern unsigned long *exec_fp_regs;
 
 __initdata static struct stub_data *seccomp_test_stub_data;
 
-static void __init sigsys_handler(int sig, siginfo_t *info, void *p)
+static void __init _sigsys_handler(int sig, siginfo_t *info, void *p)
 {
 	ucontext_t *uc = p;
 
@@ -274,7 +274,7 @@ static int __init seccomp_helper(void *data)
 			sizeof(seccomp_test_stub_data->sigstack));
 
 	sa.sa_flags = SA_ONSTACK | SA_NODEFER | SA_SIGINFO;
-	sa.sa_sigaction = (void *) sigsys_handler;
+	sa.sa_sigaction = (void *) _sigsys_handler;
 	sa.sa_restorer = NULL;
 	if (sigaction(SIGSYS, &sa, NULL) < 0)
 		exit(2);
diff --git a/arch/x86/um/nommu/Makefile b/arch/x86/um/nommu/Makefile
index d72c63afffa5..ebe47d4836f4 100644
--- a/arch/x86/um/nommu/Makefile
+++ b/arch/x86/um/nommu/Makefile
@@ -5,4 +5,4 @@ else
 	BITS := 64
 endif
 
-obj-y = do_syscall_$(BITS).o entry_$(BITS).o
+obj-y = do_syscall_$(BITS).o entry_$(BITS).o os-Linux/
diff --git a/arch/x86/um/nommu/os-Linux/Makefile b/arch/x86/um/nommu/os-Linux/Makefile
new file mode 100644
index 000000000000..4571e403a6ff
--- /dev/null
+++ b/arch/x86/um/nommu/os-Linux/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0
+
+obj-y = mcontext.o
+USER_OBJS := mcontext.o
+
+include $(srctree)/arch/um/scripts/Makefile.rules
diff --git a/arch/x86/um/nommu/os-Linux/mcontext.c b/arch/x86/um/nommu/os-Linux/mcontext.c
new file mode 100644
index 000000000000..b62a6195096f
--- /dev/null
+++ b/arch/x86/um/nommu/os-Linux/mcontext.c
@@ -0,0 +1,15 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <sys/ucontext.h>
+#define __FRAME_OFFSETS
+#include <asm/ptrace.h>
+#include <sysdep/ptrace.h>
+#include <sysdep/mcontext.h>
+
+extern long __kernel_vsyscall(int64_t a0, int64_t a1, int64_t a2, int64_t a3,
+			      int64_t a4, int64_t a5, int64_t a6);
+
+void set_mc_sigsys_hook(mcontext_t *mc)
+{
+	mc->gregs[REG_RCX] = mc->gregs[REG_RIP];
+	mc->gregs[REG_RIP] = (unsigned long) __kernel_vsyscall;
+}
diff --git a/arch/x86/um/shared/sysdep/mcontext.h b/arch/x86/um/shared/sysdep/mcontext.h
index 6fe490cc5b98..9a0d6087f357 100644
--- a/arch/x86/um/shared/sysdep/mcontext.h
+++ b/arch/x86/um/shared/sysdep/mcontext.h
@@ -17,6 +17,10 @@ extern int get_stub_state(struct uml_pt_regs *regs, struct stub_data *data,
 extern int set_stub_state(struct uml_pt_regs *regs, struct stub_data *data,
 			  int single_stepping);
 
+#ifndef CONFIG_MMU
+extern void set_mc_sigsys_hook(mcontext_t *mc);
+#endif
+
 #ifdef __i386__
 
 #define GET_FAULTINFO_FROM_MC(fi, mc) \
-- 
2.43.0


From thehajime at gmail.com  Sat Nov  8 00:05:41 2025
From: thehajime at gmail.com (Hajime Tazaki)
Date: Sat,  8 Nov 2025 17:05:41 +0900
Subject: [PATCH v13 06/13] x86/um: nommu: process/thread handling
In-Reply-To: <cover.1762588860.git.thehajime@gmail.com>
References: <cover.1762588860.git.thehajime@gmail.com>
Message-ID: <ad45749c65e4d7316b1d52b0ef0ba3553c2f3ed9.1762588860.git.thehajime@gmail.com>

Since ptrace facility isn't used under !MMU of UML, there is different
code path to invoke processes/threads; there are no external process
used, and need to properly configure some of registers (fs segment
register for TLS, etc) on every context switch, etc.

Signals aren't delivered in non-ptrace syscall entry/leave so, we also
need to handle pending signal by ourselves.

ptrace related syscalls are not tested yet so, marked
arch_has_single_step() unsupported in !MMU environment.

Signed-off-by: Hajime Tazaki <thehajime at gmail.com>
Signed-off-by: Ricardo Koller <ricarkol at google.com>
---
 arch/um/include/asm/ptrace-generic.h |  2 +-
 arch/x86/um/Makefile                 |  3 +-
 arch/x86/um/nommu/Makefile           |  2 +-
 arch/x86/um/nommu/entry_64.S         |  2 ++
 arch/x86/um/nommu/syscalls.h         |  2 ++
 arch/x86/um/nommu/syscalls_64.c      | 50 ++++++++++++++++++++++++++++
 6 files changed, 58 insertions(+), 3 deletions(-)
 create mode 100644 arch/x86/um/nommu/syscalls_64.c

diff --git a/arch/um/include/asm/ptrace-generic.h b/arch/um/include/asm/ptrace-generic.h
index 62e9916078ec..5aa38fe6b2fb 100644
--- a/arch/um/include/asm/ptrace-generic.h
+++ b/arch/um/include/asm/ptrace-generic.h
@@ -14,7 +14,7 @@ struct pt_regs {
 	struct uml_pt_regs regs;
 };
 
-#define arch_has_single_step()	(1)
+#define arch_has_single_step()	(IS_ENABLED(CONFIG_MMU))
 
 #define EMPTY_REGS { .regs = EMPTY_UML_PT_REGS }
 
diff --git a/arch/x86/um/Makefile b/arch/x86/um/Makefile
index 39693807755a..98dc57afff83 100644
--- a/arch/x86/um/Makefile
+++ b/arch/x86/um/Makefile
@@ -26,7 +26,8 @@ subarch-y += ../kernel/sys_ia32.o
 
 else
 
-obj-y += mem_64.o syscalls_64.o vdso/
+obj-y += mem_64.o vdso/
+obj-$(CONFIG_MMU) += syscalls_64.o
 
 subarch-y = ../lib/csum-partial_64.o ../lib/memcpy_64.o \
 	../lib/memmove_64.o ../lib/memset_64.o
diff --git a/arch/x86/um/nommu/Makefile b/arch/x86/um/nommu/Makefile
index ebe47d4836f4..4018d9e0aba0 100644
--- a/arch/x86/um/nommu/Makefile
+++ b/arch/x86/um/nommu/Makefile
@@ -5,4 +5,4 @@ else
 	BITS := 64
 endif
 
-obj-y = do_syscall_$(BITS).o entry_$(BITS).o os-Linux/
+obj-y = do_syscall_$(BITS).o entry_$(BITS).o syscalls_$(BITS).o os-Linux/
diff --git a/arch/x86/um/nommu/entry_64.S b/arch/x86/um/nommu/entry_64.S
index 485c578aae64..a58922fc81e5 100644
--- a/arch/x86/um/nommu/entry_64.S
+++ b/arch/x86/um/nommu/entry_64.S
@@ -86,6 +86,8 @@ END(__kernel_vsyscall)
  */
 ENTRY(userspace)
 
+	/* set stack and pt_regs to the current task */
+	call	arch_set_stack_to_current
 	/* clear direction flag to meet ABI */
 	cld
 	/* align the stack for x86_64 ABI */
diff --git a/arch/x86/um/nommu/syscalls.h b/arch/x86/um/nommu/syscalls.h
index a2433756b1fc..ce16bf8abd59 100644
--- a/arch/x86/um/nommu/syscalls.h
+++ b/arch/x86/um/nommu/syscalls.h
@@ -13,4 +13,6 @@
 extern long current_top_of_stack;
 extern long current_ptregs;
 
+void arch_set_stack_to_current(void);
+
 #endif
diff --git a/arch/x86/um/nommu/syscalls_64.c b/arch/x86/um/nommu/syscalls_64.c
new file mode 100644
index 000000000000..d56027ebc651
--- /dev/null
+++ b/arch/x86/um/nommu/syscalls_64.c
@@ -0,0 +1,50 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2003 - 2007 Jeff Dike (jdike@{addtoit,linux.intel}.com)
+ * Copyright 2003 PathScale, Inc.
+ *
+ * Licensed under the GPL
+ */
+
+#include <linux/sched.h>
+#include <linux/sched/mm.h>
+#include <linux/syscalls.h>
+#include <linux/uaccess.h>
+#include <asm/prctl.h> /* XXX This should get the constants from libc */
+#include <registers.h>
+#include <os.h>
+#include "syscalls.h"
+
+void arch_set_stack_to_current(void)
+{
+	current_top_of_stack = task_top_of_stack(current);
+	current_ptregs = (long)task_pt_regs(current);
+}
+
+void arch_switch_to(struct task_struct *to)
+{
+	/*
+	 * In !CONFIG_MMU, it doesn't ptrace thus,
+	 * The FS_BASE registers are saved here.
+	 */
+	current_top_of_stack = task_top_of_stack(to);
+	current_ptregs = (long)task_pt_regs(to);
+
+	if ((to->thread.regs.regs.gp[FS_BASE / sizeof(unsigned long)] == 0) ||
+	    (to->mm == NULL))
+		return;
+
+	/* this changes the FS on every context switch */
+	arch_prctl(to, ARCH_SET_FS,
+		   (void __user *) to->thread.regs.regs.gp[FS_BASE / sizeof(unsigned long)]);
+}
+
+SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len,
+		unsigned long, prot, unsigned long, flags,
+		unsigned long, fd, unsigned long, off)
+{
+	if (off & ~PAGE_MASK)
+		return -EINVAL;
+
+	return ksys_mmap_pgoff(addr, len, prot, flags, fd, off >> PAGE_SHIFT);
+}
-- 
2.43.0


From thehajime at gmail.com  Sat Nov  8 00:05:42 2025
From: thehajime at gmail.com (Hajime Tazaki)
Date: Sat,  8 Nov 2025 17:05:42 +0900
Subject: [PATCH v13 07/13] um: nommu: configure fs register on host syscall invocation
In-Reply-To: <cover.1762588860.git.thehajime@gmail.com>
References: <cover.1762588860.git.thehajime@gmail.com>
Message-ID: <5b4fab636ab8cbd1db025a0561fe9993990fc869.1762588860.git.thehajime@gmail.com>

As userspace on UML/!MMU also need to configure %fs register when it is
running to correctly access thread structure, host syscalls implemented
in os-Linux drivers may be puzzled when they are called.  Thus it has to
configure %fs register via arch_prctl(SET_FS) on every host syscalls.

Signed-off-by: Hajime Tazaki <thehajime at gmail.com>
Signed-off-by: Ricardo Koller <ricarkol at google.com>
---
 arch/um/include/shared/os.h       |  6 +++
 arch/um/os-Linux/process.c        |  6 +++
 arch/um/os-Linux/start_up.c       | 21 +++++++++
 arch/x86/um/nommu/do_syscall_64.c | 37 ++++++++++++++++
 arch/x86/um/nommu/syscalls_64.c   | 71 +++++++++++++++++++++++++++++++
 5 files changed, 141 insertions(+)

diff --git a/arch/um/include/shared/os.h b/arch/um/include/shared/os.h
index 5451f9b1f41e..0ac87507e05e 100644
--- a/arch/um/include/shared/os.h
+++ b/arch/um/include/shared/os.h
@@ -189,6 +189,7 @@ extern void check_host_supports_tls(int *supports_tls, int *tls_min);
 extern void get_host_cpu_features(
 	void (*flags_helper_func)(char *line),
 	void (*cache_helper_func)(char *line));
+extern int host_has_fsgsbase;
 
 /* mem.c */
 extern int create_mem_file(unsigned long long len);
@@ -213,6 +214,11 @@ extern int os_protect_memory(void *addr, unsigned long len,
 extern int os_unmap_memory(void *addr, int len);
 extern int os_drop_memory(void *addr, int length);
 extern int can_drop_memory(void);
+extern int os_arch_prctl(int pid, int option, unsigned long *arg);
+#ifndef CONFIG_MMU
+extern long long host_fs;
+#endif
+
 
 void os_set_pdeathsig(void);
 
diff --git a/arch/um/os-Linux/process.c b/arch/um/os-Linux/process.c
index ddb5258d7720..dacf63ac33c8 100644
--- a/arch/um/os-Linux/process.c
+++ b/arch/um/os-Linux/process.c
@@ -18,6 +18,7 @@
 #include <sys/prctl.h>
 #include <sys/wait.h>
 #include <asm/unistd.h>
+#include <sys/syscall.h>   /* For SYS_xxx definitions */
 #include <linux/threads.h>
 #include <init.h>
 #include <longjmp.h>
@@ -179,6 +180,11 @@ int __init can_drop_memory(void)
 	return ok;
 }
 
+int os_arch_prctl(int pid, int option, unsigned long *arg2)
+{
+	return syscall(SYS_arch_prctl, option, arg2);
+}
+
 void init_new_thread_signals(void)
 {
 	set_handler(SIGSEGV);
diff --git a/arch/um/os-Linux/start_up.c b/arch/um/os-Linux/start_up.c
index 33e039d2c1bf..c0afe5d8b559 100644
--- a/arch/um/os-Linux/start_up.c
+++ b/arch/um/os-Linux/start_up.c
@@ -20,6 +20,8 @@
 #include <sys/resource.h>
 #include <asm/ldt.h>
 #include <asm/unistd.h>
+#include <sys/auxv.h>
+#include <asm/hwcap2.h>
 #include <init.h>
 #include <os.h>
 #include <smp.h>
@@ -37,6 +39,8 @@
 #include <skas.h>
 #include "internal.h"
 
+int host_has_fsgsbase;
+
 static void ptrace_child(void)
 {
 	int ret;
@@ -460,6 +464,20 @@ __uml_setup("seccomp=", uml_seccomp_config,
 "    This is insecure and should only be used with a trusted userspace\n\n"
 );
 
+static void __init check_fsgsbase(void)
+{
+	unsigned long auxv = getauxval(AT_HWCAP2);
+
+	os_info("Checking FSGSBASE instructions...");
+	if (auxv & HWCAP2_FSGSBASE) {
+		host_has_fsgsbase = 1;
+		os_info("OK\n");
+	} else {
+		host_has_fsgsbase = 0;
+		os_info("disabled\n");
+	}
+}
+
 void __init os_early_checks(void)
 {
 	int pid;
@@ -488,6 +506,9 @@ void __init os_early_checks(void)
 	using_seccomp = 0;
 	check_ptrace();
 
+	/* probe fsgsbase instruction */
+	check_fsgsbase();
+
 	pid = start_ptraced_child();
 	if (init_pid_registers(pid))
 		fatal("Failed to initialize default registers");
diff --git a/arch/x86/um/nommu/do_syscall_64.c b/arch/x86/um/nommu/do_syscall_64.c
index 292d7c578622..9bc630995df9 100644
--- a/arch/x86/um/nommu/do_syscall_64.c
+++ b/arch/x86/um/nommu/do_syscall_64.c
@@ -2,10 +2,38 @@
 
 #include <linux/kernel.h>
 #include <linux/ptrace.h>
+#include <asm/fsgsbase.h>
+#include <asm/prctl.h>
 #include <kern_util.h>
 #include <asm/syscall.h>
 #include <os.h>
 
+static int os_x86_arch_prctl(int pid, int option, unsigned long *arg2)
+{
+	if (!host_has_fsgsbase)
+		return os_arch_prctl(pid, option, arg2);
+
+	switch (option) {
+	case ARCH_SET_FS:
+		wrfsbase(*arg2);
+		break;
+	case ARCH_SET_GS:
+		wrgsbase(*arg2);
+		break;
+	case ARCH_GET_FS:
+		*arg2 = rdfsbase();
+		break;
+	case ARCH_GET_GS:
+		*arg2 = rdgsbase();
+		break;
+	default:
+		pr_warn("%s: unsupported option: 0x%x", __func__, option);
+		break;
+	}
+
+	return 0;
+}
+
 __visible void do_syscall_64(struct pt_regs *regs)
 {
 	int syscall;
@@ -13,6 +41,9 @@ __visible void do_syscall_64(struct pt_regs *regs)
 	syscall = PT_SYSCALL_NR(regs->regs.gp);
 	UPT_SYSCALL_NR(&regs->regs) = syscall;
 
+	/* set fs register to the original host one */
+	os_x86_arch_prctl(0, ARCH_SET_FS, (void *)host_fs);
+
 	if (likely(syscall < NR_syscalls)) {
 		unsigned long ret;
 
@@ -29,4 +60,10 @@ __visible void do_syscall_64(struct pt_regs *regs)
 
 	/* handle tasks and signals at the end */
 	interrupt_end();
+
+	/* restore back fs register to userspace configured one */
+	os_x86_arch_prctl(0, ARCH_SET_FS,
+		      (void *)(current->thread.regs.regs.gp[FS_BASE
+						     / sizeof(unsigned long)]));
+
 }
diff --git a/arch/x86/um/nommu/syscalls_64.c b/arch/x86/um/nommu/syscalls_64.c
index d56027ebc651..19d23686fc5b 100644
--- a/arch/x86/um/nommu/syscalls_64.c
+++ b/arch/x86/um/nommu/syscalls_64.c
@@ -13,8 +13,70 @@
 #include <asm/prctl.h> /* XXX This should get the constants from libc */
 #include <registers.h>
 #include <os.h>
+#include <asm/thread_info.h>
+#include <asm/mman.h>
 #include "syscalls.h"
 
+/*
+ * The guest libc can change FS, which confuses the host libc.
+ * In fact, changing FS directly is not supported (check
+ * man arch_prctl). So, whenever we make a host syscall,
+ * we should be changing FS to the original FS (not the
+ * one set by the guest libc). This original FS is stored
+ * in host_fs.
+ */
+long long host_fs = -1;
+
+long arch_prctl(struct task_struct *task, int option,
+		unsigned long __user *arg2)
+{
+	long ret = -EINVAL;
+	unsigned long *ptr = arg2, tmp;
+
+	switch (option) {
+	case ARCH_SET_FS:
+		if (host_fs == -1)
+			os_arch_prctl(0, ARCH_GET_FS, (void *)&host_fs);
+		ret = 0;
+		break;
+	case ARCH_SET_GS:
+		ret = 0;
+		break;
+	case ARCH_GET_FS:
+	case ARCH_GET_GS:
+		ptr = &tmp;
+		break;
+	}
+
+	ret = os_arch_prctl(0, option, ptr);
+	if (ret)
+		return ret;
+
+	switch (option) {
+	case ARCH_SET_FS:
+		current->thread.regs.regs.gp[FS_BASE / sizeof(unsigned long)] =
+			(unsigned long) arg2;
+		break;
+	case ARCH_SET_GS:
+		current->thread.regs.regs.gp[GS_BASE / sizeof(unsigned long)] =
+			(unsigned long) arg2;
+		break;
+	case ARCH_GET_FS:
+		ret = put_user(current->thread.regs.regs.gp[FS_BASE / sizeof(unsigned long)], arg2);
+		break;
+	case ARCH_GET_GS:
+		ret = put_user(current->thread.regs.regs.gp[GS_BASE / sizeof(unsigned long)], arg2);
+		break;
+	}
+
+	return ret;
+}
+
+SYSCALL_DEFINE2(arch_prctl, int, option, unsigned long, arg2)
+{
+	return arch_prctl(current, option, (unsigned long __user *) arg2);
+}
+
 void arch_set_stack_to_current(void)
 {
 	current_top_of_stack = task_top_of_stack(current);
@@ -48,3 +110,12 @@ SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len,
 
 	return ksys_mmap_pgoff(addr, len, prot, flags, fd, off >> PAGE_SHIFT);
 }
+
+static int __init um_nommu_setup_hostfs(void)
+{
+	/* initialize the host_fs value at boottime */
+	os_arch_prctl(0, ARCH_GET_FS, (void *)&host_fs);
+
+	return 0;
+}
+arch_initcall(um_nommu_setup_hostfs);
-- 
2.43.0


From thehajime at gmail.com  Sat Nov  8 00:05:43 2025
From: thehajime at gmail.com (Hajime Tazaki)
Date: Sat,  8 Nov 2025 17:05:43 +0900
Subject: [PATCH v13 08/13] x86/um/vdso: nommu: vdso memory update
In-Reply-To: <cover.1762588860.git.thehajime@gmail.com>
References: <cover.1762588860.git.thehajime@gmail.com>
Message-ID: <a689fc371d0207ba83c9e0d2e9335be7095b73be.1762588860.git.thehajime@gmail.com>

On !MMU mode, the address of vdso is accessible from userspace.  This
commit implements the entry point by pointing a block of page address.

This commit also add memory permission configuration of vdso page to be
executable.

Signed-off-by: Hajime Tazaki <thehajime at gmail.com>
Signed-off-by: Ricardo Koller <ricarkol at google.com>
---
 arch/x86/um/vdso/vma.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/x86/um/vdso/vma.c b/arch/x86/um/vdso/vma.c
index 51a2b9f2eca9..0799b3fe7521 100644
--- a/arch/x86/um/vdso/vma.c
+++ b/arch/x86/um/vdso/vma.c
@@ -9,6 +9,7 @@
 #include <asm/page.h>
 #include <asm/elf.h>
 #include <linux/init.h>
+#include <os.h>
 
 unsigned long um_vdso_addr;
 static struct page *um_vdso;
@@ -20,18 +21,29 @@ static int __init init_vdso(void)
 {
 	BUG_ON(vdso_end - vdso_start > PAGE_SIZE);
 
-	um_vdso_addr = task_size - PAGE_SIZE;
-
 	um_vdso = alloc_page(GFP_KERNEL);
 	if (!um_vdso)
 		panic("Cannot allocate vdso\n");
 
 	copy_page(page_address(um_vdso), vdso_start);
 
+#ifdef CONFIG_MMU
+	um_vdso_addr = task_size - PAGE_SIZE;
+#else
+	/* this is fine with NOMMU as everything is accessible */
+	um_vdso_addr = (unsigned long)page_address(um_vdso);
+	os_protect_memory((void *)um_vdso_addr, vdso_end - vdso_start, 1, 0, 1);
+#endif
+
+	pr_info("vdso_start=%lx um_vdso_addr=%lx pg_um_vdso=%lx",
+	       (unsigned long)vdso_start, um_vdso_addr,
+	       (unsigned long)page_address(um_vdso));
+
 	return 0;
 }
 subsys_initcall(init_vdso);
 
+#ifdef CONFIG_MMU
 int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 {
 	struct vm_area_struct *vma;
@@ -53,3 +65,4 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 
 	return IS_ERR(vma) ? PTR_ERR(vma) : 0;
 }
+#endif
-- 
2.43.0


From thehajime at gmail.com  Sat Nov  8 00:05:44 2025
From: thehajime at gmail.com (Hajime Tazaki)
Date: Sat,  8 Nov 2025 17:05:44 +0900
Subject: [PATCH v13 09/13] x86/um: nommu: signal handling
In-Reply-To: <cover.1762588860.git.thehajime@gmail.com>
References: <cover.1762588860.git.thehajime@gmail.com>
Message-ID: <c3d41609286993562214359a6158997e9de06551.1762588860.git.thehajime@gmail.com>

This commit updates the behavior of signal handling under !MMU
environment. It adds the alignment code for signal frame as the frame
is used in userspace as-is.

floating point register is carefully handling upon entry/leave of
syscall routine so that signal handlers can read/write the contents of
the register.

It also adds the follow up routine for SIGSEGV as a signal delivery runs
in the same stack frame while we have to avoid endless SIGSEGV.

Signed-off-by: Hajime Tazaki <thehajime at gmail.com>
---
 arch/um/include/shared/kern_util.h    |   4 +
 arch/um/nommu/Makefile                |   2 +-
 arch/um/nommu/os-Linux/signal.c       |   8 +
 arch/um/nommu/trap.c                  | 201 ++++++++++++++++++++++++++
 arch/um/os-Linux/signal.c             |   3 +-
 arch/x86/um/nommu/do_syscall_64.c     |   6 +
 arch/x86/um/nommu/os-Linux/mcontext.c |  11 ++
 arch/x86/um/shared/sysdep/mcontext.h  |   1 +
 arch/x86/um/shared/sysdep/ptrace.h    |   2 +-
 9 files changed, 235 insertions(+), 3 deletions(-)
 create mode 100644 arch/um/nommu/trap.c

diff --git a/arch/um/include/shared/kern_util.h b/arch/um/include/shared/kern_util.h
index 7798f16a4677..46c8d6336ca1 100644
--- a/arch/um/include/shared/kern_util.h
+++ b/arch/um/include/shared/kern_util.h
@@ -70,4 +70,8 @@ void um_idle_sleep(void);
 
 void kasan_map_memory(void *start, size_t len);
 
+#ifndef CONFIG_MMU
+extern void nommu_relay_signal(void *ptr);
+#endif
+
 #endif
diff --git a/arch/um/nommu/Makefile b/arch/um/nommu/Makefile
index baab7c2f57c2..096221590cfd 100644
--- a/arch/um/nommu/Makefile
+++ b/arch/um/nommu/Makefile
@@ -1,3 +1,3 @@
 # SPDX-License-Identifier: GPL-2.0
 
-obj-y := os-Linux/
+obj-y := trap.o os-Linux/
diff --git a/arch/um/nommu/os-Linux/signal.c b/arch/um/nommu/os-Linux/signal.c
index 19043b9652e2..6febb178dcda 100644
--- a/arch/um/nommu/os-Linux/signal.c
+++ b/arch/um/nommu/os-Linux/signal.c
@@ -5,6 +5,7 @@
 #include <os.h>
 #include <sysdep/mcontext.h>
 #include <sys/ucontext.h>
+#include <as-layout.h>
 
 void sigsys_handler(int sig, struct siginfo *si,
 		    struct uml_pt_regs *regs, void *ptr)
@@ -14,3 +15,10 @@ void sigsys_handler(int sig, struct siginfo *si,
 	/* hook syscall via SIGSYS */
 	set_mc_sigsys_hook(mc);
 }
+
+void nommu_relay_signal(void *ptr)
+{
+	mcontext_t *mc = (mcontext_t *) ptr;
+
+	set_mc_relay_signal(mc);
+}
diff --git a/arch/um/nommu/trap.c b/arch/um/nommu/trap.c
new file mode 100644
index 000000000000..430297517455
--- /dev/null
+++ b/arch/um/nommu/trap.c
@@ -0,0 +1,201 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/mm.h>
+#include <linux/sched/signal.h>
+#include <linux/hardirq.h>
+#include <linux/module.h>
+#include <linux/uaccess.h>
+#include <linux/sched/debug.h>
+#include <asm/current.h>
+#include <asm/tlbflush.h>
+#include <arch.h>
+#include <as-layout.h>
+#include <kern_util.h>
+#include <os.h>
+#include <skas.h>
+
+/*
+ * Note this is constrained to return 0, -EFAULT, -EACCES, -ENOMEM by
+ * segv().
+ */
+int handle_page_fault(unsigned long address, unsigned long ip,
+		      int is_write, int is_user, int *code_out)
+{
+	/* !MMU has no pagefault */
+	return -EFAULT;
+}
+
+static void show_segv_info(struct uml_pt_regs *regs)
+{
+	struct task_struct *tsk = current;
+	struct faultinfo *fi = UPT_FAULTINFO(regs);
+
+	if (!unhandled_signal(tsk, SIGSEGV))
+		return;
+
+	pr_warn_ratelimited("%s%s[%d]: segfault at %lx ip %p sp %p error %x",
+			    task_pid_nr(tsk) > 1 ? KERN_INFO : KERN_EMERG,
+			    tsk->comm, task_pid_nr(tsk), FAULT_ADDRESS(*fi),
+			    (void *)UPT_IP(regs), (void *)UPT_SP(regs),
+			    fi->error_code);
+}
+
+static void bad_segv(struct faultinfo fi, unsigned long ip)
+{
+	current->thread.arch.faultinfo = fi;
+	force_sig_fault(SIGSEGV, SEGV_ACCERR, (void __user *) FAULT_ADDRESS(fi));
+}
+
+void fatal_sigsegv(void)
+{
+	force_fatal_sig(SIGSEGV);
+	do_signal(&current->thread.regs);
+	/*
+	 * This is to tell gcc that we're not returning - do_signal
+	 * can, in general, return, but in this case, it's not, since
+	 * we just got a fatal SIGSEGV queued.
+	 */
+	os_dump_core();
+}
+
+/**
+ * segv_handler() - the SIGSEGV handler
+ * @sig:	the signal number
+ * @unused_si:	the signal info struct; unused in this handler
+ * @regs:	the ptrace register information
+ *
+ * The handler first extracts the faultinfo from the UML ptrace regs struct.
+ * If the userfault did not happen in an UML userspace process, bad_segv is called.
+ * Otherwise the signal did happen in a cloned userspace process, handle it.
+ */
+void segv_handler(int sig, struct siginfo *unused_si, struct uml_pt_regs *regs,
+		  void *mc)
+{
+	struct faultinfo *fi = UPT_FAULTINFO(regs);
+
+	/* !MMU specific part; detection of userspace */
+	/* mark is_user=1 when the IP is from userspace code. */
+	if (UPT_IP(regs) > uml_reserved && UPT_IP(regs) < high_physmem)
+		regs->is_user = 1;
+
+	if (UPT_IS_USER(regs) && !SEGV_IS_FIXABLE(fi)) {
+		show_segv_info(regs);
+		bad_segv(*fi, UPT_IP(regs));
+		return;
+	}
+	segv(*fi, UPT_IP(regs), UPT_IS_USER(regs), regs, mc);
+
+	/* !MMU specific part; detection of userspace */
+	relay_signal(sig, unused_si, regs, mc);
+}
+
+/*
+ * We give a *copy* of the faultinfo in the regs to segv.
+ * This must be done, since nesting SEGVs could overwrite
+ * the info in the regs. A pointer to the info then would
+ * give us bad data!
+ */
+unsigned long segv(struct faultinfo fi, unsigned long ip, int is_user,
+		   struct uml_pt_regs *regs, void *mc)
+{
+	int si_code;
+	int err;
+	int is_write = FAULT_WRITE(fi);
+	unsigned long address = FAULT_ADDRESS(fi);
+
+	if (!is_user && regs)
+		current->thread.segv_regs = container_of(regs, struct pt_regs, regs);
+
+	if (current->mm == NULL) {
+		show_regs(container_of(regs, struct pt_regs, regs));
+		panic("Segfault with no mm");
+	} else if (!is_user && address > PAGE_SIZE && address < TASK_SIZE) {
+		show_regs(container_of(regs, struct pt_regs, regs));
+		panic("Kernel tried to access user memory at addr 0x%lx, ip 0x%lx",
+		       address, ip);
+	}
+
+	if (SEGV_IS_FIXABLE(&fi))
+		err = handle_page_fault(address, ip, is_write, is_user,
+					&si_code);
+	else {
+		err = -EFAULT;
+		/*
+		 * A thread accessed NULL, we get a fault, but CR2 is invalid.
+		 * This code is used in __do_copy_from_user() of TT mode.
+		 * XXX tt mode is gone, so maybe this isn't needed any more
+		 */
+		address = 0;
+	}
+
+	if (!err)
+		goto out;
+	else if (!is_user && arch_fixup(ip, regs))
+		goto out;
+
+	if (!is_user) {
+		show_regs(container_of(regs, struct pt_regs, regs));
+		panic("Kernel mode fault at addr 0x%lx, ip 0x%lx",
+		      address, ip);
+	}
+
+	show_segv_info(regs);
+
+	if (err == -EACCES) {
+		current->thread.arch.faultinfo = fi;
+		force_sig_fault(SIGBUS, BUS_ADRERR, (void __user *)address);
+	} else {
+		WARN_ON_ONCE(err != -EFAULT);
+		current->thread.arch.faultinfo = fi;
+		force_sig_fault(SIGSEGV, si_code, (void __user *) address);
+	}
+
+out:
+	if (regs)
+		current->thread.segv_regs = NULL;
+
+	return 0;
+}
+
+void relay_signal(int sig, struct siginfo *si, struct uml_pt_regs *regs,
+		  void *mc)
+{
+	int code, err;
+
+	/* !MMU specific part; detection of userspace */
+	/* mark is_user=1 when the IP is from userspace code. */
+	if (UPT_IP(regs) > uml_reserved && UPT_IP(regs) < high_physmem)
+		regs->is_user = 1;
+
+	if (!UPT_IS_USER(regs)) {
+		if (sig == SIGBUS)
+			pr_err("Bus error - the host /dev/shm or /tmp mount likely just ran out of space\n");
+		panic("Kernel mode signal %d", sig);
+	}
+	/* if is_user==1, set return to userspace sig handler to relay signal */
+	nommu_relay_signal(mc);
+
+	arch_examine_signal(sig, regs);
+
+	/* Is the signal layout for the signal known?
+	 * Signal data must be scrubbed to prevent information leaks.
+	 */
+	code = si->si_code;
+	err = si->si_errno;
+	if ((err == 0) && (siginfo_layout(sig, code) == SIL_FAULT)) {
+		struct faultinfo *fi = UPT_FAULTINFO(regs);
+
+		current->thread.arch.faultinfo = *fi;
+		force_sig_fault(sig, code, (void __user *)FAULT_ADDRESS(*fi));
+	} else {
+		pr_err("Attempted to relay unknown signal %d (si_code = %d) with errno %d\n",
+		       sig, code, err);
+		force_sig(sig);
+	}
+}
+
+void winch(int sig, struct siginfo *unused_si, struct uml_pt_regs *regs,
+	   void *mc)
+{
+	do_IRQ(WINCH_IRQ, regs);
+}
diff --git a/arch/um/os-Linux/signal.c b/arch/um/os-Linux/signal.c
index 2f6795cd884c..28754f56c42b 100644
--- a/arch/um/os-Linux/signal.c
+++ b/arch/um/os-Linux/signal.c
@@ -41,9 +41,10 @@ static void sig_handler_common(int sig, struct siginfo *si, mcontext_t *mc)
 	int save_errno = errno;
 
 	r.is_user = 0;
+	if (mc)
+		get_regs_from_mc(&r, mc);
 	if (sig == SIGSEGV) {
 		/* For segfaults, we want the data from the sigcontext. */
-		get_regs_from_mc(&r, mc);
 		GET_FAULTINFO_FROM_MC(r.faultinfo, mc);
 	}
 
diff --git a/arch/x86/um/nommu/do_syscall_64.c b/arch/x86/um/nommu/do_syscall_64.c
index 9bc630995df9..cf5a347ee9b1 100644
--- a/arch/x86/um/nommu/do_syscall_64.c
+++ b/arch/x86/um/nommu/do_syscall_64.c
@@ -44,6 +44,9 @@ __visible void do_syscall_64(struct pt_regs *regs)
 	/* set fs register to the original host one */
 	os_x86_arch_prctl(0, ARCH_SET_FS, (void *)host_fs);
 
+	/* save fp registers */
+	asm volatile("fxsaveq %0" : "=m"(*(struct _xstate *)regs->regs.fp));
+
 	if (likely(syscall < NR_syscalls)) {
 		unsigned long ret;
 
@@ -61,6 +64,9 @@ __visible void do_syscall_64(struct pt_regs *regs)
 	/* handle tasks and signals at the end */
 	interrupt_end();
 
+	/* restore fp registers */
+	asm volatile("fxrstorq %0" : : "m"((current->thread.regs.regs.fp)));
+
 	/* restore back fs register to userspace configured one */
 	os_x86_arch_prctl(0, ARCH_SET_FS,
 		      (void *)(current->thread.regs.regs.gp[FS_BASE
diff --git a/arch/x86/um/nommu/os-Linux/mcontext.c b/arch/x86/um/nommu/os-Linux/mcontext.c
index b62a6195096f..afa20f1e235a 100644
--- a/arch/x86/um/nommu/os-Linux/mcontext.c
+++ b/arch/x86/um/nommu/os-Linux/mcontext.c
@@ -4,10 +4,21 @@
 #include <asm/ptrace.h>
 #include <sysdep/ptrace.h>
 #include <sysdep/mcontext.h>
+#include <os.h>
+#include "../syscalls.h"
 
 extern long __kernel_vsyscall(int64_t a0, int64_t a1, int64_t a2, int64_t a3,
 			      int64_t a4, int64_t a5, int64_t a6);
 
+void set_mc_relay_signal(mcontext_t *mc)
+{
+	/* configure stack and userspace returning routine as
+	 * instruction pointer
+	 */
+	mc->gregs[REG_RSP] = (unsigned long) current_top_of_stack;
+	mc->gregs[REG_RIP] = (unsigned long) userspace;
+}
+
 void set_mc_sigsys_hook(mcontext_t *mc)
 {
 	mc->gregs[REG_RCX] = mc->gregs[REG_RIP];
diff --git a/arch/x86/um/shared/sysdep/mcontext.h b/arch/x86/um/shared/sysdep/mcontext.h
index 9a0d6087f357..82a5f38b350f 100644
--- a/arch/x86/um/shared/sysdep/mcontext.h
+++ b/arch/x86/um/shared/sysdep/mcontext.h
@@ -19,6 +19,7 @@ extern int set_stub_state(struct uml_pt_regs *regs, struct stub_data *data,
 
 #ifndef CONFIG_MMU
 extern void set_mc_sigsys_hook(mcontext_t *mc);
+extern void set_mc_relay_signal(mcontext_t *mc);
 #endif
 
 #ifdef __i386__
diff --git a/arch/x86/um/shared/sysdep/ptrace.h b/arch/x86/um/shared/sysdep/ptrace.h
index 572ea2d79131..6ed6bb1ca50e 100644
--- a/arch/x86/um/shared/sysdep/ptrace.h
+++ b/arch/x86/um/shared/sysdep/ptrace.h
@@ -53,7 +53,7 @@ struct uml_pt_regs {
 	int is_user;
 
 	/* Dynamically sized FP registers (holds an XSTATE) */
-	unsigned long fp[];
+	unsigned long fp[] __attribute__((aligned(16)));
 };
 
 #define EMPTY_UML_PT_REGS { }
-- 
2.43.0


From thehajime at gmail.com  Sat Nov  8 00:05:45 2025
From: thehajime at gmail.com (Hajime Tazaki)
Date: Sat,  8 Nov 2025 17:05:45 +0900
Subject: [PATCH v13 10/13] um: change machine name for uname output
In-Reply-To: <cover.1762588860.git.thehajime@gmail.com>
References: <cover.1762588860.git.thehajime@gmail.com>
Message-ID: <7cfc1ecdcb8fe15edd92d3b1539994e28f3b6d5a.1762588860.git.thehajime@gmail.com>

This commit tries to display MMU/!MMU mode from the output of uname(2)
so that users can distinguish which mode of UML is running right now.

Signed-off-by: Hajime Tazaki <thehajime at gmail.com>
---
 arch/um/Makefile        | 6 ++++++
 arch/um/os-Linux/util.c | 3 ++-
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/um/Makefile b/arch/um/Makefile
index 5371c9a1b11e..9bc8fc149514 100644
--- a/arch/um/Makefile
+++ b/arch/um/Makefile
@@ -153,6 +153,12 @@ export CFLAGS_vmlinux := $(LINK-y) $(LINK_WRAPS) $(LD_FLAGS_CMDLINE) $(CC_FLAGS_
 CLEAN_FILES += linux x.i gmon.out
 MRPROPER_FILES += $(HOST_DIR)/include/generated
 
+ifeq ($(CONFIG_MMU),y)
+UTS_MACHINE := "um"
+else
+UTS_MACHINE := "um\(nommu\)"
+endif
+
 archclean:
 	@find . \( -name '*.bb' -o -name '*.bbg' -o -name '*.da' \
 		-o -name '*.gcov' \) -type f -print | xargs rm -f
diff --git a/arch/um/os-Linux/util.c b/arch/um/os-Linux/util.c
index e3ad71a0d13c..5fb26f5dfcb6 100644
--- a/arch/um/os-Linux/util.c
+++ b/arch/um/os-Linux/util.c
@@ -64,7 +64,8 @@ void setup_machinename(char *machine_out)
 	}
 # endif
 #endif
-	strcpy(machine_out, host.machine);
+	strcat(machine_out, "/");
+	strcat(machine_out, host.machine);
 }
 
 void setup_hostinfo(char *buf, int len)
-- 
2.43.0


From thehajime at gmail.com  Sat Nov  8 00:05:46 2025
From: thehajime at gmail.com (Hajime Tazaki)
Date: Sat,  8 Nov 2025 17:05:46 +0900
Subject: [PATCH v13 11/13] um: nommu: disable SMP on nommu UML
In-Reply-To: <cover.1762588860.git.thehajime@gmail.com>
References: <cover.1762588860.git.thehajime@gmail.com>
Message-ID: <a9d5519fe7696def8c00a1bba7b3c6c832d5dfba.1762588860.git.thehajime@gmail.com>

CONFIG_SMP doesn't work with nommu UML since fs register handling of
host does conflict with thread local storage (more specifically,
the variable signals_enabled).

Thus this commit disables the CONFIG option and the TLS variables.

Signed-off-by: Hajime Tazaki <thehajime at gmail.com>
---
 arch/um/os-Linux/internal.h | 8 ++++++++
 arch/x86/um/Kconfig         | 2 +-
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/um/os-Linux/internal.h b/arch/um/os-Linux/internal.h
index bac9fcc8c14c..25cb5cc931c1 100644
--- a/arch/um/os-Linux/internal.h
+++ b/arch/um/os-Linux/internal.h
@@ -6,6 +6,14 @@
 #include <stub-data.h>
 #include <signal.h>
 
+/* NOMMU doesn't work with thread-local storage used in CONFIG_SMP,
+ * due to the dependency on host_fs variable switch upon user/kernel
+ * context so, disable TLS until NOMMU supports SMP.
+ */
+#ifndef CONFIG_MMU
+#define __thread
+#endif
+
 /*
  * elf_aux.c
  */
diff --git a/arch/x86/um/Kconfig b/arch/x86/um/Kconfig
index bdd7c8e39b01..f12e2e4e0a12 100644
--- a/arch/x86/um/Kconfig
+++ b/arch/x86/um/Kconfig
@@ -12,7 +12,7 @@ config UML_X86
 	select ARCH_USE_QUEUED_SPINLOCKS
 	select DCACHE_WORD_ACCESS
 	select HAVE_EFFICIENT_UNALIGNED_ACCESS
-	select UML_SUBARCH_SUPPORTS_SMP if X86_CX8
+	select UML_SUBARCH_SUPPORTS_SMP if X86_CX8 && MMU
 
 config 64BIT
 	bool "64-bit kernel" if "$(SUBARCH)" = "x86"
-- 
2.43.0


From thehajime at gmail.com  Sat Nov  8 00:05:47 2025
From: thehajime at gmail.com (Hajime Tazaki)
Date: Sat,  8 Nov 2025 17:05:47 +0900
Subject: [PATCH v13 12/13] um: nommu: add documentation of nommu UML
In-Reply-To: <cover.1762588860.git.thehajime@gmail.com>
References: <cover.1762588860.git.thehajime@gmail.com>
Message-ID: <16940d31af89a3127acf29d23e10dcb9b7b9f4e3.1762588860.git.thehajime@gmail.com>

This commit adds an initial documentation for !MMU mode of UML.

Signed-off-by: Hajime Tazaki <thehajime at gmail.com>
---
 Documentation/virt/uml/nommu-uml.rst | 180 +++++++++++++++++++++++++++
 MAINTAINERS                          |   1 +
 2 files changed, 181 insertions(+)
 create mode 100644 Documentation/virt/uml/nommu-uml.rst

diff --git a/Documentation/virt/uml/nommu-uml.rst b/Documentation/virt/uml/nommu-uml.rst
new file mode 100644
index 000000000000..f049bbc697d1
--- /dev/null
+++ b/Documentation/virt/uml/nommu-uml.rst
@@ -0,0 +1,180 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+UML has been built with CONFIG_MMU since day 0.  The patchset
+introduces the nommu mode on UML in a different angle from what Linux
+Kernel Library tried.
+
+.. contents:: :local:
+
+What is it for ?
+================
+
+- Alleviate syscall hook overhead implemented with ptrace(2)
+- To exercises nommu code over UML (and over KUnit)
+- Less dependency to host facilities
+
+
+How it works ?
+==============
+
+To illustrate how this feature works, the below shows how syscalls are
+called under nommu/UML environment.
+
+- boot kernel, install seccomp filter if ``syscall`` instructions are
+  called from userspace memory based on the address of instruction
+  pointer
+- (userspace starts)
+- calls ``vfork``/``execve`` syscalls
+- ``SIGSYS`` signal raised, handler calls syscall entry point ``__kernel_vsyscall``
+- call handler function in ``sys_call_table[]`` and follow how UML syscall
+  works.
+- return to userspace
+
+
+What are the differences from MMU-full UML ?
+============================================
+
+The current nommu implementation adds 3 different functions which
+MMU-full UML doesn't have:
+
+- kernel address space can directly be accessible from userspace
+  - so, ``uaccess()`` always returns 1
+  - generic implementation of memcpy/strcpy/futex is also used
+- alternate syscall entrypoint without ptrace
+- alternate syscall hook
+  - hook syscall by seccomp filter
+
+With those modifications, it allows us to use unmodified userspace
+binaries with nommu UML.
+
+
+History
+=======
+
+This feature was originally introduced by Ricardo Koller at Open
+Source Summit NA 2020, then integrated with the syscall translation
+functionality with the clean up to the original code.
+
+Building and run
+================
+
+::
+
+   make ARCH=um x86_64_nommu_defconfig
+   make ARCH=um
+
+will build UML with ``CONFIG_MMU=n`` applied.
+
+Kunit tests can run with the following command::
+
+   ./tools/testing/kunit/kunit.py run --kconfig_add CONFIG_MMU=n
+
+To run a typical Linux distribution, we need nommu-aware userspace.
+We can use a stock version of Alpine Linux with nommu-built version of
+busybox and musl-libc.
+
+
+Preparing root filesystem
+=========================
+
+nommu UML requires to use a specific standard library which is aware
+of nommu kernel.  We have tested custom-build musl-libc and busybox,
+both of which have built-in support for nommu kernels.
+
+There are no available Linux distributions for nommu under x86_64
+architecture, so we need to prepare our own image for the root
+filesystem.  We use Alpine Linux as a base distribution and replace
+busybox and musl-libc on top of that.  The following are the step to
+prepare the filesystem for the quick start::
+
+     container_id=$(docker create ghcr.io/thehajime/alpine:3.20.3-um-nommu)
+     docker start $container_id
+     docker wait $container_id
+     docker export $container_id > alpine.tar
+     docker rm $container_id
+
+     mnt=$(mktemp -d)
+     dd if=/dev/zero of=alpine.ext4 bs=1 count=0 seek=1G
+     sudo chmod og+wr "alpine.ext4"
+     yes 2>/dev/null | mkfs.ext4 "alpine.ext4" || true
+     sudo mount "alpine.ext4" $mnt
+     sudo tar -xf alpine.tar -C $mnt
+     sudo umount $mnt
+
+This will create a file image, ``alpine.ext4``, which contains busybox
+and musl with nommu build on the Alpine Linux root filesystem.  The
+file can be specified to the argument ``ubd0=`` to the UML command line::
+
+  ./vmlinux ubd0=./alpine.ext4 rw mem=1024m loglevel=8 init=/sbin/init
+
+We plan to upstream apk packages for busybox and musl so that we can
+follow the proper procedure to set up the root filesystem.
+
+
+Quick start with docker
+=======================
+
+There is a docker image that you can quickly start with a simple step::
+
+  docker run -it -v /dev/shm:/dev/shm --rm ghcr.io/thehajime/alpine:3.20.3-um-nommu
+
+This will launch a UML instance with an pre-configured root filesystem.
+
+Benchmark
+=========
+
+The below shows an example of performance measurement conducted with
+lmbench and (self-crafted) getpid benchmark (with v6.17-rc5 uml/next
+tree).
+
+.. csv-table:: lmbench (usec)
+  :header: ,native,um,um-mmu(s),um-nommu(s)
+
+  select-10    ,0.5319,36.1214,24.2795,2.9174
+  select-100   ,1.6019,34.6049,28.8865,3.8080
+  select-1000  ,12.2588,43.6838,48.7438,12.7872
+  syscall      ,0.1644,35.0321,53.2119,2.5981
+  read         ,0.3055,31.5509,45.8538,2.7068
+  write        ,0.2512,31.3609,29.2636,2.6948
+  stat         ,1.8894,43.8477,49.6121,3.1908
+  open/close   ,3.2973,77.5123,68.9431,6.2575
+  fork+sh      ,1110.3000,7359.5000,4618.6667,439.4615
+  fork+execve  ,510.8182,2834.0000,2461.1667,139.7848
+
+.. csv-table:: do_getpid bench (nsec)
+  :header: ,native,um,um-mmu(s),um-nommu(s)
+
+  getpid , 161 , 34477 , 26242 , 2599
+
+(um-nommu(s) is with seccomp syscall hook, um-mmu(s) is SECCOMP mode,
+respectively)
+
+Limitations
+===========
+
+generic nommu limitations
+-------------------------
+Since this port is a kernel of nommu architecture so, the
+implementation inherits the characteristics of other nommu kernels
+(riscv, arm, etc), described below.
+
+- vfork(2) should be used instead of fork(2)
+- ELF loader only loads PIE (position independent executable) binaries
+- processes share the address space among others
+- mmap(2) offers a subset of functionalities (e.g., unsupported
+  MMAP_FIXED)
+
+Thus, we have limited options to userspace programs.  We have tested
+Alpine Linux with musl-libc, which has a support nommu kernel.
+
+supported architecture
+----------------------
+The current implementation of nommu UML only works on x86_64 SUBARCH.
+We have not tested with 32-bit environment.
+
+
+Further readings about NOMMU UML
+================================
+
+- NOMMU UML (original code by Ricardo Koller)
+ - https://static.sched.com/hosted_files/ossna2020/ec/kollerr_linux_um_nommu.pdf
diff --git a/MAINTAINERS b/MAINTAINERS
index 3da2c26a796b..2f227f56d04e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -26764,6 +26764,7 @@ USER-MODE LINUX (UML)
 M:	Richard Weinberger <richard at nod.at>
 M:	Anton Ivanov <anton.ivanov at cambridgegreys.com>
 M:	Johannes Berg <johannes at sipsolutions.net>
+M:	Hajime Tazaki <thehajime at gmail.com>
 L:	linux-um at lists.infradead.org
 S:	Maintained
 W:	http://user-mode-linux.sourceforge.net
-- 
2.43.0


From thehajime at gmail.com  Sat Nov  8 00:05:48 2025
From: thehajime at gmail.com (Hajime Tazaki)
Date: Sat,  8 Nov 2025 17:05:48 +0900
Subject: [PATCH v13 13/13] um: nommu: plug nommu code into build system
In-Reply-To: <cover.1762588860.git.thehajime@gmail.com>
References: <cover.1762588860.git.thehajime@gmail.com>
Message-ID: <ae577803cd463b830d120f15b316ad8c373d4735.1762588860.git.thehajime@gmail.com>

Add nommu kernel for um build.  defconfig is also provided.

Signed-off-by: Hajime Tazaki <thehajime at gmail.com>
Signed-off-by: Ricardo Koller <ricarkol at google.com>
---
 arch/um/Kconfig                        | 14 ++++++-
 arch/um/configs/x86_64_nommu_defconfig | 54 ++++++++++++++++++++++++++
 2 files changed, 66 insertions(+), 2 deletions(-)
 create mode 100644 arch/um/configs/x86_64_nommu_defconfig

diff --git a/arch/um/Kconfig b/arch/um/Kconfig
index 097c6a6265ef..4907fd2db512 100644
--- a/arch/um/Kconfig
+++ b/arch/um/Kconfig
@@ -34,16 +34,19 @@ config UML
 	select ARCH_SUPPORTS_LTO_CLANG_THIN
 	select TRACE_IRQFLAGS_SUPPORT
 	select TTY # Needed for line.c
-	select HAVE_ARCH_VMAP_STACK
+	select HAVE_ARCH_VMAP_STACK if MMU
 	select HAVE_RUST
 	select ARCH_HAS_UBSAN
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_SYSCALL_TRACEPOINTS
 	select THREAD_INFO_IN_TASK
 	select SPARSE_IRQ
+	select UACCESS_MEMCPY if !MMU
+	select GENERIC_STRNLEN_USER if !MMU
+	select GENERIC_STRNCPY_FROM_USER if !MMU
 
 config MMU
-	bool
+	bool "MMU-based Paged Memory Management Support" if 64BIT
 	default y
 
 config UML_DMA_EMULATION
@@ -225,8 +228,15 @@ config MAGIC_SYSRQ
 	  The keys are documented in <file:Documentation/admin-guide/sysrq.rst>. Don't say Y
 	  unless you really know what this hack does.
 
+config ARCH_FORCE_MAX_ORDER
+	int "Order of maximal physically contiguous allocations" if EXPERT
+	default "10" if MMU
+	default "16" if !MMU
+
 config KERNEL_STACK_ORDER
 	int "Kernel stack size order"
+	default 3 if !MMU
+	range 3 10 if !MMU
 	default 2 if 64BIT
 	range 2 10 if 64BIT
 	default 1 if !64BIT
diff --git a/arch/um/configs/x86_64_nommu_defconfig b/arch/um/configs/x86_64_nommu_defconfig
new file mode 100644
index 000000000000..02cb87091c9f
--- /dev/null
+++ b/arch/um/configs/x86_64_nommu_defconfig
@@ -0,0 +1,54 @@
+CONFIG_SYSVIPC=y
+CONFIG_POSIX_MQUEUE=y
+CONFIG_NO_HZ=y
+CONFIG_HIGH_RES_TIMERS=y
+CONFIG_BSD_PROCESS_ACCT=y
+CONFIG_IKCONFIG=y
+CONFIG_IKCONFIG_PROC=y
+CONFIG_LOG_BUF_SHIFT=14
+CONFIG_CGROUPS=y
+CONFIG_BLK_CGROUP=y
+CONFIG_CGROUP_SCHED=y
+CONFIG_CGROUP_DEVICE=y
+CONFIG_CGROUP_CPUACCT=y
+# CONFIG_PID_NS is not set
+CONFIG_CC_OPTIMIZE_FOR_SIZE=y
+# CONFIG_MMU is not set
+CONFIG_HOSTFS=y
+CONFIG_MAGIC_SYSRQ=y
+CONFIG_SSL=y
+CONFIG_NULL_CHAN=y
+CONFIG_PORT_CHAN=y
+CONFIG_PTY_CHAN=y
+CONFIG_TTY_CHAN=y
+CONFIG_CON_CHAN="pts"
+CONFIG_SSL_CHAN="pts"
+CONFIG_MODULES=y
+CONFIG_MODULE_UNLOAD=y
+CONFIG_IOSCHED_BFQ=m
+CONFIG_BINFMT_MISC=m
+CONFIG_NET=y
+CONFIG_PACKET=y
+CONFIG_UNIX=y
+CONFIG_INET=y
+CONFIG_DEVTMPFS=y
+CONFIG_DEVTMPFS_MOUNT=y
+CONFIG_BLK_DEV_UBD=y
+CONFIG_BLK_DEV_LOOP=m
+CONFIG_BLK_DEV_NBD=m
+CONFIG_DUMMY=m
+CONFIG_TUN=m
+CONFIG_PPP=m
+CONFIG_SLIP=m
+CONFIG_LEGACY_PTY_COUNT=32
+CONFIG_UML_RANDOM=y
+CONFIG_EXT4_FS=y
+CONFIG_QUOTA=y
+CONFIG_AUTOFS_FS=m
+CONFIG_ISO9660_FS=m
+CONFIG_JOLIET=y
+CONFIG_NLS=y
+CONFIG_DEBUG_KERNEL=y
+CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y
+CONFIG_FRAME_WARN=1024
+CONFIG_IPV6=y
-- 
2.43.0


From bagasdotme at gmail.com  Sat Nov  8 01:19:33 2025
From: bagasdotme at gmail.com (Bagas Sanjaya)
Date: Sat, 8 Nov 2025 16:19:33 +0700
Subject: [PATCH v2] vfs: remove the excl argument from the ->create()
 inode_operation
In-Reply-To: <aQ7fOmknHIxcxuha@codewreck.org>
References: <20251107-create-excl-v2-1-f678165d7f3f@kernel.org>
 <aQ7fOmknHIxcxuha@codewreck.org>
Message-ID: <aQ8LJfKC0R-4ehLU@archie.me>

On Sat, Nov 08, 2025 at 03:12:10PM +0900, Dominique Martinet wrote:
> Jeff Layton wrote on Fri, Nov 07, 2025 at 10:05:03AM -0500:
> > diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
> > index 4f13b01e42eb5e2ad9d60cbbce7e47d09ad831e6..7a55e491e0c87a0d18909bd181754d6d68318059 100644
> > --- a/Documentation/filesystems/vfs.rst
> > +++ b/Documentation/filesystems/vfs.rst
> > @@ -505,7 +505,10 @@ otherwise noted.
> >  	if you want to support regular files.  The dentry you get should
> >  	not have an inode (i.e. it should be a negative dentry).  Here
> >  	you will probably call d_instantiate() with the dentry and the
> > -	newly created inode
> > +        newly created inode. This operation should always provide O_EXCL
> 
> This and the block below change halfway from tab (old text) to spaces
> (your patch)
> 
> Looks like the file has a few space-indented sections though so it won't
> be the first if that goes in as is, the html-rendering doesn't seem to
> care :)

FYI: I'm using Vim. My important settings (in ~/.vimrc) are:

```
set nojoinspaces
set textwidth=0
set backspace=2
```

However, ftplugin override these for each file type, so you have to essentially
"fork" the relevant ftplugin file for each type if you want for your settings
to take precedence. For example, in case of reST, copy
/usr/share/vim/vim91/ftplugin/rst.vim to ~/.vim/ftplugin/rst and override the
already defined options there:

```
...
" keep tabs as-is
setlocal comments=fb:.. commentstring=..\ %s noexpandtab
...
if exists("g:rst_style") && g:rst_style != 0
    setlocal noexpandtab shiftwidth=8 softtabstop=0 tabstop=8
endif
...
```

Thanks.

-- 
An old man doll... just what I always wanted! - Clara
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 228 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-um/attachments/20251108/3f8d2d5a/attachment-0001.sig>

From hch at infradead.org  Mon Nov 10 01:14:26 2025
From: hch at infradead.org (Christoph Hellwig)
Date: Mon, 10 Nov 2025 01:14:26 -0800
Subject: [PATCH v13 00/13] nommu UML
In-Reply-To: <cover.1762588860.git.thehajime@gmail.com>
References: <cover.1762588860.git.thehajime@gmail.com>
Message-ID: <aRGs8lPjH22NqMZc@infradead.org>

On Sat, Nov 08, 2025 at 05:05:35PM +0900, Hajime Tazaki wrote:
> This patchset is another spin of nommu mode addition to UML.  It would
> be nice to hear about your opinions on that.

I've not seen any explanation of the use case and/or benefits anywhere
in this cover letter or the patches.  Without that it's usually pretty
hard to get maintainers and reviewers excited.


From thehajime at gmail.com  Mon Nov 10 04:18:05 2025
From: thehajime at gmail.com (Hajime Tazaki)
Date: Mon, 10 Nov 2025 21:18:05 +0900
Subject: [PATCH v13 00/13] nommu UML
In-Reply-To: <aRGs8lPjH22NqMZc@infradead.org>
References: <cover.1762588860.git.thehajime@gmail.com>
	<aRGs8lPjH22NqMZc@infradead.org>
Message-ID: <m2framxerm.wl-thehajime@gmail.com>


Hello,

On Mon, 10 Nov 2025 18:14:26 +0900,
Christoph Hellwig wrote:
> 
> On Sat, Nov 08, 2025 at 05:05:35PM +0900, Hajime Tazaki wrote:
> > This patchset is another spin of nommu mode addition to UML.  It would
> > be nice to hear about your opinions on that.
> 
> I've not seen any explanation of the use case and/or benefits anywhere
> in this cover letter or the patches.  Without that it's usually pretty
> hard to get maintainers and reviewers excited.

thank you for the comment.  I tried to include this explanation in the
document patch [12/13], which I copied from the text below.

  What is it for ?
  ================
  
  - Alleviate syscall hook overhead implemented with ptrace(2)
  - To exercises nommu code over UML (and over KUnit)
  - Less dependency to host facilities

the first item is for speed up, the second item is for more testing,
the last item is for more extensibility in the future.

Early version of this patchset included this information as well as
the whole documentation, but I removed it as the versions grow.  But I
can revert it to the cover letter if it helps.

-- Hajime


From jlayton at kernel.org  Mon Nov 10 05:13:00 2025
From: jlayton at kernel.org (Jeff Layton)
Date: Mon, 10 Nov 2025 08:13:00 -0500
Subject: [PATCH v3] vfs: remove the excl argument from the ->create()
 inode_operation
Message-ID: <20251110-create-excl-v3-1-836a61d14fb0@kernel.org>

With three exceptions, ->create() methods provided by filesystems ignore
the "excl" flag.  Those exception are NFS, GFS2 and vboxsf which all also
provide ->atomic_open.

Since ce8644fcadc5 ("lookup_open(): expand the call of vfs_create()"),
the "excl" argument to the ->create() inode_operation is always set to
true in vfs_create(). The ->create() call in lookup_open() sets it
according to the O_EXCL open flag, but is never called if the filesystem
provides ->atomic_open().

The excl flag is therefore always either ignored or true.  Remove it,
and change NFS, GFS2 and vboxsf to act as if it were always true.

Reviewed-by: Dominique Martinet <asmadeus at codewreck.org>
Reviewed-by: NeilBrown <neil at brown.name>
Signed-off-by: Jeff Layton <jlayton at kernel.org>
---
Minor update to fix vboxsf case that I somehow missed in the first
version, and some minor whitespace cleanup in the docs. Reminder that
this should be applied on top of the directory delegation series [1].

[1]: https://lore.kernel.org/linux-nfs/20251105-dir-deleg-ro-v5-0-7ebc168a88ac at kernel.org/
---
Changes in v3:
- fix use of excl in vboxsf_dir_mkfile()
- fix tab prefixes in Documentation/filesystems/vfs.rst
- Link to v2: https://lore.kernel.org/r/20251107-create-excl-v2-1-f678165d7f3f at kernel.org

Changes in v2:
- better describe why the argument isn't needed in the changelog
- updates do Documentation/
- Link to v1: https://lore.kernel.org/r/20251105-create-excl-v1-1-a4cce035cc55 at kernel.org
---
 Documentation/filesystems/porting.rst | 12 ++++++++++++
 Documentation/filesystems/vfs.rst     | 13 ++++++++++---
 fs/9p/vfs_inode.c                     |  2 +-
 fs/9p/vfs_inode_dotl.c                |  2 +-
 fs/affs/affs.h                        |  2 +-
 fs/affs/namei.c                       |  2 +-
 fs/afs/dir.c                          |  4 ++--
 fs/bad_inode.c                        |  2 +-
 fs/bfs/dir.c                          |  2 +-
 fs/btrfs/inode.c                      |  2 +-
 fs/ceph/dir.c                         |  2 +-
 fs/coda/dir.c                         |  2 +-
 fs/ecryptfs/inode.c                   |  2 +-
 fs/efivarfs/inode.c                   |  2 +-
 fs/exfat/namei.c                      |  2 +-
 fs/ext2/namei.c                       |  2 +-
 fs/ext4/namei.c                       |  2 +-
 fs/f2fs/namei.c                       |  2 +-
 fs/fat/namei_msdos.c                  |  2 +-
 fs/fat/namei_vfat.c                   |  2 +-
 fs/fuse/dir.c                         |  2 +-
 fs/gfs2/inode.c                       |  5 ++---
 fs/hfs/dir.c                          |  2 +-
 fs/hfsplus/dir.c                      |  2 +-
 fs/hostfs/hostfs_kern.c               |  2 +-
 fs/hpfs/namei.c                       |  2 +-
 fs/hugetlbfs/inode.c                  |  2 +-
 fs/jffs2/dir.c                        |  4 ++--
 fs/jfs/namei.c                        |  2 +-
 fs/minix/namei.c                      |  2 +-
 fs/namei.c                            |  4 ++--
 fs/nfs/dir.c                          |  4 ++--
 fs/nfs/internal.h                     |  2 +-
 fs/nilfs2/namei.c                     |  2 +-
 fs/ntfs3/namei.c                      |  2 +-
 fs/ocfs2/dlmfs/dlmfs.c                |  3 +--
 fs/ocfs2/namei.c                      |  3 +--
 fs/omfs/dir.c                         |  2 +-
 fs/orangefs/namei.c                   |  3 +--
 fs/overlayfs/dir.c                    |  2 +-
 fs/ramfs/inode.c                      |  2 +-
 fs/smb/client/cifsfs.h                |  2 +-
 fs/smb/client/dir.c                   |  2 +-
 fs/ubifs/dir.c                        |  2 +-
 fs/udf/namei.c                        |  2 +-
 fs/ufs/namei.c                        |  3 +--
 fs/vboxsf/dir.c                       |  4 ++--
 fs/xfs/xfs_iops.c                     |  3 +--
 include/linux/fs.h                    |  4 ++--
 ipc/mqueue.c                          |  2 +-
 mm/shmem.c                            |  2 +-
 51 files changed, 78 insertions(+), 65 deletions(-)

diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst
index 7233b04668fcce75f1ed170329a2cd18110a7d89..d71a3f5c626e578f0370986975ca50292c8e15c3 100644
--- a/Documentation/filesystems/porting.rst
+++ b/Documentation/filesystems/porting.rst
@@ -1309,3 +1309,15 @@ a different length, use
 	vfs_parse_fs_qstr(fc, key, &QSTR_LEN(value, len))
 
 instead.
+
+---
+
+**mandatory**
+
+The ->create() operation has dropped the bool "excl" argument. This operation
+should now always provide O_EXCL semantics (i.e. fail with -EEXIST if the file
+exists). If the filesystem needs to handle the case where another entity could
+create the file on the backing store after a negative lookup or revalidate
+(e.g. it's a network filesystem and another client could create the file after
+a negative lookup), then it will require ->atomic_open() in addition to
+->create().
diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index 4f13b01e42eb5e2ad9d60cbbce7e47d09ad831e6..0752ed2b6475ab2b42482fde6dff870110a33eac 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -467,7 +467,7 @@ As of kernel 2.6.22, the following members are defined:
 .. code-block:: c
 
 	struct inode_operations {
-		int (*create) (struct mnt_idmap *, struct inode *,struct dentry *, umode_t, bool);
+		int (*create) (struct mnt_idmap *, struct inode *,struct dentry *, umode_t);
 		struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int);
 		int (*link) (struct dentry *,struct inode *,struct dentry *);
 		int (*unlink) (struct inode *,struct dentry *);
@@ -505,7 +505,10 @@ otherwise noted.
 	if you want to support regular files.  The dentry you get should
 	not have an inode (i.e. it should be a negative dentry).  Here
 	you will probably call d_instantiate() with the dentry and the
-	newly created inode
+	newly created inode. This operation should always provide O_EXCL
+	semantics (i.e. it should fail with -EEXIST if the file exists).
+	If the filesystem needs to mediate non-exclusive creation,
+	then the filesystem must also provide an ->atomic_open() operation.
 
 ``lookup``
 	called when the VFS needs to look up an inode in a parent
@@ -654,7 +657,11 @@ otherwise noted.
 	handled by f_op->open().  If the file was created, FMODE_CREATED
 	flag should be set in file->f_mode.  In case of O_EXCL the
 	method must only succeed if the file didn't exist and hence
-	FMODE_CREATED shall always be set on success.
+	FMODE_CREATED shall always be set on success. This method is
+	usually needed on filesystems where the dentry to be created could
+	unexpectedly become positive after the kernel has looked it up or
+	revalidated it. (e.g. another host racing in and creating the file
+	on an NFS server).
 
 ``tmpfile``
 	called in the end of O_TMPFILE open().  Optional, equivalent to
diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c
index 69f378a837753e934c20b599660f8a756127e40a..595244d57cba62869b9af8b909af67d3c61e7f6c 100644
--- a/fs/9p/vfs_inode.c
+++ b/fs/9p/vfs_inode.c
@@ -643,7 +643,7 @@ v9fs_create(struct v9fs_session_info *v9ses, struct inode *dir,
 
 static int
 v9fs_vfs_create(struct mnt_idmap *idmap, struct inode *dir,
-		struct dentry *dentry, umode_t mode, bool excl)
+		struct dentry *dentry, umode_t mode)
 {
 	struct v9fs_session_info *v9ses = v9fs_inode2v9ses(dir);
 	u32 perm = unixmode2p9mode(v9ses, mode);
diff --git a/fs/9p/vfs_inode_dotl.c b/fs/9p/vfs_inode_dotl.c
index 0b404e8484d22e2cbe60d846e0fa653001cdc4b1..de8fe9954d433c9b14ff5dd72ba13c3d5a67ebe7 100644
--- a/fs/9p/vfs_inode_dotl.c
+++ b/fs/9p/vfs_inode_dotl.c
@@ -218,7 +218,7 @@ int v9fs_open_to_dotl_flags(int flags)
  */
 static int
 v9fs_vfs_create_dotl(struct mnt_idmap *idmap, struct inode *dir,
-		     struct dentry *dentry, umode_t omode, bool excl)
+		     struct dentry *dentry, umode_t omode)
 {
 	return v9fs_vfs_mknod_dotl(idmap, dir, dentry, omode, 0);
 }
diff --git a/fs/affs/affs.h b/fs/affs/affs.h
index ac4e9a02910b72d63c8ec5291347b54518e67f4b..665be23c42cfa206dc0a2c9ffa119b7c3c747389 100644
--- a/fs/affs/affs.h
+++ b/fs/affs/affs.h
@@ -167,7 +167,7 @@ extern int	affs_hash_name(struct super_block *sb, const u8 *name, unsigned int l
 extern struct dentry *affs_lookup(struct inode *dir, struct dentry *dentry, unsigned int);
 extern int	affs_unlink(struct inode *dir, struct dentry *dentry);
 extern int	affs_create(struct mnt_idmap *idmap, struct inode *dir,
-			struct dentry *dentry, umode_t mode, bool);
+			struct dentry *dentry, umode_t mode);
 extern struct dentry *affs_mkdir(struct mnt_idmap *idmap, struct inode *dir,
 			struct dentry *dentry, umode_t mode);
 extern int	affs_rmdir(struct inode *dir, struct dentry *dentry);
diff --git a/fs/affs/namei.c b/fs/affs/namei.c
index f883be50db122d3b09f0ae4d24618bd49b55186b..5591e1b5a2f68fc7600115e241f01f81d3aac010 100644
--- a/fs/affs/namei.c
+++ b/fs/affs/namei.c
@@ -243,7 +243,7 @@ affs_unlink(struct inode *dir, struct dentry *dentry)
 
 int
 affs_create(struct mnt_idmap *idmap, struct inode *dir,
-	    struct dentry *dentry, umode_t mode, bool excl)
+	    struct dentry *dentry, umode_t mode)
 {
 	struct super_block *sb = dir->i_sb;
 	struct inode	*inode;
diff --git a/fs/afs/dir.c b/fs/afs/dir.c
index 89d36e3e5c7999c2e448b78e86896d8893a8a7a9..09224aca8cad37ad273fd0c1ac292f0c15e078b5 100644
--- a/fs/afs/dir.c
+++ b/fs/afs/dir.c
@@ -32,7 +32,7 @@ static bool afs_lookup_one_filldir(struct dir_context *ctx, const char *name, in
 static bool afs_lookup_filldir(struct dir_context *ctx, const char *name, int nlen,
 			      loff_t fpos, u64 ino, unsigned dtype);
 static int afs_create(struct mnt_idmap *idmap, struct inode *dir,
-		      struct dentry *dentry, umode_t mode, bool excl);
+		      struct dentry *dentry, umode_t mode);
 static struct dentry *afs_mkdir(struct mnt_idmap *idmap, struct inode *dir,
 				struct dentry *dentry, umode_t mode);
 static int afs_rmdir(struct inode *dir, struct dentry *dentry);
@@ -1637,7 +1637,7 @@ static const struct afs_operation_ops afs_create_operation = {
  * create a regular file on an AFS filesystem
  */
 static int afs_create(struct mnt_idmap *idmap, struct inode *dir,
-		      struct dentry *dentry, umode_t mode, bool excl)
+		      struct dentry *dentry, umode_t mode)
 {
 	struct afs_operation *op;
 	struct afs_vnode *dvnode = AFS_FS_I(dir);
diff --git a/fs/bad_inode.c b/fs/bad_inode.c
index 0ef9bcb744dd620bf47caa024d97a1316ff7bc89..5701361cf98155a61cb75a4ec602e8fc615eb3ae 100644
--- a/fs/bad_inode.c
+++ b/fs/bad_inode.c
@@ -29,7 +29,7 @@ static const struct file_operations bad_file_ops =
 
 static int bad_inode_create(struct mnt_idmap *idmap,
 			    struct inode *dir, struct dentry *dentry,
-			    umode_t mode, bool excl)
+			    umode_t mode)
 {
 	return -EIO;
 }
diff --git a/fs/bfs/dir.c b/fs/bfs/dir.c
index c375e22c4c0c15ba27307d266adfe3f093b90ab8..6beb8605c523cc2c7250d7b1a61508e103f0f3fd 100644
--- a/fs/bfs/dir.c
+++ b/fs/bfs/dir.c
@@ -76,7 +76,7 @@ const struct file_operations bfs_dir_operations = {
 };
 
 static int bfs_create(struct mnt_idmap *idmap, struct inode *dir,
-		      struct dentry *dentry, umode_t mode, bool excl)
+		      struct dentry *dentry, umode_t mode)
 {
 	int err;
 	struct inode *inode;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 3b1b3a0553eea06229255ad0284d76074bdb958a..8e06baeabae594850607366ea4f4f0fa41e3b464 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6816,7 +6816,7 @@ static int btrfs_mknod(struct mnt_idmap *idmap, struct inode *dir,
 }
 
 static int btrfs_create(struct mnt_idmap *idmap, struct inode *dir,
-			struct dentry *dentry, umode_t mode, bool excl)
+			struct dentry *dentry, umode_t mode)
 {
 	struct inode *inode;
 
diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
index d18c0eaef9b7e7be7eb517c701d6c4af08fd78ac..308903dc0780dbed2382228005d0221f185c61ee 100644
--- a/fs/ceph/dir.c
+++ b/fs/ceph/dir.c
@@ -976,7 +976,7 @@ static int ceph_mknod(struct mnt_idmap *idmap, struct inode *dir,
 }
 
 static int ceph_create(struct mnt_idmap *idmap, struct inode *dir,
-		       struct dentry *dentry, umode_t mode, bool excl)
+		       struct dentry *dentry, umode_t mode)
 {
 	return ceph_mknod(idmap, dir, dentry, mode, 0);
 }
diff --git a/fs/coda/dir.c b/fs/coda/dir.c
index ca99900172657d80a479b2eb27f50effdf834995..554e7fd44e5df1aae6da2c41a492a02ae9e0d616 100644
--- a/fs/coda/dir.c
+++ b/fs/coda/dir.c
@@ -134,7 +134,7 @@ static inline void coda_dir_drop_nlink(struct inode *dir)
 
 /* creation routines: create, mknod, mkdir, link, symlink */
 static int coda_create(struct mnt_idmap *idmap, struct inode *dir,
-		       struct dentry *de, umode_t mode, bool excl)
+		       struct dentry *de, umode_t mode)
 {
 	int error;
 	const char *name=de->d_name.name;
diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
index ba15e7359dfa6e150b577205991010873a633511..9a1ba68b16f3d6c4551e2d75e1e27309159c062e 100644
--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -262,7 +262,7 @@ int ecryptfs_initialize_file(struct dentry *ecryptfs_dentry,
 static int
 ecryptfs_create(struct mnt_idmap *idmap,
 		struct inode *directory_inode, struct dentry *ecryptfs_dentry,
-		umode_t mode, bool excl)
+		umode_t mode)
 {
 	struct inode *ecryptfs_inode;
 	int rc;
diff --git a/fs/efivarfs/inode.c b/fs/efivarfs/inode.c
index 2891614abf8d554f563319187b6d54c2bc006a91..043b3e3a4f0adefe27855f8156b946c1dc4bd184 100644
--- a/fs/efivarfs/inode.c
+++ b/fs/efivarfs/inode.c
@@ -75,7 +75,7 @@ static bool efivarfs_valid_name(const char *str, int len)
 }
 
 static int efivarfs_create(struct mnt_idmap *idmap, struct inode *dir,
-			   struct dentry *dentry, umode_t mode, bool excl)
+			   struct dentry *dentry, umode_t mode)
 {
 	struct inode *inode = NULL;
 	struct efivar_entry *var;
diff --git a/fs/exfat/namei.c b/fs/exfat/namei.c
index 7eb9c67fd35f4c54e18061a948806f20455675cf..c272a522c571044fd0cdc7630be30bdcec2ab8e5 100644
--- a/fs/exfat/namei.c
+++ b/fs/exfat/namei.c
@@ -543,7 +543,7 @@ static int exfat_add_entry(struct inode *inode, const char *path,
 }
 
 static int exfat_create(struct mnt_idmap *idmap, struct inode *dir,
-			struct dentry *dentry, umode_t mode, bool excl)
+			struct dentry *dentry, umode_t mode)
 {
 	struct super_block *sb = dir->i_sb;
 	struct inode *inode;
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index bde617a66cecd4a2bf12a713a2297bb4fee45916..edea7784ad39acd4afffc7f5ae6e50a20c04999d 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -101,7 +101,7 @@ struct dentry *ext2_get_parent(struct dentry *child)
  */
 static int ext2_create (struct mnt_idmap * idmap,
 			struct inode * dir, struct dentry * dentry,
-			umode_t mode, bool excl)
+			umode_t mode)
 {
 	struct inode *inode;
 	int err;
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 2cd36f59c9e363124ee949f742adccd88447295a..a1e77390a7ce300db02db9af90e45d69efabfea5 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2806,7 +2806,7 @@ static int ext4_add_nondir(handle_t *handle,
  * with d_instantiate().
  */
 static int ext4_create(struct mnt_idmap *idmap, struct inode *dir,
-		       struct dentry *dentry, umode_t mode, bool excl)
+		       struct dentry *dentry, umode_t mode)
 {
 	handle_t *handle;
 	struct inode *inode;
diff --git a/fs/f2fs/namei.c b/fs/f2fs/namei.c
index b882771e469971dcf4e7a42416f9fbb8a5d9bf39..9bcbb8b521501b22d0fe2238b7729c342e95baa4 100644
--- a/fs/f2fs/namei.c
+++ b/fs/f2fs/namei.c
@@ -351,7 +351,7 @@ static struct inode *f2fs_new_inode(struct mnt_idmap *idmap,
 }
 
 static int f2fs_create(struct mnt_idmap *idmap, struct inode *dir,
-		       struct dentry *dentry, umode_t mode, bool excl)
+		       struct dentry *dentry, umode_t mode)
 {
 	struct f2fs_sb_info *sbi = F2FS_I_SB(dir);
 	struct inode *inode;
diff --git a/fs/fat/namei_msdos.c b/fs/fat/namei_msdos.c
index 0b920ee40a7f9fe3c57af5d939d3efedf001a3d9..905ffa9e5b99f1507734d99b7c16dcad21d7b5b5 100644
--- a/fs/fat/namei_msdos.c
+++ b/fs/fat/namei_msdos.c
@@ -262,7 +262,7 @@ static int msdos_add_entry(struct inode *dir, const unsigned char *name,
 
 /***** Create a file */
 static int msdos_create(struct mnt_idmap *idmap, struct inode *dir,
-			struct dentry *dentry, umode_t mode, bool excl)
+			struct dentry *dentry, umode_t mode)
 {
 	struct super_block *sb = dir->i_sb;
 	struct inode *inode = NULL;
diff --git a/fs/fat/namei_vfat.c b/fs/fat/namei_vfat.c
index 5dbc4cbb8fce3d9b891cbc597f876c2c7b8d6aa0..8396b1ec4ec582fcdfadbcb12b04694ef0b8c5fc 100644
--- a/fs/fat/namei_vfat.c
+++ b/fs/fat/namei_vfat.c
@@ -754,7 +754,7 @@ static struct dentry *vfat_lookup(struct inode *dir, struct dentry *dentry,
 }
 
 static int vfat_create(struct mnt_idmap *idmap, struct inode *dir,
-		       struct dentry *dentry, umode_t mode, bool excl)
+		       struct dentry *dentry, umode_t mode)
 {
 	struct super_block *sb = dir->i_sb;
 	struct inode *inode;
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 667774cc72a1d49796f531fcb342d2e4878beb85..b7a2cee9b18313f88e745c5bb406bcc72866e390 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -889,7 +889,7 @@ static int fuse_mknod(struct mnt_idmap *idmap, struct inode *dir,
 }
 
 static int fuse_create(struct mnt_idmap *idmap, struct inode *dir,
-		       struct dentry *entry, umode_t mode, bool excl)
+		       struct dentry *entry, umode_t mode)
 {
 	return fuse_mknod(idmap, dir, entry, mode, 0);
 }
diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
index 8a7ed80d9f2d6e829b240629bdd18b5e0d30b5fc..b8e399dd1182b6ede0bcf1aa78bd7f9f2dca8b2b 100644
--- a/fs/gfs2/inode.c
+++ b/fs/gfs2/inode.c
@@ -942,15 +942,14 @@ static int gfs2_create_inode(struct inode *dir, struct dentry *dentry,
  * @dir: The directory in which to create the file
  * @dentry: The dentry of the new file
  * @mode: The mode of the new file
- * @excl: Force fail if inode exists
  *
  * Returns: errno
  */
 
 static int gfs2_create(struct mnt_idmap *idmap, struct inode *dir,
-		       struct dentry *dentry, umode_t mode, bool excl)
+		       struct dentry *dentry, umode_t mode)
 {
-	return gfs2_create_inode(dir, dentry, NULL, S_IFREG | mode, 0, NULL, 0, excl);
+	return gfs2_create_inode(dir, dentry, NULL, S_IFREG | mode, 0, NULL, 0, 1);
 }
 
 /**
diff --git a/fs/hfs/dir.c b/fs/hfs/dir.c
index 86a6b317b474a95f283f6a0908582efadde80892..c585942aa985686ca428d2d17f4401aa845a0eb8 100644
--- a/fs/hfs/dir.c
+++ b/fs/hfs/dir.c
@@ -190,7 +190,7 @@ static int hfs_dir_release(struct inode *inode, struct file *file)
  * the directory and the name (and its length) of the new file.
  */
 static int hfs_create(struct mnt_idmap *idmap, struct inode *dir,
-		      struct dentry *dentry, umode_t mode, bool excl)
+		      struct dentry *dentry, umode_t mode)
 {
 	struct inode *inode;
 	int res;
diff --git a/fs/hfsplus/dir.c b/fs/hfsplus/dir.c
index 1b3e27a0d5e038b559bd19b37d769078b2996d1b..c5ea04e078340a91b992095e189e978a3345f03c 100644
--- a/fs/hfsplus/dir.c
+++ b/fs/hfsplus/dir.c
@@ -518,7 +518,7 @@ static int hfsplus_mknod(struct mnt_idmap *idmap, struct inode *dir,
 }
 
 static int hfsplus_create(struct mnt_idmap *idmap, struct inode *dir,
-			  struct dentry *dentry, umode_t mode, bool excl)
+			  struct dentry *dentry, umode_t mode)
 {
 	return hfsplus_mknod(&nop_mnt_idmap, dir, dentry, mode, 0);
 }
diff --git a/fs/hostfs/hostfs_kern.c b/fs/hostfs/hostfs_kern.c
index 1e1acf5775ab5f6daf13bb917966d05f410d5ff5..18ca8cb9aa15e4015582ee5bd3db968c6b32de4b 100644
--- a/fs/hostfs/hostfs_kern.c
+++ b/fs/hostfs/hostfs_kern.c
@@ -593,7 +593,7 @@ static struct inode *hostfs_iget(struct super_block *sb, char *name)
 }
 
 static int hostfs_create(struct mnt_idmap *idmap, struct inode *dir,
-			 struct dentry *dentry, umode_t mode, bool excl)
+			 struct dentry *dentry, umode_t mode)
 {
 	struct inode *inode;
 	char *name;
diff --git a/fs/hpfs/namei.c b/fs/hpfs/namei.c
index 353e13a615f56664638f08a3408f90a727f5458b..809113d8248d50c0eaa57047b6c4bd87b9a5c6be 100644
--- a/fs/hpfs/namei.c
+++ b/fs/hpfs/namei.c
@@ -129,7 +129,7 @@ static struct dentry *hpfs_mkdir(struct mnt_idmap *idmap, struct inode *dir,
 }
 
 static int hpfs_create(struct mnt_idmap *idmap, struct inode *dir,
-		       struct dentry *dentry, umode_t mode, bool excl)
+		       struct dentry *dentry, umode_t mode)
 {
 	const unsigned char *name = dentry->d_name.name;
 	unsigned len = dentry->d_name.len;
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 9c94ed8c3ab0028772b7afb5d03a91d280c38106..0fd0d73e450bdedd92b953b9dd00f6babe1246e7 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -1001,7 +1001,7 @@ static struct dentry *hugetlbfs_mkdir(struct mnt_idmap *idmap, struct inode *dir
 
 static int hugetlbfs_create(struct mnt_idmap *idmap,
 			    struct inode *dir, struct dentry *dentry,
-			    umode_t mode, bool excl)
+			    umode_t mode)
 {
 	return hugetlbfs_mknod(idmap, dir, dentry, mode | S_IFREG, 0);
 }
diff --git a/fs/jffs2/dir.c b/fs/jffs2/dir.c
index dd91f725ded69ccb3a240aafd72a4b552f21bcd9..e77c84e43621a8c53e9852843f18cc3514315650 100644
--- a/fs/jffs2/dir.c
+++ b/fs/jffs2/dir.c
@@ -25,7 +25,7 @@
 static int jffs2_readdir (struct file *, struct dir_context *);
 
 static int jffs2_create (struct mnt_idmap *, struct inode *,
-		         struct dentry *, umode_t, bool);
+			 struct dentry *, umode_t);
 static struct dentry *jffs2_lookup (struct inode *,struct dentry *,
 				    unsigned int);
 static int jffs2_link (struct dentry *,struct inode *,struct dentry *);
@@ -161,7 +161,7 @@ static int jffs2_readdir(struct file *file, struct dir_context *ctx)
 
 
 static int jffs2_create(struct mnt_idmap *idmap, struct inode *dir_i,
-			struct dentry *dentry, umode_t mode, bool excl)
+			struct dentry *dentry, umode_t mode)
 {
 	struct jffs2_raw_inode *ri;
 	struct jffs2_inode_info *f, *dir_f;
diff --git a/fs/jfs/namei.c b/fs/jfs/namei.c
index 65a218eba8faf9508f5727515b812f6de2661618..48111f8d3efe40becadd857c56c84ed09de867ef 100644
--- a/fs/jfs/namei.c
+++ b/fs/jfs/namei.c
@@ -60,7 +60,7 @@ static inline void free_ea_wmap(struct inode *inode)
  *
  */
 static int jfs_create(struct mnt_idmap *idmap, struct inode *dip,
-		      struct dentry *dentry, umode_t mode, bool excl)
+		      struct dentry *dentry, umode_t mode)
 {
 	int rc = 0;
 	tid_t tid;		/* transaction id */
diff --git a/fs/minix/namei.c b/fs/minix/namei.c
index 8938536d8d3cf65c7e57f88f1819689365951fea..6540574f54781eab487074de7fe10ed38b1a8d1e 100644
--- a/fs/minix/namei.c
+++ b/fs/minix/namei.c
@@ -64,7 +64,7 @@ static int minix_tmpfile(struct mnt_idmap *idmap, struct inode *dir,
 }
 
 static int minix_create(struct mnt_idmap *idmap, struct inode *dir,
-			struct dentry *dentry, umode_t mode, bool excl)
+			struct dentry *dentry, umode_t mode)
 {
 	return minix_mknod(&nop_mnt_idmap, dir, dentry, mode, 0);
 }
diff --git a/fs/namei.c b/fs/namei.c
index d5ab28947b2b6c6e19c7bb4a9140ccec407dc07c..83da60fc298e523096e881b25c727d14f9553476 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3493,7 +3493,7 @@ int vfs_create(struct mnt_idmap *idmap, struct dentry *dentry, umode_t mode,
 	error = try_break_deleg(dir, di);
 	if (error)
 		return error;
-	error = dir->i_op->create(idmap, dir, dentry, mode, true);
+	error = dir->i_op->create(idmap, dir, dentry, mode);
 	if (!error)
 		fsnotify_create(dir, dentry);
 	return error;
@@ -3802,7 +3802,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
 		}
 
 		error = dir_inode->i_op->create(idmap, dir_inode, dentry,
-						mode, open_flag & O_EXCL);
+						mode);
 		if (error)
 			goto out_dput;
 	}
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 46d9c65d50f83fc1dc73f3d7f5868b84132bb0fd..7fe18efcd37b08030c7a4e17832801abfc19a3bd 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -2377,9 +2377,9 @@ static int nfs_do_create(struct inode *dir, struct dentry *dentry,
 }
 
 int nfs_create(struct mnt_idmap *idmap, struct inode *dir,
-	       struct dentry *dentry, umode_t mode, bool excl)
+	       struct dentry *dentry, umode_t mode)
 {
-	return nfs_do_create(dir, dentry, mode, excl ? O_EXCL : 0);
+	return nfs_do_create(dir, dentry, mode, O_EXCL);
 }
 EXPORT_SYMBOL_GPL(nfs_create);
 
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 2ecd38e1d17a8053a9134702588d57efc35f49e9..b122c4f34f7b53c5102a8b5138efe269af433c81 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -398,7 +398,7 @@ extern unsigned long nfs_access_cache_scan(struct shrinker *shrink,
 struct dentry *nfs_lookup(struct inode *, struct dentry *, unsigned int);
 void nfs_d_prune_case_insensitive_aliases(struct inode *inode);
 int nfs_create(struct mnt_idmap *, struct inode *, struct dentry *,
-	       umode_t, bool);
+	       umode_t);
 struct dentry *nfs_mkdir(struct mnt_idmap *, struct inode *, struct dentry *,
 			 umode_t);
 int nfs_rmdir(struct inode *, struct dentry *);
diff --git a/fs/nilfs2/namei.c b/fs/nilfs2/namei.c
index 40f4b1a28705b6e0eb8f0978cf3ac18b43aa1331..31d1d466c03048aaaab23f64c3f413c095939770 100644
--- a/fs/nilfs2/namei.c
+++ b/fs/nilfs2/namei.c
@@ -86,7 +86,7 @@ nilfs_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags)
  * with d_instantiate().
  */
 static int nilfs_create(struct mnt_idmap *idmap, struct inode *dir,
-			struct dentry *dentry, umode_t mode, bool excl)
+			struct dentry *dentry, umode_t mode)
 {
 	struct inode *inode;
 	struct nilfs_transaction_info ti;
diff --git a/fs/ntfs3/namei.c b/fs/ntfs3/namei.c
index 82c8ae56beee6d79046dd6c8f02ff0f35e9a1ad3..49fe635b550d3f51f81138649b47c9c831a73e3b 100644
--- a/fs/ntfs3/namei.c
+++ b/fs/ntfs3/namei.c
@@ -105,7 +105,7 @@ static struct dentry *ntfs_lookup(struct inode *dir, struct dentry *dentry,
  * ntfs_create - inode_operations::create
  */
 static int ntfs_create(struct mnt_idmap *idmap, struct inode *dir,
-		       struct dentry *dentry, umode_t mode, bool excl)
+		       struct dentry *dentry, umode_t mode)
 {
 	return ntfs_create_inode(idmap, dir, dentry, NULL, S_IFREG | mode, 0,
 				 NULL, 0, NULL);
diff --git a/fs/ocfs2/dlmfs/dlmfs.c b/fs/ocfs2/dlmfs/dlmfs.c
index cccaa1d6fbbac13ebcaf14a9183277890708e643..bd4b2269598b49c6f88dd8d201e246ee5ed855a6 100644
--- a/fs/ocfs2/dlmfs/dlmfs.c
+++ b/fs/ocfs2/dlmfs/dlmfs.c
@@ -454,8 +454,7 @@ static struct dentry *dlmfs_mkdir(struct mnt_idmap * idmap,
 static int dlmfs_create(struct mnt_idmap *idmap,
 			struct inode *dir,
 			struct dentry *dentry,
-			umode_t mode,
-			bool excl)
+			umode_t mode)
 {
 	int status = 0;
 	struct inode *inode;
diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
index c90b254da75eb5b90d2af5e37d41e781efe8b836..7443f468f45657cf68779a02e4edf4e38fb70f59 100644
--- a/fs/ocfs2/namei.c
+++ b/fs/ocfs2/namei.c
@@ -666,8 +666,7 @@ static struct dentry *ocfs2_mkdir(struct mnt_idmap *idmap,
 static int ocfs2_create(struct mnt_idmap *idmap,
 			struct inode *dir,
 			struct dentry *dentry,
-			umode_t mode,
-			bool excl)
+			umode_t mode)
 {
 	int ret;
 
diff --git a/fs/omfs/dir.c b/fs/omfs/dir.c
index 2ed541fccf331d796805dd1594fbf05c1f7f3b9a..a09a98f7e30bc66deca60725f9462d081b5e4784 100644
--- a/fs/omfs/dir.c
+++ b/fs/omfs/dir.c
@@ -286,7 +286,7 @@ static struct dentry *omfs_mkdir(struct mnt_idmap *idmap, struct inode *dir,
 }
 
 static int omfs_create(struct mnt_idmap *idmap, struct inode *dir,
-		       struct dentry *dentry, umode_t mode, bool excl)
+		       struct dentry *dentry, umode_t mode)
 {
 	return omfs_add_node(dir, dentry, mode | S_IFREG);
 }
diff --git a/fs/orangefs/namei.c b/fs/orangefs/namei.c
index bec5475de094dada6bb29eaf8520a875880f3bab..0ebaa7f000f26f1c1ecffd22cfe4272f20a783ed 100644
--- a/fs/orangefs/namei.c
+++ b/fs/orangefs/namei.c
@@ -18,8 +18,7 @@
 static int orangefs_create(struct mnt_idmap *idmap,
 			struct inode *dir,
 			struct dentry *dentry,
-			umode_t mode,
-			bool exclusive)
+			umode_t mode)
 {
 	struct orangefs_inode_s *parent = ORANGEFS_I(dir);
 	struct orangefs_kernel_op_s *new_op;
diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
index a5e9ddf3023b3942fafb9adb2770f26780a1b86b..0f70b3835f4a08c29d6bba8ae9143df55895e56b 100644
--- a/fs/overlayfs/dir.c
+++ b/fs/overlayfs/dir.c
@@ -704,7 +704,7 @@ static int ovl_create_object(struct dentry *dentry, int mode, dev_t rdev,
 }
 
 static int ovl_create(struct mnt_idmap *idmap, struct inode *dir,
-		      struct dentry *dentry, umode_t mode, bool excl)
+		      struct dentry *dentry, umode_t mode)
 {
 	return ovl_create_object(dentry, (mode & 07777) | S_IFREG, 0, NULL);
 }
diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
index 41f9995da7cab0d11395cb40a98fb4936d52597f..b6502aaa4fb44d27c939da9fae4449af7edd28d4 100644
--- a/fs/ramfs/inode.c
+++ b/fs/ramfs/inode.c
@@ -129,7 +129,7 @@ static struct dentry *ramfs_mkdir(struct mnt_idmap *idmap, struct inode *dir,
 }
 
 static int ramfs_create(struct mnt_idmap *idmap, struct inode *dir,
-			struct dentry *dentry, umode_t mode, bool excl)
+			struct dentry *dentry, umode_t mode)
 {
 	return ramfs_mknod(&nop_mnt_idmap, dir, dentry, mode | S_IFREG, 0);
 }
diff --git a/fs/smb/client/cifsfs.h b/fs/smb/client/cifsfs.h
index e9534258d1efd0bb34f36bf2c725c64d0a8ca8f4..294c66cea2eca3344e09cd77619761e9cb79a807 100644
--- a/fs/smb/client/cifsfs.h
+++ b/fs/smb/client/cifsfs.h
@@ -50,7 +50,7 @@ extern void cifs_sb_deactive(struct super_block *sb);
 extern const struct inode_operations cifs_dir_inode_ops;
 extern struct inode *cifs_root_iget(struct super_block *);
 extern int cifs_create(struct mnt_idmap *, struct inode *,
-		       struct dentry *, umode_t, bool excl);
+		       struct dentry *, umode_t);
 extern int cifs_atomic_open(struct inode *, struct dentry *,
 			    struct file *, unsigned, umode_t);
 extern struct dentry *cifs_lookup(struct inode *, struct dentry *,
diff --git a/fs/smb/client/dir.c b/fs/smb/client/dir.c
index da5597dbf5b9f140c6801158ac2357fa911c52ab..b00bc214db9f0e9533f481f41ac99ac8937610ac 100644
--- a/fs/smb/client/dir.c
+++ b/fs/smb/client/dir.c
@@ -566,7 +566,7 @@ cifs_atomic_open(struct inode *inode, struct dentry *direntry,
 }
 
 int cifs_create(struct mnt_idmap *idmap, struct inode *inode,
-		struct dentry *direntry, umode_t mode, bool excl)
+		struct dentry *direntry, umode_t mode)
 {
 	int rc;
 	unsigned int xid = get_xid();
diff --git a/fs/ubifs/dir.c b/fs/ubifs/dir.c
index 3c3d3ad4fa6cb719e9ec08fa2164c55371c017c1..4840a6f7974e254eba4ca249357e968764e326e0 100644
--- a/fs/ubifs/dir.c
+++ b/fs/ubifs/dir.c
@@ -303,7 +303,7 @@ static int ubifs_prepare_create(struct inode *dir, struct dentry *dentry,
 }
 
 static int ubifs_create(struct mnt_idmap *idmap, struct inode *dir,
-			struct dentry *dentry, umode_t mode, bool excl)
+			struct dentry *dentry, umode_t mode)
 {
 	struct inode *inode;
 	struct ubifs_info *c = dir->i_sb->s_fs_info;
diff --git a/fs/udf/namei.c b/fs/udf/namei.c
index 5f2e9a892bffa9579143cedf71d80efa7ad6e9fb..f83b5564cbc4c68c02c07bb3ab2109bfabdc799d 100644
--- a/fs/udf/namei.c
+++ b/fs/udf/namei.c
@@ -371,7 +371,7 @@ static int udf_add_nondir(struct dentry *dentry, struct inode *inode)
 }
 
 static int udf_create(struct mnt_idmap *idmap, struct inode *dir,
-		      struct dentry *dentry, umode_t mode, bool excl)
+		      struct dentry *dentry, umode_t mode)
 {
 	struct inode *inode = udf_new_inode(dir, mode);
 
diff --git a/fs/ufs/namei.c b/fs/ufs/namei.c
index 5b3c85c9324298f4ff6aa3d4feeb962ce5ede539..5012e056200aca671364d34a7faf647e6747e1d2 100644
--- a/fs/ufs/namei.c
+++ b/fs/ufs/namei.c
@@ -70,8 +70,7 @@ static struct dentry *ufs_lookup(struct inode * dir, struct dentry *dentry, unsi
  * with d_instantiate(). 
  */
 static int ufs_create (struct mnt_idmap * idmap,
-		struct inode * dir, struct dentry * dentry, umode_t mode,
-		bool excl)
+		struct inode * dir, struct dentry * dentry, umode_t mode)
 {
 	struct inode *inode;
 
diff --git a/fs/vboxsf/dir.c b/fs/vboxsf/dir.c
index 42bedc4ec7af7709c564a7174805d185ce86f854..330dade582d081e965c0e365bd2f96ae31d92ccc 100644
--- a/fs/vboxsf/dir.c
+++ b/fs/vboxsf/dir.c
@@ -298,9 +298,9 @@ static int vboxsf_dir_create(struct inode *parent, struct dentry *dentry,
 
 static int vboxsf_dir_mkfile(struct mnt_idmap *idmap,
 			     struct inode *parent, struct dentry *dentry,
-			     umode_t mode, bool excl)
+			     umode_t mode)
 {
-	return vboxsf_dir_create(parent, dentry, mode, false, excl, NULL);
+	return vboxsf_dir_create(parent, dentry, mode, false, true, NULL);
 }
 
 static struct dentry *vboxsf_dir_mkdir(struct mnt_idmap *idmap,
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index caff0125faeac093c1c05a722d3588e3f2e99926..2bc7faac35678b5b78acd6a50695a0d7b1c9a263 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -293,8 +293,7 @@ xfs_vn_create(
 	struct mnt_idmap	*idmap,
 	struct inode		*dir,
 	struct dentry		*dentry,
-	umode_t			mode,
-	bool			flags)
+	umode_t			mode)
 {
 	return xfs_generic_create(idmap, dir, dentry, mode, 0, NULL);
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 64323e618724bc20dc101db13035b042f5f88e4d..b9a32e10078f5a1a0bbeb0d8913ac3e4b5b3a85d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2345,8 +2345,8 @@ struct inode_operations {
 
 	int (*readlink) (struct dentry *, char __user *,int);
 
-	int (*create) (struct mnt_idmap *, struct inode *,struct dentry *,
-		       umode_t, bool);
+	int (*create) (struct mnt_idmap *, struct inode *, struct dentry *,
+		       umode_t);
 	int (*link) (struct dentry *,struct inode *,struct dentry *);
 	int (*unlink) (struct inode *,struct dentry *);
 	int (*symlink) (struct mnt_idmap *, struct inode *,struct dentry *,
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 093551fe66a7eb884fc34ef853a0ca92b95770af..9ae28c79fe0578bf96b2d22daed45b48aba0b946 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -610,7 +610,7 @@ static int mqueue_create_attr(struct dentry *dentry, umode_t mode, void *arg)
 }
 
 static int mqueue_create(struct mnt_idmap *idmap, struct inode *dir,
-			 struct dentry *dentry, umode_t mode, bool excl)
+			 struct dentry *dentry, umode_t mode)
 {
 	return mqueue_create_attr(dentry, mode, NULL);
 }
diff --git a/mm/shmem.c b/mm/shmem.c
index b9081b817d28f3db1fbdd90ed3f04b6904d6ff18..8fdc9cbecb908e127f8173ca8888b5e038354fed 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -3912,7 +3912,7 @@ static struct dentry *shmem_mkdir(struct mnt_idmap *idmap, struct inode *dir,
 }
 
 static int shmem_create(struct mnt_idmap *idmap, struct inode *dir,
-			struct dentry *dentry, umode_t mode, bool excl)
+			struct dentry *dentry, umode_t mode)
 {
 	return shmem_mknod(idmap, dir, dentry, mode | S_IFREG, 0);
 }

---
base-commit: 76ddfe7d66d631e5e31ef4e5dd59797fa03acbf7
change-id: 20251105-create-excl-2b366d9bf3bb

Best regards,
-- 
Jeff Layton <jlayton at kernel.org>


From johannes at sipsolutions.net  Tue Nov 11 00:01:25 2025
From: johannes at sipsolutions.net (Johannes Berg)
Date: Tue, 11 Nov 2025 09:01:25 +0100
Subject: [PATCH v13 00/13] nommu UML
In-Reply-To: <m2framxerm.wl-thehajime@gmail.com>
References: <cover.1762588860.git.thehajime@gmail.com>
		<aRGs8lPjH22NqMZc@infradead.org> <m2framxerm.wl-thehajime@gmail.com>
Message-ID: <0a84c16f862026c82271c0adbc91d98b812a78b4.camel@sipsolutions.net>

On Mon, 2025-11-10 at 21:18 +0900, Hajime Tazaki wrote:
> 
>   What is it for ?
>   ================
>   
>   - Alleviate syscall hook overhead implemented with ptrace(2)
>   - To exercises nommu code over UML (and over KUnit)
>   - Less dependency to host facilities

FWIW, in some way, this order of priorities is exactly why this hasn't
been going anywhere, and every time I looked at it I got somewhat
annoyed by what seems to me like choices made to support especially the
first bullet.

I suspect that the first and third bullet are not even really true any
more, since you moved to seccomp (per our request), yet I think design
choices influenced by them persist.

People are definitely interested in the second bullet, mostly for kunit,
and I'd be willing to support them in that to some extent.

However, I'm not yet convinced that all of the complexities presented in
this patchset (such as completely separate seccomp implementation) are
actually necessary in support of _just_ the second bullet. These seem to
me like design choices necessary to support the _first_ bullet [1].

[1] and then I suppose the third, which I'm reading as "doesn't need
seccomp or ptrace", but I'm not really quite sure what you meant


I've thought about what would happen if we stuck to creating a (single)
separate process on the host to execute userspace, and just used
CLONE_VM for it. That way, it's still no-MMU with full memory access,
but there's some implicit isolation between the kernel and userspace
processes which will likely remove complexities around FP/SSE/AVX
handling, may completely remove the need for a separate seccomp
implementation, etc.

It would, on the other hand, make it completely non-viable to achieve
the first and third bullets, so given your pursuit of those, one some
level I understand the design right now. I'm yet to be convinced,
however, that those are even worthy goals for (upstream) UML, what use
case would that enable that we really need? Especially considering that
over a longer perspective, NOMMU architectures _are_ on their way out,
and UML will certainly follow once that happens, it won't be the last
remaining NOMMU architecture.

So the only value I see in this is for testing over the net couple of
years, which really doesn't need any sort of significant optimisation or
less reliance on host facilities.

Where do you see this differently?

johannes


From xuanzhuo at linux.alibaba.com  Tue Nov 11 03:12:12 2025
From: xuanzhuo at linux.alibaba.com (Xuan Zhuo)
Date: Tue, 11 Nov 2025 19:12:12 +0800
Subject: [PATCH net v5 2/2] virtio-net: correct hdr_len handling for tunnel gso
In-Reply-To: <20251111111212.102083-1-xuanzhuo@linux.alibaba.com>
References: <20251111111212.102083-1-xuanzhuo@linux.alibaba.com>
Message-ID: <20251111111212.102083-3-xuanzhuo@linux.alibaba.com>

The commit a2fb4bc4e2a6a03 ("net: implement virtio helpers to handle UDP
GSO tunneling.") introduces support for the UDP GSO tunnel feature in
virtio-net.

The virtio spec says:

    If the \field{gso_type} has the VIRTIO_NET_HDR_GSO_UDP_TUNNEL_IPV4 bit or
    VIRTIO_NET_HDR_GSO_UDP_TUNNEL_IPV6 bit set, \field{hdr_len} accounts for
    all the headers up to and including the inner transport.

The commit did not update the hdr_len to include the inner transport.

I observed that the "hdr_len" is 116 for this packet:

    17:36:18.241105 52:55:00:d1:27:0a > 2e:2c:df:46:a9:e1, ethertype IPv4 (0x0800), length 2912: (tos 0x0, ttl 64, id 45197, offset 0, flags [none], proto UDP (17), length 2898)
        192.168.122.100.50613 > 192.168.122.1.4789: [bad udp cksum 0x8106 -> 0x26a0!] VXLAN, flags [I] (0x08), vni 1
    fa:c3:ba:82:05:ee > ce:85:0c:31:77:e5, ethertype IPv4 (0x0800), length 2862: (tos 0x0, ttl 64, id 14678, offset 0, flags [DF], proto TCP (6), length 2848)
        192.168.3.1.49880 > 192.168.3.2.9898: Flags [P.], cksum 0x9266 (incorrect -> 0xaa20), seq 515667:518463, ack 1, win 64, options [nop,nop,TS val 2990048824 ecr 2798801412], length 2796

116 = 14(mac) + 20(ip) + 8(udp) + 8(vxlan) + 14(inner mac) + 20(inner ip) + 32(innner tcp)

Fixes: a2fb4bc4e2a6a03 ("net: implement virtio helpers to handle UDP GSO tunneling.")
Signed-off-by: Xuan Zhuo <xuanzhuo at linux.alibaba.com>
---
 include/linux/virtio_net.h | 24 ++++++++++++++++--------
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
index 3cd8b2ebc197..432b17979d17 100644
--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -232,12 +232,23 @@ static inline int virtio_net_hdr_from_skb(const struct sk_buff *skb,
 			return -EINVAL;
 
 		if (hdrlen_negotiated) {
-			hdr_len = skb_transport_offset(skb);
+			if (sinfo->gso_type & (SKB_GSO_UDP_TUNNEL |
+					       SKB_GSO_UDP_TUNNEL_CSUM)) {
+				hdr_len = skb_inner_transport_offset(skb);
+
+				if (hdr->gso_type == VIRTIO_NET_HDR_GSO_UDP_L4)
+					hdr_len += sizeof(struct udphdr);
+				else
+					hdr_len += inner_tcp_hdrlen(skb);
+			} else {
+				hdr_len = skb_transport_offset(skb);
+
+				if (hdr->gso_type == VIRTIO_NET_HDR_GSO_UDP_L4)
+					hdr_len += sizeof(struct udphdr);
+				else
+					hdr_len += tcp_hdrlen(skb);
+			}
 
-			if (hdr->gso_type == VIRTIO_NET_HDR_GSO_UDP_L4)
-				hdr_len += sizeof(struct udphdr);
-			else
-				hdr_len += tcp_hdrlen(skb);
 		} else {
 			/* This is a hint as to how much should be linear. */
 			hdr_len = skb_headlen(skb);
@@ -421,11 +432,8 @@ virtio_net_hdr_tnl_from_skb(const struct sk_buff *skb,
         vhdr->hash_hdr.hash_report = 0;
         vhdr->hash_hdr.padding = 0;
 
-	/* Let the basic parsing deal with plain GSO features. */
-	skb_shinfo(skb)->gso_type &= ~tnl_gso_type;
 	ret = virtio_net_hdr_from_skb(skb, hdr, true, false, hdrlen_negotiated,
 				      vlan_hlen);
-	skb_shinfo(skb)->gso_type |= tnl_gso_type;
 	if (ret)
 		return ret;
 
-- 
2.32.0.3.g01195cf9f


From xuanzhuo at linux.alibaba.com  Tue Nov 11 03:12:10 2025
From: xuanzhuo at linux.alibaba.com (Xuan Zhuo)
Date: Tue, 11 Nov 2025 19:12:10 +0800
Subject: [PATCH net v5 0/2] virtio-net: fix for VIRTIO_NET_F_GUEST_HDRLEN
Message-ID: <20251111111212.102083-1-xuanzhuo@linux.alibaba.com>

The commit be50da3e9d4a ("net: virtio_net: implement exact header length
guest feature") introduces support for the VIRTIO_NET_F_GUEST_HDRLEN
feature in virtio-net.

This feature requires virtio-net to set hdr_len to the actual header
length of the packet when transmitting, the number of
bytes from the start of the packet to the beginning of the
transport-layer payload.

However, in practice, hdr_len was being set using skb_headlen(skb),
which is clearly incorrect. This path set fixes that issue.

As discussed in [0], this version checks the VIRTIO_NET_F_GUEST_HDRLEN is
negotiated.

[0]: http://lore.kernel.org/all/20251029030913.20423-1-xuanzhuo at linux.alibaba.com

Xuan Zhuo (2):
  virtio-net: correct hdr_len handling for VIRTIO_NET_F_GUEST_HDRLEN
  virtio-net: correct hdr_len handling for tunnel gso

 arch/um/drivers/vector_transports.c |  1 +
 drivers/net/tun_vnet.h              |  4 +--
 drivers/net/virtio_net.c            |  9 +++++--
 include/linux/virtio_net.h          | 40 +++++++++++++++++++++++------
 net/packet/af_packet.c              |  5 ++--
 5 files changed, 45 insertions(+), 14 deletions(-)

--
2.32.0.3.g01195cf9f


From xuanzhuo at linux.alibaba.com  Tue Nov 11 03:12:11 2025
From: xuanzhuo at linux.alibaba.com (Xuan Zhuo)
Date: Tue, 11 Nov 2025 19:12:11 +0800
Subject: [PATCH net v5 1/2] virtio-net: correct hdr_len handling for VIRTIO_NET_F_GUEST_HDRLEN
In-Reply-To: <20251111111212.102083-1-xuanzhuo@linux.alibaba.com>
References: <20251111111212.102083-1-xuanzhuo@linux.alibaba.com>
Message-ID: <20251111111212.102083-2-xuanzhuo@linux.alibaba.com>

The commit be50da3e9d4a ("net: virtio_net: implement exact header length
guest feature") introduces support for the VIRTIO_NET_F_GUEST_HDRLEN
feature in virtio-net.

This feature requires virtio-net to set hdr_len to the actual header
length of the packet when transmitting, the number of
bytes from the start of the packet to the beginning of the
transport-layer payload.

However, in practice, hdr_len was being set using skb_headlen(skb),
which is clearly incorrect. This commit fixes that issue.

Fixes: be50da3e9d4a ("net: virtio_net: implement exact header length guest feature")
Signed-off-by: Xuan Zhuo <xuanzhuo at linux.alibaba.com>
---
 arch/um/drivers/vector_transports.c |  1 +
 drivers/net/tun_vnet.h              |  4 ++--
 drivers/net/virtio_net.c            |  9 +++++++--
 include/linux/virtio_net.h          | 26 +++++++++++++++++++++-----
 net/packet/af_packet.c              |  5 +++--
 5 files changed, 34 insertions(+), 11 deletions(-)

diff --git a/arch/um/drivers/vector_transports.c b/arch/um/drivers/vector_transports.c
index 0794d23f07cb..03c5baa1d0c1 100644
--- a/arch/um/drivers/vector_transports.c
+++ b/arch/um/drivers/vector_transports.c
@@ -121,6 +121,7 @@ static int raw_form_header(uint8_t *header,
 		vheader,
 		virtio_legacy_is_little_endian(),
 		false,
+		false,
 		0
 	);
 
diff --git a/drivers/net/tun_vnet.h b/drivers/net/tun_vnet.h
index 81662328b2c7..0d376bc70dd7 100644
--- a/drivers/net/tun_vnet.h
+++ b/drivers/net/tun_vnet.h
@@ -214,7 +214,7 @@ static inline int tun_vnet_hdr_from_skb(unsigned int flags,
 
 	if (virtio_net_hdr_from_skb(skb, hdr,
 				    tun_vnet_is_little_endian(flags), true,
-				    vlan_hlen)) {
+				    false, vlan_hlen)) {
 		struct skb_shared_info *sinfo = skb_shinfo(skb);
 
 		if (net_ratelimit()) {
@@ -244,7 +244,7 @@ tun_vnet_hdr_tnl_from_skb(unsigned int flags,
 
 	if (virtio_net_hdr_tnl_from_skb(skb, tnl_hdr, has_tnl_offload,
 					tun_vnet_is_little_endian(flags),
-					vlan_hlen)) {
+					false, vlan_hlen)) {
 		struct virtio_net_hdr_v1 *hdr = &tnl_hdr->hash_hdr.hdr;
 		struct skb_shared_info *sinfo = skb_shinfo(skb);
 
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 0369dda5ed60..b335c88a8cd6 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -3317,9 +3317,13 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff *skb, bool orphan)
 	const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
 	struct virtnet_info *vi = sq->vq->vdev->priv;
 	struct virtio_net_hdr_v1_hash_tunnel *hdr;
-	int num_sg;
 	unsigned hdr_len = vi->hdr_len;
+	bool hdrlen_negotiated;
 	bool can_push;
+	int num_sg;
+
+	hdrlen_negotiated = virtio_has_feature(vi->vdev,
+					       VIRTIO_NET_F_GUEST_HDRLEN);
 
 	pr_debug("%s: xmit %p %pM\n", vi->dev->name, skb, dest);
 
@@ -3339,7 +3343,8 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff *skb, bool orphan)
 		hdr = &skb_vnet_common_hdr(skb)->tnl_hdr;
 
 	if (virtio_net_hdr_tnl_from_skb(skb, hdr, vi->tx_tnl,
-					virtio_is_little_endian(vi->vdev), 0))
+					virtio_is_little_endian(vi->vdev),
+					hdrlen_negotiated, 0))
 		return -EPROTO;
 
 	if (vi->mergeable_rx_bufs)
diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
index b673c31569f3..3cd8b2ebc197 100644
--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -211,16 +211,15 @@ static inline int virtio_net_hdr_from_skb(const struct sk_buff *skb,
 					  struct virtio_net_hdr *hdr,
 					  bool little_endian,
 					  bool has_data_valid,
+					  bool hdrlen_negotiated,
 					  int vlan_hlen)
 {
 	memset(hdr, 0, sizeof(*hdr));   /* no info leak */
 
 	if (skb_is_gso(skb)) {
 		struct skb_shared_info *sinfo = skb_shinfo(skb);
+		u16 hdr_len;
 
-		/* This is a hint as to how much should be linear. */
-		hdr->hdr_len = __cpu_to_virtio16(little_endian,
-						 skb_headlen(skb));
 		hdr->gso_size = __cpu_to_virtio16(little_endian,
 						  sinfo->gso_size);
 		if (sinfo->gso_type & SKB_GSO_TCPV4)
@@ -231,6 +230,21 @@ static inline int virtio_net_hdr_from_skb(const struct sk_buff *skb,
 			hdr->gso_type = VIRTIO_NET_HDR_GSO_UDP_L4;
 		else
 			return -EINVAL;
+
+		if (hdrlen_negotiated) {
+			hdr_len = skb_transport_offset(skb);
+
+			if (hdr->gso_type == VIRTIO_NET_HDR_GSO_UDP_L4)
+				hdr_len += sizeof(struct udphdr);
+			else
+				hdr_len += tcp_hdrlen(skb);
+		} else {
+			/* This is a hint as to how much should be linear. */
+			hdr_len = skb_headlen(skb);
+		}
+
+		hdr->hdr_len = __cpu_to_virtio16(little_endian, hdr_len);
+
 		if (sinfo->gso_type & SKB_GSO_TCP_ECN)
 			hdr->gso_type |= VIRTIO_NET_HDR_GSO_ECN;
 	} else
@@ -384,6 +398,7 @@ virtio_net_hdr_tnl_from_skb(const struct sk_buff *skb,
 			    struct virtio_net_hdr_v1_hash_tunnel *vhdr,
 			    bool tnl_hdr_negotiated,
 			    bool little_endian,
+			    bool hdrlen_negotiated,
 			    int vlan_hlen)
 {
 	struct virtio_net_hdr *hdr = (struct virtio_net_hdr *)vhdr;
@@ -395,7 +410,7 @@ virtio_net_hdr_tnl_from_skb(const struct sk_buff *skb,
 						    SKB_GSO_UDP_TUNNEL_CSUM);
 	if (!tnl_gso_type)
 		return virtio_net_hdr_from_skb(skb, hdr, little_endian, false,
-					       vlan_hlen);
+					       hdrlen_negotiated, vlan_hlen);
 
 	/* Tunnel support not negotiated but skb ask for it. */
 	if (!tnl_hdr_negotiated)
@@ -408,7 +423,8 @@ virtio_net_hdr_tnl_from_skb(const struct sk_buff *skb,
 
 	/* Let the basic parsing deal with plain GSO features. */
 	skb_shinfo(skb)->gso_type &= ~tnl_gso_type;
-	ret = virtio_net_hdr_from_skb(skb, hdr, true, false, vlan_hlen);
+	ret = virtio_net_hdr_from_skb(skb, hdr, true, false, hdrlen_negotiated,
+				      vlan_hlen);
 	skb_shinfo(skb)->gso_type |= tnl_gso_type;
 	if (ret)
 		return ret;
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 173e6edda08f..6982f4ab1c73 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2093,7 +2093,8 @@ static int packet_rcv_vnet(struct msghdr *msg, const struct sk_buff *skb,
 		return -EINVAL;
 	*len -= vnet_hdr_sz;
 
-	if (virtio_net_hdr_from_skb(skb, (struct virtio_net_hdr *)&vnet_hdr, vio_le(), true, 0))
+	if (virtio_net_hdr_from_skb(skb, (struct virtio_net_hdr *)&vnet_hdr,
+				    vio_le(), true, false, 0))
 		return -EINVAL;
 
 	return memcpy_to_msg(msg, (void *)&vnet_hdr, vnet_hdr_sz);
@@ -2361,7 +2362,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 	if (vnet_hdr_sz &&
 	    virtio_net_hdr_from_skb(skb, h.raw + macoff -
 				    sizeof(struct virtio_net_hdr),
-				    vio_le(), true, 0)) {
+				    vio_le(), true, false, 0)) {
 		if (po->tp_version == TPACKET_V3)
 			prb_clear_blk_fill_status(&po->rx_ring);
 		goto drop_n_account;
-- 
2.32.0.3.g01195cf9f


From mst at redhat.com  Tue Nov 11 03:33:04 2025
From: mst at redhat.com (Michael S. Tsirkin)
Date: Tue, 11 Nov 2025 06:33:04 -0500
Subject: [PATCH net v5 1/2] virtio-net: correct hdr_len handling for
 VIRTIO_NET_F_GUEST_HDRLEN
In-Reply-To: <20251111111212.102083-2-xuanzhuo@linux.alibaba.com>
References: <20251111111212.102083-1-xuanzhuo@linux.alibaba.com>
 <20251111111212.102083-2-xuanzhuo@linux.alibaba.com>
Message-ID: <20251111062859-mutt-send-email-mst@kernel.org>

On Tue, Nov 11, 2025 at 07:12:11PM +0800, Xuan Zhuo wrote:
> The commit be50da3e9d4a ("net: virtio_net: implement exact header length
> guest feature") introduces support for the VIRTIO_NET_F_GUEST_HDRLEN
> feature in virtio-net.
> 
> This feature requires virtio-net to set hdr_len to the actual header
> length of the packet when transmitting, the number of
> bytes from the start of the packet to the beginning of the
> transport-layer payload.
> 
> However, in practice, hdr_len was being set using skb_headlen(skb),
> which is clearly incorrect. This commit fixes that issue.
> 
> Fixes: be50da3e9d4a ("net: virtio_net: implement exact header length guest feature")
> Signed-off-by: Xuan Zhuo <xuanzhuo at linux.alibaba.com>
> ---
>  arch/um/drivers/vector_transports.c |  1 +
>  drivers/net/tun_vnet.h              |  4 ++--
>  drivers/net/virtio_net.c            |  9 +++++++--
>  include/linux/virtio_net.h          | 26 +++++++++++++++++++++-----
>  net/packet/af_packet.c              |  5 +++--
>  5 files changed, 34 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/um/drivers/vector_transports.c b/arch/um/drivers/vector_transports.c
> index 0794d23f07cb..03c5baa1d0c1 100644
> --- a/arch/um/drivers/vector_transports.c
> +++ b/arch/um/drivers/vector_transports.c
> @@ -121,6 +121,7 @@ static int raw_form_header(uint8_t *header,
>  		vheader,
>  		virtio_legacy_is_little_endian(),
>  		false,
> +		false,
>  		0
>  	);
>  
> diff --git a/drivers/net/tun_vnet.h b/drivers/net/tun_vnet.h
> index 81662328b2c7..0d376bc70dd7 100644
> --- a/drivers/net/tun_vnet.h
> +++ b/drivers/net/tun_vnet.h
> @@ -214,7 +214,7 @@ static inline int tun_vnet_hdr_from_skb(unsigned int flags,
>  
>  	if (virtio_net_hdr_from_skb(skb, hdr,
>  				    tun_vnet_is_little_endian(flags), true,
> -				    vlan_hlen)) {
> +				    false, vlan_hlen)) {
>  		struct skb_shared_info *sinfo = skb_shinfo(skb);
>  
>  		if (net_ratelimit()) {
> @@ -244,7 +244,7 @@ tun_vnet_hdr_tnl_from_skb(unsigned int flags,
>  
>  	if (virtio_net_hdr_tnl_from_skb(skb, tnl_hdr, has_tnl_offload,
>  					tun_vnet_is_little_endian(flags),
> -					vlan_hlen)) {
> +					false, vlan_hlen)) {
>  		struct virtio_net_hdr_v1 *hdr = &tnl_hdr->hash_hdr.hdr;
>  		struct skb_shared_info *sinfo = skb_shinfo(skb);
>  
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 0369dda5ed60..b335c88a8cd6 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -3317,9 +3317,13 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff *skb, bool orphan)
>  	const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
>  	struct virtnet_info *vi = sq->vq->vdev->priv;
>  	struct virtio_net_hdr_v1_hash_tunnel *hdr;
> -	int num_sg;
>  	unsigned hdr_len = vi->hdr_len;
> +	bool hdrlen_negotiated;
>  	bool can_push;
> +	int num_sg;
> +
> +	hdrlen_negotiated = virtio_has_feature(vi->vdev,
> +					       VIRTIO_NET_F_GUEST_HDRLEN);
>  
>  	pr_debug("%s: xmit %p %pM\n", vi->dev->name, skb, dest);
>  
> @@ -3339,7 +3343,8 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff *skb, bool orphan)
>  		hdr = &skb_vnet_common_hdr(skb)->tnl_hdr;
>  
>  	if (virtio_net_hdr_tnl_from_skb(skb, hdr, vi->tx_tnl,
> -					virtio_is_little_endian(vi->vdev), 0))
> +					virtio_is_little_endian(vi->vdev),
> +					hdrlen_negotiated, 0))
>  		return -EPROTO;
>  
>  	if (vi->mergeable_rx_bufs)
> diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
> index b673c31569f3..3cd8b2ebc197 100644
> --- a/include/linux/virtio_net.h
> +++ b/include/linux/virtio_net.h
> @@ -211,16 +211,15 @@ static inline int virtio_net_hdr_from_skb(const struct sk_buff *skb,
>  					  struct virtio_net_hdr *hdr,
>  					  bool little_endian,
>  					  bool has_data_valid,
> +					  bool hdrlen_negotiated,
>  					  int vlan_hlen)

Took me a while to figure out why does tun pass false here.

The reason is that this flag is really only dealing with guest
hdrlen.  so how about guest_hdrlen to mirror spec
or if you like xmit_hdrlen?


>  {
>  	memset(hdr, 0, sizeof(*hdr));   /* no info leak */
>  
>  	if (skb_is_gso(skb)) {
>  		struct skb_shared_info *sinfo = skb_shinfo(skb);
> +		u16 hdr_len;
>  
> -		/* This is a hint as to how much should be linear. */
> -		hdr->hdr_len = __cpu_to_virtio16(little_endian,
> -						 skb_headlen(skb));
>  		hdr->gso_size = __cpu_to_virtio16(little_endian,
>  						  sinfo->gso_size);
>  		if (sinfo->gso_type & SKB_GSO_TCPV4)
> @@ -231,6 +230,21 @@ static inline int virtio_net_hdr_from_skb(const struct sk_buff *skb,
>  			hdr->gso_type = VIRTIO_NET_HDR_GSO_UDP_L4;
>  		else
>  			return -EINVAL;
> +
> +		if (hdrlen_negotiated) {
> +			hdr_len = skb_transport_offset(skb);
> +
> +			if (hdr->gso_type == VIRTIO_NET_HDR_GSO_UDP_L4)
> +				hdr_len += sizeof(struct udphdr);
> +			else
> +				hdr_len += tcp_hdrlen(skb);
> +		} else {
> +			/* This is a hint as to how much should be linear. */
> +			hdr_len = skb_headlen(skb);
> +		}
> +
> +		hdr->hdr_len = __cpu_to_virtio16(little_endian, hdr_len);
> +
>  		if (sinfo->gso_type & SKB_GSO_TCP_ECN)
>  			hdr->gso_type |= VIRTIO_NET_HDR_GSO_ECN;
>  	} else
> @@ -384,6 +398,7 @@ virtio_net_hdr_tnl_from_skb(const struct sk_buff *skb,
>  			    struct virtio_net_hdr_v1_hash_tunnel *vhdr,
>  			    bool tnl_hdr_negotiated,
>  			    bool little_endian,
> +			    bool hdrlen_negotiated,
>  			    int vlan_hlen)
>  {
>  	struct virtio_net_hdr *hdr = (struct virtio_net_hdr *)vhdr;
> @@ -395,7 +410,7 @@ virtio_net_hdr_tnl_from_skb(const struct sk_buff *skb,
>  						    SKB_GSO_UDP_TUNNEL_CSUM);
>  	if (!tnl_gso_type)
>  		return virtio_net_hdr_from_skb(skb, hdr, little_endian, false,
> -					       vlan_hlen);
> +					       hdrlen_negotiated, vlan_hlen);
>  
>  	/* Tunnel support not negotiated but skb ask for it. */
>  	if (!tnl_hdr_negotiated)
> @@ -408,7 +423,8 @@ virtio_net_hdr_tnl_from_skb(const struct sk_buff *skb,
>  
>  	/* Let the basic parsing deal with plain GSO features. */
>  	skb_shinfo(skb)->gso_type &= ~tnl_gso_type;
> -	ret = virtio_net_hdr_from_skb(skb, hdr, true, false, vlan_hlen);
> +	ret = virtio_net_hdr_from_skb(skb, hdr, true, false, hdrlen_negotiated,
> +				      vlan_hlen);
>  	skb_shinfo(skb)->gso_type |= tnl_gso_type;
>  	if (ret)
>  		return ret;
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index 173e6edda08f..6982f4ab1c73 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -2093,7 +2093,8 @@ static int packet_rcv_vnet(struct msghdr *msg, const struct sk_buff *skb,
>  		return -EINVAL;
>  	*len -= vnet_hdr_sz;
>  
> -	if (virtio_net_hdr_from_skb(skb, (struct virtio_net_hdr *)&vnet_hdr, vio_le(), true, 0))
> +	if (virtio_net_hdr_from_skb(skb, (struct virtio_net_hdr *)&vnet_hdr,
> +				    vio_le(), true, false, 0))
>  		return -EINVAL;
>  
>  	return memcpy_to_msg(msg, (void *)&vnet_hdr, vnet_hdr_sz);
> @@ -2361,7 +2362,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
>  	if (vnet_hdr_sz &&
>  	    virtio_net_hdr_from_skb(skb, h.raw + macoff -
>  				    sizeof(struct virtio_net_hdr),
> -				    vio_le(), true, 0)) {
> +				    vio_le(), true, false, 0)) {
>  		if (po->tp_version == TPACKET_V3)
>  			prb_clear_blk_fill_status(&po->rx_ring);
>  		goto drop_n_account;
> -- 
> 2.32.0.3.g01195cf9f


From thehajime at gmail.com  Wed Nov 12 00:52:56 2025
From: thehajime at gmail.com (Hajime Tazaki)
Date: Wed, 12 Nov 2025 17:52:56 +0900
Subject: [PATCH v13 00/13] nommu UML
In-Reply-To: <0a84c16f862026c82271c0adbc91d98b812a78b4.camel@sipsolutions.net>
References: <cover.1762588860.git.thehajime@gmail.com>
	<aRGs8lPjH22NqMZc@infradead.org>
	<m2framxerm.wl-thehajime@gmail.com>
	<0a84c16f862026c82271c0adbc91d98b812a78b4.camel@sipsolutions.net>
Message-ID: <m2bjl7y6mv.wl-thehajime@gmail.com>


On Tue, 11 Nov 2025 17:01:25 +0900,
Johannes Berg wrote:
> 
> On Mon, 2025-11-10 at 21:18 +0900, Hajime Tazaki wrote:
> > 
> >   What is it for ?
> >   ================
> >   
> >   - Alleviate syscall hook overhead implemented with ptrace(2)
> >   - To exercises nommu code over UML (and over KUnit)
> >   - Less dependency to host facilities
> 
> FWIW, in some way, this order of priorities is exactly why this hasn't
> been going anywhere, and every time I looked at it I got somewhat
> annoyed by what seems to me like choices made to support especially the
> first bullet.

over the past versions, I've been emphasized that the 2nd bullet (testing)
is the primary usecase as I saw several actually cases from mm folks,

https://lists.infradead.org/pipermail/maple-tree/2024-November/003775.html
https://lore.kernel.org/all/cb1cf0be-871d-4982-9a1b-5fdd54deec8d at lucifer.local/

and I think this is not limited to mm code.

other 2 bullets are additional benefits which we observed in a
comment, and our experience.

https://lore.kernel.org/all/20241122121826.GA26024 at lst.de/
[2] https://static.sched.com/hosted_files/ossna2020/ec/kollerr_linux_um_nommu.pdf

but those are not the primary goal, so I'm not pushing this aspect
with usecases.

> I suspect that the first and third bullet are not even really true any
> more, since you moved to seccomp (per our request), yet I think design
> choices influenced by them persist.

this observation is not true; the first bullet is still true even
using seccomp.  please look at the benchmark result in the patch
[12/13], quoted below.

summary: most of tests show that um-nommu+seccomp is x4 to x20 faster
than um-mmu+seccomp (and ptrace).

.. csv-table:: lmbench (usec)
  :header: ,native,um,um-mmu(s),um-nommu(s)

  select-10    ,0.5319,36.1214,24.2795,2.9174
  select-100   ,1.6019,34.6049,28.8865,3.8080
  select-1000  ,12.2588,43.6838,48.7438,12.7872
  syscall      ,0.1644,35.0321,53.2119,2.5981
  read         ,0.3055,31.5509,45.8538,2.7068
  write        ,0.2512,31.3609,29.2636,2.6948
  stat         ,1.8894,43.8477,49.6121,3.1908
  open/close   ,3.2973,77.5123,68.9431,6.2575
  fork+sh      ,1110.3000,7359.5000,4618.6667,439.4615
  fork+execve  ,510.8182,2834.0000,2461.1667,139.7848

.. csv-table:: do_getpid bench (nsec)
  :header: ,native,um,um-mmu(s),um-nommu(s)

  getpid , 161 , 34477 , 26242 , 2599

the 1st bullet saying ptrace(2) is somehow misleading now.  this might
be rephrased with "a separate process handling userspace", instead of
"ptrace".

# when I started this patchset, the seccomp patch wasn't in upstream.
  saying ptrace(2) wasn't not that much wrong.

> People are definitely interested in the second bullet, mostly for kunit,
> and I'd be willing to support them in that to some extent.

so (again) the 2nd bullet is the primary use case at this stage.

> However, I'm not yet convinced that all of the complexities presented in
> this patchset (such as completely separate seccomp implementation) are
> actually necessary in support of _just_ the second bullet. These seem to
> me like design choices necessary to support the _first_ bullet [1].

separate seccomp implementation is indeed needed due to the design
choice we made, to use a single process to host a (um) userspace.  I
think there is no reason to unify the seccomp part because the
signal handlers and filter installation do the different jobs.

I don't see why you see this as a _complexity_, as functionally both
seccomp handling don't interfere each other.  we have prepared
separate sub-directories for nommu to avoid unnecessary if/else
clauses in .c/.h files.  we haven't seen any functional regressions
since this RFC version (which was 6.12 kernel).

> [1] and then I suppose the third, which I'm reading as "doesn't need
> seccomp or ptrace", but I'm not really quite sure what you meant
> 
> I've thought about what would happen if we stuck to creating a (single)
> separate process on the host to execute userspace, and just used
> CLONE_VM for it. That way, it's still no-MMU with full memory access,
> but there's some implicit isolation between the kernel and userspace
> processes which will likely remove complexities around FP/SSE/AVX
> handling, may completely remove the need for a separate seccomp
> implementation, etc.

this would be doable I think, but we went the different way, as
using separate host processes (with ptrace/seccomp) is slow and add
complexity by the synchronization between processes, which we think
it's not easy to maintain in the future.

this was natural for us (not sure for maintainers) when we add a new
functionality, consider several options to implement, and took one of the
option which is faster, simpler, and having less cost to maintain.

the avoidance of separate processes is probably the core of our design
choice we made for nommu UML.  I'm not strongly pushing the benefits
of 1st/3rd bullets, but I thought describing the characteristics of
what _this_ patchset can should be useful.  thus in the document.

additionally, if the design choice we made introduces any breakages on
existing code, or maintenance burdens, I would understand your concern
on the complexity, but I don't think this is the case.

> It would, on the other hand, make it completely non-viable to achieve
> the first and third bullets, so given your pursuit of those, one some
> level I understand the design right now. I'm yet to be convinced,
> however, that those are even worthy goals for (upstream) UML, what use
> case would that enable that we really need?

the usecase for those are inherited from the original implementation,
[2] above, which is running UML on containers with less host dependency
and speedups.  but again, this is not the primary goal at this stage.

if you think that the document should not describe the potential
benefits/usecases which are not related to the primary goal of the
functionality, I'd agree to remove those descriptions.

> Especially considering that
> over a longer perspective, NOMMU architectures _are_ on their way out,
> and UML will certainly follow once that happens, it won't be the last
> remaining NOMMU architecture.

I'm aware of this nommu removal discussion, but also saw there are
expressions not to support this direction.  This patchset is still
useful even now.

> So the only value I see in this is for testing over the net couple of
> years, which really doesn't need any sort of significant optimisation or
> less reliance on host facilities.

I agree the former, but not the latter.

- there is a value with a real usecase,
- there are different ways to implement it but this went with the
  one with potential (additional) benefits,
- without breakages to the exising (MMU) uml code.

with that, we're proposing this patchset.

> Where do you see this differently?

thanks for the careful prompt for me.
I hope my answer clarifies your concerns.

I also wish to understand concerns of maintainers, due to the single
process design of nommu for um userspace, and the codebase is still
young so may have unexpected influence to others.  but this is exactly
the reason why I also put myself to MAINTAINERS in order to take care
of this patchset even it is small (1.3k loc).


-- Hajime


From tiwei.bie at linux.dev  Wed Nov 12 08:36:51 2025
From: tiwei.bie at linux.dev (Tiwei Bie)
Date: Thu, 13 Nov 2025 00:36:51 +0800
Subject: [PATCH v13 00/13] nommu UML
In-Reply-To: <m2bjl7y6mv.wl-thehajime@gmail.com>
References: <m2bjl7y6mv.wl-thehajime@gmail.com>
Message-ID: <20251112163651.3689244-1-tiwei.bie@linux.dev>

On Wed, 12 Nov 2025 17:52:56 +0900, Hajime Tazaki wrote:
[...]
> > However, I'm not yet convinced that all of the complexities presented in
> > this patchset (such as completely separate seccomp implementation) are
> > actually necessary in support of _just_ the second bullet. These seem to
> > me like design choices necessary to support the _first_ bullet [1].
> 
> separate seccomp implementation is indeed needed due to the design
> choice we made, to use a single process to host a (um) userspace.  I
> think there is no reason to unify the seccomp part because the
> signal handlers and filter installation do the different jobs.
> 
> I don't see why you see this as a _complexity_, as functionally both
> seccomp handling don't interfere each other.  we have prepared
> separate sub-directories for nommu to avoid unnecessary if/else
> clauses in .c/.h files.

I have the same concern about the complexities introduced by this
patch set. The new processing paths it introduces (such as the
separate handling for FP/SSE/AVX, FS, signal, syscall, ...) add a
lot of unnecessary complexities. I think Johannes's suggestion is
a great idea.

> we haven't seen any functional regressions
> since this RFC version (which was 6.12 kernel).

I took a quick look at the code. It appears that patch 02/13 will
break the mmu build when UML_TIME_TRAVEL_SUPPORT is enabled.

Regards,
Tiwei


From kuninori.morimoto.gx at renesas.com  Wed Nov 12 18:25:26 2025
From: kuninori.morimoto.gx at renesas.com (Kuninori Morimoto)
Date: Thu, 13 Nov 2025 02:25:26 +0000
Subject: [PATCH] um: drivers: virtio: use string choices helper
Message-ID: <87h5uywtwp.wl-kuninori.morimoto.gx@renesas.com>

Remove hard-coded strings by using the string helper functions

Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx at renesas.com>
---
 arch/um/drivers/virtio_uml.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/um/drivers/virtio_uml.c b/arch/um/drivers/virtio_uml.c
index de7867ae220d0..6cf1152a1a4e6 100644
--- a/arch/um/drivers/virtio_uml.c
+++ b/arch/um/drivers/virtio_uml.c
@@ -24,6 +24,7 @@
 #include <linux/of.h>
 #include <linux/platform_device.h>
 #include <linux/slab.h>
+#include <linux/string_choices.h>
 #include <linux/virtio.h>
 #include <linux/virtio_config.h>
 #include <linux/virtio_ring.h>
@@ -1151,8 +1152,7 @@ void virtio_uml_set_no_vq_suspend(struct virtio_device *vdev,
 		return;
 
 	vu_dev->no_vq_suspend = no_vq_suspend;
-	dev_info(&vdev->dev, "%sabled VQ suspend\n",
-		 no_vq_suspend ? "dis" : "en");
+	dev_info(&vdev->dev, "%s VQ suspend\n", str_disabled_enabled(no_vq_suspend));
 }
 
 static void vu_of_conn_broken(struct work_struct *wk)
-- 
2.43.0


From pabeni at redhat.com  Thu Nov 13 06:39:35 2025
From: pabeni at redhat.com (Paolo Abeni)
Date: Thu, 13 Nov 2025 15:39:35 +0100
Subject: [PATCH net v5 1/2] virtio-net: correct hdr_len handling for
 VIRTIO_NET_F_GUEST_HDRLEN
In-Reply-To: <20251111111212.102083-2-xuanzhuo@linux.alibaba.com>
References: <20251111111212.102083-1-xuanzhuo@linux.alibaba.com>
 <20251111111212.102083-2-xuanzhuo@linux.alibaba.com>
Message-ID: <25b05194-63cd-4265-8d2c-e174d801fc3a@redhat.com>

On 11/11/25 12:12 PM, Xuan Zhuo wrote:
> The commit be50da3e9d4a ("net: virtio_net: implement exact header length
> guest feature") introduces support for the VIRTIO_NET_F_GUEST_HDRLEN
> feature in virtio-net.
> 
> This feature requires virtio-net to set hdr_len to the actual header
> length of the packet when transmitting, the number of
> bytes from the start of the packet to the beginning of the
> transport-layer payload.
> 
> However, in practice, hdr_len was being set using skb_headlen(skb),
> which is clearly incorrect. This commit fixes that issue.
> 
> Fixes: be50da3e9d4a ("net: virtio_net: implement exact header length guest feature")
> Signed-off-by: Xuan Zhuo <xuanzhuo at linux.alibaba.com>

IMHO this looks like more a new feature - namely,
VIRTIO_NET_F_GUEST_HDRLEN support - than a fix.

[...]
> @@ -2361,7 +2362,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
>  	if (vnet_hdr_sz &&
>  	    virtio_net_hdr_from_skb(skb, h.raw + macoff -
>  				    sizeof(struct virtio_net_hdr),
> -				    vio_le(), true, 0)) {
> +				    vio_le(), true, false, 0)) {
>  		if (po->tp_version == TPACKET_V3)
>  			prb_clear_blk_fill_status(&po->rx_ring);
>  		goto drop_n_account;
To reduce the diffstat, what about creating a __virtio_net_hdr_from_skb()
variant (please find a better name) allowing the extra `hdrlen_negotiated`
argument, define virtio_net_hdr_from_skb() as a wrapper of such helper
withthe extra arg == false, and use the helper in the few places that
really could use hdrlen?


From pabeni at redhat.com  Thu Nov 13 06:50:13 2025
From: pabeni at redhat.com (Paolo Abeni)
Date: Thu, 13 Nov 2025 15:50:13 +0100
Subject: [PATCH net v5 2/2] virtio-net: correct hdr_len handling for
 tunnel gso
In-Reply-To: <20251111111212.102083-3-xuanzhuo@linux.alibaba.com>
References: <20251111111212.102083-1-xuanzhuo@linux.alibaba.com>
 <20251111111212.102083-3-xuanzhuo@linux.alibaba.com>
Message-ID: <f79bf201-c4fe-41a9-9ccb-b93271d83183@redhat.com>

On 11/11/25 12:12 PM, Xuan Zhuo wrote:
> The commit a2fb4bc4e2a6a03 ("net: implement virtio helpers to handle UDP
> GSO tunneling.") introduces support for the UDP GSO tunnel feature in
> virtio-net.
> 
> The virtio spec says:
> 
>     If the \field{gso_type} has the VIRTIO_NET_HDR_GSO_UDP_TUNNEL_IPV4 bit or
>     VIRTIO_NET_HDR_GSO_UDP_TUNNEL_IPV6 bit set, \field{hdr_len} accounts for
>     all the headers up to and including the inner transport.
> 
> The commit did not update the hdr_len to include the inner transport.
> 
> I observed that the "hdr_len" is 116 for this packet:
> 
>     17:36:18.241105 52:55:00:d1:27:0a > 2e:2c:df:46:a9:e1, ethertype IPv4 (0x0800), length 2912: (tos 0x0, ttl 64, id 45197, offset 0, flags [none], proto UDP (17), length 2898)
>         192.168.122.100.50613 > 192.168.122.1.4789: [bad udp cksum 0x8106 -> 0x26a0!] VXLAN, flags [I] (0x08), vni 1
>     fa:c3:ba:82:05:ee > ce:85:0c:31:77:e5, ethertype IPv4 (0x0800), length 2862: (tos 0x0, ttl 64, id 14678, offset 0, flags [DF], proto TCP (6), length 2848)
>         192.168.3.1.49880 > 192.168.3.2.9898: Flags [P.], cksum 0x9266 (incorrect -> 0xaa20), seq 515667:518463, ack 1, win 64, options [nop,nop,TS val 2990048824 ecr 2798801412], length 2796
> 
> 116 = 14(mac) + 20(ip) + 8(udp) + 8(vxlan) + 14(inner mac) + 20(inner ip) + 32(innner tcp)
> 
> Fixes: a2fb4bc4e2a6a03 ("net: implement virtio helpers to handle UDP GSO tunneling.")
> Signed-off-by: Xuan Zhuo <xuanzhuo at linux.alibaba.com>
> ---
>  include/linux/virtio_net.h | 24 ++++++++++++++++--------
>  1 file changed, 16 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
> index 3cd8b2ebc197..432b17979d17 100644
> --- a/include/linux/virtio_net.h
> +++ b/include/linux/virtio_net.h
> @@ -232,12 +232,23 @@ static inline int virtio_net_hdr_from_skb(const struct sk_buff *skb,
>  			return -EINVAL;
>  
>  		if (hdrlen_negotiated) {
> -			hdr_len = skb_transport_offset(skb);
> +			if (sinfo->gso_type & (SKB_GSO_UDP_TUNNEL |
> +					       SKB_GSO_UDP_TUNNEL_CSUM)) {

I'm personally not a huge fan of adding UDP tunnel specific check to the
generic code, did you tried something along the lines suggested here:

https://lore.kernel.org/netdev/CAF6piCLkv6kFqoq7OQfJ=Su9AVHSQ9J7DzaumOSf5xuf9w-kyA at mail.gmail.com/

?

Thanks,

Paolo


From mst at redhat.com  Thu Nov 13 07:59:17 2025
From: mst at redhat.com (Michael S. Tsirkin)
Date: Thu, 13 Nov 2025 10:59:17 -0500
Subject: [PATCH net v5 1/2] virtio-net: correct hdr_len handling for
 VIRTIO_NET_F_GUEST_HDRLEN
In-Reply-To: <25b05194-63cd-4265-8d2c-e174d801fc3a@redhat.com>
References: <20251111111212.102083-1-xuanzhuo@linux.alibaba.com>
 <20251111111212.102083-2-xuanzhuo@linux.alibaba.com>
 <25b05194-63cd-4265-8d2c-e174d801fc3a@redhat.com>
Message-ID: <20251113105844-mutt-send-email-mst@kernel.org>

On Thu, Nov 13, 2025 at 03:39:35PM +0100, Paolo Abeni wrote:
> On 11/11/25 12:12 PM, Xuan Zhuo wrote:
> > The commit be50da3e9d4a ("net: virtio_net: implement exact header length
> > guest feature") introduces support for the VIRTIO_NET_F_GUEST_HDRLEN
> > feature in virtio-net.
> > 
> > This feature requires virtio-net to set hdr_len to the actual header
> > length of the packet when transmitting, the number of
> > bytes from the start of the packet to the beginning of the
> > transport-layer payload.
> > 
> > However, in practice, hdr_len was being set using skb_headlen(skb),
> > which is clearly incorrect. This commit fixes that issue.
> > 
> > Fixes: be50da3e9d4a ("net: virtio_net: implement exact header length guest feature")
> > Signed-off-by: Xuan Zhuo <xuanzhuo at linux.alibaba.com>
> 
> IMHO this looks like more a new feature - namely,
> VIRTIO_NET_F_GUEST_HDRLEN support - than a fix.


I mean if guest negotiates VIRTIO_NET_F_GUEST_HDRLEN but the header
length is wrong then yes it is broken and this is a fix.


> [...]
> > @@ -2361,7 +2362,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
> >  	if (vnet_hdr_sz &&
> >  	    virtio_net_hdr_from_skb(skb, h.raw + macoff -
> >  				    sizeof(struct virtio_net_hdr),
> > -				    vio_le(), true, 0)) {
> > +				    vio_le(), true, false, 0)) {
> >  		if (po->tp_version == TPACKET_V3)
> >  			prb_clear_blk_fill_status(&po->rx_ring);
> >  		goto drop_n_account;
> To reduce the diffstat, what about creating a __virtio_net_hdr_from_skb()
> variant (please find a better name) allowing the extra `hdrlen_negotiated`
> argument, define virtio_net_hdr_from_skb() as a wrapper of such helper
> withthe extra arg == false, and use the helper in the few places that
> really could use hdrlen?


From thehajime at gmail.com  Thu Nov 13 22:47:34 2025
From: thehajime at gmail.com (Hajime Tazaki)
Date: Fri, 14 Nov 2025 15:47:34 +0900
Subject: [PATCH v13 00/13] nommu UML
In-Reply-To: <20251112163651.3689244-1-tiwei.bie@linux.dev>
References: <m2bjl7y6mv.wl-thehajime@gmail.com>
	<20251112163651.3689244-1-tiwei.bie@linux.dev>
Message-ID: <m2a50pxg8p.wl-thehajime@gmail.com>


On Thu, 13 Nov 2025 01:36:51 +0900,
Tiwei Bie wrote:

> > we haven't seen any functional regressions
> > since this RFC version (which was 6.12 kernel).
> 
> I took a quick look at the code. It appears that patch 02/13 will
> break the mmu build when UML_TIME_TRAVEL_SUPPORT is enabled.

thanks, it is my bad on the move the chunk.
will fix it and added to my local test.

-- Hajime


From qi.zheng at linux.dev  Fri Nov 14 03:11:16 2025
From: qi.zheng at linux.dev (Qi Zheng)
Date: Fri, 14 Nov 2025 19:11:16 +0800
Subject: [PATCH 2/7] arc: mm: enable MMU_GATHER_RCU_TABLE_FREE
In-Reply-To: <cover.1763117269.git.zhengqi.arch@bytedance.com>
References: <cover.1763117269.git.zhengqi.arch@bytedance.com>
Message-ID: <6a4192f5cef3049f123f08cb04ef5cd0179c3281.1763117269.git.zhengqi.arch@bytedance.com>

From: Qi Zheng <zhengqi.arch at bytedance.com>

On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of
empty PTE page table pages (such as 100GB+). To resolve this problem,
first enable MMU_GATHER_RCU_TABLE_FREE to prepare for enabling the
PT_RECLAIM feature, which resolves this problem.

Signed-off-by: Qi Zheng <zhengqi.arch at bytedance.com>
Cc: Vineet Gupta <vgupta at kernel.org>
---
 arch/arc/Kconfig               | 1 +
 arch/arc/include/asm/pgalloc.h | 9 ++++++---
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig
index f27e6b90428e4..47db93952386d 100644
--- a/arch/arc/Kconfig
+++ b/arch/arc/Kconfig
@@ -54,6 +54,7 @@ config ARC
 	select HAVE_ARCH_JUMP_LABEL if ISA_ARCV2 && !CPU_ENDIAN_BE32
 	select TRACE_IRQFLAGS_SUPPORT
 	select HAVE_EBPF_JIT if ISA_ARCV2
+	select MMU_GATHER_RCU_TABLE_FREE
 
 config LOCKDEP_SUPPORT
 	def_bool y
diff --git a/arch/arc/include/asm/pgalloc.h b/arch/arc/include/asm/pgalloc.h
index dfae070fe8d55..b1c6619435613 100644
--- a/arch/arc/include/asm/pgalloc.h
+++ b/arch/arc/include/asm/pgalloc.h
@@ -72,7 +72,8 @@ static inline void p4d_populate(struct mm_struct *mm, p4d_t *p4dp, pud_t *pudp)
 	set_p4d(p4dp, __p4d((unsigned long)pudp));
 }
 
-#define __pud_free_tlb(tlb, pmd, addr)  pud_free((tlb)->mm, pmd)
+#define __pud_free_tlb(tlb, pud, addr)	\
+	tlb_remove_ptdesc((tlb), virt_to_ptdesc(pud))
 
 #endif
 
@@ -83,10 +84,12 @@ static inline void pud_populate(struct mm_struct *mm, pud_t *pudp, pmd_t *pmdp)
 	set_pud(pudp, __pud((unsigned long)pmdp));
 }
 
-#define __pmd_free_tlb(tlb, pmd, addr)  pmd_free((tlb)->mm, pmd)
+#define __pmd_free_tlb(tlb, pmd, addr)	\
+	tlb_remove_ptdesc((tlb), virt_to_ptdesc(pmd))
 
 #endif
 
-#define __pte_free_tlb(tlb, pte, addr)  pte_free((tlb)->mm, pte)
+#define __pte_free_tlb(tlb, pte, addr)	\
+	tlb_remove_ptdesc((tlb), page_ptdesc(pte))
 
 #endif /* _ASM_ARC_PGALLOC_H */
-- 
2.20.1


From qi.zheng at linux.dev  Fri Nov 14 03:11:14 2025
From: qi.zheng at linux.dev (Qi Zheng)
Date: Fri, 14 Nov 2025 19:11:14 +0800
Subject: [PATCH 0/7] enable PT_RECLAIM on all 64-bit architectures
Message-ID: <cover.1763117269.git.zhengqi.arch@bytedance.com>

From: Qi Zheng <zhengqi.arch at bytedance.com>

Hi all,

This series aims to enable PT_RECLAIM on all 64-bit architectures.

On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of empty PTE
page table pages (such as 100GB+). To resolve this problem, we need to enable
PT_RECLAIM, which depends on MMU_GATHER_RCU_TABLE_FREE.

Therefore, this series first enables MMU_GATHER_RCU_TABLE_FREE on all 64-bit
architectures, and finally makes PT_RECLAIM depend on MMU_GATHER_RCU_TABLE_FREE
&& 64BIT. This way, PT_RECLAIM can be enabled by default on all 64-bit
architectures.

Comments and suggestions are welcome!

Thanks,
Qi

Qi Zheng (7):
  alpha: mm: enable MMU_GATHER_RCU_TABLE_FREE
  arc: mm: enable MMU_GATHER_RCU_TABLE_FREE
  loongarch: mm: enable MMU_GATHER_RCU_TABLE_FREE
  mips: mm: enable MMU_GATHER_RCU_TABLE_FREE
  parisc: mm: enable MMU_GATHER_RCU_TABLE_FREE
  um: mm: enable MMU_GATHER_RCU_TABLE_FREE
  mm: make PT_RECLAIM depend on MMU_GATHER_RCU_TABLE_FREE && 64BIT

 arch/alpha/Kconfig                   | 1 +
 arch/alpha/include/asm/tlb.h         | 8 +++++---
 arch/arc/Kconfig                     | 1 +
 arch/arc/include/asm/pgalloc.h       | 9 ++++++---
 arch/loongarch/Kconfig               | 1 +
 arch/loongarch/include/asm/pgalloc.h | 6 ++++--
 arch/mips/Kconfig                    | 1 +
 arch/mips/include/asm/pgalloc.h      | 6 ++++--
 arch/parisc/Kconfig                  | 1 +
 arch/parisc/include/asm/tlb.h        | 6 ++++--
 arch/um/Kconfig                      | 1 +
 arch/x86/Kconfig                     | 1 -
 mm/Kconfig                           | 6 +-----
 13 files changed, 30 insertions(+), 18 deletions(-)

-- 
2.20.1


From qi.zheng at linux.dev  Fri Nov 14 03:11:17 2025
From: qi.zheng at linux.dev (Qi Zheng)
Date: Fri, 14 Nov 2025 19:11:17 +0800
Subject: [PATCH 3/7] loongarch: mm: enable MMU_GATHER_RCU_TABLE_FREE
In-Reply-To: <cover.1763117269.git.zhengqi.arch@bytedance.com>
References: <cover.1763117269.git.zhengqi.arch@bytedance.com>
Message-ID: <146b5a0207052b38d04caac6b20756a61c2189b3.1763117269.git.zhengqi.arch@bytedance.com>

From: Qi Zheng <zhengqi.arch at bytedance.com>

On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of
empty PTE page table pages (such as 100GB+). To resolve this problem,
first enable MMU_GATHER_RCU_TABLE_FREE to prepare for enabling the
PT_RECLAIM feature, which resolves this problem.

Signed-off-by: Qi Zheng <zhengqi.arch at bytedance.com>
Cc: Huacai Chen <chenhuacai at kernel.org>
Cc: WANG Xuerui <kernel at xen0n.name>
---
 arch/loongarch/Kconfig               | 1 +
 arch/loongarch/include/asm/pgalloc.h | 6 ++++--
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig
index 5b1116733d881..3bf2f2a9cd647 100644
--- a/arch/loongarch/Kconfig
+++ b/arch/loongarch/Kconfig
@@ -210,6 +210,7 @@ config LOONGARCH
 	select USER_STACKTRACE_SUPPORT
 	select VDSO_GETRANDOM
 	select ZONE_DMA32
+	select MMU_GATHER_RCU_TABLE_FREE
 
 config 32BIT
 	bool
diff --git a/arch/loongarch/include/asm/pgalloc.h b/arch/loongarch/include/asm/pgalloc.h
index 1c63a9d9a6d35..0539d04bf1525 100644
--- a/arch/loongarch/include/asm/pgalloc.h
+++ b/arch/loongarch/include/asm/pgalloc.h
@@ -79,7 +79,8 @@ static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address)
 	return pmd;
 }
 
-#define __pmd_free_tlb(tlb, x, addr)	pmd_free((tlb)->mm, x)
+#define __pmd_free_tlb(tlb, x, addr)	\
+	tlb_remove_ptdesc((tlb), virt_to_ptdesc(x))
 
 #endif
 
@@ -99,7 +100,8 @@ static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long address)
 	return pud;
 }
 
-#define __pud_free_tlb(tlb, x, addr)	pud_free((tlb)->mm, x)
+#define __pud_free_tlb(tlb, x, addr)	\
+	tlb_remove_ptdesc((tlb), virt_to_ptdesc(x))
 
 #endif /* __PAGETABLE_PUD_FOLDED */
 
-- 
2.20.1


From qi.zheng at linux.dev  Fri Nov 14 03:11:18 2025
From: qi.zheng at linux.dev (Qi Zheng)
Date: Fri, 14 Nov 2025 19:11:18 +0800
Subject: [PATCH 4/7] mips: mm: enable MMU_GATHER_RCU_TABLE_FREE
In-Reply-To: <cover.1763117269.git.zhengqi.arch@bytedance.com>
References: <cover.1763117269.git.zhengqi.arch@bytedance.com>
Message-ID: <d69204bb10f4d08eb5d5ae673d2329e7df44af72.1763117269.git.zhengqi.arch@bytedance.com>

From: Qi Zheng <zhengqi.arch at bytedance.com>

On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of
empty PTE page table pages (such as 100GB+). To resolve this problem,
first enable MMU_GATHER_RCU_TABLE_FREE to prepare for enabling the
PT_RECLAIM feature, which resolves this problem.

Signed-off-by: Qi Zheng <zhengqi.arch at bytedance.com>
Cc: Thomas Bogendoerfer <tsbogend at alpha.franken.de>
---
 arch/mips/Kconfig               | 1 +
 arch/mips/include/asm/pgalloc.h | 6 ++++--
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index e8683f58fd3e2..0ee8820a354c4 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -108,6 +108,7 @@ config MIPS
 	select TRACE_IRQFLAGS_SUPPORT
 	select ARCH_HAS_ELFCORE_COMPAT
 	select HAVE_ARCH_KCSAN if 64BIT
+	select MMU_GATHER_RCU_TABLE_FREE
 
 config MIPS_FIXUP_BIGPHYS_ADDR
 	bool
diff --git a/arch/mips/include/asm/pgalloc.h b/arch/mips/include/asm/pgalloc.h
index 942af87f1cddb..c00f445045f43 100644
--- a/arch/mips/include/asm/pgalloc.h
+++ b/arch/mips/include/asm/pgalloc.h
@@ -72,7 +72,8 @@ static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address)
 	return pmd;
 }
 
-#define __pmd_free_tlb(tlb, x, addr)	pmd_free((tlb)->mm, x)
+#define __pmd_free_tlb(tlb, x, addr)	\
+	tlb_remove_ptdesc((tlb), virt_to_ptdesc(x))
 
 #endif
 
@@ -98,7 +99,8 @@ static inline void p4d_populate(struct mm_struct *mm, p4d_t *p4d, pud_t *pud)
 	set_p4d(p4d, __p4d((unsigned long)pud));
 }
 
-#define __pud_free_tlb(tlb, x, addr)	pud_free((tlb)->mm, x)
+#define __pud_free_tlb(tlb, x, addr)	\
+	tlb_remove_ptdesc((tlb), virt_to_ptdesc(x))
 
 #endif /* __PAGETABLE_PUD_FOLDED */
 
-- 
2.20.1


From qi.zheng at linux.dev  Fri Nov 14 03:11:19 2025
From: qi.zheng at linux.dev (Qi Zheng)
Date: Fri, 14 Nov 2025 19:11:19 +0800
Subject: [PATCH 5/7] parisc: mm: enable MMU_GATHER_RCU_TABLE_FREE
In-Reply-To: <cover.1763117269.git.zhengqi.arch@bytedance.com>
References: <cover.1763117269.git.zhengqi.arch@bytedance.com>
Message-ID: <3a88790a662c2b84066c77772d20bd1f5f687f8b.1763117269.git.zhengqi.arch@bytedance.com>

From: Qi Zheng <zhengqi.arch at bytedance.com>

On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of
empty PTE page table pages (such as 100GB+). To resolve this problem,
first enable MMU_GATHER_RCU_TABLE_FREE to prepare for enabling the
PT_RECLAIM feature, which resolves this problem.

Signed-off-by: Qi Zheng <zhengqi.arch at bytedance.com>
Cc: "James E.J. Bottomley" <James.Bottomley at HansenPartnership.com>
Cc: Helge Deller <deller at gmx.de>
---
 arch/parisc/Kconfig           | 1 +
 arch/parisc/include/asm/tlb.h | 6 ++++--
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig
index 47fd9662d8005..946cbe21a4118 100644
--- a/arch/parisc/Kconfig
+++ b/arch/parisc/Kconfig
@@ -92,6 +92,7 @@ config PARISC
 	select TRACE_IRQFLAGS_SUPPORT
 	select HAVE_FUNCTION_DESCRIPTORS if 64BIT
 	select PCI_MSI_ARCH_FALLBACKS if PCI_MSI
+	select MMU_GATHER_RCU_TABLE_FREE
 
 	help
 	  The PA-RISC microprocessor is designed by Hewlett-Packard and used
diff --git a/arch/parisc/include/asm/tlb.h b/arch/parisc/include/asm/tlb.h
index 44235f367674d..ab7d4113df61a 100644
--- a/arch/parisc/include/asm/tlb.h
+++ b/arch/parisc/include/asm/tlb.h
@@ -5,8 +5,10 @@
 #include <asm-generic/tlb.h>
 
 #if CONFIG_PGTABLE_LEVELS == 3
-#define __pmd_free_tlb(tlb, pmd, addr)	pmd_free((tlb)->mm, pmd)
+#define __pmd_free_tlb(tlb, pmd, addr)	\
+	tlb_remove_ptdesc((tlb), virt_to_ptdesc(pmd))
 #endif
-#define __pte_free_tlb(tlb, pte, addr)	pte_free((tlb)->mm, pte)
+#define __pte_free_tlb(tlb, pte, addr)	\
+	tlb_remove_ptdesc((tlb), page_ptdesc(pte))
 
 #endif
-- 
2.20.1


From qi.zheng at linux.dev  Fri Nov 14 03:11:20 2025
From: qi.zheng at linux.dev (Qi Zheng)
Date: Fri, 14 Nov 2025 19:11:20 +0800
Subject: [PATCH 6/7] um: mm: enable MMU_GATHER_RCU_TABLE_FREE
In-Reply-To: <cover.1763117269.git.zhengqi.arch@bytedance.com>
References: <cover.1763117269.git.zhengqi.arch@bytedance.com>
Message-ID: <27f173b0fc6fdf92104721fc3daba8d7d9d31e2f.1763117269.git.zhengqi.arch@bytedance.com>

From: Qi Zheng <zhengqi.arch at bytedance.com>

On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of
empty PTE page table pages (such as 100GB+). To resolve this problem,
first enable MMU_GATHER_RCU_TABLE_FREE to prepare for enabling the
PT_RECLAIM feature, which resolves this problem.

Signed-off-by: Qi Zheng <zhengqi.arch at bytedance.com>
Cc: Richard Weinberger <richard at nod.at>
Cc: Anton Ivanov <anton.ivanov at cambridgegreys.com>
Cc: Johannes Berg <johannes at sipsolutions.net>
---
 arch/um/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/um/Kconfig b/arch/um/Kconfig
index 097c6a6265ef3..47a41bc77bb24 100644
--- a/arch/um/Kconfig
+++ b/arch/um/Kconfig
@@ -41,6 +41,7 @@ config UML
 	select HAVE_SYSCALL_TRACEPOINTS
 	select THREAD_INFO_IN_TASK
 	select SPARSE_IRQ
+	select MMU_GATHER_RCU_TABLE_FREE
 
 config MMU
 	bool
-- 
2.20.1


From qi.zheng at linux.dev  Fri Nov 14 03:11:21 2025
From: qi.zheng at linux.dev (Qi Zheng)
Date: Fri, 14 Nov 2025 19:11:21 +0800
Subject: [PATCH 7/7] mm: make PT_RECLAIM depend on MMU_GATHER_RCU_TABLE_FREE && 64BIT
In-Reply-To: <cover.1763117269.git.zhengqi.arch@bytedance.com>
References: <cover.1763117269.git.zhengqi.arch@bytedance.com>
Message-ID: <0a4d1e6f0bf299cafd1fc624f965bd1ca542cea8.1763117269.git.zhengqi.arch@bytedance.com>

From: Qi Zheng <zhengqi.arch at bytedance.com>

Make PT_RECLAIM depend on MMU_GATHER_RCU_TABLE_FREE so that PT_RECLAIM can
be enabled by default on all architectures that support
MMU_GATHER_RCU_TABLE_FREE.

Considering that a large number of PTE page table pages (such as 100GB+)
can only be caused on a 64-bit system, let PT_RECLAIM also depend on
64BIT.

Signed-off-by: Qi Zheng <zhengqi.arch at bytedance.com>
---
 arch/x86/Kconfig | 1 -
 mm/Kconfig       | 6 +-----
 2 files changed, 1 insertion(+), 6 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index eac2e86056902..96bff81fd4787 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -330,7 +330,6 @@ config X86
 	select FUNCTION_ALIGNMENT_4B
 	imply IMA_SECURE_AND_OR_TRUSTED_BOOT    if EFI
 	select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE
-	select ARCH_SUPPORTS_PT_RECLAIM		if X86_64
 	select ARCH_SUPPORTS_SCHED_SMT		if SMP
 	select SCHED_SMT			if SMP
 	select ARCH_SUPPORTS_SCHED_CLUSTER	if SMP
diff --git a/mm/Kconfig b/mm/Kconfig
index a5a90b169435d..e795fbd69e50c 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1440,14 +1440,10 @@ config ARCH_HAS_USER_SHADOW_STACK
 	  The architecture has hardware support for userspace shadow call
           stacks (eg, x86 CET, arm64 GCS or RISC-V Zicfiss).
 
-config ARCH_SUPPORTS_PT_RECLAIM
-	def_bool n
-
 config PT_RECLAIM
 	bool "reclaim empty user page table pages"
 	default y
-	depends on ARCH_SUPPORTS_PT_RECLAIM && MMU && SMP
-	select MMU_GATHER_RCU_TABLE_FREE
+	depends on MMU_GATHER_RCU_TABLE_FREE && MMU && SMP && 64BIT
 	help
 	  Try to reclaim empty user page table pages in paths other than munmap
 	  and exit_mmap path.
-- 
2.20.1


From qi.zheng at linux.dev  Fri Nov 14 03:11:15 2025
From: qi.zheng at linux.dev (Qi Zheng)
Date: Fri, 14 Nov 2025 19:11:15 +0800
Subject: [PATCH 1/7] alpha: mm: enable MMU_GATHER_RCU_TABLE_FREE
In-Reply-To: <cover.1763117269.git.zhengqi.arch@bytedance.com>
References: <cover.1763117269.git.zhengqi.arch@bytedance.com>
Message-ID: <66cd5b21aecc3281318b66a3a4aae078c4b9d37b.1763117269.git.zhengqi.arch@bytedance.com>

From: Qi Zheng <zhengqi.arch at bytedance.com>

On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of
empty PTE page table pages (such as 100GB+). To resolve this problem,
first enable MMU_GATHER_RCU_TABLE_FREE to prepare for enabling the
PT_RECLAIM feature, which resolves this problem.

Signed-off-by: Qi Zheng <zhengqi.arch at bytedance.com>
Cc: Richard Henderson <richard.henderson at linaro.org>
Cc: Matt Turner <mattst88 at gmail.com>
---
 arch/alpha/Kconfig           | 1 +
 arch/alpha/include/asm/tlb.h | 8 +++++---
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/alpha/Kconfig b/arch/alpha/Kconfig
index 80367f2cf821c..681ed894d9e72 100644
--- a/arch/alpha/Kconfig
+++ b/arch/alpha/Kconfig
@@ -40,6 +40,7 @@ config ALPHA
 	select MMU_GATHER_NO_RANGE
 	select SPARSEMEM_EXTREME if SPARSEMEM
 	select ZONE_DMA
+	select MMU_GATHER_RCU_TABLE_FREE
 	help
 	  The Alpha is a 64-bit general-purpose processor designed and
 	  marketed by the Digital Equipment Corporation of blessed memory,
diff --git a/arch/alpha/include/asm/tlb.h b/arch/alpha/include/asm/tlb.h
index 4f79e331af5ea..4fe5a901720f0 100644
--- a/arch/alpha/include/asm/tlb.h
+++ b/arch/alpha/include/asm/tlb.h
@@ -4,7 +4,9 @@
 
 #include <asm-generic/tlb.h>
 
-#define __pte_free_tlb(tlb, pte, address)		pte_free((tlb)->mm, pte)
-#define __pmd_free_tlb(tlb, pmd, address)		pmd_free((tlb)->mm, pmd)
- 
+#define __pte_free_tlb(tlb, pte, address)	\
+	tlb_remove_ptdesc((tlb), page_ptdesc(pte))
+#define __pmd_free_tlb(tlb, pmd, address)	\
+	tlb_remove_ptdesc((tlb), virt_to_ptdesc(pmd))
+
 #endif
-- 
2.20.1


From qi.zheng at linux.dev  Fri Nov 14 03:20:02 2025
From: qi.zheng at linux.dev (Qi Zheng)
Date: Fri, 14 Nov 2025 19:20:02 +0800
Subject: [PATCH 2/7] arc: mm: enable MMU_GATHER_RCU_TABLE_FREE
In-Reply-To: <6a4192f5cef3049f123f08cb04ef5cd0179c3281.1763117269.git.zhengqi.arch@bytedance.com>
References: <cover.1763117269.git.zhengqi.arch@bytedance.com>
 <6a4192f5cef3049f123f08cb04ef5cd0179c3281.1763117269.git.zhengqi.arch@bytedance.com>
Message-ID: <5199c367-aabb-43e7-951e-452657dcdddc@linux.dev>


On 11/14/25 7:11 PM, Qi Zheng wrote:
> From: Qi Zheng <zhengqi.arch at bytedance.com>
> 
> On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of
> empty PTE page table pages (such as 100GB+). To resolve this problem,
> first enable MMU_GATHER_RCU_TABLE_FREE to prepare for enabling the
> PT_RECLAIM feature, which resolves this problem.
> 
> Signed-off-by: Qi Zheng <zhengqi.arch at bytedance.com>
> Cc: Vineet Gupta <vgupta at kernel.org>
> ---
>   arch/arc/Kconfig               | 1 +
>   arch/arc/include/asm/pgalloc.h | 9 ++++++---
>   2 files changed, 7 insertions(+), 3 deletions(-)

Strangely, it seems that only ARC does not define CONFIG_64BIT?

Does the ARC architecture support 64-bit? Did I miss something?

> 
> diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig
> index f27e6b90428e4..47db93952386d 100644
> --- a/arch/arc/Kconfig
> +++ b/arch/arc/Kconfig
> @@ -54,6 +54,7 @@ config ARC
>   	select HAVE_ARCH_JUMP_LABEL if ISA_ARCV2 && !CPU_ENDIAN_BE32
>   	select TRACE_IRQFLAGS_SUPPORT
>   	select HAVE_EBPF_JIT if ISA_ARCV2
> +	select MMU_GATHER_RCU_TABLE_FREE
>   
>   config LOCKDEP_SUPPORT
>   	def_bool y
> diff --git a/arch/arc/include/asm/pgalloc.h b/arch/arc/include/asm/pgalloc.h
> index dfae070fe8d55..b1c6619435613 100644
> --- a/arch/arc/include/asm/pgalloc.h
> +++ b/arch/arc/include/asm/pgalloc.h
> @@ -72,7 +72,8 @@ static inline void p4d_populate(struct mm_struct *mm, p4d_t *p4dp, pud_t *pudp)
>   	set_p4d(p4dp, __p4d((unsigned long)pudp));
>   }
>   
> -#define __pud_free_tlb(tlb, pmd, addr)  pud_free((tlb)->mm, pmd)
> +#define __pud_free_tlb(tlb, pud, addr)	\
> +	tlb_remove_ptdesc((tlb), virt_to_ptdesc(pud))
>   
>   #endif
>   
> @@ -83,10 +84,12 @@ static inline void pud_populate(struct mm_struct *mm, pud_t *pudp, pmd_t *pmdp)
>   	set_pud(pudp, __pud((unsigned long)pmdp));
>   }
>   
> -#define __pmd_free_tlb(tlb, pmd, addr)  pmd_free((tlb)->mm, pmd)
> +#define __pmd_free_tlb(tlb, pmd, addr)	\
> +	tlb_remove_ptdesc((tlb), virt_to_ptdesc(pmd))
>   
>   #endif
>   
> -#define __pte_free_tlb(tlb, pte, addr)  pte_free((tlb)->mm, pte)
> +#define __pte_free_tlb(tlb, pte, addr)	\
> +	tlb_remove_ptdesc((tlb), page_ptdesc(pte))
>   
>   #endif /* _ASM_ARC_PGALLOC_H */


From chenhuacai at kernel.org  Fri Nov 14 06:17:55 2025
From: chenhuacai at kernel.org (Huacai Chen)
Date: Fri, 14 Nov 2025 22:17:55 +0800
Subject: [PATCH 3/7] loongarch: mm: enable MMU_GATHER_RCU_TABLE_FREE
In-Reply-To: <146b5a0207052b38d04caac6b20756a61c2189b3.1763117269.git.zhengqi.arch@bytedance.com>
References: <cover.1763117269.git.zhengqi.arch@bytedance.com> <146b5a0207052b38d04caac6b20756a61c2189b3.1763117269.git.zhengqi.arch@bytedance.com>
Message-ID: <CAAhV-H6HL+mXeuLqgo5BOVBB0_GHTUmn7_7NTzdUpLX7NbuQ5w@mail.gmail.com>

Hi, Qi Zheng,

We usually use LoongArch rather than loongarch, but if you want to
keep consistency for all patches, just do it.

On Fri, Nov 14, 2025 at 7:13?PM Qi Zheng <qi.zheng at linux.dev> wrote:
>
> From: Qi Zheng <zhengqi.arch at bytedance.com>
>
> On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of
> empty PTE page table pages (such as 100GB+). To resolve this problem,
> first enable MMU_GATHER_RCU_TABLE_FREE to prepare for enabling the
> PT_RECLAIM feature, which resolves this problem.
>
> Signed-off-by: Qi Zheng <zhengqi.arch at bytedance.com>
> Cc: Huacai Chen <chenhuacai at kernel.org>
> Cc: WANG Xuerui <kernel at xen0n.name>
> ---
>  arch/loongarch/Kconfig               | 1 +
>  arch/loongarch/include/asm/pgalloc.h | 6 ++++--
>  2 files changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig
> index 5b1116733d881..3bf2f2a9cd647 100644
> --- a/arch/loongarch/Kconfig
> +++ b/arch/loongarch/Kconfig
> @@ -210,6 +210,7 @@ config LOONGARCH
>         select USER_STACKTRACE_SUPPORT
>         select VDSO_GETRANDOM
>         select ZONE_DMA32
> +       select MMU_GATHER_RCU_TABLE_FREE
Please use alpha-betical order.

>
>  config 32BIT
>         bool
> diff --git a/arch/loongarch/include/asm/pgalloc.h b/arch/loongarch/include/asm/pgalloc.h
> index 1c63a9d9a6d35..0539d04bf1525 100644
> --- a/arch/loongarch/include/asm/pgalloc.h
> +++ b/arch/loongarch/include/asm/pgalloc.h
> @@ -79,7 +79,8 @@ static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address)
>         return pmd;
>  }
>
> -#define __pmd_free_tlb(tlb, x, addr)   pmd_free((tlb)->mm, x)
> +#define __pmd_free_tlb(tlb, x, addr)   \
> +       tlb_remove_ptdesc((tlb), virt_to_ptdesc(x))
I think we can define it in one line.

>
>  #endif
>
> @@ -99,7 +100,8 @@ static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long address)
>         return pud;
>  }
>
> -#define __pud_free_tlb(tlb, x, addr)   pud_free((tlb)->mm, x)
> +#define __pud_free_tlb(tlb, x, addr)   \
> +       tlb_remove_ptdesc((tlb), virt_to_ptdesc(x))
The same.

Other patches have the same problem.

Huacai

>
>  #endif /* __PAGETABLE_PUD_FOLDED */
>
> --
> 2.20.1
>


From qi.zheng at linux.dev  Fri Nov 14 07:55:11 2025
From: qi.zheng at linux.dev (Qi Zheng)
Date: Fri, 14 Nov 2025 23:55:11 +0800
Subject: [PATCH 3/7] loongarch: mm: enable MMU_GATHER_RCU_TABLE_FREE
In-Reply-To: <CAAhV-H6HL+mXeuLqgo5BOVBB0_GHTUmn7_7NTzdUpLX7NbuQ5w@mail.gmail.com>
References: <cover.1763117269.git.zhengqi.arch@bytedance.com>
 <146b5a0207052b38d04caac6b20756a61c2189b3.1763117269.git.zhengqi.arch@bytedance.com>
 <CAAhV-H6HL+mXeuLqgo5BOVBB0_GHTUmn7_7NTzdUpLX7NbuQ5w@mail.gmail.com>
Message-ID: <fc08c1a8-f469-43f1-93c0-7fcd2b1c477c@linux.dev>

Hi Huacai,

On 11/14/25 10:17 PM, Huacai Chen wrote:
> Hi, Qi Zheng,
> 
> We usually use LoongArch rather than loongarch, but if you want to
> keep consistency for all patches, just do it.

OK, will change to use LoongArch.

> 
> On Fri, Nov 14, 2025 at 7:13?PM Qi Zheng <qi.zheng at linux.dev> wrote:
>>
>> From: Qi Zheng <zhengqi.arch at bytedance.com>
>>
>> On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of
>> empty PTE page table pages (such as 100GB+). To resolve this problem,
>> first enable MMU_GATHER_RCU_TABLE_FREE to prepare for enabling the
>> PT_RECLAIM feature, which resolves this problem.
>>
>> Signed-off-by: Qi Zheng <zhengqi.arch at bytedance.com>
>> Cc: Huacai Chen <chenhuacai at kernel.org>
>> Cc: WANG Xuerui <kernel at xen0n.name>
>> ---
>>   arch/loongarch/Kconfig               | 1 +
>>   arch/loongarch/include/asm/pgalloc.h | 6 ++++--
>>   2 files changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig
>> index 5b1116733d881..3bf2f2a9cd647 100644
>> --- a/arch/loongarch/Kconfig
>> +++ b/arch/loongarch/Kconfig
>> @@ -210,6 +210,7 @@ config LOONGARCH
>>          select USER_STACKTRACE_SUPPORT
>>          select VDSO_GETRANDOM
>>          select ZONE_DMA32
>> +       select MMU_GATHER_RCU_TABLE_FREE
> Please use alpha-betical order.

OK, will do.

> 
>>
>>   config 32BIT
>>          bool
>> diff --git a/arch/loongarch/include/asm/pgalloc.h b/arch/loongarch/include/asm/pgalloc.h
>> index 1c63a9d9a6d35..0539d04bf1525 100644
>> --- a/arch/loongarch/include/asm/pgalloc.h
>> +++ b/arch/loongarch/include/asm/pgalloc.h
>> @@ -79,7 +79,8 @@ static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address)
>>          return pmd;
>>   }
>>
>> -#define __pmd_free_tlb(tlb, x, addr)   pmd_free((tlb)->mm, x)
>> +#define __pmd_free_tlb(tlb, x, addr)   \
>> +       tlb_remove_ptdesc((tlb), virt_to_ptdesc(x))
> I think we can define it in one line.

will do.

> 
>>
>>   #endif
>>
>> @@ -99,7 +100,8 @@ static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long address)
>>          return pud;
>>   }
>>
>> -#define __pud_free_tlb(tlb, x, addr)   pud_free((tlb)->mm, x)
>> +#define __pud_free_tlb(tlb, x, addr)   \
>> +       tlb_remove_ptdesc((tlb), virt_to_ptdesc(x))
> The same.
> 
> Other patches have the same problem.

Got it, will convert them all to the one-line type.

Thanks,
Qi

> 
> Huacai
> 
>>
>>   #endif /* __PAGETABLE_PUD_FOLDED */
>>
>> --
>> 2.20.1
>>


From linmag7 at gmail.com  Fri Nov 14 11:13:55 2025
From: linmag7 at gmail.com (Magnus Lindholm)
Date: Fri, 14 Nov 2025 20:13:55 +0100
Subject: [PATCH 1/7] alpha: mm: enable MMU_GATHER_RCU_TABLE_FREE
In-Reply-To: <66cd5b21aecc3281318b66a3a4aae078c4b9d37b.1763117269.git.zhengqi.arch@bytedance.com>
References: <cover.1763117269.git.zhengqi.arch@bytedance.com> <66cd5b21aecc3281318b66a3a4aae078c4b9d37b.1763117269.git.zhengqi.arch@bytedance.com>
Message-ID: <CA+=Fv5SGu_Y-zwryrQiTQDy32SipMk_dfjZezth1=aZmnDKNeA@mail.gmail.com>

Hi,

I applied your patches to a fresh pull of torvalds/linux.git repo but was unable
to build the kernel (on Alpha) with this patch applied.

I made the following changes in order to get it to build on Alpha:

diff --git a/mm/pt_reclaim.c b/mm/pt_reclaim.c
index 7e9455a18aae..6761b0c282bf 100644
--- a/mm/pt_reclaim.c
+++ b/mm/pt_reclaim.c
@@ -1,7 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/hugetlb.h>
-#include <asm-generic/tlb.h>
 #include <asm/pgalloc.h>
+#include <asm/tlb.h>

 #include "internal.h"


/Magnus


From vgupta at kernel.org  Fri Nov 14 15:10:02 2025
From: vgupta at kernel.org (Vineet Gupta)
Date: Fri, 14 Nov 2025 15:10:02 -0800
Subject: [PATCH 2/7] arc: mm: enable MMU_GATHER_RCU_TABLE_FREE
In-Reply-To: <5199c367-aabb-43e7-951e-452657dcdddc@linux.dev>
References: <cover.1763117269.git.zhengqi.arch@bytedance.com>
 <6a4192f5cef3049f123f08cb04ef5cd0179c3281.1763117269.git.zhengqi.arch@bytedance.com>
 <5199c367-aabb-43e7-951e-452657dcdddc@linux.dev>
Message-ID: <4e120357-6fa3-436a-8474-b07b473381b6@kernel.org>

On 11/14/25 03:20, Qi Zheng wrote:
> Strangely, it seems that only ARC does not define CONFIG_64BIT?
>
> Does the ARC architecture support 64-bit? Did I miss something?

ARC is 32-bit only !

-Vineet


From lkp at intel.com  Fri Nov 14 16:51:44 2025
From: lkp at intel.com (kernel test robot)
Date: Sat, 15 Nov 2025 08:51:44 +0800
Subject: [PATCH 7/7] mm: make PT_RECLAIM depend on
 MMU_GATHER_RCU_TABLE_FREE && 64BIT
In-Reply-To: <0a4d1e6f0bf299cafd1fc624f965bd1ca542cea8.1763117269.git.zhengqi.arch@bytedance.com>
References: <0a4d1e6f0bf299cafd1fc624f965bd1ca542cea8.1763117269.git.zhengqi.arch@bytedance.com>
Message-ID: <202511150845.XqOxPJxe-lkp@intel.com>

Hi Qi,

kernel test robot noticed the following build errors:

[auto build test ERROR on deller-parisc/for-next]
[also build test ERROR on uml/next tip/x86/core akpm-mm/mm-everything linus/master v6.18-rc5 next-20251114]
[cannot apply to uml/fixes vgupta-arc/for-next vgupta-arc/for-curr]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Qi-Zheng/alpha-mm-enable-MMU_GATHER_RCU_TABLE_FREE/20251114-191543
base:   https://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux.git for-next
patch link:    https://lore.kernel.org/r/0a4d1e6f0bf299cafd1fc624f965bd1ca542cea8.1763117269.git.zhengqi.arch%40bytedance.com
patch subject: [PATCH 7/7] mm: make PT_RECLAIM depend on MMU_GATHER_RCU_TABLE_FREE && 64BIT
config: arm64-randconfig-004-20251115 (https://download.01.org/0day-ci/archive/20251115/202511150845.XqOxPJxe-lkp at intel.com/config)
compiler: aarch64-linux-gcc (GCC) 8.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251115/202511150845.XqOxPJxe-lkp at intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp at intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202511150845.XqOxPJxe-lkp at intel.com/

All errors (new ones prefixed by >>):

   In file included from mm/pt_reclaim.c:3:
   mm/pt_reclaim.c: In function 'free_pte':
>> include/asm-generic/tlb.h:731:3: error: implicit declaration of function '__pte_free_tlb'; did you mean 'pte_free_tlb'? [-Werror=implicit-function-declaration]
      __pte_free_tlb(tlb, ptep, address);  \
      ^~~~~~~~~~~~~~
   mm/pt_reclaim.c:31:2: note: in expansion of macro 'pte_free_tlb'
     pte_free_tlb(tlb, pmd_pgtable(pmdval), addr);
     ^~~~~~~~~~~~
   cc1: some warnings being treated as errors


vim +731 include/asm-generic/tlb.h

a00cc7d9dd93d6 Matthew Wilcox         2017-02-24  701  
a00cc7d9dd93d6 Matthew Wilcox         2017-02-24  702  #define tlb_remove_pud_tlb_entry(tlb, pudp, address)			\
a00cc7d9dd93d6 Matthew Wilcox         2017-02-24  703  	do {								\
2631ed00b04988 Peter Zijlstra (Intel  2020-06-25  704) 		tlb_flush_pud_range(tlb, address, HPAGE_PUD_SIZE);	\
a00cc7d9dd93d6 Matthew Wilcox         2017-02-24  705  		__tlb_remove_pud_tlb_entry(tlb, pudp, address);		\
a00cc7d9dd93d6 Matthew Wilcox         2017-02-24  706  	} while (0)
a00cc7d9dd93d6 Matthew Wilcox         2017-02-24  707  
b5bc66b7131087 Aneesh Kumar K.V       2016-12-12  708  /*
b5bc66b7131087 Aneesh Kumar K.V       2016-12-12  709   * For things like page tables caches (ie caching addresses "inside" the
b5bc66b7131087 Aneesh Kumar K.V       2016-12-12  710   * page tables, like x86 does), for legacy reasons, flushing an
b5bc66b7131087 Aneesh Kumar K.V       2016-12-12  711   * individual page had better flush the page table caches behind it. This
b5bc66b7131087 Aneesh Kumar K.V       2016-12-12  712   * is definitely how x86 works, for example. And if you have an
b5bc66b7131087 Aneesh Kumar K.V       2016-12-12  713   * architected non-legacy page table cache (which I'm not aware of
b5bc66b7131087 Aneesh Kumar K.V       2016-12-12  714   * anybody actually doing), you're going to have some architecturally
b5bc66b7131087 Aneesh Kumar K.V       2016-12-12  715   * explicit flushing for that, likely *separate* from a regular TLB entry
b5bc66b7131087 Aneesh Kumar K.V       2016-12-12  716   * flush, and thus you'd need more than just some range expansion..
b5bc66b7131087 Aneesh Kumar K.V       2016-12-12  717   *
b5bc66b7131087 Aneesh Kumar K.V       2016-12-12  718   * So if we ever find an architecture
b5bc66b7131087 Aneesh Kumar K.V       2016-12-12  719   * that would want something that odd, I think it is up to that
b5bc66b7131087 Aneesh Kumar K.V       2016-12-12  720   * architecture to do its own odd thing, not cause pain for others
b5bc66b7131087 Aneesh Kumar K.V       2016-12-12  721   * http://lkml.kernel.org/r/CA+55aFzBggoXtNXQeng5d_mRoDnaMBE5Y+URs+PHR67nUpMtaw at mail.gmail.com
b5bc66b7131087 Aneesh Kumar K.V       2016-12-12  722   *
b5bc66b7131087 Aneesh Kumar K.V       2016-12-12  723   * For now w.r.t page table cache, mark the range_size as PAGE_SIZE
b5bc66b7131087 Aneesh Kumar K.V       2016-12-12  724   */
b5bc66b7131087 Aneesh Kumar K.V       2016-12-12  725  
a90744bac57c3c Nicholas Piggin        2018-07-13  726  #ifndef pte_free_tlb
9e1b32caa525cb Benjamin Herrenschmidt 2009-07-22  727  #define pte_free_tlb(tlb, ptep, address)			\
^1da177e4c3f41 Linus Torvalds         2005-04-16  728  	do {							\
2631ed00b04988 Peter Zijlstra (Intel  2020-06-25  729) 		tlb_flush_pmd_range(tlb, address, PAGE_SIZE);	\
22a61c3c4f1379 Peter Zijlstra         2018-08-23  730  		tlb->freed_tables = 1;				\
9e1b32caa525cb Benjamin Herrenschmidt 2009-07-22 @731  		__pte_free_tlb(tlb, ptep, address);		\
^1da177e4c3f41 Linus Torvalds         2005-04-16  732  	} while (0)
a90744bac57c3c Nicholas Piggin        2018-07-13  733  #endif
^1da177e4c3f41 Linus Torvalds         2005-04-16  734  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


From lkp at intel.com  Fri Nov 14 17:12:35 2025
From: lkp at intel.com (kernel test robot)
Date: Sat, 15 Nov 2025 09:12:35 +0800
Subject: [PATCH 7/7] mm: make PT_RECLAIM depend on
 MMU_GATHER_RCU_TABLE_FREE && 64BIT
In-Reply-To: <0a4d1e6f0bf299cafd1fc624f965bd1ca542cea8.1763117269.git.zhengqi.arch@bytedance.com>
References: <0a4d1e6f0bf299cafd1fc624f965bd1ca542cea8.1763117269.git.zhengqi.arch@bytedance.com>
Message-ID: <202511150832.iAyO0SAW-lkp@intel.com>

Hi Qi,

kernel test robot noticed the following build errors:

[auto build test ERROR on deller-parisc/for-next]
[also build test ERROR on uml/next tip/x86/core akpm-mm/mm-everything linus/master v6.18-rc5 next-20251114]
[cannot apply to uml/fixes vgupta-arc/for-next vgupta-arc/for-curr]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Qi-Zheng/alpha-mm-enable-MMU_GATHER_RCU_TABLE_FREE/20251114-191543
base:   https://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux.git for-next
patch link:    https://lore.kernel.org/r/0a4d1e6f0bf299cafd1fc624f965bd1ca542cea8.1763117269.git.zhengqi.arch%40bytedance.com
patch subject: [PATCH 7/7] mm: make PT_RECLAIM depend on MMU_GATHER_RCU_TABLE_FREE && 64BIT
config: arm64-randconfig-002-20251115 (https://download.01.org/0day-ci/archive/20251115/202511150832.iAyO0SAW-lkp at intel.com/config)
compiler: clang version 18.1.8 (https://github.com/llvm/llvm-project 3b5b5c1ec4a3095ab096dd780e84d7ab81f3d7ff)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251115/202511150832.iAyO0SAW-lkp at intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp at intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202511150832.iAyO0SAW-lkp at intel.com/

All errors (new ones prefixed by >>):

>> mm/pt_reclaim.c:31:2: error: call to undeclared function '__pte_free_tlb'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
      31 |         pte_free_tlb(tlb, pmd_pgtable(pmdval), addr);
         |         ^
   include/asm-generic/tlb.h:731:3: note: expanded from macro 'pte_free_tlb'
     731 |                 __pte_free_tlb(tlb, ptep, address);             \
         |                 ^
   1 error generated.


vim +/__pte_free_tlb +31 mm/pt_reclaim.c

6375e95f381e3d Qi Zheng 2024-12-04  27  
6375e95f381e3d Qi Zheng 2024-12-04  28  void free_pte(struct mm_struct *mm, unsigned long addr, struct mmu_gather *tlb,
6375e95f381e3d Qi Zheng 2024-12-04  29  	      pmd_t pmdval)
6375e95f381e3d Qi Zheng 2024-12-04  30  {
6375e95f381e3d Qi Zheng 2024-12-04 @31  	pte_free_tlb(tlb, pmd_pgtable(pmdval), addr);
6375e95f381e3d Qi Zheng 2024-12-04  32  	mm_dec_nr_ptes(mm);
6375e95f381e3d Qi Zheng 2024-12-04  33  }
6375e95f381e3d Qi Zheng 2024-12-04  34  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


From qi.zheng at linux.dev  Sat Nov 15 01:06:51 2025
From: qi.zheng at linux.dev (Qi Zheng)
Date: Sat, 15 Nov 2025 17:06:51 +0800
Subject: [PATCH 1/7] alpha: mm: enable MMU_GATHER_RCU_TABLE_FREE
In-Reply-To: <CA+=Fv5SGu_Y-zwryrQiTQDy32SipMk_dfjZezth1=aZmnDKNeA@mail.gmail.com>
References: <cover.1763117269.git.zhengqi.arch@bytedance.com>
 <66cd5b21aecc3281318b66a3a4aae078c4b9d37b.1763117269.git.zhengqi.arch@bytedance.com>
 <CA+=Fv5SGu_Y-zwryrQiTQDy32SipMk_dfjZezth1=aZmnDKNeA@mail.gmail.com>
Message-ID: <d58a6475-15f2-4e7c-b384-146623ce55fc@linux.dev>

Hi Magnus,

On 11/15/25 3:13 AM, Magnus Lindholm wrote:
> Hi,
> 
> I applied your patches to a fresh pull of torvalds/linux.git repo but was unable
> to build the kernel (on Alpha) with this patch applied.
> 
> I made the following changes in order to get it to build on Alpha:

Thanks! Will fix it in the next version.

> 
> diff --git a/mm/pt_reclaim.c b/mm/pt_reclaim.c
> index 7e9455a18aae..6761b0c282bf 100644
> --- a/mm/pt_reclaim.c
> +++ b/mm/pt_reclaim.c
> @@ -1,7 +1,7 @@
>   // SPDX-License-Identifier: GPL-2.0
>   #include <linux/hugetlb.h>
> -#include <asm-generic/tlb.h>
>   #include <asm/pgalloc.h>
> +#include <asm/tlb.h>
> 
>   #include "internal.h"
> 
> 
> /Magnus


From qi.zheng at linux.dev  Sat Nov 15 01:08:35 2025
From: qi.zheng at linux.dev (Qi Zheng)
Date: Sat, 15 Nov 2025 17:08:35 +0800
Subject: [PATCH 2/7] arc: mm: enable MMU_GATHER_RCU_TABLE_FREE
In-Reply-To: <4e120357-6fa3-436a-8474-b07b473381b6@kernel.org>
References: <cover.1763117269.git.zhengqi.arch@bytedance.com>
 <6a4192f5cef3049f123f08cb04ef5cd0179c3281.1763117269.git.zhengqi.arch@bytedance.com>
 <5199c367-aabb-43e7-951e-452657dcdddc@linux.dev>
 <4e120357-6fa3-436a-8474-b07b473381b6@kernel.org>
Message-ID: <f7bd2dc1-e7a5-4e80-9cad-2acac8065876@linux.dev>


On 11/15/25 7:10 AM, Vineet Gupta wrote:
> On 11/14/25 03:20, Qi Zheng wrote:
>> Strangely, it seems that only ARC does not define CONFIG_64BIT?
>>
>> Does the ARC architecture support 64-bit? Did I miss something?
> 
> ARC is 32-bit only !

Got it! Will drop this patch in the next version.

Thanks!

> 
> -Vineet


From qi.zheng at linux.dev  Sun Nov 16 22:41:10 2025
From: qi.zheng at linux.dev (Qi Zheng)
Date: Mon, 17 Nov 2025 14:41:10 +0800
Subject: [PATCH 3/7] loongarch: mm: enable MMU_GATHER_RCU_TABLE_FREE
In-Reply-To: <CAAhV-H6HL+mXeuLqgo5BOVBB0_GHTUmn7_7NTzdUpLX7NbuQ5w@mail.gmail.com>
References: <cover.1763117269.git.zhengqi.arch@bytedance.com>
 <146b5a0207052b38d04caac6b20756a61c2189b3.1763117269.git.zhengqi.arch@bytedance.com>
 <CAAhV-H6HL+mXeuLqgo5BOVBB0_GHTUmn7_7NTzdUpLX7NbuQ5w@mail.gmail.com>
Message-ID: <8fbeb3e8-7c30-46f6-a0a4-289efbf45ac0@linux.dev>

Hi Huacai,

On 11/14/25 10:17 PM, Huacai Chen wrote:
> Hi, Qi Zheng,

[...]

>>
>> -#define __pmd_free_tlb(tlb, x, addr)   pmd_free((tlb)->mm, x)
>> +#define __pmd_free_tlb(tlb, x, addr)   \
>> +       tlb_remove_ptdesc((tlb), virt_to_ptdesc(x))
> I think we can define it in one line.

Do we need to change __pte_free_tlb() to a single-line format
as well?

Thanks,
Qi


>>


From chenhuacai at kernel.org  Sun Nov 16 22:57:47 2025
From: chenhuacai at kernel.org (Huacai Chen)
Date: Mon, 17 Nov 2025 14:57:47 +0800
Subject: [PATCH 3/7] loongarch: mm: enable MMU_GATHER_RCU_TABLE_FREE
In-Reply-To: <8fbeb3e8-7c30-46f6-a0a4-289efbf45ac0@linux.dev>
References: <cover.1763117269.git.zhengqi.arch@bytedance.com>
 <146b5a0207052b38d04caac6b20756a61c2189b3.1763117269.git.zhengqi.arch@bytedance.com>
 <CAAhV-H6HL+mXeuLqgo5BOVBB0_GHTUmn7_7NTzdUpLX7NbuQ5w@mail.gmail.com> <8fbeb3e8-7c30-46f6-a0a4-289efbf45ac0@linux.dev>
Message-ID: <CAAhV-H73vNhjZqsDq0KGJD_PC2LtM39XAG-aZtvE=tphrQ8dJA@mail.gmail.com>

On Mon, Nov 17, 2025 at 2:42?PM Qi Zheng <qi.zheng at linux.dev> wrote:
>
> Hi Huacai,
>
> On 11/14/25 10:17 PM, Huacai Chen wrote:
> > Hi, Qi Zheng,
>
> [...]
>
> >>
> >> -#define __pmd_free_tlb(tlb, x, addr)   pmd_free((tlb)->mm, x)
> >> +#define __pmd_free_tlb(tlb, x, addr)   \
> >> +       tlb_remove_ptdesc((tlb), virt_to_ptdesc(x))
> > I think we can define it in one line.
>
> Do we need to change __pte_free_tlb() to a single-line format
> as well?
Yes, there is no 80 columns limit now.

Huacai

>
> Thanks,
> Qi
>
>
> >>
>
>


From david at kernel.org  Mon Nov 17 08:53:42 2025
From: david at kernel.org (David Hildenbrand (Red Hat))
Date: Mon, 17 Nov 2025 17:53:42 +0100
Subject: [PATCH 0/7] enable PT_RECLAIM on all 64-bit architectures
In-Reply-To: <cover.1763117269.git.zhengqi.arch@bytedance.com>
References: <cover.1763117269.git.zhengqi.arch@bytedance.com>
Message-ID: <83e88171-54cb-4112-a344-f6a7d7f13784@kernel.org>

On 14.11.25 12:11, Qi Zheng wrote:
> From: Qi Zheng <zhengqi.arch at bytedance.com>
> 
> Hi all,
> 
> This series aims to enable PT_RECLAIM on all 64-bit architectures.
> 
> On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of empty PTE
> page table pages (such as 100GB+). To resolve this problem, we need to enable
> PT_RECLAIM, which depends on MMU_GATHER_RCU_TABLE_FREE.
> 

Makes sense!

> Therefore, this series first enables MMU_GATHER_RCU_TABLE_FREE on all 64-bit
> architectures, and finally makes PT_RECLAIM depend on MMU_GATHER_RCU_TABLE_FREE
> && 64BIT. This way, PT_RECLAIM can be enabled by default on all 64-bit
> architectures.

Could we then even go ahead and stop making PT_RECLAIM user-selectable?

-- 
Cheers

David


From david at kernel.org  Mon Nov 17 08:57:58 2025
From: david at kernel.org (David Hildenbrand (Red Hat))
Date: Mon, 17 Nov 2025 17:57:58 +0100
Subject: [PATCH 7/7] mm: make PT_RECLAIM depend on
 MMU_GATHER_RCU_TABLE_FREE && 64BIT
In-Reply-To: <0a4d1e6f0bf299cafd1fc624f965bd1ca542cea8.1763117269.git.zhengqi.arch@bytedance.com>
References: <cover.1763117269.git.zhengqi.arch@bytedance.com>
 <0a4d1e6f0bf299cafd1fc624f965bd1ca542cea8.1763117269.git.zhengqi.arch@bytedance.com>
Message-ID: <355d3bf3-c6bc-403e-9f19-89259d868611@kernel.org>

On 14.11.25 12:11, Qi Zheng wrote:
> From: Qi Zheng <zhengqi.arch at bytedance.com>

Subject: s/&&/&/

> 
> Make PT_RECLAIM depend on MMU_GATHER_RCU_TABLE_FREE so that PT_RECLAIM can
> be enabled by default on all architectures that support
> MMU_GATHER_RCU_TABLE_FREE.
> 
> Considering that a large number of PTE page table pages (such as 100GB+)
> can only be caused on a 64-bit system, let PT_RECLAIM also depend on
> 64BIT.
> 
> Signed-off-by: Qi Zheng <zhengqi.arch at bytedance.com>
> ---
>   arch/x86/Kconfig | 1 -
>   mm/Kconfig       | 6 +-----
>   2 files changed, 1 insertion(+), 6 deletions(-)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index eac2e86056902..96bff81fd4787 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -330,7 +330,6 @@ config X86
>   	select FUNCTION_ALIGNMENT_4B
>   	imply IMA_SECURE_AND_OR_TRUSTED_BOOT    if EFI
>   	select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE
> -	select ARCH_SUPPORTS_PT_RECLAIM		if X86_64
>   	select ARCH_SUPPORTS_SCHED_SMT		if SMP
>   	select SCHED_SMT			if SMP
>   	select ARCH_SUPPORTS_SCHED_CLUSTER	if SMP
> diff --git a/mm/Kconfig b/mm/Kconfig
> index a5a90b169435d..e795fbd69e50c 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1440,14 +1440,10 @@ config ARCH_HAS_USER_SHADOW_STACK
>   	  The architecture has hardware support for userspace shadow call
>             stacks (eg, x86 CET, arm64 GCS or RISC-V Zicfiss).
>   
> -config ARCH_SUPPORTS_PT_RECLAIM
> -	def_bool n
> -
>   config PT_RECLAIM
>   	bool "reclaim empty user page table pages"
>   	default y
> -	depends on ARCH_SUPPORTS_PT_RECLAIM && MMU && SMP
> -	select MMU_GATHER_RCU_TABLE_FREE
> +	depends on MMU_GATHER_RCU_TABLE_FREE && MMU && SMP && 64BIT

Who would we have MMU_GATHER_RCU_TABLE_FREE without MMU? (can we drop 
the MMU part)

Why do we care about SMP in the first place? (can we frop SMP)

But I also wonder why we need "MMU_GATHER_RCU_TABLE_FREE && 64BIT":

Would it be harmful on 32bit (sure, we might not reclaim as much, but 
still there is memory to be reclaimed?)?

If all 64BIT support MMU_GATHER_RCU_TABLE_FREE (as you previously 
state), why can't we only check for 64BIT?

-- 
Cheers

David


From development at efficientek.com  Mon Nov 17 18:31:31 2025
From: development at efficientek.com (Glenn Washburn)
Date: Mon, 17 Nov 2025 20:31:31 -0600
Subject: No more non-root networking modes?
Message-ID: <20251117203131.479a1cfd@crass-HP-ZBook-15-G2>

Hi all,

I'm just now noticing that earlier this year obsolete networking
transports were removed (commit 65eaac591b75). This included SLiRP,
which was to my knowledge the only transport that supported network
access as an unprivileged user (no privileged access needed to setup
the transport either). Am I wrong about that? If not, I'm curious as to
why functionality that could not be achieved by other means was
dropped? And if there are any work arounds? I can understand why other
transports that need privileged access to setup, but were inferior to
other existing transports would be removed. I write this as someone who
is currently using the SLiRP transport.

Glenn


From tiwei.bie at linux.dev  Mon Nov 17 22:08:55 2025
From: tiwei.bie at linux.dev (Tiwei Bie)
Date: Tue, 18 Nov 2025 14:08:55 +0800
Subject: No more non-root networking modes?
In-Reply-To: <20251117203131.479a1cfd@crass-HP-ZBook-15-G2>
References: <20251117203131.479a1cfd@crass-HP-ZBook-15-G2>
Message-ID: <20251118060855.3714863-1-tiwei.bie@linux.dev>

On Mon, 17 Nov 2025 20:31:31 -0600, Glenn Washburn wrote:
> Hi all,
> 
> I'm just now noticing that earlier this year obsolete networking
> transports were removed (commit 65eaac591b75). This included SLiRP,
> which was to my knowledge the only transport that supported network
> access as an unprivileged user (no privileged access needed to setup
> the transport either). Am I wrong about that? If not, I'm curious as to
> why functionality that could not be achieved by other means was
> dropped? And if there are any work arounds? I can understand why other
> transports that need privileged access to setup, but were inferior to
> other existing transports would be removed. I write this as someone who
> is currently using the SLiRP transport.

vec also supports networking without privileged access:

https://www.kernel.org/doc/html/v6.17/virt/uml/user_mode_linux_howto_v2.html#vde-vector-transport
https://lore.kernel.org/all/bfa07f4d-16a3-476b-9314-b8052ec198b1 at antgroup.com/

Regards,
Tiwei


From arch0.zheng at gmail.com  Tue Nov 18 03:53:50 2025
From: arch0.zheng at gmail.com (Qi Zheng)
Date: Tue, 18 Nov 2025 19:53:50 +0800
Subject: [PATCH 0/7] enable PT_RECLAIM on all 64-bit architectures
In-Reply-To: <83e88171-54cb-4112-a344-f6a7d7f13784@kernel.org>
References: <cover.1763117269.git.zhengqi.arch@bytedance.com>
 <83e88171-54cb-4112-a344-f6a7d7f13784@kernel.org>
Message-ID: <f7f0ca8d-bca2-4a3e-8c45-85cba1b0ff18@gmail.com>


On 11/18/25 12:53 AM, David Hildenbrand (Red Hat) wrote:
> On 14.11.25 12:11, Qi Zheng wrote:
>> From: Qi Zheng <zhengqi.arch at bytedance.com>
>>
>> Hi all,
>>
>> This series aims to enable PT_RECLAIM on all 64-bit architectures.
>>
>> On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of 
>> empty PTE
>> page table pages (such as 100GB+). To resolve this problem, we need to 
>> enable
>> PT_RECLAIM, which depends on MMU_GATHER_RCU_TABLE_FREE.
>>
> 
> Makes sense!
> 
>> Therefore, this series first enables MMU_GATHER_RCU_TABLE_FREE on all 
>> 64-bit
>> architectures, and finally makes PT_RECLAIM depend on 
>> MMU_GATHER_RCU_TABLE_FREE
>> && 64BIT. This way, PT_RECLAIM can be enabled by default on all 64-bit
>> architectures.
> 
> Could we then even go ahead and stop making PT_RECLAIM user-selectable?

OK, will change to:

config PT_RECLAIM
	def_bool y
	depends on MMU_GATHER_RCU_TABLE_FREE && MMU && SMP && 64BIT

> 


From qi.zheng at linux.dev  Tue Nov 18 04:02:30 2025
From: qi.zheng at linux.dev (Qi Zheng)
Date: Tue, 18 Nov 2025 20:02:30 +0800
Subject: [PATCH 7/7] mm: make PT_RECLAIM depend on
 MMU_GATHER_RCU_TABLE_FREE && 64BIT
In-Reply-To: <355d3bf3-c6bc-403e-9f19-89259d868611@kernel.org>
References: <cover.1763117269.git.zhengqi.arch@bytedance.com>
 <0a4d1e6f0bf299cafd1fc624f965bd1ca542cea8.1763117269.git.zhengqi.arch@bytedance.com>
 <355d3bf3-c6bc-403e-9f19-89259d868611@kernel.org>
Message-ID: <195baf7c-1f4e-46a4-a4aa-e68e7d00c0f9@linux.dev>


On 11/18/25 12:57 AM, David Hildenbrand (Red Hat) wrote:
> On 14.11.25 12:11, Qi Zheng wrote:
>> From: Qi Zheng <zhengqi.arch at bytedance.com>
> 
> Subject: s/&&/&/

will do.

> 
>>
>> Make PT_RECLAIM depend on MMU_GATHER_RCU_TABLE_FREE so that PT_RECLAIM 
>> can
>> be enabled by default on all architectures that support
>> MMU_GATHER_RCU_TABLE_FREE.
>>
>> Considering that a large number of PTE page table pages (such as 100GB+)
>> can only be caused on a 64-bit system, let PT_RECLAIM also depend on
>> 64BIT.
>>
>> Signed-off-by: Qi Zheng <zhengqi.arch at bytedance.com>
>> ---
>> ? arch/x86/Kconfig | 1 -
>> ? mm/Kconfig?????? | 6 +-----
>> ? 2 files changed, 1 insertion(+), 6 deletions(-)
>>
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index eac2e86056902..96bff81fd4787 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -330,7 +330,6 @@ config X86
>> ????? select FUNCTION_ALIGNMENT_4B
>> ????? imply IMA_SECURE_AND_OR_TRUSTED_BOOT??? if EFI
>> ????? select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE
>> -??? select ARCH_SUPPORTS_PT_RECLAIM??????? if X86_64
>> ????? select ARCH_SUPPORTS_SCHED_SMT??????? if SMP
>> ????? select SCHED_SMT??????????? if SMP
>> ????? select ARCH_SUPPORTS_SCHED_CLUSTER??? if SMP
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index a5a90b169435d..e795fbd69e50c 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -1440,14 +1440,10 @@ config ARCH_HAS_USER_SHADOW_STACK
>> ??????? The architecture has hardware support for userspace shadow call
>> ??????????? stacks (eg, x86 CET, arm64 GCS or RISC-V Zicfiss).
>> -config ARCH_SUPPORTS_PT_RECLAIM
>> -??? def_bool n
>> -
>> ? config PT_RECLAIM
>> ????? bool "reclaim empty user page table pages"
>> ????? default y
>> -??? depends on ARCH_SUPPORTS_PT_RECLAIM && MMU && SMP
>> -??? select MMU_GATHER_RCU_TABLE_FREE
>> +??? depends on MMU_GATHER_RCU_TABLE_FREE && MMU && SMP && 64BIT
> 
> Who would we have MMU_GATHER_RCU_TABLE_FREE without MMU? (can we drop 
> the MMU part)

OK.

> 
> Why do we care about SMP in the first place? (can we frop SMP)

OK.

> 
> But I also wonder why we need "MMU_GATHER_RCU_TABLE_FREE && 64BIT":
> 
> Would it be harmful on 32bit (sure, we might not reclaim as much, but 
> still there is memory to be reclaimed?)?

This is also fine on 32bit, but the benefits are not significant, So I
chose to enable it only on 64-bit.

I actually tried enabling MMU_GATHER_RCU_TABLE_FREE on all
architectures, and apart from sparc32 being a bit troublesome (because
it uses mm->page_table_lock for synchronization within
__pte_free_tlb()), the modifications were relatively simple.

> 
> If all 64BIT support MMU_GATHER_RCU_TABLE_FREE (as you previously 
> state), why can't we only check for 64BIT?

OK, will do.

Thanks,
Qi

> 


From qi.zheng at linux.dev  Tue Nov 18 23:31:17 2025
From: qi.zheng at linux.dev (Qi Zheng)
Date: Wed, 19 Nov 2025 15:31:17 +0800
Subject: [PATCH v2 0/7] enable PT_RECLAIM on all 64-bit architectures
Message-ID: <cover.1763537007.git.zhengqi.arch@bytedance.com>

From: Qi Zheng <zhengqi.arch at bytedance.com>

Changelog in v2:
 - fix compilation errors (reported by Magnus Lindholm and kernel test robot)
 - adjust some code style (suggested by Huacai Chen)
 - make PT_RECLAIM user-unselectable (suggested by David Hildenbrand)
 - rebase onto the next-20251119

Hi all,

This series aims to enable PT_RECLAIM on all 64-bit architectures.

On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of empty PTE
page table pages (such as 100GB+). To resolve this problem, we need to enable
PT_RECLAIM, which depends on MMU_GATHER_RCU_TABLE_FREE.

Therefore, this series first enables MMU_GATHER_RCU_TABLE_FREE on all 64-bit
architectures, and finally makes PT_RECLAIM depend on 64BIT. This way,
PT_RECLAIM can be enabled by default on all 64-bit architectures.

BTW, PT_RECLAIM works well on all 32-bit architectures as well. Although the
benefit isn't significant, there's still memory that can be reclaimed. Perhaps
PT_RECLAIM can be enabled on all 32-bit architectures in the future.

Comments and suggestions are welcome!

Thanks,
Qi

Qi Zheng (7):
  mm: change mm/pt_reclaim.c to use asm/tlb.h instead of
    asm-generic/tlb.h
  alpha: mm: enable MMU_GATHER_RCU_TABLE_FREE
  LoongArch: mm: enable MMU_GATHER_RCU_TABLE_FREE
  mips: mm: enable MMU_GATHER_RCU_TABLE_FREE
  parisc: mm: enable MMU_GATHER_RCU_TABLE_FREE
  um: mm: enable MMU_GATHER_RCU_TABLE_FREE
  mm: enable PT_RECLAIM on all 64-bit architectures

 arch/alpha/Kconfig                   | 1 +
 arch/alpha/include/asm/tlb.h         | 6 +++---
 arch/loongarch/Kconfig               | 1 +
 arch/loongarch/include/asm/pgalloc.h | 7 +++----
 arch/mips/Kconfig                    | 1 +
 arch/mips/include/asm/pgalloc.h      | 7 +++----
 arch/parisc/Kconfig                  | 1 +
 arch/parisc/include/asm/tlb.h        | 4 ++--
 arch/um/Kconfig                      | 1 +
 arch/x86/Kconfig                     | 1 -
 mm/Kconfig                           | 9 ++-------
 mm/pt_reclaim.c                      | 2 +-
 12 files changed, 19 insertions(+), 22 deletions(-)

-- 
2.20.1


From qi.zheng at linux.dev  Tue Nov 18 23:31:18 2025
From: qi.zheng at linux.dev (Qi Zheng)
Date: Wed, 19 Nov 2025 15:31:18 +0800
Subject: [PATCH v2 1/7] mm: change mm/pt_reclaim.c to use asm/tlb.h instead of asm-generic/tlb.h
In-Reply-To: <cover.1763537007.git.zhengqi.arch@bytedance.com>
References: <cover.1763537007.git.zhengqi.arch@bytedance.com>
Message-ID: <e9d510106b5bf72a9b577b8c5ad161fd3c29c2b6.1763537007.git.zhengqi.arch@bytedance.com>

From: Qi Zheng <zhengqi.arch at bytedance.com>

Generally, the asm/tlb.h will include asm-generic/tlb.h, so change
mm/pt_reclaim.c to use asm/tlb.h instead of asm-generic/tlb.h. This can
also fix compilation errors on some architecture when CONFIG_PT_RECLAIM
is enabled (such as alpha).

Signed-off-by: Qi Zheng <zhengqi.arch at bytedance.com>
---
 mm/pt_reclaim.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/pt_reclaim.c b/mm/pt_reclaim.c
index 0d9cfbf4fe5d8..46771cfff8239 100644
--- a/mm/pt_reclaim.c
+++ b/mm/pt_reclaim.c
@@ -2,7 +2,7 @@
 #include <linux/hugetlb.h>
 #include <linux/pgalloc.h>
 
-#include <asm-generic/tlb.h>
+#include <asm/tlb.h>
 
 #include "internal.h"
 
-- 
2.20.1


From qi.zheng at linux.dev  Tue Nov 18 23:31:19 2025
From: qi.zheng at linux.dev (Qi Zheng)
Date: Wed, 19 Nov 2025 15:31:19 +0800
Subject: [PATCH v2 2/7] alpha: mm: enable MMU_GATHER_RCU_TABLE_FREE
In-Reply-To: <cover.1763537007.git.zhengqi.arch@bytedance.com>
References: <cover.1763537007.git.zhengqi.arch@bytedance.com>
Message-ID: <54381c49729449b9c3a09e78a69bf14b4b107774.1763537007.git.zhengqi.arch@bytedance.com>

From: Qi Zheng <zhengqi.arch at bytedance.com>

On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of
empty PTE page table pages (such as 100GB+). To resolve this problem,
first enable MMU_GATHER_RCU_TABLE_FREE to prepare for enabling the
PT_RECLAIM feature, which resolves this problem.

Signed-off-by: Qi Zheng <zhengqi.arch at bytedance.com>
Cc: Richard Henderson <richard.henderson at linaro.org>
Cc: Matt Turner <mattst88 at gmail.com>
---
 arch/alpha/Kconfig           | 1 +
 arch/alpha/include/asm/tlb.h | 6 +++---
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/alpha/Kconfig b/arch/alpha/Kconfig
index 80367f2cf821c..6c7dbf0adad62 100644
--- a/arch/alpha/Kconfig
+++ b/arch/alpha/Kconfig
@@ -38,6 +38,7 @@ config ALPHA
 	select OLD_SIGSUSPEND
 	select CPU_NO_EFFICIENT_FFS if !ALPHA_EV67
 	select MMU_GATHER_NO_RANGE
+	select MMU_GATHER_RCU_TABLE_FREE
 	select SPARSEMEM_EXTREME if SPARSEMEM
 	select ZONE_DMA
 	help
diff --git a/arch/alpha/include/asm/tlb.h b/arch/alpha/include/asm/tlb.h
index 4f79e331af5ea..ad586b898fd6b 100644
--- a/arch/alpha/include/asm/tlb.h
+++ b/arch/alpha/include/asm/tlb.h
@@ -4,7 +4,7 @@
 
 #include <asm-generic/tlb.h>
 
-#define __pte_free_tlb(tlb, pte, address)		pte_free((tlb)->mm, pte)
-#define __pmd_free_tlb(tlb, pmd, address)		pmd_free((tlb)->mm, pmd)
- 
+#define __pte_free_tlb(tlb, pte, address)	tlb_remove_ptdesc((tlb), page_ptdesc(pte))
+#define __pmd_free_tlb(tlb, pmd, address)	tlb_remove_ptdesc((tlb), virt_to_ptdesc(pmd))
+
 #endif
-- 
2.20.1


From qi.zheng at linux.dev  Tue Nov 18 23:31:20 2025
From: qi.zheng at linux.dev (Qi Zheng)
Date: Wed, 19 Nov 2025 15:31:20 +0800
Subject: [PATCH v2 3/7] LoongArch: mm: enable MMU_GATHER_RCU_TABLE_FREE
In-Reply-To: <cover.1763537007.git.zhengqi.arch@bytedance.com>
References: <cover.1763537007.git.zhengqi.arch@bytedance.com>
Message-ID: <0e12d201cc18a970c28c84030a0d79f5bda492ca.1763537007.git.zhengqi.arch@bytedance.com>

From: Qi Zheng <zhengqi.arch at bytedance.com>

On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of
empty PTE page table pages (such as 100GB+). To resolve this problem,
first enable MMU_GATHER_RCU_TABLE_FREE to prepare for enabling the
PT_RECLAIM feature, which resolves this problem.

Signed-off-by: Qi Zheng <zhengqi.arch at bytedance.com>
Cc: Huacai Chen <chenhuacai at kernel.org>
Cc: WANG Xuerui <kernel at xen0n.name>
---
 arch/loongarch/Kconfig               | 1 +
 arch/loongarch/include/asm/pgalloc.h | 7 +++----
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig
index 5b1116733d881..57d3e199605dc 100644
--- a/arch/loongarch/Kconfig
+++ b/arch/loongarch/Kconfig
@@ -186,6 +186,7 @@ config LOONGARCH
 	select IRQ_LOONGARCH_CPU
 	select LOCK_MM_AND_FIND_VMA
 	select MMU_GATHER_MERGE_VMAS if MMU
+	select MMU_GATHER_RCU_TABLE_FREE
 	select MODULES_USE_ELF_RELA if MODULES
 	select NEED_PER_CPU_EMBED_FIRST_CHUNK
 	select NEED_PER_CPU_PAGE_FIRST_CHUNK
diff --git a/arch/loongarch/include/asm/pgalloc.h b/arch/loongarch/include/asm/pgalloc.h
index 08dcc698ec184..248f62d0b590e 100644
--- a/arch/loongarch/include/asm/pgalloc.h
+++ b/arch/loongarch/include/asm/pgalloc.h
@@ -55,8 +55,7 @@ static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
 	return pte;
 }
 
-#define __pte_free_tlb(tlb, pte, address)	\
-	tlb_remove_ptdesc((tlb), page_ptdesc(pte))
+#define __pte_free_tlb(tlb, pte, address)	tlb_remove_ptdesc((tlb), page_ptdesc(pte))
 
 #ifndef __PAGETABLE_PMD_FOLDED
 
@@ -79,7 +78,7 @@ static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address)
 	return pmd;
 }
 
-#define __pmd_free_tlb(tlb, x, addr)	pmd_free((tlb)->mm, x)
+#define __pmd_free_tlb(tlb, x, addr)	tlb_remove_ptdesc((tlb), virt_to_ptdesc(x))
 
 #endif
 
@@ -99,7 +98,7 @@ static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long address)
 	return pud;
 }
 
-#define __pud_free_tlb(tlb, x, addr)	pud_free((tlb)->mm, x)
+#define __pud_free_tlb(tlb, x, addr)	tlb_remove_ptdesc((tlb), virt_to_ptdesc(x))
 
 #endif /* __PAGETABLE_PUD_FOLDED */
 
-- 
2.20.1


From qi.zheng at linux.dev  Tue Nov 18 23:31:21 2025
From: qi.zheng at linux.dev (Qi Zheng)
Date: Wed, 19 Nov 2025 15:31:21 +0800
Subject: [PATCH v2 4/7] mips: mm: enable MMU_GATHER_RCU_TABLE_FREE
In-Reply-To: <cover.1763537007.git.zhengqi.arch@bytedance.com>
References: <cover.1763537007.git.zhengqi.arch@bytedance.com>
Message-ID: <1ef6075dca55a0ace4a6de6350531e4bc513080e.1763537007.git.zhengqi.arch@bytedance.com>

From: Qi Zheng <zhengqi.arch at bytedance.com>

On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of
empty PTE page table pages (such as 100GB+). To resolve this problem,
first enable MMU_GATHER_RCU_TABLE_FREE to prepare for enabling the
PT_RECLAIM feature, which resolves this problem.

Signed-off-by: Qi Zheng <zhengqi.arch at bytedance.com>
Cc: Thomas Bogendoerfer <tsbogend at alpha.franken.de>
---
 arch/mips/Kconfig               | 1 +
 arch/mips/include/asm/pgalloc.h | 7 +++----
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index e8683f58fd3e2..8b16dd4db7c08 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -99,6 +99,7 @@ config MIPS
 	select IRQ_FORCED_THREADING
 	select ISA if EISA
 	select LOCK_MM_AND_FIND_VMA
+	select MMU_GATHER_RCU_TABLE_FREE
 	select MODULES_USE_ELF_REL if MODULES
 	select MODULES_USE_ELF_RELA if MODULES && 64BIT
 	select PERF_USE_VMALLOC
diff --git a/arch/mips/include/asm/pgalloc.h b/arch/mips/include/asm/pgalloc.h
index 942af87f1cddb..9a7e5af16c00b 100644
--- a/arch/mips/include/asm/pgalloc.h
+++ b/arch/mips/include/asm/pgalloc.h
@@ -48,8 +48,7 @@ static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
 extern void pgd_init(void *addr);
 extern pgd_t *pgd_alloc(struct mm_struct *mm);
 
-#define __pte_free_tlb(tlb, pte, address)	\
-	tlb_remove_ptdesc((tlb), page_ptdesc(pte))
+#define __pte_free_tlb(tlb, pte, address)	tlb_remove_ptdesc((tlb), page_ptdesc(pte))
 
 #ifndef __PAGETABLE_PMD_FOLDED
 
@@ -72,7 +71,7 @@ static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address)
 	return pmd;
 }
 
-#define __pmd_free_tlb(tlb, x, addr)	pmd_free((tlb)->mm, x)
+#define __pmd_free_tlb(tlb, x, addr)	tlb_remove_ptdesc((tlb), virt_to_ptdesc(x))
 
 #endif
 
@@ -98,7 +97,7 @@ static inline void p4d_populate(struct mm_struct *mm, p4d_t *p4d, pud_t *pud)
 	set_p4d(p4d, __p4d((unsigned long)pud));
 }
 
-#define __pud_free_tlb(tlb, x, addr)	pud_free((tlb)->mm, x)
+#define __pud_free_tlb(tlb, x, addr)	tlb_remove_ptdesc((tlb), virt_to_ptdesc(x))
 
 #endif /* __PAGETABLE_PUD_FOLDED */
 
-- 
2.20.1


From qi.zheng at linux.dev  Tue Nov 18 23:31:22 2025
From: qi.zheng at linux.dev (Qi Zheng)
Date: Wed, 19 Nov 2025 15:31:22 +0800
Subject: [PATCH v2 5/7] parisc: mm: enable MMU_GATHER_RCU_TABLE_FREE
In-Reply-To: <cover.1763537007.git.zhengqi.arch@bytedance.com>
References: <cover.1763537007.git.zhengqi.arch@bytedance.com>
Message-ID: <74f0e72f11347656a9de0d4b9e2bccc17e4338a7.1763537007.git.zhengqi.arch@bytedance.com>

From: Qi Zheng <zhengqi.arch at bytedance.com>

On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of
empty PTE page table pages (such as 100GB+). To resolve this problem,
first enable MMU_GATHER_RCU_TABLE_FREE to prepare for enabling the
PT_RECLAIM feature, which resolves this problem.

Signed-off-by: Qi Zheng <zhengqi.arch at bytedance.com>
Cc: "James E.J. Bottomley" <James.Bottomley at HansenPartnership.com>
Cc: Helge Deller <deller at gmx.de>
---
 arch/parisc/Kconfig           | 1 +
 arch/parisc/include/asm/tlb.h | 4 ++--
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig
index 47fd9662d8005..62d5a89d5c7bc 100644
--- a/arch/parisc/Kconfig
+++ b/arch/parisc/Kconfig
@@ -79,6 +79,7 @@ config PARISC
 	select GENERIC_CLOCKEVENTS
 	select CPU_NO_EFFICIENT_FFS
 	select THREAD_INFO_IN_TASK
+	select MMU_GATHER_RCU_TABLE_FREE
 	select NEED_DMA_MAP_STATE
 	select NEED_SG_DMA_LENGTH
 	select HAVE_ARCH_KGDB
diff --git a/arch/parisc/include/asm/tlb.h b/arch/parisc/include/asm/tlb.h
index 44235f367674d..4501fee0a8fa4 100644
--- a/arch/parisc/include/asm/tlb.h
+++ b/arch/parisc/include/asm/tlb.h
@@ -5,8 +5,8 @@
 #include <asm-generic/tlb.h>
 
 #if CONFIG_PGTABLE_LEVELS == 3
-#define __pmd_free_tlb(tlb, pmd, addr)	pmd_free((tlb)->mm, pmd)
+#define __pmd_free_tlb(tlb, pmd, addr)	tlb_remove_ptdesc((tlb), virt_to_ptdesc(pmd))
 #endif
-#define __pte_free_tlb(tlb, pte, addr)	pte_free((tlb)->mm, pte)
+#define __pte_free_tlb(tlb, pte, addr)	tlb_remove_ptdesc((tlb), page_ptdesc(pte))
 
 #endif
-- 
2.20.1


From qi.zheng at linux.dev  Tue Nov 18 23:31:23 2025
From: qi.zheng at linux.dev (Qi Zheng)
Date: Wed, 19 Nov 2025 15:31:23 +0800
Subject: [PATCH v2 6/7] um: mm: enable MMU_GATHER_RCU_TABLE_FREE
In-Reply-To: <cover.1763537007.git.zhengqi.arch@bytedance.com>
References: <cover.1763537007.git.zhengqi.arch@bytedance.com>
Message-ID: <16ab9e6ce0febaf2fc383b7e09e3f1fb2ad63a40.1763537007.git.zhengqi.arch@bytedance.com>

From: Qi Zheng <zhengqi.arch at bytedance.com>

On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of
empty PTE page table pages (such as 100GB+). To resolve this problem,
first enable MMU_GATHER_RCU_TABLE_FREE to prepare for enabling the
PT_RECLAIM feature, which resolves this problem.

Signed-off-by: Qi Zheng <zhengqi.arch at bytedance.com>
Cc: Richard Weinberger <richard at nod.at>
Cc: Anton Ivanov <anton.ivanov at cambridgegreys.com>
Cc: Johannes Berg <johannes at sipsolutions.net>
---
 arch/um/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/um/Kconfig b/arch/um/Kconfig
index 097c6a6265ef3..47a41bc77bb24 100644
--- a/arch/um/Kconfig
+++ b/arch/um/Kconfig
@@ -41,6 +41,7 @@ config UML
 	select HAVE_SYSCALL_TRACEPOINTS
 	select THREAD_INFO_IN_TASK
 	select SPARSE_IRQ
+	select MMU_GATHER_RCU_TABLE_FREE
 
 config MMU
 	bool
-- 
2.20.1


From qi.zheng at linux.dev  Tue Nov 18 23:31:24 2025
From: qi.zheng at linux.dev (Qi Zheng)
Date: Wed, 19 Nov 2025 15:31:24 +0800
Subject: [PATCH v2 7/7] mm: enable PT_RECLAIM on all 64-bit architectures
In-Reply-To: <cover.1763537007.git.zhengqi.arch@bytedance.com>
References: <cover.1763537007.git.zhengqi.arch@bytedance.com>
Message-ID: <caacf08b765ef00770b7c75afdb5e5754485b2aa.1763537007.git.zhengqi.arch@bytedance.com>

From: Qi Zheng <zhengqi.arch at bytedance.com>

Now, the MMU_GATHER_RCU_TABLE_FREE is enabled on all 64-bit architectures,
so make PT_RECLAIM depend on 64BIT, thereby enabling PT_RECLAIM on all
64-bit architectures.

Signed-off-by: Qi Zheng <zhengqi.arch at bytedance.com>
---
 arch/x86/Kconfig | 1 -
 mm/Kconfig       | 9 ++-------
 2 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index eac2e86056902..96bff81fd4787 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -330,7 +330,6 @@ config X86
 	select FUNCTION_ALIGNMENT_4B
 	imply IMA_SECURE_AND_OR_TRUSTED_BOOT    if EFI
 	select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE
-	select ARCH_SUPPORTS_PT_RECLAIM		if X86_64
 	select ARCH_SUPPORTS_SCHED_SMT		if SMP
 	select SCHED_SMT			if SMP
 	select ARCH_SUPPORTS_SCHED_CLUSTER	if SMP
diff --git a/mm/Kconfig b/mm/Kconfig
index d548976d0e0ad..94eec5c0cad96 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1448,14 +1448,9 @@ config ARCH_HAS_USER_SHADOW_STACK
 	  The architecture has hardware support for userspace shadow call
           stacks (eg, x86 CET, arm64 GCS or RISC-V Zicfiss).
 
-config ARCH_SUPPORTS_PT_RECLAIM
-	def_bool n
-
 config PT_RECLAIM
-	bool "reclaim empty user page table pages"
-	default y
-	depends on ARCH_SUPPORTS_PT_RECLAIM && MMU && SMP
-	select MMU_GATHER_RCU_TABLE_FREE
+	def_bool y
+	depends on 64BIT
 	help
 	  Try to reclaim empty user page table pages in paths other than munmap
 	  and exit_mmap path.
-- 
2.20.1


From david at kernel.org  Wed Nov 19 02:13:37 2025
From: david at kernel.org (David Hildenbrand (Red Hat))
Date: Wed, 19 Nov 2025 11:13:37 +0100
Subject: [PATCH 0/7] enable PT_RECLAIM on all 64-bit architectures
In-Reply-To: <f7f0ca8d-bca2-4a3e-8c45-85cba1b0ff18@gmail.com>
References: <cover.1763117269.git.zhengqi.arch@bytedance.com>
 <83e88171-54cb-4112-a344-f6a7d7f13784@kernel.org>
 <f7f0ca8d-bca2-4a3e-8c45-85cba1b0ff18@gmail.com>
Message-ID: <afecde77-a4af-40f1-a905-9de8a1bdd783@kernel.org>

On 18.11.25 12:53, Qi Zheng wrote:
> 
> 
> On 11/18/25 12:53 AM, David Hildenbrand (Red Hat) wrote:
>> On 14.11.25 12:11, Qi Zheng wrote:
>>> From: Qi Zheng <zhengqi.arch at bytedance.com>
>>>
>>> Hi all,
>>>
>>> This series aims to enable PT_RECLAIM on all 64-bit architectures.
>>>
>>> On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of
>>> empty PTE
>>> page table pages (such as 100GB+). To resolve this problem, we need to
>>> enable
>>> PT_RECLAIM, which depends on MMU_GATHER_RCU_TABLE_FREE.
>>>
>>
>> Makes sense!
>>
>>> Therefore, this series first enables MMU_GATHER_RCU_TABLE_FREE on all
>>> 64-bit
>>> architectures, and finally makes PT_RECLAIM depend on
>>> MMU_GATHER_RCU_TABLE_FREE
>>> && 64BIT. This way, PT_RECLAIM can be enabled by default on all 64-bit
>>> architectures.
>>
>> Could we then even go ahead and stop making PT_RECLAIM user-selectable?
> 
> OK, will change to:

Was more of a question: is there any scenario where we ran so far into 
issues with it?

-- 
Cheers

David


From david at kernel.org  Wed Nov 19 02:19:51 2025
From: david at kernel.org (David Hildenbrand (Red Hat))
Date: Wed, 19 Nov 2025 11:19:51 +0100
Subject: [PATCH 7/7] mm: make PT_RECLAIM depend on
 MMU_GATHER_RCU_TABLE_FREE && 64BIT
In-Reply-To: <195baf7c-1f4e-46a4-a4aa-e68e7d00c0f9@linux.dev>
References: <cover.1763117269.git.zhengqi.arch@bytedance.com>
 <0a4d1e6f0bf299cafd1fc624f965bd1ca542cea8.1763117269.git.zhengqi.arch@bytedance.com>
 <355d3bf3-c6bc-403e-9f19-89259d868611@kernel.org>
 <195baf7c-1f4e-46a4-a4aa-e68e7d00c0f9@linux.dev>
Message-ID: <9386032c-9840-49da-83f9-74b112f3e752@kernel.org>

On 18.11.25 13:02, Qi Zheng wrote:
> 
> 
> On 11/18/25 12:57 AM, David Hildenbrand (Red Hat) wrote:
>> On 14.11.25 12:11, Qi Zheng wrote:
>>> From: Qi Zheng <zhengqi.arch at bytedance.com>
>>
>> Subject: s/&&/&/
> 
> will do.
> 
>>
>>>
>>> Make PT_RECLAIM depend on MMU_GATHER_RCU_TABLE_FREE so that PT_RECLAIM
>>> can
>>> be enabled by default on all architectures that support
>>> MMU_GATHER_RCU_TABLE_FREE.
>>>
>>> Considering that a large number of PTE page table pages (such as 100GB+)
>>> can only be caused on a 64-bit system, let PT_RECLAIM also depend on
>>> 64BIT.
>>>
>>> Signed-off-by: Qi Zheng <zhengqi.arch at bytedance.com>
>>> ---
>>>  ? arch/x86/Kconfig | 1 -
>>>  ? mm/Kconfig?????? | 6 +-----
>>>  ? 2 files changed, 1 insertion(+), 6 deletions(-)
>>>
>>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>>> index eac2e86056902..96bff81fd4787 100644
>>> --- a/arch/x86/Kconfig
>>> +++ b/arch/x86/Kconfig
>>> @@ -330,7 +330,6 @@ config X86
>>>  ????? select FUNCTION_ALIGNMENT_4B
>>>  ????? imply IMA_SECURE_AND_OR_TRUSTED_BOOT??? if EFI
>>>  ????? select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE
>>> -??? select ARCH_SUPPORTS_PT_RECLAIM??????? if X86_64
>>>  ????? select ARCH_SUPPORTS_SCHED_SMT??????? if SMP
>>>  ????? select SCHED_SMT??????????? if SMP
>>>  ????? select ARCH_SUPPORTS_SCHED_CLUSTER??? if SMP
>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>> index a5a90b169435d..e795fbd69e50c 100644
>>> --- a/mm/Kconfig
>>> +++ b/mm/Kconfig
>>> @@ -1440,14 +1440,10 @@ config ARCH_HAS_USER_SHADOW_STACK
>>>  ??????? The architecture has hardware support for userspace shadow call
>>>  ??????????? stacks (eg, x86 CET, arm64 GCS or RISC-V Zicfiss).
>>> -config ARCH_SUPPORTS_PT_RECLAIM
>>> -??? def_bool n
>>> -
>>>  ? config PT_RECLAIM
>>>  ????? bool "reclaim empty user page table pages"
>>>  ????? default y
>>> -??? depends on ARCH_SUPPORTS_PT_RECLAIM && MMU && SMP
>>> -??? select MMU_GATHER_RCU_TABLE_FREE
>>> +??? depends on MMU_GATHER_RCU_TABLE_FREE && MMU && SMP && 64BIT
>>
>> Who would we have MMU_GATHER_RCU_TABLE_FREE without MMU? (can we drop
>> the MMU part)
> 
> OK.
> 
>>
>> Why do we care about SMP in the first place? (can we frop SMP)
> 
> OK.
> 
>>
>> But I also wonder why we need "MMU_GATHER_RCU_TABLE_FREE && 64BIT":
>>
>> Would it be harmful on 32bit (sure, we might not reclaim as much, but
>> still there is memory to be reclaimed?)?
> 
> This is also fine on 32bit, but the benefits are not significant, So I
> chose to enable it only on 64-bit.

Right. Address space is smaller, but also memory is smaller. Not that I 
think we strictly *must* to support 32bit, I merely wonder why we 
wouldn't just enable it here.

OTOH, if there is a good reason we cannot enable it, we can definitely 
just keep it 64bit only.

> 
> I actually tried enabling MMU_GATHER_RCU_TABLE_FREE on all
> architectures, and apart from sparc32 being a bit troublesome (because
> it uses mm->page_table_lock for synchronization within
> __pte_free_tlb()), the modifications were relatively simple.
> 
>>
>> If all 64BIT support MMU_GATHER_RCU_TABLE_FREE (as you previously
>> state), why can't we only check for 64BIT?
> 
> OK, will do.

This was also more of a question for discussion:

Would it make sense to have

config PT_RECLAIM
	def_bool y
	depends on MMU_GATHER_RCU_TABLE_FREE

(a) Would we want to make it configurable (why?)
(b) Do we really care about SMP (why?)
(c) Do we want to limit to 64bit (why?)
(d) Do we really need the MMU check in addition to
     MMU_GATHER_RCU_TABLE_FREE


-- 
Cheers

David


From qi.zheng at linux.dev  Wed Nov 19 02:37:47 2025
From: qi.zheng at linux.dev (Qi Zheng)
Date: Wed, 19 Nov 2025 18:37:47 +0800
Subject: [PATCH 0/7] enable PT_RECLAIM on all 64-bit architectures
In-Reply-To: <afecde77-a4af-40f1-a905-9de8a1bdd783@kernel.org>
References: <cover.1763117269.git.zhengqi.arch@bytedance.com>
 <83e88171-54cb-4112-a344-f6a7d7f13784@kernel.org>
 <f7f0ca8d-bca2-4a3e-8c45-85cba1b0ff18@gmail.com>
 <afecde77-a4af-40f1-a905-9de8a1bdd783@kernel.org>
Message-ID: <9c884aeb-c1ec-4fe0-8495-639344633569@linux.dev>


On 11/19/25 6:13 PM, David Hildenbrand (Red Hat) wrote:
> On 18.11.25 12:53, Qi Zheng wrote:
>>
>>
>> On 11/18/25 12:53 AM, David Hildenbrand (Red Hat) wrote:
>>> On 14.11.25 12:11, Qi Zheng wrote:
>>>> From: Qi Zheng <zhengqi.arch at bytedance.com>
>>>>
>>>> Hi all,
>>>>
>>>> This series aims to enable PT_RECLAIM on all 64-bit architectures.
>>>>
>>>> On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of
>>>> empty PTE
>>>> page table pages (such as 100GB+). To resolve this problem, we need to
>>>> enable
>>>> PT_RECLAIM, which depends on MMU_GATHER_RCU_TABLE_FREE.
>>>>
>>>
>>> Makes sense!
>>>
>>>> Therefore, this series first enables MMU_GATHER_RCU_TABLE_FREE on all
>>>> 64-bit
>>>> architectures, and finally makes PT_RECLAIM depend on
>>>> MMU_GATHER_RCU_TABLE_FREE
>>>> && 64BIT. This way, PT_RECLAIM can be enabled by default on all 64-bit
>>>> architectures.
>>>
>>> Could we then even go ahead and stop making PT_RECLAIM user-selectable?
>>
>> OK, will change to:
> 
> Was more of a question: is there any scenario where we ran so far into 
> issues with it?

No, I haven't received any reports of related issues, either within the
company or in the community.

> 


From qi.zheng at linux.dev  Wed Nov 19 03:02:01 2025
From: qi.zheng at linux.dev (Qi Zheng)
Date: Wed, 19 Nov 2025 19:02:01 +0800
Subject: [PATCH 7/7] mm: make PT_RECLAIM depend on
 MMU_GATHER_RCU_TABLE_FREE && 64BIT
In-Reply-To: <9386032c-9840-49da-83f9-74b112f3e752@kernel.org>
References: <cover.1763117269.git.zhengqi.arch@bytedance.com>
 <0a4d1e6f0bf299cafd1fc624f965bd1ca542cea8.1763117269.git.zhengqi.arch@bytedance.com>
 <355d3bf3-c6bc-403e-9f19-89259d868611@kernel.org>
 <195baf7c-1f4e-46a4-a4aa-e68e7d00c0f9@linux.dev>
 <9386032c-9840-49da-83f9-74b112f3e752@kernel.org>
Message-ID: <956c7ca1-bce8-4eed-8a86-bc8adfc708b8@linux.dev>

Hi David,

On 11/19/25 6:19 PM, David Hildenbrand (Red Hat) wrote:
> On 18.11.25 13:02, Qi Zheng wrote:
>>
>>
>> On 11/18/25 12:57 AM, David Hildenbrand (Red Hat) wrote:
>>> On 14.11.25 12:11, Qi Zheng wrote:
>>>> From: Qi Zheng <zhengqi.arch at bytedance.com>
>>>
>>> Subject: s/&&/&/
>>
>> will do.
>>
>>>
>>>>
>>>> Make PT_RECLAIM depend on MMU_GATHER_RCU_TABLE_FREE so that PT_RECLAIM
>>>> can
>>>> be enabled by default on all architectures that support
>>>> MMU_GATHER_RCU_TABLE_FREE.
>>>>
>>>> Considering that a large number of PTE page table pages (such as 
>>>> 100GB+)
>>>> can only be caused on a 64-bit system, let PT_RECLAIM also depend on
>>>> 64BIT.
>>>>
>>>> Signed-off-by: Qi Zheng <zhengqi.arch at bytedance.com>
>>>> ---
>>>> ?? arch/x86/Kconfig | 1 -
>>>> ?? mm/Kconfig?????? | 6 +-----
>>>> ?? 2 files changed, 1 insertion(+), 6 deletions(-)
>>>>
>>>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>>>> index eac2e86056902..96bff81fd4787 100644
>>>> --- a/arch/x86/Kconfig
>>>> +++ b/arch/x86/Kconfig
>>>> @@ -330,7 +330,6 @@ config X86
>>>> ?????? select FUNCTION_ALIGNMENT_4B
>>>> ?????? imply IMA_SECURE_AND_OR_TRUSTED_BOOT??? if EFI
>>>> ?????? select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE
>>>> -??? select ARCH_SUPPORTS_PT_RECLAIM??????? if X86_64
>>>> ?????? select ARCH_SUPPORTS_SCHED_SMT??????? if SMP
>>>> ?????? select SCHED_SMT??????????? if SMP
>>>> ?????? select ARCH_SUPPORTS_SCHED_CLUSTER??? if SMP
>>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>>> index a5a90b169435d..e795fbd69e50c 100644
>>>> --- a/mm/Kconfig
>>>> +++ b/mm/Kconfig
>>>> @@ -1440,14 +1440,10 @@ config ARCH_HAS_USER_SHADOW_STACK
>>>> ???????? The architecture has hardware support for userspace shadow 
>>>> call
>>>> ???????????? stacks (eg, x86 CET, arm64 GCS or RISC-V Zicfiss).
>>>> -config ARCH_SUPPORTS_PT_RECLAIM
>>>> -??? def_bool n
>>>> -
>>>> ?? config PT_RECLAIM
>>>> ?????? bool "reclaim empty user page table pages"
>>>> ?????? default y
>>>> -??? depends on ARCH_SUPPORTS_PT_RECLAIM && MMU && SMP
>>>> -??? select MMU_GATHER_RCU_TABLE_FREE
>>>> +??? depends on MMU_GATHER_RCU_TABLE_FREE && MMU && SMP && 64BIT
>>>
>>> Who would we have MMU_GATHER_RCU_TABLE_FREE without MMU? (can we drop
>>> the MMU part)
>>
>> OK.
>>
>>>
>>> Why do we care about SMP in the first place? (can we frop SMP)
>>
>> OK.
>>
>>>
>>> But I also wonder why we need "MMU_GATHER_RCU_TABLE_FREE && 64BIT":
>>>
>>> Would it be harmful on 32bit (sure, we might not reclaim as much, but
>>> still there is memory to be reclaimed?)?
>>
>> This is also fine on 32bit, but the benefits are not significant, So I
>> chose to enable it only on 64-bit.
> 
> Right. Address space is smaller, but also memory is smaller. Not that I 
> think we strictly *must* to support 32bit, I merely wonder why we 
> wouldn't just enable it here.
> 
> OTOH, if there is a good reason we cannot enable it, we can definitely 
> just keep it 64bit only.

The only difficulty is this:

> 
>>
>> I actually tried enabling MMU_GATHER_RCU_TABLE_FREE on all
>> architectures, and apart from sparc32 being a bit troublesome (because
>> it uses mm->page_table_lock for synchronization within
>> __pte_free_tlb()), the modifications were relatively simple.

in sparc32:

void pte_free(struct mm_struct *mm, pgtable_t ptep)
{
         struct page *page;

         page = pfn_to_page(__nocache_pa((unsigned long)ptep) >> 
PAGE_SHIFT);
         spin_lock(&mm->page_table_lock);
         if (page_ref_dec_return(page) == 1)
                 pagetable_dtor(page_ptdesc(page));
         spin_unlock(&mm->page_table_lock);

         srmmu_free_nocache(ptep, SRMMU_PTE_TABLE_SIZE);
}

#define __pte_free_tlb(tlb, pte, addr)  pte_free((tlb)->mm, pte)

To enable MMU_GATHER_RCU_TABLE_FREE on sparc32, we need to implement
__tlb_remove_table(), and call the pte_free() above in __tlb_remove_table().

However, the __tlb_remove_table() does not have an mm parameter:

void __tlb_remove_table(void *_table)

so we need to use another lock instead of mm->page_table_lock.

I have already sent the v2 [1], and perhaps after that I can enable
PT_RECLAIM on all 32-bit architectures as well.

[1]. 
https://lore.kernel.org/all/cover.1763537007.git.zhengqi.arch at bytedance.com/

>>
>>>
>>> If all 64BIT support MMU_GATHER_RCU_TABLE_FREE (as you previously
>>> state), why can't we only check for 64BIT?
>>
>> OK, will do.
> 
> This was also more of a question for discussion:
> 
> Would it make sense to have
> 
> config PT_RECLAIM
>  ????def_bool y
>  ????depends on MMU_GATHER_RCU_TABLE_FREE

make sense.

> 
> (a) Would we want to make it configurable (why?)

No, it was just out of caution before.

> (b) Do we really care about SMP (why?)

No. Simply because the following situation is impossible to occur:

pte_offset_map
traversing the PTE page table

<preemption or hardirq>

call madvise(MADV_DONTNEED)

so there's no need to free PTE page via RCU.

> (c) Do we want to limit to 64bit (why?)

No, just because the profit is greater at 64-BIT.

> (d) Do we really need the MMU check in addition to
>  ??? MMU_GATHER_RCU_TABLE_FREE

No, I was worried about compilation issues before, but now it seems that
my worries were unnecessary.

Thanks,
Qi

> 
> 


From david at kernel.org  Wed Nov 19 03:35:10 2025
From: david at kernel.org (David Hildenbrand (Red Hat))
Date: Wed, 19 Nov 2025 12:35:10 +0100
Subject: [PATCH 7/7] mm: make PT_RECLAIM depend on
 MMU_GATHER_RCU_TABLE_FREE && 64BIT
In-Reply-To: <956c7ca1-bce8-4eed-8a86-bc8adfc708b8@linux.dev>
References: <cover.1763117269.git.zhengqi.arch@bytedance.com>
 <0a4d1e6f0bf299cafd1fc624f965bd1ca542cea8.1763117269.git.zhengqi.arch@bytedance.com>
 <355d3bf3-c6bc-403e-9f19-89259d868611@kernel.org>
 <195baf7c-1f4e-46a4-a4aa-e68e7d00c0f9@linux.dev>
 <9386032c-9840-49da-83f9-74b112f3e752@kernel.org>
 <956c7ca1-bce8-4eed-8a86-bc8adfc708b8@linux.dev>
Message-ID: <6a22ff95-28c1-4c1d-a1a8-6a391bcc8c86@kernel.org>

On 19.11.25 12:02, Qi Zheng wrote:
> Hi David,
> 
> On 11/19/25 6:19 PM, David Hildenbrand (Red Hat) wrote:
>> On 18.11.25 13:02, Qi Zheng wrote:
>>>
>>>
>>> On 11/18/25 12:57 AM, David Hildenbrand (Red Hat) wrote:
>>>> On 14.11.25 12:11, Qi Zheng wrote:
>>>>> From: Qi Zheng <zhengqi.arch at bytedance.com>
>>>>
>>>> Subject: s/&&/&/
>>>
>>> will do.
>>>
>>>>
>>>>>
>>>>> Make PT_RECLAIM depend on MMU_GATHER_RCU_TABLE_FREE so that PT_RECLAIM
>>>>> can
>>>>> be enabled by default on all architectures that support
>>>>> MMU_GATHER_RCU_TABLE_FREE.
>>>>>
>>>>> Considering that a large number of PTE page table pages (such as
>>>>> 100GB+)
>>>>> can only be caused on a 64-bit system, let PT_RECLAIM also depend on
>>>>> 64BIT.
>>>>>
>>>>> Signed-off-by: Qi Zheng <zhengqi.arch at bytedance.com>
>>>>> ---
>>>>>  ?? arch/x86/Kconfig | 1 -
>>>>>  ?? mm/Kconfig?????? | 6 +-----
>>>>>  ?? 2 files changed, 1 insertion(+), 6 deletions(-)
>>>>>
>>>>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>>>>> index eac2e86056902..96bff81fd4787 100644
>>>>> --- a/arch/x86/Kconfig
>>>>> +++ b/arch/x86/Kconfig
>>>>> @@ -330,7 +330,6 @@ config X86
>>>>>  ?????? select FUNCTION_ALIGNMENT_4B
>>>>>  ?????? imply IMA_SECURE_AND_OR_TRUSTED_BOOT??? if EFI
>>>>>  ?????? select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE
>>>>> -??? select ARCH_SUPPORTS_PT_RECLAIM??????? if X86_64
>>>>>  ?????? select ARCH_SUPPORTS_SCHED_SMT??????? if SMP
>>>>>  ?????? select SCHED_SMT??????????? if SMP
>>>>>  ?????? select ARCH_SUPPORTS_SCHED_CLUSTER??? if SMP
>>>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>>>> index a5a90b169435d..e795fbd69e50c 100644
>>>>> --- a/mm/Kconfig
>>>>> +++ b/mm/Kconfig
>>>>> @@ -1440,14 +1440,10 @@ config ARCH_HAS_USER_SHADOW_STACK
>>>>>  ???????? The architecture has hardware support for userspace shadow
>>>>> call
>>>>>  ???????????? stacks (eg, x86 CET, arm64 GCS or RISC-V Zicfiss).
>>>>> -config ARCH_SUPPORTS_PT_RECLAIM
>>>>> -??? def_bool n
>>>>> -
>>>>>  ?? config PT_RECLAIM
>>>>>  ?????? bool "reclaim empty user page table pages"
>>>>>  ?????? default y
>>>>> -??? depends on ARCH_SUPPORTS_PT_RECLAIM && MMU && SMP
>>>>> -??? select MMU_GATHER_RCU_TABLE_FREE
>>>>> +??? depends on MMU_GATHER_RCU_TABLE_FREE && MMU && SMP && 64BIT
>>>>
>>>> Who would we have MMU_GATHER_RCU_TABLE_FREE without MMU? (can we drop
>>>> the MMU part)
>>>
>>> OK.
>>>
>>>>
>>>> Why do we care about SMP in the first place? (can we frop SMP)
>>>
>>> OK.
>>>
>>>>
>>>> But I also wonder why we need "MMU_GATHER_RCU_TABLE_FREE && 64BIT":
>>>>
>>>> Would it be harmful on 32bit (sure, we might not reclaim as much, but
>>>> still there is memory to be reclaimed?)?
>>>
>>> This is also fine on 32bit, but the benefits are not significant, So I
>>> chose to enable it only on 64-bit.
>>
>> Right. Address space is smaller, but also memory is smaller. Not that I
>> think we strictly *must* to support 32bit, I merely wonder why we
>> wouldn't just enable it here.
>>
>> OTOH, if there is a good reason we cannot enable it, we can definitely
>> just keep it 64bit only.
> 
> The only difficulty is this:
> 
>>
>>>
>>> I actually tried enabling MMU_GATHER_RCU_TABLE_FREE on all
>>> architectures, and apart from sparc32 being a bit troublesome (because
>>> it uses mm->page_table_lock for synchronization within
>>> __pte_free_tlb()), the modifications were relatively simple.
> 
> in sparc32:
> 
> void pte_free(struct mm_struct *mm, pgtable_t ptep)
> {
>           struct page *page;
> 
>           page = pfn_to_page(__nocache_pa((unsigned long)ptep) >>
> PAGE_SHIFT);
>           spin_lock(&mm->page_table_lock);
>           if (page_ref_dec_return(page) == 1)
>                   pagetable_dtor(page_ptdesc(page));
>           spin_unlock(&mm->page_table_lock);
> 
>           srmmu_free_nocache(ptep, SRMMU_PTE_TABLE_SIZE);
> }
> 
> #define __pte_free_tlb(tlb, pte, addr)  pte_free((tlb)->mm, pte)
> 
> To enable MMU_GATHER_RCU_TABLE_FREE on sparc32, we need to implement
> __tlb_remove_table(), and call the pte_free() above in __tlb_remove_table().
> 
> However, the __tlb_remove_table() does not have an mm parameter:
> 
> void __tlb_remove_table(void *_table)
> 
> so we need to use another lock instead of mm->page_table_lock.
> 
> I have already sent the v2 [1], and perhaps after that I can enable
> PT_RECLAIM on all 32-bit architectures as well.
> 

I guess if we just make it depend on MMU_GATHER_RCU_TABLE_FREE that will 
be fine.

> [1].
> https://lore.kernel.org/all/cover.1763537007.git.zhengqi.arch at bytedance.com/
> 
>>>
>>>>
>>>> If all 64BIT support MMU_GATHER_RCU_TABLE_FREE (as you previously
>>>> state), why can't we only check for 64BIT?
>>>
>>> OK, will do.
>>
>> This was also more of a question for discussion:
>>
>> Would it make sense to have
>>
>> config PT_RECLAIM
>>   ????def_bool y
>>   ????depends on MMU_GATHER_RCU_TABLE_FREE
> 
> make sense.
> 
>>
>> (a) Would we want to make it configurable (why?)
> 
> No, it was just out of caution before.
> 
>> (b) Do we really care about SMP (why?)
> 
> No. Simply because the following situation is impossible to occur:
> 
> pte_offset_map
> traversing the PTE page table
> 
> <preemption or hardirq>
> 
> call madvise(MADV_DONTNEED)
> 
> so there's no need to free PTE page via RCU.
> 
>> (c) Do we want to limit to 64bit (why?)
> 
> No, just because the profit is greater at 64-BIT.

I was briefly wondering if on 32bit (but maybe also on 64bit with 
configurable user page table levels?) we could have the scenario that we 
only have two page table levels.

So reclaiming the PMD level (corresponding to the highest level) would 
be impossible. But for that to happen one would have to discard the 
whole address range through MADV_DONTNEED (impossible I guess) :)

-- 
Cheers

David


From david at kernel.org  Wed Nov 19 03:38:46 2025
From: david at kernel.org (David Hildenbrand (Red Hat))
Date: Wed, 19 Nov 2025 12:38:46 +0100
Subject: [PATCH v2 7/7] mm: enable PT_RECLAIM on all 64-bit architectures
In-Reply-To: <caacf08b765ef00770b7c75afdb5e5754485b2aa.1763537007.git.zhengqi.arch@bytedance.com>
References: <cover.1763537007.git.zhengqi.arch@bytedance.com>
 <caacf08b765ef00770b7c75afdb5e5754485b2aa.1763537007.git.zhengqi.arch@bytedance.com>
Message-ID: <9b55623a-4606-4610-a0fe-55b8cd6b95e7@kernel.org>

On 19.11.25 08:31, Qi Zheng wrote:
> From: Qi Zheng <zhengqi.arch at bytedance.com>
> 
> Now, the MMU_GATHER_RCU_TABLE_FREE is enabled on all 64-bit architectures,
> so make PT_RECLAIM depend on 64BIT, thereby enabling PT_RECLAIM on all
> 64-bit architectures.
> 
> Signed-off-by: Qi Zheng <zhengqi.arch at bytedance.com>
> ---
>   arch/x86/Kconfig | 1 -
>   mm/Kconfig       | 9 ++-------
>   2 files changed, 2 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index eac2e86056902..96bff81fd4787 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -330,7 +330,6 @@ config X86
>   	select FUNCTION_ALIGNMENT_4B
>   	imply IMA_SECURE_AND_OR_TRUSTED_BOOT    if EFI
>   	select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE
> -	select ARCH_SUPPORTS_PT_RECLAIM		if X86_64
>   	select ARCH_SUPPORTS_SCHED_SMT		if SMP
>   	select SCHED_SMT			if SMP
>   	select ARCH_SUPPORTS_SCHED_CLUSTER	if SMP
> diff --git a/mm/Kconfig b/mm/Kconfig
> index d548976d0e0ad..94eec5c0cad96 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1448,14 +1448,9 @@ config ARCH_HAS_USER_SHADOW_STACK
>   	  The architecture has hardware support for userspace shadow call
>             stacks (eg, x86 CET, arm64 GCS or RISC-V Zicfiss).
>   
> -config ARCH_SUPPORTS_PT_RECLAIM
> -	def_bool n
> -
>   config PT_RECLAIM
> -	bool "reclaim empty user page table pages"
> -	default y
> -	depends on ARCH_SUPPORTS_PT_RECLAIM && MMU && SMP
> -	select MMU_GATHER_RCU_TABLE_FREE
> +	def_bool y
> +	depends on 64BIT

As discussed in the other thread, likely

config PT_RECLAIM
	def_bool y
	depends on MMU_GATHER_RCU_TABLE_FREE && 64BIT

Could be nice, and if possible even dropping the 64BIT limitation as 
well if there is no need to.


-- 
Cheers

David


From david at kernel.org  Wed Nov 19 03:41:22 2025
From: david at kernel.org (David Hildenbrand (Red Hat))
Date: Wed, 19 Nov 2025 12:41:22 +0100
Subject: [PATCH v2 1/7] mm: change mm/pt_reclaim.c to use asm/tlb.h
 instead of asm-generic/tlb.h
In-Reply-To: <e9d510106b5bf72a9b577b8c5ad161fd3c29c2b6.1763537007.git.zhengqi.arch@bytedance.com>
References: <cover.1763537007.git.zhengqi.arch@bytedance.com>
 <e9d510106b5bf72a9b577b8c5ad161fd3c29c2b6.1763537007.git.zhengqi.arch@bytedance.com>
Message-ID: <e539179f-668e-452d-a08e-6143392dae6a@kernel.org>

On 19.11.25 08:31, Qi Zheng wrote:
> From: Qi Zheng <zhengqi.arch at bytedance.com>
> 
> Generally, the asm/tlb.h will include asm-generic/tlb.h, so change
> mm/pt_reclaim.c to use asm/tlb.h instead of asm-generic/tlb.h. This can
> also fix compilation errors on some architecture when CONFIG_PT_RECLAIM
> is enabled (such as alpha).

"This is a preparation for enabling CONFIG_PT_RECLAIM on other 
architectures, such as alpha."

> 
> Signed-off-by: Qi Zheng <zhengqi.arch at bytedance.com>
> ---
>   mm/pt_reclaim.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/pt_reclaim.c b/mm/pt_reclaim.c
> index 0d9cfbf4fe5d8..46771cfff8239 100644
> --- a/mm/pt_reclaim.c
> +++ b/mm/pt_reclaim.c
> @@ -2,7 +2,7 @@
>   #include <linux/hugetlb.h>
>   #include <linux/pgalloc.h>
>   
> -#include <asm-generic/tlb.h>
> +#include <asm/tlb.h>
>   
>   #include "internal.h"
>   

Right, we're using pte_free_tlb(), and the default lives in 
include/asm-generic/tlb.h.

Acked-by: David Hildenbrand (Red Hat) <david at kernel.org>

-- 
Cheers

David


From qi.zheng at linux.dev  Wed Nov 19 04:13:10 2025
From: qi.zheng at linux.dev (Qi Zheng)
Date: Wed, 19 Nov 2025 20:13:10 +0800
Subject: [PATCH 7/7] mm: make PT_RECLAIM depend on
 MMU_GATHER_RCU_TABLE_FREE && 64BIT
In-Reply-To: <6a22ff95-28c1-4c1d-a1a8-6a391bcc8c86@kernel.org>
References: <cover.1763117269.git.zhengqi.arch@bytedance.com>
 <0a4d1e6f0bf299cafd1fc624f965bd1ca542cea8.1763117269.git.zhengqi.arch@bytedance.com>
 <355d3bf3-c6bc-403e-9f19-89259d868611@kernel.org>
 <195baf7c-1f4e-46a4-a4aa-e68e7d00c0f9@linux.dev>
 <9386032c-9840-49da-83f9-74b112f3e752@kernel.org>
 <956c7ca1-bce8-4eed-8a86-bc8adfc708b8@linux.dev>
 <6a22ff95-28c1-4c1d-a1a8-6a391bcc8c86@kernel.org>
Message-ID: <479b0409-335f-4450-8eb2-5270a5847f5e@linux.dev>


On 11/19/25 7:35 PM, David Hildenbrand (Red Hat) wrote:
> On 19.11.25 12:02, Qi Zheng wrote:
>> Hi David,
>>
>> On 11/19/25 6:19 PM, David Hildenbrand (Red Hat) wrote:
>>> On 18.11.25 13:02, Qi Zheng wrote:
>>>>
>>>>
>>>> On 11/18/25 12:57 AM, David Hildenbrand (Red Hat) wrote:
>>>>> On 14.11.25 12:11, Qi Zheng wrote:
>>>>>> From: Qi Zheng <zhengqi.arch at bytedance.com>
>>>>>
>>>>> Subject: s/&&/&/
>>>>
>>>> will do.
>>>>
>>>>>
>>>>>>
>>>>>> Make PT_RECLAIM depend on MMU_GATHER_RCU_TABLE_FREE so that 
>>>>>> PT_RECLAIM
>>>>>> can
>>>>>> be enabled by default on all architectures that support
>>>>>> MMU_GATHER_RCU_TABLE_FREE.
>>>>>>
>>>>>> Considering that a large number of PTE page table pages (such as
>>>>>> 100GB+)
>>>>>> can only be caused on a 64-bit system, let PT_RECLAIM also depend on
>>>>>> 64BIT.
>>>>>>
>>>>>> Signed-off-by: Qi Zheng <zhengqi.arch at bytedance.com>
>>>>>> ---
>>>>>> ??? arch/x86/Kconfig | 1 -
>>>>>> ??? mm/Kconfig?????? | 6 +-----
>>>>>> ??? 2 files changed, 1 insertion(+), 6 deletions(-)
>>>>>>
>>>>>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>>>>>> index eac2e86056902..96bff81fd4787 100644
>>>>>> --- a/arch/x86/Kconfig
>>>>>> +++ b/arch/x86/Kconfig
>>>>>> @@ -330,7 +330,6 @@ config X86
>>>>>> ??????? select FUNCTION_ALIGNMENT_4B
>>>>>> ??????? imply IMA_SECURE_AND_OR_TRUSTED_BOOT??? if EFI
>>>>>> ??????? select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE
>>>>>> -??? select ARCH_SUPPORTS_PT_RECLAIM??????? if X86_64
>>>>>> ??????? select ARCH_SUPPORTS_SCHED_SMT??????? if SMP
>>>>>> ??????? select SCHED_SMT??????????? if SMP
>>>>>> ??????? select ARCH_SUPPORTS_SCHED_CLUSTER??? if SMP
>>>>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>>>>> index a5a90b169435d..e795fbd69e50c 100644
>>>>>> --- a/mm/Kconfig
>>>>>> +++ b/mm/Kconfig
>>>>>> @@ -1440,14 +1440,10 @@ config ARCH_HAS_USER_SHADOW_STACK
>>>>>> ????????? The architecture has hardware support for userspace shadow
>>>>>> call
>>>>>> ????????????? stacks (eg, x86 CET, arm64 GCS or RISC-V Zicfiss).
>>>>>> -config ARCH_SUPPORTS_PT_RECLAIM
>>>>>> -??? def_bool n
>>>>>> -
>>>>>> ??? config PT_RECLAIM
>>>>>> ??????? bool "reclaim empty user page table pages"
>>>>>> ??????? default y
>>>>>> -??? depends on ARCH_SUPPORTS_PT_RECLAIM && MMU && SMP
>>>>>> -??? select MMU_GATHER_RCU_TABLE_FREE
>>>>>> +??? depends on MMU_GATHER_RCU_TABLE_FREE && MMU && SMP && 64BIT
>>>>>
>>>>> Who would we have MMU_GATHER_RCU_TABLE_FREE without MMU? (can we drop
>>>>> the MMU part)
>>>>
>>>> OK.
>>>>
>>>>>
>>>>> Why do we care about SMP in the first place? (can we frop SMP)
>>>>
>>>> OK.
>>>>
>>>>>
>>>>> But I also wonder why we need "MMU_GATHER_RCU_TABLE_FREE && 64BIT":
>>>>>
>>>>> Would it be harmful on 32bit (sure, we might not reclaim as much, but
>>>>> still there is memory to be reclaimed?)?
>>>>
>>>> This is also fine on 32bit, but the benefits are not significant, So I
>>>> chose to enable it only on 64-bit.
>>>
>>> Right. Address space is smaller, but also memory is smaller. Not that I
>>> think we strictly *must* to support 32bit, I merely wonder why we
>>> wouldn't just enable it here.
>>>
>>> OTOH, if there is a good reason we cannot enable it, we can definitely
>>> just keep it 64bit only.
>>
>> The only difficulty is this:
>>
>>>
>>>>
>>>> I actually tried enabling MMU_GATHER_RCU_TABLE_FREE on all
>>>> architectures, and apart from sparc32 being a bit troublesome (because
>>>> it uses mm->page_table_lock for synchronization within
>>>> __pte_free_tlb()), the modifications were relatively simple.
>>
>> in sparc32:
>>
>> void pte_free(struct mm_struct *mm, pgtable_t ptep)
>> {
>> ????????? struct page *page;
>>
>> ????????? page = pfn_to_page(__nocache_pa((unsigned long)ptep) >>
>> PAGE_SHIFT);
>> ????????? spin_lock(&mm->page_table_lock);
>> ????????? if (page_ref_dec_return(page) == 1)
>> ????????????????? pagetable_dtor(page_ptdesc(page));
>> ????????? spin_unlock(&mm->page_table_lock);
>>
>> ????????? srmmu_free_nocache(ptep, SRMMU_PTE_TABLE_SIZE);
>> }
>>
>> #define __pte_free_tlb(tlb, pte, addr)? pte_free((tlb)->mm, pte)
>>
>> To enable MMU_GATHER_RCU_TABLE_FREE on sparc32, we need to implement
>> __tlb_remove_table(), and call the pte_free() above in 
>> __tlb_remove_table().
>>
>> However, the __tlb_remove_table() does not have an mm parameter:
>>
>> void __tlb_remove_table(void *_table)
>>
>> so we need to use another lock instead of mm->page_table_lock.
>>
>> I have already sent the v2 [1], and perhaps after that I can enable
>> PT_RECLAIM on all 32-bit architectures as well.
>>
> 
> I guess if we just make it depend on MMU_GATHER_RCU_TABLE_FREE that will 
> be fine.
> 
>> [1].
>> https://lore.kernel.org/all/ 
>> cover.1763537007.git.zhengqi.arch at bytedance.com/
>>
>>>>
>>>>>
>>>>> If all 64BIT support MMU_GATHER_RCU_TABLE_FREE (as you previously
>>>>> state), why can't we only check for 64BIT?
>>>>
>>>> OK, will do.
>>>
>>> This was also more of a question for discussion:
>>>
>>> Would it make sense to have
>>>
>>> config PT_RECLAIM
>>> ? ????def_bool y
>>> ? ????depends on MMU_GATHER_RCU_TABLE_FREE
>>
>> make sense.
>>
>>>
>>> (a) Would we want to make it configurable (why?)
>>
>> No, it was just out of caution before.
>>
>>> (b) Do we really care about SMP (why?)
>>
>> No. Simply because the following situation is impossible to occur:
>>
>> pte_offset_map
>> traversing the PTE page table
>>
>> <preemption or hardirq>
>>
>> call madvise(MADV_DONTNEED)
>>
>> so there's no need to free PTE page via RCU.
>>
>>> (c) Do we want to limit to 64bit (why?)
>>
>> No, just because the profit is greater at 64-BIT.
> 
> I was briefly wondering if on 32bit (but maybe also on 64bit with 
> configurable user page table levels?) we could have the scenario that we 
> only have two page table levels.
> 
> So reclaiming the PMD level (corresponding to the highest level) would 

reclaiming the PMD level? The PT_RECLAIM only reclaim PTE pages, not PMD
pages, am I misunderstanding something?

> be impossible. But for that to happen one would have to discard the 
> whole address range through MADV_DONTNEED (impossible I guess) :)
> 


From qi.zheng at linux.dev  Wed Nov 19 04:15:58 2025
From: qi.zheng at linux.dev (Qi Zheng)
Date: Wed, 19 Nov 2025 20:15:58 +0800
Subject: [PATCH v2 7/7] mm: enable PT_RECLAIM on all 64-bit architectures
In-Reply-To: <9b55623a-4606-4610-a0fe-55b8cd6b95e7@kernel.org>
References: <cover.1763537007.git.zhengqi.arch@bytedance.com>
 <caacf08b765ef00770b7c75afdb5e5754485b2aa.1763537007.git.zhengqi.arch@bytedance.com>
 <9b55623a-4606-4610-a0fe-55b8cd6b95e7@kernel.org>
Message-ID: <6e6d8390-1f9e-40cf-949d-168160fa9a15@linux.dev>


On 11/19/25 7:38 PM, David Hildenbrand (Red Hat) wrote:
> On 19.11.25 08:31, Qi Zheng wrote:
>> From: Qi Zheng <zhengqi.arch at bytedance.com>
>>
>> Now, the MMU_GATHER_RCU_TABLE_FREE is enabled on all 64-bit 
>> architectures,
>> so make PT_RECLAIM depend on 64BIT, thereby enabling PT_RECLAIM on all
>> 64-bit architectures.
>>
>> Signed-off-by: Qi Zheng <zhengqi.arch at bytedance.com>
>> ---
>> ? arch/x86/Kconfig | 1 -
>> ? mm/Kconfig?????? | 9 ++-------
>> ? 2 files changed, 2 insertions(+), 8 deletions(-)
>>
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index eac2e86056902..96bff81fd4787 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -330,7 +330,6 @@ config X86
>> ????? select FUNCTION_ALIGNMENT_4B
>> ????? imply IMA_SECURE_AND_OR_TRUSTED_BOOT??? if EFI
>> ????? select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE
>> -??? select ARCH_SUPPORTS_PT_RECLAIM??????? if X86_64
>> ????? select ARCH_SUPPORTS_SCHED_SMT??????? if SMP
>> ????? select SCHED_SMT??????????? if SMP
>> ????? select ARCH_SUPPORTS_SCHED_CLUSTER??? if SMP
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index d548976d0e0ad..94eec5c0cad96 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -1448,14 +1448,9 @@ config ARCH_HAS_USER_SHADOW_STACK
>> ??????? The architecture has hardware support for userspace shadow call
>> ??????????? stacks (eg, x86 CET, arm64 GCS or RISC-V Zicfiss).
>> -config ARCH_SUPPORTS_PT_RECLAIM
>> -??? def_bool n
>> -
>> ? config PT_RECLAIM
>> -??? bool "reclaim empty user page table pages"
>> -??? default y
>> -??? depends on ARCH_SUPPORTS_PT_RECLAIM && MMU && SMP
>> -??? select MMU_GATHER_RCU_TABLE_FREE
>> +??? def_bool y
>> +??? depends on 64BIT
> 
> As discussed in the other thread, likely
> 
> config PT_RECLAIM
>  ????def_bool y
>  ????depends on MMU_GATHER_RCU_TABLE_FREE && 64BIT
> 
> Could be nice, and if possible even dropping the 64BIT limitation as 
> well if there is no need to.

I think it's ok to drop the 64BIT limitation. There should be some
32-bit architectures that already enable MMU_GATHER_RCU_TABLE_FREE.

> 
> 


From qi.zheng at linux.dev  Wed Nov 19 04:17:31 2025
From: qi.zheng at linux.dev (Qi Zheng)
Date: Wed, 19 Nov 2025 20:17:31 +0800
Subject: [PATCH v2 1/7] mm: change mm/pt_reclaim.c to use asm/tlb.h
 instead of asm-generic/tlb.h
In-Reply-To: <e539179f-668e-452d-a08e-6143392dae6a@kernel.org>
References: <cover.1763537007.git.zhengqi.arch@bytedance.com>
 <e9d510106b5bf72a9b577b8c5ad161fd3c29c2b6.1763537007.git.zhengqi.arch@bytedance.com>
 <e539179f-668e-452d-a08e-6143392dae6a@kernel.org>
Message-ID: <939e3496-5012-4e7d-8a33-e9de4354d4fd@linux.dev>


On 11/19/25 7:41 PM, David Hildenbrand (Red Hat) wrote:
> On 19.11.25 08:31, Qi Zheng wrote:
>> From: Qi Zheng <zhengqi.arch at bytedance.com>
>>
>> Generally, the asm/tlb.h will include asm-generic/tlb.h, so change
>> mm/pt_reclaim.c to use asm/tlb.h instead of asm-generic/tlb.h. This can
>> also fix compilation errors on some architecture when CONFIG_PT_RECLAIM
>> is enabled (such as alpha).
> 
> "This is a preparation for enabling CONFIG_PT_RECLAIM on other 
> architectures, such as alpha."

OK, will modify it in the next version.

> 
>>
>> Signed-off-by: Qi Zheng <zhengqi.arch at bytedance.com>
>> ---
>> ? mm/pt_reclaim.c | 2 +-
>> ? 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/pt_reclaim.c b/mm/pt_reclaim.c
>> index 0d9cfbf4fe5d8..46771cfff8239 100644
>> --- a/mm/pt_reclaim.c
>> +++ b/mm/pt_reclaim.c
>> @@ -2,7 +2,7 @@
>> ? #include <linux/hugetlb.h>
>> ? #include <linux/pgalloc.h>
>> -#include <asm-generic/tlb.h>
>> +#include <asm/tlb.h>
>> ? #include "internal.h"
> 
> Right, we're using pte_free_tlb(), and the default lives in include/asm- 
> generic/tlb.h.
> 
> Acked-by: David Hildenbrand (Red Hat) <david at kernel.org>

Thanks!

> 


From david at kernel.org  Wed Nov 19 04:24:35 2025
From: david at kernel.org (David Hildenbrand (Red Hat))
Date: Wed, 19 Nov 2025 13:24:35 +0100
Subject: [PATCH 7/7] mm: make PT_RECLAIM depend on
 MMU_GATHER_RCU_TABLE_FREE && 64BIT
In-Reply-To: <479b0409-335f-4450-8eb2-5270a5847f5e@linux.dev>
References: <cover.1763117269.git.zhengqi.arch@bytedance.com>
 <0a4d1e6f0bf299cafd1fc624f965bd1ca542cea8.1763117269.git.zhengqi.arch@bytedance.com>
 <355d3bf3-c6bc-403e-9f19-89259d868611@kernel.org>
 <195baf7c-1f4e-46a4-a4aa-e68e7d00c0f9@linux.dev>
 <9386032c-9840-49da-83f9-74b112f3e752@kernel.org>
 <956c7ca1-bce8-4eed-8a86-bc8adfc708b8@linux.dev>
 <6a22ff95-28c1-4c1d-a1a8-6a391bcc8c86@kernel.org>
 <479b0409-335f-4450-8eb2-5270a5847f5e@linux.dev>
Message-ID: <7160b6ec-4da5-4273-be91-1339bd00d009@kernel.org>

On 19.11.25 13:13, Qi Zheng wrote:
> 
> 
> On 11/19/25 7:35 PM, David Hildenbrand (Red Hat) wrote:
>> On 19.11.25 12:02, Qi Zheng wrote:
>>> Hi David,
>>>
>>> On 11/19/25 6:19 PM, David Hildenbrand (Red Hat) wrote:
>>>> On 18.11.25 13:02, Qi Zheng wrote:
>>>>>
>>>>>
>>>>> On 11/18/25 12:57 AM, David Hildenbrand (Red Hat) wrote:
>>>>>> On 14.11.25 12:11, Qi Zheng wrote:
>>>>>>> From: Qi Zheng <zhengqi.arch at bytedance.com>
>>>>>>
>>>>>> Subject: s/&&/&/
>>>>>
>>>>> will do.
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Make PT_RECLAIM depend on MMU_GATHER_RCU_TABLE_FREE so that
>>>>>>> PT_RECLAIM
>>>>>>> can
>>>>>>> be enabled by default on all architectures that support
>>>>>>> MMU_GATHER_RCU_TABLE_FREE.
>>>>>>>
>>>>>>> Considering that a large number of PTE page table pages (such as
>>>>>>> 100GB+)
>>>>>>> can only be caused on a 64-bit system, let PT_RECLAIM also depend on
>>>>>>> 64BIT.
>>>>>>>
>>>>>>> Signed-off-by: Qi Zheng <zhengqi.arch at bytedance.com>
>>>>>>> ---
>>>>>>>  ??? arch/x86/Kconfig | 1 -
>>>>>>>  ??? mm/Kconfig?????? | 6 +-----
>>>>>>>  ??? 2 files changed, 1 insertion(+), 6 deletions(-)
>>>>>>>
>>>>>>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>>>>>>> index eac2e86056902..96bff81fd4787 100644
>>>>>>> --- a/arch/x86/Kconfig
>>>>>>> +++ b/arch/x86/Kconfig
>>>>>>> @@ -330,7 +330,6 @@ config X86
>>>>>>>  ??????? select FUNCTION_ALIGNMENT_4B
>>>>>>>  ??????? imply IMA_SECURE_AND_OR_TRUSTED_BOOT??? if EFI
>>>>>>>  ??????? select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE
>>>>>>> -??? select ARCH_SUPPORTS_PT_RECLAIM??????? if X86_64
>>>>>>>  ??????? select ARCH_SUPPORTS_SCHED_SMT??????? if SMP
>>>>>>>  ??????? select SCHED_SMT??????????? if SMP
>>>>>>>  ??????? select ARCH_SUPPORTS_SCHED_CLUSTER??? if SMP
>>>>>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>>>>>> index a5a90b169435d..e795fbd69e50c 100644
>>>>>>> --- a/mm/Kconfig
>>>>>>> +++ b/mm/Kconfig
>>>>>>> @@ -1440,14 +1440,10 @@ config ARCH_HAS_USER_SHADOW_STACK
>>>>>>>  ????????? The architecture has hardware support for userspace shadow
>>>>>>> call
>>>>>>>  ????????????? stacks (eg, x86 CET, arm64 GCS or RISC-V Zicfiss).
>>>>>>> -config ARCH_SUPPORTS_PT_RECLAIM
>>>>>>> -??? def_bool n
>>>>>>> -
>>>>>>>  ??? config PT_RECLAIM
>>>>>>>  ??????? bool "reclaim empty user page table pages"
>>>>>>>  ??????? default y
>>>>>>> -??? depends on ARCH_SUPPORTS_PT_RECLAIM && MMU && SMP
>>>>>>> -??? select MMU_GATHER_RCU_TABLE_FREE
>>>>>>> +??? depends on MMU_GATHER_RCU_TABLE_FREE && MMU && SMP && 64BIT
>>>>>>
>>>>>> Who would we have MMU_GATHER_RCU_TABLE_FREE without MMU? (can we drop
>>>>>> the MMU part)
>>>>>
>>>>> OK.
>>>>>
>>>>>>
>>>>>> Why do we care about SMP in the first place? (can we frop SMP)
>>>>>
>>>>> OK.
>>>>>
>>>>>>
>>>>>> But I also wonder why we need "MMU_GATHER_RCU_TABLE_FREE && 64BIT":
>>>>>>
>>>>>> Would it be harmful on 32bit (sure, we might not reclaim as much, but
>>>>>> still there is memory to be reclaimed?)?
>>>>>
>>>>> This is also fine on 32bit, but the benefits are not significant, So I
>>>>> chose to enable it only on 64-bit.
>>>>
>>>> Right. Address space is smaller, but also memory is smaller. Not that I
>>>> think we strictly *must* to support 32bit, I merely wonder why we
>>>> wouldn't just enable it here.
>>>>
>>>> OTOH, if there is a good reason we cannot enable it, we can definitely
>>>> just keep it 64bit only.
>>>
>>> The only difficulty is this:
>>>
>>>>
>>>>>
>>>>> I actually tried enabling MMU_GATHER_RCU_TABLE_FREE on all
>>>>> architectures, and apart from sparc32 being a bit troublesome (because
>>>>> it uses mm->page_table_lock for synchronization within
>>>>> __pte_free_tlb()), the modifications were relatively simple.
>>>
>>> in sparc32:
>>>
>>> void pte_free(struct mm_struct *mm, pgtable_t ptep)
>>> {
>>>  ????????? struct page *page;
>>>
>>>  ????????? page = pfn_to_page(__nocache_pa((unsigned long)ptep) >>
>>> PAGE_SHIFT);
>>>  ????????? spin_lock(&mm->page_table_lock);
>>>  ????????? if (page_ref_dec_return(page) == 1)
>>>  ????????????????? pagetable_dtor(page_ptdesc(page));
>>>  ????????? spin_unlock(&mm->page_table_lock);
>>>
>>>  ????????? srmmu_free_nocache(ptep, SRMMU_PTE_TABLE_SIZE);
>>> }
>>>
>>> #define __pte_free_tlb(tlb, pte, addr)? pte_free((tlb)->mm, pte)
>>>
>>> To enable MMU_GATHER_RCU_TABLE_FREE on sparc32, we need to implement
>>> __tlb_remove_table(), and call the pte_free() above in
>>> __tlb_remove_table().
>>>
>>> However, the __tlb_remove_table() does not have an mm parameter:
>>>
>>> void __tlb_remove_table(void *_table)
>>>
>>> so we need to use another lock instead of mm->page_table_lock.
>>>
>>> I have already sent the v2 [1], and perhaps after that I can enable
>>> PT_RECLAIM on all 32-bit architectures as well.
>>>
>>
>> I guess if we just make it depend on MMU_GATHER_RCU_TABLE_FREE that will
>> be fine.
>>
>>> [1].
>>> https://lore.kernel.org/all/
>>> cover.1763537007.git.zhengqi.arch at bytedance.com/
>>>
>>>>>
>>>>>>
>>>>>> If all 64BIT support MMU_GATHER_RCU_TABLE_FREE (as you previously
>>>>>> state), why can't we only check for 64BIT?
>>>>>
>>>>> OK, will do.
>>>>
>>>> This was also more of a question for discussion:
>>>>
>>>> Would it make sense to have
>>>>
>>>> config PT_RECLAIM
>>>>  ? ????def_bool y
>>>>  ? ????depends on MMU_GATHER_RCU_TABLE_FREE
>>>
>>> make sense.
>>>
>>>>
>>>> (a) Would we want to make it configurable (why?)
>>>
>>> No, it was just out of caution before.
>>>
>>>> (b) Do we really care about SMP (why?)
>>>
>>> No. Simply because the following situation is impossible to occur:
>>>
>>> pte_offset_map
>>> traversing the PTE page table
>>>
>>> <preemption or hardirq>
>>>
>>> call madvise(MADV_DONTNEED)
>>>
>>> so there's no need to free PTE page via RCU.
>>>
>>>> (c) Do we want to limit to 64bit (why?)
>>>
>>> No, just because the profit is greater at 64-BIT.
>>
>> I was briefly wondering if on 32bit (but maybe also on 64bit with
>> configurable user page table levels?) we could have the scenario that we
>> only have two page table levels.
>>
>> So reclaiming the PMD level (corresponding to the highest level) would
> 
> reclaiming the PMD level? The PT_RECLAIM only reclaim PTE pages, not PMD
> pages, am I misunderstanding something?

Sorry, I looked too much into PMD table sharing the last days :D

You're right, it would work in any case even with only 2 levels of apge 
tables.

-- 
Cheers

David


From mpdesouza at suse.com  Fri Nov 21 10:50:32 2025
From: mpdesouza at suse.com (Marcos Paulo de Souza)
Date: Fri, 21 Nov 2025 15:50:32 -0300
Subject: [PATCH v2 0/4] printk cleanup - part 2
Message-ID: <20251121-printk-cleanup-part2-v2-0-57b8b78647f4@suse.com>

The first part can be found here[1]. The proposed changes do not
change the functionality of printk, but were suggestions made by
Petr Mladek. I already have more patches for a part 3 ,but I would like
to see these ones merged first.

I did the testing with VMs, checking suspend and resume cycles, and it worked
as expected.

Thanks for reviewing!

[1]: https://lore.kernel.org/lkml/20250226-printk-renaming-v1-0-0b878577f2e6 at suse.com/

Signed-off-by: Marcos Paulo de Souza <mpdesouza at suse.com>
---
Changes in v2:
- Squashed patches 1 and 3 (CON_SUSPEND usage) and now is the last patch
  of the series, suggested by Petr Mladek
- Moved commit 4 as the first one in the series, and it was changed to
  use console_is_usable helper, suggested by Petr Mladek
- Moved commit 5 as the second commit in the series, and adjusted to use
  console_is_usable helper, suggested by Petr Mladek
- The patch 6 was dropped, since it was implemented in a different patchset
  (https://lore.kernel.org/lkml/20250902-nbcon-kgdboc-v3-0-cd30a8106f1c at suse.com/)
- Patch 7 was moved as third patch, and is using the console_is_usable,
  suggested by Petr Mladek
- Patch 2 was dropped from this patchset, and will be included in the
  next cleanup patchset.
- Link to v1: https://lore.kernel.org/r/20250606-printk-cleanup-part2-v1-0-f427c743dda0 at suse.com

---
Marcos Paulo de Souza (4):
      drivers: serial: kgdboc: Drop checks for CON_ENABLED and CON_BOOT
      arch: um: kmsg_dump: Use console_is_usable
      printk: Use console_is_usable on console_unblank
      printk: Make console_{suspend,resume} handle CON_SUSPENDED

 arch/um/kernel/kmsg_dump.c  |  2 +-
 drivers/tty/serial/kgdboc.c |  1 -
 drivers/tty/tty_io.c        |  2 +-
 kernel/printk/printk.c      | 17 +++++++----------
 4 files changed, 9 insertions(+), 13 deletions(-)
---
base-commit: 887c7f05d40eb51ba3f38fd71d5e6b4aff4bb8a2
change-id: 20250601-printk-cleanup-part2-38f8d5108099

Best regards,
--  
Marcos Paulo de Souza <mpdesouza at suse.com>


From mpdesouza at suse.com  Fri Nov 21 10:50:33 2025
From: mpdesouza at suse.com (Marcos Paulo de Souza)
Date: Fri, 21 Nov 2025 15:50:33 -0300
Subject: [PATCH v2 1/4] drivers: serial: kgdboc: Drop checks for
 CON_ENABLED and CON_BOOT
In-Reply-To: <20251121-printk-cleanup-part2-v2-0-57b8b78647f4@suse.com>
References: <20251121-printk-cleanup-part2-v2-0-57b8b78647f4@suse.com>
Message-ID: <20251121-printk-cleanup-part2-v2-1-57b8b78647f4@suse.com>

The original code tried to find a console that has CON_BOOT _or_
CON_ENABLED flag set. The flag CON_ENABLED is set to all registered
consoles, so in this case this check is always true, even for the
CON_BOOT consoles.

The initial intent of the kgdboc_earlycon_init was to get a console
early (CON_BOOT) or later on in the process (CON_ENABLED). The
code was using for_each_console macro, meaning that all console structs
were previously registered on the printk() machinery. At this point,
any console found on for_each_console is safe for kgdboc_earlycon_init
to use.

Dropping the check makes the code cleaner, and avoids further confusion
by future readers of the code.

Signed-off-by: Marcos Paulo de Souza <mpdesouza at suse.com>
---
 drivers/tty/serial/kgdboc.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/tty/serial/kgdboc.c b/drivers/tty/serial/kgdboc.c
index 85f6c5a76e0f..5a955c80a853 100644
--- a/drivers/tty/serial/kgdboc.c
+++ b/drivers/tty/serial/kgdboc.c
@@ -577,7 +577,6 @@ static int __init kgdboc_earlycon_init(char *opt)
 	console_list_lock();
 	for_each_console(con) {
 		if (con->write && con->read &&
-		    (con->flags & (CON_BOOT | CON_ENABLED)) &&
 		    (!opt || !opt[0] || strcmp(con->name, opt) == 0))
 			break;
 	}

-- 
2.51.1


From mpdesouza at suse.com  Fri Nov 21 10:50:34 2025
From: mpdesouza at suse.com (Marcos Paulo de Souza)
Date: Fri, 21 Nov 2025 15:50:34 -0300
Subject: [PATCH v2 2/4] arch: um: kmsg_dump: Use console_is_usable
In-Reply-To: <20251121-printk-cleanup-part2-v2-0-57b8b78647f4@suse.com>
References: <20251121-printk-cleanup-part2-v2-0-57b8b78647f4@suse.com>
Message-ID: <20251121-printk-cleanup-part2-v2-2-57b8b78647f4@suse.com>

All consoles found on for_each_console are registered, meaning that all
of them have the CON_ENABLED flag set. Since NBCON was introduced it's
important to check if a given console also implements the NBCON callbacks.
The function console_is_usable does exactly that.

Signed-off-by: Marcos Paulo de Souza <mpdesouza at suse.com>
---
 arch/um/kernel/kmsg_dump.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/um/kernel/kmsg_dump.c b/arch/um/kernel/kmsg_dump.c
index 419021175272..fc0f543d1d8e 100644
--- a/arch/um/kernel/kmsg_dump.c
+++ b/arch/um/kernel/kmsg_dump.c
@@ -31,7 +31,7 @@ static void kmsg_dumper_stdout(struct kmsg_dumper *dumper,
 		 * expected to output the crash information.
 		 */
 		if (strcmp(con->name, "ttynull") != 0 &&
-		    (console_srcu_read_flags(con) & CON_ENABLED)) {
+		    console_is_usable(con, console_srcu_read_flags(con), true)) {
 			break;
 		}
 	}

-- 
2.51.1


From mpdesouza at suse.com  Fri Nov 21 10:50:35 2025
From: mpdesouza at suse.com (Marcos Paulo de Souza)
Date: Fri, 21 Nov 2025 15:50:35 -0300
Subject: [PATCH v2 3/4] printk: Use console_is_usable on console_unblank
In-Reply-To: <20251121-printk-cleanup-part2-v2-0-57b8b78647f4@suse.com>
References: <20251121-printk-cleanup-part2-v2-0-57b8b78647f4@suse.com>
Message-ID: <20251121-printk-cleanup-part2-v2-3-57b8b78647f4@suse.com>

The macro for_each_console_srcu iterates over all registered consoles. It's
implied that all registered consoles have CON_ENABLED flag set, making
the check for the flag unnecessary. Call console_is_usable function to
fully verify if the given console is usable before calling the ->unblank
callback.

Signed-off-by: Marcos Paulo de Souza <mpdesouza at suse.com>
---
 kernel/printk/printk.c | 12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index cb79d1d2e6e5..fed98a18e830 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -3374,12 +3374,10 @@ void console_unblank(void)
 	 */
 	cookie = console_srcu_read_lock();
 	for_each_console_srcu(c) {
-		short flags = console_srcu_read_flags(c);
-
-		if (flags & CON_SUSPENDED)
+		if (!console_is_usable(c, console_srcu_read_flags(c), true))
 			continue;
 
-		if ((flags & CON_ENABLED) && c->unblank) {
+		if (c->unblank) {
 			found_unblank = true;
 			break;
 		}
@@ -3416,12 +3414,10 @@ void console_unblank(void)
 
 	cookie = console_srcu_read_lock();
 	for_each_console_srcu(c) {
-		short flags = console_srcu_read_flags(c);
-
-		if (flags & CON_SUSPENDED)
+		if (!console_is_usable(c, console_srcu_read_flags(c), true))
 			continue;
 
-		if ((flags & CON_ENABLED) && c->unblank)
+		if (c->unblank)
 			c->unblank();
 	}
 	console_srcu_read_unlock(cookie);

-- 
2.51.1


From mpdesouza at suse.com  Fri Nov 21 10:50:36 2025
From: mpdesouza at suse.com (Marcos Paulo de Souza)
Date: Fri, 21 Nov 2025 15:50:36 -0300
Subject: [PATCH v2 4/4] printk: Make console_{suspend,resume} handle
 CON_SUSPENDED
In-Reply-To: <20251121-printk-cleanup-part2-v2-0-57b8b78647f4@suse.com>
References: <20251121-printk-cleanup-part2-v2-0-57b8b78647f4@suse.com>
Message-ID: <20251121-printk-cleanup-part2-v2-4-57b8b78647f4@suse.com>

Since commit 9e70a5e109a4 ("printk: Add per-console suspended state")
the CON_SUSPENDED flag was introced, and this flag was being checked
on console_is_usable function, which returns false if the console is
suspended.

To make the behavior consistent, change show_cons_active to look for
consoles that are not suspended, instead of checking CON_ENABLED.

Signed-off-by: Marcos Paulo de Souza <mpdesouza at suse.com>
---
 drivers/tty/tty_io.c   | 2 +-
 kernel/printk/printk.c | 5 +++--
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index e2d92cf70eb7..1b2ce0f36010 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -3554,7 +3554,7 @@ static ssize_t show_cons_active(struct device *dev,
 			continue;
 		if (!(c->flags & CON_NBCON) && !c->write)
 			continue;
-		if ((c->flags & CON_ENABLED) == 0)
+		if (c->flags & CON_SUSPENDED)
 			continue;
 		cs[i++] = c;
 		if (i >= ARRAY_SIZE(cs))
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index fed98a18e830..fe7c956f73bd 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -3542,7 +3542,7 @@ void console_suspend(struct console *console)
 {
 	__pr_flush(console, 1000, true);
 	console_list_lock();
-	console_srcu_write_flags(console, console->flags & ~CON_ENABLED);
+	console_srcu_write_flags(console, console->flags | CON_SUSPENDED);
 	console_list_unlock();
 
 	/*
@@ -3555,13 +3555,14 @@ void console_suspend(struct console *console)
 }
 EXPORT_SYMBOL(console_suspend);
 
+/* Unset CON_SUSPENDED flag so the console can start printing again. */
 void console_resume(struct console *console)
 {
 	struct console_flush_type ft;
 	bool is_nbcon;
 
 	console_list_lock();
-	console_srcu_write_flags(console, console->flags | CON_ENABLED);
+	console_srcu_write_flags(console, console->flags & ~CON_SUSPENDED);
 	is_nbcon = console->flags & CON_NBCON;
 	console_list_unlock();
 

-- 
2.51.1


From davidgow at google.com  Sat Nov 22 00:32:12 2025
From: davidgow at google.com (David Gow)
Date: Sat, 22 Nov 2025 16:32:12 +0800
Subject: [PATCH] arch: um: Don't rename vmap to kernel_vmap
Message-ID: <20251122083213.3996586-1-davidgow@google.com>

In order to work around the existence of a vmap symbol in libpcap, the
UML makefile unconditionally redefines vmap to kernel_vmap. However,
this not only affects the actual vmap symbol, but also anything else
named vmap, including a number of struct members in DRM.

This would not be too much of a problem, since all uses are also
updated, except we now have Rust DRM bindings, which expect the
corresponding Rust structs to have 'vmap' names. Since the redefinition
applies in bindgen, but not to Rust code, we end up with errors such as:

error[E0560]: struct `drm_gem_object_funcs` has no fields named `vmap`
  --> rust/kernel/drm/gem/mod.rs:210:9

Since, as far as I can tell, we no longer actually link to libpcap, it
should be safe to just remove this define unconditionally.

(If it's not, we can possibly either disable DRM Rust bindings under
UML, or move the redefinition of vmap behind some config option.)

We also take this opportunity to update the comment.

Signed-off-by: David Gow <davidgow at google.com>
---
 arch/um/Makefile | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/arch/um/Makefile b/arch/um/Makefile
index 7be0143b5ba3..721b652ffb65 100644
--- a/arch/um/Makefile
+++ b/arch/um/Makefile
@@ -46,19 +46,17 @@ ARCH_INCLUDE	:= -I$(srctree)/$(SHARED_HEADERS)
 ARCH_INCLUDE	+= -I$(srctree)/$(HOST_DIR)/um/shared
 KBUILD_CPPFLAGS += -I$(srctree)/$(HOST_DIR)/um
 
-# -Dvmap=kernel_vmap prevents anything from referencing the libpcap.o symbol so
-# named - it's a common symbol in libpcap, so we get a binary which crashes.
-#
-# Same things for in6addr_loopback and mktime - found in libc. For these two we
-# only get link-time error, luckily.
+# -Dstrrchr=kernel_strrchr (as well as the various in6addr symbols) prevents
+#  anything from referencing
+# libc symbols with the same name, which can cause a linker error.
 #
 # -Dlongjmp=kernel_longjmp prevents anything from referencing the libpthread.a
 # embedded copy of longjmp, same thing for setjmp.
 #
-# These apply to USER_CFLAGS to.
+# These apply to USER_CFLAGS too.
 
 KBUILD_CFLAGS += $(CFLAGS) $(CFLAGS-y) -D__arch_um__ \
-	$(ARCH_INCLUDE) $(MODE_INCLUDE) -Dvmap=kernel_vmap	\
+	$(ARCH_INCLUDE) $(MODE_INCLUDE)	\
 	-Dlongjmp=kernel_longjmp -Dsetjmp=kernel_setjmp \
 	-Din6addr_loopback=kernel_in6addr_loopback \
 	-Din6addr_any=kernel_in6addr_any -Dstrrchr=kernel_strrchr \
-- 
2.52.0.rc2.455.g230fcf2819-goog


From miguel.ojeda.sandonis at gmail.com  Sun Nov 23 09:07:54 2025
From: miguel.ojeda.sandonis at gmail.com (Miguel Ojeda)
Date: Sun, 23 Nov 2025 18:07:54 +0100
Subject: [PATCH] arch: um: Don't rename vmap to kernel_vmap
In-Reply-To: <20251122083213.3996586-1-davidgow@google.com>
References: <20251122083213.3996586-1-davidgow@google.com>
Message-ID: <CANiq72=MXHZ2XswxkfzLqUEPwriNy=_sxAQpOYoYW8iNgyuJ8Q@mail.gmail.com>

On Sat, Nov 22, 2025 at 9:32?AM David Gow <davidgow at google.com> wrote:
>
> In order to work around the existence of a vmap symbol in libpcap, the
> UML makefile unconditionally redefines vmap to kernel_vmap. However,
> this not only affects the actual vmap symbol, but also anything else
> named vmap, including a number of struct members in DRM.
>
> This would not be too much of a problem, since all uses are also
> updated, except we now have Rust DRM bindings, which expect the
> corresponding Rust structs to have 'vmap' names. Since the redefinition
> applies in bindgen, but not to Rust code, we end up with errors such as:
>
> error[E0560]: struct `drm_gem_object_funcs` has no fields named `vmap`
>   --> rust/kernel/drm/gem/mod.rs:210:9
>
> Since, as far as I can tell, we no longer actually link to libpcap, it
> should be safe to just remove this define unconditionally.
>
> (If it's not, we can possibly either disable DRM Rust bindings under
> UML, or move the redefinition of vmap behind some config option.)
>
> We also take this opportunity to update the comment.
>
> Signed-off-by: David Gow <davidgow at google.com>

Nice, thanks for this!

Yeah, I guess we would otherwise need to do the same kind of "wild"
macro replacement in Rust code to support this or conditional
compilation, and neither sounds good.

If it is not actually needed, then this sounds like a win-win.

It seems it was indeed gone in commit:

    12b8e7e69aa7 ("um: Remove obsolete pcap driver")

So it sounds reasonable to me assuming I am not missing anything,
which I may be... FWIW:

Acked-by: Miguel Ojeda <ojeda at kernel.org>

Cheers,
Miguel


From johannes at sipsolutions.net  Sun Nov 23 23:49:35 2025
From: johannes at sipsolutions.net (Johannes Berg)
Date: Mon, 24 Nov 2025 08:49:35 +0100
Subject: [PATCH] arch: um: Don't rename vmap to kernel_vmap
In-Reply-To: <CANiq72=MXHZ2XswxkfzLqUEPwriNy=_sxAQpOYoYW8iNgyuJ8Q@mail.gmail.com> (sfid-20251123_180818_160099_E5AEF588)
References: <20251122083213.3996586-1-davidgow@google.com>
	 <CANiq72=MXHZ2XswxkfzLqUEPwriNy=_sxAQpOYoYW8iNgyuJ8Q@mail.gmail.com>
	 (sfid-20251123_180818_160099_E5AEF588)
Message-ID: <bfb512e24f365695534313b375fc0c38d0a843b2.camel@sipsolutions.net>

On Sun, 2025-11-23 at 18:07 +0100, Miguel Ojeda wrote:
> On Sat, Nov 22, 2025 at 9:32?AM David Gow <davidgow at google.com> wrote:
> > 
> > In order to work around the existence of a vmap symbol in libpcap, the
> > UML makefile unconditionally redefines vmap to kernel_vmap. However,
> > this not only affects the actual vmap symbol, but also anything else
> > named vmap, including a number of struct members in DRM.
> > 
> > This would not be too much of a problem, since all uses are also
> > updated, except we now have Rust DRM bindings, which expect the
> > corresponding Rust structs to have 'vmap' names. Since the redefinition
> > applies in bindgen, but not to Rust code, we end up with errors such as:
> > 
> > error[E0560]: struct `drm_gem_object_funcs` has no fields named `vmap`
> >   --> rust/kernel/drm/gem/mod.rs:210:9
> > 
> > Since, as far as I can tell, we no longer actually link to libpcap, it
> > should be safe to just remove this define unconditionally.
> > 
> > (If it's not, we can possibly either disable DRM Rust bindings under
> > UML, or move the redefinition of vmap behind some config option.)
> > 
> > We also take this opportunity to update the comment.
> > 
> > Signed-off-by: David Gow <davidgow at google.com>
> 
> Nice, thanks for this!
> 
> Yeah, I guess we would otherwise need to do the same kind of "wild"
> macro replacement in Rust code to support this or conditional
> compilation, and neither sounds good.
> 
> If it is not actually needed, then this sounds like a win-win.
> 
> It seems it was indeed gone in commit:
> 
>     12b8e7e69aa7 ("um: Remove obsolete pcap driver")

Indeed, that was just missed during the removal, we can't link to
libpcap any more.

How do we want to take this patch in, and where is it needed? I hadn't
planned to send a UML PR to -rc still, but I guess I _can_ if needed?
But if anyone else wants to line it up through a tree (rust related?)
that has pending work anyway, that seems fair too. In which case:

Acked-by: Johannes Berg <johannes at sipsolutions.net>

Or it's not that urgent because all this came up in -next now? I didn't
really see (or fully understand) all the build bug reports.

johannes


From davidgow at google.com  Mon Nov 24 23:36:31 2025
From: davidgow at google.com (David Gow)
Date: Tue, 25 Nov 2025 15:36:31 +0800
Subject: [PATCH] arch: um: Don't rename vmap to kernel_vmap
In-Reply-To: <bfb512e24f365695534313b375fc0c38d0a843b2.camel@sipsolutions.net>
References: <20251122083213.3996586-1-davidgow@google.com> <CANiq72=MXHZ2XswxkfzLqUEPwriNy=_sxAQpOYoYW8iNgyuJ8Q@mail.gmail.com>
 <bfb512e24f365695534313b375fc0c38d0a843b2.camel@sipsolutions.net>
Message-ID: <CABVgOSmAGahEaa71_fOT4Oo9Wg91X6rN+1_KgVVvVzmd4vuMCg@mail.gmail.com>

On Mon, 24 Nov 2025 at 15:49, Johannes Berg <johannes at sipsolutions.net> wrote:
>
> On Sun, 2025-11-23 at 18:07 +0100, Miguel Ojeda wrote:
> > On Sat, Nov 22, 2025 at 9:32?AM David Gow <davidgow at google.com> wrote:
> > >
> > > In order to work around the existence of a vmap symbol in libpcap, the
> > > UML makefile unconditionally redefines vmap to kernel_vmap. However,
> > > this not only affects the actual vmap symbol, but also anything else
> > > named vmap, including a number of struct members in DRM.
> > >
> > > This would not be too much of a problem, since all uses are also
> > > updated, except we now have Rust DRM bindings, which expect the
> > > corresponding Rust structs to have 'vmap' names. Since the redefinition
> > > applies in bindgen, but not to Rust code, we end up with errors such as:
> > >
> > > error[E0560]: struct `drm_gem_object_funcs` has no fields named `vmap`
> > >   --> rust/kernel/drm/gem/mod.rs:210:9
> > >
> > > Since, as far as I can tell, we no longer actually link to libpcap, it
> > > should be safe to just remove this define unconditionally.
> > >
> > > (If it's not, we can possibly either disable DRM Rust bindings under
> > > UML, or move the redefinition of vmap behind some config option.)
> > >
> > > We also take this opportunity to update the comment.
> > >
> > > Signed-off-by: David Gow <davidgow at google.com>
> >
> > Nice, thanks for this!
> >
> > Yeah, I guess we would otherwise need to do the same kind of "wild"
> > macro replacement in Rust code to support this or conditional
> > compilation, and neither sounds good.
> >
> > If it is not actually needed, then this sounds like a win-win.
> >
> > It seems it was indeed gone in commit:
> >
> >     12b8e7e69aa7 ("um: Remove obsolete pcap driver")
>
> Indeed, that was just missed during the removal, we can't link to
> libpcap any more.
>
> How do we want to take this patch in, and where is it needed? I hadn't
> planned to send a UML PR to -rc still, but I guess I _can_ if needed?
> But if anyone else wants to line it up through a tree (rust related?)
> that has pending work anyway, that seems fair too. In which case:
>
> Acked-by: Johannes Berg <johannes at sipsolutions.net>
>
> Or it's not that urgent because all this came up in -next now? I didn't
> really see (or fully understand) all the build bug reports.
>

I'm happy for this to go in via any tree. (Worst-case, we could
possibly take it via KUnit, though I'd rather not, as it's not really
KUnit-related at all.)

The issue has actually been around since probably 6.16 (c284d3e42338
("rust: drm: gem: Add GEM object abstraction")), but since it only
applies to people building Rust graphics drivers against UML, which is
not super common, it seems like it's only come up in randconfig builds
so far.

-- David
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5281 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.infradead.org/pipermail/linux-um/attachments/20251125/17487443/attachment.p7s>

From johannes at sipsolutions.net  Mon Nov 24 23:40:15 2025
From: johannes at sipsolutions.net (Johannes Berg)
Date: Tue, 25 Nov 2025 08:40:15 +0100
Subject: [PATCH] arch: um: Don't rename vmap to kernel_vmap
In-Reply-To: <CABVgOSmAGahEaa71_fOT4Oo9Wg91X6rN+1_KgVVvVzmd4vuMCg@mail.gmail.com> (sfid-20251125_083645_823712_C74034DC)
References: <20251122083213.3996586-1-davidgow@google.com>
	 <CANiq72=MXHZ2XswxkfzLqUEPwriNy=_sxAQpOYoYW8iNgyuJ8Q@mail.gmail.com>
	 <bfb512e24f365695534313b375fc0c38d0a843b2.camel@sipsolutions.net>
	 <CABVgOSmAGahEaa71_fOT4Oo9Wg91X6rN+1_KgVVvVzmd4vuMCg@mail.gmail.com>
	 (sfid-20251125_083645_823712_C74034DC)
Message-ID: <dedea008d546a2a29d3b604e9f4b0c0dec3ebac1.camel@sipsolutions.net>

On Tue, 2025-11-25 at 15:36 +0800, David Gow wrote:
> > 
> > Or it's not that urgent because all this came up in -next now? I didn't
> > really see (or fully understand) all the build bug reports.
> > 
> 
> I'm happy for this to go in via any tree. (Worst-case, we could
> possibly take it via KUnit, though I'd rather not, as it's not really
> KUnit-related at all.)
> 
> The issue has actually been around since probably 6.16 (c284d3e42338
> ("rust: drm: gem: Add GEM object abstraction")), but since it only
> applies to people building Rust graphics drivers against UML, which is
> not super common, it seems like it's only come up in randconfig builds
> so far.

Oh, interesting, OK. I guess then given that it's not super important
and how late we're in the game, I'll just throw it into the (relatively
small) pile we have for UML for -next. Given that we removed the pcap
driver in 6.11 (12b8e7e69aa7a) I guess we could even ask stable to take
it, but it's not even that important until someone wants to test the
rust DRM stuff in kunit or something :)

johannes


From davidgow at google.com  Tue Nov 25 01:17:48 2025
From: davidgow at google.com (David Gow)
Date: Tue, 25 Nov 2025 17:17:48 +0800
Subject: [PATCH] arch: um: Don't rename vmap to kernel_vmap
In-Reply-To: <dedea008d546a2a29d3b604e9f4b0c0dec3ebac1.camel@sipsolutions.net>
References: <20251122083213.3996586-1-davidgow@google.com> <CANiq72=MXHZ2XswxkfzLqUEPwriNy=_sxAQpOYoYW8iNgyuJ8Q@mail.gmail.com>
 <bfb512e24f365695534313b375fc0c38d0a843b2.camel@sipsolutions.net>
 <CABVgOSmAGahEaa71_fOT4Oo9Wg91X6rN+1_KgVVvVzmd4vuMCg@mail.gmail.com> <dedea008d546a2a29d3b604e9f4b0c0dec3ebac1.camel@sipsolutions.net>
Message-ID: <CABVgOSkk=C7f3X-Si8e+bA3XTDy_D+yE2fvxMYLfGp3VpUbyJg@mail.gmail.com>

On Tue, 25 Nov 2025 at 15:40, Johannes Berg <johannes at sipsolutions.net> wrote:
>
> On Tue, 2025-11-25 at 15:36 +0800, David Gow wrote:
> > >
> > > Or it's not that urgent because all this came up in -next now? I didn't
> > > really see (or fully understand) all the build bug reports.
> > >
> >
> > I'm happy for this to go in via any tree. (Worst-case, we could
> > possibly take it via KUnit, though I'd rather not, as it's not really
> > KUnit-related at all.)
> >
> > The issue has actually been around since probably 6.16 (c284d3e42338
> > ("rust: drm: gem: Add GEM object abstraction")), but since it only
> > applies to people building Rust graphics drivers against UML, which is
> > not super common, it seems like it's only come up in randconfig builds
> > so far.
>
> Oh, interesting, OK. I guess then given that it's not super important
> and how late we're in the game, I'll just throw it into the (relatively
> small) pile we have for UML for -next. Given that we removed the pcap
> driver in 6.11 (12b8e7e69aa7a) I guess we could even ask stable to take
> it, but it's not even that important until someone wants to test the
> rust DRM stuff in kunit or something :)
>

Sounds good to me. Throwing the Fixes: tag in wouldn't hurt at least
(and I do think some of the Rust DRM stuff is growing some KUnit
tests, so it'll be nice to have going forward, even if they can be
tested under other architectures).

Cheers,
-- David
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5281 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.infradead.org/pipermail/linux-um/attachments/20251125/2b5daa04/attachment-0001.p7s>

From johannes at sipsolutions.net  Tue Nov 25 01:58:53 2025
From: johannes at sipsolutions.net (Johannes Berg)
Date: Tue, 25 Nov 2025 10:58:53 +0100
Subject: [PATCH v13 00/13] nommu UML
In-Reply-To: <m2bjl7y6mv.wl-thehajime@gmail.com> (sfid-20251112_095303_672501_9A7DDF36)
References: <cover.1762588860.git.thehajime@gmail.com>
		<aRGs8lPjH22NqMZc@infradead.org>	<m2framxerm.wl-thehajime@gmail.com>
		<0a84c16f862026c82271c0adbc91d98b812a78b4.camel@sipsolutions.net>
	 <m2bjl7y6mv.wl-thehajime@gmail.com> (sfid-20251112_095303_672501_9A7DDF36)
Message-ID: <defcec3945fbc37e90070b030bf1596b11b6d926.camel@sipsolutions.net>

On Wed, 2025-11-12 at 17:52 +0900, Hajime Tazaki wrote:
> > >   What is it for ?
> > >   ================
> > >   
> > >   - Alleviate syscall hook overhead implemented with ptrace(2)
> > >   - To exercises nommu code over UML (and over KUnit)
> > >   - Less dependency to host facilities
> > 
> > FWIW, in some way, this order of priorities is exactly why this hasn't
> > been going anywhere, and every time I looked at it I got somewhat
> > annoyed by what seems to me like choices made to support especially the
> > first bullet.
> 
> over the past versions, I've been emphasized that the 2nd bullet (testing)
> is the primary usecase as I saw several actually cases from mm folks,
> 
> https://lists.infradead.org/pipermail/maple-tree/2024-November/003775.html
> https://lore.kernel.org/all/cb1cf0be-871d-4982-9a1b-5fdd54deec8d at lucifer.local/
> 
> and I think this is not limited to mm code.

Not sure there's much value in testing much else in no-MMU, but sure,
I'll give you that it's useful for testing.

> other 2 bullets are additional benefits which we observed in a
> comment, and our experience.

But are they really _worthwhile_ benefits? A lot of this design adds
additional complexity, and it doesn't really seem necessary for the
testing use case. Making it faster is nice, but it's not like the
speedup really is 20x for arbitrary tests, that's just for corner cases
like "sit in a loop of gettimeofday()". And for kunit there's no syscall
boundary at all, so there's no speedup.

> > I suspect that the first and third bullet are not even really true any
> > more, since you moved to seccomp (per our request), yet I think design
> > choices influenced by them persist.
> 
> this observation is not true; the first bullet is still true even
> using seccomp.  please look at the benchmark result in the patch
> [12/13], quoted below.

> [snip]

So thanks for the correction. If that's the case, however, it means the
speedup can't be due to the syscall boundary itself (seccomp) but must
rather be due to some pagefault/mapping handling issue? Which would be
inherent in no-MMU, even taking an approach of using two host processes
rather than embedding everything into one.

> > However, I'm not yet convinced that all of the complexities presented in
> > this patchset (such as completely separate seccomp implementation) are
> > actually necessary in support of _just_ the second bullet. These seem to
> > me like design choices necessary to support the _first_ bullet [1].
> 
> separate seccomp implementation is indeed needed due to the design
> choice we made, to use a single process to host a (um) userspace.

That sounds misleading or even wrong to me, I'd say it's due to putting
the (um) userspace in the same host process as the kernel space?

> I don't see why you see this as a _complexity_, as functionally both
> seccomp handling don't interfere each other.

The complexity isn't so much in the separate code, which is a small
factor, but in the "put everything into the same process" aspect of it.
That has consequences around the host context state handling, things we
didn't really need to consider before suddenly become crucially
important. In the current (with-MMU) design, we only need to worry about
being able to correctly switch between userspace tasks/threads within a
userspace mm (host) process. With the no-MMU design you propose, we also
need to be able to correctly switch between kernel and userspace tasks
within the same single (host) process.

I think this is a pretty significant difference, and saying "there's no
complexity here" is simply pretending it isn't a relevant difference. I
believe you're not even handling this correctly right now in this patch
set, specifically wrt. the GS register which has been pointed out
before, but I wouldn't say that I even have a complete picture in my
head over what state handling would be necessary and sufficient.

So yeah, I think this warrants taking another look as to whether or not
the approach of putting everything into the same host process is even
worth it. I tend to believe that it isn't, given the use cases. And if
you say the speedup still is with seccomp, that kills the speed argument
too.

> > I've thought about what would happen if we stuck to creating a (single)
> > separate process on the host to execute userspace, and just used
> > CLONE_VM for it. That way, it's still no-MMU with full memory access,
> > but there's some implicit isolation between the kernel and userspace
> > processes which will likely remove complexities around FP/SSE/AVX
> > handling, may completely remove the need for a separate seccomp
> > implementation, etc.
> 
> this would be doable I think, but we went the different way, as
> using separate host processes (with ptrace/seccomp) is slow and add
> complexity by the synchronization between processes, which we think
> it's not easy to maintain in the future.

Which one is it then, slow or not? Not sure I follow. You just said you
do have seccomp when comparing speeds, so that in itself doesn't make it
slow. What synchronization? It'd (have to) be CLONE_VM, but that
actually _simplifies_ state transfer/synchronization, and we already
have (to have) state transfer between different userspace threads in the
same host process for the with-MMU case.

johannes


From pmladek at suse.com  Wed Nov 26 05:12:19 2025
From: pmladek at suse.com (Petr Mladek)
Date: Wed, 26 Nov 2025 14:12:19 +0100
Subject: [PATCH v2 1/4] drivers: serial: kgdboc: Drop checks for
 CON_ENABLED and CON_BOOT
In-Reply-To: <20251121-printk-cleanup-part2-v2-1-57b8b78647f4@suse.com>
References: <20251121-printk-cleanup-part2-v2-0-57b8b78647f4@suse.com>
 <20251121-printk-cleanup-part2-v2-1-57b8b78647f4@suse.com>
Message-ID: <aSb8s_N4Pc0yTk9f@pathway.suse.cz>

On Fri 2025-11-21 15:50:33, Marcos Paulo de Souza wrote:
> The original code tried to find a console that has CON_BOOT _or_
> CON_ENABLED flag set. The flag CON_ENABLED is set to all registered
> consoles, so in this case this check is always true, even for the
> CON_BOOT consoles.
> 
> The initial intent of the kgdboc_earlycon_init was to get a console
> early (CON_BOOT) or later on in the process (CON_ENABLED). The
> code was using for_each_console macro, meaning that all console structs
> were previously registered on the printk() machinery. At this point,
> any console found on for_each_console is safe for kgdboc_earlycon_init
> to use.
> 
> Dropping the check makes the code cleaner, and avoids further confusion
> by future readers of the code.
> 
> Signed-off-by: Marcos Paulo de Souza <mpdesouza at suse.com>

I agree that the check is superfluous and can be removed:

Reviewed-by: Petr Mladek <pmladek at suse.com>

Best Regards,
Petr


From pmladek at suse.com  Wed Nov 26 05:22:44 2025
From: pmladek at suse.com (Petr Mladek)
Date: Wed, 26 Nov 2025 14:22:44 +0100
Subject: [PATCH v2 2/4] arch: um: kmsg_dump: Use console_is_usable
In-Reply-To: <20251121-printk-cleanup-part2-v2-2-57b8b78647f4@suse.com>
References: <20251121-printk-cleanup-part2-v2-0-57b8b78647f4@suse.com>
 <20251121-printk-cleanup-part2-v2-2-57b8b78647f4@suse.com>
Message-ID: <aSb_JDBBX9Yh0jCM@pathway.suse.cz>

On Fri 2025-11-21 15:50:34, Marcos Paulo de Souza wrote:
> All consoles found on for_each_console are registered, meaning that all
> of them have the CON_ENABLED flag set. Since NBCON was introduced it's
> important to check if a given console also implements the NBCON callbacks.
> The function console_is_usable does exactly that.
> 
> Signed-off-by: Marcos Paulo de Souza <mpdesouza at suse.com>

Makes sense:

Reviewed-by: Petr Mladek <pmladek at suse.com>

Best Regards,
Petr


From pmladek at suse.com  Wed Nov 26 05:24:58 2025
From: pmladek at suse.com (Petr Mladek)
Date: Wed, 26 Nov 2025 14:24:58 +0100
Subject: [PATCH v2 3/4] printk: Use console_is_usable on console_unblank
In-Reply-To: <20251121-printk-cleanup-part2-v2-3-57b8b78647f4@suse.com>
References: <20251121-printk-cleanup-part2-v2-0-57b8b78647f4@suse.com>
 <20251121-printk-cleanup-part2-v2-3-57b8b78647f4@suse.com>
Message-ID: <aSb_qmjfNc1seDzb@pathway.suse.cz>

On Fri 2025-11-21 15:50:35, Marcos Paulo de Souza wrote:
> The macro for_each_console_srcu iterates over all registered consoles. It's
> implied that all registered consoles have CON_ENABLED flag set, making
> the check for the flag unnecessary. Call console_is_usable function to
> fully verify if the given console is usable before calling the ->unblank
> callback.
> 
> Signed-off-by: Marcos Paulo de Souza <mpdesouza at suse.com>

Makes sense:

Reviewed-by: Petr Mladek <pmladek at suse.com>

Best Regards,
Petr


From miguel.ojeda.sandonis at gmail.com  Wed Nov 26 05:50:44 2025
From: miguel.ojeda.sandonis at gmail.com (Miguel Ojeda)
Date: Wed, 26 Nov 2025 14:50:44 +0100
Subject: [linux-next:master 4806/10599] error[E0560]: struct
 `bindings::kernel_param_ops` has no field named `get`
In-Reply-To: <84b74435-5aad-4c15-aea5-db87b4a6bf11@kernel.org>
References: <202511210858.uwVivgvn-lkp@intel.com> <84b74435-5aad-4c15-aea5-db87b4a6bf11@kernel.org>
Message-ID: <CANiq72mCd7FVO0Btsvct5Dy67TkBJd=QJgnPvLMn9d43Vy0YnA@mail.gmail.com>

On Wed, Nov 26, 2025 at 2:41?PM Daniel Gomez <da.gomez at kernel.org> wrote:
>
> On 21/11/2025 01.24, kernel test robot wrote:
> > tree:   https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master
> > head:   88cbd8ac379cf5ce68b7efcfd4d1484a6871ee0b
> > commit: 0b08fc292842a13aa496413b48c1efb83573b8c6 [4806/10599] rust: introduce module_param module
> > config: um-randconfig-001-20251121 (https://download.01.org/0day-ci/archive/20251121/202511210858.uwVivgvn-lkp at intel.com/config)
> > compiler: clang version 22.0.0git (https://github.com/llvm/llvm-project 9e9fe08b16ea2c4d9867fb4974edf2a3776d6ece)
> > rustc: rustc 1.88.0 (6b00bc388 2025-06-23)
> > reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251121/202511210858.uwVivgvn-lkp at intel.com/reproduce)
>
> We can't reproduce this.
>
> If anyone cares, please let us know how to reproduce it.
>
> Tested on Debian testing x86_64 host.
>
> rustc --version
> rustc 1.91.1 (ed61e7d7e 2025-11-07
>
> /home/dagomez/0day/llvm-22.0.0-e19fa930ca838715028c00c234874d1db4f93154-20250918-184558-x86_64/bin/clang-22 --version
> ClangBuiltLinux clang version 22.0.0git (https://github.com/llvm/llvm-project.git e19fa930ca838715028c00c234874d1db4f93154)
> Target: x86_64-unknown-linux-gnu
> Thread model: posix
> InstalledDir: /home/dagomez/0day/llvm-22.0.0-e19fa930ca838715028c00c234874d1db4f93154-20250918-184558-x86_64/bin
>
>   561  wget https://download.01.org/0day-ci/archive/20251121/202511210858.uwVivgvn-lkp at intel.com/config
>   563  git clone https://github.com/intel/lkp-tests.git ~/lkp-tests
>   565  mkdir -p build_dir && cp config build_dir/.config
>
>   571  COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang-22 ~/lkp-tests/kbuild/make.cross W=1 O=build_dir ARCH=um olddefconfig
>   572  COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang-22 ~/lkp-tests/kbuild/make.cross W=1 O=build_dir ARCH=um prepare
>   573  COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang-22 ~/lkp-tests/kbuild/make.cross W=1 O=build_dir ARCH=um -j$(nproc)
>
> I'm just getting these warnings:
>
> ...

Cc'ing UML so that they are in the loop.

Cheers,
Miguel


From davidgow at google.com  Thu Nov 27 01:26:31 2025
From: davidgow at google.com (David Gow)
Date: Thu, 27 Nov 2025 17:26:31 +0800
Subject: [linux-next:master 4806/10599] error[E0560]: struct
 `bindings::kernel_param_ops` has no field named `get`
In-Reply-To: <CANiq72mCd7FVO0Btsvct5Dy67TkBJd=QJgnPvLMn9d43Vy0YnA@mail.gmail.com>
References: <202511210858.uwVivgvn-lkp@intel.com> <84b74435-5aad-4c15-aea5-db87b4a6bf11@kernel.org>
 <CANiq72mCd7FVO0Btsvct5Dy67TkBJd=QJgnPvLMn9d43Vy0YnA@mail.gmail.com>
Message-ID: <CABVgOS=SFM7Po-Xyf0nyT=np8fKNJzuHSBeyw+m4dL1vivq1WA@mail.gmail.com>

On Wed, 26 Nov 2025 at 21:50, Miguel Ojeda
<miguel.ojeda.sandonis at gmail.com> wrote:
>
> On Wed, Nov 26, 2025 at 2:41?PM Daniel Gomez <da.gomez at kernel.org> wrote:
> >
> > On 21/11/2025 01.24, kernel test robot wrote:
> > > tree:   https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master
> > > head:   88cbd8ac379cf5ce68b7efcfd4d1484a6871ee0b
> > > commit: 0b08fc292842a13aa496413b48c1efb83573b8c6 [4806/10599] rust: introduce module_param module
> > > config: um-randconfig-001-20251121 (https://download.01.org/0day-ci/archive/20251121/202511210858.uwVivgvn-lkp at intel.com/config)
> > > compiler: clang version 22.0.0git (https://github.com/llvm/llvm-project 9e9fe08b16ea2c4d9867fb4974edf2a3776d6ece)
> > > rustc: rustc 1.88.0 (6b00bc388 2025-06-23)
> > > reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251121/202511210858.uwVivgvn-lkp at intel.com/reproduce)
> >
> > We can't reproduce this.
> >
> > If anyone cares, please let us know how to reproduce it.
> >

Thanks -- this does sit in the category of things I care about (at
least in theory), but also can't reproduce.

It looks like this affects random struct fields in bindings:: (I've
seen other 0day reports with other structs and fields). If anyone has
any idea what's going on, suggestions are welcome.

Cheers,
-- David
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5281 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.infradead.org/pipermail/linux-um/attachments/20251127/4a46aef3/attachment.p7s>

From pmladek at suse.com  Thu Nov 27 01:49:28 2025
From: pmladek at suse.com (Petr Mladek)
Date: Thu, 27 Nov 2025 10:49:28 +0100
Subject: [PATCH v2 4/4] printk: Make console_{suspend,resume} handle
 CON_SUSPENDED
In-Reply-To: <20251121-printk-cleanup-part2-v2-4-57b8b78647f4@suse.com>
References: <20251121-printk-cleanup-part2-v2-0-57b8b78647f4@suse.com>
 <20251121-printk-cleanup-part2-v2-4-57b8b78647f4@suse.com>
Message-ID: <aSgeqM3DWvR8-cMY@pathway.suse.cz>

On Fri 2025-11-21 15:50:36, Marcos Paulo de Souza wrote:
> Since commit 9e70a5e109a4 ("printk: Add per-console suspended state")
> the CON_SUSPENDED flag was introced, and this flag was being checked
> on console_is_usable function, which returns false if the console is
> suspended.
> 
> To make the behavior consistent, change show_cons_active to look for
> consoles that are not suspended, instead of checking CON_ENABLED.
> 
> --- a/drivers/tty/tty_io.c
> +++ b/drivers/tty/tty_io.c
> @@ -3554,7 +3554,7 @@ static ssize_t show_cons_active(struct device *dev,
>  			continue;
>  		if (!(c->flags & CON_NBCON) && !c->write)
>  			continue;
> -		if ((c->flags & CON_ENABLED) == 0)
> +		if (c->flags & CON_SUSPENDED)

I believe that we could and should replace

		if (!(c->flags & CON_NBCON) && !c->write)
			continue;
		if (c->flags & CON_SUSPENDED)
			continue;

with

		if (!console_is_usable(c, c->flags, true) &&
		    !console_is_usable(c, c->flags, false))
			continue;

It would make the value compatible with all other callers/users of
the console drivers.

The variant using two console_is_usable() calls with "true/false"
parameters is inspited by __pr_flush().

>  			continue;
>  		cs[i++] = c;
>  		if (i >= ARRAY_SIZE(cs))
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index fed98a18e830..fe7c956f73bd 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -3542,7 +3542,7 @@ void console_suspend(struct console *console)
>  {
>  	__pr_flush(console, 1000, true);
>  	console_list_lock();
> -	console_srcu_write_flags(console, console->flags & ~CON_ENABLED);
> +	console_srcu_write_flags(console, console->flags | CON_SUSPENDED);

This is the same flag which is set also by the console_suspend_all()
API. Now, as discussed at
https://lore.kernel.org/lkml/844j4lepak.fsf at jogness.linutronix.de/

   + console_suspend()/console_resume() API is used by few console
     drivers to suspend the console when the related HW device
     gets suspended.

   + console_suspend_all()/console_resume_all() is used by
     the power management subsystem to call down/up all consoles
     when the system is going down/up. It is a big hammer approach.

We need to distinguish the two APIs so that console drivers which were
suspended by both APIs stay suspended until they get resumed by both
APIs. I mean:

	// This should suspend all consoles unless it is not disabled
	// by "no_console_suspend" API.
	console_suspend_all();
	// This suspends @con even when "no_console_suspend" parameter
	// is used. It is needed because the HW is going to be suspended.
	// It has no effect when the consoles were already suspended
	// by the big hammer API.
	console_suspend(con);

	// This might resume the console when "no_console_suspend" option
	// is used. The driver should work because the HW was resumed.
	// But it should stay suspended when all consoles are supposed
	// to stay suspended because of the big hammer API.
	console_resume(con);
	// This should resume all consoles.
	console_resume_all();

Other behavior would be unexpected and untested. It might cause regression.

I see two solutions:

   + add another CON_SUSPENDED_ALL flag
   + add back "consoles_suspended" global variable

I prefer adding back the "consoles_suspended" global variable because
it is a global state...

The global state should be synchronized the same way as the current
per-console flag (write under console_list_lock, read under
console_srcu_read_lock()).

Also it should be checked by console_is_usable() API. Otherwise, we
would need to update all callers.

This brings a challenge how to make it safe and keep the API sane.
I propose to create:

  + __console_is_usable() where the "consoles_suspended" value will
    be passed as parameter. It might be used directly under
    console_list_lock().

  + console_is_usable() with the existing parameters. It will check
    the it was called under console_srcu_read_lock, read
    the global "consoles_suspend" and pass it to
    __console_is_usable().


>  	console_list_unlock();
>  
>  	/*

I played with the code to make sure that it looked sane
and I ended with the following changes on top of this patch.

diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index 1b2ce0f36010..fda4683d12f1 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -3552,9 +3552,8 @@ static ssize_t show_cons_active(struct device *dev,
 	for_each_console(c) {
 		if (!c->device)
 			continue;
-		if (!(c->flags & CON_NBCON) && !c->write)
-			continue;
-		if (c->flags & CON_SUSPENDED)
+		if (!__console_is_usable(c, c->flags, consoles_suspended, true) &&
+		    !__console_is_usable(c, c->flags, consoles_suspended, false))
 			continue;
 		cs[i++] = c;
 		if (i >= ARRAY_SIZE(cs))
diff --git a/include/linux/console.h b/include/linux/console.h
index 5f17321ed962..090490ef570f 100644
--- a/include/linux/console.h
+++ b/include/linux/console.h
@@ -496,6 +496,7 @@ extern void console_list_lock(void) __acquires(console_mutex);
 extern void console_list_unlock(void) __releases(console_mutex);
 
 extern struct hlist_head console_list;
+extern bool consoles_suspended;
 
 /**
  * console_srcu_read_flags - Locklessly read flags of a possibly registered
@@ -548,6 +549,47 @@ static inline void console_srcu_write_flags(struct console *con, short flags)
 	WRITE_ONCE(con->flags, flags);
 }
 
+/**
+ * consoles_suspended_srcu_read - Locklessly read the global flag for
+ *				suspending all consoles.
+ *
+ * The global "consoles_suspended" flag is synchronized using console_list_lock
+ * and console_srcu_read_lock. It is the same approach as CON_SUSSPENDED flag.
+ * See console_srcu_read_flags() for more details.
+ *
+ * Context: Any context.
+ * Return: The current value of the global "consoles_suspended" flag.
+ */
+static inline short consoles_suspended_srcu_read(void)
+{
+	WARN_ON_ONCE(!console_srcu_read_lock_is_held());
+
+	/*
+	 * The READ_ONCE() matches the WRITE_ONCE() when "consoles_suspended"
+	 * is modified with consoles_suspended_srcu_write().
+	 */
+	return data_race(READ_ONCE(consoles_suspended));
+}
+
+/**
+ * consoles_suspended_srcu_write - Write the global flag for suspending
+ *			all consoles.
+ * @suspend:	new value to write
+ *
+ * The write must be done under the console_list_lock. The caller is responsible
+ * for calling synchronize_srcu() to make sure that all callers checking the
+ * usablility of registered consoles see the new state.
+ *
+ * Context: Any context.
+ */
+static inline void consoles_suspended_srcu_write(bool suspend)
+{
+	lockdep_assert_console_list_lock_held();
+
+	/* This matches the READ_ONCE() in consoles_suspended_srcu_read(). */
+	WRITE_ONCE(consoles_suspended, suspend);
+}
+
 /* Variant of console_is_registered() when the console_list_lock is held. */
 static inline bool console_is_registered_locked(const struct console *con)
 {
@@ -617,13 +659,15 @@ extern bool nbcon_kdb_try_acquire(struct console *con,
 extern void nbcon_kdb_release(struct nbcon_write_context *wctxt);
 
 /*
- * Check if the given console is currently capable and allowed to print
- * records. Note that this function does not consider the current context,
- * which can also play a role in deciding if @con can be used to print
- * records.
+ * This variant might be called under console_list_lock where both
+ * @flags and @all_suspended flags can be read directly.
  */
-static inline bool console_is_usable(struct console *con, short flags, bool use_atomic)
+static inline bool __console_is_usable(struct console *con, short flags,
+				       bool all_suspended, bool use_atomic)
 {
+	if (all_suspended)
+		return false;
+
 	if (!(flags & CON_ENABLED))
 		return false;
 
@@ -666,6 +710,20 @@ static inline bool console_is_usable(struct console *con, short flags, bool use_
 	return true;
 }
 
+/*
+ * Check if the given console is currently capable and allowed to print
+ * records. Note that this function does not consider the current context,
+ * which can also play a role in deciding if @con can be used to print
+ * records.
+ */
+static inline bool console_is_usable(struct console *con, short flags,
+				     bool use_atomic)
+{
+	bool all_suspended = consoles_suspended_srcu_read();
+
+	return __console_is_usable(con, flags, all_suspended, use_atomic);
+}
+
 #else
 static inline void nbcon_cpu_emergency_enter(void) { }
 static inline void nbcon_cpu_emergency_exit(void) { }
@@ -678,6 +736,8 @@ static inline void nbcon_reacquire_nobuf(struct nbcon_write_context *wctxt) { }
 static inline bool nbcon_kdb_try_acquire(struct console *con,
 					 struct nbcon_write_context *wctxt) { return false; }
 static inline void nbcon_kdb_release(struct nbcon_write_context *wctxt) { }
+static inline bool __console_is_usable(struct console *con, short flags,
+				       bool all_suspended, bool use_atomic) { return false; }
 static inline bool console_is_usable(struct console *con, short flags,
 				     bool use_atomic) { return false; }
 #endif
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 23a14e8c7a49..12247df07420 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -104,6 +104,13 @@ DEFINE_STATIC_SRCU(console_srcu);
  */
 int __read_mostly suppress_printk;
 
+/*
+ * Global flag for calling down all consoles during suspend.
+ * There is also a per-console flag which is used when the related
+ * device HW gets disabled, see CON_SUSPEND.
+ */
+bool consoles_suspended;
+
 #ifdef CONFIG_LOCKDEP
 static struct lockdep_map console_lock_dep_map = {
 	.name = "console_lock"
@@ -2731,8 +2738,6 @@ MODULE_PARM_DESC(console_no_auto_verbose, "Disable console loglevel raise to hig
  */
 void console_suspend_all(void)
 {
-	struct console *con;
-
 	if (console_suspend_enabled)
 		pr_info("Suspending console(s) (use no_console_suspend to debug)\n");
 
@@ -2749,8 +2754,7 @@ void console_suspend_all(void)
 		return;
 
 	console_list_lock();
-	for_each_console(con)
-		console_srcu_write_flags(con, con->flags | CON_SUSPENDED);
+	consoles_suspended_srcu_write(true);
 	console_list_unlock();
 
 	/*
@@ -2765,7 +2769,6 @@ void console_suspend_all(void)
 void console_resume_all(void)
 {
 	struct console_flush_type ft;
-	struct console *con;
 
 	/*
 	 * Allow queueing irq_work. After restoring console state, deferred
@@ -2776,8 +2779,7 @@ void console_resume_all(void)
 
 	if (console_suspend_enabled) {
 		console_list_lock();
-		for_each_console(con)
-			console_srcu_write_flags(con, con->flags & ~CON_SUSPENDED);
+		consoles_suspended_srcu_write(false);
 		console_list_unlock();
 
 		/*

Best Regards,
Petr


From pmladek at suse.com  Thu Nov 27 07:18:40 2025
From: pmladek at suse.com (Petr Mladek)
Date: Thu, 27 Nov 2025 16:18:40 +0100
Subject: [PATCH v2 0/4] printk cleanup - part 2
In-Reply-To: <20251121-printk-cleanup-part2-v2-0-57b8b78647f4@suse.com>
References: <20251121-printk-cleanup-part2-v2-0-57b8b78647f4@suse.com>
Message-ID: <aShr0DZRmpDnL0nz@pathway.suse.cz>

On Fri 2025-11-21 15:50:32, Marcos Paulo de Souza wrote:
> The first part can be found here[1]. The proposed changes do not
> change the functionality of printk, but were suggestions made by
> Petr Mladek. I already have more patches for a part 3 ,but I would like
> to see these ones merged first.
> 
> I did the testing with VMs, checking suspend and resume cycles, and it worked
> as expected.
> 
> Thanks for reviewing!

> Marcos Paulo de Souza (4):
>       drivers: serial: kgdboc: Drop checks for CON_ENABLED and CON_BOOT
>       arch: um: kmsg_dump: Use console_is_usable
>       printk: Use console_is_usable on console_unblank

These three patches were simple, straightforward, and ready for linux
next.

I have comitted them into printk/linux.git, branch rework/nbcon-in-kdb.
I am going to push them for 6.19.

>       printk: Make console_{suspend,resume} handle CON_SUSPENDED

This patch still need some love and v3.

Best Regards,
Petr


From bhe at redhat.com  Thu Nov 27 19:33:19 2025
From: bhe at redhat.com (Baoquan He)
Date: Fri, 28 Nov 2025 11:33:19 +0800
Subject: [PATCH v4 11/12] arch/um: don't initialize kasan if it's disabled
In-Reply-To: <20251128033320.1349620-1-bhe@redhat.com>
References: <20251128033320.1349620-1-bhe@redhat.com>
Message-ID: <20251128033320.1349620-12-bhe@redhat.com>

And also do the kasan_arg_disabled chekcing before kasan_flag_enabled
enabling to make sure kernel parameter kasan=on|off has been parsed.

Signed-off-by: Baoquan He <bhe at redhat.com>
Cc: linux-um at lists.infradead.org
---
 arch/um/kernel/mem.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/um/kernel/mem.c b/arch/um/kernel/mem.c
index 39c4a7e21c6f..08cd012a6bb8 100644
--- a/arch/um/kernel/mem.c
+++ b/arch/um/kernel/mem.c
@@ -62,8 +62,11 @@ static unsigned long brk_end;
 
 void __init arch_mm_preinit(void)
 {
+#ifdef CONFIG_KASAN
 	/* Safe to call after jump_label_init(). Enables KASAN. */
-	kasan_init_generic();
+	if (!kasan_arg_disabled)
+		kasan_init_generic();
+#endif
 
 	/* clear the zero-page */
 	memset(empty_zero_page, 0, PAGE_SIZE);
-- 
2.41.0


From daniel at riscstar.com  Fri Nov 28 01:52:24 2025
From: daniel at riscstar.com (Daniel Thompson)
Date: Fri, 28 Nov 2025 09:52:24 +0000
Subject: [PATCH v2 0/4] printk cleanup - part 2
In-Reply-To: <aShr0DZRmpDnL0nz@pathway.suse.cz>
References: <20251121-printk-cleanup-part2-v2-0-57b8b78647f4@suse.com>
 <aShr0DZRmpDnL0nz@pathway.suse.cz>
Message-ID: <aSlw2AHo_AWzjH-s@aspen.lan>

On Thu, Nov 27, 2025 at 04:18:40PM +0100, Petr Mladek wrote:
> On Fri 2025-11-21 15:50:32, Marcos Paulo de Souza wrote:
> > The first part can be found here[1]. The proposed changes do not
> > change the functionality of printk, but were suggestions made by
> > Petr Mladek. I already have more patches for a part 3 ,but I would like
> > to see these ones merged first.
> >
> > I did the testing with VMs, checking suspend and resume cycles, and it worked
> > as expected.
> >
> > Thanks for reviewing!
>
> > Marcos Paulo de Souza (4):
> >       drivers: serial: kgdboc: Drop checks for CON_ENABLED and CON_BOOT
> >       arch: um: kmsg_dump: Use console_is_usable
> >       printk: Use console_is_usable on console_unblank
>
> These three patches were simple, straightforward, and ready for linux
> next.
>
> I have comitted them into printk/linux.git, branch rework/nbcon-in-kdb.
> I am going to push them for 6.19.

I pointed the kgdb test suite at this branch (as I did for the earlier
part of the patchset, although I think I forgot to post about it).

The console coverage is fairly modest (I think just 8250 and PL011
drivers, with and without earlycon) and the suite exercises features
rather than crash resilience. Nevertheless and FWIW, the tests didn't
pick up any regressions. Yay!


Daniel.


From pmladek at suse.com  Fri Nov 28 04:31:01 2025
From: pmladek at suse.com (Petr Mladek)
Date: Fri, 28 Nov 2025 13:31:01 +0100
Subject: [PATCH v2 0/4] printk cleanup - part 2
In-Reply-To: <aSlw2AHo_AWzjH-s@aspen.lan>
References: <20251121-printk-cleanup-part2-v2-0-57b8b78647f4@suse.com>
 <aShr0DZRmpDnL0nz@pathway.suse.cz>
 <aSlw2AHo_AWzjH-s@aspen.lan>
Message-ID: <aSmWBYRmH_BNM4kg@pathway.suse.cz>

On Fri 2025-11-28 09:52:24, Daniel Thompson wrote:
> On Thu, Nov 27, 2025 at 04:18:40PM +0100, Petr Mladek wrote:
> > On Fri 2025-11-21 15:50:32, Marcos Paulo de Souza wrote:
> > > The first part can be found here[1]. The proposed changes do not
> > > change the functionality of printk, but were suggestions made by
> > > Petr Mladek. I already have more patches for a part 3 ,but I would like
> > > to see these ones merged first.
> > >
> > > I did the testing with VMs, checking suspend and resume cycles, and it worked
> > > as expected.
> > >
> > > Thanks for reviewing!
> >
> > > Marcos Paulo de Souza (4):
> > >       drivers: serial: kgdboc: Drop checks for CON_ENABLED and CON_BOOT
> > >       arch: um: kmsg_dump: Use console_is_usable
> > >       printk: Use console_is_usable on console_unblank
> >
> > These three patches were simple, straightforward, and ready for linux
> > next.
> >
> > I have comitted them into printk/linux.git, branch rework/nbcon-in-kdb.
> > I am going to push them for 6.19.
> 
> I pointed the kgdb test suite at this branch (as I did for the earlier
> part of the patchset, although I think I forgot to post about it).
> 
> The console coverage is fairly modest (I think just 8250 and PL011
> drivers, with and without earlycon) and the suite exercises features
> rather than crash resilience. Nevertheless and FWIW, the tests didn't
> pick up any regressions. Yay!

Great news! Thanks a lot for doing the test and sharing results.

Best Regards,
Petr


From thehajime at gmail.com  Fri Nov 28 04:57:55 2025
From: thehajime at gmail.com (Hajime Tazaki)
Date: Fri, 28 Nov 2025 21:57:55 +0900
Subject: [PATCH v13 00/13] nommu UML
In-Reply-To: <defcec3945fbc37e90070b030bf1596b11b6d926.camel@sipsolutions.net>
References: <cover.1762588860.git.thehajime@gmail.com>
	<aRGs8lPjH22NqMZc@infradead.org>
	<m2framxerm.wl-thehajime@gmail.com>
	<0a84c16f862026c82271c0adbc91d98b812a78b4.camel@sipsolutions.net>
	<m2bjl7y6mv.wl-thehajime@gmail.com>
	<defcec3945fbc37e90070b030bf1596b11b6d926.camel@sipsolutions.net>
Message-ID: <m2y0nqwbzg.wl-thehajime@gmail.com>


On Tue, 25 Nov 2025 18:58:53 +0900,
Johannes Berg wrote:
> 
> On Wed, 2025-11-12 at 17:52 +0900, Hajime Tazaki wrote:
> > > >   What is it for ?
> > > >   ================
> > > >   
> > > >   - Alleviate syscall hook overhead implemented with ptrace(2)
> > > >   - To exercises nommu code over UML (and over KUnit)
> > > >   - Less dependency to host facilities
> > > 
> > > FWIW, in some way, this order of priorities is exactly why this hasn't
> > > been going anywhere, and every time I looked at it I got somewhat
> > > annoyed by what seems to me like choices made to support especially the
> > > first bullet.
> > 
> > over the past versions, I've been emphasized that the 2nd bullet (testing)
> > is the primary usecase as I saw several actually cases from mm folks,
> > 
> > https://lists.infradead.org/pipermail/maple-tree/2024-November/003775.html
> > https://lore.kernel.org/all/cb1cf0be-871d-4982-9a1b-5fdd54deec8d at lucifer.local/
> > 
> > and I think this is not limited to mm code.
> 
> Not sure there's much value in testing much else in no-MMU, but sure,
> I'll give you that it's useful for testing.

under the tree,

% global -xr CONFIG_MMU | grep ifndef  | grep -v -E "arch/|mm/" | wc -l
45

this is a rough picture but there are places to be tested other than
mm codebase.

> > other 2 bullets are additional benefits which we observed in a
> > comment, and our experience.
> 
> But are they really _worthwhile_ benefits? A lot of this design adds
> additional complexity, and it doesn't really seem necessary for the
> testing use case. Making it faster is nice, but it's not like the
> speedup really is 20x for arbitrary tests, that's just for corner cases
> like "sit in a loop of gettimeofday()". And for kunit there's no syscall
> boundary at all, so there's no speedup.

I agree and as I said the reason to take a single-host-process
approach is from the speed and simplicity of removing interaction
between host processes.

I have never claimed that tests should execute fast.
and agree that kunit doesn't benefit from speed as there is no syscall
(unless kunit-uapi patch will be in).

> > > I suspect that the first and third bullet are not even really true any
> > > more, since you moved to seccomp (per our request), yet I think design
> > > choices influenced by them persist.
> > 
> > this observation is not true; the first bullet is still true even
> > using seccomp.  please look at the benchmark result in the patch
> > [12/13], quoted below.
> 
> > [snip]
> 
> So thanks for the correction. If that's the case, however, it means the
> speedup can't be due to the syscall boundary itself (seccomp) but must
> rather be due to some pagefault/mapping handling issue? Which would be
> inherent in no-MMU, even taking an approach of using two host processes
> rather than embedding everything into one.

I'll explain this later in this email.

# nommu doesn't have page fault as there are only physical address.

> > > However, I'm not yet convinced that all of the complexities presented in
> > > this patchset (such as completely separate seccomp implementation) are
> > > actually necessary in support of _just_ the second bullet. These seem to
> > > me like design choices necessary to support the _first_ bullet [1].
> > 
> > separate seccomp implementation is indeed needed due to the design
> > choice we made, to use a single process to host a (um) userspace.
> 
> That sounds misleading or even wrong to me, I'd say it's due to putting
> the (um) userspace in the same host process as the kernel space?

not sure if this is different from my explanation...

> > I don't see why you see this as a _complexity_, as functionally both
> > seccomp handling don't interfere each other.
> 
> The complexity isn't so much in the separate code, which is a small
> factor, but in the "put everything into the same process" aspect of it.
> That has consequences around the host context state handling, things we
> didn't really need to consider before suddenly become crucially
> important. In the current (with-MMU) design, we only need to worry about
> being able to correctly switch between userspace tasks/threads within a
> userspace mm (host) process. With the no-MMU design you propose, we also
> need to be able to correctly switch between kernel and userspace tasks
> within the same single (host) process.
> 
> I think this is a pretty significant difference, and saying "there's no
> complexity here" is simply pretending it isn't a relevant difference. I
> believe you're not even handling this correctly right now in this patch
> set, specifically wrt. the GS register which has been pointed out
> before, but I wouldn't say that I even have a complete picture in my
> head over what state handling would be necessary and sufficient.
> 
> So yeah, I think this warrants taking another look as to whether or not
> the approach of putting everything into the same host process is even
> worth it. I tend to believe that it isn't, given the use cases. And if
> you say the speedup still is with seccomp, that kills the speed argument
> too.

I understand your concern on complexity, thanks for the detail.

the host context state handling is indeed new thing. we've only
verified a limited set of code path, with a basic operation with um +
drivers and some userspace programs.  this should not be perfect at
this moment but can be improved.

> > > I've thought about what would happen if we stuck to creating a (single)
> > > separate process on the host to execute userspace, and just used
> > > CLONE_VM for it. That way, it's still no-MMU with full memory access,
> > > but there's some implicit isolation between the kernel and userspace
> > > processes which will likely remove complexities around FP/SSE/AVX
> > > handling, may completely remove the need for a separate seccomp
> > > implementation, etc.
> > 
> > this would be doable I think, but we went the different way, as
> > using separate host processes (with ptrace/seccomp) is slow and add
> > complexity by the synchronization between processes, which we think
> > it's not easy to maintain in the future.
> 
> Which one is it then, slow or not? Not sure I follow. You just said you
> do have seccomp when comparing speeds, so that in itself doesn't make it
> slow. What synchronization? It'd (have to) be CLONE_VM, but that
> actually _simplifies_ state transfer/synchronization, and we already
> have (to have) state transfer between different userspace threads in the
> same host process for the with-MMU case.

Since I included speed characteristics in the document, I should
explain more on the impact of this, compared to the existing
design/implementation of uml.

many documents, articles said uml is slow (uml document in tree also
mentioned a bit), but cannot find detailed analysis, so I look closely
at how nommu (w/ seccomp) and mmu w/ seccomp behave.

suppose we have a userspace program running under uml (on seccomp-mmu,
seccomp-nommu).


	struct timespec ts1, ts2;
	clock_gettime(CLOCK_REALTIME, &ts1);  // 1)
	getpid()                              // 2)
	clock_gettime(CLOCK_REALTIME, &ts2);  // 3)

# this is a chunk from the benchmark program used in the document.

then collected several events (sched_switch, signal_generate, and
sys_enter_futex) via ftrace.

looking at 3 SIGSYS (sig=31) signals on above code, and below is the
output of the `trace-cmd report`.

- frace seecomp-mmu, 2)-3)= 11 usec
 uml-userspace-3092637 [002] 1749286.670199: signal_generate:      sig=31 errno=0 code=1 comm=uml-userspace pid=3092637 grp=0 res=0    => 1)
 uml-userspace-3092637 [002] 1749286.670200: sys_enter_futex:      op=FUTEX_WAKE uaddr=0x7fffffffdf8c val=1
 uml-userspace-3092637 [002] 1749286.670201: sys_enter_futex:      op=FUTEX_WAIT uaddr=0x7fffffffdf8c val=0x00000001 utime=0x00000000
 uml-userspace-3092637 [002] 1749286.670202: sched_switch:         uml-userspace:3092637 [120] S ==> swapper/2:0 [120]
          <idle>-0     [028] 1749286.670203: sched_switch:         swapper/28:0 [120] R ==> vmlinux:3092631 [120]
       vmlinux-3092631 [028] 1749286.670205: sys_enter_futex:      op=FUTEX_WAKE uaddr=0x60b64f8c val=1
       vmlinux-3092631 [028] 1749286.670206: sys_enter_futex:      op=FUTEX_WAIT uaddr=0x60b64f8c val=0x00000000 utime=0x00000000
       vmlinux-3092631 [028] 1749286.670207: sched_switch:         vmlinux:3092631 [120] S ==> swapper/28:0 [120]
          <idle>-0     [002] 1749286.670209: sched_switch:         swapper/2:0 [120] R ==> uml-userspace:3092637 [120]
 uml-userspace-3092637 [002] 1749286.670211: signal_generate:      sig=31 errno=0 code=1 comm=uml-userspace pid=3092637 grp=0 res=0    => 2)
 uml-userspace-3092637 [002] 1749286.670212: sys_enter_futex:      op=FUTEX_WAKE uaddr=0x7fffffffdf8c val=1
 uml-userspace-3092637 [002] 1749286.670213: sys_enter_futex:      op=FUTEX_WAIT uaddr=0x7fffffffdf8c val=0x00000001 utime=0x00000000
 uml-userspace-3092637 [002] 1749286.670214: sched_switch:         uml-userspace:3092637 [120] S ==> swapper/2:0 [120]
          <idle>-0     [028] 1749286.670215: sched_switch:         swapper/28:0 [120] R ==> vmlinux:3092631 [120]
       vmlinux-3092631 [028] 1749286.670216: sys_enter_futex:      op=FUTEX_WAKE uaddr=0x60b64f8c val=1
       vmlinux-3092631 [028] 1749286.670217: sys_enter_futex:      op=FUTEX_WAIT uaddr=0x60b64f8c val=0x00000000 utime=0x00000000
       vmlinux-3092631 [028] 1749286.670218: sched_switch:         vmlinux:3092631 [120] S ==> swapper/28:0 [120]
          <idle>-0     [002] 1749286.670220: sched_switch:         swapper/2:0 [120] R ==> uml-userspace:3092637 [120]
 uml-userspace-3092637 [002] 1749286.670222: signal_generate:      sig=31 errno=0 code=1 comm=uml-userspace pid=3092637 grp=0 res=0    => 3)


- ftrace seccomp-nommu, 2)-3) =  3 usec
       vmlinux-3092542 [006] 1749158.829292: signal_generate:      sig=31 errno=0 code=1 comm=vmlinux pid=3092542 grp=0 res=0    => 1)
       vmlinux-3092542 [006] 1749158.829294: signal_generate:      sig=31 errno=0 code=1 comm=vmlinux pid=3092542 grp=0 res=0    => 2)
       vmlinux-3092542 [006] 1749158.829297: signal_generate:      sig=31 errno=0 code=1 comm=vmlinux pid=3092542 grp=0 res=0    => 3)

with seccomp-mmu, a host process for userspace (uml-userspace) is
notified with SIGSYS (sig=31) upon syscall from userspace, and switched
task (of host) to vmlinux (um kernel), with the wake/wait
synchronization (which I meant synchronization in my previous email),
and switch back to uml-userspace to continue the userspace process.

so, at least 4 host sched_switch-es per single um syscall.

with current nommu using a single host process, notifications via
SIGSYS is same as seccomp-mmu, but after that there is no context
switch upon syscall issued by a userspace, in the same context to the
next syscall.

nommu implementation with CLONE_VM (btw, the host process, uml-userspace
is already created with CLONE_VM flag IIUC) might face the similar
situation as seccomp-mmu, seeing the same switches between processes.

this becomes the difference between the benchmark results of getpid, which
um-mmu (seccomp)/um-nommu (seccomp) is mostly x10 (26.242 and 2.599
usec) (this was described as an example of benchmark in the patchset).

I didn't look at ptrace mode of MMU, but expect to see the similar (or
more) duration on a single syscall.


in addition to this ftrace measurement above, I conducted more
practical benchmark with iperf3 (forward/reverse path) and netperf
(TCP_STREAM/MAERTS), which aren't corner cases I believe, and below is
the result.

all use the vector driver with gro on via host tap devices.
iperf3/netperf server run on a host and client runs inside uml.

# I can give a complete script to reproduce this if needed.


- iperf3 (Mbps)
              um-mmu(seccomp)	 um-nommu(seccomp)
--------------------------------------------------
iperf3(f)       7984             13152
iperf3(r)       8009             14363

- netperf (Mbps, bufsize=65507bytes)
              um-mmu(seccomp)	 um-nommu(seccomp)
--------------------------------------------------
netperf(STREAM)   5912.93        10792.02
netperf(MAERTS)  29263.53        33970.06


not significant different as we saw with simple syscall benchmark with
getpid(2), but still see an impact with difference.

I would say these results only show partial cases of what UML can do,
different workloads may show different result, but it is still
valuable to present one of the benefits to see the nature of the
feature (of what single process design can do).

Of course, nommu will come with various limitations as I described in
the document; like applications should be aware of the kernel is nommu
(i.e., need to use vfork, PIE binaries, etc).  So traditional uml is
more generic and has broader usage, but with this characteristic of
speed with nommu, I think it is worthwhile and users benefit from this
if they need speed.

I hope this clarifies a bit.

-- Hajime


From mpdesouza at suse.com  Fri Nov 28 04:59:25 2025
From: mpdesouza at suse.com (Marcos Paulo de Souza)
Date: Fri, 28 Nov 2025 09:59:25 -0300
Subject: [PATCH v2 0/4] printk cleanup - part 2
In-Reply-To: <aSlw2AHo_AWzjH-s@aspen.lan>
References: <20251121-printk-cleanup-part2-v2-0-57b8b78647f4@suse.com>
	 <aShr0DZRmpDnL0nz@pathway.suse.cz> <aSlw2AHo_AWzjH-s@aspen.lan>
Message-ID: <aa78f0418ca2f408cdc31b478963ddeab797faa7.camel@suse.com>

On Fri, 2025-11-28 at 09:52 +0000, Daniel Thompson wrote:
> On Thu, Nov 27, 2025 at 04:18:40PM +0100, Petr Mladek wrote:
> > On Fri 2025-11-21 15:50:32, Marcos Paulo de Souza wrote:
> > > The first part can be found here[1]. The proposed changes do not
> > > change the functionality of printk, but were suggestions made by
> > > Petr Mladek. I already have more patches for a part 3 ,but I
> > > would like
> > > to see these ones merged first.
> > > 
> > > I did the testing with VMs, checking suspend and resume cycles,
> > > and it worked
> > > as expected.
> > > 
> > > Thanks for reviewing!
> > 
> > > Marcos Paulo de Souza (4):
> > > ????? drivers: serial: kgdboc: Drop checks for CON_ENABLED and
> > > CON_BOOT
> > > ????? arch: um: kmsg_dump: Use console_is_usable
> > > ????? printk: Use console_is_usable on console_unblank
> > 
> > These three patches were simple, straightforward, and ready for
> > linux
> > next.
> > 
> > I have comitted them into printk/linux.git, branch rework/nbcon-in-
> > kdb.
> > I am going to push them for 6.19.
> 
> I pointed the kgdb test suite at this branch (as I did for the
> earlier
> part of the patchset, although I think I forgot to post about it).
> 
> The console coverage is fairly modest (I think just 8250 and PL011
> drivers, with and without earlycon) and the suite exercises features
> rather than crash resilience. Nevertheless and FWIW, the tests didn't
> pick up any regressions. Yay!

Thanks Daniel! I remember that you said that you would run the
testsuite in the previous patchset, but didn't want to bother asking
you (I believe that if you found anything you would point it out either
way :) ).

> 
> 
> Daniel.


From chleroy at kernel.org  Sat Nov 29 01:56:02 2025
From: chleroy at kernel.org (Christophe Leroy (CS GROUP))
Date: Sat, 29 Nov 2025 10:56:02 +0100
Subject: [PATCH] um: Disable KASAN_INLINE when STATIC_LINK is selected
Message-ID: <2620ab0bbba640b6237c50b9c0dca1c7d1142f5d.1764410067.git.chleroy@kernel.org>

um doesn't support KASAN_INLINE together with STATIC_LINK.

Instead of failing the build, disable KASAN_INLINE when
STATIC_LINK is selected.

Reported-by: kernel test robot <lkp at intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202511290451.x9GZVJ1l-lkp at intel.com/
Fixes: 1e338f4d99e6 ("kasan: introduce ARCH_DEFER_KASAN and unify static key across modes")
Signed-off-by: Christophe Leroy (CS GROUP) <chleroy at kernel.org>
---
 arch/um/Kconfig             | 1 +
 arch/um/include/asm/kasan.h | 4 ----
 2 files changed, 1 insertion(+), 4 deletions(-)

diff --git a/arch/um/Kconfig b/arch/um/Kconfig
index 49781bee7905..93ed850d508e 100644
--- a/arch/um/Kconfig
+++ b/arch/um/Kconfig
@@ -5,6 +5,7 @@ menu "UML-specific options"
 config UML
 	bool
 	default y
+	select ARCH_DISABLE_KASAN_INLINE if STATIC_LINK
 	select ARCH_NEEDS_DEFER_KASAN if STATIC_LINK
 	select ARCH_WANTS_DYNAMIC_TASK_STRUCT
 	select ARCH_HAS_CACHE_LINE_SIZE
diff --git a/arch/um/include/asm/kasan.h b/arch/um/include/asm/kasan.h
index b54a4e937fd1..81bcdc0f962e 100644
--- a/arch/um/include/asm/kasan.h
+++ b/arch/um/include/asm/kasan.h
@@ -24,10 +24,6 @@
 
 #ifdef CONFIG_KASAN
 void kasan_init(void);
-
-#if defined(CONFIG_STATIC_LINK) && defined(CONFIG_KASAN_INLINE)
-#error UML does not work in KASAN_INLINE mode with STATIC_LINK enabled!
-#endif
 #else
 static inline void kasan_init(void) { }
 #endif /* CONFIG_KASAN */
-- 
2.49.0