From thehajime at gmail.com Sun Nov 2 01:49:25 2025 From: thehajime at gmail.com (Hajime Tazaki) Date: Sun, 2 Nov 2025 18:49:25 +0900 Subject: [PATCH v12 00/13] nommu UML Message-ID: This patchset is another spin of nommu mode addition to UML. It would be nice to hear about your opinions on that. There are still several limitations/issues which we already found; here is the list of those issues. - memory mapped by loadable modules are not distinguished from userspace memory. - CONFIG_SMP is disabled as host_fs handling doesn't work with thread local storage. -- Hajime v12: - rebase with the latest uml/next branch - disable SMP and tls as those doesn't work with host_fs handling ([11/13]) v11: - clean up userspace return routine and integrate to userspace() ([04/13]) - fix direction flag issue on using nolibc memcpy ([04/13]) - fix a crash issue when using usermode helper ([06/13]) - test with out-of-tree kunit-uapi patches (which uses umh) - https://lore.kernel.org/all/20250626-kunit-kselftests-v4-0-48760534fef5 at linutronix.de/ - https://lore.kernel.org/all/20250626195714.2123694-3-benjamin at sipsolutions.net/ - https://lore.kernel.org/all/cover.1758181109.git.thehajime at gmail.com/ v10: - fix wrong comment on gs register handling ([09/13]) - remove unnecessary code of early syscall implementation ([04/13]) - https://lore.kernel.org/all/cover.1750594487.git.thehajime at gmail.com/ v9: - rebase with the latest uml/next branch - add performance numbers of new SECCOMP mode, and update results ([12/13]) - add a workaround for upstream change on MMU depedency to PCI drivers ([10/13]) - https://lore.kernel.org/all/cover.1750294482.git.thehajime at gmail.com/ v8: - rebase with the latest uml/next branch - clean up segv_handler to align with the latest uml ([9/12]) - https://lore.kernel.org/all/cover.1745980082.git.thehajime at gmail.com/ v7: - properly handle FP register upon signal delivery [10/13] - update benchmark result with new FP register handling [12/13] - fix arch_has_single_step() for !MMU case [07/13] - revert stack alignment as it is in uml/fixes tree [10/13] - https://lore.kernel.org/all/cover.1737348399.git.thehajime at gmail.com/ v6: - rebase to the latest uml/next tree - more clean up on mmu/nommu for signal handling [10/13] - rename functions of mcontext routines [06,10/13] - added Acked-by tag for binfmt_elf_fdpic [02/13] - https://lore.kernel.org/linux-um/cover.1736853925.git.thehajime at gmail.com/ v5: - clean up stack manipulation code [05,06,07,10/13] - https://lore.kernel.org/linux-um/cover.1733998168.git.thehajime at gmail.com/ v4: - add arch/um/nommu, arch/x86/um/nommu to contain !MMU specific codes - remove zpoline patch - drop binfmt_elf_fdpic patch - reduce ifndef CONFIG_MMU if possible - split to elf header cleanup patch [01/13] - fix kernel test robot warnings [06/13] - fix coding styles [07/13] - move task_top_of_stack definition [05/13] - https://lore.kernel.org/linux-um/cover.1733652929.git.thehajime at gmail.com/ v3: - https://lore.kernel.org/linux-um/cover.1733199769.git.thehajime at gmail.com/ - add seccomp-based syscall hook in addition to zpoline [06/13] - remove RFC, add a line to MAINTAINERS file - fix kernel test robot warnings [02/13,08/13,10/13] - add base-commit tag to cover letter - pull the latest uml/next - clean up SIGSEGV handling [10/13] - detect fsgsbase availability with elf aux vector [08/13] - simplify vdso code with macros [09/13] RFC v2: - https://lore.kernel.org/linux-um/cover.1731290567.git.thehajime at gmail.com/ - base branch is now uml/linux.git instead of torvalds/linux.git. - reorganize the patch series to clean up - fixed various coding styles issues - clean up exec code path [07/13] - fixed the crash/SIGSEGV case on userspace programs [10/13] - add seccomp filter to limit syscall caller address [06/13] - detect fsgsbase availability with sigsetjmp/siglongjmp [08/13] - removes unrelated changes - removes unneeded ifndef CONFIG_MMU - convert UML_CONFIG_MMU to CONFIG_MMU as using uml/linux.git - proposed a patch of maple-tree issue (resolving a limitation in RFC v1) https://lore.kernel.org/linux-mm/20241108222834.3625217-1-thehajime at gmail.com/ RFC: - https://lore.kernel.org/linux-um/cover.1729770373.git.thehajime at gmail.com/ Hajime Tazaki (13): x86/um: nommu: elf loader for fdpic um: decouple MMU specific code from the common part um: nommu: memory handling x86/um: nommu: syscall handling um: nommu: seccomp syscalls hook x86/um: nommu: process/thread handling um: nommu: configure fs register on host syscall invocation x86/um/vdso: nommu: vdso memory update x86/um: nommu: signal handling um: change machine name for uname output um: nommu: disable SMP on nommu UML um: nommu: add documentation of nommu UML um: nommu: plug nommu code into build system Documentation/virt/uml/nommu-uml.rst | 180 ++++++++++++++++++++++ MAINTAINERS | 1 + arch/um/Kconfig | 14 +- arch/um/Makefile | 10 ++ arch/um/configs/x86_64_nommu_defconfig | 54 +++++++ arch/um/include/asm/futex.h | 4 + arch/um/include/asm/mmu.h | 8 + arch/um/include/asm/mmu_context.h | 2 + arch/um/include/asm/ptrace-generic.h | 8 +- arch/um/include/asm/uaccess.h | 7 +- arch/um/include/shared/kern_util.h | 6 + arch/um/include/shared/os.h | 16 ++ arch/um/kernel/Makefile | 5 +- arch/um/kernel/mem-pgtable.c | 55 +++++++ arch/um/kernel/mem.c | 38 +---- arch/um/kernel/process.c | 38 +++++ arch/um/kernel/skas/process.c | 37 ----- arch/um/kernel/um_arch.c | 3 + arch/um/nommu/Makefile | 3 + arch/um/nommu/os-Linux/Makefile | 7 + arch/um/nommu/os-Linux/seccomp.c | 87 +++++++++++ arch/um/nommu/os-Linux/signal.c | 24 +++ arch/um/nommu/trap.c | 201 +++++++++++++++++++++++++ arch/um/os-Linux/Makefile | 3 +- arch/um/os-Linux/internal.h | 8 + arch/um/os-Linux/mem.c | 4 + arch/um/os-Linux/process.c | 139 ++++++++++++++++- arch/um/os-Linux/signal.c | 11 +- arch/um/os-Linux/skas/process.c | 127 ---------------- arch/um/os-Linux/start_up.c | 25 ++- arch/um/os-Linux/util.c | 3 +- arch/x86/um/Kconfig | 2 +- arch/x86/um/Makefile | 7 +- arch/x86/um/asm/elf.h | 8 +- arch/x86/um/asm/syscall.h | 6 + arch/x86/um/nommu/Makefile | 8 + arch/x86/um/nommu/do_syscall_64.c | 75 +++++++++ arch/x86/um/nommu/entry_64.S | 114 ++++++++++++++ arch/x86/um/nommu/os-Linux/Makefile | 6 + arch/x86/um/nommu/os-Linux/mcontext.c | 26 ++++ arch/x86/um/nommu/syscalls.h | 18 +++ arch/x86/um/nommu/syscalls_64.c | 121 +++++++++++++++ arch/x86/um/shared/sysdep/mcontext.h | 5 + arch/x86/um/shared/sysdep/ptrace.h | 2 +- arch/x86/um/vdso/vma.c | 17 ++- fs/Kconfig.binfmt | 2 +- 46 files changed, 1322 insertions(+), 223 deletions(-) create mode 100644 Documentation/virt/uml/nommu-uml.rst create mode 100644 arch/um/configs/x86_64_nommu_defconfig create mode 100644 arch/um/kernel/mem-pgtable.c create mode 100644 arch/um/nommu/Makefile create mode 100644 arch/um/nommu/os-Linux/Makefile create mode 100644 arch/um/nommu/os-Linux/seccomp.c create mode 100644 arch/um/nommu/os-Linux/signal.c create mode 100644 arch/um/nommu/trap.c create mode 100644 arch/x86/um/nommu/Makefile create mode 100644 arch/x86/um/nommu/do_syscall_64.c create mode 100644 arch/x86/um/nommu/entry_64.S create mode 100644 arch/x86/um/nommu/os-Linux/Makefile create mode 100644 arch/x86/um/nommu/os-Linux/mcontext.c create mode 100644 arch/x86/um/nommu/syscalls.h create mode 100644 arch/x86/um/nommu/syscalls_64.c base-commit: 8e03c195cc4d82100291500f772f85c686653748 -- 2.43.0 From thehajime at gmail.com Sun Nov 2 01:49:26 2025 From: thehajime at gmail.com (Hajime Tazaki) Date: Sun, 2 Nov 2025 18:49:26 +0900 Subject: [PATCH v12 01/13] x86/um: nommu: elf loader for fdpic In-Reply-To: References: Message-ID: As UML supports CONFIG_MMU=n case, it has to use an alternate ELF loader, FDPIC ELF loader. In this commit, we added necessary definitions in the arch, as UML has not been used so far. It also updates Kconfig file to use BINFMT_ELF_FDPIC under !MMU environment. Cc: Eric Biederman Cc: Kees Cook Cc: Alexander Viro Cc: Christian Brauner Cc: Jan Kara Cc: linux-mm at kvack.org Cc: linux-fsdevel at vger.kernel.org Acked-by: Kees Cook Signed-off-by: Hajime Tazaki Signed-off-by: Ricardo Koller --- arch/um/include/asm/mmu.h | 5 +++++ arch/um/include/asm/ptrace-generic.h | 6 ++++++ arch/x86/um/asm/elf.h | 8 ++++++-- fs/Kconfig.binfmt | 2 +- 4 files changed, 18 insertions(+), 3 deletions(-) diff --git a/arch/um/include/asm/mmu.h b/arch/um/include/asm/mmu.h index 07d48738b402..82a919132aff 100644 --- a/arch/um/include/asm/mmu.h +++ b/arch/um/include/asm/mmu.h @@ -21,6 +21,11 @@ typedef struct mm_context { spinlock_t sync_tlb_lock; unsigned long sync_tlb_range_from; unsigned long sync_tlb_range_to; + +#ifdef CONFIG_BINFMT_ELF_FDPIC + unsigned long exec_fdpic_loadmap; + unsigned long interp_fdpic_loadmap; +#endif } mm_context_t; #define INIT_MM_CONTEXT(mm) \ diff --git a/arch/um/include/asm/ptrace-generic.h b/arch/um/include/asm/ptrace-generic.h index 86d74f9d33cf..62e9916078ec 100644 --- a/arch/um/include/asm/ptrace-generic.h +++ b/arch/um/include/asm/ptrace-generic.h @@ -29,6 +29,12 @@ struct pt_regs { #define PTRACE_OLDSETOPTIONS 21 +#ifdef CONFIG_BINFMT_ELF_FDPIC +#define PTRACE_GETFDPIC 31 +#define PTRACE_GETFDPIC_EXEC 0 +#define PTRACE_GETFDPIC_INTERP 1 +#endif + struct task_struct; extern long subarch_ptrace(struct task_struct *child, long request, diff --git a/arch/x86/um/asm/elf.h b/arch/x86/um/asm/elf.h index 62ed5d68a978..33f69f1eac10 100644 --- a/arch/x86/um/asm/elf.h +++ b/arch/x86/um/asm/elf.h @@ -9,6 +9,7 @@ #include #define CORE_DUMP_USE_REGSET +#define ELF_FDPIC_CORE_EFLAGS 0 #ifdef CONFIG_X86_32 @@ -190,8 +191,11 @@ extern int arch_setup_additional_pages(struct linux_binprm *bprm, extern unsigned long um_vdso_addr; #define AT_SYSINFO_EHDR 33 -#define ARCH_DLINFO NEW_AUX_ENT(AT_SYSINFO_EHDR, um_vdso_addr) - +#define ARCH_DLINFO \ +do { \ + NEW_AUX_ENT(AT_SYSINFO_EHDR, um_vdso_addr); \ + NEW_AUX_ENT(AT_MINSIGSTKSZ, 0); \ +} while (0) #endif typedef unsigned long elf_greg_t; diff --git a/fs/Kconfig.binfmt b/fs/Kconfig.binfmt index 1949e25c7741..0a92bebd5f75 100644 --- a/fs/Kconfig.binfmt +++ b/fs/Kconfig.binfmt @@ -58,7 +58,7 @@ config ARCH_USE_GNU_PROPERTY config BINFMT_ELF_FDPIC bool "Kernel support for FDPIC ELF binaries" default y if !BINFMT_ELF - depends on ARM || ((M68K || RISCV || SUPERH || XTENSA) && !MMU) + depends on ARM || ((M68K || RISCV || SUPERH || UML || XTENSA) && !MMU) select ELFCORE help ELF FDPIC binaries are based on ELF, but allow the individual load -- 2.43.0 From thehajime at gmail.com Sun Nov 2 01:49:27 2025 From: thehajime at gmail.com (Hajime Tazaki) Date: Sun, 2 Nov 2025 18:49:27 +0900 Subject: [PATCH v12 02/13] um: decouple MMU specific code from the common part In-Reply-To: References: Message-ID: <08489faaad68a17037e1f24b2a39d8fc3b021c61.1762075876.git.thehajime@gmail.com> This splits the memory, process related code with common and MMU specific parts in order to avoid ifdefs in .c file and duplication between MMU and !MMU. Signed-off-by: Hajime Tazaki --- arch/um/kernel/Makefile | 5 +- arch/um/kernel/mem-pgtable.c | 55 ++++++++++++++ arch/um/kernel/mem.c | 35 --------- arch/um/kernel/process.c | 38 ++++++++++ arch/um/kernel/skas/process.c | 37 --------- arch/um/os-Linux/Makefile | 3 +- arch/um/os-Linux/process.c | 129 ++++++++++++++++++++++++++++++++ arch/um/os-Linux/skas/process.c | 127 ------------------------------- 8 files changed, 227 insertions(+), 202 deletions(-) create mode 100644 arch/um/kernel/mem-pgtable.c diff --git a/arch/um/kernel/Makefile b/arch/um/kernel/Makefile index be60bc451b3f..76d36751973e 100644 --- a/arch/um/kernel/Makefile +++ b/arch/um/kernel/Makefile @@ -16,9 +16,10 @@ always-$(KBUILD_BUILTIN) := vmlinux.lds obj-y = config.o exec.o exitcode.o irq.o ksyms.o mem.o \ physmem.o process.o ptrace.o reboot.o sigio.o \ - signal.o sysrq.o time.o tlb.o trap.o \ - um_arch.o umid.o kmsg_dump.o capflags.o skas/ + signal.o sysrq.o time.o \ + um_arch.o umid.o kmsg_dump.o capflags.o obj-y += load_file.o +obj-$(CONFIG_MMU) += mem-pgtable.o tlb.o trap.o skas/ obj-$(CONFIG_BLK_DEV_INITRD) += initrd.o obj-$(CONFIG_GPROF) += gprof_syms.o diff --git a/arch/um/kernel/mem-pgtable.c b/arch/um/kernel/mem-pgtable.c new file mode 100644 index 000000000000..549da1d3bff0 --- /dev/null +++ b/arch/um/kernel/mem-pgtable.c @@ -0,0 +1,55 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2000 - 2007 Jeff Dike (jdike@{addtoit,linux.intel}.com) + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + + +/* Allocate and free page tables. */ + +pgd_t *pgd_alloc(struct mm_struct *mm) +{ + pgd_t *pgd = (pgd_t *)__get_free_page(GFP_KERNEL); + + if (pgd) { + memset(pgd, 0, USER_PTRS_PER_PGD * sizeof(pgd_t)); + memcpy(pgd + USER_PTRS_PER_PGD, + swapper_pg_dir + USER_PTRS_PER_PGD, + (PTRS_PER_PGD - USER_PTRS_PER_PGD) * sizeof(pgd_t)); + } + return pgd; +} + +static const pgprot_t protection_map[16] = { + [VM_NONE] = PAGE_NONE, + [VM_READ] = PAGE_READONLY, + [VM_WRITE] = PAGE_COPY, + [VM_WRITE | VM_READ] = PAGE_COPY, + [VM_EXEC] = PAGE_READONLY, + [VM_EXEC | VM_READ] = PAGE_READONLY, + [VM_EXEC | VM_WRITE] = PAGE_COPY, + [VM_EXEC | VM_WRITE | VM_READ] = PAGE_COPY, + [VM_SHARED] = PAGE_NONE, + [VM_SHARED | VM_READ] = PAGE_READONLY, + [VM_SHARED | VM_WRITE] = PAGE_SHARED, + [VM_SHARED | VM_WRITE | VM_READ] = PAGE_SHARED, + [VM_SHARED | VM_EXEC] = PAGE_READONLY, + [VM_SHARED | VM_EXEC | VM_READ] = PAGE_READONLY, + [VM_SHARED | VM_EXEC | VM_WRITE] = PAGE_SHARED, + [VM_SHARED | VM_EXEC | VM_WRITE | VM_READ] = PAGE_SHARED +}; +DECLARE_VM_GET_PAGE_PROT diff --git a/arch/um/kernel/mem.c b/arch/um/kernel/mem.c index dc938715ec9d..52cd906e3896 100644 --- a/arch/um/kernel/mem.c +++ b/arch/um/kernel/mem.c @@ -6,7 +6,6 @@ #include #include #include -#include #include #include #include @@ -214,45 +213,11 @@ void free_initmem(void) { } -/* Allocate and free page tables. */ - -pgd_t *pgd_alloc(struct mm_struct *mm) -{ - pgd_t *pgd = __pgd_alloc(mm, 0); - - if (pgd) - memcpy(pgd + USER_PTRS_PER_PGD, - swapper_pg_dir + USER_PTRS_PER_PGD, - (PTRS_PER_PGD - USER_PTRS_PER_PGD) * sizeof(pgd_t)); - - return pgd; -} - void *uml_kmalloc(int size, int flags) { return kmalloc(size, flags); } -static const pgprot_t protection_map[16] = { - [VM_NONE] = PAGE_NONE, - [VM_READ] = PAGE_READONLY, - [VM_WRITE] = PAGE_COPY, - [VM_WRITE | VM_READ] = PAGE_COPY, - [VM_EXEC] = PAGE_READONLY, - [VM_EXEC | VM_READ] = PAGE_READONLY, - [VM_EXEC | VM_WRITE] = PAGE_COPY, - [VM_EXEC | VM_WRITE | VM_READ] = PAGE_COPY, - [VM_SHARED] = PAGE_NONE, - [VM_SHARED | VM_READ] = PAGE_READONLY, - [VM_SHARED | VM_WRITE] = PAGE_SHARED, - [VM_SHARED | VM_WRITE | VM_READ] = PAGE_SHARED, - [VM_SHARED | VM_EXEC] = PAGE_READONLY, - [VM_SHARED | VM_EXEC | VM_READ] = PAGE_READONLY, - [VM_SHARED | VM_EXEC | VM_WRITE] = PAGE_SHARED, - [VM_SHARED | VM_EXEC | VM_WRITE | VM_READ] = PAGE_SHARED -}; -DECLARE_VM_GET_PAGE_PROT - void mark_rodata_ro(void) { unsigned long rodata_start = PFN_ALIGN(__start_rodata); diff --git a/arch/um/kernel/process.c b/arch/um/kernel/process.c index 63b38a3f73f7..b07c1f120910 100644 --- a/arch/um/kernel/process.c +++ b/arch/um/kernel/process.c @@ -25,6 +25,7 @@ #include #include #include +#include #include #include #include @@ -307,3 +308,40 @@ unsigned long __get_wchan(struct task_struct *p) return 0; } + +extern void start_kernel(void); + +static int __init start_kernel_proc(void *unused) +{ + block_signals_trace(); + + start_kernel(); + return 0; +} + +char cpu_irqstacks[NR_CPUS][THREAD_SIZE] __aligned(THREAD_SIZE); + +int __init start_uml(void) +{ + stack_protections((unsigned long) &cpu_irqstacks[0]); + set_sigstack(cpu_irqstacks[0], THREAD_SIZE); + + init_new_thread_signals(); + + init_task.thread.request.thread.proc = start_kernel_proc; + init_task.thread.request.thread.arg = NULL; + return start_idle_thread(task_stack_page(&init_task), + &init_task.thread.switch_buf); +} + +static DEFINE_SPINLOCK(initial_jmpbuf_spinlock); + +void initial_jmpbuf_lock(void) +{ + spin_lock_irq(&initial_jmpbuf_spinlock); +} + +void initial_jmpbuf_unlock(void) +{ + spin_unlock_irq(&initial_jmpbuf_spinlock); +} diff --git a/arch/um/kernel/skas/process.c b/arch/um/kernel/skas/process.c index 4a7673b0261a..d643854942bc 100644 --- a/arch/um/kernel/skas/process.c +++ b/arch/um/kernel/skas/process.c @@ -17,31 +17,6 @@ #include #include -extern void start_kernel(void); - -static int __init start_kernel_proc(void *unused) -{ - block_signals_trace(); - - start_kernel(); - return 0; -} - -char cpu_irqstacks[NR_CPUS][THREAD_SIZE] __aligned(THREAD_SIZE); - -int __init start_uml(void) -{ - stack_protections((unsigned long) &cpu_irqstacks[0]); - set_sigstack(cpu_irqstacks[0], THREAD_SIZE); - - init_new_thread_signals(); - - init_task.thread.request.thread.proc = start_kernel_proc; - init_task.thread.request.thread.arg = NULL; - return start_idle_thread(task_stack_page(&init_task), - &init_task.thread.switch_buf); -} - unsigned long current_stub_stack(void) { if (current->mm == NULL) @@ -65,15 +40,3 @@ void current_mm_sync(void) um_tlb_sync(current->mm); } - -static DEFINE_SPINLOCK(initial_jmpbuf_spinlock); - -void initial_jmpbuf_lock(void) -{ - spin_lock_irq(&initial_jmpbuf_spinlock); -} - -void initial_jmpbuf_unlock(void) -{ - spin_unlock_irq(&initial_jmpbuf_spinlock); -} diff --git a/arch/um/os-Linux/Makefile b/arch/um/os-Linux/Makefile index 70c73c22f715..051679d78aae 100644 --- a/arch/um/os-Linux/Makefile +++ b/arch/um/os-Linux/Makefile @@ -8,7 +8,8 @@ KCOV_INSTRUMENT := n obj-y = execvp.o file.o helper.o irq.o main.o mem.o process.o \ registers.o sigio.o signal.o start_up.o time.o tty.o \ - umid.o user_syms.o util.o skas/ + umid.o user_syms.o util.o +obj-$(CONFIG_MMU) += skas/ CFLAGS_signal.o += -Wframe-larger-than=4096 diff --git a/arch/um/os-Linux/process.c b/arch/um/os-Linux/process.c index 3a2a84ab9325..c50fa865d8c7 100644 --- a/arch/um/os-Linux/process.c +++ b/arch/um/os-Linux/process.c @@ -6,6 +6,7 @@ #include #include +#include #include #include #include @@ -17,10 +18,16 @@ #include #include #include +#include #include #include #include #include +#include +#include + +int using_seccomp; +static int unscheduled_userspace_iterations; void os_alarm_process(int pid) { @@ -209,3 +216,125 @@ int os_futex_wake(void *uaddr) NULL, NULL, 0)); return r < 0 ? -errno : r; } + +int is_skas_winch(int pid, int fd, void *data) +{ + return pid == getpgrp(); +} + +void new_thread(void *stack, jmp_buf *buf, void (*handler)(void)) +{ + (*buf)[0].JB_IP = (unsigned long) handler; + (*buf)[0].JB_SP = (unsigned long) stack + UM_THREAD_SIZE - + sizeof(void *); +} + +#define INIT_JMP_NEW_THREAD 0 +#define INIT_JMP_CALLBACK 1 +#define INIT_JMP_HALT 2 +#define INIT_JMP_REBOOT 3 + +void switch_threads(jmp_buf *me, jmp_buf *you) +{ + unscheduled_userspace_iterations = 0; + + if (UML_SETJMP(me) == 0) + UML_LONGJMP(you, 1); +} + +static jmp_buf initial_jmpbuf; + +static __thread void (*cb_proc)(void *arg); +static __thread void *cb_arg; +static __thread jmp_buf *cb_back; + +int start_idle_thread(void *stack, jmp_buf *switch_buf) +{ + int n; + + set_handler(SIGWINCH); + + /* + * Can't use UML_SETJMP or UML_LONGJMP here because they save + * and restore signals, with the possible side-effect of + * trying to handle any signals which came when they were + * blocked, which can't be done on this stack. + * Signals must be blocked when jumping back here and restored + * after returning to the jumper. + */ + n = setjmp(initial_jmpbuf); + switch (n) { + case INIT_JMP_NEW_THREAD: + (*switch_buf)[0].JB_IP = (unsigned long) uml_finishsetup; + (*switch_buf)[0].JB_SP = (unsigned long) stack + + UM_THREAD_SIZE - sizeof(void *); + break; + case INIT_JMP_CALLBACK: + (*cb_proc)(cb_arg); + longjmp(*cb_back, 1); + break; + case INIT_JMP_HALT: + kmalloc_ok = 0; + return 0; + case INIT_JMP_REBOOT: + kmalloc_ok = 0; + return 1; + default: + printk(UM_KERN_ERR "Bad sigsetjmp return in %s - %d\n", + __func__, n); + fatal_sigsegv(); + } + longjmp(*switch_buf, 1); + + /* unreachable */ + printk(UM_KERN_ERR "impossible long jump!"); + fatal_sigsegv(); + return 0; +} + +void initial_thread_cb_skas(void (*proc)(void *), void *arg) +{ + jmp_buf here; + + cb_proc = proc; + cb_arg = arg; + cb_back = &here; + + initial_jmpbuf_lock(); + if (UML_SETJMP(&here) == 0) + UML_LONGJMP(&initial_jmpbuf, INIT_JMP_CALLBACK); + initial_jmpbuf_unlock(); + + cb_proc = NULL; + cb_arg = NULL; + cb_back = NULL; +} + +void halt_skas(void) +{ + initial_jmpbuf_lock(); + UML_LONGJMP(&initial_jmpbuf, INIT_JMP_HALT); + /* unreachable */ +} + +static bool noreboot; + +static int __init noreboot_cmd_param(char *str, int *add) +{ + *add = 0; + noreboot = true; + return 0; +} + +__uml_setup("noreboot", noreboot_cmd_param, +"noreboot\n" +" Rather than rebooting, exit always, akin to QEMU's -no-reboot option.\n" +" This is useful if you're using CONFIG_PANIC_TIMEOUT in order to catch\n" +" crashes in CI\n\n"); + +void reboot_skas(void) +{ + initial_jmpbuf_lock(); + UML_LONGJMP(&initial_jmpbuf, noreboot ? INIT_JMP_HALT : INIT_JMP_REBOOT); + /* unreachable */ +} diff --git a/arch/um/os-Linux/skas/process.c b/arch/um/os-Linux/skas/process.c index d6c22f8aa06d..01814ad82f5d 100644 --- a/arch/um/os-Linux/skas/process.c +++ b/arch/um/os-Linux/skas/process.c @@ -18,7 +18,6 @@ #include #include #include -#include #include #include #include @@ -29,16 +28,10 @@ #include #include #include -#include #include #include #include "../internal.h" -int is_skas_winch(int pid, int fd, void *data) -{ - return pid == getpgrp(); -} - static const char *ptrace_reg_name(int idx) { #define R(n) case HOST_##n: return #n @@ -426,8 +419,6 @@ static int __init init_stub_exe_fd(void) } __initcall(init_stub_exe_fd); -int using_seccomp; - /** * start_userspace() - prepare a new userspace process * @mm_id: The corresponding struct mm_id @@ -540,7 +531,6 @@ int start_userspace(struct mm_id *mm_id) return err; } -static int unscheduled_userspace_iterations; extern unsigned long tt_extra_sched_jiffies; void userspace(struct uml_pt_regs *regs) @@ -789,120 +779,3 @@ void userspace(struct uml_pt_regs *regs) } } } - -void new_thread(void *stack, jmp_buf *buf, void (*handler)(void)) -{ - (*buf)[0].JB_IP = (unsigned long) handler; - (*buf)[0].JB_SP = (unsigned long) stack + UM_THREAD_SIZE - - sizeof(void *); -} - -#define INIT_JMP_NEW_THREAD 0 -#define INIT_JMP_CALLBACK 1 -#define INIT_JMP_HALT 2 -#define INIT_JMP_REBOOT 3 - -void switch_threads(jmp_buf *me, jmp_buf *you) -{ - unscheduled_userspace_iterations = 0; - - if (UML_SETJMP(me) == 0) - UML_LONGJMP(you, 1); -} - -static jmp_buf initial_jmpbuf; - -static __thread void (*cb_proc)(void *arg); -static __thread void *cb_arg; -static __thread jmp_buf *cb_back; - -int start_idle_thread(void *stack, jmp_buf *switch_buf) -{ - int n; - - set_handler(SIGWINCH); - - /* - * Can't use UML_SETJMP or UML_LONGJMP here because they save - * and restore signals, with the possible side-effect of - * trying to handle any signals which came when they were - * blocked, which can't be done on this stack. - * Signals must be blocked when jumping back here and restored - * after returning to the jumper. - */ - n = setjmp(initial_jmpbuf); - switch (n) { - case INIT_JMP_NEW_THREAD: - (*switch_buf)[0].JB_IP = (unsigned long) uml_finishsetup; - (*switch_buf)[0].JB_SP = (unsigned long) stack + - UM_THREAD_SIZE - sizeof(void *); - break; - case INIT_JMP_CALLBACK: - (*cb_proc)(cb_arg); - longjmp(*cb_back, 1); - break; - case INIT_JMP_HALT: - kmalloc_ok = 0; - return 0; - case INIT_JMP_REBOOT: - kmalloc_ok = 0; - return 1; - default: - printk(UM_KERN_ERR "Bad sigsetjmp return in %s - %d\n", - __func__, n); - fatal_sigsegv(); - } - longjmp(*switch_buf, 1); - - /* unreachable */ - printk(UM_KERN_ERR "impossible long jump!"); - fatal_sigsegv(); - return 0; -} - -void initial_thread_cb_skas(void (*proc)(void *), void *arg) -{ - jmp_buf here; - - cb_proc = proc; - cb_arg = arg; - cb_back = &here; - - initial_jmpbuf_lock(); - if (UML_SETJMP(&here) == 0) - UML_LONGJMP(&initial_jmpbuf, INIT_JMP_CALLBACK); - initial_jmpbuf_unlock(); - - cb_proc = NULL; - cb_arg = NULL; - cb_back = NULL; -} - -void halt_skas(void) -{ - initial_jmpbuf_lock(); - UML_LONGJMP(&initial_jmpbuf, INIT_JMP_HALT); - /* unreachable */ -} - -static bool noreboot; - -static int __init noreboot_cmd_param(char *str, int *add) -{ - *add = 0; - noreboot = true; - return 0; -} - -__uml_setup("noreboot", noreboot_cmd_param, -"noreboot\n" -" Rather than rebooting, exit always, akin to QEMU's -no-reboot option.\n" -" This is useful if you're using CONFIG_PANIC_TIMEOUT in order to catch\n" -" crashes in CI\n\n"); - -void reboot_skas(void) -{ - initial_jmpbuf_lock(); - UML_LONGJMP(&initial_jmpbuf, noreboot ? INIT_JMP_HALT : INIT_JMP_REBOOT); - /* unreachable */ -} -- 2.43.0 From thehajime at gmail.com Sun Nov 2 01:49:29 2025 From: thehajime at gmail.com (Hajime Tazaki) Date: Sun, 2 Nov 2025 18:49:29 +0900 Subject: [PATCH v12 04/13] x86/um: nommu: syscall handling In-Reply-To: References: Message-ID: This commit introduces an entry point of syscall interface for !MMU mode. It uses an entry function, __kernel_vsyscall, a kernel-wide global symbol accessible from any locations. Although it isn't in the scope of this commit, it can be also exposed via vdso image which is directly accessible from userspace. A standard library (i.e., libc) can utilize this entry point to implement syscall wrapper; we can also use this by hooking syscall for unmodified userspace applications/libraries, which will be implemented in the subsequent commit. This only supports 64-bit mode of x86 architecture. Signed-off-by: Hajime Tazaki Signed-off-by: Ricardo Koller --- arch/x86/um/Makefile | 4 ++ arch/x86/um/asm/syscall.h | 6 ++ arch/x86/um/nommu/Makefile | 8 +++ arch/x86/um/nommu/do_syscall_64.c | 32 +++++++++ arch/x86/um/nommu/entry_64.S | 112 ++++++++++++++++++++++++++++++ arch/x86/um/nommu/syscalls.h | 16 +++++ 6 files changed, 178 insertions(+) create mode 100644 arch/x86/um/nommu/Makefile create mode 100644 arch/x86/um/nommu/do_syscall_64.c create mode 100644 arch/x86/um/nommu/entry_64.S create mode 100644 arch/x86/um/nommu/syscalls.h diff --git a/arch/x86/um/Makefile b/arch/x86/um/Makefile index b42c31cd2390..227af2a987e2 100644 --- a/arch/x86/um/Makefile +++ b/arch/x86/um/Makefile @@ -32,6 +32,10 @@ obj-y += syscalls_64.o vdso/ subarch-y = ../lib/csum-partial_64.o ../lib/memcpy_64.o \ ../lib/memmove_64.o ../lib/memset_64.o +ifneq ($(CONFIG_MMU),y) +obj-y += nommu/ +endif + endif subarch-$(CONFIG_MODULES) += ../kernel/module.o diff --git a/arch/x86/um/asm/syscall.h b/arch/x86/um/asm/syscall.h index d6208d0fad51..bb4f6f011667 100644 --- a/arch/x86/um/asm/syscall.h +++ b/arch/x86/um/asm/syscall.h @@ -20,4 +20,10 @@ static inline int syscall_get_arch(struct task_struct *task) #endif } +#ifndef CONFIG_MMU +extern void do_syscall_64(struct pt_regs *regs); +extern long __kernel_vsyscall(int64_t a0, int64_t a1, int64_t a2, int64_t a3, + int64_t a4, int64_t a5, int64_t a6); +#endif + #endif /* __UM_ASM_SYSCALL_H */ diff --git a/arch/x86/um/nommu/Makefile b/arch/x86/um/nommu/Makefile new file mode 100644 index 000000000000..d72c63afffa5 --- /dev/null +++ b/arch/x86/um/nommu/Makefile @@ -0,0 +1,8 @@ +# SPDX-License-Identifier: GPL-2.0 +ifeq ($(CONFIG_X86_32),y) + BITS := 32 +else + BITS := 64 +endif + +obj-y = do_syscall_$(BITS).o entry_$(BITS).o diff --git a/arch/x86/um/nommu/do_syscall_64.c b/arch/x86/um/nommu/do_syscall_64.c new file mode 100644 index 000000000000..292d7c578622 --- /dev/null +++ b/arch/x86/um/nommu/do_syscall_64.c @@ -0,0 +1,32 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include +#include +#include +#include +#include + +__visible void do_syscall_64(struct pt_regs *regs) +{ + int syscall; + + syscall = PT_SYSCALL_NR(regs->regs.gp); + UPT_SYSCALL_NR(®s->regs) = syscall; + + if (likely(syscall < NR_syscalls)) { + unsigned long ret; + + ret = (*sys_call_table[syscall])(UPT_SYSCALL_ARG1(®s->regs), + UPT_SYSCALL_ARG2(®s->regs), + UPT_SYSCALL_ARG3(®s->regs), + UPT_SYSCALL_ARG4(®s->regs), + UPT_SYSCALL_ARG5(®s->regs), + UPT_SYSCALL_ARG6(®s->regs)); + PT_REGS_SET_SYSCALL_RETURN(regs, ret); + } + + PT_REGS_SYSCALL_RET(regs) = regs->regs.gp[HOST_AX]; + + /* handle tasks and signals at the end */ + interrupt_end(); +} diff --git a/arch/x86/um/nommu/entry_64.S b/arch/x86/um/nommu/entry_64.S new file mode 100644 index 000000000000..485c578aae64 --- /dev/null +++ b/arch/x86/um/nommu/entry_64.S @@ -0,0 +1,112 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#include + +#include +#include +#include + +#include "../entry/calling.h" + +#ifdef CONFIG_SMP +#error need to stash these variables somewhere else +#endif + +#define UM_GLOBAL_VAR(x) .data; .align 8; .globl x; x:; .long 0 + +UM_GLOBAL_VAR(current_top_of_stack) +UM_GLOBAL_VAR(current_ptregs) + +.code64 +.section .entry.text, "ax" + +.align 8 +#undef ENTRY +#define ENTRY(x) .text; .globl x; .type x,%function; x: +#undef END +#define END(x) .size x, . - x + +/* + * %rcx has the return address (we set it before entering __kernel_vsyscall). + * + * Registers on entry: + * rax system call number + * rcx return address + * rdi arg0 + * rsi arg1 + * rdx arg2 + * r10 arg3 + * r8 arg4 + * r9 arg5 + * + * (note: we are allowed to mess with r11: r11 is callee-clobbered + * register in C ABI) + */ +ENTRY(__kernel_vsyscall) + + movq %rsp, %r11 + + /* Point rsp to the top of the ptregs array, so we can + just fill it with a bunch of push'es. */ + movq current_ptregs, %rsp + + /* 8 bytes * 20 registers (plus 8 for the push) */ + addq $168, %rsp + + /* Construct struct pt_regs on stack */ + pushq $0 /* pt_regs->ss (index 20) */ + pushq %r11 /* pt_regs->sp */ + pushfq /* pt_regs->flags */ + pushq $0 /* pt_regs->cs */ + pushq %rcx /* pt_regs->ip */ + pushq %rax /* pt_regs->orig_ax */ + + PUSH_AND_CLEAR_REGS rax=$-ENOSYS + + mov %rsp, %rdi + + /* + * Switch to current top of stack, so "current->" points + * to the right task. + */ + movq current_top_of_stack, %rsp + + call do_syscall_64 + + jmp userspace + +END(__kernel_vsyscall) + +/* + * common userspace returning routine + * + * all procedures like syscalls, signal handlers, umh processes, will gate + * this routine to properly configure registers/stacks. + * + * void userspace(struct uml_pt_regs *regs) + */ +ENTRY(userspace) + + /* clear direction flag to meet ABI */ + cld + /* align the stack for x86_64 ABI */ + and $-0x10, %rsp + /* Handle any immediate reschedules or signals */ + call interrupt_end + + movq current_ptregs, %rsp + + POP_REGS + + addq $8, %rsp /* skip orig_ax */ + popq %rcx /* pt_regs->ip */ + addq $8, %rsp /* skip cs */ + addq $8, %rsp /* skip flags */ + popq %rsp + + /* + * not return w/ ret but w/ jmp as the stack is already popped before + * entering __kernel_vsyscall + */ + jmp *%rcx + +END(userspace) diff --git a/arch/x86/um/nommu/syscalls.h b/arch/x86/um/nommu/syscalls.h new file mode 100644 index 000000000000..a2433756b1fc --- /dev/null +++ b/arch/x86/um/nommu/syscalls.h @@ -0,0 +1,16 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef __UM_NOMMU_SYSCALLS_H +#define __UM_NOMMU_SYSCALLS_H + + +#define task_top_of_stack(task) \ +({ \ + unsigned long __ptr = (unsigned long)task->stack; \ + __ptr += THREAD_SIZE; \ + __ptr; \ +}) + +extern long current_top_of_stack; +extern long current_ptregs; + +#endif -- 2.43.0 From thehajime at gmail.com Sun Nov 2 01:49:28 2025 From: thehajime at gmail.com (Hajime Tazaki) Date: Sun, 2 Nov 2025 18:49:28 +0900 Subject: [PATCH v12 03/13] um: nommu: memory handling In-Reply-To: References: Message-ID: This commit adds memory operations on UML under !MMU environment. Some part of the original UML code relying on CONFIG_MMU are excluded from compilation when !CONFIG_MMU. Additionally, generic functions such as uaccess, futex, memcpy/strnlen/strncpy can be used as user- and kernel-space share the address space in !CONFIG_MMU mode. Signed-off-by: Hajime Tazaki Signed-off-by: Ricardo Koller --- arch/um/Makefile | 4 ++++ arch/um/include/asm/futex.h | 4 ++++ arch/um/include/asm/mmu.h | 3 +++ arch/um/include/asm/mmu_context.h | 2 ++ arch/um/include/asm/uaccess.h | 7 ++++--- arch/um/kernel/mem.c | 3 ++- arch/um/os-Linux/mem.c | 4 ++++ arch/um/os-Linux/process.c | 4 ++-- 8 files changed, 25 insertions(+), 6 deletions(-) diff --git a/arch/um/Makefile b/arch/um/Makefile index 7be0143b5ba3..5371c9a1b11e 100644 --- a/arch/um/Makefile +++ b/arch/um/Makefile @@ -46,6 +46,10 @@ ARCH_INCLUDE := -I$(srctree)/$(SHARED_HEADERS) ARCH_INCLUDE += -I$(srctree)/$(HOST_DIR)/um/shared KBUILD_CPPFLAGS += -I$(srctree)/$(HOST_DIR)/um +ifneq ($(CONFIG_MMU),y) +core-y += $(ARCH_DIR)/nommu/ +endif + # -Dvmap=kernel_vmap prevents anything from referencing the libpcap.o symbol so # named - it's a common symbol in libpcap, so we get a binary which crashes. # diff --git a/arch/um/include/asm/futex.h b/arch/um/include/asm/futex.h index 780aa6bfc050..785fd6649aa2 100644 --- a/arch/um/include/asm/futex.h +++ b/arch/um/include/asm/futex.h @@ -7,8 +7,12 @@ #include +#ifdef CONFIG_MMU int arch_futex_atomic_op_inuser(int op, u32 oparg, int *oval, u32 __user *uaddr); int futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr, u32 oldval, u32 newval); +#else +#include +#endif #endif diff --git a/arch/um/include/asm/mmu.h b/arch/um/include/asm/mmu.h index 82a919132aff..c0b9ce3215c4 100644 --- a/arch/um/include/asm/mmu.h +++ b/arch/um/include/asm/mmu.h @@ -22,10 +22,13 @@ typedef struct mm_context { unsigned long sync_tlb_range_from; unsigned long sync_tlb_range_to; +#ifndef CONFIG_MMU + unsigned long end_brk; #ifdef CONFIG_BINFMT_ELF_FDPIC unsigned long exec_fdpic_loadmap; unsigned long interp_fdpic_loadmap; #endif +#endif /* !CONFIG_MMU */ } mm_context_t; #define INIT_MM_CONTEXT(mm) \ diff --git a/arch/um/include/asm/mmu_context.h b/arch/um/include/asm/mmu_context.h index c727e56ba116..528b217da285 100644 --- a/arch/um/include/asm/mmu_context.h +++ b/arch/um/include/asm/mmu_context.h @@ -18,11 +18,13 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, { } +#ifdef CONFIG_MMU #define init_new_context init_new_context extern int init_new_context(struct task_struct *task, struct mm_struct *mm); #define destroy_context destroy_context extern void destroy_context(struct mm_struct *mm); +#endif #include diff --git a/arch/um/include/asm/uaccess.h b/arch/um/include/asm/uaccess.h index 1c6e0ae41b0c..b9677758e759 100644 --- a/arch/um/include/asm/uaccess.h +++ b/arch/um/include/asm/uaccess.h @@ -23,6 +23,7 @@ #define __addr_range_nowrap(addr, size) \ ((unsigned long) (addr) <= ((unsigned long) (addr) + (size))) +#ifdef CONFIG_MMU extern unsigned long raw_copy_from_user(void *to, const void __user *from, unsigned long n); extern unsigned long raw_copy_to_user(void __user *to, const void *from, unsigned long n); extern unsigned long __clear_user(void __user *mem, unsigned long len); @@ -34,9 +35,6 @@ static inline int __access_ok(const void __user *ptr, unsigned long size); #define INLINE_COPY_FROM_USER #define INLINE_COPY_TO_USER - -#include - static inline int __access_ok(const void __user *ptr, unsigned long size) { unsigned long addr = (unsigned long)ptr; @@ -70,5 +68,8 @@ do { \ barrier(); \ current->thread.segv_continue = NULL; \ } while (0) +#endif + +#include #endif diff --git a/arch/um/kernel/mem.c b/arch/um/kernel/mem.c index 52cd906e3896..1b9e7c62412d 100644 --- a/arch/um/kernel/mem.c +++ b/arch/um/kernel/mem.c @@ -71,7 +71,8 @@ void __init arch_mm_preinit(void) * to be turned on. */ brk_end = PAGE_ALIGN((unsigned long) sbrk(0)); - map_memory(brk_end, __pa(brk_end), uml_reserved - brk_end, 1, 1, 0); + map_memory(brk_end, __pa(brk_end), uml_reserved - brk_end, 1, 1, + !IS_ENABLED(CONFIG_MMU)); memblock_free((void *)brk_end, uml_reserved - brk_end); uml_reserved = brk_end; min_low_pfn = PFN_UP(__pa(uml_reserved)); diff --git a/arch/um/os-Linux/mem.c b/arch/um/os-Linux/mem.c index 72f302f4d197..4f5d9a94f8e2 100644 --- a/arch/um/os-Linux/mem.c +++ b/arch/um/os-Linux/mem.c @@ -213,6 +213,10 @@ int __init create_mem_file(unsigned long long len) { int err, fd; + /* NOMMU kernel uses -1 as a fd for further use (e.g., mmap) */ + if (!IS_ENABLED(CONFIG_MMU)) + return -1; + fd = create_tmp_file(len); err = os_set_exec_close(fd); diff --git a/arch/um/os-Linux/process.c b/arch/um/os-Linux/process.c index c50fa865d8c7..ddb5258d7720 100644 --- a/arch/um/os-Linux/process.c +++ b/arch/um/os-Linux/process.c @@ -100,8 +100,8 @@ int os_map_memory(void *virt, int fd, unsigned long long off, unsigned long len, prot = (r ? PROT_READ : 0) | (w ? PROT_WRITE : 0) | (x ? PROT_EXEC : 0); - loc = mmap64((void *) virt, len, prot, MAP_SHARED | MAP_FIXED, - fd, off); + loc = mmap64((void *) virt, len, prot, MAP_SHARED | MAP_FIXED | + (!IS_ENABLED(CONFIG_MMU) ? MAP_ANONYMOUS : 0), fd, off); if (loc == MAP_FAILED) return -errno; return 0; -- 2.43.0 From thehajime at gmail.com Sun Nov 2 01:49:31 2025 From: thehajime at gmail.com (Hajime Tazaki) Date: Sun, 2 Nov 2025 18:49:31 +0900 Subject: [PATCH v12 06/13] x86/um: nommu: process/thread handling In-Reply-To: References: Message-ID: <94b1c9a65af9d22e3f21d28bc0fad2f94e1e86cb.1762075876.git.thehajime@gmail.com> Since ptrace facility isn't used under !MMU of UML, there is different code path to invoke processes/threads; there are no external process used, and need to properly configure some of registers (fs segment register for TLS, etc) on every context switch, etc. Signals aren't delivered in non-ptrace syscall entry/leave so, we also need to handle pending signal by ourselves. ptrace related syscalls are not tested yet so, marked arch_has_single_step() unsupported in !MMU environment. Signed-off-by: Hajime Tazaki Signed-off-by: Ricardo Koller --- arch/um/include/asm/ptrace-generic.h | 2 +- arch/x86/um/Makefile | 3 +- arch/x86/um/nommu/Makefile | 2 +- arch/x86/um/nommu/entry_64.S | 2 ++ arch/x86/um/nommu/syscalls.h | 2 ++ arch/x86/um/nommu/syscalls_64.c | 50 ++++++++++++++++++++++++++++ 6 files changed, 58 insertions(+), 3 deletions(-) create mode 100644 arch/x86/um/nommu/syscalls_64.c diff --git a/arch/um/include/asm/ptrace-generic.h b/arch/um/include/asm/ptrace-generic.h index 62e9916078ec..5aa38fe6b2fb 100644 --- a/arch/um/include/asm/ptrace-generic.h +++ b/arch/um/include/asm/ptrace-generic.h @@ -14,7 +14,7 @@ struct pt_regs { struct uml_pt_regs regs; }; -#define arch_has_single_step() (1) +#define arch_has_single_step() (IS_ENABLED(CONFIG_MMU)) #define EMPTY_REGS { .regs = EMPTY_UML_PT_REGS } diff --git a/arch/x86/um/Makefile b/arch/x86/um/Makefile index 227af2a987e2..53c9ebb3c41c 100644 --- a/arch/x86/um/Makefile +++ b/arch/x86/um/Makefile @@ -27,7 +27,8 @@ subarch-y += ../kernel/sys_ia32.o else -obj-y += syscalls_64.o vdso/ +obj-y += vdso/ +obj-$(CONFIG_MMU) += syscalls_64.o subarch-y = ../lib/csum-partial_64.o ../lib/memcpy_64.o \ ../lib/memmove_64.o ../lib/memset_64.o diff --git a/arch/x86/um/nommu/Makefile b/arch/x86/um/nommu/Makefile index ebe47d4836f4..4018d9e0aba0 100644 --- a/arch/x86/um/nommu/Makefile +++ b/arch/x86/um/nommu/Makefile @@ -5,4 +5,4 @@ else BITS := 64 endif -obj-y = do_syscall_$(BITS).o entry_$(BITS).o os-Linux/ +obj-y = do_syscall_$(BITS).o entry_$(BITS).o syscalls_$(BITS).o os-Linux/ diff --git a/arch/x86/um/nommu/entry_64.S b/arch/x86/um/nommu/entry_64.S index 485c578aae64..a58922fc81e5 100644 --- a/arch/x86/um/nommu/entry_64.S +++ b/arch/x86/um/nommu/entry_64.S @@ -86,6 +86,8 @@ END(__kernel_vsyscall) */ ENTRY(userspace) + /* set stack and pt_regs to the current task */ + call arch_set_stack_to_current /* clear direction flag to meet ABI */ cld /* align the stack for x86_64 ABI */ diff --git a/arch/x86/um/nommu/syscalls.h b/arch/x86/um/nommu/syscalls.h index a2433756b1fc..ce16bf8abd59 100644 --- a/arch/x86/um/nommu/syscalls.h +++ b/arch/x86/um/nommu/syscalls.h @@ -13,4 +13,6 @@ extern long current_top_of_stack; extern long current_ptregs; +void arch_set_stack_to_current(void); + #endif diff --git a/arch/x86/um/nommu/syscalls_64.c b/arch/x86/um/nommu/syscalls_64.c new file mode 100644 index 000000000000..d56027ebc651 --- /dev/null +++ b/arch/x86/um/nommu/syscalls_64.c @@ -0,0 +1,50 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2003 - 2007 Jeff Dike (jdike@{addtoit,linux.intel}.com) + * Copyright 2003 PathScale, Inc. + * + * Licensed under the GPL + */ + +#include +#include +#include +#include +#include /* XXX This should get the constants from libc */ +#include +#include +#include "syscalls.h" + +void arch_set_stack_to_current(void) +{ + current_top_of_stack = task_top_of_stack(current); + current_ptregs = (long)task_pt_regs(current); +} + +void arch_switch_to(struct task_struct *to) +{ + /* + * In !CONFIG_MMU, it doesn't ptrace thus, + * The FS_BASE registers are saved here. + */ + current_top_of_stack = task_top_of_stack(to); + current_ptregs = (long)task_pt_regs(to); + + if ((to->thread.regs.regs.gp[FS_BASE / sizeof(unsigned long)] == 0) || + (to->mm == NULL)) + return; + + /* this changes the FS on every context switch */ + arch_prctl(to, ARCH_SET_FS, + (void __user *) to->thread.regs.regs.gp[FS_BASE / sizeof(unsigned long)]); +} + +SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len, + unsigned long, prot, unsigned long, flags, + unsigned long, fd, unsigned long, off) +{ + if (off & ~PAGE_MASK) + return -EINVAL; + + return ksys_mmap_pgoff(addr, len, prot, flags, fd, off >> PAGE_SHIFT); +} -- 2.43.0 From thehajime at gmail.com Sun Nov 2 01:49:30 2025 From: thehajime at gmail.com (Hajime Tazaki) Date: Sun, 2 Nov 2025 18:49:30 +0900 Subject: [PATCH v12 05/13] um: nommu: seccomp syscalls hook In-Reply-To: References: Message-ID: This commit adds syscall hook with seccomp. Using seccomp raises SIGSYS to UML process, which is captured in the (UML) kernel, then jumps to the syscall entry point, __kernel_vsyscall, to hook the original syscall instructions. The SIGSYS signal is raised upon the execution from uml_reserved and high_physmem, which locates userspace memory. It also renames existing static function, sigsys_handler(), in start_up.c to avoid name conflicts between them. Signed-off-by: Hajime Tazaki Signed-off-by: Kenichi Yasukata --- arch/um/include/shared/kern_util.h | 2 + arch/um/include/shared/os.h | 10 +++ arch/um/kernel/um_arch.c | 3 + arch/um/nommu/Makefile | 3 + arch/um/nommu/os-Linux/Makefile | 7 +++ arch/um/nommu/os-Linux/seccomp.c | 87 +++++++++++++++++++++++++++ arch/um/nommu/os-Linux/signal.c | 16 +++++ arch/um/os-Linux/signal.c | 8 +++ arch/um/os-Linux/start_up.c | 4 +- arch/x86/um/nommu/Makefile | 2 +- arch/x86/um/nommu/os-Linux/Makefile | 6 ++ arch/x86/um/nommu/os-Linux/mcontext.c | 15 +++++ arch/x86/um/shared/sysdep/mcontext.h | 4 ++ 13 files changed, 164 insertions(+), 3 deletions(-) create mode 100644 arch/um/nommu/Makefile create mode 100644 arch/um/nommu/os-Linux/Makefile create mode 100644 arch/um/nommu/os-Linux/seccomp.c create mode 100644 arch/um/nommu/os-Linux/signal.c create mode 100644 arch/x86/um/nommu/os-Linux/Makefile create mode 100644 arch/x86/um/nommu/os-Linux/mcontext.c diff --git a/arch/um/include/shared/kern_util.h b/arch/um/include/shared/kern_util.h index 38321188c04c..7798f16a4677 100644 --- a/arch/um/include/shared/kern_util.h +++ b/arch/um/include/shared/kern_util.h @@ -63,6 +63,8 @@ extern void segv_handler(int sig, struct siginfo *unused_si, struct uml_pt_regs extern void winch(int sig, struct siginfo *unused_si, struct uml_pt_regs *regs, void *mc); extern void fatal_sigsegv(void) __attribute__ ((noreturn)); +extern void sigsys_handler(int sig, struct siginfo *si, struct uml_pt_regs *regs, + void *mc); void um_idle_sleep(void); diff --git a/arch/um/include/shared/os.h b/arch/um/include/shared/os.h index b26e94292fc1..5451f9b1f41e 100644 --- a/arch/um/include/shared/os.h +++ b/arch/um/include/shared/os.h @@ -356,4 +356,14 @@ static inline void os_local_ipi_enable(void) { } static inline void os_local_ipi_disable(void) { } #endif /* CONFIG_SMP */ +/* seccomp.c */ +#ifdef CONFIG_MMU +static inline int os_setup_seccomp(void) +{ + return 0; +} +#else +extern int os_setup_seccomp(void); +#endif + #endif diff --git a/arch/um/kernel/um_arch.c b/arch/um/kernel/um_arch.c index e2b24e1ecfa6..27c13423d9aa 100644 --- a/arch/um/kernel/um_arch.c +++ b/arch/um/kernel/um_arch.c @@ -423,6 +423,9 @@ void __init setup_arch(char **cmdline_p) add_bootloader_randomness(rng_seed, sizeof(rng_seed)); memzero_explicit(rng_seed, sizeof(rng_seed)); } + + /* install seccomp filter */ + os_setup_seccomp(); } void __init arch_cpu_finalize_init(void) diff --git a/arch/um/nommu/Makefile b/arch/um/nommu/Makefile new file mode 100644 index 000000000000..baab7c2f57c2 --- /dev/null +++ b/arch/um/nommu/Makefile @@ -0,0 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0 + +obj-y := os-Linux/ diff --git a/arch/um/nommu/os-Linux/Makefile b/arch/um/nommu/os-Linux/Makefile new file mode 100644 index 000000000000..805e26ccf63b --- /dev/null +++ b/arch/um/nommu/os-Linux/Makefile @@ -0,0 +1,7 @@ +# SPDX-License-Identifier: GPL-2.0 + +obj-y := seccomp.o signal.o +USER_OBJS := $(obj-y) + +include $(srctree)/arch/um/scripts/Makefile.rules +USER_CFLAGS+=-I$(srctree)/arch/um/os-Linux diff --git a/arch/um/nommu/os-Linux/seccomp.c b/arch/um/nommu/os-Linux/seccomp.c new file mode 100644 index 000000000000..d1cfa6e3d632 --- /dev/null +++ b/arch/um/nommu/os-Linux/seccomp.c @@ -0,0 +1,87 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include +#include +#include +#include +#include +#include +#include /* For SYS_xxx definitions */ +#include +#include +#include +#include +#include + +int __init os_setup_seccomp(void) +{ + int err; + unsigned long __userspace_start = uml_reserved, + __userspace_end = high_physmem; + + struct sock_filter filter[] = { + /* if (IP_high > __userspace_end) allow; */ + BPF_STMT(BPF_LD + BPF_W + BPF_ABS, + offsetof(struct seccomp_data, instruction_pointer) + 4), + BPF_JUMP(BPF_JMP + BPF_JGT + BPF_K, __userspace_end >> 32, + /*true-skip=*/0, /*false-skip=*/1), + BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_ALLOW), + + /* if (IP_high == __userspace_end && IP_low >= __userspace_end) allow; */ + BPF_STMT(BPF_LD + BPF_W + BPF_ABS, + offsetof(struct seccomp_data, instruction_pointer) + 4), + BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, __userspace_end >> 32, + /*true-skip=*/0, /*false-skip=*/3), + BPF_STMT(BPF_LD + BPF_W + BPF_ABS, + offsetof(struct seccomp_data, instruction_pointer)), + BPF_JUMP(BPF_JMP + BPF_JGE + BPF_K, __userspace_end, + /*true-skip=*/0, /*false-skip=*/1), + BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_ALLOW), + + /* if (IP_high < __userspace_start) allow; */ + BPF_STMT(BPF_LD + BPF_W + BPF_ABS, + offsetof(struct seccomp_data, instruction_pointer) + 4), + BPF_JUMP(BPF_JMP + BPF_JGE + BPF_K, __userspace_start >> 32, + /*true-skip=*/1, /*false-skip=*/0), + BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_ALLOW), + + /* if (IP_high == __userspace_start && IP_low < __userspace_start) allow; */ + BPF_STMT(BPF_LD + BPF_W + BPF_ABS, + offsetof(struct seccomp_data, instruction_pointer) + 4), + BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, __userspace_start >> 32, + /*true-skip=*/0, /*false-skip=*/3), + BPF_STMT(BPF_LD + BPF_W + BPF_ABS, + offsetof(struct seccomp_data, instruction_pointer)), + BPF_JUMP(BPF_JMP + BPF_JGE + BPF_K, __userspace_start, + /*true-skip=*/1, /*false-skip=*/0), + BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_ALLOW), + + /* other address; trap */ + BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_TRAP), + }; + struct sock_fprog prog = { + .len = ARRAY_SIZE(filter), + .filter = filter, + }; + + err = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0); + if (err) + os_warn("PR_SET_NO_NEW_PRIVS (err=%d, ernro=%d)\n", + err, errno); + + err = syscall(SYS_seccomp, SECCOMP_SET_MODE_FILTER, + SECCOMP_FILTER_FLAG_TSYNC, &prog); + if (err) { + os_warn("SECCOMP_SET_MODE_FILTER (err=%d, ernro=%d)\n", + err, errno); + exit(1); + } + + set_handler(SIGSYS); + + os_info("seccomp: setup filter syscalls in the range: 0x%lx-0x%lx\n", + __userspace_start, __userspace_end); + + return 0; +} + diff --git a/arch/um/nommu/os-Linux/signal.c b/arch/um/nommu/os-Linux/signal.c new file mode 100644 index 000000000000..19043b9652e2 --- /dev/null +++ b/arch/um/nommu/os-Linux/signal.c @@ -0,0 +1,16 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include +#include +#include +#include +#include + +void sigsys_handler(int sig, struct siginfo *si, + struct uml_pt_regs *regs, void *ptr) +{ + mcontext_t *mc = (mcontext_t *) ptr; + + /* hook syscall via SIGSYS */ + set_mc_sigsys_hook(mc); +} diff --git a/arch/um/os-Linux/signal.c b/arch/um/os-Linux/signal.c index 327fb3c52fc7..2f6795cd884c 100644 --- a/arch/um/os-Linux/signal.c +++ b/arch/um/os-Linux/signal.c @@ -20,6 +20,7 @@ #include #include #include +#include #include "internal.h" void (*sig_info[NSIG])(int, struct siginfo *, struct uml_pt_regs *, void *mc) = { @@ -31,6 +32,7 @@ void (*sig_info[NSIG])(int, struct siginfo *, struct uml_pt_regs *, void *mc) = [SIGSEGV] = segv_handler, [SIGIO] = sigio_handler, [SIGCHLD] = sigchld_handler, + [SIGSYS] = sigsys_handler, }; static void sig_handler_common(int sig, struct siginfo *si, mcontext_t *mc) @@ -182,6 +184,11 @@ static void sigusr1_handler(int sig, struct siginfo *unused_si, mcontext_t *mc) uml_pm_wake(); } +__weak void sigsys_handler(int sig, struct siginfo *unused_si, + struct uml_pt_regs *regs, void *mc) +{ +} + void register_pm_wake_signal(void) { set_handler(SIGUSR1); @@ -193,6 +200,7 @@ static void (*handlers[_NSIG])(int sig, struct siginfo *si, mcontext_t *mc) = { [SIGILL] = sig_handler, [SIGFPE] = sig_handler, [SIGTRAP] = sig_handler, + [SIGSYS] = sig_handler, [SIGIO] = sig_handler, [SIGWINCH] = sig_handler, diff --git a/arch/um/os-Linux/start_up.c b/arch/um/os-Linux/start_up.c index 054ac03bbf5e..33e039d2c1bf 100644 --- a/arch/um/os-Linux/start_up.c +++ b/arch/um/os-Linux/start_up.c @@ -239,7 +239,7 @@ extern unsigned long *exec_fp_regs; __initdata static struct stub_data *seccomp_test_stub_data; -static void __init sigsys_handler(int sig, siginfo_t *info, void *p) +static void __init _sigsys_handler(int sig, siginfo_t *info, void *p) { ucontext_t *uc = p; @@ -274,7 +274,7 @@ static int __init seccomp_helper(void *data) sizeof(seccomp_test_stub_data->sigstack)); sa.sa_flags = SA_ONSTACK | SA_NODEFER | SA_SIGINFO; - sa.sa_sigaction = (void *) sigsys_handler; + sa.sa_sigaction = (void *) _sigsys_handler; sa.sa_restorer = NULL; if (sigaction(SIGSYS, &sa, NULL) < 0) exit(2); diff --git a/arch/x86/um/nommu/Makefile b/arch/x86/um/nommu/Makefile index d72c63afffa5..ebe47d4836f4 100644 --- a/arch/x86/um/nommu/Makefile +++ b/arch/x86/um/nommu/Makefile @@ -5,4 +5,4 @@ else BITS := 64 endif -obj-y = do_syscall_$(BITS).o entry_$(BITS).o +obj-y = do_syscall_$(BITS).o entry_$(BITS).o os-Linux/ diff --git a/arch/x86/um/nommu/os-Linux/Makefile b/arch/x86/um/nommu/os-Linux/Makefile new file mode 100644 index 000000000000..4571e403a6ff --- /dev/null +++ b/arch/x86/um/nommu/os-Linux/Makefile @@ -0,0 +1,6 @@ +# SPDX-License-Identifier: GPL-2.0 + +obj-y = mcontext.o +USER_OBJS := mcontext.o + +include $(srctree)/arch/um/scripts/Makefile.rules diff --git a/arch/x86/um/nommu/os-Linux/mcontext.c b/arch/x86/um/nommu/os-Linux/mcontext.c new file mode 100644 index 000000000000..b62a6195096f --- /dev/null +++ b/arch/x86/um/nommu/os-Linux/mcontext.c @@ -0,0 +1,15 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#define __FRAME_OFFSETS +#include +#include +#include + +extern long __kernel_vsyscall(int64_t a0, int64_t a1, int64_t a2, int64_t a3, + int64_t a4, int64_t a5, int64_t a6); + +void set_mc_sigsys_hook(mcontext_t *mc) +{ + mc->gregs[REG_RCX] = mc->gregs[REG_RIP]; + mc->gregs[REG_RIP] = (unsigned long) __kernel_vsyscall; +} diff --git a/arch/x86/um/shared/sysdep/mcontext.h b/arch/x86/um/shared/sysdep/mcontext.h index 6fe490cc5b98..9a0d6087f357 100644 --- a/arch/x86/um/shared/sysdep/mcontext.h +++ b/arch/x86/um/shared/sysdep/mcontext.h @@ -17,6 +17,10 @@ extern int get_stub_state(struct uml_pt_regs *regs, struct stub_data *data, extern int set_stub_state(struct uml_pt_regs *regs, struct stub_data *data, int single_stepping); +#ifndef CONFIG_MMU +extern void set_mc_sigsys_hook(mcontext_t *mc); +#endif + #ifdef __i386__ #define GET_FAULTINFO_FROM_MC(fi, mc) \ -- 2.43.0 From thehajime at gmail.com Sun Nov 2 01:49:32 2025 From: thehajime at gmail.com (Hajime Tazaki) Date: Sun, 2 Nov 2025 18:49:32 +0900 Subject: [PATCH v12 07/13] um: nommu: configure fs register on host syscall invocation In-Reply-To: References: Message-ID: <86fc0b173ac530454a0f0e33f5100e0b60e37730.1762075876.git.thehajime@gmail.com> As userspace on UML/!MMU also need to configure %fs register when it is running to correctly access thread structure, host syscalls implemented in os-Linux drivers may be puzzled when they are called. Thus it has to configure %fs register via arch_prctl(SET_FS) on every host syscalls. Signed-off-by: Hajime Tazaki Signed-off-by: Ricardo Koller --- arch/um/include/shared/os.h | 6 +++ arch/um/os-Linux/process.c | 6 +++ arch/um/os-Linux/start_up.c | 21 +++++++++ arch/x86/um/nommu/do_syscall_64.c | 37 ++++++++++++++++ arch/x86/um/nommu/syscalls_64.c | 71 +++++++++++++++++++++++++++++++ 5 files changed, 141 insertions(+) diff --git a/arch/um/include/shared/os.h b/arch/um/include/shared/os.h index 5451f9b1f41e..0ac87507e05e 100644 --- a/arch/um/include/shared/os.h +++ b/arch/um/include/shared/os.h @@ -189,6 +189,7 @@ extern void check_host_supports_tls(int *supports_tls, int *tls_min); extern void get_host_cpu_features( void (*flags_helper_func)(char *line), void (*cache_helper_func)(char *line)); +extern int host_has_fsgsbase; /* mem.c */ extern int create_mem_file(unsigned long long len); @@ -213,6 +214,11 @@ extern int os_protect_memory(void *addr, unsigned long len, extern int os_unmap_memory(void *addr, int len); extern int os_drop_memory(void *addr, int length); extern int can_drop_memory(void); +extern int os_arch_prctl(int pid, int option, unsigned long *arg); +#ifndef CONFIG_MMU +extern long long host_fs; +#endif + void os_set_pdeathsig(void); diff --git a/arch/um/os-Linux/process.c b/arch/um/os-Linux/process.c index ddb5258d7720..dacf63ac33c8 100644 --- a/arch/um/os-Linux/process.c +++ b/arch/um/os-Linux/process.c @@ -18,6 +18,7 @@ #include #include #include +#include /* For SYS_xxx definitions */ #include #include #include @@ -179,6 +180,11 @@ int __init can_drop_memory(void) return ok; } +int os_arch_prctl(int pid, int option, unsigned long *arg2) +{ + return syscall(SYS_arch_prctl, option, arg2); +} + void init_new_thread_signals(void) { set_handler(SIGSEGV); diff --git a/arch/um/os-Linux/start_up.c b/arch/um/os-Linux/start_up.c index 33e039d2c1bf..c0afe5d8b559 100644 --- a/arch/um/os-Linux/start_up.c +++ b/arch/um/os-Linux/start_up.c @@ -20,6 +20,8 @@ #include #include #include +#include +#include #include #include #include @@ -37,6 +39,8 @@ #include #include "internal.h" +int host_has_fsgsbase; + static void ptrace_child(void) { int ret; @@ -460,6 +464,20 @@ __uml_setup("seccomp=", uml_seccomp_config, " This is insecure and should only be used with a trusted userspace\n\n" ); +static void __init check_fsgsbase(void) +{ + unsigned long auxv = getauxval(AT_HWCAP2); + + os_info("Checking FSGSBASE instructions..."); + if (auxv & HWCAP2_FSGSBASE) { + host_has_fsgsbase = 1; + os_info("OK\n"); + } else { + host_has_fsgsbase = 0; + os_info("disabled\n"); + } +} + void __init os_early_checks(void) { int pid; @@ -488,6 +506,9 @@ void __init os_early_checks(void) using_seccomp = 0; check_ptrace(); + /* probe fsgsbase instruction */ + check_fsgsbase(); + pid = start_ptraced_child(); if (init_pid_registers(pid)) fatal("Failed to initialize default registers"); diff --git a/arch/x86/um/nommu/do_syscall_64.c b/arch/x86/um/nommu/do_syscall_64.c index 292d7c578622..9bc630995df9 100644 --- a/arch/x86/um/nommu/do_syscall_64.c +++ b/arch/x86/um/nommu/do_syscall_64.c @@ -2,10 +2,38 @@ #include #include +#include +#include #include #include #include +static int os_x86_arch_prctl(int pid, int option, unsigned long *arg2) +{ + if (!host_has_fsgsbase) + return os_arch_prctl(pid, option, arg2); + + switch (option) { + case ARCH_SET_FS: + wrfsbase(*arg2); + break; + case ARCH_SET_GS: + wrgsbase(*arg2); + break; + case ARCH_GET_FS: + *arg2 = rdfsbase(); + break; + case ARCH_GET_GS: + *arg2 = rdgsbase(); + break; + default: + pr_warn("%s: unsupported option: 0x%x", __func__, option); + break; + } + + return 0; +} + __visible void do_syscall_64(struct pt_regs *regs) { int syscall; @@ -13,6 +41,9 @@ __visible void do_syscall_64(struct pt_regs *regs) syscall = PT_SYSCALL_NR(regs->regs.gp); UPT_SYSCALL_NR(®s->regs) = syscall; + /* set fs register to the original host one */ + os_x86_arch_prctl(0, ARCH_SET_FS, (void *)host_fs); + if (likely(syscall < NR_syscalls)) { unsigned long ret; @@ -29,4 +60,10 @@ __visible void do_syscall_64(struct pt_regs *regs) /* handle tasks and signals at the end */ interrupt_end(); + + /* restore back fs register to userspace configured one */ + os_x86_arch_prctl(0, ARCH_SET_FS, + (void *)(current->thread.regs.regs.gp[FS_BASE + / sizeof(unsigned long)])); + } diff --git a/arch/x86/um/nommu/syscalls_64.c b/arch/x86/um/nommu/syscalls_64.c index d56027ebc651..19d23686fc5b 100644 --- a/arch/x86/um/nommu/syscalls_64.c +++ b/arch/x86/um/nommu/syscalls_64.c @@ -13,8 +13,70 @@ #include /* XXX This should get the constants from libc */ #include #include +#include +#include #include "syscalls.h" +/* + * The guest libc can change FS, which confuses the host libc. + * In fact, changing FS directly is not supported (check + * man arch_prctl). So, whenever we make a host syscall, + * we should be changing FS to the original FS (not the + * one set by the guest libc). This original FS is stored + * in host_fs. + */ +long long host_fs = -1; + +long arch_prctl(struct task_struct *task, int option, + unsigned long __user *arg2) +{ + long ret = -EINVAL; + unsigned long *ptr = arg2, tmp; + + switch (option) { + case ARCH_SET_FS: + if (host_fs == -1) + os_arch_prctl(0, ARCH_GET_FS, (void *)&host_fs); + ret = 0; + break; + case ARCH_SET_GS: + ret = 0; + break; + case ARCH_GET_FS: + case ARCH_GET_GS: + ptr = &tmp; + break; + } + + ret = os_arch_prctl(0, option, ptr); + if (ret) + return ret; + + switch (option) { + case ARCH_SET_FS: + current->thread.regs.regs.gp[FS_BASE / sizeof(unsigned long)] = + (unsigned long) arg2; + break; + case ARCH_SET_GS: + current->thread.regs.regs.gp[GS_BASE / sizeof(unsigned long)] = + (unsigned long) arg2; + break; + case ARCH_GET_FS: + ret = put_user(current->thread.regs.regs.gp[FS_BASE / sizeof(unsigned long)], arg2); + break; + case ARCH_GET_GS: + ret = put_user(current->thread.regs.regs.gp[GS_BASE / sizeof(unsigned long)], arg2); + break; + } + + return ret; +} + +SYSCALL_DEFINE2(arch_prctl, int, option, unsigned long, arg2) +{ + return arch_prctl(current, option, (unsigned long __user *) arg2); +} + void arch_set_stack_to_current(void) { current_top_of_stack = task_top_of_stack(current); @@ -48,3 +110,12 @@ SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len, return ksys_mmap_pgoff(addr, len, prot, flags, fd, off >> PAGE_SHIFT); } + +static int __init um_nommu_setup_hostfs(void) +{ + /* initialize the host_fs value at boottime */ + os_arch_prctl(0, ARCH_GET_FS, (void *)&host_fs); + + return 0; +} +arch_initcall(um_nommu_setup_hostfs); -- 2.43.0 From thehajime at gmail.com Sun Nov 2 01:49:33 2025 From: thehajime at gmail.com (Hajime Tazaki) Date: Sun, 2 Nov 2025 18:49:33 +0900 Subject: [PATCH v12 08/13] x86/um/vdso: nommu: vdso memory update In-Reply-To: References: Message-ID: <8036933c8c46dbf1ec32b8b57ecebc94c2cdb2ca.1762075876.git.thehajime@gmail.com> On !MMU mode, the address of vdso is accessible from userspace. This commit implements the entry point by pointing a block of page address. This commit also add memory permission configuration of vdso page to be executable. Signed-off-by: Hajime Tazaki Signed-off-by: Ricardo Koller --- arch/x86/um/vdso/vma.c | 17 +++++++++++++++-- 1 file changed, 15 insertions(+), 2 deletions(-) diff --git a/arch/x86/um/vdso/vma.c b/arch/x86/um/vdso/vma.c index 51a2b9f2eca9..0799b3fe7521 100644 --- a/arch/x86/um/vdso/vma.c +++ b/arch/x86/um/vdso/vma.c @@ -9,6 +9,7 @@ #include #include #include +#include unsigned long um_vdso_addr; static struct page *um_vdso; @@ -20,18 +21,29 @@ static int __init init_vdso(void) { BUG_ON(vdso_end - vdso_start > PAGE_SIZE); - um_vdso_addr = task_size - PAGE_SIZE; - um_vdso = alloc_page(GFP_KERNEL); if (!um_vdso) panic("Cannot allocate vdso\n"); copy_page(page_address(um_vdso), vdso_start); +#ifdef CONFIG_MMU + um_vdso_addr = task_size - PAGE_SIZE; +#else + /* this is fine with NOMMU as everything is accessible */ + um_vdso_addr = (unsigned long)page_address(um_vdso); + os_protect_memory((void *)um_vdso_addr, vdso_end - vdso_start, 1, 0, 1); +#endif + + pr_info("vdso_start=%lx um_vdso_addr=%lx pg_um_vdso=%lx", + (unsigned long)vdso_start, um_vdso_addr, + (unsigned long)page_address(um_vdso)); + return 0; } subsys_initcall(init_vdso); +#ifdef CONFIG_MMU int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) { struct vm_area_struct *vma; @@ -53,3 +65,4 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) return IS_ERR(vma) ? PTR_ERR(vma) : 0; } +#endif -- 2.43.0 From thehajime at gmail.com Sun Nov 2 01:49:34 2025 From: thehajime at gmail.com (Hajime Tazaki) Date: Sun, 2 Nov 2025 18:49:34 +0900 Subject: [PATCH v12 09/13] x86/um: nommu: signal handling In-Reply-To: References: Message-ID: <32debc0728ce22cd4db50cdf1cd4e8db430ad402.1762075876.git.thehajime@gmail.com> This commit updates the behavior of signal handling under !MMU environment. It adds the alignment code for signal frame as the frame is used in userspace as-is. floating point register is carefully handling upon entry/leave of syscall routine so that signal handlers can read/write the contents of the register. It also adds the follow up routine for SIGSEGV as a signal delivery runs in the same stack frame while we have to avoid endless SIGSEGV. Signed-off-by: Hajime Tazaki --- arch/um/include/shared/kern_util.h | 4 + arch/um/nommu/Makefile | 2 +- arch/um/nommu/os-Linux/signal.c | 8 + arch/um/nommu/trap.c | 201 ++++++++++++++++++++++++++ arch/um/os-Linux/signal.c | 3 +- arch/x86/um/nommu/do_syscall_64.c | 6 + arch/x86/um/nommu/os-Linux/mcontext.c | 11 ++ arch/x86/um/shared/sysdep/mcontext.h | 1 + arch/x86/um/shared/sysdep/ptrace.h | 2 +- 9 files changed, 235 insertions(+), 3 deletions(-) create mode 100644 arch/um/nommu/trap.c diff --git a/arch/um/include/shared/kern_util.h b/arch/um/include/shared/kern_util.h index 7798f16a4677..46c8d6336ca1 100644 --- a/arch/um/include/shared/kern_util.h +++ b/arch/um/include/shared/kern_util.h @@ -70,4 +70,8 @@ void um_idle_sleep(void); void kasan_map_memory(void *start, size_t len); +#ifndef CONFIG_MMU +extern void nommu_relay_signal(void *ptr); +#endif + #endif diff --git a/arch/um/nommu/Makefile b/arch/um/nommu/Makefile index baab7c2f57c2..096221590cfd 100644 --- a/arch/um/nommu/Makefile +++ b/arch/um/nommu/Makefile @@ -1,3 +1,3 @@ # SPDX-License-Identifier: GPL-2.0 -obj-y := os-Linux/ +obj-y := trap.o os-Linux/ diff --git a/arch/um/nommu/os-Linux/signal.c b/arch/um/nommu/os-Linux/signal.c index 19043b9652e2..6febb178dcda 100644 --- a/arch/um/nommu/os-Linux/signal.c +++ b/arch/um/nommu/os-Linux/signal.c @@ -5,6 +5,7 @@ #include #include #include +#include void sigsys_handler(int sig, struct siginfo *si, struct uml_pt_regs *regs, void *ptr) @@ -14,3 +15,10 @@ void sigsys_handler(int sig, struct siginfo *si, /* hook syscall via SIGSYS */ set_mc_sigsys_hook(mc); } + +void nommu_relay_signal(void *ptr) +{ + mcontext_t *mc = (mcontext_t *) ptr; + + set_mc_relay_signal(mc); +} diff --git a/arch/um/nommu/trap.c b/arch/um/nommu/trap.c new file mode 100644 index 000000000000..430297517455 --- /dev/null +++ b/arch/um/nommu/trap.c @@ -0,0 +1,201 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +/* + * Note this is constrained to return 0, -EFAULT, -EACCES, -ENOMEM by + * segv(). + */ +int handle_page_fault(unsigned long address, unsigned long ip, + int is_write, int is_user, int *code_out) +{ + /* !MMU has no pagefault */ + return -EFAULT; +} + +static void show_segv_info(struct uml_pt_regs *regs) +{ + struct task_struct *tsk = current; + struct faultinfo *fi = UPT_FAULTINFO(regs); + + if (!unhandled_signal(tsk, SIGSEGV)) + return; + + pr_warn_ratelimited("%s%s[%d]: segfault at %lx ip %p sp %p error %x", + task_pid_nr(tsk) > 1 ? KERN_INFO : KERN_EMERG, + tsk->comm, task_pid_nr(tsk), FAULT_ADDRESS(*fi), + (void *)UPT_IP(regs), (void *)UPT_SP(regs), + fi->error_code); +} + +static void bad_segv(struct faultinfo fi, unsigned long ip) +{ + current->thread.arch.faultinfo = fi; + force_sig_fault(SIGSEGV, SEGV_ACCERR, (void __user *) FAULT_ADDRESS(fi)); +} + +void fatal_sigsegv(void) +{ + force_fatal_sig(SIGSEGV); + do_signal(¤t->thread.regs); + /* + * This is to tell gcc that we're not returning - do_signal + * can, in general, return, but in this case, it's not, since + * we just got a fatal SIGSEGV queued. + */ + os_dump_core(); +} + +/** + * segv_handler() - the SIGSEGV handler + * @sig: the signal number + * @unused_si: the signal info struct; unused in this handler + * @regs: the ptrace register information + * + * The handler first extracts the faultinfo from the UML ptrace regs struct. + * If the userfault did not happen in an UML userspace process, bad_segv is called. + * Otherwise the signal did happen in a cloned userspace process, handle it. + */ +void segv_handler(int sig, struct siginfo *unused_si, struct uml_pt_regs *regs, + void *mc) +{ + struct faultinfo *fi = UPT_FAULTINFO(regs); + + /* !MMU specific part; detection of userspace */ + /* mark is_user=1 when the IP is from userspace code. */ + if (UPT_IP(regs) > uml_reserved && UPT_IP(regs) < high_physmem) + regs->is_user = 1; + + if (UPT_IS_USER(regs) && !SEGV_IS_FIXABLE(fi)) { + show_segv_info(regs); + bad_segv(*fi, UPT_IP(regs)); + return; + } + segv(*fi, UPT_IP(regs), UPT_IS_USER(regs), regs, mc); + + /* !MMU specific part; detection of userspace */ + relay_signal(sig, unused_si, regs, mc); +} + +/* + * We give a *copy* of the faultinfo in the regs to segv. + * This must be done, since nesting SEGVs could overwrite + * the info in the regs. A pointer to the info then would + * give us bad data! + */ +unsigned long segv(struct faultinfo fi, unsigned long ip, int is_user, + struct uml_pt_regs *regs, void *mc) +{ + int si_code; + int err; + int is_write = FAULT_WRITE(fi); + unsigned long address = FAULT_ADDRESS(fi); + + if (!is_user && regs) + current->thread.segv_regs = container_of(regs, struct pt_regs, regs); + + if (current->mm == NULL) { + show_regs(container_of(regs, struct pt_regs, regs)); + panic("Segfault with no mm"); + } else if (!is_user && address > PAGE_SIZE && address < TASK_SIZE) { + show_regs(container_of(regs, struct pt_regs, regs)); + panic("Kernel tried to access user memory at addr 0x%lx, ip 0x%lx", + address, ip); + } + + if (SEGV_IS_FIXABLE(&fi)) + err = handle_page_fault(address, ip, is_write, is_user, + &si_code); + else { + err = -EFAULT; + /* + * A thread accessed NULL, we get a fault, but CR2 is invalid. + * This code is used in __do_copy_from_user() of TT mode. + * XXX tt mode is gone, so maybe this isn't needed any more + */ + address = 0; + } + + if (!err) + goto out; + else if (!is_user && arch_fixup(ip, regs)) + goto out; + + if (!is_user) { + show_regs(container_of(regs, struct pt_regs, regs)); + panic("Kernel mode fault at addr 0x%lx, ip 0x%lx", + address, ip); + } + + show_segv_info(regs); + + if (err == -EACCES) { + current->thread.arch.faultinfo = fi; + force_sig_fault(SIGBUS, BUS_ADRERR, (void __user *)address); + } else { + WARN_ON_ONCE(err != -EFAULT); + current->thread.arch.faultinfo = fi; + force_sig_fault(SIGSEGV, si_code, (void __user *) address); + } + +out: + if (regs) + current->thread.segv_regs = NULL; + + return 0; +} + +void relay_signal(int sig, struct siginfo *si, struct uml_pt_regs *regs, + void *mc) +{ + int code, err; + + /* !MMU specific part; detection of userspace */ + /* mark is_user=1 when the IP is from userspace code. */ + if (UPT_IP(regs) > uml_reserved && UPT_IP(regs) < high_physmem) + regs->is_user = 1; + + if (!UPT_IS_USER(regs)) { + if (sig == SIGBUS) + pr_err("Bus error - the host /dev/shm or /tmp mount likely just ran out of space\n"); + panic("Kernel mode signal %d", sig); + } + /* if is_user==1, set return to userspace sig handler to relay signal */ + nommu_relay_signal(mc); + + arch_examine_signal(sig, regs); + + /* Is the signal layout for the signal known? + * Signal data must be scrubbed to prevent information leaks. + */ + code = si->si_code; + err = si->si_errno; + if ((err == 0) && (siginfo_layout(sig, code) == SIL_FAULT)) { + struct faultinfo *fi = UPT_FAULTINFO(regs); + + current->thread.arch.faultinfo = *fi; + force_sig_fault(sig, code, (void __user *)FAULT_ADDRESS(*fi)); + } else { + pr_err("Attempted to relay unknown signal %d (si_code = %d) with errno %d\n", + sig, code, err); + force_sig(sig); + } +} + +void winch(int sig, struct siginfo *unused_si, struct uml_pt_regs *regs, + void *mc) +{ + do_IRQ(WINCH_IRQ, regs); +} diff --git a/arch/um/os-Linux/signal.c b/arch/um/os-Linux/signal.c index 2f6795cd884c..28754f56c42b 100644 --- a/arch/um/os-Linux/signal.c +++ b/arch/um/os-Linux/signal.c @@ -41,9 +41,10 @@ static void sig_handler_common(int sig, struct siginfo *si, mcontext_t *mc) int save_errno = errno; r.is_user = 0; + if (mc) + get_regs_from_mc(&r, mc); if (sig == SIGSEGV) { /* For segfaults, we want the data from the sigcontext. */ - get_regs_from_mc(&r, mc); GET_FAULTINFO_FROM_MC(r.faultinfo, mc); } diff --git a/arch/x86/um/nommu/do_syscall_64.c b/arch/x86/um/nommu/do_syscall_64.c index 9bc630995df9..cf5a347ee9b1 100644 --- a/arch/x86/um/nommu/do_syscall_64.c +++ b/arch/x86/um/nommu/do_syscall_64.c @@ -44,6 +44,9 @@ __visible void do_syscall_64(struct pt_regs *regs) /* set fs register to the original host one */ os_x86_arch_prctl(0, ARCH_SET_FS, (void *)host_fs); + /* save fp registers */ + asm volatile("fxsaveq %0" : "=m"(*(struct _xstate *)regs->regs.fp)); + if (likely(syscall < NR_syscalls)) { unsigned long ret; @@ -61,6 +64,9 @@ __visible void do_syscall_64(struct pt_regs *regs) /* handle tasks and signals at the end */ interrupt_end(); + /* restore fp registers */ + asm volatile("fxrstorq %0" : : "m"((current->thread.regs.regs.fp))); + /* restore back fs register to userspace configured one */ os_x86_arch_prctl(0, ARCH_SET_FS, (void *)(current->thread.regs.regs.gp[FS_BASE diff --git a/arch/x86/um/nommu/os-Linux/mcontext.c b/arch/x86/um/nommu/os-Linux/mcontext.c index b62a6195096f..afa20f1e235a 100644 --- a/arch/x86/um/nommu/os-Linux/mcontext.c +++ b/arch/x86/um/nommu/os-Linux/mcontext.c @@ -4,10 +4,21 @@ #include #include #include +#include +#include "../syscalls.h" extern long __kernel_vsyscall(int64_t a0, int64_t a1, int64_t a2, int64_t a3, int64_t a4, int64_t a5, int64_t a6); +void set_mc_relay_signal(mcontext_t *mc) +{ + /* configure stack and userspace returning routine as + * instruction pointer + */ + mc->gregs[REG_RSP] = (unsigned long) current_top_of_stack; + mc->gregs[REG_RIP] = (unsigned long) userspace; +} + void set_mc_sigsys_hook(mcontext_t *mc) { mc->gregs[REG_RCX] = mc->gregs[REG_RIP]; diff --git a/arch/x86/um/shared/sysdep/mcontext.h b/arch/x86/um/shared/sysdep/mcontext.h index 9a0d6087f357..82a5f38b350f 100644 --- a/arch/x86/um/shared/sysdep/mcontext.h +++ b/arch/x86/um/shared/sysdep/mcontext.h @@ -19,6 +19,7 @@ extern int set_stub_state(struct uml_pt_regs *regs, struct stub_data *data, #ifndef CONFIG_MMU extern void set_mc_sigsys_hook(mcontext_t *mc); +extern void set_mc_relay_signal(mcontext_t *mc); #endif #ifdef __i386__ diff --git a/arch/x86/um/shared/sysdep/ptrace.h b/arch/x86/um/shared/sysdep/ptrace.h index 572ea2d79131..6ed6bb1ca50e 100644 --- a/arch/x86/um/shared/sysdep/ptrace.h +++ b/arch/x86/um/shared/sysdep/ptrace.h @@ -53,7 +53,7 @@ struct uml_pt_regs { int is_user; /* Dynamically sized FP registers (holds an XSTATE) */ - unsigned long fp[]; + unsigned long fp[] __attribute__((aligned(16))); }; #define EMPTY_UML_PT_REGS { } -- 2.43.0 From thehajime at gmail.com Sun Nov 2 01:49:35 2025 From: thehajime at gmail.com (Hajime Tazaki) Date: Sun, 2 Nov 2025 18:49:35 +0900 Subject: [PATCH v12 10/13] um: change machine name for uname output In-Reply-To: References: Message-ID: This commit tries to display MMU/!MMU mode from the output of uname(2) so that users can distinguish which mode of UML is running right now. Signed-off-by: Hajime Tazaki --- arch/um/Makefile | 6 ++++++ arch/um/os-Linux/util.c | 3 ++- 2 files changed, 8 insertions(+), 1 deletion(-) diff --git a/arch/um/Makefile b/arch/um/Makefile index 5371c9a1b11e..9bc8fc149514 100644 --- a/arch/um/Makefile +++ b/arch/um/Makefile @@ -153,6 +153,12 @@ export CFLAGS_vmlinux := $(LINK-y) $(LINK_WRAPS) $(LD_FLAGS_CMDLINE) $(CC_FLAGS_ CLEAN_FILES += linux x.i gmon.out MRPROPER_FILES += $(HOST_DIR)/include/generated +ifeq ($(CONFIG_MMU),y) +UTS_MACHINE := "um" +else +UTS_MACHINE := "um\(nommu\)" +endif + archclean: @find . \( -name '*.bb' -o -name '*.bbg' -o -name '*.da' \ -o -name '*.gcov' \) -type f -print | xargs rm -f diff --git a/arch/um/os-Linux/util.c b/arch/um/os-Linux/util.c index e3ad71a0d13c..5fb26f5dfcb6 100644 --- a/arch/um/os-Linux/util.c +++ b/arch/um/os-Linux/util.c @@ -64,7 +64,8 @@ void setup_machinename(char *machine_out) } # endif #endif - strcpy(machine_out, host.machine); + strcat(machine_out, "/"); + strcat(machine_out, host.machine); } void setup_hostinfo(char *buf, int len) -- 2.43.0 From thehajime at gmail.com Sun Nov 2 01:49:36 2025 From: thehajime at gmail.com (Hajime Tazaki) Date: Sun, 2 Nov 2025 18:49:36 +0900 Subject: [PATCH v12 11/13] um: nommu: disable SMP on nommu UML In-Reply-To: References: Message-ID: <54839396f81bc2755728a53912bd8fcb19b889a1.1762075876.git.thehajime@gmail.com> CONFIG_SMP doesn't work with nommu UML since fs register handling of host does conflict with thread local storage (more specifically, the variable signals_enabled). Thus this commit disables the CONFIG option and the TLS variables. Signed-off-by: Hajime Tazaki --- arch/um/os-Linux/internal.h | 8 ++++++++ arch/x86/um/Kconfig | 2 +- 2 files changed, 9 insertions(+), 1 deletion(-) diff --git a/arch/um/os-Linux/internal.h b/arch/um/os-Linux/internal.h index bac9fcc8c14c..25cb5cc931c1 100644 --- a/arch/um/os-Linux/internal.h +++ b/arch/um/os-Linux/internal.h @@ -6,6 +6,14 @@ #include #include +/* NOMMU doesn't work with thread-local storage used in CONFIG_SMP, + * due to the dependency on host_fs variable switch upon user/kernel + * context so, disable TLS until NOMMU supports SMP. + */ +#ifndef CONFIG_MMU +#define __thread +#endif + /* * elf_aux.c */ diff --git a/arch/x86/um/Kconfig b/arch/x86/um/Kconfig index c52fb5cb8d21..2bc18ecad783 100644 --- a/arch/x86/um/Kconfig +++ b/arch/x86/um/Kconfig @@ -13,7 +13,7 @@ config UML_X86 select ARCH_USE_QUEUED_SPINLOCKS select DCACHE_WORD_ACCESS select HAVE_EFFICIENT_UNALIGNED_ACCESS - select UML_SUBARCH_SUPPORTS_SMP if X86_CX8 + select UML_SUBARCH_SUPPORTS_SMP if X86_CX8 && MMU config 64BIT bool "64-bit kernel" if "$(SUBARCH)" = "x86" -- 2.43.0 From thehajime at gmail.com Sun Nov 2 01:49:37 2025 From: thehajime at gmail.com (Hajime Tazaki) Date: Sun, 2 Nov 2025 18:49:37 +0900 Subject: [PATCH v12 12/13] um: nommu: add documentation of nommu UML In-Reply-To: References: Message-ID: <5a831d893431c15a1bc2833cedc5a45cdfa44cb9.1762075876.git.thehajime@gmail.com> This commit adds an initial documentation for !MMU mode of UML. Signed-off-by: Hajime Tazaki --- Documentation/virt/uml/nommu-uml.rst | 180 +++++++++++++++++++++++++++ MAINTAINERS | 1 + 2 files changed, 181 insertions(+) create mode 100644 Documentation/virt/uml/nommu-uml.rst diff --git a/Documentation/virt/uml/nommu-uml.rst b/Documentation/virt/uml/nommu-uml.rst new file mode 100644 index 000000000000..f049bbc697d1 --- /dev/null +++ b/Documentation/virt/uml/nommu-uml.rst @@ -0,0 +1,180 @@ +.. SPDX-License-Identifier: GPL-2.0 + +UML has been built with CONFIG_MMU since day 0. The patchset +introduces the nommu mode on UML in a different angle from what Linux +Kernel Library tried. + +.. contents:: :local: + +What is it for ? +================ + +- Alleviate syscall hook overhead implemented with ptrace(2) +- To exercises nommu code over UML (and over KUnit) +- Less dependency to host facilities + + +How it works ? +============== + +To illustrate how this feature works, the below shows how syscalls are +called under nommu/UML environment. + +- boot kernel, install seccomp filter if ``syscall`` instructions are + called from userspace memory based on the address of instruction + pointer +- (userspace starts) +- calls ``vfork``/``execve`` syscalls +- ``SIGSYS`` signal raised, handler calls syscall entry point ``__kernel_vsyscall`` +- call handler function in ``sys_call_table[]`` and follow how UML syscall + works. +- return to userspace + + +What are the differences from MMU-full UML ? +============================================ + +The current nommu implementation adds 3 different functions which +MMU-full UML doesn't have: + +- kernel address space can directly be accessible from userspace + - so, ``uaccess()`` always returns 1 + - generic implementation of memcpy/strcpy/futex is also used +- alternate syscall entrypoint without ptrace +- alternate syscall hook + - hook syscall by seccomp filter + +With those modifications, it allows us to use unmodified userspace +binaries with nommu UML. + + +History +======= + +This feature was originally introduced by Ricardo Koller at Open +Source Summit NA 2020, then integrated with the syscall translation +functionality with the clean up to the original code. + +Building and run +================ + +:: + + make ARCH=um x86_64_nommu_defconfig + make ARCH=um + +will build UML with ``CONFIG_MMU=n`` applied. + +Kunit tests can run with the following command:: + + ./tools/testing/kunit/kunit.py run --kconfig_add CONFIG_MMU=n + +To run a typical Linux distribution, we need nommu-aware userspace. +We can use a stock version of Alpine Linux with nommu-built version of +busybox and musl-libc. + + +Preparing root filesystem +========================= + +nommu UML requires to use a specific standard library which is aware +of nommu kernel. We have tested custom-build musl-libc and busybox, +both of which have built-in support for nommu kernels. + +There are no available Linux distributions for nommu under x86_64 +architecture, so we need to prepare our own image for the root +filesystem. We use Alpine Linux as a base distribution and replace +busybox and musl-libc on top of that. The following are the step to +prepare the filesystem for the quick start:: + + container_id=$(docker create ghcr.io/thehajime/alpine:3.20.3-um-nommu) + docker start $container_id + docker wait $container_id + docker export $container_id > alpine.tar + docker rm $container_id + + mnt=$(mktemp -d) + dd if=/dev/zero of=alpine.ext4 bs=1 count=0 seek=1G + sudo chmod og+wr "alpine.ext4" + yes 2>/dev/null | mkfs.ext4 "alpine.ext4" || true + sudo mount "alpine.ext4" $mnt + sudo tar -xf alpine.tar -C $mnt + sudo umount $mnt + +This will create a file image, ``alpine.ext4``, which contains busybox +and musl with nommu build on the Alpine Linux root filesystem. The +file can be specified to the argument ``ubd0=`` to the UML command line:: + + ./vmlinux ubd0=./alpine.ext4 rw mem=1024m loglevel=8 init=/sbin/init + +We plan to upstream apk packages for busybox and musl so that we can +follow the proper procedure to set up the root filesystem. + + +Quick start with docker +======================= + +There is a docker image that you can quickly start with a simple step:: + + docker run -it -v /dev/shm:/dev/shm --rm ghcr.io/thehajime/alpine:3.20.3-um-nommu + +This will launch a UML instance with an pre-configured root filesystem. + +Benchmark +========= + +The below shows an example of performance measurement conducted with +lmbench and (self-crafted) getpid benchmark (with v6.17-rc5 uml/next +tree). + +.. csv-table:: lmbench (usec) + :header: ,native,um,um-mmu(s),um-nommu(s) + + select-10 ,0.5319,36.1214,24.2795,2.9174 + select-100 ,1.6019,34.6049,28.8865,3.8080 + select-1000 ,12.2588,43.6838,48.7438,12.7872 + syscall ,0.1644,35.0321,53.2119,2.5981 + read ,0.3055,31.5509,45.8538,2.7068 + write ,0.2512,31.3609,29.2636,2.6948 + stat ,1.8894,43.8477,49.6121,3.1908 + open/close ,3.2973,77.5123,68.9431,6.2575 + fork+sh ,1110.3000,7359.5000,4618.6667,439.4615 + fork+execve ,510.8182,2834.0000,2461.1667,139.7848 + +.. csv-table:: do_getpid bench (nsec) + :header: ,native,um,um-mmu(s),um-nommu(s) + + getpid , 161 , 34477 , 26242 , 2599 + +(um-nommu(s) is with seccomp syscall hook, um-mmu(s) is SECCOMP mode, +respectively) + +Limitations +=========== + +generic nommu limitations +------------------------- +Since this port is a kernel of nommu architecture so, the +implementation inherits the characteristics of other nommu kernels +(riscv, arm, etc), described below. + +- vfork(2) should be used instead of fork(2) +- ELF loader only loads PIE (position independent executable) binaries +- processes share the address space among others +- mmap(2) offers a subset of functionalities (e.g., unsupported + MMAP_FIXED) + +Thus, we have limited options to userspace programs. We have tested +Alpine Linux with musl-libc, which has a support nommu kernel. + +supported architecture +---------------------- +The current implementation of nommu UML only works on x86_64 SUBARCH. +We have not tested with 32-bit environment. + + +Further readings about NOMMU UML +================================ + +- NOMMU UML (original code by Ricardo Koller) + - https://static.sched.com/hosted_files/ossna2020/ec/kollerr_linux_um_nommu.pdf diff --git a/MAINTAINERS b/MAINTAINERS index 3da2c26a796b..2f227f56d04e 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -26764,6 +26764,7 @@ USER-MODE LINUX (UML) M: Richard Weinberger M: Anton Ivanov M: Johannes Berg +M: Hajime Tazaki L: linux-um at lists.infradead.org S: Maintained W: http://user-mode-linux.sourceforge.net -- 2.43.0 From thehajime at gmail.com Sun Nov 2 01:49:38 2025 From: thehajime at gmail.com (Hajime Tazaki) Date: Sun, 2 Nov 2025 18:49:38 +0900 Subject: [PATCH v12 13/13] um: nommu: plug nommu code into build system In-Reply-To: References: Message-ID: Add nommu kernel for um build. defconfig is also provided. Signed-off-by: Hajime Tazaki Signed-off-by: Ricardo Koller --- arch/um/Kconfig | 14 ++++++- arch/um/configs/x86_64_nommu_defconfig | 54 ++++++++++++++++++++++++++ 2 files changed, 66 insertions(+), 2 deletions(-) create mode 100644 arch/um/configs/x86_64_nommu_defconfig diff --git a/arch/um/Kconfig b/arch/um/Kconfig index 097c6a6265ef..4907fd2db512 100644 --- a/arch/um/Kconfig +++ b/arch/um/Kconfig @@ -34,16 +34,19 @@ config UML select ARCH_SUPPORTS_LTO_CLANG_THIN select TRACE_IRQFLAGS_SUPPORT select TTY # Needed for line.c - select HAVE_ARCH_VMAP_STACK + select HAVE_ARCH_VMAP_STACK if MMU select HAVE_RUST select ARCH_HAS_UBSAN select HAVE_ARCH_TRACEHOOK select HAVE_SYSCALL_TRACEPOINTS select THREAD_INFO_IN_TASK select SPARSE_IRQ + select UACCESS_MEMCPY if !MMU + select GENERIC_STRNLEN_USER if !MMU + select GENERIC_STRNCPY_FROM_USER if !MMU config MMU - bool + bool "MMU-based Paged Memory Management Support" if 64BIT default y config UML_DMA_EMULATION @@ -225,8 +228,15 @@ config MAGIC_SYSRQ The keys are documented in . Don't say Y unless you really know what this hack does. +config ARCH_FORCE_MAX_ORDER + int "Order of maximal physically contiguous allocations" if EXPERT + default "10" if MMU + default "16" if !MMU + config KERNEL_STACK_ORDER int "Kernel stack size order" + default 3 if !MMU + range 3 10 if !MMU default 2 if 64BIT range 2 10 if 64BIT default 1 if !64BIT diff --git a/arch/um/configs/x86_64_nommu_defconfig b/arch/um/configs/x86_64_nommu_defconfig new file mode 100644 index 000000000000..02cb87091c9f --- /dev/null +++ b/arch/um/configs/x86_64_nommu_defconfig @@ -0,0 +1,54 @@ +CONFIG_SYSVIPC=y +CONFIG_POSIX_MQUEUE=y +CONFIG_NO_HZ=y +CONFIG_HIGH_RES_TIMERS=y +CONFIG_BSD_PROCESS_ACCT=y +CONFIG_IKCONFIG=y +CONFIG_IKCONFIG_PROC=y +CONFIG_LOG_BUF_SHIFT=14 +CONFIG_CGROUPS=y +CONFIG_BLK_CGROUP=y +CONFIG_CGROUP_SCHED=y +CONFIG_CGROUP_DEVICE=y +CONFIG_CGROUP_CPUACCT=y +# CONFIG_PID_NS is not set +CONFIG_CC_OPTIMIZE_FOR_SIZE=y +# CONFIG_MMU is not set +CONFIG_HOSTFS=y +CONFIG_MAGIC_SYSRQ=y +CONFIG_SSL=y +CONFIG_NULL_CHAN=y +CONFIG_PORT_CHAN=y +CONFIG_PTY_CHAN=y +CONFIG_TTY_CHAN=y +CONFIG_CON_CHAN="pts" +CONFIG_SSL_CHAN="pts" +CONFIG_MODULES=y +CONFIG_MODULE_UNLOAD=y +CONFIG_IOSCHED_BFQ=m +CONFIG_BINFMT_MISC=m +CONFIG_NET=y +CONFIG_PACKET=y +CONFIG_UNIX=y +CONFIG_INET=y +CONFIG_DEVTMPFS=y +CONFIG_DEVTMPFS_MOUNT=y +CONFIG_BLK_DEV_UBD=y +CONFIG_BLK_DEV_LOOP=m +CONFIG_BLK_DEV_NBD=m +CONFIG_DUMMY=m +CONFIG_TUN=m +CONFIG_PPP=m +CONFIG_SLIP=m +CONFIG_LEGACY_PTY_COUNT=32 +CONFIG_UML_RANDOM=y +CONFIG_EXT4_FS=y +CONFIG_QUOTA=y +CONFIG_AUTOFS_FS=m +CONFIG_ISO9660_FS=m +CONFIG_JOLIET=y +CONFIG_NLS=y +CONFIG_DEBUG_KERNEL=y +CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y +CONFIG_FRAME_WARN=1024 +CONFIG_IPV6=y -- 2.43.0 From jlayton at kernel.org Wed Nov 5 12:24:50 2025 From: jlayton at kernel.org (Jeff Layton) Date: Wed, 05 Nov 2025 15:24:50 -0500 Subject: [PATCH] vfs: remove the excl argument from the ->create() inode_operation Message-ID: <20251105-create-excl-v1-1-a4cce035cc55@kernel.org> Since ce8644fcadc5 ("lookup_open(): expand the call of vfs_create()"), the "excl" argument to the ->create() inode_operation is always set to true. Remove it, and fix up all of the create implementations. Signed-off-by: Jeff Layton --- The latest directory delegation patchset has a patch in it to clean up the arguments for vfs_create() [1]. If that looks sane, then I think this would be the next logical step. Full disclosure: I did use claude code to generate the first approximation, but I had to fix a number of things that it missed. I probably could have given it better prompts. In any case, I'm not sure how to properly attribute this (or if I even need to). [1]: https://lore.kernel.org/linux-nfs/20251105-dir-deleg-ro-v5-9-7ebc168a88ac at kernel.org/ --- fs/9p/vfs_inode.c | 2 +- fs/9p/vfs_inode_dotl.c | 2 +- fs/affs/affs.h | 2 +- fs/affs/namei.c | 2 +- fs/afs/dir.c | 4 ++-- fs/bad_inode.c | 2 +- fs/bfs/dir.c | 2 +- fs/btrfs/inode.c | 2 +- fs/ceph/dir.c | 2 +- fs/coda/dir.c | 2 +- fs/ecryptfs/inode.c | 2 +- fs/efivarfs/inode.c | 2 +- fs/exfat/namei.c | 2 +- fs/ext2/namei.c | 2 +- fs/ext4/namei.c | 2 +- fs/f2fs/namei.c | 2 +- fs/fat/namei_msdos.c | 2 +- fs/fat/namei_vfat.c | 2 +- fs/fuse/dir.c | 2 +- fs/gfs2/inode.c | 5 ++--- fs/hfs/dir.c | 2 +- fs/hfsplus/dir.c | 2 +- fs/hostfs/hostfs_kern.c | 2 +- fs/hpfs/namei.c | 2 +- fs/hugetlbfs/inode.c | 2 +- fs/jffs2/dir.c | 4 ++-- fs/jfs/namei.c | 2 +- fs/minix/namei.c | 2 +- fs/namei.c | 4 ++-- fs/nfs/dir.c | 4 ++-- fs/nfs/internal.h | 2 +- fs/nilfs2/namei.c | 2 +- fs/ntfs3/namei.c | 2 +- fs/ocfs2/dlmfs/dlmfs.c | 3 +-- fs/ocfs2/namei.c | 3 +-- fs/omfs/dir.c | 2 +- fs/orangefs/namei.c | 3 +-- fs/overlayfs/dir.c | 2 +- fs/ramfs/inode.c | 2 +- fs/smb/client/cifsfs.h | 2 +- fs/smb/client/dir.c | 2 +- fs/ubifs/dir.c | 2 +- fs/udf/namei.c | 2 +- fs/ufs/namei.c | 3 +-- fs/vboxsf/dir.c | 2 +- fs/xfs/xfs_iops.c | 3 +-- include/linux/fs.h | 4 ++-- ipc/mqueue.c | 2 +- mm/shmem.c | 2 +- 49 files changed, 55 insertions(+), 61 deletions(-) diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c index 69f378a837753e934c20b599660f8a756127e40a..595244d57cba62869b9af8b909af67d3c61e7f6c 100644 --- a/fs/9p/vfs_inode.c +++ b/fs/9p/vfs_inode.c @@ -643,7 +643,7 @@ v9fs_create(struct v9fs_session_info *v9ses, struct inode *dir, static int v9fs_vfs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { struct v9fs_session_info *v9ses = v9fs_inode2v9ses(dir); u32 perm = unixmode2p9mode(v9ses, mode); diff --git a/fs/9p/vfs_inode_dotl.c b/fs/9p/vfs_inode_dotl.c index 0b404e8484d22e2cbe60d846e0fa653001cdc4b1..de8fe9954d433c9b14ff5dd72ba13c3d5a67ebe7 100644 --- a/fs/9p/vfs_inode_dotl.c +++ b/fs/9p/vfs_inode_dotl.c @@ -218,7 +218,7 @@ int v9fs_open_to_dotl_flags(int flags) */ static int v9fs_vfs_create_dotl(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t omode, bool excl) + struct dentry *dentry, umode_t omode) { return v9fs_vfs_mknod_dotl(idmap, dir, dentry, omode, 0); } diff --git a/fs/affs/affs.h b/fs/affs/affs.h index ac4e9a02910b72d63c8ec5291347b54518e67f4b..665be23c42cfa206dc0a2c9ffa119b7c3c747389 100644 --- a/fs/affs/affs.h +++ b/fs/affs/affs.h @@ -167,7 +167,7 @@ extern int affs_hash_name(struct super_block *sb, const u8 *name, unsigned int l extern struct dentry *affs_lookup(struct inode *dir, struct dentry *dentry, unsigned int); extern int affs_unlink(struct inode *dir, struct dentry *dentry); extern int affs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool); + struct dentry *dentry, umode_t mode); extern struct dentry *affs_mkdir(struct mnt_idmap *idmap, struct inode *dir, struct dentry *dentry, umode_t mode); extern int affs_rmdir(struct inode *dir, struct dentry *dentry); diff --git a/fs/affs/namei.c b/fs/affs/namei.c index f883be50db122d3b09f0ae4d24618bd49b55186b..5591e1b5a2f68fc7600115e241f01f81d3aac010 100644 --- a/fs/affs/namei.c +++ b/fs/affs/namei.c @@ -243,7 +243,7 @@ affs_unlink(struct inode *dir, struct dentry *dentry) int affs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { struct super_block *sb = dir->i_sb; struct inode *inode; diff --git a/fs/afs/dir.c b/fs/afs/dir.c index 89d36e3e5c7999c2e448b78e86896d8893a8a7a9..09224aca8cad37ad273fd0c1ac292f0c15e078b5 100644 --- a/fs/afs/dir.c +++ b/fs/afs/dir.c @@ -32,7 +32,7 @@ static bool afs_lookup_one_filldir(struct dir_context *ctx, const char *name, in static bool afs_lookup_filldir(struct dir_context *ctx, const char *name, int nlen, loff_t fpos, u64 ino, unsigned dtype); static int afs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl); + struct dentry *dentry, umode_t mode); static struct dentry *afs_mkdir(struct mnt_idmap *idmap, struct inode *dir, struct dentry *dentry, umode_t mode); static int afs_rmdir(struct inode *dir, struct dentry *dentry); @@ -1637,7 +1637,7 @@ static const struct afs_operation_ops afs_create_operation = { * create a regular file on an AFS filesystem */ static int afs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { struct afs_operation *op; struct afs_vnode *dvnode = AFS_FS_I(dir); diff --git a/fs/bad_inode.c b/fs/bad_inode.c index 0ef9bcb744dd620bf47caa024d97a1316ff7bc89..5701361cf98155a61cb75a4ec602e8fc615eb3ae 100644 --- a/fs/bad_inode.c +++ b/fs/bad_inode.c @@ -29,7 +29,7 @@ static const struct file_operations bad_file_ops = static int bad_inode_create(struct mnt_idmap *idmap, struct inode *dir, struct dentry *dentry, - umode_t mode, bool excl) + umode_t mode) { return -EIO; } diff --git a/fs/bfs/dir.c b/fs/bfs/dir.c index c375e22c4c0c15ba27307d266adfe3f093b90ab8..6beb8605c523cc2c7250d7b1a61508e103f0f3fd 100644 --- a/fs/bfs/dir.c +++ b/fs/bfs/dir.c @@ -76,7 +76,7 @@ const struct file_operations bfs_dir_operations = { }; static int bfs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { int err; struct inode *inode; diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 3b1b3a0553eea06229255ad0284d76074bdb958a..8e06baeabae594850607366ea4f4f0fa41e3b464 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -6816,7 +6816,7 @@ static int btrfs_mknod(struct mnt_idmap *idmap, struct inode *dir, } static int btrfs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { struct inode *inode; diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c index d18c0eaef9b7e7be7eb517c701d6c4af08fd78ac..308903dc0780dbed2382228005d0221f185c61ee 100644 --- a/fs/ceph/dir.c +++ b/fs/ceph/dir.c @@ -976,7 +976,7 @@ static int ceph_mknod(struct mnt_idmap *idmap, struct inode *dir, } static int ceph_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { return ceph_mknod(idmap, dir, dentry, mode, 0); } diff --git a/fs/coda/dir.c b/fs/coda/dir.c index ca99900172657d80a479b2eb27f50effdf834995..554e7fd44e5df1aae6da2c41a492a02ae9e0d616 100644 --- a/fs/coda/dir.c +++ b/fs/coda/dir.c @@ -134,7 +134,7 @@ static inline void coda_dir_drop_nlink(struct inode *dir) /* creation routines: create, mknod, mkdir, link, symlink */ static int coda_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *de, umode_t mode, bool excl) + struct dentry *de, umode_t mode) { int error; const char *name=de->d_name.name; diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c index ba15e7359dfa6e150b577205991010873a633511..9a1ba68b16f3d6c4551e2d75e1e27309159c062e 100644 --- a/fs/ecryptfs/inode.c +++ b/fs/ecryptfs/inode.c @@ -262,7 +262,7 @@ int ecryptfs_initialize_file(struct dentry *ecryptfs_dentry, static int ecryptfs_create(struct mnt_idmap *idmap, struct inode *directory_inode, struct dentry *ecryptfs_dentry, - umode_t mode, bool excl) + umode_t mode) { struct inode *ecryptfs_inode; int rc; diff --git a/fs/efivarfs/inode.c b/fs/efivarfs/inode.c index 2891614abf8d554f563319187b6d54c2bc006a91..043b3e3a4f0adefe27855f8156b946c1dc4bd184 100644 --- a/fs/efivarfs/inode.c +++ b/fs/efivarfs/inode.c @@ -75,7 +75,7 @@ static bool efivarfs_valid_name(const char *str, int len) } static int efivarfs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { struct inode *inode = NULL; struct efivar_entry *var; diff --git a/fs/exfat/namei.c b/fs/exfat/namei.c index 7eb9c67fd35f4c54e18061a948806f20455675cf..c272a522c571044fd0cdc7630be30bdcec2ab8e5 100644 --- a/fs/exfat/namei.c +++ b/fs/exfat/namei.c @@ -543,7 +543,7 @@ static int exfat_add_entry(struct inode *inode, const char *path, } static int exfat_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { struct super_block *sb = dir->i_sb; struct inode *inode; diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c index bde617a66cecd4a2bf12a713a2297bb4fee45916..edea7784ad39acd4afffc7f5ae6e50a20c04999d 100644 --- a/fs/ext2/namei.c +++ b/fs/ext2/namei.c @@ -101,7 +101,7 @@ struct dentry *ext2_get_parent(struct dentry *child) */ static int ext2_create (struct mnt_idmap * idmap, struct inode * dir, struct dentry * dentry, - umode_t mode, bool excl) + umode_t mode) { struct inode *inode; int err; diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index 2cd36f59c9e363124ee949f742adccd88447295a..a1e77390a7ce300db02db9af90e45d69efabfea5 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -2806,7 +2806,7 @@ static int ext4_add_nondir(handle_t *handle, * with d_instantiate(). */ static int ext4_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { handle_t *handle; struct inode *inode; diff --git a/fs/f2fs/namei.c b/fs/f2fs/namei.c index b882771e469971dcf4e7a42416f9fbb8a5d9bf39..9bcbb8b521501b22d0fe2238b7729c342e95baa4 100644 --- a/fs/f2fs/namei.c +++ b/fs/f2fs/namei.c @@ -351,7 +351,7 @@ static struct inode *f2fs_new_inode(struct mnt_idmap *idmap, } static int f2fs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { struct f2fs_sb_info *sbi = F2FS_I_SB(dir); struct inode *inode; diff --git a/fs/fat/namei_msdos.c b/fs/fat/namei_msdos.c index 0b920ee40a7f9fe3c57af5d939d3efedf001a3d9..905ffa9e5b99f1507734d99b7c16dcad21d7b5b5 100644 --- a/fs/fat/namei_msdos.c +++ b/fs/fat/namei_msdos.c @@ -262,7 +262,7 @@ static int msdos_add_entry(struct inode *dir, const unsigned char *name, /***** Create a file */ static int msdos_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { struct super_block *sb = dir->i_sb; struct inode *inode = NULL; diff --git a/fs/fat/namei_vfat.c b/fs/fat/namei_vfat.c index 5dbc4cbb8fce3d9b891cbc597f876c2c7b8d6aa0..8396b1ec4ec582fcdfadbcb12b04694ef0b8c5fc 100644 --- a/fs/fat/namei_vfat.c +++ b/fs/fat/namei_vfat.c @@ -754,7 +754,7 @@ static struct dentry *vfat_lookup(struct inode *dir, struct dentry *dentry, } static int vfat_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { struct super_block *sb = dir->i_sb; struct inode *inode; diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index 667774cc72a1d49796f531fcb342d2e4878beb85..b7a2cee9b18313f88e745c5bb406bcc72866e390 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -889,7 +889,7 @@ static int fuse_mknod(struct mnt_idmap *idmap, struct inode *dir, } static int fuse_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *entry, umode_t mode, bool excl) + struct dentry *entry, umode_t mode) { return fuse_mknod(idmap, dir, entry, mode, 0); } diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c index 8a7ed80d9f2d6e829b240629bdd18b5e0d30b5fc..b8e399dd1182b6ede0bcf1aa78bd7f9f2dca8b2b 100644 --- a/fs/gfs2/inode.c +++ b/fs/gfs2/inode.c @@ -942,15 +942,14 @@ static int gfs2_create_inode(struct inode *dir, struct dentry *dentry, * @dir: The directory in which to create the file * @dentry: The dentry of the new file * @mode: The mode of the new file - * @excl: Force fail if inode exists * * Returns: errno */ static int gfs2_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { - return gfs2_create_inode(dir, dentry, NULL, S_IFREG | mode, 0, NULL, 0, excl); + return gfs2_create_inode(dir, dentry, NULL, S_IFREG | mode, 0, NULL, 0, 1); } /** diff --git a/fs/hfs/dir.c b/fs/hfs/dir.c index 86a6b317b474a95f283f6a0908582efadde80892..c585942aa985686ca428d2d17f4401aa845a0eb8 100644 --- a/fs/hfs/dir.c +++ b/fs/hfs/dir.c @@ -190,7 +190,7 @@ static int hfs_dir_release(struct inode *inode, struct file *file) * the directory and the name (and its length) of the new file. */ static int hfs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { struct inode *inode; int res; diff --git a/fs/hfsplus/dir.c b/fs/hfsplus/dir.c index 1b3e27a0d5e038b559bd19b37d769078b2996d1b..c5ea04e078340a91b992095e189e978a3345f03c 100644 --- a/fs/hfsplus/dir.c +++ b/fs/hfsplus/dir.c @@ -518,7 +518,7 @@ static int hfsplus_mknod(struct mnt_idmap *idmap, struct inode *dir, } static int hfsplus_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { return hfsplus_mknod(&nop_mnt_idmap, dir, dentry, mode, 0); } diff --git a/fs/hostfs/hostfs_kern.c b/fs/hostfs/hostfs_kern.c index 1e1acf5775ab5f6daf13bb917966d05f410d5ff5..18ca8cb9aa15e4015582ee5bd3db968c6b32de4b 100644 --- a/fs/hostfs/hostfs_kern.c +++ b/fs/hostfs/hostfs_kern.c @@ -593,7 +593,7 @@ static struct inode *hostfs_iget(struct super_block *sb, char *name) } static int hostfs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { struct inode *inode; char *name; diff --git a/fs/hpfs/namei.c b/fs/hpfs/namei.c index 353e13a615f56664638f08a3408f90a727f5458b..809113d8248d50c0eaa57047b6c4bd87b9a5c6be 100644 --- a/fs/hpfs/namei.c +++ b/fs/hpfs/namei.c @@ -129,7 +129,7 @@ static struct dentry *hpfs_mkdir(struct mnt_idmap *idmap, struct inode *dir, } static int hpfs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { const unsigned char *name = dentry->d_name.name; unsigned len = dentry->d_name.len; diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 9c94ed8c3ab0028772b7afb5d03a91d280c38106..0fd0d73e450bdedd92b953b9dd00f6babe1246e7 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -1001,7 +1001,7 @@ static struct dentry *hugetlbfs_mkdir(struct mnt_idmap *idmap, struct inode *dir static int hugetlbfs_create(struct mnt_idmap *idmap, struct inode *dir, struct dentry *dentry, - umode_t mode, bool excl) + umode_t mode) { return hugetlbfs_mknod(idmap, dir, dentry, mode | S_IFREG, 0); } diff --git a/fs/jffs2/dir.c b/fs/jffs2/dir.c index dd91f725ded69ccb3a240aafd72a4b552f21bcd9..e77c84e43621a8c53e9852843f18cc3514315650 100644 --- a/fs/jffs2/dir.c +++ b/fs/jffs2/dir.c @@ -25,7 +25,7 @@ static int jffs2_readdir (struct file *, struct dir_context *); static int jffs2_create (struct mnt_idmap *, struct inode *, - struct dentry *, umode_t, bool); + struct dentry *, umode_t); static struct dentry *jffs2_lookup (struct inode *,struct dentry *, unsigned int); static int jffs2_link (struct dentry *,struct inode *,struct dentry *); @@ -161,7 +161,7 @@ static int jffs2_readdir(struct file *file, struct dir_context *ctx) static int jffs2_create(struct mnt_idmap *idmap, struct inode *dir_i, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { struct jffs2_raw_inode *ri; struct jffs2_inode_info *f, *dir_f; diff --git a/fs/jfs/namei.c b/fs/jfs/namei.c index 65a218eba8faf9508f5727515b812f6de2661618..48111f8d3efe40becadd857c56c84ed09de867ef 100644 --- a/fs/jfs/namei.c +++ b/fs/jfs/namei.c @@ -60,7 +60,7 @@ static inline void free_ea_wmap(struct inode *inode) * */ static int jfs_create(struct mnt_idmap *idmap, struct inode *dip, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { int rc = 0; tid_t tid; /* transaction id */ diff --git a/fs/minix/namei.c b/fs/minix/namei.c index 8938536d8d3cf65c7e57f88f1819689365951fea..6540574f54781eab487074de7fe10ed38b1a8d1e 100644 --- a/fs/minix/namei.c +++ b/fs/minix/namei.c @@ -64,7 +64,7 @@ static int minix_tmpfile(struct mnt_idmap *idmap, struct inode *dir, } static int minix_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { return minix_mknod(&nop_mnt_idmap, dir, dentry, mode, 0); } diff --git a/fs/namei.c b/fs/namei.c index d5ab28947b2b6c6e19c7bb4a9140ccec407dc07c..83da60fc298e523096e881b25c727d14f9553476 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -3493,7 +3493,7 @@ int vfs_create(struct mnt_idmap *idmap, struct dentry *dentry, umode_t mode, error = try_break_deleg(dir, di); if (error) return error; - error = dir->i_op->create(idmap, dir, dentry, mode, true); + error = dir->i_op->create(idmap, dir, dentry, mode); if (!error) fsnotify_create(dir, dentry); return error; @@ -3802,7 +3802,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file, } error = dir_inode->i_op->create(idmap, dir_inode, dentry, - mode, open_flag & O_EXCL); + mode); if (error) goto out_dput; } diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index 46d9c65d50f83fc1dc73f3d7f5868b84132bb0fd..7fe18efcd37b08030c7a4e17832801abfc19a3bd 100644 --- a/fs/nfs/dir.c +++ b/fs/nfs/dir.c @@ -2377,9 +2377,9 @@ static int nfs_do_create(struct inode *dir, struct dentry *dentry, } int nfs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { - return nfs_do_create(dir, dentry, mode, excl ? O_EXCL : 0); + return nfs_do_create(dir, dentry, mode, O_EXCL); } EXPORT_SYMBOL_GPL(nfs_create); diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h index 2ecd38e1d17a8053a9134702588d57efc35f49e9..b122c4f34f7b53c5102a8b5138efe269af433c81 100644 --- a/fs/nfs/internal.h +++ b/fs/nfs/internal.h @@ -398,7 +398,7 @@ extern unsigned long nfs_access_cache_scan(struct shrinker *shrink, struct dentry *nfs_lookup(struct inode *, struct dentry *, unsigned int); void nfs_d_prune_case_insensitive_aliases(struct inode *inode); int nfs_create(struct mnt_idmap *, struct inode *, struct dentry *, - umode_t, bool); + umode_t); struct dentry *nfs_mkdir(struct mnt_idmap *, struct inode *, struct dentry *, umode_t); int nfs_rmdir(struct inode *, struct dentry *); diff --git a/fs/nilfs2/namei.c b/fs/nilfs2/namei.c index 40f4b1a28705b6e0eb8f0978cf3ac18b43aa1331..31d1d466c03048aaaab23f64c3f413c095939770 100644 --- a/fs/nilfs2/namei.c +++ b/fs/nilfs2/namei.c @@ -86,7 +86,7 @@ nilfs_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags) * with d_instantiate(). */ static int nilfs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { struct inode *inode; struct nilfs_transaction_info ti; diff --git a/fs/ntfs3/namei.c b/fs/ntfs3/namei.c index 82c8ae56beee6d79046dd6c8f02ff0f35e9a1ad3..49fe635b550d3f51f81138649b47c9c831a73e3b 100644 --- a/fs/ntfs3/namei.c +++ b/fs/ntfs3/namei.c @@ -105,7 +105,7 @@ static struct dentry *ntfs_lookup(struct inode *dir, struct dentry *dentry, * ntfs_create - inode_operations::create */ static int ntfs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { return ntfs_create_inode(idmap, dir, dentry, NULL, S_IFREG | mode, 0, NULL, 0, NULL); diff --git a/fs/ocfs2/dlmfs/dlmfs.c b/fs/ocfs2/dlmfs/dlmfs.c index cccaa1d6fbbac13ebcaf14a9183277890708e643..bd4b2269598b49c6f88dd8d201e246ee5ed855a6 100644 --- a/fs/ocfs2/dlmfs/dlmfs.c +++ b/fs/ocfs2/dlmfs/dlmfs.c @@ -454,8 +454,7 @@ static struct dentry *dlmfs_mkdir(struct mnt_idmap * idmap, static int dlmfs_create(struct mnt_idmap *idmap, struct inode *dir, struct dentry *dentry, - umode_t mode, - bool excl) + umode_t mode) { int status = 0; struct inode *inode; diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c index c90b254da75eb5b90d2af5e37d41e781efe8b836..7443f468f45657cf68779a02e4edf4e38fb70f59 100644 --- a/fs/ocfs2/namei.c +++ b/fs/ocfs2/namei.c @@ -666,8 +666,7 @@ static struct dentry *ocfs2_mkdir(struct mnt_idmap *idmap, static int ocfs2_create(struct mnt_idmap *idmap, struct inode *dir, struct dentry *dentry, - umode_t mode, - bool excl) + umode_t mode) { int ret; diff --git a/fs/omfs/dir.c b/fs/omfs/dir.c index 2ed541fccf331d796805dd1594fbf05c1f7f3b9a..a09a98f7e30bc66deca60725f9462d081b5e4784 100644 --- a/fs/omfs/dir.c +++ b/fs/omfs/dir.c @@ -286,7 +286,7 @@ static struct dentry *omfs_mkdir(struct mnt_idmap *idmap, struct inode *dir, } static int omfs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { return omfs_add_node(dir, dentry, mode | S_IFREG); } diff --git a/fs/orangefs/namei.c b/fs/orangefs/namei.c index bec5475de094dada6bb29eaf8520a875880f3bab..0ebaa7f000f26f1c1ecffd22cfe4272f20a783ed 100644 --- a/fs/orangefs/namei.c +++ b/fs/orangefs/namei.c @@ -18,8 +18,7 @@ static int orangefs_create(struct mnt_idmap *idmap, struct inode *dir, struct dentry *dentry, - umode_t mode, - bool exclusive) + umode_t mode) { struct orangefs_inode_s *parent = ORANGEFS_I(dir); struct orangefs_kernel_op_s *new_op; diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c index a5e9ddf3023b3942fafb9adb2770f26780a1b86b..0f70b3835f4a08c29d6bba8ae9143df55895e56b 100644 --- a/fs/overlayfs/dir.c +++ b/fs/overlayfs/dir.c @@ -704,7 +704,7 @@ static int ovl_create_object(struct dentry *dentry, int mode, dev_t rdev, } static int ovl_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { return ovl_create_object(dentry, (mode & 07777) | S_IFREG, 0, NULL); } diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c index 41f9995da7cab0d11395cb40a98fb4936d52597f..b6502aaa4fb44d27c939da9fae4449af7edd28d4 100644 --- a/fs/ramfs/inode.c +++ b/fs/ramfs/inode.c @@ -129,7 +129,7 @@ static struct dentry *ramfs_mkdir(struct mnt_idmap *idmap, struct inode *dir, } static int ramfs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { return ramfs_mknod(&nop_mnt_idmap, dir, dentry, mode | S_IFREG, 0); } diff --git a/fs/smb/client/cifsfs.h b/fs/smb/client/cifsfs.h index e9534258d1efd0bb34f36bf2c725c64d0a8ca8f4..294c66cea2eca3344e09cd77619761e9cb79a807 100644 --- a/fs/smb/client/cifsfs.h +++ b/fs/smb/client/cifsfs.h @@ -50,7 +50,7 @@ extern void cifs_sb_deactive(struct super_block *sb); extern const struct inode_operations cifs_dir_inode_ops; extern struct inode *cifs_root_iget(struct super_block *); extern int cifs_create(struct mnt_idmap *, struct inode *, - struct dentry *, umode_t, bool excl); + struct dentry *, umode_t); extern int cifs_atomic_open(struct inode *, struct dentry *, struct file *, unsigned, umode_t); extern struct dentry *cifs_lookup(struct inode *, struct dentry *, diff --git a/fs/smb/client/dir.c b/fs/smb/client/dir.c index da5597dbf5b9f140c6801158ac2357fa911c52ab..b00bc214db9f0e9533f481f41ac99ac8937610ac 100644 --- a/fs/smb/client/dir.c +++ b/fs/smb/client/dir.c @@ -566,7 +566,7 @@ cifs_atomic_open(struct inode *inode, struct dentry *direntry, } int cifs_create(struct mnt_idmap *idmap, struct inode *inode, - struct dentry *direntry, umode_t mode, bool excl) + struct dentry *direntry, umode_t mode) { int rc; unsigned int xid = get_xid(); diff --git a/fs/ubifs/dir.c b/fs/ubifs/dir.c index 3c3d3ad4fa6cb719e9ec08fa2164c55371c017c1..4840a6f7974e254eba4ca249357e968764e326e0 100644 --- a/fs/ubifs/dir.c +++ b/fs/ubifs/dir.c @@ -303,7 +303,7 @@ static int ubifs_prepare_create(struct inode *dir, struct dentry *dentry, } static int ubifs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { struct inode *inode; struct ubifs_info *c = dir->i_sb->s_fs_info; diff --git a/fs/udf/namei.c b/fs/udf/namei.c index 5f2e9a892bffa9579143cedf71d80efa7ad6e9fb..f83b5564cbc4c68c02c07bb3ab2109bfabdc799d 100644 --- a/fs/udf/namei.c +++ b/fs/udf/namei.c @@ -371,7 +371,7 @@ static int udf_add_nondir(struct dentry *dentry, struct inode *inode) } static int udf_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { struct inode *inode = udf_new_inode(dir, mode); diff --git a/fs/ufs/namei.c b/fs/ufs/namei.c index 5b3c85c9324298f4ff6aa3d4feeb962ce5ede539..5012e056200aca671364d34a7faf647e6747e1d2 100644 --- a/fs/ufs/namei.c +++ b/fs/ufs/namei.c @@ -70,8 +70,7 @@ static struct dentry *ufs_lookup(struct inode * dir, struct dentry *dentry, unsi * with d_instantiate(). */ static int ufs_create (struct mnt_idmap * idmap, - struct inode * dir, struct dentry * dentry, umode_t mode, - bool excl) + struct inode * dir, struct dentry * dentry, umode_t mode) { struct inode *inode; diff --git a/fs/vboxsf/dir.c b/fs/vboxsf/dir.c index 42bedc4ec7af7709c564a7174805d185ce86f854..9ce4310c891044db17b6af98c06e3130002a7dda 100644 --- a/fs/vboxsf/dir.c +++ b/fs/vboxsf/dir.c @@ -298,7 +298,7 @@ static int vboxsf_dir_create(struct inode *parent, struct dentry *dentry, static int vboxsf_dir_mkfile(struct mnt_idmap *idmap, struct inode *parent, struct dentry *dentry, - umode_t mode, bool excl) + umode_t mode) { return vboxsf_dir_create(parent, dentry, mode, false, excl, NULL); } diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index caff0125faeac093c1c05a722d3588e3f2e99926..2bc7faac35678b5b78acd6a50695a0d7b1c9a263 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -293,8 +293,7 @@ xfs_vn_create( struct mnt_idmap *idmap, struct inode *dir, struct dentry *dentry, - umode_t mode, - bool flags) + umode_t mode) { return xfs_generic_create(idmap, dir, dentry, mode, 0, NULL); } diff --git a/include/linux/fs.h b/include/linux/fs.h index 64323e618724bc20dc101db13035b042f5f88e4d..b9a32e10078f5a1a0bbeb0d8913ac3e4b5b3a85d 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2345,8 +2345,8 @@ struct inode_operations { int (*readlink) (struct dentry *, char __user *,int); - int (*create) (struct mnt_idmap *, struct inode *,struct dentry *, - umode_t, bool); + int (*create) (struct mnt_idmap *, struct inode *, struct dentry *, + umode_t); int (*link) (struct dentry *,struct inode *,struct dentry *); int (*unlink) (struct inode *,struct dentry *); int (*symlink) (struct mnt_idmap *, struct inode *,struct dentry *, diff --git a/ipc/mqueue.c b/ipc/mqueue.c index 093551fe66a7eb884fc34ef853a0ca92b95770af..9ae28c79fe0578bf96b2d22daed45b48aba0b946 100644 --- a/ipc/mqueue.c +++ b/ipc/mqueue.c @@ -610,7 +610,7 @@ static int mqueue_create_attr(struct dentry *dentry, umode_t mode, void *arg) } static int mqueue_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { return mqueue_create_attr(dentry, mode, NULL); } diff --git a/mm/shmem.c b/mm/shmem.c index b9081b817d28f3db1fbdd90ed3f04b6904d6ff18..8fdc9cbecb908e127f8173ca8888b5e038354fed 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -3912,7 +3912,7 @@ static struct dentry *shmem_mkdir(struct mnt_idmap *idmap, struct inode *dir, } static int shmem_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { return shmem_mknod(idmap, dir, dentry, mode | S_IFREG, 0); } --- base-commit: 76ddfe7d66d631e5e31ef4e5dd59797fa03acbf7 change-id: 20251105-create-excl-2b366d9bf3bb Best regards, -- Jeff Layton From neilb at ownmail.net Wed Nov 5 13:23:24 2025 From: neilb at ownmail.net (NeilBrown) Date: Thu, 06 Nov 2025 08:23:24 +1100 Subject: [PATCH] vfs: remove the excl argument from the ->create() inode_operation In-Reply-To: <20251105-create-excl-v1-1-a4cce035cc55@kernel.org> References: <20251105-create-excl-v1-1-a4cce035cc55@kernel.org> Message-ID: <176237780417.634289.15818324160940255011@noble.neil.brown.name> On Thu, 06 Nov 2025, Jeff Layton wrote: > Since ce8644fcadc5 ("lookup_open(): expand the call of vfs_create()"), > the "excl" argument to the ->create() inode_operation is always set to > true. Remove it, and fix up all of the create implementations. nonono > @@ -3802,7 +3802,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file, > } > > error = dir_inode->i_op->create(idmap, dir_inode, dentry, > - mode, open_flag & O_EXCL); > + mode); "open_flag & O_EXCL" is not the same as "true". It is true that "all calls to vfs_create() pass true for 'excl'" The same is NOT true for inode_operations.create. NeilBrown From jlayton at kernel.org Thu Nov 6 04:07:48 2025 From: jlayton at kernel.org (Jeff Layton) Date: Thu, 06 Nov 2025 07:07:48 -0500 Subject: [PATCH] vfs: remove the excl argument from the ->create() inode_operation In-Reply-To: <176237780417.634289.15818324160940255011@noble.neil.brown.name> References: <20251105-create-excl-v1-1-a4cce035cc55@kernel.org> <176237780417.634289.15818324160940255011@noble.neil.brown.name> Message-ID: <6758176514cdd6e2ceacb3bd0e4d63fb8784b7c6.camel@kernel.org> On Thu, 2025-11-06 at 08:23 +1100, NeilBrown wrote: > On Thu, 06 Nov 2025, Jeff Layton wrote: > > Since ce8644fcadc5 ("lookup_open(): expand the call of vfs_create()"), > > the "excl" argument to the ->create() inode_operation is always set to > > true. Remove it, and fix up all of the create implementations. > > nonono > > > > @@ -3802,7 +3802,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file, > > } > > > > error = dir_inode->i_op->create(idmap, dir_inode, dentry, > > - mode, open_flag & O_EXCL); > > + mode); > > "open_flag & O_EXCL" is not the same as "true". > > It is true that "all calls to vfs_create() pass true for 'excl'" > The same is NOT true for inode_operations.create. > I don't think this is a problem, actually: Almost all of the existing ->create() operations ignore the "excl" bool. There are only two that I found that do not: NFS and GFS2. Both of those have an ->atomic_open() operation though, so lookup_open() will never call ->create() for those filesystems. This means that - >create() _is_ always called with excl == true. -- Jeff Layton From jlayton at kernel.org Thu Nov 6 10:01:20 2025 From: jlayton at kernel.org (Jeff Layton) Date: Thu, 06 Nov 2025 13:01:20 -0500 Subject: [PATCH] vfs: remove the excl argument from the ->create() inode_operation In-Reply-To: <6758176514cdd6e2ceacb3bd0e4d63fb8784b7c6.camel@kernel.org> References: <20251105-create-excl-v1-1-a4cce035cc55@kernel.org> <176237780417.634289.15818324160940255011@noble.neil.brown.name> <6758176514cdd6e2ceacb3bd0e4d63fb8784b7c6.camel@kernel.org> Message-ID: On Thu, 2025-11-06 at 07:07 -0500, Jeff Layton wrote: > On Thu, 2025-11-06 at 08:23 +1100, NeilBrown wrote: > > On Thu, 06 Nov 2025, Jeff Layton wrote: > > > Since ce8644fcadc5 ("lookup_open(): expand the call of vfs_create()"), > > > the "excl" argument to the ->create() inode_operation is always set to > > > true. Remove it, and fix up all of the create implementations. > > > > nonono > > > > > > > @@ -3802,7 +3802,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file, > > > } > > > > > > error = dir_inode->i_op->create(idmap, dir_inode, dentry, > > > - mode, open_flag & O_EXCL); > > > + mode); > > > > "open_flag & O_EXCL" is not the same as "true". > > > > It is true that "all calls to vfs_create() pass true for 'excl'" > > The same is NOT true for inode_operations.create. > > > > I don't think this is a problem, actually: > > Almost all of the existing ->create() operations ignore the "excl" > bool. There are only two that I found that do not: NFS and GFS2. Both > of those have an ->atomic_open() operation though, so lookup_open() > will never call ->create() for those filesystems. This means that - > > create() _is_ always called with excl == true. How about this for a revised changelog, which makes the above clear: vfs: remove the excl argument from the ->create() inode_operation Since ce8644fcadc5 ("lookup_open(): expand the call of vfs_create()"), the "excl" argument to the ->create() inode_operation is always set to true in vfs_create(). There is another call to ->create() in lookup_open() that can set it to either true or false. All of the ->create() operations in the kernel ignore the excl argument, except for NFS and GFS2. Both NFS and GFS2 have an ->atomic_open() operation, however so lookup_open() will never call ->create() on those filesystems. Remove the "excl" argument from the ->create() operation, and fix up the filesystems accordingly. Maybe we also need some comments or updates to Documentation/ to make it clear that ->create() always implies O_EXCL semantics? -- Jeff Layton From neilb at ownmail.net Thu Nov 6 16:00:34 2025 From: neilb at ownmail.net (NeilBrown) Date: Fri, 07 Nov 2025 11:00:34 +1100 Subject: [PATCH] vfs: remove the excl argument from the ->create() inode_operation In-Reply-To: References: <20251105-create-excl-v1-1-a4cce035cc55@kernel.org>, <176237780417.634289.15818324160940255011@noble.neil.brown.name>, <6758176514cdd6e2ceacb3bd0e4d63fb8784b7c6.camel@kernel.org>, Message-ID: <176247363419.634289.473957828516111884@noble.neil.brown.name> On Fri, 07 Nov 2025, Jeff Layton wrote: > On Thu, 2025-11-06 at 07:07 -0500, Jeff Layton wrote: > > On Thu, 2025-11-06 at 08:23 +1100, NeilBrown wrote: > > > On Thu, 06 Nov 2025, Jeff Layton wrote: > > > > Since ce8644fcadc5 ("lookup_open(): expand the call of vfs_create()"), > > > > the "excl" argument to the ->create() inode_operation is always set to > > > > true. Remove it, and fix up all of the create implementations. > > > > > > nonono > > > > > > > > > > @@ -3802,7 +3802,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file, > > > > } > > > > > > > > error = dir_inode->i_op->create(idmap, dir_inode, dentry, > > > > - mode, open_flag & O_EXCL); > > > > + mode); > > > > > > "open_flag & O_EXCL" is not the same as "true". > > > > > > It is true that "all calls to vfs_create() pass true for 'excl'" > > > The same is NOT true for inode_operations.create. > > > > > > > I don't think this is a problem, actually: > > > > Almost all of the existing ->create() operations ignore the "excl" > > bool. There are only two that I found that do not: NFS and GFS2. Both > > of those have an ->atomic_open() operation though, so lookup_open() > > will never call ->create() for those filesystems. This means that - > > > create() _is_ always called with excl == true. > > How about this for a revised changelog, which makes the above clear: > > vfs: remove the excl argument from the ->create() inode_operation > > Since ce8644fcadc5 ("lookup_open(): expand the call of vfs_create()"), > the "excl" argument to the ->create() inode_operation is always set to > true in vfs_create(). > > There is another call to ->create() in lookup_open() that can set it to > either true or false. All of the ->create() operations in the kernel > ignore the excl argument, except for NFS and GFS2. Both NFS and GFS2 > have an ->atomic_open() operation, however so lookup_open() will never > call ->create() on those filesystems. > > Remove the "excl" argument from the ->create() operation, and fix up the > filesystems accordingly. Thanks, that is a substantial improvement. I see your point now and I think this is a really nice cleanup to make - thanks. I think the commit message could be improved further by leading with the detail that is central - that most ->create function ignore 'excl'. With two exceptions, ->create() methods provided by filesystems ignore the "excl" flag. Those exception are NFS and GFS2 which both also provide ->atomic_open. excl is always true when ->create is called from vfs_create() (since commit......) so the only time it can be false is when it is called by lookup_open() for filesystems that do not provide ->atomic_open. So the excl flag to ->create is either ignored or true. So we can remove it and change NFS and GFS2 to acts as though it were true. > > Maybe we also need some comments or updates to Documentation/ to make > it clear that ->create() always implies O_EXCL semantics? Definitely, something in porting.rst and something in vfs.rst. I would be worth saying somewhere that if the fs needs to mediate non-exclusive creation, it must provide atomic_open(). Thanks, NeilBrown > -- > Jeff Layton > From jlayton at kernel.org Fri Nov 7 07:05:03 2025 From: jlayton at kernel.org (Jeff Layton) Date: Fri, 07 Nov 2025 10:05:03 -0500 Subject: [PATCH v2] vfs: remove the excl argument from the ->create() inode_operation Message-ID: <20251107-create-excl-v2-1-f678165d7f3f@kernel.org> With two exceptions, ->create() methods provided by filesystems ignore the "excl" flag. Those exception are NFS and GFS2 which both also provide ->atomic_open. Since ce8644fcadc5 ("lookup_open(): expand the call of vfs_create()"), the "excl" argument to the ->create() inode_operation is always set to true in vfs_create(). The ->create() call in lookup_open() sets it according to the O_EXCL open flag, but is never called if the filesystem provides ->atomic_open(). The excl flag is therefore always either ignored or true. Remove it, and change NFS and GFS2 to act as if it were always true. Signed-off-by: Jeff Layton --- Note that this is based on top of the dir delegation series [1]. LMK if the Documentation/ updates are too wordy. Full disclosure: I did use Claude code to generate the first approximation of this patch, but I had to fix a number of things that it missed. I probably could have given it better prompts. In any case, I'm not sure how to properly attribute this (or if I even need to). [1]: https://lore.kernel.org/linux-nfs/20251105-dir-deleg-ro-v5-0-7ebc168a88ac at kernel.org/ --- Changes in v2: - better describe why the argument isn't needed in the changelog - updates do Documentation/ - Link to v1: https://lore.kernel.org/r/20251105-create-excl-v1-1-a4cce035cc55 at kernel.org --- Documentation/filesystems/porting.rst | 12 ++++++++++++ Documentation/filesystems/vfs.rst | 13 ++++++++++--- fs/9p/vfs_inode.c | 2 +- fs/9p/vfs_inode_dotl.c | 2 +- fs/affs/affs.h | 2 +- fs/affs/namei.c | 2 +- fs/afs/dir.c | 4 ++-- fs/bad_inode.c | 2 +- fs/bfs/dir.c | 2 +- fs/btrfs/inode.c | 2 +- fs/ceph/dir.c | 2 +- fs/coda/dir.c | 2 +- fs/ecryptfs/inode.c | 2 +- fs/efivarfs/inode.c | 2 +- fs/exfat/namei.c | 2 +- fs/ext2/namei.c | 2 +- fs/ext4/namei.c | 2 +- fs/f2fs/namei.c | 2 +- fs/fat/namei_msdos.c | 2 +- fs/fat/namei_vfat.c | 2 +- fs/fuse/dir.c | 2 +- fs/gfs2/inode.c | 5 ++--- fs/hfs/dir.c | 2 +- fs/hfsplus/dir.c | 2 +- fs/hostfs/hostfs_kern.c | 2 +- fs/hpfs/namei.c | 2 +- fs/hugetlbfs/inode.c | 2 +- fs/jffs2/dir.c | 4 ++-- fs/jfs/namei.c | 2 +- fs/minix/namei.c | 2 +- fs/namei.c | 4 ++-- fs/nfs/dir.c | 4 ++-- fs/nfs/internal.h | 2 +- fs/nilfs2/namei.c | 2 +- fs/ntfs3/namei.c | 2 +- fs/ocfs2/dlmfs/dlmfs.c | 3 +-- fs/ocfs2/namei.c | 3 +-- fs/omfs/dir.c | 2 +- fs/orangefs/namei.c | 3 +-- fs/overlayfs/dir.c | 2 +- fs/ramfs/inode.c | 2 +- fs/smb/client/cifsfs.h | 2 +- fs/smb/client/dir.c | 2 +- fs/ubifs/dir.c | 2 +- fs/udf/namei.c | 2 +- fs/ufs/namei.c | 3 +-- fs/vboxsf/dir.c | 2 +- fs/xfs/xfs_iops.c | 3 +-- include/linux/fs.h | 4 ++-- ipc/mqueue.c | 2 +- mm/shmem.c | 2 +- 51 files changed, 77 insertions(+), 64 deletions(-) diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst index 7233b04668fcce75f1ed170329a2cd18110a7d89..d71a3f5c626e578f0370986975ca50292c8e15c3 100644 --- a/Documentation/filesystems/porting.rst +++ b/Documentation/filesystems/porting.rst @@ -1309,3 +1309,15 @@ a different length, use vfs_parse_fs_qstr(fc, key, &QSTR_LEN(value, len)) instead. + +--- + +**mandatory** + +The ->create() operation has dropped the bool "excl" argument. This operation +should now always provide O_EXCL semantics (i.e. fail with -EEXIST if the file +exists). If the filesystem needs to handle the case where another entity could +create the file on the backing store after a negative lookup or revalidate +(e.g. it's a network filesystem and another client could create the file after +a negative lookup), then it will require ->atomic_open() in addition to +->create(). diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst index 4f13b01e42eb5e2ad9d60cbbce7e47d09ad831e6..7a55e491e0c87a0d18909bd181754d6d68318059 100644 --- a/Documentation/filesystems/vfs.rst +++ b/Documentation/filesystems/vfs.rst @@ -467,7 +467,7 @@ As of kernel 2.6.22, the following members are defined: .. code-block:: c struct inode_operations { - int (*create) (struct mnt_idmap *, struct inode *,struct dentry *, umode_t, bool); + int (*create) (struct mnt_idmap *, struct inode *,struct dentry *, umode_t); struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int); int (*link) (struct dentry *,struct inode *,struct dentry *); int (*unlink) (struct inode *,struct dentry *); @@ -505,7 +505,10 @@ otherwise noted. if you want to support regular files. The dentry you get should not have an inode (i.e. it should be a negative dentry). Here you will probably call d_instantiate() with the dentry and the - newly created inode + newly created inode. This operation should always provide O_EXCL + semantics (i.e. it should fail with -EEXIST if the file exists). + If the filesystem needs to mediate non-exclusive creation, + then the filesystem must also provide an ->atomic_open() operation. ``lookup`` called when the VFS needs to look up an inode in a parent @@ -654,7 +657,11 @@ otherwise noted. handled by f_op->open(). If the file was created, FMODE_CREATED flag should be set in file->f_mode. In case of O_EXCL the method must only succeed if the file didn't exist and hence - FMODE_CREATED shall always be set on success. + FMODE_CREATED shall always be set on success. This method is + usually needed on filesystems where the dentry to be created could + unexpectedly become positive after the kernel has looked it up or + revalidated it. (e.g. another host racing in and creating the file + on an NFS server). ``tmpfile`` called in the end of O_TMPFILE open(). Optional, equivalent to diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c index 69f378a837753e934c20b599660f8a756127e40a..595244d57cba62869b9af8b909af67d3c61e7f6c 100644 --- a/fs/9p/vfs_inode.c +++ b/fs/9p/vfs_inode.c @@ -643,7 +643,7 @@ v9fs_create(struct v9fs_session_info *v9ses, struct inode *dir, static int v9fs_vfs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { struct v9fs_session_info *v9ses = v9fs_inode2v9ses(dir); u32 perm = unixmode2p9mode(v9ses, mode); diff --git a/fs/9p/vfs_inode_dotl.c b/fs/9p/vfs_inode_dotl.c index 0b404e8484d22e2cbe60d846e0fa653001cdc4b1..de8fe9954d433c9b14ff5dd72ba13c3d5a67ebe7 100644 --- a/fs/9p/vfs_inode_dotl.c +++ b/fs/9p/vfs_inode_dotl.c @@ -218,7 +218,7 @@ int v9fs_open_to_dotl_flags(int flags) */ static int v9fs_vfs_create_dotl(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t omode, bool excl) + struct dentry *dentry, umode_t omode) { return v9fs_vfs_mknod_dotl(idmap, dir, dentry, omode, 0); } diff --git a/fs/affs/affs.h b/fs/affs/affs.h index ac4e9a02910b72d63c8ec5291347b54518e67f4b..665be23c42cfa206dc0a2c9ffa119b7c3c747389 100644 --- a/fs/affs/affs.h +++ b/fs/affs/affs.h @@ -167,7 +167,7 @@ extern int affs_hash_name(struct super_block *sb, const u8 *name, unsigned int l extern struct dentry *affs_lookup(struct inode *dir, struct dentry *dentry, unsigned int); extern int affs_unlink(struct inode *dir, struct dentry *dentry); extern int affs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool); + struct dentry *dentry, umode_t mode); extern struct dentry *affs_mkdir(struct mnt_idmap *idmap, struct inode *dir, struct dentry *dentry, umode_t mode); extern int affs_rmdir(struct inode *dir, struct dentry *dentry); diff --git a/fs/affs/namei.c b/fs/affs/namei.c index f883be50db122d3b09f0ae4d24618bd49b55186b..5591e1b5a2f68fc7600115e241f01f81d3aac010 100644 --- a/fs/affs/namei.c +++ b/fs/affs/namei.c @@ -243,7 +243,7 @@ affs_unlink(struct inode *dir, struct dentry *dentry) int affs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { struct super_block *sb = dir->i_sb; struct inode *inode; diff --git a/fs/afs/dir.c b/fs/afs/dir.c index 89d36e3e5c7999c2e448b78e86896d8893a8a7a9..09224aca8cad37ad273fd0c1ac292f0c15e078b5 100644 --- a/fs/afs/dir.c +++ b/fs/afs/dir.c @@ -32,7 +32,7 @@ static bool afs_lookup_one_filldir(struct dir_context *ctx, const char *name, in static bool afs_lookup_filldir(struct dir_context *ctx, const char *name, int nlen, loff_t fpos, u64 ino, unsigned dtype); static int afs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl); + struct dentry *dentry, umode_t mode); static struct dentry *afs_mkdir(struct mnt_idmap *idmap, struct inode *dir, struct dentry *dentry, umode_t mode); static int afs_rmdir(struct inode *dir, struct dentry *dentry); @@ -1637,7 +1637,7 @@ static const struct afs_operation_ops afs_create_operation = { * create a regular file on an AFS filesystem */ static int afs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { struct afs_operation *op; struct afs_vnode *dvnode = AFS_FS_I(dir); diff --git a/fs/bad_inode.c b/fs/bad_inode.c index 0ef9bcb744dd620bf47caa024d97a1316ff7bc89..5701361cf98155a61cb75a4ec602e8fc615eb3ae 100644 --- a/fs/bad_inode.c +++ b/fs/bad_inode.c @@ -29,7 +29,7 @@ static const struct file_operations bad_file_ops = static int bad_inode_create(struct mnt_idmap *idmap, struct inode *dir, struct dentry *dentry, - umode_t mode, bool excl) + umode_t mode) { return -EIO; } diff --git a/fs/bfs/dir.c b/fs/bfs/dir.c index c375e22c4c0c15ba27307d266adfe3f093b90ab8..6beb8605c523cc2c7250d7b1a61508e103f0f3fd 100644 --- a/fs/bfs/dir.c +++ b/fs/bfs/dir.c @@ -76,7 +76,7 @@ const struct file_operations bfs_dir_operations = { }; static int bfs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { int err; struct inode *inode; diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 3b1b3a0553eea06229255ad0284d76074bdb958a..8e06baeabae594850607366ea4f4f0fa41e3b464 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -6816,7 +6816,7 @@ static int btrfs_mknod(struct mnt_idmap *idmap, struct inode *dir, } static int btrfs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { struct inode *inode; diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c index d18c0eaef9b7e7be7eb517c701d6c4af08fd78ac..308903dc0780dbed2382228005d0221f185c61ee 100644 --- a/fs/ceph/dir.c +++ b/fs/ceph/dir.c @@ -976,7 +976,7 @@ static int ceph_mknod(struct mnt_idmap *idmap, struct inode *dir, } static int ceph_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { return ceph_mknod(idmap, dir, dentry, mode, 0); } diff --git a/fs/coda/dir.c b/fs/coda/dir.c index ca99900172657d80a479b2eb27f50effdf834995..554e7fd44e5df1aae6da2c41a492a02ae9e0d616 100644 --- a/fs/coda/dir.c +++ b/fs/coda/dir.c @@ -134,7 +134,7 @@ static inline void coda_dir_drop_nlink(struct inode *dir) /* creation routines: create, mknod, mkdir, link, symlink */ static int coda_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *de, umode_t mode, bool excl) + struct dentry *de, umode_t mode) { int error; const char *name=de->d_name.name; diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c index ba15e7359dfa6e150b577205991010873a633511..9a1ba68b16f3d6c4551e2d75e1e27309159c062e 100644 --- a/fs/ecryptfs/inode.c +++ b/fs/ecryptfs/inode.c @@ -262,7 +262,7 @@ int ecryptfs_initialize_file(struct dentry *ecryptfs_dentry, static int ecryptfs_create(struct mnt_idmap *idmap, struct inode *directory_inode, struct dentry *ecryptfs_dentry, - umode_t mode, bool excl) + umode_t mode) { struct inode *ecryptfs_inode; int rc; diff --git a/fs/efivarfs/inode.c b/fs/efivarfs/inode.c index 2891614abf8d554f563319187b6d54c2bc006a91..043b3e3a4f0adefe27855f8156b946c1dc4bd184 100644 --- a/fs/efivarfs/inode.c +++ b/fs/efivarfs/inode.c @@ -75,7 +75,7 @@ static bool efivarfs_valid_name(const char *str, int len) } static int efivarfs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { struct inode *inode = NULL; struct efivar_entry *var; diff --git a/fs/exfat/namei.c b/fs/exfat/namei.c index 7eb9c67fd35f4c54e18061a948806f20455675cf..c272a522c571044fd0cdc7630be30bdcec2ab8e5 100644 --- a/fs/exfat/namei.c +++ b/fs/exfat/namei.c @@ -543,7 +543,7 @@ static int exfat_add_entry(struct inode *inode, const char *path, } static int exfat_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { struct super_block *sb = dir->i_sb; struct inode *inode; diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c index bde617a66cecd4a2bf12a713a2297bb4fee45916..edea7784ad39acd4afffc7f5ae6e50a20c04999d 100644 --- a/fs/ext2/namei.c +++ b/fs/ext2/namei.c @@ -101,7 +101,7 @@ struct dentry *ext2_get_parent(struct dentry *child) */ static int ext2_create (struct mnt_idmap * idmap, struct inode * dir, struct dentry * dentry, - umode_t mode, bool excl) + umode_t mode) { struct inode *inode; int err; diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index 2cd36f59c9e363124ee949f742adccd88447295a..a1e77390a7ce300db02db9af90e45d69efabfea5 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -2806,7 +2806,7 @@ static int ext4_add_nondir(handle_t *handle, * with d_instantiate(). */ static int ext4_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { handle_t *handle; struct inode *inode; diff --git a/fs/f2fs/namei.c b/fs/f2fs/namei.c index b882771e469971dcf4e7a42416f9fbb8a5d9bf39..9bcbb8b521501b22d0fe2238b7729c342e95baa4 100644 --- a/fs/f2fs/namei.c +++ b/fs/f2fs/namei.c @@ -351,7 +351,7 @@ static struct inode *f2fs_new_inode(struct mnt_idmap *idmap, } static int f2fs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { struct f2fs_sb_info *sbi = F2FS_I_SB(dir); struct inode *inode; diff --git a/fs/fat/namei_msdos.c b/fs/fat/namei_msdos.c index 0b920ee40a7f9fe3c57af5d939d3efedf001a3d9..905ffa9e5b99f1507734d99b7c16dcad21d7b5b5 100644 --- a/fs/fat/namei_msdos.c +++ b/fs/fat/namei_msdos.c @@ -262,7 +262,7 @@ static int msdos_add_entry(struct inode *dir, const unsigned char *name, /***** Create a file */ static int msdos_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { struct super_block *sb = dir->i_sb; struct inode *inode = NULL; diff --git a/fs/fat/namei_vfat.c b/fs/fat/namei_vfat.c index 5dbc4cbb8fce3d9b891cbc597f876c2c7b8d6aa0..8396b1ec4ec582fcdfadbcb12b04694ef0b8c5fc 100644 --- a/fs/fat/namei_vfat.c +++ b/fs/fat/namei_vfat.c @@ -754,7 +754,7 @@ static struct dentry *vfat_lookup(struct inode *dir, struct dentry *dentry, } static int vfat_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { struct super_block *sb = dir->i_sb; struct inode *inode; diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index 667774cc72a1d49796f531fcb342d2e4878beb85..b7a2cee9b18313f88e745c5bb406bcc72866e390 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -889,7 +889,7 @@ static int fuse_mknod(struct mnt_idmap *idmap, struct inode *dir, } static int fuse_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *entry, umode_t mode, bool excl) + struct dentry *entry, umode_t mode) { return fuse_mknod(idmap, dir, entry, mode, 0); } diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c index 8a7ed80d9f2d6e829b240629bdd18b5e0d30b5fc..b8e399dd1182b6ede0bcf1aa78bd7f9f2dca8b2b 100644 --- a/fs/gfs2/inode.c +++ b/fs/gfs2/inode.c @@ -942,15 +942,14 @@ static int gfs2_create_inode(struct inode *dir, struct dentry *dentry, * @dir: The directory in which to create the file * @dentry: The dentry of the new file * @mode: The mode of the new file - * @excl: Force fail if inode exists * * Returns: errno */ static int gfs2_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { - return gfs2_create_inode(dir, dentry, NULL, S_IFREG | mode, 0, NULL, 0, excl); + return gfs2_create_inode(dir, dentry, NULL, S_IFREG | mode, 0, NULL, 0, 1); } /** diff --git a/fs/hfs/dir.c b/fs/hfs/dir.c index 86a6b317b474a95f283f6a0908582efadde80892..c585942aa985686ca428d2d17f4401aa845a0eb8 100644 --- a/fs/hfs/dir.c +++ b/fs/hfs/dir.c @@ -190,7 +190,7 @@ static int hfs_dir_release(struct inode *inode, struct file *file) * the directory and the name (and its length) of the new file. */ static int hfs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { struct inode *inode; int res; diff --git a/fs/hfsplus/dir.c b/fs/hfsplus/dir.c index 1b3e27a0d5e038b559bd19b37d769078b2996d1b..c5ea04e078340a91b992095e189e978a3345f03c 100644 --- a/fs/hfsplus/dir.c +++ b/fs/hfsplus/dir.c @@ -518,7 +518,7 @@ static int hfsplus_mknod(struct mnt_idmap *idmap, struct inode *dir, } static int hfsplus_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { return hfsplus_mknod(&nop_mnt_idmap, dir, dentry, mode, 0); } diff --git a/fs/hostfs/hostfs_kern.c b/fs/hostfs/hostfs_kern.c index 1e1acf5775ab5f6daf13bb917966d05f410d5ff5..18ca8cb9aa15e4015582ee5bd3db968c6b32de4b 100644 --- a/fs/hostfs/hostfs_kern.c +++ b/fs/hostfs/hostfs_kern.c @@ -593,7 +593,7 @@ static struct inode *hostfs_iget(struct super_block *sb, char *name) } static int hostfs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { struct inode *inode; char *name; diff --git a/fs/hpfs/namei.c b/fs/hpfs/namei.c index 353e13a615f56664638f08a3408f90a727f5458b..809113d8248d50c0eaa57047b6c4bd87b9a5c6be 100644 --- a/fs/hpfs/namei.c +++ b/fs/hpfs/namei.c @@ -129,7 +129,7 @@ static struct dentry *hpfs_mkdir(struct mnt_idmap *idmap, struct inode *dir, } static int hpfs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { const unsigned char *name = dentry->d_name.name; unsigned len = dentry->d_name.len; diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 9c94ed8c3ab0028772b7afb5d03a91d280c38106..0fd0d73e450bdedd92b953b9dd00f6babe1246e7 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -1001,7 +1001,7 @@ static struct dentry *hugetlbfs_mkdir(struct mnt_idmap *idmap, struct inode *dir static int hugetlbfs_create(struct mnt_idmap *idmap, struct inode *dir, struct dentry *dentry, - umode_t mode, bool excl) + umode_t mode) { return hugetlbfs_mknod(idmap, dir, dentry, mode | S_IFREG, 0); } diff --git a/fs/jffs2/dir.c b/fs/jffs2/dir.c index dd91f725ded69ccb3a240aafd72a4b552f21bcd9..e77c84e43621a8c53e9852843f18cc3514315650 100644 --- a/fs/jffs2/dir.c +++ b/fs/jffs2/dir.c @@ -25,7 +25,7 @@ static int jffs2_readdir (struct file *, struct dir_context *); static int jffs2_create (struct mnt_idmap *, struct inode *, - struct dentry *, umode_t, bool); + struct dentry *, umode_t); static struct dentry *jffs2_lookup (struct inode *,struct dentry *, unsigned int); static int jffs2_link (struct dentry *,struct inode *,struct dentry *); @@ -161,7 +161,7 @@ static int jffs2_readdir(struct file *file, struct dir_context *ctx) static int jffs2_create(struct mnt_idmap *idmap, struct inode *dir_i, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { struct jffs2_raw_inode *ri; struct jffs2_inode_info *f, *dir_f; diff --git a/fs/jfs/namei.c b/fs/jfs/namei.c index 65a218eba8faf9508f5727515b812f6de2661618..48111f8d3efe40becadd857c56c84ed09de867ef 100644 --- a/fs/jfs/namei.c +++ b/fs/jfs/namei.c @@ -60,7 +60,7 @@ static inline void free_ea_wmap(struct inode *inode) * */ static int jfs_create(struct mnt_idmap *idmap, struct inode *dip, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { int rc = 0; tid_t tid; /* transaction id */ diff --git a/fs/minix/namei.c b/fs/minix/namei.c index 8938536d8d3cf65c7e57f88f1819689365951fea..6540574f54781eab487074de7fe10ed38b1a8d1e 100644 --- a/fs/minix/namei.c +++ b/fs/minix/namei.c @@ -64,7 +64,7 @@ static int minix_tmpfile(struct mnt_idmap *idmap, struct inode *dir, } static int minix_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { return minix_mknod(&nop_mnt_idmap, dir, dentry, mode, 0); } diff --git a/fs/namei.c b/fs/namei.c index d5ab28947b2b6c6e19c7bb4a9140ccec407dc07c..83da60fc298e523096e881b25c727d14f9553476 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -3493,7 +3493,7 @@ int vfs_create(struct mnt_idmap *idmap, struct dentry *dentry, umode_t mode, error = try_break_deleg(dir, di); if (error) return error; - error = dir->i_op->create(idmap, dir, dentry, mode, true); + error = dir->i_op->create(idmap, dir, dentry, mode); if (!error) fsnotify_create(dir, dentry); return error; @@ -3802,7 +3802,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file, } error = dir_inode->i_op->create(idmap, dir_inode, dentry, - mode, open_flag & O_EXCL); + mode); if (error) goto out_dput; } diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index 46d9c65d50f83fc1dc73f3d7f5868b84132bb0fd..7fe18efcd37b08030c7a4e17832801abfc19a3bd 100644 --- a/fs/nfs/dir.c +++ b/fs/nfs/dir.c @@ -2377,9 +2377,9 @@ static int nfs_do_create(struct inode *dir, struct dentry *dentry, } int nfs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { - return nfs_do_create(dir, dentry, mode, excl ? O_EXCL : 0); + return nfs_do_create(dir, dentry, mode, O_EXCL); } EXPORT_SYMBOL_GPL(nfs_create); diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h index 2ecd38e1d17a8053a9134702588d57efc35f49e9..b122c4f34f7b53c5102a8b5138efe269af433c81 100644 --- a/fs/nfs/internal.h +++ b/fs/nfs/internal.h @@ -398,7 +398,7 @@ extern unsigned long nfs_access_cache_scan(struct shrinker *shrink, struct dentry *nfs_lookup(struct inode *, struct dentry *, unsigned int); void nfs_d_prune_case_insensitive_aliases(struct inode *inode); int nfs_create(struct mnt_idmap *, struct inode *, struct dentry *, - umode_t, bool); + umode_t); struct dentry *nfs_mkdir(struct mnt_idmap *, struct inode *, struct dentry *, umode_t); int nfs_rmdir(struct inode *, struct dentry *); diff --git a/fs/nilfs2/namei.c b/fs/nilfs2/namei.c index 40f4b1a28705b6e0eb8f0978cf3ac18b43aa1331..31d1d466c03048aaaab23f64c3f413c095939770 100644 --- a/fs/nilfs2/namei.c +++ b/fs/nilfs2/namei.c @@ -86,7 +86,7 @@ nilfs_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags) * with d_instantiate(). */ static int nilfs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { struct inode *inode; struct nilfs_transaction_info ti; diff --git a/fs/ntfs3/namei.c b/fs/ntfs3/namei.c index 82c8ae56beee6d79046dd6c8f02ff0f35e9a1ad3..49fe635b550d3f51f81138649b47c9c831a73e3b 100644 --- a/fs/ntfs3/namei.c +++ b/fs/ntfs3/namei.c @@ -105,7 +105,7 @@ static struct dentry *ntfs_lookup(struct inode *dir, struct dentry *dentry, * ntfs_create - inode_operations::create */ static int ntfs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { return ntfs_create_inode(idmap, dir, dentry, NULL, S_IFREG | mode, 0, NULL, 0, NULL); diff --git a/fs/ocfs2/dlmfs/dlmfs.c b/fs/ocfs2/dlmfs/dlmfs.c index cccaa1d6fbbac13ebcaf14a9183277890708e643..bd4b2269598b49c6f88dd8d201e246ee5ed855a6 100644 --- a/fs/ocfs2/dlmfs/dlmfs.c +++ b/fs/ocfs2/dlmfs/dlmfs.c @@ -454,8 +454,7 @@ static struct dentry *dlmfs_mkdir(struct mnt_idmap * idmap, static int dlmfs_create(struct mnt_idmap *idmap, struct inode *dir, struct dentry *dentry, - umode_t mode, - bool excl) + umode_t mode) { int status = 0; struct inode *inode; diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c index c90b254da75eb5b90d2af5e37d41e781efe8b836..7443f468f45657cf68779a02e4edf4e38fb70f59 100644 --- a/fs/ocfs2/namei.c +++ b/fs/ocfs2/namei.c @@ -666,8 +666,7 @@ static struct dentry *ocfs2_mkdir(struct mnt_idmap *idmap, static int ocfs2_create(struct mnt_idmap *idmap, struct inode *dir, struct dentry *dentry, - umode_t mode, - bool excl) + umode_t mode) { int ret; diff --git a/fs/omfs/dir.c b/fs/omfs/dir.c index 2ed541fccf331d796805dd1594fbf05c1f7f3b9a..a09a98f7e30bc66deca60725f9462d081b5e4784 100644 --- a/fs/omfs/dir.c +++ b/fs/omfs/dir.c @@ -286,7 +286,7 @@ static struct dentry *omfs_mkdir(struct mnt_idmap *idmap, struct inode *dir, } static int omfs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { return omfs_add_node(dir, dentry, mode | S_IFREG); } diff --git a/fs/orangefs/namei.c b/fs/orangefs/namei.c index bec5475de094dada6bb29eaf8520a875880f3bab..0ebaa7f000f26f1c1ecffd22cfe4272f20a783ed 100644 --- a/fs/orangefs/namei.c +++ b/fs/orangefs/namei.c @@ -18,8 +18,7 @@ static int orangefs_create(struct mnt_idmap *idmap, struct inode *dir, struct dentry *dentry, - umode_t mode, - bool exclusive) + umode_t mode) { struct orangefs_inode_s *parent = ORANGEFS_I(dir); struct orangefs_kernel_op_s *new_op; diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c index a5e9ddf3023b3942fafb9adb2770f26780a1b86b..0f70b3835f4a08c29d6bba8ae9143df55895e56b 100644 --- a/fs/overlayfs/dir.c +++ b/fs/overlayfs/dir.c @@ -704,7 +704,7 @@ static int ovl_create_object(struct dentry *dentry, int mode, dev_t rdev, } static int ovl_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { return ovl_create_object(dentry, (mode & 07777) | S_IFREG, 0, NULL); } diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c index 41f9995da7cab0d11395cb40a98fb4936d52597f..b6502aaa4fb44d27c939da9fae4449af7edd28d4 100644 --- a/fs/ramfs/inode.c +++ b/fs/ramfs/inode.c @@ -129,7 +129,7 @@ static struct dentry *ramfs_mkdir(struct mnt_idmap *idmap, struct inode *dir, } static int ramfs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { return ramfs_mknod(&nop_mnt_idmap, dir, dentry, mode | S_IFREG, 0); } diff --git a/fs/smb/client/cifsfs.h b/fs/smb/client/cifsfs.h index e9534258d1efd0bb34f36bf2c725c64d0a8ca8f4..294c66cea2eca3344e09cd77619761e9cb79a807 100644 --- a/fs/smb/client/cifsfs.h +++ b/fs/smb/client/cifsfs.h @@ -50,7 +50,7 @@ extern void cifs_sb_deactive(struct super_block *sb); extern const struct inode_operations cifs_dir_inode_ops; extern struct inode *cifs_root_iget(struct super_block *); extern int cifs_create(struct mnt_idmap *, struct inode *, - struct dentry *, umode_t, bool excl); + struct dentry *, umode_t); extern int cifs_atomic_open(struct inode *, struct dentry *, struct file *, unsigned, umode_t); extern struct dentry *cifs_lookup(struct inode *, struct dentry *, diff --git a/fs/smb/client/dir.c b/fs/smb/client/dir.c index da5597dbf5b9f140c6801158ac2357fa911c52ab..b00bc214db9f0e9533f481f41ac99ac8937610ac 100644 --- a/fs/smb/client/dir.c +++ b/fs/smb/client/dir.c @@ -566,7 +566,7 @@ cifs_atomic_open(struct inode *inode, struct dentry *direntry, } int cifs_create(struct mnt_idmap *idmap, struct inode *inode, - struct dentry *direntry, umode_t mode, bool excl) + struct dentry *direntry, umode_t mode) { int rc; unsigned int xid = get_xid(); diff --git a/fs/ubifs/dir.c b/fs/ubifs/dir.c index 3c3d3ad4fa6cb719e9ec08fa2164c55371c017c1..4840a6f7974e254eba4ca249357e968764e326e0 100644 --- a/fs/ubifs/dir.c +++ b/fs/ubifs/dir.c @@ -303,7 +303,7 @@ static int ubifs_prepare_create(struct inode *dir, struct dentry *dentry, } static int ubifs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { struct inode *inode; struct ubifs_info *c = dir->i_sb->s_fs_info; diff --git a/fs/udf/namei.c b/fs/udf/namei.c index 5f2e9a892bffa9579143cedf71d80efa7ad6e9fb..f83b5564cbc4c68c02c07bb3ab2109bfabdc799d 100644 --- a/fs/udf/namei.c +++ b/fs/udf/namei.c @@ -371,7 +371,7 @@ static int udf_add_nondir(struct dentry *dentry, struct inode *inode) } static int udf_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { struct inode *inode = udf_new_inode(dir, mode); diff --git a/fs/ufs/namei.c b/fs/ufs/namei.c index 5b3c85c9324298f4ff6aa3d4feeb962ce5ede539..5012e056200aca671364d34a7faf647e6747e1d2 100644 --- a/fs/ufs/namei.c +++ b/fs/ufs/namei.c @@ -70,8 +70,7 @@ static struct dentry *ufs_lookup(struct inode * dir, struct dentry *dentry, unsi * with d_instantiate(). */ static int ufs_create (struct mnt_idmap * idmap, - struct inode * dir, struct dentry * dentry, umode_t mode, - bool excl) + struct inode * dir, struct dentry * dentry, umode_t mode) { struct inode *inode; diff --git a/fs/vboxsf/dir.c b/fs/vboxsf/dir.c index 42bedc4ec7af7709c564a7174805d185ce86f854..9ce4310c891044db17b6af98c06e3130002a7dda 100644 --- a/fs/vboxsf/dir.c +++ b/fs/vboxsf/dir.c @@ -298,7 +298,7 @@ static int vboxsf_dir_create(struct inode *parent, struct dentry *dentry, static int vboxsf_dir_mkfile(struct mnt_idmap *idmap, struct inode *parent, struct dentry *dentry, - umode_t mode, bool excl) + umode_t mode) { return vboxsf_dir_create(parent, dentry, mode, false, excl, NULL); } diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index caff0125faeac093c1c05a722d3588e3f2e99926..2bc7faac35678b5b78acd6a50695a0d7b1c9a263 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -293,8 +293,7 @@ xfs_vn_create( struct mnt_idmap *idmap, struct inode *dir, struct dentry *dentry, - umode_t mode, - bool flags) + umode_t mode) { return xfs_generic_create(idmap, dir, dentry, mode, 0, NULL); } diff --git a/include/linux/fs.h b/include/linux/fs.h index 64323e618724bc20dc101db13035b042f5f88e4d..b9a32e10078f5a1a0bbeb0d8913ac3e4b5b3a85d 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2345,8 +2345,8 @@ struct inode_operations { int (*readlink) (struct dentry *, char __user *,int); - int (*create) (struct mnt_idmap *, struct inode *,struct dentry *, - umode_t, bool); + int (*create) (struct mnt_idmap *, struct inode *, struct dentry *, + umode_t); int (*link) (struct dentry *,struct inode *,struct dentry *); int (*unlink) (struct inode *,struct dentry *); int (*symlink) (struct mnt_idmap *, struct inode *,struct dentry *, diff --git a/ipc/mqueue.c b/ipc/mqueue.c index 093551fe66a7eb884fc34ef853a0ca92b95770af..9ae28c79fe0578bf96b2d22daed45b48aba0b946 100644 --- a/ipc/mqueue.c +++ b/ipc/mqueue.c @@ -610,7 +610,7 @@ static int mqueue_create_attr(struct dentry *dentry, umode_t mode, void *arg) } static int mqueue_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { return mqueue_create_attr(dentry, mode, NULL); } diff --git a/mm/shmem.c b/mm/shmem.c index b9081b817d28f3db1fbdd90ed3f04b6904d6ff18..8fdc9cbecb908e127f8173ca8888b5e038354fed 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -3912,7 +3912,7 @@ static struct dentry *shmem_mkdir(struct mnt_idmap *idmap, struct inode *dir, } static int shmem_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { return shmem_mknod(idmap, dir, dentry, mode | S_IFREG, 0); } --- base-commit: 76ddfe7d66d631e5e31ef4e5dd59797fa03acbf7 change-id: 20251105-create-excl-2b366d9bf3bb Best regards, -- Jeff Layton From neilb at ownmail.net Fri Nov 7 14:29:43 2025 From: neilb at ownmail.net (NeilBrown) Date: Sat, 08 Nov 2025 09:29:43 +1100 Subject: [PATCH v2] vfs: remove the excl argument from the ->create() inode_operation In-Reply-To: <20251107-create-excl-v2-1-f678165d7f3f@kernel.org> References: <20251107-create-excl-v2-1-f678165d7f3f@kernel.org> Message-ID: <176255458305.634289.5577159882824096330@noble.neil.brown.name> On Sat, 08 Nov 2025, Jeff Layton wrote: > With two exceptions, ->create() methods provided by filesystems ignore > the "excl" flag. Those exception are NFS and GFS2 which both also > provide ->atomic_open. > > Since ce8644fcadc5 ("lookup_open(): expand the call of vfs_create()"), > the "excl" argument to the ->create() inode_operation is always set to > true in vfs_create(). The ->create() call in lookup_open() sets it > according to the O_EXCL open flag, but is never called if the filesystem > provides ->atomic_open(). > > The excl flag is therefore always either ignored or true. Remove it, > and change NFS and GFS2 to act as if it were always true. > > Signed-off-by: Jeff Layton > --- > Note that this is based on top of the dir delegation series [1]. LMK > if the Documentation/ updates are too wordy. Patch is very nice. I don't think the documentation is too wordy. I think it is good that the two changes to the different files say essentially the same thing but use different words. That helps. Reviewed-by: NeilBrown > > Full disclosure: I did use Claude code to generate the first > approximation of this patch, but I had to fix a number of things that it > missed. I probably could have given it better prompts. In any case, I'm > not sure how to properly attribute this (or if I even need to). My understanding is that if you fully understand (and can defend) the code change with all its motivations and implications as well as if you had written it yourself, then you don't need to attribute whatever fancy text editor or IDE (e.g. Claude) that you used to help produce the patch. Thanks, NeilBrown From corbet at lwn.net Fri Nov 7 14:35:17 2025 From: corbet at lwn.net (Jonathan Corbet) Date: Fri, 07 Nov 2025 15:35:17 -0700 Subject: LLM disclosure (was: [PATCH v2] vfs: remove the excl argument from the ->create() inode_operation) In-Reply-To: <176255458305.634289.5577159882824096330@noble.neil.brown.name> References: <20251107-create-excl-v2-1-f678165d7f3f@kernel.org> <176255458305.634289.5577159882824096330@noble.neil.brown.name> Message-ID: <87ikfl1nfe.fsf@trenco.lwn.net> NeilBrown writes: > On Sat, 08 Nov 2025, Jeff Layton wrote: >> Full disclosure: I did use Claude code to generate the first >> approximation of this patch, but I had to fix a number of things that it >> missed. I probably could have given it better prompts. In any case, I'm >> not sure how to properly attribute this (or if I even need to). > > My understanding is that if you fully understand (and can defend) the > code change with all its motivations and implications as well as if you > had written it yourself, then you don't need to attribute whatever fancy > text editor or IDE (e.g. Claude) that you used to help produce the > patch. The proposed policy for such things is here, under review right now: https://lore.kernel.org/all/20251105231514.3167738-1-dave.hansen at linux.intel.com/ jon From jlayton at kernel.org Fri Nov 7 15:19:24 2025 From: jlayton at kernel.org (Jeff Layton) Date: Fri, 07 Nov 2025 18:19:24 -0500 Subject: LLM disclosure (was: [PATCH v2] vfs: remove the excl argument from the ->create() inode_operation) In-Reply-To: <87ikfl1nfe.fsf@trenco.lwn.net> References: <20251107-create-excl-v2-1-f678165d7f3f@kernel.org> <176255458305.634289.5577159882824096330@noble.neil.brown.name> <87ikfl1nfe.fsf@trenco.lwn.net> Message-ID: On Fri, 2025-11-07 at 15:35 -0700, Jonathan Corbet wrote: > NeilBrown writes: > > > On Sat, 08 Nov 2025, Jeff Layton wrote: > > > > Full disclosure: I did use Claude code to generate the first > > > approximation of this patch, but I had to fix a number of things that it > > > missed. I probably could have given it better prompts. In any case, I'm > > > not sure how to properly attribute this (or if I even need to). > > > > My understanding is that if you fully understand (and can defend) the > > code change with all its motivations and implications as well as if you > > had written it yourself, then you don't need to attribute whatever fancy > > text editor or IDE (e.g. Claude) that you used to help produce the > > patch. > > The proposed policy for such things is here, under review right now: > > https://lore.kernel.org/all/20251105231514.3167738-1-dave.hansen at linux.intel.com/ > > jon Thanks Jon. I'm guessing that this would fall under the "menial task" classification, and therefore doesn't need attribution. This seems applicable: + - Purely mechanical transformations like variable renaming This is a little different, but it's a similar rote task. -- Jeff Layton From neilb at ownmail.net Fri Nov 7 15:37:30 2025 From: neilb at ownmail.net (NeilBrown) Date: Sat, 08 Nov 2025 10:37:30 +1100 Subject: LLM disclosure (was: [PATCH v2] vfs: remove the excl argument from the ->create() inode_operation) In-Reply-To: References: <20251107-create-excl-v2-1-f678165d7f3f@kernel.org>, <176255458305.634289.5577159882824096330@noble.neil.brown.name>, <87ikfl1nfe.fsf@trenco.lwn.net>, Message-ID: <176255865045.634289.1814933499430115577@noble.neil.brown.name> On Sat, 08 Nov 2025, Jeff Layton wrote: > On Fri, 2025-11-07 at 15:35 -0700, Jonathan Corbet wrote: > > NeilBrown writes: > > > > > On Sat, 08 Nov 2025, Jeff Layton wrote: > > > > > > Full disclosure: I did use Claude code to generate the first > > > > approximation of this patch, but I had to fix a number of things that it > > > > missed. I probably could have given it better prompts. In any case, I'm > > > > not sure how to properly attribute this (or if I even need to). > > > > > > My understanding is that if you fully understand (and can defend) the > > > code change with all its motivations and implications as well as if you > > > had written it yourself, then you don't need to attribute whatever fancy > > > text editor or IDE (e.g. Claude) that you used to help produce the > > > patch. > > > > The proposed policy for such things is here, under review right now: > > > > https://lore.kernel.org/all/20251105231514.3167738-1-dave.hansen at linux.intel.com/ > > > > jon > > Thanks Jon. > > I'm guessing that this would fall under the "menial task" > classification, and therefore doesn't need attribution. This seems > applicable: > > + - Purely mechanical transformations like variable renaming > > This is a little different, but it's a similar rote task. > -- > Jeff Layton > The bit I particularly liked was: + +Even if your tool use is out of scope you should still always consider +if it would help reviewing your contribution if the reviewer knows +about the tool that you used. + "would it help the reviewer"? I agree that is a key question. In your case I cannot see how it would help. Thanks, NeilBrown From asmadeus at codewreck.org Fri Nov 7 22:12:10 2025 From: asmadeus at codewreck.org (Dominique Martinet) Date: Sat, 8 Nov 2025 15:12:10 +0900 Subject: [PATCH v2] vfs: remove the excl argument from the ->create() inode_operation In-Reply-To: <20251107-create-excl-v2-1-f678165d7f3f@kernel.org> References: <20251107-create-excl-v2-1-f678165d7f3f@kernel.org> Message-ID: Jeff Layton wrote on Fri, Nov 07, 2025 at 10:05:03AM -0500: > With two exceptions, ->create() methods provided by filesystems ignore > the "excl" flag. Those exception are NFS and GFS2 which both also > provide ->atomic_open. > > Since ce8644fcadc5 ("lookup_open(): expand the call of vfs_create()"), > the "excl" argument to the ->create() inode_operation is always set to > true in vfs_create(). The ->create() call in lookup_open() sets it > according to the O_EXCL open flag, but is never called if the filesystem > provides ->atomic_open(). > > The excl flag is therefore always either ignored or true. Remove it, > and change NFS and GFS2 to act as if it were always true. > > Signed-off-by: Jeff Layton Good cleanup, just one whitespace nitpick below but: Reviewed-by: Dominique Martinet > diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst > index 4f13b01e42eb5e2ad9d60cbbce7e47d09ad831e6..7a55e491e0c87a0d18909bd181754d6d68318059 100644 > --- a/Documentation/filesystems/vfs.rst > +++ b/Documentation/filesystems/vfs.rst > @@ -505,7 +505,10 @@ otherwise noted. > if you want to support regular files. The dentry you get should > not have an inode (i.e. it should be a negative dentry). Here > you will probably call d_instantiate() with the dentry and the > - newly created inode > + newly created inode. This operation should always provide O_EXCL This and the block below change halfway from tab (old text) to spaces (your patch) Looks like the file has a few space-indented sections though so it won't be the first if that goes in as is, the html-rendering doesn't seem to care :) Cheers, -- Dominique Martinet | Asmadeus From thehajime at gmail.com Sat Nov 8 00:05:35 2025 From: thehajime at gmail.com (Hajime Tazaki) Date: Sat, 8 Nov 2025 17:05:35 +0900 Subject: [PATCH v13 00/13] nommu UML Message-ID: This patchset is another spin of nommu mode addition to UML. It would be nice to hear about your opinions on that. There are still several limitations/issues which we already found; here is the list of those issues. - memory mapped by loadable modules are not distinguished from userspace memory. - CONFIG_SMP is disabled as host_fs handling doesn't work with thread local storage. -- Hajime v13: - rebase with the latest uml/next branch, fixing a conflict ([06/13]) v12: - rebase with the latest uml/next branch - disable SMP and tls as those doesn't work with host_fs handling ([11/13]) - https://lore.kernel.org/all/cover.1762075876.git.thehajime at gmail.com/ v11: - clean up userspace return routine and integrate to userspace() ([04/13]) - fix direction flag issue on using nolibc memcpy ([04/13]) - fix a crash issue when using usermode helper ([06/13]) - test with out-of-tree kunit-uapi patches (which uses umh) - https://lore.kernel.org/all/20250626-kunit-kselftests-v4-0-48760534fef5 at linutronix.de/ - https://lore.kernel.org/all/20250626195714.2123694-3-benjamin at sipsolutions.net/ - https://lore.kernel.org/all/cover.1758181109.git.thehajime at gmail.com/ v10: - fix wrong comment on gs register handling ([09/13]) - remove unnecessary code of early syscall implementation ([04/13]) - https://lore.kernel.org/all/cover.1750594487.git.thehajime at gmail.com/ v9: - rebase with the latest uml/next branch - add performance numbers of new SECCOMP mode, and update results ([12/13]) - add a workaround for upstream change on MMU depedency to PCI drivers ([10/13]) - https://lore.kernel.org/all/cover.1750294482.git.thehajime at gmail.com/ v8: - rebase with the latest uml/next branch - clean up segv_handler to align with the latest uml ([9/12]) - https://lore.kernel.org/all/cover.1745980082.git.thehajime at gmail.com/ v7: - properly handle FP register upon signal delivery [10/13] - update benchmark result with new FP register handling [12/13] - fix arch_has_single_step() for !MMU case [07/13] - revert stack alignment as it is in uml/fixes tree [10/13] - https://lore.kernel.org/all/cover.1737348399.git.thehajime at gmail.com/ v6: - rebase to the latest uml/next tree - more clean up on mmu/nommu for signal handling [10/13] - rename functions of mcontext routines [06,10/13] - added Acked-by tag for binfmt_elf_fdpic [02/13] - https://lore.kernel.org/linux-um/cover.1736853925.git.thehajime at gmail.com/ v5: - clean up stack manipulation code [05,06,07,10/13] - https://lore.kernel.org/linux-um/cover.1733998168.git.thehajime at gmail.com/ v4: - add arch/um/nommu, arch/x86/um/nommu to contain !MMU specific codes - remove zpoline patch - drop binfmt_elf_fdpic patch - reduce ifndef CONFIG_MMU if possible - split to elf header cleanup patch [01/13] - fix kernel test robot warnings [06/13] - fix coding styles [07/13] - move task_top_of_stack definition [05/13] - https://lore.kernel.org/linux-um/cover.1733652929.git.thehajime at gmail.com/ v3: - https://lore.kernel.org/linux-um/cover.1733199769.git.thehajime at gmail.com/ - add seccomp-based syscall hook in addition to zpoline [06/13] - remove RFC, add a line to MAINTAINERS file - fix kernel test robot warnings [02/13,08/13,10/13] - add base-commit tag to cover letter - pull the latest uml/next - clean up SIGSEGV handling [10/13] - detect fsgsbase availability with elf aux vector [08/13] - simplify vdso code with macros [09/13] RFC v2: - https://lore.kernel.org/linux-um/cover.1731290567.git.thehajime at gmail.com/ - base branch is now uml/linux.git instead of torvalds/linux.git. - reorganize the patch series to clean up - fixed various coding styles issues - clean up exec code path [07/13] - fixed the crash/SIGSEGV case on userspace programs [10/13] - add seccomp filter to limit syscall caller address [06/13] - detect fsgsbase availability with sigsetjmp/siglongjmp [08/13] - removes unrelated changes - removes unneeded ifndef CONFIG_MMU - convert UML_CONFIG_MMU to CONFIG_MMU as using uml/linux.git - proposed a patch of maple-tree issue (resolving a limitation in RFC v1) https://lore.kernel.org/linux-mm/20241108222834.3625217-1-thehajime at gmail.com/ RFC: - https://lore.kernel.org/linux-um/cover.1729770373.git.thehajime at gmail.com/ Hajime Tazaki (13): x86/um: nommu: elf loader for fdpic um: decouple MMU specific code from the common part um: nommu: memory handling x86/um: nommu: syscall handling um: nommu: seccomp syscalls hook x86/um: nommu: process/thread handling um: nommu: configure fs register on host syscall invocation x86/um/vdso: nommu: vdso memory update x86/um: nommu: signal handling um: change machine name for uname output um: nommu: disable SMP on nommu UML um: nommu: add documentation of nommu UML um: nommu: plug nommu code into build system Documentation/virt/uml/nommu-uml.rst | 180 ++++++++++++++++++++++ MAINTAINERS | 1 + arch/um/Kconfig | 14 +- arch/um/Makefile | 10 ++ arch/um/configs/x86_64_nommu_defconfig | 54 +++++++ arch/um/include/asm/futex.h | 4 + arch/um/include/asm/mmu.h | 8 + arch/um/include/asm/mmu_context.h | 2 + arch/um/include/asm/ptrace-generic.h | 8 +- arch/um/include/asm/uaccess.h | 7 +- arch/um/include/shared/kern_util.h | 6 + arch/um/include/shared/os.h | 16 ++ arch/um/kernel/Makefile | 5 +- arch/um/kernel/mem-pgtable.c | 55 +++++++ arch/um/kernel/mem.c | 38 +---- arch/um/kernel/process.c | 38 +++++ arch/um/kernel/skas/process.c | 37 ----- arch/um/kernel/um_arch.c | 3 + arch/um/nommu/Makefile | 3 + arch/um/nommu/os-Linux/Makefile | 7 + arch/um/nommu/os-Linux/seccomp.c | 87 +++++++++++ arch/um/nommu/os-Linux/signal.c | 24 +++ arch/um/nommu/trap.c | 201 +++++++++++++++++++++++++ arch/um/os-Linux/Makefile | 3 +- arch/um/os-Linux/internal.h | 8 + arch/um/os-Linux/mem.c | 4 + arch/um/os-Linux/process.c | 139 ++++++++++++++++- arch/um/os-Linux/signal.c | 11 +- arch/um/os-Linux/skas/process.c | 127 ---------------- arch/um/os-Linux/start_up.c | 25 ++- arch/um/os-Linux/util.c | 3 +- arch/x86/um/Kconfig | 2 +- arch/x86/um/Makefile | 7 +- arch/x86/um/asm/elf.h | 8 +- arch/x86/um/asm/syscall.h | 6 + arch/x86/um/nommu/Makefile | 8 + arch/x86/um/nommu/do_syscall_64.c | 75 +++++++++ arch/x86/um/nommu/entry_64.S | 114 ++++++++++++++ arch/x86/um/nommu/os-Linux/Makefile | 6 + arch/x86/um/nommu/os-Linux/mcontext.c | 26 ++++ arch/x86/um/nommu/syscalls.h | 18 +++ arch/x86/um/nommu/syscalls_64.c | 121 +++++++++++++++ arch/x86/um/shared/sysdep/mcontext.h | 5 + arch/x86/um/shared/sysdep/ptrace.h | 2 +- arch/x86/um/vdso/vma.c | 17 ++- fs/Kconfig.binfmt | 2 +- 46 files changed, 1322 insertions(+), 223 deletions(-) create mode 100644 Documentation/virt/uml/nommu-uml.rst create mode 100644 arch/um/configs/x86_64_nommu_defconfig create mode 100644 arch/um/kernel/mem-pgtable.c create mode 100644 arch/um/nommu/Makefile create mode 100644 arch/um/nommu/os-Linux/Makefile create mode 100644 arch/um/nommu/os-Linux/seccomp.c create mode 100644 arch/um/nommu/os-Linux/signal.c create mode 100644 arch/um/nommu/trap.c create mode 100644 arch/x86/um/nommu/Makefile create mode 100644 arch/x86/um/nommu/do_syscall_64.c create mode 100644 arch/x86/um/nommu/entry_64.S create mode 100644 arch/x86/um/nommu/os-Linux/Makefile create mode 100644 arch/x86/um/nommu/os-Linux/mcontext.c create mode 100644 arch/x86/um/nommu/syscalls.h create mode 100644 arch/x86/um/nommu/syscalls_64.c base-commit: 293f71435d14f5b5c46fc3398695fa265c69363d -- 2.43.0 From thehajime at gmail.com Sat Nov 8 00:05:36 2025 From: thehajime at gmail.com (Hajime Tazaki) Date: Sat, 8 Nov 2025 17:05:36 +0900 Subject: [PATCH v13 01/13] x86/um: nommu: elf loader for fdpic In-Reply-To: References: Message-ID: <59210140957e95ab0df73125bfdb035913a468b1.1762588860.git.thehajime@gmail.com> As UML supports CONFIG_MMU=n case, it has to use an alternate ELF loader, FDPIC ELF loader. In this commit, we added necessary definitions in the arch, as UML has not been used so far. It also updates Kconfig file to use BINFMT_ELF_FDPIC under !MMU environment. Cc: Eric Biederman Cc: Kees Cook Cc: Alexander Viro Cc: Christian Brauner Cc: Jan Kara Cc: linux-mm at kvack.org Cc: linux-fsdevel at vger.kernel.org Acked-by: Kees Cook Signed-off-by: Hajime Tazaki Signed-off-by: Ricardo Koller --- arch/um/include/asm/mmu.h | 5 +++++ arch/um/include/asm/ptrace-generic.h | 6 ++++++ arch/x86/um/asm/elf.h | 8 ++++++-- fs/Kconfig.binfmt | 2 +- 4 files changed, 18 insertions(+), 3 deletions(-) diff --git a/arch/um/include/asm/mmu.h b/arch/um/include/asm/mmu.h index 07d48738b402..82a919132aff 100644 --- a/arch/um/include/asm/mmu.h +++ b/arch/um/include/asm/mmu.h @@ -21,6 +21,11 @@ typedef struct mm_context { spinlock_t sync_tlb_lock; unsigned long sync_tlb_range_from; unsigned long sync_tlb_range_to; + +#ifdef CONFIG_BINFMT_ELF_FDPIC + unsigned long exec_fdpic_loadmap; + unsigned long interp_fdpic_loadmap; +#endif } mm_context_t; #define INIT_MM_CONTEXT(mm) \ diff --git a/arch/um/include/asm/ptrace-generic.h b/arch/um/include/asm/ptrace-generic.h index 86d74f9d33cf..62e9916078ec 100644 --- a/arch/um/include/asm/ptrace-generic.h +++ b/arch/um/include/asm/ptrace-generic.h @@ -29,6 +29,12 @@ struct pt_regs { #define PTRACE_OLDSETOPTIONS 21 +#ifdef CONFIG_BINFMT_ELF_FDPIC +#define PTRACE_GETFDPIC 31 +#define PTRACE_GETFDPIC_EXEC 0 +#define PTRACE_GETFDPIC_INTERP 1 +#endif + struct task_struct; extern long subarch_ptrace(struct task_struct *child, long request, diff --git a/arch/x86/um/asm/elf.h b/arch/x86/um/asm/elf.h index 22d0111b543b..388fe669886c 100644 --- a/arch/x86/um/asm/elf.h +++ b/arch/x86/um/asm/elf.h @@ -9,6 +9,7 @@ #include #define CORE_DUMP_USE_REGSET +#define ELF_FDPIC_CORE_EFLAGS 0 #ifdef CONFIG_X86_32 @@ -158,8 +159,11 @@ extern int arch_setup_additional_pages(struct linux_binprm *bprm, extern unsigned long um_vdso_addr; #define AT_SYSINFO_EHDR 33 -#define ARCH_DLINFO NEW_AUX_ENT(AT_SYSINFO_EHDR, um_vdso_addr) - +#define ARCH_DLINFO \ +do { \ + NEW_AUX_ENT(AT_SYSINFO_EHDR, um_vdso_addr); \ + NEW_AUX_ENT(AT_MINSIGSTKSZ, 0); \ +} while (0) #endif typedef unsigned long elf_greg_t; diff --git a/fs/Kconfig.binfmt b/fs/Kconfig.binfmt index 1949e25c7741..0a92bebd5f75 100644 --- a/fs/Kconfig.binfmt +++ b/fs/Kconfig.binfmt @@ -58,7 +58,7 @@ config ARCH_USE_GNU_PROPERTY config BINFMT_ELF_FDPIC bool "Kernel support for FDPIC ELF binaries" default y if !BINFMT_ELF - depends on ARM || ((M68K || RISCV || SUPERH || XTENSA) && !MMU) + depends on ARM || ((M68K || RISCV || SUPERH || UML || XTENSA) && !MMU) select ELFCORE help ELF FDPIC binaries are based on ELF, but allow the individual load -- 2.43.0 From thehajime at gmail.com Sat Nov 8 00:05:37 2025 From: thehajime at gmail.com (Hajime Tazaki) Date: Sat, 8 Nov 2025 17:05:37 +0900 Subject: [PATCH v13 02/13] um: decouple MMU specific code from the common part In-Reply-To: References: Message-ID: This splits the memory, process related code with common and MMU specific parts in order to avoid ifdefs in .c file and duplication between MMU and !MMU. Signed-off-by: Hajime Tazaki --- arch/um/kernel/Makefile | 5 +- arch/um/kernel/mem-pgtable.c | 55 ++++++++++++++ arch/um/kernel/mem.c | 35 --------- arch/um/kernel/process.c | 38 ++++++++++ arch/um/kernel/skas/process.c | 37 --------- arch/um/os-Linux/Makefile | 3 +- arch/um/os-Linux/process.c | 129 ++++++++++++++++++++++++++++++++ arch/um/os-Linux/skas/process.c | 127 ------------------------------- 8 files changed, 227 insertions(+), 202 deletions(-) create mode 100644 arch/um/kernel/mem-pgtable.c diff --git a/arch/um/kernel/Makefile b/arch/um/kernel/Makefile index be60bc451b3f..76d36751973e 100644 --- a/arch/um/kernel/Makefile +++ b/arch/um/kernel/Makefile @@ -16,9 +16,10 @@ always-$(KBUILD_BUILTIN) := vmlinux.lds obj-y = config.o exec.o exitcode.o irq.o ksyms.o mem.o \ physmem.o process.o ptrace.o reboot.o sigio.o \ - signal.o sysrq.o time.o tlb.o trap.o \ - um_arch.o umid.o kmsg_dump.o capflags.o skas/ + signal.o sysrq.o time.o \ + um_arch.o umid.o kmsg_dump.o capflags.o obj-y += load_file.o +obj-$(CONFIG_MMU) += mem-pgtable.o tlb.o trap.o skas/ obj-$(CONFIG_BLK_DEV_INITRD) += initrd.o obj-$(CONFIG_GPROF) += gprof_syms.o diff --git a/arch/um/kernel/mem-pgtable.c b/arch/um/kernel/mem-pgtable.c new file mode 100644 index 000000000000..549da1d3bff0 --- /dev/null +++ b/arch/um/kernel/mem-pgtable.c @@ -0,0 +1,55 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2000 - 2007 Jeff Dike (jdike@{addtoit,linux.intel}.com) + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + + +/* Allocate and free page tables. */ + +pgd_t *pgd_alloc(struct mm_struct *mm) +{ + pgd_t *pgd = (pgd_t *)__get_free_page(GFP_KERNEL); + + if (pgd) { + memset(pgd, 0, USER_PTRS_PER_PGD * sizeof(pgd_t)); + memcpy(pgd + USER_PTRS_PER_PGD, + swapper_pg_dir + USER_PTRS_PER_PGD, + (PTRS_PER_PGD - USER_PTRS_PER_PGD) * sizeof(pgd_t)); + } + return pgd; +} + +static const pgprot_t protection_map[16] = { + [VM_NONE] = PAGE_NONE, + [VM_READ] = PAGE_READONLY, + [VM_WRITE] = PAGE_COPY, + [VM_WRITE | VM_READ] = PAGE_COPY, + [VM_EXEC] = PAGE_READONLY, + [VM_EXEC | VM_READ] = PAGE_READONLY, + [VM_EXEC | VM_WRITE] = PAGE_COPY, + [VM_EXEC | VM_WRITE | VM_READ] = PAGE_COPY, + [VM_SHARED] = PAGE_NONE, + [VM_SHARED | VM_READ] = PAGE_READONLY, + [VM_SHARED | VM_WRITE] = PAGE_SHARED, + [VM_SHARED | VM_WRITE | VM_READ] = PAGE_SHARED, + [VM_SHARED | VM_EXEC] = PAGE_READONLY, + [VM_SHARED | VM_EXEC | VM_READ] = PAGE_READONLY, + [VM_SHARED | VM_EXEC | VM_WRITE] = PAGE_SHARED, + [VM_SHARED | VM_EXEC | VM_WRITE | VM_READ] = PAGE_SHARED +}; +DECLARE_VM_GET_PAGE_PROT diff --git a/arch/um/kernel/mem.c b/arch/um/kernel/mem.c index 39c4a7e21c6f..f3258680bfbe 100644 --- a/arch/um/kernel/mem.c +++ b/arch/um/kernel/mem.c @@ -6,7 +6,6 @@ #include #include #include -#include #include #include #include @@ -107,45 +106,11 @@ void free_initmem(void) { } -/* Allocate and free page tables. */ - -pgd_t *pgd_alloc(struct mm_struct *mm) -{ - pgd_t *pgd = __pgd_alloc(mm, 0); - - if (pgd) - memcpy(pgd + USER_PTRS_PER_PGD, - swapper_pg_dir + USER_PTRS_PER_PGD, - (PTRS_PER_PGD - USER_PTRS_PER_PGD) * sizeof(pgd_t)); - - return pgd; -} - void *uml_kmalloc(int size, int flags) { return kmalloc(size, flags); } -static const pgprot_t protection_map[16] = { - [VM_NONE] = PAGE_NONE, - [VM_READ] = PAGE_READONLY, - [VM_WRITE] = PAGE_COPY, - [VM_WRITE | VM_READ] = PAGE_COPY, - [VM_EXEC] = PAGE_READONLY, - [VM_EXEC | VM_READ] = PAGE_READONLY, - [VM_EXEC | VM_WRITE] = PAGE_COPY, - [VM_EXEC | VM_WRITE | VM_READ] = PAGE_COPY, - [VM_SHARED] = PAGE_NONE, - [VM_SHARED | VM_READ] = PAGE_READONLY, - [VM_SHARED | VM_WRITE] = PAGE_SHARED, - [VM_SHARED | VM_WRITE | VM_READ] = PAGE_SHARED, - [VM_SHARED | VM_EXEC] = PAGE_READONLY, - [VM_SHARED | VM_EXEC | VM_READ] = PAGE_READONLY, - [VM_SHARED | VM_EXEC | VM_WRITE] = PAGE_SHARED, - [VM_SHARED | VM_EXEC | VM_WRITE | VM_READ] = PAGE_SHARED -}; -DECLARE_VM_GET_PAGE_PROT - void mark_rodata_ro(void) { unsigned long rodata_start = PFN_ALIGN(__start_rodata); diff --git a/arch/um/kernel/process.c b/arch/um/kernel/process.c index 63b38a3f73f7..b07c1f120910 100644 --- a/arch/um/kernel/process.c +++ b/arch/um/kernel/process.c @@ -25,6 +25,7 @@ #include #include #include +#include #include #include #include @@ -307,3 +308,40 @@ unsigned long __get_wchan(struct task_struct *p) return 0; } + +extern void start_kernel(void); + +static int __init start_kernel_proc(void *unused) +{ + block_signals_trace(); + + start_kernel(); + return 0; +} + +char cpu_irqstacks[NR_CPUS][THREAD_SIZE] __aligned(THREAD_SIZE); + +int __init start_uml(void) +{ + stack_protections((unsigned long) &cpu_irqstacks[0]); + set_sigstack(cpu_irqstacks[0], THREAD_SIZE); + + init_new_thread_signals(); + + init_task.thread.request.thread.proc = start_kernel_proc; + init_task.thread.request.thread.arg = NULL; + return start_idle_thread(task_stack_page(&init_task), + &init_task.thread.switch_buf); +} + +static DEFINE_SPINLOCK(initial_jmpbuf_spinlock); + +void initial_jmpbuf_lock(void) +{ + spin_lock_irq(&initial_jmpbuf_spinlock); +} + +void initial_jmpbuf_unlock(void) +{ + spin_unlock_irq(&initial_jmpbuf_spinlock); +} diff --git a/arch/um/kernel/skas/process.c b/arch/um/kernel/skas/process.c index 4a7673b0261a..d643854942bc 100644 --- a/arch/um/kernel/skas/process.c +++ b/arch/um/kernel/skas/process.c @@ -17,31 +17,6 @@ #include #include -extern void start_kernel(void); - -static int __init start_kernel_proc(void *unused) -{ - block_signals_trace(); - - start_kernel(); - return 0; -} - -char cpu_irqstacks[NR_CPUS][THREAD_SIZE] __aligned(THREAD_SIZE); - -int __init start_uml(void) -{ - stack_protections((unsigned long) &cpu_irqstacks[0]); - set_sigstack(cpu_irqstacks[0], THREAD_SIZE); - - init_new_thread_signals(); - - init_task.thread.request.thread.proc = start_kernel_proc; - init_task.thread.request.thread.arg = NULL; - return start_idle_thread(task_stack_page(&init_task), - &init_task.thread.switch_buf); -} - unsigned long current_stub_stack(void) { if (current->mm == NULL) @@ -65,15 +40,3 @@ void current_mm_sync(void) um_tlb_sync(current->mm); } - -static DEFINE_SPINLOCK(initial_jmpbuf_spinlock); - -void initial_jmpbuf_lock(void) -{ - spin_lock_irq(&initial_jmpbuf_spinlock); -} - -void initial_jmpbuf_unlock(void) -{ - spin_unlock_irq(&initial_jmpbuf_spinlock); -} diff --git a/arch/um/os-Linux/Makefile b/arch/um/os-Linux/Makefile index f8d672d570d9..40e3e0eab6a0 100644 --- a/arch/um/os-Linux/Makefile +++ b/arch/um/os-Linux/Makefile @@ -8,7 +8,8 @@ KCOV_INSTRUMENT := n obj-y = elf_aux.o execvp.o file.o helper.o irq.o main.o mem.o process.o \ registers.o sigio.o signal.o start_up.o time.o tty.o \ - umid.o user_syms.o util.o skas/ + umid.o user_syms.o util.o +obj-$(CONFIG_MMU) += skas/ CFLAGS_signal.o += -Wframe-larger-than=4096 diff --git a/arch/um/os-Linux/process.c b/arch/um/os-Linux/process.c index 3a2a84ab9325..c50fa865d8c7 100644 --- a/arch/um/os-Linux/process.c +++ b/arch/um/os-Linux/process.c @@ -6,6 +6,7 @@ #include #include +#include #include #include #include @@ -17,10 +18,16 @@ #include #include #include +#include #include #include #include #include +#include +#include + +int using_seccomp; +static int unscheduled_userspace_iterations; void os_alarm_process(int pid) { @@ -209,3 +216,125 @@ int os_futex_wake(void *uaddr) NULL, NULL, 0)); return r < 0 ? -errno : r; } + +int is_skas_winch(int pid, int fd, void *data) +{ + return pid == getpgrp(); +} + +void new_thread(void *stack, jmp_buf *buf, void (*handler)(void)) +{ + (*buf)[0].JB_IP = (unsigned long) handler; + (*buf)[0].JB_SP = (unsigned long) stack + UM_THREAD_SIZE - + sizeof(void *); +} + +#define INIT_JMP_NEW_THREAD 0 +#define INIT_JMP_CALLBACK 1 +#define INIT_JMP_HALT 2 +#define INIT_JMP_REBOOT 3 + +void switch_threads(jmp_buf *me, jmp_buf *you) +{ + unscheduled_userspace_iterations = 0; + + if (UML_SETJMP(me) == 0) + UML_LONGJMP(you, 1); +} + +static jmp_buf initial_jmpbuf; + +static __thread void (*cb_proc)(void *arg); +static __thread void *cb_arg; +static __thread jmp_buf *cb_back; + +int start_idle_thread(void *stack, jmp_buf *switch_buf) +{ + int n; + + set_handler(SIGWINCH); + + /* + * Can't use UML_SETJMP or UML_LONGJMP here because they save + * and restore signals, with the possible side-effect of + * trying to handle any signals which came when they were + * blocked, which can't be done on this stack. + * Signals must be blocked when jumping back here and restored + * after returning to the jumper. + */ + n = setjmp(initial_jmpbuf); + switch (n) { + case INIT_JMP_NEW_THREAD: + (*switch_buf)[0].JB_IP = (unsigned long) uml_finishsetup; + (*switch_buf)[0].JB_SP = (unsigned long) stack + + UM_THREAD_SIZE - sizeof(void *); + break; + case INIT_JMP_CALLBACK: + (*cb_proc)(cb_arg); + longjmp(*cb_back, 1); + break; + case INIT_JMP_HALT: + kmalloc_ok = 0; + return 0; + case INIT_JMP_REBOOT: + kmalloc_ok = 0; + return 1; + default: + printk(UM_KERN_ERR "Bad sigsetjmp return in %s - %d\n", + __func__, n); + fatal_sigsegv(); + } + longjmp(*switch_buf, 1); + + /* unreachable */ + printk(UM_KERN_ERR "impossible long jump!"); + fatal_sigsegv(); + return 0; +} + +void initial_thread_cb_skas(void (*proc)(void *), void *arg) +{ + jmp_buf here; + + cb_proc = proc; + cb_arg = arg; + cb_back = &here; + + initial_jmpbuf_lock(); + if (UML_SETJMP(&here) == 0) + UML_LONGJMP(&initial_jmpbuf, INIT_JMP_CALLBACK); + initial_jmpbuf_unlock(); + + cb_proc = NULL; + cb_arg = NULL; + cb_back = NULL; +} + +void halt_skas(void) +{ + initial_jmpbuf_lock(); + UML_LONGJMP(&initial_jmpbuf, INIT_JMP_HALT); + /* unreachable */ +} + +static bool noreboot; + +static int __init noreboot_cmd_param(char *str, int *add) +{ + *add = 0; + noreboot = true; + return 0; +} + +__uml_setup("noreboot", noreboot_cmd_param, +"noreboot\n" +" Rather than rebooting, exit always, akin to QEMU's -no-reboot option.\n" +" This is useful if you're using CONFIG_PANIC_TIMEOUT in order to catch\n" +" crashes in CI\n\n"); + +void reboot_skas(void) +{ + initial_jmpbuf_lock(); + UML_LONGJMP(&initial_jmpbuf, noreboot ? INIT_JMP_HALT : INIT_JMP_REBOOT); + /* unreachable */ +} diff --git a/arch/um/os-Linux/skas/process.c b/arch/um/os-Linux/skas/process.c index d6c22f8aa06d..01814ad82f5d 100644 --- a/arch/um/os-Linux/skas/process.c +++ b/arch/um/os-Linux/skas/process.c @@ -18,7 +18,6 @@ #include #include #include -#include #include #include #include @@ -29,16 +28,10 @@ #include #include #include -#include #include #include #include "../internal.h" -int is_skas_winch(int pid, int fd, void *data) -{ - return pid == getpgrp(); -} - static const char *ptrace_reg_name(int idx) { #define R(n) case HOST_##n: return #n @@ -426,8 +419,6 @@ static int __init init_stub_exe_fd(void) } __initcall(init_stub_exe_fd); -int using_seccomp; - /** * start_userspace() - prepare a new userspace process * @mm_id: The corresponding struct mm_id @@ -540,7 +531,6 @@ int start_userspace(struct mm_id *mm_id) return err; } -static int unscheduled_userspace_iterations; extern unsigned long tt_extra_sched_jiffies; void userspace(struct uml_pt_regs *regs) @@ -789,120 +779,3 @@ void userspace(struct uml_pt_regs *regs) } } } - -void new_thread(void *stack, jmp_buf *buf, void (*handler)(void)) -{ - (*buf)[0].JB_IP = (unsigned long) handler; - (*buf)[0].JB_SP = (unsigned long) stack + UM_THREAD_SIZE - - sizeof(void *); -} - -#define INIT_JMP_NEW_THREAD 0 -#define INIT_JMP_CALLBACK 1 -#define INIT_JMP_HALT 2 -#define INIT_JMP_REBOOT 3 - -void switch_threads(jmp_buf *me, jmp_buf *you) -{ - unscheduled_userspace_iterations = 0; - - if (UML_SETJMP(me) == 0) - UML_LONGJMP(you, 1); -} - -static jmp_buf initial_jmpbuf; - -static __thread void (*cb_proc)(void *arg); -static __thread void *cb_arg; -static __thread jmp_buf *cb_back; - -int start_idle_thread(void *stack, jmp_buf *switch_buf) -{ - int n; - - set_handler(SIGWINCH); - - /* - * Can't use UML_SETJMP or UML_LONGJMP here because they save - * and restore signals, with the possible side-effect of - * trying to handle any signals which came when they were - * blocked, which can't be done on this stack. - * Signals must be blocked when jumping back here and restored - * after returning to the jumper. - */ - n = setjmp(initial_jmpbuf); - switch (n) { - case INIT_JMP_NEW_THREAD: - (*switch_buf)[0].JB_IP = (unsigned long) uml_finishsetup; - (*switch_buf)[0].JB_SP = (unsigned long) stack + - UM_THREAD_SIZE - sizeof(void *); - break; - case INIT_JMP_CALLBACK: - (*cb_proc)(cb_arg); - longjmp(*cb_back, 1); - break; - case INIT_JMP_HALT: - kmalloc_ok = 0; - return 0; - case INIT_JMP_REBOOT: - kmalloc_ok = 0; - return 1; - default: - printk(UM_KERN_ERR "Bad sigsetjmp return in %s - %d\n", - __func__, n); - fatal_sigsegv(); - } - longjmp(*switch_buf, 1); - - /* unreachable */ - printk(UM_KERN_ERR "impossible long jump!"); - fatal_sigsegv(); - return 0; -} - -void initial_thread_cb_skas(void (*proc)(void *), void *arg) -{ - jmp_buf here; - - cb_proc = proc; - cb_arg = arg; - cb_back = &here; - - initial_jmpbuf_lock(); - if (UML_SETJMP(&here) == 0) - UML_LONGJMP(&initial_jmpbuf, INIT_JMP_CALLBACK); - initial_jmpbuf_unlock(); - - cb_proc = NULL; - cb_arg = NULL; - cb_back = NULL; -} - -void halt_skas(void) -{ - initial_jmpbuf_lock(); - UML_LONGJMP(&initial_jmpbuf, INIT_JMP_HALT); - /* unreachable */ -} - -static bool noreboot; - -static int __init noreboot_cmd_param(char *str, int *add) -{ - *add = 0; - noreboot = true; - return 0; -} - -__uml_setup("noreboot", noreboot_cmd_param, -"noreboot\n" -" Rather than rebooting, exit always, akin to QEMU's -no-reboot option.\n" -" This is useful if you're using CONFIG_PANIC_TIMEOUT in order to catch\n" -" crashes in CI\n\n"); - -void reboot_skas(void) -{ - initial_jmpbuf_lock(); - UML_LONGJMP(&initial_jmpbuf, noreboot ? INIT_JMP_HALT : INIT_JMP_REBOOT); - /* unreachable */ -} -- 2.43.0 From thehajime at gmail.com Sat Nov 8 00:05:38 2025 From: thehajime at gmail.com (Hajime Tazaki) Date: Sat, 8 Nov 2025 17:05:38 +0900 Subject: [PATCH v13 03/13] um: nommu: memory handling In-Reply-To: References: Message-ID: <28512370a78b53783655667300bc4464fd338029.1762588860.git.thehajime@gmail.com> This commit adds memory operations on UML under !MMU environment. Some part of the original UML code relying on CONFIG_MMU are excluded from compilation when !CONFIG_MMU. Additionally, generic functions such as uaccess, futex, memcpy/strnlen/strncpy can be used as user- and kernel-space share the address space in !CONFIG_MMU mode. Signed-off-by: Hajime Tazaki Signed-off-by: Ricardo Koller --- arch/um/Makefile | 4 ++++ arch/um/include/asm/futex.h | 4 ++++ arch/um/include/asm/mmu.h | 3 +++ arch/um/include/asm/mmu_context.h | 2 ++ arch/um/include/asm/uaccess.h | 7 ++++--- arch/um/kernel/mem.c | 3 ++- arch/um/os-Linux/mem.c | 4 ++++ arch/um/os-Linux/process.c | 4 ++-- 8 files changed, 25 insertions(+), 6 deletions(-) diff --git a/arch/um/Makefile b/arch/um/Makefile index 7be0143b5ba3..5371c9a1b11e 100644 --- a/arch/um/Makefile +++ b/arch/um/Makefile @@ -46,6 +46,10 @@ ARCH_INCLUDE := -I$(srctree)/$(SHARED_HEADERS) ARCH_INCLUDE += -I$(srctree)/$(HOST_DIR)/um/shared KBUILD_CPPFLAGS += -I$(srctree)/$(HOST_DIR)/um +ifneq ($(CONFIG_MMU),y) +core-y += $(ARCH_DIR)/nommu/ +endif + # -Dvmap=kernel_vmap prevents anything from referencing the libpcap.o symbol so # named - it's a common symbol in libpcap, so we get a binary which crashes. # diff --git a/arch/um/include/asm/futex.h b/arch/um/include/asm/futex.h index 780aa6bfc050..785fd6649aa2 100644 --- a/arch/um/include/asm/futex.h +++ b/arch/um/include/asm/futex.h @@ -7,8 +7,12 @@ #include +#ifdef CONFIG_MMU int arch_futex_atomic_op_inuser(int op, u32 oparg, int *oval, u32 __user *uaddr); int futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *uaddr, u32 oldval, u32 newval); +#else +#include +#endif #endif diff --git a/arch/um/include/asm/mmu.h b/arch/um/include/asm/mmu.h index 82a919132aff..c0b9ce3215c4 100644 --- a/arch/um/include/asm/mmu.h +++ b/arch/um/include/asm/mmu.h @@ -22,10 +22,13 @@ typedef struct mm_context { unsigned long sync_tlb_range_from; unsigned long sync_tlb_range_to; +#ifndef CONFIG_MMU + unsigned long end_brk; #ifdef CONFIG_BINFMT_ELF_FDPIC unsigned long exec_fdpic_loadmap; unsigned long interp_fdpic_loadmap; #endif +#endif /* !CONFIG_MMU */ } mm_context_t; #define INIT_MM_CONTEXT(mm) \ diff --git a/arch/um/include/asm/mmu_context.h b/arch/um/include/asm/mmu_context.h index c727e56ba116..528b217da285 100644 --- a/arch/um/include/asm/mmu_context.h +++ b/arch/um/include/asm/mmu_context.h @@ -18,11 +18,13 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next, { } +#ifdef CONFIG_MMU #define init_new_context init_new_context extern int init_new_context(struct task_struct *task, struct mm_struct *mm); #define destroy_context destroy_context extern void destroy_context(struct mm_struct *mm); +#endif #include diff --git a/arch/um/include/asm/uaccess.h b/arch/um/include/asm/uaccess.h index 0df9ea4abda8..031b357800b7 100644 --- a/arch/um/include/asm/uaccess.h +++ b/arch/um/include/asm/uaccess.h @@ -18,6 +18,7 @@ #define __addr_range_nowrap(addr, size) \ ((unsigned long) (addr) <= ((unsigned long) (addr) + (size))) +#ifdef CONFIG_MMU extern unsigned long raw_copy_from_user(void *to, const void __user *from, unsigned long n); extern unsigned long raw_copy_to_user(void __user *to, const void *from, unsigned long n); extern unsigned long __clear_user(void __user *mem, unsigned long len); @@ -29,9 +30,6 @@ static inline int __access_ok(const void __user *ptr, unsigned long size); #define INLINE_COPY_FROM_USER #define INLINE_COPY_TO_USER - -#include - static inline int __access_ok(const void __user *ptr, unsigned long size) { unsigned long addr = (unsigned long)ptr; @@ -63,5 +61,8 @@ do { \ barrier(); \ current->thread.segv_continue = NULL; \ } while (0) +#endif + +#include #endif diff --git a/arch/um/kernel/mem.c b/arch/um/kernel/mem.c index f3258680bfbe..e599b637c5fb 100644 --- a/arch/um/kernel/mem.c +++ b/arch/um/kernel/mem.c @@ -71,7 +71,8 @@ void __init arch_mm_preinit(void) * to be turned on. */ brk_end = PAGE_ALIGN((unsigned long) sbrk(0)); - map_memory(brk_end, __pa(brk_end), uml_reserved - brk_end, 1, 1, 0); + map_memory(brk_end, __pa(brk_end), uml_reserved - brk_end, 1, 1, + !IS_ENABLED(CONFIG_MMU)); memblock_free((void *)brk_end, uml_reserved - brk_end); uml_reserved = brk_end; min_low_pfn = PFN_UP(__pa(uml_reserved)); diff --git a/arch/um/os-Linux/mem.c b/arch/um/os-Linux/mem.c index 72f302f4d197..4f5d9a94f8e2 100644 --- a/arch/um/os-Linux/mem.c +++ b/arch/um/os-Linux/mem.c @@ -213,6 +213,10 @@ int __init create_mem_file(unsigned long long len) { int err, fd; + /* NOMMU kernel uses -1 as a fd for further use (e.g., mmap) */ + if (!IS_ENABLED(CONFIG_MMU)) + return -1; + fd = create_tmp_file(len); err = os_set_exec_close(fd); diff --git a/arch/um/os-Linux/process.c b/arch/um/os-Linux/process.c index c50fa865d8c7..ddb5258d7720 100644 --- a/arch/um/os-Linux/process.c +++ b/arch/um/os-Linux/process.c @@ -100,8 +100,8 @@ int os_map_memory(void *virt, int fd, unsigned long long off, unsigned long len, prot = (r ? PROT_READ : 0) | (w ? PROT_WRITE : 0) | (x ? PROT_EXEC : 0); - loc = mmap64((void *) virt, len, prot, MAP_SHARED | MAP_FIXED, - fd, off); + loc = mmap64((void *) virt, len, prot, MAP_SHARED | MAP_FIXED | + (!IS_ENABLED(CONFIG_MMU) ? MAP_ANONYMOUS : 0), fd, off); if (loc == MAP_FAILED) return -errno; return 0; -- 2.43.0 From thehajime at gmail.com Sat Nov 8 00:05:39 2025 From: thehajime at gmail.com (Hajime Tazaki) Date: Sat, 8 Nov 2025 17:05:39 +0900 Subject: [PATCH v13 04/13] x86/um: nommu: syscall handling In-Reply-To: References: Message-ID: This commit introduces an entry point of syscall interface for !MMU mode. It uses an entry function, __kernel_vsyscall, a kernel-wide global symbol accessible from any locations. Although it isn't in the scope of this commit, it can be also exposed via vdso image which is directly accessible from userspace. A standard library (i.e., libc) can utilize this entry point to implement syscall wrapper; we can also use this by hooking syscall for unmodified userspace applications/libraries, which will be implemented in the subsequent commit. This only supports 64-bit mode of x86 architecture. Signed-off-by: Hajime Tazaki Signed-off-by: Ricardo Koller --- arch/x86/um/Makefile | 4 ++ arch/x86/um/asm/syscall.h | 6 ++ arch/x86/um/nommu/Makefile | 8 +++ arch/x86/um/nommu/do_syscall_64.c | 32 +++++++++ arch/x86/um/nommu/entry_64.S | 112 ++++++++++++++++++++++++++++++ arch/x86/um/nommu/syscalls.h | 16 +++++ 6 files changed, 178 insertions(+) create mode 100644 arch/x86/um/nommu/Makefile create mode 100644 arch/x86/um/nommu/do_syscall_64.c create mode 100644 arch/x86/um/nommu/entry_64.S create mode 100644 arch/x86/um/nommu/syscalls.h diff --git a/arch/x86/um/Makefile b/arch/x86/um/Makefile index f9ea75bf43ac..39693807755a 100644 --- a/arch/x86/um/Makefile +++ b/arch/x86/um/Makefile @@ -31,6 +31,10 @@ obj-y += mem_64.o syscalls_64.o vdso/ subarch-y = ../lib/csum-partial_64.o ../lib/memcpy_64.o \ ../lib/memmove_64.o ../lib/memset_64.o +ifneq ($(CONFIG_MMU),y) +obj-y += nommu/ +endif + endif subarch-$(CONFIG_MODULES) += ../kernel/module.o diff --git a/arch/x86/um/asm/syscall.h b/arch/x86/um/asm/syscall.h index d6208d0fad51..bb4f6f011667 100644 --- a/arch/x86/um/asm/syscall.h +++ b/arch/x86/um/asm/syscall.h @@ -20,4 +20,10 @@ static inline int syscall_get_arch(struct task_struct *task) #endif } +#ifndef CONFIG_MMU +extern void do_syscall_64(struct pt_regs *regs); +extern long __kernel_vsyscall(int64_t a0, int64_t a1, int64_t a2, int64_t a3, + int64_t a4, int64_t a5, int64_t a6); +#endif + #endif /* __UM_ASM_SYSCALL_H */ diff --git a/arch/x86/um/nommu/Makefile b/arch/x86/um/nommu/Makefile new file mode 100644 index 000000000000..d72c63afffa5 --- /dev/null +++ b/arch/x86/um/nommu/Makefile @@ -0,0 +1,8 @@ +# SPDX-License-Identifier: GPL-2.0 +ifeq ($(CONFIG_X86_32),y) + BITS := 32 +else + BITS := 64 +endif + +obj-y = do_syscall_$(BITS).o entry_$(BITS).o diff --git a/arch/x86/um/nommu/do_syscall_64.c b/arch/x86/um/nommu/do_syscall_64.c new file mode 100644 index 000000000000..292d7c578622 --- /dev/null +++ b/arch/x86/um/nommu/do_syscall_64.c @@ -0,0 +1,32 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include +#include +#include +#include +#include + +__visible void do_syscall_64(struct pt_regs *regs) +{ + int syscall; + + syscall = PT_SYSCALL_NR(regs->regs.gp); + UPT_SYSCALL_NR(®s->regs) = syscall; + + if (likely(syscall < NR_syscalls)) { + unsigned long ret; + + ret = (*sys_call_table[syscall])(UPT_SYSCALL_ARG1(®s->regs), + UPT_SYSCALL_ARG2(®s->regs), + UPT_SYSCALL_ARG3(®s->regs), + UPT_SYSCALL_ARG4(®s->regs), + UPT_SYSCALL_ARG5(®s->regs), + UPT_SYSCALL_ARG6(®s->regs)); + PT_REGS_SET_SYSCALL_RETURN(regs, ret); + } + + PT_REGS_SYSCALL_RET(regs) = regs->regs.gp[HOST_AX]; + + /* handle tasks and signals at the end */ + interrupt_end(); +} diff --git a/arch/x86/um/nommu/entry_64.S b/arch/x86/um/nommu/entry_64.S new file mode 100644 index 000000000000..485c578aae64 --- /dev/null +++ b/arch/x86/um/nommu/entry_64.S @@ -0,0 +1,112 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#include + +#include +#include +#include + +#include "../entry/calling.h" + +#ifdef CONFIG_SMP +#error need to stash these variables somewhere else +#endif + +#define UM_GLOBAL_VAR(x) .data; .align 8; .globl x; x:; .long 0 + +UM_GLOBAL_VAR(current_top_of_stack) +UM_GLOBAL_VAR(current_ptregs) + +.code64 +.section .entry.text, "ax" + +.align 8 +#undef ENTRY +#define ENTRY(x) .text; .globl x; .type x,%function; x: +#undef END +#define END(x) .size x, . - x + +/* + * %rcx has the return address (we set it before entering __kernel_vsyscall). + * + * Registers on entry: + * rax system call number + * rcx return address + * rdi arg0 + * rsi arg1 + * rdx arg2 + * r10 arg3 + * r8 arg4 + * r9 arg5 + * + * (note: we are allowed to mess with r11: r11 is callee-clobbered + * register in C ABI) + */ +ENTRY(__kernel_vsyscall) + + movq %rsp, %r11 + + /* Point rsp to the top of the ptregs array, so we can + just fill it with a bunch of push'es. */ + movq current_ptregs, %rsp + + /* 8 bytes * 20 registers (plus 8 for the push) */ + addq $168, %rsp + + /* Construct struct pt_regs on stack */ + pushq $0 /* pt_regs->ss (index 20) */ + pushq %r11 /* pt_regs->sp */ + pushfq /* pt_regs->flags */ + pushq $0 /* pt_regs->cs */ + pushq %rcx /* pt_regs->ip */ + pushq %rax /* pt_regs->orig_ax */ + + PUSH_AND_CLEAR_REGS rax=$-ENOSYS + + mov %rsp, %rdi + + /* + * Switch to current top of stack, so "current->" points + * to the right task. + */ + movq current_top_of_stack, %rsp + + call do_syscall_64 + + jmp userspace + +END(__kernel_vsyscall) + +/* + * common userspace returning routine + * + * all procedures like syscalls, signal handlers, umh processes, will gate + * this routine to properly configure registers/stacks. + * + * void userspace(struct uml_pt_regs *regs) + */ +ENTRY(userspace) + + /* clear direction flag to meet ABI */ + cld + /* align the stack for x86_64 ABI */ + and $-0x10, %rsp + /* Handle any immediate reschedules or signals */ + call interrupt_end + + movq current_ptregs, %rsp + + POP_REGS + + addq $8, %rsp /* skip orig_ax */ + popq %rcx /* pt_regs->ip */ + addq $8, %rsp /* skip cs */ + addq $8, %rsp /* skip flags */ + popq %rsp + + /* + * not return w/ ret but w/ jmp as the stack is already popped before + * entering __kernel_vsyscall + */ + jmp *%rcx + +END(userspace) diff --git a/arch/x86/um/nommu/syscalls.h b/arch/x86/um/nommu/syscalls.h new file mode 100644 index 000000000000..a2433756b1fc --- /dev/null +++ b/arch/x86/um/nommu/syscalls.h @@ -0,0 +1,16 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef __UM_NOMMU_SYSCALLS_H +#define __UM_NOMMU_SYSCALLS_H + + +#define task_top_of_stack(task) \ +({ \ + unsigned long __ptr = (unsigned long)task->stack; \ + __ptr += THREAD_SIZE; \ + __ptr; \ +}) + +extern long current_top_of_stack; +extern long current_ptregs; + +#endif -- 2.43.0 From thehajime at gmail.com Sat Nov 8 00:05:40 2025 From: thehajime at gmail.com (Hajime Tazaki) Date: Sat, 8 Nov 2025 17:05:40 +0900 Subject: [PATCH v13 05/13] um: nommu: seccomp syscalls hook In-Reply-To: References: Message-ID: <9e3438cf6d6c26a708c428267375b102b434d5d6.1762588860.git.thehajime@gmail.com> This commit adds syscall hook with seccomp. Using seccomp raises SIGSYS to UML process, which is captured in the (UML) kernel, then jumps to the syscall entry point, __kernel_vsyscall, to hook the original syscall instructions. The SIGSYS signal is raised upon the execution from uml_reserved and high_physmem, which locates userspace memory. It also renames existing static function, sigsys_handler(), in start_up.c to avoid name conflicts between them. Signed-off-by: Hajime Tazaki Signed-off-by: Kenichi Yasukata --- arch/um/include/shared/kern_util.h | 2 + arch/um/include/shared/os.h | 10 +++ arch/um/kernel/um_arch.c | 3 + arch/um/nommu/Makefile | 3 + arch/um/nommu/os-Linux/Makefile | 7 +++ arch/um/nommu/os-Linux/seccomp.c | 87 +++++++++++++++++++++++++++ arch/um/nommu/os-Linux/signal.c | 16 +++++ arch/um/os-Linux/signal.c | 8 +++ arch/um/os-Linux/start_up.c | 4 +- arch/x86/um/nommu/Makefile | 2 +- arch/x86/um/nommu/os-Linux/Makefile | 6 ++ arch/x86/um/nommu/os-Linux/mcontext.c | 15 +++++ arch/x86/um/shared/sysdep/mcontext.h | 4 ++ 13 files changed, 164 insertions(+), 3 deletions(-) create mode 100644 arch/um/nommu/Makefile create mode 100644 arch/um/nommu/os-Linux/Makefile create mode 100644 arch/um/nommu/os-Linux/seccomp.c create mode 100644 arch/um/nommu/os-Linux/signal.c create mode 100644 arch/x86/um/nommu/os-Linux/Makefile create mode 100644 arch/x86/um/nommu/os-Linux/mcontext.c diff --git a/arch/um/include/shared/kern_util.h b/arch/um/include/shared/kern_util.h index 38321188c04c..7798f16a4677 100644 --- a/arch/um/include/shared/kern_util.h +++ b/arch/um/include/shared/kern_util.h @@ -63,6 +63,8 @@ extern void segv_handler(int sig, struct siginfo *unused_si, struct uml_pt_regs extern void winch(int sig, struct siginfo *unused_si, struct uml_pt_regs *regs, void *mc); extern void fatal_sigsegv(void) __attribute__ ((noreturn)); +extern void sigsys_handler(int sig, struct siginfo *si, struct uml_pt_regs *regs, + void *mc); void um_idle_sleep(void); diff --git a/arch/um/include/shared/os.h b/arch/um/include/shared/os.h index b26e94292fc1..5451f9b1f41e 100644 --- a/arch/um/include/shared/os.h +++ b/arch/um/include/shared/os.h @@ -356,4 +356,14 @@ static inline void os_local_ipi_enable(void) { } static inline void os_local_ipi_disable(void) { } #endif /* CONFIG_SMP */ +/* seccomp.c */ +#ifdef CONFIG_MMU +static inline int os_setup_seccomp(void) +{ + return 0; +} +#else +extern int os_setup_seccomp(void); +#endif + #endif diff --git a/arch/um/kernel/um_arch.c b/arch/um/kernel/um_arch.c index e2b24e1ecfa6..27c13423d9aa 100644 --- a/arch/um/kernel/um_arch.c +++ b/arch/um/kernel/um_arch.c @@ -423,6 +423,9 @@ void __init setup_arch(char **cmdline_p) add_bootloader_randomness(rng_seed, sizeof(rng_seed)); memzero_explicit(rng_seed, sizeof(rng_seed)); } + + /* install seccomp filter */ + os_setup_seccomp(); } void __init arch_cpu_finalize_init(void) diff --git a/arch/um/nommu/Makefile b/arch/um/nommu/Makefile new file mode 100644 index 000000000000..baab7c2f57c2 --- /dev/null +++ b/arch/um/nommu/Makefile @@ -0,0 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0 + +obj-y := os-Linux/ diff --git a/arch/um/nommu/os-Linux/Makefile b/arch/um/nommu/os-Linux/Makefile new file mode 100644 index 000000000000..805e26ccf63b --- /dev/null +++ b/arch/um/nommu/os-Linux/Makefile @@ -0,0 +1,7 @@ +# SPDX-License-Identifier: GPL-2.0 + +obj-y := seccomp.o signal.o +USER_OBJS := $(obj-y) + +include $(srctree)/arch/um/scripts/Makefile.rules +USER_CFLAGS+=-I$(srctree)/arch/um/os-Linux diff --git a/arch/um/nommu/os-Linux/seccomp.c b/arch/um/nommu/os-Linux/seccomp.c new file mode 100644 index 000000000000..d1cfa6e3d632 --- /dev/null +++ b/arch/um/nommu/os-Linux/seccomp.c @@ -0,0 +1,87 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include +#include +#include +#include +#include +#include +#include /* For SYS_xxx definitions */ +#include +#include +#include +#include +#include + +int __init os_setup_seccomp(void) +{ + int err; + unsigned long __userspace_start = uml_reserved, + __userspace_end = high_physmem; + + struct sock_filter filter[] = { + /* if (IP_high > __userspace_end) allow; */ + BPF_STMT(BPF_LD + BPF_W + BPF_ABS, + offsetof(struct seccomp_data, instruction_pointer) + 4), + BPF_JUMP(BPF_JMP + BPF_JGT + BPF_K, __userspace_end >> 32, + /*true-skip=*/0, /*false-skip=*/1), + BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_ALLOW), + + /* if (IP_high == __userspace_end && IP_low >= __userspace_end) allow; */ + BPF_STMT(BPF_LD + BPF_W + BPF_ABS, + offsetof(struct seccomp_data, instruction_pointer) + 4), + BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, __userspace_end >> 32, + /*true-skip=*/0, /*false-skip=*/3), + BPF_STMT(BPF_LD + BPF_W + BPF_ABS, + offsetof(struct seccomp_data, instruction_pointer)), + BPF_JUMP(BPF_JMP + BPF_JGE + BPF_K, __userspace_end, + /*true-skip=*/0, /*false-skip=*/1), + BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_ALLOW), + + /* if (IP_high < __userspace_start) allow; */ + BPF_STMT(BPF_LD + BPF_W + BPF_ABS, + offsetof(struct seccomp_data, instruction_pointer) + 4), + BPF_JUMP(BPF_JMP + BPF_JGE + BPF_K, __userspace_start >> 32, + /*true-skip=*/1, /*false-skip=*/0), + BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_ALLOW), + + /* if (IP_high == __userspace_start && IP_low < __userspace_start) allow; */ + BPF_STMT(BPF_LD + BPF_W + BPF_ABS, + offsetof(struct seccomp_data, instruction_pointer) + 4), + BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, __userspace_start >> 32, + /*true-skip=*/0, /*false-skip=*/3), + BPF_STMT(BPF_LD + BPF_W + BPF_ABS, + offsetof(struct seccomp_data, instruction_pointer)), + BPF_JUMP(BPF_JMP + BPF_JGE + BPF_K, __userspace_start, + /*true-skip=*/1, /*false-skip=*/0), + BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_ALLOW), + + /* other address; trap */ + BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_TRAP), + }; + struct sock_fprog prog = { + .len = ARRAY_SIZE(filter), + .filter = filter, + }; + + err = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0); + if (err) + os_warn("PR_SET_NO_NEW_PRIVS (err=%d, ernro=%d)\n", + err, errno); + + err = syscall(SYS_seccomp, SECCOMP_SET_MODE_FILTER, + SECCOMP_FILTER_FLAG_TSYNC, &prog); + if (err) { + os_warn("SECCOMP_SET_MODE_FILTER (err=%d, ernro=%d)\n", + err, errno); + exit(1); + } + + set_handler(SIGSYS); + + os_info("seccomp: setup filter syscalls in the range: 0x%lx-0x%lx\n", + __userspace_start, __userspace_end); + + return 0; +} + diff --git a/arch/um/nommu/os-Linux/signal.c b/arch/um/nommu/os-Linux/signal.c new file mode 100644 index 000000000000..19043b9652e2 --- /dev/null +++ b/arch/um/nommu/os-Linux/signal.c @@ -0,0 +1,16 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include +#include +#include +#include +#include + +void sigsys_handler(int sig, struct siginfo *si, + struct uml_pt_regs *regs, void *ptr) +{ + mcontext_t *mc = (mcontext_t *) ptr; + + /* hook syscall via SIGSYS */ + set_mc_sigsys_hook(mc); +} diff --git a/arch/um/os-Linux/signal.c b/arch/um/os-Linux/signal.c index 327fb3c52fc7..2f6795cd884c 100644 --- a/arch/um/os-Linux/signal.c +++ b/arch/um/os-Linux/signal.c @@ -20,6 +20,7 @@ #include #include #include +#include #include "internal.h" void (*sig_info[NSIG])(int, struct siginfo *, struct uml_pt_regs *, void *mc) = { @@ -31,6 +32,7 @@ void (*sig_info[NSIG])(int, struct siginfo *, struct uml_pt_regs *, void *mc) = [SIGSEGV] = segv_handler, [SIGIO] = sigio_handler, [SIGCHLD] = sigchld_handler, + [SIGSYS] = sigsys_handler, }; static void sig_handler_common(int sig, struct siginfo *si, mcontext_t *mc) @@ -182,6 +184,11 @@ static void sigusr1_handler(int sig, struct siginfo *unused_si, mcontext_t *mc) uml_pm_wake(); } +__weak void sigsys_handler(int sig, struct siginfo *unused_si, + struct uml_pt_regs *regs, void *mc) +{ +} + void register_pm_wake_signal(void) { set_handler(SIGUSR1); @@ -193,6 +200,7 @@ static void (*handlers[_NSIG])(int sig, struct siginfo *si, mcontext_t *mc) = { [SIGILL] = sig_handler, [SIGFPE] = sig_handler, [SIGTRAP] = sig_handler, + [SIGSYS] = sig_handler, [SIGIO] = sig_handler, [SIGWINCH] = sig_handler, diff --git a/arch/um/os-Linux/start_up.c b/arch/um/os-Linux/start_up.c index 054ac03bbf5e..33e039d2c1bf 100644 --- a/arch/um/os-Linux/start_up.c +++ b/arch/um/os-Linux/start_up.c @@ -239,7 +239,7 @@ extern unsigned long *exec_fp_regs; __initdata static struct stub_data *seccomp_test_stub_data; -static void __init sigsys_handler(int sig, siginfo_t *info, void *p) +static void __init _sigsys_handler(int sig, siginfo_t *info, void *p) { ucontext_t *uc = p; @@ -274,7 +274,7 @@ static int __init seccomp_helper(void *data) sizeof(seccomp_test_stub_data->sigstack)); sa.sa_flags = SA_ONSTACK | SA_NODEFER | SA_SIGINFO; - sa.sa_sigaction = (void *) sigsys_handler; + sa.sa_sigaction = (void *) _sigsys_handler; sa.sa_restorer = NULL; if (sigaction(SIGSYS, &sa, NULL) < 0) exit(2); diff --git a/arch/x86/um/nommu/Makefile b/arch/x86/um/nommu/Makefile index d72c63afffa5..ebe47d4836f4 100644 --- a/arch/x86/um/nommu/Makefile +++ b/arch/x86/um/nommu/Makefile @@ -5,4 +5,4 @@ else BITS := 64 endif -obj-y = do_syscall_$(BITS).o entry_$(BITS).o +obj-y = do_syscall_$(BITS).o entry_$(BITS).o os-Linux/ diff --git a/arch/x86/um/nommu/os-Linux/Makefile b/arch/x86/um/nommu/os-Linux/Makefile new file mode 100644 index 000000000000..4571e403a6ff --- /dev/null +++ b/arch/x86/um/nommu/os-Linux/Makefile @@ -0,0 +1,6 @@ +# SPDX-License-Identifier: GPL-2.0 + +obj-y = mcontext.o +USER_OBJS := mcontext.o + +include $(srctree)/arch/um/scripts/Makefile.rules diff --git a/arch/x86/um/nommu/os-Linux/mcontext.c b/arch/x86/um/nommu/os-Linux/mcontext.c new file mode 100644 index 000000000000..b62a6195096f --- /dev/null +++ b/arch/x86/um/nommu/os-Linux/mcontext.c @@ -0,0 +1,15 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#define __FRAME_OFFSETS +#include +#include +#include + +extern long __kernel_vsyscall(int64_t a0, int64_t a1, int64_t a2, int64_t a3, + int64_t a4, int64_t a5, int64_t a6); + +void set_mc_sigsys_hook(mcontext_t *mc) +{ + mc->gregs[REG_RCX] = mc->gregs[REG_RIP]; + mc->gregs[REG_RIP] = (unsigned long) __kernel_vsyscall; +} diff --git a/arch/x86/um/shared/sysdep/mcontext.h b/arch/x86/um/shared/sysdep/mcontext.h index 6fe490cc5b98..9a0d6087f357 100644 --- a/arch/x86/um/shared/sysdep/mcontext.h +++ b/arch/x86/um/shared/sysdep/mcontext.h @@ -17,6 +17,10 @@ extern int get_stub_state(struct uml_pt_regs *regs, struct stub_data *data, extern int set_stub_state(struct uml_pt_regs *regs, struct stub_data *data, int single_stepping); +#ifndef CONFIG_MMU +extern void set_mc_sigsys_hook(mcontext_t *mc); +#endif + #ifdef __i386__ #define GET_FAULTINFO_FROM_MC(fi, mc) \ -- 2.43.0 From thehajime at gmail.com Sat Nov 8 00:05:41 2025 From: thehajime at gmail.com (Hajime Tazaki) Date: Sat, 8 Nov 2025 17:05:41 +0900 Subject: [PATCH v13 06/13] x86/um: nommu: process/thread handling In-Reply-To: References: Message-ID: Since ptrace facility isn't used under !MMU of UML, there is different code path to invoke processes/threads; there are no external process used, and need to properly configure some of registers (fs segment register for TLS, etc) on every context switch, etc. Signals aren't delivered in non-ptrace syscall entry/leave so, we also need to handle pending signal by ourselves. ptrace related syscalls are not tested yet so, marked arch_has_single_step() unsupported in !MMU environment. Signed-off-by: Hajime Tazaki Signed-off-by: Ricardo Koller --- arch/um/include/asm/ptrace-generic.h | 2 +- arch/x86/um/Makefile | 3 +- arch/x86/um/nommu/Makefile | 2 +- arch/x86/um/nommu/entry_64.S | 2 ++ arch/x86/um/nommu/syscalls.h | 2 ++ arch/x86/um/nommu/syscalls_64.c | 50 ++++++++++++++++++++++++++++ 6 files changed, 58 insertions(+), 3 deletions(-) create mode 100644 arch/x86/um/nommu/syscalls_64.c diff --git a/arch/um/include/asm/ptrace-generic.h b/arch/um/include/asm/ptrace-generic.h index 62e9916078ec..5aa38fe6b2fb 100644 --- a/arch/um/include/asm/ptrace-generic.h +++ b/arch/um/include/asm/ptrace-generic.h @@ -14,7 +14,7 @@ struct pt_regs { struct uml_pt_regs regs; }; -#define arch_has_single_step() (1) +#define arch_has_single_step() (IS_ENABLED(CONFIG_MMU)) #define EMPTY_REGS { .regs = EMPTY_UML_PT_REGS } diff --git a/arch/x86/um/Makefile b/arch/x86/um/Makefile index 39693807755a..98dc57afff83 100644 --- a/arch/x86/um/Makefile +++ b/arch/x86/um/Makefile @@ -26,7 +26,8 @@ subarch-y += ../kernel/sys_ia32.o else -obj-y += mem_64.o syscalls_64.o vdso/ +obj-y += mem_64.o vdso/ +obj-$(CONFIG_MMU) += syscalls_64.o subarch-y = ../lib/csum-partial_64.o ../lib/memcpy_64.o \ ../lib/memmove_64.o ../lib/memset_64.o diff --git a/arch/x86/um/nommu/Makefile b/arch/x86/um/nommu/Makefile index ebe47d4836f4..4018d9e0aba0 100644 --- a/arch/x86/um/nommu/Makefile +++ b/arch/x86/um/nommu/Makefile @@ -5,4 +5,4 @@ else BITS := 64 endif -obj-y = do_syscall_$(BITS).o entry_$(BITS).o os-Linux/ +obj-y = do_syscall_$(BITS).o entry_$(BITS).o syscalls_$(BITS).o os-Linux/ diff --git a/arch/x86/um/nommu/entry_64.S b/arch/x86/um/nommu/entry_64.S index 485c578aae64..a58922fc81e5 100644 --- a/arch/x86/um/nommu/entry_64.S +++ b/arch/x86/um/nommu/entry_64.S @@ -86,6 +86,8 @@ END(__kernel_vsyscall) */ ENTRY(userspace) + /* set stack and pt_regs to the current task */ + call arch_set_stack_to_current /* clear direction flag to meet ABI */ cld /* align the stack for x86_64 ABI */ diff --git a/arch/x86/um/nommu/syscalls.h b/arch/x86/um/nommu/syscalls.h index a2433756b1fc..ce16bf8abd59 100644 --- a/arch/x86/um/nommu/syscalls.h +++ b/arch/x86/um/nommu/syscalls.h @@ -13,4 +13,6 @@ extern long current_top_of_stack; extern long current_ptregs; +void arch_set_stack_to_current(void); + #endif diff --git a/arch/x86/um/nommu/syscalls_64.c b/arch/x86/um/nommu/syscalls_64.c new file mode 100644 index 000000000000..d56027ebc651 --- /dev/null +++ b/arch/x86/um/nommu/syscalls_64.c @@ -0,0 +1,50 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2003 - 2007 Jeff Dike (jdike@{addtoit,linux.intel}.com) + * Copyright 2003 PathScale, Inc. + * + * Licensed under the GPL + */ + +#include +#include +#include +#include +#include /* XXX This should get the constants from libc */ +#include +#include +#include "syscalls.h" + +void arch_set_stack_to_current(void) +{ + current_top_of_stack = task_top_of_stack(current); + current_ptregs = (long)task_pt_regs(current); +} + +void arch_switch_to(struct task_struct *to) +{ + /* + * In !CONFIG_MMU, it doesn't ptrace thus, + * The FS_BASE registers are saved here. + */ + current_top_of_stack = task_top_of_stack(to); + current_ptregs = (long)task_pt_regs(to); + + if ((to->thread.regs.regs.gp[FS_BASE / sizeof(unsigned long)] == 0) || + (to->mm == NULL)) + return; + + /* this changes the FS on every context switch */ + arch_prctl(to, ARCH_SET_FS, + (void __user *) to->thread.regs.regs.gp[FS_BASE / sizeof(unsigned long)]); +} + +SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len, + unsigned long, prot, unsigned long, flags, + unsigned long, fd, unsigned long, off) +{ + if (off & ~PAGE_MASK) + return -EINVAL; + + return ksys_mmap_pgoff(addr, len, prot, flags, fd, off >> PAGE_SHIFT); +} -- 2.43.0 From thehajime at gmail.com Sat Nov 8 00:05:42 2025 From: thehajime at gmail.com (Hajime Tazaki) Date: Sat, 8 Nov 2025 17:05:42 +0900 Subject: [PATCH v13 07/13] um: nommu: configure fs register on host syscall invocation In-Reply-To: References: Message-ID: <5b4fab636ab8cbd1db025a0561fe9993990fc869.1762588860.git.thehajime@gmail.com> As userspace on UML/!MMU also need to configure %fs register when it is running to correctly access thread structure, host syscalls implemented in os-Linux drivers may be puzzled when they are called. Thus it has to configure %fs register via arch_prctl(SET_FS) on every host syscalls. Signed-off-by: Hajime Tazaki Signed-off-by: Ricardo Koller --- arch/um/include/shared/os.h | 6 +++ arch/um/os-Linux/process.c | 6 +++ arch/um/os-Linux/start_up.c | 21 +++++++++ arch/x86/um/nommu/do_syscall_64.c | 37 ++++++++++++++++ arch/x86/um/nommu/syscalls_64.c | 71 +++++++++++++++++++++++++++++++ 5 files changed, 141 insertions(+) diff --git a/arch/um/include/shared/os.h b/arch/um/include/shared/os.h index 5451f9b1f41e..0ac87507e05e 100644 --- a/arch/um/include/shared/os.h +++ b/arch/um/include/shared/os.h @@ -189,6 +189,7 @@ extern void check_host_supports_tls(int *supports_tls, int *tls_min); extern void get_host_cpu_features( void (*flags_helper_func)(char *line), void (*cache_helper_func)(char *line)); +extern int host_has_fsgsbase; /* mem.c */ extern int create_mem_file(unsigned long long len); @@ -213,6 +214,11 @@ extern int os_protect_memory(void *addr, unsigned long len, extern int os_unmap_memory(void *addr, int len); extern int os_drop_memory(void *addr, int length); extern int can_drop_memory(void); +extern int os_arch_prctl(int pid, int option, unsigned long *arg); +#ifndef CONFIG_MMU +extern long long host_fs; +#endif + void os_set_pdeathsig(void); diff --git a/arch/um/os-Linux/process.c b/arch/um/os-Linux/process.c index ddb5258d7720..dacf63ac33c8 100644 --- a/arch/um/os-Linux/process.c +++ b/arch/um/os-Linux/process.c @@ -18,6 +18,7 @@ #include #include #include +#include /* For SYS_xxx definitions */ #include #include #include @@ -179,6 +180,11 @@ int __init can_drop_memory(void) return ok; } +int os_arch_prctl(int pid, int option, unsigned long *arg2) +{ + return syscall(SYS_arch_prctl, option, arg2); +} + void init_new_thread_signals(void) { set_handler(SIGSEGV); diff --git a/arch/um/os-Linux/start_up.c b/arch/um/os-Linux/start_up.c index 33e039d2c1bf..c0afe5d8b559 100644 --- a/arch/um/os-Linux/start_up.c +++ b/arch/um/os-Linux/start_up.c @@ -20,6 +20,8 @@ #include #include #include +#include +#include #include #include #include @@ -37,6 +39,8 @@ #include #include "internal.h" +int host_has_fsgsbase; + static void ptrace_child(void) { int ret; @@ -460,6 +464,20 @@ __uml_setup("seccomp=", uml_seccomp_config, " This is insecure and should only be used with a trusted userspace\n\n" ); +static void __init check_fsgsbase(void) +{ + unsigned long auxv = getauxval(AT_HWCAP2); + + os_info("Checking FSGSBASE instructions..."); + if (auxv & HWCAP2_FSGSBASE) { + host_has_fsgsbase = 1; + os_info("OK\n"); + } else { + host_has_fsgsbase = 0; + os_info("disabled\n"); + } +} + void __init os_early_checks(void) { int pid; @@ -488,6 +506,9 @@ void __init os_early_checks(void) using_seccomp = 0; check_ptrace(); + /* probe fsgsbase instruction */ + check_fsgsbase(); + pid = start_ptraced_child(); if (init_pid_registers(pid)) fatal("Failed to initialize default registers"); diff --git a/arch/x86/um/nommu/do_syscall_64.c b/arch/x86/um/nommu/do_syscall_64.c index 292d7c578622..9bc630995df9 100644 --- a/arch/x86/um/nommu/do_syscall_64.c +++ b/arch/x86/um/nommu/do_syscall_64.c @@ -2,10 +2,38 @@ #include #include +#include +#include #include #include #include +static int os_x86_arch_prctl(int pid, int option, unsigned long *arg2) +{ + if (!host_has_fsgsbase) + return os_arch_prctl(pid, option, arg2); + + switch (option) { + case ARCH_SET_FS: + wrfsbase(*arg2); + break; + case ARCH_SET_GS: + wrgsbase(*arg2); + break; + case ARCH_GET_FS: + *arg2 = rdfsbase(); + break; + case ARCH_GET_GS: + *arg2 = rdgsbase(); + break; + default: + pr_warn("%s: unsupported option: 0x%x", __func__, option); + break; + } + + return 0; +} + __visible void do_syscall_64(struct pt_regs *regs) { int syscall; @@ -13,6 +41,9 @@ __visible void do_syscall_64(struct pt_regs *regs) syscall = PT_SYSCALL_NR(regs->regs.gp); UPT_SYSCALL_NR(®s->regs) = syscall; + /* set fs register to the original host one */ + os_x86_arch_prctl(0, ARCH_SET_FS, (void *)host_fs); + if (likely(syscall < NR_syscalls)) { unsigned long ret; @@ -29,4 +60,10 @@ __visible void do_syscall_64(struct pt_regs *regs) /* handle tasks and signals at the end */ interrupt_end(); + + /* restore back fs register to userspace configured one */ + os_x86_arch_prctl(0, ARCH_SET_FS, + (void *)(current->thread.regs.regs.gp[FS_BASE + / sizeof(unsigned long)])); + } diff --git a/arch/x86/um/nommu/syscalls_64.c b/arch/x86/um/nommu/syscalls_64.c index d56027ebc651..19d23686fc5b 100644 --- a/arch/x86/um/nommu/syscalls_64.c +++ b/arch/x86/um/nommu/syscalls_64.c @@ -13,8 +13,70 @@ #include /* XXX This should get the constants from libc */ #include #include +#include +#include #include "syscalls.h" +/* + * The guest libc can change FS, which confuses the host libc. + * In fact, changing FS directly is not supported (check + * man arch_prctl). So, whenever we make a host syscall, + * we should be changing FS to the original FS (not the + * one set by the guest libc). This original FS is stored + * in host_fs. + */ +long long host_fs = -1; + +long arch_prctl(struct task_struct *task, int option, + unsigned long __user *arg2) +{ + long ret = -EINVAL; + unsigned long *ptr = arg2, tmp; + + switch (option) { + case ARCH_SET_FS: + if (host_fs == -1) + os_arch_prctl(0, ARCH_GET_FS, (void *)&host_fs); + ret = 0; + break; + case ARCH_SET_GS: + ret = 0; + break; + case ARCH_GET_FS: + case ARCH_GET_GS: + ptr = &tmp; + break; + } + + ret = os_arch_prctl(0, option, ptr); + if (ret) + return ret; + + switch (option) { + case ARCH_SET_FS: + current->thread.regs.regs.gp[FS_BASE / sizeof(unsigned long)] = + (unsigned long) arg2; + break; + case ARCH_SET_GS: + current->thread.regs.regs.gp[GS_BASE / sizeof(unsigned long)] = + (unsigned long) arg2; + break; + case ARCH_GET_FS: + ret = put_user(current->thread.regs.regs.gp[FS_BASE / sizeof(unsigned long)], arg2); + break; + case ARCH_GET_GS: + ret = put_user(current->thread.regs.regs.gp[GS_BASE / sizeof(unsigned long)], arg2); + break; + } + + return ret; +} + +SYSCALL_DEFINE2(arch_prctl, int, option, unsigned long, arg2) +{ + return arch_prctl(current, option, (unsigned long __user *) arg2); +} + void arch_set_stack_to_current(void) { current_top_of_stack = task_top_of_stack(current); @@ -48,3 +110,12 @@ SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len, return ksys_mmap_pgoff(addr, len, prot, flags, fd, off >> PAGE_SHIFT); } + +static int __init um_nommu_setup_hostfs(void) +{ + /* initialize the host_fs value at boottime */ + os_arch_prctl(0, ARCH_GET_FS, (void *)&host_fs); + + return 0; +} +arch_initcall(um_nommu_setup_hostfs); -- 2.43.0 From thehajime at gmail.com Sat Nov 8 00:05:43 2025 From: thehajime at gmail.com (Hajime Tazaki) Date: Sat, 8 Nov 2025 17:05:43 +0900 Subject: [PATCH v13 08/13] x86/um/vdso: nommu: vdso memory update In-Reply-To: References: Message-ID: On !MMU mode, the address of vdso is accessible from userspace. This commit implements the entry point by pointing a block of page address. This commit also add memory permission configuration of vdso page to be executable. Signed-off-by: Hajime Tazaki Signed-off-by: Ricardo Koller --- arch/x86/um/vdso/vma.c | 17 +++++++++++++++-- 1 file changed, 15 insertions(+), 2 deletions(-) diff --git a/arch/x86/um/vdso/vma.c b/arch/x86/um/vdso/vma.c index 51a2b9f2eca9..0799b3fe7521 100644 --- a/arch/x86/um/vdso/vma.c +++ b/arch/x86/um/vdso/vma.c @@ -9,6 +9,7 @@ #include #include #include +#include unsigned long um_vdso_addr; static struct page *um_vdso; @@ -20,18 +21,29 @@ static int __init init_vdso(void) { BUG_ON(vdso_end - vdso_start > PAGE_SIZE); - um_vdso_addr = task_size - PAGE_SIZE; - um_vdso = alloc_page(GFP_KERNEL); if (!um_vdso) panic("Cannot allocate vdso\n"); copy_page(page_address(um_vdso), vdso_start); +#ifdef CONFIG_MMU + um_vdso_addr = task_size - PAGE_SIZE; +#else + /* this is fine with NOMMU as everything is accessible */ + um_vdso_addr = (unsigned long)page_address(um_vdso); + os_protect_memory((void *)um_vdso_addr, vdso_end - vdso_start, 1, 0, 1); +#endif + + pr_info("vdso_start=%lx um_vdso_addr=%lx pg_um_vdso=%lx", + (unsigned long)vdso_start, um_vdso_addr, + (unsigned long)page_address(um_vdso)); + return 0; } subsys_initcall(init_vdso); +#ifdef CONFIG_MMU int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) { struct vm_area_struct *vma; @@ -53,3 +65,4 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) return IS_ERR(vma) ? PTR_ERR(vma) : 0; } +#endif -- 2.43.0 From thehajime at gmail.com Sat Nov 8 00:05:44 2025 From: thehajime at gmail.com (Hajime Tazaki) Date: Sat, 8 Nov 2025 17:05:44 +0900 Subject: [PATCH v13 09/13] x86/um: nommu: signal handling In-Reply-To: References: Message-ID: This commit updates the behavior of signal handling under !MMU environment. It adds the alignment code for signal frame as the frame is used in userspace as-is. floating point register is carefully handling upon entry/leave of syscall routine so that signal handlers can read/write the contents of the register. It also adds the follow up routine for SIGSEGV as a signal delivery runs in the same stack frame while we have to avoid endless SIGSEGV. Signed-off-by: Hajime Tazaki --- arch/um/include/shared/kern_util.h | 4 + arch/um/nommu/Makefile | 2 +- arch/um/nommu/os-Linux/signal.c | 8 + arch/um/nommu/trap.c | 201 ++++++++++++++++++++++++++ arch/um/os-Linux/signal.c | 3 +- arch/x86/um/nommu/do_syscall_64.c | 6 + arch/x86/um/nommu/os-Linux/mcontext.c | 11 ++ arch/x86/um/shared/sysdep/mcontext.h | 1 + arch/x86/um/shared/sysdep/ptrace.h | 2 +- 9 files changed, 235 insertions(+), 3 deletions(-) create mode 100644 arch/um/nommu/trap.c diff --git a/arch/um/include/shared/kern_util.h b/arch/um/include/shared/kern_util.h index 7798f16a4677..46c8d6336ca1 100644 --- a/arch/um/include/shared/kern_util.h +++ b/arch/um/include/shared/kern_util.h @@ -70,4 +70,8 @@ void um_idle_sleep(void); void kasan_map_memory(void *start, size_t len); +#ifndef CONFIG_MMU +extern void nommu_relay_signal(void *ptr); +#endif + #endif diff --git a/arch/um/nommu/Makefile b/arch/um/nommu/Makefile index baab7c2f57c2..096221590cfd 100644 --- a/arch/um/nommu/Makefile +++ b/arch/um/nommu/Makefile @@ -1,3 +1,3 @@ # SPDX-License-Identifier: GPL-2.0 -obj-y := os-Linux/ +obj-y := trap.o os-Linux/ diff --git a/arch/um/nommu/os-Linux/signal.c b/arch/um/nommu/os-Linux/signal.c index 19043b9652e2..6febb178dcda 100644 --- a/arch/um/nommu/os-Linux/signal.c +++ b/arch/um/nommu/os-Linux/signal.c @@ -5,6 +5,7 @@ #include #include #include +#include void sigsys_handler(int sig, struct siginfo *si, struct uml_pt_regs *regs, void *ptr) @@ -14,3 +15,10 @@ void sigsys_handler(int sig, struct siginfo *si, /* hook syscall via SIGSYS */ set_mc_sigsys_hook(mc); } + +void nommu_relay_signal(void *ptr) +{ + mcontext_t *mc = (mcontext_t *) ptr; + + set_mc_relay_signal(mc); +} diff --git a/arch/um/nommu/trap.c b/arch/um/nommu/trap.c new file mode 100644 index 000000000000..430297517455 --- /dev/null +++ b/arch/um/nommu/trap.c @@ -0,0 +1,201 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +/* + * Note this is constrained to return 0, -EFAULT, -EACCES, -ENOMEM by + * segv(). + */ +int handle_page_fault(unsigned long address, unsigned long ip, + int is_write, int is_user, int *code_out) +{ + /* !MMU has no pagefault */ + return -EFAULT; +} + +static void show_segv_info(struct uml_pt_regs *regs) +{ + struct task_struct *tsk = current; + struct faultinfo *fi = UPT_FAULTINFO(regs); + + if (!unhandled_signal(tsk, SIGSEGV)) + return; + + pr_warn_ratelimited("%s%s[%d]: segfault at %lx ip %p sp %p error %x", + task_pid_nr(tsk) > 1 ? KERN_INFO : KERN_EMERG, + tsk->comm, task_pid_nr(tsk), FAULT_ADDRESS(*fi), + (void *)UPT_IP(regs), (void *)UPT_SP(regs), + fi->error_code); +} + +static void bad_segv(struct faultinfo fi, unsigned long ip) +{ + current->thread.arch.faultinfo = fi; + force_sig_fault(SIGSEGV, SEGV_ACCERR, (void __user *) FAULT_ADDRESS(fi)); +} + +void fatal_sigsegv(void) +{ + force_fatal_sig(SIGSEGV); + do_signal(¤t->thread.regs); + /* + * This is to tell gcc that we're not returning - do_signal + * can, in general, return, but in this case, it's not, since + * we just got a fatal SIGSEGV queued. + */ + os_dump_core(); +} + +/** + * segv_handler() - the SIGSEGV handler + * @sig: the signal number + * @unused_si: the signal info struct; unused in this handler + * @regs: the ptrace register information + * + * The handler first extracts the faultinfo from the UML ptrace regs struct. + * If the userfault did not happen in an UML userspace process, bad_segv is called. + * Otherwise the signal did happen in a cloned userspace process, handle it. + */ +void segv_handler(int sig, struct siginfo *unused_si, struct uml_pt_regs *regs, + void *mc) +{ + struct faultinfo *fi = UPT_FAULTINFO(regs); + + /* !MMU specific part; detection of userspace */ + /* mark is_user=1 when the IP is from userspace code. */ + if (UPT_IP(regs) > uml_reserved && UPT_IP(regs) < high_physmem) + regs->is_user = 1; + + if (UPT_IS_USER(regs) && !SEGV_IS_FIXABLE(fi)) { + show_segv_info(regs); + bad_segv(*fi, UPT_IP(regs)); + return; + } + segv(*fi, UPT_IP(regs), UPT_IS_USER(regs), regs, mc); + + /* !MMU specific part; detection of userspace */ + relay_signal(sig, unused_si, regs, mc); +} + +/* + * We give a *copy* of the faultinfo in the regs to segv. + * This must be done, since nesting SEGVs could overwrite + * the info in the regs. A pointer to the info then would + * give us bad data! + */ +unsigned long segv(struct faultinfo fi, unsigned long ip, int is_user, + struct uml_pt_regs *regs, void *mc) +{ + int si_code; + int err; + int is_write = FAULT_WRITE(fi); + unsigned long address = FAULT_ADDRESS(fi); + + if (!is_user && regs) + current->thread.segv_regs = container_of(regs, struct pt_regs, regs); + + if (current->mm == NULL) { + show_regs(container_of(regs, struct pt_regs, regs)); + panic("Segfault with no mm"); + } else if (!is_user && address > PAGE_SIZE && address < TASK_SIZE) { + show_regs(container_of(regs, struct pt_regs, regs)); + panic("Kernel tried to access user memory at addr 0x%lx, ip 0x%lx", + address, ip); + } + + if (SEGV_IS_FIXABLE(&fi)) + err = handle_page_fault(address, ip, is_write, is_user, + &si_code); + else { + err = -EFAULT; + /* + * A thread accessed NULL, we get a fault, but CR2 is invalid. + * This code is used in __do_copy_from_user() of TT mode. + * XXX tt mode is gone, so maybe this isn't needed any more + */ + address = 0; + } + + if (!err) + goto out; + else if (!is_user && arch_fixup(ip, regs)) + goto out; + + if (!is_user) { + show_regs(container_of(regs, struct pt_regs, regs)); + panic("Kernel mode fault at addr 0x%lx, ip 0x%lx", + address, ip); + } + + show_segv_info(regs); + + if (err == -EACCES) { + current->thread.arch.faultinfo = fi; + force_sig_fault(SIGBUS, BUS_ADRERR, (void __user *)address); + } else { + WARN_ON_ONCE(err != -EFAULT); + current->thread.arch.faultinfo = fi; + force_sig_fault(SIGSEGV, si_code, (void __user *) address); + } + +out: + if (regs) + current->thread.segv_regs = NULL; + + return 0; +} + +void relay_signal(int sig, struct siginfo *si, struct uml_pt_regs *regs, + void *mc) +{ + int code, err; + + /* !MMU specific part; detection of userspace */ + /* mark is_user=1 when the IP is from userspace code. */ + if (UPT_IP(regs) > uml_reserved && UPT_IP(regs) < high_physmem) + regs->is_user = 1; + + if (!UPT_IS_USER(regs)) { + if (sig == SIGBUS) + pr_err("Bus error - the host /dev/shm or /tmp mount likely just ran out of space\n"); + panic("Kernel mode signal %d", sig); + } + /* if is_user==1, set return to userspace sig handler to relay signal */ + nommu_relay_signal(mc); + + arch_examine_signal(sig, regs); + + /* Is the signal layout for the signal known? + * Signal data must be scrubbed to prevent information leaks. + */ + code = si->si_code; + err = si->si_errno; + if ((err == 0) && (siginfo_layout(sig, code) == SIL_FAULT)) { + struct faultinfo *fi = UPT_FAULTINFO(regs); + + current->thread.arch.faultinfo = *fi; + force_sig_fault(sig, code, (void __user *)FAULT_ADDRESS(*fi)); + } else { + pr_err("Attempted to relay unknown signal %d (si_code = %d) with errno %d\n", + sig, code, err); + force_sig(sig); + } +} + +void winch(int sig, struct siginfo *unused_si, struct uml_pt_regs *regs, + void *mc) +{ + do_IRQ(WINCH_IRQ, regs); +} diff --git a/arch/um/os-Linux/signal.c b/arch/um/os-Linux/signal.c index 2f6795cd884c..28754f56c42b 100644 --- a/arch/um/os-Linux/signal.c +++ b/arch/um/os-Linux/signal.c @@ -41,9 +41,10 @@ static void sig_handler_common(int sig, struct siginfo *si, mcontext_t *mc) int save_errno = errno; r.is_user = 0; + if (mc) + get_regs_from_mc(&r, mc); if (sig == SIGSEGV) { /* For segfaults, we want the data from the sigcontext. */ - get_regs_from_mc(&r, mc); GET_FAULTINFO_FROM_MC(r.faultinfo, mc); } diff --git a/arch/x86/um/nommu/do_syscall_64.c b/arch/x86/um/nommu/do_syscall_64.c index 9bc630995df9..cf5a347ee9b1 100644 --- a/arch/x86/um/nommu/do_syscall_64.c +++ b/arch/x86/um/nommu/do_syscall_64.c @@ -44,6 +44,9 @@ __visible void do_syscall_64(struct pt_regs *regs) /* set fs register to the original host one */ os_x86_arch_prctl(0, ARCH_SET_FS, (void *)host_fs); + /* save fp registers */ + asm volatile("fxsaveq %0" : "=m"(*(struct _xstate *)regs->regs.fp)); + if (likely(syscall < NR_syscalls)) { unsigned long ret; @@ -61,6 +64,9 @@ __visible void do_syscall_64(struct pt_regs *regs) /* handle tasks and signals at the end */ interrupt_end(); + /* restore fp registers */ + asm volatile("fxrstorq %0" : : "m"((current->thread.regs.regs.fp))); + /* restore back fs register to userspace configured one */ os_x86_arch_prctl(0, ARCH_SET_FS, (void *)(current->thread.regs.regs.gp[FS_BASE diff --git a/arch/x86/um/nommu/os-Linux/mcontext.c b/arch/x86/um/nommu/os-Linux/mcontext.c index b62a6195096f..afa20f1e235a 100644 --- a/arch/x86/um/nommu/os-Linux/mcontext.c +++ b/arch/x86/um/nommu/os-Linux/mcontext.c @@ -4,10 +4,21 @@ #include #include #include +#include +#include "../syscalls.h" extern long __kernel_vsyscall(int64_t a0, int64_t a1, int64_t a2, int64_t a3, int64_t a4, int64_t a5, int64_t a6); +void set_mc_relay_signal(mcontext_t *mc) +{ + /* configure stack and userspace returning routine as + * instruction pointer + */ + mc->gregs[REG_RSP] = (unsigned long) current_top_of_stack; + mc->gregs[REG_RIP] = (unsigned long) userspace; +} + void set_mc_sigsys_hook(mcontext_t *mc) { mc->gregs[REG_RCX] = mc->gregs[REG_RIP]; diff --git a/arch/x86/um/shared/sysdep/mcontext.h b/arch/x86/um/shared/sysdep/mcontext.h index 9a0d6087f357..82a5f38b350f 100644 --- a/arch/x86/um/shared/sysdep/mcontext.h +++ b/arch/x86/um/shared/sysdep/mcontext.h @@ -19,6 +19,7 @@ extern int set_stub_state(struct uml_pt_regs *regs, struct stub_data *data, #ifndef CONFIG_MMU extern void set_mc_sigsys_hook(mcontext_t *mc); +extern void set_mc_relay_signal(mcontext_t *mc); #endif #ifdef __i386__ diff --git a/arch/x86/um/shared/sysdep/ptrace.h b/arch/x86/um/shared/sysdep/ptrace.h index 572ea2d79131..6ed6bb1ca50e 100644 --- a/arch/x86/um/shared/sysdep/ptrace.h +++ b/arch/x86/um/shared/sysdep/ptrace.h @@ -53,7 +53,7 @@ struct uml_pt_regs { int is_user; /* Dynamically sized FP registers (holds an XSTATE) */ - unsigned long fp[]; + unsigned long fp[] __attribute__((aligned(16))); }; #define EMPTY_UML_PT_REGS { } -- 2.43.0 From thehajime at gmail.com Sat Nov 8 00:05:45 2025 From: thehajime at gmail.com (Hajime Tazaki) Date: Sat, 8 Nov 2025 17:05:45 +0900 Subject: [PATCH v13 10/13] um: change machine name for uname output In-Reply-To: References: Message-ID: <7cfc1ecdcb8fe15edd92d3b1539994e28f3b6d5a.1762588860.git.thehajime@gmail.com> This commit tries to display MMU/!MMU mode from the output of uname(2) so that users can distinguish which mode of UML is running right now. Signed-off-by: Hajime Tazaki --- arch/um/Makefile | 6 ++++++ arch/um/os-Linux/util.c | 3 ++- 2 files changed, 8 insertions(+), 1 deletion(-) diff --git a/arch/um/Makefile b/arch/um/Makefile index 5371c9a1b11e..9bc8fc149514 100644 --- a/arch/um/Makefile +++ b/arch/um/Makefile @@ -153,6 +153,12 @@ export CFLAGS_vmlinux := $(LINK-y) $(LINK_WRAPS) $(LD_FLAGS_CMDLINE) $(CC_FLAGS_ CLEAN_FILES += linux x.i gmon.out MRPROPER_FILES += $(HOST_DIR)/include/generated +ifeq ($(CONFIG_MMU),y) +UTS_MACHINE := "um" +else +UTS_MACHINE := "um\(nommu\)" +endif + archclean: @find . \( -name '*.bb' -o -name '*.bbg' -o -name '*.da' \ -o -name '*.gcov' \) -type f -print | xargs rm -f diff --git a/arch/um/os-Linux/util.c b/arch/um/os-Linux/util.c index e3ad71a0d13c..5fb26f5dfcb6 100644 --- a/arch/um/os-Linux/util.c +++ b/arch/um/os-Linux/util.c @@ -64,7 +64,8 @@ void setup_machinename(char *machine_out) } # endif #endif - strcpy(machine_out, host.machine); + strcat(machine_out, "/"); + strcat(machine_out, host.machine); } void setup_hostinfo(char *buf, int len) -- 2.43.0 From thehajime at gmail.com Sat Nov 8 00:05:46 2025 From: thehajime at gmail.com (Hajime Tazaki) Date: Sat, 8 Nov 2025 17:05:46 +0900 Subject: [PATCH v13 11/13] um: nommu: disable SMP on nommu UML In-Reply-To: References: Message-ID: CONFIG_SMP doesn't work with nommu UML since fs register handling of host does conflict with thread local storage (more specifically, the variable signals_enabled). Thus this commit disables the CONFIG option and the TLS variables. Signed-off-by: Hajime Tazaki --- arch/um/os-Linux/internal.h | 8 ++++++++ arch/x86/um/Kconfig | 2 +- 2 files changed, 9 insertions(+), 1 deletion(-) diff --git a/arch/um/os-Linux/internal.h b/arch/um/os-Linux/internal.h index bac9fcc8c14c..25cb5cc931c1 100644 --- a/arch/um/os-Linux/internal.h +++ b/arch/um/os-Linux/internal.h @@ -6,6 +6,14 @@ #include #include +/* NOMMU doesn't work with thread-local storage used in CONFIG_SMP, + * due to the dependency on host_fs variable switch upon user/kernel + * context so, disable TLS until NOMMU supports SMP. + */ +#ifndef CONFIG_MMU +#define __thread +#endif + /* * elf_aux.c */ diff --git a/arch/x86/um/Kconfig b/arch/x86/um/Kconfig index bdd7c8e39b01..f12e2e4e0a12 100644 --- a/arch/x86/um/Kconfig +++ b/arch/x86/um/Kconfig @@ -12,7 +12,7 @@ config UML_X86 select ARCH_USE_QUEUED_SPINLOCKS select DCACHE_WORD_ACCESS select HAVE_EFFICIENT_UNALIGNED_ACCESS - select UML_SUBARCH_SUPPORTS_SMP if X86_CX8 + select UML_SUBARCH_SUPPORTS_SMP if X86_CX8 && MMU config 64BIT bool "64-bit kernel" if "$(SUBARCH)" = "x86" -- 2.43.0 From thehajime at gmail.com Sat Nov 8 00:05:47 2025 From: thehajime at gmail.com (Hajime Tazaki) Date: Sat, 8 Nov 2025 17:05:47 +0900 Subject: [PATCH v13 12/13] um: nommu: add documentation of nommu UML In-Reply-To: References: Message-ID: <16940d31af89a3127acf29d23e10dcb9b7b9f4e3.1762588860.git.thehajime@gmail.com> This commit adds an initial documentation for !MMU mode of UML. Signed-off-by: Hajime Tazaki --- Documentation/virt/uml/nommu-uml.rst | 180 +++++++++++++++++++++++++++ MAINTAINERS | 1 + 2 files changed, 181 insertions(+) create mode 100644 Documentation/virt/uml/nommu-uml.rst diff --git a/Documentation/virt/uml/nommu-uml.rst b/Documentation/virt/uml/nommu-uml.rst new file mode 100644 index 000000000000..f049bbc697d1 --- /dev/null +++ b/Documentation/virt/uml/nommu-uml.rst @@ -0,0 +1,180 @@ +.. SPDX-License-Identifier: GPL-2.0 + +UML has been built with CONFIG_MMU since day 0. The patchset +introduces the nommu mode on UML in a different angle from what Linux +Kernel Library tried. + +.. contents:: :local: + +What is it for ? +================ + +- Alleviate syscall hook overhead implemented with ptrace(2) +- To exercises nommu code over UML (and over KUnit) +- Less dependency to host facilities + + +How it works ? +============== + +To illustrate how this feature works, the below shows how syscalls are +called under nommu/UML environment. + +- boot kernel, install seccomp filter if ``syscall`` instructions are + called from userspace memory based on the address of instruction + pointer +- (userspace starts) +- calls ``vfork``/``execve`` syscalls +- ``SIGSYS`` signal raised, handler calls syscall entry point ``__kernel_vsyscall`` +- call handler function in ``sys_call_table[]`` and follow how UML syscall + works. +- return to userspace + + +What are the differences from MMU-full UML ? +============================================ + +The current nommu implementation adds 3 different functions which +MMU-full UML doesn't have: + +- kernel address space can directly be accessible from userspace + - so, ``uaccess()`` always returns 1 + - generic implementation of memcpy/strcpy/futex is also used +- alternate syscall entrypoint without ptrace +- alternate syscall hook + - hook syscall by seccomp filter + +With those modifications, it allows us to use unmodified userspace +binaries with nommu UML. + + +History +======= + +This feature was originally introduced by Ricardo Koller at Open +Source Summit NA 2020, then integrated with the syscall translation +functionality with the clean up to the original code. + +Building and run +================ + +:: + + make ARCH=um x86_64_nommu_defconfig + make ARCH=um + +will build UML with ``CONFIG_MMU=n`` applied. + +Kunit tests can run with the following command:: + + ./tools/testing/kunit/kunit.py run --kconfig_add CONFIG_MMU=n + +To run a typical Linux distribution, we need nommu-aware userspace. +We can use a stock version of Alpine Linux with nommu-built version of +busybox and musl-libc. + + +Preparing root filesystem +========================= + +nommu UML requires to use a specific standard library which is aware +of nommu kernel. We have tested custom-build musl-libc and busybox, +both of which have built-in support for nommu kernels. + +There are no available Linux distributions for nommu under x86_64 +architecture, so we need to prepare our own image for the root +filesystem. We use Alpine Linux as a base distribution and replace +busybox and musl-libc on top of that. The following are the step to +prepare the filesystem for the quick start:: + + container_id=$(docker create ghcr.io/thehajime/alpine:3.20.3-um-nommu) + docker start $container_id + docker wait $container_id + docker export $container_id > alpine.tar + docker rm $container_id + + mnt=$(mktemp -d) + dd if=/dev/zero of=alpine.ext4 bs=1 count=0 seek=1G + sudo chmod og+wr "alpine.ext4" + yes 2>/dev/null | mkfs.ext4 "alpine.ext4" || true + sudo mount "alpine.ext4" $mnt + sudo tar -xf alpine.tar -C $mnt + sudo umount $mnt + +This will create a file image, ``alpine.ext4``, which contains busybox +and musl with nommu build on the Alpine Linux root filesystem. The +file can be specified to the argument ``ubd0=`` to the UML command line:: + + ./vmlinux ubd0=./alpine.ext4 rw mem=1024m loglevel=8 init=/sbin/init + +We plan to upstream apk packages for busybox and musl so that we can +follow the proper procedure to set up the root filesystem. + + +Quick start with docker +======================= + +There is a docker image that you can quickly start with a simple step:: + + docker run -it -v /dev/shm:/dev/shm --rm ghcr.io/thehajime/alpine:3.20.3-um-nommu + +This will launch a UML instance with an pre-configured root filesystem. + +Benchmark +========= + +The below shows an example of performance measurement conducted with +lmbench and (self-crafted) getpid benchmark (with v6.17-rc5 uml/next +tree). + +.. csv-table:: lmbench (usec) + :header: ,native,um,um-mmu(s),um-nommu(s) + + select-10 ,0.5319,36.1214,24.2795,2.9174 + select-100 ,1.6019,34.6049,28.8865,3.8080 + select-1000 ,12.2588,43.6838,48.7438,12.7872 + syscall ,0.1644,35.0321,53.2119,2.5981 + read ,0.3055,31.5509,45.8538,2.7068 + write ,0.2512,31.3609,29.2636,2.6948 + stat ,1.8894,43.8477,49.6121,3.1908 + open/close ,3.2973,77.5123,68.9431,6.2575 + fork+sh ,1110.3000,7359.5000,4618.6667,439.4615 + fork+execve ,510.8182,2834.0000,2461.1667,139.7848 + +.. csv-table:: do_getpid bench (nsec) + :header: ,native,um,um-mmu(s),um-nommu(s) + + getpid , 161 , 34477 , 26242 , 2599 + +(um-nommu(s) is with seccomp syscall hook, um-mmu(s) is SECCOMP mode, +respectively) + +Limitations +=========== + +generic nommu limitations +------------------------- +Since this port is a kernel of nommu architecture so, the +implementation inherits the characteristics of other nommu kernels +(riscv, arm, etc), described below. + +- vfork(2) should be used instead of fork(2) +- ELF loader only loads PIE (position independent executable) binaries +- processes share the address space among others +- mmap(2) offers a subset of functionalities (e.g., unsupported + MMAP_FIXED) + +Thus, we have limited options to userspace programs. We have tested +Alpine Linux with musl-libc, which has a support nommu kernel. + +supported architecture +---------------------- +The current implementation of nommu UML only works on x86_64 SUBARCH. +We have not tested with 32-bit environment. + + +Further readings about NOMMU UML +================================ + +- NOMMU UML (original code by Ricardo Koller) + - https://static.sched.com/hosted_files/ossna2020/ec/kollerr_linux_um_nommu.pdf diff --git a/MAINTAINERS b/MAINTAINERS index 3da2c26a796b..2f227f56d04e 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -26764,6 +26764,7 @@ USER-MODE LINUX (UML) M: Richard Weinberger M: Anton Ivanov M: Johannes Berg +M: Hajime Tazaki L: linux-um at lists.infradead.org S: Maintained W: http://user-mode-linux.sourceforge.net -- 2.43.0 From thehajime at gmail.com Sat Nov 8 00:05:48 2025 From: thehajime at gmail.com (Hajime Tazaki) Date: Sat, 8 Nov 2025 17:05:48 +0900 Subject: [PATCH v13 13/13] um: nommu: plug nommu code into build system In-Reply-To: References: Message-ID: Add nommu kernel for um build. defconfig is also provided. Signed-off-by: Hajime Tazaki Signed-off-by: Ricardo Koller --- arch/um/Kconfig | 14 ++++++- arch/um/configs/x86_64_nommu_defconfig | 54 ++++++++++++++++++++++++++ 2 files changed, 66 insertions(+), 2 deletions(-) create mode 100644 arch/um/configs/x86_64_nommu_defconfig diff --git a/arch/um/Kconfig b/arch/um/Kconfig index 097c6a6265ef..4907fd2db512 100644 --- a/arch/um/Kconfig +++ b/arch/um/Kconfig @@ -34,16 +34,19 @@ config UML select ARCH_SUPPORTS_LTO_CLANG_THIN select TRACE_IRQFLAGS_SUPPORT select TTY # Needed for line.c - select HAVE_ARCH_VMAP_STACK + select HAVE_ARCH_VMAP_STACK if MMU select HAVE_RUST select ARCH_HAS_UBSAN select HAVE_ARCH_TRACEHOOK select HAVE_SYSCALL_TRACEPOINTS select THREAD_INFO_IN_TASK select SPARSE_IRQ + select UACCESS_MEMCPY if !MMU + select GENERIC_STRNLEN_USER if !MMU + select GENERIC_STRNCPY_FROM_USER if !MMU config MMU - bool + bool "MMU-based Paged Memory Management Support" if 64BIT default y config UML_DMA_EMULATION @@ -225,8 +228,15 @@ config MAGIC_SYSRQ The keys are documented in . Don't say Y unless you really know what this hack does. +config ARCH_FORCE_MAX_ORDER + int "Order of maximal physically contiguous allocations" if EXPERT + default "10" if MMU + default "16" if !MMU + config KERNEL_STACK_ORDER int "Kernel stack size order" + default 3 if !MMU + range 3 10 if !MMU default 2 if 64BIT range 2 10 if 64BIT default 1 if !64BIT diff --git a/arch/um/configs/x86_64_nommu_defconfig b/arch/um/configs/x86_64_nommu_defconfig new file mode 100644 index 000000000000..02cb87091c9f --- /dev/null +++ b/arch/um/configs/x86_64_nommu_defconfig @@ -0,0 +1,54 @@ +CONFIG_SYSVIPC=y +CONFIG_POSIX_MQUEUE=y +CONFIG_NO_HZ=y +CONFIG_HIGH_RES_TIMERS=y +CONFIG_BSD_PROCESS_ACCT=y +CONFIG_IKCONFIG=y +CONFIG_IKCONFIG_PROC=y +CONFIG_LOG_BUF_SHIFT=14 +CONFIG_CGROUPS=y +CONFIG_BLK_CGROUP=y +CONFIG_CGROUP_SCHED=y +CONFIG_CGROUP_DEVICE=y +CONFIG_CGROUP_CPUACCT=y +# CONFIG_PID_NS is not set +CONFIG_CC_OPTIMIZE_FOR_SIZE=y +# CONFIG_MMU is not set +CONFIG_HOSTFS=y +CONFIG_MAGIC_SYSRQ=y +CONFIG_SSL=y +CONFIG_NULL_CHAN=y +CONFIG_PORT_CHAN=y +CONFIG_PTY_CHAN=y +CONFIG_TTY_CHAN=y +CONFIG_CON_CHAN="pts" +CONFIG_SSL_CHAN="pts" +CONFIG_MODULES=y +CONFIG_MODULE_UNLOAD=y +CONFIG_IOSCHED_BFQ=m +CONFIG_BINFMT_MISC=m +CONFIG_NET=y +CONFIG_PACKET=y +CONFIG_UNIX=y +CONFIG_INET=y +CONFIG_DEVTMPFS=y +CONFIG_DEVTMPFS_MOUNT=y +CONFIG_BLK_DEV_UBD=y +CONFIG_BLK_DEV_LOOP=m +CONFIG_BLK_DEV_NBD=m +CONFIG_DUMMY=m +CONFIG_TUN=m +CONFIG_PPP=m +CONFIG_SLIP=m +CONFIG_LEGACY_PTY_COUNT=32 +CONFIG_UML_RANDOM=y +CONFIG_EXT4_FS=y +CONFIG_QUOTA=y +CONFIG_AUTOFS_FS=m +CONFIG_ISO9660_FS=m +CONFIG_JOLIET=y +CONFIG_NLS=y +CONFIG_DEBUG_KERNEL=y +CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y +CONFIG_FRAME_WARN=1024 +CONFIG_IPV6=y -- 2.43.0 From bagasdotme at gmail.com Sat Nov 8 01:19:33 2025 From: bagasdotme at gmail.com (Bagas Sanjaya) Date: Sat, 8 Nov 2025 16:19:33 +0700 Subject: [PATCH v2] vfs: remove the excl argument from the ->create() inode_operation In-Reply-To: References: <20251107-create-excl-v2-1-f678165d7f3f@kernel.org> Message-ID: On Sat, Nov 08, 2025 at 03:12:10PM +0900, Dominique Martinet wrote: > Jeff Layton wrote on Fri, Nov 07, 2025 at 10:05:03AM -0500: > > diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst > > index 4f13b01e42eb5e2ad9d60cbbce7e47d09ad831e6..7a55e491e0c87a0d18909bd181754d6d68318059 100644 > > --- a/Documentation/filesystems/vfs.rst > > +++ b/Documentation/filesystems/vfs.rst > > @@ -505,7 +505,10 @@ otherwise noted. > > if you want to support regular files. The dentry you get should > > not have an inode (i.e. it should be a negative dentry). Here > > you will probably call d_instantiate() with the dentry and the > > - newly created inode > > + newly created inode. This operation should always provide O_EXCL > > This and the block below change halfway from tab (old text) to spaces > (your patch) > > Looks like the file has a few space-indented sections though so it won't > be the first if that goes in as is, the html-rendering doesn't seem to > care :) FYI: I'm using Vim. My important settings (in ~/.vimrc) are: ``` set nojoinspaces set textwidth=0 set backspace=2 ``` However, ftplugin override these for each file type, so you have to essentially "fork" the relevant ftplugin file for each type if you want for your settings to take precedence. For example, in case of reST, copy /usr/share/vim/vim91/ftplugin/rst.vim to ~/.vim/ftplugin/rst and override the already defined options there: ``` ... " keep tabs as-is setlocal comments=fb:.. commentstring=..\ %s noexpandtab ... if exists("g:rst_style") && g:rst_style != 0 setlocal noexpandtab shiftwidth=8 softtabstop=0 tabstop=8 endif ... ``` Thanks. -- An old man doll... just what I always wanted! - Clara -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 228 bytes Desc: not available URL: From hch at infradead.org Mon Nov 10 01:14:26 2025 From: hch at infradead.org (Christoph Hellwig) Date: Mon, 10 Nov 2025 01:14:26 -0800 Subject: [PATCH v13 00/13] nommu UML In-Reply-To: References: Message-ID: On Sat, Nov 08, 2025 at 05:05:35PM +0900, Hajime Tazaki wrote: > This patchset is another spin of nommu mode addition to UML. It would > be nice to hear about your opinions on that. I've not seen any explanation of the use case and/or benefits anywhere in this cover letter or the patches. Without that it's usually pretty hard to get maintainers and reviewers excited. From thehajime at gmail.com Mon Nov 10 04:18:05 2025 From: thehajime at gmail.com (Hajime Tazaki) Date: Mon, 10 Nov 2025 21:18:05 +0900 Subject: [PATCH v13 00/13] nommu UML In-Reply-To: References: Message-ID: Hello, On Mon, 10 Nov 2025 18:14:26 +0900, Christoph Hellwig wrote: > > On Sat, Nov 08, 2025 at 05:05:35PM +0900, Hajime Tazaki wrote: > > This patchset is another spin of nommu mode addition to UML. It would > > be nice to hear about your opinions on that. > > I've not seen any explanation of the use case and/or benefits anywhere > in this cover letter or the patches. Without that it's usually pretty > hard to get maintainers and reviewers excited. thank you for the comment. I tried to include this explanation in the document patch [12/13], which I copied from the text below. What is it for ? ================ - Alleviate syscall hook overhead implemented with ptrace(2) - To exercises nommu code over UML (and over KUnit) - Less dependency to host facilities the first item is for speed up, the second item is for more testing, the last item is for more extensibility in the future. Early version of this patchset included this information as well as the whole documentation, but I removed it as the versions grow. But I can revert it to the cover letter if it helps. -- Hajime From jlayton at kernel.org Mon Nov 10 05:13:00 2025 From: jlayton at kernel.org (Jeff Layton) Date: Mon, 10 Nov 2025 08:13:00 -0500 Subject: [PATCH v3] vfs: remove the excl argument from the ->create() inode_operation Message-ID: <20251110-create-excl-v3-1-836a61d14fb0@kernel.org> With three exceptions, ->create() methods provided by filesystems ignore the "excl" flag. Those exception are NFS, GFS2 and vboxsf which all also provide ->atomic_open. Since ce8644fcadc5 ("lookup_open(): expand the call of vfs_create()"), the "excl" argument to the ->create() inode_operation is always set to true in vfs_create(). The ->create() call in lookup_open() sets it according to the O_EXCL open flag, but is never called if the filesystem provides ->atomic_open(). The excl flag is therefore always either ignored or true. Remove it, and change NFS, GFS2 and vboxsf to act as if it were always true. Reviewed-by: Dominique Martinet Reviewed-by: NeilBrown Signed-off-by: Jeff Layton --- Minor update to fix vboxsf case that I somehow missed in the first version, and some minor whitespace cleanup in the docs. Reminder that this should be applied on top of the directory delegation series [1]. [1]: https://lore.kernel.org/linux-nfs/20251105-dir-deleg-ro-v5-0-7ebc168a88ac at kernel.org/ --- Changes in v3: - fix use of excl in vboxsf_dir_mkfile() - fix tab prefixes in Documentation/filesystems/vfs.rst - Link to v2: https://lore.kernel.org/r/20251107-create-excl-v2-1-f678165d7f3f at kernel.org Changes in v2: - better describe why the argument isn't needed in the changelog - updates do Documentation/ - Link to v1: https://lore.kernel.org/r/20251105-create-excl-v1-1-a4cce035cc55 at kernel.org --- Documentation/filesystems/porting.rst | 12 ++++++++++++ Documentation/filesystems/vfs.rst | 13 ++++++++++--- fs/9p/vfs_inode.c | 2 +- fs/9p/vfs_inode_dotl.c | 2 +- fs/affs/affs.h | 2 +- fs/affs/namei.c | 2 +- fs/afs/dir.c | 4 ++-- fs/bad_inode.c | 2 +- fs/bfs/dir.c | 2 +- fs/btrfs/inode.c | 2 +- fs/ceph/dir.c | 2 +- fs/coda/dir.c | 2 +- fs/ecryptfs/inode.c | 2 +- fs/efivarfs/inode.c | 2 +- fs/exfat/namei.c | 2 +- fs/ext2/namei.c | 2 +- fs/ext4/namei.c | 2 +- fs/f2fs/namei.c | 2 +- fs/fat/namei_msdos.c | 2 +- fs/fat/namei_vfat.c | 2 +- fs/fuse/dir.c | 2 +- fs/gfs2/inode.c | 5 ++--- fs/hfs/dir.c | 2 +- fs/hfsplus/dir.c | 2 +- fs/hostfs/hostfs_kern.c | 2 +- fs/hpfs/namei.c | 2 +- fs/hugetlbfs/inode.c | 2 +- fs/jffs2/dir.c | 4 ++-- fs/jfs/namei.c | 2 +- fs/minix/namei.c | 2 +- fs/namei.c | 4 ++-- fs/nfs/dir.c | 4 ++-- fs/nfs/internal.h | 2 +- fs/nilfs2/namei.c | 2 +- fs/ntfs3/namei.c | 2 +- fs/ocfs2/dlmfs/dlmfs.c | 3 +-- fs/ocfs2/namei.c | 3 +-- fs/omfs/dir.c | 2 +- fs/orangefs/namei.c | 3 +-- fs/overlayfs/dir.c | 2 +- fs/ramfs/inode.c | 2 +- fs/smb/client/cifsfs.h | 2 +- fs/smb/client/dir.c | 2 +- fs/ubifs/dir.c | 2 +- fs/udf/namei.c | 2 +- fs/ufs/namei.c | 3 +-- fs/vboxsf/dir.c | 4 ++-- fs/xfs/xfs_iops.c | 3 +-- include/linux/fs.h | 4 ++-- ipc/mqueue.c | 2 +- mm/shmem.c | 2 +- 51 files changed, 78 insertions(+), 65 deletions(-) diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst index 7233b04668fcce75f1ed170329a2cd18110a7d89..d71a3f5c626e578f0370986975ca50292c8e15c3 100644 --- a/Documentation/filesystems/porting.rst +++ b/Documentation/filesystems/porting.rst @@ -1309,3 +1309,15 @@ a different length, use vfs_parse_fs_qstr(fc, key, &QSTR_LEN(value, len)) instead. + +--- + +**mandatory** + +The ->create() operation has dropped the bool "excl" argument. This operation +should now always provide O_EXCL semantics (i.e. fail with -EEXIST if the file +exists). If the filesystem needs to handle the case where another entity could +create the file on the backing store after a negative lookup or revalidate +(e.g. it's a network filesystem and another client could create the file after +a negative lookup), then it will require ->atomic_open() in addition to +->create(). diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst index 4f13b01e42eb5e2ad9d60cbbce7e47d09ad831e6..0752ed2b6475ab2b42482fde6dff870110a33eac 100644 --- a/Documentation/filesystems/vfs.rst +++ b/Documentation/filesystems/vfs.rst @@ -467,7 +467,7 @@ As of kernel 2.6.22, the following members are defined: .. code-block:: c struct inode_operations { - int (*create) (struct mnt_idmap *, struct inode *,struct dentry *, umode_t, bool); + int (*create) (struct mnt_idmap *, struct inode *,struct dentry *, umode_t); struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int); int (*link) (struct dentry *,struct inode *,struct dentry *); int (*unlink) (struct inode *,struct dentry *); @@ -505,7 +505,10 @@ otherwise noted. if you want to support regular files. The dentry you get should not have an inode (i.e. it should be a negative dentry). Here you will probably call d_instantiate() with the dentry and the - newly created inode + newly created inode. This operation should always provide O_EXCL + semantics (i.e. it should fail with -EEXIST if the file exists). + If the filesystem needs to mediate non-exclusive creation, + then the filesystem must also provide an ->atomic_open() operation. ``lookup`` called when the VFS needs to look up an inode in a parent @@ -654,7 +657,11 @@ otherwise noted. handled by f_op->open(). If the file was created, FMODE_CREATED flag should be set in file->f_mode. In case of O_EXCL the method must only succeed if the file didn't exist and hence - FMODE_CREATED shall always be set on success. + FMODE_CREATED shall always be set on success. This method is + usually needed on filesystems where the dentry to be created could + unexpectedly become positive after the kernel has looked it up or + revalidated it. (e.g. another host racing in and creating the file + on an NFS server). ``tmpfile`` called in the end of O_TMPFILE open(). Optional, equivalent to diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c index 69f378a837753e934c20b599660f8a756127e40a..595244d57cba62869b9af8b909af67d3c61e7f6c 100644 --- a/fs/9p/vfs_inode.c +++ b/fs/9p/vfs_inode.c @@ -643,7 +643,7 @@ v9fs_create(struct v9fs_session_info *v9ses, struct inode *dir, static int v9fs_vfs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { struct v9fs_session_info *v9ses = v9fs_inode2v9ses(dir); u32 perm = unixmode2p9mode(v9ses, mode); diff --git a/fs/9p/vfs_inode_dotl.c b/fs/9p/vfs_inode_dotl.c index 0b404e8484d22e2cbe60d846e0fa653001cdc4b1..de8fe9954d433c9b14ff5dd72ba13c3d5a67ebe7 100644 --- a/fs/9p/vfs_inode_dotl.c +++ b/fs/9p/vfs_inode_dotl.c @@ -218,7 +218,7 @@ int v9fs_open_to_dotl_flags(int flags) */ static int v9fs_vfs_create_dotl(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t omode, bool excl) + struct dentry *dentry, umode_t omode) { return v9fs_vfs_mknod_dotl(idmap, dir, dentry, omode, 0); } diff --git a/fs/affs/affs.h b/fs/affs/affs.h index ac4e9a02910b72d63c8ec5291347b54518e67f4b..665be23c42cfa206dc0a2c9ffa119b7c3c747389 100644 --- a/fs/affs/affs.h +++ b/fs/affs/affs.h @@ -167,7 +167,7 @@ extern int affs_hash_name(struct super_block *sb, const u8 *name, unsigned int l extern struct dentry *affs_lookup(struct inode *dir, struct dentry *dentry, unsigned int); extern int affs_unlink(struct inode *dir, struct dentry *dentry); extern int affs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool); + struct dentry *dentry, umode_t mode); extern struct dentry *affs_mkdir(struct mnt_idmap *idmap, struct inode *dir, struct dentry *dentry, umode_t mode); extern int affs_rmdir(struct inode *dir, struct dentry *dentry); diff --git a/fs/affs/namei.c b/fs/affs/namei.c index f883be50db122d3b09f0ae4d24618bd49b55186b..5591e1b5a2f68fc7600115e241f01f81d3aac010 100644 --- a/fs/affs/namei.c +++ b/fs/affs/namei.c @@ -243,7 +243,7 @@ affs_unlink(struct inode *dir, struct dentry *dentry) int affs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { struct super_block *sb = dir->i_sb; struct inode *inode; diff --git a/fs/afs/dir.c b/fs/afs/dir.c index 89d36e3e5c7999c2e448b78e86896d8893a8a7a9..09224aca8cad37ad273fd0c1ac292f0c15e078b5 100644 --- a/fs/afs/dir.c +++ b/fs/afs/dir.c @@ -32,7 +32,7 @@ static bool afs_lookup_one_filldir(struct dir_context *ctx, const char *name, in static bool afs_lookup_filldir(struct dir_context *ctx, const char *name, int nlen, loff_t fpos, u64 ino, unsigned dtype); static int afs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl); + struct dentry *dentry, umode_t mode); static struct dentry *afs_mkdir(struct mnt_idmap *idmap, struct inode *dir, struct dentry *dentry, umode_t mode); static int afs_rmdir(struct inode *dir, struct dentry *dentry); @@ -1637,7 +1637,7 @@ static const struct afs_operation_ops afs_create_operation = { * create a regular file on an AFS filesystem */ static int afs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { struct afs_operation *op; struct afs_vnode *dvnode = AFS_FS_I(dir); diff --git a/fs/bad_inode.c b/fs/bad_inode.c index 0ef9bcb744dd620bf47caa024d97a1316ff7bc89..5701361cf98155a61cb75a4ec602e8fc615eb3ae 100644 --- a/fs/bad_inode.c +++ b/fs/bad_inode.c @@ -29,7 +29,7 @@ static const struct file_operations bad_file_ops = static int bad_inode_create(struct mnt_idmap *idmap, struct inode *dir, struct dentry *dentry, - umode_t mode, bool excl) + umode_t mode) { return -EIO; } diff --git a/fs/bfs/dir.c b/fs/bfs/dir.c index c375e22c4c0c15ba27307d266adfe3f093b90ab8..6beb8605c523cc2c7250d7b1a61508e103f0f3fd 100644 --- a/fs/bfs/dir.c +++ b/fs/bfs/dir.c @@ -76,7 +76,7 @@ const struct file_operations bfs_dir_operations = { }; static int bfs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { int err; struct inode *inode; diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 3b1b3a0553eea06229255ad0284d76074bdb958a..8e06baeabae594850607366ea4f4f0fa41e3b464 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -6816,7 +6816,7 @@ static int btrfs_mknod(struct mnt_idmap *idmap, struct inode *dir, } static int btrfs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { struct inode *inode; diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c index d18c0eaef9b7e7be7eb517c701d6c4af08fd78ac..308903dc0780dbed2382228005d0221f185c61ee 100644 --- a/fs/ceph/dir.c +++ b/fs/ceph/dir.c @@ -976,7 +976,7 @@ static int ceph_mknod(struct mnt_idmap *idmap, struct inode *dir, } static int ceph_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { return ceph_mknod(idmap, dir, dentry, mode, 0); } diff --git a/fs/coda/dir.c b/fs/coda/dir.c index ca99900172657d80a479b2eb27f50effdf834995..554e7fd44e5df1aae6da2c41a492a02ae9e0d616 100644 --- a/fs/coda/dir.c +++ b/fs/coda/dir.c @@ -134,7 +134,7 @@ static inline void coda_dir_drop_nlink(struct inode *dir) /* creation routines: create, mknod, mkdir, link, symlink */ static int coda_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *de, umode_t mode, bool excl) + struct dentry *de, umode_t mode) { int error; const char *name=de->d_name.name; diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c index ba15e7359dfa6e150b577205991010873a633511..9a1ba68b16f3d6c4551e2d75e1e27309159c062e 100644 --- a/fs/ecryptfs/inode.c +++ b/fs/ecryptfs/inode.c @@ -262,7 +262,7 @@ int ecryptfs_initialize_file(struct dentry *ecryptfs_dentry, static int ecryptfs_create(struct mnt_idmap *idmap, struct inode *directory_inode, struct dentry *ecryptfs_dentry, - umode_t mode, bool excl) + umode_t mode) { struct inode *ecryptfs_inode; int rc; diff --git a/fs/efivarfs/inode.c b/fs/efivarfs/inode.c index 2891614abf8d554f563319187b6d54c2bc006a91..043b3e3a4f0adefe27855f8156b946c1dc4bd184 100644 --- a/fs/efivarfs/inode.c +++ b/fs/efivarfs/inode.c @@ -75,7 +75,7 @@ static bool efivarfs_valid_name(const char *str, int len) } static int efivarfs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { struct inode *inode = NULL; struct efivar_entry *var; diff --git a/fs/exfat/namei.c b/fs/exfat/namei.c index 7eb9c67fd35f4c54e18061a948806f20455675cf..c272a522c571044fd0cdc7630be30bdcec2ab8e5 100644 --- a/fs/exfat/namei.c +++ b/fs/exfat/namei.c @@ -543,7 +543,7 @@ static int exfat_add_entry(struct inode *inode, const char *path, } static int exfat_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { struct super_block *sb = dir->i_sb; struct inode *inode; diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c index bde617a66cecd4a2bf12a713a2297bb4fee45916..edea7784ad39acd4afffc7f5ae6e50a20c04999d 100644 --- a/fs/ext2/namei.c +++ b/fs/ext2/namei.c @@ -101,7 +101,7 @@ struct dentry *ext2_get_parent(struct dentry *child) */ static int ext2_create (struct mnt_idmap * idmap, struct inode * dir, struct dentry * dentry, - umode_t mode, bool excl) + umode_t mode) { struct inode *inode; int err; diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index 2cd36f59c9e363124ee949f742adccd88447295a..a1e77390a7ce300db02db9af90e45d69efabfea5 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -2806,7 +2806,7 @@ static int ext4_add_nondir(handle_t *handle, * with d_instantiate(). */ static int ext4_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { handle_t *handle; struct inode *inode; diff --git a/fs/f2fs/namei.c b/fs/f2fs/namei.c index b882771e469971dcf4e7a42416f9fbb8a5d9bf39..9bcbb8b521501b22d0fe2238b7729c342e95baa4 100644 --- a/fs/f2fs/namei.c +++ b/fs/f2fs/namei.c @@ -351,7 +351,7 @@ static struct inode *f2fs_new_inode(struct mnt_idmap *idmap, } static int f2fs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { struct f2fs_sb_info *sbi = F2FS_I_SB(dir); struct inode *inode; diff --git a/fs/fat/namei_msdos.c b/fs/fat/namei_msdos.c index 0b920ee40a7f9fe3c57af5d939d3efedf001a3d9..905ffa9e5b99f1507734d99b7c16dcad21d7b5b5 100644 --- a/fs/fat/namei_msdos.c +++ b/fs/fat/namei_msdos.c @@ -262,7 +262,7 @@ static int msdos_add_entry(struct inode *dir, const unsigned char *name, /***** Create a file */ static int msdos_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { struct super_block *sb = dir->i_sb; struct inode *inode = NULL; diff --git a/fs/fat/namei_vfat.c b/fs/fat/namei_vfat.c index 5dbc4cbb8fce3d9b891cbc597f876c2c7b8d6aa0..8396b1ec4ec582fcdfadbcb12b04694ef0b8c5fc 100644 --- a/fs/fat/namei_vfat.c +++ b/fs/fat/namei_vfat.c @@ -754,7 +754,7 @@ static struct dentry *vfat_lookup(struct inode *dir, struct dentry *dentry, } static int vfat_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { struct super_block *sb = dir->i_sb; struct inode *inode; diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c index 667774cc72a1d49796f531fcb342d2e4878beb85..b7a2cee9b18313f88e745c5bb406bcc72866e390 100644 --- a/fs/fuse/dir.c +++ b/fs/fuse/dir.c @@ -889,7 +889,7 @@ static int fuse_mknod(struct mnt_idmap *idmap, struct inode *dir, } static int fuse_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *entry, umode_t mode, bool excl) + struct dentry *entry, umode_t mode) { return fuse_mknod(idmap, dir, entry, mode, 0); } diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c index 8a7ed80d9f2d6e829b240629bdd18b5e0d30b5fc..b8e399dd1182b6ede0bcf1aa78bd7f9f2dca8b2b 100644 --- a/fs/gfs2/inode.c +++ b/fs/gfs2/inode.c @@ -942,15 +942,14 @@ static int gfs2_create_inode(struct inode *dir, struct dentry *dentry, * @dir: The directory in which to create the file * @dentry: The dentry of the new file * @mode: The mode of the new file - * @excl: Force fail if inode exists * * Returns: errno */ static int gfs2_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { - return gfs2_create_inode(dir, dentry, NULL, S_IFREG | mode, 0, NULL, 0, excl); + return gfs2_create_inode(dir, dentry, NULL, S_IFREG | mode, 0, NULL, 0, 1); } /** diff --git a/fs/hfs/dir.c b/fs/hfs/dir.c index 86a6b317b474a95f283f6a0908582efadde80892..c585942aa985686ca428d2d17f4401aa845a0eb8 100644 --- a/fs/hfs/dir.c +++ b/fs/hfs/dir.c @@ -190,7 +190,7 @@ static int hfs_dir_release(struct inode *inode, struct file *file) * the directory and the name (and its length) of the new file. */ static int hfs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { struct inode *inode; int res; diff --git a/fs/hfsplus/dir.c b/fs/hfsplus/dir.c index 1b3e27a0d5e038b559bd19b37d769078b2996d1b..c5ea04e078340a91b992095e189e978a3345f03c 100644 --- a/fs/hfsplus/dir.c +++ b/fs/hfsplus/dir.c @@ -518,7 +518,7 @@ static int hfsplus_mknod(struct mnt_idmap *idmap, struct inode *dir, } static int hfsplus_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { return hfsplus_mknod(&nop_mnt_idmap, dir, dentry, mode, 0); } diff --git a/fs/hostfs/hostfs_kern.c b/fs/hostfs/hostfs_kern.c index 1e1acf5775ab5f6daf13bb917966d05f410d5ff5..18ca8cb9aa15e4015582ee5bd3db968c6b32de4b 100644 --- a/fs/hostfs/hostfs_kern.c +++ b/fs/hostfs/hostfs_kern.c @@ -593,7 +593,7 @@ static struct inode *hostfs_iget(struct super_block *sb, char *name) } static int hostfs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { struct inode *inode; char *name; diff --git a/fs/hpfs/namei.c b/fs/hpfs/namei.c index 353e13a615f56664638f08a3408f90a727f5458b..809113d8248d50c0eaa57047b6c4bd87b9a5c6be 100644 --- a/fs/hpfs/namei.c +++ b/fs/hpfs/namei.c @@ -129,7 +129,7 @@ static struct dentry *hpfs_mkdir(struct mnt_idmap *idmap, struct inode *dir, } static int hpfs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { const unsigned char *name = dentry->d_name.name; unsigned len = dentry->d_name.len; diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 9c94ed8c3ab0028772b7afb5d03a91d280c38106..0fd0d73e450bdedd92b953b9dd00f6babe1246e7 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -1001,7 +1001,7 @@ static struct dentry *hugetlbfs_mkdir(struct mnt_idmap *idmap, struct inode *dir static int hugetlbfs_create(struct mnt_idmap *idmap, struct inode *dir, struct dentry *dentry, - umode_t mode, bool excl) + umode_t mode) { return hugetlbfs_mknod(idmap, dir, dentry, mode | S_IFREG, 0); } diff --git a/fs/jffs2/dir.c b/fs/jffs2/dir.c index dd91f725ded69ccb3a240aafd72a4b552f21bcd9..e77c84e43621a8c53e9852843f18cc3514315650 100644 --- a/fs/jffs2/dir.c +++ b/fs/jffs2/dir.c @@ -25,7 +25,7 @@ static int jffs2_readdir (struct file *, struct dir_context *); static int jffs2_create (struct mnt_idmap *, struct inode *, - struct dentry *, umode_t, bool); + struct dentry *, umode_t); static struct dentry *jffs2_lookup (struct inode *,struct dentry *, unsigned int); static int jffs2_link (struct dentry *,struct inode *,struct dentry *); @@ -161,7 +161,7 @@ static int jffs2_readdir(struct file *file, struct dir_context *ctx) static int jffs2_create(struct mnt_idmap *idmap, struct inode *dir_i, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { struct jffs2_raw_inode *ri; struct jffs2_inode_info *f, *dir_f; diff --git a/fs/jfs/namei.c b/fs/jfs/namei.c index 65a218eba8faf9508f5727515b812f6de2661618..48111f8d3efe40becadd857c56c84ed09de867ef 100644 --- a/fs/jfs/namei.c +++ b/fs/jfs/namei.c @@ -60,7 +60,7 @@ static inline void free_ea_wmap(struct inode *inode) * */ static int jfs_create(struct mnt_idmap *idmap, struct inode *dip, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { int rc = 0; tid_t tid; /* transaction id */ diff --git a/fs/minix/namei.c b/fs/minix/namei.c index 8938536d8d3cf65c7e57f88f1819689365951fea..6540574f54781eab487074de7fe10ed38b1a8d1e 100644 --- a/fs/minix/namei.c +++ b/fs/minix/namei.c @@ -64,7 +64,7 @@ static int minix_tmpfile(struct mnt_idmap *idmap, struct inode *dir, } static int minix_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { return minix_mknod(&nop_mnt_idmap, dir, dentry, mode, 0); } diff --git a/fs/namei.c b/fs/namei.c index d5ab28947b2b6c6e19c7bb4a9140ccec407dc07c..83da60fc298e523096e881b25c727d14f9553476 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -3493,7 +3493,7 @@ int vfs_create(struct mnt_idmap *idmap, struct dentry *dentry, umode_t mode, error = try_break_deleg(dir, di); if (error) return error; - error = dir->i_op->create(idmap, dir, dentry, mode, true); + error = dir->i_op->create(idmap, dir, dentry, mode); if (!error) fsnotify_create(dir, dentry); return error; @@ -3802,7 +3802,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file, } error = dir_inode->i_op->create(idmap, dir_inode, dentry, - mode, open_flag & O_EXCL); + mode); if (error) goto out_dput; } diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index 46d9c65d50f83fc1dc73f3d7f5868b84132bb0fd..7fe18efcd37b08030c7a4e17832801abfc19a3bd 100644 --- a/fs/nfs/dir.c +++ b/fs/nfs/dir.c @@ -2377,9 +2377,9 @@ static int nfs_do_create(struct inode *dir, struct dentry *dentry, } int nfs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { - return nfs_do_create(dir, dentry, mode, excl ? O_EXCL : 0); + return nfs_do_create(dir, dentry, mode, O_EXCL); } EXPORT_SYMBOL_GPL(nfs_create); diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h index 2ecd38e1d17a8053a9134702588d57efc35f49e9..b122c4f34f7b53c5102a8b5138efe269af433c81 100644 --- a/fs/nfs/internal.h +++ b/fs/nfs/internal.h @@ -398,7 +398,7 @@ extern unsigned long nfs_access_cache_scan(struct shrinker *shrink, struct dentry *nfs_lookup(struct inode *, struct dentry *, unsigned int); void nfs_d_prune_case_insensitive_aliases(struct inode *inode); int nfs_create(struct mnt_idmap *, struct inode *, struct dentry *, - umode_t, bool); + umode_t); struct dentry *nfs_mkdir(struct mnt_idmap *, struct inode *, struct dentry *, umode_t); int nfs_rmdir(struct inode *, struct dentry *); diff --git a/fs/nilfs2/namei.c b/fs/nilfs2/namei.c index 40f4b1a28705b6e0eb8f0978cf3ac18b43aa1331..31d1d466c03048aaaab23f64c3f413c095939770 100644 --- a/fs/nilfs2/namei.c +++ b/fs/nilfs2/namei.c @@ -86,7 +86,7 @@ nilfs_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags) * with d_instantiate(). */ static int nilfs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { struct inode *inode; struct nilfs_transaction_info ti; diff --git a/fs/ntfs3/namei.c b/fs/ntfs3/namei.c index 82c8ae56beee6d79046dd6c8f02ff0f35e9a1ad3..49fe635b550d3f51f81138649b47c9c831a73e3b 100644 --- a/fs/ntfs3/namei.c +++ b/fs/ntfs3/namei.c @@ -105,7 +105,7 @@ static struct dentry *ntfs_lookup(struct inode *dir, struct dentry *dentry, * ntfs_create - inode_operations::create */ static int ntfs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { return ntfs_create_inode(idmap, dir, dentry, NULL, S_IFREG | mode, 0, NULL, 0, NULL); diff --git a/fs/ocfs2/dlmfs/dlmfs.c b/fs/ocfs2/dlmfs/dlmfs.c index cccaa1d6fbbac13ebcaf14a9183277890708e643..bd4b2269598b49c6f88dd8d201e246ee5ed855a6 100644 --- a/fs/ocfs2/dlmfs/dlmfs.c +++ b/fs/ocfs2/dlmfs/dlmfs.c @@ -454,8 +454,7 @@ static struct dentry *dlmfs_mkdir(struct mnt_idmap * idmap, static int dlmfs_create(struct mnt_idmap *idmap, struct inode *dir, struct dentry *dentry, - umode_t mode, - bool excl) + umode_t mode) { int status = 0; struct inode *inode; diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c index c90b254da75eb5b90d2af5e37d41e781efe8b836..7443f468f45657cf68779a02e4edf4e38fb70f59 100644 --- a/fs/ocfs2/namei.c +++ b/fs/ocfs2/namei.c @@ -666,8 +666,7 @@ static struct dentry *ocfs2_mkdir(struct mnt_idmap *idmap, static int ocfs2_create(struct mnt_idmap *idmap, struct inode *dir, struct dentry *dentry, - umode_t mode, - bool excl) + umode_t mode) { int ret; diff --git a/fs/omfs/dir.c b/fs/omfs/dir.c index 2ed541fccf331d796805dd1594fbf05c1f7f3b9a..a09a98f7e30bc66deca60725f9462d081b5e4784 100644 --- a/fs/omfs/dir.c +++ b/fs/omfs/dir.c @@ -286,7 +286,7 @@ static struct dentry *omfs_mkdir(struct mnt_idmap *idmap, struct inode *dir, } static int omfs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { return omfs_add_node(dir, dentry, mode | S_IFREG); } diff --git a/fs/orangefs/namei.c b/fs/orangefs/namei.c index bec5475de094dada6bb29eaf8520a875880f3bab..0ebaa7f000f26f1c1ecffd22cfe4272f20a783ed 100644 --- a/fs/orangefs/namei.c +++ b/fs/orangefs/namei.c @@ -18,8 +18,7 @@ static int orangefs_create(struct mnt_idmap *idmap, struct inode *dir, struct dentry *dentry, - umode_t mode, - bool exclusive) + umode_t mode) { struct orangefs_inode_s *parent = ORANGEFS_I(dir); struct orangefs_kernel_op_s *new_op; diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c index a5e9ddf3023b3942fafb9adb2770f26780a1b86b..0f70b3835f4a08c29d6bba8ae9143df55895e56b 100644 --- a/fs/overlayfs/dir.c +++ b/fs/overlayfs/dir.c @@ -704,7 +704,7 @@ static int ovl_create_object(struct dentry *dentry, int mode, dev_t rdev, } static int ovl_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { return ovl_create_object(dentry, (mode & 07777) | S_IFREG, 0, NULL); } diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c index 41f9995da7cab0d11395cb40a98fb4936d52597f..b6502aaa4fb44d27c939da9fae4449af7edd28d4 100644 --- a/fs/ramfs/inode.c +++ b/fs/ramfs/inode.c @@ -129,7 +129,7 @@ static struct dentry *ramfs_mkdir(struct mnt_idmap *idmap, struct inode *dir, } static int ramfs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { return ramfs_mknod(&nop_mnt_idmap, dir, dentry, mode | S_IFREG, 0); } diff --git a/fs/smb/client/cifsfs.h b/fs/smb/client/cifsfs.h index e9534258d1efd0bb34f36bf2c725c64d0a8ca8f4..294c66cea2eca3344e09cd77619761e9cb79a807 100644 --- a/fs/smb/client/cifsfs.h +++ b/fs/smb/client/cifsfs.h @@ -50,7 +50,7 @@ extern void cifs_sb_deactive(struct super_block *sb); extern const struct inode_operations cifs_dir_inode_ops; extern struct inode *cifs_root_iget(struct super_block *); extern int cifs_create(struct mnt_idmap *, struct inode *, - struct dentry *, umode_t, bool excl); + struct dentry *, umode_t); extern int cifs_atomic_open(struct inode *, struct dentry *, struct file *, unsigned, umode_t); extern struct dentry *cifs_lookup(struct inode *, struct dentry *, diff --git a/fs/smb/client/dir.c b/fs/smb/client/dir.c index da5597dbf5b9f140c6801158ac2357fa911c52ab..b00bc214db9f0e9533f481f41ac99ac8937610ac 100644 --- a/fs/smb/client/dir.c +++ b/fs/smb/client/dir.c @@ -566,7 +566,7 @@ cifs_atomic_open(struct inode *inode, struct dentry *direntry, } int cifs_create(struct mnt_idmap *idmap, struct inode *inode, - struct dentry *direntry, umode_t mode, bool excl) + struct dentry *direntry, umode_t mode) { int rc; unsigned int xid = get_xid(); diff --git a/fs/ubifs/dir.c b/fs/ubifs/dir.c index 3c3d3ad4fa6cb719e9ec08fa2164c55371c017c1..4840a6f7974e254eba4ca249357e968764e326e0 100644 --- a/fs/ubifs/dir.c +++ b/fs/ubifs/dir.c @@ -303,7 +303,7 @@ static int ubifs_prepare_create(struct inode *dir, struct dentry *dentry, } static int ubifs_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { struct inode *inode; struct ubifs_info *c = dir->i_sb->s_fs_info; diff --git a/fs/udf/namei.c b/fs/udf/namei.c index 5f2e9a892bffa9579143cedf71d80efa7ad6e9fb..f83b5564cbc4c68c02c07bb3ab2109bfabdc799d 100644 --- a/fs/udf/namei.c +++ b/fs/udf/namei.c @@ -371,7 +371,7 @@ static int udf_add_nondir(struct dentry *dentry, struct inode *inode) } static int udf_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { struct inode *inode = udf_new_inode(dir, mode); diff --git a/fs/ufs/namei.c b/fs/ufs/namei.c index 5b3c85c9324298f4ff6aa3d4feeb962ce5ede539..5012e056200aca671364d34a7faf647e6747e1d2 100644 --- a/fs/ufs/namei.c +++ b/fs/ufs/namei.c @@ -70,8 +70,7 @@ static struct dentry *ufs_lookup(struct inode * dir, struct dentry *dentry, unsi * with d_instantiate(). */ static int ufs_create (struct mnt_idmap * idmap, - struct inode * dir, struct dentry * dentry, umode_t mode, - bool excl) + struct inode * dir, struct dentry * dentry, umode_t mode) { struct inode *inode; diff --git a/fs/vboxsf/dir.c b/fs/vboxsf/dir.c index 42bedc4ec7af7709c564a7174805d185ce86f854..330dade582d081e965c0e365bd2f96ae31d92ccc 100644 --- a/fs/vboxsf/dir.c +++ b/fs/vboxsf/dir.c @@ -298,9 +298,9 @@ static int vboxsf_dir_create(struct inode *parent, struct dentry *dentry, static int vboxsf_dir_mkfile(struct mnt_idmap *idmap, struct inode *parent, struct dentry *dentry, - umode_t mode, bool excl) + umode_t mode) { - return vboxsf_dir_create(parent, dentry, mode, false, excl, NULL); + return vboxsf_dir_create(parent, dentry, mode, false, true, NULL); } static struct dentry *vboxsf_dir_mkdir(struct mnt_idmap *idmap, diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index caff0125faeac093c1c05a722d3588e3f2e99926..2bc7faac35678b5b78acd6a50695a0d7b1c9a263 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -293,8 +293,7 @@ xfs_vn_create( struct mnt_idmap *idmap, struct inode *dir, struct dentry *dentry, - umode_t mode, - bool flags) + umode_t mode) { return xfs_generic_create(idmap, dir, dentry, mode, 0, NULL); } diff --git a/include/linux/fs.h b/include/linux/fs.h index 64323e618724bc20dc101db13035b042f5f88e4d..b9a32e10078f5a1a0bbeb0d8913ac3e4b5b3a85d 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2345,8 +2345,8 @@ struct inode_operations { int (*readlink) (struct dentry *, char __user *,int); - int (*create) (struct mnt_idmap *, struct inode *,struct dentry *, - umode_t, bool); + int (*create) (struct mnt_idmap *, struct inode *, struct dentry *, + umode_t); int (*link) (struct dentry *,struct inode *,struct dentry *); int (*unlink) (struct inode *,struct dentry *); int (*symlink) (struct mnt_idmap *, struct inode *,struct dentry *, diff --git a/ipc/mqueue.c b/ipc/mqueue.c index 093551fe66a7eb884fc34ef853a0ca92b95770af..9ae28c79fe0578bf96b2d22daed45b48aba0b946 100644 --- a/ipc/mqueue.c +++ b/ipc/mqueue.c @@ -610,7 +610,7 @@ static int mqueue_create_attr(struct dentry *dentry, umode_t mode, void *arg) } static int mqueue_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { return mqueue_create_attr(dentry, mode, NULL); } diff --git a/mm/shmem.c b/mm/shmem.c index b9081b817d28f3db1fbdd90ed3f04b6904d6ff18..8fdc9cbecb908e127f8173ca8888b5e038354fed 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -3912,7 +3912,7 @@ static struct dentry *shmem_mkdir(struct mnt_idmap *idmap, struct inode *dir, } static int shmem_create(struct mnt_idmap *idmap, struct inode *dir, - struct dentry *dentry, umode_t mode, bool excl) + struct dentry *dentry, umode_t mode) { return shmem_mknod(idmap, dir, dentry, mode | S_IFREG, 0); } --- base-commit: 76ddfe7d66d631e5e31ef4e5dd59797fa03acbf7 change-id: 20251105-create-excl-2b366d9bf3bb Best regards, -- Jeff Layton From johannes at sipsolutions.net Tue Nov 11 00:01:25 2025 From: johannes at sipsolutions.net (Johannes Berg) Date: Tue, 11 Nov 2025 09:01:25 +0100 Subject: [PATCH v13 00/13] nommu UML In-Reply-To: References: Message-ID: <0a84c16f862026c82271c0adbc91d98b812a78b4.camel@sipsolutions.net> On Mon, 2025-11-10 at 21:18 +0900, Hajime Tazaki wrote: > > What is it for ? > ================ > > - Alleviate syscall hook overhead implemented with ptrace(2) > - To exercises nommu code over UML (and over KUnit) > - Less dependency to host facilities FWIW, in some way, this order of priorities is exactly why this hasn't been going anywhere, and every time I looked at it I got somewhat annoyed by what seems to me like choices made to support especially the first bullet. I suspect that the first and third bullet are not even really true any more, since you moved to seccomp (per our request), yet I think design choices influenced by them persist. People are definitely interested in the second bullet, mostly for kunit, and I'd be willing to support them in that to some extent. However, I'm not yet convinced that all of the complexities presented in this patchset (such as completely separate seccomp implementation) are actually necessary in support of _just_ the second bullet. These seem to me like design choices necessary to support the _first_ bullet [1]. [1] and then I suppose the third, which I'm reading as "doesn't need seccomp or ptrace", but I'm not really quite sure what you meant I've thought about what would happen if we stuck to creating a (single) separate process on the host to execute userspace, and just used CLONE_VM for it. That way, it's still no-MMU with full memory access, but there's some implicit isolation between the kernel and userspace processes which will likely remove complexities around FP/SSE/AVX handling, may completely remove the need for a separate seccomp implementation, etc. It would, on the other hand, make it completely non-viable to achieve the first and third bullets, so given your pursuit of those, one some level I understand the design right now. I'm yet to be convinced, however, that those are even worthy goals for (upstream) UML, what use case would that enable that we really need? Especially considering that over a longer perspective, NOMMU architectures _are_ on their way out, and UML will certainly follow once that happens, it won't be the last remaining NOMMU architecture. So the only value I see in this is for testing over the net couple of years, which really doesn't need any sort of significant optimisation or less reliance on host facilities. Where do you see this differently? johannes From xuanzhuo at linux.alibaba.com Tue Nov 11 03:12:12 2025 From: xuanzhuo at linux.alibaba.com (Xuan Zhuo) Date: Tue, 11 Nov 2025 19:12:12 +0800 Subject: [PATCH net v5 2/2] virtio-net: correct hdr_len handling for tunnel gso In-Reply-To: <20251111111212.102083-1-xuanzhuo@linux.alibaba.com> References: <20251111111212.102083-1-xuanzhuo@linux.alibaba.com> Message-ID: <20251111111212.102083-3-xuanzhuo@linux.alibaba.com> The commit a2fb4bc4e2a6a03 ("net: implement virtio helpers to handle UDP GSO tunneling.") introduces support for the UDP GSO tunnel feature in virtio-net. The virtio spec says: If the \field{gso_type} has the VIRTIO_NET_HDR_GSO_UDP_TUNNEL_IPV4 bit or VIRTIO_NET_HDR_GSO_UDP_TUNNEL_IPV6 bit set, \field{hdr_len} accounts for all the headers up to and including the inner transport. The commit did not update the hdr_len to include the inner transport. I observed that the "hdr_len" is 116 for this packet: 17:36:18.241105 52:55:00:d1:27:0a > 2e:2c:df:46:a9:e1, ethertype IPv4 (0x0800), length 2912: (tos 0x0, ttl 64, id 45197, offset 0, flags [none], proto UDP (17), length 2898) 192.168.122.100.50613 > 192.168.122.1.4789: [bad udp cksum 0x8106 -> 0x26a0!] VXLAN, flags [I] (0x08), vni 1 fa:c3:ba:82:05:ee > ce:85:0c:31:77:e5, ethertype IPv4 (0x0800), length 2862: (tos 0x0, ttl 64, id 14678, offset 0, flags [DF], proto TCP (6), length 2848) 192.168.3.1.49880 > 192.168.3.2.9898: Flags [P.], cksum 0x9266 (incorrect -> 0xaa20), seq 515667:518463, ack 1, win 64, options [nop,nop,TS val 2990048824 ecr 2798801412], length 2796 116 = 14(mac) + 20(ip) + 8(udp) + 8(vxlan) + 14(inner mac) + 20(inner ip) + 32(innner tcp) Fixes: a2fb4bc4e2a6a03 ("net: implement virtio helpers to handle UDP GSO tunneling.") Signed-off-by: Xuan Zhuo --- include/linux/virtio_net.h | 24 ++++++++++++++++-------- 1 file changed, 16 insertions(+), 8 deletions(-) diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h index 3cd8b2ebc197..432b17979d17 100644 --- a/include/linux/virtio_net.h +++ b/include/linux/virtio_net.h @@ -232,12 +232,23 @@ static inline int virtio_net_hdr_from_skb(const struct sk_buff *skb, return -EINVAL; if (hdrlen_negotiated) { - hdr_len = skb_transport_offset(skb); + if (sinfo->gso_type & (SKB_GSO_UDP_TUNNEL | + SKB_GSO_UDP_TUNNEL_CSUM)) { + hdr_len = skb_inner_transport_offset(skb); + + if (hdr->gso_type == VIRTIO_NET_HDR_GSO_UDP_L4) + hdr_len += sizeof(struct udphdr); + else + hdr_len += inner_tcp_hdrlen(skb); + } else { + hdr_len = skb_transport_offset(skb); + + if (hdr->gso_type == VIRTIO_NET_HDR_GSO_UDP_L4) + hdr_len += sizeof(struct udphdr); + else + hdr_len += tcp_hdrlen(skb); + } - if (hdr->gso_type == VIRTIO_NET_HDR_GSO_UDP_L4) - hdr_len += sizeof(struct udphdr); - else - hdr_len += tcp_hdrlen(skb); } else { /* This is a hint as to how much should be linear. */ hdr_len = skb_headlen(skb); @@ -421,11 +432,8 @@ virtio_net_hdr_tnl_from_skb(const struct sk_buff *skb, vhdr->hash_hdr.hash_report = 0; vhdr->hash_hdr.padding = 0; - /* Let the basic parsing deal with plain GSO features. */ - skb_shinfo(skb)->gso_type &= ~tnl_gso_type; ret = virtio_net_hdr_from_skb(skb, hdr, true, false, hdrlen_negotiated, vlan_hlen); - skb_shinfo(skb)->gso_type |= tnl_gso_type; if (ret) return ret; -- 2.32.0.3.g01195cf9f From xuanzhuo at linux.alibaba.com Tue Nov 11 03:12:10 2025 From: xuanzhuo at linux.alibaba.com (Xuan Zhuo) Date: Tue, 11 Nov 2025 19:12:10 +0800 Subject: [PATCH net v5 0/2] virtio-net: fix for VIRTIO_NET_F_GUEST_HDRLEN Message-ID: <20251111111212.102083-1-xuanzhuo@linux.alibaba.com> The commit be50da3e9d4a ("net: virtio_net: implement exact header length guest feature") introduces support for the VIRTIO_NET_F_GUEST_HDRLEN feature in virtio-net. This feature requires virtio-net to set hdr_len to the actual header length of the packet when transmitting, the number of bytes from the start of the packet to the beginning of the transport-layer payload. However, in practice, hdr_len was being set using skb_headlen(skb), which is clearly incorrect. This path set fixes that issue. As discussed in [0], this version checks the VIRTIO_NET_F_GUEST_HDRLEN is negotiated. [0]: http://lore.kernel.org/all/20251029030913.20423-1-xuanzhuo at linux.alibaba.com Xuan Zhuo (2): virtio-net: correct hdr_len handling for VIRTIO_NET_F_GUEST_HDRLEN virtio-net: correct hdr_len handling for tunnel gso arch/um/drivers/vector_transports.c | 1 + drivers/net/tun_vnet.h | 4 +-- drivers/net/virtio_net.c | 9 +++++-- include/linux/virtio_net.h | 40 +++++++++++++++++++++++------ net/packet/af_packet.c | 5 ++-- 5 files changed, 45 insertions(+), 14 deletions(-) -- 2.32.0.3.g01195cf9f From xuanzhuo at linux.alibaba.com Tue Nov 11 03:12:11 2025 From: xuanzhuo at linux.alibaba.com (Xuan Zhuo) Date: Tue, 11 Nov 2025 19:12:11 +0800 Subject: [PATCH net v5 1/2] virtio-net: correct hdr_len handling for VIRTIO_NET_F_GUEST_HDRLEN In-Reply-To: <20251111111212.102083-1-xuanzhuo@linux.alibaba.com> References: <20251111111212.102083-1-xuanzhuo@linux.alibaba.com> Message-ID: <20251111111212.102083-2-xuanzhuo@linux.alibaba.com> The commit be50da3e9d4a ("net: virtio_net: implement exact header length guest feature") introduces support for the VIRTIO_NET_F_GUEST_HDRLEN feature in virtio-net. This feature requires virtio-net to set hdr_len to the actual header length of the packet when transmitting, the number of bytes from the start of the packet to the beginning of the transport-layer payload. However, in practice, hdr_len was being set using skb_headlen(skb), which is clearly incorrect. This commit fixes that issue. Fixes: be50da3e9d4a ("net: virtio_net: implement exact header length guest feature") Signed-off-by: Xuan Zhuo --- arch/um/drivers/vector_transports.c | 1 + drivers/net/tun_vnet.h | 4 ++-- drivers/net/virtio_net.c | 9 +++++++-- include/linux/virtio_net.h | 26 +++++++++++++++++++++----- net/packet/af_packet.c | 5 +++-- 5 files changed, 34 insertions(+), 11 deletions(-) diff --git a/arch/um/drivers/vector_transports.c b/arch/um/drivers/vector_transports.c index 0794d23f07cb..03c5baa1d0c1 100644 --- a/arch/um/drivers/vector_transports.c +++ b/arch/um/drivers/vector_transports.c @@ -121,6 +121,7 @@ static int raw_form_header(uint8_t *header, vheader, virtio_legacy_is_little_endian(), false, + false, 0 ); diff --git a/drivers/net/tun_vnet.h b/drivers/net/tun_vnet.h index 81662328b2c7..0d376bc70dd7 100644 --- a/drivers/net/tun_vnet.h +++ b/drivers/net/tun_vnet.h @@ -214,7 +214,7 @@ static inline int tun_vnet_hdr_from_skb(unsigned int flags, if (virtio_net_hdr_from_skb(skb, hdr, tun_vnet_is_little_endian(flags), true, - vlan_hlen)) { + false, vlan_hlen)) { struct skb_shared_info *sinfo = skb_shinfo(skb); if (net_ratelimit()) { @@ -244,7 +244,7 @@ tun_vnet_hdr_tnl_from_skb(unsigned int flags, if (virtio_net_hdr_tnl_from_skb(skb, tnl_hdr, has_tnl_offload, tun_vnet_is_little_endian(flags), - vlan_hlen)) { + false, vlan_hlen)) { struct virtio_net_hdr_v1 *hdr = &tnl_hdr->hash_hdr.hdr; struct skb_shared_info *sinfo = skb_shinfo(skb); diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 0369dda5ed60..b335c88a8cd6 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -3317,9 +3317,13 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff *skb, bool orphan) const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest; struct virtnet_info *vi = sq->vq->vdev->priv; struct virtio_net_hdr_v1_hash_tunnel *hdr; - int num_sg; unsigned hdr_len = vi->hdr_len; + bool hdrlen_negotiated; bool can_push; + int num_sg; + + hdrlen_negotiated = virtio_has_feature(vi->vdev, + VIRTIO_NET_F_GUEST_HDRLEN); pr_debug("%s: xmit %p %pM\n", vi->dev->name, skb, dest); @@ -3339,7 +3343,8 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff *skb, bool orphan) hdr = &skb_vnet_common_hdr(skb)->tnl_hdr; if (virtio_net_hdr_tnl_from_skb(skb, hdr, vi->tx_tnl, - virtio_is_little_endian(vi->vdev), 0)) + virtio_is_little_endian(vi->vdev), + hdrlen_negotiated, 0)) return -EPROTO; if (vi->mergeable_rx_bufs) diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h index b673c31569f3..3cd8b2ebc197 100644 --- a/include/linux/virtio_net.h +++ b/include/linux/virtio_net.h @@ -211,16 +211,15 @@ static inline int virtio_net_hdr_from_skb(const struct sk_buff *skb, struct virtio_net_hdr *hdr, bool little_endian, bool has_data_valid, + bool hdrlen_negotiated, int vlan_hlen) { memset(hdr, 0, sizeof(*hdr)); /* no info leak */ if (skb_is_gso(skb)) { struct skb_shared_info *sinfo = skb_shinfo(skb); + u16 hdr_len; - /* This is a hint as to how much should be linear. */ - hdr->hdr_len = __cpu_to_virtio16(little_endian, - skb_headlen(skb)); hdr->gso_size = __cpu_to_virtio16(little_endian, sinfo->gso_size); if (sinfo->gso_type & SKB_GSO_TCPV4) @@ -231,6 +230,21 @@ static inline int virtio_net_hdr_from_skb(const struct sk_buff *skb, hdr->gso_type = VIRTIO_NET_HDR_GSO_UDP_L4; else return -EINVAL; + + if (hdrlen_negotiated) { + hdr_len = skb_transport_offset(skb); + + if (hdr->gso_type == VIRTIO_NET_HDR_GSO_UDP_L4) + hdr_len += sizeof(struct udphdr); + else + hdr_len += tcp_hdrlen(skb); + } else { + /* This is a hint as to how much should be linear. */ + hdr_len = skb_headlen(skb); + } + + hdr->hdr_len = __cpu_to_virtio16(little_endian, hdr_len); + if (sinfo->gso_type & SKB_GSO_TCP_ECN) hdr->gso_type |= VIRTIO_NET_HDR_GSO_ECN; } else @@ -384,6 +398,7 @@ virtio_net_hdr_tnl_from_skb(const struct sk_buff *skb, struct virtio_net_hdr_v1_hash_tunnel *vhdr, bool tnl_hdr_negotiated, bool little_endian, + bool hdrlen_negotiated, int vlan_hlen) { struct virtio_net_hdr *hdr = (struct virtio_net_hdr *)vhdr; @@ -395,7 +410,7 @@ virtio_net_hdr_tnl_from_skb(const struct sk_buff *skb, SKB_GSO_UDP_TUNNEL_CSUM); if (!tnl_gso_type) return virtio_net_hdr_from_skb(skb, hdr, little_endian, false, - vlan_hlen); + hdrlen_negotiated, vlan_hlen); /* Tunnel support not negotiated but skb ask for it. */ if (!tnl_hdr_negotiated) @@ -408,7 +423,8 @@ virtio_net_hdr_tnl_from_skb(const struct sk_buff *skb, /* Let the basic parsing deal with plain GSO features. */ skb_shinfo(skb)->gso_type &= ~tnl_gso_type; - ret = virtio_net_hdr_from_skb(skb, hdr, true, false, vlan_hlen); + ret = virtio_net_hdr_from_skb(skb, hdr, true, false, hdrlen_negotiated, + vlan_hlen); skb_shinfo(skb)->gso_type |= tnl_gso_type; if (ret) return ret; diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index 173e6edda08f..6982f4ab1c73 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -2093,7 +2093,8 @@ static int packet_rcv_vnet(struct msghdr *msg, const struct sk_buff *skb, return -EINVAL; *len -= vnet_hdr_sz; - if (virtio_net_hdr_from_skb(skb, (struct virtio_net_hdr *)&vnet_hdr, vio_le(), true, 0)) + if (virtio_net_hdr_from_skb(skb, (struct virtio_net_hdr *)&vnet_hdr, + vio_le(), true, false, 0)) return -EINVAL; return memcpy_to_msg(msg, (void *)&vnet_hdr, vnet_hdr_sz); @@ -2361,7 +2362,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, if (vnet_hdr_sz && virtio_net_hdr_from_skb(skb, h.raw + macoff - sizeof(struct virtio_net_hdr), - vio_le(), true, 0)) { + vio_le(), true, false, 0)) { if (po->tp_version == TPACKET_V3) prb_clear_blk_fill_status(&po->rx_ring); goto drop_n_account; -- 2.32.0.3.g01195cf9f From mst at redhat.com Tue Nov 11 03:33:04 2025 From: mst at redhat.com (Michael S. Tsirkin) Date: Tue, 11 Nov 2025 06:33:04 -0500 Subject: [PATCH net v5 1/2] virtio-net: correct hdr_len handling for VIRTIO_NET_F_GUEST_HDRLEN In-Reply-To: <20251111111212.102083-2-xuanzhuo@linux.alibaba.com> References: <20251111111212.102083-1-xuanzhuo@linux.alibaba.com> <20251111111212.102083-2-xuanzhuo@linux.alibaba.com> Message-ID: <20251111062859-mutt-send-email-mst@kernel.org> On Tue, Nov 11, 2025 at 07:12:11PM +0800, Xuan Zhuo wrote: > The commit be50da3e9d4a ("net: virtio_net: implement exact header length > guest feature") introduces support for the VIRTIO_NET_F_GUEST_HDRLEN > feature in virtio-net. > > This feature requires virtio-net to set hdr_len to the actual header > length of the packet when transmitting, the number of > bytes from the start of the packet to the beginning of the > transport-layer payload. > > However, in practice, hdr_len was being set using skb_headlen(skb), > which is clearly incorrect. This commit fixes that issue. > > Fixes: be50da3e9d4a ("net: virtio_net: implement exact header length guest feature") > Signed-off-by: Xuan Zhuo > --- > arch/um/drivers/vector_transports.c | 1 + > drivers/net/tun_vnet.h | 4 ++-- > drivers/net/virtio_net.c | 9 +++++++-- > include/linux/virtio_net.h | 26 +++++++++++++++++++++----- > net/packet/af_packet.c | 5 +++-- > 5 files changed, 34 insertions(+), 11 deletions(-) > > diff --git a/arch/um/drivers/vector_transports.c b/arch/um/drivers/vector_transports.c > index 0794d23f07cb..03c5baa1d0c1 100644 > --- a/arch/um/drivers/vector_transports.c > +++ b/arch/um/drivers/vector_transports.c > @@ -121,6 +121,7 @@ static int raw_form_header(uint8_t *header, > vheader, > virtio_legacy_is_little_endian(), > false, > + false, > 0 > ); > > diff --git a/drivers/net/tun_vnet.h b/drivers/net/tun_vnet.h > index 81662328b2c7..0d376bc70dd7 100644 > --- a/drivers/net/tun_vnet.h > +++ b/drivers/net/tun_vnet.h > @@ -214,7 +214,7 @@ static inline int tun_vnet_hdr_from_skb(unsigned int flags, > > if (virtio_net_hdr_from_skb(skb, hdr, > tun_vnet_is_little_endian(flags), true, > - vlan_hlen)) { > + false, vlan_hlen)) { > struct skb_shared_info *sinfo = skb_shinfo(skb); > > if (net_ratelimit()) { > @@ -244,7 +244,7 @@ tun_vnet_hdr_tnl_from_skb(unsigned int flags, > > if (virtio_net_hdr_tnl_from_skb(skb, tnl_hdr, has_tnl_offload, > tun_vnet_is_little_endian(flags), > - vlan_hlen)) { > + false, vlan_hlen)) { > struct virtio_net_hdr_v1 *hdr = &tnl_hdr->hash_hdr.hdr; > struct skb_shared_info *sinfo = skb_shinfo(skb); > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > index 0369dda5ed60..b335c88a8cd6 100644 > --- a/drivers/net/virtio_net.c > +++ b/drivers/net/virtio_net.c > @@ -3317,9 +3317,13 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff *skb, bool orphan) > const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest; > struct virtnet_info *vi = sq->vq->vdev->priv; > struct virtio_net_hdr_v1_hash_tunnel *hdr; > - int num_sg; > unsigned hdr_len = vi->hdr_len; > + bool hdrlen_negotiated; > bool can_push; > + int num_sg; > + > + hdrlen_negotiated = virtio_has_feature(vi->vdev, > + VIRTIO_NET_F_GUEST_HDRLEN); > > pr_debug("%s: xmit %p %pM\n", vi->dev->name, skb, dest); > > @@ -3339,7 +3343,8 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff *skb, bool orphan) > hdr = &skb_vnet_common_hdr(skb)->tnl_hdr; > > if (virtio_net_hdr_tnl_from_skb(skb, hdr, vi->tx_tnl, > - virtio_is_little_endian(vi->vdev), 0)) > + virtio_is_little_endian(vi->vdev), > + hdrlen_negotiated, 0)) > return -EPROTO; > > if (vi->mergeable_rx_bufs) > diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h > index b673c31569f3..3cd8b2ebc197 100644 > --- a/include/linux/virtio_net.h > +++ b/include/linux/virtio_net.h > @@ -211,16 +211,15 @@ static inline int virtio_net_hdr_from_skb(const struct sk_buff *skb, > struct virtio_net_hdr *hdr, > bool little_endian, > bool has_data_valid, > + bool hdrlen_negotiated, > int vlan_hlen) Took me a while to figure out why does tun pass false here. The reason is that this flag is really only dealing with guest hdrlen. so how about guest_hdrlen to mirror spec or if you like xmit_hdrlen? > { > memset(hdr, 0, sizeof(*hdr)); /* no info leak */ > > if (skb_is_gso(skb)) { > struct skb_shared_info *sinfo = skb_shinfo(skb); > + u16 hdr_len; > > - /* This is a hint as to how much should be linear. */ > - hdr->hdr_len = __cpu_to_virtio16(little_endian, > - skb_headlen(skb)); > hdr->gso_size = __cpu_to_virtio16(little_endian, > sinfo->gso_size); > if (sinfo->gso_type & SKB_GSO_TCPV4) > @@ -231,6 +230,21 @@ static inline int virtio_net_hdr_from_skb(const struct sk_buff *skb, > hdr->gso_type = VIRTIO_NET_HDR_GSO_UDP_L4; > else > return -EINVAL; > + > + if (hdrlen_negotiated) { > + hdr_len = skb_transport_offset(skb); > + > + if (hdr->gso_type == VIRTIO_NET_HDR_GSO_UDP_L4) > + hdr_len += sizeof(struct udphdr); > + else > + hdr_len += tcp_hdrlen(skb); > + } else { > + /* This is a hint as to how much should be linear. */ > + hdr_len = skb_headlen(skb); > + } > + > + hdr->hdr_len = __cpu_to_virtio16(little_endian, hdr_len); > + > if (sinfo->gso_type & SKB_GSO_TCP_ECN) > hdr->gso_type |= VIRTIO_NET_HDR_GSO_ECN; > } else > @@ -384,6 +398,7 @@ virtio_net_hdr_tnl_from_skb(const struct sk_buff *skb, > struct virtio_net_hdr_v1_hash_tunnel *vhdr, > bool tnl_hdr_negotiated, > bool little_endian, > + bool hdrlen_negotiated, > int vlan_hlen) > { > struct virtio_net_hdr *hdr = (struct virtio_net_hdr *)vhdr; > @@ -395,7 +410,7 @@ virtio_net_hdr_tnl_from_skb(const struct sk_buff *skb, > SKB_GSO_UDP_TUNNEL_CSUM); > if (!tnl_gso_type) > return virtio_net_hdr_from_skb(skb, hdr, little_endian, false, > - vlan_hlen); > + hdrlen_negotiated, vlan_hlen); > > /* Tunnel support not negotiated but skb ask for it. */ > if (!tnl_hdr_negotiated) > @@ -408,7 +423,8 @@ virtio_net_hdr_tnl_from_skb(const struct sk_buff *skb, > > /* Let the basic parsing deal with plain GSO features. */ > skb_shinfo(skb)->gso_type &= ~tnl_gso_type; > - ret = virtio_net_hdr_from_skb(skb, hdr, true, false, vlan_hlen); > + ret = virtio_net_hdr_from_skb(skb, hdr, true, false, hdrlen_negotiated, > + vlan_hlen); > skb_shinfo(skb)->gso_type |= tnl_gso_type; > if (ret) > return ret; > diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c > index 173e6edda08f..6982f4ab1c73 100644 > --- a/net/packet/af_packet.c > +++ b/net/packet/af_packet.c > @@ -2093,7 +2093,8 @@ static int packet_rcv_vnet(struct msghdr *msg, const struct sk_buff *skb, > return -EINVAL; > *len -= vnet_hdr_sz; > > - if (virtio_net_hdr_from_skb(skb, (struct virtio_net_hdr *)&vnet_hdr, vio_le(), true, 0)) > + if (virtio_net_hdr_from_skb(skb, (struct virtio_net_hdr *)&vnet_hdr, > + vio_le(), true, false, 0)) > return -EINVAL; > > return memcpy_to_msg(msg, (void *)&vnet_hdr, vnet_hdr_sz); > @@ -2361,7 +2362,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, > if (vnet_hdr_sz && > virtio_net_hdr_from_skb(skb, h.raw + macoff - > sizeof(struct virtio_net_hdr), > - vio_le(), true, 0)) { > + vio_le(), true, false, 0)) { > if (po->tp_version == TPACKET_V3) > prb_clear_blk_fill_status(&po->rx_ring); > goto drop_n_account; > -- > 2.32.0.3.g01195cf9f From thehajime at gmail.com Wed Nov 12 00:52:56 2025 From: thehajime at gmail.com (Hajime Tazaki) Date: Wed, 12 Nov 2025 17:52:56 +0900 Subject: [PATCH v13 00/13] nommu UML In-Reply-To: <0a84c16f862026c82271c0adbc91d98b812a78b4.camel@sipsolutions.net> References: <0a84c16f862026c82271c0adbc91d98b812a78b4.camel@sipsolutions.net> Message-ID: On Tue, 11 Nov 2025 17:01:25 +0900, Johannes Berg wrote: > > On Mon, 2025-11-10 at 21:18 +0900, Hajime Tazaki wrote: > > > > What is it for ? > > ================ > > > > - Alleviate syscall hook overhead implemented with ptrace(2) > > - To exercises nommu code over UML (and over KUnit) > > - Less dependency to host facilities > > FWIW, in some way, this order of priorities is exactly why this hasn't > been going anywhere, and every time I looked at it I got somewhat > annoyed by what seems to me like choices made to support especially the > first bullet. over the past versions, I've been emphasized that the 2nd bullet (testing) is the primary usecase as I saw several actually cases from mm folks, https://lists.infradead.org/pipermail/maple-tree/2024-November/003775.html https://lore.kernel.org/all/cb1cf0be-871d-4982-9a1b-5fdd54deec8d at lucifer.local/ and I think this is not limited to mm code. other 2 bullets are additional benefits which we observed in a comment, and our experience. https://lore.kernel.org/all/20241122121826.GA26024 at lst.de/ [2] https://static.sched.com/hosted_files/ossna2020/ec/kollerr_linux_um_nommu.pdf but those are not the primary goal, so I'm not pushing this aspect with usecases. > I suspect that the first and third bullet are not even really true any > more, since you moved to seccomp (per our request), yet I think design > choices influenced by them persist. this observation is not true; the first bullet is still true even using seccomp. please look at the benchmark result in the patch [12/13], quoted below. summary: most of tests show that um-nommu+seccomp is x4 to x20 faster than um-mmu+seccomp (and ptrace). .. csv-table:: lmbench (usec) :header: ,native,um,um-mmu(s),um-nommu(s) select-10 ,0.5319,36.1214,24.2795,2.9174 select-100 ,1.6019,34.6049,28.8865,3.8080 select-1000 ,12.2588,43.6838,48.7438,12.7872 syscall ,0.1644,35.0321,53.2119,2.5981 read ,0.3055,31.5509,45.8538,2.7068 write ,0.2512,31.3609,29.2636,2.6948 stat ,1.8894,43.8477,49.6121,3.1908 open/close ,3.2973,77.5123,68.9431,6.2575 fork+sh ,1110.3000,7359.5000,4618.6667,439.4615 fork+execve ,510.8182,2834.0000,2461.1667,139.7848 .. csv-table:: do_getpid bench (nsec) :header: ,native,um,um-mmu(s),um-nommu(s) getpid , 161 , 34477 , 26242 , 2599 the 1st bullet saying ptrace(2) is somehow misleading now. this might be rephrased with "a separate process handling userspace", instead of "ptrace". # when I started this patchset, the seccomp patch wasn't in upstream. saying ptrace(2) wasn't not that much wrong. > People are definitely interested in the second bullet, mostly for kunit, > and I'd be willing to support them in that to some extent. so (again) the 2nd bullet is the primary use case at this stage. > However, I'm not yet convinced that all of the complexities presented in > this patchset (such as completely separate seccomp implementation) are > actually necessary in support of _just_ the second bullet. These seem to > me like design choices necessary to support the _first_ bullet [1]. separate seccomp implementation is indeed needed due to the design choice we made, to use a single process to host a (um) userspace. I think there is no reason to unify the seccomp part because the signal handlers and filter installation do the different jobs. I don't see why you see this as a _complexity_, as functionally both seccomp handling don't interfere each other. we have prepared separate sub-directories for nommu to avoid unnecessary if/else clauses in .c/.h files. we haven't seen any functional regressions since this RFC version (which was 6.12 kernel). > [1] and then I suppose the third, which I'm reading as "doesn't need > seccomp or ptrace", but I'm not really quite sure what you meant > > I've thought about what would happen if we stuck to creating a (single) > separate process on the host to execute userspace, and just used > CLONE_VM for it. That way, it's still no-MMU with full memory access, > but there's some implicit isolation between the kernel and userspace > processes which will likely remove complexities around FP/SSE/AVX > handling, may completely remove the need for a separate seccomp > implementation, etc. this would be doable I think, but we went the different way, as using separate host processes (with ptrace/seccomp) is slow and add complexity by the synchronization between processes, which we think it's not easy to maintain in the future. this was natural for us (not sure for maintainers) when we add a new functionality, consider several options to implement, and took one of the option which is faster, simpler, and having less cost to maintain. the avoidance of separate processes is probably the core of our design choice we made for nommu UML. I'm not strongly pushing the benefits of 1st/3rd bullets, but I thought describing the characteristics of what _this_ patchset can should be useful. thus in the document. additionally, if the design choice we made introduces any breakages on existing code, or maintenance burdens, I would understand your concern on the complexity, but I don't think this is the case. > It would, on the other hand, make it completely non-viable to achieve > the first and third bullets, so given your pursuit of those, one some > level I understand the design right now. I'm yet to be convinced, > however, that those are even worthy goals for (upstream) UML, what use > case would that enable that we really need? the usecase for those are inherited from the original implementation, [2] above, which is running UML on containers with less host dependency and speedups. but again, this is not the primary goal at this stage. if you think that the document should not describe the potential benefits/usecases which are not related to the primary goal of the functionality, I'd agree to remove those descriptions. > Especially considering that > over a longer perspective, NOMMU architectures _are_ on their way out, > and UML will certainly follow once that happens, it won't be the last > remaining NOMMU architecture. I'm aware of this nommu removal discussion, but also saw there are expressions not to support this direction. This patchset is still useful even now. > So the only value I see in this is for testing over the net couple of > years, which really doesn't need any sort of significant optimisation or > less reliance on host facilities. I agree the former, but not the latter. - there is a value with a real usecase, - there are different ways to implement it but this went with the one with potential (additional) benefits, - without breakages to the exising (MMU) uml code. with that, we're proposing this patchset. > Where do you see this differently? thanks for the careful prompt for me. I hope my answer clarifies your concerns. I also wish to understand concerns of maintainers, due to the single process design of nommu for um userspace, and the codebase is still young so may have unexpected influence to others. but this is exactly the reason why I also put myself to MAINTAINERS in order to take care of this patchset even it is small (1.3k loc). -- Hajime From tiwei.bie at linux.dev Wed Nov 12 08:36:51 2025 From: tiwei.bie at linux.dev (Tiwei Bie) Date: Thu, 13 Nov 2025 00:36:51 +0800 Subject: [PATCH v13 00/13] nommu UML In-Reply-To: References: Message-ID: <20251112163651.3689244-1-tiwei.bie@linux.dev> On Wed, 12 Nov 2025 17:52:56 +0900, Hajime Tazaki wrote: [...] > > However, I'm not yet convinced that all of the complexities presented in > > this patchset (such as completely separate seccomp implementation) are > > actually necessary in support of _just_ the second bullet. These seem to > > me like design choices necessary to support the _first_ bullet [1]. > > separate seccomp implementation is indeed needed due to the design > choice we made, to use a single process to host a (um) userspace. I > think there is no reason to unify the seccomp part because the > signal handlers and filter installation do the different jobs. > > I don't see why you see this as a _complexity_, as functionally both > seccomp handling don't interfere each other. we have prepared > separate sub-directories for nommu to avoid unnecessary if/else > clauses in .c/.h files. I have the same concern about the complexities introduced by this patch set. The new processing paths it introduces (such as the separate handling for FP/SSE/AVX, FS, signal, syscall, ...) add a lot of unnecessary complexities. I think Johannes's suggestion is a great idea. > we haven't seen any functional regressions > since this RFC version (which was 6.12 kernel). I took a quick look at the code. It appears that patch 02/13 will break the mmu build when UML_TIME_TRAVEL_SUPPORT is enabled. Regards, Tiwei From kuninori.morimoto.gx at renesas.com Wed Nov 12 18:25:26 2025 From: kuninori.morimoto.gx at renesas.com (Kuninori Morimoto) Date: Thu, 13 Nov 2025 02:25:26 +0000 Subject: [PATCH] um: drivers: virtio: use string choices helper Message-ID: <87h5uywtwp.wl-kuninori.morimoto.gx@renesas.com> Remove hard-coded strings by using the string helper functions Signed-off-by: Kuninori Morimoto --- arch/um/drivers/virtio_uml.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/um/drivers/virtio_uml.c b/arch/um/drivers/virtio_uml.c index de7867ae220d0..6cf1152a1a4e6 100644 --- a/arch/um/drivers/virtio_uml.c +++ b/arch/um/drivers/virtio_uml.c @@ -24,6 +24,7 @@ #include #include #include +#include #include #include #include @@ -1151,8 +1152,7 @@ void virtio_uml_set_no_vq_suspend(struct virtio_device *vdev, return; vu_dev->no_vq_suspend = no_vq_suspend; - dev_info(&vdev->dev, "%sabled VQ suspend\n", - no_vq_suspend ? "dis" : "en"); + dev_info(&vdev->dev, "%s VQ suspend\n", str_disabled_enabled(no_vq_suspend)); } static void vu_of_conn_broken(struct work_struct *wk) -- 2.43.0 From pabeni at redhat.com Thu Nov 13 06:39:35 2025 From: pabeni at redhat.com (Paolo Abeni) Date: Thu, 13 Nov 2025 15:39:35 +0100 Subject: [PATCH net v5 1/2] virtio-net: correct hdr_len handling for VIRTIO_NET_F_GUEST_HDRLEN In-Reply-To: <20251111111212.102083-2-xuanzhuo@linux.alibaba.com> References: <20251111111212.102083-1-xuanzhuo@linux.alibaba.com> <20251111111212.102083-2-xuanzhuo@linux.alibaba.com> Message-ID: <25b05194-63cd-4265-8d2c-e174d801fc3a@redhat.com> On 11/11/25 12:12 PM, Xuan Zhuo wrote: > The commit be50da3e9d4a ("net: virtio_net: implement exact header length > guest feature") introduces support for the VIRTIO_NET_F_GUEST_HDRLEN > feature in virtio-net. > > This feature requires virtio-net to set hdr_len to the actual header > length of the packet when transmitting, the number of > bytes from the start of the packet to the beginning of the > transport-layer payload. > > However, in practice, hdr_len was being set using skb_headlen(skb), > which is clearly incorrect. This commit fixes that issue. > > Fixes: be50da3e9d4a ("net: virtio_net: implement exact header length guest feature") > Signed-off-by: Xuan Zhuo IMHO this looks like more a new feature - namely, VIRTIO_NET_F_GUEST_HDRLEN support - than a fix. [...] > @@ -2361,7 +2362,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, > if (vnet_hdr_sz && > virtio_net_hdr_from_skb(skb, h.raw + macoff - > sizeof(struct virtio_net_hdr), > - vio_le(), true, 0)) { > + vio_le(), true, false, 0)) { > if (po->tp_version == TPACKET_V3) > prb_clear_blk_fill_status(&po->rx_ring); > goto drop_n_account; To reduce the diffstat, what about creating a __virtio_net_hdr_from_skb() variant (please find a better name) allowing the extra `hdrlen_negotiated` argument, define virtio_net_hdr_from_skb() as a wrapper of such helper withthe extra arg == false, and use the helper in the few places that really could use hdrlen? From pabeni at redhat.com Thu Nov 13 06:50:13 2025 From: pabeni at redhat.com (Paolo Abeni) Date: Thu, 13 Nov 2025 15:50:13 +0100 Subject: [PATCH net v5 2/2] virtio-net: correct hdr_len handling for tunnel gso In-Reply-To: <20251111111212.102083-3-xuanzhuo@linux.alibaba.com> References: <20251111111212.102083-1-xuanzhuo@linux.alibaba.com> <20251111111212.102083-3-xuanzhuo@linux.alibaba.com> Message-ID: On 11/11/25 12:12 PM, Xuan Zhuo wrote: > The commit a2fb4bc4e2a6a03 ("net: implement virtio helpers to handle UDP > GSO tunneling.") introduces support for the UDP GSO tunnel feature in > virtio-net. > > The virtio spec says: > > If the \field{gso_type} has the VIRTIO_NET_HDR_GSO_UDP_TUNNEL_IPV4 bit or > VIRTIO_NET_HDR_GSO_UDP_TUNNEL_IPV6 bit set, \field{hdr_len} accounts for > all the headers up to and including the inner transport. > > The commit did not update the hdr_len to include the inner transport. > > I observed that the "hdr_len" is 116 for this packet: > > 17:36:18.241105 52:55:00:d1:27:0a > 2e:2c:df:46:a9:e1, ethertype IPv4 (0x0800), length 2912: (tos 0x0, ttl 64, id 45197, offset 0, flags [none], proto UDP (17), length 2898) > 192.168.122.100.50613 > 192.168.122.1.4789: [bad udp cksum 0x8106 -> 0x26a0!] VXLAN, flags [I] (0x08), vni 1 > fa:c3:ba:82:05:ee > ce:85:0c:31:77:e5, ethertype IPv4 (0x0800), length 2862: (tos 0x0, ttl 64, id 14678, offset 0, flags [DF], proto TCP (6), length 2848) > 192.168.3.1.49880 > 192.168.3.2.9898: Flags [P.], cksum 0x9266 (incorrect -> 0xaa20), seq 515667:518463, ack 1, win 64, options [nop,nop,TS val 2990048824 ecr 2798801412], length 2796 > > 116 = 14(mac) + 20(ip) + 8(udp) + 8(vxlan) + 14(inner mac) + 20(inner ip) + 32(innner tcp) > > Fixes: a2fb4bc4e2a6a03 ("net: implement virtio helpers to handle UDP GSO tunneling.") > Signed-off-by: Xuan Zhuo > --- > include/linux/virtio_net.h | 24 ++++++++++++++++-------- > 1 file changed, 16 insertions(+), 8 deletions(-) > > diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h > index 3cd8b2ebc197..432b17979d17 100644 > --- a/include/linux/virtio_net.h > +++ b/include/linux/virtio_net.h > @@ -232,12 +232,23 @@ static inline int virtio_net_hdr_from_skb(const struct sk_buff *skb, > return -EINVAL; > > if (hdrlen_negotiated) { > - hdr_len = skb_transport_offset(skb); > + if (sinfo->gso_type & (SKB_GSO_UDP_TUNNEL | > + SKB_GSO_UDP_TUNNEL_CSUM)) { I'm personally not a huge fan of adding UDP tunnel specific check to the generic code, did you tried something along the lines suggested here: https://lore.kernel.org/netdev/CAF6piCLkv6kFqoq7OQfJ=Su9AVHSQ9J7DzaumOSf5xuf9w-kyA at mail.gmail.com/ ? Thanks, Paolo From mst at redhat.com Thu Nov 13 07:59:17 2025 From: mst at redhat.com (Michael S. Tsirkin) Date: Thu, 13 Nov 2025 10:59:17 -0500 Subject: [PATCH net v5 1/2] virtio-net: correct hdr_len handling for VIRTIO_NET_F_GUEST_HDRLEN In-Reply-To: <25b05194-63cd-4265-8d2c-e174d801fc3a@redhat.com> References: <20251111111212.102083-1-xuanzhuo@linux.alibaba.com> <20251111111212.102083-2-xuanzhuo@linux.alibaba.com> <25b05194-63cd-4265-8d2c-e174d801fc3a@redhat.com> Message-ID: <20251113105844-mutt-send-email-mst@kernel.org> On Thu, Nov 13, 2025 at 03:39:35PM +0100, Paolo Abeni wrote: > On 11/11/25 12:12 PM, Xuan Zhuo wrote: > > The commit be50da3e9d4a ("net: virtio_net: implement exact header length > > guest feature") introduces support for the VIRTIO_NET_F_GUEST_HDRLEN > > feature in virtio-net. > > > > This feature requires virtio-net to set hdr_len to the actual header > > length of the packet when transmitting, the number of > > bytes from the start of the packet to the beginning of the > > transport-layer payload. > > > > However, in practice, hdr_len was being set using skb_headlen(skb), > > which is clearly incorrect. This commit fixes that issue. > > > > Fixes: be50da3e9d4a ("net: virtio_net: implement exact header length guest feature") > > Signed-off-by: Xuan Zhuo > > IMHO this looks like more a new feature - namely, > VIRTIO_NET_F_GUEST_HDRLEN support - than a fix. I mean if guest negotiates VIRTIO_NET_F_GUEST_HDRLEN but the header length is wrong then yes it is broken and this is a fix. > [...] > > @@ -2361,7 +2362,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, > > if (vnet_hdr_sz && > > virtio_net_hdr_from_skb(skb, h.raw + macoff - > > sizeof(struct virtio_net_hdr), > > - vio_le(), true, 0)) { > > + vio_le(), true, false, 0)) { > > if (po->tp_version == TPACKET_V3) > > prb_clear_blk_fill_status(&po->rx_ring); > > goto drop_n_account; > To reduce the diffstat, what about creating a __virtio_net_hdr_from_skb() > variant (please find a better name) allowing the extra `hdrlen_negotiated` > argument, define virtio_net_hdr_from_skb() as a wrapper of such helper > withthe extra arg == false, and use the helper in the few places that > really could use hdrlen? From thehajime at gmail.com Thu Nov 13 22:47:34 2025 From: thehajime at gmail.com (Hajime Tazaki) Date: Fri, 14 Nov 2025 15:47:34 +0900 Subject: [PATCH v13 00/13] nommu UML In-Reply-To: <20251112163651.3689244-1-tiwei.bie@linux.dev> References: <20251112163651.3689244-1-tiwei.bie@linux.dev> Message-ID: On Thu, 13 Nov 2025 01:36:51 +0900, Tiwei Bie wrote: > > we haven't seen any functional regressions > > since this RFC version (which was 6.12 kernel). > > I took a quick look at the code. It appears that patch 02/13 will > break the mmu build when UML_TIME_TRAVEL_SUPPORT is enabled. thanks, it is my bad on the move the chunk. will fix it and added to my local test. -- Hajime From qi.zheng at linux.dev Fri Nov 14 03:11:16 2025 From: qi.zheng at linux.dev (Qi Zheng) Date: Fri, 14 Nov 2025 19:11:16 +0800 Subject: [PATCH 2/7] arc: mm: enable MMU_GATHER_RCU_TABLE_FREE In-Reply-To: References: Message-ID: <6a4192f5cef3049f123f08cb04ef5cd0179c3281.1763117269.git.zhengqi.arch@bytedance.com> From: Qi Zheng On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of empty PTE page table pages (such as 100GB+). To resolve this problem, first enable MMU_GATHER_RCU_TABLE_FREE to prepare for enabling the PT_RECLAIM feature, which resolves this problem. Signed-off-by: Qi Zheng Cc: Vineet Gupta --- arch/arc/Kconfig | 1 + arch/arc/include/asm/pgalloc.h | 9 ++++++--- 2 files changed, 7 insertions(+), 3 deletions(-) diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig index f27e6b90428e4..47db93952386d 100644 --- a/arch/arc/Kconfig +++ b/arch/arc/Kconfig @@ -54,6 +54,7 @@ config ARC select HAVE_ARCH_JUMP_LABEL if ISA_ARCV2 && !CPU_ENDIAN_BE32 select TRACE_IRQFLAGS_SUPPORT select HAVE_EBPF_JIT if ISA_ARCV2 + select MMU_GATHER_RCU_TABLE_FREE config LOCKDEP_SUPPORT def_bool y diff --git a/arch/arc/include/asm/pgalloc.h b/arch/arc/include/asm/pgalloc.h index dfae070fe8d55..b1c6619435613 100644 --- a/arch/arc/include/asm/pgalloc.h +++ b/arch/arc/include/asm/pgalloc.h @@ -72,7 +72,8 @@ static inline void p4d_populate(struct mm_struct *mm, p4d_t *p4dp, pud_t *pudp) set_p4d(p4dp, __p4d((unsigned long)pudp)); } -#define __pud_free_tlb(tlb, pmd, addr) pud_free((tlb)->mm, pmd) +#define __pud_free_tlb(tlb, pud, addr) \ + tlb_remove_ptdesc((tlb), virt_to_ptdesc(pud)) #endif @@ -83,10 +84,12 @@ static inline void pud_populate(struct mm_struct *mm, pud_t *pudp, pmd_t *pmdp) set_pud(pudp, __pud((unsigned long)pmdp)); } -#define __pmd_free_tlb(tlb, pmd, addr) pmd_free((tlb)->mm, pmd) +#define __pmd_free_tlb(tlb, pmd, addr) \ + tlb_remove_ptdesc((tlb), virt_to_ptdesc(pmd)) #endif -#define __pte_free_tlb(tlb, pte, addr) pte_free((tlb)->mm, pte) +#define __pte_free_tlb(tlb, pte, addr) \ + tlb_remove_ptdesc((tlb), page_ptdesc(pte)) #endif /* _ASM_ARC_PGALLOC_H */ -- 2.20.1 From qi.zheng at linux.dev Fri Nov 14 03:11:14 2025 From: qi.zheng at linux.dev (Qi Zheng) Date: Fri, 14 Nov 2025 19:11:14 +0800 Subject: [PATCH 0/7] enable PT_RECLAIM on all 64-bit architectures Message-ID: From: Qi Zheng Hi all, This series aims to enable PT_RECLAIM on all 64-bit architectures. On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of empty PTE page table pages (such as 100GB+). To resolve this problem, we need to enable PT_RECLAIM, which depends on MMU_GATHER_RCU_TABLE_FREE. Therefore, this series first enables MMU_GATHER_RCU_TABLE_FREE on all 64-bit architectures, and finally makes PT_RECLAIM depend on MMU_GATHER_RCU_TABLE_FREE && 64BIT. This way, PT_RECLAIM can be enabled by default on all 64-bit architectures. Comments and suggestions are welcome! Thanks, Qi Qi Zheng (7): alpha: mm: enable MMU_GATHER_RCU_TABLE_FREE arc: mm: enable MMU_GATHER_RCU_TABLE_FREE loongarch: mm: enable MMU_GATHER_RCU_TABLE_FREE mips: mm: enable MMU_GATHER_RCU_TABLE_FREE parisc: mm: enable MMU_GATHER_RCU_TABLE_FREE um: mm: enable MMU_GATHER_RCU_TABLE_FREE mm: make PT_RECLAIM depend on MMU_GATHER_RCU_TABLE_FREE && 64BIT arch/alpha/Kconfig | 1 + arch/alpha/include/asm/tlb.h | 8 +++++--- arch/arc/Kconfig | 1 + arch/arc/include/asm/pgalloc.h | 9 ++++++--- arch/loongarch/Kconfig | 1 + arch/loongarch/include/asm/pgalloc.h | 6 ++++-- arch/mips/Kconfig | 1 + arch/mips/include/asm/pgalloc.h | 6 ++++-- arch/parisc/Kconfig | 1 + arch/parisc/include/asm/tlb.h | 6 ++++-- arch/um/Kconfig | 1 + arch/x86/Kconfig | 1 - mm/Kconfig | 6 +----- 13 files changed, 30 insertions(+), 18 deletions(-) -- 2.20.1 From qi.zheng at linux.dev Fri Nov 14 03:11:17 2025 From: qi.zheng at linux.dev (Qi Zheng) Date: Fri, 14 Nov 2025 19:11:17 +0800 Subject: [PATCH 3/7] loongarch: mm: enable MMU_GATHER_RCU_TABLE_FREE In-Reply-To: References: Message-ID: <146b5a0207052b38d04caac6b20756a61c2189b3.1763117269.git.zhengqi.arch@bytedance.com> From: Qi Zheng On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of empty PTE page table pages (such as 100GB+). To resolve this problem, first enable MMU_GATHER_RCU_TABLE_FREE to prepare for enabling the PT_RECLAIM feature, which resolves this problem. Signed-off-by: Qi Zheng Cc: Huacai Chen Cc: WANG Xuerui --- arch/loongarch/Kconfig | 1 + arch/loongarch/include/asm/pgalloc.h | 6 ++++-- 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig index 5b1116733d881..3bf2f2a9cd647 100644 --- a/arch/loongarch/Kconfig +++ b/arch/loongarch/Kconfig @@ -210,6 +210,7 @@ config LOONGARCH select USER_STACKTRACE_SUPPORT select VDSO_GETRANDOM select ZONE_DMA32 + select MMU_GATHER_RCU_TABLE_FREE config 32BIT bool diff --git a/arch/loongarch/include/asm/pgalloc.h b/arch/loongarch/include/asm/pgalloc.h index 1c63a9d9a6d35..0539d04bf1525 100644 --- a/arch/loongarch/include/asm/pgalloc.h +++ b/arch/loongarch/include/asm/pgalloc.h @@ -79,7 +79,8 @@ static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address) return pmd; } -#define __pmd_free_tlb(tlb, x, addr) pmd_free((tlb)->mm, x) +#define __pmd_free_tlb(tlb, x, addr) \ + tlb_remove_ptdesc((tlb), virt_to_ptdesc(x)) #endif @@ -99,7 +100,8 @@ static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long address) return pud; } -#define __pud_free_tlb(tlb, x, addr) pud_free((tlb)->mm, x) +#define __pud_free_tlb(tlb, x, addr) \ + tlb_remove_ptdesc((tlb), virt_to_ptdesc(x)) #endif /* __PAGETABLE_PUD_FOLDED */ -- 2.20.1 From qi.zheng at linux.dev Fri Nov 14 03:11:18 2025 From: qi.zheng at linux.dev (Qi Zheng) Date: Fri, 14 Nov 2025 19:11:18 +0800 Subject: [PATCH 4/7] mips: mm: enable MMU_GATHER_RCU_TABLE_FREE In-Reply-To: References: Message-ID: From: Qi Zheng On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of empty PTE page table pages (such as 100GB+). To resolve this problem, first enable MMU_GATHER_RCU_TABLE_FREE to prepare for enabling the PT_RECLAIM feature, which resolves this problem. Signed-off-by: Qi Zheng Cc: Thomas Bogendoerfer --- arch/mips/Kconfig | 1 + arch/mips/include/asm/pgalloc.h | 6 ++++-- 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig index e8683f58fd3e2..0ee8820a354c4 100644 --- a/arch/mips/Kconfig +++ b/arch/mips/Kconfig @@ -108,6 +108,7 @@ config MIPS select TRACE_IRQFLAGS_SUPPORT select ARCH_HAS_ELFCORE_COMPAT select HAVE_ARCH_KCSAN if 64BIT + select MMU_GATHER_RCU_TABLE_FREE config MIPS_FIXUP_BIGPHYS_ADDR bool diff --git a/arch/mips/include/asm/pgalloc.h b/arch/mips/include/asm/pgalloc.h index 942af87f1cddb..c00f445045f43 100644 --- a/arch/mips/include/asm/pgalloc.h +++ b/arch/mips/include/asm/pgalloc.h @@ -72,7 +72,8 @@ static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address) return pmd; } -#define __pmd_free_tlb(tlb, x, addr) pmd_free((tlb)->mm, x) +#define __pmd_free_tlb(tlb, x, addr) \ + tlb_remove_ptdesc((tlb), virt_to_ptdesc(x)) #endif @@ -98,7 +99,8 @@ static inline void p4d_populate(struct mm_struct *mm, p4d_t *p4d, pud_t *pud) set_p4d(p4d, __p4d((unsigned long)pud)); } -#define __pud_free_tlb(tlb, x, addr) pud_free((tlb)->mm, x) +#define __pud_free_tlb(tlb, x, addr) \ + tlb_remove_ptdesc((tlb), virt_to_ptdesc(x)) #endif /* __PAGETABLE_PUD_FOLDED */ -- 2.20.1 From qi.zheng at linux.dev Fri Nov 14 03:11:19 2025 From: qi.zheng at linux.dev (Qi Zheng) Date: Fri, 14 Nov 2025 19:11:19 +0800 Subject: [PATCH 5/7] parisc: mm: enable MMU_GATHER_RCU_TABLE_FREE In-Reply-To: References: Message-ID: <3a88790a662c2b84066c77772d20bd1f5f687f8b.1763117269.git.zhengqi.arch@bytedance.com> From: Qi Zheng On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of empty PTE page table pages (such as 100GB+). To resolve this problem, first enable MMU_GATHER_RCU_TABLE_FREE to prepare for enabling the PT_RECLAIM feature, which resolves this problem. Signed-off-by: Qi Zheng Cc: "James E.J. Bottomley" Cc: Helge Deller --- arch/parisc/Kconfig | 1 + arch/parisc/include/asm/tlb.h | 6 ++++-- 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig index 47fd9662d8005..946cbe21a4118 100644 --- a/arch/parisc/Kconfig +++ b/arch/parisc/Kconfig @@ -92,6 +92,7 @@ config PARISC select TRACE_IRQFLAGS_SUPPORT select HAVE_FUNCTION_DESCRIPTORS if 64BIT select PCI_MSI_ARCH_FALLBACKS if PCI_MSI + select MMU_GATHER_RCU_TABLE_FREE help The PA-RISC microprocessor is designed by Hewlett-Packard and used diff --git a/arch/parisc/include/asm/tlb.h b/arch/parisc/include/asm/tlb.h index 44235f367674d..ab7d4113df61a 100644 --- a/arch/parisc/include/asm/tlb.h +++ b/arch/parisc/include/asm/tlb.h @@ -5,8 +5,10 @@ #include #if CONFIG_PGTABLE_LEVELS == 3 -#define __pmd_free_tlb(tlb, pmd, addr) pmd_free((tlb)->mm, pmd) +#define __pmd_free_tlb(tlb, pmd, addr) \ + tlb_remove_ptdesc((tlb), virt_to_ptdesc(pmd)) #endif -#define __pte_free_tlb(tlb, pte, addr) pte_free((tlb)->mm, pte) +#define __pte_free_tlb(tlb, pte, addr) \ + tlb_remove_ptdesc((tlb), page_ptdesc(pte)) #endif -- 2.20.1 From qi.zheng at linux.dev Fri Nov 14 03:11:20 2025 From: qi.zheng at linux.dev (Qi Zheng) Date: Fri, 14 Nov 2025 19:11:20 +0800 Subject: [PATCH 6/7] um: mm: enable MMU_GATHER_RCU_TABLE_FREE In-Reply-To: References: Message-ID: <27f173b0fc6fdf92104721fc3daba8d7d9d31e2f.1763117269.git.zhengqi.arch@bytedance.com> From: Qi Zheng On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of empty PTE page table pages (such as 100GB+). To resolve this problem, first enable MMU_GATHER_RCU_TABLE_FREE to prepare for enabling the PT_RECLAIM feature, which resolves this problem. Signed-off-by: Qi Zheng Cc: Richard Weinberger Cc: Anton Ivanov Cc: Johannes Berg --- arch/um/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/um/Kconfig b/arch/um/Kconfig index 097c6a6265ef3..47a41bc77bb24 100644 --- a/arch/um/Kconfig +++ b/arch/um/Kconfig @@ -41,6 +41,7 @@ config UML select HAVE_SYSCALL_TRACEPOINTS select THREAD_INFO_IN_TASK select SPARSE_IRQ + select MMU_GATHER_RCU_TABLE_FREE config MMU bool -- 2.20.1 From qi.zheng at linux.dev Fri Nov 14 03:11:21 2025 From: qi.zheng at linux.dev (Qi Zheng) Date: Fri, 14 Nov 2025 19:11:21 +0800 Subject: [PATCH 7/7] mm: make PT_RECLAIM depend on MMU_GATHER_RCU_TABLE_FREE && 64BIT In-Reply-To: References: Message-ID: <0a4d1e6f0bf299cafd1fc624f965bd1ca542cea8.1763117269.git.zhengqi.arch@bytedance.com> From: Qi Zheng Make PT_RECLAIM depend on MMU_GATHER_RCU_TABLE_FREE so that PT_RECLAIM can be enabled by default on all architectures that support MMU_GATHER_RCU_TABLE_FREE. Considering that a large number of PTE page table pages (such as 100GB+) can only be caused on a 64-bit system, let PT_RECLAIM also depend on 64BIT. Signed-off-by: Qi Zheng --- arch/x86/Kconfig | 1 - mm/Kconfig | 6 +----- 2 files changed, 1 insertion(+), 6 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index eac2e86056902..96bff81fd4787 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -330,7 +330,6 @@ config X86 select FUNCTION_ALIGNMENT_4B imply IMA_SECURE_AND_OR_TRUSTED_BOOT if EFI select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE - select ARCH_SUPPORTS_PT_RECLAIM if X86_64 select ARCH_SUPPORTS_SCHED_SMT if SMP select SCHED_SMT if SMP select ARCH_SUPPORTS_SCHED_CLUSTER if SMP diff --git a/mm/Kconfig b/mm/Kconfig index a5a90b169435d..e795fbd69e50c 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1440,14 +1440,10 @@ config ARCH_HAS_USER_SHADOW_STACK The architecture has hardware support for userspace shadow call stacks (eg, x86 CET, arm64 GCS or RISC-V Zicfiss). -config ARCH_SUPPORTS_PT_RECLAIM - def_bool n - config PT_RECLAIM bool "reclaim empty user page table pages" default y - depends on ARCH_SUPPORTS_PT_RECLAIM && MMU && SMP - select MMU_GATHER_RCU_TABLE_FREE + depends on MMU_GATHER_RCU_TABLE_FREE && MMU && SMP && 64BIT help Try to reclaim empty user page table pages in paths other than munmap and exit_mmap path. -- 2.20.1 From qi.zheng at linux.dev Fri Nov 14 03:11:15 2025 From: qi.zheng at linux.dev (Qi Zheng) Date: Fri, 14 Nov 2025 19:11:15 +0800 Subject: [PATCH 1/7] alpha: mm: enable MMU_GATHER_RCU_TABLE_FREE In-Reply-To: References: Message-ID: <66cd5b21aecc3281318b66a3a4aae078c4b9d37b.1763117269.git.zhengqi.arch@bytedance.com> From: Qi Zheng On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of empty PTE page table pages (such as 100GB+). To resolve this problem, first enable MMU_GATHER_RCU_TABLE_FREE to prepare for enabling the PT_RECLAIM feature, which resolves this problem. Signed-off-by: Qi Zheng Cc: Richard Henderson Cc: Matt Turner --- arch/alpha/Kconfig | 1 + arch/alpha/include/asm/tlb.h | 8 +++++--- 2 files changed, 6 insertions(+), 3 deletions(-) diff --git a/arch/alpha/Kconfig b/arch/alpha/Kconfig index 80367f2cf821c..681ed894d9e72 100644 --- a/arch/alpha/Kconfig +++ b/arch/alpha/Kconfig @@ -40,6 +40,7 @@ config ALPHA select MMU_GATHER_NO_RANGE select SPARSEMEM_EXTREME if SPARSEMEM select ZONE_DMA + select MMU_GATHER_RCU_TABLE_FREE help The Alpha is a 64-bit general-purpose processor designed and marketed by the Digital Equipment Corporation of blessed memory, diff --git a/arch/alpha/include/asm/tlb.h b/arch/alpha/include/asm/tlb.h index 4f79e331af5ea..4fe5a901720f0 100644 --- a/arch/alpha/include/asm/tlb.h +++ b/arch/alpha/include/asm/tlb.h @@ -4,7 +4,9 @@ #include -#define __pte_free_tlb(tlb, pte, address) pte_free((tlb)->mm, pte) -#define __pmd_free_tlb(tlb, pmd, address) pmd_free((tlb)->mm, pmd) - +#define __pte_free_tlb(tlb, pte, address) \ + tlb_remove_ptdesc((tlb), page_ptdesc(pte)) +#define __pmd_free_tlb(tlb, pmd, address) \ + tlb_remove_ptdesc((tlb), virt_to_ptdesc(pmd)) + #endif -- 2.20.1 From qi.zheng at linux.dev Fri Nov 14 03:20:02 2025 From: qi.zheng at linux.dev (Qi Zheng) Date: Fri, 14 Nov 2025 19:20:02 +0800 Subject: [PATCH 2/7] arc: mm: enable MMU_GATHER_RCU_TABLE_FREE In-Reply-To: <6a4192f5cef3049f123f08cb04ef5cd0179c3281.1763117269.git.zhengqi.arch@bytedance.com> References: <6a4192f5cef3049f123f08cb04ef5cd0179c3281.1763117269.git.zhengqi.arch@bytedance.com> Message-ID: <5199c367-aabb-43e7-951e-452657dcdddc@linux.dev> On 11/14/25 7:11 PM, Qi Zheng wrote: > From: Qi Zheng > > On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of > empty PTE page table pages (such as 100GB+). To resolve this problem, > first enable MMU_GATHER_RCU_TABLE_FREE to prepare for enabling the > PT_RECLAIM feature, which resolves this problem. > > Signed-off-by: Qi Zheng > Cc: Vineet Gupta > --- > arch/arc/Kconfig | 1 + > arch/arc/include/asm/pgalloc.h | 9 ++++++--- > 2 files changed, 7 insertions(+), 3 deletions(-) Strangely, it seems that only ARC does not define CONFIG_64BIT? Does the ARC architecture support 64-bit? Did I miss something? > > diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig > index f27e6b90428e4..47db93952386d 100644 > --- a/arch/arc/Kconfig > +++ b/arch/arc/Kconfig > @@ -54,6 +54,7 @@ config ARC > select HAVE_ARCH_JUMP_LABEL if ISA_ARCV2 && !CPU_ENDIAN_BE32 > select TRACE_IRQFLAGS_SUPPORT > select HAVE_EBPF_JIT if ISA_ARCV2 > + select MMU_GATHER_RCU_TABLE_FREE > > config LOCKDEP_SUPPORT > def_bool y > diff --git a/arch/arc/include/asm/pgalloc.h b/arch/arc/include/asm/pgalloc.h > index dfae070fe8d55..b1c6619435613 100644 > --- a/arch/arc/include/asm/pgalloc.h > +++ b/arch/arc/include/asm/pgalloc.h > @@ -72,7 +72,8 @@ static inline void p4d_populate(struct mm_struct *mm, p4d_t *p4dp, pud_t *pudp) > set_p4d(p4dp, __p4d((unsigned long)pudp)); > } > > -#define __pud_free_tlb(tlb, pmd, addr) pud_free((tlb)->mm, pmd) > +#define __pud_free_tlb(tlb, pud, addr) \ > + tlb_remove_ptdesc((tlb), virt_to_ptdesc(pud)) > > #endif > > @@ -83,10 +84,12 @@ static inline void pud_populate(struct mm_struct *mm, pud_t *pudp, pmd_t *pmdp) > set_pud(pudp, __pud((unsigned long)pmdp)); > } > > -#define __pmd_free_tlb(tlb, pmd, addr) pmd_free((tlb)->mm, pmd) > +#define __pmd_free_tlb(tlb, pmd, addr) \ > + tlb_remove_ptdesc((tlb), virt_to_ptdesc(pmd)) > > #endif > > -#define __pte_free_tlb(tlb, pte, addr) pte_free((tlb)->mm, pte) > +#define __pte_free_tlb(tlb, pte, addr) \ > + tlb_remove_ptdesc((tlb), page_ptdesc(pte)) > > #endif /* _ASM_ARC_PGALLOC_H */ From chenhuacai at kernel.org Fri Nov 14 06:17:55 2025 From: chenhuacai at kernel.org (Huacai Chen) Date: Fri, 14 Nov 2025 22:17:55 +0800 Subject: [PATCH 3/7] loongarch: mm: enable MMU_GATHER_RCU_TABLE_FREE In-Reply-To: <146b5a0207052b38d04caac6b20756a61c2189b3.1763117269.git.zhengqi.arch@bytedance.com> References: <146b5a0207052b38d04caac6b20756a61c2189b3.1763117269.git.zhengqi.arch@bytedance.com> Message-ID: Hi, Qi Zheng, We usually use LoongArch rather than loongarch, but if you want to keep consistency for all patches, just do it. On Fri, Nov 14, 2025 at 7:13?PM Qi Zheng wrote: > > From: Qi Zheng > > On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of > empty PTE page table pages (such as 100GB+). To resolve this problem, > first enable MMU_GATHER_RCU_TABLE_FREE to prepare for enabling the > PT_RECLAIM feature, which resolves this problem. > > Signed-off-by: Qi Zheng > Cc: Huacai Chen > Cc: WANG Xuerui > --- > arch/loongarch/Kconfig | 1 + > arch/loongarch/include/asm/pgalloc.h | 6 ++++-- > 2 files changed, 5 insertions(+), 2 deletions(-) > > diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig > index 5b1116733d881..3bf2f2a9cd647 100644 > --- a/arch/loongarch/Kconfig > +++ b/arch/loongarch/Kconfig > @@ -210,6 +210,7 @@ config LOONGARCH > select USER_STACKTRACE_SUPPORT > select VDSO_GETRANDOM > select ZONE_DMA32 > + select MMU_GATHER_RCU_TABLE_FREE Please use alpha-betical order. > > config 32BIT > bool > diff --git a/arch/loongarch/include/asm/pgalloc.h b/arch/loongarch/include/asm/pgalloc.h > index 1c63a9d9a6d35..0539d04bf1525 100644 > --- a/arch/loongarch/include/asm/pgalloc.h > +++ b/arch/loongarch/include/asm/pgalloc.h > @@ -79,7 +79,8 @@ static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address) > return pmd; > } > > -#define __pmd_free_tlb(tlb, x, addr) pmd_free((tlb)->mm, x) > +#define __pmd_free_tlb(tlb, x, addr) \ > + tlb_remove_ptdesc((tlb), virt_to_ptdesc(x)) I think we can define it in one line. > > #endif > > @@ -99,7 +100,8 @@ static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long address) > return pud; > } > > -#define __pud_free_tlb(tlb, x, addr) pud_free((tlb)->mm, x) > +#define __pud_free_tlb(tlb, x, addr) \ > + tlb_remove_ptdesc((tlb), virt_to_ptdesc(x)) The same. Other patches have the same problem. Huacai > > #endif /* __PAGETABLE_PUD_FOLDED */ > > -- > 2.20.1 > From qi.zheng at linux.dev Fri Nov 14 07:55:11 2025 From: qi.zheng at linux.dev (Qi Zheng) Date: Fri, 14 Nov 2025 23:55:11 +0800 Subject: [PATCH 3/7] loongarch: mm: enable MMU_GATHER_RCU_TABLE_FREE In-Reply-To: References: <146b5a0207052b38d04caac6b20756a61c2189b3.1763117269.git.zhengqi.arch@bytedance.com> Message-ID: Hi Huacai, On 11/14/25 10:17 PM, Huacai Chen wrote: > Hi, Qi Zheng, > > We usually use LoongArch rather than loongarch, but if you want to > keep consistency for all patches, just do it. OK, will change to use LoongArch. > > On Fri, Nov 14, 2025 at 7:13?PM Qi Zheng wrote: >> >> From: Qi Zheng >> >> On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of >> empty PTE page table pages (such as 100GB+). To resolve this problem, >> first enable MMU_GATHER_RCU_TABLE_FREE to prepare for enabling the >> PT_RECLAIM feature, which resolves this problem. >> >> Signed-off-by: Qi Zheng >> Cc: Huacai Chen >> Cc: WANG Xuerui >> --- >> arch/loongarch/Kconfig | 1 + >> arch/loongarch/include/asm/pgalloc.h | 6 ++++-- >> 2 files changed, 5 insertions(+), 2 deletions(-) >> >> diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig >> index 5b1116733d881..3bf2f2a9cd647 100644 >> --- a/arch/loongarch/Kconfig >> +++ b/arch/loongarch/Kconfig >> @@ -210,6 +210,7 @@ config LOONGARCH >> select USER_STACKTRACE_SUPPORT >> select VDSO_GETRANDOM >> select ZONE_DMA32 >> + select MMU_GATHER_RCU_TABLE_FREE > Please use alpha-betical order. OK, will do. > >> >> config 32BIT >> bool >> diff --git a/arch/loongarch/include/asm/pgalloc.h b/arch/loongarch/include/asm/pgalloc.h >> index 1c63a9d9a6d35..0539d04bf1525 100644 >> --- a/arch/loongarch/include/asm/pgalloc.h >> +++ b/arch/loongarch/include/asm/pgalloc.h >> @@ -79,7 +79,8 @@ static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address) >> return pmd; >> } >> >> -#define __pmd_free_tlb(tlb, x, addr) pmd_free((tlb)->mm, x) >> +#define __pmd_free_tlb(tlb, x, addr) \ >> + tlb_remove_ptdesc((tlb), virt_to_ptdesc(x)) > I think we can define it in one line. will do. > >> >> #endif >> >> @@ -99,7 +100,8 @@ static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long address) >> return pud; >> } >> >> -#define __pud_free_tlb(tlb, x, addr) pud_free((tlb)->mm, x) >> +#define __pud_free_tlb(tlb, x, addr) \ >> + tlb_remove_ptdesc((tlb), virt_to_ptdesc(x)) > The same. > > Other patches have the same problem. Got it, will convert them all to the one-line type. Thanks, Qi > > Huacai > >> >> #endif /* __PAGETABLE_PUD_FOLDED */ >> >> -- >> 2.20.1 >> From linmag7 at gmail.com Fri Nov 14 11:13:55 2025 From: linmag7 at gmail.com (Magnus Lindholm) Date: Fri, 14 Nov 2025 20:13:55 +0100 Subject: [PATCH 1/7] alpha: mm: enable MMU_GATHER_RCU_TABLE_FREE In-Reply-To: <66cd5b21aecc3281318b66a3a4aae078c4b9d37b.1763117269.git.zhengqi.arch@bytedance.com> References: <66cd5b21aecc3281318b66a3a4aae078c4b9d37b.1763117269.git.zhengqi.arch@bytedance.com> Message-ID: Hi, I applied your patches to a fresh pull of torvalds/linux.git repo but was unable to build the kernel (on Alpha) with this patch applied. I made the following changes in order to get it to build on Alpha: diff --git a/mm/pt_reclaim.c b/mm/pt_reclaim.c index 7e9455a18aae..6761b0c282bf 100644 --- a/mm/pt_reclaim.c +++ b/mm/pt_reclaim.c @@ -1,7 +1,7 @@ // SPDX-License-Identifier: GPL-2.0 #include -#include #include +#include #include "internal.h" /Magnus From vgupta at kernel.org Fri Nov 14 15:10:02 2025 From: vgupta at kernel.org (Vineet Gupta) Date: Fri, 14 Nov 2025 15:10:02 -0800 Subject: [PATCH 2/7] arc: mm: enable MMU_GATHER_RCU_TABLE_FREE In-Reply-To: <5199c367-aabb-43e7-951e-452657dcdddc@linux.dev> References: <6a4192f5cef3049f123f08cb04ef5cd0179c3281.1763117269.git.zhengqi.arch@bytedance.com> <5199c367-aabb-43e7-951e-452657dcdddc@linux.dev> Message-ID: <4e120357-6fa3-436a-8474-b07b473381b6@kernel.org> On 11/14/25 03:20, Qi Zheng wrote: > Strangely, it seems that only ARC does not define CONFIG_64BIT? > > Does the ARC architecture support 64-bit? Did I miss something? ARC is 32-bit only ! -Vineet From lkp at intel.com Fri Nov 14 16:51:44 2025 From: lkp at intel.com (kernel test robot) Date: Sat, 15 Nov 2025 08:51:44 +0800 Subject: [PATCH 7/7] mm: make PT_RECLAIM depend on MMU_GATHER_RCU_TABLE_FREE && 64BIT In-Reply-To: <0a4d1e6f0bf299cafd1fc624f965bd1ca542cea8.1763117269.git.zhengqi.arch@bytedance.com> References: <0a4d1e6f0bf299cafd1fc624f965bd1ca542cea8.1763117269.git.zhengqi.arch@bytedance.com> Message-ID: <202511150845.XqOxPJxe-lkp@intel.com> Hi Qi, kernel test robot noticed the following build errors: [auto build test ERROR on deller-parisc/for-next] [also build test ERROR on uml/next tip/x86/core akpm-mm/mm-everything linus/master v6.18-rc5 next-20251114] [cannot apply to uml/fixes vgupta-arc/for-next vgupta-arc/for-curr] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Qi-Zheng/alpha-mm-enable-MMU_GATHER_RCU_TABLE_FREE/20251114-191543 base: https://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux.git for-next patch link: https://lore.kernel.org/r/0a4d1e6f0bf299cafd1fc624f965bd1ca542cea8.1763117269.git.zhengqi.arch%40bytedance.com patch subject: [PATCH 7/7] mm: make PT_RECLAIM depend on MMU_GATHER_RCU_TABLE_FREE && 64BIT config: arm64-randconfig-004-20251115 (https://download.01.org/0day-ci/archive/20251115/202511150845.XqOxPJxe-lkp at intel.com/config) compiler: aarch64-linux-gcc (GCC) 8.5.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251115/202511150845.XqOxPJxe-lkp at intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot | Closes: https://lore.kernel.org/oe-kbuild-all/202511150845.XqOxPJxe-lkp at intel.com/ All errors (new ones prefixed by >>): In file included from mm/pt_reclaim.c:3: mm/pt_reclaim.c: In function 'free_pte': >> include/asm-generic/tlb.h:731:3: error: implicit declaration of function '__pte_free_tlb'; did you mean 'pte_free_tlb'? [-Werror=implicit-function-declaration] __pte_free_tlb(tlb, ptep, address); \ ^~~~~~~~~~~~~~ mm/pt_reclaim.c:31:2: note: in expansion of macro 'pte_free_tlb' pte_free_tlb(tlb, pmd_pgtable(pmdval), addr); ^~~~~~~~~~~~ cc1: some warnings being treated as errors vim +731 include/asm-generic/tlb.h a00cc7d9dd93d6 Matthew Wilcox 2017-02-24 701 a00cc7d9dd93d6 Matthew Wilcox 2017-02-24 702 #define tlb_remove_pud_tlb_entry(tlb, pudp, address) \ a00cc7d9dd93d6 Matthew Wilcox 2017-02-24 703 do { \ 2631ed00b04988 Peter Zijlstra (Intel 2020-06-25 704) tlb_flush_pud_range(tlb, address, HPAGE_PUD_SIZE); \ a00cc7d9dd93d6 Matthew Wilcox 2017-02-24 705 __tlb_remove_pud_tlb_entry(tlb, pudp, address); \ a00cc7d9dd93d6 Matthew Wilcox 2017-02-24 706 } while (0) a00cc7d9dd93d6 Matthew Wilcox 2017-02-24 707 b5bc66b7131087 Aneesh Kumar K.V 2016-12-12 708 /* b5bc66b7131087 Aneesh Kumar K.V 2016-12-12 709 * For things like page tables caches (ie caching addresses "inside" the b5bc66b7131087 Aneesh Kumar K.V 2016-12-12 710 * page tables, like x86 does), for legacy reasons, flushing an b5bc66b7131087 Aneesh Kumar K.V 2016-12-12 711 * individual page had better flush the page table caches behind it. This b5bc66b7131087 Aneesh Kumar K.V 2016-12-12 712 * is definitely how x86 works, for example. And if you have an b5bc66b7131087 Aneesh Kumar K.V 2016-12-12 713 * architected non-legacy page table cache (which I'm not aware of b5bc66b7131087 Aneesh Kumar K.V 2016-12-12 714 * anybody actually doing), you're going to have some architecturally b5bc66b7131087 Aneesh Kumar K.V 2016-12-12 715 * explicit flushing for that, likely *separate* from a regular TLB entry b5bc66b7131087 Aneesh Kumar K.V 2016-12-12 716 * flush, and thus you'd need more than just some range expansion.. b5bc66b7131087 Aneesh Kumar K.V 2016-12-12 717 * b5bc66b7131087 Aneesh Kumar K.V 2016-12-12 718 * So if we ever find an architecture b5bc66b7131087 Aneesh Kumar K.V 2016-12-12 719 * that would want something that odd, I think it is up to that b5bc66b7131087 Aneesh Kumar K.V 2016-12-12 720 * architecture to do its own odd thing, not cause pain for others b5bc66b7131087 Aneesh Kumar K.V 2016-12-12 721 * http://lkml.kernel.org/r/CA+55aFzBggoXtNXQeng5d_mRoDnaMBE5Y+URs+PHR67nUpMtaw at mail.gmail.com b5bc66b7131087 Aneesh Kumar K.V 2016-12-12 722 * b5bc66b7131087 Aneesh Kumar K.V 2016-12-12 723 * For now w.r.t page table cache, mark the range_size as PAGE_SIZE b5bc66b7131087 Aneesh Kumar K.V 2016-12-12 724 */ b5bc66b7131087 Aneesh Kumar K.V 2016-12-12 725 a90744bac57c3c Nicholas Piggin 2018-07-13 726 #ifndef pte_free_tlb 9e1b32caa525cb Benjamin Herrenschmidt 2009-07-22 727 #define pte_free_tlb(tlb, ptep, address) \ ^1da177e4c3f41 Linus Torvalds 2005-04-16 728 do { \ 2631ed00b04988 Peter Zijlstra (Intel 2020-06-25 729) tlb_flush_pmd_range(tlb, address, PAGE_SIZE); \ 22a61c3c4f1379 Peter Zijlstra 2018-08-23 730 tlb->freed_tables = 1; \ 9e1b32caa525cb Benjamin Herrenschmidt 2009-07-22 @731 __pte_free_tlb(tlb, ptep, address); \ ^1da177e4c3f41 Linus Torvalds 2005-04-16 732 } while (0) a90744bac57c3c Nicholas Piggin 2018-07-13 733 #endif ^1da177e4c3f41 Linus Torvalds 2005-04-16 734 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki From lkp at intel.com Fri Nov 14 17:12:35 2025 From: lkp at intel.com (kernel test robot) Date: Sat, 15 Nov 2025 09:12:35 +0800 Subject: [PATCH 7/7] mm: make PT_RECLAIM depend on MMU_GATHER_RCU_TABLE_FREE && 64BIT In-Reply-To: <0a4d1e6f0bf299cafd1fc624f965bd1ca542cea8.1763117269.git.zhengqi.arch@bytedance.com> References: <0a4d1e6f0bf299cafd1fc624f965bd1ca542cea8.1763117269.git.zhengqi.arch@bytedance.com> Message-ID: <202511150832.iAyO0SAW-lkp@intel.com> Hi Qi, kernel test robot noticed the following build errors: [auto build test ERROR on deller-parisc/for-next] [also build test ERROR on uml/next tip/x86/core akpm-mm/mm-everything linus/master v6.18-rc5 next-20251114] [cannot apply to uml/fixes vgupta-arc/for-next vgupta-arc/for-curr] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Qi-Zheng/alpha-mm-enable-MMU_GATHER_RCU_TABLE_FREE/20251114-191543 base: https://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux.git for-next patch link: https://lore.kernel.org/r/0a4d1e6f0bf299cafd1fc624f965bd1ca542cea8.1763117269.git.zhengqi.arch%40bytedance.com patch subject: [PATCH 7/7] mm: make PT_RECLAIM depend on MMU_GATHER_RCU_TABLE_FREE && 64BIT config: arm64-randconfig-002-20251115 (https://download.01.org/0day-ci/archive/20251115/202511150832.iAyO0SAW-lkp at intel.com/config) compiler: clang version 18.1.8 (https://github.com/llvm/llvm-project 3b5b5c1ec4a3095ab096dd780e84d7ab81f3d7ff) reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251115/202511150832.iAyO0SAW-lkp at intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot | Closes: https://lore.kernel.org/oe-kbuild-all/202511150832.iAyO0SAW-lkp at intel.com/ All errors (new ones prefixed by >>): >> mm/pt_reclaim.c:31:2: error: call to undeclared function '__pte_free_tlb'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration] 31 | pte_free_tlb(tlb, pmd_pgtable(pmdval), addr); | ^ include/asm-generic/tlb.h:731:3: note: expanded from macro 'pte_free_tlb' 731 | __pte_free_tlb(tlb, ptep, address); \ | ^ 1 error generated. vim +/__pte_free_tlb +31 mm/pt_reclaim.c 6375e95f381e3d Qi Zheng 2024-12-04 27 6375e95f381e3d Qi Zheng 2024-12-04 28 void free_pte(struct mm_struct *mm, unsigned long addr, struct mmu_gather *tlb, 6375e95f381e3d Qi Zheng 2024-12-04 29 pmd_t pmdval) 6375e95f381e3d Qi Zheng 2024-12-04 30 { 6375e95f381e3d Qi Zheng 2024-12-04 @31 pte_free_tlb(tlb, pmd_pgtable(pmdval), addr); 6375e95f381e3d Qi Zheng 2024-12-04 32 mm_dec_nr_ptes(mm); 6375e95f381e3d Qi Zheng 2024-12-04 33 } 6375e95f381e3d Qi Zheng 2024-12-04 34 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki From qi.zheng at linux.dev Sat Nov 15 01:06:51 2025 From: qi.zheng at linux.dev (Qi Zheng) Date: Sat, 15 Nov 2025 17:06:51 +0800 Subject: [PATCH 1/7] alpha: mm: enable MMU_GATHER_RCU_TABLE_FREE In-Reply-To: References: <66cd5b21aecc3281318b66a3a4aae078c4b9d37b.1763117269.git.zhengqi.arch@bytedance.com> Message-ID: Hi Magnus, On 11/15/25 3:13 AM, Magnus Lindholm wrote: > Hi, > > I applied your patches to a fresh pull of torvalds/linux.git repo but was unable > to build the kernel (on Alpha) with this patch applied. > > I made the following changes in order to get it to build on Alpha: Thanks! Will fix it in the next version. > > diff --git a/mm/pt_reclaim.c b/mm/pt_reclaim.c > index 7e9455a18aae..6761b0c282bf 100644 > --- a/mm/pt_reclaim.c > +++ b/mm/pt_reclaim.c > @@ -1,7 +1,7 @@ > // SPDX-License-Identifier: GPL-2.0 > #include > -#include > #include > +#include > > #include "internal.h" > > > /Magnus From qi.zheng at linux.dev Sat Nov 15 01:08:35 2025 From: qi.zheng at linux.dev (Qi Zheng) Date: Sat, 15 Nov 2025 17:08:35 +0800 Subject: [PATCH 2/7] arc: mm: enable MMU_GATHER_RCU_TABLE_FREE In-Reply-To: <4e120357-6fa3-436a-8474-b07b473381b6@kernel.org> References: <6a4192f5cef3049f123f08cb04ef5cd0179c3281.1763117269.git.zhengqi.arch@bytedance.com> <5199c367-aabb-43e7-951e-452657dcdddc@linux.dev> <4e120357-6fa3-436a-8474-b07b473381b6@kernel.org> Message-ID: On 11/15/25 7:10 AM, Vineet Gupta wrote: > On 11/14/25 03:20, Qi Zheng wrote: >> Strangely, it seems that only ARC does not define CONFIG_64BIT? >> >> Does the ARC architecture support 64-bit? Did I miss something? > > ARC is 32-bit only ! Got it! Will drop this patch in the next version. Thanks! > > -Vineet From qi.zheng at linux.dev Sun Nov 16 22:41:10 2025 From: qi.zheng at linux.dev (Qi Zheng) Date: Mon, 17 Nov 2025 14:41:10 +0800 Subject: [PATCH 3/7] loongarch: mm: enable MMU_GATHER_RCU_TABLE_FREE In-Reply-To: References: <146b5a0207052b38d04caac6b20756a61c2189b3.1763117269.git.zhengqi.arch@bytedance.com> Message-ID: <8fbeb3e8-7c30-46f6-a0a4-289efbf45ac0@linux.dev> Hi Huacai, On 11/14/25 10:17 PM, Huacai Chen wrote: > Hi, Qi Zheng, [...] >> >> -#define __pmd_free_tlb(tlb, x, addr) pmd_free((tlb)->mm, x) >> +#define __pmd_free_tlb(tlb, x, addr) \ >> + tlb_remove_ptdesc((tlb), virt_to_ptdesc(x)) > I think we can define it in one line. Do we need to change __pte_free_tlb() to a single-line format as well? Thanks, Qi >> From chenhuacai at kernel.org Sun Nov 16 22:57:47 2025 From: chenhuacai at kernel.org (Huacai Chen) Date: Mon, 17 Nov 2025 14:57:47 +0800 Subject: [PATCH 3/7] loongarch: mm: enable MMU_GATHER_RCU_TABLE_FREE In-Reply-To: <8fbeb3e8-7c30-46f6-a0a4-289efbf45ac0@linux.dev> References: <146b5a0207052b38d04caac6b20756a61c2189b3.1763117269.git.zhengqi.arch@bytedance.com> <8fbeb3e8-7c30-46f6-a0a4-289efbf45ac0@linux.dev> Message-ID: On Mon, Nov 17, 2025 at 2:42?PM Qi Zheng wrote: > > Hi Huacai, > > On 11/14/25 10:17 PM, Huacai Chen wrote: > > Hi, Qi Zheng, > > [...] > > >> > >> -#define __pmd_free_tlb(tlb, x, addr) pmd_free((tlb)->mm, x) > >> +#define __pmd_free_tlb(tlb, x, addr) \ > >> + tlb_remove_ptdesc((tlb), virt_to_ptdesc(x)) > > I think we can define it in one line. > > Do we need to change __pte_free_tlb() to a single-line format > as well? Yes, there is no 80 columns limit now. Huacai > > Thanks, > Qi > > > >> > > From david at kernel.org Mon Nov 17 08:53:42 2025 From: david at kernel.org (David Hildenbrand (Red Hat)) Date: Mon, 17 Nov 2025 17:53:42 +0100 Subject: [PATCH 0/7] enable PT_RECLAIM on all 64-bit architectures In-Reply-To: References: Message-ID: <83e88171-54cb-4112-a344-f6a7d7f13784@kernel.org> On 14.11.25 12:11, Qi Zheng wrote: > From: Qi Zheng > > Hi all, > > This series aims to enable PT_RECLAIM on all 64-bit architectures. > > On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of empty PTE > page table pages (such as 100GB+). To resolve this problem, we need to enable > PT_RECLAIM, which depends on MMU_GATHER_RCU_TABLE_FREE. > Makes sense! > Therefore, this series first enables MMU_GATHER_RCU_TABLE_FREE on all 64-bit > architectures, and finally makes PT_RECLAIM depend on MMU_GATHER_RCU_TABLE_FREE > && 64BIT. This way, PT_RECLAIM can be enabled by default on all 64-bit > architectures. Could we then even go ahead and stop making PT_RECLAIM user-selectable? -- Cheers David From david at kernel.org Mon Nov 17 08:57:58 2025 From: david at kernel.org (David Hildenbrand (Red Hat)) Date: Mon, 17 Nov 2025 17:57:58 +0100 Subject: [PATCH 7/7] mm: make PT_RECLAIM depend on MMU_GATHER_RCU_TABLE_FREE && 64BIT In-Reply-To: <0a4d1e6f0bf299cafd1fc624f965bd1ca542cea8.1763117269.git.zhengqi.arch@bytedance.com> References: <0a4d1e6f0bf299cafd1fc624f965bd1ca542cea8.1763117269.git.zhengqi.arch@bytedance.com> Message-ID: <355d3bf3-c6bc-403e-9f19-89259d868611@kernel.org> On 14.11.25 12:11, Qi Zheng wrote: > From: Qi Zheng Subject: s/&&/&/ > > Make PT_RECLAIM depend on MMU_GATHER_RCU_TABLE_FREE so that PT_RECLAIM can > be enabled by default on all architectures that support > MMU_GATHER_RCU_TABLE_FREE. > > Considering that a large number of PTE page table pages (such as 100GB+) > can only be caused on a 64-bit system, let PT_RECLAIM also depend on > 64BIT. > > Signed-off-by: Qi Zheng > --- > arch/x86/Kconfig | 1 - > mm/Kconfig | 6 +----- > 2 files changed, 1 insertion(+), 6 deletions(-) > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > index eac2e86056902..96bff81fd4787 100644 > --- a/arch/x86/Kconfig > +++ b/arch/x86/Kconfig > @@ -330,7 +330,6 @@ config X86 > select FUNCTION_ALIGNMENT_4B > imply IMA_SECURE_AND_OR_TRUSTED_BOOT if EFI > select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE > - select ARCH_SUPPORTS_PT_RECLAIM if X86_64 > select ARCH_SUPPORTS_SCHED_SMT if SMP > select SCHED_SMT if SMP > select ARCH_SUPPORTS_SCHED_CLUSTER if SMP > diff --git a/mm/Kconfig b/mm/Kconfig > index a5a90b169435d..e795fbd69e50c 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -1440,14 +1440,10 @@ config ARCH_HAS_USER_SHADOW_STACK > The architecture has hardware support for userspace shadow call > stacks (eg, x86 CET, arm64 GCS or RISC-V Zicfiss). > > -config ARCH_SUPPORTS_PT_RECLAIM > - def_bool n > - > config PT_RECLAIM > bool "reclaim empty user page table pages" > default y > - depends on ARCH_SUPPORTS_PT_RECLAIM && MMU && SMP > - select MMU_GATHER_RCU_TABLE_FREE > + depends on MMU_GATHER_RCU_TABLE_FREE && MMU && SMP && 64BIT Who would we have MMU_GATHER_RCU_TABLE_FREE without MMU? (can we drop the MMU part) Why do we care about SMP in the first place? (can we frop SMP) But I also wonder why we need "MMU_GATHER_RCU_TABLE_FREE && 64BIT": Would it be harmful on 32bit (sure, we might not reclaim as much, but still there is memory to be reclaimed?)? If all 64BIT support MMU_GATHER_RCU_TABLE_FREE (as you previously state), why can't we only check for 64BIT? -- Cheers David From development at efficientek.com Mon Nov 17 18:31:31 2025 From: development at efficientek.com (Glenn Washburn) Date: Mon, 17 Nov 2025 20:31:31 -0600 Subject: No more non-root networking modes? Message-ID: <20251117203131.479a1cfd@crass-HP-ZBook-15-G2> Hi all, I'm just now noticing that earlier this year obsolete networking transports were removed (commit 65eaac591b75). This included SLiRP, which was to my knowledge the only transport that supported network access as an unprivileged user (no privileged access needed to setup the transport either). Am I wrong about that? If not, I'm curious as to why functionality that could not be achieved by other means was dropped? And if there are any work arounds? I can understand why other transports that need privileged access to setup, but were inferior to other existing transports would be removed. I write this as someone who is currently using the SLiRP transport. Glenn From tiwei.bie at linux.dev Mon Nov 17 22:08:55 2025 From: tiwei.bie at linux.dev (Tiwei Bie) Date: Tue, 18 Nov 2025 14:08:55 +0800 Subject: No more non-root networking modes? In-Reply-To: <20251117203131.479a1cfd@crass-HP-ZBook-15-G2> References: <20251117203131.479a1cfd@crass-HP-ZBook-15-G2> Message-ID: <20251118060855.3714863-1-tiwei.bie@linux.dev> On Mon, 17 Nov 2025 20:31:31 -0600, Glenn Washburn wrote: > Hi all, > > I'm just now noticing that earlier this year obsolete networking > transports were removed (commit 65eaac591b75). This included SLiRP, > which was to my knowledge the only transport that supported network > access as an unprivileged user (no privileged access needed to setup > the transport either). Am I wrong about that? If not, I'm curious as to > why functionality that could not be achieved by other means was > dropped? And if there are any work arounds? I can understand why other > transports that need privileged access to setup, but were inferior to > other existing transports would be removed. I write this as someone who > is currently using the SLiRP transport. vec also supports networking without privileged access: https://www.kernel.org/doc/html/v6.17/virt/uml/user_mode_linux_howto_v2.html#vde-vector-transport https://lore.kernel.org/all/bfa07f4d-16a3-476b-9314-b8052ec198b1 at antgroup.com/ Regards, Tiwei From arch0.zheng at gmail.com Tue Nov 18 03:53:50 2025 From: arch0.zheng at gmail.com (Qi Zheng) Date: Tue, 18 Nov 2025 19:53:50 +0800 Subject: [PATCH 0/7] enable PT_RECLAIM on all 64-bit architectures In-Reply-To: <83e88171-54cb-4112-a344-f6a7d7f13784@kernel.org> References: <83e88171-54cb-4112-a344-f6a7d7f13784@kernel.org> Message-ID: On 11/18/25 12:53 AM, David Hildenbrand (Red Hat) wrote: > On 14.11.25 12:11, Qi Zheng wrote: >> From: Qi Zheng >> >> Hi all, >> >> This series aims to enable PT_RECLAIM on all 64-bit architectures. >> >> On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of >> empty PTE >> page table pages (such as 100GB+). To resolve this problem, we need to >> enable >> PT_RECLAIM, which depends on MMU_GATHER_RCU_TABLE_FREE. >> > > Makes sense! > >> Therefore, this series first enables MMU_GATHER_RCU_TABLE_FREE on all >> 64-bit >> architectures, and finally makes PT_RECLAIM depend on >> MMU_GATHER_RCU_TABLE_FREE >> && 64BIT. This way, PT_RECLAIM can be enabled by default on all 64-bit >> architectures. > > Could we then even go ahead and stop making PT_RECLAIM user-selectable? OK, will change to: config PT_RECLAIM def_bool y depends on MMU_GATHER_RCU_TABLE_FREE && MMU && SMP && 64BIT > From qi.zheng at linux.dev Tue Nov 18 04:02:30 2025 From: qi.zheng at linux.dev (Qi Zheng) Date: Tue, 18 Nov 2025 20:02:30 +0800 Subject: [PATCH 7/7] mm: make PT_RECLAIM depend on MMU_GATHER_RCU_TABLE_FREE && 64BIT In-Reply-To: <355d3bf3-c6bc-403e-9f19-89259d868611@kernel.org> References: <0a4d1e6f0bf299cafd1fc624f965bd1ca542cea8.1763117269.git.zhengqi.arch@bytedance.com> <355d3bf3-c6bc-403e-9f19-89259d868611@kernel.org> Message-ID: <195baf7c-1f4e-46a4-a4aa-e68e7d00c0f9@linux.dev> On 11/18/25 12:57 AM, David Hildenbrand (Red Hat) wrote: > On 14.11.25 12:11, Qi Zheng wrote: >> From: Qi Zheng > > Subject: s/&&/&/ will do. > >> >> Make PT_RECLAIM depend on MMU_GATHER_RCU_TABLE_FREE so that PT_RECLAIM >> can >> be enabled by default on all architectures that support >> MMU_GATHER_RCU_TABLE_FREE. >> >> Considering that a large number of PTE page table pages (such as 100GB+) >> can only be caused on a 64-bit system, let PT_RECLAIM also depend on >> 64BIT. >> >> Signed-off-by: Qi Zheng >> --- >> ? arch/x86/Kconfig | 1 - >> ? mm/Kconfig?????? | 6 +----- >> ? 2 files changed, 1 insertion(+), 6 deletions(-) >> >> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig >> index eac2e86056902..96bff81fd4787 100644 >> --- a/arch/x86/Kconfig >> +++ b/arch/x86/Kconfig >> @@ -330,7 +330,6 @@ config X86 >> ????? select FUNCTION_ALIGNMENT_4B >> ????? imply IMA_SECURE_AND_OR_TRUSTED_BOOT??? if EFI >> ????? select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE >> -??? select ARCH_SUPPORTS_PT_RECLAIM??????? if X86_64 >> ????? select ARCH_SUPPORTS_SCHED_SMT??????? if SMP >> ????? select SCHED_SMT??????????? if SMP >> ????? select ARCH_SUPPORTS_SCHED_CLUSTER??? if SMP >> diff --git a/mm/Kconfig b/mm/Kconfig >> index a5a90b169435d..e795fbd69e50c 100644 >> --- a/mm/Kconfig >> +++ b/mm/Kconfig >> @@ -1440,14 +1440,10 @@ config ARCH_HAS_USER_SHADOW_STACK >> ??????? The architecture has hardware support for userspace shadow call >> ??????????? stacks (eg, x86 CET, arm64 GCS or RISC-V Zicfiss). >> -config ARCH_SUPPORTS_PT_RECLAIM >> -??? def_bool n >> - >> ? config PT_RECLAIM >> ????? bool "reclaim empty user page table pages" >> ????? default y >> -??? depends on ARCH_SUPPORTS_PT_RECLAIM && MMU && SMP >> -??? select MMU_GATHER_RCU_TABLE_FREE >> +??? depends on MMU_GATHER_RCU_TABLE_FREE && MMU && SMP && 64BIT > > Who would we have MMU_GATHER_RCU_TABLE_FREE without MMU? (can we drop > the MMU part) OK. > > Why do we care about SMP in the first place? (can we frop SMP) OK. > > But I also wonder why we need "MMU_GATHER_RCU_TABLE_FREE && 64BIT": > > Would it be harmful on 32bit (sure, we might not reclaim as much, but > still there is memory to be reclaimed?)? This is also fine on 32bit, but the benefits are not significant, So I chose to enable it only on 64-bit. I actually tried enabling MMU_GATHER_RCU_TABLE_FREE on all architectures, and apart from sparc32 being a bit troublesome (because it uses mm->page_table_lock for synchronization within __pte_free_tlb()), the modifications were relatively simple. > > If all 64BIT support MMU_GATHER_RCU_TABLE_FREE (as you previously > state), why can't we only check for 64BIT? OK, will do. Thanks, Qi > From qi.zheng at linux.dev Tue Nov 18 23:31:17 2025 From: qi.zheng at linux.dev (Qi Zheng) Date: Wed, 19 Nov 2025 15:31:17 +0800 Subject: [PATCH v2 0/7] enable PT_RECLAIM on all 64-bit architectures Message-ID: From: Qi Zheng Changelog in v2: - fix compilation errors (reported by Magnus Lindholm and kernel test robot) - adjust some code style (suggested by Huacai Chen) - make PT_RECLAIM user-unselectable (suggested by David Hildenbrand) - rebase onto the next-20251119 Hi all, This series aims to enable PT_RECLAIM on all 64-bit architectures. On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of empty PTE page table pages (such as 100GB+). To resolve this problem, we need to enable PT_RECLAIM, which depends on MMU_GATHER_RCU_TABLE_FREE. Therefore, this series first enables MMU_GATHER_RCU_TABLE_FREE on all 64-bit architectures, and finally makes PT_RECLAIM depend on 64BIT. This way, PT_RECLAIM can be enabled by default on all 64-bit architectures. BTW, PT_RECLAIM works well on all 32-bit architectures as well. Although the benefit isn't significant, there's still memory that can be reclaimed. Perhaps PT_RECLAIM can be enabled on all 32-bit architectures in the future. Comments and suggestions are welcome! Thanks, Qi Qi Zheng (7): mm: change mm/pt_reclaim.c to use asm/tlb.h instead of asm-generic/tlb.h alpha: mm: enable MMU_GATHER_RCU_TABLE_FREE LoongArch: mm: enable MMU_GATHER_RCU_TABLE_FREE mips: mm: enable MMU_GATHER_RCU_TABLE_FREE parisc: mm: enable MMU_GATHER_RCU_TABLE_FREE um: mm: enable MMU_GATHER_RCU_TABLE_FREE mm: enable PT_RECLAIM on all 64-bit architectures arch/alpha/Kconfig | 1 + arch/alpha/include/asm/tlb.h | 6 +++--- arch/loongarch/Kconfig | 1 + arch/loongarch/include/asm/pgalloc.h | 7 +++---- arch/mips/Kconfig | 1 + arch/mips/include/asm/pgalloc.h | 7 +++---- arch/parisc/Kconfig | 1 + arch/parisc/include/asm/tlb.h | 4 ++-- arch/um/Kconfig | 1 + arch/x86/Kconfig | 1 - mm/Kconfig | 9 ++------- mm/pt_reclaim.c | 2 +- 12 files changed, 19 insertions(+), 22 deletions(-) -- 2.20.1 From qi.zheng at linux.dev Tue Nov 18 23:31:18 2025 From: qi.zheng at linux.dev (Qi Zheng) Date: Wed, 19 Nov 2025 15:31:18 +0800 Subject: [PATCH v2 1/7] mm: change mm/pt_reclaim.c to use asm/tlb.h instead of asm-generic/tlb.h In-Reply-To: References: Message-ID: From: Qi Zheng Generally, the asm/tlb.h will include asm-generic/tlb.h, so change mm/pt_reclaim.c to use asm/tlb.h instead of asm-generic/tlb.h. This can also fix compilation errors on some architecture when CONFIG_PT_RECLAIM is enabled (such as alpha). Signed-off-by: Qi Zheng --- mm/pt_reclaim.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/pt_reclaim.c b/mm/pt_reclaim.c index 0d9cfbf4fe5d8..46771cfff8239 100644 --- a/mm/pt_reclaim.c +++ b/mm/pt_reclaim.c @@ -2,7 +2,7 @@ #include #include -#include +#include #include "internal.h" -- 2.20.1 From qi.zheng at linux.dev Tue Nov 18 23:31:19 2025 From: qi.zheng at linux.dev (Qi Zheng) Date: Wed, 19 Nov 2025 15:31:19 +0800 Subject: [PATCH v2 2/7] alpha: mm: enable MMU_GATHER_RCU_TABLE_FREE In-Reply-To: References: Message-ID: <54381c49729449b9c3a09e78a69bf14b4b107774.1763537007.git.zhengqi.arch@bytedance.com> From: Qi Zheng On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of empty PTE page table pages (such as 100GB+). To resolve this problem, first enable MMU_GATHER_RCU_TABLE_FREE to prepare for enabling the PT_RECLAIM feature, which resolves this problem. Signed-off-by: Qi Zheng Cc: Richard Henderson Cc: Matt Turner --- arch/alpha/Kconfig | 1 + arch/alpha/include/asm/tlb.h | 6 +++--- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/arch/alpha/Kconfig b/arch/alpha/Kconfig index 80367f2cf821c..6c7dbf0adad62 100644 --- a/arch/alpha/Kconfig +++ b/arch/alpha/Kconfig @@ -38,6 +38,7 @@ config ALPHA select OLD_SIGSUSPEND select CPU_NO_EFFICIENT_FFS if !ALPHA_EV67 select MMU_GATHER_NO_RANGE + select MMU_GATHER_RCU_TABLE_FREE select SPARSEMEM_EXTREME if SPARSEMEM select ZONE_DMA help diff --git a/arch/alpha/include/asm/tlb.h b/arch/alpha/include/asm/tlb.h index 4f79e331af5ea..ad586b898fd6b 100644 --- a/arch/alpha/include/asm/tlb.h +++ b/arch/alpha/include/asm/tlb.h @@ -4,7 +4,7 @@ #include -#define __pte_free_tlb(tlb, pte, address) pte_free((tlb)->mm, pte) -#define __pmd_free_tlb(tlb, pmd, address) pmd_free((tlb)->mm, pmd) - +#define __pte_free_tlb(tlb, pte, address) tlb_remove_ptdesc((tlb), page_ptdesc(pte)) +#define __pmd_free_tlb(tlb, pmd, address) tlb_remove_ptdesc((tlb), virt_to_ptdesc(pmd)) + #endif -- 2.20.1 From qi.zheng at linux.dev Tue Nov 18 23:31:20 2025 From: qi.zheng at linux.dev (Qi Zheng) Date: Wed, 19 Nov 2025 15:31:20 +0800 Subject: [PATCH v2 3/7] LoongArch: mm: enable MMU_GATHER_RCU_TABLE_FREE In-Reply-To: References: Message-ID: <0e12d201cc18a970c28c84030a0d79f5bda492ca.1763537007.git.zhengqi.arch@bytedance.com> From: Qi Zheng On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of empty PTE page table pages (such as 100GB+). To resolve this problem, first enable MMU_GATHER_RCU_TABLE_FREE to prepare for enabling the PT_RECLAIM feature, which resolves this problem. Signed-off-by: Qi Zheng Cc: Huacai Chen Cc: WANG Xuerui --- arch/loongarch/Kconfig | 1 + arch/loongarch/include/asm/pgalloc.h | 7 +++---- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig index 5b1116733d881..57d3e199605dc 100644 --- a/arch/loongarch/Kconfig +++ b/arch/loongarch/Kconfig @@ -186,6 +186,7 @@ config LOONGARCH select IRQ_LOONGARCH_CPU select LOCK_MM_AND_FIND_VMA select MMU_GATHER_MERGE_VMAS if MMU + select MMU_GATHER_RCU_TABLE_FREE select MODULES_USE_ELF_RELA if MODULES select NEED_PER_CPU_EMBED_FIRST_CHUNK select NEED_PER_CPU_PAGE_FIRST_CHUNK diff --git a/arch/loongarch/include/asm/pgalloc.h b/arch/loongarch/include/asm/pgalloc.h index 08dcc698ec184..248f62d0b590e 100644 --- a/arch/loongarch/include/asm/pgalloc.h +++ b/arch/loongarch/include/asm/pgalloc.h @@ -55,8 +55,7 @@ static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm) return pte; } -#define __pte_free_tlb(tlb, pte, address) \ - tlb_remove_ptdesc((tlb), page_ptdesc(pte)) +#define __pte_free_tlb(tlb, pte, address) tlb_remove_ptdesc((tlb), page_ptdesc(pte)) #ifndef __PAGETABLE_PMD_FOLDED @@ -79,7 +78,7 @@ static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address) return pmd; } -#define __pmd_free_tlb(tlb, x, addr) pmd_free((tlb)->mm, x) +#define __pmd_free_tlb(tlb, x, addr) tlb_remove_ptdesc((tlb), virt_to_ptdesc(x)) #endif @@ -99,7 +98,7 @@ static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long address) return pud; } -#define __pud_free_tlb(tlb, x, addr) pud_free((tlb)->mm, x) +#define __pud_free_tlb(tlb, x, addr) tlb_remove_ptdesc((tlb), virt_to_ptdesc(x)) #endif /* __PAGETABLE_PUD_FOLDED */ -- 2.20.1 From qi.zheng at linux.dev Tue Nov 18 23:31:21 2025 From: qi.zheng at linux.dev (Qi Zheng) Date: Wed, 19 Nov 2025 15:31:21 +0800 Subject: [PATCH v2 4/7] mips: mm: enable MMU_GATHER_RCU_TABLE_FREE In-Reply-To: References: Message-ID: <1ef6075dca55a0ace4a6de6350531e4bc513080e.1763537007.git.zhengqi.arch@bytedance.com> From: Qi Zheng On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of empty PTE page table pages (such as 100GB+). To resolve this problem, first enable MMU_GATHER_RCU_TABLE_FREE to prepare for enabling the PT_RECLAIM feature, which resolves this problem. Signed-off-by: Qi Zheng Cc: Thomas Bogendoerfer --- arch/mips/Kconfig | 1 + arch/mips/include/asm/pgalloc.h | 7 +++---- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig index e8683f58fd3e2..8b16dd4db7c08 100644 --- a/arch/mips/Kconfig +++ b/arch/mips/Kconfig @@ -99,6 +99,7 @@ config MIPS select IRQ_FORCED_THREADING select ISA if EISA select LOCK_MM_AND_FIND_VMA + select MMU_GATHER_RCU_TABLE_FREE select MODULES_USE_ELF_REL if MODULES select MODULES_USE_ELF_RELA if MODULES && 64BIT select PERF_USE_VMALLOC diff --git a/arch/mips/include/asm/pgalloc.h b/arch/mips/include/asm/pgalloc.h index 942af87f1cddb..9a7e5af16c00b 100644 --- a/arch/mips/include/asm/pgalloc.h +++ b/arch/mips/include/asm/pgalloc.h @@ -48,8 +48,7 @@ static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd) extern void pgd_init(void *addr); extern pgd_t *pgd_alloc(struct mm_struct *mm); -#define __pte_free_tlb(tlb, pte, address) \ - tlb_remove_ptdesc((tlb), page_ptdesc(pte)) +#define __pte_free_tlb(tlb, pte, address) tlb_remove_ptdesc((tlb), page_ptdesc(pte)) #ifndef __PAGETABLE_PMD_FOLDED @@ -72,7 +71,7 @@ static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address) return pmd; } -#define __pmd_free_tlb(tlb, x, addr) pmd_free((tlb)->mm, x) +#define __pmd_free_tlb(tlb, x, addr) tlb_remove_ptdesc((tlb), virt_to_ptdesc(x)) #endif @@ -98,7 +97,7 @@ static inline void p4d_populate(struct mm_struct *mm, p4d_t *p4d, pud_t *pud) set_p4d(p4d, __p4d((unsigned long)pud)); } -#define __pud_free_tlb(tlb, x, addr) pud_free((tlb)->mm, x) +#define __pud_free_tlb(tlb, x, addr) tlb_remove_ptdesc((tlb), virt_to_ptdesc(x)) #endif /* __PAGETABLE_PUD_FOLDED */ -- 2.20.1 From qi.zheng at linux.dev Tue Nov 18 23:31:22 2025 From: qi.zheng at linux.dev (Qi Zheng) Date: Wed, 19 Nov 2025 15:31:22 +0800 Subject: [PATCH v2 5/7] parisc: mm: enable MMU_GATHER_RCU_TABLE_FREE In-Reply-To: References: Message-ID: <74f0e72f11347656a9de0d4b9e2bccc17e4338a7.1763537007.git.zhengqi.arch@bytedance.com> From: Qi Zheng On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of empty PTE page table pages (such as 100GB+). To resolve this problem, first enable MMU_GATHER_RCU_TABLE_FREE to prepare for enabling the PT_RECLAIM feature, which resolves this problem. Signed-off-by: Qi Zheng Cc: "James E.J. Bottomley" Cc: Helge Deller --- arch/parisc/Kconfig | 1 + arch/parisc/include/asm/tlb.h | 4 ++-- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig index 47fd9662d8005..62d5a89d5c7bc 100644 --- a/arch/parisc/Kconfig +++ b/arch/parisc/Kconfig @@ -79,6 +79,7 @@ config PARISC select GENERIC_CLOCKEVENTS select CPU_NO_EFFICIENT_FFS select THREAD_INFO_IN_TASK + select MMU_GATHER_RCU_TABLE_FREE select NEED_DMA_MAP_STATE select NEED_SG_DMA_LENGTH select HAVE_ARCH_KGDB diff --git a/arch/parisc/include/asm/tlb.h b/arch/parisc/include/asm/tlb.h index 44235f367674d..4501fee0a8fa4 100644 --- a/arch/parisc/include/asm/tlb.h +++ b/arch/parisc/include/asm/tlb.h @@ -5,8 +5,8 @@ #include #if CONFIG_PGTABLE_LEVELS == 3 -#define __pmd_free_tlb(tlb, pmd, addr) pmd_free((tlb)->mm, pmd) +#define __pmd_free_tlb(tlb, pmd, addr) tlb_remove_ptdesc((tlb), virt_to_ptdesc(pmd)) #endif -#define __pte_free_tlb(tlb, pte, addr) pte_free((tlb)->mm, pte) +#define __pte_free_tlb(tlb, pte, addr) tlb_remove_ptdesc((tlb), page_ptdesc(pte)) #endif -- 2.20.1 From qi.zheng at linux.dev Tue Nov 18 23:31:23 2025 From: qi.zheng at linux.dev (Qi Zheng) Date: Wed, 19 Nov 2025 15:31:23 +0800 Subject: [PATCH v2 6/7] um: mm: enable MMU_GATHER_RCU_TABLE_FREE In-Reply-To: References: Message-ID: <16ab9e6ce0febaf2fc383b7e09e3f1fb2ad63a40.1763537007.git.zhengqi.arch@bytedance.com> From: Qi Zheng On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of empty PTE page table pages (such as 100GB+). To resolve this problem, first enable MMU_GATHER_RCU_TABLE_FREE to prepare for enabling the PT_RECLAIM feature, which resolves this problem. Signed-off-by: Qi Zheng Cc: Richard Weinberger Cc: Anton Ivanov Cc: Johannes Berg --- arch/um/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/um/Kconfig b/arch/um/Kconfig index 097c6a6265ef3..47a41bc77bb24 100644 --- a/arch/um/Kconfig +++ b/arch/um/Kconfig @@ -41,6 +41,7 @@ config UML select HAVE_SYSCALL_TRACEPOINTS select THREAD_INFO_IN_TASK select SPARSE_IRQ + select MMU_GATHER_RCU_TABLE_FREE config MMU bool -- 2.20.1 From qi.zheng at linux.dev Tue Nov 18 23:31:24 2025 From: qi.zheng at linux.dev (Qi Zheng) Date: Wed, 19 Nov 2025 15:31:24 +0800 Subject: [PATCH v2 7/7] mm: enable PT_RECLAIM on all 64-bit architectures In-Reply-To: References: Message-ID: From: Qi Zheng Now, the MMU_GATHER_RCU_TABLE_FREE is enabled on all 64-bit architectures, so make PT_RECLAIM depend on 64BIT, thereby enabling PT_RECLAIM on all 64-bit architectures. Signed-off-by: Qi Zheng --- arch/x86/Kconfig | 1 - mm/Kconfig | 9 ++------- 2 files changed, 2 insertions(+), 8 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index eac2e86056902..96bff81fd4787 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -330,7 +330,6 @@ config X86 select FUNCTION_ALIGNMENT_4B imply IMA_SECURE_AND_OR_TRUSTED_BOOT if EFI select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE - select ARCH_SUPPORTS_PT_RECLAIM if X86_64 select ARCH_SUPPORTS_SCHED_SMT if SMP select SCHED_SMT if SMP select ARCH_SUPPORTS_SCHED_CLUSTER if SMP diff --git a/mm/Kconfig b/mm/Kconfig index d548976d0e0ad..94eec5c0cad96 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1448,14 +1448,9 @@ config ARCH_HAS_USER_SHADOW_STACK The architecture has hardware support for userspace shadow call stacks (eg, x86 CET, arm64 GCS or RISC-V Zicfiss). -config ARCH_SUPPORTS_PT_RECLAIM - def_bool n - config PT_RECLAIM - bool "reclaim empty user page table pages" - default y - depends on ARCH_SUPPORTS_PT_RECLAIM && MMU && SMP - select MMU_GATHER_RCU_TABLE_FREE + def_bool y + depends on 64BIT help Try to reclaim empty user page table pages in paths other than munmap and exit_mmap path. -- 2.20.1 From david at kernel.org Wed Nov 19 02:13:37 2025 From: david at kernel.org (David Hildenbrand (Red Hat)) Date: Wed, 19 Nov 2025 11:13:37 +0100 Subject: [PATCH 0/7] enable PT_RECLAIM on all 64-bit architectures In-Reply-To: References: <83e88171-54cb-4112-a344-f6a7d7f13784@kernel.org> Message-ID: On 18.11.25 12:53, Qi Zheng wrote: > > > On 11/18/25 12:53 AM, David Hildenbrand (Red Hat) wrote: >> On 14.11.25 12:11, Qi Zheng wrote: >>> From: Qi Zheng >>> >>> Hi all, >>> >>> This series aims to enable PT_RECLAIM on all 64-bit architectures. >>> >>> On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of >>> empty PTE >>> page table pages (such as 100GB+). To resolve this problem, we need to >>> enable >>> PT_RECLAIM, which depends on MMU_GATHER_RCU_TABLE_FREE. >>> >> >> Makes sense! >> >>> Therefore, this series first enables MMU_GATHER_RCU_TABLE_FREE on all >>> 64-bit >>> architectures, and finally makes PT_RECLAIM depend on >>> MMU_GATHER_RCU_TABLE_FREE >>> && 64BIT. This way, PT_RECLAIM can be enabled by default on all 64-bit >>> architectures. >> >> Could we then even go ahead and stop making PT_RECLAIM user-selectable? > > OK, will change to: Was more of a question: is there any scenario where we ran so far into issues with it? -- Cheers David From david at kernel.org Wed Nov 19 02:19:51 2025 From: david at kernel.org (David Hildenbrand (Red Hat)) Date: Wed, 19 Nov 2025 11:19:51 +0100 Subject: [PATCH 7/7] mm: make PT_RECLAIM depend on MMU_GATHER_RCU_TABLE_FREE && 64BIT In-Reply-To: <195baf7c-1f4e-46a4-a4aa-e68e7d00c0f9@linux.dev> References: <0a4d1e6f0bf299cafd1fc624f965bd1ca542cea8.1763117269.git.zhengqi.arch@bytedance.com> <355d3bf3-c6bc-403e-9f19-89259d868611@kernel.org> <195baf7c-1f4e-46a4-a4aa-e68e7d00c0f9@linux.dev> Message-ID: <9386032c-9840-49da-83f9-74b112f3e752@kernel.org> On 18.11.25 13:02, Qi Zheng wrote: > > > On 11/18/25 12:57 AM, David Hildenbrand (Red Hat) wrote: >> On 14.11.25 12:11, Qi Zheng wrote: >>> From: Qi Zheng >> >> Subject: s/&&/&/ > > will do. > >> >>> >>> Make PT_RECLAIM depend on MMU_GATHER_RCU_TABLE_FREE so that PT_RECLAIM >>> can >>> be enabled by default on all architectures that support >>> MMU_GATHER_RCU_TABLE_FREE. >>> >>> Considering that a large number of PTE page table pages (such as 100GB+) >>> can only be caused on a 64-bit system, let PT_RECLAIM also depend on >>> 64BIT. >>> >>> Signed-off-by: Qi Zheng >>> --- >>> ? arch/x86/Kconfig | 1 - >>> ? mm/Kconfig?????? | 6 +----- >>> ? 2 files changed, 1 insertion(+), 6 deletions(-) >>> >>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig >>> index eac2e86056902..96bff81fd4787 100644 >>> --- a/arch/x86/Kconfig >>> +++ b/arch/x86/Kconfig >>> @@ -330,7 +330,6 @@ config X86 >>> ????? select FUNCTION_ALIGNMENT_4B >>> ????? imply IMA_SECURE_AND_OR_TRUSTED_BOOT??? if EFI >>> ????? select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE >>> -??? select ARCH_SUPPORTS_PT_RECLAIM??????? if X86_64 >>> ????? select ARCH_SUPPORTS_SCHED_SMT??????? if SMP >>> ????? select SCHED_SMT??????????? if SMP >>> ????? select ARCH_SUPPORTS_SCHED_CLUSTER??? if SMP >>> diff --git a/mm/Kconfig b/mm/Kconfig >>> index a5a90b169435d..e795fbd69e50c 100644 >>> --- a/mm/Kconfig >>> +++ b/mm/Kconfig >>> @@ -1440,14 +1440,10 @@ config ARCH_HAS_USER_SHADOW_STACK >>> ??????? The architecture has hardware support for userspace shadow call >>> ??????????? stacks (eg, x86 CET, arm64 GCS or RISC-V Zicfiss). >>> -config ARCH_SUPPORTS_PT_RECLAIM >>> -??? def_bool n >>> - >>> ? config PT_RECLAIM >>> ????? bool "reclaim empty user page table pages" >>> ????? default y >>> -??? depends on ARCH_SUPPORTS_PT_RECLAIM && MMU && SMP >>> -??? select MMU_GATHER_RCU_TABLE_FREE >>> +??? depends on MMU_GATHER_RCU_TABLE_FREE && MMU && SMP && 64BIT >> >> Who would we have MMU_GATHER_RCU_TABLE_FREE without MMU? (can we drop >> the MMU part) > > OK. > >> >> Why do we care about SMP in the first place? (can we frop SMP) > > OK. > >> >> But I also wonder why we need "MMU_GATHER_RCU_TABLE_FREE && 64BIT": >> >> Would it be harmful on 32bit (sure, we might not reclaim as much, but >> still there is memory to be reclaimed?)? > > This is also fine on 32bit, but the benefits are not significant, So I > chose to enable it only on 64-bit. Right. Address space is smaller, but also memory is smaller. Not that I think we strictly *must* to support 32bit, I merely wonder why we wouldn't just enable it here. OTOH, if there is a good reason we cannot enable it, we can definitely just keep it 64bit only. > > I actually tried enabling MMU_GATHER_RCU_TABLE_FREE on all > architectures, and apart from sparc32 being a bit troublesome (because > it uses mm->page_table_lock for synchronization within > __pte_free_tlb()), the modifications were relatively simple. > >> >> If all 64BIT support MMU_GATHER_RCU_TABLE_FREE (as you previously >> state), why can't we only check for 64BIT? > > OK, will do. This was also more of a question for discussion: Would it make sense to have config PT_RECLAIM def_bool y depends on MMU_GATHER_RCU_TABLE_FREE (a) Would we want to make it configurable (why?) (b) Do we really care about SMP (why?) (c) Do we want to limit to 64bit (why?) (d) Do we really need the MMU check in addition to MMU_GATHER_RCU_TABLE_FREE -- Cheers David From qi.zheng at linux.dev Wed Nov 19 02:37:47 2025 From: qi.zheng at linux.dev (Qi Zheng) Date: Wed, 19 Nov 2025 18:37:47 +0800 Subject: [PATCH 0/7] enable PT_RECLAIM on all 64-bit architectures In-Reply-To: References: <83e88171-54cb-4112-a344-f6a7d7f13784@kernel.org> Message-ID: <9c884aeb-c1ec-4fe0-8495-639344633569@linux.dev> On 11/19/25 6:13 PM, David Hildenbrand (Red Hat) wrote: > On 18.11.25 12:53, Qi Zheng wrote: >> >> >> On 11/18/25 12:53 AM, David Hildenbrand (Red Hat) wrote: >>> On 14.11.25 12:11, Qi Zheng wrote: >>>> From: Qi Zheng >>>> >>>> Hi all, >>>> >>>> This series aims to enable PT_RECLAIM on all 64-bit architectures. >>>> >>>> On a 64-bit system, madvise(MADV_DONTNEED) may cause a large number of >>>> empty PTE >>>> page table pages (such as 100GB+). To resolve this problem, we need to >>>> enable >>>> PT_RECLAIM, which depends on MMU_GATHER_RCU_TABLE_FREE. >>>> >>> >>> Makes sense! >>> >>>> Therefore, this series first enables MMU_GATHER_RCU_TABLE_FREE on all >>>> 64-bit >>>> architectures, and finally makes PT_RECLAIM depend on >>>> MMU_GATHER_RCU_TABLE_FREE >>>> && 64BIT. This way, PT_RECLAIM can be enabled by default on all 64-bit >>>> architectures. >>> >>> Could we then even go ahead and stop making PT_RECLAIM user-selectable? >> >> OK, will change to: > > Was more of a question: is there any scenario where we ran so far into > issues with it? No, I haven't received any reports of related issues, either within the company or in the community. > From qi.zheng at linux.dev Wed Nov 19 03:02:01 2025 From: qi.zheng at linux.dev (Qi Zheng) Date: Wed, 19 Nov 2025 19:02:01 +0800 Subject: [PATCH 7/7] mm: make PT_RECLAIM depend on MMU_GATHER_RCU_TABLE_FREE && 64BIT In-Reply-To: <9386032c-9840-49da-83f9-74b112f3e752@kernel.org> References: <0a4d1e6f0bf299cafd1fc624f965bd1ca542cea8.1763117269.git.zhengqi.arch@bytedance.com> <355d3bf3-c6bc-403e-9f19-89259d868611@kernel.org> <195baf7c-1f4e-46a4-a4aa-e68e7d00c0f9@linux.dev> <9386032c-9840-49da-83f9-74b112f3e752@kernel.org> Message-ID: <956c7ca1-bce8-4eed-8a86-bc8adfc708b8@linux.dev> Hi David, On 11/19/25 6:19 PM, David Hildenbrand (Red Hat) wrote: > On 18.11.25 13:02, Qi Zheng wrote: >> >> >> On 11/18/25 12:57 AM, David Hildenbrand (Red Hat) wrote: >>> On 14.11.25 12:11, Qi Zheng wrote: >>>> From: Qi Zheng >>> >>> Subject: s/&&/&/ >> >> will do. >> >>> >>>> >>>> Make PT_RECLAIM depend on MMU_GATHER_RCU_TABLE_FREE so that PT_RECLAIM >>>> can >>>> be enabled by default on all architectures that support >>>> MMU_GATHER_RCU_TABLE_FREE. >>>> >>>> Considering that a large number of PTE page table pages (such as >>>> 100GB+) >>>> can only be caused on a 64-bit system, let PT_RECLAIM also depend on >>>> 64BIT. >>>> >>>> Signed-off-by: Qi Zheng >>>> --- >>>> ?? arch/x86/Kconfig | 1 - >>>> ?? mm/Kconfig?????? | 6 +----- >>>> ?? 2 files changed, 1 insertion(+), 6 deletions(-) >>>> >>>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig >>>> index eac2e86056902..96bff81fd4787 100644 >>>> --- a/arch/x86/Kconfig >>>> +++ b/arch/x86/Kconfig >>>> @@ -330,7 +330,6 @@ config X86 >>>> ?????? select FUNCTION_ALIGNMENT_4B >>>> ?????? imply IMA_SECURE_AND_OR_TRUSTED_BOOT??? if EFI >>>> ?????? select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE >>>> -??? select ARCH_SUPPORTS_PT_RECLAIM??????? if X86_64 >>>> ?????? select ARCH_SUPPORTS_SCHED_SMT??????? if SMP >>>> ?????? select SCHED_SMT??????????? if SMP >>>> ?????? select ARCH_SUPPORTS_SCHED_CLUSTER??? if SMP >>>> diff --git a/mm/Kconfig b/mm/Kconfig >>>> index a5a90b169435d..e795fbd69e50c 100644 >>>> --- a/mm/Kconfig >>>> +++ b/mm/Kconfig >>>> @@ -1440,14 +1440,10 @@ config ARCH_HAS_USER_SHADOW_STACK >>>> ???????? The architecture has hardware support for userspace shadow >>>> call >>>> ???????????? stacks (eg, x86 CET, arm64 GCS or RISC-V Zicfiss). >>>> -config ARCH_SUPPORTS_PT_RECLAIM >>>> -??? def_bool n >>>> - >>>> ?? config PT_RECLAIM >>>> ?????? bool "reclaim empty user page table pages" >>>> ?????? default y >>>> -??? depends on ARCH_SUPPORTS_PT_RECLAIM && MMU && SMP >>>> -??? select MMU_GATHER_RCU_TABLE_FREE >>>> +??? depends on MMU_GATHER_RCU_TABLE_FREE && MMU && SMP && 64BIT >>> >>> Who would we have MMU_GATHER_RCU_TABLE_FREE without MMU? (can we drop >>> the MMU part) >> >> OK. >> >>> >>> Why do we care about SMP in the first place? (can we frop SMP) >> >> OK. >> >>> >>> But I also wonder why we need "MMU_GATHER_RCU_TABLE_FREE && 64BIT": >>> >>> Would it be harmful on 32bit (sure, we might not reclaim as much, but >>> still there is memory to be reclaimed?)? >> >> This is also fine on 32bit, but the benefits are not significant, So I >> chose to enable it only on 64-bit. > > Right. Address space is smaller, but also memory is smaller. Not that I > think we strictly *must* to support 32bit, I merely wonder why we > wouldn't just enable it here. > > OTOH, if there is a good reason we cannot enable it, we can definitely > just keep it 64bit only. The only difficulty is this: > >> >> I actually tried enabling MMU_GATHER_RCU_TABLE_FREE on all >> architectures, and apart from sparc32 being a bit troublesome (because >> it uses mm->page_table_lock for synchronization within >> __pte_free_tlb()), the modifications were relatively simple. in sparc32: void pte_free(struct mm_struct *mm, pgtable_t ptep) { struct page *page; page = pfn_to_page(__nocache_pa((unsigned long)ptep) >> PAGE_SHIFT); spin_lock(&mm->page_table_lock); if (page_ref_dec_return(page) == 1) pagetable_dtor(page_ptdesc(page)); spin_unlock(&mm->page_table_lock); srmmu_free_nocache(ptep, SRMMU_PTE_TABLE_SIZE); } #define __pte_free_tlb(tlb, pte, addr) pte_free((tlb)->mm, pte) To enable MMU_GATHER_RCU_TABLE_FREE on sparc32, we need to implement __tlb_remove_table(), and call the pte_free() above in __tlb_remove_table(). However, the __tlb_remove_table() does not have an mm parameter: void __tlb_remove_table(void *_table) so we need to use another lock instead of mm->page_table_lock. I have already sent the v2 [1], and perhaps after that I can enable PT_RECLAIM on all 32-bit architectures as well. [1]. https://lore.kernel.org/all/cover.1763537007.git.zhengqi.arch at bytedance.com/ >> >>> >>> If all 64BIT support MMU_GATHER_RCU_TABLE_FREE (as you previously >>> state), why can't we only check for 64BIT? >> >> OK, will do. > > This was also more of a question for discussion: > > Would it make sense to have > > config PT_RECLAIM > ????def_bool y > ????depends on MMU_GATHER_RCU_TABLE_FREE make sense. > > (a) Would we want to make it configurable (why?) No, it was just out of caution before. > (b) Do we really care about SMP (why?) No. Simply because the following situation is impossible to occur: pte_offset_map traversing the PTE page table call madvise(MADV_DONTNEED) so there's no need to free PTE page via RCU. > (c) Do we want to limit to 64bit (why?) No, just because the profit is greater at 64-BIT. > (d) Do we really need the MMU check in addition to > ??? MMU_GATHER_RCU_TABLE_FREE No, I was worried about compilation issues before, but now it seems that my worries were unnecessary. Thanks, Qi > > From david at kernel.org Wed Nov 19 03:35:10 2025 From: david at kernel.org (David Hildenbrand (Red Hat)) Date: Wed, 19 Nov 2025 12:35:10 +0100 Subject: [PATCH 7/7] mm: make PT_RECLAIM depend on MMU_GATHER_RCU_TABLE_FREE && 64BIT In-Reply-To: <956c7ca1-bce8-4eed-8a86-bc8adfc708b8@linux.dev> References: <0a4d1e6f0bf299cafd1fc624f965bd1ca542cea8.1763117269.git.zhengqi.arch@bytedance.com> <355d3bf3-c6bc-403e-9f19-89259d868611@kernel.org> <195baf7c-1f4e-46a4-a4aa-e68e7d00c0f9@linux.dev> <9386032c-9840-49da-83f9-74b112f3e752@kernel.org> <956c7ca1-bce8-4eed-8a86-bc8adfc708b8@linux.dev> Message-ID: <6a22ff95-28c1-4c1d-a1a8-6a391bcc8c86@kernel.org> On 19.11.25 12:02, Qi Zheng wrote: > Hi David, > > On 11/19/25 6:19 PM, David Hildenbrand (Red Hat) wrote: >> On 18.11.25 13:02, Qi Zheng wrote: >>> >>> >>> On 11/18/25 12:57 AM, David Hildenbrand (Red Hat) wrote: >>>> On 14.11.25 12:11, Qi Zheng wrote: >>>>> From: Qi Zheng >>>> >>>> Subject: s/&&/&/ >>> >>> will do. >>> >>>> >>>>> >>>>> Make PT_RECLAIM depend on MMU_GATHER_RCU_TABLE_FREE so that PT_RECLAIM >>>>> can >>>>> be enabled by default on all architectures that support >>>>> MMU_GATHER_RCU_TABLE_FREE. >>>>> >>>>> Considering that a large number of PTE page table pages (such as >>>>> 100GB+) >>>>> can only be caused on a 64-bit system, let PT_RECLAIM also depend on >>>>> 64BIT. >>>>> >>>>> Signed-off-by: Qi Zheng >>>>> --- >>>>> ?? arch/x86/Kconfig | 1 - >>>>> ?? mm/Kconfig?????? | 6 +----- >>>>> ?? 2 files changed, 1 insertion(+), 6 deletions(-) >>>>> >>>>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig >>>>> index eac2e86056902..96bff81fd4787 100644 >>>>> --- a/arch/x86/Kconfig >>>>> +++ b/arch/x86/Kconfig >>>>> @@ -330,7 +330,6 @@ config X86 >>>>> ?????? select FUNCTION_ALIGNMENT_4B >>>>> ?????? imply IMA_SECURE_AND_OR_TRUSTED_BOOT??? if EFI >>>>> ?????? select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE >>>>> -??? select ARCH_SUPPORTS_PT_RECLAIM??????? if X86_64 >>>>> ?????? select ARCH_SUPPORTS_SCHED_SMT??????? if SMP >>>>> ?????? select SCHED_SMT??????????? if SMP >>>>> ?????? select ARCH_SUPPORTS_SCHED_CLUSTER??? if SMP >>>>> diff --git a/mm/Kconfig b/mm/Kconfig >>>>> index a5a90b169435d..e795fbd69e50c 100644 >>>>> --- a/mm/Kconfig >>>>> +++ b/mm/Kconfig >>>>> @@ -1440,14 +1440,10 @@ config ARCH_HAS_USER_SHADOW_STACK >>>>> ???????? The architecture has hardware support for userspace shadow >>>>> call >>>>> ???????????? stacks (eg, x86 CET, arm64 GCS or RISC-V Zicfiss). >>>>> -config ARCH_SUPPORTS_PT_RECLAIM >>>>> -??? def_bool n >>>>> - >>>>> ?? config PT_RECLAIM >>>>> ?????? bool "reclaim empty user page table pages" >>>>> ?????? default y >>>>> -??? depends on ARCH_SUPPORTS_PT_RECLAIM && MMU && SMP >>>>> -??? select MMU_GATHER_RCU_TABLE_FREE >>>>> +??? depends on MMU_GATHER_RCU_TABLE_FREE && MMU && SMP && 64BIT >>>> >>>> Who would we have MMU_GATHER_RCU_TABLE_FREE without MMU? (can we drop >>>> the MMU part) >>> >>> OK. >>> >>>> >>>> Why do we care about SMP in the first place? (can we frop SMP) >>> >>> OK. >>> >>>> >>>> But I also wonder why we need "MMU_GATHER_RCU_TABLE_FREE && 64BIT": >>>> >>>> Would it be harmful on 32bit (sure, we might not reclaim as much, but >>>> still there is memory to be reclaimed?)? >>> >>> This is also fine on 32bit, but the benefits are not significant, So I >>> chose to enable it only on 64-bit. >> >> Right. Address space is smaller, but also memory is smaller. Not that I >> think we strictly *must* to support 32bit, I merely wonder why we >> wouldn't just enable it here. >> >> OTOH, if there is a good reason we cannot enable it, we can definitely >> just keep it 64bit only. > > The only difficulty is this: > >> >>> >>> I actually tried enabling MMU_GATHER_RCU_TABLE_FREE on all >>> architectures, and apart from sparc32 being a bit troublesome (because >>> it uses mm->page_table_lock for synchronization within >>> __pte_free_tlb()), the modifications were relatively simple. > > in sparc32: > > void pte_free(struct mm_struct *mm, pgtable_t ptep) > { > struct page *page; > > page = pfn_to_page(__nocache_pa((unsigned long)ptep) >> > PAGE_SHIFT); > spin_lock(&mm->page_table_lock); > if (page_ref_dec_return(page) == 1) > pagetable_dtor(page_ptdesc(page)); > spin_unlock(&mm->page_table_lock); > > srmmu_free_nocache(ptep, SRMMU_PTE_TABLE_SIZE); > } > > #define __pte_free_tlb(tlb, pte, addr) pte_free((tlb)->mm, pte) > > To enable MMU_GATHER_RCU_TABLE_FREE on sparc32, we need to implement > __tlb_remove_table(), and call the pte_free() above in __tlb_remove_table(). > > However, the __tlb_remove_table() does not have an mm parameter: > > void __tlb_remove_table(void *_table) > > so we need to use another lock instead of mm->page_table_lock. > > I have already sent the v2 [1], and perhaps after that I can enable > PT_RECLAIM on all 32-bit architectures as well. > I guess if we just make it depend on MMU_GATHER_RCU_TABLE_FREE that will be fine. > [1]. > https://lore.kernel.org/all/cover.1763537007.git.zhengqi.arch at bytedance.com/ > >>> >>>> >>>> If all 64BIT support MMU_GATHER_RCU_TABLE_FREE (as you previously >>>> state), why can't we only check for 64BIT? >>> >>> OK, will do. >> >> This was also more of a question for discussion: >> >> Would it make sense to have >> >> config PT_RECLAIM >> ????def_bool y >> ????depends on MMU_GATHER_RCU_TABLE_FREE > > make sense. > >> >> (a) Would we want to make it configurable (why?) > > No, it was just out of caution before. > >> (b) Do we really care about SMP (why?) > > No. Simply because the following situation is impossible to occur: > > pte_offset_map > traversing the PTE page table > > > > call madvise(MADV_DONTNEED) > > so there's no need to free PTE page via RCU. > >> (c) Do we want to limit to 64bit (why?) > > No, just because the profit is greater at 64-BIT. I was briefly wondering if on 32bit (but maybe also on 64bit with configurable user page table levels?) we could have the scenario that we only have two page table levels. So reclaiming the PMD level (corresponding to the highest level) would be impossible. But for that to happen one would have to discard the whole address range through MADV_DONTNEED (impossible I guess) :) -- Cheers David From david at kernel.org Wed Nov 19 03:38:46 2025 From: david at kernel.org (David Hildenbrand (Red Hat)) Date: Wed, 19 Nov 2025 12:38:46 +0100 Subject: [PATCH v2 7/7] mm: enable PT_RECLAIM on all 64-bit architectures In-Reply-To: References: Message-ID: <9b55623a-4606-4610-a0fe-55b8cd6b95e7@kernel.org> On 19.11.25 08:31, Qi Zheng wrote: > From: Qi Zheng > > Now, the MMU_GATHER_RCU_TABLE_FREE is enabled on all 64-bit architectures, > so make PT_RECLAIM depend on 64BIT, thereby enabling PT_RECLAIM on all > 64-bit architectures. > > Signed-off-by: Qi Zheng > --- > arch/x86/Kconfig | 1 - > mm/Kconfig | 9 ++------- > 2 files changed, 2 insertions(+), 8 deletions(-) > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > index eac2e86056902..96bff81fd4787 100644 > --- a/arch/x86/Kconfig > +++ b/arch/x86/Kconfig > @@ -330,7 +330,6 @@ config X86 > select FUNCTION_ALIGNMENT_4B > imply IMA_SECURE_AND_OR_TRUSTED_BOOT if EFI > select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE > - select ARCH_SUPPORTS_PT_RECLAIM if X86_64 > select ARCH_SUPPORTS_SCHED_SMT if SMP > select SCHED_SMT if SMP > select ARCH_SUPPORTS_SCHED_CLUSTER if SMP > diff --git a/mm/Kconfig b/mm/Kconfig > index d548976d0e0ad..94eec5c0cad96 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -1448,14 +1448,9 @@ config ARCH_HAS_USER_SHADOW_STACK > The architecture has hardware support for userspace shadow call > stacks (eg, x86 CET, arm64 GCS or RISC-V Zicfiss). > > -config ARCH_SUPPORTS_PT_RECLAIM > - def_bool n > - > config PT_RECLAIM > - bool "reclaim empty user page table pages" > - default y > - depends on ARCH_SUPPORTS_PT_RECLAIM && MMU && SMP > - select MMU_GATHER_RCU_TABLE_FREE > + def_bool y > + depends on 64BIT As discussed in the other thread, likely config PT_RECLAIM def_bool y depends on MMU_GATHER_RCU_TABLE_FREE && 64BIT Could be nice, and if possible even dropping the 64BIT limitation as well if there is no need to. -- Cheers David From david at kernel.org Wed Nov 19 03:41:22 2025 From: david at kernel.org (David Hildenbrand (Red Hat)) Date: Wed, 19 Nov 2025 12:41:22 +0100 Subject: [PATCH v2 1/7] mm: change mm/pt_reclaim.c to use asm/tlb.h instead of asm-generic/tlb.h In-Reply-To: References: Message-ID: On 19.11.25 08:31, Qi Zheng wrote: > From: Qi Zheng > > Generally, the asm/tlb.h will include asm-generic/tlb.h, so change > mm/pt_reclaim.c to use asm/tlb.h instead of asm-generic/tlb.h. This can > also fix compilation errors on some architecture when CONFIG_PT_RECLAIM > is enabled (such as alpha). "This is a preparation for enabling CONFIG_PT_RECLAIM on other architectures, such as alpha." > > Signed-off-by: Qi Zheng > --- > mm/pt_reclaim.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/pt_reclaim.c b/mm/pt_reclaim.c > index 0d9cfbf4fe5d8..46771cfff8239 100644 > --- a/mm/pt_reclaim.c > +++ b/mm/pt_reclaim.c > @@ -2,7 +2,7 @@ > #include > #include > > -#include > +#include > > #include "internal.h" > Right, we're using pte_free_tlb(), and the default lives in include/asm-generic/tlb.h. Acked-by: David Hildenbrand (Red Hat) -- Cheers David From qi.zheng at linux.dev Wed Nov 19 04:13:10 2025 From: qi.zheng at linux.dev (Qi Zheng) Date: Wed, 19 Nov 2025 20:13:10 +0800 Subject: [PATCH 7/7] mm: make PT_RECLAIM depend on MMU_GATHER_RCU_TABLE_FREE && 64BIT In-Reply-To: <6a22ff95-28c1-4c1d-a1a8-6a391bcc8c86@kernel.org> References: <0a4d1e6f0bf299cafd1fc624f965bd1ca542cea8.1763117269.git.zhengqi.arch@bytedance.com> <355d3bf3-c6bc-403e-9f19-89259d868611@kernel.org> <195baf7c-1f4e-46a4-a4aa-e68e7d00c0f9@linux.dev> <9386032c-9840-49da-83f9-74b112f3e752@kernel.org> <956c7ca1-bce8-4eed-8a86-bc8adfc708b8@linux.dev> <6a22ff95-28c1-4c1d-a1a8-6a391bcc8c86@kernel.org> Message-ID: <479b0409-335f-4450-8eb2-5270a5847f5e@linux.dev> On 11/19/25 7:35 PM, David Hildenbrand (Red Hat) wrote: > On 19.11.25 12:02, Qi Zheng wrote: >> Hi David, >> >> On 11/19/25 6:19 PM, David Hildenbrand (Red Hat) wrote: >>> On 18.11.25 13:02, Qi Zheng wrote: >>>> >>>> >>>> On 11/18/25 12:57 AM, David Hildenbrand (Red Hat) wrote: >>>>> On 14.11.25 12:11, Qi Zheng wrote: >>>>>> From: Qi Zheng >>>>> >>>>> Subject: s/&&/&/ >>>> >>>> will do. >>>> >>>>> >>>>>> >>>>>> Make PT_RECLAIM depend on MMU_GATHER_RCU_TABLE_FREE so that >>>>>> PT_RECLAIM >>>>>> can >>>>>> be enabled by default on all architectures that support >>>>>> MMU_GATHER_RCU_TABLE_FREE. >>>>>> >>>>>> Considering that a large number of PTE page table pages (such as >>>>>> 100GB+) >>>>>> can only be caused on a 64-bit system, let PT_RECLAIM also depend on >>>>>> 64BIT. >>>>>> >>>>>> Signed-off-by: Qi Zheng >>>>>> --- >>>>>> ??? arch/x86/Kconfig | 1 - >>>>>> ??? mm/Kconfig?????? | 6 +----- >>>>>> ??? 2 files changed, 1 insertion(+), 6 deletions(-) >>>>>> >>>>>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig >>>>>> index eac2e86056902..96bff81fd4787 100644 >>>>>> --- a/arch/x86/Kconfig >>>>>> +++ b/arch/x86/Kconfig >>>>>> @@ -330,7 +330,6 @@ config X86 >>>>>> ??????? select FUNCTION_ALIGNMENT_4B >>>>>> ??????? imply IMA_SECURE_AND_OR_TRUSTED_BOOT??? if EFI >>>>>> ??????? select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE >>>>>> -??? select ARCH_SUPPORTS_PT_RECLAIM??????? if X86_64 >>>>>> ??????? select ARCH_SUPPORTS_SCHED_SMT??????? if SMP >>>>>> ??????? select SCHED_SMT??????????? if SMP >>>>>> ??????? select ARCH_SUPPORTS_SCHED_CLUSTER??? if SMP >>>>>> diff --git a/mm/Kconfig b/mm/Kconfig >>>>>> index a5a90b169435d..e795fbd69e50c 100644 >>>>>> --- a/mm/Kconfig >>>>>> +++ b/mm/Kconfig >>>>>> @@ -1440,14 +1440,10 @@ config ARCH_HAS_USER_SHADOW_STACK >>>>>> ????????? The architecture has hardware support for userspace shadow >>>>>> call >>>>>> ????????????? stacks (eg, x86 CET, arm64 GCS or RISC-V Zicfiss). >>>>>> -config ARCH_SUPPORTS_PT_RECLAIM >>>>>> -??? def_bool n >>>>>> - >>>>>> ??? config PT_RECLAIM >>>>>> ??????? bool "reclaim empty user page table pages" >>>>>> ??????? default y >>>>>> -??? depends on ARCH_SUPPORTS_PT_RECLAIM && MMU && SMP >>>>>> -??? select MMU_GATHER_RCU_TABLE_FREE >>>>>> +??? depends on MMU_GATHER_RCU_TABLE_FREE && MMU && SMP && 64BIT >>>>> >>>>> Who would we have MMU_GATHER_RCU_TABLE_FREE without MMU? (can we drop >>>>> the MMU part) >>>> >>>> OK. >>>> >>>>> >>>>> Why do we care about SMP in the first place? (can we frop SMP) >>>> >>>> OK. >>>> >>>>> >>>>> But I also wonder why we need "MMU_GATHER_RCU_TABLE_FREE && 64BIT": >>>>> >>>>> Would it be harmful on 32bit (sure, we might not reclaim as much, but >>>>> still there is memory to be reclaimed?)? >>>> >>>> This is also fine on 32bit, but the benefits are not significant, So I >>>> chose to enable it only on 64-bit. >>> >>> Right. Address space is smaller, but also memory is smaller. Not that I >>> think we strictly *must* to support 32bit, I merely wonder why we >>> wouldn't just enable it here. >>> >>> OTOH, if there is a good reason we cannot enable it, we can definitely >>> just keep it 64bit only. >> >> The only difficulty is this: >> >>> >>>> >>>> I actually tried enabling MMU_GATHER_RCU_TABLE_FREE on all >>>> architectures, and apart from sparc32 being a bit troublesome (because >>>> it uses mm->page_table_lock for synchronization within >>>> __pte_free_tlb()), the modifications were relatively simple. >> >> in sparc32: >> >> void pte_free(struct mm_struct *mm, pgtable_t ptep) >> { >> ????????? struct page *page; >> >> ????????? page = pfn_to_page(__nocache_pa((unsigned long)ptep) >> >> PAGE_SHIFT); >> ????????? spin_lock(&mm->page_table_lock); >> ????????? if (page_ref_dec_return(page) == 1) >> ????????????????? pagetable_dtor(page_ptdesc(page)); >> ????????? spin_unlock(&mm->page_table_lock); >> >> ????????? srmmu_free_nocache(ptep, SRMMU_PTE_TABLE_SIZE); >> } >> >> #define __pte_free_tlb(tlb, pte, addr)? pte_free((tlb)->mm, pte) >> >> To enable MMU_GATHER_RCU_TABLE_FREE on sparc32, we need to implement >> __tlb_remove_table(), and call the pte_free() above in >> __tlb_remove_table(). >> >> However, the __tlb_remove_table() does not have an mm parameter: >> >> void __tlb_remove_table(void *_table) >> >> so we need to use another lock instead of mm->page_table_lock. >> >> I have already sent the v2 [1], and perhaps after that I can enable >> PT_RECLAIM on all 32-bit architectures as well. >> > > I guess if we just make it depend on MMU_GATHER_RCU_TABLE_FREE that will > be fine. > >> [1]. >> https://lore.kernel.org/all/ >> cover.1763537007.git.zhengqi.arch at bytedance.com/ >> >>>> >>>>> >>>>> If all 64BIT support MMU_GATHER_RCU_TABLE_FREE (as you previously >>>>> state), why can't we only check for 64BIT? >>>> >>>> OK, will do. >>> >>> This was also more of a question for discussion: >>> >>> Would it make sense to have >>> >>> config PT_RECLAIM >>> ? ????def_bool y >>> ? ????depends on MMU_GATHER_RCU_TABLE_FREE >> >> make sense. >> >>> >>> (a) Would we want to make it configurable (why?) >> >> No, it was just out of caution before. >> >>> (b) Do we really care about SMP (why?) >> >> No. Simply because the following situation is impossible to occur: >> >> pte_offset_map >> traversing the PTE page table >> >> >> >> call madvise(MADV_DONTNEED) >> >> so there's no need to free PTE page via RCU. >> >>> (c) Do we want to limit to 64bit (why?) >> >> No, just because the profit is greater at 64-BIT. > > I was briefly wondering if on 32bit (but maybe also on 64bit with > configurable user page table levels?) we could have the scenario that we > only have two page table levels. > > So reclaiming the PMD level (corresponding to the highest level) would reclaiming the PMD level? The PT_RECLAIM only reclaim PTE pages, not PMD pages, am I misunderstanding something? > be impossible. But for that to happen one would have to discard the > whole address range through MADV_DONTNEED (impossible I guess) :) > From qi.zheng at linux.dev Wed Nov 19 04:15:58 2025 From: qi.zheng at linux.dev (Qi Zheng) Date: Wed, 19 Nov 2025 20:15:58 +0800 Subject: [PATCH v2 7/7] mm: enable PT_RECLAIM on all 64-bit architectures In-Reply-To: <9b55623a-4606-4610-a0fe-55b8cd6b95e7@kernel.org> References: <9b55623a-4606-4610-a0fe-55b8cd6b95e7@kernel.org> Message-ID: <6e6d8390-1f9e-40cf-949d-168160fa9a15@linux.dev> On 11/19/25 7:38 PM, David Hildenbrand (Red Hat) wrote: > On 19.11.25 08:31, Qi Zheng wrote: >> From: Qi Zheng >> >> Now, the MMU_GATHER_RCU_TABLE_FREE is enabled on all 64-bit >> architectures, >> so make PT_RECLAIM depend on 64BIT, thereby enabling PT_RECLAIM on all >> 64-bit architectures. >> >> Signed-off-by: Qi Zheng >> --- >> ? arch/x86/Kconfig | 1 - >> ? mm/Kconfig?????? | 9 ++------- >> ? 2 files changed, 2 insertions(+), 8 deletions(-) >> >> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig >> index eac2e86056902..96bff81fd4787 100644 >> --- a/arch/x86/Kconfig >> +++ b/arch/x86/Kconfig >> @@ -330,7 +330,6 @@ config X86 >> ????? select FUNCTION_ALIGNMENT_4B >> ????? imply IMA_SECURE_AND_OR_TRUSTED_BOOT??? if EFI >> ????? select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE >> -??? select ARCH_SUPPORTS_PT_RECLAIM??????? if X86_64 >> ????? select ARCH_SUPPORTS_SCHED_SMT??????? if SMP >> ????? select SCHED_SMT??????????? if SMP >> ????? select ARCH_SUPPORTS_SCHED_CLUSTER??? if SMP >> diff --git a/mm/Kconfig b/mm/Kconfig >> index d548976d0e0ad..94eec5c0cad96 100644 >> --- a/mm/Kconfig >> +++ b/mm/Kconfig >> @@ -1448,14 +1448,9 @@ config ARCH_HAS_USER_SHADOW_STACK >> ??????? The architecture has hardware support for userspace shadow call >> ??????????? stacks (eg, x86 CET, arm64 GCS or RISC-V Zicfiss). >> -config ARCH_SUPPORTS_PT_RECLAIM >> -??? def_bool n >> - >> ? config PT_RECLAIM >> -??? bool "reclaim empty user page table pages" >> -??? default y >> -??? depends on ARCH_SUPPORTS_PT_RECLAIM && MMU && SMP >> -??? select MMU_GATHER_RCU_TABLE_FREE >> +??? def_bool y >> +??? depends on 64BIT > > As discussed in the other thread, likely > > config PT_RECLAIM > ????def_bool y > ????depends on MMU_GATHER_RCU_TABLE_FREE && 64BIT > > Could be nice, and if possible even dropping the 64BIT limitation as > well if there is no need to. I think it's ok to drop the 64BIT limitation. There should be some 32-bit architectures that already enable MMU_GATHER_RCU_TABLE_FREE. > > From qi.zheng at linux.dev Wed Nov 19 04:17:31 2025 From: qi.zheng at linux.dev (Qi Zheng) Date: Wed, 19 Nov 2025 20:17:31 +0800 Subject: [PATCH v2 1/7] mm: change mm/pt_reclaim.c to use asm/tlb.h instead of asm-generic/tlb.h In-Reply-To: References: Message-ID: <939e3496-5012-4e7d-8a33-e9de4354d4fd@linux.dev> On 11/19/25 7:41 PM, David Hildenbrand (Red Hat) wrote: > On 19.11.25 08:31, Qi Zheng wrote: >> From: Qi Zheng >> >> Generally, the asm/tlb.h will include asm-generic/tlb.h, so change >> mm/pt_reclaim.c to use asm/tlb.h instead of asm-generic/tlb.h. This can >> also fix compilation errors on some architecture when CONFIG_PT_RECLAIM >> is enabled (such as alpha). > > "This is a preparation for enabling CONFIG_PT_RECLAIM on other > architectures, such as alpha." OK, will modify it in the next version. > >> >> Signed-off-by: Qi Zheng >> --- >> ? mm/pt_reclaim.c | 2 +- >> ? 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/mm/pt_reclaim.c b/mm/pt_reclaim.c >> index 0d9cfbf4fe5d8..46771cfff8239 100644 >> --- a/mm/pt_reclaim.c >> +++ b/mm/pt_reclaim.c >> @@ -2,7 +2,7 @@ >> ? #include >> ? #include >> -#include >> +#include >> ? #include "internal.h" > > Right, we're using pte_free_tlb(), and the default lives in include/asm- > generic/tlb.h. > > Acked-by: David Hildenbrand (Red Hat) Thanks! > From david at kernel.org Wed Nov 19 04:24:35 2025 From: david at kernel.org (David Hildenbrand (Red Hat)) Date: Wed, 19 Nov 2025 13:24:35 +0100 Subject: [PATCH 7/7] mm: make PT_RECLAIM depend on MMU_GATHER_RCU_TABLE_FREE && 64BIT In-Reply-To: <479b0409-335f-4450-8eb2-5270a5847f5e@linux.dev> References: <0a4d1e6f0bf299cafd1fc624f965bd1ca542cea8.1763117269.git.zhengqi.arch@bytedance.com> <355d3bf3-c6bc-403e-9f19-89259d868611@kernel.org> <195baf7c-1f4e-46a4-a4aa-e68e7d00c0f9@linux.dev> <9386032c-9840-49da-83f9-74b112f3e752@kernel.org> <956c7ca1-bce8-4eed-8a86-bc8adfc708b8@linux.dev> <6a22ff95-28c1-4c1d-a1a8-6a391bcc8c86@kernel.org> <479b0409-335f-4450-8eb2-5270a5847f5e@linux.dev> Message-ID: <7160b6ec-4da5-4273-be91-1339bd00d009@kernel.org> On 19.11.25 13:13, Qi Zheng wrote: > > > On 11/19/25 7:35 PM, David Hildenbrand (Red Hat) wrote: >> On 19.11.25 12:02, Qi Zheng wrote: >>> Hi David, >>> >>> On 11/19/25 6:19 PM, David Hildenbrand (Red Hat) wrote: >>>> On 18.11.25 13:02, Qi Zheng wrote: >>>>> >>>>> >>>>> On 11/18/25 12:57 AM, David Hildenbrand (Red Hat) wrote: >>>>>> On 14.11.25 12:11, Qi Zheng wrote: >>>>>>> From: Qi Zheng >>>>>> >>>>>> Subject: s/&&/&/ >>>>> >>>>> will do. >>>>> >>>>>> >>>>>>> >>>>>>> Make PT_RECLAIM depend on MMU_GATHER_RCU_TABLE_FREE so that >>>>>>> PT_RECLAIM >>>>>>> can >>>>>>> be enabled by default on all architectures that support >>>>>>> MMU_GATHER_RCU_TABLE_FREE. >>>>>>> >>>>>>> Considering that a large number of PTE page table pages (such as >>>>>>> 100GB+) >>>>>>> can only be caused on a 64-bit system, let PT_RECLAIM also depend on >>>>>>> 64BIT. >>>>>>> >>>>>>> Signed-off-by: Qi Zheng >>>>>>> --- >>>>>>> ??? arch/x86/Kconfig | 1 - >>>>>>> ??? mm/Kconfig?????? | 6 +----- >>>>>>> ??? 2 files changed, 1 insertion(+), 6 deletions(-) >>>>>>> >>>>>>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig >>>>>>> index eac2e86056902..96bff81fd4787 100644 >>>>>>> --- a/arch/x86/Kconfig >>>>>>> +++ b/arch/x86/Kconfig >>>>>>> @@ -330,7 +330,6 @@ config X86 >>>>>>> ??????? select FUNCTION_ALIGNMENT_4B >>>>>>> ??????? imply IMA_SECURE_AND_OR_TRUSTED_BOOT??? if EFI >>>>>>> ??????? select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE >>>>>>> -??? select ARCH_SUPPORTS_PT_RECLAIM??????? if X86_64 >>>>>>> ??????? select ARCH_SUPPORTS_SCHED_SMT??????? if SMP >>>>>>> ??????? select SCHED_SMT??????????? if SMP >>>>>>> ??????? select ARCH_SUPPORTS_SCHED_CLUSTER??? if SMP >>>>>>> diff --git a/mm/Kconfig b/mm/Kconfig >>>>>>> index a5a90b169435d..e795fbd69e50c 100644 >>>>>>> --- a/mm/Kconfig >>>>>>> +++ b/mm/Kconfig >>>>>>> @@ -1440,14 +1440,10 @@ config ARCH_HAS_USER_SHADOW_STACK >>>>>>> ????????? The architecture has hardware support for userspace shadow >>>>>>> call >>>>>>> ????????????? stacks (eg, x86 CET, arm64 GCS or RISC-V Zicfiss). >>>>>>> -config ARCH_SUPPORTS_PT_RECLAIM >>>>>>> -??? def_bool n >>>>>>> - >>>>>>> ??? config PT_RECLAIM >>>>>>> ??????? bool "reclaim empty user page table pages" >>>>>>> ??????? default y >>>>>>> -??? depends on ARCH_SUPPORTS_PT_RECLAIM && MMU && SMP >>>>>>> -??? select MMU_GATHER_RCU_TABLE_FREE >>>>>>> +??? depends on MMU_GATHER_RCU_TABLE_FREE && MMU && SMP && 64BIT >>>>>> >>>>>> Who would we have MMU_GATHER_RCU_TABLE_FREE without MMU? (can we drop >>>>>> the MMU part) >>>>> >>>>> OK. >>>>> >>>>>> >>>>>> Why do we care about SMP in the first place? (can we frop SMP) >>>>> >>>>> OK. >>>>> >>>>>> >>>>>> But I also wonder why we need "MMU_GATHER_RCU_TABLE_FREE && 64BIT": >>>>>> >>>>>> Would it be harmful on 32bit (sure, we might not reclaim as much, but >>>>>> still there is memory to be reclaimed?)? >>>>> >>>>> This is also fine on 32bit, but the benefits are not significant, So I >>>>> chose to enable it only on 64-bit. >>>> >>>> Right. Address space is smaller, but also memory is smaller. Not that I >>>> think we strictly *must* to support 32bit, I merely wonder why we >>>> wouldn't just enable it here. >>>> >>>> OTOH, if there is a good reason we cannot enable it, we can definitely >>>> just keep it 64bit only. >>> >>> The only difficulty is this: >>> >>>> >>>>> >>>>> I actually tried enabling MMU_GATHER_RCU_TABLE_FREE on all >>>>> architectures, and apart from sparc32 being a bit troublesome (because >>>>> it uses mm->page_table_lock for synchronization within >>>>> __pte_free_tlb()), the modifications were relatively simple. >>> >>> in sparc32: >>> >>> void pte_free(struct mm_struct *mm, pgtable_t ptep) >>> { >>> ????????? struct page *page; >>> >>> ????????? page = pfn_to_page(__nocache_pa((unsigned long)ptep) >> >>> PAGE_SHIFT); >>> ????????? spin_lock(&mm->page_table_lock); >>> ????????? if (page_ref_dec_return(page) == 1) >>> ????????????????? pagetable_dtor(page_ptdesc(page)); >>> ????????? spin_unlock(&mm->page_table_lock); >>> >>> ????????? srmmu_free_nocache(ptep, SRMMU_PTE_TABLE_SIZE); >>> } >>> >>> #define __pte_free_tlb(tlb, pte, addr)? pte_free((tlb)->mm, pte) >>> >>> To enable MMU_GATHER_RCU_TABLE_FREE on sparc32, we need to implement >>> __tlb_remove_table(), and call the pte_free() above in >>> __tlb_remove_table(). >>> >>> However, the __tlb_remove_table() does not have an mm parameter: >>> >>> void __tlb_remove_table(void *_table) >>> >>> so we need to use another lock instead of mm->page_table_lock. >>> >>> I have already sent the v2 [1], and perhaps after that I can enable >>> PT_RECLAIM on all 32-bit architectures as well. >>> >> >> I guess if we just make it depend on MMU_GATHER_RCU_TABLE_FREE that will >> be fine. >> >>> [1]. >>> https://lore.kernel.org/all/ >>> cover.1763537007.git.zhengqi.arch at bytedance.com/ >>> >>>>> >>>>>> >>>>>> If all 64BIT support MMU_GATHER_RCU_TABLE_FREE (as you previously >>>>>> state), why can't we only check for 64BIT? >>>>> >>>>> OK, will do. >>>> >>>> This was also more of a question for discussion: >>>> >>>> Would it make sense to have >>>> >>>> config PT_RECLAIM >>>> ? ????def_bool y >>>> ? ????depends on MMU_GATHER_RCU_TABLE_FREE >>> >>> make sense. >>> >>>> >>>> (a) Would we want to make it configurable (why?) >>> >>> No, it was just out of caution before. >>> >>>> (b) Do we really care about SMP (why?) >>> >>> No. Simply because the following situation is impossible to occur: >>> >>> pte_offset_map >>> traversing the PTE page table >>> >>> >>> >>> call madvise(MADV_DONTNEED) >>> >>> so there's no need to free PTE page via RCU. >>> >>>> (c) Do we want to limit to 64bit (why?) >>> >>> No, just because the profit is greater at 64-BIT. >> >> I was briefly wondering if on 32bit (but maybe also on 64bit with >> configurable user page table levels?) we could have the scenario that we >> only have two page table levels. >> >> So reclaiming the PMD level (corresponding to the highest level) would > > reclaiming the PMD level? The PT_RECLAIM only reclaim PTE pages, not PMD > pages, am I misunderstanding something? Sorry, I looked too much into PMD table sharing the last days :D You're right, it would work in any case even with only 2 levels of apge tables. -- Cheers David From mpdesouza at suse.com Fri Nov 21 10:50:32 2025 From: mpdesouza at suse.com (Marcos Paulo de Souza) Date: Fri, 21 Nov 2025 15:50:32 -0300 Subject: [PATCH v2 0/4] printk cleanup - part 2 Message-ID: <20251121-printk-cleanup-part2-v2-0-57b8b78647f4@suse.com> The first part can be found here[1]. The proposed changes do not change the functionality of printk, but were suggestions made by Petr Mladek. I already have more patches for a part 3 ,but I would like to see these ones merged first. I did the testing with VMs, checking suspend and resume cycles, and it worked as expected. Thanks for reviewing! [1]: https://lore.kernel.org/lkml/20250226-printk-renaming-v1-0-0b878577f2e6 at suse.com/ Signed-off-by: Marcos Paulo de Souza --- Changes in v2: - Squashed patches 1 and 3 (CON_SUSPEND usage) and now is the last patch of the series, suggested by Petr Mladek - Moved commit 4 as the first one in the series, and it was changed to use console_is_usable helper, suggested by Petr Mladek - Moved commit 5 as the second commit in the series, and adjusted to use console_is_usable helper, suggested by Petr Mladek - The patch 6 was dropped, since it was implemented in a different patchset (https://lore.kernel.org/lkml/20250902-nbcon-kgdboc-v3-0-cd30a8106f1c at suse.com/) - Patch 7 was moved as third patch, and is using the console_is_usable, suggested by Petr Mladek - Patch 2 was dropped from this patchset, and will be included in the next cleanup patchset. - Link to v1: https://lore.kernel.org/r/20250606-printk-cleanup-part2-v1-0-f427c743dda0 at suse.com --- Marcos Paulo de Souza (4): drivers: serial: kgdboc: Drop checks for CON_ENABLED and CON_BOOT arch: um: kmsg_dump: Use console_is_usable printk: Use console_is_usable on console_unblank printk: Make console_{suspend,resume} handle CON_SUSPENDED arch/um/kernel/kmsg_dump.c | 2 +- drivers/tty/serial/kgdboc.c | 1 - drivers/tty/tty_io.c | 2 +- kernel/printk/printk.c | 17 +++++++---------- 4 files changed, 9 insertions(+), 13 deletions(-) --- base-commit: 887c7f05d40eb51ba3f38fd71d5e6b4aff4bb8a2 change-id: 20250601-printk-cleanup-part2-38f8d5108099 Best regards, -- Marcos Paulo de Souza From mpdesouza at suse.com Fri Nov 21 10:50:33 2025 From: mpdesouza at suse.com (Marcos Paulo de Souza) Date: Fri, 21 Nov 2025 15:50:33 -0300 Subject: [PATCH v2 1/4] drivers: serial: kgdboc: Drop checks for CON_ENABLED and CON_BOOT In-Reply-To: <20251121-printk-cleanup-part2-v2-0-57b8b78647f4@suse.com> References: <20251121-printk-cleanup-part2-v2-0-57b8b78647f4@suse.com> Message-ID: <20251121-printk-cleanup-part2-v2-1-57b8b78647f4@suse.com> The original code tried to find a console that has CON_BOOT _or_ CON_ENABLED flag set. The flag CON_ENABLED is set to all registered consoles, so in this case this check is always true, even for the CON_BOOT consoles. The initial intent of the kgdboc_earlycon_init was to get a console early (CON_BOOT) or later on in the process (CON_ENABLED). The code was using for_each_console macro, meaning that all console structs were previously registered on the printk() machinery. At this point, any console found on for_each_console is safe for kgdboc_earlycon_init to use. Dropping the check makes the code cleaner, and avoids further confusion by future readers of the code. Signed-off-by: Marcos Paulo de Souza --- drivers/tty/serial/kgdboc.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/tty/serial/kgdboc.c b/drivers/tty/serial/kgdboc.c index 85f6c5a76e0f..5a955c80a853 100644 --- a/drivers/tty/serial/kgdboc.c +++ b/drivers/tty/serial/kgdboc.c @@ -577,7 +577,6 @@ static int __init kgdboc_earlycon_init(char *opt) console_list_lock(); for_each_console(con) { if (con->write && con->read && - (con->flags & (CON_BOOT | CON_ENABLED)) && (!opt || !opt[0] || strcmp(con->name, opt) == 0)) break; } -- 2.51.1 From mpdesouza at suse.com Fri Nov 21 10:50:34 2025 From: mpdesouza at suse.com (Marcos Paulo de Souza) Date: Fri, 21 Nov 2025 15:50:34 -0300 Subject: [PATCH v2 2/4] arch: um: kmsg_dump: Use console_is_usable In-Reply-To: <20251121-printk-cleanup-part2-v2-0-57b8b78647f4@suse.com> References: <20251121-printk-cleanup-part2-v2-0-57b8b78647f4@suse.com> Message-ID: <20251121-printk-cleanup-part2-v2-2-57b8b78647f4@suse.com> All consoles found on for_each_console are registered, meaning that all of them have the CON_ENABLED flag set. Since NBCON was introduced it's important to check if a given console also implements the NBCON callbacks. The function console_is_usable does exactly that. Signed-off-by: Marcos Paulo de Souza --- arch/um/kernel/kmsg_dump.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/um/kernel/kmsg_dump.c b/arch/um/kernel/kmsg_dump.c index 419021175272..fc0f543d1d8e 100644 --- a/arch/um/kernel/kmsg_dump.c +++ b/arch/um/kernel/kmsg_dump.c @@ -31,7 +31,7 @@ static void kmsg_dumper_stdout(struct kmsg_dumper *dumper, * expected to output the crash information. */ if (strcmp(con->name, "ttynull") != 0 && - (console_srcu_read_flags(con) & CON_ENABLED)) { + console_is_usable(con, console_srcu_read_flags(con), true)) { break; } } -- 2.51.1 From mpdesouza at suse.com Fri Nov 21 10:50:35 2025 From: mpdesouza at suse.com (Marcos Paulo de Souza) Date: Fri, 21 Nov 2025 15:50:35 -0300 Subject: [PATCH v2 3/4] printk: Use console_is_usable on console_unblank In-Reply-To: <20251121-printk-cleanup-part2-v2-0-57b8b78647f4@suse.com> References: <20251121-printk-cleanup-part2-v2-0-57b8b78647f4@suse.com> Message-ID: <20251121-printk-cleanup-part2-v2-3-57b8b78647f4@suse.com> The macro for_each_console_srcu iterates over all registered consoles. It's implied that all registered consoles have CON_ENABLED flag set, making the check for the flag unnecessary. Call console_is_usable function to fully verify if the given console is usable before calling the ->unblank callback. Signed-off-by: Marcos Paulo de Souza --- kernel/printk/printk.c | 12 ++++-------- 1 file changed, 4 insertions(+), 8 deletions(-) diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index cb79d1d2e6e5..fed98a18e830 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -3374,12 +3374,10 @@ void console_unblank(void) */ cookie = console_srcu_read_lock(); for_each_console_srcu(c) { - short flags = console_srcu_read_flags(c); - - if (flags & CON_SUSPENDED) + if (!console_is_usable(c, console_srcu_read_flags(c), true)) continue; - if ((flags & CON_ENABLED) && c->unblank) { + if (c->unblank) { found_unblank = true; break; } @@ -3416,12 +3414,10 @@ void console_unblank(void) cookie = console_srcu_read_lock(); for_each_console_srcu(c) { - short flags = console_srcu_read_flags(c); - - if (flags & CON_SUSPENDED) + if (!console_is_usable(c, console_srcu_read_flags(c), true)) continue; - if ((flags & CON_ENABLED) && c->unblank) + if (c->unblank) c->unblank(); } console_srcu_read_unlock(cookie); -- 2.51.1 From mpdesouza at suse.com Fri Nov 21 10:50:36 2025 From: mpdesouza at suse.com (Marcos Paulo de Souza) Date: Fri, 21 Nov 2025 15:50:36 -0300 Subject: [PATCH v2 4/4] printk: Make console_{suspend,resume} handle CON_SUSPENDED In-Reply-To: <20251121-printk-cleanup-part2-v2-0-57b8b78647f4@suse.com> References: <20251121-printk-cleanup-part2-v2-0-57b8b78647f4@suse.com> Message-ID: <20251121-printk-cleanup-part2-v2-4-57b8b78647f4@suse.com> Since commit 9e70a5e109a4 ("printk: Add per-console suspended state") the CON_SUSPENDED flag was introced, and this flag was being checked on console_is_usable function, which returns false if the console is suspended. To make the behavior consistent, change show_cons_active to look for consoles that are not suspended, instead of checking CON_ENABLED. Signed-off-by: Marcos Paulo de Souza --- drivers/tty/tty_io.c | 2 +- kernel/printk/printk.c | 5 +++-- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c index e2d92cf70eb7..1b2ce0f36010 100644 --- a/drivers/tty/tty_io.c +++ b/drivers/tty/tty_io.c @@ -3554,7 +3554,7 @@ static ssize_t show_cons_active(struct device *dev, continue; if (!(c->flags & CON_NBCON) && !c->write) continue; - if ((c->flags & CON_ENABLED) == 0) + if (c->flags & CON_SUSPENDED) continue; cs[i++] = c; if (i >= ARRAY_SIZE(cs)) diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index fed98a18e830..fe7c956f73bd 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -3542,7 +3542,7 @@ void console_suspend(struct console *console) { __pr_flush(console, 1000, true); console_list_lock(); - console_srcu_write_flags(console, console->flags & ~CON_ENABLED); + console_srcu_write_flags(console, console->flags | CON_SUSPENDED); console_list_unlock(); /* @@ -3555,13 +3555,14 @@ void console_suspend(struct console *console) } EXPORT_SYMBOL(console_suspend); +/* Unset CON_SUSPENDED flag so the console can start printing again. */ void console_resume(struct console *console) { struct console_flush_type ft; bool is_nbcon; console_list_lock(); - console_srcu_write_flags(console, console->flags | CON_ENABLED); + console_srcu_write_flags(console, console->flags & ~CON_SUSPENDED); is_nbcon = console->flags & CON_NBCON; console_list_unlock(); -- 2.51.1 From davidgow at google.com Sat Nov 22 00:32:12 2025 From: davidgow at google.com (David Gow) Date: Sat, 22 Nov 2025 16:32:12 +0800 Subject: [PATCH] arch: um: Don't rename vmap to kernel_vmap Message-ID: <20251122083213.3996586-1-davidgow@google.com> In order to work around the existence of a vmap symbol in libpcap, the UML makefile unconditionally redefines vmap to kernel_vmap. However, this not only affects the actual vmap symbol, but also anything else named vmap, including a number of struct members in DRM. This would not be too much of a problem, since all uses are also updated, except we now have Rust DRM bindings, which expect the corresponding Rust structs to have 'vmap' names. Since the redefinition applies in bindgen, but not to Rust code, we end up with errors such as: error[E0560]: struct `drm_gem_object_funcs` has no fields named `vmap` --> rust/kernel/drm/gem/mod.rs:210:9 Since, as far as I can tell, we no longer actually link to libpcap, it should be safe to just remove this define unconditionally. (If it's not, we can possibly either disable DRM Rust bindings under UML, or move the redefinition of vmap behind some config option.) We also take this opportunity to update the comment. Signed-off-by: David Gow --- arch/um/Makefile | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/arch/um/Makefile b/arch/um/Makefile index 7be0143b5ba3..721b652ffb65 100644 --- a/arch/um/Makefile +++ b/arch/um/Makefile @@ -46,19 +46,17 @@ ARCH_INCLUDE := -I$(srctree)/$(SHARED_HEADERS) ARCH_INCLUDE += -I$(srctree)/$(HOST_DIR)/um/shared KBUILD_CPPFLAGS += -I$(srctree)/$(HOST_DIR)/um -# -Dvmap=kernel_vmap prevents anything from referencing the libpcap.o symbol so -# named - it's a common symbol in libpcap, so we get a binary which crashes. -# -# Same things for in6addr_loopback and mktime - found in libc. For these two we -# only get link-time error, luckily. +# -Dstrrchr=kernel_strrchr (as well as the various in6addr symbols) prevents +# anything from referencing +# libc symbols with the same name, which can cause a linker error. # # -Dlongjmp=kernel_longjmp prevents anything from referencing the libpthread.a # embedded copy of longjmp, same thing for setjmp. # -# These apply to USER_CFLAGS to. +# These apply to USER_CFLAGS too. KBUILD_CFLAGS += $(CFLAGS) $(CFLAGS-y) -D__arch_um__ \ - $(ARCH_INCLUDE) $(MODE_INCLUDE) -Dvmap=kernel_vmap \ + $(ARCH_INCLUDE) $(MODE_INCLUDE) \ -Dlongjmp=kernel_longjmp -Dsetjmp=kernel_setjmp \ -Din6addr_loopback=kernel_in6addr_loopback \ -Din6addr_any=kernel_in6addr_any -Dstrrchr=kernel_strrchr \ -- 2.52.0.rc2.455.g230fcf2819-goog From miguel.ojeda.sandonis at gmail.com Sun Nov 23 09:07:54 2025 From: miguel.ojeda.sandonis at gmail.com (Miguel Ojeda) Date: Sun, 23 Nov 2025 18:07:54 +0100 Subject: [PATCH] arch: um: Don't rename vmap to kernel_vmap In-Reply-To: <20251122083213.3996586-1-davidgow@google.com> References: <20251122083213.3996586-1-davidgow@google.com> Message-ID: On Sat, Nov 22, 2025 at 9:32?AM David Gow wrote: > > In order to work around the existence of a vmap symbol in libpcap, the > UML makefile unconditionally redefines vmap to kernel_vmap. However, > this not only affects the actual vmap symbol, but also anything else > named vmap, including a number of struct members in DRM. > > This would not be too much of a problem, since all uses are also > updated, except we now have Rust DRM bindings, which expect the > corresponding Rust structs to have 'vmap' names. Since the redefinition > applies in bindgen, but not to Rust code, we end up with errors such as: > > error[E0560]: struct `drm_gem_object_funcs` has no fields named `vmap` > --> rust/kernel/drm/gem/mod.rs:210:9 > > Since, as far as I can tell, we no longer actually link to libpcap, it > should be safe to just remove this define unconditionally. > > (If it's not, we can possibly either disable DRM Rust bindings under > UML, or move the redefinition of vmap behind some config option.) > > We also take this opportunity to update the comment. > > Signed-off-by: David Gow Nice, thanks for this! Yeah, I guess we would otherwise need to do the same kind of "wild" macro replacement in Rust code to support this or conditional compilation, and neither sounds good. If it is not actually needed, then this sounds like a win-win. It seems it was indeed gone in commit: 12b8e7e69aa7 ("um: Remove obsolete pcap driver") So it sounds reasonable to me assuming I am not missing anything, which I may be... FWIW: Acked-by: Miguel Ojeda Cheers, Miguel From johannes at sipsolutions.net Sun Nov 23 23:49:35 2025 From: johannes at sipsolutions.net (Johannes Berg) Date: Mon, 24 Nov 2025 08:49:35 +0100 Subject: [PATCH] arch: um: Don't rename vmap to kernel_vmap In-Reply-To: (sfid-20251123_180818_160099_E5AEF588) References: <20251122083213.3996586-1-davidgow@google.com> (sfid-20251123_180818_160099_E5AEF588) Message-ID: On Sun, 2025-11-23 at 18:07 +0100, Miguel Ojeda wrote: > On Sat, Nov 22, 2025 at 9:32?AM David Gow wrote: > > > > In order to work around the existence of a vmap symbol in libpcap, the > > UML makefile unconditionally redefines vmap to kernel_vmap. However, > > this not only affects the actual vmap symbol, but also anything else > > named vmap, including a number of struct members in DRM. > > > > This would not be too much of a problem, since all uses are also > > updated, except we now have Rust DRM bindings, which expect the > > corresponding Rust structs to have 'vmap' names. Since the redefinition > > applies in bindgen, but not to Rust code, we end up with errors such as: > > > > error[E0560]: struct `drm_gem_object_funcs` has no fields named `vmap` > > --> rust/kernel/drm/gem/mod.rs:210:9 > > > > Since, as far as I can tell, we no longer actually link to libpcap, it > > should be safe to just remove this define unconditionally. > > > > (If it's not, we can possibly either disable DRM Rust bindings under > > UML, or move the redefinition of vmap behind some config option.) > > > > We also take this opportunity to update the comment. > > > > Signed-off-by: David Gow > > Nice, thanks for this! > > Yeah, I guess we would otherwise need to do the same kind of "wild" > macro replacement in Rust code to support this or conditional > compilation, and neither sounds good. > > If it is not actually needed, then this sounds like a win-win. > > It seems it was indeed gone in commit: > > 12b8e7e69aa7 ("um: Remove obsolete pcap driver") Indeed, that was just missed during the removal, we can't link to libpcap any more. How do we want to take this patch in, and where is it needed? I hadn't planned to send a UML PR to -rc still, but I guess I _can_ if needed? But if anyone else wants to line it up through a tree (rust related?) that has pending work anyway, that seems fair too. In which case: Acked-by: Johannes Berg Or it's not that urgent because all this came up in -next now? I didn't really see (or fully understand) all the build bug reports. johannes From davidgow at google.com Mon Nov 24 23:36:31 2025 From: davidgow at google.com (David Gow) Date: Tue, 25 Nov 2025 15:36:31 +0800 Subject: [PATCH] arch: um: Don't rename vmap to kernel_vmap In-Reply-To: References: <20251122083213.3996586-1-davidgow@google.com> Message-ID: On Mon, 24 Nov 2025 at 15:49, Johannes Berg wrote: > > On Sun, 2025-11-23 at 18:07 +0100, Miguel Ojeda wrote: > > On Sat, Nov 22, 2025 at 9:32?AM David Gow wrote: > > > > > > In order to work around the existence of a vmap symbol in libpcap, the > > > UML makefile unconditionally redefines vmap to kernel_vmap. However, > > > this not only affects the actual vmap symbol, but also anything else > > > named vmap, including a number of struct members in DRM. > > > > > > This would not be too much of a problem, since all uses are also > > > updated, except we now have Rust DRM bindings, which expect the > > > corresponding Rust structs to have 'vmap' names. Since the redefinition > > > applies in bindgen, but not to Rust code, we end up with errors such as: > > > > > > error[E0560]: struct `drm_gem_object_funcs` has no fields named `vmap` > > > --> rust/kernel/drm/gem/mod.rs:210:9 > > > > > > Since, as far as I can tell, we no longer actually link to libpcap, it > > > should be safe to just remove this define unconditionally. > > > > > > (If it's not, we can possibly either disable DRM Rust bindings under > > > UML, or move the redefinition of vmap behind some config option.) > > > > > > We also take this opportunity to update the comment. > > > > > > Signed-off-by: David Gow > > > > Nice, thanks for this! > > > > Yeah, I guess we would otherwise need to do the same kind of "wild" > > macro replacement in Rust code to support this or conditional > > compilation, and neither sounds good. > > > > If it is not actually needed, then this sounds like a win-win. > > > > It seems it was indeed gone in commit: > > > > 12b8e7e69aa7 ("um: Remove obsolete pcap driver") > > Indeed, that was just missed during the removal, we can't link to > libpcap any more. > > How do we want to take this patch in, and where is it needed? I hadn't > planned to send a UML PR to -rc still, but I guess I _can_ if needed? > But if anyone else wants to line it up through a tree (rust related?) > that has pending work anyway, that seems fair too. In which case: > > Acked-by: Johannes Berg > > Or it's not that urgent because all this came up in -next now? I didn't > really see (or fully understand) all the build bug reports. > I'm happy for this to go in via any tree. (Worst-case, we could possibly take it via KUnit, though I'd rather not, as it's not really KUnit-related at all.) The issue has actually been around since probably 6.16 (c284d3e42338 ("rust: drm: gem: Add GEM object abstraction")), but since it only applies to people building Rust graphics drivers against UML, which is not super common, it seems like it's only come up in randconfig builds so far. -- David -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5281 bytes Desc: S/MIME Cryptographic Signature URL: From johannes at sipsolutions.net Mon Nov 24 23:40:15 2025 From: johannes at sipsolutions.net (Johannes Berg) Date: Tue, 25 Nov 2025 08:40:15 +0100 Subject: [PATCH] arch: um: Don't rename vmap to kernel_vmap In-Reply-To: (sfid-20251125_083645_823712_C74034DC) References: <20251122083213.3996586-1-davidgow@google.com> (sfid-20251125_083645_823712_C74034DC) Message-ID: On Tue, 2025-11-25 at 15:36 +0800, David Gow wrote: > > > > Or it's not that urgent because all this came up in -next now? I didn't > > really see (or fully understand) all the build bug reports. > > > > I'm happy for this to go in via any tree. (Worst-case, we could > possibly take it via KUnit, though I'd rather not, as it's not really > KUnit-related at all.) > > The issue has actually been around since probably 6.16 (c284d3e42338 > ("rust: drm: gem: Add GEM object abstraction")), but since it only > applies to people building Rust graphics drivers against UML, which is > not super common, it seems like it's only come up in randconfig builds > so far. Oh, interesting, OK. I guess then given that it's not super important and how late we're in the game, I'll just throw it into the (relatively small) pile we have for UML for -next. Given that we removed the pcap driver in 6.11 (12b8e7e69aa7a) I guess we could even ask stable to take it, but it's not even that important until someone wants to test the rust DRM stuff in kunit or something :) johannes From davidgow at google.com Tue Nov 25 01:17:48 2025 From: davidgow at google.com (David Gow) Date: Tue, 25 Nov 2025 17:17:48 +0800 Subject: [PATCH] arch: um: Don't rename vmap to kernel_vmap In-Reply-To: References: <20251122083213.3996586-1-davidgow@google.com> Message-ID: On Tue, 25 Nov 2025 at 15:40, Johannes Berg wrote: > > On Tue, 2025-11-25 at 15:36 +0800, David Gow wrote: > > > > > > Or it's not that urgent because all this came up in -next now? I didn't > > > really see (or fully understand) all the build bug reports. > > > > > > > I'm happy for this to go in via any tree. (Worst-case, we could > > possibly take it via KUnit, though I'd rather not, as it's not really > > KUnit-related at all.) > > > > The issue has actually been around since probably 6.16 (c284d3e42338 > > ("rust: drm: gem: Add GEM object abstraction")), but since it only > > applies to people building Rust graphics drivers against UML, which is > > not super common, it seems like it's only come up in randconfig builds > > so far. > > Oh, interesting, OK. I guess then given that it's not super important > and how late we're in the game, I'll just throw it into the (relatively > small) pile we have for UML for -next. Given that we removed the pcap > driver in 6.11 (12b8e7e69aa7a) I guess we could even ask stable to take > it, but it's not even that important until someone wants to test the > rust DRM stuff in kunit or something :) > Sounds good to me. Throwing the Fixes: tag in wouldn't hurt at least (and I do think some of the Rust DRM stuff is growing some KUnit tests, so it'll be nice to have going forward, even if they can be tested under other architectures). Cheers, -- David -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5281 bytes Desc: S/MIME Cryptographic Signature URL: From johannes at sipsolutions.net Tue Nov 25 01:58:53 2025 From: johannes at sipsolutions.net (Johannes Berg) Date: Tue, 25 Nov 2025 10:58:53 +0100 Subject: [PATCH v13 00/13] nommu UML In-Reply-To: (sfid-20251112_095303_672501_9A7DDF36) References: <0a84c16f862026c82271c0adbc91d98b812a78b4.camel@sipsolutions.net> (sfid-20251112_095303_672501_9A7DDF36) Message-ID: On Wed, 2025-11-12 at 17:52 +0900, Hajime Tazaki wrote: > > > What is it for ? > > > ================ > > > > > > - Alleviate syscall hook overhead implemented with ptrace(2) > > > - To exercises nommu code over UML (and over KUnit) > > > - Less dependency to host facilities > > > > FWIW, in some way, this order of priorities is exactly why this hasn't > > been going anywhere, and every time I looked at it I got somewhat > > annoyed by what seems to me like choices made to support especially the > > first bullet. > > over the past versions, I've been emphasized that the 2nd bullet (testing) > is the primary usecase as I saw several actually cases from mm folks, > > https://lists.infradead.org/pipermail/maple-tree/2024-November/003775.html > https://lore.kernel.org/all/cb1cf0be-871d-4982-9a1b-5fdd54deec8d at lucifer.local/ > > and I think this is not limited to mm code. Not sure there's much value in testing much else in no-MMU, but sure, I'll give you that it's useful for testing. > other 2 bullets are additional benefits which we observed in a > comment, and our experience. But are they really _worthwhile_ benefits? A lot of this design adds additional complexity, and it doesn't really seem necessary for the testing use case. Making it faster is nice, but it's not like the speedup really is 20x for arbitrary tests, that's just for corner cases like "sit in a loop of gettimeofday()". And for kunit there's no syscall boundary at all, so there's no speedup. > > I suspect that the first and third bullet are not even really true any > > more, since you moved to seccomp (per our request), yet I think design > > choices influenced by them persist. > > this observation is not true; the first bullet is still true even > using seccomp. please look at the benchmark result in the patch > [12/13], quoted below. > [snip] So thanks for the correction. If that's the case, however, it means the speedup can't be due to the syscall boundary itself (seccomp) but must rather be due to some pagefault/mapping handling issue? Which would be inherent in no-MMU, even taking an approach of using two host processes rather than embedding everything into one. > > However, I'm not yet convinced that all of the complexities presented in > > this patchset (such as completely separate seccomp implementation) are > > actually necessary in support of _just_ the second bullet. These seem to > > me like design choices necessary to support the _first_ bullet [1]. > > separate seccomp implementation is indeed needed due to the design > choice we made, to use a single process to host a (um) userspace. That sounds misleading or even wrong to me, I'd say it's due to putting the (um) userspace in the same host process as the kernel space? > I don't see why you see this as a _complexity_, as functionally both > seccomp handling don't interfere each other. The complexity isn't so much in the separate code, which is a small factor, but in the "put everything into the same process" aspect of it. That has consequences around the host context state handling, things we didn't really need to consider before suddenly become crucially important. In the current (with-MMU) design, we only need to worry about being able to correctly switch between userspace tasks/threads within a userspace mm (host) process. With the no-MMU design you propose, we also need to be able to correctly switch between kernel and userspace tasks within the same single (host) process. I think this is a pretty significant difference, and saying "there's no complexity here" is simply pretending it isn't a relevant difference. I believe you're not even handling this correctly right now in this patch set, specifically wrt. the GS register which has been pointed out before, but I wouldn't say that I even have a complete picture in my head over what state handling would be necessary and sufficient. So yeah, I think this warrants taking another look as to whether or not the approach of putting everything into the same host process is even worth it. I tend to believe that it isn't, given the use cases. And if you say the speedup still is with seccomp, that kills the speed argument too. > > I've thought about what would happen if we stuck to creating a (single) > > separate process on the host to execute userspace, and just used > > CLONE_VM for it. That way, it's still no-MMU with full memory access, > > but there's some implicit isolation between the kernel and userspace > > processes which will likely remove complexities around FP/SSE/AVX > > handling, may completely remove the need for a separate seccomp > > implementation, etc. > > this would be doable I think, but we went the different way, as > using separate host processes (with ptrace/seccomp) is slow and add > complexity by the synchronization between processes, which we think > it's not easy to maintain in the future. Which one is it then, slow or not? Not sure I follow. You just said you do have seccomp when comparing speeds, so that in itself doesn't make it slow. What synchronization? It'd (have to) be CLONE_VM, but that actually _simplifies_ state transfer/synchronization, and we already have (to have) state transfer between different userspace threads in the same host process for the with-MMU case. johannes From pmladek at suse.com Wed Nov 26 05:12:19 2025 From: pmladek at suse.com (Petr Mladek) Date: Wed, 26 Nov 2025 14:12:19 +0100 Subject: [PATCH v2 1/4] drivers: serial: kgdboc: Drop checks for CON_ENABLED and CON_BOOT In-Reply-To: <20251121-printk-cleanup-part2-v2-1-57b8b78647f4@suse.com> References: <20251121-printk-cleanup-part2-v2-0-57b8b78647f4@suse.com> <20251121-printk-cleanup-part2-v2-1-57b8b78647f4@suse.com> Message-ID: On Fri 2025-11-21 15:50:33, Marcos Paulo de Souza wrote: > The original code tried to find a console that has CON_BOOT _or_ > CON_ENABLED flag set. The flag CON_ENABLED is set to all registered > consoles, so in this case this check is always true, even for the > CON_BOOT consoles. > > The initial intent of the kgdboc_earlycon_init was to get a console > early (CON_BOOT) or later on in the process (CON_ENABLED). The > code was using for_each_console macro, meaning that all console structs > were previously registered on the printk() machinery. At this point, > any console found on for_each_console is safe for kgdboc_earlycon_init > to use. > > Dropping the check makes the code cleaner, and avoids further confusion > by future readers of the code. > > Signed-off-by: Marcos Paulo de Souza I agree that the check is superfluous and can be removed: Reviewed-by: Petr Mladek Best Regards, Petr From pmladek at suse.com Wed Nov 26 05:22:44 2025 From: pmladek at suse.com (Petr Mladek) Date: Wed, 26 Nov 2025 14:22:44 +0100 Subject: [PATCH v2 2/4] arch: um: kmsg_dump: Use console_is_usable In-Reply-To: <20251121-printk-cleanup-part2-v2-2-57b8b78647f4@suse.com> References: <20251121-printk-cleanup-part2-v2-0-57b8b78647f4@suse.com> <20251121-printk-cleanup-part2-v2-2-57b8b78647f4@suse.com> Message-ID: On Fri 2025-11-21 15:50:34, Marcos Paulo de Souza wrote: > All consoles found on for_each_console are registered, meaning that all > of them have the CON_ENABLED flag set. Since NBCON was introduced it's > important to check if a given console also implements the NBCON callbacks. > The function console_is_usable does exactly that. > > Signed-off-by: Marcos Paulo de Souza Makes sense: Reviewed-by: Petr Mladek Best Regards, Petr From pmladek at suse.com Wed Nov 26 05:24:58 2025 From: pmladek at suse.com (Petr Mladek) Date: Wed, 26 Nov 2025 14:24:58 +0100 Subject: [PATCH v2 3/4] printk: Use console_is_usable on console_unblank In-Reply-To: <20251121-printk-cleanup-part2-v2-3-57b8b78647f4@suse.com> References: <20251121-printk-cleanup-part2-v2-0-57b8b78647f4@suse.com> <20251121-printk-cleanup-part2-v2-3-57b8b78647f4@suse.com> Message-ID: On Fri 2025-11-21 15:50:35, Marcos Paulo de Souza wrote: > The macro for_each_console_srcu iterates over all registered consoles. It's > implied that all registered consoles have CON_ENABLED flag set, making > the check for the flag unnecessary. Call console_is_usable function to > fully verify if the given console is usable before calling the ->unblank > callback. > > Signed-off-by: Marcos Paulo de Souza Makes sense: Reviewed-by: Petr Mladek Best Regards, Petr From miguel.ojeda.sandonis at gmail.com Wed Nov 26 05:50:44 2025 From: miguel.ojeda.sandonis at gmail.com (Miguel Ojeda) Date: Wed, 26 Nov 2025 14:50:44 +0100 Subject: [linux-next:master 4806/10599] error[E0560]: struct `bindings::kernel_param_ops` has no field named `get` In-Reply-To: <84b74435-5aad-4c15-aea5-db87b4a6bf11@kernel.org> References: <202511210858.uwVivgvn-lkp@intel.com> <84b74435-5aad-4c15-aea5-db87b4a6bf11@kernel.org> Message-ID: On Wed, Nov 26, 2025 at 2:41?PM Daniel Gomez wrote: > > On 21/11/2025 01.24, kernel test robot wrote: > > tree: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master > > head: 88cbd8ac379cf5ce68b7efcfd4d1484a6871ee0b > > commit: 0b08fc292842a13aa496413b48c1efb83573b8c6 [4806/10599] rust: introduce module_param module > > config: um-randconfig-001-20251121 (https://download.01.org/0day-ci/archive/20251121/202511210858.uwVivgvn-lkp at intel.com/config) > > compiler: clang version 22.0.0git (https://github.com/llvm/llvm-project 9e9fe08b16ea2c4d9867fb4974edf2a3776d6ece) > > rustc: rustc 1.88.0 (6b00bc388 2025-06-23) > > reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251121/202511210858.uwVivgvn-lkp at intel.com/reproduce) > > We can't reproduce this. > > If anyone cares, please let us know how to reproduce it. > > Tested on Debian testing x86_64 host. > > rustc --version > rustc 1.91.1 (ed61e7d7e 2025-11-07 > > /home/dagomez/0day/llvm-22.0.0-e19fa930ca838715028c00c234874d1db4f93154-20250918-184558-x86_64/bin/clang-22 --version > ClangBuiltLinux clang version 22.0.0git (https://github.com/llvm/llvm-project.git e19fa930ca838715028c00c234874d1db4f93154) > Target: x86_64-unknown-linux-gnu > Thread model: posix > InstalledDir: /home/dagomez/0day/llvm-22.0.0-e19fa930ca838715028c00c234874d1db4f93154-20250918-184558-x86_64/bin > > 561 wget https://download.01.org/0day-ci/archive/20251121/202511210858.uwVivgvn-lkp at intel.com/config > 563 git clone https://github.com/intel/lkp-tests.git ~/lkp-tests > 565 mkdir -p build_dir && cp config build_dir/.config > > 571 COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang-22 ~/lkp-tests/kbuild/make.cross W=1 O=build_dir ARCH=um olddefconfig > 572 COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang-22 ~/lkp-tests/kbuild/make.cross W=1 O=build_dir ARCH=um prepare > 573 COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang-22 ~/lkp-tests/kbuild/make.cross W=1 O=build_dir ARCH=um -j$(nproc) > > I'm just getting these warnings: > > ... Cc'ing UML so that they are in the loop. Cheers, Miguel From davidgow at google.com Thu Nov 27 01:26:31 2025 From: davidgow at google.com (David Gow) Date: Thu, 27 Nov 2025 17:26:31 +0800 Subject: [linux-next:master 4806/10599] error[E0560]: struct `bindings::kernel_param_ops` has no field named `get` In-Reply-To: References: <202511210858.uwVivgvn-lkp@intel.com> <84b74435-5aad-4c15-aea5-db87b4a6bf11@kernel.org> Message-ID: On Wed, 26 Nov 2025 at 21:50, Miguel Ojeda wrote: > > On Wed, Nov 26, 2025 at 2:41?PM Daniel Gomez wrote: > > > > On 21/11/2025 01.24, kernel test robot wrote: > > > tree: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master > > > head: 88cbd8ac379cf5ce68b7efcfd4d1484a6871ee0b > > > commit: 0b08fc292842a13aa496413b48c1efb83573b8c6 [4806/10599] rust: introduce module_param module > > > config: um-randconfig-001-20251121 (https://download.01.org/0day-ci/archive/20251121/202511210858.uwVivgvn-lkp at intel.com/config) > > > compiler: clang version 22.0.0git (https://github.com/llvm/llvm-project 9e9fe08b16ea2c4d9867fb4974edf2a3776d6ece) > > > rustc: rustc 1.88.0 (6b00bc388 2025-06-23) > > > reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251121/202511210858.uwVivgvn-lkp at intel.com/reproduce) > > > > We can't reproduce this. > > > > If anyone cares, please let us know how to reproduce it. > > Thanks -- this does sit in the category of things I care about (at least in theory), but also can't reproduce. It looks like this affects random struct fields in bindings:: (I've seen other 0day reports with other structs and fields). If anyone has any idea what's going on, suggestions are welcome. Cheers, -- David -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5281 bytes Desc: S/MIME Cryptographic Signature URL: From pmladek at suse.com Thu Nov 27 01:49:28 2025 From: pmladek at suse.com (Petr Mladek) Date: Thu, 27 Nov 2025 10:49:28 +0100 Subject: [PATCH v2 4/4] printk: Make console_{suspend,resume} handle CON_SUSPENDED In-Reply-To: <20251121-printk-cleanup-part2-v2-4-57b8b78647f4@suse.com> References: <20251121-printk-cleanup-part2-v2-0-57b8b78647f4@suse.com> <20251121-printk-cleanup-part2-v2-4-57b8b78647f4@suse.com> Message-ID: On Fri 2025-11-21 15:50:36, Marcos Paulo de Souza wrote: > Since commit 9e70a5e109a4 ("printk: Add per-console suspended state") > the CON_SUSPENDED flag was introced, and this flag was being checked > on console_is_usable function, which returns false if the console is > suspended. > > To make the behavior consistent, change show_cons_active to look for > consoles that are not suspended, instead of checking CON_ENABLED. > > --- a/drivers/tty/tty_io.c > +++ b/drivers/tty/tty_io.c > @@ -3554,7 +3554,7 @@ static ssize_t show_cons_active(struct device *dev, > continue; > if (!(c->flags & CON_NBCON) && !c->write) > continue; > - if ((c->flags & CON_ENABLED) == 0) > + if (c->flags & CON_SUSPENDED) I believe that we could and should replace if (!(c->flags & CON_NBCON) && !c->write) continue; if (c->flags & CON_SUSPENDED) continue; with if (!console_is_usable(c, c->flags, true) && !console_is_usable(c, c->flags, false)) continue; It would make the value compatible with all other callers/users of the console drivers. The variant using two console_is_usable() calls with "true/false" parameters is inspited by __pr_flush(). > continue; > cs[i++] = c; > if (i >= ARRAY_SIZE(cs)) > diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c > index fed98a18e830..fe7c956f73bd 100644 > --- a/kernel/printk/printk.c > +++ b/kernel/printk/printk.c > @@ -3542,7 +3542,7 @@ void console_suspend(struct console *console) > { > __pr_flush(console, 1000, true); > console_list_lock(); > - console_srcu_write_flags(console, console->flags & ~CON_ENABLED); > + console_srcu_write_flags(console, console->flags | CON_SUSPENDED); This is the same flag which is set also by the console_suspend_all() API. Now, as discussed at https://lore.kernel.org/lkml/844j4lepak.fsf at jogness.linutronix.de/ + console_suspend()/console_resume() API is used by few console drivers to suspend the console when the related HW device gets suspended. + console_suspend_all()/console_resume_all() is used by the power management subsystem to call down/up all consoles when the system is going down/up. It is a big hammer approach. We need to distinguish the two APIs so that console drivers which were suspended by both APIs stay suspended until they get resumed by both APIs. I mean: // This should suspend all consoles unless it is not disabled // by "no_console_suspend" API. console_suspend_all(); // This suspends @con even when "no_console_suspend" parameter // is used. It is needed because the HW is going to be suspended. // It has no effect when the consoles were already suspended // by the big hammer API. console_suspend(con); // This might resume the console when "no_console_suspend" option // is used. The driver should work because the HW was resumed. // But it should stay suspended when all consoles are supposed // to stay suspended because of the big hammer API. console_resume(con); // This should resume all consoles. console_resume_all(); Other behavior would be unexpected and untested. It might cause regression. I see two solutions: + add another CON_SUSPENDED_ALL flag + add back "consoles_suspended" global variable I prefer adding back the "consoles_suspended" global variable because it is a global state... The global state should be synchronized the same way as the current per-console flag (write under console_list_lock, read under console_srcu_read_lock()). Also it should be checked by console_is_usable() API. Otherwise, we would need to update all callers. This brings a challenge how to make it safe and keep the API sane. I propose to create: + __console_is_usable() where the "consoles_suspended" value will be passed as parameter. It might be used directly under console_list_lock(). + console_is_usable() with the existing parameters. It will check the it was called under console_srcu_read_lock, read the global "consoles_suspend" and pass it to __console_is_usable(). > console_list_unlock(); > > /* I played with the code to make sure that it looked sane and I ended with the following changes on top of this patch. diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c index 1b2ce0f36010..fda4683d12f1 100644 --- a/drivers/tty/tty_io.c +++ b/drivers/tty/tty_io.c @@ -3552,9 +3552,8 @@ static ssize_t show_cons_active(struct device *dev, for_each_console(c) { if (!c->device) continue; - if (!(c->flags & CON_NBCON) && !c->write) - continue; - if (c->flags & CON_SUSPENDED) + if (!__console_is_usable(c, c->flags, consoles_suspended, true) && + !__console_is_usable(c, c->flags, consoles_suspended, false)) continue; cs[i++] = c; if (i >= ARRAY_SIZE(cs)) diff --git a/include/linux/console.h b/include/linux/console.h index 5f17321ed962..090490ef570f 100644 --- a/include/linux/console.h +++ b/include/linux/console.h @@ -496,6 +496,7 @@ extern void console_list_lock(void) __acquires(console_mutex); extern void console_list_unlock(void) __releases(console_mutex); extern struct hlist_head console_list; +extern bool consoles_suspended; /** * console_srcu_read_flags - Locklessly read flags of a possibly registered @@ -548,6 +549,47 @@ static inline void console_srcu_write_flags(struct console *con, short flags) WRITE_ONCE(con->flags, flags); } +/** + * consoles_suspended_srcu_read - Locklessly read the global flag for + * suspending all consoles. + * + * The global "consoles_suspended" flag is synchronized using console_list_lock + * and console_srcu_read_lock. It is the same approach as CON_SUSSPENDED flag. + * See console_srcu_read_flags() for more details. + * + * Context: Any context. + * Return: The current value of the global "consoles_suspended" flag. + */ +static inline short consoles_suspended_srcu_read(void) +{ + WARN_ON_ONCE(!console_srcu_read_lock_is_held()); + + /* + * The READ_ONCE() matches the WRITE_ONCE() when "consoles_suspended" + * is modified with consoles_suspended_srcu_write(). + */ + return data_race(READ_ONCE(consoles_suspended)); +} + +/** + * consoles_suspended_srcu_write - Write the global flag for suspending + * all consoles. + * @suspend: new value to write + * + * The write must be done under the console_list_lock. The caller is responsible + * for calling synchronize_srcu() to make sure that all callers checking the + * usablility of registered consoles see the new state. + * + * Context: Any context. + */ +static inline void consoles_suspended_srcu_write(bool suspend) +{ + lockdep_assert_console_list_lock_held(); + + /* This matches the READ_ONCE() in consoles_suspended_srcu_read(). */ + WRITE_ONCE(consoles_suspended, suspend); +} + /* Variant of console_is_registered() when the console_list_lock is held. */ static inline bool console_is_registered_locked(const struct console *con) { @@ -617,13 +659,15 @@ extern bool nbcon_kdb_try_acquire(struct console *con, extern void nbcon_kdb_release(struct nbcon_write_context *wctxt); /* - * Check if the given console is currently capable and allowed to print - * records. Note that this function does not consider the current context, - * which can also play a role in deciding if @con can be used to print - * records. + * This variant might be called under console_list_lock where both + * @flags and @all_suspended flags can be read directly. */ -static inline bool console_is_usable(struct console *con, short flags, bool use_atomic) +static inline bool __console_is_usable(struct console *con, short flags, + bool all_suspended, bool use_atomic) { + if (all_suspended) + return false; + if (!(flags & CON_ENABLED)) return false; @@ -666,6 +710,20 @@ static inline bool console_is_usable(struct console *con, short flags, bool use_ return true; } +/* + * Check if the given console is currently capable and allowed to print + * records. Note that this function does not consider the current context, + * which can also play a role in deciding if @con can be used to print + * records. + */ +static inline bool console_is_usable(struct console *con, short flags, + bool use_atomic) +{ + bool all_suspended = consoles_suspended_srcu_read(); + + return __console_is_usable(con, flags, all_suspended, use_atomic); +} + #else static inline void nbcon_cpu_emergency_enter(void) { } static inline void nbcon_cpu_emergency_exit(void) { } @@ -678,6 +736,8 @@ static inline void nbcon_reacquire_nobuf(struct nbcon_write_context *wctxt) { } static inline bool nbcon_kdb_try_acquire(struct console *con, struct nbcon_write_context *wctxt) { return false; } static inline void nbcon_kdb_release(struct nbcon_write_context *wctxt) { } +static inline bool __console_is_usable(struct console *con, short flags, + bool all_suspended, bool use_atomic) { return false; } static inline bool console_is_usable(struct console *con, short flags, bool use_atomic) { return false; } #endif diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c index 23a14e8c7a49..12247df07420 100644 --- a/kernel/printk/printk.c +++ b/kernel/printk/printk.c @@ -104,6 +104,13 @@ DEFINE_STATIC_SRCU(console_srcu); */ int __read_mostly suppress_printk; +/* + * Global flag for calling down all consoles during suspend. + * There is also a per-console flag which is used when the related + * device HW gets disabled, see CON_SUSPEND. + */ +bool consoles_suspended; + #ifdef CONFIG_LOCKDEP static struct lockdep_map console_lock_dep_map = { .name = "console_lock" @@ -2731,8 +2738,6 @@ MODULE_PARM_DESC(console_no_auto_verbose, "Disable console loglevel raise to hig */ void console_suspend_all(void) { - struct console *con; - if (console_suspend_enabled) pr_info("Suspending console(s) (use no_console_suspend to debug)\n"); @@ -2749,8 +2754,7 @@ void console_suspend_all(void) return; console_list_lock(); - for_each_console(con) - console_srcu_write_flags(con, con->flags | CON_SUSPENDED); + consoles_suspended_srcu_write(true); console_list_unlock(); /* @@ -2765,7 +2769,6 @@ void console_suspend_all(void) void console_resume_all(void) { struct console_flush_type ft; - struct console *con; /* * Allow queueing irq_work. After restoring console state, deferred @@ -2776,8 +2779,7 @@ void console_resume_all(void) if (console_suspend_enabled) { console_list_lock(); - for_each_console(con) - console_srcu_write_flags(con, con->flags & ~CON_SUSPENDED); + consoles_suspended_srcu_write(false); console_list_unlock(); /* Best Regards, Petr From pmladek at suse.com Thu Nov 27 07:18:40 2025 From: pmladek at suse.com (Petr Mladek) Date: Thu, 27 Nov 2025 16:18:40 +0100 Subject: [PATCH v2 0/4] printk cleanup - part 2 In-Reply-To: <20251121-printk-cleanup-part2-v2-0-57b8b78647f4@suse.com> References: <20251121-printk-cleanup-part2-v2-0-57b8b78647f4@suse.com> Message-ID: On Fri 2025-11-21 15:50:32, Marcos Paulo de Souza wrote: > The first part can be found here[1]. The proposed changes do not > change the functionality of printk, but were suggestions made by > Petr Mladek. I already have more patches for a part 3 ,but I would like > to see these ones merged first. > > I did the testing with VMs, checking suspend and resume cycles, and it worked > as expected. > > Thanks for reviewing! > Marcos Paulo de Souza (4): > drivers: serial: kgdboc: Drop checks for CON_ENABLED and CON_BOOT > arch: um: kmsg_dump: Use console_is_usable > printk: Use console_is_usable on console_unblank These three patches were simple, straightforward, and ready for linux next. I have comitted them into printk/linux.git, branch rework/nbcon-in-kdb. I am going to push them for 6.19. > printk: Make console_{suspend,resume} handle CON_SUSPENDED This patch still need some love and v3. Best Regards, Petr From bhe at redhat.com Thu Nov 27 19:33:19 2025 From: bhe at redhat.com (Baoquan He) Date: Fri, 28 Nov 2025 11:33:19 +0800 Subject: [PATCH v4 11/12] arch/um: don't initialize kasan if it's disabled In-Reply-To: <20251128033320.1349620-1-bhe@redhat.com> References: <20251128033320.1349620-1-bhe@redhat.com> Message-ID: <20251128033320.1349620-12-bhe@redhat.com> And also do the kasan_arg_disabled chekcing before kasan_flag_enabled enabling to make sure kernel parameter kasan=on|off has been parsed. Signed-off-by: Baoquan He Cc: linux-um at lists.infradead.org --- arch/um/kernel/mem.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/arch/um/kernel/mem.c b/arch/um/kernel/mem.c index 39c4a7e21c6f..08cd012a6bb8 100644 --- a/arch/um/kernel/mem.c +++ b/arch/um/kernel/mem.c @@ -62,8 +62,11 @@ static unsigned long brk_end; void __init arch_mm_preinit(void) { +#ifdef CONFIG_KASAN /* Safe to call after jump_label_init(). Enables KASAN. */ - kasan_init_generic(); + if (!kasan_arg_disabled) + kasan_init_generic(); +#endif /* clear the zero-page */ memset(empty_zero_page, 0, PAGE_SIZE); -- 2.41.0 From daniel at riscstar.com Fri Nov 28 01:52:24 2025 From: daniel at riscstar.com (Daniel Thompson) Date: Fri, 28 Nov 2025 09:52:24 +0000 Subject: [PATCH v2 0/4] printk cleanup - part 2 In-Reply-To: References: <20251121-printk-cleanup-part2-v2-0-57b8b78647f4@suse.com> Message-ID: On Thu, Nov 27, 2025 at 04:18:40PM +0100, Petr Mladek wrote: > On Fri 2025-11-21 15:50:32, Marcos Paulo de Souza wrote: > > The first part can be found here[1]. The proposed changes do not > > change the functionality of printk, but were suggestions made by > > Petr Mladek. I already have more patches for a part 3 ,but I would like > > to see these ones merged first. > > > > I did the testing with VMs, checking suspend and resume cycles, and it worked > > as expected. > > > > Thanks for reviewing! > > > Marcos Paulo de Souza (4): > > drivers: serial: kgdboc: Drop checks for CON_ENABLED and CON_BOOT > > arch: um: kmsg_dump: Use console_is_usable > > printk: Use console_is_usable on console_unblank > > These three patches were simple, straightforward, and ready for linux > next. > > I have comitted them into printk/linux.git, branch rework/nbcon-in-kdb. > I am going to push them for 6.19. I pointed the kgdb test suite at this branch (as I did for the earlier part of the patchset, although I think I forgot to post about it). The console coverage is fairly modest (I think just 8250 and PL011 drivers, with and without earlycon) and the suite exercises features rather than crash resilience. Nevertheless and FWIW, the tests didn't pick up any regressions. Yay! Daniel. From pmladek at suse.com Fri Nov 28 04:31:01 2025 From: pmladek at suse.com (Petr Mladek) Date: Fri, 28 Nov 2025 13:31:01 +0100 Subject: [PATCH v2 0/4] printk cleanup - part 2 In-Reply-To: References: <20251121-printk-cleanup-part2-v2-0-57b8b78647f4@suse.com> Message-ID: On Fri 2025-11-28 09:52:24, Daniel Thompson wrote: > On Thu, Nov 27, 2025 at 04:18:40PM +0100, Petr Mladek wrote: > > On Fri 2025-11-21 15:50:32, Marcos Paulo de Souza wrote: > > > The first part can be found here[1]. The proposed changes do not > > > change the functionality of printk, but were suggestions made by > > > Petr Mladek. I already have more patches for a part 3 ,but I would like > > > to see these ones merged first. > > > > > > I did the testing with VMs, checking suspend and resume cycles, and it worked > > > as expected. > > > > > > Thanks for reviewing! > > > > > Marcos Paulo de Souza (4): > > > drivers: serial: kgdboc: Drop checks for CON_ENABLED and CON_BOOT > > > arch: um: kmsg_dump: Use console_is_usable > > > printk: Use console_is_usable on console_unblank > > > > These three patches were simple, straightforward, and ready for linux > > next. > > > > I have comitted them into printk/linux.git, branch rework/nbcon-in-kdb. > > I am going to push them for 6.19. > > I pointed the kgdb test suite at this branch (as I did for the earlier > part of the patchset, although I think I forgot to post about it). > > The console coverage is fairly modest (I think just 8250 and PL011 > drivers, with and without earlycon) and the suite exercises features > rather than crash resilience. Nevertheless and FWIW, the tests didn't > pick up any regressions. Yay! Great news! Thanks a lot for doing the test and sharing results. Best Regards, Petr From thehajime at gmail.com Fri Nov 28 04:57:55 2025 From: thehajime at gmail.com (Hajime Tazaki) Date: Fri, 28 Nov 2025 21:57:55 +0900 Subject: [PATCH v13 00/13] nommu UML In-Reply-To: References: <0a84c16f862026c82271c0adbc91d98b812a78b4.camel@sipsolutions.net> Message-ID: On Tue, 25 Nov 2025 18:58:53 +0900, Johannes Berg wrote: > > On Wed, 2025-11-12 at 17:52 +0900, Hajime Tazaki wrote: > > > > What is it for ? > > > > ================ > > > > > > > > - Alleviate syscall hook overhead implemented with ptrace(2) > > > > - To exercises nommu code over UML (and over KUnit) > > > > - Less dependency to host facilities > > > > > > FWIW, in some way, this order of priorities is exactly why this hasn't > > > been going anywhere, and every time I looked at it I got somewhat > > > annoyed by what seems to me like choices made to support especially the > > > first bullet. > > > > over the past versions, I've been emphasized that the 2nd bullet (testing) > > is the primary usecase as I saw several actually cases from mm folks, > > > > https://lists.infradead.org/pipermail/maple-tree/2024-November/003775.html > > https://lore.kernel.org/all/cb1cf0be-871d-4982-9a1b-5fdd54deec8d at lucifer.local/ > > > > and I think this is not limited to mm code. > > Not sure there's much value in testing much else in no-MMU, but sure, > I'll give you that it's useful for testing. under the tree, % global -xr CONFIG_MMU | grep ifndef | grep -v -E "arch/|mm/" | wc -l 45 this is a rough picture but there are places to be tested other than mm codebase. > > other 2 bullets are additional benefits which we observed in a > > comment, and our experience. > > But are they really _worthwhile_ benefits? A lot of this design adds > additional complexity, and it doesn't really seem necessary for the > testing use case. Making it faster is nice, but it's not like the > speedup really is 20x for arbitrary tests, that's just for corner cases > like "sit in a loop of gettimeofday()". And for kunit there's no syscall > boundary at all, so there's no speedup. I agree and as I said the reason to take a single-host-process approach is from the speed and simplicity of removing interaction between host processes. I have never claimed that tests should execute fast. and agree that kunit doesn't benefit from speed as there is no syscall (unless kunit-uapi patch will be in). > > > I suspect that the first and third bullet are not even really true any > > > more, since you moved to seccomp (per our request), yet I think design > > > choices influenced by them persist. > > > > this observation is not true; the first bullet is still true even > > using seccomp. please look at the benchmark result in the patch > > [12/13], quoted below. > > > [snip] > > So thanks for the correction. If that's the case, however, it means the > speedup can't be due to the syscall boundary itself (seccomp) but must > rather be due to some pagefault/mapping handling issue? Which would be > inherent in no-MMU, even taking an approach of using two host processes > rather than embedding everything into one. I'll explain this later in this email. # nommu doesn't have page fault as there are only physical address. > > > However, I'm not yet convinced that all of the complexities presented in > > > this patchset (such as completely separate seccomp implementation) are > > > actually necessary in support of _just_ the second bullet. These seem to > > > me like design choices necessary to support the _first_ bullet [1]. > > > > separate seccomp implementation is indeed needed due to the design > > choice we made, to use a single process to host a (um) userspace. > > That sounds misleading or even wrong to me, I'd say it's due to putting > the (um) userspace in the same host process as the kernel space? not sure if this is different from my explanation... > > I don't see why you see this as a _complexity_, as functionally both > > seccomp handling don't interfere each other. > > The complexity isn't so much in the separate code, which is a small > factor, but in the "put everything into the same process" aspect of it. > That has consequences around the host context state handling, things we > didn't really need to consider before suddenly become crucially > important. In the current (with-MMU) design, we only need to worry about > being able to correctly switch between userspace tasks/threads within a > userspace mm (host) process. With the no-MMU design you propose, we also > need to be able to correctly switch between kernel and userspace tasks > within the same single (host) process. > > I think this is a pretty significant difference, and saying "there's no > complexity here" is simply pretending it isn't a relevant difference. I > believe you're not even handling this correctly right now in this patch > set, specifically wrt. the GS register which has been pointed out > before, but I wouldn't say that I even have a complete picture in my > head over what state handling would be necessary and sufficient. > > So yeah, I think this warrants taking another look as to whether or not > the approach of putting everything into the same host process is even > worth it. I tend to believe that it isn't, given the use cases. And if > you say the speedup still is with seccomp, that kills the speed argument > too. I understand your concern on complexity, thanks for the detail. the host context state handling is indeed new thing. we've only verified a limited set of code path, with a basic operation with um + drivers and some userspace programs. this should not be perfect at this moment but can be improved. > > > I've thought about what would happen if we stuck to creating a (single) > > > separate process on the host to execute userspace, and just used > > > CLONE_VM for it. That way, it's still no-MMU with full memory access, > > > but there's some implicit isolation between the kernel and userspace > > > processes which will likely remove complexities around FP/SSE/AVX > > > handling, may completely remove the need for a separate seccomp > > > implementation, etc. > > > > this would be doable I think, but we went the different way, as > > using separate host processes (with ptrace/seccomp) is slow and add > > complexity by the synchronization between processes, which we think > > it's not easy to maintain in the future. > > Which one is it then, slow or not? Not sure I follow. You just said you > do have seccomp when comparing speeds, so that in itself doesn't make it > slow. What synchronization? It'd (have to) be CLONE_VM, but that > actually _simplifies_ state transfer/synchronization, and we already > have (to have) state transfer between different userspace threads in the > same host process for the with-MMU case. Since I included speed characteristics in the document, I should explain more on the impact of this, compared to the existing design/implementation of uml. many documents, articles said uml is slow (uml document in tree also mentioned a bit), but cannot find detailed analysis, so I look closely at how nommu (w/ seccomp) and mmu w/ seccomp behave. suppose we have a userspace program running under uml (on seccomp-mmu, seccomp-nommu). struct timespec ts1, ts2; clock_gettime(CLOCK_REALTIME, &ts1); // 1) getpid() // 2) clock_gettime(CLOCK_REALTIME, &ts2); // 3) # this is a chunk from the benchmark program used in the document. then collected several events (sched_switch, signal_generate, and sys_enter_futex) via ftrace. looking at 3 SIGSYS (sig=31) signals on above code, and below is the output of the `trace-cmd report`. - frace seecomp-mmu, 2)-3)= 11 usec uml-userspace-3092637 [002] 1749286.670199: signal_generate: sig=31 errno=0 code=1 comm=uml-userspace pid=3092637 grp=0 res=0 => 1) uml-userspace-3092637 [002] 1749286.670200: sys_enter_futex: op=FUTEX_WAKE uaddr=0x7fffffffdf8c val=1 uml-userspace-3092637 [002] 1749286.670201: sys_enter_futex: op=FUTEX_WAIT uaddr=0x7fffffffdf8c val=0x00000001 utime=0x00000000 uml-userspace-3092637 [002] 1749286.670202: sched_switch: uml-userspace:3092637 [120] S ==> swapper/2:0 [120] -0 [028] 1749286.670203: sched_switch: swapper/28:0 [120] R ==> vmlinux:3092631 [120] vmlinux-3092631 [028] 1749286.670205: sys_enter_futex: op=FUTEX_WAKE uaddr=0x60b64f8c val=1 vmlinux-3092631 [028] 1749286.670206: sys_enter_futex: op=FUTEX_WAIT uaddr=0x60b64f8c val=0x00000000 utime=0x00000000 vmlinux-3092631 [028] 1749286.670207: sched_switch: vmlinux:3092631 [120] S ==> swapper/28:0 [120] -0 [002] 1749286.670209: sched_switch: swapper/2:0 [120] R ==> uml-userspace:3092637 [120] uml-userspace-3092637 [002] 1749286.670211: signal_generate: sig=31 errno=0 code=1 comm=uml-userspace pid=3092637 grp=0 res=0 => 2) uml-userspace-3092637 [002] 1749286.670212: sys_enter_futex: op=FUTEX_WAKE uaddr=0x7fffffffdf8c val=1 uml-userspace-3092637 [002] 1749286.670213: sys_enter_futex: op=FUTEX_WAIT uaddr=0x7fffffffdf8c val=0x00000001 utime=0x00000000 uml-userspace-3092637 [002] 1749286.670214: sched_switch: uml-userspace:3092637 [120] S ==> swapper/2:0 [120] -0 [028] 1749286.670215: sched_switch: swapper/28:0 [120] R ==> vmlinux:3092631 [120] vmlinux-3092631 [028] 1749286.670216: sys_enter_futex: op=FUTEX_WAKE uaddr=0x60b64f8c val=1 vmlinux-3092631 [028] 1749286.670217: sys_enter_futex: op=FUTEX_WAIT uaddr=0x60b64f8c val=0x00000000 utime=0x00000000 vmlinux-3092631 [028] 1749286.670218: sched_switch: vmlinux:3092631 [120] S ==> swapper/28:0 [120] -0 [002] 1749286.670220: sched_switch: swapper/2:0 [120] R ==> uml-userspace:3092637 [120] uml-userspace-3092637 [002] 1749286.670222: signal_generate: sig=31 errno=0 code=1 comm=uml-userspace pid=3092637 grp=0 res=0 => 3) - ftrace seccomp-nommu, 2)-3) = 3 usec vmlinux-3092542 [006] 1749158.829292: signal_generate: sig=31 errno=0 code=1 comm=vmlinux pid=3092542 grp=0 res=0 => 1) vmlinux-3092542 [006] 1749158.829294: signal_generate: sig=31 errno=0 code=1 comm=vmlinux pid=3092542 grp=0 res=0 => 2) vmlinux-3092542 [006] 1749158.829297: signal_generate: sig=31 errno=0 code=1 comm=vmlinux pid=3092542 grp=0 res=0 => 3) with seccomp-mmu, a host process for userspace (uml-userspace) is notified with SIGSYS (sig=31) upon syscall from userspace, and switched task (of host) to vmlinux (um kernel), with the wake/wait synchronization (which I meant synchronization in my previous email), and switch back to uml-userspace to continue the userspace process. so, at least 4 host sched_switch-es per single um syscall. with current nommu using a single host process, notifications via SIGSYS is same as seccomp-mmu, but after that there is no context switch upon syscall issued by a userspace, in the same context to the next syscall. nommu implementation with CLONE_VM (btw, the host process, uml-userspace is already created with CLONE_VM flag IIUC) might face the similar situation as seccomp-mmu, seeing the same switches between processes. this becomes the difference between the benchmark results of getpid, which um-mmu (seccomp)/um-nommu (seccomp) is mostly x10 (26.242 and 2.599 usec) (this was described as an example of benchmark in the patchset). I didn't look at ptrace mode of MMU, but expect to see the similar (or more) duration on a single syscall. in addition to this ftrace measurement above, I conducted more practical benchmark with iperf3 (forward/reverse path) and netperf (TCP_STREAM/MAERTS), which aren't corner cases I believe, and below is the result. all use the vector driver with gro on via host tap devices. iperf3/netperf server run on a host and client runs inside uml. # I can give a complete script to reproduce this if needed. - iperf3 (Mbps) um-mmu(seccomp) um-nommu(seccomp) -------------------------------------------------- iperf3(f) 7984 13152 iperf3(r) 8009 14363 - netperf (Mbps, bufsize=65507bytes) um-mmu(seccomp) um-nommu(seccomp) -------------------------------------------------- netperf(STREAM) 5912.93 10792.02 netperf(MAERTS) 29263.53 33970.06 not significant different as we saw with simple syscall benchmark with getpid(2), but still see an impact with difference. I would say these results only show partial cases of what UML can do, different workloads may show different result, but it is still valuable to present one of the benefits to see the nature of the feature (of what single process design can do). Of course, nommu will come with various limitations as I described in the document; like applications should be aware of the kernel is nommu (i.e., need to use vfork, PIE binaries, etc). So traditional uml is more generic and has broader usage, but with this characteristic of speed with nommu, I think it is worthwhile and users benefit from this if they need speed. I hope this clarifies a bit. -- Hajime From mpdesouza at suse.com Fri Nov 28 04:59:25 2025 From: mpdesouza at suse.com (Marcos Paulo de Souza) Date: Fri, 28 Nov 2025 09:59:25 -0300 Subject: [PATCH v2 0/4] printk cleanup - part 2 In-Reply-To: References: <20251121-printk-cleanup-part2-v2-0-57b8b78647f4@suse.com> Message-ID: On Fri, 2025-11-28 at 09:52 +0000, Daniel Thompson wrote: > On Thu, Nov 27, 2025 at 04:18:40PM +0100, Petr Mladek wrote: > > On Fri 2025-11-21 15:50:32, Marcos Paulo de Souza wrote: > > > The first part can be found here[1]. The proposed changes do not > > > change the functionality of printk, but were suggestions made by > > > Petr Mladek. I already have more patches for a part 3 ,but I > > > would like > > > to see these ones merged first. > > > > > > I did the testing with VMs, checking suspend and resume cycles, > > > and it worked > > > as expected. > > > > > > Thanks for reviewing! > > > > > Marcos Paulo de Souza (4): > > > ????? drivers: serial: kgdboc: Drop checks for CON_ENABLED and > > > CON_BOOT > > > ????? arch: um: kmsg_dump: Use console_is_usable > > > ????? printk: Use console_is_usable on console_unblank > > > > These three patches were simple, straightforward, and ready for > > linux > > next. > > > > I have comitted them into printk/linux.git, branch rework/nbcon-in- > > kdb. > > I am going to push them for 6.19. > > I pointed the kgdb test suite at this branch (as I did for the > earlier > part of the patchset, although I think I forgot to post about it). > > The console coverage is fairly modest (I think just 8250 and PL011 > drivers, with and without earlycon) and the suite exercises features > rather than crash resilience. Nevertheless and FWIW, the tests didn't > pick up any regressions. Yay! Thanks Daniel! I remember that you said that you would run the testsuite in the previous patchset, but didn't want to bother asking you (I believe that if you found anything you would point it out either way :) ). > > > Daniel. From chleroy at kernel.org Sat Nov 29 01:56:02 2025 From: chleroy at kernel.org (Christophe Leroy (CS GROUP)) Date: Sat, 29 Nov 2025 10:56:02 +0100 Subject: [PATCH] um: Disable KASAN_INLINE when STATIC_LINK is selected Message-ID: <2620ab0bbba640b6237c50b9c0dca1c7d1142f5d.1764410067.git.chleroy@kernel.org> um doesn't support KASAN_INLINE together with STATIC_LINK. Instead of failing the build, disable KASAN_INLINE when STATIC_LINK is selected. Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-kbuild-all/202511290451.x9GZVJ1l-lkp at intel.com/ Fixes: 1e338f4d99e6 ("kasan: introduce ARCH_DEFER_KASAN and unify static key across modes") Signed-off-by: Christophe Leroy (CS GROUP) --- arch/um/Kconfig | 1 + arch/um/include/asm/kasan.h | 4 ---- 2 files changed, 1 insertion(+), 4 deletions(-) diff --git a/arch/um/Kconfig b/arch/um/Kconfig index 49781bee7905..93ed850d508e 100644 --- a/arch/um/Kconfig +++ b/arch/um/Kconfig @@ -5,6 +5,7 @@ menu "UML-specific options" config UML bool default y + select ARCH_DISABLE_KASAN_INLINE if STATIC_LINK select ARCH_NEEDS_DEFER_KASAN if STATIC_LINK select ARCH_WANTS_DYNAMIC_TASK_STRUCT select ARCH_HAS_CACHE_LINE_SIZE diff --git a/arch/um/include/asm/kasan.h b/arch/um/include/asm/kasan.h index b54a4e937fd1..81bcdc0f962e 100644 --- a/arch/um/include/asm/kasan.h +++ b/arch/um/include/asm/kasan.h @@ -24,10 +24,6 @@ #ifdef CONFIG_KASAN void kasan_init(void); - -#if defined(CONFIG_STATIC_LINK) && defined(CONFIG_KASAN_INLINE) -#error UML does not work in KASAN_INLINE mode with STATIC_LINK enabled! -#endif #else static inline void kasan_init(void) { } #endif /* CONFIG_KASAN */ -- 2.49.0