From rppt at kernel.org Mon Jun 1 00:00:05 2026 From: rppt at kernel.org (Mike Rapoport) Date: Mon, 1 Jun 2026 10:00:05 +0300 Subject: [PATCH v2 0/3] kho: Add support for kunit mocking KHO restore API In-Reply-To: <20260521193202.746810-1-skhawaja@google.com> References: <20260521193202.746810-1-skhawaja@google.com> Message-ID: Hi Samiullah, On Thu, May 21, 2026 at 07:31:59PM +0000, Samiullah Khawaja wrote: > To write kunit tests for preservation and restoration of liveupdate > state in various subsystems without triggering the actual kexec, the KHO > restore API needs to be mocked by the test writer. The mocking is done > to allow testing of the individual components or functions in isolation. > > The patch series adds the following to support kunit testing when using the KHO > API: > > - Add static stub hooks to mock the KHO restore API so the restore path > can be tested without triggering kexec. > - Add helper function that can be used by the test writer to check if > memory is preserved in KHO tree. > > Finally, it adds a KUnit test for the KHO API that verifies the allocation of > preserved memory, and the preservation/restoration of pages and folios. I looked at the tests for preservation and apparently they don't add coverage beyond the existing KHO selftest. How hard and/or intrusive would be adding tests for example for error paths? Do you have an example of a kunit test for another subsystem that would benefit from mocking of KHO APIs? > KHO Kunit test run: > > KTAP version 1 > 1..1 > KTAP version 1 > # Subtest: kho_test > # module: kexec_handover_test > 1..3 > ok 1 kho_test_alloc_preserve > ok 2 kho_test_preserve_pages > ok 3 kho_test_preserve_folio > # kho_test: pass:3 fail:0 skip:0 total:3 > # Totals: pass:3 fail:0 skip:0 total:3 > ok 1 kho_test > > v2: > - Move kunit header includes above linux header includes. > - Use the __kho_preserve_pages_order() to get the order of preserved > pages instead of open order calculation math. > > Samiullah Khawaja (3): > kho: Add kunit static stubs > kho: Add helper function to check if pages are preserved > kho: Add kunit test to verify preserve/restore pages and folio > > include/linux/kexec_handover.h | 5 + > kernel/liveupdate/Kconfig | 10 ++ > kernel/liveupdate/Makefile | 1 + > kernel/liveupdate/kexec_handover.c | 63 +++++++++++- > kernel/liveupdate/kexec_handover_test.c | 131 ++++++++++++++++++++++++ > 5 files changed, 209 insertions(+), 1 deletion(-) > create mode 100644 kernel/liveupdate/kexec_handover_test.c > > > base-commit: ec4084bc445027a52f600e30a976928be1ba1950 > -- > 2.54.0.746.g67dd491aae-goog > -- Sincerely yours, Mike. From dongtai.guo at linux.dev Mon Jun 1 02:28:20 2026 From: dongtai.guo at linux.dev (George Guo) Date: Mon, 1 Jun 2026 17:28:20 +0800 Subject: [PATCH v3 0/3] LoongArch: add KHO support and selftests Message-ID: <20260601092823.110362-1-dongtai.guo@linux.dev> From: George Guo This series adds Kexec Handover (KHO) support for LoongArch and extends the KHO selftest infrastructure to run on LoongArch under QEMU. KHO passes metadata (the KHO state FDT and scratch area addresses) to the second kernel via the FDT /chosen node, using the linux,kho-fdt and linux,kho-scratch properties that drivers/of/kexec.c:kho_add_chosen() writes and drivers/of/fdt.c:early_init_dt_check_kho() reads. KHO support (patches 1-2): Patch 1 adds KHO support for FDT-based systems (initial_boot_params != NULL, e.g. QEMU virt without OVMF). kho_load_fdt() copies the running kernel's FDT, appends linux,kho-fdt and linux,kho-scratch to /chosen, and loads the result as a kexec segment. machine_kexec() updates the DEVICE_TREE_GUID entry in the EFI config table to point to this segment so the second kernel's fdt_setup() can find and parse it. Patch 2 adds KHO support for ACPI-only systems (initial_boot_params == NULL, e.g. LoongArch servers with UEFI or QEMU with OVMF). Because no system FDT is available, kho_load_fdt() builds a minimal FDT from scratch containing only /chosen with the two KHO properties. Since DEVICE_TREE_GUID is absent from the EFI config table on ACPI-only systems, a new extended config table is built with the entry appended and loaded as a kexec segment; machine_kexec() switches st->tables to point to it before jumping. The second kernel's fdt_setup() calls efi_fdt_pointer() to detect the KHO FDT and passes it to early_init_dt_check_kho(). Selftest support (patch 3): Patch 3 adds loongarch.conf and extends vmtest.sh to recognise loongarch64 as a build target. The LoongArch virt machine is FDT-only (no ACPI), so 'earlycon' must appear on the kernel cmdline or the console UART is never discovered. PS/2 input devices are disabled since QEMU's LoongArch virt machine has no i8042 controller; the fallback port probe hits a page fault and panics before reaching userspace. QEMU provides no EFI runtime services on LoongArch, so machine_restart() falls through to an infinite idle loop after kexec; QEMU_TIMEOUT=120 in loongarch.conf lets timeout(1) terminate QEMU once the time limit is reached. Changes in v3: - Merge selftest patches 3 and 4 from v2 into a single patch - Replace QEMU_NEEDS_KILL/background kill loop with QEMU_TIMEOUT/timeout(1); the timeout value is set per-arch in the conf file. George Guo (3): LoongArch: kexec: add KHO support for FDT-based systems LoongArch: kexec: add KHO support for ACPI-only systems selftests/kho: add LoongArch vmtest support arch/loongarch/Kconfig | 3 + arch/loongarch/include/asm/kexec.h | 7 + arch/loongarch/kernel/machine_kexec.c | 38 +++ arch/loongarch/kernel/machine_kexec_file.c | 256 +++++++++++++++++++++ arch/loongarch/kernel/setup.c | 21 +- tools/testing/selftests/kho/loongarch.conf | 13 ++ tools/testing/selftests/kho/vmtest.sh | 23 +- 7 files changed, 353 insertions(+), 8 deletions(-) create mode 100644 tools/testing/selftests/kho/loongarch.conf -- 2.25.1 From dongtai.guo at linux.dev Mon Jun 1 02:39:30 2026 From: dongtai.guo at linux.dev (George Guo) Date: Mon, 1 Jun 2026 17:39:30 +0800 Subject: [PATCH v3 3/3] selftests/kho: add LoongArch vmtest support In-Reply-To: <20260601093930.112758-1-dongtai.guo@linux.dev> References: <20260601092823.110362-1-dongtai.guo@linux.dev> <20260601093930.112758-1-dongtai.guo@linux.dev> Message-ID: <20260601093930.112758-3-dongtai.guo@linux.dev> From: George Guo Add loongarch.conf to configure QEMU's LoongArch virt machine with a la464 CPU, enable the 8250 serial console, and set the kernel image to vmlinux.efi. Extend vmtest.sh to recognise loongarch64 as a supported target and map it to the 'loongarch' kernel arch name. QEMU's LoongArch virt machine provides no ACPI tables and relies on FDT to describe hardware. Without 'earlycon' on the kernel command line, the FDT is not scanned for a console UART, no output reaches the console, and vmtest.sh's console log stays empty causing the test to always fail. Add 'earlycon' to KERNEL_CMDLINE in loongarch.conf. QEMU's LoongArch virt machine has no i8042 PS/2 controller. When PNP detection finds nothing, i8042_init() falls back to probing the ports directly. On LoongArch the I/O ports are memory-mapped, and the i8042 port addresses are not backed by any device on the virt machine, so i8042_flush() takes a page fault and the kernel panics: i8042: PNP: No PS/2 controller found. i8042: Probing ports directly. CPU 0 Unable to handle kernel paging request at virtual address ffff800000008064 ERA: i8042_flush+0x50/0x198 RA: i8042_init+0x2a8/0x35c Kernel panic - not syncing: Attempted to kill init! Disable SERIO_I8042 and its dependents (KEYBOARD_ATKBD, MOUSE_PS2) in the QEMU_KCONFIG fragment to prevent the driver from being built. All three options are scoped to loongarch.conf; no other architecture is affected. QEMU provides no EFI runtime services on LoongArch, so machine_restart() falls through to an infinite idle loop after kexec. Set QEMU_TIMEOUT=120 in loongarch.conf so vmtest.sh wraps the QEMU invocation with timeout(1), which terminates QEMU after 120 seconds if it does not exit on its own. Architectures that do not set QEMU_TIMEOUT are unaffected. Co-developed-by: Kexin Liu Signed-off-by: Kexin Liu Signed-off-by: George Guo --- tools/testing/selftests/kho/loongarch.conf | 13 ++++++++++++ tools/testing/selftests/kho/vmtest.sh | 23 +++++++++++++++------- 2 files changed, 29 insertions(+), 7 deletions(-) create mode 100644 tools/testing/selftests/kho/loongarch.conf diff --git a/tools/testing/selftests/kho/loongarch.conf b/tools/testing/selftests/kho/loongarch.conf new file mode 100644 index 000000000000..68727654578d --- /dev/null +++ b/tools/testing/selftests/kho/loongarch.conf @@ -0,0 +1,13 @@ +QEMU_CMD="qemu-system-loongarch64 -M virt -cpu la464" +QEMU_KCONFIG=" +CONFIG_SERIAL_8250=y +CONFIG_SERIAL_8250_CONSOLE=y +# CONFIG_KEYBOARD_ATKBD is not set +# CONFIG_MOUSE_PS2 is not set +# CONFIG_SERIO_I8042 is not set +" +KERNEL_IMAGE="vmlinux.efi" +KERNEL_CMDLINE="console=ttyS0 earlycon" +# QEMU never exits after kexec on LoongArch (no EFI runtime services); +# give the test a fixed time limit and let timeout(1) terminate QEMU. +QEMU_TIMEOUT=120 diff --git a/tools/testing/selftests/kho/vmtest.sh b/tools/testing/selftests/kho/vmtest.sh index 49fdac8e8b15..918698b6dd2a 100755 --- a/tools/testing/selftests/kho/vmtest.sh +++ b/tools/testing/selftests/kho/vmtest.sh @@ -21,7 +21,7 @@ Options: -d) path to the kernel build directory -j) number of jobs for compilation, similar to -j in make -t) run test for target_arch, requires CROSS_COMPILE set - supported targets: aarch64, x86_64 + supported targets: aarch64, x86_64, loongarch64 -h) display this help EOF } @@ -107,12 +107,20 @@ function run_qemu() { cmdline="$cmdline kho=on panic=-1" - $qemu_cmd -m 1G -smp 2 -no-reboot -nographic -nodefaults \ - -accel kvm -accel hvf -accel tcg \ - -serial file:"$serial" \ - -append "$cmdline" \ - -kernel "$kernel" \ - -initrd "$initrd" + local qemu_args=( + -m 1G -smp 2 -no-reboot -nographic -nodefaults + -accel kvm -accel hvf -accel tcg + -serial file:"$serial" + -append "$cmdline" + -kernel "$kernel" + -initrd "$initrd" + ) + + if [[ -n "${QEMU_TIMEOUT:-}" ]]; then + timeout "$QEMU_TIMEOUT" $qemu_cmd "${qemu_args[@]}" || true + else + $qemu_cmd "${qemu_args[@]}" + fi grep "KHO restore succeeded" "$serial" &> /dev/null || fail "KHO failed" } @@ -123,6 +131,7 @@ function target_to_arch() { case $target in aarch64) echo "arm64" ;; x86_64) echo "x86" ;; + loongarch64) echo "loongarch" ;; *) skip "architecture $target is not supported" esac } -- 2.25.1 From pratyush at kernel.org Mon Jun 1 04:52:14 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Mon, 01 Jun 2026 13:52:14 +0200 Subject: [liveupdate:next 11/21] kernel/liveupdate/luo_session.c:344:48: error: too few arguments provided to function-like macro invocation In-Reply-To: <202606011344.RHiYuqso-lkp@intel.com> (kernel test robot's message of "Mon, 01 Jun 2026 13:25:29 +0800") References: <202606011344.RHiYuqso-lkp@intel.com> Message-ID: <2vxzik82h40h.fsf@kernel.org> On Mon, Jun 01 2026, kernel test robot wrote: > tree: https://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux.git next > head: f4b4f5fe58a55a4212a263d9d44416778ca6e7a7 > commit: 74cab0be9a5d9d91471c4dee7311dcdfc1c0a6f4 [11/21] liveupdate: validate session type before performing operation > config: x86_64-buildonly-randconfig-003-20260601 (https://download.01.org/0day-ci/archive/20260601/202606011344.RHiYuqso-lkp at intel.com/config) > compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261) > reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260601/202606011344.RHiYuqso-lkp at intel.com/reproduce) > > If you fix the issue in a separate patch/commit (i.e. not just a new version of > the same patch/commit), kindly add following tags > | Reported-by: kernel test robot > | Closes: https://lore.kernel.org/oe-kbuild-all/202606011344.RHiYuqso-lkp at intel.com/ > > All errors (new ones prefixed by >>): > >>> kernel/liveupdate/luo_session.c:344:48: error: too few arguments provided to function-like macro invocation > 344 | struct liveupdate_session_retrieve_fd, token), > | ^ > kernel/liveupdate/luo_session.c:327:9: note: macro 'IOCTL_OP' defined here > 327 | #define IOCTL_OP(_ioctl, _fn, _struct, _last, _type) \ > | ^ >>> kernel/liveupdate/luo_session.c:343:2: error: use of undeclared identifier 'IOCTL_OP' > 343 | IOCTL_OP(LIVEUPDATE_SESSION_RETRIEVE_FD, luo_session_retrieve_fd, > | ^ >>> kernel/liveupdate/luo_session.c:378:6: error: invalid application of 'sizeof' to an incomplete type 'const struct luo_ioctl_op[]' > 378 | ARRAY_SIZE(luo_session_ioctl_ops)) { > | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > include/linux/array_size.h:11:32: note: expanded from macro 'ARRAY_SIZE' > 11 | #define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]) + __must_be_array(arr)) > | ^~~~~ > 3 errors generated. This happens because the patch got moved from the fixes branch to next, and next has a new ioctl. Will send a fixup soon. [...] -- Regards, Pratyush Yadav From pratyush at kernel.org Mon Jun 1 04:57:19 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Mon, 01 Jun 2026 13:57:19 +0200 Subject: [liveupdate:next 11/21] kernel/liveupdate/luo_session.c:344:48: error: too few arguments provided to function-like macro invocation In-Reply-To: <2vxzik82h40h.fsf@kernel.org> (Pratyush Yadav's message of "Mon, 01 Jun 2026 13:52:14 +0200") References: <202606011344.RHiYuqso-lkp@intel.com> <2vxzik82h40h.fsf@kernel.org> Message-ID: <2vxzcxyah3s0.fsf@kernel.org> On Mon, Jun 01 2026, Pratyush Yadav wrote: > On Mon, Jun 01 2026, kernel test robot wrote: > >> tree: https://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux.git next >> head: f4b4f5fe58a55a4212a263d9d44416778ca6e7a7 >> commit: 74cab0be9a5d9d91471c4dee7311dcdfc1c0a6f4 [11/21] liveupdate: validate session type before performing operation >> config: x86_64-buildonly-randconfig-003-20260601 (https://download.01.org/0day-ci/archive/20260601/202606011344.RHiYuqso-lkp at intel.com/config) >> compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261) >> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260601/202606011344.RHiYuqso-lkp at intel.com/reproduce) >> >> If you fix the issue in a separate patch/commit (i.e. not just a new version of >> the same patch/commit), kindly add following tags >> | Reported-by: kernel test robot >> | Closes: https://lore.kernel.org/oe-kbuild-all/202606011344.RHiYuqso-lkp at intel.com/ >> >> All errors (new ones prefixed by >>): >> >>>> kernel/liveupdate/luo_session.c:344:48: error: too few arguments provided to function-like macro invocation >> 344 | struct liveupdate_session_retrieve_fd, token), >> | ^ >> kernel/liveupdate/luo_session.c:327:9: note: macro 'IOCTL_OP' defined here >> 327 | #define IOCTL_OP(_ioctl, _fn, _struct, _last, _type) \ >> | ^ >>>> kernel/liveupdate/luo_session.c:343:2: error: use of undeclared identifier 'IOCTL_OP' >> 343 | IOCTL_OP(LIVEUPDATE_SESSION_RETRIEVE_FD, luo_session_retrieve_fd, >> | ^ >>>> kernel/liveupdate/luo_session.c:378:6: error: invalid application of 'sizeof' to an incomplete type 'const struct luo_ioctl_op[]' >> 378 | ARRAY_SIZE(luo_session_ioctl_ops)) { >> | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >> include/linux/array_size.h:11:32: note: expanded from macro 'ARRAY_SIZE' >> 11 | #define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]) + __must_be_array(arr)) >> | ^~~~~ >> 3 errors generated. > > This happens because the patch got moved from the fixes branch to next, > and next has a new ioctl. Will send a fixup soon. Nevermind. Seems like this is already fixed (thanks to Mike I think?). -- Regards, Pratyush Yadav From pratyush at kernel.org Mon Jun 1 05:08:46 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Mon, 01 Jun 2026 14:08:46 +0200 Subject: [PATCH v4 01/13] liveupdate: change file_set->count type to u64 for type safety In-Reply-To: <20260530221938.115978-2-pasha.tatashin@soleen.com> (Pasha Tatashin's message of "Sat, 30 May 2026 22:19:26 +0000") References: <20260530221938.115978-1-pasha.tatashin@soleen.com> <20260530221938.115978-2-pasha.tatashin@soleen.com> Message-ID: <2vxz8q8yh38x.fsf@kernel.org> On Sat, May 30 2026, Pasha Tatashin wrote: > This improves type safety and aligns the in-memory file_set->count with > the serialized count type. It avoids potential truncation or sign > conversion mismatch issues. > > Signed-off-by: Pasha Tatashin Reviewed-by: Pratyush Yadav (Google) [...] -- Regards, Pratyush Yadav From pratyush at kernel.org Mon Jun 1 05:15:14 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Mon, 01 Jun 2026 14:15:14 +0200 Subject: [PATCH v4 02/13] liveupdate: avoid mixing cleanup guards with goto in luo_session_retrieve_fd In-Reply-To: <20260530221938.115978-3-pasha.tatashin@soleen.com> (Pasha Tatashin's message of "Sat, 30 May 2026 22:19:27 +0000") References: <20260530221938.115978-1-pasha.tatashin@soleen.com> <20260530221938.115978-3-pasha.tatashin@soleen.com> Message-ID: <2vxz4ijmh2y5.fsf@kernel.org> On Sat, May 30 2026, Pasha Tatashin wrote: > Refactoring luo_session_retrieve_fd() to avoid mixing automated > cleanup-style guards with goto-based resource release, which is not > recommended under the Linux kernel coding style. > > Signed-off-by: Pasha Tatashin Perhaps we would be better off moving to FD_ADD() at some point, which should make this a little bit simpler? Anyway, this patch is still an improvement, so Reviewed-by: Pratyush Yadav (Google) [...] -- Regards, Pratyush Yadav From pratyush at kernel.org Mon Jun 1 05:19:21 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Mon, 01 Jun 2026 14:19:21 +0200 Subject: [PATCH v4 03/13] liveupdate: centralize state management into struct luo_ser In-Reply-To: <20260530221938.115978-4-pasha.tatashin@soleen.com> (Pasha Tatashin's message of "Sat, 30 May 2026 22:19:28 +0000") References: <20260530221938.115978-1-pasha.tatashin@soleen.com> <20260530221938.115978-4-pasha.tatashin@soleen.com> Message-ID: <2vxzzf1efo6u.fsf@kernel.org> On Sat, May 30 2026, Pasha Tatashin wrote: > Transition the LUO to ABI v2, which centralizes state management into a > single struct luo_ser header. > > Previously, LUO state was spread across multiple FDT properties and > subnodes. ABI v2 simplifies this by placing all core state, including > the liveupdate number and physical addresses for sessions and FLB > headers into a centralized struct luo_ser. > > Note that this change introduces a semantic difference: the sessions > and FLB serialization formats are no longer completely independent of > the core LUO. Their metadata (such as physical addresses for sessions > and FLB headers) is now coupled to and managed via the centralized > struct luo_ser. > > Signed-off-by: Pasha Tatashin Reviewed-by: Pratyush Yadav (Google) [...] -- Regards, Pratyush Yadav From will at kernel.org Mon Jun 1 05:35:16 2026 From: will at kernel.org (Will Deacon) Date: Mon, 1 Jun 2026 13:35:16 +0100 Subject: [RFC PATCH 3/4] dma-direct: Add API to preserve/restore allocations In-Reply-To: <20260505002737.2213734-4-skhawaja@google.com> References: <20260505002737.2213734-1-skhawaja@google.com> <20260505002737.2213734-4-skhawaja@google.com> Message-ID: On Tue, May 05, 2026 at 12:27:36AM +0000, Samiullah Khawaja wrote: > diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c > index ec887f443741..c2b98f91900a 100644 > --- a/kernel/dma/direct.c > +++ b/kernel/dma/direct.c > @@ -6,6 +6,8 @@ > */ > #include /* for max_pfn */ > #include > +#include > +#include > #include > #include > #include > @@ -307,6 +309,167 @@ void *dma_direct_alloc(struct device *dev, size_t size, > return NULL; > } > > +#ifdef CONFIG_DMA_LIVEUPDATE > +int dma_direct_preserve_allocation(struct device *dev, void *cpu_addr, > + size_t size, dma_addr_t dma_handle, > + unsigned long attrs, u64 *state) > +{ > + struct dma_alloc_ser *ser; > + int ret; > + > + if (!kho_is_enabled()) > + return -EOPNOTSUPP; > + > + if (IS_ENABLED(CONFIG_DMA_CMA)) > + return -EOPNOTSUPP; Hmm, it seems a bit overkill to do this just because CMA is compiled in, especially as it's user-selectable in kconfig. Maybe you need to iterate over the CMA areas using cma_for_each_area(), similarly to how you do with the pools? Will From pratyush at kernel.org Mon Jun 1 05:39:52 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Mon, 01 Jun 2026 14:39:52 +0200 Subject: [PATCH v4 04/13] liveupdate: register luo_ser as KHO subtree In-Reply-To: <20260530221938.115978-5-pasha.tatashin@soleen.com> (Pasha Tatashin's message of "Sat, 30 May 2026 22:19:29 +0000") References: <20260530221938.115978-1-pasha.tatashin@soleen.com> <20260530221938.115978-5-pasha.tatashin@soleen.com> Message-ID: <2vxzv7c2fn8n.fsf@kernel.org> On Sat, May 30 2026, Pasha Tatashin wrote: > Entirely remove the LUO FDT wrapper since the FDT only carries the > compatible string and the pointer to the centralized struct luo_ser. > Instead, register the struct luo_ser via the KHO raw subtree > API, placing the compatibility string inside the structure itself. > > Signed-off-by: Pasha Tatashin > --- > include/linux/kho/abi/luo.h | 57 +++++++++--------------- > kernel/liveupdate/luo_core.c | 85 +++++++++++------------------------- > 2 files changed, 46 insertions(+), 96 deletions(-) > > diff --git a/include/linux/kho/abi/luo.h b/include/linux/kho/abi/luo.h > index 1b2f865a771a..9a4fe491812b 100644 > --- a/include/linux/kho/abi/luo.h > +++ b/include/linux/kho/abi/luo.h > @@ -10,11 +10,11 @@ > * > * Live Update Orchestrator uses the stable Application Binary Interface > * defined below to pass state from a pre-update kernel to a post-update > - * kernel. The ABI is built upon the Kexec HandOver framework and uses a > - * Flattened Device Tree to describe the preserved data. > + * kernel. The ABI is built upon the Kexec HandOver framework and registers > + * the central `struct luo_ser` via the KHO raw subtree API. > * > - * This interface is a contract. Any modification to the FDT structure, node > - * properties, compatible strings, or the layout of the `__packed` serialization > + * This interface is a contract. Any modification to the structure fields, > + * compatible strings, or the layout of the `__packed` serialization > * structures defined here constitutes a breaking change. Such changes require > * incrementing the version number in the relevant `_COMPATIBLE` string to > * prevent a new kernel from misinterpreting data from an old kernel. > @@ -23,31 +23,15 @@ > * however, backward/forward compatibility is only guaranteed for kernels > * supporting the same ABI version. > * > - * FDT Structure Overview: > + * KHO Structure Overview: > * The entire LUO state is encapsulated within a single KHO entry named "LUO". > - * This entry contains an FDT with the following layout: > - * > - * .. code-block:: none > - * > - * / { > - * compatible = "luo-v2"; > - * luo-abi-header = ; > - * }; > - * > - * Main LUO Node (/): > - * > - * - compatible: "luo-v2" > - * Identifies the overall LUO ABI version. > - * - luo-abi-header: u64 > - * The physical address of `struct luo_ser`. > + * This entry contains the `struct luo_ser` structure. > * > * Serialization Structures: > - * The FDT properties point to memory regions containing arrays of simple, > - * `__packed` structures. These structures contain the actual preserved state. > - * > * - struct luo_ser: > * The central ABI structure that contains the overall state of the LUO. > - * It includes the liveupdate-number and pointers to sessions and FLBs. > + * It includes the compatibility string, the liveupdate-number, and pointers > + * to sessions and FLBs. > * > * - struct luo_session_header_ser: > * Header for the session array. Contains the total page count of the > @@ -78,26 +62,27 @@ > #ifndef _LINUX_KHO_ABI_LUO_H > #define _LINUX_KHO_ABI_LUO_H > > +#include > #include > > /* > - * The LUO FDT hooks all LUO state for sessions, fds, etc. > + * The LUO state is registered under this KHO entry name. > */ > -#define LUO_FDT_SIZE PAGE_SIZE > -#define LUO_FDT_KHO_ENTRY_NAME "LUO" > -#define LUO_FDT_COMPATIBLE "luo-v2" > -#define LUO_FDT_ABI_HEADER "luo-abi-header" > +#define LUO_KHO_ENTRY_NAME "LUO" > +#define LUO_ABI_COMPATIBLE "luo-v3" > +#define LUO_ABI_COMPAT_LEN ALIGN(sizeof(LUO_ABI_COMPATIBLE), 8) The length of the compatible field will change depending on the length of the string. While that is technically fine since a new ABI version is allowed to change the layout, it feels odd. I think it would be better if we define a static size here, say 64 bytes. This way you can avoid all the weirdness that can happen when you move from one version to another. > > /** > * struct luo_ser - Centralized LUO ABI header. > + * @compatible: Compatibility string identifying the LUO ABI version. > * @liveupdate_num: A counter tracking the number of successful live updates. > * @sessions_pa: Physical address of the first session block header. > * @flbs_pa: Physical address of the FLB header. > * > - * This structure is the root of all preserved LUO state. It is pointed to by > - * the "luo-abi-header" property in the LUO FDT. > + * This structure is the root of all preserved LUO state. > */ > struct luo_ser { > + char compatible[LUO_ABI_COMPAT_LEN]; > u64 liveupdate_num; > u64 sessions_pa; > u64 flbs_pa; [...] > @@ -94,40 +91,29 @@ static int __init luo_early_startup(void) > return 0; > } > > - /* Retrieve LUO subtree, and verify its format. */ > - err = kho_retrieve_subtree(LUO_FDT_KHO_ENTRY_NAME, &fdt_phys, NULL); > + /* Retrieve LUO state from KHO. */ > + err = kho_retrieve_subtree(LUO_KHO_ENTRY_NAME, &luo_ser_phys, &len); > if (err) { > if (err != -ENOENT) { > - pr_err("failed to retrieve FDT '%s' from KHO: %pe\n", > - LUO_FDT_KHO_ENTRY_NAME, ERR_PTR(err)); > + pr_err("failed to retrieve LUO state '%s' from KHO: %pe\n", > + LUO_KHO_ENTRY_NAME, ERR_PTR(err)); > return err; > } > > return 0; > } > > - luo_global.fdt_in = phys_to_virt(fdt_phys); > - err = fdt_node_check_compatible(luo_global.fdt_in, 0, > - LUO_FDT_COMPATIBLE); > - if (err) { > - pr_err("FDT '%s' is incompatible with '%s' [%d]\n", > - LUO_FDT_KHO_ENTRY_NAME, LUO_FDT_COMPATIBLE, err); > - > + if (len < sizeof(*luo_ser)) { len != sizeof(*luo_ser) here? > + pr_err("LUO state is too small (%zu < %zu)\n", len, sizeof(*luo_ser)); > return -EINVAL; > } > > - header_size = 0; > - ptr = fdt_getprop(luo_global.fdt_in, 0, LUO_FDT_ABI_HEADER, &header_size); > - if (!ptr || header_size != sizeof(u64)) { > - pr_err("Unable to get ABI header '%s' [%d]\n", > - LUO_FDT_ABI_HEADER, header_size); > - > + luo_ser = phys_to_virt(luo_ser_phys); > + if (strncmp(luo_ser->compatible, LUO_ABI_COMPATIBLE, LUO_ABI_COMPAT_LEN)) { > + pr_err("LUO state is incompatible with '%s'\n", LUO_ABI_COMPATIBLE); > return -EINVAL; > } > > - luo_ser_pa = get_unaligned((u64 *)ptr); > - luo_ser = phys_to_virt(luo_ser_pa); > - > luo_global.liveupdate_num = luo_ser->liveupdate_num; > pr_info("Retrieved live update data, liveupdate number: %lld\n", > luo_global.liveupdate_num); [...] -- Regards, Pratyush Yadav From pratyush at kernel.org Mon Jun 1 06:38:34 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Mon, 01 Jun 2026 15:38:34 +0200 Subject: [PATCH v4 07/13] kho: add support for linked-block serialization In-Reply-To: <20260530221938.115978-8-pasha.tatashin@soleen.com> (Pasha Tatashin's message of "Sat, 30 May 2026 22:19:32 +0000") References: <20260530221938.115978-1-pasha.tatashin@soleen.com> <20260530221938.115978-8-pasha.tatashin@soleen.com> Message-ID: <2vxzqzmqfkit.fsf@kernel.org> On Sat, May 30 2026, Pasha Tatashin wrote: > Introduce a linked-block serialization mechanism for state handover. > > Previously, LUO used contiguous memory blocks for serializing sessions > and files, which imposed limits on the total number of items that could > be preserved across a live update. > > This commit adds the infrastructure for a more flexible, block-based > approach where serialized data is stored in a chain of linked blocks. > This is a generic KHO serialization block infrastructure that can be > used by multiple subsystems. > > Signed-off-by: Pasha Tatashin > --- > Documentation/core-api/kho/abi.rst | 5 + > Documentation/core-api/kho/index.rst | 11 + > MAINTAINERS | 1 + > include/linux/kho/abi/block.h | 56 ++++ > include/linux/kho_block.h | 79 ++++++ > kernel/liveupdate/Makefile | 1 + > kernel/liveupdate/kho_block.c | 384 +++++++++++++++++++++++++++ > 7 files changed, 537 insertions(+) > create mode 100644 include/linux/kho/abi/block.h > create mode 100644 include/linux/kho_block.h > create mode 100644 kernel/liveupdate/kho_block.c > > diff --git a/Documentation/core-api/kho/abi.rst b/Documentation/core-api/kho/abi.rst > index 799d743105a6..edeb5b311963 100644 > --- a/Documentation/core-api/kho/abi.rst > +++ b/Documentation/core-api/kho/abi.rst > @@ -28,6 +28,11 @@ KHO persistent memory tracker ABI > .. kernel-doc:: include/linux/kho/abi/kexec_handover.h > :doc: KHO persistent memory tracker > > +KHO serialization block ABI > +=========================== > + > +.. kernel-doc:: include/linux/kho/abi/block.h > + > See Also > ======== > > diff --git a/Documentation/core-api/kho/index.rst b/Documentation/core-api/kho/index.rst > index 0a2dee4f8e7d..320914a42178 100644 > --- a/Documentation/core-api/kho/index.rst > +++ b/Documentation/core-api/kho/index.rst > @@ -83,6 +83,17 @@ Public API > .. kernel-doc:: kernel/liveupdate/kexec_handover.c > :export: > > +KHO Serialization Blocks API > +============================ > + > +.. kernel-doc:: kernel/liveupdate/kho_block.c > + :doc: KHO Serialization Blocks > + > +.. kernel-doc:: include/linux/kho_block.h > + > +.. kernel-doc:: kernel/liveupdate/kho_block.c > + :internal: > + > See Also > ======== > > diff --git a/MAINTAINERS b/MAINTAINERS > index 2fb1c75afd16..fd119b343e99 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -14194,6 +14194,7 @@ F: Documentation/admin-guide/mm/kho.rst > F: Documentation/core-api/kho/* > F: include/linux/kexec_handover.h > F: include/linux/kho/ > +F: include/linux/kho_block.h > F: kernel/liveupdate/kexec_handover* > F: lib/test_kho.c > F: tools/testing/selftests/kho/ > diff --git a/include/linux/kho/abi/block.h b/include/linux/kho/abi/block.h > new file mode 100644 > index 000000000000..8641c20b379b > --- /dev/null > +++ b/include/linux/kho/abi/block.h > @@ -0,0 +1,56 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +/* > + * Copyright (c) 2026, Google LLC. > + * Pasha Tatashin > + */ > + > +/** > + * DOC: KHO Serialization Blocks ABI > + * > + * Subsystems using the KHO Serialization Blocks framework rely on the stable > + * Application Binary Interface defined below to pass serialized state from a > + * pre-update kernel to a post-update kernel. > + * > + * This interface is a contract. Any modification to the structure fields, > + * compatible strings, or the layout of the `__packed` serialization > + * structures defined here constitutes a breaking change. Such changes require > + * incrementing the version number in the `KHO_BLOCK_ABI_COMPATIBLE` string to > + * prevent a new kernel from misinterpreting data from an old kernel. > + * > + * Changes are allowed provided the compatibility version is incremented; > + * however, backward/forward compatibility is only guaranteed for kernels > + * supporting the same ABI version. > + */ > + > +#ifndef _LINUX_KHO_ABI_BLOCK_H > +#define _LINUX_KHO_ABI_BLOCK_H > + > +#include > +#include > + > +#define KHO_BLOCK_ABI_COMPATIBLE "kho-block-v1" During KHO radix development, I argued for a separate compatible for the radix tree, but at that time, we tied the radix tree to core KHO ABI. The argument being that all core KHO data structures belong to the KHO ABI set. I imagine this will be used by kho_vmalloc, so it will also be end up being used by a core KHO API. So, do we want separate ABI? I don't much have a preference myself, but I do think the compatible management will be a bit easier if this relied on KHO compatible, especially once kho_vmalloc starts using it. > + > +/** > + * KHO_BLOCK_SIZE - The size of each serialization block. > + * > + * This is defined as PAGE_SIZE. PAGE_SIZE is ABI compliant because live > + * update between kernels with different page sizes is not supported by KHO. > + */ > +#define KHO_BLOCK_SIZE PAGE_SIZE > + > +/** > + * struct kho_block_header_ser - Header for the serialized data block. > + * @next: Physical address of the next struct kho_block_header_ser. > + * @count: The number of entries that immediately follow this header in the > + * memory block. > + * > + * This structure is located at the beginning of a block of physical memory > + * preserved across a kexec. It provides the necessary metadata to interpret > + * the array of entries that follow. > + */ > +struct kho_block_header_ser { > + u64 next; > + u64 count; > +} __packed; > + > +#endif /* _LINUX_KHO_ABI_BLOCK_H */ > diff --git a/include/linux/kho_block.h b/include/linux/kho_block.h > new file mode 100644 > index 000000000000..5e6b87b1befa > --- /dev/null > +++ b/include/linux/kho_block.h > @@ -0,0 +1,79 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +/* > + * Copyright (c) 2026, Google LLC. > + * Pasha Tatashin > + */ > + > +#ifndef _LINUX_KHO_BLOCK_H > +#define _LINUX_KHO_BLOCK_H > + > +#include > +#include > +#include > + > +/** > + * struct kho_block - Internal representation of a serialization block. > + * @list: List head for linking blocks in memory. > + * @ser: Pointer to the serialized header in preserved memory. > + */ > +struct kho_block { > + struct list_head list; > + struct kho_block_header_ser *ser; > +}; > + > +/** > + * struct kho_block_set - A set of blocks that belong to the same object. > + * @blocks: The list of serialization blocks (struct kho_block). > + * @nblocks: The number of allocated serialization blocks. > + * @head_pa: Physical address of the first block header. > + * @entry_size: The size of each entry in the blocks. > + * @count_per_block: The maximum number of entries each block can hold. > + * @incoming: True if this block set was restored from the previous kernel. > + */ > +struct kho_block_set { > + struct list_head blocks; > + long nblocks; > + u64 head_pa; > + size_t entry_size; I think we should add the entry_size to kho_block_header_ser? I think it is a part of the ABI of the block set. If this changes, we cannot parse a block set with a different size. If a subsystem wants to change entry size, they create a new block set with different entry size, and then they bump their compatible version. > + u64 count_per_block; > + bool incoming; > +}; > + > +/** > + * struct kho_block_it - Iterator for serializing entries into blocks. > + * @bs: The block set being iterated. > + * @block: The current block. > + * @i: The current entry index within @block. > + */ > +struct kho_block_it { > + struct kho_block_set *bs; > + struct kho_block *block; > + u64 i; > +}; > + > +/** > + * KHO_BLOCK_SET_INIT - Initialize a static kho_block_set. > + * @_name: Name of the kho_block_set variable. > + * @_entry_size: The size of each entry in the block set. > + */ > +#define KHO_BLOCK_SET_INIT(_name, _entry_size) { \ > + .blocks = LIST_HEAD_INIT((_name).blocks), \ > + .entry_size = _entry_size, \ > +} > + > +void kho_block_set_init(struct kho_block_set *bs, size_t entry_size); > + > +int kho_block_grow(struct kho_block_set *bs, u64 count); > +void kho_block_shrink(struct kho_block_set *bs, u64 count); These block management functions seem like internal details of the block set API. Do we need to export them? I think users should not have to worry about block management. They should read, set, or clear entries using the iterators, and internally the block management should take of allocation or freeing. So here for example, I th > + > +int kho_block_restore(struct kho_block_set *bs, u64 head_pa); > +void kho_block_destroy(struct kho_block_set *bs); Nit: kho_block_set_{restore,destroy}()? At first glance I thought they manipulated a single block. > +void kho_block_set_clear(struct kho_block_set *bs); > + > +void kho_block_it_init(struct kho_block_it *it, struct kho_block_set *bs); > +void *kho_block_it_next(struct kho_block_it *it); > +void *kho_block_it_read(struct kho_block_it *it); > +void *kho_block_it_prev(struct kho_block_it *it); > +void kho_block_it_finalize(struct kho_block_it *it); > + > +#endif /* _LINUX_KHO_BLOCK_H */ > diff --git a/kernel/liveupdate/Makefile b/kernel/liveupdate/Makefile > index d2f779cbe279..eec9d3ae07eb 100644 > --- a/kernel/liveupdate/Makefile > +++ b/kernel/liveupdate/Makefile > @@ -1,6 +1,7 @@ > # SPDX-License-Identifier: GPL-2.0 > > luo-y := \ > + kho_block.o \ > luo_core.o \ > luo_file.o \ > luo_flb.o \ > diff --git a/kernel/liveupdate/kho_block.c b/kernel/liveupdate/kho_block.c > new file mode 100644 > index 000000000000..a4e650af946f > --- /dev/null > +++ b/kernel/liveupdate/kho_block.c > @@ -0,0 +1,384 @@ > +// SPDX-License-Identifier: GPL-2.0 > + > +/* > + * Copyright (c) 2026, Google LLC. > + * Pasha Tatashin > + */ > + > +/** > + * DOC: KHO Serialization Blocks > + * > + * KHO provides a mechanism to preserve stateful data across a kexec handover > + * by serializing it into memory blocks. This file provides the common > + * infrastructure for managing these blocks. > + * > + * Each block consists of a header (struct kho_block_header_ser) followed by an > + * array of serialized entries. Multiple blocks are linked together via a > + * physical pointer in the header, forming a linked list that can be easily > + * traversed in both the current and the next kernel. > + */ > + > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt > + > +#include > +#include > +#include > +#include > +#include > + > +/* > + * Safeguard limit for the number of serialization blocks. This is used to > + * prevent infinite loops and excessive memory allocation in case of memory > + * corruption in the preserved state. > + */ > +#define KHO_MAX_BLOCKS 10000 > + > +/** > + * kho_block_set_init - Initialize a block set. > + * @bs: The block set to initialize. > + * @entry_size: The size of each entry in the blocks. > + */ > +void kho_block_set_init(struct kho_block_set *bs, size_t entry_size) > +{ > + *bs = (struct kho_block_set)KHO_BLOCK_SET_INIT(*bs, entry_size); > +} > + > +static inline u64 kho_block_count_per_block(struct kho_block_set *bs) > +{ > + if (unlikely(!bs->count_per_block)) { > + bs->count_per_block = (KHO_BLOCK_SIZE - > + sizeof(struct kho_block_header_ser)) / > + bs->entry_size; > + WARN_ON(!bs->count_per_block); > + } > + return bs->count_per_block; > +} This looks odd. I don't see a reason to calculate this lazily. Why not just do it when initializing the block set, in kho_block_set_init() or kho_block_restore()? And then use bs->count_per_block directly. > + > +/* Free serialized data */ > +static void kho_block_free_ser(struct kho_block_set *bs, > + struct kho_block_header_ser *ser) > +{ > + if (bs->incoming) > + kho_restore_free(ser); > + else > + kho_unpreserve_free(ser); > +} > + > +static struct kho_block_header_ser *kho_block_alloc_ser(struct kho_block_set *bs) > +{ > + WARN_ON(bs->incoming); WARN_ON_ONCE? > + return kho_alloc_preserve(KHO_BLOCK_SIZE); > +} > + > +static int kho_block_add(struct kho_block_set *bs, > + struct kho_block_header_ser *ser) > +{ > + struct kho_block *block, *last; > + > + if (bs->nblocks >= KHO_MAX_BLOCKS) > + return -ENOSPC; > + > + block = kzalloc_obj(*block); > + if (!block) > + return -ENOMEM; > + > + block->ser = ser; > + last = list_last_entry_or_null(&bs->blocks, struct kho_block, list); > + list_add_tail(&block->list, &bs->blocks); > + bs->nblocks++; > + > + if (last) > + last->ser->next = virt_to_phys(ser); > + else > + bs->head_pa = virt_to_phys(ser); > + > + return 0; > +} > + > +/** > + * kho_block_grow - Create a new block if the current capacity is reached. > + * @bs: The block set. > + * @count: The current number of entries. > + * > + * This function handles the dynamic expansion of a block set. It allocates > + * and links a new serialization block if the provided entry count matches > + * the current total capacity of the set. > + * > + * Return: 0 on success, or a negative errno on failure. > + */ > +int kho_block_grow(struct kho_block_set *bs, u64 count) > +{ > + struct kho_block_header_ser *ser; > + int err; > + > + if (WARN_ON(bs->incoming)) WARN_ON_ONCE here too? > + return -EINVAL; > + > + if (count != bs->nblocks * kho_block_count_per_block(bs)) > + return 0; > + > + ser = kho_block_alloc_ser(bs); > + if (IS_ERR(ser)) > + return PTR_ERR(ser); > + > + err = kho_block_add(bs, ser); > + if (err) { > + kho_block_free_ser(bs, ser); > + return err; > + } > + > + return 0; > +} > + > +/** > + * kho_block_shrink - Conditionally destroy the last block in a block set. > + * @bs: The block set. > + * @count: The current number of entries across all blocks. > + * > + * This function checks if the last block in the set is redundant based on the > + * total entry count and the capacity of the preceding blocks. If the entry > + * count can be accommodated by the blocks that come before the last one, the > + * last block is destroyed and removed from the set. > + */ > +void kho_block_shrink(struct kho_block_set *bs, u64 count) > +{ > + struct kho_block *last, *new_last; > + > + if (count > (bs->nblocks - 1) * kho_block_count_per_block(bs)) > + return; > + > + if (list_empty(&bs->blocks)) > + return; > + > + last = list_last_entry(&bs->blocks, struct kho_block, list); > + list_del(&last->list); > + bs->nblocks--; > + kho_block_free_ser(bs, last->ser); > + kfree(last); > + > + new_last = list_last_entry_or_null(&bs->blocks, struct kho_block, list); > + if (new_last) > + new_last->ser->next = 0; > + else > + bs->head_pa = 0; > +} > + > +/* > + * kho_cyclic_blocks_check - Check for cycles in a linked list of blocks. > + * Uses Floyd's cycle-finding algorithm to ensure sanity of the incoming list. > + */ > +static bool kho_cyclic_blocks_check(struct kho_block_set *bs) > +{ > + struct kho_block_header_ser *fast; > + struct kho_block_header_ser *slow; > + int count = 0; > + > + fast = phys_to_virt(bs->head_pa); > + slow = fast; > + > + while (fast) { > + if (count++ >= KHO_MAX_BLOCKS) { > + pr_err("Linked list too long\n"); > + return false; > + } > + > + if (!fast->next) > + break; > + > + fast = phys_to_virt(fast->next); > + if (!fast->next) > + break; > + > + fast = phys_to_virt(fast->next); > + slow = phys_to_virt(slow->next); > + > + if (slow == fast) { > + pr_err("Cyclic list detected\n"); Heh, reminds me of the time I was practicing leetcode for interviews ;-) > + return false; > + } > + } > + > + return true; > +} > + > +/** > + * kho_block_restore - Restore a block set from a physical address. > + * @bs: The block set to restore. > + * @head_pa: Physical address of the first block header. > + * > + * Return: 0 on success, or a negative errno on failure. > + */ > +int kho_block_restore(struct kho_block_set *bs, u64 head_pa) > +{ > + struct kho_block_header_ser *ser; > + u64 next_pa = head_pa; > + int err; > + > + /* Restored block sets use size from the previous kernel */ > + bs->incoming = true; > + if (!head_pa) > + return 0; > + > + bs->head_pa = head_pa; > + if (!kho_cyclic_blocks_check(bs)) { > + bs->head_pa = 0; > + return -EINVAL; > + } > + > + while (next_pa) { > + ser = phys_to_virt(next_pa); > + if (ser->count > kho_block_count_per_block(bs)) { > + pr_warn("Block contains too many entries: %llu\n", > + ser->count); > + err = -EINVAL; > + goto err_destroy; > + } > + err = kho_block_add(bs, ser); > + if (err) > + goto err_destroy; > + next_pa = ser->next; > + } > + > + return 0; > + > +err_destroy: > + kho_block_destroy(bs); > + return err; > +} > + > +/** > + * kho_block_destroy - Destroy all blocks in a block set. > + * @bs: The block set. > + */ > +void kho_block_destroy(struct kho_block_set *bs) > +{ > + u64 head_pa = bs->head_pa; > + struct kho_block *block; > + > + while (!list_empty(&bs->blocks)) { > + block = list_first_entry(&bs->blocks, struct kho_block, list); > + list_del(&block->list); > + kfree(block); > + } Nit: list_for_each_entry_safe(block, tmp, &bs->blocks, list) { list_del(&block->list); kfree(block); } is a bit more idiomatic (and IMO easier to read). > + bs->nblocks = 0; > + bs->head_pa = 0; > + > + while (head_pa) { > + struct kho_block_header_ser *ser = phys_to_virt(head_pa); > + > + head_pa = ser->next; > + kho_block_free_ser(bs, ser); Nit: also, can't you put this also in the previous loop? Something like: list_for_each_entry_safe(block, tmp, &bs->blocks, list) { list_del(&block->list); kho_block_free_ser(block->ser); kfree(block); } > + } > +} > + > +/** > + * kho_block_set_clear - Clear all serialized data in a block set. > + * @bs: The block set to clear. > + */ > +void kho_block_set_clear(struct kho_block_set *bs) > +{ > + struct kho_block *block; > + > + list_for_each_entry(block, &bs->blocks, list) { > + block->ser->count = 0; > + memset(block->ser + 1, 0, KHO_BLOCK_SIZE - sizeof(*block->ser)); > + } > +} > + > +/** > + * kho_block_it_init - Initialize a block set iterator. > + * @it: The iterator to initialize. > + * @bs: The block set to iterate over. > + */ > +void kho_block_it_init(struct kho_block_it *it, struct kho_block_set *bs) > +{ > + it->bs = bs; > + it->block = list_first_entry_or_null(&bs->blocks, struct kho_block, list); > + it->i = 0; > +} > + > +/** > + * kho_block_it_next - Return the next entry slot in the block set. > + * @it: The block iterator. > + * > + * If the current block is full, it automatically advances to the next block > + * in the set. > + * > + * Return: A pointer to the next entry slot, or NULL if no more slots are > + * available. > + */ > +void *kho_block_it_next(struct kho_block_it *it) The naming and documentation here are very confusing. This and kho_block_it_read() look pretty much identical, and their documentation also looks pretty much identical. There seems to be only one tiny difference: this function returns the slot while incrementing the block count. Can we do better something like kho_block_it_write_next(struct kho_block_it *it, void *entry) (size was specified when creating block set)? Yes, this results in a copy but does that matter that much? And if you really want to avoid copying, perhaps kho_block_it_add_entry()? Or something along the lines? To make it clear this is adding an entry to the block set. Also, make the intended usage clear in the documentation. > +{ > + if (!it->block) > + return NULL; > + > + if (it->i == kho_block_count_per_block(it->bs)) { > + it->block->ser->count = it->i; > + if (list_is_last(&it->block->list, &it->bs->blocks)) > + return NULL; > + it->block = list_next_entry(it->block, list); > + it->i = 0; > + } > + > + return (void *)(it->block->ser + 1) + (it->i++ * it->bs->entry_size); > +} > + > +/** > + * kho_block_it_read - Return the next entry slot for reading. > + * @it: The block iterator. > + * > + * This function iterates through entries that were previously serialized, > + * respecting the count stored in each block's header. > + * > + * Return: A pointer to the next entry slot, or NULL if no more entries are > + * available. > + */ > +void *kho_block_it_read(struct kho_block_it *it) > +{ > + if (!it->block) > + return NULL; > + > + while (it->i == it->block->ser->count) { Hmm, the while loop suggests we can have blocks with zero count. Do you think we should detect those and error out instead? Since it doesn't really make sense to have a block with no entries. > + if (list_is_last(&it->block->list, &it->bs->blocks)) > + return NULL; > + it->block = list_next_entry(it->block, list); > + it->i = 0; > + } > + > + return (void *)(it->block->ser + 1) + (it->i++ * it->bs->entry_size); > +} > + > +/** > + * kho_block_it_prev - Return the previous entry slot in the block set. > + * @it: The block iterator. > + * > + * If the current index is at the start of a block, it automatically moves to > + * the end of the previous block. > + * > + * Return: A pointer to the previous entry slot, or NULL if at the very > + * beginning of the block set. > + */ > +void *kho_block_it_prev(struct kho_block_it *it) > +{ > + if (!it->block) > + return NULL; > + > + if (it->i == 0) { > + if (list_is_first(&it->block->list, &it->bs->blocks)) > + return NULL; > + it->block = list_prev_entry(it->block, list); > + it->i = kho_block_count_per_block(it->bs); > + } > + > + return (void *)(it->block->ser + 1) + (--it->i * it->bs->entry_size); > +} > + > +/** > + * kho_block_it_finalize - Finalize the current block by setting its entry count. > + * @it: The block iterator. > + */ > +void kho_block_it_finalize(struct kho_block_it *it) > +{ > + if (it->block) > + it->block->ser->count = it->i; > +} Doesn't kho_block_it_next() already do this when you add an entry? So this seems redundant. -- Regards, Pratyush Yadav From pratyush at kernel.org Mon Jun 1 06:47:00 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Mon, 01 Jun 2026 15:47:00 +0200 Subject: [PATCH v4 08/13] liveupdate: defer session block allocation and PA setting In-Reply-To: <20260530221938.115978-9-pasha.tatashin@soleen.com> (Pasha Tatashin's message of "Sat, 30 May 2026 22:19:33 +0000") References: <20260530221938.115978-1-pasha.tatashin@soleen.com> <20260530221938.115978-9-pasha.tatashin@soleen.com> Message-ID: <2vxzmrxefk4r.fsf@kernel.org> On Sat, May 30 2026, Pasha Tatashin wrote: > Currently, luo_session_setup_outgoing() allocates the session block and > sets its physical address in the header immediately. With upcoming > dynamic block-based session management, this makes the first block > different from the rest. Move the allocation to where it is first needed. > > Acked-by: Mike Rapoport (Microsoft) > Signed-off-by: Pasha Tatashin Reviewed-by: Pratyush Yadav (Google) [...] -- Regards, Pratyush Yadav From pasha.tatashin at soleen.com Mon Jun 1 06:50:49 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Mon, 1 Jun 2026 09:50:49 -0400 Subject: [PATCH v4 04/13] liveupdate: register luo_ser as KHO subtree In-Reply-To: <2vxzv7c2fn8n.fsf@kernel.org> References: <20260530221938.115978-1-pasha.tatashin@soleen.com> <20260530221938.115978-5-pasha.tatashin@soleen.com> <2vxzv7c2fn8n.fsf@kernel.org> Message-ID: On 06-01 14:39, Pratyush Yadav wrote: > On Sat, May 30 2026, Pasha Tatashin wrote: > > > Entirely remove the LUO FDT wrapper since the FDT only carries the > > compatible string and the pointer to the centralized struct luo_ser. > > Instead, register the struct luo_ser via the KHO raw subtree > > API, placing the compatibility string inside the structure itself. > > > > Signed-off-by: Pasha Tatashin > > --- > > include/linux/kho/abi/luo.h | 57 +++++++++--------------- > > kernel/liveupdate/luo_core.c | 85 +++++++++++------------------------- > > 2 files changed, 46 insertions(+), 96 deletions(-) > > > > diff --git a/include/linux/kho/abi/luo.h b/include/linux/kho/abi/luo.h > > index 1b2f865a771a..9a4fe491812b 100644 > > --- a/include/linux/kho/abi/luo.h > > +++ b/include/linux/kho/abi/luo.h > > @@ -10,11 +10,11 @@ > > * > > * Live Update Orchestrator uses the stable Application Binary Interface > > * defined below to pass state from a pre-update kernel to a post-update > > - * kernel. The ABI is built upon the Kexec HandOver framework and uses a > > - * Flattened Device Tree to describe the preserved data. > > + * kernel. The ABI is built upon the Kexec HandOver framework and registers > > + * the central `struct luo_ser` via the KHO raw subtree API. > > * > > - * This interface is a contract. Any modification to the FDT structure, node > > - * properties, compatible strings, or the layout of the `__packed` serialization > > + * This interface is a contract. Any modification to the structure fields, > > + * compatible strings, or the layout of the `__packed` serialization > > * structures defined here constitutes a breaking change. Such changes require > > * incrementing the version number in the relevant `_COMPATIBLE` string to > > * prevent a new kernel from misinterpreting data from an old kernel. > > @@ -23,31 +23,15 @@ > > * however, backward/forward compatibility is only guaranteed for kernels > > * supporting the same ABI version. > > * > > - * FDT Structure Overview: > > + * KHO Structure Overview: > > * The entire LUO state is encapsulated within a single KHO entry named "LUO". > > - * This entry contains an FDT with the following layout: > > - * > > - * .. code-block:: none > > - * > > - * / { > > - * compatible = "luo-v2"; > > - * luo-abi-header = ; > > - * }; > > - * > > - * Main LUO Node (/): > > - * > > - * - compatible: "luo-v2" > > - * Identifies the overall LUO ABI version. > > - * - luo-abi-header: u64 > > - * The physical address of `struct luo_ser`. > > + * This entry contains the `struct luo_ser` structure. > > * > > * Serialization Structures: > > - * The FDT properties point to memory regions containing arrays of simple, > > - * `__packed` structures. These structures contain the actual preserved state. > > - * > > * - struct luo_ser: > > * The central ABI structure that contains the overall state of the LUO. > > - * It includes the liveupdate-number and pointers to sessions and FLBs. > > + * It includes the compatibility string, the liveupdate-number, and pointers > > + * to sessions and FLBs. > > * > > * - struct luo_session_header_ser: > > * Header for the session array. Contains the total page count of the > > @@ -78,26 +62,27 @@ > > #ifndef _LINUX_KHO_ABI_LUO_H > > #define _LINUX_KHO_ABI_LUO_H > > > > +#include > > #include > > > > /* > > - * The LUO FDT hooks all LUO state for sessions, fds, etc. > > + * The LUO state is registered under this KHO entry name. > > */ > > -#define LUO_FDT_SIZE PAGE_SIZE > > -#define LUO_FDT_KHO_ENTRY_NAME "LUO" > > -#define LUO_FDT_COMPATIBLE "luo-v2" > > -#define LUO_FDT_ABI_HEADER "luo-abi-header" > > +#define LUO_KHO_ENTRY_NAME "LUO" > > +#define LUO_ABI_COMPATIBLE "luo-v3" > > +#define LUO_ABI_COMPAT_LEN ALIGN(sizeof(LUO_ABI_COMPATIBLE), 8) > > The length of the compatible field will change depending on the length > of the string. While that is technically fine since a new ABI version is > allowed to change the layout, it feels odd. I think it would be better > if we define a static size here, say 64 bytes. This way you can avoid > all the weirdness that can happen when you move from one version to > another. This is what I used initially, but we have cases where one LUO/KHO subsystem depends on another. For example, the LUO version must change when the block version changes, making the static length too restrictive. I would prefer to use proper strncmp() everywhere and allow the version string to change dynamically between kernels, while still allowing something like this (from [PATCH v4 09/13] liveupdate: Remove limit on the number of sessions): #define LUO_COMPAT_BASE "luo-v3" #define LUO_ABI_COMPATIBLE LUO_COMPAT_BASE "-" KHO_BLOCK_ABI_COMPATIBLE In the future, we may extend this further as we add more dependencies, such as your preservable xarray, vmalloc, etc. Everything that depends on an external version should include that in its compatibility string. > > > > > /** > > * struct luo_ser - Centralized LUO ABI header. > > + * @compatible: Compatibility string identifying the LUO ABI version. > > * @liveupdate_num: A counter tracking the number of successful live updates. > > * @sessions_pa: Physical address of the first session block header. > > * @flbs_pa: Physical address of the FLB header. > > * > > - * This structure is the root of all preserved LUO state. It is pointed to by > > - * the "luo-abi-header" property in the LUO FDT. > > + * This structure is the root of all preserved LUO state. > > */ > > struct luo_ser { > > + char compatible[LUO_ABI_COMPAT_LEN]; > > u64 liveupdate_num; > > u64 sessions_pa; > > u64 flbs_pa; > [...] > > @@ -94,40 +91,29 @@ static int __init luo_early_startup(void) > > return 0; > > } > > > > - /* Retrieve LUO subtree, and verify its format. */ > > - err = kho_retrieve_subtree(LUO_FDT_KHO_ENTRY_NAME, &fdt_phys, NULL); > > + /* Retrieve LUO state from KHO. */ > > + err = kho_retrieve_subtree(LUO_KHO_ENTRY_NAME, &luo_ser_phys, &len); > > if (err) { > > if (err != -ENOENT) { > > - pr_err("failed to retrieve FDT '%s' from KHO: %pe\n", > > - LUO_FDT_KHO_ENTRY_NAME, ERR_PTR(err)); > > + pr_err("failed to retrieve LUO state '%s' from KHO: %pe\n", > > + LUO_KHO_ENTRY_NAME, ERR_PTR(err)); > > return err; > > } > > > > return 0; > > } > > > > - luo_global.fdt_in = phys_to_virt(fdt_phys); > > - err = fdt_node_check_compatible(luo_global.fdt_in, 0, > > - LUO_FDT_COMPATIBLE); > > - if (err) { > > - pr_err("FDT '%s' is incompatible with '%s' [%d]\n", > > - LUO_FDT_KHO_ENTRY_NAME, LUO_FDT_COMPATIBLE, err); > > - > > + if (len < sizeof(*luo_ser)) { > > len != sizeof(*luo_ser) here? I can change this, but it is not necessary. It is common practice to verify that a "struct" is not smaller when compatibility is checked, allowing for future expansion without breaking compatibility with older kernels. I know we do not support forward/backward compatibility in any way right now, but I do not think it hurts to put the proper safeguards in place. Pasha > > > + pr_err("LUO state is too small (%zu < %zu)\n", len, sizeof(*luo_ser)); > > return -EINVAL; > > } > > > > - header_size = 0; > > - ptr = fdt_getprop(luo_global.fdt_in, 0, LUO_FDT_ABI_HEADER, &header_size); > > - if (!ptr || header_size != sizeof(u64)) { > > - pr_err("Unable to get ABI header '%s' [%d]\n", > > - LUO_FDT_ABI_HEADER, header_size); > > - > > + luo_ser = phys_to_virt(luo_ser_phys); > > + if (strncmp(luo_ser->compatible, LUO_ABI_COMPATIBLE, LUO_ABI_COMPAT_LEN)) { > > + pr_err("LUO state is incompatible with '%s'\n", LUO_ABI_COMPATIBLE); > > return -EINVAL; > > } > > > > - luo_ser_pa = get_unaligned((u64 *)ptr); > > - luo_ser = phys_to_virt(luo_ser_pa); > > - > > luo_global.liveupdate_num = luo_ser->liveupdate_num; > > pr_info("Retrieved live update data, liveupdate number: %lld\n", > > luo_global.liveupdate_num); > [...] > > -- > Regards, > Pratyush Yadav From pratyush at kernel.org Mon Jun 1 07:03:56 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Mon, 01 Jun 2026 16:03:56 +0200 Subject: [PATCH v4 09/13] liveupdate: Remove limit on the number of sessions In-Reply-To: <20260530221938.115978-10-pasha.tatashin@soleen.com> (Pasha Tatashin's message of "Sat, 30 May 2026 22:19:34 +0000") References: <20260530221938.115978-1-pasha.tatashin@soleen.com> <20260530221938.115978-10-pasha.tatashin@soleen.com> Message-ID: <2vxzfr36fjcj.fsf@kernel.org> On Sat, May 30 2026, Pasha Tatashin wrote: > Currently, the number of LUO sessions is limited by a fixed number of > pre-allocated pages for serialization (16 pages, allowing for ~819 > sessions). > > This limitation is problematic if LUO is used to support things such as > systemd file descriptor store, and would be used not just as VM memory > but to save other states on the machine. > > Remove this limit by transitioning to a linked-block approach for > session metadata serialization. Instead of a single contiguous block, > session metadata is now stored in a chain of 16-page blocks. Each block > starts with a header containing the physical address of the next block > and the number of session entries in the current block. > > Acked-by: Mike Rapoport (Microsoft) > Signed-off-by: Pasha Tatashin > --- [...] > @@ -63,13 +58,15 @@ > #define _LINUX_KHO_ABI_LUO_H > > #include > +#include > #include > > /* > * The LUO state is registered under this KHO entry name. > */ > #define LUO_KHO_ENTRY_NAME "LUO" > -#define LUO_ABI_COMPATIBLE "luo-v3" > +#define LUO_COMPAT_BASE "luo-v3" > +#define LUO_ABI_COMPATIBLE LUO_COMPAT_BASE "-" KHO_BLOCK_ABI_COMPATIBLE That's clever :-) [...] > int luo_session_serialize(void) > { > struct luo_session_header *sh = &luo_session_global.outgoing; > struct luo_session *session; > - int i = 0; > + struct kho_block_it it; > int err; > > down_write(&luo_session_serialize_rwsem); > down_write(&sh->rwsem); > *sh->sessions_pa = 0; > > + kho_block_it_init(&it, &sh->block_set); > + > list_for_each_entry(session, &sh->list, list) { > - err = luo_session_freeze_one(session, &sh->ser[i]); > - if (err) > + struct luo_session_ser *ser = kho_block_it_next(&it); > + > + if (!ser) { > + err = -ENOSPC; > goto err_undo; > + } > > - strscpy(sh->ser[i].name, session->name, > - sizeof(sh->ser[i].name)); > - i++; > - } > + err = luo_session_freeze_one(session, ser); > + if (err) { > + kho_block_it_prev(&it); > + goto err_undo; > + } > > - if (sh->header_ser && sh->count > 0) { > - sh->header_ser->count = sh->count; > - *sh->sessions_pa = virt_to_phys(sh->header_ser); > + strscpy(ser->name, session->name, sizeof(ser->name)); > } > + > + kho_block_it_finalize(&it); > + > + if (sh->sessions_pa && sh->count > 0) Nit: Why check for sh->sessions_pa? It can never be NULL. Other than this, Reviewed-by: Pratyush Yadav (Google) > + *sh->sessions_pa = sh->block_set.head_pa; > up_write(&sh->rwsem); > > return 0; > > err_undo: > list_for_each_entry_continue_reverse(session, &sh->list, list) { > - i--; > - luo_session_unfreeze_one(session, &sh->ser[i]); > - memset(sh->ser[i].name, 0, sizeof(sh->ser[i].name)); > + struct luo_session_ser *ser = kho_block_it_prev(&it); > + > + luo_session_unfreeze_one(session, ser); > + memset(ser->name, 0, sizeof(ser->name)); > } > up_write(&sh->rwsem); > up_write(&luo_session_serialize_rwsem); -- Regards, Pratyush Yadav From pratyush at kernel.org Mon Jun 1 07:16:25 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Mon, 01 Jun 2026 16:16:25 +0200 Subject: [PATCH v4 10/13] liveupdate: Remove limit on the number of files per session In-Reply-To: <20260530221938.115978-11-pasha.tatashin@soleen.com> (Pasha Tatashin's message of "Sat, 30 May 2026 22:19:35 +0000") References: <20260530221938.115978-1-pasha.tatashin@soleen.com> <20260530221938.115978-11-pasha.tatashin@soleen.com> Message-ID: <2vxzbjdufirq.fsf@kernel.org> On Sat, May 30 2026, Pasha Tatashin wrote: > To remove the fixed limit on the number of preserved files per session, > transition the file metadata serialization from a single contiguous > memory block to a chain of linked blocks. > > Acked-by: Mike Rapoport (Microsoft) > Signed-off-by: Pasha Tatashin > --- > include/linux/kho/abi/luo.h | 13 +-- > kernel/liveupdate/luo_file.c | 144 +++++++++++++++---------------- > kernel/liveupdate/luo_internal.h | 6 +- > 3 files changed, 80 insertions(+), 83 deletions(-) > > diff --git a/include/linux/kho/abi/luo.h b/include/linux/kho/abi/luo.h > index 79758d92ed5f..16df550ef143 100644 > --- a/include/linux/kho/abi/luo.h > +++ b/include/linux/kho/abi/luo.h > @@ -35,8 +35,8 @@ > * > * - struct luo_session_ser: > * Metadata for a single session, including its name and a physical pointer > - * to another preserved memory block containing an array of > - * `struct luo_file_ser` for all files in that session. > + * to the first `struct kho_block_header_ser` for all files in that session. > + * Multiple blocks are linked via the `next` field in the header. > * > * - struct luo_file_ser: > * Metadata for a single preserved file. Contains the `compatible` string to > @@ -65,7 +65,7 @@ > * The LUO state is registered under this KHO entry name. > */ > #define LUO_KHO_ENTRY_NAME "LUO" > -#define LUO_COMPAT_BASE "luo-v3" > +#define LUO_COMPAT_BASE "luo-v4" > #define LUO_ABI_COMPATIBLE LUO_COMPAT_BASE "-" KHO_BLOCK_ABI_COMPATIBLE > #define LUO_ABI_COMPAT_LEN ALIGN(sizeof(LUO_ABI_COMPATIBLE), 8) > > @@ -103,9 +103,10 @@ struct luo_file_ser { > > /** > * struct luo_file_set_ser - Represents the serialized metadata for file set > - * @files: The physical address of a contiguous memory block that holds > - * the serialized state of files (array of luo_file_ser) in this file > - * set. > + * @files: The physical address of the first `struct kho_block_header_ser`. > + * This structure is the header for a block of memory containing > + * an array of `struct luo_file_ser` entries. Multiple blocks are > + * linked via the `next` field in the header. > * @count: The total number of files that were part of this session during > * serialization. Used for iteration and validation during > * restoration. > diff --git a/kernel/liveupdate/luo_file.c b/kernel/liveupdate/luo_file.c > index 9eec07a9e9fc..a445b1950ca7 100644 > --- a/kernel/liveupdate/luo_file.c > +++ b/kernel/liveupdate/luo_file.c > @@ -118,11 +118,6 @@ static LIST_HEAD(luo_file_handler_list); > /* Keep track of files being preserved by LUO */ > static DEFINE_XARRAY(luo_preserved_files); > > -/* 2 4K pages, give space for 128 files per file_set */ > -#define LUO_FILE_PGCNT 2ul > -#define LUO_FILE_MAX \ > - ((LUO_FILE_PGCNT << PAGE_SHIFT) / sizeof(struct luo_file_ser)) > - > /** > * struct luo_file - Represents a single preserved file instance. > * @fh: Pointer to the &struct liveupdate_file_handler that manages > @@ -174,39 +169,6 @@ struct luo_file { > u64 token; > }; > > -static int luo_alloc_files_mem(struct luo_file_set *file_set) > -{ > - size_t size; > - void *mem; > - > - if (file_set->files) > - return 0; > - > - WARN_ON_ONCE(file_set->count); > - > - size = LUO_FILE_PGCNT << PAGE_SHIFT; > - mem = kho_alloc_preserve(size); > - if (IS_ERR(mem)) > - return PTR_ERR(mem); > - > - file_set->files = mem; > - > - return 0; > -} > - > -static void luo_free_files_mem(struct luo_file_set *file_set) > -{ > - /* If file_set has files, no need to free preservation memory */ > - if (file_set->count) > - return; > - > - if (!file_set->files) > - return; > - > - kho_unpreserve_free(file_set->files); > - file_set->files = NULL; > -} > - > static unsigned long luo_get_id(struct liveupdate_file_handler *fh, > struct file *file) > { > @@ -276,16 +238,15 @@ int luo_preserve_file(struct luo_file_set *file_set, u64 token, int fd) > if (luo_token_is_used(file_set, token)) > return -EEXIST; > > - if (file_set->count == LUO_FILE_MAX) > - return -ENOSPC; > + err = kho_block_grow(&file_set->block_set, file_set->count); > + if (err) > + return err; > > file = fget(fd); > - if (!file) > - return -EBADF; > - > - err = luo_alloc_files_mem(file_set); > - if (err) > - goto err_fput; > + if (!file) { > + err = -EBADF; > + goto err_shrink; > + } > > err = -ENOENT; > down_read(&luo_register_rwlock); > @@ -300,7 +261,7 @@ int luo_preserve_file(struct luo_file_set *file_set, u64 token, int fd) > > /* err is still -ENOENT if no handler was found */ > if (err) > - goto err_free_files_mem; > + goto err_fput; > > err = xa_insert(&luo_preserved_files, luo_get_id(fh, file), > file, GFP_KERNEL); > @@ -343,10 +304,10 @@ int luo_preserve_file(struct luo_file_set *file_set, u64 token, int fd) > xa_erase(&luo_preserved_files, luo_get_id(fh, file)); > err_module_put: > module_put(fh->ops->owner); > -err_free_files_mem: > - luo_free_files_mem(file_set); > err_fput: > fput(file); > +err_shrink: > + kho_block_shrink(&file_set->block_set, file_set->count); > > return err; > } > @@ -392,13 +353,14 @@ void luo_file_unpreserve_files(struct luo_file_set *file_set) > > list_del(&luo_file->list); > file_set->count--; > + kho_block_shrink(&file_set->block_set, file_set->count); > > fput(luo_file->file); > mutex_destroy(&luo_file->mutex); > kfree(luo_file); > } > > - luo_free_files_mem(file_set); > + kho_block_destroy(&file_set->block_set); > } > > static int luo_file_freeze_one(struct luo_file_set *file_set, > @@ -454,7 +416,7 @@ static void __luo_file_unfreeze(struct luo_file_set *file_set, > luo_file_unfreeze_one(file_set, luo_file); > } > > - memset(file_set->files, 0, LUO_FILE_PGCNT << PAGE_SHIFT); > + kho_block_set_clear(&file_set->block_set); > } > > /** > @@ -493,19 +455,23 @@ static void __luo_file_unfreeze(struct luo_file_set *file_set, > int luo_file_freeze(struct luo_file_set *file_set, > struct luo_file_set_ser *file_set_ser) > { > - struct luo_file_ser *file_ser = file_set->files; > struct luo_file *luo_file; > + struct kho_block_it it; > int err; > - int i; > > if (!file_set->count) > return 0; > > - if (WARN_ON(!file_ser)) > - return -EINVAL; > + kho_block_it_init(&it, &file_set->block_set); > > - i = 0; > list_for_each_entry(luo_file, &file_set->files_list, list) { > + struct luo_file_ser *file_ser = kho_block_it_next(&it); > + > + if (!file_ser) { > + err = -ENOSPC; > + goto err_unfreeze; > + } This should not fail normally, right? Since we pre-allocate the memory. Perhaps add a comment saying that? > + > err = luo_file_freeze_one(file_set, luo_file); > if (err < 0) { > pr_warn("Freeze failed for token[%#0llx] handler[%s] err[%pe]\n", > @@ -514,16 +480,21 @@ int luo_file_freeze(struct luo_file_set *file_set, > goto err_unfreeze; > } > > - strscpy(file_ser[i].compatible, luo_file->fh->compatible, > - sizeof(file_ser[i].compatible)); > - file_ser[i].data = luo_file->serialized_data; > - file_ser[i].token = luo_file->token; > - i++; > + strscpy(file_ser->compatible, luo_file->fh->compatible, > + sizeof(file_ser->compatible)); > + file_ser->data = luo_file->serialized_data; > + file_ser->token = luo_file->token; > } > + kho_block_it_finalize(&it); > > file_set_ser->count = file_set->count; > - if (file_set->files) > - file_set_ser->files = virt_to_phys(file_set->files); > + if (!list_empty(&file_set->block_set.blocks)) { > + struct kho_block *block; > + > + block = list_first_entry(&file_set->block_set.blocks, > + struct kho_block, list); > + file_set_ser->files = virt_to_phys(block->ser); > + } Please, add an API in KHO block to return the header physical address. Poking into the internals of the data structure like this is not a good idea. I missed that patch 9 also does this. So please use that there too. > > return 0; > > @@ -741,14 +712,12 @@ int luo_file_finish(struct luo_file_set *file_set) > module_put(luo_file->fh->ops->owner); > list_del(&luo_file->list); > file_set->count--; > + kho_block_shrink(&file_set->block_set, file_set->count); > mutex_destroy(&luo_file->mutex); > kfree(luo_file); > } > > - if (file_set->files) { > - kho_restore_free(file_set->files); > - file_set->files = NULL; > - } > + kho_block_destroy(&file_set->block_set); > > return 0; > } > @@ -822,16 +791,18 @@ int luo_file_deserialize(struct luo_file_set *file_set, > struct luo_file_set_ser *file_set_ser) > { > struct luo_file_ser *file_ser; > + struct kho_block_it it; > int err; > - u64 i; > > if (!file_set_ser->files) { > WARN_ON(file_set_ser->count); > return 0; > } > > - file_set->count = file_set_ser->count; > - file_set->files = phys_to_virt(file_set_ser->files); > + file_set->count = 0; > + err = kho_block_restore(&file_set->block_set, file_set_ser->files); > + if (err) > + return err; > > /* > * Note on error handling: > @@ -848,25 +819,50 @@ int luo_file_deserialize(struct luo_file_set *file_set, > * userspace to detect the failure and trigger a reboot, which will > * reliably reset devices and reclaim memory. > */ > - file_ser = file_set->files; > - for (i = 0; i < file_set->count; i++) { > - err = luo_file_deserialize_one(file_set, &file_ser[i]); > + kho_block_it_init(&it, &file_set->block_set); > + while ((file_ser = kho_block_it_read(&it))) { > + err = luo_file_deserialize_one(file_set, file_ser); > if (err) > - return err; > + goto err_destroy_blocks; > + file_set->count++; > + } > + > + if (file_set->count != file_set_ser->count) { > + pr_warn("File count mismatch: expected %llu, found %llu\n", > + file_set_ser->count, file_set->count); > + err = -EINVAL; > + goto err_destroy_blocks; > } > > return 0; > + > +err_destroy_blocks: > + while (!list_empty(&file_set->files_list)) { > + struct luo_file *luo_file; > + > + luo_file = list_first_entry(&file_set->files_list, > + struct luo_file, list); > + list_del(&luo_file->list); > + module_put(luo_file->fh->ops->owner); > + mutex_destroy(&luo_file->mutex); > + kfree(luo_file); > + } > + file_set->count = 0; > + kho_block_destroy(&file_set->block_set); > + return err; > } > > void luo_file_set_init(struct luo_file_set *file_set) > { > INIT_LIST_HEAD(&file_set->files_list); > + kho_block_set_init(&file_set->block_set, sizeof(struct luo_file_ser)); > } > > void luo_file_set_destroy(struct luo_file_set *file_set) > { > WARN_ON(file_set->count); > WARN_ON(!list_empty(&file_set->files_list)); > + WARN_ON(!list_empty(&file_set->block_set.blocks)); Here too. > } > > /** > diff --git a/kernel/liveupdate/luo_internal.h b/kernel/liveupdate/luo_internal.h > index ee18f9a11b91..64879ffe7378 100644 > --- a/kernel/liveupdate/luo_internal.h > +++ b/kernel/liveupdate/luo_internal.h > @@ -10,6 +10,7 @@ > > #include > #include > +#include > > struct luo_ucmd { > void __user *ubuffer; > @@ -44,14 +45,13 @@ static inline int luo_ucmd_respond(struct luo_ucmd *ucmd, > * struct luo_file_set - A set of files that belong to the same sessions. > * @files_list: An ordered list of files associated with this session, it is > * ordered by preservation time. > - * @files: The physically contiguous memory block that holds the serialized > - * state of files. > + * @block_set: The set of serialization blocks. > * @count: A counter tracking the number of files currently stored in the > * @files_list for this session. > */ > struct luo_file_set { > struct list_head files_list; > - struct luo_file_ser *files; > + struct kho_block_set block_set; > u64 count; > }; -- Regards, Pratyush Yadav From pratyush at kernel.org Mon Jun 1 07:17:19 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Mon, 01 Jun 2026 16:17:19 +0200 Subject: [PATCH v4 11/13] selftests/liveupdate: Test session and file limit removal In-Reply-To: <20260530221938.115978-12-pasha.tatashin@soleen.com> (Pasha Tatashin's message of "Sat, 30 May 2026 22:19:36 +0000") References: <20260530221938.115978-1-pasha.tatashin@soleen.com> <20260530221938.115978-12-pasha.tatashin@soleen.com> Message-ID: <2vxz7boifiq8.fsf@kernel.org> On Sat, May 30 2026, Pasha Tatashin wrote: > With the removal of static limits on the number of sessions and files per > session, the orchestrator now uses dynamic allocation. > > Add new test cases to verify that the system can handle a large number of > sessions and files. These tests ensure that the dynamic block allocation > and reuse logic for session metadata and outgoing files work correctly > beyond the previous static limits. > > Acked-by: Mike Rapoport (Microsoft) > Signed-off-by: Pasha Tatashin Reviewed-by: Pratyush Yadav (Google) > --- > .../testing/selftests/liveupdate/liveupdate.c | 75 +++++++++++++++++++ > .../selftests/liveupdate/luo_test_utils.c | 24 ++++++ > .../selftests/liveupdate/luo_test_utils.h | 2 + > 3 files changed, 101 insertions(+) > > diff --git a/tools/testing/selftests/liveupdate/liveupdate.c b/tools/testing/selftests/liveupdate/liveupdate.c > index c7d94b9181e1..502fb3567e38 100644 > --- a/tools/testing/selftests/liveupdate/liveupdate.c > +++ b/tools/testing/selftests/liveupdate/liveupdate.c > @@ -26,6 +26,7 @@ > > #include > > +#include "luo_test_utils.h" > #include "../kselftest.h" > #include "../kselftest_harness.h" > > @@ -499,4 +500,78 @@ TEST_F(liveupdate_device, get_session_name_max_length) > ASSERT_EQ(close(session_fd), 0); > } > > +/* > + * Test Case: Manage Many Sessions > + * > + * Verifies that a large number of sessions can be created and then > + * destroyed during normal system operation. This specifically tests the > + * dynamic block allocation and reuse logic for session metadata management > + * without preserving any files. > + */ > +TEST_F(liveupdate_device, preserve_many_sessions) > +{ > +#define MANY_SESSIONS 2000 > + int session_fds[MANY_SESSIONS]; > + int ret, i; > + > + self->fd1 = open(LIVEUPDATE_DEV, O_RDWR); > + if (self->fd1 < 0 && errno == ENOENT) > + SKIP(return, "%s does not exist", LIVEUPDATE_DEV); > + ASSERT_GE(self->fd1, 0); > + > + ret = luo_ensure_nofile_limit(MANY_SESSIONS); > + if (ret == -EPERM) > + SKIP(return, "Insufficient privileges to set RLIMIT_NOFILE"); > + ASSERT_EQ(ret, 0); > + > + for (i = 0; i < MANY_SESSIONS; i++) { > + char name[64]; > + > + snprintf(name, sizeof(name), "many-session-%d", i); > + session_fds[i] = create_session(self->fd1, name); > + ASSERT_GE(session_fds[i], 0); > + } > + > + for (i = 0; i < MANY_SESSIONS; i++) > + ASSERT_EQ(close(session_fds[i]), 0); > +} > + > +/* > + * Test Case: Preserve Many Files > + * > + * Verifies that a large number of files can be preserved in a single session > + * and then destroyed during normal system operation. This tests the dynamic > + * block allocation and management for outgoing files. > + */ > +TEST_F(liveupdate_device, preserve_many_files) > +{ > +#define MANY_FILES 500 > + int mem_fds[MANY_FILES]; > + int session_fd, ret, i; > + > + self->fd1 = open(LIVEUPDATE_DEV, O_RDWR); > + if (self->fd1 < 0 && errno == ENOENT) > + SKIP(return, "%s does not exist", LIVEUPDATE_DEV); > + ASSERT_GE(self->fd1, 0); > + > + session_fd = create_session(self->fd1, "many-files-test"); > + ASSERT_GE(session_fd, 0); > + > + ret = luo_ensure_nofile_limit(MANY_FILES + 10); > + if (ret == -EPERM) > + SKIP(return, "Insufficient privileges to set RLIMIT_NOFILE"); > + ASSERT_EQ(ret, 0); > + > + for (i = 0; i < MANY_FILES; i++) { > + mem_fds[i] = memfd_create("test-memfd", 0); > + ASSERT_GE(mem_fds[i], 0); > + ASSERT_EQ(preserve_fd(session_fd, mem_fds[i], i), 0); > + } > + > + for (i = 0; i < MANY_FILES; i++) > + ASSERT_EQ(close(mem_fds[i]), 0); > + > + ASSERT_EQ(close(session_fd), 0); > +} > + > TEST_HARNESS_MAIN > diff --git a/tools/testing/selftests/liveupdate/luo_test_utils.c b/tools/testing/selftests/liveupdate/luo_test_utils.c > index 3c8721c505df..333a3530051b 100644 > --- a/tools/testing/selftests/liveupdate/luo_test_utils.c > +++ b/tools/testing/selftests/liveupdate/luo_test_utils.c > @@ -17,6 +17,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -28,6 +29,29 @@ int luo_open_device(void) > return open(LUO_DEVICE, O_RDWR); > } > > +int luo_ensure_nofile_limit(long min_limit) > +{ > + struct rlimit hl; > + > + /* Allow to extra files to be used by test itself */ > + min_limit += 32; > + > + if (getrlimit(RLIMIT_NOFILE, &hl) < 0) > + return -errno; > + > + if (hl.rlim_cur >= min_limit) > + return 0; > + > + hl.rlim_cur = min_limit; > + if (hl.rlim_cur > hl.rlim_max) > + hl.rlim_max = hl.rlim_cur; > + > + if (setrlimit(RLIMIT_NOFILE, &hl) < 0) > + return -errno; > + > + return 0; > +} > + > int luo_create_session(int luo_fd, const char *name) > { > struct liveupdate_ioctl_create_session arg = { .size = sizeof(arg) }; > diff --git a/tools/testing/selftests/liveupdate/luo_test_utils.h b/tools/testing/selftests/liveupdate/luo_test_utils.h > index 90099bf49577..6a0d85386613 100644 > --- a/tools/testing/selftests/liveupdate/luo_test_utils.h > +++ b/tools/testing/selftests/liveupdate/luo_test_utils.h > @@ -26,6 +26,8 @@ int luo_create_session(int luo_fd, const char *name); > int luo_retrieve_session(int luo_fd, const char *name); > int luo_session_finish(int session_fd); > > +int luo_ensure_nofile_limit(long min_limit); > + > int create_and_preserve_memfd(int session_fd, int token, const char *data); > int restore_and_verify_memfd(int session_fd, int token, const char *expected_data); -- Regards, Pratyush Yadav From pratyush at kernel.org Mon Jun 1 07:19:57 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Mon, 01 Jun 2026 16:19:57 +0200 Subject: [PATCH v4 12/13] selftests/liveupdate: Add stress-sessions kexec test In-Reply-To: <20260530221938.115978-13-pasha.tatashin@soleen.com> (Pasha Tatashin's message of "Sat, 30 May 2026 22:19:37 +0000") References: <20260530221938.115978-1-pasha.tatashin@soleen.com> <20260530221938.115978-13-pasha.tatashin@soleen.com> Message-ID: <2vxz33z6filu.fsf@kernel.org> On Sat, May 30 2026, Pasha Tatashin wrote: > Add a new test that creates 2000 LUO sessions before a kexec > reboot and verifies their presence after the reboot. This ensures > that the linked-block serialization mechanism works correctly for > a large number of sessions. > > Acked-by: Mike Rapoport (Microsoft) > Signed-off-by: Pasha Tatashin Reviewed-by: Pratyush Yadav (Google) [...] -- Regards, Pratyush Yadav From pratyush at kernel.org Mon Jun 1 07:27:39 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Mon, 01 Jun 2026 16:27:39 +0200 Subject: [PATCH v4 04/13] liveupdate: register luo_ser as KHO subtree In-Reply-To: (Pasha Tatashin's message of "Mon, 1 Jun 2026 09:50:49 -0400") References: <20260530221938.115978-1-pasha.tatashin@soleen.com> <20260530221938.115978-5-pasha.tatashin@soleen.com> <2vxzv7c2fn8n.fsf@kernel.org> Message-ID: <2vxzy0gye3ok.fsf@kernel.org> On Mon, Jun 01 2026, Pasha Tatashin wrote: > On 06-01 14:39, Pratyush Yadav wrote: >> On Sat, May 30 2026, Pasha Tatashin wrote: >> >> > Entirely remove the LUO FDT wrapper since the FDT only carries the >> > compatible string and the pointer to the centralized struct luo_ser. >> > Instead, register the struct luo_ser via the KHO raw subtree >> > API, placing the compatibility string inside the structure itself. >> > >> > Signed-off-by: Pasha Tatashin >> > --- >> > include/linux/kho/abi/luo.h | 57 +++++++++--------------- >> > kernel/liveupdate/luo_core.c | 85 +++++++++++------------------------- >> > 2 files changed, 46 insertions(+), 96 deletions(-) >> > >> > diff --git a/include/linux/kho/abi/luo.h b/include/linux/kho/abi/luo.h >> > index 1b2f865a771a..9a4fe491812b 100644 >> > --- a/include/linux/kho/abi/luo.h >> > +++ b/include/linux/kho/abi/luo.h >> > @@ -10,11 +10,11 @@ >> > * >> > * Live Update Orchestrator uses the stable Application Binary Interface >> > * defined below to pass state from a pre-update kernel to a post-update >> > - * kernel. The ABI is built upon the Kexec HandOver framework and uses a >> > - * Flattened Device Tree to describe the preserved data. >> > + * kernel. The ABI is built upon the Kexec HandOver framework and registers >> > + * the central `struct luo_ser` via the KHO raw subtree API. >> > * >> > - * This interface is a contract. Any modification to the FDT structure, node >> > - * properties, compatible strings, or the layout of the `__packed` serialization >> > + * This interface is a contract. Any modification to the structure fields, >> > + * compatible strings, or the layout of the `__packed` serialization >> > * structures defined here constitutes a breaking change. Such changes require >> > * incrementing the version number in the relevant `_COMPATIBLE` string to >> > * prevent a new kernel from misinterpreting data from an old kernel. >> > @@ -23,31 +23,15 @@ >> > * however, backward/forward compatibility is only guaranteed for kernels >> > * supporting the same ABI version. >> > * >> > - * FDT Structure Overview: >> > + * KHO Structure Overview: >> > * The entire LUO state is encapsulated within a single KHO entry named "LUO". >> > - * This entry contains an FDT with the following layout: >> > - * >> > - * .. code-block:: none >> > - * >> > - * / { >> > - * compatible = "luo-v2"; >> > - * luo-abi-header = ; >> > - * }; >> > - * >> > - * Main LUO Node (/): >> > - * >> > - * - compatible: "luo-v2" >> > - * Identifies the overall LUO ABI version. >> > - * - luo-abi-header: u64 >> > - * The physical address of `struct luo_ser`. >> > + * This entry contains the `struct luo_ser` structure. >> > * >> > * Serialization Structures: >> > - * The FDT properties point to memory regions containing arrays of simple, >> > - * `__packed` structures. These structures contain the actual preserved state. >> > - * >> > * - struct luo_ser: >> > * The central ABI structure that contains the overall state of the LUO. >> > - * It includes the liveupdate-number and pointers to sessions and FLBs. >> > + * It includes the compatibility string, the liveupdate-number, and pointers >> > + * to sessions and FLBs. >> > * >> > * - struct luo_session_header_ser: >> > * Header for the session array. Contains the total page count of the >> > @@ -78,26 +62,27 @@ >> > #ifndef _LINUX_KHO_ABI_LUO_H >> > #define _LINUX_KHO_ABI_LUO_H >> > >> > +#include >> > #include >> > >> > /* >> > - * The LUO FDT hooks all LUO state for sessions, fds, etc. >> > + * The LUO state is registered under this KHO entry name. >> > */ >> > -#define LUO_FDT_SIZE PAGE_SIZE >> > -#define LUO_FDT_KHO_ENTRY_NAME "LUO" >> > -#define LUO_FDT_COMPATIBLE "luo-v2" >> > -#define LUO_FDT_ABI_HEADER "luo-abi-header" >> > +#define LUO_KHO_ENTRY_NAME "LUO" >> > +#define LUO_ABI_COMPATIBLE "luo-v3" >> > +#define LUO_ABI_COMPAT_LEN ALIGN(sizeof(LUO_ABI_COMPATIBLE), 8) >> >> The length of the compatible field will change depending on the length >> of the string. While that is technically fine since a new ABI version is >> allowed to change the layout, it feels odd. I think it would be better >> if we define a static size here, say 64 bytes. This way you can avoid >> all the weirdness that can happen when you move from one version to >> another. > > This is what I used initially, but we have cases where one LUO/KHO > subsystem depends on another. For example, the LUO version must change > when the block version changes, making the static length too > restrictive. I would prefer to use proper strncmp() everywhere and allow > the version string to change dynamically between kernels, while still > allowing something like this (from [PATCH v4 09/13] liveupdate: Remove > limit on the number of sessions): > > #define LUO_COMPAT_BASE "luo-v3" > #define LUO_ABI_COMPATIBLE LUO_COMPAT_BASE "-" > KHO_BLOCK_ABI_COMPATIBLE > > In the future, we may extend this further as we add more dependencies, > such as your preservable xarray, vmalloc, etc. Everything that depends > on an external version should include that in its compatibility string. Hmm, it feels odd, but I don't have any real counter arguments. So let's keep this as-is. > >> >> > >> > /** >> > * struct luo_ser - Centralized LUO ABI header. >> > + * @compatible: Compatibility string identifying the LUO ABI version. >> > * @liveupdate_num: A counter tracking the number of successful live updates. >> > * @sessions_pa: Physical address of the first session block header. >> > * @flbs_pa: Physical address of the FLB header. >> > * >> > - * This structure is the root of all preserved LUO state. It is pointed to by >> > - * the "luo-abi-header" property in the LUO FDT. >> > + * This structure is the root of all preserved LUO state. >> > */ >> > struct luo_ser { >> > + char compatible[LUO_ABI_COMPAT_LEN]; >> > u64 liveupdate_num; >> > u64 sessions_pa; >> > u64 flbs_pa; >> [...] >> > @@ -94,40 +91,29 @@ static int __init luo_early_startup(void) >> > return 0; >> > } >> > >> > - /* Retrieve LUO subtree, and verify its format. */ >> > - err = kho_retrieve_subtree(LUO_FDT_KHO_ENTRY_NAME, &fdt_phys, NULL); >> > + /* Retrieve LUO state from KHO. */ >> > + err = kho_retrieve_subtree(LUO_KHO_ENTRY_NAME, &luo_ser_phys, &len); >> > if (err) { >> > if (err != -ENOENT) { >> > - pr_err("failed to retrieve FDT '%s' from KHO: %pe\n", >> > - LUO_FDT_KHO_ENTRY_NAME, ERR_PTR(err)); >> > + pr_err("failed to retrieve LUO state '%s' from KHO: %pe\n", >> > + LUO_KHO_ENTRY_NAME, ERR_PTR(err)); >> > return err; >> > } >> > >> > return 0; >> > } >> > >> > - luo_global.fdt_in = phys_to_virt(fdt_phys); >> > - err = fdt_node_check_compatible(luo_global.fdt_in, 0, >> > - LUO_FDT_COMPATIBLE); >> > - if (err) { >> > - pr_err("FDT '%s' is incompatible with '%s' [%d]\n", >> > - LUO_FDT_KHO_ENTRY_NAME, LUO_FDT_COMPATIBLE, err); >> > - >> > + if (len < sizeof(*luo_ser)) { >> >> len != sizeof(*luo_ser) here? > > I can change this, but it is not necessary. It is common practice to > verify that a "struct" is not smaller when compatibility is checked, > allowing for future expansion without breaking compatibility with older > kernels. I know we do not support forward/backward compatibility in any > way right now, but I do not think it hurts to put the proper safeguards > in place. Yeah, that was my point. We don't support anything other than exact agreement on formats. But let's keep it this way for now, so we can grow the struct in a backwards compatible way if needed. > > Pasha > >> >> > + pr_err("LUO state is too small (%zu < %zu)\n", len, sizeof(*luo_ser)); >> > return -EINVAL; >> > } >> > >> > - header_size = 0; >> > - ptr = fdt_getprop(luo_global.fdt_in, 0, LUO_FDT_ABI_HEADER, &header_size); >> > - if (!ptr || header_size != sizeof(u64)) { >> > - pr_err("Unable to get ABI header '%s' [%d]\n", >> > - LUO_FDT_ABI_HEADER, header_size); >> > - >> > + luo_ser = phys_to_virt(luo_ser_phys); >> > + if (strncmp(luo_ser->compatible, LUO_ABI_COMPATIBLE, LUO_ABI_COMPAT_LEN)) { >> > + pr_err("LUO state is incompatible with '%s'\n", LUO_ABI_COMPATIBLE); >> > return -EINVAL; >> > } >> > >> > - luo_ser_pa = get_unaligned((u64 *)ptr); >> > - luo_ser = phys_to_virt(luo_ser_pa); >> > - >> > luo_global.liveupdate_num = luo_ser->liveupdate_num; >> > pr_info("Retrieved live update data, liveupdate number: %lld\n", >> > luo_global.liveupdate_num); >> [...] >> >> -- >> Regards, >> Pratyush Yadav -- Regards, Pratyush Yadav From pasha.tatashin at soleen.com Mon Jun 1 07:37:35 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Mon, 1 Jun 2026 10:37:35 -0400 Subject: [PATCH v4 07/13] kho: add support for linked-block serialization In-Reply-To: <2vxzqzmqfkit.fsf@kernel.org> References: <20260530221938.115978-1-pasha.tatashin@soleen.com> <20260530221938.115978-8-pasha.tatashin@soleen.com> <2vxzqzmqfkit.fsf@kernel.org> Message-ID: On 06-01 15:38, Pratyush Yadav wrote: > On Sat, May 30 2026, Pasha Tatashin wrote: > > > Introduce a linked-block serialization mechanism for state handover. > > > > Previously, LUO used contiguous memory blocks for serializing sessions > > and files, which imposed limits on the total number of items that could > > be preserved across a live update. > > > > This commit adds the infrastructure for a more flexible, block-based > > approach where serialized data is stored in a chain of linked blocks. > > This is a generic KHO serialization block infrastructure that can be > > used by multiple subsystems. > > > > Signed-off-by: Pasha Tatashin > > --- > > Documentation/core-api/kho/abi.rst | 5 + > > Documentation/core-api/kho/index.rst | 11 + > > MAINTAINERS | 1 + > > include/linux/kho/abi/block.h | 56 ++++ > > include/linux/kho_block.h | 79 ++++++ > > kernel/liveupdate/Makefile | 1 + > > kernel/liveupdate/kho_block.c | 384 +++++++++++++++++++++++++++ > > 7 files changed, 537 insertions(+) > > create mode 100644 include/linux/kho/abi/block.h > > create mode 100644 include/linux/kho_block.h > > create mode 100644 kernel/liveupdate/kho_block.c > > > > diff --git a/Documentation/core-api/kho/abi.rst b/Documentation/core-api/kho/abi.rst > > index 799d743105a6..edeb5b311963 100644 > > --- a/Documentation/core-api/kho/abi.rst > > +++ b/Documentation/core-api/kho/abi.rst > > @@ -28,6 +28,11 @@ KHO persistent memory tracker ABI > > .. kernel-doc:: include/linux/kho/abi/kexec_handover.h > > :doc: KHO persistent memory tracker > > > > +KHO serialization block ABI > > +=========================== > > + > > +.. kernel-doc:: include/linux/kho/abi/block.h > > + > > See Also > > ======== > > > > diff --git a/Documentation/core-api/kho/index.rst b/Documentation/core-api/kho/index.rst > > index 0a2dee4f8e7d..320914a42178 100644 > > --- a/Documentation/core-api/kho/index.rst > > +++ b/Documentation/core-api/kho/index.rst > > @@ -83,6 +83,17 @@ Public API > > .. kernel-doc:: kernel/liveupdate/kexec_handover.c > > :export: > > > > +KHO Serialization Blocks API > > +============================ > > + > > +.. kernel-doc:: kernel/liveupdate/kho_block.c > > + :doc: KHO Serialization Blocks > > + > > +.. kernel-doc:: include/linux/kho_block.h > > + > > +.. kernel-doc:: kernel/liveupdate/kho_block.c > > + :internal: > > + > > See Also > > ======== > > > > diff --git a/MAINTAINERS b/MAINTAINERS > > index 2fb1c75afd16..fd119b343e99 100644 > > --- a/MAINTAINERS > > +++ b/MAINTAINERS > > @@ -14194,6 +14194,7 @@ F: Documentation/admin-guide/mm/kho.rst > > F: Documentation/core-api/kho/* > > F: include/linux/kexec_handover.h > > F: include/linux/kho/ > > +F: include/linux/kho_block.h > > F: kernel/liveupdate/kexec_handover* > > F: lib/test_kho.c > > F: tools/testing/selftests/kho/ > > diff --git a/include/linux/kho/abi/block.h b/include/linux/kho/abi/block.h > > new file mode 100644 > > index 000000000000..8641c20b379b > > --- /dev/null > > +++ b/include/linux/kho/abi/block.h > > @@ -0,0 +1,56 @@ > > +/* SPDX-License-Identifier: GPL-2.0 */ > > +/* > > + * Copyright (c) 2026, Google LLC. > > + * Pasha Tatashin > > + */ > > + > > +/** > > + * DOC: KHO Serialization Blocks ABI > > + * > > + * Subsystems using the KHO Serialization Blocks framework rely on the stable > > + * Application Binary Interface defined below to pass serialized state from a > > + * pre-update kernel to a post-update kernel. > > + * > > + * This interface is a contract. Any modification to the structure fields, > > + * compatible strings, or the layout of the `__packed` serialization > > + * structures defined here constitutes a breaking change. Such changes require > > + * incrementing the version number in the `KHO_BLOCK_ABI_COMPATIBLE` string to > > + * prevent a new kernel from misinterpreting data from an old kernel. > > + * > > + * Changes are allowed provided the compatibility version is incremented; > > + * however, backward/forward compatibility is only guaranteed for kernels > > + * supporting the same ABI version. > > + */ > > + > > +#ifndef _LINUX_KHO_ABI_BLOCK_H > > +#define _LINUX_KHO_ABI_BLOCK_H > > + > > +#include > > +#include > > + > > +#define KHO_BLOCK_ABI_COMPATIBLE "kho-block-v1" > > During KHO radix development, I argued for a separate compatible for the > radix tree, but at that time, we tied the radix tree to core KHO ABI. > The argument being that all core KHO data structures belong to the KHO > ABI set. I imagine this will be used by kho_vmalloc, so it will also be > end up being used by a core KHO API. > > So, do we want separate ABI? I don't much have a preference myself, but > I do think the compatible management will be a bit easier if this relied > on KHO compatible, especially once kho_vmalloc starts using it. I prefer to make them fine-grained, now that we are adding more and more features: kho vmalloc, kho radix, and kho block should all have their own compatibility strings. Furthermore, any components that depend on them should include these compatibility strings in their own compatibility strings, in the same manner I have done in this series. > > > + > > +/** > > + * KHO_BLOCK_SIZE - The size of each serialization block. > > + * > > + * This is defined as PAGE_SIZE. PAGE_SIZE is ABI compliant because live > > + * update between kernels with different page sizes is not supported by KHO. > > + */ > > +#define KHO_BLOCK_SIZE PAGE_SIZE > > + > > +/** > > + * struct kho_block_header_ser - Header for the serialized data block. > > + * @next: Physical address of the next struct kho_block_header_ser. > > + * @count: The number of entries that immediately follow this header in the > > + * memory block. > > + * > > + * This structure is located at the beginning of a block of physical memory > > + * preserved across a kexec. It provides the necessary metadata to interpret > > + * the array of entries that follow. > > + */ > > +struct kho_block_header_ser { > > + u64 next; > > + u64 count; > > +} __packed; > > + > > +#endif /* _LINUX_KHO_ABI_BLOCK_H */ > > diff --git a/include/linux/kho_block.h b/include/linux/kho_block.h > > new file mode 100644 > > index 000000000000..5e6b87b1befa > > --- /dev/null > > +++ b/include/linux/kho_block.h > > @@ -0,0 +1,79 @@ > > +/* SPDX-License-Identifier: GPL-2.0 */ > > +/* > > + * Copyright (c) 2026, Google LLC. > > + * Pasha Tatashin > > + */ > > + > > +#ifndef _LINUX_KHO_BLOCK_H > > +#define _LINUX_KHO_BLOCK_H > > + > > +#include > > +#include > > +#include > > + > > +/** > > + * struct kho_block - Internal representation of a serialization block. > > + * @list: List head for linking blocks in memory. > > + * @ser: Pointer to the serialized header in preserved memory. > > + */ > > +struct kho_block { > > + struct list_head list; > > + struct kho_block_header_ser *ser; > > +}; > > + > > +/** > > + * struct kho_block_set - A set of blocks that belong to the same object. > > + * @blocks: The list of serialization blocks (struct kho_block). > > + * @nblocks: The number of allocated serialization blocks. > > + * @head_pa: Physical address of the first block header. > > + * @entry_size: The size of each entry in the blocks. > > + * @count_per_block: The maximum number of entries each block can hold. > > + * @incoming: True if this block set was restored from the previous kernel. > > + */ > > +struct kho_block_set { > > + struct list_head blocks; > > + long nblocks; > > + u64 head_pa; > > + size_t entry_size; > > I think we should add the entry_size to kho_block_header_ser? I think it > is a part of the ABI of the block set. If this changes, we cannot parse > a block set with a different size. If a subsystem wants to change entry > size, they create a new block set with different entry size, and then > they bump their compatible version. I have considered that, and we can certainly do it; however, I do not see how it would affect the current implementation. If luo_file or luo_session change entry_size, they must change the LUO compatibility version, which would prevent LU from one kernel to the next. However, for flexibility and future extensibility, I believe it would be useful to add entry_size and block_size (which is PAGE_SIZE, but could be larger for some users) to the header. This is more of a feature request than an issue with the current series. > > > + u64 count_per_block; > > + bool incoming; > > +}; > > + > > +/** > > + * struct kho_block_it - Iterator for serializing entries into blocks. > > + * @bs: The block set being iterated. > > + * @block: The current block. > > + * @i: The current entry index within @block. > > + */ > > +struct kho_block_it { > > + struct kho_block_set *bs; > > + struct kho_block *block; > > + u64 i; > > +}; > > + > > +/** > > + * KHO_BLOCK_SET_INIT - Initialize a static kho_block_set. > > + * @_name: Name of the kho_block_set variable. > > + * @_entry_size: The size of each entry in the block set. > > + */ > > +#define KHO_BLOCK_SET_INIT(_name, _entry_size) { \ > > + .blocks = LIST_HEAD_INIT((_name).blocks), \ > > + .entry_size = _entry_size, \ > > +} > > + > > +void kho_block_set_init(struct kho_block_set *bs, size_t entry_size); > > + > > +int kho_block_grow(struct kho_block_set *bs, u64 count); > > +void kho_block_shrink(struct kho_block_set *bs, u64 count); > > These block management functions seem like internal details of the block This is not so. The confusion here is that they must be allocated and preserved at runtime as resources are registered/unregistered, while these blocks are only used serialization phase, These calls are more like notifiers that more files/sessions are created removed, so we can adjust block count accordingly if necessary (allocate preserver memory), and have them available durign serialization/deserialization > set API. Do we need to export them? I think users should not have to > worry about block management. They should read, set, or clear entries > using the iterators, and internally the block management should take of > allocation or freeing. So here for example, I th something is missing :-) > > > + > > +int kho_block_restore(struct kho_block_set *bs, u64 head_pa); > > +void kho_block_destroy(struct kho_block_set *bs); > > Nit: kho_block_set_{restore,destroy}()? At first glance I thought they > manipulated a single block. Makes sense. > > > +void kho_block_set_clear(struct kho_block_set *bs); > > + > > +void kho_block_it_init(struct kho_block_it *it, struct kho_block_set *bs); > > +void *kho_block_it_next(struct kho_block_it *it); > > +void *kho_block_it_read(struct kho_block_it *it); > > +void *kho_block_it_prev(struct kho_block_it *it); > > +void kho_block_it_finalize(struct kho_block_it *it); > > + > > +#endif /* _LINUX_KHO_BLOCK_H */ > > diff --git a/kernel/liveupdate/Makefile b/kernel/liveupdate/Makefile > > index d2f779cbe279..eec9d3ae07eb 100644 > > --- a/kernel/liveupdate/Makefile > > +++ b/kernel/liveupdate/Makefile > > @@ -1,6 +1,7 @@ > > # SPDX-License-Identifier: GPL-2.0 > > > > luo-y := \ > > + kho_block.o \ > > luo_core.o \ > > luo_file.o \ > > luo_flb.o \ > > diff --git a/kernel/liveupdate/kho_block.c b/kernel/liveupdate/kho_block.c > > new file mode 100644 > > index 000000000000..a4e650af946f > > --- /dev/null > > +++ b/kernel/liveupdate/kho_block.c > > @@ -0,0 +1,384 @@ > > +// SPDX-License-Identifier: GPL-2.0 > > + > > +/* > > + * Copyright (c) 2026, Google LLC. > > + * Pasha Tatashin > > + */ > > + > > +/** > > + * DOC: KHO Serialization Blocks > > + * > > + * KHO provides a mechanism to preserve stateful data across a kexec handover > > + * by serializing it into memory blocks. This file provides the common > > + * infrastructure for managing these blocks. > > + * > > + * Each block consists of a header (struct kho_block_header_ser) followed by an > > + * array of serialized entries. Multiple blocks are linked together via a > > + * physical pointer in the header, forming a linked list that can be easily > > + * traversed in both the current and the next kernel. > > + */ > > + > > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt > > + > > +#include > > +#include > > +#include > > +#include > > +#include > > + > > +/* > > + * Safeguard limit for the number of serialization blocks. This is used to > > + * prevent infinite loops and excessive memory allocation in case of memory > > + * corruption in the preserved state. > > + */ > > +#define KHO_MAX_BLOCKS 10000 > > + > > +/** > > + * kho_block_set_init - Initialize a block set. > > + * @bs: The block set to initialize. > > + * @entry_size: The size of each entry in the blocks. > > + */ > > +void kho_block_set_init(struct kho_block_set *bs, size_t entry_size) > > +{ > > + *bs = (struct kho_block_set)KHO_BLOCK_SET_INIT(*bs, entry_size); > > +} > > + > > +static inline u64 kho_block_count_per_block(struct kho_block_set *bs) > > +{ > > + if (unlikely(!bs->count_per_block)) { > > + bs->count_per_block = (KHO_BLOCK_SIZE - > > + sizeof(struct kho_block_header_ser)) / > > + bs->entry_size; > > + WARN_ON(!bs->count_per_block); > > + } > > + return bs->count_per_block; > > +} > > This looks odd. I don't see a reason to calculate this lazily. Why not > just do it when initializing the block set, in kho_block_set_init() or > kho_block_restore()? And then use bs->count_per_block directly. This allows for blocks to use static initilziation, I like static inits :-) > > > + > > +/* Free serialized data */ > > +static void kho_block_free_ser(struct kho_block_set *bs, > > + struct kho_block_header_ser *ser) > > +{ > > + if (bs->incoming) > > + kho_restore_free(ser); > > + else > > + kho_unpreserve_free(ser); > > +} > > + > > +static struct kho_block_header_ser *kho_block_alloc_ser(struct kho_block_set *bs) > > +{ > > + WARN_ON(bs->incoming); > > WARN_ON_ONCE? Sure > > > + return kho_alloc_preserve(KHO_BLOCK_SIZE); > > +} > > + > > +static int kho_block_add(struct kho_block_set *bs, > > + struct kho_block_header_ser *ser) > > +{ > > + struct kho_block *block, *last; > > + > > + if (bs->nblocks >= KHO_MAX_BLOCKS) > > + return -ENOSPC; > > + > > + block = kzalloc_obj(*block); > > + if (!block) > > + return -ENOMEM; > > + > > + block->ser = ser; > > + last = list_last_entry_or_null(&bs->blocks, struct kho_block, list); > > + list_add_tail(&block->list, &bs->blocks); > > + bs->nblocks++; > > + > > + if (last) > > + last->ser->next = virt_to_phys(ser); > > + else > > + bs->head_pa = virt_to_phys(ser); > > + > > + return 0; > > +} > > + > > +/** > > + * kho_block_grow - Create a new block if the current capacity is reached. > > + * @bs: The block set. > > + * @count: The current number of entries. > > + * > > + * This function handles the dynamic expansion of a block set. It allocates > > + * and links a new serialization block if the provided entry count matches > > + * the current total capacity of the set. > > + * > > + * Return: 0 on success, or a negative errno on failure. > > + */ > > +int kho_block_grow(struct kho_block_set *bs, u64 count) > > +{ > > + struct kho_block_header_ser *ser; > > + int err; > > + > > + if (WARN_ON(bs->incoming)) > > WARN_ON_ONCE here too? Sure > > > + return -EINVAL; > > + > > + if (count != bs->nblocks * kho_block_count_per_block(bs)) > > + return 0; > > + > > + ser = kho_block_alloc_ser(bs); > > + if (IS_ERR(ser)) > > + return PTR_ERR(ser); > > + > > + err = kho_block_add(bs, ser); > > + if (err) { > > + kho_block_free_ser(bs, ser); > > + return err; > > + } > > + > > + return 0; > > +} > > + > > +/** > > + * kho_block_shrink - Conditionally destroy the last block in a block set. > > + * @bs: The block set. > > + * @count: The current number of entries across all blocks. > > + * > > + * This function checks if the last block in the set is redundant based on the > > + * total entry count and the capacity of the preceding blocks. If the entry > > + * count can be accommodated by the blocks that come before the last one, the > > + * last block is destroyed and removed from the set. > > + */ > > +void kho_block_shrink(struct kho_block_set *bs, u64 count) > > +{ > > + struct kho_block *last, *new_last; > > + > > + if (count > (bs->nblocks - 1) * kho_block_count_per_block(bs)) > > + return; > > + > > + if (list_empty(&bs->blocks)) > > + return; > > + > > + last = list_last_entry(&bs->blocks, struct kho_block, list); > > + list_del(&last->list); > > + bs->nblocks--; > > + kho_block_free_ser(bs, last->ser); > > + kfree(last); > > + > > + new_last = list_last_entry_or_null(&bs->blocks, struct kho_block, list); > > + if (new_last) > > + new_last->ser->next = 0; > > + else > > + bs->head_pa = 0; > > +} > > + > > +/* > > + * kho_cyclic_blocks_check - Check for cycles in a linked list of blocks. > > + * Uses Floyd's cycle-finding algorithm to ensure sanity of the incoming list. > > + */ > > +static bool kho_cyclic_blocks_check(struct kho_block_set *bs) > > +{ > > + struct kho_block_header_ser *fast; > > + struct kho_block_header_ser *slow; > > + int count = 0; > > + > > + fast = phys_to_virt(bs->head_pa); > > + slow = fast; > > + > > + while (fast) { > > + if (count++ >= KHO_MAX_BLOCKS) { > > + pr_err("Linked list too long\n"); > > + return false; > > + } > > + > > + if (!fast->next) > > + break; > > + > > + fast = phys_to_virt(fast->next); > > + if (!fast->next) > > + break; > > + > > + fast = phys_to_virt(fast->next); > > + slow = phys_to_virt(slow->next); > > + > > + if (slow == fast) { > > + pr_err("Cyclic list detected\n"); > > Heh, reminds me of the time I was practicing leetcode for interviews ;-) :-) > > > + return false; > > + } > > + } > > + > > + return true; > > +} > > + > > +/** > > + * kho_block_restore - Restore a block set from a physical address. > > + * @bs: The block set to restore. > > + * @head_pa: Physical address of the first block header. > > + * > > + * Return: 0 on success, or a negative errno on failure. > > + */ > > +int kho_block_restore(struct kho_block_set *bs, u64 head_pa) > > +{ > > + struct kho_block_header_ser *ser; > > + u64 next_pa = head_pa; > > + int err; > > + > > + /* Restored block sets use size from the previous kernel */ > > + bs->incoming = true; > > + if (!head_pa) > > + return 0; > > + > > + bs->head_pa = head_pa; > > + if (!kho_cyclic_blocks_check(bs)) { > > + bs->head_pa = 0; > > + return -EINVAL; > > + } > > + > > + while (next_pa) { > > + ser = phys_to_virt(next_pa); > > + if (ser->count > kho_block_count_per_block(bs)) { > > + pr_warn("Block contains too many entries: %llu\n", > > + ser->count); > > + err = -EINVAL; > > + goto err_destroy; > > + } > > + err = kho_block_add(bs, ser); > > + if (err) > > + goto err_destroy; > > + next_pa = ser->next; > > + } > > + > > + return 0; > > + > > +err_destroy: > > + kho_block_destroy(bs); > > + return err; > > +} > > + > > +/** > > + * kho_block_destroy - Destroy all blocks in a block set. > > + * @bs: The block set. > > + */ > > +void kho_block_destroy(struct kho_block_set *bs) > > +{ > > + u64 head_pa = bs->head_pa; > > + struct kho_block *block; > > + > > + while (!list_empty(&bs->blocks)) { > > + block = list_first_entry(&bs->blocks, struct kho_block, list); > > + list_del(&block->list); > > + kfree(block); > > + } > > Nit: > > list_for_each_entry_safe(block, tmp, &bs->blocks, list) { > list_del(&block->list); > kfree(block); > } > > is a bit more idiomatic (and IMO easier to read). Sure > > > + bs->nblocks = 0; > > + bs->head_pa = 0; > > + > > + while (head_pa) { > > + struct kho_block_header_ser *ser = phys_to_virt(head_pa); > > + > > + head_pa = ser->next; > > + kho_block_free_ser(bs, ser); > > Nit: also, can't you put this also in the previous loop? Something like: > > list_for_each_entry_safe(block, tmp, &bs->blocks, list) { > list_del(&block->list); > kho_block_free_ser(block->ser); > kfree(block); > } We actually can't merge these into a single loop because of partial restoration failures handling in kho_block_restore(). If kho_block_restore fails halfway through restoring a chain of blocks (for example, if kho_block_add fails on block 3 of 5), we jump to the err_destroy cleanup path which calls kho_block_destroy(). At this point: - bs->blocks only contains the tracked blocks we successfully added (blocks 1 and 2). - bs->head_pa still points to the physical head of the entire 5-block incoming chain. But, this is a good place to add a comment. > > + } > > +} > > + > > +/** > > + * kho_block_set_clear - Clear all serialized data in a block set. > > + * @bs: The block set to clear. > > + */ > > +void kho_block_set_clear(struct kho_block_set *bs) > > +{ > > + struct kho_block *block; > > + > > + list_for_each_entry(block, &bs->blocks, list) { > > + block->ser->count = 0; > > + memset(block->ser + 1, 0, KHO_BLOCK_SIZE - sizeof(*block->ser)); > > + } > > +} > > + > > +/** > > + * kho_block_it_init - Initialize a block set iterator. > > + * @it: The iterator to initialize. > > + * @bs: The block set to iterate over. > > + */ > > +void kho_block_it_init(struct kho_block_it *it, struct kho_block_set *bs) > > +{ > > + it->bs = bs; > > + it->block = list_first_entry_or_null(&bs->blocks, struct kho_block, list); > > + it->i = 0; > > +} > > + > > +/** > > + * kho_block_it_next - Return the next entry slot in the block set. > > + * @it: The block iterator. > > + * > > + * If the current block is full, it automatically advances to the next block > > + * in the set. > > + * > > + * Return: A pointer to the next entry slot, or NULL if no more slots are > > + * available. > > + */ > > +void *kho_block_it_next(struct kho_block_it *it) > > The naming and documentation here are very confusing. This and > kho_block_it_read() look pretty much identical, and their documentation > also looks pretty much identical. There seems to be only one tiny > difference: this function returns the slot while incrementing the block > count. > > Can we do better something like kho_block_it_write_next(struct > kho_block_it *it, void *entry) (size was specified when creating block > set)? Yes, this results in a copy but does that matter that much? > > And if you really want to avoid copying, perhaps > kho_block_it_add_entry()? Or something along the lines? To make it clear > this is adding an entry to the block set. > > Also, make the intended usage clear in the documentation. Sure, I will work on this. I also did not like the names, but could not think of anything clearer. > > > +{ > > + if (!it->block) > > + return NULL; > > + > > + if (it->i == kho_block_count_per_block(it->bs)) { > > + it->block->ser->count = it->i; > > + if (list_is_last(&it->block->list, &it->bs->blocks)) > > + return NULL; > > + it->block = list_next_entry(it->block, list); > > + it->i = 0; > > + } > > + > > + return (void *)(it->block->ser + 1) + (it->i++ * it->bs->entry_size); > > +} > > + > > +/** > > + * kho_block_it_read - Return the next entry slot for reading. > > + * @it: The block iterator. > > + * > > + * This function iterates through entries that were previously serialized, > > + * respecting the count stored in each block's header. > > + * > > + * Return: A pointer to the next entry slot, or NULL if no more entries are > > + * available. > > + */ > > +void *kho_block_it_read(struct kho_block_it *it) > > +{ > > + if (!it->block) > > + return NULL; > > + > > + while (it->i == it->block->ser->count) { > > Hmm, the while loop suggests we can have blocks with zero count. Do you > think we should detect those and error out instead? Since it doesn't > really make sense to have a block with no entries. This sounds reasonable. > > > + if (list_is_last(&it->block->list, &it->bs->blocks)) > > + return NULL; > > + it->block = list_next_entry(it->block, list); > > + it->i = 0; > > + } > > + > > + return (void *)(it->block->ser + 1) + (it->i++ * it->bs->entry_size); > > +} > > + > > +/** > > + * kho_block_it_prev - Return the previous entry slot in the block set. > > + * @it: The block iterator. > > + * > > + * If the current index is at the start of a block, it automatically moves to > > + * the end of the previous block. > > + * > > + * Return: A pointer to the previous entry slot, or NULL if at the very > > + * beginning of the block set. > > + */ > > +void *kho_block_it_prev(struct kho_block_it *it) > > +{ > > + if (!it->block) > > + return NULL; > > + > > + if (it->i == 0) { > > + if (list_is_first(&it->block->list, &it->bs->blocks)) > > + return NULL; > > + it->block = list_prev_entry(it->block, list); > > + it->i = kho_block_count_per_block(it->bs); > > + } > > + > > + return (void *)(it->block->ser + 1) + (--it->i * it->bs->entry_size); > > +} > > + > > +/** > > + * kho_block_it_finalize - Finalize the current block by setting its entry count. > > + * @it: The block iterator. > > + */ > > +void kho_block_it_finalize(struct kho_block_it *it) > > +{ > > + if (it->block) > > + it->block->ser->count = it->i; > > +} > > Doesn't kho_block_it_next() already do this when you add an entry? So > this seems redundant. It is not redundant because of how the final partially-fille block is handled. kho_block_it_next() only writes the count into the block header when a block is completely full and it is advancing to the next one: if (it->i == kho_block_count_per_block(it->bs)) { it->block->ser->count = it->i; ... But for the very last block in the set, it is usually only partially filled (e.g., we write 10 entries into a block with a capacity of 64). Since it->i never reaches the maximum capacity, kho_block_it_next() never commits its count. Pasha From pasha.tatashin at soleen.com Mon Jun 1 07:40:47 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Mon, 1 Jun 2026 10:40:47 -0400 Subject: [PATCH v4 10/13] liveupdate: Remove limit on the number of files per session In-Reply-To: <2vxzbjdufirq.fsf@kernel.org> References: <20260530221938.115978-1-pasha.tatashin@soleen.com> <20260530221938.115978-11-pasha.tatashin@soleen.com> <2vxzbjdufirq.fsf@kernel.org> Message-ID: On 06-01 16:16, Pratyush Yadav wrote: > On Sat, May 30 2026, Pasha Tatashin wrote: > > > To remove the fixed limit on the number of preserved files per session, > > transition the file metadata serialization from a single contiguous > > memory block to a chain of linked blocks. > > > > Acked-by: Mike Rapoport (Microsoft) > > Signed-off-by: Pasha Tatashin > > --- > > include/linux/kho/abi/luo.h | 13 +-- > > kernel/liveupdate/luo_file.c | 144 +++++++++++++++---------------- > > kernel/liveupdate/luo_internal.h | 6 +- > > 3 files changed, 80 insertions(+), 83 deletions(-) > > > > diff --git a/include/linux/kho/abi/luo.h b/include/linux/kho/abi/luo.h > > index 79758d92ed5f..16df550ef143 100644 > > --- a/include/linux/kho/abi/luo.h > > +++ b/include/linux/kho/abi/luo.h > > @@ -35,8 +35,8 @@ > > * > > * - struct luo_session_ser: > > * Metadata for a single session, including its name and a physical pointer > > - * to another preserved memory block containing an array of > > - * `struct luo_file_ser` for all files in that session. > > + * to the first `struct kho_block_header_ser` for all files in that session. > > + * Multiple blocks are linked via the `next` field in the header. > > * > > * - struct luo_file_ser: > > * Metadata for a single preserved file. Contains the `compatible` string to > > @@ -65,7 +65,7 @@ > > * The LUO state is registered under this KHO entry name. > > */ > > #define LUO_KHO_ENTRY_NAME "LUO" > > -#define LUO_COMPAT_BASE "luo-v3" > > +#define LUO_COMPAT_BASE "luo-v4" > > #define LUO_ABI_COMPATIBLE LUO_COMPAT_BASE "-" KHO_BLOCK_ABI_COMPATIBLE > > #define LUO_ABI_COMPAT_LEN ALIGN(sizeof(LUO_ABI_COMPATIBLE), 8) > > > > @@ -103,9 +103,10 @@ struct luo_file_ser { > > > > /** > > * struct luo_file_set_ser - Represents the serialized metadata for file set > > - * @files: The physical address of a contiguous memory block that holds > > - * the serialized state of files (array of luo_file_ser) in this file > > - * set. > > + * @files: The physical address of the first `struct kho_block_header_ser`. > > + * This structure is the header for a block of memory containing > > + * an array of `struct luo_file_ser` entries. Multiple blocks are > > + * linked via the `next` field in the header. > > * @count: The total number of files that were part of this session during > > * serialization. Used for iteration and validation during > > * restoration. > > diff --git a/kernel/liveupdate/luo_file.c b/kernel/liveupdate/luo_file.c > > index 9eec07a9e9fc..a445b1950ca7 100644 > > --- a/kernel/liveupdate/luo_file.c > > +++ b/kernel/liveupdate/luo_file.c > > @@ -118,11 +118,6 @@ static LIST_HEAD(luo_file_handler_list); > > /* Keep track of files being preserved by LUO */ > > static DEFINE_XARRAY(luo_preserved_files); > > > > -/* 2 4K pages, give space for 128 files per file_set */ > > -#define LUO_FILE_PGCNT 2ul > > -#define LUO_FILE_MAX \ > > - ((LUO_FILE_PGCNT << PAGE_SHIFT) / sizeof(struct luo_file_ser)) > > - > > /** > > * struct luo_file - Represents a single preserved file instance. > > * @fh: Pointer to the &struct liveupdate_file_handler that manages > > @@ -174,39 +169,6 @@ struct luo_file { > > u64 token; > > }; > > > > -static int luo_alloc_files_mem(struct luo_file_set *file_set) > > -{ > > - size_t size; > > - void *mem; > > - > > - if (file_set->files) > > - return 0; > > - > > - WARN_ON_ONCE(file_set->count); > > - > > - size = LUO_FILE_PGCNT << PAGE_SHIFT; > > - mem = kho_alloc_preserve(size); > > - if (IS_ERR(mem)) > > - return PTR_ERR(mem); > > - > > - file_set->files = mem; > > - > > - return 0; > > -} > > - > > -static void luo_free_files_mem(struct luo_file_set *file_set) > > -{ > > - /* If file_set has files, no need to free preservation memory */ > > - if (file_set->count) > > - return; > > - > > - if (!file_set->files) > > - return; > > - > > - kho_unpreserve_free(file_set->files); > > - file_set->files = NULL; > > -} > > - > > static unsigned long luo_get_id(struct liveupdate_file_handler *fh, > > struct file *file) > > { > > @@ -276,16 +238,15 @@ int luo_preserve_file(struct luo_file_set *file_set, u64 token, int fd) > > if (luo_token_is_used(file_set, token)) > > return -EEXIST; > > > > - if (file_set->count == LUO_FILE_MAX) > > - return -ENOSPC; > > + err = kho_block_grow(&file_set->block_set, file_set->count); > > + if (err) > > + return err; > > > > file = fget(fd); > > - if (!file) > > - return -EBADF; > > - > > - err = luo_alloc_files_mem(file_set); > > - if (err) > > - goto err_fput; > > + if (!file) { > > + err = -EBADF; > > + goto err_shrink; > > + } > > > > err = -ENOENT; > > down_read(&luo_register_rwlock); > > @@ -300,7 +261,7 @@ int luo_preserve_file(struct luo_file_set *file_set, u64 token, int fd) > > > > /* err is still -ENOENT if no handler was found */ > > if (err) > > - goto err_free_files_mem; > > + goto err_fput; > > > > err = xa_insert(&luo_preserved_files, luo_get_id(fh, file), > > file, GFP_KERNEL); > > @@ -343,10 +304,10 @@ int luo_preserve_file(struct luo_file_set *file_set, u64 token, int fd) > > xa_erase(&luo_preserved_files, luo_get_id(fh, file)); > > err_module_put: > > module_put(fh->ops->owner); > > -err_free_files_mem: > > - luo_free_files_mem(file_set); > > err_fput: > > fput(file); > > +err_shrink: > > + kho_block_shrink(&file_set->block_set, file_set->count); > > > > return err; > > } > > @@ -392,13 +353,14 @@ void luo_file_unpreserve_files(struct luo_file_set *file_set) > > > > list_del(&luo_file->list); > > file_set->count--; > > + kho_block_shrink(&file_set->block_set, file_set->count); > > > > fput(luo_file->file); > > mutex_destroy(&luo_file->mutex); > > kfree(luo_file); > > } > > > > - luo_free_files_mem(file_set); > > + kho_block_destroy(&file_set->block_set); > > } > > > > static int luo_file_freeze_one(struct luo_file_set *file_set, > > @@ -454,7 +416,7 @@ static void __luo_file_unfreeze(struct luo_file_set *file_set, > > luo_file_unfreeze_one(file_set, luo_file); > > } > > > > - memset(file_set->files, 0, LUO_FILE_PGCNT << PAGE_SHIFT); > > + kho_block_set_clear(&file_set->block_set); > > } > > > > /** > > @@ -493,19 +455,23 @@ static void __luo_file_unfreeze(struct luo_file_set *file_set, > > int luo_file_freeze(struct luo_file_set *file_set, > > struct luo_file_set_ser *file_set_ser) > > { > > - struct luo_file_ser *file_ser = file_set->files; > > struct luo_file *luo_file; > > + struct kho_block_it it; > > int err; > > - int i; > > > > if (!file_set->count) > > return 0; > > > > - if (WARN_ON(!file_ser)) > > - return -EINVAL; > > + kho_block_it_init(&it, &file_set->block_set); > > > > - i = 0; > > list_for_each_entry(luo_file, &file_set->files_list, list) { > > + struct luo_file_ser *file_ser = kho_block_it_next(&it); > > + > > + if (!file_ser) { > > + err = -ENOSPC; > > + goto err_unfreeze; > > + } > > This should not fail normally, right? Since we pre-allocate the memory. > Perhaps add a comment saying that? > > > + > > err = luo_file_freeze_one(file_set, luo_file); > > if (err < 0) { > > pr_warn("Freeze failed for token[%#0llx] handler[%s] err[%pe]\n", > > @@ -514,16 +480,21 @@ int luo_file_freeze(struct luo_file_set *file_set, > > goto err_unfreeze; > > } > > > > - strscpy(file_ser[i].compatible, luo_file->fh->compatible, > > - sizeof(file_ser[i].compatible)); > > - file_ser[i].data = luo_file->serialized_data; > > - file_ser[i].token = luo_file->token; > > - i++; > > + strscpy(file_ser->compatible, luo_file->fh->compatible, > > + sizeof(file_ser->compatible)); > > + file_ser->data = luo_file->serialized_data; > > + file_ser->token = luo_file->token; > > } > > + kho_block_it_finalize(&it); > > > > file_set_ser->count = file_set->count; > > - if (file_set->files) > > - file_set_ser->files = virt_to_phys(file_set->files); > > + if (!list_empty(&file_set->block_set.blocks)) { > > + struct kho_block *block; > > + > > + block = list_first_entry(&file_set->block_set.blocks, > > + struct kho_block, list); > > + file_set_ser->files = virt_to_phys(block->ser); > > + } > > Please, add an API in KHO block to return the header physical address. > Poking into the internals of the data structure like this is not a good > idea. SGTM > > I missed that patch 9 also does this. So please use that there too. > > > > > return 0; > > > > @@ -741,14 +712,12 @@ int luo_file_finish(struct luo_file_set *file_set) > > module_put(luo_file->fh->ops->owner); > > list_del(&luo_file->list); > > file_set->count--; > > + kho_block_shrink(&file_set->block_set, file_set->count); > > mutex_destroy(&luo_file->mutex); > > kfree(luo_file); > > } > > > > - if (file_set->files) { > > - kho_restore_free(file_set->files); > > - file_set->files = NULL; > > - } > > + kho_block_destroy(&file_set->block_set); > > > > return 0; > > } > > @@ -822,16 +791,18 @@ int luo_file_deserialize(struct luo_file_set *file_set, > > struct luo_file_set_ser *file_set_ser) > > { > > struct luo_file_ser *file_ser; > > + struct kho_block_it it; > > int err; > > - u64 i; > > > > if (!file_set_ser->files) { > > WARN_ON(file_set_ser->count); > > return 0; > > } > > > > - file_set->count = file_set_ser->count; > > - file_set->files = phys_to_virt(file_set_ser->files); > > + file_set->count = 0; > > + err = kho_block_restore(&file_set->block_set, file_set_ser->files); > > + if (err) > > + return err; > > > > /* > > * Note on error handling: > > @@ -848,25 +819,50 @@ int luo_file_deserialize(struct luo_file_set *file_set, > > * userspace to detect the failure and trigger a reboot, which will > > * reliably reset devices and reclaim memory. > > */ > > - file_ser = file_set->files; > > - for (i = 0; i < file_set->count; i++) { > > - err = luo_file_deserialize_one(file_set, &file_ser[i]); > > + kho_block_it_init(&it, &file_set->block_set); > > + while ((file_ser = kho_block_it_read(&it))) { > > + err = luo_file_deserialize_one(file_set, file_ser); > > if (err) > > - return err; > > + goto err_destroy_blocks; > > + file_set->count++; > > + } > > + > > + if (file_set->count != file_set_ser->count) { > > + pr_warn("File count mismatch: expected %llu, found %llu\n", > > + file_set_ser->count, file_set->count); > > + err = -EINVAL; > > + goto err_destroy_blocks; > > } > > > > return 0; > > + > > +err_destroy_blocks: > > + while (!list_empty(&file_set->files_list)) { > > + struct luo_file *luo_file; > > + > > + luo_file = list_first_entry(&file_set->files_list, > > + struct luo_file, list); > > + list_del(&luo_file->list); > > + module_put(luo_file->fh->ops->owner); > > + mutex_destroy(&luo_file->mutex); > > + kfree(luo_file); > > + } > > + file_set->count = 0; > > + kho_block_destroy(&file_set->block_set); > > + return err; > > } > > > > void luo_file_set_init(struct luo_file_set *file_set) > > { > > INIT_LIST_HEAD(&file_set->files_list); > > + kho_block_set_init(&file_set->block_set, sizeof(struct luo_file_ser)); > > } > > > > void luo_file_set_destroy(struct luo_file_set *file_set) > > { > > WARN_ON(file_set->count); > > WARN_ON(!list_empty(&file_set->files_list)); > > + WARN_ON(!list_empty(&file_set->block_set.blocks)); > > Here too. Sure > > > } > > > > /** > > diff --git a/kernel/liveupdate/luo_internal.h b/kernel/liveupdate/luo_internal.h > > index ee18f9a11b91..64879ffe7378 100644 > > --- a/kernel/liveupdate/luo_internal.h > > +++ b/kernel/liveupdate/luo_internal.h > > @@ -10,6 +10,7 @@ > > > > #include > > #include > > +#include > > > > struct luo_ucmd { > > void __user *ubuffer; > > @@ -44,14 +45,13 @@ static inline int luo_ucmd_respond(struct luo_ucmd *ucmd, > > * struct luo_file_set - A set of files that belong to the same sessions. > > * @files_list: An ordered list of files associated with this session, it is > > * ordered by preservation time. > > - * @files: The physically contiguous memory block that holds the serialized > > - * state of files. > > + * @block_set: The set of serialization blocks. > > * @count: A counter tracking the number of files currently stored in the > > * @files_list for this session. > > */ > > struct luo_file_set { > > struct list_head files_list; > > - struct luo_file_ser *files; > > + struct kho_block_set block_set; > > u64 count; > > }; > > -- > Regards, > Pratyush Yadav From pasha.tatashin at soleen.com Mon Jun 1 07:44:14 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Mon, 1 Jun 2026 10:44:14 -0400 Subject: [PATCH v4 09/13] liveupdate: Remove limit on the number of sessions In-Reply-To: <2vxzfr36fjcj.fsf@kernel.org> References: <20260530221938.115978-1-pasha.tatashin@soleen.com> <20260530221938.115978-10-pasha.tatashin@soleen.com> <2vxzfr36fjcj.fsf@kernel.org> Message-ID: On 06-01 16:03, Pratyush Yadav wrote: > On Sat, May 30 2026, Pasha Tatashin wrote: > > > Currently, the number of LUO sessions is limited by a fixed number of > > pre-allocated pages for serialization (16 pages, allowing for ~819 > > sessions). > > > > This limitation is problematic if LUO is used to support things such as > > systemd file descriptor store, and would be used not just as VM memory > > but to save other states on the machine. > > > > Remove this limit by transitioning to a linked-block approach for > > session metadata serialization. Instead of a single contiguous block, > > session metadata is now stored in a chain of 16-page blocks. Each block > > starts with a header containing the physical address of the next block > > and the number of session entries in the current block. > > > > Acked-by: Mike Rapoport (Microsoft) > > Signed-off-by: Pasha Tatashin > > --- > [...] > > @@ -63,13 +58,15 @@ > > #define _LINUX_KHO_ABI_LUO_H > > > > #include > > +#include > > #include > > > > /* > > * The LUO state is registered under this KHO entry name. > > */ > > #define LUO_KHO_ENTRY_NAME "LUO" > > -#define LUO_ABI_COMPATIBLE "luo-v3" > > +#define LUO_COMPAT_BASE "luo-v3" > > +#define LUO_ABI_COMPATIBLE LUO_COMPAT_BASE "-" KHO_BLOCK_ABI_COMPATIBLE > > That's clever :-) > > [...] > > int luo_session_serialize(void) > > { > > struct luo_session_header *sh = &luo_session_global.outgoing; > > struct luo_session *session; > > - int i = 0; > > + struct kho_block_it it; > > int err; > > > > down_write(&luo_session_serialize_rwsem); > > down_write(&sh->rwsem); > > *sh->sessions_pa = 0; > > > > + kho_block_it_init(&it, &sh->block_set); > > + > > list_for_each_entry(session, &sh->list, list) { > > - err = luo_session_freeze_one(session, &sh->ser[i]); > > - if (err) > > + struct luo_session_ser *ser = kho_block_it_next(&it); > > + > > + if (!ser) { > > + err = -ENOSPC; > > goto err_undo; > > + } > > > > - strscpy(sh->ser[i].name, session->name, > > - sizeof(sh->ser[i].name)); > > - i++; > > - } > > + err = luo_session_freeze_one(session, ser); > > + if (err) { > > + kho_block_it_prev(&it); > > + goto err_undo; > > + } > > > > - if (sh->header_ser && sh->count > 0) { > > - sh->header_ser->count = sh->count; > > - *sh->sessions_pa = virt_to_phys(sh->header_ser); > > + strscpy(ser->name, session->name, sizeof(ser->name)); > > } > > + > > + kho_block_it_finalize(&it); > > + > > + if (sh->sessions_pa && sh->count > 0) > > Nit: Why check for sh->sessions_pa? It can never be NULL. Good point, I will remove it. > > Other than this, > > Reviewed-by: Pratyush Yadav (Google) > > > + *sh->sessions_pa = sh->block_set.head_pa; > > up_write(&sh->rwsem); > > > > return 0; > > > > err_undo: > > list_for_each_entry_continue_reverse(session, &sh->list, list) { > > - i--; > > - luo_session_unfreeze_one(session, &sh->ser[i]); > > - memset(sh->ser[i].name, 0, sizeof(sh->ser[i].name)); > > + struct luo_session_ser *ser = kho_block_it_prev(&it); > > + > > + luo_session_unfreeze_one(session, ser); > > + memset(ser->name, 0, sizeof(ser->name)); > > } > > up_write(&sh->rwsem); > > up_write(&luo_session_serialize_rwsem); > > -- > Regards, > Pratyush Yadav From pasha.tatashin at soleen.com Mon Jun 1 08:00:59 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Mon, 1 Jun 2026 11:00:59 -0400 Subject: [RFC PATCH 00/20] mshv: enable kexec with Hyper-V donated pages and partitions In-Reply-To: References: <20260528004204.1484584-1-jloeser@linux.microsoft.com> Message-ID: On 05-31 20:10, Mike Rapoport wrote: > Hi Jork, > > Only had time to skim through the patches. > I have a couple of high level questions for now. > > On Wed, May 27, 2026 at 05:41:42PM -0700, Jork Loeser wrote: > > When Linux runs as an L1 Virtual Host (L1VH) under Hyper-V, the MSHV > > root partition driver deposits pages to the hypervisor and creates > > partitions for guest VMs. Prior patches enabled kexec for L1VH, but > > only when no partitions had been created and no memory had been donated. > > > > This series lifts that limitation. It uses KHO (Kexec Handover) to: > > > > - Track all pages deposited to the hypervisor in a KHO radix tree > > and preserve them across kexec so the new kernel knows which pages > > are owned by the hypervisor. > > > > - Freeze running partitions before kexec, record their IDs in the > > KHO FDT, and vacuum (tear down + reclaim memory) stale partitions > > after kexec. > > > > - In case of a crash, exclude hypervisor-owned pages from crash > > dump collection by passing the radix tree root PA via Hyper-V > > crash MSR P2 to the crash kernel. > > > > Dependency on Pratyush's KHO series > > =================================== > > > > Patches 1-12 are cherry-picked from Pratyush Yadav's v1 series > > "kho: make boot time huge page allocation work nicely with KHO" [1], > > which is still under discussion. This series uses functionality from > > those patches -- specifically the meta-data page enumeration via table > > callbacks and the restructured radix tree API. It also extends the > > KHO radix tree with: > > > > - A freeze mechanism to lock the tree before serializing for kexec > > (patch 13). > > There were a lot of effort to make KHO stateless and drop the requirement > for finalization/freeze. Yes, using KHO directly here is incorrect. The state machine is provided by LUO, so we should use LUO here. MSHV should provide a file that userspace adds to LUO, and all state machine management would be the same as for all other clients participating in LU. > > Why is this necessary to add a freeze mechanism to kho_radix_tree? > If it's a hard requirement of mshv maybe the freeze part should be handled > there? j > > - A crash-kernel-safe variant that memremaps radix nodes for use > > outside the direct map (patch 14). > > > > Patch overview > > ============== > > > > Patches 1-12: KHO radix tree and memblock changes (from [1]) > > Patch 13: Radix tree freeze and del_key() error reporting > > del_key() error reporting sounds like something we'd want to avoid. > del_key() is called on "freeing" path and during error handling, it would > be hard if at all possible to deal with errors from del_key(). > > > Patch 14: Crash-kernel-safe radix tree presence check > > Patch 15: Page tracker using KHO radix tree for deposited pages > > Patch 16: Debugfs interface for page tracker > > Patches 17-18: Crash MSR reshuffling + crash dump page exclusion > > Patch 19: Export kexec_in_progress for modules > > Isn't there another way to differentiate kexec reboot? > > > Patch 20: Freeze and vacuum partitions across kexec > > > > Feedback > > ======== > > > > This is an RFC. I am looking for feedback on the overall approach as > > well as the KHO changes (patches 13-14). > > > > [1] https://lore.kernel.org/linux-mm/20260429133928.850721-1-pratyush at kernel.org/ > > > > Based-on: linux-next/master (next-20260527) > > -- > Sincerely yours, > Mike. From dakr at kernel.org Mon Jun 1 08:09:59 2026 From: dakr at kernel.org (Danilo Krummrich) Date: Mon, 01 Jun 2026 17:09:59 +0200 Subject: [PATCH v16 0/5] shut down devices asynchronously In-Reply-To: <20260518193204.14273-1-djeffery@redhat.com> References: <20260518193204.14273-1-djeffery@redhat.com> Message-ID: On Mon May 18, 2026 at 9:31 PM CEST, David Jeffery wrote: > These patches are now rebased against the driver-core tree's driver-core-next > branch. [...] > Changes from V15: > > The async_shutdown bit field is converted to a device flags bit Convert all > patches to use the flag bit accessor macros to set or check if async shutdown > should be used Added documentation on the kernel parameter to control use of > async shutdown Did you have a look at the Sashiko report from v15 [1]? Some of the concerns raised seem valid at a quick glance. (It seems that this version has not been picked up by Sashiko (despite you mentioning they are based on driver-core-next). I'd assume it doesn't like that the series was not sent with '--base'.) Can you have a look at [1] please? Thanks, Danilo [1] https://sashiko.dev/#/patchset/20260429175016.7915-1-djeffery%40redhat.com > Stuart Hayes (2): > driver core: separate function to shutdown one device > driver core: do not always lock parent in shutdown > > David Jeffery (3): > driver core: async device shutdown infrastructure > PCI: Enable async shutdown support > scsi: Enable async shutdown support Not sure it will make it for 7.2, but I think it would be good to give this some more time in linux-next anyways. Bjorn, James, Martin: Should the PCI and scsi patch go through the driver-core tree too? Do you prefer a signed tag with the driver-core changes to merge into the PCI and scsi trees? From djeffery at redhat.com Mon Jun 1 09:54:42 2026 From: djeffery at redhat.com (David Jeffery) Date: Mon, 1 Jun 2026 12:54:42 -0400 Subject: [PATCH v16 0/5] shut down devices asynchronously In-Reply-To: References: <20260518193204.14273-1-djeffery@redhat.com> Message-ID: On Mon, Jun 1, 2026 at 11:10?AM Danilo Krummrich wrote: > > On Mon May 18, 2026 at 9:31 PM CEST, David Jeffery wrote: > > These patches are now rebased against the driver-core tree's driver-core-next > > branch. > > [...] > > > Changes from V15: > > > > The async_shutdown bit field is converted to a device flags bit Convert all > > patches to use the flag bit accessor macros to set or check if async shutdown > > should be used Added documentation on the kernel parameter to control use of > > async shutdown > > Did you have a look at the Sashiko report from v15 [1]? Some of the concerns > raised seem valid at a quick glance. > > (It seems that this version has not been picked up by Sashiko (despite you > mentioning they are based on driver-core-next). I'd assume it doesn't like that > the series was not sent with '--base'.) > > Can you have a look at [1] please? > > Thanks, > Danilo > > [1] https://sashiko.dev/#/patchset/20260429175016.7915-1-djeffery%40redhat.com This does look to have found some legitimate issues in need of correction. I'll get them fixed. Thanks, David Jeffery From mclapinski at google.com Mon Jun 1 12:11:36 2026 From: mclapinski at google.com (Michal Clapinski) Date: Mon, 1 Jun 2026 21:11:36 +0200 Subject: [PATCH] kexec_file: skip checksum verification when relocations aren't needed Message-ID: <20260601191136.799134-1-mclapinski@google.com> Checksum verification is needed 1. for crash kernels. In a crash, we can't be sure the kernel is intact. 2. if we're worried about relocating the kernel into a region used by some DMA that wasn't properly cancelled. If we used CMA to allocate segments then 1. we're not working with a crash kernel. 2. relocations are not going to happen. Therefore, we can safely disable checksum verification. Instead of adding a new variable to purgatory, just skip adding regions and save the default value of SHA256 hash. Saves ~250ms on my 4.0 GHz CPU. Signed-off-by: Michal Clapinski --- kernel/kexec_file.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c index 2bfbb2d144e6..2dc8b0435fe6 100644 --- a/kernel/kexec_file.c +++ b/kernel/kexec_file.c @@ -808,6 +808,7 @@ static int kexec_calculate_store_digests(struct kimage *image) void *zero_buf; struct kexec_sha_region *sha_regions; struct purgatory_info *pi = &image->purgatory_info; + bool can_skip_checksum = true; if (!IS_ENABLED(CONFIG_ARCH_SUPPORTS_KEXEC_PURGATORY)) return 0; @@ -822,6 +823,23 @@ static int kexec_calculate_store_digests(struct kimage *image) sha256_init(&sctx); + /* + * If all segments were loaded into contiguous memory, there will be no + * relocations. In that case there is no risk of memory corruption by + * uncancelled DMA and we can skip checksum calculation. + */ + for (i = 0; i < image->nr_segments; i++) { + if (!image->segment_cma[i]) { + can_skip_checksum = false; + break; + } + } + + if (can_skip_checksum) { + pr_info("disabling checksum verification in purgatory\n"); + goto skip_checksum; + } + for (j = i = 0; i < image->nr_segments; i++) { struct kexec_segment *ksegment; @@ -867,6 +885,7 @@ static int kexec_calculate_store_digests(struct kimage *image) j++; } +skip_checksum: sha256_final(&sctx, digest); ret = kexec_purgatory_get_set_symbol(image, "purgatory_sha_regions", -- 2.54.0.929.g9b7fa37559-goog From mclapinski at google.com Mon Jun 1 12:30:14 2026 From: mclapinski at google.com (Michal Clapinski) Date: Mon, 1 Jun 2026 21:30:14 +0200 Subject: [PATCH] kho: try to allocate contiguous memory for kexec segments Message-ID: <20260601193014.896405-1-mclapinski@google.com> This allows us to skip relocations (and maybe checksum calculation in the future). kho_scratch is marked as MIGRATE_CMA but isn't actually given to the CMA, so it should only contain movable allocations, therefore this should always succeed. Signed-off-by: Michal Clapinski --- kernel/kexec_core.c | 6 +++++- kernel/liveupdate/kexec_handover.c | 21 +++++++++++++++++---- 2 files changed, 22 insertions(+), 5 deletions(-) diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c index dc770b9a6d05..cba3ce985aa9 100644 --- a/kernel/kexec_core.c +++ b/kernel/kexec_core.c @@ -43,6 +43,7 @@ #include #include #include +#include #include #include @@ -566,7 +567,10 @@ static void kimage_free_cma(struct kimage *image) continue; arch_kexec_pre_free_pages(page_address(cma), nr_pages); - dma_release_from_contiguous(NULL, cma, nr_pages); + if (kho_is_enabled()) + free_contig_range(page_to_pfn(cma), nr_pages); + else + dma_release_from_contiguous(NULL, cma, nr_pages); image->segment_cma[i] = NULL; } diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 4834a809985a..289fd5948fd2 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -1770,15 +1770,28 @@ static int kho_walk_scratch(struct kexec_buf *kbuf, return ret; } +static void kho_try_alloc_contig(struct kexec_buf *kbuf) +{ + unsigned long start_pfn = PFN_DOWN(kbuf->mem); + unsigned long nr_pages = kbuf->memsz >> PAGE_SHIFT; + + if (alloc_contig_range(start_pfn, start_pfn + nr_pages, + ACR_FLAGS_CMA, GFP_KERNEL)) + return; + + kbuf->cma = pfn_to_page(start_pfn); + arch_kexec_post_alloc_pages(page_address(kbuf->cma), nr_pages, 0); +} + int kho_locate_mem_hole(struct kexec_buf *kbuf, int (*func)(struct resource *, void *)) { - int ret; - if (!kho_enable || kbuf->image->type == KEXEC_TYPE_CRASH) return 1; - ret = kho_walk_scratch(kbuf, func); + if (!kho_walk_scratch(kbuf, func)) + return -EADDRNOTAVAIL; - return ret == 1 ? 0 : -EADDRNOTAVAIL; + kho_try_alloc_contig(kbuf); + return 0; } -- 2.54.0.929.g9b7fa37559-goog From mclapinski at google.com Mon Jun 1 12:52:46 2026 From: mclapinski at google.com (=?UTF-8?B?TWljaGHFgiBDxYJhcGnFhHNraQ==?=) Date: Mon, 1 Jun 2026 21:52:46 +0200 Subject: [PATCH] kho: try to allocate contiguous memory for kexec segments In-Reply-To: <20260601193014.896405-1-mclapinski@google.com> References: <20260601193014.896405-1-mclapinski@google.com> Message-ID: On Mon, Jun 1, 2026 at 9:30?PM Michal Clapinski wrote: > > This allows us to skip relocations (and maybe checksum calculation > in the future). > > kho_scratch is marked as MIGRATE_CMA but isn't actually given to the > CMA, so it should only contain movable allocations, therefore this > should always succeed. Now that I think about it, this is only true on the primary boot. On subsequent boots, kho scratch will contain memblock allocations forever. I should have tested it more than once. I have no idea how probable it is that I will find enough movable/free memory in kho scratch for this to ever succeed. I'll give it more thought. From jloeser at linux.microsoft.com Mon Jun 1 13:09:41 2026 From: jloeser at linux.microsoft.com (Jork Loeser) Date: Mon, 1 Jun 2026 13:09:41 -0700 (PDT) Subject: [RFC PATCH 00/20] mshv: enable kexec with Hyper-V donated pages and partitions In-Reply-To: References: <20260528004204.1484584-1-jloeser@linux.microsoft.com> Message-ID: On Sun, 31 May 2026, Mike Rapoport wrote: > Hi Jork, >> - A freeze mechanism to lock the tree before serializing for kexec >> (patch 13). > > There were a lot of effort to make KHO stateless and drop the requirement > for finalization/freeze. > > Why is this necessary to add a freeze mechanism to kho_radix_tree? > If it's a hard requirement of mshv maybe the freeze part should be handled > there? Good feedback. It's a safety-net so we do not accidentally donate pages without being able to track them. Thought it might be a good generic feature. Let me keep it in the MSHV driver. >> Patch 13: Radix tree freeze and del_key() error reporting > > del_key() error reporting sounds like something we'd want to avoid. > del_key() is called on "freeing" path and during error handling, it would > be hard if at all possible to deal with errors from del_key(). I hear you. Stating "yeah, it can only really fail if the key isn't there, or it's frozen, but not due to other things, so don't bother to check the return code if you are sure" is an odd contract. With the freeze-logic moving into MSHV, will revert to no-error. >> Patch 19: Export kexec_in_progress for modules > > Isn't there another way to differentiate kexec reboot? I could not find one, unfortunately. > Sincerely yours, > Mike. Best, Jork From jloeser at linux.microsoft.com Mon Jun 1 13:15:11 2026 From: jloeser at linux.microsoft.com (Jork Loeser) Date: Mon, 1 Jun 2026 13:15:11 -0700 (PDT) Subject: [RFC PATCH 00/20] mshv: enable kexec with Hyper-V donated pages and partitions In-Reply-To: References: <20260528004204.1484584-1-jloeser@linux.microsoft.com> Message-ID: <4172d271-21b4-346-924e-406baef179a1@linux.microsoft.com> On Mon, 1 Jun 2026, Pasha Tatashin wrote: > On 05-31 20:10, Mike Rapoport wrote: >>> - A freeze mechanism to lock the tree before serializing for kexec >>> (patch 13). >> >> There were a lot of effort to make KHO stateless and drop the requirement >> for finalization/freeze. > > Yes, using KHO directly here is incorrect. The state machine is provided > by LUO, so we should use LUO here. MSHV should provide a file that > userspace adds to LUO, and all state machine management would be the > same as for all other clients participating in LU. The thing is, there is no file handle to rely on. Even once partitions are all removed, Hyper-V might hang onto pages (and won't return them even if asked). However, these pages very much must be excluded from Linux post-kexec, or the system will crash. We cannot rely on UM to ensure integrity of memory management. Contrast that to standard LUO use: If you drop individual file handles, or even skip the LUO phase entirely, the worst that will happen is that the objects will be gone post-kexec. The MM itself will still be consistent. For MSHV & page donation, this is different. (And yes, partition preservation will very much tie into LUO) Best, Jork From lkp at intel.com Mon Jun 1 14:10:02 2026 From: lkp at intel.com (kernel test robot) Date: Tue, 02 Jun 2026 05:10:02 +0800 Subject: [liveupdate:next] BUILD SUCCESS 2935777b418d2bfcbfe96705bb2c0fa6c0d94e18 Message-ID: <202606020552.pcMaifmj-lkp@intel.com> tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux.git next branch HEAD: 2935777b418d2bfcbfe96705bb2c0fa6c0d94e18 liveupdate: Remove unused ser field from struct luo_session elapsed time: 859m configs tested: 360 configs skipped: 5 The following configs have been built successfully. More configs may be tested in the coming days. tested configs: alpha allnoconfig gcc-15.2.0 alpha allyesconfig gcc-15.2.0 alpha defconfig gcc-15.2.0 arc allmodconfig clang-16 arc allmodconfig gcc-15.2.0 arc allnoconfig gcc-15.2.0 arc allyesconfig clang-23 arc allyesconfig gcc-15.2.0 arc defconfig gcc-15.2.0 arc randconfig-001 gcc-8.5.0 arc randconfig-001-20260601 clang-23 arc randconfig-001-20260601 gcc-8.5.0 arc randconfig-001-20260602 gcc-10.5.0 arc randconfig-002 gcc-8.5.0 arc randconfig-002-20260601 clang-23 arc randconfig-002-20260601 gcc-8.5.0 arc randconfig-002-20260602 gcc-10.5.0 arm allnoconfig clang-23 arm allnoconfig gcc-15.2.0 arm allyesconfig clang-16 arm allyesconfig gcc-15.2.0 arm defconfig gcc-15.2.0 arm h3600_defconfig gcc-15.2.0 arm mvebu_v7_defconfig clang-23 arm randconfig-001 gcc-8.5.0 arm randconfig-001-20260601 clang-23 arm randconfig-001-20260601 gcc-8.5.0 arm randconfig-001-20260602 gcc-10.5.0 arm randconfig-002 gcc-8.5.0 arm randconfig-002-20260601 clang-23 arm randconfig-002-20260601 gcc-8.5.0 arm randconfig-002-20260602 gcc-10.5.0 arm randconfig-003 gcc-8.5.0 arm randconfig-003-20260601 clang-23 arm randconfig-003-20260601 gcc-8.5.0 arm randconfig-003-20260602 gcc-10.5.0 arm randconfig-004 gcc-8.5.0 arm randconfig-004-20260601 clang-23 arm randconfig-004-20260601 gcc-8.5.0 arm randconfig-004-20260602 gcc-10.5.0 arm64 allmodconfig clang-19 arm64 allmodconfig clang-23 arm64 allnoconfig gcc-15.2.0 arm64 defconfig gcc-15.2.0 arm64 randconfig-001 gcc-8.5.0 arm64 randconfig-001-20260601 gcc-14.3.0 arm64 randconfig-001-20260601 gcc-8.5.0 arm64 randconfig-001-20260602 gcc-15.2.0 arm64 randconfig-002 gcc-14.3.0 arm64 randconfig-002-20260601 gcc-8.5.0 arm64 randconfig-002-20260602 gcc-15.2.0 arm64 randconfig-003 clang-23 arm64 randconfig-003-20260601 gcc-15.2.0 arm64 randconfig-003-20260601 gcc-8.5.0 arm64 randconfig-003-20260602 gcc-15.2.0 arm64 randconfig-004-20260601 gcc-14.3.0 arm64 randconfig-004-20260601 gcc-8.5.0 arm64 randconfig-004-20260602 gcc-15.2.0 csky allmodconfig gcc-15.2.0 csky allnoconfig gcc-15.2.0 csky defconfig gcc-15.2.0 csky randconfig-001 gcc-10.5.0 csky randconfig-001-20260601 gcc-14.3.0 csky randconfig-001-20260601 gcc-8.5.0 csky randconfig-001-20260602 gcc-15.2.0 csky randconfig-002 gcc-10.5.0 csky randconfig-002-20260601 gcc-16.1.0 csky randconfig-002-20260601 gcc-8.5.0 csky randconfig-002-20260602 gcc-15.2.0 hexagon allmodconfig clang-17 hexagon allmodconfig gcc-15.2.0 hexagon allnoconfig clang-23 hexagon allnoconfig gcc-15.2.0 hexagon defconfig gcc-15.2.0 hexagon randconfig-001 gcc-11.5.0 hexagon randconfig-001-20260601 gcc-11.5.0 hexagon randconfig-001-20260601 gcc-8.5.0 hexagon randconfig-001-20260602 gcc-8.5.0 hexagon randconfig-002 gcc-11.5.0 hexagon randconfig-002-20260601 gcc-11.5.0 hexagon randconfig-002-20260601 gcc-8.5.0 hexagon randconfig-002-20260602 gcc-8.5.0 i386 allmodconfig clang-20 i386 allmodconfig gcc-14 i386 allnoconfig gcc-14 i386 allnoconfig gcc-15.2.0 i386 allyesconfig clang-20 i386 allyesconfig gcc-14 i386 buildonly-randconfig-001 gcc-12 i386 buildonly-randconfig-001-20260601 gcc-12 i386 buildonly-randconfig-001-20260601 gcc-14 i386 buildonly-randconfig-001-20260602 clang-20 i386 buildonly-randconfig-002 gcc-12 i386 buildonly-randconfig-002-20260601 clang-20 i386 buildonly-randconfig-002-20260601 gcc-12 i386 buildonly-randconfig-002-20260602 clang-20 i386 buildonly-randconfig-003 gcc-12 i386 buildonly-randconfig-003-20260601 gcc-12 i386 buildonly-randconfig-003-20260601 gcc-14 i386 buildonly-randconfig-003-20260602 clang-20 i386 buildonly-randconfig-004 gcc-12 i386 buildonly-randconfig-004-20260601 clang-20 i386 buildonly-randconfig-004-20260601 gcc-12 i386 buildonly-randconfig-004-20260602 clang-20 i386 buildonly-randconfig-005 gcc-12 i386 buildonly-randconfig-005-20260601 gcc-12 i386 buildonly-randconfig-005-20260601 gcc-14 i386 buildonly-randconfig-005-20260602 clang-20 i386 buildonly-randconfig-006 gcc-12 i386 buildonly-randconfig-006-20260601 gcc-12 i386 buildonly-randconfig-006-20260602 clang-20 i386 defconfig gcc-15.2.0 i386 randconfig-001 gcc-14 i386 randconfig-001-20260601 gcc-14 i386 randconfig-001-20260602 gcc-14 i386 randconfig-002 gcc-14 i386 randconfig-002-20260601 gcc-14 i386 randconfig-002-20260602 gcc-14 i386 randconfig-003 gcc-14 i386 randconfig-003-20260601 gcc-14 i386 randconfig-003-20260602 gcc-14 i386 randconfig-004 gcc-14 i386 randconfig-004-20260601 gcc-14 i386 randconfig-004-20260602 gcc-14 i386 randconfig-005 gcc-14 i386 randconfig-005-20260601 gcc-14 i386 randconfig-005-20260602 gcc-14 i386 randconfig-006 gcc-14 i386 randconfig-006-20260601 gcc-14 i386 randconfig-006-20260602 gcc-14 i386 randconfig-007 gcc-14 i386 randconfig-007-20260601 gcc-14 i386 randconfig-007-20260602 gcc-14 i386 randconfig-011-20260602 clang-20 i386 randconfig-012-20260602 clang-20 i386 randconfig-013-20260602 clang-20 i386 randconfig-014-20260602 clang-20 i386 randconfig-015-20260602 clang-20 i386 randconfig-016-20260602 clang-20 i386 randconfig-017-20260602 clang-20 loongarch allmodconfig clang-19 loongarch allmodconfig clang-23 loongarch allnoconfig clang-23 loongarch allnoconfig gcc-15.2.0 loongarch defconfig clang-19 loongarch randconfig-001 gcc-11.5.0 loongarch randconfig-001-20260601 gcc-11.5.0 loongarch randconfig-001-20260601 gcc-8.5.0 loongarch randconfig-001-20260602 gcc-8.5.0 loongarch randconfig-002 gcc-11.5.0 loongarch randconfig-002-20260601 gcc-11.5.0 loongarch randconfig-002-20260601 gcc-8.5.0 loongarch randconfig-002-20260602 gcc-8.5.0 m68k alldefconfig gcc-15.2.0 m68k allmodconfig gcc-15.2.0 m68k allnoconfig gcc-15.2.0 m68k allyesconfig clang-16 m68k allyesconfig gcc-15.2.0 m68k defconfig clang-19 microblaze allnoconfig gcc-15.2.0 microblaze allyesconfig gcc-15.2.0 microblaze defconfig clang-19 mips allmodconfig gcc-15.2.0 mips allnoconfig gcc-15.2.0 mips allyesconfig gcc-15.2.0 mips gcw0_defconfig clang-23 mips rs90_defconfig gcc-15.2.0 nios2 allmodconfig clang-23 nios2 allmodconfig gcc-11.5.0 nios2 allnoconfig clang-23 nios2 allnoconfig gcc-11.5.0 nios2 defconfig clang-19 nios2 randconfig-001 gcc-11.5.0 nios2 randconfig-001-20260601 gcc-11.5.0 nios2 randconfig-001-20260601 gcc-8.5.0 nios2 randconfig-001-20260602 gcc-8.5.0 nios2 randconfig-002 gcc-11.5.0 nios2 randconfig-002-20260601 gcc-11.5.0 nios2 randconfig-002-20260601 gcc-8.5.0 nios2 randconfig-002-20260602 gcc-8.5.0 openrisc allmodconfig clang-23 openrisc allmodconfig gcc-15.2.0 openrisc allnoconfig clang-23 openrisc allnoconfig gcc-15.2.0 openrisc defconfig gcc-15.2.0 parisc allmodconfig gcc-15.2.0 parisc allnoconfig clang-23 parisc allnoconfig gcc-15.2.0 parisc allyesconfig clang-19 parisc allyesconfig gcc-15.2.0 parisc defconfig gcc-15.2.0 parisc randconfig-001 gcc-10.5.0 parisc randconfig-001-20260601 gcc-10.5.0 parisc randconfig-001-20260602 gcc-12.5.0 parisc randconfig-002 gcc-10.5.0 parisc randconfig-002-20260601 gcc-10.5.0 parisc randconfig-002-20260602 gcc-12.5.0 parisc64 defconfig clang-19 powerpc allmodconfig gcc-15.2.0 powerpc allnoconfig clang-23 powerpc allnoconfig gcc-15.2.0 powerpc arches_defconfig gcc-15.2.0 powerpc randconfig-001 gcc-10.5.0 powerpc randconfig-001-20260601 gcc-10.5.0 powerpc randconfig-001-20260602 gcc-12.5.0 powerpc randconfig-002 gcc-10.5.0 powerpc randconfig-002-20260601 gcc-10.5.0 powerpc randconfig-002-20260602 gcc-12.5.0 powerpc tqm8560_defconfig gcc-15.2.0 powerpc64 randconfig-001 gcc-10.5.0 powerpc64 randconfig-001-20260601 gcc-10.5.0 powerpc64 randconfig-001-20260602 gcc-12.5.0 powerpc64 randconfig-002 gcc-10.5.0 powerpc64 randconfig-002-20260601 gcc-10.5.0 powerpc64 randconfig-002-20260602 gcc-12.5.0 riscv allmodconfig clang-23 riscv allnoconfig clang-23 riscv allnoconfig gcc-15.2.0 riscv allyesconfig clang-16 riscv defconfig gcc-15.2.0 riscv randconfig-001 clang-23 riscv randconfig-001-20260601 clang-23 riscv randconfig-001-20260602 gcc-8.5.0 riscv randconfig-002 clang-23 riscv randconfig-002-20260601 clang-23 riscv randconfig-002-20260602 gcc-8.5.0 s390 allmodconfig clang-18 s390 allmodconfig clang-19 s390 allnoconfig clang-23 s390 allyesconfig gcc-15.2.0 s390 defconfig gcc-15.2.0 s390 randconfig-001 clang-23 s390 randconfig-001-20260601 clang-23 s390 randconfig-001-20260602 gcc-8.5.0 s390 randconfig-002 clang-23 s390 randconfig-002-20260601 clang-23 s390 randconfig-002-20260602 gcc-8.5.0 sh allmodconfig gcc-15.2.0 sh allnoconfig clang-23 sh allnoconfig gcc-15.2.0 sh allyesconfig clang-19 sh allyesconfig gcc-15.2.0 sh defconfig gcc-14 sh defconfig gcc-15.2.0 sh randconfig-001 clang-23 sh randconfig-001-20260601 clang-23 sh randconfig-001-20260602 gcc-8.5.0 sh randconfig-002 clang-23 sh randconfig-002-20260601 clang-23 sh randconfig-002-20260602 gcc-8.5.0 sparc allnoconfig clang-23 sparc allnoconfig gcc-15.2.0 sparc defconfig gcc-15.2.0 sparc randconfig-001 gcc-8.5.0 sparc randconfig-001-20260601 gcc-15.2.0 sparc randconfig-001-20260601 gcc-8.5.0 sparc randconfig-002 gcc-8.5.0 sparc randconfig-002-20260601 gcc-15.2.0 sparc randconfig-002-20260601 gcc-8.5.0 sparc64 allmodconfig clang-23 sparc64 defconfig clang-20 sparc64 defconfig gcc-14 sparc64 randconfig-001 gcc-8.5.0 sparc64 randconfig-001-20260601 clang-20 sparc64 randconfig-001-20260601 gcc-15.2.0 sparc64 randconfig-001-20260601 gcc-8.5.0 sparc64 randconfig-002 gcc-8.5.0 sparc64 randconfig-002-20260601 clang-23 sparc64 randconfig-002-20260601 gcc-15.2.0 sparc64 randconfig-002-20260601 gcc-8.5.0 um allmodconfig clang-19 um allnoconfig clang-23 um allyesconfig gcc-14 um allyesconfig gcc-15.2.0 um defconfig clang-23 um defconfig gcc-14 um i386_defconfig gcc-14 um randconfig-001 gcc-8.5.0 um randconfig-001-20260601 gcc-14 um randconfig-001-20260601 gcc-15.2.0 um randconfig-001-20260601 gcc-8.5.0 um randconfig-002 gcc-8.5.0 um randconfig-002-20260601 gcc-14 um randconfig-002-20260601 gcc-15.2.0 um randconfig-002-20260601 gcc-8.5.0 um x86_64_defconfig clang-23 um x86_64_defconfig gcc-14 x86_64 allmodconfig clang-20 x86_64 allnoconfig clang-20 x86_64 allnoconfig clang-23 x86_64 allyesconfig clang-20 x86_64 buildonly-randconfig-001-20260601 clang-20 x86_64 buildonly-randconfig-001-20260602 gcc-14 x86_64 buildonly-randconfig-002-20260601 clang-20 x86_64 buildonly-randconfig-002-20260602 gcc-14 x86_64 buildonly-randconfig-003-20260601 clang-20 x86_64 buildonly-randconfig-003-20260602 gcc-14 x86_64 buildonly-randconfig-004-20260601 clang-20 x86_64 buildonly-randconfig-004-20260602 gcc-14 x86_64 buildonly-randconfig-005-20260601 clang-20 x86_64 buildonly-randconfig-005-20260602 gcc-14 x86_64 buildonly-randconfig-006-20260601 clang-20 x86_64 buildonly-randconfig-006-20260602 gcc-14 x86_64 defconfig gcc-14 x86_64 kexec clang-20 x86_64 randconfig-001-20260601 clang-20 x86_64 randconfig-001-20260601 gcc-14 x86_64 randconfig-002-20260601 clang-20 x86_64 randconfig-002-20260601 gcc-14 x86_64 randconfig-003-20260601 clang-20 x86_64 randconfig-004-20260601 clang-20 x86_64 randconfig-005-20260601 clang-20 x86_64 randconfig-006-20260601 clang-20 x86_64 randconfig-011-20260601 clang-20 x86_64 randconfig-011-20260602 clang-20 x86_64 randconfig-012-20260601 clang-20 x86_64 randconfig-012-20260602 clang-20 x86_64 randconfig-013-20260601 clang-20 x86_64 randconfig-013-20260602 clang-20 x86_64 randconfig-014-20260601 clang-20 x86_64 randconfig-014-20260602 clang-20 x86_64 randconfig-015-20260601 clang-20 x86_64 randconfig-015-20260602 clang-20 x86_64 randconfig-016-20260601 clang-20 x86_64 randconfig-016-20260602 clang-20 x86_64 randconfig-071 gcc-14 x86_64 randconfig-071-20260601 gcc-14 x86_64 randconfig-071-20260602 clang-20 x86_64 randconfig-072 gcc-14 x86_64 randconfig-072-20260601 gcc-14 x86_64 randconfig-072-20260602 clang-20 x86_64 randconfig-073 gcc-14 x86_64 randconfig-073-20260601 gcc-14 x86_64 randconfig-073-20260602 clang-20 x86_64 randconfig-074 gcc-14 x86_64 randconfig-074-20260601 gcc-14 x86_64 randconfig-074-20260602 clang-20 x86_64 randconfig-075 gcc-14 x86_64 randconfig-075-20260601 gcc-14 x86_64 randconfig-075-20260602 clang-20 x86_64 randconfig-076 gcc-14 x86_64 randconfig-076-20260601 gcc-14 x86_64 randconfig-076-20260602 clang-20 x86_64 rhel-9.4 clang-20 x86_64 rhel-9.4-bpf gcc-14 x86_64 rhel-9.4-func clang-20 x86_64 rhel-9.4-kselftests clang-20 x86_64 rhel-9.4-kunit gcc-14 x86_64 rhel-9.4-ltp gcc-14 x86_64 rhel-9.4-rust clang-20 xtensa allnoconfig clang-23 xtensa allnoconfig gcc-15.2.0 xtensa allyesconfig clang-23 xtensa allyesconfig gcc-15.2.0 xtensa randconfig-001 gcc-8.5.0 xtensa randconfig-001-20260601 gcc-15.2.0 xtensa randconfig-001-20260601 gcc-8.5.0 xtensa randconfig-002 gcc-8.5.0 xtensa randconfig-002-20260601 gcc-15.2.0 xtensa randconfig-002-20260601 gcc-8.5.0 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki From pasha.tatashin at soleen.com Mon Jun 1 15:55:47 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Mon, 1 Jun 2026 22:55:47 +0000 Subject: [PATCH] kexec_file: skip checksum verification when relocations aren't needed In-Reply-To: <20260601191136.799134-1-mclapinski@google.com> References: <20260601191136.799134-1-mclapinski@google.com> Message-ID: Nit: The crash kernel also does not perform relocations, yet a checksum is still required. The subject should be something like: kexec_file: skip purgatory checksum if all segments are CMA allocated On 06-01 21:11, Michal Clapinski wrote: > Checksum verification is needed > 1. for crash kernels. In a crash, we can't be sure the kernel is > intact. > 2. if we're worried about relocating the kernel into a region used by > some DMA that wasn't properly cancelled. Nit: Please add a little background information about CMA segments being recently added, as well as the necessity for a fast reboot due to the live update use case. > > If we used CMA to allocate segments then > 1. we're not working with a crash kernel. > 2. relocations are not going to happen. > > Therefore, we can safely disable checksum verification. > > Instead of adding a new variable to purgatory, just skip adding regions > and save the default value of SHA256 hash. > > Saves ~250ms on my 4.0 GHz CPU. > > Signed-off-by: Michal Clapinski > --- > kernel/kexec_file.c | 19 +++++++++++++++++++ > 1 file changed, 19 insertions(+) > > diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c > index 2bfbb2d144e6..2dc8b0435fe6 100644 > --- a/kernel/kexec_file.c > +++ b/kernel/kexec_file.c > @@ -808,6 +808,7 @@ static int kexec_calculate_store_digests(struct kimage *image) > void *zero_buf; > struct kexec_sha_region *sha_regions; > struct purgatory_info *pi = &image->purgatory_info; > + bool can_skip_checksum = true; > > if (!IS_ENABLED(CONFIG_ARCH_SUPPORTS_KEXEC_PURGATORY)) > return 0; > @@ -822,6 +823,23 @@ static int kexec_calculate_store_digests(struct kimage *image) > > sha256_init(&sctx); > > + /* > + * If all segments were loaded into contiguous memory, there will be no > + * relocations. In that case there is no risk of memory corruption by > + * uncancelled DMA and we can skip checksum calculation. > + */ > + for (i = 0; i < image->nr_segments; i++) { > + if (!image->segment_cma[i]) { > + can_skip_checksum = false; > + break; > + } > + } > + > + if (can_skip_checksum) { > + pr_info("disabling checksum verification in purgatory\n"); > + goto skip_checksum; > + } > + > for (j = i = 0; i < image->nr_segments; i++) { > struct kexec_segment *ksegment; > > @@ -867,6 +885,7 @@ static int kexec_calculate_store_digests(struct kimage *image) > j++; > } > > +skip_checksum: > sha256_final(&sctx, digest); With the few nits: Reviewed-by: Pasha Tatashin > > ret = kexec_purgatory_get_set_symbol(image, "purgatory_sha_regions", > -- > 2.54.0.929.g9b7fa37559-goog > From ltao at redhat.com Mon Jun 1 16:12:05 2026 From: ltao at redhat.com (Tao Liu) Date: Tue, 2 Jun 2026 11:12:05 +1200 Subject: [PATCH v5][makedumpfile 0/9] btf/kallsyms based makedumpfile extension for mm page filtering In-Reply-To: References: <20260414102656.55200-1-ltao@redhat.com> Message-ID: Hi Krister, Thanks a lot for your suggestions and comments! On Sat, May 30, 2026 at 9:11?AM Krister Johansen wrote: > > On Tue, Apr 14, 2026 at 10:26:47PM +1200, Tao Liu wrote: > > A) This patchset will introduce the following features to makedumpfile: > > > > 1) Add .so extension support to makedumpfile > > 2) Enable btf and kallsyms for symbol type and address resolving. > > > > B) The purpose of the features are: > > > > 1) Currently makedumpfile filters mm pages based on page flags, because flags > > can help to determine one page's usage. But this page-flag-checking method > > lacks of flexibility in certain cases, e.g. if we want to filter those mm > > pages occupied by GPU during vmcore dumping due to: > > > > a) GPU may be taking a large memory and contains sensitive data; > > b) GPU mm pages have no relations to kernel crash and useless for vmcore > > analysis. > > > > But there is no GPU mm page specific flags, and apparently we don't need > > to create one just for kdump use. A programmable filtering tool is more > > suitable for such cases. In addition, different GPU vendors may use > > different ways for mm pages allocating, programmable filtering is better > > than hard coding these GPU specific logics into makedumpfile in this case. > > > > 2) Currently makedumpfile already contains a programmable filtering tool, aka > > eppic script, which allows user to write customized code for data erasing. > > However it has the following drawbacks: > > > > a) cannot do mm page filtering. > > b) need to access to debuginfo of both kernel and modules, which is not > > applicable in the 2nd kernel. > > c) eppic library has memory leaks which are not all resolved [1]. This > > is not acceptable in 2nd kernel. > > > > makedumpfile need to resolve the dwarf data from debuginfo, to get symbols > > types and addresses. In recent kernel there are dwarf alternatives such > > as btf/kallsyms which can be used for this purpose. And btf/kallsyms info > > are already packed within vmcore, so we can use it directly. > > > > With these, this patchset introduces makedumpfile extensions, which is based > > on btf/kallsyms symbol resolving, and is programmable for mm page filtering. > > The following section shows its usage and performance, please note the tests > > are performed in 1st kernel. > > > > 3) Compile and run makedumpfile extensions: > > > > $ make LINKTYPE=dynamic USELZO=on USESNAPPY=on USEZSTD=on EXTENSION=on > > $ make extensions > > I love this idea. Do you have time to take it further, and if not are > you open to making the extension framework more modular so that we could > add others in the future? The purpose of extension is to make the framework modular. My original thought is, we can implement several makedumpfile extensions, each restricted to one specific function. Like one extension deals with AMD gpu mm filtering only, one deals with Intel gpu only etc. For distros we can ship all extensions along with makedumpfile once, but the respective extensions will only take effect if the machine has AMD / Intel gpu. This is the same case if you'd like to add other customized functions while the makedumpfile core remains unchanged. > > Could the btf lookups be extended to cover the symbol lookups used by > eppic and the erase filters so that the -x option is unnecessary for > kernels that have BTF support? Yes, from my view it is doable and not difficult to implement. > > The current extension implementation is focused just on skipping pages, > but it would be great to be able to use this to erase data in structures > like the config filters and eppic, but without having to provide a > vmlinux at dump time. What do you think about adding the ability to > use the extensions to also erase parts of data structures, in addition > to filtering whole pages? That's the step 2 for the BTF/kallsyms work of makedumpfile, and I have planed to work on this once the patchset(step 1) is accepted. The reason for the task dividing is, the GPU mm page filtering is more urgent than data erasing from my view. For data erasing, at least we can do the erasing in 1st kernel with the help of dwarf, cumbersome but working; For GPU mm filtering, as far as I know, there are no handy tools in 2nd kernel. I think erasing the data is doable upon the current page filtering code. > > Would you be willing to modify the extension registration options to > allow an extension to specify what kind it is? That way, in the future I'm not sure what you mean by "what kind". Do you mean an extension needs to tell makedumpfile what purpose it is for when loading? > we could register multiple different kinds without breaking existing > ones. One for filtering pages, one for erasing / modifying dump > content, and others based upon whatever additional use cases develop. That's the goal of extensions, each extension deals with its own business. Could you point out the code that doesn't match the goal? I'm happy to correct it in v6. Thanks, Tao Liu > > Thanks, > > -K > From kjlx at templeofstupid.com Mon Jun 1 17:47:28 2026 From: kjlx at templeofstupid.com (Krister Johansen) Date: Mon, 1 Jun 2026 17:47:28 -0700 Subject: [PATCH v5][makedumpfile 0/9] btf/kallsyms based makedumpfile extension for mm page filtering In-Reply-To: References: <20260414102656.55200-1-ltao@redhat.com> Message-ID: Hi Tao, Thanks for the response! I've put the followups below. On Tue, Jun 02, 2026 at 11:12:05AM +1200, Tao Liu wrote: > On Sat, May 30, 2026 at 9:11?AM Krister Johansen > wrote: > > > > I love this idea. Do you have time to take it further, and if not are > > you open to making the extension framework more modular so that we could > > add others in the future? > > The purpose of extension is to make the framework modular. My original > thought is, we can implement several makedumpfile extensions, each > restricted to one specific function. Like one extension deals with AMD > gpu mm filtering only, one deals with Intel gpu only etc. For distros > we can ship all extensions along with makedumpfile once, but the > respective extensions will only take effect if the machine has AMD / > Intel gpu. This is the same case if you'd like to add other customized > functions while the makedumpfile core remains unchanged. Makes sense. > > Could the btf lookups be extended to cover the symbol lookups used by > > eppic and the erase filters so that the -x option is unnecessary for > > kernels that have BTF support? > > Yes, from my view it is doable and not difficult to implement. In some environments, the size of the vmlinux + modules can be fairly substantial to leave on disk. It's attractive to have the option to omit it and still filter dumps. > > The current extension implementation is focused just on skipping pages, > > but it would be great to be able to use this to erase data in structures > > like the config filters and eppic, but without having to provide a > > vmlinux at dump time. What do you think about adding the ability to > > use the extensions to also erase parts of data structures, in addition > > to filtering whole pages? > > That's the step 2 for the BTF/kallsyms work of makedumpfile, and I > have planed to work on this once the patchset(step 1) is accepted. The > reason for the task dividing is, the GPU mm page filtering is more > urgent than data erasing from my view. For data erasing, at least we > can do the erasing in 1st kernel with the help of dwarf, cumbersome > but working; For GPU mm filtering, as far as I know, there are no > handy tools in 2nd kernel. Excited to hear that you have something already planned for erasing. My apologies if I missed a more comprehensive write-up about the longer term goals for the work. > I think erasing the data is doable upon the current page filtering code. I wondered about this, but for data-structures that are smaller than a page, wouldn't that mean that we're erasing other content? The "erase" plugins memset the output data to a chosen value (or 0), whereas the filtering just drops the page. Couldn't this also lead to a situation where the debugger can't find the page at all, versus giving us one that's sanitized? (I do understand why you want to drop the pages for the GPU cases) > > Would you be willing to modify the extension registration options to > > allow an extension to specify what kind it is? That way, in the future > > I'm not sure what you mean by "what kind". Do you mean an extension > needs to tell makedumpfile what purpose it is for when loading? Yes, sorry I wasn't clear in writing the question. Stating this differently, if we want to allow the ability for different extensions to do different things, how do the extensions declare to makedumpfile what they can do, so that it knows where to invoke their callbacks, and what callbacks of theirs to invoke. Looking at patch 6/9, right now run_extension_callback() is involved from __exclude_unncessary_pages and always calls the "extension_callback" symbol in the module. This makes sense for a single extension type that's focused on filtering pages. However, if we wanted to have multiple different extensions, this might be more difficult. If we could determine what type of functionality the module implements in load_extensions, then we could tell if this is a page filtering extension, an erase extension, or some other kind of extension. For example, for an erase filter, perhaps we would want two callbacks: one to set up the ranges to filter "extension_gather_callback" and another to actuallyf check the address range to see if it is filtered, "extension_filter_data_callback" I'm not sure about the names. "extension_callback" seems generic, but this has a specific purpose. It's a "extension_filter_page_callback" I may be overengineering this a bit, but having makedumpfile pass an ops vector to the extension in a load function could help here. Then the module's load function fills out the vector with the functions it supports. Depending on what's implemented, these can be placed into different callback lists to get invoked at different points in the program (e.g. one at pfn filter time, another in filter_data_buffer, etc). It sounds like you had a plan here, though. Were you thinking of adding new extension types a different way? > > we could register multiple different kinds without breaking existing > > ones. One for filtering pages, one for erasing / modifying dump > > content, and others based upon whatever additional use cases develop. > > That's the goal of extensions, each extension deals with its own > business. Could you point out the code that doesn't match the goal? > I'm happy to correct it in v6. Yes, I attempted to elaborate on this in the preceding paragraphs. Basically wondering how we can add new extension functionality without breaking existing extensions, and then get the code to invoke the right if there are multiple types that need to be used at different times. Thanks, -K From ltao at redhat.com Mon Jun 1 20:04:12 2026 From: ltao at redhat.com (Tao Liu) Date: Tue, 2 Jun 2026 15:04:12 +1200 Subject: [PATCH v5][makedumpfile 0/9] btf/kallsyms based makedumpfile extension for mm page filtering In-Reply-To: References: <20260414102656.55200-1-ltao@redhat.com> Message-ID: Hi Krister, On Tue, Jun 2, 2026 at 12:47?PM Krister Johansen wrote: > > Hi Tao, > Thanks for the response! I've put the followups below. Thanks for your in-depth explanation, it's very helpful to me for designing the data erasing function. > > On Tue, Jun 02, 2026 at 11:12:05AM +1200, Tao Liu wrote: > > On Sat, May 30, 2026 at 9:11?AM Krister Johansen > > wrote: > > > > > > I love this idea. Do you have time to take it further, and if not are > > > you open to making the extension framework more modular so that we could > > > add others in the future? > > > > The purpose of extension is to make the framework modular. My original > > thought is, we can implement several makedumpfile extensions, each > > restricted to one specific function. Like one extension deals with AMD > > gpu mm filtering only, one deals with Intel gpu only etc. For distros > > we can ship all extensions along with makedumpfile once, but the > > respective extensions will only take effect if the machine has AMD / > > Intel gpu. This is the same case if you'd like to add other customized > > functions while the makedumpfile core remains unchanged. > > Makes sense. > > > > Could the btf lookups be extended to cover the symbol lookups used by > > > eppic and the erase filters so that the -x option is unnecessary for > > > kernels that have BTF support? > > > > Yes, from my view it is doable and not difficult to implement. > > In some environments, the size of the vmlinux + modules can be fairly > substantial to leave on disk. It's attractive to have the option to > omit it and still filter dumps. Yes, I totally agree. > > > > The current extension implementation is focused just on skipping pages, > > > but it would be great to be able to use this to erase data in structures > > > like the config filters and eppic, but without having to provide a > > > vmlinux at dump time. What do you think about adding the ability to > > > use the extensions to also erase parts of data structures, in addition > > > to filtering whole pages? > > > > That's the step 2 for the BTF/kallsyms work of makedumpfile, and I > > have planed to work on this once the patchset(step 1) is accepted. The > > reason for the task dividing is, the GPU mm page filtering is more > > urgent than data erasing from my view. For data erasing, at least we > > can do the erasing in 1st kernel with the help of dwarf, cumbersome > > but working; For GPU mm filtering, as far as I know, there are no > > handy tools in 2nd kernel. > > Excited to hear that you have something already planned for erasing. My > apologies if I missed a more comprehensive write-up about the longer > term goals for the work. No worries, I didn't post the goals upstream; I only had internal discussions within my team regarding the next steps for BTF/kallsyms in makedumpfile. > > > I think erasing the data is doable upon the current page filtering code. > > I wondered about this, but for data-structures that are smaller than a > page, wouldn't that mean that we're erasing other content? The "erase" > plugins memset the output data to a chosen value (or 0), whereas the > filtering just drops the page. Couldn't this also lead to a situation > where the debugger can't find the page at all, versus giving us one > that's sanitized? (I do understand why you want to drop the pages for > the GPU cases) Frankly I didn't consider the data erasing as in-depth as you did. I think you are right, makedumpfile needs to know which extensions handle data erasing and which handle mm page filtering. I guess the mm page filtering extensions will need to perform a "dry-run" filter first, in case the "data erasing" extensions break any useful data structure. In this step, "dry-run" will only record pfn numbers of the pages that will be filtered. Then "data erasing" extensions are called, so all the sensitive data is memset to 0. Finally, all desired pages are filtered out based on the previous recording. With this, "data erase" and "page filtering" will not interfere with each other. What do you think? > > > > Would you be willing to modify the extension registration options to > > > allow an extension to specify what kind it is? That way, in the future > > > > I'm not sure what you mean by "what kind". Do you mean an extension > > needs to tell makedumpfile what purpose it is for when loading? > > Yes, sorry I wasn't clear in writing the question. Stating this > differently, if we want to allow the ability for different extensions to > do different things, how do the extensions declare to makedumpfile what > they can do, so that it knows where to invoke their callbacks, and what > callbacks of theirs to invoke. > > Looking at patch 6/9, right now run_extension_callback() is involved > from __exclude_unncessary_pages and always calls the > "extension_callback" symbol in the module. This makes sense for a > single extension type that's focused on filtering pages. However, if we > wanted to have multiple different extensions, this might be more > difficult. > > If we could determine what type of functionality the module implements > in load_extensions, then we could tell if this is a page filtering > extension, an erase extension, or some other kind of extension. > > For example, for an erase filter, perhaps we would want two callbacks: > one to set up the ranges to filter "extension_gather_callback" and > another to actuallyf check the address range to see if it is filtered, > "extension_filter_data_callback" > > I'm not sure about the names. "extension_callback" seems generic, but > this has a specific purpose. It's a "extension_filter_page_callback" > > I may be overengineering this a bit, but having makedumpfile pass an ops > vector to the extension in a load function could help here. Then the > module's load function fills out the vector with the functions it > supports. Depending on what's implemented, these can be placed into > different callback lists to get invoked at different points in the > program (e.g. one at pfn filter time, another in filter_data_buffer, > etc). > > It sounds like you had a plan here, though. Were you thinking of adding > new extension types a different way? I see your idea: makedumpfile predefines a few hook points at different stages, and extensions can register their callbacks to these hook points. For now I think 2 hook points are enough, one for page filtering and other one for resiger the data erasing, which definitely shouldn't be within __exclude_unnecessary_pages(). I'm willing to modify the code. Such as implementing a hooking point registration/management. But since I haven't work on the data erasing functions so far, the design might be superficial, personally I'd prefer to do this along with the data erasing functions in the next independent patchset, considering current patchset we already includes plenty of code/function implementations. @maintainers, What's your opinion? > > > > we could register multiple different kinds without breaking existing > > > ones. One for filtering pages, one for erasing / modifying dump > > > content, and others based upon whatever additional use cases develop. > > > > That's the goal of extensions, each extension deals with its own > > business. Could you point out the code that doesn't match the goal? > > I'm happy to correct it in v6. > > Yes, I attempted to elaborate on this in the preceding paragraphs. > Basically wondering how we can add new extension functionality without > breaking existing extensions, and then get the code to invoke the right > if there are multiple types that need to be used at different times. Agreed. Thanks, Tao Liu > > Thanks, > > -K > From pasha.tatashin at soleen.com Mon Jun 1 20:17:04 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Tue, 2 Jun 2026 03:17:04 +0000 Subject: [PATCH v5 00/13] liveupdate: Remove limits on sessions and files Message-ID: <20260602031717.197696-1-pasha.tatashin@soleen.com> Hi all, This series removes the fixed limits on the number of files that can be preserved within a single session, and the total number of sessions managed by the Live Update Orchestrator (LUO). The core of the change is a transition from single contiguous memory blocks for metadata serialization to a chain of linked blocks. This allows LUO to scale dynamically. 1. ABI Evolution: - Introduced linked-block headers for both file and session serialization. - Bumped session ABI version to v4. 2. Memory Management & Security: - Implemented a dynamic block allocation and reuse strategy. Blocks are allocated only when existing ones are exhausted and are reused during session/file removal cycles. - Introduced KHO_MAX_BLOCKS (10000) as a safeguard against stupid excessive allocations or corrupted cyclic lists during restore. 3. Expanded Selftests: - Added new kexec-based tests verifying preservation of 2000 sessions and 500 files per session. - Added self-tests for many sessions and many files management. Tree: git.kernel.org/pub/scm/linux/kernel/git/tatashin/linux.git Branch: luo-remove-max-files-sessions-limits/v5 Changes v5: - Addressed comments from Pratyush: - Renamed kho_block_restore -> kho_block_set_restore, kho_block_destroy -> kho_block_set_destroy. - Renamed block iterator next/read functions to reserve_entry/read_entry. - Added public helpers kho_block_set_head_pa() and kho_block_set_is_empty(). - Added validation to treat zero-count blocks as errors during restoration. - Simplified block iterator reading loop from a while to an if statement. - Changed standard WARN_ON macros to WARN_ON_ONCE on iterator allocation checks, and added warning details. - Simplified session serialization by removing a redundant NULL check on sessions_pa. Please review. Thanks, Pasha Pasha Tatashin (13): liveupdate: change file_set->count type to u64 for type safety liveupdate: avoid mixing cleanup guards with goto in luo_session_retrieve_fd liveupdate: centralize state management into struct luo_ser liveupdate: register luo_ser as KHO subtree liveupdate: Extract luo_file_deserialize_one helper liveupdate: Extract luo_session_deserialize_one helper kho: add support for linked-block serialization liveupdate: defer session block allocation and PA setting liveupdate: Remove limit on the number of sessions liveupdate: Remove limit on the number of files per session selftests/liveupdate: Test session and file limit removal selftests/liveupdate: Add stress-sessions kexec test selftests/liveupdate: Add stress-files kexec test Documentation/core-api/kho/abi.rst | 5 + Documentation/core-api/kho/index.rst | 11 + MAINTAINERS | 1 + include/linux/kho/abi/block.h | 56 +++ include/linux/kho/abi/luo.h | 149 ++----- include/linux/kho_block.h | 101 +++++ kernel/liveupdate/Makefile | 1 + kernel/liveupdate/kho_block.c | 390 ++++++++++++++++++ kernel/liveupdate/luo_core.c | 99 ++--- kernel/liveupdate/luo_file.c | 206 ++++----- kernel/liveupdate/luo_flb.c | 65 +-- kernel/liveupdate/luo_internal.h | 16 +- kernel/liveupdate/luo_session.c | 241 +++++------ tools/testing/selftests/liveupdate/Makefile | 2 + .../testing/selftests/liveupdate/liveupdate.c | 75 ++++ .../selftests/liveupdate/luo_stress_files.c | 97 +++++ .../liveupdate/luo_stress_sessions.c | 102 +++++ .../selftests/liveupdate/luo_test_utils.c | 24 ++ .../selftests/liveupdate/luo_test_utils.h | 2 + 19 files changed, 1184 insertions(+), 459 deletions(-) create mode 100644 include/linux/kho/abi/block.h create mode 100644 include/linux/kho_block.h create mode 100644 kernel/liveupdate/kho_block.c create mode 100644 tools/testing/selftests/liveupdate/luo_stress_files.c create mode 100644 tools/testing/selftests/liveupdate/luo_stress_sessions.c base-commit: 2935777b418d2bfcbfe96705bb2c0fa6c0d94e18 -- 2.53.0 From pasha.tatashin at soleen.com Mon Jun 1 20:17:05 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Tue, 2 Jun 2026 03:17:05 +0000 Subject: [PATCH v5 01/13] liveupdate: change file_set->count type to u64 for type safety In-Reply-To: <20260602031717.197696-1-pasha.tatashin@soleen.com> References: <20260602031717.197696-1-pasha.tatashin@soleen.com> Message-ID: <20260602031717.197696-2-pasha.tatashin@soleen.com> This improves type safety and aligns the in-memory file_set->count with the serialized count type. It avoids potential truncation or sign conversion mismatch issues. Reviewed-by: Pratyush Yadav (Google) Signed-off-by: Pasha Tatashin --- kernel/liveupdate/luo_internal.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/liveupdate/luo_internal.h b/kernel/liveupdate/luo_internal.h index dd53d4a7277e..ae58206f14ac 100644 --- a/kernel/liveupdate/luo_internal.h +++ b/kernel/liveupdate/luo_internal.h @@ -52,7 +52,7 @@ static inline int luo_ucmd_respond(struct luo_ucmd *ucmd, struct luo_file_set { struct list_head files_list; struct luo_file_ser *files; - long count; + u64 count; }; /** -- 2.53.0 From pasha.tatashin at soleen.com Mon Jun 1 20:17:06 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Tue, 2 Jun 2026 03:17:06 +0000 Subject: [PATCH v5 02/13] liveupdate: avoid mixing cleanup guards with goto in luo_session_retrieve_fd In-Reply-To: <20260602031717.197696-1-pasha.tatashin@soleen.com> References: <20260602031717.197696-1-pasha.tatashin@soleen.com> Message-ID: <20260602031717.197696-3-pasha.tatashin@soleen.com> Refactoring luo_session_retrieve_fd() to avoid mixing automated cleanup-style guards with goto-based resource release, which is not recommended under the Linux kernel coding style. Reviewed-by: Pratyush Yadav (Google) Signed-off-by: Pasha Tatashin --- kernel/liveupdate/luo_session.c | 25 ++++++++++++------------- 1 file changed, 12 insertions(+), 13 deletions(-) diff --git a/kernel/liveupdate/luo_session.c b/kernel/liveupdate/luo_session.c index 5c6cebc6e326..7b2f9cbabb05 100644 --- a/kernel/liveupdate/luo_session.c +++ b/kernel/liveupdate/luo_session.c @@ -291,25 +291,24 @@ static int luo_session_retrieve_fd(struct luo_session *session, if (argp->fd < 0) return argp->fd; - guard(mutex)(&session->mutex); - err = luo_retrieve_file(&session->file_set, argp->token, &file); - if (err < 0) - goto err_put_fd; + scoped_guard(mutex, &session->mutex) { + err = luo_retrieve_file(&session->file_set, argp->token, &file); + if (err < 0) { + put_unused_fd(argp->fd); + return err; + } + } err = luo_ucmd_respond(ucmd, sizeof(*argp)); - if (err) - goto err_put_file; + if (err) { + fput(file); + put_unused_fd(argp->fd); + return err; + } fd_install(argp->fd, file); return 0; - -err_put_file: - fput(file); -err_put_fd: - put_unused_fd(argp->fd); - - return err; } static int luo_session_finish(struct luo_session *session, -- 2.53.0 From pasha.tatashin at soleen.com Mon Jun 1 20:17:07 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Tue, 2 Jun 2026 03:17:07 +0000 Subject: [PATCH v5 03/13] liveupdate: centralize state management into struct luo_ser In-Reply-To: <20260602031717.197696-1-pasha.tatashin@soleen.com> References: <20260602031717.197696-1-pasha.tatashin@soleen.com> Message-ID: <20260602031717.197696-4-pasha.tatashin@soleen.com> Transition the LUO to ABI v2, which centralizes state management into a single struct luo_ser header. Previously, LUO state was spread across multiple FDT properties and subnodes. ABI v2 simplifies this by placing all core state, including the liveupdate number and physical addresses for sessions and FLB headers into a centralized struct luo_ser. Note that this change introduces a semantic difference: the sessions and FLB serialization formats are no longer completely independent of the core LUO. Their metadata (such as physical addresses for sessions and FLB headers) is now coupled to and managed via the centralized struct luo_ser. Reviewed-by: Pratyush Yadav (Google) Signed-off-by: Pasha Tatashin --- include/linux/kho/abi/luo.h | 91 +++++++++++--------------------- kernel/liveupdate/luo_core.c | 64 +++++++++++++++------- kernel/liveupdate/luo_flb.c | 65 +++-------------------- kernel/liveupdate/luo_internal.h | 8 +-- kernel/liveupdate/luo_session.c | 64 ++++------------------ 5 files changed, 98 insertions(+), 194 deletions(-) diff --git a/include/linux/kho/abi/luo.h b/include/linux/kho/abi/luo.h index 46750a0ddf88..1b2f865a771a 100644 --- a/include/linux/kho/abi/luo.h +++ b/include/linux/kho/abi/luo.h @@ -30,52 +30,25 @@ * .. code-block:: none * * / { - * compatible = "luo-v1"; - * liveupdate-number = <...>; - * - * luo-session { - * compatible = "luo-session-v1"; - * luo-session-header = ; - * }; - * - * luo-flb { - * compatible = "luo-flb-v1"; - * luo-flb-header = ; - * }; + * compatible = "luo-v2"; + * luo-abi-header = ; * }; * * Main LUO Node (/): * - * - compatible: "luo-v1" + * - compatible: "luo-v2" * Identifies the overall LUO ABI version. - * - liveupdate-number: u64 - * A counter tracking the number of successful live updates performed. - * - * Session Node (luo-session): - * This node describes all preserved user-space sessions. - * - * - compatible: "luo-session-v1" - * Identifies the session ABI version. - * - luo-session-header: u64 - * The physical address of a `struct luo_session_header_ser`. This structure - * is the header for a contiguous block of memory containing an array of - * `struct luo_session_ser`, one for each preserved session. - * - * File-Lifecycle-Bound Node (luo-flb): - * This node describes all preserved global objects whose lifecycle is bound - * to that of the preserved files (e.g., shared IOMMU state). - * - * - compatible: "luo-flb-v1" - * Identifies the FLB ABI version. - * - luo-flb-header: u64 - * The physical address of a `struct luo_flb_header_ser`. This structure is - * the header for a contiguous block of memory containing an array of - * `struct luo_flb_ser`, one for each preserved global object. + * - luo-abi-header: u64 + * The physical address of `struct luo_ser`. * * Serialization Structures: * The FDT properties point to memory regions containing arrays of simple, * `__packed` structures. These structures contain the actual preserved state. * + * - struct luo_ser: + * The central ABI structure that contains the overall state of the LUO. + * It includes the liveupdate-number and pointers to sessions and FLBs. + * * - struct luo_session_header_ser: * Header for the session array. Contains the total page count of the * preserved memory block and the number of `struct luo_session_ser` @@ -109,13 +82,26 @@ /* * The LUO FDT hooks all LUO state for sessions, fds, etc. - * In the root it also carries "liveupdate-number" 64-bit property that - * corresponds to the number of live-updates performed on this machine. */ #define LUO_FDT_SIZE PAGE_SIZE #define LUO_FDT_KHO_ENTRY_NAME "LUO" -#define LUO_FDT_COMPATIBLE "luo-v1" -#define LUO_FDT_LIVEUPDATE_NUM "liveupdate-number" +#define LUO_FDT_COMPATIBLE "luo-v2" +#define LUO_FDT_ABI_HEADER "luo-abi-header" + +/** + * struct luo_ser - Centralized LUO ABI header. + * @liveupdate_num: A counter tracking the number of successful live updates. + * @sessions_pa: Physical address of the first session block header. + * @flbs_pa: Physical address of the FLB header. + * + * This structure is the root of all preserved LUO state. It is pointed to by + * the "luo-abi-header" property in the LUO FDT. + */ +struct luo_ser { + u64 liveupdate_num; + u64 sessions_pa; + u64 flbs_pa; +} __packed; #define LIVEUPDATE_HNDL_COMPAT_LENGTH 48 @@ -147,15 +133,6 @@ struct luo_file_set_ser { u64 count; } __packed; -/* - * LUO FDT session node - * LUO_FDT_SESSION_HEADER: is a u64 physical address of struct - * luo_session_header_ser - */ -#define LUO_FDT_SESSION_NODE_NAME "luo-session" -#define LUO_FDT_SESSION_COMPATIBLE "luo-session-v2" -#define LUO_FDT_SESSION_HEADER "luo-session-header" - /** * struct luo_session_header_ser - Header for the serialized session data block. * @count: The number of `struct luo_session_ser` entries that immediately @@ -165,7 +142,7 @@ struct luo_file_set_ser { * physical memory preserved across the kexec. It provides the necessary * metadata to interpret the array of session entries that follow. * - * If this structure is modified, `LUO_FDT_SESSION_COMPATIBLE` must be updated. + * If this structure is modified, `LUO_FDT_COMPATIBLE` must be updated. */ struct luo_session_header_ser { u64 count; @@ -182,7 +159,7 @@ struct luo_session_header_ser { * session) is created and passed to the new kernel, allowing it to reconstruct * the session context. * - * If this structure is modified, `LUO_FDT_SESSION_COMPATIBLE` must be updated. + * If this structure is modified, `LUO_FDT_COMPATIBLE` must be updated. */ struct luo_session_ser { char name[LIVEUPDATE_SESSION_NAME_LENGTH]; @@ -192,10 +169,6 @@ struct luo_session_ser { /* The max size is set so it can be reliably used during in serialization */ #define LIVEUPDATE_FLB_COMPAT_LENGTH 48 -#define LUO_FDT_FLB_NODE_NAME "luo-flb" -#define LUO_FDT_FLB_COMPATIBLE "luo-flb-v1" -#define LUO_FDT_FLB_HEADER "luo-flb-header" - /** * struct luo_flb_header_ser - Header for the serialized FLB data block. * @pgcnt: The total number of pages occupied by the entire preserved memory @@ -205,11 +178,9 @@ struct luo_session_ser { * in the memory block. * * This structure is located at the physical address specified by the - * `LUO_FDT_FLB_HEADER` FDT property. It provides the new kernel with the - * necessary information to find and iterate over the array of preserved - * File-Lifecycle-Bound objects and to manage the underlying memory. + * flbs_pa in luo_ser. * - * If this structure is modified, LUO_FDT_FLB_COMPATIBLE must be updated. + * If this structure is modified, `LUO_FDT_COMPATIBLE` must be updated. */ struct luo_flb_header_ser { u64 pgcnt; @@ -231,7 +202,7 @@ struct luo_flb_header_ser { * passed to the new kernel. Each entry allows the LUO core to restore one * global, shared object. * - * If this structure is modified, LUO_FDT_FLB_COMPATIBLE must be updated. + * If this structure is modified, `LUO_FDT_COMPATIBLE` must be updated. */ struct luo_flb_ser { char name[LIVEUPDATE_FLB_COMPAT_LENGTH]; diff --git a/kernel/liveupdate/luo_core.c b/kernel/liveupdate/luo_core.c index 5d5827ced73c..085c0dfc1ef1 100644 --- a/kernel/liveupdate/luo_core.c +++ b/kernel/liveupdate/luo_core.c @@ -61,7 +61,6 @@ #include #include #include -#include #include "kexec_handover_internal.h" #include "luo_internal.h" @@ -86,9 +85,11 @@ early_param("liveupdate", early_liveupdate_param); static int __init luo_early_startup(void) { + struct luo_ser *luo_ser; + int err, header_size; phys_addr_t fdt_phys; - int err, ln_size; const void *ptr; + u64 luo_ser_pa; if (!kho_is_enabled()) { if (liveupdate_enabled()) @@ -119,26 +120,32 @@ static int __init luo_early_startup(void) return -EINVAL; } - ln_size = 0; - ptr = fdt_getprop(luo_global.fdt_in, 0, LUO_FDT_LIVEUPDATE_NUM, - &ln_size); - if (!ptr || ln_size != sizeof(luo_global.liveupdate_num)) { - pr_err("Unable to get live update number '%s' [%d]\n", - LUO_FDT_LIVEUPDATE_NUM, ln_size); + header_size = 0; + ptr = fdt_getprop(luo_global.fdt_in, 0, LUO_FDT_ABI_HEADER, &header_size); + if (!ptr || header_size != sizeof(u64)) { + pr_err("Unable to get ABI header '%s' [%d]\n", + LUO_FDT_ABI_HEADER, header_size); return -EINVAL; } - luo_global.liveupdate_num = get_unaligned((u64 *)ptr); + luo_ser_pa = get_unaligned((u64 *)ptr); + luo_ser = phys_to_virt(luo_ser_pa); + + luo_global.liveupdate_num = luo_ser->liveupdate_num; pr_info("Retrieved live update data, liveupdate number: %lld\n", luo_global.liveupdate_num); - err = luo_session_setup_incoming(luo_global.fdt_in); + err = luo_session_setup_incoming(luo_ser->sessions_pa); if (err) - return err; + goto out_free_ser; + + luo_flb_setup_incoming(luo_ser->flbs_pa); - err = luo_flb_setup_incoming(luo_global.fdt_in); + err = 0; +out_free_ser: + kho_restore_free(luo_ser); return err; } @@ -160,7 +167,8 @@ early_initcall(liveupdate_early_init); /* Called during boot to create outgoing LUO fdt tree */ static int __init luo_fdt_setup(void) { - const u64 ln = luo_global.liveupdate_num + 1; + struct luo_ser *luo_ser; + u64 luo_ser_pa; void *fdt_out; int err; @@ -170,27 +178,45 @@ static int __init luo_fdt_setup(void) return PTR_ERR(fdt_out); } + luo_ser = kho_alloc_preserve(sizeof(*luo_ser)); + if (IS_ERR(luo_ser)) { + err = PTR_ERR(luo_ser); + goto exit_free_fdt; + } + luo_ser_pa = virt_to_phys(luo_ser); + err = fdt_create(fdt_out, LUO_FDT_SIZE); err |= fdt_finish_reservemap(fdt_out); err |= fdt_begin_node(fdt_out, ""); err |= fdt_property_string(fdt_out, "compatible", LUO_FDT_COMPATIBLE); - err |= fdt_property(fdt_out, LUO_FDT_LIVEUPDATE_NUM, &ln, sizeof(ln)); - err |= luo_session_setup_outgoing(fdt_out); - err |= luo_flb_setup_outgoing(fdt_out); + err |= fdt_property(fdt_out, LUO_FDT_ABI_HEADER, &luo_ser_pa, + sizeof(luo_ser_pa)); err |= fdt_end_node(fdt_out); err |= fdt_finish(fdt_out); if (err) - goto exit_free; + goto exit_free_luo_ser; + + err = luo_session_setup_outgoing(&luo_ser->sessions_pa); + if (err) + goto exit_free_luo_ser; + + err = luo_flb_setup_outgoing(&luo_ser->flbs_pa); + if (err) + goto exit_free_luo_ser; + + luo_ser->liveupdate_num = luo_global.liveupdate_num + 1; err = kho_add_subtree(LUO_FDT_KHO_ENTRY_NAME, fdt_out, fdt_totalsize(fdt_out)); if (err) - goto exit_free; + goto exit_free_luo_ser; luo_global.fdt_out = fdt_out; return 0; -exit_free: +exit_free_luo_ser: + kho_unpreserve_free(luo_ser); +exit_free_fdt: kho_unpreserve_free(fdt_out); pr_err("failed to prepare LUO FDT: %d\n", err); diff --git a/kernel/liveupdate/luo_flb.c b/kernel/liveupdate/luo_flb.c index 8f5c5dd01cd0..c8dd30b41238 100644 --- a/kernel/liveupdate/luo_flb.c +++ b/kernel/liveupdate/luo_flb.c @@ -44,13 +44,11 @@ #include #include #include -#include #include #include #include #include #include -#include #include "luo_internal.h" #define LUO_FLB_PGCNT 1ul @@ -551,27 +549,15 @@ int liveupdate_flb_get_outgoing(struct liveupdate_flb *flb, void **objp) return 0; } -int __init luo_flb_setup_outgoing(void *fdt_out) +int __init luo_flb_setup_outgoing(u64 *flbs_pa) { struct luo_flb_header_ser *header_ser; - u64 header_ser_pa; - int err; header_ser = kho_alloc_preserve(LUO_FLB_PGCNT << PAGE_SHIFT); if (IS_ERR(header_ser)) return PTR_ERR(header_ser); - header_ser_pa = virt_to_phys(header_ser); - - err = fdt_begin_node(fdt_out, LUO_FDT_FLB_NODE_NAME); - err |= fdt_property_string(fdt_out, "compatible", - LUO_FDT_FLB_COMPATIBLE); - err |= fdt_property(fdt_out, LUO_FDT_FLB_HEADER, &header_ser_pa, - sizeof(header_ser_pa)); - err |= fdt_end_node(fdt_out); - - if (err) - goto err_unpreserve; + *flbs_pa = virt_to_phys(header_ser); header_ser->pgcnt = LUO_FLB_PGCNT; luo_flb_global.outgoing.header_ser = header_ser; @@ -579,53 +565,18 @@ int __init luo_flb_setup_outgoing(void *fdt_out) luo_flb_global.outgoing.active = true; return 0; - -err_unpreserve: - kho_unpreserve_free(header_ser); - - return err; } -int __init luo_flb_setup_incoming(void *fdt_in) +void __init luo_flb_setup_incoming(u64 flbs_pa) { struct luo_flb_header_ser *header_ser; - int err, header_size, offset; - const void *ptr; - u64 header_ser_pa; - offset = fdt_subnode_offset(fdt_in, 0, LUO_FDT_FLB_NODE_NAME); - if (offset < 0) { - pr_err("Unable to get FLB node [%s]\n", LUO_FDT_FLB_NODE_NAME); - - return -ENOENT; + if (flbs_pa) { + header_ser = phys_to_virt(flbs_pa); + luo_flb_global.incoming.header_ser = header_ser; + luo_flb_global.incoming.ser = (void *)(header_ser + 1); + luo_flb_global.incoming.active = true; } - - err = fdt_node_check_compatible(fdt_in, offset, - LUO_FDT_FLB_COMPATIBLE); - if (err) { - pr_err("FLB node is incompatible with '%s' [%d]\n", - LUO_FDT_FLB_COMPATIBLE, err); - - return -EINVAL; - } - - header_size = 0; - ptr = fdt_getprop(fdt_in, offset, LUO_FDT_FLB_HEADER, &header_size); - if (!ptr || header_size != sizeof(u64)) { - pr_err("Unable to get FLB header property '%s' [%d]\n", - LUO_FDT_FLB_HEADER, header_size); - - return -EINVAL; - } - - header_ser_pa = get_unaligned((u64 *)ptr); - header_ser = phys_to_virt(header_ser_pa); - - luo_flb_global.incoming.header_ser = header_ser; - luo_flb_global.incoming.ser = (void *)(header_ser + 1); - luo_flb_global.incoming.active = true; - - return 0; } /** diff --git a/kernel/liveupdate/luo_internal.h b/kernel/liveupdate/luo_internal.h index ae58206f14ac..fe22086bfbeb 100644 --- a/kernel/liveupdate/luo_internal.h +++ b/kernel/liveupdate/luo_internal.h @@ -79,8 +79,8 @@ extern struct rw_semaphore luo_register_rwlock; int luo_session_create(const char *name, struct file **filep); int luo_session_retrieve(const char *name, struct file **filep); -int __init luo_session_setup_outgoing(void *fdt); -int __init luo_session_setup_incoming(void *fdt); +int __init luo_session_setup_outgoing(u64 *sessions_pa); +int __init luo_session_setup_incoming(u64 sessions_pa); int luo_session_serialize(void); int luo_session_deserialize(void); @@ -102,8 +102,8 @@ int luo_flb_file_preserve(struct liveupdate_file_handler *fh); void luo_flb_file_unpreserve(struct liveupdate_file_handler *fh); void luo_flb_file_finish(struct liveupdate_file_handler *fh); void luo_flb_unregister_all(struct liveupdate_file_handler *fh); -int __init luo_flb_setup_outgoing(void *fdt); -int __init luo_flb_setup_incoming(void *fdt); +int __init luo_flb_setup_outgoing(u64 *flbs_pa); +void __init luo_flb_setup_incoming(u64 flbs_pa); void luo_flb_serialize(void); #ifdef CONFIG_LIVEUPDATE_TEST diff --git a/kernel/liveupdate/luo_session.c b/kernel/liveupdate/luo_session.c index 7b2f9cbabb05..3b255ffd1bf1 100644 --- a/kernel/liveupdate/luo_session.c +++ b/kernel/liveupdate/luo_session.c @@ -25,9 +25,8 @@ * * - Serialization: Session metadata is preserved using the KHO framework. When * a live update is triggered via kexec, an array of `struct luo_session_ser` - * is populated and placed in a preserved memory region. An FDT node is also - * created, containing the count of sessions and the physical address of this - * array. + * is populated and placed in a preserved memory region. The physical address + * of this array is stored in the centralized `struct luo_ser` structure. * * Session Lifecycle: * @@ -91,13 +90,11 @@ #include #include #include -#include #include #include #include #include #include -#include #include #include "luo_internal.h" @@ -525,75 +522,34 @@ int luo_session_retrieve(const char *name, struct file **filep) return err; } -int __init luo_session_setup_outgoing(void *fdt_out) +int __init luo_session_setup_outgoing(u64 *sessions_pa) { struct luo_session_header_ser *header_ser; - u64 header_ser_pa; - int err; header_ser = kho_alloc_preserve(LUO_SESSION_PGCNT << PAGE_SHIFT); if (IS_ERR(header_ser)) return PTR_ERR(header_ser); - header_ser_pa = virt_to_phys(header_ser); - - err = fdt_begin_node(fdt_out, LUO_FDT_SESSION_NODE_NAME); - err |= fdt_property_string(fdt_out, "compatible", - LUO_FDT_SESSION_COMPATIBLE); - err |= fdt_property(fdt_out, LUO_FDT_SESSION_HEADER, &header_ser_pa, - sizeof(header_ser_pa)); - err |= fdt_end_node(fdt_out); - if (err) - goto err_unpreserve; + *sessions_pa = virt_to_phys(header_ser); luo_session_global.outgoing.header_ser = header_ser; luo_session_global.outgoing.ser = (void *)(header_ser + 1); luo_session_global.outgoing.active = true; return 0; - -err_unpreserve: - kho_unpreserve_free(header_ser); - return err; } -int __init luo_session_setup_incoming(void *fdt_in) +int __init luo_session_setup_incoming(u64 sessions_pa) { struct luo_session_header_ser *header_ser; - int err, header_size, offset; - u64 header_ser_pa; - const void *ptr; - - offset = fdt_subnode_offset(fdt_in, 0, LUO_FDT_SESSION_NODE_NAME); - if (offset < 0) { - pr_err("Unable to get session node: [%s]\n", - LUO_FDT_SESSION_NODE_NAME); - return -EINVAL; - } - err = fdt_node_check_compatible(fdt_in, offset, - LUO_FDT_SESSION_COMPATIBLE); - if (err) { - pr_err("Session node incompatible [%s]\n", - LUO_FDT_SESSION_COMPATIBLE); - return -EINVAL; + if (sessions_pa) { + header_ser = phys_to_virt(sessions_pa); + luo_session_global.incoming.header_ser = header_ser; + luo_session_global.incoming.ser = (void *)(header_ser + 1); + luo_session_global.incoming.active = true; } - header_size = 0; - ptr = fdt_getprop(fdt_in, offset, LUO_FDT_SESSION_HEADER, &header_size); - if (!ptr || header_size != sizeof(u64)) { - pr_err("Unable to get session header '%s' [%d]\n", - LUO_FDT_SESSION_HEADER, header_size); - return -EINVAL; - } - - header_ser_pa = get_unaligned((u64 *)ptr); - header_ser = phys_to_virt(header_ser_pa); - - luo_session_global.incoming.header_ser = header_ser; - luo_session_global.incoming.ser = (void *)(header_ser + 1); - luo_session_global.incoming.active = true; - return 0; } -- 2.53.0 From pasha.tatashin at soleen.com Mon Jun 1 20:17:08 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Tue, 2 Jun 2026 03:17:08 +0000 Subject: [PATCH v5 04/13] liveupdate: register luo_ser as KHO subtree In-Reply-To: <20260602031717.197696-1-pasha.tatashin@soleen.com> References: <20260602031717.197696-1-pasha.tatashin@soleen.com> Message-ID: <20260602031717.197696-5-pasha.tatashin@soleen.com> Entirely remove the LUO FDT wrapper since the FDT only carries the compatible string and the pointer to the centralized struct luo_ser. Instead, register the struct luo_ser via the KHO raw subtree API, placing the compatibility string inside the structure itself. Signed-off-by: Pasha Tatashin --- include/linux/kho/abi/luo.h | 57 +++++++++--------------- kernel/liveupdate/luo_core.c | 85 +++++++++++------------------------- 2 files changed, 46 insertions(+), 96 deletions(-) diff --git a/include/linux/kho/abi/luo.h b/include/linux/kho/abi/luo.h index 1b2f865a771a..9a4fe491812b 100644 --- a/include/linux/kho/abi/luo.h +++ b/include/linux/kho/abi/luo.h @@ -10,11 +10,11 @@ * * Live Update Orchestrator uses the stable Application Binary Interface * defined below to pass state from a pre-update kernel to a post-update - * kernel. The ABI is built upon the Kexec HandOver framework and uses a - * Flattened Device Tree to describe the preserved data. + * kernel. The ABI is built upon the Kexec HandOver framework and registers + * the central `struct luo_ser` via the KHO raw subtree API. * - * This interface is a contract. Any modification to the FDT structure, node - * properties, compatible strings, or the layout of the `__packed` serialization + * This interface is a contract. Any modification to the structure fields, + * compatible strings, or the layout of the `__packed` serialization * structures defined here constitutes a breaking change. Such changes require * incrementing the version number in the relevant `_COMPATIBLE` string to * prevent a new kernel from misinterpreting data from an old kernel. @@ -23,31 +23,15 @@ * however, backward/forward compatibility is only guaranteed for kernels * supporting the same ABI version. * - * FDT Structure Overview: + * KHO Structure Overview: * The entire LUO state is encapsulated within a single KHO entry named "LUO". - * This entry contains an FDT with the following layout: - * - * .. code-block:: none - * - * / { - * compatible = "luo-v2"; - * luo-abi-header = ; - * }; - * - * Main LUO Node (/): - * - * - compatible: "luo-v2" - * Identifies the overall LUO ABI version. - * - luo-abi-header: u64 - * The physical address of `struct luo_ser`. + * This entry contains the `struct luo_ser` structure. * * Serialization Structures: - * The FDT properties point to memory regions containing arrays of simple, - * `__packed` structures. These structures contain the actual preserved state. - * * - struct luo_ser: * The central ABI structure that contains the overall state of the LUO. - * It includes the liveupdate-number and pointers to sessions and FLBs. + * It includes the compatibility string, the liveupdate-number, and pointers + * to sessions and FLBs. * * - struct luo_session_header_ser: * Header for the session array. Contains the total page count of the @@ -78,26 +62,27 @@ #ifndef _LINUX_KHO_ABI_LUO_H #define _LINUX_KHO_ABI_LUO_H +#include #include /* - * The LUO FDT hooks all LUO state for sessions, fds, etc. + * The LUO state is registered under this KHO entry name. */ -#define LUO_FDT_SIZE PAGE_SIZE -#define LUO_FDT_KHO_ENTRY_NAME "LUO" -#define LUO_FDT_COMPATIBLE "luo-v2" -#define LUO_FDT_ABI_HEADER "luo-abi-header" +#define LUO_KHO_ENTRY_NAME "LUO" +#define LUO_ABI_COMPATIBLE "luo-v3" +#define LUO_ABI_COMPAT_LEN ALIGN(sizeof(LUO_ABI_COMPATIBLE), 8) /** * struct luo_ser - Centralized LUO ABI header. + * @compatible: Compatibility string identifying the LUO ABI version. * @liveupdate_num: A counter tracking the number of successful live updates. * @sessions_pa: Physical address of the first session block header. * @flbs_pa: Physical address of the FLB header. * - * This structure is the root of all preserved LUO state. It is pointed to by - * the "luo-abi-header" property in the LUO FDT. + * This structure is the root of all preserved LUO state. */ struct luo_ser { + char compatible[LUO_ABI_COMPAT_LEN]; u64 liveupdate_num; u64 sessions_pa; u64 flbs_pa; @@ -111,7 +96,7 @@ struct luo_ser { * @data: Private data * @token: User provided token for this file * - * If this structure is modified, LUO_SESSION_COMPATIBLE must be updated. + * If this structure is modified, `LUO_ABI_COMPATIBLE` must be updated. */ struct luo_file_ser { char compatible[LIVEUPDATE_HNDL_COMPAT_LENGTH]; @@ -142,7 +127,7 @@ struct luo_file_set_ser { * physical memory preserved across the kexec. It provides the necessary * metadata to interpret the array of session entries that follow. * - * If this structure is modified, `LUO_FDT_COMPATIBLE` must be updated. + * If this structure is modified, `LUO_ABI_COMPATIBLE` must be updated. */ struct luo_session_header_ser { u64 count; @@ -159,7 +144,7 @@ struct luo_session_header_ser { * session) is created and passed to the new kernel, allowing it to reconstruct * the session context. * - * If this structure is modified, `LUO_FDT_COMPATIBLE` must be updated. + * If this structure is modified, `LUO_ABI_COMPATIBLE` must be updated. */ struct luo_session_ser { char name[LIVEUPDATE_SESSION_NAME_LENGTH]; @@ -180,7 +165,7 @@ struct luo_session_ser { * This structure is located at the physical address specified by the * flbs_pa in luo_ser. * - * If this structure is modified, `LUO_FDT_COMPATIBLE` must be updated. + * If this structure is modified, `LUO_ABI_COMPATIBLE` must be updated. */ struct luo_flb_header_ser { u64 pgcnt; @@ -202,7 +187,7 @@ struct luo_flb_header_ser { * passed to the new kernel. Each entry allows the LUO core to restore one * global, shared object. * - * If this structure is modified, `LUO_FDT_COMPATIBLE` must be updated. + * If this structure is modified, `LUO_ABI_COMPATIBLE` must be updated. */ struct luo_flb_ser { char name[LIVEUPDATE_FLB_COMPAT_LENGTH]; diff --git a/kernel/liveupdate/luo_core.c b/kernel/liveupdate/luo_core.c index 085c0dfc1ef1..69b00e7d0f8f 100644 --- a/kernel/liveupdate/luo_core.c +++ b/kernel/liveupdate/luo_core.c @@ -54,7 +54,6 @@ #include #include #include -#include #include #include #include @@ -67,8 +66,7 @@ static struct { bool enabled; - void *fdt_out; - void *fdt_in; + struct luo_ser *luo_ser_out; u64 liveupdate_num; } luo_global; @@ -85,11 +83,10 @@ early_param("liveupdate", early_liveupdate_param); static int __init luo_early_startup(void) { + phys_addr_t luo_ser_phys; struct luo_ser *luo_ser; - int err, header_size; - phys_addr_t fdt_phys; - const void *ptr; - u64 luo_ser_pa; + size_t len; + int err; if (!kho_is_enabled()) { if (liveupdate_enabled()) @@ -98,40 +95,29 @@ static int __init luo_early_startup(void) return 0; } - /* Retrieve LUO subtree, and verify its format. */ - err = kho_retrieve_subtree(LUO_FDT_KHO_ENTRY_NAME, &fdt_phys, NULL); + /* Retrieve LUO state from KHO. */ + err = kho_retrieve_subtree(LUO_KHO_ENTRY_NAME, &luo_ser_phys, &len); if (err) { if (err != -ENOENT) { - pr_err("failed to retrieve FDT '%s' from KHO: %pe\n", - LUO_FDT_KHO_ENTRY_NAME, ERR_PTR(err)); + pr_err("failed to retrieve LUO state '%s' from KHO: %pe\n", + LUO_KHO_ENTRY_NAME, ERR_PTR(err)); return err; } return 0; } - luo_global.fdt_in = phys_to_virt(fdt_phys); - err = fdt_node_check_compatible(luo_global.fdt_in, 0, - LUO_FDT_COMPATIBLE); - if (err) { - pr_err("FDT '%s' is incompatible with '%s' [%d]\n", - LUO_FDT_KHO_ENTRY_NAME, LUO_FDT_COMPATIBLE, err); - + if (len < sizeof(*luo_ser)) { + pr_err("LUO state is too small (%zu < %zu)\n", len, sizeof(*luo_ser)); return -EINVAL; } - header_size = 0; - ptr = fdt_getprop(luo_global.fdt_in, 0, LUO_FDT_ABI_HEADER, &header_size); - if (!ptr || header_size != sizeof(u64)) { - pr_err("Unable to get ABI header '%s' [%d]\n", - LUO_FDT_ABI_HEADER, header_size); - + luo_ser = phys_to_virt(luo_ser_phys); + if (strncmp(luo_ser->compatible, LUO_ABI_COMPATIBLE, LUO_ABI_COMPAT_LEN)) { + pr_err("LUO state is incompatible with '%s'\n", LUO_ABI_COMPATIBLE); return -EINVAL; } - luo_ser_pa = get_unaligned((u64 *)ptr); - luo_ser = phys_to_virt(luo_ser_pa); - luo_global.liveupdate_num = luo_ser->liveupdate_num; pr_info("Retrieved live update data, liveupdate number: %lld\n", luo_global.liveupdate_num); @@ -164,37 +150,20 @@ static int __init liveupdate_early_init(void) } early_initcall(liveupdate_early_init); -/* Called during boot to create outgoing LUO fdt tree */ -static int __init luo_fdt_setup(void) +/* Called during boot to create outgoing LUO state */ +static int __init luo_state_setup(void) { struct luo_ser *luo_ser; - u64 luo_ser_pa; - void *fdt_out; int err; - fdt_out = kho_alloc_preserve(LUO_FDT_SIZE); - if (IS_ERR(fdt_out)) { - pr_err("failed to allocate/preserve FDT memory\n"); - return PTR_ERR(fdt_out); - } - luo_ser = kho_alloc_preserve(sizeof(*luo_ser)); if (IS_ERR(luo_ser)) { - err = PTR_ERR(luo_ser); - goto exit_free_fdt; + pr_err("failed to allocate/preserve LUO state memory\n"); + return PTR_ERR(luo_ser); } - luo_ser_pa = virt_to_phys(luo_ser); - - err = fdt_create(fdt_out, LUO_FDT_SIZE); - err |= fdt_finish_reservemap(fdt_out); - err |= fdt_begin_node(fdt_out, ""); - err |= fdt_property_string(fdt_out, "compatible", LUO_FDT_COMPATIBLE); - err |= fdt_property(fdt_out, LUO_FDT_ABI_HEADER, &luo_ser_pa, - sizeof(luo_ser_pa)); - err |= fdt_end_node(fdt_out); - err |= fdt_finish(fdt_out); - if (err) - goto exit_free_luo_ser; + + strscpy(luo_ser->compatible, LUO_ABI_COMPATIBLE, sizeof(luo_ser->compatible)); + luo_ser->liveupdate_num = luo_global.liveupdate_num + 1; err = luo_session_setup_outgoing(&luo_ser->sessions_pa); if (err) @@ -204,21 +173,17 @@ static int __init luo_fdt_setup(void) if (err) goto exit_free_luo_ser; - luo_ser->liveupdate_num = luo_global.liveupdate_num + 1; - - err = kho_add_subtree(LUO_FDT_KHO_ENTRY_NAME, fdt_out, - fdt_totalsize(fdt_out)); + err = kho_add_subtree(LUO_KHO_ENTRY_NAME, luo_ser, sizeof(*luo_ser)); if (err) goto exit_free_luo_ser; - luo_global.fdt_out = fdt_out; + + luo_global.luo_ser_out = luo_ser; return 0; exit_free_luo_ser: kho_unpreserve_free(luo_ser); -exit_free_fdt: - kho_unpreserve_free(fdt_out); - pr_err("failed to prepare LUO FDT: %d\n", err); + pr_err("failed to prepare LUO state: %d\n", err); return err; } @@ -234,7 +199,7 @@ static int __init luo_late_startup(void) if (!liveupdate_enabled()) return 0; - err = luo_fdt_setup(); + err = luo_state_setup(); if (err) luo_global.enabled = false; -- 2.53.0 From pasha.tatashin at soleen.com Mon Jun 1 20:17:09 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Tue, 2 Jun 2026 03:17:09 +0000 Subject: [PATCH v5 05/13] liveupdate: Extract luo_file_deserialize_one helper In-Reply-To: <20260602031717.197696-1-pasha.tatashin@soleen.com> References: <20260602031717.197696-1-pasha.tatashin@soleen.com> Message-ID: <20260602031717.197696-6-pasha.tatashin@soleen.com> Extract the logic for deserializing single entries for files into separate helper functions. In preparation to a linked-block serialization for files. This is a pure code movement, no other changes intended. Acked-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav (Google) Signed-off-by: Pasha Tatashin --- kernel/liveupdate/luo_file.c | 77 ++++++++++++++++++++---------------- 1 file changed, 44 insertions(+), 33 deletions(-) diff --git a/kernel/liveupdate/luo_file.c b/kernel/liveupdate/luo_file.c index 208987502f73..9eec07a9e9fc 100644 --- a/kernel/liveupdate/luo_file.c +++ b/kernel/liveupdate/luo_file.c @@ -753,6 +753,46 @@ int luo_file_finish(struct luo_file_set *file_set) return 0; } +static int luo_file_deserialize_one(struct luo_file_set *file_set, + struct luo_file_ser *ser) +{ + struct liveupdate_file_handler *fh; + bool handler_found = false; + struct luo_file *luo_file; + + down_read(&luo_register_rwlock); + list_private_for_each_entry(fh, &luo_file_handler_list, list) { + if (!strcmp(fh->compatible, ser->compatible)) { + if (try_module_get(fh->ops->owner)) + handler_found = true; + break; + } + } + up_read(&luo_register_rwlock); + + if (!handler_found) { + pr_warn("No registered handler for compatible '%.*s'\n", + (int)sizeof(ser->compatible), + ser->compatible); + return -ENOENT; + } + + luo_file = kzalloc_obj(*luo_file); + if (!luo_file) { + module_put(fh->ops->owner); + return -ENOMEM; + } + + luo_file->fh = fh; + luo_file->file = NULL; + luo_file->serialized_data = ser->data; + luo_file->token = ser->token; + mutex_init(&luo_file->mutex); + list_add_tail(&luo_file->list, &file_set->files_list); + + return 0; +} + /** * luo_file_deserialize - Reconstructs the list of preserved files in the new kernel. * @file_set: The incoming file_set to fill with deserialized data. @@ -782,6 +822,7 @@ int luo_file_deserialize(struct luo_file_set *file_set, struct luo_file_set_ser *file_set_ser) { struct luo_file_ser *file_ser; + int err; u64 i; if (!file_set_ser->files) { @@ -809,39 +850,9 @@ int luo_file_deserialize(struct luo_file_set *file_set, */ file_ser = file_set->files; for (i = 0; i < file_set->count; i++) { - struct liveupdate_file_handler *fh; - bool handler_found = false; - struct luo_file *luo_file; - - down_read(&luo_register_rwlock); - list_private_for_each_entry(fh, &luo_file_handler_list, list) { - if (!strcmp(fh->compatible, file_ser[i].compatible)) { - if (try_module_get(fh->ops->owner)) - handler_found = true; - break; - } - } - up_read(&luo_register_rwlock); - - if (!handler_found) { - pr_warn("No registered handler for compatible '%.*s'\n", - (int)sizeof(file_ser[i].compatible), - file_ser[i].compatible); - return -ENOENT; - } - - luo_file = kzalloc_obj(*luo_file); - if (!luo_file) { - module_put(fh->ops->owner); - return -ENOMEM; - } - - luo_file->fh = fh; - luo_file->file = NULL; - luo_file->serialized_data = file_ser[i].data; - luo_file->token = file_ser[i].token; - mutex_init(&luo_file->mutex); - list_add_tail(&luo_file->list, &file_set->files_list); + err = luo_file_deserialize_one(file_set, &file_ser[i]); + if (err) + return err; } return 0; -- 2.53.0 From pasha.tatashin at soleen.com Mon Jun 1 20:17:10 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Tue, 2 Jun 2026 03:17:10 +0000 Subject: [PATCH v5 06/13] liveupdate: Extract luo_session_deserialize_one helper In-Reply-To: <20260602031717.197696-1-pasha.tatashin@soleen.com> References: <20260602031717.197696-1-pasha.tatashin@soleen.com> Message-ID: <20260602031717.197696-7-pasha.tatashin@soleen.com> Extract the logic for deserializing single entries for sessions into separate helper functions. In preparation to a linked-block serialization for sessions. This is a pure code movement, no other changes intended. Acked-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav (Google) Signed-off-by: Pasha Tatashin --- kernel/liveupdate/luo_session.c | 63 +++++++++++++++++++-------------- 1 file changed, 36 insertions(+), 27 deletions(-) diff --git a/kernel/liveupdate/luo_session.c b/kernel/liveupdate/luo_session.c index 3b255ffd1bf1..9f72a8b0a9a8 100644 --- a/kernel/liveupdate/luo_session.c +++ b/kernel/liveupdate/luo_session.c @@ -553,6 +553,40 @@ int __init luo_session_setup_incoming(u64 sessions_pa) return 0; } +static int luo_session_deserialize_one(struct luo_session_header *sh, + struct luo_session_ser *ser) +{ + struct luo_session *session; + int err; + + session = luo_session_alloc(ser->name); + if (IS_ERR(session)) { + pr_warn("Failed to allocate session [%.*s] during deserialization %pe\n", + (int)sizeof(ser->name), ser->name, session); + return PTR_ERR(session); + } + + err = luo_session_insert(sh, session); + if (err) { + pr_warn("Failed to insert session [%s] %pe\n", + session->name, ERR_PTR(err)); + luo_session_free(session); + return err; + } + + scoped_guard(mutex, &session->mutex) { + err = luo_file_deserialize(&session->file_set, + &ser->file_set_ser); + } + if (err) { + pr_warn("Failed to deserialize files for session [%s] %pe\n", + session->name, ERR_PTR(err)); + return err; + } + + return 0; +} + int luo_session_deserialize(void) { struct luo_session_header *sh = &luo_session_global.incoming; @@ -584,34 +618,9 @@ int luo_session_deserialize(void) * reliably reset devices and reclaim memory. */ for (int i = 0; i < sh->header_ser->count; i++) { - struct luo_session *session; - - session = luo_session_alloc(sh->ser[i].name); - if (IS_ERR(session)) { - pr_warn("Failed to allocate session [%.*s] during deserialization %pe\n", - (int)sizeof(sh->ser[i].name), - sh->ser[i].name, session); - err = PTR_ERR(session); - goto save_err; - } - - err = luo_session_insert(sh, session); - if (err) { - pr_warn("Failed to insert session [%s] %pe\n", - session->name, ERR_PTR(err)); - luo_session_free(session); - goto save_err; - } - - scoped_guard(mutex, &session->mutex) { - err = luo_file_deserialize(&session->file_set, - &sh->ser[i].file_set_ser); - } - if (err) { - pr_warn("Failed to deserialize files for session [%s] %pe\n", - session->name, ERR_PTR(err)); + err = luo_session_deserialize_one(sh, &sh->ser[i]); + if (err) goto save_err; - } } kho_restore_free(sh->header_ser); -- 2.53.0 From pasha.tatashin at soleen.com Mon Jun 1 20:17:11 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Tue, 2 Jun 2026 03:17:11 +0000 Subject: [PATCH v5 07/13] kho: add support for linked-block serialization In-Reply-To: <20260602031717.197696-1-pasha.tatashin@soleen.com> References: <20260602031717.197696-1-pasha.tatashin@soleen.com> Message-ID: <20260602031717.197696-8-pasha.tatashin@soleen.com> Introduce a linked-block serialization mechanism for state handover. Previously, LUO used contiguous memory blocks for serializing sessions and files, which imposed limits on the total number of items that could be preserved across a live update. This commit adds the infrastructure for a more flexible, block-based approach where serialized data is stored in a chain of linked blocks. This is a generic KHO serialization block infrastructure that can be used by multiple subsystems. Signed-off-by: Pasha Tatashin --- Documentation/core-api/kho/abi.rst | 5 + Documentation/core-api/kho/index.rst | 11 + MAINTAINERS | 1 + include/linux/kho/abi/block.h | 56 ++++ include/linux/kho_block.h | 79 ++++++ kernel/liveupdate/Makefile | 1 + kernel/liveupdate/kho_block.c | 390 +++++++++++++++++++++++++++ 7 files changed, 543 insertions(+) create mode 100644 include/linux/kho/abi/block.h create mode 100644 include/linux/kho_block.h create mode 100644 kernel/liveupdate/kho_block.c diff --git a/Documentation/core-api/kho/abi.rst b/Documentation/core-api/kho/abi.rst index 799d743105a6..edeb5b311963 100644 --- a/Documentation/core-api/kho/abi.rst +++ b/Documentation/core-api/kho/abi.rst @@ -28,6 +28,11 @@ KHO persistent memory tracker ABI .. kernel-doc:: include/linux/kho/abi/kexec_handover.h :doc: KHO persistent memory tracker +KHO serialization block ABI +=========================== + +.. kernel-doc:: include/linux/kho/abi/block.h + See Also ======== diff --git a/Documentation/core-api/kho/index.rst b/Documentation/core-api/kho/index.rst index 0a2dee4f8e7d..320914a42178 100644 --- a/Documentation/core-api/kho/index.rst +++ b/Documentation/core-api/kho/index.rst @@ -83,6 +83,17 @@ Public API .. kernel-doc:: kernel/liveupdate/kexec_handover.c :export: +KHO Serialization Blocks API +============================ + +.. kernel-doc:: kernel/liveupdate/kho_block.c + :doc: KHO Serialization Blocks + +.. kernel-doc:: include/linux/kho_block.h + +.. kernel-doc:: kernel/liveupdate/kho_block.c + :internal: + See Also ======== diff --git a/MAINTAINERS b/MAINTAINERS index 9ec290e38b44..920ba7622afa 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -14208,6 +14208,7 @@ F: Documentation/admin-guide/mm/kho.rst F: Documentation/core-api/kho/* F: include/linux/kexec_handover.h F: include/linux/kho/ +F: include/linux/kho_block.h F: kernel/liveupdate/kexec_handover* F: lib/test_kho.c F: tools/testing/selftests/kho/ diff --git a/include/linux/kho/abi/block.h b/include/linux/kho/abi/block.h new file mode 100644 index 000000000000..8641c20b379b --- /dev/null +++ b/include/linux/kho/abi/block.h @@ -0,0 +1,56 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2026, Google LLC. + * Pasha Tatashin + */ + +/** + * DOC: KHO Serialization Blocks ABI + * + * Subsystems using the KHO Serialization Blocks framework rely on the stable + * Application Binary Interface defined below to pass serialized state from a + * pre-update kernel to a post-update kernel. + * + * This interface is a contract. Any modification to the structure fields, + * compatible strings, or the layout of the `__packed` serialization + * structures defined here constitutes a breaking change. Such changes require + * incrementing the version number in the `KHO_BLOCK_ABI_COMPATIBLE` string to + * prevent a new kernel from misinterpreting data from an old kernel. + * + * Changes are allowed provided the compatibility version is incremented; + * however, backward/forward compatibility is only guaranteed for kernels + * supporting the same ABI version. + */ + +#ifndef _LINUX_KHO_ABI_BLOCK_H +#define _LINUX_KHO_ABI_BLOCK_H + +#include +#include + +#define KHO_BLOCK_ABI_COMPATIBLE "kho-block-v1" + +/** + * KHO_BLOCK_SIZE - The size of each serialization block. + * + * This is defined as PAGE_SIZE. PAGE_SIZE is ABI compliant because live + * update between kernels with different page sizes is not supported by KHO. + */ +#define KHO_BLOCK_SIZE PAGE_SIZE + +/** + * struct kho_block_header_ser - Header for the serialized data block. + * @next: Physical address of the next struct kho_block_header_ser. + * @count: The number of entries that immediately follow this header in the + * memory block. + * + * This structure is located at the beginning of a block of physical memory + * preserved across a kexec. It provides the necessary metadata to interpret + * the array of entries that follow. + */ +struct kho_block_header_ser { + u64 next; + u64 count; +} __packed; + +#endif /* _LINUX_KHO_ABI_BLOCK_H */ diff --git a/include/linux/kho_block.h b/include/linux/kho_block.h new file mode 100644 index 000000000000..505bf78409f2 --- /dev/null +++ b/include/linux/kho_block.h @@ -0,0 +1,79 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2026, Google LLC. + * Pasha Tatashin + */ + +#ifndef _LINUX_KHO_BLOCK_H +#define _LINUX_KHO_BLOCK_H + +#include +#include +#include + +/** + * struct kho_block - Internal representation of a serialization block. + * @list: List head for linking blocks in memory. + * @ser: Pointer to the serialized header in preserved memory. + */ +struct kho_block { + struct list_head list; + struct kho_block_header_ser *ser; +}; + +/** + * struct kho_block_set - A set of blocks that belong to the same object. + * @blocks: The list of serialization blocks (struct kho_block). + * @nblocks: The number of allocated serialization blocks. + * @head_pa: Physical address of the first block header. + * @entry_size: The size of each entry in the blocks. + * @count_per_block: The maximum number of entries each block can hold. + * @incoming: True if this block set was restored from the previous kernel. + */ +struct kho_block_set { + struct list_head blocks; + long nblocks; + u64 head_pa; + size_t entry_size; + u64 count_per_block; + bool incoming; +}; + +/** + * struct kho_block_it - Iterator for serializing entries into blocks. + * @bs: The block set being iterated. + * @block: The current block. + * @i: The current entry index within @block. + */ +struct kho_block_it { + struct kho_block_set *bs; + struct kho_block *block; + u64 i; +}; + +/** + * KHO_BLOCK_SET_INIT - Initialize a static kho_block_set. + * @_name: Name of the kho_block_set variable. + * @_entry_size: The size of each entry in the block set. + */ +#define KHO_BLOCK_SET_INIT(_name, _entry_size) { \ + .blocks = LIST_HEAD_INIT((_name).blocks), \ + .entry_size = _entry_size, \ +} + +void kho_block_set_init(struct kho_block_set *bs, size_t entry_size); + +int kho_block_grow(struct kho_block_set *bs, u64 count); +void kho_block_shrink(struct kho_block_set *bs, u64 count); + +int kho_block_set_restore(struct kho_block_set *bs, u64 head_pa); +void kho_block_set_destroy(struct kho_block_set *bs); +void kho_block_set_clear(struct kho_block_set *bs); + +void kho_block_it_init(struct kho_block_it *it, struct kho_block_set *bs); +void *kho_block_it_reserve_entry(struct kho_block_it *it); +void *kho_block_it_read_entry(struct kho_block_it *it); +void *kho_block_it_prev(struct kho_block_it *it); +void kho_block_it_finalize(struct kho_block_it *it); + +#endif /* _LINUX_KHO_BLOCK_H */ diff --git a/kernel/liveupdate/Makefile b/kernel/liveupdate/Makefile index d2f779cbe279..eec9d3ae07eb 100644 --- a/kernel/liveupdate/Makefile +++ b/kernel/liveupdate/Makefile @@ -1,6 +1,7 @@ # SPDX-License-Identifier: GPL-2.0 luo-y := \ + kho_block.o \ luo_core.o \ luo_file.o \ luo_flb.o \ diff --git a/kernel/liveupdate/kho_block.c b/kernel/liveupdate/kho_block.c new file mode 100644 index 000000000000..01978c6aea1a --- /dev/null +++ b/kernel/liveupdate/kho_block.c @@ -0,0 +1,390 @@ +// SPDX-License-Identifier: GPL-2.0 + +/* + * Copyright (c) 2026, Google LLC. + * Pasha Tatashin + */ + +/** + * DOC: KHO Serialization Blocks + * + * KHO provides a mechanism to preserve stateful data across a kexec handover + * by serializing it into memory blocks. This file provides the common + * infrastructure for managing these blocks. + * + * Each block consists of a header (struct kho_block_header_ser) followed by an + * array of serialized entries. Multiple blocks are linked together via a + * physical pointer in the header, forming a linked list that can be easily + * traversed in both the current and the next kernel. + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include +#include +#include + +/* + * Safeguard limit for the number of serialization blocks. This is used to + * prevent infinite loops and excessive memory allocation in case of memory + * corruption in the preserved state. + */ +#define KHO_MAX_BLOCKS 10000 + +/** + * kho_block_set_init - Initialize a block set. + * @bs: The block set to initialize. + * @entry_size: The size of each entry in the blocks. + */ +void kho_block_set_init(struct kho_block_set *bs, size_t entry_size) +{ + *bs = (struct kho_block_set)KHO_BLOCK_SET_INIT(*bs, entry_size); +} + +static inline u64 kho_block_count_per_block(struct kho_block_set *bs) +{ + if (unlikely(!bs->count_per_block)) { + bs->count_per_block = (KHO_BLOCK_SIZE - + sizeof(struct kho_block_header_ser)) / + bs->entry_size; + WARN_ON_ONCE(!bs->count_per_block); + } + return bs->count_per_block; +} + +/* Free serialized data */ +static void kho_block_free_ser(struct kho_block_set *bs, + struct kho_block_header_ser *ser) +{ + if (bs->incoming) + kho_restore_free(ser); + else + kho_unpreserve_free(ser); +} + +static struct kho_block_header_ser *kho_block_alloc_ser(struct kho_block_set *bs) +{ + WARN_ON_ONCE(bs->incoming); + return kho_alloc_preserve(KHO_BLOCK_SIZE); +} + +static int kho_block_add(struct kho_block_set *bs, + struct kho_block_header_ser *ser) +{ + struct kho_block *block, *last; + + if (bs->nblocks >= KHO_MAX_BLOCKS) + return -ENOSPC; + + block = kzalloc_obj(*block); + if (!block) + return -ENOMEM; + + block->ser = ser; + last = list_last_entry_or_null(&bs->blocks, struct kho_block, list); + list_add_tail(&block->list, &bs->blocks); + bs->nblocks++; + + if (last) + last->ser->next = virt_to_phys(ser); + else + bs->head_pa = virt_to_phys(ser); + + return 0; +} + +/** + * kho_block_grow - Create a new block if the current capacity is reached. + * @bs: The block set. + * @count: The current number of entries. + * + * This function handles the dynamic expansion of a block set. It allocates + * and links a new serialization block if the provided entry count matches + * the current total capacity of the set. + * + * Return: 0 on success, or a negative errno on failure. + */ +int kho_block_grow(struct kho_block_set *bs, u64 count) +{ + struct kho_block_header_ser *ser; + int err; + + if (WARN_ON_ONCE(bs->incoming)) + return -EINVAL; + + if (count != bs->nblocks * kho_block_count_per_block(bs)) + return 0; + + ser = kho_block_alloc_ser(bs); + if (IS_ERR(ser)) + return PTR_ERR(ser); + + err = kho_block_add(bs, ser); + if (err) { + kho_block_free_ser(bs, ser); + return err; + } + + return 0; +} + +/** + * kho_block_shrink - Conditionally destroy the last block in a block set. + * @bs: The block set. + * @count: The current number of entries across all blocks. + * + * This function checks if the last block in the set is redundant based on the + * total entry count and the capacity of the preceding blocks. If the entry + * count can be accommodated by the blocks that come before the last one, the + * last block is destroyed and removed from the set. + */ +void kho_block_shrink(struct kho_block_set *bs, u64 count) +{ + struct kho_block *last, *new_last; + + if (count > (bs->nblocks - 1) * kho_block_count_per_block(bs)) + return; + + if (list_empty(&bs->blocks)) + return; + + last = list_last_entry(&bs->blocks, struct kho_block, list); + list_del(&last->list); + bs->nblocks--; + kho_block_free_ser(bs, last->ser); + kfree(last); + + new_last = list_last_entry_or_null(&bs->blocks, struct kho_block, list); + if (new_last) + new_last->ser->next = 0; + else + bs->head_pa = 0; +} + +/* + * kho_cyclic_blocks_check - Check for cycles in a linked list of blocks. + * Uses Floyd's cycle-finding algorithm to ensure sanity of the incoming list. + */ +static bool kho_cyclic_blocks_check(struct kho_block_set *bs) +{ + struct kho_block_header_ser *fast; + struct kho_block_header_ser *slow; + int count = 0; + + fast = phys_to_virt(bs->head_pa); + slow = fast; + + while (fast) { + if (count++ >= KHO_MAX_BLOCKS) { + pr_err("Linked list too long\n"); + return false; + } + + if (!fast->next) + break; + + fast = phys_to_virt(fast->next); + if (!fast->next) + break; + + fast = phys_to_virt(fast->next); + slow = phys_to_virt(slow->next); + + if (slow == fast) { + pr_err("Cyclic list detected\n"); + return false; + } + } + + return true; +} + +/** + * kho_block_set_restore - Restore a block set from a physical address. + * @bs: The block set to restore. + * @head_pa: Physical address of the first block header. + * + * Return: 0 on success, or a negative errno on failure. + */ +int kho_block_set_restore(struct kho_block_set *bs, u64 head_pa) +{ + struct kho_block_header_ser *ser; + u64 next_pa = head_pa; + int err; + + /* Restored block sets use size from the previous kernel */ + bs->incoming = true; + if (!head_pa) + return 0; + + bs->head_pa = head_pa; + if (!kho_cyclic_blocks_check(bs)) { + bs->head_pa = 0; + return -EINVAL; + } + + while (next_pa) { + ser = phys_to_virt(next_pa); + if (!ser->count || ser->count > kho_block_count_per_block(bs)) { + pr_warn("Block contains invalid entry count: %llu\n", + ser->count); + err = -EINVAL; + goto err_destroy; + } + err = kho_block_add(bs, ser); + if (err) + goto err_destroy; + next_pa = ser->next; + } + + return 0; + +err_destroy: + kho_block_set_destroy(bs); + return err; +} + +/** + * kho_block_set_destroy - Destroy all blocks in a block set. + * @bs: The block set. + */ +void kho_block_set_destroy(struct kho_block_set *bs) +{ + struct kho_block *block, *tmp; + u64 head_pa = bs->head_pa; + + list_for_each_entry_safe(block, tmp, &bs->blocks, list) { + list_del(&block->list); + kfree(block); + } + bs->nblocks = 0; + bs->head_pa = 0; + + /* + * bs->blocks may only contain partially restored blocks, but head_pa + * still points to the entire chain. + */ + while (head_pa) { + struct kho_block_header_ser *ser = phys_to_virt(head_pa); + + head_pa = ser->next; + kho_block_free_ser(bs, ser); + } +} + +/** + * kho_block_set_clear - Clear all serialized data in a block set. + * @bs: The block set to clear. + */ +void kho_block_set_clear(struct kho_block_set *bs) +{ + struct kho_block *block; + + list_for_each_entry(block, &bs->blocks, list) { + block->ser->count = 0; + memset(block->ser + 1, 0, KHO_BLOCK_SIZE - sizeof(*block->ser)); + } +} + +/** + * kho_block_it_init - Initialize a block set iterator. + * @it: The iterator to initialize. + * @bs: The block set to iterate over. + */ +void kho_block_it_init(struct kho_block_it *it, struct kho_block_set *bs) +{ + it->bs = bs; + it->block = list_first_entry_or_null(&bs->blocks, struct kho_block, list); + it->i = 0; +} + +/** + * kho_block_it_reserve_entry - Reserve and return the next available slot for writing. + * @it: The block iterator. + * + * This function is used during state serialization to add a new entry. + * It reserves a slot in the current block, advancing the internal index. + * If the current block is full, it automatically moves to the next block + * in the set. + * + * Return: A pointer to the reserved entry slot, or NULL if the block set's + * capacity is fully exhausted. + */ +void *kho_block_it_reserve_entry(struct kho_block_it *it) +{ + if (!it->block) + return NULL; + + if (it->i == kho_block_count_per_block(it->bs)) { + it->block->ser->count = it->i; + if (list_is_last(&it->block->list, &it->bs->blocks)) + return NULL; + it->block = list_next_entry(it->block, list); + it->i = 0; + } + + return (void *)(it->block->ser + 1) + (it->i++ * it->bs->entry_size); +} + +/** + * kho_block_it_read_entry - Read the next serialized entry from the block set. + * @it: The block iterator. + * + * This function is used during state deserialization. It iterates through + * entries that were previously written, respecting the actual count stored + * in each block's header. + * + * Return: A pointer to the next serialized entry, or NULL if all serialized + * entries have been read. + */ +void *kho_block_it_read_entry(struct kho_block_it *it) +{ + if (!it->block) + return NULL; + + if (it->i == it->block->ser->count) { + if (list_is_last(&it->block->list, &it->bs->blocks)) + return NULL; + it->block = list_next_entry(it->block, list); + it->i = 0; + } + + return (void *)(it->block->ser + 1) + (it->i++ * it->bs->entry_size); +} + +/** + * kho_block_it_prev - Return the previous entry slot in the block set. + * @it: The block iterator. + * + * If the current index is at the start of a block, it automatically moves to + * the end of the previous block. + * + * Return: A pointer to the previous entry slot, or NULL if at the very + * beginning of the block set. + */ +void *kho_block_it_prev(struct kho_block_it *it) +{ + if (!it->block) + return NULL; + + if (it->i == 0) { + if (list_is_first(&it->block->list, &it->bs->blocks)) + return NULL; + it->block = list_prev_entry(it->block, list); + it->i = kho_block_count_per_block(it->bs); + } + + return (void *)(it->block->ser + 1) + (--it->i * it->bs->entry_size); +} + +/** + * kho_block_it_finalize - Finalize the current block by setting its entry count. + * @it: The block iterator. + */ +void kho_block_it_finalize(struct kho_block_it *it) +{ + if (it->block) + it->block->ser->count = it->i; +} -- 2.53.0 From pasha.tatashin at soleen.com Mon Jun 1 20:17:12 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Tue, 2 Jun 2026 03:17:12 +0000 Subject: [PATCH v5 08/13] liveupdate: defer session block allocation and PA setting In-Reply-To: <20260602031717.197696-1-pasha.tatashin@soleen.com> References: <20260602031717.197696-1-pasha.tatashin@soleen.com> Message-ID: <20260602031717.197696-9-pasha.tatashin@soleen.com> Currently, luo_session_setup_outgoing() allocates the session block and sets its physical address in the header immediately. With upcoming dynamic block-based session management, this makes the first block different from the rest. Move the allocation to where it is first needed. Acked-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav (Google) Signed-off-by: Pasha Tatashin --- include/linux/kho_block.h | 22 +++++++++++ kernel/liveupdate/luo_core.c | 4 +- kernel/liveupdate/luo_internal.h | 2 +- kernel/liveupdate/luo_session.c | 68 ++++++++++++++++++++------------ 4 files changed, 67 insertions(+), 29 deletions(-) diff --git a/include/linux/kho_block.h b/include/linux/kho_block.h index 505bf78409f2..0a8cda2cbfb5 100644 --- a/include/linux/kho_block.h +++ b/include/linux/kho_block.h @@ -70,6 +70,28 @@ int kho_block_set_restore(struct kho_block_set *bs, u64 head_pa); void kho_block_set_destroy(struct kho_block_set *bs); void kho_block_set_clear(struct kho_block_set *bs); +/** + * kho_block_set_head_pa - Get the physical address of the first block header. + * @bs: The block set. + * + * Return: The physical address of the first block header, or 0 if empty. + */ +static inline u64 kho_block_set_head_pa(struct kho_block_set *bs) +{ + return bs->head_pa; +} + +/** + * kho_block_set_is_empty - Check if the block set has no allocated blocks. + * @bs: The block set. + * + * Return: True if there are no blocks in the set, false otherwise. + */ +static inline bool kho_block_set_is_empty(struct kho_block_set *bs) +{ + return list_empty(&bs->blocks); +} + void kho_block_it_init(struct kho_block_it *it, struct kho_block_set *bs); void *kho_block_it_reserve_entry(struct kho_block_it *it); void *kho_block_it_read_entry(struct kho_block_it *it); diff --git a/kernel/liveupdate/luo_core.c b/kernel/liveupdate/luo_core.c index 69b00e7d0f8f..1b2bda22902d 100644 --- a/kernel/liveupdate/luo_core.c +++ b/kernel/liveupdate/luo_core.c @@ -165,9 +165,7 @@ static int __init luo_state_setup(void) strscpy(luo_ser->compatible, LUO_ABI_COMPATIBLE, sizeof(luo_ser->compatible)); luo_ser->liveupdate_num = luo_global.liveupdate_num + 1; - err = luo_session_setup_outgoing(&luo_ser->sessions_pa); - if (err) - goto exit_free_luo_ser; + luo_session_setup_outgoing(&luo_ser->sessions_pa); err = luo_flb_setup_outgoing(&luo_ser->flbs_pa); if (err) diff --git a/kernel/liveupdate/luo_internal.h b/kernel/liveupdate/luo_internal.h index fe22086bfbeb..ee18f9a11b91 100644 --- a/kernel/liveupdate/luo_internal.h +++ b/kernel/liveupdate/luo_internal.h @@ -79,7 +79,7 @@ extern struct rw_semaphore luo_register_rwlock; int luo_session_create(const char *name, struct file **filep); int luo_session_retrieve(const char *name, struct file **filep); -int __init luo_session_setup_outgoing(u64 *sessions_pa); +void __init luo_session_setup_outgoing(u64 *sessions_pa); int __init luo_session_setup_incoming(u64 sessions_pa); int luo_session_serialize(void); int luo_session_deserialize(void); diff --git a/kernel/liveupdate/luo_session.c b/kernel/liveupdate/luo_session.c index 9f72a8b0a9a8..43342916d314 100644 --- a/kernel/liveupdate/luo_session.c +++ b/kernel/liveupdate/luo_session.c @@ -108,15 +108,16 @@ static DECLARE_RWSEM(luo_session_serialize_rwsem); /** * struct luo_session_header - Header struct for managing LUO sessions. - * @count: The number of sessions currently tracked in the @list. - * @list: The head of the linked list of `struct luo_session` instances. - * @rwsem: A read-write semaphore providing synchronized access to the - * session list and other fields in this structure. - * @header_ser: The header data of serialization array. - * @ser: The serialized session data (an array of - * `struct luo_session_ser`). - * @active: Set to true when first initialized. If previous kernel did not - * send session data, active stays false for incoming. + * @count: The number of sessions currently tracked in the @list. + * @list: The head of the linked list of `struct luo_session` instances. + * @rwsem: A read-write semaphore providing synchronized access to the + * session list and other fields in this structure. + * @header_ser: The header data of serialization array. + * @ser: The serialized session data (an array of + * `struct luo_session_ser`). + * @sessions_pa: Points to the location of sessions_pa within struct luo_ser. + * @active: Set to true when first initialized. If previous kernel did not + * send session data, active stays false for incoming. */ struct luo_session_header { long count; @@ -124,6 +125,7 @@ struct luo_session_header { struct rw_semaphore rwsem; struct luo_session_header_ser *header_ser; struct luo_session_ser *ser; + u64 *sessions_pa; bool active; }; @@ -171,10 +173,30 @@ static void luo_session_free(struct luo_session *session) kfree(session); } +static int luo_session_grow_ser(struct luo_session_header *sh) +{ + struct luo_session_header_ser *header_ser; + + if (sh->count == LUO_SESSION_MAX) + return -ENOMEM; + + if (sh->header_ser) + return 0; + + header_ser = kho_alloc_preserve(LUO_SESSION_PGCNT << PAGE_SHIFT); + if (IS_ERR(header_ser)) + return PTR_ERR(header_ser); + + sh->header_ser = header_ser; + sh->ser = (void *)(header_ser + 1); + return 0; +} + static int luo_session_insert(struct luo_session_header *sh, struct luo_session *session) { struct luo_session *it; + int err; guard(rwsem_write)(&sh->rwsem); @@ -183,8 +205,9 @@ static int luo_session_insert(struct luo_session_header *sh, * for new session. */ if (sh == &luo_session_global.outgoing) { - if (sh->count == LUO_SESSION_MAX) - return -ENOMEM; + err = luo_session_grow_ser(sh); + if (err) + return err; } /* @@ -522,21 +545,10 @@ int luo_session_retrieve(const char *name, struct file **filep) return err; } -int __init luo_session_setup_outgoing(u64 *sessions_pa) +void __init luo_session_setup_outgoing(u64 *sessions_pa) { - struct luo_session_header_ser *header_ser; - - header_ser = kho_alloc_preserve(LUO_SESSION_PGCNT << PAGE_SHIFT); - if (IS_ERR(header_ser)) - return PTR_ERR(header_ser); - - *sessions_pa = virt_to_phys(header_ser); - - luo_session_global.outgoing.header_ser = header_ser; - luo_session_global.outgoing.ser = (void *)(header_ser + 1); + luo_session_global.outgoing.sessions_pa = sessions_pa; luo_session_global.outgoing.active = true; - - return 0; } int __init luo_session_setup_incoming(u64 sessions_pa) @@ -642,6 +654,8 @@ int luo_session_serialize(void) down_write(&luo_session_serialize_rwsem); down_write(&sh->rwsem); + *sh->sessions_pa = 0; + list_for_each_entry(session, &sh->list, list) { err = luo_session_freeze_one(session, &sh->ser[i]); if (err) @@ -651,7 +665,11 @@ int luo_session_serialize(void) sizeof(sh->ser[i].name)); i++; } - sh->header_ser->count = sh->count; + + if (sh->header_ser && sh->count > 0) { + sh->header_ser->count = sh->count; + *sh->sessions_pa = virt_to_phys(sh->header_ser); + } up_write(&sh->rwsem); return 0; -- 2.53.0 From pasha.tatashin at soleen.com Mon Jun 1 20:17:13 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Tue, 2 Jun 2026 03:17:13 +0000 Subject: [PATCH v5 09/13] liveupdate: Remove limit on the number of sessions In-Reply-To: <20260602031717.197696-1-pasha.tatashin@soleen.com> References: <20260602031717.197696-1-pasha.tatashin@soleen.com> Message-ID: <20260602031717.197696-10-pasha.tatashin@soleen.com> Currently, the number of LUO sessions is limited by a fixed number of pre-allocated pages for serialization (16 pages, allowing for ~819 sessions). This limitation is problematic if LUO is used to support things such as systemd file descriptor store, and would be used not just as VM memory but to save other states on the machine. Remove this limit by transitioning to a linked-block approach for session metadata serialization. Instead of a single contiguous block, session metadata is now stored in a chain of 16-page blocks. Each block starts with a header containing the physical address of the next block and the number of session entries in the current block. Acked-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav (Google) Signed-off-by: Pasha Tatashin --- include/linux/kho/abi/luo.h | 24 +------ kernel/liveupdate/luo_session.c | 115 +++++++++++++++----------------- 2 files changed, 58 insertions(+), 81 deletions(-) diff --git a/include/linux/kho/abi/luo.h b/include/linux/kho/abi/luo.h index 9a4fe491812b..79758d92ed5f 100644 --- a/include/linux/kho/abi/luo.h +++ b/include/linux/kho/abi/luo.h @@ -33,11 +33,6 @@ * It includes the compatibility string, the liveupdate-number, and pointers * to sessions and FLBs. * - * - struct luo_session_header_ser: - * Header for the session array. Contains the total page count of the - * preserved memory block and the number of `struct luo_session_ser` - * entries that follow. - * * - struct luo_session_ser: * Metadata for a single session, including its name and a physical pointer * to another preserved memory block containing an array of @@ -63,13 +58,15 @@ #define _LINUX_KHO_ABI_LUO_H #include +#include #include /* * The LUO state is registered under this KHO entry name. */ #define LUO_KHO_ENTRY_NAME "LUO" -#define LUO_ABI_COMPATIBLE "luo-v3" +#define LUO_COMPAT_BASE "luo-v3" +#define LUO_ABI_COMPATIBLE LUO_COMPAT_BASE "-" KHO_BLOCK_ABI_COMPATIBLE #define LUO_ABI_COMPAT_LEN ALIGN(sizeof(LUO_ABI_COMPATIBLE), 8) /** @@ -118,21 +115,6 @@ struct luo_file_set_ser { u64 count; } __packed; -/** - * struct luo_session_header_ser - Header for the serialized session data block. - * @count: The number of `struct luo_session_ser` entries that immediately - * follow this header in the memory block. - * - * This structure is located at the beginning of a contiguous block of - * physical memory preserved across the kexec. It provides the necessary - * metadata to interpret the array of session entries that follow. - * - * If this structure is modified, `LUO_ABI_COMPATIBLE` must be updated. - */ -struct luo_session_header_ser { - u64 count; -} __packed; - /** * struct luo_session_ser - Represents the serialized metadata for a LUO session. * @name: The unique name of the session, provided by the userspace at diff --git a/kernel/liveupdate/luo_session.c b/kernel/liveupdate/luo_session.c index 43342916d314..f6eeb965b3c1 100644 --- a/kernel/liveupdate/luo_session.c +++ b/kernel/liveupdate/luo_session.c @@ -24,9 +24,10 @@ * ioctls on /dev/liveupdate. * * - Serialization: Session metadata is preserved using the KHO framework. When - * a live update is triggered via kexec, an array of `struct luo_session_ser` - * is populated and placed in a preserved memory region. The physical address - * of this array is stored in the centralized `struct luo_ser` structure. + * a live update is triggered via kexec, session metadata is serialized into + * a chain of linked-blocks and placed in a preserved memory region. The + * physical address of the first block header is stored in the centralized + * `struct luo_ser` structure. * * Session Lifecycle: * @@ -89,6 +90,7 @@ #include #include #include +#include #include #include #include @@ -98,23 +100,14 @@ #include #include "luo_internal.h" -/* 16 4K pages, give space for 744 sessions */ -#define LUO_SESSION_PGCNT 16ul -#define LUO_SESSION_MAX (((LUO_SESSION_PGCNT << PAGE_SHIFT) - \ - sizeof(struct luo_session_header_ser)) / \ - sizeof(struct luo_session_ser)) - static DECLARE_RWSEM(luo_session_serialize_rwsem); - /** * struct luo_session_header - Header struct for managing LUO sessions. * @count: The number of sessions currently tracked in the @list. * @list: The head of the linked list of `struct luo_session` instances. * @rwsem: A read-write semaphore providing synchronized access to the * session list and other fields in this structure. - * @header_ser: The header data of serialization array. - * @ser: The serialized session data (an array of - * `struct luo_session_ser`). + * @block_set: The set of serialization blocks. * @sessions_pa: Points to the location of sessions_pa within struct luo_ser. * @active: Set to true when first initialized. If previous kernel did not * send session data, active stays false for incoming. @@ -123,8 +116,7 @@ struct luo_session_header { long count; struct list_head list; struct rw_semaphore rwsem; - struct luo_session_header_ser *header_ser; - struct luo_session_ser *ser; + struct kho_block_set block_set; u64 *sessions_pa; bool active; }; @@ -143,10 +135,14 @@ static struct luo_session_global luo_session_global = { .incoming = { .list = LIST_HEAD_INIT(luo_session_global.incoming.list), .rwsem = __RWSEM_INITIALIZER(luo_session_global.incoming.rwsem), + .block_set = KHO_BLOCK_SET_INIT(luo_session_global.incoming.block_set, + sizeof(struct luo_session_ser)), }, .outgoing = { .list = LIST_HEAD_INIT(luo_session_global.outgoing.list), .rwsem = __RWSEM_INITIALIZER(luo_session_global.outgoing.rwsem), + .block_set = KHO_BLOCK_SET_INIT(luo_session_global.outgoing.block_set, + sizeof(struct luo_session_ser)), }, }; @@ -173,25 +169,6 @@ static void luo_session_free(struct luo_session *session) kfree(session); } -static int luo_session_grow_ser(struct luo_session_header *sh) -{ - struct luo_session_header_ser *header_ser; - - if (sh->count == LUO_SESSION_MAX) - return -ENOMEM; - - if (sh->header_ser) - return 0; - - header_ser = kho_alloc_preserve(LUO_SESSION_PGCNT << PAGE_SHIFT); - if (IS_ERR(header_ser)) - return PTR_ERR(header_ser); - - sh->header_ser = header_ser; - sh->ser = (void *)(header_ser + 1); - return 0; -} - static int luo_session_insert(struct luo_session_header *sh, struct luo_session *session) { @@ -205,7 +182,7 @@ static int luo_session_insert(struct luo_session_header *sh, * for new session. */ if (sh == &luo_session_global.outgoing) { - err = luo_session_grow_ser(sh); + err = kho_block_grow(&sh->block_set, sh->count); if (err) return err; } @@ -232,6 +209,8 @@ static void luo_session_remove(struct luo_session_header *sh, guard(rwsem_write)(&sh->rwsem); list_del(&session->list); sh->count--; + if (sh == &luo_session_global.outgoing) + kho_block_shrink(&sh->block_set, sh->count); } static int luo_session_finish_one(struct luo_session *session) @@ -553,15 +532,17 @@ void __init luo_session_setup_outgoing(u64 *sessions_pa) int __init luo_session_setup_incoming(u64 sessions_pa) { - struct luo_session_header_ser *header_ser; + struct luo_session_header *sh = &luo_session_global.incoming; + int err; - if (sessions_pa) { - header_ser = phys_to_virt(sessions_pa); - luo_session_global.incoming.header_ser = header_ser; - luo_session_global.incoming.ser = (void *)(header_ser + 1); - luo_session_global.incoming.active = true; - } + if (!sessions_pa) + return 0; + err = kho_block_set_restore(&sh->block_set, sessions_pa); + if (err) + return err; + + sh->active = true; return 0; } @@ -603,6 +584,8 @@ int luo_session_deserialize(void) { struct luo_session_header *sh = &luo_session_global.incoming; static bool is_deserialized; + struct luo_session_ser *ser; + struct kho_block_it it; static int saved_err; int err; @@ -629,18 +612,19 @@ int luo_session_deserialize(void) * userspace to detect the failure and trigger a reboot, which will * reliably reset devices and reclaim memory. */ - for (int i = 0; i < sh->header_ser->count; i++) { - err = luo_session_deserialize_one(sh, &sh->ser[i]); + kho_block_it_init(&it, &sh->block_set); + while ((ser = kho_block_it_read_entry(&it))) { + err = luo_session_deserialize_one(sh, ser); if (err) goto save_err; } - kho_restore_free(sh->header_ser); - sh->header_ser = NULL; - sh->ser = NULL; + kho_block_set_destroy(&sh->block_set); return 0; + save_err: + kho_block_set_destroy(&sh->block_set); saved_err = err; return err; } @@ -649,36 +633,47 @@ int luo_session_serialize(void) { struct luo_session_header *sh = &luo_session_global.outgoing; struct luo_session *session; - int i = 0; + struct kho_block_it it; int err; down_write(&luo_session_serialize_rwsem); down_write(&sh->rwsem); *sh->sessions_pa = 0; + kho_block_it_init(&it, &sh->block_set); + list_for_each_entry(session, &sh->list, list) { - err = luo_session_freeze_one(session, &sh->ser[i]); - if (err) + struct luo_session_ser *ser = kho_block_it_reserve_entry(&it); + + /* This should not fail normally as blocks were pre-allocated */ + if (WARN_ON_ONCE(!ser)) { + err = -ENOSPC; goto err_undo; + } - strscpy(sh->ser[i].name, session->name, - sizeof(sh->ser[i].name)); - i++; - } + err = luo_session_freeze_one(session, ser); + if (err) { + kho_block_it_prev(&it); + goto err_undo; + } - if (sh->header_ser && sh->count > 0) { - sh->header_ser->count = sh->count; - *sh->sessions_pa = virt_to_phys(sh->header_ser); + strscpy(ser->name, session->name, sizeof(ser->name)); } + + kho_block_it_finalize(&it); + + if (sh->count > 0) + *sh->sessions_pa = kho_block_set_head_pa(&sh->block_set); up_write(&sh->rwsem); return 0; err_undo: list_for_each_entry_continue_reverse(session, &sh->list, list) { - i--; - luo_session_unfreeze_one(session, &sh->ser[i]); - memset(sh->ser[i].name, 0, sizeof(sh->ser[i].name)); + struct luo_session_ser *ser = kho_block_it_prev(&it); + + luo_session_unfreeze_one(session, ser); + memset(ser->name, 0, sizeof(ser->name)); } up_write(&sh->rwsem); up_write(&luo_session_serialize_rwsem); -- 2.53.0 From pasha.tatashin at soleen.com Mon Jun 1 20:17:15 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Tue, 2 Jun 2026 03:17:15 +0000 Subject: [PATCH v5 11/13] selftests/liveupdate: Test session and file limit removal In-Reply-To: <20260602031717.197696-1-pasha.tatashin@soleen.com> References: <20260602031717.197696-1-pasha.tatashin@soleen.com> Message-ID: <20260602031717.197696-12-pasha.tatashin@soleen.com> With the removal of static limits on the number of sessions and files per session, the orchestrator now uses dynamic allocation. Add new test cases to verify that the system can handle a large number of sessions and files. These tests ensure that the dynamic block allocation and reuse logic for session metadata and outgoing files work correctly beyond the previous static limits. Acked-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav (Google) Signed-off-by: Pasha Tatashin --- .../testing/selftests/liveupdate/liveupdate.c | 75 +++++++++++++++++++ .../selftests/liveupdate/luo_test_utils.c | 24 ++++++ .../selftests/liveupdate/luo_test_utils.h | 2 + 3 files changed, 101 insertions(+) diff --git a/tools/testing/selftests/liveupdate/liveupdate.c b/tools/testing/selftests/liveupdate/liveupdate.c index c7d94b9181e1..502fb3567e38 100644 --- a/tools/testing/selftests/liveupdate/liveupdate.c +++ b/tools/testing/selftests/liveupdate/liveupdate.c @@ -26,6 +26,7 @@ #include +#include "luo_test_utils.h" #include "../kselftest.h" #include "../kselftest_harness.h" @@ -499,4 +500,78 @@ TEST_F(liveupdate_device, get_session_name_max_length) ASSERT_EQ(close(session_fd), 0); } +/* + * Test Case: Manage Many Sessions + * + * Verifies that a large number of sessions can be created and then + * destroyed during normal system operation. This specifically tests the + * dynamic block allocation and reuse logic for session metadata management + * without preserving any files. + */ +TEST_F(liveupdate_device, preserve_many_sessions) +{ +#define MANY_SESSIONS 2000 + int session_fds[MANY_SESSIONS]; + int ret, i; + + self->fd1 = open(LIVEUPDATE_DEV, O_RDWR); + if (self->fd1 < 0 && errno == ENOENT) + SKIP(return, "%s does not exist", LIVEUPDATE_DEV); + ASSERT_GE(self->fd1, 0); + + ret = luo_ensure_nofile_limit(MANY_SESSIONS); + if (ret == -EPERM) + SKIP(return, "Insufficient privileges to set RLIMIT_NOFILE"); + ASSERT_EQ(ret, 0); + + for (i = 0; i < MANY_SESSIONS; i++) { + char name[64]; + + snprintf(name, sizeof(name), "many-session-%d", i); + session_fds[i] = create_session(self->fd1, name); + ASSERT_GE(session_fds[i], 0); + } + + for (i = 0; i < MANY_SESSIONS; i++) + ASSERT_EQ(close(session_fds[i]), 0); +} + +/* + * Test Case: Preserve Many Files + * + * Verifies that a large number of files can be preserved in a single session + * and then destroyed during normal system operation. This tests the dynamic + * block allocation and management for outgoing files. + */ +TEST_F(liveupdate_device, preserve_many_files) +{ +#define MANY_FILES 500 + int mem_fds[MANY_FILES]; + int session_fd, ret, i; + + self->fd1 = open(LIVEUPDATE_DEV, O_RDWR); + if (self->fd1 < 0 && errno == ENOENT) + SKIP(return, "%s does not exist", LIVEUPDATE_DEV); + ASSERT_GE(self->fd1, 0); + + session_fd = create_session(self->fd1, "many-files-test"); + ASSERT_GE(session_fd, 0); + + ret = luo_ensure_nofile_limit(MANY_FILES + 10); + if (ret == -EPERM) + SKIP(return, "Insufficient privileges to set RLIMIT_NOFILE"); + ASSERT_EQ(ret, 0); + + for (i = 0; i < MANY_FILES; i++) { + mem_fds[i] = memfd_create("test-memfd", 0); + ASSERT_GE(mem_fds[i], 0); + ASSERT_EQ(preserve_fd(session_fd, mem_fds[i], i), 0); + } + + for (i = 0; i < MANY_FILES; i++) + ASSERT_EQ(close(mem_fds[i]), 0); + + ASSERT_EQ(close(session_fd), 0); +} + TEST_HARNESS_MAIN diff --git a/tools/testing/selftests/liveupdate/luo_test_utils.c b/tools/testing/selftests/liveupdate/luo_test_utils.c index 3c8721c505df..333a3530051b 100644 --- a/tools/testing/selftests/liveupdate/luo_test_utils.c +++ b/tools/testing/selftests/liveupdate/luo_test_utils.c @@ -17,6 +17,7 @@ #include #include #include +#include #include #include #include @@ -28,6 +29,29 @@ int luo_open_device(void) return open(LUO_DEVICE, O_RDWR); } +int luo_ensure_nofile_limit(long min_limit) +{ + struct rlimit hl; + + /* Allow to extra files to be used by test itself */ + min_limit += 32; + + if (getrlimit(RLIMIT_NOFILE, &hl) < 0) + return -errno; + + if (hl.rlim_cur >= min_limit) + return 0; + + hl.rlim_cur = min_limit; + if (hl.rlim_cur > hl.rlim_max) + hl.rlim_max = hl.rlim_cur; + + if (setrlimit(RLIMIT_NOFILE, &hl) < 0) + return -errno; + + return 0; +} + int luo_create_session(int luo_fd, const char *name) { struct liveupdate_ioctl_create_session arg = { .size = sizeof(arg) }; diff --git a/tools/testing/selftests/liveupdate/luo_test_utils.h b/tools/testing/selftests/liveupdate/luo_test_utils.h index 90099bf49577..6a0d85386613 100644 --- a/tools/testing/selftests/liveupdate/luo_test_utils.h +++ b/tools/testing/selftests/liveupdate/luo_test_utils.h @@ -26,6 +26,8 @@ int luo_create_session(int luo_fd, const char *name); int luo_retrieve_session(int luo_fd, const char *name); int luo_session_finish(int session_fd); +int luo_ensure_nofile_limit(long min_limit); + int create_and_preserve_memfd(int session_fd, int token, const char *data); int restore_and_verify_memfd(int session_fd, int token, const char *expected_data); -- 2.53.0 From pasha.tatashin at soleen.com Mon Jun 1 20:17:14 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Tue, 2 Jun 2026 03:17:14 +0000 Subject: [PATCH v5 10/13] liveupdate: Remove limit on the number of files per session In-Reply-To: <20260602031717.197696-1-pasha.tatashin@soleen.com> References: <20260602031717.197696-1-pasha.tatashin@soleen.com> Message-ID: <20260602031717.197696-11-pasha.tatashin@soleen.com> To remove the fixed limit on the number of preserved files per session, transition the file metadata serialization from a single contiguous memory block to a chain of linked blocks. Acked-by: Mike Rapoport (Microsoft) Signed-off-by: Pasha Tatashin --- include/linux/kho/abi/luo.h | 13 +-- kernel/liveupdate/luo_file.c | 139 +++++++++++++++---------------- kernel/liveupdate/luo_internal.h | 6 +- 3 files changed, 75 insertions(+), 83 deletions(-) diff --git a/include/linux/kho/abi/luo.h b/include/linux/kho/abi/luo.h index 79758d92ed5f..16df550ef143 100644 --- a/include/linux/kho/abi/luo.h +++ b/include/linux/kho/abi/luo.h @@ -35,8 +35,8 @@ * * - struct luo_session_ser: * Metadata for a single session, including its name and a physical pointer - * to another preserved memory block containing an array of - * `struct luo_file_ser` for all files in that session. + * to the first `struct kho_block_header_ser` for all files in that session. + * Multiple blocks are linked via the `next` field in the header. * * - struct luo_file_ser: * Metadata for a single preserved file. Contains the `compatible` string to @@ -65,7 +65,7 @@ * The LUO state is registered under this KHO entry name. */ #define LUO_KHO_ENTRY_NAME "LUO" -#define LUO_COMPAT_BASE "luo-v3" +#define LUO_COMPAT_BASE "luo-v4" #define LUO_ABI_COMPATIBLE LUO_COMPAT_BASE "-" KHO_BLOCK_ABI_COMPATIBLE #define LUO_ABI_COMPAT_LEN ALIGN(sizeof(LUO_ABI_COMPATIBLE), 8) @@ -103,9 +103,10 @@ struct luo_file_ser { /** * struct luo_file_set_ser - Represents the serialized metadata for file set - * @files: The physical address of a contiguous memory block that holds - * the serialized state of files (array of luo_file_ser) in this file - * set. + * @files: The physical address of the first `struct kho_block_header_ser`. + * This structure is the header for a block of memory containing + * an array of `struct luo_file_ser` entries. Multiple blocks are + * linked via the `next` field in the header. * @count: The total number of files that were part of this session during * serialization. Used for iteration and validation during * restoration. diff --git a/kernel/liveupdate/luo_file.c b/kernel/liveupdate/luo_file.c index 9eec07a9e9fc..695e99aaba20 100644 --- a/kernel/liveupdate/luo_file.c +++ b/kernel/liveupdate/luo_file.c @@ -118,11 +118,6 @@ static LIST_HEAD(luo_file_handler_list); /* Keep track of files being preserved by LUO */ static DEFINE_XARRAY(luo_preserved_files); -/* 2 4K pages, give space for 128 files per file_set */ -#define LUO_FILE_PGCNT 2ul -#define LUO_FILE_MAX \ - ((LUO_FILE_PGCNT << PAGE_SHIFT) / sizeof(struct luo_file_ser)) - /** * struct luo_file - Represents a single preserved file instance. * @fh: Pointer to the &struct liveupdate_file_handler that manages @@ -174,39 +169,6 @@ struct luo_file { u64 token; }; -static int luo_alloc_files_mem(struct luo_file_set *file_set) -{ - size_t size; - void *mem; - - if (file_set->files) - return 0; - - WARN_ON_ONCE(file_set->count); - - size = LUO_FILE_PGCNT << PAGE_SHIFT; - mem = kho_alloc_preserve(size); - if (IS_ERR(mem)) - return PTR_ERR(mem); - - file_set->files = mem; - - return 0; -} - -static void luo_free_files_mem(struct luo_file_set *file_set) -{ - /* If file_set has files, no need to free preservation memory */ - if (file_set->count) - return; - - if (!file_set->files) - return; - - kho_unpreserve_free(file_set->files); - file_set->files = NULL; -} - static unsigned long luo_get_id(struct liveupdate_file_handler *fh, struct file *file) { @@ -276,16 +238,15 @@ int luo_preserve_file(struct luo_file_set *file_set, u64 token, int fd) if (luo_token_is_used(file_set, token)) return -EEXIST; - if (file_set->count == LUO_FILE_MAX) - return -ENOSPC; + err = kho_block_grow(&file_set->block_set, file_set->count); + if (err) + return err; file = fget(fd); - if (!file) - return -EBADF; - - err = luo_alloc_files_mem(file_set); - if (err) - goto err_fput; + if (!file) { + err = -EBADF; + goto err_shrink; + } err = -ENOENT; down_read(&luo_register_rwlock); @@ -300,7 +261,7 @@ int luo_preserve_file(struct luo_file_set *file_set, u64 token, int fd) /* err is still -ENOENT if no handler was found */ if (err) - goto err_free_files_mem; + goto err_fput; err = xa_insert(&luo_preserved_files, luo_get_id(fh, file), file, GFP_KERNEL); @@ -343,10 +304,10 @@ int luo_preserve_file(struct luo_file_set *file_set, u64 token, int fd) xa_erase(&luo_preserved_files, luo_get_id(fh, file)); err_module_put: module_put(fh->ops->owner); -err_free_files_mem: - luo_free_files_mem(file_set); err_fput: fput(file); +err_shrink: + kho_block_shrink(&file_set->block_set, file_set->count); return err; } @@ -392,13 +353,14 @@ void luo_file_unpreserve_files(struct luo_file_set *file_set) list_del(&luo_file->list); file_set->count--; + kho_block_shrink(&file_set->block_set, file_set->count); fput(luo_file->file); mutex_destroy(&luo_file->mutex); kfree(luo_file); } - luo_free_files_mem(file_set); + kho_block_set_destroy(&file_set->block_set); } static int luo_file_freeze_one(struct luo_file_set *file_set, @@ -454,7 +416,7 @@ static void __luo_file_unfreeze(struct luo_file_set *file_set, luo_file_unfreeze_one(file_set, luo_file); } - memset(file_set->files, 0, LUO_FILE_PGCNT << PAGE_SHIFT); + kho_block_set_clear(&file_set->block_set); } /** @@ -493,19 +455,24 @@ static void __luo_file_unfreeze(struct luo_file_set *file_set, int luo_file_freeze(struct luo_file_set *file_set, struct luo_file_set_ser *file_set_ser) { - struct luo_file_ser *file_ser = file_set->files; struct luo_file *luo_file; + struct kho_block_it it; int err; - int i; if (!file_set->count) return 0; - if (WARN_ON(!file_ser)) - return -EINVAL; + kho_block_it_init(&it, &file_set->block_set); - i = 0; list_for_each_entry(luo_file, &file_set->files_list, list) { + struct luo_file_ser *file_ser = kho_block_it_reserve_entry(&it); + + /* This should not fail normally as blocks were pre-allocated */ + if (WARN_ON_ONCE(!file_ser)) { + err = -ENOSPC; + goto err_unfreeze; + } + err = luo_file_freeze_one(file_set, luo_file); if (err < 0) { pr_warn("Freeze failed for token[%#0llx] handler[%s] err[%pe]\n", @@ -514,16 +481,15 @@ int luo_file_freeze(struct luo_file_set *file_set, goto err_unfreeze; } - strscpy(file_ser[i].compatible, luo_file->fh->compatible, - sizeof(file_ser[i].compatible)); - file_ser[i].data = luo_file->serialized_data; - file_ser[i].token = luo_file->token; - i++; + strscpy(file_ser->compatible, luo_file->fh->compatible, + sizeof(file_ser->compatible)); + file_ser->data = luo_file->serialized_data; + file_ser->token = luo_file->token; } + kho_block_it_finalize(&it); file_set_ser->count = file_set->count; - if (file_set->files) - file_set_ser->files = virt_to_phys(file_set->files); + file_set_ser->files = kho_block_set_head_pa(&file_set->block_set); return 0; @@ -741,14 +707,12 @@ int luo_file_finish(struct luo_file_set *file_set) module_put(luo_file->fh->ops->owner); list_del(&luo_file->list); file_set->count--; + kho_block_shrink(&file_set->block_set, file_set->count); mutex_destroy(&luo_file->mutex); kfree(luo_file); } - if (file_set->files) { - kho_restore_free(file_set->files); - file_set->files = NULL; - } + kho_block_set_destroy(&file_set->block_set); return 0; } @@ -822,16 +786,18 @@ int luo_file_deserialize(struct luo_file_set *file_set, struct luo_file_set_ser *file_set_ser) { struct luo_file_ser *file_ser; + struct kho_block_it it; int err; - u64 i; if (!file_set_ser->files) { WARN_ON(file_set_ser->count); return 0; } - file_set->count = file_set_ser->count; - file_set->files = phys_to_virt(file_set_ser->files); + file_set->count = 0; + err = kho_block_set_restore(&file_set->block_set, file_set_ser->files); + if (err) + return err; /* * Note on error handling: @@ -848,25 +814,50 @@ int luo_file_deserialize(struct luo_file_set *file_set, * userspace to detect the failure and trigger a reboot, which will * reliably reset devices and reclaim memory. */ - file_ser = file_set->files; - for (i = 0; i < file_set->count; i++) { - err = luo_file_deserialize_one(file_set, &file_ser[i]); + kho_block_it_init(&it, &file_set->block_set); + while ((file_ser = kho_block_it_read_entry(&it))) { + err = luo_file_deserialize_one(file_set, file_ser); if (err) - return err; + goto err_destroy_blocks; + file_set->count++; + } + + if (file_set->count != file_set_ser->count) { + pr_warn("File count mismatch: expected %llu, found %llu\n", + file_set_ser->count, file_set->count); + err = -EINVAL; + goto err_destroy_blocks; } return 0; + +err_destroy_blocks: + while (!list_empty(&file_set->files_list)) { + struct luo_file *luo_file; + + luo_file = list_first_entry(&file_set->files_list, + struct luo_file, list); + list_del(&luo_file->list); + module_put(luo_file->fh->ops->owner); + mutex_destroy(&luo_file->mutex); + kfree(luo_file); + } + file_set->count = 0; + kho_block_set_destroy(&file_set->block_set); + return err; } void luo_file_set_init(struct luo_file_set *file_set) { INIT_LIST_HEAD(&file_set->files_list); + kho_block_set_init(&file_set->block_set, sizeof(struct luo_file_ser)); } void luo_file_set_destroy(struct luo_file_set *file_set) { WARN_ON(file_set->count); WARN_ON(!list_empty(&file_set->files_list)); + WARN_ON(!kho_block_set_is_empty(&file_set->block_set)); } /** diff --git a/kernel/liveupdate/luo_internal.h b/kernel/liveupdate/luo_internal.h index ee18f9a11b91..64879ffe7378 100644 --- a/kernel/liveupdate/luo_internal.h +++ b/kernel/liveupdate/luo_internal.h @@ -10,6 +10,7 @@ #include #include +#include struct luo_ucmd { void __user *ubuffer; @@ -44,14 +45,13 @@ static inline int luo_ucmd_respond(struct luo_ucmd *ucmd, * struct luo_file_set - A set of files that belong to the same sessions. * @files_list: An ordered list of files associated with this session, it is * ordered by preservation time. - * @files: The physically contiguous memory block that holds the serialized - * state of files. + * @block_set: The set of serialization blocks. * @count: A counter tracking the number of files currently stored in the * @files_list for this session. */ struct luo_file_set { struct list_head files_list; - struct luo_file_ser *files; + struct kho_block_set block_set; u64 count; }; -- 2.53.0 From pasha.tatashin at soleen.com Mon Jun 1 20:17:16 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Tue, 2 Jun 2026 03:17:16 +0000 Subject: [PATCH v5 12/13] selftests/liveupdate: Add stress-sessions kexec test In-Reply-To: <20260602031717.197696-1-pasha.tatashin@soleen.com> References: <20260602031717.197696-1-pasha.tatashin@soleen.com> Message-ID: <20260602031717.197696-13-pasha.tatashin@soleen.com> Add a new test that creates 2000 LUO sessions before a kexec reboot and verifies their presence after the reboot. This ensures that the linked-block serialization mechanism works correctly for a large number of sessions. Acked-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav (Google) Signed-off-by: Pasha Tatashin --- tools/testing/selftests/liveupdate/Makefile | 1 + .../liveupdate/luo_stress_sessions.c | 102 ++++++++++++++++++ 2 files changed, 103 insertions(+) create mode 100644 tools/testing/selftests/liveupdate/luo_stress_sessions.c diff --git a/tools/testing/selftests/liveupdate/Makefile b/tools/testing/selftests/liveupdate/Makefile index 080754787ede..ed7534468386 100644 --- a/tools/testing/selftests/liveupdate/Makefile +++ b/tools/testing/selftests/liveupdate/Makefile @@ -6,6 +6,7 @@ TEST_GEN_PROGS += liveupdate TEST_GEN_PROGS_EXTENDED += luo_kexec_simple TEST_GEN_PROGS_EXTENDED += luo_multi_session +TEST_GEN_PROGS_EXTENDED += luo_stress_sessions TEST_FILES += do_kexec.sh diff --git a/tools/testing/selftests/liveupdate/luo_stress_sessions.c b/tools/testing/selftests/liveupdate/luo_stress_sessions.c new file mode 100644 index 000000000000..f201b1839d1d --- /dev/null +++ b/tools/testing/selftests/liveupdate/luo_stress_sessions.c @@ -0,0 +1,102 @@ +// SPDX-License-Identifier: GPL-2.0-only + +/* + * Copyright (c) 2026, Google LLC. + * Pasha Tatashin + * + * Validate that LUO can handle a large number of sessions across a kexec + * reboot. + */ + +#include +#include +#include "luo_test_utils.h" + +#define NUM_SESSIONS 2000 +#define STATE_SESSION_NAME "kexec_many_state" +#define STATE_MEMFD_TOKEN 999 + +/* Stage 1: Executed before the kexec reboot. */ +static void run_stage_1(int luo_fd) +{ + int ret, i; + + ksft_print_msg("[STAGE 1] Increasing ulimit for open files...\n"); + ret = luo_ensure_nofile_limit(NUM_SESSIONS); + if (ret == -EPERM) + ksft_exit_skip("Insufficient privileges to set RLIMIT_NOFILE\n"); + if (ret < 0) + ksft_exit_fail_msg("luo_ensure_nofile_limit failed: %s\n", strerror(-ret)); + + ksft_print_msg("[STAGE 1] Creating state file for next stage (2)...\n"); + create_state_file(luo_fd, STATE_SESSION_NAME, STATE_MEMFD_TOKEN, 2); + + ksft_print_msg("[STAGE 1] Creating %d sessions...\n", NUM_SESSIONS); + + for (i = 0; i < NUM_SESSIONS; i++) { + char name[LIVEUPDATE_SESSION_NAME_LENGTH]; + int s_fd; + + snprintf(name, sizeof(name), "many-test-%d", i); + s_fd = luo_create_session(luo_fd, name); + if (s_fd < 0) { + fail_exit("luo_create_session for '%s' at index %d", + name, i); + } + } + + ksft_print_msg("[STAGE 1] Successfully created %d sessions.\n", + NUM_SESSIONS); + + close(luo_fd); + daemonize_and_wait(); +} + +/* Stage 2: Executed after the kexec reboot. */ +static void run_stage_2(int luo_fd, int state_session_fd) +{ + int i, stage; + + ksft_print_msg("[STAGE 2] Starting post-kexec verification...\n"); + + restore_and_read_stage(state_session_fd, STATE_MEMFD_TOKEN, &stage); + if (stage != 2) { + fail_exit("Expected stage 2, but state file contains %d", + stage); + } + + ksft_print_msg("[STAGE 2] Retrieving and finishing %d sessions...\n", + NUM_SESSIONS); + + for (i = 0; i < NUM_SESSIONS; i++) { + char name[LIVEUPDATE_SESSION_NAME_LENGTH]; + int s_fd; + + snprintf(name, sizeof(name), "many-test-%d", i); + s_fd = luo_retrieve_session(luo_fd, name); + if (s_fd < 0) { + fail_exit("luo_retrieve_session for '%s' at index %d", + name, i); + } + + if (luo_session_finish(s_fd) < 0) { + fail_exit("luo_session_finish for '%s' at index %d", + name, i); + } + close(s_fd); + } + + ksft_print_msg("[STAGE 2] Finalizing state session...\n"); + if (luo_session_finish(state_session_fd) < 0) + fail_exit("luo_session_finish for state session"); + close(state_session_fd); + + ksft_print_msg("\n--- MANY-SESSIONS KEXEC TEST PASSED (%d sessions) ---\n", + NUM_SESSIONS); +} + +int main(int argc, char *argv[]) +{ + return luo_test(argc, argv, STATE_SESSION_NAME, + run_stage_1, run_stage_2); +} -- 2.53.0 From pasha.tatashin at soleen.com Mon Jun 1 20:17:17 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Tue, 2 Jun 2026 03:17:17 +0000 Subject: [PATCH v5 13/13] selftests/liveupdate: Add stress-files kexec test In-Reply-To: <20260602031717.197696-1-pasha.tatashin@soleen.com> References: <20260602031717.197696-1-pasha.tatashin@soleen.com> Message-ID: <20260602031717.197696-14-pasha.tatashin@soleen.com> Add a new luo_stress_files kexec test that verifies preserving and retrieving 500 files across a kexec reboot. Reviewed-by: Pratyush Yadav (Google) Acked-by: Mike Rapoport (Microsoft) Signed-off-by: Pasha Tatashin --- tools/testing/selftests/liveupdate/Makefile | 1 + .../selftests/liveupdate/luo_stress_files.c | 97 +++++++++++++++++++ 2 files changed, 98 insertions(+) create mode 100644 tools/testing/selftests/liveupdate/luo_stress_files.c diff --git a/tools/testing/selftests/liveupdate/Makefile b/tools/testing/selftests/liveupdate/Makefile index ed7534468386..30689d22cb02 100644 --- a/tools/testing/selftests/liveupdate/Makefile +++ b/tools/testing/selftests/liveupdate/Makefile @@ -7,6 +7,7 @@ TEST_GEN_PROGS += liveupdate TEST_GEN_PROGS_EXTENDED += luo_kexec_simple TEST_GEN_PROGS_EXTENDED += luo_multi_session TEST_GEN_PROGS_EXTENDED += luo_stress_sessions +TEST_GEN_PROGS_EXTENDED += luo_stress_files TEST_FILES += do_kexec.sh diff --git a/tools/testing/selftests/liveupdate/luo_stress_files.c b/tools/testing/selftests/liveupdate/luo_stress_files.c new file mode 100644 index 000000000000..0cdf9cd4bac7 --- /dev/null +++ b/tools/testing/selftests/liveupdate/luo_stress_files.c @@ -0,0 +1,97 @@ +// SPDX-License-Identifier: GPL-2.0-only + +/* + * Copyright (c) 2026, Google LLC. + * Pasha Tatashin + * + * Validate that LUO can handle a large number of files per session across + * a kexec reboot. + */ + +#include +#include +#include "luo_test_utils.h" + +#define NUM_FILES 500 +#define STATE_SESSION_NAME "kexec_many_files_state" +#define STATE_MEMFD_TOKEN 9999 +#define TEST_SESSION_NAME "many_files_session" + +/* Stage 1: Executed before the kexec reboot. */ +static void run_stage_1(int luo_fd) +{ + int session_fd, i; + + ksft_print_msg("[STAGE 1] Creating state file for next stage (2)...\n"); + create_state_file(luo_fd, STATE_SESSION_NAME, STATE_MEMFD_TOKEN, 2); + + ksft_print_msg("[STAGE 1] Creating test session '%s'...\n", TEST_SESSION_NAME); + session_fd = luo_create_session(luo_fd, TEST_SESSION_NAME); + if (session_fd < 0) + fail_exit("luo_create_session"); + + ksft_print_msg("[STAGE 1] Preserving %d files...\n", NUM_FILES); + for (i = 0; i < NUM_FILES; i++) { + char data[64]; + + snprintf(data, sizeof(data), "file-data-%d", i); + if (create_and_preserve_memfd(session_fd, i, data) < 0) + fail_exit("create_and_preserve_memfd for index %d", i); + } + + ksft_print_msg("[STAGE 1] Successfully preserved %d files.\n", NUM_FILES); + + close(luo_fd); + daemonize_and_wait(); +} + +/* Stage 2: Executed after the kexec reboot. */ +static void run_stage_2(int luo_fd, int state_session_fd) +{ + int session_fd; + int i, stage; + + ksft_print_msg("[STAGE 2] Starting post-kexec verification...\n"); + + restore_and_read_stage(state_session_fd, STATE_MEMFD_TOKEN, &stage); + if (stage != 2) { + fail_exit("Expected stage 2, but state file contains %d", + stage); + } + + ksft_print_msg("[STAGE 2] Retrieving test session '%s'...\n", TEST_SESSION_NAME); + session_fd = luo_retrieve_session(luo_fd, TEST_SESSION_NAME); + if (session_fd < 0) + fail_exit("luo_retrieve_session"); + + ksft_print_msg("[STAGE 2] Verifying %d files...\n", NUM_FILES); + for (i = 0; i < NUM_FILES; i++) { + char data[64]; + int fd; + + snprintf(data, sizeof(data), "file-data-%d", i); + fd = restore_and_verify_memfd(session_fd, i, data); + if (fd < 0) + fail_exit("restore_and_verify_memfd for index %d", i); + close(fd); + } + + ksft_print_msg("[STAGE 2] Finishing test session...\n"); + if (luo_session_finish(session_fd) < 0) + fail_exit("luo_session_finish for test session"); + close(session_fd); + + ksft_print_msg("[STAGE 2] Finalizing state session...\n"); + if (luo_session_finish(state_session_fd) < 0) + fail_exit("luo_session_finish for state session"); + close(state_session_fd); + + ksft_print_msg("\n--- MANY-FILES KEXEC TEST PASSED (%d files) ---\n", + NUM_FILES); +} + +int main(int argc, char *argv[]) +{ + return luo_test(argc, argv, STATE_SESSION_NAME, + run_stage_1, run_stage_2); +} -- 2.53.0 From kjlx at templeofstupid.com Mon Jun 1 21:49:14 2026 From: kjlx at templeofstupid.com (Krister Johansen) Date: Mon, 1 Jun 2026 21:49:14 -0700 Subject: [PATCH v5][makedumpfile 0/9] btf/kallsyms based makedumpfile extension for mm page filtering In-Reply-To: References: <20260414102656.55200-1-ltao@redhat.com> Message-ID: Hi Tao, On Tue, Jun 02, 2026 at 03:04:12PM +1200, Tao Liu wrote: > On Tue, Jun 2, 2026 at 12:47?PM Krister Johansen > wrote: > Thanks for your in-depth explanation, it's very helpful to me for > designing the data erasing function. Thanks for the great discussion. > > On Tue, Jun 02, 2026 at 11:12:05AM +1200, Tao Liu wrote: > > > On Sat, May 30, 2026 at 9:11?AM Krister Johansen > > I wondered about this, but for data-structures that are smaller than a > > page, wouldn't that mean that we're erasing other content? The "erase" > > plugins memset the output data to a chosen value (or 0), whereas the > > filtering just drops the page. Couldn't this also lead to a situation > > where the debugger can't find the page at all, versus giving us one > > that's sanitized? (I do understand why you want to drop the pages for > > the GPU cases) > > Frankly I didn't consider the data erasing as in-depth as you did. I > think you are right, makedumpfile needs to know which extensions > handle data erasing and which handle mm page filtering. I guess the mm > page filtering extensions will need to perform a "dry-run" filter > first, in case the "data erasing" extensions break any useful data > structure. In this step, "dry-run" will only record pfn numbers of the > pages that will be filtered. Then "data erasing" extensions are > called, so all the sensitive data is memset to 0. Finally, all desired > pages are filtered out based on the previous recording. > > With this, "data erase" and "page filtering" will not interfere with > each other. What do you think? This is a great point. It's probably worth documenting the precedence order in which these callbacks are expected to be applied. Naively, I might expect filtering pages to take precedence over erasing data structures. For the GPU cases, these are orthogonal. However, for something where a user might be both trying to filter the page and erase matching content, we don't have any rules defined. It's probably less surprising to allow pages to be filtered first. (I think it is this way in the code.) It also prevents the page filtering from completely filtering a page. > > > > Would you be willing to modify the extension registration options to > > > > allow an extension to specify what kind it is? That way, in the future > > > > > > I'm not sure what you mean by "what kind". Do you mean an extension > > > needs to tell makedumpfile what purpose it is for when loading? > > > > Yes, sorry I wasn't clear in writing the question. Stating this > > differently, if we want to allow the ability for different extensions to > > do different things, how do the extensions declare to makedumpfile what > > they can do, so that it knows where to invoke their callbacks, and what > > callbacks of theirs to invoke. > > > > Looking at patch 6/9, right now run_extension_callback() is involved > > from __exclude_unncessary_pages and always calls the > > "extension_callback" symbol in the module. This makes sense for a > > single extension type that's focused on filtering pages. However, if we > > wanted to have multiple different extensions, this might be more > > difficult. > > > > If we could determine what type of functionality the module implements > > in load_extensions, then we could tell if this is a page filtering > > extension, an erase extension, or some other kind of extension. > > > > For example, for an erase filter, perhaps we would want two callbacks: > > one to set up the ranges to filter "extension_gather_callback" and > > another to actuallyf check the address range to see if it is filtered, > > "extension_filter_data_callback" > > > > I'm not sure about the names. "extension_callback" seems generic, but > > this has a specific purpose. It's a "extension_filter_page_callback" > > > > I may be overengineering this a bit, but having makedumpfile pass an ops > > vector to the extension in a load function could help here. Then the > > module's load function fills out the vector with the functions it > > supports. Depending on what's implemented, these can be placed into > > different callback lists to get invoked at different points in the > > program (e.g. one at pfn filter time, another in filter_data_buffer, > > etc). > > > > It sounds like you had a plan here, though. Were you thinking of adding > > new extension types a different way? > > I see your idea: makedumpfile predefines a few hook points at > different stages, and extensions can register their callbacks to these > hook points. For now I think 2 hook points are enough, one for page > filtering and other one for resiger the data erasing, which definitely > shouldn't be within __exclude_unnecessary_pages(). > > I'm willing to modify the code. Such as implementing a hooking point > registration/management. But since I haven't work on the data erasing > functions so far, the design might be superficial, personally I'd > prefer to do this along with the data erasing functions in the next > independent patchset, considering current patchset we already includes > plenty of code/function implementations. @maintainers, What's your > opinion? Just to clarify, I'm not asking that you implement any erase functionality in the current patchset. Rather, asking if there's a way to implement the current functionality such that the extension modules won't need recompilation when a new extension type is introduced. I think there are a number of different ways to do this, but I didn't want to be overly prescriptive in my feedback. Thanks again, -K From rppt at kernel.org Tue Jun 2 01:13:34 2026 From: rppt at kernel.org (Mike Rapoport) Date: Tue, 02 Jun 2026 11:13:34 +0300 Subject: [PATCH v4 01/13] liveupdate: change file_set->count type to u64 for type safety In-Reply-To: <20260530221938.115978-2-pasha.tatashin@soleen.com> References: <20260530221938.115978-1-pasha.tatashin@soleen.com> <20260530221938.115978-2-pasha.tatashin@soleen.com> Message-ID: <178038801483.119771.5551368813719436713.b4-review@b4> On Sat, 30 May 2026 22:19:26 +0000, Pasha Tatashin wrote: > This improves type safety and aligns the in-memory file_set->count with > the serialized count type. It avoids potential truncation or sign > conversion mismatch issues. Acked-by: Mike Rapoport (Microsoft) -- Sincerely yours, Mike. From rppt at kernel.org Tue Jun 2 01:13:34 2026 From: rppt at kernel.org (Mike Rapoport) Date: Tue, 02 Jun 2026 11:13:34 +0300 Subject: [PATCH v4 02/13] liveupdate: avoid mixing cleanup guards with goto in luo_session_retrieve_fd In-Reply-To: <20260530221938.115978-3-pasha.tatashin@soleen.com> References: <20260530221938.115978-1-pasha.tatashin@soleen.com> <20260530221938.115978-3-pasha.tatashin@soleen.com> Message-ID: <178038801485.119771.9514973100282773342.b4-review@b4> On Sat, 30 May 2026 22:19:27 +0000, Pasha Tatashin wrote: > diff --git a/kernel/liveupdate/luo_session.c b/kernel/liveupdate/luo_session.c > index 146414933977..8d9201c25412 100644 > --- a/kernel/liveupdate/luo_session.c > +++ b/kernel/liveupdate/luo_session.c > @@ -291,25 +291,24 @@ static int luo_session_retrieve_fd(struct luo_session *session, > if (argp->fd < 0) > return argp->fd; > > - guard(mutex)(&session->mutex); > - err = luo_retrieve_file(&session->file_set, argp->token, &file); > - if (err < 0) > - goto err_put_fd; > + scoped_guard(mutex, &session->mutex) { > + err = luo_retrieve_file(&session->file_set, argp->token, &file); > + if (err < 0) { > + put_unused_fd(argp->fd); > + return err; I don't like piling up error handling inside if (err) statements. As we only need the lock only for luo_retrieve_file() I think it's better drop the guard and use goto: mutex_lock(&session->mutex); err = luo_retrieve_file(&session->file_set, argp->token, &file); mutex_unlock(&session->mutex); if (err) ... -- Sincerely yours, Mike. From rppt at kernel.org Tue Jun 2 01:13:34 2026 From: rppt at kernel.org (Mike Rapoport) Date: Tue, 02 Jun 2026 11:13:34 +0300 Subject: [PATCH v4 03/13] liveupdate: centralize state management into struct luo_ser In-Reply-To: <20260530221938.115978-4-pasha.tatashin@soleen.com> References: <20260530221938.115978-1-pasha.tatashin@soleen.com> <20260530221938.115978-4-pasha.tatashin@soleen.com> Message-ID: <178038801487.119771.6308607614059754603.b4-review@b4> On Sat, 30 May 2026 22:19:28 +0000, Pasha Tatashin wrote: > diff --git a/kernel/liveupdate/luo_flb.c b/kernel/liveupdate/luo_flb.c > index 8f5c5dd01cd0..c8dd30b41238 100644 > --- a/kernel/liveupdate/luo_flb.c > +++ b/kernel/liveupdate/luo_flb.c > @@ -579,53 +565,18 @@ int __init luo_flb_setup_outgoing(void *fdt_out) > [ ... skip 18 lines ... ] > - offset = fdt_subnode_offset(fdt_in, 0, LUO_FDT_FLB_NODE_NAME); > - if (offset < 0) { > - pr_err("Unable to get FLB node [%s]\n", LUO_FDT_FLB_NODE_NAME); > - > - return -ENOENT; > + if (flbs_pa) { I like if (!flbs_pa) return; more > > diff --git a/kernel/liveupdate/luo_session.c b/kernel/liveupdate/luo_session.c > index 8d9201c25412..3b760fefa7b9 100644 > --- a/kernel/liveupdate/luo_session.c > +++ b/kernel/liveupdate/luo_session.c > @@ -497,75 +494,34 @@ int luo_session_retrieve(const char *name, struct file **filep) > [ ... skip 58 lines ... ] > + if (sessions_pa) { > + header_ser = phys_to_virt(sessions_pa); > + luo_session_global.incoming.header_ser = header_ser; > + luo_session_global.incoming.ser = (void *)(header_ser + 1); > + luo_session_global.incoming.active = true; > } Ditto -- Sincerely yours, Mike. From rppt at kernel.org Tue Jun 2 01:13:34 2026 From: rppt at kernel.org (Mike Rapoport) Date: Tue, 02 Jun 2026 11:13:34 +0300 Subject: [PATCH v4 07/13] kho: add support for linked-block serialization In-Reply-To: <20260530221938.115978-8-pasha.tatashin@soleen.com> References: <20260530221938.115978-1-pasha.tatashin@soleen.com> <20260530221938.115978-8-pasha.tatashin@soleen.com> Message-ID: <178038801491.119771.18384706761138506132.b4-review@b4> On Sat, 30 May 2026 22:19:32 +0000, Pasha Tatashin wrote: > diff --git a/include/linux/kho_block.h b/include/linux/kho_block.h > new file mode 100644 > index 000000000000..5e6b87b1befa > --- /dev/null > +++ b/include/linux/kho_block.h > @@ -0,0 +1,79 @@ > [ ... skip 19 lines ... ] > + struct list_head list; > + struct kho_block_header_ser *ser; > +}; > + > +/** > + * struct kho_block_set - A set of blocks that belong to the same object. "same object" sounds off to me. The blocks belong to the same module? user? Thoughts? > + * @blocks: The list of serialization blocks (struct kho_block). > + * @nblocks: The number of allocated serialization blocks. > + * @head_pa: Physical address of the first block header. > + * @entry_size: The size of each entry in the blocks. I think it's "... entry in a block" > [ ... skip 42 lines ... ] > + > +void kho_block_it_init(struct kho_block_it *it, struct kho_block_set *bs); > +void *kho_block_it_next(struct kho_block_it *it); > +void *kho_block_it_read(struct kho_block_it *it); > +void *kho_block_it_prev(struct kho_block_it *it); > +void kho_block_it_finalize(struct kho_block_it *it); These operate on block sets, should be reflected in the names. Can be kho_blocks_ to avoid too long names. > > diff --git a/kernel/liveupdate/kho_block.c b/kernel/liveupdate/kho_block.c > new file mode 100644 > index 000000000000..a4e650af946f > --- /dev/null > +++ b/kernel/liveupdate/kho_block.c > @@ -0,0 +1,384 @@ > +// SPDX-License-Identifier: GPL-2.0 > + > +/* > + * Copyright (c) 2026, Google LLC. > + * Pasha Tatashin > + */ > + > +/** > + * DOC: KHO Serialization Blocks > + * > + * KHO provides a mechanism to preserve stateful data across a kexec handover > + * by serializing it into memory blocks. This file provides the common "This file" does not look good in HTML docs. > [ ... skip 15 lines ... ] > + > +/* > + * Safeguard limit for the number of serialization blocks. This is used to > + * prevent infinite loops and excessive memory allocation in case of memory > + * corruption in the preserved state. > + */ Can you add how much memory it is and how many entries with, say, 4 u64 it can accommodate? > [ ... skip 13 lines ... ] > +{ > + if (unlikely(!bs->count_per_block)) { > + bs->count_per_block = (KHO_BLOCK_SIZE - > + sizeof(struct kho_block_header_ser)) / > + bs->entry_size; > + WARN_ON(!bs->count_per_block); Don't you want to set count_per_block in _init()? > [ ... skip 29 lines ... ] > + if (!block) > + return -ENOMEM; > + > + block->ser = ser; > + last = list_last_entry_or_null(&bs->blocks, struct kho_block, list); > + list_add_tail(&block->list, &bs->blocks); No locks? > [ ... skip 12 lines ... ] > + * @bs: The block set. > + * @count: The current number of entries. > + * > + * This function handles the dynamic expansion of a block set. It allocates > + * and links a new serialization block if the provided entry count matches > + * the current total capacity of the set. This is a weird semantics for a generic API. I'd expect _grow() would add count - current_count blocks. > [ ... skip 25 lines ... ] > +} > + > +/** > + * kho_block_shrink - Conditionally destroy the last block in a block set. > + * @bs: The block set. > + * @count: The current number of entries across all blocks. Maybe ... of valid entries? > + * > + * This function checks if the last block in the set is redundant based on the > + * total entry count and the capacity of the preceding blocks. If the entry > + * count can be accommodated by the blocks that come before the last one, the > + * last block is destroyed and removed from the set. This should mention that it's the caller responsibility to ensure that entries are removed in the right order. > [ ... skip 49 lines ... ] > + > + fast = phys_to_virt(fast->next); > + slow = phys_to_virt(slow->next); > + > + if (slow == fast) { > + pr_err("Cyclic list detected\n"); Maybe "block set is corrupted"? > + return false; > + } > + } > + > + return true; > +} > + > +/** > + * kho_block_restore - Restore a block set from a physical address. > + * @bs: The block set to restore. > + * @head_pa: Physical address of the first block header. I'd mention that the block set should be allocated and initialized > [ ... skip 10 lines ... ] > + bs->incoming = true; > + if (!head_pa) > + return 0; > + > + bs->head_pa = head_pa; > + if (!kho_cyclic_blocks_check(bs)) { if (kho_block_set_cyclic()) reads nicer IMO > [ ... skip 87 lines ... ] > +{ > + if (!it->block) > + return NULL; > + > + if (it->i == kho_block_count_per_block(it->bs)) { > + it->block->ser->count = it->i; Why iterator updates ser->count? > + if (list_is_last(&it->block->list, &it->bs->blocks)) > + return NULL; > + it->block = list_next_entry(it->block, list); > + it->i = 0; > + } > + > + return (void *)(it->block->ser + 1) + (it->i++ * it->bs->entry_size); In a month we'll need an LLM's help to understand what it does. > +} > + > +/** > + * kho_block_it_read - Return the next entry slot for reading. > + * @it: The block iterator. And what is the conceptual difference between this and _it_next()? > [ ... skip 49 lines ... ] > + * @it: The block iterator. > + */ > +void kho_block_it_finalize(struct kho_block_it *it) > +{ > + if (it->block) > + it->block->ser->count = it->i; So, it looks like the intention of _it_next is for write, and this ends a write iteration. I think the names should be adjusted to make it clearer. -- Sincerely yours, Mike. From rppt at kernel.org Tue Jun 2 01:13:34 2026 From: rppt at kernel.org (Mike Rapoport) Date: Tue, 02 Jun 2026 11:13:34 +0300 Subject: [PATCH v4 08/13] liveupdate: defer session block allocation and PA setting In-Reply-To: <20260530221938.115978-9-pasha.tatashin@soleen.com> References: <20260530221938.115978-1-pasha.tatashin@soleen.com> <20260530221938.115978-9-pasha.tatashin@soleen.com> Message-ID: <178038801492.119771.3419366349068848854.b4-review@b4> On Sat, 30 May 2026 22:19:33 +0000, Pasha Tatashin wrote: > Currently, luo_session_setup_outgoing() allocates the session block and "liveupdate: defer session block allocation and PA setting" PA as "Public Assistance"? ;-) Let's spell it out. -- Sincerely yours, Mike. From baoquan.he at linux.dev Tue Jun 2 02:00:40 2026 From: baoquan.he at linux.dev (Baoquan He) Date: Tue, 2 Jun 2026 17:00:40 +0800 Subject: [PATCH] kexec_file: skip checksum verification when relocations aren't needed In-Reply-To: <20260601191136.799134-1-mclapinski@google.com> References: <20260601191136.799134-1-mclapinski@google.com> Message-ID: On 06/01/26 at 09:11pm, Michal Clapinski wrote: ...snip... > + /* > + * If all segments were loaded into contiguous memory, there will be no > + * relocations. In that case there is no risk of memory corruption by > + * uncancelled DMA and we can skip checksum calculation. > + */ > + for (i = 0; i < image->nr_segments; i++) { > + if (!image->segment_cma[i]) { > + can_skip_checksum = false; > + break; > + } > + } > + > + if (can_skip_checksum) { > + pr_info("disabling checksum verification in purgatory\n"); Use pr_debug() or kexec_dprintk() instead because this is unnecessary to note users if it's a normal action? Except of this, the overral looks good to me. Acked-by: Baoquan He > + goto skip_checksum; > + } > + > for (j = i = 0; i < image->nr_segments; i++) { > struct kexec_segment *ksegment; > > @@ -867,6 +885,7 @@ static int kexec_calculate_store_digests(struct kimage *image) > j++; > } > > +skip_checksum: > sha256_final(&sctx, digest); > > ret = kexec_purgatory_get_set_symbol(image, "purgatory_sha_regions", > -- > 2.54.0.929.g9b7fa37559-goog > From rppt at kernel.org Tue Jun 2 02:04:24 2026 From: rppt at kernel.org (Mike Rapoport) Date: Tue, 2 Jun 2026 12:04:24 +0300 Subject: [PATCH v4 07/13] kho: add support for linked-block serialization In-Reply-To: <178038801491.119771.18384706761138506132.b4-review@b4> References: <20260530221938.115978-1-pasha.tatashin@soleen.com> <20260530221938.115978-8-pasha.tatashin@soleen.com> <178038801491.119771.18384706761138506132.b4-review@b4> Message-ID: I sent it before seeing v5, so some of those are already addressed, but please take a look anyway. On Tue, Jun 02, 2026 at 11:13:34AM +0300, Mike Rapoport wrote: > On Sat, 30 May 2026 22:19:32 +0000, Pasha Tatashin wrote: > > diff --git a/include/linux/kho_block.h b/include/linux/kho_block.h > > new file mode 100644 > > index 000000000000..5e6b87b1befa > > --- /dev/null > > +++ b/include/linux/kho_block.h > > @@ -0,0 +1,79 @@ > > [ ... skip 19 lines ... ] > > + struct list_head list; > > + struct kho_block_header_ser *ser; > > +}; > > + > > +/** > > + * struct kho_block_set - A set of blocks that belong to the same object. > > "same object" sounds off to me. The blocks belong to the same module? > user? > > Thoughts? > > > + * @blocks: The list of serialization blocks (struct kho_block). > > + * @nblocks: The number of allocated serialization blocks. > > + * @head_pa: Physical address of the first block header. > > + * @entry_size: The size of each entry in the blocks. > > I think it's "... entry in a block" > > > [ ... skip 42 lines ... ] > > + > > +void kho_block_it_init(struct kho_block_it *it, struct kho_block_set *bs); > > +void *kho_block_it_next(struct kho_block_it *it); > > +void *kho_block_it_read(struct kho_block_it *it); > > +void *kho_block_it_prev(struct kho_block_it *it); > > +void kho_block_it_finalize(struct kho_block_it *it); > > These operate on block sets, should be reflected in the names. > Can be kho_blocks_ to avoid too long names. > > > > > diff --git a/kernel/liveupdate/kho_block.c b/kernel/liveupdate/kho_block.c > > new file mode 100644 > > index 000000000000..a4e650af946f > > --- /dev/null > > +++ b/kernel/liveupdate/kho_block.c > > @@ -0,0 +1,384 @@ > > +// SPDX-License-Identifier: GPL-2.0 > > + > > +/* > > + * Copyright (c) 2026, Google LLC. > > + * Pasha Tatashin > > + */ > > + > > +/** > > + * DOC: KHO Serialization Blocks > > + * > > + * KHO provides a mechanism to preserve stateful data across a kexec handover > > + * by serializing it into memory blocks. This file provides the common > > "This file" does not look good in HTML docs. > > > [ ... skip 15 lines ... ] > > + > > +/* > > + * Safeguard limit for the number of serialization blocks. This is used to > > + * prevent infinite loops and excessive memory allocation in case of memory > > + * corruption in the preserved state. > > + */ > > Can you add how much memory it is and how many entries with, say, 4 u64 > it can accommodate? > > > [ ... skip 13 lines ... ] > > +{ > > + if (unlikely(!bs->count_per_block)) { > > + bs->count_per_block = (KHO_BLOCK_SIZE - > > + sizeof(struct kho_block_header_ser)) / > > + bs->entry_size; > > + WARN_ON(!bs->count_per_block); > > Don't you want to set count_per_block in _init()? > > > [ ... skip 29 lines ... ] > > + if (!block) > > + return -ENOMEM; > > + > > + block->ser = ser; > > + last = list_last_entry_or_null(&bs->blocks, struct kho_block, list); > > + list_add_tail(&block->list, &bs->blocks); > > No locks? > > > [ ... skip 12 lines ... ] > > + * @bs: The block set. > > + * @count: The current number of entries. > > + * > > + * This function handles the dynamic expansion of a block set. It allocates > > + * and links a new serialization block if the provided entry count matches > > + * the current total capacity of the set. > > This is a weird semantics for a generic API. I'd expect _grow() would > add count - current_count blocks. > > > [ ... skip 25 lines ... ] > > +} > > + > > +/** > > + * kho_block_shrink - Conditionally destroy the last block in a block set. > > + * @bs: The block set. > > + * @count: The current number of entries across all blocks. > > Maybe > ... of valid entries? > > > + * > > + * This function checks if the last block in the set is redundant based on the > > + * total entry count and the capacity of the preceding blocks. If the entry > > + * count can be accommodated by the blocks that come before the last one, the > > + * last block is destroyed and removed from the set. > > This should mention that it's the caller responsibility to ensure that > entries are removed in the right order. > > > [ ... skip 49 lines ... ] > > + > > + fast = phys_to_virt(fast->next); > > + slow = phys_to_virt(slow->next); > > + > > + if (slow == fast) { > > + pr_err("Cyclic list detected\n"); > > Maybe "block set is corrupted"? > > > + return false; > > + } > > + } > > + > > + return true; > > +} > > + > > +/** > > + * kho_block_restore - Restore a block set from a physical address. > > + * @bs: The block set to restore. > > + * @head_pa: Physical address of the first block header. > > I'd mention that the block set should be allocated and initialized > > > [ ... skip 10 lines ... ] > > + bs->incoming = true; > > + if (!head_pa) > > + return 0; > > + > > + bs->head_pa = head_pa; > > + if (!kho_cyclic_blocks_check(bs)) { > > if (kho_block_set_cyclic()) > > reads nicer IMO > > > [ ... skip 87 lines ... ] > > +{ > > + if (!it->block) > > + return NULL; > > + > > + if (it->i == kho_block_count_per_block(it->bs)) { > > + it->block->ser->count = it->i; > > Why iterator updates ser->count? > > > + if (list_is_last(&it->block->list, &it->bs->blocks)) > > + return NULL; > > + it->block = list_next_entry(it->block, list); > > + it->i = 0; > > + } > > + > > + return (void *)(it->block->ser + 1) + (it->i++ * it->bs->entry_size); > > In a month we'll need an LLM's help to understand what it does. > > > +} > > + > > +/** > > + * kho_block_it_read - Return the next entry slot for reading. > > + * @it: The block iterator. > > And what is the conceptual difference between this and _it_next()? > > > [ ... skip 49 lines ... ] > > + * @it: The block iterator. > > + */ > > +void kho_block_it_finalize(struct kho_block_it *it) > > +{ > > + if (it->block) > > + it->block->ser->count = it->i; > > So, it looks like the intention of _it_next is for write, and this ends a > write iteration. > > I think the names should be adjusted to make it clearer. > > -- > Sincerely yours, > Mike. > -- Sincerely yours, Mike. From rppt at kernel.org Tue Jun 2 02:34:01 2026 From: rppt at kernel.org (Mike Rapoport) Date: Tue, 2 Jun 2026 12:34:01 +0300 Subject: [PATCH v3 09/11] arm64: kdump: exclude non-dumpable reserved memory regions from vmcore In-Reply-To: References: <20260527032917.3385849-1-chenwandun1@gmail.com> <20260527032917.3385849-10-chenwandun1@gmail.com> Message-ID: Hi Baoquan, On Mon, Jun 01, 2026 at 01:00:34PM +0800, Baoquan He wrote: > On 05/30/26 at 07:25pm, Mike Rapoport wrote: > > On Fri, May 29, 2026 at 04:08:41PM +0100, Will Deacon wrote: > > > On Wed, May 27, 2026 at 11:29:15AM +0800, Wandun Chen wrote: > ...snip... > > There are patches that move common code to kernel/crash_core.c: > > > > https://lore.kernel.org/all/20260525084932.934910-1-ruanjinjie at huawei.com > > > > Review from arch maintainers would be helpful there ;-) > > Before, Andrew would put patch candidates into his mm tree and trigger > testing. If any adjustment, he would take them off. Can we do the > similar thing for kexec/kdump patches, unless the patches are objected > explicitly? Yes, we can add patches that look reasonable and expose them in -next. Now is too late in the release cycle, let's start with it after -rc1. > Thanks > Baoquan -- Sincerely yours, Mike. From mclapinski at google.com Tue Jun 2 05:33:11 2026 From: mclapinski at google.com (Michal Clapinski) Date: Tue, 2 Jun 2026 14:33:11 +0200 Subject: [PATCH v2] kexec_file: skip checksum verification when safe Message-ID: <20260602123311.1841746-1-mclapinski@google.com> Checksum verification is needed 1. for crash kernels. In a crash, we can't be sure the kernel is intact. 2. if we're worried about relocating the kernel into a region used by some DMA that wasn't properly cancelled. If KHO is enabled then relocations will happen to KHO scratch, which is free from DMA regions. If we used CMA to allocate segments then relocations are not going to happen at all. Therefore, we can safely disable checksum verification in both of those cases. Instead of adding a new variable to purgatory, just skip adding regions and save the default value of SHA256 hash. Saves ~250ms on my 4.0 GHz CPU. This is an important saving for the live-update project. Signed-off-by: Michal Clapinski --- v2: - also skip checksum verification if KHO is enabled - small fixes from reviews My original idea was to do 2 changes: 1. Skip checksum if all segments are CMA. 2. If KHO is enabled, allocate the kernel inside kho_scratch using CMA. This way we could skip both relocations and checksum verification when KHO is enabled. But I realized that step 2 might not be possible on warm boots. I have no idea how to fix that (except weird ideas like 2 kho_scratches that we swap on every warm boot), so I decided to just skip checksum verification when KHO is enabled. This unfortunately means relocations will still happen. --- kernel/kexec_file.c | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+) diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c index 2bfbb2d144e6..db25a14692ab 100644 --- a/kernel/kexec_file.c +++ b/kernel/kexec_file.c @@ -27,6 +27,7 @@ #include #include #include +#include #include "kexec_internal.h" #ifdef CONFIG_KEXEC_SIG @@ -798,6 +799,16 @@ int kexec_add_buffer(struct kexec_buf *kbuf) return 0; } +static bool kexec_only_cma_segments(struct kimage *image) +{ + for (int i = 0; i < image->nr_segments; i++) { + if (!image->segment_cma[i]) + return false; + } + + return true; +} + /* Calculate and store the digest of segments */ static int kexec_calculate_store_digests(struct kimage *image) { @@ -822,6 +833,21 @@ static int kexec_calculate_store_digests(struct kimage *image) sha256_init(&sctx); + /* + * If KHO is enabled, the destinations are located in KHO scratch. + * KHO scratch can only contain early boot allocations and movable + * allocations. That means there is no risk of memory corruption by + * uncancelled DMA. + * + * If all segments were loaded into contiguous memory, there will be no + * relocations at all, so also no risk no corruption. + */ + if (image->type != KEXEC_TYPE_CRASH && + (kho_is_enabled() || kexec_only_cma_segments(image))) { + pr_debug("disabling checksum verification in purgatory\n"); + goto skip_checksum; + } + for (j = i = 0; i < image->nr_segments; i++) { struct kexec_segment *ksegment; @@ -867,6 +893,7 @@ static int kexec_calculate_store_digests(struct kimage *image) j++; } +skip_checksum: sha256_final(&sctx, digest); ret = kexec_purgatory_get_set_symbol(image, "purgatory_sha_regions", -- 2.54.0.929.g9b7fa37559-goog From pratyush at kernel.org Tue Jun 2 05:52:02 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Tue, 02 Jun 2026 14:52:02 +0200 Subject: [PATCH 09/12] memblock: introduce MEMBLOCK_KHO_SCRATCH_EXT In-Reply-To: (Mike Rapoport's message of "Sun, 31 May 2026 21:51:09 +0300") References: <20260429133928.850721-1-pratyush@kernel.org> <20260429133928.850721-10-pratyush@kernel.org> <2vxzecjhc2s8.fsf@kernel.org> <2vxzse7j7ai9.fsf@kernel.org> Message-ID: <2vxztsrlds0d.fsf@kernel.org> On Sun, May 31 2026, Mike Rapoport wrote: > On Fri, May 22, 2026 at 05:02:38PM +0200, Pratyush Yadav wrote: >> On Fri, May 22 2026, Pasha Tatashin wrote: >> >> > On 05-11 18:46, Pratyush Yadav wrote: >> >> On Mon, May 11 2026, Mike Rapoport wrote: >> >> >> >> > On Wed, Apr 29, 2026 at 03:39:11PM +0200, Pratyush Yadav wrote: >> >> >> From: "Pratyush Yadav (Google)" >> >> >> >> >> >> In the upcoming commits, the KHO will learn how to discover free blocks >> >> >> of memory by walking the KHO radix tree. It will then mark those regions >> >> >> as scratch to allow memory allocation in case scratch runs low. >> >> >> >> >> >> To differentiate the extended scratch areas from the main scratch areas, >> >> >> introduce MEMBLOCK_KHO_SCRATCH_EXT. Use it when choosing memblock flags >> >> >> for allocations during scratch-only. Teach should_skip_region() to check >> >> >> for both flags before deciding if the region should be skipped. >> >> > >> >> > Why there's a need to differentiate SCRATCH and SCRATCH_EXT? >> >> > SCRATCH (I still hate the name) means "memory memblock can safely use for >> > >> > +1000 >> > >> > I also strongly dislike this name and mentioned it in another thread >> > earlier today. >> > >> > If we ever decide to s/scratch/something-else/ globally, that should be a >> > separate cleanup effort. However, since we are introducing a brand new flag >> > here, we can discuss a better name for the _ext portion to avoid overloading >> > the "scratch" concept. >> > >> >> > the allocations". Initially this memory comes from the reservations in the >> >> > first kernel, but if the second kernel can find more memory to extend it, >> >> > why that additional memory should be treated differently? >> >> >> >> Two reasons: >> >> >> >> 1. We mark SCRATCH as MIGRATE_CMA. We don't want to do that for >> >> SCRATCH_EXT since this memory can be used for non-movable >> >> allocations. >> >> >> >> 2. Gigantic (1G) huge pages can not be allocated from scratch. They can >> >> be preserved memory and thus should not be allocated from SCRATCH. >> >> See patch 12 that does allocations for gigantic huge pages only from >> >> SCRATCH_EXT. >> >> >> >> I will add this in the commit message for the next version. >> >> >> >> Naming is hard, so if you have any better names I'm all ears :-) >> > >> > IMO, this scratch_ext is not "scratch" in the traditional KHO sense at all. >> > The traditional KHO scratch is what is passed from kernel to kernel and is >> > guaranteed to contain zero preserved memory. This new memory is not passed >> > from kernel to kernel and can contain preserved memory at runtime. It's >> > essentially just memory that we identify as currently unpreserved and release >> > early to the system. >> > >> > If we want to keep the naming aligned with the existing codebase for now: >> > MEMBLOCK_KHO_SCRATCH -> original scratch >> > MEMBLOCK_KHO_UNPRESERVED -> for the new memory (instead of SCRATCH_EXT) >> >> UNPRESERVED sounds good to me. I will use that for the next revision >> unless Mike objects. > > Can we make it shorter? ;-) > > UNPRESERVED makes sense, although I'd love to completely remove KHO_ notion > and make the name reflect how it's used by memblock. I was toying with > PREFERRED instead of SCRATCH, but it didn't feel right enough. > With two of them that surely won't work :) I don't think you really can remove KHO_ notion. These memory regions only make sense on a KHO boot, and won't exist otherwise. And PREFERRED sounds like a suggestion/priority hint, not a hard limit. "With KHO boot, you can _only_ use PREFERRED memory", doesn't sound right... I think MEMBLOCK_KHO_BOOTMEM for scratch and MEMBLOCK_KHO_NOPRESERVE (which I think is a tiny bit better than UNPRESERVED) for scratch_ext are my top picks. To make it shorter, perhaps MEMBLOCK_KHO_NOPRSRV, in similar fashion to RSRV_KERN? > >> > Alternatively, if we do want to tackle the global rename of "scratch" later: >> > MEMBLOCK_KHO_BOOTSTRAP -> for the original scratch >> > MEMBLOCK_KHO_UNPRESERVED -> for this new dynamic memory >> >> Or perhaps BOOTMEM? I suppose either of the two are somewhat better than >> scratch. > > Well, if we have BOOTMEM_HVO, we can have BOOTMEM_KHO as well :) > >> Anyway, can we please do the SCRATCH rename as a separate series? I > > Sure. We can continue bikeshedding in parallel. > >> would like this series to not get muddled in the naming discussion. I >> will use UNPRESERVED for the new concept in v2 though. > > That might warrant v3 even if everything else is perfect :) I can live with that. As long as we can agree on the easy part (the code), I don't mind doing another version for the hard part (the naming) ;-) -- Regards, Pratyush Yadav From rppt at kernel.org Tue Jun 2 06:20:02 2026 From: rppt at kernel.org (Mike Rapoport) Date: Tue, 2 Jun 2026 16:20:02 +0300 Subject: [PATCH 09/12] memblock: introduce MEMBLOCK_KHO_SCRATCH_EXT In-Reply-To: <2vxztsrlds0d.fsf@kernel.org> References: <20260429133928.850721-1-pratyush@kernel.org> <20260429133928.850721-10-pratyush@kernel.org> <2vxzecjhc2s8.fsf@kernel.org> <2vxzse7j7ai9.fsf@kernel.org> <2vxztsrlds0d.fsf@kernel.org> Message-ID: On Tue, Jun 02, 2026 at 02:52:02PM +0200, Pratyush Yadav wrote: > On Sun, May 31 2026, Mike Rapoport wrote: > >> > > >> > If we want to keep the naming aligned with the existing codebase for now: > >> > MEMBLOCK_KHO_SCRATCH -> original scratch > >> > MEMBLOCK_KHO_UNPRESERVED -> for the new memory (instead of SCRATCH_EXT) > >> > >> UNPRESERVED sounds good to me. I will use that for the next revision > >> unless Mike objects. > > > > Can we make it shorter? ;-) > > > > UNPRESERVED makes sense, although I'd love to completely remove KHO_ notion > > and make the name reflect how it's used by memblock. I was toying with > > PREFERRED instead of SCRATCH, but it didn't feel right enough. > > With two of them that surely won't work :) > > I don't think you really can remove KHO_ notion. These memory regions > only make sense on a KHO boot, and won't exist otherwise. And PREFERRED > sounds like a suggestion/priority hint, not a hard limit. "With KHO > boot, you can _only_ use PREFERRED memory", doesn't sound right... > > I think MEMBLOCK_KHO_BOOTMEM for scratch and MEMBLOCK_KHO_NOPRESERVE > (which I think is a tiny bit better than UNPRESERVED) for scratch_ext > are my top picks. To make it shorter, perhaps MEMBLOCK_KHO_NOPRSRV, in > similar fashion to RSRV_KERN? There are a couple of unrelated 'bootmem' things in the kernel, adding another one shouldn't hurt :) I like MEMBLOCK_KHO_BOOTMEM and MEMBLOCK_KHO_NOPRSRV the most. > >> would like this series to not get muddled in the naming discussion. I > >> will use UNPRESERVED for the new concept in v2 though. > > > > That might warrant v3 even if everything else is perfect :) > > I can live with that. As long as we can agree on the easy part (the > code), I don't mind doing another version for the hard part (the naming) > ;-) It makes sense to keep KHO_SCRATCH for now and use MEMBLOCK_KHO_NOPRSRV for the new one to begin with. And then we can ask an LLM do the renaming. > -- > Regards, > Pratyush Yadav -- Sincerely yours, Mike. From pratyush at kernel.org Tue Jun 2 06:35:44 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Tue, 02 Jun 2026 15:35:44 +0200 Subject: [PATCH 12/12] mm/hugetlb: make bootmem allocation work with KHO In-Reply-To: (Mike Rapoport's message of "Sun, 31 May 2026 21:40:07 +0300") References: <20260429133928.850721-1-pratyush@kernel.org> <20260429133928.850721-13-pratyush@kernel.org> <2vxzo6i37bs6.fsf@kernel.org> Message-ID: <2vxzpl29dpzj.fsf@kernel.org> On Sun, May 31 2026, Mike Rapoport wrote: > On Mon, May 25, 2026 at 05:24:09PM +0200, Pratyush Yadav wrote: >> On Sun, May 17 2026, Mike Rapoport wrote: >> > On Wed, Apr 29, 2026 at 03:39:14PM +0200, Pratyush Yadav wrote: >> >> From: "Pratyush Yadav (Google)" >> >> So, in summary, I would like to pursue option 1 and try to make it more >> appetizing. But I would like to at least know if you hate the "extended >> scratch" (ignore the name) as a concept or only the code it results in. > > Let's retry this one :) > > I looked more closely, and it seems that mixing SCRATCH and SCRATCH_EXT > should be a lesser headache than going with option 4. I also had some time to ruminate on this. I still think option 1 has the most promise, but my opinion on option 4 has improved a bit. While I still am not sure adding a 3rd phase to struct page/MM init (early -> deferred -> KHO reserved blocks) is a good idea, I think it might not be as bad as I first thought. Dunno... Anyway, for now I think I will try to make option 1 more appetizing. Here's an idea I want to try out: I get rid of SCRATCH_EXT and mark the free blocks as SCRATCH. For HugeTLB, I can teach the special memblock_alloc_hugetlb_something() function to exclude scratch areas when looking for free memory ranges. So core memblock does not get a new memory type, and the complexity of hugepage allocation does not leak into memblock. How does that sound? > > Tracking the changes in gigantic pages in hugetlb also does not seem > something we'd like to pursue especially considering that memory from freed > or demoted gigantic pages could be reserved. > > If we add a dedicated memblock_something to allocate gigantic pages, we > can reduce branching in alloc_bootmem() to > > if (cma) > do_cma() > else > do_memblock() > > For hugetlb_cma we might want to teach CMA to create pre-allocated areas > and then it could reuse the same memblock API. This seems useful even > regardless of KHO. Sorry, I don't get what you mean by this. What pre-allocated areas? When creating CMA areas it calls cma_alloc_mem() which calls into memblock. What would we change about this? -- Regards, Pratyush Yadav From pratyush at kernel.org Tue Jun 2 08:16:21 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Tue, 02 Jun 2026 17:16:21 +0200 Subject: [PATCH v2] kexec_file: skip checksum verification when safe In-Reply-To: <20260602123311.1841746-1-mclapinski@google.com> (Michal Clapinski's message of "Tue, 2 Jun 2026 14:33:11 +0200") References: <20260602123311.1841746-1-mclapinski@google.com> Message-ID: <2vxzik81dlbu.fsf@kernel.org> On Tue, Jun 02 2026, Michal Clapinski wrote: > Checksum verification is needed > 1. for crash kernels. In a crash, we can't be sure the kernel is > intact. > 2. if we're worried about relocating the kernel into a region used by > some DMA that wasn't properly cancelled. > > If KHO is enabled then relocations will happen to KHO scratch, which > is free from DMA regions. > If we used CMA to allocate segments then relocations are not going to > happen at all. > Therefore, we can safely disable checksum verification in both of those > cases. > > Instead of adding a new variable to purgatory, just skip adding regions > and save the default value of SHA256 hash. > > Saves ~250ms on my 4.0 GHz CPU. This is an important saving for the > live-update project. > > Signed-off-by: Michal Clapinski > --- > v2: > - also skip checksum verification if KHO is enabled > - small fixes from reviews > > My original idea was to do 2 changes: > 1. Skip checksum if all segments are CMA. > 2. If KHO is enabled, allocate the kernel inside kho_scratch using CMA. > > This way we could skip both relocations and checksum verification when > KHO is enabled. > But I realized that step 2 might not be possible on warm boots. AFAIU we only relocate into scratch since relocating anywhere else might over-write preserved memory. If there is no relocation, there is no need for the kernel image to be in scratch, since the image won't be preserved memory anyway. So perhaps we can just use CMA directly, and only fall back to kho_locate_mem_hole() if that fails? This should be a simple enough change. Do you know how much time we can save by skipping relocations? I would guess it is in the hundreds of milliseconds. Can you try this (COMPLETELY UNTESTED) patch out and see if it works and if it further improves kexec time? --- 8< --- diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c index 2bfbb2d144e6..0ccc7b6d67c1 100644 --- a/kernel/kexec_file.c +++ b/kernel/kexec_file.c @@ -720,14 +720,6 @@ int kexec_locate_mem_hole(struct kexec_buf *kbuf) if (kbuf->mem != KEXEC_BUF_MEM_UNKNOWN) return 0; - /* - * If KHO is active, only use KHO scratch memory. All other memory - * could potentially be handed over. - */ - ret = kho_locate_mem_hole(kbuf, locate_mem_hole_callback); - if (ret <= 0) - return ret; - /* * Try to find a free physically contiguous block of memory first. With that, we * can avoid any copying at kexec time. @@ -735,6 +727,14 @@ int kexec_locate_mem_hole(struct kexec_buf *kbuf) if (!kexec_alloc_contig(kbuf)) return 0; + /* + * If KHO is active and relocations are to be done,, only use KHO + * scratch memory. All other memory could potentially be handed over. + */ + ret = kho_locate_mem_hole(kbuf, locate_mem_hole_callback); + if (ret <= 0) + return ret; + if (!IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) ret = kexec_walk_resources(kbuf, locate_mem_hole_callback); else --- >8 --- Of course this is not directly related to this patch so it shouldn't block it, but I reckon we might be able to squeeze a bit more performance out this way as a follow up. > I have no idea how to fix that (except weird ideas like 2 kho_scratches > that we swap on every warm boot), so I decided to just skip checksum > verification when KHO is enabled. This unfortunately means relocations > will still happen. > --- > kernel/kexec_file.c | 27 +++++++++++++++++++++++++++ > 1 file changed, 27 insertions(+) > > diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c > index 2bfbb2d144e6..db25a14692ab 100644 > --- a/kernel/kexec_file.c > +++ b/kernel/kexec_file.c > @@ -27,6 +27,7 @@ > #include > #include > #include > +#include > #include "kexec_internal.h" > > #ifdef CONFIG_KEXEC_SIG > @@ -798,6 +799,16 @@ int kexec_add_buffer(struct kexec_buf *kbuf) > return 0; > } > > +static bool kexec_only_cma_segments(struct kimage *image) > +{ > + for (int i = 0; i < image->nr_segments; i++) { > + if (!image->segment_cma[i]) > + return false; > + } > + > + return true; > +} > + > /* Calculate and store the digest of segments */ > static int kexec_calculate_store_digests(struct kimage *image) > { > @@ -822,6 +833,21 @@ static int kexec_calculate_store_digests(struct kimage *image) > > sha256_init(&sctx); > > + /* > + * If KHO is enabled, the destinations are located in KHO scratch. > + * KHO scratch can only contain early boot allocations and movable > + * allocations. That means there is no risk of memory corruption by > + * uncancelled DMA. > + * > + * If all segments were loaded into contiguous memory, there will be no > + * relocations at all, so also no risk no corruption. Typo: "so also no risk *of* corruption". We can fix that up when applying I think, so no need for a v3 just for this. Other than this, Reviewed-by: Pratyush Yadav (Google) > + */ > + if (image->type != KEXEC_TYPE_CRASH && > + (kho_is_enabled() || kexec_only_cma_segments(image))) { > + pr_debug("disabling checksum verification in purgatory\n"); > + goto skip_checksum; > + } > + > for (j = i = 0; i < image->nr_segments; i++) { > struct kexec_segment *ksegment; > > @@ -867,6 +893,7 @@ static int kexec_calculate_store_digests(struct kimage *image) > j++; > } > > +skip_checksum: > sha256_final(&sctx, digest); > > ret = kexec_purgatory_get_set_symbol(image, "purgatory_sha_regions", -- Regards, Pratyush Yadav From baoquan.he at linux.dev Mon Jun 1 06:40:23 2026 From: baoquan.he at linux.dev (Baoquan He) Date: Mon, 1 Jun 2026 21:40:23 +0800 Subject: [PATCH v15 00/23] arm64/riscv: Add support for crashkernel CMA reservation In-Reply-To: <20260601094805.2928614-1-ruanjinjie@huawei.com> References: <20260601094805.2928614-1-ruanjinjie@huawei.com> Message-ID: Hi Jinjie, On 06/01/26 at 05:47pm, Jinjie Ruan wrote: ...snip... > Changes in v15: > - Unify the subject prefix formats as Huacai suggested. > - Fix powerpc pre-existing NULL pointer dereference [Sashiko [1]] > - Fix powerpc pre-existing __merge_memory_ranges() memory range > truncation [Sashiko [1]]. > - Fix pre-existing arm64 CMA page leaks [Sashiko[2]]. > - Fix pre-existing crash_load_dm_crypt_keys() Use-After-Free and > Double Free issue [Sashiko[3]]. > - Fix vfree(headers) and uninitialized variables issue > and simplify the fix [Sashiko[2]]. > - As walk_system_ram_res() and for_each_mem_range() use different > lock, unify and simplify the fix of TOCTOU buffer overflow via memory > region padding [Sashiko[4]]. > - Fix the arm64 crash dump issues in Sashiko[5]. > - Link to v14: https://lore.kernel.org/all/20260525084932.934910-1-ruanjinjie at huawei.com/ Do these Fixes have anything with the main target of this patch series you mentioned in cover-letter:"arm64/riscv: Add support for crashkernel CMA"? The patches become more and more in each new version, I am wondering if it relies on these Fixes patches to implement your adding support for crashkernel CMA on arm64/risc-v. If not relying on them, could you split them into different patchset on different purpose? Thanks Baoquan > > [1]: https://lore.kernel.org/all/20260525092207.96B9D1F000E9 at smtp.kernel.org/ > [2]: https://lore.kernel.org/all/20260525091149.1A1E01F00A3D at smtp.kernel.org/ > [3]: https://lore.kernel.org/all/20260525105227.3C2421F000E9 at smtp.kernel.org/ > [4]: https://lore.kernel.org/all/20260525095447.944E11F000E9 at smtp.kernel.org/ > [5]: https://lore.kernel.org/all/20260525101746.9959D1F000E9 at smtp.kernel.org/ > > Changes in v14: > - Fix image->elf_headers memory leak during retry loop for arm64 as Sashiko > AI code review pointed out. > - Solve the hotplug notifier arch_crash_handle_hotplug_event() AA > self-deadlock problem as Sashiko AI code review pointed out. > - Fix the TOCTOU issue in prepare_elf_headers() by get_online_mems(). > - -ENOMEM -> -EAGAIN as Breno suggested. > - Add support for arm64 crash hotplug. > - Link to v13: https://lore.kernel.org/all/20260511030454.1730881-1-ruanjinjie at huawei.com/ > > Changes in v13: > - Rebased on v7.1-rc1. > - Update the commit message. > - Add Reviewed-by. > - Link to v12: https://lore.kernel.org/all/20260402072701.628293-1-ruanjinjie at huawei.com/ > > Changes in v12: > - Remove the unused "nr_mem_ranges" for x86. > - Add "Fix crashk_low_res not exclude bug" test log. > - Provide a separate patch for each architecture for using > crash_prepare_headers(), which will make the review more convenient. > - Add Reviewed-by and Tested-by. > - Link to v11: https://lore.kernel.org/all/20260328074013.3589544-1-ruanjinjie at huawei.com/ > > Changes in v11: > - Avoid silently drop crash memory if the crash kernel is built without > CONFIG_CMA. > - Remove unnecessary "cmem->nr_ranges = 0" for arch_crash_populate_cmem() > as we use kvzalloc(). > - Provide a separate patch for each architecture to fix the existing > buffer overflow issue. > - Add Acked-bys for arm64. > > Changes in v10: > - Fix crashk_low_res not excluded bug in the existing > RISC-V code. > - Fix an existing memory leak issue in the existing PowerPC code. > - Fix the ordering issue of adding CMA ranges to > "linux,usable-memory-range". > - Fix an existing concurrency issue. A Concurrent memory hotplug may occur > between reading memblock and attempting to fill cmem during kexec_load() > for almost all existing architectures. > - Link to v9: https://lore.kernel.org/all/20260323072745.2481719-1-ruanjinjie at huawei.com/ > > Changes in v9: > - Collect Reviewed-by and Acked-by, and prepare for Sashiko AI review. > - Link to v8: https://lore.kernel.org/all/20260302035315.3892241-1-ruanjinjie at huawei.com/ > > Changes in v8: > - Fix the build issues reported by kernel test robot and Sourabh. > - Link to v7: https://lore.kernel.org/all/20260226130437.1867658-1-ruanjinjie at huawei.com/ > > Changes in v7: > - Correct the inclusion of CMA-reserved ranges for kdump kernel in of/kexec > for arm64 and riscv. > - Add Acked-by. > - Link to v6: https://lore.kernel.org/all/20260224085342.387996-1-ruanjinjie at huawei.com/ > > Changes in v6: > - Update the crash core exclude code as Mike suggested. > - Rebased on v7.0-rc1. > - Add acked-by. > - Link to v5: https://lore.kernel.org/all/20260212101001.343158-1-ruanjinjie at huawei.com/ > > Jinjie Ruan (22): > riscv: kexec_file: Fix crashk_low_res not exclude bug > powerpc/crash: Fix possible memory leak in update_crash_elfcorehdr() > powerpc/kexec_file: Fix NULL pointer dereference in > kexec_extra_fdt_size_ppc64() > powerpc/kexec_file: Fix memory range truncation in > __merge_memory_ranges() > kexec: Extract kexec_free_segment_cma() from kimage_free_cma() > arm64: kexec_file: Fix CMA page leaks during segment placement retry > loops > arm64: kexec_file: Fix image->elf_headers memory leak during retry > loop > kexec: Fix UAF and Double Free in crash_load_dm_crypt_keys() > crash_core: Introduce CRASH_HOTPLUG_SAFETY_PADDING for memory hotplug > safety > x86: kexec_file: Fix TOCTOU buffer overflow via memory region padding > arm64: kexec_file: Fix TOCTOU buffer overflow via memory region > padding > riscv: kexec_file: Fix TOCTOU buffer overflow via memory region > padding > LoongArch: kexec_file: Fix TOCTOU buffer overflow via memory region > padding > crash: Add crash_prepare_headers() to exclude crash kernel memory > arm64: kexec_file: Use crash_prepare_headers() helper to simplify code > x86: kexec_file: Use crash_prepare_headers() helper to simplify code > riscv: kexec_file: Use crash_prepare_headers() helper to simplify code > LoongArch: kexec_file: Use crash_prepare_headers() helper to simplify > code > powerpc/kexec_file: Use crash_exclude_core_ranges() helper > arm64: kexec_file: Add support for crashkernel CMA reservation > riscv: kexec_file: Add support for crashkernel CMA reservation > arm64: crash: Add crash hotplug support > > Sourabh Jain (1): > powerpc/crash: sort crash memory ranges before preparing elfcorehdr > > .../admin-guide/kernel-parameters.txt | 16 +- > arch/arm64/Kconfig | 3 + > arch/arm64/include/asm/kexec.h | 13 ++ > arch/arm64/kernel/Makefile | 2 +- > arch/arm64/kernel/crash.c | 152 ++++++++++++++++++ > arch/arm64/kernel/kexec_image.c | 34 ++++ > arch/arm64/kernel/machine_kexec_file.c | 78 ++------- > arch/arm64/mm/init.c | 5 +- > arch/loongarch/kernel/machine_kexec_file.c | 44 ++--- > arch/powerpc/include/asm/kexec_ranges.h | 1 - > arch/powerpc/kexec/crash.c | 7 +- > arch/powerpc/kexec/file_load_64.c | 3 + > arch/powerpc/kexec/ranges.c | 113 ++----------- > arch/riscv/kernel/machine_kexec_file.c | 43 ++--- > arch/riscv/mm/init.c | 5 +- > arch/x86/kernel/crash.c | 92 ++--------- > drivers/of/fdt.c | 9 +- > drivers/of/kexec.c | 9 ++ > include/linux/crash_core.h | 15 ++ > include/linux/crash_reserve.h | 4 +- > include/linux/kexec.h | 2 + > kernel/crash_core.c | 89 +++++++++- > kernel/crash_dump_dm_crypt.c | 4 +- > kernel/kexec_core.c | 25 +-- > 24 files changed, 430 insertions(+), 338 deletions(-) > create mode 100644 arch/arm64/kernel/crash.c > > -- > 2.34.1 > From baoquan.he at linux.dev Mon Jun 1 20:06:11 2026 From: baoquan.he at linux.dev (Baoquan He) Date: Tue, 2 Jun 2026 11:06:11 +0800 Subject: [PATCH v15 00/23] arm64/riscv: Add support for crashkernel CMA reservation In-Reply-To: <1a459706-80db-43d8-b163-76fc09da338d@huawei.com> References: <20260601094805.2928614-1-ruanjinjie@huawei.com> <1a459706-80db-43d8-b163-76fc09da338d@huawei.com> Message-ID: On 06/02/26 at 09:43am, Jinjie Ruan wrote: > > > On 6/1/2026 9:40 PM, Baoquan He wrote: > > Hi Jinjie, > > > > On 06/01/26 at 05:47pm, Jinjie Ruan wrote: > > ...snip... > >> Changes in v15: > >> - Unify the subject prefix formats as Huacai suggested. > >> - Fix powerpc pre-existing NULL pointer dereference [Sashiko [1]] > >> - Fix powerpc pre-existing __merge_memory_ranges() memory range > >> truncation [Sashiko [1]]. > >> - Fix pre-existing arm64 CMA page leaks [Sashiko[2]]. > >> - Fix pre-existing crash_load_dm_crypt_keys() Use-After-Free and > >> Double Free issue [Sashiko[3]]. > >> - Fix vfree(headers) and uninitialized variables issue > >> and simplify the fix [Sashiko[2]]. > >> - As walk_system_ram_res() and for_each_mem_range() use different > >> lock, unify and simplify the fix of TOCTOU buffer overflow via memory > >> region padding [Sashiko[4]]. > >> - Fix the arm64 crash dump issues in Sashiko[5]. > >> - Link to v14: https://lore.kernel.org/all/20260525084932.934910-1-ruanjinjie at huawei.com/ > > > > Do these Fixes have anything with the main target of this patch series > > you mentioned in cover-letter:"arm64/riscv: Add support for crashkernel CMA"? > > The patches become more and more in each new version, I am wondering if > > it relies on these Fixes patches to implement your adding support for > > crashkernel CMA on arm64/risc-v. > > > > If not relying on them, could you split them into different patchset > > on different purpose? > > Hi Baoquan, > > Thank you for your valuable guidance. > > You are absolutely right. Most of these fix patches are indeed not > strictly related to the core implementation of the crashkernel CMA > support. They are pre-existing bugs in the surrounding kexec/crash code > that were flagged during our review. > > Previously, Andrew suggested taking a look at the code review comments > from the Sashiko AI system, which is why these fixes kept expanding. I > completely agree with your advice that there is no need to keep them > together. I will split them into two completely different patchsets > based on their purpose: > > 1. A cleaner version of this series, strictly focused on adding the core > crashkernel CMA support for arm64/riscv. > > 2. One standalone bugfix patchset dedicated entirely to fixing these > pre-existing issues. > > By the way, I would also appreciate some advice on how to handle further > AI reviews. It seems that the more code we touch or refactor to fix > these pre-existing issues, the more tangential bugs the AI flags in the > newly exposed areas, making the series extremely difficult to converge. > > Should I continue to address all AI-reported bugs associated with the > surrounding code in this series, or should we draw a strict line > and only focus on the core CMA logic moving forward? Then please post patches to focus on the core implementation of the crashkernel CMA support. If any AI reported bugs are raised but not relatd to it, you can add note in cover-letter or explain somewhere to tell whehter it's caused by the core code and how you want to deal with it. Otherwise, you could go round and round of new posting and still can't see when it ends up. From ruanjinjie at huawei.com Mon Jun 1 02:47:42 2026 From: ruanjinjie at huawei.com (Jinjie Ruan) Date: Mon, 1 Jun 2026 17:47:42 +0800 Subject: [PATCH v15 00/23] arm64/riscv: Add support for crashkernel CMA reservation Message-ID: <20260601094805.2928614-1-ruanjinjie@huawei.com> The crash memory allocation, and the exclude of crashk_res, crashk_low_res and crashk_cma memory are almost identical across different architectures, This patch set handle them in crash core in a general way, which eliminate a lot of duplication code. And add support for crashkernel CMA reservation for arm64 and riscv. Also add support for arm64 crash hotplug. This patch set is rebased on v7.1-rc1. Basic second kernel boot test were performed on QEMU platforms for x86, ARM64 and RISC-V architectures with the following parameters: "cma=256M crashkernel=4G crashkernel=64M,cma" For first kernel, there will be such log: # dmesg | grep crash [ 0.000000] crashkernel low memory reserved: 0xe8000000 - 0xf0000000 (128 MB) [ 0.000000] crashkernel reserved: 0x000000023e600000 - 0x000000033e600000 (4096 MB) [ 0.000000] crashkernel CMA reserved: 64 MB in 1 ranges # dmesg | grep cma [ 0.000000] cma: Reserved 256 MiB at 0x00000000f0000000 [ 0.000000] cma: Reserved 64 MiB at 0x0000000100000000 For second kernel, there will be such log: [ 0.000000] OF: fdt: Looking for usable-memory-range property... [ 0.000000] OF: fdt: cap_mem_regions[0]: base=0x000000023e600000, size=0x0000000100000000 [ 0.000000] OF: fdt: cap_mem_regions[1]: base=0x00000000e8000000, size=0x0000000008000000 [ 0.000000] OF: fdt: cap_mem_regions[2]: base=0x0000000100000000, size=0x0000000004000000 Changes in v15: - Unify the subject prefix formats as Huacai suggested. - Fix powerpc pre-existing NULL pointer dereference [Sashiko [1]] - Fix powerpc pre-existing __merge_memory_ranges() memory range truncation [Sashiko [1]]. - Fix pre-existing arm64 CMA page leaks [Sashiko[2]]. - Fix pre-existing crash_load_dm_crypt_keys() Use-After-Free and Double Free issue [Sashiko[3]]. - Fix vfree(headers) and uninitialized variables issue and simplify the fix [Sashiko[2]]. - As walk_system_ram_res() and for_each_mem_range() use different lock, unify and simplify the fix of TOCTOU buffer overflow via memory region padding [Sashiko[4]]. - Fix the arm64 crash dump issues in Sashiko[5]. - Link to v14: https://lore.kernel.org/all/20260525084932.934910-1-ruanjinjie at huawei.com/ [1]: https://lore.kernel.org/all/20260525092207.96B9D1F000E9 at smtp.kernel.org/ [2]: https://lore.kernel.org/all/20260525091149.1A1E01F00A3D at smtp.kernel.org/ [3]: https://lore.kernel.org/all/20260525105227.3C2421F000E9 at smtp.kernel.org/ [4]: https://lore.kernel.org/all/20260525095447.944E11F000E9 at smtp.kernel.org/ [5]: https://lore.kernel.org/all/20260525101746.9959D1F000E9 at smtp.kernel.org/ Changes in v14: - Fix image->elf_headers memory leak during retry loop for arm64 as Sashiko AI code review pointed out. - Solve the hotplug notifier arch_crash_handle_hotplug_event() AA self-deadlock problem as Sashiko AI code review pointed out. - Fix the TOCTOU issue in prepare_elf_headers() by get_online_mems(). - -ENOMEM -> -EAGAIN as Breno suggested. - Add support for arm64 crash hotplug. - Link to v13: https://lore.kernel.org/all/20260511030454.1730881-1-ruanjinjie at huawei.com/ Changes in v13: - Rebased on v7.1-rc1. - Update the commit message. - Add Reviewed-by. - Link to v12: https://lore.kernel.org/all/20260402072701.628293-1-ruanjinjie at huawei.com/ Changes in v12: - Remove the unused "nr_mem_ranges" for x86. - Add "Fix crashk_low_res not exclude bug" test log. - Provide a separate patch for each architecture for using crash_prepare_headers(), which will make the review more convenient. - Add Reviewed-by and Tested-by. - Link to v11: https://lore.kernel.org/all/20260328074013.3589544-1-ruanjinjie at huawei.com/ Changes in v11: - Avoid silently drop crash memory if the crash kernel is built without CONFIG_CMA. - Remove unnecessary "cmem->nr_ranges = 0" for arch_crash_populate_cmem() as we use kvzalloc(). - Provide a separate patch for each architecture to fix the existing buffer overflow issue. - Add Acked-bys for arm64. Changes in v10: - Fix crashk_low_res not excluded bug in the existing RISC-V code. - Fix an existing memory leak issue in the existing PowerPC code. - Fix the ordering issue of adding CMA ranges to "linux,usable-memory-range". - Fix an existing concurrency issue. A Concurrent memory hotplug may occur between reading memblock and attempting to fill cmem during kexec_load() for almost all existing architectures. - Link to v9: https://lore.kernel.org/all/20260323072745.2481719-1-ruanjinjie at huawei.com/ Changes in v9: - Collect Reviewed-by and Acked-by, and prepare for Sashiko AI review. - Link to v8: https://lore.kernel.org/all/20260302035315.3892241-1-ruanjinjie at huawei.com/ Changes in v8: - Fix the build issues reported by kernel test robot and Sourabh. - Link to v7: https://lore.kernel.org/all/20260226130437.1867658-1-ruanjinjie at huawei.com/ Changes in v7: - Correct the inclusion of CMA-reserved ranges for kdump kernel in of/kexec for arm64 and riscv. - Add Acked-by. - Link to v6: https://lore.kernel.org/all/20260224085342.387996-1-ruanjinjie at huawei.com/ Changes in v6: - Update the crash core exclude code as Mike suggested. - Rebased on v7.0-rc1. - Add acked-by. - Link to v5: https://lore.kernel.org/all/20260212101001.343158-1-ruanjinjie at huawei.com/ Jinjie Ruan (22): riscv: kexec_file: Fix crashk_low_res not exclude bug powerpc/crash: Fix possible memory leak in update_crash_elfcorehdr() powerpc/kexec_file: Fix NULL pointer dereference in kexec_extra_fdt_size_ppc64() powerpc/kexec_file: Fix memory range truncation in __merge_memory_ranges() kexec: Extract kexec_free_segment_cma() from kimage_free_cma() arm64: kexec_file: Fix CMA page leaks during segment placement retry loops arm64: kexec_file: Fix image->elf_headers memory leak during retry loop kexec: Fix UAF and Double Free in crash_load_dm_crypt_keys() crash_core: Introduce CRASH_HOTPLUG_SAFETY_PADDING for memory hotplug safety x86: kexec_file: Fix TOCTOU buffer overflow via memory region padding arm64: kexec_file: Fix TOCTOU buffer overflow via memory region padding riscv: kexec_file: Fix TOCTOU buffer overflow via memory region padding LoongArch: kexec_file: Fix TOCTOU buffer overflow via memory region padding crash: Add crash_prepare_headers() to exclude crash kernel memory arm64: kexec_file: Use crash_prepare_headers() helper to simplify code x86: kexec_file: Use crash_prepare_headers() helper to simplify code riscv: kexec_file: Use crash_prepare_headers() helper to simplify code LoongArch: kexec_file: Use crash_prepare_headers() helper to simplify code powerpc/kexec_file: Use crash_exclude_core_ranges() helper arm64: kexec_file: Add support for crashkernel CMA reservation riscv: kexec_file: Add support for crashkernel CMA reservation arm64: crash: Add crash hotplug support Sourabh Jain (1): powerpc/crash: sort crash memory ranges before preparing elfcorehdr .../admin-guide/kernel-parameters.txt | 16 +- arch/arm64/Kconfig | 3 + arch/arm64/include/asm/kexec.h | 13 ++ arch/arm64/kernel/Makefile | 2 +- arch/arm64/kernel/crash.c | 152 ++++++++++++++++++ arch/arm64/kernel/kexec_image.c | 34 ++++ arch/arm64/kernel/machine_kexec_file.c | 78 ++------- arch/arm64/mm/init.c | 5 +- arch/loongarch/kernel/machine_kexec_file.c | 44 ++--- arch/powerpc/include/asm/kexec_ranges.h | 1 - arch/powerpc/kexec/crash.c | 7 +- arch/powerpc/kexec/file_load_64.c | 3 + arch/powerpc/kexec/ranges.c | 113 ++----------- arch/riscv/kernel/machine_kexec_file.c | 43 ++--- arch/riscv/mm/init.c | 5 +- arch/x86/kernel/crash.c | 92 ++--------- drivers/of/fdt.c | 9 +- drivers/of/kexec.c | 9 ++ include/linux/crash_core.h | 15 ++ include/linux/crash_reserve.h | 4 +- include/linux/kexec.h | 2 + kernel/crash_core.c | 89 +++++++++- kernel/crash_dump_dm_crypt.c | 4 +- kernel/kexec_core.c | 25 +-- 24 files changed, 430 insertions(+), 338 deletions(-) create mode 100644 arch/arm64/kernel/crash.c -- 2.34.1 From ruanjinjie at huawei.com Mon Jun 1 02:47:44 2026 From: ruanjinjie at huawei.com (Jinjie Ruan) Date: Mon, 1 Jun 2026 17:47:44 +0800 Subject: [PATCH v15 02/23] powerpc/crash: Fix possible memory leak in update_crash_elfcorehdr() In-Reply-To: <20260601094805.2928614-1-ruanjinjie@huawei.com> References: <20260601094805.2928614-1-ruanjinjie@huawei.com> Message-ID: <20260601094805.2928614-3-ruanjinjie@huawei.com> In get_crash_memory_ranges(), if crash_exclude_mem_range() failed after realloc_mem_ranges() has successfully allocated the cmem memory, it just returns an error but leaves cmem pointing to the allocated memory, nor is it freed in the caller update_crash_elfcorehdr(), which cause a memory leak, goto out to free the cmem. Cc: Sourabh Jain Cc: Hari Bathini Cc: Michael Ellerman Fixes: 849599b702ef ("powerpc/crash: add crash memory hotplug support") Reviewed-by: Sourabh Jain Signed-off-by: Jinjie Ruan --- arch/powerpc/kexec/crash.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/kexec/crash.c b/arch/powerpc/kexec/crash.c index e6539f213b3d..a520f851c3a6 100644 --- a/arch/powerpc/kexec/crash.c +++ b/arch/powerpc/kexec/crash.c @@ -502,7 +502,7 @@ static void update_crash_elfcorehdr(struct kimage *image, struct memory_notify * ret = get_crash_memory_ranges(&cmem); if (ret) { pr_err("Failed to get crash mem range\n"); - return; + goto out; } /* -- 2.34.1 From ruanjinjie at huawei.com Mon Jun 1 02:47:45 2026 From: ruanjinjie at huawei.com (Jinjie Ruan) Date: Mon, 1 Jun 2026 17:47:45 +0800 Subject: [PATCH v15 03/23] powerpc/kexec_file: Fix NULL pointer dereference in kexec_extra_fdt_size_ppc64() In-Reply-To: <20260601094805.2928614-1-ruanjinjie@huawei.com> References: <20260601094805.2928614-1-ruanjinjie@huawei.com> Message-ID: <20260601094805.2928614-4-ruanjinjie@huawei.com> A static Sashiko AI review identified a potential NULL pointer dereference in kexec_extra_fdt_size_ppc64(). When get_reserved_memory_ranges() successfully returns 0 on platforms without any reserved memory regions, the allocated 'rmem' pointer remains NULL. Passing this unallocated pointer directly to kexec_extra_fdt_size_ppc64() leads to a kernel panic when evaluating 'rmem->nr_ranges'. Fix this by adding a defensive NULL pointer check at the beginning of kexec_extra_fdt_size_ppc64(), returning 0 extra space immediately if no reserved memory structure exists. Cc: Sourabh Jain Cc: Hari Bathini Cc: Michael Ellerman Cc: stable at vger.kernel.org Fixes: 0d3ff067331e ("powerpc/kexec_file: fix extra size calculation for kexec FDT") Signed-off-by: Jinjie Ruan --- arch/powerpc/kexec/file_load_64.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/powerpc/kexec/file_load_64.c b/arch/powerpc/kexec/file_load_64.c index 8c72e12ea44e..fdeedf102c38 100644 --- a/arch/powerpc/kexec/file_load_64.c +++ b/arch/powerpc/kexec/file_load_64.c @@ -649,6 +649,9 @@ unsigned int kexec_extra_fdt_size_ppc64(struct kimage *image, struct crash_mem * struct device_node *dn; unsigned int cpu_nodes = 0, extra_size = 0; + if (!rmem) + return 0; + // Budget some space for the password blob. There's already extra space // for the key name if (plpks_is_available()) -- 2.34.1 From ruanjinjie at huawei.com Mon Jun 1 02:47:46 2026 From: ruanjinjie at huawei.com (Jinjie Ruan) Date: Mon, 1 Jun 2026 17:47:46 +0800 Subject: [PATCH v15 04/23] powerpc/kexec_file: Fix memory range truncation in __merge_memory_ranges() In-Reply-To: <20260601094805.2928614-1-ruanjinjie@huawei.com> References: <20260601094805.2928614-1-ruanjinjie@huawei.com> Message-ID: <20260601094805.2928614-5-ruanjinjie@huawei.com> Sashiko AI review pointed out the following issue. The __merge_memory_ranges() function incorrectly handles overlapping memory ranges when merging them. Although sort_memory_ranges() sorts all ranges by their start address in ascending order beforehand, the merge logic remains defective in two ways: 1. It compares the current range's start against the previous element (i-1) instead of the running target index (idx) 2. It unconditionally overwrites 'ranges[idx].end' with 'ranges[i].end'. This logic flaw leads to critical memory truncation when a larger memory range completely subsumes subsequent smaller ranges. For example, consider a sorted input array with three ranges: Range A (idx=0): [0x1000 - 0x9000] Range B (i=1): [0x2000 - 0x5000] (completely inside Range A) Range C (i=2): [0x6000 - 0x8000] (completely inside Range A) 1. When i=1 (Range B): ranges[1].start (0x2000) <= ranges[0].end + 1 (0x9001) is TRUE. The code executes: ranges[0].end = ranges[1].end, which erroneously shrinks Range A's end from 0x9000 down to 0x5000. 2. When i=2 (Range C): ranges[2].start (0x6000) <= ranges[1].end + 1 (0x5001) is FALSE. The code falls into the else block, creating a broken new range. As a result, valid memory fragments [0x5001 - 0x5fff] and [0x8001 - 0x9000] are completely lost from the kexec exclude lists, potentially allowing the crash kernel to overwrite active memory, causing data corruption or crashes. Fix this by ensuring the start of the current range is compared against the end of the active merged range (idx), and use max() to safely prevent the outer boundary from being truncated. Cc: Sourabh Jain Cc: Hari Bathini Cc: Michael Ellerman Cc: stable at vger.kernel.org Fixes: 180adfc532a8 ("powerpc/kexec_file: Add helper functions for getting memory ranges") Signed-off-by: Jinjie Ruan --- arch/powerpc/kexec/ranges.c | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/kexec/ranges.c b/arch/powerpc/kexec/ranges.c index 867135560e5c..eb45e89502ca 100644 --- a/arch/powerpc/kexec/ranges.c +++ b/arch/powerpc/kexec/ranges.c @@ -21,6 +21,7 @@ #include #include #include +#include #include #include #include @@ -105,19 +106,16 @@ static void __merge_memory_ranges(struct crash_mem *mem_rngs) struct range *ranges; int i, idx; - if (!mem_rngs) + if (!mem_rngs || mem_rngs->nr_ranges <= 1) return; idx = 0; - ranges = &(mem_rngs->ranges[0]); + ranges = mem_rngs->ranges; for (i = 1; i < mem_rngs->nr_ranges; i++) { - if (ranges[i].start <= (ranges[i-1].end + 1)) - ranges[idx].end = ranges[i].end; + if (ranges[i].start <= (ranges[idx].end + 1)) + ranges[idx].end = max(ranges[idx].end, ranges[i].end); else { idx++; - if (i == idx) - continue; - ranges[idx] = ranges[i]; } } -- 2.34.1 From ruanjinjie at huawei.com Mon Jun 1 02:47:43 2026 From: ruanjinjie at huawei.com (Jinjie Ruan) Date: Mon, 1 Jun 2026 17:47:43 +0800 Subject: [PATCH v15 01/23] riscv: kexec_file: Fix crashk_low_res not exclude bug In-Reply-To: <20260601094805.2928614-1-ruanjinjie@huawei.com> References: <20260601094805.2928614-1-ruanjinjie@huawei.com> Message-ID: <20260601094805.2928614-2-ruanjinjie@huawei.com> As done in commit 944a45abfabc ("arm64: kdump: Reimplement crashkernel=X") and commit 4831be702b95 ("arm64/kexec: Fix missing extra range for crashkres_low.") for arm64, while implementing crashkernel=X,[high,low], riscv should have excluded the "crashk_low_res" reserved ranges from the crash kernel memory to prevent them from being exported through /proc/vmcore, and the exclusion would need an extra crash_mem range. Just simply tested on qemu with crashkernel=4G with kexec in [1] mentioned in [2]. And the second kernel can be started normally. # dmesg | grep crash [ 0.000000] crashkernel low memory reserved: 0xf8000000 - 0x100000000 (128 MB) [ 0.000000] crashkernel reserved: 0x000000017fe00000 - 0x000000027fe00000 (4096 MB) Cc: Guo Ren Cc: Baoquan He [1]: https://github.com/chenjh005/kexec-tools/tree/build-test-riscv-v2 [2]: https://lore.kernel.org/all/20230726175000.2536220-1-chenjiahao16 at huawei.com/ Fixes: 5882e5acf18d ("riscv: kdump: Implement crashkernel=X,[high,low]") Reviewed-by: Guo Ren Signed-off-by: Jinjie Ruan --- arch/riscv/kernel/machine_kexec_file.c | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/arch/riscv/kernel/machine_kexec_file.c b/arch/riscv/kernel/machine_kexec_file.c index 54e2d9552e93..3f7766057cac 100644 --- a/arch/riscv/kernel/machine_kexec_file.c +++ b/arch/riscv/kernel/machine_kexec_file.c @@ -61,7 +61,7 @@ static int prepare_elf_headers(void **addr, unsigned long *sz) unsigned int nr_ranges; int ret; - nr_ranges = 1; /* For exclusion of crashkernel region */ + nr_ranges = 2; /* For exclusion of crashkernel region */ walk_system_ram_res(0, -1, &nr_ranges, get_nr_ram_ranges_callback); cmem = kmalloc_flex(*cmem, ranges, nr_ranges); @@ -76,8 +76,16 @@ static int prepare_elf_headers(void **addr, unsigned long *sz) /* Exclude crashkernel region */ ret = crash_exclude_mem_range(cmem, crashk_res.start, crashk_res.end); - if (!ret) - ret = crash_prepare_elf64_headers(cmem, true, addr, sz); + if (ret) + goto out; + + if (crashk_low_res.end) { + ret = crash_exclude_mem_range(cmem, crashk_low_res.start, crashk_low_res.end); + if (ret) + goto out; + } + + ret = crash_prepare_elf64_headers(cmem, true, addr, sz); out: kfree(cmem); -- 2.34.1 From ruanjinjie at huawei.com Mon Jun 1 02:47:47 2026 From: ruanjinjie at huawei.com (Jinjie Ruan) Date: Mon, 1 Jun 2026 17:47:47 +0800 Subject: [PATCH v15 05/23] powerpc/crash: sort crash memory ranges before preparing elfcorehdr In-Reply-To: <20260601094805.2928614-1-ruanjinjie@huawei.com> References: <20260601094805.2928614-1-ruanjinjie@huawei.com> Message-ID: <20260601094805.2928614-6-ruanjinjie@huawei.com> From: Sourabh Jain During a memory hot-remove event, the elfcorehdr is rebuilt to exclude the removed memory. While updating the crash memory ranges for this operation, the crash memory ranges array can become unsorted. This happens because remove_mem_range() may split a memory range into two parts and append the higher-address part as a separate range at the end of the array. So far, no issues have been observed due to the unsorted crash memory ranges. However, this could lead to problems once crash memory range removal is handled by generic code, as introduced in the upcoming patches in this series. Currently, powerpc uses a platform-specific function, remove_mem_range(), to exclude hot-removed memory from the crash memory ranges. This function performs the same task as the generic crash_exclude_mem_range() in crash_core.c. The generic helper also ensures that the crash memory ranges remain sorted. So remove the redundant powerpc-specific implementation and instead call crash_exclude_mem_range_guarded() (which internally calls crash_exclude_mem_range()) to exclude the hot-removed memory ranges. Cc: Andrew Morton Cc: Baoquan he Cc: Jinjie Ruan Cc: Hari Bathini Cc: Madhavan Srinivasan Cc: Mahesh Salgaonkar Cc: Michael Ellerman Cc: Ritesh Harjani (IBM) Cc: Shivang Upadhyay Cc: linux-kernel at vger.kernel.org Acked-by: Baoquan He Reviewed-by: Ritesh Harjani (IBM) Acked-by: Mike Rapoport (Microsoft) Signed-off-by: Sourabh Jain Signed-off-by: Jinjie Ruan --- arch/powerpc/include/asm/kexec_ranges.h | 4 +- arch/powerpc/kexec/crash.c | 5 +- arch/powerpc/kexec/ranges.c | 87 +------------------------ 3 files changed, 7 insertions(+), 89 deletions(-) diff --git a/arch/powerpc/include/asm/kexec_ranges.h b/arch/powerpc/include/asm/kexec_ranges.h index 14055896cbcb..ad95e3792d10 100644 --- a/arch/powerpc/include/asm/kexec_ranges.h +++ b/arch/powerpc/include/asm/kexec_ranges.h @@ -7,7 +7,9 @@ void sort_memory_ranges(struct crash_mem *mrngs, bool merge); struct crash_mem *realloc_mem_ranges(struct crash_mem **mem_ranges); int add_mem_range(struct crash_mem **mem_ranges, u64 base, u64 size); -int remove_mem_range(struct crash_mem **mem_ranges, u64 base, u64 size); +int crash_exclude_mem_range_guarded(struct crash_mem **mem_ranges, + unsigned long long mstart, + unsigned long long mend); int get_exclude_memory_ranges(struct crash_mem **mem_ranges); int get_reserved_memory_ranges(struct crash_mem **mem_ranges); int get_crash_memory_ranges(struct crash_mem **mem_ranges); diff --git a/arch/powerpc/kexec/crash.c b/arch/powerpc/kexec/crash.c index a520f851c3a6..d634db67becc 100644 --- a/arch/powerpc/kexec/crash.c +++ b/arch/powerpc/kexec/crash.c @@ -493,7 +493,7 @@ static void update_crash_elfcorehdr(struct kimage *image, struct memory_notify * struct crash_mem *cmem = NULL; struct kexec_segment *ksegment; void *ptr, *mem, *elfbuf = NULL; - unsigned long elfsz, memsz, base_addr, size; + unsigned long elfsz, memsz, base_addr, size, end; ksegment = &image->segment[image->elfcorehdr_index]; mem = (void *) ksegment->mem; @@ -512,7 +512,8 @@ static void update_crash_elfcorehdr(struct kimage *image, struct memory_notify * if (image->hp_action == KEXEC_CRASH_HP_REMOVE_MEMORY) { base_addr = PFN_PHYS(mn->start_pfn); size = mn->nr_pages * PAGE_SIZE; - ret = remove_mem_range(&cmem, base_addr, size); + end = base_addr + size - 1; + ret = crash_exclude_mem_range_guarded(&cmem, base_addr, end); if (ret) { pr_err("Failed to remove hot-unplugged memory from crash memory ranges\n"); goto out; diff --git a/arch/powerpc/kexec/ranges.c b/arch/powerpc/kexec/ranges.c index eb45e89502ca..b2fb78562cdc 100644 --- a/arch/powerpc/kexec/ranges.c +++ b/arch/powerpc/kexec/ranges.c @@ -551,7 +551,7 @@ int get_usable_memory_ranges(struct crash_mem **mem_ranges) #endif /* CONFIG_KEXEC_FILE */ #ifdef CONFIG_CRASH_DUMP -static int crash_exclude_mem_range_guarded(struct crash_mem **mem_ranges, +int crash_exclude_mem_range_guarded(struct crash_mem **mem_ranges, unsigned long long mstart, unsigned long long mend) { @@ -639,89 +639,4 @@ int get_crash_memory_ranges(struct crash_mem **mem_ranges) pr_err("Failed to setup crash memory ranges\n"); return ret; } - -/** - * remove_mem_range - Removes the given memory range from the range list. - * @mem_ranges: Range list to remove the memory range to. - * @base: Base address of the range to remove. - * @size: Size of the memory range to remove. - * - * (Re)allocates memory, if needed. - * - * Returns 0 on success, negative errno on error. - */ -int remove_mem_range(struct crash_mem **mem_ranges, u64 base, u64 size) -{ - u64 end; - int ret = 0; - unsigned int i; - u64 mstart, mend; - struct crash_mem *mem_rngs = *mem_ranges; - - if (!size) - return 0; - - /* - * Memory range are stored as start and end address, use - * the same format to do remove operation. - */ - end = base + size - 1; - - for (i = 0; i < mem_rngs->nr_ranges; i++) { - mstart = mem_rngs->ranges[i].start; - mend = mem_rngs->ranges[i].end; - - /* - * Memory range to remove is not part of this range entry - * in the memory range list - */ - if (!(base >= mstart && end <= mend)) - continue; - - /* - * Memory range to remove is equivalent to this entry in the - * memory range list. Remove the range entry from the list. - */ - if (base == mstart && end == mend) { - for (; i < mem_rngs->nr_ranges - 1; i++) { - mem_rngs->ranges[i].start = mem_rngs->ranges[i+1].start; - mem_rngs->ranges[i].end = mem_rngs->ranges[i+1].end; - } - mem_rngs->nr_ranges--; - goto out; - } - /* - * Start address of the memory range to remove and the - * current memory range entry in the list is same. Just - * move the start address of the current memory range - * entry in the list to end + 1. - */ - else if (base == mstart) { - mem_rngs->ranges[i].start = end + 1; - goto out; - } - /* - * End address of the memory range to remove and the - * current memory range entry in the list is same. - * Just move the end address of the current memory - * range entry in the list to base - 1. - */ - else if (end == mend) { - mem_rngs->ranges[i].end = base - 1; - goto out; - } - /* - * Memory range to remove is not at the edge of current - * memory range entry. Split the current memory entry into - * two half. - */ - else { - size = mem_rngs->ranges[i].end - end + 1; - mem_rngs->ranges[i].end = base - 1; - ret = add_mem_range(mem_ranges, end + 1, size); - } - } -out: - return ret; -} #endif /* CONFIG_CRASH_DUMP */ -- 2.34.1 From ruanjinjie at huawei.com Mon Jun 1 02:47:48 2026 From: ruanjinjie at huawei.com (Jinjie Ruan) Date: Mon, 1 Jun 2026 17:47:48 +0800 Subject: [PATCH v15 06/23] kexec: Extract kexec_free_segment_cma() from kimage_free_cma() In-Reply-To: <20260601094805.2928614-1-ruanjinjie@huawei.com> References: <20260601094805.2928614-1-ruanjinjie@huawei.com> Message-ID: <20260601094805.2928614-7-ruanjinjie@huawei.com> The generic kimage_free_cma() relies on `image->nr_segments` to iterate and free allocated CMA pages. However, during architecture-specific segment placement retry loops (e.g., arm64's image_load()), a mid-way failure will truncate `image->nr_segments` back to its initial value. This truncation permanently hides any CMA pages allocated outside the new boundary from global cleanup, causing silent background memory leaks. To allow architecture-specific loaders to execute fine-grained memory reclamation before truncation occurs, extract the single-pass CMA release logic into a dedicated and exported helper: void kexec_free_segment_cma(struct kimage *image, unsigned long idx); Refactor the main kimage_free_cma() to invoke this helper sequentially to maintain backward compatibility while expanding single-slot flexibility. Signed-off-by: Jinjie Ruan --- include/linux/kexec.h | 2 ++ kernel/kexec_core.c | 25 ++++++++++++++----------- 2 files changed, 16 insertions(+), 11 deletions(-) diff --git a/include/linux/kexec.h b/include/linux/kexec.h index 8a22bc9b8c6c..6f1eabda0300 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -532,6 +532,7 @@ extern bool kexec_file_dbg_print; extern void *kimage_map_segment(struct kimage *image, int idx); extern void kimage_unmap_segment(void *buffer); +extern void kexec_free_segment_cma(struct kimage *image, unsigned long idx); #else /* !CONFIG_KEXEC_CORE */ struct pt_regs; struct task_struct; @@ -543,6 +544,7 @@ static inline int kexec_crash_loaded(void) { return 0; } static inline void *kimage_map_segment(struct kimage *image, int idx) { return NULL; } static inline void kimage_unmap_segment(void *buffer) { } +static inline void kexec_free_segment_cma(struct kimage *image, unsigned long idx) { } #define kexec_in_progress false #endif /* CONFIG_KEXEC_CORE */ diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c index a43d2da0fe3e..9195f81e53c4 100644 --- a/kernel/kexec_core.c +++ b/kernel/kexec_core.c @@ -554,22 +554,25 @@ static void kimage_free_entry(kimage_entry_t entry) kimage_free_pages(page); } -static void kimage_free_cma(struct kimage *image) +void kexec_free_segment_cma(struct kimage *image, unsigned long idx) { - unsigned long i; + u32 nr_pages = image->segment[idx].memsz >> PAGE_SHIFT; + struct page *cma = image->segment_cma[idx]; - for (i = 0; i < image->nr_segments; i++) { - struct page *cma = image->segment_cma[i]; - u32 nr_pages = image->segment[i].memsz >> PAGE_SHIFT; + if (!cma) + return; - if (!cma) - continue; + arch_kexec_pre_free_pages(page_address(cma), nr_pages); + dma_release_from_contiguous(NULL, cma, nr_pages); + image->segment_cma[idx] = NULL; +} - arch_kexec_pre_free_pages(page_address(cma), nr_pages); - dma_release_from_contiguous(NULL, cma, nr_pages); - image->segment_cma[i] = NULL; - } +static void kimage_free_cma(struct kimage *image) +{ + unsigned long i; + for (i = 0; i < image->nr_segments; i++) + kexec_free_segment_cma(image, i); } void kimage_free(struct kimage *image) -- 2.34.1 From ruanjinjie at huawei.com Mon Jun 1 02:47:49 2026 From: ruanjinjie at huawei.com (Jinjie Ruan) Date: Mon, 1 Jun 2026 17:47:49 +0800 Subject: [PATCH v15 07/23] arm64: kexec_file: Fix CMA page leaks during segment placement retry loops In-Reply-To: <20260601094805.2928614-1-ruanjinjie@huawei.com> References: <20260601094805.2928614-1-ruanjinjie@huawei.com> Message-ID: <20260601094805.2928614-8-ruanjinjie@huawei.com> Sashiko AI code review pointed out, during arm64 kexec image placement retry loops in image_load(), the loader repeatedly attempts to find a suitable memory hole for the kernel and its associated segments (initrd, dtb, etc.). When a placement attempt fails midway, the core framework rolls back `image->nr_segments` to its initial state to purge the failed segments logically. However, this truncation causes a severe background memory leak. Any CMA pages successfully allocated via kexec_add_buffer() during the failed attempt are recorded in the `image->segment_cma` array. Since the subsequent global kimage_free_cma() cleanup only iterates up to the truncated (smaller) `nr_segments` boundary, these allocated CMA pages outside the new boundary become completely orphaned and permanently leaked. Fix this by leverage the newly introduced generic kexec_free_segment_cma() helper to execute fine-grained memory reclamation before any truncation occurs: 1. In image_load(), explicitly invoke kexec_free_segment_cma() to release the CMA buffer allocated for the current failed kernel segment before decrementing `image->nr_segments`. 2. In the error path of load_other_segments(), iterate backward from the failed segment index down to `orig_segments`, sequentially freeing each orphan CMA segment allocation before restoring the initial segment count. This guarantees that all temporary CMA pages allocated during placement failures are cleanly returned to the contiguous memory allocator, eliminating silent background memory leaks across all retry paths. Cc: Catalin Marinas Cc: Will Deacon Cc: Breno Leitao Cc: Pratyush Yadav Cc: Andrew Morton Cc: Yeoreum Yun Cc: Kees Cook Cc: "Rob Herring (Arm)" Cc: Baoquan He Cc: Coiby Xu Cc: Alexander Graf Cc: Pasha Tatashin Cc: stable at vger.kernel.org Fixes: 07d24902977e4 ("kexec: enable CMA based contiguous allocation") Signed-off-by: Jinjie Ruan --- arch/arm64/kernel/kexec_image.c | 1 + arch/arm64/kernel/machine_kexec_file.c | 5 ++++- 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/arch/arm64/kernel/kexec_image.c b/arch/arm64/kernel/kexec_image.c index b70f4df15a1a..ffcb7f9075e6 100644 --- a/arch/arm64/kernel/kexec_image.c +++ b/arch/arm64/kernel/kexec_image.c @@ -107,6 +107,7 @@ static void *image_load(struct kimage *image, * We couldn't find space for the other segments; erase the * kernel segment and try the next available hole. */ + kexec_free_segment_cma(image, kernel_segment_number); image->nr_segments -= 1; kbuf.buf_min = kernel_segment->mem + kernel_segment->memsz; kbuf.mem = KEXEC_BUF_MEM_UNKNOWN; diff --git a/arch/arm64/kernel/machine_kexec_file.c b/arch/arm64/kernel/machine_kexec_file.c index e31fabed378a..13c247c28866 100644 --- a/arch/arm64/kernel/machine_kexec_file.c +++ b/arch/arm64/kernel/machine_kexec_file.c @@ -195,7 +195,10 @@ int load_other_segments(struct kimage *image, return 0; out_err: - image->nr_segments = orig_segments; + while (image->nr_segments > orig_segments) { + kexec_free_segment_cma(image, image->nr_segments - 1); + image->nr_segments--; + } kvfree(dtb); return ret; } -- 2.34.1 From ruanjinjie at huawei.com Mon Jun 1 02:47:50 2026 From: ruanjinjie at huawei.com (Jinjie Ruan) Date: Mon, 1 Jun 2026 17:47:50 +0800 Subject: [PATCH v15 08/23] arm64: kexec_file: Fix image->elf_headers memory leak during retry loop In-Reply-To: <20260601094805.2928614-1-ruanjinjie@huawei.com> References: <20260601094805.2928614-1-ruanjinjie@huawei.com> Message-ID: <20260601094805.2928614-9-ruanjinjie@huawei.com> Sashiko AI code review pointed out a potential memory leak of image->elf_headers when load_other_segments() fails on error paths. In the arm64 kexec_file file-load path, kexec_image.c runs a retry loop calling kexec_add_buffer() to find a suitable location for the kernel segment. On each iteration, load_other_segments() is invoked to allocate and populate alternative segments such as initrd, DTB, and ELF headers. However, if a placement or allocation failure occurs later in load_other_segments() (e.g., when adding initrd or dtb), the execution jumps to the out_err label. While this path restores image->nr_segments via orig_segments, it returns an error back to the caller without freeing the previously allocated image->elf_headers vmalloc buffer. As a result, the retry loop in image_load() unconditionally allocates new ELF headers on the next iteration and overwrites image->elf_headers, permanently leaking the memory blocks allocated in previous iterations. To fix this, decouple the ELF header allocation from the target-seeking retry loop. Since the contents and size of ELF headers only depend on the host memory layout and do not change with the kernel's physical placement, move prepare_elf_headers() completely outside and prior to the while retry loop in image_load(). And if kexec_add_buffer() for elf headers fails, not need to vfree headers, because the err path will vfree `image->elf_headers` by calling arch_kimage_file_post_load_cleanup(). This optimization eliminates redundant memory allocation/deallocation overhead during kexec placement retries and eradicates the Use-After-Free and memory leak risk. Concurrently, remove the prepare_elf_headers() call from inside load_other_segments() and have it directly reuse the single, pre-allocated image->elf_headers. Cc: Catalin Marinas Cc: Will Deacon Cc: Thomas Huth Cc: Breno Leitao Cc: Andrew Morton Cc: Yeoreum Yun Cc: Coiby Xu Cc: Baoquan He Cc: Kees Cook Cc: Benjamin Gwin Cc: stable at vger.kernel.org Fixes: 108aa503657e ("arm64: kexec_file: try more regions if loading segments fails") Signed-off-by: Jinjie Ruan --- v15: - Use image->elf_headers and image->elf_headers_sz instead of adding function parameters for load_other_segments() to simplify the fix. --- arch/arm64/include/asm/kexec.h | 1 + arch/arm64/kernel/kexec_image.c | 16 ++++++++++++++++ arch/arm64/kernel/machine_kexec_file.c | 23 +++++------------------ 3 files changed, 22 insertions(+), 18 deletions(-) diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h index 892e5bebda95..7ffa2ff5fcfd 100644 --- a/arch/arm64/include/asm/kexec.h +++ b/arch/arm64/include/asm/kexec.h @@ -128,6 +128,7 @@ extern int load_other_segments(struct kimage *image, unsigned long kernel_load_addr, unsigned long kernel_size, char *initrd, unsigned long initrd_len, char *cmdline); +extern int prepare_elf_headers(void **addr, unsigned long *sz); #endif #endif /* __ASSEMBLER__ */ diff --git a/arch/arm64/kernel/kexec_image.c b/arch/arm64/kernel/kexec_image.c index ffcb7f9075e6..424b9527db09 100644 --- a/arch/arm64/kernel/kexec_image.c +++ b/arch/arm64/kernel/kexec_image.c @@ -89,6 +89,22 @@ static void *image_load(struct kimage *image, kernel_segment_number = image->nr_segments; +#ifdef CONFIG_CRASH_DUMP + if (image->type == KEXEC_TYPE_CRASH) { + /* load elf core header */ + unsigned long headers_sz; + void *headers; + + ret = prepare_elf_headers(&headers, &headers_sz); + if (ret) { + pr_err("Preparing elf core header failed\n"); + return ERR_PTR(ret); + } + image->elf_headers = headers; + image->elf_headers_sz = headers_sz; + } +#endif + /* * The location of the kernel segment may make it impossible to satisfy * the other segment requirements, so we try repeatedly to find a diff --git a/arch/arm64/kernel/machine_kexec_file.c b/arch/arm64/kernel/machine_kexec_file.c index 13c247c28866..4cbb71e1f8ed 100644 --- a/arch/arm64/kernel/machine_kexec_file.c +++ b/arch/arm64/kernel/machine_kexec_file.c @@ -40,7 +40,7 @@ int arch_kimage_file_post_load_cleanup(struct kimage *image) } #ifdef CONFIG_CRASH_DUMP -static int prepare_elf_headers(void **addr, unsigned long *sz) +int prepare_elf_headers(void **addr, unsigned long *sz) { struct crash_mem *cmem; unsigned int nr_ranges; @@ -105,32 +105,19 @@ int load_other_segments(struct kimage *image, kbuf.buf_min = kernel_load_addr + kernel_size; #ifdef CONFIG_CRASH_DUMP - /* load elf core header */ - void *headers; - unsigned long headers_sz; if (image->type == KEXEC_TYPE_CRASH) { - ret = prepare_elf_headers(&headers, &headers_sz); - if (ret) { - pr_err("Preparing elf core header failed\n"); - goto out_err; - } - - kbuf.buffer = headers; - kbuf.bufsz = headers_sz; + kbuf.buffer = image->elf_headers; + kbuf.bufsz = image->elf_headers_sz; kbuf.mem = KEXEC_BUF_MEM_UNKNOWN; - kbuf.memsz = headers_sz; + kbuf.memsz = image->elf_headers_sz; kbuf.buf_align = SZ_64K; /* largest supported page size */ kbuf.buf_max = ULONG_MAX; kbuf.top_down = true; ret = kexec_add_buffer(&kbuf); - if (ret) { - vfree(headers); + if (ret) goto out_err; - } - image->elf_headers = headers; image->elf_load_addr = kbuf.mem; - image->elf_headers_sz = headers_sz; kexec_dprintk("Loaded elf core header at 0x%lx bufsz=0x%lx memsz=0x%lx\n", image->elf_load_addr, kbuf.bufsz, kbuf.memsz); -- 2.34.1 From ruanjinjie at huawei.com Mon Jun 1 02:47:51 2026 From: ruanjinjie at huawei.com (Jinjie Ruan) Date: Mon, 1 Jun 2026 17:47:51 +0800 Subject: [PATCH v15 09/23] kexec: Fix UAF and Double Free in crash_load_dm_crypt_keys() In-Reply-To: <20260601094805.2928614-1-ruanjinjie@huawei.com> References: <20260601094805.2928614-1-ruanjinjie@huawei.com> Message-ID: <20260601094805.2928614-10-ruanjinjie@huawei.com> A static memory safety review by Sashiko AI identified a high-severity Use-After-Free (UAF) and Double Free vulnerability in the dm-crypt keys handling path during arm64 kexec image placement retry loops. In crash_load_dm_crypt_keys(), when the segment allocation fails via kexec_add_buffer(), the error path invokes `kvfree((void *)kbuf.buffer)` to reclaim the keys buffer. However, the global pointer `keys_header` is left dangling with a stale address, creating an insecure memory trap. When the top-level loader image_load() retries the next available placement hole, crash_load_dm_crypt_keys() is re-entered. Since `is_dm_key_reused` is a read-only global configuration managed by user-space configfs, it cannot be mutated by the kernel. If it remains true, the loader skips build_keys_header() and blindly reuses the stale `keys_header` pointer for kbuf.buffer, triggering a severe Use-After-Free or a Null pointer dereference during kexec_add_buffer(). Alternatively, a new headers build can trigger a recursive Double Free inside build_keys_header(). Fix this by setting the global `keys_header` to NULL immediately after it is freed in the failure path. Concurrently, upgrade the header regeneration check to a composite condition: `if (!is_dm_key_reused || !keys_header)` This ensures that if a previous retry attempt wiped the buffer, the kernel will automatically and safely trigger a fresh header regeneration internally without modifying the user-configured `is_dm_key_reused` state flag, achieving absolute data consistency and memory safety across all retry paths. Cc: Andrew Morton Cc: Baoquan He Cc: Mike Rapoport Cc: Pasha Tatashin Cc: Pratyush Yadav Cc: Dave Young Cc: stable at vger.kernel.org Fixes: e3a84be1ec2f ("arm64,ppc64le/kdump: pass dm-crypt keys to kdump kernel") Signed-off-by: Jinjie Ruan --- kernel/crash_dump_dm_crypt.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/crash_dump_dm_crypt.c b/kernel/crash_dump_dm_crypt.c index cb875ddb6ba6..2c5462876337 100644 --- a/kernel/crash_dump_dm_crypt.c +++ b/kernel/crash_dump_dm_crypt.c @@ -412,13 +412,12 @@ int crash_load_dm_crypt_keys(struct kimage *image) }; int r; - if (key_count <= 0) { kexec_dprintk("No dm-crypt keys\n"); return 0; } - if (!is_dm_key_reused) { + if (!is_dm_key_reused || unlikely(!keys_header)) { image->dm_crypt_keys_addr = 0; r = build_keys_header(); if (r) { @@ -437,6 +436,7 @@ int crash_load_dm_crypt_keys(struct kimage *image) if (r) { pr_err("Failed to call kexec_add_buffer, ret=%d\n", r); kvfree((void *)kbuf.buffer); + keys_header = NULL; return r; } image->dm_crypt_keys_addr = kbuf.mem; -- 2.34.1 From ruanjinjie at huawei.com Mon Jun 1 02:47:52 2026 From: ruanjinjie at huawei.com (Jinjie Ruan) Date: Mon, 1 Jun 2026 17:47:52 +0800 Subject: [PATCH v15 10/23] crash_core: Introduce CRASH_HOTPLUG_SAFETY_PADDING for memory hotplug safety In-Reply-To: <20260601094805.2928614-1-ruanjinjie@huawei.com> References: <20260601094805.2928614-1-ruanjinjie@huawei.com> Message-ID: <20260601094805.2928614-11-ruanjinjie@huawei.com> Introduce CRASH_HOTPLUG_SAFETY_PADDING to allocate extra slots for the crash memory ranges array, mitigating potential TOCTOU races caused by concurrent memory hotplug events. When CONFIG_MEMORY_HOTPLUG is disabled, the padding safely defaults to 0 as the memory layout remains static. Signed-off-by: Jinjie Ruan --- include/linux/crash_core.h | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h index c1dee3f971a9..d4762e000098 100644 --- a/include/linux/crash_core.h +++ b/include/linux/crash_core.h @@ -14,6 +14,12 @@ struct crash_mem { struct range ranges[] __counted_by(max_nr_ranges); }; +#ifdef CONFIG_MEMORY_HOTPLUG +#define CRASH_HOTPLUG_SAFETY_PADDING 128 +#else +#define CRASH_HOTPLUG_SAFETY_PADDING 0 +#endif + #ifdef CONFIG_CRASH_DUMP int crash_shrink_memory(unsigned long new_size); -- 2.34.1 From ruanjinjie at huawei.com Mon Jun 1 02:47:53 2026 From: ruanjinjie at huawei.com (Jinjie Ruan) Date: Mon, 1 Jun 2026 17:47:53 +0800 Subject: [PATCH v15 11/23] x86: kexec_file: Fix TOCTOU buffer overflow via memory region padding In-Reply-To: <20260601094805.2928614-1-ruanjinjie@huawei.com> References: <20260601094805.2928614-1-ruanjinjie@huawei.com> Message-ID: <20260601094805.2928614-12-ruanjinjie@huawei.com> Sashiko AI code review pointed out there is a TOCTOU (Time-of-Check to Time-of-Use) race condition in prepare_elf_headers() between the initial pass that counts System RAM ranges and the second pass that populates them. If a memory hotplug event occurs between these two steps, the number of memory regions may increase, causing an out-of-bounds write to the cmem->ranges[] array. Fix this fundamentally by using `CRASH_HOTPLUG_SAFETY_PADDING`(128 slots) to expand the flexible array allocation ceiling upfront. This safely absorbs any concurrent memory region expansion. Concurrently, add a defensive boundary check inside the callback to return -EAGAIN on unexpected overrun, fully eradicating the overflow window and ensuring system stability. Cc: AKASHI Takahiro Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: "H. Peter Anvin" Cc: Andrew Morton Cc: Baoquan He Cc: Mike Rapoport Cc: stable at vger.kernel.org Fixes: 8d5f894a3108 ("x86: kexec_file: lift CRASH_MAX_RANGES limit on crash_mem buffer") Signed-off-by: Jinjie Ruan --- arch/x86/kernel/crash.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c index cd796818d94d..a1089907728d 100644 --- a/arch/x86/kernel/crash.c +++ b/arch/x86/kernel/crash.c @@ -177,7 +177,7 @@ static struct crash_mem *fill_up_crash_elf_data(void) * But in order to lest the low 1M could be changed in the future, * (e.g. [start, 1M]), add a extra slot. */ - nr_ranges += 3 + crashk_cma_cnt; + nr_ranges += 3 + crashk_cma_cnt + CRASH_HOTPLUG_SAFETY_PADDING; cmem = vzalloc(struct_size(cmem, ranges, nr_ranges)); if (!cmem) return NULL; @@ -226,6 +226,9 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg) { struct crash_mem *cmem = arg; + if (unlikely(cmem->nr_ranges >= cmem->max_nr_ranges)) + return -EAGAIN; + cmem->ranges[cmem->nr_ranges].start = res->start; cmem->ranges[cmem->nr_ranges].end = res->end; cmem->nr_ranges++; -- 2.34.1 From ruanjinjie at huawei.com Mon Jun 1 02:47:54 2026 From: ruanjinjie at huawei.com (Jinjie Ruan) Date: Mon, 1 Jun 2026 17:47:54 +0800 Subject: [PATCH v15 12/23] arm64: kexec_file: Fix TOCTOU buffer overflow via memory region padding In-Reply-To: <20260601094805.2928614-1-ruanjinjie@huawei.com> References: <20260601094805.2928614-1-ruanjinjie@huawei.com> Message-ID: <20260601094805.2928614-13-ruanjinjie@huawei.com> Sashiko AI code review pointed out there is a TOCTOU (Time-of-Check to Time-of-Use) race condition in prepare_elf_headers() between the initial pass that counts System RAM ranges and the second pass that populates them. If a memory hotplug event occurs between these two steps, the number of memory regions may increase, causing an out-of-bounds write to the cmem->ranges[] array. Fix this fundamentally by using `CRASH_HOTPLUG_SAFETY_PADDING` (128 slots) to expand the flexible array allocation ceiling upfront. This safely absorbs any concurrent memory region expansion. Concurrently, add a defensive boundary check to return -EAGAIN on unexpected overrun, fully eradicating the overflow window and ensuring system stability. Cc: Catalin Marinas Cc: Will Deacon Cc: Andrew Morton Cc: Baoquan He Cc: Breno Leitao Cc: stable at vger.kernel.org Fixes: 3751e728cef2 ("arm64: kexec_file: add crash dump support") Signed-off-by: Jinjie Ruan --- arch/arm64/kernel/machine_kexec_file.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/arch/arm64/kernel/machine_kexec_file.c b/arch/arm64/kernel/machine_kexec_file.c index 4cbb71e1f8ed..8a96fb68b88d 100644 --- a/arch/arm64/kernel/machine_kexec_file.c +++ b/arch/arm64/kernel/machine_kexec_file.c @@ -48,7 +48,8 @@ int prepare_elf_headers(void **addr, unsigned long *sz) u64 i; phys_addr_t start, end; - nr_ranges = 2; /* for exclusion of crashkernel region */ + /* for exclusion of crashkernel region */ + nr_ranges = 2 + CRASH_HOTPLUG_SAFETY_PADDING; for_each_mem_range(i, &start, &end) nr_ranges++; @@ -59,6 +60,11 @@ int prepare_elf_headers(void **addr, unsigned long *sz) cmem->max_nr_ranges = nr_ranges; cmem->nr_ranges = 0; for_each_mem_range(i, &start, &end) { + if (unlikely(cmem->nr_ranges >= cmem->max_nr_ranges)) { + ret = -EAGAIN; + goto out; + } + cmem->ranges[cmem->nr_ranges].start = start; cmem->ranges[cmem->nr_ranges].end = end - 1; cmem->nr_ranges++; -- 2.34.1 From ruanjinjie at huawei.com Mon Jun 1 02:47:55 2026 From: ruanjinjie at huawei.com (Jinjie Ruan) Date: Mon, 1 Jun 2026 17:47:55 +0800 Subject: [PATCH v15 13/23] riscv: kexec_file: Fix TOCTOU buffer overflow via memory region padding In-Reply-To: <20260601094805.2928614-1-ruanjinjie@huawei.com> References: <20260601094805.2928614-1-ruanjinjie@huawei.com> Message-ID: <20260601094805.2928614-14-ruanjinjie@huawei.com> Sashiko AI code review pointed out there is a TOCTOU (Time-of-Check to Time-of-Use) race condition in prepare_elf_headers() between the initial pass that counts System RAM ranges and the second pass that populates them. If a memory hotplug event occurs between these two steps, the number of memory regions may increase, causing an out-of-bounds write to the cmem->ranges[] array. Fix this fundamentally by using `CRASH_HOTPLUG_SAFETY_PADDING` (128 slots) to expand the flexible array allocation ceiling upfront. This safely absorbs any concurrent memory region expansion. Concurrently, add a defensive boundary check inside the callback to return -EAGAIN on unexpected overrun, fully eradicating the overflow window and ensuring system stability. Cc: Paul Walmsley Cc: Palmer Dabbelt Cc: Albert Ou Cc: Alexandre Ghiti Cc: songshuaishuai at tinylab.org Cc: bjorn at rivosinc.com Cc: leitao at debian.org Fixes: 8acea455fafa ("RISC-V: Support for kexec_file on panic") Reviewed-by: Guo Ren Signed-off-by: Jinjie Ruan --- arch/riscv/kernel/machine_kexec_file.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/arch/riscv/kernel/machine_kexec_file.c b/arch/riscv/kernel/machine_kexec_file.c index 3f7766057cac..f3576dc0513f 100644 --- a/arch/riscv/kernel/machine_kexec_file.c +++ b/arch/riscv/kernel/machine_kexec_file.c @@ -48,6 +48,9 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg) { struct crash_mem *cmem = arg; + if (unlikely(cmem->nr_ranges >= cmem->max_nr_ranges)) + return -EAGAIN; + cmem->ranges[cmem->nr_ranges].start = res->start; cmem->ranges[cmem->nr_ranges].end = res->end; cmem->nr_ranges++; @@ -61,7 +64,8 @@ static int prepare_elf_headers(void **addr, unsigned long *sz) unsigned int nr_ranges; int ret; - nr_ranges = 2; /* For exclusion of crashkernel region */ + /* For exclusion of crashkernel region */ + nr_ranges = 2 + CRASH_HOTPLUG_SAFETY_PADDING; walk_system_ram_res(0, -1, &nr_ranges, get_nr_ram_ranges_callback); cmem = kmalloc_flex(*cmem, ranges, nr_ranges); -- 2.34.1 From ruanjinjie at huawei.com Mon Jun 1 02:47:56 2026 From: ruanjinjie at huawei.com (Jinjie Ruan) Date: Mon, 1 Jun 2026 17:47:56 +0800 Subject: [PATCH v15 14/23] LoongArch: kexec_file: Fix TOCTOU buffer overflow via memory region padding In-Reply-To: <20260601094805.2928614-1-ruanjinjie@huawei.com> References: <20260601094805.2928614-1-ruanjinjie@huawei.com> Message-ID: <20260601094805.2928614-15-ruanjinjie@huawei.com> Sashiko AI code review pointed out there is a TOCTOU (Time-of-Check to Time-of-Use) race condition in prepare_elf_headers() between the initial pass that counts System RAM ranges and the second pass that populates them. If a memory hotplug event occurs between these two steps, the number of memory regions may increase, causing an out-of-bounds write to the cmem->ranges[] array. Fix this fundamentally by using `CRASH_HOTPLUG_SAFETY_PADDING` (128 slots) to expand the flexible array allocation ceiling upfront. This safely absorbs any concurrent memory region expansion. Concurrently, add a defensive boundary check to return -EAGAIN on unexpected overrun, fully eradicating the overflow window and ensuring system stability. Cc: Youling Tang Cc: Huacai Chen Cc: WANG Xuerui Cc: stable at vger.kernel.org Fixes: 1bcca8620a91 ("LoongArch: Add crash dump support for kexec_file") Signed-off-by: Jinjie Ruan --- arch/loongarch/kernel/machine_kexec_file.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/arch/loongarch/kernel/machine_kexec_file.c b/arch/loongarch/kernel/machine_kexec_file.c index 5584b798ba46..3c369124586e 100644 --- a/arch/loongarch/kernel/machine_kexec_file.c +++ b/arch/loongarch/kernel/machine_kexec_file.c @@ -64,7 +64,8 @@ static int prepare_elf_headers(void **addr, unsigned long *sz) phys_addr_t start, end; struct crash_mem *cmem; - nr_ranges = 2; /* for exclusion of crashkernel region */ + /* for exclusion of crashkernel region */ + nr_ranges = 2 + CRASH_HOTPLUG_SAFETY_PADDING; for_each_mem_range(i, &start, &end) nr_ranges++; @@ -75,6 +76,11 @@ static int prepare_elf_headers(void **addr, unsigned long *sz) cmem->max_nr_ranges = nr_ranges; cmem->nr_ranges = 0; for_each_mem_range(i, &start, &end) { + if (unlikely(cmem->nr_ranges >= cmem->max_nr_ranges)) { + ret = -EAGAIN; + goto out; + } + cmem->ranges[cmem->nr_ranges].start = start; cmem->ranges[cmem->nr_ranges].end = end - 1; cmem->nr_ranges++; -- 2.34.1 From ruanjinjie at huawei.com Mon Jun 1 02:47:57 2026 From: ruanjinjie at huawei.com (Jinjie Ruan) Date: Mon, 1 Jun 2026 17:47:57 +0800 Subject: [PATCH v15 15/23] crash: Add crash_prepare_headers() to exclude crash kernel memory In-Reply-To: <20260601094805.2928614-1-ruanjinjie@huawei.com> References: <20260601094805.2928614-1-ruanjinjie@huawei.com> Message-ID: <20260601094805.2928614-16-ruanjinjie@huawei.com> The crash memory alloc, and the exclude of crashk_res, crashk_low_res and crashk_cma memory are almost identical across different architectures, handling them in the crash core would eliminate a lot of duplication, so add crash_prepare_headers() helper to handle them in the common code. To achieve the above goal, three architecture-specific functions are introduced: - arch_get_system_nr_ranges(). Pre-counts the max number of memory ranges. - arch_crash_populate_cmem(). Collects the memory ranges and fills them into cmem. - arch_crash_exclude_ranges(). Architecture's additional crash memory ranges exclusion, defaulting to empty. Reviewed-by: Sourabh Jain Acked-by: Baoquan He Acked-by: Mike Rapoport (Microsoft) Signed-off-by: Jinjie Ruan --- include/linux/crash_core.h | 5 +++ kernel/crash_core.c | 82 ++++++++++++++++++++++++++++++++++++-- 2 files changed, 84 insertions(+), 3 deletions(-) diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h index d4762e000098..43baf9c87355 100644 --- a/include/linux/crash_core.h +++ b/include/linux/crash_core.h @@ -65,6 +65,8 @@ extern int crash_exclude_mem_range(struct crash_mem *mem, unsigned long long mend); extern int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map, void **addr, unsigned long *sz); +extern int crash_prepare_headers(int need_kernel_map, void **addr, + unsigned long *sz, unsigned long *nr_mem_ranges); struct kimage; struct kexec_segment; @@ -82,6 +84,9 @@ int kexec_should_crash(struct task_struct *p); int kexec_crash_loaded(void); void crash_save_cpu(struct pt_regs *regs, int cpu); extern int kimage_crash_copy_vmcoreinfo(struct kimage *image); +extern unsigned int arch_get_system_nr_ranges(void); +extern int arch_crash_populate_cmem(struct crash_mem *cmem); +extern int arch_crash_exclude_ranges(struct crash_mem *cmem); #else /* !CONFIG_CRASH_DUMP*/ struct pt_regs; diff --git a/kernel/crash_core.c b/kernel/crash_core.c index 4f21fc3b108b..481babc29131 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -168,9 +168,6 @@ static inline resource_size_t crash_resource_size(const struct resource *res) return !res->end ? 0 : resource_size(res); } - - - int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map, void **addr, unsigned long *sz) { @@ -272,6 +269,85 @@ int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map, return 0; } +static struct crash_mem *alloc_cmem(unsigned int nr_ranges) +{ + struct crash_mem *cmem; + + cmem = kvzalloc_flex(*cmem, ranges, nr_ranges); + if (!cmem) + return NULL; + + cmem->max_nr_ranges = nr_ranges; + return cmem; +} + +unsigned int __weak arch_get_system_nr_ranges(void) { return 0; } +int __weak arch_crash_populate_cmem(struct crash_mem *cmem) { return -1; } +int __weak arch_crash_exclude_ranges(struct crash_mem *cmem) { return 0; } + +static int crash_exclude_core_ranges(struct crash_mem *cmem) +{ + int ret, i; + + /* Exclude crashkernel region */ + ret = crash_exclude_mem_range(cmem, crashk_res.start, crashk_res.end); + if (ret) + return ret; + + if (crashk_low_res.end) { + ret = crash_exclude_mem_range(cmem, crashk_low_res.start, crashk_low_res.end); + if (ret) + return ret; + } + + for (i = 0; i < crashk_cma_cnt; ++i) { + ret = crash_exclude_mem_range(cmem, crashk_cma_ranges[i].start, + crashk_cma_ranges[i].end); + if (ret) + return ret; + } + + return 0; +} + +int crash_prepare_headers(int need_kernel_map, void **addr, unsigned long *sz, + unsigned long *nr_mem_ranges) +{ + unsigned int max_nr_ranges; + struct crash_mem *cmem; + int ret; + + max_nr_ranges = arch_get_system_nr_ranges(); + if (!max_nr_ranges) + return -ENOMEM; + + cmem = alloc_cmem(max_nr_ranges); + if (!cmem) + return -ENOMEM; + + ret = arch_crash_populate_cmem(cmem); + if (ret) + goto out; + + ret = crash_exclude_core_ranges(cmem); + if (ret) + goto out; + + ret = arch_crash_exclude_ranges(cmem); + if (ret) + goto out; + + /* Return the computed number of memory ranges, for hotplug usage */ + if (nr_mem_ranges) + *nr_mem_ranges = cmem->nr_ranges; + + ret = crash_prepare_elf64_headers(cmem, need_kernel_map, addr, sz); + +out: + kvfree(cmem); + return ret; +} + /** * crash_exclude_mem_range - exclude a mem range for existing ranges * @mem: mem->range contains an array of ranges sorted in ascending order -- 2.34.1 From ruanjinjie at huawei.com Mon Jun 1 02:47:58 2026 From: ruanjinjie at huawei.com (Jinjie Ruan) Date: Mon, 1 Jun 2026 17:47:58 +0800 Subject: [PATCH v15 16/23] arm64: kexec_file: Use crash_prepare_headers() helper to simplify code In-Reply-To: <20260601094805.2928614-1-ruanjinjie@huawei.com> References: <20260601094805.2928614-1-ruanjinjie@huawei.com> Message-ID: <20260601094805.2928614-17-ruanjinjie@huawei.com> Use the newly introduced crash_prepare_headers() function to replace the existing prepare_elf_headers(), allocate cmem and exclude crash kernel memory in the crash core, which reduce code duplication. Only the following two architecture functions need to be implemented: - arch_get_system_nr_ranges(). Use for_each_mem_range() to traverse and pre-count the max number of memory ranges. - arch_crash_populate_cmem(). Use for_each_mem_range to traverse and collect the memory ranges and fills them into cmem. Acked-by: Catalin Marinas Reviewed-by: Sourabh Jain Acked-by: Baoquan He Acked-by: Mike Rapoport (Microsoft) Signed-off-by: Jinjie Ruan --- arch/arm64/include/asm/kexec.h | 1 - arch/arm64/kernel/kexec_image.c | 2 +- arch/arm64/kernel/machine_kexec_file.c | 46 ++++++++------------------ 3 files changed, 15 insertions(+), 34 deletions(-) diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h index 7ffa2ff5fcfd..892e5bebda95 100644 --- a/arch/arm64/include/asm/kexec.h +++ b/arch/arm64/include/asm/kexec.h @@ -128,7 +128,6 @@ extern int load_other_segments(struct kimage *image, unsigned long kernel_load_addr, unsigned long kernel_size, char *initrd, unsigned long initrd_len, char *cmdline); -extern int prepare_elf_headers(void **addr, unsigned long *sz); #endif #endif /* __ASSEMBLER__ */ diff --git a/arch/arm64/kernel/kexec_image.c b/arch/arm64/kernel/kexec_image.c index 424b9527db09..93c36a3aa618 100644 --- a/arch/arm64/kernel/kexec_image.c +++ b/arch/arm64/kernel/kexec_image.c @@ -95,7 +95,7 @@ static void *image_load(struct kimage *image, unsigned long headers_sz; void *headers; - ret = prepare_elf_headers(&headers, &headers_sz); + ret = crash_prepare_headers(true, &headers, &headers_sz, NULL); if (ret) { pr_err("Preparing elf core header failed\n"); return ERR_PTR(ret); diff --git a/arch/arm64/kernel/machine_kexec_file.c b/arch/arm64/kernel/machine_kexec_file.c index 8a96fb68b88d..14e65351133e 100644 --- a/arch/arm64/kernel/machine_kexec_file.c +++ b/arch/arm64/kernel/machine_kexec_file.c @@ -40,52 +40,34 @@ int arch_kimage_file_post_load_cleanup(struct kimage *image) } #ifdef CONFIG_CRASH_DUMP -int prepare_elf_headers(void **addr, unsigned long *sz) +unsigned int arch_get_system_nr_ranges(void) { - struct crash_mem *cmem; - unsigned int nr_ranges; - int ret; - u64 i; + /* for exclusion of crashkernel region */ + unsigned int nr_ranges = 2 + CRASH_HOTPLUG_SAFETY_PADDING; phys_addr_t start, end; + u64 i; - /* for exclusion of crashkernel region */ - nr_ranges = 2 + CRASH_HOTPLUG_SAFETY_PADDING; for_each_mem_range(i, &start, &end) nr_ranges++; - cmem = kmalloc_flex(*cmem, ranges, nr_ranges); - if (!cmem) - return -ENOMEM; + return nr_ranges; +} + +int arch_crash_populate_cmem(struct crash_mem *cmem) +{ + phys_addr_t start, end; + u64 i; - cmem->max_nr_ranges = nr_ranges; - cmem->nr_ranges = 0; for_each_mem_range(i, &start, &end) { - if (unlikely(cmem->nr_ranges >= cmem->max_nr_ranges)) { - ret = -EAGAIN; - goto out; - } + if (unlikely(cmem->nr_ranges >= cmem->max_nr_ranges)) + return -EAGAIN; cmem->ranges[cmem->nr_ranges].start = start; cmem->ranges[cmem->nr_ranges].end = end - 1; cmem->nr_ranges++; } - /* Exclude crashkernel region */ - ret = crash_exclude_mem_range(cmem, crashk_res.start, crashk_res.end); - if (ret) - goto out; - - if (crashk_low_res.end) { - ret = crash_exclude_mem_range(cmem, crashk_low_res.start, crashk_low_res.end); - if (ret) - goto out; - } - - ret = crash_prepare_elf64_headers(cmem, true, addr, sz); - -out: - kfree(cmem); - return ret; + return 0; } #endif -- 2.34.1 From ruanjinjie at huawei.com Mon Jun 1 02:47:59 2026 From: ruanjinjie at huawei.com (Jinjie Ruan) Date: Mon, 1 Jun 2026 17:47:59 +0800 Subject: [PATCH v15 17/23] x86: kexec_file: Use crash_prepare_headers() helper to simplify code In-Reply-To: <20260601094805.2928614-1-ruanjinjie@huawei.com> References: <20260601094805.2928614-1-ruanjinjie@huawei.com> Message-ID: <20260601094805.2928614-18-ruanjinjie@huawei.com> Use the newly introduced crash_prepare_headers() function to replace the existing prepare_elf_headers(), allocate cmem and exclude crash kernel memory in the crash core, which reduce code duplication. Only the following three architecture functions need to be implemented: - arch_get_system_nr_ranges(). Call get_nr_ram_ranges_callback() to pre-count the max number of memory ranges. - arch_crash_populate_cmem(). Use prepare_elf64_ram_headers_callback() to collect the memory ranges and fills them into cmem. - arch_crash_exclude_ranges(). Exclude the low 1M for x86. By the way, remove the unused "nr_mem_ranges" in arch_crash_handle_hotplug_event(). Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: Dave Hansen Cc: Andrew Morton Cc: Vivek Goyal Reviewed-by: Sourabh Jain Acked-by: Baoquan He Acked-by: Mike Rapoport (Microsoft) Signed-off-by: Jinjie Ruan --- arch/x86/kernel/crash.c | 89 +++++------------------------------------ 1 file changed, 11 insertions(+), 78 deletions(-) diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c index a1089907728d..7145b00da4ee 100644 --- a/arch/x86/kernel/crash.c +++ b/arch/x86/kernel/crash.c @@ -153,16 +153,8 @@ static int get_nr_ram_ranges_callback(struct resource *res, void *arg) return 0; } -/* Gather all the required information to prepare elf headers for ram regions */ -static struct crash_mem *fill_up_crash_elf_data(void) +unsigned int arch_get_system_nr_ranges(void) { - unsigned int nr_ranges = 0; - struct crash_mem *cmem; - - walk_system_ram_res(0, -1, &nr_ranges, get_nr_ram_ranges_callback); - if (!nr_ranges) - return NULL; - /* * Exclusion of crash region, crashk_low_res and/or crashk_cma_ranges * may cause range splits. So add extra slots here. @@ -177,49 +169,16 @@ static struct crash_mem *fill_up_crash_elf_data(void) * But in order to lest the low 1M could be changed in the future, * (e.g. [start, 1M]), add a extra slot. */ - nr_ranges += 3 + crashk_cma_cnt + CRASH_HOTPLUG_SAFETY_PADDING; - cmem = vzalloc(struct_size(cmem, ranges, nr_ranges)); - if (!cmem) - return NULL; - - cmem->max_nr_ranges = nr_ranges; + unsigned int nr_ranges = 3 + crashk_cma_cnt + CRASH_HOTPLUG_SAFETY_PADDING; - return cmem; + walk_system_ram_res(0, -1, &nr_ranges, get_nr_ram_ranges_callback); + return nr_ranges; } -/* - * Look for any unwanted ranges between mstart, mend and remove them. This - * might lead to split and split ranges are put in cmem->ranges[] array - */ -static int elf_header_exclude_ranges(struct crash_mem *cmem) +int arch_crash_exclude_ranges(struct crash_mem *cmem) { - int ret = 0; - int i; - /* Exclude the low 1M because it is always reserved */ - ret = crash_exclude_mem_range(cmem, 0, SZ_1M - 1); - if (ret) - return ret; - - /* Exclude crashkernel region */ - ret = crash_exclude_mem_range(cmem, crashk_res.start, crashk_res.end); - if (ret) - return ret; - - if (crashk_low_res.end) - ret = crash_exclude_mem_range(cmem, crashk_low_res.start, - crashk_low_res.end); - if (ret) - return ret; - - for (i = 0; i < crashk_cma_cnt; ++i) { - ret = crash_exclude_mem_range(cmem, crashk_cma_ranges[i].start, - crashk_cma_ranges[i].end); - if (ret) - return ret; - } - - return 0; + return crash_exclude_mem_range(cmem, 0, SZ_1M - 1); } static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg) @@ -236,35 +195,9 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg) return 0; } -/* Prepare elf headers. Return addr and size */ -static int prepare_elf_headers(void **addr, unsigned long *sz, - unsigned long *nr_mem_ranges) +int arch_crash_populate_cmem(struct crash_mem *cmem) { - struct crash_mem *cmem; - int ret; - - cmem = fill_up_crash_elf_data(); - if (!cmem) - return -ENOMEM; - - ret = walk_system_ram_res(0, -1, cmem, prepare_elf64_ram_headers_callback); - if (ret) - goto out; - - /* Exclude unwanted mem ranges */ - ret = elf_header_exclude_ranges(cmem); - if (ret) - goto out; - - /* Return the computed number of memory ranges, for hotplug usage */ - *nr_mem_ranges = cmem->nr_ranges; - - /* By default prepare 64bit headers */ - ret = crash_prepare_elf64_headers(cmem, IS_ENABLED(CONFIG_X86_64), addr, sz); - -out: - vfree(cmem); - return ret; + return walk_system_ram_res(0, -1, cmem, prepare_elf64_ram_headers_callback); } #endif @@ -422,7 +355,8 @@ int crash_load_segments(struct kimage *image) .buf_max = ULONG_MAX, .top_down = false }; /* Prepare elf headers and add a segment */ - ret = prepare_elf_headers(&kbuf.buffer, &kbuf.bufsz, &pnum); + ret = crash_prepare_headers(IS_ENABLED(CONFIG_X86_64), &kbuf.buffer, + &kbuf.bufsz, &pnum); if (ret) return ret; @@ -515,7 +449,6 @@ unsigned int arch_crash_get_elfcorehdr_size(void) void arch_crash_handle_hotplug_event(struct kimage *image, void *arg) { void *elfbuf = NULL, *old_elfcorehdr; - unsigned long nr_mem_ranges; unsigned long mem, memsz; unsigned long elfsz = 0; @@ -533,7 +466,7 @@ void arch_crash_handle_hotplug_event(struct kimage *image, void *arg) * Create the new elfcorehdr reflecting the changes to CPU and/or * memory resources. */ - if (prepare_elf_headers(&elfbuf, &elfsz, &nr_mem_ranges)) { + if (crash_prepare_headers(IS_ENABLED(CONFIG_X86_64), &elfbuf, &elfsz, NULL)) { pr_err("unable to create new elfcorehdr"); goto out; } -- 2.34.1 From ruanjinjie at huawei.com Mon Jun 1 02:48:00 2026 From: ruanjinjie at huawei.com (Jinjie Ruan) Date: Mon, 1 Jun 2026 17:48:00 +0800 Subject: [PATCH v15 18/23] riscv: kexec_file: Use crash_prepare_headers() helper to simplify code In-Reply-To: <20260601094805.2928614-1-ruanjinjie@huawei.com> References: <20260601094805.2928614-1-ruanjinjie@huawei.com> Message-ID: <20260601094805.2928614-19-ruanjinjie@huawei.com> Use the newly introduced crash_prepare_headers() function to replace the existing prepare_elf_headers(), allocate cmem and exclude crash kernel memory in the crash core, which reduce code duplication. Only the following two architecture functions need to be implemented: - arch_get_system_nr_ranges(). Call get_nr_ram_ranges_callback() to pre-counts the max number of memory ranges. - arch_crash_populate_cmem(). Use prepare_elf64_ram_headers_callback() to collects the memory ranges and fills them into cmem. Cc: Paul Walmsley Cc: Palmer Dabbelt Cc: Albert Ou Cc: Alexandre Ghiti Cc: Guo Ren Reviewed-by: Sourabh Jain Acked-by: Baoquan He Acked-by: Mike Rapoport (Microsoft) Signed-off-by: Jinjie Ruan --- arch/riscv/kernel/machine_kexec_file.c | 49 +++++++------------------- 1 file changed, 13 insertions(+), 36 deletions(-) diff --git a/arch/riscv/kernel/machine_kexec_file.c b/arch/riscv/kernel/machine_kexec_file.c index f3576dc0513f..6e2a6747d187 100644 --- a/arch/riscv/kernel/machine_kexec_file.c +++ b/arch/riscv/kernel/machine_kexec_file.c @@ -44,6 +44,16 @@ static int get_nr_ram_ranges_callback(struct resource *res, void *arg) return 0; } +unsigned int arch_get_system_nr_ranges(void) +{ + /* For exclusion of crashkernel region */ + unsigned int nr_ranges = 2 + CRASH_HOTPLUG_SAFETY_PADDING; + + walk_system_ram_res(0, -1, &nr_ranges, get_nr_ram_ranges_callback); + + return nr_ranges; +} + static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg) { struct crash_mem *cmem = arg; @@ -58,42 +68,9 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg) return 0; } -static int prepare_elf_headers(void **addr, unsigned long *sz) +int arch_crash_populate_cmem(struct crash_mem *cmem) { - struct crash_mem *cmem; - unsigned int nr_ranges; - int ret; - - /* For exclusion of crashkernel region */ - nr_ranges = 2 + CRASH_HOTPLUG_SAFETY_PADDING; - walk_system_ram_res(0, -1, &nr_ranges, get_nr_ram_ranges_callback); - - cmem = kmalloc_flex(*cmem, ranges, nr_ranges); - if (!cmem) - return -ENOMEM; - - cmem->max_nr_ranges = nr_ranges; - cmem->nr_ranges = 0; - ret = walk_system_ram_res(0, -1, cmem, prepare_elf64_ram_headers_callback); - if (ret) - goto out; - - /* Exclude crashkernel region */ - ret = crash_exclude_mem_range(cmem, crashk_res.start, crashk_res.end); - if (ret) - goto out; - - if (crashk_low_res.end) { - ret = crash_exclude_mem_range(cmem, crashk_low_res.start, crashk_low_res.end); - if (ret) - goto out; - } - - ret = crash_prepare_elf64_headers(cmem, true, addr, sz); - -out: - kfree(cmem); - return ret; + return walk_system_ram_res(0, -1, cmem, prepare_elf64_ram_headers_callback); } static char *setup_kdump_cmdline(struct kimage *image, char *cmdline, @@ -285,7 +262,7 @@ int load_extra_segments(struct kimage *image, unsigned long kernel_start, if (image->type == KEXEC_TYPE_CRASH) { void *headers; unsigned long headers_sz; - ret = prepare_elf_headers(&headers, &headers_sz); + ret = crash_prepare_headers(true, &headers, &headers_sz, NULL); if (ret) { pr_err("Preparing elf core header failed\n"); goto out; -- 2.34.1 From ruanjinjie at huawei.com Mon Jun 1 02:48:01 2026 From: ruanjinjie at huawei.com (Jinjie Ruan) Date: Mon, 1 Jun 2026 17:48:01 +0800 Subject: [PATCH v15 19/23] LoongArch: kexec_file: Use crash_prepare_headers() helper to simplify code In-Reply-To: <20260601094805.2928614-1-ruanjinjie@huawei.com> References: <20260601094805.2928614-1-ruanjinjie@huawei.com> Message-ID: <20260601094805.2928614-20-ruanjinjie@huawei.com> Use the newly introduced crash_prepare_headers() function to replace the existing prepare_elf_headers(), allocate cmem and exclude crash kernel memory in the crash core, which reduce code duplication. Only the following two architecture functions need to be implemented: - arch_get_system_nr_ranges(). Use for_each_mem_range to traverse and pre-count the max number of memory ranges. - arch_crash_populate_cmem(). Use for_each_mem_range to traverse and collect the memory ranges and fills them into cmem. Cc: Huacai Chen Cc: WANG Xuerui Cc: Youling Tang Cc: Baoquan He Reviewed-by: Sourabh Jain Acked-by: Baoquan He Acked-by: Mike Rapoport (Microsoft) Signed-off-by: Jinjie Ruan --- arch/loongarch/kernel/machine_kexec_file.c | 48 +++++++--------------- 1 file changed, 15 insertions(+), 33 deletions(-) diff --git a/arch/loongarch/kernel/machine_kexec_file.c b/arch/loongarch/kernel/machine_kexec_file.c index 3c369124586e..f3101bea9e45 100644 --- a/arch/loongarch/kernel/machine_kexec_file.c +++ b/arch/loongarch/kernel/machine_kexec_file.c @@ -56,52 +56,34 @@ static void cmdline_add_initrd(struct kimage *image, unsigned long *cmdline_tmpl } #ifdef CONFIG_CRASH_DUMP - -static int prepare_elf_headers(void **addr, unsigned long *sz) +unsigned int arch_get_system_nr_ranges(void) { - int ret, nr_ranges; - uint64_t i; + /* for exclusion of crashkernel region */ + int nr_ranges = 2 + CRASH_HOTPLUG_SAFETY_PADDING; phys_addr_t start, end; - struct crash_mem *cmem; + uint64_t i; - /* for exclusion of crashkernel region */ - nr_ranges = 2 + CRASH_HOTPLUG_SAFETY_PADDING; for_each_mem_range(i, &start, &end) nr_ranges++; - cmem = kmalloc_flex(*cmem, ranges, nr_ranges); - if (!cmem) - return -ENOMEM; + return nr_ranges; +} + +int arch_crash_populate_cmem(struct crash_mem *cmem) +{ + phys_addr_t start, end; + uint64_t i; - cmem->max_nr_ranges = nr_ranges; - cmem->nr_ranges = 0; for_each_mem_range(i, &start, &end) { - if (unlikely(cmem->nr_ranges >= cmem->max_nr_ranges)) { - ret = -EAGAIN; - goto out; - } + if (unlikely(cmem->nr_ranges >= cmem->max_nr_ranges)) + return -EAGAIN; cmem->ranges[cmem->nr_ranges].start = start; cmem->ranges[cmem->nr_ranges].end = end - 1; cmem->nr_ranges++; } - /* Exclude crashkernel region */ - ret = crash_exclude_mem_range(cmem, crashk_res.start, crashk_res.end); - if (ret < 0) - goto out; - - if (crashk_low_res.end) { - ret = crash_exclude_mem_range(cmem, crashk_low_res.start, crashk_low_res.end); - if (ret < 0) - goto out; - } - - ret = crash_prepare_elf64_headers(cmem, true, addr, sz); - -out: - kfree(cmem); - return ret; + return 0; } /* @@ -169,7 +151,7 @@ int load_other_segments(struct kimage *image, void *headers; unsigned long headers_sz; - ret = prepare_elf_headers(&headers, &headers_sz); + ret = crash_prepare_headers(true, &headers, &headers_sz, NULL); if (ret < 0) { pr_err("Preparing elf core header failed\n"); goto out_err; -- 2.34.1 From ruanjinjie at huawei.com Mon Jun 1 02:48:02 2026 From: ruanjinjie at huawei.com (Jinjie Ruan) Date: Mon, 1 Jun 2026 17:48:02 +0800 Subject: [PATCH v15 20/23] powerpc/kexec_file: Use crash_exclude_core_ranges() helper In-Reply-To: <20260601094805.2928614-1-ruanjinjie@huawei.com> References: <20260601094805.2928614-1-ruanjinjie@huawei.com> Message-ID: <20260601094805.2928614-21-ruanjinjie@huawei.com> The crash memory exclude of crashk_res and crashk_cma memory on powerpc are almost identical to the generic crash_exclude_core_ranges(). By introducing the architecture-specific arch_crash_exclude_mem_range() function with a default implementation of crash_exclude_mem_range(), and using crash_exclude_mem_range_guarded as powerpc's separate implementation, the generic crash_exclude_core_ranges() helper function can be reused. Cc: Andrew Morton Cc: Hari Bathini Cc: Madhavan Srinivasan Cc: Mahesh Salgaonkar Cc: Michael Ellerman Cc: Ritesh Harjani (IBM) Cc: Shivang Upadhyay Acked-by: Baoquan He Reviewed-by: Sourabh Jain Acked-by: Mike Rapoport (Microsoft) Signed-off-by: Jinjie Ruan --- arch/powerpc/include/asm/kexec_ranges.h | 3 --- arch/powerpc/kexec/crash.c | 2 +- arch/powerpc/kexec/ranges.c | 16 ++++------------ include/linux/crash_core.h | 4 ++++ kernel/crash_core.c | 19 +++++++++++++------ 5 files changed, 22 insertions(+), 22 deletions(-) diff --git a/arch/powerpc/include/asm/kexec_ranges.h b/arch/powerpc/include/asm/kexec_ranges.h index ad95e3792d10..8489e844b447 100644 --- a/arch/powerpc/include/asm/kexec_ranges.h +++ b/arch/powerpc/include/asm/kexec_ranges.h @@ -7,9 +7,6 @@ void sort_memory_ranges(struct crash_mem *mrngs, bool merge); struct crash_mem *realloc_mem_ranges(struct crash_mem **mem_ranges); int add_mem_range(struct crash_mem **mem_ranges, u64 base, u64 size); -int crash_exclude_mem_range_guarded(struct crash_mem **mem_ranges, - unsigned long long mstart, - unsigned long long mend); int get_exclude_memory_ranges(struct crash_mem **mem_ranges); int get_reserved_memory_ranges(struct crash_mem **mem_ranges); int get_crash_memory_ranges(struct crash_mem **mem_ranges); diff --git a/arch/powerpc/kexec/crash.c b/arch/powerpc/kexec/crash.c index d634db67becc..775895f31037 100644 --- a/arch/powerpc/kexec/crash.c +++ b/arch/powerpc/kexec/crash.c @@ -513,7 +513,7 @@ static void update_crash_elfcorehdr(struct kimage *image, struct memory_notify * base_addr = PFN_PHYS(mn->start_pfn); size = mn->nr_pages * PAGE_SIZE; end = base_addr + size - 1; - ret = crash_exclude_mem_range_guarded(&cmem, base_addr, end); + ret = arch_crash_exclude_mem_range(&cmem, base_addr, end); if (ret) { pr_err("Failed to remove hot-unplugged memory from crash memory ranges\n"); goto out; diff --git a/arch/powerpc/kexec/ranges.c b/arch/powerpc/kexec/ranges.c index b2fb78562cdc..539061d14a77 100644 --- a/arch/powerpc/kexec/ranges.c +++ b/arch/powerpc/kexec/ranges.c @@ -551,9 +551,9 @@ int get_usable_memory_ranges(struct crash_mem **mem_ranges) #endif /* CONFIG_KEXEC_FILE */ #ifdef CONFIG_CRASH_DUMP -int crash_exclude_mem_range_guarded(struct crash_mem **mem_ranges, - unsigned long long mstart, - unsigned long long mend) +int arch_crash_exclude_mem_range(struct crash_mem **mem_ranges, + unsigned long long mstart, + unsigned long long mend) { struct crash_mem *tmem = *mem_ranges; @@ -602,18 +602,10 @@ int get_crash_memory_ranges(struct crash_mem **mem_ranges) sort_memory_ranges(*mem_ranges, true); } - /* Exclude crashkernel region */ - ret = crash_exclude_mem_range_guarded(mem_ranges, crashk_res.start, crashk_res.end); + ret = crash_exclude_core_ranges(mem_ranges); if (ret) goto out; - for (i = 0; i < crashk_cma_cnt; ++i) { - ret = crash_exclude_mem_range_guarded(mem_ranges, crashk_cma_ranges[i].start, - crashk_cma_ranges[i].end); - if (ret) - goto out; - } - /* * FIXME: For now, stay in parity with kexec-tools but if RTAS/OPAL * regions are exported to save their context at the time of diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h index 43baf9c87355..1ae2c0eb2eb3 100644 --- a/include/linux/crash_core.h +++ b/include/linux/crash_core.h @@ -67,6 +67,7 @@ extern int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_ma void **addr, unsigned long *sz); extern int crash_prepare_headers(int need_kernel_map, void **addr, unsigned long *sz, unsigned long *nr_mem_ranges); +extern int crash_exclude_core_ranges(struct crash_mem **cmem); struct kimage; struct kexec_segment; @@ -87,6 +88,9 @@ extern int kimage_crash_copy_vmcoreinfo(struct kimage *image); extern unsigned int arch_get_system_nr_ranges(void); extern int arch_crash_populate_cmem(struct crash_mem *cmem); extern int arch_crash_exclude_ranges(struct crash_mem *cmem); +extern int arch_crash_exclude_mem_range(struct crash_mem **mem, + unsigned long long mstart, + unsigned long long mend); #else /* !CONFIG_CRASH_DUMP*/ struct pt_regs; diff --git a/kernel/crash_core.c b/kernel/crash_core.c index 481babc29131..2b36aa9fade0 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -285,24 +285,31 @@ unsigned int __weak arch_get_system_nr_ranges(void) { return 0; } int __weak arch_crash_populate_cmem(struct crash_mem *cmem) { return -1; } int __weak arch_crash_exclude_ranges(struct crash_mem *cmem) { return 0; } -static int crash_exclude_core_ranges(struct crash_mem *cmem) +int __weak arch_crash_exclude_mem_range(struct crash_mem **mem, + unsigned long long mstart, + unsigned long long mend) +{ + return crash_exclude_mem_range(*mem, mstart, mend); +} + +int crash_exclude_core_ranges(struct crash_mem **cmem) { int ret, i; /* Exclude crashkernel region */ - ret = crash_exclude_mem_range(cmem, crashk_res.start, crashk_res.end); + ret = arch_crash_exclude_mem_range(cmem, crashk_res.start, crashk_res.end); if (ret) return ret; if (crashk_low_res.end) { - ret = crash_exclude_mem_range(cmem, crashk_low_res.start, crashk_low_res.end); + ret = arch_crash_exclude_mem_range(cmem, crashk_low_res.start, crashk_low_res.end); if (ret) return ret; } for (i = 0; i < crashk_cma_cnt; ++i) { - ret = crash_exclude_mem_range(cmem, crashk_cma_ranges[i].start, - crashk_cma_ranges[i].end); + ret = arch_crash_exclude_mem_range(cmem, crashk_cma_ranges[i].start, + crashk_cma_ranges[i].end); if (ret) return ret; } @@ -329,7 +336,7 @@ int crash_prepare_headers(int need_kernel_map, void **addr, unsigned long *sz, if (ret) goto out; - ret = crash_exclude_core_ranges(cmem); + ret = crash_exclude_core_ranges(&cmem); if (ret) goto out; -- 2.34.1 From ruanjinjie at huawei.com Mon Jun 1 02:48:03 2026 From: ruanjinjie at huawei.com (Jinjie Ruan) Date: Mon, 1 Jun 2026 17:48:03 +0800 Subject: [PATCH v15 21/23] arm64: kexec_file: Add support for crashkernel CMA reservation In-Reply-To: <20260601094805.2928614-1-ruanjinjie@huawei.com> References: <20260601094805.2928614-1-ruanjinjie@huawei.com> Message-ID: <20260601094805.2928614-22-ruanjinjie@huawei.com> Commit 35c18f2933c5 ("Add a new optional ",cma" suffix to the crashkernel= command line option") and commit ab475510e042 ("kdump: implement reserve_crashkernel_cma") added CMA support for kdump crashkernel reservation. Crash kernel memory reservation wastes production resources if too large, risks kdump failure if too small, and faces allocation difficulties on fragmented systems due to contiguous block constraints. The new CMA-based crashkernel reservation scheme splits the "large fixed reservation" into a "small fixed region + large CMA dynamic region": the CMA memory is available to userspace during normal operation to avoid waste, and is reclaimed for kdump upon crash?saving memory while improving reliability. So extend crashkernel CMA reservation support to arm64. The following changes are made to enable CMA reservation: - Parse and obtain the CMA reservation size along with other crashkernel parameters. - Call reserve_crashkernel_cma() to allocate the CMA region for kdump. - Include the CMA-reserved ranges for kdump kernel to use. - Exclude the CMA-reserved ranges from the crash kernel memory to prevent them from being exported through /proc/vmcore, which is already done in the crash core. Update kernel-parameters.txt to document CMA support for crashkernel on arm64 architecture. Tested-by: Breno Leitao Acked-by: Catalin Marinas Acked-by: Rob Herring (Arm) Acked-by: Baoquan He Acked-by: Mike Rapoport (Microsoft) Acked-by: Ard Biesheuvel Signed-off-by: Jinjie Ruan --- v7: - Correct the inclusion of CMA-reserved ranges for kdump kernel in of/kexec. v3: - Add Acked-by. v2: - Free cmem in prepare_elf_headers() - Add the mtivation. --- Documentation/admin-guide/kernel-parameters.txt | 2 +- arch/arm64/kernel/machine_kexec_file.c | 2 +- arch/arm64/mm/init.c | 5 +++-- drivers/of/fdt.c | 9 +++++---- drivers/of/kexec.c | 9 +++++++++ include/linux/crash_reserve.h | 4 +++- 6 files changed, 22 insertions(+), 9 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 4d0f545fb3ec..52742fab49a9 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -1119,7 +1119,7 @@ Kernel parameters It will be ignored when crashkernel=X,high is not used or memory reserved is below 4G. crashkernel=size[KMG],cma - [KNL, X86, ppc] Reserve additional crash kernel memory from + [KNL, X86, ARM64, PPC] Reserve additional crash kernel memory from CMA. This reservation is usable by the first system's userspace memory and kernel movable allocations (memory balloon, zswap). Pages allocated from this memory range diff --git a/arch/arm64/kernel/machine_kexec_file.c b/arch/arm64/kernel/machine_kexec_file.c index 14e65351133e..d0f73eb3f856 100644 --- a/arch/arm64/kernel/machine_kexec_file.c +++ b/arch/arm64/kernel/machine_kexec_file.c @@ -43,7 +43,7 @@ int arch_kimage_file_post_load_cleanup(struct kimage *image) unsigned int arch_get_system_nr_ranges(void) { /* for exclusion of crashkernel region */ - unsigned int nr_ranges = 2 + CRASH_HOTPLUG_SAFETY_PADDING; + unsigned int nr_ranges = 2 + crashk_cma_cnt + CRASH_HOTPLUG_SAFETY_PADDING; phys_addr_t start, end; u64 i; diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c index 97987f850a33..227f58522dad 100644 --- a/arch/arm64/mm/init.c +++ b/arch/arm64/mm/init.c @@ -96,8 +96,8 @@ phys_addr_t __ro_after_init arm64_dma_phys_limit; static void __init arch_reserve_crashkernel(void) { + unsigned long long crash_base, crash_size, cma_size = 0; unsigned long long low_size = 0; - unsigned long long crash_base, crash_size; bool high = false; int ret; @@ -106,11 +106,12 @@ static void __init arch_reserve_crashkernel(void) ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(), &crash_size, &crash_base, - &low_size, NULL, &high); + &low_size, &cma_size, &high); if (ret) return; reserve_crashkernel_generic(crash_size, crash_base, low_size, high); + reserve_crashkernel_cma(cma_size); } static phys_addr_t __init max_zone_phys(phys_addr_t zone_limit) diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c index 82f7327c59ea..0470acbd1fcf 100644 --- a/drivers/of/fdt.c +++ b/drivers/of/fdt.c @@ -880,11 +880,12 @@ static unsigned long chosen_node_offset = -FDT_ERR_NOTFOUND; /* * The main usage of linux,usable-memory-range is for crash dump kernel. * Originally, the number of usable-memory regions is one. Now there may - * be two regions, low region and high region. - * To make compatibility with existing user-space and older kdump, the low - * region is always the last range of linux,usable-memory-range if exist. + * be 2 + CRASHK_CMA_RANGES_MAX regions, low region, high region and cma + * regions. To make compatibility with existing user-space and older kdump, + * the high and low region are always the first two ranges of + * linux,usable-memory-range if exist. */ -#define MAX_USABLE_RANGES 2 +#define MAX_USABLE_RANGES (2 + CRASHK_CMA_RANGES_MAX) /** * early_init_dt_check_for_usable_mem_range - Decode usable memory range diff --git a/drivers/of/kexec.c b/drivers/of/kexec.c index b6837e299e7f..029903b986cb 100644 --- a/drivers/of/kexec.c +++ b/drivers/of/kexec.c @@ -458,6 +458,15 @@ void *of_kexec_alloc_and_setup_fdt(const struct kimage *image, if (ret) goto out; } + + for (int i = 0; i < crashk_cma_cnt; i++) { + ret = fdt_appendprop_addrrange(fdt, 0, chosen_node, + "linux,usable-memory-range", + crashk_cma_ranges[i].start, + crashk_cma_ranges[i].end - crashk_cma_ranges[i].start + 1); + if (ret) + goto out; + } #endif } diff --git a/include/linux/crash_reserve.h b/include/linux/crash_reserve.h index f0dc03d94ca2..30864d90d7f5 100644 --- a/include/linux/crash_reserve.h +++ b/include/linux/crash_reserve.h @@ -14,9 +14,11 @@ extern struct resource crashk_res; extern struct resource crashk_low_res; extern struct range crashk_cma_ranges[]; + +#define CRASHK_CMA_RANGES_MAX 4 #if defined(CONFIG_CMA) && defined(CONFIG_ARCH_HAS_GENERIC_CRASHKERNEL_RESERVATION) #define CRASHKERNEL_CMA -#define CRASHKERNEL_CMA_RANGES_MAX 4 +#define CRASHKERNEL_CMA_RANGES_MAX (CRASHK_CMA_RANGES_MAX) extern int crashk_cma_cnt; #else #define crashk_cma_cnt 0 -- 2.34.1 From ruanjinjie at huawei.com Mon Jun 1 02:48:04 2026 From: ruanjinjie at huawei.com (Jinjie Ruan) Date: Mon, 1 Jun 2026 17:48:04 +0800 Subject: [PATCH v15 22/23] riscv: kexec_file: Add support for crashkernel CMA reservation In-Reply-To: <20260601094805.2928614-1-ruanjinjie@huawei.com> References: <20260601094805.2928614-1-ruanjinjie@huawei.com> Message-ID: <20260601094805.2928614-23-ruanjinjie@huawei.com> Commit 35c18f2933c5 ("Add a new optional ",cma" suffix to the crashkernel= command line option") and commit ab475510e042 ("kdump: implement reserve_crashkernel_cma") added CMA support for kdump crashkernel reservation. This allows the kernel to dynamically allocate contiguous memory for crash dumping when needed, rather than permanently reserving a fixed region at boot time. So extend crashkernel CMA reservation support to riscv. The following changes are made to enable CMA reservation: - Parse and obtain the CMA reservation size along with other crashkernel parameters. - Call reserve_crashkernel_cma() to allocate the CMA region for kdump. - Include the CMA-reserved ranges for kdump kernel to use, which was already done in of_kexec_alloc_and_setup_fdt(). - Exclude the CMA-reserved ranges from the crash kernel memory to prevent them from being exported through /proc/vmcore, which was already done in the crash core. Update kernel-parameters.txt to document CMA support for crashkernel on riscv architecture. Cc: Paul Walmsley Cc: Palmer Dabbelt Cc: Albert Ou Cc: Alexandre Ghiti Acked-by: Baoquan He Acked-by: Mike Rapoport (Microsoft) Acked-by: Paul Walmsley # arch/riscv Signed-off-by: Jinjie Ruan --- Documentation/admin-guide/kernel-parameters.txt | 16 ++++++++-------- arch/riscv/kernel/machine_kexec_file.c | 2 +- arch/riscv/mm/init.c | 5 +++-- 3 files changed, 12 insertions(+), 11 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 52742fab49a9..3ff3ddd516cf 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -1119,14 +1119,14 @@ Kernel parameters It will be ignored when crashkernel=X,high is not used or memory reserved is below 4G. crashkernel=size[KMG],cma - [KNL, X86, ARM64, PPC] Reserve additional crash kernel memory from - CMA. This reservation is usable by the first system's - userspace memory and kernel movable allocations (memory - balloon, zswap). Pages allocated from this memory range - will not be included in the vmcore so this should not - be used if dumping of userspace memory is intended and - it has to be expected that some movable kernel pages - may be missing from the dump. + [KNL, X86, ARM64, RISCV, PPC] Reserve additional crash + kernel memory from CMA. This reservation is usable by + the first system's userspace memory and kernel movable + allocations (memory balloon, zswap). Pages allocated + from this memory range will not be included in the vmcore + so this should not be used if dumping of userspace memory + is intended and it has to be expected that some movable + kernel pages may be missing from the dump. A standard crashkernel reservation, as described above, is still needed to hold the crash kernel and initrd. diff --git a/arch/riscv/kernel/machine_kexec_file.c b/arch/riscv/kernel/machine_kexec_file.c index 6e2a6747d187..42d847154e19 100644 --- a/arch/riscv/kernel/machine_kexec_file.c +++ b/arch/riscv/kernel/machine_kexec_file.c @@ -47,7 +47,7 @@ static int get_nr_ram_ranges_callback(struct resource *res, void *arg) unsigned int arch_get_system_nr_ranges(void) { /* For exclusion of crashkernel region */ - unsigned int nr_ranges = 2 + CRASH_HOTPLUG_SAFETY_PADDING; + unsigned int nr_ranges = 2 + crashk_cma_cnt + CRASH_HOTPLUG_SAFETY_PADDING; walk_system_ram_res(0, -1, &nr_ranges, get_nr_ram_ranges_callback); diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c index decd7df40fa4..c848454b8349 100644 --- a/arch/riscv/mm/init.c +++ b/arch/riscv/mm/init.c @@ -1295,7 +1295,7 @@ static inline void setup_vm_final(void) */ static void __init arch_reserve_crashkernel(void) { - unsigned long long low_size = 0; + unsigned long long low_size = 0, cma_size = 0; unsigned long long crash_base, crash_size; bool high = false; int ret; @@ -1305,11 +1305,12 @@ static void __init arch_reserve_crashkernel(void) ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(), &crash_size, &crash_base, - &low_size, NULL, &high); + &low_size, &cma_size, &high); if (ret) return; reserve_crashkernel_generic(crash_size, crash_base, low_size, high); + reserve_crashkernel_cma(cma_size); } void __init paging_init(void) -- 2.34.1 From ruanjinjie at huawei.com Mon Jun 1 02:48:05 2026 From: ruanjinjie at huawei.com (Jinjie Ruan) Date: Mon, 1 Jun 2026 17:48:05 +0800 Subject: [PATCH v15 23/23] arm64: crash: Add crash hotplug support In-Reply-To: <20260601094805.2928614-1-ruanjinjie@huawei.com> References: <20260601094805.2928614-1-ruanjinjie@huawei.com> Message-ID: <20260601094805.2928614-24-ruanjinjie@huawei.com> Due to CPU/Memory hotplug or online/offline events, the elfcorehdr (which describes the CPUs and memory of the crashed kernel) of kdump image becomes outdated. Consequently, attempting dump collection with an outdated elfcorehdr can lead to inaccurate dump collection. The current solution to address the above issue involves monitoring the CPU/Memory add/remove events in userspace using udev rules and whenever there are changes in CPU and memory resources, the entire kdump image is loaded again. The kdump image includes kernel, initrd, elfcorehdr, FDT, purgatory. Given that only elfcorehdr gets outdated due to CPU/Memory add/remove events, reloading the entire kdump image is inefficient. More importantly, kdump remains inactive for a substantial amount of time until the kdump reload completes. To address the aforementioned issue, commit 247262756121 ("crash: add generic infrastructure for crash hotplug support") added a generic infrastructure that allows architectures to selectively update the kdump image component during CPU or memory add/remove events within the kernel itself. In the event of a CPU or memory add/remove events, the generic crash hotplug event handler, crash_handle_hotplug_event(), is triggered. It then acquires the necessary locks to update the kdump image and invokes the architecture-specific crash hotplug handler, arch_crash_handle_hotplug_event(), to update the required kdump image components. [1] has supported virtual CPU hotplug in virtual machines for ARM64, allowing vCPUs to be added or removed at runtime to meet Kubernetes demands. On ARM64, only memory add/remove events are handled. Here's why: 1. Physical CPU hotplug: Not supported on ARM64 hardware. 2. ACPI vCPU hotplug (KVM virtual machine): - vCPU hotplug is implemented as a static firmware policy where all possible vCPUs are pre-described in the MADT table at boot. - The vCPU status will be automatically updated after vCPU hotplug. - No FDT or elfcorehdr update needed. 3. Device tree booted Virtual Machine vCPU hotplug: - The elfcorehdr is built using for_each_possible_cpu(), so it already includes all possible CPUs and doesn't need updates. For memory add/remove events, the elfcorehdr is updated to reflect the current memory layout. This patch adds the ARCH_SUPPORTS_CRASH_HOTPLUG config option and implements: - arch_crash_hotplug_support(): Check if hotplug update is supported - arch_crash_get_elfcorehdr_size(): Return elfcorehdr buffer size - arch_crash_handle_hotplug_event(): Handle memory hotplug events This follows the same approach as x86 commit ea53ad9cf73b ("x86/crash: add x86 crash hotplug support") and powerpc commit b741092d5976 ("powerpc/crash: add crash CPU hotplug support") and commit 849599b702ef ("powerpc/crash: add crash memory hotplug support"). The test is based on the following QEMU version: https://github.com/salil-mehta/qemu.git virt-cpuhp-armv8/rfc-v2 Replace your '-smp' argument with something like: | -smp cpus=1,maxcpus=3,cores=3,threads=1,sockets=1 then feed the following to the Qemu montior to hotplug vCPU; | (qemu) device_add driver=host-arm-cpu,core-id=1,id=cpu1 | (qemu) device_del cpu1 feed the following to the Qemu montior to hotplug memory; | (qemu) object_add memory-backend-ram,id=mem1,size=256M | (qemu) device_add pc-dimm,id=dimm1,memdev=mem1 | (qemu) device_del dimm1 The qemu startup configuration is as follows: qemu-system-aarch64 \ -M virt,gic-version=3,acpi=on,highmem=on \ -enable-kvm \ -cpu host \ -kernel Image \ -smp cpus=1,maxcpus=3,cores=3,threads=1,sockets=1 \ -bios /usr/share/edk2/aarch64/QEMU_EFI.fd \ -m 2G,slots=64,maxmem=16G \ -nographic \ -no-reboot \ -device virtio-rng-pci \ -append "root=/dev/vda rw console=ttyAMA0 kgdboc=ttyAMA0,115200 \ earlycon acpi=on crashkernel=512M" \ -drive if=none,file=images/rootfs.ext4,format=raw,id=hd0 \ -device virtio-blk-device,drive=hd0 \ There are two system calls, `kexec_file_load` and `kexec_load`, used to load the kdump image. Only kexec_file_load syscall way is tested now. Cc: Catalin Marinas Cc: Will Deacon Cc: Baoquan He Cc: "Mike Rapoport (Microsoft)" Cc: Andrew Morton Cc: Breno Leitao Cc: Kees Cook [1]: https://lore.kernel.org/all/20240529133446.28446-1-Jonathan.Cameron at huawei.com/ Signed-off-by: Jinjie Ruan --- arch/arm64/Kconfig | 3 + arch/arm64/include/asm/kexec.h | 13 +++ arch/arm64/kernel/Makefile | 2 +- arch/arm64/kernel/crash.c | 152 +++++++++++++++++++++++++ arch/arm64/kernel/kexec_image.c | 21 +++- arch/arm64/kernel/machine_kexec_file.c | 40 ++----- 6 files changed, 195 insertions(+), 36 deletions(-) create mode 100644 arch/arm64/kernel/crash.c diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index fe60738e5943..9091c67e1cc2 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -1609,6 +1609,9 @@ config ARCH_DEFAULT_CRASH_DUMP config ARCH_HAS_GENERIC_CRASHKERNEL_RESERVATION def_bool CRASH_RESERVE +config ARCH_SUPPORTS_CRASH_HOTPLUG + def_bool y + config TRANS_TABLE def_bool y depends on HIBERNATION || KEXEC_CORE diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h index 892e5bebda95..4f3d4fc2807e 100644 --- a/arch/arm64/include/asm/kexec.h +++ b/arch/arm64/include/asm/kexec.h @@ -130,6 +130,19 @@ extern int load_other_segments(struct kimage *image, char *cmdline); #endif +#ifdef CONFIG_CRASH_HOTPLUG +#define pnum_hdr_sz(pnum) ((pnum) * sizeof(Elf64_Phdr) + sizeof(Elf64_Ehdr)) + +void arch_crash_handle_hotplug_event(struct kimage *image, void *arg); +#define arch_crash_handle_hotplug_event arch_crash_handle_hotplug_event + +int arch_crash_hotplug_support(struct kimage *image, unsigned long kexec_flags); +#define arch_crash_hotplug_support arch_crash_hotplug_support + +unsigned int arch_crash_get_elfcorehdr_size(void); +#define crash_get_elfcorehdr_size arch_crash_get_elfcorehdr_size +#endif + #endif /* __ASSEMBLER__ */ #endif diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile index 74b76bb70452..0625422fc528 100644 --- a/arch/arm64/kernel/Makefile +++ b/arch/arm64/kernel/Makefile @@ -64,7 +64,7 @@ obj-$(CONFIG_KEXEC_CORE) += machine_kexec.o relocate_kernel.o \ obj-$(CONFIG_KEXEC_FILE) += machine_kexec_file.o kexec_image.o obj-$(CONFIG_ARM64_RELOC_TEST) += arm64-reloc-test.o arm64-reloc-test-y := reloc_test_core.o reloc_test_syms.o -obj-$(CONFIG_CRASH_DUMP) += crash_dump.o +obj-$(CONFIG_CRASH_DUMP) += crash_dump.o crash.o obj-$(CONFIG_VMCORE_INFO) += vmcore_info.o obj-$(CONFIG_ARM_SDE_INTERFACE) += sdei.o obj-$(CONFIG_ARM64_PTR_AUTH) += pointer_auth.o diff --git a/arch/arm64/kernel/crash.c b/arch/arm64/kernel/crash.c new file mode 100644 index 000000000000..5882b9b5a90e --- /dev/null +++ b/arch/arm64/kernel/crash.c @@ -0,0 +1,152 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Architecture specific functions for kexec based crash dumps. + */ + +#define pr_fmt(fmt) "crash hp: " fmt + +#include +#include +#include +#include +#include +#include + +#include + +#if defined(CONFIG_KEXEC_FILE) || defined(CONFIG_CRASH_HOTPLUG) +unsigned int arch_get_system_nr_ranges(void) +{ + /* for exclusion of crashkernel region */ + unsigned int nr_ranges = 2 + crashk_cma_cnt + CRASH_HOTPLUG_SAFETY_PADDING; + phys_addr_t start, end; + u64 i; + + for_each_mem_range(i, &start, &end) + nr_ranges++; + + return nr_ranges; +} + +int arch_crash_populate_cmem(struct crash_mem *cmem) +{ + phys_addr_t start, end; + u64 i; + + for_each_mem_range(i, &start, &end) { + if (unlikely(cmem->nr_ranges >= cmem->max_nr_ranges)) + return -EAGAIN; + + cmem->ranges[cmem->nr_ranges].start = start; + cmem->ranges[cmem->nr_ranges].end = end - 1; + cmem->nr_ranges++; + } + + return 0; +} +#endif + +#ifdef CONFIG_CRASH_HOTPLUG +int arch_crash_hotplug_support(struct kimage *image, unsigned long kexec_flags) +{ +#ifdef CONFIG_KEXEC_FILE + if (image->file_mode) + return 1; +#endif + /* + * For kexec_load syscall, crash hotplug support requires + * KEXEC_CRASH_HOTPLUG_SUPPORT flag to be passed by userspace. + */ + return kexec_flags & KEXEC_CRASH_HOTPLUG_SUPPORT; +} + +unsigned int arch_crash_get_elfcorehdr_size(void) +{ + unsigned int phdr_cnt; + + /* A program header for possible CPUs, vmcoreinfo and kernel_map */ + phdr_cnt = 2 + num_possible_cpus(); + if (IS_ENABLED(CONFIG_MEMORY_HOTPLUG)) + phdr_cnt += CONFIG_CRASH_MAX_MEMORY_RANGES; + + return pnum_hdr_sz(phdr_cnt); +} + +/** + * update_crash_elfcorehdr() - Recreate the elfcorehdr and replace it with old + * elfcorehdr in the kexec segment array. + * @image: the active struct kimage + */ +static void update_crash_elfcorehdr(struct kimage *image) +{ + void *elfbuf = NULL, *old_elfcorehdr; + unsigned long mem, memsz; + unsigned long elfsz = 0; + + /* + * Create the new elfcorehdr reflecting the changes to CPU and/or + * memory resources. + */ + if (crash_prepare_headers(true, &elfbuf, &elfsz, NULL)) { + pr_err("unable to create new elfcorehdr"); + goto out; + } + + /* + * Obtain address and size of the elfcorehdr segment, and + * check it against the new elfcorehdr buffer. + */ + mem = image->segment[image->elfcorehdr_index].mem; + memsz = image->segment[image->elfcorehdr_index].memsz; + if (elfsz > memsz) { + pr_err("update elfcorehdr elfsz %lu > memsz %lu", + elfsz, memsz); + goto out; + } + + /* + * Copy new elfcorehdr over the old elfcorehdr at destination. + */ + old_elfcorehdr = (void *)__va(mem); + if (!old_elfcorehdr) { + pr_err("mapping elfcorehdr segment failed\n"); + goto out; + } + + /* + * Temporarily invalidate the crash image while the + * elfcorehdr is updated. + */ + xchg(&kexec_crash_image, NULL); + memcpy((void *)old_elfcorehdr, elfbuf, elfsz); + dcache_clean_inval_poc((unsigned long)old_elfcorehdr, + (unsigned long)old_elfcorehdr + elfsz); + xchg(&kexec_crash_image, image); + pr_debug("updated elfcorehdr\n"); + +out: + vfree(elfbuf); +} + +/** + * arch_crash_handle_hotplug_event() - Handle hotplug elfcorehdr changes + * @image: a pointer to kexec_crash_image + * @arg: struct memory_notify handler for memory hotplug case and + * NULL for CPU hotplug case. + * + * Update the kdump image based on the type of hotplug event: + * - CPU add and remove: No action is needed. + * - Memory add/remove: Update the elfcorehdr to reflect the current memory layout. + * + * Prepare the new elfcorehdr and replace the existing elfcorehdr. + */ +void arch_crash_handle_hotplug_event(struct kimage *image, void *arg) +{ + if ((image->file_mode || image->elfcorehdr_updated) && + ((image->hp_action == KEXEC_CRASH_HP_ADD_CPU) || + (image->hp_action == KEXEC_CRASH_HP_REMOVE_CPU))) + return; + + update_crash_elfcorehdr(image); +} +#endif /* CONFIG_CRASH_HOTPLUG */ diff --git a/arch/arm64/kernel/kexec_image.c b/arch/arm64/kernel/kexec_image.c index 93c36a3aa618..21f38de7a8b6 100644 --- a/arch/arm64/kernel/kexec_image.c +++ b/arch/arm64/kernel/kexec_image.c @@ -8,6 +8,7 @@ #define pr_fmt(fmt) "kexec_file(Image): " fmt +#include #include #include #include @@ -92,16 +93,32 @@ static void *image_load(struct kimage *image, #ifdef CONFIG_CRASH_DUMP if (image->type == KEXEC_TYPE_CRASH) { /* load elf core header */ - unsigned long headers_sz; + unsigned long headers_sz, pnum = 0; void *headers; - ret = crash_prepare_headers(true, &headers, &headers_sz, NULL); + ret = crash_prepare_headers(true, &headers, &headers_sz, &pnum); if (ret) { pr_err("Preparing elf core header failed\n"); return ERR_PTR(ret); } image->elf_headers = headers; image->elf_headers_sz = headers_sz; + +#ifdef CONFIG_CRASH_HOTPLUG + /* + * The elfcorehdr segment size accounts for VMCOREINFO, kernel_map + * maximum CPUs and maximum memory ranges. + */ + if (IS_ENABLED(CONFIG_MEMORY_HOTPLUG)) + pnum = 2 + num_possible_cpus() + CONFIG_CRASH_MAX_MEMORY_RANGES; + else + pnum += 2 + num_possible_cpus(); + + if (pnum < (unsigned long)PN_XNUM) + image->elf_headers_sz = max(pnum_hdr_sz(pnum), headers_sz); + else + pr_err("number of Phdrs %lu exceeds max\n", pnum); +#endif } #endif diff --git a/arch/arm64/kernel/machine_kexec_file.c b/arch/arm64/kernel/machine_kexec_file.c index d0f73eb3f856..0016001f4d00 100644 --- a/arch/arm64/kernel/machine_kexec_file.c +++ b/arch/arm64/kernel/machine_kexec_file.c @@ -10,11 +10,11 @@ #define pr_fmt(fmt) "kexec_file: " fmt +#include #include #include #include #include -#include #include #include #include @@ -39,38 +39,6 @@ int arch_kimage_file_post_load_cleanup(struct kimage *image) return kexec_image_post_load_cleanup_default(image); } -#ifdef CONFIG_CRASH_DUMP -unsigned int arch_get_system_nr_ranges(void) -{ - /* for exclusion of crashkernel region */ - unsigned int nr_ranges = 2 + crashk_cma_cnt + CRASH_HOTPLUG_SAFETY_PADDING; - phys_addr_t start, end; - u64 i; - - for_each_mem_range(i, &start, &end) - nr_ranges++; - - return nr_ranges; -} - -int arch_crash_populate_cmem(struct crash_mem *cmem) -{ - phys_addr_t start, end; - u64 i; - - for_each_mem_range(i, &start, &end) { - if (unlikely(cmem->nr_ranges >= cmem->max_nr_ranges)) - return -EAGAIN; - - cmem->ranges[cmem->nr_ranges].start = start; - cmem->ranges[cmem->nr_ranges].end = end - 1; - cmem->nr_ranges++; - } - - return 0; -} -#endif - /* * Tries to add the initrd and DTB to the image. If it is not possible to find * valid locations, this function will undo changes to the image and return non @@ -98,6 +66,12 @@ int load_other_segments(struct kimage *image, kbuf.bufsz = image->elf_headers_sz; kbuf.mem = KEXEC_BUF_MEM_UNKNOWN; kbuf.memsz = image->elf_headers_sz; + +#ifdef CONFIG_CRASH_HOTPLUG + if (image->elf_headers_sz < pnum_hdr_sz(PN_XNUM)) + image->elfcorehdr_index = image->nr_segments; +#endif + kbuf.buf_align = SZ_64K; /* largest supported page size */ kbuf.buf_max = ULONG_MAX; kbuf.top_down = true; -- 2.34.1 From ruanjinjie at huawei.com Mon Jun 1 18:43:26 2026 From: ruanjinjie at huawei.com (Jinjie Ruan) Date: Tue, 2 Jun 2026 09:43:26 +0800 Subject: [PATCH v15 00/23] arm64/riscv: Add support for crashkernel CMA reservation In-Reply-To: References: <20260601094805.2928614-1-ruanjinjie@huawei.com> Message-ID: <1a459706-80db-43d8-b163-76fc09da338d@huawei.com> On 6/1/2026 9:40 PM, Baoquan He wrote: > Hi Jinjie, > > On 06/01/26 at 05:47pm, Jinjie Ruan wrote: > ...snip... >> Changes in v15: >> - Unify the subject prefix formats as Huacai suggested. >> - Fix powerpc pre-existing NULL pointer dereference [Sashiko [1]] >> - Fix powerpc pre-existing __merge_memory_ranges() memory range >> truncation [Sashiko [1]]. >> - Fix pre-existing arm64 CMA page leaks [Sashiko[2]]. >> - Fix pre-existing crash_load_dm_crypt_keys() Use-After-Free and >> Double Free issue [Sashiko[3]]. >> - Fix vfree(headers) and uninitialized variables issue >> and simplify the fix [Sashiko[2]]. >> - As walk_system_ram_res() and for_each_mem_range() use different >> lock, unify and simplify the fix of TOCTOU buffer overflow via memory >> region padding [Sashiko[4]]. >> - Fix the arm64 crash dump issues in Sashiko[5]. >> - Link to v14: https://lore.kernel.org/all/20260525084932.934910-1-ruanjinjie at huawei.com/ > > Do these Fixes have anything with the main target of this patch series > you mentioned in cover-letter:"arm64/riscv: Add support for crashkernel CMA"? > The patches become more and more in each new version, I am wondering if > it relies on these Fixes patches to implement your adding support for > crashkernel CMA on arm64/risc-v. > > If not relying on them, could you split them into different patchset > on different purpose? Hi Baoquan, Thank you for your valuable guidance. You are absolutely right. Most of these fix patches are indeed not strictly related to the core implementation of the crashkernel CMA support. They are pre-existing bugs in the surrounding kexec/crash code that were flagged during our review. Previously, Andrew suggested taking a look at the code review comments from the Sashiko AI system, which is why these fixes kept expanding. I completely agree with your advice that there is no need to keep them together. I will split them into two completely different patchsets based on their purpose: 1. A cleaner version of this series, strictly focused on adding the core crashkernel CMA support for arm64/riscv. 2. One standalone bugfix patchset dedicated entirely to fixing these pre-existing issues. By the way, I would also appreciate some advice on how to handle further AI reviews. It seems that the more code we touch or refactor to fix these pre-existing issues, the more tangential bugs the AI flags in the newly exposed areas, making the series extremely difficult to converge. Should I continue to address all AI-reported bugs associated with the surrounding code in this series, or should we draw a strict line and only focus on the core CMA logic moving forward? I will prepare the split patchsets shortly. Thanks again for straightening this out! Best regards, Jinjie Ruan > > Thanks > Baoquan > >> >> [1]: https://lore.kernel.org/all/20260525092207.96B9D1F000E9 at smtp.kernel.org/ >> [2]: https://lore.kernel.org/all/20260525091149.1A1E01F00A3D at smtp.kernel.org/ >> [3]: https://lore.kernel.org/all/20260525105227.3C2421F000E9 at smtp.kernel.org/ >> [4]: https://lore.kernel.org/all/20260525095447.944E11F000E9 at smtp.kernel.org/ >> [5]: https://lore.kernel.org/all/20260525101746.9959D1F000E9 at smtp.kernel.org/ >> >> Changes in v14: >> - Fix image->elf_headers memory leak during retry loop for arm64 as Sashiko >> AI code review pointed out. >> - Solve the hotplug notifier arch_crash_handle_hotplug_event() AA >> self-deadlock problem as Sashiko AI code review pointed out. >> - Fix the TOCTOU issue in prepare_elf_headers() by get_online_mems(). >> - -ENOMEM -> -EAGAIN as Breno suggested. >> - Add support for arm64 crash hotplug. >> - Link to v13: https://lore.kernel.org/all/20260511030454.1730881-1-ruanjinjie at huawei.com/ >> [...] >> 24 files changed, 430 insertions(+), 338 deletions(-) >> create mode 100644 arch/arm64/kernel/crash.c >> >> -- >> 2.34.1 >> From mclapinski at google.com Tue Jun 2 08:43:54 2026 From: mclapinski at google.com (=?UTF-8?B?TWljaGHFgiBDxYJhcGnFhHNraQ==?=) Date: Tue, 2 Jun 2026 17:43:54 +0200 Subject: [PATCH v2] kexec_file: skip checksum verification when safe In-Reply-To: <2vxzik81dlbu.fsf@kernel.org> References: <20260602123311.1841746-1-mclapinski@google.com> <2vxzik81dlbu.fsf@kernel.org> Message-ID: On Tue, Jun 2, 2026 at 5:16?PM Pratyush Yadav wrote: > > On Tue, Jun 02 2026, Michal Clapinski wrote: > > > Checksum verification is needed > > 1. for crash kernels. In a crash, we can't be sure the kernel is > > intact. > > 2. if we're worried about relocating the kernel into a region used by > > some DMA that wasn't properly cancelled. > > > > If KHO is enabled then relocations will happen to KHO scratch, which > > is free from DMA regions. > > If we used CMA to allocate segments then relocations are not going to > > happen at all. > > Therefore, we can safely disable checksum verification in both of those > > cases. > > > > Instead of adding a new variable to purgatory, just skip adding regions > > and save the default value of SHA256 hash. > > > > Saves ~250ms on my 4.0 GHz CPU. This is an important saving for the > > live-update project. > > > > Signed-off-by: Michal Clapinski > > --- > > v2: > > - also skip checksum verification if KHO is enabled > > - small fixes from reviews > > > > My original idea was to do 2 changes: > > 1. Skip checksum if all segments are CMA. > > 2. If KHO is enabled, allocate the kernel inside kho_scratch using CMA. > > > > This way we could skip both relocations and checksum verification when > > KHO is enabled. > > But I realized that step 2 might not be possible on warm boots. > > AFAIU we only relocate into scratch since relocating anywhere else might > over-write preserved memory. If there is no relocation, there is no need > for the kernel image to be in scratch, since the image won't be > preserved memory anyway. > > So perhaps we can just use CMA directly, and only fall back to > kho_locate_mem_hole() if that fails? This should be a simple enough > change. I agree that it will work. However, the user would need to have CMA memory and it would need to have enough contiguous memory available. Do you think running out of CMA memory is a real problem? > Do you know how much time we can save by skipping relocations? I would > guess it is in the hundreds of milliseconds. It's smaller than the variance between runs. Maybe 10ms. Everything between exiting the old kernel and TSC initialization in the new kernel takes ~70ms. Theoretically if we didn't have to do relocations, we could try unpacking the kernel before kexec, which would save a little bit more time. But again, definitely less than 0.1s. > Can you try this (COMPLETELY UNTESTED) patch out and see if it works and > if it further improves kexec time? > > --- 8< --- > diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c > index 2bfbb2d144e6..0ccc7b6d67c1 100644 > --- a/kernel/kexec_file.c > +++ b/kernel/kexec_file.c > @@ -720,14 +720,6 @@ int kexec_locate_mem_hole(struct kexec_buf *kbuf) > if (kbuf->mem != KEXEC_BUF_MEM_UNKNOWN) > return 0; > > - /* > - * If KHO is active, only use KHO scratch memory. All other memory > - * could potentially be handed over. > - */ > - ret = kho_locate_mem_hole(kbuf, locate_mem_hole_callback); > - if (ret <= 0) > - return ret; > - > /* > * Try to find a free physically contiguous block of memory first. With that, we > * can avoid any copying at kexec time. > @@ -735,6 +727,14 @@ int kexec_locate_mem_hole(struct kexec_buf *kbuf) > if (!kexec_alloc_contig(kbuf)) > return 0; > > + /* > + * If KHO is active and relocations are to be done,, only use KHO > + * scratch memory. All other memory could potentially be handed over. > + */ > + ret = kho_locate_mem_hole(kbuf, locate_mem_hole_callback); > + if (ret <= 0) > + return ret; > + > if (!IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) > ret = kexec_walk_resources(kbuf, locate_mem_hole_callback); > else > --- >8 --- > > Of course this is not directly related to this patch so it shouldn't > block it, but I reckon we might be able to squeeze a bit more > performance out this way as a follow up. > > > I have no idea how to fix that (except weird ideas like 2 kho_scratches > > that we swap on every warm boot), so I decided to just skip checksum > > verification when KHO is enabled. This unfortunately means relocations > > will still happen. > > --- > > kernel/kexec_file.c | 27 +++++++++++++++++++++++++++ > > 1 file changed, 27 insertions(+) > > > > diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c > > index 2bfbb2d144e6..db25a14692ab 100644 > > --- a/kernel/kexec_file.c > > +++ b/kernel/kexec_file.c > > @@ -27,6 +27,7 @@ > > #include > > #include > > #include > > +#include > > #include "kexec_internal.h" > > > > #ifdef CONFIG_KEXEC_SIG > > @@ -798,6 +799,16 @@ int kexec_add_buffer(struct kexec_buf *kbuf) > > return 0; > > } > > > > +static bool kexec_only_cma_segments(struct kimage *image) > > +{ > > + for (int i = 0; i < image->nr_segments; i++) { > > + if (!image->segment_cma[i]) > > + return false; > > + } > > + > > + return true; > > +} > > + > > /* Calculate and store the digest of segments */ > > static int kexec_calculate_store_digests(struct kimage *image) > > { > > @@ -822,6 +833,21 @@ static int kexec_calculate_store_digests(struct kimage *image) > > > > sha256_init(&sctx); > > > > + /* > > + * If KHO is enabled, the destinations are located in KHO scratch. > > + * KHO scratch can only contain early boot allocations and movable > > + * allocations. That means there is no risk of memory corruption by > > + * uncancelled DMA. > > + * > > + * If all segments were loaded into contiguous memory, there will be no > > + * relocations at all, so also no risk no corruption. > > Typo: "so also no risk *of* corruption". > > We can fix that up when applying I think, so no need for a v3 just for > this. > > Other than this, > > Reviewed-by: Pratyush Yadav (Google) > > > + */ > > + if (image->type != KEXEC_TYPE_CRASH && > > + (kho_is_enabled() || kexec_only_cma_segments(image))) { > > + pr_debug("disabling checksum verification in purgatory\n"); > > + goto skip_checksum; > > + } > > + > > for (j = i = 0; i < image->nr_segments; i++) { > > struct kexec_segment *ksegment; > > > > @@ -867,6 +893,7 @@ static int kexec_calculate_store_digests(struct kimage *image) > > j++; > > } > > > > +skip_checksum: > > sha256_final(&sctx, digest); > > > > ret = kexec_purgatory_get_set_symbol(image, "purgatory_sha_regions", > > -- > Regards, > Pratyush Yadav From robh at kernel.org Tue Jun 2 09:24:50 2026 From: robh at kernel.org (Rob Herring) Date: Tue, 2 Jun 2026 11:24:50 -0500 Subject: [PATCH v3 03/11] of: reserved_mem: avoid post-init UAF when alloc_reserved_mem_array() fails In-Reply-To: <20260527032917.3385849-4-chenwandun1@gmail.com> References: <20260527032917.3385849-1-chenwandun1@gmail.com> <20260527032917.3385849-4-chenwandun1@gmail.com> Message-ID: <20260602162450.GA442759-robh@kernel.org> On Wed, May 27, 2026 at 11:29:09AM +0800, Wandun Chen wrote: > From: Wandun Chen > > The global pointer 'reserved_mem' continues to reference the > reserved_mem_array which lives in __initdata if > alloc_reserved_mem_array() fails. of_reserved_mem_lookup() is > exported for post-init use, that would dereference freed memory > and trigger a use-after-free. > > So reset reserved_mem_count to 0 when alloc_reserved_mem_array() > fails. > > Fixes: 00c9a452a235 ("of: reserved_mem: Add code to dynamically allocate reserved_mem array") Fixes should come first in a series. > Signed-off-by: Wandun Chen > --- > drivers/of/of_reserved_mem.c | 20 ++++++++++++++------ > 1 file changed, 14 insertions(+), 6 deletions(-) > > diff --git a/drivers/of/of_reserved_mem.c b/drivers/of/of_reserved_mem.c > index 313cbc57aa45..6d479381ff1f 100644 > --- a/drivers/of/of_reserved_mem.c > +++ b/drivers/of/of_reserved_mem.c > @@ -69,29 +69,31 @@ static int __init early_init_dt_alloc_reserved_memory_arch(phys_addr_t size, > * the initial static array is copied over to this new array and > * the new array is used from this point on. > */ > -static void __init alloc_reserved_mem_array(void) > +static bool __init alloc_reserved_mem_array(void) > { > struct reserved_mem *new_array; > size_t alloc_size, copy_size, memset_size; > > + if (!total_reserved_mem_cnt) > + return true; > + > alloc_size = array_size(total_reserved_mem_cnt, sizeof(*new_array)); > if (alloc_size == SIZE_MAX) { > pr_err("Failed to allocate memory for reserved_mem array with err: %d", -EOVERFLOW); > - return; > + goto fail; > } > > new_array = memblock_alloc(alloc_size, SMP_CACHE_BYTES); > if (!new_array) { > pr_err("Failed to allocate memory for reserved_mem array with err: %d", -ENOMEM); > - return; > + goto fail; > } > > copy_size = array_size(reserved_mem_count, sizeof(*new_array)); > if (copy_size == SIZE_MAX) { > memblock_free(new_array, alloc_size); > - total_reserved_mem_cnt = MAX_RESERVED_REGIONS; > pr_err("Failed to allocate memory for reserved_mem array with err: %d", -EOVERFLOW); These prints could be moved to 'fail'. Perhaps instead of just printing an error value, you can return the error value instead of boolean. If you respin just this patch, I can pick it up for 7.2. Rob From pratyush at kernel.org Tue Jun 2 09:43:43 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Tue, 02 Jun 2026 18:43:43 +0200 Subject: [PATCH v4 07/13] kho: add support for linked-block serialization In-Reply-To: (Pasha Tatashin's message of "Mon, 1 Jun 2026 10:37:35 -0400") References: <20260530221938.115978-1-pasha.tatashin@soleen.com> <20260530221938.115978-8-pasha.tatashin@soleen.com> <2vxzqzmqfkit.fsf@kernel.org> Message-ID: <2vxzcxy8evuo.fsf@kernel.org> On Mon, Jun 01 2026, Pasha Tatashin wrote: > On 06-01 15:38, Pratyush Yadav wrote: >> On Sat, May 30 2026, Pasha Tatashin wrote: >> >> > Introduce a linked-block serialization mechanism for state handover. >> > >> > Previously, LUO used contiguous memory blocks for serializing sessions >> > and files, which imposed limits on the total number of items that could >> > be preserved across a live update. >> > >> > This commit adds the infrastructure for a more flexible, block-based >> > approach where serialized data is stored in a chain of linked blocks. >> > This is a generic KHO serialization block infrastructure that can be >> > used by multiple subsystems. >> > >> > Signed-off-by: Pasha Tatashin [...] >> > +/** >> > + * DOC: KHO Serialization Blocks ABI >> > + * >> > + * Subsystems using the KHO Serialization Blocks framework rely on the stable >> > + * Application Binary Interface defined below to pass serialized state from a >> > + * pre-update kernel to a post-update kernel. >> > + * >> > + * This interface is a contract. Any modification to the structure fields, >> > + * compatible strings, or the layout of the `__packed` serialization >> > + * structures defined here constitutes a breaking change. Such changes require >> > + * incrementing the version number in the `KHO_BLOCK_ABI_COMPATIBLE` string to >> > + * prevent a new kernel from misinterpreting data from an old kernel. >> > + * >> > + * Changes are allowed provided the compatibility version is incremented; >> > + * however, backward/forward compatibility is only guaranteed for kernels >> > + * supporting the same ABI version. >> > + */ >> > + >> > +#ifndef _LINUX_KHO_ABI_BLOCK_H >> > +#define _LINUX_KHO_ABI_BLOCK_H >> > + >> > +#include >> > +#include >> > + >> > +#define KHO_BLOCK_ABI_COMPATIBLE "kho-block-v1" >> >> During KHO radix development, I argued for a separate compatible for the >> radix tree, but at that time, we tied the radix tree to core KHO ABI. >> The argument being that all core KHO data structures belong to the KHO >> ABI set. I imagine this will be used by kho_vmalloc, so it will also be >> end up being used by a core KHO API. >> >> So, do we want separate ABI? I don't much have a preference myself, but >> I do think the compatible management will be a bit easier if this relied >> on KHO compatible, especially once kho_vmalloc starts using it. > > I prefer to make them fine-grained, now that we are adding more and more > features: kho vmalloc, kho radix, and kho block should all have their > own compatibility strings. Furthermore, any components that depend on > them should include these compatibility strings in their own > compatibility strings, in the same manner I have done in this series. Sure, sounds good. > >> >> > + >> > +/** >> > + * KHO_BLOCK_SIZE - The size of each serialization block. >> > + * >> > + * This is defined as PAGE_SIZE. PAGE_SIZE is ABI compliant because live >> > + * update between kernels with different page sizes is not supported by KHO. >> > + */ >> > +#define KHO_BLOCK_SIZE PAGE_SIZE >> > + >> > +/** >> > + * struct kho_block_header_ser - Header for the serialized data block. >> > + * @next: Physical address of the next struct kho_block_header_ser. >> > + * @count: The number of entries that immediately follow this header in the >> > + * memory block. >> > + * >> > + * This structure is located at the beginning of a block of physical memory >> > + * preserved across a kexec. It provides the necessary metadata to interpret >> > + * the array of entries that follow. >> > + */ >> > +struct kho_block_header_ser { >> > + u64 next; >> > + u64 count; >> > +} __packed; >> > + >> > +#endif /* _LINUX_KHO_ABI_BLOCK_H */ >> > diff --git a/include/linux/kho_block.h b/include/linux/kho_block.h >> > new file mode 100644 >> > index 000000000000..5e6b87b1befa >> > --- /dev/null >> > +++ b/include/linux/kho_block.h >> > @@ -0,0 +1,79 @@ >> > +/* SPDX-License-Identifier: GPL-2.0 */ >> > +/* >> > + * Copyright (c) 2026, Google LLC. >> > + * Pasha Tatashin >> > + */ >> > + >> > +#ifndef _LINUX_KHO_BLOCK_H >> > +#define _LINUX_KHO_BLOCK_H >> > + >> > +#include >> > +#include >> > +#include >> > + >> > +/** >> > + * struct kho_block - Internal representation of a serialization block. >> > + * @list: List head for linking blocks in memory. >> > + * @ser: Pointer to the serialized header in preserved memory. >> > + */ >> > +struct kho_block { >> > + struct list_head list; >> > + struct kho_block_header_ser *ser; >> > +}; >> > + >> > +/** >> > + * struct kho_block_set - A set of blocks that belong to the same object. >> > + * @blocks: The list of serialization blocks (struct kho_block). >> > + * @nblocks: The number of allocated serialization blocks. >> > + * @head_pa: Physical address of the first block header. >> > + * @entry_size: The size of each entry in the blocks. >> > + * @count_per_block: The maximum number of entries each block can hold. >> > + * @incoming: True if this block set was restored from the previous kernel. >> > + */ >> > +struct kho_block_set { >> > + struct list_head blocks; >> > + long nblocks; >> > + u64 head_pa; >> > + size_t entry_size; >> >> I think we should add the entry_size to kho_block_header_ser? I think it >> is a part of the ABI of the block set. If this changes, we cannot parse >> a block set with a different size. If a subsystem wants to change entry >> size, they create a new block set with different entry size, and then >> they bump their compatible version. > > I have considered that, and we can certainly do it; however, I do not > see how it would affect the current implementation. If luo_file or > luo_session change entry_size, they must change the LUO compatibility > version, which would prevent LU from one kernel to the next. However, > for flexibility and future extensibility, I believe it would be useful > to add entry_size and block_size (which is PAGE_SIZE, but could be > larger for some users) to the header. This is more of a feature request > than an issue with the current series. My suggestion was mainly for sanity checking. So if LUO or another user inadvertently changes entry size, it gets caught. But thinking about it more, there are a million other ways to break compatibility while keeping the entry size same so perhaps it doesn't matter as much... > >> >> > + u64 count_per_block; >> > + bool incoming; >> > +}; >> > + >> > +/** >> > + * struct kho_block_it - Iterator for serializing entries into blocks. >> > + * @bs: The block set being iterated. >> > + * @block: The current block. >> > + * @i: The current entry index within @block. >> > + */ >> > +struct kho_block_it { >> > + struct kho_block_set *bs; >> > + struct kho_block *block; >> > + u64 i; >> > +}; >> > + >> > +/** >> > + * KHO_BLOCK_SET_INIT - Initialize a static kho_block_set. >> > + * @_name: Name of the kho_block_set variable. >> > + * @_entry_size: The size of each entry in the block set. >> > + */ >> > +#define KHO_BLOCK_SET_INIT(_name, _entry_size) { \ >> > + .blocks = LIST_HEAD_INIT((_name).blocks), \ >> > + .entry_size = _entry_size, \ >> > +} >> > + >> > +void kho_block_set_init(struct kho_block_set *bs, size_t entry_size); >> > + >> > +int kho_block_grow(struct kho_block_set *bs, u64 count); >> > +void kho_block_shrink(struct kho_block_set *bs, u64 count); >> >> These block management functions seem like internal details of the block > > This is not so. The confusion here is that they must be allocated and > preserved at runtime as resources are registered/unregistered, while > these blocks are only used serialization phase, > > These calls are more like notifiers that more files/sessions are created > removed, so we can adjust block count accordingly if necessary (allocate > preserver memory), and have them available durign > serialization/deserialization Yeah, I got that when reading the later patches that use these. Perhaps kho_block_prealloc() and kho_block_unalloc() is more clear, although it does not sound as nice. If not, then I suppose at least add a comment explaining the intended usage. > >> set API. Do we need to export them? I think users should not have to >> worry about block management. They should read, set, or clear entries >> using the iterators, and internally the block management should take of >> allocation or freeing. So here for example, I th > > something is missing :-) I don't remember what I meant to say anymore :-/ [...] >> > +/** >> > + * kho_block_set_init - Initialize a block set. >> > + * @bs: The block set to initialize. >> > + * @entry_size: The size of each entry in the blocks. >> > + */ >> > +void kho_block_set_init(struct kho_block_set *bs, size_t entry_size) >> > +{ >> > + *bs = (struct kho_block_set)KHO_BLOCK_SET_INIT(*bs, entry_size); >> > +} >> > + >> > +static inline u64 kho_block_count_per_block(struct kho_block_set *bs) >> > +{ >> > + if (unlikely(!bs->count_per_block)) { >> > + bs->count_per_block = (KHO_BLOCK_SIZE - >> > + sizeof(struct kho_block_header_ser)) / >> > + bs->entry_size; >> > + WARN_ON(!bs->count_per_block); >> > + } >> > + return bs->count_per_block; >> > +} >> >> This looks odd. I don't see a reason to calculate this lazily. Why not >> just do it when initializing the block set, in kho_block_set_init() or >> kho_block_restore()? And then use bs->count_per_block directly. > > This allows for blocks to use static initilziation, I like static inits > :-) You can do this: #define KHO_BLOCK_SET_INIT(_name, _entry_size) { \ .blocks = LIST_HEAD_INIT((_name).blocks), \ .entry_size = _entry_size, \ .count_per_block = (KHO_BLOCK_SIZE - sizeof(struct kho_block_header_ser)) / (_entry_size), \ } Compiles for me. [...] >> > +void kho_block_destroy(struct kho_block_set *bs) >> > +{ >> > + u64 head_pa = bs->head_pa; >> > + struct kho_block *block; >> > + >> > + while (!list_empty(&bs->blocks)) { >> > + block = list_first_entry(&bs->blocks, struct kho_block, list); >> > + list_del(&block->list); >> > + kfree(block); >> > + } >> >> Nit: >> >> list_for_each_entry_safe(block, tmp, &bs->blocks, list) { >> list_del(&block->list); >> kfree(block); >> } >> >> is a bit more idiomatic (and IMO easier to read). > > Sure > >> >> > + bs->nblocks = 0; >> > + bs->head_pa = 0; >> > + >> > + while (head_pa) { >> > + struct kho_block_header_ser *ser = phys_to_virt(head_pa); >> > + >> > + head_pa = ser->next; >> > + kho_block_free_ser(bs, ser); >> >> Nit: also, can't you put this also in the previous loop? Something like: >> >> list_for_each_entry_safe(block, tmp, &bs->blocks, list) { >> list_del(&block->list); >> kho_block_free_ser(block->ser); >> kfree(block); >> } > > We actually can't merge these into a single loop because of partial > restoration failures handling in kho_block_restore(). > > If kho_block_restore fails halfway through restoring a chain of blocks > (for example, if kho_block_add fails on block 3 of 5), we jump to the > err_destroy cleanup path which calls kho_block_destroy(). > > At this point: > - bs->blocks only contains the tracked blocks we successfully added > (blocks 1 and 2). > - bs->head_pa still points to the physical head of the entire 5-block > incoming chain. > > But, this is a good place to add a comment. IMO it would be cleaner for kho_block_destroy() to destroy the currently initialized block set, and then the error handling path in restore path can clean up the rest. > >> > + } >> > +} [...] >> > +/** >> > + * kho_block_it_prev - Return the previous entry slot in the block set. >> > + * @it: The block iterator. >> > + * >> > + * If the current index is at the start of a block, it automatically moves to >> > + * the end of the previous block. >> > + * >> > + * Return: A pointer to the previous entry slot, or NULL if at the very >> > + * beginning of the block set. >> > + */ >> > +void *kho_block_it_prev(struct kho_block_it *it) >> > +{ >> > + if (!it->block) >> > + return NULL; >> > + >> > + if (it->i == 0) { >> > + if (list_is_first(&it->block->list, &it->bs->blocks)) >> > + return NULL; >> > + it->block = list_prev_entry(it->block, list); >> > + it->i = kho_block_count_per_block(it->bs); >> > + } >> > + >> > + return (void *)(it->block->ser + 1) + (--it->i * it->bs->entry_size); >> > +} >> > + >> > +/** >> > + * kho_block_it_finalize - Finalize the current block by setting its entry count. >> > + * @it: The block iterator. >> > + */ >> > +void kho_block_it_finalize(struct kho_block_it *it) >> > +{ >> > + if (it->block) >> > + it->block->ser->count = it->i; >> > +} >> >> Doesn't kho_block_it_next() already do this when you add an entry? So >> this seems redundant. > > It is not redundant because of how the final partially-fille block is handled. > > kho_block_it_next() only writes the count into the block header when a block is completely full and it is advancing to the next one: > > if (it->i == kho_block_count_per_block(it->bs)) { > it->block->ser->count = it->i; > ... > > But for the very last block in the set, it is usually only partially > filled (e.g., we write 10 entries into a block with a capacity of 64). > Since it->i never reaches the maximum capacity, kho_block_it_next() > never commits its count. > > Pasha I think we can make kho_block_it_next() always write it. I think it makes sense from an API point of view, since I see this API as "adding an entry to the block set", so updating its internal counters makes sense. Requiring the finalize will be error prone, since it is easy to forget. Then you silently lose some entries on the next boot. -- Regards, Pratyush Yadav From pratyush at kernel.org Tue Jun 2 09:49:40 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Tue, 02 Jun 2026 18:49:40 +0200 Subject: [PATCH v2] kexec_file: skip checksum verification when safe In-Reply-To: (=?utf-8?Q?=22Micha=C5=82_C=C5=82api=C5=84ski=22's?= message of "Tue, 2 Jun 2026 17:43:54 +0200") References: <20260602123311.1841746-1-mclapinski@google.com> <2vxzik81dlbu.fsf@kernel.org> Message-ID: <2vxz8q8wevkr.fsf@kernel.org> On Tue, Jun 02 2026, Micha? C?api?ski wrote: > On Tue, Jun 2, 2026 at 5:16?PM Pratyush Yadav wrote: >> >> On Tue, Jun 02 2026, Michal Clapinski wrote: >> >> > Checksum verification is needed >> > 1. for crash kernels. In a crash, we can't be sure the kernel is >> > intact. >> > 2. if we're worried about relocating the kernel into a region used by >> > some DMA that wasn't properly cancelled. >> > >> > If KHO is enabled then relocations will happen to KHO scratch, which >> > is free from DMA regions. >> > If we used CMA to allocate segments then relocations are not going to >> > happen at all. >> > Therefore, we can safely disable checksum verification in both of those >> > cases. >> > >> > Instead of adding a new variable to purgatory, just skip adding regions >> > and save the default value of SHA256 hash. >> > >> > Saves ~250ms on my 4.0 GHz CPU. This is an important saving for the >> > live-update project. >> > >> > Signed-off-by: Michal Clapinski >> > --- >> > v2: >> > - also skip checksum verification if KHO is enabled >> > - small fixes from reviews >> > >> > My original idea was to do 2 changes: >> > 1. Skip checksum if all segments are CMA. >> > 2. If KHO is enabled, allocate the kernel inside kho_scratch using CMA. >> > >> > This way we could skip both relocations and checksum verification when >> > KHO is enabled. >> > But I realized that step 2 might not be possible on warm boots. >> >> AFAIU we only relocate into scratch since relocating anywhere else might >> over-write preserved memory. If there is no relocation, there is no need >> for the kernel image to be in scratch, since the image won't be >> preserved memory anyway. >> >> So perhaps we can just use CMA directly, and only fall back to >> kho_locate_mem_hole() if that fails? This should be a simple enough >> change. > > I agree that it will work. However, the user would need to have CMA > memory and it would need to have enough contiguous memory available. > Do you think running out of CMA memory is a real problem? No idea. I think that depends heavily on how much memory drivers are using, and I have no numbers for that. Anyway, if the user doesn't have memory available in CMA, we will still fall back to the normal path so kexec load will still at least keep working. > >> Do you know how much time we can save by skipping relocations? I would >> guess it is in the hundreds of milliseconds. > > It's smaller than the variance between runs. Maybe 10ms. Everything > between exiting the old kernel and TSC initialization in the new > kernel takes ~70ms. > > Theoretically if we didn't have to do relocations, we could try > unpacking the kernel before kexec, which would save a little bit more > time. But again, definitely less than 0.1s. Hmm, I thought it would take longer. I don't think we are at a point yet where we should try to save 10s of milliseconds. Thanks for trying it out though. > >> Can you try this (COMPLETELY UNTESTED) patch out and see if it works and >> if it further improves kexec time? >> >> --- 8< --- >> diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c >> index 2bfbb2d144e6..0ccc7b6d67c1 100644 >> --- a/kernel/kexec_file.c >> +++ b/kernel/kexec_file.c >> @@ -720,14 +720,6 @@ int kexec_locate_mem_hole(struct kexec_buf *kbuf) >> if (kbuf->mem != KEXEC_BUF_MEM_UNKNOWN) >> return 0; >> >> - /* >> - * If KHO is active, only use KHO scratch memory. All other memory >> - * could potentially be handed over. >> - */ >> - ret = kho_locate_mem_hole(kbuf, locate_mem_hole_callback); >> - if (ret <= 0) >> - return ret; >> - >> /* >> * Try to find a free physically contiguous block of memory first. With that, we >> * can avoid any copying at kexec time. >> @@ -735,6 +727,14 @@ int kexec_locate_mem_hole(struct kexec_buf *kbuf) >> if (!kexec_alloc_contig(kbuf)) >> return 0; >> >> + /* >> + * If KHO is active and relocations are to be done,, only use KHO >> + * scratch memory. All other memory could potentially be handed over. >> + */ >> + ret = kho_locate_mem_hole(kbuf, locate_mem_hole_callback); >> + if (ret <= 0) >> + return ret; >> + >> if (!IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) >> ret = kexec_walk_resources(kbuf, locate_mem_hole_callback); >> else >> --- >8 --- >> >> Of course this is not directly related to this patch so it shouldn't >> block it, but I reckon we might be able to squeeze a bit more >> performance out this way as a follow up. >> [...] -- Regards, Pratyush Yadav From pratyush at kernel.org Tue Jun 2 09:58:52 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Tue, 02 Jun 2026 18:58:52 +0200 Subject: [PATCH] kho: try to allocate contiguous memory for kexec segments In-Reply-To: <20260601193014.896405-1-mclapinski@google.com> (Michal Clapinski's message of "Mon, 1 Jun 2026 21:30:14 +0200") References: <20260601193014.896405-1-mclapinski@google.com> Message-ID: <2vxz4ijkev5f.fsf@kernel.org> On Mon, Jun 01 2026, Michal Clapinski wrote: > This allows us to skip relocations (and maybe checksum calculation > in the future). I'm confused. Doesn't your patch "kexec_file: skip checksum verification when safe" [0] skip the checksum for KHO already? So this only skips the relocations part then? And based on the discussion on that thread, relocations don't seem to take much time. So is there a real need for this patch? [0] https://lore.kernel.org/kexec/20260602123311.1841746-1-mclapinski at google.com/ > > kho_scratch is marked as MIGRATE_CMA but isn't actually given to the > CMA, so it should only contain movable allocations, therefore this > should always succeed. > > Signed-off-by: Michal Clapinski [...] -- Regards, Pratyush Yadav From pratyush at kernel.org Tue Jun 2 10:02:07 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Tue, 02 Jun 2026 19:02:07 +0200 Subject: [PATCH v5 04/13] liveupdate: register luo_ser as KHO subtree In-Reply-To: <20260602031717.197696-5-pasha.tatashin@soleen.com> (Pasha Tatashin's message of "Tue, 2 Jun 2026 03:17:08 +0000") References: <20260602031717.197696-1-pasha.tatashin@soleen.com> <20260602031717.197696-5-pasha.tatashin@soleen.com> Message-ID: <2vxzzf1cdgfk.fsf@kernel.org> On Tue, Jun 02 2026, Pasha Tatashin wrote: > Entirely remove the LUO FDT wrapper since the FDT only carries the > compatible string and the pointer to the centralized struct luo_ser. > Instead, register the struct luo_ser via the KHO raw subtree > API, placing the compatibility string inside the structure itself. > > Signed-off-by: Pasha Tatashin Reviewed-by: Pratyush Yadav (Google) [...] -- Regards, Pratyush Yadav From pratyush at kernel.org Tue Jun 2 10:06:25 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Tue, 02 Jun 2026 19:06:25 +0200 Subject: [PATCH v5 08/13] liveupdate: defer session block allocation and PA setting In-Reply-To: <20260602031717.197696-9-pasha.tatashin@soleen.com> (Pasha Tatashin's message of "Tue, 2 Jun 2026 03:17:12 +0000") References: <20260602031717.197696-1-pasha.tatashin@soleen.com> <20260602031717.197696-9-pasha.tatashin@soleen.com> Message-ID: <2vxzv7c0dg8e.fsf@kernel.org> On Tue, Jun 02 2026, Pasha Tatashin wrote: > Currently, luo_session_setup_outgoing() allocates the session block and > sets its physical address in the header immediately. With upcoming > dynamic block-based session management, this makes the first block > different from the rest. Move the allocation to where it is first needed. > > Acked-by: Mike Rapoport (Microsoft) > Reviewed-by: Pratyush Yadav (Google) > Signed-off-by: Pasha Tatashin > --- > include/linux/kho_block.h | 22 +++++++++++ > kernel/liveupdate/luo_core.c | 4 +- > kernel/liveupdate/luo_internal.h | 2 +- > kernel/liveupdate/luo_session.c | 68 ++++++++++++++++++++------------ > 4 files changed, 67 insertions(+), 29 deletions(-) > > diff --git a/include/linux/kho_block.h b/include/linux/kho_block.h > index 505bf78409f2..0a8cda2cbfb5 100644 > --- a/include/linux/kho_block.h > +++ b/include/linux/kho_block.h > @@ -70,6 +70,28 @@ int kho_block_set_restore(struct kho_block_set *bs, u64 head_pa); > void kho_block_set_destroy(struct kho_block_set *bs); > void kho_block_set_clear(struct kho_block_set *bs); > > +/** > + * kho_block_set_head_pa - Get the physical address of the first block header. > + * @bs: The block set. > + * > + * Return: The physical address of the first block header, or 0 if empty. > + */ > +static inline u64 kho_block_set_head_pa(struct kho_block_set *bs) > +{ > + return bs->head_pa; > +} > + > +/** > + * kho_block_set_is_empty - Check if the block set has no allocated blocks. > + * @bs: The block set. > + * > + * Return: True if there are no blocks in the set, false otherwise. > + */ > +static inline bool kho_block_set_is_empty(struct kho_block_set *bs) > +{ > + return list_empty(&bs->blocks); > +} > + Are these intended to be here or should they go in patch 7? > void kho_block_it_init(struct kho_block_it *it, struct kho_block_set *bs); > void *kho_block_it_reserve_entry(struct kho_block_it *it); > void *kho_block_it_read_entry(struct kho_block_it *it); [...] -- Regards, Pratyush Yadav From pratyush at kernel.org Tue Jun 2 10:07:59 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Tue, 02 Jun 2026 19:07:59 +0200 Subject: [PATCH v5 10/13] liveupdate: Remove limit on the number of files per session In-Reply-To: <20260602031717.197696-11-pasha.tatashin@soleen.com> (Pasha Tatashin's message of "Tue, 2 Jun 2026 03:17:14 +0000") References: <20260602031717.197696-1-pasha.tatashin@soleen.com> <20260602031717.197696-11-pasha.tatashin@soleen.com> Message-ID: <2vxzo6hsdg5s.fsf@kernel.org> On Tue, Jun 02 2026, Pasha Tatashin wrote: > To remove the fixed limit on the number of preserved files per session, > transition the file metadata serialization from a single contiguous > memory block to a chain of linked blocks. > > Acked-by: Mike Rapoport (Microsoft) > Signed-off-by: Pasha Tatashin Reviewed-by: Pratyush Yadav (Google) [...] -- Regards, Pratyush Yadav From mclapinski at google.com Tue Jun 2 10:08:09 2026 From: mclapinski at google.com (=?UTF-8?B?TWljaGHFgiBDxYJhcGnFhHNraQ==?=) Date: Tue, 2 Jun 2026 19:08:09 +0200 Subject: [PATCH] kho: try to allocate contiguous memory for kexec segments In-Reply-To: <2vxz4ijkev5f.fsf@kernel.org> References: <20260601193014.896405-1-mclapinski@google.com> <2vxz4ijkev5f.fsf@kernel.org> Message-ID: On Tue, Jun 2, 2026 at 6:58?PM Pratyush Yadav wrote: > > On Mon, Jun 01 2026, Michal Clapinski wrote: > > > This allows us to skip relocations (and maybe checksum calculation > > in the future). > > I'm confused. Doesn't your patch "kexec_file: skip checksum verification > when safe" [0] skip the checksum for KHO already? So this only skips the > relocations part then? > > And based on the discussion on that thread, relocations don't seem to > take much time. So is there a real need for this patch? I sent this patch out yesterday, then realized it wasn't going to work so I abandoned this approach. Today I sent patch [0] that skips the checksum for KHO, so this patch is no longer needed. > [0] https://lore.kernel.org/kexec/20260602123311.1841746-1-mclapinski at google.com/ > > > > > kho_scratch is marked as MIGRATE_CMA but isn't actually given to the > > CMA, so it should only contain movable allocations, therefore this > > should always succeed. > > > > Signed-off-by: Michal Clapinski > [...] > > -- > Regards, > Pratyush Yadav From pratyush at kernel.org Tue Jun 2 10:09:03 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Tue, 02 Jun 2026 19:09:03 +0200 Subject: [PATCH] kho: try to allocate contiguous memory for kexec segments In-Reply-To: (=?utf-8?Q?=22Micha=C5=82_C=C5=82api=C5=84ski=22's?= message of "Tue, 2 Jun 2026 19:08:09 +0200") References: <20260601193014.896405-1-mclapinski@google.com> <2vxz4ijkev5f.fsf@kernel.org> Message-ID: <2vxzjysgdg40.fsf@kernel.org> On Tue, Jun 02 2026, Micha? C?api?ski wrote: > On Tue, Jun 2, 2026 at 6:58?PM Pratyush Yadav wrote: >> >> On Mon, Jun 01 2026, Michal Clapinski wrote: >> >> > This allows us to skip relocations (and maybe checksum calculation >> > in the future). >> >> I'm confused. Doesn't your patch "kexec_file: skip checksum verification >> when safe" [0] skip the checksum for KHO already? So this only skips the >> relocations part then? >> >> And based on the discussion on that thread, relocations don't seem to >> take much time. So is there a real need for this patch? > > I sent this patch out yesterday, then realized it wasn't going to work > so I abandoned this approach. > Today I sent patch [0] that skips the checksum for KHO, so this patch > is no longer needed. Makes sense, thanks for clarifying. [...] -- Regards, Pratyush Yadav From pratyush at kernel.org Tue Jun 2 10:15:05 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Tue, 02 Jun 2026 19:15:05 +0200 Subject: [PATCH 1/2] liveupdate: Reference count outgoing FLB data In-Reply-To: <20260528174140.1921129-2-dmatlack@google.com> (David Matlack's message of "Thu, 28 May 2026 17:41:39 +0000") References: <20260528174140.1921129-1-dmatlack@google.com> <20260528174140.1921129-2-dmatlack@google.com> Message-ID: <2vxzfr34dfty.fsf@kernel.org> Hi David, On Thu, May 28 2026, David Matlack wrote: > Increment the outgoing FLB refcount in liveupdate_flb_get_outgoing() so > that the FLB structure cannot be freed while the caller is actively > using it. Add an additional liveupdate_flb_put_outgoing() function so > the caller can explicitly indicate when it is done using the outgoing > FLB. > > During a Live Update, the kernel may need to fetch the outgoing FLB > outside of the scope of a file handler's preserve() and unpreserve() > callbacks. In that situation there is no way for the caller to protect > itself against the outgoing FLB from being freed while it is using it. > Incrementing the reference count in liveupdate_flb_get_outgoing() > ensures it cannot be freed. We grab a reference to the FLB's module when the first file using the FLB is preserved. So the FLB should never go away while preserved files exist. Once all preserved files go away, you normally shouldn't be doing anything with the FLB anyway. Can you please elaborate on the use case and why this is a problem? Using the FLB outside of the standard LUO file callbacks sounds problematic. > > This change also aligns the outgoing FLB lifecycle management with the > incoming FLB, since the latter uses the same get/put semantics. > > Fixes: cab056f2aae7 ("liveupdate: luo_flb: introduce File-Lifecycle-Bound global state") > Assisted-by: Gemini:gemini-3-pro-preview > Signed-off-by: David Matlack [...] -- Regards, Pratyush Yadav From pratyush at kernel.org Tue Jun 2 10:18:18 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Tue, 02 Jun 2026 19:18:18 +0200 Subject: [PATCH 2/2] liveupdate: Remember FLB retrieve() status In-Reply-To: <20260528174140.1921129-3-dmatlack@google.com> (David Matlack's message of "Thu, 28 May 2026 17:41:40 +0000") References: <20260528174140.1921129-1-dmatlack@google.com> <20260528174140.1921129-3-dmatlack@google.com> Message-ID: <2vxzbjdsdfol.fsf@kernel.org> On Thu, May 28 2026, David Matlack wrote: > LUO keeps track of successful retrieve attempts on an FLB. It does so > to avoid multiple retrievals of the same FLB. Multiple retrievals cause > problems because once the FLB is retrieved, the serialized data > structures are likely freed and the FLB is likely in a very different > state from what the code expects. > > All this works well when retrieve succeeds. When it fails, > luo_flb_retrieve_one() returns the error immediately, without ever > storing anywhere that a retrieve was attempted or what its error code > was. If the user attempts to retrieve another file registered with the > same FLB, LUO will attempt to call the FLB's retrieve() callback again. > > The retry is problematic for much of the same reasons listed above. The > FLB is likely in a very different state than what the retrieve logic > normally expects (e.g. some KHO pages may have already been restored and > freed). > > There is no sane way of attempting the retrieve again. Remember the > error retrieve returned and directly return it on a retry. > > This is done by changing the retrieved bool to a retrieve_status > integer. A value of 0 means retrieve was never attempted, a positive > value means it succeeded, and a negative value means it failed and the > error code is the value. > > This is similar to commit f85b1c6af5bc ("liveupdate: luo_file: remember > retrieve() status") which did the same for LUO files. > > Fixes: cab056f2aae7 ("liveupdate: luo_flb: introduce File-Lifecycle-Bound global state") > Assisted-by: Gemini:gemini-3-pro-preview > Signed-off-by: David Matlack Reviewed-by: Pratyush Yadav (Google) Thanks for fixing this! [...] -- Regards, Pratyush Yadav From dmatlack at google.com Tue Jun 2 10:25:34 2026 From: dmatlack at google.com (David Matlack) Date: Tue, 2 Jun 2026 17:25:34 +0000 Subject: [PATCH 1/2] liveupdate: Reference count outgoing FLB data In-Reply-To: <2vxzfr34dfty.fsf@kernel.org> References: <20260528174140.1921129-1-dmatlack@google.com> <20260528174140.1921129-2-dmatlack@google.com> <2vxzfr34dfty.fsf@kernel.org> Message-ID: On 2026-06-02 07:15 PM, Pratyush Yadav wrote: > Hi David, > > On Thu, May 28 2026, David Matlack wrote: > > > Increment the outgoing FLB refcount in liveupdate_flb_get_outgoing() so > > that the FLB structure cannot be freed while the caller is actively > > using it. Add an additional liveupdate_flb_put_outgoing() function so > > the caller can explicitly indicate when it is done using the outgoing > > FLB. > > > > During a Live Update, the kernel may need to fetch the outgoing FLB > > outside of the scope of a file handler's preserve() and unpreserve() > > callbacks. In that situation there is no way for the caller to protect > > itself against the outgoing FLB from being freed while it is using it. > > Incrementing the reference count in liveupdate_flb_get_outgoing() > > ensures it cannot be freed. > > We grab a reference to the FLB's module when the first file using the > FLB is preserved. So the FLB should never go away while preserved files > exist. Once all preserved files go away, you normally shouldn't be doing > anything with the FLB anyway. > > Can you please elaborate on the use case and why this is a problem? > Using the FLB outside of the standard LUO file callbacks sounds > problematic. The scenario I had in mind was to remove a PCI device from the outgoing FLB if the device is forcibly removed while the file is still preserved, for example someone writes 1 to /sys/bus/pci/devices/.../remove or a device is physically hot-unplugged. Specifically this call here from the patch below: +void pci_liveupdate_cleanup_device(struct pci_dev *dev) +{ + /* + * It should be safe to READ_ONCE() outside of the rwsem during cleanup + * since there should no longer be any references to @dev on the system. + */ + if (READ_ONCE(dev->liveupdate.outgoing)) { + pci_WARN(dev, 1, "Destroying outgoing-preserved device!\n"); + pci_liveupdate_unpreserve(dev); + } +} https://lore.kernel.org/linux-pci/20260522202410.3104264-3-dmatlack at google.com/ I can do this without adding reference counting to liveupdate_flb_get_outgoing(), but the reference counting makes it obvious that the outgoing FLB will not be freed while I am using it here, and also aligns with liveupdate_flb_get_incoming(). > > > > This change also aligns the outgoing FLB lifecycle management with the > > incoming FLB, since the latter uses the same get/put semantics. > > > > Fixes: cab056f2aae7 ("liveupdate: luo_flb: introduce File-Lifecycle-Bound global state") > > Assisted-by: Gemini:gemini-3-pro-preview > > Signed-off-by: David Matlack > [...] > > -- > Regards, > Pratyush Yadav From rppt at kernel.org Tue Jun 2 10:50:07 2026 From: rppt at kernel.org (Mike Rapoport) Date: Tue, 2 Jun 2026 20:50:07 +0300 Subject: [PATCH 12/12] mm/hugetlb: make bootmem allocation work with KHO In-Reply-To: <2vxzpl29dpzj.fsf@kernel.org> References: <20260429133928.850721-1-pratyush@kernel.org> <20260429133928.850721-13-pratyush@kernel.org> <2vxzo6i37bs6.fsf@kernel.org> <2vxzpl29dpzj.fsf@kernel.org> Message-ID: On Tue, Jun 02, 2026 at 03:35:44PM +0200, Pratyush Yadav wrote: > On Sun, May 31 2026, Mike Rapoport wrote: > > > On Mon, May 25, 2026 at 05:24:09PM +0200, Pratyush Yadav wrote: > >> On Sun, May 17 2026, Mike Rapoport wrote: > >> > On Wed, Apr 29, 2026 at 03:39:14PM +0200, Pratyush Yadav wrote: > >> >> From: "Pratyush Yadav (Google)" > >> > >> So, in summary, I would like to pursue option 1 and try to make it more > >> appetizing. But I would like to at least know if you hate the "extended > >> scratch" (ignore the name) as a concept or only the code it results in. > > > > Let's retry this one :) > > > > I looked more closely, and it seems that mixing SCRATCH and SCRATCH_EXT > > should be a lesser headache than going with option 4. > > I also had some time to ruminate on this. I still think option 1 has the > most promise, but my opinion on option 4 has improved a bit. While I > still am not sure adding a 3rd phase to struct page/MM init (early -> > deferred -> KHO reserved blocks) is a good idea, I think it might not be > as bad as I first thought. Dunno... Until (if ever) we enlighten memblock_free_all()/deferred_init_pages() about KHO, small scattered reservation make memblock slower. It's hard to tell if delaying more complex initialization of the large reserved chunks until SMP is worth speedup in a few memblock operations between kho_memory_init() and the end of deferred_init_pages(). > Anyway, for now I think I will try to make option 1 more appetizing. > > Here's an idea I want to try out: I get rid of SCRATCH_EXT and mark the > free blocks as SCRATCH. For HugeTLB, I can teach the special > memblock_alloc_hugetlb_something() function to exclude scratch areas > when looking for free memory ranges. So core memblock does not get a new > memory type, and the complexity of hugepage allocation does not leak > into memblock. > > How does that sound? It sounds fine, let's see how it looks :) > > Tracking the changes in gigantic pages in hugetlb also does not seem > > something we'd like to pursue especially considering that memory from freed > > or demoted gigantic pages could be reserved. > > > > If we add a dedicated memblock_something to allocate gigantic pages, we > > can reduce branching in alloc_bootmem() to > > > > if (cma) > > do_cma() > > else > > do_memblock() > > > > For hugetlb_cma we might want to teach CMA to create pre-allocated areas > > and then it could reuse the same memblock API. This seems useful even > > regardless of KHO. > > Sorry, I don't get what you mean by this. What pre-allocated areas? When > creating CMA areas it calls cma_alloc_mem() which calls into memblock. > What would we change about this? s/teach CMA to create/teach CMA to use/ I mean that we might want to be able first allocate gigantic pages and then feed them to empty CMA areas. > -- > Regards, > Pratyush Yadav -- Sincerely yours, Mike. From skhawaja at google.com Tue Jun 2 13:14:37 2026 From: skhawaja at google.com (Samiullah Khawaja) Date: Tue, 2 Jun 2026 20:14:37 +0000 Subject: [RFC PATCH 3/4] dma-direct: Add API to preserve/restore allocations In-Reply-To: References: <20260505002737.2213734-1-skhawaja@google.com> <20260505002737.2213734-4-skhawaja@google.com> Message-ID: On Mon, Jun 01, 2026 at 01:35:16PM +0100, Will Deacon wrote: >On Tue, May 05, 2026 at 12:27:36AM +0000, Samiullah Khawaja wrote: >> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c >> index ec887f443741..c2b98f91900a 100644 >> --- a/kernel/dma/direct.c >> +++ b/kernel/dma/direct.c >> @@ -6,6 +6,8 @@ >> */ >> #include /* for max_pfn */ >> #include >> +#include >> +#include >> #include >> #include >> #include >> @@ -307,6 +309,167 @@ void *dma_direct_alloc(struct device *dev, size_t size, >> return NULL; >> } >> >> +#ifdef CONFIG_DMA_LIVEUPDATE >> +int dma_direct_preserve_allocation(struct device *dev, void *cpu_addr, >> + size_t size, dma_addr_t dma_handle, >> + unsigned long attrs, u64 *state) >> +{ >> + struct dma_alloc_ser *ser; >> + int ret; >> + >> + if (!kho_is_enabled()) >> + return -EOPNOTSUPP; >> + >> + if (IS_ENABLED(CONFIG_DMA_CMA)) >> + return -EOPNOTSUPP; > >Hmm, it seems a bit overkill to do this just because CMA is compiled >in, especially as it's user-selectable in kconfig. > >Maybe you need to iterate over the CMA areas using cma_for_each_area(), >similarly to how you do with the pools? Agreed. So basically return error if the range belongs to one of the CMAs. I will update this. > >Will From pasha.tatashin at soleen.com Tue Jun 2 19:21:28 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Wed, 3 Jun 2026 02:21:28 +0000 Subject: [PATCH v4 07/13] kho: add support for linked-block serialization In-Reply-To: <178038801491.119771.18384706761138506132.b4-review@b4> References: <20260530221938.115978-1-pasha.tatashin@soleen.com> <20260530221938.115978-8-pasha.tatashin@soleen.com> <178038801491.119771.18384706761138506132.b4-review@b4> Message-ID: On 06-02 11:13, Mike Rapoport wrote: > On Sat, 30 May 2026 22:19:32 +0000, Pasha Tatashin wrote: > > diff --git a/include/linux/kho_block.h b/include/linux/kho_block.h > > new file mode 100644 > > index 000000000000..5e6b87b1befa > > --- /dev/null > > +++ b/include/linux/kho_block.h > > @@ -0,0 +1,79 @@ > > [ ... skip 19 lines ... ] > > + struct list_head list; > > + struct kho_block_header_ser *ser; > > +}; > > + > > +/** > > + * struct kho_block_set - A set of blocks that belong to the same object. > > "same object" sounds off to me. The blocks belong to the same module? > user? > > Thoughts? user and module are not descriptive, as the same client/user/module can use multiple kho_block_set for different purposes. I suggest: "struct kho_block_set - A set of blocks containing serialized entries of the same type." > > > + * @blocks: The list of serialization blocks (struct kho_block). > > + * @nblocks: The number of allocated serialization blocks. > > + * @head_pa: Physical address of the first block header. > > + * @entry_size: The size of each entry in the blocks. > > I think it's "... entry in a block" It is 'in the blocks' (or 'across the blocks') because a single block_set can contain multiple blocks, and they all share this same uniform entry size. > > > [ ... skip 42 lines ... ] > > + > > +void kho_block_it_init(struct kho_block_it *it, struct kho_block_set *bs); > > +void *kho_block_it_next(struct kho_block_it *it); > > +void *kho_block_it_read(struct kho_block_it *it); > > +void *kho_block_it_prev(struct kho_block_it *it); > > +void kho_block_it_finalize(struct kho_block_it *it); > > These operate on block sets, should be reflected in the names. > Can be kho_blocks_ to avoid too long names. We have already started using kho_block_set. Although it is longer, I prefer to avoid kho_blocks/kho_block because the subtle difference makes them difficult to read and prone to typos during coding. Let's use kho_block_set for operations on a block_set. > > > > diff --git a/kernel/liveupdate/kho_block.c b/kernel/liveupdate/kho_block.c > > new file mode 100644 > > index 000000000000..a4e650af946f > > --- /dev/null > > +++ b/kernel/liveupdate/kho_block.c > > @@ -0,0 +1,384 @@ > > +// SPDX-License-Identifier: GPL-2.0 > > + > > +/* > > + * Copyright (c) 2026, Google LLC. > > + * Pasha Tatashin > > + */ > > + > > +/** > > + * DOC: KHO Serialization Blocks > > + * > > + * KHO provides a mechanism to preserve stateful data across a kexec handover > > + * by serializing it into memory blocks. This file provides the common > > "This file" does not look good in HTML docs. Fixed. > > > [ ... skip 15 lines ... ] > > + > > +/* > > + * Safeguard limit for the number of serialization blocks. This is used to > > + * prevent infinite loops and excessive memory allocation in case of memory > > + * corruption in the preserved state. > > + */ > > Can you add how much memory it is and how many entries with, say, 4 u64 > it can accommodate? Done > > > [ ... skip 13 lines ... ] > > +{ > > + if (unlikely(!bs->count_per_block)) { > > + bs->count_per_block = (KHO_BLOCK_SIZE - > > + sizeof(struct kho_block_header_ser)) / > > + bs->entry_size; > > + WARN_ON(!bs->count_per_block); > > Don't you want to set count_per_block in _init()? Done. > > > [ ... skip 29 lines ... ] > > + if (!block) > > + return -ENOMEM; > > + > > + block->ser = ser; > > + last = list_last_entry_or_null(&bs->blocks, struct kho_block, list); > > + list_add_tail(&block->list, &bs->blocks); > > No locks? Linked blocks are not internally synchronized; that is a responsibility of the caller, similar to linked lists. > > > [ ... skip 12 lines ... ] > > + * @bs: The block set. > > + * @count: The current number of entries. > > + * > > + * This function handles the dynamic expansion of a block set. It allocates > > + * and links a new serialization block if the provided entry count matches > > + * the current total capacity of the set. > > This is a weird semantics for a generic API. I'd expect _grow() would > add count - current_count blocks. Changed the semantics to use target count, i.e. "The target number of valid entries to accommodate." > > > [ ... skip 25 lines ... ] > > +} > > + > > +/** > > + * kho_block_shrink - Conditionally destroy the last block in a block set. > > + * @bs: The block set. > > + * @count: The current number of entries across all blocks. > > Maybe > ... of valid entries? OK > > > + * > > + * This function checks if the last block in the set is redundant based on the > > + * total entry count and the capacity of the preceding blocks. If the entry > > + * count can be accommodated by the blocks that come before the last one, the > > + * last block is destroyed and removed from the set. > > This should mention that it's the caller responsibility to ensure that > entries are removed in the right order. OK > > > [ ... skip 49 lines ... ] > > + > > + fast = phys_to_virt(fast->next); > > + slow = phys_to_virt(slow->next); > > + > > + if (slow == fast) { > > + pr_err("Cyclic list detected\n"); > > Maybe "block set is corrupted"? OK > > > + return false; > > + } > > + } > > + > > + return true; > > +} > > + > > +/** > > + * kho_block_restore - Restore a block set from a physical address. > > + * @bs: The block set to restore. > > + * @head_pa: Physical address of the first block header. > > I'd mention that the block set should be allocated and initialized Done > > > [ ... skip 10 lines ... ] > > + bs->incoming = true; > > + if (!head_pa) > > + return 0; > > + > > + bs->head_pa = head_pa; > > + if (!kho_cyclic_blocks_check(bs)) { > > if (kho_block_set_cyclic()) > > reads nicer IMO Sure, done. > > > [ ... skip 87 lines ... ] > > +{ > > + if (!it->block) > > + return NULL; > > + > > + if (it->i == kho_block_count_per_block(it->bs)) { > > + it->block->ser->count = it->i; > > Why iterator updates ser->count? The new name kho_block_set_it_reserve_entry() clarifies that this is a write/reservation path function (unlike the original read-only next name). Reserving a slot to write entries naturally implies writing/finalizing the metadata count in the physical block header when a block becomes full > > + if (list_is_last(&it->block->list, &it->bs->blocks)) > > + return NULL; > > + it->block = list_next_entry(it->block, list); > > + it->i = 0; > > + } > > + > > + return (void *)(it->block->ser + 1) + (it->i++ * it->bs->entry_size); > > In a month we'll need an LLM's help to understand what it does. Good thing in a month we will have even stronger LLMs to help us :-) Anyways, clean-up ... > > > +} > > + > > +/** > > + * kho_block_it_read - Return the next entry slot for reading. > > + * @it: The block iterator. > > And what is the conceptual difference between this and _it_next()? This was updated :-) > > > [ ... skip 49 lines ... ] > > + * @it: The block iterator. > > + */ > > +void kho_block_it_finalize(struct kho_block_it *it) > > +{ > > + if (it->block) > > + it->block->ser->count = it->i; > > So, it looks like the intention of _it_next is for write, and this ends a > write iteration. > > I think the names should be adjusted to make it clearer. Done > > -- > Sincerely yours, > Mike. > From pasha.tatashin at soleen.com Tue Jun 2 19:44:25 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Wed, 3 Jun 2026 02:44:25 +0000 Subject: [PATCH v4 07/13] kho: add support for linked-block serialization In-Reply-To: <2vxzcxy8evuo.fsf@kernel.org> References: <20260530221938.115978-1-pasha.tatashin@soleen.com> <20260530221938.115978-8-pasha.tatashin@soleen.com> <2vxzqzmqfkit.fsf@kernel.org> <2vxzcxy8evuo.fsf@kernel.org> Message-ID: On 06-02 18:43, Pratyush Yadav wrote: > On Mon, Jun 01 2026, Pasha Tatashin wrote: > > > On 06-01 15:38, Pratyush Yadav wrote: > >> On Sat, May 30 2026, Pasha Tatashin wrote: > >> > >> > Introduce a linked-block serialization mechanism for state handover. > >> > > >> > Previously, LUO used contiguous memory blocks for serializing sessions > >> > and files, which imposed limits on the total number of items that could > >> > be preserved across a live update. > >> > > >> > This commit adds the infrastructure for a more flexible, block-based > >> > approach where serialized data is stored in a chain of linked blocks. > >> > This is a generic KHO serialization block infrastructure that can be > >> > used by multiple subsystems. > >> > > >> > Signed-off-by: Pasha Tatashin > [...] > >> > +/** > >> > + * DOC: KHO Serialization Blocks ABI > >> > + * > >> > + * Subsystems using the KHO Serialization Blocks framework rely on the stable > >> > + * Application Binary Interface defined below to pass serialized state from a > >> > + * pre-update kernel to a post-update kernel. > >> > + * > >> > + * This interface is a contract. Any modification to the structure fields, > >> > + * compatible strings, or the layout of the `__packed` serialization > >> > + * structures defined here constitutes a breaking change. Such changes require > >> > + * incrementing the version number in the `KHO_BLOCK_ABI_COMPATIBLE` string to > >> > + * prevent a new kernel from misinterpreting data from an old kernel. > >> > + * > >> > + * Changes are allowed provided the compatibility version is incremented; > >> > + * however, backward/forward compatibility is only guaranteed for kernels > >> > + * supporting the same ABI version. > >> > + */ > >> > + > >> > +#ifndef _LINUX_KHO_ABI_BLOCK_H > >> > +#define _LINUX_KHO_ABI_BLOCK_H > >> > + > >> > +#include > >> > +#include > >> > + > >> > +#define KHO_BLOCK_ABI_COMPATIBLE "kho-block-v1" > >> > >> During KHO radix development, I argued for a separate compatible for the > >> radix tree, but at that time, we tied the radix tree to core KHO ABI. > >> The argument being that all core KHO data structures belong to the KHO > >> ABI set. I imagine this will be used by kho_vmalloc, so it will also be > >> end up being used by a core KHO API. > >> > >> So, do we want separate ABI? I don't much have a preference myself, but > >> I do think the compatible management will be a bit easier if this relied > >> on KHO compatible, especially once kho_vmalloc starts using it. > > > > I prefer to make them fine-grained, now that we are adding more and more > > features: kho vmalloc, kho radix, and kho block should all have their > > own compatibility strings. Furthermore, any components that depend on > > them should include these compatibility strings in their own > > compatibility strings, in the same manner I have done in this series. > > Sure, sounds good. > > > > >> > >> > + > >> > +/** > >> > + * KHO_BLOCK_SIZE - The size of each serialization block. > >> > + * > >> > + * This is defined as PAGE_SIZE. PAGE_SIZE is ABI compliant because live > >> > + * update between kernels with different page sizes is not supported by KHO. > >> > + */ > >> > +#define KHO_BLOCK_SIZE PAGE_SIZE > >> > + > >> > +/** > >> > + * struct kho_block_header_ser - Header for the serialized data block. > >> > + * @next: Physical address of the next struct kho_block_header_ser. > >> > + * @count: The number of entries that immediately follow this header in the > >> > + * memory block. > >> > + * > >> > + * This structure is located at the beginning of a block of physical memory > >> > + * preserved across a kexec. It provides the necessary metadata to interpret > >> > + * the array of entries that follow. > >> > + */ > >> > +struct kho_block_header_ser { > >> > + u64 next; > >> > + u64 count; > >> > +} __packed; > >> > + > >> > +#endif /* _LINUX_KHO_ABI_BLOCK_H */ > >> > diff --git a/include/linux/kho_block.h b/include/linux/kho_block.h > >> > new file mode 100644 > >> > index 000000000000..5e6b87b1befa > >> > --- /dev/null > >> > +++ b/include/linux/kho_block.h > >> > @@ -0,0 +1,79 @@ > >> > +/* SPDX-License-Identifier: GPL-2.0 */ > >> > +/* > >> > + * Copyright (c) 2026, Google LLC. > >> > + * Pasha Tatashin > >> > + */ > >> > + > >> > +#ifndef _LINUX_KHO_BLOCK_H > >> > +#define _LINUX_KHO_BLOCK_H > >> > + > >> > +#include > >> > +#include > >> > +#include > >> > + > >> > +/** > >> > + * struct kho_block - Internal representation of a serialization block. > >> > + * @list: List head for linking blocks in memory. > >> > + * @ser: Pointer to the serialized header in preserved memory. > >> > + */ > >> > +struct kho_block { > >> > + struct list_head list; > >> > + struct kho_block_header_ser *ser; > >> > +}; > >> > + > >> > +/** > >> > + * struct kho_block_set - A set of blocks that belong to the same object. > >> > + * @blocks: The list of serialization blocks (struct kho_block). > >> > + * @nblocks: The number of allocated serialization blocks. > >> > + * @head_pa: Physical address of the first block header. > >> > + * @entry_size: The size of each entry in the blocks. > >> > + * @count_per_block: The maximum number of entries each block can hold. > >> > + * @incoming: True if this block set was restored from the previous kernel. > >> > + */ > >> > +struct kho_block_set { > >> > + struct list_head blocks; > >> > + long nblocks; > >> > + u64 head_pa; > >> > + size_t entry_size; > >> > >> I think we should add the entry_size to kho_block_header_ser? I think it > >> is a part of the ABI of the block set. If this changes, we cannot parse > >> a block set with a different size. If a subsystem wants to change entry > >> size, they create a new block set with different entry size, and then > >> they bump their compatible version. > > > > I have considered that, and we can certainly do it; however, I do not > > see how it would affect the current implementation. If luo_file or > > luo_session change entry_size, they must change the LUO compatibility > > version, which would prevent LU from one kernel to the next. However, > > for flexibility and future extensibility, I believe it would be useful > > to add entry_size and block_size (which is PAGE_SIZE, but could be > > larger for some users) to the header. This is more of a feature request > > than an issue with the current series. > > My suggestion was mainly for sanity checking. So if LUO or another user > inadvertently changes entry size, it gets caught. But thinking about it > more, there are a million other ways to break compatibility while > keeping the entry size same so perhaps it doesn't matter as much... > > > > >> > >> > + u64 count_per_block; > >> > + bool incoming; > >> > +}; > >> > + > >> > +/** > >> > + * struct kho_block_it - Iterator for serializing entries into blocks. > >> > + * @bs: The block set being iterated. > >> > + * @block: The current block. > >> > + * @i: The current entry index within @block. > >> > + */ > >> > +struct kho_block_it { > >> > + struct kho_block_set *bs; > >> > + struct kho_block *block; > >> > + u64 i; > >> > +}; > >> > + > >> > +/** > >> > + * KHO_BLOCK_SET_INIT - Initialize a static kho_block_set. > >> > + * @_name: Name of the kho_block_set variable. > >> > + * @_entry_size: The size of each entry in the block set. > >> > + */ > >> > +#define KHO_BLOCK_SET_INIT(_name, _entry_size) { \ > >> > + .blocks = LIST_HEAD_INIT((_name).blocks), \ > >> > + .entry_size = _entry_size, \ > >> > +} > >> > + > >> > +void kho_block_set_init(struct kho_block_set *bs, size_t entry_size); > >> > + > >> > +int kho_block_grow(struct kho_block_set *bs, u64 count); > >> > +void kho_block_shrink(struct kho_block_set *bs, u64 count); > >> > >> These block management functions seem like internal details of the block > > > > This is not so. The confusion here is that they must be allocated and > > preserved at runtime as resources are registered/unregistered, while > > these blocks are only used serialization phase, > > > > These calls are more like notifiers that more files/sessions are created > > removed, so we can adjust block count accordingly if necessary (allocate > > preserver memory), and have them available durign > > serialization/deserialization > > Yeah, I got that when reading the later patches that use these. > > Perhaps kho_block_prealloc() and kho_block_unalloc() is more clear, > although it does not sound as nice. If not, then I suppose at least add > a comment explaining the intended usage. Done > > > > >> set API. Do we need to export them? I think users should not have to > >> worry about block management. They should read, set, or clear entries > >> using the iterators, and internally the block management should take of > >> allocation or freeing. So here for example, I th > > > > something is missing :-) > > I don't remember what I meant to say anymore :-/ > > [...] > >> > +/** > >> > + * kho_block_set_init - Initialize a block set. > >> > + * @bs: The block set to initialize. > >> > + * @entry_size: The size of each entry in the blocks. > >> > + */ > >> > +void kho_block_set_init(struct kho_block_set *bs, size_t entry_size) > >> > +{ > >> > + *bs = (struct kho_block_set)KHO_BLOCK_SET_INIT(*bs, entry_size); > >> > +} > >> > + > >> > +static inline u64 kho_block_count_per_block(struct kho_block_set *bs) > >> > +{ > >> > + if (unlikely(!bs->count_per_block)) { > >> > + bs->count_per_block = (KHO_BLOCK_SIZE - > >> > + sizeof(struct kho_block_header_ser)) / > >> > + bs->entry_size; > >> > + WARN_ON(!bs->count_per_block); > >> > + } > >> > + return bs->count_per_block; > >> > +} > >> > >> This looks odd. I don't see a reason to calculate this lazily. Why not > >> just do it when initializing the block set, in kho_block_set_init() or > >> kho_block_restore()? And then use bs->count_per_block directly. > > > > This allows for blocks to use static initilziation, I like static inits > > :-) > > You can do this: > > #define KHO_BLOCK_SET_INIT(_name, _entry_size) { \ > .blocks = LIST_HEAD_INIT((_name).blocks), \ > .entry_size = _entry_size, \ > .count_per_block = (KHO_BLOCK_SIZE - sizeof(struct kho_block_header_ser)) / (_entry_size), \ > } > > Compiles for me. You are correct, done. > > [...] > >> > +void kho_block_destroy(struct kho_block_set *bs) > >> > +{ > >> > + u64 head_pa = bs->head_pa; > >> > + struct kho_block *block; > >> > + > >> > + while (!list_empty(&bs->blocks)) { > >> > + block = list_first_entry(&bs->blocks, struct kho_block, list); > >> > + list_del(&block->list); > >> > + kfree(block); > >> > + } > >> > >> Nit: > >> > >> list_for_each_entry_safe(block, tmp, &bs->blocks, list) { > >> list_del(&block->list); > >> kfree(block); > >> } > >> > >> is a bit more idiomatic (and IMO easier to read). > > > > Sure > > > >> > >> > + bs->nblocks = 0; > >> > + bs->head_pa = 0; > >> > + > >> > + while (head_pa) { > >> > + struct kho_block_header_ser *ser = phys_to_virt(head_pa); > >> > + > >> > + head_pa = ser->next; > >> > + kho_block_free_ser(bs, ser); > >> > >> Nit: also, can't you put this also in the previous loop? Something like: > >> > >> list_for_each_entry_safe(block, tmp, &bs->blocks, list) { > >> list_del(&block->list); > >> kho_block_free_ser(block->ser); > >> kfree(block); > >> } > > > > We actually can't merge these into a single loop because of partial > > restoration failures handling in kho_block_restore(). > > > > If kho_block_restore fails halfway through restoring a chain of blocks > > (for example, if kho_block_add fails on block 3 of 5), we jump to the > > err_destroy cleanup path which calls kho_block_destroy(). > > > > At this point: > > - bs->blocks only contains the tracked blocks we successfully added > > (blocks 1 and 2). > > - bs->head_pa still points to the physical head of the entire 5-block > > incoming chain. > > > > But, this is a good place to add a comment. > > IMO it would be cleaner for kho_block_destroy() to destroy the currently > initialized block set, and then the error handling path in restore path > can clean up the rest. Sounds good, done. > > > > >> > + } > >> > +} > [...] > >> > +/** > >> > + * kho_block_it_prev - Return the previous entry slot in the block set. > >> > + * @it: The block iterator. > >> > + * > >> > + * If the current index is at the start of a block, it automatically moves to > >> > + * the end of the previous block. > >> > + * > >> > + * Return: A pointer to the previous entry slot, or NULL if at the very > >> > + * beginning of the block set. > >> > + */ > >> > +void *kho_block_it_prev(struct kho_block_it *it) > >> > +{ > >> > + if (!it->block) > >> > + return NULL; > >> > + > >> > + if (it->i == 0) { > >> > + if (list_is_first(&it->block->list, &it->bs->blocks)) > >> > + return NULL; > >> > + it->block = list_prev_entry(it->block, list); > >> > + it->i = kho_block_count_per_block(it->bs); > >> > + } > >> > + > >> > + return (void *)(it->block->ser + 1) + (--it->i * it->bs->entry_size); > >> > +} > >> > + > >> > +/** > >> > + * kho_block_it_finalize - Finalize the current block by setting its entry count. > >> > + * @it: The block iterator. > >> > + */ > >> > +void kho_block_it_finalize(struct kho_block_it *it) > >> > +{ > >> > + if (it->block) > >> > + it->block->ser->count = it->i; > >> > +} > >> > >> Doesn't kho_block_it_next() already do this when you add an entry? So > >> this seems redundant. > > > > It is not redundant because of how the final partially-fille block is handled. > > > > kho_block_it_next() only writes the count into the block header when a block is completely full and it is advancing to the next one: > > > > if (it->i == kho_block_count_per_block(it->bs)) { > > it->block->ser->count = it->i; > > ... > > > > But for the very last block in the set, it is usually only partially > > filled (e.g., we write 10 entries into a block with a capacity of 64). > > Since it->i never reaches the maximum capacity, kho_block_it_next() > > never commits its count. > > > > Pasha > > I think we can make kho_block_it_next() always write it. I think it > makes sense from an API point of view, since I see this API as "adding > an entry to the block set", so updating its internal counters makes > sense. > > Requiring the finalize will be error prone, since it is easy to forget. > Then you silently lose some entries on the next boot. Good suggetion, cleaned-up. Thank you! Pasha > > -- > Regards, > Pratyush Yadav From pasha.tatashin at soleen.com Tue Jun 2 19:50:52 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Wed, 3 Jun 2026 02:50:52 +0000 Subject: [PATCH v4 08/13] liveupdate: defer session block allocation and PA setting In-Reply-To: <178038801492.119771.3419366349068848854.b4-review@b4> References: <20260530221938.115978-1-pasha.tatashin@soleen.com> <20260530221938.115978-9-pasha.tatashin@soleen.com> <178038801492.119771.3419366349068848854.b4-review@b4> Message-ID: On 06-02 11:13, Mike Rapoport wrote: > On Sat, 30 May 2026 22:19:33 +0000, Pasha Tatashin wrote: > > Currently, luo_session_setup_outgoing() allocates the session block and > > "liveupdate: defer session block allocation and PA setting" > > PA as "Public Assistance"? ;-) > > Let's spell it out. Done > > -- > Sincerely yours, > Mike. > From pasha.tatashin at soleen.com Tue Jun 2 19:57:19 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Wed, 3 Jun 2026 02:57:19 +0000 Subject: [PATCH v4 03/13] liveupdate: centralize state management into struct luo_ser In-Reply-To: <178038801487.119771.6308607614059754603.b4-review@b4> References: <20260530221938.115978-1-pasha.tatashin@soleen.com> <20260530221938.115978-4-pasha.tatashin@soleen.com> <178038801487.119771.6308607614059754603.b4-review@b4> Message-ID: On 06-02 11:13, Mike Rapoport wrote: > On Sat, 30 May 2026 22:19:28 +0000, Pasha Tatashin wrote: > > diff --git a/kernel/liveupdate/luo_flb.c b/kernel/liveupdate/luo_flb.c > > index 8f5c5dd01cd0..c8dd30b41238 100644 > > --- a/kernel/liveupdate/luo_flb.c > > +++ b/kernel/liveupdate/luo_flb.c > > @@ -579,53 +565,18 @@ int __init luo_flb_setup_outgoing(void *fdt_out) > > [ ... skip 18 lines ... ] > > - offset = fdt_subnode_offset(fdt_in, 0, LUO_FDT_FLB_NODE_NAME); > > - if (offset < 0) { > > - pr_err("Unable to get FLB node [%s]\n", LUO_FDT_FLB_NODE_NAME); > > - > > - return -ENOENT; > > + if (flbs_pa) { > > I like > > if (!flbs_pa) > return; > > more Ok. > > > > > diff --git a/kernel/liveupdate/luo_session.c b/kernel/liveupdate/luo_session.c > > index 8d9201c25412..3b760fefa7b9 100644 > > --- a/kernel/liveupdate/luo_session.c > > +++ b/kernel/liveupdate/luo_session.c > > @@ -497,75 +494,34 @@ int luo_session_retrieve(const char *name, struct file **filep) > > [ ... skip 58 lines ... ] > > + if (sessions_pa) { > > + header_ser = phys_to_virt(sessions_pa); > > + luo_session_global.incoming.header_ser = header_ser; > > + luo_session_global.incoming.ser = (void *)(header_ser + 1); > > + luo_session_global.incoming.active = true; > > } > > Ditto This functions get's re-written with early return later in the serires. > > -- > Sincerely yours, > Mike. > From pasha.tatashin at soleen.com Tue Jun 2 20:10:40 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Wed, 3 Jun 2026 03:10:40 +0000 Subject: [PATCH v4 02/13] liveupdate: avoid mixing cleanup guards with goto in luo_session_retrieve_fd In-Reply-To: <178038801485.119771.9514973100282773342.b4-review@b4> References: <20260530221938.115978-1-pasha.tatashin@soleen.com> <20260530221938.115978-3-pasha.tatashin@soleen.com> <178038801485.119771.9514973100282773342.b4-review@b4> Message-ID: On 06-02 11:13, Mike Rapoport wrote: > On Sat, 30 May 2026 22:19:27 +0000, Pasha Tatashin wrote: > > diff --git a/kernel/liveupdate/luo_session.c b/kernel/liveupdate/luo_session.c > > index 146414933977..8d9201c25412 100644 > > --- a/kernel/liveupdate/luo_session.c > > +++ b/kernel/liveupdate/luo_session.c > > @@ -291,25 +291,24 @@ static int luo_session_retrieve_fd(struct luo_session *session, > > if (argp->fd < 0) > > return argp->fd; > > > > - guard(mutex)(&session->mutex); > > - err = luo_retrieve_file(&session->file_set, argp->token, &file); > > - if (err < 0) > > - goto err_put_fd; > > + scoped_guard(mutex, &session->mutex) { > > + err = luo_retrieve_file(&session->file_set, argp->token, &file); > > + if (err < 0) { > > + put_unused_fd(argp->fd); > > + return err; > > I don't like piling up error handling inside if (err) statements. > > As we only need the lock only for luo_retrieve_file() I think it's better > drop the guard and use goto: > > > mutex_lock(&session->mutex); > err = luo_retrieve_file(&session->file_set, argp->token, &file); > mutex_unlock(&session->mutex); > if (err) > ... ok, done. > > -- > Sincerely yours, > Mike. > From pasha.tatashin at soleen.com Tue Jun 2 20:28:51 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Wed, 3 Jun 2026 03:28:51 +0000 Subject: [PATCH v6 00/13] liveupdate: Remove limits on sessions and files Message-ID: <20260603032905.344462-1-pasha.tatashin@soleen.com> Hi all, This series removes the fixed limits on the number of files that can be preserved within a single session, and the total number of sessions managed by the Live Update Orchestrator (LUO). The core of the change is a transition from single contiguous memory blocks for metadata serialization to a chain of linked blocks. This allows LUO to scale dynamically. 1. ABI Evolution: - Introduced linked-block headers for both file and session serialization. - Bumped session ABI version to v4. 2. Memory Management & Security: - Implemented a dynamic block allocation and reuse strategy. Blocks are allocated only when existing ones are exhausted and are reused during session/file removal cycles. - Introduced KHO_MAX_BLOCKS (10000) as a safeguard against stupid excessive allocations or corrupted cyclic lists during restore. 3. Expanded Selftests: - Added new kexec-based tests verifying preservation of 2000 sessions and 500 files per session. - Added self-tests for many sessions and many files management. Tree: git.kernel.org/pub/scm/linux/kernel/git/tatashin/linux.git Branch: luo-remove-max-files-sessions-limits/v6 Changes v6: - Addressed comments from Mike and Pratyush: - Simplified kho_block_set_destroy() to only free successfully tracked blocks - Enabled dynamic entry count tracking in kho_block_set_it_reserve_entry() to automatically update the count field in block headers on every reserve. - Removed the error kho_block_set_it_finalize(). - Renamed various kho_block_* APIs to kho_block_set_* (e.g. kho_block_set_grow, kho_block_set_shrink, struct kho_block_set_it, etc.). Changes v5: - Addressed comments from Pratyush: - Renamed kho_block_restore -> kho_block_set_restore, kho_block_destroy -> kho_block_set_destroy. - Renamed block iterator next/read functions to reserve_entry/read_entry. - Added public helpers kho_block_set_head_pa() and kho_block_set_is_empty(). - Added validation to treat zero-count blocks as errors during restoration. - Simplified block iterator reading loop from a while to an if statement. - Changed standard WARN_ON macros to WARN_ON_ONCE on iterator allocation checks, and added warning details. - Simplified session serialization by removing a redundant NULL check on sessions_pa. Please review. Thanks, Pasha Pasha Tatashin (13): liveupdate: change file_set->count type to u64 for type safety liveupdate: avoid mixing cleanup guards with goto in luo_session_retrieve_fd liveupdate: centralize state management into struct luo_ser liveupdate: register luo_ser as KHO subtree liveupdate: Extract luo_file_deserialize_one helper liveupdate: Extract luo_session_deserialize_one helper kho: add support for linked-block serialization liveupdate: defer session block allocation and physical address setting liveupdate: Remove limit on the number of sessions liveupdate: Remove limit on the number of files per session selftests/liveupdate: Test session and file limit removal selftests/liveupdate: Add stress-sessions kexec test selftests/liveupdate: Add stress-files kexec test Documentation/core-api/kho/abi.rst | 5 + Documentation/core-api/kho/index.rst | 11 + MAINTAINERS | 1 + include/linux/kho/abi/block.h | 56 +++ include/linux/kho/abi/luo.h | 149 ++----- include/linux/kho_block.h | 106 +++++ kernel/liveupdate/Makefile | 1 + kernel/liveupdate/kho_block.c | 411 ++++++++++++++++++ kernel/liveupdate/luo_core.c | 99 ++--- kernel/liveupdate/luo_file.c | 205 ++++----- kernel/liveupdate/luo_flb.c | 60 +-- kernel/liveupdate/luo_internal.h | 16 +- kernel/liveupdate/luo_session.c | 219 +++++----- tools/testing/selftests/liveupdate/Makefile | 2 + .../testing/selftests/liveupdate/liveupdate.c | 75 ++++ .../selftests/liveupdate/luo_stress_files.c | 97 +++++ .../liveupdate/luo_stress_sessions.c | 102 +++++ .../selftests/liveupdate/luo_test_utils.c | 24 + .../selftests/liveupdate/luo_test_utils.h | 2 + 19 files changed, 1196 insertions(+), 445 deletions(-) create mode 100644 include/linux/kho/abi/block.h create mode 100644 include/linux/kho_block.h create mode 100644 kernel/liveupdate/kho_block.c create mode 100644 tools/testing/selftests/liveupdate/luo_stress_files.c create mode 100644 tools/testing/selftests/liveupdate/luo_stress_sessions.c base-commit: 2935777b418d2bfcbfe96705bb2c0fa6c0d94e18 -- 2.53.0 From pasha.tatashin at soleen.com Tue Jun 2 20:28:52 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Wed, 3 Jun 2026 03:28:52 +0000 Subject: [PATCH v6 01/13] liveupdate: change file_set->count type to u64 for type safety In-Reply-To: <20260603032905.344462-1-pasha.tatashin@soleen.com> References: <20260603032905.344462-1-pasha.tatashin@soleen.com> Message-ID: <20260603032905.344462-2-pasha.tatashin@soleen.com> This improves type safety and aligns the in-memory file_set->count with the serialized count type. It avoids potential truncation or sign conversion mismatch issues. Reviewed-by: Pratyush Yadav (Google) Acked-by: Mike Rapoport (Microsoft) Signed-off-by: Pasha Tatashin --- kernel/liveupdate/luo_internal.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/liveupdate/luo_internal.h b/kernel/liveupdate/luo_internal.h index dd53d4a7277e..ae58206f14ac 100644 --- a/kernel/liveupdate/luo_internal.h +++ b/kernel/liveupdate/luo_internal.h @@ -52,7 +52,7 @@ static inline int luo_ucmd_respond(struct luo_ucmd *ucmd, struct luo_file_set { struct list_head files_list; struct luo_file_ser *files; - long count; + u64 count; }; /** -- 2.53.0 From pasha.tatashin at soleen.com Tue Jun 2 20:28:53 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Wed, 3 Jun 2026 03:28:53 +0000 Subject: [PATCH v6 02/13] liveupdate: avoid mixing cleanup guards with goto in luo_session_retrieve_fd In-Reply-To: <20260603032905.344462-1-pasha.tatashin@soleen.com> References: <20260603032905.344462-1-pasha.tatashin@soleen.com> Message-ID: <20260603032905.344462-3-pasha.tatashin@soleen.com> Refactoring luo_session_retrieve_fd() to avoid mixing automated cleanup-style guards with goto-based resource release, which is not recommended under the Linux kernel coding style. Reviewed-by: Pratyush Yadav (Google) Signed-off-by: Pasha Tatashin --- kernel/liveupdate/luo_session.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/kernel/liveupdate/luo_session.c b/kernel/liveupdate/luo_session.c index 5c6cebc6e326..47566db64598 100644 --- a/kernel/liveupdate/luo_session.c +++ b/kernel/liveupdate/luo_session.c @@ -291,10 +291,11 @@ static int luo_session_retrieve_fd(struct luo_session *session, if (argp->fd < 0) return argp->fd; - guard(mutex)(&session->mutex); + mutex_lock(&session->mutex); err = luo_retrieve_file(&session->file_set, argp->token, &file); + mutex_unlock(&session->mutex); if (err < 0) - goto err_put_fd; + goto err_put_fd; err = luo_ucmd_respond(ucmd, sizeof(*argp)); if (err) -- 2.53.0 From pasha.tatashin at soleen.com Tue Jun 2 20:28:54 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Wed, 3 Jun 2026 03:28:54 +0000 Subject: [PATCH v6 03/13] liveupdate: centralize state management into struct luo_ser In-Reply-To: <20260603032905.344462-1-pasha.tatashin@soleen.com> References: <20260603032905.344462-1-pasha.tatashin@soleen.com> Message-ID: <20260603032905.344462-4-pasha.tatashin@soleen.com> Transition the LUO to ABI v2, which centralizes state management into a single struct luo_ser header. Previously, LUO state was spread across multiple FDT properties and subnodes. ABI v2 simplifies this by placing all core state, including the liveupdate number and physical addresses for sessions and FLB headers into a centralized struct luo_ser. Note that this change introduces a semantic difference: the sessions and FLB serialization formats are no longer completely independent of the core LUO. Their metadata (such as physical addresses for sessions and FLB headers) is now coupled to and managed via the centralized struct luo_ser. Reviewed-by: Pratyush Yadav (Google) Signed-off-by: Pasha Tatashin --- include/linux/kho/abi/luo.h | 91 +++++++++++--------------------- kernel/liveupdate/luo_core.c | 64 +++++++++++++++------- kernel/liveupdate/luo_flb.c | 60 +++------------------ kernel/liveupdate/luo_internal.h | 8 +-- kernel/liveupdate/luo_session.c | 64 ++++------------------ 5 files changed, 96 insertions(+), 191 deletions(-) diff --git a/include/linux/kho/abi/luo.h b/include/linux/kho/abi/luo.h index 46750a0ddf88..1b2f865a771a 100644 --- a/include/linux/kho/abi/luo.h +++ b/include/linux/kho/abi/luo.h @@ -30,52 +30,25 @@ * .. code-block:: none * * / { - * compatible = "luo-v1"; - * liveupdate-number = <...>; - * - * luo-session { - * compatible = "luo-session-v1"; - * luo-session-header = ; - * }; - * - * luo-flb { - * compatible = "luo-flb-v1"; - * luo-flb-header = ; - * }; + * compatible = "luo-v2"; + * luo-abi-header = ; * }; * * Main LUO Node (/): * - * - compatible: "luo-v1" + * - compatible: "luo-v2" * Identifies the overall LUO ABI version. - * - liveupdate-number: u64 - * A counter tracking the number of successful live updates performed. - * - * Session Node (luo-session): - * This node describes all preserved user-space sessions. - * - * - compatible: "luo-session-v1" - * Identifies the session ABI version. - * - luo-session-header: u64 - * The physical address of a `struct luo_session_header_ser`. This structure - * is the header for a contiguous block of memory containing an array of - * `struct luo_session_ser`, one for each preserved session. - * - * File-Lifecycle-Bound Node (luo-flb): - * This node describes all preserved global objects whose lifecycle is bound - * to that of the preserved files (e.g., shared IOMMU state). - * - * - compatible: "luo-flb-v1" - * Identifies the FLB ABI version. - * - luo-flb-header: u64 - * The physical address of a `struct luo_flb_header_ser`. This structure is - * the header for a contiguous block of memory containing an array of - * `struct luo_flb_ser`, one for each preserved global object. + * - luo-abi-header: u64 + * The physical address of `struct luo_ser`. * * Serialization Structures: * The FDT properties point to memory regions containing arrays of simple, * `__packed` structures. These structures contain the actual preserved state. * + * - struct luo_ser: + * The central ABI structure that contains the overall state of the LUO. + * It includes the liveupdate-number and pointers to sessions and FLBs. + * * - struct luo_session_header_ser: * Header for the session array. Contains the total page count of the * preserved memory block and the number of `struct luo_session_ser` @@ -109,13 +82,26 @@ /* * The LUO FDT hooks all LUO state for sessions, fds, etc. - * In the root it also carries "liveupdate-number" 64-bit property that - * corresponds to the number of live-updates performed on this machine. */ #define LUO_FDT_SIZE PAGE_SIZE #define LUO_FDT_KHO_ENTRY_NAME "LUO" -#define LUO_FDT_COMPATIBLE "luo-v1" -#define LUO_FDT_LIVEUPDATE_NUM "liveupdate-number" +#define LUO_FDT_COMPATIBLE "luo-v2" +#define LUO_FDT_ABI_HEADER "luo-abi-header" + +/** + * struct luo_ser - Centralized LUO ABI header. + * @liveupdate_num: A counter tracking the number of successful live updates. + * @sessions_pa: Physical address of the first session block header. + * @flbs_pa: Physical address of the FLB header. + * + * This structure is the root of all preserved LUO state. It is pointed to by + * the "luo-abi-header" property in the LUO FDT. + */ +struct luo_ser { + u64 liveupdate_num; + u64 sessions_pa; + u64 flbs_pa; +} __packed; #define LIVEUPDATE_HNDL_COMPAT_LENGTH 48 @@ -147,15 +133,6 @@ struct luo_file_set_ser { u64 count; } __packed; -/* - * LUO FDT session node - * LUO_FDT_SESSION_HEADER: is a u64 physical address of struct - * luo_session_header_ser - */ -#define LUO_FDT_SESSION_NODE_NAME "luo-session" -#define LUO_FDT_SESSION_COMPATIBLE "luo-session-v2" -#define LUO_FDT_SESSION_HEADER "luo-session-header" - /** * struct luo_session_header_ser - Header for the serialized session data block. * @count: The number of `struct luo_session_ser` entries that immediately @@ -165,7 +142,7 @@ struct luo_file_set_ser { * physical memory preserved across the kexec. It provides the necessary * metadata to interpret the array of session entries that follow. * - * If this structure is modified, `LUO_FDT_SESSION_COMPATIBLE` must be updated. + * If this structure is modified, `LUO_FDT_COMPATIBLE` must be updated. */ struct luo_session_header_ser { u64 count; @@ -182,7 +159,7 @@ struct luo_session_header_ser { * session) is created and passed to the new kernel, allowing it to reconstruct * the session context. * - * If this structure is modified, `LUO_FDT_SESSION_COMPATIBLE` must be updated. + * If this structure is modified, `LUO_FDT_COMPATIBLE` must be updated. */ struct luo_session_ser { char name[LIVEUPDATE_SESSION_NAME_LENGTH]; @@ -192,10 +169,6 @@ struct luo_session_ser { /* The max size is set so it can be reliably used during in serialization */ #define LIVEUPDATE_FLB_COMPAT_LENGTH 48 -#define LUO_FDT_FLB_NODE_NAME "luo-flb" -#define LUO_FDT_FLB_COMPATIBLE "luo-flb-v1" -#define LUO_FDT_FLB_HEADER "luo-flb-header" - /** * struct luo_flb_header_ser - Header for the serialized FLB data block. * @pgcnt: The total number of pages occupied by the entire preserved memory @@ -205,11 +178,9 @@ struct luo_session_ser { * in the memory block. * * This structure is located at the physical address specified by the - * `LUO_FDT_FLB_HEADER` FDT property. It provides the new kernel with the - * necessary information to find and iterate over the array of preserved - * File-Lifecycle-Bound objects and to manage the underlying memory. + * flbs_pa in luo_ser. * - * If this structure is modified, LUO_FDT_FLB_COMPATIBLE must be updated. + * If this structure is modified, `LUO_FDT_COMPATIBLE` must be updated. */ struct luo_flb_header_ser { u64 pgcnt; @@ -231,7 +202,7 @@ struct luo_flb_header_ser { * passed to the new kernel. Each entry allows the LUO core to restore one * global, shared object. * - * If this structure is modified, LUO_FDT_FLB_COMPATIBLE must be updated. + * If this structure is modified, `LUO_FDT_COMPATIBLE` must be updated. */ struct luo_flb_ser { char name[LIVEUPDATE_FLB_COMPAT_LENGTH]; diff --git a/kernel/liveupdate/luo_core.c b/kernel/liveupdate/luo_core.c index 5d5827ced73c..085c0dfc1ef1 100644 --- a/kernel/liveupdate/luo_core.c +++ b/kernel/liveupdate/luo_core.c @@ -61,7 +61,6 @@ #include #include #include -#include #include "kexec_handover_internal.h" #include "luo_internal.h" @@ -86,9 +85,11 @@ early_param("liveupdate", early_liveupdate_param); static int __init luo_early_startup(void) { + struct luo_ser *luo_ser; + int err, header_size; phys_addr_t fdt_phys; - int err, ln_size; const void *ptr; + u64 luo_ser_pa; if (!kho_is_enabled()) { if (liveupdate_enabled()) @@ -119,26 +120,32 @@ static int __init luo_early_startup(void) return -EINVAL; } - ln_size = 0; - ptr = fdt_getprop(luo_global.fdt_in, 0, LUO_FDT_LIVEUPDATE_NUM, - &ln_size); - if (!ptr || ln_size != sizeof(luo_global.liveupdate_num)) { - pr_err("Unable to get live update number '%s' [%d]\n", - LUO_FDT_LIVEUPDATE_NUM, ln_size); + header_size = 0; + ptr = fdt_getprop(luo_global.fdt_in, 0, LUO_FDT_ABI_HEADER, &header_size); + if (!ptr || header_size != sizeof(u64)) { + pr_err("Unable to get ABI header '%s' [%d]\n", + LUO_FDT_ABI_HEADER, header_size); return -EINVAL; } - luo_global.liveupdate_num = get_unaligned((u64 *)ptr); + luo_ser_pa = get_unaligned((u64 *)ptr); + luo_ser = phys_to_virt(luo_ser_pa); + + luo_global.liveupdate_num = luo_ser->liveupdate_num; pr_info("Retrieved live update data, liveupdate number: %lld\n", luo_global.liveupdate_num); - err = luo_session_setup_incoming(luo_global.fdt_in); + err = luo_session_setup_incoming(luo_ser->sessions_pa); if (err) - return err; + goto out_free_ser; + + luo_flb_setup_incoming(luo_ser->flbs_pa); - err = luo_flb_setup_incoming(luo_global.fdt_in); + err = 0; +out_free_ser: + kho_restore_free(luo_ser); return err; } @@ -160,7 +167,8 @@ early_initcall(liveupdate_early_init); /* Called during boot to create outgoing LUO fdt tree */ static int __init luo_fdt_setup(void) { - const u64 ln = luo_global.liveupdate_num + 1; + struct luo_ser *luo_ser; + u64 luo_ser_pa; void *fdt_out; int err; @@ -170,27 +178,45 @@ static int __init luo_fdt_setup(void) return PTR_ERR(fdt_out); } + luo_ser = kho_alloc_preserve(sizeof(*luo_ser)); + if (IS_ERR(luo_ser)) { + err = PTR_ERR(luo_ser); + goto exit_free_fdt; + } + luo_ser_pa = virt_to_phys(luo_ser); + err = fdt_create(fdt_out, LUO_FDT_SIZE); err |= fdt_finish_reservemap(fdt_out); err |= fdt_begin_node(fdt_out, ""); err |= fdt_property_string(fdt_out, "compatible", LUO_FDT_COMPATIBLE); - err |= fdt_property(fdt_out, LUO_FDT_LIVEUPDATE_NUM, &ln, sizeof(ln)); - err |= luo_session_setup_outgoing(fdt_out); - err |= luo_flb_setup_outgoing(fdt_out); + err |= fdt_property(fdt_out, LUO_FDT_ABI_HEADER, &luo_ser_pa, + sizeof(luo_ser_pa)); err |= fdt_end_node(fdt_out); err |= fdt_finish(fdt_out); if (err) - goto exit_free; + goto exit_free_luo_ser; + + err = luo_session_setup_outgoing(&luo_ser->sessions_pa); + if (err) + goto exit_free_luo_ser; + + err = luo_flb_setup_outgoing(&luo_ser->flbs_pa); + if (err) + goto exit_free_luo_ser; + + luo_ser->liveupdate_num = luo_global.liveupdate_num + 1; err = kho_add_subtree(LUO_FDT_KHO_ENTRY_NAME, fdt_out, fdt_totalsize(fdt_out)); if (err) - goto exit_free; + goto exit_free_luo_ser; luo_global.fdt_out = fdt_out; return 0; -exit_free: +exit_free_luo_ser: + kho_unpreserve_free(luo_ser); +exit_free_fdt: kho_unpreserve_free(fdt_out); pr_err("failed to prepare LUO FDT: %d\n", err); diff --git a/kernel/liveupdate/luo_flb.c b/kernel/liveupdate/luo_flb.c index 8f5c5dd01cd0..5c27134ce7ba 100644 --- a/kernel/liveupdate/luo_flb.c +++ b/kernel/liveupdate/luo_flb.c @@ -44,13 +44,11 @@ #include #include #include -#include #include #include #include #include #include -#include #include "luo_internal.h" #define LUO_FLB_PGCNT 1ul @@ -551,27 +549,15 @@ int liveupdate_flb_get_outgoing(struct liveupdate_flb *flb, void **objp) return 0; } -int __init luo_flb_setup_outgoing(void *fdt_out) +int __init luo_flb_setup_outgoing(u64 *flbs_pa) { struct luo_flb_header_ser *header_ser; - u64 header_ser_pa; - int err; header_ser = kho_alloc_preserve(LUO_FLB_PGCNT << PAGE_SHIFT); if (IS_ERR(header_ser)) return PTR_ERR(header_ser); - header_ser_pa = virt_to_phys(header_ser); - - err = fdt_begin_node(fdt_out, LUO_FDT_FLB_NODE_NAME); - err |= fdt_property_string(fdt_out, "compatible", - LUO_FDT_FLB_COMPATIBLE); - err |= fdt_property(fdt_out, LUO_FDT_FLB_HEADER, &header_ser_pa, - sizeof(header_ser_pa)); - err |= fdt_end_node(fdt_out); - - if (err) - goto err_unpreserve; + *flbs_pa = virt_to_phys(header_ser); header_ser->pgcnt = LUO_FLB_PGCNT; luo_flb_global.outgoing.header_ser = header_ser; @@ -579,53 +565,19 @@ int __init luo_flb_setup_outgoing(void *fdt_out) luo_flb_global.outgoing.active = true; return 0; - -err_unpreserve: - kho_unpreserve_free(header_ser); - - return err; } -int __init luo_flb_setup_incoming(void *fdt_in) +void __init luo_flb_setup_incoming(u64 flbs_pa) { struct luo_flb_header_ser *header_ser; - int err, header_size, offset; - const void *ptr; - u64 header_ser_pa; - - offset = fdt_subnode_offset(fdt_in, 0, LUO_FDT_FLB_NODE_NAME); - if (offset < 0) { - pr_err("Unable to get FLB node [%s]\n", LUO_FDT_FLB_NODE_NAME); - - return -ENOENT; - } - - err = fdt_node_check_compatible(fdt_in, offset, - LUO_FDT_FLB_COMPATIBLE); - if (err) { - pr_err("FLB node is incompatible with '%s' [%d]\n", - LUO_FDT_FLB_COMPATIBLE, err); - - return -EINVAL; - } - - header_size = 0; - ptr = fdt_getprop(fdt_in, offset, LUO_FDT_FLB_HEADER, &header_size); - if (!ptr || header_size != sizeof(u64)) { - pr_err("Unable to get FLB header property '%s' [%d]\n", - LUO_FDT_FLB_HEADER, header_size); - return -EINVAL; - } - - header_ser_pa = get_unaligned((u64 *)ptr); - header_ser = phys_to_virt(header_ser_pa); + if (!flbs_pa) + return; + header_ser = phys_to_virt(flbs_pa); luo_flb_global.incoming.header_ser = header_ser; luo_flb_global.incoming.ser = (void *)(header_ser + 1); luo_flb_global.incoming.active = true; - - return 0; } /** diff --git a/kernel/liveupdate/luo_internal.h b/kernel/liveupdate/luo_internal.h index ae58206f14ac..fe22086bfbeb 100644 --- a/kernel/liveupdate/luo_internal.h +++ b/kernel/liveupdate/luo_internal.h @@ -79,8 +79,8 @@ extern struct rw_semaphore luo_register_rwlock; int luo_session_create(const char *name, struct file **filep); int luo_session_retrieve(const char *name, struct file **filep); -int __init luo_session_setup_outgoing(void *fdt); -int __init luo_session_setup_incoming(void *fdt); +int __init luo_session_setup_outgoing(u64 *sessions_pa); +int __init luo_session_setup_incoming(u64 sessions_pa); int luo_session_serialize(void); int luo_session_deserialize(void); @@ -102,8 +102,8 @@ int luo_flb_file_preserve(struct liveupdate_file_handler *fh); void luo_flb_file_unpreserve(struct liveupdate_file_handler *fh); void luo_flb_file_finish(struct liveupdate_file_handler *fh); void luo_flb_unregister_all(struct liveupdate_file_handler *fh); -int __init luo_flb_setup_outgoing(void *fdt); -int __init luo_flb_setup_incoming(void *fdt); +int __init luo_flb_setup_outgoing(u64 *flbs_pa); +void __init luo_flb_setup_incoming(u64 flbs_pa); void luo_flb_serialize(void); #ifdef CONFIG_LIVEUPDATE_TEST diff --git a/kernel/liveupdate/luo_session.c b/kernel/liveupdate/luo_session.c index 47566db64598..85782c6f3d6c 100644 --- a/kernel/liveupdate/luo_session.c +++ b/kernel/liveupdate/luo_session.c @@ -25,9 +25,8 @@ * * - Serialization: Session metadata is preserved using the KHO framework. When * a live update is triggered via kexec, an array of `struct luo_session_ser` - * is populated and placed in a preserved memory region. An FDT node is also - * created, containing the count of sessions and the physical address of this - * array. + * is populated and placed in a preserved memory region. The physical address + * of this array is stored in the centralized `struct luo_ser` structure. * * Session Lifecycle: * @@ -91,13 +90,11 @@ #include #include #include -#include #include #include #include #include #include -#include #include #include "luo_internal.h" @@ -527,75 +524,34 @@ int luo_session_retrieve(const char *name, struct file **filep) return err; } -int __init luo_session_setup_outgoing(void *fdt_out) +int __init luo_session_setup_outgoing(u64 *sessions_pa) { struct luo_session_header_ser *header_ser; - u64 header_ser_pa; - int err; header_ser = kho_alloc_preserve(LUO_SESSION_PGCNT << PAGE_SHIFT); if (IS_ERR(header_ser)) return PTR_ERR(header_ser); - header_ser_pa = virt_to_phys(header_ser); - - err = fdt_begin_node(fdt_out, LUO_FDT_SESSION_NODE_NAME); - err |= fdt_property_string(fdt_out, "compatible", - LUO_FDT_SESSION_COMPATIBLE); - err |= fdt_property(fdt_out, LUO_FDT_SESSION_HEADER, &header_ser_pa, - sizeof(header_ser_pa)); - err |= fdt_end_node(fdt_out); - if (err) - goto err_unpreserve; + *sessions_pa = virt_to_phys(header_ser); luo_session_global.outgoing.header_ser = header_ser; luo_session_global.outgoing.ser = (void *)(header_ser + 1); luo_session_global.outgoing.active = true; return 0; - -err_unpreserve: - kho_unpreserve_free(header_ser); - return err; } -int __init luo_session_setup_incoming(void *fdt_in) +int __init luo_session_setup_incoming(u64 sessions_pa) { struct luo_session_header_ser *header_ser; - int err, header_size, offset; - u64 header_ser_pa; - const void *ptr; - - offset = fdt_subnode_offset(fdt_in, 0, LUO_FDT_SESSION_NODE_NAME); - if (offset < 0) { - pr_err("Unable to get session node: [%s]\n", - LUO_FDT_SESSION_NODE_NAME); - return -EINVAL; - } - err = fdt_node_check_compatible(fdt_in, offset, - LUO_FDT_SESSION_COMPATIBLE); - if (err) { - pr_err("Session node incompatible [%s]\n", - LUO_FDT_SESSION_COMPATIBLE); - return -EINVAL; - } - - header_size = 0; - ptr = fdt_getprop(fdt_in, offset, LUO_FDT_SESSION_HEADER, &header_size); - if (!ptr || header_size != sizeof(u64)) { - pr_err("Unable to get session header '%s' [%d]\n", - LUO_FDT_SESSION_HEADER, header_size); - return -EINVAL; + if (sessions_pa) { + header_ser = phys_to_virt(sessions_pa); + luo_session_global.incoming.header_ser = header_ser; + luo_session_global.incoming.ser = (void *)(header_ser + 1); + luo_session_global.incoming.active = true; } - header_ser_pa = get_unaligned((u64 *)ptr); - header_ser = phys_to_virt(header_ser_pa); - - luo_session_global.incoming.header_ser = header_ser; - luo_session_global.incoming.ser = (void *)(header_ser + 1); - luo_session_global.incoming.active = true; - return 0; } -- 2.53.0 From pasha.tatashin at soleen.com Tue Jun 2 20:28:55 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Wed, 3 Jun 2026 03:28:55 +0000 Subject: [PATCH v6 04/13] liveupdate: register luo_ser as KHO subtree In-Reply-To: <20260603032905.344462-1-pasha.tatashin@soleen.com> References: <20260603032905.344462-1-pasha.tatashin@soleen.com> Message-ID: <20260603032905.344462-5-pasha.tatashin@soleen.com> Entirely remove the LUO FDT wrapper since the FDT only carries the compatible string and the pointer to the centralized struct luo_ser. Instead, register the struct luo_ser via the KHO raw subtree API, placing the compatibility string inside the structure itself. Signed-off-by: Pasha Tatashin --- include/linux/kho/abi/luo.h | 57 +++++++++--------------- kernel/liveupdate/luo_core.c | 85 +++++++++++------------------------- 2 files changed, 46 insertions(+), 96 deletions(-) diff --git a/include/linux/kho/abi/luo.h b/include/linux/kho/abi/luo.h index 1b2f865a771a..9a4fe491812b 100644 --- a/include/linux/kho/abi/luo.h +++ b/include/linux/kho/abi/luo.h @@ -10,11 +10,11 @@ * * Live Update Orchestrator uses the stable Application Binary Interface * defined below to pass state from a pre-update kernel to a post-update - * kernel. The ABI is built upon the Kexec HandOver framework and uses a - * Flattened Device Tree to describe the preserved data. + * kernel. The ABI is built upon the Kexec HandOver framework and registers + * the central `struct luo_ser` via the KHO raw subtree API. * - * This interface is a contract. Any modification to the FDT structure, node - * properties, compatible strings, or the layout of the `__packed` serialization + * This interface is a contract. Any modification to the structure fields, + * compatible strings, or the layout of the `__packed` serialization * structures defined here constitutes a breaking change. Such changes require * incrementing the version number in the relevant `_COMPATIBLE` string to * prevent a new kernel from misinterpreting data from an old kernel. @@ -23,31 +23,15 @@ * however, backward/forward compatibility is only guaranteed for kernels * supporting the same ABI version. * - * FDT Structure Overview: + * KHO Structure Overview: * The entire LUO state is encapsulated within a single KHO entry named "LUO". - * This entry contains an FDT with the following layout: - * - * .. code-block:: none - * - * / { - * compatible = "luo-v2"; - * luo-abi-header = ; - * }; - * - * Main LUO Node (/): - * - * - compatible: "luo-v2" - * Identifies the overall LUO ABI version. - * - luo-abi-header: u64 - * The physical address of `struct luo_ser`. + * This entry contains the `struct luo_ser` structure. * * Serialization Structures: - * The FDT properties point to memory regions containing arrays of simple, - * `__packed` structures. These structures contain the actual preserved state. - * * - struct luo_ser: * The central ABI structure that contains the overall state of the LUO. - * It includes the liveupdate-number and pointers to sessions and FLBs. + * It includes the compatibility string, the liveupdate-number, and pointers + * to sessions and FLBs. * * - struct luo_session_header_ser: * Header for the session array. Contains the total page count of the @@ -78,26 +62,27 @@ #ifndef _LINUX_KHO_ABI_LUO_H #define _LINUX_KHO_ABI_LUO_H +#include #include /* - * The LUO FDT hooks all LUO state for sessions, fds, etc. + * The LUO state is registered under this KHO entry name. */ -#define LUO_FDT_SIZE PAGE_SIZE -#define LUO_FDT_KHO_ENTRY_NAME "LUO" -#define LUO_FDT_COMPATIBLE "luo-v2" -#define LUO_FDT_ABI_HEADER "luo-abi-header" +#define LUO_KHO_ENTRY_NAME "LUO" +#define LUO_ABI_COMPATIBLE "luo-v3" +#define LUO_ABI_COMPAT_LEN ALIGN(sizeof(LUO_ABI_COMPATIBLE), 8) /** * struct luo_ser - Centralized LUO ABI header. + * @compatible: Compatibility string identifying the LUO ABI version. * @liveupdate_num: A counter tracking the number of successful live updates. * @sessions_pa: Physical address of the first session block header. * @flbs_pa: Physical address of the FLB header. * - * This structure is the root of all preserved LUO state. It is pointed to by - * the "luo-abi-header" property in the LUO FDT. + * This structure is the root of all preserved LUO state. */ struct luo_ser { + char compatible[LUO_ABI_COMPAT_LEN]; u64 liveupdate_num; u64 sessions_pa; u64 flbs_pa; @@ -111,7 +96,7 @@ struct luo_ser { * @data: Private data * @token: User provided token for this file * - * If this structure is modified, LUO_SESSION_COMPATIBLE must be updated. + * If this structure is modified, `LUO_ABI_COMPATIBLE` must be updated. */ struct luo_file_ser { char compatible[LIVEUPDATE_HNDL_COMPAT_LENGTH]; @@ -142,7 +127,7 @@ struct luo_file_set_ser { * physical memory preserved across the kexec. It provides the necessary * metadata to interpret the array of session entries that follow. * - * If this structure is modified, `LUO_FDT_COMPATIBLE` must be updated. + * If this structure is modified, `LUO_ABI_COMPATIBLE` must be updated. */ struct luo_session_header_ser { u64 count; @@ -159,7 +144,7 @@ struct luo_session_header_ser { * session) is created and passed to the new kernel, allowing it to reconstruct * the session context. * - * If this structure is modified, `LUO_FDT_COMPATIBLE` must be updated. + * If this structure is modified, `LUO_ABI_COMPATIBLE` must be updated. */ struct luo_session_ser { char name[LIVEUPDATE_SESSION_NAME_LENGTH]; @@ -180,7 +165,7 @@ struct luo_session_ser { * This structure is located at the physical address specified by the * flbs_pa in luo_ser. * - * If this structure is modified, `LUO_FDT_COMPATIBLE` must be updated. + * If this structure is modified, `LUO_ABI_COMPATIBLE` must be updated. */ struct luo_flb_header_ser { u64 pgcnt; @@ -202,7 +187,7 @@ struct luo_flb_header_ser { * passed to the new kernel. Each entry allows the LUO core to restore one * global, shared object. * - * If this structure is modified, `LUO_FDT_COMPATIBLE` must be updated. + * If this structure is modified, `LUO_ABI_COMPATIBLE` must be updated. */ struct luo_flb_ser { char name[LIVEUPDATE_FLB_COMPAT_LENGTH]; diff --git a/kernel/liveupdate/luo_core.c b/kernel/liveupdate/luo_core.c index 085c0dfc1ef1..69b00e7d0f8f 100644 --- a/kernel/liveupdate/luo_core.c +++ b/kernel/liveupdate/luo_core.c @@ -54,7 +54,6 @@ #include #include #include -#include #include #include #include @@ -67,8 +66,7 @@ static struct { bool enabled; - void *fdt_out; - void *fdt_in; + struct luo_ser *luo_ser_out; u64 liveupdate_num; } luo_global; @@ -85,11 +83,10 @@ early_param("liveupdate", early_liveupdate_param); static int __init luo_early_startup(void) { + phys_addr_t luo_ser_phys; struct luo_ser *luo_ser; - int err, header_size; - phys_addr_t fdt_phys; - const void *ptr; - u64 luo_ser_pa; + size_t len; + int err; if (!kho_is_enabled()) { if (liveupdate_enabled()) @@ -98,40 +95,29 @@ static int __init luo_early_startup(void) return 0; } - /* Retrieve LUO subtree, and verify its format. */ - err = kho_retrieve_subtree(LUO_FDT_KHO_ENTRY_NAME, &fdt_phys, NULL); + /* Retrieve LUO state from KHO. */ + err = kho_retrieve_subtree(LUO_KHO_ENTRY_NAME, &luo_ser_phys, &len); if (err) { if (err != -ENOENT) { - pr_err("failed to retrieve FDT '%s' from KHO: %pe\n", - LUO_FDT_KHO_ENTRY_NAME, ERR_PTR(err)); + pr_err("failed to retrieve LUO state '%s' from KHO: %pe\n", + LUO_KHO_ENTRY_NAME, ERR_PTR(err)); return err; } return 0; } - luo_global.fdt_in = phys_to_virt(fdt_phys); - err = fdt_node_check_compatible(luo_global.fdt_in, 0, - LUO_FDT_COMPATIBLE); - if (err) { - pr_err("FDT '%s' is incompatible with '%s' [%d]\n", - LUO_FDT_KHO_ENTRY_NAME, LUO_FDT_COMPATIBLE, err); - + if (len < sizeof(*luo_ser)) { + pr_err("LUO state is too small (%zu < %zu)\n", len, sizeof(*luo_ser)); return -EINVAL; } - header_size = 0; - ptr = fdt_getprop(luo_global.fdt_in, 0, LUO_FDT_ABI_HEADER, &header_size); - if (!ptr || header_size != sizeof(u64)) { - pr_err("Unable to get ABI header '%s' [%d]\n", - LUO_FDT_ABI_HEADER, header_size); - + luo_ser = phys_to_virt(luo_ser_phys); + if (strncmp(luo_ser->compatible, LUO_ABI_COMPATIBLE, LUO_ABI_COMPAT_LEN)) { + pr_err("LUO state is incompatible with '%s'\n", LUO_ABI_COMPATIBLE); return -EINVAL; } - luo_ser_pa = get_unaligned((u64 *)ptr); - luo_ser = phys_to_virt(luo_ser_pa); - luo_global.liveupdate_num = luo_ser->liveupdate_num; pr_info("Retrieved live update data, liveupdate number: %lld\n", luo_global.liveupdate_num); @@ -164,37 +150,20 @@ static int __init liveupdate_early_init(void) } early_initcall(liveupdate_early_init); -/* Called during boot to create outgoing LUO fdt tree */ -static int __init luo_fdt_setup(void) +/* Called during boot to create outgoing LUO state */ +static int __init luo_state_setup(void) { struct luo_ser *luo_ser; - u64 luo_ser_pa; - void *fdt_out; int err; - fdt_out = kho_alloc_preserve(LUO_FDT_SIZE); - if (IS_ERR(fdt_out)) { - pr_err("failed to allocate/preserve FDT memory\n"); - return PTR_ERR(fdt_out); - } - luo_ser = kho_alloc_preserve(sizeof(*luo_ser)); if (IS_ERR(luo_ser)) { - err = PTR_ERR(luo_ser); - goto exit_free_fdt; + pr_err("failed to allocate/preserve LUO state memory\n"); + return PTR_ERR(luo_ser); } - luo_ser_pa = virt_to_phys(luo_ser); - - err = fdt_create(fdt_out, LUO_FDT_SIZE); - err |= fdt_finish_reservemap(fdt_out); - err |= fdt_begin_node(fdt_out, ""); - err |= fdt_property_string(fdt_out, "compatible", LUO_FDT_COMPATIBLE); - err |= fdt_property(fdt_out, LUO_FDT_ABI_HEADER, &luo_ser_pa, - sizeof(luo_ser_pa)); - err |= fdt_end_node(fdt_out); - err |= fdt_finish(fdt_out); - if (err) - goto exit_free_luo_ser; + + strscpy(luo_ser->compatible, LUO_ABI_COMPATIBLE, sizeof(luo_ser->compatible)); + luo_ser->liveupdate_num = luo_global.liveupdate_num + 1; err = luo_session_setup_outgoing(&luo_ser->sessions_pa); if (err) @@ -204,21 +173,17 @@ static int __init luo_fdt_setup(void) if (err) goto exit_free_luo_ser; - luo_ser->liveupdate_num = luo_global.liveupdate_num + 1; - - err = kho_add_subtree(LUO_FDT_KHO_ENTRY_NAME, fdt_out, - fdt_totalsize(fdt_out)); + err = kho_add_subtree(LUO_KHO_ENTRY_NAME, luo_ser, sizeof(*luo_ser)); if (err) goto exit_free_luo_ser; - luo_global.fdt_out = fdt_out; + + luo_global.luo_ser_out = luo_ser; return 0; exit_free_luo_ser: kho_unpreserve_free(luo_ser); -exit_free_fdt: - kho_unpreserve_free(fdt_out); - pr_err("failed to prepare LUO FDT: %d\n", err); + pr_err("failed to prepare LUO state: %d\n", err); return err; } @@ -234,7 +199,7 @@ static int __init luo_late_startup(void) if (!liveupdate_enabled()) return 0; - err = luo_fdt_setup(); + err = luo_state_setup(); if (err) luo_global.enabled = false; -- 2.53.0 From pasha.tatashin at soleen.com Tue Jun 2 20:28:56 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Wed, 3 Jun 2026 03:28:56 +0000 Subject: [PATCH v6 05/13] liveupdate: Extract luo_file_deserialize_one helper In-Reply-To: <20260603032905.344462-1-pasha.tatashin@soleen.com> References: <20260603032905.344462-1-pasha.tatashin@soleen.com> Message-ID: <20260603032905.344462-6-pasha.tatashin@soleen.com> Extract the logic for deserializing single entries for files into separate helper functions. In preparation to a linked-block serialization for files. This is a pure code movement, no other changes intended. Acked-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav (Google) Signed-off-by: Pasha Tatashin --- kernel/liveupdate/luo_file.c | 77 ++++++++++++++++++++---------------- 1 file changed, 44 insertions(+), 33 deletions(-) diff --git a/kernel/liveupdate/luo_file.c b/kernel/liveupdate/luo_file.c index 208987502f73..9eec07a9e9fc 100644 --- a/kernel/liveupdate/luo_file.c +++ b/kernel/liveupdate/luo_file.c @@ -753,6 +753,46 @@ int luo_file_finish(struct luo_file_set *file_set) return 0; } +static int luo_file_deserialize_one(struct luo_file_set *file_set, + struct luo_file_ser *ser) +{ + struct liveupdate_file_handler *fh; + bool handler_found = false; + struct luo_file *luo_file; + + down_read(&luo_register_rwlock); + list_private_for_each_entry(fh, &luo_file_handler_list, list) { + if (!strcmp(fh->compatible, ser->compatible)) { + if (try_module_get(fh->ops->owner)) + handler_found = true; + break; + } + } + up_read(&luo_register_rwlock); + + if (!handler_found) { + pr_warn("No registered handler for compatible '%.*s'\n", + (int)sizeof(ser->compatible), + ser->compatible); + return -ENOENT; + } + + luo_file = kzalloc_obj(*luo_file); + if (!luo_file) { + module_put(fh->ops->owner); + return -ENOMEM; + } + + luo_file->fh = fh; + luo_file->file = NULL; + luo_file->serialized_data = ser->data; + luo_file->token = ser->token; + mutex_init(&luo_file->mutex); + list_add_tail(&luo_file->list, &file_set->files_list); + + return 0; +} + /** * luo_file_deserialize - Reconstructs the list of preserved files in the new kernel. * @file_set: The incoming file_set to fill with deserialized data. @@ -782,6 +822,7 @@ int luo_file_deserialize(struct luo_file_set *file_set, struct luo_file_set_ser *file_set_ser) { struct luo_file_ser *file_ser; + int err; u64 i; if (!file_set_ser->files) { @@ -809,39 +850,9 @@ int luo_file_deserialize(struct luo_file_set *file_set, */ file_ser = file_set->files; for (i = 0; i < file_set->count; i++) { - struct liveupdate_file_handler *fh; - bool handler_found = false; - struct luo_file *luo_file; - - down_read(&luo_register_rwlock); - list_private_for_each_entry(fh, &luo_file_handler_list, list) { - if (!strcmp(fh->compatible, file_ser[i].compatible)) { - if (try_module_get(fh->ops->owner)) - handler_found = true; - break; - } - } - up_read(&luo_register_rwlock); - - if (!handler_found) { - pr_warn("No registered handler for compatible '%.*s'\n", - (int)sizeof(file_ser[i].compatible), - file_ser[i].compatible); - return -ENOENT; - } - - luo_file = kzalloc_obj(*luo_file); - if (!luo_file) { - module_put(fh->ops->owner); - return -ENOMEM; - } - - luo_file->fh = fh; - luo_file->file = NULL; - luo_file->serialized_data = file_ser[i].data; - luo_file->token = file_ser[i].token; - mutex_init(&luo_file->mutex); - list_add_tail(&luo_file->list, &file_set->files_list); + err = luo_file_deserialize_one(file_set, &file_ser[i]); + if (err) + return err; } return 0; -- 2.53.0 From pasha.tatashin at soleen.com Tue Jun 2 20:28:57 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Wed, 3 Jun 2026 03:28:57 +0000 Subject: [PATCH v6 06/13] liveupdate: Extract luo_session_deserialize_one helper In-Reply-To: <20260603032905.344462-1-pasha.tatashin@soleen.com> References: <20260603032905.344462-1-pasha.tatashin@soleen.com> Message-ID: <20260603032905.344462-7-pasha.tatashin@soleen.com> Extract the logic for deserializing single entries for sessions into separate helper functions. In preparation to a linked-block serialization for sessions. This is a pure code movement, no other changes intended. Acked-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav (Google) Signed-off-by: Pasha Tatashin --- kernel/liveupdate/luo_session.c | 63 +++++++++++++++++++-------------- 1 file changed, 36 insertions(+), 27 deletions(-) diff --git a/kernel/liveupdate/luo_session.c b/kernel/liveupdate/luo_session.c index 85782c6f3d6c..1cd315e0f6de 100644 --- a/kernel/liveupdate/luo_session.c +++ b/kernel/liveupdate/luo_session.c @@ -555,6 +555,40 @@ int __init luo_session_setup_incoming(u64 sessions_pa) return 0; } +static int luo_session_deserialize_one(struct luo_session_header *sh, + struct luo_session_ser *ser) +{ + struct luo_session *session; + int err; + + session = luo_session_alloc(ser->name); + if (IS_ERR(session)) { + pr_warn("Failed to allocate session [%.*s] during deserialization %pe\n", + (int)sizeof(ser->name), ser->name, session); + return PTR_ERR(session); + } + + err = luo_session_insert(sh, session); + if (err) { + pr_warn("Failed to insert session [%s] %pe\n", + session->name, ERR_PTR(err)); + luo_session_free(session); + return err; + } + + scoped_guard(mutex, &session->mutex) { + err = luo_file_deserialize(&session->file_set, + &ser->file_set_ser); + } + if (err) { + pr_warn("Failed to deserialize files for session [%s] %pe\n", + session->name, ERR_PTR(err)); + return err; + } + + return 0; +} + int luo_session_deserialize(void) { struct luo_session_header *sh = &luo_session_global.incoming; @@ -586,34 +620,9 @@ int luo_session_deserialize(void) * reliably reset devices and reclaim memory. */ for (int i = 0; i < sh->header_ser->count; i++) { - struct luo_session *session; - - session = luo_session_alloc(sh->ser[i].name); - if (IS_ERR(session)) { - pr_warn("Failed to allocate session [%.*s] during deserialization %pe\n", - (int)sizeof(sh->ser[i].name), - sh->ser[i].name, session); - err = PTR_ERR(session); - goto save_err; - } - - err = luo_session_insert(sh, session); - if (err) { - pr_warn("Failed to insert session [%s] %pe\n", - session->name, ERR_PTR(err)); - luo_session_free(session); - goto save_err; - } - - scoped_guard(mutex, &session->mutex) { - err = luo_file_deserialize(&session->file_set, - &sh->ser[i].file_set_ser); - } - if (err) { - pr_warn("Failed to deserialize files for session [%s] %pe\n", - session->name, ERR_PTR(err)); + err = luo_session_deserialize_one(sh, &sh->ser[i]); + if (err) goto save_err; - } } kho_restore_free(sh->header_ser); -- 2.53.0 From pasha.tatashin at soleen.com Tue Jun 2 20:28:58 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Wed, 3 Jun 2026 03:28:58 +0000 Subject: [PATCH v6 07/13] kho: add support for linked-block serialization In-Reply-To: <20260603032905.344462-1-pasha.tatashin@soleen.com> References: <20260603032905.344462-1-pasha.tatashin@soleen.com> Message-ID: <20260603032905.344462-8-pasha.tatashin@soleen.com> Introduce a linked-block serialization mechanism for state handover. Previously, LUO used contiguous memory blocks for serializing sessions and files, which imposed limits on the total number of items that could be preserved across a live update. This commit adds the infrastructure for a more flexible, block-based approach where serialized data is stored in a chain of linked blocks. This is a generic KHO serialization block infrastructure that can be used by multiple subsystems. Signed-off-by: Pasha Tatashin --- Documentation/core-api/kho/abi.rst | 5 + Documentation/core-api/kho/index.rst | 11 + MAINTAINERS | 1 + include/linux/kho/abi/block.h | 56 ++++ include/linux/kho_block.h | 106 +++++++ kernel/liveupdate/Makefile | 1 + kernel/liveupdate/kho_block.c | 411 +++++++++++++++++++++++++++ 7 files changed, 591 insertions(+) create mode 100644 include/linux/kho/abi/block.h create mode 100644 include/linux/kho_block.h create mode 100644 kernel/liveupdate/kho_block.c diff --git a/Documentation/core-api/kho/abi.rst b/Documentation/core-api/kho/abi.rst index 799d743105a6..edeb5b311963 100644 --- a/Documentation/core-api/kho/abi.rst +++ b/Documentation/core-api/kho/abi.rst @@ -28,6 +28,11 @@ KHO persistent memory tracker ABI .. kernel-doc:: include/linux/kho/abi/kexec_handover.h :doc: KHO persistent memory tracker +KHO serialization block ABI +=========================== + +.. kernel-doc:: include/linux/kho/abi/block.h + See Also ======== diff --git a/Documentation/core-api/kho/index.rst b/Documentation/core-api/kho/index.rst index 0a2dee4f8e7d..320914a42178 100644 --- a/Documentation/core-api/kho/index.rst +++ b/Documentation/core-api/kho/index.rst @@ -83,6 +83,17 @@ Public API .. kernel-doc:: kernel/liveupdate/kexec_handover.c :export: +KHO Serialization Blocks API +============================ + +.. kernel-doc:: kernel/liveupdate/kho_block.c + :doc: KHO Serialization Blocks + +.. kernel-doc:: include/linux/kho_block.h + +.. kernel-doc:: kernel/liveupdate/kho_block.c + :internal: + See Also ======== diff --git a/MAINTAINERS b/MAINTAINERS index 9ec290e38b44..920ba7622afa 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -14208,6 +14208,7 @@ F: Documentation/admin-guide/mm/kho.rst F: Documentation/core-api/kho/* F: include/linux/kexec_handover.h F: include/linux/kho/ +F: include/linux/kho_block.h F: kernel/liveupdate/kexec_handover* F: lib/test_kho.c F: tools/testing/selftests/kho/ diff --git a/include/linux/kho/abi/block.h b/include/linux/kho/abi/block.h new file mode 100644 index 000000000000..8641c20b379b --- /dev/null +++ b/include/linux/kho/abi/block.h @@ -0,0 +1,56 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2026, Google LLC. + * Pasha Tatashin + */ + +/** + * DOC: KHO Serialization Blocks ABI + * + * Subsystems using the KHO Serialization Blocks framework rely on the stable + * Application Binary Interface defined below to pass serialized state from a + * pre-update kernel to a post-update kernel. + * + * This interface is a contract. Any modification to the structure fields, + * compatible strings, or the layout of the `__packed` serialization + * structures defined here constitutes a breaking change. Such changes require + * incrementing the version number in the `KHO_BLOCK_ABI_COMPATIBLE` string to + * prevent a new kernel from misinterpreting data from an old kernel. + * + * Changes are allowed provided the compatibility version is incremented; + * however, backward/forward compatibility is only guaranteed for kernels + * supporting the same ABI version. + */ + +#ifndef _LINUX_KHO_ABI_BLOCK_H +#define _LINUX_KHO_ABI_BLOCK_H + +#include +#include + +#define KHO_BLOCK_ABI_COMPATIBLE "kho-block-v1" + +/** + * KHO_BLOCK_SIZE - The size of each serialization block. + * + * This is defined as PAGE_SIZE. PAGE_SIZE is ABI compliant because live + * update between kernels with different page sizes is not supported by KHO. + */ +#define KHO_BLOCK_SIZE PAGE_SIZE + +/** + * struct kho_block_header_ser - Header for the serialized data block. + * @next: Physical address of the next struct kho_block_header_ser. + * @count: The number of entries that immediately follow this header in the + * memory block. + * + * This structure is located at the beginning of a block of physical memory + * preserved across a kexec. It provides the necessary metadata to interpret + * the array of entries that follow. + */ +struct kho_block_header_ser { + u64 next; + u64 count; +} __packed; + +#endif /* _LINUX_KHO_ABI_BLOCK_H */ diff --git a/include/linux/kho_block.h b/include/linux/kho_block.h new file mode 100644 index 000000000000..93a7cc2be5f5 --- /dev/null +++ b/include/linux/kho_block.h @@ -0,0 +1,106 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2026, Google LLC. + * Pasha Tatashin + */ + +#ifndef _LINUX_KHO_BLOCK_H +#define _LINUX_KHO_BLOCK_H + +#include +#include +#include + +/** + * struct kho_block - Internal representation of a serialization block. + * @list: List head for linking blocks in memory. + * @ser: Pointer to the serialized header in preserved memory. + */ +struct kho_block { + struct list_head list; + struct kho_block_header_ser *ser; +}; + +/** + * struct kho_block_set - A set of blocks containing serialized entries of the same type. + * @blocks: The list of serialization blocks (struct kho_block). + * @nblocks: The number of allocated serialization blocks. + * @head_pa: Physical address of the first block header. + * @entry_size: The size of each entry in the blocks. + * @count_per_block: The maximum number of entries each block can hold. + * @incoming: True if this block set was restored from the previous kernel. + * + * Note: Synchronization and locking are the responsibility of the caller. + * The block set structure itself is not internally synchronized. + */ +struct kho_block_set { + struct list_head blocks; + long nblocks; + u64 head_pa; + size_t entry_size; + u64 count_per_block; + bool incoming; +}; + +/** + * struct kho_block_set_it - Iterator for serializing entries into blocks. + * @bs: The block set being iterated. + * @block: The current block. + * @i: The current entry index within @block. + */ +struct kho_block_set_it { + struct kho_block_set *bs; + struct kho_block *block; + u64 i; +}; + +/** + * KHO_BLOCK_SET_INIT - Initialize a static kho_block_set. + * @_name: Name of the kho_block_set variable. + * @_entry_size: The size of each entry in the block set. + */ +#define KHO_BLOCK_SET_INIT(_name, _entry_size) { \ + .blocks = LIST_HEAD_INIT((_name).blocks), \ + .entry_size = _entry_size, \ + .count_per_block = (KHO_BLOCK_SIZE - \ + sizeof(struct kho_block_header_ser)) / \ + (_entry_size), \ +} + +void kho_block_set_init(struct kho_block_set *bs, size_t entry_size); + +int kho_block_set_grow(struct kho_block_set *bs, u64 count); +void kho_block_set_shrink(struct kho_block_set *bs, u64 count); + +int kho_block_set_restore(struct kho_block_set *bs, u64 head_pa); +void kho_block_set_destroy(struct kho_block_set *bs); +void kho_block_set_clear(struct kho_block_set *bs); + +/** + * kho_block_set_head_pa - Get the physical address of the first block header. + * @bs: The block set. + * + * Return: The physical address of the first block header, or 0 if empty. + */ +static inline u64 kho_block_set_head_pa(struct kho_block_set *bs) +{ + return bs->head_pa; +} + +/** + * kho_block_set_is_empty - Check if the block set has no allocated blocks. + * @bs: The block set. + * + * Return: True if there are no blocks in the set, false otherwise. + */ +static inline bool kho_block_set_is_empty(struct kho_block_set *bs) +{ + return list_empty(&bs->blocks); +} + +void kho_block_set_it_init(struct kho_block_set_it *it, struct kho_block_set *bs); +void *kho_block_set_it_reserve_entry(struct kho_block_set_it *it); +void *kho_block_set_it_read_entry(struct kho_block_set_it *it); +void *kho_block_set_it_prev(struct kho_block_set_it *it); + +#endif /* _LINUX_KHO_BLOCK_H */ diff --git a/kernel/liveupdate/Makefile b/kernel/liveupdate/Makefile index d2f779cbe279..eec9d3ae07eb 100644 --- a/kernel/liveupdate/Makefile +++ b/kernel/liveupdate/Makefile @@ -1,6 +1,7 @@ # SPDX-License-Identifier: GPL-2.0 luo-y := \ + kho_block.o \ luo_core.o \ luo_file.o \ luo_flb.o \ diff --git a/kernel/liveupdate/kho_block.c b/kernel/liveupdate/kho_block.c new file mode 100644 index 000000000000..4f147c308e6b --- /dev/null +++ b/kernel/liveupdate/kho_block.c @@ -0,0 +1,411 @@ +// SPDX-License-Identifier: GPL-2.0 + +/* + * Copyright (c) 2026, Google LLC. + * Pasha Tatashin + */ + +/** + * DOC: KHO Serialization Blocks + * + * KHO provides a mechanism to preserve stateful data across a kexec handover + * by serializing it into memory blocks, and provides the common + * infrastructure for managing these blocks. + * + * Each block consists of a header (struct kho_block_header_ser) followed by an + * array of serialized entries. Multiple blocks are linked together via a + * physical pointer in the header, forming a linked list that can be easily + * traversed in both the current and the next kernel. + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include +#include +#include + +/* + * Safeguard limit for the number of serialization blocks. This is used to + * prevent infinite loops and excessive memory allocation in case of memory + * corruption in the preserved state. + * + * With a 4KB page size, 10k blocks is about 40MB. For 32-byte entries + * (e.g. 4 u64s), each block holds up to 127 entries (accounting for the + * 16-byte header), allowing the block set to hold up to 1.27M entries. + */ +#define KHO_MAX_BLOCKS 10000 + +/** + * kho_block_set_init - Initialize a block set. + * @bs: The block set to initialize. + * @entry_size: The size of each entry in the blocks. + */ +void kho_block_set_init(struct kho_block_set *bs, size_t entry_size) +{ + *bs = (struct kho_block_set)KHO_BLOCK_SET_INIT(*bs, entry_size); + WARN_ON_ONCE(!bs->count_per_block); +} + +/* Serialized entries start immediately after the block header */ +static void *kho_block_entries(struct kho_block *block) +{ + return (void *)(block->ser + 1); +} + +/* Get the address of the serialized entry at the specified index */ +static void *kho_block_entry(struct kho_block_set_it *it, u64 index) +{ + return kho_block_entries(it->block) + (index * it->bs->entry_size); +} + +/* Free serialized data */ +static void kho_block_free_ser(struct kho_block_set *bs, + struct kho_block_header_ser *ser) +{ + if (bs->incoming) + kho_restore_free(ser); + else + kho_unpreserve_free(ser); +} + +static struct kho_block_header_ser *kho_block_alloc_ser(struct kho_block_set *bs) +{ + WARN_ON_ONCE(bs->incoming); + return kho_alloc_preserve(KHO_BLOCK_SIZE); +} + +static int kho_block_add(struct kho_block_set *bs, + struct kho_block_header_ser *ser) +{ + struct kho_block *block, *last; + + if (bs->nblocks >= KHO_MAX_BLOCKS) + return -ENOSPC; + + block = kzalloc_obj(*block); + if (!block) + return -ENOMEM; + + block->ser = ser; + last = list_last_entry_or_null(&bs->blocks, struct kho_block, list); + list_add_tail(&block->list, &bs->blocks); + bs->nblocks++; + + if (last) + last->ser->next = virt_to_phys(ser); + else + bs->head_pa = virt_to_phys(ser); + + return 0; +} + +static int kho_block_set_grow_one(struct kho_block_set *bs) +{ + struct kho_block_header_ser *ser; + int err; + + ser = kho_block_alloc_ser(bs); + if (IS_ERR(ser)) + return PTR_ERR(ser); + + err = kho_block_add(bs, ser); + if (err) { + kho_block_free_ser(bs, ser); + return err; + } + + return 0; +} + +/** + * kho_block_set_grow - Expand the block set to accommodate the target count. + * @bs: The block set. + * @count: The target number of valid entries to accommodate. + * + * Acts as a runtime notifier when new resources (such as files or sessions) + * are registered. Dynamically preallocates and links preserved memory blocks + * if the target entry count exceeds the current total capacity of the set, + * ensuring they are available during serialization/deserialization. + * + * Context: Caller must hold a lock protecting the block set. + * Return: 0 on success, or a negative errno on failure. + */ +int kho_block_set_grow(struct kho_block_set *bs, u64 count) +{ + if (WARN_ON_ONCE(bs->incoming)) + return -EINVAL; + + while (count > bs->nblocks * bs->count_per_block) { + int err = kho_block_set_grow_one(bs); + + if (err) + return err; + } + + return 0; +} + +static void kho_block_set_shrink_one(struct kho_block_set *bs) +{ + struct kho_block *last, *new_last; + + if (list_empty(&bs->blocks)) + return; + + last = list_last_entry(&bs->blocks, struct kho_block, list); + list_del(&last->list); + bs->nblocks--; + kho_block_free_ser(bs, last->ser); + kfree(last); + + new_last = list_last_entry_or_null(&bs->blocks, struct kho_block, list); + if (new_last) + new_last->ser->next = 0; + else + bs->head_pa = 0; +} + +/** + * kho_block_set_shrink - Shrink the block set to accommodate the target count. + * @bs: The block set. + * @count: The target number of valid entries to accommodate. + * + * Acts as a runtime notifier when resources (such as files or sessions) are + * unregistered, allowing the block set to release and unallocate redundant + * preserved memory blocks. Checks if the last block in the set can be removed + * because the remaining entry count is fully accommodated by the preceding blocks. + * + * Note: It is the caller's responsibility to ensure that entries are removed + * in LIFO (last-in, first-out) order (the reverse order of their insertion). + * Because shrinking destroys the last block in the set, removing entries in + * any other order would corrupt active data. + * + * Context: Caller must hold a lock protecting the block set. + */ +void kho_block_set_shrink(struct kho_block_set *bs, u64 count) +{ + while (bs->nblocks > 0 && count <= (bs->nblocks - 1) * bs->count_per_block) + kho_block_set_shrink_one(bs); +} + +/* + * kho_block_set_is_cyclic - Check for cycles in a linked list of blocks. + * Uses Floyd's cycle-finding algorithm to ensure sanity of the incoming list. + * + * Return: true if a cycle or corruption is detected, false otherwise. + */ +static bool kho_block_set_is_cyclic(struct kho_block_set *bs) +{ + struct kho_block_header_ser *fast; + struct kho_block_header_ser *slow; + int count = 0; + + fast = phys_to_virt(bs->head_pa); + slow = fast; + + while (fast) { + if (count++ >= KHO_MAX_BLOCKS) { + pr_err("Block set is corrupted\n"); + return true; + } + + if (!fast->next) + break; + + fast = phys_to_virt(fast->next); + if (!fast->next) + break; + + fast = phys_to_virt(fast->next); + slow = phys_to_virt(slow->next); + + if (slow == fast) { + pr_err("Block set is corrupted\n"); + return true; + } + } + + return false; +} + +/** + * kho_block_set_restore - Restore a block set from a physical address. + * @bs: The block set to restore. + * @head_pa: Physical address of the first block header. + * + * Restores a serialized block set from a given physical address. The caller is + * responsible for ensuring that the block set @bs has been allocated and + * initialized prior to calling this function. + * + * Return: 0 on success, or a negative errno on failure. + */ +int kho_block_set_restore(struct kho_block_set *bs, u64 head_pa) +{ + struct kho_block_header_ser *ser; + u64 next_pa = head_pa; + int err; + + /* Restored block sets use size from the previous kernel */ + bs->incoming = true; + if (!head_pa) + return 0; + + bs->head_pa = head_pa; + if (kho_block_set_is_cyclic(bs)) { + bs->head_pa = 0; + return -EINVAL; + } + + while (next_pa) { + ser = phys_to_virt(next_pa); + if (!ser->count || ser->count > bs->count_per_block) { + pr_warn("Block contains invalid entry count: %llu\n", + ser->count); + err = -EINVAL; + goto err_destroy; + } + err = kho_block_add(bs, ser); + if (err) + goto err_destroy; + next_pa = ser->next; + } + + return 0; + +err_destroy: + kho_block_set_destroy(bs); + + /* Free the remaining un-restored blocks in the physical chain */ + while (next_pa) { + struct kho_block_header_ser *next_ser = phys_to_virt(next_pa); + + next_pa = next_ser->next; + kho_block_free_ser(bs, next_ser); + } + return err; +} + +/** + * kho_block_set_destroy - Destroy all blocks in a block set. + * @bs: The block set. + */ +void kho_block_set_destroy(struct kho_block_set *bs) +{ + struct kho_block *block, *tmp; + + list_for_each_entry_safe(block, tmp, &bs->blocks, list) { + list_del(&block->list); + kho_block_free_ser(bs, block->ser); + kfree(block); + } + bs->nblocks = 0; + bs->head_pa = 0; +} + +/** + * kho_block_set_clear - Clear all serialized data in a block set. + * @bs: The block set to clear. + */ +void kho_block_set_clear(struct kho_block_set *bs) +{ + struct kho_block *block; + + list_for_each_entry(block, &bs->blocks, list) { + block->ser->count = 0; + memset(block->ser + 1, 0, KHO_BLOCK_SIZE - sizeof(*block->ser)); + } +} + +/** + * kho_block_set_it_init - Initialize a block set iterator. + * @it: The iterator to initialize. + * @bs: The block set to iterate over. + */ +void kho_block_set_it_init(struct kho_block_set_it *it, struct kho_block_set *bs) +{ + it->bs = bs; + it->block = list_first_entry_or_null(&bs->blocks, struct kho_block, list); + it->i = 0; +} + +/** + * kho_block_set_it_reserve_entry - Reserve and return the next available slot for writing. + * @it: The block iterator. + * + * Reserves a slot in the current block during state serialization to add a new + * entry, advancing the internal index. If the current block is full, it + * automatically moves to the next block in the set. + * + * Return: A pointer to the reserved entry slot, or NULL if the block set's + * capacity is fully exhausted. + */ +void *kho_block_set_it_reserve_entry(struct kho_block_set_it *it) +{ + void *entry; + + if (!it->block) + return NULL; + + if (it->i == it->bs->count_per_block) { + if (list_is_last(&it->block->list, &it->bs->blocks)) + return NULL; + it->block = list_next_entry(it->block, list); + it->i = 0; + } + + entry = kho_block_entry(it, it->i++); + it->block->ser->count = it->i; + return entry; +} + +/** + * kho_block_set_it_read_entry - Read the next serialized entry from the block set. + * @it: The block iterator. + * + * Iterates through previously written entries during state deserialization, + * respecting the actual count stored in each block's header. + * + * Return: A pointer to the next serialized entry, or NULL if all serialized + * entries have been read. + */ +void *kho_block_set_it_read_entry(struct kho_block_set_it *it) +{ + if (!it->block) + return NULL; + + if (it->i == it->block->ser->count) { + if (list_is_last(&it->block->list, &it->bs->blocks)) + return NULL; + it->block = list_next_entry(it->block, list); + it->i = 0; + } + + return kho_block_entry(it, it->i++); +} + +/** + * kho_block_set_it_prev - Return the previous entry slot in the block set. + * @it: The block iterator. + * + * If the current index is at the start of a block, it automatically moves to + * the end of the previous block. + * + * Return: A pointer to the previous entry slot, or NULL if at the very + * beginning of the block set. + */ +void *kho_block_set_it_prev(struct kho_block_set_it *it) +{ + if (!it->block) + return NULL; + + if (it->i == 0) { + if (list_is_first(&it->block->list, &it->bs->blocks)) + return NULL; + it->block = list_prev_entry(it->block, list); + it->i = it->bs->count_per_block; + } + + return kho_block_entry(it, --it->i); +} -- 2.53.0 From pasha.tatashin at soleen.com Tue Jun 2 20:28:59 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Wed, 3 Jun 2026 03:28:59 +0000 Subject: [PATCH v6 08/13] liveupdate: defer session block allocation and physical address setting In-Reply-To: <20260603032905.344462-1-pasha.tatashin@soleen.com> References: <20260603032905.344462-1-pasha.tatashin@soleen.com> Message-ID: <20260603032905.344462-9-pasha.tatashin@soleen.com> Currently, luo_session_setup_outgoing() allocates the session block and sets its physical address in the header immediately. With upcoming dynamic block-based session management, this makes the first block different from the rest. Move the allocation to where it is first needed. Acked-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav (Google) Signed-off-by: Pasha Tatashin --- kernel/liveupdate/luo_core.c | 4 +- kernel/liveupdate/luo_internal.h | 2 +- kernel/liveupdate/luo_session.c | 68 ++++++++++++++++++++------------ 3 files changed, 45 insertions(+), 29 deletions(-) diff --git a/kernel/liveupdate/luo_core.c b/kernel/liveupdate/luo_core.c index 69b00e7d0f8f..1b2bda22902d 100644 --- a/kernel/liveupdate/luo_core.c +++ b/kernel/liveupdate/luo_core.c @@ -165,9 +165,7 @@ static int __init luo_state_setup(void) strscpy(luo_ser->compatible, LUO_ABI_COMPATIBLE, sizeof(luo_ser->compatible)); luo_ser->liveupdate_num = luo_global.liveupdate_num + 1; - err = luo_session_setup_outgoing(&luo_ser->sessions_pa); - if (err) - goto exit_free_luo_ser; + luo_session_setup_outgoing(&luo_ser->sessions_pa); err = luo_flb_setup_outgoing(&luo_ser->flbs_pa); if (err) diff --git a/kernel/liveupdate/luo_internal.h b/kernel/liveupdate/luo_internal.h index fe22086bfbeb..ee18f9a11b91 100644 --- a/kernel/liveupdate/luo_internal.h +++ b/kernel/liveupdate/luo_internal.h @@ -79,7 +79,7 @@ extern struct rw_semaphore luo_register_rwlock; int luo_session_create(const char *name, struct file **filep); int luo_session_retrieve(const char *name, struct file **filep); -int __init luo_session_setup_outgoing(u64 *sessions_pa); +void __init luo_session_setup_outgoing(u64 *sessions_pa); int __init luo_session_setup_incoming(u64 sessions_pa); int luo_session_serialize(void); int luo_session_deserialize(void); diff --git a/kernel/liveupdate/luo_session.c b/kernel/liveupdate/luo_session.c index 1cd315e0f6de..2411849a34e3 100644 --- a/kernel/liveupdate/luo_session.c +++ b/kernel/liveupdate/luo_session.c @@ -108,15 +108,16 @@ static DECLARE_RWSEM(luo_session_serialize_rwsem); /** * struct luo_session_header - Header struct for managing LUO sessions. - * @count: The number of sessions currently tracked in the @list. - * @list: The head of the linked list of `struct luo_session` instances. - * @rwsem: A read-write semaphore providing synchronized access to the - * session list and other fields in this structure. - * @header_ser: The header data of serialization array. - * @ser: The serialized session data (an array of - * `struct luo_session_ser`). - * @active: Set to true when first initialized. If previous kernel did not - * send session data, active stays false for incoming. + * @count: The number of sessions currently tracked in the @list. + * @list: The head of the linked list of `struct luo_session` instances. + * @rwsem: A read-write semaphore providing synchronized access to the + * session list and other fields in this structure. + * @header_ser: The header data of serialization array. + * @ser: The serialized session data (an array of + * `struct luo_session_ser`). + * @sessions_pa: Points to the location of sessions_pa within struct luo_ser. + * @active: Set to true when first initialized. If previous kernel did not + * send session data, active stays false for incoming. */ struct luo_session_header { long count; @@ -124,6 +125,7 @@ struct luo_session_header { struct rw_semaphore rwsem; struct luo_session_header_ser *header_ser; struct luo_session_ser *ser; + u64 *sessions_pa; bool active; }; @@ -171,10 +173,30 @@ static void luo_session_free(struct luo_session *session) kfree(session); } +static int luo_session_grow_ser(struct luo_session_header *sh) +{ + struct luo_session_header_ser *header_ser; + + if (sh->count == LUO_SESSION_MAX) + return -ENOMEM; + + if (sh->header_ser) + return 0; + + header_ser = kho_alloc_preserve(LUO_SESSION_PGCNT << PAGE_SHIFT); + if (IS_ERR(header_ser)) + return PTR_ERR(header_ser); + + sh->header_ser = header_ser; + sh->ser = (void *)(header_ser + 1); + return 0; +} + static int luo_session_insert(struct luo_session_header *sh, struct luo_session *session) { struct luo_session *it; + int err; guard(rwsem_write)(&sh->rwsem); @@ -183,8 +205,9 @@ static int luo_session_insert(struct luo_session_header *sh, * for new session. */ if (sh == &luo_session_global.outgoing) { - if (sh->count == LUO_SESSION_MAX) - return -ENOMEM; + err = luo_session_grow_ser(sh); + if (err) + return err; } /* @@ -524,21 +547,10 @@ int luo_session_retrieve(const char *name, struct file **filep) return err; } -int __init luo_session_setup_outgoing(u64 *sessions_pa) +void __init luo_session_setup_outgoing(u64 *sessions_pa) { - struct luo_session_header_ser *header_ser; - - header_ser = kho_alloc_preserve(LUO_SESSION_PGCNT << PAGE_SHIFT); - if (IS_ERR(header_ser)) - return PTR_ERR(header_ser); - - *sessions_pa = virt_to_phys(header_ser); - - luo_session_global.outgoing.header_ser = header_ser; - luo_session_global.outgoing.ser = (void *)(header_ser + 1); + luo_session_global.outgoing.sessions_pa = sessions_pa; luo_session_global.outgoing.active = true; - - return 0; } int __init luo_session_setup_incoming(u64 sessions_pa) @@ -644,6 +656,8 @@ int luo_session_serialize(void) down_write(&luo_session_serialize_rwsem); down_write(&sh->rwsem); + *sh->sessions_pa = 0; + list_for_each_entry(session, &sh->list, list) { err = luo_session_freeze_one(session, &sh->ser[i]); if (err) @@ -653,7 +667,11 @@ int luo_session_serialize(void) sizeof(sh->ser[i].name)); i++; } - sh->header_ser->count = sh->count; + + if (sh->header_ser && sh->count > 0) { + sh->header_ser->count = sh->count; + *sh->sessions_pa = virt_to_phys(sh->header_ser); + } up_write(&sh->rwsem); return 0; -- 2.53.0 From pasha.tatashin at soleen.com Tue Jun 2 20:29:00 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Wed, 3 Jun 2026 03:29:00 +0000 Subject: [PATCH v6 09/13] liveupdate: Remove limit on the number of sessions In-Reply-To: <20260603032905.344462-1-pasha.tatashin@soleen.com> References: <20260603032905.344462-1-pasha.tatashin@soleen.com> Message-ID: <20260603032905.344462-10-pasha.tatashin@soleen.com> Currently, the number of LUO sessions is limited by a fixed number of pre-allocated pages for serialization (16 pages, allowing for ~819 sessions). This limitation is problematic if LUO is used to support things such as systemd file descriptor store, and would be used not just as VM memory but to save other states on the machine. Remove this limit by transitioning to a linked-block approach for session metadata serialization. Instead of a single contiguous block, session metadata is now stored in a chain of 16-page blocks. Each block starts with a header containing the physical address of the next block and the number of session entries in the current block. Acked-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav (Google) Signed-off-by: Pasha Tatashin --- include/linux/kho/abi/luo.h | 24 +------ kernel/liveupdate/luo_session.c | 113 +++++++++++++++----------------- 2 files changed, 56 insertions(+), 81 deletions(-) diff --git a/include/linux/kho/abi/luo.h b/include/linux/kho/abi/luo.h index 9a4fe491812b..79758d92ed5f 100644 --- a/include/linux/kho/abi/luo.h +++ b/include/linux/kho/abi/luo.h @@ -33,11 +33,6 @@ * It includes the compatibility string, the liveupdate-number, and pointers * to sessions and FLBs. * - * - struct luo_session_header_ser: - * Header for the session array. Contains the total page count of the - * preserved memory block and the number of `struct luo_session_ser` - * entries that follow. - * * - struct luo_session_ser: * Metadata for a single session, including its name and a physical pointer * to another preserved memory block containing an array of @@ -63,13 +58,15 @@ #define _LINUX_KHO_ABI_LUO_H #include +#include #include /* * The LUO state is registered under this KHO entry name. */ #define LUO_KHO_ENTRY_NAME "LUO" -#define LUO_ABI_COMPATIBLE "luo-v3" +#define LUO_COMPAT_BASE "luo-v3" +#define LUO_ABI_COMPATIBLE LUO_COMPAT_BASE "-" KHO_BLOCK_ABI_COMPATIBLE #define LUO_ABI_COMPAT_LEN ALIGN(sizeof(LUO_ABI_COMPATIBLE), 8) /** @@ -118,21 +115,6 @@ struct luo_file_set_ser { u64 count; } __packed; -/** - * struct luo_session_header_ser - Header for the serialized session data block. - * @count: The number of `struct luo_session_ser` entries that immediately - * follow this header in the memory block. - * - * This structure is located at the beginning of a contiguous block of - * physical memory preserved across the kexec. It provides the necessary - * metadata to interpret the array of session entries that follow. - * - * If this structure is modified, `LUO_ABI_COMPATIBLE` must be updated. - */ -struct luo_session_header_ser { - u64 count; -} __packed; - /** * struct luo_session_ser - Represents the serialized metadata for a LUO session. * @name: The unique name of the session, provided by the userspace at diff --git a/kernel/liveupdate/luo_session.c b/kernel/liveupdate/luo_session.c index 2411849a34e3..b79b2a488974 100644 --- a/kernel/liveupdate/luo_session.c +++ b/kernel/liveupdate/luo_session.c @@ -24,9 +24,10 @@ * ioctls on /dev/liveupdate. * * - Serialization: Session metadata is preserved using the KHO framework. When - * a live update is triggered via kexec, an array of `struct luo_session_ser` - * is populated and placed in a preserved memory region. The physical address - * of this array is stored in the centralized `struct luo_ser` structure. + * a live update is triggered via kexec, session metadata is serialized into + * a chain of linked-blocks and placed in a preserved memory region. The + * physical address of the first block header is stored in the centralized + * `struct luo_ser` structure. * * Session Lifecycle: * @@ -89,6 +90,7 @@ #include #include #include +#include #include #include #include @@ -98,23 +100,14 @@ #include #include "luo_internal.h" -/* 16 4K pages, give space for 744 sessions */ -#define LUO_SESSION_PGCNT 16ul -#define LUO_SESSION_MAX (((LUO_SESSION_PGCNT << PAGE_SHIFT) - \ - sizeof(struct luo_session_header_ser)) / \ - sizeof(struct luo_session_ser)) - static DECLARE_RWSEM(luo_session_serialize_rwsem); - /** * struct luo_session_header - Header struct for managing LUO sessions. * @count: The number of sessions currently tracked in the @list. * @list: The head of the linked list of `struct luo_session` instances. * @rwsem: A read-write semaphore providing synchronized access to the * session list and other fields in this structure. - * @header_ser: The header data of serialization array. - * @ser: The serialized session data (an array of - * `struct luo_session_ser`). + * @block_set: The set of serialization blocks. * @sessions_pa: Points to the location of sessions_pa within struct luo_ser. * @active: Set to true when first initialized. If previous kernel did not * send session data, active stays false for incoming. @@ -123,8 +116,7 @@ struct luo_session_header { long count; struct list_head list; struct rw_semaphore rwsem; - struct luo_session_header_ser *header_ser; - struct luo_session_ser *ser; + struct kho_block_set block_set; u64 *sessions_pa; bool active; }; @@ -143,10 +135,14 @@ static struct luo_session_global luo_session_global = { .incoming = { .list = LIST_HEAD_INIT(luo_session_global.incoming.list), .rwsem = __RWSEM_INITIALIZER(luo_session_global.incoming.rwsem), + .block_set = KHO_BLOCK_SET_INIT(luo_session_global.incoming.block_set, + sizeof(struct luo_session_ser)), }, .outgoing = { .list = LIST_HEAD_INIT(luo_session_global.outgoing.list), .rwsem = __RWSEM_INITIALIZER(luo_session_global.outgoing.rwsem), + .block_set = KHO_BLOCK_SET_INIT(luo_session_global.outgoing.block_set, + sizeof(struct luo_session_ser)), }, }; @@ -173,25 +169,6 @@ static void luo_session_free(struct luo_session *session) kfree(session); } -static int luo_session_grow_ser(struct luo_session_header *sh) -{ - struct luo_session_header_ser *header_ser; - - if (sh->count == LUO_SESSION_MAX) - return -ENOMEM; - - if (sh->header_ser) - return 0; - - header_ser = kho_alloc_preserve(LUO_SESSION_PGCNT << PAGE_SHIFT); - if (IS_ERR(header_ser)) - return PTR_ERR(header_ser); - - sh->header_ser = header_ser; - sh->ser = (void *)(header_ser + 1); - return 0; -} - static int luo_session_insert(struct luo_session_header *sh, struct luo_session *session) { @@ -205,7 +182,7 @@ static int luo_session_insert(struct luo_session_header *sh, * for new session. */ if (sh == &luo_session_global.outgoing) { - err = luo_session_grow_ser(sh); + err = kho_block_set_grow(&sh->block_set, sh->count + 1); if (err) return err; } @@ -232,6 +209,8 @@ static void luo_session_remove(struct luo_session_header *sh, guard(rwsem_write)(&sh->rwsem); list_del(&session->list); sh->count--; + if (sh == &luo_session_global.outgoing) + kho_block_set_shrink(&sh->block_set, sh->count); } static int luo_session_finish_one(struct luo_session *session) @@ -555,15 +534,17 @@ void __init luo_session_setup_outgoing(u64 *sessions_pa) int __init luo_session_setup_incoming(u64 sessions_pa) { - struct luo_session_header_ser *header_ser; + struct luo_session_header *sh = &luo_session_global.incoming; + int err; - if (sessions_pa) { - header_ser = phys_to_virt(sessions_pa); - luo_session_global.incoming.header_ser = header_ser; - luo_session_global.incoming.ser = (void *)(header_ser + 1); - luo_session_global.incoming.active = true; - } + if (!sessions_pa) + return 0; + err = kho_block_set_restore(&sh->block_set, sessions_pa); + if (err) + return err; + + sh->active = true; return 0; } @@ -605,6 +586,8 @@ int luo_session_deserialize(void) { struct luo_session_header *sh = &luo_session_global.incoming; static bool is_deserialized; + struct luo_session_ser *ser; + struct kho_block_set_it it; static int saved_err; int err; @@ -631,18 +614,19 @@ int luo_session_deserialize(void) * userspace to detect the failure and trigger a reboot, which will * reliably reset devices and reclaim memory. */ - for (int i = 0; i < sh->header_ser->count; i++) { - err = luo_session_deserialize_one(sh, &sh->ser[i]); + kho_block_set_it_init(&it, &sh->block_set); + while ((ser = kho_block_set_it_read_entry(&it))) { + err = luo_session_deserialize_one(sh, ser); if (err) goto save_err; } - kho_restore_free(sh->header_ser); - sh->header_ser = NULL; - sh->ser = NULL; + kho_block_set_destroy(&sh->block_set); return 0; + save_err: + kho_block_set_destroy(&sh->block_set); saved_err = err; return err; } @@ -651,36 +635,45 @@ int luo_session_serialize(void) { struct luo_session_header *sh = &luo_session_global.outgoing; struct luo_session *session; - int i = 0; + struct kho_block_set_it it; int err; down_write(&luo_session_serialize_rwsem); down_write(&sh->rwsem); *sh->sessions_pa = 0; + kho_block_set_it_init(&it, &sh->block_set); + list_for_each_entry(session, &sh->list, list) { - err = luo_session_freeze_one(session, &sh->ser[i]); - if (err) + struct luo_session_ser *ser = kho_block_set_it_reserve_entry(&it); + + /* This should not fail normally as blocks were pre-allocated */ + if (WARN_ON_ONCE(!ser)) { + err = -ENOSPC; goto err_undo; + } - strscpy(sh->ser[i].name, session->name, - sizeof(sh->ser[i].name)); - i++; - } + err = luo_session_freeze_one(session, ser); + if (err) { + kho_block_set_it_prev(&it); + goto err_undo; + } - if (sh->header_ser && sh->count > 0) { - sh->header_ser->count = sh->count; - *sh->sessions_pa = virt_to_phys(sh->header_ser); + strscpy(ser->name, session->name, sizeof(ser->name)); } + + if (sh->count > 0) + *sh->sessions_pa = kho_block_set_head_pa(&sh->block_set); up_write(&sh->rwsem); return 0; err_undo: list_for_each_entry_continue_reverse(session, &sh->list, list) { - i--; - luo_session_unfreeze_one(session, &sh->ser[i]); - memset(sh->ser[i].name, 0, sizeof(sh->ser[i].name)); + struct luo_session_ser *ser = kho_block_set_it_prev(&it); + + luo_session_unfreeze_one(session, ser); + memset(ser->name, 0, sizeof(ser->name)); } up_write(&sh->rwsem); up_write(&luo_session_serialize_rwsem); -- 2.53.0 From pasha.tatashin at soleen.com Tue Jun 2 20:29:01 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Wed, 3 Jun 2026 03:29:01 +0000 Subject: [PATCH v6 10/13] liveupdate: Remove limit on the number of files per session In-Reply-To: <20260603032905.344462-1-pasha.tatashin@soleen.com> References: <20260603032905.344462-1-pasha.tatashin@soleen.com> Message-ID: <20260603032905.344462-11-pasha.tatashin@soleen.com> To remove the fixed limit on the number of preserved files per session, transition the file metadata serialization from a single contiguous memory block to a chain of linked blocks. Acked-by: Mike Rapoport (Microsoft) Signed-off-by: Pasha Tatashin --- include/linux/kho/abi/luo.h | 13 +-- kernel/liveupdate/luo_file.c | 138 ++++++++++++++----------------- kernel/liveupdate/luo_internal.h | 6 +- 3 files changed, 74 insertions(+), 83 deletions(-) diff --git a/include/linux/kho/abi/luo.h b/include/linux/kho/abi/luo.h index 79758d92ed5f..16df550ef143 100644 --- a/include/linux/kho/abi/luo.h +++ b/include/linux/kho/abi/luo.h @@ -35,8 +35,8 @@ * * - struct luo_session_ser: * Metadata for a single session, including its name and a physical pointer - * to another preserved memory block containing an array of - * `struct luo_file_ser` for all files in that session. + * to the first `struct kho_block_header_ser` for all files in that session. + * Multiple blocks are linked via the `next` field in the header. * * - struct luo_file_ser: * Metadata for a single preserved file. Contains the `compatible` string to @@ -65,7 +65,7 @@ * The LUO state is registered under this KHO entry name. */ #define LUO_KHO_ENTRY_NAME "LUO" -#define LUO_COMPAT_BASE "luo-v3" +#define LUO_COMPAT_BASE "luo-v4" #define LUO_ABI_COMPATIBLE LUO_COMPAT_BASE "-" KHO_BLOCK_ABI_COMPATIBLE #define LUO_ABI_COMPAT_LEN ALIGN(sizeof(LUO_ABI_COMPATIBLE), 8) @@ -103,9 +103,10 @@ struct luo_file_ser { /** * struct luo_file_set_ser - Represents the serialized metadata for file set - * @files: The physical address of a contiguous memory block that holds - * the serialized state of files (array of luo_file_ser) in this file - * set. + * @files: The physical address of the first `struct kho_block_header_ser`. + * This structure is the header for a block of memory containing + * an array of `struct luo_file_ser` entries. Multiple blocks are + * linked via the `next` field in the header. * @count: The total number of files that were part of this session during * serialization. Used for iteration and validation during * restoration. diff --git a/kernel/liveupdate/luo_file.c b/kernel/liveupdate/luo_file.c index 9eec07a9e9fc..c39f96961a85 100644 --- a/kernel/liveupdate/luo_file.c +++ b/kernel/liveupdate/luo_file.c @@ -118,11 +118,6 @@ static LIST_HEAD(luo_file_handler_list); /* Keep track of files being preserved by LUO */ static DEFINE_XARRAY(luo_preserved_files); -/* 2 4K pages, give space for 128 files per file_set */ -#define LUO_FILE_PGCNT 2ul -#define LUO_FILE_MAX \ - ((LUO_FILE_PGCNT << PAGE_SHIFT) / sizeof(struct luo_file_ser)) - /** * struct luo_file - Represents a single preserved file instance. * @fh: Pointer to the &struct liveupdate_file_handler that manages @@ -174,39 +169,6 @@ struct luo_file { u64 token; }; -static int luo_alloc_files_mem(struct luo_file_set *file_set) -{ - size_t size; - void *mem; - - if (file_set->files) - return 0; - - WARN_ON_ONCE(file_set->count); - - size = LUO_FILE_PGCNT << PAGE_SHIFT; - mem = kho_alloc_preserve(size); - if (IS_ERR(mem)) - return PTR_ERR(mem); - - file_set->files = mem; - - return 0; -} - -static void luo_free_files_mem(struct luo_file_set *file_set) -{ - /* If file_set has files, no need to free preservation memory */ - if (file_set->count) - return; - - if (!file_set->files) - return; - - kho_unpreserve_free(file_set->files); - file_set->files = NULL; -} - static unsigned long luo_get_id(struct liveupdate_file_handler *fh, struct file *file) { @@ -276,16 +238,15 @@ int luo_preserve_file(struct luo_file_set *file_set, u64 token, int fd) if (luo_token_is_used(file_set, token)) return -EEXIST; - if (file_set->count == LUO_FILE_MAX) - return -ENOSPC; + err = kho_block_set_grow(&file_set->block_set, file_set->count + 1); + if (err) + return err; file = fget(fd); - if (!file) - return -EBADF; - - err = luo_alloc_files_mem(file_set); - if (err) - goto err_fput; + if (!file) { + err = -EBADF; + goto err_shrink; + } err = -ENOENT; down_read(&luo_register_rwlock); @@ -300,7 +261,7 @@ int luo_preserve_file(struct luo_file_set *file_set, u64 token, int fd) /* err is still -ENOENT if no handler was found */ if (err) - goto err_free_files_mem; + goto err_fput; err = xa_insert(&luo_preserved_files, luo_get_id(fh, file), file, GFP_KERNEL); @@ -343,10 +304,10 @@ int luo_preserve_file(struct luo_file_set *file_set, u64 token, int fd) xa_erase(&luo_preserved_files, luo_get_id(fh, file)); err_module_put: module_put(fh->ops->owner); -err_free_files_mem: - luo_free_files_mem(file_set); err_fput: fput(file); +err_shrink: + kho_block_set_shrink(&file_set->block_set, file_set->count); return err; } @@ -392,13 +353,14 @@ void luo_file_unpreserve_files(struct luo_file_set *file_set) list_del(&luo_file->list); file_set->count--; + kho_block_set_shrink(&file_set->block_set, file_set->count); fput(luo_file->file); mutex_destroy(&luo_file->mutex); kfree(luo_file); } - luo_free_files_mem(file_set); + kho_block_set_destroy(&file_set->block_set); } static int luo_file_freeze_one(struct luo_file_set *file_set, @@ -454,7 +416,7 @@ static void __luo_file_unfreeze(struct luo_file_set *file_set, luo_file_unfreeze_one(file_set, luo_file); } - memset(file_set->files, 0, LUO_FILE_PGCNT << PAGE_SHIFT); + kho_block_set_clear(&file_set->block_set); } /** @@ -493,19 +455,24 @@ static void __luo_file_unfreeze(struct luo_file_set *file_set, int luo_file_freeze(struct luo_file_set *file_set, struct luo_file_set_ser *file_set_ser) { - struct luo_file_ser *file_ser = file_set->files; struct luo_file *luo_file; + struct kho_block_set_it it; int err; - int i; if (!file_set->count) return 0; - if (WARN_ON(!file_ser)) - return -EINVAL; + kho_block_set_it_init(&it, &file_set->block_set); - i = 0; list_for_each_entry(luo_file, &file_set->files_list, list) { + struct luo_file_ser *file_ser = kho_block_set_it_reserve_entry(&it); + + /* This should not fail normally as blocks were pre-allocated */ + if (WARN_ON_ONCE(!file_ser)) { + err = -ENOSPC; + goto err_unfreeze; + } + err = luo_file_freeze_one(file_set, luo_file); if (err < 0) { pr_warn("Freeze failed for token[%#0llx] handler[%s] err[%pe]\n", @@ -514,16 +481,14 @@ int luo_file_freeze(struct luo_file_set *file_set, goto err_unfreeze; } - strscpy(file_ser[i].compatible, luo_file->fh->compatible, - sizeof(file_ser[i].compatible)); - file_ser[i].data = luo_file->serialized_data; - file_ser[i].token = luo_file->token; - i++; + strscpy(file_ser->compatible, luo_file->fh->compatible, + sizeof(file_ser->compatible)); + file_ser->data = luo_file->serialized_data; + file_ser->token = luo_file->token; } file_set_ser->count = file_set->count; - if (file_set->files) - file_set_ser->files = virt_to_phys(file_set->files); + file_set_ser->files = kho_block_set_head_pa(&file_set->block_set); return 0; @@ -741,14 +706,12 @@ int luo_file_finish(struct luo_file_set *file_set) module_put(luo_file->fh->ops->owner); list_del(&luo_file->list); file_set->count--; + kho_block_set_shrink(&file_set->block_set, file_set->count); mutex_destroy(&luo_file->mutex); kfree(luo_file); } - if (file_set->files) { - kho_restore_free(file_set->files); - file_set->files = NULL; - } + kho_block_set_destroy(&file_set->block_set); return 0; } @@ -822,16 +785,18 @@ int luo_file_deserialize(struct luo_file_set *file_set, struct luo_file_set_ser *file_set_ser) { struct luo_file_ser *file_ser; + struct kho_block_set_it it; int err; - u64 i; if (!file_set_ser->files) { WARN_ON(file_set_ser->count); return 0; } - file_set->count = file_set_ser->count; - file_set->files = phys_to_virt(file_set_ser->files); + file_set->count = 0; + err = kho_block_set_restore(&file_set->block_set, file_set_ser->files); + if (err) + return err; /* * Note on error handling: @@ -848,25 +813,50 @@ int luo_file_deserialize(struct luo_file_set *file_set, * userspace to detect the failure and trigger a reboot, which will * reliably reset devices and reclaim memory. */ - file_ser = file_set->files; - for (i = 0; i < file_set->count; i++) { - err = luo_file_deserialize_one(file_set, &file_ser[i]); + kho_block_set_it_init(&it, &file_set->block_set); + while ((file_ser = kho_block_set_it_read_entry(&it))) { + err = luo_file_deserialize_one(file_set, file_ser); if (err) - return err; + goto err_destroy_blocks; + file_set->count++; + } + + if (file_set->count != file_set_ser->count) { + pr_warn("File count mismatch: expected %llu, found %llu\n", + file_set_ser->count, file_set->count); + err = -EINVAL; + goto err_destroy_blocks; } return 0; + +err_destroy_blocks: + while (!list_empty(&file_set->files_list)) { + struct luo_file *luo_file; + + luo_file = list_first_entry(&file_set->files_list, + struct luo_file, list); + list_del(&luo_file->list); + module_put(luo_file->fh->ops->owner); + mutex_destroy(&luo_file->mutex); + kfree(luo_file); + } + file_set->count = 0; + kho_block_set_destroy(&file_set->block_set); + return err; } void luo_file_set_init(struct luo_file_set *file_set) { INIT_LIST_HEAD(&file_set->files_list); + kho_block_set_init(&file_set->block_set, sizeof(struct luo_file_ser)); } void luo_file_set_destroy(struct luo_file_set *file_set) { WARN_ON(file_set->count); WARN_ON(!list_empty(&file_set->files_list)); + WARN_ON(!kho_block_set_is_empty(&file_set->block_set)); } /** diff --git a/kernel/liveupdate/luo_internal.h b/kernel/liveupdate/luo_internal.h index ee18f9a11b91..64879ffe7378 100644 --- a/kernel/liveupdate/luo_internal.h +++ b/kernel/liveupdate/luo_internal.h @@ -10,6 +10,7 @@ #include #include +#include struct luo_ucmd { void __user *ubuffer; @@ -44,14 +45,13 @@ static inline int luo_ucmd_respond(struct luo_ucmd *ucmd, * struct luo_file_set - A set of files that belong to the same sessions. * @files_list: An ordered list of files associated with this session, it is * ordered by preservation time. - * @files: The physically contiguous memory block that holds the serialized - * state of files. + * @block_set: The set of serialization blocks. * @count: A counter tracking the number of files currently stored in the * @files_list for this session. */ struct luo_file_set { struct list_head files_list; - struct luo_file_ser *files; + struct kho_block_set block_set; u64 count; }; -- 2.53.0 From pasha.tatashin at soleen.com Tue Jun 2 20:29:02 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Wed, 3 Jun 2026 03:29:02 +0000 Subject: [PATCH v6 11/13] selftests/liveupdate: Test session and file limit removal In-Reply-To: <20260603032905.344462-1-pasha.tatashin@soleen.com> References: <20260603032905.344462-1-pasha.tatashin@soleen.com> Message-ID: <20260603032905.344462-12-pasha.tatashin@soleen.com> With the removal of static limits on the number of sessions and files per session, the orchestrator now uses dynamic allocation. Add new test cases to verify that the system can handle a large number of sessions and files. These tests ensure that the dynamic block allocation and reuse logic for session metadata and outgoing files work correctly beyond the previous static limits. Acked-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav (Google) Signed-off-by: Pasha Tatashin --- .../testing/selftests/liveupdate/liveupdate.c | 75 +++++++++++++++++++ .../selftests/liveupdate/luo_test_utils.c | 24 ++++++ .../selftests/liveupdate/luo_test_utils.h | 2 + 3 files changed, 101 insertions(+) diff --git a/tools/testing/selftests/liveupdate/liveupdate.c b/tools/testing/selftests/liveupdate/liveupdate.c index c7d94b9181e1..502fb3567e38 100644 --- a/tools/testing/selftests/liveupdate/liveupdate.c +++ b/tools/testing/selftests/liveupdate/liveupdate.c @@ -26,6 +26,7 @@ #include +#include "luo_test_utils.h" #include "../kselftest.h" #include "../kselftest_harness.h" @@ -499,4 +500,78 @@ TEST_F(liveupdate_device, get_session_name_max_length) ASSERT_EQ(close(session_fd), 0); } +/* + * Test Case: Manage Many Sessions + * + * Verifies that a large number of sessions can be created and then + * destroyed during normal system operation. This specifically tests the + * dynamic block allocation and reuse logic for session metadata management + * without preserving any files. + */ +TEST_F(liveupdate_device, preserve_many_sessions) +{ +#define MANY_SESSIONS 2000 + int session_fds[MANY_SESSIONS]; + int ret, i; + + self->fd1 = open(LIVEUPDATE_DEV, O_RDWR); + if (self->fd1 < 0 && errno == ENOENT) + SKIP(return, "%s does not exist", LIVEUPDATE_DEV); + ASSERT_GE(self->fd1, 0); + + ret = luo_ensure_nofile_limit(MANY_SESSIONS); + if (ret == -EPERM) + SKIP(return, "Insufficient privileges to set RLIMIT_NOFILE"); + ASSERT_EQ(ret, 0); + + for (i = 0; i < MANY_SESSIONS; i++) { + char name[64]; + + snprintf(name, sizeof(name), "many-session-%d", i); + session_fds[i] = create_session(self->fd1, name); + ASSERT_GE(session_fds[i], 0); + } + + for (i = 0; i < MANY_SESSIONS; i++) + ASSERT_EQ(close(session_fds[i]), 0); +} + +/* + * Test Case: Preserve Many Files + * + * Verifies that a large number of files can be preserved in a single session + * and then destroyed during normal system operation. This tests the dynamic + * block allocation and management for outgoing files. + */ +TEST_F(liveupdate_device, preserve_many_files) +{ +#define MANY_FILES 500 + int mem_fds[MANY_FILES]; + int session_fd, ret, i; + + self->fd1 = open(LIVEUPDATE_DEV, O_RDWR); + if (self->fd1 < 0 && errno == ENOENT) + SKIP(return, "%s does not exist", LIVEUPDATE_DEV); + ASSERT_GE(self->fd1, 0); + + session_fd = create_session(self->fd1, "many-files-test"); + ASSERT_GE(session_fd, 0); + + ret = luo_ensure_nofile_limit(MANY_FILES + 10); + if (ret == -EPERM) + SKIP(return, "Insufficient privileges to set RLIMIT_NOFILE"); + ASSERT_EQ(ret, 0); + + for (i = 0; i < MANY_FILES; i++) { + mem_fds[i] = memfd_create("test-memfd", 0); + ASSERT_GE(mem_fds[i], 0); + ASSERT_EQ(preserve_fd(session_fd, mem_fds[i], i), 0); + } + + for (i = 0; i < MANY_FILES; i++) + ASSERT_EQ(close(mem_fds[i]), 0); + + ASSERT_EQ(close(session_fd), 0); +} + TEST_HARNESS_MAIN diff --git a/tools/testing/selftests/liveupdate/luo_test_utils.c b/tools/testing/selftests/liveupdate/luo_test_utils.c index 3c8721c505df..333a3530051b 100644 --- a/tools/testing/selftests/liveupdate/luo_test_utils.c +++ b/tools/testing/selftests/liveupdate/luo_test_utils.c @@ -17,6 +17,7 @@ #include #include #include +#include #include #include #include @@ -28,6 +29,29 @@ int luo_open_device(void) return open(LUO_DEVICE, O_RDWR); } +int luo_ensure_nofile_limit(long min_limit) +{ + struct rlimit hl; + + /* Allow to extra files to be used by test itself */ + min_limit += 32; + + if (getrlimit(RLIMIT_NOFILE, &hl) < 0) + return -errno; + + if (hl.rlim_cur >= min_limit) + return 0; + + hl.rlim_cur = min_limit; + if (hl.rlim_cur > hl.rlim_max) + hl.rlim_max = hl.rlim_cur; + + if (setrlimit(RLIMIT_NOFILE, &hl) < 0) + return -errno; + + return 0; +} + int luo_create_session(int luo_fd, const char *name) { struct liveupdate_ioctl_create_session arg = { .size = sizeof(arg) }; diff --git a/tools/testing/selftests/liveupdate/luo_test_utils.h b/tools/testing/selftests/liveupdate/luo_test_utils.h index 90099bf49577..6a0d85386613 100644 --- a/tools/testing/selftests/liveupdate/luo_test_utils.h +++ b/tools/testing/selftests/liveupdate/luo_test_utils.h @@ -26,6 +26,8 @@ int luo_create_session(int luo_fd, const char *name); int luo_retrieve_session(int luo_fd, const char *name); int luo_session_finish(int session_fd); +int luo_ensure_nofile_limit(long min_limit); + int create_and_preserve_memfd(int session_fd, int token, const char *data); int restore_and_verify_memfd(int session_fd, int token, const char *expected_data); -- 2.53.0 From pasha.tatashin at soleen.com Tue Jun 2 20:29:03 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Wed, 3 Jun 2026 03:29:03 +0000 Subject: [PATCH v6 12/13] selftests/liveupdate: Add stress-sessions kexec test In-Reply-To: <20260603032905.344462-1-pasha.tatashin@soleen.com> References: <20260603032905.344462-1-pasha.tatashin@soleen.com> Message-ID: <20260603032905.344462-13-pasha.tatashin@soleen.com> Add a new test that creates 2000 LUO sessions before a kexec reboot and verifies their presence after the reboot. This ensures that the linked-block serialization mechanism works correctly for a large number of sessions. Acked-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav (Google) Signed-off-by: Pasha Tatashin --- tools/testing/selftests/liveupdate/Makefile | 1 + .../liveupdate/luo_stress_sessions.c | 102 ++++++++++++++++++ 2 files changed, 103 insertions(+) create mode 100644 tools/testing/selftests/liveupdate/luo_stress_sessions.c diff --git a/tools/testing/selftests/liveupdate/Makefile b/tools/testing/selftests/liveupdate/Makefile index 080754787ede..ed7534468386 100644 --- a/tools/testing/selftests/liveupdate/Makefile +++ b/tools/testing/selftests/liveupdate/Makefile @@ -6,6 +6,7 @@ TEST_GEN_PROGS += liveupdate TEST_GEN_PROGS_EXTENDED += luo_kexec_simple TEST_GEN_PROGS_EXTENDED += luo_multi_session +TEST_GEN_PROGS_EXTENDED += luo_stress_sessions TEST_FILES += do_kexec.sh diff --git a/tools/testing/selftests/liveupdate/luo_stress_sessions.c b/tools/testing/selftests/liveupdate/luo_stress_sessions.c new file mode 100644 index 000000000000..f201b1839d1d --- /dev/null +++ b/tools/testing/selftests/liveupdate/luo_stress_sessions.c @@ -0,0 +1,102 @@ +// SPDX-License-Identifier: GPL-2.0-only + +/* + * Copyright (c) 2026, Google LLC. + * Pasha Tatashin + * + * Validate that LUO can handle a large number of sessions across a kexec + * reboot. + */ + +#include +#include +#include "luo_test_utils.h" + +#define NUM_SESSIONS 2000 +#define STATE_SESSION_NAME "kexec_many_state" +#define STATE_MEMFD_TOKEN 999 + +/* Stage 1: Executed before the kexec reboot. */ +static void run_stage_1(int luo_fd) +{ + int ret, i; + + ksft_print_msg("[STAGE 1] Increasing ulimit for open files...\n"); + ret = luo_ensure_nofile_limit(NUM_SESSIONS); + if (ret == -EPERM) + ksft_exit_skip("Insufficient privileges to set RLIMIT_NOFILE\n"); + if (ret < 0) + ksft_exit_fail_msg("luo_ensure_nofile_limit failed: %s\n", strerror(-ret)); + + ksft_print_msg("[STAGE 1] Creating state file for next stage (2)...\n"); + create_state_file(luo_fd, STATE_SESSION_NAME, STATE_MEMFD_TOKEN, 2); + + ksft_print_msg("[STAGE 1] Creating %d sessions...\n", NUM_SESSIONS); + + for (i = 0; i < NUM_SESSIONS; i++) { + char name[LIVEUPDATE_SESSION_NAME_LENGTH]; + int s_fd; + + snprintf(name, sizeof(name), "many-test-%d", i); + s_fd = luo_create_session(luo_fd, name); + if (s_fd < 0) { + fail_exit("luo_create_session for '%s' at index %d", + name, i); + } + } + + ksft_print_msg("[STAGE 1] Successfully created %d sessions.\n", + NUM_SESSIONS); + + close(luo_fd); + daemonize_and_wait(); +} + +/* Stage 2: Executed after the kexec reboot. */ +static void run_stage_2(int luo_fd, int state_session_fd) +{ + int i, stage; + + ksft_print_msg("[STAGE 2] Starting post-kexec verification...\n"); + + restore_and_read_stage(state_session_fd, STATE_MEMFD_TOKEN, &stage); + if (stage != 2) { + fail_exit("Expected stage 2, but state file contains %d", + stage); + } + + ksft_print_msg("[STAGE 2] Retrieving and finishing %d sessions...\n", + NUM_SESSIONS); + + for (i = 0; i < NUM_SESSIONS; i++) { + char name[LIVEUPDATE_SESSION_NAME_LENGTH]; + int s_fd; + + snprintf(name, sizeof(name), "many-test-%d", i); + s_fd = luo_retrieve_session(luo_fd, name); + if (s_fd < 0) { + fail_exit("luo_retrieve_session for '%s' at index %d", + name, i); + } + + if (luo_session_finish(s_fd) < 0) { + fail_exit("luo_session_finish for '%s' at index %d", + name, i); + } + close(s_fd); + } + + ksft_print_msg("[STAGE 2] Finalizing state session...\n"); + if (luo_session_finish(state_session_fd) < 0) + fail_exit("luo_session_finish for state session"); + close(state_session_fd); + + ksft_print_msg("\n--- MANY-SESSIONS KEXEC TEST PASSED (%d sessions) ---\n", + NUM_SESSIONS); +} + +int main(int argc, char *argv[]) +{ + return luo_test(argc, argv, STATE_SESSION_NAME, + run_stage_1, run_stage_2); +} -- 2.53.0 From pasha.tatashin at soleen.com Tue Jun 2 20:29:04 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Wed, 3 Jun 2026 03:29:04 +0000 Subject: [PATCH v6 13/13] selftests/liveupdate: Add stress-files kexec test In-Reply-To: <20260603032905.344462-1-pasha.tatashin@soleen.com> References: <20260603032905.344462-1-pasha.tatashin@soleen.com> Message-ID: <20260603032905.344462-14-pasha.tatashin@soleen.com> Add a new luo_stress_files kexec test that verifies preserving and retrieving 500 files across a kexec reboot. Reviewed-by: Pratyush Yadav (Google) Acked-by: Mike Rapoport (Microsoft) Signed-off-by: Pasha Tatashin --- tools/testing/selftests/liveupdate/Makefile | 1 + .../selftests/liveupdate/luo_stress_files.c | 97 +++++++++++++++++++ 2 files changed, 98 insertions(+) create mode 100644 tools/testing/selftests/liveupdate/luo_stress_files.c diff --git a/tools/testing/selftests/liveupdate/Makefile b/tools/testing/selftests/liveupdate/Makefile index ed7534468386..30689d22cb02 100644 --- a/tools/testing/selftests/liveupdate/Makefile +++ b/tools/testing/selftests/liveupdate/Makefile @@ -7,6 +7,7 @@ TEST_GEN_PROGS += liveupdate TEST_GEN_PROGS_EXTENDED += luo_kexec_simple TEST_GEN_PROGS_EXTENDED += luo_multi_session TEST_GEN_PROGS_EXTENDED += luo_stress_sessions +TEST_GEN_PROGS_EXTENDED += luo_stress_files TEST_FILES += do_kexec.sh diff --git a/tools/testing/selftests/liveupdate/luo_stress_files.c b/tools/testing/selftests/liveupdate/luo_stress_files.c new file mode 100644 index 000000000000..0cdf9cd4bac7 --- /dev/null +++ b/tools/testing/selftests/liveupdate/luo_stress_files.c @@ -0,0 +1,97 @@ +// SPDX-License-Identifier: GPL-2.0-only + +/* + * Copyright (c) 2026, Google LLC. + * Pasha Tatashin + * + * Validate that LUO can handle a large number of files per session across + * a kexec reboot. + */ + +#include +#include +#include "luo_test_utils.h" + +#define NUM_FILES 500 +#define STATE_SESSION_NAME "kexec_many_files_state" +#define STATE_MEMFD_TOKEN 9999 +#define TEST_SESSION_NAME "many_files_session" + +/* Stage 1: Executed before the kexec reboot. */ +static void run_stage_1(int luo_fd) +{ + int session_fd, i; + + ksft_print_msg("[STAGE 1] Creating state file for next stage (2)...\n"); + create_state_file(luo_fd, STATE_SESSION_NAME, STATE_MEMFD_TOKEN, 2); + + ksft_print_msg("[STAGE 1] Creating test session '%s'...\n", TEST_SESSION_NAME); + session_fd = luo_create_session(luo_fd, TEST_SESSION_NAME); + if (session_fd < 0) + fail_exit("luo_create_session"); + + ksft_print_msg("[STAGE 1] Preserving %d files...\n", NUM_FILES); + for (i = 0; i < NUM_FILES; i++) { + char data[64]; + + snprintf(data, sizeof(data), "file-data-%d", i); + if (create_and_preserve_memfd(session_fd, i, data) < 0) + fail_exit("create_and_preserve_memfd for index %d", i); + } + + ksft_print_msg("[STAGE 1] Successfully preserved %d files.\n", NUM_FILES); + + close(luo_fd); + daemonize_and_wait(); +} + +/* Stage 2: Executed after the kexec reboot. */ +static void run_stage_2(int luo_fd, int state_session_fd) +{ + int session_fd; + int i, stage; + + ksft_print_msg("[STAGE 2] Starting post-kexec verification...\n"); + + restore_and_read_stage(state_session_fd, STATE_MEMFD_TOKEN, &stage); + if (stage != 2) { + fail_exit("Expected stage 2, but state file contains %d", + stage); + } + + ksft_print_msg("[STAGE 2] Retrieving test session '%s'...\n", TEST_SESSION_NAME); + session_fd = luo_retrieve_session(luo_fd, TEST_SESSION_NAME); + if (session_fd < 0) + fail_exit("luo_retrieve_session"); + + ksft_print_msg("[STAGE 2] Verifying %d files...\n", NUM_FILES); + for (i = 0; i < NUM_FILES; i++) { + char data[64]; + int fd; + + snprintf(data, sizeof(data), "file-data-%d", i); + fd = restore_and_verify_memfd(session_fd, i, data); + if (fd < 0) + fail_exit("restore_and_verify_memfd for index %d", i); + close(fd); + } + + ksft_print_msg("[STAGE 2] Finishing test session...\n"); + if (luo_session_finish(session_fd) < 0) + fail_exit("luo_session_finish for test session"); + close(session_fd); + + ksft_print_msg("[STAGE 2] Finalizing state session...\n"); + if (luo_session_finish(state_session_fd) < 0) + fail_exit("luo_session_finish for state session"); + close(state_session_fd); + + ksft_print_msg("\n--- MANY-FILES KEXEC TEST PASSED (%d files) ---\n", + NUM_FILES); +} + +int main(int argc, char *argv[]) +{ + return luo_test(argc, argv, STATE_SESSION_NAME, + run_stage_1, run_stage_2); +} -- 2.53.0 From pasha.tatashin at soleen.com Tue Jun 2 20:36:17 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Wed, 3 Jun 2026 03:36:17 +0000 Subject: [PATCH 2/2] liveupdate: Remember FLB retrieve() status In-Reply-To: <20260528174140.1921129-3-dmatlack@google.com> References: <20260528174140.1921129-1-dmatlack@google.com> <20260528174140.1921129-3-dmatlack@google.com> Message-ID: On 05-28 17:41, David Matlack wrote: > LUO keeps track of successful retrieve attempts on an FLB. It does so > to avoid multiple retrievals of the same FLB. Multiple retrievals cause > problems because once the FLB is retrieved, the serialized data > structures are likely freed and the FLB is likely in a very different > state from what the code expects. > > All this works well when retrieve succeeds. When it fails, > luo_flb_retrieve_one() returns the error immediately, without ever > storing anywhere that a retrieve was attempted or what its error code > was. If the user attempts to retrieve another file registered with the > same FLB, LUO will attempt to call the FLB's retrieve() callback again. > > The retry is problematic for much of the same reasons listed above. The > FLB is likely in a very different state than what the retrieve logic > normally expects (e.g. some KHO pages may have already been restored and > freed). > > There is no sane way of attempting the retrieve again. Remember the > error retrieve returned and directly return it on a retry. > > This is done by changing the retrieved bool to a retrieve_status > integer. A value of 0 means retrieve was never attempted, a positive > value means it succeeded, and a negative value means it failed and the > error code is the value. > > This is similar to commit f85b1c6af5bc ("liveupdate: luo_file: remember > retrieve() status") which did the same for LUO files. > > Fixes: cab056f2aae7 ("liveupdate: luo_flb: introduce File-Lifecycle-Bound global state") > Assisted-by: Gemini:gemini-3-pro-preview > Signed-off-by: David Matlack Reviewed-by: Pasha Tatashin > --- > include/linux/liveupdate.h | 6 ++++-- > kernel/liveupdate/luo_flb.c | 10 +++++++--- > 2 files changed, 11 insertions(+), 5 deletions(-) > > diff --git a/include/linux/liveupdate.h b/include/linux/liveupdate.h > index c344bf987b63..63ea5417de84 100644 > --- a/include/linux/liveupdate.h > +++ b/include/linux/liveupdate.h > @@ -173,7 +173,9 @@ struct liveupdate_flb_ops { > * @lock: A mutex that protects all fields within this structure, providing > * the synchronization service for the FLB's ops. > * @finished: True once the FLB's finish() callback has run. > - * @retrieved: True once the FLB's retrieve() callback has run. > + * @retrieve_status: Status code indicating whether retrieve() has been > + * attempted. 0 means not attempted, 1 means successful, > + * and negative value means it failed with that error code. > */ > struct luo_flb_private_state { > refcount_t count; > @@ -181,7 +183,7 @@ struct luo_flb_private_state { > void *obj; > struct mutex lock; > bool finished; > - bool retrieved; > + int retrieve_status; > }; > > /* > diff --git a/kernel/liveupdate/luo_flb.c b/kernel/liveupdate/luo_flb.c > index 7ddef552ff6b..f8852f7e62e5 100644 > --- a/kernel/liveupdate/luo_flb.c > +++ b/kernel/liveupdate/luo_flb.c > @@ -170,7 +170,10 @@ static int luo_flb_retrieve_one(struct liveupdate_flb *flb) > if (private->incoming.finished) > return -ENODATA; > > - if (private->incoming.retrieved) > + if (private->incoming.retrieve_status < 0) > + return private->incoming.retrieve_status; > + > + if (private->incoming.retrieve_status > 0) > return 0; > > if (!fh->active) > @@ -196,12 +199,13 @@ static int luo_flb_retrieve_one(struct liveupdate_flb *flb) > > err = flb->ops->retrieve(&args); > if (err) { > + private->incoming.retrieve_status = err; > module_put(flb->ops->owner); > return err; > } > > private->incoming.obj = args.obj; > - private->incoming.retrieved = true; > + private->incoming.retrieve_status = 1; > > return 0; > } > @@ -215,7 +219,7 @@ void liveupdate_flb_put_incoming(struct liveupdate_flb *flb) > if (!refcount_dec_and_test(&private->incoming.count)) > return; > > - if (!private->incoming.retrieved) { > + if (private->incoming.retrieve_status <= 0) { > int err = luo_flb_retrieve_one(flb); > > if (WARN_ON(err)) > -- > 2.54.0.823.g6e5bcc1fc9-goog > From pasha.tatashin at soleen.com Tue Jun 2 20:36:37 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Wed, 3 Jun 2026 03:36:37 +0000 Subject: [PATCH 1/2] liveupdate: Reference count outgoing FLB data In-Reply-To: <20260528174140.1921129-2-dmatlack@google.com> References: <20260528174140.1921129-1-dmatlack@google.com> <20260528174140.1921129-2-dmatlack@google.com> Message-ID: On 05-28 17:41, David Matlack wrote: > Increment the outgoing FLB refcount in liveupdate_flb_get_outgoing() so > that the FLB structure cannot be freed while the caller is actively > using it. Add an additional liveupdate_flb_put_outgoing() function so > the caller can explicitly indicate when it is done using the outgoing > FLB. > > During a Live Update, the kernel may need to fetch the outgoing FLB > outside of the scope of a file handler's preserve() and unpreserve() > callbacks. In that situation there is no way for the caller to protect > itself against the outgoing FLB from being freed while it is using it. > Incrementing the reference count in liveupdate_flb_get_outgoing() > ensures it cannot be freed. > > This change also aligns the outgoing FLB lifecycle management with the > incoming FLB, since the latter uses the same get/put semantics. > > Fixes: cab056f2aae7 ("liveupdate: luo_flb: introduce File-Lifecycle-Bound global state") > Assisted-by: Gemini:gemini-3-pro-preview > Signed-off-by: David Matlack Reviewed-by: Pasha Tatashin > --- > include/linux/liveupdate.h | 5 +++++ > kernel/liveupdate/luo_flb.c | 10 +++++++--- > 2 files changed, 12 insertions(+), 3 deletions(-) > > diff --git a/include/linux/liveupdate.h b/include/linux/liveupdate.h > index 88722e5caf02..c344bf987b63 100644 > --- a/include/linux/liveupdate.h > +++ b/include/linux/liveupdate.h > @@ -243,6 +243,7 @@ int liveupdate_flb_get_incoming(struct liveupdate_flb *flb, void **objp); > void liveupdate_flb_put_incoming(struct liveupdate_flb *flb); > > int liveupdate_flb_get_outgoing(struct liveupdate_flb *flb, void **objp); > +void liveupdate_flb_put_outgoing(struct liveupdate_flb *flb); > > #else /* CONFIG_LIVEUPDATE */ > > @@ -292,5 +293,9 @@ static inline int liveupdate_flb_get_outgoing(struct liveupdate_flb *flb, > return -EOPNOTSUPP; > } > > +static inline void liveupdate_flb_put_outgoing(struct liveupdate_flb *flb) > +{ > +} > + > #endif /* CONFIG_LIVEUPDATE */ > #endif /* _LINUX_LIVEUPDATE_H */ > diff --git a/kernel/liveupdate/luo_flb.c b/kernel/liveupdate/luo_flb.c > index 8f5c5dd01cd0..7ddef552ff6b 100644 > --- a/kernel/liveupdate/luo_flb.c > +++ b/kernel/liveupdate/luo_flb.c > @@ -135,7 +135,7 @@ static int luo_flb_file_preserve_one(struct liveupdate_flb *flb) > return 0; > } > > -static void luo_flb_file_unpreserve_one(struct liveupdate_flb *flb) > +void liveupdate_flb_put_outgoing(struct liveupdate_flb *flb) > { > struct luo_flb_private *private = luo_flb_get_private(flb); > > @@ -266,7 +266,7 @@ int luo_flb_file_preserve(struct liveupdate_file_handler *fh) > > exit_err: > list_for_each_entry_continue_reverse(iter, flb_list, list) > - luo_flb_file_unpreserve_one(iter->flb); > + liveupdate_flb_put_outgoing(iter->flb); > up_read(&luo_register_rwlock); > > return err; > @@ -291,7 +291,7 @@ void luo_flb_file_unpreserve(struct liveupdate_file_handler *fh) > > guard(rwsem_read)(&luo_register_rwlock); > list_for_each_entry_reverse(iter, flb_list, list) > - luo_flb_file_unpreserve_one(iter->flb); > + liveupdate_flb_put_outgoing(iter->flb); > } > > /** > @@ -546,6 +546,10 @@ int liveupdate_flb_get_outgoing(struct liveupdate_flb *flb, void **objp) > return -EOPNOTSUPP; > > guard(mutex)(&private->outgoing.lock); > + if (!private->outgoing.obj) > + return -ENOENT; > + > + refcount_inc(&private->outgoing.count); > *objp = private->outgoing.obj; > > return 0; > -- > 2.54.0.823.g6e5bcc1fc9-goog > From pasha.tatashin at soleen.com Tue Jun 2 21:02:38 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Wed, 3 Jun 2026 04:02:38 +0000 Subject: [PATCH v2] kexec_file: skip checksum verification when safe In-Reply-To: <20260602123311.1841746-1-mclapinski@google.com> References: <20260602123311.1841746-1-mclapinski@google.com> Message-ID: <178045911488.348018.9577276078419028102.b4-ty@soleen.com> On Tue, 02 Jun 2026 14:33:11 +0200, Michal Clapinski wrote: > Checksum verification is needed > 1. for crash kernels. In a crash, we can't be sure the kernel is > intact. > 2. if we're worried about relocating the kernel into a region used by > some DMA that wasn't properly cancelled. > > If KHO is enabled then relocations will happen to KHO scratch, which > is free from DMA regions. > If we used CMA to allocate segments then relocations are not going to > happen at all. > Therefore, we can safely disable checksum verification in both of those > cases. > > [...] Applied, thanks! [1/1] kexec_file: skip checksum verification when safe commit: 459a08d029bf6f026b25063708a63bdaa8ccc0b1 Best regards, -- Pasha Tatashin From chenwandun1 at gmail.com Tue Jun 2 23:44:13 2026 From: chenwandun1 at gmail.com (Wandun) Date: Wed, 3 Jun 2026 14:44:13 +0800 Subject: [PATCH v3 03/11] of: reserved_mem: avoid post-init UAF when alloc_reserved_mem_array() fails In-Reply-To: <20260602162450.GA442759-robh@kernel.org> References: <20260527032917.3385849-1-chenwandun1@gmail.com> <20260527032917.3385849-4-chenwandun1@gmail.com> <20260602162450.GA442759-robh@kernel.org> Message-ID: <79932afc-2e91-4a54-aff9-f550be784c36@gmail.com> On 6/3/26 00:24, Rob Herring wrote: > On Wed, May 27, 2026 at 11:29:09AM +0800, Wandun Chen wrote: >> From: Wandun Chen >> >> The global pointer 'reserved_mem' continues to reference the >> reserved_mem_array which lives in __initdata if >> alloc_reserved_mem_array() fails. of_reserved_mem_lookup() is >> exported for post-init use, that would dereference freed memory >> and trigger a use-after-free. >> >> So reset reserved_mem_count to 0 when alloc_reserved_mem_array() >> fails. >> >> Fixes: 00c9a452a235 ("of: reserved_mem: Add code to dynamically allocate reserved_mem array") > Fixes should come first in a series. Understood, will do in future submissions. > >> Signed-off-by: Wandun Chen >> --- >> drivers/of/of_reserved_mem.c | 20 ++++++++++++++------ >> 1 file changed, 14 insertions(+), 6 deletions(-) >> >> diff --git a/drivers/of/of_reserved_mem.c b/drivers/of/of_reserved_mem.c >> index 313cbc57aa45..6d479381ff1f 100644 >> --- a/drivers/of/of_reserved_mem.c >> +++ b/drivers/of/of_reserved_mem.c >> @@ -69,29 +69,31 @@ static int __init early_init_dt_alloc_reserved_memory_arch(phys_addr_t size, >> * the initial static array is copied over to this new array and >> * the new array is used from this point on. >> */ >> -static void __init alloc_reserved_mem_array(void) >> +static bool __init alloc_reserved_mem_array(void) >> { >> struct reserved_mem *new_array; >> size_t alloc_size, copy_size, memset_size; >> >> + if (!total_reserved_mem_cnt) >> + return true; >> + >> alloc_size = array_size(total_reserved_mem_cnt, sizeof(*new_array)); >> if (alloc_size == SIZE_MAX) { >> pr_err("Failed to allocate memory for reserved_mem array with err: %d", -EOVERFLOW); >> - return; >> + goto fail; >> } >> >> new_array = memblock_alloc(alloc_size, SMP_CACHE_BYTES); >> if (!new_array) { >> pr_err("Failed to allocate memory for reserved_mem array with err: %d", -ENOMEM); >> - return; >> + goto fail; >> } >> >> copy_size = array_size(reserved_mem_count, sizeof(*new_array)); >> if (copy_size == SIZE_MAX) { >> memblock_free(new_array, alloc_size); >> - total_reserved_mem_cnt = MAX_RESERVED_REGIONS; >> pr_err("Failed to allocate memory for reserved_mem array with err: %d", -EOVERFLOW); > These prints could be moved to 'fail'. Perhaps instead of just printing > an error value, you can return the error value instead of boolean. Will do, consolidating pr_err() under 'fail' and changing the return type to int. > > If you respin just this patch, I can pick it up for 7.2. Before I respin, I'd like to flag a dependency: patch 05/07 in this series build on the signature change introduced by this patch ("the void -> bool return type change of alloc_reserved_mem_array()") Could you let me know which of the following you'd prefer: a) Take patch 03 alone via your tree as you suggested, after it lands, I'll ? ?respin the remaining patches of this series. b) Keep patch 03 in the v4 respin of the full series, reordered to the front ? ?per your earlier comment. Best regards, Wandun > > Rob From rppt at kernel.org Tue Jun 2 23:49:31 2026 From: rppt at kernel.org (Mike Rapoport) Date: Wed, 03 Jun 2026 09:49:31 +0300 Subject: [PATCH v6 07/13] kho: add support for linked-block serialization In-Reply-To: <20260603032905.344462-8-pasha.tatashin@soleen.com> References: <20260603032905.344462-1-pasha.tatashin@soleen.com> <20260603032905.344462-8-pasha.tatashin@soleen.com> Message-ID: <178046937151.468621.13398573538792303093.b4-review@b4> On Wed, 03 Jun 2026 03:28:58 +0000, Pasha Tatashin wrote: > diff --git a/include/linux/kho/abi/block.h b/include/linux/kho/abi/block.h > new file mode 100644 > index 000000000000..8641c20b379b > --- /dev/null > +++ b/include/linux/kho/abi/block.h > @@ -0,0 +1,56 @@ > [ ... skip 25 lines ... ] > +#define _LINUX_KHO_ABI_BLOCK_H > + > +#include > +#include > + > +#define KHO_BLOCK_ABI_COMPATIBLE "kho-block-v1" It's never used by block set and after looking at the following patches I found that it's appended to LUO compatible string. While this works for LUO, I think it should be kho_block_set_restore() responsibility to verify the compatibility. > > diff --git a/kernel/liveupdate/kho_block.c b/kernel/liveupdate/kho_block.c > new file mode 100644 > index 000000000000..4f147c308e6b > --- /dev/null > +++ b/kernel/liveupdate/kho_block.c > @@ -0,0 +1,411 @@ > [ ... skip 121 lines ... ] > +/** > + * kho_block_set_grow - Expand the block set to accommodate the target count. > + * @bs: The block set. > + * @count: The target number of valid entries to accommodate. > + * > + * Acts as a runtime notifier when new resources (such as files or sessions) Not sure I understand what "runtime notifier" means in this context. > [ ... skip 11 lines ... ] > + > + while (count > bs->nblocks * bs->count_per_block) { > + int err = kho_block_set_grow_one(bs); > + > + if (err) > + return err; This leaks memory if more than one block is added. > [ ... skip 31 lines ... ] > + * unregistered, allowing the block set to release and unallocate redundant > + * preserved memory blocks. Checks if the last block in the set can be removed > + * because the remaining entry count is fully accommodated by the preceding blocks. > + * > + * Note: It is the caller's responsibility to ensure that entries are removed > + * in LIFO (last-in, first-out) order (the reverse order of their insertion). I think "in LIFO order" is sufficient :) > [ ... skip 173 lines ... ] > + it->i = 0; > + } > + > + entry = kho_block_entry(it, it->i++); > + it->block->ser->count = it->i; > + return entry; This looks way better than the previous version :) Thanks! -- Sincerely yours, Mike. From rppt at kernel.org Tue Jun 2 23:50:24 2026 From: rppt at kernel.org (Mike Rapoport) Date: Wed, 03 Jun 2026 09:50:24 +0300 Subject: [PATCH v6 04/13] liveupdate: register luo_ser as KHO subtree In-Reply-To: <20260603032905.344462-5-pasha.tatashin@soleen.com> References: <20260603032905.344462-1-pasha.tatashin@soleen.com> <20260603032905.344462-5-pasha.tatashin@soleen.com> Message-ID: <178046942429.468621.9591914636403075487.b4-review@b4> # Add your code comments below. There is no need to trim or delete # any existing content -- just insert your comments under the relevant # lines of code. Lines starting with "> " are quoted diff context and # lines starting with "| " are comments from other reviewers. # The final email will be reformatted automatically to include only # the sections that have your comments. # > Entirely remove the LUO FDT wrapper since the FDT only carries the > compatible string and the pointer to the centralized struct luo_ser. > Instead, register the struct luo_ser via the KHO raw subtree > API, placing the compatibility string inside the structure itself. > > Signed-off-by: Pasha Tatashin > > diff --git a/include/linux/kho/abi/luo.h b/include/linux/kho/abi/luo.h > index 1b2f865a771a..9a4fe491812b 100644 > --- a/include/linux/kho/abi/luo.h > +++ b/include/linux/kho/abi/luo.h > @@ -10,11 +10,11 @@ > * > * Live Update Orchestrator uses the stable Application Binary Interface > * defined below to pass state from a pre-update kernel to a post-update > - * kernel. The ABI is built upon the Kexec HandOver framework and uses a > - * Flattened Device Tree to describe the preserved data. > + * kernel. The ABI is built upon the Kexec HandOver framework and registers > + * the central `struct luo_ser` via the KHO raw subtree API. > * > - * This interface is a contract. Any modification to the FDT structure, node > - * properties, compatible strings, or the layout of the `__packed` serialization > + * This interface is a contract. Any modification to the structure fields, > + * compatible strings, or the layout of the `__packed` serialization > * structures defined here constitutes a breaking change. Such changes require > * incrementing the version number in the relevant `_COMPATIBLE` string to > * prevent a new kernel from misinterpreting data from an old kernel. > @@ -23,31 +23,15 @@ > * however, backward/forward compatibility is only guaranteed for kernels > * supporting the same ABI version. > * > - * FDT Structure Overview: > + * KHO Structure Overview: > * The entire LUO state is encapsulated within a single KHO entry named "LUO". > - * This entry contains an FDT with the following layout: > - * > - * .. code-block:: none > - * > - * / { > - * compatible = "luo-v2"; > - * luo-abi-header = ; > - * }; > - * > - * Main LUO Node (/): > - * > - * - compatible: "luo-v2" > - * Identifies the overall LUO ABI version. > - * - luo-abi-header: u64 > - * The physical address of `struct luo_ser`. > + * This entry contains the `struct luo_ser` structure. > * > * Serialization Structures: > - * The FDT properties point to memory regions containing arrays of simple, > - * `__packed` structures. These structures contain the actual preserved state. > - * > * - struct luo_ser: > * The central ABI structure that contains the overall state of the LUO. > - * It includes the liveupdate-number and pointers to sessions and FLBs. > + * It includes the compatibility string, the liveupdate-number, and pointers > + * to sessions and FLBs. > * > * - struct luo_session_header_ser: > * Header for the session array. Contains the total page count of the > @@ -78,26 +62,27 @@ > #ifndef _LINUX_KHO_ABI_LUO_H > #define _LINUX_KHO_ABI_LUO_H > > +#include > #include > > /* > - * The LUO FDT hooks all LUO state for sessions, fds, etc. > + * The LUO state is registered under this KHO entry name. > */ > -#define LUO_FDT_SIZE PAGE_SIZE > -#define LUO_FDT_KHO_ENTRY_NAME "LUO" > -#define LUO_FDT_COMPATIBLE "luo-v2" > -#define LUO_FDT_ABI_HEADER "luo-abi-header" > +#define LUO_KHO_ENTRY_NAME "LUO" > +#define LUO_ABI_COMPATIBLE "luo-v3" > +#define LUO_ABI_COMPAT_LEN ALIGN(sizeof(LUO_ABI_COMPATIBLE), 8) > > /** > * struct luo_ser - Centralized LUO ABI header. > + * @compatible: Compatibility string identifying the LUO ABI version. > * @liveupdate_num: A counter tracking the number of successful live updates. > * @sessions_pa: Physical address of the first session block header. > * @flbs_pa: Physical address of the FLB header. > * > - * This structure is the root of all preserved LUO state. It is pointed to by > - * the "luo-abi-header" property in the LUO FDT. > + * This structure is the root of all preserved LUO state. > */ > struct luo_ser { > + char compatible[LUO_ABI_COMPAT_LEN]; > u64 liveupdate_num; > u64 sessions_pa; > u64 flbs_pa; > @@ -111,7 +96,7 @@ struct luo_ser { > * @data: Private data > * @token: User provided token for this file > * > - * If this structure is modified, LUO_SESSION_COMPATIBLE must be updated. > + * If this structure is modified, `LUO_ABI_COMPATIBLE` must be updated. > */ > struct luo_file_ser { > char compatible[LIVEUPDATE_HNDL_COMPAT_LENGTH]; > @@ -142,7 +127,7 @@ struct luo_file_set_ser { > * physical memory preserved across the kexec. It provides the necessary > * metadata to interpret the array of session entries that follow. > * > - * If this structure is modified, `LUO_FDT_COMPATIBLE` must be updated. > + * If this structure is modified, `LUO_ABI_COMPATIBLE` must be updated. > */ > struct luo_session_header_ser { > u64 count; > @@ -159,7 +144,7 @@ struct luo_session_header_ser { > * session) is created and passed to the new kernel, allowing it to reconstruct > * the session context. > * > - * If this structure is modified, `LUO_FDT_COMPATIBLE` must be updated. > + * If this structure is modified, `LUO_ABI_COMPATIBLE` must be updated. > */ > struct luo_session_ser { > char name[LIVEUPDATE_SESSION_NAME_LENGTH]; > @@ -180,7 +165,7 @@ struct luo_session_ser { > * This structure is located at the physical address specified by the > * flbs_pa in luo_ser. > * > - * If this structure is modified, `LUO_FDT_COMPATIBLE` must be updated. > + * If this structure is modified, `LUO_ABI_COMPATIBLE` must be updated. > */ > struct luo_flb_header_ser { > u64 pgcnt; > @@ -202,7 +187,7 @@ struct luo_flb_header_ser { > * passed to the new kernel. Each entry allows the LUO core to restore one > * global, shared object. > * > - * If this structure is modified, `LUO_FDT_COMPATIBLE` must be updated. > + * If this structure is modified, `LUO_ABI_COMPATIBLE` must be updated. > */ > struct luo_flb_ser { > char name[LIVEUPDATE_FLB_COMPAT_LENGTH]; > diff --git a/kernel/liveupdate/luo_core.c b/kernel/liveupdate/luo_core.c > index fbc18c5f4230..e261a03a1b47 100644 > --- a/kernel/liveupdate/luo_core.c > +++ b/kernel/liveupdate/luo_core.c > @@ -50,7 +50,6 @@ > #include > #include > #include > -#include > #include > #include > #include > @@ -63,8 +62,7 @@ > > static struct { > bool enabled; > - void *fdt_out; > - void *fdt_in; > + struct luo_ser *luo_ser_out; > u64 liveupdate_num; > } luo_global; > > @@ -81,11 +79,10 @@ early_param("liveupdate", early_liveupdate_param); > > static int __init luo_early_startup(void) > { > + phys_addr_t luo_ser_phys; > struct luo_ser *luo_ser; > - int err, header_size; > - phys_addr_t fdt_phys; > - const void *ptr; > - u64 luo_ser_pa; > + size_t len; > + int err; > > if (!kho_is_enabled()) { > if (liveupdate_enabled()) > @@ -94,40 +91,29 @@ static int __init luo_early_startup(void) > return 0; > } > > - /* Retrieve LUO subtree, and verify its format. */ > - err = kho_retrieve_subtree(LUO_FDT_KHO_ENTRY_NAME, &fdt_phys, NULL); > + /* Retrieve LUO state from KHO. */ > + err = kho_retrieve_subtree(LUO_KHO_ENTRY_NAME, &luo_ser_phys, &len); > if (err) { > if (err != -ENOENT) { > - pr_err("failed to retrieve FDT '%s' from KHO: %pe\n", > - LUO_FDT_KHO_ENTRY_NAME, ERR_PTR(err)); > + pr_err("failed to retrieve LUO state '%s' from KHO: %pe\n", > + LUO_KHO_ENTRY_NAME, ERR_PTR(err)); > return err; > } > > return 0; > } > > - luo_global.fdt_in = phys_to_virt(fdt_phys); > - err = fdt_node_check_compatible(luo_global.fdt_in, 0, > - LUO_FDT_COMPATIBLE); > - if (err) { > - pr_err("FDT '%s' is incompatible with '%s' [%d]\n", > - LUO_FDT_KHO_ENTRY_NAME, LUO_FDT_COMPATIBLE, err); > - > + if (len < sizeof(*luo_ser)) { > + pr_err("LUO state is too small (%zu < %zu)\n", len, sizeof(*luo_ser)); > return -EINVAL; > } > > - header_size = 0; > - ptr = fdt_getprop(luo_global.fdt_in, 0, LUO_FDT_ABI_HEADER, &header_size); > - if (!ptr || header_size != sizeof(u64)) { > - pr_err("Unable to get ABI header '%s' [%d]\n", > - LUO_FDT_ABI_HEADER, header_size); > - > + luo_ser = phys_to_virt(luo_ser_phys); > + if (strncmp(luo_ser->compatible, LUO_ABI_COMPATIBLE, LUO_ABI_COMPAT_LEN)) { > + pr_err("LUO state is incompatible with '%s'\n", LUO_ABI_COMPATIBLE); > return -EINVAL; > } > > - luo_ser_pa = get_unaligned((u64 *)ptr); > - luo_ser = phys_to_virt(luo_ser_pa); > - > luo_global.liveupdate_num = luo_ser->liveupdate_num; > pr_info("Retrieved live update data, liveupdate number: %lld\n", > luo_global.liveupdate_num); > @@ -160,37 +146,20 @@ static int __init liveupdate_early_init(void) > } > early_initcall(liveupdate_early_init); > > -/* Called during boot to create outgoing LUO fdt tree */ > -static int __init luo_fdt_setup(void) > +/* Called during boot to create outgoing LUO state */ > +static int __init luo_state_setup(void) > { > struct luo_ser *luo_ser; > - u64 luo_ser_pa; > - void *fdt_out; > int err; > > - fdt_out = kho_alloc_preserve(LUO_FDT_SIZE); > - if (IS_ERR(fdt_out)) { > - pr_err("failed to allocate/preserve FDT memory\n"); > - return PTR_ERR(fdt_out); > - } > - > luo_ser = kho_alloc_preserve(sizeof(*luo_ser)); > if (IS_ERR(luo_ser)) { > - err = PTR_ERR(luo_ser); > - goto exit_free_fdt; > + pr_err("failed to allocate/preserve LUO state memory\n"); > + return PTR_ERR(luo_ser); > } > - luo_ser_pa = virt_to_phys(luo_ser); > - > - err = fdt_create(fdt_out, LUO_FDT_SIZE); > - err |= fdt_finish_reservemap(fdt_out); > - err |= fdt_begin_node(fdt_out, ""); > - err |= fdt_property_string(fdt_out, "compatible", LUO_FDT_COMPATIBLE); > - err |= fdt_property(fdt_out, LUO_FDT_ABI_HEADER, &luo_ser_pa, > - sizeof(luo_ser_pa)); > - err |= fdt_end_node(fdt_out); > - err |= fdt_finish(fdt_out); > - if (err) > - goto exit_free_luo_ser; > + > + strscpy(luo_ser->compatible, LUO_ABI_COMPATIBLE, sizeof(luo_ser->compatible)); > + luo_ser->liveupdate_num = luo_global.liveupdate_num + 1; > > err = luo_session_setup_outgoing(&luo_ser->sessions_pa); > if (err) > @@ -200,21 +169,17 @@ static int __init luo_fdt_setup(void) > if (err) > goto exit_free_luo_ser; > > - luo_ser->liveupdate_num = luo_global.liveupdate_num + 1; > - > - err = kho_add_subtree(LUO_FDT_KHO_ENTRY_NAME, fdt_out, > - fdt_totalsize(fdt_out)); > + err = kho_add_subtree(LUO_KHO_ENTRY_NAME, luo_ser, sizeof(*luo_ser)); > if (err) > goto exit_free_luo_ser; > - luo_global.fdt_out = fdt_out; > + > + luo_global.luo_ser_out = luo_ser; > > return 0; > > exit_free_luo_ser: > kho_unpreserve_free(luo_ser); > -exit_free_fdt: > - kho_unpreserve_free(fdt_out); > - pr_err("failed to prepare LUO FDT: %d\n", err); > + pr_err("failed to prepare LUO state: %d\n", err); > > return err; > } > @@ -230,7 +195,7 @@ static int __init luo_late_startup(void) > if (!liveupdate_enabled()) > return 0; > > - err = luo_fdt_setup(); > + err = luo_state_setup(); > if (err) > luo_global.enabled = false; > -- Sincerely yours, Mike. From rppt at kernel.org Tue Jun 2 23:52:02 2026 From: rppt at kernel.org (Mike Rapoport) Date: Wed, 03 Jun 2026 09:52:02 +0300 Subject: [PATCH v6 02/13] liveupdate: avoid mixing cleanup guards with goto in luo_session_retrieve_fd In-Reply-To: <20260603032905.344462-3-pasha.tatashin@soleen.com> References: <20260603032905.344462-1-pasha.tatashin@soleen.com> <20260603032905.344462-3-pasha.tatashin@soleen.com> Message-ID: <178046952227.468621.16091000479096712016.b4-review@b4> On Wed, 03 Jun 2026 03:28:53 +0000, Pasha Tatashin wrote: > Refactoring luo_session_retrieve_fd() to avoid mixing automated > cleanup-style guards with goto-based resource release, which is not > recommended under the Linux kernel coding style. Acked-by: Mike Rapoport (Microsoft) -- Sincerely yours, Mike. From rppt at kernel.org Tue Jun 2 23:52:02 2026 From: rppt at kernel.org (Mike Rapoport) Date: Wed, 03 Jun 2026 09:52:02 +0300 Subject: [PATCH v6 03/13] liveupdate: centralize state management into struct luo_ser In-Reply-To: <20260603032905.344462-4-pasha.tatashin@soleen.com> References: <20260603032905.344462-1-pasha.tatashin@soleen.com> <20260603032905.344462-4-pasha.tatashin@soleen.com> Message-ID: <178046952228.468621.17532660334285388424.b4-review@b4> On Wed, 03 Jun 2026 03:28:54 +0000, Pasha Tatashin wrote: > Transition the LUO to ABI v2, which centralizes state management into a > single struct luo_ser header. > > Previously, LUO state was spread across multiple FDT properties and > subnodes. ABI v2 simplifies this by placing all core state, including > the liveupdate number and physical addresses for sessions and FLB > headers into a centralized struct luo_ser. > > [...] Acked-by: Mike Rapoport (Microsoft) -- Sincerely yours, Mike. From rppt at kernel.org Wed Jun 3 02:29:00 2026 From: rppt at kernel.org (Mike Rapoport) Date: Wed, 3 Jun 2026 12:29:00 +0300 Subject: [RFC PATCH 00/20] mshv: enable kexec with Hyper-V donated pages and partitions In-Reply-To: References: <20260528004204.1484584-1-jloeser@linux.microsoft.com> Message-ID: On Mon, Jun 01, 2026 at 01:09:41PM -0700, Jork Loeser wrote: > On Sun, 31 May 2026, Mike Rapoport wrote: > > > > Patch 19: Export kexec_in_progress for modules > > > > Isn't there another way to differentiate kexec reboot? There's that "kexec reboot" string passed as the cmd to the reboot notifier. Maybe we can make it somehow more well defined API and use it? > I could not find one, unfortunately. > > > Sincerely yours, > > Mike. > > Best, > Jork -- Sincerely yours, Mike. From rppt at kernel.org Wed Jun 3 02:32:54 2026 From: rppt at kernel.org (Mike Rapoport) Date: Wed, 3 Jun 2026 12:32:54 +0300 Subject: [PATCH v6 04/13] liveupdate: register luo_ser as KHO subtree In-Reply-To: <178046942429.468621.9591914636403075487.b4-review@b4> References: <20260603032905.344462-1-pasha.tatashin@soleen.com> <20260603032905.344462-5-pasha.tatashin@soleen.com> <178046942429.468621.9591914636403075487.b4-review@b4> Message-ID: On Wed, Jun 03, 2026 at 09:50:24AM +0300, Mike Rapoport wrote: > # Add your code comments below. There is no need to trim or delete > # any existing content -- just insert your comments under the relevant > # lines of code. Lines starting with "> " are quoted diff context and > # lines starting with "| " are comments from other reviewers. > # The final email will be reformatted automatically to include only > # the sections that have your comments. > # looks like b4 review bug or misuse from my side :) -- Sincerely yours, Mike. From pasha.tatashin at soleen.com Wed Jun 3 05:05:04 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Wed, 3 Jun 2026 12:05:04 +0000 Subject: [PATCH v6 07/13] kho: add support for linked-block serialization In-Reply-To: <178046937151.468621.13398573538792303093.b4-review@b4> References: <20260603032905.344462-1-pasha.tatashin@soleen.com> <20260603032905.344462-8-pasha.tatashin@soleen.com> <178046937151.468621.13398573538792303093.b4-review@b4> Message-ID: On 06-03 09:49, Mike Rapoport wrote: > On Wed, 03 Jun 2026 03:28:58 +0000, Pasha Tatashin wrote: > > diff --git a/include/linux/kho/abi/block.h b/include/linux/kho/abi/block.h > > new file mode 100644 > > index 000000000000..8641c20b379b > > --- /dev/null > > +++ b/include/linux/kho/abi/block.h > > @@ -0,0 +1,56 @@ > > [ ... skip 25 lines ... ] > > +#define _LINUX_KHO_ABI_BLOCK_H > > + > > +#include > > +#include > > + > > +#define KHO_BLOCK_ABI_COMPATIBLE "kho-block-v1" > > It's never used by block set and after looking at the following patches I > found that it's appended to LUO compatible string. > > While this works for LUO, I think it should be kho_block_set_restore() > responsibility to verify the compatibility. It should work for any component that relies on kho_block. My proposal is to use this method for other common KHO data structures (e.g., kho vmalloc, kho radix, future kho xarray). There is no need for them to carry the compatibility string in their metadata, as whoever uses them will include their compatibility string. For now, reviewers will have to make sure that if the ABI header content is changed, the compatibility string is updated. > > diff --git a/kernel/liveupdate/kho_block.c b/kernel/liveupdate/kho_block.c > > new file mode 100644 > > index 000000000000..4f147c308e6b > > --- /dev/null > > +++ b/kernel/liveupdate/kho_block.c > > @@ -0,0 +1,411 @@ > > [ ... skip 121 lines ... ] > > +/** > > + * kho_block_set_grow - Expand the block set to accommodate the target count. > > + * @bs: The block set. > > + * @count: The target number of valid entries to accommodate. > > + * > > + * Acts as a runtime notifier when new resources (such as files or sessions) > > Not sure I understand what "runtime notifier" means in this context. It came from discussion with Pratyush, but I think we are on the same page what they are, and I will just remove this. > > > [ ... skip 11 lines ... ] > > + > > + while (count > bs->nblocks * bs->count_per_block) { > > + int err = kho_block_set_grow_one(bs); > > + > > + if (err) > > + return err; > > This leaks memory if more than one block is added. > > > [ ... skip 31 lines ... ] > > + * unregistered, allowing the block set to release and unallocate redundant > > + * preserved memory blocks. Checks if the last block in the set can be removed > > + * because the remaining entry count is fully accommodated by the preceding blocks. > > + * > > + * Note: It is the caller's responsibility to ensure that entries are removed > > + * in LIFO (last-in, first-out) order (the reverse order of their insertion). > > I think "in LIFO order" is sufficient :) Oh, I keep removing those :-) > > [ ... skip 173 lines ... ] > > + it->i = 0; > > + } > > + > > + entry = kho_block_entry(it, it->i++); > > + it->block->ser->count = it->i; > > + return entry; > > This looks way better than the previous version :) > Thanks! Thank you. I will send a new version of this patch as a reply to this email to avoid cluttering the mailing list. Pasha > > -- > Sincerely yours, > Mike. > From pasha.tatashin at soleen.com Wed Jun 3 06:06:12 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Wed, 3 Jun 2026 13:06:12 +0000 Subject: [PATCH v6.1 07/13] kho: add support for linked-block serialization In-Reply-To: References: Message-ID: <20260603130612.397948-1-pasha.tatashin@soleen.com> Introduce a linked-block serialization mechanism for state handover. Previously, LUO used contiguous memory blocks for serializing sessions and files, which imposed limits on the total number of items that could be preserved across a live update. This commit adds the infrastructure for a more flexible, block-based approach where serialized data is stored in a chain of linked blocks. This is a generic KHO serialization block infrastructure that can be used by multiple subsystems. Signed-off-by: Pasha Tatashin --- Documentation/core-api/kho/abi.rst | 5 + Documentation/core-api/kho/index.rst | 11 + MAINTAINERS | 1 + include/linux/kho/abi/block.h | 56 ++++ include/linux/kho_block.h | 106 +++++++ kernel/liveupdate/Makefile | 1 + kernel/liveupdate/kho_block.c | 416 +++++++++++++++++++++++++++ 7 files changed, 596 insertions(+) create mode 100644 include/linux/kho/abi/block.h create mode 100644 include/linux/kho_block.h create mode 100644 kernel/liveupdate/kho_block.c diff --git a/Documentation/core-api/kho/abi.rst b/Documentation/core-api/kho/abi.rst index 799d743105a6..edeb5b311963 100644 --- a/Documentation/core-api/kho/abi.rst +++ b/Documentation/core-api/kho/abi.rst @@ -28,6 +28,11 @@ KHO persistent memory tracker ABI .. kernel-doc:: include/linux/kho/abi/kexec_handover.h :doc: KHO persistent memory tracker +KHO serialization block ABI +=========================== + +.. kernel-doc:: include/linux/kho/abi/block.h + See Also ======== diff --git a/Documentation/core-api/kho/index.rst b/Documentation/core-api/kho/index.rst index 0a2dee4f8e7d..320914a42178 100644 --- a/Documentation/core-api/kho/index.rst +++ b/Documentation/core-api/kho/index.rst @@ -83,6 +83,17 @@ Public API .. kernel-doc:: kernel/liveupdate/kexec_handover.c :export: +KHO Serialization Blocks API +============================ + +.. kernel-doc:: kernel/liveupdate/kho_block.c + :doc: KHO Serialization Blocks + +.. kernel-doc:: include/linux/kho_block.h + +.. kernel-doc:: kernel/liveupdate/kho_block.c + :internal: + See Also ======== diff --git a/MAINTAINERS b/MAINTAINERS index 9ec290e38b44..920ba7622afa 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -14208,6 +14208,7 @@ F: Documentation/admin-guide/mm/kho.rst F: Documentation/core-api/kho/* F: include/linux/kexec_handover.h F: include/linux/kho/ +F: include/linux/kho_block.h F: kernel/liveupdate/kexec_handover* F: lib/test_kho.c F: tools/testing/selftests/kho/ diff --git a/include/linux/kho/abi/block.h b/include/linux/kho/abi/block.h new file mode 100644 index 000000000000..8641c20b379b --- /dev/null +++ b/include/linux/kho/abi/block.h @@ -0,0 +1,56 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2026, Google LLC. + * Pasha Tatashin + */ + +/** + * DOC: KHO Serialization Blocks ABI + * + * Subsystems using the KHO Serialization Blocks framework rely on the stable + * Application Binary Interface defined below to pass serialized state from a + * pre-update kernel to a post-update kernel. + * + * This interface is a contract. Any modification to the structure fields, + * compatible strings, or the layout of the `__packed` serialization + * structures defined here constitutes a breaking change. Such changes require + * incrementing the version number in the `KHO_BLOCK_ABI_COMPATIBLE` string to + * prevent a new kernel from misinterpreting data from an old kernel. + * + * Changes are allowed provided the compatibility version is incremented; + * however, backward/forward compatibility is only guaranteed for kernels + * supporting the same ABI version. + */ + +#ifndef _LINUX_KHO_ABI_BLOCK_H +#define _LINUX_KHO_ABI_BLOCK_H + +#include +#include + +#define KHO_BLOCK_ABI_COMPATIBLE "kho-block-v1" + +/** + * KHO_BLOCK_SIZE - The size of each serialization block. + * + * This is defined as PAGE_SIZE. PAGE_SIZE is ABI compliant because live + * update between kernels with different page sizes is not supported by KHO. + */ +#define KHO_BLOCK_SIZE PAGE_SIZE + +/** + * struct kho_block_header_ser - Header for the serialized data block. + * @next: Physical address of the next struct kho_block_header_ser. + * @count: The number of entries that immediately follow this header in the + * memory block. + * + * This structure is located at the beginning of a block of physical memory + * preserved across a kexec. It provides the necessary metadata to interpret + * the array of entries that follow. + */ +struct kho_block_header_ser { + u64 next; + u64 count; +} __packed; + +#endif /* _LINUX_KHO_ABI_BLOCK_H */ diff --git a/include/linux/kho_block.h b/include/linux/kho_block.h new file mode 100644 index 000000000000..93a7cc2be5f5 --- /dev/null +++ b/include/linux/kho_block.h @@ -0,0 +1,106 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2026, Google LLC. + * Pasha Tatashin + */ + +#ifndef _LINUX_KHO_BLOCK_H +#define _LINUX_KHO_BLOCK_H + +#include +#include +#include + +/** + * struct kho_block - Internal representation of a serialization block. + * @list: List head for linking blocks in memory. + * @ser: Pointer to the serialized header in preserved memory. + */ +struct kho_block { + struct list_head list; + struct kho_block_header_ser *ser; +}; + +/** + * struct kho_block_set - A set of blocks containing serialized entries of the same type. + * @blocks: The list of serialization blocks (struct kho_block). + * @nblocks: The number of allocated serialization blocks. + * @head_pa: Physical address of the first block header. + * @entry_size: The size of each entry in the blocks. + * @count_per_block: The maximum number of entries each block can hold. + * @incoming: True if this block set was restored from the previous kernel. + * + * Note: Synchronization and locking are the responsibility of the caller. + * The block set structure itself is not internally synchronized. + */ +struct kho_block_set { + struct list_head blocks; + long nblocks; + u64 head_pa; + size_t entry_size; + u64 count_per_block; + bool incoming; +}; + +/** + * struct kho_block_set_it - Iterator for serializing entries into blocks. + * @bs: The block set being iterated. + * @block: The current block. + * @i: The current entry index within @block. + */ +struct kho_block_set_it { + struct kho_block_set *bs; + struct kho_block *block; + u64 i; +}; + +/** + * KHO_BLOCK_SET_INIT - Initialize a static kho_block_set. + * @_name: Name of the kho_block_set variable. + * @_entry_size: The size of each entry in the block set. + */ +#define KHO_BLOCK_SET_INIT(_name, _entry_size) { \ + .blocks = LIST_HEAD_INIT((_name).blocks), \ + .entry_size = _entry_size, \ + .count_per_block = (KHO_BLOCK_SIZE - \ + sizeof(struct kho_block_header_ser)) / \ + (_entry_size), \ +} + +void kho_block_set_init(struct kho_block_set *bs, size_t entry_size); + +int kho_block_set_grow(struct kho_block_set *bs, u64 count); +void kho_block_set_shrink(struct kho_block_set *bs, u64 count); + +int kho_block_set_restore(struct kho_block_set *bs, u64 head_pa); +void kho_block_set_destroy(struct kho_block_set *bs); +void kho_block_set_clear(struct kho_block_set *bs); + +/** + * kho_block_set_head_pa - Get the physical address of the first block header. + * @bs: The block set. + * + * Return: The physical address of the first block header, or 0 if empty. + */ +static inline u64 kho_block_set_head_pa(struct kho_block_set *bs) +{ + return bs->head_pa; +} + +/** + * kho_block_set_is_empty - Check if the block set has no allocated blocks. + * @bs: The block set. + * + * Return: True if there are no blocks in the set, false otherwise. + */ +static inline bool kho_block_set_is_empty(struct kho_block_set *bs) +{ + return list_empty(&bs->blocks); +} + +void kho_block_set_it_init(struct kho_block_set_it *it, struct kho_block_set *bs); +void *kho_block_set_it_reserve_entry(struct kho_block_set_it *it); +void *kho_block_set_it_read_entry(struct kho_block_set_it *it); +void *kho_block_set_it_prev(struct kho_block_set_it *it); + +#endif /* _LINUX_KHO_BLOCK_H */ diff --git a/kernel/liveupdate/Makefile b/kernel/liveupdate/Makefile index d2f779cbe279..eec9d3ae07eb 100644 --- a/kernel/liveupdate/Makefile +++ b/kernel/liveupdate/Makefile @@ -1,6 +1,7 @@ # SPDX-License-Identifier: GPL-2.0 luo-y := \ + kho_block.o \ luo_core.o \ luo_file.o \ luo_flb.o \ diff --git a/kernel/liveupdate/kho_block.c b/kernel/liveupdate/kho_block.c new file mode 100644 index 000000000000..0d2a342ef422 --- /dev/null +++ b/kernel/liveupdate/kho_block.c @@ -0,0 +1,416 @@ +// SPDX-License-Identifier: GPL-2.0 + +/* + * Copyright (c) 2026, Google LLC. + * Pasha Tatashin + */ + +/** + * DOC: KHO Serialization Blocks + * + * KHO provides a mechanism to preserve stateful data across a kexec handover + * by serializing it into memory blocks, and provides the common + * infrastructure for managing these blocks. + * + * Each block consists of a header (struct kho_block_header_ser) followed by an + * array of serialized entries. Multiple blocks are linked together via a + * physical pointer in the header, forming a linked list that can be easily + * traversed in both the current and the next kernel. + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include +#include +#include + +/* + * Safeguard limit for the number of serialization blocks. This is used to + * prevent infinite loops and excessive memory allocation in case of memory + * corruption in the preserved state. + * + * With a 4KB page size, 10k blocks is about 40MB. For 32-byte entries + * (e.g. 4 u64s), each block holds up to 127 entries (accounting for the + * 16-byte header), allowing the block set to hold up to 1.27M entries. + */ +#define KHO_MAX_BLOCKS 10000 + +/** + * kho_block_set_init - Initialize a block set. + * @bs: The block set to initialize. + * @entry_size: The size of each entry in the blocks. + */ +void kho_block_set_init(struct kho_block_set *bs, size_t entry_size) +{ + *bs = (struct kho_block_set)KHO_BLOCK_SET_INIT(*bs, entry_size); + WARN_ON_ONCE(!bs->count_per_block); +} + +/* Serialized entries start immediately after the block header */ +static void *kho_block_entries(struct kho_block *block) +{ + return (void *)(block->ser + 1); +} + +/* Get the address of the serialized entry at the specified index */ +static void *kho_block_entry(struct kho_block_set_it *it, u64 index) +{ + return kho_block_entries(it->block) + (index * it->bs->entry_size); +} + +/* Free serialized data */ +static void kho_block_free_ser(struct kho_block_set *bs, + struct kho_block_header_ser *ser) +{ + if (bs->incoming) + kho_restore_free(ser); + else + kho_unpreserve_free(ser); +} + +static struct kho_block_header_ser *kho_block_alloc_ser(struct kho_block_set *bs) +{ + WARN_ON_ONCE(bs->incoming); + return kho_alloc_preserve(KHO_BLOCK_SIZE); +} + +static int kho_block_add(struct kho_block_set *bs, + struct kho_block_header_ser *ser) +{ + struct kho_block *block, *last; + + if (bs->nblocks >= KHO_MAX_BLOCKS) + return -ENOSPC; + + block = kzalloc_obj(*block); + if (!block) + return -ENOMEM; + + block->ser = ser; + last = list_last_entry_or_null(&bs->blocks, struct kho_block, list); + list_add_tail(&block->list, &bs->blocks); + bs->nblocks++; + + if (last) + last->ser->next = virt_to_phys(ser); + else + bs->head_pa = virt_to_phys(ser); + + return 0; +} + +static int kho_block_set_grow_one(struct kho_block_set *bs) +{ + struct kho_block_header_ser *ser; + int err; + + ser = kho_block_alloc_ser(bs); + if (IS_ERR(ser)) + return PTR_ERR(ser); + + err = kho_block_add(bs, ser); + if (err) { + kho_block_free_ser(bs, ser); + return err; + } + + return 0; +} + +static void kho_block_set_shrink_one(struct kho_block_set *bs) +{ + struct kho_block *last, *new_last; + + if (list_empty(&bs->blocks)) + return; + + last = list_last_entry(&bs->blocks, struct kho_block, list); + list_del(&last->list); + bs->nblocks--; + kho_block_free_ser(bs, last->ser); + kfree(last); + + new_last = list_last_entry_or_null(&bs->blocks, struct kho_block, list); + if (new_last) + new_last->ser->next = 0; + else + bs->head_pa = 0; +} + +/** + * kho_block_set_grow - Expand the block set to accommodate the target count. + * @bs: The block set. + * @count: The target number of valid entries to accommodate. + * + * Dynamically preallocates and links preserved memory blocks if the target + * entry count exceeds the current total capacity of the set, ensuring they + * are available during serialization/deserialization. + * + * Context: Caller must hold a lock protecting the block set. + * Return: 0 on success, or a negative errno on failure. + */ +int kho_block_set_grow(struct kho_block_set *bs, u64 count) +{ + long orig_nblocks = bs->nblocks; + int err; + + if (WARN_ON_ONCE(bs->incoming)) + return -EINVAL; + + while (count > bs->nblocks * bs->count_per_block) { + err = kho_block_set_grow_one(bs); + if (err) + goto err_shrink; + } + + return 0; + +err_shrink: + while (bs->nblocks > orig_nblocks) + kho_block_set_shrink_one(bs); + return err; +} + +/** + * kho_block_set_shrink - Shrink the block set to accommodate the target count. + * @bs: The block set. + * @count: The target number of valid entries to accommodate. + * + * Releases and unallocates redundant preserved memory blocks. Checks if the + * last block in the set can be removed because the remaining entry count is + * fully accommodated by the preceding blocks. + * + * Note: It is the caller's responsibility to ensure that entries are removed + * in the reverse order of their insertion. Because shrinking destroys the last + * block in the set, removing entries in any other order would corrupt active + * data. + * + * Context: Caller must hold a lock protecting the block set. + */ +void kho_block_set_shrink(struct kho_block_set *bs, u64 count) +{ + while (bs->nblocks > 0 && count <= (bs->nblocks - 1) * bs->count_per_block) + kho_block_set_shrink_one(bs); +} + +/* + * kho_block_set_is_cyclic - Check for cycles in a linked list of blocks. + * Uses Floyd's cycle-finding algorithm to ensure sanity of the incoming list. + * + * Return: true if a cycle or corruption is detected, false otherwise. + */ +static bool kho_block_set_is_cyclic(struct kho_block_set *bs) +{ + struct kho_block_header_ser *fast; + struct kho_block_header_ser *slow; + int count = 0; + + fast = phys_to_virt(bs->head_pa); + slow = fast; + + while (fast) { + if (count++ >= KHO_MAX_BLOCKS) { + pr_err("Block set is corrupted\n"); + return true; + } + + if (!fast->next) + break; + + fast = phys_to_virt(fast->next); + if (!fast->next) + break; + + fast = phys_to_virt(fast->next); + slow = phys_to_virt(slow->next); + + if (slow == fast) { + pr_err("Block set is corrupted\n"); + return true; + } + } + + return false; +} + +/** + * kho_block_set_restore - Restore a block set from a physical address. + * @bs: The block set to restore. + * @head_pa: Physical address of the first block header. + * + * Restores a serialized block set from a given physical address. The caller is + * responsible for ensuring that the block set @bs has been allocated and + * initialized prior to calling this function. + * + * Return: 0 on success, or a negative errno on failure. + */ +int kho_block_set_restore(struct kho_block_set *bs, u64 head_pa) +{ + struct kho_block_header_ser *ser; + u64 next_pa = head_pa; + int err; + + /* Restored block sets use size from the previous kernel */ + bs->incoming = true; + if (!head_pa) + return 0; + + bs->head_pa = head_pa; + if (kho_block_set_is_cyclic(bs)) { + bs->head_pa = 0; + return -EINVAL; + } + + while (next_pa) { + ser = phys_to_virt(next_pa); + if (!ser->count || ser->count > bs->count_per_block) { + pr_warn("Block contains invalid entry count: %llu\n", + ser->count); + err = -EINVAL; + goto err_destroy; + } + err = kho_block_add(bs, ser); + if (err) + goto err_destroy; + next_pa = ser->next; + } + + return 0; + +err_destroy: + kho_block_set_destroy(bs); + + /* Free the remaining un-restored blocks in the physical chain */ + while (next_pa) { + struct kho_block_header_ser *next_ser = phys_to_virt(next_pa); + + next_pa = next_ser->next; + kho_block_free_ser(bs, next_ser); + } + return err; +} + +/** + * kho_block_set_destroy - Destroy all blocks in a block set. + * @bs: The block set. + */ +void kho_block_set_destroy(struct kho_block_set *bs) +{ + struct kho_block *block, *tmp; + + list_for_each_entry_safe(block, tmp, &bs->blocks, list) { + list_del(&block->list); + kho_block_free_ser(bs, block->ser); + kfree(block); + } + bs->nblocks = 0; + bs->head_pa = 0; +} + +/** + * kho_block_set_clear - Clear all serialized data in a block set. + * @bs: The block set to clear. + */ +void kho_block_set_clear(struct kho_block_set *bs) +{ + struct kho_block *block; + + list_for_each_entry(block, &bs->blocks, list) { + block->ser->count = 0; + memset(block->ser + 1, 0, KHO_BLOCK_SIZE - sizeof(*block->ser)); + } +} + +/** + * kho_block_set_it_init - Initialize a block set iterator. + * @it: The iterator to initialize. + * @bs: The block set to iterate over. + */ +void kho_block_set_it_init(struct kho_block_set_it *it, struct kho_block_set *bs) +{ + it->bs = bs; + it->block = list_first_entry_or_null(&bs->blocks, struct kho_block, list); + it->i = 0; +} + +/** + * kho_block_set_it_reserve_entry - Reserve and return the next available slot for writing. + * @it: The block iterator. + * + * Reserves a slot in the current block during state serialization to add a new + * entry, advancing the internal index. If the current block is full, it + * automatically moves to the next block in the set. + * + * Return: A pointer to the reserved entry slot, or NULL if the block set's + * capacity is fully exhausted. + */ +void *kho_block_set_it_reserve_entry(struct kho_block_set_it *it) +{ + void *entry; + + if (!it->block) + return NULL; + + if (it->i == it->bs->count_per_block) { + if (list_is_last(&it->block->list, &it->bs->blocks)) + return NULL; + it->block = list_next_entry(it->block, list); + it->i = 0; + } + + entry = kho_block_entry(it, it->i++); + it->block->ser->count = it->i; + return entry; +} + +/** + * kho_block_set_it_read_entry - Read the next serialized entry from the block set. + * @it: The block iterator. + * + * Iterates through previously written entries during state deserialization, + * respecting the actual count stored in each block's header. + * + * Return: A pointer to the next serialized entry, or NULL if all serialized + * entries have been read. + */ +void *kho_block_set_it_read_entry(struct kho_block_set_it *it) +{ + if (!it->block) + return NULL; + + if (it->i == it->block->ser->count) { + if (list_is_last(&it->block->list, &it->bs->blocks)) + return NULL; + it->block = list_next_entry(it->block, list); + it->i = 0; + } + + return kho_block_entry(it, it->i++); +} + +/** + * kho_block_set_it_prev - Return the previous entry slot in the block set. + * @it: The block iterator. + * + * If the current index is at the start of a block, it automatically moves to + * the end of the previous block. + * + * Return: A pointer to the previous entry slot, or NULL if at the very + * beginning of the block set. + */ +void *kho_block_set_it_prev(struct kho_block_set_it *it) +{ + if (!it->block) + return NULL; + + if (it->i == 0) { + if (list_is_first(&it->block->list, &it->bs->blocks)) + return NULL; + it->block = list_prev_entry(it->block, list); + it->i = it->bs->count_per_block; + } + + return kho_block_entry(it, --it->i); +} -- 2.53.0 From pasha.tatashin at soleen.com Wed Jun 3 06:14:11 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Wed, 3 Jun 2026 13:14:11 +0000 Subject: [PATCH v2] kexec_file: skip checksum verification when safe In-Reply-To: <2vxzik81dlbu.fsf@kernel.org> References: <20260602123311.1841746-1-mclapinski@google.com> <2vxzik81dlbu.fsf@kernel.org> Message-ID: On 06-02 17:16, Pratyush Yadav wrote: > On Tue, Jun 02 2026, Michal Clapinski wrote: > > > Checksum verification is needed > > 1. for crash kernels. In a crash, we can't be sure the kernel is > > intact. > > 2. if we're worried about relocating the kernel into a region used by > > some DMA that wasn't properly cancelled. > > > > If KHO is enabled then relocations will happen to KHO scratch, which > > is free from DMA regions. > > If we used CMA to allocate segments then relocations are not going to > > happen at all. > > Therefore, we can safely disable checksum verification in both of those > > cases. > > > > Instead of adding a new variable to purgatory, just skip adding regions > > and save the default value of SHA256 hash. > > > > Saves ~250ms on my 4.0 GHz CPU. This is an important saving for the > > live-update project. > > > > Signed-off-by: Michal Clapinski > > --- > > v2: > > - also skip checksum verification if KHO is enabled > > - small fixes from reviews > > > > My original idea was to do 2 changes: > > 1. Skip checksum if all segments are CMA. > > 2. If KHO is enabled, allocate the kernel inside kho_scratch using CMA. > > > > This way we could skip both relocations and checksum verification when > > KHO is enabled. > > But I realized that step 2 might not be possible on warm boots. > > AFAIU we only relocate into scratch since relocating anywhere else might > over-write preserved memory. If there is no relocation, there is no need > for the kernel image to be in scratch, since the image won't be > preserved memory anyway. > > So perhaps we can just use CMA directly, and only fall back to > kho_locate_mem_hole() if that fails? This should be a simple enough > change. > > Do you know how much time we can save by skipping relocations? I would > guess it is in the hundreds of milliseconds. > > Can you try this (COMPLETELY UNTESTED) patch out and see if it works and > if it further improves kexec time? > > --- 8< --- > diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c > index 2bfbb2d144e6..0ccc7b6d67c1 100644 > --- a/kernel/kexec_file.c > +++ b/kernel/kexec_file.c > @@ -720,14 +720,6 @@ int kexec_locate_mem_hole(struct kexec_buf *kbuf) > if (kbuf->mem != KEXEC_BUF_MEM_UNKNOWN) > return 0; > > - /* > - * If KHO is active, only use KHO scratch memory. All other memory > - * could potentially be handed over. > - */ > - ret = kho_locate_mem_hole(kbuf, locate_mem_hole_callback); > - if (ret <= 0) > - return ret; > - > /* > * Try to find a free physically contiguous block of memory first. With that, we > * can avoid any copying at kexec time. > @@ -735,6 +727,14 @@ int kexec_locate_mem_hole(struct kexec_buf *kbuf) > if (!kexec_alloc_contig(kbuf)) > return 0; > > + /* > + * If KHO is active and relocations are to be done,, only use KHO > + * scratch memory. All other memory could potentially be handed over. > + */ > + ret = kho_locate_mem_hole(kbuf, locate_mem_hole_callback); > + if (ret <= 0) > + return ret; > + > if (!IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) > ret = kexec_walk_resources(kbuf, locate_mem_hole_callback); > else > --- >8 --- > > Of course this is not directly related to this patch so it shouldn't > block it, but I reckon we might be able to squeeze a bit more > performance out this way as a follow up. > > > I have no idea how to fix that (except weird ideas like 2 kho_scratches > > that we swap on every warm boot), so I decided to just skip checksum > > verification when KHO is enabled. This unfortunately means relocations > > will still happen. > > --- > > kernel/kexec_file.c | 27 +++++++++++++++++++++++++++ > > 1 file changed, 27 insertions(+) > > > > diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c > > index 2bfbb2d144e6..db25a14692ab 100644 > > --- a/kernel/kexec_file.c > > +++ b/kernel/kexec_file.c > > @@ -27,6 +27,7 @@ > > #include > > #include > > #include > > +#include > > #include "kexec_internal.h" > > > > #ifdef CONFIG_KEXEC_SIG > > @@ -798,6 +799,16 @@ int kexec_add_buffer(struct kexec_buf *kbuf) > > return 0; > > } > > > > +static bool kexec_only_cma_segments(struct kimage *image) > > +{ > > + for (int i = 0; i < image->nr_segments; i++) { > > + if (!image->segment_cma[i]) > > + return false; > > + } > > + > > + return true; > > +} > > + > > /* Calculate and store the digest of segments */ > > static int kexec_calculate_store_digests(struct kimage *image) > > { > > @@ -822,6 +833,21 @@ static int kexec_calculate_store_digests(struct kimage *image) > > > > sha256_init(&sctx); > > > > + /* > > + * If KHO is enabled, the destinations are located in KHO scratch. > > + * KHO scratch can only contain early boot allocations and movable > > + * allocations. That means there is no risk of memory corruption by > > + * uncancelled DMA. > > + * > > + * If all segments were loaded into contiguous memory, there will be no > > + * relocations at all, so also no risk no corruption. > > Typo: "so also no risk *of* corruption". Missed this fix when applied forced updated, to address this. > > We can fix that up when applying I think, so no need for a v3 just for > this. > > Other than this, > > Reviewed-by: Pratyush Yadav (Google) > > > + */ > > + if (image->type != KEXEC_TYPE_CRASH && > > + (kho_is_enabled() || kexec_only_cma_segments(image))) { > > + pr_debug("disabling checksum verification in purgatory\n"); > > + goto skip_checksum; > > + } > > + > > for (j = i = 0; i < image->nr_segments; i++) { > > struct kexec_segment *ksegment; > > > > @@ -867,6 +893,7 @@ static int kexec_calculate_store_digests(struct kimage *image) > > j++; > > } > > > > +skip_checksum: > > sha256_final(&sctx, digest); > > > > ret = kexec_purgatory_get_set_symbol(image, "purgatory_sha_regions", > > -- > Regards, > Pratyush Yadav From pasha.tatashin at soleen.com Wed Jun 3 06:21:24 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Wed, 3 Jun 2026 13:21:24 +0000 Subject: [PATCH v6 07/13] kho: add support for linked-block serialization In-Reply-To: References: <20260603032905.344462-1-pasha.tatashin@soleen.com> <20260603032905.344462-8-pasha.tatashin@soleen.com> <178046937151.468621.13398573538792303093.b4-review@b4> Message-ID: On 06-03 12:05, Pasha Tatashin wrote: > On 06-03 09:49, Mike Rapoport wrote: > > On Wed, 03 Jun 2026 03:28:58 +0000, Pasha Tatashin wrote: > > > diff --git a/include/linux/kho/abi/block.h b/include/linux/kho/abi/block.h > > > new file mode 100644 > > > index 000000000000..8641c20b379b > > > --- /dev/null > > > +++ b/include/linux/kho/abi/block.h > > > @@ -0,0 +1,56 @@ > > > [ ... skip 25 lines ... ] > > > +#define _LINUX_KHO_ABI_BLOCK_H > > > + > > > +#include > > > +#include > > > + > > > +#define KHO_BLOCK_ABI_COMPATIBLE "kho-block-v1" > > > > It's never used by block set and after looking at the following patches I > > found that it's appended to LUO compatible string. > > > > While this works for LUO, I think it should be kho_block_set_restore() > > responsibility to verify the compatibility. > > It should work for any component that relies on kho_block. My proposal > is to use this method for other common KHO data structures (e.g., kho > vmalloc, kho radix, future kho xarray). There is no need for them to > carry the compatibility string in their metadata, as whoever uses them > will include their compatibility string. > > For now, reviewers will have to make sure that if the ABI header content > is changed, the compatibility string is updated. > > > > diff --git a/kernel/liveupdate/kho_block.c b/kernel/liveupdate/kho_block.c > > > new file mode 100644 > > > index 000000000000..4f147c308e6b > > > --- /dev/null > > > +++ b/kernel/liveupdate/kho_block.c > > > @@ -0,0 +1,411 @@ > > > [ ... skip 121 lines ... ] > > > +/** > > > + * kho_block_set_grow - Expand the block set to accommodate the target count. > > > + * @bs: The block set. > > > + * @count: The target number of valid entries to accommodate. > > > + * > > > + * Acts as a runtime notifier when new resources (such as files or sessions) > > > > Not sure I understand what "runtime notifier" means in this context. > > It came from discussion with Pratyush, but I think we are on the same > page what they are, and I will just remove this. > > > > > > [ ... skip 11 lines ... ] > > > + > > > + while (count > bs->nblocks * bs->count_per_block) { > > > + int err = kho_block_set_grow_one(bs); > > > + > > > + if (err) > > > + return err; > > > > This leaks memory if more than one block is added. > > > > > [ ... skip 31 lines ... ] > > > + * unregistered, allowing the block set to release and unallocate redundant > > > + * preserved memory blocks. Checks if the last block in the set can be removed > > > + * because the remaining entry count is fully accommodated by the preceding blocks. > > > + * > > > + * Note: It is the caller's responsibility to ensure that entries are removed > > > + * in LIFO (last-in, first-out) order (the reverse order of their insertion). > > > > I think "in LIFO order" is sufficient :) > > Oh, I keep removing those :-) > > > > [ ... skip 173 lines ... ] > > > + it->i = 0; > > > + } > > > + > > > + entry = kho_block_entry(it, it->i++); > > > + it->block->ser->count = it->i; > > > + return entry; > > > > This looks way better than the previous version :) > > Thanks! > > Thank you. I will send a new version of this patch as a reply to this > email to avoid cluttering the mailing list. The patch is here: https://lore.kernel.org/all/20260603130612.397948-1-pasha.tatashin at soleen.com/ I messed up in-reply-to field with wrong message-id. Pasha > > Pasha > > > > > -- > > Sincerely yours, > > Mike. > > From rppt at kernel.org Wed Jun 3 06:31:59 2026 From: rppt at kernel.org (Mike Rapoport) Date: Wed, 03 Jun 2026 16:31:59 +0300 Subject: [PATCH v6 04/13] liveupdate: register luo_ser as KHO subtree In-Reply-To: <20260603032905.344462-5-pasha.tatashin@soleen.com> References: <20260603032905.344462-1-pasha.tatashin@soleen.com> <20260603032905.344462-5-pasha.tatashin@soleen.com> Message-ID: <178049351951.475072.17505666427466268204.b4-review@b4> On Wed, 03 Jun 2026 03:28:55 +0000, Pasha Tatashin wrote: > Entirely remove the LUO FDT wrapper since the FDT only carries the > compatible string and the pointer to the centralized struct luo_ser. > Instead, register the struct luo_ser via the KHO raw subtree > API, placing the compatibility string inside the structure itself. Acked-by: Mike Rapoport (Microsoft) -- Sincerely yours, Mike. From lgs201920130244 at gmail.com Wed Jun 3 06:50:56 2026 From: lgs201920130244 at gmail.com (Guangshuo Li) Date: Wed, 3 Jun 2026 21:50:56 +0800 Subject: [PATCH] crash_dump: release keyring reference at the correct time Message-ID: <20260603135056.1397084-1-lgs201920130244@gmail.com> restore_dm_crypt_keys_to_thread_keyring() gets a reference to the user keyring before restoring the saved dm-crypt keys. The same keyring reference is then passed to add_key_to_keyring() for each saved key, but add_key_to_keyring() drops that reference on every call. This is only balanced when exactly one key is restored. With multiple keys, the keyring reference is dropped too many times and may trigger a refcount underflow or use-after-free. The early error paths after lookup_user_key() also return without dropping the keyring reference. Keep ownership of the keyring reference in restore_dm_crypt_keys_to_thread_keyring(), drop it once on all exit paths, and make add_key_to_keyring() only use the reference without consuming it. Fixes: 62f17d9df692 ("crash_dump: retrieve dm crypt keys in kdump kernel") Signed-off-by: Guangshuo Li --- kernel/crash_dump_dm_crypt.c | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/kernel/crash_dump_dm_crypt.c b/kernel/crash_dump_dm_crypt.c index a20d4097744a..641c290f1270 100644 --- a/kernel/crash_dump_dm_crypt.c +++ b/kernel/crash_dump_dm_crypt.c @@ -80,7 +80,6 @@ static int add_key_to_keyring(struct dm_crypt_key *dm_key, kexec_dprintk("Error when adding key"); } - key_ref_put(keyring_ref); return r; } @@ -104,6 +103,7 @@ static int restore_dm_crypt_keys_to_thread_keyring(void) size_t keys_header_size; key_ref_t keyring_ref; u64 addr; + int ret = 0; /* find the target keyring (which must be writable) */ keyring_ref = @@ -117,7 +117,8 @@ static int restore_dm_crypt_keys_to_thread_keyring(void) dm_crypt_keys_read((char *)&key_count, sizeof(key_count), &addr); if (key_count < 0 || key_count > KEY_NUM_MAX) { kexec_dprintk("Failed to read the number of dm-crypt keys\n"); - return -1; + ret = -1; + goto out; } kexec_dprintk("There are %u keys\n", key_count); @@ -125,8 +126,10 @@ static int restore_dm_crypt_keys_to_thread_keyring(void) keys_header_size = get_keys_header_size(key_count); keys_header = kzalloc(keys_header_size, GFP_KERNEL); - if (!keys_header) - return -ENOMEM; + if (!keys_header) { + ret = -ENOMEM; + goto out; + } dm_crypt_keys_read((char *)keys_header, keys_header_size, &addr); @@ -136,7 +139,9 @@ static int restore_dm_crypt_keys_to_thread_keyring(void) add_key_to_keyring(key, keyring_ref); } - return 0; +out: + key_ref_put(keyring_ref); + return ret; } static int read_key_from_user_keying(struct dm_crypt_key *dm_key) -- 2.43.0 From rppt at kernel.org Wed Jun 3 06:59:43 2026 From: rppt at kernel.org (Mike Rapoport) Date: Wed, 3 Jun 2026 16:59:43 +0300 Subject: [PATCH v6 07/13] kho: add support for linked-block serialization In-Reply-To: References: <20260603032905.344462-1-pasha.tatashin@soleen.com> <20260603032905.344462-8-pasha.tatashin@soleen.com> <178046937151.468621.13398573538792303093.b4-review@b4> Message-ID: On Wed, Jun 03, 2026 at 12:05:04PM +0000, Pasha Tatashin wrote: > On 06-03 09:49, Mike Rapoport wrote: > > On Wed, 03 Jun 2026 03:28:58 +0000, Pasha Tatashin wrote: > > > diff --git a/include/linux/kho/abi/block.h b/include/linux/kho/abi/block.h > > > new file mode 100644 > > > index 000000000000..8641c20b379b > > > --- /dev/null > > > +++ b/include/linux/kho/abi/block.h > > > @@ -0,0 +1,56 @@ > > > [ ... skip 25 lines ... ] > > > +#define _LINUX_KHO_ABI_BLOCK_H > > > + > > > +#include > > > +#include > > > + > > > +#define KHO_BLOCK_ABI_COMPATIBLE "kho-block-v1" > > > > It's never used by block set and after looking at the following patches I > > found that it's appended to LUO compatible string. > > > > While this works for LUO, I think it should be kho_block_set_restore() > > responsibility to verify the compatibility. > > It should work for any component that relies on kho_block. My proposal > is to use this method for other common KHO data structures (e.g., kho > vmalloc, kho radix, future kho xarray). There is no need for them to > carry the compatibility string in their metadata, as whoever uses them > will include their compatibility string. So if, say, memfd_luo uses kho vmalloc, xarray and blocks it'll have five compatibility strings glued together? > For now, reviewers will have to make sure that if the ABI header content > is changed, the compatibility string is updated. -- Sincerely yours, Mike. From pasha.tatashin at soleen.com Wed Jun 3 07:11:31 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Wed, 3 Jun 2026 14:11:31 +0000 Subject: [PATCH v6 07/13] kho: add support for linked-block serialization In-Reply-To: References: <20260603032905.344462-1-pasha.tatashin@soleen.com> <20260603032905.344462-8-pasha.tatashin@soleen.com> <178046937151.468621.13398573538792303093.b4-review@b4> Message-ID: On 06-03 16:59, Mike Rapoport wrote: > On Wed, Jun 03, 2026 at 12:05:04PM +0000, Pasha Tatashin wrote: > > On 06-03 09:49, Mike Rapoport wrote: > > > On Wed, 03 Jun 2026 03:28:58 +0000, Pasha Tatashin wrote: > > > > diff --git a/include/linux/kho/abi/block.h b/include/linux/kho/abi/block.h > > > > new file mode 100644 > > > > index 000000000000..8641c20b379b > > > > --- /dev/null > > > > +++ b/include/linux/kho/abi/block.h > > > > @@ -0,0 +1,56 @@ > > > > [ ... skip 25 lines ... ] > > > > +#define _LINUX_KHO_ABI_BLOCK_H > > > > + > > > > +#include > > > > +#include > > > > + > > > > +#define KHO_BLOCK_ABI_COMPATIBLE "kho-block-v1" > > > > > > It's never used by block set and after looking at the following patches I > > > found that it's appended to LUO compatible string. > > > > > > While this works for LUO, I think it should be kho_block_set_restore() > > > responsibility to verify the compatibility. > > > > It should work for any component that relies on kho_block. My proposal > > is to use this method for other common KHO data structures (e.g., kho > > vmalloc, kho radix, future kho xarray). There is no need for them to > > carry the compatibility string in their metadata, as whoever uses them > > will include their compatibility string. > > So if, say, memfd_luo uses kho vmalloc, xarray and blocks it'll have five > compatibility strings glued together? That is correct, but it will be in only one place: the header of the client's KHO subtree. Since it is dynamically sized and 8-byte aligned, it should be safe to include in any struct. Pasha > > > For now, reviewers will have to make sure that if the ABI header content > > is changed, the compatibility string is updated. > > -- > Sincerely yours, > Mike. From rppt at kernel.org Wed Jun 3 07:34:14 2026 From: rppt at kernel.org (Mike Rapoport) Date: Wed, 03 Jun 2026 17:34:14 +0300 Subject: [PATCH v6 07/13] kho: add support for linked-block serialization In-Reply-To: References: <20260603032905.344462-1-pasha.tatashin@soleen.com> <20260603032905.344462-8-pasha.tatashin@soleen.com> <178046937151.468621.13398573538792303093.b4-review@b4> Message-ID: <178049725439.475072.11560134126837430744.b4-reply@b4> On 2026-06-03 14:11 +0000, Pasha Tatashin wrote: > On 06-03 16:59, Mike Rapoport wrote: > > On Wed, Jun 03, 2026 at 12:05:04PM +0000, Pasha Tatashin wrote: > > > On 06-03 09:49, Mike Rapoport wrote: > > > > On Wed, 03 Jun 2026 03:28:58 +0000, Pasha Tatashin wrote: > > > > > diff --git a/include/linux/kho/abi/block.h b/include/linux/kho/abi/block.h > > > > > new file mode 100644 > > > > > index 000000000000..8641c20b379b > > > > > --- /dev/null > > > > > +++ b/include/linux/kho/abi/block.h > > > > > @@ -0,0 +1,56 @@ > > > > > [ ... skip 25 lines ... ] > > > > > +#define _LINUX_KHO_ABI_BLOCK_H > > > > > + > > > > > +#include > > > > > +#include > > > > > + > > > > > +#define KHO_BLOCK_ABI_COMPATIBLE "kho-block-v1" > > > > > > > > It's never used by block set and after looking at the following patches I > > > > found that it's appended to LUO compatible string. > > > > > > > > While this works for LUO, I think it should be kho_block_set_restore() > > > > responsibility to verify the compatibility. > > > > > > It should work for any component that relies on kho_block. My proposal > > > is to use this method for other common KHO data structures (e.g., kho > > > vmalloc, kho radix, future kho xarray). There is no need for them to > > > carry the compatibility string in their metadata, as whoever uses them > > > will include their compatibility string. > > > > So if, say, memfd_luo uses kho vmalloc, xarray and blocks it'll have five > > compatibility strings glued together? > > That is correct, but it will be in only one place: the header of the > client's KHO subtree. Since it is dynamically sized and 8-byte aligned, > it should be safe to include in any struct. This is safe, you are right. But I have more usability concerns from one side and the duplication it causes from the other. I can see the downside of putting the version information in the data structure itself as it either requires a different header for the first element or needlessly increases all the headers. But #define LUO_ABI_COMPATIBLE LUO_COMPAT_BASE "-" KHO_BLOCK_ABI_COMPATIBLE "-" KHO_VMALLOC_ABI_COMPATIBLE "-" KHO_RADIX_COMPATIBLE is not really digestible too. And it forces KHO users to potentially track KHO internal changes. We still don't promise any compatibility between different kernel versions so to avoid blocking this series on the decision what is the best way to convey KHO data structures compatibility I suggest to bump kho ABI version in v6.2* of the patch that adds KHO blocks and postpone this discussion to after rc1 when we'll have plenty of time. * sending a new version of a single file does same email traffic, but it confuses b4 and quite possibly other tools, so I think v7 is better. > Pasha > > > > > > For now, reviewers will have to make sure that if the ABI header content > > > is changed, the compatibility string is updated. > > > -- > > Sincerely yours, > > Mike. > From kirill at shutemov.name Wed Jun 3 07:36:31 2026 From: kirill at shutemov.name (Kiryl Shutsemau) Date: Wed, 3 Jun 2026 15:36:31 +0100 Subject: [PATCH 0/4] arm64: cross-CPU NMI via SDEI Message-ID: From: "Kiryl Shutsemau (Meta)" A class of debug/observability features needs to interrupt a CPU that has its interrupts locally masked: hard-lockup detection, the all-CPU backtrace behind sysrq-l / RCU-stall / hung-task dumps, and crash_smp_send_stop() capturing a stuck CPU's state into the vmcore. On arm64 these need a mechanism that reaches a CPU spinning with DAIF masked, which a normal IPI cannot. arm64 has two such mechanisms today: - GICv3 pseudo-NMI (interrupt priority masking). This is the preferred path and what the perf-based hard-lockup detector (HAVE_HARDLOCKUP_DETECTOR_PERF) is built on. Its cost, however, is on the interrupt mask/unmask hot path: local_irq_enable() becomes an ICC_PMR_EL1 write plus a synchronising barrier, and exception entry/exit save and restore the PMR, paid on every CPU whether or not an NMI is ever delivered. In our measurements, enabling pseudo-NMI costs up to ~5% on real workloads, and ~66% on a syscall-in-a-loop microbenchmark that maximises exception entry/exit (where pseudo-NMI adds the PMR save/restore). A fleet-wide ~5% regression is not acceptable, so these systems run with pseudo-NMI disabled ? and therefore have no hard-lockup detector and degraded backtrace/crash-stop today. - FEAT_NMI (Armv8.8) ? the architectural fix, but absent from deployed silicon and from most of the fleet for years to come. For deployments that do not run pseudo-NMI (to avoid that standing hot-path cost), the hard-lockup detector and the backtrace/crash paths are degraded: a plain IPI can't reach the masked CPU, so the lockup goes undetected, the backtrace of the CPU you care about comes back empty, and the kdump is missing the culprit's registers. This series adds a third delivery backend that costs nothing on the hot path: SDEI. Firmware delivers an SDEI event into a CPU regardless of its DAIF state, so interrupt masking stays the cheap PSTATE.DAIF operation and the firmware round-trip is paid only at the rare moment a CPU must be interrupted. Mechanism ========= It uses the standard SDEI software-signalled event (event 0) and the SDEI_EVENT_SIGNAL call (DEN0054) ? a spec-defined cross-PE signal, not a vendor extension. The driver registers a handler for event 0 and pokes a target CPU with sdei_event_signal(0, target_mpidr); firmware makes event 0 pending on that PE and dispatches the handler NMI-like. No firmware change is required beyond SDEI being enabled, which firmware-first RAS (APEI/GHES) deployments already have; the only SDEI-core addition is a thin sdei_event_signal() wrapper over the standard call. Clean kdump when a CPU panics from inside the SDEI handler (the hard-lockup case) is handled by the already-merged sdei_handler_abort(), which crash_smp_send_stop() calls: it issues SDEI_EVENT_COMPLETE_AND_RESUME so the firmware-side priority is dropped before the capture kernel boots. Prior SDEI watchdog work ======================== Out-of-tree SDEI hard-lockup watchdogs exist (e.g. in the openEuler and Anolis kernels). They take a different mechanism: they bind the secure physical timer as an SDEI event, so firmware delivers a periodic self-CPU tick that drives the detector. That requires a new SDEI interrupt-binding API, pushes the watchdog period (watchdog_thresh) into firmware, and adds secure-timer EOI handling on the kexec path. This series instead uses only the standard software-signalled event 0: the kernel keeps the timing (a per-CPU hrtimer with a buddy heartbeat check) and firmware does nothing but deliver the cross-CPU poke when a buddy looks stalled. The result is a smaller, far less firmware-coupled change ? no secure-timer dependency, no new SDEI API, no period in firmware ? and the same delivery primitive serves the backtrace and crash-stop users, not just the watchdog. Testing ======= Developed on QEMU (Trusted Firmware-A with SDEI enabled) and validated on NVIDIA Grace (Neoverse V2) hardware, under irqchip.gicv3_pseudo_nmi=0: - hard lockup (LKDTM) caught by the SDEI watchdog and panicked, with the stack pointing at the wedged code; - sysrq-l backtrace of an interrupt-masked CPU returning its real stack; - kdump via crash_smp_send_stop() with a wedged CPU, and via a watchdog panic from inside the event-0 handler ? sdei_handler_abort() fires and the capture kernel boots to userspace on the formerly-wedged CPU, with its registers present in the vmcore. Series ====== [1/4] firmware: arm_sdei: add SDEI_EVENT_SIGNAL support Thin sdei_event_signal() wrapper over the standard call; NMI/crash safe (no locks). [2/4] drivers/firmware: add SDEI cross-CPU NMI service for arm64 Register event 0; first user, arch_trigger_cpumask_backtrace(). [3/4] arm64: wire SDEI NMI into the hardlockup watchdog HAVE_HARDLOCKUP_DETECTOR_ARCH backend; boot-time source selection with perf-NMI fallback. [4/4] arm64: route crash_smp_send_stop() last resort through SDEI SDEI as the final escalation rung for CPUs that ignored the normal and pseudo-NMI stop IPIs. Also available at: git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git sdei-nmi arch/arm64/Kconfig | 1 + arch/arm64/include/asm/nmi.h | 30 ++ arch/arm64/kernel/smp.c | 33 +++ drivers/firmware/Kconfig | 23 ++ drivers/firmware/Makefile | 1 + drivers/firmware/arm_sdei.c | 12 + drivers/firmware/sdei_nmi.c | 523 ++++++++++++++++++++++++++++++++++ include/linux/arm_sdei.h | 6 + include/uapi/linux/arm_sdei.h | 1 + 9 files changed, 630 insertions(+) create mode 100644 arch/arm64/include/asm/nmi.h create mode 100644 drivers/firmware/sdei_nmi.c base-commit: e7ae89a0c97ce2b68b0983cd01eda67cf373517d -- 2.54.0 From kirill at shutemov.name Wed Jun 3 07:36:32 2026 From: kirill at shutemov.name (Kiryl Shutsemau) Date: Wed, 3 Jun 2026 15:36:32 +0100 Subject: [PATCH 1/4] firmware: arm_sdei: add SDEI_EVENT_SIGNAL support In-Reply-To: References: Message-ID: From: "Kiryl Shutsemau (Meta)" Add sdei_event_signal(), a thin wrapper over the SDEI_EVENT_SIGNAL call (DEN0054) that makes the software-signalled event (event 0) pending on a target PE -- delivered NMI-like even when that PE has interrupts masked. It takes no locks, so it is safe to call from NMI / crash context. Signed-off-by: Kiryl Shutsemau (Meta) --- drivers/firmware/arm_sdei.c | 12 ++++++++++++ include/linux/arm_sdei.h | 6 ++++++ include/uapi/linux/arm_sdei.h | 1 + 3 files changed, 19 insertions(+) diff --git a/drivers/firmware/arm_sdei.c b/drivers/firmware/arm_sdei.c index f39ed7ba3a38..e3fd604d9894 100644 --- a/drivers/firmware/arm_sdei.c +++ b/drivers/firmware/arm_sdei.c @@ -339,6 +339,18 @@ static void _ipi_unmask_cpu(void *ignored) sdei_unmask_local_cpu(); } +/* + * Signal the software-signalled event (event 0) to @mpidr. Does nothing + * but the SMC -- no locks, no event lookup -- so it is safe from NMI / + * crash context (e.g. the cross-CPU NMI service). + */ +int sdei_event_signal(u32 event_num, u64 mpidr) +{ + return invoke_sdei_fn(SDEI_1_0_FN_SDEI_EVENT_SIGNAL, event_num, + mpidr, 0, 0, 0, NULL); +} +NOKPROBE_SYMBOL(sdei_event_signal); + static void _ipi_private_reset(void *ignored) { int err; diff --git a/include/linux/arm_sdei.h b/include/linux/arm_sdei.h index f652a5028b59..3f3ec01155e8 100644 --- a/include/linux/arm_sdei.h +++ b/include/linux/arm_sdei.h @@ -37,6 +37,12 @@ int sdei_event_unregister(u32 event_num); int sdei_event_enable(u32 event_num); int sdei_event_disable(u32 event_num); +/* + * Signal the software-signalled event (event 0) to another PE, NMI-like. + * @mpidr is the target's MPIDR affinity. + */ +int sdei_event_signal(u32 event_num, u64 mpidr); + /* GHES register/unregister helpers */ int sdei_register_ghes(struct ghes *ghes, sdei_event_callback *normal_cb, sdei_event_callback *critical_cb); diff --git a/include/uapi/linux/arm_sdei.h b/include/uapi/linux/arm_sdei.h index af0630ba5437..22eb61612673 100644 --- a/include/uapi/linux/arm_sdei.h +++ b/include/uapi/linux/arm_sdei.h @@ -22,6 +22,7 @@ #define SDEI_1_0_FN_SDEI_PE_UNMASK SDEI_1_0_FN(0x0C) #define SDEI_1_0_FN_SDEI_INTERRUPT_BIND SDEI_1_0_FN(0x0D) #define SDEI_1_0_FN_SDEI_INTERRUPT_RELEASE SDEI_1_0_FN(0x0E) +#define SDEI_1_0_FN_SDEI_EVENT_SIGNAL SDEI_1_0_FN(0x0F) #define SDEI_1_0_FN_SDEI_PRIVATE_RESET SDEI_1_0_FN(0x11) #define SDEI_1_0_FN_SDEI_SHARED_RESET SDEI_1_0_FN(0x12) -- 2.54.0 From kirill at shutemov.name Wed Jun 3 07:36:33 2026 From: kirill at shutemov.name (Kiryl Shutsemau) Date: Wed, 3 Jun 2026 15:36:33 +0100 Subject: [PATCH 2/4] drivers/firmware: add SDEI cross-CPU NMI service for arm64 In-Reply-To: References: Message-ID: <145b9e98b12a7d314fc4a203075f65c3a0c3a913.1780496779.git.kas@kernel.org> From: "Kiryl Shutsemau (Meta)" Deliver an NMI-like event to an interrupt-masked arm64 CPU via the standard SDEI software-signalled event (event 0), without the pseudo-NMI hot-path cost: register a handler for event 0 and poke a target with sdei_event_signal(0, mpidr). First user is arch_trigger_cpumask_backtrace() (sysrq-l, RCU stalls, hung-task/soft-lockup dumps), which otherwise rides an IPI that can't reach a masked CPU. Falls back to the IPI path when SDEI is absent; no watchdog backend yet, so the stock detector is untouched. Signed-off-by: Kiryl Shutsemau (Meta) --- arch/arm64/include/asm/nmi.h | 24 ++++++ arch/arm64/kernel/smp.c | 9 +++ drivers/firmware/Kconfig | 19 +++++ drivers/firmware/Makefile | 1 + drivers/firmware/sdei_nmi.c | 147 +++++++++++++++++++++++++++++++++++ 5 files changed, 200 insertions(+) create mode 100644 arch/arm64/include/asm/nmi.h create mode 100644 drivers/firmware/sdei_nmi.c diff --git a/arch/arm64/include/asm/nmi.h b/arch/arm64/include/asm/nmi.h new file mode 100644 index 000000000000..ccdb75692e9d --- /dev/null +++ b/arch/arm64/include/asm/nmi.h @@ -0,0 +1,24 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef __ASM_NMI_H +#define __ASM_NMI_H + +#include + +/* + * Cross-CPU NMI provider hooks, consulted by the arm64 arch code before + * its regular-IRQ / pseudo-NMI IPI paths. The SDEI provider in + * drivers/firmware/sdei_nmi.c implements them when active; a future + * FEAT_NMI provider could slot in here too. The stubs let callers stay + * unconditional when ARM_SDEI_NMI is off. + */ +#ifdef CONFIG_ARM_SDEI_NMI +bool sdei_nmi_trigger_cpumask_backtrace(const cpumask_t *mask, int exclude_cpu); +#else +static inline bool sdei_nmi_trigger_cpumask_backtrace(const cpumask_t *mask, + int exclude_cpu) +{ + return false; +} +#endif + +#endif /* __ASM_NMI_H */ diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c index 1aa324104afb..656b8417af72 100644 --- a/arch/arm64/kernel/smp.c +++ b/arch/arm64/kernel/smp.c @@ -45,6 +45,7 @@ #include #include #include +#include #include #include #include @@ -928,11 +929,19 @@ static void arm64_backtrace_ipi(cpumask_t *mask) void arch_trigger_cpumask_backtrace(const cpumask_t *mask, int exclude_cpu) { /* + * Prefer the SDEI cross-CPU NMI provider when active: firmware + * dispatches the event out of EL3 and reaches CPUs that have + * interrupts locally masked, without the per-IRQ-mask cost that + * pseudo-NMI pays for the same reach. The plain IPI path below + * can't reach such a CPU unless pseudo-NMI is enabled. + * * NOTE: though nmi_trigger_cpumask_backtrace() has "nmi_" in the name, * nothing about it truly needs to be implemented using an NMI, it's * just that it's _allowed_ to work with NMIs. If ipi_should_be_nmi() * returned false our backtrace attempt will just use a regular IPI. */ + if (sdei_nmi_trigger_cpumask_backtrace(mask, exclude_cpu)) + return; nmi_trigger_cpumask_backtrace(mask, exclude_cpu, arm64_backtrace_ipi); } diff --git a/drivers/firmware/Kconfig b/drivers/firmware/Kconfig index bbd2155d8483..6501087ff90d 100644 --- a/drivers/firmware/Kconfig +++ b/drivers/firmware/Kconfig @@ -36,6 +36,25 @@ config ARM_SDE_INTERFACE standard for registering callbacks from the platform firmware into the OS. This is typically used to implement RAS notifications. +config ARM_SDEI_NMI + bool "SDEI-based cross-CPU NMI service (arm64)" + depends on ARM64 && ARM_SDE_INTERFACE + help + Provides SDEI-based cross-CPU NMI delivery for hooks that need + to reach interrupt-masked CPUs on silicon that lacks FEAT_NMI: + + - arch_trigger_cpumask_backtrace() (sysrq-l, RCU stalls, + hardlockup_all_cpu_backtrace, soft-lockup secondary dumps, + hung-task auxiliary dumps) + + The driver registers a handler for the SDEI software-signalled + event (event 0) and reaches a target CPU by signalling it with + SDEI_EVENT_SIGNAL. Firmware delivers the event out of EL3 + regardless of the target's PSTATE.DAIF -- forced delivery into a + CPU wedged with interrupts locally masked. + + If unsure, say N. + config EDD tristate "BIOS Enhanced Disk Drive calls determine boot disk" depends on X86 diff --git a/drivers/firmware/Makefile b/drivers/firmware/Makefile index 4ddec2820c96..48221fb8b385 100644 --- a/drivers/firmware/Makefile +++ b/drivers/firmware/Makefile @@ -4,6 +4,7 @@ # obj-$(CONFIG_ARM_SCPI_PROTOCOL) += arm_scpi.o obj-$(CONFIG_ARM_SDE_INTERFACE) += arm_sdei.o +obj-$(CONFIG_ARM_SDEI_NMI) += sdei_nmi.o obj-$(CONFIG_DMI) += dmi_scan.o obj-$(CONFIG_DMI_SYSFS) += dmi-sysfs.o obj-$(CONFIG_EDD) += edd.o diff --git a/drivers/firmware/sdei_nmi.c b/drivers/firmware/sdei_nmi.c new file mode 100644 index 000000000000..e5c3f28b3991 --- /dev/null +++ b/drivers/firmware/sdei_nmi.c @@ -0,0 +1,147 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * arm64 SDEI-based cross-CPU NMI service. + * + * Delivering an "NMI-shaped" event to an EL1 context that has locally + * masked interrupts, on silicon without FEAT_NMI, can be done two ways: + * + * - pseudo-NMI: mask "interrupts" via the GIC priority register + * (ICC_PMR_EL1) instead of PSTATE.DAIF, leaving a high-priority band + * deliverable. Functionally this works -- but it reimplements every + * local_irq_disable()/enable() and exception entry/exit as a PMR + * write plus synchronisation, a cost paid on that hot path forever, + * whether or not an NMI is ever delivered. + * + * - SDEI: leave interrupt masking as the cheap PSTATE.DAIF operation + * and have the firmware bounce an EL3-routed Group-0 SGI back to + * NS-EL1 as an event callback. The cost is a firmware round-trip, + * but only at the rare moment delivery is actually needed. + * + * This driver takes the second path: it keeps the IRQ-mask hot path + * free and pays only when it fires, which is what makes cross-CPU NMI + * affordable on hardware where the pseudo-NMI tax isn't, until FEAT_NMI + * makes NMI masking cheap in the architecture itself. + * + * Capabilities provided: + * + * - sdei_nmi_trigger_cpumask_backtrace() ? override for arm64's + * arch_trigger_cpumask_backtrace(), so sysrq-l, RCU stall dumps, + * hardlockup_all_cpu_backtrace, soft-lockup/hung-task secondary + * dumps all reach interrupt-masked CPUs. + * + * Delivery uses the standard SDEI software-signalled event (event 0) and + * SDEI_EVENT_SIGNAL. We register a handler for event 0, enable it, and + * poke a target CPU with sdei_event_signal(0, mpidr): firmware makes + * event 0 pending on that PE and dispatches the handler NMI-like, + * regardless of the target's DAIF. + * Availability is simply whether event 0 registers and enables -- if SDEI + * and its software-signalled event are present we use it, otherwise the + * driver stays inert. + */ + +#define pr_fmt(fmt) "sdei_nmi: " fmt + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +static bool sdei_nmi_available; + +#define SDEI_NMI_EVENT 0 + +static int sdei_nmi_handler(u32 event, struct pt_regs *regs, void *arg) +{ + /* + * nmi_cpu_backtrace() no-ops unless this CPU's bit is set in the + * global backtrace mask (driven by nmi_trigger_cpumask_backtrace()), + * so a fire that reaches a CPU not being backtraced is harmless. + */ + nmi_cpu_backtrace(regs); + return SDEI_EV_HANDLED; +} + +static void sdei_nmi_fire(unsigned int target_cpu) +{ + int err = sdei_event_signal(SDEI_NMI_EVENT, cpu_logical_map(target_cpu)); + + if (err) + pr_warn("SDEI_EVENT_SIGNAL to CPU %u failed: %d\n", + target_cpu, err); +} + +/* + * Raise callback for nmi_trigger_cpumask_backtrace(): signal event 0 + * at every CPU still pending in @mask. The framework excludes the local + * CPU from @mask before calling us. + */ +static void sdei_nmi_raise_backtrace(cpumask_t *mask) +{ + unsigned int cpu; + + for_each_cpu(cpu, mask) + sdei_nmi_fire(cpu); +} + +/* + * Override hook for arch_trigger_cpumask_backtrace() (see + * arch/arm64/kernel/smp.c). Returns true when SDEI handled the request, + * which is the case whenever SDEI is active; on a false return the arch + * falls back to its regular-IRQ (or pseudo-NMI, if enabled) IPI. + * + * On a kernel built without paying the pseudo-NMI hot-path cost (the + * usual case for this driver's target), the IPI can't reach a CPU that + * has interrupts masked -- so the backtrace of the one CPU you care + * about comes back empty. SDEI is dispatched out of EL3 and lands + * regardless of the target's DAIF, without taxing the IRQ-mask path. + */ +bool sdei_nmi_trigger_cpumask_backtrace(const cpumask_t *mask, int exclude_cpu) +{ + if (!sdei_nmi_available) + return false; + + nmi_trigger_cpumask_backtrace(mask, exclude_cpu, + sdei_nmi_raise_backtrace); + return true; +} + +/* + * device_initcall (after arch_initcall(sdei_init), so the SDEI subsystem + * is up): probe the firmware, register the event, and turn on the + * cross-CPU service. If the probe fails the driver stays inert and the + * override hooks decline, leaving the arch's own paths in place. + */ +static int __init sdei_nmi_init(void) +{ + int err; + + err = sdei_event_register(SDEI_NMI_EVENT, sdei_nmi_handler, NULL); + if (err) { + pr_err("sdei_event_register(%u) failed: %d\n", + SDEI_NMI_EVENT, err); + return 0; + } + + err = sdei_event_enable(SDEI_NMI_EVENT); + if (err) { + pr_err("sdei_event_enable(%u) failed: %d\n", + SDEI_NMI_EVENT, err); + sdei_event_unregister(SDEI_NMI_EVENT); + return 0; + } + + sdei_nmi_available = true; + pr_info("using SDEI cross-CPU NMI (SDEI_EVENT_SIGNAL, event %u)\n", + SDEI_NMI_EVENT); + + return 0; +} +device_initcall(sdei_nmi_init); -- 2.54.0 From kirill at shutemov.name Wed Jun 3 07:36:34 2026 From: kirill at shutemov.name (Kiryl Shutsemau) Date: Wed, 3 Jun 2026 15:36:34 +0100 Subject: [PATCH 3/4] arm64: wire SDEI NMI into the hardlockup watchdog In-Reply-To: References: Message-ID: <6172eafcb9de6e626c0f1c36426d67e1e562ed32.1780496779.git.kas@kernel.org> From: "Kiryl Shutsemau (Meta)" Select HAVE_HARDLOCKUP_DETECTOR_ARCH so the framework takes its backend from this driver. A per-CPU hrtimer checks its buddy's heartbeat and signals event 0 at a stalled CPU, which runs watchdog_hardlockup_check() NMI-like. The source is chosen at boot: SDEI if firmware provides it, otherwise a perf-NMI counter (pseudo-NMI) fallback -- one image covers both. Signed-off-by: Kiryl Shutsemau (Meta) --- arch/arm64/Kconfig | 1 + drivers/firmware/Kconfig | 3 + drivers/firmware/sdei_nmi.c | 247 +++++++++++++++++++++++++++++++++++- 3 files changed, 248 insertions(+), 3 deletions(-) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index fe60738e5943..ebefe1e20806 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -205,6 +205,7 @@ config ARM64 select HAVE_FUNCTION_GRAPH_FREGS select HAVE_FUNCTION_GRAPH_TRACER select HAVE_GCC_PLUGINS + select HAVE_HARDLOCKUP_DETECTOR_ARCH if ARM_SDEI_NMI select HAVE_HARDLOCKUP_DETECTOR_PERF if PERF_EVENTS && \ HW_PERF_EVENTS && HAVE_PERF_EVENTS_NMI select HAVE_HW_BREAKPOINT if PERF_EVENTS diff --git a/drivers/firmware/Kconfig b/drivers/firmware/Kconfig index 6501087ff90d..552eff7b9bc3 100644 --- a/drivers/firmware/Kconfig +++ b/drivers/firmware/Kconfig @@ -39,6 +39,7 @@ config ARM_SDE_INTERFACE config ARM_SDEI_NMI bool "SDEI-based cross-CPU NMI service (arm64)" depends on ARM64 && ARM_SDE_INTERFACE + select HARDLOCKUP_DETECTOR_COUNTS_HRTIMER if HARDLOCKUP_DETECTOR help Provides SDEI-based cross-CPU NMI delivery for hooks that need to reach interrupt-masked CPUs on silicon that lacks FEAT_NMI: @@ -46,6 +47,8 @@ config ARM_SDEI_NMI - arch_trigger_cpumask_backtrace() (sysrq-l, RCU stalls, hardlockup_all_cpu_backtrace, soft-lockup secondary dumps, hung-task auxiliary dumps) + - the hardlockup watchdog backend, when HARDLOCKUP_DETECTOR is + also enabled The driver registers a handler for the SDEI software-signalled event (event 0) and reaches a target CPU by signalling it with diff --git a/drivers/firmware/sdei_nmi.c b/drivers/firmware/sdei_nmi.c index e5c3f28b3991..51e220d4083d 100644 --- a/drivers/firmware/sdei_nmi.c +++ b/drivers/firmware/sdei_nmi.c @@ -29,6 +29,14 @@ * hardlockup_all_cpu_backtrace, soft-lockup/hung-task secondary * dumps all reach interrupt-masked CPUs. * + * - the hardlockup-detector backend (watchdog_hardlockup_enable/ + * disable/probe()), when CONFIG_HARDLOCKUP_DETECTOR is also on. + * ARM_SDEI_NMI selects HAVE_HARDLOCKUP_DETECTOR_ARCH, so the + * framework picks this backend. The detection source is chosen at + * boot: SDEI when the firmware has it, otherwise a perf-PMU NMI + * counter if one is available (pseudo-NMI enabled). One kernel image + * thus serves SDEI and non-SDEI hosts. + * * Delivery uses the standard SDEI software-signalled event (event 0) and * SDEI_EVENT_SIGNAL. We register a handler for event 0, enable it, and * poke a target CPU with sdei_event_signal(0, mpidr): firmware makes @@ -42,12 +50,18 @@ #define pr_fmt(fmt) "sdei_nmi: " fmt #include +#include #include +#include #include #include #include +#include +#include +#include #include #include +#include #include #include @@ -61,11 +75,17 @@ static bool sdei_nmi_available; static int sdei_nmi_handler(u32 event, struct pt_regs *regs, void *arg) { /* - * nmi_cpu_backtrace() no-ops unless this CPU's bit is set in the - * global backtrace mask (driven by nmi_trigger_cpumask_backtrace()), - * so a fire that reaches a CPU not being backtraced is harmless. + * Both consumers no-op on a CPU that wasn't actually requested: + * nmi_cpu_backtrace() unless this CPU's bit is set in the global + * backtrace mask, and watchdog_hardlockup_check() unless this CPU's + * hrtimer_interrupts counter has stalled. The latter is only + * declared when the watchdog backend is built in (COUNTS_HRTIMER, + * pulled by ARM_SDEI_NMI when HARDLOCKUP_DETECTOR is enabled). */ nmi_cpu_backtrace(regs); +#ifdef CONFIG_HARDLOCKUP_DETECTOR_COUNTS_HRTIMER + watchdog_hardlockup_check(smp_processor_id(), regs); +#endif return SDEI_EV_HANDLED; } @@ -113,6 +133,220 @@ bool sdei_nmi_trigger_cpumask_backtrace(const cpumask_t *mask, int exclude_cpu) return true; } +#ifdef CONFIG_HARDLOCKUP_DETECTOR_COUNTS_HRTIMER + +/* + * SDEI watchdog source: a per-CPU hrtimer pets its own heartbeat and + * checks its buddy's; on a stall it signals event 0 at the buddy, + * whose SDEI handler then runs watchdog_hardlockup_check(). + */ +#define SDEI_NMI_WATCHDOG_TICK_MS 1000 + +static cpumask_t __read_mostly sdei_nmi_watchdog_cpus; +static DEFINE_PER_CPU(struct hrtimer, sdei_nmi_watchdog_hrtimer); +static DEFINE_PER_CPU(u64, sdei_nmi_watchdog_heartbeat_ns); + +static unsigned int sdei_nmi_watchdog_next_cpu(unsigned int cpu) +{ + unsigned int next = cpumask_next_wrap(cpu, &sdei_nmi_watchdog_cpus); + + if (next == cpu) + return nr_cpu_ids; + return next; +} + +static enum hrtimer_restart sdei_nmi_watchdog_hrtimer_fn(struct hrtimer *t) +{ + unsigned int this_cpu = smp_processor_id(); + unsigned int buddy; + u64 now = local_clock(); + u64 buddy_hb, thresh_ns; + + this_cpu_write(sdei_nmi_watchdog_heartbeat_ns, now); + + buddy = sdei_nmi_watchdog_next_cpu(this_cpu); + if (buddy >= nr_cpu_ids) + goto restart; + + /* pair with smp_wmb() in start_watchdog/stop_watchdog */ + smp_rmb(); + + buddy_hb = per_cpu(sdei_nmi_watchdog_heartbeat_ns, buddy); + thresh_ns = (u64)watchdog_thresh * NSEC_PER_SEC; + + if (now > buddy_hb + thresh_ns) { + /* + * Fire every tick while the buddy looks stale: the framework's + * watchdog_hardlockup_check() needs two consecutive calls + * before it'll declare a lockup (first call updates + * hrtimer_interrupts_saved; second confirms the counter + * hasn't moved). One-shot firing wedges the detection at + * step 1. The cost of an extra SMC per second on a truly + * wedged CPU is negligible; the alternative is silent + * non-detection. + */ + pr_warn_ratelimited("watchdog: CPU %u no heartbeat for %llu ms (thresh %us), firing NMI from CPU %u\n", + buddy, + (now - buddy_hb) / NSEC_PER_MSEC, + watchdog_thresh, this_cpu); + sdei_nmi_fire(buddy); + } + +restart: + hrtimer_forward_now(t, ms_to_ktime(SDEI_NMI_WATCHDOG_TICK_MS)); + return HRTIMER_RESTART; +} + +static void sdei_nmi_watchdog_enable(unsigned int cpu) +{ + struct hrtimer *t = this_cpu_ptr(&sdei_nmi_watchdog_hrtimer); + + if (cpumask_test_cpu(cpu, &sdei_nmi_watchdog_cpus)) + return; + + this_cpu_write(sdei_nmi_watchdog_heartbeat_ns, local_clock()); + + hrtimer_setup(t, sdei_nmi_watchdog_hrtimer_fn, CLOCK_MONOTONIC, + HRTIMER_MODE_REL_PINNED); + + /* pair with smp_rmb() in the hrtimer callback */ + smp_wmb(); + cpumask_set_cpu(cpu, &sdei_nmi_watchdog_cpus); + + hrtimer_start(t, ms_to_ktime(SDEI_NMI_WATCHDOG_TICK_MS), + HRTIMER_MODE_REL_PINNED); +} + +static void sdei_nmi_watchdog_disable(unsigned int cpu) +{ + if (!cpumask_test_cpu(cpu, &sdei_nmi_watchdog_cpus)) + return; + + cpumask_clear_cpu(cpu, &sdei_nmi_watchdog_cpus); + /* pair with smp_rmb() in the hrtimer callback */ + smp_wmb(); + + hrtimer_cancel(this_cpu_ptr(&sdei_nmi_watchdog_hrtimer)); +} + +/* + * Perf-NMI fallback source, used when SDEI is absent but the PMU IRQ is + * a (pseudo-)NMI. A per-CPU cycle counter overflows into the same + * watchdog_hardlockup_check(). This is the stock arm64 perf hardlockup + * detector, minimal-copied here because the framework's + * HARDLOCKUP_DETECTOR_PERF is compile-excluded once we select + * HAVE_HARDLOCKUP_DETECTOR_ARCH (it would otherwise provide a second + * definition of these same hooks). + */ +static struct perf_event_attr perf_wd_attr = { + .type = PERF_TYPE_HARDWARE, + .config = PERF_COUNT_HW_CPU_CYCLES, + .size = sizeof(struct perf_event_attr), + .pinned = 1, + .disabled = 1, +}; + +static DEFINE_PER_CPU(struct perf_event *, perf_wd_event); + +static u64 perf_wd_period(int cpu) +{ + /* 5 GHz safe max when cpufreq is unavailable, as in watchdog_hld.c. */ + u64 hz = cpufreq_get_hw_max_freq(cpu) * 1000UL; + + return (hz ? hz : 5000000000UL) * watchdog_thresh; +} + +static void perf_wd_overflow(struct perf_event *event, + struct perf_sample_data *data, + struct pt_regs *regs) +{ + watchdog_hardlockup_check(smp_processor_id(), regs); +} + +static void perf_wd_enable(unsigned int cpu) +{ + struct perf_event *evt; + + if (this_cpu_read(perf_wd_event)) + return; + + perf_wd_attr.sample_period = perf_wd_period(cpu); + evt = perf_event_create_kernel_counter(&perf_wd_attr, cpu, NULL, + perf_wd_overflow, NULL); + if (IS_ERR(evt)) { + pr_warn_once("perf event create on CPU %u failed: %ld\n", + cpu, PTR_ERR(evt)); + return; + } + + this_cpu_write(perf_wd_event, evt); + perf_event_enable(evt); +} + +static void perf_wd_disable(unsigned int cpu) +{ + struct perf_event *evt = this_cpu_read(perf_wd_event); + + if (!evt) + return; + + perf_event_disable(evt); + perf_event_release_kernel(evt); + this_cpu_write(perf_wd_event, NULL); +} + +/* Set by the late_initcall below once the perf fallback is chosen. */ +static bool perf_wd_active; + +void watchdog_hardlockup_enable(unsigned int cpu) +{ + WARN_ON_ONCE(cpu != smp_processor_id()); + + if (sdei_nmi_available) + sdei_nmi_watchdog_enable(cpu); + else if (perf_wd_active) + perf_wd_enable(cpu); +} + +void watchdog_hardlockup_disable(unsigned int cpu) +{ + WARN_ON_ONCE(cpu != smp_processor_id()); + + if (sdei_nmi_available) + sdei_nmi_watchdog_disable(cpu); + else if (perf_wd_active) + perf_wd_disable(cpu); +} + +int __init watchdog_hardlockup_probe(void) +{ + return (sdei_nmi_available || perf_wd_active) ? 0 : -ENODEV; +} + +/* + * Phase 2 of init, at late_initcall so it runs after both our own + * device_initcall (SDEI decision) and armv8_pmuv3's (which is what makes + * arm_pmu_irq_is_nmi() read true). If SDEI didn't claim the watchdog and + * the PMU IRQ is a (pseudo-)NMI, take the perf fallback. Deciding here, + * after both device_initcalls, keeps the choice deterministic -- no race + * over which initcall ran first, and no flip from perf to SDEI. + */ +static int __init perf_wd_init(void) +{ + if (sdei_nmi_available) + return 0; /* SDEI already owns the watchdog */ + + if (IS_ENABLED(CONFIG_ARM64_PSEUDO_NMI) && arm_pmu_irq_is_nmi()) { + perf_wd_active = true; + pr_info("no SDEI firmware; using perf-NMI watchdog fallback\n"); + lockup_detector_retry_init(); + } + return 0; +} +late_initcall(perf_wd_init); + +#endif /* CONFIG_HARDLOCKUP_DETECTOR_COUNTS_HRTIMER */ + /* * device_initcall (after arch_initcall(sdei_init), so the SDEI subsystem * is up): probe the firmware, register the event, and turn on the @@ -142,6 +376,13 @@ static int __init sdei_nmi_init(void) pr_info("using SDEI cross-CPU NMI (SDEI_EVENT_SIGNAL, event %u)\n", SDEI_NMI_EVENT); + /* + * lockup_detector_init() ran in early init and found no hardlockup + * backend yet; re-probe now that SDEI owns the watchdog. + */ + if (IS_ENABLED(CONFIG_HARDLOCKUP_DETECTOR_COUNTS_HRTIMER)) + lockup_detector_retry_init(); + return 0; } device_initcall(sdei_nmi_init); -- 2.54.0 From kirill at shutemov.name Wed Jun 3 07:36:35 2026 From: kirill at shutemov.name (Kiryl Shutsemau) Date: Wed, 3 Jun 2026 15:36:35 +0100 Subject: [PATCH 4/4] arm64: route crash_smp_send_stop() last resort through SDEI In-Reply-To: References: Message-ID: <54cb99db3c981dc39eb3031aff5caeaadb09e8b9.1780496779.git.kas@kernel.org> From: "Kiryl Shutsemau (Meta)" Add SDEI as the final rung after the normal stop IPI (and the pseudo-NMI IPI, if enabled): signal event 0 at the CPUs still online, whose handler runs crash_save_cpu() on the wedged context and parks them. It only ever touches CPUs the normal path couldn't reach. SDEI is last because a CPU parked in the handler never completes the event, so it is less recoverable -- a cost paid only when nothing else worked. Signed-off-by: Kiryl Shutsemau (Meta) --- arch/arm64/include/asm/nmi.h | 6 ++ arch/arm64/kernel/smp.c | 24 ++++++ drivers/firmware/Kconfig | 1 + drivers/firmware/sdei_nmi.c | 137 ++++++++++++++++++++++++++++++++++- 4 files changed, 167 insertions(+), 1 deletion(-) diff --git a/arch/arm64/include/asm/nmi.h b/arch/arm64/include/asm/nmi.h index ccdb75692e9d..e3edfb24fc08 100644 --- a/arch/arm64/include/asm/nmi.h +++ b/arch/arm64/include/asm/nmi.h @@ -13,12 +13,18 @@ */ #ifdef CONFIG_ARM_SDEI_NMI bool sdei_nmi_trigger_cpumask_backtrace(const cpumask_t *mask, int exclude_cpu); +bool sdei_nmi_crash_smp_send_stop(void); #else static inline bool sdei_nmi_trigger_cpumask_backtrace(const cpumask_t *mask, int exclude_cpu) { return false; } + +static inline bool sdei_nmi_crash_smp_send_stop(void) +{ + return false; +} #endif #endif /* __ASM_NMI_H */ diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c index 656b8417af72..386ddd526b48 100644 --- a/arch/arm64/kernel/smp.c +++ b/arch/arm64/kernel/smp.c @@ -1288,8 +1288,32 @@ void crash_smp_send_stop(void) return; crash_stop = 1; + /* + * Stop the normal way first: IPI_CPU_STOP escalating to a pseudo-NMI + * IPI. Every CPU that responds saves its state via crash_save_cpu() + * and parks in cpu_park_loop() with its online bit cleared -- the + * standard kdump stop, identical to a kernel without SDEI. Crucially + * those CPUs stay in a clean, potentially-reusable state. + */ smp_send_stop(); + /* + * Whatever is still online didn't respond -- typically a CPU wedged + * with interrupts masked. The plain IPI can't reach it, and a fleet + * that declines the pseudo-NMI hot-path cost has no NMI IPI to + * escalate to. Hit only the survivors with the SDEI cross-CPU NMI + * (no-op if SDEI isn't active, or if everything already stopped): + * firmware delivers out of EL3 regardless of PSTATE.DAIF, and the + * handler captures crash_save_cpu() state from the wedged context + * before parking the CPU. + * + * SDEI is deliberately last: an SDEI-stopped CPU never completes its + * event (it parks inside the handler, so EL3 retains its dispatch + * slot until reset), which is strictly less recoverable than a normal + * stop. We pay that only for CPUs that left no other way to reach them. + */ + sdei_nmi_crash_smp_send_stop(); + sdei_handler_abort(); } diff --git a/drivers/firmware/Kconfig b/drivers/firmware/Kconfig index 552eff7b9bc3..84aead609406 100644 --- a/drivers/firmware/Kconfig +++ b/drivers/firmware/Kconfig @@ -49,6 +49,7 @@ config ARM_SDEI_NMI hung-task auxiliary dumps) - the hardlockup watchdog backend, when HARDLOCKUP_DETECTOR is also enabled + - crash_smp_send_stop() (panic / kdump path) The driver registers a handler for the SDEI software-signalled event (event 0) and reaches a target CPU by signalling it with diff --git a/drivers/firmware/sdei_nmi.c b/drivers/firmware/sdei_nmi.c index 51e220d4083d..ad8fbb1c90a6 100644 --- a/drivers/firmware/sdei_nmi.c +++ b/drivers/firmware/sdei_nmi.c @@ -29,6 +29,11 @@ * hardlockup_all_cpu_backtrace, soft-lockup/hung-task secondary * dumps all reach interrupt-masked CPUs. * + * - sdei_nmi_crash_smp_send_stop() ? override for arm64's + * crash_smp_send_stop(); the panic/kdump last resort for CPUs that + * didn't answer the normal stop IPI, capturing the wedged context + * into the vmcore before parking the CPU. + * * - the hardlockup-detector backend (watchdog_hardlockup_enable/ * disable/probe()), when CONFIG_HARDLOCKUP_DETECTOR is also on. * ARM_SDEI_NMI selects HAVE_HARDLOCKUP_DETECTOR_ARCH, so the @@ -50,11 +55,15 @@ #define pr_fmt(fmt) "sdei_nmi: " fmt #include +#include #include #include +#include +#include #include #include #include +#include #include #include #include @@ -72,8 +81,66 @@ static bool sdei_nmi_available; #define SDEI_NMI_EVENT 0 +/* + * Crash-stop dispatch lives on the same SDEI event 0 as everything else. + * The requesting CPU sets sdei_nmi_crash_stop_requested for each target + * before signalling event 0; the target's handler clears it, saves crash + * state, parks, and sets sdei_nmi_crash_stop_acked so the requester knows + * the target is down. + * + * Using a per-CPU flag rather than a separate SDEI event avoids needing + * extra registrations from firmware. The SDEI_EVENT_SIGNAL SMC is itself + * a write barrier, so a WRITE_ONCE() before the signal is sufficient + * ordering against the handler's READ_ONCE() on the target. + */ +static DEFINE_PER_CPU(unsigned long, sdei_nmi_crash_stop_requested); +static DEFINE_PER_CPU(unsigned long, sdei_nmi_crash_stop_acked); + static int sdei_nmi_handler(u32 event, struct pt_regs *regs, void *arg) { + int cpu = smp_processor_id(); + + if (READ_ONCE(*this_cpu_ptr(&sdei_nmi_crash_stop_requested))) { + WRITE_ONCE(*this_cpu_ptr(&sdei_nmi_crash_stop_requested), 0); + + /* + * Capture the wedged context for kdump while pt_regs still + * points at the interrupted PC. This is the main motivation + * for using SDEI here: the plain IPI stop path can't reach an + * interrupt-masked CPU (and the fleet declines pseudo-NMI to + * keep the IRQ-mask hot path cheap), so crash_save_cpu() for + * that CPU would otherwise record nothing useful. + */ + crash_save_cpu(regs, cpu); + set_cpu_online(cpu, false); + + /* publish the crash state/offline before the requester sees the ack */ + smp_wmb(); + WRITE_ONCE(*this_cpu_ptr(&sdei_nmi_crash_stop_acked), 1); + + /* + * Park forever from within the SDEI handler. We deliberately + * do NOT issue SDEI_EVENT_COMPLETE: the framework's return + * path restores firmware's saved interrupted context, which + * would land the CPU back wherever it was running (often + * do_idle, which then notices cpu_is_offline=true and BUGs + * at cpuhp_report_idle_dead). Returning the modified pt_regs + * doesn't help -- arch/arm64/kernel/sdei.c::do_sdei_event + * only honours a PC override via its IRQ-state heuristic + * and otherwise hands EL3 its own saved-context slot back. + * + * Trade-off: EL3 firmware retains ~one saved-context slot + * per parked CPU until the next hardware reset (~hundreds of + * bytes per CPU). The CPU itself is parked in cpu_park_loop + * exactly as if IPI_CPU_STOP had stopped it; recoverability + * is unchanged versus the existing path (neither is + * recoverable without hardware reset, since PSCI sees the + * CPU as ALREADY_ON in both cases). + */ + cpu_park_loop(); + /* unreachable */ + } + /* * Both consumers no-op on a CPU that wasn't actually requested: * nmi_cpu_backtrace() unless this CPU's bit is set in the global @@ -84,7 +151,7 @@ static int sdei_nmi_handler(u32 event, struct pt_regs *regs, void *arg) */ nmi_cpu_backtrace(regs); #ifdef CONFIG_HARDLOCKUP_DETECTOR_COUNTS_HRTIMER - watchdog_hardlockup_check(smp_processor_id(), regs); + watchdog_hardlockup_check(cpu, regs); #endif return SDEI_EV_HANDLED; } @@ -133,6 +200,74 @@ bool sdei_nmi_trigger_cpumask_backtrace(const cpumask_t *mask, int exclude_cpu) return true; } +/* + * Last-resort half of arm64's crash_smp_send_stop() (see + * arch/arm64/kernel/smp.c). The caller runs the normal IPI / pseudo-NMI + * stop first; whatever is left in cpu_online_mask by the time we're + * called are the CPUs that didn't respond -- wedged with interrupts + * masked, unreachable by those paths. We snapshot that residual mask, + * set each survivor's per-CPU crash-stop request flag, signal event 0 + * at it, and poll for acks. The handler captures crash_save_cpu() state + * and parks the CPU (without completing the SDEI event, see + * sdei_nmi_handler()). + * + * Because SDEI-stopped CPUs are less recoverable than normally-stopped + * ones, this is intentionally the fallback, not the first choice -- it + * only ever runs against CPUs the normal path already gave up on. + * + * Returns true when SDEI was active and this path ran (even if some CPU + * failed to ack within the timeout, or there were no survivors to stop); + * false when SDEI isn't active, leaving the caller's normal-path result + * as the final word. + */ +bool sdei_nmi_crash_smp_send_stop(void) +{ + unsigned int this_cpu, cpu, remaining; + unsigned long timeout; + cpumask_t mask; + + if (!sdei_nmi_available) + return false; + + this_cpu = smp_processor_id(); + cpumask_copy(&mask, cpu_online_mask); + cpumask_clear_cpu(this_cpu, &mask); + if (cpumask_empty(&mask)) + return true; + + for_each_cpu(cpu, &mask) { + WRITE_ONCE(per_cpu(sdei_nmi_crash_stop_acked, cpu), 0); + WRITE_ONCE(per_cpu(sdei_nmi_crash_stop_requested, cpu), 1); + } + /* Publish flags before the SMCs read them on the target side. */ + smp_wmb(); + + for_each_cpu(cpu, &mask) + sdei_nmi_fire(cpu); + + /* + * Poll up to 100ms -- same order as the kernel's existing pseudo-NMI + * stop wait (10ms) plus headroom for the SDEI round-trip on slow + * firmware. + */ + timeout = USEC_PER_MSEC * 100; + while (timeout--) { + remaining = 0; + for_each_cpu(cpu, &mask) + if (!READ_ONCE(per_cpu(sdei_nmi_crash_stop_acked, cpu))) + remaining++; + if (!remaining) + break; + udelay(1); + } + + if (remaining) + pr_warn("crash_stop: %u CPU(s) did not ack within 100ms\n", + remaining); + + return true; +} + #ifdef CONFIG_HARDLOCKUP_DETECTOR_COUNTS_HRTIMER /* -- 2.54.0 From pasha.tatashin at soleen.com Wed Jun 3 08:03:36 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Wed, 3 Jun 2026 15:03:36 +0000 Subject: [PATCH v6 07/13] kho: add support for linked-block serialization In-Reply-To: <178049725439.475072.11560134126837430744.b4-reply@b4> References: <20260603032905.344462-1-pasha.tatashin@soleen.com> <20260603032905.344462-8-pasha.tatashin@soleen.com> <178046937151.468621.13398573538792303093.b4-review@b4> <178049725439.475072.11560134126837430744.b4-reply@b4> Message-ID: On 06-03 17:34, Mike Rapoport wrote: > On 2026-06-03 14:11 +0000, Pasha Tatashin wrote: > > On 06-03 16:59, Mike Rapoport wrote: > > > On Wed, Jun 03, 2026 at 12:05:04PM +0000, Pasha Tatashin wrote: > > > > On 06-03 09:49, Mike Rapoport wrote: > > > > > On Wed, 03 Jun 2026 03:28:58 +0000, Pasha Tatashin wrote: > > > > > > diff --git a/include/linux/kho/abi/block.h b/include/linux/kho/abi/block.h > > > > > > new file mode 100644 > > > > > > index 000000000000..8641c20b379b > > > > > > --- /dev/null > > > > > > +++ b/include/linux/kho/abi/block.h > > > > > > @@ -0,0 +1,56 @@ > > > > > > [ ... skip 25 lines ... ] > > > > > > +#define _LINUX_KHO_ABI_BLOCK_H > > > > > > + > > > > > > +#include > > > > > > +#include > > > > > > + > > > > > > +#define KHO_BLOCK_ABI_COMPATIBLE "kho-block-v1" > > > > > > > > > > It's never used by block set and after looking at the following patches I > > > > > found that it's appended to LUO compatible string. > > > > > > > > > > While this works for LUO, I think it should be kho_block_set_restore() > > > > > responsibility to verify the compatibility. > > > > > > > > It should work for any component that relies on kho_block. My proposal > > > > is to use this method for other common KHO data structures (e.g., kho > > > > vmalloc, kho radix, future kho xarray). There is no need for them to > > > > carry the compatibility string in their metadata, as whoever uses them > > > > will include their compatibility string. > > > > > > So if, say, memfd_luo uses kho vmalloc, xarray and blocks it'll have five > > > compatibility strings glued together? > > > > That is correct, but it will be in only one place: the header of the > > client's KHO subtree. Since it is dynamically sized and 8-byte aligned, > > it should be safe to include in any struct. > > This is safe, you are right. > But I have more usability concerns from one side and the duplication it > causes from the other. > > I can see the downside of putting the version information in the data > structure itself as it either requires a different header for the first > element or needlessly increases all the headers. > > But > > #define LUO_ABI_COMPATIBLE LUO_COMPAT_BASE "-" KHO_BLOCK_ABI_COMPATIBLE "-" KHO_VMALLOC_ABI_COMPATIBLE "-" KHO_RADIX_COMPATIBLE > > is not really digestible too. And it forces KHO users to potentially > track KHO internal changes. These are compatibilities; I think they are quite digestible, both to write and also when the LUO_ABI_COMPATIBLE string is printed out for debugging/info purposes. > We still don't promise any compatibility between different kernel > versions so to avoid blocking this series on the decision what is the > best way to convey KHO data structures compatibility I suggest to bump > kho ABI version in v6.2* of the patch that adds KHO blocks and postpone > this discussion to after rc1 when we'll have plenty of time. Let's keep this patch as is for now. We will have a broader discussion when we convert other participants to this new scheme. If we decide not to pursue this approach, we will change this code to use an independent compatibility string. However, having this in place as a template will help us convert other components correctly, ensuring proper alignment and that correct string helpers like strncmp / strscpy are used?which I have already ensured is the case in LUO. > * sending a new version of a single file does same email traffic, but it > confuses b4 and quite possibly other tools, so I think v7 is better. Agreed, I also prefer re-sending the whole series... Pasha > > Pasha > > > > > > > > > For now, reviewers will have to make sure that if the ABI header content > > > > is changed, the compatibility string is updated. > > > > > -- > > > Sincerely yours, > > > Mike. > > > > From rppt at kernel.org Wed Jun 3 08:18:38 2026 From: rppt at kernel.org (Mike Rapoport) Date: Wed, 03 Jun 2026 18:18:38 +0300 Subject: [PATCH v6 07/13] kho: add support for linked-block serialization In-Reply-To: References: <20260603032905.344462-1-pasha.tatashin@soleen.com> <20260603032905.344462-8-pasha.tatashin@soleen.com> <178046937151.468621.13398573538792303093.b4-review@b4> <178049725439.475072.11560134126837430744.b4-reply@b4> Message-ID: <178049991886.481270.8804649242857550471.b4-reply@b4> On 2026-06-03 15:03 +0000, Pasha Tatashin wrote: > On 06-03 17:34, Mike Rapoport wrote: > > On 2026-06-03 14:11 +0000, Pasha Tatashin wrote: > > > On 06-03 16:59, Mike Rapoport wrote: > > > > On Wed, Jun 03, 2026 at 12:05:04PM +0000, Pasha Tatashin wrote: > > > > > On 06-03 09:49, Mike Rapoport wrote: > > > > > > On Wed, 03 Jun 2026 03:28:58 +0000, Pasha Tatashin wrote: > > > > > > > diff --git a/include/linux/kho/abi/block.h b/include/linux/kho/abi/block.h > > > > > > > new file mode 100644 > > > > > > > index 000000000000..8641c20b379b > > > > > > > --- /dev/null > > > > > > > +++ b/include/linux/kho/abi/block.h > > > > > > > @@ -0,0 +1,56 @@ > > > > > > > [ ... skip 25 lines ... ] > > > > > > > +#define _LINUX_KHO_ABI_BLOCK_H > > > > > > > + > > > > > > > +#include > > > > > > > +#include > > > > > > > + > > > > > > > +#define KHO_BLOCK_ABI_COMPATIBLE "kho-block-v1" > > > > > > > > > > > > It's never used by block set and after looking at the following patches I > > > > > > found that it's appended to LUO compatible string. > > > > > > > > > > > > While this works for LUO, I think it should be kho_block_set_restore() > > > > > > responsibility to verify the compatibility. > > > > > > > > > > It should work for any component that relies on kho_block. My proposal > > > > > is to use this method for other common KHO data structures (e.g., kho > > > > > vmalloc, kho radix, future kho xarray). There is no need for them to > > > > > carry the compatibility string in their metadata, as whoever uses them > > > > > will include their compatibility string. > > > > > > > > So if, say, memfd_luo uses kho vmalloc, xarray and blocks it'll have five > > > > compatibility strings glued together? > > > > > > That is correct, but it will be in only one place: the header of the > > > client's KHO subtree. Since it is dynamically sized and 8-byte aligned, > > > it should be safe to include in any struct. > > > > This is safe, you are right. > > But I have more usability concerns from one side and the duplication it > > causes from the other. > > > > I can see the downside of putting the version information in the data > > structure itself as it either requires a different header for the first > > element or needlessly increases all the headers. > > > > But > > > > #define LUO_ABI_COMPATIBLE LUO_COMPAT_BASE "-" KHO_BLOCK_ABI_COMPATIBLE "-" KHO_VMALLOC_ABI_COMPATIBLE "-" KHO_RADIX_COMPATIBLE > > > > is not really digestible too. And it forces KHO users to potentially > > track KHO internal changes. > > These are compatibilities; I think they are quite digestible, both to > write and also when the LUO_ABI_COMPATIBLE string is printed out for > debugging/info purposes. I agree to disagree :) It's KHO property, not it's users. > > We still don't promise any compatibility between different kernel > > versions so to avoid blocking this series on the decision what is the > > best way to convey KHO data structures compatibility I suggest to bump > > kho ABI version in v6.2* of the patch that adds KHO blocks and postpone > > this discussion to after rc1 when we'll have plenty of time. > > Let's keep this patch as is for now. We will have a broader discussion > when we convert other participants to this new scheme. If we decide not > to pursue this approach, we will change this code to use an independent > compatibility string. However, having this in place as a template will > help us convert other components correctly, ensuring proper alignment > and that correct string helpers like strncmp / strscpy are used?which I > have already ensured is the case in LUO. Pasha, this sounds like salami approach :) We didn't agree yet to convert other components and even to use this scheme globally. Changing this during -rc does not seem a good practice. So whatever new versioning scheme we'll come up with, it'll have to wait until v7.3. Let's bump kho and LUO ABI versions and drop the concatenation for now. It's a small change to the patches, so I don't see it as a blocker for merging them in v7.2. > > * sending a new version of a single file does same email traffic, but it > > confuses b4 and quite possibly other tools, so I think v7 is better. > > Agreed, I also prefer re-sending the whole series... > > Pasha > > > > Pasha > > > > > > > > > > > > For now, reviewers will have to make sure that if the ABI header content > > > > > is changed, the compatibility string is updated. > > > > > > > -- > > > > Sincerely yours, > > > > Mike. > > > > > > > > From pasha.tatashin at soleen.com Wed Jun 3 08:43:49 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Wed, 3 Jun 2026 15:43:49 +0000 Subject: [PATCH v7 00/13] liveupdate: Remove limits on sessions and files Message-ID: <20260603154402.468928-1-pasha.tatashin@soleen.com> Hi all, This series removes the fixed limits on the number of files that can be preserved within a single session, and the total number of sessions managed by the Live Update Orchestrator (LUO). The core of the change is a transition from single contiguous memory blocks for metadata serialization to a chain of linked blocks. This allows LUO to scale dynamically. 1. ABI Evolution: - Introduced linked-block headers for both file and session serialization. - Bumped session ABI version to v4. 2. Memory Management & Security: - Implemented a dynamic block allocation and reuse strategy. Blocks are allocated only when existing ones are exhausted and are reused during session/file removal cycles. - Introduced KHO_MAX_BLOCKS (10000) as a safeguard against stupid excessive allocations or corrupted cyclic lists during restore. 3. Expanded Selftests: - Added new kexec-based tests verifying preservation of 2000 sessions and 500 files per session. - Added self-tests for many sessions and many files management. Tree: git.kernel.org/pub/scm/linux/kernel/git/tatashin/linux.git Branch: luo-remove-max-files-sessions-limits/v6 Changes v7: - Addressed comments from Mike. - For changes in kho_block.c, and updated to use gloval KHO compatability. - Collected review-by's and acks. Please review. Thanks, Pasha Pasha Tatashin (13): liveupdate: change file_set->count type to u64 for type safety liveupdate: avoid mixing cleanup guards with goto in luo_session_retrieve_fd liveupdate: centralize state management into struct luo_ser liveupdate: register luo_ser as KHO subtree liveupdate: Extract luo_file_deserialize_one helper liveupdate: Extract luo_session_deserialize_one helper kho: add support for linked-block serialization liveupdate: defer session block allocation and physical address setting liveupdate: Remove limit on the number of sessions liveupdate: Remove limit on the number of files per session selftests/liveupdate: Test session and file limit removal selftests/liveupdate: Add stress-sessions kexec test selftests/liveupdate: Add stress-files kexec test Documentation/core-api/kho/abi.rst | 5 + Documentation/core-api/kho/index.rst | 11 + MAINTAINERS | 1 + include/linux/kho/abi/block.h | 54 +++ include/linux/kho/abi/kexec_handover.h | 2 +- include/linux/kho/abi/luo.h | 148 ++----- include/linux/kho_block.h | 106 +++++ kernel/liveupdate/Makefile | 1 + kernel/liveupdate/kho_block.c | 416 ++++++++++++++++++ kernel/liveupdate/luo_core.c | 99 ++--- kernel/liveupdate/luo_file.c | 205 ++++----- kernel/liveupdate/luo_flb.c | 60 +-- kernel/liveupdate/luo_internal.h | 16 +- kernel/liveupdate/luo_session.c | 219 +++++---- tools/testing/selftests/liveupdate/Makefile | 2 + .../testing/selftests/liveupdate/liveupdate.c | 75 ++++ .../selftests/liveupdate/luo_stress_files.c | 97 ++++ .../liveupdate/luo_stress_sessions.c | 102 +++++ .../selftests/liveupdate/luo_test_utils.c | 24 + .../selftests/liveupdate/luo_test_utils.h | 2 + 20 files changed, 1199 insertions(+), 446 deletions(-) create mode 100644 include/linux/kho/abi/block.h create mode 100644 include/linux/kho_block.h create mode 100644 kernel/liveupdate/kho_block.c create mode 100644 tools/testing/selftests/liveupdate/luo_stress_files.c create mode 100644 tools/testing/selftests/liveupdate/luo_stress_sessions.c base-commit: 2935777b418d2bfcbfe96705bb2c0fa6c0d94e18 -- 2.53.0 From pasha.tatashin at soleen.com Wed Jun 3 08:43:50 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Wed, 3 Jun 2026 15:43:50 +0000 Subject: [PATCH v7 01/13] liveupdate: change file_set->count type to u64 for type safety In-Reply-To: <20260603154402.468928-1-pasha.tatashin@soleen.com> References: <20260603154402.468928-1-pasha.tatashin@soleen.com> Message-ID: <20260603154402.468928-2-pasha.tatashin@soleen.com> This improves type safety and aligns the in-memory file_set->count with the serialized count type. It avoids potential truncation or sign conversion mismatch issues. Reviewed-by: Pratyush Yadav (Google) Acked-by: Mike Rapoport (Microsoft) Signed-off-by: Pasha Tatashin --- kernel/liveupdate/luo_internal.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/liveupdate/luo_internal.h b/kernel/liveupdate/luo_internal.h index dd53d4a7277e..ae58206f14ac 100644 --- a/kernel/liveupdate/luo_internal.h +++ b/kernel/liveupdate/luo_internal.h @@ -52,7 +52,7 @@ static inline int luo_ucmd_respond(struct luo_ucmd *ucmd, struct luo_file_set { struct list_head files_list; struct luo_file_ser *files; - long count; + u64 count; }; /** -- 2.53.0 From pasha.tatashin at soleen.com Wed Jun 3 08:43:51 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Wed, 3 Jun 2026 15:43:51 +0000 Subject: [PATCH v7 02/13] liveupdate: avoid mixing cleanup guards with goto in luo_session_retrieve_fd In-Reply-To: <20260603154402.468928-1-pasha.tatashin@soleen.com> References: <20260603154402.468928-1-pasha.tatashin@soleen.com> Message-ID: <20260603154402.468928-3-pasha.tatashin@soleen.com> Refactoring luo_session_retrieve_fd() to avoid mixing automated cleanup-style guards with goto-based resource release, which is not recommended under the Linux kernel coding style. Reviewed-by: Pratyush Yadav (Google) Acked-by: Mike Rapoport (Microsoft) Signed-off-by: Pasha Tatashin --- kernel/liveupdate/luo_session.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/kernel/liveupdate/luo_session.c b/kernel/liveupdate/luo_session.c index 5c6cebc6e326..47566db64598 100644 --- a/kernel/liveupdate/luo_session.c +++ b/kernel/liveupdate/luo_session.c @@ -291,10 +291,11 @@ static int luo_session_retrieve_fd(struct luo_session *session, if (argp->fd < 0) return argp->fd; - guard(mutex)(&session->mutex); + mutex_lock(&session->mutex); err = luo_retrieve_file(&session->file_set, argp->token, &file); + mutex_unlock(&session->mutex); if (err < 0) - goto err_put_fd; + goto err_put_fd; err = luo_ucmd_respond(ucmd, sizeof(*argp)); if (err) -- 2.53.0 From pasha.tatashin at soleen.com Wed Jun 3 08:43:52 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Wed, 3 Jun 2026 15:43:52 +0000 Subject: [PATCH v7 03/13] liveupdate: centralize state management into struct luo_ser In-Reply-To: <20260603154402.468928-1-pasha.tatashin@soleen.com> References: <20260603154402.468928-1-pasha.tatashin@soleen.com> Message-ID: <20260603154402.468928-4-pasha.tatashin@soleen.com> Transition the LUO to ABI v2, which centralizes state management into a single struct luo_ser header. Previously, LUO state was spread across multiple FDT properties and subnodes. ABI v2 simplifies this by placing all core state, including the liveupdate number and physical addresses for sessions and FLB headers into a centralized struct luo_ser. Note that this change introduces a semantic difference: the sessions and FLB serialization formats are no longer completely independent of the core LUO. Their metadata (such as physical addresses for sessions and FLB headers) is now coupled to and managed via the centralized struct luo_ser. Reviewed-by: Pratyush Yadav (Google) Acked-by: Mike Rapoport (Microsoft) Signed-off-by: Pasha Tatashin --- include/linux/kho/abi/luo.h | 91 +++++++++++--------------------- kernel/liveupdate/luo_core.c | 64 +++++++++++++++------- kernel/liveupdate/luo_flb.c | 60 +++------------------ kernel/liveupdate/luo_internal.h | 8 +-- kernel/liveupdate/luo_session.c | 64 ++++------------------ 5 files changed, 96 insertions(+), 191 deletions(-) diff --git a/include/linux/kho/abi/luo.h b/include/linux/kho/abi/luo.h index 46750a0ddf88..1b2f865a771a 100644 --- a/include/linux/kho/abi/luo.h +++ b/include/linux/kho/abi/luo.h @@ -30,52 +30,25 @@ * .. code-block:: none * * / { - * compatible = "luo-v1"; - * liveupdate-number = <...>; - * - * luo-session { - * compatible = "luo-session-v1"; - * luo-session-header = ; - * }; - * - * luo-flb { - * compatible = "luo-flb-v1"; - * luo-flb-header = ; - * }; + * compatible = "luo-v2"; + * luo-abi-header = ; * }; * * Main LUO Node (/): * - * - compatible: "luo-v1" + * - compatible: "luo-v2" * Identifies the overall LUO ABI version. - * - liveupdate-number: u64 - * A counter tracking the number of successful live updates performed. - * - * Session Node (luo-session): - * This node describes all preserved user-space sessions. - * - * - compatible: "luo-session-v1" - * Identifies the session ABI version. - * - luo-session-header: u64 - * The physical address of a `struct luo_session_header_ser`. This structure - * is the header for a contiguous block of memory containing an array of - * `struct luo_session_ser`, one for each preserved session. - * - * File-Lifecycle-Bound Node (luo-flb): - * This node describes all preserved global objects whose lifecycle is bound - * to that of the preserved files (e.g., shared IOMMU state). - * - * - compatible: "luo-flb-v1" - * Identifies the FLB ABI version. - * - luo-flb-header: u64 - * The physical address of a `struct luo_flb_header_ser`. This structure is - * the header for a contiguous block of memory containing an array of - * `struct luo_flb_ser`, one for each preserved global object. + * - luo-abi-header: u64 + * The physical address of `struct luo_ser`. * * Serialization Structures: * The FDT properties point to memory regions containing arrays of simple, * `__packed` structures. These structures contain the actual preserved state. * + * - struct luo_ser: + * The central ABI structure that contains the overall state of the LUO. + * It includes the liveupdate-number and pointers to sessions and FLBs. + * * - struct luo_session_header_ser: * Header for the session array. Contains the total page count of the * preserved memory block and the number of `struct luo_session_ser` @@ -109,13 +82,26 @@ /* * The LUO FDT hooks all LUO state for sessions, fds, etc. - * In the root it also carries "liveupdate-number" 64-bit property that - * corresponds to the number of live-updates performed on this machine. */ #define LUO_FDT_SIZE PAGE_SIZE #define LUO_FDT_KHO_ENTRY_NAME "LUO" -#define LUO_FDT_COMPATIBLE "luo-v1" -#define LUO_FDT_LIVEUPDATE_NUM "liveupdate-number" +#define LUO_FDT_COMPATIBLE "luo-v2" +#define LUO_FDT_ABI_HEADER "luo-abi-header" + +/** + * struct luo_ser - Centralized LUO ABI header. + * @liveupdate_num: A counter tracking the number of successful live updates. + * @sessions_pa: Physical address of the first session block header. + * @flbs_pa: Physical address of the FLB header. + * + * This structure is the root of all preserved LUO state. It is pointed to by + * the "luo-abi-header" property in the LUO FDT. + */ +struct luo_ser { + u64 liveupdate_num; + u64 sessions_pa; + u64 flbs_pa; +} __packed; #define LIVEUPDATE_HNDL_COMPAT_LENGTH 48 @@ -147,15 +133,6 @@ struct luo_file_set_ser { u64 count; } __packed; -/* - * LUO FDT session node - * LUO_FDT_SESSION_HEADER: is a u64 physical address of struct - * luo_session_header_ser - */ -#define LUO_FDT_SESSION_NODE_NAME "luo-session" -#define LUO_FDT_SESSION_COMPATIBLE "luo-session-v2" -#define LUO_FDT_SESSION_HEADER "luo-session-header" - /** * struct luo_session_header_ser - Header for the serialized session data block. * @count: The number of `struct luo_session_ser` entries that immediately @@ -165,7 +142,7 @@ struct luo_file_set_ser { * physical memory preserved across the kexec. It provides the necessary * metadata to interpret the array of session entries that follow. * - * If this structure is modified, `LUO_FDT_SESSION_COMPATIBLE` must be updated. + * If this structure is modified, `LUO_FDT_COMPATIBLE` must be updated. */ struct luo_session_header_ser { u64 count; @@ -182,7 +159,7 @@ struct luo_session_header_ser { * session) is created and passed to the new kernel, allowing it to reconstruct * the session context. * - * If this structure is modified, `LUO_FDT_SESSION_COMPATIBLE` must be updated. + * If this structure is modified, `LUO_FDT_COMPATIBLE` must be updated. */ struct luo_session_ser { char name[LIVEUPDATE_SESSION_NAME_LENGTH]; @@ -192,10 +169,6 @@ struct luo_session_ser { /* The max size is set so it can be reliably used during in serialization */ #define LIVEUPDATE_FLB_COMPAT_LENGTH 48 -#define LUO_FDT_FLB_NODE_NAME "luo-flb" -#define LUO_FDT_FLB_COMPATIBLE "luo-flb-v1" -#define LUO_FDT_FLB_HEADER "luo-flb-header" - /** * struct luo_flb_header_ser - Header for the serialized FLB data block. * @pgcnt: The total number of pages occupied by the entire preserved memory @@ -205,11 +178,9 @@ struct luo_session_ser { * in the memory block. * * This structure is located at the physical address specified by the - * `LUO_FDT_FLB_HEADER` FDT property. It provides the new kernel with the - * necessary information to find and iterate over the array of preserved - * File-Lifecycle-Bound objects and to manage the underlying memory. + * flbs_pa in luo_ser. * - * If this structure is modified, LUO_FDT_FLB_COMPATIBLE must be updated. + * If this structure is modified, `LUO_FDT_COMPATIBLE` must be updated. */ struct luo_flb_header_ser { u64 pgcnt; @@ -231,7 +202,7 @@ struct luo_flb_header_ser { * passed to the new kernel. Each entry allows the LUO core to restore one * global, shared object. * - * If this structure is modified, LUO_FDT_FLB_COMPATIBLE must be updated. + * If this structure is modified, `LUO_FDT_COMPATIBLE` must be updated. */ struct luo_flb_ser { char name[LIVEUPDATE_FLB_COMPAT_LENGTH]; diff --git a/kernel/liveupdate/luo_core.c b/kernel/liveupdate/luo_core.c index 5d5827ced73c..085c0dfc1ef1 100644 --- a/kernel/liveupdate/luo_core.c +++ b/kernel/liveupdate/luo_core.c @@ -61,7 +61,6 @@ #include #include #include -#include #include "kexec_handover_internal.h" #include "luo_internal.h" @@ -86,9 +85,11 @@ early_param("liveupdate", early_liveupdate_param); static int __init luo_early_startup(void) { + struct luo_ser *luo_ser; + int err, header_size; phys_addr_t fdt_phys; - int err, ln_size; const void *ptr; + u64 luo_ser_pa; if (!kho_is_enabled()) { if (liveupdate_enabled()) @@ -119,26 +120,32 @@ static int __init luo_early_startup(void) return -EINVAL; } - ln_size = 0; - ptr = fdt_getprop(luo_global.fdt_in, 0, LUO_FDT_LIVEUPDATE_NUM, - &ln_size); - if (!ptr || ln_size != sizeof(luo_global.liveupdate_num)) { - pr_err("Unable to get live update number '%s' [%d]\n", - LUO_FDT_LIVEUPDATE_NUM, ln_size); + header_size = 0; + ptr = fdt_getprop(luo_global.fdt_in, 0, LUO_FDT_ABI_HEADER, &header_size); + if (!ptr || header_size != sizeof(u64)) { + pr_err("Unable to get ABI header '%s' [%d]\n", + LUO_FDT_ABI_HEADER, header_size); return -EINVAL; } - luo_global.liveupdate_num = get_unaligned((u64 *)ptr); + luo_ser_pa = get_unaligned((u64 *)ptr); + luo_ser = phys_to_virt(luo_ser_pa); + + luo_global.liveupdate_num = luo_ser->liveupdate_num; pr_info("Retrieved live update data, liveupdate number: %lld\n", luo_global.liveupdate_num); - err = luo_session_setup_incoming(luo_global.fdt_in); + err = luo_session_setup_incoming(luo_ser->sessions_pa); if (err) - return err; + goto out_free_ser; + + luo_flb_setup_incoming(luo_ser->flbs_pa); - err = luo_flb_setup_incoming(luo_global.fdt_in); + err = 0; +out_free_ser: + kho_restore_free(luo_ser); return err; } @@ -160,7 +167,8 @@ early_initcall(liveupdate_early_init); /* Called during boot to create outgoing LUO fdt tree */ static int __init luo_fdt_setup(void) { - const u64 ln = luo_global.liveupdate_num + 1; + struct luo_ser *luo_ser; + u64 luo_ser_pa; void *fdt_out; int err; @@ -170,27 +178,45 @@ static int __init luo_fdt_setup(void) return PTR_ERR(fdt_out); } + luo_ser = kho_alloc_preserve(sizeof(*luo_ser)); + if (IS_ERR(luo_ser)) { + err = PTR_ERR(luo_ser); + goto exit_free_fdt; + } + luo_ser_pa = virt_to_phys(luo_ser); + err = fdt_create(fdt_out, LUO_FDT_SIZE); err |= fdt_finish_reservemap(fdt_out); err |= fdt_begin_node(fdt_out, ""); err |= fdt_property_string(fdt_out, "compatible", LUO_FDT_COMPATIBLE); - err |= fdt_property(fdt_out, LUO_FDT_LIVEUPDATE_NUM, &ln, sizeof(ln)); - err |= luo_session_setup_outgoing(fdt_out); - err |= luo_flb_setup_outgoing(fdt_out); + err |= fdt_property(fdt_out, LUO_FDT_ABI_HEADER, &luo_ser_pa, + sizeof(luo_ser_pa)); err |= fdt_end_node(fdt_out); err |= fdt_finish(fdt_out); if (err) - goto exit_free; + goto exit_free_luo_ser; + + err = luo_session_setup_outgoing(&luo_ser->sessions_pa); + if (err) + goto exit_free_luo_ser; + + err = luo_flb_setup_outgoing(&luo_ser->flbs_pa); + if (err) + goto exit_free_luo_ser; + + luo_ser->liveupdate_num = luo_global.liveupdate_num + 1; err = kho_add_subtree(LUO_FDT_KHO_ENTRY_NAME, fdt_out, fdt_totalsize(fdt_out)); if (err) - goto exit_free; + goto exit_free_luo_ser; luo_global.fdt_out = fdt_out; return 0; -exit_free: +exit_free_luo_ser: + kho_unpreserve_free(luo_ser); +exit_free_fdt: kho_unpreserve_free(fdt_out); pr_err("failed to prepare LUO FDT: %d\n", err); diff --git a/kernel/liveupdate/luo_flb.c b/kernel/liveupdate/luo_flb.c index 8f5c5dd01cd0..5c27134ce7ba 100644 --- a/kernel/liveupdate/luo_flb.c +++ b/kernel/liveupdate/luo_flb.c @@ -44,13 +44,11 @@ #include #include #include -#include #include #include #include #include #include -#include #include "luo_internal.h" #define LUO_FLB_PGCNT 1ul @@ -551,27 +549,15 @@ int liveupdate_flb_get_outgoing(struct liveupdate_flb *flb, void **objp) return 0; } -int __init luo_flb_setup_outgoing(void *fdt_out) +int __init luo_flb_setup_outgoing(u64 *flbs_pa) { struct luo_flb_header_ser *header_ser; - u64 header_ser_pa; - int err; header_ser = kho_alloc_preserve(LUO_FLB_PGCNT << PAGE_SHIFT); if (IS_ERR(header_ser)) return PTR_ERR(header_ser); - header_ser_pa = virt_to_phys(header_ser); - - err = fdt_begin_node(fdt_out, LUO_FDT_FLB_NODE_NAME); - err |= fdt_property_string(fdt_out, "compatible", - LUO_FDT_FLB_COMPATIBLE); - err |= fdt_property(fdt_out, LUO_FDT_FLB_HEADER, &header_ser_pa, - sizeof(header_ser_pa)); - err |= fdt_end_node(fdt_out); - - if (err) - goto err_unpreserve; + *flbs_pa = virt_to_phys(header_ser); header_ser->pgcnt = LUO_FLB_PGCNT; luo_flb_global.outgoing.header_ser = header_ser; @@ -579,53 +565,19 @@ int __init luo_flb_setup_outgoing(void *fdt_out) luo_flb_global.outgoing.active = true; return 0; - -err_unpreserve: - kho_unpreserve_free(header_ser); - - return err; } -int __init luo_flb_setup_incoming(void *fdt_in) +void __init luo_flb_setup_incoming(u64 flbs_pa) { struct luo_flb_header_ser *header_ser; - int err, header_size, offset; - const void *ptr; - u64 header_ser_pa; - - offset = fdt_subnode_offset(fdt_in, 0, LUO_FDT_FLB_NODE_NAME); - if (offset < 0) { - pr_err("Unable to get FLB node [%s]\n", LUO_FDT_FLB_NODE_NAME); - - return -ENOENT; - } - - err = fdt_node_check_compatible(fdt_in, offset, - LUO_FDT_FLB_COMPATIBLE); - if (err) { - pr_err("FLB node is incompatible with '%s' [%d]\n", - LUO_FDT_FLB_COMPATIBLE, err); - - return -EINVAL; - } - - header_size = 0; - ptr = fdt_getprop(fdt_in, offset, LUO_FDT_FLB_HEADER, &header_size); - if (!ptr || header_size != sizeof(u64)) { - pr_err("Unable to get FLB header property '%s' [%d]\n", - LUO_FDT_FLB_HEADER, header_size); - return -EINVAL; - } - - header_ser_pa = get_unaligned((u64 *)ptr); - header_ser = phys_to_virt(header_ser_pa); + if (!flbs_pa) + return; + header_ser = phys_to_virt(flbs_pa); luo_flb_global.incoming.header_ser = header_ser; luo_flb_global.incoming.ser = (void *)(header_ser + 1); luo_flb_global.incoming.active = true; - - return 0; } /** diff --git a/kernel/liveupdate/luo_internal.h b/kernel/liveupdate/luo_internal.h index ae58206f14ac..fe22086bfbeb 100644 --- a/kernel/liveupdate/luo_internal.h +++ b/kernel/liveupdate/luo_internal.h @@ -79,8 +79,8 @@ extern struct rw_semaphore luo_register_rwlock; int luo_session_create(const char *name, struct file **filep); int luo_session_retrieve(const char *name, struct file **filep); -int __init luo_session_setup_outgoing(void *fdt); -int __init luo_session_setup_incoming(void *fdt); +int __init luo_session_setup_outgoing(u64 *sessions_pa); +int __init luo_session_setup_incoming(u64 sessions_pa); int luo_session_serialize(void); int luo_session_deserialize(void); @@ -102,8 +102,8 @@ int luo_flb_file_preserve(struct liveupdate_file_handler *fh); void luo_flb_file_unpreserve(struct liveupdate_file_handler *fh); void luo_flb_file_finish(struct liveupdate_file_handler *fh); void luo_flb_unregister_all(struct liveupdate_file_handler *fh); -int __init luo_flb_setup_outgoing(void *fdt); -int __init luo_flb_setup_incoming(void *fdt); +int __init luo_flb_setup_outgoing(u64 *flbs_pa); +void __init luo_flb_setup_incoming(u64 flbs_pa); void luo_flb_serialize(void); #ifdef CONFIG_LIVEUPDATE_TEST diff --git a/kernel/liveupdate/luo_session.c b/kernel/liveupdate/luo_session.c index 47566db64598..85782c6f3d6c 100644 --- a/kernel/liveupdate/luo_session.c +++ b/kernel/liveupdate/luo_session.c @@ -25,9 +25,8 @@ * * - Serialization: Session metadata is preserved using the KHO framework. When * a live update is triggered via kexec, an array of `struct luo_session_ser` - * is populated and placed in a preserved memory region. An FDT node is also - * created, containing the count of sessions and the physical address of this - * array. + * is populated and placed in a preserved memory region. The physical address + * of this array is stored in the centralized `struct luo_ser` structure. * * Session Lifecycle: * @@ -91,13 +90,11 @@ #include #include #include -#include #include #include #include #include #include -#include #include #include "luo_internal.h" @@ -527,75 +524,34 @@ int luo_session_retrieve(const char *name, struct file **filep) return err; } -int __init luo_session_setup_outgoing(void *fdt_out) +int __init luo_session_setup_outgoing(u64 *sessions_pa) { struct luo_session_header_ser *header_ser; - u64 header_ser_pa; - int err; header_ser = kho_alloc_preserve(LUO_SESSION_PGCNT << PAGE_SHIFT); if (IS_ERR(header_ser)) return PTR_ERR(header_ser); - header_ser_pa = virt_to_phys(header_ser); - - err = fdt_begin_node(fdt_out, LUO_FDT_SESSION_NODE_NAME); - err |= fdt_property_string(fdt_out, "compatible", - LUO_FDT_SESSION_COMPATIBLE); - err |= fdt_property(fdt_out, LUO_FDT_SESSION_HEADER, &header_ser_pa, - sizeof(header_ser_pa)); - err |= fdt_end_node(fdt_out); - if (err) - goto err_unpreserve; + *sessions_pa = virt_to_phys(header_ser); luo_session_global.outgoing.header_ser = header_ser; luo_session_global.outgoing.ser = (void *)(header_ser + 1); luo_session_global.outgoing.active = true; return 0; - -err_unpreserve: - kho_unpreserve_free(header_ser); - return err; } -int __init luo_session_setup_incoming(void *fdt_in) +int __init luo_session_setup_incoming(u64 sessions_pa) { struct luo_session_header_ser *header_ser; - int err, header_size, offset; - u64 header_ser_pa; - const void *ptr; - - offset = fdt_subnode_offset(fdt_in, 0, LUO_FDT_SESSION_NODE_NAME); - if (offset < 0) { - pr_err("Unable to get session node: [%s]\n", - LUO_FDT_SESSION_NODE_NAME); - return -EINVAL; - } - err = fdt_node_check_compatible(fdt_in, offset, - LUO_FDT_SESSION_COMPATIBLE); - if (err) { - pr_err("Session node incompatible [%s]\n", - LUO_FDT_SESSION_COMPATIBLE); - return -EINVAL; - } - - header_size = 0; - ptr = fdt_getprop(fdt_in, offset, LUO_FDT_SESSION_HEADER, &header_size); - if (!ptr || header_size != sizeof(u64)) { - pr_err("Unable to get session header '%s' [%d]\n", - LUO_FDT_SESSION_HEADER, header_size); - return -EINVAL; + if (sessions_pa) { + header_ser = phys_to_virt(sessions_pa); + luo_session_global.incoming.header_ser = header_ser; + luo_session_global.incoming.ser = (void *)(header_ser + 1); + luo_session_global.incoming.active = true; } - header_ser_pa = get_unaligned((u64 *)ptr); - header_ser = phys_to_virt(header_ser_pa); - - luo_session_global.incoming.header_ser = header_ser; - luo_session_global.incoming.ser = (void *)(header_ser + 1); - luo_session_global.incoming.active = true; - return 0; } -- 2.53.0 From pasha.tatashin at soleen.com Wed Jun 3 08:43:53 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Wed, 3 Jun 2026 15:43:53 +0000 Subject: [PATCH v7 04/13] liveupdate: register luo_ser as KHO subtree In-Reply-To: <20260603154402.468928-1-pasha.tatashin@soleen.com> References: <20260603154402.468928-1-pasha.tatashin@soleen.com> Message-ID: <20260603154402.468928-5-pasha.tatashin@soleen.com> Entirely remove the LUO FDT wrapper since the FDT only carries the compatible string and the pointer to the centralized struct luo_ser. Instead, register the struct luo_ser via the KHO raw subtree API, placing the compatibility string inside the structure itself. Acked-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav (Google) Signed-off-by: Pasha Tatashin --- include/linux/kho/abi/luo.h | 57 +++++++++--------------- kernel/liveupdate/luo_core.c | 85 +++++++++++------------------------- 2 files changed, 46 insertions(+), 96 deletions(-) diff --git a/include/linux/kho/abi/luo.h b/include/linux/kho/abi/luo.h index 1b2f865a771a..9a4fe491812b 100644 --- a/include/linux/kho/abi/luo.h +++ b/include/linux/kho/abi/luo.h @@ -10,11 +10,11 @@ * * Live Update Orchestrator uses the stable Application Binary Interface * defined below to pass state from a pre-update kernel to a post-update - * kernel. The ABI is built upon the Kexec HandOver framework and uses a - * Flattened Device Tree to describe the preserved data. + * kernel. The ABI is built upon the Kexec HandOver framework and registers + * the central `struct luo_ser` via the KHO raw subtree API. * - * This interface is a contract. Any modification to the FDT structure, node - * properties, compatible strings, or the layout of the `__packed` serialization + * This interface is a contract. Any modification to the structure fields, + * compatible strings, or the layout of the `__packed` serialization * structures defined here constitutes a breaking change. Such changes require * incrementing the version number in the relevant `_COMPATIBLE` string to * prevent a new kernel from misinterpreting data from an old kernel. @@ -23,31 +23,15 @@ * however, backward/forward compatibility is only guaranteed for kernels * supporting the same ABI version. * - * FDT Structure Overview: + * KHO Structure Overview: * The entire LUO state is encapsulated within a single KHO entry named "LUO". - * This entry contains an FDT with the following layout: - * - * .. code-block:: none - * - * / { - * compatible = "luo-v2"; - * luo-abi-header = ; - * }; - * - * Main LUO Node (/): - * - * - compatible: "luo-v2" - * Identifies the overall LUO ABI version. - * - luo-abi-header: u64 - * The physical address of `struct luo_ser`. + * This entry contains the `struct luo_ser` structure. * * Serialization Structures: - * The FDT properties point to memory regions containing arrays of simple, - * `__packed` structures. These structures contain the actual preserved state. - * * - struct luo_ser: * The central ABI structure that contains the overall state of the LUO. - * It includes the liveupdate-number and pointers to sessions and FLBs. + * It includes the compatibility string, the liveupdate-number, and pointers + * to sessions and FLBs. * * - struct luo_session_header_ser: * Header for the session array. Contains the total page count of the @@ -78,26 +62,27 @@ #ifndef _LINUX_KHO_ABI_LUO_H #define _LINUX_KHO_ABI_LUO_H +#include #include /* - * The LUO FDT hooks all LUO state for sessions, fds, etc. + * The LUO state is registered under this KHO entry name. */ -#define LUO_FDT_SIZE PAGE_SIZE -#define LUO_FDT_KHO_ENTRY_NAME "LUO" -#define LUO_FDT_COMPATIBLE "luo-v2" -#define LUO_FDT_ABI_HEADER "luo-abi-header" +#define LUO_KHO_ENTRY_NAME "LUO" +#define LUO_ABI_COMPATIBLE "luo-v3" +#define LUO_ABI_COMPAT_LEN ALIGN(sizeof(LUO_ABI_COMPATIBLE), 8) /** * struct luo_ser - Centralized LUO ABI header. + * @compatible: Compatibility string identifying the LUO ABI version. * @liveupdate_num: A counter tracking the number of successful live updates. * @sessions_pa: Physical address of the first session block header. * @flbs_pa: Physical address of the FLB header. * - * This structure is the root of all preserved LUO state. It is pointed to by - * the "luo-abi-header" property in the LUO FDT. + * This structure is the root of all preserved LUO state. */ struct luo_ser { + char compatible[LUO_ABI_COMPAT_LEN]; u64 liveupdate_num; u64 sessions_pa; u64 flbs_pa; @@ -111,7 +96,7 @@ struct luo_ser { * @data: Private data * @token: User provided token for this file * - * If this structure is modified, LUO_SESSION_COMPATIBLE must be updated. + * If this structure is modified, `LUO_ABI_COMPATIBLE` must be updated. */ struct luo_file_ser { char compatible[LIVEUPDATE_HNDL_COMPAT_LENGTH]; @@ -142,7 +127,7 @@ struct luo_file_set_ser { * physical memory preserved across the kexec. It provides the necessary * metadata to interpret the array of session entries that follow. * - * If this structure is modified, `LUO_FDT_COMPATIBLE` must be updated. + * If this structure is modified, `LUO_ABI_COMPATIBLE` must be updated. */ struct luo_session_header_ser { u64 count; @@ -159,7 +144,7 @@ struct luo_session_header_ser { * session) is created and passed to the new kernel, allowing it to reconstruct * the session context. * - * If this structure is modified, `LUO_FDT_COMPATIBLE` must be updated. + * If this structure is modified, `LUO_ABI_COMPATIBLE` must be updated. */ struct luo_session_ser { char name[LIVEUPDATE_SESSION_NAME_LENGTH]; @@ -180,7 +165,7 @@ struct luo_session_ser { * This structure is located at the physical address specified by the * flbs_pa in luo_ser. * - * If this structure is modified, `LUO_FDT_COMPATIBLE` must be updated. + * If this structure is modified, `LUO_ABI_COMPATIBLE` must be updated. */ struct luo_flb_header_ser { u64 pgcnt; @@ -202,7 +187,7 @@ struct luo_flb_header_ser { * passed to the new kernel. Each entry allows the LUO core to restore one * global, shared object. * - * If this structure is modified, `LUO_FDT_COMPATIBLE` must be updated. + * If this structure is modified, `LUO_ABI_COMPATIBLE` must be updated. */ struct luo_flb_ser { char name[LIVEUPDATE_FLB_COMPAT_LENGTH]; diff --git a/kernel/liveupdate/luo_core.c b/kernel/liveupdate/luo_core.c index 085c0dfc1ef1..69b00e7d0f8f 100644 --- a/kernel/liveupdate/luo_core.c +++ b/kernel/liveupdate/luo_core.c @@ -54,7 +54,6 @@ #include #include #include -#include #include #include #include @@ -67,8 +66,7 @@ static struct { bool enabled; - void *fdt_out; - void *fdt_in; + struct luo_ser *luo_ser_out; u64 liveupdate_num; } luo_global; @@ -85,11 +83,10 @@ early_param("liveupdate", early_liveupdate_param); static int __init luo_early_startup(void) { + phys_addr_t luo_ser_phys; struct luo_ser *luo_ser; - int err, header_size; - phys_addr_t fdt_phys; - const void *ptr; - u64 luo_ser_pa; + size_t len; + int err; if (!kho_is_enabled()) { if (liveupdate_enabled()) @@ -98,40 +95,29 @@ static int __init luo_early_startup(void) return 0; } - /* Retrieve LUO subtree, and verify its format. */ - err = kho_retrieve_subtree(LUO_FDT_KHO_ENTRY_NAME, &fdt_phys, NULL); + /* Retrieve LUO state from KHO. */ + err = kho_retrieve_subtree(LUO_KHO_ENTRY_NAME, &luo_ser_phys, &len); if (err) { if (err != -ENOENT) { - pr_err("failed to retrieve FDT '%s' from KHO: %pe\n", - LUO_FDT_KHO_ENTRY_NAME, ERR_PTR(err)); + pr_err("failed to retrieve LUO state '%s' from KHO: %pe\n", + LUO_KHO_ENTRY_NAME, ERR_PTR(err)); return err; } return 0; } - luo_global.fdt_in = phys_to_virt(fdt_phys); - err = fdt_node_check_compatible(luo_global.fdt_in, 0, - LUO_FDT_COMPATIBLE); - if (err) { - pr_err("FDT '%s' is incompatible with '%s' [%d]\n", - LUO_FDT_KHO_ENTRY_NAME, LUO_FDT_COMPATIBLE, err); - + if (len < sizeof(*luo_ser)) { + pr_err("LUO state is too small (%zu < %zu)\n", len, sizeof(*luo_ser)); return -EINVAL; } - header_size = 0; - ptr = fdt_getprop(luo_global.fdt_in, 0, LUO_FDT_ABI_HEADER, &header_size); - if (!ptr || header_size != sizeof(u64)) { - pr_err("Unable to get ABI header '%s' [%d]\n", - LUO_FDT_ABI_HEADER, header_size); - + luo_ser = phys_to_virt(luo_ser_phys); + if (strncmp(luo_ser->compatible, LUO_ABI_COMPATIBLE, LUO_ABI_COMPAT_LEN)) { + pr_err("LUO state is incompatible with '%s'\n", LUO_ABI_COMPATIBLE); return -EINVAL; } - luo_ser_pa = get_unaligned((u64 *)ptr); - luo_ser = phys_to_virt(luo_ser_pa); - luo_global.liveupdate_num = luo_ser->liveupdate_num; pr_info("Retrieved live update data, liveupdate number: %lld\n", luo_global.liveupdate_num); @@ -164,37 +150,20 @@ static int __init liveupdate_early_init(void) } early_initcall(liveupdate_early_init); -/* Called during boot to create outgoing LUO fdt tree */ -static int __init luo_fdt_setup(void) +/* Called during boot to create outgoing LUO state */ +static int __init luo_state_setup(void) { struct luo_ser *luo_ser; - u64 luo_ser_pa; - void *fdt_out; int err; - fdt_out = kho_alloc_preserve(LUO_FDT_SIZE); - if (IS_ERR(fdt_out)) { - pr_err("failed to allocate/preserve FDT memory\n"); - return PTR_ERR(fdt_out); - } - luo_ser = kho_alloc_preserve(sizeof(*luo_ser)); if (IS_ERR(luo_ser)) { - err = PTR_ERR(luo_ser); - goto exit_free_fdt; + pr_err("failed to allocate/preserve LUO state memory\n"); + return PTR_ERR(luo_ser); } - luo_ser_pa = virt_to_phys(luo_ser); - - err = fdt_create(fdt_out, LUO_FDT_SIZE); - err |= fdt_finish_reservemap(fdt_out); - err |= fdt_begin_node(fdt_out, ""); - err |= fdt_property_string(fdt_out, "compatible", LUO_FDT_COMPATIBLE); - err |= fdt_property(fdt_out, LUO_FDT_ABI_HEADER, &luo_ser_pa, - sizeof(luo_ser_pa)); - err |= fdt_end_node(fdt_out); - err |= fdt_finish(fdt_out); - if (err) - goto exit_free_luo_ser; + + strscpy(luo_ser->compatible, LUO_ABI_COMPATIBLE, sizeof(luo_ser->compatible)); + luo_ser->liveupdate_num = luo_global.liveupdate_num + 1; err = luo_session_setup_outgoing(&luo_ser->sessions_pa); if (err) @@ -204,21 +173,17 @@ static int __init luo_fdt_setup(void) if (err) goto exit_free_luo_ser; - luo_ser->liveupdate_num = luo_global.liveupdate_num + 1; - - err = kho_add_subtree(LUO_FDT_KHO_ENTRY_NAME, fdt_out, - fdt_totalsize(fdt_out)); + err = kho_add_subtree(LUO_KHO_ENTRY_NAME, luo_ser, sizeof(*luo_ser)); if (err) goto exit_free_luo_ser; - luo_global.fdt_out = fdt_out; + + luo_global.luo_ser_out = luo_ser; return 0; exit_free_luo_ser: kho_unpreserve_free(luo_ser); -exit_free_fdt: - kho_unpreserve_free(fdt_out); - pr_err("failed to prepare LUO FDT: %d\n", err); + pr_err("failed to prepare LUO state: %d\n", err); return err; } @@ -234,7 +199,7 @@ static int __init luo_late_startup(void) if (!liveupdate_enabled()) return 0; - err = luo_fdt_setup(); + err = luo_state_setup(); if (err) luo_global.enabled = false; -- 2.53.0 From pasha.tatashin at soleen.com Wed Jun 3 08:43:54 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Wed, 3 Jun 2026 15:43:54 +0000 Subject: [PATCH v7 05/13] liveupdate: Extract luo_file_deserialize_one helper In-Reply-To: <20260603154402.468928-1-pasha.tatashin@soleen.com> References: <20260603154402.468928-1-pasha.tatashin@soleen.com> Message-ID: <20260603154402.468928-6-pasha.tatashin@soleen.com> Extract the logic for deserializing single entries for files into separate helper functions. In preparation to a linked-block serialization for files. This is a pure code movement, no other changes intended. Acked-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav (Google) Signed-off-by: Pasha Tatashin --- kernel/liveupdate/luo_file.c | 77 ++++++++++++++++++++---------------- 1 file changed, 44 insertions(+), 33 deletions(-) diff --git a/kernel/liveupdate/luo_file.c b/kernel/liveupdate/luo_file.c index 208987502f73..9eec07a9e9fc 100644 --- a/kernel/liveupdate/luo_file.c +++ b/kernel/liveupdate/luo_file.c @@ -753,6 +753,46 @@ int luo_file_finish(struct luo_file_set *file_set) return 0; } +static int luo_file_deserialize_one(struct luo_file_set *file_set, + struct luo_file_ser *ser) +{ + struct liveupdate_file_handler *fh; + bool handler_found = false; + struct luo_file *luo_file; + + down_read(&luo_register_rwlock); + list_private_for_each_entry(fh, &luo_file_handler_list, list) { + if (!strcmp(fh->compatible, ser->compatible)) { + if (try_module_get(fh->ops->owner)) + handler_found = true; + break; + } + } + up_read(&luo_register_rwlock); + + if (!handler_found) { + pr_warn("No registered handler for compatible '%.*s'\n", + (int)sizeof(ser->compatible), + ser->compatible); + return -ENOENT; + } + + luo_file = kzalloc_obj(*luo_file); + if (!luo_file) { + module_put(fh->ops->owner); + return -ENOMEM; + } + + luo_file->fh = fh; + luo_file->file = NULL; + luo_file->serialized_data = ser->data; + luo_file->token = ser->token; + mutex_init(&luo_file->mutex); + list_add_tail(&luo_file->list, &file_set->files_list); + + return 0; +} + /** * luo_file_deserialize - Reconstructs the list of preserved files in the new kernel. * @file_set: The incoming file_set to fill with deserialized data. @@ -782,6 +822,7 @@ int luo_file_deserialize(struct luo_file_set *file_set, struct luo_file_set_ser *file_set_ser) { struct luo_file_ser *file_ser; + int err; u64 i; if (!file_set_ser->files) { @@ -809,39 +850,9 @@ int luo_file_deserialize(struct luo_file_set *file_set, */ file_ser = file_set->files; for (i = 0; i < file_set->count; i++) { - struct liveupdate_file_handler *fh; - bool handler_found = false; - struct luo_file *luo_file; - - down_read(&luo_register_rwlock); - list_private_for_each_entry(fh, &luo_file_handler_list, list) { - if (!strcmp(fh->compatible, file_ser[i].compatible)) { - if (try_module_get(fh->ops->owner)) - handler_found = true; - break; - } - } - up_read(&luo_register_rwlock); - - if (!handler_found) { - pr_warn("No registered handler for compatible '%.*s'\n", - (int)sizeof(file_ser[i].compatible), - file_ser[i].compatible); - return -ENOENT; - } - - luo_file = kzalloc_obj(*luo_file); - if (!luo_file) { - module_put(fh->ops->owner); - return -ENOMEM; - } - - luo_file->fh = fh; - luo_file->file = NULL; - luo_file->serialized_data = file_ser[i].data; - luo_file->token = file_ser[i].token; - mutex_init(&luo_file->mutex); - list_add_tail(&luo_file->list, &file_set->files_list); + err = luo_file_deserialize_one(file_set, &file_ser[i]); + if (err) + return err; } return 0; -- 2.53.0 From pasha.tatashin at soleen.com Wed Jun 3 08:43:55 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Wed, 3 Jun 2026 15:43:55 +0000 Subject: [PATCH v7 06/13] liveupdate: Extract luo_session_deserialize_one helper In-Reply-To: <20260603154402.468928-1-pasha.tatashin@soleen.com> References: <20260603154402.468928-1-pasha.tatashin@soleen.com> Message-ID: <20260603154402.468928-7-pasha.tatashin@soleen.com> Extract the logic for deserializing single entries for sessions into separate helper functions. In preparation to a linked-block serialization for sessions. This is a pure code movement, no other changes intended. Acked-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav (Google) Signed-off-by: Pasha Tatashin --- kernel/liveupdate/luo_session.c | 63 +++++++++++++++++++-------------- 1 file changed, 36 insertions(+), 27 deletions(-) diff --git a/kernel/liveupdate/luo_session.c b/kernel/liveupdate/luo_session.c index 85782c6f3d6c..1cd315e0f6de 100644 --- a/kernel/liveupdate/luo_session.c +++ b/kernel/liveupdate/luo_session.c @@ -555,6 +555,40 @@ int __init luo_session_setup_incoming(u64 sessions_pa) return 0; } +static int luo_session_deserialize_one(struct luo_session_header *sh, + struct luo_session_ser *ser) +{ + struct luo_session *session; + int err; + + session = luo_session_alloc(ser->name); + if (IS_ERR(session)) { + pr_warn("Failed to allocate session [%.*s] during deserialization %pe\n", + (int)sizeof(ser->name), ser->name, session); + return PTR_ERR(session); + } + + err = luo_session_insert(sh, session); + if (err) { + pr_warn("Failed to insert session [%s] %pe\n", + session->name, ERR_PTR(err)); + luo_session_free(session); + return err; + } + + scoped_guard(mutex, &session->mutex) { + err = luo_file_deserialize(&session->file_set, + &ser->file_set_ser); + } + if (err) { + pr_warn("Failed to deserialize files for session [%s] %pe\n", + session->name, ERR_PTR(err)); + return err; + } + + return 0; +} + int luo_session_deserialize(void) { struct luo_session_header *sh = &luo_session_global.incoming; @@ -586,34 +620,9 @@ int luo_session_deserialize(void) * reliably reset devices and reclaim memory. */ for (int i = 0; i < sh->header_ser->count; i++) { - struct luo_session *session; - - session = luo_session_alloc(sh->ser[i].name); - if (IS_ERR(session)) { - pr_warn("Failed to allocate session [%.*s] during deserialization %pe\n", - (int)sizeof(sh->ser[i].name), - sh->ser[i].name, session); - err = PTR_ERR(session); - goto save_err; - } - - err = luo_session_insert(sh, session); - if (err) { - pr_warn("Failed to insert session [%s] %pe\n", - session->name, ERR_PTR(err)); - luo_session_free(session); - goto save_err; - } - - scoped_guard(mutex, &session->mutex) { - err = luo_file_deserialize(&session->file_set, - &sh->ser[i].file_set_ser); - } - if (err) { - pr_warn("Failed to deserialize files for session [%s] %pe\n", - session->name, ERR_PTR(err)); + err = luo_session_deserialize_one(sh, &sh->ser[i]); + if (err) goto save_err; - } } kho_restore_free(sh->header_ser); -- 2.53.0 From pasha.tatashin at soleen.com Wed Jun 3 08:43:56 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Wed, 3 Jun 2026 15:43:56 +0000 Subject: [PATCH v7 07/13] kho: add support for linked-block serialization In-Reply-To: <20260603154402.468928-1-pasha.tatashin@soleen.com> References: <20260603154402.468928-1-pasha.tatashin@soleen.com> Message-ID: <20260603154402.468928-8-pasha.tatashin@soleen.com> Introduce a linked-block serialization mechanism for state handover. Previously, LUO used contiguous memory blocks for serializing sessions and files, which imposed limits on the total number of items that could be preserved across a live update. This commit adds the infrastructure for a more flexible, block-based approach where serialized data is stored in a chain of linked blocks. This is a generic KHO serialization block infrastructure that can be used by multiple subsystems. Signed-off-by: Pasha Tatashin --- Documentation/core-api/kho/abi.rst | 5 + Documentation/core-api/kho/index.rst | 11 + MAINTAINERS | 1 + include/linux/kho/abi/block.h | 54 ++++ include/linux/kho/abi/kexec_handover.h | 2 +- include/linux/kho_block.h | 106 +++++++ kernel/liveupdate/Makefile | 1 + kernel/liveupdate/kho_block.c | 416 +++++++++++++++++++++++++ 8 files changed, 595 insertions(+), 1 deletion(-) create mode 100644 include/linux/kho/abi/block.h create mode 100644 include/linux/kho_block.h create mode 100644 kernel/liveupdate/kho_block.c diff --git a/Documentation/core-api/kho/abi.rst b/Documentation/core-api/kho/abi.rst index 799d743105a6..edeb5b311963 100644 --- a/Documentation/core-api/kho/abi.rst +++ b/Documentation/core-api/kho/abi.rst @@ -28,6 +28,11 @@ KHO persistent memory tracker ABI .. kernel-doc:: include/linux/kho/abi/kexec_handover.h :doc: KHO persistent memory tracker +KHO serialization block ABI +=========================== + +.. kernel-doc:: include/linux/kho/abi/block.h + See Also ======== diff --git a/Documentation/core-api/kho/index.rst b/Documentation/core-api/kho/index.rst index 0a2dee4f8e7d..320914a42178 100644 --- a/Documentation/core-api/kho/index.rst +++ b/Documentation/core-api/kho/index.rst @@ -83,6 +83,17 @@ Public API .. kernel-doc:: kernel/liveupdate/kexec_handover.c :export: +KHO Serialization Blocks API +============================ + +.. kernel-doc:: kernel/liveupdate/kho_block.c + :doc: KHO Serialization Blocks + +.. kernel-doc:: include/linux/kho_block.h + +.. kernel-doc:: kernel/liveupdate/kho_block.c + :internal: + See Also ======== diff --git a/MAINTAINERS b/MAINTAINERS index 9ec290e38b44..920ba7622afa 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -14208,6 +14208,7 @@ F: Documentation/admin-guide/mm/kho.rst F: Documentation/core-api/kho/* F: include/linux/kexec_handover.h F: include/linux/kho/ +F: include/linux/kho_block.h F: kernel/liveupdate/kexec_handover* F: lib/test_kho.c F: tools/testing/selftests/kho/ diff --git a/include/linux/kho/abi/block.h b/include/linux/kho/abi/block.h new file mode 100644 index 000000000000..d06d64b963be --- /dev/null +++ b/include/linux/kho/abi/block.h @@ -0,0 +1,54 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2026, Google LLC. + * Pasha Tatashin + */ + +/** + * DOC: KHO Serialization Blocks ABI + * + * Subsystems using the KHO Serialization Blocks framework rely on the stable + * Application Binary Interface defined below to pass serialized state from a + * pre-update kernel to a post-update kernel. + * + * This interface is a contract. Any modification to the structure fields, + * compatible strings, or the layout of the `__packed` serialization + * structures defined here constitutes a breaking change. Such changes require + * incrementing the version number in the `KHO_FDT_COMPATIBLE` string to + * prevent a new kernel from misinterpreting data from an old kernel. + * + * Changes are allowed provided the compatibility version is incremented; + * however, backward/forward compatibility is only guaranteed for kernels + * supporting the same ABI version. + */ + +#ifndef _LINUX_KHO_ABI_BLOCK_H +#define _LINUX_KHO_ABI_BLOCK_H + +#include +#include + +/** + * KHO_BLOCK_SIZE - The size of each serialization block. + * + * This is defined as PAGE_SIZE. PAGE_SIZE is ABI compliant because live + * update between kernels with different page sizes is not supported by KHO. + */ +#define KHO_BLOCK_SIZE PAGE_SIZE + +/** + * struct kho_block_header_ser - Header for the serialized data block. + * @next: Physical address of the next struct kho_block_header_ser. + * @count: The number of entries that immediately follow this header in the + * memory block. + * + * This structure is located at the beginning of a block of physical memory + * preserved across a kexec. It provides the necessary metadata to interpret + * the array of entries that follow. + */ +struct kho_block_header_ser { + u64 next; + u64 count; +} __packed; + +#endif /* _LINUX_KHO_ABI_BLOCK_H */ diff --git a/include/linux/kho/abi/kexec_handover.h b/include/linux/kho/abi/kexec_handover.h index fb2d37417ad9..5e2eb8519bda 100644 --- a/include/linux/kho/abi/kexec_handover.h +++ b/include/linux/kho/abi/kexec_handover.h @@ -90,7 +90,7 @@ */ /* The compatible string for the KHO FDT root node. */ -#define KHO_FDT_COMPATIBLE "kho-v3" +#define KHO_FDT_COMPATIBLE "kho-v4" /* The FDT property for the preserved memory map. */ #define KHO_FDT_MEMORY_MAP_PROP_NAME "preserved-memory-map" diff --git a/include/linux/kho_block.h b/include/linux/kho_block.h new file mode 100644 index 000000000000..93a7cc2be5f5 --- /dev/null +++ b/include/linux/kho_block.h @@ -0,0 +1,106 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2026, Google LLC. + * Pasha Tatashin + */ + +#ifndef _LINUX_KHO_BLOCK_H +#define _LINUX_KHO_BLOCK_H + +#include +#include +#include + +/** + * struct kho_block - Internal representation of a serialization block. + * @list: List head for linking blocks in memory. + * @ser: Pointer to the serialized header in preserved memory. + */ +struct kho_block { + struct list_head list; + struct kho_block_header_ser *ser; +}; + +/** + * struct kho_block_set - A set of blocks containing serialized entries of the same type. + * @blocks: The list of serialization blocks (struct kho_block). + * @nblocks: The number of allocated serialization blocks. + * @head_pa: Physical address of the first block header. + * @entry_size: The size of each entry in the blocks. + * @count_per_block: The maximum number of entries each block can hold. + * @incoming: True if this block set was restored from the previous kernel. + * + * Note: Synchronization and locking are the responsibility of the caller. + * The block set structure itself is not internally synchronized. + */ +struct kho_block_set { + struct list_head blocks; + long nblocks; + u64 head_pa; + size_t entry_size; + u64 count_per_block; + bool incoming; +}; + +/** + * struct kho_block_set_it - Iterator for serializing entries into blocks. + * @bs: The block set being iterated. + * @block: The current block. + * @i: The current entry index within @block. + */ +struct kho_block_set_it { + struct kho_block_set *bs; + struct kho_block *block; + u64 i; +}; + +/** + * KHO_BLOCK_SET_INIT - Initialize a static kho_block_set. + * @_name: Name of the kho_block_set variable. + * @_entry_size: The size of each entry in the block set. + */ +#define KHO_BLOCK_SET_INIT(_name, _entry_size) { \ + .blocks = LIST_HEAD_INIT((_name).blocks), \ + .entry_size = _entry_size, \ + .count_per_block = (KHO_BLOCK_SIZE - \ + sizeof(struct kho_block_header_ser)) / \ + (_entry_size), \ +} + +void kho_block_set_init(struct kho_block_set *bs, size_t entry_size); + +int kho_block_set_grow(struct kho_block_set *bs, u64 count); +void kho_block_set_shrink(struct kho_block_set *bs, u64 count); + +int kho_block_set_restore(struct kho_block_set *bs, u64 head_pa); +void kho_block_set_destroy(struct kho_block_set *bs); +void kho_block_set_clear(struct kho_block_set *bs); + +/** + * kho_block_set_head_pa - Get the physical address of the first block header. + * @bs: The block set. + * + * Return: The physical address of the first block header, or 0 if empty. + */ +static inline u64 kho_block_set_head_pa(struct kho_block_set *bs) +{ + return bs->head_pa; +} + +/** + * kho_block_set_is_empty - Check if the block set has no allocated blocks. + * @bs: The block set. + * + * Return: True if there are no blocks in the set, false otherwise. + */ +static inline bool kho_block_set_is_empty(struct kho_block_set *bs) +{ + return list_empty(&bs->blocks); +} + +void kho_block_set_it_init(struct kho_block_set_it *it, struct kho_block_set *bs); +void *kho_block_set_it_reserve_entry(struct kho_block_set_it *it); +void *kho_block_set_it_read_entry(struct kho_block_set_it *it); +void *kho_block_set_it_prev(struct kho_block_set_it *it); + +#endif /* _LINUX_KHO_BLOCK_H */ diff --git a/kernel/liveupdate/Makefile b/kernel/liveupdate/Makefile index d2f779cbe279..eec9d3ae07eb 100644 --- a/kernel/liveupdate/Makefile +++ b/kernel/liveupdate/Makefile @@ -1,6 +1,7 @@ # SPDX-License-Identifier: GPL-2.0 luo-y := \ + kho_block.o \ luo_core.o \ luo_file.o \ luo_flb.o \ diff --git a/kernel/liveupdate/kho_block.c b/kernel/liveupdate/kho_block.c new file mode 100644 index 000000000000..0d2a342ef422 --- /dev/null +++ b/kernel/liveupdate/kho_block.c @@ -0,0 +1,416 @@ +// SPDX-License-Identifier: GPL-2.0 + +/* + * Copyright (c) 2026, Google LLC. + * Pasha Tatashin + */ + +/** + * DOC: KHO Serialization Blocks + * + * KHO provides a mechanism to preserve stateful data across a kexec handover + * by serializing it into memory blocks, and provides the common + * infrastructure for managing these blocks. + * + * Each block consists of a header (struct kho_block_header_ser) followed by an + * array of serialized entries. Multiple blocks are linked together via a + * physical pointer in the header, forming a linked list that can be easily + * traversed in both the current and the next kernel. + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include +#include +#include + +/* + * Safeguard limit for the number of serialization blocks. This is used to + * prevent infinite loops and excessive memory allocation in case of memory + * corruption in the preserved state. + * + * With a 4KB page size, 10k blocks is about 40MB. For 32-byte entries + * (e.g. 4 u64s), each block holds up to 127 entries (accounting for the + * 16-byte header), allowing the block set to hold up to 1.27M entries. + */ +#define KHO_MAX_BLOCKS 10000 + +/** + * kho_block_set_init - Initialize a block set. + * @bs: The block set to initialize. + * @entry_size: The size of each entry in the blocks. + */ +void kho_block_set_init(struct kho_block_set *bs, size_t entry_size) +{ + *bs = (struct kho_block_set)KHO_BLOCK_SET_INIT(*bs, entry_size); + WARN_ON_ONCE(!bs->count_per_block); +} + +/* Serialized entries start immediately after the block header */ +static void *kho_block_entries(struct kho_block *block) +{ + return (void *)(block->ser + 1); +} + +/* Get the address of the serialized entry at the specified index */ +static void *kho_block_entry(struct kho_block_set_it *it, u64 index) +{ + return kho_block_entries(it->block) + (index * it->bs->entry_size); +} + +/* Free serialized data */ +static void kho_block_free_ser(struct kho_block_set *bs, + struct kho_block_header_ser *ser) +{ + if (bs->incoming) + kho_restore_free(ser); + else + kho_unpreserve_free(ser); +} + +static struct kho_block_header_ser *kho_block_alloc_ser(struct kho_block_set *bs) +{ + WARN_ON_ONCE(bs->incoming); + return kho_alloc_preserve(KHO_BLOCK_SIZE); +} + +static int kho_block_add(struct kho_block_set *bs, + struct kho_block_header_ser *ser) +{ + struct kho_block *block, *last; + + if (bs->nblocks >= KHO_MAX_BLOCKS) + return -ENOSPC; + + block = kzalloc_obj(*block); + if (!block) + return -ENOMEM; + + block->ser = ser; + last = list_last_entry_or_null(&bs->blocks, struct kho_block, list); + list_add_tail(&block->list, &bs->blocks); + bs->nblocks++; + + if (last) + last->ser->next = virt_to_phys(ser); + else + bs->head_pa = virt_to_phys(ser); + + return 0; +} + +static int kho_block_set_grow_one(struct kho_block_set *bs) +{ + struct kho_block_header_ser *ser; + int err; + + ser = kho_block_alloc_ser(bs); + if (IS_ERR(ser)) + return PTR_ERR(ser); + + err = kho_block_add(bs, ser); + if (err) { + kho_block_free_ser(bs, ser); + return err; + } + + return 0; +} + +static void kho_block_set_shrink_one(struct kho_block_set *bs) +{ + struct kho_block *last, *new_last; + + if (list_empty(&bs->blocks)) + return; + + last = list_last_entry(&bs->blocks, struct kho_block, list); + list_del(&last->list); + bs->nblocks--; + kho_block_free_ser(bs, last->ser); + kfree(last); + + new_last = list_last_entry_or_null(&bs->blocks, struct kho_block, list); + if (new_last) + new_last->ser->next = 0; + else + bs->head_pa = 0; +} + +/** + * kho_block_set_grow - Expand the block set to accommodate the target count. + * @bs: The block set. + * @count: The target number of valid entries to accommodate. + * + * Dynamically preallocates and links preserved memory blocks if the target + * entry count exceeds the current total capacity of the set, ensuring they + * are available during serialization/deserialization. + * + * Context: Caller must hold a lock protecting the block set. + * Return: 0 on success, or a negative errno on failure. + */ +int kho_block_set_grow(struct kho_block_set *bs, u64 count) +{ + long orig_nblocks = bs->nblocks; + int err; + + if (WARN_ON_ONCE(bs->incoming)) + return -EINVAL; + + while (count > bs->nblocks * bs->count_per_block) { + err = kho_block_set_grow_one(bs); + if (err) + goto err_shrink; + } + + return 0; + +err_shrink: + while (bs->nblocks > orig_nblocks) + kho_block_set_shrink_one(bs); + return err; +} + +/** + * kho_block_set_shrink - Shrink the block set to accommodate the target count. + * @bs: The block set. + * @count: The target number of valid entries to accommodate. + * + * Releases and unallocates redundant preserved memory blocks. Checks if the + * last block in the set can be removed because the remaining entry count is + * fully accommodated by the preceding blocks. + * + * Note: It is the caller's responsibility to ensure that entries are removed + * in the reverse order of their insertion. Because shrinking destroys the last + * block in the set, removing entries in any other order would corrupt active + * data. + * + * Context: Caller must hold a lock protecting the block set. + */ +void kho_block_set_shrink(struct kho_block_set *bs, u64 count) +{ + while (bs->nblocks > 0 && count <= (bs->nblocks - 1) * bs->count_per_block) + kho_block_set_shrink_one(bs); +} + +/* + * kho_block_set_is_cyclic - Check for cycles in a linked list of blocks. + * Uses Floyd's cycle-finding algorithm to ensure sanity of the incoming list. + * + * Return: true if a cycle or corruption is detected, false otherwise. + */ +static bool kho_block_set_is_cyclic(struct kho_block_set *bs) +{ + struct kho_block_header_ser *fast; + struct kho_block_header_ser *slow; + int count = 0; + + fast = phys_to_virt(bs->head_pa); + slow = fast; + + while (fast) { + if (count++ >= KHO_MAX_BLOCKS) { + pr_err("Block set is corrupted\n"); + return true; + } + + if (!fast->next) + break; + + fast = phys_to_virt(fast->next); + if (!fast->next) + break; + + fast = phys_to_virt(fast->next); + slow = phys_to_virt(slow->next); + + if (slow == fast) { + pr_err("Block set is corrupted\n"); + return true; + } + } + + return false; +} + +/** + * kho_block_set_restore - Restore a block set from a physical address. + * @bs: The block set to restore. + * @head_pa: Physical address of the first block header. + * + * Restores a serialized block set from a given physical address. The caller is + * responsible for ensuring that the block set @bs has been allocated and + * initialized prior to calling this function. + * + * Return: 0 on success, or a negative errno on failure. + */ +int kho_block_set_restore(struct kho_block_set *bs, u64 head_pa) +{ + struct kho_block_header_ser *ser; + u64 next_pa = head_pa; + int err; + + /* Restored block sets use size from the previous kernel */ + bs->incoming = true; + if (!head_pa) + return 0; + + bs->head_pa = head_pa; + if (kho_block_set_is_cyclic(bs)) { + bs->head_pa = 0; + return -EINVAL; + } + + while (next_pa) { + ser = phys_to_virt(next_pa); + if (!ser->count || ser->count > bs->count_per_block) { + pr_warn("Block contains invalid entry count: %llu\n", + ser->count); + err = -EINVAL; + goto err_destroy; + } + err = kho_block_add(bs, ser); + if (err) + goto err_destroy; + next_pa = ser->next; + } + + return 0; + +err_destroy: + kho_block_set_destroy(bs); + + /* Free the remaining un-restored blocks in the physical chain */ + while (next_pa) { + struct kho_block_header_ser *next_ser = phys_to_virt(next_pa); + + next_pa = next_ser->next; + kho_block_free_ser(bs, next_ser); + } + return err; +} + +/** + * kho_block_set_destroy - Destroy all blocks in a block set. + * @bs: The block set. + */ +void kho_block_set_destroy(struct kho_block_set *bs) +{ + struct kho_block *block, *tmp; + + list_for_each_entry_safe(block, tmp, &bs->blocks, list) { + list_del(&block->list); + kho_block_free_ser(bs, block->ser); + kfree(block); + } + bs->nblocks = 0; + bs->head_pa = 0; +} + +/** + * kho_block_set_clear - Clear all serialized data in a block set. + * @bs: The block set to clear. + */ +void kho_block_set_clear(struct kho_block_set *bs) +{ + struct kho_block *block; + + list_for_each_entry(block, &bs->blocks, list) { + block->ser->count = 0; + memset(block->ser + 1, 0, KHO_BLOCK_SIZE - sizeof(*block->ser)); + } +} + +/** + * kho_block_set_it_init - Initialize a block set iterator. + * @it: The iterator to initialize. + * @bs: The block set to iterate over. + */ +void kho_block_set_it_init(struct kho_block_set_it *it, struct kho_block_set *bs) +{ + it->bs = bs; + it->block = list_first_entry_or_null(&bs->blocks, struct kho_block, list); + it->i = 0; +} + +/** + * kho_block_set_it_reserve_entry - Reserve and return the next available slot for writing. + * @it: The block iterator. + * + * Reserves a slot in the current block during state serialization to add a new + * entry, advancing the internal index. If the current block is full, it + * automatically moves to the next block in the set. + * + * Return: A pointer to the reserved entry slot, or NULL if the block set's + * capacity is fully exhausted. + */ +void *kho_block_set_it_reserve_entry(struct kho_block_set_it *it) +{ + void *entry; + + if (!it->block) + return NULL; + + if (it->i == it->bs->count_per_block) { + if (list_is_last(&it->block->list, &it->bs->blocks)) + return NULL; + it->block = list_next_entry(it->block, list); + it->i = 0; + } + + entry = kho_block_entry(it, it->i++); + it->block->ser->count = it->i; + return entry; +} + +/** + * kho_block_set_it_read_entry - Read the next serialized entry from the block set. + * @it: The block iterator. + * + * Iterates through previously written entries during state deserialization, + * respecting the actual count stored in each block's header. + * + * Return: A pointer to the next serialized entry, or NULL if all serialized + * entries have been read. + */ +void *kho_block_set_it_read_entry(struct kho_block_set_it *it) +{ + if (!it->block) + return NULL; + + if (it->i == it->block->ser->count) { + if (list_is_last(&it->block->list, &it->bs->blocks)) + return NULL; + it->block = list_next_entry(it->block, list); + it->i = 0; + } + + return kho_block_entry(it, it->i++); +} + +/** + * kho_block_set_it_prev - Return the previous entry slot in the block set. + * @it: The block iterator. + * + * If the current index is at the start of a block, it automatically moves to + * the end of the previous block. + * + * Return: A pointer to the previous entry slot, or NULL if at the very + * beginning of the block set. + */ +void *kho_block_set_it_prev(struct kho_block_set_it *it) +{ + if (!it->block) + return NULL; + + if (it->i == 0) { + if (list_is_first(&it->block->list, &it->bs->blocks)) + return NULL; + it->block = list_prev_entry(it->block, list); + it->i = it->bs->count_per_block; + } + + return kho_block_entry(it, --it->i); +} -- 2.53.0 From pasha.tatashin at soleen.com Wed Jun 3 08:43:57 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Wed, 3 Jun 2026 15:43:57 +0000 Subject: [PATCH v7 08/13] liveupdate: defer session block allocation and physical address setting In-Reply-To: <20260603154402.468928-1-pasha.tatashin@soleen.com> References: <20260603154402.468928-1-pasha.tatashin@soleen.com> Message-ID: <20260603154402.468928-9-pasha.tatashin@soleen.com> Currently, luo_session_setup_outgoing() allocates the session block and sets its physical address in the header immediately. With upcoming dynamic block-based session management, this makes the first block different from the rest. Move the allocation to where it is first needed. Acked-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav (Google) Signed-off-by: Pasha Tatashin --- kernel/liveupdate/luo_core.c | 4 +- kernel/liveupdate/luo_internal.h | 2 +- kernel/liveupdate/luo_session.c | 68 ++++++++++++++++++++------------ 3 files changed, 45 insertions(+), 29 deletions(-) diff --git a/kernel/liveupdate/luo_core.c b/kernel/liveupdate/luo_core.c index 69b00e7d0f8f..1b2bda22902d 100644 --- a/kernel/liveupdate/luo_core.c +++ b/kernel/liveupdate/luo_core.c @@ -165,9 +165,7 @@ static int __init luo_state_setup(void) strscpy(luo_ser->compatible, LUO_ABI_COMPATIBLE, sizeof(luo_ser->compatible)); luo_ser->liveupdate_num = luo_global.liveupdate_num + 1; - err = luo_session_setup_outgoing(&luo_ser->sessions_pa); - if (err) - goto exit_free_luo_ser; + luo_session_setup_outgoing(&luo_ser->sessions_pa); err = luo_flb_setup_outgoing(&luo_ser->flbs_pa); if (err) diff --git a/kernel/liveupdate/luo_internal.h b/kernel/liveupdate/luo_internal.h index fe22086bfbeb..ee18f9a11b91 100644 --- a/kernel/liveupdate/luo_internal.h +++ b/kernel/liveupdate/luo_internal.h @@ -79,7 +79,7 @@ extern struct rw_semaphore luo_register_rwlock; int luo_session_create(const char *name, struct file **filep); int luo_session_retrieve(const char *name, struct file **filep); -int __init luo_session_setup_outgoing(u64 *sessions_pa); +void __init luo_session_setup_outgoing(u64 *sessions_pa); int __init luo_session_setup_incoming(u64 sessions_pa); int luo_session_serialize(void); int luo_session_deserialize(void); diff --git a/kernel/liveupdate/luo_session.c b/kernel/liveupdate/luo_session.c index 1cd315e0f6de..2411849a34e3 100644 --- a/kernel/liveupdate/luo_session.c +++ b/kernel/liveupdate/luo_session.c @@ -108,15 +108,16 @@ static DECLARE_RWSEM(luo_session_serialize_rwsem); /** * struct luo_session_header - Header struct for managing LUO sessions. - * @count: The number of sessions currently tracked in the @list. - * @list: The head of the linked list of `struct luo_session` instances. - * @rwsem: A read-write semaphore providing synchronized access to the - * session list and other fields in this structure. - * @header_ser: The header data of serialization array. - * @ser: The serialized session data (an array of - * `struct luo_session_ser`). - * @active: Set to true when first initialized. If previous kernel did not - * send session data, active stays false for incoming. + * @count: The number of sessions currently tracked in the @list. + * @list: The head of the linked list of `struct luo_session` instances. + * @rwsem: A read-write semaphore providing synchronized access to the + * session list and other fields in this structure. + * @header_ser: The header data of serialization array. + * @ser: The serialized session data (an array of + * `struct luo_session_ser`). + * @sessions_pa: Points to the location of sessions_pa within struct luo_ser. + * @active: Set to true when first initialized. If previous kernel did not + * send session data, active stays false for incoming. */ struct luo_session_header { long count; @@ -124,6 +125,7 @@ struct luo_session_header { struct rw_semaphore rwsem; struct luo_session_header_ser *header_ser; struct luo_session_ser *ser; + u64 *sessions_pa; bool active; }; @@ -171,10 +173,30 @@ static void luo_session_free(struct luo_session *session) kfree(session); } +static int luo_session_grow_ser(struct luo_session_header *sh) +{ + struct luo_session_header_ser *header_ser; + + if (sh->count == LUO_SESSION_MAX) + return -ENOMEM; + + if (sh->header_ser) + return 0; + + header_ser = kho_alloc_preserve(LUO_SESSION_PGCNT << PAGE_SHIFT); + if (IS_ERR(header_ser)) + return PTR_ERR(header_ser); + + sh->header_ser = header_ser; + sh->ser = (void *)(header_ser + 1); + return 0; +} + static int luo_session_insert(struct luo_session_header *sh, struct luo_session *session) { struct luo_session *it; + int err; guard(rwsem_write)(&sh->rwsem); @@ -183,8 +205,9 @@ static int luo_session_insert(struct luo_session_header *sh, * for new session. */ if (sh == &luo_session_global.outgoing) { - if (sh->count == LUO_SESSION_MAX) - return -ENOMEM; + err = luo_session_grow_ser(sh); + if (err) + return err; } /* @@ -524,21 +547,10 @@ int luo_session_retrieve(const char *name, struct file **filep) return err; } -int __init luo_session_setup_outgoing(u64 *sessions_pa) +void __init luo_session_setup_outgoing(u64 *sessions_pa) { - struct luo_session_header_ser *header_ser; - - header_ser = kho_alloc_preserve(LUO_SESSION_PGCNT << PAGE_SHIFT); - if (IS_ERR(header_ser)) - return PTR_ERR(header_ser); - - *sessions_pa = virt_to_phys(header_ser); - - luo_session_global.outgoing.header_ser = header_ser; - luo_session_global.outgoing.ser = (void *)(header_ser + 1); + luo_session_global.outgoing.sessions_pa = sessions_pa; luo_session_global.outgoing.active = true; - - return 0; } int __init luo_session_setup_incoming(u64 sessions_pa) @@ -644,6 +656,8 @@ int luo_session_serialize(void) down_write(&luo_session_serialize_rwsem); down_write(&sh->rwsem); + *sh->sessions_pa = 0; + list_for_each_entry(session, &sh->list, list) { err = luo_session_freeze_one(session, &sh->ser[i]); if (err) @@ -653,7 +667,11 @@ int luo_session_serialize(void) sizeof(sh->ser[i].name)); i++; } - sh->header_ser->count = sh->count; + + if (sh->header_ser && sh->count > 0) { + sh->header_ser->count = sh->count; + *sh->sessions_pa = virt_to_phys(sh->header_ser); + } up_write(&sh->rwsem); return 0; -- 2.53.0 From pasha.tatashin at soleen.com Wed Jun 3 08:43:58 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Wed, 3 Jun 2026 15:43:58 +0000 Subject: [PATCH v7 09/13] liveupdate: Remove limit on the number of sessions In-Reply-To: <20260603154402.468928-1-pasha.tatashin@soleen.com> References: <20260603154402.468928-1-pasha.tatashin@soleen.com> Message-ID: <20260603154402.468928-10-pasha.tatashin@soleen.com> Currently, the number of LUO sessions is limited by a fixed number of pre-allocated pages for serialization (16 pages, allowing for ~819 sessions). This limitation is problematic if LUO is used to support things such as systemd file descriptor store, and would be used not just as VM memory but to save other states on the machine. Remove this limit by transitioning to a linked-block approach for session metadata serialization. Instead of a single contiguous block, session metadata is now stored in a chain of 16-page blocks. Each block starts with a header containing the physical address of the next block and the number of session entries in the current block. Acked-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav (Google) Signed-off-by: Pasha Tatashin --- include/linux/kho/abi/luo.h | 23 +------ kernel/liveupdate/luo_session.c | 113 +++++++++++++++----------------- 2 files changed, 55 insertions(+), 81 deletions(-) diff --git a/include/linux/kho/abi/luo.h b/include/linux/kho/abi/luo.h index 9a4fe491812b..03d940d0f9bb 100644 --- a/include/linux/kho/abi/luo.h +++ b/include/linux/kho/abi/luo.h @@ -33,11 +33,6 @@ * It includes the compatibility string, the liveupdate-number, and pointers * to sessions and FLBs. * - * - struct luo_session_header_ser: - * Header for the session array. Contains the total page count of the - * preserved memory block and the number of `struct luo_session_ser` - * entries that follow. - * * - struct luo_session_ser: * Metadata for a single session, including its name and a physical pointer * to another preserved memory block containing an array of @@ -63,13 +58,14 @@ #define _LINUX_KHO_ABI_LUO_H #include +#include #include /* * The LUO state is registered under this KHO entry name. */ #define LUO_KHO_ENTRY_NAME "LUO" -#define LUO_ABI_COMPATIBLE "luo-v3" +#define LUO_ABI_COMPATIBLE "luo-v4" #define LUO_ABI_COMPAT_LEN ALIGN(sizeof(LUO_ABI_COMPATIBLE), 8) /** @@ -118,21 +114,6 @@ struct luo_file_set_ser { u64 count; } __packed; -/** - * struct luo_session_header_ser - Header for the serialized session data block. - * @count: The number of `struct luo_session_ser` entries that immediately - * follow this header in the memory block. - * - * This structure is located at the beginning of a contiguous block of - * physical memory preserved across the kexec. It provides the necessary - * metadata to interpret the array of session entries that follow. - * - * If this structure is modified, `LUO_ABI_COMPATIBLE` must be updated. - */ -struct luo_session_header_ser { - u64 count; -} __packed; - /** * struct luo_session_ser - Represents the serialized metadata for a LUO session. * @name: The unique name of the session, provided by the userspace at diff --git a/kernel/liveupdate/luo_session.c b/kernel/liveupdate/luo_session.c index 2411849a34e3..b79b2a488974 100644 --- a/kernel/liveupdate/luo_session.c +++ b/kernel/liveupdate/luo_session.c @@ -24,9 +24,10 @@ * ioctls on /dev/liveupdate. * * - Serialization: Session metadata is preserved using the KHO framework. When - * a live update is triggered via kexec, an array of `struct luo_session_ser` - * is populated and placed in a preserved memory region. The physical address - * of this array is stored in the centralized `struct luo_ser` structure. + * a live update is triggered via kexec, session metadata is serialized into + * a chain of linked-blocks and placed in a preserved memory region. The + * physical address of the first block header is stored in the centralized + * `struct luo_ser` structure. * * Session Lifecycle: * @@ -89,6 +90,7 @@ #include #include #include +#include #include #include #include @@ -98,23 +100,14 @@ #include #include "luo_internal.h" -/* 16 4K pages, give space for 744 sessions */ -#define LUO_SESSION_PGCNT 16ul -#define LUO_SESSION_MAX (((LUO_SESSION_PGCNT << PAGE_SHIFT) - \ - sizeof(struct luo_session_header_ser)) / \ - sizeof(struct luo_session_ser)) - static DECLARE_RWSEM(luo_session_serialize_rwsem); - /** * struct luo_session_header - Header struct for managing LUO sessions. * @count: The number of sessions currently tracked in the @list. * @list: The head of the linked list of `struct luo_session` instances. * @rwsem: A read-write semaphore providing synchronized access to the * session list and other fields in this structure. - * @header_ser: The header data of serialization array. - * @ser: The serialized session data (an array of - * `struct luo_session_ser`). + * @block_set: The set of serialization blocks. * @sessions_pa: Points to the location of sessions_pa within struct luo_ser. * @active: Set to true when first initialized. If previous kernel did not * send session data, active stays false for incoming. @@ -123,8 +116,7 @@ struct luo_session_header { long count; struct list_head list; struct rw_semaphore rwsem; - struct luo_session_header_ser *header_ser; - struct luo_session_ser *ser; + struct kho_block_set block_set; u64 *sessions_pa; bool active; }; @@ -143,10 +135,14 @@ static struct luo_session_global luo_session_global = { .incoming = { .list = LIST_HEAD_INIT(luo_session_global.incoming.list), .rwsem = __RWSEM_INITIALIZER(luo_session_global.incoming.rwsem), + .block_set = KHO_BLOCK_SET_INIT(luo_session_global.incoming.block_set, + sizeof(struct luo_session_ser)), }, .outgoing = { .list = LIST_HEAD_INIT(luo_session_global.outgoing.list), .rwsem = __RWSEM_INITIALIZER(luo_session_global.outgoing.rwsem), + .block_set = KHO_BLOCK_SET_INIT(luo_session_global.outgoing.block_set, + sizeof(struct luo_session_ser)), }, }; @@ -173,25 +169,6 @@ static void luo_session_free(struct luo_session *session) kfree(session); } -static int luo_session_grow_ser(struct luo_session_header *sh) -{ - struct luo_session_header_ser *header_ser; - - if (sh->count == LUO_SESSION_MAX) - return -ENOMEM; - - if (sh->header_ser) - return 0; - - header_ser = kho_alloc_preserve(LUO_SESSION_PGCNT << PAGE_SHIFT); - if (IS_ERR(header_ser)) - return PTR_ERR(header_ser); - - sh->header_ser = header_ser; - sh->ser = (void *)(header_ser + 1); - return 0; -} - static int luo_session_insert(struct luo_session_header *sh, struct luo_session *session) { @@ -205,7 +182,7 @@ static int luo_session_insert(struct luo_session_header *sh, * for new session. */ if (sh == &luo_session_global.outgoing) { - err = luo_session_grow_ser(sh); + err = kho_block_set_grow(&sh->block_set, sh->count + 1); if (err) return err; } @@ -232,6 +209,8 @@ static void luo_session_remove(struct luo_session_header *sh, guard(rwsem_write)(&sh->rwsem); list_del(&session->list); sh->count--; + if (sh == &luo_session_global.outgoing) + kho_block_set_shrink(&sh->block_set, sh->count); } static int luo_session_finish_one(struct luo_session *session) @@ -555,15 +534,17 @@ void __init luo_session_setup_outgoing(u64 *sessions_pa) int __init luo_session_setup_incoming(u64 sessions_pa) { - struct luo_session_header_ser *header_ser; + struct luo_session_header *sh = &luo_session_global.incoming; + int err; - if (sessions_pa) { - header_ser = phys_to_virt(sessions_pa); - luo_session_global.incoming.header_ser = header_ser; - luo_session_global.incoming.ser = (void *)(header_ser + 1); - luo_session_global.incoming.active = true; - } + if (!sessions_pa) + return 0; + err = kho_block_set_restore(&sh->block_set, sessions_pa); + if (err) + return err; + + sh->active = true; return 0; } @@ -605,6 +586,8 @@ int luo_session_deserialize(void) { struct luo_session_header *sh = &luo_session_global.incoming; static bool is_deserialized; + struct luo_session_ser *ser; + struct kho_block_set_it it; static int saved_err; int err; @@ -631,18 +614,19 @@ int luo_session_deserialize(void) * userspace to detect the failure and trigger a reboot, which will * reliably reset devices and reclaim memory. */ - for (int i = 0; i < sh->header_ser->count; i++) { - err = luo_session_deserialize_one(sh, &sh->ser[i]); + kho_block_set_it_init(&it, &sh->block_set); + while ((ser = kho_block_set_it_read_entry(&it))) { + err = luo_session_deserialize_one(sh, ser); if (err) goto save_err; } - kho_restore_free(sh->header_ser); - sh->header_ser = NULL; - sh->ser = NULL; + kho_block_set_destroy(&sh->block_set); return 0; + save_err: + kho_block_set_destroy(&sh->block_set); saved_err = err; return err; } @@ -651,36 +635,45 @@ int luo_session_serialize(void) { struct luo_session_header *sh = &luo_session_global.outgoing; struct luo_session *session; - int i = 0; + struct kho_block_set_it it; int err; down_write(&luo_session_serialize_rwsem); down_write(&sh->rwsem); *sh->sessions_pa = 0; + kho_block_set_it_init(&it, &sh->block_set); + list_for_each_entry(session, &sh->list, list) { - err = luo_session_freeze_one(session, &sh->ser[i]); - if (err) + struct luo_session_ser *ser = kho_block_set_it_reserve_entry(&it); + + /* This should not fail normally as blocks were pre-allocated */ + if (WARN_ON_ONCE(!ser)) { + err = -ENOSPC; goto err_undo; + } - strscpy(sh->ser[i].name, session->name, - sizeof(sh->ser[i].name)); - i++; - } + err = luo_session_freeze_one(session, ser); + if (err) { + kho_block_set_it_prev(&it); + goto err_undo; + } - if (sh->header_ser && sh->count > 0) { - sh->header_ser->count = sh->count; - *sh->sessions_pa = virt_to_phys(sh->header_ser); + strscpy(ser->name, session->name, sizeof(ser->name)); } + + if (sh->count > 0) + *sh->sessions_pa = kho_block_set_head_pa(&sh->block_set); up_write(&sh->rwsem); return 0; err_undo: list_for_each_entry_continue_reverse(session, &sh->list, list) { - i--; - luo_session_unfreeze_one(session, &sh->ser[i]); - memset(sh->ser[i].name, 0, sizeof(sh->ser[i].name)); + struct luo_session_ser *ser = kho_block_set_it_prev(&it); + + luo_session_unfreeze_one(session, ser); + memset(ser->name, 0, sizeof(ser->name)); } up_write(&sh->rwsem); up_write(&luo_session_serialize_rwsem); -- 2.53.0 From pasha.tatashin at soleen.com Wed Jun 3 08:43:59 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Wed, 3 Jun 2026 15:43:59 +0000 Subject: [PATCH v7 10/13] liveupdate: Remove limit on the number of files per session In-Reply-To: <20260603154402.468928-1-pasha.tatashin@soleen.com> References: <20260603154402.468928-1-pasha.tatashin@soleen.com> Message-ID: <20260603154402.468928-11-pasha.tatashin@soleen.com> To remove the fixed limit on the number of preserved files per session, transition the file metadata serialization from a single contiguous memory block to a chain of linked blocks. Acked-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav (Google) Signed-off-by: Pasha Tatashin --- include/linux/kho/abi/luo.h | 13 +-- kernel/liveupdate/luo_file.c | 138 ++++++++++++++----------------- kernel/liveupdate/luo_internal.h | 6 +- 3 files changed, 74 insertions(+), 83 deletions(-) diff --git a/include/linux/kho/abi/luo.h b/include/linux/kho/abi/luo.h index 03d940d0f9bb..288076de6d4a 100644 --- a/include/linux/kho/abi/luo.h +++ b/include/linux/kho/abi/luo.h @@ -35,8 +35,8 @@ * * - struct luo_session_ser: * Metadata for a single session, including its name and a physical pointer - * to another preserved memory block containing an array of - * `struct luo_file_ser` for all files in that session. + * to the first `struct kho_block_header_ser` for all files in that session. + * Multiple blocks are linked via the `next` field in the header. * * - struct luo_file_ser: * Metadata for a single preserved file. Contains the `compatible` string to @@ -65,7 +65,7 @@ * The LUO state is registered under this KHO entry name. */ #define LUO_KHO_ENTRY_NAME "LUO" -#define LUO_ABI_COMPATIBLE "luo-v4" +#define LUO_ABI_COMPATIBLE "luo-v5" #define LUO_ABI_COMPAT_LEN ALIGN(sizeof(LUO_ABI_COMPATIBLE), 8) /** @@ -102,9 +102,10 @@ struct luo_file_ser { /** * struct luo_file_set_ser - Represents the serialized metadata for file set - * @files: The physical address of a contiguous memory block that holds - * the serialized state of files (array of luo_file_ser) in this file - * set. + * @files: The physical address of the first `struct kho_block_header_ser`. + * This structure is the header for a block of memory containing + * an array of `struct luo_file_ser` entries. Multiple blocks are + * linked via the `next` field in the header. * @count: The total number of files that were part of this session during * serialization. Used for iteration and validation during * restoration. diff --git a/kernel/liveupdate/luo_file.c b/kernel/liveupdate/luo_file.c index 9eec07a9e9fc..c39f96961a85 100644 --- a/kernel/liveupdate/luo_file.c +++ b/kernel/liveupdate/luo_file.c @@ -118,11 +118,6 @@ static LIST_HEAD(luo_file_handler_list); /* Keep track of files being preserved by LUO */ static DEFINE_XARRAY(luo_preserved_files); -/* 2 4K pages, give space for 128 files per file_set */ -#define LUO_FILE_PGCNT 2ul -#define LUO_FILE_MAX \ - ((LUO_FILE_PGCNT << PAGE_SHIFT) / sizeof(struct luo_file_ser)) - /** * struct luo_file - Represents a single preserved file instance. * @fh: Pointer to the &struct liveupdate_file_handler that manages @@ -174,39 +169,6 @@ struct luo_file { u64 token; }; -static int luo_alloc_files_mem(struct luo_file_set *file_set) -{ - size_t size; - void *mem; - - if (file_set->files) - return 0; - - WARN_ON_ONCE(file_set->count); - - size = LUO_FILE_PGCNT << PAGE_SHIFT; - mem = kho_alloc_preserve(size); - if (IS_ERR(mem)) - return PTR_ERR(mem); - - file_set->files = mem; - - return 0; -} - -static void luo_free_files_mem(struct luo_file_set *file_set) -{ - /* If file_set has files, no need to free preservation memory */ - if (file_set->count) - return; - - if (!file_set->files) - return; - - kho_unpreserve_free(file_set->files); - file_set->files = NULL; -} - static unsigned long luo_get_id(struct liveupdate_file_handler *fh, struct file *file) { @@ -276,16 +238,15 @@ int luo_preserve_file(struct luo_file_set *file_set, u64 token, int fd) if (luo_token_is_used(file_set, token)) return -EEXIST; - if (file_set->count == LUO_FILE_MAX) - return -ENOSPC; + err = kho_block_set_grow(&file_set->block_set, file_set->count + 1); + if (err) + return err; file = fget(fd); - if (!file) - return -EBADF; - - err = luo_alloc_files_mem(file_set); - if (err) - goto err_fput; + if (!file) { + err = -EBADF; + goto err_shrink; + } err = -ENOENT; down_read(&luo_register_rwlock); @@ -300,7 +261,7 @@ int luo_preserve_file(struct luo_file_set *file_set, u64 token, int fd) /* err is still -ENOENT if no handler was found */ if (err) - goto err_free_files_mem; + goto err_fput; err = xa_insert(&luo_preserved_files, luo_get_id(fh, file), file, GFP_KERNEL); @@ -343,10 +304,10 @@ int luo_preserve_file(struct luo_file_set *file_set, u64 token, int fd) xa_erase(&luo_preserved_files, luo_get_id(fh, file)); err_module_put: module_put(fh->ops->owner); -err_free_files_mem: - luo_free_files_mem(file_set); err_fput: fput(file); +err_shrink: + kho_block_set_shrink(&file_set->block_set, file_set->count); return err; } @@ -392,13 +353,14 @@ void luo_file_unpreserve_files(struct luo_file_set *file_set) list_del(&luo_file->list); file_set->count--; + kho_block_set_shrink(&file_set->block_set, file_set->count); fput(luo_file->file); mutex_destroy(&luo_file->mutex); kfree(luo_file); } - luo_free_files_mem(file_set); + kho_block_set_destroy(&file_set->block_set); } static int luo_file_freeze_one(struct luo_file_set *file_set, @@ -454,7 +416,7 @@ static void __luo_file_unfreeze(struct luo_file_set *file_set, luo_file_unfreeze_one(file_set, luo_file); } - memset(file_set->files, 0, LUO_FILE_PGCNT << PAGE_SHIFT); + kho_block_set_clear(&file_set->block_set); } /** @@ -493,19 +455,24 @@ static void __luo_file_unfreeze(struct luo_file_set *file_set, int luo_file_freeze(struct luo_file_set *file_set, struct luo_file_set_ser *file_set_ser) { - struct luo_file_ser *file_ser = file_set->files; struct luo_file *luo_file; + struct kho_block_set_it it; int err; - int i; if (!file_set->count) return 0; - if (WARN_ON(!file_ser)) - return -EINVAL; + kho_block_set_it_init(&it, &file_set->block_set); - i = 0; list_for_each_entry(luo_file, &file_set->files_list, list) { + struct luo_file_ser *file_ser = kho_block_set_it_reserve_entry(&it); + + /* This should not fail normally as blocks were pre-allocated */ + if (WARN_ON_ONCE(!file_ser)) { + err = -ENOSPC; + goto err_unfreeze; + } + err = luo_file_freeze_one(file_set, luo_file); if (err < 0) { pr_warn("Freeze failed for token[%#0llx] handler[%s] err[%pe]\n", @@ -514,16 +481,14 @@ int luo_file_freeze(struct luo_file_set *file_set, goto err_unfreeze; } - strscpy(file_ser[i].compatible, luo_file->fh->compatible, - sizeof(file_ser[i].compatible)); - file_ser[i].data = luo_file->serialized_data; - file_ser[i].token = luo_file->token; - i++; + strscpy(file_ser->compatible, luo_file->fh->compatible, + sizeof(file_ser->compatible)); + file_ser->data = luo_file->serialized_data; + file_ser->token = luo_file->token; } file_set_ser->count = file_set->count; - if (file_set->files) - file_set_ser->files = virt_to_phys(file_set->files); + file_set_ser->files = kho_block_set_head_pa(&file_set->block_set); return 0; @@ -741,14 +706,12 @@ int luo_file_finish(struct luo_file_set *file_set) module_put(luo_file->fh->ops->owner); list_del(&luo_file->list); file_set->count--; + kho_block_set_shrink(&file_set->block_set, file_set->count); mutex_destroy(&luo_file->mutex); kfree(luo_file); } - if (file_set->files) { - kho_restore_free(file_set->files); - file_set->files = NULL; - } + kho_block_set_destroy(&file_set->block_set); return 0; } @@ -822,16 +785,18 @@ int luo_file_deserialize(struct luo_file_set *file_set, struct luo_file_set_ser *file_set_ser) { struct luo_file_ser *file_ser; + struct kho_block_set_it it; int err; - u64 i; if (!file_set_ser->files) { WARN_ON(file_set_ser->count); return 0; } - file_set->count = file_set_ser->count; - file_set->files = phys_to_virt(file_set_ser->files); + file_set->count = 0; + err = kho_block_set_restore(&file_set->block_set, file_set_ser->files); + if (err) + return err; /* * Note on error handling: @@ -848,25 +813,50 @@ int luo_file_deserialize(struct luo_file_set *file_set, * userspace to detect the failure and trigger a reboot, which will * reliably reset devices and reclaim memory. */ - file_ser = file_set->files; - for (i = 0; i < file_set->count; i++) { - err = luo_file_deserialize_one(file_set, &file_ser[i]); + kho_block_set_it_init(&it, &file_set->block_set); + while ((file_ser = kho_block_set_it_read_entry(&it))) { + err = luo_file_deserialize_one(file_set, file_ser); if (err) - return err; + goto err_destroy_blocks; + file_set->count++; + } + + if (file_set->count != file_set_ser->count) { + pr_warn("File count mismatch: expected %llu, found %llu\n", + file_set_ser->count, file_set->count); + err = -EINVAL; + goto err_destroy_blocks; } return 0; + +err_destroy_blocks: + while (!list_empty(&file_set->files_list)) { + struct luo_file *luo_file; + + luo_file = list_first_entry(&file_set->files_list, + struct luo_file, list); + list_del(&luo_file->list); + module_put(luo_file->fh->ops->owner); + mutex_destroy(&luo_file->mutex); + kfree(luo_file); + } + file_set->count = 0; + kho_block_set_destroy(&file_set->block_set); + return err; } void luo_file_set_init(struct luo_file_set *file_set) { INIT_LIST_HEAD(&file_set->files_list); + kho_block_set_init(&file_set->block_set, sizeof(struct luo_file_ser)); } void luo_file_set_destroy(struct luo_file_set *file_set) { WARN_ON(file_set->count); WARN_ON(!list_empty(&file_set->files_list)); + WARN_ON(!kho_block_set_is_empty(&file_set->block_set)); } /** diff --git a/kernel/liveupdate/luo_internal.h b/kernel/liveupdate/luo_internal.h index ee18f9a11b91..64879ffe7378 100644 --- a/kernel/liveupdate/luo_internal.h +++ b/kernel/liveupdate/luo_internal.h @@ -10,6 +10,7 @@ #include #include +#include struct luo_ucmd { void __user *ubuffer; @@ -44,14 +45,13 @@ static inline int luo_ucmd_respond(struct luo_ucmd *ucmd, * struct luo_file_set - A set of files that belong to the same sessions. * @files_list: An ordered list of files associated with this session, it is * ordered by preservation time. - * @files: The physically contiguous memory block that holds the serialized - * state of files. + * @block_set: The set of serialization blocks. * @count: A counter tracking the number of files currently stored in the * @files_list for this session. */ struct luo_file_set { struct list_head files_list; - struct luo_file_ser *files; + struct kho_block_set block_set; u64 count; }; -- 2.53.0 From pasha.tatashin at soleen.com Wed Jun 3 08:44:00 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Wed, 3 Jun 2026 15:44:00 +0000 Subject: [PATCH v7 11/13] selftests/liveupdate: Test session and file limit removal In-Reply-To: <20260603154402.468928-1-pasha.tatashin@soleen.com> References: <20260603154402.468928-1-pasha.tatashin@soleen.com> Message-ID: <20260603154402.468928-12-pasha.tatashin@soleen.com> With the removal of static limits on the number of sessions and files per session, the orchestrator now uses dynamic allocation. Add new test cases to verify that the system can handle a large number of sessions and files. These tests ensure that the dynamic block allocation and reuse logic for session metadata and outgoing files work correctly beyond the previous static limits. Acked-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav (Google) Signed-off-by: Pasha Tatashin --- .../testing/selftests/liveupdate/liveupdate.c | 75 +++++++++++++++++++ .../selftests/liveupdate/luo_test_utils.c | 24 ++++++ .../selftests/liveupdate/luo_test_utils.h | 2 + 3 files changed, 101 insertions(+) diff --git a/tools/testing/selftests/liveupdate/liveupdate.c b/tools/testing/selftests/liveupdate/liveupdate.c index c7d94b9181e1..502fb3567e38 100644 --- a/tools/testing/selftests/liveupdate/liveupdate.c +++ b/tools/testing/selftests/liveupdate/liveupdate.c @@ -26,6 +26,7 @@ #include +#include "luo_test_utils.h" #include "../kselftest.h" #include "../kselftest_harness.h" @@ -499,4 +500,78 @@ TEST_F(liveupdate_device, get_session_name_max_length) ASSERT_EQ(close(session_fd), 0); } +/* + * Test Case: Manage Many Sessions + * + * Verifies that a large number of sessions can be created and then + * destroyed during normal system operation. This specifically tests the + * dynamic block allocation and reuse logic for session metadata management + * without preserving any files. + */ +TEST_F(liveupdate_device, preserve_many_sessions) +{ +#define MANY_SESSIONS 2000 + int session_fds[MANY_SESSIONS]; + int ret, i; + + self->fd1 = open(LIVEUPDATE_DEV, O_RDWR); + if (self->fd1 < 0 && errno == ENOENT) + SKIP(return, "%s does not exist", LIVEUPDATE_DEV); + ASSERT_GE(self->fd1, 0); + + ret = luo_ensure_nofile_limit(MANY_SESSIONS); + if (ret == -EPERM) + SKIP(return, "Insufficient privileges to set RLIMIT_NOFILE"); + ASSERT_EQ(ret, 0); + + for (i = 0; i < MANY_SESSIONS; i++) { + char name[64]; + + snprintf(name, sizeof(name), "many-session-%d", i); + session_fds[i] = create_session(self->fd1, name); + ASSERT_GE(session_fds[i], 0); + } + + for (i = 0; i < MANY_SESSIONS; i++) + ASSERT_EQ(close(session_fds[i]), 0); +} + +/* + * Test Case: Preserve Many Files + * + * Verifies that a large number of files can be preserved in a single session + * and then destroyed during normal system operation. This tests the dynamic + * block allocation and management for outgoing files. + */ +TEST_F(liveupdate_device, preserve_many_files) +{ +#define MANY_FILES 500 + int mem_fds[MANY_FILES]; + int session_fd, ret, i; + + self->fd1 = open(LIVEUPDATE_DEV, O_RDWR); + if (self->fd1 < 0 && errno == ENOENT) + SKIP(return, "%s does not exist", LIVEUPDATE_DEV); + ASSERT_GE(self->fd1, 0); + + session_fd = create_session(self->fd1, "many-files-test"); + ASSERT_GE(session_fd, 0); + + ret = luo_ensure_nofile_limit(MANY_FILES + 10); + if (ret == -EPERM) + SKIP(return, "Insufficient privileges to set RLIMIT_NOFILE"); + ASSERT_EQ(ret, 0); + + for (i = 0; i < MANY_FILES; i++) { + mem_fds[i] = memfd_create("test-memfd", 0); + ASSERT_GE(mem_fds[i], 0); + ASSERT_EQ(preserve_fd(session_fd, mem_fds[i], i), 0); + } + + for (i = 0; i < MANY_FILES; i++) + ASSERT_EQ(close(mem_fds[i]), 0); + + ASSERT_EQ(close(session_fd), 0); +} + TEST_HARNESS_MAIN diff --git a/tools/testing/selftests/liveupdate/luo_test_utils.c b/tools/testing/selftests/liveupdate/luo_test_utils.c index 3c8721c505df..333a3530051b 100644 --- a/tools/testing/selftests/liveupdate/luo_test_utils.c +++ b/tools/testing/selftests/liveupdate/luo_test_utils.c @@ -17,6 +17,7 @@ #include #include #include +#include #include #include #include @@ -28,6 +29,29 @@ int luo_open_device(void) return open(LUO_DEVICE, O_RDWR); } +int luo_ensure_nofile_limit(long min_limit) +{ + struct rlimit hl; + + /* Allow to extra files to be used by test itself */ + min_limit += 32; + + if (getrlimit(RLIMIT_NOFILE, &hl) < 0) + return -errno; + + if (hl.rlim_cur >= min_limit) + return 0; + + hl.rlim_cur = min_limit; + if (hl.rlim_cur > hl.rlim_max) + hl.rlim_max = hl.rlim_cur; + + if (setrlimit(RLIMIT_NOFILE, &hl) < 0) + return -errno; + + return 0; +} + int luo_create_session(int luo_fd, const char *name) { struct liveupdate_ioctl_create_session arg = { .size = sizeof(arg) }; diff --git a/tools/testing/selftests/liveupdate/luo_test_utils.h b/tools/testing/selftests/liveupdate/luo_test_utils.h index 90099bf49577..6a0d85386613 100644 --- a/tools/testing/selftests/liveupdate/luo_test_utils.h +++ b/tools/testing/selftests/liveupdate/luo_test_utils.h @@ -26,6 +26,8 @@ int luo_create_session(int luo_fd, const char *name); int luo_retrieve_session(int luo_fd, const char *name); int luo_session_finish(int session_fd); +int luo_ensure_nofile_limit(long min_limit); + int create_and_preserve_memfd(int session_fd, int token, const char *data); int restore_and_verify_memfd(int session_fd, int token, const char *expected_data); -- 2.53.0 From pasha.tatashin at soleen.com Wed Jun 3 08:44:01 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Wed, 3 Jun 2026 15:44:01 +0000 Subject: [PATCH v7 12/13] selftests/liveupdate: Add stress-sessions kexec test In-Reply-To: <20260603154402.468928-1-pasha.tatashin@soleen.com> References: <20260603154402.468928-1-pasha.tatashin@soleen.com> Message-ID: <20260603154402.468928-13-pasha.tatashin@soleen.com> Add a new test that creates 2000 LUO sessions before a kexec reboot and verifies their presence after the reboot. This ensures that the linked-block serialization mechanism works correctly for a large number of sessions. Acked-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav (Google) Signed-off-by: Pasha Tatashin --- tools/testing/selftests/liveupdate/Makefile | 1 + .../liveupdate/luo_stress_sessions.c | 102 ++++++++++++++++++ 2 files changed, 103 insertions(+) create mode 100644 tools/testing/selftests/liveupdate/luo_stress_sessions.c diff --git a/tools/testing/selftests/liveupdate/Makefile b/tools/testing/selftests/liveupdate/Makefile index 080754787ede..ed7534468386 100644 --- a/tools/testing/selftests/liveupdate/Makefile +++ b/tools/testing/selftests/liveupdate/Makefile @@ -6,6 +6,7 @@ TEST_GEN_PROGS += liveupdate TEST_GEN_PROGS_EXTENDED += luo_kexec_simple TEST_GEN_PROGS_EXTENDED += luo_multi_session +TEST_GEN_PROGS_EXTENDED += luo_stress_sessions TEST_FILES += do_kexec.sh diff --git a/tools/testing/selftests/liveupdate/luo_stress_sessions.c b/tools/testing/selftests/liveupdate/luo_stress_sessions.c new file mode 100644 index 000000000000..f201b1839d1d --- /dev/null +++ b/tools/testing/selftests/liveupdate/luo_stress_sessions.c @@ -0,0 +1,102 @@ +// SPDX-License-Identifier: GPL-2.0-only + +/* + * Copyright (c) 2026, Google LLC. + * Pasha Tatashin + * + * Validate that LUO can handle a large number of sessions across a kexec + * reboot. + */ + +#include +#include +#include "luo_test_utils.h" + +#define NUM_SESSIONS 2000 +#define STATE_SESSION_NAME "kexec_many_state" +#define STATE_MEMFD_TOKEN 999 + +/* Stage 1: Executed before the kexec reboot. */ +static void run_stage_1(int luo_fd) +{ + int ret, i; + + ksft_print_msg("[STAGE 1] Increasing ulimit for open files...\n"); + ret = luo_ensure_nofile_limit(NUM_SESSIONS); + if (ret == -EPERM) + ksft_exit_skip("Insufficient privileges to set RLIMIT_NOFILE\n"); + if (ret < 0) + ksft_exit_fail_msg("luo_ensure_nofile_limit failed: %s\n", strerror(-ret)); + + ksft_print_msg("[STAGE 1] Creating state file for next stage (2)...\n"); + create_state_file(luo_fd, STATE_SESSION_NAME, STATE_MEMFD_TOKEN, 2); + + ksft_print_msg("[STAGE 1] Creating %d sessions...\n", NUM_SESSIONS); + + for (i = 0; i < NUM_SESSIONS; i++) { + char name[LIVEUPDATE_SESSION_NAME_LENGTH]; + int s_fd; + + snprintf(name, sizeof(name), "many-test-%d", i); + s_fd = luo_create_session(luo_fd, name); + if (s_fd < 0) { + fail_exit("luo_create_session for '%s' at index %d", + name, i); + } + } + + ksft_print_msg("[STAGE 1] Successfully created %d sessions.\n", + NUM_SESSIONS); + + close(luo_fd); + daemonize_and_wait(); +} + +/* Stage 2: Executed after the kexec reboot. */ +static void run_stage_2(int luo_fd, int state_session_fd) +{ + int i, stage; + + ksft_print_msg("[STAGE 2] Starting post-kexec verification...\n"); + + restore_and_read_stage(state_session_fd, STATE_MEMFD_TOKEN, &stage); + if (stage != 2) { + fail_exit("Expected stage 2, but state file contains %d", + stage); + } + + ksft_print_msg("[STAGE 2] Retrieving and finishing %d sessions...\n", + NUM_SESSIONS); + + for (i = 0; i < NUM_SESSIONS; i++) { + char name[LIVEUPDATE_SESSION_NAME_LENGTH]; + int s_fd; + + snprintf(name, sizeof(name), "many-test-%d", i); + s_fd = luo_retrieve_session(luo_fd, name); + if (s_fd < 0) { + fail_exit("luo_retrieve_session for '%s' at index %d", + name, i); + } + + if (luo_session_finish(s_fd) < 0) { + fail_exit("luo_session_finish for '%s' at index %d", + name, i); + } + close(s_fd); + } + + ksft_print_msg("[STAGE 2] Finalizing state session...\n"); + if (luo_session_finish(state_session_fd) < 0) + fail_exit("luo_session_finish for state session"); + close(state_session_fd); + + ksft_print_msg("\n--- MANY-SESSIONS KEXEC TEST PASSED (%d sessions) ---\n", + NUM_SESSIONS); +} + +int main(int argc, char *argv[]) +{ + return luo_test(argc, argv, STATE_SESSION_NAME, + run_stage_1, run_stage_2); +} -- 2.53.0 From pasha.tatashin at soleen.com Wed Jun 3 08:44:02 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Wed, 3 Jun 2026 15:44:02 +0000 Subject: [PATCH v7 13/13] selftests/liveupdate: Add stress-files kexec test In-Reply-To: <20260603154402.468928-1-pasha.tatashin@soleen.com> References: <20260603154402.468928-1-pasha.tatashin@soleen.com> Message-ID: <20260603154402.468928-14-pasha.tatashin@soleen.com> Add a new luo_stress_files kexec test that verifies preserving and retrieving 500 files across a kexec reboot. Reviewed-by: Pratyush Yadav (Google) Acked-by: Mike Rapoport (Microsoft) Signed-off-by: Pasha Tatashin --- tools/testing/selftests/liveupdate/Makefile | 1 + .../selftests/liveupdate/luo_stress_files.c | 97 +++++++++++++++++++ 2 files changed, 98 insertions(+) create mode 100644 tools/testing/selftests/liveupdate/luo_stress_files.c diff --git a/tools/testing/selftests/liveupdate/Makefile b/tools/testing/selftests/liveupdate/Makefile index ed7534468386..30689d22cb02 100644 --- a/tools/testing/selftests/liveupdate/Makefile +++ b/tools/testing/selftests/liveupdate/Makefile @@ -7,6 +7,7 @@ TEST_GEN_PROGS += liveupdate TEST_GEN_PROGS_EXTENDED += luo_kexec_simple TEST_GEN_PROGS_EXTENDED += luo_multi_session TEST_GEN_PROGS_EXTENDED += luo_stress_sessions +TEST_GEN_PROGS_EXTENDED += luo_stress_files TEST_FILES += do_kexec.sh diff --git a/tools/testing/selftests/liveupdate/luo_stress_files.c b/tools/testing/selftests/liveupdate/luo_stress_files.c new file mode 100644 index 000000000000..0cdf9cd4bac7 --- /dev/null +++ b/tools/testing/selftests/liveupdate/luo_stress_files.c @@ -0,0 +1,97 @@ +// SPDX-License-Identifier: GPL-2.0-only + +/* + * Copyright (c) 2026, Google LLC. + * Pasha Tatashin + * + * Validate that LUO can handle a large number of files per session across + * a kexec reboot. + */ + +#include +#include +#include "luo_test_utils.h" + +#define NUM_FILES 500 +#define STATE_SESSION_NAME "kexec_many_files_state" +#define STATE_MEMFD_TOKEN 9999 +#define TEST_SESSION_NAME "many_files_session" + +/* Stage 1: Executed before the kexec reboot. */ +static void run_stage_1(int luo_fd) +{ + int session_fd, i; + + ksft_print_msg("[STAGE 1] Creating state file for next stage (2)...\n"); + create_state_file(luo_fd, STATE_SESSION_NAME, STATE_MEMFD_TOKEN, 2); + + ksft_print_msg("[STAGE 1] Creating test session '%s'...\n", TEST_SESSION_NAME); + session_fd = luo_create_session(luo_fd, TEST_SESSION_NAME); + if (session_fd < 0) + fail_exit("luo_create_session"); + + ksft_print_msg("[STAGE 1] Preserving %d files...\n", NUM_FILES); + for (i = 0; i < NUM_FILES; i++) { + char data[64]; + + snprintf(data, sizeof(data), "file-data-%d", i); + if (create_and_preserve_memfd(session_fd, i, data) < 0) + fail_exit("create_and_preserve_memfd for index %d", i); + } + + ksft_print_msg("[STAGE 1] Successfully preserved %d files.\n", NUM_FILES); + + close(luo_fd); + daemonize_and_wait(); +} + +/* Stage 2: Executed after the kexec reboot. */ +static void run_stage_2(int luo_fd, int state_session_fd) +{ + int session_fd; + int i, stage; + + ksft_print_msg("[STAGE 2] Starting post-kexec verification...\n"); + + restore_and_read_stage(state_session_fd, STATE_MEMFD_TOKEN, &stage); + if (stage != 2) { + fail_exit("Expected stage 2, but state file contains %d", + stage); + } + + ksft_print_msg("[STAGE 2] Retrieving test session '%s'...\n", TEST_SESSION_NAME); + session_fd = luo_retrieve_session(luo_fd, TEST_SESSION_NAME); + if (session_fd < 0) + fail_exit("luo_retrieve_session"); + + ksft_print_msg("[STAGE 2] Verifying %d files...\n", NUM_FILES); + for (i = 0; i < NUM_FILES; i++) { + char data[64]; + int fd; + + snprintf(data, sizeof(data), "file-data-%d", i); + fd = restore_and_verify_memfd(session_fd, i, data); + if (fd < 0) + fail_exit("restore_and_verify_memfd for index %d", i); + close(fd); + } + + ksft_print_msg("[STAGE 2] Finishing test session...\n"); + if (luo_session_finish(session_fd) < 0) + fail_exit("luo_session_finish for test session"); + close(session_fd); + + ksft_print_msg("[STAGE 2] Finalizing state session...\n"); + if (luo_session_finish(state_session_fd) < 0) + fail_exit("luo_session_finish for state session"); + close(state_session_fd); + + ksft_print_msg("\n--- MANY-FILES KEXEC TEST PASSED (%d files) ---\n", + NUM_FILES); +} + +int main(int argc, char *argv[]) +{ + return luo_test(argc, argv, STATE_SESSION_NAME, + run_stage_1, run_stage_2); +} -- 2.53.0 From jloeser at linux.microsoft.com Wed Jun 3 10:25:58 2026 From: jloeser at linux.microsoft.com (Jork Loeser) Date: Wed, 3 Jun 2026 10:25:58 -0700 (PDT) Subject: [RFC PATCH 00/20] mshv: enable kexec with Hyper-V donated pages and partitions In-Reply-To: References: <20260528004204.1484584-1-jloeser@linux.microsoft.com> Message-ID: <3197c9c9-9e4f-c592-bb7-ac422f89115@linux.microsoft.com> On Wed, 3 Jun 2026, Mike Rapoport wrote: > On Mon, Jun 01, 2026 at 01:09:41PM -0700, Jork Loeser wrote: >> On Sun, 31 May 2026, Mike Rapoport wrote: >> >>>> Patch 19: Export kexec_in_progress for modules >>> >>> Isn't there another way to differentiate kexec reboot? > > There's that "kexec reboot" string passed as the cmd to the reboot > notifier. > Maybe we can make it somehow more well defined API and use it? A string? Dear my - the compiler won't flag it on an API change then, not ideal clearly. What's wrong with exporting kexec_in_progress()? Best, Jork From robh at kernel.org Wed Jun 3 10:44:34 2026 From: robh at kernel.org (Rob Herring) Date: Wed, 3 Jun 2026 12:44:34 -0500 Subject: [PATCH v3 03/11] of: reserved_mem: avoid post-init UAF when alloc_reserved_mem_array() fails In-Reply-To: <79932afc-2e91-4a54-aff9-f550be784c36@gmail.com> References: <20260527032917.3385849-1-chenwandun1@gmail.com> <20260527032917.3385849-4-chenwandun1@gmail.com> <20260602162450.GA442759-robh@kernel.org> <79932afc-2e91-4a54-aff9-f550be784c36@gmail.com> Message-ID: On Wed, Jun 3, 2026 at 1:44?AM Wandun wrote: > > > > On 6/3/26 00:24, Rob Herring wrote: > > On Wed, May 27, 2026 at 11:29:09AM +0800, Wandun Chen wrote: > >> From: Wandun Chen > >> > >> The global pointer 'reserved_mem' continues to reference the > >> reserved_mem_array which lives in __initdata if > >> alloc_reserved_mem_array() fails. of_reserved_mem_lookup() is > >> exported for post-init use, that would dereference freed memory > >> and trigger a use-after-free. > >> > >> So reset reserved_mem_count to 0 when alloc_reserved_mem_array() > >> fails. > >> > >> Fixes: 00c9a452a235 ("of: reserved_mem: Add code to dynamically allocate reserved_mem array") > > Fixes should come first in a series. > Understood, will do in future submissions. > > > >> Signed-off-by: Wandun Chen > >> --- > >> drivers/of/of_reserved_mem.c | 20 ++++++++++++++------ > >> 1 file changed, 14 insertions(+), 6 deletions(-) > >> > >> diff --git a/drivers/of/of_reserved_mem.c b/drivers/of/of_reserved_mem.c > >> index 313cbc57aa45..6d479381ff1f 100644 > >> --- a/drivers/of/of_reserved_mem.c > >> +++ b/drivers/of/of_reserved_mem.c > >> @@ -69,29 +69,31 @@ static int __init early_init_dt_alloc_reserved_memory_arch(phys_addr_t size, > >> * the initial static array is copied over to this new array and > >> * the new array is used from this point on. > >> */ > >> -static void __init alloc_reserved_mem_array(void) > >> +static bool __init alloc_reserved_mem_array(void) > >> { > >> struct reserved_mem *new_array; > >> size_t alloc_size, copy_size, memset_size; > >> > >> + if (!total_reserved_mem_cnt) > >> + return true; > >> + > >> alloc_size = array_size(total_reserved_mem_cnt, sizeof(*new_array)); > >> if (alloc_size == SIZE_MAX) { > >> pr_err("Failed to allocate memory for reserved_mem array with err: %d", -EOVERFLOW); > >> - return; > >> + goto fail; > >> } > >> > >> new_array = memblock_alloc(alloc_size, SMP_CACHE_BYTES); > >> if (!new_array) { > >> pr_err("Failed to allocate memory for reserved_mem array with err: %d", -ENOMEM); > >> - return; > >> + goto fail; > >> } > >> > >> copy_size = array_size(reserved_mem_count, sizeof(*new_array)); > >> if (copy_size == SIZE_MAX) { > >> memblock_free(new_array, alloc_size); > >> - total_reserved_mem_cnt = MAX_RESERVED_REGIONS; > >> pr_err("Failed to allocate memory for reserved_mem array with err: %d", -EOVERFLOW); > > These prints could be moved to 'fail'. Perhaps instead of just printing > > an error value, you can return the error value instead of boolean. > Will do, consolidating pr_err() under 'fail' and changing the return type > to int. > > > > If you respin just this patch, I can pick it up for 7.2. > Before I respin, I'd like to flag a dependency: > patch 05/07 in this series build on the signature change introduced by this > patch ("the void -> bool return type change of alloc_reserved_mem_array()") > > Could you let me know which of the following you'd prefer: > a) Take patch 03 alone via your tree as you suggested, after it lands, I'll > respin the remaining patches of this series. I would go with this option. AIUI, this series isn't going to land in 7.2, so ultimately you will rebase on v7.2-rc1 which will have the fix. > > b) Keep patch 03 in the v4 respin of the full series, reordered to the front > per your earlier comment. Rob From rppt at kernel.org Wed Jun 3 11:16:52 2026 From: rppt at kernel.org (Mike Rapoport) Date: Wed, 03 Jun 2026 21:16:52 +0300 Subject: [PATCH v7 00/13] liveupdate: Remove limits on sessions and files In-Reply-To: <20260603154402.468928-1-pasha.tatashin@soleen.com> References: <20260603154402.468928-1-pasha.tatashin@soleen.com> Message-ID: <178051061274.867224.3632796902576075261.b4-ty@b4> On Wed, 03 Jun 2026 15:43:49 +0000, Pasha Tatashin wrote: > liveupdate: Remove limits on sessions and files > > Hi all, > > This series removes the fixed limits on the number of files that can > be preserved within a single session, and the total number of sessions > managed by the Live Update Orchestrator (LUO). > > [...] Applied to next branch of liveupdate/linux.git tree, thanks! [01/13] liveupdate: change file_set->count type to u64 for type safety commit: 81fbb909ec07868415f6b2269922c8d1cc6a215a [02/13] liveupdate: avoid mixing cleanup guards with goto in luo_session_retrieve_fd commit: 6af06e11bd48bdefaf9381f6ff0bd65b1e5d98ab [03/13] liveupdate: centralize state management into struct luo_ser commit: d376e4b55c9a0adb3e701c7eaff21d9ba655a1c6 [04/13] liveupdate: register luo_ser as KHO subtree commit: cf071b3536df76a2a75b83ca1fe8c043824352c3 [05/13] liveupdate: Extract luo_file_deserialize_one helper commit: 51b71af922a7145e63fdc0cab075d681ecd89e4a [06/13] liveupdate: Extract luo_session_deserialize_one helper commit: be9d10d167652e11283cd07c7daf187222808db1 [07/13] kho: add support for linked-block serialization commit: 0349ff2887059112ce06831ab29aec47a2a7285a [08/13] liveupdate: defer session block allocation and physical address setting commit: b5a58a922e6f2f9f40faddd8e0e1fe3ce0ea9c56 [09/13] liveupdate: Remove limit on the number of sessions commit: 2a441a14c2c03b39d1c89438dd28cef9d8fa57d5 [10/13] liveupdate: Remove limit on the number of files per session commit: 1d1153097f4dd417e2ea00404edec9fbd1d88f28 [11/13] selftests/liveupdate: Test session and file limit removal commit: 5ba3f30643cbdd79fb82e525aa1ca55b62fcc7ac [12/13] selftests/liveupdate: Add stress-sessions kexec test commit: 3432292fb9130191dca57953941f7ae3888d52d8 [13/13] selftests/liveupdate: Add stress-files kexec test commit: 46429a15a6dfe522880d5085f1f6999357758872 tree: https://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux branch: next -- Sincerely yours, Mike. From ackerleytng at google.com Wed Jun 3 15:17:06 2026 From: ackerleytng at google.com (Ackerley Tng) Date: Wed, 3 Jun 2026 15:17:06 -0700 Subject: [RFC PATCH v1 0/8] liveupdate: kvm: Guest_memfd preservation In-Reply-To: References: <9huzwlwnbgdd.fsf@tarunix.c.googlers.com> Message-ID: Sean Christopherson writes: >> >> [...snip...] >> >> we have one open Question left: >> 1. How to check guest_memfd is fully shared. >> >> [...snip...] >> > > Given that lack of support isn't going to be limited to _just_ guest_memfd, > simply disallow preservation if the VM supports private memory: > > if (kvm_arch_has_private_mem(kvm)) > return -EOPNOTSUPP; Makes sense. Tarun this was the other option that I was suggesting when we discussed offline. I think (?) it is possible to create a fully-private guest_memfd for a non-Confidential VM, and even after conversion lands, for both vm_memory_attributes=true and vm_memory_attributes=false. In that case, your preservation series can still preserve memory tracked as private by guest_memfd but not used as private, right? I don't think anyone will use this combination before guest_memfd write() support lands, we just need to make sure there's no kernel crash or corruption in this case. From chenwandun1 at gmail.com Wed Jun 3 18:48:44 2026 From: chenwandun1 at gmail.com (Wandun) Date: Thu, 4 Jun 2026 09:48:44 +0800 Subject: [PATCH v3 03/11] of: reserved_mem: avoid post-init UAF when alloc_reserved_mem_array() fails In-Reply-To: References: <20260527032917.3385849-1-chenwandun1@gmail.com> <20260527032917.3385849-4-chenwandun1@gmail.com> <20260602162450.GA442759-robh@kernel.org> <79932afc-2e91-4a54-aff9-f550be784c36@gmail.com> Message-ID: <8e088395-c22d-4bc8-9e58-84235af1e56b@gmail.com> On 6/4/26 01:44, Rob Herring wrote: > On Wed, Jun 3, 2026 at 1:44?AM Wandun wrote: >> >> >> On 6/3/26 00:24, Rob Herring wrote: >>> On Wed, May 27, 2026 at 11:29:09AM +0800, Wandun Chen wrote: >>>> From: Wandun Chen >>>> >>>> The global pointer 'reserved_mem' continues to reference the >>>> reserved_mem_array which lives in __initdata if >>>> alloc_reserved_mem_array() fails. of_reserved_mem_lookup() is >>>> exported for post-init use, that would dereference freed memory >>>> and trigger a use-after-free. >>>> >>>> So reset reserved_mem_count to 0 when alloc_reserved_mem_array() >>>> fails. >>>> >>>> Fixes: 00c9a452a235 ("of: reserved_mem: Add code to dynamically allocate reserved_mem array") >>> Fixes should come first in a series. >> Understood, will do in future submissions. >>>> Signed-off-by: Wandun Chen >>>> --- >>>> drivers/of/of_reserved_mem.c | 20 ++++++++++++++------ >>>> 1 file changed, 14 insertions(+), 6 deletions(-) >>>> >>>> diff --git a/drivers/of/of_reserved_mem.c b/drivers/of/of_reserved_mem.c >>>> index 313cbc57aa45..6d479381ff1f 100644 >>>> --- a/drivers/of/of_reserved_mem.c >>>> +++ b/drivers/of/of_reserved_mem.c >>>> @@ -69,29 +69,31 @@ static int __init early_init_dt_alloc_reserved_memory_arch(phys_addr_t size, >>>> * the initial static array is copied over to this new array and >>>> * the new array is used from this point on. >>>> */ >>>> -static void __init alloc_reserved_mem_array(void) >>>> +static bool __init alloc_reserved_mem_array(void) >>>> { >>>> struct reserved_mem *new_array; >>>> size_t alloc_size, copy_size, memset_size; >>>> >>>> + if (!total_reserved_mem_cnt) >>>> + return true; >>>> + >>>> alloc_size = array_size(total_reserved_mem_cnt, sizeof(*new_array)); >>>> if (alloc_size == SIZE_MAX) { >>>> pr_err("Failed to allocate memory for reserved_mem array with err: %d", -EOVERFLOW); >>>> - return; >>>> + goto fail; >>>> } >>>> >>>> new_array = memblock_alloc(alloc_size, SMP_CACHE_BYTES); >>>> if (!new_array) { >>>> pr_err("Failed to allocate memory for reserved_mem array with err: %d", -ENOMEM); >>>> - return; >>>> + goto fail; >>>> } >>>> >>>> copy_size = array_size(reserved_mem_count, sizeof(*new_array)); >>>> if (copy_size == SIZE_MAX) { >>>> memblock_free(new_array, alloc_size); >>>> - total_reserved_mem_cnt = MAX_RESERVED_REGIONS; >>>> pr_err("Failed to allocate memory for reserved_mem array with err: %d", -EOVERFLOW); >>> These prints could be moved to 'fail'. Perhaps instead of just printing >>> an error value, you can return the error value instead of boolean. >> Will do, consolidating pr_err() under 'fail' and changing the return type >> to int. >>> If you respin just this patch, I can pick it up for 7.2. >> Before I respin, I'd like to flag a dependency: >> patch 05/07 in this series build on the signature change introduced by this >> patch ("the void -> bool return type change of alloc_reserved_mem_array()") >> >> Could you let me know which of the following you'd prefer: >> a) Take patch 03 alone via your tree as you suggested, after it lands, I'll >> respin the remaining patches of this series. > I would go with this option. AIUI, this series isn't going to land in > 7.2, so ultimately you will rebase on v7.2-rc1 which will have the > fix. OK, will send v4. Best regards, Wandun > >> b) Keep patch 03 in the v4 respin of the full series, reordered to the front >> per your earlier comment. > > Rob From rppt at kernel.org Wed Jun 3 22:28:57 2026 From: rppt at kernel.org (Mike Rapoport) Date: Thu, 4 Jun 2026 08:28:57 +0300 Subject: [PATCH 0/2] liveupdate: Small FLB fixes In-Reply-To: <20260528174140.1921129-1-dmatlack@google.com> References: <20260528174140.1921129-1-dmatlack@google.com> Message-ID: On Thu, May 28, 2026 at 05:41:38PM +0000, David Matlack wrote: > This series has 2 small fixes to how FLBs are managed. First is to > increase the outgoing FLB refcount during liveupdate_flb_get_outgoing() > so it cannot be freed while the caller is using it, and to align with > the semantics of liveupdate_flb_get_incoming(). The second is to prevent > FLB retrieve() from being called multiple times if the first attempt > fails. > > Both of these changes are needed for the correctness of the PCI core > support for Live Update: > > https://lore.kernel.org/linux-pci/20260522202410.3104264-1-dmatlack at google.com/ We are late in the release cycle and since there no in-tree flb users let's postpone this after rc1. > David Matlack (2): > liveupdate: Reference count outgoing FLB data > liveupdate: Remember FLB retrieve() status > > include/linux/liveupdate.h | 11 +++++++++-- > kernel/liveupdate/luo_flb.c | 20 ++++++++++++++------ > 2 files changed, 23 insertions(+), 8 deletions(-) > > > base-commit: 5428435567cbe06c19914592fc22ca23c9ca1de5 > -- > 2.54.0.823.g6e5bcc1fc9-goog > -- Sincerely yours, Mike. From dongtai.guo at linux.dev Thu Jun 4 02:19:13 2026 From: dongtai.guo at linux.dev (George Guo) Date: Thu, 4 Jun 2026 17:19:13 +0800 Subject: [PATCH v2] liveupdate: luo_session: include linux/mm.h for virt/phys translation Message-ID: <20260604091913.306603-1-dongtai.guo@linux.dev> From: George Guo luo_session.c calls virt_to_phys() and phys_to_virt(). On LoongArch with CONFIG_KFENCE=y, these macros (in arch/loongarch/include/asm/io.h) expand to offset_in_page() and page_address(), both declared in . Since luo_session.c only includes , the translation unit fails to build with CONFIG_KFENCE=y: asm/io.h: error: implicit declaration of function 'offset_in_page' asm/io.h: error: implicit declaration of function 'page_address' Add the missing include to fix these build errors. Co-developed-by: Kexin Liu Signed-off-by: Kexin Liu Signed-off-by: George Guo --- v2: Move the include from arch/loongarch/include/asm/io.h to the consumer luo_session.c, instead of pulling the heavy into the low-level asm/io.h header (per review feedback). The 0-day report confirmed the v1 approach introduces a circular include (slab.h -> kasan.h -> asm/kasan.h -> asm/io.h -> mm.h, where mm.h needs kfree() before slab.h declares it). Link: https://lore.kernel.org/r/20260521063310.52926-1-dongtai.guo at linux.dev/ # v1 kernel/liveupdate/luo_session.c | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/liveupdate/luo_session.c b/kernel/liveupdate/luo_session.c index 7a42385dabe2..4ce7128a4ae9 100644 --- a/kernel/liveupdate/luo_session.c +++ b/kernel/liveupdate/luo_session.c @@ -62,6 +62,7 @@ #include #include #include +#include #include #include #include -- 2.25.1 From dongtai.guo at linux.dev Thu Jun 4 04:41:06 2026 From: dongtai.guo at linux.dev (George Guo) Date: Thu, 4 Jun 2026 19:41:06 +0800 Subject: [PATCH v3] LoongArch: kexec: use core control page for relocation trampoline to avoid QEMU FDT conflict Message-ID: <20260604114106.391502-1-dongtai.guo@linux.dev> From: George Guo KEXEC_CONTROL_CODE was hardcoded to TO_CACHE(0x100000). QEMU places its machine FDT at physical 0x100000 when booting with '-kernel', so machine_kexec_prepare() overwrote the FDT with the relocation trampoline. The kexec'd kernel's fdt_setup() then read trampoline code instead of a valid FDT, earlycon auto-detection failed, and the second kernel booted silently with no console output. The trampoline does not need a fixed address. It is executed by the current kernel to relocate and enter the new kernel, and is dead once the new kernel starts. Drop KEXEC_CONTROL_CODE and reuse the control_code_page that the kexec core already allocates (as arm64 and riscv do). That page is excluded from the relocation copy and lives nowhere near 0x100000, so the QEMU FDT is left intact. Signed-off-by: George Guo --- Changes in v3: - Only the relocation trampoline is moved off the hardcoded address onto the core-allocated control_code_page. v2 also moved the kernel command line to a kexec control page, but as Huacai pointed out a control page only avoids being clobbered by the relocation copy in the current kernel; it is not reserved in the new kernel, which reads the command line early in boot. The command line therefore stays at the existing fixed KEXEC_CMDLINE_ADDR in the reserved first 2MB (unchanged). Changes in v2: - Instead of moving KEXEC_CONTROL_CODE to a different fixed address, reuse the kexec core's control_code_page for the trampoline. v2: https://lore.kernel.org/all/20260601033820.38805-1-dongtai.guo at linux.dev/ v1: https://lore.kernel.org/all/20260528135828.196953-1-dongtai.guo at linux.dev/ arch/loongarch/kernel/machine_kexec.c | 21 ++++++++++++++++----- 1 file changed, 16 insertions(+), 5 deletions(-) diff --git a/arch/loongarch/kernel/machine_kexec.c b/arch/loongarch/kernel/machine_kexec.c index d7fafda1d541..3527da57234d 100644 --- a/arch/loongarch/kernel/machine_kexec.c +++ b/arch/loongarch/kernel/machine_kexec.c @@ -21,8 +21,14 @@ #include #include -/* 0x100000 ~ 0x200000 is safe */ -#define KEXEC_CONTROL_CODE TO_CACHE(0x100000UL) +/* + * The kexec'd kernel reads its command line from this pointer early in + * boot, so the command line must live in memory the new kernel will not + * reuse. Keep it at a fixed address in the first 2MB, which both the + * current and the kexec'd kernel always keep reserved (see + * memblock_reserve(PHYS_OFFSET, 0x200000) in arch/loongarch/kernel/mem.c). + * 0x108000 does not overlap QEMU's machine FDT at 0x100000. + */ #define KEXEC_CMDLINE_ADDR TO_CACHE(0x108000UL) static unsigned long reboot_code_buffer; @@ -72,9 +78,14 @@ int machine_kexec_prepare(struct kimage *kimage) } } - /* kexec/kdump need a safe page to save reboot_code_buffer */ - kimage->control_code_page = virt_to_page((void *)KEXEC_CONTROL_CODE); - + /* + * kexec/kdump need a safe page to save reboot_code_buffer. Reuse the + * control_code_page allocated by the kexec core (as arm64 and riscv + * do) instead of a fixed address: the trampoline is only executed by + * the current kernel before entering the new kernel, so it needs no + * fixed or reserved location. This also stops machine_kexec_prepare() + * from overwriting QEMU's machine FDT at 0x100000. + */ reboot_code_buffer = (unsigned long)page_address(kimage->control_code_page); memcpy((void *)reboot_code_buffer, relocate_new_kernel, relocate_new_kernel_size); -- 2.25.1 From rppt at kernel.org Thu Jun 4 05:17:20 2026 From: rppt at kernel.org (Mike Rapoport) Date: Thu, 4 Jun 2026 15:17:20 +0300 Subject: [RFC PATCH 00/20] mshv: enable kexec with Hyper-V donated pages and partitions In-Reply-To: <3197c9c9-9e4f-c592-bb7-ac422f89115@linux.microsoft.com> References: <20260528004204.1484584-1-jloeser@linux.microsoft.com> <3197c9c9-9e4f-c592-bb7-ac422f89115@linux.microsoft.com> Message-ID: On Wed, Jun 03, 2026 at 10:25:58AM -0700, Jork Loeser wrote: > > > On Wed, 3 Jun 2026, Mike Rapoport wrote: > > > On Mon, Jun 01, 2026 at 01:09:41PM -0700, Jork Loeser wrote: > > > On Sun, 31 May 2026, Mike Rapoport wrote: > > > > > > > > Patch 19: Export kexec_in_progress for modules > > > > > > > > Isn't there another way to differentiate kexec reboot? > > > > There's that "kexec reboot" string passed as the cmd to the reboot > > notifier. > > Maybe we can make it somehow more well defined API and use it? > > A string? Dear my - the compiler won't flag it on an API change then, not > ideal clearly. What's wrong with exporting kexec_in_progress()? The policy in general is avoid exports unless strictly necessary. A string can be declared as const char *KEXEC_REBOOT = "kexec reboot" and used in both kexec and mshv. Not ideal, but still better. No strong feelings from my side, just EXPORT_SYMBOL there felt a bit off. > Best, > Jork -- Sincerely yours, Mike. From fangyu.yu at linux.alibaba.com Thu Jun 4 06:24:10 2026 From: fangyu.yu at linux.alibaba.com (fangyu.yu at linux.alibaba.com) Date: Thu, 4 Jun 2026 21:24:10 +0800 Subject: [PATCH v3 1/9] riscv: kexec: Reset executable bit on the control code page in cleanup In-Reply-To: <20260604132418.15725-1-fangyu.yu@linux.alibaba.com> References: <20260604132418.15725-1-fangyu.yu@linux.alibaba.com> Message-ID: <20260604132418.15725-2-fangyu.yu@linux.alibaba.com> From: Fangyu Yu machine_kexec_prepare() calls set_memory_x() on the per-image control_code_page so the relocate stub copied into it can be executed during a normal kexec. machine_kexec_cleanup() is empty, so when the image is freed (via kexec -u, or because a later step in load failed) the page is returned to the buddy allocator with its executable bit still set. Once the page is reallocated for arbitrary kernel data, the W^X invariant is broken: a writable page also marked executable. Implement the architecture cleanup hook to call set_memory_nx() on the control code page for non-crash images, mirroring the set_memory_x() in prepare(). The crash path does not call set_memory_x() (the crash kernel is loaded into the reserved crashkernel region whose pages are not in the buddy allocator) and so does not need the cleanup. Fixes: fba8a8674f68 ("RISC-V: Add kexec support") Signed-off-by: Fangyu Yu --- arch/riscv/kernel/machine_kexec.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/arch/riscv/kernel/machine_kexec.c b/arch/riscv/kernel/machine_kexec.c index 2306ce3e5f22..ea6794c9f4c2 100644 --- a/arch/riscv/kernel/machine_kexec.c +++ b/arch/riscv/kernel/machine_kexec.c @@ -91,6 +91,19 @@ machine_kexec_prepare(struct kimage *image) void machine_kexec_cleanup(struct kimage *image) { + void *control_code_buffer; + + if (image->type == KEXEC_TYPE_CRASH || !image->control_code_page) + return; + + /* + * machine_kexec_prepare() called set_memory_x() on the control + * code page for non-crash images. Revert it before kimage_free() + * returns the page to the buddy allocator, so we do not leak an + * executable page back into general allocation. + */ + control_code_buffer = page_address(image->control_code_page); + set_memory_nx((unsigned long)control_code_buffer, 1); } -- 2.50.1 From fangyu.yu at linux.alibaba.com Thu Jun 4 06:24:11 2026 From: fangyu.yu at linux.alibaba.com (fangyu.yu at linux.alibaba.com) Date: Thu, 4 Jun 2026 21:24:11 +0800 Subject: [PATCH v3 2/9] riscv: kexec: Bound FDT search by source buffer size, not destination In-Reply-To: <20260604132418.15725-1-fangyu.yu@linux.alibaba.com> References: <20260604132418.15725-1-fangyu.yu@linux.alibaba.com> Message-ID: <20260604132418.15725-3-fangyu.yu@linux.alibaba.com> From: Fangyu Yu The FDT search loop in machine_kexec_prepare() reads sizeof(fdt) bytes from segment[i].buf to identify the device tree, but it gates the read on segment[i].memsz, which is the destination size in the next kernel. kexec allows bufsz < memsz (the loaded image is zero-padded at the destination), so a caller can craft a segment with bufsz=10 and memsz=1MB: if (image->segment[i].memsz <= sizeof(fdt)) /* 1MB > 40, OK */ continue; memcpy(&fdt, image->segment[i].buf, sizeof(fdt)); /* reads 40 from a 10-byte kbuf */ For kexec_file_load (image->file_mode), the read walks 30 bytes past the kernel-allocated kbuf. In the worst case the trailing bytes fall in an unmapped guard page and the read faults the kernel; in the common case the read returns garbage which fdt_check_header() rejects and the loop continues. The plain kexec_load path is shielded by copy_from_user(), which validates the read against the user mapping. Replace the memsz check with the bufsz check, which is the right bound for the read site. Fixes: fba8a8674f68 ("RISC-V: Add kexec support") Signed-off-by: Fangyu Yu --- arch/riscv/kernel/machine_kexec.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/riscv/kernel/machine_kexec.c b/arch/riscv/kernel/machine_kexec.c index ea6794c9f4c2..e6e179cffc44 100644 --- a/arch/riscv/kernel/machine_kexec.c +++ b/arch/riscv/kernel/machine_kexec.c @@ -38,7 +38,7 @@ machine_kexec_prepare(struct kimage *image) /* Find the Flattened Device Tree and save its physical address */ for (i = 0; i < image->nr_segments; i++) { - if (image->segment[i].memsz <= sizeof(fdt)) + if (image->segment[i].bufsz < sizeof(fdt)) continue; if (image->file_mode) -- 2.50.1 From fangyu.yu at linux.alibaba.com Thu Jun 4 06:24:12 2026 From: fangyu.yu at linux.alibaba.com (fangyu.yu at linux.alibaba.com) Date: Thu, 4 Jun 2026 21:24:12 +0800 Subject: [PATCH v3 3/9] riscv: Add kexec trampoline text section to vmlinux.lds.S In-Reply-To: <20260604132418.15725-1-fangyu.yu@linux.alibaba.com> References: <20260604132418.15725-1-fangyu.yu@linux.alibaba.com> Message-ID: <20260604132418.15725-4-fangyu.yu@linux.alibaba.com> From: Fangyu Yu When CONFIG_KEXEC_CORE is enabled, add a dedicated .kexec.tramp.text area to the RISC-V kernel linker script. Extend vmlinux.lds.S to: - align both the start and the end to PAGE_SIZE - define __kexec_tramp_text_start/__kexec_tramp_text_end - KEEP all .kexec.tramp.text* input sections - ASSERT the trampoline text fits within one page The end-of-section page alignment guarantees that the trampoline page, which is later identity-mapped as PAGE_KERNEL_EXEC, contains nothing but the trampoline code and padding (no shared neighbour data). When kexec is disabled, the whole block is excluded via #ifdef CONFIG_KEXEC_CORE. Signed-off-by: Fangyu Yu --- arch/riscv/kernel/vmlinux.lds.S | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/arch/riscv/kernel/vmlinux.lds.S b/arch/riscv/kernel/vmlinux.lds.S index 1f4f8496941a..bc615f7b702f 100644 --- a/arch/riscv/kernel/vmlinux.lds.S +++ b/arch/riscv/kernel/vmlinux.lds.S @@ -41,6 +41,16 @@ SECTIONS ENTRY_TEXT IRQENTRY_TEXT SOFTIRQENTRY_TEXT +#ifdef CONFIG_KEXEC_CORE + . = ALIGN(PAGE_SIZE); + __kexec_tramp_text_start = .; + KEEP(*(.kexec.tramp.text)) + KEEP(*(.kexec.tramp.text.*)) + __kexec_tramp_text_end = .; + ASSERT((__kexec_tramp_text_end - __kexec_tramp_text_start) <= PAGE_SIZE, + ".kexec.tramp.text exceeds PAGE_SIZE"); + . = ALIGN(PAGE_SIZE); +#endif _etext = .; } -- 2.50.1 From fangyu.yu at linux.alibaba.com Thu Jun 4 06:24:09 2026 From: fangyu.yu at linux.alibaba.com (fangyu.yu at linux.alibaba.com) Date: Thu, 4 Jun 2026 21:24:09 +0800 Subject: [PATCH v3 0/9] riscv: kexec: Make kexec/kdump robust under VS-mode Message-ID: <20260604132418.15725-1-fangyu.yu@linux.alibaba.com> From: Fangyu Yu In a RISC-V kernel, both kexec and crashdump need to hand off execution to the next kernel after tearing down the current kernel address space. However, under virtualization the guest uses two-stage address translation, and pc does not jump to stvec after setting satp to zero, so the legacy single-step "csrw satp,0 + stvec redirect" sequence traps with "kvm run failed Operation not supported" and the VCPU dies. This patch set introduces a dedicated kexec trampoline text section and builds a minimal trampoline page table for it. Both handoffs are then reworked into a two-pass trampoline: 1. First enter via the kernel VA, install the trampoline page table, and jump to the trampoline VA(=PA) of the entry stub; 2. Continue execution with PC already on a PA, drop SATP with csrw satp,0 (now safe because PC re-anchoring is moot), and jump directly to the target -- either the crash kernel entry (crash path) or the per-image control_code_buffer that runs the relocate body with SATP=0 throughout (normal path). With this, both kexec and crashdump in RISC-V guests become robust against the two-stage translation. Tested on QEMU virt under two configurations: * HS-mode bare (QEMU TCG) -- regression check - normal kexec: kexec -l/-e succeeds, second kernel boots and prints the userspace SECOND BOOT marker. - crash kdump: panic triggers crash kernel boot, /proc/vmcore opens cleanly in crash(1) and shows the panic backtrace. * VS-mode (L0 x86 + QEMU TCG -> L1 riscv64 + KVM -> L2) Before this series, both paths die with kvm run failed Operation not supported and an all-zero M-mode register dump on the SATP transition. After this series, both paths succeed end-to-end and the vmcore opens cleanly in crash. --- Changes in v3 (Sashiko AI review): - Add two new Fixes: patches at the start of the series: #1: machine_kexec_cleanup() was empty, so the set_memory_x() call in prepare() leaked an executable direct-map page back to the buddy allocator on kexec -u (W^X bypass). Fix: add set_memory_nx() in cleanup. #2: machine_kexec_prepare() FDT search checked memsz <= sizeof(fdt) but read sizeof(fdt) bytes from segment[i].buf, which can be smaller than memsz. Fix: check bufsz instead. - Inline the .kexec.tramp.text section definition directly into vmlinux.lds.S instead of using a macro in image-vars.h. - Rewrite map_tramp_page() to share a single set of lower-level page tables between the VA and PA mappings (5 BSS pages instead of 9), with a collision-safe walker that only populates entries still zero. Add Sv32 support. - Link to v2: https://lore.kernel.org/linux-riscv/20260526125009.2404-1-fangyu.yu at linux.alibaba.com/ - Link to v1: https://lore.kernel.org/linux-riscv/20260324114527.91494-1-fangyu.yu at linux.alibaba.com/ Fangyu Yu (9): riscv: kexec: Reset executable bit on the control code page in cleanup riscv: kexec: Bound FDT search by source buffer size, not destination riscv: Add kexec trampoline text section to vmlinux.lds.S riscv: kexec: Place norelocate trampoline into .kexec.tramp.text riscv: kexec: Build trampoline page tables for crash kernel entry riscv: kexec: Switch to trampoline page table before norelocate riscv: kexec: Always build the trampoline page table riscv: kexec: Add the relocate-trampoline wrapper riscv: kexec: Route normal kexec through the trampoline page table arch/riscv/include/asm/kexec.h | 5 + arch/riscv/kernel/kexec_relocate.S | 92 +++++++++++----- arch/riscv/kernel/machine_kexec.c | 171 +++++++++++++++++++++++++++-- arch/riscv/kernel/vmlinux.lds.S | 10 ++ 4 files changed, 241 insertions(+), 37 deletions(-) -- 2.50.1 From fangyu.yu at linux.alibaba.com Thu Jun 4 06:24:13 2026 From: fangyu.yu at linux.alibaba.com (fangyu.yu at linux.alibaba.com) Date: Thu, 4 Jun 2026 21:24:13 +0800 Subject: [PATCH v3 4/9] riscv: kexec: Place norelocate trampoline into .kexec.tramp.text In-Reply-To: <20260604132418.15725-1-fangyu.yu@linux.alibaba.com> References: <20260604132418.15725-1-fangyu.yu@linux.alibaba.com> Message-ID: <20260604132418.15725-5-fangyu.yu@linux.alibaba.com> From: Fangyu Yu Move riscv_kexec_norelocate out of the generic .text section and into a dedicated executable trampoline section, .kexec.tramp.text. Signed-off-by: Fangyu Yu --- arch/riscv/include/asm/kexec.h | 4 ++++ arch/riscv/kernel/kexec_relocate.S | 2 +- 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/arch/riscv/include/asm/kexec.h b/arch/riscv/include/asm/kexec.h index b9ee8346cc8c..6466c1f00d41 100644 --- a/arch/riscv/include/asm/kexec.h +++ b/arch/riscv/include/asm/kexec.h @@ -75,4 +75,8 @@ int load_extra_segments(struct kimage *image, unsigned long kernel_start, unsigned long cmdline_len); #endif +#ifndef __ASSEMBLY__ +extern char __kexec_tramp_text_start[]; +#endif + #endif diff --git a/arch/riscv/kernel/kexec_relocate.S b/arch/riscv/kernel/kexec_relocate.S index de0a4b35d01e..af6b99f5b0fd 100644 --- a/arch/riscv/kernel/kexec_relocate.S +++ b/arch/riscv/kernel/kexec_relocate.S @@ -147,7 +147,7 @@ riscv_kexec_relocate_end: /* Used for jumping to crashkernel */ -.section ".text" +.section ".kexec.tramp.text", "ax" SYM_CODE_START(riscv_kexec_norelocate) /* * s0: (const) Phys address to jump to -- 2.50.1 From fangyu.yu at linux.alibaba.com Thu Jun 4 06:24:16 2026 From: fangyu.yu at linux.alibaba.com (fangyu.yu at linux.alibaba.com) Date: Thu, 4 Jun 2026 21:24:16 +0800 Subject: [PATCH v3 7/9] riscv: kexec: Always build the trampoline page table In-Reply-To: <20260604132418.15725-1-fangyu.yu@linux.alibaba.com> References: <20260604132418.15725-1-fangyu.yu@linux.alibaba.com> Message-ID: <20260604132418.15725-8-fangyu.yu@linux.alibaba.com> From: Fangyu Yu The trampoline page table and the kexec_tramp_satp value are currently built only on the crash path. A follow-up patch needs the same infrastructure for the normal kexec path. Pull the trampoline build and the WRITE_ONCE() that publishes the SATP value out of the crash-only else branch in machine_kexec_prepare(). The crash path keeps recording its own riscv_kexec_norelocate_pa; the normal path keeps its existing control_code_buffer copy. No functional change. Signed-off-by: Fangyu Yu --- arch/riscv/kernel/machine_kexec.c | 21 ++++++++++----------- 1 file changed, 10 insertions(+), 11 deletions(-) diff --git a/arch/riscv/kernel/machine_kexec.c b/arch/riscv/kernel/machine_kexec.c index 72817bba5d3b..d82f45fb44b6 100644 --- a/arch/riscv/kernel/machine_kexec.c +++ b/arch/riscv/kernel/machine_kexec.c @@ -139,6 +139,16 @@ machine_kexec_prepare(struct kimage *image) return -EINVAL; } + /* + * Build the trampoline page table and capture its SATP value. + * The crash path consumes it today; the non-crash kexec path + * will use the same setup as well. + */ + riscv_kexec_build_tramp((unsigned long)__kexec_tramp_text_start, + __pa_symbol(__kexec_tramp_text_start)); + WRITE_ONCE(kexec_tramp_satp, + PFN_DOWN(__pa_symbol(kexec_tramp_pgd)) | satp_mode); + /* Copy the assembler code for relocation to the control page */ if (image->type != KEXEC_TYPE_CRASH) { control_code_buffer = page_address(image->control_code_page); @@ -155,19 +165,8 @@ machine_kexec_prepare(struct kimage *image) /* Mark the control page executable */ set_memory_x((unsigned long) control_code_buffer, 1); } else { - /* - * Crash kexec uses riscv_kexec_norelocate as a trampoline. - * Pre-build the trampoline page tables and capture the - * trampoline SATP value plus the physical address of - * riscv_kexec_norelocate so that the panic path only has - * to switch satp and jump. - */ - riscv_kexec_build_tramp((unsigned long)__kexec_tramp_text_start, - __pa_symbol(__kexec_tramp_text_start)); WRITE_ONCE(riscv_kexec_norelocate_pa, __pa_symbol(&riscv_kexec_norelocate)); - WRITE_ONCE(kexec_tramp_satp, - PFN_DOWN(__pa_symbol(kexec_tramp_pgd)) | satp_mode); } return 0; -- 2.50.1 From fangyu.yu at linux.alibaba.com Thu Jun 4 06:24:14 2026 From: fangyu.yu at linux.alibaba.com (fangyu.yu at linux.alibaba.com) Date: Thu, 4 Jun 2026 21:24:14 +0800 Subject: [PATCH v3 5/9] riscv: kexec: Build trampoline page tables for crash kernel entry In-Reply-To: <20260604132418.15725-1-fangyu.yu@linux.alibaba.com> References: <20260604132418.15725-1-fangyu.yu@linux.alibaba.com> Message-ID: <20260604132418.15725-6-fangyu.yu@linux.alibaba.com> From: Fangyu Yu Crash kexec uses riscv_kexec_norelocate as a trampoline to jump into the crashkernel. Pre-build dedicated 4 KB page tables in machine_kexec_prepare() that map the trampoline page as executable, so the panic path only has to switch satp and jump. Two mappings are installed into a shared pgd: - VA(__kexec_tramp_text_start) -> PA(__kexec_tramp_text_start) - PA(__kexec_tramp_text_start) -> PA(__kexec_tramp_text_start) The lower-level tables (p4d/pud/pmd/pte) are shared between both mappings; map_tramp_page() walks the existing tree and only populates entries that are still zero, so the two installs coexist even when their indices happen to collide at any level. Signed-off-by: Fangyu Yu --- arch/riscv/kernel/machine_kexec.c | 87 +++++++++++++++++++++++++++++++ 1 file changed, 87 insertions(+) diff --git a/arch/riscv/kernel/machine_kexec.c b/arch/riscv/kernel/machine_kexec.c index e6e179cffc44..1947b7bdf5c4 100644 --- a/arch/riscv/kernel/machine_kexec.c +++ b/arch/riscv/kernel/machine_kexec.c @@ -18,6 +18,85 @@ #include #include +/* + * Trampoline page tables. Both the VA(trampoline)->PA and the + * PA(trampoline)->PA identity mapping are installed in this single + * pgd; the lower-level tables are shared so the two mappings can + * coexist even if they happen to collide at any level (the walker + * only populates entries that are still zero). + * + * Pre-allocate for the largest paging mode (Sv57). Levels that the + * runtime mode does not use simply waste a page or two of BSS, in + * exchange for a builder that is infallible and safe to run from + * the panic path. + */ +static pgd_t kexec_tramp_pgd[PTRS_PER_PGD] __aligned(PAGE_SIZE); +#ifdef CONFIG_64BIT +static p4d_t kexec_tramp_p4d[PTRS_PER_P4D] __aligned(PAGE_SIZE); +static pud_t kexec_tramp_pud[PTRS_PER_PUD] __aligned(PAGE_SIZE); +static pmd_t kexec_tramp_pmd[PTRS_PER_PMD] __aligned(PAGE_SIZE); +#endif +static pte_t kexec_tramp_pte[PTRS_PER_PTE] __aligned(PAGE_SIZE); + +static void map_tramp_page(unsigned long va, unsigned long pa) +{ + pgd_t *pgd = kexec_tramp_pgd + pgd_index(va); + +#ifdef CONFIG_64BIT + p4d_t *p4d; + pud_t *pud; + pmd_t *pmd; + + if (pgtable_l5_enabled) { + if (pgd_val(*pgd) == 0) + set_pgd(pgd, pfn_pgd(PFN_DOWN(__pa_symbol(kexec_tramp_p4d)), + PAGE_TABLE)); + p4d = kexec_tramp_p4d + p4d_index(va); + } else { + p4d = (p4d_t *)pgd; + } + + if (pgtable_l4_enabled) { + if (p4d_val(*p4d) == 0) + set_p4d(p4d, pfn_p4d(PFN_DOWN(__pa_symbol(kexec_tramp_pud)), + PAGE_TABLE)); + pud = kexec_tramp_pud + pud_index(va); + } else { + pud = (pud_t *)p4d; + } + + if (pud_val(*pud) == 0) + set_pud(pud, pfn_pud(PFN_DOWN(__pa_symbol(kexec_tramp_pmd)), + PAGE_TABLE)); + pmd = kexec_tramp_pmd + pmd_index(va); + + if (pmd_val(*pmd) == 0) + set_pmd(pmd, pfn_pmd(PFN_DOWN(__pa_symbol(kexec_tramp_pte)), + PAGE_TABLE)); +#else + /* Sv32: PGD points directly to the PTE table. */ + if (pgd_val(*pgd) == 0) + set_pgd(pgd, pfn_pgd(PFN_DOWN(__pa_symbol(kexec_tramp_pte)), + PAGE_TABLE)); +#endif + + set_pte(kexec_tramp_pte + pte_index(va), + pfn_pte(PFN_DOWN(pa), PAGE_KERNEL_EXEC)); +} + +static void riscv_kexec_build_tramp(unsigned long va, unsigned long pa) +{ + /* VA -> PA: map the trampoline page via its kernel VA. */ + map_tramp_page(va, pa); + + /* + * PA -> PA: identity-map the same page so the second-pass code + * can keep executing after the kernel VA mapping is dropped. + */ + map_tramp_page(pa, pa); +} + + /* * machine_kexec_prepare - Initialize kexec * @@ -73,6 +152,14 @@ machine_kexec_prepare(struct kimage *image) /* Mark the control page executable */ set_memory_x((unsigned long) control_code_buffer, 1); + } else { + /* + * Crash kexec uses riscv_kexec_norelocate as a trampoline. + * Pre-build the trampoline page tables here so the panic + * path only has to switch satp and jump. + */ + riscv_kexec_build_tramp((unsigned long)__kexec_tramp_text_start, + __pa_symbol(__kexec_tramp_text_start)); } return 0; -- 2.50.1 From fangyu.yu at linux.alibaba.com Thu Jun 4 06:24:15 2026 From: fangyu.yu at linux.alibaba.com (fangyu.yu at linux.alibaba.com) Date: Thu, 4 Jun 2026 21:24:15 +0800 Subject: [PATCH v3 6/9] riscv: kexec: Switch to trampoline page table before norelocate In-Reply-To: <20260604132418.15725-1-fangyu.yu@linux.alibaba.com> References: <20260604132418.15725-1-fangyu.yu@linux.alibaba.com> Message-ID: <20260604132418.15725-7-fangyu.yu@linux.alibaba.com> From: Fangyu Yu Make riscv_kexec_norelocate a two-pass trampoline so it can drop the kernel page tables while still executing from a mapped address. On the first entry, t3 is initialized to 0 by machine_kexec(). Loads the physical address of riscv_kexec_norelocate and the trampoline SATP value, switches to the trampoline page table, and jumps to the trampoline VA(=PA). On the second entry, t3 contains the physical address of riscv_kexec_norelocate, so the PC comparison matches and execution continues under trampoline VA(=PA). Since the trampoline page table is already active, replace the previous stvec-based handoff with a direct jump to the target entry (jr a2). Signed-off-by: Fangyu Yu --- arch/riscv/kernel/kexec_relocate.S | 30 +++++++++++++++----- arch/riscv/kernel/machine_kexec.c | 44 +++++++++++++++++++++++++++--- 2 files changed, 63 insertions(+), 11 deletions(-) diff --git a/arch/riscv/kernel/kexec_relocate.S b/arch/riscv/kernel/kexec_relocate.S index af6b99f5b0fd..8cfdf6f4032a 100644 --- a/arch/riscv/kernel/kexec_relocate.S +++ b/arch/riscv/kernel/kexec_relocate.S @@ -147,13 +147,35 @@ riscv_kexec_relocate_end: /* Used for jumping to crashkernel */ +.extern kexec_tramp_satp +.extern riscv_kexec_norelocate_pa .section ".kexec.tramp.text", "ax" SYM_CODE_START(riscv_kexec_norelocate) + /* + * Two-pass entry: + * - 1st entry: t3 == 0 (initialized by machine_kexec()). + * + * - 2nd entry: t3 holds the physical address of + * riscv_kexec_norelocate, so auipc matches t3 and we fall through + * to label 1 to continue execution under trampoline VA(=PA). + */ + auipc t0, 0 + beq t0, t3, 1f + + la t0, riscv_kexec_norelocate_pa + REG_L t3, 0(t0) + la t0, kexec_tramp_satp + REG_L t1, 0(t0) + csrw CSR_SATP, t1 + sfence.vma x0, x0 + + jr t3 /* * s0: (const) Phys address to jump to * s1: (const) Phys address of the FDT image * s2: (const) The hartid of the current hart */ +1: mv s0, a1 mv s1, a2 mv s2, a3 @@ -198,14 +220,8 @@ SYM_CODE_START(riscv_kexec_norelocate) csrw CSR_SCAUSE, zero csrw CSR_SSCRATCH, zero - /* - * Switch to physical addressing - * This will also trigger a jump to CSR_STVEC - * which in this case is the address of the new - * kernel. - */ - csrw CSR_STVEC, a2 csrw CSR_SATP, zero + jr a2 SYM_CODE_END(riscv_kexec_norelocate) diff --git a/arch/riscv/kernel/machine_kexec.c b/arch/riscv/kernel/machine_kexec.c index 1947b7bdf5c4..72817bba5d3b 100644 --- a/arch/riscv/kernel/machine_kexec.c +++ b/arch/riscv/kernel/machine_kexec.c @@ -18,6 +18,8 @@ #include #include +unsigned long kexec_tramp_satp; +unsigned long riscv_kexec_norelocate_pa; /* * Trampoline page tables. Both the VA(trampoline)->PA and the * PA(trampoline)->PA identity mapping are installed in this single @@ -155,11 +157,17 @@ machine_kexec_prepare(struct kimage *image) } else { /* * Crash kexec uses riscv_kexec_norelocate as a trampoline. - * Pre-build the trampoline page tables here so the panic - * path only has to switch satp and jump. + * Pre-build the trampoline page tables and capture the + * trampoline SATP value plus the physical address of + * riscv_kexec_norelocate so that the panic path only has + * to switch satp and jump. */ riscv_kexec_build_tramp((unsigned long)__kexec_tramp_text_start, __pa_symbol(__kexec_tramp_text_start)); + WRITE_ONCE(riscv_kexec_norelocate_pa, + __pa_symbol(&riscv_kexec_norelocate)); + WRITE_ONCE(kexec_tramp_satp, + PFN_DOWN(__pa_symbol(kexec_tramp_pgd)) | satp_mode); } return 0; @@ -276,7 +284,35 @@ machine_kexec(struct kimage *image) /* Jump to the relocation code */ pr_notice("Bye...\n"); - kexec_method(first_ind_entry, jump_addr, fdt_addr, - this_hart_id, kernel_map.va_pa_offset); + /* + * Hand off to the trampoline. For KEXEC_TYPE_CRASH we go into + * riscv_kexec_norelocate, which uses t3 as the 1st/2nd-pass + * discriminator (must be 0 on first entry). A bare + * asm volatile ("li t3, 0" ::: "t3") + * before the C call only declares t3 *modified*; the compiler is + * free to use t3 as scratch when materialising args. Pin t3 = 0 + * (and the args) via local register variables and perform the + * indirect jump inside the same inline asm so t3 == 0 is + * guaranteed at the moment control leaves machine_kexec(). + */ + { + register unsigned long a0_val asm("a0") = first_ind_entry; + register unsigned long a1_val asm("a1") = jump_addr; + register unsigned long a2_val asm("a2") = fdt_addr; + register unsigned long a3_val asm("a3") = this_hart_id; + register unsigned long a4_val asm("a4") = kernel_map.va_pa_offset; + register unsigned long t3_zero asm("t3") = 0; + register riscv_kexec_method m asm("t6") = kexec_method; + + asm volatile ( + "jr %[m]" + : + : "r" (a0_val), "r" (a1_val), "r" (a2_val), + "r" (a3_val), "r" (a4_val), + "r" (t3_zero), + [m] "r" (m) + : "memory" + ); + } unreachable(); } -- 2.50.1 From fangyu.yu at linux.alibaba.com Thu Jun 4 06:24:17 2026 From: fangyu.yu at linux.alibaba.com (fangyu.yu at linux.alibaba.com) Date: Thu, 4 Jun 2026 21:24:17 +0800 Subject: [PATCH v3 8/9] riscv: kexec: Add the relocate-trampoline wrapper In-Reply-To: <20260604132418.15725-1-fangyu.yu@linux.alibaba.com> References: <20260604132418.15725-1-fangyu.yu@linux.alibaba.com> Message-ID: <20260604132418.15725-9-fangyu.yu@linux.alibaba.com> From: Fangyu Yu Add riscv_kexec_relocate_entry to .kexec.tramp.text and the two asm-visible globals (riscv_kexec_relocate_entry_pa and riscv_kexec_cc_buffer_pa) that the wrapper consumes. The wrapper performs the same two-step transition used by the crash path: switch to the trampoline pgd, jump to the PA of self, then drop the MMU with PC already on a PA. It finally jumps to the PA of control_code_buffer. machine_kexec_prepare() publishes the wrapper PA via WRITE_ONCE for non-crash images. The per-image control_code_buffer PA is published later, at dispatch time, so a load failure between prepare() and the kexec_image swap cannot leave the global pointing at a freed page. Nothing routes to the wrapper yet; the switchover happens in the follow-up patch. Signed-off-by: Fangyu Yu --- arch/riscv/include/asm/kexec.h | 1 + arch/riscv/kernel/kexec_relocate.S | 36 ++++++++++++++++++++++++++++++ arch/riscv/kernel/machine_kexec.c | 5 +++++ 3 files changed, 42 insertions(+) diff --git a/arch/riscv/include/asm/kexec.h b/arch/riscv/include/asm/kexec.h index 6466c1f00d41..b75cab959e53 100644 --- a/arch/riscv/include/asm/kexec.h +++ b/arch/riscv/include/asm/kexec.h @@ -53,6 +53,7 @@ typedef void (*riscv_kexec_method)(unsigned long first_ind_entry, unsigned long va_pa_off); extern riscv_kexec_method riscv_kexec_norelocate; +extern riscv_kexec_method riscv_kexec_relocate_entry; #ifdef CONFIG_KEXEC_FILE extern const struct kexec_file_ops elf_kexec_ops; diff --git a/arch/riscv/kernel/kexec_relocate.S b/arch/riscv/kernel/kexec_relocate.S index 8cfdf6f4032a..6c624560c9ac 100644 --- a/arch/riscv/kernel/kexec_relocate.S +++ b/arch/riscv/kernel/kexec_relocate.S @@ -225,6 +225,42 @@ SYM_CODE_START(riscv_kexec_norelocate) SYM_CODE_END(riscv_kexec_norelocate) +.extern riscv_kexec_relocate_entry_pa +.extern riscv_kexec_cc_buffer_pa +.section ".kexec.tramp.text", "ax" +SYM_CODE_START(riscv_kexec_relocate_entry) + /* + * Two-pass entry, identical in shape to riscv_kexec_norelocate: + * - 1st entry: t3 == 0 (initialized by machine_kexec()). + * - 2nd entry: t3 == PA of riscv_kexec_relocate_entry, so auipc + * matches t3 and we fall through to label 1. + * Args a0..a4 are passed through unchanged to riscv_kexec_relocate. + */ + auipc t0, 0 + beq t0, t3, 1f + + la t0, riscv_kexec_relocate_entry_pa + REG_L t3, 0(t0) + la t0, kexec_tramp_satp + REG_L t1, 0(t0) + csrw CSR_SATP, t1 + sfence.vma x0, x0 + + jr t3 +1: + /* + * Now executing at the PA of this wrapper with the trampoline pgd + * installed (identity-mapped). Drop the MMU; PC stays valid because + * it is already a PA. + */ + csrw CSR_SATP, zero + + /* Jump to the PA of control_code_buffer to run the relocate body. */ + la t0, riscv_kexec_cc_buffer_pa + REG_L t0, 0(t0) + jr t0 +SYM_CODE_END(riscv_kexec_relocate_entry) + .section ".rodata" SYM_DATA(riscv_kexec_relocate_size, .long riscv_kexec_relocate_end - riscv_kexec_relocate) diff --git a/arch/riscv/kernel/machine_kexec.c b/arch/riscv/kernel/machine_kexec.c index d82f45fb44b6..71688c63af65 100644 --- a/arch/riscv/kernel/machine_kexec.c +++ b/arch/riscv/kernel/machine_kexec.c @@ -20,6 +20,8 @@ unsigned long kexec_tramp_satp; unsigned long riscv_kexec_norelocate_pa; +unsigned long riscv_kexec_relocate_entry_pa; +unsigned long riscv_kexec_cc_buffer_pa; /* * Trampoline page tables. Both the VA(trampoline)->PA and the * PA(trampoline)->PA identity mapping are installed in this single @@ -164,6 +166,9 @@ machine_kexec_prepare(struct kimage *image) /* Mark the control page executable */ set_memory_x((unsigned long) control_code_buffer, 1); + + WRITE_ONCE(riscv_kexec_relocate_entry_pa, + __pa_symbol(&riscv_kexec_relocate_entry)); } else { WRITE_ONCE(riscv_kexec_norelocate_pa, __pa_symbol(&riscv_kexec_norelocate)); -- 2.50.1 From fangyu.yu at linux.alibaba.com Thu Jun 4 06:24:18 2026 From: fangyu.yu at linux.alibaba.com (fangyu.yu at linux.alibaba.com) Date: Thu, 4 Jun 2026 21:24:18 +0800 Subject: [PATCH v3 9/9] riscv: kexec: Route normal kexec through the trampoline page table In-Reply-To: <20260604132418.15725-1-fangyu.yu@linux.alibaba.com> References: <20260604132418.15725-1-fangyu.yu@linux.alibaba.com> Message-ID: <20260604132418.15725-10-fangyu.yu@linux.alibaba.com> From: Fangyu Yu riscv_kexec_relocate (copied into control_code_buffer) uses an stvec trick to drop the MMU and land on the PA of the next loop label. Under VS-mode KVM cannot emulate this single-step transition and the VCPU dies with "kvm run failed Operation not supported". Route normal kexec through riscv_kexec_relocate_entry, the trampoline wrapper added in the previous patch. It drops SATP with PC already on a PA, then hands off to control_code_buffer where the relocate body runs with SATP=0. Drop the stvec trick from the relocate body and pass first_ind_entry as a physical address since the body now starts with SATP=0. The ".align 2" plus filler "nop" that ensured the PA of the loop top was 4-byte aligned -- required because the legacy stvec trick wrote that PA into stvec.BASE, whose low two bits are MODE and are discarded by the hardware -- is no longer load-bearing and is removed as well. Signed-off-by: Fangyu Yu --- arch/riscv/kernel/kexec_relocate.S | 26 ++++++-------------------- arch/riscv/kernel/machine_kexec.c | 27 +++++++++++++++++++-------- 2 files changed, 25 insertions(+), 28 deletions(-) diff --git a/arch/riscv/kernel/kexec_relocate.S b/arch/riscv/kernel/kexec_relocate.S index 6c624560c9ac..7ffb83ea45fc 100644 --- a/arch/riscv/kernel/kexec_relocate.S +++ b/arch/riscv/kernel/kexec_relocate.S @@ -34,27 +34,13 @@ SYM_CODE_START(riscv_kexec_relocate) csrw CSR_SIP, zero /* - * When we switch SATP.MODE to "Bare" we'll only - * play with physical addresses. However the first time - * we try to jump somewhere, the offset on the jump - * will be relative to pc which will still be on VA. To - * deal with this we set stvec to the physical address at - * the start of the loop below so that we jump there in - * any case. + * The trampoline wrapper (riscv_kexec_relocate_entry) has already + * dropped the MMU and handed control to us at this PA copy of the + * relocate code. From here on the entire loop runs with SATP=0 and + * every address (s0, s5, source/dest pointers) is a physical one. */ - la s6, 1f - sub s6, s6, s4 - csrw CSR_STVEC, s6 - - /* - * With C-extension, here we get 42 Bytes and the next - * .align directive would pad zeros here up to 44 Bytes. - * So manually put a nop here to avoid zeros padding. - */ - nop /* Process entries in a loop */ -.align 2 1: REG_L t0, 0(s0) /* t0 = *image->entry */ addi s0, s0, RISCV_SZPTR /* image->entry++ */ @@ -70,8 +56,8 @@ SYM_CODE_START(riscv_kexec_relocate) andi t1, t0, 0x2 beqz t1, 2f andi s0, t0, ~0x2 - csrw CSR_SATP, zero - jr s6 + /* MMU is already off; the entry wrapper handled the transition. */ + j 1b 2: /* IND_DONE entry ? -> jump to done label */ diff --git a/arch/riscv/kernel/machine_kexec.c b/arch/riscv/kernel/machine_kexec.c index 71688c63af65..82fcb84a03ec 100644 --- a/arch/riscv/kernel/machine_kexec.c +++ b/arch/riscv/kernel/machine_kexec.c @@ -164,9 +164,6 @@ machine_kexec_prepare(struct kimage *image) memcpy(control_code_buffer, riscv_kexec_relocate, riscv_kexec_relocate_size); - /* Mark the control page executable */ - set_memory_x((unsigned long) control_code_buffer, 1); - WRITE_ONCE(riscv_kexec_relocate_entry_pa, __pa_symbol(&riscv_kexec_relocate_entry)); } else { @@ -262,11 +259,15 @@ machine_kexec(struct kimage *image) { struct kimage_arch *internal = &image->arch; unsigned long jump_addr = (unsigned long) image->start; - unsigned long first_ind_entry = (unsigned long) &image->head; + /* + * The relocate body runs entirely with the MMU off (the wrapper + * drops SATP before jumping into control_code_buffer), so the very + * first entry must be a physical address. + */ + unsigned long first_ind_entry = __pa(&image->head); unsigned long this_cpu_id = __smp_processor_id(); unsigned long this_hart_id = cpuid_to_hartid_map(this_cpu_id); unsigned long fdt_addr = internal->fdt_addr; - void *control_code_buffer = page_address(image->control_code_page); riscv_kexec_method kexec_method = NULL; #ifdef CONFIG_SMP @@ -274,10 +275,20 @@ machine_kexec(struct kimage *image) "Some CPUs may be stale, kdump will be unreliable.\n"); #endif - if (image->type != KEXEC_TYPE_CRASH) - kexec_method = control_code_buffer; - else + if (image->type != KEXEC_TYPE_CRASH) { + kexec_method = (riscv_kexec_method) &riscv_kexec_relocate_entry; + /* + * Publish the per-image control_code_buffer PA at dispatch + * time rather than in machine_kexec_prepare(). machine_kexec() + * only runs once the image has been fully loaded and committed + * as kexec_image, so the global cannot be left pointing at a + * page freed by a failed load. + */ + WRITE_ONCE(riscv_kexec_cc_buffer_pa, + __pa(page_address(image->control_code_page))); + } else { kexec_method = (riscv_kexec_method) &riscv_kexec_norelocate; + } pr_notice("Will call new kernel at %08lx from hart id %lx\n", jump_addr, this_hart_id); -- 2.50.1 From pasha.tatashin at soleen.com Thu Jun 4 20:32:26 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Fri, 5 Jun 2026 03:32:26 +0000 Subject: [RFC v1 0/9] kho: granular compatibility and header decoupling Message-ID: <20260605033235.717351-1-pasha.tatashin@soleen.com> This series decouples the compatibility tracking and code organization of individual KHO subsystems (radix tree, vmalloc, and block device). The diff here is a bit larger than I'd like, but most of it is just refactoring and moving code around to modularize the subsystems rather than functional changes. Specifically, this series separates KHO data structures from KHO core functionality: - KHO data structures are those that have ABI checks between kernel versions (e.g., vmalloc, radix trees, and block devices). - KHO core is the functionality that involves passing the memory from one kernel to another. KHO core depends on some of these data structures, but other users of KHO also depend on them. The number of data structures keeps growing: first we introduced vmalloc, then radix trees, then linked blocks, and next we are planning to add an xarray-like data structure. Keeping all of this within the same `kexec_handover.c` file, and also under the same global version, is no longer sustainable. To address this, this series: 1. Refactors and reorganizes the code by splitting out radix tree and vmalloc into separate files. 2. Moves and organizes internal and ABI headers into structured directories under include/linux/kho/ and include/linux/kho/abi/. Instead of cluttering include/linux/ with prefix-styled headers like kho_block.h or kho_radix_tree.h, we use the already existing include/linux/kho/ directory (e.g., kho/block.h and kho/radix_tree.h). 3. Introduces a standard set of compatibility helpers in kho/abi/compat.h. 4. Decouples the compatibility strings of individual KHO subsystems (radix tree, vmalloc, and block) from the global KHO version. This enables independent, granular compatibility versioning. 5. Adds a KUnit test suite to verify that the composite compatibility strings of different subsystems remain unique and sorted in alphabetical order, guaranteeing a consistent and predictable representation across configurations. This series is to gather feedback on the overall design and layout of the granular compatibility mechanism. Pasha Tatashin (9): kho: split out radix tree tracker into kho_radix.c kho: split radix tree headers out of kexec_handover.h kho: split out vmalloc preservation into kho_vmalloc.c kho: split vmalloc headers out of kexec_handover.h kho: move kho_block.h to kho/block.h kho: introduce compatibility helpers and decouple block version kho: decouple radix tree compatibility from global KHO version kho: decouple vmalloc compatibility from global KHO version and update memfd liveupdate: add KUnit test to verify alphabetical order of compatibility strings Documentation/core-api/kho/abi.rst | 11 +- Documentation/core-api/kho/index.rst | 10 +- MAINTAINERS | 1 - include/linux/kexec_handover.h | 18 - include/linux/kho/abi/block.h | 4 +- include/linux/kho/abi/compat.h | 33 ++ include/linux/kho/abi/kexec_handover.h | 203 +------ include/linux/kho/abi/luo.h | 8 +- include/linux/kho/abi/memfd.h | 12 +- include/linux/kho/abi/radix_tree.h | 133 +++++ include/linux/kho/abi/vmalloc.h | 101 ++++ include/linux/{kho_block.h => kho/block.h} | 2 +- .../{kho_radix_tree.h => kho/radix_tree.h} | 5 +- include/linux/kho/vmalloc.h | 34 ++ kernel/liveupdate/Kconfig | 15 + kernel/liveupdate/Makefile | 9 +- kernel/liveupdate/kexec_handover.c | 531 +----------------- kernel/liveupdate/kho_block.c | 2 +- kernel/liveupdate/kho_radix.c | 290 ++++++++++ kernel/liveupdate/kho_vmalloc.c | 274 +++++++++ kernel/liveupdate/liveupdate_test.c | 56 ++ kernel/liveupdate/luo_internal.h | 2 +- kernel/liveupdate/luo_session.c | 2 +- lib/test_kho.c | 1 + mm/memfd_luo.c | 1 + tools/testing/selftests/liveupdate/config | 1 + 26 files changed, 997 insertions(+), 762 deletions(-) create mode 100644 include/linux/kho/abi/compat.h create mode 100644 include/linux/kho/abi/radix_tree.h create mode 100644 include/linux/kho/abi/vmalloc.h rename include/linux/{kho_block.h => kho/block.h} (100%) rename include/linux/{kho_radix_tree.h => kho/radix_tree.h} (96%) create mode 100644 include/linux/kho/vmalloc.h create mode 100644 kernel/liveupdate/kho_radix.c create mode 100644 kernel/liveupdate/kho_vmalloc.c create mode 100644 kernel/liveupdate/liveupdate_test.c base-commit: 5fb813ae0009d97fc414f08ad73286f562e9a123 -- 2.53.0 From pasha.tatashin at soleen.com Thu Jun 4 20:32:27 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Fri, 5 Jun 2026 03:32:27 +0000 Subject: [RFC v1 1/9] kho: split out radix tree tracker into kho_radix.c In-Reply-To: <20260605033235.717351-1-pasha.tatashin@soleen.com> References: <20260605033235.717351-1-pasha.tatashin@soleen.com> Message-ID: <20260605033235.717351-2-pasha.tatashin@soleen.com> Move the radix tree tracker implementation from the core KHO code into its own dedicated file (kho_radix.c). This is a pure code movement patch; no logic or functional changes are introduced. Signed-off-by: Pasha Tatashin --- Documentation/core-api/kho/index.rst | 3 + kernel/liveupdate/Makefile | 6 +- kernel/liveupdate/kexec_handover.c | 273 ------------------------- kernel/liveupdate/kho_radix.c | 290 +++++++++++++++++++++++++++ 4 files changed, 298 insertions(+), 274 deletions(-) create mode 100644 kernel/liveupdate/kho_radix.c diff --git a/Documentation/core-api/kho/index.rst b/Documentation/core-api/kho/index.rst index 320914a42178..a9892c671ec3 100644 --- a/Documentation/core-api/kho/index.rst +++ b/Documentation/core-api/kho/index.rst @@ -83,6 +83,9 @@ Public API .. kernel-doc:: kernel/liveupdate/kexec_handover.c :export: +.. kernel-doc:: kernel/liveupdate/kho_radix.c + :export: + KHO Serialization Blocks API ============================ diff --git a/kernel/liveupdate/Makefile b/kernel/liveupdate/Makefile index eec9d3ae07eb..a3ee8a5c27a2 100644 --- a/kernel/liveupdate/Makefile +++ b/kernel/liveupdate/Makefile @@ -7,7 +7,11 @@ luo-y := \ luo_flb.o \ luo_session.o -obj-$(CONFIG_KEXEC_HANDOVER) += kexec_handover.o +kho-y := \ + kexec_handover.o \ + kho_radix.o + +obj-$(CONFIG_KEXEC_HANDOVER) += kho.o obj-$(CONFIG_KEXEC_HANDOVER_DEBUG) += kexec_handover_debug.o obj-$(CONFIG_KEXEC_HANDOVER_DEBUGFS) += kexec_handover_debugfs.o diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 4834a809985a..041efff7ca11 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -5,7 +5,6 @@ * Copyright (C) 2025 Microsoft Corporation, Mike Rapoport * Copyright (C) 2025 Google LLC, Changyuan Lyu * Copyright (C) 2025 Pasha Tatashin - * Copyright (C) 2026 Google LLC, Jason Miu */ #define pr_fmt(fmt) "KHO: " fmt @@ -84,278 +83,6 @@ static struct kho_out kho_out = { }, }; -/** - * kho_radix_encode_key - Encodes a physical address and order into a radix key. - * @phys: The physical address of the page. - * @order: The order of the page. - * - * This function combines a page's physical address and its order into a - * single unsigned long, which is used as a key for all radix tree - * operations. - * - * Return: The encoded unsigned long radix key. - */ -static unsigned long kho_radix_encode_key(phys_addr_t phys, unsigned int order) -{ - /* Order bits part */ - unsigned long h = 1UL << (KHO_ORDER_0_LOG2 - order); - /* Shifted physical address part */ - unsigned long l = phys >> (PAGE_SHIFT + order); - - return h | l; -} - -/** - * kho_radix_decode_key - Decodes a radix key back into a physical address and order. - * @key: The unsigned long key to decode. - * @order: An output parameter, a pointer to an unsigned int where the decoded - * page order will be stored. - * - * This function reverses the encoding performed by kho_radix_encode_key(), - * extracting the original physical address and page order from a given key. - * - * Return: The decoded physical address. - */ -static phys_addr_t kho_radix_decode_key(unsigned long key, unsigned int *order) -{ - unsigned int order_bit = fls64(key); - phys_addr_t phys; - - /* order_bit is numbered starting at 1 from fls64 */ - *order = KHO_ORDER_0_LOG2 - order_bit + 1; - /* The order is discarded by the shift */ - phys = key << (PAGE_SHIFT + *order); - - return phys; -} - -static unsigned long kho_radix_get_bitmap_index(unsigned long key) -{ - return key % (1 << KHO_BITMAP_SIZE_LOG2); -} - -static unsigned long kho_radix_get_table_index(unsigned long key, - unsigned int level) -{ - int s; - - s = ((level - 1) * KHO_TABLE_SIZE_LOG2) + KHO_BITMAP_SIZE_LOG2; - return (key >> s) % (1 << KHO_TABLE_SIZE_LOG2); -} - -/** - * kho_radix_add_page - Marks a page as preserved in the radix tree. - * @tree: The KHO radix tree. - * @pfn: The page frame number of the page to preserve. - * @order: The order of the page. - * - * This function traverses the radix tree based on the key derived from @pfn - * and @order. It sets the corresponding bit in the leaf bitmap to mark the - * page for preservation. If intermediate nodes do not exist along the path, - * they are allocated and added to the tree. - * - * Return: 0 on success, or a negative error code on failure. - */ -int kho_radix_add_page(struct kho_radix_tree *tree, - unsigned long pfn, unsigned int order) -{ - /* Newly allocated nodes for error cleanup */ - struct kho_radix_node *intermediate_nodes[KHO_TREE_MAX_DEPTH] = { 0 }; - unsigned long key = kho_radix_encode_key(PFN_PHYS(pfn), order); - struct kho_radix_node *anchor_node = NULL; - struct kho_radix_node *node = tree->root; - struct kho_radix_node *new_node; - unsigned int i, idx, anchor_idx; - struct kho_radix_leaf *leaf; - int err = 0; - - if (WARN_ON_ONCE(!tree->root)) - return -EINVAL; - - might_sleep(); - - guard(mutex)(&tree->lock); - - /* Go from high levels to low levels */ - for (i = KHO_TREE_MAX_DEPTH - 1; i > 0; i--) { - idx = kho_radix_get_table_index(key, i); - - if (node->table[idx]) { - node = phys_to_virt(node->table[idx]); - continue; - } - - /* Next node is empty, create a new node for it */ - new_node = (struct kho_radix_node *)get_zeroed_page(GFP_KERNEL); - if (!new_node) { - err = -ENOMEM; - goto err_free_nodes; - } - - node->table[idx] = virt_to_phys(new_node); - - /* - * Capture the node where the new branch starts for cleanup - * if allocation fails. - */ - if (!anchor_node) { - anchor_node = node; - anchor_idx = idx; - } - intermediate_nodes[i] = new_node; - - node = new_node; - } - - /* Handle the leaf level bitmap (level 0) */ - idx = kho_radix_get_bitmap_index(key); - leaf = (struct kho_radix_leaf *)node; - __set_bit(idx, leaf->bitmap); - - return 0; - -err_free_nodes: - for (i = KHO_TREE_MAX_DEPTH - 1; i > 0; i--) { - if (intermediate_nodes[i]) - free_page((unsigned long)intermediate_nodes[i]); - } - if (anchor_node) - anchor_node->table[anchor_idx] = 0; - - return err; -} -EXPORT_SYMBOL_GPL(kho_radix_add_page); - -/** - * kho_radix_del_page - Removes a page's preservation status from the radix tree. - * @tree: The KHO radix tree. - * @pfn: The page frame number of the page to unpreserve. - * @order: The order of the page. - * - * This function traverses the radix tree and clears the bit corresponding to - * the page, effectively removing its "preserved" status. It does not free - * the tree's intermediate nodes, even if they become empty. - */ -void kho_radix_del_page(struct kho_radix_tree *tree, unsigned long pfn, - unsigned int order) -{ - unsigned long key = kho_radix_encode_key(PFN_PHYS(pfn), order); - struct kho_radix_node *node = tree->root; - struct kho_radix_leaf *leaf; - unsigned int i, idx; - - if (WARN_ON_ONCE(!tree->root)) - return; - - might_sleep(); - - guard(mutex)(&tree->lock); - - /* Go from high levels to low levels */ - for (i = KHO_TREE_MAX_DEPTH - 1; i > 0; i--) { - idx = kho_radix_get_table_index(key, i); - - /* - * Attempting to delete a page that has not been preserved, - * return with a warning. - */ - if (WARN_ON(!node->table[idx])) - return; - - node = phys_to_virt(node->table[idx]); - } - - /* Handle the leaf level bitmap (level 0) */ - leaf = (struct kho_radix_leaf *)node; - idx = kho_radix_get_bitmap_index(key); - __clear_bit(idx, leaf->bitmap); -} -EXPORT_SYMBOL_GPL(kho_radix_del_page); - -static int kho_radix_walk_leaf(struct kho_radix_leaf *leaf, - unsigned long key, - kho_radix_tree_walk_callback_t cb) -{ - unsigned long *bitmap = (unsigned long *)leaf; - unsigned int order; - phys_addr_t phys; - unsigned int i; - int err; - - for_each_set_bit(i, bitmap, PAGE_SIZE * BITS_PER_BYTE) { - phys = kho_radix_decode_key(key | i, &order); - err = cb(phys, order); - if (err) - return err; - } - - return 0; -} - -static int __kho_radix_walk_tree(struct kho_radix_node *root, - unsigned int level, unsigned long start, - kho_radix_tree_walk_callback_t cb) -{ - struct kho_radix_node *node; - struct kho_radix_leaf *leaf; - unsigned long key, i; - unsigned int shift; - int err; - - for (i = 0; i < PAGE_SIZE / sizeof(phys_addr_t); i++) { - if (!root->table[i]) - continue; - - shift = ((level - 1) * KHO_TABLE_SIZE_LOG2) + - KHO_BITMAP_SIZE_LOG2; - key = start | (i << shift); - - node = phys_to_virt(root->table[i]); - - if (level == 1) { - /* - * we are at level 1, - * node is pointing to the level 0 bitmap. - */ - leaf = (struct kho_radix_leaf *)node; - err = kho_radix_walk_leaf(leaf, key, cb); - } else { - err = __kho_radix_walk_tree(node, level - 1, - key, cb); - } - - if (err) - return err; - } - - return 0; -} - -/** - * kho_radix_walk_tree - Traverses the radix tree and calls a callback for each preserved page. - * @tree: A pointer to the KHO radix tree to walk. - * @cb: A callback function of type kho_radix_tree_walk_callback_t that will be - * invoked for each preserved page found in the tree. The callback receives - * the physical address and order of the preserved page. - * - * This function walks the radix tree, searching from the specified top level - * down to the lowest level (level 0). For each preserved page found, it invokes - * the provided callback, passing the page's physical address and order. - * - * Return: 0 if the walk completed the specified tree, or the non-zero return - * value from the callback that stopped the walk. - */ -int kho_radix_walk_tree(struct kho_radix_tree *tree, - kho_radix_tree_walk_callback_t cb) -{ - if (WARN_ON_ONCE(!tree->root)) - return -EINVAL; - - guard(mutex)(&tree->lock); - - return __kho_radix_walk_tree(tree->root, KHO_TREE_MAX_DEPTH - 1, 0, cb); -} -EXPORT_SYMBOL_GPL(kho_radix_walk_tree); /* For physically contiguous 0-order pages. */ static void kho_init_pages(struct page *page, unsigned long nr_pages) diff --git a/kernel/liveupdate/kho_radix.c b/kernel/liveupdate/kho_radix.c new file mode 100644 index 000000000000..c836783a1376 --- /dev/null +++ b/kernel/liveupdate/kho_radix.c @@ -0,0 +1,290 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * kho_radix.c - KHO radix tree tracker for preserved memory pages + * Copyright (C) 2025 Microsoft Corporation, Mike Rapoport + * Copyright (C) 2025 Pasha Tatashin + * Copyright (C) 2026 Google LLC, Jason Miu + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +/** + * kho_radix_encode_key - Encodes a physical address and order into a radix key. + * @phys: The physical address of the page. + * @order: The order of the page. + * + * This function combines a page's physical address and its order into a + * single unsigned long, which is used as a key for all radix tree + * operations. + * + * Return: The encoded unsigned long radix key. + */ +static unsigned long kho_radix_encode_key(phys_addr_t phys, unsigned int order) +{ + /* Order bits part */ + unsigned long h = 1UL << (KHO_ORDER_0_LOG2 - order); + /* Shifted physical address part */ + unsigned long l = phys >> (PAGE_SHIFT + order); + + return h | l; +} + +/** + * kho_radix_decode_key - Decodes a radix key back into a physical address and order. + * @key: The unsigned long key to decode. + * @order: An output parameter, a pointer to an unsigned int where the decoded + * page order will be stored. + * + * This function reverses the encoding performed by kho_radix_encode_key(), + * extracting the original physical address and page order from a given key. + * + * Return: The decoded physical address. + */ +static phys_addr_t kho_radix_decode_key(unsigned long key, unsigned int *order) +{ + unsigned int order_bit = fls64(key); + phys_addr_t phys; + + /* order_bit is numbered starting at 1 from fls64 */ + *order = KHO_ORDER_0_LOG2 - order_bit + 1; + /* The order is discarded by the shift */ + phys = key << (PAGE_SHIFT + *order); + + return phys; +} + +static unsigned long kho_radix_get_bitmap_index(unsigned long key) +{ + return key % (1 << KHO_BITMAP_SIZE_LOG2); +} + +static unsigned long kho_radix_get_table_index(unsigned long key, + unsigned int level) +{ + int s; + + s = ((level - 1) * KHO_TABLE_SIZE_LOG2) + KHO_BITMAP_SIZE_LOG2; + return (key >> s) % (1 << KHO_TABLE_SIZE_LOG2); +} + +/** + * kho_radix_add_page - Marks a page as preserved in the radix tree. + * @tree: The KHO radix tree. + * @pfn: The page frame number of the page to preserve. + * @order: The order of the page. + * + * This function traverses the radix tree based on the key derived from @pfn + * and @order. It sets the corresponding bit in the leaf bitmap to mark the + * page for preservation. If intermediate nodes do not exist along the path, + * they are allocated and added to the tree. + * + * Return: 0 on success, or a negative error code on failure. + */ +int kho_radix_add_page(struct kho_radix_tree *tree, + unsigned long pfn, unsigned int order) +{ + /* Newly allocated nodes for error cleanup */ + struct kho_radix_node *intermediate_nodes[KHO_TREE_MAX_DEPTH] = { 0 }; + unsigned long key = kho_radix_encode_key(PFN_PHYS(pfn), order); + struct kho_radix_node *anchor_node = NULL; + struct kho_radix_node *node = tree->root; + struct kho_radix_node *new_node; + unsigned int i, idx, anchor_idx; + struct kho_radix_leaf *leaf; + int err = 0; + + if (WARN_ON_ONCE(!tree->root)) + return -EINVAL; + + might_sleep(); + + guard(mutex)(&tree->lock); + + /* Go from high levels to low levels */ + for (i = KHO_TREE_MAX_DEPTH - 1; i > 0; i--) { + idx = kho_radix_get_table_index(key, i); + + if (node->table[idx]) { + node = phys_to_virt(node->table[idx]); + continue; + } + + /* Next node is empty, create a new node for it */ + new_node = (struct kho_radix_node *)get_zeroed_page(GFP_KERNEL); + if (!new_node) { + err = -ENOMEM; + goto err_free_nodes; + } + + node->table[idx] = virt_to_phys(new_node); + + /* + * Capture the node where the new branch starts for cleanup + * if allocation fails. + */ + if (!anchor_node) { + anchor_node = node; + anchor_idx = idx; + } + intermediate_nodes[i] = new_node; + + node = new_node; + } + + /* Handle the leaf level bitmap (level 0) */ + idx = kho_radix_get_bitmap_index(key); + leaf = (struct kho_radix_leaf *)node; + __set_bit(idx, leaf->bitmap); + + return 0; + +err_free_nodes: + for (i = KHO_TREE_MAX_DEPTH - 1; i > 0; i--) { + if (intermediate_nodes[i]) + free_page((unsigned long)intermediate_nodes[i]); + } + if (anchor_node) + anchor_node->table[anchor_idx] = 0; + + return err; +} +EXPORT_SYMBOL_GPL(kho_radix_add_page); + +/** + * kho_radix_del_page - Removes a page's preservation status from the radix tree. + * @tree: The KHO radix tree. + * @pfn: The page frame number of the page to unpreserve. + * @order: The order of the page. + * + * This function traverses the radix tree and clears the bit corresponding to + * the page, effectively removing its "preserved" status. It does not free + * the tree's intermediate nodes, even if they become empty. + */ +void kho_radix_del_page(struct kho_radix_tree *tree, unsigned long pfn, + unsigned int order) +{ + unsigned long key = kho_radix_encode_key(PFN_PHYS(pfn), order); + struct kho_radix_node *node = tree->root; + struct kho_radix_leaf *leaf; + unsigned int i, idx; + + if (WARN_ON_ONCE(!tree->root)) + return; + + might_sleep(); + + guard(mutex)(&tree->lock); + + /* Go from high levels to low levels */ + for (i = KHO_TREE_MAX_DEPTH - 1; i > 0; i--) { + idx = kho_radix_get_table_index(key, i); + + /* + * Attempting to delete a page that has not been preserved, + * return with a warning. + */ + if (WARN_ON(!node->table[idx])) + return; + + node = phys_to_virt(node->table[idx]); + } + + /* Handle the leaf level bitmap (level 0) */ + leaf = (struct kho_radix_leaf *)node; + idx = kho_radix_get_bitmap_index(key); + __clear_bit(idx, leaf->bitmap); +} +EXPORT_SYMBOL_GPL(kho_radix_del_page); + +static int kho_radix_walk_leaf(struct kho_radix_leaf *leaf, + unsigned long key, + kho_radix_tree_walk_callback_t cb) +{ + unsigned long *bitmap = (unsigned long *)leaf; + unsigned int order; + phys_addr_t phys; + unsigned int i; + int err; + + for_each_set_bit(i, bitmap, PAGE_SIZE * BITS_PER_BYTE) { + phys = kho_radix_decode_key(key | i, &order); + err = cb(phys, order); + if (err) + return err; + } + + return 0; +} + +static int __kho_radix_walk_tree(struct kho_radix_node *root, + unsigned int level, unsigned long start, + kho_radix_tree_walk_callback_t cb) +{ + struct kho_radix_node *node; + struct kho_radix_leaf *leaf; + unsigned long key, i; + unsigned int shift; + int err; + + for (i = 0; i < PAGE_SIZE / sizeof(phys_addr_t); i++) { + if (!root->table[i]) + continue; + + shift = ((level - 1) * KHO_TABLE_SIZE_LOG2) + + KHO_BITMAP_SIZE_LOG2; + key = start | (i << shift); + + node = phys_to_virt(root->table[i]); + + if (level == 1) { + /* + * we are at level 1, + * node is pointing to the level 0 bitmap. + */ + leaf = (struct kho_radix_leaf *)node; + err = kho_radix_walk_leaf(leaf, key, cb); + } else { + err = __kho_radix_walk_tree(node, level - 1, + key, cb); + } + + if (err) + return err; + } + + return 0; +} + +/** + * kho_radix_walk_tree - Traverses the radix tree and calls a callback for each preserved page. + * @tree: A pointer to the KHO radix tree to walk. + * @cb: A callback function of type kho_radix_tree_walk_callback_t that will be + * invoked for each preserved page found in the tree. The callback receives + * the physical address and order of the preserved page. + * + * This function walks the radix tree, searching from the specified top level + * down to the lowest level (level 0). For each preserved page found, it invokes + * the provided callback, passing the page's physical address and order. + * + * Return: 0 if the walk completed the specified tree, or the non-zero return + * value from the callback that stopped the walk. + */ +int kho_radix_walk_tree(struct kho_radix_tree *tree, + kho_radix_tree_walk_callback_t cb) +{ + if (WARN_ON_ONCE(!tree->root)) + return -EINVAL; + + guard(mutex)(&tree->lock); + + return __kho_radix_walk_tree(tree->root, KHO_TREE_MAX_DEPTH - 1, 0, cb); +} +EXPORT_SYMBOL_GPL(kho_radix_walk_tree); -- 2.53.0 From pasha.tatashin at soleen.com Thu Jun 4 20:32:28 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Fri, 5 Jun 2026 03:32:28 +0000 Subject: [RFC v1 2/9] kho: split radix tree headers out of kexec_handover.h In-Reply-To: <20260605033235.717351-1-pasha.tatashin@soleen.com> References: <20260605033235.717351-1-pasha.tatashin@soleen.com> Message-ID: <20260605033235.717351-3-pasha.tatashin@soleen.com> Split the radix tree tracker-related ABI definitions and header declarations out of the monolithic kexec_handover.h header into a dedicated header file (radix_tree.h). Additionally, rename kho_radix_tree.h to kho/radix_tree.h, organizing it within the existing kho directory structure as more KHO data structures are introduced. This is a pure code movement patch; no logic or functional changes are introduced. Signed-off-by: Pasha Tatashin --- Documentation/core-api/kho/abi.rst | 3 +- Documentation/core-api/kho/index.rst | 2 +- include/linux/kho/abi/kexec_handover.h | 114 --------------- include/linux/kho/abi/radix_tree.h | 131 ++++++++++++++++++ .../{kho_radix_tree.h => kho/radix_tree.h} | 5 +- kernel/liveupdate/kexec_handover.c | 2 +- kernel/liveupdate/kho_radix.c | 2 +- 7 files changed, 137 insertions(+), 122 deletions(-) create mode 100644 include/linux/kho/abi/radix_tree.h rename include/linux/{kho_radix_tree.h => kho/radix_tree.h} (96%) diff --git a/Documentation/core-api/kho/abi.rst b/Documentation/core-api/kho/abi.rst index edeb5b311963..da5c6636bb17 100644 --- a/Documentation/core-api/kho/abi.rst +++ b/Documentation/core-api/kho/abi.rst @@ -25,8 +25,7 @@ memblock preservation ABI KHO persistent memory tracker ABI ================================= -.. kernel-doc:: include/linux/kho/abi/kexec_handover.h - :doc: KHO persistent memory tracker +.. kernel-doc:: include/linux/kho/abi/radix_tree.h KHO serialization block ABI =========================== diff --git a/Documentation/core-api/kho/index.rst b/Documentation/core-api/kho/index.rst index a9892c671ec3..f69367d217cf 100644 --- a/Documentation/core-api/kho/index.rst +++ b/Documentation/core-api/kho/index.rst @@ -74,7 +74,7 @@ the next KHO, because kexec can overwrite even the original kernel. Kexec Handover Radix Tree ========================= -.. kernel-doc:: include/linux/kho_radix_tree.h +.. kernel-doc:: include/linux/kho/radix_tree.h :doc: Kexec Handover Radix Tree Public API diff --git a/include/linux/kho/abi/kexec_handover.h b/include/linux/kho/abi/kexec_handover.h index 5e2eb8519bda..99e4a53d4e35 100644 --- a/include/linux/kho/abi/kexec_handover.h +++ b/include/linux/kho/abi/kexec_handover.h @@ -4,15 +4,10 @@ * Copyright (C) 2023 Alexander Graf * Copyright (C) 2025 Microsoft Corporation, Mike Rapoport * Copyright (C) 2025 Google LLC, Changyuan Lyu - * Copyright (C) 2025 Google LLC, Jason Miu */ #ifndef _LINUX_KHO_ABI_KEXEC_HANDOVER_H #define _LINUX_KHO_ABI_KEXEC_HANDOVER_H - -#include -#include -#include #include #include @@ -177,113 +172,4 @@ struct kho_vmalloc { unsigned short order; }; -/** - * DOC: KHO persistent memory tracker - * - * KHO tracks preserved memory using a radix tree data structure. Each node of - * the tree is exactly a single page. The leaf nodes are bitmaps where each set - * bit is a preserved page of any order. The intermediate nodes are tables of - * physical addresses that point to a lower level node. - * - * The tree hierarchy is shown below:: - * - * root - * +-------------------+ - * | Level 5 | (struct kho_radix_node) - * +-------------------+ - * | - * v - * +-------------------+ - * | Level 4 | (struct kho_radix_node) - * +-------------------+ - * | - * | ... (intermediate levels) - * | - * v - * +-------------------+ - * | Level 0 | (struct kho_radix_leaf) - * +-------------------+ - * - * The tree is traversed using a key that encodes the page's physical address - * (pa) and its order into a single unsigned long value. The encoded key value - * is composed of two parts: the 'order bit' in the upper part and the - * 'shifted physical address' in the lower part.:: - * - * +------------+-----------------------------+--------------------------+ - * | Page Order | Order Bit | Shifted Physical Address | - * +------------+-----------------------------+--------------------------+ - * | 0 | ...000100 ... (at bit 52) | pa >> (PAGE_SHIFT + 0) | - * | 1 | ...000010 ... (at bit 51) | pa >> (PAGE_SHIFT + 1) | - * | 2 | ...000001 ... (at bit 50) | pa >> (PAGE_SHIFT + 2) | - * | ... | ... | ... | - * +------------+-----------------------------+--------------------------+ - * - * Shifted Physical Address: - * The 'shifted physical address' is the physical address normalized for its - * order. It effectively represents the PFN shifted right by the order. - * - * Order Bit: - * The 'order bit' encodes the page order by setting a single bit at a - * specific position. The position of this bit itself represents the order. - * - * For instance, on a 64-bit system with 4KB pages (PAGE_SHIFT = 12), the - * maximum range for the shifted physical address (for order 0) is 52 bits - * (64 - 12). This address occupies bits [0-51]. For order 0, the order bit is - * set at position 52. - * - * The following diagram illustrates how the encoded key value is split into - * indices for the tree levels, with PAGE_SIZE of 4KB:: - * - * 63:60 59:51 50:42 41:33 32:24 23:15 14:0 - * +---------+--------+--------+--------+--------+--------+-----------------+ - * | 0 | Lv 5 | Lv 4 | Lv 3 | Lv 2 | Lv 1 | Lv 0 (bitmap) | - * +---------+--------+--------+--------+--------+--------+-----------------+ - * - * The radix tree stores pages of all orders in a single 6-level hierarchy. It - * efficiently shares higher tree levels, especially due to common zero top - * address bits, allowing a single, efficient algorithm to manage all - * pages. This bitmap approach also offers memory efficiency; for example, a - * 512KB bitmap can cover a 16GB memory range for 0-order pages with PAGE_SIZE = - * 4KB. - * - * The data structures defined here are part of the KHO ABI. Any modification - * to these structures that breaks backward compatibility must be accompanied by - * an update to the "compatible" string. This ensures that a newer kernel can - * correctly interpret the data passed by an older kernel. - */ - -/* - * Defines constants for the KHO radix tree structure, used to track preserved - * memory. These constants govern the indexing, sizing, and depth of the tree. - */ -enum kho_radix_consts { - /* - * The bit position of the order bit (and also the length of the - * shifted physical address) for an order-0 page. - */ - KHO_ORDER_0_LOG2 = 64 - PAGE_SHIFT, - - /* Size of the table in kho_radix_node, in log2 */ - KHO_TABLE_SIZE_LOG2 = const_ilog2(PAGE_SIZE / sizeof(phys_addr_t)), - - /* Number of bits in the kho_radix_leaf bitmap, in log2 */ - KHO_BITMAP_SIZE_LOG2 = PAGE_SHIFT + const_ilog2(BITS_PER_BYTE), - - /* - * The total tree depth is the number of intermediate levels - * and 1 bitmap level. - */ - KHO_TREE_MAX_DEPTH = - DIV_ROUND_UP(KHO_ORDER_0_LOG2 - KHO_BITMAP_SIZE_LOG2 + 1, - KHO_TABLE_SIZE_LOG2) + 1, -}; - -struct kho_radix_node { - u64 table[1 << KHO_TABLE_SIZE_LOG2]; -}; - -struct kho_radix_leaf { - DECLARE_BITMAP(bitmap, 1 << KHO_BITMAP_SIZE_LOG2); -}; - #endif /* _LINUX_KHO_ABI_KEXEC_HANDOVER_H */ diff --git a/include/linux/kho/abi/radix_tree.h b/include/linux/kho/abi/radix_tree.h new file mode 100644 index 000000000000..f4cc5c02f37a --- /dev/null +++ b/include/linux/kho/abi/radix_tree.h @@ -0,0 +1,131 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (C) 2025 Google LLC, Jason Miu + * Copyright (C) 2026 Pasha Tatashin + */ + +#ifndef _LINUX_KHO_ABI_RADIX_TREE_H +#define _LINUX_KHO_ABI_RADIX_TREE_H + +#include +#include + +/** + * DOC: KHO persistent memory tracker + * + * Subsystems using the KHO persistent memory tracker rely on the stable + * Application Binary Interface defined below to pass serialized state from a + * pre-update kernel to a post-update kernel. + * + * This interface is a contract. Any modification to the structure fields, + * compatible strings, or the layout of the serialization structures defined + * here constitutes a breaking change. Such changes require incrementing the + * version number in the `KHO_FDT_COMPATIBLE` string to prevent a new kernel + * from misinterpreting data from an old kernel. + * + * Changes are allowed provided the compatibility version is incremented; + * however, backward/forward compatibility is only guaranteed for kernels + * supporting the same ABI version. + * + * KHO tracks preserved memory using a radix tree data structure. Each node of + * the tree is exactly a single page. The leaf nodes are bitmaps where each set + * bit is a preserved page of any order. The intermediate nodes are tables of + * physical addresses that point to a lower level node. + * + * The tree hierarchy is shown below:: + * + * root + * +-------------------+ + * | Level 5 | (struct kho_radix_node) + * +-------------------+ + * | + * v + * +-------------------+ + * | Level 4 | (struct kho_radix_node) + * +-------------------+ + * | + * | ... (intermediate levels) + * | + * v + * +-------------------+ + * | Level 0 | (struct kho_radix_leaf) + * +-------------------+ + * + * The tree is traversed using a key that encodes the page's physical address + * (pa) and its order into a single unsigned long value. The encoded key value + * is composed of two parts: the 'order bit' in the upper part and the + * 'shifted physical address' in the lower part.:: + * + * +------------+-----------------------------+--------------------------+ + * | Page Order | Order Bit | Shifted Physical Address | + * +------------+-----------------------------+--------------------------+ + * | 0 | ...000100 ... (at bit 52) | pa >> (PAGE_SHIFT + 0) | + * | 1 | ...000010 ... (at bit 51) | pa >> (PAGE_SHIFT + 1) | + * | 2 | ...000001 ... (at bit 50) | pa >> (PAGE_SHIFT + 2) | + * | ... | ... | ... | + * +------------+-----------------------------+--------------------------+ + * + * Shifted Physical Address: + * The 'shifted physical address' is the physical address normalized for its + * order. It effectively represents the PFN shifted right by the order. + * + * Order Bit: + * The 'order bit' encodes the page order by setting a single bit at a + * specific position. The position of this bit itself represents the order. + * + * For instance, on a 64-bit system with 4KB pages (PAGE_SHIFT = 12), the + * maximum range for the shifted physical address (for order 0) is 52 bits + * (64 - 12). This address occupies bits [0-51]. For order 0, the order bit is + * set at position 52. + * + * The following diagram illustrates how the encoded key value is split into + * indices for the tree levels, with PAGE_SIZE of 4KB:: + * + * 63:60 59:51 50:42 41:33 32:24 23:15 14:0 + * +---------+--------+--------+--------+--------+--------+-----------------+ + * | 0 | Lv 5 | Lv 4 | Lv 3 | Lv 2 | Lv 1 | Lv 0 (bitmap) | + * +---------+--------+--------+--------+--------+--------+-----------------+ + * + * The radix tree stores pages of all orders in a single 6-level hierarchy. It + * efficiently shares higher tree levels, especially due to common zero top + * address bits, allowing a single, efficient algorithm to manage all + * pages. This bitmap approach also offers memory efficiency; for example, a + * 512KB bitmap can cover a 16GB memory range for 0-order pages with PAGE_SIZE = + * 4KB. + */ + +/* + * Defines constants for the KHO radix tree structure, used to track preserved + * memory. These constants govern the indexing, sizing, and depth of the tree. + */ +enum kho_radix_consts { + /* + * The bit position of the order bit (and also the length of the + * shifted physical address) for an order-0 page. + */ + KHO_ORDER_0_LOG2 = 64 - PAGE_SHIFT, + + /* Size of the table in kho_radix_node, in log2 */ + KHO_TABLE_SIZE_LOG2 = const_ilog2(PAGE_SIZE / sizeof(phys_addr_t)), + + /* Number of bits in the kho_radix_leaf bitmap, in log2 */ + KHO_BITMAP_SIZE_LOG2 = PAGE_SHIFT + const_ilog2(BITS_PER_BYTE), + + /* + * The total tree depth is the number of intermediate levels + * and 1 bitmap level. + */ + KHO_TREE_MAX_DEPTH = + DIV_ROUND_UP(KHO_ORDER_0_LOG2 - KHO_BITMAP_SIZE_LOG2 + 1, + KHO_TABLE_SIZE_LOG2) + 1, +}; + +struct kho_radix_node { + u64 table[1 << KHO_TABLE_SIZE_LOG2]; +}; + +struct kho_radix_leaf { + DECLARE_BITMAP(bitmap, 1 << KHO_BITMAP_SIZE_LOG2); +}; + +#endif /* _LINUX_KHO_ABI_RADIX_TREE_H */ diff --git a/include/linux/kho_radix_tree.h b/include/linux/kho/radix_tree.h similarity index 96% rename from include/linux/kho_radix_tree.h rename to include/linux/kho/radix_tree.h index 84e918b96e53..1e337e73deba 100644 --- a/include/linux/kho_radix_tree.h +++ b/include/linux/kho/radix_tree.h @@ -5,6 +5,7 @@ #include #include +#include #include #include @@ -24,11 +25,9 @@ * Client code is responsible for allocating the root node of the tree, * initializing the mutex lock, and managing its lifecycle. It must use the * tree data structures defined in the KHO ABI, - * `include/linux/kho/abi/kexec_handover.h`. + * `include/linux/kho/abi/radix_tree.h`. */ -struct kho_radix_node; - struct kho_radix_tree { struct kho_radix_node *root; struct mutex lock; /* protects the tree's structure and root pointer */ diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 041efff7ca11..4a3d6a54a17f 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -16,7 +16,7 @@ #include #include #include -#include +#include #include #include #include diff --git a/kernel/liveupdate/kho_radix.c b/kernel/liveupdate/kho_radix.c index c836783a1376..f48088847264 100644 --- a/kernel/liveupdate/kho_radix.c +++ b/kernel/liveupdate/kho_radix.c @@ -11,7 +11,7 @@ #include #include #include -#include +#include #include #include #include -- 2.53.0 From pasha.tatashin at soleen.com Thu Jun 4 20:32:29 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Fri, 5 Jun 2026 03:32:29 +0000 Subject: [RFC v1 3/9] kho: split out vmalloc preservation into kho_vmalloc.c In-Reply-To: <20260605033235.717351-1-pasha.tatashin@soleen.com> References: <20260605033235.717351-1-pasha.tatashin@soleen.com> Message-ID: <20260605033235.717351-4-pasha.tatashin@soleen.com> Move the vmalloc serialization and preservation implementation out of the core KHO code into its own dedicated file (kho_vmalloc.c). This is a pure code movement patch; no logic or functional changes are introduced. Signed-off-by: Pasha Tatashin --- Documentation/core-api/kho/index.rst | 3 + kernel/liveupdate/Makefile | 3 +- kernel/liveupdate/kexec_handover.c | 258 +------------------------ kernel/liveupdate/kho_vmalloc.c | 274 +++++++++++++++++++++++++++ lib/test_kho.c | 1 + mm/memfd_luo.c | 1 + 6 files changed, 282 insertions(+), 258 deletions(-) create mode 100644 kernel/liveupdate/kho_vmalloc.c diff --git a/Documentation/core-api/kho/index.rst b/Documentation/core-api/kho/index.rst index f69367d217cf..a10b10700fb9 100644 --- a/Documentation/core-api/kho/index.rst +++ b/Documentation/core-api/kho/index.rst @@ -86,6 +86,9 @@ Public API .. kernel-doc:: kernel/liveupdate/kho_radix.c :export: +.. kernel-doc:: kernel/liveupdate/kho_vmalloc.c + :export: + KHO Serialization Blocks API ============================ diff --git a/kernel/liveupdate/Makefile b/kernel/liveupdate/Makefile index a3ee8a5c27a2..b481e21a311a 100644 --- a/kernel/liveupdate/Makefile +++ b/kernel/liveupdate/Makefile @@ -9,7 +9,8 @@ luo-y := \ kho-y := \ kexec_handover.o \ - kho_radix.o + kho_radix.o \ + kho_vmalloc.o obj-$(CONFIG_KEXEC_HANDOVER) += kho.o obj-$(CONFIG_KEXEC_HANDOVER_DEBUG) += kexec_handover_debug.o diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 4a3d6a54a17f..6672bc168e57 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -13,7 +13,6 @@ #include #include #include -#include #include #include #include @@ -23,11 +22,7 @@ #include #include #include -#include #include -#include - -#include /* * KHO is tightly coupled with mm init and needs access to some of mm @@ -84,6 +79,7 @@ static struct kho_out kho_out = { }; + /* For physically contiguous 0-order pages. */ static void kho_init_pages(struct page *page, unsigned long nr_pages) { @@ -702,259 +698,7 @@ void kho_unpreserve_pages(struct page *page, unsigned long nr_pages) } EXPORT_SYMBOL_GPL(kho_unpreserve_pages); -/* vmalloc flags KHO supports */ -#define KHO_VMALLOC_SUPPORTED_FLAGS (VM_ALLOC | VM_ALLOW_HUGE_VMAP) - -/* KHO internal flags for vmalloc preservations */ -#define KHO_VMALLOC_ALLOC 0x0001 -#define KHO_VMALLOC_HUGE_VMAP 0x0002 - -static unsigned short vmalloc_flags_to_kho(unsigned int vm_flags) -{ - unsigned short kho_flags = 0; - - if (vm_flags & VM_ALLOC) - kho_flags |= KHO_VMALLOC_ALLOC; - if (vm_flags & VM_ALLOW_HUGE_VMAP) - kho_flags |= KHO_VMALLOC_HUGE_VMAP; - - return kho_flags; -} - -static unsigned int kho_flags_to_vmalloc(unsigned short kho_flags) -{ - unsigned int vm_flags = 0; - - if (kho_flags & KHO_VMALLOC_ALLOC) - vm_flags |= VM_ALLOC; - if (kho_flags & KHO_VMALLOC_HUGE_VMAP) - vm_flags |= VM_ALLOW_HUGE_VMAP; - - return vm_flags; -} - -static struct kho_vmalloc_chunk *new_vmalloc_chunk(struct kho_vmalloc_chunk *cur) -{ - struct kho_vmalloc_chunk *chunk; - int err; - - chunk = (struct kho_vmalloc_chunk *)get_zeroed_page(GFP_KERNEL); - if (!chunk) - return NULL; - - err = kho_preserve_pages(virt_to_page(chunk), 1); - if (err) - goto err_free; - if (cur) - KHOSER_STORE_PTR(cur->hdr.next, chunk); - return chunk; - -err_free: - free_page((unsigned long)chunk); - return NULL; -} - -static void kho_vmalloc_unpreserve_chunk(struct kho_vmalloc_chunk *chunk, - unsigned short order) -{ - struct kho_radix_tree *tree = &kho_out.radix_tree; - unsigned long pfn = PHYS_PFN(virt_to_phys(chunk)); - - __kho_unpreserve(tree, pfn, pfn + 1); - - for (int i = 0; i < ARRAY_SIZE(chunk->phys) && chunk->phys[i]; i++) { - pfn = PHYS_PFN(chunk->phys[i]); - __kho_unpreserve(tree, pfn, pfn + (1 << order)); - } -} - -/** - * kho_preserve_vmalloc - preserve memory allocated with vmalloc() across kexec - * @ptr: pointer to the area in vmalloc address space - * @preservation: placeholder for preservation metadata - * - * Instructs KHO to preserve the area in vmalloc address space at @ptr. The - * physical pages mapped at @ptr will be preserved and on successful return - * @preservation will hold the physical address of a structure that describes - * the preservation. - * - * NOTE: The memory allocated with vmalloc_node() variants cannot be reliably - * restored on the same node - * - * Return: 0 on success, error code on failure - */ -int kho_preserve_vmalloc(void *ptr, struct kho_vmalloc *preservation) -{ - struct kho_vmalloc_chunk *chunk; - struct vm_struct *vm = find_vm_area(ptr); - unsigned int order, flags, nr_contig_pages; - unsigned int idx = 0; - int err; - - if (!vm) - return -EINVAL; - - if (vm->flags & ~KHO_VMALLOC_SUPPORTED_FLAGS) - return -EOPNOTSUPP; - - flags = vmalloc_flags_to_kho(vm->flags); - order = get_vm_area_page_order(vm); - - chunk = new_vmalloc_chunk(NULL); - if (!chunk) - return -ENOMEM; - KHOSER_STORE_PTR(preservation->first, chunk); - - nr_contig_pages = (1 << order); - for (int i = 0; i < vm->nr_pages; i += nr_contig_pages) { - phys_addr_t phys = page_to_phys(vm->pages[i]); - - err = kho_preserve_pages(vm->pages[i], nr_contig_pages); - if (err) - goto err_free; - - chunk->phys[idx++] = phys; - if (idx == ARRAY_SIZE(chunk->phys)) { - chunk = new_vmalloc_chunk(chunk); - if (!chunk) { - err = -ENOMEM; - goto err_free; - } - idx = 0; - } - } - - preservation->total_pages = vm->nr_pages; - preservation->flags = flags; - preservation->order = order; - - return 0; - -err_free: - kho_unpreserve_vmalloc(preservation); - return err; -} -EXPORT_SYMBOL_GPL(kho_preserve_vmalloc); - -/** - * kho_unpreserve_vmalloc - unpreserve memory allocated with vmalloc() - * @preservation: preservation metadata returned by kho_preserve_vmalloc() - * - * Instructs KHO to unpreserve the area in vmalloc address space that was - * previously preserved with kho_preserve_vmalloc(). - */ -void kho_unpreserve_vmalloc(struct kho_vmalloc *preservation) -{ - struct kho_vmalloc_chunk *chunk = KHOSER_LOAD_PTR(preservation->first); - - while (chunk) { - struct kho_vmalloc_chunk *tmp = chunk; - - kho_vmalloc_unpreserve_chunk(chunk, preservation->order); - - chunk = KHOSER_LOAD_PTR(chunk->hdr.next); - free_page((unsigned long)tmp); - } -} -EXPORT_SYMBOL_GPL(kho_unpreserve_vmalloc); - -/** - * kho_restore_vmalloc - recreates and populates an area in vmalloc address - * space from the preserved memory. - * @preservation: preservation metadata. - * - * Recreates an area in vmalloc address space and populates it with memory that - * was preserved using kho_preserve_vmalloc(). - * - * Return: pointer to the area in the vmalloc address space, NULL on failure. - */ -void *kho_restore_vmalloc(const struct kho_vmalloc *preservation) -{ - struct kho_vmalloc_chunk *chunk = KHOSER_LOAD_PTR(preservation->first); - kasan_vmalloc_flags_t kasan_flags = KASAN_VMALLOC_PROT_NORMAL; - unsigned int align, order, shift, vm_flags; - unsigned long total_pages, contig_pages; - unsigned long addr, size; - struct vm_struct *area; - struct page **pages; - unsigned int idx = 0; - int err; - - vm_flags = kho_flags_to_vmalloc(preservation->flags); - if (vm_flags & ~KHO_VMALLOC_SUPPORTED_FLAGS) - return NULL; - - total_pages = preservation->total_pages; - pages = kvmalloc_objs(*pages, total_pages); - if (!pages) - return NULL; - order = preservation->order; - contig_pages = (1 << order); - shift = PAGE_SHIFT + order; - align = 1 << shift; - - while (chunk) { - struct page *page; - - for (int i = 0; i < ARRAY_SIZE(chunk->phys) && chunk->phys[i]; i++) { - phys_addr_t phys = chunk->phys[i]; - - if (idx + contig_pages > total_pages) - goto err_free_pages_array; - - page = kho_restore_pages(phys, contig_pages); - if (!page) - goto err_free_pages_array; - - for (int j = 0; j < contig_pages; j++) - pages[idx++] = page + j; - - phys += contig_pages * PAGE_SIZE; - } - - page = kho_restore_pages(virt_to_phys(chunk), 1); - if (!page) - goto err_free_pages_array; - chunk = KHOSER_LOAD_PTR(chunk->hdr.next); - __free_page(page); - } - - if (idx != total_pages) - goto err_free_pages_array; - - area = __get_vm_area_node(total_pages * PAGE_SIZE, align, shift, - vm_flags | VM_UNINITIALIZED, - VMALLOC_START, VMALLOC_END, - NUMA_NO_NODE, GFP_KERNEL, - __builtin_return_address(0)); - if (!area) - goto err_free_pages_array; - - addr = (unsigned long)area->addr; - size = get_vm_area_size(area); - err = vmap_pages_range(addr, addr + size, PAGE_KERNEL, pages, shift); - if (err) - goto err_free_vm_area; - area->nr_pages = total_pages; - area->pages = pages; - - if (vm_flags & VM_ALLOC) - kasan_flags |= KASAN_VMALLOC_VM_ALLOC; - - area->addr = kasan_unpoison_vmalloc(area->addr, total_pages * PAGE_SIZE, - kasan_flags); - clear_vm_uninitialized_flag(area); - - return area->addr; - -err_free_vm_area: - free_vm_area(area); -err_free_pages_array: - kvfree(pages); - return NULL; -} -EXPORT_SYMBOL_GPL(kho_restore_vmalloc); /** * kho_alloc_preserve - Allocate, zero, and preserve memory. diff --git a/kernel/liveupdate/kho_vmalloc.c b/kernel/liveupdate/kho_vmalloc.c new file mode 100644 index 000000000000..84c17b7a81ae --- /dev/null +++ b/kernel/liveupdate/kho_vmalloc.c @@ -0,0 +1,274 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * kho_vmalloc.c - KHO vmalloc space serialization/preservation + * Copyright (C) 2025 Microsoft Corporation, Mike Rapoport + * Copyright (C) 2025 Pasha Tatashin + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + + +#include "../../mm/internal.h" +#include "kexec_handover_internal.h" + +/* vmalloc flags KHO supports */ +#define KHO_VMALLOC_SUPPORTED_FLAGS (VM_ALLOC | VM_ALLOW_HUGE_VMAP) + +/* KHO internal flags for vmalloc preservations */ +#define KHO_VMALLOC_ALLOC 0x0001 +#define KHO_VMALLOC_HUGE_VMAP 0x0002 + +static unsigned short vmalloc_flags_to_kho(unsigned int vm_flags) +{ + unsigned short kho_flags = 0; + + if (vm_flags & VM_ALLOC) + kho_flags |= KHO_VMALLOC_ALLOC; + if (vm_flags & VM_ALLOW_HUGE_VMAP) + kho_flags |= KHO_VMALLOC_HUGE_VMAP; + + return kho_flags; +} + +static unsigned int kho_flags_to_vmalloc(unsigned short kho_flags) +{ + unsigned int vm_flags = 0; + + if (kho_flags & KHO_VMALLOC_ALLOC) + vm_flags |= VM_ALLOC; + if (kho_flags & KHO_VMALLOC_HUGE_VMAP) + vm_flags |= VM_ALLOW_HUGE_VMAP; + + return vm_flags; +} + +static struct kho_vmalloc_chunk *new_vmalloc_chunk(struct kho_vmalloc_chunk *cur) +{ + struct kho_vmalloc_chunk *chunk; + int err; + + chunk = (struct kho_vmalloc_chunk *)get_zeroed_page(GFP_KERNEL); + if (!chunk) + return NULL; + + err = kho_preserve_pages(virt_to_page(chunk), 1); + if (err) + goto err_free; + if (cur) + KHOSER_STORE_PTR(cur->hdr.next, chunk); + return chunk; + +err_free: + free_page((unsigned long)chunk); + return NULL; +} + +static void kho_vmalloc_unpreserve_chunk(struct kho_vmalloc_chunk *chunk, + unsigned short order) +{ + unsigned long pfn = PHYS_PFN(virt_to_phys(chunk)); + + kho_unpreserve_pages(pfn_to_page(pfn), 1); + + for (int i = 0; i < ARRAY_SIZE(chunk->phys) && chunk->phys[i]; i++) { + pfn = PHYS_PFN(chunk->phys[i]); + kho_unpreserve_pages(pfn_to_page(pfn), 1 << order); + } +} + +/** + * kho_preserve_vmalloc - preserve memory allocated with vmalloc() across kexec + * @ptr: pointer to the area in vmalloc address space + * @preservation: placeholder for preservation metadata + * + * Instructs KHO to preserve the area in vmalloc address space at @ptr. The + * physical pages mapped at @ptr will be preserved and on successful return + * @preservation will hold the physical address of a structure that describes + * the preservation. + * + * NOTE: The memory allocated with vmalloc_node() variants cannot be reliably + * restored on the same node + * + * Return: 0 on success, error code on failure + */ +int kho_preserve_vmalloc(void *ptr, struct kho_vmalloc *preservation) +{ + struct kho_vmalloc_chunk *chunk; + struct vm_struct *vm = find_vm_area(ptr); + unsigned int order, flags, nr_contig_pages; + unsigned int idx = 0; + int err; + + if (!vm) + return -EINVAL; + + if (vm->flags & ~KHO_VMALLOC_SUPPORTED_FLAGS) + return -EOPNOTSUPP; + + flags = vmalloc_flags_to_kho(vm->flags); + order = get_vm_area_page_order(vm); + + chunk = new_vmalloc_chunk(NULL); + if (!chunk) + return -ENOMEM; + KHOSER_STORE_PTR(preservation->first, chunk); + + nr_contig_pages = (1 << order); + for (int i = 0; i < vm->nr_pages; i += nr_contig_pages) { + phys_addr_t phys = page_to_phys(vm->pages[i]); + + err = kho_preserve_pages(vm->pages[i], nr_contig_pages); + if (err) + goto err_free; + + chunk->phys[idx++] = phys; + if (idx == ARRAY_SIZE(chunk->phys)) { + chunk = new_vmalloc_chunk(chunk); + if (!chunk) { + err = -ENOMEM; + goto err_free; + } + idx = 0; + } + } + + preservation->total_pages = vm->nr_pages; + preservation->flags = flags; + preservation->order = order; + + return 0; + +err_free: + kho_unpreserve_vmalloc(preservation); + return err; +} +EXPORT_SYMBOL_GPL(kho_preserve_vmalloc); + +/** + * kho_unpreserve_vmalloc - unpreserve memory allocated with vmalloc() + * @preservation: preservation metadata returned by kho_preserve_vmalloc() + * + * Instructs KHO to unpreserve the area in vmalloc address space that was + * previously preserved with kho_preserve_vmalloc(). + */ +void kho_unpreserve_vmalloc(struct kho_vmalloc *preservation) +{ + struct kho_vmalloc_chunk *chunk = KHOSER_LOAD_PTR(preservation->first); + + while (chunk) { + struct kho_vmalloc_chunk *tmp = chunk; + + kho_vmalloc_unpreserve_chunk(chunk, preservation->order); + + chunk = KHOSER_LOAD_PTR(chunk->hdr.next); + free_page((unsigned long)tmp); + } +} +EXPORT_SYMBOL_GPL(kho_unpreserve_vmalloc); + +/** + * kho_restore_vmalloc - recreates and populates an area in vmalloc address + * space from the preserved memory. + * @preservation: preservation metadata. + * + * Recreates an area in vmalloc address space and populates it with memory that + * was preserved using kho_preserve_vmalloc(). + * + * Return: pointer to the area in the vmalloc address space, NULL on failure. + */ +void *kho_restore_vmalloc(const struct kho_vmalloc *preservation) +{ + struct kho_vmalloc_chunk *chunk = KHOSER_LOAD_PTR(preservation->first); + kasan_vmalloc_flags_t kasan_flags = KASAN_VMALLOC_PROT_NORMAL; + unsigned int align, order, shift, vm_flags; + unsigned long total_pages, contig_pages; + unsigned long addr, size; + struct vm_struct *area; + struct page **pages; + unsigned int idx = 0; + int err; + + vm_flags = kho_flags_to_vmalloc(preservation->flags); + if (vm_flags & ~KHO_VMALLOC_SUPPORTED_FLAGS) + return NULL; + + total_pages = preservation->total_pages; + pages = kvmalloc_objs(*pages, total_pages); + if (!pages) + return NULL; + order = preservation->order; + contig_pages = (1 << order); + shift = PAGE_SHIFT + order; + align = 1 << shift; + + while (chunk) { + struct page *page; + + for (int i = 0; i < ARRAY_SIZE(chunk->phys) && chunk->phys[i]; i++) { + phys_addr_t phys = chunk->phys[i]; + + if (idx + contig_pages > total_pages) + goto err_free_pages_array; + + page = kho_restore_pages(phys, contig_pages); + if (!page) + goto err_free_pages_array; + + for (int j = 0; j < contig_pages; j++) + pages[idx++] = page + j; + + phys += contig_pages * PAGE_SIZE; + } + + page = kho_restore_pages(virt_to_phys(chunk), 1); + if (!page) + goto err_free_pages_array; + chunk = KHOSER_LOAD_PTR(chunk->hdr.next); + __free_page(page); + } + + if (idx != total_pages) + goto err_free_pages_array; + + area = __get_vm_area_node(total_pages * PAGE_SIZE, align, shift, + vm_flags | VM_UNINITIALIZED, + VMALLOC_START, VMALLOC_END, + NUMA_NO_NODE, GFP_KERNEL, + __builtin_return_address(0)); + if (!area) + goto err_free_pages_array; + + addr = (unsigned long)area->addr; + size = get_vm_area_size(area); + err = vmap_pages_range(addr, addr + size, PAGE_KERNEL, pages, shift); + if (err) + goto err_free_vm_area; + + area->nr_pages = total_pages; + area->pages = pages; + + if (vm_flags & VM_ALLOC) + kasan_flags |= KASAN_VMALLOC_VM_ALLOC; + + area->addr = kasan_unpoison_vmalloc(area->addr, total_pages * PAGE_SIZE, + kasan_flags); + clear_vm_uninitialized_flag(area); + + return area->addr; + +err_free_vm_area: + free_vm_area(area); +err_free_pages_array: + kvfree(pages); + return NULL; +} +EXPORT_SYMBOL_GPL(kho_restore_vmalloc); diff --git a/lib/test_kho.c b/lib/test_kho.c index aa6a0956bb8b..6907e09688dd 100644 --- a/lib/test_kho.c +++ b/lib/test_kho.c @@ -20,6 +20,7 @@ #include #include #include +#include #include diff --git a/mm/memfd_luo.c b/mm/memfd_luo.c index 59de210bee5f..ade2aa24c7b8 100644 --- a/mm/memfd_luo.c +++ b/mm/memfd_luo.c @@ -76,6 +76,7 @@ #include #include #include +#include #include #include #include -- 2.53.0 From pasha.tatashin at soleen.com Thu Jun 4 20:32:30 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Fri, 5 Jun 2026 03:32:30 +0000 Subject: [RFC v1 4/9] kho: split vmalloc headers out of kexec_handover.h In-Reply-To: <20260605033235.717351-1-pasha.tatashin@soleen.com> References: <20260605033235.717351-1-pasha.tatashin@soleen.com> Message-ID: <20260605033235.717351-5-pasha.tatashin@soleen.com> Split the vmalloc-related ABI definitions and header declarations out of the monolithic kexec_handover.h header into a dedicated header file (vmalloc.h). This is a pure code movement patch; no logic or functional changes are introduced. Signed-off-by: Pasha Tatashin --- Documentation/core-api/kho/abi.rst | 3 +- include/linux/kexec_handover.h | 18 ----- include/linux/kho/abi/kexec_handover.h | 77 +------------------- include/linux/kho/abi/memfd.h | 3 +- include/linux/kho/abi/vmalloc.h | 99 ++++++++++++++++++++++++++ include/linux/kho/vmalloc.h | 34 +++++++++ 6 files changed, 137 insertions(+), 97 deletions(-) create mode 100644 include/linux/kho/abi/vmalloc.h create mode 100644 include/linux/kho/vmalloc.h diff --git a/Documentation/core-api/kho/abi.rst b/Documentation/core-api/kho/abi.rst index da5c6636bb17..b61363679829 100644 --- a/Documentation/core-api/kho/abi.rst +++ b/Documentation/core-api/kho/abi.rst @@ -13,8 +13,7 @@ Core Kexec Handover ABI vmalloc preservation ABI ======================== -.. kernel-doc:: include/linux/kho/abi/kexec_handover.h - :doc: Kexec Handover ABI for vmalloc Preservation +.. kernel-doc:: include/linux/kho/abi/vmalloc.h memblock preservation ABI ========================= diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h index 8968c56d2d73..518fdab2a4d1 100644 --- a/include/linux/kexec_handover.h +++ b/include/linux/kexec_handover.h @@ -11,8 +11,6 @@ struct kho_scratch { phys_addr_t size; }; -struct kho_vmalloc; - struct folio; struct page; @@ -24,14 +22,11 @@ int kho_preserve_folio(struct folio *folio); void kho_unpreserve_folio(struct folio *folio); int kho_preserve_pages(struct page *page, unsigned long nr_pages); void kho_unpreserve_pages(struct page *page, unsigned long nr_pages); -int kho_preserve_vmalloc(void *ptr, struct kho_vmalloc *preservation); -void kho_unpreserve_vmalloc(struct kho_vmalloc *preservation); void *kho_alloc_preserve(size_t size); void kho_unpreserve_free(void *mem); void kho_restore_free(void *mem); struct folio *kho_restore_folio(phys_addr_t phys); struct page *kho_restore_pages(phys_addr_t phys, unsigned long nr_pages); -void *kho_restore_vmalloc(const struct kho_vmalloc *preservation); int kho_add_subtree(const char *name, void *blob, size_t size); void kho_remove_subtree(void *blob); int kho_retrieve_subtree(const char *name, phys_addr_t *phys, size_t *size); @@ -65,14 +60,6 @@ static inline int kho_preserve_pages(struct page *page, unsigned int nr_pages) static inline void kho_unpreserve_pages(struct page *page, unsigned int nr_pages) { } -static inline int kho_preserve_vmalloc(void *ptr, - struct kho_vmalloc *preservation) -{ - return -EOPNOTSUPP; -} - -static inline void kho_unpreserve_vmalloc(struct kho_vmalloc *preservation) { } - static inline void *kho_alloc_preserve(size_t size) { return ERR_PTR(-EOPNOTSUPP); @@ -92,11 +79,6 @@ static inline struct page *kho_restore_pages(phys_addr_t phys, return NULL; } -static inline void *kho_restore_vmalloc(const struct kho_vmalloc *preservation) -{ - return NULL; -} - static inline int kho_add_subtree(const char *name, void *blob, size_t size) { return -EOPNOTSUPP; diff --git a/include/linux/kho/abi/kexec_handover.h b/include/linux/kho/abi/kexec_handover.h index 99e4a53d4e35..c893b5045078 100644 --- a/include/linux/kho/abi/kexec_handover.h +++ b/include/linux/kho/abi/kexec_handover.h @@ -96,80 +96,5 @@ /* The FDT property for the size of preserved data blobs. */ #define KHO_SUB_TREE_SIZE_PROP_NAME "blob-size" -/** - * DOC: Kexec Handover ABI for vmalloc Preservation - * - * The Kexec Handover ABI for preserving vmalloc'ed memory is defined by - * a set of structures and helper macros. The layout of these structures is a - * stable contract between kernels and is versioned by the KHO_FDT_COMPATIBLE - * string. - * - * The preservation is managed through a main descriptor &struct kho_vmalloc, - * which points to a linked list of &struct kho_vmalloc_chunk structures. These - * chunks contain the physical addresses of the preserved pages, allowing the - * next kernel to reconstruct the vmalloc area with the same content and layout. - * Helper macros are also defined for storing and loading pointers within - * these structures. - */ - -/* Helper macro to define a union for a serializable pointer. */ -#define DECLARE_KHOSER_PTR(name, type) \ - union { \ - u64 phys; \ - type ptr; \ - } name - -/* Stores the physical address of a serializable pointer. */ -#define KHOSER_STORE_PTR(dest, val) \ - ({ \ - typeof(val) v = val; \ - typecheck(typeof((dest).ptr), v); \ - (dest).phys = virt_to_phys(v); \ - }) - -/* Loads the stored physical address back to a pointer. */ -#define KHOSER_LOAD_PTR(src) \ - ({ \ - typeof(src) s = src; \ - (typeof((s).ptr))((s).phys ? phys_to_virt((s).phys) : NULL); \ - }) - -/* - * This header is embedded at the beginning of each `kho_vmalloc_chunk` - * and contains a pointer to the next chunk in the linked list, - * stored as a physical address for handover. - */ -struct kho_vmalloc_hdr { - DECLARE_KHOSER_PTR(next, struct kho_vmalloc_chunk *); -}; - -#define KHO_VMALLOC_SIZE \ - ((PAGE_SIZE - sizeof(struct kho_vmalloc_hdr)) / \ - sizeof(u64)) - -/* - * Each chunk is a single page and is part of a linked list that describes - * a preserved vmalloc area. It contains the header with the link to the next - * chunk and a zero terminated array of physical addresses of the pages that - * make up the preserved vmalloc area. - */ -struct kho_vmalloc_chunk { - struct kho_vmalloc_hdr hdr; - u64 phys[KHO_VMALLOC_SIZE]; -}; - -static_assert(sizeof(struct kho_vmalloc_chunk) == PAGE_SIZE); - -/* - * Describes a preserved vmalloc memory area, including the - * total number of pages, allocation flags, page order, and a pointer to the - * first chunk of physical page addresses. - */ -struct kho_vmalloc { - DECLARE_KHOSER_PTR(first, struct kho_vmalloc_chunk *); - unsigned int total_pages; - unsigned short flags; - unsigned short order; -}; - #endif /* _LINUX_KHO_ABI_KEXEC_HANDOVER_H */ + diff --git a/include/linux/kho/abi/memfd.h b/include/linux/kho/abi/memfd.h index 08b10fea2afc..af310c0c9fdf 100644 --- a/include/linux/kho/abi/memfd.h +++ b/include/linux/kho/abi/memfd.h @@ -11,8 +11,9 @@ #ifndef _LINUX_KHO_ABI_MEMFD_H #define _LINUX_KHO_ABI_MEMFD_H -#include #include +#include +#include /** * DOC: memfd Live Update ABI diff --git a/include/linux/kho/abi/vmalloc.h b/include/linux/kho/abi/vmalloc.h new file mode 100644 index 000000000000..87650e1dd774 --- /dev/null +++ b/include/linux/kho/abi/vmalloc.h @@ -0,0 +1,99 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (C) 2025 Microsoft Corporation, Mike Rapoport + * Copyright (C) 2025 Pasha Tatashin + */ + +/** + * DOC: Kexec Handover ABI for vmalloc Preservation + * + * The Kexec Handover ABI for preserving vmalloc'ed memory is defined by + * a set of structures and helper macros. The layout of these structures is a + * stable contract between kernels and is versioned by the KHO_FDT_COMPATIBLE + * string. + * + * This interface is a contract. Any modification to the structure fields, + * compatible strings, or the layout of the serialization structures defined + * here constitutes a breaking change. Such changes require incrementing the + * version number in the `KHO_FDT_COMPATIBLE` string to prevent a new kernel + * from misinterpreting data from an old kernel. + * + * Changes are allowed provided the compatibility version is incremented; + * however, backward/forward compatibility is only guaranteed for kernels + * supporting the same ABI version. + * + * The preservation is managed through a main descriptor &struct kho_vmalloc, + * which points to a linked list of &struct kho_vmalloc_chunk structures. These + * chunks contain the physical addresses of the preserved pages, allowing the + * next kernel to reconstruct the vmalloc area with the same content and layout. + * Helper macros are also defined for storing and loading pointers within + * these structures. + */ + +#ifndef _LINUX_KHO_ABI_VMALLOC_H +#define _LINUX_KHO_ABI_VMALLOC_H + +#include +#include + +/* Helper macro to define a union for a serializable pointer. */ +#define DECLARE_KHOSER_PTR(name, type) \ + union { \ + u64 phys; \ + type ptr; \ + } name + +/* Stores the physical address of a serializable pointer. */ +#define KHOSER_STORE_PTR(dest, val) \ + ({ \ + typeof(val) v = val; \ + typecheck(typeof((dest).ptr), v); \ + (dest).phys = virt_to_phys(v); \ + }) + +/* Loads the stored physical address back to a pointer. */ +#define KHOSER_LOAD_PTR(src) \ + ({ \ + typeof(src) s = src; \ + (typeof((s).ptr))((s).phys ? phys_to_virt((s).phys) : NULL); \ + }) + +/* + * This header is embedded at the beginning of each `kho_vmalloc_chunk` + * and contains a pointer to the next chunk in the linked list, + * stored as a physical address for handover. + */ +struct kho_vmalloc_hdr { + DECLARE_KHOSER_PTR(next, struct kho_vmalloc_chunk *); +}; + +#define KHO_VMALLOC_SIZE \ + ((PAGE_SIZE - sizeof(struct kho_vmalloc_hdr)) / \ + sizeof(u64)) + +/* + * Each chunk is a single page and is part of a linked list that describes + * a preserved vmalloc area. It contains the header with the link to the next + * chunk and a zero terminated array of physical addresses of the pages that + * make up the preserved vmalloc area. + */ +struct kho_vmalloc_chunk { + struct kho_vmalloc_hdr hdr; + u64 phys[KHO_VMALLOC_SIZE]; +}; + +static_assert(sizeof(struct kho_vmalloc_chunk) == PAGE_SIZE); + +/* + * Describes a preserved vmalloc memory area, including the + * total number of pages, allocation flags, page order, and a pointer to the + * first chunk of physical page addresses. + */ +struct kho_vmalloc { + DECLARE_KHOSER_PTR(first, struct kho_vmalloc_chunk *); + unsigned int total_pages; + unsigned short flags; + unsigned short order; +}; + +#endif /* _LINUX_KHO_ABI_VMALLOC_H */ diff --git a/include/linux/kho/vmalloc.h b/include/linux/kho/vmalloc.h new file mode 100644 index 000000000000..2d1b5d282a93 --- /dev/null +++ b/include/linux/kho/vmalloc.h @@ -0,0 +1,34 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_KHO_VMALLOC_H +#define _LINUX_KHO_VMALLOC_H + +#include +#include +#include + +struct page; + +#ifdef CONFIG_KEXEC_HANDOVER + +int kho_preserve_vmalloc(void *ptr, struct kho_vmalloc *preservation); +void kho_unpreserve_vmalloc(struct kho_vmalloc *preservation); +void *kho_restore_vmalloc(const struct kho_vmalloc *preservation); + +#else /* CONFIG_KEXEC_HANDOVER */ + +static inline int kho_preserve_vmalloc(void *ptr, + struct kho_vmalloc *preservation) +{ + return -EOPNOTSUPP; +} + +static inline void kho_unpreserve_vmalloc(struct kho_vmalloc *preservation) { } + +static inline void *kho_restore_vmalloc(const struct kho_vmalloc *preservation) +{ + return NULL; +} + +#endif /* CONFIG_KEXEC_HANDOVER */ + +#endif /* _LINUX_KHO_VMALLOC_H */ -- 2.53.0 From pasha.tatashin at soleen.com Thu Jun 4 20:32:31 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Fri, 5 Jun 2026 03:32:31 +0000 Subject: [RFC v1 5/9] kho: move kho_block.h to kho/block.h In-Reply-To: <20260605033235.717351-1-pasha.tatashin@soleen.com> References: <20260605033235.717351-1-pasha.tatashin@soleen.com> Message-ID: <20260605033235.717351-6-pasha.tatashin@soleen.com> Move kho_block.h to kho/block.h, organizing it within the existing kho directory structure as more KHO data structures are introduced. This is a pure code movement patch; no logic or functional changes are introduced. Signed-off-by: Pasha Tatashin --- Documentation/core-api/kho/index.rst | 2 +- MAINTAINERS | 1 - include/linux/{kho_block.h => kho/block.h} | 2 +- kernel/liveupdate/kho_block.c | 2 +- kernel/liveupdate/luo_internal.h | 2 +- kernel/liveupdate/luo_session.c | 2 +- 6 files changed, 5 insertions(+), 6 deletions(-) rename include/linux/{kho_block.h => kho/block.h} (100%) diff --git a/Documentation/core-api/kho/index.rst b/Documentation/core-api/kho/index.rst index a10b10700fb9..4a5477221fe4 100644 --- a/Documentation/core-api/kho/index.rst +++ b/Documentation/core-api/kho/index.rst @@ -95,7 +95,7 @@ KHO Serialization Blocks API .. kernel-doc:: kernel/liveupdate/kho_block.c :doc: KHO Serialization Blocks -.. kernel-doc:: include/linux/kho_block.h +.. kernel-doc:: include/linux/kho/block.h .. kernel-doc:: kernel/liveupdate/kho_block.c :internal: diff --git a/MAINTAINERS b/MAINTAINERS index 920ba7622afa..9ec290e38b44 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -14208,7 +14208,6 @@ F: Documentation/admin-guide/mm/kho.rst F: Documentation/core-api/kho/* F: include/linux/kexec_handover.h F: include/linux/kho/ -F: include/linux/kho_block.h F: kernel/liveupdate/kexec_handover* F: lib/test_kho.c F: tools/testing/selftests/kho/ diff --git a/include/linux/kho_block.h b/include/linux/kho/block.h similarity index 100% rename from include/linux/kho_block.h rename to include/linux/kho/block.h index 93a7cc2be5f5..2b9d5a080a6a 100644 --- a/include/linux/kho_block.h +++ b/include/linux/kho/block.h @@ -7,9 +7,9 @@ #ifndef _LINUX_KHO_BLOCK_H #define _LINUX_KHO_BLOCK_H +#include #include #include -#include /** * struct kho_block - Internal representation of a serialization block. diff --git a/kernel/liveupdate/kho_block.c b/kernel/liveupdate/kho_block.c index 0d2a342ef422..6cedcd36bfd2 100644 --- a/kernel/liveupdate/kho_block.c +++ b/kernel/liveupdate/kho_block.c @@ -23,7 +23,7 @@ #include #include #include -#include +#include #include /* diff --git a/kernel/liveupdate/luo_internal.h b/kernel/liveupdate/luo_internal.h index 64879ffe7378..349f6d141873 100644 --- a/kernel/liveupdate/luo_internal.h +++ b/kernel/liveupdate/luo_internal.h @@ -8,9 +8,9 @@ #ifndef _LINUX_LUO_INTERNAL_H #define _LINUX_LUO_INTERNAL_H +#include #include #include -#include struct luo_ucmd { void __user *ubuffer; diff --git a/kernel/liveupdate/luo_session.c b/kernel/liveupdate/luo_session.c index b79b2a488974..01c0ccf09919 100644 --- a/kernel/liveupdate/luo_session.c +++ b/kernel/liveupdate/luo_session.c @@ -90,8 +90,8 @@ #include #include #include -#include #include +#include #include #include #include -- 2.53.0 From pasha.tatashin at soleen.com Thu Jun 4 20:32:32 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Fri, 5 Jun 2026 03:32:32 +0000 Subject: [RFC v1 6/9] kho: introduce compatibility helpers and decouple block version In-Reply-To: <20260605033235.717351-1-pasha.tatashin@soleen.com> References: <20260605033235.717351-1-pasha.tatashin@soleen.com> Message-ID: <20260605033235.717351-7-pasha.tatashin@soleen.com> Decouple the block compatibility string from the global KHO version. Introduce a compatibility helper header (compat.h) defining utility macros for constructing subsystem compatibility strings, specifically: - KHO_SUB_COMPAT() to append sub-component compatibility strings using a semicolon separator. - KHO_COMPAT_ALIGN() to align compatibility string sizes to 8-byte boundaries. Define the individual block compatibility string "block-v1" in block.h, and integrate it into the composite LUO compatibility string (LUO_ABI_COMPATIBLE) via the new compatibility helpers. Signed-off-by: Pasha Tatashin --- include/linux/kho/abi/block.h | 4 +++- include/linux/kho/abi/compat.h | 33 +++++++++++++++++++++++++++++++++ include/linux/kho/abi/luo.h | 8 ++++++-- 3 files changed, 42 insertions(+), 3 deletions(-) create mode 100644 include/linux/kho/abi/compat.h diff --git a/include/linux/kho/abi/block.h b/include/linux/kho/abi/block.h index d06d64b963be..95d13cf677cf 100644 --- a/include/linux/kho/abi/block.h +++ b/include/linux/kho/abi/block.h @@ -14,7 +14,7 @@ * This interface is a contract. Any modification to the structure fields, * compatible strings, or the layout of the `__packed` serialization * structures defined here constitutes a breaking change. Such changes require - * incrementing the version number in the `KHO_FDT_COMPATIBLE` string to + * incrementing the version number in the `KHO_BLOCK_COMPATIBLE` string to * prevent a new kernel from misinterpreting data from an old kernel. * * Changes are allowed provided the compatibility version is incremented; @@ -28,6 +28,8 @@ #include #include +#define KHO_BLOCK_COMPATIBLE "block-v1" + /** * KHO_BLOCK_SIZE - The size of each serialization block. * diff --git a/include/linux/kho/abi/compat.h b/include/linux/kho/abi/compat.h new file mode 100644 index 000000000000..25edd964c390 --- /dev/null +++ b/include/linux/kho/abi/compat.h @@ -0,0 +1,33 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2026 Google LLC. + * Pasha Tatashin + */ + +#ifndef _LINUX_KHO_ABI_COMPAT_H +#define _LINUX_KHO_ABI_COMPAT_H + +#include + +/** + * KHO_SUB_COMPAT - Helper to append a sub-component compatibility string. + * @str: The compatibility string of the sub-component. + * + * Appends a KHO safe data structure compatibility string to a sub-system + * compatibility string using a semicolon ';' as a separator. + * + * NOTE: Sub-components MUST be added in strict alphabetical order to maintain + * a consistent and predictable compatibility string value. + */ +#define KHO_SUB_COMPAT(str) ";" str + +/** + * KHO_COMPAT_ALIGN - Align a compatibility string size to 8 bytes. + * @str: The compatibility string. + * + * Aligns the size of a compatibility string to an 8-byte boundary for use + * in ABI structures. + */ +#define KHO_COMPAT_ALIGN(str) ALIGN(sizeof(str), 8) + +#endif /* _LINUX_KHO_ABI_COMPAT_H */ diff --git a/include/linux/kho/abi/luo.h b/include/linux/kho/abi/luo.h index 288076de6d4a..b502670cd2a6 100644 --- a/include/linux/kho/abi/luo.h +++ b/include/linux/kho/abi/luo.h @@ -58,6 +58,7 @@ #define _LINUX_KHO_ABI_LUO_H #include +#include #include #include @@ -65,8 +66,11 @@ * The LUO state is registered under this KHO entry name. */ #define LUO_KHO_ENTRY_NAME "LUO" -#define LUO_ABI_COMPATIBLE "luo-v5" -#define LUO_ABI_COMPAT_LEN ALIGN(sizeof(LUO_ABI_COMPATIBLE), 8) +#define LUO_ABI_COMPAT_BASE "luo-v5" +#define LUO_ABI_COMPATIBLE \ + LUO_ABI_COMPAT_BASE \ + KHO_SUB_COMPAT(KHO_BLOCK_COMPATIBLE) +#define LUO_ABI_COMPAT_LEN KHO_COMPAT_ALIGN(LUO_ABI_COMPATIBLE) /** * struct luo_ser - Centralized LUO ABI header. -- 2.53.0 From pasha.tatashin at soleen.com Thu Jun 4 20:32:33 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Fri, 5 Jun 2026 03:32:33 +0000 Subject: [RFC v1 7/9] kho: decouple radix tree compatibility from global KHO version In-Reply-To: <20260605033235.717351-1-pasha.tatashin@soleen.com> References: <20260605033235.717351-1-pasha.tatashin@soleen.com> Message-ID: <20260605033235.717351-8-pasha.tatashin@soleen.com> Decouple the kho radix tree compatibility version from the global KHO compatibility version KHO_FDT_COMPATIBLE. Define the independent compatibility version "radix-v1" for the radix tree KHO_RADIX_COMPATIBLE in radix_tree.h. Integrate KHO_RADIX_COMPATIBLE into the composite root compatibility string KHO_FDT_COMPATIBLE. Additionally, document the new KHO Compatibility ABI under the Documentation/core-api/kho/abi.rst section. Signed-off-by: Pasha Tatashin --- Documentation/core-api/kho/abi.rst | 5 +++++ include/linux/kho/abi/kexec_handover.h | 10 ++++++---- include/linux/kho/abi/radix_tree.h | 4 +++- 3 files changed, 14 insertions(+), 5 deletions(-) diff --git a/Documentation/core-api/kho/abi.rst b/Documentation/core-api/kho/abi.rst index b61363679829..6acdb7c85239 100644 --- a/Documentation/core-api/kho/abi.rst +++ b/Documentation/core-api/kho/abi.rst @@ -10,6 +10,11 @@ Core Kexec Handover ABI .. kernel-doc:: include/linux/kho/abi/kexec_handover.h :doc: Kexec Handover ABI +KHO Compatibility ABI +===================== + +.. kernel-doc:: include/linux/kho/abi/compat.h + vmalloc preservation ABI ======================== diff --git a/include/linux/kho/abi/kexec_handover.h b/include/linux/kho/abi/kexec_handover.h index c893b5045078..49ac4b47cc3d 100644 --- a/include/linux/kho/abi/kexec_handover.h +++ b/include/linux/kho/abi/kexec_handover.h @@ -8,9 +8,8 @@ #ifndef _LINUX_KHO_ABI_KEXEC_HANDOVER_H #define _LINUX_KHO_ABI_KEXEC_HANDOVER_H -#include - -#include +#include +#include /** * DOC: Kexec Handover ABI @@ -85,7 +84,10 @@ */ /* The compatible string for the KHO FDT root node. */ -#define KHO_FDT_COMPATIBLE "kho-v4" +#define KHO_FDT_COMPAT_BASE "kho-v4" +#define KHO_FDT_COMPATIBLE \ + KHO_FDT_COMPAT_BASE \ + KHO_SUB_COMPAT(KHO_RADIX_COMPATIBLE) /* The FDT property for the preserved memory map. */ #define KHO_FDT_MEMORY_MAP_PROP_NAME "preserved-memory-map" diff --git a/include/linux/kho/abi/radix_tree.h b/include/linux/kho/abi/radix_tree.h index f4cc5c02f37a..89cd7eb4a91d 100644 --- a/include/linux/kho/abi/radix_tree.h +++ b/include/linux/kho/abi/radix_tree.h @@ -20,7 +20,7 @@ * This interface is a contract. Any modification to the structure fields, * compatible strings, or the layout of the serialization structures defined * here constitutes a breaking change. Such changes require incrementing the - * version number in the `KHO_FDT_COMPATIBLE` string to prevent a new kernel + * version number in the `KHO_RADIX_COMPATIBLE` string to prevent a new kernel * from misinterpreting data from an old kernel. * * Changes are allowed provided the compatibility version is incremented; @@ -94,6 +94,8 @@ * 4KB. */ +#define KHO_RADIX_COMPATIBLE "radix-v1" + /* * Defines constants for the KHO radix tree structure, used to track preserved * memory. These constants govern the indexing, sizing, and depth of the tree. -- 2.53.0 From pasha.tatashin at soleen.com Thu Jun 4 20:32:34 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Fri, 5 Jun 2026 03:32:34 +0000 Subject: [RFC v1 8/9] kho: decouple vmalloc compatibility from global KHO version and update memfd In-Reply-To: <20260605033235.717351-1-pasha.tatashin@soleen.com> References: <20260605033235.717351-1-pasha.tatashin@soleen.com> Message-ID: <20260605033235.717351-9-pasha.tatashin@soleen.com> Decouple the vmalloc preservation compatibility version from the global KHO compatibility version KHO_FDT_COMPATIBLE. Define the independent compatibility version "vmalloc-v1" for vmalloc preservation KHO_VMALLOC_COMPATIBLE in vmalloc.h. Integrate KHO_VMALLOC_COMPATIBLE into the composite root compatibility string KHO_FDT_COMPATIBLE. Additionally, update the memfd compatibility string MEMFD_LUO_FH_COMPATIBLE to include the vmalloc compatibility dependency, and add a static assertion to verify its length limit. Signed-off-by: Pasha Tatashin --- include/linux/kho/abi/kexec_handover.h | 4 +++- include/linux/kho/abi/memfd.h | 9 ++++++++- include/linux/kho/abi/vmalloc.h | 6 ++++-- 3 files changed, 15 insertions(+), 4 deletions(-) diff --git a/include/linux/kho/abi/kexec_handover.h b/include/linux/kho/abi/kexec_handover.h index 49ac4b47cc3d..f048bc95fed3 100644 --- a/include/linux/kho/abi/kexec_handover.h +++ b/include/linux/kho/abi/kexec_handover.h @@ -10,6 +10,7 @@ #define _LINUX_KHO_ABI_KEXEC_HANDOVER_H #include #include +#include /** * DOC: Kexec Handover ABI @@ -87,7 +88,8 @@ #define KHO_FDT_COMPAT_BASE "kho-v4" #define KHO_FDT_COMPATIBLE \ KHO_FDT_COMPAT_BASE \ - KHO_SUB_COMPAT(KHO_RADIX_COMPATIBLE) + KHO_SUB_COMPAT(KHO_RADIX_COMPATIBLE) \ + KHO_SUB_COMPAT(KHO_VMALLOC_COMPATIBLE) /* The FDT property for the preserved memory map. */ #define KHO_FDT_MEMORY_MAP_PROP_NAME "preserved-memory-map" diff --git a/include/linux/kho/abi/memfd.h b/include/linux/kho/abi/memfd.h index af310c0c9fdf..24ecbf48cbe1 100644 --- a/include/linux/kho/abi/memfd.h +++ b/include/linux/kho/abi/memfd.h @@ -11,7 +11,9 @@ #ifndef _LINUX_KHO_ABI_MEMFD_H #define _LINUX_KHO_ABI_MEMFD_H +#include #include +#include #include #include @@ -89,6 +91,11 @@ struct memfd_luo_ser { } __packed; /* The compatibility string for memfd file handler */ -#define MEMFD_LUO_FH_COMPATIBLE "memfd-v2" +#define MEMFD_LUO_FH_COMPAT_BASE "memfd-v2" +#define MEMFD_LUO_FH_COMPATIBLE \ + MEMFD_LUO_FH_COMPAT_BASE \ + KHO_SUB_COMPAT(KHO_VMALLOC_COMPATIBLE) + +static_assert(KHO_COMPAT_ALIGN(MEMFD_LUO_FH_COMPATIBLE) <= LIVEUPDATE_HNDL_COMPAT_LENGTH); #endif /* _LINUX_KHO_ABI_MEMFD_H */ diff --git a/include/linux/kho/abi/vmalloc.h b/include/linux/kho/abi/vmalloc.h index 87650e1dd774..1847b82e147b 100644 --- a/include/linux/kho/abi/vmalloc.h +++ b/include/linux/kho/abi/vmalloc.h @@ -9,13 +9,13 @@ * * The Kexec Handover ABI for preserving vmalloc'ed memory is defined by * a set of structures and helper macros. The layout of these structures is a - * stable contract between kernels and is versioned by the KHO_FDT_COMPATIBLE + * stable contract between kernels and is versioned by the KHO_VMALLOC_COMPATIBLE * string. * * This interface is a contract. Any modification to the structure fields, * compatible strings, or the layout of the serialization structures defined * here constitutes a breaking change. Such changes require incrementing the - * version number in the `KHO_FDT_COMPATIBLE` string to prevent a new kernel + * version number in the `KHO_VMALLOC_COMPATIBLE` string to prevent a new kernel * from misinterpreting data from an old kernel. * * Changes are allowed provided the compatibility version is incremented; @@ -36,6 +36,8 @@ #include #include +#define KHO_VMALLOC_COMPATIBLE "vmalloc-v1" + /* Helper macro to define a union for a serializable pointer. */ #define DECLARE_KHOSER_PTR(name, type) \ union { \ -- 2.53.0 From pasha.tatashin at soleen.com Thu Jun 4 20:32:35 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Fri, 5 Jun 2026 03:32:35 +0000 Subject: [RFC v1 9/9] liveupdate: add KUnit test to verify alphabetical order of compatibility strings In-Reply-To: <20260605033235.717351-1-pasha.tatashin@soleen.com> References: <20260605033235.717351-1-pasha.tatashin@soleen.com> Message-ID: <20260605033235.717351-10-pasha.tatashin@soleen.com> Introduce a KUnit test suite to verify that composite compatibility strings are formatted correctly, specifically ensuring that all KHO sub-component compatibility strings are unique and strictly sorted in alphabetical order. Maintaining alphabetical order in composite compatibility strings is required to guarantee consistent, predictable, and reproducible compatibility string representation across different system configurations. The test suite validates: - KHO_FDT_COMPATIBLE (the root composite compatibility string) - LUO_ABI_COMPATIBLE (the LUO composite compatibility string) - MEMFD_LUO_FH_COMPATIBLE (the memfd file handler composite compatibility string) Signed-off-by: Pasha Tatashin --- kernel/liveupdate/Kconfig | 15 ++++++ kernel/liveupdate/Makefile | 2 + kernel/liveupdate/liveupdate_test.c | 56 +++++++++++++++++++++++ tools/testing/selftests/liveupdate/config | 1 + 4 files changed, 74 insertions(+) create mode 100644 kernel/liveupdate/liveupdate_test.c diff --git a/kernel/liveupdate/Kconfig b/kernel/liveupdate/Kconfig index c13af38ba23a..617a31dcee73 100644 --- a/kernel/liveupdate/Kconfig +++ b/kernel/liveupdate/Kconfig @@ -86,4 +86,19 @@ config LIVEUPDATE_MEMFD If unsure, say N. +config LIVEUPDATE_KUNIT_TEST + tristate "KUnit tests for LUO and KHO" if !KUNIT_ALL_TESTS + depends on KUNIT + depends on LIVEUPDATE + default KUNIT_ALL_TESTS + help + Enable KUnit tests for LUO and KHO. These tests verify that the + composite KHO, LUO, and memfd compatibility strings remain unique + and sorted alphabetically. + + For more information on KUnit and unit tests in general, please refer + to the KUnit documentation in Documentation/dev-tools/kunit/. + + If unsure, say N. + endmenu diff --git a/kernel/liveupdate/Makefile b/kernel/liveupdate/Makefile index b481e21a311a..5e0deb85e1b1 100644 --- a/kernel/liveupdate/Makefile +++ b/kernel/liveupdate/Makefile @@ -17,3 +17,5 @@ obj-$(CONFIG_KEXEC_HANDOVER_DEBUG) += kexec_handover_debug.o obj-$(CONFIG_KEXEC_HANDOVER_DEBUGFS) += kexec_handover_debugfs.o obj-$(CONFIG_LIVEUPDATE) += luo.o +obj-$(CONFIG_LIVEUPDATE_KUNIT_TEST) += liveupdate_test.o + diff --git a/kernel/liveupdate/liveupdate_test.c b/kernel/liveupdate/liveupdate_test.c new file mode 100644 index 000000000000..15688d69735e --- /dev/null +++ b/kernel/liveupdate/liveupdate_test.c @@ -0,0 +1,56 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * KUnit test for Live Update compatibility strings. + */ +#include +#include +#include +#include +#include + +/* Verify that compatibility sub-components are unique and sorted alphabetically */ +static bool is_alphabetical_unique(const char *compat) +{ + char buf[1024]; + char *string = buf; + char *token; + char *prev = NULL; + char *sub; + + strscpy(buf, compat, sizeof(buf)); + + sub = strchr(string, ';'); + if (!sub) + return true; + + sub++; + + while ((token = strsep(&sub, ";")) != NULL) { + if (prev && strcmp(prev, token) >= 0) + return false; + prev = token; + } + + return true; +} + +static void test_compatibility_alphabetical(struct kunit *test) +{ + KUNIT_EXPECT_TRUE(test, is_alphabetical_unique(KHO_FDT_COMPATIBLE)); + KUNIT_EXPECT_TRUE(test, is_alphabetical_unique(LUO_ABI_COMPATIBLE)); + KUNIT_EXPECT_TRUE(test, is_alphabetical_unique(MEMFD_LUO_FH_COMPATIBLE)); +} + +static struct kunit_case liveupdate_test_cases[] = { + KUNIT_CASE(test_compatibility_alphabetical), + {} +}; + +static struct kunit_suite liveupdate_test_suite = { + .name = "liveupdate-compatibility", + .test_cases = liveupdate_test_cases, +}; + +kunit_test_suite(liveupdate_test_suite); + +MODULE_LICENSE("GPL"); diff --git a/tools/testing/selftests/liveupdate/config b/tools/testing/selftests/liveupdate/config index 91d03f9a6a39..28c54bb473b8 100644 --- a/tools/testing/selftests/liveupdate/config +++ b/tools/testing/selftests/liveupdate/config @@ -6,6 +6,7 @@ CONFIG_KEXEC_HANDOVER_DEBUGFS=y CONFIG_KEXEC_HANDOVER_DEBUG=y CONFIG_LIVEUPDATE=y CONFIG_LIVEUPDATE_TEST=y +CONFIG_LIVEUPDATE_KUNIT_TEST=y CONFIG_MEMFD_CREATE=y CONFIG_TMPFS=y CONFIG_SHMEM=y -- 2.53.0 From praan at google.com Thu Jun 4 22:41:54 2026 From: praan at google.com (Pranjal Shrivastava) Date: Fri, 5 Jun 2026 05:41:54 +0000 Subject: [PATCH v6 01/12] PCI: liveupdate: Set up FLB handler for the PCI core In-Reply-To: <20260522202410.3104264-2-dmatlack@google.com> References: <20260522202410.3104264-1-dmatlack@google.com> <20260522202410.3104264-2-dmatlack@google.com> Message-ID: On Fri, May 22, 2026 at 08:23:59PM +0000, David Matlack wrote: > Set up a File-Lifecycle-Bound (FLB) handler for the PCI core to enable > it to participate in the preservation of PCI devices across Live Update. > Essentially, this commit enables the PCI core to allocate a struct > (struct pci_ser) and preserve it across a Live Update whenever at least > one device is preserved. > > Preserving PCI devices across Live Update is built on top of the Live > Update Orchestrator's (LUO) support for file preservation. Drivers are > expected to expose a file to userspace to represent a single PCI device > and support preservation of that file. This is intended primarily to > support preservation of PCI devices bound to VFIO drivers. > > This commit enables drivers to register their liveupdate_file_handler > with the PCI core so that the PCI core can do its own tracking and > enforcement of which devices are preserved. > > pci_liveupdate_register_flb(driver_file_handler); > pci_liveupdate_unregister_flb(driver_file_handler); > > When the first file (with a handler registered with the PCI core) is > preserved, the PCI core will be notified to allocate its tracking struct > (pci_ser). When the last file is unpreserved (i.e. preservation > cancelled) the PCI core will be notified to free struct pci_ser. > > This struct is preserved across a Live Update using KHO and can be > fetched by the PCI core during early boot (e.g. during device > enumeration) so that it knows which devices were preserved. > > Note: This commit only allocates struct pci_ser and preserves it across > Live Update. A subsequent commit will add an API for drivers to tell the > PCI core exactly which devices are being preserved. > > Note: There is no reason to check for kho_is_enabled() since it can be > assumed to return true. If KHO was not enabled then Live Update would > not be enabled and these routines would never run. > [...] > +/** > + * struct pci_dev_ser - Serialized state about a single PCI device. > + * > + * @domain: The device's PCI domain number (segment). > + * @bdf: The device's PCI bus, device, and function number. > + * @padding: Padding to naturally align struct pci_dev_ser. > + */ > +struct pci_dev_ser { > + u32 domain; > + u16 bdf; > + u16 padding; > +} __packed; > + > +/** > + * struct pci_ser - PCI Subsystem Live Update State > + * > + * This struct tracks state about all devices that are being preserved across > + * a Live Update for the next kernel. > + * > + * @max_nr_devices: The length of the devices[] flexible array. > + * @nr_devices: The number of devices that were preserved. > + * @devices: Flexible array of pci_dev_ser structs for each device. > + */ > +struct pci_ser { > + u32 max_nr_devices; > + u32 nr_devices; > + struct pci_dev_ser devices[]; > +} __packed; > + > +/* Ensure all elements of devices[] are naturally aligned. */ > +static_assert(offsetof(struct pci_ser, devices) % sizeof(unsigned long) == 0); > +static_assert(sizeof(struct pci_dev_ser) % sizeof(unsigned long) == 0); Minor Nit: Shall we consider using specific bitwidth types here? I'm wondering if down the line another u32 field is added to struct pci_dev_ser.. in that case on a 32-bit machine 12 % 4 == 0 but on a 64-bit machine 12 % 8 != 0.. [...] With the nit: Reviewed-by: Pranjal Shrivastava Thanks, Praan From praan at google.com Thu Jun 4 23:11:18 2026 From: praan at google.com (Pranjal Shrivastava) Date: Fri, 5 Jun 2026 06:11:18 +0000 Subject: [PATCH v6 02/12] PCI: liveupdate: Track outgoing preserved PCI devices In-Reply-To: <20260522202410.3104264-3-dmatlack@google.com> References: <20260522202410.3104264-1-dmatlack@google.com> <20260522202410.3104264-3-dmatlack@google.com> Message-ID: On Fri, May 22, 2026 at 08:24:00PM +0000, David Matlack wrote: > Add APIs to allow drivers to notify the PCI core of which devices are > being preserved across a Live Update for the next kernel, i.e. > "outgoing" devices. > > Drivers must notify the PCI core when devices are preserved so that the > PCI core can update its FLB data (struct pci_ser) and track the list of > outgoing devices. pci_liveupdate_preserve() notifies the PCI core that a > device must be preserved across Live Update. pci_liveupdate_unpreserve() > reverses this (cancels the preservation of the device). > > This tracking ensures the PCI core is fully aware of which devices may > need special handling during shutdown and kexec, and so that it can be > handed off to the next kernel. > > Signed-off-by: David Matlack [...] > > /** > * struct pci_dev_ser - Serialized state about a single PCI device. > * > * @domain: The device's PCI domain number (segment). > * @bdf: The device's PCI bus, device, and function number. > - * @padding: Padding to naturally align struct pci_dev_ser. > + * @refcount: Reference count used by the PCI core to keep track of whether it > + * is done using a device's struct pci_dev_ser. The value of the > + * refcount is equal to 1 when the struct pci_dev_ser is in use, and > + * 0 otherwise. Note to fellow reviewers: This may seem like a bool instead of refcount, but this is changed in Patch 6. Reviewed-by: Pranjal Shrivastava Thanks, Praan From lkp at intel.com Fri Jun 5 00:16:36 2026 From: lkp at intel.com (kernel test robot) Date: Fri, 05 Jun 2026 15:16:36 +0800 Subject: [liveupdate:kexec-next] BUILD SUCCESS 7eb5f7c5b5d58b4abe78f6a1b0817391c291f199 Message-ID: <202606051527.ebVGd2F3-lkp@intel.com> tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux.git kexec-next branch HEAD: 7eb5f7c5b5d58b4abe78f6a1b0817391c291f199 kexec_file: skip checksum verification when safe elapsed time: 2068m configs tested: 193 configs skipped: 3 The following configs have been built successfully. More configs may be tested in the coming days. tested configs: alpha allnoconfig gcc-15.2.0 alpha allyesconfig gcc-15.2.0 alpha defconfig gcc-16.1.0 arc allmodconfig clang-17 arc allnoconfig gcc-15.2.0 arc allyesconfig clang-23 arc defconfig gcc-16.1.0 arc randconfig-001-20260604 clang-17 arc randconfig-002-20260604 clang-17 arm allnoconfig gcc-15.2.0 arm allyesconfig clang-17 arm defconfig gcc-16.1.0 arm randconfig-001-20260604 clang-17 arm randconfig-002-20260604 clang-17 arm randconfig-003-20260604 clang-17 arm randconfig-004-20260604 clang-17 arm64 allmodconfig clang-23 arm64 allnoconfig gcc-15.2.0 arm64 defconfig gcc-16.1.0 arm64 randconfig-001-20260604 clang-17 arm64 randconfig-002-20260604 clang-17 arm64 randconfig-003-20260604 clang-17 arm64 randconfig-004-20260604 clang-17 csky allmodconfig gcc-16.1.0 csky allnoconfig gcc-15.2.0 csky defconfig gcc-16.1.0 csky randconfig-001-20260604 clang-17 csky randconfig-002-20260604 clang-17 hexagon allmodconfig gcc-15.2.0 hexagon allnoconfig gcc-15.2.0 hexagon defconfig gcc-16.1.0 hexagon randconfig-001 gcc-11.5.0 hexagon randconfig-001-20260604 gcc-11.5.0 hexagon randconfig-002 gcc-11.5.0 hexagon randconfig-002-20260604 gcc-11.5.0 i386 allmodconfig clang-20 i386 allnoconfig gcc-15.2.0 i386 allyesconfig clang-20 i386 buildonly-randconfig-001-20260604 clang-20 i386 buildonly-randconfig-002-20260604 clang-20 i386 buildonly-randconfig-003-20260604 clang-20 i386 buildonly-randconfig-004-20260604 clang-20 i386 buildonly-randconfig-005-20260604 clang-20 i386 buildonly-randconfig-006-20260604 clang-20 i386 defconfig gcc-16.1.0 i386 randconfig-001-20260604 clang-20 i386 randconfig-002-20260604 clang-20 i386 randconfig-003-20260604 clang-20 i386 randconfig-004-20260604 clang-20 i386 randconfig-005-20260604 clang-20 i386 randconfig-006-20260604 clang-20 i386 randconfig-007-20260604 clang-20 i386 randconfig-011-20260604 gcc-14 i386 randconfig-012-20260604 gcc-14 i386 randconfig-013-20260604 gcc-14 i386 randconfig-014-20260604 gcc-14 i386 randconfig-015-20260604 gcc-14 i386 randconfig-016-20260604 gcc-14 i386 randconfig-017-20260604 gcc-14 loongarch allmodconfig clang-23 loongarch allnoconfig gcc-15.2.0 loongarch defconfig clang-23 loongarch randconfig-001 gcc-11.5.0 loongarch randconfig-001-20260604 gcc-11.5.0 loongarch randconfig-002 gcc-11.5.0 loongarch randconfig-002-20260604 gcc-11.5.0 m68k allmodconfig gcc-16.1.0 m68k allnoconfig gcc-15.2.0 m68k allyesconfig clang-17 m68k defconfig clang-23 microblaze allnoconfig gcc-15.2.0 microblaze allyesconfig gcc-16.1.0 microblaze defconfig clang-23 mips allmodconfig gcc-15.2.0 mips allnoconfig gcc-15.2.0 mips allyesconfig gcc-16.1.0 nios2 allmodconfig clang-23 nios2 allnoconfig clang-17 nios2 defconfig clang-23 nios2 randconfig-001 gcc-11.5.0 nios2 randconfig-001-20260604 gcc-11.5.0 nios2 randconfig-002 gcc-11.5.0 nios2 randconfig-002-20260604 gcc-11.5.0 openrisc allnoconfig clang-17 openrisc defconfig gcc-16.1.0 parisc allmodconfig gcc-15.2.0 parisc allnoconfig clang-17 parisc allyesconfig clang-19 parisc defconfig gcc-16.1.0 parisc randconfig-001 gcc-8.5.0 parisc randconfig-001-20260604 gcc-8.5.0 parisc randconfig-002 gcc-8.5.0 parisc randconfig-002-20260604 gcc-8.5.0 parisc64 defconfig clang-23 powerpc allmodconfig gcc-15.2.0 powerpc allnoconfig clang-17 powerpc randconfig-001 gcc-8.5.0 powerpc randconfig-001-20260604 gcc-8.5.0 powerpc randconfig-002 gcc-8.5.0 powerpc randconfig-002-20260604 gcc-8.5.0 powerpc64 randconfig-001 gcc-8.5.0 powerpc64 randconfig-001-20260604 gcc-8.5.0 powerpc64 randconfig-002 gcc-8.5.0 powerpc64 randconfig-002-20260604 gcc-8.5.0 riscv allmodconfig clang-23 riscv allnoconfig clang-17 riscv allyesconfig clang-17 riscv defconfig gcc-16.1.0 riscv randconfig-001 clang-17 riscv randconfig-001-20260604 clang-17 riscv randconfig-002 clang-17 riscv randconfig-002-20260604 clang-17 s390 allmodconfig clang-19 s390 allnoconfig clang-17 s390 allyesconfig gcc-15.2.0 s390 defconfig gcc-16.1.0 s390 randconfig-001 clang-17 s390 randconfig-001-20260604 clang-17 s390 randconfig-002 clang-17 s390 randconfig-002-20260604 clang-17 sh allmodconfig gcc-15.2.0 sh allnoconfig clang-17 sh allyesconfig clang-19 sh defconfig gcc-14 sh randconfig-001 clang-17 sh randconfig-001-20260604 clang-17 sh randconfig-002 clang-17 sh randconfig-002-20260604 clang-17 sh urquell_defconfig gcc-16.1.0 sparc allnoconfig clang-17 sparc defconfig gcc-16.1.0 sparc randconfig-001 gcc-11.5.0 sparc randconfig-001-20260604 gcc-11.5.0 sparc randconfig-002 gcc-11.5.0 sparc randconfig-002-20260604 gcc-11.5.0 sparc64 allmodconfig clang-23 sparc64 defconfig gcc-14 sparc64 randconfig-001 gcc-11.5.0 sparc64 randconfig-001-20260604 gcc-11.5.0 sparc64 randconfig-002 gcc-11.5.0 sparc64 randconfig-002-20260604 gcc-11.5.0 um allmodconfig clang-19 um allnoconfig clang-17 um allyesconfig gcc-15.2.0 um defconfig gcc-14 um i386_defconfig gcc-14 um randconfig-001 gcc-11.5.0 um randconfig-001-20260604 gcc-11.5.0 um randconfig-002 gcc-11.5.0 um randconfig-002-20260604 gcc-11.5.0 um x86_64_defconfig gcc-14 x86_64 allmodconfig clang-20 x86_64 allnoconfig clang-17 x86_64 allyesconfig clang-20 x86_64 buildonly-randconfig-001-20260604 gcc-14 x86_64 buildonly-randconfig-002-20260604 gcc-14 x86_64 buildonly-randconfig-003-20260604 gcc-14 x86_64 buildonly-randconfig-004-20260604 gcc-14 x86_64 buildonly-randconfig-005-20260604 gcc-14 x86_64 buildonly-randconfig-006-20260604 gcc-14 x86_64 defconfig gcc-14 x86_64 kexec clang-22 x86_64 randconfig-001-20260604 clang-20 x86_64 randconfig-002-20260604 clang-20 x86_64 randconfig-003-20260604 clang-20 x86_64 randconfig-004-20260604 clang-20 x86_64 randconfig-005-20260604 clang-20 x86_64 randconfig-006-20260604 clang-20 x86_64 randconfig-011-20260604 clang-22 x86_64 randconfig-012-20260604 clang-22 x86_64 randconfig-013-20260604 clang-22 x86_64 randconfig-014-20260604 clang-22 x86_64 randconfig-015-20260604 clang-22 x86_64 randconfig-016-20260604 clang-22 x86_64 randconfig-071-20260604 clang-22 x86_64 randconfig-072-20260604 clang-22 x86_64 randconfig-073-20260604 clang-22 x86_64 randconfig-074-20260604 clang-22 x86_64 randconfig-075-20260604 clang-22 x86_64 randconfig-076-20260604 clang-22 x86_64 rhel-9.4 clang-22 x86_64 rhel-9.4-bpf gcc-14 x86_64 rhel-9.4-func clang-22 x86_64 rhel-9.4-kselftests clang-22 x86_64 rhel-9.4-kunit gcc-14 x86_64 rhel-9.4-ltp gcc-14 x86_64 rhel-9.4-rust clang-20 xtensa allnoconfig clang-17 xtensa allyesconfig clang-23 xtensa randconfig-001 gcc-11.5.0 xtensa randconfig-001-20260604 gcc-11.5.0 xtensa randconfig-002 gcc-11.5.0 xtensa randconfig-002-20260604 gcc-11.5.0 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki From mclapinski at google.com Fri Jun 5 05:10:40 2026 From: mclapinski at google.com (Michal Clapinski) Date: Fri, 5 Jun 2026 14:10:40 +0200 Subject: [PATCH v2] pstore: add a KHO backend Message-ID: <20260605121040.1177072-1-mclapinski@google.com> Up to this point to preserve late shutdown logs in memory, users had to predefine a memory region using ramoops. This commit changes this by preserving a buffer using kexec-handover. pstore_kho supports preserving only 1 dmesg buffer. It gets replaced with the new buffer on every kexec, so the user has to copy the file out of pstore after every kexec. There is no erase() support. Signed-off-by: Michal Clapinski --- v2: - Added a comment explaining the benefits of pstore_kho. - Created include/linux/kho/abi/pstore.h. - Got rid of the KHO subtree. - Made sure never to free incoming kho data. This way the module can be safely reloaded. - Sashiko complained that I trust the data coming from the old kernel. I ignored it. LMK if I shouldn't trust the old kernel. --- fs/pstore/Kconfig | 10 ++ fs/pstore/Makefile | 2 + fs/pstore/pstore_kho.c | 230 +++++++++++++++++++++++++++++++++ include/linux/kho/abi/pstore.h | 27 ++++ 4 files changed, 269 insertions(+) create mode 100644 fs/pstore/pstore_kho.c create mode 100644 include/linux/kho/abi/pstore.h diff --git a/fs/pstore/Kconfig b/fs/pstore/Kconfig index 3acc38600cd1..455790fec955 100644 --- a/fs/pstore/Kconfig +++ b/fs/pstore/Kconfig @@ -81,6 +81,16 @@ config PSTORE_RAM For more information, see Documentation/admin-guide/ramoops.rst. +config PSTORE_KHO + tristate "Preserve logs over kexec" + depends on PSTORE + depends on KEXEC_HANDOVER + help + A pstore backend for preserving dmesg over KHO (kexec handover). + It does not require any additional cmdline params to work. + + It supports preservation of only 1 dmesg file. + config PSTORE_ZONE tristate depends on PSTORE diff --git a/fs/pstore/Makefile b/fs/pstore/Makefile index c270467aeece..518cd408bf8e 100644 --- a/fs/pstore/Makefile +++ b/fs/pstore/Makefile @@ -13,6 +13,8 @@ pstore-$(CONFIG_PSTORE_PMSG) += pmsg.o ramoops-objs += ram.o ram_core.o obj-$(CONFIG_PSTORE_RAM) += ramoops.o +obj-$(CONFIG_PSTORE_KHO) += pstore_kho.o + pstore_zone-objs += zone.o obj-$(CONFIG_PSTORE_ZONE) += pstore_zone.o diff --git a/fs/pstore/pstore_kho.c b/fs/pstore/pstore_kho.c new file mode 100644 index 000000000000..6d4187d91642 --- /dev/null +++ b/fs/pstore/pstore_kho.c @@ -0,0 +1,230 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * KHO (Kexec Handover) backend for pstore. + * + * KHO-based pstore provides a mechanism to hand over pstore data (specifically + * dmesg logs) from one kernel to another across a kexec reboot using the + * Kexec Handover (KHO) framework. + * + * Key advantages of KHO-based pstore include: + * - No hardcoded memmap: Unlike ramoops, it does not require reserving a static + * memory region in the bootloader or device tree. Memory is allocated + * dynamically and handed over to the next kernel. + * - Firmware independence: It does not rely on platform firmware support (like + * ACPI ERST or UEFI variable storage) to preserve logs across reboots. + * - High throughput: It avoids the performance bottlenecks of serial consoles, + * not being limited by console baud rates. + * - Complete log preservation: It preserves all dmesg logs, including those + * generated late in the reboot cycle after filesystems have been unmounted, + * up to the point of the kexec jump. + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include +#include +#include +#include +#include +#include + +/* + * The in and out buffers are separate and they need not be the same size. + * Therefore, this is not part of ABI. + */ +#define RECORD_MAX_SIZE (1 << CONFIG_LOG_BUF_SHIFT) + +struct pstore_kho_context { + struct pstore_info pstore; + bool read_done; +}; + +static struct pstore_ser *kho_ser_in; +static struct pstore_ser *kho_ser_out; + +static int pstore_kho_open(struct pstore_info *psi) +{ + struct pstore_kho_context *cxt = psi->data; + + cxt->read_done = false; + return 0; +} + +static ssize_t pstore_kho_read(struct pstore_record *record) +{ + struct pstore_kho_context *cxt = record->psi->data; + struct pstore_kho_record *kho_data_in; + + if (cxt->read_done || !kho_ser_in) + return 0; + + cxt->read_done = true; + kho_data_in = &kho_ser_in->record; + + record->buf = kmemdup(kho_data_in->buf, kho_data_in->size, GFP_KERNEL); + if (!record->buf) + return -ENOMEM; + + record->type = PSTORE_TYPE_DMESG; + record->id = 0; + record->size = kho_data_in->size; + record->time.tv_sec = kho_data_in->time_sec; + record->time.tv_nsec = kho_data_in->time_nsec; + record->count = kho_data_in->count; + record->reason = kho_data_in->reason; + record->part = kho_data_in->part; + record->compressed = kho_data_in->compressed; + + return record->size; +} + +static int pstore_kho_write(struct pstore_record *record) +{ + struct pstore_kho_record *kho_data_out = &kho_ser_out->record; + + if (record->type != PSTORE_TYPE_DMESG) + return -EINVAL; + + if (kho_data_out->size != 0) { + pr_err("pstore kho already contains a record\n"); + return -ENOSPC; + } + + if (record->size > RECORD_MAX_SIZE) { + pr_err("dmesg record too big, record size: %lu, available space: %d\n", + record->size, RECORD_MAX_SIZE); + return -ENOSPC; + } + + memcpy(kho_data_out->buf, record->buf, record->size); + kho_data_out->size = record->size; + kho_data_out->time_sec = record->time.tv_sec; + kho_data_out->time_nsec = record->time.tv_nsec; + kho_data_out->count = record->count; + kho_data_out->reason = record->reason; + kho_data_out->part = record->part; + kho_data_out->compressed = record->compressed; + + return 0; +} + +static struct pstore_kho_context pstore_kho_cxt = { + .pstore = { + .owner = THIS_MODULE, + .name = "kho", + .bufsize = RECORD_MAX_SIZE, + .flags = PSTORE_FLAGS_DMESG, + .max_reason = KMSG_DUMP_SHUTDOWN, + .open = pstore_kho_open, + .read = pstore_kho_read, + .write = pstore_kho_write, + }, +}; + +static void __init kho_setup_incoming(void) +{ + phys_addr_t kho_ser_phys; + int err; + + err = kho_retrieve_subtree(KHO_PSTORE_FDT_NAME, &kho_ser_phys); + if (err) { + if (err != -ENOENT) + pr_err("failed to retrieve KHO data %s: %d\n", + KHO_PSTORE_FDT_NAME, err); + return; + } + + kho_ser_in = phys_to_virt(kho_ser_phys); + + if (kho_ser_in->version != KHO_PSTORE_VERSION) { + pr_err("unsupported KHO pstore version: %d\n", kho_ser_in->version); + kho_ser_in = NULL; + return; + } + + pr_info("successfully restored preserved data\n"); +} + +static int __init kho_setup_outgoing(void) +{ + int err; + size_t total_size = sizeof(struct pstore_ser) + RECORD_MAX_SIZE; + + kho_ser_out = kho_alloc_preserve(total_size); + if (IS_ERR(kho_ser_out)) { + pr_err("failed to allocate pstore kho ser anchor\n"); + return PTR_ERR(kho_ser_out); + } + memset(kho_ser_out, 0, total_size); + kho_ser_out->version = KHO_PSTORE_VERSION; + + err = kho_add_subtree(KHO_PSTORE_FDT_NAME, kho_ser_out); + if (err) { + pr_err("failed to add KHO data\n"); + goto err_free_ser; + } + + return 0; + +err_free_ser: + kho_unpreserve_free(kho_ser_out); + return err; +} + +static int __init pstore_kho_init(void) +{ + int err; + struct pstore_kho_context *cxt = &pstore_kho_cxt; + + if (!kho_is_enabled()) { + pr_info("KHO is disabled, pstore_kho cannot start\n"); + return -ENODEV; + } + + kho_setup_incoming(); + err = kho_setup_outgoing(); + if (err) { + pr_err("failed to setup outgoing KHO\n"); + return err; + } + + cxt->pstore.data = cxt; + cxt->pstore.buf = kmalloc(cxt->pstore.bufsize, GFP_KERNEL); + if (!cxt->pstore.buf) { + err = -ENOMEM; + goto err_free_outgoing; + } + + err = pstore_register(&cxt->pstore); + if (err) { + pr_err("failed to register with pstore\n"); + goto err_free_pstore_buf; + } + + return 0; + +err_free_pstore_buf: + kfree(cxt->pstore.buf); + +err_free_outgoing: + kho_remove_subtree(kho_ser_out); + kho_unpreserve_free(kho_ser_out); + + return err; +} +module_init(pstore_kho_init); + +static void __exit pstore_kho_exit(void) +{ + pstore_unregister(&pstore_kho_cxt.pstore); + kfree(pstore_kho_cxt.pstore.buf); + + kho_remove_subtree(kho_ser_out); + kho_unpreserve_free(kho_ser_out); +} +module_exit(pstore_kho_exit); + +MODULE_LICENSE("GPL"); +MODULE_DESCRIPTION("Pstore backend for dmesg preservation over kexec"); diff --git a/include/linux/kho/abi/pstore.h b/include/linux/kho/abi/pstore.h new file mode 100644 index 000000000000..743ec64d67fc --- /dev/null +++ b/include/linux/kho/abi/pstore.h @@ -0,0 +1,27 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#ifndef _LINUX_KHO_ABI_PSTORE_H +#define _LINUX_KHO_ABI_PSTORE_H + +#include + +#define KHO_PSTORE_FDT_NAME "pstore-kho" +#define KHO_PSTORE_VERSION 1 + +struct pstore_kho_record { + s64 size; + s64 time_sec; + u32 time_nsec; + s32 count; + u32 reason; + u32 part; + u32 compressed; + char buf[]; +}; + +struct pstore_ser { + u32 version; + struct pstore_kho_record record; +}; + +#endif /* _LINUX_KHO_ABI_PSTORE_H */ -- 2.54.0.1032.g2f8565e1d1-goog From pratyush at kernel.org Fri Jun 5 06:09:27 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Fri, 05 Jun 2026 15:09:27 +0200 Subject: [PATCH 0/2] liveupdate: Small FLB fixes In-Reply-To: (Mike Rapoport's message of "Thu, 4 Jun 2026 08:28:57 +0300") References: <20260528174140.1921129-1-dmatlack@google.com> Message-ID: <2vxzcxy5i16g.fsf@kernel.org> On Thu, Jun 04 2026, Mike Rapoport wrote: > On Thu, May 28, 2026 at 05:41:38PM +0000, David Matlack wrote: >> This series has 2 small fixes to how FLBs are managed. First is to >> increase the outgoing FLB refcount during liveupdate_flb_get_outgoing() >> so it cannot be freed while the caller is using it, and to align with >> the semantics of liveupdate_flb_get_incoming(). The second is to prevent >> FLB retrieve() from being called multiple times if the first attempt >> fails. >> >> Both of these changes are needed for the correctness of the PCI core >> support for Live Update: >> >> https://lore.kernel.org/linux-pci/20260522202410.3104264-1-dmatlack at google.com/ > > We are late in the release cycle and since there no in-tree flb users let's > postpone this after rc1. Yes, I agree. [...] -- Regards, Pratyush Yadav From pratyush at kernel.org Fri Jun 5 09:06:44 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Fri, 5 Jun 2026 18:06:44 +0200 Subject: [PATCH] docs: memfd_preservation: fix rendering of ABI documentation Message-ID: <20260605160645.3650271-1-pratyush@kernel.org> From: "Pratyush Yadav (Google)" The "memfd Live Update ABI" section in include/linux/kho/abi/memfd.h currently does not render in the exported documentation. This is because it should not include the "DOC:" in its reference. Drop it to ensure correct rendering. Tested by running make htmldocs. Fixes: 15fc11bb2cb6 ("docs: add documentation for memfd preservation via LUO") Signed-off-by: Pratyush Yadav (Google) --- Notes: Mike/Pasha, I reckon this can still go in liveupdate/next. But if you think it is too late, we can probably take it via -rc1 fixes as well. Documentation/mm/memfd_preservation.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/mm/memfd_preservation.rst b/Documentation/mm/memfd_preservation.rst index a8a5b476afd3..c908a12dffa7 100644 --- a/Documentation/mm/memfd_preservation.rst +++ b/Documentation/mm/memfd_preservation.rst @@ -11,7 +11,7 @@ Memfd Preservation ABI ====================== .. kernel-doc:: include/linux/kho/abi/memfd.h - :doc: DOC: memfd Live Update ABI + :doc: memfd Live Update ABI .. kernel-doc:: include/linux/kho/abi/memfd.h :internal: base-commit: 2935777b418d2bfcbfe96705bb2c0fa6c0d94e18 -- 2.54.0.1032.g2f8565e1d1-goog From tarunsahu at google.com Fri Jun 5 10:08:26 2026 From: tarunsahu at google.com (Tarun Sahu) Date: Fri, 5 Jun 2026 17:08:26 +0000 Subject: [RFC PATCH v2 01/10] liveupdate: luo_file: Add internal APIs for file preservation In-Reply-To: References: Message-ID: From: Pasha Tatashin The core liveupdate mechanism allows userspace to preserve file descriptors. However, kernel subsystems often manage struct file objects directly and need to participate in the preservation process programmatically without relying solely on userspace interaction. Signed-off-by: Pasha Tatashin Signed-off-by: Samiullah Khawaja Signed-off-by: Tarun Sahu --- include/linux/liveupdate.h | 21 ++++++++++ kernel/liveupdate/luo_file.c | 69 ++++++++++++++++++++++++++++++++ kernel/liveupdate/luo_internal.h | 17 ++++++++ 3 files changed, 107 insertions(+) diff --git a/include/linux/liveupdate.h b/include/linux/liveupdate.h index 30c5a39ff9e9..de052438eaac 100644 --- a/include/linux/liveupdate.h +++ b/include/linux/liveupdate.h @@ -24,6 +24,7 @@ struct file; /** * struct liveupdate_file_op_args - Arguments for file operation callbacks. * @handler: The file handler being called. + * @session: The session this file belongs to. * @retrieve_status: The retrieve status for the 'can_finish / finish' * operation. A value of 0 means the retrieve has not been * attempted, a positive value means the retrieve was @@ -44,6 +45,7 @@ struct file; */ struct liveupdate_file_op_args { struct liveupdate_file_handler *handler; + struct liveupdate_session *session; int retrieve_status; struct file *file; u64 serialized_data; @@ -240,6 +242,13 @@ void liveupdate_unregister_flb(struct liveupdate_file_handler *fh, int liveupdate_flb_get_incoming(struct liveupdate_flb *flb, void **objp); int liveupdate_flb_get_outgoing(struct liveupdate_flb *flb, void **objp); +/* kernel can internally retrieve files */ +int liveupdate_get_file_incoming(struct liveupdate_session *s, u64 token, + struct file **filep); + +/* Get a token for an outgoing file, or -ENOENT if file is not preserved */ +int liveupdate_get_token_outgoing(struct liveupdate_session *s, + struct file *file, u64 *tokenp); #else /* CONFIG_LIVEUPDATE */ @@ -285,5 +294,17 @@ static inline int liveupdate_flb_get_outgoing(struct liveupdate_flb *flb, return -EOPNOTSUPP; } +static inline int liveupdate_get_file_incoming(struct liveupdate_session *s, + u64 token, struct file **filep) +{ + return -EOPNOTSUPP; +} + +static inline int liveupdate_get_token_outgoing(struct liveupdate_session *s, + struct file *file, u64 *tokenp) +{ + return -EOPNOTSUPP; +} + #endif /* CONFIG_LIVEUPDATE */ #endif /* _LINUX_LIVEUPDATE_H */ diff --git a/kernel/liveupdate/luo_file.c b/kernel/liveupdate/luo_file.c index a0a419085e28..0aa0b4e5339f 100644 --- a/kernel/liveupdate/luo_file.c +++ b/kernel/liveupdate/luo_file.c @@ -323,6 +323,7 @@ int luo_preserve_file(struct luo_file_set *file_set, u64 token, int fd) mutex_init(&luo_file->mutex); args.handler = fh; + args.session = luo_session_from_file_set(file_set); args.file = file; err = fh->ops->preserve(&args); if (err) @@ -380,6 +381,7 @@ void luo_file_unpreserve_files(struct luo_file_set *file_set) struct luo_file, list); args.handler = luo_file->fh; + args.session = luo_session_from_file_set(file_set); args.file = luo_file->file; args.serialized_data = luo_file->serialized_data; args.private_data = luo_file->private_data; @@ -411,6 +413,7 @@ static int luo_file_freeze_one(struct luo_file_set *file_set, struct liveupdate_file_op_args args = {0}; args.handler = luo_file->fh; + args.session = luo_session_from_file_set(file_set); args.file = luo_file->file; args.serialized_data = luo_file->serialized_data; args.private_data = luo_file->private_data; @@ -432,6 +435,7 @@ static void luo_file_unfreeze_one(struct luo_file_set *file_set, struct liveupdate_file_op_args args = {0}; args.handler = luo_file->fh; + args.session = luo_session_from_file_set(file_set); args.file = luo_file->file; args.serialized_data = luo_file->serialized_data; args.private_data = luo_file->private_data; @@ -621,6 +625,7 @@ int luo_retrieve_file(struct luo_file_set *file_set, u64 token, } args.handler = luo_file->fh; + args.session = luo_session_from_file_set(file_set); args.serialized_data = luo_file->serialized_data; err = luo_file->fh->ops->retrieve(&args); if (err) { @@ -654,6 +659,7 @@ static int luo_file_can_finish_one(struct luo_file_set *file_set, struct liveupdate_file_op_args args = {0}; args.handler = luo_file->fh; + args.session = luo_session_from_file_set(file_set); args.file = luo_file->file; args.serialized_data = luo_file->serialized_data; args.retrieve_status = luo_file->retrieve_status; @@ -671,6 +677,7 @@ static void luo_file_finish_one(struct luo_file_set *file_set, guard(mutex)(&luo_file->mutex); args.handler = luo_file->fh; + args.session = luo_session_from_file_set(file_set); args.file = luo_file->file; args.serialized_data = luo_file->serialized_data; args.retrieve_status = luo_file->retrieve_status; @@ -924,3 +931,65 @@ void liveupdate_unregister_file_handler(struct liveupdate_file_handler *fh) luo_flb_unregister_all(fh); list_del(&ACCESS_PRIVATE(fh, list)); } +EXPORT_SYMBOL_GPL(liveupdate_unregister_file_handler); + +/** + * liveupdate_get_token_outgoing - Get the token for a preserved file. + * @s: The outgoing liveupdate session. + * @file: The file object to search for. + * @tokenp: Output parameter for the found token. + * + * Searches the list of preserved files in an outgoing session for a matching + * file object. If found, the corresponding user-provided token is returned. + * + * This function is intended for in-kernel callers that need to correlate a + * file with its liveupdate token. + * + * Context: It must be called with session mutex acquired. + * Return: 0 on success, -ENOENT if the file is not preserved in this session. + */ +int liveupdate_get_token_outgoing(struct liveupdate_session *s, + struct file *file, u64 *tokenp) +{ + struct luo_file_set *file_set = luo_file_set_from_session_locked(s); + struct luo_file *luo_file; + int err = -ENOENT; + + list_for_each_entry(luo_file, &file_set->files_list, list) { + if (luo_file->file == file) { + if (tokenp) + *tokenp = luo_file->token; + err = 0; + break; + } + } + + return err; +} + +/** + * liveupdate_get_file_incoming - Retrieves a preserved file for in-kernel use. + * @s: The incoming liveupdate session (restored from the previous kernel). + * @token: The unique token identifying the file to retrieve. + * @filep: On success, this will be populated with a pointer to the retrieved + * 'struct file'. + * + * Provides a kernel-internal API for other subsystems to retrieve their + * preserved files after a live update. This function is a simple wrapper + * around luo_retrieve_file(), allowing callers to find a file by its token. + * + * The caller receives a new reference to the file and must call fput() when it + * is no longer needed. The file's lifetime is managed by LUO and any userspace + * file descriptors. If the caller needs to hold a reference to the file beyond + * the immediate scope, it must call get_file() itself. + * + * Context: It must be called with session mutex acquired of a restored session. + * Return: 0 on success. Returns -ENOENT if no file with the matching token is + * found, or any other negative errno on failure. + */ +int liveupdate_get_file_incoming(struct liveupdate_session *s, u64 token, + struct file **filep) +{ + return luo_retrieve_file(luo_file_set_from_session_locked(s), + token, filep); +} diff --git a/kernel/liveupdate/luo_internal.h b/kernel/liveupdate/luo_internal.h index 875844d7a41d..08b198802e7f 100644 --- a/kernel/liveupdate/luo_internal.h +++ b/kernel/liveupdate/luo_internal.h @@ -79,6 +79,23 @@ struct luo_session { extern struct rw_semaphore luo_register_rwlock; +static inline struct liveupdate_session *luo_session_from_file_set(struct luo_file_set *file_set) +{ + struct luo_session *session; + + session = container_of(file_set, struct luo_session, file_set); + + return (struct liveupdate_session *)session; +} + +static inline struct luo_file_set *luo_file_set_from_session_locked(struct liveupdate_session *s) +{ + struct luo_session *session = (struct luo_session *)s; + + lockdep_assert_held(&session->mutex); + return &session->file_set; +} + int luo_session_create(const char *name, struct file **filep); int luo_session_retrieve(const char *name, struct file **filep); int __init luo_session_setup_outgoing(void *fdt); -- 2.54.0.1032.g2f8565e1d1-goog From tarunsahu at google.com Fri Jun 5 10:08:25 2026 From: tarunsahu at google.com (Tarun Sahu) Date: Fri, 5 Jun 2026 17:08:25 +0000 Subject: [RFC PATCH v1 0/10] liveupdate: kvm: Guest_memfd preservation Message-ID: Changes from V1: 1. Remove mem_attr_array preservation 2. Removed prefaulted guest_memfd condition 3. Updated the check for shared guest_memfd from INIT_SHARED to kvm_arch_has_private_mem 4. Added the document liveupdate/vmm.rst Hello, I am proposing this series as RFC, to initiate the discussion for supporting the guest_memfd preservation. This will setup basic arhitecture for VM preservation during liveupdate. This Cover letter has three sections (please feel free to skip the section you already know): A. Guest_memfd introduction: To make the audience familiar with guest_memfd B. Liveupdate introduction: To make the audience familiar with liveupdate C. Actual Implementation Design and questions. **A: GUEST MEMFD INTRODUCTION** Initially, guest_memfd was created to support guest private memory in confidential computing VMs (CoCo VMs). It was designed so that whenever a guest wants to grant the host access to private memory, a series of calls occurs: from the guest to KVM, KVM to the host userspace, host userspace back to KVM, and finally a new page fault maps the memory into a separate shared address space. Conversely, if the guest transitions the memory back to private, the subsequent fault is handled by guest_memfd. (Dual Mapping Architecture). In such a VM, all guest memory is initially shared. On the fly, the guest may request to change pages to private; the metadata indicating which parts of memory are private is stored in an xarray inside struct kvm (mem_attr_array). This array serves as the source of truth for the fault mechanism, determining whether a mapping should be created from host-userspace-mapped pages or directly from the guest_memfd file. For private memory, Fault also calls architecture-specific function to set up private hardware access (e.g., on SEV-SNP or TDX). This type of guest_memfd is fully-private where shared mapping comes from userspace mapped address space. Subsequently, support was added to allow the entire guest memory to be backed by guest_memfd. This led to the implementation of the MMAP and INIT_SHARED flags for the guest_memfd inode. When KVM_CREATE_GUEST_MEMFD is called with these flags, the guest_memfd becomes mmap-able by host userspace. The INIT_SHARED flag is used to make the guest_memfd completely shared between the host and the guest. Consequently, page faults from both host userspace and the guest resolve to the same guest_memfd page cache. However, under this configuration, marking a portion of this memory as private is not possible. This type of guest_memfd is fully-shared. If guest_memfd is created with INIT_SHARED without MMAP, the host can never access the guest_memfd. But the memory is still considered shared. Hence, At this point, Only use-case of guest_memfd is either fully-shared or fully-private. There is ongoing work to make shared and private mapping in-place backed by guest_memfd. [1] There is also ongoing work to back guest_memfd by hugetlb pages. [2] **B: LIVEUPDATE INTRODUCTION (LIVEUPDATE ORCHESTRATOR - LUO)** Livepdate support was added in kernel to update the host kernel by minimizing the downtime to minimal. This is generally achieved by preserving the current state of the system and retrieve after boot to resume from where we left it. Any subsystem that wants to preserve themselves, register their handler with liveupdate system. This handler includes calls to the following *can_preserve (file)*: This tells the luo system about the eligibility of the file. When preserve ioctl is called, it first loop through all the file handlers and call can_preserve, the one which return true, luo uses this file handler fh->preserve call to preserve the file. *preserve(file)*: This actually preserves the file. *unpreserve(file)*: This unpreserve the file incase userspace want to go back. *retrieve(file)*: On new kernel boot, this function retrieves the file. *finish(file)*: When userspace decides that all the files in the liveupdate session has been retrieved, it can trigger this to do final work of cleaning up. LUO preserve its memory using KHO (kexec-handover). All these APIs will be implemented using KHO calls. **C: GUEST MEMFD PRESERVATION** SCOPE: 1. Fully Shared Guest_memfd 2. Guest_memfd backed by PAGE_SIZE pages Any VM whose memory is backed by such guest_memfd can be preserved across liveupdate. The preservation call is straight forward. It walks through the page cache, serialize the folios and preserve them. On the retrieval path: Currently, creating a guest_memfd requires an associated struct kvm (derived from vm_file / vm_fd). Since there is no direct way to pass a VM file descriptor via the LUO API. I leverage a companion patch [3] (Also added as part of this series PATCH[1]) that allows one file to retrieve another file from the same LUO session. This enables the guest_memfd retrieval path to obtain the preserved KVM file, use it during guest_memfd file creation, and subsequently populate its preserved memory. Preserving the KVM file allows us to preserve additional VM-specific metadata, which will be crucial in the future for cleanly resuming the VM. Currently, it preserves only the VM type. On the retrieval path: KVM normally requires a unique identifier (fdname) upon creation, which KVM typically assigns based on the newly created file descriptor number. However, in the LUO retrieval path, the retrieve call restores the underlying file structure and delegates actual file descriptor allocation to LUO (check luo_session_retrieve_fd). Currently, I used an atomically incremented sequence number as the fdname. I would like to discuss whether userspace services rely on specific naming conventions here. Or if we can change underlying the retrieve call (luo_retrieve_file) to pass fd? This series also introduces the inode freeze call for guest_memfd inode. Which fails any subseuquent fallocate calls or new page fault allocation. VMM is supposed to take necessary measure when it is triggering the liveupdate. VMM must: 1. Either pause the VM before preserving the VM/guest_memfd OR 2. Take action (vm_pause or unpreserve/destroy liveupdate sequence) when a fault fails and VM_EXIT to VMM with -EPERM. Preservation Order between VM and guest_memfd file: There is no strict order, they are independent. Guest_memfd file needs the kvm_file preserved token, which it update on freeze call as freeze is called just before kexec jump. kexec fails incase freeze will be unsuccessful, for this case, it will fail if vm_file token is not found. Retrieval order for VM and guest_memfd file: There is no strict order needed for retrieval. 1. If VM file is retrieve before guest_memfd: guest_memfd will be retrieved and vm_file also retrieved and userspace hold reference to both files. 2. If guest_memfd file is retrieved before vm_file: guest_memfd will be retrieved and it will retrieve vm_file internally and userspace can retrieve vm_file later. But userspace will not have reference to vm_file and luo_finish() will drop vm_file final reference if userspace does not retrieve vm_file before calling luo_finish(). This is valid case, as guest_memfd can live without vm_file as in the case vm_file is closed before guest_memfd file. I have implemented the basic test, where it spawn a VM with guest_memfd or 16MB and write data to its 5MB portion. After LUO preserve call, and kexec, On retrieve, a new VM is spawn with the restored vm_file and restored guest_memfd and the data is verified. It uses the liveupdate test library [5]. Future Work: 1. Support private guest_memfd preservation. 2. Extend the support for guest_memfd with in-place conversion of shared/private. [1] https://lore.kernel.org/all/20260507-gmem-inplace-conversion-v6-0-91ab5a8b19a4 at google.com/ [2] https://lore.kernel.org/all/cover.1747264138.git.ackerleytng at google.com/ [3] https://lore.kernel.org/all/20260427175633.1978233-2-skhawaja at google.com/ [4] https://lore.kernel.org/all/cover.1691446946.git.ackerleytng at google.com/ [5] https://lore.kernel.org/all/20260511201155.1488670-1-vipinsh at google.com/ Pasha Tatashin (1): liveupdate: luo_file: Add internal APIs for file preservation Tarun Sahu (8): liveupdate: Add LIVEUPDATE_GUEST_MEMFD config option kvm: Prepare core VM structs and helpers for LUO support kvm: kvm_luo: Allow kvm preservation with LUO kvm: guest_memfd: Move internal definitions and helper to new header kvm: guest_memfd: Add support for freezing and unfreezing mappings kvm: guest_memfd_luo: add support for guest_memfd preservation selftests: kvm: Split ____vm_create() to expose init helpers selftests: kvm: Add guest_memfd_preservation_test MAINTAINERS | 13 + include/linux/kho/abi/kvm.h | 106 ++++ include/linux/kvm_host.h | 14 + include/linux/liveupdate.h | 21 + kernel/liveupdate/Kconfig | 15 + kernel/liveupdate/luo_file.c | 69 +++ kernel/liveupdate/luo_internal.h | 17 + tools/testing/selftests/kvm/Makefile.kvm | 6 +- .../kvm/guest_memfd_preservation_test.c | 230 ++++++++ .../testing/selftests/kvm/include/kvm_util.h | 2 + tools/testing/selftests/kvm/lib/kvm_util.c | 26 +- virt/kvm/Makefile.kvm | 1 + virt/kvm/guest_memfd.c | 185 +++++-- virt/kvm/guest_memfd.h | 44 ++ virt/kvm/guest_memfd_luo.c | 489 ++++++++++++++++++ virt/kvm/kvm_luo.c | 190 +++++++ virt/kvm/kvm_main.c | 94 +++- virt/kvm/kvm_mm.h | 15 + 18 files changed, 1456 insertions(+), 81 deletions(-) create mode 100644 include/linux/kho/abi/kvm.h create mode 100644 tools/testing/selftests/kvm/guest_memfd_preservation_test.c create mode 100644 virt/kvm/guest_memfd.h create mode 100644 virt/kvm/guest_memfd_luo.c create mode 100644 virt/kvm/kvm_luo.c base-commit: e43ffb69e0438cddd72aaa30898b4dc446f664f8 prerequisite-patch-id: 85705fb54d3065efe1d87ab4b69e828a9f3404e7 prerequisite-patch-id: 7bf85ca17e12b26a72d41ee35f2ec8fc5ce2e692 -- 2.54.0.1032.g2f8565e1d1-goog From tarunsahu at google.com Fri Jun 5 10:08:27 2026 From: tarunsahu at google.com (Tarun Sahu) Date: Fri, 5 Jun 2026 17:08:27 +0000 Subject: [RFC PATCH v2 02/10] liveupdate: Add LIVEUPDATE_GUEST_MEMFD config option In-Reply-To: References: Message-ID: Introduce the LIVEUPDATE_GUEST_MEMFD Kconfig option. This option enables live update support for KVM guest_memfd files, enabling guest_memfd-backed memory preservation across kernel upgrades. Currently this support only guest_memfd files that are full-shared and pre-faulted. Signed-off-by: Tarun Sahu --- kernel/liveupdate/Kconfig | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/kernel/liveupdate/Kconfig b/kernel/liveupdate/Kconfig index 1a8513f16ef7..0bbc4037192e 100644 --- a/kernel/liveupdate/Kconfig +++ b/kernel/liveupdate/Kconfig @@ -88,4 +88,19 @@ config LIVEUPDATE_MEMFD If unsure, say N. +config LIVEUPDATE_GUEST_MEMFD + bool "Live update support for guest_memfd" + depends on LIVEUPDATE + depends on KVM_GUEST_MEMFD + default LIVEUPDATE + help + Enable live update support for KVM guest_memfd files. This allows + preserving VM Memory backed by guest_memfd file across kernel live + updates. + + This can only be used for the guest_memfd that are fully-shared + and pre-faulted. + + If unsure, say N. + endmenu -- 2.54.0.1032.g2f8565e1d1-goog From tarunsahu at google.com Fri Jun 5 10:08:28 2026 From: tarunsahu at google.com (Tarun Sahu) Date: Fri, 5 Jun 2026 17:08:28 +0000 Subject: [RFC PATCH v2 03/10] kvm: Prepare core VM structs and helpers for LUO support In-Reply-To: References: Message-ID: <20ae20f9d1a198b289444ebb4c824314cbba1bcf.1780676742.git.tarunsahu@google.com> Introduce core infrastructure to support VM preservation with LUO. First two changes are just refactoring, no functional change, third change introduces a new member in struct kvm. - Move ITOA_MAX_LEN to kvm_mm.h for reuse by upcoming kvm_luo code. - Add a public kvm_create_vm_file() helper wrapping kvm_create_vm() and anon_inode_getfile() to provide a unified VM file creation API. - Track a weak reference to the backing file in struct kvm under CONFIG_LIVEUPDATE_GUEST_MEMFD to enable reverse file resolution without circular lifetime dependencies. Signed-off-by: Tarun Sahu --- include/linux/kvm_host.h | 14 +++++++ virt/kvm/kvm_main.c | 79 +++++++++++++++++++++++++++++----------- virt/kvm/kvm_mm.h | 3 ++ 3 files changed, 75 insertions(+), 21 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 4c14aee1fb06..9111a28637af 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -874,6 +874,18 @@ struct kvm { #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES /* Protected by slots_lock (for writes) and RCU (for reads) */ struct xarray mem_attr_array; +#endif +#ifdef CONFIG_LIVEUPDATE_GUEST_MEMFD + /* + * Weak reference to the VFS file backing this KVM instance. Stored + * without incrementing the file refcount to prevent a circular lifetime + * dependency (since file->private_data already pins this struct kvm). + * Used exclusively to resolve the file pointer back from struct kvm. + * + * Written/cleared via rcu_assign_pointer() and read locklessly under + * RCU (e.g. via get_file_active() to prevent ABA races). + */ + struct file *vm_file; #endif char stats_id[KVM_STATS_NAME_SIZE]; }; @@ -1074,7 +1086,9 @@ void kvm_get_kvm(struct kvm *kvm); bool kvm_get_kvm_safe(struct kvm *kvm); void kvm_put_kvm(struct kvm *kvm); bool file_is_kvm(struct file *file); +struct file *kvm_create_vm_file(unsigned long type, const char *fdname); void kvm_put_kvm_no_destroy(struct kvm *kvm); +void kvm_uevent_notify_vm_create(struct kvm *kvm); static inline struct kvm_memslots *__kvm_memslots(struct kvm *kvm, int as_id) { diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 89489996fbc1..65f0c5fb353e 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -67,9 +67,6 @@ #include -/* Worst case buffer size needed for holding an integer. */ -#define ITOA_MAX_LEN 12 - MODULE_AUTHOR("Qumranet"); MODULE_DESCRIPTION("Kernel-based Virtual Machine (KVM) Hypervisor"); MODULE_LICENSE("GPL"); @@ -1349,6 +1346,19 @@ static int kvm_vm_release(struct inode *inode, struct file *filp) { struct kvm *kvm = filp->private_data; +#ifdef CONFIG_LIVEUPDATE_GUEST_MEMFD + /* + * Clear the weak reference of the vm file. + * In case vm file is closed by userspace, but kvm still has + * other users like vCPUs, clearing this pointer ensures + * that we don't have a dangling pointer to a closed file. + * + * Cleared via rcu_assign_pointer() to ensure proper memory visibility + * for concurrent lockless readers under RCU. + */ + rcu_assign_pointer(kvm->vm_file, NULL); +#endif + kvm_irqfd_release(kvm); kvm_put_kvm(kvm); @@ -5476,11 +5486,47 @@ bool file_is_kvm(struct file *file) } EXPORT_SYMBOL_FOR_KVM_INTERNAL(file_is_kvm); +struct file *kvm_create_vm_file(unsigned long type, const char *fdname) +{ + struct kvm *kvm = kvm_create_vm(type, fdname); + struct file *file; + + if (IS_ERR(kvm)) + return ERR_CAST(kvm); + + file = anon_inode_getfile("kvm-vm", &kvm_vm_fops, kvm, O_RDWR); + if (IS_ERR(file)) { + kvm_put_kvm(kvm); + return file; + } + +#ifdef CONFIG_LIVEUPDATE_GUEST_MEMFD + /* + * Weak reference to the file (without get_file()) to prevent a circular + * dependency. Safe because the file's release path clears this pointer + * and drops its reference to the VM. + * + * Written via rcu_assign_pointer() because the pointer can be read + * locklessly under RCU (e.g., in kvm_gmem_luo_preserve() via + * get_file_active() to prevent lockless ABA races). + */ + rcu_assign_pointer(kvm->vm_file, file); +#endif + + /* + * Don't call kvm_put_kvm anymore at this point; file->f_op is + * already set, with ->release() being kvm_vm_release(). In error + * cases it will be called by the final fput(file) and will take + * care of doing kvm_put_kvm(kvm). + */ + + return file; +} + static int kvm_dev_ioctl_create_vm(unsigned long type) { char fdname[ITOA_MAX_LEN + 1]; int r, fd; - struct kvm *kvm; struct file *file; fd = get_unused_fd_flags(O_CLOEXEC); @@ -5489,31 +5535,17 @@ static int kvm_dev_ioctl_create_vm(unsigned long type) snprintf(fdname, sizeof(fdname), "%d", fd); - kvm = kvm_create_vm(type, fdname); - if (IS_ERR(kvm)) { - r = PTR_ERR(kvm); - goto put_fd; - } - - file = anon_inode_getfile("kvm-vm", &kvm_vm_fops, kvm, O_RDWR); + file = kvm_create_vm_file(type, fdname); if (IS_ERR(file)) { r = PTR_ERR(file); - goto put_kvm; + goto put_fd; } - /* - * Don't call kvm_put_kvm anymore at this point; file->f_op is - * already set, with ->release() being kvm_vm_release(). In error - * cases it will be called by the final fput(file) and will take - * care of doing kvm_put_kvm(kvm). - */ - kvm_uevent_notify_change(KVM_EVENT_CREATE_VM, kvm); + kvm_uevent_notify_change(KVM_EVENT_CREATE_VM, file->private_data); fd_install(fd, file); return fd; -put_kvm: - kvm_put_kvm(kvm); put_fd: put_unused_fd(fd); return r; @@ -6341,6 +6373,11 @@ static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm) kfree(env); } +void kvm_uevent_notify_vm_create(struct kvm *kvm) +{ + kvm_uevent_notify_change(KVM_EVENT_CREATE_VM, kvm); +} + static void kvm_init_debug(void) { const struct file_operations *fops; diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h index 9fcc5d5b7f8d..7aa1d65c3d46 100644 --- a/virt/kvm/kvm_mm.h +++ b/virt/kvm/kvm_mm.h @@ -3,6 +3,9 @@ #ifndef __KVM_MM_H__ #define __KVM_MM_H__ 1 +/* Worst case buffer size needed for holding an integer as a string. */ +#define ITOA_MAX_LEN 12 + /* * Architectures can choose whether to use an rwlock or spinlock * for the mmu_lock. These macros, for use in common code -- 2.54.0.1032.g2f8565e1d1-goog From tarunsahu at google.com Fri Jun 5 10:08:29 2026 From: tarunsahu at google.com (Tarun Sahu) Date: Fri, 5 Jun 2026 17:08:29 +0000 Subject: [RFC PATCH v2 04/10] kvm: kvm_luo: Allow kvm preservation with LUO In-Reply-To: References: Message-ID: <8730c0e11acbd0d645a8b7187cd5cd7de373380e.1780676742.git.tarunsahu@google.com> Introduce KVM VM preservation support for Live Update Orchestrator. Register an LUO file handler for KVM files to serialize and deserialize necessary VM state across live updates. Currently, this preserves the VM type. This implementation provides the necessary infrastructure and dependencies for the upcoming guest_memfd preservation support. And it can be extended to preserve more vm state in future. Retrieve is simply creating the kvm and populate the retrieved data. Only catch here is there is no way to know which fd is going to be assigned to this kvm file hence I am using atomically incremented id for the fdname. This change also updates the MAINTAINERS list for kvm_luo.c. Signed-off-by: Tarun Sahu --- My only worry is if userspace strictly depends on the fdname, that it needs to be consistent with vm_fd. Discussed more details in the cover letter. Would really appreciates the alternatives/other approaches. --- MAINTAINERS | 11 +++ include/linux/kho/abi/kvm.h | 39 ++++++++ virt/kvm/Makefile.kvm | 1 + virt/kvm/kvm_luo.c | 190 ++++++++++++++++++++++++++++++++++++ virt/kvm/kvm_main.c | 8 ++ virt/kvm/kvm_mm.h | 8 ++ 6 files changed, 257 insertions(+) create mode 100644 include/linux/kho/abi/kvm.h create mode 100644 virt/kvm/kvm_luo.c diff --git a/MAINTAINERS b/MAINTAINERS index 9ec290e38b44..9bfc3c1f6676 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -14409,6 +14409,17 @@ S: Maintained F: Documentation/devicetree/bindings/leds/backlight/kinetic,ktz8866.yaml F: drivers/video/backlight/ktz8866.c +KVM LIVE UPDATE +M: Pasha Tatashin +M: Mike Rapoport +M: Pratyush Yadav +R: Tarun Sahu +L: kexec at lists.infradead.org +L: kvm at vger.kernel.org +S: Maintained +T: git git://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux.git +F: virt/kvm/kvm_luo.c + KVM PARAVIRT (KVM/paravirt) M: Paolo Bonzini R: Vitaly Kuznetsov diff --git a/include/linux/kho/abi/kvm.h b/include/linux/kho/abi/kvm.h new file mode 100644 index 000000000000..718db68a541a --- /dev/null +++ b/include/linux/kho/abi/kvm.h @@ -0,0 +1,39 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2026, Google LLC. + * Tarun Sahu + * + * KVM Preservation ABI for Live Update Orchestrator (LUO) + */ +#ifndef _LINUX_KHO_ABI_KVM_H +#define _LINUX_KHO_ABI_KVM_H + +#include +#include + +/** + * DOC: KVM Live Update ABI + * + * KVM uses the ABI defined below for preserving its state + * across a kexec reboot using the LUO. + * + * The state is serialized into a packed structure `struct kvm_luo_ser` + * which is handed over to the next kernel via the KHO mechanism. + * + * This interface is a contract. Any modification to the structure layout + * constitutes a breaking change. Such changes require incrementing the + * version number in the KVM_LUO_FH_COMPATIBLE compatibility string. + */ + +/** + * struct kvm_luo_ser - Main serialization structure for a KVM VM. + * @type: The type of VM. + */ +struct kvm_luo_ser { + u64 type; +} __packed; + +/* The compatibility string for KVM VM file handler */ +#define KVM_LUO_FH_COMPATIBLE "kvm_vm_luo_v1" + +#endif /* _LINUX_KHO_ABI_KVM_H */ diff --git a/virt/kvm/Makefile.kvm b/virt/kvm/Makefile.kvm index d047d4cf58c9..c1a962159264 100644 --- a/virt/kvm/Makefile.kvm +++ b/virt/kvm/Makefile.kvm @@ -13,3 +13,4 @@ kvm-$(CONFIG_HAVE_KVM_IRQ_ROUTING) += $(KVM)/irqchip.o kvm-$(CONFIG_HAVE_KVM_DIRTY_RING) += $(KVM)/dirty_ring.o kvm-$(CONFIG_HAVE_KVM_PFNCACHE) += $(KVM)/pfncache.o kvm-$(CONFIG_KVM_GUEST_MEMFD) += $(KVM)/guest_memfd.o +kvm-$(CONFIG_LIVEUPDATE_GUEST_MEMFD) += $(KVM)/kvm_luo.o diff --git a/virt/kvm/kvm_luo.c b/virt/kvm/kvm_luo.c new file mode 100644 index 000000000000..25619f94ace5 --- /dev/null +++ b/virt/kvm/kvm_luo.c @@ -0,0 +1,190 @@ +// SPDX-License-Identifier: GPL-2.0 + +/* + * Copyright (c) 2026, Google LLC. + * Tarun Sahu + * + * KVM VM Preservation for Live Update Orchestrator (LUO) + */ + +/** + * DOC: KVM VM Preservation via LUO + * + * Overview + * ======== + * + * KVM virtual machines (VMs) can be preserved over a kexec reboot using the + * Live Update Orchestrator (LUO) file preservation. This allows userspace + * to preserve KVM VM state across kexec reboots. + * + * The preservation is not intended to be fully transparent. Only specific + * VM configuration and state are preserved, while other aspects of the VM + * must be re-established or re-configured by userspace after retrieval. + * + * Preserved Properties + * ==================== + * + * The following properties of the KVM VM are preserved across kexec: + * + * VM Type + * The VM type (e.g., on x86 architecture, the vm_type parameter) is + * preserved. + * + * Non-Preserved Properties + * ======================== + * + * The preservation does not cover: + * + * - vCPUs and vCPU states + * - Memspots / Memory slot layout (memslots) + * - Interrupt controllers and IRQ routings + * - Coalesced MMIO zones + * - Device bindings (VFIO/Eventfds) + * - Active paging or guest registers state + * - etc + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "kvm_mm.h" + +static bool kvm_luo_can_preserve(struct liveupdate_file_handler *handler, + struct file *file) +{ + return file_is_kvm(file); +} + +static int kvm_luo_preserve(struct liveupdate_file_op_args *args) +{ + struct kvm *kvm = args->file->private_data; + struct kvm_luo_ser *ser; + + if (kvm->vm_dead || kvm->vm_bugged) + return -EINVAL; + + ser = kho_alloc_preserve(sizeof(*ser)); + if (IS_ERR(ser)) + return PTR_ERR(ser); + +#ifdef CONFIG_X86 + ser->type = kvm->arch.vm_type; +#else + ser->type = 0; +#endif + + args->serialized_data = virt_to_phys(ser); + + return 0; +} + +static atomic_t restored_vm_id = ATOMIC_INIT(0); + +static int kvm_luo_retrieve(struct liveupdate_file_op_args *args) +{ + char fdname[ITOA_MAX_LEN + 1]; + struct kvm_luo_ser *ser; + struct file *file; + struct kvm *kvm; + int err = 0; + + if (!args->serialized_data) + return -EINVAL; + + ser = phys_to_virt(args->serialized_data); + + snprintf(fdname, sizeof(fdname), "%d", + atomic_inc_return(&restored_vm_id)); + + file = kvm_create_vm_file(ser->type, fdname); + if (IS_ERR(file)) { + err = PTR_ERR(file); + goto err_free_ser; + } + + kvm = file->private_data; + + args->file = file; + kho_restore_free(ser); + + kvm_uevent_notify_vm_create(kvm); + return 0; + +err_free_ser: + kho_restore_free(ser); + return err; +} + +static void kvm_luo_unpreserve(struct liveupdate_file_op_args *args) +{ + struct kvm_luo_ser *ser; + + /* + * in case preservation failed, args->serialized_data will + * be NULL and kvm_luo_preserve takes care of cleaning up. + * If preserve succeeds, this condition fails and unpreserve + * function takes care of cleaning up. + */ + if (WARN_ON_ONCE(!args->serialized_data)) + return; + + ser = phys_to_virt(args->serialized_data); + + kho_unpreserve_free(ser); +} + +static void kvm_luo_finish(struct liveupdate_file_op_args *args) +{ + struct kvm_luo_ser *ser; + + /* + * If retrieve_status is true or set to error, nothing to do here. + * Already cleaned up in kvm_luo_retrieve(). + */ + if (args->retrieve_status) + return; + + if (!args->serialized_data) + return; + + ser = phys_to_virt(args->serialized_data); + kho_restore_free(ser); +} + +static const struct liveupdate_file_ops kvm_luo_file_ops = { + .can_preserve = kvm_luo_can_preserve, + .preserve = kvm_luo_preserve, + .retrieve = kvm_luo_retrieve, + .unpreserve = kvm_luo_unpreserve, + .finish = kvm_luo_finish, + .owner = THIS_MODULE, +}; + +static struct liveupdate_file_handler kvm_luo_handler = { + .ops = &kvm_luo_file_ops, + .compatible = KVM_LUO_FH_COMPATIBLE, +}; + +int kvm_luo_init(void) +{ + int err = liveupdate_register_file_handler(&kvm_luo_handler); + + if (err && err != -EOPNOTSUPP) { + pr_err("Could not register kvm_vm_luo handler: %pe\n", ERR_PTR(err)); + return err; + } + + return 0; +} + +void kvm_luo_exit(void) +{ + liveupdate_unregister_file_handler(&kvm_luo_handler); +} + diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 65f0c5fb353e..c70346906a89 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -6576,6 +6576,10 @@ int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module) if (r) goto err_virt; + r = kvm_luo_init(); + if (r) + goto err_luo; + /* * Registration _must_ be the very last thing done, as this exposes * /dev/kvm to userspace, i.e. all infrastructure must be setup! @@ -6589,6 +6593,8 @@ int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module) return 0; err_register: + kvm_luo_exit(); +err_luo: kvm_uninit_virtualization(); err_virt: kvm_gmem_exit(); @@ -6618,6 +6624,8 @@ void kvm_exit(void) */ misc_deregister(&kvm_dev); + kvm_luo_exit(); + kvm_uninit_virtualization(); debugfs_remove_recursive(kvm_debugfs_dir); diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h index 7aa1d65c3d46..118edc47df83 100644 --- a/virt/kvm/kvm_mm.h +++ b/virt/kvm/kvm_mm.h @@ -97,4 +97,12 @@ static inline void kvm_gmem_unbind(struct kvm_memory_slot *slot) } #endif /* CONFIG_KVM_GUEST_MEMFD */ +#ifdef CONFIG_LIVEUPDATE_GUEST_MEMFD +int kvm_luo_init(void); +void kvm_luo_exit(void); +#else +static inline int kvm_luo_init(void) { return 0; } +static inline void kvm_luo_exit(void) {} +#endif /* CONFIG_LIVEUPDATE_GUEST_MEMFD */ + #endif /* __KVM_MM_H__ */ -- 2.54.0.1032.g2f8565e1d1-goog From tarunsahu at google.com Fri Jun 5 10:08:30 2026 From: tarunsahu at google.com (Tarun Sahu) Date: Fri, 5 Jun 2026 17:08:30 +0000 Subject: [RFC PATCH v2 05/10] kvm: guest_memfd: Move internal definitions and helper to new header In-Reply-To: References: Message-ID: <769d0adac847d638fceb3fce093755a74e686582.1780676742.git.tarunsahu@google.com> To support guest_memfd memory preservation with LUO, guest_memfd luo code needs to access guest_memfd internals and reconstruct guest_memfd file instances from a preserved state. Extract gmem_file, gmem_inode, and the GMEM_I() helper from guest_memfd.c into a new internal header virt/kvm/guest_memfd.h. Additionally, split __kvm_gmem_create() to expose a non-static __kvm_gmem_create_file() helper. This helper returns a struct file instead of a file descriptor, enabling file creation and initialization without installing it into a file descriptor table. Signed-off-by: Tarun Sahu --- virt/kvm/guest_memfd.c | 68 +++++++++++++++++------------------------- virt/kvm/guest_memfd.h | 39 ++++++++++++++++++++++++ 2 files changed, 67 insertions(+), 40 deletions(-) create mode 100644 virt/kvm/guest_memfd.h diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index 69c9d6d546b2..6740ae2bf948 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -7,38 +7,12 @@ #include #include #include +#include "guest_memfd.h" #include "kvm_mm.h" static struct vfsmount *kvm_gmem_mnt; -/* - * A guest_memfd instance can be associated multiple VMs, each with its own - * "view" of the underlying physical memory. - * - * The gmem's inode is effectively the raw underlying physical storage, and is - * used to track properties of the physical memory, while each gmem file is - * effectively a single VM's view of that storage, and is used to track assets - * specific to its associated VM, e.g. memslots=>gmem bindings. - */ -struct gmem_file { - struct kvm *kvm; - struct xarray bindings; - struct list_head entry; -}; - -struct gmem_inode { - struct shared_policy policy; - struct inode vfs_inode; - struct list_head gmem_file_list; - - u64 flags; -}; - -static __always_inline struct gmem_inode *GMEM_I(struct inode *inode) -{ - return container_of(inode, struct gmem_inode, vfs_inode); -} #define kvm_gmem_for_each_file(f, inode) \ list_for_each_entry(f, &GMEM_I(inode)->gmem_file_list, entry) @@ -556,23 +530,17 @@ bool __weak kvm_arch_supports_gmem_init_shared(struct kvm *kvm) return true; } -static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags) +struct file *__kvm_gmem_create_file(struct kvm *kvm, loff_t size, u64 flags) { static const char *name = "[kvm-gmem]"; struct gmem_file *f; struct inode *inode; struct file *file; - int fd, err; - - fd = get_unused_fd_flags(0); - if (fd < 0) - return fd; + int err; f = kzalloc_obj(*f); - if (!f) { - err = -ENOMEM; - goto err_fd; - } + if (!f) + return ERR_PTR(-ENOMEM); /* __fput() will take care of fops_put(). */ if (!fops_get(&kvm_gmem_fops)) { @@ -611,8 +579,7 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags) xa_init(&f->bindings); list_add(&f->entry, &GMEM_I(inode)->gmem_file_list); - fd_install(fd, file); - return fd; + return file; err_inode: iput(inode); @@ -620,7 +587,28 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags) fops_put(&kvm_gmem_fops); err_gmem: kfree(f); -err_fd: + return ERR_PTR(err); +} + +static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags) +{ + struct file *file; + int fd, err; + + fd = get_unused_fd_flags(0); + if (fd < 0) + return fd; + + file = __kvm_gmem_create_file(kvm, size, flags); + if (IS_ERR(file)) { + err = PTR_ERR(file); + goto err_put_fd; + } + + fd_install(fd, file); + return fd; + +err_put_fd: put_unused_fd(fd); return err; } diff --git a/virt/kvm/guest_memfd.h b/virt/kvm/guest_memfd.h new file mode 100644 index 000000000000..c528b046dd69 --- /dev/null +++ b/virt/kvm/guest_memfd.h @@ -0,0 +1,39 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +#ifndef __KVM_GUEST_MEMFD_H__ +#define __KVM_GUEST_MEMFD_H__ 1 + +#include +#include +#include + +/* + * A guest_memfd instance can be associated multiple VMs, each with its own + * "view" of the underlying physical memory. + * + * The gmem's inode is effectively the raw underlying physical storage, and is + * used to track properties of the physical memory, while each gmem file is + * effectively a single VM's view of that storage, and is used to track assets + * specific to its associated VM, e.g. memslots=>gmem bindings. + */ +struct gmem_file { + struct kvm *kvm; + struct xarray bindings; + struct list_head entry; +}; + +struct gmem_inode { + struct shared_policy policy; + struct inode vfs_inode; + struct list_head gmem_file_list; + + u64 flags; +}; + +static inline struct gmem_inode *GMEM_I(struct inode *inode) +{ + return container_of(inode, struct gmem_inode, vfs_inode); +} + +struct file *__kvm_gmem_create_file(struct kvm *kvm, loff_t size, u64 flags); + +#endif /* __KVM_GUEST_MEMFD_H__ */ -- 2.54.0.1032.g2f8565e1d1-goog From tarunsahu at google.com Fri Jun 5 10:08:32 2026 From: tarunsahu at google.com (Tarun Sahu) Date: Fri, 5 Jun 2026 17:08:32 +0000 Subject: [RFC PATCH v2 07/10] kvm: guest_memfd_luo: add support for guest_memfd preservation In-Reply-To: References: Message-ID: <4b2216f5c459fe699a3f62464cbc765624e20ae6.1780676742.git.tarunsahu@google.com> This patch sets up the basic infrastructure to preserve the guest_memfd. Currently this supports only fully shared guest_memfd and backed by PAGE_SIZE pages. It registers a new LUO file handler for guest_memfd files to serialize and deserialize guest memory. This allows preserving guest memory backed by guest_memfd across updates, ensuring that guest instances can be resumed seamlessly without losing their memory contents. Preservation is straight forward. It walks through the folios and serialize them. There is kvm_gmem_freeze call on preserve which freeze the guest_memfd inode. It avoids any changes to inode mapping with fallocate calls or any new fault allocation (fails) on or after preservation. No need to check this during the page fault as preservation is only supported for pre-faulted/pre-allocated guest_memfd. While retrieving the guest_memfd, it requires the struct kvm to create new guest_memfd. So it first get the vm_file from the same session using the token passed during the preservation. And use it to get vm_file->kvm. This change also update the MAINTAINERS list. Signed-off-by: Tarun Sahu --- MAINTAINERS | 1 + include/linux/kho/abi/kvm.h | 79 +++++- virt/kvm/Makefile.kvm | 2 +- virt/kvm/guest_memfd_luo.c | 485 ++++++++++++++++++++++++++++++++++++ virt/kvm/kvm_main.c | 7 + virt/kvm/kvm_mm.h | 4 + 6 files changed, 571 insertions(+), 7 deletions(-) create mode 100644 virt/kvm/guest_memfd_luo.c diff --git a/MAINTAINERS b/MAINTAINERS index 9bfc3c1f6676..16cba790a84d 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -14418,6 +14418,7 @@ L: kexec at lists.infradead.org L: kvm at vger.kernel.org S: Maintained T: git git://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux.git +F: virt/kvm/guest_memfd_luo.c F: virt/kvm/kvm_luo.c KVM PARAVIRT (KVM/paravirt) diff --git a/include/linux/kho/abi/kvm.h b/include/linux/kho/abi/kvm.h index 718db68a541a..42074d76e04a 100644 --- a/include/linux/kho/abi/kvm.h +++ b/include/linux/kho/abi/kvm.h @@ -9,20 +9,23 @@ #define _LINUX_KHO_ABI_KVM_H #include +#include #include /** - * DOC: KVM Live Update ABI + * DOC: KVM and guest_memfd Live Update ABI * - * KVM uses the ABI defined below for preserving its state + * KVM and guest_memfd use the ABI defined below for preserving their states * across a kexec reboot using the LUO. * - * The state is serialized into a packed structure `struct kvm_luo_ser` - * which is handed over to the next kernel via the KHO mechanism. + * The state is serialized into packed structures (struct kvm_luo_ser and + * struct guest_memfd_luo_ser) which are handed over to the next kernel via + * the KHO mechanism. * - * This interface is a contract. Any modification to the structure layout + * This interface is a contract. Any modification to the structure layouts * constitutes a breaking change. Such changes require incrementing the - * version number in the KVM_LUO_FH_COMPATIBLE compatibility string. + * version number in the KVM_LUO_FH_COMPATIBLE or + * GUEST_MEMFD_LUO_FH_COMPATIBLE compatibility strings. */ /** @@ -36,4 +39,68 @@ struct kvm_luo_ser { /* The compatibility string for KVM VM file handler */ #define KVM_LUO_FH_COMPATIBLE "kvm_vm_luo_v1" +/** + * struct guest_memfd_luo_folio_ser - Serialization layout for a single folio in guest_memfd. + * @pfn: Page Frame Number of the folio. + * @index: Page offset of the folio within the file. + * @flags: State flags associated with the folio. + */ +struct guest_memfd_luo_folio_ser { + u64 pfn:52; + u64 flags:12; + u64 index; +} __packed; + +/** + * GUEST_MEMFD_LUO_FOLIO_UPTODATE - The folio is up-to-date. + * + * This flag is per folio to check if the folio is uptodate. + */ +#define GUEST_MEMFD_LUO_FOLIO_UPTODATE BIT(0) + + +/** + * GUEST_MEMFD_LUO_FLAG_MMAP - The guest_memfd supports mmap. + * + * This flag indicates that the guest_memfd supports host-side mmap. + */ +#define GUEST_MEMFD_LUO_FLAG_MMAP BIT(0) + +/** + * GUEST_MEMFD_LUO_FLAG_INIT_SHARED - Initialize memory as shared. + * + * This flag indicates that the guest_memfd has been initialized as shared + * memory. + */ +#define GUEST_MEMFD_LUO_FLAG_INIT_SHARED BIT(1) + +/** + * GUEST_MEMFD_LUO_SUPPORTED_FLAGS - Supported guest_memfd LUO flags mask. + * + * A mask of all guest_memfd preservation flags supported by this version + * of the KVM LUO ABI. + */ +#define GUEST_MEMFD_LUO_SUPPORTED_FLAGS (GUEST_MEMFD_LUO_FLAG_MMAP | \ + GUEST_MEMFD_LUO_FLAG_INIT_SHARED) + +/** + * struct guest_memfd_luo_ser - Main serialization structure for guest_memfd. + * @size: The size of the file in bytes. + * @flags: File-level flags. + * @nr_folios: Number of folios in the folios array. + * @vm_token: Token of the associated KVM VM instance. + * @folios: KHO vmalloc descriptor pointing to the array of + * struct guest_memfd_luo_folio_ser. + */ +struct guest_memfd_luo_ser { + u64 size; + u64 flags; + u64 nr_folios; + u64 vm_token; + struct kho_vmalloc folios; +} __packed; + +/* The compatibility string for GUEST_MEMFD file handler */ +#define GUEST_MEMFD_LUO_FH_COMPATIBLE "guest_memfd_luo_v1" + #endif /* _LINUX_KHO_ABI_KVM_H */ diff --git a/virt/kvm/Makefile.kvm b/virt/kvm/Makefile.kvm index c1a962159264..d30fca094c42 100644 --- a/virt/kvm/Makefile.kvm +++ b/virt/kvm/Makefile.kvm @@ -13,4 +13,4 @@ kvm-$(CONFIG_HAVE_KVM_IRQ_ROUTING) += $(KVM)/irqchip.o kvm-$(CONFIG_HAVE_KVM_DIRTY_RING) += $(KVM)/dirty_ring.o kvm-$(CONFIG_HAVE_KVM_PFNCACHE) += $(KVM)/pfncache.o kvm-$(CONFIG_KVM_GUEST_MEMFD) += $(KVM)/guest_memfd.o -kvm-$(CONFIG_LIVEUPDATE_GUEST_MEMFD) += $(KVM)/kvm_luo.o +kvm-$(CONFIG_LIVEUPDATE_GUEST_MEMFD) += $(KVM)/guest_memfd_luo.o $(KVM)/kvm_luo.o diff --git a/virt/kvm/guest_memfd_luo.c b/virt/kvm/guest_memfd_luo.c new file mode 100644 index 000000000000..d466f889c9aa --- /dev/null +++ b/virt/kvm/guest_memfd_luo.c @@ -0,0 +1,485 @@ +// SPDX-License-Identifier: GPL-2.0 + +/* + * Copyright (c) 2026, Google LLC. + * Tarun Sahu + * + * Guestmemfd Preservation for Live Update Orchestrator (LUO) + */ + +/** + * DOC: Guestmemfd Preservation via LUO + * + * Overview + * ======== + * + * Guest memory file descriptors (guest_memfd) can be preserved over a kexec + * reboot using the Live Update Orchestrator (LUO) file preservation. This + * allows userspace to preserve VM memory across kexec reboots. + * + * The preservation is not intended to be transparent. Only select properties + * of the guest_memfd are preserved, while others are reset to default. + * + * Preserved Properties + * ==================== + * + * The following properties of guest_memfd are preserved across kexec: + * + * File Size + * The size of the file is preserved. + * + * File Contents + * All folios present in the page cache are preserved. + * + * File-level Flags + * The file-level flags (such as MMAP support and INIT_SHARED default mapping) + * are preserved. + * + * Non-Preserved Properties + * ======================== + * + * NUMA Memory Policy + * NUMA memory policies associated with the guest_memfd are not preserved. + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "guest_memfd.h" + +static int kvm_gmem_luo_walk_folios(struct address_space *mapping, + pgoff_t end_index, struct guest_memfd_luo_folio_ser *folios_ser, + u64 *out_count) +{ + struct folio_batch fbatch; + pgoff_t index = 0; + u64 count = 0; + int err = 0; + + folio_batch_init(&fbatch); + while (index < end_index) { + unsigned int nr, i; + + nr = filemap_get_folios(mapping, &index, end_index - 1, &fbatch); + if (nr == 0) + break; + + for (i = 0; i < nr; i++) { + struct folio *folio = fbatch.folios[i]; + + if (folios_ser) { + if (folio_test_hwpoison(folio)) { + err = -EHWPOISON; + folio_batch_release(&fbatch); + goto out; + } + err = kho_preserve_folio(folio); + if (err) { + folio_batch_release(&fbatch); + goto out; + } + + folios_ser[count].pfn = folio_pfn(folio); + folios_ser[count].index = folio->index; + folios_ser[count].flags = folio_test_uptodate(folio) ? + GUEST_MEMFD_LUO_FOLIO_UPTODATE : 0; + } + count++; + } + folio_batch_release(&fbatch); + cond_resched(); + } + +out: + *out_count = count; + return err; +} + +static bool kvm_gmem_luo_can_preserve(struct liveupdate_file_handler *handler, struct file *file) +{ + struct inode *inode = file_inode(file); + struct gmem_file *gmem_file = file->private_data; + struct kvm *kvm = gmem_file->kvm; + + if (inode->i_sb->s_magic != GUEST_MEMFD_MAGIC) + return 0; + + if (kvm_arch_has_private_mem(kvm)) + return 0; + + if (mapping_large_folio_support(inode->i_mapping)) + return 0; + + return 1; +} + +static int kvm_gmem_luo_preserve(struct liveupdate_file_op_args *args) +{ + struct guest_memfd_luo_folio_ser *folios_ser = NULL; + u64 count = 0, gmem_flags, abi_flags = 0; + struct guest_memfd_luo_ser *ser; + struct address_space *mapping; + struct gmem_file *gmem_file; + struct inode *inode; + pgoff_t end_index; + struct kvm *kvm; + int err = 0; + long size; + + inode = file_inode(args->file); + kvm_gmem_freeze(inode, true); + + mapping = inode->i_mapping; + size = i_size_read(inode); + if (!size) { + err = -EINVAL; + goto err_unfreeze_inode; + } + + if (WARN_ON_ONCE(!PAGE_ALIGNED(size))) { + err = -EINVAL; + goto err_unfreeze_inode; + } + + gmem_file = args->file->private_data; + kvm = gmem_file->kvm; + + gmem_flags = READ_ONCE(GMEM_I(inode)->flags); + if (gmem_flags & ~(GUEST_MEMFD_FLAG_MMAP | GUEST_MEMFD_FLAG_INIT_SHARED + | GUEST_MEMFD_F_MAPPING_FROZEN)) { + err = -EOPNOTSUPP; + goto err_unfreeze_inode; + } + + if (gmem_flags & GUEST_MEMFD_FLAG_MMAP) + abi_flags |= GUEST_MEMFD_LUO_FLAG_MMAP; + if (gmem_flags & GUEST_MEMFD_FLAG_INIT_SHARED) + abi_flags |= GUEST_MEMFD_LUO_FLAG_INIT_SHARED; + + end_index = size >> PAGE_SHIFT; + + ser = kho_alloc_preserve(sizeof(*ser)); + if (IS_ERR(ser)) { + err = PTR_ERR(ser); + goto err_unfreeze_inode; + } + + /* First pass: Count the folios present in the page cache */ + err = kvm_gmem_luo_walk_folios(mapping, end_index, NULL, &count); + if (err) + goto err_free_ser; + + ser->size = size; + ser->flags = abi_flags; + ser->nr_folios = count; + ser->vm_token = 0; // It will be set during the kvm_gmem_luo_freeze() + + if (count > 0) { + folios_ser = vcalloc(count, sizeof(*folios_ser)); + if (!folios_ser) { + err = -ENOMEM; + goto err_free_ser; + } + + /* Second pass: Fill the metadata array and preserve folios */ + err = kvm_gmem_luo_walk_folios(mapping, end_index, folios_ser, &count); + if (err) + goto err_unpreserve_unlocked; + + if (WARN_ON_ONCE(count != ser->nr_folios)) { + err = -EINVAL; + goto err_unpreserve_unlocked; + } + } + + if (count > 0) { + err = kho_preserve_vmalloc(folios_ser, &ser->folios); + if (err) + goto err_unpreserve_unlocked; + } + + args->serialized_data = virt_to_phys(ser); + args->private_data = folios_ser; + + return 0; + +err_unpreserve_unlocked: + for (long i = (long)count - 1; i >= 0; i--) { + struct folio *folio = pfn_folio(folios_ser[i].pfn); + + kho_unpreserve_folio(folio); + } + vfree(folios_ser); +err_free_ser: + kho_unpreserve_free(ser); +err_unfreeze_inode: + kvm_gmem_freeze(inode, false); + return err; +} + +static int kvm_gmem_luo_freeze(struct liveupdate_file_op_args *args) +{ + struct guest_memfd_luo_ser *ser; + struct gmem_file *gmem_file; + struct kvm *kvm; + struct file *kvm_file; + u64 vm_token; + int err; + + if (WARN_ON_ONCE(!args->serialized_data)) + return -EINVAL; + + ser = phys_to_virt(args->serialized_data); + + gmem_file = args->file->private_data; + kvm = gmem_file->kvm; + + /* + * Obtain a strong reference to kvm->vm_file to prevent the SLAB_TYPESAFE_BY_RCU + * file memory from being reallocated while it is being processed. + */ + kvm_file = get_file_active(&kvm->vm_file); + if (!kvm_file) + return -ENOENT; + + err = liveupdate_get_token_outgoing(args->session, kvm_file, &vm_token); + fput(kvm_file); + if (err) + return err; + + ser->vm_token = vm_token; + return 0; +} + +static void kvm_gmem_luo_discard_folios( + const struct guest_memfd_luo_folio_ser *folios_ser, + u64 nr_folios, u64 start_idx) +{ + long i; + + for (i = start_idx; i < nr_folios; i++) { + struct folio *folio; + phys_addr_t phys; + + if (!folios_ser[i].pfn) + continue; + + phys = PFN_PHYS(folios_ser[i].pfn); + folio = kho_restore_folio(phys); + if (folio) + folio_put(folio); + } +} + +static void kvm_gmem_luo_unpreserve(struct liveupdate_file_op_args *args) +{ + struct guest_memfd_luo_folio_ser *folios_ser = args->private_data; + struct guest_memfd_luo_ser *ser; + long i; + + if (WARN_ON_ONCE(!args->serialized_data)) + return; + + ser = phys_to_virt(args->serialized_data); + if (!ser) + return; + + if (ser->nr_folios > 0) + kho_unpreserve_vmalloc(&ser->folios); + for (i = ser->nr_folios - 1; i >= 0; i--) { + struct folio *folio; + + if (!folios_ser[i].pfn) + continue; + + folio = pfn_folio(folios_ser[i].pfn); + kho_unpreserve_folio(folio); + } + vfree(folios_ser); + + kho_unpreserve_free(ser); + kvm_gmem_freeze(file_inode(args->file), false); +} + +static int kvm_gmem_luo_retrieve(struct liveupdate_file_op_args *args) +{ + struct guest_memfd_luo_folio_ser *folios_ser = NULL; + struct guest_memfd_luo_ser *ser; + struct kvm *kvm = NULL; + struct file *vm_file; + struct inode *inode; + struct file *file; + u64 gmem_flags = 0; + int err = 0; + long i = 0; + + if (!args->serialized_data) + return -EINVAL; + + ser = phys_to_virt(args->serialized_data); + + if (ser->flags & ~GUEST_MEMFD_LUO_SUPPORTED_FLAGS) { + err = -EOPNOTSUPP; + goto err_free_ser; + } + + if (ser->flags & GUEST_MEMFD_LUO_FLAG_MMAP) + gmem_flags |= GUEST_MEMFD_FLAG_MMAP; + if (ser->flags & GUEST_MEMFD_LUO_FLAG_INIT_SHARED) + gmem_flags |= GUEST_MEMFD_FLAG_INIT_SHARED; + + err = liveupdate_get_file_incoming(args->session, ser->vm_token, &vm_file); + if (err) { + pr_warn("gmem: provided VM FD token (%llx) on preserve is incorrect\n", + ser->vm_token); + goto err_free_ser; + } + + if (file_is_kvm(vm_file)) + kvm = vm_file->private_data; + + /* + * Release the temporary reference taken by the liveupdate_get_file_incoming + * call. LUO still holds a reference. + */ + fput(vm_file); + + if (!kvm) { + err = -EINVAL; + goto err_free_ser; + } + + file = __kvm_gmem_create_file(kvm, ser->size, gmem_flags); + if (IS_ERR(file)) { + err = PTR_ERR(file); + goto err_free_ser; + } + + inode = file_inode(file); + + if (ser->nr_folios) { + folios_ser = kho_restore_vmalloc(&ser->folios); + if (!folios_ser) { + err = -EINVAL; + goto err_destroy_file; + } + + for (i = 0; i < ser->nr_folios; i++) { + struct folio *folio; + phys_addr_t phys; + + if (!folios_ser[i].pfn) + continue; + + phys = PFN_PHYS(folios_ser[i].pfn); + folio = kho_restore_folio(phys); + if (!folio) { + pr_err("gmem: failed to restore folio at %llx\n", phys); + err = -EIO; + goto err_put_remaining_folios; + } + + err = filemap_add_folio(inode->i_mapping, folio, folios_ser[i].index, + GFP_KERNEL); + if (err) { + pr_err("gmem: failed to add folio to page cache\n"); + folio_put(folio); + goto err_put_remaining_folios; + } + + if (folios_ser[i].flags & GUEST_MEMFD_LUO_FOLIO_UPTODATE) + folio_mark_uptodate(folio); + folio_unlock(folio); + folio_put(folio); + } + vfree(folios_ser); + } + + args->file = file; + kho_restore_free(ser); + return 0; + +err_put_remaining_folios: + i++; +err_destroy_file: + fput(file); +err_free_ser: + if (ser->nr_folios) { + if (!folios_ser) + folios_ser = kho_restore_vmalloc(&ser->folios); + if (folios_ser) { + kvm_gmem_luo_discard_folios(folios_ser, ser->nr_folios, i); + vfree(folios_ser); + } + } + kho_restore_free(ser); + return err; +} + +static void kvm_gmem_luo_finish(struct liveupdate_file_op_args *args) +{ + struct guest_memfd_luo_ser *ser; + struct guest_memfd_luo_folio_ser *folios_ser; + + /* Nothing to be done here, if retrieve_status was successful or errored, + * Cleanup is taken care of in retrieval call. + */ + if (args->retrieve_status) + return; + + if (!args->serialized_data) + return; + + ser = phys_to_virt(args->serialized_data); + if (!ser) + return; + + if (ser->nr_folios) { + folios_ser = kho_restore_vmalloc(&ser->folios); + if (folios_ser) { + kvm_gmem_luo_discard_folios(folios_ser, ser->nr_folios, 0); + vfree(folios_ser); + } + } + + kho_restore_free(ser); +} + +static const struct liveupdate_file_ops kvm_gmem_luo_file_ops = { + .can_preserve = kvm_gmem_luo_can_preserve, + .preserve = kvm_gmem_luo_preserve, + .freeze = kvm_gmem_luo_freeze, + .retrieve = kvm_gmem_luo_retrieve, + .unpreserve = kvm_gmem_luo_unpreserve, + .finish = kvm_gmem_luo_finish, + .owner = THIS_MODULE, +}; + +static struct liveupdate_file_handler kvm_gmem_luo_handler = { + .ops = &kvm_gmem_luo_file_ops, + .compatible = GUEST_MEMFD_LUO_FH_COMPATIBLE, +}; + +int kvm_gmem_luo_init(void) +{ + int err = liveupdate_register_file_handler(&kvm_gmem_luo_handler); + + if (err && err != -EOPNOTSUPP) { + pr_err("Could not register luo filesystem handler: %pe\n", ERR_PTR(err)); + return err; + } + + return 0; +} + +void kvm_gmem_luo_exit(void) +{ + liveupdate_unregister_file_handler(&kvm_gmem_luo_handler); +} + diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index c70346906a89..501a5d048418 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -6580,6 +6580,10 @@ int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module) if (r) goto err_luo; + r = kvm_gmem_luo_init(); + if (r) + goto err_gmem_luo; + /* * Registration _must_ be the very last thing done, as this exposes * /dev/kvm to userspace, i.e. all infrastructure must be setup! @@ -6593,6 +6597,8 @@ int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module) return 0; err_register: + kvm_gmem_luo_exit(); +err_gmem_luo: kvm_luo_exit(); err_luo: kvm_uninit_virtualization(); @@ -6624,6 +6630,7 @@ void kvm_exit(void) */ misc_deregister(&kvm_dev); + kvm_gmem_luo_exit(); kvm_luo_exit(); kvm_uninit_virtualization(); diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h index 118edc47df83..d8ccb68e7e9b 100644 --- a/virt/kvm/kvm_mm.h +++ b/virt/kvm/kvm_mm.h @@ -100,9 +100,13 @@ static inline void kvm_gmem_unbind(struct kvm_memory_slot *slot) #ifdef CONFIG_LIVEUPDATE_GUEST_MEMFD int kvm_luo_init(void); void kvm_luo_exit(void); +int kvm_gmem_luo_init(void); +void kvm_gmem_luo_exit(void); #else static inline int kvm_luo_init(void) { return 0; } static inline void kvm_luo_exit(void) {} +static inline int kvm_gmem_luo_init(void) { return 0; } +static inline void kvm_gmem_luo_exit(void) {} #endif /* CONFIG_LIVEUPDATE_GUEST_MEMFD */ #endif /* __KVM_MM_H__ */ -- 2.54.0.1032.g2f8565e1d1-goog From tarunsahu at google.com Fri Jun 5 10:08:31 2026 From: tarunsahu at google.com (Tarun Sahu) Date: Fri, 5 Jun 2026 17:08:31 +0000 Subject: [RFC PATCH v2 06/10] kvm: guest_memfd: Add support for freezing and unfreezing mappings In-Reply-To: References: Message-ID: <48777f4749fa43d5648085dbb2037aa99c144a88.1780676742.git.tarunsahu@google.com> This patch introduces the freeze on gmem_inode which prevents the fallocate call and any new page fault allocation. This will avoid gmem file modification when it is being preserved Used srcu lock to synchronise the freeze call, where write blocks until all the reads are free. And reads are re-entrant. Incase fault fails, It return -EPERM and VM_EXIT to userspace. userspace must handle this properly as every new fault will fail. Signed-off-by: Tarun Sahu --- virt/kvm/guest_memfd.c | 117 +++++++++++++++++++++++++++++++++++++---- virt/kvm/guest_memfd.h | 5 ++ 2 files changed, 111 insertions(+), 11 deletions(-) diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index 6740ae2bf948..b94639cdf312 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -7,11 +7,13 @@ #include #include #include +#include #include "guest_memfd.h" #include "kvm_mm.h" static struct vfsmount *kvm_gmem_mnt; +static struct srcu_struct kvm_gmem_freeze_srcu; #define kvm_gmem_for_each_file(f, inode) \ @@ -96,6 +98,7 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index) /* TODO: Support huge pages. */ struct mempolicy *policy; struct folio *folio; + int idx; /* * Fast-path: See if folio is already present in mapping to avoid @@ -105,12 +108,20 @@ static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index) if (!IS_ERR(folio)) return folio; + idx = srcu_read_lock(&kvm_gmem_freeze_srcu); + if (kvm_gmem_is_frozen(inode)) { + srcu_read_unlock(&kvm_gmem_freeze_srcu, idx); + return ERR_PTR(-EPERM); + } + policy = mpol_shared_policy_lookup(&GMEM_I(inode)->policy, index); folio = __filemap_get_folio_mpol(inode->i_mapping, index, FGP_LOCK | FGP_CREAT, mapping_gfp_mask(inode->i_mapping), policy); mpol_cond_put(policy); + srcu_read_unlock(&kvm_gmem_freeze_srcu, idx); + /* * External interfaces like kvm_gmem_get_pfn() support dealing * with hugepages to a degree, but internally, guest_memfd currently @@ -273,16 +284,30 @@ static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len) static long kvm_gmem_fallocate(struct file *file, int mode, loff_t offset, loff_t len) { + struct inode *inode = file_inode(file); int ret; + int idx; - if (!(mode & FALLOC_FL_KEEP_SIZE)) - return -EOPNOTSUPP; + idx = srcu_read_lock(&kvm_gmem_freeze_srcu); + if (kvm_gmem_is_frozen(inode)) { + srcu_read_unlock(&kvm_gmem_freeze_srcu, idx); + return -EPERM; + } - if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE)) - return -EOPNOTSUPP; + if (!(mode & FALLOC_FL_KEEP_SIZE)) { + ret = -EOPNOTSUPP; + goto out; + } - if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len)) - return -EINVAL; + if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE)) { + ret = -EOPNOTSUPP; + goto out; + } + + if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len)) { + ret = -EINVAL; + goto out; + } if (mode & FALLOC_FL_PUNCH_HOLE) ret = kvm_gmem_punch_hole(file_inode(file), offset, len); @@ -291,6 +316,9 @@ static long kvm_gmem_fallocate(struct file *file, int mode, loff_t offset, if (!ret) file_modified(file); + +out: + srcu_read_unlock(&kvm_gmem_freeze_srcu, idx); return ret; } @@ -944,7 +972,9 @@ static void kvm_gmem_destroy_inode(struct inode *inode) static void kvm_gmem_free_inode(struct inode *inode) { - kmem_cache_free(kvm_gmem_inode_cachep, GMEM_I(inode)); + struct gmem_inode *gi = GMEM_I(inode); + + kmem_cache_free(kvm_gmem_inode_cachep, gi); } static const struct super_operations kvm_gmem_super_operations = { @@ -1001,12 +1031,21 @@ int kvm_gmem_init(struct module *module) if (!kvm_gmem_inode_cachep) return -ENOMEM; + ret = init_srcu_struct(&kvm_gmem_freeze_srcu); + if (ret) + goto err_cache; + ret = kvm_gmem_init_mount(); - if (ret) { - kmem_cache_destroy(kvm_gmem_inode_cachep); - return ret; - } + if (ret) + goto err_srcu; + return 0; + +err_srcu: + cleanup_srcu_struct(&kvm_gmem_freeze_srcu); +err_cache: + kmem_cache_destroy(kvm_gmem_inode_cachep); + return ret; } void kvm_gmem_exit(void) @@ -1014,5 +1053,61 @@ void kvm_gmem_exit(void) kern_unmount(kvm_gmem_mnt); kvm_gmem_mnt = NULL; rcu_barrier(); + cleanup_srcu_struct(&kvm_gmem_freeze_srcu); kmem_cache_destroy(kvm_gmem_inode_cachep); } + +/** + * kvm_gmem_freeze - Freeze or unfreeze a guest_memfd inode mapping. + * @inode: The guest_memfd inode. + * @freeze: True to freeze, false to unfreeze. + * + * This API is used strictly during the live update / preservation transition + * window to prevent host userspace and guest-side faults from making any + * mapping modifications (such as fallocate or page fault allocation) + * to the guest_memfd page cache. + * + * Synchronization Strategy (Sleepable RCU): + * To avoid high-contention VFS locks (like inode_lock or + * filemap_invalidate_lock) on the vCPU page fault hot paths, this subsystem + * implements a lightweight, system-wide Sleepable RCU (SRCU) mechanism + * (`kvm_gmem_freeze_srcu`): + * + * Global vs. Per-Inode SRCU + * ====================== + * A single system-wide global static `srcu_struct` is used instead of a + * per-inode SRCU structure to completely prevent unprivileged users from + * exhausting the host's per-CPU memory allocator. Because + * `init_srcu_struct()` allocates per-CPU memory via `alloc_percpu()`, which + * is not accounted by memory cgroups (memcg), + * a per-inode SRCU structure would allow a tenant to bypass cgroup limits and + * trigger a system-wide Out-of-Memory (OOM) crash simply by spawning a large + * number of guest_memfd file descriptors (bounded only by RLIMIT_NOFILE). + * + * Flag Modification Note: + * Since `GUEST_MEMFD_F_MAPPING_FROZEN` is the ONLY flag in + * `GMEM_I(inode)->flags` that is mutated dynamically at runtime (all other + * flags are creation-time flags which remain strictly read-only), there is + * no possibility of concurrent bit-modification races. Therefore, a standard + * `WRITE_ONCE` is fully safe and does not require complex `cmpxchg` + * synchronization loops. + */ +void kvm_gmem_freeze(struct inode *inode, bool freeze) +{ + u64 flags = READ_ONCE(GMEM_I(inode)->flags); + + if (freeze) + flags |= GUEST_MEMFD_F_MAPPING_FROZEN; + else + flags &= ~GUEST_MEMFD_F_MAPPING_FROZEN; + + WRITE_ONCE(GMEM_I(inode)->flags, flags); + + if (freeze) + synchronize_srcu(&kvm_gmem_freeze_srcu); +} + +bool kvm_gmem_is_frozen(struct inode *inode) +{ + return READ_ONCE(GMEM_I(inode)->flags) & GUEST_MEMFD_F_MAPPING_FROZEN; +} diff --git a/virt/kvm/guest_memfd.h b/virt/kvm/guest_memfd.h index c528b046dd69..028c348a1023 100644 --- a/virt/kvm/guest_memfd.h +++ b/virt/kvm/guest_memfd.h @@ -29,11 +29,16 @@ struct gmem_inode { u64 flags; }; +/* Internal kernel-only flags (must not overlap with UAPI flags) */ +#define GUEST_MEMFD_F_MAPPING_FROZEN (1ULL << 63) + static inline struct gmem_inode *GMEM_I(struct inode *inode) { return container_of(inode, struct gmem_inode, vfs_inode); } struct file *__kvm_gmem_create_file(struct kvm *kvm, loff_t size, u64 flags); +void kvm_gmem_freeze(struct inode *inode, bool freeze); +bool kvm_gmem_is_frozen(struct inode *inode); #endif /* __KVM_GUEST_MEMFD_H__ */ -- 2.54.0.1032.g2f8565e1d1-goog From tarunsahu at google.com Fri Jun 5 10:08:33 2026 From: tarunsahu at google.com (Tarun Sahu) Date: Fri, 5 Jun 2026 17:08:33 +0000 Subject: [RFC PATCH v2 08/10] docs: add documentation for guest_memfd preservation via LUO In-Reply-To: References: Message-ID: Add the documentation under the "Preserving file descriptors" section of LUO's documentation. Signed-off-by: Tarun Sahu --- Documentation/core-api/liveupdate.rst | 1 + Documentation/liveupdate/vmm.rst | 103 ++++++++++++++++++++++++++ MAINTAINERS | 1 + 3 files changed, 105 insertions(+) create mode 100644 Documentation/liveupdate/vmm.rst diff --git a/Documentation/core-api/liveupdate.rst b/Documentation/core-api/liveupdate.rst index 5a292d0f3706..bac58a363151 100644 --- a/Documentation/core-api/liveupdate.rst +++ b/Documentation/core-api/liveupdate.rst @@ -34,6 +34,7 @@ The following types of file descriptors can be preserved :maxdepth: 1 ../mm/memfd_preservation + ../liveupdate/vmm Public API ========== diff --git a/Documentation/liveupdate/vmm.rst b/Documentation/liveupdate/vmm.rst new file mode 100644 index 000000000000..0cd487a0e1a6 --- /dev/null +++ b/Documentation/liveupdate/vmm.rst @@ -0,0 +1,103 @@ +.. SPDX-License-Identifier: GPL-2.0-or-later + +============================= +VM & Guest_Memfd Preservation +============================= + +.. kernel-doc:: virt/kvm/kvm_luo.c + :doc: KVM VM Preservation via LUO + +.. kernel-doc:: virt/kvm/guest_memfd_luo.c + :doc: Guest_Memfd Preservation via LUO + +VMM Instructions +================ + +This section describes the requirements, scope, conditions, and +ordering constraints that a Virtual Machine Monitor (VMM) must adhere +to for successful preservation and retrieval of guest_memfd files +across a Live Update Orchestrator (LUO) sequence. + +Scope and Limitations +--------------------- + +At this stage, the scope of guest_memfd preservation is restricted to: + +1. **Fully Shared guest_memfd**: + This time only fully shared guest_memfd supported. Any system that + supports coco vm (which uses private guest_memfd), will not support + the preservation. + +2. **Standard Page Size**: + Only guest_memfd backed by standard page size (``PAGE_SIZE``, + order-0) pages is supported. Large/huge page backing (e.g., + hugetlb guest_memfd) is not supported. + +Any Virtual Machine (VM) whose memory is fully backed by such +guest_memfd files can be preserved across live update. + +VMM Actions and Conditions during Live Update +--------------------------------------------- + +During the live update sequence, the kernel introduces a *freezing* +phase for the guest_memfd inode. Freezing prevents any modifications to +the guest_memfd page cache. Specifically, once a guest_memfd mapping is +frozen: + +- Any subsequent ``fallocate`` calls on the guest_memfd file descriptor + will fail and return ``-EPERM``. +- Any new page faults (guest-side or host-userspace-side) that require + folio allocation will fail and return ``-EPERM``. + +To prevent vCPUs or VMM helper threads from failing due to these +``-EPERM`` errors, the VMM must implement one of the following +strategies: + +1. **Pause the VM (Recommended)**: + The VMM should pause/suspend all vCPUs before invoking the + preservation or freezing of the VM and guest_memfd files. This + ensures no new page faults or memory accesses can occur while the + guest_memfd is frozen. + +2. **Handle Fault Failures**: + If the VM is not paused, the VMM must be prepared to handle VM + exits or user page fault errors resulting from the ``-EPERM`` + failures. The VMM must take appropriate action, such as + immediately pausing the VM, or aborting the live update sequence + (by tearing down or unpreserving the live update session). + +Preservation and Retrieval Ordering +----------------------------------- + +Preservation Order +~~~~~~~~~~~~~~~~~~ + +There is no strict ordering requirement for initiating the +preservation of the KVM VM file and the guest_memfd files; they are +preserved independently. If kexec is triggered with guest_memfd +preservation without preserving the vm file, kexec will fail. + +Retrieval Order +~~~~~~~~~~~~~~~ + +Similarly, there is no strict ordering required for retrieving the VM +and guest_memfd files. Any file can be retrieved at any order. + +If guest_memfd file is retrieved and VM file is not retrieved, and +luo_finish is called, then vm_file will be lost and guest_memfd file +will be hanging around. + +VM & Guest_Memfd Preservation ABI +================================= + +.. kernel-doc:: include/linux/kho/abi/kvm.h + :doc: DOC: guest_memfd Live Update ABI + +.. kernel-doc:: include/linux/kho/abi/kvm.h + :internal: + +See Also +======== + +- :doc:`/core-api/liveupdate` +- :doc:`/userspace-api/liveupdate` diff --git a/MAINTAINERS b/MAINTAINERS index 16cba790a84d..ca459d032712 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -14418,6 +14418,7 @@ L: kexec at lists.infradead.org L: kvm at vger.kernel.org S: Maintained T: git git://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux.git +F: Documentation/liveupdate/vmm.rst F: virt/kvm/guest_memfd_luo.c F: virt/kvm/kvm_luo.c -- 2.54.0.1032.g2f8565e1d1-goog From tarunsahu at google.com Fri Jun 5 10:08:34 2026 From: tarunsahu at google.com (Tarun Sahu) Date: Fri, 5 Jun 2026 17:08:34 +0000 Subject: [RFC PATCH v2 09/10] selftests: kvm: Split ____vm_create() to expose init helpers In-Reply-To: References: Message-ID: <4af286e970b7a44b539f78d746e92b91571c18fa.1780676742.git.tarunsahu@google.com> Refactor `____vm_create()` in the KVM selftest library to extract its initialization steps into separate, reusable internal helpers. Introduce `vm_init_fields()` and `vm_init_memory_properties()`. This allows advanced test setups to perform targeted VM fields or memory property initializations independently, which is required by upcoming test cases that restore preserved VMs. No functional changes are introduced for the existing tests. Signed-off-by: Tarun Sahu --- .../testing/selftests/kvm/include/kvm_util.h | 2 ++ tools/testing/selftests/kvm/lib/kvm_util.c | 26 +++++++++++++------ 2 files changed, 20 insertions(+), 8 deletions(-) diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h index 2ecaaa0e9965..d10cd25d0658 100644 --- a/tools/testing/selftests/kvm/include/kvm_util.h +++ b/tools/testing/selftests/kvm/include/kvm_util.h @@ -471,6 +471,8 @@ const char *vm_guest_mode_string(u32 i); void kvm_vm_free(struct kvm_vm *vmp); void kvm_vm_restart(struct kvm_vm *vmp); +void vm_init_fields(struct kvm_vm *vm, struct vm_shape shape); +void vm_init_memory_properties(struct kvm_vm *vm); void kvm_vm_release(struct kvm_vm *vmp); void kvm_vm_elf_load(struct kvm_vm *vm, const char *filename); int kvm_memfd_alloc(size_t size, bool hugepages); diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c index e08967ef7b7b..d3e6508e9863 100644 --- a/tools/testing/selftests/kvm/lib/kvm_util.c +++ b/tools/testing/selftests/kvm/lib/kvm_util.c @@ -276,13 +276,8 @@ __weak void vm_populate_gva_bitmap(struct kvm_vm *vm) (1ULL << (vm->va_bits - 1)) >> vm->page_shift); } -struct kvm_vm *____vm_create(struct vm_shape shape) +void vm_init_fields(struct kvm_vm *vm, struct vm_shape shape) { - struct kvm_vm *vm; - - vm = calloc(1, sizeof(*vm)); - TEST_ASSERT(vm != NULL, "Insufficient Memory"); - INIT_LIST_HEAD(&vm->vcpus); vm->regions.gpa_tree = RB_ROOT; vm->regions.hva_tree = RB_ROOT; @@ -380,9 +375,10 @@ struct kvm_vm *____vm_create(struct vm_shape shape) if (vm->pa_bits != 40) vm->type = KVM_VM_TYPE_ARM_IPA_SIZE(vm->pa_bits); #endif +} - vm_open(vm); - +void vm_init_memory_properties(struct kvm_vm *vm) +{ /* Limit to VA-bit canonical virtual addresses. */ vm->vpages_valid = sparsebit_alloc(); vm_populate_gva_bitmap(vm); @@ -392,6 +388,20 @@ struct kvm_vm *____vm_create(struct vm_shape shape) /* Allocate and setup memory for guest. */ vm->vpages_mapped = sparsebit_alloc(); +} + +struct kvm_vm *____vm_create(struct vm_shape shape) +{ + struct kvm_vm *vm; + + vm = calloc(1, sizeof(*vm)); + TEST_ASSERT(vm != NULL, "Insufficient Memory"); + + vm_init_fields(vm, shape); + + vm_open(vm); + + vm_init_memory_properties(vm); return vm; } -- 2.54.0.1032.g2f8565e1d1-goog From tarunsahu at google.com Fri Jun 5 10:08:35 2026 From: tarunsahu at google.com (Tarun Sahu) Date: Fri, 5 Jun 2026 17:08:35 +0000 Subject: [RFC PATCH v2 10/10] selftests: kvm: Add guest_memfd_preservation_test In-Reply-To: References: Message-ID: Add a new KVM selftest `guest_memfd_preservation_test` to verify that guest memory backed by guest_memfd is preserved properly. The test leverages the Live Update Orchestrator (LUO) infrastructure to validate that memory folios and configuration layouts are successfully saved and then restored during kernel live updates, preventing any memory loss for the guest. Here, I have used the kvm selftests framework by creating a new vm and mapping two memory slots to it. One is the code that is executed inside the vm and other is the guest_memfd whose memory is being written by the guest code. In Phase 1: Once data is written the vm exits and wait for the user to trigger the kexec. In Phase 2: A new vm is created with retrieved kvm and again two memory slots are assigned. Once for guest code, and another is for retrieved guest_memfd where guest_memfd memory is verified by the executed guest code. If verification succeeds, The test passes. Signed-off-by: Tarun Sahu --- MAINTAINERS | 1 + tools/testing/selftests/kvm/Makefile.kvm | 6 +- .../kvm/guest_memfd_preservation_test.c | 230 ++++++++++++++++++ 3 files changed, 236 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/kvm/guest_memfd_preservation_test.c diff --git a/MAINTAINERS b/MAINTAINERS index ca459d032712..76e59620d2f1 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -14419,6 +14419,7 @@ L: kvm at vger.kernel.org S: Maintained T: git git://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux.git F: Documentation/liveupdate/vmm.rst +F: tools/testing/selftests/kvm/guest_memfd_preservation_test.c F: virt/kvm/guest_memfd_luo.c F: virt/kvm/kvm_luo.c diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm index 9118a5a51b89..68584d4ee1b0 100644 --- a/tools/testing/selftests/kvm/Makefile.kvm +++ b/tools/testing/selftests/kvm/Makefile.kvm @@ -161,6 +161,8 @@ TEST_GEN_PROGS_x86 += pre_fault_memory_test # Compiled outputs used by test targets TEST_GEN_PROGS_EXTENDED_x86 += x86/nx_huge_pages_test +# Manual test that forks a persistent background daemon; skip auto CI run +TEST_GEN_PROGS_EXTENDED_x86 += guest_memfd_preservation_test TEST_GEN_PROGS_arm64 = $(TEST_GEN_PROGS_COMMON) TEST_GEN_PROGS_arm64 += arm64/aarch32_id_regs @@ -254,6 +256,7 @@ OVERRIDE_TARGETS = 1 # which causes the environment variable to override the makefile). include ../lib.mk include ../cgroup/lib/libcgroup.mk +include ../liveupdate/lib/libliveupdate.mk INSTALL_HDR_PATH = $(top_srcdir)/usr LINUX_HDR_PATH = $(INSTALL_HDR_PATH)/include/ @@ -308,7 +311,8 @@ LIBKVM_S := $(filter %.S,$(LIBKVM)) LIBKVM_C_OBJ := $(patsubst %.c, $(OUTPUT)/%.o, $(LIBKVM_C)) LIBKVM_S_OBJ := $(patsubst %.S, $(OUTPUT)/%.o, $(LIBKVM_S)) LIBKVM_STRING_OBJ := $(patsubst %.c, $(OUTPUT)/%.o, $(LIBKVM_STRING)) -LIBKVM_OBJS = $(LIBKVM_C_OBJ) $(LIBKVM_S_OBJ) $(LIBKVM_STRING_OBJ) $(LIBCGROUP_O) +LIBKVM_OBJS = $(LIBKVM_C_OBJ) $(LIBKVM_S_OBJ) $(LIBKVM_STRING_OBJ) \ + $(LIBCGROUP_O) $(LIBLIVEUPDATE_O) SPLIT_TEST_GEN_PROGS := $(patsubst %, $(OUTPUT)/%, $(SPLIT_TESTS)) SPLIT_TEST_GEN_OBJ := $(patsubst %, $(OUTPUT)/$(ARCH)/%.o, $(SPLIT_TESTS)) diff --git a/tools/testing/selftests/kvm/guest_memfd_preservation_test.c b/tools/testing/selftests/kvm/guest_memfd_preservation_test.c new file mode 100644 index 000000000000..74f90c5c4bf5 --- /dev/null +++ b/tools/testing/selftests/kvm/guest_memfd_preservation_test.c @@ -0,0 +1,230 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2026, Google LLC. + * + * Author: Tarun Sahu + * + * Test for VM and guest_memfd preservation across kexec (Live Update) via LUO. + * + * NOTE: This is a MANUAL test and is excluded from automated CI/testing + * frameworks because Phase 1 daemonizes into the background to pin resources + * and requires a human operator to manually trigger kexec before Phase 2 + * is executed. Running Phase 1 automatically would leak the background daemon + * and cause CI runners to falsely interpret it as a passed test. + * + * Usage: + * Phase 1: ./guest_memfd_preservation_test + * Phase 2: ./guest_memfd_preservation_test --phase2 + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "kvm_util.h" +#include "processor.h" +#include "test_util.h" +#include "ucall_common.h" +#include "../kselftest.h" +#include "../kselftest_harness.h" + +#include + +#define SESSION_NAME "gmem_vm_preservation_session" +#define VM_TOKEN 0x1001 +#define GMEM_TOKEN 0x1002 + +#define GMEM_SIZE (16ULL * 1024 * 1024) +#define DATA_SIZE (5ULL * 1024 * 1024) + +static size_t page_size; + +/* Deterministic byte pattern generation based on offset */ +static inline uint8_t get_pattern_byte(size_t offset) +{ + return (uint8_t)(offset ^ 0x5A); +} + +static void guest_code_phase1(uint64_t gpa, uint64_t size, uint64_t data_size) +{ + uint8_t *mem = (uint8_t *)gpa; + size_t i; + + for (i = 0; i < data_size; i++) + mem[i] = get_pattern_byte(i); + + GUEST_DONE(); +} + +static void guest_code_phase2(uint64_t gpa, uint64_t size, uint64_t data_size) +{ + uint8_t *mem = (uint8_t *)gpa; + size_t i; + + for (i = 0; i < data_size; i++) { + uint8_t val = get_pattern_byte(i); + + __GUEST_ASSERT(mem[i] == val, + "Data mismatch at offset %lu! Expected 0x%x, got 0x%x", + i, val, mem[i]); + } + + GUEST_DONE(); +} + +static void do_phase1(void) +{ + uint64_t flags = GUEST_MEMFD_FLAG_MMAP | GUEST_MEMFD_FLAG_INIT_SHARED; + int gmem_fd, dev_luo_fd, session_fd, ret; + const uint64_t gpa = SZ_4G; + struct kvm_vcpu *vcpu; + const int slot = 1; + struct kvm_vm *vm; + + vm = __vm_create_shape_with_one_vcpu(VM_SHAPE_DEFAULT, &vcpu, 1, + guest_code_phase1); + gmem_fd = vm_create_guest_memfd(vm, GMEM_SIZE, flags); + vm_set_user_memory_region2(vm, slot, KVM_MEM_GUEST_MEMFD, gpa, GMEM_SIZE, NULL, + gmem_fd, 0); + + for (size_t i = 0; i < GMEM_SIZE; i += page_size) + virt_pg_map(vm, gpa + i, gpa + i); + + vcpu_args_set(vcpu, 3, gpa, GMEM_SIZE, DATA_SIZE); + + vcpu_run(vcpu); + TEST_ASSERT_EQ(get_ucall(vcpu, NULL), UCALL_DONE); + + dev_luo_fd = luo_open_device(); + TEST_ASSERT(dev_luo_fd >= 0, "Failed to open /dev/liveupdate"); + + session_fd = luo_create_session(dev_luo_fd, SESSION_NAME); + TEST_ASSERT(session_fd >= 0, "Failed to create LUO session"); + + ret = luo_session_preserve_fd(session_fd, vm->fd, VM_TOKEN); + TEST_ASSERT(ret == 0, "Failed to preserve VM file descriptor"); + + ret = luo_session_preserve_fd(session_fd, gmem_fd, GMEM_TOKEN); + TEST_ASSERT(ret == 0, "Failed to preserve guest_memfd file descriptor"); + + printf("\n============================================================\n"); + printf("Phase 1 Complete Successfully!\n"); + printf("VM file and guest_memfd file have been preserved via LUO.\n"); + printf("Tokens: VM_TOKEN=0x%x, GMEM_TOKEN=0x%x\n", VM_TOKEN, GMEM_TOKEN); + printf("Machine Size: %llu MB, Data Size: %llu MB\n", GMEM_SIZE / SZ_1M, + DATA_SIZE / SZ_1M); + printf("------------------------------------------------------------\n"); + + daemonize_and_wait(); +} + +static struct kvm_vm *vm_create_from_fd(int resurrected_vm_fd, + struct vm_shape shape) +{ + struct kvm_vm *vm; + + vm = calloc(1, sizeof(*vm)); + TEST_ASSERT(vm != NULL, "Insufficient Memory"); + + vm_init_fields(vm, shape); + + vm->kvm_fd = open_path_or_exit(KVM_DEV_PATH, O_RDWR); + vm->fd = resurrected_vm_fd; + + if (kvm_has_cap(KVM_CAP_BINARY_STATS_FD)) + vm->stats.fd = vm_get_stats_fd(vm); + else + vm->stats.fd = -1; + + vm_init_memory_properties(vm); + + return vm; +} + +static void do_phase2(void) +{ + int retrieved_vm_fd, retrieved_gmem_fd, dev_luo_fd, session_fd; + struct vm_shape shape = VM_SHAPE_DEFAULT; + const uint64_t gpa = SZ_4G; + struct kvm_vcpu *vcpu; + const int slot = 1; + struct kvm_vm *vm; + + dev_luo_fd = luo_open_device(); + TEST_ASSERT(dev_luo_fd >= 0, "Failed to open /dev/liveupdate"); + + session_fd = luo_retrieve_session(dev_luo_fd, SESSION_NAME); + TEST_ASSERT(session_fd >= 0, "Failed to retrieve LUO session"); + + retrieved_vm_fd = luo_session_retrieve_fd(session_fd, VM_TOKEN); + TEST_ASSERT(retrieved_vm_fd >= 0, "Failed to retrieve VM file descriptor"); + + retrieved_gmem_fd = luo_session_retrieve_fd(session_fd, GMEM_TOKEN); + TEST_ASSERT(retrieved_gmem_fd >= 0, "Failed to retrieve guest_memfd file descriptor"); + + vm = vm_create_from_fd(retrieved_vm_fd, shape); + + u64 nr_pages = 2048; /* 8MB is plenty for slot0 pages */ + + vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS, 0, 0, nr_pages, 0); + kvm_vm_elf_load(vm, program_invocation_name); + + for (int i = 0; i < NR_MEM_REGIONS; i++) + vm->memslots[i] = 0; + + struct userspace_mem_region *slot0 = memslot2region(vm, 0); + + ucall_init(vm, slot0->region.guest_phys_addr + slot0->region.memory_size); + + vm_set_user_memory_region2(vm, slot, KVM_MEM_GUEST_MEMFD, gpa, GMEM_SIZE, NULL, + retrieved_gmem_fd, 0); + + for (size_t i = 0; i < GMEM_SIZE; i += page_size) + virt_pg_map(vm, gpa + i, gpa + i); + + vcpu = vm_vcpu_add(vm, 0, guest_code_phase2); + kvm_arch_vm_finalize_vcpus(vm); + + vcpu_args_set(vcpu, 3, gpa, GMEM_SIZE, DATA_SIZE); + + printf("Resuming / Running VM in Phase 2...\n"); + vcpu_run(vcpu); + TEST_ASSERT_EQ(get_ucall(vcpu, NULL), UCALL_DONE); + + printf("\nSUCCESS: Phase 2 Complete! All 5MB complex data verified intact!\n"); + + luo_session_finish(session_fd); + close(session_fd); + close(dev_luo_fd); + /* This will also close the vm_fd */ + kvm_vm_free(vm); + close(retrieved_gmem_fd); +} + +int main(int argc, char *argv[]) +{ + bool phase2 = false; + + TEST_REQUIRE(kvm_has_cap(KVM_CAP_GUEST_MEMFD)); + page_size = getpagesize(); + + for (int i = 1; i < argc; i++) { + if (strcmp(argv[i], "--phase2") == 0) + phase2 = true; + } + + if (phase2) + do_phase2(); + else + do_phase1(); + + return 0; +} -- 2.54.0.1032.g2f8565e1d1-goog From pratyush at kernel.org Fri Jun 5 11:34:33 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Fri, 5 Jun 2026 20:34:33 +0200 Subject: [PATCH v2 00/18] kho: make boot time huge page allocation work nicely with KHO Message-ID: <20260605183501.3884950-1-pratyush@kernel.org> From: "Pratyush Yadav (Google)" Hi, Gigantic huge page allocation is somewhat broken currently with KHO. First, they break scratch size accounting. Since they are allocated using the memblock alloc APIs, they count towards RSRV_KERN, and this scratch size when using scratch_scale. This means if huge pages take a large enough chunk of system memory scratch size will blow up and fail to allocate. Second, scratch can not contain preserved memory, and if hugepages are allocated from scratch, they will fail to be preserved with the upcoming hugetlb preservation series [0]. Fix this by introducing the concept of extended scratch areas. They are areas that the kernel discovers on boot by walking the radix tree and finding free memory ranges. See patch 10 for more details. Discovering the scratch areas needs some preparatory changes to KHO, the radix tree APIs, and to memblock. Patches 1-14 do that. Patch 15 adds the scratch discovery logic. Patch 16 adds the dedicated memblock hugetlb allocator. Patch 17-18 fix the scratch size calculation with using scratch_scale. [0] https://lore.kernel.org/linux-mm/20251206230222.853493-1-pratyush at kernel.org/T/#u Changes in v2: Detailed changelog below. At a high level, the major change in this version is to remove MEMBLOCK_KHO_SCRATCH_EXT. Keep MEMBLOCK_KHO_SCRATCH as the only memory type and mark the discovered areas with it. For HugeTLB, add a dedicated allocation routine and if allocated memory lands in scratch, do a retry. Also introduce MEMBLOCK_RSRV_HUGETLB to help with accounting of scratch area sizes. - Fixup commit message in patch 1 to make namespacing change clearer. - Use @key in kernel-doc for radix functions. - Add a runtime check on key width. - Move all mem retrieval logic to kho_mem_retrieve(). - Add a comment in kho_mem_retrieve() explaining why mem_map won't be NULL. - Rename callbacks to ->leaf() and ->node(). - Fixup commit messages. - Clear tree->root in kho_radix_destroy_tree(). This lets the tree be re-initialized by calling kho_radix_init_tree() - Add kho_get_mem_map() earlier in the series. - Export kho_scratch_overlap() and use it in memblock_is_kho_scratch_memory(). - Get rid of MEMBLOCK_KHO_SCRATCH_EXT. - Introduce MEMBLOCK_RSRV_HUGETLB. - Introduce memblock_alloc_hugetlb() for hugetlb bootmem allocations. - Refactor memblock_reserved_kern_size() to allow calculating size by flags. - Exclude hugetlb memory from scratch size calculation. - Collect R-bys. Regards, Pratyush Yadav Pratyush Yadav (Google) (18): kho: generalize radix tree APIs kho: disallow wide keys in radix tree kho: return virtual address of mem_map kho: store incoming radix tree in kho_in kho: move all memory retrieval logic to kho_mem_retrieve() kho: add a struct for radix callbacks kho: add callback for table pages kho: add data argument to radix walk callback kho: allow early-boot usage of the KHO radix tree kho: allow destroying KHO radix tree kho: add kho_radix_init_tree() kho: export kho_scratch_overlap() kho: initialize kho_scratch pointer earlier in boot memblock: use kho_scratch_overlap() to decide migratetype kho: extend scratch memblock: make HugeTLB bootmem allocation work with KHO memblock: allow calculating reserved size by flags kho: exclude hugetlb memory from scratch size calculation include/linux/kexec_handover.h | 10 + include/linux/kho/abi/kexec_handover.h | 8 + include/linux/kho_radix_tree.h | 44 +- include/linux/memblock.h | 9 +- kernel/liveupdate/Makefile | 1 - kernel/liveupdate/kexec_handover.c | 495 +++++++++++++++----- kernel/liveupdate/kexec_handover_debug.c | 25 - kernel/liveupdate/kexec_handover_internal.h | 9 - mm/hugetlb.c | 22 +- mm/memblock.c | 120 ++++- mm/mm_init.c | 1 + 11 files changed, 540 insertions(+), 204 deletions(-) delete mode 100644 kernel/liveupdate/kexec_handover_debug.c base-commit: 2935777b418d2bfcbfe96705bb2c0fa6c0d94e18 -- 2.54.0.1032.g2f8565e1d1-goog From pratyush at kernel.org Fri Jun 5 11:34:34 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Fri, 5 Jun 2026 20:34:34 +0200 Subject: [PATCH v2 01/18] kho: generalize radix tree APIs In-Reply-To: <20260605183501.3884950-1-pratyush@kernel.org> References: <20260605183501.3884950-1-pratyush@kernel.org> Message-ID: <20260605183501.3884950-2-pratyush@kernel.org> From: "Pratyush Yadav (Google)" The KHO radix tree is a data structure that can track the presence or absence of an arbitrary key, with nothing inherently tied to KHO memory preservation tracking. This was one of the design goals of the radix tree. This was done to enable it to be re-used by other users of KHO. Despite that, the radix tree APIs are very closely tied to KHO memory preservation tracking. Adding a key is done by kho_radix_add_page(), which encodes it as a page tracking operation and takes in PFN and order. kho_radix_del_page() does the same. These functions encode the key internally that goes into the radix tree. kho_radix_walk_tree() does the same by baking the PFN and order into the callback arguments. Generalize the APIs by taking the key directly and doing the encoding at the callers. Rename the functions to kho_radix_add_key() and kho_radix_del_key(). In practice, this removes a line each from the functions and moves the encoding function call to the callers. Similarly, update kho_radix_tree_walk_callback_t to take the key directly. Now that key encoding is no longer an inherent part of the radix tree and can be decided by the user, rename kho_radix_{encode,decode}_key() to kho_{encode,decode}_radix_key(). This moves them out of the "kho_radix_" name space into the "kho_" namespace. This emphasizes that this is KHO's way of encoding the key for its radix tree. Reviewed-by: Pasha Tatashin Signed-off-by: Pratyush Yadav (Google) --- include/linux/kho_radix_tree.h | 18 +++---- kernel/liveupdate/kexec_handover.c | 76 ++++++++++++++---------------- 2 files changed, 42 insertions(+), 52 deletions(-) diff --git a/include/linux/kho_radix_tree.h b/include/linux/kho_radix_tree.h index 84e918b96e53..f368f3b9f923 100644 --- a/include/linux/kho_radix_tree.h +++ b/include/linux/kho_radix_tree.h @@ -34,30 +34,24 @@ struct kho_radix_tree { struct mutex lock; /* protects the tree's structure and root pointer */ }; -typedef int (*kho_radix_tree_walk_callback_t)(phys_addr_t phys, - unsigned int order); +typedef int (*kho_radix_tree_walk_callback_t)(unsigned long key); #ifdef CONFIG_KEXEC_HANDOVER -int kho_radix_add_page(struct kho_radix_tree *tree, unsigned long pfn, - unsigned int order); - -void kho_radix_del_page(struct kho_radix_tree *tree, unsigned long pfn, - unsigned int order); - +int kho_radix_add_key(struct kho_radix_tree *tree, unsigned long key); +void kho_radix_del_key(struct kho_radix_tree *tree, unsigned long key); int kho_radix_walk_tree(struct kho_radix_tree *tree, kho_radix_tree_walk_callback_t cb); #else /* #ifdef CONFIG_KEXEC_HANDOVER */ -static inline int kho_radix_add_page(struct kho_radix_tree *tree, long pfn, - unsigned int order) +static inline int kho_radix_add_key(struct kho_radix_tree *tree, unsigned long key) { return -EOPNOTSUPP; } -static inline void kho_radix_del_page(struct kho_radix_tree *tree, - unsigned long pfn, unsigned int order) { } +static inline void kho_radix_del_key(struct kho_radix_tree *tree, + unsigned long key) { } static inline int kho_radix_walk_tree(struct kho_radix_tree *tree, kho_radix_tree_walk_callback_t cb) diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 4834a809985a..7349cc82f6dc 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -85,7 +85,7 @@ static struct kho_out kho_out = { }; /** - * kho_radix_encode_key - Encodes a physical address and order into a radix key. + * kho_encode_radix_key - Encodes a physical address and order into a radix key. * @phys: The physical address of the page. * @order: The order of the page. * @@ -95,7 +95,7 @@ static struct kho_out kho_out = { * * Return: The encoded unsigned long radix key. */ -static unsigned long kho_radix_encode_key(phys_addr_t phys, unsigned int order) +static unsigned long kho_encode_radix_key(phys_addr_t phys, unsigned int order) { /* Order bits part */ unsigned long h = 1UL << (KHO_ORDER_0_LOG2 - order); @@ -106,17 +106,17 @@ static unsigned long kho_radix_encode_key(phys_addr_t phys, unsigned int order) } /** - * kho_radix_decode_key - Decodes a radix key back into a physical address and order. + * kho_decode_radix_key - Decodes a radix key back into a physical address and order. * @key: The unsigned long key to decode. * @order: An output parameter, a pointer to an unsigned int where the decoded * page order will be stored. * - * This function reverses the encoding performed by kho_radix_encode_key(), + * This function reverses the encoding performed by kho_encode_radix_key(), * extracting the original physical address and page order from a given key. * * Return: The decoded physical address. */ -static phys_addr_t kho_radix_decode_key(unsigned long key, unsigned int *order) +static phys_addr_t kho_decode_radix_key(unsigned long key, unsigned int *order) { unsigned int order_bit = fls64(key); phys_addr_t phys; @@ -144,24 +144,21 @@ static unsigned long kho_radix_get_table_index(unsigned long key, } /** - * kho_radix_add_page - Marks a page as preserved in the radix tree. + * kho_radix_add_key - Add a key to the radix tree. * @tree: The KHO radix tree. - * @pfn: The page frame number of the page to preserve. - * @order: The order of the page. + * @key: The key to add. * - * This function traverses the radix tree based on the key derived from @pfn - * and @order. It sets the corresponding bit in the leaf bitmap to mark the - * page for preservation. If intermediate nodes do not exist along the path, - * they are allocated and added to the tree. + * This function traverses the radix tree based on the @key provided. It sets the + * corresponding bit in the leaf bitmap to mark the @key as present. If + * intermediate nodes do not exist along the path, they are allocated and added + * to the tree. * * Return: 0 on success, or a negative error code on failure. */ -int kho_radix_add_page(struct kho_radix_tree *tree, - unsigned long pfn, unsigned int order) +int kho_radix_add_key(struct kho_radix_tree *tree, unsigned long key) { /* Newly allocated nodes for error cleanup */ struct kho_radix_node *intermediate_nodes[KHO_TREE_MAX_DEPTH] = { 0 }; - unsigned long key = kho_radix_encode_key(PFN_PHYS(pfn), order); struct kho_radix_node *anchor_node = NULL; struct kho_radix_node *node = tree->root; struct kho_radix_node *new_node; @@ -224,22 +221,19 @@ int kho_radix_add_page(struct kho_radix_tree *tree, return err; } -EXPORT_SYMBOL_GPL(kho_radix_add_page); +EXPORT_SYMBOL_GPL(kho_radix_add_key); /** - * kho_radix_del_page - Removes a page's preservation status from the radix tree. + * kho_radix_del_key - Removes the key from the radix tree. * @tree: The KHO radix tree. - * @pfn: The page frame number of the page to unpreserve. - * @order: The order of the page. + * @key: The key to remove. * * This function traverses the radix tree and clears the bit corresponding to - * the page, effectively removing its "preserved" status. It does not free - * the tree's intermediate nodes, even if they become empty. + * the @key, effectively removing it from the tree. It does not free the tree's + * intermediate nodes, even if they become empty. */ -void kho_radix_del_page(struct kho_radix_tree *tree, unsigned long pfn, - unsigned int order) +void kho_radix_del_key(struct kho_radix_tree *tree, unsigned long key) { - unsigned long key = kho_radix_encode_key(PFN_PHYS(pfn), order); struct kho_radix_node *node = tree->root; struct kho_radix_leaf *leaf; unsigned int i, idx; @@ -270,21 +264,18 @@ void kho_radix_del_page(struct kho_radix_tree *tree, unsigned long pfn, idx = kho_radix_get_bitmap_index(key); __clear_bit(idx, leaf->bitmap); } -EXPORT_SYMBOL_GPL(kho_radix_del_page); +EXPORT_SYMBOL_GPL(kho_radix_del_key); static int kho_radix_walk_leaf(struct kho_radix_leaf *leaf, unsigned long key, kho_radix_tree_walk_callback_t cb) { unsigned long *bitmap = (unsigned long *)leaf; - unsigned int order; - phys_addr_t phys; unsigned int i; int err; for_each_set_bit(i, bitmap, PAGE_SIZE * BITS_PER_BYTE) { - phys = kho_radix_decode_key(key | i, &order); - err = cb(phys, order); + err = cb(key | i); if (err) return err; } @@ -332,15 +323,14 @@ static int __kho_radix_walk_tree(struct kho_radix_node *root, } /** - * kho_radix_walk_tree - Traverses the radix tree and calls a callback for each preserved page. + * kho_radix_walk_tree - Traverses the radix tree and calls a callback for each key. * @tree: A pointer to the KHO radix tree to walk. * @cb: A callback function of type kho_radix_tree_walk_callback_t that will be - * invoked for each preserved page found in the tree. The callback receives - * the physical address and order of the preserved page. + * invoked for each key in the tree. * * This function walks the radix tree, searching from the specified top level - * down to the lowest level (level 0). For each preserved page found, it invokes - * the provided callback, passing the page's physical address and order. + * down to the lowest level (level 0). For each key found, it invokes the + * provided callback. * * Return: 0 if the walk completed the specified tree, or the non-zero return * value from the callback that stopped the walk. @@ -484,13 +474,16 @@ static struct page *__init kho_get_preserved_page(phys_addr_t phys, return pfn_to_page(pfn); } -static int __init kho_preserved_memory_reserve(phys_addr_t phys, - unsigned int order) +static int __init kho_preserved_memory_reserve(unsigned long key) { union kho_page_info info; struct page *page; + unsigned int order; + phys_addr_t phys; u64 sz; + phys = kho_decode_radix_key(key, &order); + sz = 1 << (order + PAGE_SHIFT); page = kho_get_preserved_page(phys, order); @@ -859,7 +852,8 @@ int kho_preserve_folio(struct folio *folio) if (WARN_ON(kho_scratch_overlap(pfn << PAGE_SHIFT, PAGE_SIZE << order))) return -EINVAL; - return kho_radix_add_page(tree, pfn, order); + return kho_radix_add_key(tree, kho_encode_radix_key(PFN_PHYS(pfn), + order)); } EXPORT_SYMBOL_GPL(kho_preserve_folio); @@ -877,7 +871,7 @@ void kho_unpreserve_folio(struct folio *folio) const unsigned long pfn = folio_pfn(folio); const unsigned int order = folio_order(folio); - kho_radix_del_page(tree, pfn, order); + kho_radix_del_key(tree, kho_encode_radix_key(PFN_PHYS(pfn), order)); } EXPORT_SYMBOL_GPL(kho_unpreserve_folio); @@ -906,7 +900,8 @@ static void __kho_unpreserve(struct kho_radix_tree *tree, while (pfn < end_pfn) { order = __kho_preserve_pages_order(pfn, end_pfn); - kho_radix_del_page(tree, pfn, order); + kho_radix_del_key(tree, kho_encode_radix_key(PFN_PHYS(pfn), + order)); pfn += 1 << order; } @@ -939,7 +934,8 @@ int kho_preserve_pages(struct page *page, unsigned long nr_pages) while (pfn < end_pfn) { unsigned int order = __kho_preserve_pages_order(pfn, end_pfn); - err = kho_radix_add_page(tree, pfn, order); + err = kho_radix_add_key(tree, kho_encode_radix_key(PFN_PHYS(pfn), + order)); if (err) { failed_pfn = pfn; break; -- 2.54.0.1032.g2f8565e1d1-goog From pratyush at kernel.org Fri Jun 5 11:34:35 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Fri, 5 Jun 2026 20:34:35 +0200 Subject: [PATCH v2 02/18] kho: disallow wide keys in radix tree In-Reply-To: <20260605183501.3884950-1-pratyush@kernel.org> References: <20260605183501.3884950-1-pratyush@kernel.org> Message-ID: <20260605183501.3884950-3-pratyush@kernel.org> From: "Pratyush Yadav (Google)" The KHO radix tree was designed to track preserved pages. So it does not provide the capability to track any 64-bit key. Instead, it limits the key width to how much it needs for tracking PFNs and their orders. Limiting the width reduces the number of levels in the tree. KHO is not expected to be the only user of the radix tree. With the API generalized to allow other users, now it is possible to add any key to the tree. Check the key width at kho_radix_add_key(), and error out if it exceeds what the tree can handle. Do this instead of increasing the tree depth since right now there are no users that need to use wider keys, so this avoids memory overhead and ABI breakage. Signed-off-by: Pratyush Yadav (Google) --- include/linux/kho/abi/kexec_handover.h | 8 ++++++++ kernel/liveupdate/kexec_handover.c | 12 ++++++++++++ 2 files changed, 20 insertions(+) diff --git a/include/linux/kho/abi/kexec_handover.h b/include/linux/kho/abi/kexec_handover.h index fb2d37417ad9..6dbb98bfb586 100644 --- a/include/linux/kho/abi/kexec_handover.h +++ b/include/linux/kho/abi/kexec_handover.h @@ -278,6 +278,14 @@ enum kho_radix_consts { KHO_TABLE_SIZE_LOG2) + 1, }; +/* + * The maximum key width this radix tree can track. + * + * This value isn't ABI itself, but it is derived from values that are ABI. + */ +#define KHO_RADIX_KEY_WIDTH (((KHO_TREE_MAX_DEPTH - 1) * KHO_TABLE_SIZE_LOG2) + \ + KHO_BITMAP_SIZE_LOG2) + struct kho_radix_node { u64 table[1 << KHO_TABLE_SIZE_LOG2]; }; diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 7349cc82f6dc..e8454dc5b489 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -153,6 +153,11 @@ static unsigned long kho_radix_get_table_index(unsigned long key, * intermediate nodes do not exist along the path, they are allocated and added * to the tree. * + * NOTE: Currently only keys of width up to %KHO_RADIX_KEY_WIDTH are supported. + * This limit only exists because current users of the radix tree don't use more + * than that. Changing the maximum width requires changing the tree depth, which + * needs bumping the ABI version. + * * Return: 0 on success, or a negative error code on failure. */ int kho_radix_add_key(struct kho_radix_tree *tree, unsigned long key) @@ -169,6 +174,9 @@ int kho_radix_add_key(struct kho_radix_tree *tree, unsigned long key) if (WARN_ON_ONCE(!tree->root)) return -EINVAL; + if (unlikely(fls64(key) > KHO_RADIX_KEY_WIDTH)) + return -ERANGE; + might_sleep(); guard(mutex)(&tree->lock); @@ -241,6 +249,10 @@ void kho_radix_del_key(struct kho_radix_tree *tree, unsigned long key) if (WARN_ON_ONCE(!tree->root)) return; + /* Keys wider than KHO_RADIX_KEY_WIDTH are not allowed to be added. */ + if (unlikely(fls64(key) > KHO_RADIX_KEY_WIDTH)) + return; + might_sleep(); guard(mutex)(&tree->lock); -- 2.54.0.1032.g2f8565e1d1-goog From pratyush at kernel.org Fri Jun 5 11:34:36 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Fri, 5 Jun 2026 20:34:36 +0200 Subject: [PATCH v2 03/18] kho: return virtual address of mem_map In-Reply-To: <20260605183501.3884950-1-pratyush@kernel.org> References: <20260605183501.3884950-1-pratyush@kernel.org> Message-ID: <20260605183501.3884950-4-pratyush@kernel.org> From: "Pratyush Yadav (Google)" Currently it is only used by kho_populate(), which doesn't care whether the address is virtual or physical and only cares that it exists and is valid. In coming patches, more callers will be added, all of which will need the virtual address. Make things simpler by directly returning the virtual address. Rename kho_get_mem_map_phys() to kho_get_mem_map() to accurately reflect what it returns. Signed-off-by: Pratyush Yadav (Google) --- kernel/liveupdate/kexec_handover.c | 18 +++++++++++------- 1 file changed, 11 insertions(+), 7 deletions(-) diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index e8454dc5b489..d8dd0ede4f87 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -509,10 +509,11 @@ static int __init kho_preserved_memory_reserve(unsigned long key) return 0; } -/* Returns physical address of the preserved memory map from FDT */ -static phys_addr_t __init kho_get_mem_map_phys(const void *fdt) +/* Returns virtual address of the preserved memory map from FDT */ +static __init void *kho_get_mem_map(const void *fdt) { const void *mem_ptr; + phys_addr_t mem_map_phys; int len; mem_ptr = fdt_getprop(fdt, 0, KHO_FDT_MEMORY_MAP_PROP_NAME, &len); @@ -521,7 +522,11 @@ static phys_addr_t __init kho_get_mem_map_phys(const void *fdt) return 0; } - return get_unaligned((const u64 *)mem_ptr); + mem_map_phys = get_unaligned((const u64 *)mem_ptr); + if (!mem_map_phys) + return NULL; + + return phys_to_virt(mem_map_phys); } /* @@ -1644,8 +1649,7 @@ void __init kho_populate(phys_addr_t fdt_phys, u64 fdt_len, { unsigned int scratch_cnt = scratch_len / sizeof(*kho_scratch); struct kho_scratch *scratch = NULL; - phys_addr_t mem_map_phys; - void *fdt = NULL; + void *fdt = NULL, *mem_map; bool populated = false; int err; @@ -1668,8 +1672,8 @@ void __init kho_populate(phys_addr_t fdt_phys, u64 fdt_len, goto unmap_fdt; } - mem_map_phys = kho_get_mem_map_phys(fdt); - if (!mem_map_phys) + mem_map = kho_get_mem_map(fdt); + if (!mem_map) goto unmap_fdt; scratch = early_memremap(scratch_phys, scratch_len); -- 2.54.0.1032.g2f8565e1d1-goog From pratyush at kernel.org Fri Jun 5 11:34:37 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Fri, 5 Jun 2026 20:34:37 +0200 Subject: [PATCH v2 04/18] kho: store incoming radix tree in kho_in In-Reply-To: <20260605183501.3884950-1-pratyush@kernel.org> References: <20260605183501.3884950-1-pratyush@kernel.org> Message-ID: <20260605183501.3884950-5-pratyush@kernel.org> From: "Pratyush Yadav (Google)" This allows other functions to also use the radix tree. While at it, also use kho_get_mem_map_phys() instead of duplicating the code to get the radix tree root from the FDT. Signed-off-by: Pratyush Yadav (Google) --- kernel/liveupdate/kexec_handover.c | 32 ++++++++++++------------------ 1 file changed, 13 insertions(+), 19 deletions(-) diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index d8dd0ede4f87..61e436f5077e 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -1334,6 +1334,7 @@ struct kho_in { char previous_release[__NEW_UTS_LEN + 1]; u32 kexec_count; struct kho_debugfs dbg; + struct kho_radix_tree radix_tree; }; static struct kho_in kho_in = { @@ -1413,24 +1414,15 @@ EXPORT_SYMBOL_GPL(kho_retrieve_subtree); static int __init kho_mem_retrieve(const void *fdt) { - struct kho_radix_tree tree; - const phys_addr_t *mem; - int len; - - /* Retrieve the KHO radix tree from passed-in FDT. */ - mem = fdt_getprop(fdt, 0, KHO_FDT_MEMORY_MAP_PROP_NAME, &len); - - if (!mem || len != sizeof(*mem)) { - pr_err("failed to get preserved KHO memory tree\n"); - return -ENOENT; - } - - if (!*mem) - return -EINVAL; - - tree.root = phys_to_virt(*mem); - mutex_init(&tree.lock); - return kho_radix_walk_tree(&tree, kho_preserved_memory_reserve); + /* + * kho_get_mem_map() should always succeed. If it fails, kho_populate() + * catches that and never sets kho_in.scratch_phys, which stops memory + * retrieval. + */ + kho_in.radix_tree.root = kho_get_mem_map(fdt); + mutex_init(&kho_in.radix_tree.lock); + return kho_radix_walk_tree(&kho_in.radix_tree, + kho_preserved_memory_reserve); } static __init int kho_out_fdt_setup(void) @@ -1637,8 +1629,10 @@ void __init kho_memory_init(void) if (kho_in.scratch_phys) { kho_scratch = phys_to_virt(kho_in.scratch_phys); - if (kho_mem_retrieve(kho_get_fdt())) + if (kho_mem_retrieve(kho_get_fdt())) { kho_in.fdt_phys = 0; + kho_in.radix_tree.root = NULL; + } } else { kho_reserve_scratch(); } -- 2.54.0.1032.g2f8565e1d1-goog From pratyush at kernel.org Fri Jun 5 11:34:38 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Fri, 5 Jun 2026 20:34:38 +0200 Subject: [PATCH v2 05/18] kho: move all memory retrieval logic to kho_mem_retrieve() In-Reply-To: <20260605183501.3884950-1-pratyush@kernel.org> References: <20260605183501.3884950-1-pratyush@kernel.org> Message-ID: <20260605183501.3884950-6-pratyush@kernel.org> From: "Pratyush Yadav (Google)" The memory retrieval logic is spread out across kho_mem_retrieve() and kho_memory_init(). The incoming scratch area is initialized at kho_memory_init(), and the error handling is done there too. Consolidate all this logic into kho_mem_retrieve() to make the code cleaner. Signed-off-by: Pratyush Yadav (Google) --- kernel/liveupdate/kexec_handover.c | 31 ++++++++++++++++++------------ 1 file changed, 19 insertions(+), 12 deletions(-) diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 61e436f5077e..7e556afae283 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -1412,8 +1412,13 @@ int kho_retrieve_subtree(const char *name, phys_addr_t *phys, size_t *size) } EXPORT_SYMBOL_GPL(kho_retrieve_subtree); -static int __init kho_mem_retrieve(const void *fdt) +static void __init kho_mem_retrieve(void) { + const void *fdt = kho_get_fdt(); + int err; + + kho_scratch = phys_to_virt(kho_in.scratch_phys); + /* * kho_get_mem_map() should always succeed. If it fails, kho_populate() * catches that and never sets kho_in.scratch_phys, which stops memory @@ -1421,8 +1426,16 @@ static int __init kho_mem_retrieve(const void *fdt) */ kho_in.radix_tree.root = kho_get_mem_map(fdt); mutex_init(&kho_in.radix_tree.lock); - return kho_radix_walk_tree(&kho_in.radix_tree, - kho_preserved_memory_reserve); + + err = kho_radix_walk_tree(&kho_in.radix_tree, kho_preserved_memory_reserve); + if (err) { + /* + * Failed to initialize preserved memory. Clear FDT and radix + * so KHO users don't treat it as a KHO boot. + */ + kho_in.fdt_phys = 0; + kho_in.radix_tree.root = NULL; + } } static __init int kho_out_fdt_setup(void) @@ -1626,16 +1639,10 @@ fs_initcall(kho_init); void __init kho_memory_init(void) { - if (kho_in.scratch_phys) { - kho_scratch = phys_to_virt(kho_in.scratch_phys); - - if (kho_mem_retrieve(kho_get_fdt())) { - kho_in.fdt_phys = 0; - kho_in.radix_tree.root = NULL; - } - } else { + if (kho_in.scratch_phys) + kho_mem_retrieve(); + else kho_reserve_scratch(); - } } void __init kho_populate(phys_addr_t fdt_phys, u64 fdt_len, -- 2.54.0.1032.g2f8565e1d1-goog From pratyush at kernel.org Fri Jun 5 11:34:39 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Fri, 5 Jun 2026 20:34:39 +0200 Subject: [PATCH v2 06/18] kho: add a struct for radix callbacks In-Reply-To: <20260605183501.3884950-1-pratyush@kernel.org> References: <20260605183501.3884950-1-pratyush@kernel.org> Message-ID: <20260605183501.3884950-7-pratyush@kernel.org> From: "Pratyush Yadav (Google)" A future commit will add more callbacks for the KHO radix tree. Add a struct for collecting the callbacks. Signed-off-by: Pratyush Yadav (Google) --- include/linux/kho_radix_tree.h | 15 ++++++++++++--- kernel/liveupdate/kexec_handover.c | 27 +++++++++++++++------------ 2 files changed, 27 insertions(+), 15 deletions(-) diff --git a/include/linux/kho_radix_tree.h b/include/linux/kho_radix_tree.h index f368f3b9f923..426a9cc9bcde 100644 --- a/include/linux/kho_radix_tree.h +++ b/include/linux/kho_radix_tree.h @@ -34,14 +34,23 @@ struct kho_radix_tree { struct mutex lock; /* protects the tree's structure and root pointer */ }; -typedef int (*kho_radix_tree_walk_callback_t)(unsigned long key); +/** + * struct kho_radix_walk_cb - Callbacks for KHO radix tree walk. + * @leaf: Called on each present key in the radix tree. + * + * For each callback, a return value of 0 continues the walk and a non-zero + * return value is directly returned to the caller. + */ +struct kho_radix_walk_cb { + int (*leaf)(unsigned long key); +}; #ifdef CONFIG_KEXEC_HANDOVER int kho_radix_add_key(struct kho_radix_tree *tree, unsigned long key); void kho_radix_del_key(struct kho_radix_tree *tree, unsigned long key); int kho_radix_walk_tree(struct kho_radix_tree *tree, - kho_radix_tree_walk_callback_t cb); + const struct kho_radix_walk_cb *cb); #else /* #ifdef CONFIG_KEXEC_HANDOVER */ @@ -54,7 +63,7 @@ static inline void kho_radix_del_key(struct kho_radix_tree *tree, unsigned long key) { } static inline int kho_radix_walk_tree(struct kho_radix_tree *tree, - kho_radix_tree_walk_callback_t cb) + const struct kho_radix_walk_cb *cb) { return -EOPNOTSUPP; } diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 7e556afae283..dbe075348ce4 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -278,16 +278,18 @@ void kho_radix_del_key(struct kho_radix_tree *tree, unsigned long key) } EXPORT_SYMBOL_GPL(kho_radix_del_key); -static int kho_radix_walk_leaf(struct kho_radix_leaf *leaf, - unsigned long key, - kho_radix_tree_walk_callback_t cb) +static int kho_radix_walk_leaf(struct kho_radix_leaf *leaf, unsigned long key, + const struct kho_radix_walk_cb *cb) { unsigned long *bitmap = (unsigned long *)leaf; unsigned int i; int err; + if (!cb->leaf) + return 0; + for_each_set_bit(i, bitmap, PAGE_SIZE * BITS_PER_BYTE) { - err = cb(key | i); + err = cb->leaf(key | i); if (err) return err; } @@ -297,7 +299,7 @@ static int kho_radix_walk_leaf(struct kho_radix_leaf *leaf, static int __kho_radix_walk_tree(struct kho_radix_node *root, unsigned int level, unsigned long start, - kho_radix_tree_walk_callback_t cb) + const struct kho_radix_walk_cb *cb) { struct kho_radix_node *node; struct kho_radix_leaf *leaf; @@ -337,18 +339,16 @@ static int __kho_radix_walk_tree(struct kho_radix_node *root, /** * kho_radix_walk_tree - Traverses the radix tree and calls a callback for each key. * @tree: A pointer to the KHO radix tree to walk. - * @cb: A callback function of type kho_radix_tree_walk_callback_t that will be - * invoked for each key in the tree. + * @cb: Set of callbacks to be invoked during the tree walk. * - * This function walks the radix tree, searching from the specified top level - * down to the lowest level (level 0). For each key found, it invokes the - * provided callback. + * This function walks the radix tree, searching from the top level down to the + * lowest level (level 0), invoking the appropriate callbacks. * * Return: 0 if the walk completed the specified tree, or the non-zero return * value from the callback that stopped the walk. */ int kho_radix_walk_tree(struct kho_radix_tree *tree, - kho_radix_tree_walk_callback_t cb) + const struct kho_radix_walk_cb *cb) { if (WARN_ON_ONCE(!tree->root)) return -EINVAL; @@ -1414,6 +1414,9 @@ EXPORT_SYMBOL_GPL(kho_retrieve_subtree); static void __init kho_mem_retrieve(void) { + const struct kho_radix_walk_cb cb = { + .leaf = kho_preserved_memory_reserve, + }; const void *fdt = kho_get_fdt(); int err; @@ -1427,7 +1430,7 @@ static void __init kho_mem_retrieve(void) kho_in.radix_tree.root = kho_get_mem_map(fdt); mutex_init(&kho_in.radix_tree.lock); - err = kho_radix_walk_tree(&kho_in.radix_tree, kho_preserved_memory_reserve); + err = kho_radix_walk_tree(&kho_in.radix_tree, &cb); if (err) { /* * Failed to initialize preserved memory. Clear FDT and radix -- 2.54.0.1032.g2f8565e1d1-goog From pratyush at kernel.org Fri Jun 5 11:34:40 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Fri, 5 Jun 2026 20:34:40 +0200 Subject: [PATCH v2 07/18] kho: add callback for table pages In-Reply-To: <20260605183501.3884950-1-pratyush@kernel.org> References: <20260605183501.3884950-1-pratyush@kernel.org> Message-ID: <20260605183501.3884950-8-pratyush@kernel.org> From: "Pratyush Yadav (Google)" The KHO memory preservation radix tree does not mark the table pages themselves as preserved. This is done to avoid a circular dependency where preserving a page can lead of allocating other preserved pages. This means any walker looking for free ranges of memory outside of scratch areas will ignore the table Add a table callback that is invoked for each table page. The callback is given the physical address of the table page. This is useful for the upcoming mechanism that discovers blocks of memory with no preserved pages and lets them be used for boot memory. Another use case is for users of the radix tree other than KHO itself. The radix tree does not preserve its own pages due to the circular dependency described above. But external users of the radix tree would need to preserve and restore their pages for the radix tree to survive past early boot. They can use this callback to do so. Signed-off-by: Pratyush Yadav (Google) --- include/linux/kho_radix_tree.h | 3 +++ kernel/liveupdate/kexec_handover.c | 12 ++++++++++++ 2 files changed, 15 insertions(+) diff --git a/include/linux/kho_radix_tree.h b/include/linux/kho_radix_tree.h index 426a9cc9bcde..ac7ba7e567e1 100644 --- a/include/linux/kho_radix_tree.h +++ b/include/linux/kho_radix_tree.h @@ -37,12 +37,15 @@ struct kho_radix_tree { /** * struct kho_radix_walk_cb - Callbacks for KHO radix tree walk. * @leaf: Called on each present key in the radix tree. + * @node: Called on each node of the radix tree itself. Receives the + * physical address of the page containing the node. * * For each callback, a return value of 0 continues the walk and a non-zero * return value is directly returned to the caller. */ struct kho_radix_walk_cb { int (*leaf)(unsigned long key); + int (*node)(phys_addr_t phys); }; #ifdef CONFIG_KEXEC_HANDOVER diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index dbe075348ce4..94f18fe42c4b 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -285,6 +285,12 @@ static int kho_radix_walk_leaf(struct kho_radix_leaf *leaf, unsigned long key, unsigned int i; int err; + if (cb->node) { + err = cb->node(virt_to_phys(leaf)); + if (err) + return err; + } + if (!cb->leaf) return 0; @@ -307,6 +313,12 @@ static int __kho_radix_walk_tree(struct kho_radix_node *root, unsigned int shift; int err; + if (cb->node) { + err = cb->node(virt_to_phys(root)); + if (err) + return err; + } + for (i = 0; i < PAGE_SIZE / sizeof(phys_addr_t); i++) { if (!root->table[i]) continue; -- 2.54.0.1032.g2f8565e1d1-goog From pratyush at kernel.org Fri Jun 5 11:34:41 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Fri, 5 Jun 2026 20:34:41 +0200 Subject: [PATCH v2 08/18] kho: add data argument to radix walk callback In-Reply-To: <20260605183501.3884950-1-pratyush@kernel.org> References: <20260605183501.3884950-1-pratyush@kernel.org> Message-ID: <20260605183501.3884950-9-pratyush@kernel.org> From: "Pratyush Yadav (Google)" Add an opaque data pointer argument to kho_radix_walk_cb_t. This can be used by callers to pass extra information to the callback. Reviewed-by: Pasha Tatashin Signed-off-by: Pratyush Yadav (Google) --- include/linux/kho_radix_tree.h | 8 ++++---- kernel/liveupdate/kexec_handover.c | 24 +++++++++++++----------- 2 files changed, 17 insertions(+), 15 deletions(-) diff --git a/include/linux/kho_radix_tree.h b/include/linux/kho_radix_tree.h index ac7ba7e567e1..4138621e0e87 100644 --- a/include/linux/kho_radix_tree.h +++ b/include/linux/kho_radix_tree.h @@ -44,8 +44,8 @@ struct kho_radix_tree { * return value is directly returned to the caller. */ struct kho_radix_walk_cb { - int (*leaf)(unsigned long key); - int (*node)(phys_addr_t phys); + int (*leaf)(unsigned long key, void *data); + int (*node)(phys_addr_t phys, void *data); }; #ifdef CONFIG_KEXEC_HANDOVER @@ -53,7 +53,7 @@ struct kho_radix_walk_cb { int kho_radix_add_key(struct kho_radix_tree *tree, unsigned long key); void kho_radix_del_key(struct kho_radix_tree *tree, unsigned long key); int kho_radix_walk_tree(struct kho_radix_tree *tree, - const struct kho_radix_walk_cb *cb); + const struct kho_radix_walk_cb *cb, void *data); #else /* #ifdef CONFIG_KEXEC_HANDOVER */ @@ -66,7 +66,7 @@ static inline void kho_radix_del_key(struct kho_radix_tree *tree, unsigned long key) { } static inline int kho_radix_walk_tree(struct kho_radix_tree *tree, - const struct kho_radix_walk_cb *cb) + const struct kho_radix_walk_cb *cb, void *data) { return -EOPNOTSUPP; } diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 94f18fe42c4b..b890a69bddd5 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -279,14 +279,14 @@ void kho_radix_del_key(struct kho_radix_tree *tree, unsigned long key) EXPORT_SYMBOL_GPL(kho_radix_del_key); static int kho_radix_walk_leaf(struct kho_radix_leaf *leaf, unsigned long key, - const struct kho_radix_walk_cb *cb) + const struct kho_radix_walk_cb *cb, void *data) { unsigned long *bitmap = (unsigned long *)leaf; unsigned int i; int err; if (cb->node) { - err = cb->node(virt_to_phys(leaf)); + err = cb->node(virt_to_phys(leaf), data); if (err) return err; } @@ -295,7 +295,7 @@ static int kho_radix_walk_leaf(struct kho_radix_leaf *leaf, unsigned long key, return 0; for_each_set_bit(i, bitmap, PAGE_SIZE * BITS_PER_BYTE) { - err = cb->leaf(key | i); + err = cb->leaf(key | i, data); if (err) return err; } @@ -305,7 +305,7 @@ static int kho_radix_walk_leaf(struct kho_radix_leaf *leaf, unsigned long key, static int __kho_radix_walk_tree(struct kho_radix_node *root, unsigned int level, unsigned long start, - const struct kho_radix_walk_cb *cb) + const struct kho_radix_walk_cb *cb, void *data) { struct kho_radix_node *node; struct kho_radix_leaf *leaf; @@ -314,7 +314,7 @@ static int __kho_radix_walk_tree(struct kho_radix_node *root, int err; if (cb->node) { - err = cb->node(virt_to_phys(root)); + err = cb->node(virt_to_phys(root), data); if (err) return err; } @@ -335,10 +335,10 @@ static int __kho_radix_walk_tree(struct kho_radix_node *root, * node is pointing to the level 0 bitmap. */ leaf = (struct kho_radix_leaf *)node; - err = kho_radix_walk_leaf(leaf, key, cb); + err = kho_radix_walk_leaf(leaf, key, cb, data); } else { err = __kho_radix_walk_tree(node, level - 1, - key, cb); + key, cb, data); } if (err) @@ -352,6 +352,7 @@ static int __kho_radix_walk_tree(struct kho_radix_node *root, * kho_radix_walk_tree - Traverses the radix tree and calls a callback for each key. * @tree: A pointer to the KHO radix tree to walk. * @cb: Set of callbacks to be invoked during the tree walk. + * @data: Opaque data pointer passed to each callback in @cb. * * This function walks the radix tree, searching from the top level down to the * lowest level (level 0), invoking the appropriate callbacks. @@ -360,14 +361,15 @@ static int __kho_radix_walk_tree(struct kho_radix_node *root, * value from the callback that stopped the walk. */ int kho_radix_walk_tree(struct kho_radix_tree *tree, - const struct kho_radix_walk_cb *cb) + const struct kho_radix_walk_cb *cb, void *data) { if (WARN_ON_ONCE(!tree->root)) return -EINVAL; guard(mutex)(&tree->lock); - return __kho_radix_walk_tree(tree->root, KHO_TREE_MAX_DEPTH - 1, 0, cb); + return __kho_radix_walk_tree(tree->root, KHO_TREE_MAX_DEPTH - 1, 0, cb, + data); } EXPORT_SYMBOL_GPL(kho_radix_walk_tree); @@ -498,7 +500,7 @@ static struct page *__init kho_get_preserved_page(phys_addr_t phys, return pfn_to_page(pfn); } -static int __init kho_preserved_memory_reserve(unsigned long key) +static int __init kho_preserved_memory_reserve(unsigned long key, void *data) { union kho_page_info info; struct page *page; @@ -1442,7 +1444,7 @@ static void __init kho_mem_retrieve(void) kho_in.radix_tree.root = kho_get_mem_map(fdt); mutex_init(&kho_in.radix_tree.lock); - err = kho_radix_walk_tree(&kho_in.radix_tree, &cb); + err = kho_radix_walk_tree(&kho_in.radix_tree, &cb, NULL); if (err) { /* * Failed to initialize preserved memory. Clear FDT and radix -- 2.54.0.1032.g2f8565e1d1-goog From pratyush at kernel.org Fri Jun 5 11:34:42 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Fri, 5 Jun 2026 20:34:42 +0200 Subject: [PATCH v2 09/18] kho: allow early-boot usage of the KHO radix tree In-Reply-To: <20260605183501.3884950-1-pratyush@kernel.org> References: <20260605183501.3884950-1-pratyush@kernel.org> Message-ID: <20260605183501.3884950-10-pratyush@kernel.org> From: "Pratyush Yadav (Google)" The KHO radix tree allocates memory for table pages from the buddy allocator using get_zeroed_page(). This is not available in early boot when memblock is still active. Using the radix tree in early boot is useful for KHO to track metadata about its memory. One such example is for tracking free blocks for memory allocation when scratch runs out of space. This feature will be added in the following commits. Add kho_radix_{alloc,free}_node() which allocate and free the table pages. They use slab_is_available() to decide which allocator to use. While slab_is_available() indicates availability of the slab allocator, it gets initialized right after buddy so it serves the same practical purpose. Reviewed-by: Pasha Tatashin Signed-off-by: Pratyush Yadav (Google) --- kernel/liveupdate/kexec_handover.c | 24 ++++++++++++++++++++++-- 1 file changed, 22 insertions(+), 2 deletions(-) diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index b890a69bddd5..452b4dcdf2d2 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -143,6 +143,26 @@ static unsigned long kho_radix_get_table_index(unsigned long key, return (key >> s) % (1 << KHO_TABLE_SIZE_LOG2); } +static void __ref *kho_radix_alloc_node(void) +{ + struct kho_radix_node *node; + + if (slab_is_available()) + node = (struct kho_radix_node *)get_zeroed_page(GFP_KERNEL); + else + node = memblock_alloc(PAGE_SIZE, PAGE_SIZE); + + return node; +} + +static void __ref kho_radix_free_node(struct kho_radix_node *node) +{ + if (slab_is_available()) + free_page((unsigned long)node); + else + memblock_free(node, PAGE_SIZE); +} + /** * kho_radix_add_key - Add a key to the radix tree. * @tree: The KHO radix tree. @@ -191,7 +211,7 @@ int kho_radix_add_key(struct kho_radix_tree *tree, unsigned long key) } /* Next node is empty, create a new node for it */ - new_node = (struct kho_radix_node *)get_zeroed_page(GFP_KERNEL); + new_node = kho_radix_alloc_node(); if (!new_node) { err = -ENOMEM; goto err_free_nodes; @@ -222,7 +242,7 @@ int kho_radix_add_key(struct kho_radix_tree *tree, unsigned long key) err_free_nodes: for (i = KHO_TREE_MAX_DEPTH - 1; i > 0; i--) { if (intermediate_nodes[i]) - free_page((unsigned long)intermediate_nodes[i]); + kho_radix_free_node(intermediate_nodes[i]); } if (anchor_node) anchor_node->table[anchor_idx] = 0; -- 2.54.0.1032.g2f8565e1d1-goog From pratyush at kernel.org Fri Jun 5 11:34:43 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Fri, 5 Jun 2026 20:34:43 +0200 Subject: [PATCH v2 10/18] kho: allow destroying KHO radix tree In-Reply-To: <20260605183501.3884950-1-pratyush@kernel.org> References: <20260605183501.3884950-1-pratyush@kernel.org> Message-ID: <20260605183501.3884950-11-pratyush@kernel.org> From: "Pratyush Yadav (Google)" Add kho_radix_destroy_tree() which allows destroying the radix tree and freeing all its pages. This is will be used by the upcoming scratch extension mechanism. It creates a radix tree to track free blocks and then frees them after telling memblock about them. Reviewed-by: Pasha Tatashin Signed-off-by: Pratyush Yadav (Google) --- include/linux/kho_radix_tree.h | 3 +++ kernel/liveupdate/kexec_handover.c | 35 ++++++++++++++++++++++++++++++ 2 files changed, 38 insertions(+) diff --git a/include/linux/kho_radix_tree.h b/include/linux/kho_radix_tree.h index 4138621e0e87..66ca936b3f06 100644 --- a/include/linux/kho_radix_tree.h +++ b/include/linux/kho_radix_tree.h @@ -54,6 +54,7 @@ int kho_radix_add_key(struct kho_radix_tree *tree, unsigned long key); void kho_radix_del_key(struct kho_radix_tree *tree, unsigned long key); int kho_radix_walk_tree(struct kho_radix_tree *tree, const struct kho_radix_walk_cb *cb, void *data); +void kho_radix_destroy_tree(struct kho_radix_tree *tree); #else /* #ifdef CONFIG_KEXEC_HANDOVER */ @@ -71,6 +72,8 @@ static inline int kho_radix_walk_tree(struct kho_radix_tree *tree, return -EOPNOTSUPP; } +static inline void kho_radix_destroy_tree(struct kho_radix_tree *tree) { } + #endif /* #ifdef CONFIG_KEXEC_HANDOVER */ #endif /* _LINUX_KHO_RADIX_TREE_H */ diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 452b4dcdf2d2..df3f5eb01bf1 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -298,6 +298,41 @@ void kho_radix_del_key(struct kho_radix_tree *tree, unsigned long key) } EXPORT_SYMBOL_GPL(kho_radix_del_key); +static void __kho_radix_destroy_tree(struct kho_radix_node *root, + unsigned int level) +{ + unsigned long i; + + if (level == 0) { + kho_radix_free_node(root); + return; + } + + for (i = 0; i < PAGE_SIZE / sizeof(phys_addr_t); i++) { + if (root->table[i]) + __kho_radix_destroy_tree(phys_to_virt(root->table[i]), + level - 1); + } + + kho_radix_free_node(root); +} + +/** + * kho_radix_destroy_tree - Destroy the radix tree + * @tree: The radix tree to destroy + * + * Walk @tree and free all its nodes. + */ +void kho_radix_destroy_tree(struct kho_radix_tree *tree) +{ + if (!tree->root) + return; + + __kho_radix_destroy_tree(tree->root, KHO_TREE_MAX_DEPTH - 1); + tree->root = NULL; +} +EXPORT_SYMBOL_GPL(kho_radix_destroy_tree); + static int kho_radix_walk_leaf(struct kho_radix_leaf *leaf, unsigned long key, const struct kho_radix_walk_cb *cb, void *data) { -- 2.54.0.1032.g2f8565e1d1-goog From pratyush at kernel.org Fri Jun 5 11:34:44 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Fri, 5 Jun 2026 20:34:44 +0200 Subject: [PATCH v2 11/18] kho: add kho_radix_init_tree() In-Reply-To: <20260605183501.3884950-1-pratyush@kernel.org> References: <20260605183501.3884950-1-pratyush@kernel.org> Message-ID: <20260605183501.3884950-12-pratyush@kernel.org> From: "Pratyush Yadav (Google)" Move the initialization logic of the radix tree into kho_radix_init_tree() instead of having users open-code it. Makes the boundaries cleaner and reduces code duplication when a new user of the radix tree will be added in a future commit. Signed-off-by: Pratyush Yadav (Google) --- include/linux/kho_radix_tree.h | 7 ++++ kernel/liveupdate/kexec_handover.c | 66 ++++++++++++++++++++++-------- 2 files changed, 55 insertions(+), 18 deletions(-) diff --git a/include/linux/kho_radix_tree.h b/include/linux/kho_radix_tree.h index 66ca936b3f06..5d6ae2893684 100644 --- a/include/linux/kho_radix_tree.h +++ b/include/linux/kho_radix_tree.h @@ -54,6 +54,7 @@ int kho_radix_add_key(struct kho_radix_tree *tree, unsigned long key); void kho_radix_del_key(struct kho_radix_tree *tree, unsigned long key); int kho_radix_walk_tree(struct kho_radix_tree *tree, const struct kho_radix_walk_cb *cb, void *data); +int kho_radix_init_tree(struct kho_radix_tree *tree, struct kho_radix_node *root); void kho_radix_destroy_tree(struct kho_radix_tree *tree); #else /* #ifdef CONFIG_KEXEC_HANDOVER */ @@ -72,6 +73,12 @@ static inline int kho_radix_walk_tree(struct kho_radix_tree *tree, return -EOPNOTSUPP; } +static inline int kho_radix_init_tree(struct kho_radix_tree *tree, + struct kho_radix_node *root) +{ + return 0; +} + static inline void kho_radix_destroy_tree(struct kho_radix_tree *tree) { } #endif /* #ifdef CONFIG_KEXEC_HANDOVER */ diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index df3f5eb01bf1..8ab2c7e234e1 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -317,6 +317,34 @@ static void __kho_radix_destroy_tree(struct kho_radix_node *root, kho_radix_free_node(root); } +/** + * kho_radix_init_tree - initialize the radix tree. + * @tree: the tree to initialize. + * @root: root table of the radix tree. + * + * Initialize the radix tree with the given root node. If root is %NULL, an + * empty root table is allocated. If root is not %NULL, it is the caller's + * responsibility to make sure the root is valid and in the correct format. + * + * Return: 0 on success, -errno on failure. + */ +int kho_radix_init_tree(struct kho_radix_tree *tree, struct kho_radix_node *root) +{ + /* Already initialized. */ + if (tree->root) + return 0; + + if (!root) + root = kho_radix_alloc_node(); + if (!root) + return -ENOMEM; + + tree->root = root; + mutex_init(&tree->lock); + return 0; +} +EXPORT_SYMBOL_GPL(kho_radix_init_tree); + /** * kho_radix_destroy_tree - Destroy the radix tree * @tree: The radix tree to destroy @@ -1496,18 +1524,23 @@ static void __init kho_mem_retrieve(void) * catches that and never sets kho_in.scratch_phys, which stops memory * retrieval. */ - kho_in.radix_tree.root = kho_get_mem_map(fdt); - mutex_init(&kho_in.radix_tree.lock); + err = kho_radix_init_tree(&kho_in.radix_tree, kho_get_mem_map(fdt)); + if (err) + goto err; err = kho_radix_walk_tree(&kho_in.radix_tree, &cb, NULL); - if (err) { - /* - * Failed to initialize preserved memory. Clear FDT and radix - * so KHO users don't treat it as a KHO boot. - */ - kho_in.fdt_phys = 0; - kho_in.radix_tree.root = NULL; - } + if (err) + goto err; + + return; + +err: + /* + * Failed to initialize preserved memory. Clear FDT and radix so KHO + * users don't treat it as a KHO boot. + */ + kho_in.fdt_phys = 0; + kho_in.radix_tree.root = NULL; } static __init int kho_out_fdt_setup(void) @@ -1633,16 +1666,14 @@ static __init int kho_init(void) if (!kho_enable) return 0; - tree->root = kzalloc(PAGE_SIZE, GFP_KERNEL); - if (!tree->root) { - err = -ENOMEM; + err = kho_radix_init_tree(tree, NULL); + if (err) goto err_free_scratch; - } kho_out.fdt = kho_alloc_preserve(PAGE_SIZE); if (IS_ERR(kho_out.fdt)) { err = PTR_ERR(kho_out.fdt); - goto err_free_kho_radix_tree_root; + goto err_free_kho_radix_tree; } err = kho_debugfs_init(); @@ -1693,9 +1724,8 @@ static __init int kho_init(void) err_free_fdt: kho_unpreserve_free(kho_out.fdt); -err_free_kho_radix_tree_root: - kfree(tree->root); - tree->root = NULL; +err_free_kho_radix_tree: + kho_radix_destroy_tree(tree); err_free_scratch: kho_out.fdt = NULL; for (int i = 0; i < kho_scratch_cnt; i++) { -- 2.54.0.1032.g2f8565e1d1-goog From pratyush at kernel.org Fri Jun 5 11:34:45 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Fri, 5 Jun 2026 20:34:45 +0200 Subject: [PATCH v2 12/18] kho: export kho_scratch_overlap() In-Reply-To: <20260605183501.3884950-1-pratyush@kernel.org> References: <20260605183501.3884950-1-pratyush@kernel.org> Message-ID: <20260605183501.3884950-13-pratyush@kernel.org> From: "Pratyush Yadav (Google)" Support for discovering memory blocks with no preserved memory will be added in coming patches. These areas will also be marked as scratch to allow allocations from them. Memblock will switch to looking through the scratch array to decide the right migratetype. Export kho_scratch_overlap(). Since it is now used by non-debug code, move it out of kexec_handover_debug.c and into kexec_handover.c. Gate the overlap checks in kho_preserve_folio() and kho_preserve_pages() by IS_ENABLED(CONFIG_KEXEC_HANDOVER_DEBUG) instead. Since kexec_handover_debug.c is now empty, delete it. No functional changes. Signed-off-by: Pratyush Yadav (Google) --- include/linux/kexec_handover.h | 7 ++++++ kernel/liveupdate/Makefile | 1 - kernel/liveupdate/kexec_handover.c | 22 ++++++++++++++++-- kernel/liveupdate/kexec_handover_debug.c | 25 --------------------- kernel/liveupdate/kexec_handover_internal.h | 9 -------- 5 files changed, 27 insertions(+), 37 deletions(-) delete mode 100644 kernel/liveupdate/kexec_handover_debug.c diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h index 8968c56d2d73..3740c14d970d 100644 --- a/include/linux/kexec_handover.h +++ b/include/linux/kexec_handover.h @@ -40,6 +40,8 @@ void kho_memory_init(void); void kho_populate(phys_addr_t fdt_phys, u64 fdt_len, phys_addr_t scratch_phys, u64 scratch_len); + +bool kho_scratch_overlap(phys_addr_t phys, size_t size); #else static inline bool kho_is_enabled(void) { @@ -116,6 +118,11 @@ static inline void kho_populate(phys_addr_t fdt_phys, u64 fdt_len, phys_addr_t scratch_phys, u64 scratch_len) { } + +static inline bool kho_scratch_overlap(phys_addr_t phys, size_t size) +{ + return false; +} #endif /* CONFIG_KEXEC_HANDOVER */ #endif /* LINUX_KEXEC_HANDOVER_H */ diff --git a/kernel/liveupdate/Makefile b/kernel/liveupdate/Makefile index d2f779cbe279..dc352839ccf0 100644 --- a/kernel/liveupdate/Makefile +++ b/kernel/liveupdate/Makefile @@ -7,7 +7,6 @@ luo-y := \ luo_session.o obj-$(CONFIG_KEXEC_HANDOVER) += kexec_handover.o -obj-$(CONFIG_KEXEC_HANDOVER_DEBUG) += kexec_handover_debug.o obj-$(CONFIG_KEXEC_HANDOVER_DEBUGFS) += kexec_handover_debugfs.o obj-$(CONFIG_LIVEUPDATE) += luo.o diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 8ab2c7e234e1..a66f23a35389 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -766,6 +766,22 @@ static phys_addr_t __init scratch_size_node(int nid) return round_up(size, CMA_MIN_ALIGNMENT_BYTES); } +bool kho_scratch_overlap(phys_addr_t phys, size_t size) +{ + phys_addr_t scratch_start, scratch_end; + unsigned int i; + + for (i = 0; i < kho_scratch_cnt; i++) { + scratch_start = kho_scratch[i].addr; + scratch_end = kho_scratch[i].addr + kho_scratch[i].size; + + if (phys < scratch_end && (phys + size) > scratch_start) + return true; + } + + return false; +} + /** * kho_reserve_scratch - Reserve a contiguous chunk of memory for kexec * @@ -963,7 +979,8 @@ int kho_preserve_folio(struct folio *folio) const unsigned long pfn = folio_pfn(folio); const unsigned int order = folio_order(folio); - if (WARN_ON(kho_scratch_overlap(pfn << PAGE_SHIFT, PAGE_SIZE << order))) + if (IS_ENABLED(CONFIG_KEXEC_HANDOVER_DEBUG) && + WARN_ON(kho_scratch_overlap(pfn << PAGE_SHIFT, PAGE_SIZE << order))) return -EINVAL; return kho_radix_add_key(tree, kho_encode_radix_key(PFN_PHYS(pfn), @@ -1040,7 +1057,8 @@ int kho_preserve_pages(struct page *page, unsigned long nr_pages) unsigned long failed_pfn = 0; int err = 0; - if (WARN_ON(kho_scratch_overlap(start_pfn << PAGE_SHIFT, + if (IS_ENABLED(CONFIG_KEXEC_HANDOVER_DEBUG) && + WARN_ON(kho_scratch_overlap(start_pfn << PAGE_SHIFT, nr_pages << PAGE_SHIFT))) { return -EINVAL; } diff --git a/kernel/liveupdate/kexec_handover_debug.c b/kernel/liveupdate/kexec_handover_debug.c deleted file mode 100644 index 6efb696f5426..000000000000 --- a/kernel/liveupdate/kexec_handover_debug.c +++ /dev/null @@ -1,25 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0-only -/* - * kexec_handover_debug.c - kexec handover optional debug functionality - * Copyright (C) 2025 Google LLC, Pasha Tatashin - */ - -#define pr_fmt(fmt) "KHO: " fmt - -#include "kexec_handover_internal.h" - -bool kho_scratch_overlap(phys_addr_t phys, size_t size) -{ - phys_addr_t scratch_start, scratch_end; - unsigned int i; - - for (i = 0; i < kho_scratch_cnt; i++) { - scratch_start = kho_scratch[i].addr; - scratch_end = kho_scratch[i].addr + kho_scratch[i].size; - - if (phys < scratch_end && (phys + size) > scratch_start) - return true; - } - - return false; -} diff --git a/kernel/liveupdate/kexec_handover_internal.h b/kernel/liveupdate/kexec_handover_internal.h index 0399ff107775..805d2a76c388 100644 --- a/kernel/liveupdate/kexec_handover_internal.h +++ b/kernel/liveupdate/kexec_handover_internal.h @@ -41,13 +41,4 @@ static inline void kho_debugfs_blob_remove(struct kho_debugfs *dbg, void *blob) { } #endif /* CONFIG_KEXEC_HANDOVER_DEBUGFS */ -#ifdef CONFIG_KEXEC_HANDOVER_DEBUG -bool kho_scratch_overlap(phys_addr_t phys, size_t size); -#else -static inline bool kho_scratch_overlap(phys_addr_t phys, size_t size) -{ - return false; -} -#endif /* CONFIG_KEXEC_HANDOVER_DEBUG */ - #endif /* LINUX_KEXEC_HANDOVER_INTERNAL_H */ -- 2.54.0.1032.g2f8565e1d1-goog From pratyush at kernel.org Fri Jun 5 11:34:46 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Fri, 5 Jun 2026 20:34:46 +0200 Subject: [PATCH v2 13/18] kho: initialize kho_scratch pointer earlier in boot In-Reply-To: <20260605183501.3884950-1-pratyush@kernel.org> References: <20260605183501.3884950-1-pratyush@kernel.org> Message-ID: <20260605183501.3884950-14-pratyush@kernel.org> From: "Pratyush Yadav (Google)" In a future patch, mm init will use kho_scratch_overlap() for deciding the migrate type of pageblocks it initializes. The earliest user currently is free_area_init(). kho_scratch_overlap() relies on kho_scratch pointer being initialized. Introduce kho_memory_init_early() to do this. kho_populate() would normally be a good place to do this, but unfortunately, phys_to_virt() does not work at that point on ARM64. So we need yet another initialization function. Signed-off-by: Pratyush Yadav (Google) --- include/linux/kexec_handover.h | 3 +++ kernel/liveupdate/kexec_handover.c | 12 ++++++++++-- mm/mm_init.c | 1 + 3 files changed, 14 insertions(+), 2 deletions(-) diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h index 3740c14d970d..9e961032e06b 100644 --- a/include/linux/kexec_handover.h +++ b/include/linux/kexec_handover.h @@ -37,6 +37,7 @@ void kho_remove_subtree(void *blob); int kho_retrieve_subtree(const char *name, phys_addr_t *phys, size_t *size); void kho_memory_init(void); +void kho_memory_init_early(void); void kho_populate(phys_addr_t fdt_phys, u64 fdt_len, phys_addr_t scratch_phys, u64 scratch_len); @@ -114,6 +115,8 @@ static inline int kho_retrieve_subtree(const char *name, phys_addr_t *phys, static inline void kho_memory_init(void) { } +static inline void kho_memory_init_early(void) { } + static inline void kho_populate(phys_addr_t fdt_phys, u64 fdt_len, phys_addr_t scratch_phys, u64 scratch_len) { diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index a66f23a35389..af22086ca2d6 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -1535,8 +1535,6 @@ static void __init kho_mem_retrieve(void) const void *fdt = kho_get_fdt(); int err; - kho_scratch = phys_to_virt(kho_in.scratch_phys); - /* * kho_get_mem_map() should always succeed. If it fails, kho_populate() * catches that and never sets kho_in.scratch_phys, which stops memory @@ -1757,6 +1755,16 @@ static __init int kho_init(void) } fs_initcall(kho_init); +void __init kho_memory_init_early(void) +{ + /* + * kho_scratch_overlap() needs kho_scratch to be initialized. It is used + * by free_area_init() on KHO boots, so initialize it early. + */ + if (kho_in.scratch_phys) + kho_scratch = phys_to_virt(kho_in.scratch_phys); +} + void __init kho_memory_init(void) { if (kho_in.scratch_phys) diff --git a/mm/mm_init.c b/mm/mm_init.c index eddc0f03a779..0675837bbfc9 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -2688,6 +2688,7 @@ void __init __weak mem_init(void) void __init mm_core_init_early(void) { + kho_memory_init_early(); hugetlb_cma_reserve(); hugetlb_bootmem_alloc(); -- 2.54.0.1032.g2f8565e1d1-goog From pratyush at kernel.org Fri Jun 5 11:34:47 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Fri, 5 Jun 2026 20:34:47 +0200 Subject: [PATCH v2 14/18] memblock: use kho_scratch_overlap() to decide migratetype In-Reply-To: <20260605183501.3884950-1-pratyush@kernel.org> References: <20260605183501.3884950-1-pratyush@kernel.org> Message-ID: <20260605183501.3884950-15-pratyush@kernel.org> From: "Pratyush Yadav (Google)" Support for discovering memory blocks with no preserved memory will be added in coming patches. These areas will also be marked as MEMBLOCK_KHO_SCRATCH to allow allocations from them. But only the scratch areas passed to KHO should be marked as MIGRATE_CMA, all others should be left as normal. So instead of checking the flags on the region, ask KHO to loop through its scratch array. Signed-off-by: Pratyush Yadav (Google) --- include/linux/memblock.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/include/linux/memblock.h b/include/linux/memblock.h index 5afcd99aa8c1..546d7ef798b8 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -11,6 +11,7 @@ #include #include #include +#include extern unsigned long max_low_pfn; extern unsigned long min_low_pfn; @@ -618,7 +619,7 @@ bool memblock_is_kho_scratch_memory(phys_addr_t addr); static inline enum migratetype kho_scratch_migratetype(unsigned long pfn, enum migratetype mt) { - if (memblock_is_kho_scratch_memory(PFN_PHYS(pfn))) + if (kho_scratch_overlap(PFN_PHYS(pfn), pageblock_nr_pages << PAGE_SHIFT)) return MIGRATE_CMA; return mt; } -- 2.54.0.1032.g2f8565e1d1-goog From pratyush at kernel.org Fri Jun 5 11:34:48 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Fri, 5 Jun 2026 20:34:48 +0200 Subject: [PATCH v2 15/18] kho: extend scratch In-Reply-To: <20260605183501.3884950-1-pratyush@kernel.org> References: <20260605183501.3884950-1-pratyush@kernel.org> Message-ID: <20260605183501.3884950-16-pratyush@kernel.org> From: "Pratyush Yadav (Google)" Motivation ========== The scratch space is allocated by the first kernel in the KHO chain, and is reused by all subsequent kernels. The size of the space is either set via the commandline by the system administrator or by calculating the amount of memory used by the kernel and adding a multiplier. In either case, the scratch space is a heuristic and is liable to fill up and fail allocation if a kernel uses more memory than expected. In addition, gigantic huge pages (usually 1 GiB) are allocated via memblock, and in a KHO boot that memory comes from the scratch space. In hypervisors it is common to dedicate a major part of the system's memory to gigantic hugepages for VM memory. If this memory needs to come from scratch space, then scratch needs to be greater than the memory needed for huge pages, which is impractical. In addition, hugepages can be preserved memory. Allocating them from scratch violates the assumption that scratch contains no preserved memory. Methodology =========== Discover areas that don't contain any preserved memory at boot by walking the preserved memory radix tree. Mark them as scratch to allow allocations from them. This makes KHO more resilient to memory pressure and allows supporting huge page preservation. Since the preserved memory radix tree mixes both physical address and order into a single key, and does not track table pages, it is difficult to identify free areas from it directly. Walk the tree and digest it down into another radix tree. The latter tracks blocks of KHO_EXT_SHIFT (1 GiB as of now) granularity. Then walk the digested tree and mark the areas between the present keys as scratch. Performance =========== The discovery algorithm traverses the preserved memory radix tree exactly once. While it does use memory for the digested radix tree, since the blocks are split by 1 GiB, a single bitmap with 4k pages can track up to 32 TiB of memory. So there are likely to be very few radix tree pages used in this tracking. For systems with all physical memory below 32 TiB, this should result in a total of 6 pages being used (KHO_TREE_MAX_DEPTH == 6). An alternate way of achieving this would be to call kho_mem_retrieve() earlier in boot and mark all the KHO preservations as reserved. But that can blow up memblock.reserved with a bunch of 4K pages scattered everywhere, which will reduce performance of subsequent allocations. Since the free blocks are tracked in chunks of 1 GiB, this won't blow up memblock.memory as much. There is no inherent reason for using 1 GiB as the discovered block size. This can be changed later if needed. Currently, KHO is mainly targeted for server grade systems with hundreds of gigabytes to terabytes of memory. So 1 GiB is a reasonable granularity for those systems. For smaller systems this doesn't work as well, but we can arrive at a better heuristic when we have concrete use cases. Practical evaluation ==================== The testing is done on a x86_64 qemu VM running under KVM with 64G memory and 12 CPUs. The machine pre-allocates 50 1G pages. Since the performance scales with how busy the radix tree is, tests are done with 2 preservation patterns: first with two 1M memfds, second with two 1G memfds, both using 4k pages. Test case 1 - 1M memfd ~~~~~~~~~~~~~~~~~~~~~~ This test case has two memfds with 1M memory each in 4k pages, plus other preservations from LUO core and other KHO users. This is how the radix tree stats look like: radix_nodes: 0x13 nr_preservations: 0x214 mem_preserved: 0x227000 per order preservations: order 0: 0x20f order 1: 0x4 order 4: 0x1 and this is how long it takes to extend the scratch after KHO boot: KHO: KHO extend time: 47 us KHO: KHO extend total mem: 0xe6c17b000 (~57G) Test case 2 - 1G memfd ~~~~~~~~~~~~~~~~~~~~~~ This test case has two memfds with 1G memory each in 4k pages, plus other preservations from LUO core and other KHO users. This is how the radix tree stats look like: radix_nodes: 0x28 nr_preservations: 0x80816 mem_preserved: 0x80829000 per order preservations: order 0: 0x80811 order 1: 0x4 order 4: 0x1 and this is how long it takes to extend the scratch after KHO boot: KHO: KHO extend time: 22514 us KHO: KHO extend total mem: 0xd3f200000 (~52G) Signed-off-by: Pratyush Yadav (Google) --- Notes: As one might notice, the "scratch" terminology starts to break down here. There is the original "scratch", which is passed down by the previous kernel. It is marked MEMBLOCK_KHO_SCRATCH. There is also the discovered "scratch", which also gets marked MEMBLOCK_KHO_SCRATCH, but has nothing to do with the former. For limiting the scope of this series, I haven't done the rename here. I can do it as a follow up series once this stabilizes and lands into -next. I suggest the following scheme: - Rename "KHO scratch" to "KHO bootmem". Update the documentation and all code to use this name. We have the kho_scratch kernel cmdline parameter, which is harder to change, but perhaps we can rename it to "kho_bootmem" and if someone complains we can add it back. - Rename MEMBLOCK_KHO_SCRATCH to MEMBLOCK_KHO_NOPRSRV. This describes the property of the memory not its origin. Then KHO can mark its "bootmem" as KHO_NOPRSRV because bootmem never has any preserved memory. Later, kho_extend_scratch() (which is also due for a better name) can also mark its discovered areas as KHO_NOPRSRV. kernel/liveupdate/kexec_handover.c | 149 +++++++++++++++++++++++++---- 1 file changed, 132 insertions(+), 17 deletions(-) diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index af22086ca2d6..8540608b8602 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -84,6 +84,23 @@ static struct kho_out kho_out = { }, }; +struct kho_in { + phys_addr_t fdt_phys; + phys_addr_t scratch_phys; + char previous_release[__NEW_UTS_LEN + 1]; + u32 kexec_count; + struct kho_debugfs dbg; + struct kho_radix_tree radix_tree; +}; + +static struct kho_in kho_in = { +}; + +static const void *kho_get_fdt(void) +{ + return kho_in.fdt_phys ? phys_to_virt(kho_in.fdt_phys) : NULL; +} + /** * kho_encode_radix_key - Encodes a physical address and order into a radix key. * @phys: The physical address of the page. @@ -869,6 +886,119 @@ static void __init kho_reserve_scratch(void) kho_enable = false; } +#define KHO_EXT_SHIFT 30 /* 1 GiB */ + +static int __init kho_ext_walk_key(unsigned long key, void *data) +{ + struct kho_radix_tree *tree = data; + phys_addr_t start, end; + unsigned int order; + int err; + + start = kho_decode_radix_key(key, &order); + end = start + (1UL << (order + PAGE_SHIFT)); + + while (start < end) { + err = kho_radix_add_key(tree, start >> KHO_EXT_SHIFT); + if (err) + return err; + + start += (1UL << KHO_EXT_SHIFT); + } + + return 0; +} + +static int __init kho_ext_walk_node(phys_addr_t phys, void *data) +{ + struct kho_radix_tree *tree = data; + + return kho_radix_add_key(tree, phys >> KHO_EXT_SHIFT); +} + +static int __init kho_ext_mark_scratch(unsigned long key, void *data) +{ + phys_addr_t *prev_end = data; + phys_addr_t start = key << KHO_EXT_SHIFT; + int err; + + if (start > *prev_end) { + err = memblock_mark_kho_scratch(*prev_end, start - *prev_end); + if (err) + return err; + } + + *prev_end = start + (1UL << KHO_EXT_SHIFT); + return 0; +} + +/** + * kho_extend_scratch - Extend the scratch regions + * + * The KHO radix tree mixes both physical address and order into a single key. + * This makes it hard to look for free ranges directly. This function first + * walks the radix tree and digests it down into another radix tree, whose keys + * identify blocks of KHO_EXT_SHIFT which contain preserved memory. + * + * Then it walks the digested radix tree and marks everything that doesn't have + * preserved memory as scratch. + * + * NOTE: This function allocates memory so it should be called when scratch has + * available space. + * + * NOTE: The pages of the KHO radix tree tables are not marked as preserved in + * the KHO tree. But they are expected to remain untouched until the tree is + * fully parsed. So this function also considers them to be "preserved memory" + * and marks their blocks as busy. + */ +static void __init kho_extend_scratch(void) +{ + const struct kho_radix_walk_cb kho_cb = { + .leaf = kho_ext_walk_key, + .node = kho_ext_walk_node, + }; + const struct kho_radix_walk_cb ext_cb = { + .leaf = kho_ext_mark_scratch, + }; + struct kho_radix_tree radix; + phys_addr_t prev_end = 0; + int err = 0; + + if (!is_kho_boot()) + return; + + /* Make sure the KHO radix tree is initialized. */ + err = kho_radix_init_tree(&kho_in.radix_tree, + kho_get_mem_map(kho_get_fdt())); + if (err) + goto print; + + err = kho_radix_init_tree(&radix, NULL); + if (err) + goto print; + + /* Walk the KHO radix tree to find busy blocks. */ + err = kho_radix_walk_tree(&kho_in.radix_tree, &kho_cb, &radix); + if (err) + goto out; + + /* Walk the blocks and mark everything between keys as scratch. */ + err = kho_radix_walk_tree(&radix, &ext_cb, &prev_end); + if (err) + goto out; + + /* Mark everything from last busy block to end of DRAM. */ + if (prev_end < memblock_end_of_DRAM()) + err = memblock_mark_kho_scratch(prev_end, memblock_end_of_DRAM() - prev_end); + + /* fallthrough */ +out: + kho_radix_destroy_tree(&radix); +print: + if (err) + pr_err("Failed to extend scratch: %pe\n", ERR_PTR(err)); +} + /** * kho_add_subtree - record the physical address of a sub blob in KHO root tree. * @name: name of the sub tree. @@ -1443,23 +1573,6 @@ void kho_restore_free(void *mem) } EXPORT_SYMBOL_GPL(kho_restore_free); -struct kho_in { - phys_addr_t fdt_phys; - phys_addr_t scratch_phys; - char previous_release[__NEW_UTS_LEN + 1]; - u32 kexec_count; - struct kho_debugfs dbg; - struct kho_radix_tree radix_tree; -}; - -static struct kho_in kho_in = { -}; - -static const void *kho_get_fdt(void) -{ - return kho_in.fdt_phys ? phys_to_virt(kho_in.fdt_phys) : NULL; -} - /** * is_kho_boot - check if current kernel was booted via KHO-enabled * kexec @@ -1763,6 +1876,8 @@ void __init kho_memory_init_early(void) */ if (kho_in.scratch_phys) kho_scratch = phys_to_virt(kho_in.scratch_phys); + + kho_extend_scratch(); } void __init kho_memory_init(void) -- 2.54.0.1032.g2f8565e1d1-goog From pratyush at kernel.org Fri Jun 5 11:34:49 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Fri, 5 Jun 2026 20:34:49 +0200 Subject: [PATCH v2 16/18] memblock: make HugeTLB bootmem allocation work with KHO In-Reply-To: <20260605183501.3884950-1-pratyush@kernel.org> References: <20260605183501.3884950-1-pratyush@kernel.org> Message-ID: <20260605183501.3884950-17-pratyush@kernel.org> From: "Pratyush Yadav (Google)" Gigantic huge page allocation is somewhat broken currently when KHO is used. Firstly, they break KHO scratch size accounting. RSRV_KERN is used to track how much memory is reserved for use by the kernel. Since alloc_bootmem() calls the memblock_alloc*() APIs, the hugepages allocated also get marked as RSRV_KERN. Allocations marked RSRV_KERN are used by KHO to calculate how much scratch space it should reserve to make sure the next kernel has enough memory to boot when it is in scratch-only phase. Counting hugepages in that blows up scratch size, and can lead to the scratch allocation failing, making KHO unusable. This will show up when huge pages make up more than 50% of the system, which is a fairly common use case. Secondly, while not supported right now, huge pages are user memory and can be preserved via KHO. The scratch spaces should not have any preserved memory. Allocating hugepages from scratch (on a KHO boot) can lead to them being un-preservable. Introduce memblock_alloc_hugetlb(). This lets memblock tailor to the needs of hugetb without exposing those details to the general allocation routines. First, it does not use mirrored memory for hugetlb. Mirrored memory is a limited resource that is best saved for kernel data structures, not user memory. Second, if the memory found overlaps with KHO scratch areas, it discards the memory and retries. Third, it simplifies the argument list by baking in some hugetlb assumptions like alignment and exact_nid. This also simplifies allocation logic in alloc_bootmem(). Also introduce MEMBLOCK_RSRV_HUGETLB to mark reservations made for HugeTLB. This will be used by KHO in future patches to correctly calculate scratch sizes. Refactor some of the preparation logic like kmemleak tracking and accepting memory into a separate helper memblock_prep_allocation(), and use it from both memblock_alloc_hugetlb() and the usual memblock_alloc_range_nid(). Signed-off-by: Pratyush Yadav (Google) --- include/linux/memblock.h | 3 ++ mm/hugetlb.c | 22 +++----- mm/memblock.c | 112 +++++++++++++++++++++++++++++++-------- 3 files changed, 100 insertions(+), 37 deletions(-) diff --git a/include/linux/memblock.h b/include/linux/memblock.h index 546d7ef798b8..b3b4a6145fad 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -52,6 +52,7 @@ extern unsigned long long max_possible_pfn; * memory reservations yet, so we get scratch memory from the previous * kernel that we know is good to use. It is the only memory that * allocations may happen from in this phase. + * @MEMBLOCK_RSRV_HUGETLB: memory is reserved for hugetlb pages */ enum memblock_flags { MEMBLOCK_NONE = 0x0, /* No special request */ @@ -62,6 +63,7 @@ enum memblock_flags { MEMBLOCK_RSRV_NOINIT = 0x10, /* don't initialize struct pages */ MEMBLOCK_RSRV_KERN = 0x20, /* memory reserved for kernel use */ MEMBLOCK_KHO_SCRATCH = 0x40, /* scratch memory for kexec handover */ + MEMBLOCK_RSRV_HUGETLB = 0x80, /* memory reserved for hugetlb pages */ }; /** @@ -421,6 +423,7 @@ void *memblock_alloc_try_nid_raw(phys_addr_t size, phys_addr_t align, void *memblock_alloc_try_nid(phys_addr_t size, phys_addr_t align, phys_addr_t min_addr, phys_addr_t max_addr, int nid); +void *memblock_alloc_hugetlb(phys_addr_t size, int nid, bool exact_nid); static __always_inline void *memblock_alloc(phys_addr_t size, phys_addr_t align) { diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 4b80b167cc9c..fadcfa267ceb 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3029,29 +3029,21 @@ static __init void *alloc_bootmem(struct hstate *h, int nid, bool node_exact) if (hugetlb_early_cma(h)) m = hugetlb_cma_alloc_bootmem(h, &listnode, node_exact); else { - if (node_exact) - m = memblock_alloc_exact_nid_raw(huge_page_size(h), - huge_page_size(h), 0, - MEMBLOCK_ALLOC_ACCESSIBLE, nid); - else { - m = memblock_alloc_try_nid_raw(huge_page_size(h), - huge_page_size(h), 0, - MEMBLOCK_ALLOC_ACCESSIBLE, nid); + m = memblock_alloc_hugetlb(huge_page_size(h), nid, node_exact); + if (m) { + m->flags = 0; + m->cma = NULL; + /* * For pre-HVO to work correctly, pages need to be on * the list for the node they were actually allocated * from. That node may be different in the case of - * fallback by memblock_alloc_try_nid_raw. So, + * fallback by memblock_alloc_hugetlb_bootmem. So, * extract the actual node first. */ - if (m) + if (!node_exact) listnode = early_pfn_to_nid(PHYS_PFN(__pa(m))); } - - if (m) { - m->flags = 0; - m->cma = NULL; - } } if (m) { diff --git a/mm/memblock.c b/mm/memblock.c index 6349c48154f4..131e54dd5d8d 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -1506,6 +1506,32 @@ int __init_memblock memblock_set_node(phys_addr_t base, phys_addr_t size, return 0; } +static void memblock_prep_allocation(phys_addr_t start, phys_addr_t size, + bool leaktrace) +{ + /* + * Skip kmemleak for those places like kasan_init() and + * early_pgtable_alloc() due to high volume. + */ + if (leaktrace) + /* + * Memblock allocated blocks are never reported as + * leaks. This is because many of these blocks are + * only referred via the physical address which is + * not looked up by kmemleak. + */ + kmemleak_alloc_phys(start, size, 0); + + /* + * Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP, + * require memory to be accepted before it can be used by the + * guest. + * + * Accept the memory of the allocated buffer. + */ + accept_memory(start, size); +} + /** * memblock_alloc_range_nid - allocate boot memory block * @size: size of memory block to be allocated in bytes @@ -1580,28 +1606,7 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size, return 0; done: - /* - * Skip kmemleak for those places like kasan_init() and - * early_pgtable_alloc() due to high volume. - */ - if (end != MEMBLOCK_ALLOC_NOLEAKTRACE) - /* - * Memblock allocated blocks are never reported as - * leaks. This is because many of these blocks are - * only referred via the physical address which is - * not looked up by kmemleak. - */ - kmemleak_alloc_phys(found, size, 0); - - /* - * Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP, - * require memory to be accepted before it can be used by the - * guest. - * - * Accept the memory of the allocated buffer. - */ - accept_memory(found, size); - + memblock_prep_allocation(found, size, end != MEMBLOCK_ALLOC_NOLEAKTRACE); return found; } @@ -1756,6 +1761,69 @@ void * __init memblock_alloc_try_nid_raw( false); } +/** + * memblock_alloc_hugetlb - allocate boot memory for HugeTLB pages + * @size: size of the memory to be allocated in bytes + * @nid: nid of the free memory to find, %NUMA_NO_NODE for any node + * @exact_nid: only allocate from the specified nid. If %false, the specified + * nid is tried first, and then all nodes are tried as fallback. + * + * HugeTLB pages are always aligned by their size, so the alignment matches + * @size. Since the memory is for userspace, mirrored memory is not used. The + * memory is not zeroed. Does not panic if request cannot be satisfied. + * + * Return: + * Virtual address of allocated memory block on success, %NULL on failure. + */ +void * __init memblock_alloc_hugetlb(phys_addr_t size, int nid, bool exact_nid) +{ + enum memblock_flags flags = choose_memblock_flags(); + phys_addr_t addr, start = 0, end = MEMBLOCK_ALLOC_ACCESSIBLE; + + memblock_dbg("%s: %llu bytes, nid=%d, exact_nid=%d %pS\n", __func__, + (u64)size, nid, exact_nid, (void *)_RET_IP_); + + /* Don't waste mirrored memory on HugeTLB pages. */ + flags &= ~MEMBLOCK_MIRROR; +retry: + /* HugeTLB pages are always aligned by their size. */ + addr = memblock_find_in_range_node(size, size, start, end, nid, flags); + if (addr) + goto found; + + /* Try all nodes if allowed. */ + if (numa_valid_node(nid) && !exact_nid) { + nid = NUMA_NO_NODE; + goto retry; + } + + /* Found nothing... :-( */ + return NULL; + +found: + /* + * HugeTLB pages can be preserved with KHO and no preserved memory can + * be in scratch. So retry if found address overlaps with scratch. + * + * Scratch areas are normally not very large, so this shouldn't take too + * many retries. + */ + if (kho_scratch_overlap(addr, size)) { + if (memblock_bottom_up()) + start = addr + size; + else + start = addr - size; + + goto retry; + } + + if (__memblock_reserve(addr, size, nid, MEMBLOCK_RSRV_KERN | MEMBLOCK_RSRV_HUGETLB)) + return NULL; + + memblock_prep_allocation(addr, size, true); + return phys_to_virt(addr); +} + /** * memblock_alloc_try_nid - allocate boot memory block * @size: size of memory block to be allocated in bytes -- 2.54.0.1032.g2f8565e1d1-goog From pratyush at kernel.org Fri Jun 5 11:34:50 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Fri, 5 Jun 2026 20:34:50 +0200 Subject: [PATCH v2 17/18] memblock: allow calculating reserved size by flags In-Reply-To: <20260605183501.3884950-1-pratyush@kernel.org> References: <20260605183501.3884950-1-pratyush@kernel.org> Message-ID: <20260605183501.3884950-18-pratyush@kernel.org> From: "Pratyush Yadav (Google)" memblock_reserved_kern_size() returns the total size of all reserved areas flagged RSRV_KERN. KHO also needs total size of all reserved areas flagged HUGETLB to correctly size its scratch areas. Refactor memblock_reserved_kern_size() into memblock_reserved_size_flags(). The new function returns total size of all reserved areas which match _any_ of the flags. Signed-off-by: Pratyush Yadav (Google) --- include/linux/memblock.h | 3 ++- kernel/liveupdate/kexec_handover.c | 14 ++++++++------ mm/memblock.c | 8 +++++--- 3 files changed, 15 insertions(+), 10 deletions(-) diff --git a/include/linux/memblock.h b/include/linux/memblock.h index b3b4a6145fad..a3b57066611d 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -487,7 +487,8 @@ static inline __init_memblock bool memblock_bottom_up(void) phys_addr_t memblock_phys_mem_size(void); phys_addr_t memblock_reserved_size(void); -phys_addr_t memblock_reserved_kern_size(phys_addr_t limit, int nid); +phys_addr_t memblock_reserved_size_flags(phys_addr_t limit, int nid, + enum memblock_flags flags); unsigned long memblock_estimated_nr_free_pages(void); phys_addr_t memblock_start_of_DRAM(void); phys_addr_t memblock_end_of_DRAM(void); diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 8540608b8602..b3c33f150e85 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -749,13 +749,15 @@ static void __init scratch_size_update(void) if (scratch_scale) { phys_addr_t size; - size = memblock_reserved_kern_size(ARCH_LOW_ADDRESS_LIMIT, - NUMA_NO_NODE); + size = memblock_reserved_size_flags(ARCH_LOW_ADDRESS_LIMIT, + NUMA_NO_NODE, + MEMBLOCK_RSRV_KERN); size = size * scratch_scale / 100; scratch_size_lowmem = size; - size = memblock_reserved_kern_size(MEMBLOCK_ALLOC_ANYWHERE, - NUMA_NO_NODE); + size = memblock_reserved_size_flags(MEMBLOCK_ALLOC_ANYWHERE, + NUMA_NO_NODE, + MEMBLOCK_RSRV_KERN); size = size * scratch_scale / 100 - scratch_size_lowmem; scratch_size_global = size; } @@ -773,8 +775,8 @@ static phys_addr_t __init scratch_size_node(int nid) phys_addr_t size; if (scratch_scale) { - size = memblock_reserved_kern_size(MEMBLOCK_ALLOC_ANYWHERE, - nid); + size = memblock_reserved_size_flags(MEMBLOCK_ALLOC_ANYWHERE, + nid, MEMBLOCK_RSRV_KERN); size = size * scratch_scale / 100; } else { size = scratch_size_pernode; diff --git a/mm/memblock.c b/mm/memblock.c index 131e54dd5d8d..cc21f877cb67 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -1893,7 +1893,8 @@ phys_addr_t __init_memblock memblock_reserved_size(void) return memblock.reserved.total_size; } -phys_addr_t __init_memblock memblock_reserved_kern_size(phys_addr_t limit, int nid) +phys_addr_t __init_memblock memblock_reserved_size_flags(phys_addr_t limit, int nid, + enum memblock_flags flags) { struct memblock_region *r; phys_addr_t total = 0; @@ -1908,7 +1909,7 @@ phys_addr_t __init_memblock memblock_reserved_kern_size(phys_addr_t limit, int n size = limit - r->base; if (nid == memblock_get_region_node(r) || !numa_valid_node(nid)) - if (r->flags & MEMBLOCK_RSRV_KERN) + if (r->flags & flags) total += size; } @@ -1930,7 +1931,8 @@ phys_addr_t __init_memblock memblock_reserved_kern_size(phys_addr_t limit, int n unsigned long __init memblock_estimated_nr_free_pages(void) { return PHYS_PFN(memblock_phys_mem_size() - - memblock_reserved_kern_size(MEMBLOCK_ALLOC_ANYWHERE, NUMA_NO_NODE)); + memblock_reserved_size_flags(MEMBLOCK_ALLOC_ANYWHERE, NUMA_NO_NODE, + MEMBLOCK_RSRV_KERN)); } /* lowest address */ -- 2.54.0.1032.g2f8565e1d1-goog From pratyush at kernel.org Fri Jun 5 11:34:51 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Fri, 5 Jun 2026 20:34:51 +0200 Subject: [PATCH v2 18/18] kho: exclude hugetlb memory from scratch size calculation In-Reply-To: <20260605183501.3884950-1-pratyush@kernel.org> References: <20260605183501.3884950-1-pratyush@kernel.org> Message-ID: <20260605183501.3884950-19-pratyush@kernel.org> From: "Pratyush Yadav (Google)" HugeTLB pages can be preserved memory. So they are never allocated from scratch. Instead, they are allocated from the memory blocks with no preserved memory. These areas are detected at runtime on each boot. But since they are allocated via memblock, they show up as RSRV_KERN, and blow up the scratch size when scratch scale is in use. All hugetlb pages are marked RSRV_HUGETLB. Subtract their size from RSRV_KERN when calculating scratch sizes. Signed-off-by: Pratyush Yadav (Google) --- kernel/liveupdate/kexec_handover.c | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index b3c33f150e85..0d106c9197d9 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -744,7 +744,8 @@ static void __init scratch_size_update(void) { /* * If fixed sizes are not provided via command line, calculate them - * now. + * now. Use RSRV_KERN to count allocated memory, but remove HugeTLB + * allocations from it because they never get allocated from scratch. */ if (scratch_scale) { phys_addr_t size; @@ -752,12 +753,19 @@ static void __init scratch_size_update(void) size = memblock_reserved_size_flags(ARCH_LOW_ADDRESS_LIMIT, NUMA_NO_NODE, MEMBLOCK_RSRV_KERN); + size -= memblock_reserved_size_flags(ARCH_LOW_ADDRESS_LIMIT, + NUMA_NO_NODE, + MEMBLOCK_RSRV_HUGETLB); + size = size * scratch_scale / 100; scratch_size_lowmem = size; size = memblock_reserved_size_flags(MEMBLOCK_ALLOC_ANYWHERE, NUMA_NO_NODE, MEMBLOCK_RSRV_KERN); + size -= memblock_reserved_size_flags(MEMBLOCK_ALLOC_ANYWHERE, + NUMA_NO_NODE, + MEMBLOCK_RSRV_HUGETLB); size = size * scratch_scale / 100 - scratch_size_lowmem; scratch_size_global = size; } @@ -777,6 +785,9 @@ static phys_addr_t __init scratch_size_node(int nid) if (scratch_scale) { size = memblock_reserved_size_flags(MEMBLOCK_ALLOC_ANYWHERE, nid, MEMBLOCK_RSRV_KERN); + /* Do not count HugeTLB pages. */ + size -= memblock_reserved_size_flags(MEMBLOCK_ALLOC_ANYWHERE, + nid, MEMBLOCK_RSRV_HUGETLB); size = size * scratch_scale / 100; } else { size = scratch_size_pernode; -- 2.54.0.1032.g2f8565e1d1-goog From tarunsahu at google.com Fri Jun 5 12:12:22 2026 From: tarunsahu at google.com (tarunsahu at google.com) Date: Fri, 05 Jun 2026 19:12:22 +0000 Subject: [RFC PATCH v1 0/8] liveupdate: kvm: Guest_memfd preservation In-Reply-To: References: <9huzwlwnbgdd.fsf@tarunix.c.googlers.com> Message-ID: <9huz33z06bu1.fsf@tarunix.c.googlers.com> Hi, Ackerley Tng writes: > Sean Christopherson writes: > >>> >>> [...snip...] >>> >>> we have one open Question left: >>> 1. How to check guest_memfd is fully shared. >>> >>> [...snip...] >>> >> >> Given that lack of support isn't going to be limited to _just_ guest_memfd, >> simply disallow preservation if the VM supports private memory: >> >> if (kvm_arch_has_private_mem(kvm)) >> return -EOPNOTSUPP; > This is a good idea, I have implemented it in V2 (sent it recently). Below, I have mentioned a detailed analysis. Please let me know your thoughts. > Makes sense. Tarun this was the other option that I was suggesting when > we discussed offline. > > I think (?) it is possible to create a fully-private guest_memfd for a > non-Confidential VM, and even after conversion lands, for both > vm_memory_attributes=true and vm_memory_attributes=false. > > In that case, your preservation series can still preserve memory tracked > as private by guest_memfd but not used as private, right? > > I don't think anyone will use this combination before guest_memfd > write() support lands, we just need to make sure there's no kernel crash > or corruption in this case. IIUC, currently, private memory definition is where cocovm with HW based private memory is supported. Which is directly checked by kvm_arch_has_private_mem and this return false incase ARCH does not support it in HW (SEV/SNP, TDX). So about the combination where a guest_memfd is tracked as private but not actually private. (Created without the INIT_SHARED flags). Even though kvm_arch_has_private_mem is false. In this case the luo will preserve the guest_memfd. But it will not preserve any attributes, because 1. during creation, we dont populate any attributes, so by default guest_memfd memory always considered to be shared. (Even though no INIT_SHARED is passed). Conversion to private IOCTL will fail as kvm_arch_has_private_mem check will fail for non-COCOVM. for COCOVM, presevation will not be supported and conversion to private memory will be succeded. (No corruption and kernel crash in this case) ==== WHAT IF CONVERSION SERIES LANDS ========= In this case, we have two scenerio 1. kvm->attributes In this case, Every logic remains same as above. 2. gmem->attributes if INIT_SHARED: memory is initially shared and no entry in maple tree, and if kvm_arch_has_private_mem returns false (non-COCOVM), this will just fails any conversion request. hence preserving the guest_memfd wont have any problem (fully shared case). if INIT_SHARED not set: memory is initially private, and there is an entry in maple tree, marked the memory as private. During conversion: A. To private: it fails as kvm_arch_has_private_mem return false. Preservation is safe and there will be no problem. as in preservation we dont preserve any attributes, but on retrieval, when a new gmem file is created, identical entry to attributes will also be assigned as INIT_SHARED flag is not set. So we wont have any issues in this case. B. TO shared: it passes, and update the maple tree. preservation will not be preserving the attributes with current patches, and it will also allow the preservation as kvm_arch_has_private_mem is false. So in this case, on retrieval, we will lose the data regarding the private vs shared (non-cocovm). So if host want to access the shared memory, it will try MMAP and which will fail in new kernel (which was successful in old kernel as conversion happened). But I dont see any kernel crash or corruption. This is an issue with only non-cocoVM support with conversion series having the gmem attribute enabled without INIT_SHARED flag. I am not sure, if there will be any user for this very soon (is it SW_PROTECTED_VM ?? ). Let me know, If my understanding is correct. Should we add INIT_SHARED along with kvm_arch_has_private_mem check to make Above case 2.B. impossible in future for the current support of guest_memfd. From dianders at chromium.org Fri Jun 5 13:03:05 2026 From: dianders at chromium.org (Doug Anderson) Date: Fri, 5 Jun 2026 13:03:05 -0700 Subject: [PATCH 3/4] arm64: wire SDEI NMI into the hardlockup watchdog In-Reply-To: <6172eafcb9de6e626c0f1c36426d67e1e562ed32.1780496779.git.kas@kernel.org> References: <6172eafcb9de6e626c0f1c36426d67e1e562ed32.1780496779.git.kas@kernel.org> Message-ID: Hi, On Wed, Jun 3, 2026 at 7:36?AM Kiryl Shutsemau wrote: > > From: "Kiryl Shutsemau (Meta)" > > Select HAVE_HARDLOCKUP_DETECTOR_ARCH so the framework takes its backend > from this driver. A per-CPU hrtimer checks its buddy's heartbeat and > signals event 0 at a stalled CPU, which runs watchdog_hardlockup_check() > NMI-like. > > The source is chosen at boot: SDEI if firmware provides it, otherwise a > perf-NMI counter (pseudo-NMI) fallback -- one image covers both. > > Signed-off-by: Kiryl Shutsemau (Meta) > --- > arch/arm64/Kconfig | 1 + > drivers/firmware/Kconfig | 3 + > drivers/firmware/sdei_nmi.c | 247 +++++++++++++++++++++++++++++++++++- > 3 files changed, 248 insertions(+), 3 deletions(-) I'm a little confused about this patch. We already have a buddy hardlockup detector using the hrtimer, and it's even been improved recently to trigger in a smaller time bound. It looks as if you're duplicating bits of the perf and buddy detector here? I don't think you need this patch at all. The existing buddy detector + patches #1 and #2 in your series should be sufficient. Did I misunderstand? -Doug From ackerleytng at google.com Fri Jun 5 13:05:02 2026 From: ackerleytng at google.com (Ackerley Tng) Date: Fri, 5 Jun 2026 13:05:02 -0700 Subject: [RFC PATCH v1 0/8] liveupdate: kvm: Guest_memfd preservation In-Reply-To: <9huz33z06bu1.fsf@tarunix.c.googlers.com> References: <9huzwlwnbgdd.fsf@tarunix.c.googlers.com> <9huz33z06bu1.fsf@tarunix.c.googlers.com> Message-ID: tarunsahu at google.com writes: > Hi, > > Ackerley Tng writes: > >> Sean Christopherson writes: >> >>>> >>>> [...snip...] >>>> >>>> we have one open Question left: >>>> 1. How to check guest_memfd is fully shared. >>>> >>>> [...snip...] >>>> >>> >>> Given that lack of support isn't going to be limited to _just_ guest_memfd, >>> simply disallow preservation if the VM supports private memory: >>> >>> if (kvm_arch_has_private_mem(kvm)) >>> return -EOPNOTSUPP; >> > This is a good idea, I have implemented it in V2 (sent it recently). > Will look at RFC v2 in a bit :) Thanks! > Below, I have mentioned a detailed analysis. Please let me know your thoughts. > >> Makes sense. Tarun this was the other option that I was suggesting when >> we discussed offline. >> >> I think (?) it is possible to create a fully-private guest_memfd for a >> non-Confidential VM, and even after conversion lands, for both >> vm_memory_attributes=true and vm_memory_attributes=false. >> >> In that case, your preservation series can still preserve memory tracked >> as private by guest_memfd but not used as private, right? >> >> I don't think anyone will use this combination before guest_memfd >> write() support lands, we just need to make sure there's no kernel crash >> or corruption in this case. > > IIUC, currently, private memory definition is where cocovm with HW based > private memory is supported. Which is directly checked by > kvm_arch_has_private_mem and this return false incase ARCH does not > support it in HW (SEV/SNP, TDX). > > So about the combination where a guest_memfd is tracked as private but > not actually private. (Created without the INIT_SHARED flags). Even > though kvm_arch_has_private_mem is false. > > In this case the luo will preserve the guest_memfd. But it will not > preserve any attributes, because > 1. during creation, we dont populate any attributes, so by default > guest_memfd memory always considered to be shared. (Even though no > INIT_SHARED is passed). Conversion to private IOCTL will fail as > kvm_arch_has_private_mem check will fail for non-COCOVM. > for COCOVM, > presevation will not be supported and conversion to private memory will > be succeded. (No corruption and kernel crash in this case) > > > ==== WHAT IF CONVERSION SERIES LANDS ========= > I think there are quite many cases to consider if we consider both before and after conversion series lands. For preservation, shall we assume it will go after conversion? :) This way we can avoid discussing some cases. > In this case, we have two scenerio > 1. kvm->attributes > In this case, Every logic remains same as above. This is the vm_memory_attributes=true case. Since kvm_supported_mem_attributes(kvm) returns 0 when kvm_arch_has_private_mem(kvm) returns false, the vm's mem_attr_array will always say shared for any vm where kvm_arch_has_private_mem() is false. So yup, preservation doesn't preserve attributes but that's also effectively preserving always-shared attributes, so all is good. > 2. gmem->attributes > if INIT_SHARED: memory is initially shared and no entry in maple tree, and if > kvm_arch_has_private_mem returns false (non-COCOVM), this will just > fails any conversion request. hence preserving the guest_memfd wont > have any problem (fully shared case). > Yup, preservation doesn't preserve attributes but that's also effectively preserving always-shared attributes, so all is good. > if INIT_SHARED not set: memory is initially private, and there is an > entry in maple tree, marked the memory as private. During conversion: > A. To private: it fails as kvm_arch_has_private_mem return false. > Preservation is safe and there will be no problem. as in preservation > we dont preserve any attributes, but on retrieval, when a new gmem > file is created, identical entry to attributes will also be assigned > as INIT_SHARED flag is not set. So we wont have any issues in this case. > B. TO shared: it passes, and update the maple tree. preservation will > not be preserving the attributes with current patches, and it will > also allow the preservation as kvm_arch_has_private_mem is false. So > in this case, on retrieval, we will lose the data regarding the > private vs shared (non-cocovm). So if host want to access the shared > memory, it will try MMAP and which will fail in new kernel (which was > successful in old kernel as conversion happened). But I dont see any > kernel crash or corruption. Perhaps it should be this way: the process of preserving should first freeze further conversions, then check for private/shared status, and if private, return error to userspace. I haven't looked at RFC v2 yet, but IIUC this is kind of in line with the other issue where we freeze allocations: + When preserving, freeze further allocations, preserve all pages that are already allocated. If there are pages that weren't allocated, too bad - allocations are frozen until unpreserve. + When preserving, freeze further conversions, preserve all shared pages. If there are private pages, fail preservation. This way, there's no loss of any private/shared state, since only shared pages are preserved. > This is an issue with only non-cocoVM support with conversion series > having the gmem attribute enabled without INIT_SHARED flag. I am not > sure, if there will be any user for this very soon (is it SW_PROTECTED_VM ?? ). > So if a non-CoCo VM does use an all-private guest_memfd (didn't specify INIT_SHARED) (don't think anyone will use this before guest_memfd write() support), it'll just always fail preservation unless before preserving everything was converted to shared. > Let me know, If my understanding is correct. Should we add INIT_SHARED > along with kvm_arch_has_private_mem check to make Above case 2.B. > impossible in future for the current support of guest_memfd. From dianders at chromium.org Fri Jun 5 13:42:57 2026 From: dianders at chromium.org (Doug Anderson) Date: Fri, 5 Jun 2026 13:42:57 -0700 Subject: [PATCH 4/4] arm64: route crash_smp_send_stop() last resort through SDEI In-Reply-To: <54cb99db3c981dc39eb3031aff5caeaadb09e8b9.1780496779.git.kas@kernel.org> References: <54cb99db3c981dc39eb3031aff5caeaadb09e8b9.1780496779.git.kas@kernel.org> Message-ID: Hi, On Wed, Jun 3, 2026 at 7:36?AM Kiryl Shutsemau wrote: > > @@ -1288,8 +1288,32 @@ void crash_smp_send_stop(void) > return; > crash_stop = 1; > > + /* > + * Stop the normal way first: IPI_CPU_STOP escalating to a pseudo-NMI > + * IPI. Every CPU that responds saves its state via crash_save_cpu() > + * and parks in cpu_park_loop() with its online bit cleared -- the > + * standard kdump stop, identical to a kernel without SDEI. Crucially > + * those CPUs stay in a clean, potentially-reusable state. > + */ > smp_send_stop(); > > + /* > + * Whatever is still online didn't respond -- typically a CPU wedged > + * with interrupts masked. The plain IPI can't reach it, and a fleet > + * that declines the pseudo-NMI hot-path cost has no NMI IPI to > + * escalate to. Hit only the survivors with the SDEI cross-CPU NMI > + * (no-op if SDEI isn't active, or if everything already stopped): > + * firmware delivers out of EL3 regardless of PSTATE.DAIF, and the > + * handler captures crash_save_cpu() state from the wedged context > + * before parking the CPU. > + * > + * SDEI is deliberately last: an SDEI-stopped CPU never completes its > + * event (it parks inside the handler, so EL3 retains its dispatch > + * slot until reset), which is strictly less recoverable than a normal > + * stop. We pay that only for CPUs that left no other way to reach them. > + */ > + sdei_nmi_crash_smp_send_stop(); It feels weird to me that you're adding SDEI for "crash stop" but not for regular "stop". It feels like you should modify smp_send_stop() to fall back to SDEI if sending the NMI failed, instead of adding this separate path. > static int sdei_nmi_handler(u32 event, struct pt_regs *regs, void *arg) > { > + int cpu = smp_processor_id(); > + > + if (READ_ONCE(*this_cpu_ptr(&sdei_nmi_crash_stop_requested))) { > + WRITE_ONCE(*this_cpu_ptr(&sdei_nmi_crash_stop_requested), 0); > + > + /* > + * Capture the wedged context for kdump while pt_regs still > + * points at the interrupted PC. This is the main motivation > + * for using SDEI here: the plain IPI stop path can't reach an > + * interrupt-masked CPU (and the fleet declines pseudo-NMI to > + * keep the IRQ-mask hot path cheap), so crash_save_cpu() for > + * that CPU would otherwise record nothing useful. > + */ > + crash_save_cpu(regs, cpu); > + set_cpu_online(cpu, false); > + > + /* publish the crash state/offline before the requester sees the ack */ > + smp_wmb(); > + WRITE_ONCE(*this_cpu_ptr(&sdei_nmi_crash_stop_acked), 1); > + > + /* > + * Park forever from within the SDEI handler. We deliberately > + * do NOT issue SDEI_EVENT_COMPLETE: the framework's return > + * path restores firmware's saved interrupted context, which > + * would land the CPU back wherever it was running (often > + * do_idle, which then notices cpu_is_offline=true and BUGs > + * at cpuhp_report_idle_dead). Returning the modified pt_regs > + * doesn't help -- arch/arm64/kernel/sdei.c::do_sdei_event > + * only honours a PC override via its IRQ-state heuristic > + * and otherwise hands EL3 its own saved-context slot back. > + * > + * Trade-off: EL3 firmware retains ~one saved-context slot > + * per parked CPU until the next hardware reset (~hundreds of > + * bytes per CPU). The CPU itself is parked in cpu_park_loop > + * exactly as if IPI_CPU_STOP had stopped it; recoverability > + * is unchanged versus the existing path (neither is > + * recoverable without hardware reset, since PSCI sees the > + * CPU as ALREADY_ON in both cases). > + */ > + cpu_park_loop(); > + /* unreachable */ Any chance we could avoid duplicating stuff from ipi_cpu_crash_stop()? > +bool sdei_nmi_crash_smp_send_stop(void) > +{ > + unsigned int this_cpu, cpu, remaining; > + unsigned long timeout; > + cpumask_t mask; The above will probably get you a yell. Putting "cpumask_t" on the stack is a no-no since it can be quite large under certain CONFIG options. This is why it's nearly always defined as "static". -Doug From dianders at chromium.org Fri Jun 5 13:46:05 2026 From: dianders at chromium.org (Doug Anderson) Date: Fri, 5 Jun 2026 13:46:05 -0700 Subject: [PATCH 1/4] firmware: arm_sdei: add SDEI_EVENT_SIGNAL support In-Reply-To: References: Message-ID: Hi, On Wed, Jun 3, 2026 at 7:36?AM Kiryl Shutsemau wrote: > > From: "Kiryl Shutsemau (Meta)" > > Add sdei_event_signal(), a thin wrapper over the SDEI_EVENT_SIGNAL call > (DEN0054) that makes the software-signalled event (event 0) pending on a > target PE -- delivered NMI-like even when that PE has interrupts masked. > It takes no locks, so it is safe to call from NMI / crash context. > > Signed-off-by: Kiryl Shutsemau (Meta) > --- > drivers/firmware/arm_sdei.c | 12 ++++++++++++ > include/linux/arm_sdei.h | 6 ++++++ > include/uapi/linux/arm_sdei.h | 1 + > 3 files changed, 19 insertions(+) I'd never looked at SDEI before this (so my review is probably not terribly strong), but this looks reasonable to me. Reviewed-by: Douglas Anderson From dianders at chromium.org Fri Jun 5 13:54:00 2026 From: dianders at chromium.org (Doug Anderson) Date: Fri, 5 Jun 2026 13:54:00 -0700 Subject: [PATCH 2/4] drivers/firmware: add SDEI cross-CPU NMI service for arm64 In-Reply-To: <145b9e98b12a7d314fc4a203075f65c3a0c3a913.1780496779.git.kas@kernel.org> References: <145b9e98b12a7d314fc4a203075f65c3a0c3a913.1780496779.git.kas@kernel.org> Message-ID: Hi, On Wed, Jun 3, 2026 at 7:36?AM Kiryl Shutsemau wrote: > > @@ -928,11 +929,19 @@ static void arm64_backtrace_ipi(cpumask_t *mask) > void arch_trigger_cpumask_backtrace(const cpumask_t *mask, int exclude_cpu) > { > /* > + * Prefer the SDEI cross-CPU NMI provider when active: firmware > + * dispatches the event out of EL3 and reaches CPUs that have > + * interrupts locally masked, without the per-IRQ-mask cost that > + * pseudo-NMI pays for the same reach. The plain IPI path below > + * can't reach such a CPU unless pseudo-NMI is enabled. > + * > * NOTE: though nmi_trigger_cpumask_backtrace() has "nmi_" in the name, > * nothing about it truly needs to be implemented using an NMI, it's > * just that it's _allowed_ to work with NMIs. If ipi_should_be_nmi() > * returned false our backtrace attempt will just use a regular IPI. > */ > + if (sdei_nmi_trigger_cpumask_backtrace(mask, exclude_cpu)) > + return; > nmi_trigger_cpumask_backtrace(mask, exclude_cpu, arm64_backtrace_ipi); nit: instead of one comment block, I would have broken it up in two. Like: /* * Prefer the SDEI ... */ if (sdei_nmi_trigger_cpumask_backtrace(mask, exclude_cpu)) return; /* * NOTE: though ... */ nmi_trigger_cpumask_backtrace(...); > } > > diff --git a/drivers/firmware/Kconfig b/drivers/firmware/Kconfig > index bbd2155d8483..6501087ff90d 100644 > --- a/drivers/firmware/Kconfig > +++ b/drivers/firmware/Kconfig > @@ -36,6 +36,25 @@ config ARM_SDE_INTERFACE > standard for registering callbacks from the platform firmware > into the OS. This is typically used to implement RAS notifications. > > +config ARM_SDEI_NMI > + bool "SDEI-based cross-CPU NMI service (arm64)" > + depends on ARM64 && ARM_SDE_INTERFACE > + help > + Provides SDEI-based cross-CPU NMI delivery for hooks that need > + to reach interrupt-masked CPUs on silicon that lacks FEAT_NMI: > + > + - arch_trigger_cpumask_backtrace() (sysrq-l, RCU stalls, > + hardlockup_all_cpu_backtrace, soft-lockup secondary dumps, > + hung-task auxiliary dumps) > + > + The driver registers a handler for the SDEI software-signalled > + event (event 0) and reaches a target CPU by signalling it with > + SDEI_EVENT_SIGNAL. Firmware delivers the event out of EL3 > + regardless of the target's PSTATE.DAIF -- forced delivery into a > + CPU wedged with interrupts locally masked. > + > + If unsure, say N. Is there some downside to this? It seems like anyone who has the SDE interface would want this. Not sure why you'd suggest people say "N". Other than the nit, this looks reasoanble to me, though I'm a complete noob when it comes to SDEI... Reviewed-by: Douglas Anderson From kirill at shutemov.name Fri Jun 5 14:11:57 2026 From: kirill at shutemov.name (Kiryl Shutsemau) Date: Fri, 5 Jun 2026 22:11:57 +0100 Subject: [PATCH 3/4] arm64: wire SDEI NMI into the hardlockup watchdog In-Reply-To: References: <6172eafcb9de6e626c0f1c36426d67e1e562ed32.1780496779.git.kas@kernel.org> Message-ID: On Fri, Jun 05, 2026 at 01:03:05PM -0700, Doug Anderson wrote: > Hi, > > On Wed, Jun 3, 2026 at 7:36?AM Kiryl Shutsemau wrote: > > > > From: "Kiryl Shutsemau (Meta)" > > > > Select HAVE_HARDLOCKUP_DETECTOR_ARCH so the framework takes its backend > > from this driver. A per-CPU hrtimer checks its buddy's heartbeat and > > signals event 0 at a stalled CPU, which runs watchdog_hardlockup_check() > > NMI-like. > > > > The source is chosen at boot: SDEI if firmware provides it, otherwise a > > perf-NMI counter (pseudo-NMI) fallback -- one image covers both. > > > > Signed-off-by: Kiryl Shutsemau (Meta) > > --- > > arch/arm64/Kconfig | 1 + > > drivers/firmware/Kconfig | 3 + > > drivers/firmware/sdei_nmi.c | 247 +++++++++++++++++++++++++++++++++++- > > 3 files changed, 248 insertions(+), 3 deletions(-) > > I'm a little confused about this patch. We already have a buddy > hardlockup detector using the hrtimer, and it's even been improved > recently to trigger in a smaller time bound. It looks as if you're > duplicating bits of the perf and buddy detector here? > > I don't think you need this patch at all. The existing buddy detector > + patches #1 and #2 in your series should be sufficient. You're mostly right. Buddy + #2 covers the console case (the remote branch triggers the culprit's backtrace, which #2 makes deliverable), and #4 gets the wedged CPU's registers into the vmcore. The one thing this patch adds that a config can't is boot-time source selection: PERF-compiled kernels have no detector on a pseudo_nmi=0 boot, and PREFER_BUDDY costs the pseudo-NMI machines perf self-detection. But that's arguably out of scope for the patchset. I'll drop this patch in v2 and run PREFER_BUDDY here. If a runtime perf->buddy fallback ever materializes, the gap closes entirely. -- Kiryl Shutsemau / Kirill A. Shutemov From kirill at shutemov.name Fri Jun 5 14:29:59 2026 From: kirill at shutemov.name (Kiryl Shutsemau) Date: Fri, 5 Jun 2026 22:29:59 +0100 Subject: [PATCH 2/4] drivers/firmware: add SDEI cross-CPU NMI service for arm64 In-Reply-To: References: <145b9e98b12a7d314fc4a203075f65c3a0c3a913.1780496779.git.kas@kernel.org> Message-ID: On Fri, Jun 05, 2026 at 01:54:00PM -0700, Doug Anderson wrote: > Hi, > > On Wed, Jun 3, 2026 at 7:36?AM Kiryl Shutsemau wrote: > > > > @@ -928,11 +929,19 @@ static void arm64_backtrace_ipi(cpumask_t *mask) > > void arch_trigger_cpumask_backtrace(const cpumask_t *mask, int exclude_cpu) > > { > > /* > > + * Prefer the SDEI cross-CPU NMI provider when active: firmware > > + * dispatches the event out of EL3 and reaches CPUs that have > > + * interrupts locally masked, without the per-IRQ-mask cost that > > + * pseudo-NMI pays for the same reach. The plain IPI path below > > + * can't reach such a CPU unless pseudo-NMI is enabled. > > + * > > * NOTE: though nmi_trigger_cpumask_backtrace() has "nmi_" in the name, > > * nothing about it truly needs to be implemented using an NMI, it's > > * just that it's _allowed_ to work with NMIs. If ipi_should_be_nmi() > > * returned false our backtrace attempt will just use a regular IPI. > > */ > > + if (sdei_nmi_trigger_cpumask_backtrace(mask, exclude_cpu)) > > + return; > > nmi_trigger_cpumask_backtrace(mask, exclude_cpu, arm64_backtrace_ipi); > > nit: instead of one comment block, I would have broken it up in two. Like: > > /* > * Prefer the SDEI ... > */ > if (sdei_nmi_trigger_cpumask_backtrace(mask, exclude_cpu)) > return; > > /* > * NOTE: though ... > */ > nmi_trigger_cpumask_backtrace(...); Makes sense. > > } > > > > diff --git a/drivers/firmware/Kconfig b/drivers/firmware/Kconfig > > index bbd2155d8483..6501087ff90d 100644 > > --- a/drivers/firmware/Kconfig > > +++ b/drivers/firmware/Kconfig > > @@ -36,6 +36,25 @@ config ARM_SDE_INTERFACE > > standard for registering callbacks from the platform firmware > > into the OS. This is typically used to implement RAS notifications. > > > > +config ARM_SDEI_NMI > > + bool "SDEI-based cross-CPU NMI service (arm64)" > > + depends on ARM64 && ARM_SDE_INTERFACE > > + help > > + Provides SDEI-based cross-CPU NMI delivery for hooks that need > > + to reach interrupt-masked CPUs on silicon that lacks FEAT_NMI: > > + > > + - arch_trigger_cpumask_backtrace() (sysrq-l, RCU stalls, > > + hardlockup_all_cpu_backtrace, soft-lockup secondary dumps, > > + hung-task auxiliary dumps) > > + > > + The driver registers a handler for the SDEI software-signalled > > + event (event 0) and reaches a target CPU by signalling it with > > + SDEI_EVENT_SIGNAL. Firmware delivers the event out of EL3 > > + regardless of the target's PSTATE.DAIF -- forced delivery into a > > + CPU wedged with interrupts locally masked. > > + > > + If unsure, say N. > > Is there some downside to this? It seems like anyone who has the SDE > interface would want this. Not sure why you'd suggest people say "N". No real downside -- without the software-signalled event the driver stays inert, and there is no cost until an event actually fires. The "say N" is caution, not a technical limit: so far this has run on QEMU (TF-A) and one hardware platform, and the interesting paths depend on each vendor's SDEI implementation at EL3. I'm not sure vendors would care to run SDEI_EVENT_SIGNAL validation. Maybe we want to see more data points first? But maybe I am too cautious. Happy to flip the recommendation (or add default y) in v2 if that the consensus. > Other than the nit, this looks reasoanble to me, though I'm a complete > noob when it comes to SDEI... > > Reviewed-by: Douglas Anderson Thanks! -- Kiryl Shutsemau / Kirill A. Shutemov From kirill at shutemov.name Fri Jun 5 14:46:09 2026 From: kirill at shutemov.name (Kiryl Shutsemau) Date: Fri, 5 Jun 2026 22:46:09 +0100 Subject: [PATCH 4/4] arm64: route crash_smp_send_stop() last resort through SDEI In-Reply-To: References: <54cb99db3c981dc39eb3031aff5caeaadb09e8b9.1780496779.git.kas@kernel.org> Message-ID: On Fri, Jun 05, 2026 at 01:42:57PM -0700, Doug Anderson wrote: > > + sdei_nmi_crash_smp_send_stop(); > > It feels weird to me that you're adding SDEI for "crash stop" but not > for regular "stop". It feels like you should modify smp_send_stop() to > fall back to SDEI if sending the NMI failed, instead of adding this > separate path. Fair. A wedged CPU ignores the reboot-path stop just the same, and the escalation logic already lives in smp.c, so I'll restructure in v2. One thing to sort out there: this patch parks the stopped CPU inside its SDEI handler without completing the event, which is fine for the crash case (nothing expects the CPU back before reset), but a generic stop path probably wants SDEI_EVENT_COMPLETE_AND_RESUME into a parking stub instead, so that e.g. a regular kexec can bring all CPUs back up in the new kernel. I'll look into that as part of the rework. > > + cpu_park_loop(); > > + /* unreachable */ > > Any chance we could avoid duplicating stuff from ipi_cpu_crash_stop()? Yes -- falls out of the above. I will look into this. Maybe pull the save/offline/park body into a shared helper that both the IPI handler and the SDEI handler call. > > +bool sdei_nmi_crash_smp_send_stop(void) > > +{ > > + unsigned int this_cpu, cpu, remaining; > > + unsigned long timeout; > > + cpumask_t mask; > > The above will probably get you a yell. Putting "cpumask_t" on the > stack is a no-no since it can be quite large under certain CONFIG > options. This is why it's nearly always defined as "static". Doh! Will make it static in v2 -- safe here since the path is serialized by the crash_stop guard. -- Kiryl Shutsemau / Kirill A. Shutemov From jloeser at linux.microsoft.com Fri Jun 5 15:06:05 2026 From: jloeser at linux.microsoft.com (Jork Loeser) Date: Fri, 5 Jun 2026 15:06:05 -0700 (PDT) Subject: [PATCH v2 02/18] kho: disallow wide keys in radix tree In-Reply-To: <20260605183501.3884950-3-pratyush@kernel.org> References: <20260605183501.3884950-1-pratyush@kernel.org> <20260605183501.3884950-3-pratyush@kernel.org> Message-ID: On Fri, 5 Jun 2026, Pratyush Yadav wrote: > From: "Pratyush Yadav (Google)" > > The KHO radix tree was designed to track preserved pages. So it does not > provide the capability to track any 64-bit key. Instead, it limits the > key width to how much it needs for tracking PFNs and their orders. > Limiting the width reduces the number of levels in the tree. > > KHO is not expected to be the only user of the radix tree. With the API > generalized to allow other users, now it is possible to add any key to > the tree. > > Check the key width at kho_radix_add_key(), and error out if it exceeds > what the tree can handle. Do this instead of increasing the tree depth > since right now there are no users that need to use wider keys, so this > avoids memory overhead and ABI breakage. > > Signed-off-by: Pratyush Yadav (Google) > --- > include/linux/kho/abi/kexec_handover.h | 8 ++++++++ > kernel/liveupdate/kexec_handover.c | 12 ++++++++++++ > 2 files changed, 20 insertions(+) > > diff --git a/include/linux/kho/abi/kexec_handover.h b/include/linux/kho/abi/kexec_handover.h > index fb2d37417ad9..6dbb98bfb586 100644 > --- a/include/linux/kho/abi/kexec_handover.h > +++ b/include/linux/kho/abi/kexec_handover.h > @@ -278,6 +278,14 @@ enum kho_radix_consts { > KHO_TABLE_SIZE_LOG2) + 1, > }; > > +/* > + * The maximum key width this radix tree can track. > + * > + * This value isn't ABI itself, but it is derived from values that are ABI. > + */ > +#define KHO_RADIX_KEY_WIDTH (((KHO_TREE_MAX_DEPTH - 1) * KHO_TABLE_SIZE_LOG2) + \ > + KHO_BITMAP_SIZE_LOG2) Love the auto-derivation of these values, this totally makes sense. That said, my lazy brain complained a bit when I asked it "so how many bits can a consumer actually use?". So I wonder: 1) Why is the value not "ABI itself"; it feels like it should as it determines client behavior. 2) Would you consider expanding the actual values for the most relevant architectures (x86-64 w/ 4kb pages, arm64 w/ 4k/16/64k page-sizes) and put it in a block-comment? > + * NOTE: Currently only keys of width up to %KHO_RADIX_KEY_WIDTH are supported. > + * This limit only exists because current users of the radix tree don't use more > + * than that. Changing the maximum width requires changing the tree depth, which > + * needs bumping the ABI version. It takes longer to walk the tree. The current implementation is a good tradeoff. Best, Jork From dianders at chromium.org Fri Jun 5 15:08:18 2026 From: dianders at chromium.org (Doug Anderson) Date: Fri, 5 Jun 2026 15:08:18 -0700 Subject: [PATCH 3/4] arm64: wire SDEI NMI into the hardlockup watchdog In-Reply-To: References: <6172eafcb9de6e626c0f1c36426d67e1e562ed32.1780496779.git.kas@kernel.org> Message-ID: Hi, On Fri, Jun 5, 2026 at 2:12?PM Kiryl Shutsemau wrote: > > On Fri, Jun 05, 2026 at 01:03:05PM -0700, Doug Anderson wrote: > > Hi, > > > > On Wed, Jun 3, 2026 at 7:36?AM Kiryl Shutsemau wrote: > > > > > > From: "Kiryl Shutsemau (Meta)" > > > > > > Select HAVE_HARDLOCKUP_DETECTOR_ARCH so the framework takes its backend > > > from this driver. A per-CPU hrtimer checks its buddy's heartbeat and > > > signals event 0 at a stalled CPU, which runs watchdog_hardlockup_check() > > > NMI-like. > > > > > > The source is chosen at boot: SDEI if firmware provides it, otherwise a > > > perf-NMI counter (pseudo-NMI) fallback -- one image covers both. > > > > > > Signed-off-by: Kiryl Shutsemau (Meta) > > > --- > > > arch/arm64/Kconfig | 1 + > > > drivers/firmware/Kconfig | 3 + > > > drivers/firmware/sdei_nmi.c | 247 +++++++++++++++++++++++++++++++++++- > > > 3 files changed, 248 insertions(+), 3 deletions(-) > > > > I'm a little confused about this patch. We already have a buddy > > hardlockup detector using the hrtimer, and it's even been improved > > recently to trigger in a smaller time bound. It looks as if you're > > duplicating bits of the perf and buddy detector here? > > > > I don't think you need this patch at all. The existing buddy detector > > + patches #1 and #2 in your series should be sufficient. > > You're mostly right. > > Buddy + #2 covers the console case (the remote branch triggers the > culprit's backtrace, which #2 makes deliverable), and #4 gets the wedged > CPU's registers into the vmcore. > > The one thing this patch adds that a config can't is boot-time source > selection: PERF-compiled kernels have no detector on a pseudo_nmi=0 > boot, and PREFER_BUDDY costs the pseudo-NMI machines perf > self-detection. But that's arguably out of scope for the patchset. > > I'll drop this patch in v2 and run PREFER_BUDDY here. If a runtime > perf->buddy fallback ever materializes, the gap closes entirely. Sure. If you're interested in trying to make pref vs. buddy coexist, that should be done in a platform-agnostic way. Feel free to post patches for that. I know we discussed this previously. Ah, here they are: https://lore.kernel.org/r/20250916145122.416128-1-wangjinchao600 at gmail.com I think those got bikeshedded to death and nobody cared enough to keep pushing. FWIW, my belief is that the buddy detector is superior in every way except that it can't detect when all CPUs lock up simultaneously. ...though I wonder if a nicer way to solve the "all CPUs locked up" is to just NMI-enable the "bark" interrupt of a hardware watchdog timer. That ought to be quite easy... -Doug From praan at google.com Sat Jun 6 03:08:28 2026 From: praan at google.com (Pranjal Shrivastava) Date: Sat, 6 Jun 2026 10:08:28 +0000 Subject: [PATCH v6 03/12] PCI: liveupdate: Track incoming preserved PCI devices In-Reply-To: <20260522202410.3104264-4-dmatlack@google.com> References: <20260522202410.3104264-1-dmatlack@google.com> <20260522202410.3104264-4-dmatlack@google.com> Message-ID: On Fri, May 22, 2026 at 08:24:01PM +0000, David Matlack wrote: > During PCI enumeration, the previous kernel might have passed state about > devices that were preserved across kexec. The PCI core needs to fetch > this state to identify which devices are "incoming" and require special > handling. > > Add pci_liveupdate_setup_device() which is called during device setup > to fetch the serialized state (struct pci_ser) from the Live Update > Orchestrator. The first time this happens, pci_flb_retrieve() will run > and convert the array of pci_dev_ser structs into an xarray so that it > can be looked up efficiently. > > If a device is found in the xarray, the PCI core stores a pointer to its > state in dev->liveupdate_incoming and holds a reference to the incoming > FLB until pci_liveupdate_finish() is called by the driver. > > This ensures proper lifecycle management for incoming preserved devices > and allows the PCI core and drivers to apply specific Live Update > logic to them in subsequent commits. > > Drivers can check if a device is an incoming preserved device (e.g. > during probe) by calling pci_liveupdate_is_incoming(). > > CONFIG_64BIT is now required to enable CONFIG_PCI_LIVEUPDATE so that the > domain and bdf can be guaranteed to fit in an unsigned long and be used > as the xarray key. > > Signed-off-by: David Matlack > --- > MAINTAINERS | 1 + > drivers/pci/Kconfig | 2 +- > drivers/pci/liveupdate.c | 230 ++++++++++++++++++++++++++++++++- > drivers/pci/liveupdate.h | 5 + > drivers/pci/probe.c | 3 + > include/linux/pci_liveupdate.h | 13 ++ > 6 files changed, 251 insertions(+), 3 deletions(-) > > diff --git a/MAINTAINERS b/MAINTAINERS > index 6c618830cf61..0e262c0ceb43 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -20537,6 +20537,7 @@ L: linux-pci at vger.kernel.org > S: Maintained > T: git git://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux.git > F: drivers/pci/liveupdate.c > +F: drivers/pci/liveupdate.h > F: include/linux/kho/abi/pci.h > F: include/linux/pci_liveupdate.h > > diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig > index 10c9b65aa242..e68ae5c172d4 100644 > --- a/drivers/pci/Kconfig > +++ b/drivers/pci/Kconfig > @@ -330,7 +330,7 @@ config VGA_ARB_MAX_GPUS > > config PCI_LIVEUPDATE > bool "PCI Live Update Support" > - depends on PCI && LIVEUPDATE > + depends on PCI && LIVEUPDATE && 64BIT I see that the static assertions in Patch 1 work because of the 64BIT enforcement here. In that case, should we have the assertions check u64? > help > Enable PCI core support for preserving PCI devices across Live > Update. This, in combination with support in a device's driver, > [...] > static int pci_flb_retrieve(struct liveupdate_flb_op_args *args) > { > - args->obj = phys_to_virt(args->data); > + struct pci_ser *ser = phys_to_virt(args->data); > + struct pci_flb_incoming *incoming; > + int ret = -ENOMEM; > + u32 i; > + > + incoming = kmalloc_obj(*incoming); > + if (!incoming) > + goto err_restore_free; > + > + incoming->ser = ser; > + xa_init(&incoming->xa); > + > + for (i = 0; i < incoming->ser->max_nr_devices; i++) { > + struct pci_dev_ser *dev_ser = &incoming->ser->devices[i]; > + unsigned long key; > + > + if (!dev_ser->refcount) > + continue; > + > + key = pci_ser_xa_key(dev_ser->domain, dev_ser->bdf); > + ret = xa_insert(&incoming->xa, key, dev_ser, GFP_KERNEL); > + if (ret) > + goto err_xa_destroy; > + } > + > + args->obj = incoming; > return 0; > + > +err_xa_destroy: > + xa_destroy(&incoming->xa); > + kfree(incoming); > +err_restore_free: > + kho_restore_free(ser); I tend to partly agree with Sashiko[1] here.. it raises a policy-hole. We may need a policy here, the options I have in mind are: 1. Retrieve shall ONLY be tried once, if it fails (like -ENOMEM in the xArray alloc), it's a liveupdate failure. We can't retry liveupdate. 2. Retrying retrieve is allowed. The only downside with option 1 is, the user may want flexibility due to certain subsystems OR may choose NOT to use the proposed LUOd and instead have its own user-space component which might try funny things or have a different use-case. In such a situation, the system may have transiently run out of memory during the kexec transition (for e.g. a subsystem uses GFP_ATOMIC to allocate memory and temporarily runs out of the atomic pool). [Note we removed it in IOMMU v1 [2] but subsystems may have a use-case for it] If the kernel frees the KHO page on the first failure, it removes any chance of recovery. :/ Thus, it might make sense to let the user decide if it wants to fail the liveupdate or retry again based on the failure type / source? [...] The changes LGTM, except for policy-based, kho_restore_free discussion. Reviewed-by: Pranjal Shrivastava Thanks, Praan [1] https://lore.kernel.org/all/20260522211333.D56A21F000E9 at smtp.kernel.org/ [2] https://lore.kernel.org/all/20260203220948.2176157-2-skhawaja at google.com/ From praan at google.com Sat Jun 6 03:20:19 2026 From: praan at google.com (Pranjal Shrivastava) Date: Sat, 6 Jun 2026 10:20:19 +0000 Subject: [PATCH v6 04/12] PCI: liveupdate: Document driver binding responsibilities In-Reply-To: <20260522202410.3104264-5-dmatlack@google.com> References: <20260522202410.3104264-1-dmatlack@google.com> <20260522202410.3104264-5-dmatlack@google.com> Message-ID: On Fri, May 22, 2026 at 08:24:02PM +0000, David Matlack wrote: > Document how driver binding works during a Live Update and what the PCI > core expects of drivers and users. Note that this is only a description > of the current division of responsibilities. These can change in the > future if we decide. > > Signed-off-by: David Matlack Reviewed-by: Pranjal Shrivastava praan at google.com Thanks, Praan From praan at google.com Sat Jun 6 04:10:29 2026 From: praan at google.com (Pranjal Shrivastava) Date: Sat, 6 Jun 2026 11:10:29 +0000 Subject: [PATCH v6 05/12] PCI: liveupdate: Keep bus numbers constant during Live Update In-Reply-To: <20260522202410.3104264-6-dmatlack@google.com> References: <20260522202410.3104264-1-dmatlack@google.com> <20260522202410.3104264-6-dmatlack@google.com> Message-ID: On Fri, May 22, 2026 at 08:24:03PM +0000, David Matlack wrote: > During a Live Update, preserved devices must be allowed to continue > performing memory transactions so the kernel cannot change the fabric > topology, including bus numbers, since that would require disabling > and flushing any memory transactions first. > > To keep bus numbers constant, always inherit the secondary and > subordinate bus numbers assigned to bridges during scanning, instead of > assigning new ones, if any PCI devices are being preserved. Note that > the kernel inherits bus numbers even on bridges without any downstream > endpoints that were preserved. This avoids accidentally assigning a > bridge a new window that overlaps with a preserved device that is > downstream of a different bridge. > > If a bridge is scanned with a broken topology or has no bus numbers > set during a Live Update, refuse to assign it new bus numbers and refuse > to enumerate devices below it until the Live Update is finished. This is > a safety measure to prevent topology conflicts. > > Require that CONFIG_CARDBUS is not enabled to enable > CONFIG_PCI_LIVEUPDATE since inheriting bus numbers on PCI-to-CardBus > bridges requires additional work but is not a priority at the moment. > > Signed-off-by: David Matlack > --- > .../admin-guide/kernel-parameters.txt | 6 +- > drivers/pci/Kconfig | 2 +- > drivers/pci/liveupdate.c | 83 ++++++++++++++++++- > drivers/pci/liveupdate.h | 14 ++++ > drivers/pci/probe.c | 17 +++- > include/linux/pci_liveupdate.h | 4 + > 6 files changed, 119 insertions(+), 7 deletions(-) > [...] > + incoming = pci_liveupdate_flb_get_incoming(); > + if (!incoming) { > + dev->liveupdate.inherit_buses = false; > + goto out; > + } > + > + /* > + * It is safe to sample incoming->ser->nr_devices and then > + * drop the rwsem since nr_devices will only decrease. Thus the > + * only "race" is that the current scan will be overly > + * conservative and force bus inheritance. > + */ > + dev->liveupdate.inherit_buses = incoming->ser->nr_devices; Nit: inherit_buses is a bool, while compiler will handle it correctly, maybe we could: dev->liveupdate.inherit_buses = !!incoming->ser->nr_devices OR dev->liveupdate.inherit_buses = (incoming->ser->nr_devices > 0) for readability? > + pci_liveupdate_flb_put_incoming(); > + } > + > +out: > + return dev->liveupdate.inherit_buses; > +} > + [...] > /* > @@ -1497,8 +1501,7 @@ static int pci_scan_bridge_extend(struct pci_bus *bus, struct pci_dev *dev, > * do in the second pass. > */ > if (!pass) { > - if (pcibios_assign_all_busses() || broken) > - > + if (assign_new_buses || broken) > /* > * Temporarily disable forwarding of the > * configuration cycles on all bridges in > @@ -1512,6 +1515,11 @@ static int pci_scan_bridge_extend(struct pci_bus *bus, struct pci_dev *dev, > goto out; > } > > + if (liveupdate) { > + pci_err(dev, "Cannot reconfigure bridge during Live Update, skipping\n"); > + goto out; > + } Quite helpful! Thanks :) > + > /* Clear errors */ > pci_write_config_word(dev, PCI_STATUS, 0xffff); > > @@ -1572,6 +1580,7 @@ static int pci_scan_bridge_extend(struct pci_bus *bus, struct pci_dev *dev, > pci_write_config_word(dev, PCI_BRIDGE_CONTROL, bctl); > > pm_runtime_put(&dev->dev); > + pci_liveupdate_scan_bridge_end(dev, pass); > > return max; > } With the minor nit above, Reviewed-by: Pranjal Shrivastava Thanks, Praan From rdunlap at infradead.org Sat Jun 6 12:36:34 2026 From: rdunlap at infradead.org (Randy Dunlap) Date: Sat, 6 Jun 2026 12:36:34 -0700 Subject: [PATCH] docs: memfd_preservation: fix rendering of ABI documentation In-Reply-To: <20260605160645.3650271-1-pratyush@kernel.org> References: <20260605160645.3650271-1-pratyush@kernel.org> Message-ID: <37fbe86b-b476-4d28-b8e6-5b8cb3808b42@infradead.org> On 6/5/26 9:06 AM, Pratyush Yadav wrote: > From: "Pratyush Yadav (Google)" > > The "memfd Live Update ABI" section in include/linux/kho/abi/memfd.h > currently does not render in the exported documentation. This is because > it should not include the "DOC:" in its reference. Drop it to ensure > correct rendering. Tested by running make htmldocs. > > Fixes: 15fc11bb2cb6 ("docs: add documentation for memfd preservation via LUO") > Signed-off-by: Pratyush Yadav (Google) Tested-by: Randy Dunlap Acked-by: Randy Dunlap Thanks. > --- > > Notes: > Mike/Pasha, I reckon this can still go in liveupdate/next. But if you > think it is too late, we can probably take it via -rc1 fixes as well. > > Documentation/mm/memfd_preservation.rst | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/Documentation/mm/memfd_preservation.rst b/Documentation/mm/memfd_preservation.rst > index a8a5b476afd3..c908a12dffa7 100644 > --- a/Documentation/mm/memfd_preservation.rst > +++ b/Documentation/mm/memfd_preservation.rst > @@ -11,7 +11,7 @@ Memfd Preservation ABI > ====================== > > .. kernel-doc:: include/linux/kho/abi/memfd.h > - :doc: DOC: memfd Live Update ABI > + :doc: memfd Live Update ABI > > .. kernel-doc:: include/linux/kho/abi/memfd.h > :internal: > > base-commit: 2935777b418d2bfcbfe96705bb2c0fa6c0d94e18 -- ~Randy From praan at google.com Sat Jun 6 15:15:15 2026 From: praan at google.com (Pranjal Shrivastava) Date: Sat, 6 Jun 2026 22:15:15 +0000 Subject: [PATCH v6 06/12] PCI: liveupdate: Auto-preserve upstream bridges across Live Update In-Reply-To: <20260522202410.3104264-7-dmatlack@google.com> References: <20260522202410.3104264-1-dmatlack@google.com> <20260522202410.3104264-7-dmatlack@google.com> Message-ID: On Fri, May 22, 2026 at 08:24:04PM +0000, David Matlack wrote: > When a PCI device is preserved across a Live Update, all of its upstream > bridges up to the root port must also be preserved. This enables the PCI > core and any drivers bound to the bridges to manage bridges correctly > across a Live Update. > > Notably, this will be used in subsequent commits to ensure that > preserved devices can continue performing memory transactions without a > disruption or change in routing. > > To preserve bridges, the PCI core tracks the number of downstream > devices preserved under each bridge using a reference count in struct > pci_dev_ser. This allows a bridge to remain preserved until all its > downstream preserved devices are unpreserved or finish their > participation in the Live Update. > > Signed-off-by: David Matlack > --- > drivers/pci/liveupdate.c | 136 +++++++++++++++++++++++++++++++----- > include/linux/kho/abi/pci.h | 5 +- > 2 files changed, 122 insertions(+), 19 deletions(-) > [...] > + > +#define for_each_pci_dev_in_path(_d, _start, _end) \ > + for ((_d) = (_start); (_d) != (_end); (_d) = (_d)->bus->self) > + > +static void __pci_liveupdate_unpreserve_path(struct pci_ser *ser, > + struct pci_dev *start, > + struct pci_dev *end) > +{ > + struct pci_dev *dev; > + > + for_each_pci_dev_in_path(dev, start, end) { > + if (pci_liveupdate_unpreserve_device(ser, dev)) I might be reading this wrong but are we leaking some upstream devs if an intermediate node fails? EP0 / Assume we have: RC -> B1 -> B2 \ EP1 and EP0 & EP1 were preserved successfully. And then we try unpreserving EP1, we follow: unpreserve EP1 -> unpreserve B2 failed due to a corruption. This aborts the loop, skipping B1 and RC completely? Their refcounts remain elevated, effectively leaking them as preserved state permanently? (i.e. if we unpreserve EP0 after this, B1 & RC will still get preserved). > + return; > + } > +} > + > +static void pci_liveupdate_unpreserve_path(struct pci_ser *ser, > + struct pci_dev *start) > +{ > + __pci_liveupdate_unpreserve_path(ser, start, /*end=*/NULL); > +} > + > +static int pci_liveupdate_preserve_path(struct pci_ser *ser, > + struct pci_dev *start) > +{ > + struct pci_dev *dev; > + int ret; > + > + for_each_pci_dev_in_path(dev, start, NULL) { > + ret = pci_liveupdate_preserve_device(ser, dev); > + if (ret) { > + __pci_liveupdate_unpreserve_path(ser, start, dev); > + return ret; > + } > + } > + > + return 0; > +} > + > /** > * pci_liveupdate_preserve() - Preserve a PCI device across Live Update > * @dev: The PCI device to preserve. > @@ -321,6 +403,9 @@ static int pci_liveupdate_preserve_device(struct pci_ser *ser, struct pci_dev *d > * pci_liveupdate_preserve() from their struct liveupdate_file_handler > * preserve() callback to ensure the outgoing struct pci_ser is already set up. > * > + * pci_liveupdate_preserve() automatically preserves all bridges upstream of > + * @dev. > + * > * Returns: 0 on success, <0 on failure. > */ > int pci_liveupdate_preserve(struct pci_dev *dev) > @@ -336,7 +421,7 @@ int pci_liveupdate_preserve(struct pci_dev *dev) > if (IS_ERR(ser)) > return PTR_ERR(ser); > > - return pci_liveupdate_preserve_device(ser, dev); > + return pci_liveupdate_preserve_path(ser, dev); Minor nit: I might be too nitpicky here (and it's NOT a strong opinion) but naming it pci_liveupdate_preserve_path_for_dev() reads better to me. > } > EXPORT_SYMBOL_GPL(pci_liveupdate_preserve); > [...] Thanks, Praan