From rppt at kernel.org Mon Jun 1 00:00:05 2026 From: rppt at kernel.org (Mike Rapoport) Date: Mon, 1 Jun 2026 10:00:05 +0300 Subject: [PATCH v2 0/3] kho: Add support for kunit mocking KHO restore API In-Reply-To: <20260521193202.746810-1-skhawaja@google.com> References: <20260521193202.746810-1-skhawaja@google.com> Message-ID: Hi Samiullah, On Thu, May 21, 2026 at 07:31:59PM +0000, Samiullah Khawaja wrote: > To write kunit tests for preservation and restoration of liveupdate > state in various subsystems without triggering the actual kexec, the KHO > restore API needs to be mocked by the test writer. The mocking is done > to allow testing of the individual components or functions in isolation. > > The patch series adds the following to support kunit testing when using the KHO > API: > > - Add static stub hooks to mock the KHO restore API so the restore path > can be tested without triggering kexec. > - Add helper function that can be used by the test writer to check if > memory is preserved in KHO tree. > > Finally, it adds a KUnit test for the KHO API that verifies the allocation of > preserved memory, and the preservation/restoration of pages and folios. I looked at the tests for preservation and apparently they don't add coverage beyond the existing KHO selftest. How hard and/or intrusive would be adding tests for example for error paths? Do you have an example of a kunit test for another subsystem that would benefit from mocking of KHO APIs? > KHO Kunit test run: > > KTAP version 1 > 1..1 > KTAP version 1 > # Subtest: kho_test > # module: kexec_handover_test > 1..3 > ok 1 kho_test_alloc_preserve > ok 2 kho_test_preserve_pages > ok 3 kho_test_preserve_folio > # kho_test: pass:3 fail:0 skip:0 total:3 > # Totals: pass:3 fail:0 skip:0 total:3 > ok 1 kho_test > > v2: > - Move kunit header includes above linux header includes. > - Use the __kho_preserve_pages_order() to get the order of preserved > pages instead of open order calculation math. > > Samiullah Khawaja (3): > kho: Add kunit static stubs > kho: Add helper function to check if pages are preserved > kho: Add kunit test to verify preserve/restore pages and folio > > include/linux/kexec_handover.h | 5 + > kernel/liveupdate/Kconfig | 10 ++ > kernel/liveupdate/Makefile | 1 + > kernel/liveupdate/kexec_handover.c | 63 +++++++++++- > kernel/liveupdate/kexec_handover_test.c | 131 ++++++++++++++++++++++++ > 5 files changed, 209 insertions(+), 1 deletion(-) > create mode 100644 kernel/liveupdate/kexec_handover_test.c > > > base-commit: ec4084bc445027a52f600e30a976928be1ba1950 > -- > 2.54.0.746.g67dd491aae-goog > -- Sincerely yours, Mike. From dongtai.guo at linux.dev Mon Jun 1 02:28:20 2026 From: dongtai.guo at linux.dev (George Guo) Date: Mon, 1 Jun 2026 17:28:20 +0800 Subject: [PATCH v3 0/3] LoongArch: add KHO support and selftests Message-ID: <20260601092823.110362-1-dongtai.guo@linux.dev> From: George Guo This series adds Kexec Handover (KHO) support for LoongArch and extends the KHO selftest infrastructure to run on LoongArch under QEMU. KHO passes metadata (the KHO state FDT and scratch area addresses) to the second kernel via the FDT /chosen node, using the linux,kho-fdt and linux,kho-scratch properties that drivers/of/kexec.c:kho_add_chosen() writes and drivers/of/fdt.c:early_init_dt_check_kho() reads. KHO support (patches 1-2): Patch 1 adds KHO support for FDT-based systems (initial_boot_params != NULL, e.g. QEMU virt without OVMF). kho_load_fdt() copies the running kernel's FDT, appends linux,kho-fdt and linux,kho-scratch to /chosen, and loads the result as a kexec segment. machine_kexec() updates the DEVICE_TREE_GUID entry in the EFI config table to point to this segment so the second kernel's fdt_setup() can find and parse it. Patch 2 adds KHO support for ACPI-only systems (initial_boot_params == NULL, e.g. LoongArch servers with UEFI or QEMU with OVMF). Because no system FDT is available, kho_load_fdt() builds a minimal FDT from scratch containing only /chosen with the two KHO properties. Since DEVICE_TREE_GUID is absent from the EFI config table on ACPI-only systems, a new extended config table is built with the entry appended and loaded as a kexec segment; machine_kexec() switches st->tables to point to it before jumping. The second kernel's fdt_setup() calls efi_fdt_pointer() to detect the KHO FDT and passes it to early_init_dt_check_kho(). Selftest support (patch 3): Patch 3 adds loongarch.conf and extends vmtest.sh to recognise loongarch64 as a build target. The LoongArch virt machine is FDT-only (no ACPI), so 'earlycon' must appear on the kernel cmdline or the console UART is never discovered. PS/2 input devices are disabled since QEMU's LoongArch virt machine has no i8042 controller; the fallback port probe hits a page fault and panics before reaching userspace. QEMU provides no EFI runtime services on LoongArch, so machine_restart() falls through to an infinite idle loop after kexec; QEMU_TIMEOUT=120 in loongarch.conf lets timeout(1) terminate QEMU once the time limit is reached. Changes in v3: - Merge selftest patches 3 and 4 from v2 into a single patch - Replace QEMU_NEEDS_KILL/background kill loop with QEMU_TIMEOUT/timeout(1); the timeout value is set per-arch in the conf file. George Guo (3): LoongArch: kexec: add KHO support for FDT-based systems LoongArch: kexec: add KHO support for ACPI-only systems selftests/kho: add LoongArch vmtest support arch/loongarch/Kconfig | 3 + arch/loongarch/include/asm/kexec.h | 7 + arch/loongarch/kernel/machine_kexec.c | 38 +++ arch/loongarch/kernel/machine_kexec_file.c | 256 +++++++++++++++++++++ arch/loongarch/kernel/setup.c | 21 +- tools/testing/selftests/kho/loongarch.conf | 13 ++ tools/testing/selftests/kho/vmtest.sh | 23 +- 7 files changed, 353 insertions(+), 8 deletions(-) create mode 100644 tools/testing/selftests/kho/loongarch.conf -- 2.25.1 From dongtai.guo at linux.dev Mon Jun 1 02:39:30 2026 From: dongtai.guo at linux.dev (George Guo) Date: Mon, 1 Jun 2026 17:39:30 +0800 Subject: [PATCH v3 3/3] selftests/kho: add LoongArch vmtest support In-Reply-To: <20260601093930.112758-1-dongtai.guo@linux.dev> References: <20260601092823.110362-1-dongtai.guo@linux.dev> <20260601093930.112758-1-dongtai.guo@linux.dev> Message-ID: <20260601093930.112758-3-dongtai.guo@linux.dev> From: George Guo Add loongarch.conf to configure QEMU's LoongArch virt machine with a la464 CPU, enable the 8250 serial console, and set the kernel image to vmlinux.efi. Extend vmtest.sh to recognise loongarch64 as a supported target and map it to the 'loongarch' kernel arch name. QEMU's LoongArch virt machine provides no ACPI tables and relies on FDT to describe hardware. Without 'earlycon' on the kernel command line, the FDT is not scanned for a console UART, no output reaches the console, and vmtest.sh's console log stays empty causing the test to always fail. Add 'earlycon' to KERNEL_CMDLINE in loongarch.conf. QEMU's LoongArch virt machine has no i8042 PS/2 controller. When PNP detection finds nothing, i8042_init() falls back to probing the ports directly. On LoongArch the I/O ports are memory-mapped, and the i8042 port addresses are not backed by any device on the virt machine, so i8042_flush() takes a page fault and the kernel panics: i8042: PNP: No PS/2 controller found. i8042: Probing ports directly. CPU 0 Unable to handle kernel paging request at virtual address ffff800000008064 ERA: i8042_flush+0x50/0x198 RA: i8042_init+0x2a8/0x35c Kernel panic - not syncing: Attempted to kill init! Disable SERIO_I8042 and its dependents (KEYBOARD_ATKBD, MOUSE_PS2) in the QEMU_KCONFIG fragment to prevent the driver from being built. All three options are scoped to loongarch.conf; no other architecture is affected. QEMU provides no EFI runtime services on LoongArch, so machine_restart() falls through to an infinite idle loop after kexec. Set QEMU_TIMEOUT=120 in loongarch.conf so vmtest.sh wraps the QEMU invocation with timeout(1), which terminates QEMU after 120 seconds if it does not exit on its own. Architectures that do not set QEMU_TIMEOUT are unaffected. Co-developed-by: Kexin Liu Signed-off-by: Kexin Liu Signed-off-by: George Guo --- tools/testing/selftests/kho/loongarch.conf | 13 ++++++++++++ tools/testing/selftests/kho/vmtest.sh | 23 +++++++++++++++------- 2 files changed, 29 insertions(+), 7 deletions(-) create mode 100644 tools/testing/selftests/kho/loongarch.conf diff --git a/tools/testing/selftests/kho/loongarch.conf b/tools/testing/selftests/kho/loongarch.conf new file mode 100644 index 000000000000..68727654578d --- /dev/null +++ b/tools/testing/selftests/kho/loongarch.conf @@ -0,0 +1,13 @@ +QEMU_CMD="qemu-system-loongarch64 -M virt -cpu la464" +QEMU_KCONFIG=" +CONFIG_SERIAL_8250=y +CONFIG_SERIAL_8250_CONSOLE=y +# CONFIG_KEYBOARD_ATKBD is not set +# CONFIG_MOUSE_PS2 is not set +# CONFIG_SERIO_I8042 is not set +" +KERNEL_IMAGE="vmlinux.efi" +KERNEL_CMDLINE="console=ttyS0 earlycon" +# QEMU never exits after kexec on LoongArch (no EFI runtime services); +# give the test a fixed time limit and let timeout(1) terminate QEMU. +QEMU_TIMEOUT=120 diff --git a/tools/testing/selftests/kho/vmtest.sh b/tools/testing/selftests/kho/vmtest.sh index 49fdac8e8b15..918698b6dd2a 100755 --- a/tools/testing/selftests/kho/vmtest.sh +++ b/tools/testing/selftests/kho/vmtest.sh @@ -21,7 +21,7 @@ Options: -d) path to the kernel build directory -j) number of jobs for compilation, similar to -j in make -t) run test for target_arch, requires CROSS_COMPILE set - supported targets: aarch64, x86_64 + supported targets: aarch64, x86_64, loongarch64 -h) display this help EOF } @@ -107,12 +107,20 @@ function run_qemu() { cmdline="$cmdline kho=on panic=-1" - $qemu_cmd -m 1G -smp 2 -no-reboot -nographic -nodefaults \ - -accel kvm -accel hvf -accel tcg \ - -serial file:"$serial" \ - -append "$cmdline" \ - -kernel "$kernel" \ - -initrd "$initrd" + local qemu_args=( + -m 1G -smp 2 -no-reboot -nographic -nodefaults + -accel kvm -accel hvf -accel tcg + -serial file:"$serial" + -append "$cmdline" + -kernel "$kernel" + -initrd "$initrd" + ) + + if [[ -n "${QEMU_TIMEOUT:-}" ]]; then + timeout "$QEMU_TIMEOUT" $qemu_cmd "${qemu_args[@]}" || true + else + $qemu_cmd "${qemu_args[@]}" + fi grep "KHO restore succeeded" "$serial" &> /dev/null || fail "KHO failed" } @@ -123,6 +131,7 @@ function target_to_arch() { case $target in aarch64) echo "arm64" ;; x86_64) echo "x86" ;; + loongarch64) echo "loongarch" ;; *) skip "architecture $target is not supported" esac } -- 2.25.1 From pratyush at kernel.org Mon Jun 1 04:52:14 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Mon, 01 Jun 2026 13:52:14 +0200 Subject: [liveupdate:next 11/21] kernel/liveupdate/luo_session.c:344:48: error: too few arguments provided to function-like macro invocation In-Reply-To: <202606011344.RHiYuqso-lkp@intel.com> (kernel test robot's message of "Mon, 01 Jun 2026 13:25:29 +0800") References: <202606011344.RHiYuqso-lkp@intel.com> Message-ID: <2vxzik82h40h.fsf@kernel.org> On Mon, Jun 01 2026, kernel test robot wrote: > tree: https://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux.git next > head: f4b4f5fe58a55a4212a263d9d44416778ca6e7a7 > commit: 74cab0be9a5d9d91471c4dee7311dcdfc1c0a6f4 [11/21] liveupdate: validate session type before performing operation > config: x86_64-buildonly-randconfig-003-20260601 (https://download.01.org/0day-ci/archive/20260601/202606011344.RHiYuqso-lkp at intel.com/config) > compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261) > reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260601/202606011344.RHiYuqso-lkp at intel.com/reproduce) > > If you fix the issue in a separate patch/commit (i.e. not just a new version of > the same patch/commit), kindly add following tags > | Reported-by: kernel test robot > | Closes: https://lore.kernel.org/oe-kbuild-all/202606011344.RHiYuqso-lkp at intel.com/ > > All errors (new ones prefixed by >>): > >>> kernel/liveupdate/luo_session.c:344:48: error: too few arguments provided to function-like macro invocation > 344 | struct liveupdate_session_retrieve_fd, token), > | ^ > kernel/liveupdate/luo_session.c:327:9: note: macro 'IOCTL_OP' defined here > 327 | #define IOCTL_OP(_ioctl, _fn, _struct, _last, _type) \ > | ^ >>> kernel/liveupdate/luo_session.c:343:2: error: use of undeclared identifier 'IOCTL_OP' > 343 | IOCTL_OP(LIVEUPDATE_SESSION_RETRIEVE_FD, luo_session_retrieve_fd, > | ^ >>> kernel/liveupdate/luo_session.c:378:6: error: invalid application of 'sizeof' to an incomplete type 'const struct luo_ioctl_op[]' > 378 | ARRAY_SIZE(luo_session_ioctl_ops)) { > | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > include/linux/array_size.h:11:32: note: expanded from macro 'ARRAY_SIZE' > 11 | #define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]) + __must_be_array(arr)) > | ^~~~~ > 3 errors generated. This happens because the patch got moved from the fixes branch to next, and next has a new ioctl. Will send a fixup soon. [...] -- Regards, Pratyush Yadav From pratyush at kernel.org Mon Jun 1 04:57:19 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Mon, 01 Jun 2026 13:57:19 +0200 Subject: [liveupdate:next 11/21] kernel/liveupdate/luo_session.c:344:48: error: too few arguments provided to function-like macro invocation In-Reply-To: <2vxzik82h40h.fsf@kernel.org> (Pratyush Yadav's message of "Mon, 01 Jun 2026 13:52:14 +0200") References: <202606011344.RHiYuqso-lkp@intel.com> <2vxzik82h40h.fsf@kernel.org> Message-ID: <2vxzcxyah3s0.fsf@kernel.org> On Mon, Jun 01 2026, Pratyush Yadav wrote: > On Mon, Jun 01 2026, kernel test robot wrote: > >> tree: https://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux.git next >> head: f4b4f5fe58a55a4212a263d9d44416778ca6e7a7 >> commit: 74cab0be9a5d9d91471c4dee7311dcdfc1c0a6f4 [11/21] liveupdate: validate session type before performing operation >> config: x86_64-buildonly-randconfig-003-20260601 (https://download.01.org/0day-ci/archive/20260601/202606011344.RHiYuqso-lkp at intel.com/config) >> compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261) >> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260601/202606011344.RHiYuqso-lkp at intel.com/reproduce) >> >> If you fix the issue in a separate patch/commit (i.e. not just a new version of >> the same patch/commit), kindly add following tags >> | Reported-by: kernel test robot >> | Closes: https://lore.kernel.org/oe-kbuild-all/202606011344.RHiYuqso-lkp at intel.com/ >> >> All errors (new ones prefixed by >>): >> >>>> kernel/liveupdate/luo_session.c:344:48: error: too few arguments provided to function-like macro invocation >> 344 | struct liveupdate_session_retrieve_fd, token), >> | ^ >> kernel/liveupdate/luo_session.c:327:9: note: macro 'IOCTL_OP' defined here >> 327 | #define IOCTL_OP(_ioctl, _fn, _struct, _last, _type) \ >> | ^ >>>> kernel/liveupdate/luo_session.c:343:2: error: use of undeclared identifier 'IOCTL_OP' >> 343 | IOCTL_OP(LIVEUPDATE_SESSION_RETRIEVE_FD, luo_session_retrieve_fd, >> | ^ >>>> kernel/liveupdate/luo_session.c:378:6: error: invalid application of 'sizeof' to an incomplete type 'const struct luo_ioctl_op[]' >> 378 | ARRAY_SIZE(luo_session_ioctl_ops)) { >> | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >> include/linux/array_size.h:11:32: note: expanded from macro 'ARRAY_SIZE' >> 11 | #define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]) + __must_be_array(arr)) >> | ^~~~~ >> 3 errors generated. > > This happens because the patch got moved from the fixes branch to next, > and next has a new ioctl. Will send a fixup soon. Nevermind. Seems like this is already fixed (thanks to Mike I think?). -- Regards, Pratyush Yadav From pratyush at kernel.org Mon Jun 1 05:08:46 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Mon, 01 Jun 2026 14:08:46 +0200 Subject: [PATCH v4 01/13] liveupdate: change file_set->count type to u64 for type safety In-Reply-To: <20260530221938.115978-2-pasha.tatashin@soleen.com> (Pasha Tatashin's message of "Sat, 30 May 2026 22:19:26 +0000") References: <20260530221938.115978-1-pasha.tatashin@soleen.com> <20260530221938.115978-2-pasha.tatashin@soleen.com> Message-ID: <2vxz8q8yh38x.fsf@kernel.org> On Sat, May 30 2026, Pasha Tatashin wrote: > This improves type safety and aligns the in-memory file_set->count with > the serialized count type. It avoids potential truncation or sign > conversion mismatch issues. > > Signed-off-by: Pasha Tatashin Reviewed-by: Pratyush Yadav (Google) [...] -- Regards, Pratyush Yadav From pratyush at kernel.org Mon Jun 1 05:15:14 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Mon, 01 Jun 2026 14:15:14 +0200 Subject: [PATCH v4 02/13] liveupdate: avoid mixing cleanup guards with goto in luo_session_retrieve_fd In-Reply-To: <20260530221938.115978-3-pasha.tatashin@soleen.com> (Pasha Tatashin's message of "Sat, 30 May 2026 22:19:27 +0000") References: <20260530221938.115978-1-pasha.tatashin@soleen.com> <20260530221938.115978-3-pasha.tatashin@soleen.com> Message-ID: <2vxz4ijmh2y5.fsf@kernel.org> On Sat, May 30 2026, Pasha Tatashin wrote: > Refactoring luo_session_retrieve_fd() to avoid mixing automated > cleanup-style guards with goto-based resource release, which is not > recommended under the Linux kernel coding style. > > Signed-off-by: Pasha Tatashin Perhaps we would be better off moving to FD_ADD() at some point, which should make this a little bit simpler? Anyway, this patch is still an improvement, so Reviewed-by: Pratyush Yadav (Google) [...] -- Regards, Pratyush Yadav From pratyush at kernel.org Mon Jun 1 05:19:21 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Mon, 01 Jun 2026 14:19:21 +0200 Subject: [PATCH v4 03/13] liveupdate: centralize state management into struct luo_ser In-Reply-To: <20260530221938.115978-4-pasha.tatashin@soleen.com> (Pasha Tatashin's message of "Sat, 30 May 2026 22:19:28 +0000") References: <20260530221938.115978-1-pasha.tatashin@soleen.com> <20260530221938.115978-4-pasha.tatashin@soleen.com> Message-ID: <2vxzzf1efo6u.fsf@kernel.org> On Sat, May 30 2026, Pasha Tatashin wrote: > Transition the LUO to ABI v2, which centralizes state management into a > single struct luo_ser header. > > Previously, LUO state was spread across multiple FDT properties and > subnodes. ABI v2 simplifies this by placing all core state, including > the liveupdate number and physical addresses for sessions and FLB > headers into a centralized struct luo_ser. > > Note that this change introduces a semantic difference: the sessions > and FLB serialization formats are no longer completely independent of > the core LUO. Their metadata (such as physical addresses for sessions > and FLB headers) is now coupled to and managed via the centralized > struct luo_ser. > > Signed-off-by: Pasha Tatashin Reviewed-by: Pratyush Yadav (Google) [...] -- Regards, Pratyush Yadav From will at kernel.org Mon Jun 1 05:35:16 2026 From: will at kernel.org (Will Deacon) Date: Mon, 1 Jun 2026 13:35:16 +0100 Subject: [RFC PATCH 3/4] dma-direct: Add API to preserve/restore allocations In-Reply-To: <20260505002737.2213734-4-skhawaja@google.com> References: <20260505002737.2213734-1-skhawaja@google.com> <20260505002737.2213734-4-skhawaja@google.com> Message-ID: On Tue, May 05, 2026 at 12:27:36AM +0000, Samiullah Khawaja wrote: > diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c > index ec887f443741..c2b98f91900a 100644 > --- a/kernel/dma/direct.c > +++ b/kernel/dma/direct.c > @@ -6,6 +6,8 @@ > */ > #include /* for max_pfn */ > #include > +#include > +#include > #include > #include > #include > @@ -307,6 +309,167 @@ void *dma_direct_alloc(struct device *dev, size_t size, > return NULL; > } > > +#ifdef CONFIG_DMA_LIVEUPDATE > +int dma_direct_preserve_allocation(struct device *dev, void *cpu_addr, > + size_t size, dma_addr_t dma_handle, > + unsigned long attrs, u64 *state) > +{ > + struct dma_alloc_ser *ser; > + int ret; > + > + if (!kho_is_enabled()) > + return -EOPNOTSUPP; > + > + if (IS_ENABLED(CONFIG_DMA_CMA)) > + return -EOPNOTSUPP; Hmm, it seems a bit overkill to do this just because CMA is compiled in, especially as it's user-selectable in kconfig. Maybe you need to iterate over the CMA areas using cma_for_each_area(), similarly to how you do with the pools? Will From pratyush at kernel.org Mon Jun 1 05:39:52 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Mon, 01 Jun 2026 14:39:52 +0200 Subject: [PATCH v4 04/13] liveupdate: register luo_ser as KHO subtree In-Reply-To: <20260530221938.115978-5-pasha.tatashin@soleen.com> (Pasha Tatashin's message of "Sat, 30 May 2026 22:19:29 +0000") References: <20260530221938.115978-1-pasha.tatashin@soleen.com> <20260530221938.115978-5-pasha.tatashin@soleen.com> Message-ID: <2vxzv7c2fn8n.fsf@kernel.org> On Sat, May 30 2026, Pasha Tatashin wrote: > Entirely remove the LUO FDT wrapper since the FDT only carries the > compatible string and the pointer to the centralized struct luo_ser. > Instead, register the struct luo_ser via the KHO raw subtree > API, placing the compatibility string inside the structure itself. > > Signed-off-by: Pasha Tatashin > --- > include/linux/kho/abi/luo.h | 57 +++++++++--------------- > kernel/liveupdate/luo_core.c | 85 +++++++++++------------------------- > 2 files changed, 46 insertions(+), 96 deletions(-) > > diff --git a/include/linux/kho/abi/luo.h b/include/linux/kho/abi/luo.h > index 1b2f865a771a..9a4fe491812b 100644 > --- a/include/linux/kho/abi/luo.h > +++ b/include/linux/kho/abi/luo.h > @@ -10,11 +10,11 @@ > * > * Live Update Orchestrator uses the stable Application Binary Interface > * defined below to pass state from a pre-update kernel to a post-update > - * kernel. The ABI is built upon the Kexec HandOver framework and uses a > - * Flattened Device Tree to describe the preserved data. > + * kernel. The ABI is built upon the Kexec HandOver framework and registers > + * the central `struct luo_ser` via the KHO raw subtree API. > * > - * This interface is a contract. Any modification to the FDT structure, node > - * properties, compatible strings, or the layout of the `__packed` serialization > + * This interface is a contract. Any modification to the structure fields, > + * compatible strings, or the layout of the `__packed` serialization > * structures defined here constitutes a breaking change. Such changes require > * incrementing the version number in the relevant `_COMPATIBLE` string to > * prevent a new kernel from misinterpreting data from an old kernel. > @@ -23,31 +23,15 @@ > * however, backward/forward compatibility is only guaranteed for kernels > * supporting the same ABI version. > * > - * FDT Structure Overview: > + * KHO Structure Overview: > * The entire LUO state is encapsulated within a single KHO entry named "LUO". > - * This entry contains an FDT with the following layout: > - * > - * .. code-block:: none > - * > - * / { > - * compatible = "luo-v2"; > - * luo-abi-header = ; > - * }; > - * > - * Main LUO Node (/): > - * > - * - compatible: "luo-v2" > - * Identifies the overall LUO ABI version. > - * - luo-abi-header: u64 > - * The physical address of `struct luo_ser`. > + * This entry contains the `struct luo_ser` structure. > * > * Serialization Structures: > - * The FDT properties point to memory regions containing arrays of simple, > - * `__packed` structures. These structures contain the actual preserved state. > - * > * - struct luo_ser: > * The central ABI structure that contains the overall state of the LUO. > - * It includes the liveupdate-number and pointers to sessions and FLBs. > + * It includes the compatibility string, the liveupdate-number, and pointers > + * to sessions and FLBs. > * > * - struct luo_session_header_ser: > * Header for the session array. Contains the total page count of the > @@ -78,26 +62,27 @@ > #ifndef _LINUX_KHO_ABI_LUO_H > #define _LINUX_KHO_ABI_LUO_H > > +#include > #include > > /* > - * The LUO FDT hooks all LUO state for sessions, fds, etc. > + * The LUO state is registered under this KHO entry name. > */ > -#define LUO_FDT_SIZE PAGE_SIZE > -#define LUO_FDT_KHO_ENTRY_NAME "LUO" > -#define LUO_FDT_COMPATIBLE "luo-v2" > -#define LUO_FDT_ABI_HEADER "luo-abi-header" > +#define LUO_KHO_ENTRY_NAME "LUO" > +#define LUO_ABI_COMPATIBLE "luo-v3" > +#define LUO_ABI_COMPAT_LEN ALIGN(sizeof(LUO_ABI_COMPATIBLE), 8) The length of the compatible field will change depending on the length of the string. While that is technically fine since a new ABI version is allowed to change the layout, it feels odd. I think it would be better if we define a static size here, say 64 bytes. This way you can avoid all the weirdness that can happen when you move from one version to another. > > /** > * struct luo_ser - Centralized LUO ABI header. > + * @compatible: Compatibility string identifying the LUO ABI version. > * @liveupdate_num: A counter tracking the number of successful live updates. > * @sessions_pa: Physical address of the first session block header. > * @flbs_pa: Physical address of the FLB header. > * > - * This structure is the root of all preserved LUO state. It is pointed to by > - * the "luo-abi-header" property in the LUO FDT. > + * This structure is the root of all preserved LUO state. > */ > struct luo_ser { > + char compatible[LUO_ABI_COMPAT_LEN]; > u64 liveupdate_num; > u64 sessions_pa; > u64 flbs_pa; [...] > @@ -94,40 +91,29 @@ static int __init luo_early_startup(void) > return 0; > } > > - /* Retrieve LUO subtree, and verify its format. */ > - err = kho_retrieve_subtree(LUO_FDT_KHO_ENTRY_NAME, &fdt_phys, NULL); > + /* Retrieve LUO state from KHO. */ > + err = kho_retrieve_subtree(LUO_KHO_ENTRY_NAME, &luo_ser_phys, &len); > if (err) { > if (err != -ENOENT) { > - pr_err("failed to retrieve FDT '%s' from KHO: %pe\n", > - LUO_FDT_KHO_ENTRY_NAME, ERR_PTR(err)); > + pr_err("failed to retrieve LUO state '%s' from KHO: %pe\n", > + LUO_KHO_ENTRY_NAME, ERR_PTR(err)); > return err; > } > > return 0; > } > > - luo_global.fdt_in = phys_to_virt(fdt_phys); > - err = fdt_node_check_compatible(luo_global.fdt_in, 0, > - LUO_FDT_COMPATIBLE); > - if (err) { > - pr_err("FDT '%s' is incompatible with '%s' [%d]\n", > - LUO_FDT_KHO_ENTRY_NAME, LUO_FDT_COMPATIBLE, err); > - > + if (len < sizeof(*luo_ser)) { len != sizeof(*luo_ser) here? > + pr_err("LUO state is too small (%zu < %zu)\n", len, sizeof(*luo_ser)); > return -EINVAL; > } > > - header_size = 0; > - ptr = fdt_getprop(luo_global.fdt_in, 0, LUO_FDT_ABI_HEADER, &header_size); > - if (!ptr || header_size != sizeof(u64)) { > - pr_err("Unable to get ABI header '%s' [%d]\n", > - LUO_FDT_ABI_HEADER, header_size); > - > + luo_ser = phys_to_virt(luo_ser_phys); > + if (strncmp(luo_ser->compatible, LUO_ABI_COMPATIBLE, LUO_ABI_COMPAT_LEN)) { > + pr_err("LUO state is incompatible with '%s'\n", LUO_ABI_COMPATIBLE); > return -EINVAL; > } > > - luo_ser_pa = get_unaligned((u64 *)ptr); > - luo_ser = phys_to_virt(luo_ser_pa); > - > luo_global.liveupdate_num = luo_ser->liveupdate_num; > pr_info("Retrieved live update data, liveupdate number: %lld\n", > luo_global.liveupdate_num); [...] -- Regards, Pratyush Yadav From pratyush at kernel.org Mon Jun 1 06:38:34 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Mon, 01 Jun 2026 15:38:34 +0200 Subject: [PATCH v4 07/13] kho: add support for linked-block serialization In-Reply-To: <20260530221938.115978-8-pasha.tatashin@soleen.com> (Pasha Tatashin's message of "Sat, 30 May 2026 22:19:32 +0000") References: <20260530221938.115978-1-pasha.tatashin@soleen.com> <20260530221938.115978-8-pasha.tatashin@soleen.com> Message-ID: <2vxzqzmqfkit.fsf@kernel.org> On Sat, May 30 2026, Pasha Tatashin wrote: > Introduce a linked-block serialization mechanism for state handover. > > Previously, LUO used contiguous memory blocks for serializing sessions > and files, which imposed limits on the total number of items that could > be preserved across a live update. > > This commit adds the infrastructure for a more flexible, block-based > approach where serialized data is stored in a chain of linked blocks. > This is a generic KHO serialization block infrastructure that can be > used by multiple subsystems. > > Signed-off-by: Pasha Tatashin > --- > Documentation/core-api/kho/abi.rst | 5 + > Documentation/core-api/kho/index.rst | 11 + > MAINTAINERS | 1 + > include/linux/kho/abi/block.h | 56 ++++ > include/linux/kho_block.h | 79 ++++++ > kernel/liveupdate/Makefile | 1 + > kernel/liveupdate/kho_block.c | 384 +++++++++++++++++++++++++++ > 7 files changed, 537 insertions(+) > create mode 100644 include/linux/kho/abi/block.h > create mode 100644 include/linux/kho_block.h > create mode 100644 kernel/liveupdate/kho_block.c > > diff --git a/Documentation/core-api/kho/abi.rst b/Documentation/core-api/kho/abi.rst > index 799d743105a6..edeb5b311963 100644 > --- a/Documentation/core-api/kho/abi.rst > +++ b/Documentation/core-api/kho/abi.rst > @@ -28,6 +28,11 @@ KHO persistent memory tracker ABI > .. kernel-doc:: include/linux/kho/abi/kexec_handover.h > :doc: KHO persistent memory tracker > > +KHO serialization block ABI > +=========================== > + > +.. kernel-doc:: include/linux/kho/abi/block.h > + > See Also > ======== > > diff --git a/Documentation/core-api/kho/index.rst b/Documentation/core-api/kho/index.rst > index 0a2dee4f8e7d..320914a42178 100644 > --- a/Documentation/core-api/kho/index.rst > +++ b/Documentation/core-api/kho/index.rst > @@ -83,6 +83,17 @@ Public API > .. kernel-doc:: kernel/liveupdate/kexec_handover.c > :export: > > +KHO Serialization Blocks API > +============================ > + > +.. kernel-doc:: kernel/liveupdate/kho_block.c > + :doc: KHO Serialization Blocks > + > +.. kernel-doc:: include/linux/kho_block.h > + > +.. kernel-doc:: kernel/liveupdate/kho_block.c > + :internal: > + > See Also > ======== > > diff --git a/MAINTAINERS b/MAINTAINERS > index 2fb1c75afd16..fd119b343e99 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -14194,6 +14194,7 @@ F: Documentation/admin-guide/mm/kho.rst > F: Documentation/core-api/kho/* > F: include/linux/kexec_handover.h > F: include/linux/kho/ > +F: include/linux/kho_block.h > F: kernel/liveupdate/kexec_handover* > F: lib/test_kho.c > F: tools/testing/selftests/kho/ > diff --git a/include/linux/kho/abi/block.h b/include/linux/kho/abi/block.h > new file mode 100644 > index 000000000000..8641c20b379b > --- /dev/null > +++ b/include/linux/kho/abi/block.h > @@ -0,0 +1,56 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +/* > + * Copyright (c) 2026, Google LLC. > + * Pasha Tatashin > + */ > + > +/** > + * DOC: KHO Serialization Blocks ABI > + * > + * Subsystems using the KHO Serialization Blocks framework rely on the stable > + * Application Binary Interface defined below to pass serialized state from a > + * pre-update kernel to a post-update kernel. > + * > + * This interface is a contract. Any modification to the structure fields, > + * compatible strings, or the layout of the `__packed` serialization > + * structures defined here constitutes a breaking change. Such changes require > + * incrementing the version number in the `KHO_BLOCK_ABI_COMPATIBLE` string to > + * prevent a new kernel from misinterpreting data from an old kernel. > + * > + * Changes are allowed provided the compatibility version is incremented; > + * however, backward/forward compatibility is only guaranteed for kernels > + * supporting the same ABI version. > + */ > + > +#ifndef _LINUX_KHO_ABI_BLOCK_H > +#define _LINUX_KHO_ABI_BLOCK_H > + > +#include > +#include > + > +#define KHO_BLOCK_ABI_COMPATIBLE "kho-block-v1" During KHO radix development, I argued for a separate compatible for the radix tree, but at that time, we tied the radix tree to core KHO ABI. The argument being that all core KHO data structures belong to the KHO ABI set. I imagine this will be used by kho_vmalloc, so it will also be end up being used by a core KHO API. So, do we want separate ABI? I don't much have a preference myself, but I do think the compatible management will be a bit easier if this relied on KHO compatible, especially once kho_vmalloc starts using it. > + > +/** > + * KHO_BLOCK_SIZE - The size of each serialization block. > + * > + * This is defined as PAGE_SIZE. PAGE_SIZE is ABI compliant because live > + * update between kernels with different page sizes is not supported by KHO. > + */ > +#define KHO_BLOCK_SIZE PAGE_SIZE > + > +/** > + * struct kho_block_header_ser - Header for the serialized data block. > + * @next: Physical address of the next struct kho_block_header_ser. > + * @count: The number of entries that immediately follow this header in the > + * memory block. > + * > + * This structure is located at the beginning of a block of physical memory > + * preserved across a kexec. It provides the necessary metadata to interpret > + * the array of entries that follow. > + */ > +struct kho_block_header_ser { > + u64 next; > + u64 count; > +} __packed; > + > +#endif /* _LINUX_KHO_ABI_BLOCK_H */ > diff --git a/include/linux/kho_block.h b/include/linux/kho_block.h > new file mode 100644 > index 000000000000..5e6b87b1befa > --- /dev/null > +++ b/include/linux/kho_block.h > @@ -0,0 +1,79 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +/* > + * Copyright (c) 2026, Google LLC. > + * Pasha Tatashin > + */ > + > +#ifndef _LINUX_KHO_BLOCK_H > +#define _LINUX_KHO_BLOCK_H > + > +#include > +#include > +#include > + > +/** > + * struct kho_block - Internal representation of a serialization block. > + * @list: List head for linking blocks in memory. > + * @ser: Pointer to the serialized header in preserved memory. > + */ > +struct kho_block { > + struct list_head list; > + struct kho_block_header_ser *ser; > +}; > + > +/** > + * struct kho_block_set - A set of blocks that belong to the same object. > + * @blocks: The list of serialization blocks (struct kho_block). > + * @nblocks: The number of allocated serialization blocks. > + * @head_pa: Physical address of the first block header. > + * @entry_size: The size of each entry in the blocks. > + * @count_per_block: The maximum number of entries each block can hold. > + * @incoming: True if this block set was restored from the previous kernel. > + */ > +struct kho_block_set { > + struct list_head blocks; > + long nblocks; > + u64 head_pa; > + size_t entry_size; I think we should add the entry_size to kho_block_header_ser? I think it is a part of the ABI of the block set. If this changes, we cannot parse a block set with a different size. If a subsystem wants to change entry size, they create a new block set with different entry size, and then they bump their compatible version. > + u64 count_per_block; > + bool incoming; > +}; > + > +/** > + * struct kho_block_it - Iterator for serializing entries into blocks. > + * @bs: The block set being iterated. > + * @block: The current block. > + * @i: The current entry index within @block. > + */ > +struct kho_block_it { > + struct kho_block_set *bs; > + struct kho_block *block; > + u64 i; > +}; > + > +/** > + * KHO_BLOCK_SET_INIT - Initialize a static kho_block_set. > + * @_name: Name of the kho_block_set variable. > + * @_entry_size: The size of each entry in the block set. > + */ > +#define KHO_BLOCK_SET_INIT(_name, _entry_size) { \ > + .blocks = LIST_HEAD_INIT((_name).blocks), \ > + .entry_size = _entry_size, \ > +} > + > +void kho_block_set_init(struct kho_block_set *bs, size_t entry_size); > + > +int kho_block_grow(struct kho_block_set *bs, u64 count); > +void kho_block_shrink(struct kho_block_set *bs, u64 count); These block management functions seem like internal details of the block set API. Do we need to export them? I think users should not have to worry about block management. They should read, set, or clear entries using the iterators, and internally the block management should take of allocation or freeing. So here for example, I th > + > +int kho_block_restore(struct kho_block_set *bs, u64 head_pa); > +void kho_block_destroy(struct kho_block_set *bs); Nit: kho_block_set_{restore,destroy}()? At first glance I thought they manipulated a single block. > +void kho_block_set_clear(struct kho_block_set *bs); > + > +void kho_block_it_init(struct kho_block_it *it, struct kho_block_set *bs); > +void *kho_block_it_next(struct kho_block_it *it); > +void *kho_block_it_read(struct kho_block_it *it); > +void *kho_block_it_prev(struct kho_block_it *it); > +void kho_block_it_finalize(struct kho_block_it *it); > + > +#endif /* _LINUX_KHO_BLOCK_H */ > diff --git a/kernel/liveupdate/Makefile b/kernel/liveupdate/Makefile > index d2f779cbe279..eec9d3ae07eb 100644 > --- a/kernel/liveupdate/Makefile > +++ b/kernel/liveupdate/Makefile > @@ -1,6 +1,7 @@ > # SPDX-License-Identifier: GPL-2.0 > > luo-y := \ > + kho_block.o \ > luo_core.o \ > luo_file.o \ > luo_flb.o \ > diff --git a/kernel/liveupdate/kho_block.c b/kernel/liveupdate/kho_block.c > new file mode 100644 > index 000000000000..a4e650af946f > --- /dev/null > +++ b/kernel/liveupdate/kho_block.c > @@ -0,0 +1,384 @@ > +// SPDX-License-Identifier: GPL-2.0 > + > +/* > + * Copyright (c) 2026, Google LLC. > + * Pasha Tatashin > + */ > + > +/** > + * DOC: KHO Serialization Blocks > + * > + * KHO provides a mechanism to preserve stateful data across a kexec handover > + * by serializing it into memory blocks. This file provides the common > + * infrastructure for managing these blocks. > + * > + * Each block consists of a header (struct kho_block_header_ser) followed by an > + * array of serialized entries. Multiple blocks are linked together via a > + * physical pointer in the header, forming a linked list that can be easily > + * traversed in both the current and the next kernel. > + */ > + > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt > + > +#include > +#include > +#include > +#include > +#include > + > +/* > + * Safeguard limit for the number of serialization blocks. This is used to > + * prevent infinite loops and excessive memory allocation in case of memory > + * corruption in the preserved state. > + */ > +#define KHO_MAX_BLOCKS 10000 > + > +/** > + * kho_block_set_init - Initialize a block set. > + * @bs: The block set to initialize. > + * @entry_size: The size of each entry in the blocks. > + */ > +void kho_block_set_init(struct kho_block_set *bs, size_t entry_size) > +{ > + *bs = (struct kho_block_set)KHO_BLOCK_SET_INIT(*bs, entry_size); > +} > + > +static inline u64 kho_block_count_per_block(struct kho_block_set *bs) > +{ > + if (unlikely(!bs->count_per_block)) { > + bs->count_per_block = (KHO_BLOCK_SIZE - > + sizeof(struct kho_block_header_ser)) / > + bs->entry_size; > + WARN_ON(!bs->count_per_block); > + } > + return bs->count_per_block; > +} This looks odd. I don't see a reason to calculate this lazily. Why not just do it when initializing the block set, in kho_block_set_init() or kho_block_restore()? And then use bs->count_per_block directly. > + > +/* Free serialized data */ > +static void kho_block_free_ser(struct kho_block_set *bs, > + struct kho_block_header_ser *ser) > +{ > + if (bs->incoming) > + kho_restore_free(ser); > + else > + kho_unpreserve_free(ser); > +} > + > +static struct kho_block_header_ser *kho_block_alloc_ser(struct kho_block_set *bs) > +{ > + WARN_ON(bs->incoming); WARN_ON_ONCE? > + return kho_alloc_preserve(KHO_BLOCK_SIZE); > +} > + > +static int kho_block_add(struct kho_block_set *bs, > + struct kho_block_header_ser *ser) > +{ > + struct kho_block *block, *last; > + > + if (bs->nblocks >= KHO_MAX_BLOCKS) > + return -ENOSPC; > + > + block = kzalloc_obj(*block); > + if (!block) > + return -ENOMEM; > + > + block->ser = ser; > + last = list_last_entry_or_null(&bs->blocks, struct kho_block, list); > + list_add_tail(&block->list, &bs->blocks); > + bs->nblocks++; > + > + if (last) > + last->ser->next = virt_to_phys(ser); > + else > + bs->head_pa = virt_to_phys(ser); > + > + return 0; > +} > + > +/** > + * kho_block_grow - Create a new block if the current capacity is reached. > + * @bs: The block set. > + * @count: The current number of entries. > + * > + * This function handles the dynamic expansion of a block set. It allocates > + * and links a new serialization block if the provided entry count matches > + * the current total capacity of the set. > + * > + * Return: 0 on success, or a negative errno on failure. > + */ > +int kho_block_grow(struct kho_block_set *bs, u64 count) > +{ > + struct kho_block_header_ser *ser; > + int err; > + > + if (WARN_ON(bs->incoming)) WARN_ON_ONCE here too? > + return -EINVAL; > + > + if (count != bs->nblocks * kho_block_count_per_block(bs)) > + return 0; > + > + ser = kho_block_alloc_ser(bs); > + if (IS_ERR(ser)) > + return PTR_ERR(ser); > + > + err = kho_block_add(bs, ser); > + if (err) { > + kho_block_free_ser(bs, ser); > + return err; > + } > + > + return 0; > +} > + > +/** > + * kho_block_shrink - Conditionally destroy the last block in a block set. > + * @bs: The block set. > + * @count: The current number of entries across all blocks. > + * > + * This function checks if the last block in the set is redundant based on the > + * total entry count and the capacity of the preceding blocks. If the entry > + * count can be accommodated by the blocks that come before the last one, the > + * last block is destroyed and removed from the set. > + */ > +void kho_block_shrink(struct kho_block_set *bs, u64 count) > +{ > + struct kho_block *last, *new_last; > + > + if (count > (bs->nblocks - 1) * kho_block_count_per_block(bs)) > + return; > + > + if (list_empty(&bs->blocks)) > + return; > + > + last = list_last_entry(&bs->blocks, struct kho_block, list); > + list_del(&last->list); > + bs->nblocks--; > + kho_block_free_ser(bs, last->ser); > + kfree(last); > + > + new_last = list_last_entry_or_null(&bs->blocks, struct kho_block, list); > + if (new_last) > + new_last->ser->next = 0; > + else > + bs->head_pa = 0; > +} > + > +/* > + * kho_cyclic_blocks_check - Check for cycles in a linked list of blocks. > + * Uses Floyd's cycle-finding algorithm to ensure sanity of the incoming list. > + */ > +static bool kho_cyclic_blocks_check(struct kho_block_set *bs) > +{ > + struct kho_block_header_ser *fast; > + struct kho_block_header_ser *slow; > + int count = 0; > + > + fast = phys_to_virt(bs->head_pa); > + slow = fast; > + > + while (fast) { > + if (count++ >= KHO_MAX_BLOCKS) { > + pr_err("Linked list too long\n"); > + return false; > + } > + > + if (!fast->next) > + break; > + > + fast = phys_to_virt(fast->next); > + if (!fast->next) > + break; > + > + fast = phys_to_virt(fast->next); > + slow = phys_to_virt(slow->next); > + > + if (slow == fast) { > + pr_err("Cyclic list detected\n"); Heh, reminds me of the time I was practicing leetcode for interviews ;-) > + return false; > + } > + } > + > + return true; > +} > + > +/** > + * kho_block_restore - Restore a block set from a physical address. > + * @bs: The block set to restore. > + * @head_pa: Physical address of the first block header. > + * > + * Return: 0 on success, or a negative errno on failure. > + */ > +int kho_block_restore(struct kho_block_set *bs, u64 head_pa) > +{ > + struct kho_block_header_ser *ser; > + u64 next_pa = head_pa; > + int err; > + > + /* Restored block sets use size from the previous kernel */ > + bs->incoming = true; > + if (!head_pa) > + return 0; > + > + bs->head_pa = head_pa; > + if (!kho_cyclic_blocks_check(bs)) { > + bs->head_pa = 0; > + return -EINVAL; > + } > + > + while (next_pa) { > + ser = phys_to_virt(next_pa); > + if (ser->count > kho_block_count_per_block(bs)) { > + pr_warn("Block contains too many entries: %llu\n", > + ser->count); > + err = -EINVAL; > + goto err_destroy; > + } > + err = kho_block_add(bs, ser); > + if (err) > + goto err_destroy; > + next_pa = ser->next; > + } > + > + return 0; > + > +err_destroy: > + kho_block_destroy(bs); > + return err; > +} > + > +/** > + * kho_block_destroy - Destroy all blocks in a block set. > + * @bs: The block set. > + */ > +void kho_block_destroy(struct kho_block_set *bs) > +{ > + u64 head_pa = bs->head_pa; > + struct kho_block *block; > + > + while (!list_empty(&bs->blocks)) { > + block = list_first_entry(&bs->blocks, struct kho_block, list); > + list_del(&block->list); > + kfree(block); > + } Nit: list_for_each_entry_safe(block, tmp, &bs->blocks, list) { list_del(&block->list); kfree(block); } is a bit more idiomatic (and IMO easier to read). > + bs->nblocks = 0; > + bs->head_pa = 0; > + > + while (head_pa) { > + struct kho_block_header_ser *ser = phys_to_virt(head_pa); > + > + head_pa = ser->next; > + kho_block_free_ser(bs, ser); Nit: also, can't you put this also in the previous loop? Something like: list_for_each_entry_safe(block, tmp, &bs->blocks, list) { list_del(&block->list); kho_block_free_ser(block->ser); kfree(block); } > + } > +} > + > +/** > + * kho_block_set_clear - Clear all serialized data in a block set. > + * @bs: The block set to clear. > + */ > +void kho_block_set_clear(struct kho_block_set *bs) > +{ > + struct kho_block *block; > + > + list_for_each_entry(block, &bs->blocks, list) { > + block->ser->count = 0; > + memset(block->ser + 1, 0, KHO_BLOCK_SIZE - sizeof(*block->ser)); > + } > +} > + > +/** > + * kho_block_it_init - Initialize a block set iterator. > + * @it: The iterator to initialize. > + * @bs: The block set to iterate over. > + */ > +void kho_block_it_init(struct kho_block_it *it, struct kho_block_set *bs) > +{ > + it->bs = bs; > + it->block = list_first_entry_or_null(&bs->blocks, struct kho_block, list); > + it->i = 0; > +} > + > +/** > + * kho_block_it_next - Return the next entry slot in the block set. > + * @it: The block iterator. > + * > + * If the current block is full, it automatically advances to the next block > + * in the set. > + * > + * Return: A pointer to the next entry slot, or NULL if no more slots are > + * available. > + */ > +void *kho_block_it_next(struct kho_block_it *it) The naming and documentation here are very confusing. This and kho_block_it_read() look pretty much identical, and their documentation also looks pretty much identical. There seems to be only one tiny difference: this function returns the slot while incrementing the block count. Can we do better something like kho_block_it_write_next(struct kho_block_it *it, void *entry) (size was specified when creating block set)? Yes, this results in a copy but does that matter that much? And if you really want to avoid copying, perhaps kho_block_it_add_entry()? Or something along the lines? To make it clear this is adding an entry to the block set. Also, make the intended usage clear in the documentation. > +{ > + if (!it->block) > + return NULL; > + > + if (it->i == kho_block_count_per_block(it->bs)) { > + it->block->ser->count = it->i; > + if (list_is_last(&it->block->list, &it->bs->blocks)) > + return NULL; > + it->block = list_next_entry(it->block, list); > + it->i = 0; > + } > + > + return (void *)(it->block->ser + 1) + (it->i++ * it->bs->entry_size); > +} > + > +/** > + * kho_block_it_read - Return the next entry slot for reading. > + * @it: The block iterator. > + * > + * This function iterates through entries that were previously serialized, > + * respecting the count stored in each block's header. > + * > + * Return: A pointer to the next entry slot, or NULL if no more entries are > + * available. > + */ > +void *kho_block_it_read(struct kho_block_it *it) > +{ > + if (!it->block) > + return NULL; > + > + while (it->i == it->block->ser->count) { Hmm, the while loop suggests we can have blocks with zero count. Do you think we should detect those and error out instead? Since it doesn't really make sense to have a block with no entries. > + if (list_is_last(&it->block->list, &it->bs->blocks)) > + return NULL; > + it->block = list_next_entry(it->block, list); > + it->i = 0; > + } > + > + return (void *)(it->block->ser + 1) + (it->i++ * it->bs->entry_size); > +} > + > +/** > + * kho_block_it_prev - Return the previous entry slot in the block set. > + * @it: The block iterator. > + * > + * If the current index is at the start of a block, it automatically moves to > + * the end of the previous block. > + * > + * Return: A pointer to the previous entry slot, or NULL if at the very > + * beginning of the block set. > + */ > +void *kho_block_it_prev(struct kho_block_it *it) > +{ > + if (!it->block) > + return NULL; > + > + if (it->i == 0) { > + if (list_is_first(&it->block->list, &it->bs->blocks)) > + return NULL; > + it->block = list_prev_entry(it->block, list); > + it->i = kho_block_count_per_block(it->bs); > + } > + > + return (void *)(it->block->ser + 1) + (--it->i * it->bs->entry_size); > +} > + > +/** > + * kho_block_it_finalize - Finalize the current block by setting its entry count. > + * @it: The block iterator. > + */ > +void kho_block_it_finalize(struct kho_block_it *it) > +{ > + if (it->block) > + it->block->ser->count = it->i; > +} Doesn't kho_block_it_next() already do this when you add an entry? So this seems redundant. -- Regards, Pratyush Yadav From pratyush at kernel.org Mon Jun 1 06:47:00 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Mon, 01 Jun 2026 15:47:00 +0200 Subject: [PATCH v4 08/13] liveupdate: defer session block allocation and PA setting In-Reply-To: <20260530221938.115978-9-pasha.tatashin@soleen.com> (Pasha Tatashin's message of "Sat, 30 May 2026 22:19:33 +0000") References: <20260530221938.115978-1-pasha.tatashin@soleen.com> <20260530221938.115978-9-pasha.tatashin@soleen.com> Message-ID: <2vxzmrxefk4r.fsf@kernel.org> On Sat, May 30 2026, Pasha Tatashin wrote: > Currently, luo_session_setup_outgoing() allocates the session block and > sets its physical address in the header immediately. With upcoming > dynamic block-based session management, this makes the first block > different from the rest. Move the allocation to where it is first needed. > > Acked-by: Mike Rapoport (Microsoft) > Signed-off-by: Pasha Tatashin Reviewed-by: Pratyush Yadav (Google) [...] -- Regards, Pratyush Yadav From pasha.tatashin at soleen.com Mon Jun 1 06:50:49 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Mon, 1 Jun 2026 09:50:49 -0400 Subject: [PATCH v4 04/13] liveupdate: register luo_ser as KHO subtree In-Reply-To: <2vxzv7c2fn8n.fsf@kernel.org> References: <20260530221938.115978-1-pasha.tatashin@soleen.com> <20260530221938.115978-5-pasha.tatashin@soleen.com> <2vxzv7c2fn8n.fsf@kernel.org> Message-ID: On 06-01 14:39, Pratyush Yadav wrote: > On Sat, May 30 2026, Pasha Tatashin wrote: > > > Entirely remove the LUO FDT wrapper since the FDT only carries the > > compatible string and the pointer to the centralized struct luo_ser. > > Instead, register the struct luo_ser via the KHO raw subtree > > API, placing the compatibility string inside the structure itself. > > > > Signed-off-by: Pasha Tatashin > > --- > > include/linux/kho/abi/luo.h | 57 +++++++++--------------- > > kernel/liveupdate/luo_core.c | 85 +++++++++++------------------------- > > 2 files changed, 46 insertions(+), 96 deletions(-) > > > > diff --git a/include/linux/kho/abi/luo.h b/include/linux/kho/abi/luo.h > > index 1b2f865a771a..9a4fe491812b 100644 > > --- a/include/linux/kho/abi/luo.h > > +++ b/include/linux/kho/abi/luo.h > > @@ -10,11 +10,11 @@ > > * > > * Live Update Orchestrator uses the stable Application Binary Interface > > * defined below to pass state from a pre-update kernel to a post-update > > - * kernel. The ABI is built upon the Kexec HandOver framework and uses a > > - * Flattened Device Tree to describe the preserved data. > > + * kernel. The ABI is built upon the Kexec HandOver framework and registers > > + * the central `struct luo_ser` via the KHO raw subtree API. > > * > > - * This interface is a contract. Any modification to the FDT structure, node > > - * properties, compatible strings, or the layout of the `__packed` serialization > > + * This interface is a contract. Any modification to the structure fields, > > + * compatible strings, or the layout of the `__packed` serialization > > * structures defined here constitutes a breaking change. Such changes require > > * incrementing the version number in the relevant `_COMPATIBLE` string to > > * prevent a new kernel from misinterpreting data from an old kernel. > > @@ -23,31 +23,15 @@ > > * however, backward/forward compatibility is only guaranteed for kernels > > * supporting the same ABI version. > > * > > - * FDT Structure Overview: > > + * KHO Structure Overview: > > * The entire LUO state is encapsulated within a single KHO entry named "LUO". > > - * This entry contains an FDT with the following layout: > > - * > > - * .. code-block:: none > > - * > > - * / { > > - * compatible = "luo-v2"; > > - * luo-abi-header = ; > > - * }; > > - * > > - * Main LUO Node (/): > > - * > > - * - compatible: "luo-v2" > > - * Identifies the overall LUO ABI version. > > - * - luo-abi-header: u64 > > - * The physical address of `struct luo_ser`. > > + * This entry contains the `struct luo_ser` structure. > > * > > * Serialization Structures: > > - * The FDT properties point to memory regions containing arrays of simple, > > - * `__packed` structures. These structures contain the actual preserved state. > > - * > > * - struct luo_ser: > > * The central ABI structure that contains the overall state of the LUO. > > - * It includes the liveupdate-number and pointers to sessions and FLBs. > > + * It includes the compatibility string, the liveupdate-number, and pointers > > + * to sessions and FLBs. > > * > > * - struct luo_session_header_ser: > > * Header for the session array. Contains the total page count of the > > @@ -78,26 +62,27 @@ > > #ifndef _LINUX_KHO_ABI_LUO_H > > #define _LINUX_KHO_ABI_LUO_H > > > > +#include > > #include > > > > /* > > - * The LUO FDT hooks all LUO state for sessions, fds, etc. > > + * The LUO state is registered under this KHO entry name. > > */ > > -#define LUO_FDT_SIZE PAGE_SIZE > > -#define LUO_FDT_KHO_ENTRY_NAME "LUO" > > -#define LUO_FDT_COMPATIBLE "luo-v2" > > -#define LUO_FDT_ABI_HEADER "luo-abi-header" > > +#define LUO_KHO_ENTRY_NAME "LUO" > > +#define LUO_ABI_COMPATIBLE "luo-v3" > > +#define LUO_ABI_COMPAT_LEN ALIGN(sizeof(LUO_ABI_COMPATIBLE), 8) > > The length of the compatible field will change depending on the length > of the string. While that is technically fine since a new ABI version is > allowed to change the layout, it feels odd. I think it would be better > if we define a static size here, say 64 bytes. This way you can avoid > all the weirdness that can happen when you move from one version to > another. This is what I used initially, but we have cases where one LUO/KHO subsystem depends on another. For example, the LUO version must change when the block version changes, making the static length too restrictive. I would prefer to use proper strncmp() everywhere and allow the version string to change dynamically between kernels, while still allowing something like this (from [PATCH v4 09/13] liveupdate: Remove limit on the number of sessions): #define LUO_COMPAT_BASE "luo-v3" #define LUO_ABI_COMPATIBLE LUO_COMPAT_BASE "-" KHO_BLOCK_ABI_COMPATIBLE In the future, we may extend this further as we add more dependencies, such as your preservable xarray, vmalloc, etc. Everything that depends on an external version should include that in its compatibility string. > > > > > /** > > * struct luo_ser - Centralized LUO ABI header. > > + * @compatible: Compatibility string identifying the LUO ABI version. > > * @liveupdate_num: A counter tracking the number of successful live updates. > > * @sessions_pa: Physical address of the first session block header. > > * @flbs_pa: Physical address of the FLB header. > > * > > - * This structure is the root of all preserved LUO state. It is pointed to by > > - * the "luo-abi-header" property in the LUO FDT. > > + * This structure is the root of all preserved LUO state. > > */ > > struct luo_ser { > > + char compatible[LUO_ABI_COMPAT_LEN]; > > u64 liveupdate_num; > > u64 sessions_pa; > > u64 flbs_pa; > [...] > > @@ -94,40 +91,29 @@ static int __init luo_early_startup(void) > > return 0; > > } > > > > - /* Retrieve LUO subtree, and verify its format. */ > > - err = kho_retrieve_subtree(LUO_FDT_KHO_ENTRY_NAME, &fdt_phys, NULL); > > + /* Retrieve LUO state from KHO. */ > > + err = kho_retrieve_subtree(LUO_KHO_ENTRY_NAME, &luo_ser_phys, &len); > > if (err) { > > if (err != -ENOENT) { > > - pr_err("failed to retrieve FDT '%s' from KHO: %pe\n", > > - LUO_FDT_KHO_ENTRY_NAME, ERR_PTR(err)); > > + pr_err("failed to retrieve LUO state '%s' from KHO: %pe\n", > > + LUO_KHO_ENTRY_NAME, ERR_PTR(err)); > > return err; > > } > > > > return 0; > > } > > > > - luo_global.fdt_in = phys_to_virt(fdt_phys); > > - err = fdt_node_check_compatible(luo_global.fdt_in, 0, > > - LUO_FDT_COMPATIBLE); > > - if (err) { > > - pr_err("FDT '%s' is incompatible with '%s' [%d]\n", > > - LUO_FDT_KHO_ENTRY_NAME, LUO_FDT_COMPATIBLE, err); > > - > > + if (len < sizeof(*luo_ser)) { > > len != sizeof(*luo_ser) here? I can change this, but it is not necessary. It is common practice to verify that a "struct" is not smaller when compatibility is checked, allowing for future expansion without breaking compatibility with older kernels. I know we do not support forward/backward compatibility in any way right now, but I do not think it hurts to put the proper safeguards in place. Pasha > > > + pr_err("LUO state is too small (%zu < %zu)\n", len, sizeof(*luo_ser)); > > return -EINVAL; > > } > > > > - header_size = 0; > > - ptr = fdt_getprop(luo_global.fdt_in, 0, LUO_FDT_ABI_HEADER, &header_size); > > - if (!ptr || header_size != sizeof(u64)) { > > - pr_err("Unable to get ABI header '%s' [%d]\n", > > - LUO_FDT_ABI_HEADER, header_size); > > - > > + luo_ser = phys_to_virt(luo_ser_phys); > > + if (strncmp(luo_ser->compatible, LUO_ABI_COMPATIBLE, LUO_ABI_COMPAT_LEN)) { > > + pr_err("LUO state is incompatible with '%s'\n", LUO_ABI_COMPATIBLE); > > return -EINVAL; > > } > > > > - luo_ser_pa = get_unaligned((u64 *)ptr); > > - luo_ser = phys_to_virt(luo_ser_pa); > > - > > luo_global.liveupdate_num = luo_ser->liveupdate_num; > > pr_info("Retrieved live update data, liveupdate number: %lld\n", > > luo_global.liveupdate_num); > [...] > > -- > Regards, > Pratyush Yadav From pratyush at kernel.org Mon Jun 1 07:03:56 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Mon, 01 Jun 2026 16:03:56 +0200 Subject: [PATCH v4 09/13] liveupdate: Remove limit on the number of sessions In-Reply-To: <20260530221938.115978-10-pasha.tatashin@soleen.com> (Pasha Tatashin's message of "Sat, 30 May 2026 22:19:34 +0000") References: <20260530221938.115978-1-pasha.tatashin@soleen.com> <20260530221938.115978-10-pasha.tatashin@soleen.com> Message-ID: <2vxzfr36fjcj.fsf@kernel.org> On Sat, May 30 2026, Pasha Tatashin wrote: > Currently, the number of LUO sessions is limited by a fixed number of > pre-allocated pages for serialization (16 pages, allowing for ~819 > sessions). > > This limitation is problematic if LUO is used to support things such as > systemd file descriptor store, and would be used not just as VM memory > but to save other states on the machine. > > Remove this limit by transitioning to a linked-block approach for > session metadata serialization. Instead of a single contiguous block, > session metadata is now stored in a chain of 16-page blocks. Each block > starts with a header containing the physical address of the next block > and the number of session entries in the current block. > > Acked-by: Mike Rapoport (Microsoft) > Signed-off-by: Pasha Tatashin > --- [...] > @@ -63,13 +58,15 @@ > #define _LINUX_KHO_ABI_LUO_H > > #include > +#include > #include > > /* > * The LUO state is registered under this KHO entry name. > */ > #define LUO_KHO_ENTRY_NAME "LUO" > -#define LUO_ABI_COMPATIBLE "luo-v3" > +#define LUO_COMPAT_BASE "luo-v3" > +#define LUO_ABI_COMPATIBLE LUO_COMPAT_BASE "-" KHO_BLOCK_ABI_COMPATIBLE That's clever :-) [...] > int luo_session_serialize(void) > { > struct luo_session_header *sh = &luo_session_global.outgoing; > struct luo_session *session; > - int i = 0; > + struct kho_block_it it; > int err; > > down_write(&luo_session_serialize_rwsem); > down_write(&sh->rwsem); > *sh->sessions_pa = 0; > > + kho_block_it_init(&it, &sh->block_set); > + > list_for_each_entry(session, &sh->list, list) { > - err = luo_session_freeze_one(session, &sh->ser[i]); > - if (err) > + struct luo_session_ser *ser = kho_block_it_next(&it); > + > + if (!ser) { > + err = -ENOSPC; > goto err_undo; > + } > > - strscpy(sh->ser[i].name, session->name, > - sizeof(sh->ser[i].name)); > - i++; > - } > + err = luo_session_freeze_one(session, ser); > + if (err) { > + kho_block_it_prev(&it); > + goto err_undo; > + } > > - if (sh->header_ser && sh->count > 0) { > - sh->header_ser->count = sh->count; > - *sh->sessions_pa = virt_to_phys(sh->header_ser); > + strscpy(ser->name, session->name, sizeof(ser->name)); > } > + > + kho_block_it_finalize(&it); > + > + if (sh->sessions_pa && sh->count > 0) Nit: Why check for sh->sessions_pa? It can never be NULL. Other than this, Reviewed-by: Pratyush Yadav (Google) > + *sh->sessions_pa = sh->block_set.head_pa; > up_write(&sh->rwsem); > > return 0; > > err_undo: > list_for_each_entry_continue_reverse(session, &sh->list, list) { > - i--; > - luo_session_unfreeze_one(session, &sh->ser[i]); > - memset(sh->ser[i].name, 0, sizeof(sh->ser[i].name)); > + struct luo_session_ser *ser = kho_block_it_prev(&it); > + > + luo_session_unfreeze_one(session, ser); > + memset(ser->name, 0, sizeof(ser->name)); > } > up_write(&sh->rwsem); > up_write(&luo_session_serialize_rwsem); -- Regards, Pratyush Yadav From pratyush at kernel.org Mon Jun 1 07:16:25 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Mon, 01 Jun 2026 16:16:25 +0200 Subject: [PATCH v4 10/13] liveupdate: Remove limit on the number of files per session In-Reply-To: <20260530221938.115978-11-pasha.tatashin@soleen.com> (Pasha Tatashin's message of "Sat, 30 May 2026 22:19:35 +0000") References: <20260530221938.115978-1-pasha.tatashin@soleen.com> <20260530221938.115978-11-pasha.tatashin@soleen.com> Message-ID: <2vxzbjdufirq.fsf@kernel.org> On Sat, May 30 2026, Pasha Tatashin wrote: > To remove the fixed limit on the number of preserved files per session, > transition the file metadata serialization from a single contiguous > memory block to a chain of linked blocks. > > Acked-by: Mike Rapoport (Microsoft) > Signed-off-by: Pasha Tatashin > --- > include/linux/kho/abi/luo.h | 13 +-- > kernel/liveupdate/luo_file.c | 144 +++++++++++++++---------------- > kernel/liveupdate/luo_internal.h | 6 +- > 3 files changed, 80 insertions(+), 83 deletions(-) > > diff --git a/include/linux/kho/abi/luo.h b/include/linux/kho/abi/luo.h > index 79758d92ed5f..16df550ef143 100644 > --- a/include/linux/kho/abi/luo.h > +++ b/include/linux/kho/abi/luo.h > @@ -35,8 +35,8 @@ > * > * - struct luo_session_ser: > * Metadata for a single session, including its name and a physical pointer > - * to another preserved memory block containing an array of > - * `struct luo_file_ser` for all files in that session. > + * to the first `struct kho_block_header_ser` for all files in that session. > + * Multiple blocks are linked via the `next` field in the header. > * > * - struct luo_file_ser: > * Metadata for a single preserved file. Contains the `compatible` string to > @@ -65,7 +65,7 @@ > * The LUO state is registered under this KHO entry name. > */ > #define LUO_KHO_ENTRY_NAME "LUO" > -#define LUO_COMPAT_BASE "luo-v3" > +#define LUO_COMPAT_BASE "luo-v4" > #define LUO_ABI_COMPATIBLE LUO_COMPAT_BASE "-" KHO_BLOCK_ABI_COMPATIBLE > #define LUO_ABI_COMPAT_LEN ALIGN(sizeof(LUO_ABI_COMPATIBLE), 8) > > @@ -103,9 +103,10 @@ struct luo_file_ser { > > /** > * struct luo_file_set_ser - Represents the serialized metadata for file set > - * @files: The physical address of a contiguous memory block that holds > - * the serialized state of files (array of luo_file_ser) in this file > - * set. > + * @files: The physical address of the first `struct kho_block_header_ser`. > + * This structure is the header for a block of memory containing > + * an array of `struct luo_file_ser` entries. Multiple blocks are > + * linked via the `next` field in the header. > * @count: The total number of files that were part of this session during > * serialization. Used for iteration and validation during > * restoration. > diff --git a/kernel/liveupdate/luo_file.c b/kernel/liveupdate/luo_file.c > index 9eec07a9e9fc..a445b1950ca7 100644 > --- a/kernel/liveupdate/luo_file.c > +++ b/kernel/liveupdate/luo_file.c > @@ -118,11 +118,6 @@ static LIST_HEAD(luo_file_handler_list); > /* Keep track of files being preserved by LUO */ > static DEFINE_XARRAY(luo_preserved_files); > > -/* 2 4K pages, give space for 128 files per file_set */ > -#define LUO_FILE_PGCNT 2ul > -#define LUO_FILE_MAX \ > - ((LUO_FILE_PGCNT << PAGE_SHIFT) / sizeof(struct luo_file_ser)) > - > /** > * struct luo_file - Represents a single preserved file instance. > * @fh: Pointer to the &struct liveupdate_file_handler that manages > @@ -174,39 +169,6 @@ struct luo_file { > u64 token; > }; > > -static int luo_alloc_files_mem(struct luo_file_set *file_set) > -{ > - size_t size; > - void *mem; > - > - if (file_set->files) > - return 0; > - > - WARN_ON_ONCE(file_set->count); > - > - size = LUO_FILE_PGCNT << PAGE_SHIFT; > - mem = kho_alloc_preserve(size); > - if (IS_ERR(mem)) > - return PTR_ERR(mem); > - > - file_set->files = mem; > - > - return 0; > -} > - > -static void luo_free_files_mem(struct luo_file_set *file_set) > -{ > - /* If file_set has files, no need to free preservation memory */ > - if (file_set->count) > - return; > - > - if (!file_set->files) > - return; > - > - kho_unpreserve_free(file_set->files); > - file_set->files = NULL; > -} > - > static unsigned long luo_get_id(struct liveupdate_file_handler *fh, > struct file *file) > { > @@ -276,16 +238,15 @@ int luo_preserve_file(struct luo_file_set *file_set, u64 token, int fd) > if (luo_token_is_used(file_set, token)) > return -EEXIST; > > - if (file_set->count == LUO_FILE_MAX) > - return -ENOSPC; > + err = kho_block_grow(&file_set->block_set, file_set->count); > + if (err) > + return err; > > file = fget(fd); > - if (!file) > - return -EBADF; > - > - err = luo_alloc_files_mem(file_set); > - if (err) > - goto err_fput; > + if (!file) { > + err = -EBADF; > + goto err_shrink; > + } > > err = -ENOENT; > down_read(&luo_register_rwlock); > @@ -300,7 +261,7 @@ int luo_preserve_file(struct luo_file_set *file_set, u64 token, int fd) > > /* err is still -ENOENT if no handler was found */ > if (err) > - goto err_free_files_mem; > + goto err_fput; > > err = xa_insert(&luo_preserved_files, luo_get_id(fh, file), > file, GFP_KERNEL); > @@ -343,10 +304,10 @@ int luo_preserve_file(struct luo_file_set *file_set, u64 token, int fd) > xa_erase(&luo_preserved_files, luo_get_id(fh, file)); > err_module_put: > module_put(fh->ops->owner); > -err_free_files_mem: > - luo_free_files_mem(file_set); > err_fput: > fput(file); > +err_shrink: > + kho_block_shrink(&file_set->block_set, file_set->count); > > return err; > } > @@ -392,13 +353,14 @@ void luo_file_unpreserve_files(struct luo_file_set *file_set) > > list_del(&luo_file->list); > file_set->count--; > + kho_block_shrink(&file_set->block_set, file_set->count); > > fput(luo_file->file); > mutex_destroy(&luo_file->mutex); > kfree(luo_file); > } > > - luo_free_files_mem(file_set); > + kho_block_destroy(&file_set->block_set); > } > > static int luo_file_freeze_one(struct luo_file_set *file_set, > @@ -454,7 +416,7 @@ static void __luo_file_unfreeze(struct luo_file_set *file_set, > luo_file_unfreeze_one(file_set, luo_file); > } > > - memset(file_set->files, 0, LUO_FILE_PGCNT << PAGE_SHIFT); > + kho_block_set_clear(&file_set->block_set); > } > > /** > @@ -493,19 +455,23 @@ static void __luo_file_unfreeze(struct luo_file_set *file_set, > int luo_file_freeze(struct luo_file_set *file_set, > struct luo_file_set_ser *file_set_ser) > { > - struct luo_file_ser *file_ser = file_set->files; > struct luo_file *luo_file; > + struct kho_block_it it; > int err; > - int i; > > if (!file_set->count) > return 0; > > - if (WARN_ON(!file_ser)) > - return -EINVAL; > + kho_block_it_init(&it, &file_set->block_set); > > - i = 0; > list_for_each_entry(luo_file, &file_set->files_list, list) { > + struct luo_file_ser *file_ser = kho_block_it_next(&it); > + > + if (!file_ser) { > + err = -ENOSPC; > + goto err_unfreeze; > + } This should not fail normally, right? Since we pre-allocate the memory. Perhaps add a comment saying that? > + > err = luo_file_freeze_one(file_set, luo_file); > if (err < 0) { > pr_warn("Freeze failed for token[%#0llx] handler[%s] err[%pe]\n", > @@ -514,16 +480,21 @@ int luo_file_freeze(struct luo_file_set *file_set, > goto err_unfreeze; > } > > - strscpy(file_ser[i].compatible, luo_file->fh->compatible, > - sizeof(file_ser[i].compatible)); > - file_ser[i].data = luo_file->serialized_data; > - file_ser[i].token = luo_file->token; > - i++; > + strscpy(file_ser->compatible, luo_file->fh->compatible, > + sizeof(file_ser->compatible)); > + file_ser->data = luo_file->serialized_data; > + file_ser->token = luo_file->token; > } > + kho_block_it_finalize(&it); > > file_set_ser->count = file_set->count; > - if (file_set->files) > - file_set_ser->files = virt_to_phys(file_set->files); > + if (!list_empty(&file_set->block_set.blocks)) { > + struct kho_block *block; > + > + block = list_first_entry(&file_set->block_set.blocks, > + struct kho_block, list); > + file_set_ser->files = virt_to_phys(block->ser); > + } Please, add an API in KHO block to return the header physical address. Poking into the internals of the data structure like this is not a good idea. I missed that patch 9 also does this. So please use that there too. > > return 0; > > @@ -741,14 +712,12 @@ int luo_file_finish(struct luo_file_set *file_set) > module_put(luo_file->fh->ops->owner); > list_del(&luo_file->list); > file_set->count--; > + kho_block_shrink(&file_set->block_set, file_set->count); > mutex_destroy(&luo_file->mutex); > kfree(luo_file); > } > > - if (file_set->files) { > - kho_restore_free(file_set->files); > - file_set->files = NULL; > - } > + kho_block_destroy(&file_set->block_set); > > return 0; > } > @@ -822,16 +791,18 @@ int luo_file_deserialize(struct luo_file_set *file_set, > struct luo_file_set_ser *file_set_ser) > { > struct luo_file_ser *file_ser; > + struct kho_block_it it; > int err; > - u64 i; > > if (!file_set_ser->files) { > WARN_ON(file_set_ser->count); > return 0; > } > > - file_set->count = file_set_ser->count; > - file_set->files = phys_to_virt(file_set_ser->files); > + file_set->count = 0; > + err = kho_block_restore(&file_set->block_set, file_set_ser->files); > + if (err) > + return err; > > /* > * Note on error handling: > @@ -848,25 +819,50 @@ int luo_file_deserialize(struct luo_file_set *file_set, > * userspace to detect the failure and trigger a reboot, which will > * reliably reset devices and reclaim memory. > */ > - file_ser = file_set->files; > - for (i = 0; i < file_set->count; i++) { > - err = luo_file_deserialize_one(file_set, &file_ser[i]); > + kho_block_it_init(&it, &file_set->block_set); > + while ((file_ser = kho_block_it_read(&it))) { > + err = luo_file_deserialize_one(file_set, file_ser); > if (err) > - return err; > + goto err_destroy_blocks; > + file_set->count++; > + } > + > + if (file_set->count != file_set_ser->count) { > + pr_warn("File count mismatch: expected %llu, found %llu\n", > + file_set_ser->count, file_set->count); > + err = -EINVAL; > + goto err_destroy_blocks; > } > > return 0; > + > +err_destroy_blocks: > + while (!list_empty(&file_set->files_list)) { > + struct luo_file *luo_file; > + > + luo_file = list_first_entry(&file_set->files_list, > + struct luo_file, list); > + list_del(&luo_file->list); > + module_put(luo_file->fh->ops->owner); > + mutex_destroy(&luo_file->mutex); > + kfree(luo_file); > + } > + file_set->count = 0; > + kho_block_destroy(&file_set->block_set); > + return err; > } > > void luo_file_set_init(struct luo_file_set *file_set) > { > INIT_LIST_HEAD(&file_set->files_list); > + kho_block_set_init(&file_set->block_set, sizeof(struct luo_file_ser)); > } > > void luo_file_set_destroy(struct luo_file_set *file_set) > { > WARN_ON(file_set->count); > WARN_ON(!list_empty(&file_set->files_list)); > + WARN_ON(!list_empty(&file_set->block_set.blocks)); Here too. > } > > /** > diff --git a/kernel/liveupdate/luo_internal.h b/kernel/liveupdate/luo_internal.h > index ee18f9a11b91..64879ffe7378 100644 > --- a/kernel/liveupdate/luo_internal.h > +++ b/kernel/liveupdate/luo_internal.h > @@ -10,6 +10,7 @@ > > #include > #include > +#include > > struct luo_ucmd { > void __user *ubuffer; > @@ -44,14 +45,13 @@ static inline int luo_ucmd_respond(struct luo_ucmd *ucmd, > * struct luo_file_set - A set of files that belong to the same sessions. > * @files_list: An ordered list of files associated with this session, it is > * ordered by preservation time. > - * @files: The physically contiguous memory block that holds the serialized > - * state of files. > + * @block_set: The set of serialization blocks. > * @count: A counter tracking the number of files currently stored in the > * @files_list for this session. > */ > struct luo_file_set { > struct list_head files_list; > - struct luo_file_ser *files; > + struct kho_block_set block_set; > u64 count; > }; -- Regards, Pratyush Yadav From pratyush at kernel.org Mon Jun 1 07:17:19 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Mon, 01 Jun 2026 16:17:19 +0200 Subject: [PATCH v4 11/13] selftests/liveupdate: Test session and file limit removal In-Reply-To: <20260530221938.115978-12-pasha.tatashin@soleen.com> (Pasha Tatashin's message of "Sat, 30 May 2026 22:19:36 +0000") References: <20260530221938.115978-1-pasha.tatashin@soleen.com> <20260530221938.115978-12-pasha.tatashin@soleen.com> Message-ID: <2vxz7boifiq8.fsf@kernel.org> On Sat, May 30 2026, Pasha Tatashin wrote: > With the removal of static limits on the number of sessions and files per > session, the orchestrator now uses dynamic allocation. > > Add new test cases to verify that the system can handle a large number of > sessions and files. These tests ensure that the dynamic block allocation > and reuse logic for session metadata and outgoing files work correctly > beyond the previous static limits. > > Acked-by: Mike Rapoport (Microsoft) > Signed-off-by: Pasha Tatashin Reviewed-by: Pratyush Yadav (Google) > --- > .../testing/selftests/liveupdate/liveupdate.c | 75 +++++++++++++++++++ > .../selftests/liveupdate/luo_test_utils.c | 24 ++++++ > .../selftests/liveupdate/luo_test_utils.h | 2 + > 3 files changed, 101 insertions(+) > > diff --git a/tools/testing/selftests/liveupdate/liveupdate.c b/tools/testing/selftests/liveupdate/liveupdate.c > index c7d94b9181e1..502fb3567e38 100644 > --- a/tools/testing/selftests/liveupdate/liveupdate.c > +++ b/tools/testing/selftests/liveupdate/liveupdate.c > @@ -26,6 +26,7 @@ > > #include > > +#include "luo_test_utils.h" > #include "../kselftest.h" > #include "../kselftest_harness.h" > > @@ -499,4 +500,78 @@ TEST_F(liveupdate_device, get_session_name_max_length) > ASSERT_EQ(close(session_fd), 0); > } > > +/* > + * Test Case: Manage Many Sessions > + * > + * Verifies that a large number of sessions can be created and then > + * destroyed during normal system operation. This specifically tests the > + * dynamic block allocation and reuse logic for session metadata management > + * without preserving any files. > + */ > +TEST_F(liveupdate_device, preserve_many_sessions) > +{ > +#define MANY_SESSIONS 2000 > + int session_fds[MANY_SESSIONS]; > + int ret, i; > + > + self->fd1 = open(LIVEUPDATE_DEV, O_RDWR); > + if (self->fd1 < 0 && errno == ENOENT) > + SKIP(return, "%s does not exist", LIVEUPDATE_DEV); > + ASSERT_GE(self->fd1, 0); > + > + ret = luo_ensure_nofile_limit(MANY_SESSIONS); > + if (ret == -EPERM) > + SKIP(return, "Insufficient privileges to set RLIMIT_NOFILE"); > + ASSERT_EQ(ret, 0); > + > + for (i = 0; i < MANY_SESSIONS; i++) { > + char name[64]; > + > + snprintf(name, sizeof(name), "many-session-%d", i); > + session_fds[i] = create_session(self->fd1, name); > + ASSERT_GE(session_fds[i], 0); > + } > + > + for (i = 0; i < MANY_SESSIONS; i++) > + ASSERT_EQ(close(session_fds[i]), 0); > +} > + > +/* > + * Test Case: Preserve Many Files > + * > + * Verifies that a large number of files can be preserved in a single session > + * and then destroyed during normal system operation. This tests the dynamic > + * block allocation and management for outgoing files. > + */ > +TEST_F(liveupdate_device, preserve_many_files) > +{ > +#define MANY_FILES 500 > + int mem_fds[MANY_FILES]; > + int session_fd, ret, i; > + > + self->fd1 = open(LIVEUPDATE_DEV, O_RDWR); > + if (self->fd1 < 0 && errno == ENOENT) > + SKIP(return, "%s does not exist", LIVEUPDATE_DEV); > + ASSERT_GE(self->fd1, 0); > + > + session_fd = create_session(self->fd1, "many-files-test"); > + ASSERT_GE(session_fd, 0); > + > + ret = luo_ensure_nofile_limit(MANY_FILES + 10); > + if (ret == -EPERM) > + SKIP(return, "Insufficient privileges to set RLIMIT_NOFILE"); > + ASSERT_EQ(ret, 0); > + > + for (i = 0; i < MANY_FILES; i++) { > + mem_fds[i] = memfd_create("test-memfd", 0); > + ASSERT_GE(mem_fds[i], 0); > + ASSERT_EQ(preserve_fd(session_fd, mem_fds[i], i), 0); > + } > + > + for (i = 0; i < MANY_FILES; i++) > + ASSERT_EQ(close(mem_fds[i]), 0); > + > + ASSERT_EQ(close(session_fd), 0); > +} > + > TEST_HARNESS_MAIN > diff --git a/tools/testing/selftests/liveupdate/luo_test_utils.c b/tools/testing/selftests/liveupdate/luo_test_utils.c > index 3c8721c505df..333a3530051b 100644 > --- a/tools/testing/selftests/liveupdate/luo_test_utils.c > +++ b/tools/testing/selftests/liveupdate/luo_test_utils.c > @@ -17,6 +17,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -28,6 +29,29 @@ int luo_open_device(void) > return open(LUO_DEVICE, O_RDWR); > } > > +int luo_ensure_nofile_limit(long min_limit) > +{ > + struct rlimit hl; > + > + /* Allow to extra files to be used by test itself */ > + min_limit += 32; > + > + if (getrlimit(RLIMIT_NOFILE, &hl) < 0) > + return -errno; > + > + if (hl.rlim_cur >= min_limit) > + return 0; > + > + hl.rlim_cur = min_limit; > + if (hl.rlim_cur > hl.rlim_max) > + hl.rlim_max = hl.rlim_cur; > + > + if (setrlimit(RLIMIT_NOFILE, &hl) < 0) > + return -errno; > + > + return 0; > +} > + > int luo_create_session(int luo_fd, const char *name) > { > struct liveupdate_ioctl_create_session arg = { .size = sizeof(arg) }; > diff --git a/tools/testing/selftests/liveupdate/luo_test_utils.h b/tools/testing/selftests/liveupdate/luo_test_utils.h > index 90099bf49577..6a0d85386613 100644 > --- a/tools/testing/selftests/liveupdate/luo_test_utils.h > +++ b/tools/testing/selftests/liveupdate/luo_test_utils.h > @@ -26,6 +26,8 @@ int luo_create_session(int luo_fd, const char *name); > int luo_retrieve_session(int luo_fd, const char *name); > int luo_session_finish(int session_fd); > > +int luo_ensure_nofile_limit(long min_limit); > + > int create_and_preserve_memfd(int session_fd, int token, const char *data); > int restore_and_verify_memfd(int session_fd, int token, const char *expected_data); -- Regards, Pratyush Yadav From pratyush at kernel.org Mon Jun 1 07:19:57 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Mon, 01 Jun 2026 16:19:57 +0200 Subject: [PATCH v4 12/13] selftests/liveupdate: Add stress-sessions kexec test In-Reply-To: <20260530221938.115978-13-pasha.tatashin@soleen.com> (Pasha Tatashin's message of "Sat, 30 May 2026 22:19:37 +0000") References: <20260530221938.115978-1-pasha.tatashin@soleen.com> <20260530221938.115978-13-pasha.tatashin@soleen.com> Message-ID: <2vxz33z6filu.fsf@kernel.org> On Sat, May 30 2026, Pasha Tatashin wrote: > Add a new test that creates 2000 LUO sessions before a kexec > reboot and verifies their presence after the reboot. This ensures > that the linked-block serialization mechanism works correctly for > a large number of sessions. > > Acked-by: Mike Rapoport (Microsoft) > Signed-off-by: Pasha Tatashin Reviewed-by: Pratyush Yadav (Google) [...] -- Regards, Pratyush Yadav From pratyush at kernel.org Mon Jun 1 07:27:39 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Mon, 01 Jun 2026 16:27:39 +0200 Subject: [PATCH v4 04/13] liveupdate: register luo_ser as KHO subtree In-Reply-To: (Pasha Tatashin's message of "Mon, 1 Jun 2026 09:50:49 -0400") References: <20260530221938.115978-1-pasha.tatashin@soleen.com> <20260530221938.115978-5-pasha.tatashin@soleen.com> <2vxzv7c2fn8n.fsf@kernel.org> Message-ID: <2vxzy0gye3ok.fsf@kernel.org> On Mon, Jun 01 2026, Pasha Tatashin wrote: > On 06-01 14:39, Pratyush Yadav wrote: >> On Sat, May 30 2026, Pasha Tatashin wrote: >> >> > Entirely remove the LUO FDT wrapper since the FDT only carries the >> > compatible string and the pointer to the centralized struct luo_ser. >> > Instead, register the struct luo_ser via the KHO raw subtree >> > API, placing the compatibility string inside the structure itself. >> > >> > Signed-off-by: Pasha Tatashin >> > --- >> > include/linux/kho/abi/luo.h | 57 +++++++++--------------- >> > kernel/liveupdate/luo_core.c | 85 +++++++++++------------------------- >> > 2 files changed, 46 insertions(+), 96 deletions(-) >> > >> > diff --git a/include/linux/kho/abi/luo.h b/include/linux/kho/abi/luo.h >> > index 1b2f865a771a..9a4fe491812b 100644 >> > --- a/include/linux/kho/abi/luo.h >> > +++ b/include/linux/kho/abi/luo.h >> > @@ -10,11 +10,11 @@ >> > * >> > * Live Update Orchestrator uses the stable Application Binary Interface >> > * defined below to pass state from a pre-update kernel to a post-update >> > - * kernel. The ABI is built upon the Kexec HandOver framework and uses a >> > - * Flattened Device Tree to describe the preserved data. >> > + * kernel. The ABI is built upon the Kexec HandOver framework and registers >> > + * the central `struct luo_ser` via the KHO raw subtree API. >> > * >> > - * This interface is a contract. Any modification to the FDT structure, node >> > - * properties, compatible strings, or the layout of the `__packed` serialization >> > + * This interface is a contract. Any modification to the structure fields, >> > + * compatible strings, or the layout of the `__packed` serialization >> > * structures defined here constitutes a breaking change. Such changes require >> > * incrementing the version number in the relevant `_COMPATIBLE` string to >> > * prevent a new kernel from misinterpreting data from an old kernel. >> > @@ -23,31 +23,15 @@ >> > * however, backward/forward compatibility is only guaranteed for kernels >> > * supporting the same ABI version. >> > * >> > - * FDT Structure Overview: >> > + * KHO Structure Overview: >> > * The entire LUO state is encapsulated within a single KHO entry named "LUO". >> > - * This entry contains an FDT with the following layout: >> > - * >> > - * .. code-block:: none >> > - * >> > - * / { >> > - * compatible = "luo-v2"; >> > - * luo-abi-header = ; >> > - * }; >> > - * >> > - * Main LUO Node (/): >> > - * >> > - * - compatible: "luo-v2" >> > - * Identifies the overall LUO ABI version. >> > - * - luo-abi-header: u64 >> > - * The physical address of `struct luo_ser`. >> > + * This entry contains the `struct luo_ser` structure. >> > * >> > * Serialization Structures: >> > - * The FDT properties point to memory regions containing arrays of simple, >> > - * `__packed` structures. These structures contain the actual preserved state. >> > - * >> > * - struct luo_ser: >> > * The central ABI structure that contains the overall state of the LUO. >> > - * It includes the liveupdate-number and pointers to sessions and FLBs. >> > + * It includes the compatibility string, the liveupdate-number, and pointers >> > + * to sessions and FLBs. >> > * >> > * - struct luo_session_header_ser: >> > * Header for the session array. Contains the total page count of the >> > @@ -78,26 +62,27 @@ >> > #ifndef _LINUX_KHO_ABI_LUO_H >> > #define _LINUX_KHO_ABI_LUO_H >> > >> > +#include >> > #include >> > >> > /* >> > - * The LUO FDT hooks all LUO state for sessions, fds, etc. >> > + * The LUO state is registered under this KHO entry name. >> > */ >> > -#define LUO_FDT_SIZE PAGE_SIZE >> > -#define LUO_FDT_KHO_ENTRY_NAME "LUO" >> > -#define LUO_FDT_COMPATIBLE "luo-v2" >> > -#define LUO_FDT_ABI_HEADER "luo-abi-header" >> > +#define LUO_KHO_ENTRY_NAME "LUO" >> > +#define LUO_ABI_COMPATIBLE "luo-v3" >> > +#define LUO_ABI_COMPAT_LEN ALIGN(sizeof(LUO_ABI_COMPATIBLE), 8) >> >> The length of the compatible field will change depending on the length >> of the string. While that is technically fine since a new ABI version is >> allowed to change the layout, it feels odd. I think it would be better >> if we define a static size here, say 64 bytes. This way you can avoid >> all the weirdness that can happen when you move from one version to >> another. > > This is what I used initially, but we have cases where one LUO/KHO > subsystem depends on another. For example, the LUO version must change > when the block version changes, making the static length too > restrictive. I would prefer to use proper strncmp() everywhere and allow > the version string to change dynamically between kernels, while still > allowing something like this (from [PATCH v4 09/13] liveupdate: Remove > limit on the number of sessions): > > #define LUO_COMPAT_BASE "luo-v3" > #define LUO_ABI_COMPATIBLE LUO_COMPAT_BASE "-" > KHO_BLOCK_ABI_COMPATIBLE > > In the future, we may extend this further as we add more dependencies, > such as your preservable xarray, vmalloc, etc. Everything that depends > on an external version should include that in its compatibility string. Hmm, it feels odd, but I don't have any real counter arguments. So let's keep this as-is. > >> >> > >> > /** >> > * struct luo_ser - Centralized LUO ABI header. >> > + * @compatible: Compatibility string identifying the LUO ABI version. >> > * @liveupdate_num: A counter tracking the number of successful live updates. >> > * @sessions_pa: Physical address of the first session block header. >> > * @flbs_pa: Physical address of the FLB header. >> > * >> > - * This structure is the root of all preserved LUO state. It is pointed to by >> > - * the "luo-abi-header" property in the LUO FDT. >> > + * This structure is the root of all preserved LUO state. >> > */ >> > struct luo_ser { >> > + char compatible[LUO_ABI_COMPAT_LEN]; >> > u64 liveupdate_num; >> > u64 sessions_pa; >> > u64 flbs_pa; >> [...] >> > @@ -94,40 +91,29 @@ static int __init luo_early_startup(void) >> > return 0; >> > } >> > >> > - /* Retrieve LUO subtree, and verify its format. */ >> > - err = kho_retrieve_subtree(LUO_FDT_KHO_ENTRY_NAME, &fdt_phys, NULL); >> > + /* Retrieve LUO state from KHO. */ >> > + err = kho_retrieve_subtree(LUO_KHO_ENTRY_NAME, &luo_ser_phys, &len); >> > if (err) { >> > if (err != -ENOENT) { >> > - pr_err("failed to retrieve FDT '%s' from KHO: %pe\n", >> > - LUO_FDT_KHO_ENTRY_NAME, ERR_PTR(err)); >> > + pr_err("failed to retrieve LUO state '%s' from KHO: %pe\n", >> > + LUO_KHO_ENTRY_NAME, ERR_PTR(err)); >> > return err; >> > } >> > >> > return 0; >> > } >> > >> > - luo_global.fdt_in = phys_to_virt(fdt_phys); >> > - err = fdt_node_check_compatible(luo_global.fdt_in, 0, >> > - LUO_FDT_COMPATIBLE); >> > - if (err) { >> > - pr_err("FDT '%s' is incompatible with '%s' [%d]\n", >> > - LUO_FDT_KHO_ENTRY_NAME, LUO_FDT_COMPATIBLE, err); >> > - >> > + if (len < sizeof(*luo_ser)) { >> >> len != sizeof(*luo_ser) here? > > I can change this, but it is not necessary. It is common practice to > verify that a "struct" is not smaller when compatibility is checked, > allowing for future expansion without breaking compatibility with older > kernels. I know we do not support forward/backward compatibility in any > way right now, but I do not think it hurts to put the proper safeguards > in place. Yeah, that was my point. We don't support anything other than exact agreement on formats. But let's keep it this way for now, so we can grow the struct in a backwards compatible way if needed. > > Pasha > >> >> > + pr_err("LUO state is too small (%zu < %zu)\n", len, sizeof(*luo_ser)); >> > return -EINVAL; >> > } >> > >> > - header_size = 0; >> > - ptr = fdt_getprop(luo_global.fdt_in, 0, LUO_FDT_ABI_HEADER, &header_size); >> > - if (!ptr || header_size != sizeof(u64)) { >> > - pr_err("Unable to get ABI header '%s' [%d]\n", >> > - LUO_FDT_ABI_HEADER, header_size); >> > - >> > + luo_ser = phys_to_virt(luo_ser_phys); >> > + if (strncmp(luo_ser->compatible, LUO_ABI_COMPATIBLE, LUO_ABI_COMPAT_LEN)) { >> > + pr_err("LUO state is incompatible with '%s'\n", LUO_ABI_COMPATIBLE); >> > return -EINVAL; >> > } >> > >> > - luo_ser_pa = get_unaligned((u64 *)ptr); >> > - luo_ser = phys_to_virt(luo_ser_pa); >> > - >> > luo_global.liveupdate_num = luo_ser->liveupdate_num; >> > pr_info("Retrieved live update data, liveupdate number: %lld\n", >> > luo_global.liveupdate_num); >> [...] >> >> -- >> Regards, >> Pratyush Yadav -- Regards, Pratyush Yadav From pasha.tatashin at soleen.com Mon Jun 1 07:37:35 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Mon, 1 Jun 2026 10:37:35 -0400 Subject: [PATCH v4 07/13] kho: add support for linked-block serialization In-Reply-To: <2vxzqzmqfkit.fsf@kernel.org> References: <20260530221938.115978-1-pasha.tatashin@soleen.com> <20260530221938.115978-8-pasha.tatashin@soleen.com> <2vxzqzmqfkit.fsf@kernel.org> Message-ID: On 06-01 15:38, Pratyush Yadav wrote: > On Sat, May 30 2026, Pasha Tatashin wrote: > > > Introduce a linked-block serialization mechanism for state handover. > > > > Previously, LUO used contiguous memory blocks for serializing sessions > > and files, which imposed limits on the total number of items that could > > be preserved across a live update. > > > > This commit adds the infrastructure for a more flexible, block-based > > approach where serialized data is stored in a chain of linked blocks. > > This is a generic KHO serialization block infrastructure that can be > > used by multiple subsystems. > > > > Signed-off-by: Pasha Tatashin > > --- > > Documentation/core-api/kho/abi.rst | 5 + > > Documentation/core-api/kho/index.rst | 11 + > > MAINTAINERS | 1 + > > include/linux/kho/abi/block.h | 56 ++++ > > include/linux/kho_block.h | 79 ++++++ > > kernel/liveupdate/Makefile | 1 + > > kernel/liveupdate/kho_block.c | 384 +++++++++++++++++++++++++++ > > 7 files changed, 537 insertions(+) > > create mode 100644 include/linux/kho/abi/block.h > > create mode 100644 include/linux/kho_block.h > > create mode 100644 kernel/liveupdate/kho_block.c > > > > diff --git a/Documentation/core-api/kho/abi.rst b/Documentation/core-api/kho/abi.rst > > index 799d743105a6..edeb5b311963 100644 > > --- a/Documentation/core-api/kho/abi.rst > > +++ b/Documentation/core-api/kho/abi.rst > > @@ -28,6 +28,11 @@ KHO persistent memory tracker ABI > > .. kernel-doc:: include/linux/kho/abi/kexec_handover.h > > :doc: KHO persistent memory tracker > > > > +KHO serialization block ABI > > +=========================== > > + > > +.. kernel-doc:: include/linux/kho/abi/block.h > > + > > See Also > > ======== > > > > diff --git a/Documentation/core-api/kho/index.rst b/Documentation/core-api/kho/index.rst > > index 0a2dee4f8e7d..320914a42178 100644 > > --- a/Documentation/core-api/kho/index.rst > > +++ b/Documentation/core-api/kho/index.rst > > @@ -83,6 +83,17 @@ Public API > > .. kernel-doc:: kernel/liveupdate/kexec_handover.c > > :export: > > > > +KHO Serialization Blocks API > > +============================ > > + > > +.. kernel-doc:: kernel/liveupdate/kho_block.c > > + :doc: KHO Serialization Blocks > > + > > +.. kernel-doc:: include/linux/kho_block.h > > + > > +.. kernel-doc:: kernel/liveupdate/kho_block.c > > + :internal: > > + > > See Also > > ======== > > > > diff --git a/MAINTAINERS b/MAINTAINERS > > index 2fb1c75afd16..fd119b343e99 100644 > > --- a/MAINTAINERS > > +++ b/MAINTAINERS > > @@ -14194,6 +14194,7 @@ F: Documentation/admin-guide/mm/kho.rst > > F: Documentation/core-api/kho/* > > F: include/linux/kexec_handover.h > > F: include/linux/kho/ > > +F: include/linux/kho_block.h > > F: kernel/liveupdate/kexec_handover* > > F: lib/test_kho.c > > F: tools/testing/selftests/kho/ > > diff --git a/include/linux/kho/abi/block.h b/include/linux/kho/abi/block.h > > new file mode 100644 > > index 000000000000..8641c20b379b > > --- /dev/null > > +++ b/include/linux/kho/abi/block.h > > @@ -0,0 +1,56 @@ > > +/* SPDX-License-Identifier: GPL-2.0 */ > > +/* > > + * Copyright (c) 2026, Google LLC. > > + * Pasha Tatashin > > + */ > > + > > +/** > > + * DOC: KHO Serialization Blocks ABI > > + * > > + * Subsystems using the KHO Serialization Blocks framework rely on the stable > > + * Application Binary Interface defined below to pass serialized state from a > > + * pre-update kernel to a post-update kernel. > > + * > > + * This interface is a contract. Any modification to the structure fields, > > + * compatible strings, or the layout of the `__packed` serialization > > + * structures defined here constitutes a breaking change. Such changes require > > + * incrementing the version number in the `KHO_BLOCK_ABI_COMPATIBLE` string to > > + * prevent a new kernel from misinterpreting data from an old kernel. > > + * > > + * Changes are allowed provided the compatibility version is incremented; > > + * however, backward/forward compatibility is only guaranteed for kernels > > + * supporting the same ABI version. > > + */ > > + > > +#ifndef _LINUX_KHO_ABI_BLOCK_H > > +#define _LINUX_KHO_ABI_BLOCK_H > > + > > +#include > > +#include > > + > > +#define KHO_BLOCK_ABI_COMPATIBLE "kho-block-v1" > > During KHO radix development, I argued for a separate compatible for the > radix tree, but at that time, we tied the radix tree to core KHO ABI. > The argument being that all core KHO data structures belong to the KHO > ABI set. I imagine this will be used by kho_vmalloc, so it will also be > end up being used by a core KHO API. > > So, do we want separate ABI? I don't much have a preference myself, but > I do think the compatible management will be a bit easier if this relied > on KHO compatible, especially once kho_vmalloc starts using it. I prefer to make them fine-grained, now that we are adding more and more features: kho vmalloc, kho radix, and kho block should all have their own compatibility strings. Furthermore, any components that depend on them should include these compatibility strings in their own compatibility strings, in the same manner I have done in this series. > > > + > > +/** > > + * KHO_BLOCK_SIZE - The size of each serialization block. > > + * > > + * This is defined as PAGE_SIZE. PAGE_SIZE is ABI compliant because live > > + * update between kernels with different page sizes is not supported by KHO. > > + */ > > +#define KHO_BLOCK_SIZE PAGE_SIZE > > + > > +/** > > + * struct kho_block_header_ser - Header for the serialized data block. > > + * @next: Physical address of the next struct kho_block_header_ser. > > + * @count: The number of entries that immediately follow this header in the > > + * memory block. > > + * > > + * This structure is located at the beginning of a block of physical memory > > + * preserved across a kexec. It provides the necessary metadata to interpret > > + * the array of entries that follow. > > + */ > > +struct kho_block_header_ser { > > + u64 next; > > + u64 count; > > +} __packed; > > + > > +#endif /* _LINUX_KHO_ABI_BLOCK_H */ > > diff --git a/include/linux/kho_block.h b/include/linux/kho_block.h > > new file mode 100644 > > index 000000000000..5e6b87b1befa > > --- /dev/null > > +++ b/include/linux/kho_block.h > > @@ -0,0 +1,79 @@ > > +/* SPDX-License-Identifier: GPL-2.0 */ > > +/* > > + * Copyright (c) 2026, Google LLC. > > + * Pasha Tatashin > > + */ > > + > > +#ifndef _LINUX_KHO_BLOCK_H > > +#define _LINUX_KHO_BLOCK_H > > + > > +#include > > +#include > > +#include > > + > > +/** > > + * struct kho_block - Internal representation of a serialization block. > > + * @list: List head for linking blocks in memory. > > + * @ser: Pointer to the serialized header in preserved memory. > > + */ > > +struct kho_block { > > + struct list_head list; > > + struct kho_block_header_ser *ser; > > +}; > > + > > +/** > > + * struct kho_block_set - A set of blocks that belong to the same object. > > + * @blocks: The list of serialization blocks (struct kho_block). > > + * @nblocks: The number of allocated serialization blocks. > > + * @head_pa: Physical address of the first block header. > > + * @entry_size: The size of each entry in the blocks. > > + * @count_per_block: The maximum number of entries each block can hold. > > + * @incoming: True if this block set was restored from the previous kernel. > > + */ > > +struct kho_block_set { > > + struct list_head blocks; > > + long nblocks; > > + u64 head_pa; > > + size_t entry_size; > > I think we should add the entry_size to kho_block_header_ser? I think it > is a part of the ABI of the block set. If this changes, we cannot parse > a block set with a different size. If a subsystem wants to change entry > size, they create a new block set with different entry size, and then > they bump their compatible version. I have considered that, and we can certainly do it; however, I do not see how it would affect the current implementation. If luo_file or luo_session change entry_size, they must change the LUO compatibility version, which would prevent LU from one kernel to the next. However, for flexibility and future extensibility, I believe it would be useful to add entry_size and block_size (which is PAGE_SIZE, but could be larger for some users) to the header. This is more of a feature request than an issue with the current series. > > > + u64 count_per_block; > > + bool incoming; > > +}; > > + > > +/** > > + * struct kho_block_it - Iterator for serializing entries into blocks. > > + * @bs: The block set being iterated. > > + * @block: The current block. > > + * @i: The current entry index within @block. > > + */ > > +struct kho_block_it { > > + struct kho_block_set *bs; > > + struct kho_block *block; > > + u64 i; > > +}; > > + > > +/** > > + * KHO_BLOCK_SET_INIT - Initialize a static kho_block_set. > > + * @_name: Name of the kho_block_set variable. > > + * @_entry_size: The size of each entry in the block set. > > + */ > > +#define KHO_BLOCK_SET_INIT(_name, _entry_size) { \ > > + .blocks = LIST_HEAD_INIT((_name).blocks), \ > > + .entry_size = _entry_size, \ > > +} > > + > > +void kho_block_set_init(struct kho_block_set *bs, size_t entry_size); > > + > > +int kho_block_grow(struct kho_block_set *bs, u64 count); > > +void kho_block_shrink(struct kho_block_set *bs, u64 count); > > These block management functions seem like internal details of the block This is not so. The confusion here is that they must be allocated and preserved at runtime as resources are registered/unregistered, while these blocks are only used serialization phase, These calls are more like notifiers that more files/sessions are created removed, so we can adjust block count accordingly if necessary (allocate preserver memory), and have them available durign serialization/deserialization > set API. Do we need to export them? I think users should not have to > worry about block management. They should read, set, or clear entries > using the iterators, and internally the block management should take of > allocation or freeing. So here for example, I th something is missing :-) > > > + > > +int kho_block_restore(struct kho_block_set *bs, u64 head_pa); > > +void kho_block_destroy(struct kho_block_set *bs); > > Nit: kho_block_set_{restore,destroy}()? At first glance I thought they > manipulated a single block. Makes sense. > > > +void kho_block_set_clear(struct kho_block_set *bs); > > + > > +void kho_block_it_init(struct kho_block_it *it, struct kho_block_set *bs); > > +void *kho_block_it_next(struct kho_block_it *it); > > +void *kho_block_it_read(struct kho_block_it *it); > > +void *kho_block_it_prev(struct kho_block_it *it); > > +void kho_block_it_finalize(struct kho_block_it *it); > > + > > +#endif /* _LINUX_KHO_BLOCK_H */ > > diff --git a/kernel/liveupdate/Makefile b/kernel/liveupdate/Makefile > > index d2f779cbe279..eec9d3ae07eb 100644 > > --- a/kernel/liveupdate/Makefile > > +++ b/kernel/liveupdate/Makefile > > @@ -1,6 +1,7 @@ > > # SPDX-License-Identifier: GPL-2.0 > > > > luo-y := \ > > + kho_block.o \ > > luo_core.o \ > > luo_file.o \ > > luo_flb.o \ > > diff --git a/kernel/liveupdate/kho_block.c b/kernel/liveupdate/kho_block.c > > new file mode 100644 > > index 000000000000..a4e650af946f > > --- /dev/null > > +++ b/kernel/liveupdate/kho_block.c > > @@ -0,0 +1,384 @@ > > +// SPDX-License-Identifier: GPL-2.0 > > + > > +/* > > + * Copyright (c) 2026, Google LLC. > > + * Pasha Tatashin > > + */ > > + > > +/** > > + * DOC: KHO Serialization Blocks > > + * > > + * KHO provides a mechanism to preserve stateful data across a kexec handover > > + * by serializing it into memory blocks. This file provides the common > > + * infrastructure for managing these blocks. > > + * > > + * Each block consists of a header (struct kho_block_header_ser) followed by an > > + * array of serialized entries. Multiple blocks are linked together via a > > + * physical pointer in the header, forming a linked list that can be easily > > + * traversed in both the current and the next kernel. > > + */ > > + > > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt > > + > > +#include > > +#include > > +#include > > +#include > > +#include > > + > > +/* > > + * Safeguard limit for the number of serialization blocks. This is used to > > + * prevent infinite loops and excessive memory allocation in case of memory > > + * corruption in the preserved state. > > + */ > > +#define KHO_MAX_BLOCKS 10000 > > + > > +/** > > + * kho_block_set_init - Initialize a block set. > > + * @bs: The block set to initialize. > > + * @entry_size: The size of each entry in the blocks. > > + */ > > +void kho_block_set_init(struct kho_block_set *bs, size_t entry_size) > > +{ > > + *bs = (struct kho_block_set)KHO_BLOCK_SET_INIT(*bs, entry_size); > > +} > > + > > +static inline u64 kho_block_count_per_block(struct kho_block_set *bs) > > +{ > > + if (unlikely(!bs->count_per_block)) { > > + bs->count_per_block = (KHO_BLOCK_SIZE - > > + sizeof(struct kho_block_header_ser)) / > > + bs->entry_size; > > + WARN_ON(!bs->count_per_block); > > + } > > + return bs->count_per_block; > > +} > > This looks odd. I don't see a reason to calculate this lazily. Why not > just do it when initializing the block set, in kho_block_set_init() or > kho_block_restore()? And then use bs->count_per_block directly. This allows for blocks to use static initilziation, I like static inits :-) > > > + > > +/* Free serialized data */ > > +static void kho_block_free_ser(struct kho_block_set *bs, > > + struct kho_block_header_ser *ser) > > +{ > > + if (bs->incoming) > > + kho_restore_free(ser); > > + else > > + kho_unpreserve_free(ser); > > +} > > + > > +static struct kho_block_header_ser *kho_block_alloc_ser(struct kho_block_set *bs) > > +{ > > + WARN_ON(bs->incoming); > > WARN_ON_ONCE? Sure > > > + return kho_alloc_preserve(KHO_BLOCK_SIZE); > > +} > > + > > +static int kho_block_add(struct kho_block_set *bs, > > + struct kho_block_header_ser *ser) > > +{ > > + struct kho_block *block, *last; > > + > > + if (bs->nblocks >= KHO_MAX_BLOCKS) > > + return -ENOSPC; > > + > > + block = kzalloc_obj(*block); > > + if (!block) > > + return -ENOMEM; > > + > > + block->ser = ser; > > + last = list_last_entry_or_null(&bs->blocks, struct kho_block, list); > > + list_add_tail(&block->list, &bs->blocks); > > + bs->nblocks++; > > + > > + if (last) > > + last->ser->next = virt_to_phys(ser); > > + else > > + bs->head_pa = virt_to_phys(ser); > > + > > + return 0; > > +} > > + > > +/** > > + * kho_block_grow - Create a new block if the current capacity is reached. > > + * @bs: The block set. > > + * @count: The current number of entries. > > + * > > + * This function handles the dynamic expansion of a block set. It allocates > > + * and links a new serialization block if the provided entry count matches > > + * the current total capacity of the set. > > + * > > + * Return: 0 on success, or a negative errno on failure. > > + */ > > +int kho_block_grow(struct kho_block_set *bs, u64 count) > > +{ > > + struct kho_block_header_ser *ser; > > + int err; > > + > > + if (WARN_ON(bs->incoming)) > > WARN_ON_ONCE here too? Sure > > > + return -EINVAL; > > + > > + if (count != bs->nblocks * kho_block_count_per_block(bs)) > > + return 0; > > + > > + ser = kho_block_alloc_ser(bs); > > + if (IS_ERR(ser)) > > + return PTR_ERR(ser); > > + > > + err = kho_block_add(bs, ser); > > + if (err) { > > + kho_block_free_ser(bs, ser); > > + return err; > > + } > > + > > + return 0; > > +} > > + > > +/** > > + * kho_block_shrink - Conditionally destroy the last block in a block set. > > + * @bs: The block set. > > + * @count: The current number of entries across all blocks. > > + * > > + * This function checks if the last block in the set is redundant based on the > > + * total entry count and the capacity of the preceding blocks. If the entry > > + * count can be accommodated by the blocks that come before the last one, the > > + * last block is destroyed and removed from the set. > > + */ > > +void kho_block_shrink(struct kho_block_set *bs, u64 count) > > +{ > > + struct kho_block *last, *new_last; > > + > > + if (count > (bs->nblocks - 1) * kho_block_count_per_block(bs)) > > + return; > > + > > + if (list_empty(&bs->blocks)) > > + return; > > + > > + last = list_last_entry(&bs->blocks, struct kho_block, list); > > + list_del(&last->list); > > + bs->nblocks--; > > + kho_block_free_ser(bs, last->ser); > > + kfree(last); > > + > > + new_last = list_last_entry_or_null(&bs->blocks, struct kho_block, list); > > + if (new_last) > > + new_last->ser->next = 0; > > + else > > + bs->head_pa = 0; > > +} > > + > > +/* > > + * kho_cyclic_blocks_check - Check for cycles in a linked list of blocks. > > + * Uses Floyd's cycle-finding algorithm to ensure sanity of the incoming list. > > + */ > > +static bool kho_cyclic_blocks_check(struct kho_block_set *bs) > > +{ > > + struct kho_block_header_ser *fast; > > + struct kho_block_header_ser *slow; > > + int count = 0; > > + > > + fast = phys_to_virt(bs->head_pa); > > + slow = fast; > > + > > + while (fast) { > > + if (count++ >= KHO_MAX_BLOCKS) { > > + pr_err("Linked list too long\n"); > > + return false; > > + } > > + > > + if (!fast->next) > > + break; > > + > > + fast = phys_to_virt(fast->next); > > + if (!fast->next) > > + break; > > + > > + fast = phys_to_virt(fast->next); > > + slow = phys_to_virt(slow->next); > > + > > + if (slow == fast) { > > + pr_err("Cyclic list detected\n"); > > Heh, reminds me of the time I was practicing leetcode for interviews ;-) :-) > > > + return false; > > + } > > + } > > + > > + return true; > > +} > > + > > +/** > > + * kho_block_restore - Restore a block set from a physical address. > > + * @bs: The block set to restore. > > + * @head_pa: Physical address of the first block header. > > + * > > + * Return: 0 on success, or a negative errno on failure. > > + */ > > +int kho_block_restore(struct kho_block_set *bs, u64 head_pa) > > +{ > > + struct kho_block_header_ser *ser; > > + u64 next_pa = head_pa; > > + int err; > > + > > + /* Restored block sets use size from the previous kernel */ > > + bs->incoming = true; > > + if (!head_pa) > > + return 0; > > + > > + bs->head_pa = head_pa; > > + if (!kho_cyclic_blocks_check(bs)) { > > + bs->head_pa = 0; > > + return -EINVAL; > > + } > > + > > + while (next_pa) { > > + ser = phys_to_virt(next_pa); > > + if (ser->count > kho_block_count_per_block(bs)) { > > + pr_warn("Block contains too many entries: %llu\n", > > + ser->count); > > + err = -EINVAL; > > + goto err_destroy; > > + } > > + err = kho_block_add(bs, ser); > > + if (err) > > + goto err_destroy; > > + next_pa = ser->next; > > + } > > + > > + return 0; > > + > > +err_destroy: > > + kho_block_destroy(bs); > > + return err; > > +} > > + > > +/** > > + * kho_block_destroy - Destroy all blocks in a block set. > > + * @bs: The block set. > > + */ > > +void kho_block_destroy(struct kho_block_set *bs) > > +{ > > + u64 head_pa = bs->head_pa; > > + struct kho_block *block; > > + > > + while (!list_empty(&bs->blocks)) { > > + block = list_first_entry(&bs->blocks, struct kho_block, list); > > + list_del(&block->list); > > + kfree(block); > > + } > > Nit: > > list_for_each_entry_safe(block, tmp, &bs->blocks, list) { > list_del(&block->list); > kfree(block); > } > > is a bit more idiomatic (and IMO easier to read). Sure > > > + bs->nblocks = 0; > > + bs->head_pa = 0; > > + > > + while (head_pa) { > > + struct kho_block_header_ser *ser = phys_to_virt(head_pa); > > + > > + head_pa = ser->next; > > + kho_block_free_ser(bs, ser); > > Nit: also, can't you put this also in the previous loop? Something like: > > list_for_each_entry_safe(block, tmp, &bs->blocks, list) { > list_del(&block->list); > kho_block_free_ser(block->ser); > kfree(block); > } We actually can't merge these into a single loop because of partial restoration failures handling in kho_block_restore(). If kho_block_restore fails halfway through restoring a chain of blocks (for example, if kho_block_add fails on block 3 of 5), we jump to the err_destroy cleanup path which calls kho_block_destroy(). At this point: - bs->blocks only contains the tracked blocks we successfully added (blocks 1 and 2). - bs->head_pa still points to the physical head of the entire 5-block incoming chain. But, this is a good place to add a comment. > > + } > > +} > > + > > +/** > > + * kho_block_set_clear - Clear all serialized data in a block set. > > + * @bs: The block set to clear. > > + */ > > +void kho_block_set_clear(struct kho_block_set *bs) > > +{ > > + struct kho_block *block; > > + > > + list_for_each_entry(block, &bs->blocks, list) { > > + block->ser->count = 0; > > + memset(block->ser + 1, 0, KHO_BLOCK_SIZE - sizeof(*block->ser)); > > + } > > +} > > + > > +/** > > + * kho_block_it_init - Initialize a block set iterator. > > + * @it: The iterator to initialize. > > + * @bs: The block set to iterate over. > > + */ > > +void kho_block_it_init(struct kho_block_it *it, struct kho_block_set *bs) > > +{ > > + it->bs = bs; > > + it->block = list_first_entry_or_null(&bs->blocks, struct kho_block, list); > > + it->i = 0; > > +} > > + > > +/** > > + * kho_block_it_next - Return the next entry slot in the block set. > > + * @it: The block iterator. > > + * > > + * If the current block is full, it automatically advances to the next block > > + * in the set. > > + * > > + * Return: A pointer to the next entry slot, or NULL if no more slots are > > + * available. > > + */ > > +void *kho_block_it_next(struct kho_block_it *it) > > The naming and documentation here are very confusing. This and > kho_block_it_read() look pretty much identical, and their documentation > also looks pretty much identical. There seems to be only one tiny > difference: this function returns the slot while incrementing the block > count. > > Can we do better something like kho_block_it_write_next(struct > kho_block_it *it, void *entry) (size was specified when creating block > set)? Yes, this results in a copy but does that matter that much? > > And if you really want to avoid copying, perhaps > kho_block_it_add_entry()? Or something along the lines? To make it clear > this is adding an entry to the block set. > > Also, make the intended usage clear in the documentation. Sure, I will work on this. I also did not like the names, but could not think of anything clearer. > > > +{ > > + if (!it->block) > > + return NULL; > > + > > + if (it->i == kho_block_count_per_block(it->bs)) { > > + it->block->ser->count = it->i; > > + if (list_is_last(&it->block->list, &it->bs->blocks)) > > + return NULL; > > + it->block = list_next_entry(it->block, list); > > + it->i = 0; > > + } > > + > > + return (void *)(it->block->ser + 1) + (it->i++ * it->bs->entry_size); > > +} > > + > > +/** > > + * kho_block_it_read - Return the next entry slot for reading. > > + * @it: The block iterator. > > + * > > + * This function iterates through entries that were previously serialized, > > + * respecting the count stored in each block's header. > > + * > > + * Return: A pointer to the next entry slot, or NULL if no more entries are > > + * available. > > + */ > > +void *kho_block_it_read(struct kho_block_it *it) > > +{ > > + if (!it->block) > > + return NULL; > > + > > + while (it->i == it->block->ser->count) { > > Hmm, the while loop suggests we can have blocks with zero count. Do you > think we should detect those and error out instead? Since it doesn't > really make sense to have a block with no entries. This sounds reasonable. > > > + if (list_is_last(&it->block->list, &it->bs->blocks)) > > + return NULL; > > + it->block = list_next_entry(it->block, list); > > + it->i = 0; > > + } > > + > > + return (void *)(it->block->ser + 1) + (it->i++ * it->bs->entry_size); > > +} > > + > > +/** > > + * kho_block_it_prev - Return the previous entry slot in the block set. > > + * @it: The block iterator. > > + * > > + * If the current index is at the start of a block, it automatically moves to > > + * the end of the previous block. > > + * > > + * Return: A pointer to the previous entry slot, or NULL if at the very > > + * beginning of the block set. > > + */ > > +void *kho_block_it_prev(struct kho_block_it *it) > > +{ > > + if (!it->block) > > + return NULL; > > + > > + if (it->i == 0) { > > + if (list_is_first(&it->block->list, &it->bs->blocks)) > > + return NULL; > > + it->block = list_prev_entry(it->block, list); > > + it->i = kho_block_count_per_block(it->bs); > > + } > > + > > + return (void *)(it->block->ser + 1) + (--it->i * it->bs->entry_size); > > +} > > + > > +/** > > + * kho_block_it_finalize - Finalize the current block by setting its entry count. > > + * @it: The block iterator. > > + */ > > +void kho_block_it_finalize(struct kho_block_it *it) > > +{ > > + if (it->block) > > + it->block->ser->count = it->i; > > +} > > Doesn't kho_block_it_next() already do this when you add an entry? So > this seems redundant. It is not redundant because of how the final partially-fille block is handled. kho_block_it_next() only writes the count into the block header when a block is completely full and it is advancing to the next one: if (it->i == kho_block_count_per_block(it->bs)) { it->block->ser->count = it->i; ... But for the very last block in the set, it is usually only partially filled (e.g., we write 10 entries into a block with a capacity of 64). Since it->i never reaches the maximum capacity, kho_block_it_next() never commits its count. Pasha From pasha.tatashin at soleen.com Mon Jun 1 07:40:47 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Mon, 1 Jun 2026 10:40:47 -0400 Subject: [PATCH v4 10/13] liveupdate: Remove limit on the number of files per session In-Reply-To: <2vxzbjdufirq.fsf@kernel.org> References: <20260530221938.115978-1-pasha.tatashin@soleen.com> <20260530221938.115978-11-pasha.tatashin@soleen.com> <2vxzbjdufirq.fsf@kernel.org> Message-ID: On 06-01 16:16, Pratyush Yadav wrote: > On Sat, May 30 2026, Pasha Tatashin wrote: > > > To remove the fixed limit on the number of preserved files per session, > > transition the file metadata serialization from a single contiguous > > memory block to a chain of linked blocks. > > > > Acked-by: Mike Rapoport (Microsoft) > > Signed-off-by: Pasha Tatashin > > --- > > include/linux/kho/abi/luo.h | 13 +-- > > kernel/liveupdate/luo_file.c | 144 +++++++++++++++---------------- > > kernel/liveupdate/luo_internal.h | 6 +- > > 3 files changed, 80 insertions(+), 83 deletions(-) > > > > diff --git a/include/linux/kho/abi/luo.h b/include/linux/kho/abi/luo.h > > index 79758d92ed5f..16df550ef143 100644 > > --- a/include/linux/kho/abi/luo.h > > +++ b/include/linux/kho/abi/luo.h > > @@ -35,8 +35,8 @@ > > * > > * - struct luo_session_ser: > > * Metadata for a single session, including its name and a physical pointer > > - * to another preserved memory block containing an array of > > - * `struct luo_file_ser` for all files in that session. > > + * to the first `struct kho_block_header_ser` for all files in that session. > > + * Multiple blocks are linked via the `next` field in the header. > > * > > * - struct luo_file_ser: > > * Metadata for a single preserved file. Contains the `compatible` string to > > @@ -65,7 +65,7 @@ > > * The LUO state is registered under this KHO entry name. > > */ > > #define LUO_KHO_ENTRY_NAME "LUO" > > -#define LUO_COMPAT_BASE "luo-v3" > > +#define LUO_COMPAT_BASE "luo-v4" > > #define LUO_ABI_COMPATIBLE LUO_COMPAT_BASE "-" KHO_BLOCK_ABI_COMPATIBLE > > #define LUO_ABI_COMPAT_LEN ALIGN(sizeof(LUO_ABI_COMPATIBLE), 8) > > > > @@ -103,9 +103,10 @@ struct luo_file_ser { > > > > /** > > * struct luo_file_set_ser - Represents the serialized metadata for file set > > - * @files: The physical address of a contiguous memory block that holds > > - * the serialized state of files (array of luo_file_ser) in this file > > - * set. > > + * @files: The physical address of the first `struct kho_block_header_ser`. > > + * This structure is the header for a block of memory containing > > + * an array of `struct luo_file_ser` entries. Multiple blocks are > > + * linked via the `next` field in the header. > > * @count: The total number of files that were part of this session during > > * serialization. Used for iteration and validation during > > * restoration. > > diff --git a/kernel/liveupdate/luo_file.c b/kernel/liveupdate/luo_file.c > > index 9eec07a9e9fc..a445b1950ca7 100644 > > --- a/kernel/liveupdate/luo_file.c > > +++ b/kernel/liveupdate/luo_file.c > > @@ -118,11 +118,6 @@ static LIST_HEAD(luo_file_handler_list); > > /* Keep track of files being preserved by LUO */ > > static DEFINE_XARRAY(luo_preserved_files); > > > > -/* 2 4K pages, give space for 128 files per file_set */ > > -#define LUO_FILE_PGCNT 2ul > > -#define LUO_FILE_MAX \ > > - ((LUO_FILE_PGCNT << PAGE_SHIFT) / sizeof(struct luo_file_ser)) > > - > > /** > > * struct luo_file - Represents a single preserved file instance. > > * @fh: Pointer to the &struct liveupdate_file_handler that manages > > @@ -174,39 +169,6 @@ struct luo_file { > > u64 token; > > }; > > > > -static int luo_alloc_files_mem(struct luo_file_set *file_set) > > -{ > > - size_t size; > > - void *mem; > > - > > - if (file_set->files) > > - return 0; > > - > > - WARN_ON_ONCE(file_set->count); > > - > > - size = LUO_FILE_PGCNT << PAGE_SHIFT; > > - mem = kho_alloc_preserve(size); > > - if (IS_ERR(mem)) > > - return PTR_ERR(mem); > > - > > - file_set->files = mem; > > - > > - return 0; > > -} > > - > > -static void luo_free_files_mem(struct luo_file_set *file_set) > > -{ > > - /* If file_set has files, no need to free preservation memory */ > > - if (file_set->count) > > - return; > > - > > - if (!file_set->files) > > - return; > > - > > - kho_unpreserve_free(file_set->files); > > - file_set->files = NULL; > > -} > > - > > static unsigned long luo_get_id(struct liveupdate_file_handler *fh, > > struct file *file) > > { > > @@ -276,16 +238,15 @@ int luo_preserve_file(struct luo_file_set *file_set, u64 token, int fd) > > if (luo_token_is_used(file_set, token)) > > return -EEXIST; > > > > - if (file_set->count == LUO_FILE_MAX) > > - return -ENOSPC; > > + err = kho_block_grow(&file_set->block_set, file_set->count); > > + if (err) > > + return err; > > > > file = fget(fd); > > - if (!file) > > - return -EBADF; > > - > > - err = luo_alloc_files_mem(file_set); > > - if (err) > > - goto err_fput; > > + if (!file) { > > + err = -EBADF; > > + goto err_shrink; > > + } > > > > err = -ENOENT; > > down_read(&luo_register_rwlock); > > @@ -300,7 +261,7 @@ int luo_preserve_file(struct luo_file_set *file_set, u64 token, int fd) > > > > /* err is still -ENOENT if no handler was found */ > > if (err) > > - goto err_free_files_mem; > > + goto err_fput; > > > > err = xa_insert(&luo_preserved_files, luo_get_id(fh, file), > > file, GFP_KERNEL); > > @@ -343,10 +304,10 @@ int luo_preserve_file(struct luo_file_set *file_set, u64 token, int fd) > > xa_erase(&luo_preserved_files, luo_get_id(fh, file)); > > err_module_put: > > module_put(fh->ops->owner); > > -err_free_files_mem: > > - luo_free_files_mem(file_set); > > err_fput: > > fput(file); > > +err_shrink: > > + kho_block_shrink(&file_set->block_set, file_set->count); > > > > return err; > > } > > @@ -392,13 +353,14 @@ void luo_file_unpreserve_files(struct luo_file_set *file_set) > > > > list_del(&luo_file->list); > > file_set->count--; > > + kho_block_shrink(&file_set->block_set, file_set->count); > > > > fput(luo_file->file); > > mutex_destroy(&luo_file->mutex); > > kfree(luo_file); > > } > > > > - luo_free_files_mem(file_set); > > + kho_block_destroy(&file_set->block_set); > > } > > > > static int luo_file_freeze_one(struct luo_file_set *file_set, > > @@ -454,7 +416,7 @@ static void __luo_file_unfreeze(struct luo_file_set *file_set, > > luo_file_unfreeze_one(file_set, luo_file); > > } > > > > - memset(file_set->files, 0, LUO_FILE_PGCNT << PAGE_SHIFT); > > + kho_block_set_clear(&file_set->block_set); > > } > > > > /** > > @@ -493,19 +455,23 @@ static void __luo_file_unfreeze(struct luo_file_set *file_set, > > int luo_file_freeze(struct luo_file_set *file_set, > > struct luo_file_set_ser *file_set_ser) > > { > > - struct luo_file_ser *file_ser = file_set->files; > > struct luo_file *luo_file; > > + struct kho_block_it it; > > int err; > > - int i; > > > > if (!file_set->count) > > return 0; > > > > - if (WARN_ON(!file_ser)) > > - return -EINVAL; > > + kho_block_it_init(&it, &file_set->block_set); > > > > - i = 0; > > list_for_each_entry(luo_file, &file_set->files_list, list) { > > + struct luo_file_ser *file_ser = kho_block_it_next(&it); > > + > > + if (!file_ser) { > > + err = -ENOSPC; > > + goto err_unfreeze; > > + } > > This should not fail normally, right? Since we pre-allocate the memory. > Perhaps add a comment saying that? > > > + > > err = luo_file_freeze_one(file_set, luo_file); > > if (err < 0) { > > pr_warn("Freeze failed for token[%#0llx] handler[%s] err[%pe]\n", > > @@ -514,16 +480,21 @@ int luo_file_freeze(struct luo_file_set *file_set, > > goto err_unfreeze; > > } > > > > - strscpy(file_ser[i].compatible, luo_file->fh->compatible, > > - sizeof(file_ser[i].compatible)); > > - file_ser[i].data = luo_file->serialized_data; > > - file_ser[i].token = luo_file->token; > > - i++; > > + strscpy(file_ser->compatible, luo_file->fh->compatible, > > + sizeof(file_ser->compatible)); > > + file_ser->data = luo_file->serialized_data; > > + file_ser->token = luo_file->token; > > } > > + kho_block_it_finalize(&it); > > > > file_set_ser->count = file_set->count; > > - if (file_set->files) > > - file_set_ser->files = virt_to_phys(file_set->files); > > + if (!list_empty(&file_set->block_set.blocks)) { > > + struct kho_block *block; > > + > > + block = list_first_entry(&file_set->block_set.blocks, > > + struct kho_block, list); > > + file_set_ser->files = virt_to_phys(block->ser); > > + } > > Please, add an API in KHO block to return the header physical address. > Poking into the internals of the data structure like this is not a good > idea. SGTM > > I missed that patch 9 also does this. So please use that there too. > > > > > return 0; > > > > @@ -741,14 +712,12 @@ int luo_file_finish(struct luo_file_set *file_set) > > module_put(luo_file->fh->ops->owner); > > list_del(&luo_file->list); > > file_set->count--; > > + kho_block_shrink(&file_set->block_set, file_set->count); > > mutex_destroy(&luo_file->mutex); > > kfree(luo_file); > > } > > > > - if (file_set->files) { > > - kho_restore_free(file_set->files); > > - file_set->files = NULL; > > - } > > + kho_block_destroy(&file_set->block_set); > > > > return 0; > > } > > @@ -822,16 +791,18 @@ int luo_file_deserialize(struct luo_file_set *file_set, > > struct luo_file_set_ser *file_set_ser) > > { > > struct luo_file_ser *file_ser; > > + struct kho_block_it it; > > int err; > > - u64 i; > > > > if (!file_set_ser->files) { > > WARN_ON(file_set_ser->count); > > return 0; > > } > > > > - file_set->count = file_set_ser->count; > > - file_set->files = phys_to_virt(file_set_ser->files); > > + file_set->count = 0; > > + err = kho_block_restore(&file_set->block_set, file_set_ser->files); > > + if (err) > > + return err; > > > > /* > > * Note on error handling: > > @@ -848,25 +819,50 @@ int luo_file_deserialize(struct luo_file_set *file_set, > > * userspace to detect the failure and trigger a reboot, which will > > * reliably reset devices and reclaim memory. > > */ > > - file_ser = file_set->files; > > - for (i = 0; i < file_set->count; i++) { > > - err = luo_file_deserialize_one(file_set, &file_ser[i]); > > + kho_block_it_init(&it, &file_set->block_set); > > + while ((file_ser = kho_block_it_read(&it))) { > > + err = luo_file_deserialize_one(file_set, file_ser); > > if (err) > > - return err; > > + goto err_destroy_blocks; > > + file_set->count++; > > + } > > + > > + if (file_set->count != file_set_ser->count) { > > + pr_warn("File count mismatch: expected %llu, found %llu\n", > > + file_set_ser->count, file_set->count); > > + err = -EINVAL; > > + goto err_destroy_blocks; > > } > > > > return 0; > > + > > +err_destroy_blocks: > > + while (!list_empty(&file_set->files_list)) { > > + struct luo_file *luo_file; > > + > > + luo_file = list_first_entry(&file_set->files_list, > > + struct luo_file, list); > > + list_del(&luo_file->list); > > + module_put(luo_file->fh->ops->owner); > > + mutex_destroy(&luo_file->mutex); > > + kfree(luo_file); > > + } > > + file_set->count = 0; > > + kho_block_destroy(&file_set->block_set); > > + return err; > > } > > > > void luo_file_set_init(struct luo_file_set *file_set) > > { > > INIT_LIST_HEAD(&file_set->files_list); > > + kho_block_set_init(&file_set->block_set, sizeof(struct luo_file_ser)); > > } > > > > void luo_file_set_destroy(struct luo_file_set *file_set) > > { > > WARN_ON(file_set->count); > > WARN_ON(!list_empty(&file_set->files_list)); > > + WARN_ON(!list_empty(&file_set->block_set.blocks)); > > Here too. Sure > > > } > > > > /** > > diff --git a/kernel/liveupdate/luo_internal.h b/kernel/liveupdate/luo_internal.h > > index ee18f9a11b91..64879ffe7378 100644 > > --- a/kernel/liveupdate/luo_internal.h > > +++ b/kernel/liveupdate/luo_internal.h > > @@ -10,6 +10,7 @@ > > > > #include > > #include > > +#include > > > > struct luo_ucmd { > > void __user *ubuffer; > > @@ -44,14 +45,13 @@ static inline int luo_ucmd_respond(struct luo_ucmd *ucmd, > > * struct luo_file_set - A set of files that belong to the same sessions. > > * @files_list: An ordered list of files associated with this session, it is > > * ordered by preservation time. > > - * @files: The physically contiguous memory block that holds the serialized > > - * state of files. > > + * @block_set: The set of serialization blocks. > > * @count: A counter tracking the number of files currently stored in the > > * @files_list for this session. > > */ > > struct luo_file_set { > > struct list_head files_list; > > - struct luo_file_ser *files; > > + struct kho_block_set block_set; > > u64 count; > > }; > > -- > Regards, > Pratyush Yadav From pasha.tatashin at soleen.com Mon Jun 1 07:44:14 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Mon, 1 Jun 2026 10:44:14 -0400 Subject: [PATCH v4 09/13] liveupdate: Remove limit on the number of sessions In-Reply-To: <2vxzfr36fjcj.fsf@kernel.org> References: <20260530221938.115978-1-pasha.tatashin@soleen.com> <20260530221938.115978-10-pasha.tatashin@soleen.com> <2vxzfr36fjcj.fsf@kernel.org> Message-ID: On 06-01 16:03, Pratyush Yadav wrote: > On Sat, May 30 2026, Pasha Tatashin wrote: > > > Currently, the number of LUO sessions is limited by a fixed number of > > pre-allocated pages for serialization (16 pages, allowing for ~819 > > sessions). > > > > This limitation is problematic if LUO is used to support things such as > > systemd file descriptor store, and would be used not just as VM memory > > but to save other states on the machine. > > > > Remove this limit by transitioning to a linked-block approach for > > session metadata serialization. Instead of a single contiguous block, > > session metadata is now stored in a chain of 16-page blocks. Each block > > starts with a header containing the physical address of the next block > > and the number of session entries in the current block. > > > > Acked-by: Mike Rapoport (Microsoft) > > Signed-off-by: Pasha Tatashin > > --- > [...] > > @@ -63,13 +58,15 @@ > > #define _LINUX_KHO_ABI_LUO_H > > > > #include > > +#include > > #include > > > > /* > > * The LUO state is registered under this KHO entry name. > > */ > > #define LUO_KHO_ENTRY_NAME "LUO" > > -#define LUO_ABI_COMPATIBLE "luo-v3" > > +#define LUO_COMPAT_BASE "luo-v3" > > +#define LUO_ABI_COMPATIBLE LUO_COMPAT_BASE "-" KHO_BLOCK_ABI_COMPATIBLE > > That's clever :-) > > [...] > > int luo_session_serialize(void) > > { > > struct luo_session_header *sh = &luo_session_global.outgoing; > > struct luo_session *session; > > - int i = 0; > > + struct kho_block_it it; > > int err; > > > > down_write(&luo_session_serialize_rwsem); > > down_write(&sh->rwsem); > > *sh->sessions_pa = 0; > > > > + kho_block_it_init(&it, &sh->block_set); > > + > > list_for_each_entry(session, &sh->list, list) { > > - err = luo_session_freeze_one(session, &sh->ser[i]); > > - if (err) > > + struct luo_session_ser *ser = kho_block_it_next(&it); > > + > > + if (!ser) { > > + err = -ENOSPC; > > goto err_undo; > > + } > > > > - strscpy(sh->ser[i].name, session->name, > > - sizeof(sh->ser[i].name)); > > - i++; > > - } > > + err = luo_session_freeze_one(session, ser); > > + if (err) { > > + kho_block_it_prev(&it); > > + goto err_undo; > > + } > > > > - if (sh->header_ser && sh->count > 0) { > > - sh->header_ser->count = sh->count; > > - *sh->sessions_pa = virt_to_phys(sh->header_ser); > > + strscpy(ser->name, session->name, sizeof(ser->name)); > > } > > + > > + kho_block_it_finalize(&it); > > + > > + if (sh->sessions_pa && sh->count > 0) > > Nit: Why check for sh->sessions_pa? It can never be NULL. Good point, I will remove it. > > Other than this, > > Reviewed-by: Pratyush Yadav (Google) > > > + *sh->sessions_pa = sh->block_set.head_pa; > > up_write(&sh->rwsem); > > > > return 0; > > > > err_undo: > > list_for_each_entry_continue_reverse(session, &sh->list, list) { > > - i--; > > - luo_session_unfreeze_one(session, &sh->ser[i]); > > - memset(sh->ser[i].name, 0, sizeof(sh->ser[i].name)); > > + struct luo_session_ser *ser = kho_block_it_prev(&it); > > + > > + luo_session_unfreeze_one(session, ser); > > + memset(ser->name, 0, sizeof(ser->name)); > > } > > up_write(&sh->rwsem); > > up_write(&luo_session_serialize_rwsem); > > -- > Regards, > Pratyush Yadav From pasha.tatashin at soleen.com Mon Jun 1 08:00:59 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Mon, 1 Jun 2026 11:00:59 -0400 Subject: [RFC PATCH 00/20] mshv: enable kexec with Hyper-V donated pages and partitions In-Reply-To: References: <20260528004204.1484584-1-jloeser@linux.microsoft.com> Message-ID: On 05-31 20:10, Mike Rapoport wrote: > Hi Jork, > > Only had time to skim through the patches. > I have a couple of high level questions for now. > > On Wed, May 27, 2026 at 05:41:42PM -0700, Jork Loeser wrote: > > When Linux runs as an L1 Virtual Host (L1VH) under Hyper-V, the MSHV > > root partition driver deposits pages to the hypervisor and creates > > partitions for guest VMs. Prior patches enabled kexec for L1VH, but > > only when no partitions had been created and no memory had been donated. > > > > This series lifts that limitation. It uses KHO (Kexec Handover) to: > > > > - Track all pages deposited to the hypervisor in a KHO radix tree > > and preserve them across kexec so the new kernel knows which pages > > are owned by the hypervisor. > > > > - Freeze running partitions before kexec, record their IDs in the > > KHO FDT, and vacuum (tear down + reclaim memory) stale partitions > > after kexec. > > > > - In case of a crash, exclude hypervisor-owned pages from crash > > dump collection by passing the radix tree root PA via Hyper-V > > crash MSR P2 to the crash kernel. > > > > Dependency on Pratyush's KHO series > > =================================== > > > > Patches 1-12 are cherry-picked from Pratyush Yadav's v1 series > > "kho: make boot time huge page allocation work nicely with KHO" [1], > > which is still under discussion. This series uses functionality from > > those patches -- specifically the meta-data page enumeration via table > > callbacks and the restructured radix tree API. It also extends the > > KHO radix tree with: > > > > - A freeze mechanism to lock the tree before serializing for kexec > > (patch 13). > > There were a lot of effort to make KHO stateless and drop the requirement > for finalization/freeze. Yes, using KHO directly here is incorrect. The state machine is provided by LUO, so we should use LUO here. MSHV should provide a file that userspace adds to LUO, and all state machine management would be the same as for all other clients participating in LU. > > Why is this necessary to add a freeze mechanism to kho_radix_tree? > If it's a hard requirement of mshv maybe the freeze part should be handled > there? j > > - A crash-kernel-safe variant that memremaps radix nodes for use > > outside the direct map (patch 14). > > > > Patch overview > > ============== > > > > Patches 1-12: KHO radix tree and memblock changes (from [1]) > > Patch 13: Radix tree freeze and del_key() error reporting > > del_key() error reporting sounds like something we'd want to avoid. > del_key() is called on "freeing" path and during error handling, it would > be hard if at all possible to deal with errors from del_key(). > > > Patch 14: Crash-kernel-safe radix tree presence check > > Patch 15: Page tracker using KHO radix tree for deposited pages > > Patch 16: Debugfs interface for page tracker > > Patches 17-18: Crash MSR reshuffling + crash dump page exclusion > > Patch 19: Export kexec_in_progress for modules > > Isn't there another way to differentiate kexec reboot? > > > Patch 20: Freeze and vacuum partitions across kexec > > > > Feedback > > ======== > > > > This is an RFC. I am looking for feedback on the overall approach as > > well as the KHO changes (patches 13-14). > > > > [1] https://lore.kernel.org/linux-mm/20260429133928.850721-1-pratyush at kernel.org/ > > > > Based-on: linux-next/master (next-20260527) > > -- > Sincerely yours, > Mike. From dakr at kernel.org Mon Jun 1 08:09:59 2026 From: dakr at kernel.org (Danilo Krummrich) Date: Mon, 01 Jun 2026 17:09:59 +0200 Subject: [PATCH v16 0/5] shut down devices asynchronously In-Reply-To: <20260518193204.14273-1-djeffery@redhat.com> References: <20260518193204.14273-1-djeffery@redhat.com> Message-ID: On Mon May 18, 2026 at 9:31 PM CEST, David Jeffery wrote: > These patches are now rebased against the driver-core tree's driver-core-next > branch. [...] > Changes from V15: > > The async_shutdown bit field is converted to a device flags bit Convert all > patches to use the flag bit accessor macros to set or check if async shutdown > should be used Added documentation on the kernel parameter to control use of > async shutdown Did you have a look at the Sashiko report from v15 [1]? Some of the concerns raised seem valid at a quick glance. (It seems that this version has not been picked up by Sashiko (despite you mentioning they are based on driver-core-next). I'd assume it doesn't like that the series was not sent with '--base'.) Can you have a look at [1] please? Thanks, Danilo [1] https://sashiko.dev/#/patchset/20260429175016.7915-1-djeffery%40redhat.com > Stuart Hayes (2): > driver core: separate function to shutdown one device > driver core: do not always lock parent in shutdown > > David Jeffery (3): > driver core: async device shutdown infrastructure > PCI: Enable async shutdown support > scsi: Enable async shutdown support Not sure it will make it for 7.2, but I think it would be good to give this some more time in linux-next anyways. Bjorn, James, Martin: Should the PCI and scsi patch go through the driver-core tree too? Do you prefer a signed tag with the driver-core changes to merge into the PCI and scsi trees? From djeffery at redhat.com Mon Jun 1 09:54:42 2026 From: djeffery at redhat.com (David Jeffery) Date: Mon, 1 Jun 2026 12:54:42 -0400 Subject: [PATCH v16 0/5] shut down devices asynchronously In-Reply-To: References: <20260518193204.14273-1-djeffery@redhat.com> Message-ID: On Mon, Jun 1, 2026 at 11:10?AM Danilo Krummrich wrote: > > On Mon May 18, 2026 at 9:31 PM CEST, David Jeffery wrote: > > These patches are now rebased against the driver-core tree's driver-core-next > > branch. > > [...] > > > Changes from V15: > > > > The async_shutdown bit field is converted to a device flags bit Convert all > > patches to use the flag bit accessor macros to set or check if async shutdown > > should be used Added documentation on the kernel parameter to control use of > > async shutdown > > Did you have a look at the Sashiko report from v15 [1]? Some of the concerns > raised seem valid at a quick glance. > > (It seems that this version has not been picked up by Sashiko (despite you > mentioning they are based on driver-core-next). I'd assume it doesn't like that > the series was not sent with '--base'.) > > Can you have a look at [1] please? > > Thanks, > Danilo > > [1] https://sashiko.dev/#/patchset/20260429175016.7915-1-djeffery%40redhat.com This does look to have found some legitimate issues in need of correction. I'll get them fixed. Thanks, David Jeffery From mclapinski at google.com Mon Jun 1 12:11:36 2026 From: mclapinski at google.com (Michal Clapinski) Date: Mon, 1 Jun 2026 21:11:36 +0200 Subject: [PATCH] kexec_file: skip checksum verification when relocations aren't needed Message-ID: <20260601191136.799134-1-mclapinski@google.com> Checksum verification is needed 1. for crash kernels. In a crash, we can't be sure the kernel is intact. 2. if we're worried about relocating the kernel into a region used by some DMA that wasn't properly cancelled. If we used CMA to allocate segments then 1. we're not working with a crash kernel. 2. relocations are not going to happen. Therefore, we can safely disable checksum verification. Instead of adding a new variable to purgatory, just skip adding regions and save the default value of SHA256 hash. Saves ~250ms on my 4.0 GHz CPU. Signed-off-by: Michal Clapinski --- kernel/kexec_file.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c index 2bfbb2d144e6..2dc8b0435fe6 100644 --- a/kernel/kexec_file.c +++ b/kernel/kexec_file.c @@ -808,6 +808,7 @@ static int kexec_calculate_store_digests(struct kimage *image) void *zero_buf; struct kexec_sha_region *sha_regions; struct purgatory_info *pi = &image->purgatory_info; + bool can_skip_checksum = true; if (!IS_ENABLED(CONFIG_ARCH_SUPPORTS_KEXEC_PURGATORY)) return 0; @@ -822,6 +823,23 @@ static int kexec_calculate_store_digests(struct kimage *image) sha256_init(&sctx); + /* + * If all segments were loaded into contiguous memory, there will be no + * relocations. In that case there is no risk of memory corruption by + * uncancelled DMA and we can skip checksum calculation. + */ + for (i = 0; i < image->nr_segments; i++) { + if (!image->segment_cma[i]) { + can_skip_checksum = false; + break; + } + } + + if (can_skip_checksum) { + pr_info("disabling checksum verification in purgatory\n"); + goto skip_checksum; + } + for (j = i = 0; i < image->nr_segments; i++) { struct kexec_segment *ksegment; @@ -867,6 +885,7 @@ static int kexec_calculate_store_digests(struct kimage *image) j++; } +skip_checksum: sha256_final(&sctx, digest); ret = kexec_purgatory_get_set_symbol(image, "purgatory_sha_regions", -- 2.54.0.929.g9b7fa37559-goog From mclapinski at google.com Mon Jun 1 12:30:14 2026 From: mclapinski at google.com (Michal Clapinski) Date: Mon, 1 Jun 2026 21:30:14 +0200 Subject: [PATCH] kho: try to allocate contiguous memory for kexec segments Message-ID: <20260601193014.896405-1-mclapinski@google.com> This allows us to skip relocations (and maybe checksum calculation in the future). kho_scratch is marked as MIGRATE_CMA but isn't actually given to the CMA, so it should only contain movable allocations, therefore this should always succeed. Signed-off-by: Michal Clapinski --- kernel/kexec_core.c | 6 +++++- kernel/liveupdate/kexec_handover.c | 21 +++++++++++++++++---- 2 files changed, 22 insertions(+), 5 deletions(-) diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c index dc770b9a6d05..cba3ce985aa9 100644 --- a/kernel/kexec_core.c +++ b/kernel/kexec_core.c @@ -43,6 +43,7 @@ #include #include #include +#include #include #include @@ -566,7 +567,10 @@ static void kimage_free_cma(struct kimage *image) continue; arch_kexec_pre_free_pages(page_address(cma), nr_pages); - dma_release_from_contiguous(NULL, cma, nr_pages); + if (kho_is_enabled()) + free_contig_range(page_to_pfn(cma), nr_pages); + else + dma_release_from_contiguous(NULL, cma, nr_pages); image->segment_cma[i] = NULL; } diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 4834a809985a..289fd5948fd2 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -1770,15 +1770,28 @@ static int kho_walk_scratch(struct kexec_buf *kbuf, return ret; } +static void kho_try_alloc_contig(struct kexec_buf *kbuf) +{ + unsigned long start_pfn = PFN_DOWN(kbuf->mem); + unsigned long nr_pages = kbuf->memsz >> PAGE_SHIFT; + + if (alloc_contig_range(start_pfn, start_pfn + nr_pages, + ACR_FLAGS_CMA, GFP_KERNEL)) + return; + + kbuf->cma = pfn_to_page(start_pfn); + arch_kexec_post_alloc_pages(page_address(kbuf->cma), nr_pages, 0); +} + int kho_locate_mem_hole(struct kexec_buf *kbuf, int (*func)(struct resource *, void *)) { - int ret; - if (!kho_enable || kbuf->image->type == KEXEC_TYPE_CRASH) return 1; - ret = kho_walk_scratch(kbuf, func); + if (!kho_walk_scratch(kbuf, func)) + return -EADDRNOTAVAIL; - return ret == 1 ? 0 : -EADDRNOTAVAIL; + kho_try_alloc_contig(kbuf); + return 0; } -- 2.54.0.929.g9b7fa37559-goog From mclapinski at google.com Mon Jun 1 12:52:46 2026 From: mclapinski at google.com (=?UTF-8?B?TWljaGHFgiBDxYJhcGnFhHNraQ==?=) Date: Mon, 1 Jun 2026 21:52:46 +0200 Subject: [PATCH] kho: try to allocate contiguous memory for kexec segments In-Reply-To: <20260601193014.896405-1-mclapinski@google.com> References: <20260601193014.896405-1-mclapinski@google.com> Message-ID: On Mon, Jun 1, 2026 at 9:30?PM Michal Clapinski wrote: > > This allows us to skip relocations (and maybe checksum calculation > in the future). > > kho_scratch is marked as MIGRATE_CMA but isn't actually given to the > CMA, so it should only contain movable allocations, therefore this > should always succeed. Now that I think about it, this is only true on the primary boot. On subsequent boots, kho scratch will contain memblock allocations forever. I should have tested it more than once. I have no idea how probable it is that I will find enough movable/free memory in kho scratch for this to ever succeed. I'll give it more thought. From jloeser at linux.microsoft.com Mon Jun 1 13:09:41 2026 From: jloeser at linux.microsoft.com (Jork Loeser) Date: Mon, 1 Jun 2026 13:09:41 -0700 (PDT) Subject: [RFC PATCH 00/20] mshv: enable kexec with Hyper-V donated pages and partitions In-Reply-To: References: <20260528004204.1484584-1-jloeser@linux.microsoft.com> Message-ID: On Sun, 31 May 2026, Mike Rapoport wrote: > Hi Jork, >> - A freeze mechanism to lock the tree before serializing for kexec >> (patch 13). > > There were a lot of effort to make KHO stateless and drop the requirement > for finalization/freeze. > > Why is this necessary to add a freeze mechanism to kho_radix_tree? > If it's a hard requirement of mshv maybe the freeze part should be handled > there? Good feedback. It's a safety-net so we do not accidentally donate pages without being able to track them. Thought it might be a good generic feature. Let me keep it in the MSHV driver. >> Patch 13: Radix tree freeze and del_key() error reporting > > del_key() error reporting sounds like something we'd want to avoid. > del_key() is called on "freeing" path and during error handling, it would > be hard if at all possible to deal with errors from del_key(). I hear you. Stating "yeah, it can only really fail if the key isn't there, or it's frozen, but not due to other things, so don't bother to check the return code if you are sure" is an odd contract. With the freeze-logic moving into MSHV, will revert to no-error. >> Patch 19: Export kexec_in_progress for modules > > Isn't there another way to differentiate kexec reboot? I could not find one, unfortunately. > Sincerely yours, > Mike. Best, Jork From jloeser at linux.microsoft.com Mon Jun 1 13:15:11 2026 From: jloeser at linux.microsoft.com (Jork Loeser) Date: Mon, 1 Jun 2026 13:15:11 -0700 (PDT) Subject: [RFC PATCH 00/20] mshv: enable kexec with Hyper-V donated pages and partitions In-Reply-To: References: <20260528004204.1484584-1-jloeser@linux.microsoft.com> Message-ID: <4172d271-21b4-346-924e-406baef179a1@linux.microsoft.com> On Mon, 1 Jun 2026, Pasha Tatashin wrote: > On 05-31 20:10, Mike Rapoport wrote: >>> - A freeze mechanism to lock the tree before serializing for kexec >>> (patch 13). >> >> There were a lot of effort to make KHO stateless and drop the requirement >> for finalization/freeze. > > Yes, using KHO directly here is incorrect. The state machine is provided > by LUO, so we should use LUO here. MSHV should provide a file that > userspace adds to LUO, and all state machine management would be the > same as for all other clients participating in LU. The thing is, there is no file handle to rely on. Even once partitions are all removed, Hyper-V might hang onto pages (and won't return them even if asked). However, these pages very much must be excluded from Linux post-kexec, or the system will crash. We cannot rely on UM to ensure integrity of memory management. Contrast that to standard LUO use: If you drop individual file handles, or even skip the LUO phase entirely, the worst that will happen is that the objects will be gone post-kexec. The MM itself will still be consistent. For MSHV & page donation, this is different. (And yes, partition preservation will very much tie into LUO) Best, Jork From lkp at intel.com Mon Jun 1 14:10:02 2026 From: lkp at intel.com (kernel test robot) Date: Tue, 02 Jun 2026 05:10:02 +0800 Subject: [liveupdate:next] BUILD SUCCESS 2935777b418d2bfcbfe96705bb2c0fa6c0d94e18 Message-ID: <202606020552.pcMaifmj-lkp@intel.com> tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/liveupdate/linux.git next branch HEAD: 2935777b418d2bfcbfe96705bb2c0fa6c0d94e18 liveupdate: Remove unused ser field from struct luo_session elapsed time: 859m configs tested: 360 configs skipped: 5 The following configs have been built successfully. More configs may be tested in the coming days. tested configs: alpha allnoconfig gcc-15.2.0 alpha allyesconfig gcc-15.2.0 alpha defconfig gcc-15.2.0 arc allmodconfig clang-16 arc allmodconfig gcc-15.2.0 arc allnoconfig gcc-15.2.0 arc allyesconfig clang-23 arc allyesconfig gcc-15.2.0 arc defconfig gcc-15.2.0 arc randconfig-001 gcc-8.5.0 arc randconfig-001-20260601 clang-23 arc randconfig-001-20260601 gcc-8.5.0 arc randconfig-001-20260602 gcc-10.5.0 arc randconfig-002 gcc-8.5.0 arc randconfig-002-20260601 clang-23 arc randconfig-002-20260601 gcc-8.5.0 arc randconfig-002-20260602 gcc-10.5.0 arm allnoconfig clang-23 arm allnoconfig gcc-15.2.0 arm allyesconfig clang-16 arm allyesconfig gcc-15.2.0 arm defconfig gcc-15.2.0 arm h3600_defconfig gcc-15.2.0 arm mvebu_v7_defconfig clang-23 arm randconfig-001 gcc-8.5.0 arm randconfig-001-20260601 clang-23 arm randconfig-001-20260601 gcc-8.5.0 arm randconfig-001-20260602 gcc-10.5.0 arm randconfig-002 gcc-8.5.0 arm randconfig-002-20260601 clang-23 arm randconfig-002-20260601 gcc-8.5.0 arm randconfig-002-20260602 gcc-10.5.0 arm randconfig-003 gcc-8.5.0 arm randconfig-003-20260601 clang-23 arm randconfig-003-20260601 gcc-8.5.0 arm randconfig-003-20260602 gcc-10.5.0 arm randconfig-004 gcc-8.5.0 arm randconfig-004-20260601 clang-23 arm randconfig-004-20260601 gcc-8.5.0 arm randconfig-004-20260602 gcc-10.5.0 arm64 allmodconfig clang-19 arm64 allmodconfig clang-23 arm64 allnoconfig gcc-15.2.0 arm64 defconfig gcc-15.2.0 arm64 randconfig-001 gcc-8.5.0 arm64 randconfig-001-20260601 gcc-14.3.0 arm64 randconfig-001-20260601 gcc-8.5.0 arm64 randconfig-001-20260602 gcc-15.2.0 arm64 randconfig-002 gcc-14.3.0 arm64 randconfig-002-20260601 gcc-8.5.0 arm64 randconfig-002-20260602 gcc-15.2.0 arm64 randconfig-003 clang-23 arm64 randconfig-003-20260601 gcc-15.2.0 arm64 randconfig-003-20260601 gcc-8.5.0 arm64 randconfig-003-20260602 gcc-15.2.0 arm64 randconfig-004-20260601 gcc-14.3.0 arm64 randconfig-004-20260601 gcc-8.5.0 arm64 randconfig-004-20260602 gcc-15.2.0 csky allmodconfig gcc-15.2.0 csky allnoconfig gcc-15.2.0 csky defconfig gcc-15.2.0 csky randconfig-001 gcc-10.5.0 csky randconfig-001-20260601 gcc-14.3.0 csky randconfig-001-20260601 gcc-8.5.0 csky randconfig-001-20260602 gcc-15.2.0 csky randconfig-002 gcc-10.5.0 csky randconfig-002-20260601 gcc-16.1.0 csky randconfig-002-20260601 gcc-8.5.0 csky randconfig-002-20260602 gcc-15.2.0 hexagon allmodconfig clang-17 hexagon allmodconfig gcc-15.2.0 hexagon allnoconfig clang-23 hexagon allnoconfig gcc-15.2.0 hexagon defconfig gcc-15.2.0 hexagon randconfig-001 gcc-11.5.0 hexagon randconfig-001-20260601 gcc-11.5.0 hexagon randconfig-001-20260601 gcc-8.5.0 hexagon randconfig-001-20260602 gcc-8.5.0 hexagon randconfig-002 gcc-11.5.0 hexagon randconfig-002-20260601 gcc-11.5.0 hexagon randconfig-002-20260601 gcc-8.5.0 hexagon randconfig-002-20260602 gcc-8.5.0 i386 allmodconfig clang-20 i386 allmodconfig gcc-14 i386 allnoconfig gcc-14 i386 allnoconfig gcc-15.2.0 i386 allyesconfig clang-20 i386 allyesconfig gcc-14 i386 buildonly-randconfig-001 gcc-12 i386 buildonly-randconfig-001-20260601 gcc-12 i386 buildonly-randconfig-001-20260601 gcc-14 i386 buildonly-randconfig-001-20260602 clang-20 i386 buildonly-randconfig-002 gcc-12 i386 buildonly-randconfig-002-20260601 clang-20 i386 buildonly-randconfig-002-20260601 gcc-12 i386 buildonly-randconfig-002-20260602 clang-20 i386 buildonly-randconfig-003 gcc-12 i386 buildonly-randconfig-003-20260601 gcc-12 i386 buildonly-randconfig-003-20260601 gcc-14 i386 buildonly-randconfig-003-20260602 clang-20 i386 buildonly-randconfig-004 gcc-12 i386 buildonly-randconfig-004-20260601 clang-20 i386 buildonly-randconfig-004-20260601 gcc-12 i386 buildonly-randconfig-004-20260602 clang-20 i386 buildonly-randconfig-005 gcc-12 i386 buildonly-randconfig-005-20260601 gcc-12 i386 buildonly-randconfig-005-20260601 gcc-14 i386 buildonly-randconfig-005-20260602 clang-20 i386 buildonly-randconfig-006 gcc-12 i386 buildonly-randconfig-006-20260601 gcc-12 i386 buildonly-randconfig-006-20260602 clang-20 i386 defconfig gcc-15.2.0 i386 randconfig-001 gcc-14 i386 randconfig-001-20260601 gcc-14 i386 randconfig-001-20260602 gcc-14 i386 randconfig-002 gcc-14 i386 randconfig-002-20260601 gcc-14 i386 randconfig-002-20260602 gcc-14 i386 randconfig-003 gcc-14 i386 randconfig-003-20260601 gcc-14 i386 randconfig-003-20260602 gcc-14 i386 randconfig-004 gcc-14 i386 randconfig-004-20260601 gcc-14 i386 randconfig-004-20260602 gcc-14 i386 randconfig-005 gcc-14 i386 randconfig-005-20260601 gcc-14 i386 randconfig-005-20260602 gcc-14 i386 randconfig-006 gcc-14 i386 randconfig-006-20260601 gcc-14 i386 randconfig-006-20260602 gcc-14 i386 randconfig-007 gcc-14 i386 randconfig-007-20260601 gcc-14 i386 randconfig-007-20260602 gcc-14 i386 randconfig-011-20260602 clang-20 i386 randconfig-012-20260602 clang-20 i386 randconfig-013-20260602 clang-20 i386 randconfig-014-20260602 clang-20 i386 randconfig-015-20260602 clang-20 i386 randconfig-016-20260602 clang-20 i386 randconfig-017-20260602 clang-20 loongarch allmodconfig clang-19 loongarch allmodconfig clang-23 loongarch allnoconfig clang-23 loongarch allnoconfig gcc-15.2.0 loongarch defconfig clang-19 loongarch randconfig-001 gcc-11.5.0 loongarch randconfig-001-20260601 gcc-11.5.0 loongarch randconfig-001-20260601 gcc-8.5.0 loongarch randconfig-001-20260602 gcc-8.5.0 loongarch randconfig-002 gcc-11.5.0 loongarch randconfig-002-20260601 gcc-11.5.0 loongarch randconfig-002-20260601 gcc-8.5.0 loongarch randconfig-002-20260602 gcc-8.5.0 m68k alldefconfig gcc-15.2.0 m68k allmodconfig gcc-15.2.0 m68k allnoconfig gcc-15.2.0 m68k allyesconfig clang-16 m68k allyesconfig gcc-15.2.0 m68k defconfig clang-19 microblaze allnoconfig gcc-15.2.0 microblaze allyesconfig gcc-15.2.0 microblaze defconfig clang-19 mips allmodconfig gcc-15.2.0 mips allnoconfig gcc-15.2.0 mips allyesconfig gcc-15.2.0 mips gcw0_defconfig clang-23 mips rs90_defconfig gcc-15.2.0 nios2 allmodconfig clang-23 nios2 allmodconfig gcc-11.5.0 nios2 allnoconfig clang-23 nios2 allnoconfig gcc-11.5.0 nios2 defconfig clang-19 nios2 randconfig-001 gcc-11.5.0 nios2 randconfig-001-20260601 gcc-11.5.0 nios2 randconfig-001-20260601 gcc-8.5.0 nios2 randconfig-001-20260602 gcc-8.5.0 nios2 randconfig-002 gcc-11.5.0 nios2 randconfig-002-20260601 gcc-11.5.0 nios2 randconfig-002-20260601 gcc-8.5.0 nios2 randconfig-002-20260602 gcc-8.5.0 openrisc allmodconfig clang-23 openrisc allmodconfig gcc-15.2.0 openrisc allnoconfig clang-23 openrisc allnoconfig gcc-15.2.0 openrisc defconfig gcc-15.2.0 parisc allmodconfig gcc-15.2.0 parisc allnoconfig clang-23 parisc allnoconfig gcc-15.2.0 parisc allyesconfig clang-19 parisc allyesconfig gcc-15.2.0 parisc defconfig gcc-15.2.0 parisc randconfig-001 gcc-10.5.0 parisc randconfig-001-20260601 gcc-10.5.0 parisc randconfig-001-20260602 gcc-12.5.0 parisc randconfig-002 gcc-10.5.0 parisc randconfig-002-20260601 gcc-10.5.0 parisc randconfig-002-20260602 gcc-12.5.0 parisc64 defconfig clang-19 powerpc allmodconfig gcc-15.2.0 powerpc allnoconfig clang-23 powerpc allnoconfig gcc-15.2.0 powerpc arches_defconfig gcc-15.2.0 powerpc randconfig-001 gcc-10.5.0 powerpc randconfig-001-20260601 gcc-10.5.0 powerpc randconfig-001-20260602 gcc-12.5.0 powerpc randconfig-002 gcc-10.5.0 powerpc randconfig-002-20260601 gcc-10.5.0 powerpc randconfig-002-20260602 gcc-12.5.0 powerpc tqm8560_defconfig gcc-15.2.0 powerpc64 randconfig-001 gcc-10.5.0 powerpc64 randconfig-001-20260601 gcc-10.5.0 powerpc64 randconfig-001-20260602 gcc-12.5.0 powerpc64 randconfig-002 gcc-10.5.0 powerpc64 randconfig-002-20260601 gcc-10.5.0 powerpc64 randconfig-002-20260602 gcc-12.5.0 riscv allmodconfig clang-23 riscv allnoconfig clang-23 riscv allnoconfig gcc-15.2.0 riscv allyesconfig clang-16 riscv defconfig gcc-15.2.0 riscv randconfig-001 clang-23 riscv randconfig-001-20260601 clang-23 riscv randconfig-001-20260602 gcc-8.5.0 riscv randconfig-002 clang-23 riscv randconfig-002-20260601 clang-23 riscv randconfig-002-20260602 gcc-8.5.0 s390 allmodconfig clang-18 s390 allmodconfig clang-19 s390 allnoconfig clang-23 s390 allyesconfig gcc-15.2.0 s390 defconfig gcc-15.2.0 s390 randconfig-001 clang-23 s390 randconfig-001-20260601 clang-23 s390 randconfig-001-20260602 gcc-8.5.0 s390 randconfig-002 clang-23 s390 randconfig-002-20260601 clang-23 s390 randconfig-002-20260602 gcc-8.5.0 sh allmodconfig gcc-15.2.0 sh allnoconfig clang-23 sh allnoconfig gcc-15.2.0 sh allyesconfig clang-19 sh allyesconfig gcc-15.2.0 sh defconfig gcc-14 sh defconfig gcc-15.2.0 sh randconfig-001 clang-23 sh randconfig-001-20260601 clang-23 sh randconfig-001-20260602 gcc-8.5.0 sh randconfig-002 clang-23 sh randconfig-002-20260601 clang-23 sh randconfig-002-20260602 gcc-8.5.0 sparc allnoconfig clang-23 sparc allnoconfig gcc-15.2.0 sparc defconfig gcc-15.2.0 sparc randconfig-001 gcc-8.5.0 sparc randconfig-001-20260601 gcc-15.2.0 sparc randconfig-001-20260601 gcc-8.5.0 sparc randconfig-002 gcc-8.5.0 sparc randconfig-002-20260601 gcc-15.2.0 sparc randconfig-002-20260601 gcc-8.5.0 sparc64 allmodconfig clang-23 sparc64 defconfig clang-20 sparc64 defconfig gcc-14 sparc64 randconfig-001 gcc-8.5.0 sparc64 randconfig-001-20260601 clang-20 sparc64 randconfig-001-20260601 gcc-15.2.0 sparc64 randconfig-001-20260601 gcc-8.5.0 sparc64 randconfig-002 gcc-8.5.0 sparc64 randconfig-002-20260601 clang-23 sparc64 randconfig-002-20260601 gcc-15.2.0 sparc64 randconfig-002-20260601 gcc-8.5.0 um allmodconfig clang-19 um allnoconfig clang-23 um allyesconfig gcc-14 um allyesconfig gcc-15.2.0 um defconfig clang-23 um defconfig gcc-14 um i386_defconfig gcc-14 um randconfig-001 gcc-8.5.0 um randconfig-001-20260601 gcc-14 um randconfig-001-20260601 gcc-15.2.0 um randconfig-001-20260601 gcc-8.5.0 um randconfig-002 gcc-8.5.0 um randconfig-002-20260601 gcc-14 um randconfig-002-20260601 gcc-15.2.0 um randconfig-002-20260601 gcc-8.5.0 um x86_64_defconfig clang-23 um x86_64_defconfig gcc-14 x86_64 allmodconfig clang-20 x86_64 allnoconfig clang-20 x86_64 allnoconfig clang-23 x86_64 allyesconfig clang-20 x86_64 buildonly-randconfig-001-20260601 clang-20 x86_64 buildonly-randconfig-001-20260602 gcc-14 x86_64 buildonly-randconfig-002-20260601 clang-20 x86_64 buildonly-randconfig-002-20260602 gcc-14 x86_64 buildonly-randconfig-003-20260601 clang-20 x86_64 buildonly-randconfig-003-20260602 gcc-14 x86_64 buildonly-randconfig-004-20260601 clang-20 x86_64 buildonly-randconfig-004-20260602 gcc-14 x86_64 buildonly-randconfig-005-20260601 clang-20 x86_64 buildonly-randconfig-005-20260602 gcc-14 x86_64 buildonly-randconfig-006-20260601 clang-20 x86_64 buildonly-randconfig-006-20260602 gcc-14 x86_64 defconfig gcc-14 x86_64 kexec clang-20 x86_64 randconfig-001-20260601 clang-20 x86_64 randconfig-001-20260601 gcc-14 x86_64 randconfig-002-20260601 clang-20 x86_64 randconfig-002-20260601 gcc-14 x86_64 randconfig-003-20260601 clang-20 x86_64 randconfig-004-20260601 clang-20 x86_64 randconfig-005-20260601 clang-20 x86_64 randconfig-006-20260601 clang-20 x86_64 randconfig-011-20260601 clang-20 x86_64 randconfig-011-20260602 clang-20 x86_64 randconfig-012-20260601 clang-20 x86_64 randconfig-012-20260602 clang-20 x86_64 randconfig-013-20260601 clang-20 x86_64 randconfig-013-20260602 clang-20 x86_64 randconfig-014-20260601 clang-20 x86_64 randconfig-014-20260602 clang-20 x86_64 randconfig-015-20260601 clang-20 x86_64 randconfig-015-20260602 clang-20 x86_64 randconfig-016-20260601 clang-20 x86_64 randconfig-016-20260602 clang-20 x86_64 randconfig-071 gcc-14 x86_64 randconfig-071-20260601 gcc-14 x86_64 randconfig-071-20260602 clang-20 x86_64 randconfig-072 gcc-14 x86_64 randconfig-072-20260601 gcc-14 x86_64 randconfig-072-20260602 clang-20 x86_64 randconfig-073 gcc-14 x86_64 randconfig-073-20260601 gcc-14 x86_64 randconfig-073-20260602 clang-20 x86_64 randconfig-074 gcc-14 x86_64 randconfig-074-20260601 gcc-14 x86_64 randconfig-074-20260602 clang-20 x86_64 randconfig-075 gcc-14 x86_64 randconfig-075-20260601 gcc-14 x86_64 randconfig-075-20260602 clang-20 x86_64 randconfig-076 gcc-14 x86_64 randconfig-076-20260601 gcc-14 x86_64 randconfig-076-20260602 clang-20 x86_64 rhel-9.4 clang-20 x86_64 rhel-9.4-bpf gcc-14 x86_64 rhel-9.4-func clang-20 x86_64 rhel-9.4-kselftests clang-20 x86_64 rhel-9.4-kunit gcc-14 x86_64 rhel-9.4-ltp gcc-14 x86_64 rhel-9.4-rust clang-20 xtensa allnoconfig clang-23 xtensa allnoconfig gcc-15.2.0 xtensa allyesconfig clang-23 xtensa allyesconfig gcc-15.2.0 xtensa randconfig-001 gcc-8.5.0 xtensa randconfig-001-20260601 gcc-15.2.0 xtensa randconfig-001-20260601 gcc-8.5.0 xtensa randconfig-002 gcc-8.5.0 xtensa randconfig-002-20260601 gcc-15.2.0 xtensa randconfig-002-20260601 gcc-8.5.0 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki From pasha.tatashin at soleen.com Mon Jun 1 15:55:47 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Mon, 1 Jun 2026 22:55:47 +0000 Subject: [PATCH] kexec_file: skip checksum verification when relocations aren't needed In-Reply-To: <20260601191136.799134-1-mclapinski@google.com> References: <20260601191136.799134-1-mclapinski@google.com> Message-ID: Nit: The crash kernel also does not perform relocations, yet a checksum is still required. The subject should be something like: kexec_file: skip purgatory checksum if all segments are CMA allocated On 06-01 21:11, Michal Clapinski wrote: > Checksum verification is needed > 1. for crash kernels. In a crash, we can't be sure the kernel is > intact. > 2. if we're worried about relocating the kernel into a region used by > some DMA that wasn't properly cancelled. Nit: Please add a little background information about CMA segments being recently added, as well as the necessity for a fast reboot due to the live update use case. > > If we used CMA to allocate segments then > 1. we're not working with a crash kernel. > 2. relocations are not going to happen. > > Therefore, we can safely disable checksum verification. > > Instead of adding a new variable to purgatory, just skip adding regions > and save the default value of SHA256 hash. > > Saves ~250ms on my 4.0 GHz CPU. > > Signed-off-by: Michal Clapinski > --- > kernel/kexec_file.c | 19 +++++++++++++++++++ > 1 file changed, 19 insertions(+) > > diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c > index 2bfbb2d144e6..2dc8b0435fe6 100644 > --- a/kernel/kexec_file.c > +++ b/kernel/kexec_file.c > @@ -808,6 +808,7 @@ static int kexec_calculate_store_digests(struct kimage *image) > void *zero_buf; > struct kexec_sha_region *sha_regions; > struct purgatory_info *pi = &image->purgatory_info; > + bool can_skip_checksum = true; > > if (!IS_ENABLED(CONFIG_ARCH_SUPPORTS_KEXEC_PURGATORY)) > return 0; > @@ -822,6 +823,23 @@ static int kexec_calculate_store_digests(struct kimage *image) > > sha256_init(&sctx); > > + /* > + * If all segments were loaded into contiguous memory, there will be no > + * relocations. In that case there is no risk of memory corruption by > + * uncancelled DMA and we can skip checksum calculation. > + */ > + for (i = 0; i < image->nr_segments; i++) { > + if (!image->segment_cma[i]) { > + can_skip_checksum = false; > + break; > + } > + } > + > + if (can_skip_checksum) { > + pr_info("disabling checksum verification in purgatory\n"); > + goto skip_checksum; > + } > + > for (j = i = 0; i < image->nr_segments; i++) { > struct kexec_segment *ksegment; > > @@ -867,6 +885,7 @@ static int kexec_calculate_store_digests(struct kimage *image) > j++; > } > > +skip_checksum: > sha256_final(&sctx, digest); With the few nits: Reviewed-by: Pasha Tatashin > > ret = kexec_purgatory_get_set_symbol(image, "purgatory_sha_regions", > -- > 2.54.0.929.g9b7fa37559-goog > From ltao at redhat.com Mon Jun 1 16:12:05 2026 From: ltao at redhat.com (Tao Liu) Date: Tue, 2 Jun 2026 11:12:05 +1200 Subject: [PATCH v5][makedumpfile 0/9] btf/kallsyms based makedumpfile extension for mm page filtering In-Reply-To: References: <20260414102656.55200-1-ltao@redhat.com> Message-ID: Hi Krister, Thanks a lot for your suggestions and comments! On Sat, May 30, 2026 at 9:11?AM Krister Johansen wrote: > > On Tue, Apr 14, 2026 at 10:26:47PM +1200, Tao Liu wrote: > > A) This patchset will introduce the following features to makedumpfile: > > > > 1) Add .so extension support to makedumpfile > > 2) Enable btf and kallsyms for symbol type and address resolving. > > > > B) The purpose of the features are: > > > > 1) Currently makedumpfile filters mm pages based on page flags, because flags > > can help to determine one page's usage. But this page-flag-checking method > > lacks of flexibility in certain cases, e.g. if we want to filter those mm > > pages occupied by GPU during vmcore dumping due to: > > > > a) GPU may be taking a large memory and contains sensitive data; > > b) GPU mm pages have no relations to kernel crash and useless for vmcore > > analysis. > > > > But there is no GPU mm page specific flags, and apparently we don't need > > to create one just for kdump use. A programmable filtering tool is more > > suitable for such cases. In addition, different GPU vendors may use > > different ways for mm pages allocating, programmable filtering is better > > than hard coding these GPU specific logics into makedumpfile in this case. > > > > 2) Currently makedumpfile already contains a programmable filtering tool, aka > > eppic script, which allows user to write customized code for data erasing. > > However it has the following drawbacks: > > > > a) cannot do mm page filtering. > > b) need to access to debuginfo of both kernel and modules, which is not > > applicable in the 2nd kernel. > > c) eppic library has memory leaks which are not all resolved [1]. This > > is not acceptable in 2nd kernel. > > > > makedumpfile need to resolve the dwarf data from debuginfo, to get symbols > > types and addresses. In recent kernel there are dwarf alternatives such > > as btf/kallsyms which can be used for this purpose. And btf/kallsyms info > > are already packed within vmcore, so we can use it directly. > > > > With these, this patchset introduces makedumpfile extensions, which is based > > on btf/kallsyms symbol resolving, and is programmable for mm page filtering. > > The following section shows its usage and performance, please note the tests > > are performed in 1st kernel. > > > > 3) Compile and run makedumpfile extensions: > > > > $ make LINKTYPE=dynamic USELZO=on USESNAPPY=on USEZSTD=on EXTENSION=on > > $ make extensions > > I love this idea. Do you have time to take it further, and if not are > you open to making the extension framework more modular so that we could > add others in the future? The purpose of extension is to make the framework modular. My original thought is, we can implement several makedumpfile extensions, each restricted to one specific function. Like one extension deals with AMD gpu mm filtering only, one deals with Intel gpu only etc. For distros we can ship all extensions along with makedumpfile once, but the respective extensions will only take effect if the machine has AMD / Intel gpu. This is the same case if you'd like to add other customized functions while the makedumpfile core remains unchanged. > > Could the btf lookups be extended to cover the symbol lookups used by > eppic and the erase filters so that the -x option is unnecessary for > kernels that have BTF support? Yes, from my view it is doable and not difficult to implement. > > The current extension implementation is focused just on skipping pages, > but it would be great to be able to use this to erase data in structures > like the config filters and eppic, but without having to provide a > vmlinux at dump time. What do you think about adding the ability to > use the extensions to also erase parts of data structures, in addition > to filtering whole pages? That's the step 2 for the BTF/kallsyms work of makedumpfile, and I have planed to work on this once the patchset(step 1) is accepted. The reason for the task dividing is, the GPU mm page filtering is more urgent than data erasing from my view. For data erasing, at least we can do the erasing in 1st kernel with the help of dwarf, cumbersome but working; For GPU mm filtering, as far as I know, there are no handy tools in 2nd kernel. I think erasing the data is doable upon the current page filtering code. > > Would you be willing to modify the extension registration options to > allow an extension to specify what kind it is? That way, in the future I'm not sure what you mean by "what kind". Do you mean an extension needs to tell makedumpfile what purpose it is for when loading? > we could register multiple different kinds without breaking existing > ones. One for filtering pages, one for erasing / modifying dump > content, and others based upon whatever additional use cases develop. That's the goal of extensions, each extension deals with its own business. Could you point out the code that doesn't match the goal? I'm happy to correct it in v6. Thanks, Tao Liu > > Thanks, > > -K > From kjlx at templeofstupid.com Mon Jun 1 17:47:28 2026 From: kjlx at templeofstupid.com (Krister Johansen) Date: Mon, 1 Jun 2026 17:47:28 -0700 Subject: [PATCH v5][makedumpfile 0/9] btf/kallsyms based makedumpfile extension for mm page filtering In-Reply-To: References: <20260414102656.55200-1-ltao@redhat.com> Message-ID: Hi Tao, Thanks for the response! I've put the followups below. On Tue, Jun 02, 2026 at 11:12:05AM +1200, Tao Liu wrote: > On Sat, May 30, 2026 at 9:11?AM Krister Johansen > wrote: > > > > I love this idea. Do you have time to take it further, and if not are > > you open to making the extension framework more modular so that we could > > add others in the future? > > The purpose of extension is to make the framework modular. My original > thought is, we can implement several makedumpfile extensions, each > restricted to one specific function. Like one extension deals with AMD > gpu mm filtering only, one deals with Intel gpu only etc. For distros > we can ship all extensions along with makedumpfile once, but the > respective extensions will only take effect if the machine has AMD / > Intel gpu. This is the same case if you'd like to add other customized > functions while the makedumpfile core remains unchanged. Makes sense. > > Could the btf lookups be extended to cover the symbol lookups used by > > eppic and the erase filters so that the -x option is unnecessary for > > kernels that have BTF support? > > Yes, from my view it is doable and not difficult to implement. In some environments, the size of the vmlinux + modules can be fairly substantial to leave on disk. It's attractive to have the option to omit it and still filter dumps. > > The current extension implementation is focused just on skipping pages, > > but it would be great to be able to use this to erase data in structures > > like the config filters and eppic, but without having to provide a > > vmlinux at dump time. What do you think about adding the ability to > > use the extensions to also erase parts of data structures, in addition > > to filtering whole pages? > > That's the step 2 for the BTF/kallsyms work of makedumpfile, and I > have planed to work on this once the patchset(step 1) is accepted. The > reason for the task dividing is, the GPU mm page filtering is more > urgent than data erasing from my view. For data erasing, at least we > can do the erasing in 1st kernel with the help of dwarf, cumbersome > but working; For GPU mm filtering, as far as I know, there are no > handy tools in 2nd kernel. Excited to hear that you have something already planned for erasing. My apologies if I missed a more comprehensive write-up about the longer term goals for the work. > I think erasing the data is doable upon the current page filtering code. I wondered about this, but for data-structures that are smaller than a page, wouldn't that mean that we're erasing other content? The "erase" plugins memset the output data to a chosen value (or 0), whereas the filtering just drops the page. Couldn't this also lead to a situation where the debugger can't find the page at all, versus giving us one that's sanitized? (I do understand why you want to drop the pages for the GPU cases) > > Would you be willing to modify the extension registration options to > > allow an extension to specify what kind it is? That way, in the future > > I'm not sure what you mean by "what kind". Do you mean an extension > needs to tell makedumpfile what purpose it is for when loading? Yes, sorry I wasn't clear in writing the question. Stating this differently, if we want to allow the ability for different extensions to do different things, how do the extensions declare to makedumpfile what they can do, so that it knows where to invoke their callbacks, and what callbacks of theirs to invoke. Looking at patch 6/9, right now run_extension_callback() is involved from __exclude_unncessary_pages and always calls the "extension_callback" symbol in the module. This makes sense for a single extension type that's focused on filtering pages. However, if we wanted to have multiple different extensions, this might be more difficult. If we could determine what type of functionality the module implements in load_extensions, then we could tell if this is a page filtering extension, an erase extension, or some other kind of extension. For example, for an erase filter, perhaps we would want two callbacks: one to set up the ranges to filter "extension_gather_callback" and another to actuallyf check the address range to see if it is filtered, "extension_filter_data_callback" I'm not sure about the names. "extension_callback" seems generic, but this has a specific purpose. It's a "extension_filter_page_callback" I may be overengineering this a bit, but having makedumpfile pass an ops vector to the extension in a load function could help here. Then the module's load function fills out the vector with the functions it supports. Depending on what's implemented, these can be placed into different callback lists to get invoked at different points in the program (e.g. one at pfn filter time, another in filter_data_buffer, etc). It sounds like you had a plan here, though. Were you thinking of adding new extension types a different way? > > we could register multiple different kinds without breaking existing > > ones. One for filtering pages, one for erasing / modifying dump > > content, and others based upon whatever additional use cases develop. > > That's the goal of extensions, each extension deals with its own > business. Could you point out the code that doesn't match the goal? > I'm happy to correct it in v6. Yes, I attempted to elaborate on this in the preceding paragraphs. Basically wondering how we can add new extension functionality without breaking existing extensions, and then get the code to invoke the right if there are multiple types that need to be used at different times. Thanks, -K From ltao at redhat.com Mon Jun 1 20:04:12 2026 From: ltao at redhat.com (Tao Liu) Date: Tue, 2 Jun 2026 15:04:12 +1200 Subject: [PATCH v5][makedumpfile 0/9] btf/kallsyms based makedumpfile extension for mm page filtering In-Reply-To: References: <20260414102656.55200-1-ltao@redhat.com> Message-ID: Hi Krister, On Tue, Jun 2, 2026 at 12:47?PM Krister Johansen wrote: > > Hi Tao, > Thanks for the response! I've put the followups below. Thanks for your in-depth explanation, it's very helpful to me for designing the data erasing function. > > On Tue, Jun 02, 2026 at 11:12:05AM +1200, Tao Liu wrote: > > On Sat, May 30, 2026 at 9:11?AM Krister Johansen > > wrote: > > > > > > I love this idea. Do you have time to take it further, and if not are > > > you open to making the extension framework more modular so that we could > > > add others in the future? > > > > The purpose of extension is to make the framework modular. My original > > thought is, we can implement several makedumpfile extensions, each > > restricted to one specific function. Like one extension deals with AMD > > gpu mm filtering only, one deals with Intel gpu only etc. For distros > > we can ship all extensions along with makedumpfile once, but the > > respective extensions will only take effect if the machine has AMD / > > Intel gpu. This is the same case if you'd like to add other customized > > functions while the makedumpfile core remains unchanged. > > Makes sense. > > > > Could the btf lookups be extended to cover the symbol lookups used by > > > eppic and the erase filters so that the -x option is unnecessary for > > > kernels that have BTF support? > > > > Yes, from my view it is doable and not difficult to implement. > > In some environments, the size of the vmlinux + modules can be fairly > substantial to leave on disk. It's attractive to have the option to > omit it and still filter dumps. Yes, I totally agree. > > > > The current extension implementation is focused just on skipping pages, > > > but it would be great to be able to use this to erase data in structures > > > like the config filters and eppic, but without having to provide a > > > vmlinux at dump time. What do you think about adding the ability to > > > use the extensions to also erase parts of data structures, in addition > > > to filtering whole pages? > > > > That's the step 2 for the BTF/kallsyms work of makedumpfile, and I > > have planed to work on this once the patchset(step 1) is accepted. The > > reason for the task dividing is, the GPU mm page filtering is more > > urgent than data erasing from my view. For data erasing, at least we > > can do the erasing in 1st kernel with the help of dwarf, cumbersome > > but working; For GPU mm filtering, as far as I know, there are no > > handy tools in 2nd kernel. > > Excited to hear that you have something already planned for erasing. My > apologies if I missed a more comprehensive write-up about the longer > term goals for the work. No worries, I didn't post the goals upstream; I only had internal discussions within my team regarding the next steps for BTF/kallsyms in makedumpfile. > > > I think erasing the data is doable upon the current page filtering code. > > I wondered about this, but for data-structures that are smaller than a > page, wouldn't that mean that we're erasing other content? The "erase" > plugins memset the output data to a chosen value (or 0), whereas the > filtering just drops the page. Couldn't this also lead to a situation > where the debugger can't find the page at all, versus giving us one > that's sanitized? (I do understand why you want to drop the pages for > the GPU cases) Frankly I didn't consider the data erasing as in-depth as you did. I think you are right, makedumpfile needs to know which extensions handle data erasing and which handle mm page filtering. I guess the mm page filtering extensions will need to perform a "dry-run" filter first, in case the "data erasing" extensions break any useful data structure. In this step, "dry-run" will only record pfn numbers of the pages that will be filtered. Then "data erasing" extensions are called, so all the sensitive data is memset to 0. Finally, all desired pages are filtered out based on the previous recording. With this, "data erase" and "page filtering" will not interfere with each other. What do you think? > > > > Would you be willing to modify the extension registration options to > > > allow an extension to specify what kind it is? That way, in the future > > > > I'm not sure what you mean by "what kind". Do you mean an extension > > needs to tell makedumpfile what purpose it is for when loading? > > Yes, sorry I wasn't clear in writing the question. Stating this > differently, if we want to allow the ability for different extensions to > do different things, how do the extensions declare to makedumpfile what > they can do, so that it knows where to invoke their callbacks, and what > callbacks of theirs to invoke. > > Looking at patch 6/9, right now run_extension_callback() is involved > from __exclude_unncessary_pages and always calls the > "extension_callback" symbol in the module. This makes sense for a > single extension type that's focused on filtering pages. However, if we > wanted to have multiple different extensions, this might be more > difficult. > > If we could determine what type of functionality the module implements > in load_extensions, then we could tell if this is a page filtering > extension, an erase extension, or some other kind of extension. > > For example, for an erase filter, perhaps we would want two callbacks: > one to set up the ranges to filter "extension_gather_callback" and > another to actuallyf check the address range to see if it is filtered, > "extension_filter_data_callback" > > I'm not sure about the names. "extension_callback" seems generic, but > this has a specific purpose. It's a "extension_filter_page_callback" > > I may be overengineering this a bit, but having makedumpfile pass an ops > vector to the extension in a load function could help here. Then the > module's load function fills out the vector with the functions it > supports. Depending on what's implemented, these can be placed into > different callback lists to get invoked at different points in the > program (e.g. one at pfn filter time, another in filter_data_buffer, > etc). > > It sounds like you had a plan here, though. Were you thinking of adding > new extension types a different way? I see your idea: makedumpfile predefines a few hook points at different stages, and extensions can register their callbacks to these hook points. For now I think 2 hook points are enough, one for page filtering and other one for resiger the data erasing, which definitely shouldn't be within __exclude_unnecessary_pages(). I'm willing to modify the code. Such as implementing a hooking point registration/management. But since I haven't work on the data erasing functions so far, the design might be superficial, personally I'd prefer to do this along with the data erasing functions in the next independent patchset, considering current patchset we already includes plenty of code/function implementations. @maintainers, What's your opinion? > > > > we could register multiple different kinds without breaking existing > > > ones. One for filtering pages, one for erasing / modifying dump > > > content, and others based upon whatever additional use cases develop. > > > > That's the goal of extensions, each extension deals with its own > > business. Could you point out the code that doesn't match the goal? > > I'm happy to correct it in v6. > > Yes, I attempted to elaborate on this in the preceding paragraphs. > Basically wondering how we can add new extension functionality without > breaking existing extensions, and then get the code to invoke the right > if there are multiple types that need to be used at different times. Agreed. Thanks, Tao Liu > > Thanks, > > -K > From pasha.tatashin at soleen.com Mon Jun 1 20:17:04 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Tue, 2 Jun 2026 03:17:04 +0000 Subject: [PATCH v5 00/13] liveupdate: Remove limits on sessions and files Message-ID: <20260602031717.197696-1-pasha.tatashin@soleen.com> Hi all, This series removes the fixed limits on the number of files that can be preserved within a single session, and the total number of sessions managed by the Live Update Orchestrator (LUO). The core of the change is a transition from single contiguous memory blocks for metadata serialization to a chain of linked blocks. This allows LUO to scale dynamically. 1. ABI Evolution: - Introduced linked-block headers for both file and session serialization. - Bumped session ABI version to v4. 2. Memory Management & Security: - Implemented a dynamic block allocation and reuse strategy. Blocks are allocated only when existing ones are exhausted and are reused during session/file removal cycles. - Introduced KHO_MAX_BLOCKS (10000) as a safeguard against stupid excessive allocations or corrupted cyclic lists during restore. 3. Expanded Selftests: - Added new kexec-based tests verifying preservation of 2000 sessions and 500 files per session. - Added self-tests for many sessions and many files management. Tree: git.kernel.org/pub/scm/linux/kernel/git/tatashin/linux.git Branch: luo-remove-max-files-sessions-limits/v5 Changes v5: - Addressed comments from Pratyush: - Renamed kho_block_restore -> kho_block_set_restore, kho_block_destroy -> kho_block_set_destroy. - Renamed block iterator next/read functions to reserve_entry/read_entry. - Added public helpers kho_block_set_head_pa() and kho_block_set_is_empty(). - Added validation to treat zero-count blocks as errors during restoration. - Simplified block iterator reading loop from a while to an if statement. - Changed standard WARN_ON macros to WARN_ON_ONCE on iterator allocation checks, and added warning details. - Simplified session serialization by removing a redundant NULL check on sessions_pa. Please review. Thanks, Pasha Pasha Tatashin (13): liveupdate: change file_set->count type to u64 for type safety liveupdate: avoid mixing cleanup guards with goto in luo_session_retrieve_fd liveupdate: centralize state management into struct luo_ser liveupdate: register luo_ser as KHO subtree liveupdate: Extract luo_file_deserialize_one helper liveupdate: Extract luo_session_deserialize_one helper kho: add support for linked-block serialization liveupdate: defer session block allocation and PA setting liveupdate: Remove limit on the number of sessions liveupdate: Remove limit on the number of files per session selftests/liveupdate: Test session and file limit removal selftests/liveupdate: Add stress-sessions kexec test selftests/liveupdate: Add stress-files kexec test Documentation/core-api/kho/abi.rst | 5 + Documentation/core-api/kho/index.rst | 11 + MAINTAINERS | 1 + include/linux/kho/abi/block.h | 56 +++ include/linux/kho/abi/luo.h | 149 ++----- include/linux/kho_block.h | 101 +++++ kernel/liveupdate/Makefile | 1 + kernel/liveupdate/kho_block.c | 390 ++++++++++++++++++ kernel/liveupdate/luo_core.c | 99 ++--- kernel/liveupdate/luo_file.c | 206 ++++----- kernel/liveupdate/luo_flb.c | 65 +-- kernel/liveupdate/luo_internal.h | 16 +- kernel/liveupdate/luo_session.c | 241 +++++------ tools/testing/selftests/liveupdate/Makefile | 2 + .../testing/selftests/liveupdate/liveupdate.c | 75 ++++ .../selftests/liveupdate/luo_stress_files.c | 97 +++++ .../liveupdate/luo_stress_sessions.c | 102 +++++ .../selftests/liveupdate/luo_test_utils.c | 24 ++ .../selftests/liveupdate/luo_test_utils.h | 2 + 19 files changed, 1184 insertions(+), 459 deletions(-) create mode 100644 include/linux/kho/abi/block.h create mode 100644 include/linux/kho_block.h create mode 100644 kernel/liveupdate/kho_block.c create mode 100644 tools/testing/selftests/liveupdate/luo_stress_files.c create mode 100644 tools/testing/selftests/liveupdate/luo_stress_sessions.c base-commit: 2935777b418d2bfcbfe96705bb2c0fa6c0d94e18 -- 2.53.0 From pasha.tatashin at soleen.com Mon Jun 1 20:17:05 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Tue, 2 Jun 2026 03:17:05 +0000 Subject: [PATCH v5 01/13] liveupdate: change file_set->count type to u64 for type safety In-Reply-To: <20260602031717.197696-1-pasha.tatashin@soleen.com> References: <20260602031717.197696-1-pasha.tatashin@soleen.com> Message-ID: <20260602031717.197696-2-pasha.tatashin@soleen.com> This improves type safety and aligns the in-memory file_set->count with the serialized count type. It avoids potential truncation or sign conversion mismatch issues. Reviewed-by: Pratyush Yadav (Google) Signed-off-by: Pasha Tatashin --- kernel/liveupdate/luo_internal.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/liveupdate/luo_internal.h b/kernel/liveupdate/luo_internal.h index dd53d4a7277e..ae58206f14ac 100644 --- a/kernel/liveupdate/luo_internal.h +++ b/kernel/liveupdate/luo_internal.h @@ -52,7 +52,7 @@ static inline int luo_ucmd_respond(struct luo_ucmd *ucmd, struct luo_file_set { struct list_head files_list; struct luo_file_ser *files; - long count; + u64 count; }; /** -- 2.53.0 From pasha.tatashin at soleen.com Mon Jun 1 20:17:06 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Tue, 2 Jun 2026 03:17:06 +0000 Subject: [PATCH v5 02/13] liveupdate: avoid mixing cleanup guards with goto in luo_session_retrieve_fd In-Reply-To: <20260602031717.197696-1-pasha.tatashin@soleen.com> References: <20260602031717.197696-1-pasha.tatashin@soleen.com> Message-ID: <20260602031717.197696-3-pasha.tatashin@soleen.com> Refactoring luo_session_retrieve_fd() to avoid mixing automated cleanup-style guards with goto-based resource release, which is not recommended under the Linux kernel coding style. Reviewed-by: Pratyush Yadav (Google) Signed-off-by: Pasha Tatashin --- kernel/liveupdate/luo_session.c | 25 ++++++++++++------------- 1 file changed, 12 insertions(+), 13 deletions(-) diff --git a/kernel/liveupdate/luo_session.c b/kernel/liveupdate/luo_session.c index 5c6cebc6e326..7b2f9cbabb05 100644 --- a/kernel/liveupdate/luo_session.c +++ b/kernel/liveupdate/luo_session.c @@ -291,25 +291,24 @@ static int luo_session_retrieve_fd(struct luo_session *session, if (argp->fd < 0) return argp->fd; - guard(mutex)(&session->mutex); - err = luo_retrieve_file(&session->file_set, argp->token, &file); - if (err < 0) - goto err_put_fd; + scoped_guard(mutex, &session->mutex) { + err = luo_retrieve_file(&session->file_set, argp->token, &file); + if (err < 0) { + put_unused_fd(argp->fd); + return err; + } + } err = luo_ucmd_respond(ucmd, sizeof(*argp)); - if (err) - goto err_put_file; + if (err) { + fput(file); + put_unused_fd(argp->fd); + return err; + } fd_install(argp->fd, file); return 0; - -err_put_file: - fput(file); -err_put_fd: - put_unused_fd(argp->fd); - - return err; } static int luo_session_finish(struct luo_session *session, -- 2.53.0 From pasha.tatashin at soleen.com Mon Jun 1 20:17:07 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Tue, 2 Jun 2026 03:17:07 +0000 Subject: [PATCH v5 03/13] liveupdate: centralize state management into struct luo_ser In-Reply-To: <20260602031717.197696-1-pasha.tatashin@soleen.com> References: <20260602031717.197696-1-pasha.tatashin@soleen.com> Message-ID: <20260602031717.197696-4-pasha.tatashin@soleen.com> Transition the LUO to ABI v2, which centralizes state management into a single struct luo_ser header. Previously, LUO state was spread across multiple FDT properties and subnodes. ABI v2 simplifies this by placing all core state, including the liveupdate number and physical addresses for sessions and FLB headers into a centralized struct luo_ser. Note that this change introduces a semantic difference: the sessions and FLB serialization formats are no longer completely independent of the core LUO. Their metadata (such as physical addresses for sessions and FLB headers) is now coupled to and managed via the centralized struct luo_ser. Reviewed-by: Pratyush Yadav (Google) Signed-off-by: Pasha Tatashin --- include/linux/kho/abi/luo.h | 91 +++++++++++--------------------- kernel/liveupdate/luo_core.c | 64 +++++++++++++++------- kernel/liveupdate/luo_flb.c | 65 +++-------------------- kernel/liveupdate/luo_internal.h | 8 +-- kernel/liveupdate/luo_session.c | 64 ++++------------------ 5 files changed, 98 insertions(+), 194 deletions(-) diff --git a/include/linux/kho/abi/luo.h b/include/linux/kho/abi/luo.h index 46750a0ddf88..1b2f865a771a 100644 --- a/include/linux/kho/abi/luo.h +++ b/include/linux/kho/abi/luo.h @@ -30,52 +30,25 @@ * .. code-block:: none * * / { - * compatible = "luo-v1"; - * liveupdate-number = <...>; - * - * luo-session { - * compatible = "luo-session-v1"; - * luo-session-header = ; - * }; - * - * luo-flb { - * compatible = "luo-flb-v1"; - * luo-flb-header = ; - * }; + * compatible = "luo-v2"; + * luo-abi-header = ; * }; * * Main LUO Node (/): * - * - compatible: "luo-v1" + * - compatible: "luo-v2" * Identifies the overall LUO ABI version. - * - liveupdate-number: u64 - * A counter tracking the number of successful live updates performed. - * - * Session Node (luo-session): - * This node describes all preserved user-space sessions. - * - * - compatible: "luo-session-v1" - * Identifies the session ABI version. - * - luo-session-header: u64 - * The physical address of a `struct luo_session_header_ser`. This structure - * is the header for a contiguous block of memory containing an array of - * `struct luo_session_ser`, one for each preserved session. - * - * File-Lifecycle-Bound Node (luo-flb): - * This node describes all preserved global objects whose lifecycle is bound - * to that of the preserved files (e.g., shared IOMMU state). - * - * - compatible: "luo-flb-v1" - * Identifies the FLB ABI version. - * - luo-flb-header: u64 - * The physical address of a `struct luo_flb_header_ser`. This structure is - * the header for a contiguous block of memory containing an array of - * `struct luo_flb_ser`, one for each preserved global object. + * - luo-abi-header: u64 + * The physical address of `struct luo_ser`. * * Serialization Structures: * The FDT properties point to memory regions containing arrays of simple, * `__packed` structures. These structures contain the actual preserved state. * + * - struct luo_ser: + * The central ABI structure that contains the overall state of the LUO. + * It includes the liveupdate-number and pointers to sessions and FLBs. + * * - struct luo_session_header_ser: * Header for the session array. Contains the total page count of the * preserved memory block and the number of `struct luo_session_ser` @@ -109,13 +82,26 @@ /* * The LUO FDT hooks all LUO state for sessions, fds, etc. - * In the root it also carries "liveupdate-number" 64-bit property that - * corresponds to the number of live-updates performed on this machine. */ #define LUO_FDT_SIZE PAGE_SIZE #define LUO_FDT_KHO_ENTRY_NAME "LUO" -#define LUO_FDT_COMPATIBLE "luo-v1" -#define LUO_FDT_LIVEUPDATE_NUM "liveupdate-number" +#define LUO_FDT_COMPATIBLE "luo-v2" +#define LUO_FDT_ABI_HEADER "luo-abi-header" + +/** + * struct luo_ser - Centralized LUO ABI header. + * @liveupdate_num: A counter tracking the number of successful live updates. + * @sessions_pa: Physical address of the first session block header. + * @flbs_pa: Physical address of the FLB header. + * + * This structure is the root of all preserved LUO state. It is pointed to by + * the "luo-abi-header" property in the LUO FDT. + */ +struct luo_ser { + u64 liveupdate_num; + u64 sessions_pa; + u64 flbs_pa; +} __packed; #define LIVEUPDATE_HNDL_COMPAT_LENGTH 48 @@ -147,15 +133,6 @@ struct luo_file_set_ser { u64 count; } __packed; -/* - * LUO FDT session node - * LUO_FDT_SESSION_HEADER: is a u64 physical address of struct - * luo_session_header_ser - */ -#define LUO_FDT_SESSION_NODE_NAME "luo-session" -#define LUO_FDT_SESSION_COMPATIBLE "luo-session-v2" -#define LUO_FDT_SESSION_HEADER "luo-session-header" - /** * struct luo_session_header_ser - Header for the serialized session data block. * @count: The number of `struct luo_session_ser` entries that immediately @@ -165,7 +142,7 @@ struct luo_file_set_ser { * physical memory preserved across the kexec. It provides the necessary * metadata to interpret the array of session entries that follow. * - * If this structure is modified, `LUO_FDT_SESSION_COMPATIBLE` must be updated. + * If this structure is modified, `LUO_FDT_COMPATIBLE` must be updated. */ struct luo_session_header_ser { u64 count; @@ -182,7 +159,7 @@ struct luo_session_header_ser { * session) is created and passed to the new kernel, allowing it to reconstruct * the session context. * - * If this structure is modified, `LUO_FDT_SESSION_COMPATIBLE` must be updated. + * If this structure is modified, `LUO_FDT_COMPATIBLE` must be updated. */ struct luo_session_ser { char name[LIVEUPDATE_SESSION_NAME_LENGTH]; @@ -192,10 +169,6 @@ struct luo_session_ser { /* The max size is set so it can be reliably used during in serialization */ #define LIVEUPDATE_FLB_COMPAT_LENGTH 48 -#define LUO_FDT_FLB_NODE_NAME "luo-flb" -#define LUO_FDT_FLB_COMPATIBLE "luo-flb-v1" -#define LUO_FDT_FLB_HEADER "luo-flb-header" - /** * struct luo_flb_header_ser - Header for the serialized FLB data block. * @pgcnt: The total number of pages occupied by the entire preserved memory @@ -205,11 +178,9 @@ struct luo_session_ser { * in the memory block. * * This structure is located at the physical address specified by the - * `LUO_FDT_FLB_HEADER` FDT property. It provides the new kernel with the - * necessary information to find and iterate over the array of preserved - * File-Lifecycle-Bound objects and to manage the underlying memory. + * flbs_pa in luo_ser. * - * If this structure is modified, LUO_FDT_FLB_COMPATIBLE must be updated. + * If this structure is modified, `LUO_FDT_COMPATIBLE` must be updated. */ struct luo_flb_header_ser { u64 pgcnt; @@ -231,7 +202,7 @@ struct luo_flb_header_ser { * passed to the new kernel. Each entry allows the LUO core to restore one * global, shared object. * - * If this structure is modified, LUO_FDT_FLB_COMPATIBLE must be updated. + * If this structure is modified, `LUO_FDT_COMPATIBLE` must be updated. */ struct luo_flb_ser { char name[LIVEUPDATE_FLB_COMPAT_LENGTH]; diff --git a/kernel/liveupdate/luo_core.c b/kernel/liveupdate/luo_core.c index 5d5827ced73c..085c0dfc1ef1 100644 --- a/kernel/liveupdate/luo_core.c +++ b/kernel/liveupdate/luo_core.c @@ -61,7 +61,6 @@ #include #include #include -#include #include "kexec_handover_internal.h" #include "luo_internal.h" @@ -86,9 +85,11 @@ early_param("liveupdate", early_liveupdate_param); static int __init luo_early_startup(void) { + struct luo_ser *luo_ser; + int err, header_size; phys_addr_t fdt_phys; - int err, ln_size; const void *ptr; + u64 luo_ser_pa; if (!kho_is_enabled()) { if (liveupdate_enabled()) @@ -119,26 +120,32 @@ static int __init luo_early_startup(void) return -EINVAL; } - ln_size = 0; - ptr = fdt_getprop(luo_global.fdt_in, 0, LUO_FDT_LIVEUPDATE_NUM, - &ln_size); - if (!ptr || ln_size != sizeof(luo_global.liveupdate_num)) { - pr_err("Unable to get live update number '%s' [%d]\n", - LUO_FDT_LIVEUPDATE_NUM, ln_size); + header_size = 0; + ptr = fdt_getprop(luo_global.fdt_in, 0, LUO_FDT_ABI_HEADER, &header_size); + if (!ptr || header_size != sizeof(u64)) { + pr_err("Unable to get ABI header '%s' [%d]\n", + LUO_FDT_ABI_HEADER, header_size); return -EINVAL; } - luo_global.liveupdate_num = get_unaligned((u64 *)ptr); + luo_ser_pa = get_unaligned((u64 *)ptr); + luo_ser = phys_to_virt(luo_ser_pa); + + luo_global.liveupdate_num = luo_ser->liveupdate_num; pr_info("Retrieved live update data, liveupdate number: %lld\n", luo_global.liveupdate_num); - err = luo_session_setup_incoming(luo_global.fdt_in); + err = luo_session_setup_incoming(luo_ser->sessions_pa); if (err) - return err; + goto out_free_ser; + + luo_flb_setup_incoming(luo_ser->flbs_pa); - err = luo_flb_setup_incoming(luo_global.fdt_in); + err = 0; +out_free_ser: + kho_restore_free(luo_ser); return err; } @@ -160,7 +167,8 @@ early_initcall(liveupdate_early_init); /* Called during boot to create outgoing LUO fdt tree */ static int __init luo_fdt_setup(void) { - const u64 ln = luo_global.liveupdate_num + 1; + struct luo_ser *luo_ser; + u64 luo_ser_pa; void *fdt_out; int err; @@ -170,27 +178,45 @@ static int __init luo_fdt_setup(void) return PTR_ERR(fdt_out); } + luo_ser = kho_alloc_preserve(sizeof(*luo_ser)); + if (IS_ERR(luo_ser)) { + err = PTR_ERR(luo_ser); + goto exit_free_fdt; + } + luo_ser_pa = virt_to_phys(luo_ser); + err = fdt_create(fdt_out, LUO_FDT_SIZE); err |= fdt_finish_reservemap(fdt_out); err |= fdt_begin_node(fdt_out, ""); err |= fdt_property_string(fdt_out, "compatible", LUO_FDT_COMPATIBLE); - err |= fdt_property(fdt_out, LUO_FDT_LIVEUPDATE_NUM, &ln, sizeof(ln)); - err |= luo_session_setup_outgoing(fdt_out); - err |= luo_flb_setup_outgoing(fdt_out); + err |= fdt_property(fdt_out, LUO_FDT_ABI_HEADER, &luo_ser_pa, + sizeof(luo_ser_pa)); err |= fdt_end_node(fdt_out); err |= fdt_finish(fdt_out); if (err) - goto exit_free; + goto exit_free_luo_ser; + + err = luo_session_setup_outgoing(&luo_ser->sessions_pa); + if (err) + goto exit_free_luo_ser; + + err = luo_flb_setup_outgoing(&luo_ser->flbs_pa); + if (err) + goto exit_free_luo_ser; + + luo_ser->liveupdate_num = luo_global.liveupdate_num + 1; err = kho_add_subtree(LUO_FDT_KHO_ENTRY_NAME, fdt_out, fdt_totalsize(fdt_out)); if (err) - goto exit_free; + goto exit_free_luo_ser; luo_global.fdt_out = fdt_out; return 0; -exit_free: +exit_free_luo_ser: + kho_unpreserve_free(luo_ser); +exit_free_fdt: kho_unpreserve_free(fdt_out); pr_err("failed to prepare LUO FDT: %d\n", err); diff --git a/kernel/liveupdate/luo_flb.c b/kernel/liveupdate/luo_flb.c index 8f5c5dd01cd0..c8dd30b41238 100644 --- a/kernel/liveupdate/luo_flb.c +++ b/kernel/liveupdate/luo_flb.c @@ -44,13 +44,11 @@ #include #include #include -#include #include #include #include #include #include -#include #include "luo_internal.h" #define LUO_FLB_PGCNT 1ul @@ -551,27 +549,15 @@ int liveupdate_flb_get_outgoing(struct liveupdate_flb *flb, void **objp) return 0; } -int __init luo_flb_setup_outgoing(void *fdt_out) +int __init luo_flb_setup_outgoing(u64 *flbs_pa) { struct luo_flb_header_ser *header_ser; - u64 header_ser_pa; - int err; header_ser = kho_alloc_preserve(LUO_FLB_PGCNT << PAGE_SHIFT); if (IS_ERR(header_ser)) return PTR_ERR(header_ser); - header_ser_pa = virt_to_phys(header_ser); - - err = fdt_begin_node(fdt_out, LUO_FDT_FLB_NODE_NAME); - err |= fdt_property_string(fdt_out, "compatible", - LUO_FDT_FLB_COMPATIBLE); - err |= fdt_property(fdt_out, LUO_FDT_FLB_HEADER, &header_ser_pa, - sizeof(header_ser_pa)); - err |= fdt_end_node(fdt_out); - - if (err) - goto err_unpreserve; + *flbs_pa = virt_to_phys(header_ser); header_ser->pgcnt = LUO_FLB_PGCNT; luo_flb_global.outgoing.header_ser = header_ser; @@ -579,53 +565,18 @@ int __init luo_flb_setup_outgoing(void *fdt_out) luo_flb_global.outgoing.active = true; return 0; - -err_unpreserve: - kho_unpreserve_free(header_ser); - - return err; } -int __init luo_flb_setup_incoming(void *fdt_in) +void __init luo_flb_setup_incoming(u64 flbs_pa) { struct luo_flb_header_ser *header_ser; - int err, header_size, offset; - const void *ptr; - u64 header_ser_pa; - offset = fdt_subnode_offset(fdt_in, 0, LUO_FDT_FLB_NODE_NAME); - if (offset < 0) { - pr_err("Unable to get FLB node [%s]\n", LUO_FDT_FLB_NODE_NAME); - - return -ENOENT; + if (flbs_pa) { + header_ser = phys_to_virt(flbs_pa); + luo_flb_global.incoming.header_ser = header_ser; + luo_flb_global.incoming.ser = (void *)(header_ser + 1); + luo_flb_global.incoming.active = true; } - - err = fdt_node_check_compatible(fdt_in, offset, - LUO_FDT_FLB_COMPATIBLE); - if (err) { - pr_err("FLB node is incompatible with '%s' [%d]\n", - LUO_FDT_FLB_COMPATIBLE, err); - - return -EINVAL; - } - - header_size = 0; - ptr = fdt_getprop(fdt_in, offset, LUO_FDT_FLB_HEADER, &header_size); - if (!ptr || header_size != sizeof(u64)) { - pr_err("Unable to get FLB header property '%s' [%d]\n", - LUO_FDT_FLB_HEADER, header_size); - - return -EINVAL; - } - - header_ser_pa = get_unaligned((u64 *)ptr); - header_ser = phys_to_virt(header_ser_pa); - - luo_flb_global.incoming.header_ser = header_ser; - luo_flb_global.incoming.ser = (void *)(header_ser + 1); - luo_flb_global.incoming.active = true; - - return 0; } /** diff --git a/kernel/liveupdate/luo_internal.h b/kernel/liveupdate/luo_internal.h index ae58206f14ac..fe22086bfbeb 100644 --- a/kernel/liveupdate/luo_internal.h +++ b/kernel/liveupdate/luo_internal.h @@ -79,8 +79,8 @@ extern struct rw_semaphore luo_register_rwlock; int luo_session_create(const char *name, struct file **filep); int luo_session_retrieve(const char *name, struct file **filep); -int __init luo_session_setup_outgoing(void *fdt); -int __init luo_session_setup_incoming(void *fdt); +int __init luo_session_setup_outgoing(u64 *sessions_pa); +int __init luo_session_setup_incoming(u64 sessions_pa); int luo_session_serialize(void); int luo_session_deserialize(void); @@ -102,8 +102,8 @@ int luo_flb_file_preserve(struct liveupdate_file_handler *fh); void luo_flb_file_unpreserve(struct liveupdate_file_handler *fh); void luo_flb_file_finish(struct liveupdate_file_handler *fh); void luo_flb_unregister_all(struct liveupdate_file_handler *fh); -int __init luo_flb_setup_outgoing(void *fdt); -int __init luo_flb_setup_incoming(void *fdt); +int __init luo_flb_setup_outgoing(u64 *flbs_pa); +void __init luo_flb_setup_incoming(u64 flbs_pa); void luo_flb_serialize(void); #ifdef CONFIG_LIVEUPDATE_TEST diff --git a/kernel/liveupdate/luo_session.c b/kernel/liveupdate/luo_session.c index 7b2f9cbabb05..3b255ffd1bf1 100644 --- a/kernel/liveupdate/luo_session.c +++ b/kernel/liveupdate/luo_session.c @@ -25,9 +25,8 @@ * * - Serialization: Session metadata is preserved using the KHO framework. When * a live update is triggered via kexec, an array of `struct luo_session_ser` - * is populated and placed in a preserved memory region. An FDT node is also - * created, containing the count of sessions and the physical address of this - * array. + * is populated and placed in a preserved memory region. The physical address + * of this array is stored in the centralized `struct luo_ser` structure. * * Session Lifecycle: * @@ -91,13 +90,11 @@ #include #include #include -#include #include #include #include #include #include -#include #include #include "luo_internal.h" @@ -525,75 +522,34 @@ int luo_session_retrieve(const char *name, struct file **filep) return err; } -int __init luo_session_setup_outgoing(void *fdt_out) +int __init luo_session_setup_outgoing(u64 *sessions_pa) { struct luo_session_header_ser *header_ser; - u64 header_ser_pa; - int err; header_ser = kho_alloc_preserve(LUO_SESSION_PGCNT << PAGE_SHIFT); if (IS_ERR(header_ser)) return PTR_ERR(header_ser); - header_ser_pa = virt_to_phys(header_ser); - - err = fdt_begin_node(fdt_out, LUO_FDT_SESSION_NODE_NAME); - err |= fdt_property_string(fdt_out, "compatible", - LUO_FDT_SESSION_COMPATIBLE); - err |= fdt_property(fdt_out, LUO_FDT_SESSION_HEADER, &header_ser_pa, - sizeof(header_ser_pa)); - err |= fdt_end_node(fdt_out); - if (err) - goto err_unpreserve; + *sessions_pa = virt_to_phys(header_ser); luo_session_global.outgoing.header_ser = header_ser; luo_session_global.outgoing.ser = (void *)(header_ser + 1); luo_session_global.outgoing.active = true; return 0; - -err_unpreserve: - kho_unpreserve_free(header_ser); - return err; } -int __init luo_session_setup_incoming(void *fdt_in) +int __init luo_session_setup_incoming(u64 sessions_pa) { struct luo_session_header_ser *header_ser; - int err, header_size, offset; - u64 header_ser_pa; - const void *ptr; - - offset = fdt_subnode_offset(fdt_in, 0, LUO_FDT_SESSION_NODE_NAME); - if (offset < 0) { - pr_err("Unable to get session node: [%s]\n", - LUO_FDT_SESSION_NODE_NAME); - return -EINVAL; - } - err = fdt_node_check_compatible(fdt_in, offset, - LUO_FDT_SESSION_COMPATIBLE); - if (err) { - pr_err("Session node incompatible [%s]\n", - LUO_FDT_SESSION_COMPATIBLE); - return -EINVAL; + if (sessions_pa) { + header_ser = phys_to_virt(sessions_pa); + luo_session_global.incoming.header_ser = header_ser; + luo_session_global.incoming.ser = (void *)(header_ser + 1); + luo_session_global.incoming.active = true; } - header_size = 0; - ptr = fdt_getprop(fdt_in, offset, LUO_FDT_SESSION_HEADER, &header_size); - if (!ptr || header_size != sizeof(u64)) { - pr_err("Unable to get session header '%s' [%d]\n", - LUO_FDT_SESSION_HEADER, header_size); - return -EINVAL; - } - - header_ser_pa = get_unaligned((u64 *)ptr); - header_ser = phys_to_virt(header_ser_pa); - - luo_session_global.incoming.header_ser = header_ser; - luo_session_global.incoming.ser = (void *)(header_ser + 1); - luo_session_global.incoming.active = true; - return 0; } -- 2.53.0 From pasha.tatashin at soleen.com Mon Jun 1 20:17:08 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Tue, 2 Jun 2026 03:17:08 +0000 Subject: [PATCH v5 04/13] liveupdate: register luo_ser as KHO subtree In-Reply-To: <20260602031717.197696-1-pasha.tatashin@soleen.com> References: <20260602031717.197696-1-pasha.tatashin@soleen.com> Message-ID: <20260602031717.197696-5-pasha.tatashin@soleen.com> Entirely remove the LUO FDT wrapper since the FDT only carries the compatible string and the pointer to the centralized struct luo_ser. Instead, register the struct luo_ser via the KHO raw subtree API, placing the compatibility string inside the structure itself. Signed-off-by: Pasha Tatashin --- include/linux/kho/abi/luo.h | 57 +++++++++--------------- kernel/liveupdate/luo_core.c | 85 +++++++++++------------------------- 2 files changed, 46 insertions(+), 96 deletions(-) diff --git a/include/linux/kho/abi/luo.h b/include/linux/kho/abi/luo.h index 1b2f865a771a..9a4fe491812b 100644 --- a/include/linux/kho/abi/luo.h +++ b/include/linux/kho/abi/luo.h @@ -10,11 +10,11 @@ * * Live Update Orchestrator uses the stable Application Binary Interface * defined below to pass state from a pre-update kernel to a post-update - * kernel. The ABI is built upon the Kexec HandOver framework and uses a - * Flattened Device Tree to describe the preserved data. + * kernel. The ABI is built upon the Kexec HandOver framework and registers + * the central `struct luo_ser` via the KHO raw subtree API. * - * This interface is a contract. Any modification to the FDT structure, node - * properties, compatible strings, or the layout of the `__packed` serialization + * This interface is a contract. Any modification to the structure fields, + * compatible strings, or the layout of the `__packed` serialization * structures defined here constitutes a breaking change. Such changes require * incrementing the version number in the relevant `_COMPATIBLE` string to * prevent a new kernel from misinterpreting data from an old kernel. @@ -23,31 +23,15 @@ * however, backward/forward compatibility is only guaranteed for kernels * supporting the same ABI version. * - * FDT Structure Overview: + * KHO Structure Overview: * The entire LUO state is encapsulated within a single KHO entry named "LUO". - * This entry contains an FDT with the following layout: - * - * .. code-block:: none - * - * / { - * compatible = "luo-v2"; - * luo-abi-header = ; - * }; - * - * Main LUO Node (/): - * - * - compatible: "luo-v2" - * Identifies the overall LUO ABI version. - * - luo-abi-header: u64 - * The physical address of `struct luo_ser`. + * This entry contains the `struct luo_ser` structure. * * Serialization Structures: - * The FDT properties point to memory regions containing arrays of simple, - * `__packed` structures. These structures contain the actual preserved state. - * * - struct luo_ser: * The central ABI structure that contains the overall state of the LUO. - * It includes the liveupdate-number and pointers to sessions and FLBs. + * It includes the compatibility string, the liveupdate-number, and pointers + * to sessions and FLBs. * * - struct luo_session_header_ser: * Header for the session array. Contains the total page count of the @@ -78,26 +62,27 @@ #ifndef _LINUX_KHO_ABI_LUO_H #define _LINUX_KHO_ABI_LUO_H +#include #include /* - * The LUO FDT hooks all LUO state for sessions, fds, etc. + * The LUO state is registered under this KHO entry name. */ -#define LUO_FDT_SIZE PAGE_SIZE -#define LUO_FDT_KHO_ENTRY_NAME "LUO" -#define LUO_FDT_COMPATIBLE "luo-v2" -#define LUO_FDT_ABI_HEADER "luo-abi-header" +#define LUO_KHO_ENTRY_NAME "LUO" +#define LUO_ABI_COMPATIBLE "luo-v3" +#define LUO_ABI_COMPAT_LEN ALIGN(sizeof(LUO_ABI_COMPATIBLE), 8) /** * struct luo_ser - Centralized LUO ABI header. + * @compatible: Compatibility string identifying the LUO ABI version. * @liveupdate_num: A counter tracking the number of successful live updates. * @sessions_pa: Physical address of the first session block header. * @flbs_pa: Physical address of the FLB header. * - * This structure is the root of all preserved LUO state. It is pointed to by - * the "luo-abi-header" property in the LUO FDT. + * This structure is the root of all preserved LUO state. */ struct luo_ser { + char compatible[LUO_ABI_COMPAT_LEN]; u64 liveupdate_num; u64 sessions_pa; u64 flbs_pa; @@ -111,7 +96,7 @@ struct luo_ser { * @data: Private data * @token: User provided token for this file * - * If this structure is modified, LUO_SESSION_COMPATIBLE must be updated. + * If this structure is modified, `LUO_ABI_COMPATIBLE` must be updated. */ struct luo_file_ser { char compatible[LIVEUPDATE_HNDL_COMPAT_LENGTH]; @@ -142,7 +127,7 @@ struct luo_file_set_ser { * physical memory preserved across the kexec. It provides the necessary * metadata to interpret the array of session entries that follow. * - * If this structure is modified, `LUO_FDT_COMPATIBLE` must be updated. + * If this structure is modified, `LUO_ABI_COMPATIBLE` must be updated. */ struct luo_session_header_ser { u64 count; @@ -159,7 +144,7 @@ struct luo_session_header_ser { * session) is created and passed to the new kernel, allowing it to reconstruct * the session context. * - * If this structure is modified, `LUO_FDT_COMPATIBLE` must be updated. + * If this structure is modified, `LUO_ABI_COMPATIBLE` must be updated. */ struct luo_session_ser { char name[LIVEUPDATE_SESSION_NAME_LENGTH]; @@ -180,7 +165,7 @@ struct luo_session_ser { * This structure is located at the physical address specified by the * flbs_pa in luo_ser. * - * If this structure is modified, `LUO_FDT_COMPATIBLE` must be updated. + * If this structure is modified, `LUO_ABI_COMPATIBLE` must be updated. */ struct luo_flb_header_ser { u64 pgcnt; @@ -202,7 +187,7 @@ struct luo_flb_header_ser { * passed to the new kernel. Each entry allows the LUO core to restore one * global, shared object. * - * If this structure is modified, `LUO_FDT_COMPATIBLE` must be updated. + * If this structure is modified, `LUO_ABI_COMPATIBLE` must be updated. */ struct luo_flb_ser { char name[LIVEUPDATE_FLB_COMPAT_LENGTH]; diff --git a/kernel/liveupdate/luo_core.c b/kernel/liveupdate/luo_core.c index 085c0dfc1ef1..69b00e7d0f8f 100644 --- a/kernel/liveupdate/luo_core.c +++ b/kernel/liveupdate/luo_core.c @@ -54,7 +54,6 @@ #include #include #include -#include #include #include #include @@ -67,8 +66,7 @@ static struct { bool enabled; - void *fdt_out; - void *fdt_in; + struct luo_ser *luo_ser_out; u64 liveupdate_num; } luo_global; @@ -85,11 +83,10 @@ early_param("liveupdate", early_liveupdate_param); static int __init luo_early_startup(void) { + phys_addr_t luo_ser_phys; struct luo_ser *luo_ser; - int err, header_size; - phys_addr_t fdt_phys; - const void *ptr; - u64 luo_ser_pa; + size_t len; + int err; if (!kho_is_enabled()) { if (liveupdate_enabled()) @@ -98,40 +95,29 @@ static int __init luo_early_startup(void) return 0; } - /* Retrieve LUO subtree, and verify its format. */ - err = kho_retrieve_subtree(LUO_FDT_KHO_ENTRY_NAME, &fdt_phys, NULL); + /* Retrieve LUO state from KHO. */ + err = kho_retrieve_subtree(LUO_KHO_ENTRY_NAME, &luo_ser_phys, &len); if (err) { if (err != -ENOENT) { - pr_err("failed to retrieve FDT '%s' from KHO: %pe\n", - LUO_FDT_KHO_ENTRY_NAME, ERR_PTR(err)); + pr_err("failed to retrieve LUO state '%s' from KHO: %pe\n", + LUO_KHO_ENTRY_NAME, ERR_PTR(err)); return err; } return 0; } - luo_global.fdt_in = phys_to_virt(fdt_phys); - err = fdt_node_check_compatible(luo_global.fdt_in, 0, - LUO_FDT_COMPATIBLE); - if (err) { - pr_err("FDT '%s' is incompatible with '%s' [%d]\n", - LUO_FDT_KHO_ENTRY_NAME, LUO_FDT_COMPATIBLE, err); - + if (len < sizeof(*luo_ser)) { + pr_err("LUO state is too small (%zu < %zu)\n", len, sizeof(*luo_ser)); return -EINVAL; } - header_size = 0; - ptr = fdt_getprop(luo_global.fdt_in, 0, LUO_FDT_ABI_HEADER, &header_size); - if (!ptr || header_size != sizeof(u64)) { - pr_err("Unable to get ABI header '%s' [%d]\n", - LUO_FDT_ABI_HEADER, header_size); - + luo_ser = phys_to_virt(luo_ser_phys); + if (strncmp(luo_ser->compatible, LUO_ABI_COMPATIBLE, LUO_ABI_COMPAT_LEN)) { + pr_err("LUO state is incompatible with '%s'\n", LUO_ABI_COMPATIBLE); return -EINVAL; } - luo_ser_pa = get_unaligned((u64 *)ptr); - luo_ser = phys_to_virt(luo_ser_pa); - luo_global.liveupdate_num = luo_ser->liveupdate_num; pr_info("Retrieved live update data, liveupdate number: %lld\n", luo_global.liveupdate_num); @@ -164,37 +150,20 @@ static int __init liveupdate_early_init(void) } early_initcall(liveupdate_early_init); -/* Called during boot to create outgoing LUO fdt tree */ -static int __init luo_fdt_setup(void) +/* Called during boot to create outgoing LUO state */ +static int __init luo_state_setup(void) { struct luo_ser *luo_ser; - u64 luo_ser_pa; - void *fdt_out; int err; - fdt_out = kho_alloc_preserve(LUO_FDT_SIZE); - if (IS_ERR(fdt_out)) { - pr_err("failed to allocate/preserve FDT memory\n"); - return PTR_ERR(fdt_out); - } - luo_ser = kho_alloc_preserve(sizeof(*luo_ser)); if (IS_ERR(luo_ser)) { - err = PTR_ERR(luo_ser); - goto exit_free_fdt; + pr_err("failed to allocate/preserve LUO state memory\n"); + return PTR_ERR(luo_ser); } - luo_ser_pa = virt_to_phys(luo_ser); - - err = fdt_create(fdt_out, LUO_FDT_SIZE); - err |= fdt_finish_reservemap(fdt_out); - err |= fdt_begin_node(fdt_out, ""); - err |= fdt_property_string(fdt_out, "compatible", LUO_FDT_COMPATIBLE); - err |= fdt_property(fdt_out, LUO_FDT_ABI_HEADER, &luo_ser_pa, - sizeof(luo_ser_pa)); - err |= fdt_end_node(fdt_out); - err |= fdt_finish(fdt_out); - if (err) - goto exit_free_luo_ser; + + strscpy(luo_ser->compatible, LUO_ABI_COMPATIBLE, sizeof(luo_ser->compatible)); + luo_ser->liveupdate_num = luo_global.liveupdate_num + 1; err = luo_session_setup_outgoing(&luo_ser->sessions_pa); if (err) @@ -204,21 +173,17 @@ static int __init luo_fdt_setup(void) if (err) goto exit_free_luo_ser; - luo_ser->liveupdate_num = luo_global.liveupdate_num + 1; - - err = kho_add_subtree(LUO_FDT_KHO_ENTRY_NAME, fdt_out, - fdt_totalsize(fdt_out)); + err = kho_add_subtree(LUO_KHO_ENTRY_NAME, luo_ser, sizeof(*luo_ser)); if (err) goto exit_free_luo_ser; - luo_global.fdt_out = fdt_out; + + luo_global.luo_ser_out = luo_ser; return 0; exit_free_luo_ser: kho_unpreserve_free(luo_ser); -exit_free_fdt: - kho_unpreserve_free(fdt_out); - pr_err("failed to prepare LUO FDT: %d\n", err); + pr_err("failed to prepare LUO state: %d\n", err); return err; } @@ -234,7 +199,7 @@ static int __init luo_late_startup(void) if (!liveupdate_enabled()) return 0; - err = luo_fdt_setup(); + err = luo_state_setup(); if (err) luo_global.enabled = false; -- 2.53.0 From pasha.tatashin at soleen.com Mon Jun 1 20:17:09 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Tue, 2 Jun 2026 03:17:09 +0000 Subject: [PATCH v5 05/13] liveupdate: Extract luo_file_deserialize_one helper In-Reply-To: <20260602031717.197696-1-pasha.tatashin@soleen.com> References: <20260602031717.197696-1-pasha.tatashin@soleen.com> Message-ID: <20260602031717.197696-6-pasha.tatashin@soleen.com> Extract the logic for deserializing single entries for files into separate helper functions. In preparation to a linked-block serialization for files. This is a pure code movement, no other changes intended. Acked-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav (Google) Signed-off-by: Pasha Tatashin --- kernel/liveupdate/luo_file.c | 77 ++++++++++++++++++++---------------- 1 file changed, 44 insertions(+), 33 deletions(-) diff --git a/kernel/liveupdate/luo_file.c b/kernel/liveupdate/luo_file.c index 208987502f73..9eec07a9e9fc 100644 --- a/kernel/liveupdate/luo_file.c +++ b/kernel/liveupdate/luo_file.c @@ -753,6 +753,46 @@ int luo_file_finish(struct luo_file_set *file_set) return 0; } +static int luo_file_deserialize_one(struct luo_file_set *file_set, + struct luo_file_ser *ser) +{ + struct liveupdate_file_handler *fh; + bool handler_found = false; + struct luo_file *luo_file; + + down_read(&luo_register_rwlock); + list_private_for_each_entry(fh, &luo_file_handler_list, list) { + if (!strcmp(fh->compatible, ser->compatible)) { + if (try_module_get(fh->ops->owner)) + handler_found = true; + break; + } + } + up_read(&luo_register_rwlock); + + if (!handler_found) { + pr_warn("No registered handler for compatible '%.*s'\n", + (int)sizeof(ser->compatible), + ser->compatible); + return -ENOENT; + } + + luo_file = kzalloc_obj(*luo_file); + if (!luo_file) { + module_put(fh->ops->owner); + return -ENOMEM; + } + + luo_file->fh = fh; + luo_file->file = NULL; + luo_file->serialized_data = ser->data; + luo_file->token = ser->token; + mutex_init(&luo_file->mutex); + list_add_tail(&luo_file->list, &file_set->files_list); + + return 0; +} + /** * luo_file_deserialize - Reconstructs the list of preserved files in the new kernel. * @file_set: The incoming file_set to fill with deserialized data. @@ -782,6 +822,7 @@ int luo_file_deserialize(struct luo_file_set *file_set, struct luo_file_set_ser *file_set_ser) { struct luo_file_ser *file_ser; + int err; u64 i; if (!file_set_ser->files) { @@ -809,39 +850,9 @@ int luo_file_deserialize(struct luo_file_set *file_set, */ file_ser = file_set->files; for (i = 0; i < file_set->count; i++) { - struct liveupdate_file_handler *fh; - bool handler_found = false; - struct luo_file *luo_file; - - down_read(&luo_register_rwlock); - list_private_for_each_entry(fh, &luo_file_handler_list, list) { - if (!strcmp(fh->compatible, file_ser[i].compatible)) { - if (try_module_get(fh->ops->owner)) - handler_found = true; - break; - } - } - up_read(&luo_register_rwlock); - - if (!handler_found) { - pr_warn("No registered handler for compatible '%.*s'\n", - (int)sizeof(file_ser[i].compatible), - file_ser[i].compatible); - return -ENOENT; - } - - luo_file = kzalloc_obj(*luo_file); - if (!luo_file) { - module_put(fh->ops->owner); - return -ENOMEM; - } - - luo_file->fh = fh; - luo_file->file = NULL; - luo_file->serialized_data = file_ser[i].data; - luo_file->token = file_ser[i].token; - mutex_init(&luo_file->mutex); - list_add_tail(&luo_file->list, &file_set->files_list); + err = luo_file_deserialize_one(file_set, &file_ser[i]); + if (err) + return err; } return 0; -- 2.53.0 From pasha.tatashin at soleen.com Mon Jun 1 20:17:10 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Tue, 2 Jun 2026 03:17:10 +0000 Subject: [PATCH v5 06/13] liveupdate: Extract luo_session_deserialize_one helper In-Reply-To: <20260602031717.197696-1-pasha.tatashin@soleen.com> References: <20260602031717.197696-1-pasha.tatashin@soleen.com> Message-ID: <20260602031717.197696-7-pasha.tatashin@soleen.com> Extract the logic for deserializing single entries for sessions into separate helper functions. In preparation to a linked-block serialization for sessions. This is a pure code movement, no other changes intended. Acked-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav (Google) Signed-off-by: Pasha Tatashin --- kernel/liveupdate/luo_session.c | 63 +++++++++++++++++++-------------- 1 file changed, 36 insertions(+), 27 deletions(-) diff --git a/kernel/liveupdate/luo_session.c b/kernel/liveupdate/luo_session.c index 3b255ffd1bf1..9f72a8b0a9a8 100644 --- a/kernel/liveupdate/luo_session.c +++ b/kernel/liveupdate/luo_session.c @@ -553,6 +553,40 @@ int __init luo_session_setup_incoming(u64 sessions_pa) return 0; } +static int luo_session_deserialize_one(struct luo_session_header *sh, + struct luo_session_ser *ser) +{ + struct luo_session *session; + int err; + + session = luo_session_alloc(ser->name); + if (IS_ERR(session)) { + pr_warn("Failed to allocate session [%.*s] during deserialization %pe\n", + (int)sizeof(ser->name), ser->name, session); + return PTR_ERR(session); + } + + err = luo_session_insert(sh, session); + if (err) { + pr_warn("Failed to insert session [%s] %pe\n", + session->name, ERR_PTR(err)); + luo_session_free(session); + return err; + } + + scoped_guard(mutex, &session->mutex) { + err = luo_file_deserialize(&session->file_set, + &ser->file_set_ser); + } + if (err) { + pr_warn("Failed to deserialize files for session [%s] %pe\n", + session->name, ERR_PTR(err)); + return err; + } + + return 0; +} + int luo_session_deserialize(void) { struct luo_session_header *sh = &luo_session_global.incoming; @@ -584,34 +618,9 @@ int luo_session_deserialize(void) * reliably reset devices and reclaim memory. */ for (int i = 0; i < sh->header_ser->count; i++) { - struct luo_session *session; - - session = luo_session_alloc(sh->ser[i].name); - if (IS_ERR(session)) { - pr_warn("Failed to allocate session [%.*s] during deserialization %pe\n", - (int)sizeof(sh->ser[i].name), - sh->ser[i].name, session); - err = PTR_ERR(session); - goto save_err; - } - - err = luo_session_insert(sh, session); - if (err) { - pr_warn("Failed to insert session [%s] %pe\n", - session->name, ERR_PTR(err)); - luo_session_free(session); - goto save_err; - } - - scoped_guard(mutex, &session->mutex) { - err = luo_file_deserialize(&session->file_set, - &sh->ser[i].file_set_ser); - } - if (err) { - pr_warn("Failed to deserialize files for session [%s] %pe\n", - session->name, ERR_PTR(err)); + err = luo_session_deserialize_one(sh, &sh->ser[i]); + if (err) goto save_err; - } } kho_restore_free(sh->header_ser); -- 2.53.0 From pasha.tatashin at soleen.com Mon Jun 1 20:17:11 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Tue, 2 Jun 2026 03:17:11 +0000 Subject: [PATCH v5 07/13] kho: add support for linked-block serialization In-Reply-To: <20260602031717.197696-1-pasha.tatashin@soleen.com> References: <20260602031717.197696-1-pasha.tatashin@soleen.com> Message-ID: <20260602031717.197696-8-pasha.tatashin@soleen.com> Introduce a linked-block serialization mechanism for state handover. Previously, LUO used contiguous memory blocks for serializing sessions and files, which imposed limits on the total number of items that could be preserved across a live update. This commit adds the infrastructure for a more flexible, block-based approach where serialized data is stored in a chain of linked blocks. This is a generic KHO serialization block infrastructure that can be used by multiple subsystems. Signed-off-by: Pasha Tatashin --- Documentation/core-api/kho/abi.rst | 5 + Documentation/core-api/kho/index.rst | 11 + MAINTAINERS | 1 + include/linux/kho/abi/block.h | 56 ++++ include/linux/kho_block.h | 79 ++++++ kernel/liveupdate/Makefile | 1 + kernel/liveupdate/kho_block.c | 390 +++++++++++++++++++++++++++ 7 files changed, 543 insertions(+) create mode 100644 include/linux/kho/abi/block.h create mode 100644 include/linux/kho_block.h create mode 100644 kernel/liveupdate/kho_block.c diff --git a/Documentation/core-api/kho/abi.rst b/Documentation/core-api/kho/abi.rst index 799d743105a6..edeb5b311963 100644 --- a/Documentation/core-api/kho/abi.rst +++ b/Documentation/core-api/kho/abi.rst @@ -28,6 +28,11 @@ KHO persistent memory tracker ABI .. kernel-doc:: include/linux/kho/abi/kexec_handover.h :doc: KHO persistent memory tracker +KHO serialization block ABI +=========================== + +.. kernel-doc:: include/linux/kho/abi/block.h + See Also ======== diff --git a/Documentation/core-api/kho/index.rst b/Documentation/core-api/kho/index.rst index 0a2dee4f8e7d..320914a42178 100644 --- a/Documentation/core-api/kho/index.rst +++ b/Documentation/core-api/kho/index.rst @@ -83,6 +83,17 @@ Public API .. kernel-doc:: kernel/liveupdate/kexec_handover.c :export: +KHO Serialization Blocks API +============================ + +.. kernel-doc:: kernel/liveupdate/kho_block.c + :doc: KHO Serialization Blocks + +.. kernel-doc:: include/linux/kho_block.h + +.. kernel-doc:: kernel/liveupdate/kho_block.c + :internal: + See Also ======== diff --git a/MAINTAINERS b/MAINTAINERS index 9ec290e38b44..920ba7622afa 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -14208,6 +14208,7 @@ F: Documentation/admin-guide/mm/kho.rst F: Documentation/core-api/kho/* F: include/linux/kexec_handover.h F: include/linux/kho/ +F: include/linux/kho_block.h F: kernel/liveupdate/kexec_handover* F: lib/test_kho.c F: tools/testing/selftests/kho/ diff --git a/include/linux/kho/abi/block.h b/include/linux/kho/abi/block.h new file mode 100644 index 000000000000..8641c20b379b --- /dev/null +++ b/include/linux/kho/abi/block.h @@ -0,0 +1,56 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2026, Google LLC. + * Pasha Tatashin + */ + +/** + * DOC: KHO Serialization Blocks ABI + * + * Subsystems using the KHO Serialization Blocks framework rely on the stable + * Application Binary Interface defined below to pass serialized state from a + * pre-update kernel to a post-update kernel. + * + * This interface is a contract. Any modification to the structure fields, + * compatible strings, or the layout of the `__packed` serialization + * structures defined here constitutes a breaking change. Such changes require + * incrementing the version number in the `KHO_BLOCK_ABI_COMPATIBLE` string to + * prevent a new kernel from misinterpreting data from an old kernel. + * + * Changes are allowed provided the compatibility version is incremented; + * however, backward/forward compatibility is only guaranteed for kernels + * supporting the same ABI version. + */ + +#ifndef _LINUX_KHO_ABI_BLOCK_H +#define _LINUX_KHO_ABI_BLOCK_H + +#include +#include + +#define KHO_BLOCK_ABI_COMPATIBLE "kho-block-v1" + +/** + * KHO_BLOCK_SIZE - The size of each serialization block. + * + * This is defined as PAGE_SIZE. PAGE_SIZE is ABI compliant because live + * update between kernels with different page sizes is not supported by KHO. + */ +#define KHO_BLOCK_SIZE PAGE_SIZE + +/** + * struct kho_block_header_ser - Header for the serialized data block. + * @next: Physical address of the next struct kho_block_header_ser. + * @count: The number of entries that immediately follow this header in the + * memory block. + * + * This structure is located at the beginning of a block of physical memory + * preserved across a kexec. It provides the necessary metadata to interpret + * the array of entries that follow. + */ +struct kho_block_header_ser { + u64 next; + u64 count; +} __packed; + +#endif /* _LINUX_KHO_ABI_BLOCK_H */ diff --git a/include/linux/kho_block.h b/include/linux/kho_block.h new file mode 100644 index 000000000000..505bf78409f2 --- /dev/null +++ b/include/linux/kho_block.h @@ -0,0 +1,79 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2026, Google LLC. + * Pasha Tatashin + */ + +#ifndef _LINUX_KHO_BLOCK_H +#define _LINUX_KHO_BLOCK_H + +#include +#include +#include + +/** + * struct kho_block - Internal representation of a serialization block. + * @list: List head for linking blocks in memory. + * @ser: Pointer to the serialized header in preserved memory. + */ +struct kho_block { + struct list_head list; + struct kho_block_header_ser *ser; +}; + +/** + * struct kho_block_set - A set of blocks that belong to the same object. + * @blocks: The list of serialization blocks (struct kho_block). + * @nblocks: The number of allocated serialization blocks. + * @head_pa: Physical address of the first block header. + * @entry_size: The size of each entry in the blocks. + * @count_per_block: The maximum number of entries each block can hold. + * @incoming: True if this block set was restored from the previous kernel. + */ +struct kho_block_set { + struct list_head blocks; + long nblocks; + u64 head_pa; + size_t entry_size; + u64 count_per_block; + bool incoming; +}; + +/** + * struct kho_block_it - Iterator for serializing entries into blocks. + * @bs: The block set being iterated. + * @block: The current block. + * @i: The current entry index within @block. + */ +struct kho_block_it { + struct kho_block_set *bs; + struct kho_block *block; + u64 i; +}; + +/** + * KHO_BLOCK_SET_INIT - Initialize a static kho_block_set. + * @_name: Name of the kho_block_set variable. + * @_entry_size: The size of each entry in the block set. + */ +#define KHO_BLOCK_SET_INIT(_name, _entry_size) { \ + .blocks = LIST_HEAD_INIT((_name).blocks), \ + .entry_size = _entry_size, \ +} + +void kho_block_set_init(struct kho_block_set *bs, size_t entry_size); + +int kho_block_grow(struct kho_block_set *bs, u64 count); +void kho_block_shrink(struct kho_block_set *bs, u64 count); + +int kho_block_set_restore(struct kho_block_set *bs, u64 head_pa); +void kho_block_set_destroy(struct kho_block_set *bs); +void kho_block_set_clear(struct kho_block_set *bs); + +void kho_block_it_init(struct kho_block_it *it, struct kho_block_set *bs); +void *kho_block_it_reserve_entry(struct kho_block_it *it); +void *kho_block_it_read_entry(struct kho_block_it *it); +void *kho_block_it_prev(struct kho_block_it *it); +void kho_block_it_finalize(struct kho_block_it *it); + +#endif /* _LINUX_KHO_BLOCK_H */ diff --git a/kernel/liveupdate/Makefile b/kernel/liveupdate/Makefile index d2f779cbe279..eec9d3ae07eb 100644 --- a/kernel/liveupdate/Makefile +++ b/kernel/liveupdate/Makefile @@ -1,6 +1,7 @@ # SPDX-License-Identifier: GPL-2.0 luo-y := \ + kho_block.o \ luo_core.o \ luo_file.o \ luo_flb.o \ diff --git a/kernel/liveupdate/kho_block.c b/kernel/liveupdate/kho_block.c new file mode 100644 index 000000000000..01978c6aea1a --- /dev/null +++ b/kernel/liveupdate/kho_block.c @@ -0,0 +1,390 @@ +// SPDX-License-Identifier: GPL-2.0 + +/* + * Copyright (c) 2026, Google LLC. + * Pasha Tatashin + */ + +/** + * DOC: KHO Serialization Blocks + * + * KHO provides a mechanism to preserve stateful data across a kexec handover + * by serializing it into memory blocks. This file provides the common + * infrastructure for managing these blocks. + * + * Each block consists of a header (struct kho_block_header_ser) followed by an + * array of serialized entries. Multiple blocks are linked together via a + * physical pointer in the header, forming a linked list that can be easily + * traversed in both the current and the next kernel. + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include +#include +#include + +/* + * Safeguard limit for the number of serialization blocks. This is used to + * prevent infinite loops and excessive memory allocation in case of memory + * corruption in the preserved state. + */ +#define KHO_MAX_BLOCKS 10000 + +/** + * kho_block_set_init - Initialize a block set. + * @bs: The block set to initialize. + * @entry_size: The size of each entry in the blocks. + */ +void kho_block_set_init(struct kho_block_set *bs, size_t entry_size) +{ + *bs = (struct kho_block_set)KHO_BLOCK_SET_INIT(*bs, entry_size); +} + +static inline u64 kho_block_count_per_block(struct kho_block_set *bs) +{ + if (unlikely(!bs->count_per_block)) { + bs->count_per_block = (KHO_BLOCK_SIZE - + sizeof(struct kho_block_header_ser)) / + bs->entry_size; + WARN_ON_ONCE(!bs->count_per_block); + } + return bs->count_per_block; +} + +/* Free serialized data */ +static void kho_block_free_ser(struct kho_block_set *bs, + struct kho_block_header_ser *ser) +{ + if (bs->incoming) + kho_restore_free(ser); + else + kho_unpreserve_free(ser); +} + +static struct kho_block_header_ser *kho_block_alloc_ser(struct kho_block_set *bs) +{ + WARN_ON_ONCE(bs->incoming); + return kho_alloc_preserve(KHO_BLOCK_SIZE); +} + +static int kho_block_add(struct kho_block_set *bs, + struct kho_block_header_ser *ser) +{ + struct kho_block *block, *last; + + if (bs->nblocks >= KHO_MAX_BLOCKS) + return -ENOSPC; + + block = kzalloc_obj(*block); + if (!block) + return -ENOMEM; + + block->ser = ser; + last = list_last_entry_or_null(&bs->blocks, struct kho_block, list); + list_add_tail(&block->list, &bs->blocks); + bs->nblocks++; + + if (last) + last->ser->next = virt_to_phys(ser); + else + bs->head_pa = virt_to_phys(ser); + + return 0; +} + +/** + * kho_block_grow - Create a new block if the current capacity is reached. + * @bs: The block set. + * @count: The current number of entries. + * + * This function handles the dynamic expansion of a block set. It allocates + * and links a new serialization block if the provided entry count matches + * the current total capacity of the set. + * + * Return: 0 on success, or a negative errno on failure. + */ +int kho_block_grow(struct kho_block_set *bs, u64 count) +{ + struct kho_block_header_ser *ser; + int err; + + if (WARN_ON_ONCE(bs->incoming)) + return -EINVAL; + + if (count != bs->nblocks * kho_block_count_per_block(bs)) + return 0; + + ser = kho_block_alloc_ser(bs); + if (IS_ERR(ser)) + return PTR_ERR(ser); + + err = kho_block_add(bs, ser); + if (err) { + kho_block_free_ser(bs, ser); + return err; + } + + return 0; +} + +/** + * kho_block_shrink - Conditionally destroy the last block in a block set. + * @bs: The block set. + * @count: The current number of entries across all blocks. + * + * This function checks if the last block in the set is redundant based on the + * total entry count and the capacity of the preceding blocks. If the entry + * count can be accommodated by the blocks that come before the last one, the + * last block is destroyed and removed from the set. + */ +void kho_block_shrink(struct kho_block_set *bs, u64 count) +{ + struct kho_block *last, *new_last; + + if (count > (bs->nblocks - 1) * kho_block_count_per_block(bs)) + return; + + if (list_empty(&bs->blocks)) + return; + + last = list_last_entry(&bs->blocks, struct kho_block, list); + list_del(&last->list); + bs->nblocks--; + kho_block_free_ser(bs, last->ser); + kfree(last); + + new_last = list_last_entry_or_null(&bs->blocks, struct kho_block, list); + if (new_last) + new_last->ser->next = 0; + else + bs->head_pa = 0; +} + +/* + * kho_cyclic_blocks_check - Check for cycles in a linked list of blocks. + * Uses Floyd's cycle-finding algorithm to ensure sanity of the incoming list. + */ +static bool kho_cyclic_blocks_check(struct kho_block_set *bs) +{ + struct kho_block_header_ser *fast; + struct kho_block_header_ser *slow; + int count = 0; + + fast = phys_to_virt(bs->head_pa); + slow = fast; + + while (fast) { + if (count++ >= KHO_MAX_BLOCKS) { + pr_err("Linked list too long\n"); + return false; + } + + if (!fast->next) + break; + + fast = phys_to_virt(fast->next); + if (!fast->next) + break; + + fast = phys_to_virt(fast->next); + slow = phys_to_virt(slow->next); + + if (slow == fast) { + pr_err("Cyclic list detected\n"); + return false; + } + } + + return true; +} + +/** + * kho_block_set_restore - Restore a block set from a physical address. + * @bs: The block set to restore. + * @head_pa: Physical address of the first block header. + * + * Return: 0 on success, or a negative errno on failure. + */ +int kho_block_set_restore(struct kho_block_set *bs, u64 head_pa) +{ + struct kho_block_header_ser *ser; + u64 next_pa = head_pa; + int err; + + /* Restored block sets use size from the previous kernel */ + bs->incoming = true; + if (!head_pa) + return 0; + + bs->head_pa = head_pa; + if (!kho_cyclic_blocks_check(bs)) { + bs->head_pa = 0; + return -EINVAL; + } + + while (next_pa) { + ser = phys_to_virt(next_pa); + if (!ser->count || ser->count > kho_block_count_per_block(bs)) { + pr_warn("Block contains invalid entry count: %llu\n", + ser->count); + err = -EINVAL; + goto err_destroy; + } + err = kho_block_add(bs, ser); + if (err) + goto err_destroy; + next_pa = ser->next; + } + + return 0; + +err_destroy: + kho_block_set_destroy(bs); + return err; +} + +/** + * kho_block_set_destroy - Destroy all blocks in a block set. + * @bs: The block set. + */ +void kho_block_set_destroy(struct kho_block_set *bs) +{ + struct kho_block *block, *tmp; + u64 head_pa = bs->head_pa; + + list_for_each_entry_safe(block, tmp, &bs->blocks, list) { + list_del(&block->list); + kfree(block); + } + bs->nblocks = 0; + bs->head_pa = 0; + + /* + * bs->blocks may only contain partially restored blocks, but head_pa + * still points to the entire chain. + */ + while (head_pa) { + struct kho_block_header_ser *ser = phys_to_virt(head_pa); + + head_pa = ser->next; + kho_block_free_ser(bs, ser); + } +} + +/** + * kho_block_set_clear - Clear all serialized data in a block set. + * @bs: The block set to clear. + */ +void kho_block_set_clear(struct kho_block_set *bs) +{ + struct kho_block *block; + + list_for_each_entry(block, &bs->blocks, list) { + block->ser->count = 0; + memset(block->ser + 1, 0, KHO_BLOCK_SIZE - sizeof(*block->ser)); + } +} + +/** + * kho_block_it_init - Initialize a block set iterator. + * @it: The iterator to initialize. + * @bs: The block set to iterate over. + */ +void kho_block_it_init(struct kho_block_it *it, struct kho_block_set *bs) +{ + it->bs = bs; + it->block = list_first_entry_or_null(&bs->blocks, struct kho_block, list); + it->i = 0; +} + +/** + * kho_block_it_reserve_entry - Reserve and return the next available slot for writing. + * @it: The block iterator. + * + * This function is used during state serialization to add a new entry. + * It reserves a slot in the current block, advancing the internal index. + * If the current block is full, it automatically moves to the next block + * in the set. + * + * Return: A pointer to the reserved entry slot, or NULL if the block set's + * capacity is fully exhausted. + */ +void *kho_block_it_reserve_entry(struct kho_block_it *it) +{ + if (!it->block) + return NULL; + + if (it->i == kho_block_count_per_block(it->bs)) { + it->block->ser->count = it->i; + if (list_is_last(&it->block->list, &it->bs->blocks)) + return NULL; + it->block = list_next_entry(it->block, list); + it->i = 0; + } + + return (void *)(it->block->ser + 1) + (it->i++ * it->bs->entry_size); +} + +/** + * kho_block_it_read_entry - Read the next serialized entry from the block set. + * @it: The block iterator. + * + * This function is used during state deserialization. It iterates through + * entries that were previously written, respecting the actual count stored + * in each block's header. + * + * Return: A pointer to the next serialized entry, or NULL if all serialized + * entries have been read. + */ +void *kho_block_it_read_entry(struct kho_block_it *it) +{ + if (!it->block) + return NULL; + + if (it->i == it->block->ser->count) { + if (list_is_last(&it->block->list, &it->bs->blocks)) + return NULL; + it->block = list_next_entry(it->block, list); + it->i = 0; + } + + return (void *)(it->block->ser + 1) + (it->i++ * it->bs->entry_size); +} + +/** + * kho_block_it_prev - Return the previous entry slot in the block set. + * @it: The block iterator. + * + * If the current index is at the start of a block, it automatically moves to + * the end of the previous block. + * + * Return: A pointer to the previous entry slot, or NULL if at the very + * beginning of the block set. + */ +void *kho_block_it_prev(struct kho_block_it *it) +{ + if (!it->block) + return NULL; + + if (it->i == 0) { + if (list_is_first(&it->block->list, &it->bs->blocks)) + return NULL; + it->block = list_prev_entry(it->block, list); + it->i = kho_block_count_per_block(it->bs); + } + + return (void *)(it->block->ser + 1) + (--it->i * it->bs->entry_size); +} + +/** + * kho_block_it_finalize - Finalize the current block by setting its entry count. + * @it: The block iterator. + */ +void kho_block_it_finalize(struct kho_block_it *it) +{ + if (it->block) + it->block->ser->count = it->i; +} -- 2.53.0 From pasha.tatashin at soleen.com Mon Jun 1 20:17:12 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Tue, 2 Jun 2026 03:17:12 +0000 Subject: [PATCH v5 08/13] liveupdate: defer session block allocation and PA setting In-Reply-To: <20260602031717.197696-1-pasha.tatashin@soleen.com> References: <20260602031717.197696-1-pasha.tatashin@soleen.com> Message-ID: <20260602031717.197696-9-pasha.tatashin@soleen.com> Currently, luo_session_setup_outgoing() allocates the session block and sets its physical address in the header immediately. With upcoming dynamic block-based session management, this makes the first block different from the rest. Move the allocation to where it is first needed. Acked-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav (Google) Signed-off-by: Pasha Tatashin --- include/linux/kho_block.h | 22 +++++++++++ kernel/liveupdate/luo_core.c | 4 +- kernel/liveupdate/luo_internal.h | 2 +- kernel/liveupdate/luo_session.c | 68 ++++++++++++++++++++------------ 4 files changed, 67 insertions(+), 29 deletions(-) diff --git a/include/linux/kho_block.h b/include/linux/kho_block.h index 505bf78409f2..0a8cda2cbfb5 100644 --- a/include/linux/kho_block.h +++ b/include/linux/kho_block.h @@ -70,6 +70,28 @@ int kho_block_set_restore(struct kho_block_set *bs, u64 head_pa); void kho_block_set_destroy(struct kho_block_set *bs); void kho_block_set_clear(struct kho_block_set *bs); +/** + * kho_block_set_head_pa - Get the physical address of the first block header. + * @bs: The block set. + * + * Return: The physical address of the first block header, or 0 if empty. + */ +static inline u64 kho_block_set_head_pa(struct kho_block_set *bs) +{ + return bs->head_pa; +} + +/** + * kho_block_set_is_empty - Check if the block set has no allocated blocks. + * @bs: The block set. + * + * Return: True if there are no blocks in the set, false otherwise. + */ +static inline bool kho_block_set_is_empty(struct kho_block_set *bs) +{ + return list_empty(&bs->blocks); +} + void kho_block_it_init(struct kho_block_it *it, struct kho_block_set *bs); void *kho_block_it_reserve_entry(struct kho_block_it *it); void *kho_block_it_read_entry(struct kho_block_it *it); diff --git a/kernel/liveupdate/luo_core.c b/kernel/liveupdate/luo_core.c index 69b00e7d0f8f..1b2bda22902d 100644 --- a/kernel/liveupdate/luo_core.c +++ b/kernel/liveupdate/luo_core.c @@ -165,9 +165,7 @@ static int __init luo_state_setup(void) strscpy(luo_ser->compatible, LUO_ABI_COMPATIBLE, sizeof(luo_ser->compatible)); luo_ser->liveupdate_num = luo_global.liveupdate_num + 1; - err = luo_session_setup_outgoing(&luo_ser->sessions_pa); - if (err) - goto exit_free_luo_ser; + luo_session_setup_outgoing(&luo_ser->sessions_pa); err = luo_flb_setup_outgoing(&luo_ser->flbs_pa); if (err) diff --git a/kernel/liveupdate/luo_internal.h b/kernel/liveupdate/luo_internal.h index fe22086bfbeb..ee18f9a11b91 100644 --- a/kernel/liveupdate/luo_internal.h +++ b/kernel/liveupdate/luo_internal.h @@ -79,7 +79,7 @@ extern struct rw_semaphore luo_register_rwlock; int luo_session_create(const char *name, struct file **filep); int luo_session_retrieve(const char *name, struct file **filep); -int __init luo_session_setup_outgoing(u64 *sessions_pa); +void __init luo_session_setup_outgoing(u64 *sessions_pa); int __init luo_session_setup_incoming(u64 sessions_pa); int luo_session_serialize(void); int luo_session_deserialize(void); diff --git a/kernel/liveupdate/luo_session.c b/kernel/liveupdate/luo_session.c index 9f72a8b0a9a8..43342916d314 100644 --- a/kernel/liveupdate/luo_session.c +++ b/kernel/liveupdate/luo_session.c @@ -108,15 +108,16 @@ static DECLARE_RWSEM(luo_session_serialize_rwsem); /** * struct luo_session_header - Header struct for managing LUO sessions. - * @count: The number of sessions currently tracked in the @list. - * @list: The head of the linked list of `struct luo_session` instances. - * @rwsem: A read-write semaphore providing synchronized access to the - * session list and other fields in this structure. - * @header_ser: The header data of serialization array. - * @ser: The serialized session data (an array of - * `struct luo_session_ser`). - * @active: Set to true when first initialized. If previous kernel did not - * send session data, active stays false for incoming. + * @count: The number of sessions currently tracked in the @list. + * @list: The head of the linked list of `struct luo_session` instances. + * @rwsem: A read-write semaphore providing synchronized access to the + * session list and other fields in this structure. + * @header_ser: The header data of serialization array. + * @ser: The serialized session data (an array of + * `struct luo_session_ser`). + * @sessions_pa: Points to the location of sessions_pa within struct luo_ser. + * @active: Set to true when first initialized. If previous kernel did not + * send session data, active stays false for incoming. */ struct luo_session_header { long count; @@ -124,6 +125,7 @@ struct luo_session_header { struct rw_semaphore rwsem; struct luo_session_header_ser *header_ser; struct luo_session_ser *ser; + u64 *sessions_pa; bool active; }; @@ -171,10 +173,30 @@ static void luo_session_free(struct luo_session *session) kfree(session); } +static int luo_session_grow_ser(struct luo_session_header *sh) +{ + struct luo_session_header_ser *header_ser; + + if (sh->count == LUO_SESSION_MAX) + return -ENOMEM; + + if (sh->header_ser) + return 0; + + header_ser = kho_alloc_preserve(LUO_SESSION_PGCNT << PAGE_SHIFT); + if (IS_ERR(header_ser)) + return PTR_ERR(header_ser); + + sh->header_ser = header_ser; + sh->ser = (void *)(header_ser + 1); + return 0; +} + static int luo_session_insert(struct luo_session_header *sh, struct luo_session *session) { struct luo_session *it; + int err; guard(rwsem_write)(&sh->rwsem); @@ -183,8 +205,9 @@ static int luo_session_insert(struct luo_session_header *sh, * for new session. */ if (sh == &luo_session_global.outgoing) { - if (sh->count == LUO_SESSION_MAX) - return -ENOMEM; + err = luo_session_grow_ser(sh); + if (err) + return err; } /* @@ -522,21 +545,10 @@ int luo_session_retrieve(const char *name, struct file **filep) return err; } -int __init luo_session_setup_outgoing(u64 *sessions_pa) +void __init luo_session_setup_outgoing(u64 *sessions_pa) { - struct luo_session_header_ser *header_ser; - - header_ser = kho_alloc_preserve(LUO_SESSION_PGCNT << PAGE_SHIFT); - if (IS_ERR(header_ser)) - return PTR_ERR(header_ser); - - *sessions_pa = virt_to_phys(header_ser); - - luo_session_global.outgoing.header_ser = header_ser; - luo_session_global.outgoing.ser = (void *)(header_ser + 1); + luo_session_global.outgoing.sessions_pa = sessions_pa; luo_session_global.outgoing.active = true; - - return 0; } int __init luo_session_setup_incoming(u64 sessions_pa) @@ -642,6 +654,8 @@ int luo_session_serialize(void) down_write(&luo_session_serialize_rwsem); down_write(&sh->rwsem); + *sh->sessions_pa = 0; + list_for_each_entry(session, &sh->list, list) { err = luo_session_freeze_one(session, &sh->ser[i]); if (err) @@ -651,7 +665,11 @@ int luo_session_serialize(void) sizeof(sh->ser[i].name)); i++; } - sh->header_ser->count = sh->count; + + if (sh->header_ser && sh->count > 0) { + sh->header_ser->count = sh->count; + *sh->sessions_pa = virt_to_phys(sh->header_ser); + } up_write(&sh->rwsem); return 0; -- 2.53.0 From pasha.tatashin at soleen.com Mon Jun 1 20:17:13 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Tue, 2 Jun 2026 03:17:13 +0000 Subject: [PATCH v5 09/13] liveupdate: Remove limit on the number of sessions In-Reply-To: <20260602031717.197696-1-pasha.tatashin@soleen.com> References: <20260602031717.197696-1-pasha.tatashin@soleen.com> Message-ID: <20260602031717.197696-10-pasha.tatashin@soleen.com> Currently, the number of LUO sessions is limited by a fixed number of pre-allocated pages for serialization (16 pages, allowing for ~819 sessions). This limitation is problematic if LUO is used to support things such as systemd file descriptor store, and would be used not just as VM memory but to save other states on the machine. Remove this limit by transitioning to a linked-block approach for session metadata serialization. Instead of a single contiguous block, session metadata is now stored in a chain of 16-page blocks. Each block starts with a header containing the physical address of the next block and the number of session entries in the current block. Acked-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav (Google) Signed-off-by: Pasha Tatashin --- include/linux/kho/abi/luo.h | 24 +------ kernel/liveupdate/luo_session.c | 115 +++++++++++++++----------------- 2 files changed, 58 insertions(+), 81 deletions(-) diff --git a/include/linux/kho/abi/luo.h b/include/linux/kho/abi/luo.h index 9a4fe491812b..79758d92ed5f 100644 --- a/include/linux/kho/abi/luo.h +++ b/include/linux/kho/abi/luo.h @@ -33,11 +33,6 @@ * It includes the compatibility string, the liveupdate-number, and pointers * to sessions and FLBs. * - * - struct luo_session_header_ser: - * Header for the session array. Contains the total page count of the - * preserved memory block and the number of `struct luo_session_ser` - * entries that follow. - * * - struct luo_session_ser: * Metadata for a single session, including its name and a physical pointer * to another preserved memory block containing an array of @@ -63,13 +58,15 @@ #define _LINUX_KHO_ABI_LUO_H #include +#include #include /* * The LUO state is registered under this KHO entry name. */ #define LUO_KHO_ENTRY_NAME "LUO" -#define LUO_ABI_COMPATIBLE "luo-v3" +#define LUO_COMPAT_BASE "luo-v3" +#define LUO_ABI_COMPATIBLE LUO_COMPAT_BASE "-" KHO_BLOCK_ABI_COMPATIBLE #define LUO_ABI_COMPAT_LEN ALIGN(sizeof(LUO_ABI_COMPATIBLE), 8) /** @@ -118,21 +115,6 @@ struct luo_file_set_ser { u64 count; } __packed; -/** - * struct luo_session_header_ser - Header for the serialized session data block. - * @count: The number of `struct luo_session_ser` entries that immediately - * follow this header in the memory block. - * - * This structure is located at the beginning of a contiguous block of - * physical memory preserved across the kexec. It provides the necessary - * metadata to interpret the array of session entries that follow. - * - * If this structure is modified, `LUO_ABI_COMPATIBLE` must be updated. - */ -struct luo_session_header_ser { - u64 count; -} __packed; - /** * struct luo_session_ser - Represents the serialized metadata for a LUO session. * @name: The unique name of the session, provided by the userspace at diff --git a/kernel/liveupdate/luo_session.c b/kernel/liveupdate/luo_session.c index 43342916d314..f6eeb965b3c1 100644 --- a/kernel/liveupdate/luo_session.c +++ b/kernel/liveupdate/luo_session.c @@ -24,9 +24,10 @@ * ioctls on /dev/liveupdate. * * - Serialization: Session metadata is preserved using the KHO framework. When - * a live update is triggered via kexec, an array of `struct luo_session_ser` - * is populated and placed in a preserved memory region. The physical address - * of this array is stored in the centralized `struct luo_ser` structure. + * a live update is triggered via kexec, session metadata is serialized into + * a chain of linked-blocks and placed in a preserved memory region. The + * physical address of the first block header is stored in the centralized + * `struct luo_ser` structure. * * Session Lifecycle: * @@ -89,6 +90,7 @@ #include #include #include +#include #include #include #include @@ -98,23 +100,14 @@ #include #include "luo_internal.h" -/* 16 4K pages, give space for 744 sessions */ -#define LUO_SESSION_PGCNT 16ul -#define LUO_SESSION_MAX (((LUO_SESSION_PGCNT << PAGE_SHIFT) - \ - sizeof(struct luo_session_header_ser)) / \ - sizeof(struct luo_session_ser)) - static DECLARE_RWSEM(luo_session_serialize_rwsem); - /** * struct luo_session_header - Header struct for managing LUO sessions. * @count: The number of sessions currently tracked in the @list. * @list: The head of the linked list of `struct luo_session` instances. * @rwsem: A read-write semaphore providing synchronized access to the * session list and other fields in this structure. - * @header_ser: The header data of serialization array. - * @ser: The serialized session data (an array of - * `struct luo_session_ser`). + * @block_set: The set of serialization blocks. * @sessions_pa: Points to the location of sessions_pa within struct luo_ser. * @active: Set to true when first initialized. If previous kernel did not * send session data, active stays false for incoming. @@ -123,8 +116,7 @@ struct luo_session_header { long count; struct list_head list; struct rw_semaphore rwsem; - struct luo_session_header_ser *header_ser; - struct luo_session_ser *ser; + struct kho_block_set block_set; u64 *sessions_pa; bool active; }; @@ -143,10 +135,14 @@ static struct luo_session_global luo_session_global = { .incoming = { .list = LIST_HEAD_INIT(luo_session_global.incoming.list), .rwsem = __RWSEM_INITIALIZER(luo_session_global.incoming.rwsem), + .block_set = KHO_BLOCK_SET_INIT(luo_session_global.incoming.block_set, + sizeof(struct luo_session_ser)), }, .outgoing = { .list = LIST_HEAD_INIT(luo_session_global.outgoing.list), .rwsem = __RWSEM_INITIALIZER(luo_session_global.outgoing.rwsem), + .block_set = KHO_BLOCK_SET_INIT(luo_session_global.outgoing.block_set, + sizeof(struct luo_session_ser)), }, }; @@ -173,25 +169,6 @@ static void luo_session_free(struct luo_session *session) kfree(session); } -static int luo_session_grow_ser(struct luo_session_header *sh) -{ - struct luo_session_header_ser *header_ser; - - if (sh->count == LUO_SESSION_MAX) - return -ENOMEM; - - if (sh->header_ser) - return 0; - - header_ser = kho_alloc_preserve(LUO_SESSION_PGCNT << PAGE_SHIFT); - if (IS_ERR(header_ser)) - return PTR_ERR(header_ser); - - sh->header_ser = header_ser; - sh->ser = (void *)(header_ser + 1); - return 0; -} - static int luo_session_insert(struct luo_session_header *sh, struct luo_session *session) { @@ -205,7 +182,7 @@ static int luo_session_insert(struct luo_session_header *sh, * for new session. */ if (sh == &luo_session_global.outgoing) { - err = luo_session_grow_ser(sh); + err = kho_block_grow(&sh->block_set, sh->count); if (err) return err; } @@ -232,6 +209,8 @@ static void luo_session_remove(struct luo_session_header *sh, guard(rwsem_write)(&sh->rwsem); list_del(&session->list); sh->count--; + if (sh == &luo_session_global.outgoing) + kho_block_shrink(&sh->block_set, sh->count); } static int luo_session_finish_one(struct luo_session *session) @@ -553,15 +532,17 @@ void __init luo_session_setup_outgoing(u64 *sessions_pa) int __init luo_session_setup_incoming(u64 sessions_pa) { - struct luo_session_header_ser *header_ser; + struct luo_session_header *sh = &luo_session_global.incoming; + int err; - if (sessions_pa) { - header_ser = phys_to_virt(sessions_pa); - luo_session_global.incoming.header_ser = header_ser; - luo_session_global.incoming.ser = (void *)(header_ser + 1); - luo_session_global.incoming.active = true; - } + if (!sessions_pa) + return 0; + err = kho_block_set_restore(&sh->block_set, sessions_pa); + if (err) + return err; + + sh->active = true; return 0; } @@ -603,6 +584,8 @@ int luo_session_deserialize(void) { struct luo_session_header *sh = &luo_session_global.incoming; static bool is_deserialized; + struct luo_session_ser *ser; + struct kho_block_it it; static int saved_err; int err; @@ -629,18 +612,19 @@ int luo_session_deserialize(void) * userspace to detect the failure and trigger a reboot, which will * reliably reset devices and reclaim memory. */ - for (int i = 0; i < sh->header_ser->count; i++) { - err = luo_session_deserialize_one(sh, &sh->ser[i]); + kho_block_it_init(&it, &sh->block_set); + while ((ser = kho_block_it_read_entry(&it))) { + err = luo_session_deserialize_one(sh, ser); if (err) goto save_err; } - kho_restore_free(sh->header_ser); - sh->header_ser = NULL; - sh->ser = NULL; + kho_block_set_destroy(&sh->block_set); return 0; + save_err: + kho_block_set_destroy(&sh->block_set); saved_err = err; return err; } @@ -649,36 +633,47 @@ int luo_session_serialize(void) { struct luo_session_header *sh = &luo_session_global.outgoing; struct luo_session *session; - int i = 0; + struct kho_block_it it; int err; down_write(&luo_session_serialize_rwsem); down_write(&sh->rwsem); *sh->sessions_pa = 0; + kho_block_it_init(&it, &sh->block_set); + list_for_each_entry(session, &sh->list, list) { - err = luo_session_freeze_one(session, &sh->ser[i]); - if (err) + struct luo_session_ser *ser = kho_block_it_reserve_entry(&it); + + /* This should not fail normally as blocks were pre-allocated */ + if (WARN_ON_ONCE(!ser)) { + err = -ENOSPC; goto err_undo; + } - strscpy(sh->ser[i].name, session->name, - sizeof(sh->ser[i].name)); - i++; - } + err = luo_session_freeze_one(session, ser); + if (err) { + kho_block_it_prev(&it); + goto err_undo; + } - if (sh->header_ser && sh->count > 0) { - sh->header_ser->count = sh->count; - *sh->sessions_pa = virt_to_phys(sh->header_ser); + strscpy(ser->name, session->name, sizeof(ser->name)); } + + kho_block_it_finalize(&it); + + if (sh->count > 0) + *sh->sessions_pa = kho_block_set_head_pa(&sh->block_set); up_write(&sh->rwsem); return 0; err_undo: list_for_each_entry_continue_reverse(session, &sh->list, list) { - i--; - luo_session_unfreeze_one(session, &sh->ser[i]); - memset(sh->ser[i].name, 0, sizeof(sh->ser[i].name)); + struct luo_session_ser *ser = kho_block_it_prev(&it); + + luo_session_unfreeze_one(session, ser); + memset(ser->name, 0, sizeof(ser->name)); } up_write(&sh->rwsem); up_write(&luo_session_serialize_rwsem); -- 2.53.0 From pasha.tatashin at soleen.com Mon Jun 1 20:17:15 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Tue, 2 Jun 2026 03:17:15 +0000 Subject: [PATCH v5 11/13] selftests/liveupdate: Test session and file limit removal In-Reply-To: <20260602031717.197696-1-pasha.tatashin@soleen.com> References: <20260602031717.197696-1-pasha.tatashin@soleen.com> Message-ID: <20260602031717.197696-12-pasha.tatashin@soleen.com> With the removal of static limits on the number of sessions and files per session, the orchestrator now uses dynamic allocation. Add new test cases to verify that the system can handle a large number of sessions and files. These tests ensure that the dynamic block allocation and reuse logic for session metadata and outgoing files work correctly beyond the previous static limits. Acked-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav (Google) Signed-off-by: Pasha Tatashin --- .../testing/selftests/liveupdate/liveupdate.c | 75 +++++++++++++++++++ .../selftests/liveupdate/luo_test_utils.c | 24 ++++++ .../selftests/liveupdate/luo_test_utils.h | 2 + 3 files changed, 101 insertions(+) diff --git a/tools/testing/selftests/liveupdate/liveupdate.c b/tools/testing/selftests/liveupdate/liveupdate.c index c7d94b9181e1..502fb3567e38 100644 --- a/tools/testing/selftests/liveupdate/liveupdate.c +++ b/tools/testing/selftests/liveupdate/liveupdate.c @@ -26,6 +26,7 @@ #include +#include "luo_test_utils.h" #include "../kselftest.h" #include "../kselftest_harness.h" @@ -499,4 +500,78 @@ TEST_F(liveupdate_device, get_session_name_max_length) ASSERT_EQ(close(session_fd), 0); } +/* + * Test Case: Manage Many Sessions + * + * Verifies that a large number of sessions can be created and then + * destroyed during normal system operation. This specifically tests the + * dynamic block allocation and reuse logic for session metadata management + * without preserving any files. + */ +TEST_F(liveupdate_device, preserve_many_sessions) +{ +#define MANY_SESSIONS 2000 + int session_fds[MANY_SESSIONS]; + int ret, i; + + self->fd1 = open(LIVEUPDATE_DEV, O_RDWR); + if (self->fd1 < 0 && errno == ENOENT) + SKIP(return, "%s does not exist", LIVEUPDATE_DEV); + ASSERT_GE(self->fd1, 0); + + ret = luo_ensure_nofile_limit(MANY_SESSIONS); + if (ret == -EPERM) + SKIP(return, "Insufficient privileges to set RLIMIT_NOFILE"); + ASSERT_EQ(ret, 0); + + for (i = 0; i < MANY_SESSIONS; i++) { + char name[64]; + + snprintf(name, sizeof(name), "many-session-%d", i); + session_fds[i] = create_session(self->fd1, name); + ASSERT_GE(session_fds[i], 0); + } + + for (i = 0; i < MANY_SESSIONS; i++) + ASSERT_EQ(close(session_fds[i]), 0); +} + +/* + * Test Case: Preserve Many Files + * + * Verifies that a large number of files can be preserved in a single session + * and then destroyed during normal system operation. This tests the dynamic + * block allocation and management for outgoing files. + */ +TEST_F(liveupdate_device, preserve_many_files) +{ +#define MANY_FILES 500 + int mem_fds[MANY_FILES]; + int session_fd, ret, i; + + self->fd1 = open(LIVEUPDATE_DEV, O_RDWR); + if (self->fd1 < 0 && errno == ENOENT) + SKIP(return, "%s does not exist", LIVEUPDATE_DEV); + ASSERT_GE(self->fd1, 0); + + session_fd = create_session(self->fd1, "many-files-test"); + ASSERT_GE(session_fd, 0); + + ret = luo_ensure_nofile_limit(MANY_FILES + 10); + if (ret == -EPERM) + SKIP(return, "Insufficient privileges to set RLIMIT_NOFILE"); + ASSERT_EQ(ret, 0); + + for (i = 0; i < MANY_FILES; i++) { + mem_fds[i] = memfd_create("test-memfd", 0); + ASSERT_GE(mem_fds[i], 0); + ASSERT_EQ(preserve_fd(session_fd, mem_fds[i], i), 0); + } + + for (i = 0; i < MANY_FILES; i++) + ASSERT_EQ(close(mem_fds[i]), 0); + + ASSERT_EQ(close(session_fd), 0); +} + TEST_HARNESS_MAIN diff --git a/tools/testing/selftests/liveupdate/luo_test_utils.c b/tools/testing/selftests/liveupdate/luo_test_utils.c index 3c8721c505df..333a3530051b 100644 --- a/tools/testing/selftests/liveupdate/luo_test_utils.c +++ b/tools/testing/selftests/liveupdate/luo_test_utils.c @@ -17,6 +17,7 @@ #include #include #include +#include #include #include #include @@ -28,6 +29,29 @@ int luo_open_device(void) return open(LUO_DEVICE, O_RDWR); } +int luo_ensure_nofile_limit(long min_limit) +{ + struct rlimit hl; + + /* Allow to extra files to be used by test itself */ + min_limit += 32; + + if (getrlimit(RLIMIT_NOFILE, &hl) < 0) + return -errno; + + if (hl.rlim_cur >= min_limit) + return 0; + + hl.rlim_cur = min_limit; + if (hl.rlim_cur > hl.rlim_max) + hl.rlim_max = hl.rlim_cur; + + if (setrlimit(RLIMIT_NOFILE, &hl) < 0) + return -errno; + + return 0; +} + int luo_create_session(int luo_fd, const char *name) { struct liveupdate_ioctl_create_session arg = { .size = sizeof(arg) }; diff --git a/tools/testing/selftests/liveupdate/luo_test_utils.h b/tools/testing/selftests/liveupdate/luo_test_utils.h index 90099bf49577..6a0d85386613 100644 --- a/tools/testing/selftests/liveupdate/luo_test_utils.h +++ b/tools/testing/selftests/liveupdate/luo_test_utils.h @@ -26,6 +26,8 @@ int luo_create_session(int luo_fd, const char *name); int luo_retrieve_session(int luo_fd, const char *name); int luo_session_finish(int session_fd); +int luo_ensure_nofile_limit(long min_limit); + int create_and_preserve_memfd(int session_fd, int token, const char *data); int restore_and_verify_memfd(int session_fd, int token, const char *expected_data); -- 2.53.0 From pasha.tatashin at soleen.com Mon Jun 1 20:17:14 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Tue, 2 Jun 2026 03:17:14 +0000 Subject: [PATCH v5 10/13] liveupdate: Remove limit on the number of files per session In-Reply-To: <20260602031717.197696-1-pasha.tatashin@soleen.com> References: <20260602031717.197696-1-pasha.tatashin@soleen.com> Message-ID: <20260602031717.197696-11-pasha.tatashin@soleen.com> To remove the fixed limit on the number of preserved files per session, transition the file metadata serialization from a single contiguous memory block to a chain of linked blocks. Acked-by: Mike Rapoport (Microsoft) Signed-off-by: Pasha Tatashin --- include/linux/kho/abi/luo.h | 13 +-- kernel/liveupdate/luo_file.c | 139 +++++++++++++++---------------- kernel/liveupdate/luo_internal.h | 6 +- 3 files changed, 75 insertions(+), 83 deletions(-) diff --git a/include/linux/kho/abi/luo.h b/include/linux/kho/abi/luo.h index 79758d92ed5f..16df550ef143 100644 --- a/include/linux/kho/abi/luo.h +++ b/include/linux/kho/abi/luo.h @@ -35,8 +35,8 @@ * * - struct luo_session_ser: * Metadata for a single session, including its name and a physical pointer - * to another preserved memory block containing an array of - * `struct luo_file_ser` for all files in that session. + * to the first `struct kho_block_header_ser` for all files in that session. + * Multiple blocks are linked via the `next` field in the header. * * - struct luo_file_ser: * Metadata for a single preserved file. Contains the `compatible` string to @@ -65,7 +65,7 @@ * The LUO state is registered under this KHO entry name. */ #define LUO_KHO_ENTRY_NAME "LUO" -#define LUO_COMPAT_BASE "luo-v3" +#define LUO_COMPAT_BASE "luo-v4" #define LUO_ABI_COMPATIBLE LUO_COMPAT_BASE "-" KHO_BLOCK_ABI_COMPATIBLE #define LUO_ABI_COMPAT_LEN ALIGN(sizeof(LUO_ABI_COMPATIBLE), 8) @@ -103,9 +103,10 @@ struct luo_file_ser { /** * struct luo_file_set_ser - Represents the serialized metadata for file set - * @files: The physical address of a contiguous memory block that holds - * the serialized state of files (array of luo_file_ser) in this file - * set. + * @files: The physical address of the first `struct kho_block_header_ser`. + * This structure is the header for a block of memory containing + * an array of `struct luo_file_ser` entries. Multiple blocks are + * linked via the `next` field in the header. * @count: The total number of files that were part of this session during * serialization. Used for iteration and validation during * restoration. diff --git a/kernel/liveupdate/luo_file.c b/kernel/liveupdate/luo_file.c index 9eec07a9e9fc..695e99aaba20 100644 --- a/kernel/liveupdate/luo_file.c +++ b/kernel/liveupdate/luo_file.c @@ -118,11 +118,6 @@ static LIST_HEAD(luo_file_handler_list); /* Keep track of files being preserved by LUO */ static DEFINE_XARRAY(luo_preserved_files); -/* 2 4K pages, give space for 128 files per file_set */ -#define LUO_FILE_PGCNT 2ul -#define LUO_FILE_MAX \ - ((LUO_FILE_PGCNT << PAGE_SHIFT) / sizeof(struct luo_file_ser)) - /** * struct luo_file - Represents a single preserved file instance. * @fh: Pointer to the &struct liveupdate_file_handler that manages @@ -174,39 +169,6 @@ struct luo_file { u64 token; }; -static int luo_alloc_files_mem(struct luo_file_set *file_set) -{ - size_t size; - void *mem; - - if (file_set->files) - return 0; - - WARN_ON_ONCE(file_set->count); - - size = LUO_FILE_PGCNT << PAGE_SHIFT; - mem = kho_alloc_preserve(size); - if (IS_ERR(mem)) - return PTR_ERR(mem); - - file_set->files = mem; - - return 0; -} - -static void luo_free_files_mem(struct luo_file_set *file_set) -{ - /* If file_set has files, no need to free preservation memory */ - if (file_set->count) - return; - - if (!file_set->files) - return; - - kho_unpreserve_free(file_set->files); - file_set->files = NULL; -} - static unsigned long luo_get_id(struct liveupdate_file_handler *fh, struct file *file) { @@ -276,16 +238,15 @@ int luo_preserve_file(struct luo_file_set *file_set, u64 token, int fd) if (luo_token_is_used(file_set, token)) return -EEXIST; - if (file_set->count == LUO_FILE_MAX) - return -ENOSPC; + err = kho_block_grow(&file_set->block_set, file_set->count); + if (err) + return err; file = fget(fd); - if (!file) - return -EBADF; - - err = luo_alloc_files_mem(file_set); - if (err) - goto err_fput; + if (!file) { + err = -EBADF; + goto err_shrink; + } err = -ENOENT; down_read(&luo_register_rwlock); @@ -300,7 +261,7 @@ int luo_preserve_file(struct luo_file_set *file_set, u64 token, int fd) /* err is still -ENOENT if no handler was found */ if (err) - goto err_free_files_mem; + goto err_fput; err = xa_insert(&luo_preserved_files, luo_get_id(fh, file), file, GFP_KERNEL); @@ -343,10 +304,10 @@ int luo_preserve_file(struct luo_file_set *file_set, u64 token, int fd) xa_erase(&luo_preserved_files, luo_get_id(fh, file)); err_module_put: module_put(fh->ops->owner); -err_free_files_mem: - luo_free_files_mem(file_set); err_fput: fput(file); +err_shrink: + kho_block_shrink(&file_set->block_set, file_set->count); return err; } @@ -392,13 +353,14 @@ void luo_file_unpreserve_files(struct luo_file_set *file_set) list_del(&luo_file->list); file_set->count--; + kho_block_shrink(&file_set->block_set, file_set->count); fput(luo_file->file); mutex_destroy(&luo_file->mutex); kfree(luo_file); } - luo_free_files_mem(file_set); + kho_block_set_destroy(&file_set->block_set); } static int luo_file_freeze_one(struct luo_file_set *file_set, @@ -454,7 +416,7 @@ static void __luo_file_unfreeze(struct luo_file_set *file_set, luo_file_unfreeze_one(file_set, luo_file); } - memset(file_set->files, 0, LUO_FILE_PGCNT << PAGE_SHIFT); + kho_block_set_clear(&file_set->block_set); } /** @@ -493,19 +455,24 @@ static void __luo_file_unfreeze(struct luo_file_set *file_set, int luo_file_freeze(struct luo_file_set *file_set, struct luo_file_set_ser *file_set_ser) { - struct luo_file_ser *file_ser = file_set->files; struct luo_file *luo_file; + struct kho_block_it it; int err; - int i; if (!file_set->count) return 0; - if (WARN_ON(!file_ser)) - return -EINVAL; + kho_block_it_init(&it, &file_set->block_set); - i = 0; list_for_each_entry(luo_file, &file_set->files_list, list) { + struct luo_file_ser *file_ser = kho_block_it_reserve_entry(&it); + + /* This should not fail normally as blocks were pre-allocated */ + if (WARN_ON_ONCE(!file_ser)) { + err = -ENOSPC; + goto err_unfreeze; + } + err = luo_file_freeze_one(file_set, luo_file); if (err < 0) { pr_warn("Freeze failed for token[%#0llx] handler[%s] err[%pe]\n", @@ -514,16 +481,15 @@ int luo_file_freeze(struct luo_file_set *file_set, goto err_unfreeze; } - strscpy(file_ser[i].compatible, luo_file->fh->compatible, - sizeof(file_ser[i].compatible)); - file_ser[i].data = luo_file->serialized_data; - file_ser[i].token = luo_file->token; - i++; + strscpy(file_ser->compatible, luo_file->fh->compatible, + sizeof(file_ser->compatible)); + file_ser->data = luo_file->serialized_data; + file_ser->token = luo_file->token; } + kho_block_it_finalize(&it); file_set_ser->count = file_set->count; - if (file_set->files) - file_set_ser->files = virt_to_phys(file_set->files); + file_set_ser->files = kho_block_set_head_pa(&file_set->block_set); return 0; @@ -741,14 +707,12 @@ int luo_file_finish(struct luo_file_set *file_set) module_put(luo_file->fh->ops->owner); list_del(&luo_file->list); file_set->count--; + kho_block_shrink(&file_set->block_set, file_set->count); mutex_destroy(&luo_file->mutex); kfree(luo_file); } - if (file_set->files) { - kho_restore_free(file_set->files); - file_set->files = NULL; - } + kho_block_set_destroy(&file_set->block_set); return 0; } @@ -822,16 +786,18 @@ int luo_file_deserialize(struct luo_file_set *file_set, struct luo_file_set_ser *file_set_ser) { struct luo_file_ser *file_ser; + struct kho_block_it it; int err; - u64 i; if (!file_set_ser->files) { WARN_ON(file_set_ser->count); return 0; } - file_set->count = file_set_ser->count; - file_set->files = phys_to_virt(file_set_ser->files); + file_set->count = 0; + err = kho_block_set_restore(&file_set->block_set, file_set_ser->files); + if (err) + return err; /* * Note on error handling: @@ -848,25 +814,50 @@ int luo_file_deserialize(struct luo_file_set *file_set, * userspace to detect the failure and trigger a reboot, which will * reliably reset devices and reclaim memory. */ - file_ser = file_set->files; - for (i = 0; i < file_set->count; i++) { - err = luo_file_deserialize_one(file_set, &file_ser[i]); + kho_block_it_init(&it, &file_set->block_set); + while ((file_ser = kho_block_it_read_entry(&it))) { + err = luo_file_deserialize_one(file_set, file_ser); if (err) - return err; + goto err_destroy_blocks; + file_set->count++; + } + + if (file_set->count != file_set_ser->count) { + pr_warn("File count mismatch: expected %llu, found %llu\n", + file_set_ser->count, file_set->count); + err = -EINVAL; + goto err_destroy_blocks; } return 0; + +err_destroy_blocks: + while (!list_empty(&file_set->files_list)) { + struct luo_file *luo_file; + + luo_file = list_first_entry(&file_set->files_list, + struct luo_file, list); + list_del(&luo_file->list); + module_put(luo_file->fh->ops->owner); + mutex_destroy(&luo_file->mutex); + kfree(luo_file); + } + file_set->count = 0; + kho_block_set_destroy(&file_set->block_set); + return err; } void luo_file_set_init(struct luo_file_set *file_set) { INIT_LIST_HEAD(&file_set->files_list); + kho_block_set_init(&file_set->block_set, sizeof(struct luo_file_ser)); } void luo_file_set_destroy(struct luo_file_set *file_set) { WARN_ON(file_set->count); WARN_ON(!list_empty(&file_set->files_list)); + WARN_ON(!kho_block_set_is_empty(&file_set->block_set)); } /** diff --git a/kernel/liveupdate/luo_internal.h b/kernel/liveupdate/luo_internal.h index ee18f9a11b91..64879ffe7378 100644 --- a/kernel/liveupdate/luo_internal.h +++ b/kernel/liveupdate/luo_internal.h @@ -10,6 +10,7 @@ #include #include +#include struct luo_ucmd { void __user *ubuffer; @@ -44,14 +45,13 @@ static inline int luo_ucmd_respond(struct luo_ucmd *ucmd, * struct luo_file_set - A set of files that belong to the same sessions. * @files_list: An ordered list of files associated with this session, it is * ordered by preservation time. - * @files: The physically contiguous memory block that holds the serialized - * state of files. + * @block_set: The set of serialization blocks. * @count: A counter tracking the number of files currently stored in the * @files_list for this session. */ struct luo_file_set { struct list_head files_list; - struct luo_file_ser *files; + struct kho_block_set block_set; u64 count; }; -- 2.53.0 From pasha.tatashin at soleen.com Mon Jun 1 20:17:16 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Tue, 2 Jun 2026 03:17:16 +0000 Subject: [PATCH v5 12/13] selftests/liveupdate: Add stress-sessions kexec test In-Reply-To: <20260602031717.197696-1-pasha.tatashin@soleen.com> References: <20260602031717.197696-1-pasha.tatashin@soleen.com> Message-ID: <20260602031717.197696-13-pasha.tatashin@soleen.com> Add a new test that creates 2000 LUO sessions before a kexec reboot and verifies their presence after the reboot. This ensures that the linked-block serialization mechanism works correctly for a large number of sessions. Acked-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav (Google) Signed-off-by: Pasha Tatashin --- tools/testing/selftests/liveupdate/Makefile | 1 + .../liveupdate/luo_stress_sessions.c | 102 ++++++++++++++++++ 2 files changed, 103 insertions(+) create mode 100644 tools/testing/selftests/liveupdate/luo_stress_sessions.c diff --git a/tools/testing/selftests/liveupdate/Makefile b/tools/testing/selftests/liveupdate/Makefile index 080754787ede..ed7534468386 100644 --- a/tools/testing/selftests/liveupdate/Makefile +++ b/tools/testing/selftests/liveupdate/Makefile @@ -6,6 +6,7 @@ TEST_GEN_PROGS += liveupdate TEST_GEN_PROGS_EXTENDED += luo_kexec_simple TEST_GEN_PROGS_EXTENDED += luo_multi_session +TEST_GEN_PROGS_EXTENDED += luo_stress_sessions TEST_FILES += do_kexec.sh diff --git a/tools/testing/selftests/liveupdate/luo_stress_sessions.c b/tools/testing/selftests/liveupdate/luo_stress_sessions.c new file mode 100644 index 000000000000..f201b1839d1d --- /dev/null +++ b/tools/testing/selftests/liveupdate/luo_stress_sessions.c @@ -0,0 +1,102 @@ +// SPDX-License-Identifier: GPL-2.0-only + +/* + * Copyright (c) 2026, Google LLC. + * Pasha Tatashin + * + * Validate that LUO can handle a large number of sessions across a kexec + * reboot. + */ + +#include +#include +#include "luo_test_utils.h" + +#define NUM_SESSIONS 2000 +#define STATE_SESSION_NAME "kexec_many_state" +#define STATE_MEMFD_TOKEN 999 + +/* Stage 1: Executed before the kexec reboot. */ +static void run_stage_1(int luo_fd) +{ + int ret, i; + + ksft_print_msg("[STAGE 1] Increasing ulimit for open files...\n"); + ret = luo_ensure_nofile_limit(NUM_SESSIONS); + if (ret == -EPERM) + ksft_exit_skip("Insufficient privileges to set RLIMIT_NOFILE\n"); + if (ret < 0) + ksft_exit_fail_msg("luo_ensure_nofile_limit failed: %s\n", strerror(-ret)); + + ksft_print_msg("[STAGE 1] Creating state file for next stage (2)...\n"); + create_state_file(luo_fd, STATE_SESSION_NAME, STATE_MEMFD_TOKEN, 2); + + ksft_print_msg("[STAGE 1] Creating %d sessions...\n", NUM_SESSIONS); + + for (i = 0; i < NUM_SESSIONS; i++) { + char name[LIVEUPDATE_SESSION_NAME_LENGTH]; + int s_fd; + + snprintf(name, sizeof(name), "many-test-%d", i); + s_fd = luo_create_session(luo_fd, name); + if (s_fd < 0) { + fail_exit("luo_create_session for '%s' at index %d", + name, i); + } + } + + ksft_print_msg("[STAGE 1] Successfully created %d sessions.\n", + NUM_SESSIONS); + + close(luo_fd); + daemonize_and_wait(); +} + +/* Stage 2: Executed after the kexec reboot. */ +static void run_stage_2(int luo_fd, int state_session_fd) +{ + int i, stage; + + ksft_print_msg("[STAGE 2] Starting post-kexec verification...\n"); + + restore_and_read_stage(state_session_fd, STATE_MEMFD_TOKEN, &stage); + if (stage != 2) { + fail_exit("Expected stage 2, but state file contains %d", + stage); + } + + ksft_print_msg("[STAGE 2] Retrieving and finishing %d sessions...\n", + NUM_SESSIONS); + + for (i = 0; i < NUM_SESSIONS; i++) { + char name[LIVEUPDATE_SESSION_NAME_LENGTH]; + int s_fd; + + snprintf(name, sizeof(name), "many-test-%d", i); + s_fd = luo_retrieve_session(luo_fd, name); + if (s_fd < 0) { + fail_exit("luo_retrieve_session for '%s' at index %d", + name, i); + } + + if (luo_session_finish(s_fd) < 0) { + fail_exit("luo_session_finish for '%s' at index %d", + name, i); + } + close(s_fd); + } + + ksft_print_msg("[STAGE 2] Finalizing state session...\n"); + if (luo_session_finish(state_session_fd) < 0) + fail_exit("luo_session_finish for state session"); + close(state_session_fd); + + ksft_print_msg("\n--- MANY-SESSIONS KEXEC TEST PASSED (%d sessions) ---\n", + NUM_SESSIONS); +} + +int main(int argc, char *argv[]) +{ + return luo_test(argc, argv, STATE_SESSION_NAME, + run_stage_1, run_stage_2); +} -- 2.53.0 From pasha.tatashin at soleen.com Mon Jun 1 20:17:17 2026 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Tue, 2 Jun 2026 03:17:17 +0000 Subject: [PATCH v5 13/13] selftests/liveupdate: Add stress-files kexec test In-Reply-To: <20260602031717.197696-1-pasha.tatashin@soleen.com> References: <20260602031717.197696-1-pasha.tatashin@soleen.com> Message-ID: <20260602031717.197696-14-pasha.tatashin@soleen.com> Add a new luo_stress_files kexec test that verifies preserving and retrieving 500 files across a kexec reboot. Reviewed-by: Pratyush Yadav (Google) Acked-by: Mike Rapoport (Microsoft) Signed-off-by: Pasha Tatashin --- tools/testing/selftests/liveupdate/Makefile | 1 + .../selftests/liveupdate/luo_stress_files.c | 97 +++++++++++++++++++ 2 files changed, 98 insertions(+) create mode 100644 tools/testing/selftests/liveupdate/luo_stress_files.c diff --git a/tools/testing/selftests/liveupdate/Makefile b/tools/testing/selftests/liveupdate/Makefile index ed7534468386..30689d22cb02 100644 --- a/tools/testing/selftests/liveupdate/Makefile +++ b/tools/testing/selftests/liveupdate/Makefile @@ -7,6 +7,7 @@ TEST_GEN_PROGS += liveupdate TEST_GEN_PROGS_EXTENDED += luo_kexec_simple TEST_GEN_PROGS_EXTENDED += luo_multi_session TEST_GEN_PROGS_EXTENDED += luo_stress_sessions +TEST_GEN_PROGS_EXTENDED += luo_stress_files TEST_FILES += do_kexec.sh diff --git a/tools/testing/selftests/liveupdate/luo_stress_files.c b/tools/testing/selftests/liveupdate/luo_stress_files.c new file mode 100644 index 000000000000..0cdf9cd4bac7 --- /dev/null +++ b/tools/testing/selftests/liveupdate/luo_stress_files.c @@ -0,0 +1,97 @@ +// SPDX-License-Identifier: GPL-2.0-only + +/* + * Copyright (c) 2026, Google LLC. + * Pasha Tatashin + * + * Validate that LUO can handle a large number of files per session across + * a kexec reboot. + */ + +#include +#include +#include "luo_test_utils.h" + +#define NUM_FILES 500 +#define STATE_SESSION_NAME "kexec_many_files_state" +#define STATE_MEMFD_TOKEN 9999 +#define TEST_SESSION_NAME "many_files_session" + +/* Stage 1: Executed before the kexec reboot. */ +static void run_stage_1(int luo_fd) +{ + int session_fd, i; + + ksft_print_msg("[STAGE 1] Creating state file for next stage (2)...\n"); + create_state_file(luo_fd, STATE_SESSION_NAME, STATE_MEMFD_TOKEN, 2); + + ksft_print_msg("[STAGE 1] Creating test session '%s'...\n", TEST_SESSION_NAME); + session_fd = luo_create_session(luo_fd, TEST_SESSION_NAME); + if (session_fd < 0) + fail_exit("luo_create_session"); + + ksft_print_msg("[STAGE 1] Preserving %d files...\n", NUM_FILES); + for (i = 0; i < NUM_FILES; i++) { + char data[64]; + + snprintf(data, sizeof(data), "file-data-%d", i); + if (create_and_preserve_memfd(session_fd, i, data) < 0) + fail_exit("create_and_preserve_memfd for index %d", i); + } + + ksft_print_msg("[STAGE 1] Successfully preserved %d files.\n", NUM_FILES); + + close(luo_fd); + daemonize_and_wait(); +} + +/* Stage 2: Executed after the kexec reboot. */ +static void run_stage_2(int luo_fd, int state_session_fd) +{ + int session_fd; + int i, stage; + + ksft_print_msg("[STAGE 2] Starting post-kexec verification...\n"); + + restore_and_read_stage(state_session_fd, STATE_MEMFD_TOKEN, &stage); + if (stage != 2) { + fail_exit("Expected stage 2, but state file contains %d", + stage); + } + + ksft_print_msg("[STAGE 2] Retrieving test session '%s'...\n", TEST_SESSION_NAME); + session_fd = luo_retrieve_session(luo_fd, TEST_SESSION_NAME); + if (session_fd < 0) + fail_exit("luo_retrieve_session"); + + ksft_print_msg("[STAGE 2] Verifying %d files...\n", NUM_FILES); + for (i = 0; i < NUM_FILES; i++) { + char data[64]; + int fd; + + snprintf(data, sizeof(data), "file-data-%d", i); + fd = restore_and_verify_memfd(session_fd, i, data); + if (fd < 0) + fail_exit("restore_and_verify_memfd for index %d", i); + close(fd); + } + + ksft_print_msg("[STAGE 2] Finishing test session...\n"); + if (luo_session_finish(session_fd) < 0) + fail_exit("luo_session_finish for test session"); + close(session_fd); + + ksft_print_msg("[STAGE 2] Finalizing state session...\n"); + if (luo_session_finish(state_session_fd) < 0) + fail_exit("luo_session_finish for state session"); + close(state_session_fd); + + ksft_print_msg("\n--- MANY-FILES KEXEC TEST PASSED (%d files) ---\n", + NUM_FILES); +} + +int main(int argc, char *argv[]) +{ + return luo_test(argc, argv, STATE_SESSION_NAME, + run_stage_1, run_stage_2); +} -- 2.53.0 From kjlx at templeofstupid.com Mon Jun 1 21:49:14 2026 From: kjlx at templeofstupid.com (Krister Johansen) Date: Mon, 1 Jun 2026 21:49:14 -0700 Subject: [PATCH v5][makedumpfile 0/9] btf/kallsyms based makedumpfile extension for mm page filtering In-Reply-To: References: <20260414102656.55200-1-ltao@redhat.com> Message-ID: Hi Tao, On Tue, Jun 02, 2026 at 03:04:12PM +1200, Tao Liu wrote: > On Tue, Jun 2, 2026 at 12:47?PM Krister Johansen > wrote: > Thanks for your in-depth explanation, it's very helpful to me for > designing the data erasing function. Thanks for the great discussion. > > On Tue, Jun 02, 2026 at 11:12:05AM +1200, Tao Liu wrote: > > > On Sat, May 30, 2026 at 9:11?AM Krister Johansen > > I wondered about this, but for data-structures that are smaller than a > > page, wouldn't that mean that we're erasing other content? The "erase" > > plugins memset the output data to a chosen value (or 0), whereas the > > filtering just drops the page. Couldn't this also lead to a situation > > where the debugger can't find the page at all, versus giving us one > > that's sanitized? (I do understand why you want to drop the pages for > > the GPU cases) > > Frankly I didn't consider the data erasing as in-depth as you did. I > think you are right, makedumpfile needs to know which extensions > handle data erasing and which handle mm page filtering. I guess the mm > page filtering extensions will need to perform a "dry-run" filter > first, in case the "data erasing" extensions break any useful data > structure. In this step, "dry-run" will only record pfn numbers of the > pages that will be filtered. Then "data erasing" extensions are > called, so all the sensitive data is memset to 0. Finally, all desired > pages are filtered out based on the previous recording. > > With this, "data erase" and "page filtering" will not interfere with > each other. What do you think? This is a great point. It's probably worth documenting the precedence order in which these callbacks are expected to be applied. Naively, I might expect filtering pages to take precedence over erasing data structures. For the GPU cases, these are orthogonal. However, for something where a user might be both trying to filter the page and erase matching content, we don't have any rules defined. It's probably less surprising to allow pages to be filtered first. (I think it is this way in the code.) It also prevents the page filtering from completely filtering a page. > > > > Would you be willing to modify the extension registration options to > > > > allow an extension to specify what kind it is? That way, in the future > > > > > > I'm not sure what you mean by "what kind". Do you mean an extension > > > needs to tell makedumpfile what purpose it is for when loading? > > > > Yes, sorry I wasn't clear in writing the question. Stating this > > differently, if we want to allow the ability for different extensions to > > do different things, how do the extensions declare to makedumpfile what > > they can do, so that it knows where to invoke their callbacks, and what > > callbacks of theirs to invoke. > > > > Looking at patch 6/9, right now run_extension_callback() is involved > > from __exclude_unncessary_pages and always calls the > > "extension_callback" symbol in the module. This makes sense for a > > single extension type that's focused on filtering pages. However, if we > > wanted to have multiple different extensions, this might be more > > difficult. > > > > If we could determine what type of functionality the module implements > > in load_extensions, then we could tell if this is a page filtering > > extension, an erase extension, or some other kind of extension. > > > > For example, for an erase filter, perhaps we would want two callbacks: > > one to set up the ranges to filter "extension_gather_callback" and > > another to actuallyf check the address range to see if it is filtered, > > "extension_filter_data_callback" > > > > I'm not sure about the names. "extension_callback" seems generic, but > > this has a specific purpose. It's a "extension_filter_page_callback" > > > > I may be overengineering this a bit, but having makedumpfile pass an ops > > vector to the extension in a load function could help here. Then the > > module's load function fills out the vector with the functions it > > supports. Depending on what's implemented, these can be placed into > > different callback lists to get invoked at different points in the > > program (e.g. one at pfn filter time, another in filter_data_buffer, > > etc). > > > > It sounds like you had a plan here, though. Were you thinking of adding > > new extension types a different way? > > I see your idea: makedumpfile predefines a few hook points at > different stages, and extensions can register their callbacks to these > hook points. For now I think 2 hook points are enough, one for page > filtering and other one for resiger the data erasing, which definitely > shouldn't be within __exclude_unnecessary_pages(). > > I'm willing to modify the code. Such as implementing a hooking point > registration/management. But since I haven't work on the data erasing > functions so far, the design might be superficial, personally I'd > prefer to do this along with the data erasing functions in the next > independent patchset, considering current patchset we already includes > plenty of code/function implementations. @maintainers, What's your > opinion? Just to clarify, I'm not asking that you implement any erase functionality in the current patchset. Rather, asking if there's a way to implement the current functionality such that the extension modules won't need recompilation when a new extension type is introduced. I think there are a number of different ways to do this, but I didn't want to be overly prescriptive in my feedback. Thanks again, -K From rppt at kernel.org Tue Jun 2 01:13:34 2026 From: rppt at kernel.org (Mike Rapoport) Date: Tue, 02 Jun 2026 11:13:34 +0300 Subject: [PATCH v4 01/13] liveupdate: change file_set->count type to u64 for type safety In-Reply-To: <20260530221938.115978-2-pasha.tatashin@soleen.com> References: <20260530221938.115978-1-pasha.tatashin@soleen.com> <20260530221938.115978-2-pasha.tatashin@soleen.com> Message-ID: <178038801483.119771.5551368813719436713.b4-review@b4> On Sat, 30 May 2026 22:19:26 +0000, Pasha Tatashin wrote: > This improves type safety and aligns the in-memory file_set->count with > the serialized count type. It avoids potential truncation or sign > conversion mismatch issues. Acked-by: Mike Rapoport (Microsoft) -- Sincerely yours, Mike. From rppt at kernel.org Tue Jun 2 01:13:34 2026 From: rppt at kernel.org (Mike Rapoport) Date: Tue, 02 Jun 2026 11:13:34 +0300 Subject: [PATCH v4 02/13] liveupdate: avoid mixing cleanup guards with goto in luo_session_retrieve_fd In-Reply-To: <20260530221938.115978-3-pasha.tatashin@soleen.com> References: <20260530221938.115978-1-pasha.tatashin@soleen.com> <20260530221938.115978-3-pasha.tatashin@soleen.com> Message-ID: <178038801485.119771.9514973100282773342.b4-review@b4> On Sat, 30 May 2026 22:19:27 +0000, Pasha Tatashin wrote: > diff --git a/kernel/liveupdate/luo_session.c b/kernel/liveupdate/luo_session.c > index 146414933977..8d9201c25412 100644 > --- a/kernel/liveupdate/luo_session.c > +++ b/kernel/liveupdate/luo_session.c > @@ -291,25 +291,24 @@ static int luo_session_retrieve_fd(struct luo_session *session, > if (argp->fd < 0) > return argp->fd; > > - guard(mutex)(&session->mutex); > - err = luo_retrieve_file(&session->file_set, argp->token, &file); > - if (err < 0) > - goto err_put_fd; > + scoped_guard(mutex, &session->mutex) { > + err = luo_retrieve_file(&session->file_set, argp->token, &file); > + if (err < 0) { > + put_unused_fd(argp->fd); > + return err; I don't like piling up error handling inside if (err) statements. As we only need the lock only for luo_retrieve_file() I think it's better drop the guard and use goto: mutex_lock(&session->mutex); err = luo_retrieve_file(&session->file_set, argp->token, &file); mutex_unlock(&session->mutex); if (err) ... -- Sincerely yours, Mike. From rppt at kernel.org Tue Jun 2 01:13:34 2026 From: rppt at kernel.org (Mike Rapoport) Date: Tue, 02 Jun 2026 11:13:34 +0300 Subject: [PATCH v4 03/13] liveupdate: centralize state management into struct luo_ser In-Reply-To: <20260530221938.115978-4-pasha.tatashin@soleen.com> References: <20260530221938.115978-1-pasha.tatashin@soleen.com> <20260530221938.115978-4-pasha.tatashin@soleen.com> Message-ID: <178038801487.119771.6308607614059754603.b4-review@b4> On Sat, 30 May 2026 22:19:28 +0000, Pasha Tatashin wrote: > diff --git a/kernel/liveupdate/luo_flb.c b/kernel/liveupdate/luo_flb.c > index 8f5c5dd01cd0..c8dd30b41238 100644 > --- a/kernel/liveupdate/luo_flb.c > +++ b/kernel/liveupdate/luo_flb.c > @@ -579,53 +565,18 @@ int __init luo_flb_setup_outgoing(void *fdt_out) > [ ... skip 18 lines ... ] > - offset = fdt_subnode_offset(fdt_in, 0, LUO_FDT_FLB_NODE_NAME); > - if (offset < 0) { > - pr_err("Unable to get FLB node [%s]\n", LUO_FDT_FLB_NODE_NAME); > - > - return -ENOENT; > + if (flbs_pa) { I like if (!flbs_pa) return; more > > diff --git a/kernel/liveupdate/luo_session.c b/kernel/liveupdate/luo_session.c > index 8d9201c25412..3b760fefa7b9 100644 > --- a/kernel/liveupdate/luo_session.c > +++ b/kernel/liveupdate/luo_session.c > @@ -497,75 +494,34 @@ int luo_session_retrieve(const char *name, struct file **filep) > [ ... skip 58 lines ... ] > + if (sessions_pa) { > + header_ser = phys_to_virt(sessions_pa); > + luo_session_global.incoming.header_ser = header_ser; > + luo_session_global.incoming.ser = (void *)(header_ser + 1); > + luo_session_global.incoming.active = true; > } Ditto -- Sincerely yours, Mike. From rppt at kernel.org Tue Jun 2 01:13:34 2026 From: rppt at kernel.org (Mike Rapoport) Date: Tue, 02 Jun 2026 11:13:34 +0300 Subject: [PATCH v4 07/13] kho: add support for linked-block serialization In-Reply-To: <20260530221938.115978-8-pasha.tatashin@soleen.com> References: <20260530221938.115978-1-pasha.tatashin@soleen.com> <20260530221938.115978-8-pasha.tatashin@soleen.com> Message-ID: <178038801491.119771.18384706761138506132.b4-review@b4> On Sat, 30 May 2026 22:19:32 +0000, Pasha Tatashin wrote: > diff --git a/include/linux/kho_block.h b/include/linux/kho_block.h > new file mode 100644 > index 000000000000..5e6b87b1befa > --- /dev/null > +++ b/include/linux/kho_block.h > @@ -0,0 +1,79 @@ > [ ... skip 19 lines ... ] > + struct list_head list; > + struct kho_block_header_ser *ser; > +}; > + > +/** > + * struct kho_block_set - A set of blocks that belong to the same object. "same object" sounds off to me. The blocks belong to the same module? user? Thoughts? > + * @blocks: The list of serialization blocks (struct kho_block). > + * @nblocks: The number of allocated serialization blocks. > + * @head_pa: Physical address of the first block header. > + * @entry_size: The size of each entry in the blocks. I think it's "... entry in a block" > [ ... skip 42 lines ... ] > + > +void kho_block_it_init(struct kho_block_it *it, struct kho_block_set *bs); > +void *kho_block_it_next(struct kho_block_it *it); > +void *kho_block_it_read(struct kho_block_it *it); > +void *kho_block_it_prev(struct kho_block_it *it); > +void kho_block_it_finalize(struct kho_block_it *it); These operate on block sets, should be reflected in the names. Can be kho_blocks_ to avoid too long names. > > diff --git a/kernel/liveupdate/kho_block.c b/kernel/liveupdate/kho_block.c > new file mode 100644 > index 000000000000..a4e650af946f > --- /dev/null > +++ b/kernel/liveupdate/kho_block.c > @@ -0,0 +1,384 @@ > +// SPDX-License-Identifier: GPL-2.0 > + > +/* > + * Copyright (c) 2026, Google LLC. > + * Pasha Tatashin > + */ > + > +/** > + * DOC: KHO Serialization Blocks > + * > + * KHO provides a mechanism to preserve stateful data across a kexec handover > + * by serializing it into memory blocks. This file provides the common "This file" does not look good in HTML docs. > [ ... skip 15 lines ... ] > + > +/* > + * Safeguard limit for the number of serialization blocks. This is used to > + * prevent infinite loops and excessive memory allocation in case of memory > + * corruption in the preserved state. > + */ Can you add how much memory it is and how many entries with, say, 4 u64 it can accommodate? > [ ... skip 13 lines ... ] > +{ > + if (unlikely(!bs->count_per_block)) { > + bs->count_per_block = (KHO_BLOCK_SIZE - > + sizeof(struct kho_block_header_ser)) / > + bs->entry_size; > + WARN_ON(!bs->count_per_block); Don't you want to set count_per_block in _init()? > [ ... skip 29 lines ... ] > + if (!block) > + return -ENOMEM; > + > + block->ser = ser; > + last = list_last_entry_or_null(&bs->blocks, struct kho_block, list); > + list_add_tail(&block->list, &bs->blocks); No locks? > [ ... skip 12 lines ... ] > + * @bs: The block set. > + * @count: The current number of entries. > + * > + * This function handles the dynamic expansion of a block set. It allocates > + * and links a new serialization block if the provided entry count matches > + * the current total capacity of the set. This is a weird semantics for a generic API. I'd expect _grow() would add count - current_count blocks. > [ ... skip 25 lines ... ] > +} > + > +/** > + * kho_block_shrink - Conditionally destroy the last block in a block set. > + * @bs: The block set. > + * @count: The current number of entries across all blocks. Maybe ... of valid entries? > + * > + * This function checks if the last block in the set is redundant based on the > + * total entry count and the capacity of the preceding blocks. If the entry > + * count can be accommodated by the blocks that come before the last one, the > + * last block is destroyed and removed from the set. This should mention that it's the caller responsibility to ensure that entries are removed in the right order. > [ ... skip 49 lines ... ] > + > + fast = phys_to_virt(fast->next); > + slow = phys_to_virt(slow->next); > + > + if (slow == fast) { > + pr_err("Cyclic list detected\n"); Maybe "block set is corrupted"? > + return false; > + } > + } > + > + return true; > +} > + > +/** > + * kho_block_restore - Restore a block set from a physical address. > + * @bs: The block set to restore. > + * @head_pa: Physical address of the first block header. I'd mention that the block set should be allocated and initialized > [ ... skip 10 lines ... ] > + bs->incoming = true; > + if (!head_pa) > + return 0; > + > + bs->head_pa = head_pa; > + if (!kho_cyclic_blocks_check(bs)) { if (kho_block_set_cyclic()) reads nicer IMO > [ ... skip 87 lines ... ] > +{ > + if (!it->block) > + return NULL; > + > + if (it->i == kho_block_count_per_block(it->bs)) { > + it->block->ser->count = it->i; Why iterator updates ser->count? > + if (list_is_last(&it->block->list, &it->bs->blocks)) > + return NULL; > + it->block = list_next_entry(it->block, list); > + it->i = 0; > + } > + > + return (void *)(it->block->ser + 1) + (it->i++ * it->bs->entry_size); In a month we'll need an LLM's help to understand what it does. > +} > + > +/** > + * kho_block_it_read - Return the next entry slot for reading. > + * @it: The block iterator. And what is the conceptual difference between this and _it_next()? > [ ... skip 49 lines ... ] > + * @it: The block iterator. > + */ > +void kho_block_it_finalize(struct kho_block_it *it) > +{ > + if (it->block) > + it->block->ser->count = it->i; So, it looks like the intention of _it_next is for write, and this ends a write iteration. I think the names should be adjusted to make it clearer. -- Sincerely yours, Mike. From rppt at kernel.org Tue Jun 2 01:13:34 2026 From: rppt at kernel.org (Mike Rapoport) Date: Tue, 02 Jun 2026 11:13:34 +0300 Subject: [PATCH v4 08/13] liveupdate: defer session block allocation and PA setting In-Reply-To: <20260530221938.115978-9-pasha.tatashin@soleen.com> References: <20260530221938.115978-1-pasha.tatashin@soleen.com> <20260530221938.115978-9-pasha.tatashin@soleen.com> Message-ID: <178038801492.119771.3419366349068848854.b4-review@b4> On Sat, 30 May 2026 22:19:33 +0000, Pasha Tatashin wrote: > Currently, luo_session_setup_outgoing() allocates the session block and "liveupdate: defer session block allocation and PA setting" PA as "Public Assistance"? ;-) Let's spell it out. -- Sincerely yours, Mike. From baoquan.he at linux.dev Tue Jun 2 02:00:40 2026 From: baoquan.he at linux.dev (Baoquan He) Date: Tue, 2 Jun 2026 17:00:40 +0800 Subject: [PATCH] kexec_file: skip checksum verification when relocations aren't needed In-Reply-To: <20260601191136.799134-1-mclapinski@google.com> References: <20260601191136.799134-1-mclapinski@google.com> Message-ID: On 06/01/26 at 09:11pm, Michal Clapinski wrote: ...snip... > + /* > + * If all segments were loaded into contiguous memory, there will be no > + * relocations. In that case there is no risk of memory corruption by > + * uncancelled DMA and we can skip checksum calculation. > + */ > + for (i = 0; i < image->nr_segments; i++) { > + if (!image->segment_cma[i]) { > + can_skip_checksum = false; > + break; > + } > + } > + > + if (can_skip_checksum) { > + pr_info("disabling checksum verification in purgatory\n"); Use pr_debug() or kexec_dprintk() instead because this is unnecessary to note users if it's a normal action? Except of this, the overral looks good to me. Acked-by: Baoquan He > + goto skip_checksum; > + } > + > for (j = i = 0; i < image->nr_segments; i++) { > struct kexec_segment *ksegment; > > @@ -867,6 +885,7 @@ static int kexec_calculate_store_digests(struct kimage *image) > j++; > } > > +skip_checksum: > sha256_final(&sctx, digest); > > ret = kexec_purgatory_get_set_symbol(image, "purgatory_sha_regions", > -- > 2.54.0.929.g9b7fa37559-goog > From rppt at kernel.org Tue Jun 2 02:04:24 2026 From: rppt at kernel.org (Mike Rapoport) Date: Tue, 2 Jun 2026 12:04:24 +0300 Subject: [PATCH v4 07/13] kho: add support for linked-block serialization In-Reply-To: <178038801491.119771.18384706761138506132.b4-review@b4> References: <20260530221938.115978-1-pasha.tatashin@soleen.com> <20260530221938.115978-8-pasha.tatashin@soleen.com> <178038801491.119771.18384706761138506132.b4-review@b4> Message-ID: I sent it before seeing v5, so some of those are already addressed, but please take a look anyway. On Tue, Jun 02, 2026 at 11:13:34AM +0300, Mike Rapoport wrote: > On Sat, 30 May 2026 22:19:32 +0000, Pasha Tatashin wrote: > > diff --git a/include/linux/kho_block.h b/include/linux/kho_block.h > > new file mode 100644 > > index 000000000000..5e6b87b1befa > > --- /dev/null > > +++ b/include/linux/kho_block.h > > @@ -0,0 +1,79 @@ > > [ ... skip 19 lines ... ] > > + struct list_head list; > > + struct kho_block_header_ser *ser; > > +}; > > + > > +/** > > + * struct kho_block_set - A set of blocks that belong to the same object. > > "same object" sounds off to me. The blocks belong to the same module? > user? > > Thoughts? > > > + * @blocks: The list of serialization blocks (struct kho_block). > > + * @nblocks: The number of allocated serialization blocks. > > + * @head_pa: Physical address of the first block header. > > + * @entry_size: The size of each entry in the blocks. > > I think it's "... entry in a block" > > > [ ... skip 42 lines ... ] > > + > > +void kho_block_it_init(struct kho_block_it *it, struct kho_block_set *bs); > > +void *kho_block_it_next(struct kho_block_it *it); > > +void *kho_block_it_read(struct kho_block_it *it); > > +void *kho_block_it_prev(struct kho_block_it *it); > > +void kho_block_it_finalize(struct kho_block_it *it); > > These operate on block sets, should be reflected in the names. > Can be kho_blocks_ to avoid too long names. > > > > > diff --git a/kernel/liveupdate/kho_block.c b/kernel/liveupdate/kho_block.c > > new file mode 100644 > > index 000000000000..a4e650af946f > > --- /dev/null > > +++ b/kernel/liveupdate/kho_block.c > > @@ -0,0 +1,384 @@ > > +// SPDX-License-Identifier: GPL-2.0 > > + > > +/* > > + * Copyright (c) 2026, Google LLC. > > + * Pasha Tatashin > > + */ > > + > > +/** > > + * DOC: KHO Serialization Blocks > > + * > > + * KHO provides a mechanism to preserve stateful data across a kexec handover > > + * by serializing it into memory blocks. This file provides the common > > "This file" does not look good in HTML docs. > > > [ ... skip 15 lines ... ] > > + > > +/* > > + * Safeguard limit for the number of serialization blocks. This is used to > > + * prevent infinite loops and excessive memory allocation in case of memory > > + * corruption in the preserved state. > > + */ > > Can you add how much memory it is and how many entries with, say, 4 u64 > it can accommodate? > > > [ ... skip 13 lines ... ] > > +{ > > + if (unlikely(!bs->count_per_block)) { > > + bs->count_per_block = (KHO_BLOCK_SIZE - > > + sizeof(struct kho_block_header_ser)) / > > + bs->entry_size; > > + WARN_ON(!bs->count_per_block); > > Don't you want to set count_per_block in _init()? > > > [ ... skip 29 lines ... ] > > + if (!block) > > + return -ENOMEM; > > + > > + block->ser = ser; > > + last = list_last_entry_or_null(&bs->blocks, struct kho_block, list); > > + list_add_tail(&block->list, &bs->blocks); > > No locks? > > > [ ... skip 12 lines ... ] > > + * @bs: The block set. > > + * @count: The current number of entries. > > + * > > + * This function handles the dynamic expansion of a block set. It allocates > > + * and links a new serialization block if the provided entry count matches > > + * the current total capacity of the set. > > This is a weird semantics for a generic API. I'd expect _grow() would > add count - current_count blocks. > > > [ ... skip 25 lines ... ] > > +} > > + > > +/** > > + * kho_block_shrink - Conditionally destroy the last block in a block set. > > + * @bs: The block set. > > + * @count: The current number of entries across all blocks. > > Maybe > ... of valid entries? > > > + * > > + * This function checks if the last block in the set is redundant based on the > > + * total entry count and the capacity of the preceding blocks. If the entry > > + * count can be accommodated by the blocks that come before the last one, the > > + * last block is destroyed and removed from the set. > > This should mention that it's the caller responsibility to ensure that > entries are removed in the right order. > > > [ ... skip 49 lines ... ] > > + > > + fast = phys_to_virt(fast->next); > > + slow = phys_to_virt(slow->next); > > + > > + if (slow == fast) { > > + pr_err("Cyclic list detected\n"); > > Maybe "block set is corrupted"? > > > + return false; > > + } > > + } > > + > > + return true; > > +} > > + > > +/** > > + * kho_block_restore - Restore a block set from a physical address. > > + * @bs: The block set to restore. > > + * @head_pa: Physical address of the first block header. > > I'd mention that the block set should be allocated and initialized > > > [ ... skip 10 lines ... ] > > + bs->incoming = true; > > + if (!head_pa) > > + return 0; > > + > > + bs->head_pa = head_pa; > > + if (!kho_cyclic_blocks_check(bs)) { > > if (kho_block_set_cyclic()) > > reads nicer IMO > > > [ ... skip 87 lines ... ] > > +{ > > + if (!it->block) > > + return NULL; > > + > > + if (it->i == kho_block_count_per_block(it->bs)) { > > + it->block->ser->count = it->i; > > Why iterator updates ser->count? > > > + if (list_is_last(&it->block->list, &it->bs->blocks)) > > + return NULL; > > + it->block = list_next_entry(it->block, list); > > + it->i = 0; > > + } > > + > > + return (void *)(it->block->ser + 1) + (it->i++ * it->bs->entry_size); > > In a month we'll need an LLM's help to understand what it does. > > > +} > > + > > +/** > > + * kho_block_it_read - Return the next entry slot for reading. > > + * @it: The block iterator. > > And what is the conceptual difference between this and _it_next()? > > > [ ... skip 49 lines ... ] > > + * @it: The block iterator. > > + */ > > +void kho_block_it_finalize(struct kho_block_it *it) > > +{ > > + if (it->block) > > + it->block->ser->count = it->i; > > So, it looks like the intention of _it_next is for write, and this ends a > write iteration. > > I think the names should be adjusted to make it clearer. > > -- > Sincerely yours, > Mike. > -- Sincerely yours, Mike. From rppt at kernel.org Tue Jun 2 02:34:01 2026 From: rppt at kernel.org (Mike Rapoport) Date: Tue, 2 Jun 2026 12:34:01 +0300 Subject: [PATCH v3 09/11] arm64: kdump: exclude non-dumpable reserved memory regions from vmcore In-Reply-To: References: <20260527032917.3385849-1-chenwandun1@gmail.com> <20260527032917.3385849-10-chenwandun1@gmail.com>

Message-ID: Hi Baoquan, On Mon, Jun 01, 2026 at 01:00:34PM +0800, Baoquan He wrote: > On 05/30/26 at 07:25pm, Mike Rapoport wrote: > > On Fri, May 29, 2026 at 04:08:41PM +0100, Will Deacon wrote: > > > On Wed, May 27, 2026 at 11:29:15AM +0800, Wandun Chen wrote: > ...snip... > > There are patches that move common code to kernel/crash_core.c: > > > > https://lore.kernel.org/all/20260525084932.934910-1-ruanjinjie at huawei.com > > > > Review from arch maintainers would be helpful there ;-) > > Before, Andrew would put patch candidates into his mm tree and trigger > testing. If any adjustment, he would take them off. Can we do the > similar thing for kexec/kdump patches, unless the patches are objected > explicitly? Yes, we can add patches that look reasonable and expose them in -next. Now is too late in the release cycle, let's start with it after -rc1. > Thanks > Baoquan -- Sincerely yours, Mike. From mclapinski at google.com Tue Jun 2 05:33:11 2026 From: mclapinski at google.com (Michal Clapinski) Date: Tue, 2 Jun 2026 14:33:11 +0200 Subject: [PATCH v2] kexec_file: skip checksum verification when safe Message-ID: <20260602123311.1841746-1-mclapinski@google.com> Checksum verification is needed 1. for crash kernels. In a crash, we can't be sure the kernel is intact. 2. if we're worried about relocating the kernel into a region used by some DMA that wasn't properly cancelled. If KHO is enabled then relocations will happen to KHO scratch, which is free from DMA regions. If we used CMA to allocate segments then relocations are not going to happen at all. Therefore, we can safely disable checksum verification in both of those cases. Instead of adding a new variable to purgatory, just skip adding regions and save the default value of SHA256 hash. Saves ~250ms on my 4.0 GHz CPU. This is an important saving for the live-update project. Signed-off-by: Michal Clapinski --- v2: - also skip checksum verification if KHO is enabled - small fixes from reviews My original idea was to do 2 changes: 1. Skip checksum if all segments are CMA. 2. If KHO is enabled, allocate the kernel inside kho_scratch using CMA. This way we could skip both relocations and checksum verification when KHO is enabled. But I realized that step 2 might not be possible on warm boots. I have no idea how to fix that (except weird ideas like 2 kho_scratches that we swap on every warm boot), so I decided to just skip checksum verification when KHO is enabled. This unfortunately means relocations will still happen. --- kernel/kexec_file.c | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+) diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c index 2bfbb2d144e6..db25a14692ab 100644 --- a/kernel/kexec_file.c +++ b/kernel/kexec_file.c @@ -27,6 +27,7 @@ #include #include #include +#include #include "kexec_internal.h" #ifdef CONFIG_KEXEC_SIG @@ -798,6 +799,16 @@ int kexec_add_buffer(struct kexec_buf *kbuf) return 0; } +static bool kexec_only_cma_segments(struct kimage *image) +{ + for (int i = 0; i < image->nr_segments; i++) { + if (!image->segment_cma[i]) + return false; + } + + return true; +} + /* Calculate and store the digest of segments */ static int kexec_calculate_store_digests(struct kimage *image) { @@ -822,6 +833,21 @@ static int kexec_calculate_store_digests(struct kimage *image) sha256_init(&sctx); + /* + * If KHO is enabled, the destinations are located in KHO scratch. + * KHO scratch can only contain early boot allocations and movable + * allocations. That means there is no risk of memory corruption by + * uncancelled DMA. + * + * If all segments were loaded into contiguous memory, there will be no + * relocations at all, so also no risk no corruption. + */ + if (image->type != KEXEC_TYPE_CRASH && + (kho_is_enabled() || kexec_only_cma_segments(image))) { + pr_debug("disabling checksum verification in purgatory\n"); + goto skip_checksum; + } + for (j = i = 0; i < image->nr_segments; i++) { struct kexec_segment *ksegment; @@ -867,6 +893,7 @@ static int kexec_calculate_store_digests(struct kimage *image) j++; } +skip_checksum: sha256_final(&sctx, digest); ret = kexec_purgatory_get_set_symbol(image, "purgatory_sha_regions", -- 2.54.0.929.g9b7fa37559-goog From pratyush at kernel.org Tue Jun 2 05:52:02 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Tue, 02 Jun 2026 14:52:02 +0200 Subject: [PATCH 09/12] memblock: introduce MEMBLOCK_KHO_SCRATCH_EXT In-Reply-To: (Mike Rapoport's message of "Sun, 31 May 2026 21:51:09 +0300") References: <20260429133928.850721-1-pratyush@kernel.org> <20260429133928.850721-10-pratyush@kernel.org> <2vxzecjhc2s8.fsf@kernel.org> <2vxzse7j7ai9.fsf@kernel.org> Message-ID: <2vxztsrlds0d.fsf@kernel.org> On Sun, May 31 2026, Mike Rapoport wrote: > On Fri, May 22, 2026 at 05:02:38PM +0200, Pratyush Yadav wrote: >> On Fri, May 22 2026, Pasha Tatashin wrote: >> >> > On 05-11 18:46, Pratyush Yadav wrote: >> >> On Mon, May 11 2026, Mike Rapoport wrote: >> >> >> >> > On Wed, Apr 29, 2026 at 03:39:11PM +0200, Pratyush Yadav wrote: >> >> >> From: "Pratyush Yadav (Google)" >> >> >> >> >> >> In the upcoming commits, the KHO will learn how to discover free blocks >> >> >> of memory by walking the KHO radix tree. It will then mark those regions >> >> >> as scratch to allow memory allocation in case scratch runs low. >> >> >> >> >> >> To differentiate the extended scratch areas from the main scratch areas, >> >> >> introduce MEMBLOCK_KHO_SCRATCH_EXT. Use it when choosing memblock flags >> >> >> for allocations during scratch-only. Teach should_skip_region() to check >> >> >> for both flags before deciding if the region should be skipped. >> >> > >> >> > Why there's a need to differentiate SCRATCH and SCRATCH_EXT? >> >> > SCRATCH (I still hate the name) means "memory memblock can safely use for >> > >> > +1000 >> > >> > I also strongly dislike this name and mentioned it in another thread >> > earlier today. >> > >> > If we ever decide to s/scratch/something-else/ globally, that should be a >> > separate cleanup effort. However, since we are introducing a brand new flag >> > here, we can discuss a better name for the _ext portion to avoid overloading >> > the "scratch" concept. >> > >> >> > the allocations". Initially this memory comes from the reservations in the >> >> > first kernel, but if the second kernel can find more memory to extend it, >> >> > why that additional memory should be treated differently? >> >> >> >> Two reasons: >> >> >> >> 1. We mark SCRATCH as MIGRATE_CMA. We don't want to do that for >> >> SCRATCH_EXT since this memory can be used for non-movable >> >> allocations. >> >> >> >> 2. Gigantic (1G) huge pages can not be allocated from scratch. They can >> >> be preserved memory and thus should not be allocated from SCRATCH. >> >> See patch 12 that does allocations for gigantic huge pages only from >> >> SCRATCH_EXT. >> >> >> >> I will add this in the commit message for the next version. >> >> >> >> Naming is hard, so if you have any better names I'm all ears :-) >> > >> > IMO, this scratch_ext is not "scratch" in the traditional KHO sense at all. >> > The traditional KHO scratch is what is passed from kernel to kernel and is >> > guaranteed to contain zero preserved memory. This new memory is not passed >> > from kernel to kernel and can contain preserved memory at runtime. It's >> > essentially just memory that we identify as currently unpreserved and release >> > early to the system. >> > >> > If we want to keep the naming aligned with the existing codebase for now: >> > MEMBLOCK_KHO_SCRATCH -> original scratch >> > MEMBLOCK_KHO_UNPRESERVED -> for the new memory (instead of SCRATCH_EXT) >> >> UNPRESERVED sounds good to me. I will use that for the next revision >> unless Mike objects. > > Can we make it shorter? ;-) > > UNPRESERVED makes sense, although I'd love to completely remove KHO_ notion > and make the name reflect how it's used by memblock. I was toying with > PREFERRED instead of SCRATCH, but it didn't feel right enough. > With two of them that surely won't work :) I don't think you really can remove KHO_ notion. These memory regions only make sense on a KHO boot, and won't exist otherwise. And PREFERRED sounds like a suggestion/priority hint, not a hard limit. "With KHO boot, you can _only_ use PREFERRED memory", doesn't sound right... I think MEMBLOCK_KHO_BOOTMEM for scratch and MEMBLOCK_KHO_NOPRESERVE (which I think is a tiny bit better than UNPRESERVED) for scratch_ext are my top picks. To make it shorter, perhaps MEMBLOCK_KHO_NOPRSRV, in similar fashion to RSRV_KERN? > >> > Alternatively, if we do want to tackle the global rename of "scratch" later: >> > MEMBLOCK_KHO_BOOTSTRAP -> for the original scratch >> > MEMBLOCK_KHO_UNPRESERVED -> for this new dynamic memory >> >> Or perhaps BOOTMEM? I suppose either of the two are somewhat better than >> scratch. > > Well, if we have BOOTMEM_HVO, we can have BOOTMEM_KHO as well :) > >> Anyway, can we please do the SCRATCH rename as a separate series? I > > Sure. We can continue bikeshedding in parallel. > >> would like this series to not get muddled in the naming discussion. I >> will use UNPRESERVED for the new concept in v2 though. > > That might warrant v3 even if everything else is perfect :) I can live with that. As long as we can agree on the easy part (the code), I don't mind doing another version for the hard part (the naming) ;-) -- Regards, Pratyush Yadav From rppt at kernel.org Tue Jun 2 06:20:02 2026 From: rppt at kernel.org (Mike Rapoport) Date: Tue, 2 Jun 2026 16:20:02 +0300 Subject: [PATCH 09/12] memblock: introduce MEMBLOCK_KHO_SCRATCH_EXT In-Reply-To: <2vxztsrlds0d.fsf@kernel.org> References: <20260429133928.850721-1-pratyush@kernel.org> <20260429133928.850721-10-pratyush@kernel.org> <2vxzecjhc2s8.fsf@kernel.org> <2vxzse7j7ai9.fsf@kernel.org> <2vxztsrlds0d.fsf@kernel.org> Message-ID: On Tue, Jun 02, 2026 at 02:52:02PM +0200, Pratyush Yadav wrote: > On Sun, May 31 2026, Mike Rapoport wrote: > >> > > >> > If we want to keep the naming aligned with the existing codebase for now: > >> > MEMBLOCK_KHO_SCRATCH -> original scratch > >> > MEMBLOCK_KHO_UNPRESERVED -> for the new memory (instead of SCRATCH_EXT) > >> > >> UNPRESERVED sounds good to me. I will use that for the next revision > >> unless Mike objects. > > > > Can we make it shorter? ;-) > > > > UNPRESERVED makes sense, although I'd love to completely remove KHO_ notion > > and make the name reflect how it's used by memblock. I was toying with > > PREFERRED instead of SCRATCH, but it didn't feel right enough. > > With two of them that surely won't work :) > > I don't think you really can remove KHO_ notion. These memory regions > only make sense on a KHO boot, and won't exist otherwise. And PREFERRED > sounds like a suggestion/priority hint, not a hard limit. "With KHO > boot, you can _only_ use PREFERRED memory", doesn't sound right... > > I think MEMBLOCK_KHO_BOOTMEM for scratch and MEMBLOCK_KHO_NOPRESERVE > (which I think is a tiny bit better than UNPRESERVED) for scratch_ext > are my top picks. To make it shorter, perhaps MEMBLOCK_KHO_NOPRSRV, in > similar fashion to RSRV_KERN? There are a couple of unrelated 'bootmem' things in the kernel, adding another one shouldn't hurt :) I like MEMBLOCK_KHO_BOOTMEM and MEMBLOCK_KHO_NOPRSRV the most. > >> would like this series to not get muddled in the naming discussion. I > >> will use UNPRESERVED for the new concept in v2 though. > > > > That might warrant v3 even if everything else is perfect :) > > I can live with that. As long as we can agree on the easy part (the > code), I don't mind doing another version for the hard part (the naming) > ;-) It makes sense to keep KHO_SCRATCH for now and use MEMBLOCK_KHO_NOPRSRV for the new one to begin with. And then we can ask an LLM do the renaming. > -- > Regards, > Pratyush Yadav -- Sincerely yours, Mike. From pratyush at kernel.org Tue Jun 2 06:35:44 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Tue, 02 Jun 2026 15:35:44 +0200 Subject: [PATCH 12/12] mm/hugetlb: make bootmem allocation work with KHO In-Reply-To: (Mike Rapoport's message of "Sun, 31 May 2026 21:40:07 +0300") References: <20260429133928.850721-1-pratyush@kernel.org> <20260429133928.850721-13-pratyush@kernel.org> <2vxzo6i37bs6.fsf@kernel.org> Message-ID: <2vxzpl29dpzj.fsf@kernel.org> On Sun, May 31 2026, Mike Rapoport wrote: > On Mon, May 25, 2026 at 05:24:09PM +0200, Pratyush Yadav wrote: >> On Sun, May 17 2026, Mike Rapoport wrote: >> > On Wed, Apr 29, 2026 at 03:39:14PM +0200, Pratyush Yadav wrote: >> >> From: "Pratyush Yadav (Google)" >> >> So, in summary, I would like to pursue option 1 and try to make it more >> appetizing. But I would like to at least know if you hate the "extended >> scratch" (ignore the name) as a concept or only the code it results in. > > Let's retry this one :) > > I looked more closely, and it seems that mixing SCRATCH and SCRATCH_EXT > should be a lesser headache than going with option 4. I also had some time to ruminate on this. I still think option 1 has the most promise, but my opinion on option 4 has improved a bit. While I still am not sure adding a 3rd phase to struct page/MM init (early -> deferred -> KHO reserved blocks) is a good idea, I think it might not be as bad as I first thought. Dunno... Anyway, for now I think I will try to make option 1 more appetizing. Here's an idea I want to try out: I get rid of SCRATCH_EXT and mark the free blocks as SCRATCH. For HugeTLB, I can teach the special memblock_alloc_hugetlb_something() function to exclude scratch areas when looking for free memory ranges. So core memblock does not get a new memory type, and the complexity of hugepage allocation does not leak into memblock. How does that sound? > > Tracking the changes in gigantic pages in hugetlb also does not seem > something we'd like to pursue especially considering that memory from freed > or demoted gigantic pages could be reserved. > > If we add a dedicated memblock_something to allocate gigantic pages, we > can reduce branching in alloc_bootmem() to > > if (cma) > do_cma() > else > do_memblock() > > For hugetlb_cma we might want to teach CMA to create pre-allocated areas > and then it could reuse the same memblock API. This seems useful even > regardless of KHO. Sorry, I don't get what you mean by this. What pre-allocated areas? When creating CMA areas it calls cma_alloc_mem() which calls into memblock. What would we change about this? -- Regards, Pratyush Yadav From pratyush at kernel.org Tue Jun 2 08:16:21 2026 From: pratyush at kernel.org (Pratyush Yadav) Date: Tue, 02 Jun 2026 17:16:21 +0200 Subject: [PATCH v2] kexec_file: skip checksum verification when safe In-Reply-To: <20260602123311.1841746-1-mclapinski@google.com> (Michal Clapinski's message of "Tue, 2 Jun 2026 14:33:11 +0200") References: <20260602123311.1841746-1-mclapinski@google.com> Message-ID: <2vxzik81dlbu.fsf@kernel.org> On Tue, Jun 02 2026, Michal Clapinski wrote: > Checksum verification is needed > 1. for crash kernels. In a crash, we can't be sure the kernel is > intact. > 2. if we're worried about relocating the kernel into a region used by > some DMA that wasn't properly cancelled. > > If KHO is enabled then relocations will happen to KHO scratch, which > is free from DMA regions. > If we used CMA to allocate segments then relocations are not going to > happen at all. > Therefore, we can safely disable checksum verification in both of those > cases. > > Instead of adding a new variable to purgatory, just skip adding regions > and save the default value of SHA256 hash. > > Saves ~250ms on my 4.0 GHz CPU. This is an important saving for the > live-update project. > > Signed-off-by: Michal Clapinski > --- > v2: > - also skip checksum verification if KHO is enabled > - small fixes from reviews > > My original idea was to do 2 changes: > 1. Skip checksum if all segments are CMA. > 2. If KHO is enabled, allocate the kernel inside kho_scratch using CMA. > > This way we could skip both relocations and checksum verification when > KHO is enabled. > But I realized that step 2 might not be possible on warm boots. AFAIU we only relocate into scratch since relocating anywhere else might over-write preserved memory. If there is no relocation, there is no need for the kernel image to be in scratch, since the image won't be preserved memory anyway. So perhaps we can just use CMA directly, and only fall back to kho_locate_mem_hole() if that fails? This should be a simple enough change. Do you know how much time we can save by skipping relocations? I would guess it is in the hundreds of milliseconds. Can you try this (COMPLETELY UNTESTED) patch out and see if it works and if it further improves kexec time? --- 8< --- diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c index 2bfbb2d144e6..0ccc7b6d67c1 100644 --- a/kernel/kexec_file.c +++ b/kernel/kexec_file.c @@ -720,14 +720,6 @@ int kexec_locate_mem_hole(struct kexec_buf *kbuf) if (kbuf->mem != KEXEC_BUF_MEM_UNKNOWN) return 0; - /* - * If KHO is active, only use KHO scratch memory. All other memory - * could potentially be handed over. - */ - ret = kho_locate_mem_hole(kbuf, locate_mem_hole_callback); - if (ret <= 0) - return ret; - /* * Try to find a free physically contiguous block of memory first. With that, we * can avoid any copying at kexec time. @@ -735,6 +727,14 @@ int kexec_locate_mem_hole(struct kexec_buf *kbuf) if (!kexec_alloc_contig(kbuf)) return 0; + /* + * If KHO is active and relocations are to be done,, only use KHO + * scratch memory. All other memory could potentially be handed over. + */ + ret = kho_locate_mem_hole(kbuf, locate_mem_hole_callback); + if (ret <= 0) + return ret; + if (!IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) ret = kexec_walk_resources(kbuf, locate_mem_hole_callback); else --- >8 --- Of course this is not directly related to this patch so it shouldn't block it, but I reckon we might be able to squeeze a bit more performance out this way as a follow up. > I have no idea how to fix that (except weird ideas like 2 kho_scratches > that we swap on every warm boot), so I decided to just skip checksum > verification when KHO is enabled. This unfortunately means relocations > will still happen. > --- > kernel/kexec_file.c | 27 +++++++++++++++++++++++++++ > 1 file changed, 27 insertions(+) > > diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c > index 2bfbb2d144e6..db25a14692ab 100644 > --- a/kernel/kexec_file.c > +++ b/kernel/kexec_file.c > @@ -27,6 +27,7 @@ > #include > #include > #include > +#include > #include "kexec_internal.h" > > #ifdef CONFIG_KEXEC_SIG > @@ -798,6 +799,16 @@ int kexec_add_buffer(struct kexec_buf *kbuf) > return 0; > } > > +static bool kexec_only_cma_segments(struct kimage *image) > +{ > + for (int i = 0; i < image->nr_segments; i++) { > + if (!image->segment_cma[i]) > + return false; > + } > + > + return true; > +} > + > /* Calculate and store the digest of segments */ > static int kexec_calculate_store_digests(struct kimage *image) > { > @@ -822,6 +833,21 @@ static int kexec_calculate_store_digests(struct kimage *image) > > sha256_init(&sctx); > > + /* > + * If KHO is enabled, the destinations are located in KHO scratch. > + * KHO scratch can only contain early boot allocations and movable > + * allocations. That means there is no risk of memory corruption by > + * uncancelled DMA. > + * > + * If all segments were loaded into contiguous memory, there will be no > + * relocations at all, so also no risk no corruption. Typo: "so also no risk *of* corruption". We can fix that up when applying I think, so no need for a v3 just for this. Other than this, Reviewed-by: Pratyush Yadav (Google) > + */ > + if (image->type != KEXEC_TYPE_CRASH && > + (kho_is_enabled() || kexec_only_cma_segments(image))) { > + pr_debug("disabling checksum verification in purgatory\n"); > + goto skip_checksum; > + } > + > for (j = i = 0; i < image->nr_segments; i++) { > struct kexec_segment *ksegment; > > @@ -867,6 +893,7 @@ static int kexec_calculate_store_digests(struct kimage *image) > j++; > } > > +skip_checksum: > sha256_final(&sctx, digest); > > ret = kexec_purgatory_get_set_symbol(image, "purgatory_sha_regions", -- Regards, Pratyush Yadav From baoquan.he at linux.dev Mon Jun 1 06:40:23 2026 From: baoquan.he at linux.dev (Baoquan He) Date: Mon, 1 Jun 2026 21:40:23 +0800 Subject: [PATCH v15 00/23] arm64/riscv: Add support for crashkernel CMA reservation In-Reply-To: <20260601094805.2928614-1-ruanjinjie@huawei.com> References: <20260601094805.2928614-1-ruanjinjie@huawei.com> Message-ID: Hi Jinjie, On 06/01/26 at 05:47pm, Jinjie Ruan wrote: ...snip... > Changes in v15: > - Unify the subject prefix formats as Huacai suggested. > - Fix powerpc pre-existing NULL pointer dereference [Sashiko [1]] > - Fix powerpc pre-existing __merge_memory_ranges() memory range > truncation [Sashiko [1]]. > - Fix pre-existing arm64 CMA page leaks [Sashiko[2]]. > - Fix pre-existing crash_load_dm_crypt_keys() Use-After-Free and > Double Free issue [Sashiko[3]]. > - Fix vfree(headers) and uninitialized variables issue > and simplify the fix [Sashiko[2]]. > - As walk_system_ram_res() and for_each_mem_range() use different > lock, unify and simplify the fix of TOCTOU buffer overflow via memory > region padding [Sashiko[4]]. > - Fix the arm64 crash dump issues in Sashiko[5]. > - Link to v14: https://lore.kernel.org/all/20260525084932.934910-1-ruanjinjie at huawei.com/ Do these Fixes have anything with the main target of this patch series you mentioned in cover-letter:"arm64/riscv: Add support for crashkernel CMA"? The patches become more and more in each new version, I am wondering if it relies on these Fixes patches to implement your adding support for crashkernel CMA on arm64/risc-v. If not relying on them, could you split them into different patchset on different purpose? Thanks Baoquan > > [1]: https://lore.kernel.org/all/20260525092207.96B9D1F000E9 at smtp.kernel.org/ > [2]: https://lore.kernel.org/all/20260525091149.1A1E01F00A3D at smtp.kernel.org/ > [3]: https://lore.kernel.org/all/20260525105227.3C2421F000E9 at smtp.kernel.org/ > [4]: https://lore.kernel.org/all/20260525095447.944E11F000E9 at smtp.kernel.org/ > [5]: https://lore.kernel.org/all/20260525101746.9959D1F000E9 at smtp.kernel.org/ > > Changes in v14: > - Fix image->elf_headers memory leak during retry loop for arm64 as Sashiko > AI code review pointed out. > - Solve the hotplug notifier arch_crash_handle_hotplug_event() AA > self-deadlock problem as Sashiko AI code review pointed out. > - Fix the TOCTOU issue in prepare_elf_headers() by get_online_mems(). > - -ENOMEM -> -EAGAIN as Breno suggested. > - Add support for arm64 crash hotplug. > - Link to v13: https://lore.kernel.org/all/20260511030454.1730881-1-ruanjinjie at huawei.com/ > > Changes in v13: > - Rebased on v7.1-rc1. > - Update the commit message. > - Add Reviewed-by. > - Link to v12: https://lore.kernel.org/all/20260402072701.628293-1-ruanjinjie at huawei.com/ > > Changes in v12: > - Remove the unused "nr_mem_ranges" for x86. > - Add "Fix crashk_low_res not exclude bug" test log. > - Provide a separate patch for each architecture for using > crash_prepare_headers(), which will make the review more convenient. > - Add Reviewed-by and Tested-by. > - Link to v11: https://lore.kernel.org/all/20260328074013.3589544-1-ruanjinjie at huawei.com/ > > Changes in v11: > - Avoid silently drop crash memory if the crash kernel is built without > CONFIG_CMA. > - Remove unnecessary "cmem->nr_ranges = 0" for arch_crash_populate_cmem() > as we use kvzalloc(). > - Provide a separate patch for each architecture to fix the existing > buffer overflow issue. > - Add Acked-bys for arm64. > > Changes in v10: > - Fix crashk_low_res not excluded bug in the existing > RISC-V code. > - Fix an existing memory leak issue in the existing PowerPC code. > - Fix the ordering issue of adding CMA ranges to > "linux,usable-memory-range". > - Fix an existing concurrency issue. A Concurrent memory hotplug may occur > between reading memblock and attempting to fill cmem during kexec_load() > for almost all existing architectures. > - Link to v9: https://lore.kernel.org/all/20260323072745.2481719-1-ruanjinjie at huawei.com/ > > Changes in v9: > - Collect Reviewed-by and Acked-by, and prepare for Sashiko AI review. > - Link to v8: https://lore.kernel.org/all/20260302035315.3892241-1-ruanjinjie at huawei.com/ > > Changes in v8: > - Fix the build issues reported by kernel test robot and Sourabh. > - Link to v7: https://lore.kernel.org/all/20260226130437.1867658-1-ruanjinjie at huawei.com/ > > Changes in v7: > - Correct the inclusion of CMA-reserved ranges for kdump kernel in of/kexec > for arm64 and riscv. > - Add Acked-by. > - Link to v6: https://lore.kernel.org/all/20260224085342.387996-1-ruanjinjie at huawei.com/ > > Changes in v6: > - Update the crash core exclude code as Mike suggested. > - Rebased on v7.0-rc1. > - Add acked-by. > - Link to v5: https://lore.kernel.org/all/20260212101001.343158-1-ruanjinjie at huawei.com/ > > Jinjie Ruan (22): > riscv: kexec_file: Fix crashk_low_res not exclude bug > powerpc/crash: Fix possible memory leak in update_crash_elfcorehdr() > powerpc/kexec_file: Fix NULL pointer dereference in > kexec_extra_fdt_size_ppc64() > powerpc/kexec_file: Fix memory range truncation in > __merge_memory_ranges() > kexec: Extract kexec_free_segment_cma() from kimage_free_cma() > arm64: kexec_file: Fix CMA page leaks during segment placement retry > loops > arm64: kexec_file: Fix image->elf_headers memory leak during retry > loop > kexec: Fix UAF and Double Free in crash_load_dm_crypt_keys() > crash_core: Introduce CRASH_HOTPLUG_SAFETY_PADDING for memory hotplug > safety > x86: kexec_file: Fix TOCTOU buffer overflow via memory region padding > arm64: kexec_file: Fix TOCTOU buffer overflow via memory region > padding > riscv: kexec_file: Fix TOCTOU buffer overflow via memory region > padding > LoongArch: kexec_file: Fix TOCTOU buffer overflow via memory region > padding > crash: Add crash_prepare_headers() to exclude crash kernel memory > arm64: kexec_file: Use crash_prepare_headers() helper to simplify code > x86: kexec_file: Use crash_prepare_headers() helper to simplify code > riscv: kexec_file: Use crash_prepare_headers() helper to simplify code > LoongArch: kexec_file: Use crash_prepare_headers() helper to simplify > code > powerpc/kexec_file: Use crash_exclude_core_ranges() helper > arm64: kexec_file: Add support for crashkernel CMA reservation > riscv: kexec_file: Add support for crashkernel CMA reservation > arm64: crash: Add crash hotplug support > > Sourabh Jain (1): > powerpc/crash: sort crash memory ranges before preparing elfcorehdr > > .../admin-guide/kernel-parameters.txt | 16 +- > arch/arm64/Kconfig | 3 + > arch/arm64/include/asm/kexec.h | 13 ++ > arch/arm64/kernel/Makefile | 2 +- > arch/arm64/kernel/crash.c | 152 ++++++++++++++++++ > arch/arm64/kernel/kexec_image.c | 34 ++++ > arch/arm64/kernel/machine_kexec_file.c | 78 ++------- > arch/arm64/mm/init.c | 5 +- > arch/loongarch/kernel/machine_kexec_file.c | 44 ++--- > arch/powerpc/include/asm/kexec_ranges.h | 1 - > arch/powerpc/kexec/crash.c | 7 +- > arch/powerpc/kexec/file_load_64.c | 3 + > arch/powerpc/kexec/ranges.c | 113 ++----------- > arch/riscv/kernel/machine_kexec_file.c | 43 ++--- > arch/riscv/mm/init.c | 5 +- > arch/x86/kernel/crash.c | 92 ++--------- > drivers/of/fdt.c | 9 +- > drivers/of/kexec.c | 9 ++ > include/linux/crash_core.h | 15 ++ > include/linux/crash_reserve.h | 4 +- > include/linux/kexec.h | 2 + > kernel/crash_core.c | 89 +++++++++- > kernel/crash_dump_dm_crypt.c | 4 +- > kernel/kexec_core.c | 25 +-- > 24 files changed, 430 insertions(+), 338 deletions(-) > create mode 100644 arch/arm64/kernel/crash.c > > -- > 2.34.1 > From baoquan.he at linux.dev Mon Jun 1 20:06:11 2026 From: baoquan.he at linux.dev (Baoquan He) Date: Tue, 2 Jun 2026 11:06:11 +0800 Subject: [PATCH v15 00/23] arm64/riscv: Add support for crashkernel CMA reservation In-Reply-To: <1a459706-80db-43d8-b163-76fc09da338d@huawei.com> References: <20260601094805.2928614-1-ruanjinjie@huawei.com> <1a459706-80db-43d8-b163-76fc09da338d@huawei.com> Message-ID: On 06/02/26 at 09:43am, Jinjie Ruan wrote: > > > On 6/1/2026 9:40 PM, Baoquan He wrote: > > Hi Jinjie, > > > > On 06/01/26 at 05:47pm, Jinjie Ruan wrote: > > ...snip... > >> Changes in v15: > >> - Unify the subject prefix formats as Huacai suggested. > >> - Fix powerpc pre-existing NULL pointer dereference [Sashiko [1]] > >> - Fix powerpc pre-existing __merge_memory_ranges() memory range > >> truncation [Sashiko [1]]. > >> - Fix pre-existing arm64 CMA page leaks [Sashiko[2]]. > >> - Fix pre-existing crash_load_dm_crypt_keys() Use-After-Free and > >> Double Free issue [Sashiko[3]]. > >> - Fix vfree(headers) and uninitialized variables issue > >> and simplify the fix [Sashiko[2]]. > >> - As walk_system_ram_res() and for_each_mem_range() use different > >> lock, unify and simplify the fix of TOCTOU buffer overflow via memory > >> region padding [Sashiko[4]]. > >> - Fix the arm64 crash dump issues in Sashiko[5]. > >> - Link to v14: https://lore.kernel.org/all/20260525084932.934910-1-ruanjinjie at huawei.com/ > > > > Do these Fixes have anything with the main target of this patch series > > you mentioned in cover-letter:"arm64/riscv: Add support for crashkernel CMA"? > > The patches become more and more in each new version, I am wondering if > > it relies on these Fixes patches to implement your adding support for > > crashkernel CMA on arm64/risc-v. > > > > If not relying on them, could you split them into different patchset > > on different purpose? > > Hi Baoquan, > > Thank you for your valuable guidance. > > You are absolutely right. Most of these fix patches are indeed not > strictly related to the core implementation of the crashkernel CMA > support. They are pre-existing bugs in the surrounding kexec/crash code > that were flagged during our review. > > Previously, Andrew suggested taking a look at the code review comments > from the Sashiko AI system, which is why these fixes kept expanding. I > completely agree with your advice that there is no need to keep them > together. I will split them into two completely different patchsets > based on their purpose: > > 1. A cleaner version of this series, strictly focused on adding the core > crashkernel CMA support for arm64/riscv. > > 2. One standalone bugfix patchset dedicated entirely to fixing these > pre-existing issues. > > By the way, I would also appreciate some advice on how to handle further > AI reviews. It seems that the more code we touch or refactor to fix > these pre-existing issues, the more tangential bugs the AI flags in the > newly exposed areas, making the series extremely difficult to converge. > > Should I continue to address all AI-reported bugs associated with the > surrounding code in this series, or should we draw a strict line > and only focus on the core CMA logic moving forward? Then please post patches to focus on the core implementation of the crashkernel CMA support. If any AI reported bugs are raised but not relatd to it, you can add note in cover-letter or explain somewhere to tell whehter it's caused by the core code and how you want to deal with it. Otherwise, you could go round and round of new posting and still can't see when it ends up. From ruanjinjie at huawei.com Mon Jun 1 02:47:42 2026 From: ruanjinjie at huawei.com (Jinjie Ruan) Date: Mon, 1 Jun 2026 17:47:42 +0800 Subject: [PATCH v15 00/23] arm64/riscv: Add support for crashkernel CMA reservation Message-ID: <20260601094805.2928614-1-ruanjinjie@huawei.com> The crash memory allocation, and the exclude of crashk_res, crashk_low_res and crashk_cma memory are almost identical across different architectures, This patch set handle them in crash core in a general way, which eliminate a lot of duplication code. And add support for crashkernel CMA reservation for arm64 and riscv. Also add support for arm64 crash hotplug. This patch set is rebased on v7.1-rc1. Basic second kernel boot test were performed on QEMU platforms for x86, ARM64 and RISC-V architectures with the following parameters: "cma=256M crashkernel=4G crashkernel=64M,cma" For first kernel, there will be such log: # dmesg | grep crash [ 0.000000] crashkernel low memory reserved: 0xe8000000 - 0xf0000000 (128 MB) [ 0.000000] crashkernel reserved: 0x000000023e600000 - 0x000000033e600000 (4096 MB) [ 0.000000] crashkernel CMA reserved: 64 MB in 1 ranges # dmesg | grep cma [ 0.000000] cma: Reserved 256 MiB at 0x00000000f0000000 [ 0.000000] cma: Reserved 64 MiB at 0x0000000100000000 For second kernel, there will be such log: [ 0.000000] OF: fdt: Looking for usable-memory-range property... [ 0.000000] OF: fdt: cap_mem_regions[0]: base=0x000000023e600000, size=0x0000000100000000 [ 0.000000] OF: fdt: cap_mem_regions[1]: base=0x00000000e8000000, size=0x0000000008000000 [ 0.000000] OF: fdt: cap_mem_regions[2]: base=0x0000000100000000, size=0x0000000004000000 Changes in v15: - Unify the subject prefix formats as Huacai suggested. - Fix powerpc pre-existing NULL pointer dereference [Sashiko [1]] - Fix powerpc pre-existing __merge_memory_ranges() memory range truncation [Sashiko [1]]. - Fix pre-existing arm64 CMA page leaks [Sashiko[2]]. - Fix pre-existing crash_load_dm_crypt_keys() Use-After-Free and Double Free issue [Sashiko[3]]. - Fix vfree(headers) and uninitialized variables issue and simplify the fix [Sashiko[2]]. - As walk_system_ram_res() and for_each_mem_range() use different lock, unify and simplify the fix of TOCTOU buffer overflow via memory region padding [Sashiko[4]]. - Fix the arm64 crash dump issues in Sashiko[5]. - Link to v14: https://lore.kernel.org/all/20260525084932.934910-1-ruanjinjie at huawei.com/ [1]: https://lore.kernel.org/all/20260525092207.96B9D1F000E9 at smtp.kernel.org/ [2]: https://lore.kernel.org/all/20260525091149.1A1E01F00A3D at smtp.kernel.org/ [3]: https://lore.kernel.org/all/20260525105227.3C2421F000E9 at smtp.kernel.org/ [4]: https://lore.kernel.org/all/20260525095447.944E11F000E9 at smtp.kernel.org/ [5]: https://lore.kernel.org/all/20260525101746.9959D1F000E9 at smtp.kernel.org/ Changes in v14: - Fix image->elf_headers memory leak during retry loop for arm64 as Sashiko AI code review pointed out. - Solve the hotplug notifier arch_crash_handle_hotplug_event() AA self-deadlock problem as Sashiko AI code review pointed out. - Fix the TOCTOU issue in prepare_elf_headers() by get_online_mems(). - -ENOMEM -> -EAGAIN as Breno suggested. - Add support for arm64 crash hotplug. - Link to v13: https://lore.kernel.org/all/20260511030454.1730881-1-ruanjinjie at huawei.com/ Changes in v13: - Rebased on v7.1-rc1. - Update the commit message. - Add Reviewed-by. - Link to v12: https://lore.kernel.org/all/20260402072701.628293-1-ruanjinjie at huawei.com/ Changes in v12: - Remove the unused "nr_mem_ranges" for x86. - Add "Fix crashk_low_res not exclude bug" test log. - Provide a separate patch for each architecture for using crash_prepare_headers(), which will make the review more convenient. - Add Reviewed-by and Tested-by. - Link to v11: https://lore.kernel.org/all/20260328074013.3589544-1-ruanjinjie at huawei.com/ Changes in v11: - Avoid silently drop crash memory if the crash kernel is built without CONFIG_CMA. - Remove unnecessary "cmem->nr_ranges = 0" for arch_crash_populate_cmem() as we use kvzalloc(). - Provide a separate patch for each architecture to fix the existing buffer overflow issue. - Add Acked-bys for arm64. Changes in v10: - Fix crashk_low_res not excluded bug in the existing RISC-V code. - Fix an existing memory leak issue in the existing PowerPC code. - Fix the ordering issue of adding CMA ranges to "linux,usable-memory-range". - Fix an existing concurrency issue. A Concurrent memory hotplug may occur between reading memblock and attempting to fill cmem during kexec_load() for almost all existing architectures. - Link to v9: https://lore.kernel.org/all/20260323072745.2481719-1-ruanjinjie at huawei.com/ Changes in v9: - Collect Reviewed-by and Acked-by, and prepare for Sashiko AI review. - Link to v8: https://lore.kernel.org/all/20260302035315.3892241-1-ruanjinjie at huawei.com/ Changes in v8: - Fix the build issues reported by kernel test robot and Sourabh. - Link to v7: https://lore.kernel.org/all/20260226130437.1867658-1-ruanjinjie at huawei.com/ Changes in v7: - Correct the inclusion of CMA-reserved ranges for kdump kernel in of/kexec for arm64 and riscv. - Add Acked-by. - Link to v6: https://lore.kernel.org/all/20260224085342.387996-1-ruanjinjie at huawei.com/ Changes in v6: - Update the crash core exclude code as Mike suggested. - Rebased on v7.0-rc1. - Add acked-by. - Link to v5: https://lore.kernel.org/all/20260212101001.343158-1-ruanjinjie at huawei.com/ Jinjie Ruan (22): riscv: kexec_file: Fix crashk_low_res not exclude bug powerpc/crash: Fix possible memory leak in update_crash_elfcorehdr() powerpc/kexec_file: Fix NULL pointer dereference in kexec_extra_fdt_size_ppc64() powerpc/kexec_file: Fix memory range truncation in __merge_memory_ranges() kexec: Extract kexec_free_segment_cma() from kimage_free_cma() arm64: kexec_file: Fix CMA page leaks during segment placement retry loops arm64: kexec_file: Fix image->elf_headers memory leak during retry loop kexec: Fix UAF and Double Free in crash_load_dm_crypt_keys() crash_core: Introduce CRASH_HOTPLUG_SAFETY_PADDING for memory hotplug safety x86: kexec_file: Fix TOCTOU buffer overflow via memory region padding arm64: kexec_file: Fix TOCTOU buffer overflow via memory region padding riscv: kexec_file: Fix TOCTOU buffer overflow via memory region padding LoongArch: kexec_file: Fix TOCTOU buffer overflow via memory region padding crash: Add crash_prepare_headers() to exclude crash kernel memory arm64: kexec_file: Use crash_prepare_headers() helper to simplify code x86: kexec_file: Use crash_prepare_headers() helper to simplify code riscv: kexec_file: Use crash_prepare_headers() helper to simplify code LoongArch: kexec_file: Use crash_prepare_headers() helper to simplify code powerpc/kexec_file: Use crash_exclude_core_ranges() helper arm64: kexec_file: Add support for crashkernel CMA reservation riscv: kexec_file: Add support for crashkernel CMA reservation arm64: crash: Add crash hotplug support Sourabh Jain (1): powerpc/crash: sort crash memory ranges before preparing elfcorehdr .../admin-guide/kernel-parameters.txt | 16 +- arch/arm64/Kconfig | 3 + arch/arm64/include/asm/kexec.h | 13 ++ arch/arm64/kernel/Makefile | 2 +- arch/arm64/kernel/crash.c | 152 ++++++++++++++++++ arch/arm64/kernel/kexec_image.c | 34 ++++ arch/arm64/kernel/machine_kexec_file.c | 78 ++------- arch/arm64/mm/init.c | 5 +- arch/loongarch/kernel/machine_kexec_file.c | 44 ++--- arch/powerpc/include/asm/kexec_ranges.h | 1 - arch/powerpc/kexec/crash.c | 7 +- arch/powerpc/kexec/file_load_64.c | 3 + arch/powerpc/kexec/ranges.c | 113 ++----------- arch/riscv/kernel/machine_kexec_file.c | 43 ++--- arch/riscv/mm/init.c | 5 +- arch/x86/kernel/crash.c | 92 ++--------- drivers/of/fdt.c | 9 +- drivers/of/kexec.c | 9 ++ include/linux/crash_core.h | 15 ++ include/linux/crash_reserve.h | 4 +- include/linux/kexec.h | 2 + kernel/crash_core.c | 89 +++++++++- kernel/crash_dump_dm_crypt.c | 4 +- kernel/kexec_core.c | 25 +-- 24 files changed, 430 insertions(+), 338 deletions(-) create mode 100644 arch/arm64/kernel/crash.c -- 2.34.1 From ruanjinjie at huawei.com Mon Jun 1 02:47:44 2026 From: ruanjinjie at huawei.com (Jinjie Ruan) Date: Mon, 1 Jun 2026 17:47:44 +0800 Subject: [PATCH v15 02/23] powerpc/crash: Fix possible memory leak in update_crash_elfcorehdr() In-Reply-To: <20260601094805.2928614-1-ruanjinjie@huawei.com> References: <20260601094805.2928614-1-ruanjinjie@huawei.com> Message-ID: <20260601094805.2928614-3-ruanjinjie@huawei.com> In get_crash_memory_ranges(), if crash_exclude_mem_range() failed after realloc_mem_ranges() has successfully allocated the cmem memory, it just returns an error but leaves cmem pointing to the allocated memory, nor is it freed in the caller update_crash_elfcorehdr(), which cause a memory leak, goto out to free the cmem. Cc: Sourabh Jain Cc: Hari Bathini Cc: Michael Ellerman Fixes: 849599b702ef ("powerpc/crash: add crash memory hotplug support") Reviewed-by: Sourabh Jain Signed-off-by: Jinjie Ruan --- arch/powerpc/kexec/crash.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/kexec/crash.c b/arch/powerpc/kexec/crash.c index e6539f213b3d..a520f851c3a6 100644 --- a/arch/powerpc/kexec/crash.c +++ b/arch/powerpc/kexec/crash.c @@ -502,7 +502,7 @@ static void update_crash_elfcorehdr(struct kimage *image, struct memory_notify * ret = get_crash_memory_ranges(&cmem); if (ret) { pr_err("Failed to get crash mem range\n"); - return; + goto out; } /* -- 2.34.1 From ruanjinjie at huawei.com Mon Jun 1 02:47:45 2026 From: ruanjinjie at huawei.com (Jinjie Ruan) Date: Mon, 1 Jun 2026 17:47:45 +0800 Subject: [PATCH v15 03/23] powerpc/kexec_file: Fix NULL pointer dereference in kexec_extra_fdt_size_ppc64() In-Reply-To: <20260601094805.2928614-1-ruanjinjie@huawei.com> References: <20260601094805.2928614-1-ruanjinjie@huawei.com> Message-ID: <20260601094805.2928614-4-ruanjinjie@huawei.com> A static Sashiko AI review identified a potential NULL pointer dereference in kexec_extra_fdt_size_ppc64(). When get_reserved_memory_ranges() successfully returns 0 on platforms without any reserved memory regions, the allocated 'rmem' pointer remains NULL. Passing this unallocated pointer directly to kexec_extra_fdt_size_ppc64() leads to a kernel panic when evaluating 'rmem->nr_ranges'. Fix this by adding a defensive NULL pointer check at the beginning of kexec_extra_fdt_size_ppc64(), returning 0 extra space immediately if no reserved memory structure exists. Cc: Sourabh Jain Cc: Hari Bathini Cc: Michael Ellerman Cc: stable at vger.kernel.org Fixes: 0d3ff067331e ("powerpc/kexec_file: fix extra size calculation for kexec FDT") Signed-off-by: Jinjie Ruan --- arch/powerpc/kexec/file_load_64.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/powerpc/kexec/file_load_64.c b/arch/powerpc/kexec/file_load_64.c index 8c72e12ea44e..fdeedf102c38 100644 --- a/arch/powerpc/kexec/file_load_64.c +++ b/arch/powerpc/kexec/file_load_64.c @@ -649,6 +649,9 @@ unsigned int kexec_extra_fdt_size_ppc64(struct kimage *image, struct crash_mem * struct device_node *dn; unsigned int cpu_nodes = 0, extra_size = 0; + if (!rmem) + return 0; + // Budget some space for the password blob. There's already extra space // for the key name if (plpks_is_available()) -- 2.34.1 From ruanjinjie at huawei.com Mon Jun 1 02:47:46 2026 From: ruanjinjie at huawei.com (Jinjie Ruan) Date: Mon, 1 Jun 2026 17:47:46 +0800 Subject: [PATCH v15 04/23] powerpc/kexec_file: Fix memory range truncation in __merge_memory_ranges() In-Reply-To: <20260601094805.2928614-1-ruanjinjie@huawei.com> References: <20260601094805.2928614-1-ruanjinjie@huawei.com> Message-ID: <20260601094805.2928614-5-ruanjinjie@huawei.com> Sashiko AI review pointed out the following issue. The __merge_memory_ranges() function incorrectly handles overlapping memory ranges when merging them. Although sort_memory_ranges() sorts all ranges by their start address in ascending order beforehand, the merge logic remains defective in two ways: 1. It compares the current range's start against the previous element (i-1) instead of the running target index (idx) 2. It unconditionally overwrites 'ranges[idx].end' with 'ranges[i].end'. This logic flaw leads to critical memory truncation when a larger memory range completely subsumes subsequent smaller ranges. For example, consider a sorted input array with three ranges: Range A (idx=0): [0x1000 - 0x9000] Range B (i=1): [0x2000 - 0x5000] (completely inside Range A) Range C (i=2): [0x6000 - 0x8000] (completely inside Range A) 1. When i=1 (Range B): ranges[1].start (0x2000) <= ranges[0].end + 1 (0x9001) is TRUE. The code executes: ranges[0].end = ranges[1].end, which erroneously shrinks Range A's end from 0x9000 down to 0x5000. 2. When i=2 (Range C): ranges[2].start (0x6000) <= ranges[1].end + 1 (0x5001) is FALSE. The code falls into the else block, creating a broken new range. As a result, valid memory fragments [0x5001 - 0x5fff] and [0x8001 - 0x9000] are completely lost from the kexec exclude lists, potentially allowing the crash kernel to overwrite active memory, causing data corruption or crashes. Fix this by ensuring the start of the current range is compared against the end of the active merged range (idx), and use max() to safely prevent the outer boundary from being truncated. Cc: Sourabh Jain Cc: Hari Bathini Cc: Michael Ellerman Cc: stable at vger.kernel.org Fixes: 180adfc532a8 ("powerpc/kexec_file: Add helper functions for getting memory ranges") Signed-off-by: Jinjie Ruan --- arch/powerpc/kexec/ranges.c | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/kexec/ranges.c b/arch/powerpc/kexec/ranges.c index 867135560e5c..eb45e89502ca 100644 --- a/arch/powerpc/kexec/ranges.c +++ b/arch/powerpc/kexec/ranges.c @@ -21,6 +21,7 @@ #include #include #include +#include #include #include #include @@ -105,19 +106,16 @@ static void __merge_memory_ranges(struct crash_mem *mem_rngs) struct range *ranges; int i, idx; - if (!mem_rngs) + if (!mem_rngs || mem_rngs->nr_ranges <= 1) return; idx = 0; - ranges = &(mem_rngs->ranges[0]); + ranges = mem_rngs->ranges; for (i = 1; i < mem_rngs->nr_ranges; i++) { - if (ranges[i].start <= (ranges[i-1].end + 1)) - ranges[idx].end = ranges[i].end; + if (ranges[i].start <= (ranges[idx].end + 1)) + ranges[idx].end = max(ranges[idx].end, ranges[i].end); else { idx++; - if (i == idx) - continue; - ranges[idx] = ranges[i]; } } -- 2.34.1 From ruanjinjie at huawei.com Mon Jun 1 02:47:43 2026 From: ruanjinjie at huawei.com (Jinjie Ruan) Date: Mon, 1 Jun 2026 17:47:43 +0800 Subject: [PATCH v15 01/23] riscv: kexec_file: Fix crashk_low_res not exclude bug In-Reply-To: <20260601094805.2928614-1-ruanjinjie@huawei.com> References: <20260601094805.2928614-1-ruanjinjie@huawei.com> Message-ID: <20260601094805.2928614-2-ruanjinjie@huawei.com> As done in commit 944a45abfabc ("arm64: kdump: Reimplement crashkernel=X") and commit 4831be702b95 ("arm64/kexec: Fix missing extra range for crashkres_low.") for arm64, while implementing crashkernel=X,[high,low], riscv should have excluded the "crashk_low_res" reserved ranges from the crash kernel memory to prevent them from being exported through /proc/vmcore, and the exclusion would need an extra crash_mem range. Just simply tested on qemu with crashkernel=4G with kexec in [1] mentioned in [2]. And the second kernel can be started normally. # dmesg | grep crash [ 0.000000] crashkernel low memory reserved: 0xf8000000 - 0x100000000 (128 MB) [ 0.000000] crashkernel reserved: 0x000000017fe00000 - 0x000000027fe00000 (4096 MB) Cc: Guo Ren Cc: Baoquan He [1]: https://github.com/chenjh005/kexec-tools/tree/build-test-riscv-v2 [2]: https://lore.kernel.org/all/20230726175000.2536220-1-chenjiahao16 at huawei.com/ Fixes: 5882e5acf18d ("riscv: kdump: Implement crashkernel=X,[high,low]") Reviewed-by: Guo Ren Signed-off-by: Jinjie Ruan --- arch/riscv/kernel/machine_kexec_file.c | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/arch/riscv/kernel/machine_kexec_file.c b/arch/riscv/kernel/machine_kexec_file.c index 54e2d9552e93..3f7766057cac 100644 --- a/arch/riscv/kernel/machine_kexec_file.c +++ b/arch/riscv/kernel/machine_kexec_file.c @@ -61,7 +61,7 @@ static int prepare_elf_headers(void **addr, unsigned long *sz) unsigned int nr_ranges; int ret; - nr_ranges = 1; /* For exclusion of crashkernel region */ + nr_ranges = 2; /* For exclusion of crashkernel region */ walk_system_ram_res(0, -1, &nr_ranges, get_nr_ram_ranges_callback); cmem = kmalloc_flex(*cmem, ranges, nr_ranges); @@ -76,8 +76,16 @@ static int prepare_elf_headers(void **addr, unsigned long *sz) /* Exclude crashkernel region */ ret = crash_exclude_mem_range(cmem, crashk_res.start, crashk_res.end); - if (!ret) - ret = crash_prepare_elf64_headers(cmem, true, addr, sz); + if (ret) + goto out; + + if (crashk_low_res.end) { + ret = crash_exclude_mem_range(cmem, crashk_low_res.start, crashk_low_res.end); + if (ret) + goto out; + } + + ret = crash_prepare_elf64_headers(cmem, true, addr, sz); out: kfree(cmem); -- 2.34.1 From ruanjinjie at huawei.com Mon Jun 1 02:47:47 2026 From: ruanjinjie at huawei.com (Jinjie Ruan) Date: Mon, 1 Jun 2026 17:47:47 +0800 Subject: [PATCH v15 05/23] powerpc/crash: sort crash memory ranges before preparing elfcorehdr In-Reply-To: <20260601094805.2928614-1-ruanjinjie@huawei.com> References: <20260601094805.2928614-1-ruanjinjie@huawei.com> Message-ID: <20260601094805.2928614-6-ruanjinjie@huawei.com> From: Sourabh Jain During a memory hot-remove event, the elfcorehdr is rebuilt to exclude the removed memory. While updating the crash memory ranges for this operation, the crash memory ranges array can become unsorted. This happens because remove_mem_range() may split a memory range into two parts and append the higher-address part as a separate range at the end of the array. So far, no issues have been observed due to the unsorted crash memory ranges. However, this could lead to problems once crash memory range removal is handled by generic code, as introduced in the upcoming patches in this series. Currently, powerpc uses a platform-specific function, remove_mem_range(), to exclude hot-removed memory from the crash memory ranges. This function performs the same task as the generic crash_exclude_mem_range() in crash_core.c. The generic helper also ensures that the crash memory ranges remain sorted. So remove the redundant powerpc-specific implementation and instead call crash_exclude_mem_range_guarded() (which internally calls crash_exclude_mem_range()) to exclude the hot-removed memory ranges. Cc: Andrew Morton Cc: Baoquan he Cc: Jinjie Ruan Cc: Hari Bathini Cc: Madhavan Srinivasan Cc: Mahesh Salgaonkar Cc: Michael Ellerman Cc: Ritesh Harjani (IBM) Cc: Shivang Upadhyay Cc: linux-kernel at vger.kernel.org Acked-by: Baoquan He Reviewed-by: Ritesh Harjani (IBM) Acked-by: Mike Rapoport (Microsoft) Signed-off-by: Sourabh Jain Signed-off-by: Jinjie Ruan --- arch/powerpc/include/asm/kexec_ranges.h | 4 +- arch/powerpc/kexec/crash.c | 5 +- arch/powerpc/kexec/ranges.c | 87 +------------------------ 3 files changed, 7 insertions(+), 89 deletions(-) diff --git a/arch/powerpc/include/asm/kexec_ranges.h b/arch/powerpc/include/asm/kexec_ranges.h index 14055896cbcb..ad95e3792d10 100644 --- a/arch/powerpc/include/asm/kexec_ranges.h +++ b/arch/powerpc/include/asm/kexec_ranges.h @@ -7,7 +7,9 @@ void sort_memory_ranges(struct crash_mem *mrngs, bool merge); struct crash_mem *realloc_mem_ranges(struct crash_mem **mem_ranges); int add_mem_range(struct crash_mem **mem_ranges, u64 base, u64 size); -int remove_mem_range(struct crash_mem **mem_ranges, u64 base, u64 size); +int crash_exclude_mem_range_guarded(struct crash_mem **mem_ranges, + unsigned long long mstart, + unsigned long long mend); int get_exclude_memory_ranges(struct crash_mem **mem_ranges); int get_reserved_memory_ranges(struct crash_mem **mem_ranges); int get_crash_memory_ranges(struct crash_mem **mem_ranges); diff --git a/arch/powerpc/kexec/crash.c b/arch/powerpc/kexec/crash.c index a520f851c3a6..d634db67becc 100644 --- a/arch/powerpc/kexec/crash.c +++ b/arch/powerpc/kexec/crash.c @@ -493,7 +493,7 @@ static void update_crash_elfcorehdr(struct kimage *image, struct memory_notify * struct crash_mem *cmem = NULL; struct kexec_segment *ksegment; void *ptr, *mem, *elfbuf = NULL; - unsigned long elfsz, memsz, base_addr, size; + unsigned long elfsz, memsz, base_addr, size, end; ksegment = &image->segment[image->elfcorehdr_index]; mem = (void *) ksegment->mem; @@ -512,7 +512,8 @@ static void update_crash_elfcorehdr(struct kimage *image, struct memory_notify * if (image->hp_action == KEXEC_CRASH_HP_REMOVE_MEMORY) { base_addr = PFN_PHYS(mn->start_pfn); size = mn->nr_pages * PAGE_SIZE; - ret = remove_mem_range(&cmem, base_addr, size); + end = base_addr + size - 1; + ret = crash_exclude_mem_range_guarded(&cmem, base_addr, end); if (ret) { pr_err("Failed to remove hot-unplugged memory from crash memory ranges\n"); goto out; diff --git a/arch/powerpc/kexec/ranges.c b/arch/powerpc/kexec/ranges.c index eb45e89502ca..b2fb78562cdc 100644 --- a/arch/powerpc/kexec/ranges.c +++ b/arch/powerpc/kexec/ranges.c @@ -551,7 +551,7 @@ int get_usable_memory_ranges(struct crash_mem **mem_ranges) #endif /* CONFIG_KEXEC_FILE */ #ifdef CONFIG_CRASH_DUMP -static int crash_exclude_mem_range_guarded(struct crash_mem **mem_ranges, +int crash_exclude_mem_range_guarded(struct crash_mem **mem_ranges, unsigned long long mstart, unsigned long long mend) { @@ -639,89 +639,4 @@ int get_crash_memory_ranges(struct crash_mem **mem_ranges) pr_err("Failed to setup crash memory ranges\n"); return ret; } - -/** - * remove_mem_range - Removes the given memory range from the range list. - * @mem_ranges: Range list to remove the memory range to. - * @base: Base address of the range to remove. - * @size: Size of the memory range to remove. - * - * (Re)allocates memory, if needed. - * - * Returns 0 on success, negative errno on error. - */ -int remove_mem_range(struct crash_mem **mem_ranges, u64 base, u64 size) -{ - u64 end; - int ret = 0; - unsigned int i; - u64 mstart, mend; - struct crash_mem *mem_rngs = *mem_ranges; - - if (!size) - return 0; - - /* - * Memory range are stored as start and end address, use - * the same format to do remove operation. - */ - end = base + size - 1; - - for (i = 0; i < mem_rngs->nr_ranges; i++) { - mstart = mem_rngs->ranges[i].start; - mend = mem_rngs->ranges[i].end; - - /* - * Memory range to remove is not part of this range entry - * in the memory range list - */ - if (!(base >= mstart && end <= mend)) - continue; - - /* - * Memory range to remove is equivalent to this entry in the - * memory range list. Remove the range entry from the list. - */ - if (base == mstart && end == mend) { - for (; i < mem_rngs->nr_ranges - 1; i++) { - mem_rngs->ranges[i].start = mem_rngs->ranges[i+1].start; - mem_rngs->ranges[i].end = mem_rngs->ranges[i+1].end; - } - mem_rngs->nr_ranges--; - goto out; - } - /* - * Start address of the memory range to remove and the - * current memory range entry in the list is same. Just - * move the start address of the current memory range - * entry in the list to end + 1. - */ - else if (base == mstart) { - mem_rngs->ranges[i].start = end + 1; - goto out; - } - /* - * End address of the memory range to remove and the - * current memory range entry in the list is same. - * Just move the end address of the current memory - * range entry in the list to base - 1. - */ - else if (end == mend) { - mem_rngs->ranges[i].end = base - 1; - goto out; - } - /* - * Memory range to remove is not at the edge of current - * memory range entry. Split the current memory entry into - * two half. - */ - else { - size = mem_rngs->ranges[i].end - end + 1; - mem_rngs->ranges[i].end = base - 1; - ret = add_mem_range(mem_ranges, end + 1, size); - } - } -out: - return ret; -} #endif /* CONFIG_CRASH_DUMP */ -- 2.34.1 From ruanjinjie at huawei.com Mon Jun 1 02:47:48 2026 From: ruanjinjie at huawei.com (Jinjie Ruan) Date: Mon, 1 Jun 2026 17:47:48 +0800 Subject: [PATCH v15 06/23] kexec: Extract kexec_free_segment_cma() from kimage_free_cma() In-Reply-To: <20260601094805.2928614-1-ruanjinjie@huawei.com> References: <20260601094805.2928614-1-ruanjinjie@huawei.com> Message-ID: <20260601094805.2928614-7-ruanjinjie@huawei.com> The generic kimage_free_cma() relies on `image->nr_segments` to iterate and free allocated CMA pages. However, during architecture-specific segment placement retry loops (e.g., arm64's image_load()), a mid-way failure will truncate `image->nr_segments` back to its initial value. This truncation permanently hides any CMA pages allocated outside the new boundary from global cleanup, causing silent background memory leaks. To allow architecture-specific loaders to execute fine-grained memory reclamation before truncation occurs, extract the single-pass CMA release logic into a dedicated and exported helper: void kexec_free_segment_cma(struct kimage *image, unsigned long idx); Refactor the main kimage_free_cma() to invoke this helper sequentially to maintain backward compatibility while expanding single-slot flexibility. Signed-off-by: Jinjie Ruan --- include/linux/kexec.h | 2 ++ kernel/kexec_core.c | 25 ++++++++++++++----------- 2 files changed, 16 insertions(+), 11 deletions(-) diff --git a/include/linux/kexec.h b/include/linux/kexec.h index 8a22bc9b8c6c..6f1eabda0300 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -532,6 +532,7 @@ extern bool kexec_file_dbg_print; extern void *kimage_map_segment(struct kimage *image, int idx); extern void kimage_unmap_segment(void *buffer); +extern void kexec_free_segment_cma(struct kimage *image, unsigned long idx); #else /* !CONFIG_KEXEC_CORE */ struct pt_regs; struct task_struct; @@ -543,6 +544,7 @@ static inline int kexec_crash_loaded(void) { return 0; } static inline void *kimage_map_segment(struct kimage *image, int idx) { return NULL; } static inline void kimage_unmap_segment(void *buffer) { } +static inline void kexec_free_segment_cma(struct kimage *image, unsigned long idx) { } #define kexec_in_progress false #endif /* CONFIG_KEXEC_CORE */ diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c index a43d2da0fe3e..9195f81e53c4 100644 --- a/kernel/kexec_core.c +++ b/kernel/kexec_core.c @@ -554,22 +554,25 @@ static void kimage_free_entry(kimage_entry_t entry) kimage_free_pages(page); } -static void kimage_free_cma(struct kimage *image) +void kexec_free_segment_cma(struct kimage *image, unsigned long idx) { - unsigned long i; + u32 nr_pages = image->segment[idx].memsz >> PAGE_SHIFT; + struct page *cma = image->segment_cma[idx]; - for (i = 0; i < image->nr_segments; i++) { - struct page *cma = image->segment_cma[i]; - u32 nr_pages = image->segment[i].memsz >> PAGE_SHIFT; + if (!cma) + return; - if (!cma) - continue; + arch_kexec_pre_free_pages(page_address(cma), nr_pages); + dma_release_from_contiguous(NULL, cma, nr_pages); + image->segment_cma[idx] = NULL; +} - arch_kexec_pre_free_pages(page_address(cma), nr_pages); - dma_release_from_contiguous(NULL, cma, nr_pages); - image->segment_cma[i] = NULL; - } +static void kimage_free_cma(struct kimage *image) +{ + unsigned long i; + for (i = 0; i < image->nr_segments; i++) + kexec_free_segment_cma(image, i); } void kimage_free(struct kimage *image) -- 2.34.1 From ruanjinjie at huawei.com Mon Jun 1 02:47:49 2026 From: ruanjinjie at huawei.com (Jinjie Ruan) Date: Mon, 1 Jun 2026 17:47:49 +0800 Subject: [PATCH v15 07/23] arm64: kexec_file: Fix CMA page leaks during segment placement retry loops In-Reply-To: <20260601094805.2928614-1-ruanjinjie@huawei.com> References: <20260601094805.2928614-1-ruanjinjie@huawei.com> Message-ID: <20260601094805.2928614-8-ruanjinjie@huawei.com> Sashiko AI code review pointed out, during arm64 kexec image placement retry loops in image_load(), the loader repeatedly attempts to find a suitable memory hole for the kernel and its associated segments (initrd, dtb, etc.). When a placement attempt fails midway, the core framework rolls back `image->nr_segments` to its initial state to purge the failed segments logically. However, this truncation causes a severe background memory leak. Any CMA pages successfully allocated via kexec_add_buffer() during the failed attempt are recorded in the `image->segment_cma` array. Since the subsequent global kimage_free_cma() cleanup only iterates up to the truncated (smaller) `nr_segments` boundary, these allocated CMA pages outside the new boundary become completely orphaned and permanently leaked. Fix this by leverage the newly introduced generic kexec_free_segment_cma() helper to execute fine-grained memory reclamation before any truncation occurs: 1. In image_load(), explicitly invoke kexec_free_segment_cma() to release the CMA buffer allocated for the current failed kernel segment before decrementing `image->nr_segments`. 2. In the error path of load_other_segments(), iterate backward from the failed segment index down to `orig_segments`, sequentially freeing each orphan CMA segment allocation before restoring the initial segment count. This guarantees that all temporary CMA pages allocated during placement failures are cleanly returned to the contiguous memory allocator, eliminating silent background memory leaks across all retry paths. Cc: Catalin Marinas Cc: Will Deacon Cc: Breno Leitao Cc: Pratyush Yadav Cc: Andrew Morton Cc: Yeoreum Yun Cc: Kees Cook Cc: "Rob Herring (Arm)" Cc: Baoquan He Cc: Coiby Xu Cc: Alexander Graf Cc: Pasha Tatashin Cc: stable at vger.kernel.org Fixes: 07d24902977e4 ("kexec: enable CMA based contiguous allocation") Signed-off-by: Jinjie Ruan --- arch/arm64/kernel/kexec_image.c | 1 + arch/arm64/kernel/machine_kexec_file.c | 5 ++++- 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/arch/arm64/kernel/kexec_image.c b/arch/arm64/kernel/kexec_image.c index b70f4df15a1a..ffcb7f9075e6 100644 --- a/arch/arm64/kernel/kexec_image.c +++ b/arch/arm64/kernel/kexec_image.c @@ -107,6 +107,7 @@ static void *image_load(struct kimage *image, * We couldn't find space for the other segments; erase the * kernel segment and try the next available hole. */ + kexec_free_segment_cma(image, kernel_segment_number); image->nr_segments -= 1; kbuf.buf_min = kernel_segment->mem + kernel_segment->memsz; kbuf.mem = KEXEC_BUF_MEM_UNKNOWN; diff --git a/arch/arm64/kernel/machine_kexec_file.c b/arch/arm64/kernel/machine_kexec_file.c index e31fabed378a..13c247c28866 100644 --- a/arch/arm64/kernel/machine_kexec_file.c +++ b/arch/arm64/kernel/machine_kexec_file.c @@ -195,7 +195,10 @@ int load_other_segments(struct kimage *image, return 0; out_err: - image->nr_segments = orig_segments; + while (image->nr_segments > orig_segments) { + kexec_free_segment_cma(image, image->nr_segments - 1); + image->nr_segments--; + } kvfree(dtb); return ret; } -- 2.34.1 From ruanjinjie at huawei.com Mon Jun 1 02:47:50 2026 From: ruanjinjie at huawei.com (Jinjie Ruan) Date: Mon, 1 Jun 2026 17:47:50 +0800 Subject: [PATCH v15 08/23] arm64: kexec_file: Fix image->elf_headers memory leak during retry loop In-Reply-To: <20260601094805.2928614-1-ruanjinjie@huawei.com> References: <20260601094805.2928614-1-ruanjinjie@huawei.com> Message-ID: <20260601094805.2928614-9-ruanjinjie@huawei.com> Sashiko AI code review pointed out a potential memory leak of image->elf_headers when load_other_segments() fails on error paths. In the arm64 kexec_file file-load path, kexec_image.c runs a retry loop calling kexec_add_buffer() to find a suitable location for the kernel segment. On each iteration, load_other_segments() is invoked to allocate and populate alternative segments such as initrd, DTB, and ELF headers. However, if a placement or allocation failure occurs later in load_other_segments() (e.g., when adding initrd or dtb), the execution jumps to the out_err label. While this path restores image->nr_segments via orig_segments, it returns an error back to the caller without freeing the previously allocated image->elf_headers vmalloc buffer. As a result, the retry loop in image_load() unconditionally allocates new ELF headers on the next iteration and overwrites image->elf_headers, permanently leaking the memory blocks allocated in previous iterations. To fix this, decouple the ELF header allocation from the target-seeking retry loop. Since the contents and size of ELF headers only depend on the host memory layout and do not change with the kernel's physical placement, move prepare_elf_headers() completely outside and prior to the while retry loop in image_load(). And if kexec_add_buffer() for elf headers fails, not need to vfree headers, because the err path will vfree `image->elf_headers` by calling arch_kimage_file_post_load_cleanup(). This optimization eliminates redundant memory allocation/deallocation overhead during kexec placement retries and eradicates the Use-After-Free and memory leak risk. Concurrently, remove the prepare_elf_headers() call from inside load_other_segments() and have it directly reuse the single, pre-allocated image->elf_headers. Cc: Catalin Marinas Cc: Will Deacon Cc: Thomas Huth Cc: Breno Leitao Cc: Andrew Morton Cc: Yeoreum Yun Cc: Coiby Xu Cc: Baoquan He Cc: Kees Cook Cc: Benjamin Gwin Cc: stable at vger.kernel.org Fixes: 108aa503657e ("arm64: kexec_file: try more regions if loading segments fails") Signed-off-by: Jinjie Ruan --- v15: - Use image->elf_headers and image->elf_headers_sz instead of adding function parameters for load_other_segments() to simplify the fix. --- arch/arm64/include/asm/kexec.h | 1 + arch/arm64/kernel/kexec_image.c | 16 ++++++++++++++++ arch/arm64/kernel/machine_kexec_file.c | 23 +++++------------------ 3 files changed, 22 insertions(+), 18 deletions(-) diff --git a/arch/arm64/include/asm/kexec.h b/arch/arm64/include/asm/kexec.h index 892e5bebda95..7ffa2ff5fcfd 100644 --- a/arch/arm64/include/asm/kexec.h +++ b/arch/arm64/include/asm/kexec.h @@ -128,6 +128,7 @@ extern int load_other_segments(struct kimage *image, unsigned long kernel_load_addr, unsigned long kernel_size, char *initrd, unsigned long initrd_len, char *cmdline); +extern int prepare_elf_headers(void **addr, unsigned long *sz); #endif #endif /* __ASSEMBLER__ */ diff --git a/arch/arm64/kernel/kexec_image.c b/arch/arm64/kernel/kexec_image.c index ffcb7f9075e6..424b9527db09 100644 --- a/arch/arm64/kernel/kexec_image.c +++ b/arch/arm64/kernel/kexec_image.c @@ -89,6 +89,22 @@ static void *image_load(struct kimage *image, kernel_segment_number = image->nr_segments; +#ifdef CONFIG_CRASH_DUMP + if (image->type == KEXEC_TYPE_CRASH) { + /* load elf core header */ + unsigned long headers_sz; + void *headers; + + ret = prepare_elf_headers(&headers, &headers_sz); + if (ret) { + pr_err("Preparing elf core header failed\n"); + return ERR_PTR(ret); + } + image->elf_headers = headers; + image->elf_headers_sz = headers_sz; + } +#endif + /* * The location of the kernel segment may make it impossible to satisfy * the other segment requirements, so we try repeatedly to find a diff --git a/arch/arm64/kernel/machine_kexec_file.c b/arch/arm64/kernel/machine_kexec_file.c index 13c247c28866..4cbb71e1f8ed 100644 --- a/arch/arm64/kernel/machine_kexec_file.c +++ b/arch/arm64/kernel/machine_kexec_file.c @@ -40,7 +40,7 @@ int arch_kimage_file_post_load_cleanup(struct kimage *image) } #ifdef CONFIG_CRASH_DUMP -static int prepare_elf_headers(void **addr, unsigned long *sz) +int prepare_elf_headers(void **addr, unsigned long *sz) { struct crash_mem *cmem; unsigned int nr_ranges; @@ -105,32 +105,19 @@ int load_other_segments(struct kimage *image, kbuf.buf_min = kernel_load_addr + kernel_size; #ifdef CONFIG_CRASH_DUMP - /* load elf core header */ - void *headers; - unsigned long headers_sz; if (image->type == KEXEC_TYPE_CRASH) { - ret = prepare_elf_headers(&headers, &headers_sz); - if (ret) { - pr_err("Preparing elf core header failed\n"); - goto out_err; - } - - kbuf.buffer = headers; - kbuf.bufsz = headers_sz; + kbuf.buffer = image->elf_headers; + kbuf.bufsz = image->elf_headers_sz; kbuf.mem = KEXEC_BUF_MEM_UNKNOWN; - kbuf.memsz = headers_sz; + kbuf.memsz = image->elf_headers_sz; kbuf.buf_align = SZ_64K; /* largest supported page size */ kbuf.buf_max = ULONG_MAX; kbuf.top_down = true; ret = kexec_add_buffer(&kbuf); - if (ret) { - vfree(headers); + if (ret) goto out_err; - } - image->elf_headers = headers; image->elf_load_addr = kbuf.mem; - image->elf_headers_sz = headers_sz; kexec_dprintk("Loaded elf core header at 0x%lx bufsz=0x%lx memsz=0x%lx\n", image->elf_load_addr, kbuf.bufsz, kbuf.memsz); -- 2.34.1 From ruanjinjie at huawei.com Mon Jun 1 02:47:51 2026 From: ruanjinjie at huawei.com (Jinjie Ruan) Date: Mon, 1 Jun 2026 17:47:51 +0800 Subject: [PATCH v15 09/23] kexec: Fix UAF and Double Free in crash_load_dm_crypt_keys() In-Reply-To: <20260601094805.2928614-1-ruanjinjie@huawei.com> References: <20260601094805.2928614-1-ruanjinjie@huawei.com> Message-ID: <20260601094805.2928614-10-ruanjinjie@huawei.com> A static memory safety review by Sashiko AI identified a high-severity Use-After-Free (UAF) and Double Free vulnerability in the dm-crypt keys handling path during arm64 kexec image placement retry loops. In crash_load_dm_crypt_keys(), when the segment allocation fails via kexec_add_buffer(), the error path invokes `kvfree((void *)kbuf.buffer)` to reclaim the keys buffer. However, the global pointer `keys_header` is left dangling with a stale address, creating an insecure memory trap. When the top-level loader image_load() retries the next available placement hole, crash_load_dm_crypt_keys() is re-entered. Since `is_dm_key_reused` is a read-only global configuration managed by user-space configfs, it cannot be mutated by the kernel. If it remains true, the loader skips build_keys_header() and blindly reuses the stale `keys_header` pointer for kbuf.buffer, triggering a severe Use-After-Free or a Null pointer dereference during kexec_add_buffer(). Alternatively, a new headers build can trigger a recursive Double Free inside build_keys_header(). Fix this by setting the global `keys_header` to NULL immediately after it is freed in the failure path. Concurrently, upgrade the header regeneration check to a composite condition: `if (!is_dm_key_reused || !keys_header)` This ensures that if a previous retry attempt wiped the buffer, the kernel will automatically and safely trigger a fresh header regeneration internally without modifying the user-configured `is_dm_key_reused` state flag, achieving absolute data consistency and memory safety across all retry paths. Cc: Andrew Morton Cc: Baoquan He Cc: Mike Rapoport Cc: Pasha Tatashin Cc: Pratyush Yadav Cc: Dave Young Cc: stable at vger.kernel.org Fixes: e3a84be1ec2f ("arm64,ppc64le/kdump: pass dm-crypt keys to kdump kernel") Signed-off-by: Jinjie Ruan --- kernel/crash_dump_dm_crypt.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/crash_dump_dm_crypt.c b/kernel/crash_dump_dm_crypt.c index cb875ddb6ba6..2c5462876337 100644 --- a/kernel/crash_dump_dm_crypt.c +++ b/kernel/crash_dump_dm_crypt.c @@ -412,13 +412,12 @@ int crash_load_dm_crypt_keys(struct kimage *image) }; int r; - if (key_count <= 0) { kexec_dprintk("No dm-crypt keys\n"); return 0; } - if (!is_dm_key_reused) { + if (!is_dm_key_reused || unlikely(!keys_header)) { image->dm_crypt_keys_addr = 0; r = build_keys_header(); if (r) { @@ -437,6 +436,7 @@ int crash_load_dm_crypt_keys(struct kimage *image) if (r) { pr_err("Failed to call kexec_add_buffer, ret=%d\n", r); kvfree((void *)kbuf.buffer); + keys_header = NULL; return r; } image->dm_crypt_keys_addr = kbuf.mem; -- 2.34.1 From ruanjinjie at huawei.com Mon Jun 1 02:47:52 2026 From: ruanjinjie at huawei.com (Jinjie Ruan) Date: Mon, 1 Jun 2026 17:47:52 +0800 Subject: [PATCH v15 10/23] crash_core: Introduce CRASH_HOTPLUG_SAFETY_PADDING for memory hotplug safety In-Reply-To: <20260601094805.2928614-1-ruanjinjie@huawei.com> References: <20260601094805.2928614-1-ruanjinjie@huawei.com> Message-ID: <20260601094805.2928614-11-ruanjinjie@huawei.com> Introduce CRASH_HOTPLUG_SAFETY_PADDING to allocate extra slots for the crash memory ranges array, mitigating potential TOCTOU races caused by concurrent memory hotplug events. When CONFIG_MEMORY_HOTPLUG is disabled, the padding safely defaults to 0 as the memory layout remains static. Signed-off-by: Jinjie Ruan --- include/linux/crash_core.h | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h index c1dee3f971a9..d4762e000098 100644 --- a/include/linux/crash_core.h +++ b/include/linux/crash_core.h @@ -14,6 +14,12 @@ struct crash_mem { struct range ranges[] __counted_by(max_nr_ranges); }; +#ifdef CONFIG_MEMORY_HOTPLUG +#define CRASH_HOTPLUG_SAFETY_PADDING 128 +#else +#define CRASH_HOTPLUG_SAFETY_PADDING 0 +#endif + #ifdef CONFIG_CRASH_DUMP int crash_shrink_memory(unsigned long new_size); -- 2.34.1 From ruanjinjie at huawei.com Mon Jun 1 02:47:53 2026 From: ruanjinjie at huawei.com (Jinjie Ruan) Date: Mon, 1 Jun 2026 17:47:53 +0800 Subject: [PATCH v15 11/23] x86: kexec_file: Fix TOCTOU buffer overflow via memory region padding In-Reply-To: <20260601094805.2928614-1-ruanjinjie@huawei.com> References: <20260601094805.2928614-1-ruanjinjie@huawei.com> Message-ID: <20260601094805.2928614-12-ruanjinjie@huawei.com> Sashiko AI code review pointed out there is a TOCTOU (Time-of-Check to Time-of-Use) race condition in prepare_elf_headers() between the initial pass that counts System RAM ranges and the second pass that populates them. If a memory hotplug event occurs between these two steps, the number of memory regions may increase, causing an out-of-bounds write to the cmem->ranges[] array. Fix this fundamentally by using `CRASH_HOTPLUG_SAFETY_PADDING`(128 slots) to expand the flexible array allocation ceiling upfront. This safely absorbs any concurrent memory region expansion. Concurrently, add a defensive boundary check inside the callback to return -EAGAIN on unexpected overrun, fully eradicating the overflow window and ensuring system stability. Cc: AKASHI Takahiro Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: "H. Peter Anvin" Cc: Andrew Morton Cc: Baoquan He Cc: Mike Rapoport Cc: stable at vger.kernel.org Fixes: 8d5f894a3108 ("x86: kexec_file: lift CRASH_MAX_RANGES limit on crash_mem buffer") Signed-off-by: Jinjie Ruan --- arch/x86/kernel/crash.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c index cd796818d94d..a1089907728d 100644 --- a/arch/x86/kernel/crash.c +++ b/arch/x86/kernel/crash.c @@ -177,7 +177,7 @@ static struct crash_mem *fill_up_crash_elf_data(void) * But in order to lest the low 1M could be changed in the future, * (e.g. [start, 1M]), add a extra slot. */ - nr_ranges += 3 + crashk_cma_cnt; + nr_ranges += 3 + crashk_cma_cnt + CRASH_HOTPLUG_SAFETY_PADDING; cmem = vzalloc(struct_size(cmem, ranges, nr_ranges)); if (!cmem) return NULL; @@ -226,6 +226,9 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg) { struct crash_mem *cmem = arg; + if (unlikely(cmem->nr_ranges >= cmem->max_nr_ranges)) + return -EAGAIN; + cmem->ranges[cmem->nr_ranges].start = res->start; cmem->ranges[cmem->nr_ranges].end = res->end; cmem->nr_ranges++; -- 2.34.1 From ruanjinjie at huawei.com Mon Jun 1 02:47:54 2026 From: ruanjinjie at huawei.com (Jinjie Ruan) Date: Mon, 1 Jun 2026 17:47:54 +0800 Subject: [PATCH v15 12/23] arm64: kexec_file: Fix TOCTOU buffer overflow via memory region padding In-Reply-To: <20260601094805.2928614-1-ruanjinjie@huawei.com> References: <20260601094805.2928614-1-ruanjinjie@huawei.com> Message-ID: <20260601094805.2928614-13-ruanjinjie@huawei.com> Sashiko AI code review pointed out there is a TOCTOU (Time-of-Check to Time-of-Use) race condition in prepare_elf_headers() between the initial pass that counts System RAM ranges and the second pass that populates them. If a memory hotplug event occurs between these two steps, the number of memory regions may increase, causing an out-of-bounds write to the cmem->ranges[] array. Fix this fundamentally by using `CRASH_HOTPLUG_SAFETY_PADDING` (128 slots) to expand the flexible array allocation ceiling upfront. This safely absorbs any concurrent memory region expansion. Concurrently, add a defensive boundary check to return -EAGAIN on unexpected overrun, fully eradicating the overflow window and ensuring system stability. Cc: Catalin Marinas Cc: Will Deacon Cc: Andrew Morton Cc: Baoquan He Cc: Breno Leitao Cc: stable at vger.kernel.org Fixes: 3751e728cef2 ("arm64: kexec_file: add crash dump support") Signed-off-by: Jinjie Ruan --- arch/arm64/kernel/machine_kexec_file.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/arch/arm64/kernel/machine_kexec_file.c b/arch/arm64/kernel/machine_kexec_file.c index 4cbb71e1f8ed..8a96fb68b88d 100644 --- a/arch/arm64/kernel/machine_kexec_file.c +++ b/arch/arm64/kernel/machine_kexec_file.c @@ -48,7 +48,8 @@ int prepare_elf_headers(void **addr, unsigned long *sz) u64 i; phys_addr_t start, end; - nr_ranges = 2; /* for exclusion of crashkernel region */ + /* for exclusion of crashkernel region */ + nr_ranges = 2 + CRASH_HOTPLUG_SAFETY_PADDING; for_each_mem_range(i, &start, &end) nr_ranges++; @@ -59,6 +60,11 @@ int prepare_elf_headers(void **addr, unsigned long *sz) cmem->max_nr_ranges = nr_ranges; cmem->nr_ranges = 0; for_each_mem_range(i, &start, &end) { + if (unlikely(cmem->nr_ranges >= cmem->max_nr_ranges)) { + ret = -EAGAIN; + goto out; + } + cmem->ranges[cmem->nr_ranges].start = start; cmem->ranges[cmem->nr_ranges].end = end - 1; cmem->nr_ranges++; -- 2.34.1 From ruanjinjie at huawei.com Mon Jun 1 02:47:55 2026 From: ruanjinjie at huawei.com (Jinjie Ruan) Date: Mon, 1 Jun 2026 17:47:55 +0800 Subject: [PATCH v15 13/23] riscv: kexec_file: Fix TOCTOU buffer overflow via memory region padding In-Reply-To: <20260601094805.2928614-1-ruanjinjie@huawei.com> References: <20260601094805.2928614-1-ruanjinjie@huawei.com> Message-ID: <20260601094805.2928614-14-ruanjinjie@huawei.com> Sashiko AI code review pointed out there is a TOCTOU (Time-of-Check to Time-of-Use) race condition in prepare_elf_headers() between the initial pass that counts System RAM ranges and the second pass that populates them. If a memory hotplug event occurs between these two steps, the number of memory regions may increase, causing an out-of-bounds write to the cmem->ranges[] array. Fix this fundamentally by using `CRASH_HOTPLUG_SAFETY_PADDING` (128 slots) to expand the flexible array allocation ceiling upfront. This safely absorbs any concurrent memory region expansion. Concurrently, add a defensive boundary check inside the callback to return -EAGAIN on unexpected overrun, fully eradicating the overflow window and ensuring system stability. Cc: Paul Walmsley Cc: Palmer Dabbelt Cc: Albert Ou