From pasha.tatashin at soleen.com Sat Nov 1 06:49:46 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Sat, 1 Nov 2025 09:49:46 -0400 Subject: [PATCH 1/1] kexec: Use %pe format specifier for error pointer printing In-Reply-To: <20251016200320.4179702-1-yanjun.zhu@linux.dev> References: <20251016200320.4179702-1-yanjun.zhu@linux.dev> Message-ID: On Thu, Oct 16, 2025 at 4:03?PM Zhu Yanjun wrote: > > Make pr_xxx() call to use the %pe format specifier instead of %d. > The %pe specifier prints a symbolic error string (e.g., -ENOMEM, > -EINVAL) when given an error pointer created with ERR_PTR(err). > > This change enhances the clarity and diagnostic value of the error > message by showing a descriptive error name rather than a numeric > error code. > > Signed-off-by: Zhu Yanjun > CC: graf at amazon.com > CC: rppt at kernel.org > CC: changyuanl at google.com > CC: akpm at linux-foundation.org > CC: bhe at redhat.com > > --- > kernel/kexec_handover.c | 18 +++++++++--------- > 1 file changed, 9 insertions(+), 9 deletions(-) > > diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c > index 76f0940fb485..77af377022b0 100644 > --- a/kernel/kexec_handover.c > +++ b/kernel/kexec_handover.c > @@ -1095,7 +1095,7 @@ static int kho_abort(void) > err = notifier_to_errno(err); > > if (err) > - pr_err("Failed to abort KHO finalization: %d\n", err); > + pr_err("Failed to abort KHO finalization: %pe\n", ERR_PTR(err)); > > return err; > } > @@ -1142,7 +1142,7 @@ static int kho_finalize(void) > > abort: > if (err) { > - pr_err("Failed to convert KHO state tree: %d\n", err); > + pr_err("Failed to convert KHO state tree: %pe\n", ERR_PTR(err)); The problem here (and in some other places below) is err a not an -errno, but fdt error: see: scripts/dtc/libfdt/libfdt.h %pe ERR_PTR(err) will output garbage, and make the debugging even harder. From sourabhjain at linux.ibm.com Sat Nov 1 12:37:41 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Sun, 2 Nov 2025 01:07:41 +0530 Subject: [PATCH] crash: fix crashkernel resource shrink Message-ID: <20251101193741.289252-1-sourabhjain@linux.ibm.com> When crashkernel is configured with a high reservation, shrinking its value below the low crashkernel reservation causes two issues: 1. Invalid crashkernel resource objects 2. Kernel crash if crashkernel shrinking is done twice For example, with crashkernel=200M,high, the kernel reserves 200MB of high memory and some default low memory (say 256MB). The reservation appears as: cat /proc/iomem | grep -i crash af000000-beffffff : Crash kernel 433000000-43f7fffff : Crash kernel If crashkernel is then shrunk to 50MB (echo 52428800 > /sys/kernel/kexec_crash_size), /proc/iomem still shows 256MB reserved: af000000-beffffff : Crash kernel Instead, it should show 50MB: af000000-b21fffff : Crash kernel Further shrinking crashkernel to 40MB causes a kernel crash with the following trace (x86): BUG: kernel NULL pointer dereference, address: 0000000000000038 PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP NOPTI Call Trace: ? __die_body.cold+0x19/0x27 ? page_fault_oops+0x15a/0x2f0 ? search_module_extables+0x19/0x60 ? search_bpf_extables+0x5f/0x80 ? exc_page_fault+0x7e/0x180 ? asm_exc_page_fault+0x26/0x30 ? __release_resource+0xd/0xb0 release_resource+0x26/0x40 __crash_shrink_memory+0xe5/0x110 crash_shrink_memory+0x12a/0x190 kexec_crash_size_store+0x41/0x80 kernfs_fop_write_iter+0x141/0x1f0 vfs_write+0x294/0x460 ksys_write+0x6d/0xf0 This happens because __crash_shrink_memory()/kernel/crash_core.c incorrectly updates the crashk_res resource object even when crashk_low_res should be updated. Fix this by ensuring the correct crashkernel resource object is updated when shrinking crashkernel memory. Fixes: 16c6006af4d4 ("kexec: enable kexec_crash_size to support two crash kernel regions") Cc: Andrew Morton Cc: Baoquan He Cc: Zhen Lei Cc: kexec at lists.infradead.org Signed-off-by: Sourabh Jain --- kernel/crash_core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/crash_core.c b/kernel/crash_core.c index 3b1c43382eec..99dac1aa972a 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -373,7 +373,7 @@ static int __crash_shrink_memory(struct resource *old_res, old_res->start = 0; old_res->end = 0; } else { - crashk_res.end = ram_res->start - 1; + old_res->end = ram_res->start - 1; } crash_free_reserved_phys_range(ram_res->start, ram_res->end); -- 2.51.0 From rientjes at google.com Sat Nov 1 16:35:07 2025 From: rientjes at google.com (David Rientjes) Date: Sat, 1 Nov 2025 16:35:07 -0700 (PDT) Subject: [Hypervisor Live Update] Notes from October 20, 2025 Message-ID: <734e26d2-ac5f-47be-331c-40e9b535ce55@google.com> Hi everybody, Here are the notes from the last Hypervisor Live Update call that happened on Monday, October 20. Thanks to everybody who was involved! These notes are intended to bring people up to speed who could not attend the call as well as keep the conversation going in between meetings. ----->o----- I thought this instance of the meeting would be short and I turned out to be very wrong :) We touched on the discussion from the previous instance regarding the fd dependency checking and this happening at the time of preserve rather than prepare, Pasha noted that the discussion continued upstream afterwards on the mailing list. The biggest change would be that the order is going to be enforced by the user. The preserve function itself is the heavy lifting now; the freeze and prepare are more for sanity checking. David Matlack asked how the global states wuld work since that's outside the fd. Pasha said the subsystem will be there but there will be another mechanism that follows the lifecycle of fds of a specific type; example is if a session has an fd of a specific type then it will follow the lifecycle of the aggregate. This will be supported in v5. ----->o----- Pasha updated that he had sent the KHO patches that provide the groundwork for LUO. Last week he also sent a KHO memory corruption fix. Once those patches are merged, he will send LUO v5. He was targeting sending the next series of changes before the next biweekly sync. ----->o----- Vipin Sharma sent out RFC patches for VFIO and was looking for feedback from the group in the next instance of the meeting. Jason was providing feedback on the upstream mailing list already. ----->o----- We shifted to discussing the main topic of the day which was iommu persistence from Samiullah. His slides are available on the shared drive. There was general alignment with what should be included in the next series upstream. His demonstrator so far included iommufd, iommu core, and iommu driver patches but was just preserving root tables. He also proposed hot swap. There was lots of discussion upstream around selection of HWPT to be preserved, preserved HWPT and iommu domain lifecycle, fd dependencies, and LUO finish. Pasha noted that LUO finish can now fail which Jason asked about. Pasha said if the fd hasn't replaced the hardware page table then finish would have to fail. Sami noted that the HWPTs are also restored and associated with the preserved iommu domains and this would be done when the fd is retrieved. We can't restore the domain during the probe but there is no mechanism to have the HWPTs to be created during the boot time. Jason said during probe time you put the domains back with placeholders so the iommu core has some understanding what the translation is. ----->o----- During the discussion for hotswap, Sami noted that once all the preserved devices have their iommu domains hot swapped, we can destroy the restored iommu domains that are not being used. Jason said that once the iommu domains are rehydrated back into an fd that they should have the normal lifecycle of a hardware page table in an fd. So they will be destroyed when the hardware page table is destroyed when the fd closes it or the VMM asks it to be destroyed. Jason noted that the VMM needs the id so that it can be destroyed. Jason suggested restoring the hardware page table pointers inside the devices that represent the currently attached hardware page table and this is done when you bring back the iommufd. We should likely retain a list for each hardware page table the list of which VFIO device objects are linked to it and this all needs to be brought back. Or an alternative may be to serialize the devices. IOMMU needs the VFIO devices and this needs careful orchestration. Pasha suggested that since we have the session and sessions have specific orders, the things without any dependencies that were preserved first and things with dependencies were preserved last. The kernel could call restore on everything from lowest to highest. Jason said there needs to be a two step process: the struct file needs to be brought back before you fill it. VFIO needs the iommufd to be filled before it can auto bind before it can complete its restoration. Sami suggested if we don't restore the HWPT until we have all the information, even if it closes it goes back to the state that it was in and we would consider the iommufd not fully restored until it is. Jason suggested that would require adding an iommufd ioctl to restore individual sub objects: restoring a HWPT that was with this tag and give back the id; the restore would only be possible if the VFIO devices are already present inside the iommufd. ----->o----- When discussing LUO finish, Pasha suggested we need a way to discard a session if it hasn't been reclaimed or there are exceptions. If the VM never is restored then we will have lingering session that need to be somehow discarded. Jason suggested all objects are brought back to userspace before you can encounter an error. If there are problems up to that point, then the cleanest way to address this is with another kexec. Jason stressed the need for another kexec as a big hammer to be able to do recovery and cleanup. For example, if there are 10 VMs and one did not restore, do another live update to clean up the lingering VM. ----->o----- Next meeting will be on Monday, November 3 at 8am PST (UTC-8), everybody is welcome: https://meet.google.com/rjn-dmzu-hgq NOTE!!! Daylight Savings Time has ended in the United States, so please check your local time carefully: Time zones PST (UTC-8) 8:00am MST (UTC-7) 9:00am CST (UTC-6) 10:00am EST (UTC-5) 11:00am Rio de Janeiro (UTC-3) 1:00pm London (UTC) 4:00pm Berlin (UTC+1) 5:00pm Moscow (UTC+3) 7:00pm Dubai (UTC+4) 8:00pm Mumbai (UTC+5:30) 9:30pm Singapore (UTC+8) 12:00am Tuesday Beijing (UTC+8) 12:00am Tuesday Tokyo (UTC+9) 1:00am Tuesday Sydney (UTC+11) 3:00am Tuesday Auckland (UTC+13) 5:00am Tuesday Topics for the next meeting: - update on the status of stateless KHO RFC patches that should simplify LUO support - update on LUO v5 and patch series sent upstream after KHO changes and fixes are staged - VFIO RFC patch feedback based on the series sent to the mailing list a couple weeks ago - follow up on the status of iommu persistence and any addtional discussion from last time - update on memfd preservation, vmalloc support, and 1GB limitation - discuss deferred struct page initialization and deferring when KHO is enabled - discuss guest_memfd preservation use cases for Confidential Computing and any current work happening on it, including overlap with memfd preservation being worked on by Pratyush + discuss any use cases for Confidential Computing where folios may need to be split after being marked as preserved during brown out - later: testing methodology to allow downstream consumers to qualify that live update works from one version to another - later: reducing blackout window during live update Please let me know if you'd like to propose additional topics for discussion, thank you! From georges.aureau at hpe.com Sun Nov 2 02:48:42 2025 From: georges.aureau at hpe.com (Aureau, Georges (Kernel Tools ERT)) Date: Sun, 2 Nov 2025 10:48:42 +0000 Subject: [PATCH][makedumpfile -R] Promptly return error on truncated regular file. In-Reply-To: References: Message-ID: Hello, Forget about this PATCH, it is missing some option as to control the behavior. What I'm really after is to promptly detect truncated flatten files without the 10-minute timeout. This would be assuming stdin is not a file in the process of being created (where eof is a moving target). Maybe something like "makedumpfile -R --some-option", where some-option would cause returning an immediate error in premature EOF: - if (TIMEOUT_STDIN < (tm - last_time)) { + if (some_option || TIMEOUT_STDIN < (tm - last_time)) { Thanks, Georges I |-----Original Message----- |From: Aureau, Georges (Kernel Tools ERT) |Sent: Friday, October 31, 2025 9:41 PM |To: kexec at lists.infradead.org |Cc: yamazaki-msmt at nec.com; HAGIO KAZUHITO(?????) |Subject: [PATCH][makedumpfile -R] Promptly return error on truncated regular |file. | |[PATCH][makedumpfile -R] Promptly return error on truncated regular file. | |When reaching the end-of-file on a truncated input regular file, |makedumpfile -R is looping for 10 minutes before producing an error. |This is confusing for users. When stdin is a regular file, an improved |behavior is to promptly return an EOF error. | |Signed-off-by: Georges Aureau |-- |diff --git a/makedumpfile.c b/makedumpfile.c |index 12fb0d8..295b3cc 100644 |--- a/makedumpfile.c |+++ b/makedumpfile.c |@@ -5135,6 +5135,21 @@ write_cache_zero(struct cache_data *cd, size_t size) | return write_cache_bufsz(cd); | } | |+int |+is_stdin_regular_file(void) |+{ |+ struct stat st; |+ static int regular_file = -1; |+ if (regular_file == -1) { |+ if (fstat(STDIN_FILENO, &st) == -1) { |+ regular_file = FALSE; |+ } else { |+ regular_file = S_ISREG(st.st_mode) ? TRUE : FALSE; |+ } |+ } |+ return regular_file; |+} |+ | int | read_buf_from_stdin(void *buf, int buf_size) | { |@@ -5154,11 +5169,12 @@ read_buf_from_stdin(void *buf, int buf_size) | | } else if (0 == tmp_read_size) { | /* |- * If it cannot get any data from a standard input |+ * If we reach end-of-file on regular file, or |+ * if we cannot get any data from a standard input | * for a long time, break this loop. | */ | tm = time(NULL); |- if (TIMEOUT_STDIN < (tm - last_time)) { |+ if (is_stdin_regular_file() || TIMEOUT_STDIN < (tm - last_time)) { | ERRMSG("Can't get any data from STDIN.\n"); | return FALSE; | } | | From bhe at redhat.com Sun Nov 2 18:54:50 2025 From: bhe at redhat.com (Baoquan He) Date: Mon, 3 Nov 2025 10:54:50 +0800 Subject: [PATCH] crash: fix crashkernel resource shrink In-Reply-To: <20251101193741.289252-1-sourabhjain@linux.ibm.com> References: <20251101193741.289252-1-sourabhjain@linux.ibm.com> Message-ID: On 11/02/25 at 01:07am, Sourabh Jain wrote: > When crashkernel is configured with a high reservation, shrinking its > value below the low crashkernel reservation causes two issues: > > 1. Invalid crashkernel resource objects > 2. Kernel crash if crashkernel shrinking is done twice > > For example, with crashkernel=200M,high, the kernel reserves 200MB of > high memory and some default low memory (say 256MB). The reservation > appears as: > > cat /proc/iomem | grep -i crash > af000000-beffffff : Crash kernel > 433000000-43f7fffff : Crash kernel > > If crashkernel is then shrunk to 50MB (echo 52428800 > > /sys/kernel/kexec_crash_size), /proc/iomem still shows 256MB reserved: > af000000-beffffff : Crash kernel > > Instead, it should show 50MB: > af000000-b21fffff : Crash kernel > > Further shrinking crashkernel to 40MB causes a kernel crash with the > following trace (x86): > > BUG: kernel NULL pointer dereference, address: 0000000000000038 > PGD 0 P4D 0 > Oops: 0000 [#1] PREEMPT SMP NOPTI > > Call Trace: > ? __die_body.cold+0x19/0x27 > ? page_fault_oops+0x15a/0x2f0 > ? search_module_extables+0x19/0x60 > ? search_bpf_extables+0x5f/0x80 > ? exc_page_fault+0x7e/0x180 > ? asm_exc_page_fault+0x26/0x30 > ? __release_resource+0xd/0xb0 > release_resource+0x26/0x40 > __crash_shrink_memory+0xe5/0x110 > crash_shrink_memory+0x12a/0x190 > kexec_crash_size_store+0x41/0x80 > kernfs_fop_write_iter+0x141/0x1f0 > vfs_write+0x294/0x460 > ksys_write+0x6d/0xf0 > > > This happens because __crash_shrink_memory()/kernel/crash_core.c > incorrectly updates the crashk_res resource object even when > crashk_low_res should be updated. > > Fix this by ensuring the correct crashkernel resource object is updated > when shrinking crashkernel memory. > > Fixes: 16c6006af4d4 ("kexec: enable kexec_crash_size to support two crash kernel regions") > Cc: Andrew Morton > Cc: Baoquan He > Cc: Zhen Lei > Cc: kexec at lists.infradead.org > Signed-off-by: Sourabh Jain > --- > kernel/crash_core.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/crash_core.c b/kernel/crash_core.c > index 3b1c43382eec..99dac1aa972a 100644 > --- a/kernel/crash_core.c > +++ b/kernel/crash_core.c > @@ -373,7 +373,7 @@ static int __crash_shrink_memory(struct resource *old_res, > old_res->start = 0; > old_res->end = 0; > } else { > - crashk_res.end = ram_res->start - 1; > + old_res->end = ram_res->start - 1; It's a code bug, nice catch, thanks. Acked-by: Baoquan He From sourabhjain at linux.ibm.com Sun Nov 2 19:58:57 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Mon, 3 Nov 2025 09:28:57 +0530 Subject: [PATCH 0/2] Export kdump crashkernel CMA ranges Message-ID: <20251103035859.1267318-1-sourabhjain@linux.ibm.com> /sys/kernel/kexec_crash_cma_ranges to export all CMA regions reserved for the crashkernel to user-space. This enables user-space tools configuring kdump to determine the amount of memory reserved for the crashkernel. When CMA is used for crashkernel allocation, tools can use this information to warn users that attempting to capture user pages while CMA reservation is active may lead to unreliable or incomplete dump capture. While adding documentation for the new sysfs interface, I realized that there was no ABI document for the existing kexec and kdump sysfs interfaces, so I added one. The first patch adds the ABI documentation for the existing kexec and kdump sysfs interfaces, and the second patch adds the /sys/kernel/kexec_crash_cma_ranges sysfs interface along with its corresponding ABI documentation. *Seeking opinions* There are already four kexec/kdump sysfs entries under /sys/kernel/, and this patch series adds one more. Should we consider moving them to a separate directory, such as /sys/kernel/kexec, to avoid polluting /sys/kernel/? For backward compatibility, we can create symlinks at the old locations for sometime and remove them in the future. Cc: Andrew Morton Cc: Baoquan he Cc: Jiri Bohac Cc: Shivang Upadhyay Cc: linuxppc-dev at lists.ozlabs.org Cc: kexec at lists.infradead.org Sourabh Jain (2): Documentation/ABI: add kexec and kdump sysfs interface crash: export crashkernel CMA reservation to userspace .../ABI/testing/sysfs-kernel-kexec-kdump | 53 +++++++++++++++++++ kernel/ksysfs.c | 17 ++++++ 2 files changed, 70 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-kernel-kexec-kdump -- 2.51.0 From sourabhjain at linux.ibm.com Sun Nov 2 19:58:58 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Mon, 3 Nov 2025 09:28:58 +0530 Subject: [PATCH 1/2] Documentation/ABI: add kexec and kdump sysfs interface In-Reply-To: <20251103035859.1267318-1-sourabhjain@linux.ibm.com> References: <20251103035859.1267318-1-sourabhjain@linux.ibm.com> Message-ID: <20251103035859.1267318-2-sourabhjain@linux.ibm.com> Add an ABI document for following kexec and kdump sysfs interface: - /sys/kernel/kexec_loaded - /sys/kernel/kexec_crash_loaded - /sys/kernel/kexec_crash_size - /sys/kernel/crash_elfcorehdr_size Cc: Andrew Morton Cc: Baoquan he Cc: Jiri Bohac Cc: Shivang Upadhyay Cc: linuxppc-dev at lists.ozlabs.org Cc: kexec at lists.infradead.org Signed-off-by: Sourabh Jain --- .../ABI/testing/sysfs-kernel-kexec-kdump | 43 +++++++++++++++++++ 1 file changed, 43 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-kernel-kexec-kdump diff --git a/Documentation/ABI/testing/sysfs-kernel-kexec-kdump b/Documentation/ABI/testing/sysfs-kernel-kexec-kdump new file mode 100644 index 000000000000..96b24565b68e --- /dev/null +++ b/Documentation/ABI/testing/sysfs-kernel-kexec-kdump @@ -0,0 +1,43 @@ +What: /sys/kernel/kexec_loaded +Date: Jun 2006 +Contact: kexec at lists.infradead.org +Description: read only + Indicates whether a new kernel image has been loaded + into memory using the kexec system call. It shows 1 if + a kexec image is present and ready to boot, or 0 if none + is loaded. +User: kexec tools, kdump service + +What: /sys/kernel/kexec_crash_loaded +Date: Jun 2006 +Contact: kexec at lists.infradead.org +Description: read only + Indicates whether a crash (kdump) kernel is currently + loaded into memory. It shows 1 if a crash kernel has been + successfully loaded for panic handling, or 0 if no crash + kernel is present. +User: Kexec tools, Kdump service + +What: /sys/kernel/kexec_crash_size +Date: Dec 2009 +Contact: kexec at lists.infradead.org +Description: read/write + Shows the amount of memory reserved for loading the crash + (kdump) kernel. It reports the size, in bytes, of the + crash kernel area defined by the crashkernel= parameter. + This interface also allows reducing the crashkernel + reservation by writing a smaller value, and the reclaimed + space is added back to the system RAM. +User: Kdump service + +What: /sys/kernel/crash_elfcorehdr_size +Date: Aug 2023 +Contact: kexec at lists.infradead.org +Description: read only + Indicates the preferred size of the memory buffer for the + ELF core header used by the crash (kdump) kernel. It defines + how much space is needed to hold metadata about the crashed + system, including CPU and memory information. This information + is used by the user space utility kexec to support updating the + in-kernel kdump image during hotplug operations. +User: Kexec tools -- 2.51.0 From sourabhjain at linux.ibm.com Sun Nov 2 19:58:59 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Mon, 3 Nov 2025 09:28:59 +0530 Subject: [PATCH 2/2] crash: export crashkernel CMA reservation to userspace In-Reply-To: <20251103035859.1267318-1-sourabhjain@linux.ibm.com> References: <20251103035859.1267318-1-sourabhjain@linux.ibm.com> Message-ID: <20251103035859.1267318-3-sourabhjain@linux.ibm.com> Add a sysfs entry /sys/kernel/kexec_crash_cma_ranges to expose all CMA crashkernel ranges. This allows userspace tools configuring kdump to determine how much memory is reserved for crashkernel. If CMA is used, tools can warn users when attempting to capture user pages with CMA reservation. The new sysfs hold the CMA ranges in below format: cat /sys/kernel/kexec_crash_cma_ranges 100000000-10c7fffff Cc: Andrew Morton Cc: Baoquan he Cc: Jiri Bohac Cc: Shivang Upadhyay Cc: linuxppc-dev at lists.ozlabs.org Cc: kexec at lists.infradead.org Signed-off-by: Sourabh Jain --- .../ABI/testing/sysfs-kernel-kexec-kdump | 10 ++++++++++ kernel/ksysfs.c | 17 +++++++++++++++++ 2 files changed, 27 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-kernel-kexec-kdump b/Documentation/ABI/testing/sysfs-kernel-kexec-kdump index 96b24565b68e..f6089e38de5f 100644 --- a/Documentation/ABI/testing/sysfs-kernel-kexec-kdump +++ b/Documentation/ABI/testing/sysfs-kernel-kexec-kdump @@ -41,3 +41,13 @@ Description: read only is used by the user space utility kexec to support updating the in-kernel kdump image during hotplug operations. User: Kexec tools + +What: /sys/kernel/kexec_crash_cma_ranges +Date: Nov 2025 +Contact: kexec at lists.infradead.org +Description: read only + Provides information about the memory ranges reserved from + the Contiguous Memory Allocator (CMA) area that are allocated + to the crash (kdump) kernel. It lists the start and end physical + addresses of CMA regions assigned for crashkernel use. +User: kdump service diff --git a/kernel/ksysfs.c b/kernel/ksysfs.c index eefb67d9883c..3855937aa923 100644 --- a/kernel/ksysfs.c +++ b/kernel/ksysfs.c @@ -135,6 +135,22 @@ static ssize_t kexec_crash_loaded_show(struct kobject *kobj, } KERNEL_ATTR_RO(kexec_crash_loaded); +static ssize_t kexec_crash_cma_ranges_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + + ssize_t len = 0; + int i; + + for (i = 0; i < crashk_cma_cnt; ++i) { + len += sysfs_emit_at(buf, len, "%08llx-%08llx\n", + crashk_cma_ranges[i].start, + crashk_cma_ranges[i].end); + } + return len; +} +KERNEL_ATTR_RO(kexec_crash_cma_ranges); + static ssize_t kexec_crash_size_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { @@ -260,6 +276,7 @@ static struct attribute * kernel_attrs[] = { #ifdef CONFIG_CRASH_DUMP &kexec_crash_loaded_attr.attr, &kexec_crash_size_attr.attr, + &kexec_crash_cma_ranges_attr.attr, #endif #endif #ifdef CONFIG_VMCORE_INFO -- 2.51.0 From sourabhjain at linux.ibm.com Sun Nov 2 20:08:27 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Mon, 3 Nov 2025 09:38:27 +0530 Subject: [PATCH] crash: fix crashkernel resource shrink In-Reply-To: References: <20251101193741.289252-1-sourabhjain@linux.ibm.com> Message-ID: <01045c4c-a37a-4a31-8787-6483c7b78dad@linux.ibm.com> On 03/11/25 08:24, Baoquan He wrote: > On 11/02/25 at 01:07am, Sourabh Jain wrote: >> When crashkernel is configured with a high reservation, shrinking its >> value below the low crashkernel reservation causes two issues: >> >> 1. Invalid crashkernel resource objects >> 2. Kernel crash if crashkernel shrinking is done twice >> >> For example, with crashkernel=200M,high, the kernel reserves 200MB of >> high memory and some default low memory (say 256MB). The reservation >> appears as: >> >> cat /proc/iomem | grep -i crash >> af000000-beffffff : Crash kernel >> 433000000-43f7fffff : Crash kernel >> >> If crashkernel is then shrunk to 50MB (echo 52428800 > >> /sys/kernel/kexec_crash_size), /proc/iomem still shows 256MB reserved: >> af000000-beffffff : Crash kernel >> >> Instead, it should show 50MB: >> af000000-b21fffff : Crash kernel >> >> Further shrinking crashkernel to 40MB causes a kernel crash with the >> following trace (x86): >> >> BUG: kernel NULL pointer dereference, address: 0000000000000038 >> PGD 0 P4D 0 >> Oops: 0000 [#1] PREEMPT SMP NOPTI >> >> Call Trace: >> ? __die_body.cold+0x19/0x27 >> ? page_fault_oops+0x15a/0x2f0 >> ? search_module_extables+0x19/0x60 >> ? search_bpf_extables+0x5f/0x80 >> ? exc_page_fault+0x7e/0x180 >> ? asm_exc_page_fault+0x26/0x30 >> ? __release_resource+0xd/0xb0 >> release_resource+0x26/0x40 >> __crash_shrink_memory+0xe5/0x110 >> crash_shrink_memory+0x12a/0x190 >> kexec_crash_size_store+0x41/0x80 >> kernfs_fop_write_iter+0x141/0x1f0 >> vfs_write+0x294/0x460 >> ksys_write+0x6d/0xf0 >> >> >> This happens because __crash_shrink_memory()/kernel/crash_core.c >> incorrectly updates the crashk_res resource object even when >> crashk_low_res should be updated. >> >> Fix this by ensuring the correct crashkernel resource object is updated >> when shrinking crashkernel memory. >> >> Fixes: 16c6006af4d4 ("kexec: enable kexec_crash_size to support two crash kernel regions") >> Cc: Andrew Morton >> Cc: Baoquan He >> Cc: Zhen Lei >> Cc: kexec at lists.infradead.org >> Signed-off-by: Sourabh Jain >> --- >> kernel/crash_core.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/kernel/crash_core.c b/kernel/crash_core.c >> index 3b1c43382eec..99dac1aa972a 100644 >> --- a/kernel/crash_core.c >> +++ b/kernel/crash_core.c >> @@ -373,7 +373,7 @@ static int __crash_shrink_memory(struct resource *old_res, >> old_res->start = 0; >> old_res->end = 0; >> } else { >> - crashk_res.end = ram_res->start - 1; >> + old_res->end = ram_res->start - 1; > It's a code bug, nice catch, thanks. > > Acked-by: Baoquan He Thanks for the ack, Baoquan. - Sourabh Jain From sourabhjain at linux.ibm.com Sun Nov 2 20:37:47 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Mon, 3 Nov 2025 10:07:47 +0530 Subject: [PATCH v5] powerpc/kdump: Add support for crashkernel CMA reservation Message-ID: <20251103043747.1298065-1-sourabhjain@linux.ibm.com> Commit 35c18f2933c5 ("Add a new optional ",cma" suffix to the crashkernel= command line option") and commit ab475510e042 ("kdump: implement reserve_crashkernel_cma") added CMA support for kdump crashkernel reservation. Extend crashkernel CMA reservation support to powerpc. The following changes are made to enable CMA reservation on powerpc: - Parse and obtain the CMA reservation size along with other crashkernel parameters - Call reserve_crashkernel_cma() to allocate the CMA region for kdump - Include the CMA-reserved ranges in the usable memory ranges for the kdump kernel to use. - Exclude the CMA-reserved ranges from the crash kernel memory to prevent them from being exported through /proc/vmcore. With the introduction of the CMA crashkernel regions, crash_exclude_mem_range() needs to be called multiple times to exclude both crashk_res and crashk_cma_ranges from the crash memory ranges. To avoid repetitive logic for validating mem_ranges size and handling reallocation when required, this functionality is moved to a new wrapper function crash_exclude_mem_range_guarded(). To ensure proper CMA reservation, reserve_crashkernel_cma() is called after pageblock_order is initialized. Update kernel-parameters.txt to document CMA support for crashkernel on powerpc architecture. Cc: Baoquan he Cc: Jiri Bohac Cc: Hari Bathini Cc: Madhavan Srinivasan Cc: Mahesh Salgaonkar Cc: Michael Ellerman Cc: Ritesh Harjani (IBM) Cc: Shivang Upadhyay Cc: kexec at lists.infradead.org Signed-off-by: Sourabh Jain --- Changlog: v3 -> v4 - Removed repeated initialization to tmem in crash_exclude_mem_range_guarded() - Call crash_exclude_mem_range() with right crashk ranges v4 -> v5: - Document CMA-based crashkernel support for ppc64 in kernel-parameters.txt --- .../admin-guide/kernel-parameters.txt | 2 +- arch/powerpc/include/asm/kexec.h | 2 + arch/powerpc/kernel/setup-common.c | 4 +- arch/powerpc/kexec/core.c | 10 ++++- arch/powerpc/kexec/ranges.c | 43 ++++++++++++++----- 5 files changed, 47 insertions(+), 14 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 6c42061ca20e..0f386b546cec 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -1013,7 +1013,7 @@ It will be ignored when crashkernel=X,high is not used or memory reserved is below 4G. crashkernel=size[KMG],cma - [KNL, X86] Reserve additional crash kernel memory from + [KNL, X86, ppc64] Reserve additional crash kernel memory from CMA. This reservation is usable by the first system's userspace memory and kernel movable allocations (memory balloon, zswap). Pages allocated from this memory range diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h index 4bbf9f699aaa..bd4a6c42a5f3 100644 --- a/arch/powerpc/include/asm/kexec.h +++ b/arch/powerpc/include/asm/kexec.h @@ -115,9 +115,11 @@ int setup_new_fdt_ppc64(const struct kimage *image, void *fdt, struct crash_mem #ifdef CONFIG_CRASH_RESERVE int __init overlaps_crashkernel(unsigned long start, unsigned long size); extern void arch_reserve_crashkernel(void); +extern void kdump_cma_reserve(void); #else static inline void arch_reserve_crashkernel(void) {} static inline int overlaps_crashkernel(unsigned long start, unsigned long size) { return 0; } +static inline void kdump_cma_reserve(void) { } #endif #if defined(CONFIG_CRASH_DUMP) diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c index 68d47c53876c..c8c42b419742 100644 --- a/arch/powerpc/kernel/setup-common.c +++ b/arch/powerpc/kernel/setup-common.c @@ -35,6 +35,7 @@ #include #include #include +#include #include #include #include @@ -995,11 +996,12 @@ void __init setup_arch(char **cmdline_p) initmem_init(); /* - * Reserve large chunks of memory for use by CMA for fadump, KVM and + * Reserve large chunks of memory for use by CMA for kdump, fadump, KVM and * hugetlb. These must be called after initmem_init(), so that * pageblock_order is initialised. */ fadump_cma_init(); + kdump_cma_reserve(); kvm_cma_reserve(); gigantic_hugetlb_cma_reserve(); diff --git a/arch/powerpc/kexec/core.c b/arch/powerpc/kexec/core.c index d1a2d755381c..25744737eff5 100644 --- a/arch/powerpc/kexec/core.c +++ b/arch/powerpc/kexec/core.c @@ -33,6 +33,8 @@ void machine_kexec_cleanup(struct kimage *image) { } +unsigned long long cma_size; + /* * Do not allocate memory (or fail in any way) in machine_kexec(). * We are past the point of no return, committed to rebooting now. @@ -110,7 +112,7 @@ void __init arch_reserve_crashkernel(void) /* use common parsing */ ret = parse_crashkernel(boot_command_line, total_mem_sz, &crash_size, - &crash_base, NULL, NULL, NULL); + &crash_base, NULL, &cma_size, NULL); if (ret) return; @@ -130,6 +132,12 @@ void __init arch_reserve_crashkernel(void) reserve_crashkernel_generic(crash_size, crash_base, 0, false); } +void __init kdump_cma_reserve(void) +{ + if (cma_size) + reserve_crashkernel_cma(cma_size); +} + int __init overlaps_crashkernel(unsigned long start, unsigned long size) { return (start + size) > crashk_res.start && start <= crashk_res.end; diff --git a/arch/powerpc/kexec/ranges.c b/arch/powerpc/kexec/ranges.c index 3702b0bdab14..3bd27c38726b 100644 --- a/arch/powerpc/kexec/ranges.c +++ b/arch/powerpc/kexec/ranges.c @@ -515,7 +515,7 @@ int get_exclude_memory_ranges(struct crash_mem **mem_ranges) */ int get_usable_memory_ranges(struct crash_mem **mem_ranges) { - int ret; + int ret, i; /* * Early boot failure observed on guests when low memory (first memory @@ -528,6 +528,13 @@ int get_usable_memory_ranges(struct crash_mem **mem_ranges) if (ret) goto out; + for (i = 0; i < crashk_cma_cnt; i++) { + ret = add_mem_range(mem_ranges, crashk_cma_ranges[i].start, + crashk_cma_ranges[i].end - crashk_cma_ranges[i].start + 1); + if (ret) + goto out; + } + ret = add_rtas_mem_range(mem_ranges); if (ret) goto out; @@ -546,6 +553,22 @@ int get_usable_memory_ranges(struct crash_mem **mem_ranges) #endif /* CONFIG_KEXEC_FILE */ #ifdef CONFIG_CRASH_DUMP +static int crash_exclude_mem_range_guarded(struct crash_mem **mem_ranges, + unsigned long long mstart, + unsigned long long mend) +{ + struct crash_mem *tmem = *mem_ranges; + + /* Reallocate memory ranges if there is no space to split ranges */ + if (tmem && (tmem->nr_ranges == tmem->max_nr_ranges)) { + tmem = realloc_mem_ranges(mem_ranges); + if (!tmem) + return -ENOMEM; + } + + return crash_exclude_mem_range(tmem, mstart, mend); +} + /** * get_crash_memory_ranges - Get crash memory ranges. This list includes * first/crashing kernel's memory regions that @@ -557,7 +580,6 @@ int get_usable_memory_ranges(struct crash_mem **mem_ranges) int get_crash_memory_ranges(struct crash_mem **mem_ranges) { phys_addr_t base, end; - struct crash_mem *tmem; u64 i; int ret; @@ -582,19 +604,18 @@ int get_crash_memory_ranges(struct crash_mem **mem_ranges) sort_memory_ranges(*mem_ranges, true); } - /* Reallocate memory ranges if there is no space to split ranges */ - tmem = *mem_ranges; - if (tmem && (tmem->nr_ranges == tmem->max_nr_ranges)) { - tmem = realloc_mem_ranges(mem_ranges); - if (!tmem) - goto out; - } - /* Exclude crashkernel region */ - ret = crash_exclude_mem_range(tmem, crashk_res.start, crashk_res.end); + ret = crash_exclude_mem_range_guarded(mem_ranges, crashk_res.start, crashk_res.end); if (ret) goto out; + for (i = 0; i < crashk_cma_cnt; ++i) { + ret = crash_exclude_mem_range_guarded(mem_ranges, crashk_cma_ranges[i].start, + crashk_cma_ranges[i].end); + if (ret) + goto out; + } + /* * FIXME: For now, stay in parity with kexec-tools but if RTAS/OPAL * regions are exported to save their context at the time of -- 2.51.0 From maqianga at uniontech.com Sun Nov 2 22:34:36 2025 From: maqianga at uniontech.com (Qiang Ma) Date: Mon, 3 Nov 2025 14:34:36 +0800 Subject: [PATCH v2 0/4] kexec: print out debugging message if required for kexec_load Message-ID: <20251103063440.1681657-1-maqianga@uniontech.com> Overview: ========= The commit a85ee18c7900 ("kexec_file: print out debugging message if required") has added general code printing in kexec_file_load(), but not in kexec_load(). Since kexec_load and kexec_file_load are not triggered simultaneously, we can unify the debug flag of kexec and kexec_file as kexec_core_dbg_print. Next, we need to do some things in this patchset: 1. rename kexec_file_dbg_print to kexec_core_dbg_print 2. Add KEXEC_DEBUG 3. Initialize kexec_core_dbg_print for kexec 4. Fix uninitialized struct kimage *image pointer 5. Set the reset of kexec_file_dbg_print to kimage_free Testing: ========= I did testing on x86_64, arm64 and loongarch. On x86_64, the printed messages look like below: unset CONFIG_KEXEC_FILE: [ 81.476959] kexec: nr_segments = 7 [ 81.477565] kexec: segment[0]: buf=0x00000000c22469d2 bufsz=0x70 mem=0x100000 memsz=0x1000 [ 81.478797] kexec: segment[1]: buf=0x00000000dedbb3b1 bufsz=0x140 mem=0x101000 memsz=0x1000 [ 81.480075] kexec: segment[2]: buf=0x00000000d7657a33 bufsz=0x30 mem=0x102000 memsz=0x1000 [ 81.481288] kexec: segment[3]: buf=0x00000000c7eb60a6 bufsz=0x16f40a8 mem=0x23bd0b000 memsz=0x16f5000 [ 81.489018] kexec: segment[4]: buf=0x00000000d1ca53c8 bufsz=0xd73400 mem=0x23d400000 memsz=0x2ab7000 [ 81.499697] kexec: segment[5]: buf=0x00000000697bac5a bufsz=0x50dc mem=0x23fff1000 memsz=0x6000 [ 81.501084] kexec: segment[6]: buf=0x000000001f743a68 bufsz=0x70e0 mem=0x23fff7000 memsz=0x9000 [ 81.502374] kexec: kexec_load: type:0, start:0x23fff7700 head:0x10a4b9002 flags:0x3e0010 set CONFIG_KEXEC_FILE [ 36.774228] kexec_file: kernel: 0000000066c386c8 kernel_size: 0xd78400 [ 36.821814] kexec-bzImage64: Loaded purgatory at 0x23fffb000 [ 36.821826] kexec-bzImage64: Loaded boot_param, command line and misc at 0x23fff9000 bufsz=0x12d0 memsz=0x2000 [ 36.821829] kexec-bzImage64: Loaded 64bit kernel at 0x23d400000 bufsz=0xd73400 memsz=0x2ab7000 [ 36.821918] kexec-bzImage64: Loaded initrd at 0x23bd0b000 bufsz=0x16f40a8 memsz=0x16f40a8 [ 36.821920] kexec-bzImage64: Final command line is: root=/dev/mapper/test-root crashkernel=auto rd.lvm.lv=test/root [ 36.821925] kexec-bzImage64: E820 memmap: [ 36.821926] kexec-bzImage64: 0000000000000000-000000000009ffff (1) [ 36.821928] kexec-bzImage64: 0000000000100000-0000000000811fff (1) [ 36.821930] kexec-bzImage64: 0000000000812000-0000000000812fff (2) [ 36.821931] kexec-bzImage64: 0000000000813000-00000000bee38fff (1) [ 36.821933] kexec-bzImage64: 00000000bee39000-00000000beec2fff (2) [ 36.821934] kexec-bzImage64: 00000000beec3000-00000000bf8ecfff (1) [ 36.821935] kexec-bzImage64: 00000000bf8ed000-00000000bfb6cfff (2) [ 36.821936] kexec-bzImage64: 00000000bfb6d000-00000000bfb7efff (3) [ 36.821937] kexec-bzImage64: 00000000bfb7f000-00000000bfbfefff (4) [ 36.821938] kexec-bzImage64: 00000000bfbff000-00000000bff7bfff (1) [ 36.821939] kexec-bzImage64: 00000000bff7c000-00000000bfffffff (2) [ 36.821940] kexec-bzImage64: 00000000feffc000-00000000feffffff (2) [ 36.821941] kexec-bzImage64: 00000000ffc00000-00000000ffffffff (2) [ 36.821942] kexec-bzImage64: 0000000100000000-000000023fffffff (1) [ 36.872348] kexec_file: nr_segments = 4 [ 36.872356] kexec_file: segment[0]: buf=0x000000005314ece7 bufsz=0x4000 mem=0x23fffb000 memsz=0x5000 [ 36.872370] kexec_file: segment[1]: buf=0x000000006e59b143 bufsz=0x12d0 mem=0x23fff9000 memsz=0x2000 [ 36.872374] kexec_file: segment[2]: buf=0x00000000eb7b1fc3 bufsz=0xd73400 mem=0x23d400000 memsz=0x2ab7000 [ 36.882172] kexec_file: segment[3]: buf=0x000000006af76441 bufsz=0x16f40a8 mem=0x23bd0b000 memsz=0x16f5000 [ 36.889113] kexec_file: kexec_file_load: type:0, start:0x23fffb150 head:0x101a2e002 flags:0x8 Changes in v2: ========== - Unify the debug flag of kexec and kexec_file - Fix uninitialized struct kimage *image pointer - Fix the issue of mismatch between loop variable types Qiang Ma (4): kexec: Fix uninitialized struct kimage *image pointer kexec: add kexec_core flag to control debug printing kexec: print out debugging message if required for kexec_load kexec_file: Fix the issue of mismatch between loop variable types include/linux/kexec.h | 9 +++++---- include/uapi/linux/kexec.h | 1 + kernel/kexec.c | 16 +++++++++++++++- kernel/kexec_core.c | 4 +++- kernel/kexec_file.c | 9 ++++----- 5 files changed, 28 insertions(+), 11 deletions(-) -- 2.20.1 From maqianga at uniontech.com Sun Nov 2 22:34:37 2025 From: maqianga at uniontech.com (Qiang Ma) Date: Mon, 3 Nov 2025 14:34:37 +0800 Subject: [PATCH v2 1/4] kexec: Fix uninitialized struct kimage *image pointer In-Reply-To: <20251103063440.1681657-1-maqianga@uniontech.com> References: <20251103063440.1681657-1-maqianga@uniontech.com> Message-ID: <20251103063440.1681657-2-maqianga@uniontech.com> The image is initialized to NULL. Then, after calling kimage_alloc_init, we can directly goto 'out' because at this time, the kimage_free will determine whether image is a NULL pointer. This can also prepare for the subsequent patch's kexec_core_dbg_print to be reset to zero in kimage_free. Signed-off-by: Qiang Ma --- kernel/kexec.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/kernel/kexec.c b/kernel/kexec.c index 28008e3d462e..9bb1f2b6b268 100644 --- a/kernel/kexec.c +++ b/kernel/kexec.c @@ -95,6 +95,8 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments, unsigned long i; int ret; + image = NULL; + /* * Because we write directly to the reserved memory region when loading * crash kernels we need a serialization here to prevent multiple crash @@ -129,7 +131,7 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments, ret = kimage_alloc_init(&image, entry, nr_segments, segments, flags); if (ret) - goto out_unlock; + goto out; if (flags & KEXEC_PRESERVE_CONTEXT) image->preserve_context = 1; -- 2.20.1 From maqianga at uniontech.com Sun Nov 2 22:34:38 2025 From: maqianga at uniontech.com (Qiang Ma) Date: Mon, 3 Nov 2025 14:34:38 +0800 Subject: [PATCH v2 2/4] kexec: add kexec_core flag to control debug printing In-Reply-To: <20251103063440.1681657-1-maqianga@uniontech.com> References: <20251103063440.1681657-1-maqianga@uniontech.com> Message-ID: <20251103063440.1681657-3-maqianga@uniontech.com> The commit a85ee18c7900 ("kexec_file: print out debugging message if required") has added general code printing in kexec_file_load(), but not in kexec_load(). Since kexec_load and kexec_file_load are not triggered simultaneously, we can unify the debug flag of kexec and kexec_file as kexec_core_dbg_print. Next, we need to do four things: 1. rename kexec_file_dbg_print to kexec_core_dbg_print 2. Add KEXEC_DEBUG 3. Initialize kexec_core_dbg_print for kexec 4. Set the reset of kexec_file_dbg_print to kimage_free Signed-off-by: Qiang Ma --- include/linux/kexec.h | 9 +++++---- include/uapi/linux/kexec.h | 1 + kernel/kexec.c | 1 + kernel/kexec_core.c | 4 +++- kernel/kexec_file.c | 4 +--- 5 files changed, 11 insertions(+), 8 deletions(-) diff --git a/include/linux/kexec.h b/include/linux/kexec.h index ff7e231b0485..cad8b5c362af 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -455,10 +455,11 @@ bool kexec_load_permitted(int kexec_image_type); /* List of defined/legal kexec flags */ #ifndef CONFIG_KEXEC_JUMP -#define KEXEC_FLAGS (KEXEC_ON_CRASH | KEXEC_UPDATE_ELFCOREHDR | KEXEC_CRASH_HOTPLUG_SUPPORT) +#define KEXEC_FLAGS (KEXEC_ON_CRASH | KEXEC_UPDATE_ELFCOREHDR | KEXEC_CRASH_HOTPLUG_SUPPORT | \ + KEXEC_DEBUG) #else #define KEXEC_FLAGS (KEXEC_ON_CRASH | KEXEC_PRESERVE_CONTEXT | KEXEC_UPDATE_ELFCOREHDR | \ - KEXEC_CRASH_HOTPLUG_SUPPORT) + KEXEC_CRASH_HOTPLUG_SUPPORT | KEXEC_DEBUG) #endif /* List of defined/legal kexec file flags */ @@ -525,10 +526,10 @@ static inline int arch_kexec_post_alloc_pages(void *vaddr, unsigned int pages, g static inline void arch_kexec_pre_free_pages(void *vaddr, unsigned int pages) { } #endif -extern bool kexec_file_dbg_print; +extern bool kexec_core_dbg_print; #define kexec_dprintk(fmt, arg...) \ - do { if (kexec_file_dbg_print) pr_info(fmt, ##arg); } while (0) + do { if (kexec_core_dbg_print) pr_info(fmt, ##arg); } while (0) extern void *kimage_map_segment(struct kimage *image, unsigned long addr, unsigned long size); extern void kimage_unmap_segment(void *buffer); diff --git a/include/uapi/linux/kexec.h b/include/uapi/linux/kexec.h index 55749cb0b81d..819c600af125 100644 --- a/include/uapi/linux/kexec.h +++ b/include/uapi/linux/kexec.h @@ -14,6 +14,7 @@ #define KEXEC_PRESERVE_CONTEXT 0x00000002 #define KEXEC_UPDATE_ELFCOREHDR 0x00000004 #define KEXEC_CRASH_HOTPLUG_SUPPORT 0x00000008 +#define KEXEC_DEBUG 0x00000010 #define KEXEC_ARCH_MASK 0xffff0000 /* diff --git a/kernel/kexec.c b/kernel/kexec.c index 9bb1f2b6b268..c7a869d32f87 100644 --- a/kernel/kexec.c +++ b/kernel/kexec.c @@ -42,6 +42,7 @@ static int kimage_alloc_init(struct kimage **rimage, unsigned long entry, if (!image) return -ENOMEM; + kexec_core_dbg_print = !!(flags & KEXEC_DEBUG); image->start = entry; image->nr_segments = nr_segments; memcpy(image->segment, segments, nr_segments * sizeof(*segments)); diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c index fa00b239c5d9..865f2b14f23b 100644 --- a/kernel/kexec_core.c +++ b/kernel/kexec_core.c @@ -53,7 +53,7 @@ atomic_t __kexec_lock = ATOMIC_INIT(0); /* Flag to indicate we are going to kexec a new kernel */ bool kexec_in_progress = false; -bool kexec_file_dbg_print; +bool kexec_core_dbg_print; /* * When kexec transitions to the new kernel there is a one-to-one @@ -576,6 +576,8 @@ void kimage_free(struct kimage *image) kimage_entry_t *ptr, entry; kimage_entry_t ind = 0; + kexec_core_dbg_print = false; + if (!image) return; diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c index eb62a9794242..4a24aadbad02 100644 --- a/kernel/kexec_file.c +++ b/kernel/kexec_file.c @@ -138,8 +138,6 @@ void kimage_file_post_load_cleanup(struct kimage *image) */ kfree(image->image_loader_data); image->image_loader_data = NULL; - - kexec_file_dbg_print = false; } #ifdef CONFIG_KEXEC_SIG @@ -314,7 +312,7 @@ kimage_file_alloc_init(struct kimage **rimage, int kernel_fd, if (!image) return -ENOMEM; - kexec_file_dbg_print = !!(flags & KEXEC_FILE_DEBUG); + kexec_core_dbg_print = !!(flags & KEXEC_FILE_DEBUG); image->file_mode = 1; #ifdef CONFIG_CRASH_DUMP -- 2.20.1 From maqianga at uniontech.com Sun Nov 2 22:34:40 2025 From: maqianga at uniontech.com (Qiang Ma) Date: Mon, 3 Nov 2025 14:34:40 +0800 Subject: [PATCH v2 4/4] kexec_file: Fix the issue of mismatch between loop variable types In-Reply-To: <20251103063440.1681657-1-maqianga@uniontech.com> References: <20251103063440.1681657-1-maqianga@uniontech.com> Message-ID: <20251103063440.1681657-5-maqianga@uniontech.com> The type of the struct kimage member variable nr_segments is unsigned long. Correct the loop variable i and the print format specifier type. Signed-off-by: Qiang Ma --- kernel/kexec_file.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c index 4a24aadbad02..7afdaa0efc50 100644 --- a/kernel/kexec_file.c +++ b/kernel/kexec_file.c @@ -366,7 +366,8 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd, int image_type = (flags & KEXEC_FILE_ON_CRASH) ? KEXEC_TYPE_CRASH : KEXEC_TYPE_DEFAULT; struct kimage **dest_image, *image; - int ret = 0, i; + int ret = 0; + unsigned long i; /* We only trust the superuser with rebooting the system. */ if (!kexec_load_permitted(image_type)) @@ -432,7 +433,7 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd, struct kexec_segment *ksegment; ksegment = &image->segment[i]; - kexec_dprintk("segment[%d]: buf=0x%p bufsz=0x%zx mem=0x%lx memsz=0x%zx\n", + kexec_dprintk("segment[%lu]: buf=0x%p bufsz=0x%zx mem=0x%lx memsz=0x%zx\n", i, ksegment->buf, ksegment->bufsz, ksegment->mem, ksegment->memsz); -- 2.20.1 From maqianga at uniontech.com Sun Nov 2 22:34:39 2025 From: maqianga at uniontech.com (Qiang Ma) Date: Mon, 3 Nov 2025 14:34:39 +0800 Subject: [PATCH v2 3/4] kexec: print out debugging message if required for kexec_load In-Reply-To: <20251103063440.1681657-1-maqianga@uniontech.com> References: <20251103063440.1681657-1-maqianga@uniontech.com> Message-ID: <20251103063440.1681657-4-maqianga@uniontech.com> The commit a85ee18c7900 ("kexec_file: print out debugging message if required") has added general code printing in kexec_file_load(), but not in kexec_load(). Especially in the RISC-V architecture, kexec_image_info() has been removed(commit eb7622d908a0 ("kexec_file, riscv: print out debugging message if required")). As a result, when using '-d' for the kexec_load interface, print nothing in the kernel space. This might be helpful for verifying the accuracy of the data passed to the kernel. Therefore, refer to this commit a85ee18c7900 ("kexec_file: print out debugging message if required"), debug print information has been added. Signed-off-by: Qiang Ma Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-kbuild-all/202510310332.6XrLe70K-lkp at intel.com/ --- kernel/kexec.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/kernel/kexec.c b/kernel/kexec.c index c7a869d32f87..9b433b972cc1 100644 --- a/kernel/kexec.c +++ b/kernel/kexec.c @@ -154,7 +154,15 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments, if (ret) goto out; + kexec_dprintk("nr_segments = %lu\n", nr_segments); for (i = 0; i < nr_segments; i++) { + struct kexec_segment *ksegment; + + ksegment = &image->segment[i]; + kexec_dprintk("segment[%lu]: buf=0x%p bufsz=0x%zx mem=0x%lx memsz=0x%zx\n", + i, ksegment->buf, ksegment->bufsz, ksegment->mem, + ksegment->memsz); + ret = kimage_load_segment(image, i); if (ret) goto out; @@ -166,6 +174,9 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments, if (ret) goto out; + kexec_dprintk("kexec_load: type:%u, start:0x%lx head:0x%lx flags:0x%lx\n", + image->type, image->start, image->head, flags); + /* Install the new kernel and uninstall the old */ image = xchg(dest_image, image); -- 2.20.1 From pratyush at kernel.org Mon Nov 3 03:01:57 2025 From: pratyush at kernel.org (Pratyush Yadav) Date: Mon, 3 Nov 2025 12:01:57 +0100 Subject: [PATCH] kho: fix out-of-bounds access of vmalloc chunk Message-ID: <20251103110159.8399-1-pratyush@kernel.org> The list of pages in a vmalloc chunk is NULL-terminated. So when looping through the pages in a vmalloc chunk, both kho_restore_vmalloc() and kho_vmalloc_unpreserve_chunk() rightly make sure to stop when encountering a NULL page. But when the chunk is full, the loops do not stop and go past the bounds of chunk->phys, resulting in out-of-bounds memory access, and possibly the restoration or unpreservation of an invalid page. Fix this by making sure the processing of chunk stops at the end of the array. Fixes: a667300bd53f2 ("kho: add support for preserving vmalloc allocations") Signed-off-by: Pratyush Yadav --- Notes: Commit 89a3ecca49ee8 ("kho: make sure page being restored is actually from KHO") was quite helpful in catching this since kho_restore_page() errored out due to missing magic number, instead of "restoring" a random page and causing errors at other random places. kernel/kexec_handover.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c index 76f0940fb4856..cc5aaa738bc50 100644 --- a/kernel/kexec_handover.c +++ b/kernel/kexec_handover.c @@ -869,7 +869,7 @@ static void kho_vmalloc_unpreserve_chunk(struct kho_vmalloc_chunk *chunk) __kho_unpreserve(track, pfn, pfn + 1); - for (int i = 0; chunk->phys[i]; i++) { + for (int i = 0; i < ARRAY_SIZE(chunk->phys) && chunk->phys[i]; i++) { pfn = PHYS_PFN(chunk->phys[i]); __kho_unpreserve(track, pfn, pfn + 1); } @@ -992,7 +992,7 @@ void *kho_restore_vmalloc(const struct kho_vmalloc *preservation) while (chunk) { struct page *page; - for (int i = 0; chunk->phys[i]; i++) { + for (int i = 0; i < ARRAY_SIZE(chunk->phys) && chunk->phys[i]; i++) { phys_addr_t phys = chunk->phys[i]; if (idx + contig_pages > total_pages) base-commit: dcb6fa37fd7bc9c3d2b066329b0d27dedf8becaa -- 2.47.3 From ritesh.list at gmail.com Mon Nov 3 02:10:10 2025 From: ritesh.list at gmail.com (Ritesh Harjani (IBM)) Date: Mon, 03 Nov 2025 15:40:10 +0530 Subject: [PATCH v5] powerpc/kdump: Add support for crashkernel CMA reservation In-Reply-To: <20251103043747.1298065-1-sourabhjain@linux.ibm.com> References: <20251103043747.1298065-1-sourabhjain@linux.ibm.com> Message-ID: <87y0on4ebh.ritesh.list@gmail.com> Sourabh Jain writes: > Commit 35c18f2933c5 ("Add a new optional ",cma" suffix to the > crashkernel= command line option") and commit ab475510e042 ("kdump: > implement reserve_crashkernel_cma") added CMA support for kdump > crashkernel reservation. > > Extend crashkernel CMA reservation support to powerpc. > > The following changes are made to enable CMA reservation on powerpc: > > - Parse and obtain the CMA reservation size along with other crashkernel > parameters > - Call reserve_crashkernel_cma() to allocate the CMA region for kdump > - Include the CMA-reserved ranges in the usable memory ranges for the > kdump kernel to use. > - Exclude the CMA-reserved ranges from the crash kernel memory to > prevent them from being exported through /proc/vmcore. > > With the introduction of the CMA crashkernel regions, > crash_exclude_mem_range() needs to be called multiple times to exclude > both crashk_res and crashk_cma_ranges from the crash memory ranges. To > avoid repetitive logic for validating mem_ranges size and handling > reallocation when required, this functionality is moved to a new wrapper > function crash_exclude_mem_range_guarded(). > > To ensure proper CMA reservation, reserve_crashkernel_cma() is called > after pageblock_order is initialized. > > Update kernel-parameters.txt to document CMA support for crashkernel on > powerpc architecture. > > Cc: Baoquan he > Cc: Jiri Bohac > Cc: Hari Bathini > Cc: Madhavan Srinivasan > Cc: Mahesh Salgaonkar > Cc: Michael Ellerman > Cc: Ritesh Harjani (IBM) > Cc: Shivang Upadhyay > Cc: kexec at lists.infradead.org > Signed-off-by: Sourabh Jain > --- > Changlog: > > v3 -> v4 > - Removed repeated initialization to tmem in > crash_exclude_mem_range_guarded() > - Call crash_exclude_mem_range() with right crashk ranges > > v4 -> v5: > - Document CMA-based crashkernel support for ppc64 in kernel-parameters.txt > --- > .../admin-guide/kernel-parameters.txt | 2 +- > arch/powerpc/include/asm/kexec.h | 2 + > arch/powerpc/kernel/setup-common.c | 4 +- > arch/powerpc/kexec/core.c | 10 ++++- > arch/powerpc/kexec/ranges.c | 43 ++++++++++++++----- > 5 files changed, 47 insertions(+), 14 deletions(-) > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > index 6c42061ca20e..0f386b546cec 100644 > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt > @@ -1013,7 +1013,7 @@ > It will be ignored when crashkernel=X,high is not used > or memory reserved is below 4G. > crashkernel=size[KMG],cma > - [KNL, X86] Reserve additional crash kernel memory from > + [KNL, X86, ppc64] Reserve additional crash kernel memory from Shouldn't this be PPC and not ppc64? If I see the crash_dump support... config ARCH_SUPPORTS_CRASH_DUMP def_bool PPC64 || PPC_BOOK3S_32 || PPC_85xx || (44x && !SMP) The changes below aren't specific to ppc64 correct? > CMA. This reservation is usable by the first system's > userspace memory and kernel movable allocations (memory > balloon, zswap). Pages allocated from this memory range > diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h > index 4bbf9f699aaa..bd4a6c42a5f3 100644 > --- a/arch/powerpc/include/asm/kexec.h > +++ b/arch/powerpc/include/asm/kexec.h > @@ -115,9 +115,11 @@ int setup_new_fdt_ppc64(const struct kimage *image, void *fdt, struct crash_mem > #ifdef CONFIG_CRASH_RESERVE > int __init overlaps_crashkernel(unsigned long start, unsigned long size); > extern void arch_reserve_crashkernel(void); > +extern void kdump_cma_reserve(void); > #else > static inline void arch_reserve_crashkernel(void) {} > static inline int overlaps_crashkernel(unsigned long start, unsigned long size) { return 0; } > +static inline void kdump_cma_reserve(void) { } > #endif > > #if defined(CONFIG_CRASH_DUMP) > diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c > index 68d47c53876c..c8c42b419742 100644 > --- a/arch/powerpc/kernel/setup-common.c > +++ b/arch/powerpc/kernel/setup-common.c > @@ -35,6 +35,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -995,11 +996,12 @@ void __init setup_arch(char **cmdline_p) > initmem_init(); > > /* > - * Reserve large chunks of memory for use by CMA for fadump, KVM and > + * Reserve large chunks of memory for use by CMA for kdump, fadump, KVM and > * hugetlb. These must be called after initmem_init(), so that > * pageblock_order is initialised. > */ > fadump_cma_init(); > + kdump_cma_reserve(); > kvm_cma_reserve(); > gigantic_hugetlb_cma_reserve(); > > diff --git a/arch/powerpc/kexec/core.c b/arch/powerpc/kexec/core.c > index d1a2d755381c..25744737eff5 100644 > --- a/arch/powerpc/kexec/core.c > +++ b/arch/powerpc/kexec/core.c > @@ -33,6 +33,8 @@ void machine_kexec_cleanup(struct kimage *image) > { > } > > +unsigned long long cma_size; > + nit: Since this is a gloabal powerpc variable you are defining, then can we keep it's name to crashk_cma_size? > /* > * Do not allocate memory (or fail in any way) in machine_kexec(). > * We are past the point of no return, committed to rebooting now. > @@ -110,7 +112,7 @@ void __init arch_reserve_crashkernel(void) > > /* use common parsing */ > ret = parse_crashkernel(boot_command_line, total_mem_sz, &crash_size, > - &crash_base, NULL, NULL, NULL); > + &crash_base, NULL, &cma_size, NULL); > > if (ret) > return; > @@ -130,6 +132,12 @@ void __init arch_reserve_crashkernel(void) > reserve_crashkernel_generic(crash_size, crash_base, 0, false); > } > > +void __init kdump_cma_reserve(void) > +{ > + if (cma_size) > + reserve_crashkernel_cma(cma_size); > +} > + nit: cma_size is already checked for null within reserve_crashkernel_cma(), so we don't really need kdump_cma_reserve() function call as such. Also kdump_cma_reserve() only make sense with #ifdef CRASHKERNEL_CMA.. so instead do you think we can directly call reserve_crashkernel_cma(cma_size)? -ritesh > int __init overlaps_crashkernel(unsigned long start, unsigned long size) > { > return (start + size) > crashk_res.start && start <= crashk_res.end; > diff --git a/arch/powerpc/kexec/ranges.c b/arch/powerpc/kexec/ranges.c > index 3702b0bdab14..3bd27c38726b 100644 > --- a/arch/powerpc/kexec/ranges.c > +++ b/arch/powerpc/kexec/ranges.c > @@ -515,7 +515,7 @@ int get_exclude_memory_ranges(struct crash_mem **mem_ranges) > */ > int get_usable_memory_ranges(struct crash_mem **mem_ranges) > { > - int ret; > + int ret, i; > > /* > * Early boot failure observed on guests when low memory (first memory > @@ -528,6 +528,13 @@ int get_usable_memory_ranges(struct crash_mem **mem_ranges) > if (ret) > goto out; > > + for (i = 0; i < crashk_cma_cnt; i++) { > + ret = add_mem_range(mem_ranges, crashk_cma_ranges[i].start, > + crashk_cma_ranges[i].end - crashk_cma_ranges[i].start + 1); > + if (ret) > + goto out; > + } > + > ret = add_rtas_mem_range(mem_ranges); > if (ret) > goto out; > @@ -546,6 +553,22 @@ int get_usable_memory_ranges(struct crash_mem **mem_ranges) > #endif /* CONFIG_KEXEC_FILE */ > > #ifdef CONFIG_CRASH_DUMP > +static int crash_exclude_mem_range_guarded(struct crash_mem **mem_ranges, > + unsigned long long mstart, > + unsigned long long mend) > +{ > + struct crash_mem *tmem = *mem_ranges; > + > + /* Reallocate memory ranges if there is no space to split ranges */ > + if (tmem && (tmem->nr_ranges == tmem->max_nr_ranges)) { > + tmem = realloc_mem_ranges(mem_ranges); > + if (!tmem) > + return -ENOMEM; > + } > + > + return crash_exclude_mem_range(tmem, mstart, mend); > +} > + > /** > * get_crash_memory_ranges - Get crash memory ranges. This list includes > * first/crashing kernel's memory regions that > @@ -557,7 +580,6 @@ int get_usable_memory_ranges(struct crash_mem **mem_ranges) > int get_crash_memory_ranges(struct crash_mem **mem_ranges) > { > phys_addr_t base, end; > - struct crash_mem *tmem; > u64 i; > int ret; > > @@ -582,19 +604,18 @@ int get_crash_memory_ranges(struct crash_mem **mem_ranges) > sort_memory_ranges(*mem_ranges, true); > } > > - /* Reallocate memory ranges if there is no space to split ranges */ > - tmem = *mem_ranges; > - if (tmem && (tmem->nr_ranges == tmem->max_nr_ranges)) { > - tmem = realloc_mem_ranges(mem_ranges); > - if (!tmem) > - goto out; > - } > - > /* Exclude crashkernel region */ > - ret = crash_exclude_mem_range(tmem, crashk_res.start, crashk_res.end); > + ret = crash_exclude_mem_range_guarded(mem_ranges, crashk_res.start, crashk_res.end); > if (ret) > goto out; > > + for (i = 0; i < crashk_cma_cnt; ++i) { > + ret = crash_exclude_mem_range_guarded(mem_ranges, crashk_cma_ranges[i].start, > + crashk_cma_ranges[i].end); > + if (ret) > + goto out; > + } > + > /* > * FIXME: For now, stay in parity with kexec-tools but if RTAS/OPAL > * regions are exported to save their context at the time of > -- > 2.51.0 From rppt at kernel.org Mon Nov 3 08:57:24 2025 From: rppt at kernel.org (Mike Rapoport) Date: Mon, 3 Nov 2025 18:57:24 +0200 Subject: [PATCH] kho: fix out-of-bounds access of vmalloc chunk In-Reply-To: <20251103110159.8399-1-pratyush@kernel.org> References: <20251103110159.8399-1-pratyush@kernel.org> Message-ID: On Mon, Nov 03, 2025 at 12:01:57PM +0100, Pratyush Yadav wrote: > The list of pages in a vmalloc chunk is NULL-terminated. So when looping > through the pages in a vmalloc chunk, both kho_restore_vmalloc() and > kho_vmalloc_unpreserve_chunk() rightly make sure to stop when > encountering a NULL page. But when the chunk is full, the loops do not > stop and go past the bounds of chunk->phys, resulting in out-of-bounds > memory access, and possibly the restoration or unpreservation of an > invalid page. > > Fix this by making sure the processing of chunk stops at the end of the > array. > > Fixes: a667300bd53f2 ("kho: add support for preserving vmalloc allocations") > Signed-off-by: Pratyush Yadav Reviewed-by: Mike Rapoport (Microsoft) > --- > > Notes: > Commit 89a3ecca49ee8 ("kho: make sure page being restored is actually > from KHO") was quite helpful in catching this since kho_restore_page() > errored out due to missing magic number, instead of "restoring" a random > page and causing errors at other random places. > > kernel/kexec_handover.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c > index 76f0940fb4856..cc5aaa738bc50 100644 > --- a/kernel/kexec_handover.c > +++ b/kernel/kexec_handover.c > @@ -869,7 +869,7 @@ static void kho_vmalloc_unpreserve_chunk(struct kho_vmalloc_chunk *chunk) > > __kho_unpreserve(track, pfn, pfn + 1); > > - for (int i = 0; chunk->phys[i]; i++) { > + for (int i = 0; i < ARRAY_SIZE(chunk->phys) && chunk->phys[i]; i++) { > pfn = PHYS_PFN(chunk->phys[i]); > __kho_unpreserve(track, pfn, pfn + 1); > } > @@ -992,7 +992,7 @@ void *kho_restore_vmalloc(const struct kho_vmalloc *preservation) > while (chunk) { > struct page *page; > > - for (int i = 0; chunk->phys[i]; i++) { > + for (int i = 0; i < ARRAY_SIZE(chunk->phys) && chunk->phys[i]; i++) { > phys_addr_t phys = chunk->phys[i]; > > if (idx + contig_pages > total_pages) > > base-commit: dcb6fa37fd7bc9c3d2b066329b0d27dedf8becaa > -- > 2.47.3 > -- Sincerely yours, Mike. From pratyush at kernel.org Mon Nov 3 10:02:30 2025 From: pratyush at kernel.org (Pratyush Yadav) Date: Mon, 3 Nov 2025 19:02:30 +0100 Subject: [PATCH 0/2] kho: misc fixes Message-ID: <20251103180235.71409-1-pratyush@kernel.org> This series has a couple of misc fixes for KHO I discovered during code review and testing. The series is based on top of [0] which has another fix for the function touched by patch 1. I spotted these two after sending the patch. If that one needs a reroll, I can combine the three into a series. [0] https://lore.kernel.org/linux-mm/20251103110159.8399-1-pratyush at kernel.org/ Pratyush Yadav (2): kho: fix unpreservation of higher-order vmalloc preservations kho: warn and exit when unpreserved page wasn't preserved kernel/kexec_handover.c | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-) base-commit: dcb6fa37fd7bc9c3d2b066329b0d27dedf8becaa prerequisite-patch-id: fce7dcea45c85bac06a559d06f038e9c0cb38b17 -- 2.47.3 From pratyush at kernel.org Mon Nov 3 10:02:31 2025 From: pratyush at kernel.org (Pratyush Yadav) Date: Mon, 3 Nov 2025 19:02:31 +0100 Subject: [PATCH 1/2] kho: fix unpreservation of higher-order vmalloc preservations In-Reply-To: <20251103180235.71409-1-pratyush@kernel.org> References: <20251103180235.71409-1-pratyush@kernel.org> Message-ID: <20251103180235.71409-2-pratyush@kernel.org> kho_vmalloc_unpreserve_chunk() calls __kho_unpreserve() with end_pfn as pfn + 1. This happens to work for 0-order pages, but leaks higher order pages. For example, say order 2 pages back the allocation. During preservation, they get preserved in the order 2 bitmaps, but kho_vmalloc_unpreserve_chunk() would try to unpreserve them from the order 0 bitmaps, which should not have these bits set anyway, leaving the order 2 bitmaps untouched. This results in the pages being carried over to the next kernel. Nothing will free those pages in the next boot, leaking them. Fix this by taking the order into account when calculating the end PFN for __kho_unpreserve(). Fixes: a667300bd53f2 ("kho: add support for preserving vmalloc allocations") Signed-off-by: Pratyush Yadav --- Notes: When Pasha's patch [0] to add kho_unpreserve_pages() is merged, maybe it would be a better idea to use kho_unpreserve_pages() here? But that is something for later I suppose. [0] https://lore.kernel.org/linux-mm/20251101142325.1326536-4-pasha.tatashin at soleen.com/ kernel/kexec_handover.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c index cc5aaa738bc50..c2bcbb10918ce 100644 --- a/kernel/kexec_handover.c +++ b/kernel/kexec_handover.c @@ -862,7 +862,8 @@ static struct kho_vmalloc_chunk *new_vmalloc_chunk(struct kho_vmalloc_chunk *cur return NULL; } -static void kho_vmalloc_unpreserve_chunk(struct kho_vmalloc_chunk *chunk) +static void kho_vmalloc_unpreserve_chunk(struct kho_vmalloc_chunk *chunk, + unsigned short order) { struct kho_mem_track *track = &kho_out.ser.track; unsigned long pfn = PHYS_PFN(virt_to_phys(chunk)); @@ -871,7 +872,7 @@ static void kho_vmalloc_unpreserve_chunk(struct kho_vmalloc_chunk *chunk) for (int i = 0; i < ARRAY_SIZE(chunk->phys) && chunk->phys[i]; i++) { pfn = PHYS_PFN(chunk->phys[i]); - __kho_unpreserve(track, pfn, pfn + 1); + __kho_unpreserve(track, pfn, pfn + (1 << order)); } } @@ -882,7 +883,7 @@ static void kho_vmalloc_free_chunks(struct kho_vmalloc *kho_vmalloc) while (chunk) { struct kho_vmalloc_chunk *tmp = chunk; - kho_vmalloc_unpreserve_chunk(chunk); + kho_vmalloc_unpreserve_chunk(chunk, kho_vmalloc->order); chunk = KHOSER_LOAD_PTR(chunk->hdr.next); free_page((unsigned long)tmp); -- 2.47.3 From pratyush at kernel.org Mon Nov 3 10:02:32 2025 From: pratyush at kernel.org (Pratyush Yadav) Date: Mon, 3 Nov 2025 19:02:32 +0100 Subject: [PATCH 2/2] kho: warn and exit when unpreserved page wasn't preserved In-Reply-To: <20251103180235.71409-1-pratyush@kernel.org> References: <20251103180235.71409-1-pratyush@kernel.org> Message-ID: <20251103180235.71409-3-pratyush@kernel.org> Calling __kho_unpreserve() on a pair of (pfn, end_pfn) that wasn't preserved is a bug. Currently, if that is done, the physxa or bits can be NULL. This results in a soft lockup since a NULL physxa or bits results in redoing the loop without ever making any progress. Return when physxa or bits are not found, but WARN first to loudly indicate invalid behaviour. Fixes: fc33e4b44b271 ("kexec: enable KHO support for memory preservation") Signed-off-by: Pratyush Yadav --- kernel/kexec_handover.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c index c2bcbb10918ce..e5fd833726226 100644 --- a/kernel/kexec_handover.c +++ b/kernel/kexec_handover.c @@ -167,12 +167,12 @@ static void __kho_unpreserve(struct kho_mem_track *track, unsigned long pfn, const unsigned long pfn_high = pfn >> order; physxa = xa_load(&track->orders, order); - if (!physxa) - continue; + if (WARN_ON_ONCE(!physxa)) + return; bits = xa_load(&physxa->phys_bits, pfn_high / PRESERVE_BITS); - if (!bits) - continue; + if (WARN_ON_ONCE(!bits)) + return; clear_bit(pfn_high % PRESERVE_BITS, bits->preserve); -- 2.47.3 From akpm at linux-foundation.org Mon Nov 3 16:20:20 2025 From: akpm at linux-foundation.org (Andrew Morton) Date: Mon, 3 Nov 2025 16:20:20 -0800 Subject: [PATCH 0/2] kho: misc fixes In-Reply-To: <20251103180235.71409-1-pratyush@kernel.org> References: <20251103180235.71409-1-pratyush@kernel.org> Message-ID: <20251103162020.ac696dbc695f9341e7a267f7@linux-foundation.org> On Mon, 3 Nov 2025 19:02:30 +0100 Pratyush Yadav wrote: > This series has a couple of misc fixes for KHO I discovered during code > review and testing. > > The series is based on top of [0] which has another fix for the function > touched by patch 1. I spotted these two after sending the patch. If that > one needs a reroll, I can combine the three into a series. > Things appear to be misordered here. [1/2] "kho: fix unpreservation of higher-order vmalloc preservations" fixes a667300bd53f2, so it's wanted in 6.18-rcX [2/2] "kho: warn and exit when unpreserved page wasn't preserved" fixes fc33e4b44b271, so it's wanted in 6.16+ So can we please have [2/2] as a standalone fix against latest -linus, with a cc:stable? And then [1/2] as a standalone fix against latest -linus without a cc:stable. Once I have those merged up we can then take a look at what to do about the 6.19 material which is presently queued in mm-unstable. Thanks. From akpm at linux-foundation.org Mon Nov 3 17:23:21 2025 From: akpm at linux-foundation.org (Andrew Morton) Date: Mon, 3 Nov 2025 17:23:21 -0800 Subject: [PATCH 0/2] kho: misc fixes In-Reply-To: <20251103162020.ac696dbc695f9341e7a267f7@linux-foundation.org> References: <20251103180235.71409-1-pratyush@kernel.org> <20251103162020.ac696dbc695f9341e7a267f7@linux-foundation.org> Message-ID: <20251103172321.689294e48c2fae795e114ce6@linux-foundation.org> On Mon, 3 Nov 2025 16:20:20 -0800 Andrew Morton wrote: > On Mon, 3 Nov 2025 19:02:30 +0100 Pratyush Yadav wrote: > > > This series has a couple of misc fixes for KHO I discovered during code > > review and testing. > > > > The series is based on top of [0] which has another fix for the function > > touched by patch 1. I spotted these two after sending the patch. If that > > one needs a reroll, I can combine the three into a series. > > > > Things appear to be misordered here. > > [1/2] "kho: fix unpreservation of higher-order vmalloc preservations" > fixes a667300bd53f2, so it's wanted in 6.18-rcX > > [2/2] "kho: warn and exit when unpreserved page wasn't preserved" > fixes fc33e4b44b271, so it's wanted in 6.16+ > > So can we please have [2/2] as a standalone fix against latest -linus, > with a cc:stable? > > And then [1/2] as a standalone fix against latest -linus without a > cc:stable. > OK, I think I figured it out. In mm-hotfixes-unstable I have kho-fix-out-of-bounds-access-of-vmalloc-chunk.patch kho-fix-unpreservation-of-higher-order-vmalloc-preservations.patch kho-warn-and-exit-when-unpreserved-page-wasnt-preserved.patch The first two are applicable to 6.18-rcX and the third is applicable to 6.18-rcX, with a cc:stable for backporting. From maqianga at uniontech.com Mon Nov 3 18:59:59 2025 From: maqianga at uniontech.com (Qiang Ma) Date: Tue, 4 Nov 2025 10:59:59 +0800 Subject: [PATCH] kexec: add kexec flag to support debug printing Message-ID: <20251104025959.1948450-1-maqianga@uniontech.com> This add KEXEC_DEBUG to kexec_flags so that it can be passed to kernel when '-d' is added with kexec_load interface. With that flag enabled, kernel can enable the debugging message printing. This patch requires support from the kexec_load debugging message of the Linux kernel[1]. [1]: https://lore.kernel.org/kexec/20251103063440.1681657-1-maqianga at uniontech.com/ Signed-off-by: Qiang Ma --- kexec/kexec-syscall.h | 1 + kexec/kexec.c | 1 + 2 files changed, 2 insertions(+) diff --git a/kexec/kexec-syscall.h b/kexec/kexec-syscall.h index e9bb7de..b60804f 100644 --- a/kexec/kexec-syscall.h +++ b/kexec/kexec-syscall.h @@ -120,6 +120,7 @@ static inline long kexec_file_load(int kernel_fd, int initrd_fd, #define KEXEC_PRESERVE_CONTEXT 0x00000002 #define KEXEC_UPDATE_ELFCOREHDR 0x00000004 #define KEXEC_CRASH_HOTPLUG_SUPPORT 0x00000008 +#define KEXEC_DEBUG 0x00000010 #define KEXEC_ARCH_MASK 0xffff0000 /* Flags for kexec file based system call */ diff --git a/kexec/kexec.c b/kexec/kexec.c index c9e4bcb..f425422 100644 --- a/kexec/kexec.c +++ b/kexec/kexec.c @@ -1518,6 +1518,7 @@ int main(int argc, char *argv[]) return 0; case OPT_DEBUG: kexec_debug = 1; + kexec_flags |= KEXEC_DEBUG; kexec_file_flags |= KEXEC_FILE_DEBUG; break; case OPT_NOIFDOWN: -- 2.20.1 From sourabhjain at linux.ibm.com Mon Nov 3 21:18:51 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Tue, 4 Nov 2025 10:48:51 +0530 Subject: [PATCH v5] powerpc/kdump: Add support for crashkernel CMA reservation In-Reply-To: <87y0on4ebh.ritesh.list@gmail.com> References: <20251103043747.1298065-1-sourabhjain@linux.ibm.com> <87y0on4ebh.ritesh.list@gmail.com> Message-ID: <7957bd55-5bda-406f-aab3-64e0620bd452@linux.ibm.com> On 03/11/25 15:40, Ritesh Harjani (IBM) wrote: > Sourabh Jain writes: > >> Commit 35c18f2933c5 ("Add a new optional ",cma" suffix to the >> crashkernel= command line option") and commit ab475510e042 ("kdump: >> implement reserve_crashkernel_cma") added CMA support for kdump >> crashkernel reservation. >> >> Extend crashkernel CMA reservation support to powerpc. >> >> The following changes are made to enable CMA reservation on powerpc: >> >> - Parse and obtain the CMA reservation size along with other crashkernel >> parameters >> - Call reserve_crashkernel_cma() to allocate the CMA region for kdump >> - Include the CMA-reserved ranges in the usable memory ranges for the >> kdump kernel to use. >> - Exclude the CMA-reserved ranges from the crash kernel memory to >> prevent them from being exported through /proc/vmcore. >> >> With the introduction of the CMA crashkernel regions, >> crash_exclude_mem_range() needs to be called multiple times to exclude >> both crashk_res and crashk_cma_ranges from the crash memory ranges. To >> avoid repetitive logic for validating mem_ranges size and handling >> reallocation when required, this functionality is moved to a new wrapper >> function crash_exclude_mem_range_guarded(). >> >> To ensure proper CMA reservation, reserve_crashkernel_cma() is called >> after pageblock_order is initialized. >> >> Update kernel-parameters.txt to document CMA support for crashkernel on >> powerpc architecture. >> >> Cc: Baoquan he >> Cc: Jiri Bohac >> Cc: Hari Bathini >> Cc: Madhavan Srinivasan >> Cc: Mahesh Salgaonkar >> Cc: Michael Ellerman >> Cc: Ritesh Harjani (IBM) >> Cc: Shivang Upadhyay >> Cc: kexec at lists.infradead.org >> Signed-off-by: Sourabh Jain >> --- >> Changlog: >> >> v3 -> v4 >> - Removed repeated initialization to tmem in >> crash_exclude_mem_range_guarded() >> - Call crash_exclude_mem_range() with right crashk ranges >> >> v4 -> v5: >> - Document CMA-based crashkernel support for ppc64 in kernel-parameters.txt >> --- >> .../admin-guide/kernel-parameters.txt | 2 +- >> arch/powerpc/include/asm/kexec.h | 2 + >> arch/powerpc/kernel/setup-common.c | 4 +- >> arch/powerpc/kexec/core.c | 10 ++++- >> arch/powerpc/kexec/ranges.c | 43 ++++++++++++++----- >> 5 files changed, 47 insertions(+), 14 deletions(-) >> >> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt >> index 6c42061ca20e..0f386b546cec 100644 >> --- a/Documentation/admin-guide/kernel-parameters.txt >> +++ b/Documentation/admin-guide/kernel-parameters.txt >> @@ -1013,7 +1013,7 @@ >> It will be ignored when crashkernel=X,high is not used >> or memory reserved is below 4G. >> crashkernel=size[KMG],cma >> - [KNL, X86] Reserve additional crash kernel memory from >> + [KNL, X86, ppc64] Reserve additional crash kernel memory from > Shouldn't this be PPC and not ppc64? > > If I see the crash_dump support... > > config ARCH_SUPPORTS_CRASH_DUMP > def_bool PPC64 || PPC_BOOK3S_32 || PPC_85xx || (44x && !SMP) > > The changes below aren't specific to ppc64 correct? The thing is this feature is only supported with KEXEC_FILE and which only supported on PPC64: config ARCH_SUPPORTS_KEXEC_FILE ??? def_bool PPC64 Hence I kept it as ppc64. I think I should update that in the commit message. Also do you think is it good to restrict this feature to KEXEC_FILE? > >> CMA. This reservation is usable by the first system's >> userspace memory and kernel movable allocations (memory >> balloon, zswap). Pages allocated from this memory range >> diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h >> index 4bbf9f699aaa..bd4a6c42a5f3 100644 >> --- a/arch/powerpc/include/asm/kexec.h >> +++ b/arch/powerpc/include/asm/kexec.h >> @@ -115,9 +115,11 @@ int setup_new_fdt_ppc64(const struct kimage *image, void *fdt, struct crash_mem >> #ifdef CONFIG_CRASH_RESERVE >> int __init overlaps_crashkernel(unsigned long start, unsigned long size); >> extern void arch_reserve_crashkernel(void); >> +extern void kdump_cma_reserve(void); >> #else >> static inline void arch_reserve_crashkernel(void) {} >> static inline int overlaps_crashkernel(unsigned long start, unsigned long size) { return 0; } >> +static inline void kdump_cma_reserve(void) { } >> #endif >> >> #if defined(CONFIG_CRASH_DUMP) >> diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c >> index 68d47c53876c..c8c42b419742 100644 >> --- a/arch/powerpc/kernel/setup-common.c >> +++ b/arch/powerpc/kernel/setup-common.c >> @@ -35,6 +35,7 @@ >> #include >> #include >> #include >> +#include >> #include >> #include >> #include >> @@ -995,11 +996,12 @@ void __init setup_arch(char **cmdline_p) >> initmem_init(); >> >> /* >> - * Reserve large chunks of memory for use by CMA for fadump, KVM and >> + * Reserve large chunks of memory for use by CMA for kdump, fadump, KVM and >> * hugetlb. These must be called after initmem_init(), so that >> * pageblock_order is initialised. >> */ >> fadump_cma_init(); >> + kdump_cma_reserve(); >> kvm_cma_reserve(); >> gigantic_hugetlb_cma_reserve(); >> >> diff --git a/arch/powerpc/kexec/core.c b/arch/powerpc/kexec/core.c >> index d1a2d755381c..25744737eff5 100644 >> --- a/arch/powerpc/kexec/core.c >> +++ b/arch/powerpc/kexec/core.c >> @@ -33,6 +33,8 @@ void machine_kexec_cleanup(struct kimage *image) >> { >> } >> >> +unsigned long long cma_size; >> + > nit: > Since this is a gloabal powerpc variable you are defining, then can we > keep it's name to crashk_cma_size? Yeah make sense. I will update the variable name. > >> /* >> * Do not allocate memory (or fail in any way) in machine_kexec(). >> * We are past the point of no return, committed to rebooting now. >> @@ -110,7 +112,7 @@ void __init arch_reserve_crashkernel(void) >> >> /* use common parsing */ >> ret = parse_crashkernel(boot_command_line, total_mem_sz, &crash_size, >> - &crash_base, NULL, NULL, NULL); >> + &crash_base, NULL, &cma_size, NULL); >> >> if (ret) >> return; >> @@ -130,6 +132,12 @@ void __init arch_reserve_crashkernel(void) >> reserve_crashkernel_generic(crash_size, crash_base, 0, false); >> } >> >> +void __init kdump_cma_reserve(void) >> +{ >> + if (cma_size) >> + reserve_crashkernel_cma(cma_size); >> +} >> + > nit: > cma_size is already checked for null within reserve_crashkernel_cma(), > so we don't really need kdump_cma_reserve() function call as such. > > Also kdump_cma_reserve() only make sense with #ifdef CRASHKERNEL_CMA.. > so instead do you think we can directly call reserve_crashkernel_cma(cma_size)? I think the above kdump_cma_reserve() definition should come under CONFIG_CRASH_RESERVE because the way it is declared in arch/powerpc/include/asm/kexec.h. I would like to keep kdump_cma_reserve() as is it because of two reasons: - It keeps setup_arch() free from kdump #ifdefs - In case if we want to add some condition on this reservation it would straight forward. So lets keep kdump_cma_reserve as is, unless you have strong opinion on not to. >> int __init overlaps_crashkernel(unsigned long start, unsigned long size) >> { >> return (start + size) > crashk_res.start && start <= crashk_res.end; >> diff --git a/arch/powerpc/kexec/ranges.c b/arch/powerpc/kexec/ranges.c >> index 3702b0bdab14..3bd27c38726b 100644 >> --- a/arch/powerpc/kexec/ranges.c >> +++ b/arch/powerpc/kexec/ranges.c >> @@ -515,7 +515,7 @@ int get_exclude_memory_ranges(struct crash_mem **mem_ranges) >> */ >> int get_usable_memory_ranges(struct crash_mem **mem_ranges) >> { >> - int ret; >> + int ret, i; >> >> /* >> * Early boot failure observed on guests when low memory (first memory >> @@ -528,6 +528,13 @@ int get_usable_memory_ranges(struct crash_mem **mem_ranges) >> if (ret) >> goto out; >> >> + for (i = 0; i < crashk_cma_cnt; i++) { >> + ret = add_mem_range(mem_ranges, crashk_cma_ranges[i].start, >> + crashk_cma_ranges[i].end - crashk_cma_ranges[i].start + 1); >> + if (ret) >> + goto out; >> + } >> + >> ret = add_rtas_mem_range(mem_ranges); >> if (ret) >> goto out; >> @@ -546,6 +553,22 @@ int get_usable_memory_ranges(struct crash_mem **mem_ranges) >> #endif /* CONFIG_KEXEC_FILE */ >> >> #ifdef CONFIG_CRASH_DUMP >> +static int crash_exclude_mem_range_guarded(struct crash_mem **mem_ranges, >> + unsigned long long mstart, >> + unsigned long long mend) >> +{ >> + struct crash_mem *tmem = *mem_ranges; >> + >> + /* Reallocate memory ranges if there is no space to split ranges */ >> + if (tmem && (tmem->nr_ranges == tmem->max_nr_ranges)) { >> + tmem = realloc_mem_ranges(mem_ranges); >> + if (!tmem) >> + return -ENOMEM; >> + } >> + >> + return crash_exclude_mem_range(tmem, mstart, mend); >> +} >> + >> /** >> * get_crash_memory_ranges - Get crash memory ranges. This list includes >> * first/crashing kernel's memory regions that >> @@ -557,7 +580,6 @@ int get_usable_memory_ranges(struct crash_mem **mem_ranges) >> int get_crash_memory_ranges(struct crash_mem **mem_ranges) >> { >> phys_addr_t base, end; >> - struct crash_mem *tmem; >> u64 i; >> int ret; >> >> @@ -582,19 +604,18 @@ int get_crash_memory_ranges(struct crash_mem **mem_ranges) >> sort_memory_ranges(*mem_ranges, true); >> } >> >> - /* Reallocate memory ranges if there is no space to split ranges */ >> - tmem = *mem_ranges; >> - if (tmem && (tmem->nr_ranges == tmem->max_nr_ranges)) { >> - tmem = realloc_mem_ranges(mem_ranges); >> - if (!tmem) >> - goto out; >> - } >> - >> /* Exclude crashkernel region */ >> - ret = crash_exclude_mem_range(tmem, crashk_res.start, crashk_res.end); >> + ret = crash_exclude_mem_range_guarded(mem_ranges, crashk_res.start, crashk_res.end); >> if (ret) >> goto out; >> >> + for (i = 0; i < crashk_cma_cnt; ++i) { >> + ret = crash_exclude_mem_range_guarded(mem_ranges, crashk_cma_ranges[i].start, >> + crashk_cma_ranges[i].end); >> + if (ret) >> + goto out; >> + } >> + >> /* >> * FIXME: For now, stay in parity with kexec-tools but if RTAS/OPAL >> * regions are exported to save their context at the time of >> -- >> 2.51.0 From sourabhjain at linux.ibm.com Mon Nov 3 22:26:42 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Tue, 4 Nov 2025 11:56:42 +0530 Subject: [PATCH 0/2] Export kdump crashkernel CMA ranges In-Reply-To: <20251103035859.1267318-1-sourabhjain@linux.ibm.com> References: <20251103035859.1267318-1-sourabhjain@linux.ibm.com> Message-ID: Cc others who can provide input. On 03/11/25 09:28, Sourabh Jain wrote: > /sys/kernel/kexec_crash_cma_ranges to export all CMA regions reserved > for the crashkernel to user-space. This enables user-space tools > configuring kdump to determine the amount of memory reserved for the > crashkernel. When CMA is used for crashkernel allocation, tools can use > this information to warn users that attempting to capture user pages > while CMA reservation is active may lead to unreliable or incomplete > dump capture. > > While adding documentation for the new sysfs interface, I realized that > there was no ABI document for the existing kexec and kdump sysfs > interfaces, so I added one. > > The first patch adds the ABI documentation for the existing kexec and > kdump sysfs interfaces, and the second patch adds the > /sys/kernel/kexec_crash_cma_ranges sysfs interface along with its > corresponding ABI documentation. > > *Seeking opinions* > There are already four kexec/kdump sysfs entries under /sys/kernel/, > and this patch series adds one more. Should we consider moving them to > a separate directory, such as /sys/kernel/kexec, to avoid polluting > /sys/kernel/? For backward compatibility, we can create symlinks at > the old locations for sometime and remove them in the future. > > Cc: Andrew Morton > Cc: Baoquan he > Cc: Jiri Bohac > Cc: Shivang Upadhyay > Cc: linuxppc-dev at lists.ozlabs.org > Cc: kexec at lists.infradead.org > > Sourabh Jain (2): > Documentation/ABI: add kexec and kdump sysfs interface > crash: export crashkernel CMA reservation to userspace > > .../ABI/testing/sysfs-kernel-kexec-kdump | 53 +++++++++++++++++++ > kernel/ksysfs.c | 17 ++++++ > 2 files changed, 70 insertions(+) > create mode 100644 Documentation/ABI/testing/sysfs-kernel-kexec-kdump > From sourabhjain at linux.ibm.com Tue Nov 4 01:34:37 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Tue, 4 Nov 2025 15:04:37 +0530 Subject: [PATCH v5] powerpc/kdump: Add support for crashkernel CMA reservation In-Reply-To: <7957bd55-5bda-406f-aab3-64e0620bd452@linux.ibm.com> References: <20251103043747.1298065-1-sourabhjain@linux.ibm.com> <87y0on4ebh.ritesh.list@gmail.com> <7957bd55-5bda-406f-aab3-64e0620bd452@linux.ibm.com> Message-ID: On 04/11/25 10:48, Sourabh Jain wrote: > > > On 03/11/25 15:40, Ritesh Harjani (IBM) wrote: >> Sourabh Jain writes: >> >>> Commit 35c18f2933c5 ("Add a new optional ",cma" suffix to the >>> crashkernel= command line option") and commit ab475510e042 ("kdump: >>> implement reserve_crashkernel_cma") added CMA support for kdump >>> crashkernel reservation. >>> >>> Extend crashkernel CMA reservation support to powerpc. >>> >>> The following changes are made to enable CMA reservation on powerpc: >>> >>> - Parse and obtain the CMA reservation size along with other >>> crashkernel >>> ?? parameters >>> - Call reserve_crashkernel_cma() to allocate the CMA region for kdump >>> - Include the CMA-reserved ranges in the usable memory ranges for the >>> ?? kdump kernel to use. >>> - Exclude the CMA-reserved ranges from the crash kernel memory to >>> ?? prevent them from being exported through /proc/vmcore. >>> >>> With the introduction of the CMA crashkernel regions, >>> crash_exclude_mem_range() needs to be called multiple times to exclude >>> both crashk_res and crashk_cma_ranges from the crash memory ranges. To >>> avoid repetitive logic for validating mem_ranges size and handling >>> reallocation when required, this functionality is moved to a new >>> wrapper >>> function crash_exclude_mem_range_guarded(). >>> >>> To ensure proper CMA reservation, reserve_crashkernel_cma() is called >>> after pageblock_order is initialized. >>> >>> Update kernel-parameters.txt to document CMA support for crashkernel on >>> powerpc architecture. >>> >>> Cc: Baoquan he >>> Cc: Jiri Bohac >>> Cc: Hari Bathini >>> Cc: Madhavan Srinivasan >>> Cc: Mahesh Salgaonkar >>> Cc: Michael Ellerman >>> Cc: Ritesh Harjani (IBM) >>> Cc: Shivang Upadhyay >>> Cc: kexec at lists.infradead.org >>> Signed-off-by: Sourabh Jain >>> --- >>> Changlog: >>> >>> v3 -> v4 >>> ? - Removed repeated initialization to tmem in >>> ??? crash_exclude_mem_range_guarded() >>> ? - Call crash_exclude_mem_range() with right crashk ranges >>> >>> v4 -> v5: >>> ? - Document CMA-based crashkernel support for ppc64 in >>> kernel-parameters.txt >>> --- >>> ? .../admin-guide/kernel-parameters.txt???????? |? 2 +- >>> ? arch/powerpc/include/asm/kexec.h????????????? |? 2 + >>> ? arch/powerpc/kernel/setup-common.c??????????? |? 4 +- >>> ? arch/powerpc/kexec/core.c???????????????????? | 10 ++++- >>> ? arch/powerpc/kexec/ranges.c?????????????????? | 43 >>> ++++++++++++++----- >>> ? 5 files changed, 47 insertions(+), 14 deletions(-) >>> >>> diff --git a/Documentation/admin-guide/kernel-parameters.txt >>> b/Documentation/admin-guide/kernel-parameters.txt >>> index 6c42061ca20e..0f386b546cec 100644 >>> --- a/Documentation/admin-guide/kernel-parameters.txt >>> +++ b/Documentation/admin-guide/kernel-parameters.txt >>> @@ -1013,7 +1013,7 @@ >>> ????????????? It will be ignored when crashkernel=X,high is not used >>> ????????????? or memory reserved is below 4G. >>> ????? crashkernel=size[KMG],cma >>> -??????????? [KNL, X86] Reserve additional crash kernel memory from >>> +??????????? [KNL, X86, ppc64] Reserve additional crash kernel >>> memory from >> Shouldn't this be PPC and not ppc64? >> >> If I see the crash_dump support... >> >> config ARCH_SUPPORTS_CRASH_DUMP >> ????def_bool PPC64 || PPC_BOOK3S_32 || PPC_85xx || (44x && !SMP) >> >> The changes below aren't specific to ppc64 correct? > > The thing is this feature is only supported with KEXEC_FILE and which > only supported on PPC64: > > config ARCH_SUPPORTS_KEXEC_FILE > ??? def_bool PPC64 > > Hence I kept it as ppc64. > > I think I should update that in the commit message. > > Also do you think is it good to restrict this feature to KEXEC_FILE? Putting this under KEXEC_FILE may not help much because KEXEC_FILE is enabled by default in most configurations. Once it is enabled, the CMA reservation will happen regardless of which system call is used to load the kdump kernel (kexec_load or kexec_file_load). However, not restricting this feature to KEXEC_FILE will allow the kexec tool to independently add support for this feature in the future for the kexec_load system call. With that logic, I think if we do not restrict this feature to KEXEC_FILE, the support will be available for ppc and not limited to ppc64. > >> >>> ????????????? CMA. This reservation is usable by the first system's >>> ????????????? userspace memory and kernel movable allocations (memory >>> ????????????? balloon, zswap). Pages allocated from this memory range >>> diff --git a/arch/powerpc/include/asm/kexec.h >>> b/arch/powerpc/include/asm/kexec.h >>> index 4bbf9f699aaa..bd4a6c42a5f3 100644 >>> --- a/arch/powerpc/include/asm/kexec.h >>> +++ b/arch/powerpc/include/asm/kexec.h >>> @@ -115,9 +115,11 @@ int setup_new_fdt_ppc64(const struct kimage >>> *image, void *fdt, struct crash_mem >>> ? #ifdef CONFIG_CRASH_RESERVE >>> ? int __init overlaps_crashkernel(unsigned long start, unsigned long >>> size); >>> ? extern void arch_reserve_crashkernel(void); >>> +extern void kdump_cma_reserve(void); >>> ? #else >>> ? static inline void arch_reserve_crashkernel(void) {} >>> ? static inline int overlaps_crashkernel(unsigned long start, >>> unsigned long size) { return 0; } >>> +static inline void kdump_cma_reserve(void) { } >>> ? #endif >>> ? ? #if defined(CONFIG_CRASH_DUMP) >>> diff --git a/arch/powerpc/kernel/setup-common.c >>> b/arch/powerpc/kernel/setup-common.c >>> index 68d47c53876c..c8c42b419742 100644 >>> --- a/arch/powerpc/kernel/setup-common.c >>> +++ b/arch/powerpc/kernel/setup-common.c >>> @@ -35,6 +35,7 @@ >>> ? #include >>> ? #include >>> ? #include >>> +#include >>> ? #include >>> ? #include >>> ? #include >>> @@ -995,11 +996,12 @@ void __init setup_arch(char **cmdline_p) >>> ????? initmem_init(); >>> ? ????? /* >>> -???? * Reserve large chunks of memory for use by CMA for fadump, >>> KVM and >>> +???? * Reserve large chunks of memory for use by CMA for kdump, >>> fadump, KVM and >>> ?????? * hugetlb. These must be called after initmem_init(), so that >>> ?????? * pageblock_order is initialised. >>> ?????? */ >>> ????? fadump_cma_init(); >>> +??? kdump_cma_reserve(); >>> ????? kvm_cma_reserve(); >>> ????? gigantic_hugetlb_cma_reserve(); >>> ? diff --git a/arch/powerpc/kexec/core.c b/arch/powerpc/kexec/core.c >>> index d1a2d755381c..25744737eff5 100644 >>> --- a/arch/powerpc/kexec/core.c >>> +++ b/arch/powerpc/kexec/core.c >>> @@ -33,6 +33,8 @@ void machine_kexec_cleanup(struct kimage *image) >>> ? { >>> ? } >>> ? +unsigned long long cma_size; >>> + >> nit: >> Since this is a gloabal powerpc variable you are defining, then can we >> keep it's name to crashk_cma_size? > > Yeah make sense. I will update the variable name. > > >> >>> ? /* >>> ?? * Do not allocate memory (or fail in any way) in machine_kexec(). >>> ?? * We are past the point of no return, committed to rebooting now. >>> @@ -110,7 +112,7 @@ void __init arch_reserve_crashkernel(void) >>> ? ????? /* use common parsing */ >>> ????? ret = parse_crashkernel(boot_command_line, total_mem_sz, >>> &crash_size, >>> -??????????????? &crash_base, NULL, NULL, NULL); >>> +??????????????? &crash_base, NULL, &cma_size, NULL); >>> ? ????? if (ret) >>> ????????? return; >>> @@ -130,6 +132,12 @@ void __init arch_reserve_crashkernel(void) >>> ????? reserve_crashkernel_generic(crash_size, crash_base, 0, false); >>> ? } >>> ? +void __init kdump_cma_reserve(void) >>> +{ >>> +??? if (cma_size) >>> +??????? reserve_crashkernel_cma(cma_size); >>> +} >>> + >> nit: >> cma_size is already checked for null within reserve_crashkernel_cma(), >> so we don't really need kdump_cma_reserve() function call as such. >> >> Also kdump_cma_reserve() only make sense with #ifdef CRASHKERNEL_CMA.. >> so instead do you think we can directly call >> reserve_crashkernel_cma(cma_size)? > > I think the above kdump_cma_reserve() definition should come under > CONFIG_CRASH_RESERVE > because the way it is declared in arch/powerpc/include/asm/kexec.h. > > I would like to keep kdump_cma_reserve() as is it because of two reasons: > > - It keeps setup_arch() free from kdump #ifdefs > - In case if we want to add some condition on this reservation it > would straight forward. > > So lets keep kdump_cma_reserve as is, unless you have strong opinion > on not to. > >>> ? int __init overlaps_crashkernel(unsigned long start, unsigned long >>> size) >>> ? { >>> ????? return (start + size) > crashk_res.start && start <= >>> crashk_res.end; >>> diff --git a/arch/powerpc/kexec/ranges.c b/arch/powerpc/kexec/ranges.c >>> index 3702b0bdab14..3bd27c38726b 100644 >>> --- a/arch/powerpc/kexec/ranges.c >>> +++ b/arch/powerpc/kexec/ranges.c >>> @@ -515,7 +515,7 @@ int get_exclude_memory_ranges(struct crash_mem >>> **mem_ranges) >>> ?? */ >>> ? int get_usable_memory_ranges(struct crash_mem **mem_ranges) >>> ? { >>> -??? int ret; >>> +??? int ret, i; >>> ? ????? /* >>> ?????? * Early boot failure observed on guests when low memory >>> (first memory >>> @@ -528,6 +528,13 @@ int get_usable_memory_ranges(struct crash_mem >>> **mem_ranges) >>> ????? if (ret) >>> ????????? goto out; >>> ? +??? for (i = 0; i < crashk_cma_cnt; i++) { >>> +??????? ret = add_mem_range(mem_ranges, crashk_cma_ranges[i].start, >>> +??????????????????? crashk_cma_ranges[i].end - >>> crashk_cma_ranges[i].start + 1); >>> +??????? if (ret) >>> +??????????? goto out; >>> +??? } >>> + >>> ????? ret = add_rtas_mem_range(mem_ranges); >>> ????? if (ret) >>> ????????? goto out; >>> @@ -546,6 +553,22 @@ int get_usable_memory_ranges(struct crash_mem >>> **mem_ranges) >>> ? #endif /* CONFIG_KEXEC_FILE */ >>> ? ? #ifdef CONFIG_CRASH_DUMP >>> +static int crash_exclude_mem_range_guarded(struct crash_mem >>> **mem_ranges, >>> +?????????????????????? unsigned long long mstart, >>> +?????????????????????? unsigned long long mend) >>> +{ >>> +??? struct crash_mem *tmem = *mem_ranges; >>> + >>> +??? /* Reallocate memory ranges if there is no space to split >>> ranges */ >>> +??? if (tmem && (tmem->nr_ranges == tmem->max_nr_ranges)) { >>> +??????? tmem = realloc_mem_ranges(mem_ranges); >>> +??????? if (!tmem) >>> +??????????? return -ENOMEM; >>> +??? } >>> + >>> +??? return crash_exclude_mem_range(tmem, mstart, mend); >>> +} >>> + >>> ? /** >>> ?? * get_crash_memory_ranges - Get crash memory ranges. This list >>> includes >>> ?? *?????????????????????????? first/crashing kernel's memory >>> regions that >>> @@ -557,7 +580,6 @@ int get_usable_memory_ranges(struct crash_mem >>> **mem_ranges) >>> ? int get_crash_memory_ranges(struct crash_mem **mem_ranges) >>> ? { >>> ????? phys_addr_t base, end; >>> -??? struct crash_mem *tmem; >>> ????? u64 i; >>> ????? int ret; >>> ? @@ -582,19 +604,18 @@ int get_crash_memory_ranges(struct crash_mem >>> **mem_ranges) >>> ????????????? sort_memory_ranges(*mem_ranges, true); >>> ????? } >>> ? -??? /* Reallocate memory ranges if there is no space to split >>> ranges */ >>> -??? tmem = *mem_ranges; >>> -??? if (tmem && (tmem->nr_ranges == tmem->max_nr_ranges)) { >>> -??????? tmem = realloc_mem_ranges(mem_ranges); >>> -??????? if (!tmem) >>> -??????????? goto out; >>> -??? } >>> - >>> ????? /* Exclude crashkernel region */ >>> -??? ret = crash_exclude_mem_range(tmem, crashk_res.start, >>> crashk_res.end); >>> +??? ret = crash_exclude_mem_range_guarded(mem_ranges, >>> crashk_res.start, crashk_res.end); >>> ????? if (ret) >>> ????????? goto out; >>> ? +??? for (i = 0; i < crashk_cma_cnt; ++i) { >>> +??????? ret = crash_exclude_mem_range_guarded(mem_ranges, >>> crashk_cma_ranges[i].start, >>> +????????????????????????? crashk_cma_ranges[i].end); >>> +??????? if (ret) >>> +??????????? goto out; >>> +??? } >>> + >>> ????? /* >>> ?????? * FIXME: For now, stay in parity with kexec-tools but if >>> RTAS/OPAL >>> ?????? *??????? regions are exported to save their context at the >>> time of >>> -- >>> 2.51.0 > From ritesh.list at gmail.com Tue Nov 4 02:18:48 2025 From: ritesh.list at gmail.com (Ritesh Harjani (IBM)) Date: Tue, 04 Nov 2025 15:48:48 +0530 Subject: [PATCH v5] powerpc/kdump: Add support for crashkernel CMA reservation In-Reply-To: <7957bd55-5bda-406f-aab3-64e0620bd452@linux.ibm.com> References: <20251103043747.1298065-1-sourabhjain@linux.ibm.com> <87y0on4ebh.ritesh.list@gmail.com> <7957bd55-5bda-406f-aab3-64e0620bd452@linux.ibm.com> Message-ID: <87wm463xtj.ritesh.list@gmail.com> Sourabh Jain writes: > I would like to keep kdump_cma_reserve() as is it because of two reasons: > > - It keeps setup_arch() free from kdump #ifdefs Not really. Instead of kdump_cma_reserve(crashk_cma_size), one could call reserve_crashkernel_cma(crashk_cma_size) directly in setup_arch(). > - In case if we want to add some condition on this reservation it would > straight forward. > Make sense. > So lets keep kdump_cma_reserve as is, unless you have strong opinion on > not to. > No strong opinion, as I said it was a minor nit. Feel free to keep the function kdump_cma_reserve() as is then. -ritesh From sourabhjain at linux.ibm.com Tue Nov 4 02:35:42 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Tue, 4 Nov 2025 16:05:42 +0530 Subject: [PATCH v5] powerpc/kdump: Add support for crashkernel CMA reservation In-Reply-To: <87wm463xtj.ritesh.list@gmail.com> References: <20251103043747.1298065-1-sourabhjain@linux.ibm.com> <87y0on4ebh.ritesh.list@gmail.com> <7957bd55-5bda-406f-aab3-64e0620bd452@linux.ibm.com> <87wm463xtj.ritesh.list@gmail.com> Message-ID: <722d72b5-cebf-48f2-8ad5-558ccd3c30f4@linux.ibm.com> On 04/11/25 15:48, Ritesh Harjani (IBM) wrote: > Sourabh Jain writes: > > >> I would like to keep kdump_cma_reserve() as is it because of two reasons: >> >> - It keeps setup_arch() free from kdump #ifdefs > Not really. > > Instead of kdump_cma_reserve(crashk_cma_size), one could call > > reserve_crashkernel_cma(crashk_cma_size) directly in setup_arch(). reserve_crashkernel_cma() is not available unless the kernel is built with CONFIG_CRASH_RESERVE. So, wouldn?t calling reserve_crashkernel_cma() directly from setup_arch() lead to a build failure? Or am I missing something? > >> - In case if we want to add some condition on this reservation it would >> straight forward. >> > Make sense. > >> So lets keep kdump_cma_reserve as is, unless you have strong opinion on >> not to. >> > No strong opinion, as I said it was a minor nit. Feel free to keep the > function kdump_cma_reserve() as is then. > > -ritesh > From ritesh.list at gmail.com Tue Nov 4 02:24:44 2025 From: ritesh.list at gmail.com (Ritesh Harjani (IBM)) Date: Tue, 04 Nov 2025 15:54:44 +0530 Subject: [PATCH v5] powerpc/kdump: Add support for crashkernel CMA reservation In-Reply-To: References: <20251103043747.1298065-1-sourabhjain@linux.ibm.com> <87y0on4ebh.ritesh.list@gmail.com> <7957bd55-5bda-406f-aab3-64e0620bd452@linux.ibm.com> Message-ID: <87v7jq3xjn.ritesh.list@gmail.com> Sourabh Jain writes: > On 04/11/25 10:48, Sourabh Jain wrote: >> >> >> On 03/11/25 15:40, Ritesh Harjani (IBM) wrote: >>> Sourabh Jain writes: >>> >>>> Commit 35c18f2933c5 ("Add a new optional ",cma" suffix to the >>>> crashkernel= command line option") and commit ab475510e042 ("kdump: >>>> implement reserve_crashkernel_cma") added CMA support for kdump >>>> crashkernel reservation. >>>> >>>> Extend crashkernel CMA reservation support to powerpc. >>>> >>>> The following changes are made to enable CMA reservation on powerpc: >>>> >>>> - Parse and obtain the CMA reservation size along with other >>>> crashkernel >>>> ?? parameters >>>> - Call reserve_crashkernel_cma() to allocate the CMA region for kdump >>>> - Include the CMA-reserved ranges in the usable memory ranges for the >>>> ?? kdump kernel to use. >>>> - Exclude the CMA-reserved ranges from the crash kernel memory to >>>> ?? prevent them from being exported through /proc/vmcore. >>>> >>>> With the introduction of the CMA crashkernel regions, >>>> crash_exclude_mem_range() needs to be called multiple times to exclude >>>> both crashk_res and crashk_cma_ranges from the crash memory ranges. To >>>> avoid repetitive logic for validating mem_ranges size and handling >>>> reallocation when required, this functionality is moved to a new >>>> wrapper >>>> function crash_exclude_mem_range_guarded(). >>>> >>>> To ensure proper CMA reservation, reserve_crashkernel_cma() is called >>>> after pageblock_order is initialized. >>>> >>>> Update kernel-parameters.txt to document CMA support for crashkernel on >>>> powerpc architecture. >>>> >>>> Cc: Baoquan he >>>> Cc: Jiri Bohac >>>> Cc: Hari Bathini >>>> Cc: Madhavan Srinivasan >>>> Cc: Mahesh Salgaonkar >>>> Cc: Michael Ellerman >>>> Cc: Ritesh Harjani (IBM) >>>> Cc: Shivang Upadhyay >>>> Cc: kexec at lists.infradead.org >>>> Signed-off-by: Sourabh Jain >>>> --- >>>> Changlog: >>>> >>>> v3 -> v4 >>>> ? - Removed repeated initialization to tmem in >>>> ??? crash_exclude_mem_range_guarded() >>>> ? - Call crash_exclude_mem_range() with right crashk ranges >>>> >>>> v4 -> v5: >>>> ? - Document CMA-based crashkernel support for ppc64 in >>>> kernel-parameters.txt >>>> --- >>>> ? .../admin-guide/kernel-parameters.txt???????? |? 2 +- >>>> ? arch/powerpc/include/asm/kexec.h????????????? |? 2 + >>>> ? arch/powerpc/kernel/setup-common.c??????????? |? 4 +- >>>> ? arch/powerpc/kexec/core.c???????????????????? | 10 ++++- >>>> ? arch/powerpc/kexec/ranges.c?????????????????? | 43 >>>> ++++++++++++++----- >>>> ? 5 files changed, 47 insertions(+), 14 deletions(-) >>>> >>>> diff --git a/Documentation/admin-guide/kernel-parameters.txt >>>> b/Documentation/admin-guide/kernel-parameters.txt >>>> index 6c42061ca20e..0f386b546cec 100644 >>>> --- a/Documentation/admin-guide/kernel-parameters.txt >>>> +++ b/Documentation/admin-guide/kernel-parameters.txt >>>> @@ -1013,7 +1013,7 @@ >>>> ????????????? It will be ignored when crashkernel=X,high is not used >>>> ????????????? or memory reserved is below 4G. >>>> ????? crashkernel=size[KMG],cma >>>> -??????????? [KNL, X86] Reserve additional crash kernel memory from >>>> +??????????? [KNL, X86, ppc64] Reserve additional crash kernel >>>> memory from >>> Shouldn't this be PPC and not ppc64? >>> >>> If I see the crash_dump support... >>> >>> config ARCH_SUPPORTS_CRASH_DUMP >>> ????def_bool PPC64 || PPC_BOOK3S_32 || PPC_85xx || (44x && !SMP) >>> >>> The changes below aren't specific to ppc64 correct? >> >> The thing is this feature is only supported with KEXEC_FILE and which >> only supported on PPC64: >> >> config ARCH_SUPPORTS_KEXEC_FILE >> ??? def_bool PPC64 >> >> Hence I kept it as ppc64. >> I am not much familiar with the kexec_load v/s kexec_file_load internals. Maybe because of that I am unable to clearly understand your above points. But let me try and explain what I think you meant :) We first call "get_usable_memory_ranges(&umem)" which updates the usable memory ranges in "umem". We then call "update_usable_mem_fdt(fdt, umem)" which updates the FDT for the kdump kernel's fdt to inform about these usable memory ranges to the kdump kernel. Now since your patch only does that in get_usable_memory_range(), this extra CMA reservation is mainly only useful when the kdump load happens via kexec_file_load(), (because get_usable_memory_range() only gets called from kexec_file_load() path) Is this what you meant here? >> I think I should update that in the commit message. >> >> Also do you think is it good to restrict this feature to KEXEC_FILE? > > Putting this under KEXEC_FILE may not help much because KEXEC_FILE is > enabled > by default in most configurations. Once it is enabled, the CMA > reservation will > happen regardless of which system call is used to load the kdump kernel > (kexec_load or kexec_file_load). > What I understood from the feature was that, on the normal production kernel this feature crashkernel=xM,cma allows to reserve an extra xMB of memory as a CMA region for kdump kernel's memory allocations. But this CMA reservation would happen in the normal kernel itself during setup_arch() -> kdump_cma_reserve().. And this CMA reservation happens irrespective of whether the kdump kernel will get loaded via whichever system call. > However, not restricting this feature to KEXEC_FILE will allow the kexec > tool to > independently add support for this feature in the future for the kexec_load > system call. Sure. > > With that logic, I think if we do not restrict this feature to > KEXEC_FILE, the support > will be available for ppc and not limited to ppc64. > Yes, that make sense. If one doesn't want to make the CMA reservation, then we need not pass the extra cmdline argument and no reservation will be made. So, no need to restrict this to PPC64 by making it available only for KEXEC_FILE. -ritesh From ritesh.list at gmail.com Tue Nov 4 02:51:41 2025 From: ritesh.list at gmail.com (Ritesh Harjani (IBM)) Date: Tue, 04 Nov 2025 16:21:41 +0530 Subject: [PATCH v5] powerpc/kdump: Add support for crashkernel CMA reservation In-Reply-To: <722d72b5-cebf-48f2-8ad5-558ccd3c30f4@linux.ibm.com> References: <20251103043747.1298065-1-sourabhjain@linux.ibm.com> <87y0on4ebh.ritesh.list@gmail.com> <7957bd55-5bda-406f-aab3-64e0620bd452@linux.ibm.com> <87wm463xtj.ritesh.list@gmail.com> <722d72b5-cebf-48f2-8ad5-558ccd3c30f4@linux.ibm.com> Message-ID: <87tsza3waq.ritesh.list@gmail.com> Sourabh Jain writes: > On 04/11/25 15:48, Ritesh Harjani (IBM) wrote: >> Sourabh Jain writes: >> >> >>> I would like to keep kdump_cma_reserve() as is it because of two reasons: >>> >>> - It keeps setup_arch() free from kdump #ifdefs >> Not really. >> >> Instead of kdump_cma_reserve(crashk_cma_size), one could call >> >> reserve_crashkernel_cma(crashk_cma_size) directly in setup_arch(). > > > reserve_crashkernel_cma() is not available unless the kernel is built > with CONFIG_CRASH_RESERVE. > So, wouldn?t calling reserve_crashkernel_cma() directly from > setup_arch() lead to a build failure? Or > am I missing something? > OOps.. I was assuming the #else CRASHKERNEL_CMA definition should get called, but all of that logic itself is protected in CONFIG_CRASH_RESERVE :( Right to avoid #ifdef or IS_ENABLED in setup_arch.. it's better to have kdump_cma_reserve() Thanks for pointing that out. obj-$(CONFIG_CRASH_RESERVE) += crash_reserve.o kernel/crash_reserve.c #ifdef CRASHKERNEL_CMA int crashk_cma_cnt; void __init reserve_crashkernel_cma(unsigned long long cma_size) { ... } #else /* CRASHKERNEL_CMA */ void __init reserve_crashkernel_cma(unsigned long long cma_size) { if (cma_size) pr_warn("crashkernel CMA reservation not supported\n"); } #endif -ritesh >> >>> - In case if we want to add some condition on this reservation it would >>> straight forward. >>> >> Make sense. >> >>> So lets keep kdump_cma_reserve as is, unless you have strong opinion on >>> not to. >>> >> No strong opinion, as I said it was a minor nit. Feel free to keep the >> function kdump_cma_reserve() as is then. >> >> -ritesh >> From sourabhjain at linux.ibm.com Tue Nov 4 04:38:19 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Tue, 4 Nov 2025 18:08:19 +0530 Subject: [PATCH v5] powerpc/kdump: Add support for crashkernel CMA reservation In-Reply-To: <87v7jq3xjn.ritesh.list@gmail.com> References: <20251103043747.1298065-1-sourabhjain@linux.ibm.com> <87y0on4ebh.ritesh.list@gmail.com> <7957bd55-5bda-406f-aab3-64e0620bd452@linux.ibm.com> <87v7jq3xjn.ritesh.list@gmail.com> Message-ID: On 04/11/25 15:54, Ritesh Harjani (IBM) wrote: > Sourabh Jain writes: > >> On 04/11/25 10:48, Sourabh Jain wrote: >>> >>> On 03/11/25 15:40, Ritesh Harjani (IBM) wrote: >>>> Sourabh Jain writes: >>>> >>>>> Commit 35c18f2933c5 ("Add a new optional ",cma" suffix to the >>>>> crashkernel= command line option") and commit ab475510e042 ("kdump: >>>>> implement reserve_crashkernel_cma") added CMA support for kdump >>>>> crashkernel reservation. >>>>> >>>>> Extend crashkernel CMA reservation support to powerpc. >>>>> >>>>> The following changes are made to enable CMA reservation on powerpc: >>>>> >>>>> - Parse and obtain the CMA reservation size along with other >>>>> crashkernel >>>>> ?? parameters >>>>> - Call reserve_crashkernel_cma() to allocate the CMA region for kdump >>>>> - Include the CMA-reserved ranges in the usable memory ranges for the >>>>> ?? kdump kernel to use. >>>>> - Exclude the CMA-reserved ranges from the crash kernel memory to >>>>> ?? prevent them from being exported through /proc/vmcore. >>>>> >>>>> With the introduction of the CMA crashkernel regions, >>>>> crash_exclude_mem_range() needs to be called multiple times to exclude >>>>> both crashk_res and crashk_cma_ranges from the crash memory ranges. To >>>>> avoid repetitive logic for validating mem_ranges size and handling >>>>> reallocation when required, this functionality is moved to a new >>>>> wrapper >>>>> function crash_exclude_mem_range_guarded(). >>>>> >>>>> To ensure proper CMA reservation, reserve_crashkernel_cma() is called >>>>> after pageblock_order is initialized. >>>>> >>>>> Update kernel-parameters.txt to document CMA support for crashkernel on >>>>> powerpc architecture. >>>>> >>>>> Cc: Baoquan he >>>>> Cc: Jiri Bohac >>>>> Cc: Hari Bathini >>>>> Cc: Madhavan Srinivasan >>>>> Cc: Mahesh Salgaonkar >>>>> Cc: Michael Ellerman >>>>> Cc: Ritesh Harjani (IBM) >>>>> Cc: Shivang Upadhyay >>>>> Cc: kexec at lists.infradead.org >>>>> Signed-off-by: Sourabh Jain >>>>> --- >>>>> Changlog: >>>>> >>>>> v3 -> v4 >>>>> ? - Removed repeated initialization to tmem in >>>>> ??? crash_exclude_mem_range_guarded() >>>>> ? - Call crash_exclude_mem_range() with right crashk ranges >>>>> >>>>> v4 -> v5: >>>>> ? - Document CMA-based crashkernel support for ppc64 in >>>>> kernel-parameters.txt >>>>> --- >>>>> ? .../admin-guide/kernel-parameters.txt???????? |? 2 +- >>>>> ? arch/powerpc/include/asm/kexec.h????????????? |? 2 + >>>>> ? arch/powerpc/kernel/setup-common.c??????????? |? 4 +- >>>>> ? arch/powerpc/kexec/core.c???????????????????? | 10 ++++- >>>>> ? arch/powerpc/kexec/ranges.c?????????????????? | 43 >>>>> ++++++++++++++----- >>>>> ? 5 files changed, 47 insertions(+), 14 deletions(-) >>>>> >>>>> diff --git a/Documentation/admin-guide/kernel-parameters.txt >>>>> b/Documentation/admin-guide/kernel-parameters.txt >>>>> index 6c42061ca20e..0f386b546cec 100644 >>>>> --- a/Documentation/admin-guide/kernel-parameters.txt >>>>> +++ b/Documentation/admin-guide/kernel-parameters.txt >>>>> @@ -1013,7 +1013,7 @@ >>>>> ????????????? It will be ignored when crashkernel=X,high is not used >>>>> ????????????? or memory reserved is below 4G. >>>>> ????? crashkernel=size[KMG],cma >>>>> -??????????? [KNL, X86] Reserve additional crash kernel memory from >>>>> +??????????? [KNL, X86, ppc64] Reserve additional crash kernel >>>>> memory from >>>> Shouldn't this be PPC and not ppc64? >>>> >>>> If I see the crash_dump support... >>>> >>>> config ARCH_SUPPORTS_CRASH_DUMP >>>> ????def_bool PPC64 || PPC_BOOK3S_32 || PPC_85xx || (44x && !SMP) >>>> >>>> The changes below aren't specific to ppc64 correct? >>> The thing is this feature is only supported with KEXEC_FILE and which >>> only supported on PPC64: >>> >>> config ARCH_SUPPORTS_KEXEC_FILE >>> ??? def_bool PPC64 >>> >>> Hence I kept it as ppc64. >>> > I am not much familiar with the kexec_load v/s kexec_file_load > internals. Maybe because of that I am unable to clearly understand your > above points. > > But let me try and explain what I think you meant :) > > We first call "get_usable_memory_ranges(&umem)" which updates the usable > memory ranges in "umem". We then call "update_usable_mem_fdt(fdt, umem)" > which updates the FDT for the kdump kernel's fdt to inform about these > usable memory ranges to the kdump kernel. > > Now since your patch only does that in get_usable_memory_range(), this > extra CMA reservation is mainly only useful when the kdump load happens > via kexec_file_load(), (because get_usable_memory_range() only gets > called from kexec_file_load() path) > > Is this what you meant here? Yeah, for kexec_file_load, the FDT for the kdump kernel is prepared in the Linux kernel (using the functions you mentioned), whereas for kexec_load, it is prepared in the kexec tool (userspace). Hence, these changes are not sufficient to support this feature with the kexec_load syscall. The kexec tool must be updated to ensure that the FDT is prepared in a way that marks the crashkernel CMA reservation as usable in the kdump FDT for the kexec_load system call. Anyway, it makes more sense to say that crashkernel=xM,cma support is available on ppc rather than ppc64, since restricting crashkernel CMA reservation to KEXEC_FILE does not help. The details are explained below. > > >>> I think I should update that in the commit message. >>> >>> Also do you think is it good to restrict this feature to KEXEC_FILE? >> Putting this under KEXEC_FILE may not help much because KEXEC_FILE is >> enabled >> by default in most configurations. Once it is enabled, the CMA >> reservation will >> happen regardless of which system call is used to load the kdump kernel >> (kexec_load or kexec_file_load). >> > What I understood from the feature was that, on the normal production > kernel this feature crashkernel=xM,cma allows to reserve an extra xMB of > memory as a CMA region for kdump kernel's memory allocations. But this > CMA reservation would happen in the normal kernel itself during > setup_arch() -> kdump_cma_reserve().. > > And this CMA reservation happens irrespective of whether the kdump > kernel will get loaded via whichever system call. Yeah that's right. > >> However, not restricting this feature to KEXEC_FILE will allow the kexec >> tool to >> independently add support for this feature in the future for the kexec_load >> system call. > Sure. > >> With that logic, I think if we do not restrict this feature to >> KEXEC_FILE, the support >> will be available for ppc and not limited to ppc64. >> > Yes, that make sense. > > If one doesn't want to make the CMA reservation, then we need not pass > the extra cmdline argument and no reservation will be made. So, no need > to restrict this to PPC64 by making it available only for KEXEC_FILE. Agree. From sourabhjain at linux.ibm.com Tue Nov 4 05:28:18 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Tue, 4 Nov 2025 18:58:18 +0530 Subject: [PATCH v6] powerpc/kdump: Add support for crashkernel CMA reservation Message-ID: <20251104132818.1724562-1-sourabhjain@linux.ibm.com> Commit 35c18f2933c5 ("Add a new optional ",cma" suffix to the crashkernel= command line option") and commit ab475510e042 ("kdump: implement reserve_crashkernel_cma") added CMA support for kdump crashkernel reservation. Extend crashkernel CMA reservation support to powerpc. The following changes are made to enable CMA reservation on powerpc: - Parse and obtain the CMA reservation size along with other crashkernel parameters - Call reserve_crashkernel_cma() to allocate the CMA region for kdump - Include the CMA-reserved ranges in the usable memory ranges for the kdump kernel to use. - Exclude the CMA-reserved ranges from the crash kernel memory to prevent them from being exported through /proc/vmcore. With the introduction of the CMA crashkernel regions, crash_exclude_mem_range() needs to be called multiple times to exclude both crashk_res and crashk_cma_ranges from the crash memory ranges. To avoid repetitive logic for validating mem_ranges size and handling reallocation when required, this functionality is moved to a new wrapper function crash_exclude_mem_range_guarded(). To ensure proper CMA reservation, reserve_crashkernel_cma() is called after pageblock_order is initialized. Update kernel-parameters.txt to document CMA support for crashkernel on powerpc architecture. Cc: Baoquan he Cc: Jiri Bohac Cc: Hari Bathini Cc: Madhavan Srinivasan Cc: Mahesh Salgaonkar Cc: Michael Ellerman Cc: Ritesh Harjani (IBM) Cc: Shivang Upadhyay Cc: kexec at lists.infradead.org Signed-off-by: Sourabh Jain --- v3 -> v4 - Removed repeated initialization to tmem in crash_exclude_mem_range_guarded() - Call crash_exclude_mem_range() with right crashk ranges v4 -> v5: - Document CMA-based crashkernel support for ppc64 in kernel-parameters.txt v5 -> v6 - Change variable name, cma_size -> crashk_cma_size - Update support for this feature to ppc instead of ppc64 --- .../admin-guide/kernel-parameters.txt | 2 +- arch/powerpc/include/asm/kexec.h | 2 + arch/powerpc/kernel/setup-common.c | 4 +- arch/powerpc/kexec/core.c | 10 ++++- arch/powerpc/kexec/ranges.c | 43 ++++++++++++++----- 5 files changed, 47 insertions(+), 14 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 6c42061ca20e..1c10190d583d 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -1013,7 +1013,7 @@ It will be ignored when crashkernel=X,high is not used or memory reserved is below 4G. crashkernel=size[KMG],cma - [KNL, X86] Reserve additional crash kernel memory from + [KNL, X86, ppc] Reserve additional crash kernel memory from CMA. This reservation is usable by the first system's userspace memory and kernel movable allocations (memory balloon, zswap). Pages allocated from this memory range diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h index 4bbf9f699aaa..bd4a6c42a5f3 100644 --- a/arch/powerpc/include/asm/kexec.h +++ b/arch/powerpc/include/asm/kexec.h @@ -115,9 +115,11 @@ int setup_new_fdt_ppc64(const struct kimage *image, void *fdt, struct crash_mem #ifdef CONFIG_CRASH_RESERVE int __init overlaps_crashkernel(unsigned long start, unsigned long size); extern void arch_reserve_crashkernel(void); +extern void kdump_cma_reserve(void); #else static inline void arch_reserve_crashkernel(void) {} static inline int overlaps_crashkernel(unsigned long start, unsigned long size) { return 0; } +static inline void kdump_cma_reserve(void) { } #endif #if defined(CONFIG_CRASH_DUMP) diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c index 68d47c53876c..c8c42b419742 100644 --- a/arch/powerpc/kernel/setup-common.c +++ b/arch/powerpc/kernel/setup-common.c @@ -35,6 +35,7 @@ #include #include #include +#include #include #include #include @@ -995,11 +996,12 @@ void __init setup_arch(char **cmdline_p) initmem_init(); /* - * Reserve large chunks of memory for use by CMA for fadump, KVM and + * Reserve large chunks of memory for use by CMA for kdump, fadump, KVM and * hugetlb. These must be called after initmem_init(), so that * pageblock_order is initialised. */ fadump_cma_init(); + kdump_cma_reserve(); kvm_cma_reserve(); gigantic_hugetlb_cma_reserve(); diff --git a/arch/powerpc/kexec/core.c b/arch/powerpc/kexec/core.c index d1a2d755381c..d0b8d6300f84 100644 --- a/arch/powerpc/kexec/core.c +++ b/arch/powerpc/kexec/core.c @@ -59,6 +59,8 @@ void machine_kexec(struct kimage *image) #ifdef CONFIG_CRASH_RESERVE +unsigned long long crashk_cma_size; + static unsigned long long __init get_crash_base(unsigned long long crash_base) { @@ -110,7 +112,7 @@ void __init arch_reserve_crashkernel(void) /* use common parsing */ ret = parse_crashkernel(boot_command_line, total_mem_sz, &crash_size, - &crash_base, NULL, NULL, NULL); + &crash_base, NULL, &crashk_cma_size, NULL); if (ret) return; @@ -130,6 +132,12 @@ void __init arch_reserve_crashkernel(void) reserve_crashkernel_generic(crash_size, crash_base, 0, false); } +void __init kdump_cma_reserve(void) +{ + if (crashk_cma_size) + reserve_crashkernel_cma(crashk_cma_size); +} + int __init overlaps_crashkernel(unsigned long start, unsigned long size) { return (start + size) > crashk_res.start && start <= crashk_res.end; diff --git a/arch/powerpc/kexec/ranges.c b/arch/powerpc/kexec/ranges.c index 3702b0bdab14..3bd27c38726b 100644 --- a/arch/powerpc/kexec/ranges.c +++ b/arch/powerpc/kexec/ranges.c @@ -515,7 +515,7 @@ int get_exclude_memory_ranges(struct crash_mem **mem_ranges) */ int get_usable_memory_ranges(struct crash_mem **mem_ranges) { - int ret; + int ret, i; /* * Early boot failure observed on guests when low memory (first memory @@ -528,6 +528,13 @@ int get_usable_memory_ranges(struct crash_mem **mem_ranges) if (ret) goto out; + for (i = 0; i < crashk_cma_cnt; i++) { + ret = add_mem_range(mem_ranges, crashk_cma_ranges[i].start, + crashk_cma_ranges[i].end - crashk_cma_ranges[i].start + 1); + if (ret) + goto out; + } + ret = add_rtas_mem_range(mem_ranges); if (ret) goto out; @@ -546,6 +553,22 @@ int get_usable_memory_ranges(struct crash_mem **mem_ranges) #endif /* CONFIG_KEXEC_FILE */ #ifdef CONFIG_CRASH_DUMP +static int crash_exclude_mem_range_guarded(struct crash_mem **mem_ranges, + unsigned long long mstart, + unsigned long long mend) +{ + struct crash_mem *tmem = *mem_ranges; + + /* Reallocate memory ranges if there is no space to split ranges */ + if (tmem && (tmem->nr_ranges == tmem->max_nr_ranges)) { + tmem = realloc_mem_ranges(mem_ranges); + if (!tmem) + return -ENOMEM; + } + + return crash_exclude_mem_range(tmem, mstart, mend); +} + /** * get_crash_memory_ranges - Get crash memory ranges. This list includes * first/crashing kernel's memory regions that @@ -557,7 +580,6 @@ int get_usable_memory_ranges(struct crash_mem **mem_ranges) int get_crash_memory_ranges(struct crash_mem **mem_ranges) { phys_addr_t base, end; - struct crash_mem *tmem; u64 i; int ret; @@ -582,19 +604,18 @@ int get_crash_memory_ranges(struct crash_mem **mem_ranges) sort_memory_ranges(*mem_ranges, true); } - /* Reallocate memory ranges if there is no space to split ranges */ - tmem = *mem_ranges; - if (tmem && (tmem->nr_ranges == tmem->max_nr_ranges)) { - tmem = realloc_mem_ranges(mem_ranges); - if (!tmem) - goto out; - } - /* Exclude crashkernel region */ - ret = crash_exclude_mem_range(tmem, crashk_res.start, crashk_res.end); + ret = crash_exclude_mem_range_guarded(mem_ranges, crashk_res.start, crashk_res.end); if (ret) goto out; + for (i = 0; i < crashk_cma_cnt; ++i) { + ret = crash_exclude_mem_range_guarded(mem_ranges, crashk_cma_ranges[i].start, + crashk_cma_ranges[i].end); + if (ret) + goto out; + } + /* * FIXME: For now, stay in parity with kexec-tools but if RTAS/OPAL * regions are exported to save their context at the time of -- 2.51.0 From rppt at kernel.org Tue Nov 4 06:31:54 2025 From: rppt at kernel.org (Mike Rapoport) Date: Tue, 4 Nov 2025 16:31:54 +0200 Subject: [PATCH 1/2] kho: fix unpreservation of higher-order vmalloc preservations In-Reply-To: <20251103180235.71409-2-pratyush@kernel.org> References: <20251103180235.71409-1-pratyush@kernel.org> <20251103180235.71409-2-pratyush@kernel.org> Message-ID: On Mon, Nov 03, 2025 at 07:02:31PM +0100, Pratyush Yadav wrote: > kho_vmalloc_unpreserve_chunk() calls __kho_unpreserve() with end_pfn as > pfn + 1. This happens to work for 0-order pages, but leaks higher order > pages. > > For example, say order 2 pages back the allocation. During preservation, > they get preserved in the order 2 bitmaps, but > kho_vmalloc_unpreserve_chunk() would try to unpreserve them from the > order 0 bitmaps, which should not have these bits set anyway, leaving > the order 2 bitmaps untouched. This results in the pages being carried > over to the next kernel. Nothing will free those pages in the next boot, > leaking them. > > Fix this by taking the order into account when calculating the end PFN > for __kho_unpreserve(). > > Fixes: a667300bd53f2 ("kho: add support for preserving vmalloc allocations") > Signed-off-by: Pratyush Yadav Reviewed-by: Mike Rapoport (Microsoft) > --- > > Notes: > When Pasha's patch [0] to add kho_unpreserve_pages() is merged, maybe it > would be a better idea to use kho_unpreserve_pages() here? But that is > something for later I suppose. > > [0] https://lore.kernel.org/linux-mm/20251101142325.1326536-4-pasha.tatashin at soleen.com/ > > kernel/kexec_handover.c | 7 ++++--- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c > index cc5aaa738bc50..c2bcbb10918ce 100644 > --- a/kernel/kexec_handover.c > +++ b/kernel/kexec_handover.c > @@ -862,7 +862,8 @@ static struct kho_vmalloc_chunk *new_vmalloc_chunk(struct kho_vmalloc_chunk *cur > return NULL; > } > > -static void kho_vmalloc_unpreserve_chunk(struct kho_vmalloc_chunk *chunk) > +static void kho_vmalloc_unpreserve_chunk(struct kho_vmalloc_chunk *chunk, > + unsigned short order) > { > struct kho_mem_track *track = &kho_out.ser.track; > unsigned long pfn = PHYS_PFN(virt_to_phys(chunk)); > @@ -871,7 +872,7 @@ static void kho_vmalloc_unpreserve_chunk(struct kho_vmalloc_chunk *chunk) > > for (int i = 0; i < ARRAY_SIZE(chunk->phys) && chunk->phys[i]; i++) { > pfn = PHYS_PFN(chunk->phys[i]); > - __kho_unpreserve(track, pfn, pfn + 1); > + __kho_unpreserve(track, pfn, pfn + (1 << order)); > } > } > > @@ -882,7 +883,7 @@ static void kho_vmalloc_free_chunks(struct kho_vmalloc *kho_vmalloc) > while (chunk) { > struct kho_vmalloc_chunk *tmp = chunk; > > - kho_vmalloc_unpreserve_chunk(chunk); > + kho_vmalloc_unpreserve_chunk(chunk, kho_vmalloc->order); > > chunk = KHOSER_LOAD_PTR(chunk->hdr.next); > free_page((unsigned long)tmp); > -- > 2.47.3 > -- Sincerely yours, Mike. From rppt at kernel.org Tue Nov 4 06:32:42 2025 From: rppt at kernel.org (Mike Rapoport) Date: Tue, 4 Nov 2025 16:32:42 +0200 Subject: [PATCH 2/2] kho: warn and exit when unpreserved page wasn't preserved In-Reply-To: <20251103180235.71409-3-pratyush@kernel.org> References: <20251103180235.71409-1-pratyush@kernel.org> <20251103180235.71409-3-pratyush@kernel.org> Message-ID: On Mon, Nov 03, 2025 at 07:02:32PM +0100, Pratyush Yadav wrote: > Calling __kho_unpreserve() on a pair of (pfn, end_pfn) that wasn't > preserved is a bug. Currently, if that is done, the physxa or bits can > be NULL. This results in a soft lockup since a NULL physxa or bits > results in redoing the loop without ever making any progress. > > Return when physxa or bits are not found, but WARN first to loudly > indicate invalid behaviour. > > Fixes: fc33e4b44b271 ("kexec: enable KHO support for memory preservation") > Signed-off-by: Pratyush Yadav Reviewed-by: Mike Rapoport (Microsoft) > --- > kernel/kexec_handover.c | 8 ++++---- > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c > index c2bcbb10918ce..e5fd833726226 100644 > --- a/kernel/kexec_handover.c > +++ b/kernel/kexec_handover.c > @@ -167,12 +167,12 @@ static void __kho_unpreserve(struct kho_mem_track *track, unsigned long pfn, > const unsigned long pfn_high = pfn >> order; > > physxa = xa_load(&track->orders, order); > - if (!physxa) > - continue; > + if (WARN_ON_ONCE(!physxa)) > + return; > > bits = xa_load(&physxa->phys_bits, pfn_high / PRESERVE_BITS); > - if (!bits) > - continue; > + if (WARN_ON_ONCE(!bits)) > + return; > > clear_bit(pfn_high % PRESERVE_BITS, bits->preserve); > > -- > 2.47.3 > -- Sincerely yours, Mike. From lbulwahn at redhat.com Tue Nov 4 06:32:38 2025 From: lbulwahn at redhat.com (Lukas Bulwahn) Date: Tue, 4 Nov 2025 15:32:38 +0100 Subject: [PATCH] MAINTAINERS: extend file entry in KHO to include subdirectories Message-ID: <20251104143238.119803-1-lukas.bulwahn@redhat.com> From: Lukas Bulwahn Commit 3498209ff64e ("Documentation: add documentation for KHO") adds the file entry for 'Documentation/core-api/kho/*'. The asterisk in the end means that all files in kho are included, but not files in its subdirectories below. Hence, the files under Documentation/core-api/kho/bindings/ are not considered part of KHO, and get_maintainers.pl does not necessarily add the KHO maintainers to the recipients of patches to those files. Probably, this is not intended, though, and it was simply an oversight of the detailed semantics of such file entries. Make the file entry to include the subdirectories of Documentation/core-api/kho/. Signed-off-by: Lukas Bulwahn --- MAINTAINERS | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/MAINTAINERS b/MAINTAINERS index 06ff926c5331..499b52d7793f 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -13836,7 +13836,7 @@ L: kexec at lists.infradead.org L: linux-mm at kvack.org S: Maintained F: Documentation/admin-guide/mm/kho.rst -F: Documentation/core-api/kho/* +F: Documentation/core-api/kho/ F: include/linux/kexec_handover.h F: kernel/kexec_handover.c F: tools/testing/selftests/kho/ -- 2.51.1 From helgaas at kernel.org Tue Nov 4 07:19:36 2025 From: helgaas at kernel.org (Bjorn Helgaas) Date: Tue, 4 Nov 2025 09:19:36 -0600 Subject: [PATCH] MAINTAINERS: extend file entry in KHO to include subdirectories In-Reply-To: <20251104143238.119803-1-lukas.bulwahn@redhat.com> Message-ID: <20251104151936.GA1857569@bhelgaas> On Tue, Nov 04, 2025 at 03:32:38PM +0100, Lukas Bulwahn wrote: > From: Lukas Bulwahn > > Commit 3498209ff64e ("Documentation: add documentation for KHO") adds the > file entry for 'Documentation/core-api/kho/*'. The asterisk in the end > means that all files in kho are included, but not files in its > subdirectories below. Add blank line between paragraphs as you did below. > Hence, the files under Documentation/core-api/kho/bindings/ are not > considered part of KHO, and get_maintainers.pl does not necessarily add the > KHO maintainers to the recipients of patches to those files. Probably, this > is not intended, though, and it was simply an oversight of the detailed > semantics of such file entries. > > Make the file entry to include the subdirectories of > Documentation/core-api/kho/. > > Signed-off-by: Lukas Bulwahn > --- > MAINTAINERS | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/MAINTAINERS b/MAINTAINERS > index 06ff926c5331..499b52d7793f 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -13836,7 +13836,7 @@ L: kexec at lists.infradead.org > L: linux-mm at kvack.org > S: Maintained > F: Documentation/admin-guide/mm/kho.rst > -F: Documentation/core-api/kho/* > +F: Documentation/core-api/kho/ > F: include/linux/kexec_handover.h > F: kernel/kexec_handover.c > F: tools/testing/selftests/kho/ > -- > 2.51.1 > From bhe at redhat.com Tue Nov 4 19:01:02 2025 From: bhe at redhat.com (Baoquan He) Date: Wed, 5 Nov 2025 11:01:02 +0800 Subject: [PATCH v2 3/4] kexec: print out debugging message if required for kexec_load In-Reply-To: <20251103063440.1681657-4-maqianga@uniontech.com> References: <20251103063440.1681657-1-maqianga@uniontech.com> <20251103063440.1681657-4-maqianga@uniontech.com> Message-ID: On 11/03/25 at 02:34pm, Qiang Ma wrote: > The commit a85ee18c7900 ("kexec_file: print out debugging message > if required") has added general code printing in kexec_file_load(), > but not in kexec_load(). > > Especially in the RISC-V architecture, kexec_image_info() has been > removed(commit eb7622d908a0 ("kexec_file, riscv: print out debugging > message if required")). As a result, when using '-d' for the kexec_load > interface, print nothing in the kernel space. This might be helpful for > verifying the accuracy of the data passed to the kernel. Therefore, > refer to this commit a85ee18c7900 ("kexec_file: print out debugging > message if required"), debug print information has been added. > > Signed-off-by: Qiang Ma > Reported-by: kernel test robot > Closes: https://lore.kernel.org/oe-kbuild-all/202510310332.6XrLe70K-lkp at intel.com/ > --- > kernel/kexec.c | 11 +++++++++++ > 1 file changed, 11 insertions(+) > > diff --git a/kernel/kexec.c b/kernel/kexec.c > index c7a869d32f87..9b433b972cc1 100644 > --- a/kernel/kexec.c > +++ b/kernel/kexec.c > @@ -154,7 +154,15 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments, > if (ret) > goto out; > > + kexec_dprintk("nr_segments = %lu\n", nr_segments); > for (i = 0; i < nr_segments; i++) { > + struct kexec_segment *ksegment; > + > + ksegment = &image->segment[i]; > + kexec_dprintk("segment[%lu]: buf=0x%p bufsz=0x%zx mem=0x%lx memsz=0x%zx\n", > + i, ksegment->buf, ksegment->bufsz, ksegment->mem, > + ksegment->memsz); There has already been a print_segments() in kexec-tools/kexec/kexec.c, you will get duplicated printing. That sounds not good. Have you tested this? > + > ret = kimage_load_segment(image, i); > if (ret) > goto out; > @@ -166,6 +174,9 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments, > if (ret) > goto out; > > + kexec_dprintk("kexec_load: type:%u, start:0x%lx head:0x%lx flags:0x%lx\n", > + image->type, image->start, image->head, flags); > + > /* Install the new kernel and uninstall the old */ > image = xchg(dest_image, image); > > -- > 2.20.1 > From bhe at redhat.com Tue Nov 4 19:05:44 2025 From: bhe at redhat.com (Baoquan He) Date: Wed, 5 Nov 2025 11:05:44 +0800 Subject: [PATCH v2 4/4] kexec_file: Fix the issue of mismatch between loop variable types In-Reply-To: <20251103063440.1681657-5-maqianga@uniontech.com> References: <20251103063440.1681657-1-maqianga@uniontech.com> <20251103063440.1681657-5-maqianga@uniontech.com> Message-ID: On 11/03/25 at 02:34pm, Qiang Ma wrote: > The type of the struct kimage member variable nr_segments is unsigned long. > Correct the loop variable i and the print format specifier type. I can't see what's meaningful with this change. nr_segments is unsigned long, but it's the range 'i' will loop. If so, we need change all for loop of the int iterator. > > Signed-off-by: Qiang Ma > --- > kernel/kexec_file.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c > index 4a24aadbad02..7afdaa0efc50 100644 > --- a/kernel/kexec_file.c > +++ b/kernel/kexec_file.c > @@ -366,7 +366,8 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd, > int image_type = (flags & KEXEC_FILE_ON_CRASH) ? > KEXEC_TYPE_CRASH : KEXEC_TYPE_DEFAULT; > struct kimage **dest_image, *image; > - int ret = 0, i; > + int ret = 0; > + unsigned long i; > > /* We only trust the superuser with rebooting the system. */ > if (!kexec_load_permitted(image_type)) > @@ -432,7 +433,7 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd, > struct kexec_segment *ksegment; > > ksegment = &image->segment[i]; > - kexec_dprintk("segment[%d]: buf=0x%p bufsz=0x%zx mem=0x%lx memsz=0x%zx\n", > + kexec_dprintk("segment[%lu]: buf=0x%p bufsz=0x%zx mem=0x%lx memsz=0x%zx\n", > i, ksegment->buf, ksegment->bufsz, ksegment->mem, > ksegment->memsz); > > -- > 2.20.1 > From bhe at redhat.com Tue Nov 4 19:09:13 2025 From: bhe at redhat.com (Baoquan He) Date: Wed, 5 Nov 2025 11:09:13 +0800 Subject: [PATCH v2 2/4] kexec: add kexec_core flag to control debug printing In-Reply-To: <20251103063440.1681657-3-maqianga@uniontech.com> References: <20251103063440.1681657-1-maqianga@uniontech.com> <20251103063440.1681657-3-maqianga@uniontech.com> Message-ID: On 11/03/25 at 02:34pm, Qiang Ma wrote: > The commit a85ee18c7900 ("kexec_file: print out debugging message > if required") has added general code printing in kexec_file_load(), > but not in kexec_load(). > > Since kexec_load and kexec_file_load are not triggered > simultaneously, we can unify the debug flag of kexec and kexec_file > as kexec_core_dbg_print. After reconsidering this, I regret calling it kexec_core_dbg_print. That sounds a printing only happening in kexec_core. Maybe kexec_dbg_print is better. Because here kexec refers to a generic concept, but not limited to kexec_load interface only. Just my personal thinking. Other than the naming, the whole patch looks good to me. Thanks. > > Next, we need to do four things: > > 1. rename kexec_file_dbg_print to kexec_core_dbg_print > 2. Add KEXEC_DEBUG > 3. Initialize kexec_core_dbg_print for kexec > 4. Set the reset of kexec_file_dbg_print to kimage_free > > Signed-off-by: Qiang Ma > --- > include/linux/kexec.h | 9 +++++---- > include/uapi/linux/kexec.h | 1 + > kernel/kexec.c | 1 + > kernel/kexec_core.c | 4 +++- > kernel/kexec_file.c | 4 +--- > 5 files changed, 11 insertions(+), 8 deletions(-) > > diff --git a/include/linux/kexec.h b/include/linux/kexec.h > index ff7e231b0485..cad8b5c362af 100644 > --- a/include/linux/kexec.h > +++ b/include/linux/kexec.h > @@ -455,10 +455,11 @@ bool kexec_load_permitted(int kexec_image_type); > > /* List of defined/legal kexec flags */ > #ifndef CONFIG_KEXEC_JUMP > -#define KEXEC_FLAGS (KEXEC_ON_CRASH | KEXEC_UPDATE_ELFCOREHDR | KEXEC_CRASH_HOTPLUG_SUPPORT) > +#define KEXEC_FLAGS (KEXEC_ON_CRASH | KEXEC_UPDATE_ELFCOREHDR | KEXEC_CRASH_HOTPLUG_SUPPORT | \ > + KEXEC_DEBUG) > #else > #define KEXEC_FLAGS (KEXEC_ON_CRASH | KEXEC_PRESERVE_CONTEXT | KEXEC_UPDATE_ELFCOREHDR | \ > - KEXEC_CRASH_HOTPLUG_SUPPORT) > + KEXEC_CRASH_HOTPLUG_SUPPORT | KEXEC_DEBUG) > #endif > > /* List of defined/legal kexec file flags */ > @@ -525,10 +526,10 @@ static inline int arch_kexec_post_alloc_pages(void *vaddr, unsigned int pages, g > static inline void arch_kexec_pre_free_pages(void *vaddr, unsigned int pages) { } > #endif > > -extern bool kexec_file_dbg_print; > +extern bool kexec_core_dbg_print; > > #define kexec_dprintk(fmt, arg...) \ > - do { if (kexec_file_dbg_print) pr_info(fmt, ##arg); } while (0) > + do { if (kexec_core_dbg_print) pr_info(fmt, ##arg); } while (0) > > extern void *kimage_map_segment(struct kimage *image, unsigned long addr, unsigned long size); > extern void kimage_unmap_segment(void *buffer); > diff --git a/include/uapi/linux/kexec.h b/include/uapi/linux/kexec.h > index 55749cb0b81d..819c600af125 100644 > --- a/include/uapi/linux/kexec.h > +++ b/include/uapi/linux/kexec.h > @@ -14,6 +14,7 @@ > #define KEXEC_PRESERVE_CONTEXT 0x00000002 > #define KEXEC_UPDATE_ELFCOREHDR 0x00000004 > #define KEXEC_CRASH_HOTPLUG_SUPPORT 0x00000008 > +#define KEXEC_DEBUG 0x00000010 > #define KEXEC_ARCH_MASK 0xffff0000 > > /* > diff --git a/kernel/kexec.c b/kernel/kexec.c > index 9bb1f2b6b268..c7a869d32f87 100644 > --- a/kernel/kexec.c > +++ b/kernel/kexec.c > @@ -42,6 +42,7 @@ static int kimage_alloc_init(struct kimage **rimage, unsigned long entry, > if (!image) > return -ENOMEM; > > + kexec_core_dbg_print = !!(flags & KEXEC_DEBUG); > image->start = entry; > image->nr_segments = nr_segments; > memcpy(image->segment, segments, nr_segments * sizeof(*segments)); > diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c > index fa00b239c5d9..865f2b14f23b 100644 > --- a/kernel/kexec_core.c > +++ b/kernel/kexec_core.c > @@ -53,7 +53,7 @@ atomic_t __kexec_lock = ATOMIC_INIT(0); > /* Flag to indicate we are going to kexec a new kernel */ > bool kexec_in_progress = false; > > -bool kexec_file_dbg_print; > +bool kexec_core_dbg_print; > > /* > * When kexec transitions to the new kernel there is a one-to-one > @@ -576,6 +576,8 @@ void kimage_free(struct kimage *image) > kimage_entry_t *ptr, entry; > kimage_entry_t ind = 0; > > + kexec_core_dbg_print = false; > + > if (!image) > return; > > diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c > index eb62a9794242..4a24aadbad02 100644 > --- a/kernel/kexec_file.c > +++ b/kernel/kexec_file.c > @@ -138,8 +138,6 @@ void kimage_file_post_load_cleanup(struct kimage *image) > */ > kfree(image->image_loader_data); > image->image_loader_data = NULL; > - > - kexec_file_dbg_print = false; > } > > #ifdef CONFIG_KEXEC_SIG > @@ -314,7 +312,7 @@ kimage_file_alloc_init(struct kimage **rimage, int kernel_fd, > if (!image) > return -ENOMEM; > > - kexec_file_dbg_print = !!(flags & KEXEC_FILE_DEBUG); > + kexec_core_dbg_print = !!(flags & KEXEC_FILE_DEBUG); > image->file_mode = 1; > > #ifdef CONFIG_CRASH_DUMP > -- > 2.20.1 > From bhe at redhat.com Tue Nov 4 19:15:57 2025 From: bhe at redhat.com (Baoquan he) Date: Wed, 5 Nov 2025 11:15:57 +0800 Subject: [PATCH 0/2] Export kdump crashkernel CMA ranges In-Reply-To: <20251103035859.1267318-1-sourabhjain@linux.ibm.com> References: <20251103035859.1267318-1-sourabhjain@linux.ibm.com> Message-ID: On 11/03/25 at 09:28am, Sourabh Jain wrote: > /sys/kernel/kexec_crash_cma_ranges to export all CMA regions reserved > for the crashkernel to user-space. This enables user-space tools > configuring kdump to determine the amount of memory reserved for the > crashkernel. When CMA is used for crashkernel allocation, tools can use > this information to warn users that attempting to capture user pages > while CMA reservation is active may lead to unreliable or incomplete > dump capture. > > While adding documentation for the new sysfs interface, I realized that > there was no ABI document for the existing kexec and kdump sysfs > interfaces, so I added one. > > The first patch adds the ABI documentation for the existing kexec and > kdump sysfs interfaces, and the second patch adds the > /sys/kernel/kexec_crash_cma_ranges sysfs interface along with its > corresponding ABI documentation. > > *Seeking opinions* > There are already four kexec/kdump sysfs entries under /sys/kernel/, > and this patch series adds one more. Should we consider moving them to > a separate directory, such as /sys/kernel/kexec, to avoid polluting > /sys/kernel/? For backward compatibility, we can create symlinks at > the old locations for sometime and remove them in the future. That sounds a good idea, will you do it in v2? Because otherwise the kexec_crash_cma_ranges need be moved too. > > Cc: Andrew Morton > Cc: Baoquan he > Cc: Jiri Bohac > Cc: Shivang Upadhyay > Cc: linuxppc-dev at lists.ozlabs.org > Cc: kexec at lists.infradead.org > > Sourabh Jain (2): > Documentation/ABI: add kexec and kdump sysfs interface > crash: export crashkernel CMA reservation to userspace > > .../ABI/testing/sysfs-kernel-kexec-kdump | 53 +++++++++++++++++++ > kernel/ksysfs.c | 17 ++++++ > 2 files changed, 70 insertions(+) > create mode 100644 Documentation/ABI/testing/sysfs-kernel-kexec-kdump > > -- > 2.51.0 > From sourabhjain at linux.ibm.com Tue Nov 4 19:33:43 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Wed, 5 Nov 2025 09:03:43 +0530 Subject: [PATCH 0/2] Export kdump crashkernel CMA ranges In-Reply-To: References: <20251103035859.1267318-1-sourabhjain@linux.ibm.com> Message-ID: On 05/11/25 08:45, Baoquan he wrote: > On 11/03/25 at 09:28am, Sourabh Jain wrote: >> /sys/kernel/kexec_crash_cma_ranges to export all CMA regions reserved >> for the crashkernel to user-space. This enables user-space tools >> configuring kdump to determine the amount of memory reserved for the >> crashkernel. When CMA is used for crashkernel allocation, tools can use >> this information to warn users that attempting to capture user pages >> while CMA reservation is active may lead to unreliable or incomplete >> dump capture. >> >> While adding documentation for the new sysfs interface, I realized that >> there was no ABI document for the existing kexec and kdump sysfs >> interfaces, so I added one. >> >> The first patch adds the ABI documentation for the existing kexec and >> kdump sysfs interfaces, and the second patch adds the >> /sys/kernel/kexec_crash_cma_ranges sysfs interface along with its >> corresponding ABI documentation. >> >> *Seeking opinions* >> There are already four kexec/kdump sysfs entries under /sys/kernel/, >> and this patch series adds one more. Should we consider moving them to >> a separate directory, such as /sys/kernel/kexec, to avoid polluting >> /sys/kernel/? For backward compatibility, we can create symlinks at >> the old locations for sometime and remove them in the future. > That sounds a good idea, will you do it in v2? Because otherwise the > kexec_crash_cma_ranges need be moved too. Yes I will include it in v2. Thanks, Sourabh Jain From maqianga at uniontech.com Tue Nov 4 19:41:09 2025 From: maqianga at uniontech.com (Qiang Ma) Date: Wed, 5 Nov 2025 11:41:09 +0800 Subject: [PATCH v2 3/4] kexec: print out debugging message if required for kexec_load In-Reply-To: References: <20251103063440.1681657-1-maqianga@uniontech.com> <20251103063440.1681657-4-maqianga@uniontech.com> Message-ID: <5FC4A8D79744B238+97288be4-6c1a-4c0d-ae7d-be2029ec87f3@uniontech.com> ? 2025/11/5 11:01, Baoquan He ??: > On 11/03/25 at 02:34pm, Qiang Ma wrote: >> The commit a85ee18c7900 ("kexec_file: print out debugging message >> if required") has added general code printing in kexec_file_load(), >> but not in kexec_load(). >> >> Especially in the RISC-V architecture, kexec_image_info() has been >> removed(commit eb7622d908a0 ("kexec_file, riscv: print out debugging >> message if required")). As a result, when using '-d' for the kexec_load >> interface, print nothing in the kernel space. This might be helpful for >> verifying the accuracy of the data passed to the kernel. Therefore, >> refer to this commit a85ee18c7900 ("kexec_file: print out debugging >> message if required"), debug print information has been added. >> >> Signed-off-by: Qiang Ma >> Reported-by: kernel test robot >> Closes: https://lore.kernel.org/oe-kbuild-all/202510310332.6XrLe70K-lkp at intel.com/ >> --- >> kernel/kexec.c | 11 +++++++++++ >> 1 file changed, 11 insertions(+) >> >> diff --git a/kernel/kexec.c b/kernel/kexec.c >> index c7a869d32f87..9b433b972cc1 100644 >> --- a/kernel/kexec.c >> +++ b/kernel/kexec.c >> @@ -154,7 +154,15 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments, >> if (ret) >> goto out; >> >> + kexec_dprintk("nr_segments = %lu\n", nr_segments); >> for (i = 0; i < nr_segments; i++) { >> + struct kexec_segment *ksegment; >> + >> + ksegment = &image->segment[i]; >> + kexec_dprintk("segment[%lu]: buf=0x%p bufsz=0x%zx mem=0x%lx memsz=0x%zx\n", >> + i, ksegment->buf, ksegment->bufsz, ksegment->mem, >> + ksegment->memsz); > There has already been a print_segments() in kexec-tools/kexec/kexec.c, > you will get duplicated printing. That sounds not good. Have you tested > this? I have tested it, kexec-tools is the debug message printed in user space, while kexec_dprintk is printed in kernel space. This might be helpful for verifying the accuracy of the data passed to the kernel. >> + >> ret = kimage_load_segment(image, i); >> if (ret) >> goto out; >> @@ -166,6 +174,9 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments, >> if (ret) >> goto out; >> >> + kexec_dprintk("kexec_load: type:%u, start:0x%lx head:0x%lx flags:0x%lx\n", >> + image->type, image->start, image->head, flags); >> + >> /* Install the new kernel and uninstall the old */ >> image = xchg(dest_image, image); >> >> -- >> 2.20.1 >> > From maqianga at uniontech.com Tue Nov 4 19:47:44 2025 From: maqianga at uniontech.com (Qiang Ma) Date: Wed, 5 Nov 2025 11:47:44 +0800 Subject: [PATCH v2 4/4] kexec_file: Fix the issue of mismatch between loop variable types In-Reply-To: References: <20251103063440.1681657-1-maqianga@uniontech.com> <20251103063440.1681657-5-maqianga@uniontech.com> Message-ID: <0C92443D3E2100AF+c669d240-1ee8-4897-a30d-3efefe161085@uniontech.com> ? 2025/11/5 11:05, Baoquan He ??: > On 11/03/25 at 02:34pm, Qiang Ma wrote: >> The type of the struct kimage member variable nr_segments is unsigned long. >> Correct the loop variable i and the print format specifier type. > I can't see what's meaningful with this change. nr_segments is unsigned > long, but it's the range 'i' will loop. If so, we need change all for > loop of the int iterator. If image->nr_segments is large enough, 'i' overflow causes an infinite loop. >> Signed-off-by: Qiang Ma >> --- >> kernel/kexec_file.c | 5 +++-- >> 1 file changed, 3 insertions(+), 2 deletions(-) >> >> diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c >> index 4a24aadbad02..7afdaa0efc50 100644 >> --- a/kernel/kexec_file.c >> +++ b/kernel/kexec_file.c >> @@ -366,7 +366,8 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd, >> int image_type = (flags & KEXEC_FILE_ON_CRASH) ? >> KEXEC_TYPE_CRASH : KEXEC_TYPE_DEFAULT; >> struct kimage **dest_image, *image; >> - int ret = 0, i; >> + int ret = 0; >> + unsigned long i; >> >> /* We only trust the superuser with rebooting the system. */ >> if (!kexec_load_permitted(image_type)) >> @@ -432,7 +433,7 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd, >> struct kexec_segment *ksegment; >> >> ksegment = &image->segment[i]; >> - kexec_dprintk("segment[%d]: buf=0x%p bufsz=0x%zx mem=0x%lx memsz=0x%zx\n", >> + kexec_dprintk("segment[%lu]: buf=0x%p bufsz=0x%zx mem=0x%lx memsz=0x%zx\n", >> i, ksegment->buf, ksegment->bufsz, ksegment->mem, >> ksegment->memsz); >> >> -- >> 2.20.1 >> > From maqianga at uniontech.com Tue Nov 4 20:31:00 2025 From: maqianga at uniontech.com (Qiang Ma) Date: Wed, 5 Nov 2025 12:31:00 +0800 Subject: [PATCH v2 4/4] kexec_file: Fix the issue of mismatch between loop variable types In-Reply-To: References: <20251103063440.1681657-1-maqianga@uniontech.com> <20251103063440.1681657-5-maqianga@uniontech.com> Message-ID: ? 2025/11/5 11:47, Qiang Ma ??: > > ? 2025/11/5 11:05, Baoquan He ??: >> On 11/03/25 at 02:34pm, Qiang Ma wrote: >>> The type of the struct kimage member variable nr_segments is >>> unsigned long. >>> Correct the loop variable i and the print format specifier type. >> I can't see what's meaningful with this change. nr_segments is unsigned >> long, but it's the range 'i' will loop. If so, we need change all for >> loop of the int iterator. > If image->nr_segments is large enough, 'i' overflow causes an infinite > loop. Meanwhile, the do_kexec_load() was checked and also defined as 'unsigned long i'. >>> Signed-off-by: Qiang Ma >>> --- >>> ? kernel/kexec_file.c | 5 +++-- >>> ? 1 file changed, 3 insertions(+), 2 deletions(-) >>> >>> diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c >>> index 4a24aadbad02..7afdaa0efc50 100644 >>> --- a/kernel/kexec_file.c >>> +++ b/kernel/kexec_file.c >>> @@ -366,7 +366,8 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, >>> int, initrd_fd, >>> ????? int image_type = (flags & KEXEC_FILE_ON_CRASH) ? >>> ?????????????? KEXEC_TYPE_CRASH : KEXEC_TYPE_DEFAULT; >>> ????? struct kimage **dest_image, *image; >>> -??? int ret = 0, i; >>> +??? int ret = 0; >>> +??? unsigned long i; >>> ? ????? /* We only trust the superuser with rebooting the system. */ >>> ????? if (!kexec_load_permitted(image_type)) >>> @@ -432,7 +433,7 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, >>> int, initrd_fd, >>> ????????? struct kexec_segment *ksegment; >>> ? ????????? ksegment = &image->segment[i]; >>> -??????? kexec_dprintk("segment[%d]: buf=0x%p bufsz=0x%zx mem=0x%lx >>> memsz=0x%zx\n", >>> +??????? kexec_dprintk("segment[%lu]: buf=0x%p bufsz=0x%zx mem=0x%lx >>> memsz=0x%zx\n", >>> ??????????????????? i, ksegment->buf, ksegment->bufsz, ksegment->mem, >>> ??????????????????? ksegment->memsz); >>> ? -- >>> 2.20.1 >>> >> From maqianga at uniontech.com Tue Nov 4 20:32:42 2025 From: maqianga at uniontech.com (Qiang Ma) Date: Wed, 5 Nov 2025 12:32:42 +0800 Subject: [PATCH v2 2/4] kexec: add kexec_core flag to control debug printing In-Reply-To: References: <20251103063440.1681657-1-maqianga@uniontech.com> <20251103063440.1681657-3-maqianga@uniontech.com> Message-ID: ? 2025/11/5 11:09, Baoquan He ??: > On 11/03/25 at 02:34pm, Qiang Ma wrote: >> The commit a85ee18c7900 ("kexec_file: print out debugging message >> if required") has added general code printing in kexec_file_load(), >> but not in kexec_load(). >> >> Since kexec_load and kexec_file_load are not triggered >> simultaneously, we can unify the debug flag of kexec and kexec_file >> as kexec_core_dbg_print. > After reconsidering this, I regret calling it kexec_core_dbg_print. > That sounds a printing only happening in kexec_core. Maybe > kexec_dbg_print is better. Because here kexec refers to a generic > concept, but not limited to kexec_load interface only. Just my personal > thinking. This sounds reasonable. The next version will be renamed kexec_dbg_print. > > Other than the naming, the whole patch looks good to me. Thanks. > >> Next, we need to do four things: >> >> 1. rename kexec_file_dbg_print to kexec_core_dbg_print >> 2. Add KEXEC_DEBUG >> 3. Initialize kexec_core_dbg_print for kexec >> 4. Set the reset of kexec_file_dbg_print to kimage_free >> >> Signed-off-by: Qiang Ma >> --- >> include/linux/kexec.h | 9 +++++---- >> include/uapi/linux/kexec.h | 1 + >> kernel/kexec.c | 1 + >> kernel/kexec_core.c | 4 +++- >> kernel/kexec_file.c | 4 +--- >> 5 files changed, 11 insertions(+), 8 deletions(-) >> >> diff --git a/include/linux/kexec.h b/include/linux/kexec.h >> index ff7e231b0485..cad8b5c362af 100644 >> --- a/include/linux/kexec.h >> +++ b/include/linux/kexec.h >> @@ -455,10 +455,11 @@ bool kexec_load_permitted(int kexec_image_type); >> >> /* List of defined/legal kexec flags */ >> #ifndef CONFIG_KEXEC_JUMP >> -#define KEXEC_FLAGS (KEXEC_ON_CRASH | KEXEC_UPDATE_ELFCOREHDR | KEXEC_CRASH_HOTPLUG_SUPPORT) >> +#define KEXEC_FLAGS (KEXEC_ON_CRASH | KEXEC_UPDATE_ELFCOREHDR | KEXEC_CRASH_HOTPLUG_SUPPORT | \ >> + KEXEC_DEBUG) >> #else >> #define KEXEC_FLAGS (KEXEC_ON_CRASH | KEXEC_PRESERVE_CONTEXT | KEXEC_UPDATE_ELFCOREHDR | \ >> - KEXEC_CRASH_HOTPLUG_SUPPORT) >> + KEXEC_CRASH_HOTPLUG_SUPPORT | KEXEC_DEBUG) >> #endif >> >> /* List of defined/legal kexec file flags */ >> @@ -525,10 +526,10 @@ static inline int arch_kexec_post_alloc_pages(void *vaddr, unsigned int pages, g >> static inline void arch_kexec_pre_free_pages(void *vaddr, unsigned int pages) { } >> #endif >> >> -extern bool kexec_file_dbg_print; >> +extern bool kexec_core_dbg_print; >> >> #define kexec_dprintk(fmt, arg...) \ >> - do { if (kexec_file_dbg_print) pr_info(fmt, ##arg); } while (0) >> + do { if (kexec_core_dbg_print) pr_info(fmt, ##arg); } while (0) >> >> extern void *kimage_map_segment(struct kimage *image, unsigned long addr, unsigned long size); >> extern void kimage_unmap_segment(void *buffer); >> diff --git a/include/uapi/linux/kexec.h b/include/uapi/linux/kexec.h >> index 55749cb0b81d..819c600af125 100644 >> --- a/include/uapi/linux/kexec.h >> +++ b/include/uapi/linux/kexec.h >> @@ -14,6 +14,7 @@ >> #define KEXEC_PRESERVE_CONTEXT 0x00000002 >> #define KEXEC_UPDATE_ELFCOREHDR 0x00000004 >> #define KEXEC_CRASH_HOTPLUG_SUPPORT 0x00000008 >> +#define KEXEC_DEBUG 0x00000010 >> #define KEXEC_ARCH_MASK 0xffff0000 >> >> /* >> diff --git a/kernel/kexec.c b/kernel/kexec.c >> index 9bb1f2b6b268..c7a869d32f87 100644 >> --- a/kernel/kexec.c >> +++ b/kernel/kexec.c >> @@ -42,6 +42,7 @@ static int kimage_alloc_init(struct kimage **rimage, unsigned long entry, >> if (!image) >> return -ENOMEM; >> >> + kexec_core_dbg_print = !!(flags & KEXEC_DEBUG); >> image->start = entry; >> image->nr_segments = nr_segments; >> memcpy(image->segment, segments, nr_segments * sizeof(*segments)); >> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c >> index fa00b239c5d9..865f2b14f23b 100644 >> --- a/kernel/kexec_core.c >> +++ b/kernel/kexec_core.c >> @@ -53,7 +53,7 @@ atomic_t __kexec_lock = ATOMIC_INIT(0); >> /* Flag to indicate we are going to kexec a new kernel */ >> bool kexec_in_progress = false; >> >> -bool kexec_file_dbg_print; >> +bool kexec_core_dbg_print; >> >> /* >> * When kexec transitions to the new kernel there is a one-to-one >> @@ -576,6 +576,8 @@ void kimage_free(struct kimage *image) >> kimage_entry_t *ptr, entry; >> kimage_entry_t ind = 0; >> >> + kexec_core_dbg_print = false; >> + >> if (!image) >> return; >> >> diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c >> index eb62a9794242..4a24aadbad02 100644 >> --- a/kernel/kexec_file.c >> +++ b/kernel/kexec_file.c >> @@ -138,8 +138,6 @@ void kimage_file_post_load_cleanup(struct kimage *image) >> */ >> kfree(image->image_loader_data); >> image->image_loader_data = NULL; >> - >> - kexec_file_dbg_print = false; >> } >> >> #ifdef CONFIG_KEXEC_SIG >> @@ -314,7 +312,7 @@ kimage_file_alloc_init(struct kimage **rimage, int kernel_fd, >> if (!image) >> return -ENOMEM; >> >> - kexec_file_dbg_print = !!(flags & KEXEC_FILE_DEBUG); >> + kexec_core_dbg_print = !!(flags & KEXEC_FILE_DEBUG); >> image->file_mode = 1; >> >> #ifdef CONFIG_CRASH_DUMP >> -- >> 2.20.1 >> > From bhe at redhat.com Tue Nov 4 22:56:43 2025 From: bhe at redhat.com (Baoquan He) Date: Wed, 5 Nov 2025 14:56:43 +0800 Subject: [PATCH v2 4/4] kexec_file: Fix the issue of mismatch between loop variable types In-Reply-To: <0C92443D3E2100AF+c669d240-1ee8-4897-a30d-3efefe161085@uniontech.com> References: <20251103063440.1681657-1-maqianga@uniontech.com> <20251103063440.1681657-5-maqianga@uniontech.com> <0C92443D3E2100AF+c669d240-1ee8-4897-a30d-3efefe161085@uniontech.com> Message-ID: On 11/05/25 at 11:47am, Qiang Ma wrote: > > ? 2025/11/5 11:05, Baoquan He ??: > > On 11/03/25 at 02:34pm, Qiang Ma wrote: > > > The type of the struct kimage member variable nr_segments is unsigned long. > > > Correct the loop variable i and the print format specifier type. > > I can't see what's meaningful with this change. nr_segments is unsigned > > long, but it's the range 'i' will loop. If so, we need change all for > > loop of the int iterator. > If image->nr_segments is large enough, 'i' overflow causes an infinite loop. Please check kexec_add_buffer(), there's checking for the value which upper limit is restricted to 16. if (kbuf->image->nr_segments >= KEXEC_SEGMENT_MAX) return -EINVAL; > > > Signed-off-by: Qiang Ma > > > --- > > > kernel/kexec_file.c | 5 +++-- > > > 1 file changed, 3 insertions(+), 2 deletions(-) > > > > > > diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c > > > index 4a24aadbad02..7afdaa0efc50 100644 > > > --- a/kernel/kexec_file.c > > > +++ b/kernel/kexec_file.c > > > @@ -366,7 +366,8 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd, > > > int image_type = (flags & KEXEC_FILE_ON_CRASH) ? > > > KEXEC_TYPE_CRASH : KEXEC_TYPE_DEFAULT; > > > struct kimage **dest_image, *image; > > > - int ret = 0, i; > > > + int ret = 0; > > > + unsigned long i; > > > /* We only trust the superuser with rebooting the system. */ > > > if (!kexec_load_permitted(image_type)) > > > @@ -432,7 +433,7 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd, > > > struct kexec_segment *ksegment; > > > ksegment = &image->segment[i]; > > > - kexec_dprintk("segment[%d]: buf=0x%p bufsz=0x%zx mem=0x%lx memsz=0x%zx\n", > > > + kexec_dprintk("segment[%lu]: buf=0x%p bufsz=0x%zx mem=0x%lx memsz=0x%zx\n", > > > i, ksegment->buf, ksegment->bufsz, ksegment->mem, > > > ksegment->memsz); > > > -- > > > 2.20.1 > > > > > > From maqianga at uniontech.com Tue Nov 4 23:06:49 2025 From: maqianga at uniontech.com (Qiang Ma) Date: Wed, 5 Nov 2025 15:06:49 +0800 Subject: [PATCH v2 4/4] kexec_file: Fix the issue of mismatch between loop variable types In-Reply-To: References: <20251103063440.1681657-1-maqianga@uniontech.com> <20251103063440.1681657-5-maqianga@uniontech.com> <0C92443D3E2100AF+c669d240-1ee8-4897-a30d-3efefe161085@uniontech.com> Message-ID: ? 2025/11/5 14:56, Baoquan He ??: > On 11/05/25 at 11:47am, Qiang Ma wrote: >> ? 2025/11/5 11:05, Baoquan He ??: >>> On 11/03/25 at 02:34pm, Qiang Ma wrote: >>>> The type of the struct kimage member variable nr_segments is unsigned long. >>>> Correct the loop variable i and the print format specifier type. >>> I can't see what's meaningful with this change. nr_segments is unsigned >>> long, but it's the range 'i' will loop. If so, we need change all for >>> loop of the int iterator. >> If image->nr_segments is large enough, 'i' overflow causes an infinite loop. > Please check kexec_add_buffer(), there's checking for the value which > upper limit is restricted to 16. > > if (kbuf->image->nr_segments >= KEXEC_SEGMENT_MAX) > return -EINVAL; Oh, then this patch is really not necessary. >>>> Signed-off-by: Qiang Ma >>>> --- >>>> kernel/kexec_file.c | 5 +++-- >>>> 1 file changed, 3 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c >>>> index 4a24aadbad02..7afdaa0efc50 100644 >>>> --- a/kernel/kexec_file.c >>>> +++ b/kernel/kexec_file.c >>>> @@ -366,7 +366,8 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd, >>>> int image_type = (flags & KEXEC_FILE_ON_CRASH) ? >>>> KEXEC_TYPE_CRASH : KEXEC_TYPE_DEFAULT; >>>> struct kimage **dest_image, *image; >>>> - int ret = 0, i; >>>> + int ret = 0; >>>> + unsigned long i; >>>> /* We only trust the superuser with rebooting the system. */ >>>> if (!kexec_load_permitted(image_type)) >>>> @@ -432,7 +433,7 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, initrd_fd, >>>> struct kexec_segment *ksegment; >>>> ksegment = &image->segment[i]; >>>> - kexec_dprintk("segment[%d]: buf=0x%p bufsz=0x%zx mem=0x%lx memsz=0x%zx\n", >>>> + kexec_dprintk("segment[%lu]: buf=0x%p bufsz=0x%zx mem=0x%lx memsz=0x%zx\n", >>>> i, ksegment->buf, ksegment->bufsz, ksegment->mem, >>>> ksegment->memsz); >>>> -- >>>> 2.20.1 >>>> > From bhe at redhat.com Tue Nov 4 23:53:13 2025 From: bhe at redhat.com (Baoquan He) Date: Wed, 5 Nov 2025 15:53:13 +0800 Subject: [PATCH v2 3/4] kexec: print out debugging message if required for kexec_load In-Reply-To: <5FC4A8D79744B238+97288be4-6c1a-4c0d-ae7d-be2029ec87f3@uniontech.com> References: <20251103063440.1681657-1-maqianga@uniontech.com> <20251103063440.1681657-4-maqianga@uniontech.com> <5FC4A8D79744B238+97288be4-6c1a-4c0d-ae7d-be2029ec87f3@uniontech.com> Message-ID: On 11/05/25 at 11:41am, Qiang Ma wrote: > > ? 2025/11/5 11:01, Baoquan He ??: > > On 11/03/25 at 02:34pm, Qiang Ma wrote: > > > The commit a85ee18c7900 ("kexec_file: print out debugging message > > > if required") has added general code printing in kexec_file_load(), > > > but not in kexec_load(). > > > > > > Especially in the RISC-V architecture, kexec_image_info() has been > > > removed(commit eb7622d908a0 ("kexec_file, riscv: print out debugging > > > message if required")). As a result, when using '-d' for the kexec_load > > > interface, print nothing in the kernel space. This might be helpful for > > > verifying the accuracy of the data passed to the kernel. Therefore, > > > refer to this commit a85ee18c7900 ("kexec_file: print out debugging > > > message if required"), debug print information has been added. > > > > > > Signed-off-by: Qiang Ma > > > Reported-by: kernel test robot > > > Closes: https://lore.kernel.org/oe-kbuild-all/202510310332.6XrLe70K-lkp at intel.com/ > > > --- > > > kernel/kexec.c | 11 +++++++++++ > > > 1 file changed, 11 insertions(+) > > > > > > diff --git a/kernel/kexec.c b/kernel/kexec.c > > > index c7a869d32f87..9b433b972cc1 100644 > > > --- a/kernel/kexec.c > > > +++ b/kernel/kexec.c > > > @@ -154,7 +154,15 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments, > > > if (ret) > > > goto out; > > > + kexec_dprintk("nr_segments = %lu\n", nr_segments); > > > for (i = 0; i < nr_segments; i++) { > > > + struct kexec_segment *ksegment; > > > + > > > + ksegment = &image->segment[i]; > > > + kexec_dprintk("segment[%lu]: buf=0x%p bufsz=0x%zx mem=0x%lx memsz=0x%zx\n", > > > + i, ksegment->buf, ksegment->bufsz, ksegment->mem, > > > + ksegment->memsz); > > There has already been a print_segments() in kexec-tools/kexec/kexec.c, > > you will get duplicated printing. That sounds not good. Have you tested > > this? > I have tested it, kexec-tools is the debug message printed > in user space, while kexec_dprintk is printed > in kernel space. > > This might be helpful for verifying the accuracy of > the data passed to the kernel. Hmm, that's not necessary with a debug printing to verify value passed in kernel. We should only add debug pringing when we need but lack it. I didn't check it carefully, if you add the debug printing only for verifying accuracy, that doesn't justify the code change. > > > + > > > ret = kimage_load_segment(image, i); > > > if (ret) > > > goto out; > > > @@ -166,6 +174,9 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments, > > > if (ret) > > > goto out; > > > + kexec_dprintk("kexec_load: type:%u, start:0x%lx head:0x%lx flags:0x%lx\n", > > > + image->type, image->start, image->head, flags); > > > + > > > /* Install the new kernel and uninstall the old */ > > > image = xchg(dest_image, image); > > > -- > > > 2.20.1 > > > > > > From maqianga at uniontech.com Wed Nov 5 00:35:06 2025 From: maqianga at uniontech.com (Qiang Ma) Date: Wed, 5 Nov 2025 16:35:06 +0800 Subject: [PATCH v2 3/4] kexec: print out debugging message if required for kexec_load In-Reply-To: References: <20251103063440.1681657-1-maqianga@uniontech.com> <20251103063440.1681657-4-maqianga@uniontech.com> <5FC4A8D79744B238+97288be4-6c1a-4c0d-ae7d-be2029ec87f3@uniontech.com> Message-ID: <2331A9F3E09581FC+4ab7e9ba-8776-47d2-868f-cb01ca9cd909@uniontech.com> ? 2025/11/5 15:53, Baoquan He ??: > On 11/05/25 at 11:41am, Qiang Ma wrote: >> ? 2025/11/5 11:01, Baoquan He ??: >>> On 11/03/25 at 02:34pm, Qiang Ma wrote: >>>> The commit a85ee18c7900 ("kexec_file: print out debugging message >>>> if required") has added general code printing in kexec_file_load(), >>>> but not in kexec_load(). >>>> >>>> Especially in the RISC-V architecture, kexec_image_info() has been >>>> removed(commit eb7622d908a0 ("kexec_file, riscv: print out debugging >>>> message if required")). As a result, when using '-d' for the kexec_load >>>> interface, print nothing in the kernel space. This might be helpful for >>>> verifying the accuracy of the data passed to the kernel. Therefore, >>>> refer to this commit a85ee18c7900 ("kexec_file: print out debugging >>>> message if required"), debug print information has been added. >>>> >>>> Signed-off-by: Qiang Ma >>>> Reported-by: kernel test robot >>>> Closes: https://lore.kernel.org/oe-kbuild-all/202510310332.6XrLe70K-lkp at intel.com/ >>>> --- >>>> kernel/kexec.c | 11 +++++++++++ >>>> 1 file changed, 11 insertions(+) >>>> >>>> diff --git a/kernel/kexec.c b/kernel/kexec.c >>>> index c7a869d32f87..9b433b972cc1 100644 >>>> --- a/kernel/kexec.c >>>> +++ b/kernel/kexec.c >>>> @@ -154,7 +154,15 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments, >>>> if (ret) >>>> goto out; >>>> + kexec_dprintk("nr_segments = %lu\n", nr_segments); >>>> for (i = 0; i < nr_segments; i++) { >>>> + struct kexec_segment *ksegment; >>>> + >>>> + ksegment = &image->segment[i]; >>>> + kexec_dprintk("segment[%lu]: buf=0x%p bufsz=0x%zx mem=0x%lx memsz=0x%zx\n", >>>> + i, ksegment->buf, ksegment->bufsz, ksegment->mem, >>>> + ksegment->memsz); >>> There has already been a print_segments() in kexec-tools/kexec/kexec.c, >>> you will get duplicated printing. That sounds not good. Have you tested >>> this? >> I have tested it, kexec-tools is the debug message printed >> in user space, while kexec_dprintk is printed >> in kernel space. >> >> This might be helpful for verifying the accuracy of >> the data passed to the kernel. > Hmm, that's not necessary with a debug printing to verify value passed > in kernel. We should only add debug pringing when we need but lack it. > I didn't check it carefully, if you add the debug printing only for > verifying accuracy, that doesn't justify the code change. It's not entirely because of it. Another reason is that for RISC-V, for kexec_file_load interface, kexec_image_info() was deleted at that time because the content has been printed out in generic code. However, these contents were not printed in kexec_load because kexec_image_info was deleted. So now it has been added. >>>> + >>>> ret = kimage_load_segment(image, i); >>>> if (ret) >>>> goto out; >>>> @@ -166,6 +174,9 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments, >>>> if (ret) >>>> goto out; >>>> + kexec_dprintk("kexec_load: type:%u, start:0x%lx head:0x%lx flags:0x%lx\n", >>>> + image->type, image->start, image->head, flags); >>>> + >>>> /* Install the new kernel and uninstall the old */ >>>> image = xchg(dest_image, image); >>>> -- >>>> 2.20.1 >>>> > From maqianga at uniontech.com Wed Nov 5 00:48:59 2025 From: maqianga at uniontech.com (Qiang Ma) Date: Wed, 5 Nov 2025 16:48:59 +0800 Subject: [PATCH v2 3/4] kexec: print out debugging message if required for kexec_load In-Reply-To: References: <20251103063440.1681657-1-maqianga@uniontech.com> <20251103063440.1681657-4-maqianga@uniontech.com> <5FC4A8D79744B238+97288be4-6c1a-4c0d-ae7d-be2029ec87f3@uniontech.com> Message-ID: <02A386F1B9701FED+a0b3ab16-3f23-4d69-9fb8-ab4d9f918bad@uniontech.com> ? 2025/11/5 15:53, Baoquan He ??: > On 11/05/25 at 11:41am, Qiang Ma wrote: >> ? 2025/11/5 11:01, Baoquan He ??: >>> On 11/03/25 at 02:34pm, Qiang Ma wrote: >>>> The commit a85ee18c7900 ("kexec_file: print out debugging message >>>> if required") has added general code printing in kexec_file_load(), >>>> but not in kexec_load(). >>>> >>>> Especially in the RISC-V architecture, kexec_image_info() has been >>>> removed(commit eb7622d908a0 ("kexec_file, riscv: print out debugging >>>> message if required")). As a result, when using '-d' for the kexec_load >>>> interface, print nothing in the kernel space. This might be helpful for >>>> verifying the accuracy of the data passed to the kernel. Therefore, >>>> refer to this commit a85ee18c7900 ("kexec_file: print out debugging >>>> message if required"), debug print information has been added. >>>> >>>> Signed-off-by: Qiang Ma >>>> Reported-by: kernel test robot >>>> Closes: https://lore.kernel.org/oe-kbuild-all/202510310332.6XrLe70K-lkp at intel.com/ >>>> --- >>>> kernel/kexec.c | 11 +++++++++++ >>>> 1 file changed, 11 insertions(+) >>>> >>>> diff --git a/kernel/kexec.c b/kernel/kexec.c >>>> index c7a869d32f87..9b433b972cc1 100644 >>>> --- a/kernel/kexec.c >>>> +++ b/kernel/kexec.c >>>> @@ -154,7 +154,15 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments, >>>> if (ret) >>>> goto out; >>>> + kexec_dprintk("nr_segments = %lu\n", nr_segments); >>>> for (i = 0; i < nr_segments; i++) { >>>> + struct kexec_segment *ksegment; >>>> + >>>> + ksegment = &image->segment[i]; >>>> + kexec_dprintk("segment[%lu]: buf=0x%p bufsz=0x%zx mem=0x%lx memsz=0x%zx\n", >>>> + i, ksegment->buf, ksegment->bufsz, ksegment->mem, >>>> + ksegment->memsz); >>> There has already been a print_segments() in kexec-tools/kexec/kexec.c, >>> you will get duplicated printing. That sounds not good. Have you tested >>> this? >> I have tested it, kexec-tools is the debug message printed >> in user space, while kexec_dprintk is printed >> in kernel space. >> >> This might be helpful for verifying the accuracy of >> the data passed to the kernel. > Hmm, that's not necessary with a debug printing to verify value passed > in kernel. We should only add debug pringing when we need but lack it. > I didn't check it carefully, if you add the debug printing only for > verifying accuracy, that doesn't justify the code change. > Also, adding these prints here is helpful for debugging the kimage_load_segment(). >>>> + >>>> ret = kimage_load_segment(image, i); >>>> if (ret) >>>> goto out; >>>> @@ -166,6 +174,9 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments, >>>> if (ret) >>>> goto out; >>>> + kexec_dprintk("kexec_load: type:%u, start:0x%lx head:0x%lx flags:0x%lx\n", >>>> + image->type, image->start, image->head, flags); >>>> + >>>> /* Install the new kernel and uninstall the old */ >>>> image = xchg(dest_image, image); >>>> -- >>>> 2.20.1 >>>> > From bhe at redhat.com Wed Nov 5 00:55:28 2025 From: bhe at redhat.com (Baoquan He) Date: Wed, 5 Nov 2025 16:55:28 +0800 Subject: [PATCH v2 3/4] kexec: print out debugging message if required for kexec_load In-Reply-To: <2331A9F3E09581FC+4ab7e9ba-8776-47d2-868f-cb01ca9cd909@uniontech.com> References: <20251103063440.1681657-1-maqianga@uniontech.com> <20251103063440.1681657-4-maqianga@uniontech.com> <5FC4A8D79744B238+97288be4-6c1a-4c0d-ae7d-be2029ec87f3@uniontech.com> <2331A9F3E09581FC+4ab7e9ba-8776-47d2-868f-cb01ca9cd909@uniontech.com> Message-ID: On 11/05/25 at 04:35pm, Qiang Ma wrote: > > ? 2025/11/5 15:53, Baoquan He ??: > > On 11/05/25 at 11:41am, Qiang Ma wrote: > > > ? 2025/11/5 11:01, Baoquan He ??: > > > > On 11/03/25 at 02:34pm, Qiang Ma wrote: > > > > > The commit a85ee18c7900 ("kexec_file: print out debugging message > > > > > if required") has added general code printing in kexec_file_load(), > > > > > but not in kexec_load(). > > > > > > > > > > Especially in the RISC-V architecture, kexec_image_info() has been > > > > > removed(commit eb7622d908a0 ("kexec_file, riscv: print out debugging > > > > > message if required")). As a result, when using '-d' for the kexec_load > > > > > interface, print nothing in the kernel space. This might be helpful for > > > > > verifying the accuracy of the data passed to the kernel. Therefore, > > > > > refer to this commit a85ee18c7900 ("kexec_file: print out debugging > > > > > message if required"), debug print information has been added. > > > > > > > > > > Signed-off-by: Qiang Ma > > > > > Reported-by: kernel test robot > > > > > Closes: https://lore.kernel.org/oe-kbuild-all/202510310332.6XrLe70K-lkp at intel.com/ > > > > > --- > > > > > kernel/kexec.c | 11 +++++++++++ > > > > > 1 file changed, 11 insertions(+) > > > > > > > > > > diff --git a/kernel/kexec.c b/kernel/kexec.c > > > > > index c7a869d32f87..9b433b972cc1 100644 > > > > > --- a/kernel/kexec.c > > > > > +++ b/kernel/kexec.c > > > > > @@ -154,7 +154,15 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments, > > > > > if (ret) > > > > > goto out; > > > > > + kexec_dprintk("nr_segments = %lu\n", nr_segments); > > > > > for (i = 0; i < nr_segments; i++) { > > > > > + struct kexec_segment *ksegment; > > > > > + > > > > > + ksegment = &image->segment[i]; > > > > > + kexec_dprintk("segment[%lu]: buf=0x%p bufsz=0x%zx mem=0x%lx memsz=0x%zx\n", > > > > > + i, ksegment->buf, ksegment->bufsz, ksegment->mem, > > > > > + ksegment->memsz); > > > > There has already been a print_segments() in kexec-tools/kexec/kexec.c, > > > > you will get duplicated printing. That sounds not good. Have you tested > > > > this? > > > I have tested it, kexec-tools is the debug message printed > > > in user space, while kexec_dprintk is printed > > > in kernel space. > > > > > > This might be helpful for verifying the accuracy of > > > the data passed to the kernel. > > Hmm, that's not necessary with a debug printing to verify value passed > > in kernel. We should only add debug pringing when we need but lack it. > > I didn't check it carefully, if you add the debug printing only for > > verifying accuracy, that doesn't justify the code change. > It's not entirely because of it. > > Another reason is that for RISC-V, for kexec_file_load interface, > kexec_image_info() was deleted at that time because the content > has been printed out in generic code. > > However, these contents were not printed in kexec_load because > kexec_image_info was deleted. So now it has been added. print_segments() in kexec-tools/kexec/kexec.c is a generic function, shouldn't you make it called in kexec-tools for risc-v? I am confused by the purpose of this patchset. > > > > > + > > > > > ret = kimage_load_segment(image, i); > > > > > if (ret) > > > > > goto out; > > > > > @@ -166,6 +174,9 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments, > > > > > if (ret) > > > > > goto out; > > > > > + kexec_dprintk("kexec_load: type:%u, start:0x%lx head:0x%lx flags:0x%lx\n", > > > > > + image->type, image->start, image->head, flags); > > > > > + > > > > > /* Install the new kernel and uninstall the old */ > > > > > image = xchg(dest_image, image); > > > > > -- > > > > > 2.20.1 > > > > > > > > From pratyush at kernel.org Wed Nov 5 01:44:24 2025 From: pratyush at kernel.org (Pratyush Yadav) Date: Wed, 05 Nov 2025 10:44:24 +0100 Subject: [PATCH] MAINTAINERS: extend file entry in KHO to include subdirectories In-Reply-To: <20251104143238.119803-1-lukas.bulwahn@redhat.com> (Lukas Bulwahn's message of "Tue, 4 Nov 2025 15:32:38 +0100") References: <20251104143238.119803-1-lukas.bulwahn@redhat.com> Message-ID: On Tue, Nov 04 2025, Lukas Bulwahn wrote: > From: Lukas Bulwahn > > Commit 3498209ff64e ("Documentation: add documentation for KHO") adds the > file entry for 'Documentation/core-api/kho/*'. The asterisk in the end > means that all files in kho are included, but not files in its > subdirectories below. > Hence, the files under Documentation/core-api/kho/bindings/ are not > considered part of KHO, and get_maintainers.pl does not necessarily add the > KHO maintainers to the recipients of patches to those files. Probably, this > is not intended, though, and it was simply an oversight of the detailed > semantics of such file entries. > > Make the file entry to include the subdirectories of > Documentation/core-api/kho/. > > Signed-off-by: Lukas Bulwahn Reviewed-by: Pratyush Yadav [...] -- Regards, Pratyush Yadav From pratyush at kernel.org Wed Nov 5 02:06:13 2025 From: pratyush at kernel.org (Pratyush Yadav) Date: Wed, 05 Nov 2025 11:06:13 +0100 Subject: [PATCH 0/2] kho: misc fixes In-Reply-To: <20251103172321.689294e48c2fae795e114ce6@linux-foundation.org> (Andrew Morton's message of "Mon, 3 Nov 2025 17:23:21 -0800") References: <20251103180235.71409-1-pratyush@kernel.org> <20251103162020.ac696dbc695f9341e7a267f7@linux-foundation.org> <20251103172321.689294e48c2fae795e114ce6@linux-foundation.org> Message-ID: On Mon, Nov 03 2025, Andrew Morton wrote: > On Mon, 3 Nov 2025 16:20:20 -0800 Andrew Morton wrote: > >> On Mon, 3 Nov 2025 19:02:30 +0100 Pratyush Yadav wrote: >> >> > This series has a couple of misc fixes for KHO I discovered during code >> > review and testing. >> > >> > The series is based on top of [0] which has another fix for the function >> > touched by patch 1. I spotted these two after sending the patch. If that >> > one needs a reroll, I can combine the three into a series. >> > >> >> Things appear to be misordered here. >> >> [1/2] "kho: fix unpreservation of higher-order vmalloc preservations" >> fixes a667300bd53f2, so it's wanted in 6.18-rcX >> >> [2/2] "kho: warn and exit when unpreserved page wasn't preserved" >> fixes fc33e4b44b271, so it's wanted in 6.16+ >> >> So can we please have [2/2] as a standalone fix against latest -linus, >> with a cc:stable? >> >> And then [1/2] as a standalone fix against latest -linus without a >> cc:stable. >> > > OK, I think I figured it out. > > In mm-hotfixes-unstable I have > > kho-fix-out-of-bounds-access-of-vmalloc-chunk.patch > kho-fix-unpreservation-of-higher-order-vmalloc-preservations.patch > kho-warn-and-exit-when-unpreserved-page-wasnt-preserved.patch > > The first two are applicable to 6.18-rcX and the third is applicable to > 6.18-rcX, with a cc:stable for backporting. Right. Sorry for the confusion. I see that on mm-hotfixes-unstable you already updated the third patch with Cc: stable. Thanks. -- Regards, Pratyush Yadav From leitao at debian.org Wed Nov 5 02:18:11 2025 From: leitao at debian.org (Breno Leitao) Date: Wed, 5 Nov 2025 02:18:11 -0800 Subject: [PATCH v8 01/17] memblock: add MEMBLOCK_RSRV_KERN flag In-Reply-To: References: <20250509074635.3187114-1-changyuanl@google.com> <20250509074635.3187114-2-changyuanl@google.com> <2ege2jfbevtunhxsnutbzde7cqwgu5qbj4bbuw2umw7ke7ogcn@5wtskk4exzsi> Message-ID: Hello Pratyush, On Tue, Oct 14, 2025 at 03:10:37PM +0200, Pratyush Yadav wrote: > On Tue, Oct 14 2025, Breno Leitao wrote: > > On Mon, Oct 13, 2025 at 06:40:09PM +0200, Pratyush Yadav wrote: > >> On Mon, Oct 13 2025, Pratyush Yadav wrote: > >> > > >> > I suppose this would be useful. I think enabling memblock debug prints > >> > would also be helpful (using the "memblock=debug" commandline parameter) > >> > if it doesn't impact your production environment too much. > >> > >> Actually, I think "memblock=debug" is going to be the more useful thing > >> since it would also show what function allocated the overlapping range > >> and the flags it was allocated with. > >> > >> On my qemu VM with KVM, this results in around 70 prints from memblock. > >> So it adds a bit of extra prints but nothing that should be too > >> disrupting I think. Plus, only at boot so the worst thing you get is > >> slightly slower boot times. > > > > Unfortunately this issue is happening on production systems, and I don't > > have an easy way to reproduce it _yet_. > > > > At the same time, "memblock=debug" has two problems: > > > > 1) It slows the boot time as you suggested. Boot time at large > > environments is SUPER critical and time sensitive. It is a bit > > weird, but it is common for machines in production to kexec > > _thousands_ of times, and kexecing is considered downtime. > > I don't know if it would make a real enough difference on boot times, > only that it should theoretically affect it, mainly if you are using > serial for dmesg logs. Anyway, that's your production environment so you > know best. > > > > > This would be useful if I find some hosts getting this issue, and > > then I can easily enable the extra information to collect what > > I need, but, this didn't pan out because the hosts I got > > `memblock=debug` didn't collaborate. > > > > 2) "memblock=debug" is verbose for all cases, which also not necessary > > the desired behaviour. I am more interested in only being verbose > > when there is a known problem. I am still interested in this problem, and I finally found a host that constantly reproduce the issue and I was able to get `memblock=debug` cmdline. I am running 6.18-rc4 with some debug options enabled. DMA-API: exceeded 7 overlapping mappings of cacheline 0x0000000006d6e400 WARNING: CPU: 58 PID: 828 at kernel/dma/debug.c:463 add_dma_entry+0x2e4/0x330 pc : add_dma_entry+0x2e4/0x330 lr : add_dma_entry+0x2e4/0x330 sp : ffff8000b036f7f0 x29: ffff8000b036f800 x28: 0000000000000001 x27: 0000000000000008 x26: ffff8000835f7fb8 x25: ffff8000835f7000 x24: ffff8000835f7ee0 x23: 0000000000000000 x22: 0000000006d6e400 x21: 0000000000000000 x20: 0000000006d6e400 x19: ffff0003f70c1100 x18: 00000000ffffffff x17: ffff80008019a2d8 x16: ffff80008019a08c x15: 0000000000000000 x14: 0000000000000000 x13: 0000000000000820 x12: ffff00011faeaf00 x11: 0000000000000000 x10: ffff8000834633d8 x9 : ffff8000801979d4 x8 : 00000000fffeffff x7 : ffff8000834633d8 x6 : 0000000000000000 x5 : 00000000000bfff4 x4 : 0000000000000000 x3 : ffff0001075eb7c0 x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0001075eb7c0 Call trace: add_dma_entry+0x2e4/0x330 (P) debug_dma_map_phys+0xc4/0xf0 dma_map_phys (/home/leit/Devel/upstream/./include/linux/dma-direct.h:138 /home/leit/Devel/upstream/kernel/dma/direct.h:102 /home/leit/Devel/upstream/kernel/dma/mapping.c:169) dma_map_page_attrs (/home/leit/Devel/upstream/kernel/dma/mapping.c:387) blk_dma_map_direct.isra.0 (/home/leit/Devel/upstream/block/blk-mq-dma.c:102) blk_dma_map_iter_start (/home/leit/Devel/upstream/block/blk-mq-dma.c:123 /home/leit/Devel/upstream/block/blk-mq-dma.c:196) blk_rq_dma_map_iter_start (/home/leit/Devel/upstream/block/blk-mq-dma.c:228) nvme_prep_rq+0xb8/0x9b8 nvme_queue_rq+0x44/0x1b0 blk_mq_dispatch_rq_list (/home/leit/Devel/upstream/block/blk-mq.c:2129) __blk_mq_sched_dispatch_requests (/home/leit/Devel/upstream/block/blk-mq-sched.c:314) blk_mq_sched_dispatch_requests (/home/leit/Devel/upstream/block/blk-mq-sched.c:329) blk_mq_run_work_fn (/home/leit/Devel/upstream/block/blk-mq.c:219 /home/leit/Devel/upstream/block/blk-mq.c:231) process_one_work (/home/leit/Devel/upstream/kernel/workqueue.c:991 /home/leit/Devel/upstream/kernel/workqueue.c:3213) worker_thread (/home/leit/Devel/upstream/./include/linux/list.h:163 /home/leit/Devel/upstream/./include/linux/list.h:191 /home/leit/Devel/upstream/./include/linux/list.h:319 /home/leit/Devel/upstream/kernel/workqueue.c:1153 /home/leit/Devel/upstream/kernel/workqueue.c:1205 /home/leit/Devel/upstream/kernel/workqueue.c:3426) kthread (/home/leit/Devel/upstream/kernel/kthread.c:386 /home/leit/Devel/upstream/kernel/kthread.c:457) ret_from_fork (/home/leit/Devel/upstream/entry.S:861) Looking at memblock debug logs, I haven't seen anything related to 0x0000000006d6e400. I got the output of `dmesg | grep memblock` in, in case you are curious: https://github.com/leitao/debug/blob/main/pastebin/memblock/dmesg_grep_memblock.txt Thanks --breno From pratyush at kernel.org Wed Nov 5 02:20:19 2025 From: pratyush at kernel.org (Pratyush Yadav) Date: Wed, 5 Nov 2025 11:20:19 +0100 Subject: [PATCH] MAINTAINERS: add myself as a reviewer for KHO Message-ID: <20251105102022.18798-1-pratyush@kernel.org> I have been reviewing most patches for KHO already, and it is easier to spot them if I am directly in Cc. Signed-off-by: Pratyush Yadav --- MAINTAINERS | 1 + 1 file changed, 1 insertion(+) diff --git a/MAINTAINERS b/MAINTAINERS index 8ee7cb5fe838f..3c85bb0e381fc 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -13789,6 +13789,7 @@ KEXEC HANDOVER (KHO) M: Alexander Graf M: Mike Rapoport M: Pasha Tatashin +R: Pratyush Yadav L: kexec at lists.infradead.org L: linux-mm at kvack.org S: Maintained base-commit: d25eefc46daf21bd1ebbc699f0ffd7fe11d92296 -- 2.47.3 From maqianga at uniontech.com Wed Nov 5 03:28:10 2025 From: maqianga at uniontech.com (Qiang Ma) Date: Wed, 5 Nov 2025 19:28:10 +0800 Subject: [PATCH v2 3/4] kexec: print out debugging message if required for kexec_load In-Reply-To: References: <20251103063440.1681657-1-maqianga@uniontech.com> <20251103063440.1681657-4-maqianga@uniontech.com> <5FC4A8D79744B238+97288be4-6c1a-4c0d-ae7d-be2029ec87f3@uniontech.com> <2331A9F3E09581FC+4ab7e9ba-8776-47d2-868f-cb01ca9cd909@uniontech.com> Message-ID: <44308A6B6D8BEB61+c143d52e-03dd-48bf-aadd-8a0d9196b280@uniontech.com> ? 2025/11/5 16:55, Baoquan He ??: > On 11/05/25 at 04:35pm, Qiang Ma wrote: >> ? 2025/11/5 15:53, Baoquan He ??: >>> On 11/05/25 at 11:41am, Qiang Ma wrote: >>>> ? 2025/11/5 11:01, Baoquan He ??: >>>>> On 11/03/25 at 02:34pm, Qiang Ma wrote: >>>>>> The commit a85ee18c7900 ("kexec_file: print out debugging message >>>>>> if required") has added general code printing in kexec_file_load(), >>>>>> but not in kexec_load(). >>>>>> >>>>>> Especially in the RISC-V architecture, kexec_image_info() has been >>>>>> removed(commit eb7622d908a0 ("kexec_file, riscv: print out debugging >>>>>> message if required")). As a result, when using '-d' for the kexec_load >>>>>> interface, print nothing in the kernel space. This might be helpful for >>>>>> verifying the accuracy of the data passed to the kernel. Therefore, >>>>>> refer to this commit a85ee18c7900 ("kexec_file: print out debugging >>>>>> message if required"), debug print information has been added. >>>>>> >>>>>> Signed-off-by: Qiang Ma >>>>>> Reported-by: kernel test robot >>>>>> Closes: https://lore.kernel.org/oe-kbuild-all/202510310332.6XrLe70K-lkp at intel.com/ >>>>>> --- >>>>>> kernel/kexec.c | 11 +++++++++++ >>>>>> 1 file changed, 11 insertions(+) >>>>>> >>>>>> diff --git a/kernel/kexec.c b/kernel/kexec.c >>>>>> index c7a869d32f87..9b433b972cc1 100644 >>>>>> --- a/kernel/kexec.c >>>>>> +++ b/kernel/kexec.c >>>>>> @@ -154,7 +154,15 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments, >>>>>> if (ret) >>>>>> goto out; >>>>>> + kexec_dprintk("nr_segments = %lu\n", nr_segments); >>>>>> for (i = 0; i < nr_segments; i++) { >>>>>> + struct kexec_segment *ksegment; >>>>>> + >>>>>> + ksegment = &image->segment[i]; >>>>>> + kexec_dprintk("segment[%lu]: buf=0x%p bufsz=0x%zx mem=0x%lx memsz=0x%zx\n", >>>>>> + i, ksegment->buf, ksegment->bufsz, ksegment->mem, >>>>>> + ksegment->memsz); >>>>> There has already been a print_segments() in kexec-tools/kexec/kexec.c, >>>>> you will get duplicated printing. That sounds not good. Have you tested >>>>> this? >>>> I have tested it, kexec-tools is the debug message printed >>>> in user space, while kexec_dprintk is printed >>>> in kernel space. >>>> >>>> This might be helpful for verifying the accuracy of >>>> the data passed to the kernel. >>> Hmm, that's not necessary with a debug printing to verify value passed >>> in kernel. We should only add debug pringing when we need but lack it. >>> I didn't check it carefully, if you add the debug printing only for >>> verifying accuracy, that doesn't justify the code change. >> It's not entirely because of it. >> >> Another reason is that for RISC-V, for kexec_file_load interface, >> kexec_image_info() was deleted at that time because the content >> has been printed out in generic code. >> >> However, these contents were not printed in kexec_load because >> kexec_image_info was deleted. So now it has been added. > print_segments() in kexec-tools/kexec/kexec.c is a generic function, > shouldn't you make it called in kexec-tools for risc-v? I am confused by > the purpose of this patchset. There is a problem with what I expressed. I don't want to add print_segments to riscv. I want to add some debugging message(ksegment,kimage,flag) for kexec_load. Although ksegment debugging message has been printed in kexec-tools, it is still helpful for debugging the kernel space function. > >>>>>> + >>>>>> ret = kimage_load_segment(image, i); >>>>>> if (ret) >>>>>> goto out; >>>>>> @@ -166,6 +174,9 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments, >>>>>> if (ret) >>>>>> goto out; >>>>>> + kexec_dprintk("kexec_load: type:%u, start:0x%lx head:0x%lx flags:0x%lx\n", >>>>>> + image->type, image->start, image->head, flags); >>>>>> + >>>>>> /* Install the new kernel and uninstall the old */ >>>>>> image = xchg(dest_image, image); >>>>>> -- >>>>>> 2.20.1 >>>>>> > From bhe at redhat.com Wed Nov 5 05:01:12 2025 From: bhe at redhat.com (Baoquan He) Date: Wed, 5 Nov 2025 21:01:12 +0800 Subject: [PATCH v2 3/4] kexec: print out debugging message if required for kexec_load In-Reply-To: <44308A6B6D8BEB61+c143d52e-03dd-48bf-aadd-8a0d9196b280@uniontech.com> References: <20251103063440.1681657-1-maqianga@uniontech.com> <20251103063440.1681657-4-maqianga@uniontech.com> <5FC4A8D79744B238+97288be4-6c1a-4c0d-ae7d-be2029ec87f3@uniontech.com> <2331A9F3E09581FC+4ab7e9ba-8776-47d2-868f-cb01ca9cd909@uniontech.com> <44308A6B6D8BEB61+c143d52e-03dd-48bf-aadd-8a0d9196b280@uniontech.com> Message-ID: On 11/05/25 at 07:28pm, Qiang Ma wrote: > > ? 2025/11/5 16:55, Baoquan He ??: > > On 11/05/25 at 04:35pm, Qiang Ma wrote: > > > ? 2025/11/5 15:53, Baoquan He ??: > > > > On 11/05/25 at 11:41am, Qiang Ma wrote: > > > > > ? 2025/11/5 11:01, Baoquan He ??: > > > > > > On 11/03/25 at 02:34pm, Qiang Ma wrote: > > > > > > > The commit a85ee18c7900 ("kexec_file: print out debugging message > > > > > > > if required") has added general code printing in kexec_file_load(), > > > > > > > but not in kexec_load(). > > > > > > > > > > > > > > Especially in the RISC-V architecture, kexec_image_info() has been > > > > > > > removed(commit eb7622d908a0 ("kexec_file, riscv: print out debugging > > > > > > > message if required")). As a result, when using '-d' for the kexec_load > > > > > > > interface, print nothing in the kernel space. This might be helpful for > > > > > > > verifying the accuracy of the data passed to the kernel. Therefore, > > > > > > > refer to this commit a85ee18c7900 ("kexec_file: print out debugging > > > > > > > message if required"), debug print information has been added. > > > > > > > > > > > > > > Signed-off-by: Qiang Ma > > > > > > > Reported-by: kernel test robot > > > > > > > Closes: https://lore.kernel.org/oe-kbuild-all/202510310332.6XrLe70K-lkp at intel.com/ > > > > > > > --- > > > > > > > kernel/kexec.c | 11 +++++++++++ > > > > > > > 1 file changed, 11 insertions(+) > > > > > > > > > > > > > > diff --git a/kernel/kexec.c b/kernel/kexec.c > > > > > > > index c7a869d32f87..9b433b972cc1 100644 > > > > > > > --- a/kernel/kexec.c > > > > > > > +++ b/kernel/kexec.c > > > > > > > @@ -154,7 +154,15 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments, > > > > > > > if (ret) > > > > > > > goto out; > > > > > > > + kexec_dprintk("nr_segments = %lu\n", nr_segments); > > > > > > > for (i = 0; i < nr_segments; i++) { > > > > > > > + struct kexec_segment *ksegment; > > > > > > > + > > > > > > > + ksegment = &image->segment[i]; > > > > > > > + kexec_dprintk("segment[%lu]: buf=0x%p bufsz=0x%zx mem=0x%lx memsz=0x%zx\n", > > > > > > > + i, ksegment->buf, ksegment->bufsz, ksegment->mem, > > > > > > > + ksegment->memsz); > > > > > > There has already been a print_segments() in kexec-tools/kexec/kexec.c, > > > > > > you will get duplicated printing. That sounds not good. Have you tested > > > > > > this? > > > > > I have tested it, kexec-tools is the debug message printed > > > > > in user space, while kexec_dprintk is printed > > > > > in kernel space. > > > > > > > > > > This might be helpful for verifying the accuracy of > > > > > the data passed to the kernel. > > > > Hmm, that's not necessary with a debug printing to verify value passed > > > > in kernel. We should only add debug pringing when we need but lack it. > > > > I didn't check it carefully, if you add the debug printing only for > > > > verifying accuracy, that doesn't justify the code change. > > > It's not entirely because of it. > > > > > > Another reason is that for RISC-V, for kexec_file_load interface, > > > kexec_image_info() was deleted at that time because the content > > > has been printed out in generic code. > > > > > > However, these contents were not printed in kexec_load because > > > kexec_image_info was deleted. So now it has been added. > > print_segments() in kexec-tools/kexec/kexec.c is a generic function, > > shouldn't you make it called in kexec-tools for risc-v? I am confused by > > the purpose of this patchset. > There is a problem with what I expressed. > I don't want to add print_segments to riscv. > I want to add some debugging message(ksegment,kimage,flag) for kexec_load. > > Although ksegment debugging message has been printed in kexec-tools, > it is still helpful for debugging the kernel space function. Sorry, I can't support that. We all prepare the loading segments for the future jumping in kexec_tools if it's kexec_load interface. And calling print_segments() to print those loading information is natural. Why do we need print them two times for verifying if the printing is accuracy? Could you explain why risc-v is special? > > > > > > > > > + > > > > > > > ret = kimage_load_segment(image, i); > > > > > > > if (ret) > > > > > > > goto out; > > > > > > > @@ -166,6 +174,9 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments, > > > > > > > if (ret) > > > > > > > goto out; > > > > > > > + kexec_dprintk("kexec_load: type:%u, start:0x%lx head:0x%lx flags:0x%lx\n", > > > > > > > + image->type, image->start, image->head, flags); > > > > > > > + > > > > > > > /* Install the new kernel and uninstall the old */ > > > > > > > image = xchg(dest_image, image); > > > > > > > -- > > > > > > > 2.20.1 > > > > > > > > > > From piliu at redhat.com Wed Nov 5 05:09:21 2025 From: piliu at redhat.com (Pingfan Liu) Date: Wed, 5 Nov 2025 21:09:21 +0800 Subject: [PATCH 1/2] kernel/kexec: Change the prototype of kimage_map_segment() Message-ID: <20251105130922.13321-1-piliu@redhat.com> The kexec segment index will be required to extract the corresponding information for that segment in kimage_map_segment(). Additionally, kexec_segment already holds the kexec relocation destination address and size. Therefore, the prototype of kimage_map_segment() can be changed. Signed-off-by: Pingfan Liu Cc: Andrew Morton Cc: Baoquan He Cc: Mimi Zohar Cc: Roberto Sassu Cc: Alexander Graf Cc: Steven Chen To: kexec at lists.infradead.org To: linux-integrity at vger.kernel.org --- include/linux/kexec.h | 4 ++-- kernel/kexec_core.c | 9 ++++++--- security/integrity/ima/ima_kexec.c | 4 +--- 3 files changed, 9 insertions(+), 8 deletions(-) diff --git a/include/linux/kexec.h b/include/linux/kexec.h index ff7e231b0485..8a22bc9b8c6c 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -530,7 +530,7 @@ extern bool kexec_file_dbg_print; #define kexec_dprintk(fmt, arg...) \ do { if (kexec_file_dbg_print) pr_info(fmt, ##arg); } while (0) -extern void *kimage_map_segment(struct kimage *image, unsigned long addr, unsigned long size); +extern void *kimage_map_segment(struct kimage *image, int idx); extern void kimage_unmap_segment(void *buffer); #else /* !CONFIG_KEXEC_CORE */ struct pt_regs; @@ -540,7 +540,7 @@ static inline void __crash_kexec(struct pt_regs *regs) { } static inline void crash_kexec(struct pt_regs *regs) { } static inline int kexec_should_crash(struct task_struct *p) { return 0; } static inline int kexec_crash_loaded(void) { return 0; } -static inline void *kimage_map_segment(struct kimage *image, unsigned long addr, unsigned long size) +static inline void *kimage_map_segment(struct kimage *image, int idx) { return NULL; } static inline void kimage_unmap_segment(void *buffer) { } #define kexec_in_progress false diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c index fa00b239c5d9..9a1966207041 100644 --- a/kernel/kexec_core.c +++ b/kernel/kexec_core.c @@ -960,17 +960,20 @@ int kimage_load_segment(struct kimage *image, int idx) return result; } -void *kimage_map_segment(struct kimage *image, - unsigned long addr, unsigned long size) +void *kimage_map_segment(struct kimage *image, int idx) { + unsigned long addr, size, eaddr; unsigned long src_page_addr, dest_page_addr = 0; - unsigned long eaddr = addr + size; kimage_entry_t *ptr, entry; struct page **src_pages; unsigned int npages; void *vaddr = NULL; int i; + addr = image->segment[idx].mem; + size = image->segment[idx].memsz; + eaddr = addr + size; + /* * Collect the source pages and map them in a contiguous VA range. */ diff --git a/security/integrity/ima/ima_kexec.c b/security/integrity/ima/ima_kexec.c index 7362f68f2d8b..5beb69edd12f 100644 --- a/security/integrity/ima/ima_kexec.c +++ b/security/integrity/ima/ima_kexec.c @@ -250,9 +250,7 @@ void ima_kexec_post_load(struct kimage *image) if (!image->ima_buffer_addr) return; - ima_kexec_buffer = kimage_map_segment(image, - image->ima_buffer_addr, - image->ima_buffer_size); + ima_kexec_buffer = kimage_map_segment(image, image->ima_segment_index); if (!ima_kexec_buffer) { pr_err("Could not map measurements buffer.\n"); return; -- 2.49.0 From piliu at redhat.com Wed Nov 5 05:09:22 2025 From: piliu at redhat.com (Pingfan Liu) Date: Wed, 5 Nov 2025 21:09:22 +0800 Subject: [PATCH 2/2] kernel/kexec: Fix IMA when allocation happens in CMA area In-Reply-To: <20251105130922.13321-1-piliu@redhat.com> References: <20251105130922.13321-1-piliu@redhat.com> Message-ID: <20251105130922.13321-2-piliu@redhat.com> When I tested kexec with the latest kernel, I ran into the following warning: [ 40.712410] ------------[ cut here ]------------ [ 40.712576] WARNING: CPU: 2 PID: 1562 at kernel/kexec_core.c:1001 kimage_map_segment+0x144/0x198 [...] [ 40.816047] Call trace: [ 40.818498] kimage_map_segment+0x144/0x198 (P) [ 40.823221] ima_kexec_post_load+0x58/0xc0 [ 40.827246] __do_sys_kexec_file_load+0x29c/0x368 [...] [ 40.855423] ---[ end trace 0000000000000000 ]--- This is caused by the fact that kexec allocates the destination directly in the CMA area. In that case, the CMA kernel address should be exported directly to the IMA component, instead of using the vmalloc'd address. Signed-off-by: Pingfan Liu Cc: Andrew Morton Cc: Baoquan He Cc: Alexander Graf Cc: Steven Chen Cc: linux-integrity at vger.kernel.org To: kexec at lists.infradead.org --- kernel/kexec_core.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c index 9a1966207041..abe40286a02c 100644 --- a/kernel/kexec_core.c +++ b/kernel/kexec_core.c @@ -967,6 +967,7 @@ void *kimage_map_segment(struct kimage *image, int idx) kimage_entry_t *ptr, entry; struct page **src_pages; unsigned int npages; + struct page *cma; void *vaddr = NULL; int i; @@ -974,6 +975,9 @@ void *kimage_map_segment(struct kimage *image, int idx) size = image->segment[idx].memsz; eaddr = addr + size; + cma = image->segment_cma[idx]; + if (cma) + return cma; /* * Collect the source pages and map them in a contiguous VA range. */ @@ -1014,7 +1018,8 @@ void *kimage_map_segment(struct kimage *image, int idx) void kimage_unmap_segment(void *segment_buffer) { - vunmap(segment_buffer); + if (is_vmalloc_addr(segment_buffer)) + vunmap(segment_buffer); } struct kexec_load_limit { -- 2.49.0 From maqianga at uniontech.com Wed Nov 5 07:05:01 2025 From: maqianga at uniontech.com (Qiang Ma) Date: Wed, 5 Nov 2025 23:05:01 +0800 Subject: [PATCH v2 3/4] kexec: print out debugging message if required for kexec_load In-Reply-To: References: <20251103063440.1681657-1-maqianga@uniontech.com> <20251103063440.1681657-4-maqianga@uniontech.com> <5FC4A8D79744B238+97288be4-6c1a-4c0d-ae7d-be2029ec87f3@uniontech.com> <2331A9F3E09581FC+4ab7e9ba-8776-47d2-868f-cb01ca9cd909@uniontech.com> <44308A6B6D8BEB61+c143d52e-03dd-48bf-aadd-8a0d9196b280@uniontech.com> Message-ID: <1E51DC0D8C72320F+8ad85e3e-1f03-4ca9-ba29-f2ff8a4cb831@uniontech.com> On 2025/11/5 ??9:01, Baoquan He wrote: > On 11/05/25 at 07:28pm, Qiang Ma wrote: >> ? 2025/11/5 16:55, Baoquan He ??: >>> On 11/05/25 at 04:35pm, Qiang Ma wrote: >>>> ? 2025/11/5 15:53, Baoquan He ??: >>>>> On 11/05/25 at 11:41am, Qiang Ma wrote: >>>>>> ? 2025/11/5 11:01, Baoquan He ??: >>>>>>> On 11/03/25 at 02:34pm, Qiang Ma wrote: >>>>>>>> The commit a85ee18c7900 ("kexec_file: print out debugging message >>>>>>>> if required") has added general code printing in kexec_file_load(), >>>>>>>> but not in kexec_load(). >>>>>>>> >>>>>>>> Especially in the RISC-V architecture, kexec_image_info() has been >>>>>>>> removed(commit eb7622d908a0 ("kexec_file, riscv: print out debugging >>>>>>>> message if required")). As a result, when using '-d' for the kexec_load >>>>>>>> interface, print nothing in the kernel space. This might be helpful for >>>>>>>> verifying the accuracy of the data passed to the kernel. Therefore, >>>>>>>> refer to this commit a85ee18c7900 ("kexec_file: print out debugging >>>>>>>> message if required"), debug print information has been added. >>>>>>>> >>>>>>>> Signed-off-by: Qiang Ma >>>>>>>> Reported-by: kernel test robot >>>>>>>> Closes: https://lore.kernel.org/oe-kbuild-all/202510310332.6XrLe70K-lkp at intel.com/ >>>>>>>> --- >>>>>>>> kernel/kexec.c | 11 +++++++++++ >>>>>>>> 1 file changed, 11 insertions(+) >>>>>>>> >>>>>>>> diff --git a/kernel/kexec.c b/kernel/kexec.c >>>>>>>> index c7a869d32f87..9b433b972cc1 100644 >>>>>>>> --- a/kernel/kexec.c >>>>>>>> +++ b/kernel/kexec.c >>>>>>>> @@ -154,7 +154,15 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments, >>>>>>>> if (ret) >>>>>>>> goto out; >>>>>>>> + kexec_dprintk("nr_segments = %lu\n", nr_segments); >>>>>>>> for (i = 0; i < nr_segments; i++) { >>>>>>>> + struct kexec_segment *ksegment; >>>>>>>> + >>>>>>>> + ksegment = &image->segment[i]; >>>>>>>> + kexec_dprintk("segment[%lu]: buf=0x%p bufsz=0x%zx mem=0x%lx memsz=0x%zx\n", >>>>>>>> + i, ksegment->buf, ksegment->bufsz, ksegment->mem, >>>>>>>> + ksegment->memsz); >>>>>>> There has already been a print_segments() in kexec-tools/kexec/kexec.c, >>>>>>> you will get duplicated printing. That sounds not good. Have you tested >>>>>>> this? >>>>>> I have tested it, kexec-tools is the debug message printed >>>>>> in user space, while kexec_dprintk is printed >>>>>> in kernel space. >>>>>> >>>>>> This might be helpful for verifying the accuracy of >>>>>> the data passed to the kernel. >>>>> Hmm, that's not necessary with a debug printing to verify value passed >>>>> in kernel. We should only add debug pringing when we need but lack it. >>>>> I didn't check it carefully, if you add the debug printing only for >>>>> verifying accuracy, that doesn't justify the code change. >>>> It's not entirely because of it. >>>> >>>> Another reason is that for RISC-V, for kexec_file_load interface, >>>> kexec_image_info() was deleted at that time because the content >>>> has been printed out in generic code. >>>> >>>> However, these contents were not printed in kexec_load because >>>> kexec_image_info was deleted. So now it has been added. >>> print_segments() in kexec-tools/kexec/kexec.c is a generic function, >>> shouldn't you make it called in kexec-tools for risc-v? I am confused by >>> the purpose of this patchset. >> There is a problem with what I expressed. >> I don't want to add print_segments to riscv. >> I want to add some debugging message(ksegment,kimage,flag) for kexec_load. >> >> Although ksegment debugging message has been printed in kexec-tools, >> it is still helpful for debugging the kernel space function. > Sorry, I can't support that. We all prepare the loading segments for the > future jumping in kexec_tools if it's kexec_load interface. And calling > print_segments() to print those loading information is natural. Why do we > need print them two times for verifying if the printing is accuracy? Is it necessary to verify the user-space data after it is passed to the kernel space? > Could you explain why risc-v is special? At first, when I saw that in the RISC-V architecture, after kexec_image_info was removed from this commit eb7622d908a0 ("kexec_file, riscv: print out debugging message if required"), I thought only kexec_file_load was taken into consideration. However, without considering that kexec_load would call kexec_image_info to print segment and other debugging message, I think that since it has been deleted. Then, I referred to kexec_file_load and added these debugging message to the general code of kexec_load. In this way, all architectures can print these general debugging message. Then I can add these debugging message to the general code, so that all architectures can print these general debugging message. >>>>>>>> + >>>>>>>> ret = kimage_load_segment(image, i); >>>>>>>> if (ret) >>>>>>>> goto out; >>>>>>>> @@ -166,6 +174,9 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments, >>>>>>>> if (ret) >>>>>>>> goto out; >>>>>>>> + kexec_dprintk("kexec_load: type:%u, start:0x%lx head:0x%lx flags:0x%lx\n", >>>>>>>> + image->type, image->start, image->head, flags); >>>>>>>> + >>>>>>>> /* Install the new kernel and uninstall the old */ >>>>>>>> image = xchg(dest_image, image); >>>>>>>> -- >>>>>>>> 2.20.1 >>>>>>>> > From rppt at kernel.org Wed Nov 5 09:37:12 2025 From: rppt at kernel.org (Mike Rapoport) Date: Wed, 5 Nov 2025 19:37:12 +0200 Subject: [PATCH] MAINTAINERS: add myself as a reviewer for KHO In-Reply-To: <20251105102022.18798-1-pratyush@kernel.org> References: <20251105102022.18798-1-pratyush@kernel.org> Message-ID: On Wed, Nov 05, 2025 at 11:20:19AM +0100, Pratyush Yadav wrote: > I have been reviewing most patches for KHO already, and it is easier to > spot them if I am directly in Cc. > > Signed-off-by: Pratyush Yadav Acked-by: Mike Rapoport (Microsoft) > --- > MAINTAINERS | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/MAINTAINERS b/MAINTAINERS > index 8ee7cb5fe838f..3c85bb0e381fc 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -13789,6 +13789,7 @@ KEXEC HANDOVER (KHO) > M: Alexander Graf > M: Mike Rapoport > M: Pasha Tatashin > +R: Pratyush Yadav > L: kexec at lists.infradead.org > L: linux-mm at kvack.org > S: Maintained > > base-commit: d25eefc46daf21bd1ebbc699f0ffd7fe11d92296 > -- > 2.47.3 > -- Sincerely yours, Mike. From rppt at kernel.org Wed Nov 5 09:39:09 2025 From: rppt at kernel.org (Mike Rapoport) Date: Wed, 5 Nov 2025 19:39:09 +0200 Subject: [PATCH] MAINTAINERS: extend file entry in KHO to include subdirectories In-Reply-To: <20251104143238.119803-1-lukas.bulwahn@redhat.com> References: <20251104143238.119803-1-lukas.bulwahn@redhat.com> Message-ID: On Tue, Nov 04, 2025 at 03:32:38PM +0100, Lukas Bulwahn wrote: > From: Lukas Bulwahn > > Commit 3498209ff64e ("Documentation: add documentation for KHO") adds the > file entry for 'Documentation/core-api/kho/*'. The asterisk in the end > means that all files in kho are included, but not files in its > subdirectories below. > Hence, the files under Documentation/core-api/kho/bindings/ are not > considered part of KHO, and get_maintainers.pl does not necessarily add the > KHO maintainers to the recipients of patches to those files. Probably, this > is not intended, though, and it was simply an oversight of the detailed > semantics of such file entries. > > Make the file entry to include the subdirectories of > Documentation/core-api/kho/. > > Signed-off-by: Lukas Bulwahn Acked-by: Mike Rapoport (Microsoft) > --- > MAINTAINERS | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/MAINTAINERS b/MAINTAINERS > index 06ff926c5331..499b52d7793f 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -13836,7 +13836,7 @@ L: kexec at lists.infradead.org > L: linux-mm at kvack.org > S: Maintained > F: Documentation/admin-guide/mm/kho.rst > -F: Documentation/core-api/kho/* > +F: Documentation/core-api/kho/ > F: include/linux/kexec_handover.h > F: kernel/kexec_handover.c > F: tools/testing/selftests/kho/ > -- > 2.51.1 > -- Sincerely yours, Mike. From pasha.tatashin at soleen.com Wed Nov 5 12:07:25 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Wed, 5 Nov 2025 15:07:25 -0500 Subject: [PATCH] MAINTAINERS: add myself as a reviewer for KHO In-Reply-To: <20251105102022.18798-1-pratyush@kernel.org> References: <20251105102022.18798-1-pratyush@kernel.org> Message-ID: Reviewed-by: Pasha Tatashin On Wed, Nov 5, 2025 at 5:20?AM Pratyush Yadav wrote: > > I have been reviewing most patches for KHO already, and it is easier to > spot them if I am directly in Cc. > > Signed-off-by: Pratyush Yadav > --- > MAINTAINERS | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/MAINTAINERS b/MAINTAINERS > index 8ee7cb5fe838f..3c85bb0e381fc 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -13789,6 +13789,7 @@ KEXEC HANDOVER (KHO) > M: Alexander Graf > M: Mike Rapoport > M: Pasha Tatashin > +R: Pratyush Yadav > L: kexec at lists.infradead.org > L: linux-mm at kvack.org > S: Maintained > > base-commit: d25eefc46daf21bd1ebbc699f0ffd7fe11d92296 > -- > 2.47.3 > From akpm at linux-foundation.org Wed Nov 5 16:14:32 2025 From: akpm at linux-foundation.org (Andrew Morton) Date: Wed, 5 Nov 2025 16:14:32 -0800 Subject: [PATCH 2/2] kernel/kexec: Fix IMA when allocation happens in CMA area In-Reply-To: <20251105130922.13321-2-piliu@redhat.com> References: <20251105130922.13321-1-piliu@redhat.com> <20251105130922.13321-2-piliu@redhat.com> Message-ID: <20251105161432.98eb69f87f30627a9067e78e@linux-foundation.org> On Wed, 5 Nov 2025 21:09:22 +0800 Pingfan Liu wrote: > When I tested kexec with the latest kernel, I ran into the following warning: > > [ 40.712410] ------------[ cut here ]------------ > [ 40.712576] WARNING: CPU: 2 PID: 1562 at kernel/kexec_core.c:1001 kimage_map_segment+0x144/0x198 > [...] > [ 40.816047] Call trace: > [ 40.818498] kimage_map_segment+0x144/0x198 (P) > [ 40.823221] ima_kexec_post_load+0x58/0xc0 > [ 40.827246] __do_sys_kexec_file_load+0x29c/0x368 > [...] > [ 40.855423] ---[ end trace 0000000000000000 ]--- > > This is caused by the fact that kexec allocates the destination directly > in the CMA area. In that case, the CMA kernel address should be exported > directly to the IMA component, instead of using the vmalloc'd address. This is something we should backport into tearlier kernels. > Signed-off-by: Pingfan Liu > Cc: Andrew Morton > Cc: Baoquan He > Cc: Alexander Graf > Cc: Steven Chen > Cc: linux-integrity at vger.kernel.org > To: kexec at lists.infradead.org So I'm thinking we should add Fixes: 0091d9241ea2 ("kexec: define functions to map and unmap segments") Cc: yes? From piliu at redhat.com Wed Nov 5 17:15:28 2025 From: piliu at redhat.com (Pingfan Liu) Date: Thu, 6 Nov 2025 09:15:28 +0800 Subject: [PATCH 2/2] kernel/kexec: Fix IMA when allocation happens in CMA area In-Reply-To: <20251105161432.98eb69f87f30627a9067e78e@linux-foundation.org> References: <20251105130922.13321-1-piliu@redhat.com> <20251105130922.13321-2-piliu@redhat.com> <20251105161432.98eb69f87f30627a9067e78e@linux-foundation.org> Message-ID: On Thu, Nov 6, 2025 at 8:14?AM Andrew Morton wrote: > > On Wed, 5 Nov 2025 21:09:22 +0800 Pingfan Liu wrote: > > > When I tested kexec with the latest kernel, I ran into the following warning: > > > > [ 40.712410] ------------[ cut here ]------------ > > [ 40.712576] WARNING: CPU: 2 PID: 1562 at kernel/kexec_core.c:1001 kimage_map_segment+0x144/0x198 > > [...] > > [ 40.816047] Call trace: > > [ 40.818498] kimage_map_segment+0x144/0x198 (P) > > [ 40.823221] ima_kexec_post_load+0x58/0xc0 > > [ 40.827246] __do_sys_kexec_file_load+0x29c/0x368 > > [...] > > [ 40.855423] ---[ end trace 0000000000000000 ]--- > > > > This is caused by the fact that kexec allocates the destination directly > > in the CMA area. In that case, the CMA kernel address should be exported > > directly to the IMA component, instead of using the vmalloc'd address. > > This is something we should backport into tearlier kernels. > > > Signed-off-by: Pingfan Liu > > Cc: Andrew Morton > > Cc: Baoquan He > > Cc: Alexander Graf > > Cc: Steven Chen > > Cc: linux-integrity at vger.kernel.org > > To: kexec at lists.infradead.org > > So I'm thinking we should add > > Fixes: 0091d9241ea2 ("kexec: define functions to map and unmap segments") > Cc: > > yes? > Yes, it should be. Thanks for your help! Best Regards, Pingfan From bhe at redhat.com Wed Nov 5 18:03:56 2025 From: bhe at redhat.com (Baoquan He) Date: Thu, 6 Nov 2025 10:03:56 +0800 Subject: [PATCH 2/2] kernel/kexec: Fix IMA when allocation happens in CMA area In-Reply-To: <20251105130922.13321-2-piliu@redhat.com> References: <20251105130922.13321-1-piliu@redhat.com> <20251105130922.13321-2-piliu@redhat.com> Message-ID: Hi Pingfan, On 11/05/25 at 09:09pm, Pingfan Liu wrote: > When I tested kexec with the latest kernel, I ran into the following warning: > > [ 40.712410] ------------[ cut here ]------------ > [ 40.712576] WARNING: CPU: 2 PID: 1562 at kernel/kexec_core.c:1001 kimage_map_segment+0x144/0x198 > [...] > [ 40.816047] Call trace: > [ 40.818498] kimage_map_segment+0x144/0x198 (P) > [ 40.823221] ima_kexec_post_load+0x58/0xc0 > [ 40.827246] __do_sys_kexec_file_load+0x29c/0x368 > [...] > [ 40.855423] ---[ end trace 0000000000000000 ]--- > > This is caused by the fact that kexec allocates the destination directly > in the CMA area. In that case, the CMA kernel address should be exported > directly to the IMA component, instead of using the vmalloc'd address. > > Signed-off-by: Pingfan Liu > Cc: Andrew Morton > Cc: Baoquan He > Cc: Alexander Graf > Cc: Steven Chen > Cc: linux-integrity at vger.kernel.org > To: kexec at lists.infradead.org > --- > kernel/kexec_core.c | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c > index 9a1966207041..abe40286a02c 100644 > --- a/kernel/kexec_core.c > +++ b/kernel/kexec_core.c > @@ -967,6 +967,7 @@ void *kimage_map_segment(struct kimage *image, int idx) > kimage_entry_t *ptr, entry; > struct page **src_pages; > unsigned int npages; > + struct page *cma; > void *vaddr = NULL; > int i; > > @@ -974,6 +975,9 @@ void *kimage_map_segment(struct kimage *image, int idx) > size = image->segment[idx].memsz; > eaddr = addr + size; > > + cma = image->segment_cma[idx]; Thanks for your fix. But I totally can't get what you are doing. The idx passed into kimage_map_segment() could index image->segment[], and can index image->segment_cma[], could you reconsider and make the code more reasonable? > + if (cma) > + return cma; > /* > * Collect the source pages and map them in a contiguous VA range. > */ > @@ -1014,7 +1018,8 @@ void *kimage_map_segment(struct kimage *image, int idx) > > void kimage_unmap_segment(void *segment_buffer) > { > - vunmap(segment_buffer); > + if (is_vmalloc_addr(segment_buffer)) > + vunmap(segment_buffer); > } > > struct kexec_load_limit { > -- > 2.49.0 > From piliu at redhat.com Wed Nov 5 18:33:17 2025 From: piliu at redhat.com (Pingfan Liu) Date: Thu, 6 Nov 2025 10:33:17 +0800 Subject: [PATCH 2/2] kernel/kexec: Fix IMA when allocation happens in CMA area In-Reply-To: References: <20251105130922.13321-1-piliu@redhat.com> <20251105130922.13321-2-piliu@redhat.com> Message-ID: Hi Baoquan, Thanks for your review. Please see the comment below. On Thu, Nov 6, 2025 at 10:04?AM Baoquan He wrote: > > Hi Pingfan, > > On 11/05/25 at 09:09pm, Pingfan Liu wrote: > > When I tested kexec with the latest kernel, I ran into the following warning: > > > > [ 40.712410] ------------[ cut here ]------------ > > [ 40.712576] WARNING: CPU: 2 PID: 1562 at kernel/kexec_core.c:1001 kimage_map_segment+0x144/0x198 > > [...] > > [ 40.816047] Call trace: > > [ 40.818498] kimage_map_segment+0x144/0x198 (P) > > [ 40.823221] ima_kexec_post_load+0x58/0xc0 > > [ 40.827246] __do_sys_kexec_file_load+0x29c/0x368 > > [...] > > [ 40.855423] ---[ end trace 0000000000000000 ]--- > > > > This is caused by the fact that kexec allocates the destination directly > > in the CMA area. In that case, the CMA kernel address should be exported > > directly to the IMA component, instead of using the vmalloc'd address. > > > > Signed-off-by: Pingfan Liu > > Cc: Andrew Morton > > Cc: Baoquan He > > Cc: Alexander Graf > > Cc: Steven Chen > > Cc: linux-integrity at vger.kernel.org > > To: kexec at lists.infradead.org > > --- > > kernel/kexec_core.c | 7 ++++++- > > 1 file changed, 6 insertions(+), 1 deletion(-) > > > > diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c > > index 9a1966207041..abe40286a02c 100644 > > --- a/kernel/kexec_core.c > > +++ b/kernel/kexec_core.c > > @@ -967,6 +967,7 @@ void *kimage_map_segment(struct kimage *image, int idx) > > kimage_entry_t *ptr, entry; > > struct page **src_pages; > > unsigned int npages; > > + struct page *cma; > > void *vaddr = NULL; > > int i; > > > > @@ -974,6 +975,9 @@ void *kimage_map_segment(struct kimage *image, int idx) > > size = image->segment[idx].memsz; > > eaddr = addr + size; > > > > + cma = image->segment_cma[idx]; > > Thanks for your fix. But I totally can't get what you are doing. The idx > passed into kimage_map_segment() could index image->segment[], and can > index image->segment_cma[], could you reconsider and make the code more > reasonable? > Since idx can index both image->segment[] and segment_cma[], the behavior differs based on whether segment_cma[idx] is NULL: - If segment_cma[idx] is not NULL, it points directly to the final target location, eliminating the need for data copying that traditional kexec relocation requires. - If segment_cma[idx] is NULL, the segment relies on the traditional kexec relocation code to copy its data. Thanks, Pingfan > > + if (cma) > > + return cma; > > /* > > * Collect the source pages and map them in a contiguous VA range. > > */ > > @@ -1014,7 +1018,8 @@ void *kimage_map_segment(struct kimage *image, int idx) > > > > void kimage_unmap_segment(void *segment_buffer) > > { > > - vunmap(segment_buffer); > > + if (is_vmalloc_addr(segment_buffer)) > > + vunmap(segment_buffer); > > } > > > > struct kexec_load_limit { > > -- > > 2.49.0 > > > From piliu at redhat.com Wed Nov 5 18:57:33 2025 From: piliu at redhat.com (Pingfan Liu) Date: Thu, 6 Nov 2025 10:57:33 +0800 Subject: [PATCH 2/2] kernel/kexec: Fix IMA when allocation happens in CMA area In-Reply-To: <20251105161432.98eb69f87f30627a9067e78e@linux-foundation.org> References: <20251105130922.13321-1-piliu@redhat.com> <20251105130922.13321-2-piliu@redhat.com> <20251105161432.98eb69f87f30627a9067e78e@linux-foundation.org> Message-ID: Hi Andrew, Thanks for your help, but on second thought, I think the Fixes commit is wrong. On Thu, Nov 6, 2025 at 8:14?AM Andrew Morton wrote: > > On Wed, 5 Nov 2025 21:09:22 +0800 Pingfan Liu wrote: > > > When I tested kexec with the latest kernel, I ran into the following warning: > > > > [ 40.712410] ------------[ cut here ]------------ > > [ 40.712576] WARNING: CPU: 2 PID: 1562 at kernel/kexec_core.c:1001 kimage_map_segment+0x144/0x198 > > [...] > > [ 40.816047] Call trace: > > [ 40.818498] kimage_map_segment+0x144/0x198 (P) > > [ 40.823221] ima_kexec_post_load+0x58/0xc0 > > [ 40.827246] __do_sys_kexec_file_load+0x29c/0x368 > > [...] > > [ 40.855423] ---[ end trace 0000000000000000 ]--- > > > > This is caused by the fact that kexec allocates the destination directly > > in the CMA area. In that case, the CMA kernel address should be exported > > directly to the IMA component, instead of using the vmalloc'd address. > > This is something we should backport into tearlier kernels. > > > Signed-off-by: Pingfan Liu > > Cc: Andrew Morton > > Cc: Baoquan He > > Cc: Alexander Graf > > Cc: Steven Chen > > Cc: linux-integrity at vger.kernel.org > > To: kexec at lists.infradead.org > > So I'm thinking we should add > > Fixes: 0091d9241ea2 ("kexec: define functions to map and unmap segments") Should be: Fixes: 07d24902977e ("kexec: enable CMA based contiguous allocation") Because 07d24902977e came after 0091d9241ea2 and introduced this issue. Thanks, Pingfan > Cc: > > yes? > From bhe at redhat.com Wed Nov 5 19:21:54 2025 From: bhe at redhat.com (Baoquan He) Date: Thu, 6 Nov 2025 11:21:54 +0800 Subject: [PATCH 2/2] kernel/kexec: Fix IMA when allocation happens in CMA area In-Reply-To: References: <20251105130922.13321-1-piliu@redhat.com> <20251105130922.13321-2-piliu@redhat.com> Message-ID: On 11/06/25 at 10:33am, Pingfan Liu wrote: > Hi Baoquan, > > Thanks for your review. Please see the comment below. > > On Thu, Nov 6, 2025 at 10:04?AM Baoquan He wrote: > > > > Hi Pingfan, > > > > On 11/05/25 at 09:09pm, Pingfan Liu wrote: > > > When I tested kexec with the latest kernel, I ran into the following warning: > > > > > > [ 40.712410] ------------[ cut here ]------------ > > > [ 40.712576] WARNING: CPU: 2 PID: 1562 at kernel/kexec_core.c:1001 kimage_map_segment+0x144/0x198 > > > [...] > > > [ 40.816047] Call trace: > > > [ 40.818498] kimage_map_segment+0x144/0x198 (P) > > > [ 40.823221] ima_kexec_post_load+0x58/0xc0 > > > [ 40.827246] __do_sys_kexec_file_load+0x29c/0x368 > > > [...] > > > [ 40.855423] ---[ end trace 0000000000000000 ]--- > > > > > > This is caused by the fact that kexec allocates the destination directly > > > in the CMA area. In that case, the CMA kernel address should be exported > > > directly to the IMA component, instead of using the vmalloc'd address. > > > > > > Signed-off-by: Pingfan Liu > > > Cc: Andrew Morton > > > Cc: Baoquan He > > > Cc: Alexander Graf > > > Cc: Steven Chen > > > Cc: linux-integrity at vger.kernel.org > > > To: kexec at lists.infradead.org > > > --- > > > kernel/kexec_core.c | 7 ++++++- > > > 1 file changed, 6 insertions(+), 1 deletion(-) > > > > > > diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c > > > index 9a1966207041..abe40286a02c 100644 > > > --- a/kernel/kexec_core.c > > > +++ b/kernel/kexec_core.c > > > @@ -967,6 +967,7 @@ void *kimage_map_segment(struct kimage *image, int idx) > > > kimage_entry_t *ptr, entry; > > > struct page **src_pages; > > > unsigned int npages; > > > + struct page *cma; > > > void *vaddr = NULL; > > > int i; > > > > > > @@ -974,6 +975,9 @@ void *kimage_map_segment(struct kimage *image, int idx) > > > size = image->segment[idx].memsz; > > > eaddr = addr + size; > > > > > > + cma = image->segment_cma[idx]; > > > > Thanks for your fix. But I totally can't get what you are doing. The idx > > passed into kimage_map_segment() could index image->segment[], and can > > index image->segment_cma[], could you reconsider and make the code more > > reasonable? > > > > Since idx can index both image->segment[] and segment_cma[], the > behavior differs based on whether segment_cma[idx] is NULL: > > - If segment_cma[idx] is not NULL, it points directly to the final > target location, eliminating the need for data copying that > traditional kexec relocation requires. > - If segment_cma[idx] is NULL, the segment relies on the traditional > kexec relocation code to copy its data. I see, thanks. While image->segment_cma[idx] records the struct page of the relevant cma area, but not virtual address. Is it OK for IMA later to update? ima_kexec_buffer is supposed to be a virtual address, wondering how IMA behaved in this case. From sourabhjain at linux.ibm.com Wed Nov 5 20:51:02 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Thu, 6 Nov 2025 10:21:02 +0530 Subject: [PATCH v2 0/5] kexec: reorganize sysfs interface and add new kexec sysfs Message-ID: <20251106045107.17813-1-sourabhjain@linux.ibm.com> All existing kexec and kdump sysfs entries are moved to a new location, /sys/kernel/kexec, to keep /sys/kernel/ clean and better organized. Symlinks are created at the old locations for backward compatibility and can be removed in the future [02/05]. While doing this cleanup, missing ABI documentation for the old sysfs interfaces is added, and those entries are marked as deprecated [01/05 and 03/05]. New ABI documentation is also added for the reorganized interfaces. [04/05] Along with this reorganization, a new sysfs file, /sys/kernel/kexec/crash_cma_ranges, is introduced to export crashkernel CMA reservation details to user space [05/05]. This helps tools determine the total crashkernel reserved memory and warn users that capturing user pages while CMA is reserved may cause incomplete or unreliable dumps. Cc: Aditya Gupta Cc: Andrew Morton Cc: Baoquan he Cc: Dave Young Cc: Hari Bathini Cc: Jiri Bohac Cc: Madhavan Srinivasan Cc: Mahesh J Salgaonkar Cc: Pingfan Liu Cc: Ritesh Harjani (IBM) Cc: Shivang Upadhyay Cc: Vivek Goyal Cc: linuxppc-dev at lists.ozlabs.org Cc: kexec at lists.infradead.org Sourabh Jain (5): Documentation/ABI: add kexec and kdump sysfs interface kexec: move sysfs entries to /sys/kernel/kexec Documentation/ABI: mark old kexec sysfs deprecated kexec: document new kexec and kdump sysfs ABIs crash: export crashkernel CMA reservation to userspace .../ABI/obsolete/sysfs-kernel-kexec-kdump | 59 +++++++++ .../ABI/testing/sysfs-kernel-kexec-kdump | 61 +++++++++ kernel/kexec_core.c | 118 ++++++++++++++++++ kernel/ksysfs.c | 68 +--------- 4 files changed, 239 insertions(+), 67 deletions(-) create mode 100644 Documentation/ABI/obsolete/sysfs-kernel-kexec-kdump create mode 100644 Documentation/ABI/testing/sysfs-kernel-kexec-kdump -- 2.51.0 From sourabhjain at linux.ibm.com Wed Nov 5 20:51:03 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Thu, 6 Nov 2025 10:21:03 +0530 Subject: [PATCH v2 1/5] Documentation/ABI: add kexec and kdump sysfs interface In-Reply-To: <20251106045107.17813-1-sourabhjain@linux.ibm.com> References: <20251106045107.17813-1-sourabhjain@linux.ibm.com> Message-ID: <20251106045107.17813-2-sourabhjain@linux.ibm.com> Add an ABI document for following kexec and kdump sysfs interface: - /sys/kernel/kexec_loaded - /sys/kernel/kexec_crash_loaded - /sys/kernel/kexec_crash_size - /sys/kernel/crash_elfcorehdr_size Cc: Aditya Gupta Cc: Andrew Morton Cc: Baoquan he Cc: Dave Young Cc: Hari Bathini Cc: Jiri Bohac Cc: Madhavan Srinivasan Cc: Mahesh J Salgaonkar Cc: Pingfan Liu Cc: Ritesh Harjani (IBM) Cc: Shivang Upadhyay Cc: Vivek Goyal Cc: linuxppc-dev at lists.ozlabs.org Cc: kexec at lists.infradead.org Signed-off-by: Sourabh Jain --- .../ABI/testing/sysfs-kernel-kexec-kdump | 43 +++++++++++++++++++ 1 file changed, 43 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-kernel-kexec-kdump diff --git a/Documentation/ABI/testing/sysfs-kernel-kexec-kdump b/Documentation/ABI/testing/sysfs-kernel-kexec-kdump new file mode 100644 index 000000000000..96b24565b68e --- /dev/null +++ b/Documentation/ABI/testing/sysfs-kernel-kexec-kdump @@ -0,0 +1,43 @@ +What: /sys/kernel/kexec_loaded +Date: Jun 2006 +Contact: kexec at lists.infradead.org +Description: read only + Indicates whether a new kernel image has been loaded + into memory using the kexec system call. It shows 1 if + a kexec image is present and ready to boot, or 0 if none + is loaded. +User: kexec tools, kdump service + +What: /sys/kernel/kexec_crash_loaded +Date: Jun 2006 +Contact: kexec at lists.infradead.org +Description: read only + Indicates whether a crash (kdump) kernel is currently + loaded into memory. It shows 1 if a crash kernel has been + successfully loaded for panic handling, or 0 if no crash + kernel is present. +User: Kexec tools, Kdump service + +What: /sys/kernel/kexec_crash_size +Date: Dec 2009 +Contact: kexec at lists.infradead.org +Description: read/write + Shows the amount of memory reserved for loading the crash + (kdump) kernel. It reports the size, in bytes, of the + crash kernel area defined by the crashkernel= parameter. + This interface also allows reducing the crashkernel + reservation by writing a smaller value, and the reclaimed + space is added back to the system RAM. +User: Kdump service + +What: /sys/kernel/crash_elfcorehdr_size +Date: Aug 2023 +Contact: kexec at lists.infradead.org +Description: read only + Indicates the preferred size of the memory buffer for the + ELF core header used by the crash (kdump) kernel. It defines + how much space is needed to hold metadata about the crashed + system, including CPU and memory information. This information + is used by the user space utility kexec to support updating the + in-kernel kdump image during hotplug operations. +User: Kexec tools -- 2.51.0 From sourabhjain at linux.ibm.com Wed Nov 5 20:51:04 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Thu, 6 Nov 2025 10:21:04 +0530 Subject: [PATCH v2 2/5] kexec: move sysfs entries to /sys/kernel/kexec In-Reply-To: <20251106045107.17813-1-sourabhjain@linux.ibm.com> References: <20251106045107.17813-1-sourabhjain@linux.ibm.com> Message-ID: <20251106045107.17813-3-sourabhjain@linux.ibm.com> Several kexec and kdump sysfs entries are currently placed directly under /sys/kernel/, which clutters the directory and makes it harder to identify unrelated entries. To improve organization and readability, these entries are now moved under a dedicated directory, /sys/kernel/kexec. For backward compatibility, symlinks are created at the old locations so that existing tools and scripts continue to work. These symlinks can be removed in the future once users have switched to the new path. While creating symlinks, entries are added in /sys/kernel/ that point to their new locations under /sys/kernel/kexec/. If an error occurs while adding a symlink, it is logged but does not stop initialization of the remaining kexec sysfs symlinks. The /sys/kernel/ entry is now controlled by CONFIG_CRASH_DUMP instead of CONFIG_VMCORE_INFO, as CONFIG_CRASH_DUMP also enables CONFIG_VMCORE_INFO. Cc: Aditya Gupta Cc: Andrew Morton Cc: Baoquan he Cc: Dave Young Cc: Hari Bathini Cc: Jiri Bohac Cc: Madhavan Srinivasan Cc: Mahesh J Salgaonkar Cc: Pingfan Liu Cc: Ritesh Harjani (IBM) Cc: Shivang Upadhyay Cc: Vivek Goyal Cc: linuxppc-dev at lists.ozlabs.org Cc: kexec at lists.infradead.org Signed-off-by: Sourabh Jain --- kernel/kexec_core.c | 118 ++++++++++++++++++++++++++++++++++++++++++++ kernel/ksysfs.c | 68 +------------------------ 2 files changed, 119 insertions(+), 67 deletions(-) diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c index fa00b239c5d9..2e12a164e870 100644 --- a/kernel/kexec_core.c +++ b/kernel/kexec_core.c @@ -41,6 +41,7 @@ #include #include #include +#include #include #include @@ -1229,3 +1230,120 @@ int kernel_kexec(void) kexec_unlock(); return error; } + +static ssize_t loaded_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return sysfs_emit(buf, "%d\n", !!kexec_image); +} +static struct kobj_attribute loaded_attr = __ATTR_RO(loaded); + +#ifdef CONFIG_CRASH_DUMP +static ssize_t crash_loaded_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return sysfs_emit(buf, "%d\n", kexec_crash_loaded()); +} +static struct kobj_attribute crash_loaded_attr = __ATTR_RO(crash_loaded); + +static ssize_t crash_size_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + ssize_t size = crash_get_memory_size(); + + if (size < 0) + return size; + + return sysfs_emit(buf, "%zd\n", size); +} +static ssize_t crash_size_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + unsigned long cnt; + int ret; + + if (kstrtoul(buf, 0, &cnt)) + return -EINVAL; + + ret = crash_shrink_memory(cnt); + return ret < 0 ? ret : count; +} +static struct kobj_attribute crash_size_attr = __ATTR_RW(crash_size); + +#ifdef CONFIG_CRASH_HOTPLUG +static ssize_t crash_elfcorehdr_size_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + unsigned int sz = crash_get_elfcorehdr_size(); + + return sysfs_emit(buf, "%u\n", sz); +} +static struct kobj_attribute crash_elfcorehdr_size_attr = __ATTR_RO(crash_elfcorehdr_size); + +#endif /* CONFIG_CRASH_HOTPLUG */ +#endif /* CONFIG_CRASH_DUMP */ + +static struct attribute *kexec_attrs[] = { + &loaded_attr.attr, +#ifdef CONFIG_CRASH_DUMP + &crash_loaded_attr.attr, + &crash_size_attr.attr, +#ifdef CONFIG_CRASH_HOTPLUG + &crash_elfcorehdr_size_attr.attr, +#endif +#endif + NULL +}; + +struct kexec_link_entry { + const char *target; + const char *name; +}; + +static struct kexec_link_entry kexec_links[] = { + { "loaded", "kexec_loaded" }, +#ifdef CONFIG_CRASH_DUMP + { "crash_loaded", "kexec_crash_loaded" }, + { "crash_size", "kexec_crash_size" }, +#ifdef CONFIG_CRASH_HOTPLUG + { "crash_elfcorehdr_size", "crash_elfcorehdr_size" }, +#endif +#endif + +}; + +struct kobject *kexec_kobj; +ATTRIBUTE_GROUPS(kexec); + +static int __init init_kexec_sysctl(void) +{ + int error; + int i; + + kexec_kobj = kobject_create_and_add("kexec", kernel_kobj); + if (!kexec_kobj) { + pr_err("failed to create kexec kobject\n"); + return -ENOMEM; + } + + error = sysfs_create_groups(kexec_kobj, kexec_groups); + if (error) + goto kset_exit; + + for (i = 0; i < ARRAY_SIZE(kexec_links); i++) { + error = compat_only_sysfs_link_entry_to_kobj(kernel_kobj, kexec_kobj, + kexec_links[i].target, + kexec_links[i].name); + if (error) + pr_err("Unable to create %s symlink (%d)", kexec_links[i].name, error); + } + + return 0; + +kset_exit: + kobject_put(kexec_kobj); + return error; +} + +subsys_initcall(init_kexec_sysctl); diff --git a/kernel/ksysfs.c b/kernel/ksysfs.c index eefb67d9883c..a9e6354d9e25 100644 --- a/kernel/ksysfs.c +++ b/kernel/ksysfs.c @@ -12,7 +12,7 @@ #include #include #include -#include +#include #include #include #include @@ -119,50 +119,6 @@ static ssize_t profiling_store(struct kobject *kobj, KERNEL_ATTR_RW(profiling); #endif -#ifdef CONFIG_KEXEC_CORE -static ssize_t kexec_loaded_show(struct kobject *kobj, - struct kobj_attribute *attr, char *buf) -{ - return sysfs_emit(buf, "%d\n", !!kexec_image); -} -KERNEL_ATTR_RO(kexec_loaded); - -#ifdef CONFIG_CRASH_DUMP -static ssize_t kexec_crash_loaded_show(struct kobject *kobj, - struct kobj_attribute *attr, char *buf) -{ - return sysfs_emit(buf, "%d\n", kexec_crash_loaded()); -} -KERNEL_ATTR_RO(kexec_crash_loaded); - -static ssize_t kexec_crash_size_show(struct kobject *kobj, - struct kobj_attribute *attr, char *buf) -{ - ssize_t size = crash_get_memory_size(); - - if (size < 0) - return size; - - return sysfs_emit(buf, "%zd\n", size); -} -static ssize_t kexec_crash_size_store(struct kobject *kobj, - struct kobj_attribute *attr, - const char *buf, size_t count) -{ - unsigned long cnt; - int ret; - - if (kstrtoul(buf, 0, &cnt)) - return -EINVAL; - - ret = crash_shrink_memory(cnt); - return ret < 0 ? ret : count; -} -KERNEL_ATTR_RW(kexec_crash_size); - -#endif /* CONFIG_CRASH_DUMP*/ -#endif /* CONFIG_KEXEC_CORE */ - #ifdef CONFIG_VMCORE_INFO static ssize_t vmcoreinfo_show(struct kobject *kobj, @@ -174,18 +130,6 @@ static ssize_t vmcoreinfo_show(struct kobject *kobj, } KERNEL_ATTR_RO(vmcoreinfo); -#ifdef CONFIG_CRASH_HOTPLUG -static ssize_t crash_elfcorehdr_size_show(struct kobject *kobj, - struct kobj_attribute *attr, char *buf) -{ - unsigned int sz = crash_get_elfcorehdr_size(); - - return sysfs_emit(buf, "%u\n", sz); -} -KERNEL_ATTR_RO(crash_elfcorehdr_size); - -#endif - #endif /* CONFIG_VMCORE_INFO */ /* whether file capabilities are enabled */ @@ -255,18 +199,8 @@ static struct attribute * kernel_attrs[] = { #ifdef CONFIG_PROFILING &profiling_attr.attr, #endif -#ifdef CONFIG_KEXEC_CORE - &kexec_loaded_attr.attr, -#ifdef CONFIG_CRASH_DUMP - &kexec_crash_loaded_attr.attr, - &kexec_crash_size_attr.attr, -#endif -#endif #ifdef CONFIG_VMCORE_INFO &vmcoreinfo_attr.attr, -#ifdef CONFIG_CRASH_HOTPLUG - &crash_elfcorehdr_size_attr.attr, -#endif #endif #ifndef CONFIG_TINY_RCU &rcu_expedited_attr.attr, -- 2.51.0 From sourabhjain at linux.ibm.com Wed Nov 5 20:51:05 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Thu, 6 Nov 2025 10:21:05 +0530 Subject: [PATCH v2 3/5] Documentation/ABI: mark old kexec sysfs deprecated In-Reply-To: <20251106045107.17813-1-sourabhjain@linux.ibm.com> References: <20251106045107.17813-1-sourabhjain@linux.ibm.com> Message-ID: <20251106045107.17813-4-sourabhjain@linux.ibm.com> The previous commit ("kexec: move sysfs entries to /sys/kernel/kexec") moved all existing kexec sysfs entries to a new location. The ABI document is updated to include a note about the deprecation of the old kexec sysfs entries. The following kexec sysfs entries are deprecated: - /sys/kernel/kexec_loaded - /sys/kernel/kexec_crash_loaded - /sys/kernel/kexec_crash_size - /sys/kernel/crash_elfcorehdr_size Cc: Aditya Gupta Cc: Andrew Morton Cc: Baoquan he Cc: Dave Young Cc: Hari Bathini Cc: Jiri Bohac Cc: Madhavan Srinivasan Cc: Mahesh J Salgaonkar Cc: Pingfan Liu Cc: Ritesh Harjani (IBM) Cc: Shivang Upadhyay Cc: Vivek Goyal Cc: linuxppc-dev at lists.ozlabs.org Cc: kexec at lists.infradead.org Signed-off-by: Sourabh Jain --- .../sysfs-kernel-kexec-kdump | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) rename Documentation/ABI/{testing => obsolete}/sysfs-kernel-kexec-kdump (61%) diff --git a/Documentation/ABI/testing/sysfs-kernel-kexec-kdump b/Documentation/ABI/obsolete/sysfs-kernel-kexec-kdump similarity index 61% rename from Documentation/ABI/testing/sysfs-kernel-kexec-kdump rename to Documentation/ABI/obsolete/sysfs-kernel-kexec-kdump index 96b24565b68e..96b4d41721cc 100644 --- a/Documentation/ABI/testing/sysfs-kernel-kexec-kdump +++ b/Documentation/ABI/obsolete/sysfs-kernel-kexec-kdump @@ -1,3 +1,19 @@ +NOTE: all the ABIs listed in this file are deprecated and will be removed after 2028. + +Here are the alternative ABIs: ++------------------------------------+-----------------------------------------+ +| Deprecated | Alternative | ++------------------------------------+-----------------------------------------+ +| /sys/kernel/kexec_loaded | /sys/kernel/kexec/loaded | ++------------------------------------+-----------------------------------------+ +| /sys/kernel/kexec_crash_loaded | /sys/kernel/kexec/crash_loaded | ++------------------------------------+-----------------------------------------+ +| /sys/kernel/kexec_crash_size | /sys/kernel/kexec/crash_size | ++------------------------------------+-----------------------------------------+ +| /sys/kernel/crash_elfcorehdr_size | /sys/kernel/kexec/crash_elfcorehdr_size | ++------------------------------------+-----------------------------------------+ + + What: /sys/kernel/kexec_loaded Date: Jun 2006 Contact: kexec at lists.infradead.org -- 2.51.0 From sourabhjain at linux.ibm.com Wed Nov 5 20:51:06 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Thu, 6 Nov 2025 10:21:06 +0530 Subject: [PATCH v2 4/5] kexec: document new kexec and kdump sysfs ABIs In-Reply-To: <20251106045107.17813-1-sourabhjain@linux.ibm.com> References: <20251106045107.17813-1-sourabhjain@linux.ibm.com> Message-ID: <20251106045107.17813-5-sourabhjain@linux.ibm.com> Add an ABI document for following kexec and kdump sysfs interface: - /sys/kernel/kexec/loaded - /sys/kernel/kexec/crash_loaded - /sys/kernel/kexec/crash_size - /sys/kernel/kexec/crash_elfcorehdr_size Cc: Aditya Gupta Cc: Andrew Morton Cc: Baoquan he Cc: Dave Young Cc: Hari Bathini Cc: Jiri Bohac Cc: Madhavan Srinivasan Cc: Mahesh J Salgaonkar Cc: Pingfan Liu Cc: Ritesh Harjani (IBM) Cc: Shivang Upadhyay Cc: Vivek Goyal Cc: linuxppc-dev at lists.ozlabs.org Cc: kexec at lists.infradead.org Signed-off-by: Sourabh Jain --- .../ABI/testing/sysfs-kernel-kexec-kdump | 51 +++++++++++++++++++ 1 file changed, 51 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-kernel-kexec-kdump diff --git a/Documentation/ABI/testing/sysfs-kernel-kexec-kdump b/Documentation/ABI/testing/sysfs-kernel-kexec-kdump new file mode 100644 index 000000000000..00c00f380fea --- /dev/null +++ b/Documentation/ABI/testing/sysfs-kernel-kexec-kdump @@ -0,0 +1,51 @@ +What: /sys/kernel/kexec/* +Date: Nov 2025 +Contact: kexec at lists.infradead.org +Description: + The /sys/kernel/kexec/* directory contains sysfs files + that provide information about the configuration status + of kexec and kdump. + +What: /sys/kernel/kexec/loaded +Date: Nov 2025 +Contact: kexec at lists.infradead.org +Description: read only + Indicates whether a new kernel image has been loaded + into memory using the kexec system call. It shows 1 if + a kexec image is present and ready to boot, or 0 if none + is loaded. +User: kexec tools, kdump service + +What: /sys/kernel/kexec/crash_loaded +Date: Nov 2025 +Contact: kexec at lists.infradead.org +Description: read only + Indicates whether a crash (kdump) kernel is currently + loaded into memory. It shows 1 if a crash kernel has been + successfully loaded for panic handling, or 0 if no crash + kernel is present. +User: Kexec tools, Kdump service + +What: /sys/kernel/kexec/crash_size +Date: Nov 2025 +Contact: kexec at lists.infradead.org +Description: read/write + Shows the amount of memory reserved for loading the crash + (kdump) kernel. It reports the size, in bytes, of the + crash kernel area defined by the crashkernel= parameter. + This interface also allows reducing the crashkernel + reservation by writing a smaller value, and the reclaimed + space is added back to the system RAM. +User: Kdump service + +What: /sys/kernel/kexec/crash_elfcorehdr_size +Date: Nov 2025 +Contact: kexec at lists.infradead.org +Description: read only + Indicates the preferred size of the memory buffer for the + ELF core header used by the crash (kdump) kernel. It defines + how much space is needed to hold metadata about the crashed + system, including CPU and memory information. This information + is used by the user space utility kexec to support updating the + in-kernel kdump image during hotplug operations. +User: Kexec tools -- 2.51.0 From sourabhjain at linux.ibm.com Wed Nov 5 20:51:07 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Thu, 6 Nov 2025 10:21:07 +0530 Subject: [PATCH v2 5/5] crash: export crashkernel CMA reservation to userspace In-Reply-To: <20251106045107.17813-1-sourabhjain@linux.ibm.com> References: <20251106045107.17813-1-sourabhjain@linux.ibm.com> Message-ID: <20251106045107.17813-6-sourabhjain@linux.ibm.com> Add a sysfs entry /sys/kernel/kexec/crash_cma_ranges to expose all CMA crashkernel ranges. This allows userspace tools configuring kdump to determine how much memory is reserved for crashkernel. If CMA is used, tools can warn users when attempting to capture user pages with CMA reservation. The new sysfs hold the CMA ranges in below format: cat /sys/kernel/kexec/crash_cma_ranges 100000000-10c7fffff Cc: Aditya Gupta Cc: Andrew Morton Cc: Baoquan he Cc: Dave Young Cc: Hari Bathini Cc: Jiri Bohac Cc: Madhavan Srinivasan Cc: Mahesh J Salgaonkar Cc: Pingfan Liu Cc: Ritesh Harjani (IBM) Cc: Shivang Upadhyay Cc: Vivek Goyal Cc: linuxppc-dev at lists.ozlabs.org Cc: kexec at lists.infradead.org Signed-off-by: Sourabh Jain --- Documentation/ABI/testing/sysfs-kernel-kexec-kdump | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-kernel-kexec-kdump b/Documentation/ABI/testing/sysfs-kernel-kexec-kdump index 00c00f380fea..f59051b5d96d 100644 --- a/Documentation/ABI/testing/sysfs-kernel-kexec-kdump +++ b/Documentation/ABI/testing/sysfs-kernel-kexec-kdump @@ -49,3 +49,13 @@ Description: read only is used by the user space utility kexec to support updating the in-kernel kdump image during hotplug operations. User: Kexec tools + +What: /sys/kernel/kexec/crash_cma_ranges +Date: Nov 2025 +Contact: kexec at lists.infradead.org +Description: read only + Provides information about the memory ranges reserved from + the Contiguous Memory Allocator (CMA) area that are allocated + to the crash (kdump) kernel. It lists the start and end physical + addresses of CMA regions assigned for crashkernel use. +User: kdump service -- 2.51.0 From piliu at redhat.com Wed Nov 5 22:56:27 2025 From: piliu at redhat.com (Pingfan Liu) Date: Thu, 6 Nov 2025 14:56:27 +0800 Subject: [PATCH 2/2] kernel/kexec: Fix IMA when allocation happens in CMA area In-Reply-To: References: <20251105130922.13321-1-piliu@redhat.com> <20251105130922.13321-2-piliu@redhat.com> Message-ID: On Thu, Nov 6, 2025 at 11:22?AM Baoquan He wrote: > > On 11/06/25 at 10:33am, Pingfan Liu wrote: > > Hi Baoquan, > > > > Thanks for your review. Please see the comment below. > > > > On Thu, Nov 6, 2025 at 10:04?AM Baoquan He wrote: > > > > > > Hi Pingfan, > > > > > > On 11/05/25 at 09:09pm, Pingfan Liu wrote: > > > > When I tested kexec with the latest kernel, I ran into the following warning: > > > > > > > > [ 40.712410] ------------[ cut here ]------------ > > > > [ 40.712576] WARNING: CPU: 2 PID: 1562 at kernel/kexec_core.c:1001 kimage_map_segment+0x144/0x198 > > > > [...] > > > > [ 40.816047] Call trace: > > > > [ 40.818498] kimage_map_segment+0x144/0x198 (P) > > > > [ 40.823221] ima_kexec_post_load+0x58/0xc0 > > > > [ 40.827246] __do_sys_kexec_file_load+0x29c/0x368 > > > > [...] > > > > [ 40.855423] ---[ end trace 0000000000000000 ]--- > > > > > > > > This is caused by the fact that kexec allocates the destination directly > > > > in the CMA area. In that case, the CMA kernel address should be exported > > > > directly to the IMA component, instead of using the vmalloc'd address. > > > > > > > > Signed-off-by: Pingfan Liu > > > > Cc: Andrew Morton > > > > Cc: Baoquan He > > > > Cc: Alexander Graf > > > > Cc: Steven Chen > > > > Cc: linux-integrity at vger.kernel.org > > > > To: kexec at lists.infradead.org > > > > --- > > > > kernel/kexec_core.c | 7 ++++++- > > > > 1 file changed, 6 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c > > > > index 9a1966207041..abe40286a02c 100644 > > > > --- a/kernel/kexec_core.c > > > > +++ b/kernel/kexec_core.c > > > > @@ -967,6 +967,7 @@ void *kimage_map_segment(struct kimage *image, int idx) > > > > kimage_entry_t *ptr, entry; > > > > struct page **src_pages; > > > > unsigned int npages; > > > > + struct page *cma; > > > > void *vaddr = NULL; > > > > int i; > > > > > > > > @@ -974,6 +975,9 @@ void *kimage_map_segment(struct kimage *image, int idx) > > > > size = image->segment[idx].memsz; > > > > eaddr = addr + size; > > > > > > > > + cma = image->segment_cma[idx]; > > > > > > Thanks for your fix. But I totally can't get what you are doing. The idx > > > passed into kimage_map_segment() could index image->segment[], and can > > > index image->segment_cma[], could you reconsider and make the code more > > > reasonable? > > > > > > > Since idx can index both image->segment[] and segment_cma[], the > > behavior differs based on whether segment_cma[idx] is NULL: > > > > - If segment_cma[idx] is not NULL, it points directly to the final > > target location, eliminating the need for data copying that > > traditional kexec relocation requires. > > - If segment_cma[idx] is NULL, the segment relies on the traditional > > kexec relocation code to copy its data. > > I see, thanks. While image->segment_cma[idx] records the struct page of > the relevant cma area, but not virtual address. Is it OK for IMA later Oops. It requires page_address(page) to convert the address. I will send out V2 to fix it. Thanks, Pingfan From piliu at redhat.com Wed Nov 5 22:59:03 2025 From: piliu at redhat.com (Pingfan Liu) Date: Thu, 6 Nov 2025 14:59:03 +0800 Subject: [PATCHv2 1/2] kernel/kexec: Change the prototype of kimage_map_segment() Message-ID: <20251106065904.10772-1-piliu@redhat.com> The kexec segment index will be required to extract the corresponding information for that segment in kimage_map_segment(). Additionally, kexec_segment already holds the kexec relocation destination address and size. Therefore, the prototype of kimage_map_segment() can be changed. Fixes: 07d24902977e ("kexec: enable CMA based contiguous allocation") Signed-off-by: Pingfan Liu Cc: Andrew Morton Cc: Baoquan He Cc: Mimi Zohar Cc: Roberto Sassu Cc: Alexander Graf Cc: Steven Chen Cc: To: kexec at lists.infradead.org To: linux-integrity at vger.kernel.org --- include/linux/kexec.h | 4 ++-- kernel/kexec_core.c | 9 ++++++--- security/integrity/ima/ima_kexec.c | 4 +--- 3 files changed, 9 insertions(+), 8 deletions(-) diff --git a/include/linux/kexec.h b/include/linux/kexec.h index ff7e231b0485..8a22bc9b8c6c 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -530,7 +530,7 @@ extern bool kexec_file_dbg_print; #define kexec_dprintk(fmt, arg...) \ do { if (kexec_file_dbg_print) pr_info(fmt, ##arg); } while (0) -extern void *kimage_map_segment(struct kimage *image, unsigned long addr, unsigned long size); +extern void *kimage_map_segment(struct kimage *image, int idx); extern void kimage_unmap_segment(void *buffer); #else /* !CONFIG_KEXEC_CORE */ struct pt_regs; @@ -540,7 +540,7 @@ static inline void __crash_kexec(struct pt_regs *regs) { } static inline void crash_kexec(struct pt_regs *regs) { } static inline int kexec_should_crash(struct task_struct *p) { return 0; } static inline int kexec_crash_loaded(void) { return 0; } -static inline void *kimage_map_segment(struct kimage *image, unsigned long addr, unsigned long size) +static inline void *kimage_map_segment(struct kimage *image, int idx) { return NULL; } static inline void kimage_unmap_segment(void *buffer) { } #define kexec_in_progress false diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c index fa00b239c5d9..9a1966207041 100644 --- a/kernel/kexec_core.c +++ b/kernel/kexec_core.c @@ -960,17 +960,20 @@ int kimage_load_segment(struct kimage *image, int idx) return result; } -void *kimage_map_segment(struct kimage *image, - unsigned long addr, unsigned long size) +void *kimage_map_segment(struct kimage *image, int idx) { + unsigned long addr, size, eaddr; unsigned long src_page_addr, dest_page_addr = 0; - unsigned long eaddr = addr + size; kimage_entry_t *ptr, entry; struct page **src_pages; unsigned int npages; void *vaddr = NULL; int i; + addr = image->segment[idx].mem; + size = image->segment[idx].memsz; + eaddr = addr + size; + /* * Collect the source pages and map them in a contiguous VA range. */ diff --git a/security/integrity/ima/ima_kexec.c b/security/integrity/ima/ima_kexec.c index 7362f68f2d8b..5beb69edd12f 100644 --- a/security/integrity/ima/ima_kexec.c +++ b/security/integrity/ima/ima_kexec.c @@ -250,9 +250,7 @@ void ima_kexec_post_load(struct kimage *image) if (!image->ima_buffer_addr) return; - ima_kexec_buffer = kimage_map_segment(image, - image->ima_buffer_addr, - image->ima_buffer_size); + ima_kexec_buffer = kimage_map_segment(image, image->ima_segment_index); if (!ima_kexec_buffer) { pr_err("Could not map measurements buffer.\n"); return; -- 2.49.0 From piliu at redhat.com Wed Nov 5 22:59:04 2025 From: piliu at redhat.com (Pingfan Liu) Date: Thu, 6 Nov 2025 14:59:04 +0800 Subject: [PATCHv2 2/2] kernel/kexec: Fix IMA when allocation happens in CMA area In-Reply-To: <20251106065904.10772-1-piliu@redhat.com> References: <20251106065904.10772-1-piliu@redhat.com> Message-ID: <20251106065904.10772-2-piliu@redhat.com> When I tested kexec with the latest kernel, I ran into the following warning: [ 40.712410] ------------[ cut here ]------------ [ 40.712576] WARNING: CPU: 2 PID: 1562 at kernel/kexec_core.c:1001 kimage_map_segment+0x144/0x198 [...] [ 40.816047] Call trace: [ 40.818498] kimage_map_segment+0x144/0x198 (P) [ 40.823221] ima_kexec_post_load+0x58/0xc0 [ 40.827246] __do_sys_kexec_file_load+0x29c/0x368 [...] [ 40.855423] ---[ end trace 0000000000000000 ]--- This is caused by the fact that kexec allocates the destination directly in the CMA area. In that case, the CMA kernel address should be exported directly to the IMA component, instead of using the vmalloc'd address. Fixes: 07d24902977e ("kexec: enable CMA based contiguous allocation") Signed-off-by: Pingfan Liu Cc: Andrew Morton Cc: Baoquan He Cc: Alexander Graf Cc: Steven Chen Cc: linux-integrity at vger.kernel.org Cc: To: kexec at lists.infradead.org --- v1 -> v2: return page_address(page) instead of *page kernel/kexec_core.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c index 9a1966207041..332204204e53 100644 --- a/kernel/kexec_core.c +++ b/kernel/kexec_core.c @@ -967,6 +967,7 @@ void *kimage_map_segment(struct kimage *image, int idx) kimage_entry_t *ptr, entry; struct page **src_pages; unsigned int npages; + struct page *cma; void *vaddr = NULL; int i; @@ -974,6 +975,9 @@ void *kimage_map_segment(struct kimage *image, int idx) size = image->segment[idx].memsz; eaddr = addr + size; + cma = image->segment_cma[idx]; + if (cma) + return page_address(cma); /* * Collect the source pages and map them in a contiguous VA range. */ @@ -1014,7 +1018,8 @@ void *kimage_map_segment(struct kimage *image, int idx) void kimage_unmap_segment(void *segment_buffer) { - vunmap(segment_buffer); + if (is_vmalloc_addr(segment_buffer)) + vunmap(segment_buffer); } struct kexec_load_limit { -- 2.49.0 From oliver.sang at intel.com Wed Nov 5 23:21:35 2025 From: oliver.sang at intel.com (kernel test robot) Date: Thu, 6 Nov 2025 15:21:35 +0800 Subject: [PATCH v9 7/9] liveupdate: kho: move to kernel/liveupdate In-Reply-To: <20251101142325.1326536-8-pasha.tatashin@soleen.com> Message-ID: <202511061443.64dd159-lkp@intel.com> Hello, as we understand, this commit is not the root cause of the WARNING. but just changes the stats as below table [1] we just report FYI there is a WARNING caused by related code in our tests, in case anybody think it's worth to look further. thanks kernel test robot noticed "WARNING:at_kernel/liveupdate/kexec_handover.c:#kho_add_subtree" on: commit: 91cb1aaea4b8276323b3814d35f6e62133f64c1b ("[PATCH v9 7/9] liveupdate: kho: move to kernel/liveupdate") url: https://github.com/intel-lab-lkp/linux/commits/Pasha-Tatashin/kho-make-debugfs-interface-optional/20251101-222610 patch link: https://lore.kernel.org/all/20251101142325.1326536-8-pasha.tatashin at soleen.com/ patch subject: [PATCH v9 7/9] liveupdate: kho: move to kernel/liveupdate in testcase: boot config: x86_64-randconfig-001-20251015 compiler: gcc-14 test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G (please refer to attached dmesg/kmsg for entire log/backtrace) [1] +--------------------------------------------------------------------------------------+------------+------------+ | | dc74e80622 | 91cb1aaea4 | +--------------------------------------------------------------------------------------+------------+------------+ | WARNING:at_kernel/kexec_handover.c:#kho_add_subtree | 8 | | | WARNING:at_kernel/liveupdate/kexec_handover.c:#kho_add_subtree | 0 | 11 | +--------------------------------------------------------------------------------------+------------+------------+ If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot | Closes: https://lore.kernel.org/oe-lkp/202511061443.64dd159-lkp at intel.com [ 12.679864][ T1] ------------[ cut here ]------------ [ 12.680514][ T1] WARNING: CPU: 0 PID: 1 at kernel/liveupdate/kexec_handover.c:711 kho_add_subtree (kernel/liveupdate/kexec_handover.c:711) [ 12.681526][ T1] Modules linked in: [ 12.681957][ T1] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.18.0-rc3-00216-g91cb1aaea4b8 #1 VOLUNTARY [ 12.682956][ T1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 [ 12.683951][ T1] RIP: 0010:kho_add_subtree (kernel/liveupdate/kexec_handover.c:711) [ 12.684514][ T1] Code: c7 58 2e a7 85 31 ed e8 31 1a 00 00 48 c7 c7 c0 12 c9 86 85 c0 89 c3 40 0f 95 c5 31 c9 31 d2 89 ee e8 57 a0 13 00 85 db 74 02 <0f> 0b b9 01 00 00 00 31 d2 89 ee 48 c7 c7 90 12 c9 86 e8 3c a0 13 All code ======== 0: c7 (bad) 1: 58 pop %rax 2: 2e a7 cmpsl %es:(%rdi),%ds:(%rsi) 4: 85 31 test %esi,(%rcx) 6: ed in (%dx),%eax 7: e8 31 1a 00 00 call 0x1a3d c: 48 c7 c7 c0 12 c9 86 mov $0xffffffff86c912c0,%rdi 13: 85 c0 test %eax,%eax 15: 89 c3 mov %eax,%ebx 17: 40 0f 95 c5 setne %bpl 1b: 31 c9 xor %ecx,%ecx 1d: 31 d2 xor %edx,%edx 1f: 89 ee mov %ebp,%esi 21: e8 57 a0 13 00 call 0x13a07d 26: 85 db test %ebx,%ebx 28: 74 02 je 0x2c 2a:* 0f 0b ud2 <-- trapping instruction 2c: b9 01 00 00 00 mov $0x1,%ecx 31: 31 d2 xor %edx,%edx 33: 89 ee mov %ebp,%esi 35: 48 c7 c7 90 12 c9 86 mov $0xffffffff86c91290,%rdi 3c: e8 .byte 0xe8 3d: 3c a0 cmp $0xa0,%al 3f: 13 .byte 0x13 Code starting with the faulting instruction =========================================== 0: 0f 0b ud2 2: b9 01 00 00 00 mov $0x1,%ecx 7: 31 d2 xor %edx,%edx 9: 89 ee mov %ebp,%esi b: 48 c7 c7 90 12 c9 86 mov $0xffffffff86c91290,%rdi 12: e8 .byte 0xe8 13: 3c a0 cmp $0xa0,%al 15: 13 .byte 0x13 [ 12.686315][ T1] RSP: 0018:ffffc9000001fc58 EFLAGS: 00010286 [ 12.687184][ T1] RAX: dffffc0000000000 RBX: 00000000ffffffff RCX: 0000000000000000 [ 12.688370][ T1] RDX: 1ffffffff0d9225c RSI: 0000000000000001 RDI: ffffffff86c912e0 [ 12.689572][ T1] RBP: 0000000000000001 R08: 0000000000000008 R09: fffffbfff0dfac6c [ 12.690762][ T1] R10: 0000000000000000 R11: ffffffff86fd6367 R12: ffff888133ce6000 [ 12.691996][ T1] R13: ffffffff85a72d60 R14: ffff88810ce59888 R15: dffffc0000000000 [ 12.693231][ T1] FS: 0000000000000000(0000) GS:ffff888426da0000(0000) knlGS:0000000000000000 [ 12.694585][ T1] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 12.695569][ T1] CR2: 00007f6d5dc8f0ac CR3: 00000000054ea000 CR4: 00000000000406f0 [ 12.696832][ T1] Call Trace: [ 12.697400][ T1] [ 12.697922][ T1] kho_test_preserve+0x2fa/0x360 [ 12.698835][ T1] ? folio_order (arch/x86/kvm/../../../virt/kvm/guest_memfd.c:181 (discriminator 3)) [ 12.699556][ T1] ? kho_test_generate_data+0x107/0x180 [ 12.700561][ T1] kho_test_init (lib/test_kho.c:222 lib/test_kho.c:327) [ 12.701312][ T1] ? vmalloc_test_init (lib/test_kho.c:314) [ 12.702100][ T1] ? add_device_randomness (drivers/char/random.c:944) [ 12.702924][ T1] ? mix_pool_bytes (drivers/char/random.c:944) [ 12.703646][ T1] ? trace_initcall_start (include/trace/events/initcall.h:27 (discriminator 3)) [ 12.704499][ T1] ? vmalloc_test_init (lib/test_kho.c:314) [ 12.705291][ T1] do_one_initcall (init/main.c:1284) [ 12.706047][ T1] ? trace_initcall_start (init/main.c:1274) [ 12.706897][ T1] ? parse_one (kernel/params.c:143) [ 12.707623][ T1] ? kasan_save_track (mm/kasan/common.c:69 (discriminator 1) mm/kasan/common.c:78 (discriminator 1)) [ 12.708394][ T1] ? __kmalloc_noprof (mm/slub.c:5659) [ 12.709218][ T1] do_initcalls (init/main.c:1344 (discriminator 3) init/main.c:1361 (discriminator 3)) [ 12.709976][ T1] kernel_init_freeable (init/main.c:1595) [ 12.710752][ T1] ? rest_init (init/main.c:1475) [ 12.711473][ T1] kernel_init (init/main.c:1485) [ 12.712165][ T1] ? rest_init (init/main.c:1475) [ 12.712871][ T1] ret_from_fork (arch/x86/kernel/process.c:164) [ 12.713609][ T1] ? rest_init (init/main.c:1475) [ 12.714326][ T1] ret_from_fork_asm (arch/x86/entry/entry_64.S:255) [ 12.715029][ T1] [ 12.715548][ T1] irq event stamp: 131753 [ 12.716243][ T1] hardirqs last enabled at (131763): __up_console_sem (arch/x86/include/asm/irqflags.h:26 arch/x86/include/asm/irqflags.h:109 arch/x86/include/asm/irqflags.h:151 kernel/printk/printk.c:345) [ 12.717702][ T1] hardirqs last disabled at (131776): __up_console_sem (kernel/printk/printk.c:343 (discriminator 3)) [ 12.719185][ T1] softirqs last enabled at (131460): handle_softirqs (kernel/softirq.c:469 (discriminator 1) kernel/softirq.c:650 (discriminator 1)) [ 12.720632][ T1] softirqs last disabled at (131455): __irq_exit_rcu (kernel/softirq.c:496 kernel/softirq.c:723) [ 12.721755][ T1] ---[ end trace 0000000000000000 ]--- The kernel config and materials to reproduce are available at: https://download.01.org/0day-ci/archive/20251106/202511061443.64dd159-lkp at intel.com -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki From bhe at redhat.com Thu Nov 6 00:01:29 2025 From: bhe at redhat.com (Baoquan He) Date: Thu, 6 Nov 2025 16:01:29 +0800 Subject: [PATCHv2 2/2] kernel/kexec: Fix IMA when allocation happens in CMA area In-Reply-To: <20251106065904.10772-2-piliu@redhat.com> References: <20251106065904.10772-1-piliu@redhat.com> <20251106065904.10772-2-piliu@redhat.com> Message-ID: On 11/06/25 at 02:59pm, Pingfan Liu wrote: > When I tested kexec with the latest kernel, I ran into the following warning: > > [ 40.712410] ------------[ cut here ]------------ > [ 40.712576] WARNING: CPU: 2 PID: 1562 at kernel/kexec_core.c:1001 kimage_map_segment+0x144/0x198 > [...] > [ 40.816047] Call trace: > [ 40.818498] kimage_map_segment+0x144/0x198 (P) > [ 40.823221] ima_kexec_post_load+0x58/0xc0 > [ 40.827246] __do_sys_kexec_file_load+0x29c/0x368 > [...] > [ 40.855423] ---[ end trace 0000000000000000 ]--- > > This is caused by the fact that kexec allocates the destination directly > in the CMA area. In that case, the CMA kernel address should be exported > directly to the IMA component, instead of using the vmalloc'd address. Well, you didn't update the log accordingly. Do you know why cma area can't be mapped into vmalloc? > > Fixes: 07d24902977e ("kexec: enable CMA based contiguous allocation") > Signed-off-by: Pingfan Liu > Cc: Andrew Morton > Cc: Baoquan He > Cc: Alexander Graf > Cc: Steven Chen > Cc: linux-integrity at vger.kernel.org > Cc: > To: kexec at lists.infradead.org > --- > v1 -> v2: > return page_address(page) instead of *page > > kernel/kexec_core.c | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c > index 9a1966207041..332204204e53 100644 > --- a/kernel/kexec_core.c > +++ b/kernel/kexec_core.c > @@ -967,6 +967,7 @@ void *kimage_map_segment(struct kimage *image, int idx) > kimage_entry_t *ptr, entry; > struct page **src_pages; > unsigned int npages; > + struct page *cma; > void *vaddr = NULL; > int i; > > @@ -974,6 +975,9 @@ void *kimage_map_segment(struct kimage *image, int idx) > size = image->segment[idx].memsz; > eaddr = addr + size; > > + cma = image->segment_cma[idx]; > + if (cma) > + return page_address(cma); > /* > * Collect the source pages and map them in a contiguous VA range. > */ > @@ -1014,7 +1018,8 @@ void *kimage_map_segment(struct kimage *image, int idx) > > void kimage_unmap_segment(void *segment_buffer) > { > - vunmap(segment_buffer); > + if (is_vmalloc_addr(segment_buffer)) > + vunmap(segment_buffer); > } > > struct kexec_load_limit { > -- > 2.49.0 > From rppt at kernel.org Thu Nov 6 00:24:24 2025 From: rppt at kernel.org (Mike Rapoport) Date: Thu, 6 Nov 2025 10:24:24 +0200 Subject: [PATCH v8 01/17] memblock: add MEMBLOCK_RSRV_KERN flag In-Reply-To: References: <20250509074635.3187114-1-changyuanl@google.com> <20250509074635.3187114-2-changyuanl@google.com> <2ege2jfbevtunhxsnutbzde7cqwgu5qbj4bbuw2umw7ke7ogcn@5wtskk4exzsi> Message-ID: Hello Breno, On Wed, Nov 05, 2025 at 02:18:11AM -0800, Breno Leitao wrote: > Hello Pratyush, > > On Tue, Oct 14, 2025 at 03:10:37PM +0200, Pratyush Yadav wrote: > > On Tue, Oct 14 2025, Breno Leitao wrote: > > > On Mon, Oct 13, 2025 at 06:40:09PM +0200, Pratyush Yadav wrote: > > >> On Mon, Oct 13 2025, Pratyush Yadav wrote: > > >> > > > >> > I suppose this would be useful. I think enabling memblock debug prints > > >> > would also be helpful (using the "memblock=debug" commandline parameter) > > >> > if it doesn't impact your production environment too much. > > >> > > >> Actually, I think "memblock=debug" is going to be the more useful thing > > >> since it would also show what function allocated the overlapping range > > >> and the flags it was allocated with. > > >> > > >> On my qemu VM with KVM, this results in around 70 prints from memblock. > > >> So it adds a bit of extra prints but nothing that should be too > > >> disrupting I think. Plus, only at boot so the worst thing you get is > > >> slightly slower boot times. > > > > > > Unfortunately this issue is happening on production systems, and I don't > > > have an easy way to reproduce it _yet_. > > > > > > At the same time, "memblock=debug" has two problems: > > > > > > 1) It slows the boot time as you suggested. Boot time at large > > > environments is SUPER critical and time sensitive. It is a bit > > > weird, but it is common for machines in production to kexec > > > _thousands_ of times, and kexecing is considered downtime. > > > > I don't know if it would make a real enough difference on boot times, > > only that it should theoretically affect it, mainly if you are using > > serial for dmesg logs. Anyway, that's your production environment so you > > know best. > > > > > > > > This would be useful if I find some hosts getting this issue, and > > > then I can easily enable the extra information to collect what > > > I need, but, this didn't pan out because the hosts I got > > > `memblock=debug` didn't collaborate. > > > > > > 2) "memblock=debug" is verbose for all cases, which also not necessary > > > the desired behaviour. I am more interested in only being verbose > > > when there is a known problem. > > I am still interested in this problem, and I finally found a host that > constantly reproduce the issue and I was able to get `memblock=debug` > cmdline. I am running 6.18-rc4 with some debug options enabled. > > DMA-API: exceeded 7 overlapping mappings of cacheline 0x0000000006d6e400 > WARNING: CPU: 58 PID: 828 at kernel/dma/debug.c:463 add_dma_entry+0x2e4/0x330 > pc : add_dma_entry+0x2e4/0x330 > lr : add_dma_entry+0x2e4/0x330 > sp : ffff8000b036f7f0 > x29: ffff8000b036f800 x28: 0000000000000001 x27: 0000000000000008 > x26: ffff8000835f7fb8 x25: ffff8000835f7000 x24: ffff8000835f7ee0 > x23: 0000000000000000 x22: 0000000006d6e400 x21: 0000000000000000 > x20: 0000000006d6e400 x19: ffff0003f70c1100 x18: 00000000ffffffff > x17: ffff80008019a2d8 x16: ffff80008019a08c x15: 0000000000000000 > x14: 0000000000000000 x13: 0000000000000820 x12: ffff00011faeaf00 > x11: 0000000000000000 x10: ffff8000834633d8 x9 : ffff8000801979d4 > x8 : 00000000fffeffff x7 : ffff8000834633d8 x6 : 0000000000000000 > x5 : 00000000000bfff4 x4 : 0000000000000000 x3 : ffff0001075eb7c0 > x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0001075eb7c0 > Call trace: > add_dma_entry+0x2e4/0x330 (P) > debug_dma_map_phys+0xc4/0xf0 > dma_map_phys (/home/leit/Devel/upstream/./include/linux/dma-direct.h:138 /home/leit/Devel/upstream/kernel/dma/direct.h:102 /home/leit/Devel/upstream/kernel/dma/mapping.c:169) > dma_map_page_attrs (/home/leit/Devel/upstream/kernel/dma/mapping.c:387) > blk_dma_map_direct.isra.0 (/home/leit/Devel/upstream/block/blk-mq-dma.c:102) > blk_dma_map_iter_start (/home/leit/Devel/upstream/block/blk-mq-dma.c:123 /home/leit/Devel/upstream/block/blk-mq-dma.c:196) > blk_rq_dma_map_iter_start (/home/leit/Devel/upstream/block/blk-mq-dma.c:228) > nvme_prep_rq+0xb8/0x9b8 > nvme_queue_rq+0x44/0x1b0 > blk_mq_dispatch_rq_list (/home/leit/Devel/upstream/block/blk-mq.c:2129) > __blk_mq_sched_dispatch_requests (/home/leit/Devel/upstream/block/blk-mq-sched.c:314) > blk_mq_sched_dispatch_requests (/home/leit/Devel/upstream/block/blk-mq-sched.c:329) > blk_mq_run_work_fn (/home/leit/Devel/upstream/block/blk-mq.c:219 /home/leit/Devel/upstream/block/blk-mq.c:231) > process_one_work (/home/leit/Devel/upstream/kernel/workqueue.c:991 /home/leit/Devel/upstream/kernel/workqueue.c:3213) > worker_thread (/home/leit/Devel/upstream/./include/linux/list.h:163 /home/leit/Devel/upstream/./include/linux/list.h:191 /home/leit/Devel/upstream/./include/linux/list.h:319 /home/leit/Devel/upstream/kernel/workqueue.c:1153 /home/leit/Devel/upstream/kernel/workqueue.c:1205 /home/leit/Devel/upstream/kernel/workqueue.c:3426) > kthread (/home/leit/Devel/upstream/kernel/kthread.c:386 /home/leit/Devel/upstream/kernel/kthread.c:457) > ret_from_fork (/home/leit/Devel/upstream/entry.S:861) > > > Looking at memblock debug logs, I haven't seen anything related to > 0x0000000006d6e400. It looks like the crash happens way after memblock passed all the memory to buddy. Why do you think this is related to memblock? > I got the output of `dmesg | grep memblock` in, in case you are curious: > > https://github.com/leitao/debug/blob/main/pastebin/memblock/dmesg_grep_memblock.txt > > Thanks > --breno > -- Sincerely yours, Mike. From oliver.sang at intel.com Thu Nov 6 00:41:16 2025 From: oliver.sang at intel.com (kernel test robot) Date: Thu, 6 Nov 2025 16:41:16 +0800 Subject: [PATCH v9 2/9] kho: drop notifiers In-Reply-To: <20251101142325.1326536-3-pasha.tatashin@soleen.com> Message-ID: <202511061629.e242724-lkp@intel.com> Hello, kernel test robot noticed "WARNING:at_kernel/kexec_handover.c:#kho_add_subtree" on: commit: e44a700c561d1e892a8d0829d557e221604a7b93 ("[PATCH v9 2/9] kho: drop notifiers") url: https://github.com/intel-lab-lkp/linux/commits/Pasha-Tatashin/kho-make-debugfs-interface-optional/20251101-222610 patch link: https://lore.kernel.org/all/20251101142325.1326536-3-pasha.tatashin at soleen.com/ patch subject: [PATCH v9 2/9] kho: drop notifiers in testcase: boot config: x86_64-randconfig-001-20251015 compiler: gcc-14 test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G (please refer to attached dmesg/kmsg for entire log/backtrace) +--------------------------------------------------------+------------+------------+ | | 93e4b3b2e9 | e44a700c56 | +--------------------------------------------------------+------------+------------+ | WARNING:at_kernel/kexec_handover.c:#kho_add_subtree | 0 | 8 | | RIP:kho_add_subtree | 0 | 8 | +--------------------------------------------------------+------------+------------+ If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot | Closes: https://lore.kernel.org/oe-lkp/202511061629.e242724-lkp at intel.com [ 13.620111][ T1] ------------[ cut here ]------------ [ 13.620739][ T1] WARNING: CPU: 1 PID: 1 at kernel/kexec_handover.c:704 kho_add_subtree (kernel/kexec_handover.c:704) [ 13.621665][ T1] Modules linked in: [ 13.622090][ T1] CPU: 1 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.18.0-rc3-00211-ge44a700c561d #1 VOLUNTARY [ 13.623073][ T1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 [ 13.624054][ T1] RIP: 0010:kho_add_subtree (kernel/kexec_handover.c:704) [ 13.624596][ T1] Code: c7 38 b4 ac 85 31 ed e8 01 1c 00 00 48 c7 c7 70 5a ca 86 85 c0 89 c3 40 0f 95 c5 31 c9 31 d2 89 ee e8 37 b5 0a 00 85 db 74 02 <0f> 0b b9 01 00 00 00 31 d2 89 ee 48 c7 c7 40 5a ca 86 e8 1c b5 0a All code ======== 0: c7 38 b4 ac 85 xbegin 0xffffffff85acb43d,(bad) 5: 31 ed xor %ebp,%ebp 7: e8 01 1c 00 00 call 0x1c0d c: 48 c7 c7 70 5a ca 86 mov $0xffffffff86ca5a70,%rdi 13: 85 c0 test %eax,%eax 15: 89 c3 mov %eax,%ebx 17: 40 0f 95 c5 setne %bpl 1b: 31 c9 xor %ecx,%ecx 1d: 31 d2 xor %edx,%edx 1f: 89 ee mov %ebp,%esi 21: e8 37 b5 0a 00 call 0xab55d 26: 85 db test %ebx,%ebx 28: 74 02 je 0x2c 2a:* 0f 0b ud2 <-- trapping instruction 2c: b9 01 00 00 00 mov $0x1,%ecx 31: 31 d2 xor %edx,%edx 33: 89 ee mov %ebp,%esi 35: 48 c7 c7 40 5a ca 86 mov $0xffffffff86ca5a40,%rdi 3c: e8 .byte 0xe8 3d: 1c b5 sbb $0xb5,%al 3f: 0a .byte 0xa Code starting with the faulting instruction =========================================== 0: 0f 0b ud2 2: b9 01 00 00 00 mov $0x1,%ecx 7: 31 d2 xor %edx,%edx 9: 89 ee mov %ebp,%esi b: 48 c7 c7 40 5a ca 86 mov $0xffffffff86ca5a40,%rdi 12: e8 .byte 0xe8 13: 1c b5 sbb $0xb5,%al 15: 0a .byte 0xa [ 13.626370][ T1] RSP: 0018:ffffc9000001fca0 EFLAGS: 00010286 [ 13.626951][ T1] RAX: dffffc0000000000 RBX: 00000000ffffffff RCX: 0000000000000000 [ 13.627737][ T1] RDX: 1ffffffff0d94b52 RSI: 0000000000000001 RDI: ffffffff86ca5a90 [ 13.628523][ T1] RBP: 0000000000000001 R08: 0000000000000008 R09: fffffbfff0dfac4c [ 13.629330][ T1] R10: 0000000000000000 R11: ffffffff86fd6267 R12: ffff888133ee2000 [ 13.630101][ T1] R13: ffffffff85acb340 R14: ffff888117a5f988 R15: dffffc0000000000 [ 13.630869][ T1] FS: 0000000000000000(0000) GS:ffff888426ea0000(0000) knlGS:0000000000000000 [ 13.631727][ T1] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 13.632370][ T1] CR2: 00007f586df260ac CR3: 00000000054ea000 CR4: 00000000000406f0 [ 13.633154][ T1] Call Trace: [ 13.633506][ T1] [ 13.633833][ T1] kho_test_prepare_fdt+0x145/0x180 [ 13.634446][ T1] ? kho_test_save_data+0x210/0x210 [ 13.635097][ T1] ? csum_partial (lib/checksum.c:123) [ 13.635546][ T1] kho_test_init (lib/test_kho.c:177 lib/test_kho.c:284) [ 13.636018][ T1] ? vmalloc_test_init (lib/test_kho.c:271) [ 13.636508][ T1] ? add_device_randomness (drivers/char/random.c:944) [ 13.637485][ T1] ? mix_pool_bytes (drivers/char/random.c:944) [ 13.637955][ T1] ? trace_initcall_start (include/trace/events/initcall.h:27 (discriminator 3)) [ 13.638498][ T1] ? vmalloc_test_init (lib/test_kho.c:271) [ 13.638989][ T1] do_one_initcall (init/main.c:1284) [ 13.639477][ T1] ? trace_initcall_start (init/main.c:1274) [ 13.639998][ T1] ? parse_one (kernel/params.c:143) [ 13.640455][ T1] ? kasan_save_track (mm/kasan/common.c:69 (discriminator 1) mm/kasan/common.c:78 (discriminator 1)) [ 13.640948][ T1] ? __kmalloc_noprof (mm/slub.c:5659) [ 13.641465][ T1] do_initcalls (init/main.c:1344 (discriminator 3) init/main.c:1361 (discriminator 3)) [ 13.641924][ T1] kernel_init_freeable (init/main.c:1595) [ 13.642441][ T1] ? rest_init (init/main.c:1475) [ 13.642891][ T1] kernel_init (init/main.c:1485) [ 13.643345][ T1] ? rest_init (init/main.c:1475) [ 13.643788][ T1] ret_from_fork (arch/x86/kernel/process.c:164) [ 13.644256][ T1] ? rest_init (init/main.c:1475) [ 13.644703][ T1] ret_from_fork_asm (arch/x86/entry/entry_64.S:255) [ 13.645213][ T1] [ 13.645540][ T1] irq event stamp: 132025 [ 13.645971][ T1] hardirqs last enabled at (132035): __up_console_sem (arch/x86/include/asm/irqflags.h:26 arch/x86/include/asm/irqflags.h:109 arch/x86/include/asm/irqflags.h:151 kernel/printk/printk.c:345) [ 13.646887][ T1] hardirqs last disabled at (132046): __up_console_sem (kernel/printk/printk.c:343 (discriminator 3)) [ 13.648253][ T1] softirqs last enabled at (131286): handle_softirqs (kernel/softirq.c:469 (discriminator 1) kernel/softirq.c:650 (discriminator 1)) [ 13.649690][ T1] softirqs last disabled at (131281): __irq_exit_rcu (kernel/softirq.c:496 kernel/softirq.c:723) [ 13.651128][ T1] ---[ end trace 0000000000000000 ]--- The kernel config and materials to reproduce are available at: https://download.01.org/0day-ci/archive/20251106/202511061629.e242724-lkp at intel.com -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki From piliu at redhat.com Thu Nov 6 02:01:41 2025 From: piliu at redhat.com (Pingfan Liu) Date: Thu, 6 Nov 2025 18:01:41 +0800 Subject: [PATCHv2 2/2] kernel/kexec: Fix IMA when allocation happens in CMA area In-Reply-To: References: <20251106065904.10772-1-piliu@redhat.com> <20251106065904.10772-2-piliu@redhat.com> Message-ID: On Thu, Nov 6, 2025 at 4:01?PM Baoquan He wrote: > > On 11/06/25 at 02:59pm, Pingfan Liu wrote: > > When I tested kexec with the latest kernel, I ran into the following warning: > > > > [ 40.712410] ------------[ cut here ]------------ > > [ 40.712576] WARNING: CPU: 2 PID: 1562 at kernel/kexec_core.c:1001 kimage_map_segment+0x144/0x198 > > [...] > > [ 40.816047] Call trace: > > [ 40.818498] kimage_map_segment+0x144/0x198 (P) > > [ 40.823221] ima_kexec_post_load+0x58/0xc0 > > [ 40.827246] __do_sys_kexec_file_load+0x29c/0x368 > > [...] > > [ 40.855423] ---[ end trace 0000000000000000 ]--- > > > > This is caused by the fact that kexec allocates the destination directly > > in the CMA area. In that case, the CMA kernel address should be exported > > directly to the IMA component, instead of using the vmalloc'd address. > > Well, you didn't update the log accordingly. > I am not sure what you mean. Do you mean the earlier content which I replied to you? > Do you know why cma area can't be mapped into vmalloc? > Should not the kernel direct mapping be used? Thanks, Pingfan > > > > Fixes: 07d24902977e ("kexec: enable CMA based contiguous allocation") > > Signed-off-by: Pingfan Liu > > Cc: Andrew Morton > > Cc: Baoquan He > > Cc: Alexander Graf > > Cc: Steven Chen > > Cc: linux-integrity at vger.kernel.org > > Cc: > > To: kexec at lists.infradead.org > > --- > > v1 -> v2: > > return page_address(page) instead of *page > > > > kernel/kexec_core.c | 7 ++++++- > > 1 file changed, 6 insertions(+), 1 deletion(-) > > > > diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c > > index 9a1966207041..332204204e53 100644 > > --- a/kernel/kexec_core.c > > +++ b/kernel/kexec_core.c > > @@ -967,6 +967,7 @@ void *kimage_map_segment(struct kimage *image, int idx) > > kimage_entry_t *ptr, entry; > > struct page **src_pages; > > unsigned int npages; > > + struct page *cma; > > void *vaddr = NULL; > > int i; > > > > @@ -974,6 +975,9 @@ void *kimage_map_segment(struct kimage *image, int idx) > > size = image->segment[idx].memsz; > > eaddr = addr + size; > > > > + cma = image->segment_cma[idx]; > > + if (cma) > > + return page_address(cma); > > /* > > * Collect the source pages and map them in a contiguous VA range. > > */ > > @@ -1014,7 +1018,8 @@ void *kimage_map_segment(struct kimage *image, int idx) > > > > void kimage_unmap_segment(void *segment_buffer) > > { > > - vunmap(segment_buffer); > > + if (is_vmalloc_addr(segment_buffer)) > > + vunmap(segment_buffer); > > } > > > > struct kexec_load_limit { > > -- > > 2.49.0 > > > From pnina.feder at mobileye.com Thu Nov 6 04:03:44 2025 From: pnina.feder at mobileye.com (Pnina Feder) Date: Thu, 6 Nov 2025 14:03:44 +0200 Subject: [PATCH] util_lib: Add direct map fallback in vaddr_to_offset() Message-ID: <20251106120344.2382695-1-pnina.feder@mobileye.com> The vmcore-dmesg tool could fail with the message: "No program header covering vaddr 0x%llx found kexec bug?" This occurred when a virtual address belonged to the kernel?s direct mapping region, which may not be covered by any PT_LOAD segment in the vmcore ELF headers. Add a direct-map fallback in vaddr_to_offset() that converts such virtual addresses using the known page and physical offsets. This allows resolving these addresses correctly. Tested on Linux 6.16 (RISC-V) Signed-off-by: Pnina Feder --- util_lib/elf_info.c | 58 +++++++++++++++++++++++++++++++++++++-------- 1 file changed, 48 insertions(+), 10 deletions(-) diff --git a/util_lib/elf_info.c b/util_lib/elf_info.c index b005245..589bc1a 100644 --- a/util_lib/elf_info.c +++ b/util_lib/elf_info.c @@ -72,6 +72,7 @@ static uint16_t log_offset_len = UINT16_MAX; static uint16_t log_offset_text_len = UINT16_MAX; static uint64_t phys_offset = UINT64_MAX; +static uint64_t page_offset = UINT64_MAX; #if __BYTE_ORDER == __LITTLE_ENDIAN #define ELFDATANATIVE ELFDATA2LSB @@ -115,7 +116,26 @@ static uint64_t vaddr_to_offset(uint64_t vaddr) continue; return (vaddr - phdr[i].p_vaddr) + phdr[i].p_offset; } - fprintf(stderr, "No program header covering vaddr 0x%llxfound kexec bug?\n", + + /* Direct map fallback */ + if (page_offset != UINT64_MAX && + phys_offset != UINT64_MAX && + vaddr >= page_offset) { + + uint64_t paddr = 0; + + paddr = vaddr - (page_offset - phys_offset); + + for (i = 0; i < ehdr.e_phnum; i++) { + if (phdr[i].p_paddr > paddr) + continue; + if ((phdr[i].p_paddr + phdr[i].p_memsz) <= paddr) + continue; + return phdr[i].p_offset + (paddr - phdr[i].p_paddr); + } + } + + fprintf(stderr, "No program header covering vaddr 0x%llx found kexec bug?\n", (unsigned long long)vaddr); exit(30); } @@ -309,6 +329,20 @@ int get_pt_load(int idx, return 1; } +static inline int parse_phys_offset(const char *str, char *pos) +{ + char *endp; + + phys_offset = strtoul(pos + strlen(str), &endp, 10); + if (strlen(endp) != 0) + phys_offset = strtoul(pos + strlen(str), &endp, 16); + if ((phys_offset == LONG_MAX) || strlen(endp) != 0) { + fprintf(stderr, "Invalid data %s\n", pos); + return -1; + } + return 0; +} + #define NOT_FOUND_LONG_VALUE (-1) void (*arch_scan_vmcoreinfo)(char *pos); @@ -319,7 +353,7 @@ void scan_vmcoreinfo(char *start, size_t size) char *pos, *eol; char temp_buf[1024]; bool last_line = false; - char *str, *endp; + char *str; #define SYMBOL(sym) { \ .str = "SYMBOL(" #sym ")=", \ @@ -543,17 +577,21 @@ void scan_vmcoreinfo(char *start, size_t size) /* Check for PHYS_OFFSET number */ str = "NUMBER(PHYS_OFFSET)="; if (memcmp(str, pos, strlen(str)) == 0) { - phys_offset = strtoul(pos + strlen(str), &endp, - 10); - if (strlen(endp) != 0) - phys_offset = strtoul(pos + strlen(str), &endp, 16); - if ((phys_offset == LONG_MAX) || strlen(endp) != 0) { - fprintf(stderr, "Invalid data %s\n", - pos); + if (parse_phys_offset(str, pos) != 0) break; - } } + /* Check for PHYS_OFFSET number on some arch it called phys_ram_base*/ + str = "NUMBER(phys_ram_base)="; + if (memcmp(str, pos, strlen(str)) == 0) { + if (parse_phys_offset(str, pos) != 0) + break; + } + + str = "NUMBER(PAGE_OFFSET)="; + if (memcmp(str, pos, strlen(str)) == 0) + page_offset = strtoull(pos + strlen(str), NULL, 16); + if (arch_scan_vmcoreinfo != NULL) (*arch_scan_vmcoreinfo)(pos); -- 2.43.0 From lkp at intel.com Thu Nov 6 06:38:11 2025 From: lkp at intel.com (kernel test robot) Date: Thu, 6 Nov 2025 22:38:11 +0800 Subject: [PATCH v6] powerpc/kdump: Add support for crashkernel CMA reservation In-Reply-To: <20251104132818.1724562-1-sourabhjain@linux.ibm.com> References: <20251104132818.1724562-1-sourabhjain@linux.ibm.com> Message-ID: <202511062213.dHidoorr-lkp@intel.com> Hi Sourabh, kernel test robot noticed the following build warnings: [auto build test WARNING on powerpc/next] [also build test WARNING on powerpc/fixes linus/master v6.18-rc4 next-20251106] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Sourabh-Jain/powerpc-kdump-Add-support-for-crashkernel-CMA-reservation/20251104-213036 base: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next patch link: https://lore.kernel.org/r/20251104132818.1724562-1-sourabhjain%40linux.ibm.com patch subject: [PATCH v6] powerpc/kdump: Add support for crashkernel CMA reservation config: powerpc64-randconfig-r113-20251106 (https://download.01.org/0day-ci/archive/20251106/202511062213.dHidoorr-lkp at intel.com/config) compiler: powerpc64-linux-gcc (GCC) 8.5.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251106/202511062213.dHidoorr-lkp at intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot | Closes: https://lore.kernel.org/oe-kbuild-all/202511062213.dHidoorr-lkp at intel.com/ sparse warnings: (new ones prefixed by >>) >> arch/powerpc/kexec/core.c:62:20: sparse: sparse: symbol 'crashk_cma_size' was not declared. Should it be static? arch/powerpc/kexec/core.c:188:29: sparse: sparse: incorrect type in assignment (different base types) @@ expected unsigned long long static [addressable] [toplevel] [usertype] crashk_base @@ got restricted __be64 [usertype] @@ arch/powerpc/kexec/core.c:188:29: sparse: expected unsigned long long static [addressable] [toplevel] [usertype] crashk_base arch/powerpc/kexec/core.c:188:29: sparse: got restricted __be64 [usertype] arch/powerpc/kexec/core.c:190:29: sparse: sparse: incorrect type in assignment (different base types) @@ expected unsigned long long static [addressable] [toplevel] [usertype] crashk_size @@ got restricted __be64 [usertype] @@ arch/powerpc/kexec/core.c:190:29: sparse: expected unsigned long long static [addressable] [toplevel] [usertype] crashk_size arch/powerpc/kexec/core.c:190:29: sparse: got restricted __be64 [usertype] arch/powerpc/kexec/core.c:198:19: sparse: sparse: incorrect type in assignment (different base types) @@ expected unsigned long long static [addressable] [toplevel] mem_limit @@ got restricted __be64 [usertype] @@ arch/powerpc/kexec/core.c:198:19: sparse: expected unsigned long long static [addressable] [toplevel] mem_limit arch/powerpc/kexec/core.c:198:19: sparse: got restricted __be64 [usertype] arch/powerpc/kexec/core.c:214:20: sparse: sparse: incorrect type in assignment (different base types) @@ expected unsigned long long static [addressable] [toplevel] [usertype] kernel_end @@ got restricted __be64 [usertype] @@ arch/powerpc/kexec/core.c:214:20: sparse: expected unsigned long long static [addressable] [toplevel] [usertype] kernel_end arch/powerpc/kexec/core.c:214:20: sparse: got restricted __be64 [usertype] vim +/crashk_cma_size +62 arch/powerpc/kexec/core.c 61 > 62 unsigned long long crashk_cma_size; 63 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki From pasha.tatashin at soleen.com Thu Nov 6 13:46:45 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Thu, 6 Nov 2025 16:46:45 -0500 Subject: [PATCH v9 2/9] kho: drop notifiers In-Reply-To: <202511061629.e242724-lkp@intel.com> References: <20251101142325.1326536-3-pasha.tatashin@soleen.com> <202511061629.e242724-lkp@intel.com> Message-ID: The bug is in lib/test_kho.c, when KHO is not enabled, it should not run KHO commands, there is a function to test that: kho_is_enabled(). So, KHO is disabled and kho_add_subtree() which calles add debugfs entry, and the list is not initialized, because KHO is disabled. The fix is: diff --git a/lib/test_kho.c b/lib/test_kho.c index 025ea251a186..85b60d87a50a 100644 --- a/lib/test_kho.c +++ b/lib/test_kho.c @@ -315,6 +315,9 @@ static int __init kho_test_init(void) phys_addr_t fdt_phys; int err; + if (!kho_is_enabled()) + return 0; + err = kho_retrieve_subtree(KHO_TEST_FDT, &fdt_phys); if (!err) return kho_test_restore(fdt_phys); On Thu, Nov 6, 2025 at 3:41?AM kernel test robot wrote: > > > > Hello, > > kernel test robot noticed "WARNING:at_kernel/kexec_handover.c:#kho_add_subtree" on: > > commit: e44a700c561d1e892a8d0829d557e221604a7b93 ("[PATCH v9 2/9] kho: drop notifiers") > url: https://github.com/intel-lab-lkp/linux/commits/Pasha-Tatashin/kho-make-debugfs-interface-optional/20251101-222610 > patch link: https://lore.kernel.org/all/20251101142325.1326536-3-pasha.tatashin at soleen.com/ > patch subject: [PATCH v9 2/9] kho: drop notifiers > > in testcase: boot > > config: x86_64-randconfig-001-20251015 > compiler: gcc-14 > test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G > > (please refer to attached dmesg/kmsg for entire log/backtrace) > > > +--------------------------------------------------------+------------+------------+ > | | 93e4b3b2e9 | e44a700c56 | > +--------------------------------------------------------+------------+------------+ > | WARNING:at_kernel/kexec_handover.c:#kho_add_subtree | 0 | 8 | > | RIP:kho_add_subtree | 0 | 8 | > +--------------------------------------------------------+------------+------------+ > > > If you fix the issue in a separate patch/commit (i.e. not just a new version of > the same patch/commit), kindly add following tags > | Reported-by: kernel test robot > | Closes: https://lore.kernel.org/oe-lkp/202511061629.e242724-lkp at intel.com > > > [ 13.620111][ T1] ------------[ cut here ]------------ > [ 13.620739][ T1] WARNING: CPU: 1 PID: 1 at kernel/kexec_handover.c:704 kho_add_subtree (kernel/kexec_handover.c:704) > [ 13.621665][ T1] Modules linked in: > [ 13.622090][ T1] CPU: 1 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.18.0-rc3-00211-ge44a700c561d #1 VOLUNTARY > [ 13.623073][ T1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 > [ 13.624054][ T1] RIP: 0010:kho_add_subtree (kernel/kexec_handover.c:704) > [ 13.624596][ T1] Code: c7 38 b4 ac 85 31 ed e8 01 1c 00 00 48 c7 c7 70 5a ca 86 85 c0 89 c3 40 0f 95 c5 31 c9 31 d2 89 ee e8 37 b5 0a 00 85 db 74 02 <0f> 0b b9 01 00 00 00 31 d2 89 ee 48 c7 c7 40 5a ca 86 e8 1c b5 0a > All code > ======== > 0: c7 38 b4 ac 85 xbegin 0xffffffff85acb43d,(bad) > 5: 31 ed xor %ebp,%ebp > 7: e8 01 1c 00 00 call 0x1c0d > c: 48 c7 c7 70 5a ca 86 mov $0xffffffff86ca5a70,%rdi > 13: 85 c0 test %eax,%eax > 15: 89 c3 mov %eax,%ebx > 17: 40 0f 95 c5 setne %bpl > 1b: 31 c9 xor %ecx,%ecx > 1d: 31 d2 xor %edx,%edx > 1f: 89 ee mov %ebp,%esi > 21: e8 37 b5 0a 00 call 0xab55d > 26: 85 db test %ebx,%ebx > 28: 74 02 je 0x2c > 2a:* 0f 0b ud2 <-- trapping instruction > 2c: b9 01 00 00 00 mov $0x1,%ecx > 31: 31 d2 xor %edx,%edx > 33: 89 ee mov %ebp,%esi > 35: 48 c7 c7 40 5a ca 86 mov $0xffffffff86ca5a40,%rdi > 3c: e8 .byte 0xe8 > 3d: 1c b5 sbb $0xb5,%al > 3f: 0a .byte 0xa > > Code starting with the faulting instruction > =========================================== > 0: 0f 0b ud2 > 2: b9 01 00 00 00 mov $0x1,%ecx > 7: 31 d2 xor %edx,%edx > 9: 89 ee mov %ebp,%esi > b: 48 c7 c7 40 5a ca 86 mov $0xffffffff86ca5a40,%rdi > 12: e8 .byte 0xe8 > 13: 1c b5 sbb $0xb5,%al > 15: 0a .byte 0xa > [ 13.626370][ T1] RSP: 0018:ffffc9000001fca0 EFLAGS: 00010286 > [ 13.626951][ T1] RAX: dffffc0000000000 RBX: 00000000ffffffff RCX: 0000000000000000 > [ 13.627737][ T1] RDX: 1ffffffff0d94b52 RSI: 0000000000000001 RDI: ffffffff86ca5a90 > [ 13.628523][ T1] RBP: 0000000000000001 R08: 0000000000000008 R09: fffffbfff0dfac4c > [ 13.629330][ T1] R10: 0000000000000000 R11: ffffffff86fd6267 R12: ffff888133ee2000 > [ 13.630101][ T1] R13: ffffffff85acb340 R14: ffff888117a5f988 R15: dffffc0000000000 > [ 13.630869][ T1] FS: 0000000000000000(0000) GS:ffff888426ea0000(0000) knlGS:0000000000000000 > [ 13.631727][ T1] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 13.632370][ T1] CR2: 00007f586df260ac CR3: 00000000054ea000 CR4: 00000000000406f0 > [ 13.633154][ T1] Call Trace: > [ 13.633506][ T1] > [ 13.633833][ T1] kho_test_prepare_fdt+0x145/0x180 > [ 13.634446][ T1] ? kho_test_save_data+0x210/0x210 > [ 13.635097][ T1] ? csum_partial (lib/checksum.c:123) > [ 13.635546][ T1] kho_test_init (lib/test_kho.c:177 lib/test_kho.c:284) > [ 13.636018][ T1] ? vmalloc_test_init (lib/test_kho.c:271) > [ 13.636508][ T1] ? add_device_randomness (drivers/char/random.c:944) > [ 13.637485][ T1] ? mix_pool_bytes (drivers/char/random.c:944) > [ 13.637955][ T1] ? trace_initcall_start (include/trace/events/initcall.h:27 (discriminator 3)) > [ 13.638498][ T1] ? vmalloc_test_init (lib/test_kho.c:271) > [ 13.638989][ T1] do_one_initcall (init/main.c:1284) > [ 13.639477][ T1] ? trace_initcall_start (init/main.c:1274) > [ 13.639998][ T1] ? parse_one (kernel/params.c:143) > [ 13.640455][ T1] ? kasan_save_track (mm/kasan/common.c:69 (discriminator 1) mm/kasan/common.c:78 (discriminator 1)) > [ 13.640948][ T1] ? __kmalloc_noprof (mm/slub.c:5659) > [ 13.641465][ T1] do_initcalls (init/main.c:1344 (discriminator 3) init/main.c:1361 (discriminator 3)) > [ 13.641924][ T1] kernel_init_freeable (init/main.c:1595) > [ 13.642441][ T1] ? rest_init (init/main.c:1475) > [ 13.642891][ T1] kernel_init (init/main.c:1485) > [ 13.643345][ T1] ? rest_init (init/main.c:1475) > [ 13.643788][ T1] ret_from_fork (arch/x86/kernel/process.c:164) > [ 13.644256][ T1] ? rest_init (init/main.c:1475) > [ 13.644703][ T1] ret_from_fork_asm (arch/x86/entry/entry_64.S:255) > [ 13.645213][ T1] > [ 13.645540][ T1] irq event stamp: 132025 > [ 13.645971][ T1] hardirqs last enabled at (132035): __up_console_sem (arch/x86/include/asm/irqflags.h:26 arch/x86/include/asm/irqflags.h:109 arch/x86/include/asm/irqflags.h:151 kernel/printk/printk.c:345) > [ 13.646887][ T1] hardirqs last disabled at (132046): __up_console_sem (kernel/printk/printk.c:343 (discriminator 3)) > [ 13.648253][ T1] softirqs last enabled at (131286): handle_softirqs (kernel/softirq.c:469 (discriminator 1) kernel/softirq.c:650 (discriminator 1)) > [ 13.649690][ T1] softirqs last disabled at (131281): __irq_exit_rcu (kernel/softirq.c:496 kernel/softirq.c:723) > [ 13.651128][ T1] ---[ end trace 0000000000000000 ]--- > > > The kernel config and materials to reproduce are available at: > https://download.01.org/0day-ci/archive/20251106/202511061629.e242724-lkp at intel.com > > > > -- > 0-DAY CI Kernel Test Service > https://github.com/intel/lkp-tests/wiki > From pasha.tatashin at soleen.com Thu Nov 6 14:06:35 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Thu, 6 Nov 2025 17:06:35 -0500 Subject: [PATCH] lib/test_kho: Check if KHO is enabled Message-ID: <20251106220635.2608494-1-pasha.tatashin@soleen.com> We must check whether KHO is enabled prior to issuing KHO commands, otherwise KHO internal data structures are not initialized. Fixes: b753522bed0b ("kho: add test for kexec handover") Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-lkp/202511061629.e242724-lkp at intel.com Signed-off-by: Pasha Tatashin --- lib/test_kho.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/lib/test_kho.c b/lib/test_kho.c index 025ea251a186..85b60d87a50a 100644 --- a/lib/test_kho.c +++ b/lib/test_kho.c @@ -315,6 +315,9 @@ static int __init kho_test_init(void) phys_addr_t fdt_phys; int err; + if (!kho_is_enabled()) + return 0; + err = kho_retrieve_subtree(KHO_TEST_FDT, &fdt_phys); if (!err) return kho_test_restore(fdt_phys); -- 2.51.2.1041.gc1ab5b90ca-goog From pasha.tatashin at soleen.com Thu Nov 6 14:14:28 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Thu, 6 Nov 2025 17:14:28 -0500 Subject: [PATCH v9 2/9] kho: drop notifiers In-Reply-To: References: <20251101142325.1326536-3-pasha.tatashin@soleen.com> <202511061629.e242724-lkp@intel.com> Message-ID: On Thu, Nov 6, 2025 at 4:46?PM Pasha Tatashin wrote: > > The bug is in lib/test_kho.c, when KHO is not enabled, it should not > run KHO commands, there is a function to test that: kho_is_enabled(). > So, KHO is disabled and kho_add_subtree() which calles add debugfs > entry, and the list is not initialized, because KHO is disabled. The > fix is: Sent it as a patch: https://lore.kernel.org/all/20251106220635.2608494-1-pasha.tatashin at soleen.com > > diff --git a/lib/test_kho.c b/lib/test_kho.c > index 025ea251a186..85b60d87a50a 100644 > --- a/lib/test_kho.c > +++ b/lib/test_kho.c > @@ -315,6 +315,9 @@ static int __init kho_test_init(void) > phys_addr_t fdt_phys; > int err; > > + if (!kho_is_enabled()) > + return 0; > + > err = kho_retrieve_subtree(KHO_TEST_FDT, &fdt_phys); > if (!err) > return kho_test_restore(fdt_phys); > > On Thu, Nov 6, 2025 at 3:41?AM kernel test robot wrote: > > > > > > > > Hello, > > > > kernel test robot noticed "WARNING:at_kernel/kexec_handover.c:#kho_add_subtree" on: > > > > commit: e44a700c561d1e892a8d0829d557e221604a7b93 ("[PATCH v9 2/9] kho: drop notifiers") > > url: https://github.com/intel-lab-lkp/linux/commits/Pasha-Tatashin/kho-make-debugfs-interface-optional/20251101-222610 > > patch link: https://lore.kernel.org/all/20251101142325.1326536-3-pasha.tatashin at soleen.com/ > > patch subject: [PATCH v9 2/9] kho: drop notifiers > > > > in testcase: boot > > > > config: x86_64-randconfig-001-20251015 > > compiler: gcc-14 > > test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G > > > > (please refer to attached dmesg/kmsg for entire log/backtrace) > > > > > > +--------------------------------------------------------+------------+------------+ > > | | 93e4b3b2e9 | e44a700c56 | > > +--------------------------------------------------------+------------+------------+ > > | WARNING:at_kernel/kexec_handover.c:#kho_add_subtree | 0 | 8 | > > | RIP:kho_add_subtree | 0 | 8 | > > +--------------------------------------------------------+------------+------------+ > > > > > > If you fix the issue in a separate patch/commit (i.e. not just a new version of > > the same patch/commit), kindly add following tags > > | Reported-by: kernel test robot > > | Closes: https://lore.kernel.org/oe-lkp/202511061629.e242724-lkp at intel.com > > > > > > [ 13.620111][ T1] ------------[ cut here ]------------ > > [ 13.620739][ T1] WARNING: CPU: 1 PID: 1 at kernel/kexec_handover.c:704 kho_add_subtree (kernel/kexec_handover.c:704) > > [ 13.621665][ T1] Modules linked in: > > [ 13.622090][ T1] CPU: 1 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.18.0-rc3-00211-ge44a700c561d #1 VOLUNTARY > > [ 13.623073][ T1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 > > [ 13.624054][ T1] RIP: 0010:kho_add_subtree (kernel/kexec_handover.c:704) > > [ 13.624596][ T1] Code: c7 38 b4 ac 85 31 ed e8 01 1c 00 00 48 c7 c7 70 5a ca 86 85 c0 89 c3 40 0f 95 c5 31 c9 31 d2 89 ee e8 37 b5 0a 00 85 db 74 02 <0f> 0b b9 01 00 00 00 31 d2 89 ee 48 c7 c7 40 5a ca 86 e8 1c b5 0a > > All code > > ======== > > 0: c7 38 b4 ac 85 xbegin 0xffffffff85acb43d,(bad) > > 5: 31 ed xor %ebp,%ebp > > 7: e8 01 1c 00 00 call 0x1c0d > > c: 48 c7 c7 70 5a ca 86 mov $0xffffffff86ca5a70,%rdi > > 13: 85 c0 test %eax,%eax > > 15: 89 c3 mov %eax,%ebx > > 17: 40 0f 95 c5 setne %bpl > > 1b: 31 c9 xor %ecx,%ecx > > 1d: 31 d2 xor %edx,%edx > > 1f: 89 ee mov %ebp,%esi > > 21: e8 37 b5 0a 00 call 0xab55d > > 26: 85 db test %ebx,%ebx > > 28: 74 02 je 0x2c > > 2a:* 0f 0b ud2 <-- trapping instruction > > 2c: b9 01 00 00 00 mov $0x1,%ecx > > 31: 31 d2 xor %edx,%edx > > 33: 89 ee mov %ebp,%esi > > 35: 48 c7 c7 40 5a ca 86 mov $0xffffffff86ca5a40,%rdi > > 3c: e8 .byte 0xe8 > > 3d: 1c b5 sbb $0xb5,%al > > 3f: 0a .byte 0xa > > > > Code starting with the faulting instruction > > =========================================== > > 0: 0f 0b ud2 > > 2: b9 01 00 00 00 mov $0x1,%ecx > > 7: 31 d2 xor %edx,%edx > > 9: 89 ee mov %ebp,%esi > > b: 48 c7 c7 40 5a ca 86 mov $0xffffffff86ca5a40,%rdi > > 12: e8 .byte 0xe8 > > 13: 1c b5 sbb $0xb5,%al > > 15: 0a .byte 0xa > > [ 13.626370][ T1] RSP: 0018:ffffc9000001fca0 EFLAGS: 00010286 > > [ 13.626951][ T1] RAX: dffffc0000000000 RBX: 00000000ffffffff RCX: 0000000000000000 > > [ 13.627737][ T1] RDX: 1ffffffff0d94b52 RSI: 0000000000000001 RDI: ffffffff86ca5a90 > > [ 13.628523][ T1] RBP: 0000000000000001 R08: 0000000000000008 R09: fffffbfff0dfac4c > > [ 13.629330][ T1] R10: 0000000000000000 R11: ffffffff86fd6267 R12: ffff888133ee2000 > > [ 13.630101][ T1] R13: ffffffff85acb340 R14: ffff888117a5f988 R15: dffffc0000000000 > > [ 13.630869][ T1] FS: 0000000000000000(0000) GS:ffff888426ea0000(0000) knlGS:0000000000000000 > > [ 13.631727][ T1] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [ 13.632370][ T1] CR2: 00007f586df260ac CR3: 00000000054ea000 CR4: 00000000000406f0 > > [ 13.633154][ T1] Call Trace: > > [ 13.633506][ T1] > > [ 13.633833][ T1] kho_test_prepare_fdt+0x145/0x180 > > [ 13.634446][ T1] ? kho_test_save_data+0x210/0x210 > > [ 13.635097][ T1] ? csum_partial (lib/checksum.c:123) > > [ 13.635546][ T1] kho_test_init (lib/test_kho.c:177 lib/test_kho.c:284) > > [ 13.636018][ T1] ? vmalloc_test_init (lib/test_kho.c:271) > > [ 13.636508][ T1] ? add_device_randomness (drivers/char/random.c:944) > > [ 13.637485][ T1] ? mix_pool_bytes (drivers/char/random.c:944) > > [ 13.637955][ T1] ? trace_initcall_start (include/trace/events/initcall.h:27 (discriminator 3)) > > [ 13.638498][ T1] ? vmalloc_test_init (lib/test_kho.c:271) > > [ 13.638989][ T1] do_one_initcall (init/main.c:1284) > > [ 13.639477][ T1] ? trace_initcall_start (init/main.c:1274) > > [ 13.639998][ T1] ? parse_one (kernel/params.c:143) > > [ 13.640455][ T1] ? kasan_save_track (mm/kasan/common.c:69 (discriminator 1) mm/kasan/common.c:78 (discriminator 1)) > > [ 13.640948][ T1] ? __kmalloc_noprof (mm/slub.c:5659) > > [ 13.641465][ T1] do_initcalls (init/main.c:1344 (discriminator 3) init/main.c:1361 (discriminator 3)) > > [ 13.641924][ T1] kernel_init_freeable (init/main.c:1595) > > [ 13.642441][ T1] ? rest_init (init/main.c:1475) > > [ 13.642891][ T1] kernel_init (init/main.c:1485) > > [ 13.643345][ T1] ? rest_init (init/main.c:1475) > > [ 13.643788][ T1] ret_from_fork (arch/x86/kernel/process.c:164) > > [ 13.644256][ T1] ? rest_init (init/main.c:1475) > > [ 13.644703][ T1] ret_from_fork_asm (arch/x86/entry/entry_64.S:255) > > [ 13.645213][ T1] > > [ 13.645540][ T1] irq event stamp: 132025 > > [ 13.645971][ T1] hardirqs last enabled at (132035): __up_console_sem (arch/x86/include/asm/irqflags.h:26 arch/x86/include/asm/irqflags.h:109 arch/x86/include/asm/irqflags.h:151 kernel/printk/printk.c:345) > > [ 13.646887][ T1] hardirqs last disabled at (132046): __up_console_sem (kernel/printk/printk.c:343 (discriminator 3)) > > [ 13.648253][ T1] softirqs last enabled at (131286): handle_softirqs (kernel/softirq.c:469 (discriminator 1) kernel/softirq.c:650 (discriminator 1)) > > [ 13.649690][ T1] softirqs last disabled at (131281): __irq_exit_rcu (kernel/softirq.c:496 kernel/softirq.c:723) > > [ 13.651128][ T1] ---[ end trace 0000000000000000 ]--- > > > > > > The kernel config and materials to reproduce are available at: > > https://download.01.org/0day-ci/archive/20251106/202511061629.e242724-lkp at intel.com > > > > > > > > -- > > 0-DAY CI Kernel Test Service > > https://github.com/intel/lkp-tests/wiki > > From akpm at linux-foundation.org Thu Nov 6 16:44:42 2025 From: akpm at linux-foundation.org (Andrew Morton) Date: Thu, 6 Nov 2025 16:44:42 -0800 Subject: [PATCH 2/2] kernel/kexec: Fix IMA when allocation happens in CMA area In-Reply-To: References: <20251105130922.13321-1-piliu@redhat.com> <20251105130922.13321-2-piliu@redhat.com> <20251105161432.98eb69f87f30627a9067e78e@linux-foundation.org> Message-ID: <20251106164442.f0158876667a18d0f31a127a@linux-foundation.org> On Thu, 6 Nov 2025 10:57:33 +0800 Pingfan Liu wrote: > > > This is caused by the fact that kexec allocates the destination directly > > > in the CMA area. In that case, the CMA kernel address should be exported > > > directly to the IMA component, instead of using the vmalloc'd address. > > > > This is something we should backport into tearlier kernels. > > > > > Signed-off-by: Pingfan Liu > > > Cc: Andrew Morton > > > Cc: Baoquan He > > > Cc: Alexander Graf > > > Cc: Steven Chen > > > Cc: linux-integrity at vger.kernel.org > > > To: kexec at lists.infradead.org > > > > So I'm thinking we should add > > > > Fixes: 0091d9241ea2 ("kexec: define functions to map and unmap segments") > Should be: > Fixes: 07d24902977e ("kexec: enable CMA based contiguous allocation") > > Because 07d24902977e came after 0091d9241ea2 and introduced this issue. Thanks, I updated the mm.git copy of this patch. From bhe at redhat.com Thu Nov 6 17:51:04 2025 From: bhe at redhat.com (Baoquan He) Date: Fri, 7 Nov 2025 09:51:04 +0800 Subject: [PATCHv2 2/2] kernel/kexec: Fix IMA when allocation happens in CMA area In-Reply-To: References: <20251106065904.10772-1-piliu@redhat.com> <20251106065904.10772-2-piliu@redhat.com> Message-ID: On 11/06/25 at 06:01pm, Pingfan Liu wrote: > On Thu, Nov 6, 2025 at 4:01?PM Baoquan He wrote: > > > > On 11/06/25 at 02:59pm, Pingfan Liu wrote: > > > When I tested kexec with the latest kernel, I ran into the following warning: > > > > > > [ 40.712410] ------------[ cut here ]------------ > > > [ 40.712576] WARNING: CPU: 2 PID: 1562 at kernel/kexec_core.c:1001 kimage_map_segment+0x144/0x198 > > > [...] > > > [ 40.816047] Call trace: > > > [ 40.818498] kimage_map_segment+0x144/0x198 (P) > > > [ 40.823221] ima_kexec_post_load+0x58/0xc0 > > > [ 40.827246] __do_sys_kexec_file_load+0x29c/0x368 > > > [...] > > > [ 40.855423] ---[ end trace 0000000000000000 ]--- > > > > > > This is caused by the fact that kexec allocates the destination directly > > > in the CMA area. In that case, the CMA kernel address should be exported > > > directly to the IMA component, instead of using the vmalloc'd address. > > > > Well, you didn't update the log accordingly. > > > > I am not sure what you mean. Do you mean the earlier content which I > replied to you? No. In v1, you return cma directly. But in v2, you return its direct mapping address, isnt' it? > > > Do you know why cma area can't be mapped into vmalloc? > > > Should not the kernel direct mapping be used? When image->segment_cma[i] has value, image->ima_buffer_addr also contains the physical address of the cma area, why cma physical address can't be mapped into vmalloc and cause the failure and call trace? From piliu at redhat.com Thu Nov 6 21:13:08 2025 From: piliu at redhat.com (Pingfan Liu) Date: Fri, 7 Nov 2025 13:13:08 +0800 Subject: [PATCHv2 2/2] kernel/kexec: Fix IMA when allocation happens in CMA area In-Reply-To: References: <20251106065904.10772-1-piliu@redhat.com> <20251106065904.10772-2-piliu@redhat.com> Message-ID: On Fri, Nov 7, 2025 at 9:51?AM Baoquan He wrote: > > On 11/06/25 at 06:01pm, Pingfan Liu wrote: > > On Thu, Nov 6, 2025 at 4:01?PM Baoquan He wrote: > > > > > > On 11/06/25 at 02:59pm, Pingfan Liu wrote: > > > > When I tested kexec with the latest kernel, I ran into the following warning: > > > > > > > > [ 40.712410] ------------[ cut here ]------------ > > > > [ 40.712576] WARNING: CPU: 2 PID: 1562 at kernel/kexec_core.c:1001 kimage_map_segment+0x144/0x198 > > > > [...] > > > > [ 40.816047] Call trace: > > > > [ 40.818498] kimage_map_segment+0x144/0x198 (P) > > > > [ 40.823221] ima_kexec_post_load+0x58/0xc0 > > > > [ 40.827246] __do_sys_kexec_file_load+0x29c/0x368 > > > > [...] > > > > [ 40.855423] ---[ end trace 0000000000000000 ]--- > > > > > > > > This is caused by the fact that kexec allocates the destination directly > > > > in the CMA area. In that case, the CMA kernel address should be exported > > > > directly to the IMA component, instead of using the vmalloc'd address. > > > > > > Well, you didn't update the log accordingly. > > > > > > > I am not sure what you mean. Do you mean the earlier content which I > > replied to you? > > No. In v1, you return cma directly. But in v2, you return its direct > mapping address, isnt' it? > Yes. But I think it is a fault in the code, which does not convey the expression in the commit log. Do you think I should rephrase the words "the CMA kernel address" as "the CMA kernel direct mapping address"? > > > > > Do you know why cma area can't be mapped into vmalloc? > > > > > Should not the kernel direct mapping be used? > > When image->segment_cma[i] has value, image->ima_buffer_addr also > contains the physical address of the cma area, why cma physical address > can't be mapped into vmalloc and cause the failure and call trace? > It could be done using the vmalloc approach, but it's unnecessary. IIUC, kimage_map_segment() was introduced to provide a contiguous virtual address for IMA access, since the IND_SRC pages are scattered throughout the kernel. However, in the CMA case, there is already a contiguous virtual address in the kernel direct mapping range. Normally, when we have a physical address, we simply use phys_to_virt() to get its corresponding kernel virtual address. Thanks, Pingfan From bhe at redhat.com Thu Nov 6 21:25:41 2025 From: bhe at redhat.com (Baoquan He) Date: Fri, 7 Nov 2025 13:25:41 +0800 Subject: [PATCHv2 2/2] kernel/kexec: Fix IMA when allocation happens in CMA area In-Reply-To: References: <20251106065904.10772-1-piliu@redhat.com> <20251106065904.10772-2-piliu@redhat.com> Message-ID: On 11/07/25 at 01:13pm, Pingfan Liu wrote: > On Fri, Nov 7, 2025 at 9:51?AM Baoquan He wrote: > > > > On 11/06/25 at 06:01pm, Pingfan Liu wrote: > > > On Thu, Nov 6, 2025 at 4:01?PM Baoquan He wrote: > > > > > > > > On 11/06/25 at 02:59pm, Pingfan Liu wrote: > > > > > When I tested kexec with the latest kernel, I ran into the following warning: > > > > > > > > > > [ 40.712410] ------------[ cut here ]------------ > > > > > [ 40.712576] WARNING: CPU: 2 PID: 1562 at kernel/kexec_core.c:1001 kimage_map_segment+0x144/0x198 > > > > > [...] > > > > > [ 40.816047] Call trace: > > > > > [ 40.818498] kimage_map_segment+0x144/0x198 (P) > > > > > [ 40.823221] ima_kexec_post_load+0x58/0xc0 > > > > > [ 40.827246] __do_sys_kexec_file_load+0x29c/0x368 > > > > > [...] > > > > > [ 40.855423] ---[ end trace 0000000000000000 ]--- > > > > > > > > > > This is caused by the fact that kexec allocates the destination directly > > > > > in the CMA area. In that case, the CMA kernel address should be exported > > > > > directly to the IMA component, instead of using the vmalloc'd address. > > > > > > > > Well, you didn't update the log accordingly. > > > > > > > > > > I am not sure what you mean. Do you mean the earlier content which I > > > replied to you? > > > > No. In v1, you return cma directly. But in v2, you return its direct > > mapping address, isnt' it? > > > > Yes. But I think it is a fault in the code, which does not convey the > expression in the commit log. Do you think I should rephrase the words > "the CMA kernel address" as "the CMA kernel direct mapping address"? That's fine to me. > > > > > > > > Do you know why cma area can't be mapped into vmalloc? > > > > > > > Should not the kernel direct mapping be used? > > > > When image->segment_cma[i] has value, image->ima_buffer_addr also > > contains the physical address of the cma area, why cma physical address > > can't be mapped into vmalloc and cause the failure and call trace? > > > > It could be done using the vmalloc approach, but it's unnecessary. > IIUC, kimage_map_segment() was introduced to provide a contiguous > virtual address for IMA access, since the IND_SRC pages are scattered > throughout the kernel. However, in the CMA case, there is already a > contiguous virtual address in the kernel direct mapping range. > Normally, when we have a physical address, we simply use > phys_to_virt() to get its corresponding kernel virtual address. OK, I understand cma area is contiguous, and no need to map into vmalloc. I am wondering why in the old code mapping cma addrss into vmalloc cause the warning which you said is a IMA problem. From sourabhjain at linux.ibm.com Thu Nov 6 23:15:55 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Fri, 7 Nov 2025 12:45:55 +0530 Subject: [PATCH v2 5/5] crash: export crashkernel CMA reservation to userspace In-Reply-To: <20251106045107.17813-6-sourabhjain@linux.ibm.com> References: <20251106045107.17813-1-sourabhjain@linux.ibm.com> <20251106045107.17813-6-sourabhjain@linux.ibm.com> Message-ID: On 06/11/25 10:21, Sourabh Jain wrote: > Add a sysfs entry /sys/kernel/kexec/crash_cma_ranges to expose all > CMA crashkernel ranges. > > This allows userspace tools configuring kdump to determine how much > memory is reserved for crashkernel. If CMA is used, tools can warn > users when attempting to capture user pages with CMA reservation. > > The new sysfs hold the CMA ranges in below format: > > cat /sys/kernel/kexec/crash_cma_ranges > 100000000-10c7fffff > > Cc: Aditya Gupta > Cc: Andrew Morton > Cc: Baoquan he > Cc: Dave Young > Cc: Hari Bathini > Cc: Jiri Bohac > Cc: Madhavan Srinivasan > Cc: Mahesh J Salgaonkar > Cc: Pingfan Liu > Cc: Ritesh Harjani (IBM) > Cc: Shivang Upadhyay > Cc: Vivek Goyal > Cc: linuxppc-dev at lists.ozlabs.org > Cc: kexec at lists.infradead.org > Signed-off-by: Sourabh Jain > --- > Documentation/ABI/testing/sysfs-kernel-kexec-kdump | 10 ++++++++++ > 1 file changed, 10 insertions(+) > > diff --git a/Documentation/ABI/testing/sysfs-kernel-kexec-kdump b/Documentation/ABI/testing/sysfs-kernel-kexec-kdump > index 00c00f380fea..f59051b5d96d 100644 > --- a/Documentation/ABI/testing/sysfs-kernel-kexec-kdump > +++ b/Documentation/ABI/testing/sysfs-kernel-kexec-kdump > @@ -49,3 +49,13 @@ Description: read only > is used by the user space utility kexec to support updating the > in-kernel kdump image during hotplug operations. > User: Kexec tools > + > +What: /sys/kernel/kexec/crash_cma_ranges > +Date: Nov 2025 > +Contact: kexec at lists.infradead.org > +Description: read only > + Provides information about the memory ranges reserved from > + the Contiguous Memory Allocator (CMA) area that are allocated > + to the crash (kdump) kernel. It lists the start and end physical > + addresses of CMA regions assigned for crashkernel use. > +User: kdump service While rebasing the v1 patches, the hunk that adds the show function didn't get picked up. I will send v3 with a show function to export the crashkernel CMA reservation. - Sourabh Jain From sourabhjain at linux.ibm.com Fri Nov 7 00:03:34 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Fri, 7 Nov 2025 13:33:34 +0530 Subject: [PATCH v7] powerpc/kdump: Add support for crashkernel CMA reservation Message-ID: <20251107080334.708028-1-sourabhjain@linux.ibm.com> Commit 35c18f2933c5 ("Add a new optional ",cma" suffix to the crashkernel= command line option") and commit ab475510e042 ("kdump: implement reserve_crashkernel_cma") added CMA support for kdump crashkernel reservation. Extend crashkernel CMA reservation support to powerpc. The following changes are made to enable CMA reservation on powerpc: - Parse and obtain the CMA reservation size along with other crashkernel parameters - Call reserve_crashkernel_cma() to allocate the CMA region for kdump - Include the CMA-reserved ranges in the usable memory ranges for the kdump kernel to use. - Exclude the CMA-reserved ranges from the crash kernel memory to prevent them from being exported through /proc/vmcore. With the introduction of the CMA crashkernel regions, crash_exclude_mem_range() needs to be called multiple times to exclude both crashk_res and crashk_cma_ranges from the crash memory ranges. To avoid repetitive logic for validating mem_ranges size and handling reallocation when required, this functionality is moved to a new wrapper function crash_exclude_mem_range_guarded(). To ensure proper CMA reservation, reserve_crashkernel_cma() is called after pageblock_order is initialized. Update kernel-parameters.txt to document CMA support for crashkernel on powerpc architecture. Cc: Baoquan he Cc: Jiri Bohac Cc: Hari Bathini Cc: Madhavan Srinivasan Cc: Mahesh Salgaonkar Cc: Michael Ellerman Cc: Ritesh Harjani (IBM) Cc: Shivang Upadhyay Cc: kexec at lists.infradead.org Signed-off-by: Sourabh Jain --- Changelog: v6 -> v7 https://lore.kernel.org/all/20251104132818.1724562-1-sourabhjain at linux.ibm.com/ - declare crashk_cma_size static --- .../admin-guide/kernel-parameters.txt | 2 +- arch/powerpc/include/asm/kexec.h | 2 + arch/powerpc/kernel/setup-common.c | 4 +- arch/powerpc/kexec/core.c | 10 ++++- arch/powerpc/kexec/ranges.c | 43 ++++++++++++++----- 5 files changed, 47 insertions(+), 14 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 6c42061ca20e..1c10190d583d 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -1013,7 +1013,7 @@ It will be ignored when crashkernel=X,high is not used or memory reserved is below 4G. crashkernel=size[KMG],cma - [KNL, X86] Reserve additional crash kernel memory from + [KNL, X86, ppc] Reserve additional crash kernel memory from CMA. This reservation is usable by the first system's userspace memory and kernel movable allocations (memory balloon, zswap). Pages allocated from this memory range diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h index 4bbf9f699aaa..bd4a6c42a5f3 100644 --- a/arch/powerpc/include/asm/kexec.h +++ b/arch/powerpc/include/asm/kexec.h @@ -115,9 +115,11 @@ int setup_new_fdt_ppc64(const struct kimage *image, void *fdt, struct crash_mem #ifdef CONFIG_CRASH_RESERVE int __init overlaps_crashkernel(unsigned long start, unsigned long size); extern void arch_reserve_crashkernel(void); +extern void kdump_cma_reserve(void); #else static inline void arch_reserve_crashkernel(void) {} static inline int overlaps_crashkernel(unsigned long start, unsigned long size) { return 0; } +static inline void kdump_cma_reserve(void) { } #endif #if defined(CONFIG_CRASH_DUMP) diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c index 68d47c53876c..c8c42b419742 100644 --- a/arch/powerpc/kernel/setup-common.c +++ b/arch/powerpc/kernel/setup-common.c @@ -35,6 +35,7 @@ #include #include #include +#include #include #include #include @@ -995,11 +996,12 @@ void __init setup_arch(char **cmdline_p) initmem_init(); /* - * Reserve large chunks of memory for use by CMA for fadump, KVM and + * Reserve large chunks of memory for use by CMA for kdump, fadump, KVM and * hugetlb. These must be called after initmem_init(), so that * pageblock_order is initialised. */ fadump_cma_init(); + kdump_cma_reserve(); kvm_cma_reserve(); gigantic_hugetlb_cma_reserve(); diff --git a/arch/powerpc/kexec/core.c b/arch/powerpc/kexec/core.c index d1a2d755381c..e59bdfcc6463 100644 --- a/arch/powerpc/kexec/core.c +++ b/arch/powerpc/kexec/core.c @@ -59,6 +59,8 @@ void machine_kexec(struct kimage *image) #ifdef CONFIG_CRASH_RESERVE +static unsigned long long crashk_cma_size; + static unsigned long long __init get_crash_base(unsigned long long crash_base) { @@ -110,7 +112,7 @@ void __init arch_reserve_crashkernel(void) /* use common parsing */ ret = parse_crashkernel(boot_command_line, total_mem_sz, &crash_size, - &crash_base, NULL, NULL, NULL); + &crash_base, NULL, &crashk_cma_size, NULL); if (ret) return; @@ -130,6 +132,12 @@ void __init arch_reserve_crashkernel(void) reserve_crashkernel_generic(crash_size, crash_base, 0, false); } +void __init kdump_cma_reserve(void) +{ + if (crashk_cma_size) + reserve_crashkernel_cma(crashk_cma_size); +} + int __init overlaps_crashkernel(unsigned long start, unsigned long size) { return (start + size) > crashk_res.start && start <= crashk_res.end; diff --git a/arch/powerpc/kexec/ranges.c b/arch/powerpc/kexec/ranges.c index 3702b0bdab14..3bd27c38726b 100644 --- a/arch/powerpc/kexec/ranges.c +++ b/arch/powerpc/kexec/ranges.c @@ -515,7 +515,7 @@ int get_exclude_memory_ranges(struct crash_mem **mem_ranges) */ int get_usable_memory_ranges(struct crash_mem **mem_ranges) { - int ret; + int ret, i; /* * Early boot failure observed on guests when low memory (first memory @@ -528,6 +528,13 @@ int get_usable_memory_ranges(struct crash_mem **mem_ranges) if (ret) goto out; + for (i = 0; i < crashk_cma_cnt; i++) { + ret = add_mem_range(mem_ranges, crashk_cma_ranges[i].start, + crashk_cma_ranges[i].end - crashk_cma_ranges[i].start + 1); + if (ret) + goto out; + } + ret = add_rtas_mem_range(mem_ranges); if (ret) goto out; @@ -546,6 +553,22 @@ int get_usable_memory_ranges(struct crash_mem **mem_ranges) #endif /* CONFIG_KEXEC_FILE */ #ifdef CONFIG_CRASH_DUMP +static int crash_exclude_mem_range_guarded(struct crash_mem **mem_ranges, + unsigned long long mstart, + unsigned long long mend) +{ + struct crash_mem *tmem = *mem_ranges; + + /* Reallocate memory ranges if there is no space to split ranges */ + if (tmem && (tmem->nr_ranges == tmem->max_nr_ranges)) { + tmem = realloc_mem_ranges(mem_ranges); + if (!tmem) + return -ENOMEM; + } + + return crash_exclude_mem_range(tmem, mstart, mend); +} + /** * get_crash_memory_ranges - Get crash memory ranges. This list includes * first/crashing kernel's memory regions that @@ -557,7 +580,6 @@ int get_usable_memory_ranges(struct crash_mem **mem_ranges) int get_crash_memory_ranges(struct crash_mem **mem_ranges) { phys_addr_t base, end; - struct crash_mem *tmem; u64 i; int ret; @@ -582,19 +604,18 @@ int get_crash_memory_ranges(struct crash_mem **mem_ranges) sort_memory_ranges(*mem_ranges, true); } - /* Reallocate memory ranges if there is no space to split ranges */ - tmem = *mem_ranges; - if (tmem && (tmem->nr_ranges == tmem->max_nr_ranges)) { - tmem = realloc_mem_ranges(mem_ranges); - if (!tmem) - goto out; - } - /* Exclude crashkernel region */ - ret = crash_exclude_mem_range(tmem, crashk_res.start, crashk_res.end); + ret = crash_exclude_mem_range_guarded(mem_ranges, crashk_res.start, crashk_res.end); if (ret) goto out; + for (i = 0; i < crashk_cma_cnt; ++i) { + ret = crash_exclude_mem_range_guarded(mem_ranges, crashk_cma_ranges[i].start, + crashk_cma_ranges[i].end); + if (ret) + goto out; + } + /* * FIXME: For now, stay in parity with kexec-tools but if RTAS/OPAL * regions are exported to save their context at the time of -- 2.51.1 From piliu at redhat.com Fri Nov 7 01:00:09 2025 From: piliu at redhat.com (Pingfan Liu) Date: Fri, 7 Nov 2025 17:00:09 +0800 Subject: [PATCHv2 2/2] kernel/kexec: Fix IMA when allocation happens in CMA area In-Reply-To: References: <20251106065904.10772-1-piliu@redhat.com> <20251106065904.10772-2-piliu@redhat.com> Message-ID: On Fri, Nov 07, 2025 at 01:25:41PM +0800, Baoquan He wrote: > On 11/07/25 at 01:13pm, Pingfan Liu wrote: > > On Fri, Nov 7, 2025 at 9:51?AM Baoquan He wrote: > > > > > > On 11/06/25 at 06:01pm, Pingfan Liu wrote: > > > > On Thu, Nov 6, 2025 at 4:01?PM Baoquan He wrote: > > > > > > > > > > On 11/06/25 at 02:59pm, Pingfan Liu wrote: > > > > > > When I tested kexec with the latest kernel, I ran into the following warning: > > > > > > > > > > > > [ 40.712410] ------------[ cut here ]------------ > > > > > > [ 40.712576] WARNING: CPU: 2 PID: 1562 at kernel/kexec_core.c:1001 kimage_map_segment+0x144/0x198 > > > > > > [...] > > > > > > [ 40.816047] Call trace: > > > > > > [ 40.818498] kimage_map_segment+0x144/0x198 (P) > > > > > > [ 40.823221] ima_kexec_post_load+0x58/0xc0 > > > > > > [ 40.827246] __do_sys_kexec_file_load+0x29c/0x368 > > > > > > [...] > > > > > > [ 40.855423] ---[ end trace 0000000000000000 ]--- > > > > > > > > > > > > This is caused by the fact that kexec allocates the destination directly > > > > > > in the CMA area. In that case, the CMA kernel address should be exported > > > > > > directly to the IMA component, instead of using the vmalloc'd address. > > > > > > > > > > Well, you didn't update the log accordingly. > > > > > > > > > > > > > I am not sure what you mean. Do you mean the earlier content which I > > > > replied to you? > > > > > > No. In v1, you return cma directly. But in v2, you return its direct > > > mapping address, isnt' it? > > > > > > > Yes. But I think it is a fault in the code, which does not convey the > > expression in the commit log. Do you think I should rephrase the words > > "the CMA kernel address" as "the CMA kernel direct mapping address"? > > That's fine to me. > > > > > > > > > > > > Do you know why cma area can't be mapped into vmalloc? > > > > > > > > > Should not the kernel direct mapping be used? > > > > > > When image->segment_cma[i] has value, image->ima_buffer_addr also > > > contains the physical address of the cma area, why cma physical address > > > can't be mapped into vmalloc and cause the failure and call trace? > > > > > > > It could be done using the vmalloc approach, but it's unnecessary. > > IIUC, kimage_map_segment() was introduced to provide a contiguous > > virtual address for IMA access, since the IND_SRC pages are scattered > > throughout the kernel. However, in the CMA case, there is already a > > contiguous virtual address in the kernel direct mapping range. > > Normally, when we have a physical address, we simply use > > phys_to_virt() to get its corresponding kernel virtual address. > > OK, I understand cma area is contiguous, and no need to map into > vmalloc. I am wondering why in the old code mapping cma addrss into > vmalloc cause the warning which you said is a IMA problem. > It doesn't go that far. The old code doesn't map CMA into vmalloc'd area. void *kimage_map_segment(struct kimage *image, int idx) { ... for_each_kimage_entry(image, ptr, entry) { if (entry & IND_DESTINATION) { dest_page_addr = entry & PAGE_MASK; } else if (entry & IND_SOURCE) { if (dest_page_addr >= addr && dest_page_addr < eaddr) { src_page_addr = entry & PAGE_MASK; src_pages[i++] = virt_to_page(__va(src_page_addr)); if (i == npages) break; dest_page_addr += PAGE_SIZE; } } } /* Sanity check. */ WARN_ON(i < npages); //--> This is the warning thrown by kernel vaddr = vmap(src_pages, npages, VM_MAP, PAGE_KERNEL); kfree(src_pages); if (!vaddr) pr_err("Could not map ima buffer.\n"); return vaddr; } When CMA is used, there is no IND_SOURCE, so we have i=0 < npages. Now, I see how my words ("In that case, the CMA kernel address should be exported directly to the IMA component, instead of using the vmalloc'd address.") confused you. As for "instead of using the vmalloc'd address", I meant to mention "vmap()" approach. Best Regards, Pingfan From bhe at redhat.com Fri Nov 7 01:31:32 2025 From: bhe at redhat.com (Baoquan He) Date: Fri, 7 Nov 2025 17:31:32 +0800 Subject: [PATCHv2 2/2] kernel/kexec: Fix IMA when allocation happens in CMA area In-Reply-To: References: <20251106065904.10772-1-piliu@redhat.com> <20251106065904.10772-2-piliu@redhat.com> Message-ID: On 11/07/25 at 05:00pm, Pingfan Liu wrote: > On Fri, Nov 07, 2025 at 01:25:41PM +0800, Baoquan He wrote: > > On 11/07/25 at 01:13pm, Pingfan Liu wrote: > > > On Fri, Nov 7, 2025 at 9:51?AM Baoquan He wrote: > > > > > > > > On 11/06/25 at 06:01pm, Pingfan Liu wrote: > > > > > On Thu, Nov 6, 2025 at 4:01?PM Baoquan He wrote: > > > > > > > > > > > > On 11/06/25 at 02:59pm, Pingfan Liu wrote: > > > > > > > When I tested kexec with the latest kernel, I ran into the following warning: > > > > > > > > > > > > > > [ 40.712410] ------------[ cut here ]------------ > > > > > > > [ 40.712576] WARNING: CPU: 2 PID: 1562 at kernel/kexec_core.c:1001 kimage_map_segment+0x144/0x198 > > > > > > > [...] > > > > > > > [ 40.816047] Call trace: > > > > > > > [ 40.818498] kimage_map_segment+0x144/0x198 (P) > > > > > > > [ 40.823221] ima_kexec_post_load+0x58/0xc0 > > > > > > > [ 40.827246] __do_sys_kexec_file_load+0x29c/0x368 > > > > > > > [...] > > > > > > > [ 40.855423] ---[ end trace 0000000000000000 ]--- > > > > > > > > > > > > > > This is caused by the fact that kexec allocates the destination directly > > > > > > > in the CMA area. In that case, the CMA kernel address should be exported > > > > > > > directly to the IMA component, instead of using the vmalloc'd address. > > > > > > > > > > > > Well, you didn't update the log accordingly. > > > > > > > > > > > > > > > > I am not sure what you mean. Do you mean the earlier content which I > > > > > replied to you? > > > > > > > > No. In v1, you return cma directly. But in v2, you return its direct > > > > mapping address, isnt' it? > > > > > > > > > > Yes. But I think it is a fault in the code, which does not convey the > > > expression in the commit log. Do you think I should rephrase the words > > > "the CMA kernel address" as "the CMA kernel direct mapping address"? > > > > That's fine to me. > > > > > > > > > > > > > > > > Do you know why cma area can't be mapped into vmalloc? > > > > > > > > > > > Should not the kernel direct mapping be used? > > > > > > > > When image->segment_cma[i] has value, image->ima_buffer_addr also > > > > contains the physical address of the cma area, why cma physical address > > > > can't be mapped into vmalloc and cause the failure and call trace? > > > > > > > > > > It could be done using the vmalloc approach, but it's unnecessary. > > > IIUC, kimage_map_segment() was introduced to provide a contiguous > > > virtual address for IMA access, since the IND_SRC pages are scattered > > > throughout the kernel. However, in the CMA case, there is already a > > > contiguous virtual address in the kernel direct mapping range. > > > Normally, when we have a physical address, we simply use > > > phys_to_virt() to get its corresponding kernel virtual address. > > > > OK, I understand cma area is contiguous, and no need to map into > > vmalloc. I am wondering why in the old code mapping cma addrss into > > vmalloc cause the warning which you said is a IMA problem. > > > > It doesn't go that far. The old code doesn't map CMA into vmalloc'd > area. > > void *kimage_map_segment(struct kimage *image, int idx) > { > ... > for_each_kimage_entry(image, ptr, entry) { > if (entry & IND_DESTINATION) { > dest_page_addr = entry & PAGE_MASK; > } else if (entry & IND_SOURCE) { > if (dest_page_addr >= addr && dest_page_addr < eaddr) { > src_page_addr = entry & PAGE_MASK; > src_pages[i++] = > virt_to_page(__va(src_page_addr)); > if (i == npages) > break; > dest_page_addr += PAGE_SIZE; > } > } > } > > /* Sanity check. */ > WARN_ON(i < npages); //--> This is the warning thrown by kernel > > vaddr = vmap(src_pages, npages, VM_MAP, PAGE_KERNEL); > kfree(src_pages); > > if (!vaddr) > pr_err("Could not map ima buffer.\n"); > > return vaddr; > } > > When CMA is used, there is no IND_SOURCE, so we have i=0 < npages. > Now, I see how my words ("In that case, the CMA kernel address should be > exported directly to the IMA component, instead of using the vmalloc'd > address.") confused you. As for "instead of using the vmalloc'd > address", I meant to mention "vmap()" approach. Ok, I got it. It's truly a bug because if image->segment_cma[idx] is valid, the current kimage_map_segment() can't collect the source pages at all since they are not marked with IND_DESTINATION|IND_SOURCE as normal segment does. In that situation, we can take the direct mapping address of image->segment_cma[idx] which is more efficient, instead of collecting source pages and vmap(). From bhe at redhat.com Fri Nov 7 01:34:15 2025 From: bhe at redhat.com (Baoquan He) Date: Fri, 7 Nov 2025 17:34:15 +0800 Subject: [PATCHv2 2/2] kernel/kexec: Fix IMA when allocation happens in CMA area In-Reply-To: <20251106065904.10772-2-piliu@redhat.com> References: <20251106065904.10772-1-piliu@redhat.com> <20251106065904.10772-2-piliu@redhat.com> Message-ID: On 11/06/25 at 02:59pm, Pingfan Liu wrote: > When I tested kexec with the latest kernel, I ran into the following warning: > > [ 40.712410] ------------[ cut here ]------------ > [ 40.712576] WARNING: CPU: 2 PID: 1562 at kernel/kexec_core.c:1001 kimage_map_segment+0x144/0x198 > [...] > [ 40.816047] Call trace: > [ 40.818498] kimage_map_segment+0x144/0x198 (P) > [ 40.823221] ima_kexec_post_load+0x58/0xc0 > [ 40.827246] __do_sys_kexec_file_load+0x29c/0x368 > [...] > [ 40.855423] ---[ end trace 0000000000000000 ]--- > > This is caused by the fact that kexec allocates the destination directly > in the CMA area. In that case, the CMA kernel address should be exported > directly to the IMA component, instead of using the vmalloc'd address. > > Fixes: 07d24902977e ("kexec: enable CMA based contiguous allocation") > Signed-off-by: Pingfan Liu > Cc: Andrew Morton > Cc: Baoquan He > Cc: Alexander Graf > Cc: Steven Chen > Cc: linux-integrity at vger.kernel.org > Cc: > To: kexec at lists.infradead.org > --- > v1 -> v2: > return page_address(page) instead of *page > > kernel/kexec_core.c | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c > index 9a1966207041..332204204e53 100644 > --- a/kernel/kexec_core.c > +++ b/kernel/kexec_core.c > @@ -967,6 +967,7 @@ void *kimage_map_segment(struct kimage *image, int idx) > kimage_entry_t *ptr, entry; > struct page **src_pages; > unsigned int npages; > + struct page *cma; > void *vaddr = NULL; > int i; > > @@ -974,6 +975,9 @@ void *kimage_map_segment(struct kimage *image, int idx) > size = image->segment[idx].memsz; > eaddr = addr + size; > > + cma = image->segment_cma[idx]; > + if (cma) > + return page_address(cma); This judgement can be put above the addr/size/eaddr assignment lines? If you agree, maybe you can update the patch log by adding more details to explain the root cause so that people can understand it easier. > /* > * Collect the source pages and map them in a contiguous VA range. > */ > @@ -1014,7 +1018,8 @@ void *kimage_map_segment(struct kimage *image, int idx) > > void kimage_unmap_segment(void *segment_buffer) > { > - vunmap(segment_buffer); > + if (is_vmalloc_addr(segment_buffer)) > + vunmap(segment_buffer); > } > > struct kexec_load_limit { > -- > 2.49.0 > From pratyush at kernel.org Fri Nov 7 02:24:55 2025 From: pratyush at kernel.org (Pratyush Yadav) Date: Fri, 07 Nov 2025 11:24:55 +0100 Subject: [PATCH] lib/test_kho: Check if KHO is enabled In-Reply-To: <20251106220635.2608494-1-pasha.tatashin@soleen.com> (Pasha Tatashin's message of "Thu, 6 Nov 2025 17:06:35 -0500") References: <20251106220635.2608494-1-pasha.tatashin@soleen.com> Message-ID: On Thu, Nov 06 2025, Pasha Tatashin wrote: > We must check whether KHO is enabled prior to issuing KHO commands, > otherwise KHO internal data structures are not initialized. Should we have this check in the KHO APIs instead? This check is easy enough to miss. > > Fixes: b753522bed0b ("kho: add test for kexec handover") > Nit: these blank lines would probably mess up trailer parsing for tooling. > Reported-by: kernel test robot > Closes: https://lore.kernel.org/oe-lkp/202511061629.e242724-lkp at intel.com > > Signed-off-by: Pasha Tatashin > --- > lib/test_kho.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/lib/test_kho.c b/lib/test_kho.c > index 025ea251a186..85b60d87a50a 100644 > --- a/lib/test_kho.c > +++ b/lib/test_kho.c > @@ -315,6 +315,9 @@ static int __init kho_test_init(void) > phys_addr_t fdt_phys; > int err; > > + if (!kho_is_enabled()) > + return 0; > + > err = kho_retrieve_subtree(KHO_TEST_FDT, &fdt_phys); > if (!err) > return kho_test_restore(fdt_phys); -- Regards, Pratyush Yadav From pasha.tatashin at soleen.com Fri Nov 7 03:15:37 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Fri, 7 Nov 2025 06:15:37 -0500 Subject: [PATCH] lib/test_kho: Check if KHO is enabled In-Reply-To: References: <20251106220635.2608494-1-pasha.tatashin@soleen.com> Message-ID: On Fri, Nov 7, 2025 at 5:24?AM Pratyush Yadav wrote: > > On Thu, Nov 06 2025, Pasha Tatashin wrote: > > > We must check whether KHO is enabled prior to issuing KHO commands, > > otherwise KHO internal data structures are not initialized. > > Should we have this check in the KHO APIs instead? This check is easy > enough to miss. I considered adding a kho_is_enabled() check to every KHO API, but it seems unnecessary. In-kernel users of KHO, like reserve_mem and the upcoming LUO, are already expected to check if KHO is enabled before doing extra preservation work. I anticipate any future in-kernel users will follow the same pattern. We could add a WARN_ON(!kho_is_enabled()) to the internal API calls, but I don't think it's needed. We already catch this condition with other WARN_ONs, as shown by this report. > > > > > Fixes: b753522bed0b ("kho: add test for kexec handover") > > > > Nit: these blank lines would probably mess up trailer parsing for > tooling. Hm, if so, the blank line should be removed. Thank you, Pasha > > > Reported-by: kernel test robot > > Closes: https://lore.kernel.org/oe-lkp/202511061629.e242724-lkp at intel.com > > > > Signed-off-by: Pasha Tatashin > > --- > > lib/test_kho.c | 3 +++ > > 1 file changed, 3 insertions(+) > > > > diff --git a/lib/test_kho.c b/lib/test_kho.c > > index 025ea251a186..85b60d87a50a 100644 > > --- a/lib/test_kho.c > > +++ b/lib/test_kho.c > > @@ -315,6 +315,9 @@ static int __init kho_test_init(void) > > phys_addr_t fdt_phys; > > int err; > > > > + if (!kho_is_enabled()) > > + return 0; > > + > > err = kho_retrieve_subtree(KHO_TEST_FDT, &fdt_phys); > > if (!err) > > return kho_test_restore(fdt_phys); > > -- > Regards, > Pratyush Yadav From pratyush at kernel.org Fri Nov 7 08:07:18 2025 From: pratyush at kernel.org (Pratyush Yadav) Date: Fri, 07 Nov 2025 17:07:18 +0100 Subject: [PATCH] lib/test_kho: Check if KHO is enabled In-Reply-To: (Pasha Tatashin's message of "Fri, 7 Nov 2025 06:15:37 -0500") References: <20251106220635.2608494-1-pasha.tatashin@soleen.com> Message-ID: On Fri, Nov 07 2025, Pasha Tatashin wrote: > On Fri, Nov 7, 2025 at 5:24?AM Pratyush Yadav wrote: >> >> On Thu, Nov 06 2025, Pasha Tatashin wrote: >> >> > We must check whether KHO is enabled prior to issuing KHO commands, >> > otherwise KHO internal data structures are not initialized. >> >> Should we have this check in the KHO APIs instead? This check is easy >> enough to miss. > > I considered adding a kho_is_enabled() check to every KHO API, but it > seems unnecessary. > > In-kernel users of KHO, like reserve_mem and the upcoming LUO, are > already expected to check if KHO is enabled before doing extra > preservation work. I anticipate any future in-kernel users will follow > the same pattern. Hmm, fair enough. I suppose we can always change this later if it causes more pain. Reviewed-by: Pratyush Yadav [...] -- Regards, Pratyush Yadav From ritesh.list at gmail.com Fri Nov 7 18:44:51 2025 From: ritesh.list at gmail.com (Ritesh Harjani (IBM)) Date: Sat, 08 Nov 2025 08:14:51 +0530 Subject: [PATCH v7] powerpc/kdump: Add support for crashkernel CMA reservation In-Reply-To: <20251107080334.708028-1-sourabhjain@linux.ibm.com> References: <20251107080334.708028-1-sourabhjain@linux.ibm.com> Message-ID: <87a50x450c.ritesh.list@gmail.com> Sourabh Jain writes: > Commit 35c18f2933c5 ("Add a new optional ",cma" suffix to the > crashkernel= command line option") and commit ab475510e042 ("kdump: > implement reserve_crashkernel_cma") added CMA support for kdump > crashkernel reservation. > > Extend crashkernel CMA reservation support to powerpc. > Yup, would be nice to see this support landing in powerpc! > The following changes are made to enable CMA reservation on powerpc: > > - Parse and obtain the CMA reservation size along with other crashkernel > parameters > - Call reserve_crashkernel_cma() to allocate the CMA region for kdump > - Include the CMA-reserved ranges in the usable memory ranges for the > kdump kernel to use. > - Exclude the CMA-reserved ranges from the crash kernel memory to > prevent them from being exported through /proc/vmcore. > > With the introduction of the CMA crashkernel regions, > crash_exclude_mem_range() needs to be called multiple times to exclude > both crashk_res and crashk_cma_ranges from the crash memory ranges. To > avoid repetitive logic for validating mem_ranges size and handling > reallocation when required, this functionality is moved to a new wrapper > function crash_exclude_mem_range_guarded(). > > To ensure proper CMA reservation, reserve_crashkernel_cma() is called > after pageblock_order is initialized. > > Update kernel-parameters.txt to document CMA support for crashkernel on > powerpc architecture. > > Cc: Baoquan he > Cc: Jiri Bohac > Cc: Hari Bathini > Cc: Madhavan Srinivasan > Cc: Mahesh Salgaonkar > Cc: Michael Ellerman > Cc: Ritesh Harjani (IBM) > Cc: Shivang Upadhyay > Cc: kexec at lists.infradead.org > Signed-off-by: Sourabh Jain > --- > Changelog: > > v6 -> v7 > https://lore.kernel.org/all/20251104132818.1724562-1-sourabhjain at linux.ibm.com/ > - declare crashk_cma_size static > > --- > .../admin-guide/kernel-parameters.txt | 2 +- > arch/powerpc/include/asm/kexec.h | 2 + > arch/powerpc/kernel/setup-common.c | 4 +- > arch/powerpc/kexec/core.c | 10 ++++- > arch/powerpc/kexec/ranges.c | 43 ++++++++++++++----- > 5 files changed, 47 insertions(+), 14 deletions(-) Although my reviewed by may not count much here since I am not deeply familiar with arch/powerpc/kexec/** part.. But FWIW, the patch overall looks logical to me. Keeping cma reservation in setup_arch() is the right thing to do to avoid issues like these in past [1]. The error handling logic and the loop logic for handling CMA ranges also looks correct to me. So feel free to add: Reviewed-by: Ritesh Harjani (IBM) [1]: https://lore.kernel.org/linuxppc-dev/3ae208e48c0d9cefe53d2dc4f593388067405b7d.1729146153.git.ritesh.list at gmail.com/ From sourabhjain at linux.ibm.com Fri Nov 7 20:26:05 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Sat, 8 Nov 2025 09:56:05 +0530 Subject: [PATCH v7] powerpc/kdump: Add support for crashkernel CMA reservation In-Reply-To: <87a50x450c.ritesh.list@gmail.com> References: <20251107080334.708028-1-sourabhjain@linux.ibm.com> <87a50x450c.ritesh.list@gmail.com> Message-ID: <7ad5c02f-63b1-404b-97a1-d7237220f6f7@linux.ibm.com> On 08/11/25 08:14, Ritesh Harjani (IBM) wrote: > Sourabh Jain writes: > >> Commit 35c18f2933c5 ("Add a new optional ",cma" suffix to the >> crashkernel= command line option") and commit ab475510e042 ("kdump: >> implement reserve_crashkernel_cma") added CMA support for kdump >> crashkernel reservation. >> >> Extend crashkernel CMA reservation support to powerpc. >> > Yup, would be nice to see this support landing in powerpc! > >> The following changes are made to enable CMA reservation on powerpc: >> >> - Parse and obtain the CMA reservation size along with other crashkernel >> parameters >> - Call reserve_crashkernel_cma() to allocate the CMA region for kdump >> - Include the CMA-reserved ranges in the usable memory ranges for the >> kdump kernel to use. >> - Exclude the CMA-reserved ranges from the crash kernel memory to >> prevent them from being exported through /proc/vmcore. >> >> With the introduction of the CMA crashkernel regions, >> crash_exclude_mem_range() needs to be called multiple times to exclude >> both crashk_res and crashk_cma_ranges from the crash memory ranges. To >> avoid repetitive logic for validating mem_ranges size and handling >> reallocation when required, this functionality is moved to a new wrapper >> function crash_exclude_mem_range_guarded(). >> >> To ensure proper CMA reservation, reserve_crashkernel_cma() is called >> after pageblock_order is initialized. >> >> Update kernel-parameters.txt to document CMA support for crashkernel on >> powerpc architecture. >> >> Cc: Baoquan he >> Cc: Jiri Bohac >> Cc: Hari Bathini >> Cc: Madhavan Srinivasan >> Cc: Mahesh Salgaonkar >> Cc: Michael Ellerman >> Cc: Ritesh Harjani (IBM) >> Cc: Shivang Upadhyay >> Cc: kexec at lists.infradead.org >> Signed-off-by: Sourabh Jain >> --- >> Changelog: >> >> v6 -> v7 >> https://lore.kernel.org/all/20251104132818.1724562-1-sourabhjain at linux.ibm.com/ >> - declare crashk_cma_size static >> >> --- >> .../admin-guide/kernel-parameters.txt | 2 +- >> arch/powerpc/include/asm/kexec.h | 2 + >> arch/powerpc/kernel/setup-common.c | 4 +- >> arch/powerpc/kexec/core.c | 10 ++++- >> arch/powerpc/kexec/ranges.c | 43 ++++++++++++++----- >> 5 files changed, 47 insertions(+), 14 deletions(-) > Although my reviewed by may not count much here since I am not deeply > familiar with arch/powerpc/kexec/** part.. > > But FWIW, the patch overall looks logical to me. > Keeping cma reservation in setup_arch() is the right thing to do to > avoid issues like these in past [1]. The error handling logic and the > loop logic for handling CMA ranges also looks correct to me. > > So feel free to add: > Reviewed-by: Ritesh Harjani (IBM) > > [1]: https://lore.kernel.org/linuxppc-dev/3ae208e48c0d9cefe53d2dc4f593388067405b7d.1729146153.git.ritesh.list at gmail.com/ Thanks for the Review Ritesh. - Sourabh Jain From rientjes at google.com Sat Nov 8 15:48:32 2025 From: rientjes at google.com (David Rientjes) Date: Sat, 8 Nov 2025 15:48:32 -0800 (PST) Subject: [Hypervisor Live Update] Notes from November 3, 2025 Message-ID: <7742456c-b248-04cc-0e1a-9da7d0546f1a@google.com> Hi everybody, Here are the notes from the last Hypervisor Live Update call that happened on Monday, November 3. Thanks to everybody who was involved! These notes are intended to bring people up to speed who could not attend the call as well as keep the conversation going in between meetings. ----->o----- We chatted briefly about the status of the stateless KHO RFC patch series intended to simply LUO support. Pasha started us off by updating that Jason Miu will be updating his patch series and is expected to send those patches to the mailing list after some more internal code review. That would be expected to be posted either this week or next week. Pasha also updated that is plan is to remove subsystems to simply the state machine and the UAPI; this would be replaced by the file lifecycle bound global data created and destroyed automatically based on reserved file state. He was also simplifying the state machine to keep the minimum needed support in the initial version, but with extensibility for the future. No exact timeline although it is currently ~90% ready. ----->o----- Pratyush discussed the global file based state and how it may circumvent the LUO state machine; it's an implicit preserve of the subsystem completely independent of global state. Pasha said the state is now bound to the state of the preserved files only based on sessions. We're getting rid of global state for now. Pratyush suggested to tie subsystems with the file handler but this would not be possible if subsystems are going away. Pasha said the new global state is flexible and can share multiple file handlers; one global state can be bound to multiple file handlers. ----->o----- Jork Loeser as if there is a design/API link for the memfd and whether this is something a driver could use to persistently holding data. He was asking if a driver could associate arbitrary pages with a preserved memfd. Pratyush said the memfd preservation was part of the LUO patch series at the end. A driver can pass a memfd to LUO after creating an fd. Pratyush suggested using KHO to preserve data; the data may moved at runtime but would need to be unmovable during preservation across kexec. Jason Gunthorpe suggested using APIs to get a page from a memfd at a specific offset. ----->o----- Vipin Sharma had posted a recent patch series for VFIO[1], David Matlack will be working on v2 of this will Vipin is on leave. Feedback was received about not moving the PCI save state and making them public, so that's work in progress. More feedback said there was missing bits and we need more PCI core changes that would be updated in v2 to be more complete (but also include more PCI changes). No specific timeline yet on v2, but it will be based on LUO v5. David said the VFIO patches are using an anonymous inode to recreate the file after live update and asked if we care about associating recreated fds for userspace after live update with a particular inode. Jason said that VFIO would care because it uses the inode to get an address space which it uses with an unmapped mapping range and this must work correctly. ----->o----- Sami summarized the discussion on IOMMU persistence. He was working on updating the patch series to v2 based on the feedback from v1. He talked about restoration of the HWPTs on the restore side. Jason thought that we wouldn't have an IOAS for the restored domains and suggested it could be null instead. Sami thought this may be slightly invasive including where we are taking locks; Jason suggested against a dummy IOAS. ----->o----- We discussed briefly about deferred struct page initialization support with KHO. Pasha said KHO isn't compatible with deferred struct pages although when using KHO we definitely want fast reboot performance. We decided to discuss this more later after LPC where there will be some discussion about reboot performance. ----->o----- Pratyush noted that he is working on the 1GB preservation but will take some more time to clean up and have it working properly. He said guest_memfd would use hugetlb for 1GB pages so he's working on hugetlb preservation. Pratyush was focusing on generic hugetlb support that could be ported for use with guest_memfd when it supports hugetlb. He's aiming for an RFC to be ready by the time of LPC. Ackerley updated that the hugetlb support for guest_memfd is currently in RFC patches posted upstream. ----->o----- Next meeting will be on Monday, November 17 at 8am PST (UTC-8), everybody is welcome: https://meet.google.com/rjn-dmzu-hgq Topics for the next meeting: - update on the status of stateless KHO RFC patches that should simplify LUO support - update on the status of LUO v5 overall - follow up on the status of iommu persistence and its v2 patches based on v1 feedback - update on the v2 of the VFIO patch series based on LUO v5 and expected timelines - discuss status of hugetlb preservation, specifically 1GB support, with regular memfd, aiming for an RFC by the time of LPC - update on status of guest_memfd support for 1GB hugetlb pages - discuss any use cases for Confidential Computing where folios may need to be split after being marked as preserved during brown out - later: testing methodology to allow downstream consumers to qualify that live update works from one version to another - later: reducing blackout window during live update, including deferred struct page initialization Please let me know if you'd like to propose additional topics for discussion, thank you! [1] https://marc.info/?l=linux-kernel&m=176074589311102&w=2 From pasha.tatashin at soleen.com Sat Nov 8 17:53:58 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Sat, 8 Nov 2025 20:53:58 -0500 Subject: [Hypervisor Live Update] Notes from November 3, 2025 In-Reply-To: <7742456c-b248-04cc-0e1a-9da7d0546f1a@google.com> References: <7742456c-b248-04cc-0e1a-9da7d0546f1a@google.com> Message-ID: On Sat, Nov 8, 2025 at 6:48?PM David Rientjes wrote: > > Hi everybody, > > Here are the notes from the last Hypervisor Live Update call that happened > on Monday, November 3. Thanks to everybody who was involved! > > These notes are intended to bring people up to speed who could not attend > the call as well as keep the conversation going in between meetings. > > ----->o----- > We chatted briefly about the status of the stateless KHO RFC patch series > intended to simply LUO support. > > Pasha started us off by updating that Jason Miu will be updating his patch > series and is expected to send those patches to the mailing list after > some more internal code review. That would be expected to be posted > either this week or next week. > > Pasha also updated that is plan is to remove subsystems to simply the > state machine and the UAPI; this would be replaced by the file lifecycle > bound global data created and destroyed automatically based on reserved > file state. He was also simplifying the state machine to keep the minimum > needed support in the initial version, but with extensibility for the > future. No exact timeline although it is currently ~90% ready. Thank you David for running this meeting. The LUOv5 + memfd preservation from Pratyush was posted yesterday: https://lore.kernel.org/all/20251107210526.257742-1-pasha.tatashin at soleen.com Pasha > ----->o----- > Pratyush discussed the global file based state and how it may circumvent > the LUO state machine; it's an implicit preserve of the subsystem > completely independent of global state. Pasha said the state is now bound > to the state of the preserved files only based on sessions. We're getting > rid of global state for now. > > Pratyush suggested to tie subsystems with the file handler but this would > not be possible if subsystems are going away. Pasha said the new global > state is flexible and can share multiple file handlers; one global state > can be bound to multiple file handlers. > > ----->o----- > Jork Loeser as if there is a design/API link for the memfd and whether > this is something a driver could use to persistently holding data. He was > asking if a driver could associate arbitrary pages with a preserved memfd. > > Pratyush said the memfd preservation was part of the LUO patch series at > the end. A driver can pass a memfd to LUO after creating an fd. Pratyush > suggested using KHO to preserve data; the data may moved at runtime but > would need to be unmovable during preservation across kexec. > > Jason Gunthorpe suggested using APIs to get a page from a memfd at a > specific offset. > > ----->o----- > Vipin Sharma had posted a recent patch series for VFIO[1], David Matlack > will be working on v2 of this will Vipin is on leave. Feedback was > received about not moving the PCI save state and making them public, so > that's work in progress. More feedback said there was missing bits and we > need more PCI core changes that would be updated in v2 to be more complete > (but also include more PCI changes). No specific timeline yet on v2, but > it will be based on LUO v5. > > David said the VFIO patches are using an anonymous inode to recreate the > file after live update and asked if we care about associating recreated > fds for userspace after live update with a particular inode. Jason said > that VFIO would care because it uses the inode to get an address space > which it uses with an unmapped mapping range and this must work correctly. > > ----->o----- > Sami summarized the discussion on IOMMU persistence. He was working on > updating the patch series to v2 based on the feedback from v1. He talked > about restoration of the HWPTs on the restore side. Jason thought that we > wouldn't have an IOAS for the restored domains and suggested it could be > null instead. Sami thought this may be slightly invasive including where > we are taking locks; Jason suggested against a dummy IOAS. > > ----->o----- > We discussed briefly about deferred struct page initialization support > with KHO. Pasha said KHO isn't compatible with deferred struct pages > although when using KHO we definitely want fast reboot performance. We > decided to discuss this more later after LPC where there will be some > discussion about reboot performance. > > ----->o----- > Pratyush noted that he is working on the 1GB preservation but will take > some more time to clean up and have it working properly. He said > guest_memfd would use hugetlb for 1GB pages so he's working on hugetlb > preservation. Pratyush was focusing on generic hugetlb support that could > be ported for use with guest_memfd when it supports hugetlb. He's aiming > for an RFC to be ready by the time of LPC. > > Ackerley updated that the hugetlb support for guest_memfd is currently in > RFC patches posted upstream. > > ----->o----- > Next meeting will be on Monday, November 17 at 8am PST (UTC-8), everybody > is welcome: https://meet.google.com/rjn-dmzu-hgq > > Topics for the next meeting: > > - update on the status of stateless KHO RFC patches that should simplify > LUO support > - update on the status of LUO v5 overall > - follow up on the status of iommu persistence and its v2 patches based > on v1 feedback > - update on the v2 of the VFIO patch series based on LUO v5 and expected > timelines > - discuss status of hugetlb preservation, specifically 1GB support, with > regular memfd, aiming for an RFC by the time of LPC > - update on status of guest_memfd support for 1GB hugetlb pages > - discuss any use cases for Confidential Computing where folios may need > to be split after being marked as preserved during brown out > - later: testing methodology to allow downstream consumers to qualify > that live update works from one version to another > - later: reducing blackout window during live update, including deferred > struct page initialization > > Please let me know if you'd like to propose additional topics for > discussion, thank you! > > [1] https://marc.info/?l=linux-kernel&m=176074589311102&w=2 > From rppt at kernel.org Sat Nov 8 23:31:34 2025 From: rppt at kernel.org (Mike Rapoport) Date: Sun, 9 Nov 2025 09:31:34 +0200 Subject: [PATCH] lib/test_kho: Check if KHO is enabled In-Reply-To: <20251106220635.2608494-1-pasha.tatashin@soleen.com> References: <20251106220635.2608494-1-pasha.tatashin@soleen.com> Message-ID: On Thu, Nov 06, 2025 at 05:06:35PM -0500, Pasha Tatashin wrote: > We must check whether KHO is enabled prior to issuing KHO commands, > otherwise KHO internal data structures are not initialized. > > Fixes: b753522bed0b ("kho: add test for kexec handover") > > Reported-by: kernel test robot > Closes: https://lore.kernel.org/oe-lkp/202511061629.e242724-lkp at intel.com > > Signed-off-by: Pasha Tatashin Reviewed-by: Mike Rapoport (Microsoft) > --- > lib/test_kho.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/lib/test_kho.c b/lib/test_kho.c > index 025ea251a186..85b60d87a50a 100644 > --- a/lib/test_kho.c > +++ b/lib/test_kho.c > @@ -315,6 +315,9 @@ static int __init kho_test_init(void) > phys_addr_t fdt_phys; > int err; > > + if (!kho_is_enabled()) > + return 0; > + > err = kho_retrieve_subtree(KHO_TEST_FDT, &fdt_phys); > if (!err) > return kho_test_restore(fdt_phys); > -- > 2.51.2.1041.gc1ab5b90ca-goog > -- Sincerely yours, Mike. From lkp at intel.com Sun Nov 9 05:38:19 2025 From: lkp at intel.com (kernel test robot) Date: Sun, 9 Nov 2025 21:38:19 +0800 Subject: [PATCH v2 2/5] kexec: move sysfs entries to /sys/kernel/kexec In-Reply-To: <20251106045107.17813-3-sourabhjain@linux.ibm.com> References: <20251106045107.17813-3-sourabhjain@linux.ibm.com> Message-ID: <202511092102.Qi35GqrR-lkp@intel.com> Hi Sourabh, kernel test robot noticed the following build warnings: [auto build test WARNING on akpm-mm/mm-everything] [also build test WARNING on linus/master v6.18-rc4 next-20251107] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Sourabh-Jain/Documentation-ABI-add-kexec-and-kdump-sysfs-interface/20251106-125243 base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything patch link: https://lore.kernel.org/r/20251106045107.17813-3-sourabhjain%40linux.ibm.com patch subject: [PATCH v2 2/5] kexec: move sysfs entries to /sys/kernel/kexec config: s390-randconfig-r131-20251109 (https://download.01.org/0day-ci/archive/20251109/202511092102.Qi35GqrR-lkp at intel.com/config) compiler: clang version 16.0.6 (https://github.com/llvm/llvm-project 7cbf1a2591520c2491aa35339f227775f4d3adf6) reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251109/202511092102.Qi35GqrR-lkp at intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot | Closes: https://lore.kernel.org/oe-kbuild-all/202511092102.Qi35GqrR-lkp at intel.com/ sparse warnings: (new ones prefixed by >>) >> kernel/kexec_core.c:1315:16: sparse: sparse: symbol 'kexec_kobj' was not declared. Should it be static? vim +/kexec_kobj +1315 kernel/kexec_core.c 1314 > 1315 struct kobject *kexec_kobj; 1316 ATTRIBUTE_GROUPS(kexec); 1317 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki From sourabhjain at linux.ibm.com Sun Nov 9 20:31:38 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Mon, 10 Nov 2025 10:01:38 +0530 Subject: [PATCH v3 0/5] kexec: reorganize sysfs interface and add new kexec sysfs Message-ID: <20251110043143.484408-1-sourabhjain@linux.ibm.com> All existing kexec and kdump sysfs entries are moved to a new location, /sys/kernel/kexec, to keep /sys/kernel/ clean and better organized. Symlinks are created at the old locations for backward compatibility and can be removed in the future [02/05]. While doing this cleanup, missing ABI documentation for the old sysfs interfaces is added, and those entries are marked as deprecated [01/05 and 03/05]. New ABI documentation is also added for the reorganized interfaces. [04/05] Along with this reorganization, a new sysfs file, /sys/kernel/kexec/crash_cma_ranges, is introduced to export crashkernel CMA reservation details to user space [05/05]. This helps tools determine the total crashkernel reserved memory and warn users that capturing user pages while CMA is reserved may cause incomplete or unreliable dumps. Changlog: --------- v2 -> v3: - Add the missing hunk to export crash_cma_ranges sysfs [05/05] - Declare kexec_kobj static [02/05] Cc: Aditya Gupta Cc: Andrew Morton Cc: Baoquan he Cc: Dave Young Cc: Hari Bathini Cc: Jiri Bohac Cc: Madhavan Srinivasan Cc: Mahesh J Salgaonkar Cc: Pingfan Liu Cc: Ritesh Harjani (IBM) Cc: Shivang Upadhyay Cc: Vivek Goyal Cc: linuxppc-dev at lists.ozlabs.org Cc: kexec at lists.infradead.org Sourabh Jain (5): Documentation/ABI: add kexec and kdump sysfs interface kexec: move sysfs entries to /sys/kernel/kexec Documentation/ABI: mark old kexec sysfs deprecated kexec: document new kexec and kdump sysfs ABIs crash: export crashkernel CMA reservation to userspace .../ABI/obsolete/sysfs-kernel-kexec-kdump | 59 ++++++++ .../ABI/testing/sysfs-kernel-kexec-kdump | 61 ++++++++ kernel/kexec_core.c | 135 ++++++++++++++++++ kernel/ksysfs.c | 68 +-------- 4 files changed, 256 insertions(+), 67 deletions(-) create mode 100644 Documentation/ABI/obsolete/sysfs-kernel-kexec-kdump create mode 100644 Documentation/ABI/testing/sysfs-kernel-kexec-kdump -- 2.51.1 From sourabhjain at linux.ibm.com Sun Nov 9 20:31:39 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Mon, 10 Nov 2025 10:01:39 +0530 Subject: [PATCH v3 1/5] Documentation/ABI: add kexec and kdump sysfs interface In-Reply-To: <20251110043143.484408-1-sourabhjain@linux.ibm.com> References: <20251110043143.484408-1-sourabhjain@linux.ibm.com> Message-ID: <20251110043143.484408-2-sourabhjain@linux.ibm.com> Add an ABI document for following kexec and kdump sysfs interface: - /sys/kernel/kexec_loaded - /sys/kernel/kexec_crash_loaded - /sys/kernel/kexec_crash_size - /sys/kernel/crash_elfcorehdr_size Cc: Aditya Gupta Cc: Andrew Morton Cc: Baoquan he Cc: Dave Young Cc: Hari Bathini Cc: Jiri Bohac Cc: Madhavan Srinivasan Cc: Mahesh J Salgaonkar Cc: Pingfan Liu Cc: Ritesh Harjani (IBM) Cc: Shivang Upadhyay Cc: Vivek Goyal Cc: linuxppc-dev at lists.ozlabs.org Cc: kexec at lists.infradead.org Signed-off-by: Sourabh Jain --- .../ABI/testing/sysfs-kernel-kexec-kdump | 43 +++++++++++++++++++ 1 file changed, 43 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-kernel-kexec-kdump diff --git a/Documentation/ABI/testing/sysfs-kernel-kexec-kdump b/Documentation/ABI/testing/sysfs-kernel-kexec-kdump new file mode 100644 index 000000000000..96b24565b68e --- /dev/null +++ b/Documentation/ABI/testing/sysfs-kernel-kexec-kdump @@ -0,0 +1,43 @@ +What: /sys/kernel/kexec_loaded +Date: Jun 2006 +Contact: kexec at lists.infradead.org +Description: read only + Indicates whether a new kernel image has been loaded + into memory using the kexec system call. It shows 1 if + a kexec image is present and ready to boot, or 0 if none + is loaded. +User: kexec tools, kdump service + +What: /sys/kernel/kexec_crash_loaded +Date: Jun 2006 +Contact: kexec at lists.infradead.org +Description: read only + Indicates whether a crash (kdump) kernel is currently + loaded into memory. It shows 1 if a crash kernel has been + successfully loaded for panic handling, or 0 if no crash + kernel is present. +User: Kexec tools, Kdump service + +What: /sys/kernel/kexec_crash_size +Date: Dec 2009 +Contact: kexec at lists.infradead.org +Description: read/write + Shows the amount of memory reserved for loading the crash + (kdump) kernel. It reports the size, in bytes, of the + crash kernel area defined by the crashkernel= parameter. + This interface also allows reducing the crashkernel + reservation by writing a smaller value, and the reclaimed + space is added back to the system RAM. +User: Kdump service + +What: /sys/kernel/crash_elfcorehdr_size +Date: Aug 2023 +Contact: kexec at lists.infradead.org +Description: read only + Indicates the preferred size of the memory buffer for the + ELF core header used by the crash (kdump) kernel. It defines + how much space is needed to hold metadata about the crashed + system, including CPU and memory information. This information + is used by the user space utility kexec to support updating the + in-kernel kdump image during hotplug operations. +User: Kexec tools -- 2.51.1 From sourabhjain at linux.ibm.com Sun Nov 9 20:31:40 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Mon, 10 Nov 2025 10:01:40 +0530 Subject: [PATCH v3 2/5] kexec: move sysfs entries to /sys/kernel/kexec In-Reply-To: <20251110043143.484408-1-sourabhjain@linux.ibm.com> References: <20251110043143.484408-1-sourabhjain@linux.ibm.com> Message-ID: <20251110043143.484408-3-sourabhjain@linux.ibm.com> Several kexec and kdump sysfs entries are currently placed directly under /sys/kernel/, which clutters the directory and makes it harder to identify unrelated entries. To improve organization and readability, these entries are now moved under a dedicated directory, /sys/kernel/kexec. For backward compatibility, symlinks are created at the old locations so that existing tools and scripts continue to work. These symlinks can be removed in the future once users have switched to the new path. While creating symlinks, entries are added in /sys/kernel/ that point to their new locations under /sys/kernel/kexec/. If an error occurs while adding a symlink, it is logged but does not stop initialization of the remaining kexec sysfs symlinks. The /sys/kernel/ entry is now controlled by CONFIG_CRASH_DUMP instead of CONFIG_VMCORE_INFO, as CONFIG_CRASH_DUMP also enables CONFIG_VMCORE_INFO. Cc: Aditya Gupta Cc: Andrew Morton Cc: Baoquan he Cc: Dave Young Cc: Hari Bathini Cc: Jiri Bohac Cc: Madhavan Srinivasan Cc: Mahesh J Salgaonkar Cc: Pingfan Liu Cc: Ritesh Harjani (IBM) Cc: Shivang Upadhyay Cc: Vivek Goyal Cc: linuxppc-dev at lists.ozlabs.org Cc: kexec at lists.infradead.org Signed-off-by: Sourabh Jain --- Changelog: v2 -> v3: - Declare kexec_kobj static --- kernel/kexec_core.c | 118 ++++++++++++++++++++++++++++++++++++++++++++ kernel/ksysfs.c | 68 +------------------------ 2 files changed, 119 insertions(+), 67 deletions(-) diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c index fa00b239c5d9..7476a46de5d6 100644 --- a/kernel/kexec_core.c +++ b/kernel/kexec_core.c @@ -41,6 +41,7 @@ #include #include #include +#include #include #include @@ -1229,3 +1230,120 @@ int kernel_kexec(void) kexec_unlock(); return error; } + +static ssize_t loaded_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return sysfs_emit(buf, "%d\n", !!kexec_image); +} +static struct kobj_attribute loaded_attr = __ATTR_RO(loaded); + +#ifdef CONFIG_CRASH_DUMP +static ssize_t crash_loaded_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return sysfs_emit(buf, "%d\n", kexec_crash_loaded()); +} +static struct kobj_attribute crash_loaded_attr = __ATTR_RO(crash_loaded); + +static ssize_t crash_size_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + ssize_t size = crash_get_memory_size(); + + if (size < 0) + return size; + + return sysfs_emit(buf, "%zd\n", size); +} +static ssize_t crash_size_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + unsigned long cnt; + int ret; + + if (kstrtoul(buf, 0, &cnt)) + return -EINVAL; + + ret = crash_shrink_memory(cnt); + return ret < 0 ? ret : count; +} +static struct kobj_attribute crash_size_attr = __ATTR_RW(crash_size); + +#ifdef CONFIG_CRASH_HOTPLUG +static ssize_t crash_elfcorehdr_size_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + unsigned int sz = crash_get_elfcorehdr_size(); + + return sysfs_emit(buf, "%u\n", sz); +} +static struct kobj_attribute crash_elfcorehdr_size_attr = __ATTR_RO(crash_elfcorehdr_size); + +#endif /* CONFIG_CRASH_HOTPLUG */ +#endif /* CONFIG_CRASH_DUMP */ + +static struct attribute *kexec_attrs[] = { + &loaded_attr.attr, +#ifdef CONFIG_CRASH_DUMP + &crash_loaded_attr.attr, + &crash_size_attr.attr, +#ifdef CONFIG_CRASH_HOTPLUG + &crash_elfcorehdr_size_attr.attr, +#endif +#endif + NULL +}; + +struct kexec_link_entry { + const char *target; + const char *name; +}; + +static struct kexec_link_entry kexec_links[] = { + { "loaded", "kexec_loaded" }, +#ifdef CONFIG_CRASH_DUMP + { "crash_loaded", "kexec_crash_loaded" }, + { "crash_size", "kexec_crash_size" }, +#ifdef CONFIG_CRASH_HOTPLUG + { "crash_elfcorehdr_size", "crash_elfcorehdr_size" }, +#endif +#endif + +}; + +static struct kobject *kexec_kobj; +ATTRIBUTE_GROUPS(kexec); + +static int __init init_kexec_sysctl(void) +{ + int error; + int i; + + kexec_kobj = kobject_create_and_add("kexec", kernel_kobj); + if (!kexec_kobj) { + pr_err("failed to create kexec kobject\n"); + return -ENOMEM; + } + + error = sysfs_create_groups(kexec_kobj, kexec_groups); + if (error) + goto kset_exit; + + for (i = 0; i < ARRAY_SIZE(kexec_links); i++) { + error = compat_only_sysfs_link_entry_to_kobj(kernel_kobj, kexec_kobj, + kexec_links[i].target, + kexec_links[i].name); + if (error) + pr_err("Unable to create %s symlink (%d)", kexec_links[i].name, error); + } + + return 0; + +kset_exit: + kobject_put(kexec_kobj); + return error; +} + +subsys_initcall(init_kexec_sysctl); diff --git a/kernel/ksysfs.c b/kernel/ksysfs.c index eefb67d9883c..a9e6354d9e25 100644 --- a/kernel/ksysfs.c +++ b/kernel/ksysfs.c @@ -12,7 +12,7 @@ #include #include #include -#include +#include #include #include #include @@ -119,50 +119,6 @@ static ssize_t profiling_store(struct kobject *kobj, KERNEL_ATTR_RW(profiling); #endif -#ifdef CONFIG_KEXEC_CORE -static ssize_t kexec_loaded_show(struct kobject *kobj, - struct kobj_attribute *attr, char *buf) -{ - return sysfs_emit(buf, "%d\n", !!kexec_image); -} -KERNEL_ATTR_RO(kexec_loaded); - -#ifdef CONFIG_CRASH_DUMP -static ssize_t kexec_crash_loaded_show(struct kobject *kobj, - struct kobj_attribute *attr, char *buf) -{ - return sysfs_emit(buf, "%d\n", kexec_crash_loaded()); -} -KERNEL_ATTR_RO(kexec_crash_loaded); - -static ssize_t kexec_crash_size_show(struct kobject *kobj, - struct kobj_attribute *attr, char *buf) -{ - ssize_t size = crash_get_memory_size(); - - if (size < 0) - return size; - - return sysfs_emit(buf, "%zd\n", size); -} -static ssize_t kexec_crash_size_store(struct kobject *kobj, - struct kobj_attribute *attr, - const char *buf, size_t count) -{ - unsigned long cnt; - int ret; - - if (kstrtoul(buf, 0, &cnt)) - return -EINVAL; - - ret = crash_shrink_memory(cnt); - return ret < 0 ? ret : count; -} -KERNEL_ATTR_RW(kexec_crash_size); - -#endif /* CONFIG_CRASH_DUMP*/ -#endif /* CONFIG_KEXEC_CORE */ - #ifdef CONFIG_VMCORE_INFO static ssize_t vmcoreinfo_show(struct kobject *kobj, @@ -174,18 +130,6 @@ static ssize_t vmcoreinfo_show(struct kobject *kobj, } KERNEL_ATTR_RO(vmcoreinfo); -#ifdef CONFIG_CRASH_HOTPLUG -static ssize_t crash_elfcorehdr_size_show(struct kobject *kobj, - struct kobj_attribute *attr, char *buf) -{ - unsigned int sz = crash_get_elfcorehdr_size(); - - return sysfs_emit(buf, "%u\n", sz); -} -KERNEL_ATTR_RO(crash_elfcorehdr_size); - -#endif - #endif /* CONFIG_VMCORE_INFO */ /* whether file capabilities are enabled */ @@ -255,18 +199,8 @@ static struct attribute * kernel_attrs[] = { #ifdef CONFIG_PROFILING &profiling_attr.attr, #endif -#ifdef CONFIG_KEXEC_CORE - &kexec_loaded_attr.attr, -#ifdef CONFIG_CRASH_DUMP - &kexec_crash_loaded_attr.attr, - &kexec_crash_size_attr.attr, -#endif -#endif #ifdef CONFIG_VMCORE_INFO &vmcoreinfo_attr.attr, -#ifdef CONFIG_CRASH_HOTPLUG - &crash_elfcorehdr_size_attr.attr, -#endif #endif #ifndef CONFIG_TINY_RCU &rcu_expedited_attr.attr, -- 2.51.1 From sourabhjain at linux.ibm.com Sun Nov 9 20:31:41 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Mon, 10 Nov 2025 10:01:41 +0530 Subject: [PATCH v3 3/5] Documentation/ABI: mark old kexec sysfs deprecated In-Reply-To: <20251110043143.484408-1-sourabhjain@linux.ibm.com> References: <20251110043143.484408-1-sourabhjain@linux.ibm.com> Message-ID: <20251110043143.484408-4-sourabhjain@linux.ibm.com> The previous commit ("kexec: move sysfs entries to /sys/kernel/kexec") moved all existing kexec sysfs entries to a new location. The ABI document is updated to include a note about the deprecation of the old kexec sysfs entries. The following kexec sysfs entries are deprecated: - /sys/kernel/kexec_loaded - /sys/kernel/kexec_crash_loaded - /sys/kernel/kexec_crash_size - /sys/kernel/crash_elfcorehdr_size Cc: Aditya Gupta Cc: Andrew Morton Cc: Baoquan he Cc: Dave Young Cc: Hari Bathini Cc: Jiri Bohac Cc: Madhavan Srinivasan Cc: Mahesh J Salgaonkar Cc: Pingfan Liu Cc: Ritesh Harjani (IBM) Cc: Shivang Upadhyay Cc: Vivek Goyal Cc: linuxppc-dev at lists.ozlabs.org Cc: kexec at lists.infradead.org Signed-off-by: Sourabh Jain --- .../sysfs-kernel-kexec-kdump | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) rename Documentation/ABI/{testing => obsolete}/sysfs-kernel-kexec-kdump (61%) diff --git a/Documentation/ABI/testing/sysfs-kernel-kexec-kdump b/Documentation/ABI/obsolete/sysfs-kernel-kexec-kdump similarity index 61% rename from Documentation/ABI/testing/sysfs-kernel-kexec-kdump rename to Documentation/ABI/obsolete/sysfs-kernel-kexec-kdump index 96b24565b68e..96b4d41721cc 100644 --- a/Documentation/ABI/testing/sysfs-kernel-kexec-kdump +++ b/Documentation/ABI/obsolete/sysfs-kernel-kexec-kdump @@ -1,3 +1,19 @@ +NOTE: all the ABIs listed in this file are deprecated and will be removed after 2028. + +Here are the alternative ABIs: ++------------------------------------+-----------------------------------------+ +| Deprecated | Alternative | ++------------------------------------+-----------------------------------------+ +| /sys/kernel/kexec_loaded | /sys/kernel/kexec/loaded | ++------------------------------------+-----------------------------------------+ +| /sys/kernel/kexec_crash_loaded | /sys/kernel/kexec/crash_loaded | ++------------------------------------+-----------------------------------------+ +| /sys/kernel/kexec_crash_size | /sys/kernel/kexec/crash_size | ++------------------------------------+-----------------------------------------+ +| /sys/kernel/crash_elfcorehdr_size | /sys/kernel/kexec/crash_elfcorehdr_size | ++------------------------------------+-----------------------------------------+ + + What: /sys/kernel/kexec_loaded Date: Jun 2006 Contact: kexec at lists.infradead.org -- 2.51.1 From sourabhjain at linux.ibm.com Sun Nov 9 20:31:42 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Mon, 10 Nov 2025 10:01:42 +0530 Subject: [PATCH v3 4/5] kexec: document new kexec and kdump sysfs ABIs In-Reply-To: <20251110043143.484408-1-sourabhjain@linux.ibm.com> References: <20251110043143.484408-1-sourabhjain@linux.ibm.com> Message-ID: <20251110043143.484408-5-sourabhjain@linux.ibm.com> Add an ABI document for following kexec and kdump sysfs interface: - /sys/kernel/kexec/loaded - /sys/kernel/kexec/crash_loaded - /sys/kernel/kexec/crash_size - /sys/kernel/kexec/crash_elfcorehdr_size Cc: Aditya Gupta Cc: Andrew Morton Cc: Baoquan he Cc: Dave Young Cc: Hari Bathini Cc: Jiri Bohac Cc: Madhavan Srinivasan Cc: Mahesh J Salgaonkar Cc: Pingfan Liu Cc: Ritesh Harjani (IBM) Cc: Shivang Upadhyay Cc: Vivek Goyal Cc: linuxppc-dev at lists.ozlabs.org Cc: kexec at lists.infradead.org Signed-off-by: Sourabh Jain --- .../ABI/testing/sysfs-kernel-kexec-kdump | 51 +++++++++++++++++++ 1 file changed, 51 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-kernel-kexec-kdump diff --git a/Documentation/ABI/testing/sysfs-kernel-kexec-kdump b/Documentation/ABI/testing/sysfs-kernel-kexec-kdump new file mode 100644 index 000000000000..00c00f380fea --- /dev/null +++ b/Documentation/ABI/testing/sysfs-kernel-kexec-kdump @@ -0,0 +1,51 @@ +What: /sys/kernel/kexec/* +Date: Nov 2025 +Contact: kexec at lists.infradead.org +Description: + The /sys/kernel/kexec/* directory contains sysfs files + that provide information about the configuration status + of kexec and kdump. + +What: /sys/kernel/kexec/loaded +Date: Nov 2025 +Contact: kexec at lists.infradead.org +Description: read only + Indicates whether a new kernel image has been loaded + into memory using the kexec system call. It shows 1 if + a kexec image is present and ready to boot, or 0 if none + is loaded. +User: kexec tools, kdump service + +What: /sys/kernel/kexec/crash_loaded +Date: Nov 2025 +Contact: kexec at lists.infradead.org +Description: read only + Indicates whether a crash (kdump) kernel is currently + loaded into memory. It shows 1 if a crash kernel has been + successfully loaded for panic handling, or 0 if no crash + kernel is present. +User: Kexec tools, Kdump service + +What: /sys/kernel/kexec/crash_size +Date: Nov 2025 +Contact: kexec at lists.infradead.org +Description: read/write + Shows the amount of memory reserved for loading the crash + (kdump) kernel. It reports the size, in bytes, of the + crash kernel area defined by the crashkernel= parameter. + This interface also allows reducing the crashkernel + reservation by writing a smaller value, and the reclaimed + space is added back to the system RAM. +User: Kdump service + +What: /sys/kernel/kexec/crash_elfcorehdr_size +Date: Nov 2025 +Contact: kexec at lists.infradead.org +Description: read only + Indicates the preferred size of the memory buffer for the + ELF core header used by the crash (kdump) kernel. It defines + how much space is needed to hold metadata about the crashed + system, including CPU and memory information. This information + is used by the user space utility kexec to support updating the + in-kernel kdump image during hotplug operations. +User: Kexec tools -- 2.51.1 From sourabhjain at linux.ibm.com Sun Nov 9 20:31:43 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Mon, 10 Nov 2025 10:01:43 +0530 Subject: [PATCH v3 5/5] crash: export crashkernel CMA reservation to userspace In-Reply-To: <20251110043143.484408-1-sourabhjain@linux.ibm.com> References: <20251110043143.484408-1-sourabhjain@linux.ibm.com> Message-ID: <20251110043143.484408-6-sourabhjain@linux.ibm.com> Add a sysfs entry /sys/kernel/kexec/crash_cma_ranges to expose all CMA crashkernel ranges. This allows userspace tools configuring kdump to determine how much memory is reserved for crashkernel. If CMA is used, tools can warn users when attempting to capture user pages with CMA reservation. The new sysfs hold the CMA ranges in below format: cat /sys/kernel/kexec/crash_cma_ranges 100000000-10c7fffff Cc: Aditya Gupta Cc: Andrew Morton Cc: Baoquan he Cc: Dave Young Cc: Hari Bathini Cc: Jiri Bohac Cc: Madhavan Srinivasan Cc: Mahesh J Salgaonkar Cc: Pingfan Liu Cc: Ritesh Harjani (IBM) Cc: Shivang Upadhyay Cc: Vivek Goyal Cc: linuxppc-dev at lists.ozlabs.org Cc: kexec at lists.infradead.org Signed-off-by: Sourabh Jain --- Changelog: - Add the missing hunk to export crash_cma_ranges sysfs --- .../ABI/testing/sysfs-kernel-kexec-kdump | 10 ++++++++++ kernel/kexec_core.c | 17 +++++++++++++++++ 2 files changed, 27 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-kernel-kexec-kdump b/Documentation/ABI/testing/sysfs-kernel-kexec-kdump index 00c00f380fea..f59051b5d96d 100644 --- a/Documentation/ABI/testing/sysfs-kernel-kexec-kdump +++ b/Documentation/ABI/testing/sysfs-kernel-kexec-kdump @@ -49,3 +49,13 @@ Description: read only is used by the user space utility kexec to support updating the in-kernel kdump image during hotplug operations. User: Kexec tools + +What: /sys/kernel/kexec/crash_cma_ranges +Date: Nov 2025 +Contact: kexec at lists.infradead.org +Description: read only + Provides information about the memory ranges reserved from + the Contiguous Memory Allocator (CMA) area that are allocated + to the crash (kdump) kernel. It lists the start and end physical + addresses of CMA regions assigned for crashkernel use. +User: kdump service diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c index 7476a46de5d6..da6ff72b4669 100644 --- a/kernel/kexec_core.c +++ b/kernel/kexec_core.c @@ -1271,6 +1271,22 @@ static ssize_t crash_size_store(struct kobject *kobj, } static struct kobj_attribute crash_size_attr = __ATTR_RW(crash_size); +static ssize_t crash_cma_ranges_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + + ssize_t len = 0; + int i; + + for (i = 0; i < crashk_cma_cnt; ++i) { + len += sysfs_emit_at(buf, len, "%08llx-%08llx\n", + crashk_cma_ranges[i].start, + crashk_cma_ranges[i].end); + } + return len; +} +static struct kobj_attribute crash_cma_ranges_attr = __ATTR_RO(crash_cma_ranges); + #ifdef CONFIG_CRASH_HOTPLUG static ssize_t crash_elfcorehdr_size_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) @@ -1289,6 +1305,7 @@ static struct attribute *kexec_attrs[] = { #ifdef CONFIG_CRASH_DUMP &crash_loaded_attr.attr, &crash_size_attr.attr, + &crash_cma_ranges_attr.attr, #ifdef CONFIG_CRASH_HOTPLUG &crash_elfcorehdr_size_attr.attr, #endif -- 2.51.1 From bhe at redhat.com Sun Nov 9 23:08:41 2025 From: bhe at redhat.com (Baoquan he) Date: Mon, 10 Nov 2025 15:08:41 +0800 Subject: [PATCH v3 5/5] crash: export crashkernel CMA reservation to userspace In-Reply-To: <20251110043143.484408-6-sourabhjain@linux.ibm.com> References: <20251110043143.484408-1-sourabhjain@linux.ibm.com> <20251110043143.484408-6-sourabhjain@linux.ibm.com> Message-ID: On 11/10/25 at 10:01am, Sourabh Jain wrote: > Add a sysfs entry /sys/kernel/kexec/crash_cma_ranges to expose all > CMA crashkernel ranges. I am not against this way. While wondering if it's more appropriate to export them into iomem_resource just like crashk_res and crashk_low_res doing. > > This allows userspace tools configuring kdump to determine how much > memory is reserved for crashkernel. If CMA is used, tools can warn > users when attempting to capture user pages with CMA reservation. > > The new sysfs hold the CMA ranges in below format: > > cat /sys/kernel/kexec/crash_cma_ranges > 100000000-10c7fffff > > Cc: Aditya Gupta > Cc: Andrew Morton > Cc: Baoquan he > Cc: Dave Young > Cc: Hari Bathini > Cc: Jiri Bohac > Cc: Madhavan Srinivasan > Cc: Mahesh J Salgaonkar > Cc: Pingfan Liu > Cc: Ritesh Harjani (IBM) > Cc: Shivang Upadhyay > Cc: Vivek Goyal > Cc: linuxppc-dev at lists.ozlabs.org > Cc: kexec at lists.infradead.org > Signed-off-by: Sourabh Jain > --- > Changelog: > - Add the missing hunk to export crash_cma_ranges sysfs > > --- > .../ABI/testing/sysfs-kernel-kexec-kdump | 10 ++++++++++ > kernel/kexec_core.c | 17 +++++++++++++++++ > 2 files changed, 27 insertions(+) > > diff --git a/Documentation/ABI/testing/sysfs-kernel-kexec-kdump b/Documentation/ABI/testing/sysfs-kernel-kexec-kdump > index 00c00f380fea..f59051b5d96d 100644 > --- a/Documentation/ABI/testing/sysfs-kernel-kexec-kdump > +++ b/Documentation/ABI/testing/sysfs-kernel-kexec-kdump > @@ -49,3 +49,13 @@ Description: read only > is used by the user space utility kexec to support updating the > in-kernel kdump image during hotplug operations. > User: Kexec tools > + > +What: /sys/kernel/kexec/crash_cma_ranges > +Date: Nov 2025 > +Contact: kexec at lists.infradead.org > +Description: read only > + Provides information about the memory ranges reserved from > + the Contiguous Memory Allocator (CMA) area that are allocated > + to the crash (kdump) kernel. It lists the start and end physical > + addresses of CMA regions assigned for crashkernel use. > +User: kdump service > diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c > index 7476a46de5d6..da6ff72b4669 100644 > --- a/kernel/kexec_core.c > +++ b/kernel/kexec_core.c > @@ -1271,6 +1271,22 @@ static ssize_t crash_size_store(struct kobject *kobj, > } > static struct kobj_attribute crash_size_attr = __ATTR_RW(crash_size); > > +static ssize_t crash_cma_ranges_show(struct kobject *kobj, > + struct kobj_attribute *attr, char *buf) > +{ > + > + ssize_t len = 0; > + int i; > + > + for (i = 0; i < crashk_cma_cnt; ++i) { > + len += sysfs_emit_at(buf, len, "%08llx-%08llx\n", > + crashk_cma_ranges[i].start, > + crashk_cma_ranges[i].end); > + } > + return len; > +} > +static struct kobj_attribute crash_cma_ranges_attr = __ATTR_RO(crash_cma_ranges); > + > #ifdef CONFIG_CRASH_HOTPLUG > static ssize_t crash_elfcorehdr_size_show(struct kobject *kobj, > struct kobj_attribute *attr, char *buf) > @@ -1289,6 +1305,7 @@ static struct attribute *kexec_attrs[] = { > #ifdef CONFIG_CRASH_DUMP > &crash_loaded_attr.attr, > &crash_size_attr.attr, > + &crash_cma_ranges_attr.attr, > #ifdef CONFIG_CRASH_HOTPLUG > &crash_elfcorehdr_size_attr.attr, > #endif > -- > 2.51.1 > From sourabhjain at linux.ibm.com Mon Nov 10 00:39:49 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Mon, 10 Nov 2025 14:09:49 +0530 Subject: [PATCH v3 5/5] crash: export crashkernel CMA reservation to userspace In-Reply-To: References: <20251110043143.484408-1-sourabhjain@linux.ibm.com> <20251110043143.484408-6-sourabhjain@linux.ibm.com> Message-ID: <09c4c181-eb4b-43ea-a439-04b83f4c20ba@linux.ibm.com> On 10/11/25 12:38, Baoquan he wrote: > On 11/10/25 at 10:01am, Sourabh Jain wrote: >> Add a sysfs entry /sys/kernel/kexec/crash_cma_ranges to expose all >> CMA crashkernel ranges. > I am not against this way. While wondering if it's more appropriate to > export them into iomem_resource just like crashk_res and crashk_low_res > doing. Handling conflict is challenging. Hence we don't export crashk_res and crashk_low_res to iomem on powerpc. Checkout [1] And I think conflicts can occur regardless of the order in which System RAM and Crash CMA ranges are added to iomem. [1] https://lore.kernel.org/all/20251016142831.144515-1-sourabhjain at linux.ibm.com/ - Sourabh Jain > >> This allows userspace tools configuring kdump to determine how much >> memory is reserved for crashkernel. If CMA is used, tools can warn >> users when attempting to capture user pages with CMA reservation. >> >> The new sysfs hold the CMA ranges in below format: >> >> cat /sys/kernel/kexec/crash_cma_ranges >> 100000000-10c7fffff >> >> Cc: Aditya Gupta >> Cc: Andrew Morton >> Cc: Baoquan he >> Cc: Dave Young >> Cc: Hari Bathini >> Cc: Jiri Bohac >> Cc: Madhavan Srinivasan >> Cc: Mahesh J Salgaonkar >> Cc: Pingfan Liu >> Cc: Ritesh Harjani (IBM) >> Cc: Shivang Upadhyay >> Cc: Vivek Goyal >> Cc: linuxppc-dev at lists.ozlabs.org >> Cc: kexec at lists.infradead.org >> Signed-off-by: Sourabh Jain >> --- >> Changelog: >> - Add the missing hunk to export crash_cma_ranges sysfs >> >> --- >> .../ABI/testing/sysfs-kernel-kexec-kdump | 10 ++++++++++ >> kernel/kexec_core.c | 17 +++++++++++++++++ >> 2 files changed, 27 insertions(+) >> >> diff --git a/Documentation/ABI/testing/sysfs-kernel-kexec-kdump b/Documentation/ABI/testing/sysfs-kernel-kexec-kdump >> index 00c00f380fea..f59051b5d96d 100644 >> --- a/Documentation/ABI/testing/sysfs-kernel-kexec-kdump >> +++ b/Documentation/ABI/testing/sysfs-kernel-kexec-kdump >> @@ -49,3 +49,13 @@ Description: read only >> is used by the user space utility kexec to support updating the >> in-kernel kdump image during hotplug operations. >> User: Kexec tools >> + >> +What: /sys/kernel/kexec/crash_cma_ranges >> +Date: Nov 2025 >> +Contact: kexec at lists.infradead.org >> +Description: read only >> + Provides information about the memory ranges reserved from >> + the Contiguous Memory Allocator (CMA) area that are allocated >> + to the crash (kdump) kernel. It lists the start and end physical >> + addresses of CMA regions assigned for crashkernel use. >> +User: kdump service >> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c >> index 7476a46de5d6..da6ff72b4669 100644 >> --- a/kernel/kexec_core.c >> +++ b/kernel/kexec_core.c >> @@ -1271,6 +1271,22 @@ static ssize_t crash_size_store(struct kobject *kobj, >> } >> static struct kobj_attribute crash_size_attr = __ATTR_RW(crash_size); >> >> +static ssize_t crash_cma_ranges_show(struct kobject *kobj, >> + struct kobj_attribute *attr, char *buf) >> +{ >> + >> + ssize_t len = 0; >> + int i; >> + >> + for (i = 0; i < crashk_cma_cnt; ++i) { >> + len += sysfs_emit_at(buf, len, "%08llx-%08llx\n", >> + crashk_cma_ranges[i].start, >> + crashk_cma_ranges[i].end); >> + } >> + return len; >> +} >> +static struct kobj_attribute crash_cma_ranges_attr = __ATTR_RO(crash_cma_ranges); >> + >> #ifdef CONFIG_CRASH_HOTPLUG >> static ssize_t crash_elfcorehdr_size_show(struct kobject *kobj, >> struct kobj_attribute *attr, char *buf) >> @@ -1289,6 +1305,7 @@ static struct attribute *kexec_attrs[] = { >> #ifdef CONFIG_CRASH_DUMP >> &crash_loaded_attr.attr, >> &crash_size_attr.attr, >> + &crash_cma_ranges_attr.attr, >> #ifdef CONFIG_CRASH_HOTPLUG >> &crash_elfcorehdr_size_attr.attr, >> #endif >> -- >> 2.51.1 >> From chenhuacai at kernel.org Mon Nov 10 01:24:04 2025 From: chenhuacai at kernel.org (Huacai Chen) Date: Mon, 10 Nov 2025 17:24:04 +0800 Subject: [PATCH] LoongArch: kexec: Initialize kexec_buf struct In-Reply-To: <20251024063653.35492-1-youling.tang@linux.dev> References: <20251024063653.35492-1-youling.tang@linux.dev> Message-ID: Applied, thanks. Huacai On Fri, Oct 24, 2025 at 2:38?PM Youling Tang wrote: > > From: Youling Tang > > The kexec_buf structure was previously declared without initialization. > commit bf454ec31add ("kexec_file: allow to place kexec_buf randomly") > added a field that is always read but not consistently populated by all > architectures. This un-initialized field will contain garbage. > > This is also triggering a UBSAN warning when the uninitialized data was > accessed: > > ------------[ cut here ]------------ > UBSAN: invalid-load in ./include/linux/kexec.h:210:10 > load of value 252 is not a valid value for type '_Bool' > > Zero-initializing kexec_buf at declaration ensures all fields are > cleanly set, preventing future instances of uninitialized memory being > used. > > Fixes: bf454ec31add ("kexec_file: allow to place kexec_buf randomly") > Link: https://lore.kernel.org/r/20250827-kbuf_all-v1-2-1df9882bb01a at debian.org > Signed-off-by: Youling Tang > --- > arch/loongarch/kernel/kexec_efi.c | 2 +- > arch/loongarch/kernel/kexec_elf.c | 2 +- > arch/loongarch/kernel/machine_kexec_file.c | 2 +- > 3 files changed, 3 insertions(+), 3 deletions(-) > > diff --git a/arch/loongarch/kernel/kexec_efi.c b/arch/loongarch/kernel/kexec_efi.c > index 45121b914f8f..5ee78ebb1546 100644 > --- a/arch/loongarch/kernel/kexec_efi.c > +++ b/arch/loongarch/kernel/kexec_efi.c > @@ -42,7 +42,7 @@ static void *efi_kexec_load(struct kimage *image, > { > int ret; > unsigned long text_offset, kernel_segment_number; > - struct kexec_buf kbuf; > + struct kexec_buf kbuf = {}; > struct kexec_segment *kernel_segment; > struct loongarch_image_header *h; > > diff --git a/arch/loongarch/kernel/kexec_elf.c b/arch/loongarch/kernel/kexec_elf.c > index 97b2f049801a..1b6b64744c7f 100644 > --- a/arch/loongarch/kernel/kexec_elf.c > +++ b/arch/loongarch/kernel/kexec_elf.c > @@ -59,7 +59,7 @@ static void *elf_kexec_load(struct kimage *image, > int ret; > unsigned long text_offset, kernel_segment_number; > struct elfhdr ehdr; > - struct kexec_buf kbuf; > + struct kexec_buf kbuf = {}; > struct kexec_elf_info elf_info; > struct kexec_segment *kernel_segment; > > diff --git a/arch/loongarch/kernel/machine_kexec_file.c b/arch/loongarch/kernel/machine_kexec_file.c > index dda236b51a88..fb57026f5f25 100644 > --- a/arch/loongarch/kernel/machine_kexec_file.c > +++ b/arch/loongarch/kernel/machine_kexec_file.c > @@ -143,7 +143,7 @@ int load_other_segments(struct kimage *image, > unsigned long initrd_load_addr = 0; > unsigned long orig_segments = image->nr_segments; > char *modified_cmdline = NULL; > - struct kexec_buf kbuf; > + struct kexec_buf kbuf = {}; > > kbuf.image = image; > /* Don't allocate anything below the kernel */ > -- > 2.43.0 > From pasha.tatashin at soleen.com Mon Nov 10 10:07:15 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Mon, 10 Nov 2025 13:07:15 -0500 Subject: [PATCH] liveupdate: kho: Enable KHO by default Message-ID: <20251110180715.602807-1-pasha.tatashin@soleen.com> Upcoming LUO requires KHO for its operations, the requirement to place both KHO=on and liveupdate=on becomes redundant. Set KHO to be enabled by default. Signed-off-by: Pasha Tatashin --- kernel/liveupdate/kexec_handover.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index b54ca665e005..568cd9fe9aca 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -51,7 +51,7 @@ union kho_page_info { static_assert(sizeof(union kho_page_info) == sizeof(((struct page *)0)->private)); -static bool kho_enable __ro_after_init; +static bool kho_enable __ro_after_init = true; bool kho_is_enabled(void) { base-commit: ab40c92c74c6b0c611c89516794502b3a3173966 -- 2.51.2.1041.gc1ab5b90ca-goog From rppt at kernel.org Mon Nov 10 10:34:48 2025 From: rppt at kernel.org (Mike Rapoport) Date: Mon, 10 Nov 2025 20:34:48 +0200 Subject: [PATCH] liveupdate: kho: Enable KHO by default In-Reply-To: <20251110180715.602807-1-pasha.tatashin@soleen.com> References: <20251110180715.602807-1-pasha.tatashin@soleen.com> Message-ID: On Mon, Nov 10, 2025 at 01:07:15PM -0500, Pasha Tatashin wrote: > > Subject: [PATCH] liveupdate: kho: Enable KHO by default No need to put a directory (liveupdate) prefix here. "kho: " is enough. With that fixed Reviewed-by: Mike Rapoport (Microsoft) > Upcoming LUO requires KHO for its operations, the requirement to place > both KHO=on and liveupdate=on becomes redundant. Set KHO to be enabled > by default. > > Signed-off-by: Pasha Tatashin > --- > kernel/liveupdate/kexec_handover.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c > index b54ca665e005..568cd9fe9aca 100644 > --- a/kernel/liveupdate/kexec_handover.c > +++ b/kernel/liveupdate/kexec_handover.c > @@ -51,7 +51,7 @@ union kho_page_info { > > static_assert(sizeof(union kho_page_info) == sizeof(((struct page *)0)->private)); > > -static bool kho_enable __ro_after_init; > +static bool kho_enable __ro_after_init = true; > > bool kho_is_enabled(void) > { > > base-commit: ab40c92c74c6b0c611c89516794502b3a3173966 > -- > 2.51.2.1041.gc1ab5b90ca-goog > -- Sincerely yours, Mike. From bhe at redhat.com Mon Nov 10 17:15:23 2025 From: bhe at redhat.com (Baoquan he) Date: Tue, 11 Nov 2025 09:15:23 +0800 Subject: [PATCH v3 5/5] crash: export crashkernel CMA reservation to userspace In-Reply-To: <09c4c181-eb4b-43ea-a439-04b83f4c20ba@linux.ibm.com> References: <20251110043143.484408-1-sourabhjain@linux.ibm.com> <20251110043143.484408-6-sourabhjain@linux.ibm.com> <09c4c181-eb4b-43ea-a439-04b83f4c20ba@linux.ibm.com> Message-ID: On 11/10/25 at 02:09pm, Sourabh Jain wrote: > > > On 10/11/25 12:38, Baoquan he wrote: > > On 11/10/25 at 10:01am, Sourabh Jain wrote: > > > Add a sysfs entry /sys/kernel/kexec/crash_cma_ranges to expose all > > > CMA crashkernel ranges. > > I am not against this way. While wondering if it's more appropriate to > > export them into iomem_resource just like crashk_res and crashk_low_res > > doing. > > Handling conflict is challenging. Hence we don't export crashk_res and > crashk_low_res to iomem on powerpc. Checkout [1] > > And I think conflicts can occur regardless of the order in which System RAM > and > Crash CMA ranges are added to iomem. > > [1] https://lore.kernel.org/all/20251016142831.144515-1-sourabhjain at linux.ibm.com/ Then I would suggest you add this reason and the link into patch log to keep a record. One day people may post patch to 'optimize' this. > > > > > > This allows userspace tools configuring kdump to determine how much > > > memory is reserved for crashkernel. If CMA is used, tools can warn > > > users when attempting to capture user pages with CMA reservation. > > > > > > The new sysfs hold the CMA ranges in below format: > > > > > > cat /sys/kernel/kexec/crash_cma_ranges > > > 100000000-10c7fffff > > > > > > Cc: Aditya Gupta > > > Cc: Andrew Morton > > > Cc: Baoquan he > > > Cc: Dave Young > > > Cc: Hari Bathini > > > Cc: Jiri Bohac > > > Cc: Madhavan Srinivasan > > > Cc: Mahesh J Salgaonkar > > > Cc: Pingfan Liu > > > Cc: Ritesh Harjani (IBM) > > > Cc: Shivang Upadhyay > > > Cc: Vivek Goyal > > > Cc: linuxppc-dev at lists.ozlabs.org > > > Cc: kexec at lists.infradead.org > > > Signed-off-by: Sourabh Jain > > > --- > > > Changelog: > > > - Add the missing hunk to export crash_cma_ranges sysfs > > > > > > --- > > > .../ABI/testing/sysfs-kernel-kexec-kdump | 10 ++++++++++ > > > kernel/kexec_core.c | 17 +++++++++++++++++ > > > 2 files changed, 27 insertions(+) > > > > > > diff --git a/Documentation/ABI/testing/sysfs-kernel-kexec-kdump b/Documentation/ABI/testing/sysfs-kernel-kexec-kdump > > > index 00c00f380fea..f59051b5d96d 100644 > > > --- a/Documentation/ABI/testing/sysfs-kernel-kexec-kdump > > > +++ b/Documentation/ABI/testing/sysfs-kernel-kexec-kdump > > > @@ -49,3 +49,13 @@ Description: read only > > > is used by the user space utility kexec to support updating the > > > in-kernel kdump image during hotplug operations. > > > User: Kexec tools > > > + > > > +What: /sys/kernel/kexec/crash_cma_ranges > > > +Date: Nov 2025 > > > +Contact: kexec at lists.infradead.org > > > +Description: read only > > > + Provides information about the memory ranges reserved from > > > + the Contiguous Memory Allocator (CMA) area that are allocated > > > + to the crash (kdump) kernel. It lists the start and end physical > > > + addresses of CMA regions assigned for crashkernel use. > > > +User: kdump service > > > diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c > > > index 7476a46de5d6..da6ff72b4669 100644 > > > --- a/kernel/kexec_core.c > > > +++ b/kernel/kexec_core.c > > > @@ -1271,6 +1271,22 @@ static ssize_t crash_size_store(struct kobject *kobj, > > > } > > > static struct kobj_attribute crash_size_attr = __ATTR_RW(crash_size); > > > +static ssize_t crash_cma_ranges_show(struct kobject *kobj, > > > + struct kobj_attribute *attr, char *buf) > > > +{ > > > + > > > + ssize_t len = 0; > > > + int i; > > > + > > > + for (i = 0; i < crashk_cma_cnt; ++i) { > > > + len += sysfs_emit_at(buf, len, "%08llx-%08llx\n", > > > + crashk_cma_ranges[i].start, > > > + crashk_cma_ranges[i].end); > > > + } > > > + return len; > > > +} > > > +static struct kobj_attribute crash_cma_ranges_attr = __ATTR_RO(crash_cma_ranges); > > > + > > > #ifdef CONFIG_CRASH_HOTPLUG > > > static ssize_t crash_elfcorehdr_size_show(struct kobject *kobj, > > > struct kobj_attribute *attr, char *buf) > > > @@ -1289,6 +1305,7 @@ static struct attribute *kexec_attrs[] = { > > > #ifdef CONFIG_CRASH_DUMP > > > &crash_loaded_attr.attr, > > > &crash_size_attr.attr, > > > + &crash_cma_ranges_attr.attr, > > > #ifdef CONFIG_CRASH_HOTPLUG > > > &crash_elfcorehdr_size_attr.attr, > > > #endif > > > -- > > > 2.51.1 > > > > From sourabhjain at linux.ibm.com Mon Nov 10 21:52:13 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Tue, 11 Nov 2025 11:22:13 +0530 Subject: [PATCH v3 5/5] crash: export crashkernel CMA reservation to userspace In-Reply-To: References: <20251110043143.484408-1-sourabhjain@linux.ibm.com> <20251110043143.484408-6-sourabhjain@linux.ibm.com> <09c4c181-eb4b-43ea-a439-04b83f4c20ba@linux.ibm.com> Message-ID: <56abcc3f-ddd4-49c3-a985-a16d616e4210@linux.ibm.com> On 11/11/25 06:45, Baoquan he wrote: > On 11/10/25 at 02:09pm, Sourabh Jain wrote: >> >> On 10/11/25 12:38, Baoquan he wrote: >>> On 11/10/25 at 10:01am, Sourabh Jain wrote: >>>> Add a sysfs entry /sys/kernel/kexec/crash_cma_ranges to expose all >>>> CMA crashkernel ranges. >>> I am not against this way. While wondering if it's more appropriate to >>> export them into iomem_resource just like crashk_res and crashk_low_res >>> doing. >> Handling conflict is challenging. Hence we don't export crashk_res and >> crashk_low_res to iomem on powerpc. Checkout [1] >> >> And I think conflicts can occur regardless of the order in which System RAM >> and >> Crash CMA ranges are added to iomem. >> >> [1] https://lore.kernel.org/all/20251016142831.144515-1-sourabhjain at linux.ibm.com/ > Then I would suggest you add this reason and the link into patch log > to keep a record. One day people may post patch to 'optimize' this. Yeah, I will include it in v3. Thanks for the review. - Sourabh Jain > >>>> This allows userspace tools configuring kdump to determine how much >>>> memory is reserved for crashkernel. If CMA is used, tools can warn >>>> users when attempting to capture user pages with CMA reservation. >>>> >>>> The new sysfs hold the CMA ranges in below format: >>>> >>>> cat /sys/kernel/kexec/crash_cma_ranges >>>> 100000000-10c7fffff >>>> >>>> Cc: Aditya Gupta >>>> Cc: Andrew Morton >>>> Cc: Baoquan he >>>> Cc: Dave Young >>>> Cc: Hari Bathini >>>> Cc: Jiri Bohac >>>> Cc: Madhavan Srinivasan >>>> Cc: Mahesh J Salgaonkar >>>> Cc: Pingfan Liu >>>> Cc: Ritesh Harjani (IBM) >>>> Cc: Shivang Upadhyay >>>> Cc: Vivek Goyal >>>> Cc: linuxppc-dev at lists.ozlabs.org >>>> Cc: kexec at lists.infradead.org >>>> Signed-off-by: Sourabh Jain >>>> --- >>>> Changelog: >>>> - Add the missing hunk to export crash_cma_ranges sysfs >>>> >>>> --- >>>> .../ABI/testing/sysfs-kernel-kexec-kdump | 10 ++++++++++ >>>> kernel/kexec_core.c | 17 +++++++++++++++++ >>>> 2 files changed, 27 insertions(+) >>>> >>>> diff --git a/Documentation/ABI/testing/sysfs-kernel-kexec-kdump b/Documentation/ABI/testing/sysfs-kernel-kexec-kdump >>>> index 00c00f380fea..f59051b5d96d 100644 >>>> --- a/Documentation/ABI/testing/sysfs-kernel-kexec-kdump >>>> +++ b/Documentation/ABI/testing/sysfs-kernel-kexec-kdump >>>> @@ -49,3 +49,13 @@ Description: read only >>>> is used by the user space utility kexec to support updating the >>>> in-kernel kdump image during hotplug operations. >>>> User: Kexec tools >>>> + >>>> +What: /sys/kernel/kexec/crash_cma_ranges >>>> +Date: Nov 2025 >>>> +Contact: kexec at lists.infradead.org >>>> +Description: read only >>>> + Provides information about the memory ranges reserved from >>>> + the Contiguous Memory Allocator (CMA) area that are allocated >>>> + to the crash (kdump) kernel. It lists the start and end physical >>>> + addresses of CMA regions assigned for crashkernel use. >>>> +User: kdump service >>>> diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c >>>> index 7476a46de5d6..da6ff72b4669 100644 >>>> --- a/kernel/kexec_core.c >>>> +++ b/kernel/kexec_core.c >>>> @@ -1271,6 +1271,22 @@ static ssize_t crash_size_store(struct kobject *kobj, >>>> } >>>> static struct kobj_attribute crash_size_attr = __ATTR_RW(crash_size); >>>> +static ssize_t crash_cma_ranges_show(struct kobject *kobj, >>>> + struct kobj_attribute *attr, char *buf) >>>> +{ >>>> + >>>> + ssize_t len = 0; >>>> + int i; >>>> + >>>> + for (i = 0; i < crashk_cma_cnt; ++i) { >>>> + len += sysfs_emit_at(buf, len, "%08llx-%08llx\n", >>>> + crashk_cma_ranges[i].start, >>>> + crashk_cma_ranges[i].end); >>>> + } >>>> + return len; >>>> +} >>>> +static struct kobj_attribute crash_cma_ranges_attr = __ATTR_RO(crash_cma_ranges); >>>> + >>>> #ifdef CONFIG_CRASH_HOTPLUG >>>> static ssize_t crash_elfcorehdr_size_show(struct kobject *kobj, >>>> struct kobj_attribute *attr, char *buf) >>>> @@ -1289,6 +1305,7 @@ static struct attribute *kexec_attrs[] = { >>>> #ifdef CONFIG_CRASH_DUMP >>>> &crash_loaded_attr.attr, >>>> &crash_size_attr.attr, >>>> + &crash_cma_ranges_attr.attr, >>>> #ifdef CONFIG_CRASH_HOTPLUG >>>> &crash_elfcorehdr_size_attr.attr, >>>> #endif >>>> -- >>>> 2.51.1 >>>> From pratyush at kernel.org Tue Nov 11 05:03:42 2025 From: pratyush at kernel.org (Pratyush Yadav) Date: Tue, 11 Nov 2025 14:03:42 +0100 Subject: [PATCH] liveupdate: kho: Enable KHO by default In-Reply-To: (Mike Rapoport's message of "Mon, 10 Nov 2025 20:34:48 +0200") References: <20251110180715.602807-1-pasha.tatashin@soleen.com> Message-ID: On Mon, Nov 10 2025, Mike Rapoport wrote: > On Mon, Nov 10, 2025 at 01:07:15PM -0500, Pasha Tatashin wrote: >> >> Subject: [PATCH] liveupdate: kho: Enable KHO by default > > No need to put a directory (liveupdate) prefix here. "kho: " is enough. +1 > > With that fixed > > Reviewed-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav > >> Upcoming LUO requires KHO for its operations, the requirement to place >> both KHO=on and liveupdate=on becomes redundant. Set KHO to be enabled >> by default. >> >> Signed-off-by: Pasha Tatashin >> --- >> kernel/liveupdate/kexec_handover.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c >> index b54ca665e005..568cd9fe9aca 100644 >> --- a/kernel/liveupdate/kexec_handover.c >> +++ b/kernel/liveupdate/kexec_handover.c >> @@ -51,7 +51,7 @@ union kho_page_info { >> >> static_assert(sizeof(union kho_page_info) == sizeof(((struct page *)0)->private)); >> >> -static bool kho_enable __ro_after_init; >> +static bool kho_enable __ro_after_init = true; >> >> bool kho_is_enabled(void) >> { >> >> base-commit: ab40c92c74c6b0c611c89516794502b3a3173966 >> -- >> 2.51.2.1041.gc1ab5b90ca-goog >> -- Regards, Pratyush Yadav From horms at kernel.org Tue Nov 11 05:31:28 2025 From: horms at kernel.org (Simon Horman) Date: Tue, 11 Nov 2025 13:31:28 +0000 Subject: [PATCH 1/2] kexec-tools: powerpc: Fix function signature of comparefunc() In-Reply-To: <20251022114413.4440-1-glaubitz@physik.fu-berlin.de> References: <20251022114413.4440-1-glaubitz@physik.fu-berlin.de> Message-ID: On Wed, Oct 22, 2025 at 01:44:12PM +0200, John Paul Adrian Glaubitz wrote: > Fixes the following build error on 32-bit PowerPC: > > kexec/arch/ppc/fs2dt.c: In function 'putnode': > kexec/arch/ppc/fs2dt.c:338:51: error: passing argument 4 of 'scandir' from incompatible pointer type [-Wincompatible-pointer-types] > 338 | numlist = scandir(pathname, &namelist, 0, comparefunc); > | ^~~~~~~~~~~ > | | > | int (*)(const void *, const void *) > > Signed-off-by: John Paul Adrian Glaubitz Thanks, I was able to reproduce this using gcc-powerpc-linux-gnu 4:14.2.0-1 on Debian Trixie. Likewise for patch 2/2. There is a CI workflow that exercises 32-bit PowerPC builds [1]. However, it does not exhibit the problems reported. I guess that is because it is using an older GCC, gcc-powerpc-linux-gnu 4:13.2.0-7ubuntu1 on Ubuntu 24.04. [1] https://github.com/horms/kexec-tools/actions/runs/18554906205/job/52889935741 It would be nice to update the job, but perhaps that is something that comes with Ubuntu 26.04. In any case I have applied this series: - kexec-tools: powerpc: Fix pointer declarations in read_memory_region_limits() https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git/commit/?id=6c878e9b8a50 - kexec-tools: powerpc: Fix function signature of comparefunc() https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git/commit/?id=2786f8eb3e5e From horms at kernel.org Tue Nov 11 05:38:33 2025 From: horms at kernel.org (Simon Horman) Date: Tue, 11 Nov 2025 13:38:33 +0000 Subject: [PATCH kexec-tools 0/2] kexec/ifdown.c: minimise errors printed In-Reply-To: <20251022020703.14200-2-mrocha@turretllc.us> References: <20251022020703.14200-2-mrocha@turretllc.us> Message-ID: On Tue, Oct 21, 2025 at 09:07:02PM -0500, Mason Rocha wrote: > On some embedded configurations, kexec generates messages when rebooting > to the new kernel. This patch line helps eliminate these messages in > the event certain kernel options are set. I'm not too worried about the > second patch, but it is along the same line as the first patch and I > thought that it should included. Thanks! Thanks, applied. - kexec/ifdown.c: Hide error if sockets are disabled https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git/commit/?id=86c3d1f7b646 - kexec/ifdown.c: Use AF_NETLINK instead of AF_INET https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git/commit/?id=eb8609a29363 From horms at kernel.org Tue Nov 11 05:39:03 2025 From: horms at kernel.org (Simon Horman) Date: Tue, 11 Nov 2025 13:39:03 +0000 Subject: [PATCH kexec-tools 2/2] kexec/ifdown.c: Hide error if sockets are disabled In-Reply-To: <20251022020703.14200-4-mrocha@turretllc.us> References: <20251022020703.14200-2-mrocha@turretllc.us> <20251022020703.14200-4-mrocha@turretllc.us> Message-ID: On Tue, Oct 21, 2025 at 09:07:04PM -0500, Mason Rocha wrote: > Prevents the message "Function not implemented" from being logged when > a system with networking support disabled, as there couldn't possibly be > any interfaces to bring down to the point where we need to make sure the > user knows that the interfaces were not brought down. > > Signed-off-by: Mason Rocha > --- > kexec/ifdown.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/kexec/ifdown.c b/kexec/ifdown.c > index 6a60bcb..e6ea0ae 100644 > --- a/kexec/ifdown.c > +++ b/kexec/ifdown.c > @@ -32,8 +32,10 @@ int ifdown(void) > int fd, shaper; > > if ((fd = socket(AF_NETLINK, SOCK_DGRAM, 0)) < 0) { > - fprintf(stderr, "ifdown: "); > - perror("socket"); > + if(errno != ENOSYS) { nit: I'd prefer a space between 'if' and '(' I added one when applying this series. > + fprintf(stderr, "ifdown: "); > + perror("socket"); > + } > goto error; > } > > -- > 2.51.0 > From horms at kernel.org Tue Nov 11 05:47:01 2025 From: horms at kernel.org (Simon Horman) Date: Tue, 11 Nov 2025 13:47:01 +0000 Subject: [PATCH kexec-tools 0/4] ppc64: Support kexec with initrd and DTB together In-Reply-To: <20251022134611.8921-1-shivangu@linux.ibm.com> References: <20251022134611.8921-1-shivangu@linux.ibm.com> Message-ID: On Wed, Oct 22, 2025 at 07:16:05PM +0530, Shivang Upadhyay wrote: > Currently, on ppc64 systems, kexec cannot directly use a > user-provided devicetreeblob (dtb) when booting a new > kernel with an initrd. This limitation exists because the > dtb must be modified at runtime ? for example, to include > the initrd?s memory location and size, and to add > /memreserve/ entries based on the current system memory > layout. > > Previously, kexec handled this by generating a fresh dtb in > memory from the running system?s /proc/device-tree directory. > However, this approach prevents users from making > intentional modifications to the dtb ? such as changin boot > arguments, enabling or disabling devices, or testing kernel > changes that depend on specific device tree properties. > > Adding support for user-provided dtb (with appropriate > patching by kexec) allows more control for developers, > particularly when experimenting with custom kernels or > hardware configurations. > > This patch series lifts this restriction and ensures that > the necessary /memreserve/ sections are properly added to > the new DTB. on ppc64, it is mandatory, for the rebooting > cpu to be present in the new kernel?s dtb, so additional > logic has been added to identify and mark one of available > cpu as reboot cpu on currect system. > > A new architecture-specific function, arch_do_unload(), has > been introduced to perform the necessary cleanup during > kexec unload. in ppc64, the reboot CPU changes due to kexec, > and it gets reset back on kexec unload. > > Shivang Upadhyay (4): > ppc64: ensure /memreserve/ sections exist in user-provided FDT > ppc64: handle reboot CPU in case of user provided DTB > Add arch_do_unload hook for arch-specific cleanup > ppc64: life the dtb and initrd restriction Thanks, applied. - ppc64: life the dtb and initrd restriction https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git/commit/?id=4dc039779675 - Add arch_do_unload hook for arch-specific cleanup https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git/commit/?id=bf4aa2a1f365 - ppc64: handle reboot CPU in case of user provided DTB https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git/commit/?id=39631c8fd64f - ppc64: ensure /memreserve/ sections exist in user-provided FDT https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git/commit/?id=32f664bfa479 From horms at kernel.org Tue Nov 11 05:49:43 2025 From: horms at kernel.org (Simon Horman) Date: Tue, 11 Nov 2025 13:49:43 +0000 Subject: [PATCH] kexec: add kexec flag to support debug printing In-Reply-To: <20251104025959.1948450-1-maqianga@uniontech.com> References: <20251104025959.1948450-1-maqianga@uniontech.com> Message-ID: On Tue, Nov 04, 2025 at 10:59:59AM +0800, Qiang Ma wrote: > This add KEXEC_DEBUG to kexec_flags so that it can be passed > to kernel when '-d' is added with kexec_load interface. With that > flag enabled, kernel can enable the debugging message printing. > > This patch requires support from the kexec_load debugging message > of the Linux kernel[1]. > > [1]: https://lore.kernel.org/kexec/20251103063440.1681657-1-maqianga at uniontech.com/ > > Signed-off-by: Qiang Ma Thanks, applied. - kexec: add kexec flag to support debug printing https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git/commit/?id=71d6fd99af7e From horms at kernel.org Tue Nov 11 05:52:58 2025 From: horms at kernel.org (Simon Horman) Date: Tue, 11 Nov 2025 13:52:58 +0000 Subject: [PATCH] util_lib: Add direct map fallback in vaddr_to_offset() In-Reply-To: <20251106120344.2382695-1-pnina.feder@mobileye.com> References: <20251106120344.2382695-1-pnina.feder@mobileye.com> Message-ID: On Thu, Nov 06, 2025 at 02:03:44PM +0200, Pnina Feder wrote: > The vmcore-dmesg tool could fail with the message: > "No program header covering vaddr 0x%llx found kexec bug?" > > This occurred when a virtual address belonged to the kernel?s direct > mapping region, which may not be covered by any PT_LOAD segment in > the vmcore ELF headers. > > Add a direct-map fallback in vaddr_to_offset() that converts such > virtual addresses using the known page and physical offsets. This > allows resolving these addresses correctly. > > Tested on Linux 6.16 (RISC-V) > > Signed-off-by: Pnina Feder Thanks, applied. - util_lib: Add direct map fallback in vaddr_to_offset() https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git/commit/?id=393c449aec3d From youling.tang at linux.dev Tue Nov 11 19:05:33 2025 From: youling.tang at linux.dev (Youling Tang) Date: Wed, 12 Nov 2025 11:05:33 +0800 Subject: [PATCH 2/2] LoongArch: Refactor command line processing In-Reply-To: References: <20250925063241.337897-1-youling.tang@linux.dev> <20250925063241.337897-2-youling.tang@linux.dev> <5ec31e96-7157-4300-af36-daec2cee5831@linux.dev> Message-ID: <50fa479e-5514-4b5a-a0b8-a264ba23a005@linux.dev> Hi, Simon Currently, it is passed through command-line parameters (fdt has not been used yet), but the readability of the existing command-line parameters is too poor. By rewriting it to be consistent with the kernel implementation and using hexadecimal, the readability will be better. Can Patch2 be applied alone (please ignore Patch1)? Thanks, Youling. On 9/25/25 18:23, Dave Young wrote: > On Thu, 25 Sept 2025 at 17:52, Youling Tang wrote: >> On 9/25/25 17:22, Dave Young wrote: >> >> On Thu, 25 Sept 2025 at 14:33, Youling Tang wrote: >> >> From: Youling Tang >> >> Refactor the cmdline_add_xxx code flow, and simultaneously display >> the content of parameters such as initrd in hexadecimal format to >> improve readability. >> >> Signed-off-by: Youling Tang >> --- >> kexec/arch/loongarch/kexec-loongarch.c | 138 ++++++++++--------------- >> 1 file changed, 55 insertions(+), 83 deletions(-) >> >> diff --git a/kexec/arch/loongarch/kexec-loongarch.c b/kexec/arch/loongarch/kexec-loongarch.c >> index 240202f..c2503de 100644 >> --- a/kexec/arch/loongarch/kexec-loongarch.c >> +++ b/kexec/arch/loongarch/kexec-loongarch.c >> @@ -35,83 +35,49 @@ >> #define _O_BINARY 0 >> #endif >> >> -#define CMDLINE_PREFIX "kexec " >> -static char cmdline[COMMAND_LINE_SIZE] = CMDLINE_PREFIX; >> +/* Add the "kexec" command line parameter to command line. */ >> +static void cmdline_add_loader(unsigned long *cmdline_tmplen, char *modified_cmdline) >> +{ >> + int loader_strlen; >> + >> + loader_strlen = sprintf(modified_cmdline + (*cmdline_tmplen), "kexec "); >> + *cmdline_tmplen += loader_strlen; >> +} >> >> Not sure why this is needed, I guess it is to distinguish the new >> kernel and original kernel? As replied in another reply I would >> suggest adding an extra cmdline in scripts instead of hard coded here, >> you need to remove the fake param each time otherwise it will make >> the cmdline longer and longer after many kexec reboot cycles. >> >> >> In arch_process_options(), the "kexec" parameter will be removed when >> reusing the current command line. >> >> -/* Adds "initrd=start,size" parameters to command line. */ >> -static int cmdline_add_initrd(char *cmdline, unsigned long addr, >> - unsigned long size) >> +/* Add the "initrd=start,size" command line parameter to command line. */ >> +static void cmdline_add_initrd(unsigned long *cmdline_tmplen, char *modified_cmdline, >> + unsigned long initrd_base, unsigned long initrd_size) >> { >> - int cmdlen, len; >> - char str[50], *ptr; >> - >> - ptr = str; >> - strcpy(str, " initrd="); >> - ptr += strlen(str); >> - ultoa(addr, ptr); >> - strcat(str, ","); >> - ptr = str + strlen(str); >> - ultoa(size, ptr); >> - len = strlen(str); >> - cmdlen = strlen(cmdline) + len; >> - if (cmdlen > (COMMAND_LINE_SIZE - 1)) >> - die("Command line overflow\n"); >> - strcat(cmdline, str); >> + int initrd_strlen; >> >> - return 0; >> + initrd_strlen = sprintf(modified_cmdline + (*cmdline_tmplen), "initrd=0x%lx,0x%lx ", >> + initrd_base, initrd_size); >> + *cmdline_tmplen += initrd_strlen; >> } >> >> -/* Adds the appropriate "mem=size at start" options to command line, indicating the >> - * memory region the new kernel can use to boot into. */ >> -static int cmdline_add_mem(char *cmdline, unsigned long addr, >> - unsigned long size) >> +/* >> + * Add the "mem=size at start" command line parameter to command line, indicating the >> + * memory region the new kernel can use to boot into. >> + */ >> +static void cmdline_add_mem(unsigned long *cmdline_tmplen, char *modified_cmdline, >> + unsigned long mem_start, unsigned long mem_sz) >> { >> - int cmdlen, len; >> - char str[50], *ptr; >> - >> - addr = addr/1024; >> - size = size/1024; >> - ptr = str; >> - strcpy(str, " mem="); >> - ptr += strlen(str); >> - ultoa(size, ptr); >> - strcat(str, "K@"); >> - ptr = str + strlen(str); >> - ultoa(addr, ptr); >> - strcat(str, "K"); >> - len = strlen(str); >> - cmdlen = strlen(cmdline) + len; >> - if (cmdlen > (COMMAND_LINE_SIZE - 1)) >> - die("Command line overflow\n"); >> - strcat(cmdline, str); >> + int mem_strlen = 0; >> >> - return 0; >> + mem_strlen = sprintf(modified_cmdline + (*cmdline_tmplen), "mem=0x%lx at 0x%lx ", >> + mem_sz, mem_start); >> + *cmdline_tmplen += mem_strlen; >> } >> >> Ditto for the mem= param and other similar cases, can this be done out >> of the kexec-tools c code? it will be more flexible. >> >> >> Currently, we will maintain passing this parameter through the command line, not >> via FDT like ARM64. In the future, we may consider whether it can be passed through >> the FDT table in efisystab (but that approach may not be friendly to ELF kernels). >> > If the kexec boot depends on the customized mem layout, ideally it > should be passed with fdt or other method. > it is reasonable to keep it for the time being. But the "kexec" extra > cmdline should not be hard coded in my opinion. > > Thanks > Dave > From shivangu at linux.ibm.com Wed Nov 12 04:10:18 2025 From: shivangu at linux.ibm.com (Shivang Upadhyay) Date: Wed, 12 Nov 2025 17:40:18 +0530 Subject: [PATCH kexec-tools 0/4] ppc64: Support kexec with initrd and DTB together In-Reply-To: References: <20251022134611.8921-1-shivangu@linux.ibm.com> Message-ID: On Tue, Nov 11, 2025 at 01:47:01PM +0000, Simon Horman wrote: > On Wed, Oct 22, 2025 at 07:16:05PM +0530, Shivang Upadhyay wrote: > > Currently, on ppc64 systems, kexec cannot directly use a > > user-provided devicetreeblob (dtb) when booting a new > > kernel with an initrd. This limitation exists because the > > dtb must be modified at runtime ? for example, to include > > the initrd?s memory location and size, and to add > > /memreserve/ entries based on the current system memory > > layout. > > > > Previously, kexec handled this by generating a fresh dtb in > > memory from the running system?s /proc/device-tree directory. > > However, this approach prevents users from making > > intentional modifications to the dtb ? such as changin boot > > arguments, enabling or disabling devices, or testing kernel > > changes that depend on specific device tree properties. > > > > Adding support for user-provided dtb (with appropriate > > patching by kexec) allows more control for developers, > > particularly when experimenting with custom kernels or > > hardware configurations. > > > > This patch series lifts this restriction and ensures that > > the necessary /memreserve/ sections are properly added to > > the new DTB. on ppc64, it is mandatory, for the rebooting > > cpu to be present in the new kernel?s dtb, so additional > > logic has been added to identify and mark one of available > > cpu as reboot cpu on currect system. > > > > A new architecture-specific function, arch_do_unload(), has > > been introduced to perform the necessary cleanup during > > kexec unload. in ppc64, the reboot CPU changes due to kexec, > > and it gets reset back on kexec unload. > > > > Shivang Upadhyay (4): > > ppc64: ensure /memreserve/ sections exist in user-provided FDT > > ppc64: handle reboot CPU in case of user provided DTB > > Add arch_do_unload hook for arch-specific cleanup > > ppc64: life the dtb and initrd restriction > > Thanks, applied. Thanks Simon ~Shivang. From sourabhjain at linux.ibm.com Wed Nov 12 09:13:28 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Wed, 12 Nov 2025 22:43:28 +0530 Subject: [PATCH 2/2] Documentation/ABI: remove old fadump sysfs doc In-Reply-To: <20251112171328.298109-1-sourabhjain@linux.ibm.com> References: <20251112171328.298109-1-sourabhjain@linux.ibm.com> Message-ID: <20251112171328.298109-2-sourabhjain@linux.ibm.com> Patch with title "powerpc/fadump: remove old sysfs symlink" removed the deprecated fadump sysfs. So remove the respective ABI documents. Additionally remove the reference of old fadump sysfs from fadump document. The alternative sysfs is documented at: Documentation/ABI/testing/sysfs-kernel-fadump Cc: Hari Bathini Cc: Madhavan Srinivasan Cc: Mahesh Salgaonkar Cc: Michael Ellerman Cc: Ritesh Harjani (IBM) Cc: Shivang Upadhyay Cc: kexec at lists.infradead.org Signed-off-by: Sourabh Jain --- .../ABI/obsolete/sysfs-kernel-fadump_enabled | 9 ----- .../obsolete/sysfs-kernel-fadump_registered | 10 ------ .../obsolete/sysfs-kernel-fadump_release_mem | 10 ------ .../arch/powerpc/firmware-assisted-dump.rst | 33 +++++++------------ 4 files changed, 11 insertions(+), 51 deletions(-) delete mode 100644 Documentation/ABI/obsolete/sysfs-kernel-fadump_enabled delete mode 100644 Documentation/ABI/obsolete/sysfs-kernel-fadump_registered delete mode 100644 Documentation/ABI/obsolete/sysfs-kernel-fadump_release_mem diff --git a/Documentation/ABI/obsolete/sysfs-kernel-fadump_enabled b/Documentation/ABI/obsolete/sysfs-kernel-fadump_enabled deleted file mode 100644 index e9c2de8b3688..000000000000 --- a/Documentation/ABI/obsolete/sysfs-kernel-fadump_enabled +++ /dev/null @@ -1,9 +0,0 @@ -This ABI is renamed and moved to a new location /sys/kernel/fadump/enabled. - -What: /sys/kernel/fadump_enabled -Date: Feb 2012 -Contact: linuxppc-dev at lists.ozlabs.org -Description: read only - Primarily used to identify whether the FADump is enabled in - the kernel or not. -User: Kdump service diff --git a/Documentation/ABI/obsolete/sysfs-kernel-fadump_registered b/Documentation/ABI/obsolete/sysfs-kernel-fadump_registered deleted file mode 100644 index dae880b1a5d5..000000000000 --- a/Documentation/ABI/obsolete/sysfs-kernel-fadump_registered +++ /dev/null @@ -1,10 +0,0 @@ -This ABI is renamed and moved to a new location /sys/kernel/fadump/registered. - -What: /sys/kernel/fadump_registered -Date: Feb 2012 -Contact: linuxppc-dev at lists.ozlabs.org -Description: read/write - Helps to control the dump collect feature from userspace. - Setting 1 to this file enables the system to collect the - dump and 0 to disable it. -User: Kdump service diff --git a/Documentation/ABI/obsolete/sysfs-kernel-fadump_release_mem b/Documentation/ABI/obsolete/sysfs-kernel-fadump_release_mem deleted file mode 100644 index ca2396edb5f1..000000000000 --- a/Documentation/ABI/obsolete/sysfs-kernel-fadump_release_mem +++ /dev/null @@ -1,10 +0,0 @@ -This ABI is renamed and moved to a new location /sys/kernel/fadump/release_mem. - -What: /sys/kernel/fadump_release_mem -Date: Feb 2012 -Contact: linuxppc-dev at lists.ozlabs.org -Description: write only - This is a special sysfs file and only available when - the system is booted to capture the vmcore using FADump. - It is used to release the memory reserved by FADump to - save the crash dump. diff --git a/Documentation/arch/powerpc/firmware-assisted-dump.rst b/Documentation/arch/powerpc/firmware-assisted-dump.rst index 7e266e749cd5..717e30e8b6cd 100644 --- a/Documentation/arch/powerpc/firmware-assisted-dump.rst +++ b/Documentation/arch/powerpc/firmware-assisted-dump.rst @@ -19,9 +19,9 @@ in production use. - Unlike phyp dump, userspace tool does not need to refer any sysfs interface while reading /proc/vmcore. - Unlike phyp dump, FADump allows user to release all the memory reserved - for dump, with a single operation of echo 1 > /sys/kernel/fadump_release_mem. + for dump, with a single operation of echo 1 > /sys/kernel/fadump/release_mem. - Once enabled through kernel boot parameter, FADump can be - started/stopped through /sys/kernel/fadump_registered interface (see + started/stopped through /sys/kernel/fadump/registered interface (see sysfs files section below) and can be easily integrated with kdump service start/stop init scripts. @@ -86,13 +86,13 @@ as follows: network, nas, san, iscsi, etc. as desired. - Once the userspace tool is done saving dump, it will echo - '1' to /sys/kernel/fadump_release_mem to release the reserved + '1' to /sys/kernel/fadump/release_mem to release the reserved memory back to general use, except the memory required for next firmware-assisted dump registration. e.g.:: - # echo 1 > /sys/kernel/fadump_release_mem + # echo 1 > /sys/kernel/fadump/release_mem Please note that the firmware-assisted dump feature is only available on POWER6 and above systems on pSeries @@ -152,7 +152,7 @@ then everything but boot memory size of RAM is reserved during early boot (See Fig. 2). This area is released once we finish collecting the dump from user land scripts (e.g. kdump scripts) that are run. If there is dump data, then the -/sys/kernel/fadump_release_mem file is created, and the reserved +/sys/kernel/fadump/release_mem file is created, and the reserved memory is held. If there is no waiting dump data, then only the memory required to @@ -281,7 +281,7 @@ the control files and debugfs file to display memory reserved region. Here is the list of files under kernel sysfs: - /sys/kernel/fadump_enabled + /sys/kernel/fadump/enabled This is used to display the FADump status. - 0 = FADump is disabled @@ -290,15 +290,15 @@ Here is the list of files under kernel sysfs: This interface can be used by kdump init scripts to identify if FADump is enabled in the kernel and act accordingly. - /sys/kernel/fadump_registered + /sys/kernel/fadump/registered This is used to display the FADump registration status as well as to control (start/stop) the FADump registration. - 0 = FADump is not registered. - 1 = FADump is registered and ready to handle system crash. - To register FADump echo 1 > /sys/kernel/fadump_registered and - echo 0 > /sys/kernel/fadump_registered for un-register and stop the + To register FADump echo 1 > /sys/kernel/fadump/registered and + echo 0 > /sys/kernel/fadump/registered for un-register and stop the FADump. Once the FADump is un-registered, the system crash will not be handled and vmcore will not be captured. This interface can be easily integrated with kdump service start/stop. @@ -308,13 +308,13 @@ Here is the list of files under kernel sysfs: This is used to display the memory reserved by FADump for saving the crash dump. - /sys/kernel/fadump_release_mem + /sys/kernel/fadump/release_mem This file is available only when FADump is active during second kernel. This is used to release the reserved memory region that are held for saving crash dump. To release the reserved memory echo 1 to it:: - echo 1 > /sys/kernel/fadump_release_mem + echo 1 > /sys/kernel/fadump/release_mem After echo 1, the content of the /sys/kernel/debug/powerpc/fadump_region file will change to reflect the new memory reservations. @@ -335,17 +335,6 @@ Note: /sys/kernel/fadump_release_opalcore sysfs has moved to echo 1 > /sys/firmware/opal/mpipl/release_core -Note: The following FADump sysfs files are deprecated. - -+----------------------------------+--------------------------------+ -| Deprecated | Alternative | -+----------------------------------+--------------------------------+ -| /sys/kernel/fadump_enabled | /sys/kernel/fadump/enabled | -+----------------------------------+--------------------------------+ -| /sys/kernel/fadump_registered | /sys/kernel/fadump/registered | -+----------------------------------+--------------------------------+ -| /sys/kernel/fadump_release_mem | /sys/kernel/fadump/release_mem | -+----------------------------------+--------------------------------+ Here is the list of files under powerpc debugfs: (Assuming debugfs is mounted on /sys/kernel/debug directory.) -- 2.51.1 From sourabhjain at linux.ibm.com Wed Nov 12 09:13:27 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Wed, 12 Nov 2025 22:43:27 +0530 Subject: [PATCH 1/2] powerpc/fadump: remove old sysfs symlink Message-ID: <20251112171328.298109-1-sourabhjain@linux.ibm.com> Commit d418b19f34ed ("powerpc/fadump: Reorganize /sys/kernel/fadump_* sysfs files") and commit 3f5f1f22ef10 ("Documentation/ABI: Mark /sys/kernel/fadump_* sysfs files deprecated") moved the /sys/kernel/fadump_* sysfs files to /sys/kernel/fadump/ and deprecated the old files in 2019. To maintain backward compatibility, symlinks were added at the old locations so existing tools could still work. References [1][2] now use the new sysfs interface, so we can safely remove the old symlinks. Link: https://github.com/rhkdump/kdump-utils/commit/fc7c65312a5bef115ce40818bf43ddd3b01b8958 [1] Link: https://github.com/openSUSE/kdump/commit/c274a22ff5f326c8afaa7bba60bd1b86abfc4fab [2] Cc: Hari Bathini Cc: Madhavan Srinivasan Cc: Mahesh Salgaonkar Cc: Michael Ellerman Cc: Ritesh Harjani (IBM) Cc: Shivang Upadhyay Cc: kexec at lists.infradead.org Signed-off-by: Sourabh Jain --- arch/powerpc/kernel/fadump.c | 36 ------------------------------------ 1 file changed, 36 deletions(-) diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c index 4ebc333dd786..4348466260cf 100644 --- a/arch/powerpc/kernel/fadump.c +++ b/arch/powerpc/kernel/fadump.c @@ -1604,43 +1604,7 @@ static void __init fadump_init_files(void) pr_err("sysfs group creation failed (%d), unregistering FADump", rc); unregister_fadump(); - return; - } - - /* - * The FADump sysfs are moved from kernel_kobj to fadump_kobj need to - * create symlink at old location to maintain backward compatibility. - * - * - fadump_enabled -> fadump/enabled - * - fadump_registered -> fadump/registered - * - fadump_release_mem -> fadump/release_mem - */ - rc = compat_only_sysfs_link_entry_to_kobj(kernel_kobj, fadump_kobj, - "enabled", "fadump_enabled"); - if (rc) { - pr_err("unable to create fadump_enabled symlink (%d)", rc); - return; - } - - rc = compat_only_sysfs_link_entry_to_kobj(kernel_kobj, fadump_kobj, - "registered", - "fadump_registered"); - if (rc) { - pr_err("unable to create fadump_registered symlink (%d)", rc); - sysfs_remove_link(kernel_kobj, "fadump_enabled"); - return; } - - if (fw_dump.dump_active) { - rc = compat_only_sysfs_link_entry_to_kobj(kernel_kobj, - fadump_kobj, - "release_mem", - "fadump_release_mem"); - if (rc) - pr_err("unable to create fadump_release_mem symlink (%d)", - rc); - } - return; } static int __init fadump_setup_elfcorehdr_buf(void) -- 2.51.1 From zaneta.kedzierska at fontri.pl Thu Nov 13 01:01:21 2025 From: zaneta.kedzierska at fontri.pl (=?UTF-8?Q? =C5=BBaneta_K=C4=99dzierska ?=) Date: Thu, 13 Nov 2025 09:01:21 GMT Subject: =?UTF-8?Q?Odbi=C3=B3r_w_paczkomacie_?= Message-ID: <20251113084500-0.1.sc.5zo8i.0.pqjh2cikjm@fontri.pl> Dzie? dobry, jako lider w us?ugach kurierskich w Polsce przygotowali?my elastyczne rozwi?zanie dla przedsi?biorc?w. Stworzyli?my abonament ??cz?cy dostawy do Paczkomat 24/7 oraz obs?ug? kuriersk? - jeden dostawca, jedna faktura i przewidywalne, sta?e koszty. Czy mog? przedstawi?, co mo?emy zaproponowa? wzgl?dem Pa?stwa potrzeb? Pozdrawiam ?aneta K?dzierska From sourabhjain at linux.ibm.com Thu Nov 13 21:15:00 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Fri, 14 Nov 2025 10:45:00 +0530 Subject: [PATCH v4 1/5] Documentation/ABI: add kexec and kdump sysfs interface In-Reply-To: <20251114051504.614937-1-sourabhjain@linux.ibm.com> References: <20251114051504.614937-1-sourabhjain@linux.ibm.com> Message-ID: <20251114051504.614937-2-sourabhjain@linux.ibm.com> Add an ABI document for following kexec and kdump sysfs interface: - /sys/kernel/kexec_loaded - /sys/kernel/kexec_crash_loaded - /sys/kernel/kexec_crash_size - /sys/kernel/crash_elfcorehdr_size Cc: Aditya Gupta Cc: Andrew Morton Cc: Baoquan he Cc: Dave Young Cc: Hari Bathini Cc: Jiri Bohac Cc: Madhavan Srinivasan Cc: Mahesh J Salgaonkar Cc: Pingfan Liu Cc: Ritesh Harjani (IBM) Cc: Shivang Upadhyay Cc: Vivek Goyal Cc: linuxppc-dev at lists.ozlabs.org Cc: kexec at lists.infradead.org Signed-off-by: Sourabh Jain --- .../ABI/testing/sysfs-kernel-kexec-kdump | 43 +++++++++++++++++++ 1 file changed, 43 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-kernel-kexec-kdump diff --git a/Documentation/ABI/testing/sysfs-kernel-kexec-kdump b/Documentation/ABI/testing/sysfs-kernel-kexec-kdump new file mode 100644 index 000000000000..96b24565b68e --- /dev/null +++ b/Documentation/ABI/testing/sysfs-kernel-kexec-kdump @@ -0,0 +1,43 @@ +What: /sys/kernel/kexec_loaded +Date: Jun 2006 +Contact: kexec at lists.infradead.org +Description: read only + Indicates whether a new kernel image has been loaded + into memory using the kexec system call. It shows 1 if + a kexec image is present and ready to boot, or 0 if none + is loaded. +User: kexec tools, kdump service + +What: /sys/kernel/kexec_crash_loaded +Date: Jun 2006 +Contact: kexec at lists.infradead.org +Description: read only + Indicates whether a crash (kdump) kernel is currently + loaded into memory. It shows 1 if a crash kernel has been + successfully loaded for panic handling, or 0 if no crash + kernel is present. +User: Kexec tools, Kdump service + +What: /sys/kernel/kexec_crash_size +Date: Dec 2009 +Contact: kexec at lists.infradead.org +Description: read/write + Shows the amount of memory reserved for loading the crash + (kdump) kernel. It reports the size, in bytes, of the + crash kernel area defined by the crashkernel= parameter. + This interface also allows reducing the crashkernel + reservation by writing a smaller value, and the reclaimed + space is added back to the system RAM. +User: Kdump service + +What: /sys/kernel/crash_elfcorehdr_size +Date: Aug 2023 +Contact: kexec at lists.infradead.org +Description: read only + Indicates the preferred size of the memory buffer for the + ELF core header used by the crash (kdump) kernel. It defines + how much space is needed to hold metadata about the crashed + system, including CPU and memory information. This information + is used by the user space utility kexec to support updating the + in-kernel kdump image during hotplug operations. +User: Kexec tools -- 2.51.1 From sourabhjain at linux.ibm.com Thu Nov 13 21:15:01 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Fri, 14 Nov 2025 10:45:01 +0530 Subject: [PATCH v4 2/5] kexec: move sysfs entries to /sys/kernel/kexec In-Reply-To: <20251114051504.614937-1-sourabhjain@linux.ibm.com> References: <20251114051504.614937-1-sourabhjain@linux.ibm.com> Message-ID: <20251114051504.614937-3-sourabhjain@linux.ibm.com> Several kexec and kdump sysfs entries are currently placed directly under /sys/kernel/, which clutters the directory and makes it harder to identify unrelated entries. To improve organization and readability, these entries are now moved under a dedicated directory, /sys/kernel/kexec. For backward compatibility, symlinks are created at the old locations so that existing tools and scripts continue to work. These symlinks can be removed in the future once users have switched to the new path. While creating symlinks, entries are added in /sys/kernel/ that point to their new locations under /sys/kernel/kexec/. If an error occurs while adding a symlink, it is logged but does not stop initialization of the remaining kexec sysfs symlinks. The /sys/kernel/ entry is now controlled by CONFIG_CRASH_DUMP instead of CONFIG_VMCORE_INFO, as CONFIG_CRASH_DUMP also enables CONFIG_VMCORE_INFO. Cc: Aditya Gupta Cc: Andrew Morton Cc: Baoquan he Cc: Dave Young Cc: Hari Bathini Cc: Jiri Bohac Cc: Madhavan Srinivasan Cc: Mahesh J Salgaonkar Cc: Pingfan Liu Cc: Ritesh Harjani (IBM) Cc: Shivang Upadhyay Cc: Vivek Goyal Cc: linuxppc-dev at lists.ozlabs.org Cc: kexec at lists.infradead.org Signed-off-by: Sourabh Jain --- kernel/kexec_core.c | 118 ++++++++++++++++++++++++++++++++++++++++++++ kernel/ksysfs.c | 68 +------------------------ 2 files changed, 119 insertions(+), 67 deletions(-) diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c index fa00b239c5d9..7476a46de5d6 100644 --- a/kernel/kexec_core.c +++ b/kernel/kexec_core.c @@ -41,6 +41,7 @@ #include #include #include +#include #include #include @@ -1229,3 +1230,120 @@ int kernel_kexec(void) kexec_unlock(); return error; } + +static ssize_t loaded_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return sysfs_emit(buf, "%d\n", !!kexec_image); +} +static struct kobj_attribute loaded_attr = __ATTR_RO(loaded); + +#ifdef CONFIG_CRASH_DUMP +static ssize_t crash_loaded_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return sysfs_emit(buf, "%d\n", kexec_crash_loaded()); +} +static struct kobj_attribute crash_loaded_attr = __ATTR_RO(crash_loaded); + +static ssize_t crash_size_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + ssize_t size = crash_get_memory_size(); + + if (size < 0) + return size; + + return sysfs_emit(buf, "%zd\n", size); +} +static ssize_t crash_size_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + unsigned long cnt; + int ret; + + if (kstrtoul(buf, 0, &cnt)) + return -EINVAL; + + ret = crash_shrink_memory(cnt); + return ret < 0 ? ret : count; +} +static struct kobj_attribute crash_size_attr = __ATTR_RW(crash_size); + +#ifdef CONFIG_CRASH_HOTPLUG +static ssize_t crash_elfcorehdr_size_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + unsigned int sz = crash_get_elfcorehdr_size(); + + return sysfs_emit(buf, "%u\n", sz); +} +static struct kobj_attribute crash_elfcorehdr_size_attr = __ATTR_RO(crash_elfcorehdr_size); + +#endif /* CONFIG_CRASH_HOTPLUG */ +#endif /* CONFIG_CRASH_DUMP */ + +static struct attribute *kexec_attrs[] = { + &loaded_attr.attr, +#ifdef CONFIG_CRASH_DUMP + &crash_loaded_attr.attr, + &crash_size_attr.attr, +#ifdef CONFIG_CRASH_HOTPLUG + &crash_elfcorehdr_size_attr.attr, +#endif +#endif + NULL +}; + +struct kexec_link_entry { + const char *target; + const char *name; +}; + +static struct kexec_link_entry kexec_links[] = { + { "loaded", "kexec_loaded" }, +#ifdef CONFIG_CRASH_DUMP + { "crash_loaded", "kexec_crash_loaded" }, + { "crash_size", "kexec_crash_size" }, +#ifdef CONFIG_CRASH_HOTPLUG + { "crash_elfcorehdr_size", "crash_elfcorehdr_size" }, +#endif +#endif + +}; + +static struct kobject *kexec_kobj; +ATTRIBUTE_GROUPS(kexec); + +static int __init init_kexec_sysctl(void) +{ + int error; + int i; + + kexec_kobj = kobject_create_and_add("kexec", kernel_kobj); + if (!kexec_kobj) { + pr_err("failed to create kexec kobject\n"); + return -ENOMEM; + } + + error = sysfs_create_groups(kexec_kobj, kexec_groups); + if (error) + goto kset_exit; + + for (i = 0; i < ARRAY_SIZE(kexec_links); i++) { + error = compat_only_sysfs_link_entry_to_kobj(kernel_kobj, kexec_kobj, + kexec_links[i].target, + kexec_links[i].name); + if (error) + pr_err("Unable to create %s symlink (%d)", kexec_links[i].name, error); + } + + return 0; + +kset_exit: + kobject_put(kexec_kobj); + return error; +} + +subsys_initcall(init_kexec_sysctl); diff --git a/kernel/ksysfs.c b/kernel/ksysfs.c index eefb67d9883c..a9e6354d9e25 100644 --- a/kernel/ksysfs.c +++ b/kernel/ksysfs.c @@ -12,7 +12,7 @@ #include #include #include -#include +#include #include #include #include @@ -119,50 +119,6 @@ static ssize_t profiling_store(struct kobject *kobj, KERNEL_ATTR_RW(profiling); #endif -#ifdef CONFIG_KEXEC_CORE -static ssize_t kexec_loaded_show(struct kobject *kobj, - struct kobj_attribute *attr, char *buf) -{ - return sysfs_emit(buf, "%d\n", !!kexec_image); -} -KERNEL_ATTR_RO(kexec_loaded); - -#ifdef CONFIG_CRASH_DUMP -static ssize_t kexec_crash_loaded_show(struct kobject *kobj, - struct kobj_attribute *attr, char *buf) -{ - return sysfs_emit(buf, "%d\n", kexec_crash_loaded()); -} -KERNEL_ATTR_RO(kexec_crash_loaded); - -static ssize_t kexec_crash_size_show(struct kobject *kobj, - struct kobj_attribute *attr, char *buf) -{ - ssize_t size = crash_get_memory_size(); - - if (size < 0) - return size; - - return sysfs_emit(buf, "%zd\n", size); -} -static ssize_t kexec_crash_size_store(struct kobject *kobj, - struct kobj_attribute *attr, - const char *buf, size_t count) -{ - unsigned long cnt; - int ret; - - if (kstrtoul(buf, 0, &cnt)) - return -EINVAL; - - ret = crash_shrink_memory(cnt); - return ret < 0 ? ret : count; -} -KERNEL_ATTR_RW(kexec_crash_size); - -#endif /* CONFIG_CRASH_DUMP*/ -#endif /* CONFIG_KEXEC_CORE */ - #ifdef CONFIG_VMCORE_INFO static ssize_t vmcoreinfo_show(struct kobject *kobj, @@ -174,18 +130,6 @@ static ssize_t vmcoreinfo_show(struct kobject *kobj, } KERNEL_ATTR_RO(vmcoreinfo); -#ifdef CONFIG_CRASH_HOTPLUG -static ssize_t crash_elfcorehdr_size_show(struct kobject *kobj, - struct kobj_attribute *attr, char *buf) -{ - unsigned int sz = crash_get_elfcorehdr_size(); - - return sysfs_emit(buf, "%u\n", sz); -} -KERNEL_ATTR_RO(crash_elfcorehdr_size); - -#endif - #endif /* CONFIG_VMCORE_INFO */ /* whether file capabilities are enabled */ @@ -255,18 +199,8 @@ static struct attribute * kernel_attrs[] = { #ifdef CONFIG_PROFILING &profiling_attr.attr, #endif -#ifdef CONFIG_KEXEC_CORE - &kexec_loaded_attr.attr, -#ifdef CONFIG_CRASH_DUMP - &kexec_crash_loaded_attr.attr, - &kexec_crash_size_attr.attr, -#endif -#endif #ifdef CONFIG_VMCORE_INFO &vmcoreinfo_attr.attr, -#ifdef CONFIG_CRASH_HOTPLUG - &crash_elfcorehdr_size_attr.attr, -#endif #endif #ifndef CONFIG_TINY_RCU &rcu_expedited_attr.attr, -- 2.51.1 From sourabhjain at linux.ibm.com Thu Nov 13 21:15:02 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Fri, 14 Nov 2025 10:45:02 +0530 Subject: [PATCH v4 3/5] Documentation/ABI: mark old kexec sysfs deprecated In-Reply-To: <20251114051504.614937-1-sourabhjain@linux.ibm.com> References: <20251114051504.614937-1-sourabhjain@linux.ibm.com> Message-ID: <20251114051504.614937-4-sourabhjain@linux.ibm.com> The previous commit ("kexec: move sysfs entries to /sys/kernel/kexec") moved all existing kexec sysfs entries to a new location. The ABI document is updated to include a note about the deprecation of the old kexec sysfs entries. The following kexec sysfs entries are deprecated: - /sys/kernel/kexec_loaded - /sys/kernel/kexec_crash_loaded - /sys/kernel/kexec_crash_size - /sys/kernel/crash_elfcorehdr_size Cc: Aditya Gupta Cc: Andrew Morton Cc: Baoquan he Cc: Dave Young Cc: Hari Bathini Cc: Jiri Bohac Cc: Madhavan Srinivasan Cc: Mahesh J Salgaonkar Cc: Pingfan Liu Cc: Ritesh Harjani (IBM) Cc: Shivang Upadhyay Cc: Vivek Goyal Cc: linuxppc-dev at lists.ozlabs.org Cc: kexec at lists.infradead.org Signed-off-by: Sourabh Jain --- .../sysfs-kernel-kexec-kdump | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) rename Documentation/ABI/{testing => obsolete}/sysfs-kernel-kexec-kdump (61%) diff --git a/Documentation/ABI/testing/sysfs-kernel-kexec-kdump b/Documentation/ABI/obsolete/sysfs-kernel-kexec-kdump similarity index 61% rename from Documentation/ABI/testing/sysfs-kernel-kexec-kdump rename to Documentation/ABI/obsolete/sysfs-kernel-kexec-kdump index 96b24565b68e..96b4d41721cc 100644 --- a/Documentation/ABI/testing/sysfs-kernel-kexec-kdump +++ b/Documentation/ABI/obsolete/sysfs-kernel-kexec-kdump @@ -1,3 +1,19 @@ +NOTE: all the ABIs listed in this file are deprecated and will be removed after 2028. + +Here are the alternative ABIs: ++------------------------------------+-----------------------------------------+ +| Deprecated | Alternative | ++------------------------------------+-----------------------------------------+ +| /sys/kernel/kexec_loaded | /sys/kernel/kexec/loaded | ++------------------------------------+-----------------------------------------+ +| /sys/kernel/kexec_crash_loaded | /sys/kernel/kexec/crash_loaded | ++------------------------------------+-----------------------------------------+ +| /sys/kernel/kexec_crash_size | /sys/kernel/kexec/crash_size | ++------------------------------------+-----------------------------------------+ +| /sys/kernel/crash_elfcorehdr_size | /sys/kernel/kexec/crash_elfcorehdr_size | ++------------------------------------+-----------------------------------------+ + + What: /sys/kernel/kexec_loaded Date: Jun 2006 Contact: kexec at lists.infradead.org -- 2.51.1 From sourabhjain at linux.ibm.com Thu Nov 13 21:15:03 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Fri, 14 Nov 2025 10:45:03 +0530 Subject: [PATCH v4 4/5] kexec: document new kexec and kdump sysfs ABIs In-Reply-To: <20251114051504.614937-1-sourabhjain@linux.ibm.com> References: <20251114051504.614937-1-sourabhjain@linux.ibm.com> Message-ID: <20251114051504.614937-5-sourabhjain@linux.ibm.com> Add an ABI document for following kexec and kdump sysfs interface: - /sys/kernel/kexec/loaded - /sys/kernel/kexec/crash_loaded - /sys/kernel/kexec/crash_size - /sys/kernel/kexec/crash_elfcorehdr_size Cc: Aditya Gupta Cc: Andrew Morton Cc: Baoquan he Cc: Dave Young Cc: Hari Bathini Cc: Jiri Bohac Cc: Madhavan Srinivasan Cc: Mahesh J Salgaonkar Cc: Pingfan Liu Cc: Ritesh Harjani (IBM) Cc: Shivang Upadhyay Cc: Vivek Goyal Cc: linuxppc-dev at lists.ozlabs.org Cc: kexec at lists.infradead.org Signed-off-by: Sourabh Jain --- .../ABI/testing/sysfs-kernel-kexec-kdump | 51 +++++++++++++++++++ 1 file changed, 51 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-kernel-kexec-kdump diff --git a/Documentation/ABI/testing/sysfs-kernel-kexec-kdump b/Documentation/ABI/testing/sysfs-kernel-kexec-kdump new file mode 100644 index 000000000000..00c00f380fea --- /dev/null +++ b/Documentation/ABI/testing/sysfs-kernel-kexec-kdump @@ -0,0 +1,51 @@ +What: /sys/kernel/kexec/* +Date: Nov 2025 +Contact: kexec at lists.infradead.org +Description: + The /sys/kernel/kexec/* directory contains sysfs files + that provide information about the configuration status + of kexec and kdump. + +What: /sys/kernel/kexec/loaded +Date: Nov 2025 +Contact: kexec at lists.infradead.org +Description: read only + Indicates whether a new kernel image has been loaded + into memory using the kexec system call. It shows 1 if + a kexec image is present and ready to boot, or 0 if none + is loaded. +User: kexec tools, kdump service + +What: /sys/kernel/kexec/crash_loaded +Date: Nov 2025 +Contact: kexec at lists.infradead.org +Description: read only + Indicates whether a crash (kdump) kernel is currently + loaded into memory. It shows 1 if a crash kernel has been + successfully loaded for panic handling, or 0 if no crash + kernel is present. +User: Kexec tools, Kdump service + +What: /sys/kernel/kexec/crash_size +Date: Nov 2025 +Contact: kexec at lists.infradead.org +Description: read/write + Shows the amount of memory reserved for loading the crash + (kdump) kernel. It reports the size, in bytes, of the + crash kernel area defined by the crashkernel= parameter. + This interface also allows reducing the crashkernel + reservation by writing a smaller value, and the reclaimed + space is added back to the system RAM. +User: Kdump service + +What: /sys/kernel/kexec/crash_elfcorehdr_size +Date: Nov 2025 +Contact: kexec at lists.infradead.org +Description: read only + Indicates the preferred size of the memory buffer for the + ELF core header used by the crash (kdump) kernel. It defines + how much space is needed to hold metadata about the crashed + system, including CPU and memory information. This information + is used by the user space utility kexec to support updating the + in-kernel kdump image during hotplug operations. +User: Kexec tools -- 2.51.1 From sourabhjain at linux.ibm.com Thu Nov 13 21:15:04 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Fri, 14 Nov 2025 10:45:04 +0530 Subject: [PATCH v4 5/5] crash: export crashkernel CMA reservation to userspace In-Reply-To: <20251114051504.614937-1-sourabhjain@linux.ibm.com> References: <20251114051504.614937-1-sourabhjain@linux.ibm.com> Message-ID: <20251114051504.614937-6-sourabhjain@linux.ibm.com> Add a sysfs entry /sys/kernel/kexec/crash_cma_ranges to expose all CMA crashkernel ranges. This allows userspace tools configuring kdump to determine how much memory is reserved for crashkernel. If CMA is used, tools can warn users when attempting to capture user pages with CMA reservation. The new sysfs hold the CMA ranges in below format: cat /sys/kernel/kexec/crash_cma_ranges 100000000-10c7fffff The reason for not including Crash CMA Ranges in /proc/iomem is to avoid conflicts. It has been observed that contiguous memory ranges are sometimes shown as two separate System RAM entries in /proc/iomem. If a CMA range overlaps two System RAM ranges, adding crashk_res to /proc/iomem can create a conflict. Reference [1] describes one such instance on the PowerPC architecture. Link: https://lore.kernel.org/all/20251016142831.144515-1-sourabhjain at linux.ibm.com/ [1] Cc: Aditya Gupta Cc: Andrew Morton Cc: Baoquan he Cc: Dave Young Cc: Hari Bathini Cc: Jiri Bohac Cc: Madhavan Srinivasan Cc: Mahesh J Salgaonkar Cc: Pingfan Liu Cc: Ritesh Harjani (IBM) Cc: Shivang Upadhyay Cc: Vivek Goyal Cc: linuxppc-dev at lists.ozlabs.org Cc: kexec at lists.infradead.org Signed-off-by: Sourabh Jain --- .../ABI/testing/sysfs-kernel-kexec-kdump | 10 ++++++++++ kernel/kexec_core.c | 17 +++++++++++++++++ 2 files changed, 27 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-kernel-kexec-kdump b/Documentation/ABI/testing/sysfs-kernel-kexec-kdump index 00c00f380fea..f59051b5d96d 100644 --- a/Documentation/ABI/testing/sysfs-kernel-kexec-kdump +++ b/Documentation/ABI/testing/sysfs-kernel-kexec-kdump @@ -49,3 +49,13 @@ Description: read only is used by the user space utility kexec to support updating the in-kernel kdump image during hotplug operations. User: Kexec tools + +What: /sys/kernel/kexec/crash_cma_ranges +Date: Nov 2025 +Contact: kexec at lists.infradead.org +Description: read only + Provides information about the memory ranges reserved from + the Contiguous Memory Allocator (CMA) area that are allocated + to the crash (kdump) kernel. It lists the start and end physical + addresses of CMA regions assigned for crashkernel use. +User: kdump service diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c index 7476a46de5d6..da6ff72b4669 100644 --- a/kernel/kexec_core.c +++ b/kernel/kexec_core.c @@ -1271,6 +1271,22 @@ static ssize_t crash_size_store(struct kobject *kobj, } static struct kobj_attribute crash_size_attr = __ATTR_RW(crash_size); +static ssize_t crash_cma_ranges_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + + ssize_t len = 0; + int i; + + for (i = 0; i < crashk_cma_cnt; ++i) { + len += sysfs_emit_at(buf, len, "%08llx-%08llx\n", + crashk_cma_ranges[i].start, + crashk_cma_ranges[i].end); + } + return len; +} +static struct kobj_attribute crash_cma_ranges_attr = __ATTR_RO(crash_cma_ranges); + #ifdef CONFIG_CRASH_HOTPLUG static ssize_t crash_elfcorehdr_size_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) @@ -1289,6 +1305,7 @@ static struct attribute *kexec_attrs[] = { #ifdef CONFIG_CRASH_DUMP &crash_loaded_attr.attr, &crash_size_attr.attr, + &crash_cma_ranges_attr.attr, #ifdef CONFIG_CRASH_HOTPLUG &crash_elfcorehdr_size_attr.attr, #endif -- 2.51.1 From rppt at kernel.org Thu Nov 13 23:30:11 2025 From: rppt at kernel.org (Mike Rapoport) Date: Fri, 14 Nov 2025 09:30:11 +0200 Subject: [PATCH] liveupdate: kho: Enable KHO by default In-Reply-To: <20251110180715.602807-1-pasha.tatashin@soleen.com> References: <20251110180715.602807-1-pasha.tatashin@soleen.com> Message-ID: On Mon, Nov 10, 2025 at 01:07:15PM -0500, Pasha Tatashin wrote: > Upcoming LUO requires KHO for its operations, the requirement to place > both KHO=on and liveupdate=on becomes redundant. Set KHO to be enabled > by default. I though more about this and it seems too much of a change. kho=1 enables scratch areas and that significantly changes how free pages are distributed in the free lists. Let's go with a Kconfig option we discussed of-list: (this is on top of the current mmotm/mm-nonmm-unstable) >From 823299d80aa4f7c16ef6cfd798a19e1dfe1a91ab Mon Sep 17 00:00:00 2001 From: Pasha Tatashin Date: Fri, 14 Nov 2025 09:27:47 +0200 Subject: [PATCH] kho: Allow KHO to be enabled by default Upcoming LUO requires KHO for its operations, the requirement to place both KHO=on and liveupdate=on becomes reduntant. Let's allow KHO to be enabled by default, and CONFIG_LIVEUPDATE can select this CONFIG. Signed-off-by: Pasha Tatashin Signed-off-by: Mike Rapoport (Microsoft) --- kernel/liveupdate/Kconfig | 8 ++++++++ kernel/liveupdate/kexec_handover.c | 2 +- 2 files changed, 9 insertions(+), 1 deletion(-) diff --git a/kernel/liveupdate/Kconfig b/kernel/liveupdate/Kconfig index d7344d347f69..25c9a4d7781f 100644 --- a/kernel/liveupdate/Kconfig +++ b/kernel/liveupdate/Kconfig @@ -63,4 +63,12 @@ config KEXEC_HANDOVER_DEBUGFS Also, enables inspecting the KHO fdt trees with the debugfs binary blobs. +config KEXEC_HANDOVER_ENABLE_DEFAULT + bool "Enable kexec handover by default" + depends on KEXEC_HANDOVER + help + Enable the kexec handover by default. It is equivalent of passing + kho=on via kernel parameter, and can be overwritten to off via + kho=off. + endmenu diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 568cd9fe9aca..23a3df297bb3 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -51,7 +51,7 @@ union kho_page_info { static_assert(sizeof(union kho_page_info) == sizeof(((struct page *)0)->private)); -static bool kho_enable __ro_after_init = true; +static bool kho_enable __ro_after_init = IS_ENABLED(CONFIG_KEXEC_HANDOVER_ENABLE_DEFAULT); bool kho_is_enabled(void) { -- 2.50.1 -- Sincerely yours, Mike. From graf at amazon.com Fri Nov 14 01:01:20 2025 From: graf at amazon.com (Alexander Graf) Date: Fri, 14 Nov 2025 10:01:20 +0100 Subject: [PATCH] liveupdate: kho: Enable KHO by default In-Reply-To: References: <20251110180715.602807-1-pasha.tatashin@soleen.com> Message-ID: On 14.11.25 08:30, Mike Rapoport wrote: > On Mon, Nov 10, 2025 at 01:07:15PM -0500, Pasha Tatashin wrote: >> Upcoming LUO requires KHO for its operations, the requirement to place >> both KHO=on and liveupdate=on becomes redundant. Set KHO to be enabled >> by default. > I though more about this and it seems too much of a change. kho=1 enables > scratch areas and that significantly changes how free pages are distributed > in the free lists. > > Let's go with a Kconfig option we discussed of-list: > (this is on top of the current mmotm/mm-nonmm-unstable) > > From 823299d80aa4f7c16ef6cfd798a19e1dfe1a91ab Mon Sep 17 00:00:00 2001 > From: Pasha Tatashin > Date: Fri, 14 Nov 2025 09:27:47 +0200 > Subject: [PATCH] kho: Allow KHO to be enabled by default > > Upcoming LUO requires KHO for its operations, the requirement to place > both KHO=on and liveupdate=on becomes reduntant. Let's allow KHO to be > enabled by default, and CONFIG_LIVEUPDATE can select this CONFIG. Looks much better, yes :). You can also imply this option automatically when LUO=y. Reviewed-by: Alexander Graf Alex > > Signed-off-by: Pasha Tatashin > Signed-off-by: Mike Rapoport (Microsoft) > --- > kernel/liveupdate/Kconfig | 8 ++++++++ > kernel/liveupdate/kexec_handover.c | 2 +- > 2 files changed, 9 insertions(+), 1 deletion(-) > > diff --git a/kernel/liveupdate/Kconfig b/kernel/liveupdate/Kconfig > index d7344d347f69..25c9a4d7781f 100644 > --- a/kernel/liveupdate/Kconfig > +++ b/kernel/liveupdate/Kconfig > @@ -63,4 +63,12 @@ config KEXEC_HANDOVER_DEBUGFS > Also, enables inspecting the KHO fdt trees with the debugfs binary > blobs. > > +config KEXEC_HANDOVER_ENABLE_DEFAULT > + bool "Enable kexec handover by default" > + depends on KEXEC_HANDOVER > + help > + Enable the kexec handover by default. It is equivalent of passing > + kho=on via kernel parameter, and can be overwritten to off via > + kho=off. > + > endmenu > diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c > index 568cd9fe9aca..23a3df297bb3 100644 > --- a/kernel/liveupdate/kexec_handover.c > +++ b/kernel/liveupdate/kexec_handover.c > @@ -51,7 +51,7 @@ union kho_page_info { > > static_assert(sizeof(union kho_page_info) == sizeof(((struct page *)0)->private)); > > -static bool kho_enable __ro_after_init = true; > +static bool kho_enable __ro_after_init = IS_ENABLED(CONFIG_KEXEC_HANDOVER_ENABLE_DEFAULT); > > bool kho_is_enabled(void) > { > -- > 2.50.1 > > > > -- > Sincerely yours, > Mike. Amazon Web Services Development Center Germany GmbH Tamara-Danz-Str. 13 10243 Berlin Geschaeftsfuehrung: Christian Schlaeger, Christof Hellmis Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B Sitz: Berlin Ust-ID: DE 365 538 597 From pasha.tatashin at soleen.com Fri Nov 14 06:13:01 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Fri, 14 Nov 2025 09:13:01 -0500 Subject: [PATCH] liveupdate: kho: Enable KHO by default In-Reply-To: References: <20251110180715.602807-1-pasha.tatashin@soleen.com> Message-ID: On Fri, Nov 14, 2025 at 2:30?AM Mike Rapoport wrote: > > On Mon, Nov 10, 2025 at 01:07:15PM -0500, Pasha Tatashin wrote: > > Upcoming LUO requires KHO for its operations, the requirement to place > > both KHO=on and liveupdate=on becomes redundant. Set KHO to be enabled > > by default. > > I though more about this and it seems too much of a change. kho=1 enables > scratch areas and that significantly changes how free pages are distributed > in the free lists. > > Let's go with a Kconfig option we discussed of-list: > (this is on top of the current mmotm/mm-nonmm-unstable) I will include this into the KHO simplification series > > From 823299d80aa4f7c16ef6cfd798a19e1dfe1a91ab Mon Sep 17 00:00:00 2001 > From: Pasha Tatashin > Date: Fri, 14 Nov 2025 09:27:47 +0200 > Subject: [PATCH] kho: Allow KHO to be enabled by default > > Upcoming LUO requires KHO for its operations, the requirement to place > both KHO=on and liveupdate=on becomes reduntant. Let's allow KHO to be > enabled by default, and CONFIG_LIVEUPDATE can select this CONFIG. > > Signed-off-by: Pasha Tatashin > Signed-off-by: Mike Rapoport (Microsoft) > --- > kernel/liveupdate/Kconfig | 8 ++++++++ > kernel/liveupdate/kexec_handover.c | 2 +- > 2 files changed, 9 insertions(+), 1 deletion(-) > > diff --git a/kernel/liveupdate/Kconfig b/kernel/liveupdate/Kconfig > index d7344d347f69..25c9a4d7781f 100644 > --- a/kernel/liveupdate/Kconfig > +++ b/kernel/liveupdate/Kconfig > @@ -63,4 +63,12 @@ config KEXEC_HANDOVER_DEBUGFS > Also, enables inspecting the KHO fdt trees with the debugfs binary > blobs. > > +config KEXEC_HANDOVER_ENABLE_DEFAULT > + bool "Enable kexec handover by default" > + depends on KEXEC_HANDOVER > + help > + Enable the kexec handover by default. It is equivalent of passing > + kho=on via kernel parameter, and can be overwritten to off via > + kho=off. > + > endmenu > diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c > index 568cd9fe9aca..23a3df297bb3 100644 > --- a/kernel/liveupdate/kexec_handover.c > +++ b/kernel/liveupdate/kexec_handover.c > @@ -51,7 +51,7 @@ union kho_page_info { > > static_assert(sizeof(union kho_page_info) == sizeof(((struct page *)0)->private)); > > -static bool kho_enable __ro_after_init = true; > +static bool kho_enable __ro_after_init = IS_ENABLED(CONFIG_KEXEC_HANDOVER_ENABLE_DEFAULT); > > bool kho_is_enabled(void) > { > -- > 2.50.1 > > > > -- > Sincerely yours, > Mike. From pasha.tatashin at soleen.com Fri Nov 14 07:53:45 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Fri, 14 Nov 2025 10:53:45 -0500 Subject: [PATCH v1 00/13] kho: simplify state machine and enable dynamic updates Message-ID: <20251114155358.2884014-1-pasha.tatashin@soleen.com> Andrew: This series applies against mm-nonmm-unstable, but should go right before LUOv5, i.e. on top of: "liveupdate: kho: use %pe format specifier for error pointer printing" It also replaces the following patches, that once applied should be dropped from mm-nonmm-unstable: "liveupdate: kho: when live update add KHO image during kexec load" "liveupdate: Kconfig: make debugfs optional" "kho: enable KHO by default" This patch series refactors the Kexec Handover subsystem to transition from a rigid, state-locked model to a dynamic, re-entrant architecture. It also introduces usability improvements. Motivation Currently, KHO relies on a strict state machine where memory preservation is locked upon finalization. If a change is required, the user must explicitly "abort" to reset the state. Additionally, the kexec image cannot be loaded until KHO is finalized, and the FDT is rebuilt from scratch on every finalization. This series simplifies this workflow to support "load early, finalize late" scenarios. Key Changes State Machine Simplification: - Removed kho_abort(). kho_finalize() is now re-entrant; calling it a second time automatically flushes the previous serialized state and generates a fresh one. - Removed kho_out.finalized checks from preservation APIs, allowing drivers to add/remove pages even after an initial finalization. - Decoupled kexec_file_load from KHO finalization. The KHO FDT physical address is now stable from boot, allowing the kexec image to be loaded before the handover metadata is finalized. FDT Management: - The FDT is now updated in-place dynamically when subtrees are added or removed, removing the need for complex reconstruction logic. - The output FDT is always exposed in debugfs (initialized and zeroed at boot), improving visibility and debugging capabilities throughout the system lifecycle. - Removed the redundant global preserved_mem_map pointer, establishing the FDT property as the single source of truth. New Features & API Enhancements: - High-Level Allocators: Introduced kho_alloc_preserve() and friends to reduce boilerplate for drivers that need to allocate, preserve, and eventually restore simple memory buffers. - Configuration: Added CONFIG_KEXEC_HANDOVER_ENABLE_DEFAULT to allow KHO to be active by default without requiring the kho=on command line parameter. Fixes: - Fixed potential alignment faults when accessing 64-bit FDT properties. - Fixed the lifecycle of the FDT folio preservation (now preserved once at init). Pasha Tatashin (13): kho: Fix misleading log message in kho_populate() kho: Convert __kho_abort() to return void kho: Preserve FDT folio only once during initialization kho: Verify deserialization status and fix FDT alignment access kho: Always expose output FDT in debugfs kho: Simplify serialization and remove __kho_abort kho: Remove global preserved_mem_map and store state in FDT kho: Remove abort functionality and support state refresh kho: Update FDT dynamically for subtree addition/removal kho: Allow kexec load before KHO finalization kho: Allow memory preservation state updates after finalization kho: Add Kconfig option to enable KHO by default kho: Introduce high-level memory allocation API include/linux/kexec_handover.h | 22 +- kernel/liveupdate/Kconfig | 14 + kernel/liveupdate/kexec_handover.c | 338 ++++++++++++-------- kernel/liveupdate/kexec_handover_debugfs.c | 2 +- kernel/liveupdate/kexec_handover_internal.h | 1 - 5 files changed, 232 insertions(+), 145 deletions(-) -- 2.52.0.rc1.455.g30608eb744-goog From pasha.tatashin at soleen.com Fri Nov 14 07:53:46 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Fri, 14 Nov 2025 10:53:46 -0500 Subject: [PATCH v1 01/13] kho: Fix misleading log message in kho_populate() In-Reply-To: <20251114155358.2884014-1-pasha.tatashin@soleen.com> References: <20251114155358.2884014-1-pasha.tatashin@soleen.com> Message-ID: <20251114155358.2884014-2-pasha.tatashin@soleen.com> The log message in kho_populate() currently states "Will skip init for some devices". This implies that Kexec Handover always involves skipping device initialization. However, KHO is a generic mechanism used to preserve kernel memory across reboot for various purposes, such as memfd, telemetry, or reserve_mem. Skipping device initialization is a specific property of live update drivers using KHO, not a property of the mechanism itself. Remove the misleading suffix to accurately reflect the generic nature of KHO discovery. Signed-off-by: Pasha Tatashin --- kernel/liveupdate/kexec_handover.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 9f0913e101be..6ad45e12f53b 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -1470,7 +1470,7 @@ void __init kho_populate(phys_addr_t fdt_phys, u64 fdt_len, kho_in.fdt_phys = fdt_phys; kho_in.scratch_phys = scratch_phys; kho_scratch_cnt = scratch_cnt; - pr_info("found kexec handover data. Will skip init for some devices\n"); + pr_info("found kexec handover data.\n"); out: if (fdt) -- 2.52.0.rc1.455.g30608eb744-goog From pasha.tatashin at soleen.com Fri Nov 14 07:53:47 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Fri, 14 Nov 2025 10:53:47 -0500 Subject: [PATCH v1 02/13] kho: Convert __kho_abort() to return void In-Reply-To: <20251114155358.2884014-1-pasha.tatashin@soleen.com> References: <20251114155358.2884014-1-pasha.tatashin@soleen.com> Message-ID: <20251114155358.2884014-3-pasha.tatashin@soleen.com> The internal helper __kho_abort() always returns 0 and has no failure paths. Its return value is ignored by __kho_finalize and checked needlessly by kho_abort. Change the return type to void to reflect that this function cannot fail, and simplify kho_abort by removing dead error handling code. Signed-off-by: Pasha Tatashin --- kernel/liveupdate/kexec_handover.c | 11 ++--------- 1 file changed, 2 insertions(+), 9 deletions(-) diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 6ad45e12f53b..bc7f046a1313 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -1117,20 +1117,16 @@ void *kho_restore_vmalloc(const struct kho_vmalloc *preservation) } EXPORT_SYMBOL_GPL(kho_restore_vmalloc); -static int __kho_abort(void) +static void __kho_abort(void) { if (kho_out.preserved_mem_map) { kho_mem_ser_free(kho_out.preserved_mem_map); kho_out.preserved_mem_map = NULL; } - - return 0; } int kho_abort(void) { - int ret = 0; - if (!kho_enable) return -EOPNOTSUPP; @@ -1138,10 +1134,7 @@ int kho_abort(void) if (!kho_out.finalized) return -ENOENT; - ret = __kho_abort(); - if (ret) - return ret; - + __kho_abort(); kho_out.finalized = false; kho_debugfs_fdt_remove(&kho_out.dbg, kho_out.fdt); -- 2.52.0.rc1.455.g30608eb744-goog From pasha.tatashin at soleen.com Fri Nov 14 07:53:48 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Fri, 14 Nov 2025 10:53:48 -0500 Subject: [PATCH v1 03/13] kho: Preserve FDT folio only once during initialization In-Reply-To: <20251114155358.2884014-1-pasha.tatashin@soleen.com> References: <20251114155358.2884014-1-pasha.tatashin@soleen.com> Message-ID: <20251114155358.2884014-4-pasha.tatashin@soleen.com> Currently, the FDT folio is preserved inside __kho_finalize(). If the user performs multiple finalize/abort cycles, kho_preserve_folio() is called repeatedly for the same FDT folio. Since the FDT folio is allocated once during kho_init(), it should be marked for preservation at the same time. Move the preservation call to kho_init() to align the preservation state with the object's lifecycle and simplify the finalize path. Signed-off-by: Pasha Tatashin --- kernel/liveupdate/kexec_handover.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index bc7f046a1313..a4b33ca79246 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -1164,10 +1164,6 @@ static int __kho_finalize(void) if (err) goto abort; - err = kho_preserve_folio(virt_to_folio(kho_out.fdt)); - if (err) - goto abort; - err = kho_mem_serialize(&kho_out); if (err) goto abort; @@ -1319,6 +1315,10 @@ static __init int kho_init(void) if (err) goto err_free_fdt; + err = kho_preserve_folio(virt_to_folio(kho_out.fdt)); + if (err) + goto err_free_fdt; + if (fdt) { kho_in_debugfs_init(&kho_in.dbg, fdt); return 0; -- 2.52.0.rc1.455.g30608eb744-goog From pasha.tatashin at soleen.com Fri Nov 14 07:53:51 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Fri, 14 Nov 2025 10:53:51 -0500 Subject: [PATCH v1 06/13] kho: Simplify serialization and remove __kho_abort In-Reply-To: <20251114155358.2884014-1-pasha.tatashin@soleen.com> References: <20251114155358.2884014-1-pasha.tatashin@soleen.com> Message-ID: <20251114155358.2884014-7-pasha.tatashin@soleen.com> Currently, __kho_finalize() performs memory serialization in the middle of FDT construction. If FDT construction fails later, the function must manually clean up the serialized memory via __kho_abort(). Refactor __kho_finalize() to perform kho_mem_serialize() only after the FDT has been successfully constructed and finished. This reordering has two benefits: 1. It avoids expensive serialization work if FDT generation fails. 2. It removes the need for cleanup in the FDT error path. As a result, the internal helper __kho_abort() is no longer needed for internal error handling. Inline its remaining logic (cleanup of the preserved memory map) directly into kho_abort() and remove the helper. Signed-off-by: Pasha Tatashin --- kernel/liveupdate/kexec_handover.c | 41 +++++++++++++----------------- 1 file changed, 17 insertions(+), 24 deletions(-) diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index cd8641725343..aea58e5a6b49 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -1127,14 +1127,6 @@ void *kho_restore_vmalloc(const struct kho_vmalloc *preservation) } EXPORT_SYMBOL_GPL(kho_restore_vmalloc); -static void __kho_abort(void) -{ - if (kho_out.preserved_mem_map) { - kho_mem_ser_free(kho_out.preserved_mem_map); - kho_out.preserved_mem_map = NULL; - } -} - int kho_abort(void) { if (!kho_enable) @@ -1144,7 +1136,8 @@ int kho_abort(void) if (!kho_out.finalized) return -ENOENT; - __kho_abort(); + kho_mem_ser_free(kho_out.preserved_mem_map); + kho_out.preserved_mem_map = NULL; kho_out.finalized = false; return 0; @@ -1152,12 +1145,12 @@ int kho_abort(void) static int __kho_finalize(void) { - int err = 0; - u64 *preserved_mem_map; void *root = kho_out.fdt; struct kho_sub_fdt *fdt; + u64 *preserved_mem_map; + int err; - err |= fdt_create(root, PAGE_SIZE); + err = fdt_create(root, PAGE_SIZE); err |= fdt_finish_reservemap(root); err |= fdt_begin_node(root, ""); err |= fdt_property_string(root, "compatible", KHO_FDT_COMPATIBLE); @@ -1170,13 +1163,7 @@ static int __kho_finalize(void) sizeof(*preserved_mem_map), (void **)&preserved_mem_map); if (err) - goto abort; - - err = kho_mem_serialize(&kho_out); - if (err) - goto abort; - - *preserved_mem_map = (u64)virt_to_phys(kho_out.preserved_mem_map); + goto err_exit; mutex_lock(&kho_out.fdts_lock); list_for_each_entry(fdt, &kho_out.sub_fdts, l) { @@ -1190,13 +1177,19 @@ static int __kho_finalize(void) err |= fdt_end_node(root); err |= fdt_finish(root); + if (err) + goto err_exit; -abort: - if (err) { - pr_err("Failed to convert KHO state tree: %d\n", err); - __kho_abort(); - } + err = kho_mem_serialize(&kho_out); + if (err) + goto err_exit; + + *preserved_mem_map = (u64)virt_to_phys(kho_out.preserved_mem_map); + + return 0; +err_exit: + pr_err("Failed to convert KHO state tree: %d\n", err); return err; } -- 2.52.0.rc1.455.g30608eb744-goog From pasha.tatashin at soleen.com Fri Nov 14 07:53:50 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Fri, 14 Nov 2025 10:53:50 -0500 Subject: [PATCH v1 05/13] kho: Always expose output FDT in debugfs In-Reply-To: <20251114155358.2884014-1-pasha.tatashin@soleen.com> References: <20251114155358.2884014-1-pasha.tatashin@soleen.com> Message-ID: <20251114155358.2884014-6-pasha.tatashin@soleen.com> Currently, the output FDT is added to debugfs only when KHO is finalized and removed when aborted. There is no need to hide the FDT based on the state. Always expose it starting from initialization. This aids the transition toward removing the explicit abort functionality and converting KHO to be fully stateless. Also, pre-zero the FDT tree so we do not expose random bits to the user and to the next kernel. Signed-off-by: Pasha Tatashin --- kernel/liveupdate/kexec_handover.c | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-) diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 83aca3b4af15..cd8641725343 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -1147,8 +1147,6 @@ int kho_abort(void) __kho_abort(); kho_out.finalized = false; - kho_debugfs_fdt_remove(&kho_out.dbg, kho_out.fdt); - return 0; } @@ -1219,9 +1217,6 @@ int kho_finalize(void) kho_out.finalized = true; - WARN_ON_ONCE(kho_debugfs_fdt_add(&kho_out.dbg, "fdt", - kho_out.fdt, true)); - return 0; } @@ -1310,7 +1305,7 @@ static __init int kho_init(void) if (!kho_enable) return 0; - fdt_page = alloc_page(GFP_KERNEL); + fdt_page = alloc_page(GFP_KERNEL | __GFP_ZERO); if (!fdt_page) { err = -ENOMEM; goto err_free_scratch; @@ -1344,6 +1339,9 @@ static __init int kho_init(void) init_cma_reserved_pageblock(pfn_to_page(pfn)); } + WARN_ON_ONCE(kho_debugfs_fdt_add(&kho_out.dbg, "fdt", + kho_out.fdt, true)); + return 0; err_free_fdt: -- 2.52.0.rc1.455.g30608eb744-goog From pasha.tatashin at soleen.com Fri Nov 14 07:53:49 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Fri, 14 Nov 2025 10:53:49 -0500 Subject: [PATCH v1 04/13] kho: Verify deserialization status and fix FDT alignment access In-Reply-To: <20251114155358.2884014-1-pasha.tatashin@soleen.com> References: <20251114155358.2884014-1-pasha.tatashin@soleen.com> Message-ID: <20251114155358.2884014-5-pasha.tatashin@soleen.com> During boot, kho_restore_folio() relies on the memory map having been successfully deserialized. If deserialization fails or no map is present, attempting to restore the FDT folio is unsafe. Update kho_mem_deserialize() to return a boolean indicating success. Use this return value in kho_memory_init() to disable KHO if deserialization fails. Also, the incoming FDT folio is never used, there is no reason to restore it. Additionally, use memcpy() to retrieve the memory map pointer from the FDT. FDT properties are not guaranteed to be naturally aligned, and accessing a 64-bit value via a pointer that is only 32-bit aligned can cause faults. Signed-off-by: Pasha Tatashin --- kernel/liveupdate/kexec_handover.c | 32 ++++++++++++++++++------------ 1 file changed, 19 insertions(+), 13 deletions(-) diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index a4b33ca79246..83aca3b4af15 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -450,20 +450,28 @@ static void __init deserialize_bitmap(unsigned int order, } } -static void __init kho_mem_deserialize(const void *fdt) +/* Return true if memory was deserizlied */ +static bool __init kho_mem_deserialize(const void *fdt) { struct khoser_mem_chunk *chunk; - const phys_addr_t *mem; + const void *mem_ptr; + u64 mem; int len; - mem = fdt_getprop(fdt, 0, PROP_PRESERVED_MEMORY_MAP, &len); - - if (!mem || len != sizeof(*mem)) { + mem_ptr = fdt_getprop(fdt, 0, PROP_PRESERVED_MEMORY_MAP, &len); + if (!mem_ptr || len != sizeof(u64)) { pr_err("failed to get preserved memory bitmaps\n"); - return; + return false; } + /* FDT guarantees 32-bit alignment, have to use memcpy */ + memcpy(&mem, mem_ptr, len); + + chunk = mem ? phys_to_virt(mem) : NULL; + + /* No preserved physical pages were passed, no deserialization */ + if (!chunk) + return false; - chunk = *mem ? phys_to_virt(*mem) : NULL; while (chunk) { unsigned int i; @@ -472,6 +480,8 @@ static void __init kho_mem_deserialize(const void *fdt) &chunk->bitmaps[i]); chunk = KHOSER_LOAD_PTR(chunk->hdr.next); } + + return true; } /* @@ -1377,16 +1387,12 @@ static void __init kho_release_scratch(void) void __init kho_memory_init(void) { - struct folio *folio; - if (kho_in.scratch_phys) { kho_scratch = phys_to_virt(kho_in.scratch_phys); kho_release_scratch(); - kho_mem_deserialize(kho_get_fdt()); - folio = kho_restore_folio(kho_in.fdt_phys); - if (!folio) - pr_warn("failed to restore folio for KHO fdt\n"); + if (!kho_mem_deserialize(kho_get_fdt())) + kho_in.fdt_phys = 0; } else { kho_reserve_scratch(); } -- 2.52.0.rc1.455.g30608eb744-goog From pasha.tatashin at soleen.com Fri Nov 14 07:53:54 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Fri, 14 Nov 2025 10:53:54 -0500 Subject: [PATCH v1 09/13] kho: Update FDT dynamically for subtree addition/removal In-Reply-To: <20251114155358.2884014-1-pasha.tatashin@soleen.com> References: <20251114155358.2884014-1-pasha.tatashin@soleen.com> Message-ID: <20251114155358.2884014-10-pasha.tatashin@soleen.com> Currently, sub-FDTs were tracked in a list (kho_out.sub_fdts) and the final FDT is constructed entirely from scratch during kho_finalize(). We can maintain the FDT dynamically: 1. Initialize a valid, empty FDT in kho_init(). 2. Use fdt_add_subnode and fdt_setprop in kho_add_subtree to update the FDT immediately when a subsystem registers. 3. Use fdt_del_node in kho_remove_subtree to remove entries. This removes the need for the intermediate sub_fdts list and the reconstruction logic in kho_finalize(). kho_finalize() now only needs to trigger memory map serialization. Signed-off-by: Pasha Tatashin --- kernel/liveupdate/kexec_handover.c | 144 ++++++++++++++--------------- 1 file changed, 68 insertions(+), 76 deletions(-) diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 8ab77cb85ca9..822da961d4c9 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -102,20 +102,11 @@ struct kho_mem_track { struct khoser_mem_chunk; -struct kho_sub_fdt { - struct list_head l; - const char *name; - void *fdt; -}; - struct kho_out { void *fdt; bool finalized; struct mutex lock; /* protects KHO FDT finalization */ - struct list_head sub_fdts; - struct mutex fdts_lock; - struct kho_mem_track track; struct kho_debugfs dbg; }; @@ -125,8 +116,6 @@ static struct kho_out kho_out = { .track = { .orders = XARRAY_INIT(kho_out.track.orders, 0), }, - .sub_fdts = LIST_HEAD_INIT(kho_out.sub_fdts), - .fdts_lock = __MUTEX_INITIALIZER(kho_out.fdts_lock), .finalized = false, }; @@ -724,37 +713,67 @@ static void __init kho_reserve_scratch(void) */ int kho_add_subtree(const char *name, void *fdt) { - struct kho_sub_fdt *sub_fdt; + phys_addr_t phys = virt_to_phys(fdt); + void *root_fdt = kho_out.fdt; + int err = -ENOMEM; + int off, fdt_err; - sub_fdt = kmalloc(sizeof(*sub_fdt), GFP_KERNEL); - if (!sub_fdt) - return -ENOMEM; + guard(mutex)(&kho_out.lock); + + fdt_err = fdt_open_into(root_fdt, root_fdt, PAGE_SIZE); + if (fdt_err < 0) + return err; - INIT_LIST_HEAD(&sub_fdt->l); - sub_fdt->name = name; - sub_fdt->fdt = fdt; + off = fdt_add_subnode(root_fdt, 0, name); + if (off < 0) { + if (off == -FDT_ERR_EXISTS) + err = -EEXIST; + goto out_pack; + } + + err = fdt_setprop(root_fdt, off, PROP_SUB_FDT, &phys, sizeof(phys)); + if (err < 0) + goto out_pack; - guard(mutex)(&kho_out.fdts_lock); - list_add_tail(&sub_fdt->l, &kho_out.sub_fdts); WARN_ON_ONCE(kho_debugfs_fdt_add(&kho_out.dbg, name, fdt, false)); - return 0; +out_pack: + fdt_pack(root_fdt); + + return err; } EXPORT_SYMBOL_GPL(kho_add_subtree); void kho_remove_subtree(void *fdt) { - struct kho_sub_fdt *sub_fdt; + phys_addr_t target_phys = virt_to_phys(fdt); + void *root_fdt = kho_out.fdt; + int off; + int err; + + guard(mutex)(&kho_out.lock); - guard(mutex)(&kho_out.fdts_lock); - list_for_each_entry(sub_fdt, &kho_out.sub_fdts, l) { - if (sub_fdt->fdt == fdt) { - list_del(&sub_fdt->l); - kfree(sub_fdt); + err = fdt_open_into(root_fdt, root_fdt, PAGE_SIZE); + if (err < 0) + return; + + for (off = fdt_first_subnode(root_fdt, 0); off >= 0; + off = fdt_next_subnode(root_fdt, off)) { + const u64 *val; + int len; + + val = fdt_getprop(root_fdt, off, PROP_SUB_FDT, &len); + if (!val || len != sizeof(phys_addr_t)) + continue; + + if ((phys_addr_t)*val == target_phys) { + fdt_del_node(root_fdt, off); kho_debugfs_fdt_remove(&kho_out.dbg, fdt); break; } } + + fdt_pack(root_fdt); } EXPORT_SYMBOL_GPL(kho_remove_subtree); @@ -1145,48 +1164,6 @@ void *kho_restore_vmalloc(const struct kho_vmalloc *preservation) } EXPORT_SYMBOL_GPL(kho_restore_vmalloc); -static int __kho_finalize(void) -{ - void *root = kho_out.fdt; - struct kho_sub_fdt *fdt; - u64 empty_mem_map = 0; - int err; - - err = fdt_create(root, PAGE_SIZE); - err |= fdt_finish_reservemap(root); - err |= fdt_begin_node(root, ""); - err |= fdt_property_string(root, "compatible", KHO_FDT_COMPATIBLE); - err |= fdt_property(root, PROP_PRESERVED_MEMORY_MAP, &empty_mem_map, - sizeof(empty_mem_map)); - if (err) - goto err_exit; - - mutex_lock(&kho_out.fdts_lock); - list_for_each_entry(fdt, &kho_out.sub_fdts, l) { - phys_addr_t phys = virt_to_phys(fdt->fdt); - - err |= fdt_begin_node(root, fdt->name); - err |= fdt_property(root, PROP_SUB_FDT, &phys, sizeof(phys)); - err |= fdt_end_node(root); - } - mutex_unlock(&kho_out.fdts_lock); - - err |= fdt_end_node(root); - err |= fdt_finish(root); - if (err) - goto err_exit; - - err = kho_mem_serialize(&kho_out); - if (err) - goto err_exit; - - return 0; - -err_exit: - pr_err("Failed to convert KHO state tree: %d\n", err); - return err; -} - int kho_finalize(void) { int ret; @@ -1195,12 +1172,7 @@ int kho_finalize(void) return -EOPNOTSUPP; guard(mutex)(&kho_out.lock); - if (kho_out.finalized) { - kho_update_memory_map(NULL); - kho_out.finalized = false; - } - - ret = __kho_finalize(); + ret = kho_mem_serialize(&kho_out); if (ret) return ret; @@ -1285,6 +1257,26 @@ int kho_retrieve_subtree(const char *name, phys_addr_t *phys) } EXPORT_SYMBOL_GPL(kho_retrieve_subtree); +static __init int kho_out_fdt_setup(void) +{ + void *root = kho_out.fdt; + u64 empty_mem_map = 0; + int err; + + err = fdt_create(root, PAGE_SIZE); + err |= fdt_finish_reservemap(root); + err |= fdt_begin_node(root, ""); + err |= fdt_property_string(root, "compatible", KHO_FDT_COMPATIBLE); + err |= fdt_property(root, PROP_PRESERVED_MEMORY_MAP, &empty_mem_map, + sizeof(empty_mem_map)); + err |= fdt_end_node(root); + err |= fdt_finish(root); + if (err) + return err; + + return kho_preserve_folio(virt_to_folio(kho_out.fdt)); +} + static __init int kho_init(void) { int err = 0; @@ -1309,7 +1301,7 @@ static __init int kho_init(void) if (err) goto err_free_fdt; - err = kho_preserve_folio(virt_to_folio(kho_out.fdt)); + err = kho_out_fdt_setup(); if (err) goto err_free_fdt; -- 2.52.0.rc1.455.g30608eb744-goog From pasha.tatashin at soleen.com Fri Nov 14 07:53:52 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Fri, 14 Nov 2025 10:53:52 -0500 Subject: [PATCH v1 07/13] kho: Remove global preserved_mem_map and store state in FDT In-Reply-To: <20251114155358.2884014-1-pasha.tatashin@soleen.com> References: <20251114155358.2884014-1-pasha.tatashin@soleen.com> Message-ID: <20251114155358.2884014-8-pasha.tatashin@soleen.com> Currently, the serialized memory map is tracked via kho_out.preserved_mem_map and copied to the FDT during finalization. This double tracking is redundant. Remove preserved_mem_map from kho_out. Instead, maintain the physical address of the head chunk directly in the preserved-memory-map FDT property. Introduce kho_update_memory_map() to manage this property. This function handles: 1. Retrieving and freeing any existing serialized map (handling the abort/retry case). 2. Updating the FDT property with the new chunk address. This establishes the FDT as the single source of truth for the handover state. Signed-off-by: Pasha Tatashin --- kernel/liveupdate/kexec_handover.c | 43 ++++++++++++++++++------------ 1 file changed, 26 insertions(+), 17 deletions(-) diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index aea58e5a6b49..f1c3dd1ef680 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -117,9 +117,6 @@ struct kho_out { struct mutex fdts_lock; struct kho_mem_track track; - /* First chunk of serialized preserved memory map */ - struct khoser_mem_chunk *preserved_mem_map; - struct kho_debugfs dbg; }; @@ -380,6 +377,27 @@ static void kho_mem_ser_free(struct khoser_mem_chunk *first_chunk) } } +/* + * Update memory map property, if old one is found discard it via + * kho_mem_ser_free(). + */ +static void kho_update_memory_map(struct khoser_mem_chunk *first_chunk) +{ + void *ptr; + u64 phys; + + ptr = fdt_getprop_w(kho_out.fdt, 0, PROP_PRESERVED_MEMORY_MAP, NULL); + + /* Check and discard previous memory map */ + memcpy(&phys, ptr, sizeof(u64)); + if (phys) + kho_mem_ser_free((struct khoser_mem_chunk *)phys_to_virt(phys)); + + /* Update with the new value */ + phys = first_chunk ? (u64)virt_to_phys(first_chunk) : 0; + memcpy(ptr, &phys, sizeof(u64)); +} + static int kho_mem_serialize(struct kho_out *kho_out) { struct khoser_mem_chunk *first_chunk = NULL; @@ -420,7 +438,7 @@ static int kho_mem_serialize(struct kho_out *kho_out) } } - kho_out->preserved_mem_map = first_chunk; + kho_update_memory_map(first_chunk); return 0; @@ -1136,8 +1154,7 @@ int kho_abort(void) if (!kho_out.finalized) return -ENOENT; - kho_mem_ser_free(kho_out.preserved_mem_map); - kho_out.preserved_mem_map = NULL; + kho_update_memory_map(NULL); kho_out.finalized = false; return 0; @@ -1147,21 +1164,15 @@ static int __kho_finalize(void) { void *root = kho_out.fdt; struct kho_sub_fdt *fdt; - u64 *preserved_mem_map; + u64 empty_mem_map = 0; int err; err = fdt_create(root, PAGE_SIZE); err |= fdt_finish_reservemap(root); err |= fdt_begin_node(root, ""); err |= fdt_property_string(root, "compatible", KHO_FDT_COMPATIBLE); - /** - * Reserve the preserved-memory-map property in the root FDT, so - * that all property definitions will precede subnodes created by - * KHO callers. - */ - err |= fdt_property_placeholder(root, PROP_PRESERVED_MEMORY_MAP, - sizeof(*preserved_mem_map), - (void **)&preserved_mem_map); + err |= fdt_property(root, PROP_PRESERVED_MEMORY_MAP, &empty_mem_map, + sizeof(empty_mem_map)); if (err) goto err_exit; @@ -1184,8 +1195,6 @@ static int __kho_finalize(void) if (err) goto err_exit; - *preserved_mem_map = (u64)virt_to_phys(kho_out.preserved_mem_map); - return 0; err_exit: -- 2.52.0.rc1.455.g30608eb744-goog From pasha.tatashin at soleen.com Fri Nov 14 07:53:55 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Fri, 14 Nov 2025 10:53:55 -0500 Subject: [PATCH v1 10/13] kho: Allow kexec load before KHO finalization In-Reply-To: <20251114155358.2884014-1-pasha.tatashin@soleen.com> References: <20251114155358.2884014-1-pasha.tatashin@soleen.com> Message-ID: <20251114155358.2884014-11-pasha.tatashin@soleen.com> Currently, kho_fill_kimage() checks kho_out.finalized and returns early if KHO is not yet finalized. This enforces a strict ordering where userspace must finalize KHO *before* loading the kexec image. This is restrictive, as standard workflows often involve loading the target kernel early in the lifecycle and finalizing the state (FDT) only immediately before the reboot. Since the KHO FDT resides at a physical address allocated during boot (kho_init), its location is stable. We can attach this stable address to the kimage regardless of whether the content has been finalized yet. Relax the check to only require kho_enable, allowing kexec_file_load to proceed at any time. Signed-off-by: Pasha Tatashin --- kernel/liveupdate/kexec_handover.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 822da961d4c9..27ef20565a5f 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -1467,7 +1467,7 @@ int kho_fill_kimage(struct kimage *image) int err = 0; struct kexec_buf scratch; - if (!kho_out.finalized) + if (!kho_enable) return 0; image->kho.fdt = virt_to_phys(kho_out.fdt); -- 2.52.0.rc1.455.g30608eb744-goog From pasha.tatashin at soleen.com Fri Nov 14 07:53:53 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Fri, 14 Nov 2025 10:53:53 -0500 Subject: [PATCH v1 08/13] kho: Remove abort functionality and support state refresh In-Reply-To: <20251114155358.2884014-1-pasha.tatashin@soleen.com> References: <20251114155358.2884014-1-pasha.tatashin@soleen.com> Message-ID: <20251114155358.2884014-9-pasha.tatashin@soleen.com> Previously, KHO required a dedicated kho_abort() function to clean up state before kho_finalize() could be called again. This was necessary to handle complex unwind paths when using notifiers. With the shift to direct memory preservation, the explicit abort step is no longer strictly necessary. Remove kho_abort() and refactor kho_finalize() to handle re-entry. If kho_finalize() is called while KHO is already finalized, it will now automatically clean up the previous memory map and state before generating a new one. This allows the KHO state to be updated/refreshed simply by triggering finalize again. Update debugfs to return -EINVAL if userspace attempts to write 0 to the finalize attribute, as explicit abort is no longer supported. Suggested-by: Mike Rapoport (Microsoft) Signed-off-by: Pasha Tatashin --- kernel/liveupdate/kexec_handover.c | 21 ++++----------------- kernel/liveupdate/kexec_handover_debugfs.c | 2 +- kernel/liveupdate/kexec_handover_internal.h | 1 - 3 files changed, 5 insertions(+), 19 deletions(-) diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index f1c3dd1ef680..8ab77cb85ca9 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -1145,21 +1145,6 @@ void *kho_restore_vmalloc(const struct kho_vmalloc *preservation) } EXPORT_SYMBOL_GPL(kho_restore_vmalloc); -int kho_abort(void) -{ - if (!kho_enable) - return -EOPNOTSUPP; - - guard(mutex)(&kho_out.lock); - if (!kho_out.finalized) - return -ENOENT; - - kho_update_memory_map(NULL); - kho_out.finalized = false; - - return 0; -} - static int __kho_finalize(void) { void *root = kho_out.fdt; @@ -1210,8 +1195,10 @@ int kho_finalize(void) return -EOPNOTSUPP; guard(mutex)(&kho_out.lock); - if (kho_out.finalized) - return -EEXIST; + if (kho_out.finalized) { + kho_update_memory_map(NULL); + kho_out.finalized = false; + } ret = __kho_finalize(); if (ret) diff --git a/kernel/liveupdate/kexec_handover_debugfs.c b/kernel/liveupdate/kexec_handover_debugfs.c index ac739d25094d..2abbf62ba942 100644 --- a/kernel/liveupdate/kexec_handover_debugfs.c +++ b/kernel/liveupdate/kexec_handover_debugfs.c @@ -87,7 +87,7 @@ static int kho_out_finalize_set(void *data, u64 val) if (val) return kho_finalize(); else - return kho_abort(); + return -EINVAL; } DEFINE_DEBUGFS_ATTRIBUTE(kho_out_finalize_fops, kho_out_finalize_get, diff --git a/kernel/liveupdate/kexec_handover_internal.h b/kernel/liveupdate/kexec_handover_internal.h index 52ed73659fe6..0202c85ad14f 100644 --- a/kernel/liveupdate/kexec_handover_internal.h +++ b/kernel/liveupdate/kexec_handover_internal.h @@ -24,7 +24,6 @@ extern unsigned int kho_scratch_cnt; bool kho_finalized(void); int kho_finalize(void); -int kho_abort(void); #ifdef CONFIG_KEXEC_HANDOVER_DEBUGFS int kho_debugfs_init(void); -- 2.52.0.rc1.455.g30608eb744-goog From pasha.tatashin at soleen.com Fri Nov 14 07:53:56 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Fri, 14 Nov 2025 10:53:56 -0500 Subject: [PATCH v1 11/13] kho: Allow memory preservation state updates after finalization In-Reply-To: <20251114155358.2884014-1-pasha.tatashin@soleen.com> References: <20251114155358.2884014-1-pasha.tatashin@soleen.com> Message-ID: <20251114155358.2884014-12-pasha.tatashin@soleen.com> Currently, kho_preserve_* and kho_unpreserve_* return -EBUSY if KHO is finalized. This enforces a rigid "freeze" on the KHO memory state. With the introduction of re-entrant finalization, this restriction is no longer necessary. Users should be allowed to modify the preservation set (e.g., adding new pages or freeing old ones) even after an initial finalization. The intended workflow for updates is now: 1. Modify state (preserve/unpreserve). 2. Call kho_finalize() again to refresh the serialized metadata. Remove the kho_out.finalized checks to enable this dynamic behavior. Signed-off-by: Pasha Tatashin --- kernel/liveupdate/kexec_handover.c | 13 ------------- 1 file changed, 13 deletions(-) diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 27ef20565a5f..87e9b488237d 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -183,10 +183,6 @@ static int __kho_preserve_order(struct kho_mem_track *track, unsigned long pfn, const unsigned long pfn_high = pfn >> order; might_sleep(); - - if (kho_out.finalized) - return -EBUSY; - physxa = xa_load(&track->orders, order); if (!physxa) { int err; @@ -815,9 +811,6 @@ int kho_unpreserve_folio(struct folio *folio) const unsigned int order = folio_order(folio); struct kho_mem_track *track = &kho_out.track; - if (kho_out.finalized) - return -EBUSY; - __kho_unpreserve_order(track, pfn, order); return 0; } @@ -885,9 +878,6 @@ int kho_unpreserve_pages(struct page *page, unsigned int nr_pages) const unsigned long start_pfn = page_to_pfn(page); const unsigned long end_pfn = start_pfn + nr_pages; - if (kho_out.finalized) - return -EBUSY; - __kho_unpreserve(track, start_pfn, end_pfn); return 0; @@ -1066,9 +1056,6 @@ EXPORT_SYMBOL_GPL(kho_preserve_vmalloc); */ int kho_unpreserve_vmalloc(struct kho_vmalloc *preservation) { - if (kho_out.finalized) - return -EBUSY; - kho_vmalloc_free_chunks(preservation); return 0; -- 2.52.0.rc1.455.g30608eb744-goog From pasha.tatashin at soleen.com Fri Nov 14 07:53:57 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Fri, 14 Nov 2025 10:53:57 -0500 Subject: [PATCH v1 12/13] kho: Add Kconfig option to enable KHO by default In-Reply-To: <20251114155358.2884014-1-pasha.tatashin@soleen.com> References: <20251114155358.2884014-1-pasha.tatashin@soleen.com> Message-ID: <20251114155358.2884014-13-pasha.tatashin@soleen.com> Currently, Kexec Handover must be explicitly enabled via the kernel command line parameter `kho=on`. For workloads that rely on KHO as a foundational requirement (such as the upcoming Live Update Orchestrator), requiring an explicit boot parameter adds redundant configuration steps. Introduce CONFIG_KEXEC_HANDOVER_ENABLE_DEFAULT. When selected, KHO defaults to enabled. This is equivalent to passing kho=on at boot. The behavior can still be disabled at runtime by passing kho=off. Signed-off-by: Pasha Tatashin --- kernel/liveupdate/Kconfig | 14 ++++++++++++++ kernel/liveupdate/kexec_handover.c | 2 +- 2 files changed, 15 insertions(+), 1 deletion(-) diff --git a/kernel/liveupdate/Kconfig b/kernel/liveupdate/Kconfig index eae428309332..a973a54447de 100644 --- a/kernel/liveupdate/Kconfig +++ b/kernel/liveupdate/Kconfig @@ -37,4 +37,18 @@ config KEXEC_HANDOVER_DEBUGFS Also, enables inspecting the KHO fdt trees with the debugfs binary blobs. +config KEXEC_HANDOVER_ENABLE_DEFAULT + bool "Enable kexec handover by default" + depends on KEXEC_HANDOVER + help + Enable Kexec Handover by default. This avoids the need to + explicitly pass 'kho=on' on the kernel command line. + + This is useful for systems where KHO is a prerequisite for other + features, such as Live Update, ensuring the mechanism is always + active. + + The default behavior can still be overridden at boot time by + passing 'kho=off'. + endmenu diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 87e9b488237d..a905bccf5f65 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -50,7 +50,7 @@ union kho_page_info { static_assert(sizeof(union kho_page_info) == sizeof(((struct page *)0)->private)); -static bool kho_enable __ro_after_init; +static bool kho_enable __ro_after_init = IS_ENABLED(CONFIG_KEXEC_HANDOVER_ENABLE_DEFAULT); bool kho_is_enabled(void) { -- 2.52.0.rc1.455.g30608eb744-goog From pasha.tatashin at soleen.com Fri Nov 14 07:53:58 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Fri, 14 Nov 2025 10:53:58 -0500 Subject: [PATCH v1 13/13] kho: Introduce high-level memory allocation API In-Reply-To: <20251114155358.2884014-1-pasha.tatashin@soleen.com> References: <20251114155358.2884014-1-pasha.tatashin@soleen.com> Message-ID: <20251114155358.2884014-14-pasha.tatashin@soleen.com> Currently, clients of KHO must manually allocate memory (e.g., via alloc_pages), calculate the page order, and explicitly call kho_preserve_folio(). Similarly, cleanup requires separate calls to unpreserve and free the memory. Introduce a high-level API to streamline this common pattern: - kho_alloc_preserve(size): Allocates physically contiguous, zeroed memory and immediately marks it for preservation. - kho_free_unpreserve(ptr, size): Unpreserves and frees the memory in the current kernel. - kho_free_restore(ptr, size): Restores the struct page state of preserved memory in the new kernel and immediately frees it to the page allocator. Signed-off-by: Pasha Tatashin --- include/linux/kexec_handover.h | 22 +++++-- kernel/liveupdate/kexec_handover.c | 101 +++++++++++++++++++++++++++++ 2 files changed, 116 insertions(+), 7 deletions(-) diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h index 80ece4232617..76c496e01877 100644 --- a/include/linux/kexec_handover.h +++ b/include/linux/kexec_handover.h @@ -2,8 +2,9 @@ #ifndef LINUX_KEXEC_HANDOVER_H #define LINUX_KEXEC_HANDOVER_H -#include +#include #include +#include struct kho_scratch { phys_addr_t addr; @@ -48,6 +49,9 @@ int kho_preserve_pages(struct page *page, unsigned int nr_pages); int kho_unpreserve_pages(struct page *page, unsigned int nr_pages); int kho_preserve_vmalloc(void *ptr, struct kho_vmalloc *preservation); int kho_unpreserve_vmalloc(struct kho_vmalloc *preservation); +void *kho_alloc_preserve(size_t size); +void kho_free_unpreserve(void *mem, size_t size); +void kho_free_restore(void *mem, size_t size); struct folio *kho_restore_folio(phys_addr_t phys); struct page *kho_restore_pages(phys_addr_t phys, unsigned int nr_pages); void *kho_restore_vmalloc(const struct kho_vmalloc *preservation); @@ -101,6 +105,14 @@ static inline int kho_unpreserve_vmalloc(struct kho_vmalloc *preservation) return -EOPNOTSUPP; } +void *kho_alloc_preserve(size_t size) +{ + return ERR_PTR(-EOPNOTSUPP); +} + +void kho_free_unpreserve(void *mem, size_t size) { } +void kho_free_restore(void *mem, size_t size) { } + static inline struct folio *kho_restore_folio(phys_addr_t phys) { return NULL; @@ -122,18 +134,14 @@ static inline int kho_add_subtree(const char *name, void *fdt) return -EOPNOTSUPP; } -static inline void kho_remove_subtree(void *fdt) -{ -} +static inline void kho_remove_subtree(void *fdt) { } static inline int kho_retrieve_subtree(const char *name, phys_addr_t *phys) { return -EOPNOTSUPP; } -static inline void kho_memory_init(void) -{ -} +static inline void kho_memory_init(void) { } static inline void kho_populate(phys_addr_t fdt_phys, u64 fdt_len, phys_addr_t scratch_phys, u64 scratch_len) diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index a905bccf5f65..9f05849fd68e 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -4,6 +4,7 @@ * Copyright (C) 2023 Alexander Graf * Copyright (C) 2025 Microsoft Corporation, Mike Rapoport * Copyright (C) 2025 Google LLC, Changyuan Lyu + * Copyright (C) 2025 Pasha Tatashin */ #define pr_fmt(fmt) "KHO: " fmt @@ -1151,6 +1152,106 @@ void *kho_restore_vmalloc(const struct kho_vmalloc *preservation) } EXPORT_SYMBOL_GPL(kho_restore_vmalloc); +/** + * kho_alloc_preserve - Allocate, zero, and preserve memory. + * @size: The number of bytes to allocate. + * + * Allocates a physically contiguous block of zeroed pages that is large + * enough to hold @size bytes. The allocated memory is then registered with + * KHO for preservation across a kexec. + * + * Note: The actual allocated size will be rounded up to the nearest + * power-of-two page boundary. + * + * @return A virtual pointer to the allocated and preserved memory on success, + * or an ERR_PTR() encoded error on failure. + */ +void *kho_alloc_preserve(size_t size) +{ + struct folio *folio; + int order, ret; + + if (!size) + return ERR_PTR(-EINVAL); + + order = get_order(size); + if (order > MAX_PAGE_ORDER) + return ERR_PTR(-E2BIG); + + folio = folio_alloc(GFP_KERNEL | __GFP_ZERO, order); + if (!folio) + return ERR_PTR(-ENOMEM); + + ret = kho_preserve_folio(folio); + if (ret) { + folio_put(folio); + return ERR_PTR(ret); + } + + return folio_address(folio); +} +EXPORT_SYMBOL_GPL(kho_alloc_preserve); + +/** + * kho_free_unpreserve - Unpreserve and free memory. + * @mem: Pointer to the memory allocated by kho_alloc_preserve(). + * @size: The original size requested during allocation. This is used to + * recalculate the correct order for freeing the pages. + * + * Unregisters the memory from KHO preservation and frees the underlying + * pages back to the system. This function should be called to clean up + * memory allocated with kho_alloc_preserve(). + */ +void kho_free_unpreserve(void *mem, size_t size) +{ + struct folio *folio; + unsigned int order; + + if (!mem || !size) + return; + + order = get_order(size); + if (WARN_ON_ONCE(order > MAX_PAGE_ORDER)) + return; + + folio = virt_to_folio(mem); + WARN_ON_ONCE(kho_unpreserve_folio(folio)); + folio_put(folio); +} +EXPORT_SYMBOL_GPL(kho_free_unpreserve); + +/** + * kho_free_restore - Restore and free memory after kexec. + * @mem: Pointer to the memory (in the new kernel's address space) + * that was allocated by the old kernel. + * @size: The original size requested during allocation. This is used to + * recalculate the correct order for freeing the pages. + * + * This function is intended to be called in the new kernel (post-kexec) + * to take ownership of and free a memory region that was preserved by the + * old kernel using kho_alloc_preserve(). + * + * It first restores the pages from KHO (using their physical address) + * and then frees the pages back to the new kernel's page allocator. + */ +void kho_free_restore(void *mem, size_t size) +{ + struct folio *folio; + unsigned int order; + + if (!mem || !size) + return; + + order = get_order(size); + if (WARN_ON_ONCE(order > MAX_PAGE_ORDER)) + return; + + folio = kho_restore_folio(__pa(mem)); + if (!WARN_ON(!folio)) + free_pages((unsigned long)mem, order); +} +EXPORT_SYMBOL_GPL(kho_free_restore); + int kho_finalize(void) { int ret; -- 2.52.0.rc1.455.g30608eb744-goog From rppt at kernel.org Fri Nov 14 08:15:27 2025 From: rppt at kernel.org (Mike Rapoport) Date: Fri, 14 Nov 2025 18:15:27 +0200 Subject: [PATCH v1 09/13] kho: Update FDT dynamically for subtree addition/removal In-Reply-To: <20251114155358.2884014-10-pasha.tatashin@soleen.com> References: <20251114155358.2884014-1-pasha.tatashin@soleen.com> <20251114155358.2884014-10-pasha.tatashin@soleen.com> Message-ID: On Fri, Nov 14, 2025 at 10:53:54AM -0500, Pasha Tatashin wrote: > Currently, sub-FDTs were tracked in a list (kho_out.sub_fdts) and the > final FDT is constructed entirely from scratch during kho_finalize(). > > We can maintain the FDT dynamically: > 1. Initialize a valid, empty FDT in kho_init(). > 2. Use fdt_add_subnode and fdt_setprop in kho_add_subtree to > update the FDT immediately when a subsystem registers. > 3. Use fdt_del_node in kho_remove_subtree to remove entries. > > This removes the need for the intermediate sub_fdts list and the > reconstruction logic in kho_finalize(). kho_finalize() now > only needs to trigger memory map serialization. > > Signed-off-by: Pasha Tatashin > --- > kernel/liveupdate/kexec_handover.c | 144 ++++++++++++++--------------- > 1 file changed, 68 insertions(+), 76 deletions(-) > > diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c > index 8ab77cb85ca9..822da961d4c9 100644 > --- a/kernel/liveupdate/kexec_handover.c > +++ b/kernel/liveupdate/kexec_handover.c > @@ -724,37 +713,67 @@ static void __init kho_reserve_scratch(void) > */ > int kho_add_subtree(const char *name, void *fdt) > { > - struct kho_sub_fdt *sub_fdt; > + phys_addr_t phys = virt_to_phys(fdt); > + void *root_fdt = kho_out.fdt; > + int err = -ENOMEM; > + int off, fdt_err; > > - sub_fdt = kmalloc(sizeof(*sub_fdt), GFP_KERNEL); > - if (!sub_fdt) > - return -ENOMEM; > + guard(mutex)(&kho_out.lock); > + > + fdt_err = fdt_open_into(root_fdt, root_fdt, PAGE_SIZE); > + if (fdt_err < 0) > + return err; > - INIT_LIST_HEAD(&sub_fdt->l); > - sub_fdt->name = name; > - sub_fdt->fdt = fdt; > + off = fdt_add_subnode(root_fdt, 0, name); fdt_err = fdt_add_subnode(); and then we don't need off > + if (off < 0) { > + if (off == -FDT_ERR_EXISTS) > + err = -EEXIST; Is it really -ENOMEM for other FDT_ERR values? > + goto out_pack; > + } > + > + err = fdt_setprop(root_fdt, off, PROP_SUB_FDT, &phys, sizeof(phys)); > + if (err < 0) > + goto out_pack; > > - guard(mutex)(&kho_out.fdts_lock); > - list_add_tail(&sub_fdt->l, &kho_out.sub_fdts); > WARN_ON_ONCE(kho_debugfs_fdt_add(&kho_out.dbg, name, fdt, false)); > > - return 0; > +out_pack: > + fdt_pack(root_fdt); > + > + return err; > } > EXPORT_SYMBOL_GPL(kho_add_subtree); -- Sincerely yours, Mike. From rppt at kernel.org Fri Nov 14 08:15:59 2025 From: rppt at kernel.org (Mike Rapoport) Date: Fri, 14 Nov 2025 18:15:59 +0200 Subject: [PATCH v1 13/13] kho: Introduce high-level memory allocation API In-Reply-To: <20251114155358.2884014-14-pasha.tatashin@soleen.com> References: <20251114155358.2884014-1-pasha.tatashin@soleen.com> <20251114155358.2884014-14-pasha.tatashin@soleen.com> Message-ID: On Fri, Nov 14, 2025 at 10:53:58AM -0500, Pasha Tatashin wrote: > Currently, clients of KHO must manually allocate memory (e.g., via > alloc_pages), calculate the page order, and explicitly call > kho_preserve_folio(). Similarly, cleanup requires separate calls to > unpreserve and free the memory. > > Introduce a high-level API to streamline this common pattern: > > - kho_alloc_preserve(size): Allocates physically contiguous, zeroed > memory and immediately marks it for preservation. > - kho_free_unpreserve(ptr, size): Unpreserves and frees the memory > in the current kernel. > - kho_free_restore(ptr, size): Restores the struct page state of > preserved memory in the new kernel and immediately frees it to the > page allocator. It would have been nice to have it before patch 3 (Preserve FDT folio only once during initialization) and use kho_alloc_preserve() for KHO's own FDT. > Signed-off-by: Pasha Tatashin > --- > include/linux/kexec_handover.h | 22 +++++-- > kernel/liveupdate/kexec_handover.c | 101 +++++++++++++++++++++++++++++ > 2 files changed, 116 insertions(+), 7 deletions(-) -- Sincerely yours, Mike. From rppt at kernel.org Fri Nov 14 08:17:27 2025 From: rppt at kernel.org (Mike Rapoport) Date: Fri, 14 Nov 2025 18:17:27 +0200 Subject: [PATCH v1 00/13] kho: simplify state machine and enable dynamic updates In-Reply-To: <20251114155358.2884014-1-pasha.tatashin@soleen.com> References: <20251114155358.2884014-1-pasha.tatashin@soleen.com> Message-ID: On Fri, Nov 14, 2025 at 10:53:45AM -0500, Pasha Tatashin wrote: > > This patch series refactors the Kexec Handover subsystem to transition > from a rigid, state-locked model to a dynamic, re-entrant architecture. > It also introduces usability improvements. > > Pasha Tatashin (13): > kho: Fix misleading log message in kho_populate() > kho: Convert __kho_abort() to return void > kho: Preserve FDT folio only once during initialization > kho: Verify deserialization status and fix FDT alignment access > kho: Always expose output FDT in debugfs > kho: Simplify serialization and remove __kho_abort > kho: Remove global preserved_mem_map and store state in FDT > kho: Remove abort functionality and support state refresh > kho: Update FDT dynamically for subtree addition/removal > kho: Allow kexec load before KHO finalization > kho: Allow memory preservation state updates after finalization > kho: Add Kconfig option to enable KHO by default > kho: Introduce high-level memory allocation API For the series: Reviewed-by: Mike Rapoport (Microsoft) with small nits in patches 9 and 13 in replies to them. > > include/linux/kexec_handover.h | 22 +- > kernel/liveupdate/Kconfig | 14 + > kernel/liveupdate/kexec_handover.c | 338 ++++++++++++-------- > kernel/liveupdate/kexec_handover_debugfs.c | 2 +- > kernel/liveupdate/kexec_handover_internal.h | 1 - > 5 files changed, 232 insertions(+), 145 deletions(-) > > -- > 2.52.0.rc1.455.g30608eb744-goog > -- Sincerely yours, Mike. From pratyush at kernel.org Fri Nov 14 08:32:01 2025 From: pratyush at kernel.org (Pratyush Yadav) Date: Fri, 14 Nov 2025 17:32:01 +0100 Subject: [PATCH v1 01/13] kho: Fix misleading log message in kho_populate() In-Reply-To: <20251114155358.2884014-2-pasha.tatashin@soleen.com> (Pasha Tatashin's message of "Fri, 14 Nov 2025 10:53:46 -0500") References: <20251114155358.2884014-1-pasha.tatashin@soleen.com> <20251114155358.2884014-2-pasha.tatashin@soleen.com> Message-ID: On Fri, Nov 14 2025, Pasha Tatashin wrote: > The log message in kho_populate() currently states "Will skip init for > some devices". This implies that Kexec Handover always involves > skipping device initialization. > > However, KHO is a generic mechanism used to preserve kernel memory across > reboot for various purposes, such as memfd, telemetry, or reserve_mem. > Skipping device initialization is a specific property of live update > drivers using KHO, not a property of the mechanism itself. > > Remove the misleading suffix to accurately reflect the generic nature of > KHO discovery. > > Signed-off-by: Pasha Tatashin Reviewed-by: Pratyush Yadav [...] -- Regards, Pratyush Yadav From pratyush at kernel.org Fri Nov 14 08:32:31 2025 From: pratyush at kernel.org (Pratyush Yadav) Date: Fri, 14 Nov 2025 17:32:31 +0100 Subject: [PATCH v1 02/13] kho: Convert __kho_abort() to return void In-Reply-To: <20251114155358.2884014-3-pasha.tatashin@soleen.com> (Pasha Tatashin's message of "Fri, 14 Nov 2025 10:53:47 -0500") References: <20251114155358.2884014-1-pasha.tatashin@soleen.com> <20251114155358.2884014-3-pasha.tatashin@soleen.com> Message-ID: On Fri, Nov 14 2025, Pasha Tatashin wrote: > The internal helper __kho_abort() always returns 0 and has no failure > paths. Its return value is ignored by __kho_finalize and checked > needlessly by kho_abort. > > Change the return type to void to reflect that this function cannot > fail, and simplify kho_abort by removing dead error handling code. > > Signed-off-by: Pasha Tatashin Reviewed-by: Pratyush Yadav [...] -- Regards, Pratyush Yadav From pratyush at kernel.org Fri Nov 14 08:32:52 2025 From: pratyush at kernel.org (Pratyush Yadav) Date: Fri, 14 Nov 2025 17:32:52 +0100 Subject: [PATCH v1 03/13] kho: Preserve FDT folio only once during initialization In-Reply-To: <20251114155358.2884014-4-pasha.tatashin@soleen.com> (Pasha Tatashin's message of "Fri, 14 Nov 2025 10:53:48 -0500") References: <20251114155358.2884014-1-pasha.tatashin@soleen.com> <20251114155358.2884014-4-pasha.tatashin@soleen.com> Message-ID: On Fri, Nov 14 2025, Pasha Tatashin wrote: > Currently, the FDT folio is preserved inside __kho_finalize(). If the > user performs multiple finalize/abort cycles, kho_preserve_folio() is > called repeatedly for the same FDT folio. > > Since the FDT folio is allocated once during kho_init(), it should be > marked for preservation at the same time. Move the preservation call to > kho_init() to align the preservation state with the object's lifecycle > and simplify the finalize path. > > Signed-off-by: Pasha Tatashin Reviewed-by: Pratyush Yadav [...] -- Regards, Pratyush Yadav From pasha.tatashin at soleen.com Fri Nov 14 08:40:11 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Fri, 14 Nov 2025 11:40:11 -0500 Subject: [PATCH v1 13/13] kho: Introduce high-level memory allocation API In-Reply-To: References: <20251114155358.2884014-1-pasha.tatashin@soleen.com> <20251114155358.2884014-14-pasha.tatashin@soleen.com> Message-ID: On Fri, Nov 14, 2025 at 11:16?AM Mike Rapoport wrote: > > On Fri, Nov 14, 2025 at 10:53:58AM -0500, Pasha Tatashin wrote: > > Currently, clients of KHO must manually allocate memory (e.g., via > > alloc_pages), calculate the page order, and explicitly call > > kho_preserve_folio(). Similarly, cleanup requires separate calls to > > unpreserve and free the memory. > > > > Introduce a high-level API to streamline this common pattern: > > > > - kho_alloc_preserve(size): Allocates physically contiguous, zeroed > > memory and immediately marks it for preservation. > > - kho_free_unpreserve(ptr, size): Unpreserves and frees the memory > > in the current kernel. > > - kho_free_restore(ptr, size): Restores the struct page state of > > preserved memory in the new kernel and immediately frees it to the > > page allocator. > > It would have been nice to have it before patch 3 (Preserve FDT folio only > once during initialization) and use kho_alloc_preserve() for KHO's own FDT. Sure, I will move it before 3. > > > Signed-off-by: Pasha Tatashin > > --- > > include/linux/kexec_handover.h | 22 +++++-- > > kernel/liveupdate/kexec_handover.c | 101 +++++++++++++++++++++++++++++ > > 2 files changed, 116 insertions(+), 7 deletions(-) > > -- > Sincerely yours, > Mike. From pasha.tatashin at soleen.com Fri Nov 14 08:42:07 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Fri, 14 Nov 2025 11:42:07 -0500 Subject: [PATCH v1 09/13] kho: Update FDT dynamically for subtree addition/removal In-Reply-To: References: <20251114155358.2884014-1-pasha.tatashin@soleen.com> <20251114155358.2884014-10-pasha.tatashin@soleen.com> Message-ID: On Fri, Nov 14, 2025 at 11:15?AM Mike Rapoport wrote: > > On Fri, Nov 14, 2025 at 10:53:54AM -0500, Pasha Tatashin wrote: > > Currently, sub-FDTs were tracked in a list (kho_out.sub_fdts) and the > > final FDT is constructed entirely from scratch during kho_finalize(). > > > > We can maintain the FDT dynamically: > > 1. Initialize a valid, empty FDT in kho_init(). > > 2. Use fdt_add_subnode and fdt_setprop in kho_add_subtree to > > update the FDT immediately when a subsystem registers. > > 3. Use fdt_del_node in kho_remove_subtree to remove entries. > > > > This removes the need for the intermediate sub_fdts list and the > > reconstruction logic in kho_finalize(). kho_finalize() now > > only needs to trigger memory map serialization. > > > > Signed-off-by: Pasha Tatashin > > --- > > kernel/liveupdate/kexec_handover.c | 144 ++++++++++++++--------------- > > 1 file changed, 68 insertions(+), 76 deletions(-) > > > > diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c > > index 8ab77cb85ca9..822da961d4c9 100644 > > --- a/kernel/liveupdate/kexec_handover.c > > +++ b/kernel/liveupdate/kexec_handover.c > > @@ -724,37 +713,67 @@ static void __init kho_reserve_scratch(void) > > */ > > int kho_add_subtree(const char *name, void *fdt) > > { > > - struct kho_sub_fdt *sub_fdt; > > + phys_addr_t phys = virt_to_phys(fdt); > > + void *root_fdt = kho_out.fdt; > > + int err = -ENOMEM; > > + int off, fdt_err; > > > > - sub_fdt = kmalloc(sizeof(*sub_fdt), GFP_KERNEL); > > - if (!sub_fdt) > > - return -ENOMEM; > > + guard(mutex)(&kho_out.lock); > > + > > + fdt_err = fdt_open_into(root_fdt, root_fdt, PAGE_SIZE); > > + if (fdt_err < 0) > > + return err; > > - INIT_LIST_HEAD(&sub_fdt->l); > > - sub_fdt->name = name; > > - sub_fdt->fdt = fdt; > > + off = fdt_add_subnode(root_fdt, 0, name); > > fdt_err = fdt_add_subnode(); > > and then we don't need off > > > + if (off < 0) { > > + if (off == -FDT_ERR_EXISTS) > > + err = -EEXIST; > > Is it really -ENOMEM for other FDT_ERR values? In practice, yes. There are some other errors like format mismatch, magic values etc, but all of them are internal FDT problems. The only error that really matters to users is the -ENOMEM one. Pasha > > > + goto out_pack; > > + } > > + > > + err = fdt_setprop(root_fdt, off, PROP_SUB_FDT, &phys, sizeof(phys)); > > + if (err < 0) > > + goto out_pack; > > > > - guard(mutex)(&kho_out.fdts_lock); > > - list_add_tail(&sub_fdt->l, &kho_out.sub_fdts); > > WARN_ON_ONCE(kho_debugfs_fdt_add(&kho_out.dbg, name, fdt, false)); > > > > - return 0; > > +out_pack: > > + fdt_pack(root_fdt); > > + > > + return err; > > } > > EXPORT_SYMBOL_GPL(kho_add_subtree); > > -- > Sincerely yours, > Mike. From pasha.tatashin at soleen.com Fri Nov 14 08:46:12 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Fri, 14 Nov 2025 11:46:12 -0500 Subject: [PATCH v1 00/13] kho: simplify state machine and enable dynamic updates In-Reply-To: References: <20251114155358.2884014-1-pasha.tatashin@soleen.com> Message-ID: On Fri, Nov 14, 2025 at 11:17?AM Mike Rapoport wrote: > > On Fri, Nov 14, 2025 at 10:53:45AM -0500, Pasha Tatashin wrote: > > > > This patch series refactors the Kexec Handover subsystem to transition > > from a rigid, state-locked model to a dynamic, re-entrant architecture. > > It also introduces usability improvements. > > > > Pasha Tatashin (13): > > kho: Fix misleading log message in kho_populate() > > kho: Convert __kho_abort() to return void > > kho: Preserve FDT folio only once during initialization > > kho: Verify deserialization status and fix FDT alignment access > > kho: Always expose output FDT in debugfs > > kho: Simplify serialization and remove __kho_abort > > kho: Remove global preserved_mem_map and store state in FDT > > kho: Remove abort functionality and support state refresh > > kho: Update FDT dynamically for subtree addition/removal > > kho: Allow kexec load before KHO finalization > > kho: Allow memory preservation state updates after finalization > > kho: Add Kconfig option to enable KHO by default > > kho: Introduce high-level memory allocation API > > For the series: > > Reviewed-by: Mike Rapoport (Microsoft) > > with small nits in patches 9 and 13 in replies to them. Thank you Mike! I will update the series, and post v2 shortly. Pasha > > > > > include/linux/kexec_handover.h | 22 +- > > kernel/liveupdate/Kconfig | 14 + > > kernel/liveupdate/kexec_handover.c | 338 ++++++++++++-------- > > kernel/liveupdate/kexec_handover_debugfs.c | 2 +- > > kernel/liveupdate/kexec_handover_internal.h | 1 - > > 5 files changed, 232 insertions(+), 145 deletions(-) > > > > -- > > 2.52.0.rc1.455.g30608eb744-goog > > > > -- > Sincerely yours, > Mike. From pratyush at kernel.org Fri Nov 14 08:52:37 2025 From: pratyush at kernel.org (Pratyush Yadav) Date: Fri, 14 Nov 2025 17:52:37 +0100 Subject: [PATCH v1 04/13] kho: Verify deserialization status and fix FDT alignment access In-Reply-To: <20251114155358.2884014-5-pasha.tatashin@soleen.com> (Pasha Tatashin's message of "Fri, 14 Nov 2025 10:53:49 -0500") References: <20251114155358.2884014-1-pasha.tatashin@soleen.com> <20251114155358.2884014-5-pasha.tatashin@soleen.com> Message-ID: On Fri, Nov 14 2025, Pasha Tatashin wrote: > During boot, kho_restore_folio() relies on the memory map having been > successfully deserialized. If deserialization fails or no map is present, > attempting to restore the FDT folio is unsafe. > > Update kho_mem_deserialize() to return a boolean indicating success. Use > this return value in kho_memory_init() to disable KHO if deserialization > fails. Also, the incoming FDT folio is never used, there is no reason to > restore it. > > Additionally, use memcpy() to retrieve the memory map pointer from the FDT. > FDT properties are not guaranteed to be naturally aligned, and accessing > a 64-bit value via a pointer that is only 32-bit aligned can cause faults. > > Signed-off-by: Pasha Tatashin > --- > kernel/liveupdate/kexec_handover.c | 32 ++++++++++++++++++------------ > 1 file changed, 19 insertions(+), 13 deletions(-) > > diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c > index a4b33ca79246..83aca3b4af15 100644 > --- a/kernel/liveupdate/kexec_handover.c > +++ b/kernel/liveupdate/kexec_handover.c > @@ -450,20 +450,28 @@ static void __init deserialize_bitmap(unsigned int order, > } > } > > -static void __init kho_mem_deserialize(const void *fdt) > +/* Return true if memory was deserizlied */ > +static bool __init kho_mem_deserialize(const void *fdt) > { > struct khoser_mem_chunk *chunk; > - const phys_addr_t *mem; > + const void *mem_ptr; > + u64 mem; > int len; > > - mem = fdt_getprop(fdt, 0, PROP_PRESERVED_MEMORY_MAP, &len); > - > - if (!mem || len != sizeof(*mem)) { > + mem_ptr = fdt_getprop(fdt, 0, PROP_PRESERVED_MEMORY_MAP, &len); > + if (!mem_ptr || len != sizeof(u64)) { > pr_err("failed to get preserved memory bitmaps\n"); > - return; > + return false; > } > + /* FDT guarantees 32-bit alignment, have to use memcpy */ > + memcpy(&mem, mem_ptr, len); Perhaps get_unaligned(mem) would have been simpler? > + > + chunk = mem ? phys_to_virt(mem) : NULL; > + > + /* No preserved physical pages were passed, no deserialization */ > + if (!chunk) > + return false; Should we disallow all kho_restore_{folio,pages}() calls too if this fails? Ideally those should never happen since kho_retrieve_subtree() will fail, so maybe as a debug aid? > > - chunk = *mem ? phys_to_virt(*mem) : NULL; > while (chunk) { > unsigned int i; > > @@ -472,6 +480,8 @@ static void __init kho_mem_deserialize(const void *fdt) > &chunk->bitmaps[i]); > chunk = KHOSER_LOAD_PTR(chunk->hdr.next); > } > + > + return true; > } > > /* > @@ -1377,16 +1387,12 @@ static void __init kho_release_scratch(void) > > void __init kho_memory_init(void) > { > - struct folio *folio; > - > if (kho_in.scratch_phys) { > kho_scratch = phys_to_virt(kho_in.scratch_phys); > kho_release_scratch(); > > - kho_mem_deserialize(kho_get_fdt()); > - folio = kho_restore_folio(kho_in.fdt_phys); > - if (!folio) > - pr_warn("failed to restore folio for KHO fdt\n"); > + if (!kho_mem_deserialize(kho_get_fdt())) > + kho_in.fdt_phys = 0; The folio restore does serve a purpose: it accounts for that folio in the system's total memory. See the call to adjust_managed_page_count() in kho_restore_page(). In practice, I don't think it makes much of a difference, but I don't see why not. > } else { > kho_reserve_scratch(); > } -- Regards, Pratyush Yadav From pratyush at kernel.org Fri Nov 14 08:59:32 2025 From: pratyush at kernel.org (Pratyush Yadav) Date: Fri, 14 Nov 2025 17:59:32 +0100 Subject: [PATCH v1 05/13] kho: Always expose output FDT in debugfs In-Reply-To: <20251114155358.2884014-6-pasha.tatashin@soleen.com> (Pasha Tatashin's message of "Fri, 14 Nov 2025 10:53:50 -0500") References: <20251114155358.2884014-1-pasha.tatashin@soleen.com> <20251114155358.2884014-6-pasha.tatashin@soleen.com> Message-ID: On Fri, Nov 14 2025, Pasha Tatashin wrote: > Currently, the output FDT is added to debugfs only when KHO is > finalized and removed when aborted. > > There is no need to hide the FDT based on the state. Always expose it > starting from initialization. This aids the transition toward removing > the explicit abort functionality and converting KHO to be fully > stateless. > > Also, pre-zero the FDT tree so we do not expose random bits to the > user and to the next kernel. > > Signed-off-by: Pasha Tatashin > --- > kernel/liveupdate/kexec_handover.c | 10 ++++------ > 1 file changed, 4 insertions(+), 6 deletions(-) > > diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c > index 83aca3b4af15..cd8641725343 100644 > --- a/kernel/liveupdate/kexec_handover.c > +++ b/kernel/liveupdate/kexec_handover.c > @@ -1147,8 +1147,6 @@ int kho_abort(void) > __kho_abort(); > kho_out.finalized = false; > > - kho_debugfs_fdt_remove(&kho_out.dbg, kho_out.fdt); > - > return 0; > } > > @@ -1219,9 +1217,6 @@ int kho_finalize(void) > > kho_out.finalized = true; > > - WARN_ON_ONCE(kho_debugfs_fdt_add(&kho_out.dbg, "fdt", > - kho_out.fdt, true)); > - > return 0; > } > > @@ -1310,7 +1305,7 @@ static __init int kho_init(void) > if (!kho_enable) > return 0; > > - fdt_page = alloc_page(GFP_KERNEL); > + fdt_page = alloc_page(GFP_KERNEL | __GFP_ZERO); If I read the series right, patch 9 will make this a full FDT with no subnodes. That makes a lot more sense than a zero page. Thinking out loud. For this patch, Reviewed-by: Pratyush Yadav > if (!fdt_page) { > err = -ENOMEM; > goto err_free_scratch; > @@ -1344,6 +1339,9 @@ static __init int kho_init(void) > init_cma_reserved_pageblock(pfn_to_page(pfn)); > } > > + WARN_ON_ONCE(kho_debugfs_fdt_add(&kho_out.dbg, "fdt", > + kho_out.fdt, true)); > + > return 0; > > err_free_fdt: -- Regards, Pratyush Yadav From pratyush at kernel.org Fri Nov 14 09:04:33 2025 From: pratyush at kernel.org (Pratyush Yadav) Date: Fri, 14 Nov 2025 18:04:33 +0100 Subject: [PATCH v1 06/13] kho: Simplify serialization and remove __kho_abort In-Reply-To: <20251114155358.2884014-7-pasha.tatashin@soleen.com> (Pasha Tatashin's message of "Fri, 14 Nov 2025 10:53:51 -0500") References: <20251114155358.2884014-1-pasha.tatashin@soleen.com> <20251114155358.2884014-7-pasha.tatashin@soleen.com> Message-ID: On Fri, Nov 14 2025, Pasha Tatashin wrote: > Currently, __kho_finalize() performs memory serialization in the middle > of FDT construction. If FDT construction fails later, the function must > manually clean up the serialized memory via __kho_abort(). > > Refactor __kho_finalize() to perform kho_mem_serialize() only after the > FDT has been successfully constructed and finished. This reordering has > two benefits: > 1. It avoids expensive serialization work if FDT generation fails. > 2. It removes the need for cleanup in the FDT error path. > > As a result, the internal helper __kho_abort() is no longer needed for > internal error handling. Inline its remaining logic (cleanup of the > preserved memory map) directly into kho_abort() and remove the helper. > > Signed-off-by: Pasha Tatashin Reviewed-by: Pratyush Yadav [...] -- Regards, Pratyush Yadav From pratyush at kernel.org Fri Nov 14 09:11:39 2025 From: pratyush at kernel.org (Pratyush Yadav) Date: Fri, 14 Nov 2025 18:11:39 +0100 Subject: [PATCH v1 07/13] kho: Remove global preserved_mem_map and store state in FDT In-Reply-To: <20251114155358.2884014-8-pasha.tatashin@soleen.com> (Pasha Tatashin's message of "Fri, 14 Nov 2025 10:53:52 -0500") References: <20251114155358.2884014-1-pasha.tatashin@soleen.com> <20251114155358.2884014-8-pasha.tatashin@soleen.com> Message-ID: On Fri, Nov 14 2025, Pasha Tatashin wrote: > Currently, the serialized memory map is tracked via > kho_out.preserved_mem_map and copied to the FDT during finalization. > This double tracking is redundant. > > Remove preserved_mem_map from kho_out. Instead, maintain the physical > address of the head chunk directly in the preserved-memory-map FDT > property. > > Introduce kho_update_memory_map() to manage this property. This function > handles: > 1. Retrieving and freeing any existing serialized map (handling the > abort/retry case). > 2. Updating the FDT property with the new chunk address. > > This establishes the FDT as the single source of truth for the handover > state. > > Signed-off-by: Pasha Tatashin Reviewed-by: Pratyush Yadav [...] -- Regards, Pratyush Yadav From pratyush at kernel.org Fri Nov 14 09:18:32 2025 From: pratyush at kernel.org (Pratyush Yadav) Date: Fri, 14 Nov 2025 18:18:32 +0100 Subject: [PATCH v1 08/13] kho: Remove abort functionality and support state refresh In-Reply-To: <20251114155358.2884014-9-pasha.tatashin@soleen.com> (Pasha Tatashin's message of "Fri, 14 Nov 2025 10:53:53 -0500") References: <20251114155358.2884014-1-pasha.tatashin@soleen.com> <20251114155358.2884014-9-pasha.tatashin@soleen.com> Message-ID: On Fri, Nov 14 2025, Pasha Tatashin wrote: > Previously, KHO required a dedicated kho_abort() function to clean up > state before kho_finalize() could be called again. This was necessary > to handle complex unwind paths when using notifiers. > > With the shift to direct memory preservation, the explicit abort step > is no longer strictly necessary. > > Remove kho_abort() and refactor kho_finalize() to handle re-entry. > If kho_finalize() is called while KHO is already finalized, it will > now automatically clean up the previous memory map and state before > generating a new one. This allows the KHO state to be updated/refreshed > simply by triggering finalize again. > > Update debugfs to return -EINVAL if userspace attempts to write 0 to > the finalize attribute, as explicit abort is no longer supported. Documentation/core-api/kho/concepts.rst touches on the concept of finalization. I suppose that should be updated as well. Other than this, Reviewed-by: Pratyush Yadav [...] -- Regards, Pratyush Yadav From pasha.tatashin at soleen.com Fri Nov 14 09:21:22 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Fri, 14 Nov 2025 12:21:22 -0500 Subject: [PATCH v1 04/13] kho: Verify deserialization status and fix FDT alignment access In-Reply-To: References: <20251114155358.2884014-1-pasha.tatashin@soleen.com> <20251114155358.2884014-5-pasha.tatashin@soleen.com> Message-ID: On Fri, Nov 14, 2025 at 11:52?AM Pratyush Yadav wrote: > > On Fri, Nov 14 2025, Pasha Tatashin wrote: > > > During boot, kho_restore_folio() relies on the memory map having been > > successfully deserialized. If deserialization fails or no map is present, > > attempting to restore the FDT folio is unsafe. > > > > Update kho_mem_deserialize() to return a boolean indicating success. Use > > this return value in kho_memory_init() to disable KHO if deserialization > > fails. Also, the incoming FDT folio is never used, there is no reason to > > restore it. > > > > Additionally, use memcpy() to retrieve the memory map pointer from the FDT. > > FDT properties are not guaranteed to be naturally aligned, and accessing > > a 64-bit value via a pointer that is only 32-bit aligned can cause faults. > > > > Signed-off-by: Pasha Tatashin > > --- > > kernel/liveupdate/kexec_handover.c | 32 ++++++++++++++++++------------ > > 1 file changed, 19 insertions(+), 13 deletions(-) > > > > diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c > > index a4b33ca79246..83aca3b4af15 100644 > > --- a/kernel/liveupdate/kexec_handover.c > > +++ b/kernel/liveupdate/kexec_handover.c > > @@ -450,20 +450,28 @@ static void __init deserialize_bitmap(unsigned int order, > > } > > } > > > > -static void __init kho_mem_deserialize(const void *fdt) > > +/* Return true if memory was deserizlied */ > > +static bool __init kho_mem_deserialize(const void *fdt) > > { > > struct khoser_mem_chunk *chunk; > > - const phys_addr_t *mem; > > + const void *mem_ptr; > > + u64 mem; > > int len; > > > > - mem = fdt_getprop(fdt, 0, PROP_PRESERVED_MEMORY_MAP, &len); > > - > > - if (!mem || len != sizeof(*mem)) { > > + mem_ptr = fdt_getprop(fdt, 0, PROP_PRESERVED_MEMORY_MAP, &len); > > + if (!mem_ptr || len != sizeof(u64)) { > > pr_err("failed to get preserved memory bitmaps\n"); > > - return; > > + return false; > > } > > + /* FDT guarantees 32-bit alignment, have to use memcpy */ > > + memcpy(&mem, mem_ptr, len); > > Perhaps get_unaligned(mem) would have been simpler? Hm, it certainly more descriptive. I will see if I can use it. > > > + > > + chunk = mem ? phys_to_virt(mem) : NULL; > > + > > + /* No preserved physical pages were passed, no deserialization */ > > + if (!chunk) > > + return false; > > Should we disallow all kho_restore_{folio,pages}() calls too if this > fails? Ideally those should never happen since kho_retrieve_subtree() > will fail, so maybe as a debug aid? Right, my thinking was that they should never happen, as they do not have a way to know the location of folios to be restored. So, preventing FDT access that we do does that. > > > > > - chunk = *mem ? phys_to_virt(*mem) : NULL; > > while (chunk) { > > unsigned int i; > > > > @@ -472,6 +480,8 @@ static void __init kho_mem_deserialize(const void *fdt) > > &chunk->bitmaps[i]); > > chunk = KHOSER_LOAD_PTR(chunk->hdr.next); > > } > > + > > + return true; > > } > > > > /* > > @@ -1377,16 +1387,12 @@ static void __init kho_release_scratch(void) > > > > void __init kho_memory_init(void) > > { > > - struct folio *folio; > > - > > if (kho_in.scratch_phys) { > > kho_scratch = phys_to_virt(kho_in.scratch_phys); > > kho_release_scratch(); > > > > - kho_mem_deserialize(kho_get_fdt()); > > - folio = kho_restore_folio(kho_in.fdt_phys); > > - if (!folio) > > - pr_warn("failed to restore folio for KHO fdt\n"); > > + if (!kho_mem_deserialize(kho_get_fdt())) > > + kho_in.fdt_phys = 0; > > The folio restore does serve a purpose: it accounts for that folio in > the system's total memory. See the call to adjust_managed_page_count() > in kho_restore_page(). In practice, I don't think it makes much of a > difference, but I don't see why not. > > > } else { > > kho_reserve_scratch(); > > } > > -- > Regards, > Pratyush Yadav From pasha.tatashin at soleen.com Fri Nov 14 09:23:54 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Fri, 14 Nov 2025 12:23:54 -0500 Subject: [PATCH v1 08/13] kho: Remove abort functionality and support state refresh In-Reply-To: References: <20251114155358.2884014-1-pasha.tatashin@soleen.com> <20251114155358.2884014-9-pasha.tatashin@soleen.com> Message-ID: On Fri, Nov 14, 2025 at 12:18?PM Pratyush Yadav wrote: > > On Fri, Nov 14 2025, Pasha Tatashin wrote: > > > Previously, KHO required a dedicated kho_abort() function to clean up > > state before kho_finalize() could be called again. This was necessary > > to handle complex unwind paths when using notifiers. > > > > With the shift to direct memory preservation, the explicit abort step > > is no longer strictly necessary. > > > > Remove kho_abort() and refactor kho_finalize() to handle re-entry. > > If kho_finalize() is called while KHO is already finalized, it will > > now automatically clean up the previous memory map and state before > > generating a new one. This allows the KHO state to be updated/refreshed > > simply by triggering finalize again. > > > > Update debugfs to return -EINVAL if userspace attempts to write 0 to > > the finalize attribute, as explicit abort is no longer supported. > > Documentation/core-api/kho/concepts.rst touches on the concept of > finalization. I suppose that should be updated as well. I looked at it, and it is vague, we are soon to remove finalize with stateless kho from Jason Miu, so in that series that section can be removed or replaced. > > Other than this, > > Reviewed-by: Pratyush Yadav Thank you, Pasha > > [...] > > -- > Regards, > Pratyush Yadav From pratyush at kernel.org Fri Nov 14 09:27:57 2025 From: pratyush at kernel.org (Pratyush Yadav) Date: Fri, 14 Nov 2025 18:27:57 +0100 Subject: [PATCH v1 09/13] kho: Update FDT dynamically for subtree addition/removal In-Reply-To: <20251114155358.2884014-10-pasha.tatashin@soleen.com> (Pasha Tatashin's message of "Fri, 14 Nov 2025 10:53:54 -0500") References: <20251114155358.2884014-1-pasha.tatashin@soleen.com> <20251114155358.2884014-10-pasha.tatashin@soleen.com> Message-ID: On Fri, Nov 14 2025, Pasha Tatashin wrote: > Currently, sub-FDTs were tracked in a list (kho_out.sub_fdts) and the > final FDT is constructed entirely from scratch during kho_finalize(). > > We can maintain the FDT dynamically: > 1. Initialize a valid, empty FDT in kho_init(). > 2. Use fdt_add_subnode and fdt_setprop in kho_add_subtree to > update the FDT immediately when a subsystem registers. > 3. Use fdt_del_node in kho_remove_subtree to remove entries. > > This removes the need for the intermediate sub_fdts list and the > reconstruction logic in kho_finalize(). kho_finalize() now > only needs to trigger memory map serialization. > > Signed-off-by: Pasha Tatashin Reviewed-by: Pratyush Yadav [...] -- Regards, Pratyush Yadav From pratyush at kernel.org Fri Nov 14 09:30:34 2025 From: pratyush at kernel.org (Pratyush Yadav) Date: Fri, 14 Nov 2025 18:30:34 +0100 Subject: [PATCH v1 10/13] kho: Allow kexec load before KHO finalization In-Reply-To: <20251114155358.2884014-11-pasha.tatashin@soleen.com> (Pasha Tatashin's message of "Fri, 14 Nov 2025 10:53:55 -0500") References: <20251114155358.2884014-1-pasha.tatashin@soleen.com> <20251114155358.2884014-11-pasha.tatashin@soleen.com> Message-ID: On Fri, Nov 14 2025, Pasha Tatashin wrote: > Currently, kho_fill_kimage() checks kho_out.finalized and returns > early if KHO is not yet finalized. This enforces a strict ordering where > userspace must finalize KHO *before* loading the kexec image. > > This is restrictive, as standard workflows often involve loading the > target kernel early in the lifecycle and finalizing the state (FDT) > only immediately before the reboot. > > Since the KHO FDT resides at a physical address allocated during boot > (kho_init), its location is stable. We can attach this stable address > to the kimage regardless of whether the content has been finalized yet. > > Relax the check to only require kho_enable, allowing kexec_file_load > to proceed at any time. > > Signed-off-by: Pasha Tatashin Reviewed-by: Pratyush Yadav [...] -- Regards, Pratyush Yadav From pratyush at kernel.org Fri Nov 14 09:33:35 2025 From: pratyush at kernel.org (Pratyush Yadav) Date: Fri, 14 Nov 2025 18:33:35 +0100 Subject: [PATCH v1 11/13] kho: Allow memory preservation state updates after finalization In-Reply-To: <20251114155358.2884014-12-pasha.tatashin@soleen.com> (Pasha Tatashin's message of "Fri, 14 Nov 2025 10:53:56 -0500") References: <20251114155358.2884014-1-pasha.tatashin@soleen.com> <20251114155358.2884014-12-pasha.tatashin@soleen.com> Message-ID: On Fri, Nov 14 2025, Pasha Tatashin wrote: > Currently, kho_preserve_* and kho_unpreserve_* return -EBUSY if > KHO is finalized. This enforces a rigid "freeze" on the KHO memory > state. > > With the introduction of re-entrant finalization, this restriction is > no longer necessary. Users should be allowed to modify the preservation > set (e.g., adding new pages or freeing old ones) even after an initial > finalization. > > The intended workflow for updates is now: > 1. Modify state (preserve/unpreserve). > 2. Call kho_finalize() again to refresh the serialized metadata. > > Remove the kho_out.finalized checks to enable this dynamic behavior. > > Signed-off-by: Pasha Tatashin > --- > kernel/liveupdate/kexec_handover.c | 13 ------------- > 1 file changed, 13 deletions(-) > > diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c > index 27ef20565a5f..87e9b488237d 100644 > --- a/kernel/liveupdate/kexec_handover.c > +++ b/kernel/liveupdate/kexec_handover.c > @@ -183,10 +183,6 @@ static int __kho_preserve_order(struct kho_mem_track *track, unsigned long pfn, > const unsigned long pfn_high = pfn >> order; > > might_sleep(); > - > - if (kho_out.finalized) > - return -EBUSY; > - > physxa = xa_load(&track->orders, order); > if (!physxa) { > int err; > @@ -815,9 +811,6 @@ int kho_unpreserve_folio(struct folio *folio) This can be void now. This would make consumers a bit simpler, since right now, the memfd preservation logic does a WARN_ON() if this function fails. That can be dropped now that the function can never fail. Same for kho_unpreserve_pages() and kho_unpreserve_vmalloc(). > const unsigned int order = folio_order(folio); > struct kho_mem_track *track = &kho_out.track; > > - if (kho_out.finalized) > - return -EBUSY; > - > __kho_unpreserve_order(track, pfn, order); > return 0; > } > @@ -885,9 +878,6 @@ int kho_unpreserve_pages(struct page *page, unsigned int nr_pages) > const unsigned long start_pfn = page_to_pfn(page); > const unsigned long end_pfn = start_pfn + nr_pages; > > - if (kho_out.finalized) > - return -EBUSY; > - > __kho_unpreserve(track, start_pfn, end_pfn); > > return 0; > @@ -1066,9 +1056,6 @@ EXPORT_SYMBOL_GPL(kho_preserve_vmalloc); > */ > int kho_unpreserve_vmalloc(struct kho_vmalloc *preservation) > { > - if (kho_out.finalized) > - return -EBUSY; > - > kho_vmalloc_free_chunks(preservation); > > return 0; -- Regards, Pratyush Yadav From pratyush at kernel.org Fri Nov 14 09:34:09 2025 From: pratyush at kernel.org (Pratyush Yadav) Date: Fri, 14 Nov 2025 18:34:09 +0100 Subject: [PATCH v1 12/13] kho: Add Kconfig option to enable KHO by default In-Reply-To: <20251114155358.2884014-13-pasha.tatashin@soleen.com> (Pasha Tatashin's message of "Fri, 14 Nov 2025 10:53:57 -0500") References: <20251114155358.2884014-1-pasha.tatashin@soleen.com> <20251114155358.2884014-13-pasha.tatashin@soleen.com> Message-ID: On Fri, Nov 14 2025, Pasha Tatashin wrote: > Currently, Kexec Handover must be explicitly enabled via the kernel > command line parameter `kho=on`. > > For workloads that rely on KHO as a foundational requirement (such as > the upcoming Live Update Orchestrator), requiring an explicit boot > parameter adds redundant configuration steps. > > Introduce CONFIG_KEXEC_HANDOVER_ENABLE_DEFAULT. When selected, KHO > defaults to enabled. This is equivalent to passing kho=on at boot. > The behavior can still be disabled at runtime by passing kho=off. > > Signed-off-by: Pasha Tatashin Reviewed-by: Pratyush Yadav [...] -- Regards, Pratyush Yadav From pratyush at kernel.org Fri Nov 14 09:45:54 2025 From: pratyush at kernel.org (Pratyush Yadav) Date: Fri, 14 Nov 2025 18:45:54 +0100 Subject: [PATCH v1 13/13] kho: Introduce high-level memory allocation API In-Reply-To: <20251114155358.2884014-14-pasha.tatashin@soleen.com> (Pasha Tatashin's message of "Fri, 14 Nov 2025 10:53:58 -0500") References: <20251114155358.2884014-1-pasha.tatashin@soleen.com> <20251114155358.2884014-14-pasha.tatashin@soleen.com> Message-ID: On Fri, Nov 14 2025, Pasha Tatashin wrote: > Currently, clients of KHO must manually allocate memory (e.g., via > alloc_pages), calculate the page order, and explicitly call > kho_preserve_folio(). Similarly, cleanup requires separate calls to > unpreserve and free the memory. > > Introduce a high-level API to streamline this common pattern: > > - kho_alloc_preserve(size): Allocates physically contiguous, zeroed > memory and immediately marks it for preservation. > - kho_free_unpreserve(ptr, size): Unpreserves and frees the memory > in the current kernel. > - kho_free_restore(ptr, size): Restores the struct page state of > preserved memory in the new kernel and immediately frees it to the > page allocator. Nit: kho_unpreserve_free() and kho_restore_free() make more sense to me since that is the order of operations. Having them the other way round is kind of confusing. Also, why do the free functions need size? They can get the order from folio_order(). This would save users of the API from having to store the size somewhere and make things simpler. > > Signed-off-by: Pasha Tatashin > --- > include/linux/kexec_handover.h | 22 +++++-- > kernel/liveupdate/kexec_handover.c | 101 +++++++++++++++++++++++++++++ > 2 files changed, 116 insertions(+), 7 deletions(-) > > diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h > index 80ece4232617..76c496e01877 100644 > --- a/include/linux/kexec_handover.h > +++ b/include/linux/kexec_handover.h > @@ -2,8 +2,9 @@ > #ifndef LINUX_KEXEC_HANDOVER_H > #define LINUX_KEXEC_HANDOVER_H > > -#include > +#include > #include > +#include > > struct kho_scratch { > phys_addr_t addr; > @@ -48,6 +49,9 @@ int kho_preserve_pages(struct page *page, unsigned int nr_pages); > int kho_unpreserve_pages(struct page *page, unsigned int nr_pages); > int kho_preserve_vmalloc(void *ptr, struct kho_vmalloc *preservation); > int kho_unpreserve_vmalloc(struct kho_vmalloc *preservation); > +void *kho_alloc_preserve(size_t size); > +void kho_free_unpreserve(void *mem, size_t size); > +void kho_free_restore(void *mem, size_t size); > struct folio *kho_restore_folio(phys_addr_t phys); > struct page *kho_restore_pages(phys_addr_t phys, unsigned int nr_pages); > void *kho_restore_vmalloc(const struct kho_vmalloc *preservation); > @@ -101,6 +105,14 @@ static inline int kho_unpreserve_vmalloc(struct kho_vmalloc *preservation) > return -EOPNOTSUPP; > } > > +void *kho_alloc_preserve(size_t size) > +{ > + return ERR_PTR(-EOPNOTSUPP); > +} > + > +void kho_free_unpreserve(void *mem, size_t size) { } > +void kho_free_restore(void *mem, size_t size) { } > + > static inline struct folio *kho_restore_folio(phys_addr_t phys) > { > return NULL; > @@ -122,18 +134,14 @@ static inline int kho_add_subtree(const char *name, void *fdt) > return -EOPNOTSUPP; > } > > -static inline void kho_remove_subtree(void *fdt) > -{ > -} > +static inline void kho_remove_subtree(void *fdt) { } > > static inline int kho_retrieve_subtree(const char *name, phys_addr_t *phys) > { > return -EOPNOTSUPP; > } > > -static inline void kho_memory_init(void) > -{ > -} > +static inline void kho_memory_init(void) { } > > static inline void kho_populate(phys_addr_t fdt_phys, u64 fdt_len, > phys_addr_t scratch_phys, u64 scratch_len) > diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c > index a905bccf5f65..9f05849fd68e 100644 > --- a/kernel/liveupdate/kexec_handover.c > +++ b/kernel/liveupdate/kexec_handover.c > @@ -4,6 +4,7 @@ > * Copyright (C) 2023 Alexander Graf > * Copyright (C) 2025 Microsoft Corporation, Mike Rapoport > * Copyright (C) 2025 Google LLC, Changyuan Lyu > + * Copyright (C) 2025 Pasha Tatashin > */ > > #define pr_fmt(fmt) "KHO: " fmt > @@ -1151,6 +1152,106 @@ void *kho_restore_vmalloc(const struct kho_vmalloc *preservation) > } > EXPORT_SYMBOL_GPL(kho_restore_vmalloc); > > +/** > + * kho_alloc_preserve - Allocate, zero, and preserve memory. > + * @size: The number of bytes to allocate. > + * > + * Allocates a physically contiguous block of zeroed pages that is large > + * enough to hold @size bytes. The allocated memory is then registered with > + * KHO for preservation across a kexec. > + * > + * Note: The actual allocated size will be rounded up to the nearest > + * power-of-two page boundary. > + * > + * @return A virtual pointer to the allocated and preserved memory on success, > + * or an ERR_PTR() encoded error on failure. > + */ > +void *kho_alloc_preserve(size_t size) > +{ > + struct folio *folio; > + int order, ret; > + > + if (!size) > + return ERR_PTR(-EINVAL); > + > + order = get_order(size); > + if (order > MAX_PAGE_ORDER) > + return ERR_PTR(-E2BIG); > + > + folio = folio_alloc(GFP_KERNEL | __GFP_ZERO, order); > + if (!folio) > + return ERR_PTR(-ENOMEM); > + > + ret = kho_preserve_folio(folio); > + if (ret) { > + folio_put(folio); > + return ERR_PTR(ret); > + } > + > + return folio_address(folio); > +} > +EXPORT_SYMBOL_GPL(kho_alloc_preserve); > + > +/** > + * kho_free_unpreserve - Unpreserve and free memory. > + * @mem: Pointer to the memory allocated by kho_alloc_preserve(). > + * @size: The original size requested during allocation. This is used to > + * recalculate the correct order for freeing the pages. > + * > + * Unregisters the memory from KHO preservation and frees the underlying > + * pages back to the system. This function should be called to clean up > + * memory allocated with kho_alloc_preserve(). > + */ > +void kho_free_unpreserve(void *mem, size_t size) > +{ > + struct folio *folio; > + unsigned int order; > + > + if (!mem || !size) > + return; > + > + order = get_order(size); > + if (WARN_ON_ONCE(order > MAX_PAGE_ORDER)) > + return; > + > + folio = virt_to_folio(mem); > + WARN_ON_ONCE(kho_unpreserve_folio(folio)); This is what I meant in my reply to the previous patch. kho_unpreserve_folio() can be void now, so the WARN_ON_ONCE() is not needed. > + folio_put(folio); > +} > +EXPORT_SYMBOL_GPL(kho_free_unpreserve); > + > +/** > + * kho_free_restore - Restore and free memory after kexec. > + * @mem: Pointer to the memory (in the new kernel's address space) > + * that was allocated by the old kernel. > + * @size: The original size requested during allocation. This is used to > + * recalculate the correct order for freeing the pages. > + * > + * This function is intended to be called in the new kernel (post-kexec) > + * to take ownership of and free a memory region that was preserved by the > + * old kernel using kho_alloc_preserve(). > + * > + * It first restores the pages from KHO (using their physical address) > + * and then frees the pages back to the new kernel's page allocator. > + */ > +void kho_free_restore(void *mem, size_t size) On restore side, callers are already using the phys addr directly. So do kho_restore_folio() and kho_restore_pages() for example. This should follow suit for uniformity. Would also save the callers a __va() call and this function the __pa() call. > +{ > + struct folio *folio; > + unsigned int order; > + > + if (!mem || !size) > + return; > + > + order = get_order(size); > + if (WARN_ON_ONCE(order > MAX_PAGE_ORDER)) > + return; > + > + folio = kho_restore_folio(__pa(mem)); > + if (!WARN_ON(!folio)) kho_restore_folio() already WARNs on failure. So the WARN_ON() here can be skipped I think. > + free_pages((unsigned long)mem, order); folio_put() here makes more sense since we just restored a folio. > +} > +EXPORT_SYMBOL_GPL(kho_free_restore); > + > int kho_finalize(void) > { > int ret; -- Regards, Pratyush Yadav From pratyush at kernel.org Fri Nov 14 09:47:45 2025 From: pratyush at kernel.org (Pratyush Yadav) Date: Fri, 14 Nov 2025 18:47:45 +0100 Subject: [PATCH v1 08/13] kho: Remove abort functionality and support state refresh In-Reply-To: (Pasha Tatashin's message of "Fri, 14 Nov 2025 12:23:54 -0500") References: <20251114155358.2884014-1-pasha.tatashin@soleen.com> <20251114155358.2884014-9-pasha.tatashin@soleen.com> Message-ID: On Fri, Nov 14 2025, Pasha Tatashin wrote: > On Fri, Nov 14, 2025 at 12:18?PM Pratyush Yadav wrote: >> >> On Fri, Nov 14 2025, Pasha Tatashin wrote: >> >> > Previously, KHO required a dedicated kho_abort() function to clean up >> > state before kho_finalize() could be called again. This was necessary >> > to handle complex unwind paths when using notifiers. >> > >> > With the shift to direct memory preservation, the explicit abort step >> > is no longer strictly necessary. >> > >> > Remove kho_abort() and refactor kho_finalize() to handle re-entry. >> > If kho_finalize() is called while KHO is already finalized, it will >> > now automatically clean up the previous memory map and state before >> > generating a new one. This allows the KHO state to be updated/refreshed >> > simply by triggering finalize again. >> > >> > Update debugfs to return -EINVAL if userspace attempts to write 0 to >> > the finalize attribute, as explicit abort is no longer supported. >> >> Documentation/core-api/kho/concepts.rst touches on the concept of >> finalization. I suppose that should be updated as well. > > I looked at it, and it is vague, we are soon to remove finalize with > stateless kho from Jason Miu, so in that series that section can be > removed or replaced. Okay, fair enough. [...] -- Regards, Pratyush Yadav From pasha.tatashin at soleen.com Fri Nov 14 09:47:24 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Fri, 14 Nov 2025 12:47:24 -0500 Subject: [PATCH v1 11/13] kho: Allow memory preservation state updates after finalization In-Reply-To: References: <20251114155358.2884014-1-pasha.tatashin@soleen.com> <20251114155358.2884014-12-pasha.tatashin@soleen.com> Message-ID: On Fri, Nov 14, 2025 at 12:33?PM Pratyush Yadav wrote: > > On Fri, Nov 14 2025, Pasha Tatashin wrote: > > > Currently, kho_preserve_* and kho_unpreserve_* return -EBUSY if > > KHO is finalized. This enforces a rigid "freeze" on the KHO memory > > state. > > > > With the introduction of re-entrant finalization, this restriction is > > no longer necessary. Users should be allowed to modify the preservation > > set (e.g., adding new pages or freeing old ones) even after an initial > > finalization. > > > > The intended workflow for updates is now: > > 1. Modify state (preserve/unpreserve). > > 2. Call kho_finalize() again to refresh the serialized metadata. > > > > Remove the kho_out.finalized checks to enable this dynamic behavior. > > > > Signed-off-by: Pasha Tatashin > > --- > > kernel/liveupdate/kexec_handover.c | 13 ------------- > > 1 file changed, 13 deletions(-) > > > > diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c > > index 27ef20565a5f..87e9b488237d 100644 > > --- a/kernel/liveupdate/kexec_handover.c > > +++ b/kernel/liveupdate/kexec_handover.c > > @@ -183,10 +183,6 @@ static int __kho_preserve_order(struct kho_mem_track *track, unsigned long pfn, > > const unsigned long pfn_high = pfn >> order; > > > > might_sleep(); > > - > > - if (kho_out.finalized) > > - return -EBUSY; > > - > > physxa = xa_load(&track->orders, order); > > if (!physxa) { > > int err; > > @@ -815,9 +811,6 @@ int kho_unpreserve_folio(struct folio *folio) > > This can be void now. This would make consumers a bit simpler, since > right now, the memfd preservation logic does a WARN_ON() if this > function fails. That can be dropped now that the function can never > fail. > > Same for kho_unpreserve_pages() and kho_unpreserve_vmalloc(). Oh, this is a very good suggestion, really disliked those kho_unpreserve_* errors. > > > const unsigned int order = folio_order(folio); > > struct kho_mem_track *track = &kho_out.track; > > > > - if (kho_out.finalized) > > - return -EBUSY; > > - > > __kho_unpreserve_order(track, pfn, order); > > return 0; > > } > > @@ -885,9 +878,6 @@ int kho_unpreserve_pages(struct page *page, unsigned int nr_pages) > > const unsigned long start_pfn = page_to_pfn(page); > > const unsigned long end_pfn = start_pfn + nr_pages; > > > > - if (kho_out.finalized) > > - return -EBUSY; > > - > > __kho_unpreserve(track, start_pfn, end_pfn); > > > > return 0; > > @@ -1066,9 +1056,6 @@ EXPORT_SYMBOL_GPL(kho_preserve_vmalloc); > > */ > > int kho_unpreserve_vmalloc(struct kho_vmalloc *preservation) > > { > > - if (kho_out.finalized) > > - return -EBUSY; > > - > > kho_vmalloc_free_chunks(preservation); > > > > return 0; > > -- > Regards, > Pratyush Yadav From pasha.tatashin at soleen.com Fri Nov 14 09:54:59 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Fri, 14 Nov 2025 12:54:59 -0500 Subject: [PATCH v1 13/13] kho: Introduce high-level memory allocation API In-Reply-To: References: <20251114155358.2884014-1-pasha.tatashin@soleen.com> <20251114155358.2884014-14-pasha.tatashin@soleen.com> Message-ID: On Fri, Nov 14, 2025 at 12:45?PM Pratyush Yadav wrote: > > On Fri, Nov 14 2025, Pasha Tatashin wrote: > > > Currently, clients of KHO must manually allocate memory (e.g., via > > alloc_pages), calculate the page order, and explicitly call > > kho_preserve_folio(). Similarly, cleanup requires separate calls to > > unpreserve and free the memory. > > > > Introduce a high-level API to streamline this common pattern: > > > > - kho_alloc_preserve(size): Allocates physically contiguous, zeroed > > memory and immediately marks it for preservation. > > - kho_free_unpreserve(ptr, size): Unpreserves and frees the memory > > in the current kernel. > > - kho_free_restore(ptr, size): Restores the struct page state of > > preserved memory in the new kernel and immediately frees it to the > > page allocator. > > Nit: kho_unpreserve_free() and kho_restore_free() make more sense to me > since that is the order of operations. Having them the other way round > is kind of confusing. Sure will rename. > > Also, why do the free functions need size? They can get the order from > folio_order(). This would save users of the API from having to store the > size somewhere and make things simpler. Yes, size is not needed, I will remove it. Thanks, Pasha > > > > > Signed-off-by: Pasha Tatashin > > --- > > include/linux/kexec_handover.h | 22 +++++-- > > kernel/liveupdate/kexec_handover.c | 101 +++++++++++++++++++++++++++++ > > 2 files changed, 116 insertions(+), 7 deletions(-) > > > > diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h > > index 80ece4232617..76c496e01877 100644 > > --- a/include/linux/kexec_handover.h > > +++ b/include/linux/kexec_handover.h > > @@ -2,8 +2,9 @@ > > #ifndef LINUX_KEXEC_HANDOVER_H > > #define LINUX_KEXEC_HANDOVER_H > > > > -#include > > +#include > > #include > > +#include > > > > struct kho_scratch { > > phys_addr_t addr; > > @@ -48,6 +49,9 @@ int kho_preserve_pages(struct page *page, unsigned int nr_pages); > > int kho_unpreserve_pages(struct page *page, unsigned int nr_pages); > > int kho_preserve_vmalloc(void *ptr, struct kho_vmalloc *preservation); > > int kho_unpreserve_vmalloc(struct kho_vmalloc *preservation); > > +void *kho_alloc_preserve(size_t size); > > +void kho_free_unpreserve(void *mem, size_t size); > > +void kho_free_restore(void *mem, size_t size); > > struct folio *kho_restore_folio(phys_addr_t phys); > > struct page *kho_restore_pages(phys_addr_t phys, unsigned int nr_pages); > > void *kho_restore_vmalloc(const struct kho_vmalloc *preservation); > > @@ -101,6 +105,14 @@ static inline int kho_unpreserve_vmalloc(struct kho_vmalloc *preservation) > > return -EOPNOTSUPP; > > } > > > > +void *kho_alloc_preserve(size_t size) > > +{ > > + return ERR_PTR(-EOPNOTSUPP); > > +} > > + > > +void kho_free_unpreserve(void *mem, size_t size) { } > > +void kho_free_restore(void *mem, size_t size) { } > > + > > static inline struct folio *kho_restore_folio(phys_addr_t phys) > > { > > return NULL; > > @@ -122,18 +134,14 @@ static inline int kho_add_subtree(const char *name, void *fdt) > > return -EOPNOTSUPP; > > } > > > > -static inline void kho_remove_subtree(void *fdt) > > -{ > > -} > > +static inline void kho_remove_subtree(void *fdt) { } > > > > static inline int kho_retrieve_subtree(const char *name, phys_addr_t *phys) > > { > > return -EOPNOTSUPP; > > } > > > > -static inline void kho_memory_init(void) > > -{ > > -} > > +static inline void kho_memory_init(void) { } > > > > static inline void kho_populate(phys_addr_t fdt_phys, u64 fdt_len, > > phys_addr_t scratch_phys, u64 scratch_len) > > diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c > > index a905bccf5f65..9f05849fd68e 100644 > > --- a/kernel/liveupdate/kexec_handover.c > > +++ b/kernel/liveupdate/kexec_handover.c > > @@ -4,6 +4,7 @@ > > * Copyright (C) 2023 Alexander Graf > > * Copyright (C) 2025 Microsoft Corporation, Mike Rapoport > > * Copyright (C) 2025 Google LLC, Changyuan Lyu > > + * Copyright (C) 2025 Pasha Tatashin > > */ > > > > #define pr_fmt(fmt) "KHO: " fmt > > @@ -1151,6 +1152,106 @@ void *kho_restore_vmalloc(const struct kho_vmalloc *preservation) > > } > > EXPORT_SYMBOL_GPL(kho_restore_vmalloc); > > > > +/** > > + * kho_alloc_preserve - Allocate, zero, and preserve memory. > > + * @size: The number of bytes to allocate. > > + * > > + * Allocates a physically contiguous block of zeroed pages that is large > > + * enough to hold @size bytes. The allocated memory is then registered with > > + * KHO for preservation across a kexec. > > + * > > + * Note: The actual allocated size will be rounded up to the nearest > > + * power-of-two page boundary. > > + * > > + * @return A virtual pointer to the allocated and preserved memory on success, > > + * or an ERR_PTR() encoded error on failure. > > + */ > > +void *kho_alloc_preserve(size_t size) > > +{ > > + struct folio *folio; > > + int order, ret; > > + > > + if (!size) > > + return ERR_PTR(-EINVAL); > > + > > + order = get_order(size); > > + if (order > MAX_PAGE_ORDER) > > + return ERR_PTR(-E2BIG); > > + > > + folio = folio_alloc(GFP_KERNEL | __GFP_ZERO, order); > > + if (!folio) > > + return ERR_PTR(-ENOMEM); > > + > > + ret = kho_preserve_folio(folio); > > + if (ret) { > > + folio_put(folio); > > + return ERR_PTR(ret); > > + } > > + > > + return folio_address(folio); > > +} > > +EXPORT_SYMBOL_GPL(kho_alloc_preserve); > > + > > +/** > > + * kho_free_unpreserve - Unpreserve and free memory. > > + * @mem: Pointer to the memory allocated by kho_alloc_preserve(). > > + * @size: The original size requested during allocation. This is used to > > + * recalculate the correct order for freeing the pages. > > + * > > + * Unregisters the memory from KHO preservation and frees the underlying > > + * pages back to the system. This function should be called to clean up > > + * memory allocated with kho_alloc_preserve(). > > + */ > > +void kho_free_unpreserve(void *mem, size_t size) > > +{ > > + struct folio *folio; > > + unsigned int order; > > + > > + if (!mem || !size) > > + return; > > + > > + order = get_order(size); > > + if (WARN_ON_ONCE(order > MAX_PAGE_ORDER)) > > + return; > > + > > + folio = virt_to_folio(mem); > > + WARN_ON_ONCE(kho_unpreserve_folio(folio)); > > This is what I meant in my reply to the previous patch. > kho_unpreserve_folio() can be void now, so the WARN_ON_ONCE() is not > needed. > > > + folio_put(folio); > > +} > > +EXPORT_SYMBOL_GPL(kho_free_unpreserve); > > + > > +/** > > + * kho_free_restore - Restore and free memory after kexec. > > + * @mem: Pointer to the memory (in the new kernel's address space) > > + * that was allocated by the old kernel. > > + * @size: The original size requested during allocation. This is used to > > + * recalculate the correct order for freeing the pages. > > + * > > + * This function is intended to be called in the new kernel (post-kexec) > > + * to take ownership of and free a memory region that was preserved by the > > + * old kernel using kho_alloc_preserve(). > > + * > > + * It first restores the pages from KHO (using their physical address) > > + * and then frees the pages back to the new kernel's page allocator. > > + */ > > +void kho_free_restore(void *mem, size_t size) > > On restore side, callers are already using the phys addr directly. So do > kho_restore_folio() and kho_restore_pages() for example. This should > follow suit for uniformity. Would also save the callers a __va() call > and this function the __pa() call. > > > +{ > > + struct folio *folio; > > + unsigned int order; > > + > > + if (!mem || !size) > > + return; > > + > > + order = get_order(size); > > + if (WARN_ON_ONCE(order > MAX_PAGE_ORDER)) > > + return; > > + > > + folio = kho_restore_folio(__pa(mem)); > > + if (!WARN_ON(!folio)) > > kho_restore_folio() already WARNs on failure. So the WARN_ON() here can > be skipped I think. > > > + free_pages((unsigned long)mem, order); > > folio_put() here makes more sense since we just restored a folio. > > > +} > > +EXPORT_SYMBOL_GPL(kho_free_restore); > > + > > int kho_finalize(void) > > { > > int ret; > > -- > Regards, > Pratyush Yadav From pasha.tatashin at soleen.com Fri Nov 14 10:59:49 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Fri, 14 Nov 2025 13:59:49 -0500 Subject: [PATCH v2 00/13] kho: simplify state machine and enable dynamic updates Message-ID: <20251114190002.3311679-1-pasha.tatashin@soleen.com> Andrew: This series applies against mm-nonmm-unstable, but should go right before LUOv5, i.e. on top of: "liveupdate: kho: use %pe format specifier for error pointer printing" Changelog v2: - Addressed comments from Mike and Pratyush - Added Review-bys. It also replaces the following patches, that once applied should be dropped from mm-nonmm-unstable: "liveupdate: kho: when live update add KHO image during kexec load" "liveupdate: Kconfig: make debugfs optional" "kho: enable KHO by default" This patch series refactors the Kexec Handover subsystem to transition from a rigid, state-locked model to a dynamic, re-entrant architecture. It also introduces usability improvements. Motivation Currently, KHO relies on a strict state machine where memory preservation is locked upon finalization. If a change is required, the user must explicitly "abort" to reset the state. Additionally, the kexec image cannot be loaded until KHO is finalized, and the FDT is rebuilt from scratch on every finalization. This series simplifies this workflow to support "load early, finalize late" scenarios. Key Changes State Machine Simplification: - Removed kho_abort(). kho_finalize() is now re-entrant; calling it a second time automatically flushes the previous serialized state and generates a fresh one. - Removed kho_out.finalized checks from preservation APIs, allowing drivers to add/remove pages even after an initial finalization. - Decoupled kexec_file_load from KHO finalization. The KHO FDT physical address is now stable from boot, allowing the kexec image to be loaded before the handover metadata is finalized. FDT Management: - The FDT is now updated in-place dynamically when subtrees are added or removed, removing the need for complex reconstruction logic. - The output FDT is always exposed in debugfs (initialized and zeroed at boot), improving visibility and debugging capabilities throughout the system lifecycle. - Removed the redundant global preserved_mem_map pointer, establishing the FDT property as the single source of truth. New Features & API Enhancements: - High-Level Allocators: Introduced kho_alloc_preserve() and friends to reduce boilerplate for drivers that need to allocate, preserve, and eventually restore simple memory buffers. - Configuration: Added CONFIG_KEXEC_HANDOVER_ENABLE_DEFAULT to allow KHO to be active by default without requiring the kho=on command line parameter. Fixes: - Fixed potential alignment faults when accessing 64-bit FDT properties. - Fixed the lifecycle of the FDT folio preservation (now preserved once at init). Pasha Tatashin (13): kho: Fix misleading log message in kho_populate() kho: Convert __kho_abort() to return void kho: Introduce high-level memory allocation API kho: Preserve FDT folio only once during initialization kho: Verify deserialization status and fix FDT alignment access kho: Always expose output FDT in debugfs kho: Simplify serialization and remove __kho_abort kho: Remove global preserved_mem_map and store state in FDT kho: Remove abort functionality and support state refresh kho: Update FDT dynamically for subtree addition/removal kho: Allow kexec load before KHO finalization kho: Allow memory preservation state updates after finalization kho: Add Kconfig option to enable KHO by default include/linux/kexec_handover.h | 39 +- kernel/liveupdate/Kconfig | 14 + kernel/liveupdate/kexec_handover.c | 378 +++++++++++--------- kernel/liveupdate/kexec_handover_debugfs.c | 2 +- kernel/liveupdate/kexec_handover_internal.h | 1 - 5 files changed, 239 insertions(+), 195 deletions(-) -- 2.52.0.rc1.455.g30608eb744-goog From pasha.tatashin at soleen.com Fri Nov 14 10:59:51 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Fri, 14 Nov 2025 13:59:51 -0500 Subject: [PATCH v2 02/13] kho: Convert __kho_abort() to return void In-Reply-To: <20251114190002.3311679-1-pasha.tatashin@soleen.com> References: <20251114190002.3311679-1-pasha.tatashin@soleen.com> Message-ID: <20251114190002.3311679-3-pasha.tatashin@soleen.com> The internal helper __kho_abort() always returns 0 and has no failure paths. Its return value is ignored by __kho_finalize and checked needlessly by kho_abort. Change the return type to void to reflect that this function cannot fail, and simplify kho_abort by removing dead error handling code. Signed-off-by: Pasha Tatashin Reviewed-by: Pratyush Yadav Reviewed-by: Mike Rapoport (Microsoft) --- kernel/liveupdate/kexec_handover.c | 11 ++--------- 1 file changed, 2 insertions(+), 9 deletions(-) diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 6ad45e12f53b..bc7f046a1313 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -1117,20 +1117,16 @@ void *kho_restore_vmalloc(const struct kho_vmalloc *preservation) } EXPORT_SYMBOL_GPL(kho_restore_vmalloc); -static int __kho_abort(void) +static void __kho_abort(void) { if (kho_out.preserved_mem_map) { kho_mem_ser_free(kho_out.preserved_mem_map); kho_out.preserved_mem_map = NULL; } - - return 0; } int kho_abort(void) { - int ret = 0; - if (!kho_enable) return -EOPNOTSUPP; @@ -1138,10 +1134,7 @@ int kho_abort(void) if (!kho_out.finalized) return -ENOENT; - ret = __kho_abort(); - if (ret) - return ret; - + __kho_abort(); kho_out.finalized = false; kho_debugfs_fdt_remove(&kho_out.dbg, kho_out.fdt); -- 2.52.0.rc1.455.g30608eb744-goog From pasha.tatashin at soleen.com Fri Nov 14 10:59:50 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Fri, 14 Nov 2025 13:59:50 -0500 Subject: [PATCH v2 01/13] kho: Fix misleading log message in kho_populate() In-Reply-To: <20251114190002.3311679-1-pasha.tatashin@soleen.com> References: <20251114190002.3311679-1-pasha.tatashin@soleen.com> Message-ID: <20251114190002.3311679-2-pasha.tatashin@soleen.com> The log message in kho_populate() currently states "Will skip init for some devices". This implies that Kexec Handover always involves skipping device initialization. However, KHO is a generic mechanism used to preserve kernel memory across reboot for various purposes, such as memfd, telemetry, or reserve_mem. Skipping device initialization is a specific property of live update drivers using KHO, not a property of the mechanism itself. Remove the misleading suffix to accurately reflect the generic nature of KHO discovery. Signed-off-by: Pasha Tatashin Reviewed-by: Pratyush Yadav Reviewed-by: Mike Rapoport (Microsoft) --- kernel/liveupdate/kexec_handover.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 9f0913e101be..6ad45e12f53b 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -1470,7 +1470,7 @@ void __init kho_populate(phys_addr_t fdt_phys, u64 fdt_len, kho_in.fdt_phys = fdt_phys; kho_in.scratch_phys = scratch_phys; kho_scratch_cnt = scratch_cnt; - pr_info("found kexec handover data. Will skip init for some devices\n"); + pr_info("found kexec handover data.\n"); out: if (fdt) -- 2.52.0.rc1.455.g30608eb744-goog From pasha.tatashin at soleen.com Fri Nov 14 10:59:53 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Fri, 14 Nov 2025 13:59:53 -0500 Subject: [PATCH v2 04/13] kho: Preserve FDT folio only once during initialization In-Reply-To: <20251114190002.3311679-1-pasha.tatashin@soleen.com> References: <20251114190002.3311679-1-pasha.tatashin@soleen.com> Message-ID: <20251114190002.3311679-5-pasha.tatashin@soleen.com> Currently, the FDT folio is preserved inside __kho_finalize(). If the user performs multiple finalize/abort cycles, kho_preserve_folio() is called repeatedly for the same FDT folio. Since the FDT folio is allocated once during kho_init(), it should be marked for preservation at the same time. Move the preservation call to kho_init() to align the preservation state with the object's lifecycle and simplify the finalize path. Also, pre-zero the FDT tree so we do not expose random bits to the user and to the next kernel by using the new kho_alloc_preserve() api. Signed-off-by: Pasha Tatashin Reviewed-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav --- kernel/liveupdate/kexec_handover.c | 18 ++++++------------ 1 file changed, 6 insertions(+), 12 deletions(-) diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 5c5c9c46fe92..704e91418214 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -1251,10 +1251,6 @@ static int __kho_finalize(void) if (err) goto abort; - err = kho_preserve_folio(virt_to_folio(kho_out.fdt)); - if (err) - goto abort; - err = kho_mem_serialize(&kho_out); if (err) goto abort; @@ -1384,19 +1380,17 @@ EXPORT_SYMBOL_GPL(kho_retrieve_subtree); static __init int kho_init(void) { - int err = 0; const void *fdt = kho_get_fdt(); - struct page *fdt_page; + int err = 0; if (!kho_enable) return 0; - fdt_page = alloc_page(GFP_KERNEL); - if (!fdt_page) { - err = -ENOMEM; + kho_out.fdt = kho_alloc_preserve(PAGE_SIZE); + if (IS_ERR(kho_out.fdt)) { + err = PTR_ERR(kho_out.fdt); goto err_free_scratch; } - kho_out.fdt = page_to_virt(fdt_page); err = kho_debugfs_init(); if (err) @@ -1424,9 +1418,9 @@ static __init int kho_init(void) return 0; err_free_fdt: - put_page(fdt_page); - kho_out.fdt = NULL; + kho_unpreserve_free(kho_out.fdt); err_free_scratch: + kho_out.fdt = NULL; for (int i = 0; i < kho_scratch_cnt; i++) { void *start = __va(kho_scratch[i].addr); void *end = start + kho_scratch[i].size; -- 2.52.0.rc1.455.g30608eb744-goog From pasha.tatashin at soleen.com Fri Nov 14 10:59:52 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Fri, 14 Nov 2025 13:59:52 -0500 Subject: [PATCH v2 03/13] kho: Introduce high-level memory allocation API In-Reply-To: <20251114190002.3311679-1-pasha.tatashin@soleen.com> References: <20251114190002.3311679-1-pasha.tatashin@soleen.com> Message-ID: <20251114190002.3311679-4-pasha.tatashin@soleen.com> Currently, clients of KHO must manually allocate memory (e.g., via alloc_pages), calculate the page order, and explicitly call kho_preserve_folio(). Similarly, cleanup requires separate calls to unpreserve and free the memory. Introduce a high-level API to streamline this common pattern: - kho_alloc_preserve(size): Allocates physically contiguous, zeroed memory and immediately marks it for preservation. - kho_unpreserve_free(ptr): Unpreserves and frees the memory in the current kernel. - kho_restore_free(ptr): Restores the struct page state of preserved memory in the new kernel and immediately frees it to the page allocator. Signed-off-by: Pasha Tatashin Reviewed-by: Mike Rapoport (Microsoft) --- include/linux/kexec_handover.h | 22 +++++--- kernel/liveupdate/kexec_handover.c | 87 ++++++++++++++++++++++++++++++ 2 files changed, 102 insertions(+), 7 deletions(-) diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h index 80ece4232617..38a9487a1a00 100644 --- a/include/linux/kexec_handover.h +++ b/include/linux/kexec_handover.h @@ -2,8 +2,9 @@ #ifndef LINUX_KEXEC_HANDOVER_H #define LINUX_KEXEC_HANDOVER_H -#include +#include #include +#include struct kho_scratch { phys_addr_t addr; @@ -48,6 +49,9 @@ int kho_preserve_pages(struct page *page, unsigned int nr_pages); int kho_unpreserve_pages(struct page *page, unsigned int nr_pages); int kho_preserve_vmalloc(void *ptr, struct kho_vmalloc *preservation); int kho_unpreserve_vmalloc(struct kho_vmalloc *preservation); +void *kho_alloc_preserve(size_t size); +void kho_unpreserve_free(void *mem); +void kho_restore_free(void *mem); struct folio *kho_restore_folio(phys_addr_t phys); struct page *kho_restore_pages(phys_addr_t phys, unsigned int nr_pages); void *kho_restore_vmalloc(const struct kho_vmalloc *preservation); @@ -101,6 +105,14 @@ static inline int kho_unpreserve_vmalloc(struct kho_vmalloc *preservation) return -EOPNOTSUPP; } +void *kho_alloc_preserve(size_t size) +{ + return ERR_PTR(-EOPNOTSUPP); +} + +void kho_unpreserve_free(void *mem) { } +void kho_restore_free(void *mem) { } + static inline struct folio *kho_restore_folio(phys_addr_t phys) { return NULL; @@ -122,18 +134,14 @@ static inline int kho_add_subtree(const char *name, void *fdt) return -EOPNOTSUPP; } -static inline void kho_remove_subtree(void *fdt) -{ -} +static inline void kho_remove_subtree(void *fdt) { } static inline int kho_retrieve_subtree(const char *name, phys_addr_t *phys) { return -EOPNOTSUPP; } -static inline void kho_memory_init(void) -{ -} +static inline void kho_memory_init(void) { } static inline void kho_populate(phys_addr_t fdt_phys, u64 fdt_len, phys_addr_t scratch_phys, u64 scratch_len) diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index bc7f046a1313..5c5c9c46fe92 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -4,6 +4,7 @@ * Copyright (C) 2023 Alexander Graf * Copyright (C) 2025 Microsoft Corporation, Mike Rapoport * Copyright (C) 2025 Google LLC, Changyuan Lyu + * Copyright (C) 2025 Pasha Tatashin */ #define pr_fmt(fmt) "KHO: " fmt @@ -1117,6 +1118,92 @@ void *kho_restore_vmalloc(const struct kho_vmalloc *preservation) } EXPORT_SYMBOL_GPL(kho_restore_vmalloc); +/** + * kho_alloc_preserve - Allocate, zero, and preserve memory. + * @size: The number of bytes to allocate. + * + * Allocates a physically contiguous block of zeroed pages that is large + * enough to hold @size bytes. The allocated memory is then registered with + * KHO for preservation across a kexec. + * + * Note: The actual allocated size will be rounded up to the nearest + * power-of-two page boundary. + * + * @return A virtual pointer to the allocated and preserved memory on success, + * or an ERR_PTR() encoded error on failure. + */ +void *kho_alloc_preserve(size_t size) +{ + struct folio *folio; + int order, ret; + + if (!size) + return ERR_PTR(-EINVAL); + + order = get_order(size); + if (order > MAX_PAGE_ORDER) + return ERR_PTR(-E2BIG); + + folio = folio_alloc(GFP_KERNEL | __GFP_ZERO, order); + if (!folio) + return ERR_PTR(-ENOMEM); + + ret = kho_preserve_folio(folio); + if (ret) { + folio_put(folio); + return ERR_PTR(ret); + } + + return folio_address(folio); +} +EXPORT_SYMBOL_GPL(kho_alloc_preserve); + +/** + * kho_unpreserve_free - Unpreserve and free memory. + * @mem: Pointer to the memory allocated by kho_alloc_preserve(). + * + * Unregisters the memory from KHO preservation and frees the underlying + * pages back to the system. This function should be called to clean up + * memory allocated with kho_alloc_preserve(). + */ +void kho_unpreserve_free(void *mem) +{ + struct folio *folio; + + if (!mem) + return; + + folio = virt_to_folio(mem); + WARN_ON_ONCE(kho_unpreserve_folio(folio)); + folio_put(folio); +} +EXPORT_SYMBOL_GPL(kho_unpreserve_free); + +/** + * kho_restore_free - Restore and free memory after kexec. + * @mem: Pointer to the memory (in the new kernel's address space) + * that was allocated by the old kernel. + * + * This function is intended to be called in the new kernel (post-kexec) + * to take ownership of and free a memory region that was preserved by the + * old kernel using kho_alloc_preserve(). + * + * It first restores the pages from KHO (using their physical address) + * and then frees the pages back to the new kernel's page allocator. + */ +void kho_restore_free(void *mem) +{ + struct folio *folio; + + if (!mem) + return; + + folio = kho_restore_folio(__pa(mem)); + if (!WARN_ON(!folio)) + folio_put(folio); +} +EXPORT_SYMBOL_GPL(kho_restore_free); + static void __kho_abort(void) { if (kho_out.preserved_mem_map) { -- 2.52.0.rc1.455.g30608eb744-goog From pasha.tatashin at soleen.com Fri Nov 14 10:59:54 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Fri, 14 Nov 2025 13:59:54 -0500 Subject: [PATCH v2 05/13] kho: Verify deserialization status and fix FDT alignment access In-Reply-To: <20251114190002.3311679-1-pasha.tatashin@soleen.com> References: <20251114190002.3311679-1-pasha.tatashin@soleen.com> Message-ID: <20251114190002.3311679-6-pasha.tatashin@soleen.com> During boot, kho_restore_folio() relies on the memory map having been successfully deserialized. If deserialization fails or no map is present, attempting to restore the FDT folio is unsafe. Update kho_mem_deserialize() to return a boolean indicating success. Use this return value in kho_memory_init() to disable KHO if deserialization fails. Also, the incoming FDT folio is never used, there is no reason to restore it. Additionally, use get_unaligned() to retrieve the memory map pointer from the FDT. FDT properties are not guaranteed to be naturally aligned, and accessing a 64-bit value via a pointer that is only 32-bit aligned can cause faults. Signed-off-by: Pasha Tatashin Reviewed-by: Mike Rapoport (Microsoft) --- kernel/liveupdate/kexec_handover.c | 32 ++++++++++++++++++------------ 1 file changed, 19 insertions(+), 13 deletions(-) diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 704e91418214..bed611bae1df 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include @@ -451,20 +452,27 @@ static void __init deserialize_bitmap(unsigned int order, } } -static void __init kho_mem_deserialize(const void *fdt) +/* Return true if memory was deserizlied */ +static bool __init kho_mem_deserialize(const void *fdt) { struct khoser_mem_chunk *chunk; - const phys_addr_t *mem; + const void *mem_ptr; + u64 mem; int len; - mem = fdt_getprop(fdt, 0, PROP_PRESERVED_MEMORY_MAP, &len); - - if (!mem || len != sizeof(*mem)) { + mem_ptr = fdt_getprop(fdt, 0, PROP_PRESERVED_MEMORY_MAP, &len); + if (!mem_ptr || len != sizeof(u64)) { pr_err("failed to get preserved memory bitmaps\n"); - return; + return false; } - chunk = *mem ? phys_to_virt(*mem) : NULL; + mem = get_unaligned((const u64 *)mem_ptr); + chunk = mem ? phys_to_virt(mem) : NULL; + + /* No preserved physical pages were passed, no deserialization */ + if (!chunk) + return false; + while (chunk) { unsigned int i; @@ -473,6 +481,8 @@ static void __init kho_mem_deserialize(const void *fdt) &chunk->bitmaps[i]); chunk = KHOSER_LOAD_PTR(chunk->hdr.next); } + + return true; } /* @@ -1458,16 +1468,12 @@ static void __init kho_release_scratch(void) void __init kho_memory_init(void) { - struct folio *folio; - if (kho_in.scratch_phys) { kho_scratch = phys_to_virt(kho_in.scratch_phys); kho_release_scratch(); - kho_mem_deserialize(kho_get_fdt()); - folio = kho_restore_folio(kho_in.fdt_phys); - if (!folio) - pr_warn("failed to restore folio for KHO fdt\n"); + if (!kho_mem_deserialize(kho_get_fdt())) + kho_in.fdt_phys = 0; } else { kho_reserve_scratch(); } -- 2.52.0.rc1.455.g30608eb744-goog From pasha.tatashin at soleen.com Fri Nov 14 10:59:56 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Fri, 14 Nov 2025 13:59:56 -0500 Subject: [PATCH v2 07/13] kho: Simplify serialization and remove __kho_abort In-Reply-To: <20251114190002.3311679-1-pasha.tatashin@soleen.com> References: <20251114190002.3311679-1-pasha.tatashin@soleen.com> Message-ID: <20251114190002.3311679-8-pasha.tatashin@soleen.com> Currently, __kho_finalize() performs memory serialization in the middle of FDT construction. If FDT construction fails later, the function must manually clean up the serialized memory via __kho_abort(). Refactor __kho_finalize() to perform kho_mem_serialize() only after the FDT has been successfully constructed and finished. This reordering has two benefits: 1. It avoids expensive serialization work if FDT generation fails. 2. It removes the need for cleanup in the FDT error path. As a result, the internal helper __kho_abort() is no longer needed for internal error handling. Inline its remaining logic (cleanup of the preserved memory map) directly into kho_abort() and remove the helper. Signed-off-by: Pasha Tatashin Reviewed-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav --- kernel/liveupdate/kexec_handover.c | 41 +++++++++++++----------------- 1 file changed, 17 insertions(+), 24 deletions(-) diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 3e32c61a64b1..297136054f75 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -1214,14 +1214,6 @@ void kho_restore_free(void *mem) } EXPORT_SYMBOL_GPL(kho_restore_free); -static void __kho_abort(void) -{ - if (kho_out.preserved_mem_map) { - kho_mem_ser_free(kho_out.preserved_mem_map); - kho_out.preserved_mem_map = NULL; - } -} - int kho_abort(void) { if (!kho_enable) @@ -1231,7 +1223,8 @@ int kho_abort(void) if (!kho_out.finalized) return -ENOENT; - __kho_abort(); + kho_mem_ser_free(kho_out.preserved_mem_map); + kho_out.preserved_mem_map = NULL; kho_out.finalized = false; return 0; @@ -1239,12 +1232,12 @@ int kho_abort(void) static int __kho_finalize(void) { - int err = 0; - u64 *preserved_mem_map; void *root = kho_out.fdt; struct kho_sub_fdt *fdt; + u64 *preserved_mem_map; + int err; - err |= fdt_create(root, PAGE_SIZE); + err = fdt_create(root, PAGE_SIZE); err |= fdt_finish_reservemap(root); err |= fdt_begin_node(root, ""); err |= fdt_property_string(root, "compatible", KHO_FDT_COMPATIBLE); @@ -1257,13 +1250,7 @@ static int __kho_finalize(void) sizeof(*preserved_mem_map), (void **)&preserved_mem_map); if (err) - goto abort; - - err = kho_mem_serialize(&kho_out); - if (err) - goto abort; - - *preserved_mem_map = (u64)virt_to_phys(kho_out.preserved_mem_map); + goto err_exit; mutex_lock(&kho_out.fdts_lock); list_for_each_entry(fdt, &kho_out.sub_fdts, l) { @@ -1277,13 +1264,19 @@ static int __kho_finalize(void) err |= fdt_end_node(root); err |= fdt_finish(root); + if (err) + goto err_exit; -abort: - if (err) { - pr_err("Failed to convert KHO state tree: %d\n", err); - __kho_abort(); - } + err = kho_mem_serialize(&kho_out); + if (err) + goto err_exit; + + *preserved_mem_map = (u64)virt_to_phys(kho_out.preserved_mem_map); + + return 0; +err_exit: + pr_err("Failed to convert KHO state tree: %d\n", err); return err; } -- 2.52.0.rc1.455.g30608eb744-goog From pasha.tatashin at soleen.com Fri Nov 14 10:59:55 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Fri, 14 Nov 2025 13:59:55 -0500 Subject: [PATCH v2 06/13] kho: Always expose output FDT in debugfs In-Reply-To: <20251114190002.3311679-1-pasha.tatashin@soleen.com> References: <20251114190002.3311679-1-pasha.tatashin@soleen.com> Message-ID: <20251114190002.3311679-7-pasha.tatashin@soleen.com> Currently, the output FDT is added to debugfs only when KHO is finalized and removed when aborted. There is no need to hide the FDT based on the state. Always expose it starting from initialization. This aids the transition toward removing the explicit abort functionality and converting KHO to be fully stateless. Signed-off-by: Pasha Tatashin Reviewed-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav --- kernel/liveupdate/kexec_handover.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index bed611bae1df..3e32c61a64b1 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -1234,8 +1234,6 @@ int kho_abort(void) __kho_abort(); kho_out.finalized = false; - kho_debugfs_fdt_remove(&kho_out.dbg, kho_out.fdt); - return 0; } @@ -1306,9 +1304,6 @@ int kho_finalize(void) kho_out.finalized = true; - WARN_ON_ONCE(kho_debugfs_fdt_add(&kho_out.dbg, "fdt", - kho_out.fdt, true)); - return 0; } @@ -1425,6 +1420,9 @@ static __init int kho_init(void) init_cma_reserved_pageblock(pfn_to_page(pfn)); } + WARN_ON_ONCE(kho_debugfs_fdt_add(&kho_out.dbg, "fdt", + kho_out.fdt, true)); + return 0; err_free_fdt: -- 2.52.0.rc1.455.g30608eb744-goog From pasha.tatashin at soleen.com Fri Nov 14 10:59:59 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Fri, 14 Nov 2025 13:59:59 -0500 Subject: [PATCH v2 10/13] kho: Update FDT dynamically for subtree addition/removal In-Reply-To: <20251114190002.3311679-1-pasha.tatashin@soleen.com> References: <20251114190002.3311679-1-pasha.tatashin@soleen.com> Message-ID: <20251114190002.3311679-11-pasha.tatashin@soleen.com> Currently, sub-FDTs were tracked in a list (kho_out.sub_fdts) and the final FDT is constructed entirely from scratch during kho_finalize(). We can maintain the FDT dynamically: 1. Initialize a valid, empty FDT in kho_init(). 2. Use fdt_add_subnode and fdt_setprop in kho_add_subtree to update the FDT immediately when a subsystem registers. 3. Use fdt_del_node in kho_remove_subtree to remove entries. This removes the need for the intermediate sub_fdts list and the reconstruction logic in kho_finalize(). kho_finalize() now only needs to trigger memory map serialization. Signed-off-by: Pasha Tatashin Reviewed-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav --- kernel/liveupdate/kexec_handover.c | 144 ++++++++++++++--------------- 1 file changed, 69 insertions(+), 75 deletions(-) diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 624fd648d21f..461d96084c12 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -104,20 +104,11 @@ struct kho_mem_track { struct khoser_mem_chunk; -struct kho_sub_fdt { - struct list_head l; - const char *name; - void *fdt; -}; - struct kho_out { void *fdt; bool finalized; struct mutex lock; /* protects KHO FDT finalization */ - struct list_head sub_fdts; - struct mutex fdts_lock; - struct kho_mem_track track; struct kho_debugfs dbg; }; @@ -127,8 +118,6 @@ static struct kho_out kho_out = { .track = { .orders = XARRAY_INIT(kho_out.track.orders, 0), }, - .sub_fdts = LIST_HEAD_INIT(kho_out.sub_fdts), - .fdts_lock = __MUTEX_INITIALIZER(kho_out.fdts_lock), .finalized = false, }; @@ -725,37 +714,67 @@ static void __init kho_reserve_scratch(void) */ int kho_add_subtree(const char *name, void *fdt) { - struct kho_sub_fdt *sub_fdt; + phys_addr_t phys = virt_to_phys(fdt); + void *root_fdt = kho_out.fdt; + int err = -ENOMEM; + int off, fdt_err; - sub_fdt = kmalloc(sizeof(*sub_fdt), GFP_KERNEL); - if (!sub_fdt) - return -ENOMEM; + guard(mutex)(&kho_out.lock); + + fdt_err = fdt_open_into(root_fdt, root_fdt, PAGE_SIZE); + if (fdt_err < 0) + return err; - INIT_LIST_HEAD(&sub_fdt->l); - sub_fdt->name = name; - sub_fdt->fdt = fdt; + off = fdt_add_subnode(root_fdt, 0, name); + if (off < 0) { + if (off == -FDT_ERR_EXISTS) + err = -EEXIST; + goto out_pack; + } + + err = fdt_setprop(root_fdt, off, PROP_SUB_FDT, &phys, sizeof(phys)); + if (err < 0) + goto out_pack; - guard(mutex)(&kho_out.fdts_lock); - list_add_tail(&sub_fdt->l, &kho_out.sub_fdts); WARN_ON_ONCE(kho_debugfs_fdt_add(&kho_out.dbg, name, fdt, false)); - return 0; +out_pack: + fdt_pack(root_fdt); + + return err; } EXPORT_SYMBOL_GPL(kho_add_subtree); void kho_remove_subtree(void *fdt) { - struct kho_sub_fdt *sub_fdt; + phys_addr_t target_phys = virt_to_phys(fdt); + void *root_fdt = kho_out.fdt; + int off; + int err; + + guard(mutex)(&kho_out.lock); + + err = fdt_open_into(root_fdt, root_fdt, PAGE_SIZE); + if (err < 0) + return; + + for (off = fdt_first_subnode(root_fdt, 0); off >= 0; + off = fdt_next_subnode(root_fdt, off)) { + const u64 *val; + int len; + + val = fdt_getprop(root_fdt, off, PROP_SUB_FDT, &len); + if (!val || len != sizeof(phys_addr_t)) + continue; - guard(mutex)(&kho_out.fdts_lock); - list_for_each_entry(sub_fdt, &kho_out.sub_fdts, l) { - if (sub_fdt->fdt == fdt) { - list_del(&sub_fdt->l); - kfree(sub_fdt); + if ((phys_addr_t)*val == target_phys) { + fdt_del_node(root_fdt, off); kho_debugfs_fdt_remove(&kho_out.dbg, fdt); break; } } + + fdt_pack(root_fdt); } EXPORT_SYMBOL_GPL(kho_remove_subtree); @@ -1232,48 +1251,6 @@ void kho_restore_free(void *mem) } EXPORT_SYMBOL_GPL(kho_restore_free); -static int __kho_finalize(void) -{ - void *root = kho_out.fdt; - struct kho_sub_fdt *fdt; - u64 empty_mem_map = 0; - int err; - - err = fdt_create(root, PAGE_SIZE); - err |= fdt_finish_reservemap(root); - err |= fdt_begin_node(root, ""); - err |= fdt_property_string(root, "compatible", KHO_FDT_COMPATIBLE); - err |= fdt_property(root, PROP_PRESERVED_MEMORY_MAP, &empty_mem_map, - sizeof(empty_mem_map)); - if (err) - goto err_exit; - - mutex_lock(&kho_out.fdts_lock); - list_for_each_entry(fdt, &kho_out.sub_fdts, l) { - phys_addr_t phys = virt_to_phys(fdt->fdt); - - err |= fdt_begin_node(root, fdt->name); - err |= fdt_property(root, PROP_SUB_FDT, &phys, sizeof(phys)); - err |= fdt_end_node(root); - } - mutex_unlock(&kho_out.fdts_lock); - - err |= fdt_end_node(root); - err |= fdt_finish(root); - if (err) - goto err_exit; - - err = kho_mem_serialize(&kho_out); - if (err) - goto err_exit; - - return 0; - -err_exit: - pr_err("Failed to convert KHO state tree: %d\n", err); - return err; -} - int kho_finalize(void) { int ret; @@ -1282,12 +1259,7 @@ int kho_finalize(void) return -EOPNOTSUPP; guard(mutex)(&kho_out.lock); - if (kho_out.finalized) { - kho_update_memory_map(NULL); - kho_out.finalized = false; - } - - ret = __kho_finalize(); + ret = kho_mem_serialize(&kho_out); if (ret) return ret; @@ -1372,6 +1344,24 @@ int kho_retrieve_subtree(const char *name, phys_addr_t *phys) } EXPORT_SYMBOL_GPL(kho_retrieve_subtree); +static __init int kho_out_fdt_setup(void) +{ + void *root = kho_out.fdt; + u64 empty_mem_map = 0; + int err; + + err = fdt_create(root, PAGE_SIZE); + err |= fdt_finish_reservemap(root); + err |= fdt_begin_node(root, ""); + err |= fdt_property_string(root, "compatible", KHO_FDT_COMPATIBLE); + err |= fdt_property(root, PROP_PRESERVED_MEMORY_MAP, &empty_mem_map, + sizeof(empty_mem_map)); + err |= fdt_end_node(root); + err |= fdt_finish(root); + + return err; +} + static __init int kho_init(void) { const void *fdt = kho_get_fdt(); @@ -1394,6 +1384,10 @@ static __init int kho_init(void) if (err) goto err_free_fdt; + err = kho_out_fdt_setup(); + if (err) + goto err_free_fdt; + if (fdt) { kho_in_debugfs_init(&kho_in.dbg, fdt); return 0; -- 2.52.0.rc1.455.g30608eb744-goog From pasha.tatashin at soleen.com Fri Nov 14 10:59:57 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Fri, 14 Nov 2025 13:59:57 -0500 Subject: [PATCH v2 08/13] kho: Remove global preserved_mem_map and store state in FDT In-Reply-To: <20251114190002.3311679-1-pasha.tatashin@soleen.com> References: <20251114190002.3311679-1-pasha.tatashin@soleen.com> Message-ID: <20251114190002.3311679-9-pasha.tatashin@soleen.com> Currently, the serialized memory map is tracked via kho_out.preserved_mem_map and copied to the FDT during finalization. This double tracking is redundant. Remove preserved_mem_map from kho_out. Instead, maintain the physical address of the head chunk directly in the preserved-memory-map FDT property. Introduce kho_update_memory_map() to manage this property. This function handles: 1. Retrieving and freeing any existing serialized map (handling the abort/retry case). 2. Updating the FDT property with the new chunk address. This establishes the FDT as the single source of truth for the handover state. Signed-off-by: Pasha Tatashin Reviewed-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav --- kernel/liveupdate/kexec_handover.c | 43 ++++++++++++++++++------------ 1 file changed, 26 insertions(+), 17 deletions(-) diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 297136054f75..63800f63551f 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -119,9 +119,6 @@ struct kho_out { struct mutex fdts_lock; struct kho_mem_track track; - /* First chunk of serialized preserved memory map */ - struct khoser_mem_chunk *preserved_mem_map; - struct kho_debugfs dbg; }; @@ -382,6 +379,27 @@ static void kho_mem_ser_free(struct khoser_mem_chunk *first_chunk) } } +/* + * Update memory map property, if old one is found discard it via + * kho_mem_ser_free(). + */ +static void kho_update_memory_map(struct khoser_mem_chunk *first_chunk) +{ + void *ptr; + u64 phys; + + ptr = fdt_getprop_w(kho_out.fdt, 0, PROP_PRESERVED_MEMORY_MAP, NULL); + + /* Check and discard previous memory map */ + phys = get_unaligned((u64 *)ptr); + if (phys) + kho_mem_ser_free((struct khoser_mem_chunk *)phys_to_virt(phys)); + + /* Update with the new value */ + phys = first_chunk ? (u64)virt_to_phys(first_chunk) : 0; + put_unaligned(phys, (u64 *)ptr); +} + static int kho_mem_serialize(struct kho_out *kho_out) { struct khoser_mem_chunk *first_chunk = NULL; @@ -422,7 +440,7 @@ static int kho_mem_serialize(struct kho_out *kho_out) } } - kho_out->preserved_mem_map = first_chunk; + kho_update_memory_map(first_chunk); return 0; @@ -1223,8 +1241,7 @@ int kho_abort(void) if (!kho_out.finalized) return -ENOENT; - kho_mem_ser_free(kho_out.preserved_mem_map); - kho_out.preserved_mem_map = NULL; + kho_update_memory_map(NULL); kho_out.finalized = false; return 0; @@ -1234,21 +1251,15 @@ static int __kho_finalize(void) { void *root = kho_out.fdt; struct kho_sub_fdt *fdt; - u64 *preserved_mem_map; + u64 empty_mem_map = 0; int err; err = fdt_create(root, PAGE_SIZE); err |= fdt_finish_reservemap(root); err |= fdt_begin_node(root, ""); err |= fdt_property_string(root, "compatible", KHO_FDT_COMPATIBLE); - /** - * Reserve the preserved-memory-map property in the root FDT, so - * that all property definitions will precede subnodes created by - * KHO callers. - */ - err |= fdt_property_placeholder(root, PROP_PRESERVED_MEMORY_MAP, - sizeof(*preserved_mem_map), - (void **)&preserved_mem_map); + err |= fdt_property(root, PROP_PRESERVED_MEMORY_MAP, &empty_mem_map, + sizeof(empty_mem_map)); if (err) goto err_exit; @@ -1271,8 +1282,6 @@ static int __kho_finalize(void) if (err) goto err_exit; - *preserved_mem_map = (u64)virt_to_phys(kho_out.preserved_mem_map); - return 0; err_exit: -- 2.52.0.rc1.455.g30608eb744-goog From pasha.tatashin at soleen.com Fri Nov 14 11:00:00 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Fri, 14 Nov 2025 14:00:00 -0500 Subject: [PATCH v2 11/13] kho: Allow kexec load before KHO finalization In-Reply-To: <20251114190002.3311679-1-pasha.tatashin@soleen.com> References: <20251114190002.3311679-1-pasha.tatashin@soleen.com> Message-ID: <20251114190002.3311679-12-pasha.tatashin@soleen.com> Currently, kho_fill_kimage() checks kho_out.finalized and returns early if KHO is not yet finalized. This enforces a strict ordering where userspace must finalize KHO *before* loading the kexec image. This is restrictive, as standard workflows often involve loading the target kernel early in the lifecycle and finalizing the state (FDT) only immediately before the reboot. Since the KHO FDT resides at a physical address allocated during boot (kho_init), its location is stable. We can attach this stable address to the kimage regardless of whether the content has been finalized yet. Relax the check to only require kho_enable, allowing kexec_file_load to proceed at any time. Signed-off-by: Pasha Tatashin Reviewed-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav --- kernel/liveupdate/kexec_handover.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 461d96084c12..4596e67de832 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -1550,7 +1550,7 @@ int kho_fill_kimage(struct kimage *image) int err = 0; struct kexec_buf scratch; - if (!kho_out.finalized) + if (!kho_enable) return 0; image->kho.fdt = virt_to_phys(kho_out.fdt); -- 2.52.0.rc1.455.g30608eb744-goog From pasha.tatashin at soleen.com Fri Nov 14 10:59:58 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Fri, 14 Nov 2025 13:59:58 -0500 Subject: [PATCH v2 09/13] kho: Remove abort functionality and support state refresh In-Reply-To: <20251114190002.3311679-1-pasha.tatashin@soleen.com> References: <20251114190002.3311679-1-pasha.tatashin@soleen.com> Message-ID: <20251114190002.3311679-10-pasha.tatashin@soleen.com> Previously, KHO required a dedicated kho_abort() function to clean up state before kho_finalize() could be called again. This was necessary to handle complex unwind paths when using notifiers. With the shift to direct memory preservation, the explicit abort step is no longer strictly necessary. Remove kho_abort() and refactor kho_finalize() to handle re-entry. If kho_finalize() is called while KHO is already finalized, it will now automatically clean up the previous memory map and state before generating a new one. This allows the KHO state to be updated/refreshed simply by triggering finalize again. Update debugfs to return -EINVAL if userspace attempts to write 0 to the finalize attribute, as explicit abort is no longer supported. Suggested-by: Mike Rapoport (Microsoft) Signed-off-by: Pasha Tatashin Reviewed-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav --- kernel/liveupdate/kexec_handover.c | 21 ++++----------------- kernel/liveupdate/kexec_handover_debugfs.c | 2 +- kernel/liveupdate/kexec_handover_internal.h | 1 - 3 files changed, 5 insertions(+), 19 deletions(-) diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 63800f63551f..624fd648d21f 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -1232,21 +1232,6 @@ void kho_restore_free(void *mem) } EXPORT_SYMBOL_GPL(kho_restore_free); -int kho_abort(void) -{ - if (!kho_enable) - return -EOPNOTSUPP; - - guard(mutex)(&kho_out.lock); - if (!kho_out.finalized) - return -ENOENT; - - kho_update_memory_map(NULL); - kho_out.finalized = false; - - return 0; -} - static int __kho_finalize(void) { void *root = kho_out.fdt; @@ -1297,8 +1282,10 @@ int kho_finalize(void) return -EOPNOTSUPP; guard(mutex)(&kho_out.lock); - if (kho_out.finalized) - return -EEXIST; + if (kho_out.finalized) { + kho_update_memory_map(NULL); + kho_out.finalized = false; + } ret = __kho_finalize(); if (ret) diff --git a/kernel/liveupdate/kexec_handover_debugfs.c b/kernel/liveupdate/kexec_handover_debugfs.c index ac739d25094d..2abbf62ba942 100644 --- a/kernel/liveupdate/kexec_handover_debugfs.c +++ b/kernel/liveupdate/kexec_handover_debugfs.c @@ -87,7 +87,7 @@ static int kho_out_finalize_set(void *data, u64 val) if (val) return kho_finalize(); else - return kho_abort(); + return -EINVAL; } DEFINE_DEBUGFS_ATTRIBUTE(kho_out_finalize_fops, kho_out_finalize_get, diff --git a/kernel/liveupdate/kexec_handover_internal.h b/kernel/liveupdate/kexec_handover_internal.h index 52ed73659fe6..0202c85ad14f 100644 --- a/kernel/liveupdate/kexec_handover_internal.h +++ b/kernel/liveupdate/kexec_handover_internal.h @@ -24,7 +24,6 @@ extern unsigned int kho_scratch_cnt; bool kho_finalized(void); int kho_finalize(void); -int kho_abort(void); #ifdef CONFIG_KEXEC_HANDOVER_DEBUGFS int kho_debugfs_init(void); -- 2.52.0.rc1.455.g30608eb744-goog From pasha.tatashin at soleen.com Fri Nov 14 11:00:01 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Fri, 14 Nov 2025 14:00:01 -0500 Subject: [PATCH v2 12/13] kho: Allow memory preservation state updates after finalization In-Reply-To: <20251114190002.3311679-1-pasha.tatashin@soleen.com> References: <20251114190002.3311679-1-pasha.tatashin@soleen.com> Message-ID: <20251114190002.3311679-13-pasha.tatashin@soleen.com> Currently, kho_preserve_* and kho_unpreserve_* return -EBUSY if KHO is finalized. This enforces a rigid "freeze" on the KHO memory state. With the introduction of re-entrant finalization, this restriction is no longer necessary. Users should be allowed to modify the preservation set (e.g., adding new pages or freeing old ones) even after an initial finalization. The intended workflow for updates is now: 1. Modify state (preserve/unpreserve). 2. Call kho_finalize() again to refresh the serialized metadata. Remove the kho_out.finalized checks to enable this dynamic behavior. This also allows to convert kho_unpreserve_* functions to void, as they do not return any error anymore. Signed-off-by: Pasha Tatashin Reviewed-by: Mike Rapoport (Microsoft) --- include/linux/kexec_handover.h | 21 ++++-------- kernel/liveupdate/kexec_handover.c | 55 +++++++----------------------- 2 files changed, 19 insertions(+), 57 deletions(-) diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h index 38a9487a1a00..6dd0dcdf0ec1 100644 --- a/include/linux/kexec_handover.h +++ b/include/linux/kexec_handover.h @@ -44,11 +44,11 @@ bool kho_is_enabled(void); bool is_kho_boot(void); int kho_preserve_folio(struct folio *folio); -int kho_unpreserve_folio(struct folio *folio); +void kho_unpreserve_folio(struct folio *folio); int kho_preserve_pages(struct page *page, unsigned int nr_pages); -int kho_unpreserve_pages(struct page *page, unsigned int nr_pages); +void kho_unpreserve_pages(struct page *page, unsigned int nr_pages); int kho_preserve_vmalloc(void *ptr, struct kho_vmalloc *preservation); -int kho_unpreserve_vmalloc(struct kho_vmalloc *preservation); +void kho_unpreserve_vmalloc(struct kho_vmalloc *preservation); void *kho_alloc_preserve(size_t size); void kho_unpreserve_free(void *mem); void kho_restore_free(void *mem); @@ -79,20 +79,14 @@ static inline int kho_preserve_folio(struct folio *folio) return -EOPNOTSUPP; } -static inline int kho_unpreserve_folio(struct folio *folio) -{ - return -EOPNOTSUPP; -} +static inline void kho_unpreserve_folio(struct folio *folio) { } static inline int kho_preserve_pages(struct page *page, unsigned int nr_pages) { return -EOPNOTSUPP; } -static inline int kho_unpreserve_pages(struct page *page, unsigned int nr_pages) -{ - return -EOPNOTSUPP; -} +static inline void kho_unpreserve_pages(struct page *page, unsigned int nr_pages) { } static inline int kho_preserve_vmalloc(void *ptr, struct kho_vmalloc *preservation) @@ -100,10 +94,7 @@ static inline int kho_preserve_vmalloc(void *ptr, return -EOPNOTSUPP; } -static inline int kho_unpreserve_vmalloc(struct kho_vmalloc *preservation) -{ - return -EOPNOTSUPP; -} +static inline void kho_unpreserve_vmalloc(struct kho_vmalloc *preservation) { } void *kho_alloc_preserve(size_t size) { diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 4596e67de832..a7f876ece445 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -185,10 +185,6 @@ static int __kho_preserve_order(struct kho_mem_track *track, unsigned long pfn, const unsigned long pfn_high = pfn >> order; might_sleep(); - - if (kho_out.finalized) - return -EBUSY; - physxa = xa_load(&track->orders, order); if (!physxa) { int err; @@ -807,20 +803,14 @@ EXPORT_SYMBOL_GPL(kho_preserve_folio); * Instructs KHO to unpreserve a folio that was preserved by * kho_preserve_folio() before. The provided @folio (pfn and order) * must exactly match a previously preserved folio. - * - * Return: 0 on success, error code on failure */ -int kho_unpreserve_folio(struct folio *folio) +void kho_unpreserve_folio(struct folio *folio) { const unsigned long pfn = folio_pfn(folio); const unsigned int order = folio_order(folio); struct kho_mem_track *track = &kho_out.track; - if (kho_out.finalized) - return -EBUSY; - __kho_unpreserve_order(track, pfn, order); - return 0; } EXPORT_SYMBOL_GPL(kho_unpreserve_folio); @@ -877,21 +867,14 @@ EXPORT_SYMBOL_GPL(kho_preserve_pages); * This must be called with the same @page and @nr_pages as the corresponding * kho_preserve_pages() call. Unpreserving arbitrary sub-ranges of larger * preserved blocks is not supported. - * - * Return: 0 on success, error code on failure */ -int kho_unpreserve_pages(struct page *page, unsigned int nr_pages) +void kho_unpreserve_pages(struct page *page, unsigned int nr_pages) { struct kho_mem_track *track = &kho_out.track; const unsigned long start_pfn = page_to_pfn(page); const unsigned long end_pfn = start_pfn + nr_pages; - if (kho_out.finalized) - return -EBUSY; - __kho_unpreserve(track, start_pfn, end_pfn); - - return 0; } EXPORT_SYMBOL_GPL(kho_unpreserve_pages); @@ -976,20 +959,6 @@ static void kho_vmalloc_unpreserve_chunk(struct kho_vmalloc_chunk *chunk, } } -static void kho_vmalloc_free_chunks(struct kho_vmalloc *kho_vmalloc) -{ - struct kho_vmalloc_chunk *chunk = KHOSER_LOAD_PTR(kho_vmalloc->first); - - while (chunk) { - struct kho_vmalloc_chunk *tmp = chunk; - - kho_vmalloc_unpreserve_chunk(chunk, kho_vmalloc->order); - - chunk = KHOSER_LOAD_PTR(chunk->hdr.next); - free_page((unsigned long)tmp); - } -} - /** * kho_preserve_vmalloc - preserve memory allocated with vmalloc() across kexec * @ptr: pointer to the area in vmalloc address space @@ -1051,7 +1020,7 @@ int kho_preserve_vmalloc(void *ptr, struct kho_vmalloc *preservation) return 0; err_free: - kho_vmalloc_free_chunks(preservation); + kho_unpreserve_vmalloc(preservation); return err; } EXPORT_SYMBOL_GPL(kho_preserve_vmalloc); @@ -1062,17 +1031,19 @@ EXPORT_SYMBOL_GPL(kho_preserve_vmalloc); * * Instructs KHO to unpreserve the area in vmalloc address space that was * previously preserved with kho_preserve_vmalloc(). - * - * Return: 0 on success, error code on failure */ -int kho_unpreserve_vmalloc(struct kho_vmalloc *preservation) +void kho_unpreserve_vmalloc(struct kho_vmalloc *preservation) { - if (kho_out.finalized) - return -EBUSY; + struct kho_vmalloc_chunk *chunk = KHOSER_LOAD_PTR(preservation->first); - kho_vmalloc_free_chunks(preservation); + while (chunk) { + struct kho_vmalloc_chunk *tmp = chunk; - return 0; + kho_vmalloc_unpreserve_chunk(chunk, preservation->order); + + chunk = KHOSER_LOAD_PTR(chunk->hdr.next); + free_page((unsigned long)tmp); + } } EXPORT_SYMBOL_GPL(kho_unpreserve_vmalloc); @@ -1221,7 +1192,7 @@ void kho_unpreserve_free(void *mem) return; folio = virt_to_folio(mem); - WARN_ON_ONCE(kho_unpreserve_folio(folio)); + kho_unpreserve_folio(folio); folio_put(folio); } EXPORT_SYMBOL_GPL(kho_unpreserve_free); -- 2.52.0.rc1.455.g30608eb744-goog From pasha.tatashin at soleen.com Fri Nov 14 11:00:02 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Fri, 14 Nov 2025 14:00:02 -0500 Subject: [PATCH v2 13/13] kho: Add Kconfig option to enable KHO by default In-Reply-To: <20251114190002.3311679-1-pasha.tatashin@soleen.com> References: <20251114190002.3311679-1-pasha.tatashin@soleen.com> Message-ID: <20251114190002.3311679-14-pasha.tatashin@soleen.com> Currently, Kexec Handover must be explicitly enabled via the kernel command line parameter `kho=on`. For workloads that rely on KHO as a foundational requirement (such as the upcoming Live Update Orchestrator), requiring an explicit boot parameter adds redundant configuration steps. Introduce CONFIG_KEXEC_HANDOVER_ENABLE_DEFAULT. When selected, KHO defaults to enabled. This is equivalent to passing kho=on at boot. The behavior can still be disabled at runtime by passing kho=off. Signed-off-by: Pasha Tatashin Reviewed-by: Mike Rapoport (Microsoft) --- kernel/liveupdate/Kconfig | 14 ++++++++++++++ kernel/liveupdate/kexec_handover.c | 2 +- 2 files changed, 15 insertions(+), 1 deletion(-) diff --git a/kernel/liveupdate/Kconfig b/kernel/liveupdate/Kconfig index eae428309332..a973a54447de 100644 --- a/kernel/liveupdate/Kconfig +++ b/kernel/liveupdate/Kconfig @@ -37,4 +37,18 @@ config KEXEC_HANDOVER_DEBUGFS Also, enables inspecting the KHO fdt trees with the debugfs binary blobs. +config KEXEC_HANDOVER_ENABLE_DEFAULT + bool "Enable kexec handover by default" + depends on KEXEC_HANDOVER + help + Enable Kexec Handover by default. This avoids the need to + explicitly pass 'kho=on' on the kernel command line. + + This is useful for systems where KHO is a prerequisite for other + features, such as Live Update, ensuring the mechanism is always + active. + + The default behavior can still be overridden at boot time by + passing 'kho=off'. + endmenu diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index a7f876ece445..224bdf5becb6 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -52,7 +52,7 @@ union kho_page_info { static_assert(sizeof(union kho_page_info) == sizeof(((struct page *)0)->private)); -static bool kho_enable __ro_after_init; +static bool kho_enable __ro_after_init = IS_ENABLED(CONFIG_KEXEC_HANDOVER_ENABLE_DEFAULT); bool kho_is_enabled(void) { -- 2.52.0.rc1.455.g30608eb744-goog From pratyush at kernel.org Fri Nov 14 11:33:17 2025 From: pratyush at kernel.org (Pratyush Yadav) Date: Fri, 14 Nov 2025 20:33:17 +0100 Subject: [PATCH v2 03/13] kho: Introduce high-level memory allocation API In-Reply-To: <20251114190002.3311679-4-pasha.tatashin@soleen.com> (Pasha Tatashin's message of "Fri, 14 Nov 2025 13:59:52 -0500") References: <20251114190002.3311679-1-pasha.tatashin@soleen.com> <20251114190002.3311679-4-pasha.tatashin@soleen.com> Message-ID: On Fri, Nov 14 2025, Pasha Tatashin wrote: > Currently, clients of KHO must manually allocate memory (e.g., via > alloc_pages), calculate the page order, and explicitly call > kho_preserve_folio(). Similarly, cleanup requires separate calls to > unpreserve and free the memory. > > Introduce a high-level API to streamline this common pattern: > > - kho_alloc_preserve(size): Allocates physically contiguous, zeroed > memory and immediately marks it for preservation. > - kho_unpreserve_free(ptr): Unpreserves and frees the memory > in the current kernel. > - kho_restore_free(ptr): Restores the struct page state of > preserved memory in the new kernel and immediately frees it to the > page allocator. > > Signed-off-by: Pasha Tatashin > Reviewed-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav [...] -- Regards, Pratyush Yadav From pratyush at kernel.org Fri Nov 14 11:33:54 2025 From: pratyush at kernel.org (Pratyush Yadav) Date: Fri, 14 Nov 2025 20:33:54 +0100 Subject: [PATCH v2 05/13] kho: Verify deserialization status and fix FDT alignment access In-Reply-To: <20251114190002.3311679-6-pasha.tatashin@soleen.com> (Pasha Tatashin's message of "Fri, 14 Nov 2025 13:59:54 -0500") References: <20251114190002.3311679-1-pasha.tatashin@soleen.com> <20251114190002.3311679-6-pasha.tatashin@soleen.com> Message-ID: On Fri, Nov 14 2025, Pasha Tatashin wrote: > During boot, kho_restore_folio() relies on the memory map having been > successfully deserialized. If deserialization fails or no map is > present, attempting to restore the FDT folio is unsafe. > > Update kho_mem_deserialize() to return a boolean indicating success. Use > this return value in kho_memory_init() to disable KHO if deserialization > fails. Also, the incoming FDT folio is never used, there is no reason to > restore it. > > Additionally, use get_unaligned() to retrieve the memory map pointer > from the FDT. FDT properties are not guaranteed to be naturally aligned, > and accessing a 64-bit value via a pointer that is only 32-bit aligned > can cause faults. > > Signed-off-by: Pasha Tatashin > Reviewed-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav [...] -- Regards, Pratyush Yadav From pratyush at kernel.org Fri Nov 14 11:35:28 2025 From: pratyush at kernel.org (Pratyush Yadav) Date: Fri, 14 Nov 2025 20:35:28 +0100 Subject: [PATCH v2 12/13] kho: Allow memory preservation state updates after finalization In-Reply-To: <20251114190002.3311679-13-pasha.tatashin@soleen.com> (Pasha Tatashin's message of "Fri, 14 Nov 2025 14:00:01 -0500") References: <20251114190002.3311679-1-pasha.tatashin@soleen.com> <20251114190002.3311679-13-pasha.tatashin@soleen.com> Message-ID: On Fri, Nov 14 2025, Pasha Tatashin wrote: > Currently, kho_preserve_* and kho_unpreserve_* return -EBUSY if > KHO is finalized. This enforces a rigid "freeze" on the KHO memory > state. > > With the introduction of re-entrant finalization, this restriction is > no longer necessary. Users should be allowed to modify the preservation > set (e.g., adding new pages or freeing old ones) even after an initial > finalization. > > The intended workflow for updates is now: > 1. Modify state (preserve/unpreserve). > 2. Call kho_finalize() again to refresh the serialized metadata. > > Remove the kho_out.finalized checks to enable this dynamic behavior. > > This also allows to convert kho_unpreserve_* functions to void, as they > do not return any error anymore. > > Signed-off-by: Pasha Tatashin > Reviewed-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav [...] -- Regards, Pratyush Yadav From pratyush at kernel.org Fri Nov 14 11:35:51 2025 From: pratyush at kernel.org (Pratyush Yadav) Date: Fri, 14 Nov 2025 20:35:51 +0100 Subject: [PATCH v2 13/13] kho: Add Kconfig option to enable KHO by default In-Reply-To: <20251114190002.3311679-14-pasha.tatashin@soleen.com> (Pasha Tatashin's message of "Fri, 14 Nov 2025 14:00:02 -0500") References: <20251114190002.3311679-1-pasha.tatashin@soleen.com> <20251114190002.3311679-14-pasha.tatashin@soleen.com> Message-ID: On Fri, Nov 14 2025, Pasha Tatashin wrote: > Currently, Kexec Handover must be explicitly enabled via the kernel > command line parameter `kho=on`. > > For workloads that rely on KHO as a foundational requirement (such as > the upcoming Live Update Orchestrator), requiring an explicit boot > parameter adds redundant configuration steps. > > Introduce CONFIG_KEXEC_HANDOVER_ENABLE_DEFAULT. When selected, KHO > defaults to enabled. This is equivalent to passing kho=on at boot. > The behavior can still be disabled at runtime by passing kho=off. > > Signed-off-by: Pasha Tatashin > Reviewed-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav [...] -- Regards, Pratyush Yadav From akpm at linux-foundation.org Fri Nov 14 13:44:34 2025 From: akpm at linux-foundation.org (Andrew Morton) Date: Fri, 14 Nov 2025 13:44:34 -0800 Subject: [PATCH v2 00/13] kho: simplify state machine and enable dynamic updates In-Reply-To: <20251114190002.3311679-1-pasha.tatashin@soleen.com> References: <20251114190002.3311679-1-pasha.tatashin@soleen.com> Message-ID: <20251114134434.5375d085a6bdc1671351f243@linux-foundation.org> On Fri, 14 Nov 2025 13:59:49 -0500 Pasha Tatashin wrote: > Andrew: This series applies against mm-nonmm-unstable, but should > go right before LUOv5, i.e. on top of: > "liveupdate: kho: use %pe format specifier for error pointer printing" > > Changelog v2: > - Addressed comments from Mike and Pratyush > - Added Review-bys. > > It also replaces the following patches, that once applied should be > dropped from mm-nonmm-unstable: > "liveupdate: kho: when live update add KHO image during kexec load" > "liveupdate: Kconfig: make debugfs optional" > "kho: enable KHO by default" > > This patch series refactors the Kexec Handover subsystem to transition > from a rigid, state-locked model to a dynamic, re-entrant architecture. > It also introduces usability improvements. OK. Where are we with the series "Live Update Orchestrator, v5"? I'm seeing a couple of review comments which I plan to circle back on: https://lkml.kernel.org/r/aROZi043lxtegqWE at kernel.org https://lkml.kernel.org/r/mafs0ms4tajcs.fsf at kernel.org and a comment from yourself against liveupdate-luo_core-integrate-with-kho.patch which indicates that you plan to update that patch? From pasha.tatashin at soleen.com Fri Nov 14 14:00:14 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Fri, 14 Nov 2025 17:00:14 -0500 Subject: [PATCH v2 00/13] kho: simplify state machine and enable dynamic updates In-Reply-To: <20251114134434.5375d085a6bdc1671351f243@linux-foundation.org> References: <20251114190002.3311679-1-pasha.tatashin@soleen.com> <20251114134434.5375d085a6bdc1671351f243@linux-foundation.org> Message-ID: On Fri, Nov 14, 2025 at 4:44?PM Andrew Morton wrote: > > On Fri, 14 Nov 2025 13:59:49 -0500 Pasha Tatashin wrote: > > > Andrew: This series applies against mm-nonmm-unstable, but should > > go right before LUOv5, i.e. on top of: > > "liveupdate: kho: use %pe format specifier for error pointer printing" > > > > Changelog v2: > > - Addressed comments from Mike and Pratyush > > - Added Review-bys. > > > > It also replaces the following patches, that once applied should be > > dropped from mm-nonmm-unstable: > > "liveupdate: kho: when live update add KHO image during kexec load" > > "liveupdate: Kconfig: make debugfs optional" > > "kho: enable KHO by default" > > > > This patch series refactors the Kexec Handover subsystem to transition > > from a rigid, state-locked model to a dynamic, re-entrant architecture. > > It also introduces usability improvements. > > OK. > > Where are we with the series "Live Update Orchestrator, v5"? I am working on LUOv6, it is going to be an incremental update with much smaller delta compared to v4->v5, addressing all the comments collected so far. I plan to send it out this weekend. Thank you, Pasha > I'm seeing a couple of review comments which I plan to circle back on: > > https://lkml.kernel.org/r/aROZi043lxtegqWE at kernel.org > https://lkml.kernel.org/r/mafs0ms4tajcs.fsf at kernel.org > and a comment from yourself against > liveupdate-luo_core-integrate-with-kho.patch which indicates that you > plan to update that patch? From pasha.tatashin at soleen.com Fri Nov 14 14:06:42 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Fri, 14 Nov 2025 17:06:42 -0500 Subject: [PATCH v2 00/13] kho: simplify state machine and enable dynamic updates In-Reply-To: References: <20251114190002.3311679-1-pasha.tatashin@soleen.com> <20251114134434.5375d085a6bdc1671351f243@linux-foundation.org> Message-ID: On Fri, Nov 14, 2025 at 5:00?PM Pasha Tatashin wrote: > > On Fri, Nov 14, 2025 at 4:44?PM Andrew Morton wrote: > > > > On Fri, 14 Nov 2025 13:59:49 -0500 Pasha Tatashin wrote: > > > > > Andrew: This series applies against mm-nonmm-unstable, but should > > > go right before LUOv5, i.e. on top of: > > > "liveupdate: kho: use %pe format specifier for error pointer printing" > > > > > > Changelog v2: > > > - Addressed comments from Mike and Pratyush > > > - Added Review-bys. > > > > > > It also replaces the following patches, that once applied should be > > > dropped from mm-nonmm-unstable: > > > "liveupdate: kho: when live update add KHO image during kexec load" > > > "liveupdate: Kconfig: make debugfs optional" > > > "kho: enable KHO by default" > > > > > > This patch series refactors the Kexec Handover subsystem to transition > > > from a rigid, state-locked model to a dynamic, re-entrant architecture. > > > It also introduces usability improvements. > > > > OK. Also, with this series, kho_unpreserve_folio() returns void, and LUOv5 requires two small fixes where this function is used: 1. fixup for mm: "memfd_luo: allow preserving memfd" diff --git a/mm/memfd_luo.c b/mm/memfd_luo.c index e366de627264..ba435590d2cf 100644 --- a/mm/memfd_luo.c +++ b/mm/memfd_luo.c @@ -138,7 +138,7 @@ static struct memfd_luo_folio_ser *memfd_luo_preserve_folios(struct file *file, err_unpreserve: i--; for (; i >= 0; i--) - WARN_ON_ONCE(kho_unpreserve_folio(folios[i])); + kho_unpreserve_folio(folios[i]); vfree(pfolios); err_unpin: unpin_folios(folios, nr_folios); @@ -170,7 +170,7 @@ static void memfd_luo_unpreserve_folios(void *fdt, struct memfd_luo_folio_ser *p folio = pfn_folio(PRESERVED_FOLIO_PFN(pfolio->foliodesc)); - WARN_ON_ONCE(kho_unpreserve_folio(folio)); + kho_unpreserve_folio(folio); unpin_folio(folio); } 2. Fixup for liveupdate: luo_core: integrate with KHO: diff --git a/kernel/liveupdate/luo_core.c b/kernel/liveupdate/luo_core.c index 29a094ee225c..f0bc3ee0a10b 100644 --- a/kernel/liveupdate/luo_core.c +++ b/kernel/liveupdate/luo_core.c @@ -305,7 +305,7 @@ void luo_free_unpreserve(void *mem, size_t size) return; folio = virt_to_folio(mem); - WARN_ON_ONCE(kho_unpreserve_folio(folio)); + kho_unpreserve_folio(folio); folio_put(folio); } > > > > Where are we with the series "Live Update Orchestrator, v5"? > > I am working on LUOv6, it is going to be an incremental update with > much smaller delta compared to v4->v5, addressing all the comments > collected so far. I plan to send it out this weekend. > > Thank you, > Pasha > > > I'm seeing a couple of review comments which I plan to circle back on: > > > > https://lkml.kernel.org/r/aROZi043lxtegqWE at kernel.org > > https://lkml.kernel.org/r/mafs0ms4tajcs.fsf at kernel.org > > and a comment from yourself against > > liveupdate-luo_core-integrate-with-kho.patch which indicates that you > > plan to update that patch? From rppt at kernel.org Sat Nov 15 01:36:19 2025 From: rppt at kernel.org (Mike Rapoport) Date: Sat, 15 Nov 2025 11:36:19 +0200 Subject: [PATCH v1 04/13] kho: Verify deserialization status and fix FDT alignment access In-Reply-To: References: <20251114155358.2884014-1-pasha.tatashin@soleen.com> <20251114155358.2884014-5-pasha.tatashin@soleen.com> Message-ID: On Fri, Nov 14, 2025 at 05:52:37PM +0100, Pratyush Yadav wrote: > On Fri, Nov 14 2025, Pasha Tatashin wrote: > > > @@ -1377,16 +1387,12 @@ static void __init kho_release_scratch(void) > > > > void __init kho_memory_init(void) > > { > > - struct folio *folio; > > - > > if (kho_in.scratch_phys) { > > kho_scratch = phys_to_virt(kho_in.scratch_phys); > > kho_release_scratch(); > > > > - kho_mem_deserialize(kho_get_fdt()); > > - folio = kho_restore_folio(kho_in.fdt_phys); > > - if (!folio) > > - pr_warn("failed to restore folio for KHO fdt\n"); > > + if (!kho_mem_deserialize(kho_get_fdt())) > > + kho_in.fdt_phys = 0; > > The folio restore does serve a purpose: it accounts for that folio in > the system's total memory. See the call to adjust_managed_page_count() > in kho_restore_page(). In practice, I don't think it makes much of a > difference, but I don't see why not. This page is never freed, so adding it to zone managed pages or keeping it reserved does not change anything. > > } else { > > kho_reserve_scratch(); > > } > > -- > Regards, > Pratyush Yadav -- Sincerely yours, Mike. From rppt at kernel.org Sat Nov 15 01:40:09 2025 From: rppt at kernel.org (Mike Rapoport) Date: Sat, 15 Nov 2025 11:40:09 +0200 Subject: [PATCH v2 10/13] kho: Update FDT dynamically for subtree addition/removal In-Reply-To: <20251114190002.3311679-11-pasha.tatashin@soleen.com> References: <20251114190002.3311679-1-pasha.tatashin@soleen.com> <20251114190002.3311679-11-pasha.tatashin@soleen.com> Message-ID: On Fri, Nov 14, 2025 at 01:59:59PM -0500, Pasha Tatashin wrote: > - struct kho_sub_fdt *sub_fdt; > + phys_addr_t phys = virt_to_phys(fdt); > + void *root_fdt = kho_out.fdt; > + int err = -ENOMEM; > + int off, fdt_err; > > - sub_fdt = kmalloc(sizeof(*sub_fdt), GFP_KERNEL); > - if (!sub_fdt) > - return -ENOMEM; > + guard(mutex)(&kho_out.lock); > + > + fdt_err = fdt_open_into(root_fdt, root_fdt, PAGE_SIZE); > + if (fdt_err < 0) > + return err; > > - INIT_LIST_HEAD(&sub_fdt->l); > - sub_fdt->name = name; > - sub_fdt->fdt = fdt; > + off = fdt_add_subnode(root_fdt, 0, name); Why not fdt_err = fdt_add_subnode() as I asked in v1 review? > + if (off < 0) { > + if (off == -FDT_ERR_EXISTS) > + err = -EEXIST; > + goto out_pack; > + } -- Sincerely yours, Mike. From rppt at kernel.org Sat Nov 15 01:42:01 2025 From: rppt at kernel.org (Mike Rapoport) Date: Sat, 15 Nov 2025 11:42:01 +0200 Subject: [PATCH] liveupdate: kho: Enable KHO by default In-Reply-To: References: <20251110180715.602807-1-pasha.tatashin@soleen.com> Message-ID: On Fri, Nov 14, 2025 at 09:13:01AM -0500, Pasha Tatashin wrote: > On Fri, Nov 14, 2025 at 2:30?AM Mike Rapoport wrote: > > > > On Mon, Nov 10, 2025 at 01:07:15PM -0500, Pasha Tatashin wrote: > > > Upcoming LUO requires KHO for its operations, the requirement to place > > > both KHO=on and liveupdate=on becomes redundant. Set KHO to be enabled > > > by default. > > > > I though more about this and it seems too much of a change. kho=1 enables > > scratch areas and that significantly changes how free pages are distributed > > in the free lists. > > > > Let's go with a Kconfig option we discussed of-list: > > (this is on top of the current mmotm/mm-nonmm-unstable) > > I will include this into the KHO simplification series Please add Alex's Reviewed-by as well. -- Sincerely yours, Mike. From pasha.tatashin at soleen.com Sat Nov 15 06:51:07 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Sat, 15 Nov 2025 09:51:07 -0500 Subject: [PATCH v2 10/13] kho: Update FDT dynamically for subtree addition/removal In-Reply-To: References: <20251114190002.3311679-1-pasha.tatashin@soleen.com> <20251114190002.3311679-11-pasha.tatashin@soleen.com> Message-ID: On Sat, Nov 15, 2025 at 4:40?AM Mike Rapoport wrote: > > On Fri, Nov 14, 2025 at 01:59:59PM -0500, Pasha Tatashin wrote: > > - struct kho_sub_fdt *sub_fdt; > > + phys_addr_t phys = virt_to_phys(fdt); > > + void *root_fdt = kho_out.fdt; > > + int err = -ENOMEM; > > + int off, fdt_err; > > > > - sub_fdt = kmalloc(sizeof(*sub_fdt), GFP_KERNEL); > > - if (!sub_fdt) > > - return -ENOMEM; > > + guard(mutex)(&kho_out.lock); > > + > > + fdt_err = fdt_open_into(root_fdt, root_fdt, PAGE_SIZE); > > + if (fdt_err < 0) > > + return err; > > > > - INIT_LIST_HEAD(&sub_fdt->l); > > - sub_fdt->name = name; > > - sub_fdt->fdt = fdt; > > + off = fdt_add_subnode(root_fdt, 0, name); > > Why not > fdt_err = fdt_add_subnode() > > as I asked in v1 review? > Oh, I missed that, there is a slight difference between the two: 'fdt_err' only contains FDT return value, i.e. error if negative. The 'off' on the other hand in the happy path contains subnode offset, and contains error only in the unhappy path. This is why I think it is a little cleaner to keep different name, however, if you still prefer re-using a single local variable for both, this is fix-up patch: diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 224bdf5becb6..81f60ccb2dc7 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -713,7 +713,7 @@ int kho_add_subtree(const char *name, void *fdt) phys_addr_t phys = virt_to_phys(fdt); void *root_fdt = kho_out.fdt; int err = -ENOMEM; - int off, fdt_err; + int fdt_err; guard(mutex)(&kho_out.lock); @@ -721,14 +721,14 @@ int kho_add_subtree(const char *name, void *fdt) if (fdt_err < 0) return err; - off = fdt_add_subnode(root_fdt, 0, name); - if (off < 0) { - if (off == -FDT_ERR_EXISTS) + fdt_err = fdt_add_subnode(root_fdt, 0, name); + if (fdt_err < 0) { + if (fdt_err == -FDT_ERR_EXISTS) err = -EEXIST; goto out_pack; } - err = fdt_setprop(root_fdt, off, PROP_SUB_FDT, &phys, sizeof(phys)); + err = fdt_setprop(root_fdt, fdt_err, PROP_SUB_FDT, &phys, sizeof(phys)); if (err < 0) goto out_pack; From rppt at kernel.org Sat Nov 15 21:46:56 2025 From: rppt at kernel.org (Mike Rapoport) Date: Sun, 16 Nov 2025 07:46:56 +0200 Subject: [PATCH v2 10/13] kho: Update FDT dynamically for subtree addition/removal In-Reply-To: References: <20251114190002.3311679-1-pasha.tatashin@soleen.com> <20251114190002.3311679-11-pasha.tatashin@soleen.com> Message-ID: On Sat, Nov 15, 2025 at 09:51:07AM -0500, Pasha Tatashin wrote: > On Sat, Nov 15, 2025 at 4:40?AM Mike Rapoport wrote: > > > > On Fri, Nov 14, 2025 at 01:59:59PM -0500, Pasha Tatashin wrote: > > > - struct kho_sub_fdt *sub_fdt; > > > + phys_addr_t phys = virt_to_phys(fdt); > > > + void *root_fdt = kho_out.fdt; > > > + int err = -ENOMEM; > > > + int off, fdt_err; > > > > > > - sub_fdt = kmalloc(sizeof(*sub_fdt), GFP_KERNEL); > > > - if (!sub_fdt) > > > - return -ENOMEM; > > > + guard(mutex)(&kho_out.lock); > > > + > > > + fdt_err = fdt_open_into(root_fdt, root_fdt, PAGE_SIZE); > > > + if (fdt_err < 0) > > > + return err; > > > > > > - INIT_LIST_HEAD(&sub_fdt->l); > > > - sub_fdt->name = name; > > > - sub_fdt->fdt = fdt; > > > + off = fdt_add_subnode(root_fdt, 0, name); > > > > Why not > > fdt_err = fdt_add_subnode() > > > > as I asked in v1 review? > > > > Oh, I missed that, there is a slight difference between the two: > 'fdt_err' only contains FDT return value, i.e. error if negative. The > 'off' on the other hand in the happy path contains subnode offset, and > contains error only in the unhappy path. This is why I think it is a > little cleaner to keep different name, however, if you still prefer > re-using a single local variable for both, this is fix-up patch: > > diff --git a/kernel/liveupdate/kexec_handover.c > b/kernel/liveupdate/kexec_handover.c > index 224bdf5becb6..81f60ccb2dc7 100644 > --- a/kernel/liveupdate/kexec_handover.c > +++ b/kernel/liveupdate/kexec_handover.c > @@ -713,7 +713,7 @@ int kho_add_subtree(const char *name, void *fdt) > phys_addr_t phys = virt_to_phys(fdt); > void *root_fdt = kho_out.fdt; > int err = -ENOMEM; > - int off, fdt_err; > + int fdt_err; > > guard(mutex)(&kho_out.lock); > > @@ -721,14 +721,14 @@ int kho_add_subtree(const char *name, void *fdt) > if (fdt_err < 0) > return err; > > - off = fdt_add_subnode(root_fdt, 0, name); > - if (off < 0) { > - if (off == -FDT_ERR_EXISTS) > + fdt_err = fdt_add_subnode(root_fdt, 0, name); > + if (fdt_err < 0) { > + if (fdt_err == -FDT_ERR_EXISTS) > err = -EEXIST; > goto out_pack; > } > > - err = fdt_setprop(root_fdt, off, PROP_SUB_FDT, &phys, sizeof(phys)); > + err = fdt_setprop(root_fdt, fdt_err, PROP_SUB_FDT, &phys, sizeof(phys)); I missed 'off' here, never mind > if (err < 0) > goto out_pack; > -- Sincerely yours, Mike. From ioworker0 at gmail.com Sat Nov 15 22:49:18 2025 From: ioworker0 at gmail.com (Lance Yang) Date: Sun, 16 Nov 2025 14:49:18 +0800 Subject: [PATCH v2 03/13] kho: Introduce high-level memory allocation API In-Reply-To: <20251114190002.3311679-4-pasha.tatashin@soleen.com> References: <20251114190002.3311679-4-pasha.tatashin@soleen.com> Message-ID: <20251116064918.35549-1-ioworker0@gmail.com> From: Lance Yang On Fri, 14 Nov 2025 13:59:52 -0500, Pasha Tatashin wrote: > Currently, clients of KHO must manually allocate memory (e.g., via > alloc_pages), calculate the page order, and explicitly call > kho_preserve_folio(). Similarly, cleanup requires separate calls to > unpreserve and free the memory. > > Introduce a high-level API to streamline this common pattern: > > - kho_alloc_preserve(size): Allocates physically contiguous, zeroed > memory and immediately marks it for preservation. > - kho_unpreserve_free(ptr): Unpreserves and frees the memory > in the current kernel. > - kho_restore_free(ptr): Restores the struct page state of > preserved memory in the new kernel and immediately frees it to the > page allocator. > > Signed-off-by: Pasha Tatashin > Reviewed-by: Mike Rapoport (Microsoft) > --- > include/linux/kexec_handover.h | 22 +++++--- > kernel/liveupdate/kexec_handover.c | 87 ++++++++++++++++++++++++++++++ > 2 files changed, 102 insertions(+), 7 deletions(-) > > diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h > index 80ece4232617..38a9487a1a00 100644 > --- a/include/linux/kexec_handover.h > +++ b/include/linux/kexec_handover.h > @@ -2,8 +2,9 @@ > #ifndef LINUX_KEXEC_HANDOVER_H > #define LINUX_KEXEC_HANDOVER_H > > -#include > +#include > #include > +#include > > struct kho_scratch { > phys_addr_t addr; > @@ -48,6 +49,9 @@ int kho_preserve_pages(struct page *page, unsigned int nr_pages); > int kho_unpreserve_pages(struct page *page, unsigned int nr_pages); > int kho_preserve_vmalloc(void *ptr, struct kho_vmalloc *preservation); > int kho_unpreserve_vmalloc(struct kho_vmalloc *preservation); > +void *kho_alloc_preserve(size_t size); > +void kho_unpreserve_free(void *mem); > +void kho_restore_free(void *mem); > struct folio *kho_restore_folio(phys_addr_t phys); > struct page *kho_restore_pages(phys_addr_t phys, unsigned int nr_pages); > void *kho_restore_vmalloc(const struct kho_vmalloc *preservation); > @@ -101,6 +105,14 @@ static inline int kho_unpreserve_vmalloc(struct kho_vmalloc *preservation) > return -EOPNOTSUPP; > } > > +void *kho_alloc_preserve(size_t size) > +{ > + return ERR_PTR(-EOPNOTSUPP); > +} > + > +void kho_unpreserve_free(void *mem) { } > +void kho_restore_free(void *mem) { } The compile is unhapply here when CONFIG_KEXEC_HANDOVER is not set ... ``` ld: arch/x86/realmode/rm/video-mode.o: in function `kho_alloc_preserve': /home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:102: multiple definition of `kho_alloc_preserve'; arch/x86/realmode/rm/wakemain.o:/home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:102: first defined here ld: arch/x86/realmode/rm/video-mode.o: in function `kho_unpreserve_free': /home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:104: multiple definition of `kho_unpreserve_free'; arch/x86/realmode/rm/wakemain.o:/home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:104: first defined here ld: arch/x86/realmode/rm/video-mode.o: in function `kho_restore_free': /home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:105: multiple definition of `kho_restore_free'; arch/x86/realmode/rm/wakemain.o:/home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:105: first defined here ld: arch/x86/realmode/rm/regs.o: in function `kho_alloc_preserve': /home/runner/work/mm-test-robot/mm-test-robot/linux/arch/x86/realmode/rm/regs.c:102: multiple definition of `kho_alloc_preserve'; arch/x86/realmode/rm/wakemain.o:/home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:102: first defined here ld: arch/x86/realmode/rm/regs.o: in function `kho_unpreserve_free': /home/runner/work/mm-test-robot/mm-test-robot/linux/arch/x86/realmode/rm/regs.c:104: multiple definition of `kho_unpreserve_free'; arch/x86/realmode/rm/wakemain.o:/home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:104: first defined here ld: arch/x86/realmode/rm/regs.o: in function `kho_restore_free': /home/runner/work/mm-test-robot/mm-test-robot/linux/arch/x86/realmode/rm/regs.c:105: multiple definition of `kho_restore_free'; arch/x86/realmode/rm/wakemain.o:/home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:105: first defined here ld: arch/x86/realmode/rm/video-vga.o: in function `kho_alloc_preserve': /home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:102: multiple definition of `kho_alloc_preserve'; arch/x86/realmode/rm/wakemain.o:/home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:102: first defined here ld: arch/x86/realmode/rm/video-vga.o: in function `kho_unpreserve_free': /home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:104: multiple definition of `kho_unpreserve_free'; arch/x86/realmode/rm/wakemain.o:/home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:104: first defined here ld: arch/x86/realmode/rm/video-vga.o: in function `kho_restore_free': /home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:105: multiple definition of `kho_restore_free'; arch/x86/realmode/rm/wakemain.o:/home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:105: first defined here ld: arch/x86/realmode/rm/video-vesa.o: in function `kho_alloc_preserve': /home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:102: multiple definition of `kho_alloc_preserve'; arch/x86/realmode/rm/wakemain.o:/home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:102: first defined here ld: arch/x86/realmode/rm/video-vesa.o: in function `kho_unpreserve_free': /home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:104: multiple definition of `kho_unpreserve_free'; arch/x86/realmode/rm/wakemain.o:/home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:104: first defined here ld: arch/x86/realmode/rm/video-vesa.o: in function `kho_restore_free': /home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:105: multiple definition of `kho_restore_free'; arch/x86/realmode/rm/wakemain.o:/home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:105: first defined here ld: arch/x86/realmode/rm/video-bios.o: in function `kho_alloc_preserve': /home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:102: multiple definition of `kho_alloc_preserve'; arch/x86/realmode/rm/wakemain.o:/home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:102: first defined here ld: arch/x86/realmode/rm/video-bios.o: in function `kho_unpreserve_free': /home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:104: multiple definition of `kho_unpreserve_free'; arch/x86/realmode/rm/wakemain.o:/home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:104: first defined here ld: arch/x86/realmode/rm/video-bios.o: in function `kho_restore_free': /home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:105: multiple definition of `kho_restore_free'; arch/x86/realmode/rm/wakemain.o:/home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:105: first defined here make[5]: *** [arch/x86/realmode/rm/Makefile:49: arch/x86/realmode/rm/realmode.elf] Error 1 make[4]: *** [arch/x86/realmode/Makefile:22: arch/x86/realmode/rm/realmode.bin] Error 2 make[3]: *** [scripts/Makefile.build:556: arch/x86/realmode] Error 2 ``` Perhaps these stubs should be declared as static inline? That should make the compiler happy and resolve the linking errors :) ----8<---- diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h index 6dd0dcdf0ec1..5f7b9de97e8d 100644 --- a/include/linux/kexec_handover.h +++ b/include/linux/kexec_handover.h @@ -96,13 +96,13 @@ static inline int kho_preserve_vmalloc(void *ptr, static inline void kho_unpreserve_vmalloc(struct kho_vmalloc *preservation) { } -void *kho_alloc_preserve(size_t size) +static inline void *kho_alloc_preserve(size_t size) { return ERR_PTR(-EOPNOTSUPP); } -void kho_unpreserve_free(void *mem) { } -void kho_restore_free(void *mem) { } +static inline void kho_unpreserve_free(void *mem) { } +static inline void kho_restore_free(void *mem) { } static inline struct folio *kho_restore_folio(phys_addr_t phys) { --- [...] Cheers, Lance From pasha.tatashin at soleen.com Sun Nov 16 06:57:05 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Sun, 16 Nov 2025 09:57:05 -0500 Subject: [PATCH v2 03/13] kho: Introduce high-level memory allocation API In-Reply-To: <20251116064918.35549-1-ioworker0@gmail.com> References: <20251114190002.3311679-4-pasha.tatashin@soleen.com> <20251116064918.35549-1-ioworker0@gmail.com> Message-ID: On Sun, Nov 16, 2025 at 1:49?AM Lance Yang wrote: > > From: Lance Yang > > > On Fri, 14 Nov 2025 13:59:52 -0500, Pasha Tatashin wrote: > > Currently, clients of KHO must manually allocate memory (e.g., via > > alloc_pages), calculate the page order, and explicitly call > > kho_preserve_folio(). Similarly, cleanup requires separate calls to > > unpreserve and free the memory. > > > > Introduce a high-level API to streamline this common pattern: > > > > - kho_alloc_preserve(size): Allocates physically contiguous, zeroed > > memory and immediately marks it for preservation. > > - kho_unpreserve_free(ptr): Unpreserves and frees the memory > > in the current kernel. > > - kho_restore_free(ptr): Restores the struct page state of > > preserved memory in the new kernel and immediately frees it to the > > page allocator. > > > > Signed-off-by: Pasha Tatashin > > Reviewed-by: Mike Rapoport (Microsoft) > > --- > > include/linux/kexec_handover.h | 22 +++++--- > > kernel/liveupdate/kexec_handover.c | 87 ++++++++++++++++++++++++++++++ > > 2 files changed, 102 insertions(+), 7 deletions(-) > > > > diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h > > index 80ece4232617..38a9487a1a00 100644 > > --- a/include/linux/kexec_handover.h > > +++ b/include/linux/kexec_handover.h > > @@ -2,8 +2,9 @@ > > #ifndef LINUX_KEXEC_HANDOVER_H > > #define LINUX_KEXEC_HANDOVER_H > > > > -#include > > +#include > > #include > > +#include > > > > struct kho_scratch { > > phys_addr_t addr; > > @@ -48,6 +49,9 @@ int kho_preserve_pages(struct page *page, unsigned int nr_pages); > > int kho_unpreserve_pages(struct page *page, unsigned int nr_pages); > > int kho_preserve_vmalloc(void *ptr, struct kho_vmalloc *preservation); > > int kho_unpreserve_vmalloc(struct kho_vmalloc *preservation); > > +void *kho_alloc_preserve(size_t size); > > +void kho_unpreserve_free(void *mem); > > +void kho_restore_free(void *mem); > > struct folio *kho_restore_folio(phys_addr_t phys); > > struct page *kho_restore_pages(phys_addr_t phys, unsigned int nr_pages); > > void *kho_restore_vmalloc(const struct kho_vmalloc *preservation); > > @@ -101,6 +105,14 @@ static inline int kho_unpreserve_vmalloc(struct kho_vmalloc *preservation) > > return -EOPNOTSUPP; > > } > > > > +void *kho_alloc_preserve(size_t size) > > +{ > > + return ERR_PTR(-EOPNOTSUPP); > > +} > > + > > +void kho_unpreserve_free(void *mem) { } > > +void kho_restore_free(void *mem) { } > > The compile is unhapply here when CONFIG_KEXEC_HANDOVER is not set ... Thank you, Andrew already applied a fix: https://lore.kernel.org/all/CA+CK2bBgXDhrHwTVgxrw7YTQ-0=LgW0t66CwPCgG=C85ftz4zw at mail.gmail.com/T/#u > > ``` > ld: arch/x86/realmode/rm/video-mode.o: in function `kho_alloc_preserve': > /home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:102: multiple definition of `kho_alloc_preserve'; arch/x86/realmode/rm/wakemain.o:/home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:102: first defined here > ld: arch/x86/realmode/rm/video-mode.o: in function `kho_unpreserve_free': > /home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:104: multiple definition of `kho_unpreserve_free'; arch/x86/realmode/rm/wakemain.o:/home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:104: first defined here > ld: arch/x86/realmode/rm/video-mode.o: in function `kho_restore_free': > /home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:105: multiple definition of `kho_restore_free'; arch/x86/realmode/rm/wakemain.o:/home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:105: first defined here > ld: arch/x86/realmode/rm/regs.o: in function `kho_alloc_preserve': > /home/runner/work/mm-test-robot/mm-test-robot/linux/arch/x86/realmode/rm/regs.c:102: multiple definition of `kho_alloc_preserve'; arch/x86/realmode/rm/wakemain.o:/home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:102: first defined here > ld: arch/x86/realmode/rm/regs.o: in function `kho_unpreserve_free': > /home/runner/work/mm-test-robot/mm-test-robot/linux/arch/x86/realmode/rm/regs.c:104: multiple definition of `kho_unpreserve_free'; arch/x86/realmode/rm/wakemain.o:/home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:104: first defined here > ld: arch/x86/realmode/rm/regs.o: in function `kho_restore_free': > /home/runner/work/mm-test-robot/mm-test-robot/linux/arch/x86/realmode/rm/regs.c:105: multiple definition of `kho_restore_free'; arch/x86/realmode/rm/wakemain.o:/home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:105: first defined here > ld: arch/x86/realmode/rm/video-vga.o: in function `kho_alloc_preserve': > /home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:102: multiple definition of `kho_alloc_preserve'; arch/x86/realmode/rm/wakemain.o:/home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:102: first defined here > ld: arch/x86/realmode/rm/video-vga.o: in function `kho_unpreserve_free': > /home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:104: multiple definition of `kho_unpreserve_free'; arch/x86/realmode/rm/wakemain.o:/home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:104: first defined here > ld: arch/x86/realmode/rm/video-vga.o: in function `kho_restore_free': > /home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:105: multiple definition of `kho_restore_free'; arch/x86/realmode/rm/wakemain.o:/home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:105: first defined here > ld: arch/x86/realmode/rm/video-vesa.o: in function `kho_alloc_preserve': > /home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:102: multiple definition of `kho_alloc_preserve'; arch/x86/realmode/rm/wakemain.o:/home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:102: first defined here > ld: arch/x86/realmode/rm/video-vesa.o: in function `kho_unpreserve_free': > /home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:104: multiple definition of `kho_unpreserve_free'; arch/x86/realmode/rm/wakemain.o:/home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:104: first defined here > ld: arch/x86/realmode/rm/video-vesa.o: in function `kho_restore_free': > /home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:105: multiple definition of `kho_restore_free'; arch/x86/realmode/rm/wakemain.o:/home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:105: first defined here > ld: arch/x86/realmode/rm/video-bios.o: in function `kho_alloc_preserve': > /home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:102: multiple definition of `kho_alloc_preserve'; arch/x86/realmode/rm/wakemain.o:/home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:102: first defined here > ld: arch/x86/realmode/rm/video-bios.o: in function `kho_unpreserve_free': > /home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:104: multiple definition of `kho_unpreserve_free'; arch/x86/realmode/rm/wakemain.o:/home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:104: first defined here > ld: arch/x86/realmode/rm/video-bios.o: in function `kho_restore_free': > /home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:105: multiple definition of `kho_restore_free'; arch/x86/realmode/rm/wakemain.o:/home/runner/work/mm-test-robot/mm-test-robot/linux/./include/linux/kexec_handover.h:105: first defined here > make[5]: *** [arch/x86/realmode/rm/Makefile:49: arch/x86/realmode/rm/realmode.elf] Error 1 > make[4]: *** [arch/x86/realmode/Makefile:22: arch/x86/realmode/rm/realmode.bin] Error 2 > make[3]: *** [scripts/Makefile.build:556: arch/x86/realmode] Error 2 > ``` > > Perhaps these stubs should be declared as static inline? That should make > the compiler happy and resolve the linking errors :) > > ----8<---- > diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h > index 6dd0dcdf0ec1..5f7b9de97e8d 100644 > --- a/include/linux/kexec_handover.h > +++ b/include/linux/kexec_handover.h > @@ -96,13 +96,13 @@ static inline int kho_preserve_vmalloc(void *ptr, > > static inline void kho_unpreserve_vmalloc(struct kho_vmalloc *preservation) { } > > -void *kho_alloc_preserve(size_t size) > +static inline void *kho_alloc_preserve(size_t size) > { > return ERR_PTR(-EOPNOTSUPP); > } > > -void kho_unpreserve_free(void *mem) { } > -void kho_restore_free(void *mem) { } > +static inline void kho_unpreserve_free(void *mem) { } > +static inline void kho_restore_free(void *mem) { } > > static inline struct folio *kho_restore_folio(phys_addr_t phys) > { > --- > > [...] > > Cheers, > Lance From sourabhjain at linux.ibm.com Sun Nov 16 19:51:53 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Mon, 17 Nov 2025 09:21:53 +0530 Subject: [PATCH v5] Documentation/ABI: add kexec and kdump sysfs interface Message-ID: <20251117035153.1199665-1-sourabhjain@linux.ibm.com> Add an ABI document for following kexec and kdump sysfs interface: - /sys/kernel/kexec_loaded - /sys/kernel/kexec_crash_loaded - /sys/kernel/kexec_crash_size - /sys/kernel/crash_elfcorehdr_size Cc: Aditya Gupta Cc: Andrew Morton Cc: Baoquan he Cc: Dave Young Cc: Hari Bathini Cc: Jiri Bohac Cc: Madhavan Srinivasan Cc: Mahesh J Salgaonkar Cc: Pingfan Liu Cc: Ritesh Harjani (IBM) Cc: Shivang Upadhyay Cc: Vivek Goyal Cc: linuxppc-dev at lists.ozlabs.org Cc: kexec at lists.infradead.org Signed-off-by: Sourabh Jain --- Changelog: v4 -> v5: https://lore.kernel.org/all/20251114152550.ac2dd5e23542f09c62defec7 at linux-foundation.org/ - Splitted patch from above patch series. --- .../ABI/testing/sysfs-kernel-kexec-kdump | 43 +++++++++++++++++++ 1 file changed, 43 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-kernel-kexec-kdump diff --git a/Documentation/ABI/testing/sysfs-kernel-kexec-kdump b/Documentation/ABI/testing/sysfs-kernel-kexec-kdump new file mode 100644 index 000000000000..96b24565b68e --- /dev/null +++ b/Documentation/ABI/testing/sysfs-kernel-kexec-kdump @@ -0,0 +1,43 @@ +What: /sys/kernel/kexec_loaded +Date: Jun 2006 +Contact: kexec at lists.infradead.org +Description: read only + Indicates whether a new kernel image has been loaded + into memory using the kexec system call. It shows 1 if + a kexec image is present and ready to boot, or 0 if none + is loaded. +User: kexec tools, kdump service + +What: /sys/kernel/kexec_crash_loaded +Date: Jun 2006 +Contact: kexec at lists.infradead.org +Description: read only + Indicates whether a crash (kdump) kernel is currently + loaded into memory. It shows 1 if a crash kernel has been + successfully loaded for panic handling, or 0 if no crash + kernel is present. +User: Kexec tools, Kdump service + +What: /sys/kernel/kexec_crash_size +Date: Dec 2009 +Contact: kexec at lists.infradead.org +Description: read/write + Shows the amount of memory reserved for loading the crash + (kdump) kernel. It reports the size, in bytes, of the + crash kernel area defined by the crashkernel= parameter. + This interface also allows reducing the crashkernel + reservation by writing a smaller value, and the reclaimed + space is added back to the system RAM. +User: Kdump service + +What: /sys/kernel/crash_elfcorehdr_size +Date: Aug 2023 +Contact: kexec at lists.infradead.org +Description: read only + Indicates the preferred size of the memory buffer for the + ELF core header used by the crash (kdump) kernel. It defines + how much space is needed to hold metadata about the crashed + system, including CPU and memory information. This information + is used by the user space utility kexec to support updating the + in-kernel kdump image during hotplug operations. +User: Kexec tools -- 2.51.1 From sourabhjain at linux.ibm.com Sun Nov 16 20:19:05 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Mon, 17 Nov 2025 09:49:05 +0530 Subject: [PATCH v5] crash: export crashkernel CMA reservation to userspace Message-ID: <20251117041905.1277801-1-sourabhjain@linux.ibm.com> Add a sysfs entry /sys/kernel/kexec/crash_cma_ranges to expose all CMA crashkernel ranges. This allows userspace tools configuring kdump to determine how much memory is reserved for crashkernel. If CMA is used, tools can warn users when attempting to capture user pages with CMA reservation. The new sysfs hold the CMA ranges in below format: cat /sys/kernel/kexec/crash_cma_ranges 100000000-10c7fffff There are already four kexec and kdump sysfs entries under /sys/kernel. Adding more entries there would clutter the directory. To avoid this, the new crash_cma_ranges sysfs entry is placed in a new kexec node under /sys/kernel/. The reason for not including Crash CMA Ranges in /proc/iomem is to avoid conflicts. It has been observed that contiguous memory ranges are sometimes shown as two separate System RAM entries in /proc/iomem. If a CMA range overlaps two System RAM ranges, adding crashk_res to /proc/iomem can create a conflict. Reference [1] describes one such instance on the PowerPC architecture. Link: https://lore.kernel.org/all/20251016142831.144515-1-sourabhjain at linux.ibm.com/ [1] Cc: Aditya Gupta Cc: Andrew Morton Cc: Baoquan he Cc: Dave Young Cc: Hari Bathini Cc: Jiri Bohac Cc: Madhavan Srinivasan Cc: Mahesh J Salgaonkar Cc: Pingfan Liu Cc: Ritesh Harjani (IBM) Cc: Shivang Upadhyay Cc: Vivek Goyal Cc: linuxppc-dev at lists.ozlabs.org Cc: kexec at lists.infradead.org Signed-off-by: Sourabh Jain --- Changelog: v4 -> v5: https://lore.kernel.org/all/20251114152550.ac2dd5e23542f09c62defec7 at linux-foundation.org/ - Splitted patch from the above patch series. - Code to create kexec node under /sys/kernel is added, eariler it was done in [02/05] of the above patch series. Note: This patch is dependent on the below patch: https://lore.kernel.org/all/20251117035153.1199665-1-sourabhjain at linux.ibm.com/ --- .../ABI/testing/sysfs-kernel-kexec-kdump | 10 ++++ kernel/kexec_core.c | 50 +++++++++++++++++++ 2 files changed, 60 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-kernel-kexec-kdump b/Documentation/ABI/testing/sysfs-kernel-kexec-kdump index 96b24565b68e..320ec75a4903 100644 --- a/Documentation/ABI/testing/sysfs-kernel-kexec-kdump +++ b/Documentation/ABI/testing/sysfs-kernel-kexec-kdump @@ -41,3 +41,13 @@ Description: read only is used by the user space utility kexec to support updating the in-kernel kdump image during hotplug operations. User: Kexec tools + +What: /sys/kernel/kexec/crash_cma_ranges +Date: Nov 2025 +Contact: kexec at lists.infradead.org +Description: read only + Provides information about the memory ranges reserved from + the Contiguous Memory Allocator (CMA) area that are allocated + to the crash (kdump) kernel. It lists the start and end physical + addresses of CMA regions assigned for crashkernel use. +User: kdump service diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c index fa00b239c5d9..51b1e0985eac 100644 --- a/kernel/kexec_core.c +++ b/kernel/kexec_core.c @@ -41,6 +41,7 @@ #include #include #include +#include #include #include @@ -1229,3 +1230,52 @@ int kernel_kexec(void) kexec_unlock(); return error; } + +#ifdef CONFIG_CRASH_RESERVE +static ssize_t crash_cma_ranges_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + + ssize_t len = 0; + int i; + + for (i = 0; i < crashk_cma_cnt; ++i) { + len += sysfs_emit_at(buf, len, "%08llx-%08llx\n", + crashk_cma_ranges[i].start, + crashk_cma_ranges[i].end); + } + return len; +} +static struct kobj_attribute crash_cma_ranges_attr = __ATTR_RO(crash_cma_ranges); + +static struct attribute *kexec_attrs[] = { + &crash_cma_ranges_attr.attr, + NULL +}; + +static struct kobject *kexec_kobj; +ATTRIBUTE_GROUPS(kexec); + +static int __init init_kexec_sysctl(void) +{ + int error; + + kexec_kobj = kobject_create_and_add("kexec", kernel_kobj); + if (!kexec_kobj) { + pr_err("failed to create kexec kobject\n"); + return -ENOMEM; + } + + error = sysfs_create_groups(kexec_kobj, kexec_groups); + if (error) + goto kset_exit; + + return 0; + +kset_exit: + kobject_put(kexec_kobj); + return error; +} + +subsys_initcall(init_kexec_sysctl); +#endif /* CONFIG_CRASH_RESERVE */ -- 2.51.1 From sourabhjain at linux.ibm.com Sun Nov 16 20:47:06 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Mon, 17 Nov 2025 10:17:06 +0530 Subject: [PATCH v5 1/3] kexec: move sysfs entries to /sys/kernel/kexec In-Reply-To: <20251117044708.1337558-1-sourabhjain@linux.ibm.com> References: <20251117044708.1337558-1-sourabhjain@linux.ibm.com> Message-ID: <20251117044708.1337558-2-sourabhjain@linux.ibm.com> Several kexec and kdump sysfs entries are currently placed directly under /sys/kernel/, which clutters the directory and makes it harder to identify unrelated entries. To improve organization and readability, these entries are now moved under a dedicated directory, /sys/kernel/kexec. For backward compatibility, symlinks are created at the old locations so that existing tools and scripts continue to work. These symlinks can be removed in the future once users have switched to the new path. While creating symlinks, entries are added in /sys/kernel/ that point to their new locations under /sys/kernel/kexec/. If an error occurs while adding a symlink, it is logged but does not stop initialization of the remaining kexec sysfs symlinks. The /sys/kernel/ entry is now controlled by CONFIG_CRASH_DUMP instead of CONFIG_VMCORE_INFO, as CONFIG_CRASH_DUMP also enables CONFIG_VMCORE_INFO. Cc: Aditya Gupta Cc: Andrew Morton Cc: Baoquan he Cc: Dave Young Cc: Hari Bathini Cc: Jiri Bohac Cc: Madhavan Srinivasan Cc: Mahesh J Salgaonkar Cc: Pingfan Liu Cc: Ritesh Harjani (IBM) Cc: Shivang Upadhyay Cc: Vivek Goyal Cc: linuxppc-dev at lists.ozlabs.org Cc: kexec at lists.infradead.org Signed-off-by: Sourabh Jain --- kernel/kexec_core.c | 91 ++++++++++++++++++++++++++++++++++++++++++++- kernel/ksysfs.c | 68 +-------------------------------- 2 files changed, 91 insertions(+), 68 deletions(-) diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c index 51b1e0985eac..b90d48f77dfb 100644 --- a/kernel/kexec_core.c +++ b/kernel/kexec_core.c @@ -1231,6 +1231,47 @@ int kernel_kexec(void) return error; } +static ssize_t loaded_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return sysfs_emit(buf, "%d\n", !!kexec_image); +} +static struct kobj_attribute loaded_attr = __ATTR_RO(loaded); + +#ifdef CONFIG_CRASH_DUMP +static ssize_t crash_loaded_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return sysfs_emit(buf, "%d\n", kexec_crash_loaded()); +} +static struct kobj_attribute crash_loaded_attr = __ATTR_RO(crash_loaded); + +static ssize_t crash_size_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + ssize_t size = crash_get_memory_size(); + + if (size < 0) + return size; + + return sysfs_emit(buf, "%zd\n", size); +} + +static ssize_t crash_size_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + unsigned long cnt; + int ret; + + if (kstrtoul(buf, 0, &cnt)) + return -EINVAL; + + ret = crash_shrink_memory(cnt); + return ret < 0 ? ret : count; +} +static struct kobj_attribute crash_size_attr = __ATTR_RW(crash_size); + #ifdef CONFIG_CRASH_RESERVE static ssize_t crash_cma_ranges_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) @@ -1247,18 +1288,59 @@ static ssize_t crash_cma_ranges_show(struct kobject *kobj, return len; } static struct kobj_attribute crash_cma_ranges_attr = __ATTR_RO(crash_cma_ranges); +#endif /* CONFIG_CRASH_RESERVE */ + +#ifdef CONFIG_CRASH_HOTPLUG +static ssize_t crash_elfcorehdr_size_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + unsigned int sz = crash_get_elfcorehdr_size(); + + return sysfs_emit(buf, "%u\n", sz); +} +static struct kobj_attribute crash_elfcorehdr_size_attr = __ATTR_RO(crash_elfcorehdr_size); +#endif /* CONFIG_CRASH_HOTPLUG */ +#endif /* CONFIG_CRASH_DUMP */ static struct attribute *kexec_attrs[] = { + &loaded_attr.attr, +#ifdef CONFIG_CRASH_DUMP + &crash_loaded_attr.attr, + &crash_size_attr.attr, +#ifdef CONFIG_CRASH_RESERVE &crash_cma_ranges_attr.attr, +#endif +#ifdef CONFIG_CRASH_HOTPLUG + &crash_elfcorehdr_size_attr.attr, +#endif +#endif NULL }; +struct kexec_link_entry { + const char *target; + const char *name; +}; + +static struct kexec_link_entry kexec_links[] = { + { "loaded", "kexec_loaded" }, +#ifdef CONFIG_CRASH_DUMP + { "crash_loaded", "kexec_crash_loaded" }, + { "crash_size", "kexec_crash_size" }, +#ifdef CONFIG_CRASH_HOTPLUG + { "crash_elfcorehdr_size", "crash_elfcorehdr_size" }, +#endif +#endif + +}; + static struct kobject *kexec_kobj; ATTRIBUTE_GROUPS(kexec); static int __init init_kexec_sysctl(void) { int error; + int i; kexec_kobj = kobject_create_and_add("kexec", kernel_kobj); if (!kexec_kobj) { @@ -1270,6 +1352,14 @@ static int __init init_kexec_sysctl(void) if (error) goto kset_exit; + for (i = 0; i < ARRAY_SIZE(kexec_links); i++) { + error = compat_only_sysfs_link_entry_to_kobj(kernel_kobj, kexec_kobj, + kexec_links[i].target, + kexec_links[i].name); + if (error) + pr_err("Unable to create %s symlink (%d)", kexec_links[i].name, error); + } + return 0; kset_exit: @@ -1278,4 +1368,3 @@ static int __init init_kexec_sysctl(void) } subsys_initcall(init_kexec_sysctl); -#endif /* CONFIG_CRASH_RESERVE */ diff --git a/kernel/ksysfs.c b/kernel/ksysfs.c index eefb67d9883c..a9e6354d9e25 100644 --- a/kernel/ksysfs.c +++ b/kernel/ksysfs.c @@ -12,7 +12,7 @@ #include #include #include -#include +#include #include #include #include @@ -119,50 +119,6 @@ static ssize_t profiling_store(struct kobject *kobj, KERNEL_ATTR_RW(profiling); #endif -#ifdef CONFIG_KEXEC_CORE -static ssize_t kexec_loaded_show(struct kobject *kobj, - struct kobj_attribute *attr, char *buf) -{ - return sysfs_emit(buf, "%d\n", !!kexec_image); -} -KERNEL_ATTR_RO(kexec_loaded); - -#ifdef CONFIG_CRASH_DUMP -static ssize_t kexec_crash_loaded_show(struct kobject *kobj, - struct kobj_attribute *attr, char *buf) -{ - return sysfs_emit(buf, "%d\n", kexec_crash_loaded()); -} -KERNEL_ATTR_RO(kexec_crash_loaded); - -static ssize_t kexec_crash_size_show(struct kobject *kobj, - struct kobj_attribute *attr, char *buf) -{ - ssize_t size = crash_get_memory_size(); - - if (size < 0) - return size; - - return sysfs_emit(buf, "%zd\n", size); -} -static ssize_t kexec_crash_size_store(struct kobject *kobj, - struct kobj_attribute *attr, - const char *buf, size_t count) -{ - unsigned long cnt; - int ret; - - if (kstrtoul(buf, 0, &cnt)) - return -EINVAL; - - ret = crash_shrink_memory(cnt); - return ret < 0 ? ret : count; -} -KERNEL_ATTR_RW(kexec_crash_size); - -#endif /* CONFIG_CRASH_DUMP*/ -#endif /* CONFIG_KEXEC_CORE */ - #ifdef CONFIG_VMCORE_INFO static ssize_t vmcoreinfo_show(struct kobject *kobj, @@ -174,18 +130,6 @@ static ssize_t vmcoreinfo_show(struct kobject *kobj, } KERNEL_ATTR_RO(vmcoreinfo); -#ifdef CONFIG_CRASH_HOTPLUG -static ssize_t crash_elfcorehdr_size_show(struct kobject *kobj, - struct kobj_attribute *attr, char *buf) -{ - unsigned int sz = crash_get_elfcorehdr_size(); - - return sysfs_emit(buf, "%u\n", sz); -} -KERNEL_ATTR_RO(crash_elfcorehdr_size); - -#endif - #endif /* CONFIG_VMCORE_INFO */ /* whether file capabilities are enabled */ @@ -255,18 +199,8 @@ static struct attribute * kernel_attrs[] = { #ifdef CONFIG_PROFILING &profiling_attr.attr, #endif -#ifdef CONFIG_KEXEC_CORE - &kexec_loaded_attr.attr, -#ifdef CONFIG_CRASH_DUMP - &kexec_crash_loaded_attr.attr, - &kexec_crash_size_attr.attr, -#endif -#endif #ifdef CONFIG_VMCORE_INFO &vmcoreinfo_attr.attr, -#ifdef CONFIG_CRASH_HOTPLUG - &crash_elfcorehdr_size_attr.attr, -#endif #endif #ifndef CONFIG_TINY_RCU &rcu_expedited_attr.attr, -- 2.51.1 From sourabhjain at linux.ibm.com Sun Nov 16 20:47:07 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Mon, 17 Nov 2025 10:17:07 +0530 Subject: [PATCH v5 2/3] Documentation/ABI: mark old kexec sysfs deprecated In-Reply-To: <20251117044708.1337558-1-sourabhjain@linux.ibm.com> References: <20251117044708.1337558-1-sourabhjain@linux.ibm.com> Message-ID: <20251117044708.1337558-3-sourabhjain@linux.ibm.com> The previous commit ("kexec: move sysfs entries to /sys/kernel/kexec") moved all existing kexec sysfs entries to a new location. The ABI document is updated to include a note about the deprecation of the old kexec sysfs entries. The following kexec sysfs entries are deprecated: - /sys/kernel/kexec_loaded - /sys/kernel/kexec_crash_loaded - /sys/kernel/kexec_crash_size - /sys/kernel/crash_elfcorehdr_size Cc: Aditya Gupta Cc: Andrew Morton Cc: Baoquan he Cc: Dave Young Cc: Hari Bathini Cc: Jiri Bohac Cc: Madhavan Srinivasan Cc: Mahesh J Salgaonkar Cc: Pingfan Liu Cc: Ritesh Harjani (IBM) Cc: Shivang Upadhyay Cc: Vivek Goyal Cc: linuxppc-dev at lists.ozlabs.org Cc: kexec at lists.infradead.org Signed-off-by: Sourabh Jain --- .../ABI/obsolete/sysfs-kernel-kexec-kdump | 59 +++++++++++++++++++ .../ABI/testing/sysfs-kernel-kexec-kdump | 44 -------------- 2 files changed, 59 insertions(+), 44 deletions(-) create mode 100644 Documentation/ABI/obsolete/sysfs-kernel-kexec-kdump diff --git a/Documentation/ABI/obsolete/sysfs-kernel-kexec-kdump b/Documentation/ABI/obsolete/sysfs-kernel-kexec-kdump new file mode 100644 index 000000000000..96b4d41721cc --- /dev/null +++ b/Documentation/ABI/obsolete/sysfs-kernel-kexec-kdump @@ -0,0 +1,59 @@ +NOTE: all the ABIs listed in this file are deprecated and will be removed after 2028. + +Here are the alternative ABIs: ++------------------------------------+-----------------------------------------+ +| Deprecated | Alternative | ++------------------------------------+-----------------------------------------+ +| /sys/kernel/kexec_loaded | /sys/kernel/kexec/loaded | ++------------------------------------+-----------------------------------------+ +| /sys/kernel/kexec_crash_loaded | /sys/kernel/kexec/crash_loaded | ++------------------------------------+-----------------------------------------+ +| /sys/kernel/kexec_crash_size | /sys/kernel/kexec/crash_size | ++------------------------------------+-----------------------------------------+ +| /sys/kernel/crash_elfcorehdr_size | /sys/kernel/kexec/crash_elfcorehdr_size | ++------------------------------------+-----------------------------------------+ + + +What: /sys/kernel/kexec_loaded +Date: Jun 2006 +Contact: kexec at lists.infradead.org +Description: read only + Indicates whether a new kernel image has been loaded + into memory using the kexec system call. It shows 1 if + a kexec image is present and ready to boot, or 0 if none + is loaded. +User: kexec tools, kdump service + +What: /sys/kernel/kexec_crash_loaded +Date: Jun 2006 +Contact: kexec at lists.infradead.org +Description: read only + Indicates whether a crash (kdump) kernel is currently + loaded into memory. It shows 1 if a crash kernel has been + successfully loaded for panic handling, or 0 if no crash + kernel is present. +User: Kexec tools, Kdump service + +What: /sys/kernel/kexec_crash_size +Date: Dec 2009 +Contact: kexec at lists.infradead.org +Description: read/write + Shows the amount of memory reserved for loading the crash + (kdump) kernel. It reports the size, in bytes, of the + crash kernel area defined by the crashkernel= parameter. + This interface also allows reducing the crashkernel + reservation by writing a smaller value, and the reclaimed + space is added back to the system RAM. +User: Kdump service + +What: /sys/kernel/crash_elfcorehdr_size +Date: Aug 2023 +Contact: kexec at lists.infradead.org +Description: read only + Indicates the preferred size of the memory buffer for the + ELF core header used by the crash (kdump) kernel. It defines + how much space is needed to hold metadata about the crashed + system, including CPU and memory information. This information + is used by the user space utility kexec to support updating the + in-kernel kdump image during hotplug operations. +User: Kexec tools diff --git a/Documentation/ABI/testing/sysfs-kernel-kexec-kdump b/Documentation/ABI/testing/sysfs-kernel-kexec-kdump index 320ec75a4903..7e5e528665db 100644 --- a/Documentation/ABI/testing/sysfs-kernel-kexec-kdump +++ b/Documentation/ABI/testing/sysfs-kernel-kexec-kdump @@ -1,47 +1,3 @@ -What: /sys/kernel/kexec_loaded -Date: Jun 2006 -Contact: kexec at lists.infradead.org -Description: read only - Indicates whether a new kernel image has been loaded - into memory using the kexec system call. It shows 1 if - a kexec image is present and ready to boot, or 0 if none - is loaded. -User: kexec tools, kdump service - -What: /sys/kernel/kexec_crash_loaded -Date: Jun 2006 -Contact: kexec at lists.infradead.org -Description: read only - Indicates whether a crash (kdump) kernel is currently - loaded into memory. It shows 1 if a crash kernel has been - successfully loaded for panic handling, or 0 if no crash - kernel is present. -User: Kexec tools, Kdump service - -What: /sys/kernel/kexec_crash_size -Date: Dec 2009 -Contact: kexec at lists.infradead.org -Description: read/write - Shows the amount of memory reserved for loading the crash - (kdump) kernel. It reports the size, in bytes, of the - crash kernel area defined by the crashkernel= parameter. - This interface also allows reducing the crashkernel - reservation by writing a smaller value, and the reclaimed - space is added back to the system RAM. -User: Kdump service - -What: /sys/kernel/crash_elfcorehdr_size -Date: Aug 2023 -Contact: kexec at lists.infradead.org -Description: read only - Indicates the preferred size of the memory buffer for the - ELF core header used by the crash (kdump) kernel. It defines - how much space is needed to hold metadata about the crashed - system, including CPU and memory information. This information - is used by the user space utility kexec to support updating the - in-kernel kdump image during hotplug operations. -User: Kexec tools - What: /sys/kernel/kexec/crash_cma_ranges Date: Nov 2025 Contact: kexec at lists.infradead.org -- 2.51.1 From sourabhjain at linux.ibm.com Sun Nov 16 20:47:08 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Mon, 17 Nov 2025 10:17:08 +0530 Subject: [PATCH v5 3/3] kexec: document new kexec and kdump sysfs ABIs In-Reply-To: <20251117044708.1337558-1-sourabhjain@linux.ibm.com> References: <20251117044708.1337558-1-sourabhjain@linux.ibm.com> Message-ID: <20251117044708.1337558-4-sourabhjain@linux.ibm.com> Add an ABI document for following kexec and kdump sysfs interface: - /sys/kernel/kexec/loaded - /sys/kernel/kexec/crash_loaded - /sys/kernel/kexec/crash_size - /sys/kernel/kexec/crash_elfcorehdr_size Cc: Aditya Gupta Cc: Andrew Morton Cc: Baoquan he Cc: Dave Young Cc: Hari Bathini Cc: Jiri Bohac Cc: Madhavan Srinivasan Cc: Mahesh J Salgaonkar Cc: Pingfan Liu Cc: Ritesh Harjani (IBM) Cc: Shivang Upadhyay Cc: Vivek Goyal Cc: linuxppc-dev at lists.ozlabs.org Cc: kexec at lists.infradead.org Signed-off-by: Sourabh Jain --- .../ABI/testing/sysfs-kernel-kexec-kdump | 52 +++++++++++++++++++ 1 file changed, 52 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-kernel-kexec-kdump b/Documentation/ABI/testing/sysfs-kernel-kexec-kdump index 7e5e528665db..f59051b5d96d 100644 --- a/Documentation/ABI/testing/sysfs-kernel-kexec-kdump +++ b/Documentation/ABI/testing/sysfs-kernel-kexec-kdump @@ -1,3 +1,55 @@ +What: /sys/kernel/kexec/* +Date: Nov 2025 +Contact: kexec at lists.infradead.org +Description: + The /sys/kernel/kexec/* directory contains sysfs files + that provide information about the configuration status + of kexec and kdump. + +What: /sys/kernel/kexec/loaded +Date: Nov 2025 +Contact: kexec at lists.infradead.org +Description: read only + Indicates whether a new kernel image has been loaded + into memory using the kexec system call. It shows 1 if + a kexec image is present and ready to boot, or 0 if none + is loaded. +User: kexec tools, kdump service + +What: /sys/kernel/kexec/crash_loaded +Date: Nov 2025 +Contact: kexec at lists.infradead.org +Description: read only + Indicates whether a crash (kdump) kernel is currently + loaded into memory. It shows 1 if a crash kernel has been + successfully loaded for panic handling, or 0 if no crash + kernel is present. +User: Kexec tools, Kdump service + +What: /sys/kernel/kexec/crash_size +Date: Nov 2025 +Contact: kexec at lists.infradead.org +Description: read/write + Shows the amount of memory reserved for loading the crash + (kdump) kernel. It reports the size, in bytes, of the + crash kernel area defined by the crashkernel= parameter. + This interface also allows reducing the crashkernel + reservation by writing a smaller value, and the reclaimed + space is added back to the system RAM. +User: Kdump service + +What: /sys/kernel/kexec/crash_elfcorehdr_size +Date: Nov 2025 +Contact: kexec at lists.infradead.org +Description: read only + Indicates the preferred size of the memory buffer for the + ELF core header used by the crash (kdump) kernel. It defines + how much space is needed to hold metadata about the crashed + system, including CPU and memory information. This information + is used by the user space utility kexec to support updating the + in-kernel kdump image during hotplug operations. +User: Kexec tools + What: /sys/kernel/kexec/crash_cma_ranges Date: Nov 2025 Contact: kexec at lists.infradead.org -- 2.51.1 From maciej.medrecki at bizzar.pl Mon Nov 17 01:05:45 2025 From: maciej.medrecki at bizzar.pl (Maciej Medrecki) Date: Mon, 17 Nov 2025 09:05:45 GMT Subject: Zaplanowanie sukcesji. Message-ID: <20251117084500-0.1.np.4yvfl.0.lkhbdf7arp@bizzar.pl> Drogi Przedsi?biorco, kto przejmie Twoj? firm?? To pytanie, kt?re wi?kszo?? w?a?cicieli odk?ada na p??niej. Rozumiem - kiedy buduje si? biznes, trudno my?le? o jego przekazaniu. Problem w tym, ?e bez planu sukcesyjnego rodzina mo?e stan?? przed trudnymi wyborami w najmniej odpowiednim momencie. Prawo spadkowe nie zawsze chroni interesy firmy. Planowanie sukcesji to nie tylko kwestia prawna - to spos?b na zachowanie tego, co w?a?ciciel zbudowa?, i ochron? najbli?szych przed niepotrzebnymi komplikacjami. Czy mo?emy porozmawia?, jak wygl?da przygotowanie sukcesji? Mog? przedstawi? opcje dostosowane do sytuacji. Pozdrawiam Maciej Medrecki From akpm at linux-foundation.org Mon Nov 17 09:42:11 2025 From: akpm at linux-foundation.org (Andrew Morton) Date: Mon, 17 Nov 2025 09:42:11 -0800 Subject: [PATCH v5] crash: export crashkernel CMA reservation to userspace In-Reply-To: <20251117041905.1277801-1-sourabhjain@linux.ibm.com> References: <20251117041905.1277801-1-sourabhjain@linux.ibm.com> Message-ID: <20251117094211.f8b4426ddda3bc0db5a62624@linux-foundation.org> On Mon, 17 Nov 2025 09:49:05 +0530 Sourabh Jain wrote: > Add a sysfs entry /sys/kernel/kexec/crash_cma_ranges to expose all > CMA crashkernel ranges. > > This allows userspace tools configuring kdump to determine how much > memory is reserved for crashkernel. If CMA is used, tools can warn > users when attempting to capture user pages with CMA reservation. > > The new sysfs hold the CMA ranges in below format: > > cat /sys/kernel/kexec/crash_cma_ranges > 100000000-10c7fffff > > There are already four kexec and kdump sysfs entries under /sys/kernel. > Adding more entries there would clutter the directory. To avoid this, > the new crash_cma_ranges sysfs entry is placed in a new kexec node under > /sys/kernel/. I suggest not creating /sys/kernel/kexec in this patch. Moving everything into a new /sys/kernel/kexec is a separate patchset and a separate concept and it might never be merged - it changes ABI! So let's put crash_cma_ranges in /sys/kernel and move it to /sys/kernel/kexec within the other patchset. From sourabhjain at linux.ibm.com Mon Nov 17 10:33:54 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Tue, 18 Nov 2025 00:03:54 +0530 Subject: [PATCH v5] crash: export crashkernel CMA reservation to userspace In-Reply-To: <20251117094211.f8b4426ddda3bc0db5a62624@linux-foundation.org> References: <20251117041905.1277801-1-sourabhjain@linux.ibm.com> <20251117094211.f8b4426ddda3bc0db5a62624@linux-foundation.org> Message-ID: <469c97cb-5ea1-4c2b-a70f-b1a6febf70df@linux.ibm.com> On 17/11/25 23:12, Andrew Morton wrote: > On Mon, 17 Nov 2025 09:49:05 +0530 Sourabh Jain wrote: > >> Add a sysfs entry /sys/kernel/kexec/crash_cma_ranges to expose all >> CMA crashkernel ranges. >> >> This allows userspace tools configuring kdump to determine how much >> memory is reserved for crashkernel. If CMA is used, tools can warn >> users when attempting to capture user pages with CMA reservation. >> >> The new sysfs hold the CMA ranges in below format: >> >> cat /sys/kernel/kexec/crash_cma_ranges >> 100000000-10c7fffff >> >> There are already four kexec and kdump sysfs entries under /sys/kernel. >> Adding more entries there would clutter the directory. To avoid this, >> the new crash_cma_ranges sysfs entry is placed in a new kexec node under >> /sys/kernel/. > I suggest not creating /sys/kernel/kexec in this patch. > > Moving everything into a new /sys/kernel/kexec is a separate patchset > and a separate concept and it might never be merged - it changes ABI! > > So let's put crash_cma_ranges in /sys/kernel and move it to > /sys/kernel/kexec within the other patchset. Yeah sure. I will send the patches accordingly. Thanks, Sourabh Jain From sourabhjain at linux.ibm.com Mon Nov 17 23:10:23 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Tue, 18 Nov 2025 12:40:23 +0530 Subject: [PATCH v6] crash: export crashkernel CMA reservation to userspace Message-ID: <20251118071023.1673329-1-sourabhjain@linux.ibm.com> Add a sysfs entry /sys/kernel/kexec_crash_cma_ranges to expose all CMA crashkernel ranges. This allows userspace tools configuring kdump to determine how much memory is reserved for crashkernel. If CMA is used, tools can warn users when attempting to capture user pages with CMA reservation. The new sysfs hold the CMA ranges in below format: cat /sys/kernel/kexec_crash_cma_ranges 100000000-10c7fffff The reason for not including Crash CMA Ranges in /proc/iomem is to avoid conflicts. It has been observed that contiguous memory ranges are sometimes shown as two separate System RAM entries in /proc/iomem. If a CMA range overlaps two System RAM ranges, adding crashk_res to /proc/iomem can create a conflict. Reference [1] describes one such instance on the PowerPC architecture. Link: https://lore.kernel.org/all/20251016142831.144515-1-sourabhjain at linux.ibm.com/ [1] Cc: Aditya Gupta Cc: Andrew Morton Cc: Baoquan he Cc: Dave Young Cc: Hari Bathini Cc: Jiri Bohac Cc: Madhavan Srinivasan Cc: Mahesh J Salgaonkar Cc: Pingfan Liu Cc: Ritesh Harjani (IBM) Cc: Shivang Upadhyay Cc: Vivek Goyal Cc: linuxppc-dev at lists.ozlabs.org Cc: kexec at lists.infradead.org Signed-off-by: Sourabh Jain --- Changelog: v4 -> v5: https://lore.kernel.org/all/20251114152550.ac2dd5e23542f09c62defec7 at linux-foundation.org/ - Splitted patch from the above patch series. - Code to create kexec node under /sys/kernel is added, eariler it was done in [02/05] of the above patch series. v5 -> v6: - Add Crash CMA Range sysfs interface under /sys/kernel Note: This patch is dependent on the below patch: https://lore.kernel.org/all/20251117035153.1199665-1-sourabhjain at linux.ibm.com/ --- .../ABI/testing/sysfs-kernel-kexec-kdump | 10 +++++++++ kernel/ksysfs.c | 21 +++++++++++++++++++ 2 files changed, 31 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-kernel-kexec-kdump b/Documentation/ABI/testing/sysfs-kernel-kexec-kdump index 96b24565b68e..f6089e38de5f 100644 --- a/Documentation/ABI/testing/sysfs-kernel-kexec-kdump +++ b/Documentation/ABI/testing/sysfs-kernel-kexec-kdump @@ -41,3 +41,13 @@ Description: read only is used by the user space utility kexec to support updating the in-kernel kdump image during hotplug operations. User: Kexec tools + +What: /sys/kernel/kexec_crash_cma_ranges +Date: Nov 2025 +Contact: kexec at lists.infradead.org +Description: read only + Provides information about the memory ranges reserved from + the Contiguous Memory Allocator (CMA) area that are allocated + to the crash (kdump) kernel. It lists the start and end physical + addresses of CMA regions assigned for crashkernel use. +User: kdump service diff --git a/kernel/ksysfs.c b/kernel/ksysfs.c index eefb67d9883c..0ff2179bc603 100644 --- a/kernel/ksysfs.c +++ b/kernel/ksysfs.c @@ -135,6 +135,24 @@ static ssize_t kexec_crash_loaded_show(struct kobject *kobj, } KERNEL_ATTR_RO(kexec_crash_loaded); +#ifdef CONFIG_CRASH_RESERVE +static ssize_t kexec_crash_cma_ranges_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + + ssize_t len = 0; + int i; + + for (i = 0; i < crashk_cma_cnt; ++i) { + len += sysfs_emit_at(buf, len, "%08llx-%08llx\n", + crashk_cma_ranges[i].start, + crashk_cma_ranges[i].end); + } + return len; +} +KERNEL_ATTR_RO(kexec_crash_cma_ranges); +#endif /* CONFIG_CRASH_RESERVE */ + static ssize_t kexec_crash_size_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { @@ -260,6 +278,9 @@ static struct attribute * kernel_attrs[] = { #ifdef CONFIG_CRASH_DUMP &kexec_crash_loaded_attr.attr, &kexec_crash_size_attr.attr, +#ifdef CONFIG_CRASH_RESERVE + &kexec_crash_cma_ranges_attr.attr, +#endif #endif #endif #ifdef CONFIG_VMCORE_INFO -- 2.51.1 From sourabhjain at linux.ibm.com Tue Nov 18 03:45:04 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Tue, 18 Nov 2025 17:15:04 +0530 Subject: [PATCH v6 0/3] kexec: reorganize kexec and kdump sysfs Message-ID: <20251118114507.1769455-1-sourabhjain@linux.ibm.com> All existing kexec and kdump sysfs entries are moved to a new location, /sys/kernel/kexec, to keep /sys/kernel/ clean and better organized. Symlinks are created at the old locations for backward compatibility and can be removed in the future [01/03]. While doing this cleanup, the old kexec and kdump sysfs entries are marked as deprecated in the existing ABI documentation [02/03]. This makes it clear that these older interfaces should no longer be used. New ABI documentation is added to describe the reorganized interfaces [03/03], so users and tools can rely on the updated sysfs interfaces going forward. Changlog: --------- v4 -> v5: https://lore.kernel.org/all/20251114152550.ac2dd5e23542f09c62defec7 at linux-foundation.org/ - Splitted patch series from the above patch series v5 -> v6: - Move /sys/kernel/kexec_crash_cma_ranges also to new /sys/kernel/kexec node - Update commit messages Note: This patch series is dependent on the patches: https://lore.kernel.org/all/20251117035153.1199665-1-sourabhjain at linux.ibm.com/ https://lore.kernel.org/all/20251118071023.1673329-1-sourabhjain at linux.ibm.com/ Cc: Aditya Gupta Cc: Andrew Morton Cc: Baoquan he Cc: Dave Young Cc: Hari Bathini Cc: Jiri Bohac Cc: Madhavan Srinivasan Cc: Mahesh J Salgaonkar Cc: Pingfan Liu Cc: Ritesh Harjani (IBM) Cc: Shivang Upadhyay Cc: Sourabh Jain Cc: Vivek Goyal Cc: linuxppc-dev at lists.ozlabs.org Cc: kexec at lists.infradead.org Sourabh Jain (3): kexec: move sysfs entries to /sys/kernel/kexec Documentation/ABI: mark old kexec sysfs deprecated Documentation/ABI: new kexec and kdump sysfs interface .../ABI/obsolete/sysfs-kernel-kexec-kdump | 71 +++++++++ .../ABI/testing/sysfs-kernel-kexec-kdump | 26 ++-- kernel/kexec_core.c | 141 ++++++++++++++++++ kernel/ksysfs.c | 89 +---------- 4 files changed, 230 insertions(+), 97 deletions(-) create mode 100644 Documentation/ABI/obsolete/sysfs-kernel-kexec-kdump -- 2.51.1 From sourabhjain at linux.ibm.com Tue Nov 18 03:45:05 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Tue, 18 Nov 2025 17:15:05 +0530 Subject: [PATCH v6 1/3] kexec: move sysfs entries to /sys/kernel/kexec In-Reply-To: <20251118114507.1769455-1-sourabhjain@linux.ibm.com> References: <20251118114507.1769455-1-sourabhjain@linux.ibm.com> Message-ID: <20251118114507.1769455-2-sourabhjain@linux.ibm.com> Several kexec and kdump sysfs entries are currently placed directly under /sys/kernel/, which clutters the directory and makes it harder to identify unrelated entries. To improve organization and readability, these entries are now moved under a dedicated directory, /sys/kernel/kexec. The following sysfs moved under new kexec sysfs node +------------------------------------+------------------+ | Old sysfs name | New sysfs name | | (under /sys/kernel) | (under /sys/kernel/kexec) | +---------------------------+---------------------------+ | kexec_loaded | loaded | +---------------------------+---------------------------+ | kexec_crash_loaded | crash_loaded | +---------------------------+---------------------------+ | kexec_crash_size | crash_size | +---------------------------+---------------------------+ | crash_elfcorehdr_size | crash_elfcorehdr_size | +---------------------------+---------------------------+ | kexec_crash_cma_ranges | crash_cma_ranges | +---------------------------+---------------------------+ For backward compatibility, symlinks are created at the old locations so that existing tools and scripts continue to work. These symlinks can be removed in the future once users have switched to the new path. While creating symlinks, entries are added in /sys/kernel/ that point to their new locations under /sys/kernel/kexec/. If an error occurs while adding a symlink, it is logged but does not stop initialization of the remaining kexec sysfs symlinks. The /sys/kernel/ entry is now controlled by CONFIG_CRASH_DUMP instead of CONFIG_VMCORE_INFO, as CONFIG_CRASH_DUMP also enables CONFIG_VMCORE_INFO. Cc: Aditya Gupta Cc: Andrew Morton Cc: Baoquan he Cc: Dave Young Cc: Hari Bathini Cc: Jiri Bohac Cc: Madhavan Srinivasan Cc: Mahesh J Salgaonkar Cc: Pingfan Liu Cc: Ritesh Harjani (IBM) Cc: Shivang Upadhyay Cc: Vivek Goyal Cc: linuxppc-dev at lists.ozlabs.org Cc: kexec at lists.infradead.org Signed-off-by: Sourabh Jain --- kernel/kexec_core.c | 141 ++++++++++++++++++++++++++++++++++++++++++++ kernel/ksysfs.c | 89 +--------------------------- 2 files changed, 142 insertions(+), 88 deletions(-) diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c index fa00b239c5d9..02429499fb64 100644 --- a/kernel/kexec_core.c +++ b/kernel/kexec_core.c @@ -41,6 +41,7 @@ #include #include #include +#include #include #include @@ -1229,3 +1230,143 @@ int kernel_kexec(void) kexec_unlock(); return error; } + +static ssize_t loaded_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return sysfs_emit(buf, "%d\n", !!kexec_image); +} +static struct kobj_attribute loaded_attr = __ATTR_RO(loaded); + +#ifdef CONFIG_CRASH_DUMP +static ssize_t crash_loaded_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return sysfs_emit(buf, "%d\n", kexec_crash_loaded()); +} +static struct kobj_attribute crash_loaded_attr = __ATTR_RO(crash_loaded); + +#ifdef CONFIG_CRASH_RESERVE +static ssize_t crash_cma_ranges_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + + ssize_t len = 0; + int i; + + for (i = 0; i < crashk_cma_cnt; ++i) { + len += sysfs_emit_at(buf, len, "%08llx-%08llx\n", + crashk_cma_ranges[i].start, + crashk_cma_ranges[i].end); + } + return len; +} +static struct kobj_attribute crash_cma_ranges_attr = __ATTR_RO(crash_cma_ranges); +#endif + +static ssize_t crash_size_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + ssize_t size = crash_get_memory_size(); + + if (size < 0) + return size; + + return sysfs_emit(buf, "%zd\n", size); +} +static ssize_t crash_size_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + unsigned long cnt; + int ret; + + if (kstrtoul(buf, 0, &cnt)) + return -EINVAL; + + ret = crash_shrink_memory(cnt); + return ret < 0 ? ret : count; +} +static struct kobj_attribute crash_size_attr = __ATTR_RW(crash_size); + +#ifdef CONFIG_CRASH_HOTPLUG +static ssize_t crash_elfcorehdr_size_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + unsigned int sz = crash_get_elfcorehdr_size(); + + return sysfs_emit(buf, "%u\n", sz); +} +static struct kobj_attribute crash_elfcorehdr_size_attr = __ATTR_RO(crash_elfcorehdr_size); + +#endif /* CONFIG_CRASH_HOTPLUG */ +#endif /* CONFIG_CRASH_DUMP */ + +static struct attribute *kexec_attrs[] = { + &loaded_attr.attr, +#ifdef CONFIG_CRASH_DUMP + &crash_loaded_attr.attr, + &crash_size_attr.attr, +#ifdef CONFIG_CRASH_RESERVE + &crash_cma_ranges_attr.attr, +#endif +#ifdef CONFIG_CRASH_HOTPLUG + &crash_elfcorehdr_size_attr.attr, +#endif +#endif + NULL +}; + +struct kexec_link_entry { + const char *target; + const char *name; +}; + +static struct kexec_link_entry kexec_links[] = { + { "loaded", "kexec_loaded" }, +#ifdef CONFIG_CRASH_DUMP + { "crash_loaded", "kexec_crash_loaded" }, + { "crash_size", "kexec_crash_size" }, +#ifdef CONFIG_CRASH_RESERVE + {"crash_cma_ranges", "kexec_crash_cma_ranges"}, +#endif +#ifdef CONFIG_CRASH_HOTPLUG + { "crash_elfcorehdr_size", "crash_elfcorehdr_size" }, +#endif +#endif +}; + +static struct kobject *kexec_kobj; +ATTRIBUTE_GROUPS(kexec); + +static int __init init_kexec_sysctl(void) +{ + int error; + int i; + + kexec_kobj = kobject_create_and_add("kexec", kernel_kobj); + if (!kexec_kobj) { + pr_err("failed to create kexec kobject\n"); + return -ENOMEM; + } + + error = sysfs_create_groups(kexec_kobj, kexec_groups); + if (error) + goto kset_exit; + + for (i = 0; i < ARRAY_SIZE(kexec_links); i++) { + error = compat_only_sysfs_link_entry_to_kobj(kernel_kobj, kexec_kobj, + kexec_links[i].target, + kexec_links[i].name); + if (error) + pr_err("Unable to create %s symlink (%d)", kexec_links[i].name, error); + } + + return 0; + +kset_exit: + kobject_put(kexec_kobj); + return error; +} + +subsys_initcall(init_kexec_sysctl); diff --git a/kernel/ksysfs.c b/kernel/ksysfs.c index 0ff2179bc603..a9e6354d9e25 100644 --- a/kernel/ksysfs.c +++ b/kernel/ksysfs.c @@ -12,7 +12,7 @@ #include #include #include -#include +#include #include #include #include @@ -119,68 +119,6 @@ static ssize_t profiling_store(struct kobject *kobj, KERNEL_ATTR_RW(profiling); #endif -#ifdef CONFIG_KEXEC_CORE -static ssize_t kexec_loaded_show(struct kobject *kobj, - struct kobj_attribute *attr, char *buf) -{ - return sysfs_emit(buf, "%d\n", !!kexec_image); -} -KERNEL_ATTR_RO(kexec_loaded); - -#ifdef CONFIG_CRASH_DUMP -static ssize_t kexec_crash_loaded_show(struct kobject *kobj, - struct kobj_attribute *attr, char *buf) -{ - return sysfs_emit(buf, "%d\n", kexec_crash_loaded()); -} -KERNEL_ATTR_RO(kexec_crash_loaded); - -#ifdef CONFIG_CRASH_RESERVE -static ssize_t kexec_crash_cma_ranges_show(struct kobject *kobj, - struct kobj_attribute *attr, char *buf) -{ - - ssize_t len = 0; - int i; - - for (i = 0; i < crashk_cma_cnt; ++i) { - len += sysfs_emit_at(buf, len, "%08llx-%08llx\n", - crashk_cma_ranges[i].start, - crashk_cma_ranges[i].end); - } - return len; -} -KERNEL_ATTR_RO(kexec_crash_cma_ranges); -#endif /* CONFIG_CRASH_RESERVE */ - -static ssize_t kexec_crash_size_show(struct kobject *kobj, - struct kobj_attribute *attr, char *buf) -{ - ssize_t size = crash_get_memory_size(); - - if (size < 0) - return size; - - return sysfs_emit(buf, "%zd\n", size); -} -static ssize_t kexec_crash_size_store(struct kobject *kobj, - struct kobj_attribute *attr, - const char *buf, size_t count) -{ - unsigned long cnt; - int ret; - - if (kstrtoul(buf, 0, &cnt)) - return -EINVAL; - - ret = crash_shrink_memory(cnt); - return ret < 0 ? ret : count; -} -KERNEL_ATTR_RW(kexec_crash_size); - -#endif /* CONFIG_CRASH_DUMP*/ -#endif /* CONFIG_KEXEC_CORE */ - #ifdef CONFIG_VMCORE_INFO static ssize_t vmcoreinfo_show(struct kobject *kobj, @@ -192,18 +130,6 @@ static ssize_t vmcoreinfo_show(struct kobject *kobj, } KERNEL_ATTR_RO(vmcoreinfo); -#ifdef CONFIG_CRASH_HOTPLUG -static ssize_t crash_elfcorehdr_size_show(struct kobject *kobj, - struct kobj_attribute *attr, char *buf) -{ - unsigned int sz = crash_get_elfcorehdr_size(); - - return sysfs_emit(buf, "%u\n", sz); -} -KERNEL_ATTR_RO(crash_elfcorehdr_size); - -#endif - #endif /* CONFIG_VMCORE_INFO */ /* whether file capabilities are enabled */ @@ -273,21 +199,8 @@ static struct attribute * kernel_attrs[] = { #ifdef CONFIG_PROFILING &profiling_attr.attr, #endif -#ifdef CONFIG_KEXEC_CORE - &kexec_loaded_attr.attr, -#ifdef CONFIG_CRASH_DUMP - &kexec_crash_loaded_attr.attr, - &kexec_crash_size_attr.attr, -#ifdef CONFIG_CRASH_RESERVE - &kexec_crash_cma_ranges_attr.attr, -#endif -#endif -#endif #ifdef CONFIG_VMCORE_INFO &vmcoreinfo_attr.attr, -#ifdef CONFIG_CRASH_HOTPLUG - &crash_elfcorehdr_size_attr.attr, -#endif #endif #ifndef CONFIG_TINY_RCU &rcu_expedited_attr.attr, -- 2.51.1 From sourabhjain at linux.ibm.com Tue Nov 18 03:45:06 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Tue, 18 Nov 2025 17:15:06 +0530 Subject: [PATCH v6 2/3] Documentation/ABI: mark old kexec sysfs deprecated In-Reply-To: <20251118114507.1769455-1-sourabhjain@linux.ibm.com> References: <20251118114507.1769455-1-sourabhjain@linux.ibm.com> Message-ID: <20251118114507.1769455-3-sourabhjain@linux.ibm.com> The previous commit ("kexec: move sysfs entries to /sys/kernel/kexec") moved all existing kexec sysfs entries to a new location. The ABI document is updated to include a note about the deprecation of the old kexec sysfs entries. The following kexec sysfs entries are deprecated: - /sys/kernel/kexec_loaded - /sys/kernel/kexec_crash_loaded - /sys/kernel/kexec_crash_size - /sys/kernel/crash_elfcorehdr_size - /sys/kernel/kexec_crash_cma_ranges Cc: Aditya Gupta Cc: Andrew Morton Cc: Baoquan he Cc: Dave Young Cc: Hari Bathini Cc: Jiri Bohac Cc: Madhavan Srinivasan Cc: Mahesh J Salgaonkar Cc: Pingfan Liu Cc: Ritesh Harjani (IBM) Cc: Shivang Upadhyay Cc: Vivek Goyal Cc: linuxppc-dev at lists.ozlabs.org Cc: kexec at lists.infradead.org Signed-off-by: Sourabh Jain --- .../sysfs-kernel-kexec-kdump | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) rename Documentation/ABI/{testing => obsolete}/sysfs-kernel-kexec-kdump (63%) diff --git a/Documentation/ABI/testing/sysfs-kernel-kexec-kdump b/Documentation/ABI/obsolete/sysfs-kernel-kexec-kdump similarity index 63% rename from Documentation/ABI/testing/sysfs-kernel-kexec-kdump rename to Documentation/ABI/obsolete/sysfs-kernel-kexec-kdump index f6089e38de5f..ba26a6a1d2be 100644 --- a/Documentation/ABI/testing/sysfs-kernel-kexec-kdump +++ b/Documentation/ABI/obsolete/sysfs-kernel-kexec-kdump @@ -1,3 +1,21 @@ +NOTE: all the ABIs listed in this file are deprecated and will be removed after 2028. + +Here are the alternative ABIs: ++------------------------------------+-----------------------------------------+ +| Deprecated | Alternative | ++------------------------------------+-----------------------------------------+ +| /sys/kernel/kexec_loaded | /sys/kernel/kexec/loaded | ++------------------------------------+-----------------------------------------+ +| /sys/kernel/kexec_crash_loaded | /sys/kernel/kexec/crash_loaded | ++------------------------------------+-----------------------------------------+ +| /sys/kernel/kexec_crash_size | /sys/kernel/kexec/crash_size | ++------------------------------------+-----------------------------------------+ +| /sys/kernel/crash_elfcorehdr_size | /sys/kernel/kexec/crash_elfcorehdr_size | ++------------------------------------+-----------------------------------------+ +| /sys/kernel/kexec_crash_cma_ranges | /sys/kernel/kexec/crash_cma_ranges | ++------------------------------------+-----------------------------------------+ + + What: /sys/kernel/kexec_loaded Date: Jun 2006 Contact: kexec at lists.infradead.org -- 2.51.1 From sourabhjain at linux.ibm.com Tue Nov 18 03:45:07 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Tue, 18 Nov 2025 17:15:07 +0530 Subject: [PATCH v6 3/3] Documentation/ABI: new kexec and kdump sysfs interface In-Reply-To: <20251118114507.1769455-1-sourabhjain@linux.ibm.com> References: <20251118114507.1769455-1-sourabhjain@linux.ibm.com> Message-ID: <20251118114507.1769455-4-sourabhjain@linux.ibm.com> Add an ABI document for following kexec and kdump sysfs interface: - /sys/kernel/kexec/loaded - /sys/kernel/kexec/crash_loaded - /sys/kernel/kexec/crash_size - /sys/kernel/kexec/crash_elfcorehdr_size - /sys/kernel/kexec/crash_cma_ranges Cc: Aditya Gupta Cc: Andrew Morton Cc: Baoquan he Cc: Dave Young Cc: Hari Bathini Cc: Jiri Bohac Cc: Madhavan Srinivasan Cc: Mahesh J Salgaonkar Cc: Pingfan Liu Cc: Ritesh Harjani (IBM) Cc: Shivang Upadhyay Cc: Vivek Goyal Cc: linuxppc-dev at lists.ozlabs.org Cc: kexec at lists.infradead.org Signed-off-by: Sourabh Jain --- .../ABI/testing/sysfs-kernel-kexec-kdump | 61 +++++++++++++++++++ 1 file changed, 61 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-kernel-kexec-kdump diff --git a/Documentation/ABI/testing/sysfs-kernel-kexec-kdump b/Documentation/ABI/testing/sysfs-kernel-kexec-kdump new file mode 100644 index 000000000000..f59051b5d96d --- /dev/null +++ b/Documentation/ABI/testing/sysfs-kernel-kexec-kdump @@ -0,0 +1,61 @@ +What: /sys/kernel/kexec/* +Date: Nov 2025 +Contact: kexec at lists.infradead.org +Description: + The /sys/kernel/kexec/* directory contains sysfs files + that provide information about the configuration status + of kexec and kdump. + +What: /sys/kernel/kexec/loaded +Date: Nov 2025 +Contact: kexec at lists.infradead.org +Description: read only + Indicates whether a new kernel image has been loaded + into memory using the kexec system call. It shows 1 if + a kexec image is present and ready to boot, or 0 if none + is loaded. +User: kexec tools, kdump service + +What: /sys/kernel/kexec/crash_loaded +Date: Nov 2025 +Contact: kexec at lists.infradead.org +Description: read only + Indicates whether a crash (kdump) kernel is currently + loaded into memory. It shows 1 if a crash kernel has been + successfully loaded for panic handling, or 0 if no crash + kernel is present. +User: Kexec tools, Kdump service + +What: /sys/kernel/kexec/crash_size +Date: Nov 2025 +Contact: kexec at lists.infradead.org +Description: read/write + Shows the amount of memory reserved for loading the crash + (kdump) kernel. It reports the size, in bytes, of the + crash kernel area defined by the crashkernel= parameter. + This interface also allows reducing the crashkernel + reservation by writing a smaller value, and the reclaimed + space is added back to the system RAM. +User: Kdump service + +What: /sys/kernel/kexec/crash_elfcorehdr_size +Date: Nov 2025 +Contact: kexec at lists.infradead.org +Description: read only + Indicates the preferred size of the memory buffer for the + ELF core header used by the crash (kdump) kernel. It defines + how much space is needed to hold metadata about the crashed + system, including CPU and memory information. This information + is used by the user space utility kexec to support updating the + in-kernel kdump image during hotplug operations. +User: Kexec tools + +What: /sys/kernel/kexec/crash_cma_ranges +Date: Nov 2025 +Contact: kexec at lists.infradead.org +Description: read only + Provides information about the memory ranges reserved from + the Contiguous Memory Allocator (CMA) area that are allocated + to the crash (kdump) kernel. It lists the start and end physical + addresses of CMA regions assigned for crashkernel use. +User: kdump service -- 2.51.1 From pratyush at kernel.org Tue Nov 18 05:19:50 2025 From: pratyush at kernel.org (Pratyush Yadav) Date: Tue, 18 Nov 2025 14:19:50 +0100 Subject: [PATCH v1 04/13] kho: Verify deserialization status and fix FDT alignment access In-Reply-To: (Mike Rapoport's message of "Sat, 15 Nov 2025 11:36:19 +0200") References: <20251114155358.2884014-1-pasha.tatashin@soleen.com> <20251114155358.2884014-5-pasha.tatashin@soleen.com> Message-ID: On Sat, Nov 15 2025, Mike Rapoport wrote: > On Fri, Nov 14, 2025 at 05:52:37PM +0100, Pratyush Yadav wrote: >> On Fri, Nov 14 2025, Pasha Tatashin wrote: >> >> > @@ -1377,16 +1387,12 @@ static void __init kho_release_scratch(void) >> > >> > void __init kho_memory_init(void) >> > { >> > - struct folio *folio; >> > - >> > if (kho_in.scratch_phys) { >> > kho_scratch = phys_to_virt(kho_in.scratch_phys); >> > kho_release_scratch(); >> > >> > - kho_mem_deserialize(kho_get_fdt()); >> > - folio = kho_restore_folio(kho_in.fdt_phys); >> > - if (!folio) >> > - pr_warn("failed to restore folio for KHO fdt\n"); >> > + if (!kho_mem_deserialize(kho_get_fdt())) >> > + kho_in.fdt_phys = 0; >> >> The folio restore does serve a purpose: it accounts for that folio in >> the system's total memory. See the call to adjust_managed_page_count() >> in kho_restore_page(). In practice, I don't think it makes much of a >> difference, but I don't see why not. > > This page is never freed, so adding it to zone managed pages or keeping it > reserved does not change anything. In practice, sure. I still don't see a good reason to _not_ initialize the page properly. It's not like it costs us much in terms of performance or code complexity. Since kho_restore_folio() makes sure the folio was _actually_ preserved from KHO, you have a safety check against previous kernel having a bug and not preserving the FDT properly. And I get that the FDT has already been used by this point, but at least you would have some known point to catch this. [...] -- Regards, Pratyush Yadav From pasha.tatashin at soleen.com Tue Nov 18 07:25:37 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Tue, 18 Nov 2025 10:25:37 -0500 Subject: [PATCH v1 04/13] kho: Verify deserialization status and fix FDT alignment access In-Reply-To: References: <20251114155358.2884014-1-pasha.tatashin@soleen.com> <20251114155358.2884014-5-pasha.tatashin@soleen.com> Message-ID: > > This page is never freed, so adding it to zone managed pages or keeping it > > reserved does not change anything. > > In practice, sure. I still don't see a good reason to _not_ initialize > the page properly. It's not like it costs us much in terms of > performance or code complexity. > > Since kho_restore_folio() makes sure the folio was _actually_ preserved > from KHO, you have a safety check against previous kernel having a bug > and not preserving the FDT properly. And I get that the FDT has already > been used by this point, but at least you would have some known point to > catch this. The kho_alloc_preserve() API is different from kho_preserve_folio(). With kho_preserve_folio(), memory is allocated and some time later is preserved, so there is a possibility for that memory to exist and be used where it is not preserved, therefore it is a crucial step for such memory to also do kho_restore_folio() before used. With kho_alloc_preserve(), when the memory exists it is always preserved; it is gurantee of this API. There is no reason to do kho_restore_folio() on such memory at all. It can be released back to the system via kho_free_restore()/kho_free_unpreserve(). Pasha > > [...] > > -- > Regards, > Pratyush Yadav From pratyush at kernel.org Tue Nov 18 09:11:24 2025 From: pratyush at kernel.org (Pratyush Yadav) Date: Tue, 18 Nov 2025 18:11:24 +0100 Subject: [PATCH v1 04/13] kho: Verify deserialization status and fix FDT alignment access In-Reply-To: (Pasha Tatashin's message of "Tue, 18 Nov 2025 10:25:37 -0500") References: <20251114155358.2884014-1-pasha.tatashin@soleen.com> <20251114155358.2884014-5-pasha.tatashin@soleen.com> Message-ID: On Tue, Nov 18 2025, Pasha Tatashin wrote: >> > This page is never freed, so adding it to zone managed pages or keeping it >> > reserved does not change anything. >> >> In practice, sure. I still don't see a good reason to _not_ initialize >> the page properly. It's not like it costs us much in terms of >> performance or code complexity. >> >> Since kho_restore_folio() makes sure the folio was _actually_ preserved >> from KHO, you have a safety check against previous kernel having a bug >> and not preserving the FDT properly. And I get that the FDT has already >> been used by this point, but at least you would have some known point to >> catch this. > > The kho_alloc_preserve() API is different from kho_preserve_folio(). > With kho_preserve_folio(), memory is allocated and some time later is > preserved, so there is a possibility for that memory to exist and be > used where it is not preserved, therefore it is a crucial step for > such memory to also do kho_restore_folio() before used. With > kho_alloc_preserve(), when the memory exists it is always preserved; > it is gurantee of this API. There is no reason to do > kho_restore_folio() on such memory at all. It can be released back to > the system via kho_free_restore()/kho_free_unpreserve(). Even for those I think there should be a kho_restore_mem() or something similar (naming things is hard :/), so they go through the restore, their struct page is properly initialized and accounted for, and make sure the pages were actually preserved. Using the memory without restoring it first should be the exception IMO. -- Regards, Pratyush Yadav From pratyush at kernel.org Tue Nov 18 10:10:45 2025 From: pratyush at kernel.org (Pratyush Yadav) Date: Tue, 18 Nov 2025 19:10:45 +0100 Subject: [PATCH] test_kho: always print restore status Message-ID: <20251118181046.23321-1-pratyush@kernel.org> Currently the KHO test only prints a message on success, and remains silent on failure. This makes it difficult to notice a failing test. A failing test is usually more interesting than a successful one. Always print the test status after attempting restore. Signed-off-by: Pratyush Yadav --- lib/test_kho.c | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/lib/test_kho.c b/lib/test_kho.c index 85b60d87a50ad..47de562807955 100644 --- a/lib/test_kho.c +++ b/lib/test_kho.c @@ -306,7 +306,6 @@ static int kho_test_restore(phys_addr_t fdt_phys) if (err) return err; - pr_info("KHO restore succeeded\n"); return 0; } @@ -319,8 +318,15 @@ static int __init kho_test_init(void) return 0; err = kho_retrieve_subtree(KHO_TEST_FDT, &fdt_phys); - if (!err) - return kho_test_restore(fdt_phys); + if (!err) { + err = kho_test_restore(fdt_phys); + if (err) + pr_err("KHO restore failed\n"); + else + pr_info("KHO restore succeeded\n"); + + return err; + } if (err != -ENOENT) { pr_warn("failed to retrieve %s FDT: %d\n", KHO_TEST_FDT, err); base-commit: f0bfdc2b69f5c600b88ee484c01b213712c63d94 -- 2.47.3 From pratyush at kernel.org Tue Nov 18 10:18:10 2025 From: pratyush at kernel.org (Pratyush Yadav) Date: Tue, 18 Nov 2025 19:18:10 +0100 Subject: [PATCH] kho: free already restored pages when kho_restore_vmalloc() fails Message-ID: <20251118181811.47336-1-pratyush@kernel.org> When kho_restore_vmalloc() fails, it frees up the pages array, but not the pages it contains. These are the pages that were successfully restored using kho_restore_pages(). If the failure happens when restoring the pages, the ones successfully restored are leaked. If the failure happens when allocating the vm_area or when mapping the pages, all the pages of the preserved vmalloc buffer are leaked. Free all of the successfully restored pages before returning error. Fixes: a667300bd53f2 ("kho: add support for preserving vmalloc allocations") Signed-off-by: Pratyush Yadav --- kernel/liveupdate/kexec_handover.c | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 224bdf5becb68..515339fa526e0 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -1088,11 +1088,11 @@ void *kho_restore_vmalloc(const struct kho_vmalloc *preservation) phys_addr_t phys = chunk->phys[i]; if (idx + contig_pages > total_pages) - goto err_free_pages_array; + goto err_free_pages; page = kho_restore_pages(phys, contig_pages); if (!page) - goto err_free_pages_array; + goto err_free_pages; for (int j = 0; j < contig_pages; j++) pages[idx++] = page; @@ -1102,20 +1102,20 @@ void *kho_restore_vmalloc(const struct kho_vmalloc *preservation) page = kho_restore_pages(virt_to_phys(chunk), 1); if (!page) - goto err_free_pages_array; + goto err_free_pages; chunk = KHOSER_LOAD_PTR(chunk->hdr.next); __free_page(page); } if (idx != total_pages) - goto err_free_pages_array; + goto err_free_pages; area = __get_vm_area_node(total_pages * PAGE_SIZE, align, shift, vm_flags, VMALLOC_START, VMALLOC_END, NUMA_NO_NODE, GFP_KERNEL, __builtin_return_address(0)); if (!area) - goto err_free_pages_array; + goto err_free_pages; addr = (unsigned long)area->addr; size = get_vm_area_size(area); @@ -1130,7 +1130,10 @@ void *kho_restore_vmalloc(const struct kho_vmalloc *preservation) err_free_vm_area: free_vm_area(area); -err_free_pages_array: +err_free_pages: + for (int i = 0; i < idx; i++) + __free_page(pages[i]); + kvfree(pages); return NULL; } base-commit: f0bfdc2b69f5c600b88ee484c01b213712c63d94 prerequisite-patch-id: f54df1de9bdcb4fe396940cdcc578f5adcc9397c -- 2.47.3 From pratyush at kernel.org Tue Nov 18 10:22:16 2025 From: pratyush at kernel.org (Pratyush Yadav) Date: Tue, 18 Nov 2025 19:22:16 +0100 Subject: [PATCH] kho: free chunks using free_page() instead of kfree() Message-ID: <20251118182218.63044-1-pratyush@kernel.org> Before commit fa759cd75bce5 ("kho: allocate metadata directly from the buddy allocator"), the chunks were allocated from the slab allocator using kzalloc(). Those were rightly freed using kfree(). When the commit switched to using the buddy allocator directly, it missed updating kho_mem_ser_free() to use free_page() instead of kfree(). Fixes: fa759cd75bce5 ("kho: allocate metadata directly from the buddy allocator") Signed-off-by: Pratyush Yadav --- Notes: Commit 73976b0f7cefe ("kho: remove abort functionality and support state refresh") made this bug easier to trigger by providing a deterministic method to trigger freeing of the chunks. kernel/liveupdate/kexec_handover.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 515339fa526e0..6497fe68c2d24 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -360,7 +360,7 @@ static void kho_mem_ser_free(struct khoser_mem_chunk *first_chunk) struct khoser_mem_chunk *tmp = chunk; chunk = KHOSER_LOAD_PTR(chunk->hdr.next); - kfree(tmp); + free_page((unsigned long)tmp); } } base-commit: f0bfdc2b69f5c600b88ee484c01b213712c63d94 prerequisite-patch-id: f54df1de9bdcb4fe396940cdcc578f5adcc9397c prerequisite-patch-id: 800ec910c37120fd77aff1fad8ec10daaeaeddb1 -- 2.47.3 From pratyush at kernel.org Tue Nov 18 10:24:15 2025 From: pratyush at kernel.org (Pratyush Yadav) Date: Tue, 18 Nov 2025 19:24:15 +0100 Subject: [PATCH] MAINTAINERS: add test_kho to KHO's entry Message-ID: <20251118182416.70660-1-pratyush@kernel.org> Commit b753522bed0b7 ("kho: add test for kexec handover") introduced the KHO test but missed adding it to KHO's MAINTAINERS entry. Add it so the KHO maintainers can get patches for its test. Cc: stable at vger.kernel.org Fixes: b753522bed0b7 ("kho: add test for kexec handover") Signed-off-by: Pratyush Yadav --- MAINTAINERS | 1 + 1 file changed, 1 insertion(+) diff --git a/MAINTAINERS b/MAINTAINERS index 05e336174ede5..b0873f8ebcda6 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -13799,6 +13799,7 @@ F: Documentation/admin-guide/mm/kho.rst F: Documentation/core-api/kho/* F: include/linux/kexec_handover.h F: kernel/liveupdate/kexec_handover* +F: lib/test_kho.c F: tools/testing/selftests/kho/ KEYS-ENCRYPTED -- 2.47.3 From pasha.tatashin at soleen.com Tue Nov 18 10:35:15 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Tue, 18 Nov 2025 13:35:15 -0500 Subject: [PATCH] MAINTAINERS: add test_kho to KHO's entry In-Reply-To: <20251118182416.70660-1-pratyush@kernel.org> References: <20251118182416.70660-1-pratyush@kernel.org> Message-ID: On Tue, Nov 18, 2025 at 1:24?PM Pratyush Yadav wrote: > > Commit b753522bed0b7 ("kho: add test for kexec handover") introduced the > KHO test but missed adding it to KHO's MAINTAINERS entry. Add it so the > KHO maintainers can get patches for its test. > > Cc: stable at vger.kernel.org > Fixes: b753522bed0b7 ("kho: add test for kexec handover") > Signed-off-by: Pratyush Yadav Reviewed-by: Pasha Tatashin From pasha.tatashin at soleen.com Tue Nov 18 10:39:07 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Tue, 18 Nov 2025 13:39:07 -0500 Subject: [PATCH] kho: free chunks using free_page() instead of kfree() In-Reply-To: <20251118182218.63044-1-pratyush@kernel.org> References: <20251118182218.63044-1-pratyush@kernel.org> Message-ID: On Tue, Nov 18, 2025 at 1:22?PM Pratyush Yadav wrote: > > Before commit fa759cd75bce5 ("kho: allocate metadata directly from the > buddy allocator"), the chunks were allocated from the slab allocator > using kzalloc(). Those were rightly freed using kfree(). > > When the commit switched to using the buddy allocator directly, it > missed updating kho_mem_ser_free() to use free_page() instead of > kfree(). > > Fixes: fa759cd75bce5 ("kho: allocate metadata directly from the buddy allocator") > Signed-off-by: Pratyush Yadav Thank you for finding and fixing this issue. Reviewed-by: Pasha Tatashin From pasha.tatashin at soleen.com Tue Nov 18 10:43:32 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Tue, 18 Nov 2025 13:43:32 -0500 Subject: [PATCH] kho: free already restored pages when kho_restore_vmalloc() fails In-Reply-To: <20251118181811.47336-1-pratyush@kernel.org> References: <20251118181811.47336-1-pratyush@kernel.org> Message-ID: > When kho_restore_vmalloc() fails, it frees up the pages array, but not > the pages it contains. These are the pages that were successfully > restored using kho_restore_pages(). If the failure happens when > restoring the pages, the ones successfully restored are leaked. If the > failure happens when allocating the vm_area or when mapping the pages, > all the pages of the preserved vmalloc buffer are leaked. Hm, I am not sure if KHO should be responsible for freeing the restored pages. We don't know the content of those pages, and what they are used for. They could be used by a hypervisor or a device. Therefore, it may be better to keep them leaked, and let the caller decide what to do next: i.e., boot into a maintenance mode, crash the kernel, or allow the leak until the next reboot. Pasha From pasha.tatashin at soleen.com Tue Nov 18 12:31:34 2025 From: pasha.tatashin at soleen.com (Pasha Tatashin) Date: Tue, 18 Nov 2025 15:31:34 -0500 Subject: [PATCH] test_kho: always print restore status In-Reply-To: <20251118181046.23321-1-pratyush@kernel.org> References: <20251118181046.23321-1-pratyush@kernel.org> Message-ID: On Tue, Nov 18, 2025 at 1:10?PM Pratyush Yadav wrote: > > Currently the KHO test only prints a message on success, and remains > silent on failure. This makes it difficult to notice a failing test. A > failing test is usually more interesting than a successful one. > > Always print the test status after attempting restore. > > Signed-off-by: Pratyush Yadav Reviewed-by: Pasha Tatashin From rppt at kernel.org Tue Nov 18 23:13:05 2025 From: rppt at kernel.org (Mike Rapoport) Date: Wed, 19 Nov 2025 09:13:05 +0200 Subject: [PATCH] test_kho: always print restore status In-Reply-To: <20251118181046.23321-1-pratyush@kernel.org> References: <20251118181046.23321-1-pratyush@kernel.org> Message-ID: On Tue, Nov 18, 2025 at 07:10:45PM +0100, Pratyush Yadav wrote: > Currently the KHO test only prints a message on success, and remains > silent on failure. This makes it difficult to notice a failing test. A > failing test is usually more interesting than a successful one. > > Always print the test status after attempting restore. > > Signed-off-by: Pratyush Yadav Reviewed-by: Mike Rapoport (Microsoft) > --- > lib/test_kho.c | 12 +++++++++--- > 1 file changed, 9 insertions(+), 3 deletions(-) > > diff --git a/lib/test_kho.c b/lib/test_kho.c > index 85b60d87a50ad..47de562807955 100644 > --- a/lib/test_kho.c > +++ b/lib/test_kho.c > @@ -306,7 +306,6 @@ static int kho_test_restore(phys_addr_t fdt_phys) > if (err) > return err; > > - pr_info("KHO restore succeeded\n"); > return 0; > } > > @@ -319,8 +318,15 @@ static int __init kho_test_init(void) > return 0; > > err = kho_retrieve_subtree(KHO_TEST_FDT, &fdt_phys); > - if (!err) > - return kho_test_restore(fdt_phys); > + if (!err) { > + err = kho_test_restore(fdt_phys); > + if (err) > + pr_err("KHO restore failed\n"); > + else > + pr_info("KHO restore succeeded\n"); > + > + return err; > + } > > if (err != -ENOENT) { > pr_warn("failed to retrieve %s FDT: %d\n", KHO_TEST_FDT, err); > > base-commit: f0bfdc2b69f5c600b88ee484c01b213712c63d94 > -- > 2.47.3 > -- Sincerely yours, Mike. From rppt at kernel.org Tue Nov 18 23:14:34 2025 From: rppt at kernel.org (Mike Rapoport) Date: Wed, 19 Nov 2025 09:14:34 +0200 Subject: [PATCH] kho: free chunks using free_page() instead of kfree() In-Reply-To: <20251118182218.63044-1-pratyush@kernel.org> References: <20251118182218.63044-1-pratyush@kernel.org> Message-ID: On Tue, Nov 18, 2025 at 07:22:16PM +0100, Pratyush Yadav wrote: > Before commit fa759cd75bce5 ("kho: allocate metadata directly from the > buddy allocator"), the chunks were allocated from the slab allocator > using kzalloc(). Those were rightly freed using kfree(). > > When the commit switched to using the buddy allocator directly, it > missed updating kho_mem_ser_free() to use free_page() instead of > kfree(). > > Fixes: fa759cd75bce5 ("kho: allocate metadata directly from the buddy allocator") > Signed-off-by: Pratyush Yadav Reviewed-by: Mike Rapoport (Microsoft) > --- > > Notes: > Commit 73976b0f7cefe ("kho: remove abort functionality and support state > refresh") made this bug easier to trigger by providing a deterministic > method to trigger freeing of the chunks. > > kernel/liveupdate/kexec_handover.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c > index 515339fa526e0..6497fe68c2d24 100644 > --- a/kernel/liveupdate/kexec_handover.c > +++ b/kernel/liveupdate/kexec_handover.c > @@ -360,7 +360,7 @@ static void kho_mem_ser_free(struct khoser_mem_chunk *first_chunk) > struct khoser_mem_chunk *tmp = chunk; > > chunk = KHOSER_LOAD_PTR(chunk->hdr.next); > - kfree(tmp); > + free_page((unsigned long)tmp); > } > } > > > base-commit: f0bfdc2b69f5c600b88ee484c01b213712c63d94 > prerequisite-patch-id: f54df1de9bdcb4fe396940cdcc578f5adcc9397c > prerequisite-patch-id: 800ec910c37120fd77aff1fad8ec10daaeaeddb1 > -- > 2.47.3 > -- Sincerely yours, Mike. From rppt at kernel.org Tue Nov 18 23:14:56 2025 From: rppt at kernel.org (Mike Rapoport) Date: Wed, 19 Nov 2025 09:14:56 +0200 Subject: [PATCH] MAINTAINERS: add test_kho to KHO's entry In-Reply-To: <20251118182416.70660-1-pratyush@kernel.org> References: <20251118182416.70660-1-pratyush@kernel.org> Message-ID: On Tue, Nov 18, 2025 at 07:24:15PM +0100, Pratyush Yadav wrote: > Commit b753522bed0b7 ("kho: add test for kexec handover") introduced the > KHO test but missed adding it to KHO's MAINTAINERS entry. Add it so the > KHO maintainers can get patches for its test. > > Cc: stable at vger.kernel.org > Fixes: b753522bed0b7 ("kho: add test for kexec handover") > Signed-off-by: Pratyush Yadav Reviewed-by: Mike Rapoport (Microsoft) > --- > MAINTAINERS | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/MAINTAINERS b/MAINTAINERS > index 05e336174ede5..b0873f8ebcda6 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -13799,6 +13799,7 @@ F: Documentation/admin-guide/mm/kho.rst > F: Documentation/core-api/kho/* > F: include/linux/kexec_handover.h > F: kernel/liveupdate/kexec_handover* > +F: lib/test_kho.c > F: tools/testing/selftests/kho/ > > KEYS-ENCRYPTED > -- > 2.47.3 > -- Sincerely yours, Mike. From gregkh at linuxfoundation.org Tue Nov 18 23:36:09 2025 From: gregkh at linuxfoundation.org (Greg KH) Date: Wed, 19 Nov 2025 02:36:09 -0500 Subject: [PATCH] MAINTAINERS: add test_kho to KHO's entry In-Reply-To: <20251118182416.70660-1-pratyush@kernel.org> References: <20251118182416.70660-1-pratyush@kernel.org> Message-ID: <2025111944-tracing-unwieldy-1769@gregkh> On Tue, Nov 18, 2025 at 07:24:15PM +0100, Pratyush Yadav wrote: > Commit b753522bed0b7 ("kho: add test for kexec handover") introduced the > KHO test but missed adding it to KHO's MAINTAINERS entry. Add it so the > KHO maintainers can get patches for its test. > > Cc: stable at vger.kernel.org Why is this a patch for stable trees? From pratyush at kernel.org Wed Nov 19 07:55:06 2025 From: pratyush at kernel.org (Pratyush Yadav) Date: Wed, 19 Nov 2025 16:55:06 +0100 Subject: [PATCH] MAINTAINERS: add test_kho to KHO's entry In-Reply-To: <2025111944-tracing-unwieldy-1769@gregkh> (Greg KH's message of "Wed, 19 Nov 2025 02:36:09 -0500") References: <20251118182416.70660-1-pratyush@kernel.org> <2025111944-tracing-unwieldy-1769@gregkh> Message-ID: On Wed, Nov 19 2025, Greg KH wrote: > On Tue, Nov 18, 2025 at 07:24:15PM +0100, Pratyush Yadav wrote: >> Commit b753522bed0b7 ("kho: add test for kexec handover") introduced the >> KHO test but missed adding it to KHO's MAINTAINERS entry. Add it so the >> KHO maintainers can get patches for its test. >> >> Cc: stable at vger.kernel.org > > Why is this a patch for stable trees? If someone finds a problem with this test in a stable kernel, they will know who to contact. -- Regards, Pratyush Yadav From gregkh at linuxfoundation.org Wed Nov 19 08:02:49 2025 From: gregkh at linuxfoundation.org (Greg KH) Date: Wed, 19 Nov 2025 17:02:49 +0100 Subject: [PATCH] MAINTAINERS: add test_kho to KHO's entry In-Reply-To: References: <20251118182416.70660-1-pratyush@kernel.org> <2025111944-tracing-unwieldy-1769@gregkh> Message-ID: <2025111944-bullpen-slinging-dcdc@gregkh> On Wed, Nov 19, 2025 at 04:55:06PM +0100, Pratyush Yadav wrote: > On Wed, Nov 19 2025, Greg KH wrote: > > > On Tue, Nov 18, 2025 at 07:24:15PM +0100, Pratyush Yadav wrote: > >> Commit b753522bed0b7 ("kho: add test for kexec handover") introduced the > >> KHO test but missed adding it to KHO's MAINTAINERS entry. Add it so the > >> KHO maintainers can get patches for its test. > >> > >> Cc: stable at vger.kernel.org > > > > Why is this a patch for stable trees? > > If someone finds a problem with this test in a stable kernel, they will > know who to contact. Contacting developers/maintainers should always be done on the latest kernel release, not on older stable kernels as fixes need to ALWAYS be done on Linus's tree first. Please don't force us to attempt to keep MAINTAINERS changes in sync in stable kernel trees, that way lies madness and even more patches that you would be forcing me to handle :) thanks, greg k-h From Markus.Elfring at web.de Wed Nov 19 12:06:49 2025 From: Markus.Elfring at web.de (Markus Elfring) Date: Wed, 19 Nov 2025 21:06:49 +0100 Subject: [PATCH] kho: free chunks using free_page() instead of kfree() In-Reply-To: <20251118182218.63044-1-pratyush@kernel.org> References: <20251118182218.63044-1-pratyush@kernel.org> Message-ID: ? > When the commit switched to using the buddy allocator directly, it > missed updating kho_mem_ser_free() to use free_page() instead of > kfree(). Would another imperative wording become helpful for an improved change description? https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.18-rc6#n94 Regards, Markus From sj at kernel.org Wed Nov 19 17:18:48 2025 From: sj at kernel.org (SeongJae Park) Date: Wed, 19 Nov 2025 17:18:48 -0800 Subject: [PATCH] test_kho: always print restore status In-Reply-To: <20251118181046.23321-1-pratyush@kernel.org> Message-ID: <20251120011849.74672-1-sj@kernel.org> On Tue, 18 Nov 2025 19:10:45 +0100 Pratyush Yadav wrote: > Currently the KHO test only prints a message on success, and remains > silent on failure. This makes it difficult to notice a failing test. A > failing test is usually more interesting than a successful one. > > Always print the test status after attempting restore. > > Signed-off-by: Pratyush Yadav Acked-by: SeongJae Park Thanks, SJ [...] From pratyush at kernel.org Thu Nov 20 01:21:30 2025 From: pratyush at kernel.org (Pratyush Yadav) Date: Thu, 20 Nov 2025 10:21:30 +0100 Subject: [PATCH] kho: free chunks using free_page() instead of kfree() In-Reply-To: (Markus Elfring's message of "Wed, 19 Nov 2025 21:06:49 +0100") References: <20251118182218.63044-1-pratyush@kernel.org> Message-ID: On Wed, Nov 19 2025, Markus Elfring wrote: > ? >> When the commit switched to using the buddy allocator directly, it >> missed updating kho_mem_ser_free() to use free_page() instead of >> kfree(). > > Would another imperative wording become helpful for an improved change description? > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.18-rc6#n94 "the commit" here refers to the commit fa759cd75bce5 ("kho: allocate metadata directly from the buddy allocator"), not "this commit"/"this patch". I figured that can be understood from the context and I won't need to spell the whole thing out again. I don't understand the technicalities of the English grammar so well, but IIUC imperative mood is used in sentences that give a command. This paragraph talks about a past event. Anyway, if you have something better, happy to take suggestions. -- Regards, Pratyush Yadav From pratyush at kernel.org Thu Nov 20 01:25:34 2025 From: pratyush at kernel.org (Pratyush Yadav) Date: Thu, 20 Nov 2025 10:25:34 +0100 Subject: [PATCH] MAINTAINERS: add test_kho to KHO's entry In-Reply-To: <2025111944-bullpen-slinging-dcdc@gregkh> (Greg KH's message of "Wed, 19 Nov 2025 17:02:49 +0100") References: <20251118182416.70660-1-pratyush@kernel.org> <2025111944-tracing-unwieldy-1769@gregkh> <2025111944-bullpen-slinging-dcdc@gregkh> Message-ID: On Wed, Nov 19 2025, Greg KH wrote: > On Wed, Nov 19, 2025 at 04:55:06PM +0100, Pratyush Yadav wrote: >> On Wed, Nov 19 2025, Greg KH wrote: >> >> > On Tue, Nov 18, 2025 at 07:24:15PM +0100, Pratyush Yadav wrote: >> >> Commit b753522bed0b7 ("kho: add test for kexec handover") introduced the >> >> KHO test but missed adding it to KHO's MAINTAINERS entry. Add it so the >> >> KHO maintainers can get patches for its test. >> >> >> >> Cc: stable at vger.kernel.org >> > >> > Why is this a patch for stable trees? >> >> If someone finds a problem with this test in a stable kernel, they will >> know who to contact. > > Contacting developers/maintainers should always be done on the latest > kernel release, not on older stable kernels as fixes need to ALWAYS be > done on Linus's tree first. > > Please don't force us to attempt to keep MAINTAINERS changes in sync in > stable kernel trees, that way lies madness and even more patches that > you would be forcing me to handle :) Okay, my bad. Feel free to ignore this patch then. And I will keep that in mind the next time around. Andrew, can you please drop the "Cc: stable at vger.kernel.org" when you apply? -- Regards, Pratyush Yadav From pratyush at kernel.org Thu Nov 20 01:26:28 2025 From: pratyush at kernel.org (Pratyush Yadav) Date: Thu, 20 Nov 2025 10:26:28 +0100 Subject: [PATCH] kho: free already restored pages when kho_restore_vmalloc() fails In-Reply-To: (Pasha Tatashin's message of "Tue, 18 Nov 2025 13:43:32 -0500") References: <20251118181811.47336-1-pratyush@kernel.org> Message-ID: On Tue, Nov 18 2025, Pasha Tatashin wrote: >> When kho_restore_vmalloc() fails, it frees up the pages array, but not >> the pages it contains. These are the pages that were successfully >> restored using kho_restore_pages(). If the failure happens when >> restoring the pages, the ones successfully restored are leaked. If the >> failure happens when allocating the vm_area or when mapping the pages, >> all the pages of the preserved vmalloc buffer are leaked. > > Hm, I am not sure if KHO should be responsible for freeing the > restored pages. We don't know the content of those pages, and what > they are used for. They could be used by a hypervisor or a device. > Therefore, it may be better to keep them leaked, and let the caller > decide what to do next: i.e., boot into a maintenance mode, crash the > kernel, or allow the leak until the next reboot. Hmm, fair point. This patch can be ignored then. -- Regards, Pratyush Yadav From Markus.Elfring at web.de Thu Nov 20 01:57:01 2025 From: Markus.Elfring at web.de (Markus Elfring) Date: Thu, 20 Nov 2025 10:57:01 +0100 Subject: kho: free chunks using free_page() instead of kfree() In-Reply-To: References: <20251118182218.63044-1-pratyush@kernel.org> Message-ID: <11c0819f-f0e4-42a3-9a0c-fc71de1e59cc@web.de> > Anyway, if you have something better, happy to take suggestions. You provided a reasonable change introduction (and justification). How do you think about to add a wording like ?Thus use an appropriate macro call.?? Would it be helpful to mention the affected function implementation also in the summary phrase? Regards, Markus From bhe at redhat.com Thu Nov 20 01:58:56 2025 From: bhe at redhat.com (Baoquan he) Date: Thu, 20 Nov 2025 17:58:56 +0800 Subject: [PATCH v6] crash: export crashkernel CMA reservation to userspace In-Reply-To: <20251118071023.1673329-1-sourabhjain@linux.ibm.com> References: <20251118071023.1673329-1-sourabhjain@linux.ibm.com> Message-ID: On 11/18/25 at 12:40pm, Sourabh Jain wrote: > Add a sysfs entry /sys/kernel/kexec_crash_cma_ranges to expose all > CMA crashkernel ranges. > > This allows userspace tools configuring kdump to determine how much > memory is reserved for crashkernel. If CMA is used, tools can warn > users when attempting to capture user pages with CMA reservation. > > The new sysfs hold the CMA ranges in below format: > > cat /sys/kernel/kexec_crash_cma_ranges > 100000000-10c7fffff > > The reason for not including Crash CMA Ranges in /proc/iomem is to avoid > conflicts. It has been observed that contiguous memory ranges are sometimes > shown as two separate System RAM entries in /proc/iomem. If a CMA range > overlaps two System RAM ranges, adding crashk_res to /proc/iomem can create > a conflict. Reference [1] describes one such instance on the PowerPC > architecture. > > Link: https://lore.kernel.org/all/20251016142831.144515-1-sourabhjain at linux.ibm.com/ [1] > > Cc: Aditya Gupta > Cc: Andrew Morton > Cc: Baoquan he > Cc: Dave Young > Cc: Hari Bathini > Cc: Jiri Bohac > Cc: Madhavan Srinivasan > Cc: Mahesh J Salgaonkar > Cc: Pingfan Liu > Cc: Ritesh Harjani (IBM) > Cc: Shivang Upadhyay > Cc: Vivek Goyal > Cc: linuxppc-dev at lists.ozlabs.org > Cc: kexec at lists.infradead.org > Signed-off-by: Sourabh Jain > --- > > Changelog: > > v4 -> v5: > https://lore.kernel.org/all/20251114152550.ac2dd5e23542f09c62defec7 at linux-foundation.org/ > - Splitted patch from the above patch series. > - Code to create kexec node under /sys/kernel is added, eariler it was > done in [02/05] of the above patch series. > > v5 -> v6: > - Add Crash CMA Range sysfs interface under /sys/kernel > > Note: > This patch is dependent on the below patch: > https://lore.kernel.org/all/20251117035153.1199665-1-sourabhjain at linux.ibm.com/ > > --- > .../ABI/testing/sysfs-kernel-kexec-kdump | 10 +++++++++ > kernel/ksysfs.c | 21 +++++++++++++++++++ > 2 files changed, 31 insertions(+) Acked-by: Baoquan He > > diff --git a/Documentation/ABI/testing/sysfs-kernel-kexec-kdump b/Documentation/ABI/testing/sysfs-kernel-kexec-kdump > index 96b24565b68e..f6089e38de5f 100644 > --- a/Documentation/ABI/testing/sysfs-kernel-kexec-kdump > +++ b/Documentation/ABI/testing/sysfs-kernel-kexec-kdump > @@ -41,3 +41,13 @@ Description: read only > is used by the user space utility kexec to support updating the > in-kernel kdump image during hotplug operations. > User: Kexec tools > + > +What: /sys/kernel/kexec_crash_cma_ranges > +Date: Nov 2025 > +Contact: kexec at lists.infradead.org > +Description: read only > + Provides information about the memory ranges reserved from > + the Contiguous Memory Allocator (CMA) area that are allocated > + to the crash (kdump) kernel. It lists the start and end physical > + addresses of CMA regions assigned for crashkernel use. > +User: kdump service > diff --git a/kernel/ksysfs.c b/kernel/ksysfs.c > index eefb67d9883c..0ff2179bc603 100644 > --- a/kernel/ksysfs.c > +++ b/kernel/ksysfs.c > @@ -135,6 +135,24 @@ static ssize_t kexec_crash_loaded_show(struct kobject *kobj, > } > KERNEL_ATTR_RO(kexec_crash_loaded); > > +#ifdef CONFIG_CRASH_RESERVE > +static ssize_t kexec_crash_cma_ranges_show(struct kobject *kobj, > + struct kobj_attribute *attr, char *buf) > +{ > + > + ssize_t len = 0; > + int i; > + > + for (i = 0; i < crashk_cma_cnt; ++i) { > + len += sysfs_emit_at(buf, len, "%08llx-%08llx\n", > + crashk_cma_ranges[i].start, > + crashk_cma_ranges[i].end); > + } > + return len; > +} > +KERNEL_ATTR_RO(kexec_crash_cma_ranges); > +#endif /* CONFIG_CRASH_RESERVE */ > + > static ssize_t kexec_crash_size_show(struct kobject *kobj, > struct kobj_attribute *attr, char *buf) > { > @@ -260,6 +278,9 @@ static struct attribute * kernel_attrs[] = { > #ifdef CONFIG_CRASH_DUMP > &kexec_crash_loaded_attr.attr, > &kexec_crash_size_attr.attr, > +#ifdef CONFIG_CRASH_RESERVE > + &kexec_crash_cma_ranges_attr.attr, > +#endif > #endif > #endif > #ifdef CONFIG_VMCORE_INFO > -- > 2.51.1 > From rppt at kernel.org Thu Nov 20 02:39:21 2025 From: rppt at kernel.org (Mike Rapoport) Date: Thu, 20 Nov 2025 12:39:21 +0200 Subject: [PATCH v1 04/13] kho: Verify deserialization status and fix FDT alignment access In-Reply-To: References: <20251114155358.2884014-1-pasha.tatashin@soleen.com> <20251114155358.2884014-5-pasha.tatashin@soleen.com> Message-ID: On Tue, Nov 18, 2025 at 06:11:24PM +0100, Pratyush Yadav wrote: > On Tue, Nov 18 2025, Pasha Tatashin wrote: > > >> > This page is never freed, so adding it to zone managed pages or keeping it > >> > reserved does not change anything. > >> > >> In practice, sure. I still don't see a good reason to _not_ initialize > >> the page properly. It's not like it costs us much in terms of > >> performance or code complexity. > >> > >> Since kho_restore_folio() makes sure the folio was _actually_ preserved > >> from KHO, you have a safety check against previous kernel having a bug > >> and not preserving the FDT properly. And I get that the FDT has already > >> been used by this point, but at least you would have some known point to > >> catch this. > > > > The kho_alloc_preserve() API is different from kho_preserve_folio(). > > With kho_preserve_folio(), memory is allocated and some time later is > > preserved, so there is a possibility for that memory to exist and be > > used where it is not preserved, therefore it is a crucial step for > > such memory to also do kho_restore_folio() before used. With > > kho_alloc_preserve(), when the memory exists it is always preserved; > > it is gurantee of this API. There is no reason to do > > kho_restore_folio() on such memory at all. It can be released back to > > the system via kho_free_restore()/kho_free_unpreserve(). > > Even for those I think there should be a kho_restore_mem() or something > similar (naming things is hard :/), so they go through the restore, > their struct page is properly initialized and accounted for, and > make sure the pages were actually preserved. > > Using the memory without restoring it first should be the exception IMO. Base KHO and LUO FTDs are such exceptions for sure :) We have to use them way before we can even think about restoring. > -- > Regards, > Pratyush Yadav -- Sincerely yours, Mike. From ranxiaokai627 at 163.com Thu Nov 20 06:41:47 2025 From: ranxiaokai627 at 163.com (ranxiaokai627 at 163.com) Date: Thu, 20 Nov 2025 14:41:47 +0000 Subject: [PATCH 2/2] liveupdate: Fix boot failure due to kmemleak access to unmapped pages In-Reply-To: <20251120144147.90508-1-ranxiaokai627@163.com> References: <20251120144147.90508-1-ranxiaokai627@163.com> Message-ID: <20251120144147.90508-3-ranxiaokai627@163.com> From: Ran Xiaokai When booting with debug_pagealloc=on while having: CONFIG_KEXEC_HANDOVER_ENABLE_DEFAULT=y CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF=n the system fails to boot due to page faults during kmemleak scanning. This occurs because: With debug_pagealloc enabled, __free_pages() invokes debug_pagealloc_unmap_pages(), clearing the _PAGE_PRESENT bit for freed pages in the direct mapping. Commit 3dc92c311498 ("kexec: add Kexec HandOver (KHO) generation helpers") releases the KHO scratch region via init_cma_reserved_pageblock(), unmapping its physical pages. Subsequent kmemleak scanning accesses these unmapped pages, triggering fatal page faults. Call kmemleak_no_scan_phys() from kho_reserve_scratch() to exclude the reserved region from scanning before it is released to the buddy allocator. Fixes: 3dc92c311498 ("kexec: add Kexec HandOver (KHO) generation helpers") Signed-off-by: Ran Xiaokai --- kernel/liveupdate/kexec_handover.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 224bdf5becb6..dd4942d1d76c 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -11,6 +11,7 @@ #include #include +#include #include #include #include @@ -654,6 +655,7 @@ static void __init kho_reserve_scratch(void) if (!addr) goto err_free_scratch_desc; + kmemleak_no_scan_phys(addr); kho_scratch[i].addr = addr; kho_scratch[i].size = size; i++; @@ -664,6 +666,7 @@ static void __init kho_reserve_scratch(void) if (!addr) goto err_free_scratch_areas; + kmemleak_no_scan_phys(addr); kho_scratch[i].addr = addr; kho_scratch[i].size = size; i++; @@ -676,6 +679,7 @@ static void __init kho_reserve_scratch(void) if (!addr) goto err_free_scratch_areas; + kmemleak_no_scan_phys(addr); kho_scratch[i].addr = addr; kho_scratch[i].size = size; i++; -- 2.25.1 From ranxiaokai627 at 163.com Thu Nov 20 06:41:45 2025 From: ranxiaokai627 at 163.com (ranxiaokai627 at 163.com) Date: Thu, 20 Nov 2025 14:41:45 +0000 Subject: [PATCH 0/2] liveupdate: Fix boot failure due to kmemleak access to unmapped pages Message-ID: <20251120144147.90508-1-ranxiaokai627@163.com> From: Ran Xiaokai When booting with debug_pagealloc=on while having: CONFIG_KEXEC_HANDOVER_ENABLE_DEFAULT=y CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF=n the system fails to boot due to page faults during kmemleak scanning. Crash logs: BUG: unable to handle page fault for address: ffff8880cd400000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 11de00067 P4D 11de00067 PUD 11af2b067 PMD 11aec1067 PTE 800fffff32bff020 Oops: Oops: 0000 [#1] SMP DEBUG_PAGEALLOC RIP: 0010:scan_block+0x43/0xb0 Call Trace: scan_gray_list+0x2b5/0x2f0 kmemleak_scan+0x3b1/0xcf0 kmemleak_scan_thread+0x7d/0xc0 kthread+0x11c/0x240 ret_from_fork+0x2d3/0x370 ret_from_fork_asm+0x11/0x20 This occurs because: With debug_pagealloc enabled, __free_pages() invokes debug_pagealloc_unmap_pages(), clearing the _PAGE_PRESENT bit for freed pages in the direct mapping. Commit 3dc92c311498 ("kexec: add Kexec HandOver (KHO) generation helpers") releases the KHO scratch region via init_cma_reserved_pageblock(), unmapping its physical pages. Subsequent kmemleak scanning accesses these unmapped pages, triggering fatal page faults. This patch introduces kmemleak_no_scan_phys(phys_addr_t), a physical-address variant of kmemleak_no_scan(), which marks memblock regions as OBJECT_NO_SCAN. We invoke this from kho_reserve_scratch() to exclude the reserved region from scanning before it is released to the buddy allocator. This is based linux next-20251119. Ran Xiaokai (2): mm: kmemleak: introduce kmemleak_no_scan_phys() helper liveupdate: Fix boot failure due to kmemleak access to unmapped pages include/linux/kmemleak.h | 4 ++++ kernel/liveupdate/kexec_handover.c | 4 ++++ mm/kmemleak.c | 15 ++++++++++++--- 3 files changed, 20 insertions(+), 3 deletions(-) -- 2.25.1 From ranxiaokai627 at 163.com Thu Nov 20 06:41:46 2025 From: ranxiaokai627 at 163.com (ranxiaokai627 at 163.com) Date: Thu, 20 Nov 2025 14:41:46 +0000 Subject: [PATCH 1/2] mm: kmemleak: introduce kmemleak_no_scan_phys() helper In-Reply-To: <20251120144147.90508-1-ranxiaokai627@163.com> References: <20251120144147.90508-1-ranxiaokai627@163.com> Message-ID: <20251120144147.90508-2-ranxiaokai627@163.com> From: Ran Xiaokai Introduce kmemleak_no_scan_phys(phys_addr_t), a physical-address variant to kmemleak_no_scan(). This helper marks memory regions as non-scanable using physical addresses directly. It is specifically designed to prevent kmemleak from accessing pages that have been unmapped by debug_pagealloc after being freed to the buddy allocator. The kexec handover (KHO) subsystem will call this helper to exclude the kho_scratch reservation region from scanning, thereby avoiding fatal page faults during boot when debug_pagealloc=on. Signed-off-by: Ran Xiaokai --- include/linux/kmemleak.h | 4 ++++ mm/kmemleak.c | 15 ++++++++++++--- 2 files changed, 16 insertions(+), 3 deletions(-) diff --git a/include/linux/kmemleak.h b/include/linux/kmemleak.h index fbd424b2abb1..e955ad441b8a 100644 --- a/include/linux/kmemleak.h +++ b/include/linux/kmemleak.h @@ -31,6 +31,7 @@ extern void kmemleak_ignore(const void *ptr) __ref; extern void kmemleak_ignore_percpu(const void __percpu *ptr) __ref; extern void kmemleak_scan_area(const void *ptr, size_t size, gfp_t gfp) __ref; extern void kmemleak_no_scan(const void *ptr) __ref; +extern void kmemleak_no_scan_phys(phys_addr_t phys) __ref; extern void kmemleak_alloc_phys(phys_addr_t phys, size_t size, gfp_t gfp) __ref; extern void kmemleak_free_part_phys(phys_addr_t phys, size_t size) __ref; @@ -113,6 +114,9 @@ static inline void kmemleak_erase(void **ptr) static inline void kmemleak_no_scan(const void *ptr) { } +static inline void kmemleak_no_scan_phys(phys_addr_t phys) +{ +} static inline void kmemleak_alloc_phys(phys_addr_t phys, size_t size, gfp_t gfp) { diff --git a/mm/kmemleak.c b/mm/kmemleak.c index 1ac56ceb29b6..b2b8374e19c3 100644 --- a/mm/kmemleak.c +++ b/mm/kmemleak.c @@ -1058,12 +1058,12 @@ static void object_set_excess_ref(unsigned long ptr, unsigned long excess_ref) * pointer. Such object will not be scanned by kmemleak but references to it * are searched. */ -static void object_no_scan(unsigned long ptr) +static void object_no_scan_flags(unsigned long ptr, unsigned long objflags) { unsigned long flags; struct kmemleak_object *object; - object = find_and_get_object(ptr, 0); + object = __find_and_get_object(ptr, 0, objflags); if (!object) { kmemleak_warn("Not scanning unknown object at 0x%08lx\n", ptr); return; @@ -1328,10 +1328,19 @@ void __ref kmemleak_no_scan(const void *ptr) pr_debug("%s(0x%px)\n", __func__, ptr); if (kmemleak_enabled && ptr && !IS_ERR(ptr)) - object_no_scan((unsigned long)ptr); + object_no_scan_flags((unsigned long)ptr, 0); } EXPORT_SYMBOL(kmemleak_no_scan); +void __ref kmemleak_no_scan_phys(phys_addr_t phys) +{ + pr_debug("%s(%pap)\n", __func__, &phys); + + if (kmemleak_enabled) + object_no_scan_flags((unsigned long)phys, OBJECT_PHYS); +} +EXPORT_SYMBOL(kmemleak_no_scan_phys); + /** * kmemleak_alloc_phys - similar to kmemleak_alloc but taking a physical * address argument -- 2.25.1 From pratyush at kernel.org Thu Nov 20 08:17:28 2025 From: pratyush at kernel.org (Pratyush Yadav) Date: Thu, 20 Nov 2025 17:17:28 +0100 Subject: [PATCH 2/2] liveupdate: Fix boot failure due to kmemleak access to unmapped pages In-Reply-To: <20251120144147.90508-3-ranxiaokai627@163.com> (ranxiaokai's message of "Thu, 20 Nov 2025 14:41:47 +0000") References: <20251120144147.90508-1-ranxiaokai627@163.com> <20251120144147.90508-3-ranxiaokai627@163.com> Message-ID: On Thu, Nov 20 2025, ranxiaokai627 at 163.com wrote: > From: Ran Xiaokai > > When booting with debug_pagealloc=on while having: > CONFIG_KEXEC_HANDOVER_ENABLE_DEFAULT=y > CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF=n > the system fails to boot due to page faults during kmemleak scanning. > > This occurs because: > With debug_pagealloc enabled, __free_pages() invokes > debug_pagealloc_unmap_pages(), clearing the _PAGE_PRESENT bit for > freed pages in the direct mapping. > Commit 3dc92c311498 ("kexec: add Kexec HandOver (KHO) generation helpers") > releases the KHO scratch region via init_cma_reserved_pageblock(), > unmapping its physical pages. Subsequent kmemleak scanning accesses > these unmapped pages, triggering fatal page faults. I don't know how kmemleak works. Why does kmemleak access the unmapped pages? If pages are not mapped, it should learn to not access them, right? > > Call kmemleak_no_scan_phys() from kho_reserve_scratch() to > exclude the reserved region from scanning before > it is released to the buddy allocator. kho_reserve_scratch() is called on the first boot. It allocates the scratch areas for subsequent boots. On every KHO boot after this, kho_reserve_scratch() is not called and kho_release_scratch() is called instead since the scratch areas already exist from previous boot. Eventually both paths converge to kho_init() and call init_cma_reserved_pageblock(). So shouldn't you call kmemleak_no_scan_phys() from kho_init() instead? This would reduce code duplication and cover both paths. > > Fixes: 3dc92c311498 ("kexec: add Kexec HandOver (KHO) generation helpers") > Signed-off-by: Ran Xiaokai > --- > kernel/liveupdate/kexec_handover.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c > index 224bdf5becb6..dd4942d1d76c 100644 > --- a/kernel/liveupdate/kexec_handover.c > +++ b/kernel/liveupdate/kexec_handover.c > @@ -11,6 +11,7 @@ > > #include > #include > +#include > #include > #include > #include > @@ -654,6 +655,7 @@ static void __init kho_reserve_scratch(void) > if (!addr) > goto err_free_scratch_desc; > > + kmemleak_no_scan_phys(addr); > kho_scratch[i].addr = addr; > kho_scratch[i].size = size; > i++; > @@ -664,6 +666,7 @@ static void __init kho_reserve_scratch(void) > if (!addr) > goto err_free_scratch_areas; > > + kmemleak_no_scan_phys(addr); > kho_scratch[i].addr = addr; > kho_scratch[i].size = size; > i++; > @@ -676,6 +679,7 @@ static void __init kho_reserve_scratch(void) > if (!addr) > goto err_free_scratch_areas; > > + kmemleak_no_scan_phys(addr); > kho_scratch[i].addr = addr; > kho_scratch[i].size = size; > i++; -- Regards, Pratyush Yadav From maddy at linux.ibm.com Thu Nov 20 18:54:02 2025 From: maddy at linux.ibm.com (Madhavan Srinivasan) Date: Fri, 21 Nov 2025 08:24:02 +0530 Subject: [PATCH v7] powerpc/kdump: Add support for crashkernel CMA reservation In-Reply-To: <20251107080334.708028-1-sourabhjain@linux.ibm.com> References: <20251107080334.708028-1-sourabhjain@linux.ibm.com> Message-ID: <176369324781.72695.15722637983958584587.b4-ty@linux.ibm.com> On Fri, 07 Nov 2025 13:33:34 +0530, Sourabh Jain wrote: > Commit 35c18f2933c5 ("Add a new optional ",cma" suffix to the > crashkernel= command line option") and commit ab475510e042 ("kdump: > implement reserve_crashkernel_cma") added CMA support for kdump > crashkernel reservation. > > Extend crashkernel CMA reservation support to powerpc. > > [...] Applied to powerpc/next. [1/1] powerpc/kdump: Add support for crashkernel CMA reservation https://git.kernel.org/powerpc/c/b4a96ab50f368afc2360ff539a20254ca2c9a889 Thanks From bhe at redhat.com Thu Nov 20 19:23:07 2025 From: bhe at redhat.com (Baoquan he) Date: Fri, 21 Nov 2025 11:23:07 +0800 Subject: [PATCH v6 0/3] kexec: reorganize kexec and kdump sysfs In-Reply-To: <20251118114507.1769455-1-sourabhjain@linux.ibm.com> References: <20251118114507.1769455-1-sourabhjain@linux.ibm.com> Message-ID: On 11/18/25 at 05:15pm, Sourabh Jain wrote: > All existing kexec and kdump sysfs entries are moved to a new location, > /sys/kernel/kexec, to keep /sys/kernel/ clean and better organized. > Symlinks are created at the old locations for backward compatibility and > can be removed in the future [01/03]. > > While doing this cleanup, the old kexec and kdump sysfs entries are > marked as deprecated in the existing ABI documentation [02/03]. This > makes it clear that these older interfaces should no longer be used. > New ABI documentation is added to describe the reorganized interfaces > [03/03], so users and tools can rely on the updated sysfs interfaces > going forward. > > Changlog: > --------- > > v4 -> v5: > https://lore.kernel.org/all/20251114152550.ac2dd5e23542f09c62defec7 at linux-foundation.org/ > - Splitted patch series from the above patch series > > v5 -> v6: > - Move /sys/kernel/kexec_crash_cma_ranges also to new /sys/kernel/kexec node > - Update commit messages > > Note: > This patch series is dependent on the patches: > https://lore.kernel.org/all/20251117035153.1199665-1-sourabhjain at linux.ibm.com/ > https://lore.kernel.org/all/20251118071023.1673329-1-sourabhjain at linux.ibm.com/ To the series, Acked-by: Baoquan He > > Cc: Aditya Gupta > Cc: Andrew Morton > Cc: Baoquan he > Cc: Dave Young > Cc: Hari Bathini > Cc: Jiri Bohac > Cc: Madhavan Srinivasan > Cc: Mahesh J Salgaonkar > Cc: Pingfan Liu > Cc: Ritesh Harjani (IBM) > Cc: Shivang Upadhyay > Cc: Sourabh Jain > Cc: Vivek Goyal > Cc: linuxppc-dev at lists.ozlabs.org > Cc: kexec at lists.infradead.org > > Sourabh Jain (3): > kexec: move sysfs entries to /sys/kernel/kexec > Documentation/ABI: mark old kexec sysfs deprecated > Documentation/ABI: new kexec and kdump sysfs interface > > .../ABI/obsolete/sysfs-kernel-kexec-kdump | 71 +++++++++ > .../ABI/testing/sysfs-kernel-kexec-kdump | 26 ++-- > kernel/kexec_core.c | 141 ++++++++++++++++++ > kernel/ksysfs.c | 89 +---------- > 4 files changed, 230 insertions(+), 97 deletions(-) > create mode 100644 Documentation/ABI/obsolete/sysfs-kernel-kexec-kdump > > -- > 2.51.1 > From rppt at kernel.org Fri Nov 21 05:36:49 2025 From: rppt at kernel.org (Mike Rapoport) Date: Fri, 21 Nov 2025 15:36:49 +0200 Subject: [PATCH 2/2] liveupdate: Fix boot failure due to kmemleak access to unmapped pages In-Reply-To: <20251120144147.90508-3-ranxiaokai627@163.com> References: <20251120144147.90508-1-ranxiaokai627@163.com> <20251120144147.90508-3-ranxiaokai627@163.com> Message-ID: On Thu, Nov 20, 2025 at 02:41:47PM +0000, ranxiaokai627 at 163.com wrote: > Subject: liveupdate: Fix boot failure due to kmemleak access to unmapped pages Please prefix kexec handover patches with kho: rather than liveupdate. > From: Ran Xiaokai > > When booting with debug_pagealloc=on while having: > CONFIG_KEXEC_HANDOVER_ENABLE_DEFAULT=y > CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF=n > the system fails to boot due to page faults during kmemleak scanning. > > This occurs because: > With debug_pagealloc enabled, __free_pages() invokes > debug_pagealloc_unmap_pages(), clearing the _PAGE_PRESENT bit for > freed pages in the direct mapping. > Commit 3dc92c311498 ("kexec: add Kexec HandOver (KHO) generation helpers") > releases the KHO scratch region via init_cma_reserved_pageblock(), > unmapping its physical pages. Subsequent kmemleak scanning accesses > these unmapped pages, triggering fatal page faults. > > Call kmemleak_no_scan_phys() from kho_reserve_scratch() to > exclude the reserved region from scanning before > it is released to the buddy allocator. > > Fixes: 3dc92c311498 ("kexec: add Kexec HandOver (KHO) generation helpers") > Signed-off-by: Ran Xiaokai > --- > kernel/liveupdate/kexec_handover.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c > index 224bdf5becb6..dd4942d1d76c 100644 > --- a/kernel/liveupdate/kexec_handover.c > +++ b/kernel/liveupdate/kexec_handover.c > @@ -11,6 +11,7 @@ > > #include > #include > +#include > #include > #include > #include > @@ -654,6 +655,7 @@ static void __init kho_reserve_scratch(void) > if (!addr) > goto err_free_scratch_desc; > > + kmemleak_no_scan_phys(addr); There's kmemleak_ignore_phys() that can be called after the scratch areas allocated from memblock and with that kmemleak should not access them. Take a look at __cma_declare_contiguous_nid(). > kho_scratch[i].addr = addr; > kho_scratch[i].size = size; > i++; -- Sincerely yours, Mike. From glaubitz at physik.fu-berlin.de Sat Nov 22 03:11:47 2025 From: glaubitz at physik.fu-berlin.de (John Paul Adrian Glaubitz) Date: Sat, 22 Nov 2025 12:11:47 +0100 Subject: [PATCH 1/2] kexec-tools: powerpc: Fix function signature of comparefunc() In-Reply-To: <20251022114413.4440-1-glaubitz@physik.fu-berlin.de> References: <20251022114413.4440-1-glaubitz@physik.fu-berlin.de> Message-ID: <501429ee083aa7fb07db910e167411fa7707a0f6.camel@physik.fu-berlin.de> On Wed, 2025-10-22 at 13:44 +0200, John Paul Adrian Glaubitz wrote: > Fixes the following build error on 32-bit PowerPC: > > kexec/arch/ppc/fs2dt.c: In function 'putnode': > kexec/arch/ppc/fs2dt.c:338:51: error: passing argument 4 of 'scandir' from incompatible pointer type [-Wincompatible-pointer-types] > 338 | numlist = scandir(pathname, &namelist, 0, comparefunc); > | ^~~~~~~~~~~ > | | > | int (*)(const void *, const void *) > > Signed-off-by: John Paul Adrian Glaubitz > --- > kexec/arch/ppc/fs2dt.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/kexec/arch/ppc/fs2dt.c b/kexec/arch/ppc/fs2dt.c > index fed499b..d03b995 100644 > --- a/kexec/arch/ppc/fs2dt.c > +++ b/kexec/arch/ppc/fs2dt.c > @@ -292,7 +292,8 @@ static void putprops(char *fn, struct dirent **nlist, int numlist) > * Compare function used to sort the device-tree directories > * This function will be passed to scandir. > */ > -static int comparefunc(const void *dentry1, const void *dentry2) > +static int comparefunc(const struct dirent **dentry1, > + const struct dirent **dentry2) > { > char *str1 = (*(struct dirent **)dentry1)->d_name; > char *str2 = (*(struct dirent **)dentry2)->d_name; Ping for both patches. Adrian -- .''`. John Paul Adrian Glaubitz : :' : Debian Developer `. `' Physicist `- GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913 From ranxiaokai627 at 163.com Sat Nov 22 09:57:35 2025 From: ranxiaokai627 at 163.com (ranxiaokai627 at 163.com) Date: Sat, 22 Nov 2025 17:57:35 +0000 Subject: [PATCH 2/2] liveupdate: Fix boot failure due to kmemleak access to unmapped pages In-Reply-To: References: Message-ID: <20251122175735.92578-1-ranxiaokai627@163.com> >On Thu, Nov 20 2025, ranxiaokai627 at 163.com wrote: > >> From: Ran Xiaokai >> >> When booting with debug_pagealloc=on while having: >> CONFIG_KEXEC_HANDOVER_ENABLE_DEFAULT=y >> CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF=n >> the system fails to boot due to page faults during kmemleak scanning. >> >> This occurs because: >> With debug_pagealloc enabled, __free_pages() invokes >> debug_pagealloc_unmap_pages(), clearing the _PAGE_PRESENT bit for >> freed pages in the direct mapping. >> Commit 3dc92c311498 ("kexec: add Kexec HandOver (KHO) generation helpers") >> releases the KHO scratch region via init_cma_reserved_pageblock(), >> unmapping its physical pages. Subsequent kmemleak scanning accesses >> these unmapped pages, triggering fatal page faults. > >I don't know how kmemleak works. Why does kmemleak access the unmapped >pages? If pages are not mapped, it should learn to not access them, >right? > >> >> Call kmemleak_no_scan_phys() from kho_reserve_scratch() to >> exclude the reserved region from scanning before >> it is released to the buddy allocator. > >kho_reserve_scratch() is called on the first boot. It allocates the >scratch areas for subsequent boots. On every KHO boot after this, >kho_reserve_scratch() is not called and kho_release_scratch() is called >instead since the scratch areas already exist from previous boot. > >Eventually both paths converge to kho_init() and call >init_cma_reserved_pageblock(). > >So shouldn't you call kmemleak_no_scan_phys() from kho_init() instead? >This would reduce code duplication and cover both paths. Thanks for your review! Yes, both paths converge to kho_init(), for the first boot, kho_get_fdt() returns NULL and init_cma_reserved_pageblock() is called, but for KHO boot, kho_get_fdt() returns non-NULL, kho_init() returns before calling init_cma_reserved_pageblock(). However, in KHO boot, calling kmemleak_no_scan_phys() is unnecessary because kmemleak objects are created when called memblock_phys_alloc() and KHO boot does not invoke memblock_phys_alloc(), moving the kmemleak_no_scan_phys() call into kho_init() both resolves the issue and reduces code duplication. From ranxiaokai627 at 163.com Sat Nov 22 10:07:20 2025 From: ranxiaokai627 at 163.com (ranxiaokai627 at 163.com) Date: Sat, 22 Nov 2025 18:07:20 +0000 Subject: [PATCH 2/2] liveupdate: Fix boot failure due to kmemleak access to unmapped pages In-Reply-To: References: Message-ID: <20251122180720.92605-1-ranxiaokai627@163.com> >On Thu, Nov 20, 2025 at 02:41:47PM +0000, ranxiaokai627 at 163.com wrote: >> Subject: liveupdate: Fix boot failure due to kmemleak access to unmapped pages > >Please prefix kexec handover patches with kho: rather than liveupdate. Thanks for your review, i will update the patch subject. >> From: Ran Xiaokai >> >> When booting with debug_pagealloc=on while having: >> CONFIG_KEXEC_HANDOVER_ENABLE_DEFAULT=y >> CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF=n >> the system fails to boot due to page faults during kmemleak scanning. >> >> This occurs because: >> With debug_pagealloc enabled, __free_pages() invokes >> debug_pagealloc_unmap_pages(), clearing the _PAGE_PRESENT bit for >> freed pages in the direct mapping. >> Commit 3dc92c311498 ("kexec: add Kexec HandOver (KHO) generation helpers") >> releases the KHO scratch region via init_cma_reserved_pageblock(), >> unmapping its physical pages. Subsequent kmemleak scanning accesses >> these unmapped pages, triggering fatal page faults. >> >> Call kmemleak_no_scan_phys() from kho_reserve_scratch() to >> exclude the reserved region from scanning before >> it is released to the buddy allocator. >> >> Fixes: 3dc92c311498 ("kexec: add Kexec HandOver (KHO) generation helpers") >> Signed-off-by: Ran Xiaokai >> --- >> kernel/liveupdate/kexec_handover.c | 4 ++++ >> 1 file changed, 4 insertions(+) >> >> diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c >> index 224bdf5becb6..dd4942d1d76c 100644 >> --- a/kernel/liveupdate/kexec_handover.c >> +++ b/kernel/liveupdate/kexec_handover.c >> @@ -11,6 +11,7 @@ >> >> #include >> #include >> +#include >> #include >> #include >> #include >> @@ -654,6 +655,7 @@ static void __init kho_reserve_scratch(void) >> if (!addr) >> goto err_free_scratch_desc; >> >> + kmemleak_no_scan_phys(addr); > >There's kmemleak_ignore_phys() that can be called after the scratch areas >allocated from memblock and with that kmemleak should not access them. > >Take a look at __cma_declare_contiguous_nid(). Thanks for catching this. Since kmemleak_ignore_phys() perfectly handles this issue, introducing another helper is unnecessary. I'll post v2 shortly. >> kho_scratch[i].addr = addr; >> kho_scratch[i].size = size; >> i++; > >-- >Sincerely yours, >Mike. From ranxiaokai627 at 163.com Sat Nov 22 10:29:29 2025 From: ranxiaokai627 at 163.com (ranxiaokai627 at 163.com) Date: Sat, 22 Nov 2025 18:29:29 +0000 Subject: [PATCH v2] KHO: Fix boot failure due to kmemleak access to non-PRESENT pages Message-ID: <20251122182929.92634-1-ranxiaokai627@163.com> From: Ran Xiaokai When booting with debug_pagealloc=on while having: CONFIG_KEXEC_HANDOVER_ENABLE_DEFAULT=y CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF=n the system fails to boot due to page faults during kmemleak scanning. This occurs because: With debug_pagealloc is enabled, __free_pages() invokes debug_pagealloc_unmap_pages(), clearing the _PAGE_PRESENT bit for freed pages in the kernel page table. Commit 3dc92c311498 ("kexec: add Kexec HandOver (KHO) generation helpers") triggers this when releases the KHO scratch region calling init_cma_reserved_pageblock(). Subsequent kmemleak scanning accesses these non-PRESENT pages, leading to fatal page faults. Call kmemleak_ignore_phys() from kho_init() to exclude the reserved region from kmemleak scanning before it is released to the buddy allocator to fix this. Signed-off-by: Ran Xiaokai --- kernel/liveupdate/kexec_handover.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 224bdf5becb6..c729d455ee7b 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -11,6 +11,7 @@ #include #include +#include #include #include #include @@ -1369,6 +1370,7 @@ static __init int kho_init(void) unsigned long count = kho_scratch[i].size >> PAGE_SHIFT; unsigned long pfn; + kmemleak_ignore_phys(kho_scratch[i].addr); for (pfn = base_pfn; pfn < base_pfn + count; pfn += pageblock_nr_pages) init_cma_reserved_pageblock(pfn_to_page(pfn)); -- 2.25.1 From rppt at kernel.org Sun Nov 23 01:27:19 2025 From: rppt at kernel.org (Mike Rapoport) Date: Sun, 23 Nov 2025 11:27:19 +0200 Subject: [PATCH v2] KHO: Fix boot failure due to kmemleak access to non-PRESENT pages In-Reply-To: <20251122182929.92634-1-ranxiaokai627@163.com> References: <20251122182929.92634-1-ranxiaokai627@163.com> Message-ID: Hi, On Sat, Nov 22, 2025 at 06:29:29PM +0000, ranxiaokai627 at 163.com wrote: > From: Ran Xiaokai > > When booting with debug_pagealloc=on while having: > CONFIG_KEXEC_HANDOVER_ENABLE_DEFAULT=y > CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF=n > the system fails to boot due to page faults during kmemleak scanning. > > This occurs because: > With debug_pagealloc is enabled, __free_pages() invokes > debug_pagealloc_unmap_pages(), clearing the _PAGE_PRESENT bit for > freed pages in the kernel page table. > Commit 3dc92c311498 ("kexec: add Kexec HandOver (KHO) generation helpers") > triggers this when releases the KHO scratch region calling > init_cma_reserved_pageblock(). Subsequent kmemleak scanning accesses > these non-PRESENT pages, leading to fatal page faults. I believe this is more clear: With debug_pagealloc is enabled, __free_pages() invokes debug_pagealloc_unmap_pages(), clearing the _PAGE_PRESENT bit for freed pages in the kernel page table. KHO scratch areas are allocated from memblock and noted by kmemleak. But these areas don't remain reserved but released later to the page allocator using init_cma_reserved_pageblock(). This causes subsequent kmemleak scans access non-PRESENT pages, leading to fatal page faults. > Call kmemleak_ignore_phys() from kho_init() to exclude > the reserved region from kmemleak scanning before > it is released to the buddy allocator to fix this. I'd suggest Mark scratch areas with kmemleak_ignore_phys() after they are allocated from memblock to exclude them from kmemleak scanning before they are released to buddy allocator to fix this. > Fixes: 3dc92c311498 ("kexec: add Kexec HandOver (KHO) generation helpers") > Signed-off-by: Ran Xiaokai With the changes above Reviewed-by: Mike Rapoport (Microsoft) > --- > kernel/liveupdate/kexec_handover.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c > index 224bdf5becb6..c729d455ee7b 100644 > --- a/kernel/liveupdate/kexec_handover.c > +++ b/kernel/liveupdate/kexec_handover.c > @@ -11,6 +11,7 @@ > > #include > #include > +#include > #include > #include > #include > @@ -1369,6 +1370,7 @@ static __init int kho_init(void) > unsigned long count = kho_scratch[i].size >> PAGE_SHIFT; > unsigned long pfn; > > + kmemleak_ignore_phys(kho_scratch[i].addr); > for (pfn = base_pfn; pfn < base_pfn + count; > pfn += pageblock_nr_pages) > init_cma_reserved_pageblock(pfn_to_page(pfn)); > -- > 2.25.1 > > -- Sincerely yours, Mike. From ranxiaokai627 at 163.com Sun Nov 23 18:59:43 2025 From: ranxiaokai627 at 163.com (ranxiaokai627 at 163.com) Date: Mon, 24 Nov 2025 02:59:43 +0000 Subject: [PATCH v3] KHO: Fix boot failure due to kmemleak access to non-PRESENT pages Message-ID: <20251124025943.94469-1-ranxiaokai627@163.com> From: Ran Xiaokai When booting with debug_pagealloc=on while having: CONFIG_KEXEC_HANDOVER_ENABLE_DEFAULT=y CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF=n the system fails to boot due to page faults during kmemleak scanning. This occurs because: With debug_pagealloc is enabled, __free_pages() invokes debug_pagealloc_unmap_pages(), clearing the _PAGE_PRESENT bit for freed pages in the kernel page table. KHO scratch areas are allocated from memblock and noted by kmemleak. But these areas don't remain reserved but released later to the page allocator using init_cma_reserved_pageblock(). This causes subsequent kmemleak scans access non-PRESENT pages, leading to fatal page faults. Mark scratch areas with kmemleak_ignore_phys() after they are allocated from memblock to exclude them from kmemleak scanning before they are released to buddy allocator to fix this. Fixes: 3dc92c311498 ("kexec: add Kexec HandOver (KHO) generation helpers") Signed-off-by: Ran Xiaokai Reviewed-by: Mike Rapoport (Microsoft) --- kernel/liveupdate/kexec_handover.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 224bdf5becb6..c729d455ee7b 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -11,6 +11,7 @@ #include #include +#include #include #include #include @@ -1369,6 +1370,7 @@ static __init int kho_init(void) unsigned long count = kho_scratch[i].size >> PAGE_SHIFT; unsigned long pfn; + kmemleak_ignore_phys(kho_scratch[i].addr); for (pfn = base_pfn; pfn < base_pfn + count; pfn += pageblock_nr_pages) init_cma_reserved_pageblock(pfn_to_page(pfn)); -- 2.25.1 From ltao at redhat.com Sun Nov 23 20:46:37 2025 From: ltao at redhat.com (Tao Liu) Date: Mon, 24 Nov 2025 17:46:37 +1300 Subject: [PATCH v2][makedumpfile 00/14] btf/kallsyms based eppic extension for mm page filtering In-Reply-To: <20251020222410.8235-1-ltao@redhat.com> References: <20251020222410.8235-1-ltao@redhat.com> Message-ID: Kindly ping... Any comments on this? Thanks, Tao Liu On Tue, Oct 21, 2025 at 11:24?AM Tao Liu wrote: > > A) This patchset will introduce the following features to makedumpfile: > > 1) Enable eppic script for memory pages filtering. > 2) Enable btf and kallsyms for symbol type and address resolving. > > B) The purpose of the features are: > > 1) Currently makedumpfile filters mm pages based on page flags, because flags > can help to determine one page's usage. But this page-flag-checking method > lacks of flexibility in certain cases, e.g. if we want to filter those mm > pages occupied by GPU during vmcore dumping due to: > > a) GPU may be taking a large memory and contains sensitive data; > b) GPU mm pages have no relations to kernel crash and useless for vmcore > analysis. > > But there is no GPU mm page specific flags, and apparently we don't need > to create one just for kdump use. A programmable filtering tool is more > suitable for such cases. In addition, different GPU vendors may use > different ways for mm pages allocating, programmable filtering is better > than hard coding these GPU specific logics into makedumpfile in this case. > > 2) Currently makedumpfile already contains a programmable filtering tool, aka > eppic script, which allows user to write customized code for data erasing. > However it has the following drawbacks: > > a) cannot do mm page filtering. > b) need to access to debuginfo of both kernel and modules, which is not > applicable in the 2nd kernel. > c) Poor performance, making vmcore dumping time unacceptable (See > the following performance testing). > > makedumpfile need to resolve the dwarf data from debuginfo, to get symbols > types and addresses. In recent kernel there are dwarf alternatives such > as btf/kallsyms which can be used for this purpose. And btf/kallsyms info > are already packed within vmcore, so we can use it directly. > > With these, this patchset introduces an upgraded eppic, which is based on > btf/kallsyms symbol resolving, and is programmable for mm page filtering. > The following info shows its usage and performance, please note the tests > are performed in 1st kernel: > > $ time ./makedumpfile -d 31 -l /var/crash/127.0.0.1-2025-06-10-18\:03\:12/vmcore > /tmp/dwarf.out -x /lib/debug/lib/modules/6.11.8-300.fc41.x86_64/vmlinux > --eppic eppic_scripts/filter_amdgpu_mm_pages.c > real 14m6.894s > user 4m16.900s > sys 9m44.695s > > $ time ./makedumpfile -d 31 -l /var/crash/127.0.0.1-2025-06-10-18\:03\:12/vmcore > /tmp/btf.out --eppic eppic_scripts/filter_amdgpu_mm_pages.c > real 0m10.672s > user 0m9.270s > sys 0m1.130s > > -rw------- 1 root root 367475074 Jun 10 18:06 btf.out > -rw------- 1 root root 367475074 Jun 10 21:05 dwarf.out > -rw-rw-rw- 1 root root 387181418 Jun 10 18:03 /var/crash/127.0.0.1-2025-06-10-18:03:12/vmcore > > C) Discussion: > > 1) GPU types: Currently only tested with amdgpu's mm page filtering, others > are not tested. > 2) OS: The code can work on rhel-10+/rhel9.5+ on x86_64/arm64/s390/ppc64. > Others are not tested. > > D) Testing: > > 1) If you don't want to create your vmcore, you can find a vmcore which I > created with amdgpu mm pages unfiltered [1], the amdgpu mm pages are > allocated by program [2]. You can use the vmcore in 1st kernel to filter > the amdgpu mm pages by the previous performance testing cmdline. To > verify the pages are filtered in crash: > > Unfiltered: > crash> search -c "!QAZXSW@#EDC" > ffff96b7fa800000: !QAZXSW@#EDCXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX > ffff96b87c800000: !QAZXSW@#EDCXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX > crash> rd ffff96b7fa800000 > ffff96b7fa800000: 405753585a415121 !QAZXSW@ > crash> rd ffff96b87c800000 > ffff96b87c800000: 405753585a415121 !QAZXSW@ > > Filtered: > crash> search -c "!QAZXSW@#EDC" > crash> rd ffff96b7fa800000 > rd: page excluded: kernel virtual address: ffff96b7fa800000 type: "64-bit KVADDR" > crash> rd ffff96b87c800000 > rd: page excluded: kernel virtual address: ffff96b87c800000 type: "64-bit KVADDR" > > 2) You can use eppic_scripts/print_all_vma.c against an ordinary vmcore to > test only btf/kallsyms functions by output all VMAs if no amdgpu > vmcores/machine avaliable. > > [1]: https://people.redhat.com/~ltao/core/ > [2]: https://gist.github.com/liutgnu/a8cbce1c666452f1530e1410d1f352df > > v2 -> v1: > > 1) Moved maple tree related code(for VMA iteration) into eppic script, so we > don't need to port maple tree code to makedumpfile. > > 2) Reorganized the patchset as follows: > > --- --- > 1.Add page filtering function > 2.Supporting main() as the entry of eppic script > > --- --- > 3.dwarf_info: Support kernel address randomization > 4.dwarf_info: Fix a infinite recursion bug for rust > 5.eppic dwarf: support anonymous structs member resolving > 6.Enable page filtering for dwarf eppic > > --- --- > 7.Implement kernel kallsyms resolving > 8.Implement kernel btf resolving > 9.Implement kernel module's kallsyms resolving > 10.Implement kernel module's btf resolving > 11.Export necessary btf/kallsyms functions to eppic extension > 12.Enable page filtering for btf/kallsyms eppic > 13.Docs: Update eppic related entries > > --- --- > 14.Introducing 2 eppic scripts to test the dwarf/btf eppic extension > > The modification on dwarf is primary for comparision purpose, that > for the same eppic program, mm page filtering should get exact same > outputs for dwarf & kallsyms/btf based approaches. If outputs unmatch, > this indicates bugs. In fact, we will never take dwarf mm pages filtering > in real use, due to its poor performance as well as inaccessibility > of debuginfo during kdump in 2nd kernel. So patch 3/4/5 won't affect > the function of btf/kallsyms eppic mm page filtering, but there are > functions shared in patch 6, so it is a must-have one. Patch 14 is > only for test purpose, to demonstrate how to write eppic script for > mm page filtering, so it isn't a must-have patch. > > Please note, in patch 14, I have deliberately converted all array > operation into pointer operation, e.g. modified "node->slot[i]" into > "*((unsigned long *)&(node->slot) + i)". This is because there are > bugs for array operation support in extension_eppic.c. I didn't have > effort to test and fix them all because as I mentioned previously, > mm page filtering in dwarf side is only for comparision and will > never be used in real use. There is no such issue for kallsyms/btf > eppic side. > > 3) Since we ported maple tree code to eppic script, several bugs found > both for eppic library & eppic btf support. Please use master branch > of eppic library to co-compile with this patchset. > > Tao Liu (14): > Add page filtering function > Supporting main() as the entry of eppic script > dwarf_info: Support kernel address randomization > dwarf_info: Fix a infinite recursion bug for rust > eppic dwarf: support anonymous structs member resolving > Enable page filtering for dwarf eppic > Implement kernel kallsyms resolving > Implement kernel btf resolving > Implement kernel module's kallsyms resolving > Implement kernel module's btf resolving > Export necessary btf/kallsyms functions to eppic extension > Enable page filtering for btf/kallsyms eppic > Docs: Update eppic related entries > Introducing 2 eppic scripts to test the dwarf/btf eppic extension > > Makefile | 6 +- > btf.c | 919 +++++++++++++++++++++++++ > btf.h | 177 +++++ > dwarf_info.c | 7 + > eppic_scripts/filter_amdgpu_mm_pages.c | 255 +++++++ > eppic_scripts/print_all_vma.c | 239 +++++++ > erase_info.c | 120 +++- > erase_info.h | 19 + > extension_btf.c | 258 +++++++ > extension_eppic.c | 106 ++- > extension_eppic.h | 6 +- > kallsyms.c | 392 +++++++++++ > kallsyms.h | 41 ++ > makedumpfile.8.in | 24 +- > makedumpfile.c | 21 +- > makedumpfile.h | 11 + > print_info.c | 11 +- > 17 files changed, 2550 insertions(+), 62 deletions(-) > create mode 100644 btf.c > create mode 100644 btf.h > create mode 100644 eppic_scripts/filter_amdgpu_mm_pages.c > create mode 100644 eppic_scripts/print_all_vma.c > create mode 100644 extension_btf.c > create mode 100644 kallsyms.c > create mode 100644 kallsyms.h > > -- > 2.47.0 > From pratyush at kernel.org Mon Nov 24 04:16:55 2025 From: pratyush at kernel.org (Pratyush Yadav) Date: Mon, 24 Nov 2025 13:16:55 +0100 Subject: [PATCH v3] KHO: Fix boot failure due to kmemleak access to non-PRESENT pages In-Reply-To: <20251124025943.94469-1-ranxiaokai627@163.com> (ranxiaokai's message of "Mon, 24 Nov 2025 02:59:43 +0000") References: <20251124025943.94469-1-ranxiaokai627@163.com> Message-ID: On Mon, Nov 24 2025, ranxiaokai627 at 163.com wrote: > From: Ran Xiaokai > > When booting with debug_pagealloc=on while having: > CONFIG_KEXEC_HANDOVER_ENABLE_DEFAULT=y > CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF=n > the system fails to boot due to page faults during kmemleak scanning. > > This occurs because: > With debug_pagealloc is enabled, __free_pages() invokes > debug_pagealloc_unmap_pages(), clearing the _PAGE_PRESENT bit for > freed pages in the kernel page table. > KHO scratch areas are allocated from memblock and noted by kmemleak. But > these areas don't remain reserved but released later to the page allocator > using init_cma_reserved_pageblock(). This causes subsequent kmemleak scans > access non-PRESENT pages, leading to fatal page faults. > > Mark scratch areas with kmemleak_ignore_phys() after they are allocated > from memblock to exclude them from kmemleak scanning before they are > released to buddy allocator to fix this. > > Fixes: 3dc92c311498 ("kexec: add Kexec HandOver (KHO) generation helpers") > Signed-off-by: Ran Xiaokai > Reviewed-by: Mike Rapoport (Microsoft) > --- > kernel/liveupdate/kexec_handover.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c > index 224bdf5becb6..c729d455ee7b 100644 > --- a/kernel/liveupdate/kexec_handover.c > +++ b/kernel/liveupdate/kexec_handover.c > @@ -11,6 +11,7 @@ > > #include > #include > +#include > #include > #include > #include > @@ -1369,6 +1370,7 @@ static __init int kho_init(void) > unsigned long count = kho_scratch[i].size >> PAGE_SHIFT; > unsigned long pfn; > > + kmemleak_ignore_phys(kho_scratch[i].addr); Can you please put the explanation you gave in [0] for why this is not necessary in KHO boot as a comment here? After that, Reviewed-by: Pratyush Yadav [0] https://lore.kernel.org/all/20251122175735.92578-1-ranxiaokai627 at 163.com/ > for (pfn = base_pfn; pfn < base_pfn + count; > pfn += pageblock_nr_pages) > init_cma_reserved_pageblock(pfn_to_page(pfn)); -- Regards, Pratyush Yadav From pratyush at kernel.org Mon Nov 24 05:25:55 2025 From: pratyush at kernel.org (Pratyush Yadav) Date: Mon, 24 Nov 2025 14:25:55 +0100 Subject: [PATCH v3] KHO: Fix boot failure due to kmemleak access to non-PRESENT pages In-Reply-To: (Pratyush Yadav's message of "Mon, 24 Nov 2025 13:16:55 +0100") References: <20251124025943.94469-1-ranxiaokai627@163.com> Message-ID: On Mon, Nov 24 2025, Pratyush Yadav wrote: > On Mon, Nov 24 2025, ranxiaokai627 at 163.com wrote: > >> From: Ran Xiaokai >> >> When booting with debug_pagealloc=on while having: >> CONFIG_KEXEC_HANDOVER_ENABLE_DEFAULT=y >> CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF=n >> the system fails to boot due to page faults during kmemleak scanning. >> >> This occurs because: >> With debug_pagealloc is enabled, __free_pages() invokes >> debug_pagealloc_unmap_pages(), clearing the _PAGE_PRESENT bit for >> freed pages in the kernel page table. >> KHO scratch areas are allocated from memblock and noted by kmemleak. But >> these areas don't remain reserved but released later to the page allocator >> using init_cma_reserved_pageblock(). This causes subsequent kmemleak scans >> access non-PRESENT pages, leading to fatal page faults. >> >> Mark scratch areas with kmemleak_ignore_phys() after they are allocated >> from memblock to exclude them from kmemleak scanning before they are >> released to buddy allocator to fix this. >> >> Fixes: 3dc92c311498 ("kexec: add Kexec HandOver (KHO) generation helpers") >> Signed-off-by: Ran Xiaokai >> Reviewed-by: Mike Rapoport (Microsoft) >> --- >> kernel/liveupdate/kexec_handover.c | 2 ++ >> 1 file changed, 2 insertions(+) >> >> diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c >> index 224bdf5becb6..c729d455ee7b 100644 >> --- a/kernel/liveupdate/kexec_handover.c >> +++ b/kernel/liveupdate/kexec_handover.c >> @@ -11,6 +11,7 @@ >> >> #include >> #include >> +#include >> #include >> #include >> #include >> @@ -1369,6 +1370,7 @@ static __init int kho_init(void) >> unsigned long count = kho_scratch[i].size >> PAGE_SHIFT; >> unsigned long pfn; >> >> + kmemleak_ignore_phys(kho_scratch[i].addr); > > Can you please put the explanation you gave in [0] for why this is not > necessary in KHO boot as a comment here? And also an explanation of why this is necessary in the first place. You do explain that in the commit message, but a shorter version as a comment will make this a lot easier to understand instead of having to dig through git history. [...] -- Regards, Pratyush Yadav From usamaarif642 at gmail.com Mon Nov 24 11:24:58 2025 From: usamaarif642 at gmail.com (Usama Arif) Date: Mon, 24 Nov 2025 19:24:58 +0000 Subject: [PATCH v8 12/17] x86/e820: temporarily enable KHO scratch for memory below 1M In-Reply-To: <20250509074635.3187114-13-changyuanl@google.com> References: <20250509074635.3187114-1-changyuanl@google.com> <20250509074635.3187114-13-changyuanl@google.com> Message-ID: On 09/05/2025 08:46, Changyuan Lyu wrote: > From: Alexander Graf > > KHO kernels are special and use only scratch memory for memblock > allocations, but memory below 1M is ignored by kernel after early boot > and cannot be naturally marked as scratch. > > To allow allocation of the real-mode trampoline and a few (if any) other > very early allocations from below 1M forcibly mark the memory below 1M > as scratch. > > After real mode trampoline is allocated, clear that scratch marking. > > Signed-off-by: Alexander Graf > Co-developed-by: Mike Rapoport (Microsoft) > Signed-off-by: Mike Rapoport (Microsoft) > Co-developed-by: Changyuan Lyu > Signed-off-by: Changyuan Lyu > Acked-by: Dave Hansen > --- > arch/x86/kernel/e820.c | 18 ++++++++++++++++++ > arch/x86/realmode/init.c | 2 ++ > 2 files changed, 20 insertions(+) > > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c > index 9920122018a0b..c3acbd26408ba 100644 > --- a/arch/x86/kernel/e820.c > +++ b/arch/x86/kernel/e820.c > @@ -1299,6 +1299,24 @@ void __init e820__memblock_setup(void) > memblock_add(entry->addr, entry->size); > } > > + /* > + * At this point memblock is only allowed to allocate from memory > + * below 1M (aka ISA_END_ADDRESS) up until direct map is completely set > + * up in init_mem_mapping(). > + * > + * KHO kernels are special and use only scratch memory for memblock > + * allocations, but memory below 1M is ignored by kernel after early > + * boot and cannot be naturally marked as scratch. > + * > + * To allow allocation of the real-mode trampoline and a few (if any) > + * other very early allocations from below 1M forcibly mark the memory > + * below 1M as scratch. > + * > + * After real mode trampoline is allocated, we clear that scratch > + * marking. > + */ > + memblock_mark_kho_scratch(0, SZ_1M); > + > /* > * 32-bit systems are limited to 4BG of memory even with HIGHMEM and > * to even less without it. > diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c > index f9bc444a3064d..9b9f4534086d2 100644 > --- a/arch/x86/realmode/init.c > +++ b/arch/x86/realmode/init.c > @@ -65,6 +65,8 @@ void __init reserve_real_mode(void) > * setup_arch(). > */ > memblock_reserve(0, SZ_1M); > + > + memblock_clear_kho_scratch(0, SZ_1M); > } > > static void __init sme_sev_setup_real_mode(struct trampoline_header *th) Hello! I am working with Breno who reported that we are seeing the below warning at boot when rolling out 6.16 in Meta fleet. It is difficult to reproduce on a single host manually but we are seeing this several times a day inside the fleet. 20:16:33 ------------[ cut here ]------------ 20:16:33 WARNING: CPU: 0 PID: 0 at mm/memblock.c:668 memblock_add_range+0x316/0x330 20:16:33 Modules linked in: 20:16:33 CPU: 0 UID: 0 PID: 0 Comm: swapper Tainted: G S 6.16.1-0_fbk0_0_gc0739ee5037a #1 NONE 20:16:33 Tainted: [S]=CPU_OUT_OF_SPEC 20:16:33 RIP: 0010:memblock_add_range+0x316/0x330 20:16:33 Code: ff ff ff 89 5c 24 08 41 ff c5 44 89 6c 24 10 48 63 74 24 08 48 63 54 24 10 e8 26 0c 00 00 e9 41 ff ff ff 0f 0b e9 af fd ff ff <0f> 0b e9 b7 fd ff ff 0f 0b 0f 0b cc cc cc cc cc cc cc cc cc cc cc 20:16:33 RSP: 0000:ffffffff83403dd8 EFLAGS: 00010083 ORIG_RAX: 0000000000000000 20:16:33 RAX: ffffffff8476ff90 RBX: 0000000000001c00 RCX: 0000000000000002 20:16:33 RDX: 00000000ffffffff RSI: 0000000000000000 RDI: ffffffff83bad4d8 20:16:33 RBP: 000000000009f000 R08: 0000000000000020 R09: 8000000000097101 20:16:33 R10: ffffffffff2004b0 R11: 203a6d6f646e6172 R12: 000000000009ec00 20:16:33 R13: 0000000000000002 R14: 0000000000100000 R15: 000000000009d000 20:16:33 FS: 0000000000000000(0000) GS:0000000000000000(0000) knlGS:0000000000000000 20:16:33 CR2: ffff888065413ff8 CR3: 00000000663b7000 CR4: 00000000000000b0 20:16:33 Call Trace: 20:16:33 20:16:33 ? __memblock_reserve+0x75/0x80 20:16:33 ? setup_arch+0x30f/0xb10 20:16:33 ? start_kernel+0x58/0x960 20:16:33 ? x86_64_start_reservations+0x20/0x20 20:16:33 ? x86_64_start_kernel+0x13d/0x140 20:16:33 ? common_startup_64+0x13e/0x140 20:16:33 20:16:33 ---[ end trace 0000000000000000 ]--- Rolling out with memblock=debug is not really an option in a large scale fleet due to the time added to boot. But I did try on one of the hosts (without reproducing the issue) and I see: [ 0.000616] memory.cnt = 0x6 [ 0.000617] memory[0x0] [0x0000000000001000-0x000000000009bfff], 0x000000000009b000 bytes flags: 0x40 [ 0.000620] memory[0x1] [0x000000000009f000-0x000000000009ffff], 0x0000000000001000 bytes flags: 0x40 [ 0.000621] memory[0x2] [0x0000000000100000-0x000000005ed09fff], 0x000000005ec0a000 bytes flags: 0x0 ... The 0x40 (MEMBLOCK_KHO_SCRATCH) is coming from memblock_mark_kho_scratch in e820__memblock_setup. I believe this should be under ifdef like the diff at the end? (Happy to send this as a patch for review if it makes sense). We have KEXEC_HANDOVER disabled in our defconfig, therefore MEMBLOCK_KHO_SCRATCH shouldnt be selected and we shouldnt have any MEMBLOCK_KHO_SCRATCH type regions in our memblock reservations. The other thing I did was insert a while(1) just before the warning and inspected the registers in qemu. R14 held the base register, and R15 held the size at that point. In the warning R14 is 0x100000 meaning that someone is reserving a region with a different flag to MEMBLOCK_NONE at the boundary of MEMBLOCK_KHO_SCRATCH. diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c index c3acbd26408ba..26e4062a0bd09 100644 --- a/arch/x86/kernel/e820.c +++ b/arch/x86/kernel/e820.c @@ -1299,6 +1299,7 @@ void __init e820__memblock_setup(void) memblock_add(entry->addr, entry->size); } +#ifdef CONFIG_MEMBLOCK_KHO_SCRATCH /* * At this point memblock is only allowed to allocate from memory * below 1M (aka ISA_END_ADDRESS) up until direct map is completely set @@ -1316,7 +1317,7 @@ void __init e820__memblock_setup(void) * marking. */ memblock_mark_kho_scratch(0, SZ_1M); - +#endif /* * 32-bit systems are limited to 4BG of memory even with HIGHMEM and * to even less without it. diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c index 88be32026768c..1cd80293a3e23 100644 --- a/arch/x86/realmode/init.c +++ b/arch/x86/realmode/init.c @@ -66,8 +66,9 @@ void __init reserve_real_mode(void) * setup_arch(). */ memblock_reserve(0, SZ_1M); - +#ifdef CONFIG_MEMBLOCK_KHO_SCRATCH memblock_clear_kho_scratch(0, SZ_1M); +#endif } static void __init sme_sev_setup_real_mode(struct trampoline_header *th) From akpm at linux-foundation.org Mon Nov 24 14:16:20 2025 From: akpm at linux-foundation.org (Andrew Morton) Date: Mon, 24 Nov 2025 14:16:20 -0800 Subject: [PATCHv2 1/2] kernel/kexec: Change the prototype of kimage_map_segment() In-Reply-To: <20251106065904.10772-1-piliu@redhat.com> References: <20251106065904.10772-1-piliu@redhat.com> Message-ID: <20251124141620.eaef984836fe2edc7acf9179@linux-foundation.org> On Thu, 6 Nov 2025 14:59:03 +0800 Pingfan Liu wrote: > The kexec segment index will be required to extract the corresponding > information for that segment in kimage_map_segment(). Additionally, > kexec_segment already holds the kexec relocation destination address and > size. Therefore, the prototype of kimage_map_segment() can be changed. Could we please have some reviewer input on thee two patches? Thanks. (Pingfan, please cc linux-kernel on patches - it's where people go to find emails on lists which they aren't suscribed to) (akpm goes off and subscribes to kexec@) > Fixes: 07d24902977e ("kexec: enable CMA based contiguous allocation") > Signed-off-by: Pingfan Liu > Cc: Andrew Morton > Cc: Baoquan He > Cc: Mimi Zohar > Cc: Roberto Sassu > Cc: Alexander Graf > Cc: Steven Chen > Cc: > To: kexec at lists.infradead.org > To: linux-integrity at vger.kernel.org > --- > include/linux/kexec.h | 4 ++-- > kernel/kexec_core.c | 9 ++++++--- > security/integrity/ima/ima_kexec.c | 4 +--- > 3 files changed, 9 insertions(+), 8 deletions(-) > > diff --git a/include/linux/kexec.h b/include/linux/kexec.h > index ff7e231b0485..8a22bc9b8c6c 100644 > --- a/include/linux/kexec.h > +++ b/include/linux/kexec.h > @@ -530,7 +530,7 @@ extern bool kexec_file_dbg_print; > #define kexec_dprintk(fmt, arg...) \ > do { if (kexec_file_dbg_print) pr_info(fmt, ##arg); } while (0) > > -extern void *kimage_map_segment(struct kimage *image, unsigned long addr, unsigned long size); > +extern void *kimage_map_segment(struct kimage *image, int idx); > extern void kimage_unmap_segment(void *buffer); > #else /* !CONFIG_KEXEC_CORE */ > struct pt_regs; > @@ -540,7 +540,7 @@ static inline void __crash_kexec(struct pt_regs *regs) { } > static inline void crash_kexec(struct pt_regs *regs) { } > static inline int kexec_should_crash(struct task_struct *p) { return 0; } > static inline int kexec_crash_loaded(void) { return 0; } > -static inline void *kimage_map_segment(struct kimage *image, unsigned long addr, unsigned long size) > +static inline void *kimage_map_segment(struct kimage *image, int idx) > { return NULL; } > static inline void kimage_unmap_segment(void *buffer) { } > #define kexec_in_progress false > diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c > index fa00b239c5d9..9a1966207041 100644 > --- a/kernel/kexec_core.c > +++ b/kernel/kexec_core.c > @@ -960,17 +960,20 @@ int kimage_load_segment(struct kimage *image, int idx) > return result; > } > > -void *kimage_map_segment(struct kimage *image, > - unsigned long addr, unsigned long size) > +void *kimage_map_segment(struct kimage *image, int idx) > { > + unsigned long addr, size, eaddr; > unsigned long src_page_addr, dest_page_addr = 0; > - unsigned long eaddr = addr + size; > kimage_entry_t *ptr, entry; > struct page **src_pages; > unsigned int npages; > void *vaddr = NULL; > int i; > > + addr = image->segment[idx].mem; > + size = image->segment[idx].memsz; > + eaddr = addr + size; > + > /* > * Collect the source pages and map them in a contiguous VA range. > */ > diff --git a/security/integrity/ima/ima_kexec.c b/security/integrity/ima/ima_kexec.c > index 7362f68f2d8b..5beb69edd12f 100644 > --- a/security/integrity/ima/ima_kexec.c > +++ b/security/integrity/ima/ima_kexec.c > @@ -250,9 +250,7 @@ void ima_kexec_post_load(struct kimage *image) > if (!image->ima_buffer_addr) > return; > > - ima_kexec_buffer = kimage_map_segment(image, > - image->ima_buffer_addr, > - image->ima_buffer_size); > + ima_kexec_buffer = kimage_map_segment(image, image->ima_segment_index); > if (!ima_kexec_buffer) { > pr_err("Could not map measurements buffer.\n"); > return; > -- > 2.49.0 From hpa at zytor.com Mon Nov 24 16:56:34 2025 From: hpa at zytor.com (H. Peter Anvin) Date: Mon, 24 Nov 2025 16:56:34 -0800 Subject: =?US-ASCII?Q?Re=3A_=5BPATCH_v8_12/17=5D_x86/e820=3A_temporari?= =?US-ASCII?Q?ly_enable_KHO_scratch_for_memory_below_1M?= In-Reply-To: References: <20250509074635.3187114-1-changyuanl@google.com> <20250509074635.3187114-13-changyuanl@google.com> Message-ID: <22BDBF5C-C831-4BBC-A854-20CA77234084@zytor.com> On November 24, 2025 11:24:58 AM PST, Usama Arif wrote: > > >On 09/05/2025 08:46, Changyuan Lyu wrote: >> From: Alexander Graf >> >> KHO kernels are special and use only scratch memory for memblock >> allocations, but memory below 1M is ignored by kernel after early boot >> and cannot be naturally marked as scratch. >> >> To allow allocation of the real-mode trampoline and a few (if any) other >> very early allocations from below 1M forcibly mark the memory below 1M >> as scratch. >> >> After real mode trampoline is allocated, clear that scratch marking. >> >> Signed-off-by: Alexander Graf >> Co-developed-by: Mike Rapoport (Microsoft) >> Signed-off-by: Mike Rapoport (Microsoft) >> Co-developed-by: Changyuan Lyu >> Signed-off-by: Changyuan Lyu >> Acked-by: Dave Hansen >> --- >> arch/x86/kernel/e820.c | 18 ++++++++++++++++++ >> arch/x86/realmode/init.c | 2 ++ >> 2 files changed, 20 insertions(+) >> >> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c >> index 9920122018a0b..c3acbd26408ba 100644 >> --- a/arch/x86/kernel/e820.c >> +++ b/arch/x86/kernel/e820.c >> @@ -1299,6 +1299,24 @@ void __init e820__memblock_setup(void) >> memblock_add(entry->addr, entry->size); >> } >> >> + /* >> + * At this point memblock is only allowed to allocate from memory >> + * below 1M (aka ISA_END_ADDRESS) up until direct map is completely set >> + * up in init_mem_mapping(). >> + * >> + * KHO kernels are special and use only scratch memory for memblock >> + * allocations, but memory below 1M is ignored by kernel after early >> + * boot and cannot be naturally marked as scratch. >> + * >> + * To allow allocation of the real-mode trampoline and a few (if any) >> + * other very early allocations from below 1M forcibly mark the memory >> + * below 1M as scratch. >> + * >> + * After real mode trampoline is allocated, we clear that scratch >> + * marking. >> + */ >> + memblock_mark_kho_scratch(0, SZ_1M); >> + >> /* >> * 32-bit systems are limited to 4BG of memory even with HIGHMEM and >> * to even less without it. >> diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c >> index f9bc444a3064d..9b9f4534086d2 100644 >> --- a/arch/x86/realmode/init.c >> +++ b/arch/x86/realmode/init.c >> @@ -65,6 +65,8 @@ void __init reserve_real_mode(void) >> * setup_arch(). >> */ >> memblock_reserve(0, SZ_1M); >> + >> + memblock_clear_kho_scratch(0, SZ_1M); >> } >> >> static void __init sme_sev_setup_real_mode(struct trampoline_header *th) > >Hello! > >I am working with Breno who reported that we are seeing the below warning at boot >when rolling out 6.16 in Meta fleet. It is difficult to reproduce on a single host >manually but we are seeing this several times a day inside the fleet. > > 20:16:33 ------------[ cut here ]------------ > 20:16:33 WARNING: CPU: 0 PID: 0 at mm/memblock.c:668 memblock_add_range+0x316/0x330 > 20:16:33 Modules linked in: > 20:16:33 CPU: 0 UID: 0 PID: 0 Comm: swapper Tainted: G S 6.16.1-0_fbk0_0_gc0739ee5037a #1 NONE > 20:16:33 Tainted: [S]=CPU_OUT_OF_SPEC > 20:16:33 RIP: 0010:memblock_add_range+0x316/0x330 > 20:16:33 Code: ff ff ff 89 5c 24 08 41 ff c5 44 89 6c 24 10 48 63 74 24 08 48 63 54 24 10 e8 26 0c 00 00 e9 41 ff ff ff 0f 0b e9 af fd ff ff <0f> 0b e9 b7 fd ff ff 0f 0b 0f 0b cc cc cc cc cc cc cc cc cc cc cc > 20:16:33 RSP: 0000:ffffffff83403dd8 EFLAGS: 00010083 ORIG_RAX: 0000000000000000 > 20:16:33 RAX: ffffffff8476ff90 RBX: 0000000000001c00 RCX: 0000000000000002 > 20:16:33 RDX: 00000000ffffffff RSI: 0000000000000000 RDI: ffffffff83bad4d8 > 20:16:33 RBP: 000000000009f000 R08: 0000000000000020 R09: 8000000000097101 > 20:16:33 R10: ffffffffff2004b0 R11: 203a6d6f646e6172 R12: 000000000009ec00 > 20:16:33 R13: 0000000000000002 R14: 0000000000100000 R15: 000000000009d000 > 20:16:33 FS: 0000000000000000(0000) GS:0000000000000000(0000) knlGS:0000000000000000 > 20:16:33 CR2: ffff888065413ff8 CR3: 00000000663b7000 CR4: 00000000000000b0 > 20:16:33 Call Trace: > 20:16:33 > 20:16:33 ? __memblock_reserve+0x75/0x80 > 20:16:33 ? setup_arch+0x30f/0xb10 > 20:16:33 ? start_kernel+0x58/0x960 > 20:16:33 ? x86_64_start_reservations+0x20/0x20 > 20:16:33 ? x86_64_start_kernel+0x13d/0x140 > 20:16:33 ? common_startup_64+0x13e/0x140 > 20:16:33 > 20:16:33 ---[ end trace 0000000000000000 ]--- > > >Rolling out with memblock=debug is not really an option in a large scale fleet due to the >time added to boot. But I did try on one of the hosts (without reproducing the issue) and I see: > >[ 0.000616] memory.cnt = 0x6 >[ 0.000617] memory[0x0] [0x0000000000001000-0x000000000009bfff], 0x000000000009b000 bytes flags: 0x40 >[ 0.000620] memory[0x1] [0x000000000009f000-0x000000000009ffff], 0x0000000000001000 bytes flags: 0x40 >[ 0.000621] memory[0x2] [0x0000000000100000-0x000000005ed09fff], 0x000000005ec0a000 bytes flags: 0x0 >... > >The 0x40 (MEMBLOCK_KHO_SCRATCH) is coming from memblock_mark_kho_scratch in e820__memblock_setup. I believe this >should be under ifdef like the diff at the end? (Happy to send this as a patch for review if it makes sense). >We have KEXEC_HANDOVER disabled in our defconfig, therefore MEMBLOCK_KHO_SCRATCH shouldnt be selected and >we shouldnt have any MEMBLOCK_KHO_SCRATCH type regions in our memblock reservations. > >The other thing I did was insert a while(1) just before the warning and inspected the registers in qemu. >R14 held the base register, and R15 held the size at that point. >In the warning R14 is 0x100000 meaning that someone is reserving a region with a different flag to MEMBLOCK_NONE >at the boundary of MEMBLOCK_KHO_SCRATCH. > >diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c >index c3acbd26408ba..26e4062a0bd09 100644 >--- a/arch/x86/kernel/e820.c >+++ b/arch/x86/kernel/e820.c >@@ -1299,6 +1299,7 @@ void __init e820__memblock_setup(void) > memblock_add(entry->addr, entry->size); > } > >+#ifdef CONFIG_MEMBLOCK_KHO_SCRATCH > /* > * At this point memblock is only allowed to allocate from memory > * below 1M (aka ISA_END_ADDRESS) up until direct map is completely set >@@ -1316,7 +1317,7 @@ void __init e820__memblock_setup(void) > * marking. > */ > memblock_mark_kho_scratch(0, SZ_1M); >- >+#endif > /* > * 32-bit systems are limited to 4BG of memory even with HIGHMEM and > * to even less without it. >diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c >index 88be32026768c..1cd80293a3e23 100644 >--- a/arch/x86/realmode/init.c >+++ b/arch/x86/realmode/init.c >@@ -66,8 +66,9 @@ void __init reserve_real_mode(void) > * setup_arch(). > */ > memblock_reserve(0, SZ_1M); >- >+#ifdef CONFIG_MEMBLOCK_KHO_SCRATCH > memblock_clear_kho_scratch(0, SZ_1M); >+#endif > } > > static void __init sme_sev_setup_real_mode(struct trampoline_header *th) What does "scratch" mean in this exact context? (Sorry, don't have the code in front of me.) From piliu at redhat.com Mon Nov 24 20:10:18 2025 From: piliu at redhat.com (Pingfan Liu) Date: Tue, 25 Nov 2025 12:10:18 +0800 Subject: [PATCHv2 1/2] kernel/kexec: Change the prototype of kimage_map_segment() In-Reply-To: <20251124141620.eaef984836fe2edc7acf9179@linux-foundation.org> References: <20251106065904.10772-1-piliu@redhat.com> <20251124141620.eaef984836fe2edc7acf9179@linux-foundation.org> Message-ID: On Tue, Nov 25, 2025 at 6:16?AM Andrew Morton wrote: > > On Thu, 6 Nov 2025 14:59:03 +0800 Pingfan Liu wrote: > > > The kexec segment index will be required to extract the corresponding > > information for that segment in kimage_map_segment(). Additionally, > > kexec_segment already holds the kexec relocation destination address and > > size. Therefore, the prototype of kimage_map_segment() can be changed. > > Could we please have some reviewer input on thee two patches? > > Thanks. > > (Pingfan, please cc linux-kernel on patches - it's where people go to > find emails on lists which they aren't suscribed to) > OK, I will cc linux-kernel for the future kexec patches For this series, it can also be found on https://lore.kernel.org/linux-integrity/20251106065904.10772-1-piliu at redhat.com/ Thanks, Pingfan > (akpm goes off and subscribes to kexec@) > > > Fixes: 07d24902977e ("kexec: enable CMA based contiguous allocation") > > Signed-off-by: Pingfan Liu > > Cc: Andrew Morton > > Cc: Baoquan He > > Cc: Mimi Zohar > > Cc: Roberto Sassu > > Cc: Alexander Graf > > Cc: Steven Chen > > Cc: > > To: kexec at lists.infradead.org > > To: linux-integrity at vger.kernel.org > > --- > > include/linux/kexec.h | 4 ++-- > > kernel/kexec_core.c | 9 ++++++--- > > security/integrity/ima/ima_kexec.c | 4 +--- > > 3 files changed, 9 insertions(+), 8 deletions(-) > > > > diff --git a/include/linux/kexec.h b/include/linux/kexec.h > > index ff7e231b0485..8a22bc9b8c6c 100644 > > --- a/include/linux/kexec.h > > +++ b/include/linux/kexec.h > > @@ -530,7 +530,7 @@ extern bool kexec_file_dbg_print; > > #define kexec_dprintk(fmt, arg...) \ > > do { if (kexec_file_dbg_print) pr_info(fmt, ##arg); } while (0) > > > > -extern void *kimage_map_segment(struct kimage *image, unsigned long addr, unsigned long size); > > +extern void *kimage_map_segment(struct kimage *image, int idx); > > extern void kimage_unmap_segment(void *buffer); > > #else /* !CONFIG_KEXEC_CORE */ > > struct pt_regs; > > @@ -540,7 +540,7 @@ static inline void __crash_kexec(struct pt_regs *regs) { } > > static inline void crash_kexec(struct pt_regs *regs) { } > > static inline int kexec_should_crash(struct task_struct *p) { return 0; } > > static inline int kexec_crash_loaded(void) { return 0; } > > -static inline void *kimage_map_segment(struct kimage *image, unsigned long addr, unsigned long size) > > +static inline void *kimage_map_segment(struct kimage *image, int idx) > > { return NULL; } > > static inline void kimage_unmap_segment(void *buffer) { } > > #define kexec_in_progress false > > diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c > > index fa00b239c5d9..9a1966207041 100644 > > --- a/kernel/kexec_core.c > > +++ b/kernel/kexec_core.c > > @@ -960,17 +960,20 @@ int kimage_load_segment(struct kimage *image, int idx) > > return result; > > } > > > > -void *kimage_map_segment(struct kimage *image, > > - unsigned long addr, unsigned long size) > > +void *kimage_map_segment(struct kimage *image, int idx) > > { > > + unsigned long addr, size, eaddr; > > unsigned long src_page_addr, dest_page_addr = 0; > > - unsigned long eaddr = addr + size; > > kimage_entry_t *ptr, entry; > > struct page **src_pages; > > unsigned int npages; > > void *vaddr = NULL; > > int i; > > > > + addr = image->segment[idx].mem; > > + size = image->segment[idx].memsz; > > + eaddr = addr + size; > > + > > /* > > * Collect the source pages and map them in a contiguous VA range. > > */ > > diff --git a/security/integrity/ima/ima_kexec.c b/security/integrity/ima/ima_kexec.c > > index 7362f68f2d8b..5beb69edd12f 100644 > > --- a/security/integrity/ima/ima_kexec.c > > +++ b/security/integrity/ima/ima_kexec.c > > @@ -250,9 +250,7 @@ void ima_kexec_post_load(struct kimage *image) > > if (!image->ima_buffer_addr) > > return; > > > > - ima_kexec_buffer = kimage_map_segment(image, > > - image->ima_buffer_addr, > > - image->ima_buffer_size); > > + ima_kexec_buffer = kimage_map_segment(image, image->ima_segment_index); > > if (!ima_kexec_buffer) { > > pr_err("Could not map measurements buffer.\n"); > > return; > > -- > > 2.49.0 > From bhe at redhat.com Mon Nov 24 20:54:39 2025 From: bhe at redhat.com (Baoquan He) Date: Tue, 25 Nov 2025 12:54:39 +0800 Subject: [PATCHv2 1/2] kernel/kexec: Change the prototype of kimage_map_segment() In-Reply-To: <20251124141620.eaef984836fe2edc7acf9179@linux-foundation.org> References: <20251106065904.10772-1-piliu@redhat.com> <20251124141620.eaef984836fe2edc7acf9179@linux-foundation.org> Message-ID: On 11/24/25 at 02:16pm, Andrew Morton wrote: > On Thu, 6 Nov 2025 14:59:03 +0800 Pingfan Liu wrote: > > > The kexec segment index will be required to extract the corresponding > > information for that segment in kimage_map_segment(). Additionally, > > kexec_segment already holds the kexec relocation destination address and > > size. Therefore, the prototype of kimage_map_segment() can be changed. > > Could we please have some reviewer input on thee two patches? I have some concerns about the one place of tiny code change, and the root cause missing in log. And Mimi sent mail to me asking why this bug can'e be seen on her laptop, I told her this bug can only be triggered on system where CMA area exists. I think these need be addressed in v3. > > (Pingfan, please cc linux-kernel on patches - it's where people go to > find emails on lists which they aren't suscribed to) > > (akpm goes off and subscribes to kexec@) > > > Fixes: 07d24902977e ("kexec: enable CMA based contiguous allocation") > > Signed-off-by: Pingfan Liu > > Cc: Andrew Morton > > Cc: Baoquan He > > Cc: Mimi Zohar > > Cc: Roberto Sassu > > Cc: Alexander Graf > > Cc: Steven Chen > > Cc: > > To: kexec at lists.infradead.org > > To: linux-integrity at vger.kernel.org > > --- > > include/linux/kexec.h | 4 ++-- > > kernel/kexec_core.c | 9 ++++++--- > > security/integrity/ima/ima_kexec.c | 4 +--- > > 3 files changed, 9 insertions(+), 8 deletions(-) > > > > diff --git a/include/linux/kexec.h b/include/linux/kexec.h > > index ff7e231b0485..8a22bc9b8c6c 100644 > > --- a/include/linux/kexec.h > > +++ b/include/linux/kexec.h > > @@ -530,7 +530,7 @@ extern bool kexec_file_dbg_print; > > #define kexec_dprintk(fmt, arg...) \ > > do { if (kexec_file_dbg_print) pr_info(fmt, ##arg); } while (0) > > > > -extern void *kimage_map_segment(struct kimage *image, unsigned long addr, unsigned long size); > > +extern void *kimage_map_segment(struct kimage *image, int idx); > > extern void kimage_unmap_segment(void *buffer); > > #else /* !CONFIG_KEXEC_CORE */ > > struct pt_regs; > > @@ -540,7 +540,7 @@ static inline void __crash_kexec(struct pt_regs *regs) { } > > static inline void crash_kexec(struct pt_regs *regs) { } > > static inline int kexec_should_crash(struct task_struct *p) { return 0; } > > static inline int kexec_crash_loaded(void) { return 0; } > > -static inline void *kimage_map_segment(struct kimage *image, unsigned long addr, unsigned long size) > > +static inline void *kimage_map_segment(struct kimage *image, int idx) > > { return NULL; } > > static inline void kimage_unmap_segment(void *buffer) { } > > #define kexec_in_progress false > > diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c > > index fa00b239c5d9..9a1966207041 100644 > > --- a/kernel/kexec_core.c > > +++ b/kernel/kexec_core.c > > @@ -960,17 +960,20 @@ int kimage_load_segment(struct kimage *image, int idx) > > return result; > > } > > > > -void *kimage_map_segment(struct kimage *image, > > - unsigned long addr, unsigned long size) > > +void *kimage_map_segment(struct kimage *image, int idx) > > { > > + unsigned long addr, size, eaddr; > > unsigned long src_page_addr, dest_page_addr = 0; > > - unsigned long eaddr = addr + size; > > kimage_entry_t *ptr, entry; > > struct page **src_pages; > > unsigned int npages; > > void *vaddr = NULL; > > int i; > > > > + addr = image->segment[idx].mem; > > + size = image->segment[idx].memsz; > > + eaddr = addr + size; > > + > > /* > > * Collect the source pages and map them in a contiguous VA range. > > */ > > diff --git a/security/integrity/ima/ima_kexec.c b/security/integrity/ima/ima_kexec.c > > index 7362f68f2d8b..5beb69edd12f 100644 > > --- a/security/integrity/ima/ima_kexec.c > > +++ b/security/integrity/ima/ima_kexec.c > > @@ -250,9 +250,7 @@ void ima_kexec_post_load(struct kimage *image) > > if (!image->ima_buffer_addr) > > return; > > > > - ima_kexec_buffer = kimage_map_segment(image, > > - image->ima_buffer_addr, > > - image->ima_buffer_size); > > + ima_kexec_buffer = kimage_map_segment(image, image->ima_segment_index); > > if (!ima_kexec_buffer) { > > pr_err("Could not map measurements buffer.\n"); > > return; > > -- > > 2.49.0 > From rppt at kernel.org Tue Nov 25 03:09:15 2025 From: rppt at kernel.org (Mike Rapoport) Date: Tue, 25 Nov 2025 13:09:15 +0200 Subject: [PATCH 0/2] kho: fixes for vmalloc restoration Message-ID: <20251125110917.843744-1-rppt@kernel.org> From: "Mike Rapoport (Microsoft)" Hi, Pratyush reported off-list that when kho_restore_vmalloc() is used to restore a vmalloc_huge() allocation it hits VM_BUG_ON() when we reconstruct the struct pages in kho_restore_pages(). These patches fix the issue. Mike Rapoport (Microsoft) (2): kho: kho_restore_vmalloc: fix initialization of pages array kho: fix restoring of contiguous ranges of order-0 pages kernel/liveupdate/kexec_handover.c | 22 +++++++++++++--------- 1 file changed, 13 insertions(+), 9 deletions(-) -- 2.50.1 From rppt at kernel.org Tue Nov 25 03:09:16 2025 From: rppt at kernel.org (Mike Rapoport) Date: Tue, 25 Nov 2025 13:09:16 +0200 Subject: [PATCH 1/2] kho: kho_restore_vmalloc: fix initialization of pages array In-Reply-To: <20251125110917.843744-1-rppt@kernel.org> References: <20251125110917.843744-1-rppt@kernel.org> Message-ID: <20251125110917.843744-2-rppt@kernel.org> From: "Mike Rapoport (Microsoft)" In case a preserved vmalloc allocation was using huge pages, all pages in the array of pages added to vm_struct during kho_restore_vmalloc() are wrongly set to the same page. Fix the indexing when assigning pages to that array. Fixes: a667300bd53f ("kho: add support for preserving vmalloc allocations") Signed-off-by: Mike Rapoport (Microsoft) --- kernel/liveupdate/kexec_handover.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 5809c6fe331c..e64ee87fa62a 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -1096,7 +1096,7 @@ void *kho_restore_vmalloc(const struct kho_vmalloc *preservation) goto err_free_pages_array; for (int j = 0; j < contig_pages; j++) - pages[idx++] = page; + pages[idx++] = page + j; phys += contig_pages * PAGE_SIZE; } -- 2.50.1 From rppt at kernel.org Tue Nov 25 03:09:17 2025 From: rppt at kernel.org (Mike Rapoport) Date: Tue, 25 Nov 2025 13:09:17 +0200 Subject: [PATCH 2/2] kho: fix restoring of contiguous ranges of order-0 pages In-Reply-To: <20251125110917.843744-1-rppt@kernel.org> References: <20251125110917.843744-1-rppt@kernel.org> Message-ID: <20251125110917.843744-3-rppt@kernel.org> From: "Mike Rapoport (Microsoft)" When contiguous ranges of order-0 pages are restored, kho_restore_page() calls prep_compound_page() with the first page in the range and order as parameters and then kho_restore_pages() calls split_page() to make sure all pages in the range are order-0. However, since split_page() is not intended to split compound pages and with VM_DEBUG enabled it will trigger a VM_BUG_ON_PAGE(). Update kho_restore_page() so that it will use prep_compound_page() when it restores a folio and make sure it properly sets page count for both large folios and ranges of order-0 pages. Reported-by: Pratyush Yadav Fixes: a667300bd53f ("kho: add support for preserving vmalloc allocations") Signed-off-by: Mike Rapoport (Microsoft) --- kernel/liveupdate/kexec_handover.c | 20 ++++++++++++-------- 1 file changed, 12 insertions(+), 8 deletions(-) diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index e64ee87fa62a..61d17ed1f423 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -219,11 +219,11 @@ static int __kho_preserve_order(struct kho_mem_track *track, unsigned long pfn, return 0; } -static struct page *kho_restore_page(phys_addr_t phys) +static struct page *kho_restore_page(phys_addr_t phys, bool is_folio) { struct page *page = pfn_to_online_page(PHYS_PFN(phys)); + unsigned int nr_pages, ref_cnt; union kho_page_info info; - unsigned int nr_pages; if (!page) return NULL; @@ -243,11 +243,16 @@ static struct page *kho_restore_page(phys_addr_t phys) /* Head page gets refcount of 1. */ set_page_count(page, 1); - /* For higher order folios, tail pages get a page count of zero. */ + /* + * For higher order folios, tail pages get a page count of zero. + * For physically contiguous order-0 pages every pages gets a page + * count of 1 + */ + ref_cnt = is_folio ? 0 : 1; for (unsigned int i = 1; i < nr_pages; i++) - set_page_count(page + i, 0); + set_page_count(page + i, ref_cnt); - if (info.order > 0) + if (is_folio && info.order) prep_compound_page(page, info.order); adjust_managed_page_count(page, nr_pages); @@ -262,7 +267,7 @@ static struct page *kho_restore_page(phys_addr_t phys) */ struct folio *kho_restore_folio(phys_addr_t phys) { - struct page *page = kho_restore_page(phys); + struct page *page = kho_restore_page(phys, true); return page ? page_folio(page) : NULL; } @@ -287,11 +292,10 @@ struct page *kho_restore_pages(phys_addr_t phys, unsigned int nr_pages) while (pfn < end_pfn) { const unsigned int order = min(count_trailing_zeros(pfn), ilog2(end_pfn - pfn)); - struct page *page = kho_restore_page(PFN_PHYS(pfn)); + struct page *page = kho_restore_page(PFN_PHYS(pfn), false); if (!page) return NULL; - split_page(page, order); pfn += 1 << order; } -- 2.50.1 From pratyush at kernel.org Tue Nov 25 04:23:05 2025 From: pratyush at kernel.org (Pratyush Yadav) Date: Tue, 25 Nov 2025 13:23:05 +0100 Subject: [PATCH v8 12/17] x86/e820: temporarily enable KHO scratch for memory below 1M In-Reply-To: <22BDBF5C-C831-4BBC-A854-20CA77234084@zytor.com> (H. Peter Anvin's message of "Mon, 24 Nov 2025 16:56:34 -0800") References: <20250509074635.3187114-1-changyuanl@google.com> <20250509074635.3187114-13-changyuanl@google.com> <22BDBF5C-C831-4BBC-A854-20CA77234084@zytor.com> Message-ID: On Mon, Nov 24 2025, H. Peter Anvin wrote: > On November 24, 2025 11:24:58 AM PST, Usama Arif wrote: >> >> >>On 09/05/2025 08:46, Changyuan Lyu wrote: >>> From: Alexander Graf >>> >>> KHO kernels are special and use only scratch memory for memblock >>> allocations, but memory below 1M is ignored by kernel after early boot >>> and cannot be naturally marked as scratch. >>> >>> To allow allocation of the real-mode trampoline and a few (if any) other >>> very early allocations from below 1M forcibly mark the memory below 1M >>> as scratch. >>> >>> After real mode trampoline is allocated, clear that scratch marking. >>> >>> Signed-off-by: Alexander Graf >>> Co-developed-by: Mike Rapoport (Microsoft) >>> Signed-off-by: Mike Rapoport (Microsoft) >>> Co-developed-by: Changyuan Lyu >>> Signed-off-by: Changyuan Lyu >>> Acked-by: Dave Hansen >>> --- >>> arch/x86/kernel/e820.c | 18 ++++++++++++++++++ >>> arch/x86/realmode/init.c | 2 ++ >>> 2 files changed, 20 insertions(+) >>> >>> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c >>> index 9920122018a0b..c3acbd26408ba 100644 >>> --- a/arch/x86/kernel/e820.c >>> +++ b/arch/x86/kernel/e820.c >>> @@ -1299,6 +1299,24 @@ void __init e820__memblock_setup(void) >>> memblock_add(entry->addr, entry->size); >>> } >>> >>> + /* >>> + * At this point memblock is only allowed to allocate from memory >>> + * below 1M (aka ISA_END_ADDRESS) up until direct map is completely set >>> + * up in init_mem_mapping(). >>> + * >>> + * KHO kernels are special and use only scratch memory for memblock >>> + * allocations, but memory below 1M is ignored by kernel after early >>> + * boot and cannot be naturally marked as scratch. >>> + * >>> + * To allow allocation of the real-mode trampoline and a few (if any) >>> + * other very early allocations from below 1M forcibly mark the memory >>> + * below 1M as scratch. >>> + * >>> + * After real mode trampoline is allocated, we clear that scratch >>> + * marking. >>> + */ >>> + memblock_mark_kho_scratch(0, SZ_1M); >>> + >>> /* >>> * 32-bit systems are limited to 4BG of memory even with HIGHMEM and >>> * to even less without it. >>> diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c >>> index f9bc444a3064d..9b9f4534086d2 100644 >>> --- a/arch/x86/realmode/init.c >>> +++ b/arch/x86/realmode/init.c >>> @@ -65,6 +65,8 @@ void __init reserve_real_mode(void) >>> * setup_arch(). >>> */ >>> memblock_reserve(0, SZ_1M); >>> + >>> + memblock_clear_kho_scratch(0, SZ_1M); >>> } >>> >>> static void __init sme_sev_setup_real_mode(struct trampoline_header *th) >> >>Hello! >> >>I am working with Breno who reported that we are seeing the below warning at boot >>when rolling out 6.16 in Meta fleet. It is difficult to reproduce on a single host >>manually but we are seeing this several times a day inside the fleet. >> >> 20:16:33 ------------[ cut here ]------------ >> 20:16:33 WARNING: CPU: 0 PID: 0 at mm/memblock.c:668 memblock_add_range+0x316/0x330 >> 20:16:33 Modules linked in: >> 20:16:33 CPU: 0 UID: 0 PID: 0 Comm: swapper Tainted: G S 6.16.1-0_fbk0_0_gc0739ee5037a #1 NONE >> 20:16:33 Tainted: [S]=CPU_OUT_OF_SPEC >> 20:16:33 RIP: 0010:memblock_add_range+0x316/0x330 >> 20:16:33 Code: ff ff ff 89 5c 24 08 41 ff c5 44 89 6c 24 10 48 63 74 24 08 48 63 54 24 10 e8 26 0c 00 00 e9 41 ff ff ff 0f 0b e9 af fd ff ff <0f> 0b e9 b7 fd ff ff 0f 0b 0f 0b cc cc cc cc cc cc cc cc cc cc cc >> 20:16:33 RSP: 0000:ffffffff83403dd8 EFLAGS: 00010083 ORIG_RAX: 0000000000000000 >> 20:16:33 RAX: ffffffff8476ff90 RBX: 0000000000001c00 RCX: 0000000000000002 >> 20:16:33 RDX: 00000000ffffffff RSI: 0000000000000000 RDI: ffffffff83bad4d8 >> 20:16:33 RBP: 000000000009f000 R08: 0000000000000020 R09: 8000000000097101 >> 20:16:33 R10: ffffffffff2004b0 R11: 203a6d6f646e6172 R12: 000000000009ec00 >> 20:16:33 R13: 0000000000000002 R14: 0000000000100000 R15: 000000000009d000 >> 20:16:33 FS: 0000000000000000(0000) GS:0000000000000000(0000) knlGS:0000000000000000 >> 20:16:33 CR2: ffff888065413ff8 CR3: 00000000663b7000 CR4: 00000000000000b0 >> 20:16:33 Call Trace: >> 20:16:33 >> 20:16:33 ? __memblock_reserve+0x75/0x80 >> 20:16:33 ? setup_arch+0x30f/0xb10 >> 20:16:33 ? start_kernel+0x58/0x960 >> 20:16:33 ? x86_64_start_reservations+0x20/0x20 >> 20:16:33 ? x86_64_start_kernel+0x13d/0x140 >> 20:16:33 ? common_startup_64+0x13e/0x140 >> 20:16:33 >> 20:16:33 ---[ end trace 0000000000000000 ]--- >> >> >>Rolling out with memblock=debug is not really an option in a large scale fleet due to the >>time added to boot. But I did try on one of the hosts (without reproducing the issue) and I see: >> >>[ 0.000616] memory.cnt = 0x6 >>[ 0.000617] memory[0x0] [0x0000000000001000-0x000000000009bfff], 0x000000000009b000 bytes flags: 0x40 >>[ 0.000620] memory[0x1] [0x000000000009f000-0x000000000009ffff], 0x0000000000001000 bytes flags: 0x40 >>[ 0.000621] memory[0x2] [0x0000000000100000-0x000000005ed09fff], 0x000000005ec0a000 bytes flags: 0x0 >>... >> >>The 0x40 (MEMBLOCK_KHO_SCRATCH) is coming from memblock_mark_kho_scratch in e820__memblock_setup. I believe this >>should be under ifdef like the diff at the end? (Happy to send this as a patch for review if it makes sense). >>We have KEXEC_HANDOVER disabled in our defconfig, therefore MEMBLOCK_KHO_SCRATCH shouldnt be selected and >>we shouldnt have any MEMBLOCK_KHO_SCRATCH type regions in our memblock reservations. >> >>The other thing I did was insert a while(1) just before the warning and inspected the registers in qemu. >>R14 held the base register, and R15 held the size at that point. >>In the warning R14 is 0x100000 meaning that someone is reserving a region with a different flag to MEMBLOCK_NONE >>at the boundary of MEMBLOCK_KHO_SCRATCH. >> >>diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c >>index c3acbd26408ba..26e4062a0bd09 100644 >>--- a/arch/x86/kernel/e820.c >>+++ b/arch/x86/kernel/e820.c >>@@ -1299,6 +1299,7 @@ void __init e820__memblock_setup(void) >> memblock_add(entry->addr, entry->size); >> } >> >>+#ifdef CONFIG_MEMBLOCK_KHO_SCRATCH >> /* >> * At this point memblock is only allowed to allocate from memory >> * below 1M (aka ISA_END_ADDRESS) up until direct map is completely set >>@@ -1316,7 +1317,7 @@ void __init e820__memblock_setup(void) >> * marking. >> */ >> memblock_mark_kho_scratch(0, SZ_1M); >>- >>+#endif >> /* >> * 32-bit systems are limited to 4BG of memory even with HIGHMEM and >> * to even less without it. >>diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c >>index 88be32026768c..1cd80293a3e23 100644 >>--- a/arch/x86/realmode/init.c >>+++ b/arch/x86/realmode/init.c >>@@ -66,8 +66,9 @@ void __init reserve_real_mode(void) >> * setup_arch(). >> */ >> memblock_reserve(0, SZ_1M); >>- >>+#ifdef CONFIG_MEMBLOCK_KHO_SCRATCH >> memblock_clear_kho_scratch(0, SZ_1M); >>+#endif >> } >> >> static void __init sme_sev_setup_real_mode(struct trampoline_header *th) > > What does "scratch" mean in this exact context? (Sorry, don't have the code in front of me.) See https://docs.kernel.org/core-api/kho/concepts.html#scratch-regions -- Regards, Pratyush Yadav From pratyush at kernel.org Tue Nov 25 05:15:34 2025 From: pratyush at kernel.org (Pratyush Yadav) Date: Tue, 25 Nov 2025 14:15:34 +0100 Subject: [PATCH v8 12/17] x86/e820: temporarily enable KHO scratch for memory below 1M In-Reply-To: (Usama Arif's message of "Mon, 24 Nov 2025 19:24:58 +0000") References: <20250509074635.3187114-1-changyuanl@google.com> <20250509074635.3187114-13-changyuanl@google.com> Message-ID: On Mon, Nov 24 2025, Usama Arif wrote: > On 09/05/2025 08:46, Changyuan Lyu wrote: >> From: Alexander Graf >> >> KHO kernels are special and use only scratch memory for memblock >> allocations, but memory below 1M is ignored by kernel after early boot >> and cannot be naturally marked as scratch. >> >> To allow allocation of the real-mode trampoline and a few (if any) other >> very early allocations from below 1M forcibly mark the memory below 1M >> as scratch. >> >> After real mode trampoline is allocated, clear that scratch marking. >> >> Signed-off-by: Alexander Graf >> Co-developed-by: Mike Rapoport (Microsoft) >> Signed-off-by: Mike Rapoport (Microsoft) >> Co-developed-by: Changyuan Lyu >> Signed-off-by: Changyuan Lyu >> Acked-by: Dave Hansen >> --- >> arch/x86/kernel/e820.c | 18 ++++++++++++++++++ >> arch/x86/realmode/init.c | 2 ++ >> 2 files changed, 20 insertions(+) >> >> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c >> index 9920122018a0b..c3acbd26408ba 100644 >> --- a/arch/x86/kernel/e820.c >> +++ b/arch/x86/kernel/e820.c >> @@ -1299,6 +1299,24 @@ void __init e820__memblock_setup(void) >> memblock_add(entry->addr, entry->size); >> } >> >> + /* >> + * At this point memblock is only allowed to allocate from memory >> + * below 1M (aka ISA_END_ADDRESS) up until direct map is completely set >> + * up in init_mem_mapping(). >> + * >> + * KHO kernels are special and use only scratch memory for memblock >> + * allocations, but memory below 1M is ignored by kernel after early >> + * boot and cannot be naturally marked as scratch. >> + * >> + * To allow allocation of the real-mode trampoline and a few (if any) >> + * other very early allocations from below 1M forcibly mark the memory >> + * below 1M as scratch. >> + * >> + * After real mode trampoline is allocated, we clear that scratch >> + * marking. >> + */ >> + memblock_mark_kho_scratch(0, SZ_1M); >> + >> /* >> * 32-bit systems are limited to 4BG of memory even with HIGHMEM and >> * to even less without it. >> diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c >> index f9bc444a3064d..9b9f4534086d2 100644 >> --- a/arch/x86/realmode/init.c >> +++ b/arch/x86/realmode/init.c >> @@ -65,6 +65,8 @@ void __init reserve_real_mode(void) >> * setup_arch(). >> */ >> memblock_reserve(0, SZ_1M); >> + >> + memblock_clear_kho_scratch(0, SZ_1M); >> } >> >> static void __init sme_sev_setup_real_mode(struct trampoline_header *th) > > Hello! > > I am working with Breno who reported that we are seeing the below warning at boot > when rolling out 6.16 in Meta fleet. It is difficult to reproduce on a single host > manually but we are seeing this several times a day inside the fleet. > > 20:16:33 ------------[ cut here ]------------ > 20:16:33 WARNING: CPU: 0 PID: 0 at mm/memblock.c:668 memblock_add_range+0x316/0x330 > 20:16:33 Modules linked in: > 20:16:33 CPU: 0 UID: 0 PID: 0 Comm: swapper Tainted: G S 6.16.1-0_fbk0_0_gc0739ee5037a #1 NONE > 20:16:33 Tainted: [S]=CPU_OUT_OF_SPEC > 20:16:33 RIP: 0010:memblock_add_range+0x316/0x330 > 20:16:33 Code: ff ff ff 89 5c 24 08 41 ff c5 44 89 6c 24 10 48 63 74 24 08 48 63 54 24 10 e8 26 0c 00 00 e9 41 ff ff ff 0f 0b e9 af fd ff ff <0f> 0b e9 b7 fd ff ff 0f 0b 0f 0b cc cc cc cc cc cc cc cc cc cc cc > 20:16:33 RSP: 0000:ffffffff83403dd8 EFLAGS: 00010083 ORIG_RAX: 0000000000000000 > 20:16:33 RAX: ffffffff8476ff90 RBX: 0000000000001c00 RCX: 0000000000000002 > 20:16:33 RDX: 00000000ffffffff RSI: 0000000000000000 RDI: ffffffff83bad4d8 > 20:16:33 RBP: 000000000009f000 R08: 0000000000000020 R09: 8000000000097101 > 20:16:33 R10: ffffffffff2004b0 R11: 203a6d6f646e6172 R12: 000000000009ec00 > 20:16:33 R13: 0000000000000002 R14: 0000000000100000 R15: 000000000009d000 > 20:16:33 FS: 0000000000000000(0000) GS:0000000000000000(0000) knlGS:0000000000000000 > 20:16:33 CR2: ffff888065413ff8 CR3: 00000000663b7000 CR4: 00000000000000b0 > 20:16:33 Call Trace: > 20:16:33 > 20:16:33 ? __memblock_reserve+0x75/0x80 > 20:16:33 ? setup_arch+0x30f/0xb10 > 20:16:33 ? start_kernel+0x58/0x960 > 20:16:33 ? x86_64_start_reservations+0x20/0x20 > 20:16:33 ? x86_64_start_kernel+0x13d/0x140 > 20:16:33 ? common_startup_64+0x13e/0x140 > 20:16:33 > 20:16:33 ---[ end trace 0000000000000000 ]--- > > > Rolling out with memblock=debug is not really an option in a large scale fleet due to the > time added to boot. But I did try on one of the hosts (without reproducing the issue) and I see: > > [ 0.000616] memory.cnt = 0x6 > [ 0.000617] memory[0x0] [0x0000000000001000-0x000000000009bfff], 0x000000000009b000 bytes flags: 0x40 > [ 0.000620] memory[0x1] [0x000000000009f000-0x000000000009ffff], 0x0000000000001000 bytes flags: 0x40 > [ 0.000621] memory[0x2] [0x0000000000100000-0x000000005ed09fff], 0x000000005ec0a000 bytes flags: 0x0 > ... > > The 0x40 (MEMBLOCK_KHO_SCRATCH) is coming from memblock_mark_kho_scratch in e820__memblock_setup. I believe this > should be under ifdef like the diff at the end? (Happy to send this as a patch for review if it makes sense). > We have KEXEC_HANDOVER disabled in our defconfig, therefore MEMBLOCK_KHO_SCRATCH shouldnt be selected and > we shouldnt have any MEMBLOCK_KHO_SCRATCH type regions in our memblock reservations. > > The other thing I did was insert a while(1) just before the warning and inspected the registers in qemu. > R14 held the base register, and R15 held the size at that point. > In the warning R14 is 0x100000 meaning that someone is reserving a region with a different flag to MEMBLOCK_NONE > at the boundary of MEMBLOCK_KHO_SCRATCH. I don't get this... The WARN_ON() is only triggered when the regions overlap. Here, there should be no overlap, since the scratch region should end at 0x100000 (SZ_1M) and the new region starts at 0x100000 (SZ_1M). Anyway, you do indeed point at a bug. memblock_mark_kho_scratch() should only be called on a KHO boot, not unconditionally. So even with CONFIG_MEMBLOCK_KHO_SCRATCH enabled, this should only be called on a KHO boot, not every time. I think the below diff should fix the warning for you by making sure the scratch areas are not present on non-KHO boot. I still don't know why you hit the warning in the first place though. If you'd be willing to dig deeper into that, it would be great. Can you give the below a try and if it fixes the problem for you I can send it on the list. diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c index c3acbd26408ba..0a34dc011bf91 100644 --- a/arch/x86/kernel/e820.c +++ b/arch/x86/kernel/e820.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #include @@ -1315,7 +1316,8 @@ void __init e820__memblock_setup(void) * After real mode trampoline is allocated, we clear that scratch * marking. */ - memblock_mark_kho_scratch(0, SZ_1M); + if (is_kho_boot()) + memblock_mark_kho_scratch(0, SZ_1M); /* * 32-bit systems are limited to 4BG of memory even with HIGHMEM and diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c index 88be32026768c..4e9b4dff17216 100644 --- a/arch/x86/realmode/init.c +++ b/arch/x86/realmode/init.c @@ -4,6 +4,7 @@ #include #include #include +#include #include #include @@ -67,7 +68,8 @@ void __init reserve_real_mode(void) */ memblock_reserve(0, SZ_1M); - memblock_clear_kho_scratch(0, SZ_1M); + if (is_kho_boot()) + memblock_clear_kho_scratch(0, SZ_1M); } static void __init sme_sev_setup_real_mode(struct trampoline_header *th) -- Regards, Pratyush Yadav From pratyush at kernel.org Tue Nov 25 05:18:20 2025 From: pratyush at kernel.org (Pratyush Yadav) Date: Tue, 25 Nov 2025 14:18:20 +0100 Subject: [PATCH 1/2] kho: kho_restore_vmalloc: fix initialization of pages array In-Reply-To: <20251125110917.843744-2-rppt@kernel.org> (Mike Rapoport's message of "Tue, 25 Nov 2025 13:09:16 +0200") References: <20251125110917.843744-1-rppt@kernel.org> <20251125110917.843744-2-rppt@kernel.org> Message-ID: On Tue, Nov 25 2025, Mike Rapoport wrote: > From: "Mike Rapoport (Microsoft)" > > In case a preserved vmalloc allocation was using huge pages, all pages in > the array of pages added to vm_struct during kho_restore_vmalloc() are > wrongly set to the same page. > > Fix the indexing when assigning pages to that array. > > Fixes: a667300bd53f ("kho: add support for preserving vmalloc allocations") > Signed-off-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav [...] -- Regards, Pratyush Yadav From pratyush at kernel.org Tue Nov 25 05:45:59 2025 From: pratyush at kernel.org (Pratyush Yadav) Date: Tue, 25 Nov 2025 14:45:59 +0100 Subject: [PATCH 2/2] kho: fix restoring of contiguous ranges of order-0 pages In-Reply-To: <20251125110917.843744-3-rppt@kernel.org> (Mike Rapoport's message of "Tue, 25 Nov 2025 13:09:17 +0200") References: <20251125110917.843744-1-rppt@kernel.org> <20251125110917.843744-3-rppt@kernel.org> Message-ID: On Tue, Nov 25 2025, Mike Rapoport wrote: > From: "Mike Rapoport (Microsoft)" > > When contiguous ranges of order-0 pages are restored, kho_restore_page() > calls prep_compound_page() with the first page in the range and order as > parameters and then kho_restore_pages() calls split_page() to make sure all > pages in the range are order-0. > > However, since split_page() is not intended to split compound pages and > with VM_DEBUG enabled it will trigger a VM_BUG_ON_PAGE(). > > Update kho_restore_page() so that it will use prep_compound_page() when it > restores a folio and make sure it properly sets page count for both large > folios and ranges of order-0 pages. > > Reported-by: Pratyush Yadav > Fixes: a667300bd53f ("kho: add support for preserving vmalloc allocations") > Signed-off-by: Mike Rapoport (Microsoft) > --- > kernel/liveupdate/kexec_handover.c | 20 ++++++++++++-------- > 1 file changed, 12 insertions(+), 8 deletions(-) > > diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c > index e64ee87fa62a..61d17ed1f423 100644 > --- a/kernel/liveupdate/kexec_handover.c > +++ b/kernel/liveupdate/kexec_handover.c > @@ -219,11 +219,11 @@ static int __kho_preserve_order(struct kho_mem_track *track, unsigned long pfn, > return 0; > } > > -static struct page *kho_restore_page(phys_addr_t phys) > +static struct page *kho_restore_page(phys_addr_t phys, bool is_folio) > { > struct page *page = pfn_to_online_page(PHYS_PFN(phys)); > + unsigned int nr_pages, ref_cnt; > union kho_page_info info; > - unsigned int nr_pages; > > if (!page) > return NULL; > @@ -243,11 +243,16 @@ static struct page *kho_restore_page(phys_addr_t phys) > /* Head page gets refcount of 1. */ > set_page_count(page, 1); > > - /* For higher order folios, tail pages get a page count of zero. */ > + /* > + * For higher order folios, tail pages get a page count of zero. > + * For physically contiguous order-0 pages every pages gets a page > + * count of 1 > + */ > + ref_cnt = is_folio ? 0 : 1; > for (unsigned int i = 1; i < nr_pages; i++) > - set_page_count(page + i, 0); > + set_page_count(page + i, ref_cnt); > > - if (info.order > 0) > + if (is_folio && info.order) This is getting a bit difficult to parse. Let's separate out folio and page initialization to separate helpers: /* Initalize 0-order KHO pages */ static void kho_init_page(struct page *page, unsigned int nr_pages) { for (unsigned int i = 0; i < nr_pages; i++) set_page_count(page + i, 1); } static void kho_init_folio(struct page *page, unsigned int order) { unsigned int nr_pages = (1 << order); /* Head page gets refcount of 1. */ set_page_count(page, 1); /* For higher order folios, tail pages get a page count of zero. */ for (unsigned int i = 1; i < nr_pages; i++) set_page_count(page + i, 0); if (order > 0) prep_compound_page(page, order); } > prep_compound_page(page, info.order); > > adjust_managed_page_count(page, nr_pages); > @@ -262,7 +267,7 @@ static struct page *kho_restore_page(phys_addr_t phys) > */ > struct folio *kho_restore_folio(phys_addr_t phys) > { > - struct page *page = kho_restore_page(phys); > + struct page *page = kho_restore_page(phys, true); > > return page ? page_folio(page) : NULL; > } > @@ -287,11 +292,10 @@ struct page *kho_restore_pages(phys_addr_t phys, unsigned int nr_pages) > while (pfn < end_pfn) { > const unsigned int order = > min(count_trailing_zeros(pfn), ilog2(end_pfn - pfn)); > - struct page *page = kho_restore_page(PFN_PHYS(pfn)); > + struct page *page = kho_restore_page(PFN_PHYS(pfn), false); > > if (!page) > return NULL; > - split_page(page, order); > pfn += 1 << order; > } -- Regards, Pratyush Yadav From rppt at kernel.org Tue Nov 25 05:50:40 2025 From: rppt at kernel.org (Mike Rapoport) Date: Tue, 25 Nov 2025 15:50:40 +0200 Subject: [PATCH v8 12/17] x86/e820: temporarily enable KHO scratch for memory below 1M In-Reply-To: References: <20250509074635.3187114-1-changyuanl@google.com> <20250509074635.3187114-13-changyuanl@google.com> Message-ID: Hi, On Tue, Nov 25, 2025 at 02:15:34PM +0100, Pratyush Yadav wrote: > On Mon, Nov 24 2025, Usama Arif wrote: > >> --- a/arch/x86/realmode/init.c > >> +++ b/arch/x86/realmode/init.c > >> @@ -65,6 +65,8 @@ void __init reserve_real_mode(void) > >> * setup_arch(). > >> */ > >> memblock_reserve(0, SZ_1M); > >> + > >> + memblock_clear_kho_scratch(0, SZ_1M); > >> } > >> > >> static void __init sme_sev_setup_real_mode(struct trampoline_header *th) > > > > Hello! > > > > I am working with Breno who reported that we are seeing the below warning at boot > > when rolling out 6.16 in Meta fleet. It is difficult to reproduce on a single host > > manually but we are seeing this several times a day inside the fleet. > > > > 20:16:33 ------------[ cut here ]------------ > > 20:16:33 WARNING: CPU: 0 PID: 0 at mm/memblock.c:668 memblock_add_range+0x316/0x330 > > 20:16:33 Modules linked in: > > 20:16:33 CPU: 0 UID: 0 PID: 0 Comm: swapper Tainted: G S 6.16.1-0_fbk0_0_gc0739ee5037a #1 NONE > > 20:16:33 Tainted: [S]=CPU_OUT_OF_SPEC > > 20:16:33 RIP: 0010:memblock_add_range+0x316/0x330 > > 20:16:33 Code: ff ff ff 89 5c 24 08 41 ff c5 44 89 6c 24 10 48 63 74 24 08 48 63 54 24 10 e8 26 0c 00 00 e9 41 ff ff ff 0f 0b e9 af fd ff ff <0f> 0b e9 b7 fd ff ff 0f 0b 0f 0b cc cc cc cc cc cc cc cc cc cc cc > > 20:16:33 RSP: 0000:ffffffff83403dd8 EFLAGS: 00010083 ORIG_RAX: 0000000000000000 > > 20:16:33 RAX: ffffffff8476ff90 RBX: 0000000000001c00 RCX: 0000000000000002 > > 20:16:33 RDX: 00000000ffffffff RSI: 0000000000000000 RDI: ffffffff83bad4d8 > > 20:16:33 RBP: 000000000009f000 R08: 0000000000000020 R09: 8000000000097101 > > 20:16:33 R10: ffffffffff2004b0 R11: 203a6d6f646e6172 R12: 000000000009ec00 > > 20:16:33 R13: 0000000000000002 R14: 0000000000100000 R15: 000000000009d000 > > 20:16:33 FS: 0000000000000000(0000) GS:0000000000000000(0000) knlGS:0000000000000000 > > 20:16:33 CR2: ffff888065413ff8 CR3: 00000000663b7000 CR4: 00000000000000b0 > > 20:16:33 Call Trace: > > 20:16:33 > > 20:16:33 ? __memblock_reserve+0x75/0x80 Do you have faddr2line for this? > > 20:16:33 ? setup_arch+0x30f/0xb10 And this? > > 20:16:33 ? start_kernel+0x58/0x960 > > 20:16:33 ? x86_64_start_reservations+0x20/0x20 > > 20:16:33 ? x86_64_start_kernel+0x13d/0x140 > > 20:16:33 ? common_startup_64+0x13e/0x140 > > 20:16:33 > > 20:16:33 ---[ end trace 0000000000000000 ]--- > > > > > > Rolling out with memblock=debug is not really an option in a large scale fleet due to the > > time added to boot. But I did try on one of the hosts (without reproducing the issue) and I see: Is it a problem to roll out a kernel that has additional debug printouts as Breno suggested earlier? I.e. if (flags != MEMBLOCK_NONE && flags != rgn->flags) { pr_warn("memblock: Flag mismatch at region [%pa-%pa]\n", &rgn->base, &rend); pr_warn(" Existing region flags: %#x\n", rgn->flags); pr_warn(" New range flags: %#x\n", flags); pr_warn(" New range: [%pa-%pa]\n", &base, &end); WARN_ON_ONCE(1); } > > [ 0.000616] memory.cnt = 0x6 > > [ 0.000617] memory[0x0] [0x0000000000001000-0x000000000009bfff], 0x000000000009b000 bytes flags: 0x40 > > [ 0.000620] memory[0x1] [0x000000000009f000-0x000000000009ffff], 0x0000000000001000 bytes flags: 0x40 > > [ 0.000621] memory[0x2] [0x0000000000100000-0x000000005ed09fff], 0x000000005ec0a000 bytes flags: 0x0 > > ... > > > > The 0x40 (MEMBLOCK_KHO_SCRATCH) is coming from memblock_mark_kho_scratch in e820__memblock_setup. I believe this > > should be under ifdef like the diff at the end? (Happy to send this as a patch for review if it makes sense). > > We have KEXEC_HANDOVER disabled in our defconfig, therefore MEMBLOCK_KHO_SCRATCH shouldnt be selected and > > we shouldnt have any MEMBLOCK_KHO_SCRATCH type regions in our memblock reservations. > > > > The other thing I did was insert a while(1) just before the warning and inspected the registers in qemu. > > R14 held the base register, and R15 held the size at that point. > > In the warning R14 is 0x100000 meaning that someone is reserving a region with a different flag to MEMBLOCK_NONE > > at the boundary of MEMBLOCK_KHO_SCRATCH. Judging by the register values, flags could be in %rcx or %r13 (0x2 - MEMBLOCK_MIRROR) or in %r8 (0x20 - MEMBLOCK_RSRV_KERN) Since WARN_ON() is triggered in __memblock_reserve() I'd bet on MEMBLOCK_RSRV_KERN. And apparently the warning triggers for some memory that was initially reserved with memblock_reserve() and than some of it was reserved with memblock_reserve_kern(). If you have the logs from failing boots up to the point where SLUB reports about it's initialization, e.g. [ 0.134377] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=8, Nodes=1 something there may hint about what's the issue. > I don't get this... The WARN_ON() is only triggered when the regions > overlap. Here, there should be no overlap, since the scratch region > should end at 0x100000 (SZ_1M) and the new region starts at 0x100000 > (SZ_1M). Not only that, the warning is from __memblock_reserve() that works with memblock.reserved and the dump is for memblock.memory. > Anyway, you do indeed point at a bug. memblock_mark_kho_scratch() should > only be called on a KHO boot, not unconditionally. So even with > CONFIG_MEMBLOCK_KHO_SCRATCH enabled, this should only be called on a KHO > boot, not every time. > > I think the below diff should fix the warning for you by making sure the > scratch areas are not present on non-KHO boot. I still don't know why > you hit the warning in the first place though. If you'd be willing to > dig deeper into that, it would be great. > > Can you give the below a try and if it fixes the problem for you I can > send it on the list. > > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c > index c3acbd26408ba..0a34dc011bf91 100644 > --- a/arch/x86/kernel/e820.c > +++ b/arch/x86/kernel/e820.c > @@ -16,6 +16,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -1315,7 +1316,8 @@ void __init e820__memblock_setup(void) > * After real mode trampoline is allocated, we clear that scratch > * marking. > */ > - memblock_mark_kho_scratch(0, SZ_1M); > + if (is_kho_boot()) > + memblock_mark_kho_scratch(0, SZ_1M); We'd better add an inline stub to memblock.h for !CONFIG_MEMBLOCK_KHO_SCRATCH and move is_kho_boot() inside memblock_{mark,clear}_kho_scratch. This might require moving them out of line, but it's not that they are on the hot paths. BTW, this makes sense even if it does not help with the issue Breno and Usama are working on. > > /* > * 32-bit systems are limited to 4BG of memory even with HIGHMEM and > diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c > index 88be32026768c..4e9b4dff17216 100644 > --- a/arch/x86/realmode/init.c > +++ b/arch/x86/realmode/init.c > @@ -4,6 +4,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -67,7 +68,8 @@ void __init reserve_real_mode(void) > */ > memblock_reserve(0, SZ_1M); > > - memblock_clear_kho_scratch(0, SZ_1M); > + if (is_kho_boot()) > + memblock_clear_kho_scratch(0, SZ_1M); > } > > static void __init sme_sev_setup_real_mode(struct trampoline_header *th) > > > -- > Regards, > Pratyush Yadav -- Sincerely yours, Mike. From rppt at kernel.org Tue Nov 25 05:53:07 2025 From: rppt at kernel.org (Mike Rapoport) Date: Tue, 25 Nov 2025 15:53:07 +0200 Subject: [PATCH v8 12/17] x86/e820: temporarily enable KHO scratch for memory below 1M In-Reply-To: <22BDBF5C-C831-4BBC-A854-20CA77234084@zytor.com> References: <20250509074635.3187114-1-changyuanl@google.com> <20250509074635.3187114-13-changyuanl@google.com> <22BDBF5C-C831-4BBC-A854-20CA77234084@zytor.com> Message-ID: On Mon, Nov 24, 2025 at 04:56:34PM -0800, H. Peter Anvin wrote: > On November 24, 2025 11:24:58 AM PST, Usama Arif wrote: > > >diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c > >index 88be32026768c..1cd80293a3e23 100644 > >--- a/arch/x86/realmode/init.c > >+++ b/arch/x86/realmode/init.c > >@@ -66,8 +66,9 @@ void __init reserve_real_mode(void) > > * setup_arch(). > > */ > > memblock_reserve(0, SZ_1M); > >- > >+#ifdef CONFIG_MEMBLOCK_KHO_SCRATCH > > memblock_clear_kho_scratch(0, SZ_1M); > >+#endif > > } > > > > static void __init sme_sev_setup_real_mode(struct trampoline_header *th) > > What does "scratch" mean in this exact context? (Sorry, don't have the code in front of me.) In this context it's the memory kexec handover used to bootstrap the kexec'ed kernel. Everything beyond these scratch areas could contain preserved data and kexec handover limits all early memory allocations to these scratch areas. -- Sincerely yours, Mike. From usamaarif642 at gmail.com Tue Nov 25 06:31:54 2025 From: usamaarif642 at gmail.com (Usama Arif) Date: Tue, 25 Nov 2025 14:31:54 +0000 Subject: [PATCH v8 12/17] x86/e820: temporarily enable KHO scratch for memory below 1M In-Reply-To: References: <20250509074635.3187114-1-changyuanl@google.com> <20250509074635.3187114-13-changyuanl@google.com> Message-ID: <80622f99-0ef4-491b-87f6-c9790dfecef6@gmail.com> On 25/11/2025 13:15, Pratyush Yadav wrote: > On Mon, Nov 24 2025, Usama Arif wrote: > >> On 09/05/2025 08:46, Changyuan Lyu wrote: >>> From: Alexander Graf >>> >>> KHO kernels are special and use only scratch memory for memblock >>> allocations, but memory below 1M is ignored by kernel after early boot >>> and cannot be naturally marked as scratch. >>> >>> To allow allocation of the real-mode trampoline and a few (if any) other >>> very early allocations from below 1M forcibly mark the memory below 1M >>> as scratch. >>> >>> After real mode trampoline is allocated, clear that scratch marking. >>> >>> Signed-off-by: Alexander Graf >>> Co-developed-by: Mike Rapoport (Microsoft) >>> Signed-off-by: Mike Rapoport (Microsoft) >>> Co-developed-by: Changyuan Lyu >>> Signed-off-by: Changyuan Lyu >>> Acked-by: Dave Hansen >>> --- >>> arch/x86/kernel/e820.c | 18 ++++++++++++++++++ >>> arch/x86/realmode/init.c | 2 ++ >>> 2 files changed, 20 insertions(+) >>> >>> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c >>> index 9920122018a0b..c3acbd26408ba 100644 >>> --- a/arch/x86/kernel/e820.c >>> +++ b/arch/x86/kernel/e820.c >>> @@ -1299,6 +1299,24 @@ void __init e820__memblock_setup(void) >>> memblock_add(entry->addr, entry->size); >>> } >>> >>> + /* >>> + * At this point memblock is only allowed to allocate from memory >>> + * below 1M (aka ISA_END_ADDRESS) up until direct map is completely set >>> + * up in init_mem_mapping(). >>> + * >>> + * KHO kernels are special and use only scratch memory for memblock >>> + * allocations, but memory below 1M is ignored by kernel after early >>> + * boot and cannot be naturally marked as scratch. >>> + * >>> + * To allow allocation of the real-mode trampoline and a few (if any) >>> + * other very early allocations from below 1M forcibly mark the memory >>> + * below 1M as scratch. >>> + * >>> + * After real mode trampoline is allocated, we clear that scratch >>> + * marking. >>> + */ >>> + memblock_mark_kho_scratch(0, SZ_1M); >>> + >>> /* >>> * 32-bit systems are limited to 4BG of memory even with HIGHMEM and >>> * to even less without it. >>> diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c >>> index f9bc444a3064d..9b9f4534086d2 100644 >>> --- a/arch/x86/realmode/init.c >>> +++ b/arch/x86/realmode/init.c >>> @@ -65,6 +65,8 @@ void __init reserve_real_mode(void) >>> * setup_arch(). >>> */ >>> memblock_reserve(0, SZ_1M); >>> + >>> + memblock_clear_kho_scratch(0, SZ_1M); >>> } >>> >>> static void __init sme_sev_setup_real_mode(struct trampoline_header *th) >> >> Hello! >> >> I am working with Breno who reported that we are seeing the below warning at boot >> when rolling out 6.16 in Meta fleet. It is difficult to reproduce on a single host >> manually but we are seeing this several times a day inside the fleet. >> >> 20:16:33 ------------[ cut here ]------------ >> 20:16:33 WARNING: CPU: 0 PID: 0 at mm/memblock.c:668 memblock_add_range+0x316/0x330 >> 20:16:33 Modules linked in: >> 20:16:33 CPU: 0 UID: 0 PID: 0 Comm: swapper Tainted: G S 6.16.1-0_fbk0_0_gc0739ee5037a #1 NONE >> 20:16:33 Tainted: [S]=CPU_OUT_OF_SPEC >> 20:16:33 RIP: 0010:memblock_add_range+0x316/0x330 >> 20:16:33 Code: ff ff ff 89 5c 24 08 41 ff c5 44 89 6c 24 10 48 63 74 24 08 48 63 54 24 10 e8 26 0c 00 00 e9 41 ff ff ff 0f 0b e9 af fd ff ff <0f> 0b e9 b7 fd ff ff 0f 0b 0f 0b cc cc cc cc cc cc cc cc cc cc cc >> 20:16:33 RSP: 0000:ffffffff83403dd8 EFLAGS: 00010083 ORIG_RAX: 0000000000000000 >> 20:16:33 RAX: ffffffff8476ff90 RBX: 0000000000001c00 RCX: 0000000000000002 >> 20:16:33 RDX: 00000000ffffffff RSI: 0000000000000000 RDI: ffffffff83bad4d8 >> 20:16:33 RBP: 000000000009f000 R08: 0000000000000020 R09: 8000000000097101 >> 20:16:33 R10: ffffffffff2004b0 R11: 203a6d6f646e6172 R12: 000000000009ec00 >> 20:16:33 R13: 0000000000000002 R14: 0000000000100000 R15: 000000000009d000 >> 20:16:33 FS: 0000000000000000(0000) GS:0000000000000000(0000) knlGS:0000000000000000 >> 20:16:33 CR2: ffff888065413ff8 CR3: 00000000663b7000 CR4: 00000000000000b0 >> 20:16:33 Call Trace: >> 20:16:33 >> 20:16:33 ? __memblock_reserve+0x75/0x80 >> 20:16:33 ? setup_arch+0x30f/0xb10 >> 20:16:33 ? start_kernel+0x58/0x960 >> 20:16:33 ? x86_64_start_reservations+0x20/0x20 >> 20:16:33 ? x86_64_start_kernel+0x13d/0x140 >> 20:16:33 ? common_startup_64+0x13e/0x140 >> 20:16:33 >> 20:16:33 ---[ end trace 0000000000000000 ]--- >> >> >> Rolling out with memblock=debug is not really an option in a large scale fleet due to the >> time added to boot. But I did try on one of the hosts (without reproducing the issue) and I see: >> >> [ 0.000616] memory.cnt = 0x6 >> [ 0.000617] memory[0x0] [0x0000000000001000-0x000000000009bfff], 0x000000000009b000 bytes flags: 0x40 >> [ 0.000620] memory[0x1] [0x000000000009f000-0x000000000009ffff], 0x0000000000001000 bytes flags: 0x40 >> [ 0.000621] memory[0x2] [0x0000000000100000-0x000000005ed09fff], 0x000000005ec0a000 bytes flags: 0x0 >> ... >> >> The 0x40 (MEMBLOCK_KHO_SCRATCH) is coming from memblock_mark_kho_scratch in e820__memblock_setup. I believe this >> should be under ifdef like the diff at the end? (Happy to send this as a patch for review if it makes sense). >> We have KEXEC_HANDOVER disabled in our defconfig, therefore MEMBLOCK_KHO_SCRATCH shouldnt be selected and >> we shouldnt have any MEMBLOCK_KHO_SCRATCH type regions in our memblock reservations. >> >> The other thing I did was insert a while(1) just before the warning and inspected the registers in qemu. >> R14 held the base register, and R15 held the size at that point. >> In the warning R14 is 0x100000 meaning that someone is reserving a region with a different flag to MEMBLOCK_NONE >> at the boundary of MEMBLOCK_KHO_SCRATCH. > > I don't get this... The WARN_ON() is only triggered when the regions > overlap. Here, there should be no overlap, since the scratch region > should end at 0x100000 (SZ_1M) and the new region starts at 0x100000 > (SZ_1M). > Yes, this is likely a separate problem. I just discovered flags = 0x40 while trying to debug it with KEXEC_HANDOVER disabled. > Anyway, you do indeed point at a bug. memblock_mark_kho_scratch() should > only be called on a KHO boot, not unconditionally. So even with > CONFIG_MEMBLOCK_KHO_SCRATCH enabled, this should only be called on a KHO > boot, not every time. > > I think the below diff should fix the warning for you by making sure the > scratch areas are not present on non-KHO boot. I still don't know why > you hit the warning in the first place though. If you'd be willing to > dig deeper into that, it would be great. > > Can you give the below a try and if it fixes the problem for you I can > send it on the list. Is there a reason for compiling this code with is_kho_boot, when we have disabled KEXEC_HANDOVER and dont want this in? i.e. why not just ifdef it with MEMBLOCK_KHO_SCRATCH when that defconfig is designed for it? > > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c > index c3acbd26408ba..0a34dc011bf91 100644 > --- a/arch/x86/kernel/e820.c > +++ b/arch/x86/kernel/e820.c > @@ -16,6 +16,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -1315,7 +1316,8 @@ void __init e820__memblock_setup(void) > * After real mode trampoline is allocated, we clear that scratch > * marking. > */ > - memblock_mark_kho_scratch(0, SZ_1M); > + if (is_kho_boot()) > + memblock_mark_kho_scratch(0, SZ_1M); > > /* > * 32-bit systems are limited to 4BG of memory even with HIGHMEM and > diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c > index 88be32026768c..4e9b4dff17216 100644 > --- a/arch/x86/realmode/init.c > +++ b/arch/x86/realmode/init.c > @@ -4,6 +4,7 @@ > #include > #include > #include > +#include > > #include > #include > @@ -67,7 +68,8 @@ void __init reserve_real_mode(void) > */ > memblock_reserve(0, SZ_1M); > > - memblock_clear_kho_scratch(0, SZ_1M); > + if (is_kho_boot()) > + memblock_clear_kho_scratch(0, SZ_1M); > } > > static void __init sme_sev_setup_real_mode(struct trampoline_header *th) > > From pratyush at kernel.org Tue Nov 25 06:39:34 2025 From: pratyush at kernel.org (Pratyush Yadav) Date: Tue, 25 Nov 2025 15:39:34 +0100 Subject: [PATCH v8 12/17] x86/e820: temporarily enable KHO scratch for memory below 1M In-Reply-To: <80622f99-0ef4-491b-87f6-c9790dfecef6@gmail.com> (Usama Arif's message of "Tue, 25 Nov 2025 14:31:54 +0000") References: <20250509074635.3187114-1-changyuanl@google.com> <20250509074635.3187114-13-changyuanl@google.com> <80622f99-0ef4-491b-87f6-c9790dfecef6@gmail.com> Message-ID: On Tue, Nov 25 2025, Usama Arif wrote: > On 25/11/2025 13:15, Pratyush Yadav wrote: >> On Mon, Nov 24 2025, Usama Arif wrote: >> >>> On 09/05/2025 08:46, Changyuan Lyu wrote: >>>> From: Alexander Graf >>>> >>>> KHO kernels are special and use only scratch memory for memblock >>>> allocations, but memory below 1M is ignored by kernel after early boot >>>> and cannot be naturally marked as scratch. >>>> >>>> To allow allocation of the real-mode trampoline and a few (if any) other >>>> very early allocations from below 1M forcibly mark the memory below 1M >>>> as scratch. >>>> >>>> After real mode trampoline is allocated, clear that scratch marking. >>>> >>>> Signed-off-by: Alexander Graf [...] >> Anyway, you do indeed point at a bug. memblock_mark_kho_scratch() should >> only be called on a KHO boot, not unconditionally. So even with >> CONFIG_MEMBLOCK_KHO_SCRATCH enabled, this should only be called on a KHO >> boot, not every time. >> >> I think the below diff should fix the warning for you by making sure the >> scratch areas are not present on non-KHO boot. I still don't know why >> you hit the warning in the first place though. If you'd be willing to >> dig deeper into that, it would be great. >> >> Can you give the below a try and if it fixes the problem for you I can >> send it on the list. > > Is there a reason for compiling this code with is_kho_boot, when we have disabled > KEXEC_HANDOVER and dont want this in? i.e. why not just ifdef it with MEMBLOCK_KHO_SCRATCH > when that defconfig is designed for it? is_kho_boot() will always be false when CONFIG_KEXEC_HANDOVER is not enabled. So the compiler should optimize this out. Only using the ifdef is not enough. Just because the config is enabled doesn't mean every boot will be a KHO boot. You can do regular reboots or even regular kexec, without ever having KHO involved. We only want to call this for a KHO boot. So a runtime check is needed anyway. > >> >> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c >> index c3acbd26408ba..0a34dc011bf91 100644 >> --- a/arch/x86/kernel/e820.c >> +++ b/arch/x86/kernel/e820.c >> @@ -16,6 +16,7 @@ >> #include >> #include >> #include >> +#include >> >> #include >> #include >> @@ -1315,7 +1316,8 @@ void __init e820__memblock_setup(void) >> * After real mode trampoline is allocated, we clear that scratch >> * marking. >> */ >> - memblock_mark_kho_scratch(0, SZ_1M); >> + if (is_kho_boot()) >> + memblock_mark_kho_scratch(0, SZ_1M); >> >> /* >> * 32-bit systems are limited to 4BG of memory even with HIGHMEM and >> diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c >> index 88be32026768c..4e9b4dff17216 100644 >> --- a/arch/x86/realmode/init.c >> +++ b/arch/x86/realmode/init.c >> @@ -4,6 +4,7 @@ >> #include >> #include >> #include >> +#include >> >> #include >> #include >> @@ -67,7 +68,8 @@ void __init reserve_real_mode(void) >> */ >> memblock_reserve(0, SZ_1M); >> >> - memblock_clear_kho_scratch(0, SZ_1M); >> + if (is_kho_boot()) >> + memblock_clear_kho_scratch(0, SZ_1M); >> } >> >> static void __init sme_sev_setup_real_mode(struct trampoline_header *th) >> >> > -- Regards, Pratyush Yadav From akpm at linux-foundation.org Tue Nov 25 09:55:13 2025 From: akpm at linux-foundation.org (Andrew Morton) Date: Tue, 25 Nov 2025 09:55:13 -0800 Subject: [PATCHv2 1/2] kernel/kexec: Change the prototype of kimage_map_segment() In-Reply-To: References: <20251106065904.10772-1-piliu@redhat.com> <20251124141620.eaef984836fe2edc7acf9179@linux-foundation.org> Message-ID: <20251125095513.d71dcf5aca95db49008cbc25@linux-foundation.org> On Tue, 25 Nov 2025 12:54:39 +0800 Baoquan He wrote: > On 11/24/25 at 02:16pm, Andrew Morton wrote: > > On Thu, 6 Nov 2025 14:59:03 +0800 Pingfan Liu wrote: > > > > > The kexec segment index will be required to extract the corresponding > > > information for that segment in kimage_map_segment(). Additionally, > > > kexec_segment already holds the kexec relocation destination address and > > > size. Therefore, the prototype of kimage_map_segment() can be changed. > > > > Could we please have some reviewer input on thee two patches? > > I have some concerns about the one place of tiny code change, and the > root cause missing in log. And Mimi sent mail to me asking why this bug > can'e be seen on her laptop, I told her this bug can only be triggered > on system where CMA area exists. I think these need be addressed in v3. Great, thanks, I'll drop this version. From usamaarif642 at gmail.com Tue Nov 25 10:47:15 2025 From: usamaarif642 at gmail.com (Usama Arif) Date: Tue, 25 Nov 2025 18:47:15 +0000 Subject: [PATCH v8 12/17] x86/e820: temporarily enable KHO scratch for memory below 1M In-Reply-To: References: <20250509074635.3187114-1-changyuanl@google.com> <20250509074635.3187114-13-changyuanl@google.com> Message-ID: On 25/11/2025 13:50, Mike Rapoport wrote: > Hi, > > On Tue, Nov 25, 2025 at 02:15:34PM +0100, Pratyush Yadav wrote: >> On Mon, Nov 24 2025, Usama Arif wrote: > >>>> --- a/arch/x86/realmode/init.c >>>> +++ b/arch/x86/realmode/init.c >>>> @@ -65,6 +65,8 @@ void __init reserve_real_mode(void) >>>> * setup_arch(). >>>> */ >>>> memblock_reserve(0, SZ_1M); >>>> + >>>> + memblock_clear_kho_scratch(0, SZ_1M); >>>> } >>>> >>>> static void __init sme_sev_setup_real_mode(struct trampoline_header *th) >>> >>> Hello! >>> >>> I am working with Breno who reported that we are seeing the below warning at boot >>> when rolling out 6.16 in Meta fleet. It is difficult to reproduce on a single host >>> manually but we are seeing this several times a day inside the fleet. >>> >>> 20:16:33 ------------[ cut here ]------------ >>> 20:16:33 WARNING: CPU: 0 PID: 0 at mm/memblock.c:668 memblock_add_range+0x316/0x330 >>> 20:16:33 Modules linked in: >>> 20:16:33 CPU: 0 UID: 0 PID: 0 Comm: swapper Tainted: G S 6.16.1-0_fbk0_0_gc0739ee5037a #1 NONE >>> 20:16:33 Tainted: [S]=CPU_OUT_OF_SPEC >>> 20:16:33 RIP: 0010:memblock_add_range+0x316/0x330 >>> 20:16:33 Code: ff ff ff 89 5c 24 08 41 ff c5 44 89 6c 24 10 48 63 74 24 08 48 63 54 24 10 e8 26 0c 00 00 e9 41 ff ff ff 0f 0b e9 af fd ff ff <0f> 0b e9 b7 fd ff ff 0f 0b 0f 0b cc cc cc cc cc cc cc cc cc cc cc >>> 20:16:33 RSP: 0000:ffffffff83403dd8 EFLAGS: 00010083 ORIG_RAX: 0000000000000000 >>> 20:16:33 RAX: ffffffff8476ff90 RBX: 0000000000001c00 RCX: 0000000000000002 >>> 20:16:33 RDX: 00000000ffffffff RSI: 0000000000000000 RDI: ffffffff83bad4d8 >>> 20:16:33 RBP: 000000000009f000 R08: 0000000000000020 R09: 8000000000097101 >>> 20:16:33 R10: ffffffffff2004b0 R11: 203a6d6f646e6172 R12: 000000000009ec00 >>> 20:16:33 R13: 0000000000000002 R14: 0000000000100000 R15: 000000000009d000 >>> 20:16:33 FS: 0000000000000000(0000) GS:0000000000000000(0000) knlGS:0000000000000000 >>> 20:16:33 CR2: ffff888065413ff8 CR3: 00000000663b7000 CR4: 00000000000000b0 >>> 20:16:33 Call Trace: >>> 20:16:33 >>> 20:16:33 ? __memblock_reserve+0x75/0x80 > > Do you have faddr2line for this? > >>> 20:16:33 ? setup_arch+0x30f/0xb10 > > And this? > Thanks for this! I think it helped narrow down the problem. The stack is: 20:16:33 ? __memblock_reserve (mm/memblock.c:936) 20:16:33 ? setup_arch (arch/x86/kernel/setup.c:413 arch/x86/kernel/setup.c:499 arch/x86/kernel/setup.c:956) 20:16:33 ? start_kernel (init/main.c:922) 20:16:33 ? x86_64_start_reservations (arch/x86/kernel/ebda.c:57) 20:16:33 ? x86_64_start_kernel (arch/x86/kernel/head64.c:231) 20:16:33 ? common_startup_64 (arch/x86/kernel/head_64.S:419) This is 6.16 kernel. 20:16:33 ? __memblock_reserve (mm/memblock.c:936) Thats memblock_add_range call in memblock_reserve 20:16:33 ? setup_arch (arch/x86/kernel/setup.c:413 arch/x86/kernel/setup.c:499 arch/x86/kernel/setup.c:956) That is parse_setup_data -> add_early_ima_buffer -> add_early_ima_buffer -> memblock_reserve_kern I put a simple print like below: diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index 680d1b6dfea41..cc97ffc0083c7 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -409,6 +409,7 @@ static void __init add_early_ima_buffer(u64 phys_addr) } if (data->size) { + pr_err("PPP %s %s %d data->addr %llx, data->size %llx \n", __FILE__, __func__, __LINE__, data->addr, data->size); memblock_reserve_kern(data->addr, data->size); ima_kexec_buffer_phys = data->addr; ima_kexec_buffer_size = data->size; and I see (without replicating the warning): [ 0.000000] PPP arch/x86/kernel/setup.c add_early_ima_buffer 412 data->addr 9e000, data->size 1000 .... [ 0.000348] MEMBLOCK configuration: [ 0.000348] memory size = 0x0000003fea329ff0 reserved size = 0x00000000050c969b [ 0.000350] memory.cnt = 0x5 [ 0.000351] memory[0x0] [0x0000000000001000-0x000000000009ffff], 0x000000000009f000 bytes flags: 0x40 [ 0.000353] memory[0x1] [0x0000000000100000-0x0000000067c65fff], 0x0000000067b66000 bytes flags: 0x0 [ 0.000355] memory[0x2] [0x000000006d8db000-0x000000006fffffff], 0x0000000002725000 bytes flags: 0x0 [ 0.000356] memory[0x3] [0x0000000100000000-0x000000407fff8fff], 0x0000003f7fff9000 bytes flags: 0x0 [ 0.000358] memory[0x4] [0x000000407fffa000-0x000000407fffffff], 0x0000000000006000 bytes flags: 0x0 [ 0.000359] reserved.cnt = 0x7 So MEMBLOCK_RSRV_KERN and MEMBLOCK_KHO_SCRATCH seem to overlap.. >>> 20:16:33 ? start_kernel+0x58/0x960 >>> 20:16:33 ? x86_64_start_reservations+0x20/0x20 >>> 20:16:33 ? x86_64_start_kernel+0x13d/0x140 >>> 20:16:33 ? common_startup_64+0x13e/0x140 >>> 20:16:33 >>> 20:16:33 ---[ end trace 0000000000000000 ]--- >>> >>> >>> Rolling out with memblock=debug is not really an option in a large scale fleet due to the >>> time added to boot. But I did try on one of the hosts (without reproducing the issue) and I see: > > Is it a problem to roll out a kernel that has additional debug printouts as > Breno suggested earlier? I.e. > > if (flags != MEMBLOCK_NONE && flags != rgn->flags) { > pr_warn("memblock: Flag mismatch at region [%pa-%pa]\n", > &rgn->base, &rend); > pr_warn(" Existing region flags: %#x\n", rgn->flags); > pr_warn(" New range flags: %#x\n", flags); > pr_warn(" New range: [%pa-%pa]\n", &base, &end); > WARN_ON_ONCE(1); > } > I can add this, but the only thing is that it might be several weeks between me putting this in the kernel and that kernel being deployed to enough machines that it starts to show up. I think the IMA coinciding with memblock_mark_kho_scratch in e820__memblock_setup could be the reason for the warning. It might be better to fix that case and deploy it to see if the warnings still show up? I can add these prints as well incase it doesnt fix the problem. >>> [ 0.000616] memory.cnt = 0x6 >>> [ 0.000617] memory[0x0] [0x0000000000001000-0x000000000009bfff], 0x000000000009b000 bytes flags: 0x40 >>> [ 0.000620] memory[0x1] [0x000000000009f000-0x000000000009ffff], 0x0000000000001000 bytes flags: 0x40 >>> [ 0.000621] memory[0x2] [0x0000000000100000-0x000000005ed09fff], 0x000000005ec0a000 bytes flags: 0x0 >>> ... >>> >>> The 0x40 (MEMBLOCK_KHO_SCRATCH) is coming from memblock_mark_kho_scratch in e820__memblock_setup. I believe this >>> should be under ifdef like the diff at the end? (Happy to send this as a patch for review if it makes sense). >>> We have KEXEC_HANDOVER disabled in our defconfig, therefore MEMBLOCK_KHO_SCRATCH shouldnt be selected and >>> we shouldnt have any MEMBLOCK_KHO_SCRATCH type regions in our memblock reservations. >>> >>> The other thing I did was insert a while(1) just before the warning and inspected the registers in qemu. >>> R14 held the base register, and R15 held the size at that point. >>> In the warning R14 is 0x100000 meaning that someone is reserving a region with a different flag to MEMBLOCK_NONE >>> at the boundary of MEMBLOCK_KHO_SCRATCH. > > Judging by the register values, flags could be in %rcx or %r13 (0x2 - MEMBLOCK_MIRROR) or in > %r8 (0x20 - MEMBLOCK_RSRV_KERN) I feel like it might be r8 (MEMBLOCK_RSRV_KERN) from IMA. > > Since WARN_ON() is triggered in __memblock_reserve() I'd bet on > MEMBLOCK_RSRV_KERN. > > And apparently the warning triggers for some memory that was initially > reserved with memblock_reserve() and than some of it was reserved with > memblock_reserve_kern(). > > If you have the logs from failing boots up to the point where SLUB reports > about it's initialization, e.g. > > [ 0.134377] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=8, Nodes=1 > > something there may hint about what's the issue. So the boot doesnt fail, its just giving warnings in the fleet. I have added the dmesg to the end of the mail. > >> I don't get this... The WARN_ON() is only triggered when the regions >> overlap. Here, there should be no overlap, since the scratch region >> should end at 0x100000 (SZ_1M) and the new region starts at 0x100000 >> (SZ_1M). > > Not only that, the warning is from __memblock_reserve() that works with > memblock.reserved and the dump is for memblock.memory. > >> Anyway, you do indeed point at a bug. memblock_mark_kho_scratch() should >> only be called on a KHO boot, not unconditionally. So even with >> CONFIG_MEMBLOCK_KHO_SCRATCH enabled, this should only be called on a KHO >> boot, not every time. >> >> I think the below diff should fix the warning for you by making sure the >> scratch areas are not present on non-KHO boot. I still don't know why >> you hit the warning in the first place though. If you'd be willing to >> dig deeper into that, it would be great. >> >> Can you give the below a try and if it fixes the problem for you I can >> send it on the list. >> >> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c >> index c3acbd26408ba..0a34dc011bf91 100644 >> --- a/arch/x86/kernel/e820.c >> +++ b/arch/x86/kernel/e820.c >> @@ -16,6 +16,7 @@ >> #include >> #include >> #include >> +#include >> >> #include >> #include >> @@ -1315,7 +1316,8 @@ void __init e820__memblock_setup(void) >> * After real mode trampoline is allocated, we clear that scratch >> * marking. >> */ >> - memblock_mark_kho_scratch(0, SZ_1M); >> + if (is_kho_boot()) >> + memblock_mark_kho_scratch(0, SZ_1M); > > We'd better add an inline stub to memblock.h for > !CONFIG_MEMBLOCK_KHO_SCRATCH > > and move is_kho_boot() inside memblock_{mark,clear}_kho_scratch. This might > require moving them out of line, but it's not that they are on the hot > paths. > > BTW, this makes sense even if it does not help with the issue Breno and > Usama are working on. > Does something like this look good? I can try deploying this (although it will take sometime to find out). We can get it upstream as well as that makes backports easier. diff --git a/mm/memblock.c b/mm/memblock.c index 154f1d73b61f2..257c6f0eee03d 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -1119,8 +1119,13 @@ int __init_memblock memblock_reserved_mark_noinit(phys_addr_t base, phys_addr_t */ __init int memblock_mark_kho_scratch(phys_addr_t base, phys_addr_t size) { - return memblock_setclr_flag(&memblock.memory, base, size, 1, - MEMBLOCK_KHO_SCRATCH); +#ifdef CONFIG_MEMBLOCK_KHO_SCRATCH + if (is_kho_boot()) + return memblock_setclr_flag(&memblock.memory, base, size, 1, + MEMBLOCK_KHO_SCRATCH); +#else + return 0; +#endif } /** @@ -1133,8 +1138,13 @@ __init int memblock_mark_kho_scratch(phys_addr_t base, phys_addr_t size) */ __init int memblock_clear_kho_scratch(phys_addr_t base, phys_addr_t size) { - return memblock_setclr_flag(&memblock.memory, base, size, 0, - MEMBLOCK_KHO_SCRATCH); +#ifdef CONFIG_MEMBLOCK_KHO_SCRATCH + if (is_kho_boot()) + return memblock_setclr_flag(&memblock.memory, base, size, 0, + MEMBLOCK_KHO_SCRATCH); +#else + return 0; +#endif } static bool should_skip_region(struct memblock_type *type, >> >> /* >> * 32-bit systems are limited to 4BG of memory even with HIGHMEM and >> diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c >> index 88be32026768c..4e9b4dff17216 100644 >> --- a/arch/x86/realmode/init.c >> +++ b/arch/x86/realmode/init.c >> @@ -4,6 +4,7 @@ >> #include >> #include >> #include >> +#include >> >> #include >> #include >> @@ -67,7 +68,8 @@ void __init reserve_real_mode(void) >> */ >> memblock_reserve(0, SZ_1M); >> >> - memblock_clear_kho_scratch(0, SZ_1M); >> + if (is_kho_boot()) >> + memblock_clear_kho_scratch(0, SZ_1M); >> } >> >> static void __init sme_sev_setup_real_mode(struct trampoline_header *th) The dmesg from one of the hosts with the warning is: 20:16:33 BIOS-provided physical RAM map: 20:16:33 BIOS-e820: [mem 0x0000000000000000-0x000000000009ffff] usable 20:16:33 BIOS-e820: [mem 0x00000000000a0000-0x00000000000fffff] reserved 20:16:33 BIOS-e820: [mem 0x0000000000100000-0x0000000069ca3fff] usable 20:16:33 BIOS-e820: [mem 0x0000000069ca4000-0x000000006bda3fff] reserved 20:16:33 BIOS-e820: [mem 0x000000006bda4000-0x000000006be5efff] ACPI data 20:16:33 BIOS-e820: [mem 0x000000006be5f000-0x000000006c9b8fff] ACPI NVS 20:16:33 BIOS-e820: [mem 0x000000006c9b9000-0x000000006ebedfff] reserved 20:16:33 BIOS-e820: [mem 0x000000006ebee000-0x000000006fffffff] usable 20:16:33 BIOS-e820: [mem 0x0000000070000000-0x000000008fffffff] reserved 20:16:33 BIOS-e820: [mem 0x00000000fd000000-0x00000000fe7fffff] reserved 20:16:33 BIOS-e820: [mem 0x00000000fed20000-0x00000000fed44fff] reserved 20:16:33 BIOS-e820: [mem 0x00000000ff000000-0x00000000ffffffff] reserved 20:16:33 BIOS-e820: [mem 0x0000000100000000-0x000000107fff847f] usable 20:16:33 BIOS-e820: [mem 0x000000107fff8480-0x000000107fff848f] type 128 20:16:33 BIOS-e820: [mem 0x000000107fff8490-0x000000107fffffff] usable 20:16:33 random: crng init done 20:16:33 ------------[ cut here ]------------ 20:16:33 WARNING: CPU: 0 PID: 0 at mm/memblock.c:668 memblock_add_range+0x316/0x330 20:16:33 Modules linked in: 20:16:33 CPU: 0 UID: 0 PID: 0 Comm: swapper Tainted: G S 6.16.1-0_fbk0_0_gc0739ee5037a #1 NONE 20:16:33 Tainted: [S]=CPU_OUT_OF_SPEC 20:16:33 RIP: 0010:memblock_add_range+0x316/0x330 20:16:33 Code: ff ff ff 89 5c 24 08 41 ff c5 44 89 6c 24 10 48 63 74 24 08 48 63 54 24 10 e8 26 0c 00 00 e9 41 ff ff ff 0f 0b e9 af fd ff ff <0f> 0b e9 b7 fd ff ff 0f 0b 0f 0b cc cc cc cc cc cc cc cc cc cc cc 20:16:33 RSP: 0000:ffffffff83403dd8 EFLAGS: 00010083 ORIG_RAX: 0000000000000000 20:16:33 RAX: ffffffff8476ff90 RBX: 0000000000001c00 RCX: 0000000000000002 20:16:33 RDX: 00000000ffffffff RSI: 0000000000000000 RDI: ffffffff83bad4d8 20:16:33 RBP: 000000000009f000 R08: 0000000000000020 R09: 8000000000097101 20:16:33 R10: ffffffffff2004b0 R11: 203a6d6f646e6172 R12: 000000000009ec00 20:16:33 R13: 0000000000000002 R14: 0000000000100000 R15: 000000000009d000 20:16:33 FS: 0000000000000000(0000) GS:0000000000000000(0000) knlGS:0000000000000000 20:16:33 CR2: ffff888065413ff8 CR3: 00000000663b7000 CR4: 00000000000000b0 20:16:33 Call Trace: 20:16:33 20:16:33 ? __memblock_reserve+0x75/0x80 20:16:33 ? setup_arch+0x30f/0xb10 20:16:33 ? start_kernel+0x58/0x960 20:16:33 ? x86_64_start_reservations+0x20/0x20 20:16:33 ? x86_64_start_kernel+0x13d/0x140 20:16:33 ? common_startup_64+0x13e/0x140 20:16:33 20:16:33 ---[ end trace 0000000000000000 ]--- 20:16:33 Memory allocation profiling is enabled with compression and is turned on! 20:16:33 NX (Execute Disable) protection: active 20:16:33 APIC: Static calls initialized 20:16:33 efi: EFI v2.6 by American Megatrends 20:16:33 efi: ACPI 2.0=0x6c5ec000 ACPI=0x6c5ec000 TPMFinalLog=0x6c987000 SMBIOS=0x6e69d000 SMBIOS 3.0=0x6e69c000 MEMATTR=0xffffffffffffffff ESRT=0x67d97918 INITRD=0x5f275d18 TPMEventLog=0x6be5d018 20:16:33 efi: Remove mem00: MMIO range=[0xff000000-0xffffffff] (16MB) from e820 map 20:16:33 efi: Not removing mem01: MMIO range=[0xfed20000-0xfed44fff] (148KB) from e820 map 20:16:33 efi: Remove mem02: MMIO range=[0xfd000000-0xfe7fffff] (24MB) from e820 map 20:16:33 efi: Remove mem03: MMIO range=[0x80000000-0x8fffffff] (256MB) from e820 map 20:16:33 SMBIOS 3.1.1 present. 20:16:33 DMI: Quanta Twin Lakes MP/Twin Lakes Passive MP, BIOS F09_3A23 12/08/2020 20:16:33 DMI: Memory slots populated: 4/8 20:16:33 tsc: Detected 1600.000 MHz processor 20:16:33 last_pfn = 0x1080000 max_arch_pfn = 0x400000000 20:16:33 MTRR map: 8 entries (3 fixed + 5 variable; max 23), built from 10 variable MTRRs 20:16:33 x86/PAT: Configuration [0-7]: WB WC UC- UC WB WP UC- WT 20:16:33 last_pfn = 0x70000 max_arch_pfn = 0x400000000 20:16:33 esrt: Reserving ESRT space from 0x0000000067d97918 to 0x0000000067d97978. 20:16:33 Using GB pages for direct mapping 20:16:33 RAMDISK: [mem 0x5ed81000-0x62ffffff] 20:16:33 ACPI: Early table checksum verification disabled 20:16:33 ACPI: RSDP 0x000000006C5EC000 000024 (v02 ALASKA) 20:16:33 ACPI: XSDT 0x000000006C5EC0C0 000104 (v01 ALASKA A M I 01072009 AMI 00010013) 20:16:33 ACPI: FACP 0x000000006C62E590 000114 (v06 ALASKA A M I 01072009 INTL 20091013) 20:16:33 ACPI: DSDT 0x000000006C5EC260 04232D (v02 ALASKA A M I 01072009 INTL 20091013) 20:16:33 ACPI: FACS 0x000000006C9B7080 000040 20:16:33 ACPI: FPDT 0x000000006C62E6A8 000044 (v01 ALASKA A M I 01072009 AMI 00010013) 20:16:33 ACPI: FIDT 0x000000006C62E6F0 00009C (v01 ALASKA A M I 01072009 AMI 00010013) 20:16:33 ACPI: SPMI 0x000000006C62E790 000041 (v05 ALASKA A M I 00000000 AMI. 00000000) 20:16:33 ACPI: UEFI 0x000000006C62E7D8 000048 (v01 ALASKA A M I 01072009 01000013) 20:16:33 ACPI: MCFG 0x000000006C62E820 00003C (v01 ALASKA A M I 01072009 MSFT 00000097) 20:16:33 ACPI: HPET 0x000000006C62E860 000038 (v01 ALASKA A M I 00000001 INTL 20091013) 20:16:33 ACPI: APIC 0x000000006C62E898 00071E (v03 ALASKA A M I 00000000 INTL 20091013) 20:16:33 ACPI: MIGT 0x000000006C62EFB8 000040 (v01 ALASKA A M I 00000000 INTL 20091013) 20:16:33 ACPI: PCAT 0x000000006C62EFF8 000068 (v02 ALASKA A M I 00000002 INTL 20091013) 20:16:33 ACPI: PCCT 0x000000006C62F060 00006E (v01 ALASKA A M I 00000002 INTL 20091013) 20:16:33 ACPI: RASF 0x000000006C62F0D0 000030 (v01 ALASKA A M I 00000001 INTL 20091013) 20:16:33 ACPI: SVOS 0x000000006C62F100 000032 (v01 ALASKA A M I 00000000 INTL 20091013) 20:16:33 ACPI: WDDT 0x000000006C62F138 000040 (v01 ALASKA A M I 00000000 INTL 20091013) 20:16:33 ACPI: OEM4 0x000000006C62F178 028A0C (v02 INTEL CPU CST 00003000 INTL 20140828) 20:16:33 ACPI: OEM1 0x000000006C657B88 00A8CC (v02 INTEL CPU EIST 00003000 INTL 20140828) 20:16:33 ACPI: OEM2 0x000000006C662458 006534 (v02 INTEL CPU HWP 00003000 INTL 20140828) 20:16:33 ACPI: SSDT 0x000000006C668990 00CEB8 (v02 INTEL SSDT PM 00004000 INTL 20140828) 20:16:33 ACPI: SSDT 0x000000006C675848 00065B (v02 ALASKA A M I 00000000 INTL 20091013) 20:16:33 ACPI: SPCR 0x000000006C675EA8 000050 (v02 A M I APTIO V 01072009 AMI. 0005000E) 20:16:33 ACPI: TPM2 0x000000006C675EF8 000034 (v04 ALASKA A M I 00000001 AMI 00000000) 20:16:33 ACPI: SSDT 0x000000006C675F30 001368 (v02 INTEL SpsNm 00000002 INTL 20140828) 20:16:33 ACPI: DMAR 0x000000006C677298 0000E8 (v01 ALASKA A M I 00000001 INTL 20091013) 20:16:33 ACPI: HEST 0x000000006C677380 0000A8 (v01 ALASKA A M I 00000001 INTL 00000001) 20:16:33 ACPI: BERT 0x000000006C677428 000030 (v01 ALASKA A M I 00000001 INTL 00000001) 20:16:33 ACPI: ERST 0x000000006C677458 000230 (v01 ALASKA A M I 00000001 INTL 00000001) 20:16:33 ACPI: EINJ 0x000000006C677688 000150 (v01 ALASKA A M I 00000001 INTL 00000001) 20:16:33 ACPI: WSMT 0x000000006C6777D8 000028 (v01 ALASKA A M I 01072009 AMI 00010013) 20:16:33 ACPI: Reserving FACP table memory at [mem 0x6c62e590-0x6c62e6a3] 20:16:33 ACPI: Reserving DSDT table memory at [mem 0x6c5ec260-0x6c62e58c] 20:16:33 ACPI: Reserving FACS table memory at [mem 0x6c9b7080-0x6c9b70bf] 20:16:33 ACPI: Reserving FPDT table memory at [mem 0x6c62e6a8-0x6c62e6eb] 20:16:33 ACPI: Reserving FIDT table memory at [mem 0x6c62e6f0-0x6c62e78b] 20:16:33 ACPI: Reserving SPMI table memory at [mem 0x6c62e790-0x6c62e7d0] 20:16:33 ACPI: Reserving UEFI table memory at [mem 0x6c62e7d8-0x6c62e81f] 20:16:33 ACPI: Reserving MCFG table memory at [mem 0x6c62e820-0x6c62e85b] 20:16:33 ACPI: Reserving HPET table memory at [mem 0x6c62e860-0x6c62e897] 20:16:33 ACPI: Reserving APIC table memory at [mem 0x6c62e898-0x6c62efb5] 20:16:33 ACPI: Reserving MIGT table memory at [mem 0x6c62efb8-0x6c62eff7] 20:16:33 ACPI: Reserving PCAT table memory at [mem 0x6c62eff8-0x6c62f05f] 20:16:33 ACPI: Reserving PCCT table memory at [mem 0x6c62f060-0x6c62f0cd] 20:16:33 ACPI: Reserving RASF table memory at [mem 0x6c62f0d0-0x6c62f0ff] 20:16:33 ACPI: Reserving SVOS table memory at [mem 0x6c62f100-0x6c62f131] 20:16:33 ACPI: Reserving WDDT table memory at [mem 0x6c62f138-0x6c62f177] 20:16:33 ACPI: Reserving OEM4 table memory at [mem 0x6c62f178-0x6c657b83] 20:16:33 ACPI: Reserving OEM1 table memory at [mem 0x6c657b88-0x6c662453] 20:16:33 ACPI: Reserving OEM2 table memory at [mem 0x6c662458-0x6c66898b] 20:16:33 ACPI: Reserving SSDT table memory at [mem 0x6c668990-0x6c675847] 20:16:33 ACPI: Reserving SSDT table memory at [mem 0x6c675848-0x6c675ea2] 20:16:33 ACPI: Reserving SPCR table memory at [mem 0x6c675ea8-0x6c675ef7] 20:16:33 ACPI: Reserving TPM2 table memory at [mem 0x6c675ef8-0x6c675f2b] 20:16:33 ACPI: Reserving SSDT table memory at [mem 0x6c675f30-0x6c677297] 20:16:33 ACPI: Reserving DMAR table memory at [mem 0x6c677298-0x6c67737f] 20:16:33 ACPI: Reserving HEST table memory at [mem 0x6c677380-0x6c677427] 20:16:33 ACPI: Reserving BERT table memory at [mem 0x6c677428-0x6c677457] 20:16:33 ACPI: Reserving ERST table memory at [mem 0x6c677458-0x6c677687] 20:16:33 ACPI: Reserving EINJ table memory at [mem 0x6c677688-0x6c6777d7] 20:16:33 ACPI: Reserving WSMT table memory at [mem 0x6c6777d8-0x6c6777ff] 20:16:33 No NUMA configuration found 20:16:33 Faking a node at [mem 0x0000000000000000-0x000000107fffffff] 20:16:33 NODE_DATA(0) allocated [mem 0x107fffcfc0-0x107fffffff] 20:16:33 hugetlb_cma: reserve 6144 MiB, up to 6144 MiB per node 20:16:33 cma: Reserved 6144 MiB in 1 range 20:16:33 hugetlb_cma: reserved 6144 MiB on node 0 20:16:33 crashkernel reserved: 0x0000000052000000 - 0x000000005e000000 (192 MB) 20:16:33 Zone ranges: 20:16:33 DMA [mem 0x0000000000001000-0x0000000000ffffff] 20:16:33 DMA32 [mem 0x0000000001000000-0x00000000ffffffff] 20:16:33 Normal [mem 0x0000000100000000-0x000000107fffffff] 20:16:33 Device empty 20:16:33 Movable zone start for each node 20:16:33 Early memory node ranges 20:16:33 node 0: [mem 0x0000000000001000-0x000000000009ffff] 20:16:33 node 0: [mem 0x0000000000100000-0x0000000069ca3fff] 20:16:33 node 0: [mem 0x000000006ebee000-0x000000006fffffff] 20:16:33 node 0: [mem 0x0000000100000000-0x000000107fff7fff] 20:16:33 node 0: [mem 0x000000107fff9000-0x000000107fffffff] 20:16:33 Initmem setup node 0 [mem 0x0000000000001000-0x000000107fffffff] 20:16:33 On node 0, zone DMA: 1 pages in unavailable ranges 20:16:33 On node 0, zone DMA: 96 pages in unavailable ranges 20:16:33 On node 0, zone DMA32: 20298 pages in unavailable ranges 20:16:33 On node 0, zone Normal: 1 pages in unavailable ranges 20:16:33 ACPI: PM-Timer IO Port: 0x508 20:16:33 ACPI: X2APIC_NMI (uid[0xffffffff] high level lint[0x1]) 20:16:33 ACPI: LAPIC_NMI (acpi_id[0xff] high level lint[0x1]) 20:16:33 IOAPIC[0]: apic_id 8, version 32, address 0xfec00000, GSI 0-23 20:16:33 IOAPIC[1]: apic_id 9, version 32, address 0xfec01000, GSI 24-31 20:16:33 IOAPIC[2]: apic_id 10, version 32, address 0xfec08000, GSI 32-39 20:16:33 IOAPIC[3]: apic_id 11, version 32, address 0xfec10000, GSI 40-47 20:16:33 IOAPIC[4]: apic_id 12, version 32, address 0xfec18000, GSI 48-55 20:16:33 ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl) 20:16:33 ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level) 20:16:33 ACPI: Using ACPI (MADT) for SMP configuration information 20:16:33 ACPI: HPET id: 0x8086a701 base: 0xfed00000 20:16:33 TSC deadline timer available 20:16:33 CPU topo: Max. logical packages: 1 20:16:33 CPU topo: Max. logical dies: 1 20:16:33 CPU topo: Max. dies per package: 1 20:16:33 CPU topo: Max. threads per core: 2 20:16:33 CPU topo: Num. cores per package: 18 20:16:33 CPU topo: Num. threads per package: 36 20:16:33 CPU topo: Allowing 36 present CPUs plus 0 hotplug CPUs 20:16:33 [mem 0x80000000-0xfed1ffff] available for PCI devices 20:16:33 Booting paravirtualized kernel on bare hardware 20:16:33 clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 1910969940391419 ns 20:16:33 Load bootconfig: 46 bytes 5 nodes 20:16:33 setup_percpu: NR_CPUS:512 nr_cpumask_bits:36 nr_cpu_ids:36 nr_node_ids:1 20:16:33 percpu: Embedded 78 pages/cpu s282624 r8192 d28672 u524288 20:16:33 Unknown kernel command line parameters "biosdevname=0", will be passed to user space. 20:16:33 printk: log buffer data + meta data: 2097152 + 7340032 = 9437184 bytes 20:16:33 Dentry cache hash table entries: 8388608 (order: 14, 67108864 bytes, linear) 20:16:33 Inode-cache hash table entries: 4194304 (order: 13, 33554432 bytes, linear) 20:16:33 software IO TLB: area num 64. 20:16:33 software IO TLB: SWIOTLB bounce buffer size roundup to 16MB 20:16:33 Fallback order for Node 0: 0 20:16:33 Built 1 zonelists, mobility grouping on. Total pages: 16691284 20:16:33 Policy zone: Normal 20:16:33 mem auto-init: stack:off, heap alloc:off, heap free:off 20:16:33 SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=36, Nodes=1 From bhe at redhat.com Tue Nov 25 17:10:07 2025 From: bhe at redhat.com (Baoquan He) Date: Wed, 26 Nov 2025 09:10:07 +0800 Subject: [PATCHv2 1/2] kernel/kexec: Change the prototype of kimage_map_segment() In-Reply-To: <20251106065904.10772-1-piliu@redhat.com> References: <20251106065904.10772-1-piliu@redhat.com> Message-ID: Hi Pingfan, On 11/06/25 at 02:59pm, Pingfan Liu wrote: > The kexec segment index will be required to extract the corresponding > information for that segment in kimage_map_segment(). Additionally, > kexec_segment already holds the kexec relocation destination address and > size. Therefore, the prototype of kimage_map_segment() can be changed. Because no cover letter, I just reply here. I am testing code of (tag: next-20251125, next/master) on arm64 system. I saw your two patches are already in there. When I used kexec reboot as below, I still got the warning message during ima_kexec_post_load() invocation. ==================== kexec -d -l /boot/vmlinuz-6.18.0-rc7-next-20251125 --initrd /boot/initramfs-6.18.0-rc7-next-20251125.img --reuse-cmdline ==================== ==================== [34283.657670] kexec_file: kernel: 000000006cf71829 kernel_size: 0x48b0000 [34283.657700] PEFILE: Unsigned PE binary [34283.676597] ima: kexec measurement buffer for the loaded kernel at 0xff206000. [34283.676621] kexec_file: Loaded initrd at 0x84cb0000 bufsz=0x25ec426 memsz=0x25ed000 [34283.684646] kexec_file: Loaded dtb at 0xff400000 bufsz=0x39e memsz=0x1000 [34283.684653] kexec_file(Image): Loaded kernel at 0x80400000 bufsz=0x48b0000 memsz=0x48b0000 [34283.684663] kexec_file: nr_segments = 4 [34283.684666] kexec_file: segment[0]: buf=0x0000000000000000 bufsz=0x0 mem=0xff206000 memsz=0x1000 [34283.684674] kexec_file: segment[1]: buf=0x000000006cf71829 bufsz=0x48b0000 mem=0x80400000 memsz=0x48b0000 [34283.725987] kexec_file: segment[2]: buf=0x00000000c7369de6 bufsz=0x25ec426 mem=0x84cb0000 memsz=0x25ed000 [34283.747670] kexec_file: segmen ** replaying previous printk message ** [34283.747670] kexec_file: segment[3]: buf=0x00000000d83b530b bufsz=0x39e mem=0xff400000 memsz=0x1000 [34283.747973] ------------[ cut here ]------------ [34283.747976] WARNING: CPU: 33 PID: 16112 at kernel/kexec_core.c:1002 kimage_map_segment+0x138/0x190 [34283.778574] Modules linked in: rfkill vfat fat ipmi_ssif igb acpi_ipmi ipmi_si ipmi_devintf mlx5_fwctl i2c_algo_bit ipmi_msghandler fwctl fuse loop nfnetlink zram lz4hc_compress lz4_compress xfs mlx5_ib macsec mlx5_core nvme nvme_core mlxfw psample tls nvme_keyring nvme_auth pci_hyperv_intf sbsa_gwdt rpcrdma sunrpc rdma_ucm ib_uverbs ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser i2c_dev ib_umad rdma_cm ib_ipoib iw_cm ib_cm libiscsi ib_core scsi_transport_iscsi aes_neon_bs [34283.824233] CPU: 33 UID: 0 PID: 16112 Comm: kexec Tainted: G W 6.17.8-200.fc42.aarch64 #1 PREEMPT(voluntary) [34283.836355] Tainted: [W]=WARN [34283.839684] Hardware name: CRAY CS500/CMUD , BIOS 1.4.0 Jun 17 2020 [34283.846903] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [34283.854243] pc : kimage_map_segment+0x138/0x190 [34283.859120] lr : kimage_map_segment+0x4c/0x190 [34283.863920] sp : ffff8000a0643a90 [34283.867394] x29: ffff8000a0643a90 x28: ffff800083d0a000 x27: 0000000000000000 [34283.874901] x26: 0000aaaad722d4b0 x25: 000000000000008f x24: ffff800083d0a000 [34283.882608] x23: 0000000000000001 x22: 00000000ff206000 x21: 00000000ff207000 [34283.890305] x20: ffff008fbd306980 x19: ffff008f895d6400 x18: 00000000fffffff9 [34283.897815] x17: 303d6d656d206539 x16: 3378303d7a736675 x15: 646565732d676e72 [34283.905516] x14: 00646565732d726c x13: 616d692c78756e69 x12: 6c00636578656b2d [34283.912999] x11: 007265666675622d x10: 636578656b2d616d x9 : ffff80008050b73c [34283.920691] x8 : 0001000000000000 x7 : 0000000000000000 x6 : 0000000080000000 [34283.928197] x5 : 0000000084cb0000 x4 : ffff008fbd2306b0 x3 : ffff008fbd305000 [34283.935898] x2 : fffffff7ff000000 x1 : 0000000000000004 x0 : ffff800082046000 [34283.943603] Call trace: [34283.946039] kimage_map_segment+0x138/0x190 (P) [34283.950935] ima_kexec_post_load+0x58/0xc0 [34283.955225] __do_sys_kexec_file_load+0x2b8/0x398 [34283.960279] __arm64_sys_kexec_file_load+0x28/0x40 [34283.965965] invoke_syscall.constprop.0+0x64/0xe8 [34283.971025] el0_svc_common.constprop.0+0x40/0xe8 [34283.975883] do_el0_svc+0x24/0x38 [34283.979361] el0_svc+0x3c/0x168 [34283.982833] el0t_64_sync_handler+0xa0/0xf0 [34283.987176] el0t_64_sync+0x1b0/0x1b8 [34283.991000] ---[ end trace 0000000000000000 ]--- [34283.996060] ------------[ cut here ]------------ [34283.996064] WARNING: CPU: 33 PID: 16112 at mm/vmalloc.c:538 vmap_pages_pte_range+0x2bc/0x3c0 [34284.010006] Modules linked in: rfkill vfat fat ipmi_ssif igb acpi_ipmi ipmi_si ipmi_devintf mlx5_fwctl i2c_algo_bit ipmi_msghandler fwctl fuse loop nfnetlink zram lz4hc_compress lz4_compress xfs mlx5_ib macsec mlx5_core nvme nvme_core mlxfw psample tls nvme_keyring nvme_auth pci_hyperv_intf sbsa_gwdt rpcrdma sunrpc rdma_ucm ib_uverbs ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser i2c_dev ib_umad rdma_cm ib_ipoib iw_cm ib_cm libiscsi ib_core scsi_transport_iscsi aes_neon_bs [34284.055630] CPU: 33 UID: 0 PID: 16112 Comm: kexec Tainted: G W 6.17.8-200.fc42.aarch64 #1 PREEMPT(voluntary) [34284.067701] Tainted: [W]=WARN [34284.070833] Hardware name: CRAY CS500/CMUD , BIOS 1.4.0 Jun 17 2020 [34284.078238] pstate: 40400009 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [34284.085546] pc : vmap_pages_pte_range+0x2bc/0x3c0 [34284.090607] lr : vmap_small_pages_range_noflush+0x16c/0x298 [34284.096528] sp : ffff8000a0643940 [34284.100001] x29: ffff8000a0643940 x28: 0000000000000000 x27: ffff800084f76000 [34284.107699] x26: fffffdffc0000000 x25: ffff8000a06439d0 x24: ffff800082046000 [34284.115174] x23: ffff800084f75000 x22: ffff007f80337ba8 x21: 03ffffffffffffc0 [34284.122821] x20: ffff008fbd306980 x19: ffff8000a06439d4 x18: 00000000fffffff9 [34284.130331] x17: 303d6d656d206539 x16: 3378303d7a736675 x15: 646565732d676e72 [34284.138032] x14: 0000000000004000 x13: ffff009781307130 x12: 0000000000002000 [34284.145733] x11: 0000000000000000 x10: 0000000000000001 x9 : ffff8000804e197c [34284.153248] x8 : 0000000000000027 x7 : ffff800085175000 x6 : ffff8000a06439d4 [34284.160944] x5 : ffff8000a06439d0 x4 : ffff008fbd306980 x3 : 0068000000000f03 [34284.168449] x2 : ffff007f80337ba8 x1 : 0000000000000000 x0 : 0000000000000000 [34284.176150] Call trace: [34284.178768] vmap_pages_pte_range+0x2bc/0x3c0 (P) [34284.183665] vmap_small_pages_range_noflush+0x16c/0x298 [34284.189264] vmap+0xb4/0x138 [34284.192312] kimage_map_segment+0xdc/0x190 [34284.196794] ima_kexec_post_load+0x58/0xc0 [34284.201044] __do_sys_kexec_file_load+0x2b8/0x398 [34284.206107] __arm64_sys_kexec_file_load+0x28/0x40 [34284.211254] invoke_syscall.constprop.0+0x64/0xe8 [34284.216139] el0_svc_common.constprop.0+0x40/0xe8 [34284.221196] do_el0_svc+0x24/0x38 [34284.224678] el0_svc+0x3c/0x168 [34284.227983] el0t_64_sync_handler+0xa0/0xf0 [34284.232526] el0t_64_sync+0x1b0/0x1b8 [34284.236376] ---[ end trace 0000000000000000 ]--- [34284.241412] kexec_core: Could not map ima buffer. [34284.241421] ima: Could not map measurements buffer. [34284.551336] machine_kexec_post_load:155: [34284.551354] kexec kimage info: [34284.551366] type: 0 [34284.551373] head: 90363f9002 [34284.551377] kern_reloc: 0x00000090363f7000 [34284.551381] el2_vectors: 0x0000000000000000 [34284.551384] kexec_file: kexec_file_load: type:0, start:0x80400000 head:0x90363f9002 flags:0x8 ==================== > > Fixes: 07d24902977e ("kexec: enable CMA based contiguous allocation") > Signed-off-by: Pingfan Liu > Cc: Andrew Morton > Cc: Baoquan He > Cc: Mimi Zohar > Cc: Roberto Sassu > Cc: Alexander Graf > Cc: Steven Chen > Cc: > To: kexec at lists.infradead.org > To: linux-integrity at vger.kernel.org > --- > include/linux/kexec.h | 4 ++-- > kernel/kexec_core.c | 9 ++++++--- > security/integrity/ima/ima_kexec.c | 4 +--- > 3 files changed, 9 insertions(+), 8 deletions(-) > > diff --git a/include/linux/kexec.h b/include/linux/kexec.h > index ff7e231b0485..8a22bc9b8c6c 100644 > --- a/include/linux/kexec.h > +++ b/include/linux/kexec.h > @@ -530,7 +530,7 @@ extern bool kexec_file_dbg_print; > #define kexec_dprintk(fmt, arg...) \ > do { if (kexec_file_dbg_print) pr_info(fmt, ##arg); } while (0) > > -extern void *kimage_map_segment(struct kimage *image, unsigned long addr, unsigned long size); > +extern void *kimage_map_segment(struct kimage *image, int idx); > extern void kimage_unmap_segment(void *buffer); > #else /* !CONFIG_KEXEC_CORE */ > struct pt_regs; > @@ -540,7 +540,7 @@ static inline void __crash_kexec(struct pt_regs *regs) { } > static inline void crash_kexec(struct pt_regs *regs) { } > static inline int kexec_should_crash(struct task_struct *p) { return 0; } > static inline int kexec_crash_loaded(void) { return 0; } > -static inline void *kimage_map_segment(struct kimage *image, unsigned long addr, unsigned long size) > +static inline void *kimage_map_segment(struct kimage *image, int idx) > { return NULL; } > static inline void kimage_unmap_segment(void *buffer) { } > #define kexec_in_progress false > diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c > index fa00b239c5d9..9a1966207041 100644 > --- a/kernel/kexec_core.c > +++ b/kernel/kexec_core.c > @@ -960,17 +960,20 @@ int kimage_load_segment(struct kimage *image, int idx) > return result; > } > > -void *kimage_map_segment(struct kimage *image, > - unsigned long addr, unsigned long size) > +void *kimage_map_segment(struct kimage *image, int idx) > { > + unsigned long addr, size, eaddr; > unsigned long src_page_addr, dest_page_addr = 0; > - unsigned long eaddr = addr + size; > kimage_entry_t *ptr, entry; > struct page **src_pages; > unsigned int npages; > void *vaddr = NULL; > int i; > > + addr = image->segment[idx].mem; > + size = image->segment[idx].memsz; > + eaddr = addr + size; > + > /* > * Collect the source pages and map them in a contiguous VA range. > */ > diff --git a/security/integrity/ima/ima_kexec.c b/security/integrity/ima/ima_kexec.c > index 7362f68f2d8b..5beb69edd12f 100644 > --- a/security/integrity/ima/ima_kexec.c > +++ b/security/integrity/ima/ima_kexec.c > @@ -250,9 +250,7 @@ void ima_kexec_post_load(struct kimage *image) > if (!image->ima_buffer_addr) > return; > > - ima_kexec_buffer = kimage_map_segment(image, > - image->ima_buffer_addr, > - image->ima_buffer_size); > + ima_kexec_buffer = kimage_map_segment(image, image->ima_segment_index); > if (!ima_kexec_buffer) { > pr_err("Could not map measurements buffer.\n"); > return; > -- > 2.49.0 > From bhe at redhat.com Tue Nov 25 17:53:52 2025 From: bhe at redhat.com (Baoquan He) Date: Wed, 26 Nov 2025 09:53:52 +0800 Subject: [PATCHv2 1/2] kernel/kexec: Change the prototype of kimage_map_segment() In-Reply-To: References: <20251106065904.10772-1-piliu@redhat.com> Message-ID: Hi, On 11/26/25 at 09:10am, Baoquan He wrote: > Hi Pingfan, > > On 11/06/25 at 02:59pm, Pingfan Liu wrote: > > The kexec segment index will be required to extract the corresponding > > information for that segment in kimage_map_segment(). Additionally, > > kexec_segment already holds the kexec relocation destination address and > > size. Therefore, the prototype of kimage_map_segment() can be changed. > > Because no cover letter, I just reply here. > > I am testing code of (tag: next-20251125, next/master) on arm64 system. > I saw your two patches are already in there. When I used kexec reboot > as below, I still got the warning message during ima_kexec_post_load() > invocation. And when I try to turn off cma allocating for kexec buffer, I found there's no such flag in user space utility kexec-tools. Since Alexander introduced commit 07d24902977e ("kexec: enable CMA based contiguous allocation"), but haven't add flag KEXEC_FILE_NO_CMA to kexec-tools, and Pingfan you are working to fix the bug, can any of you post patch to kexec-tools to add the flag? And flag KEXEC_FILE_FORCE_DTB too, which was introduced in commit f367474b5884 ("x86/kexec: carry forward the boot DTB on kexec"). We only have them in kernel, but there's no chance to specify them, what's the meaning to have them? Thanks Baoquan > > ==================== > kexec -d -l /boot/vmlinuz-6.18.0-rc7-next-20251125 --initrd /boot/initramfs-6.18.0-rc7-next-20251125.img --reuse-cmdline > ==================== > > ==================== > [34283.657670] kexec_file: kernel: 000000006cf71829 kernel_size: 0x48b0000 > [34283.657700] PEFILE: Unsigned PE binary > [34283.676597] ima: kexec measurement buffer for the loaded kernel at 0xff206000. > [34283.676621] kexec_file: Loaded initrd at 0x84cb0000 bufsz=0x25ec426 memsz=0x25ed000 > [34283.684646] kexec_file: Loaded dtb at 0xff400000 bufsz=0x39e memsz=0x1000 > [34283.684653] kexec_file(Image): Loaded kernel at 0x80400000 bufsz=0x48b0000 memsz=0x48b0000 > [34283.684663] kexec_file: nr_segments = 4 > [34283.684666] kexec_file: segment[0]: buf=0x0000000000000000 bufsz=0x0 mem=0xff206000 memsz=0x1000 > [34283.684674] kexec_file: segment[1]: buf=0x000000006cf71829 bufsz=0x48b0000 mem=0x80400000 memsz=0x48b0000 > [34283.725987] kexec_file: segment[2]: buf=0x00000000c7369de6 bufsz=0x25ec426 mem=0x84cb0000 memsz=0x25ed000 > [34283.747670] kexec_file: segmen > ** replaying previous printk message ** > [34283.747670] kexec_file: segment[3]: buf=0x00000000d83b530b bufsz=0x39e mem=0xff400000 memsz=0x1000 > [34283.747973] ------------[ cut here ]------------ > [34283.747976] WARNING: CPU: 33 PID: 16112 at kernel/kexec_core.c:1002 kimage_map_segment+0x138/0x190 > [34283.778574] Modules linked in: rfkill vfat fat ipmi_ssif igb acpi_ipmi ipmi_si ipmi_devintf mlx5_fwctl i2c_algo_bit ipmi_msghandler fwctl fuse loop nfnetlink zram lz4hc_compress lz4_compress xfs mlx5_ib macsec mlx5_core nvme nvme_core mlxfw psample tls nvme_keyring nvme_auth pci_hyperv_intf sbsa_gwdt rpcrdma sunrpc rdma_ucm ib_uverbs ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser i2c_dev ib_umad rdma_cm ib_ipoib iw_cm ib_cm libiscsi ib_core scsi_transport_iscsi aes_neon_bs > [34283.824233] CPU: 33 UID: 0 PID: 16112 Comm: kexec Tainted: G W 6.17.8-200.fc42.aarch64 #1 PREEMPT(voluntary) > [34283.836355] Tainted: [W]=WARN > [34283.839684] Hardware name: CRAY CS500/CMUD , BIOS 1.4.0 Jun 17 2020 > [34283.846903] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) > [34283.854243] pc : kimage_map_segment+0x138/0x190 > [34283.859120] lr : kimage_map_segment+0x4c/0x190 > [34283.863920] sp : ffff8000a0643a90 > [34283.867394] x29: ffff8000a0643a90 x28: ffff800083d0a000 x27: 0000000000000000 > [34283.874901] x26: 0000aaaad722d4b0 x25: 000000000000008f x24: ffff800083d0a000 > [34283.882608] x23: 0000000000000001 x22: 00000000ff206000 x21: 00000000ff207000 > [34283.890305] x20: ffff008fbd306980 x19: ffff008f895d6400 x18: 00000000fffffff9 > [34283.897815] x17: 303d6d656d206539 x16: 3378303d7a736675 x15: 646565732d676e72 > [34283.905516] x14: 00646565732d726c x13: 616d692c78756e69 x12: 6c00636578656b2d > [34283.912999] x11: 007265666675622d x10: 636578656b2d616d x9 : ffff80008050b73c > [34283.920691] x8 : 0001000000000000 x7 : 0000000000000000 x6 : 0000000080000000 > [34283.928197] x5 : 0000000084cb0000 x4 : ffff008fbd2306b0 x3 : ffff008fbd305000 > [34283.935898] x2 : fffffff7ff000000 x1 : 0000000000000004 x0 : ffff800082046000 > [34283.943603] Call trace: > [34283.946039] kimage_map_segment+0x138/0x190 (P) > [34283.950935] ima_kexec_post_load+0x58/0xc0 > [34283.955225] __do_sys_kexec_file_load+0x2b8/0x398 > [34283.960279] __arm64_sys_kexec_file_load+0x28/0x40 > [34283.965965] invoke_syscall.constprop.0+0x64/0xe8 > [34283.971025] el0_svc_common.constprop.0+0x40/0xe8 > [34283.975883] do_el0_svc+0x24/0x38 > [34283.979361] el0_svc+0x3c/0x168 > [34283.982833] el0t_64_sync_handler+0xa0/0xf0 > [34283.987176] el0t_64_sync+0x1b0/0x1b8 > [34283.991000] ---[ end trace 0000000000000000 ]--- > [34283.996060] ------------[ cut here ]------------ > [34283.996064] WARNING: CPU: 33 PID: 16112 at mm/vmalloc.c:538 vmap_pages_pte_range+0x2bc/0x3c0 > [34284.010006] Modules linked in: rfkill vfat fat ipmi_ssif igb acpi_ipmi ipmi_si ipmi_devintf mlx5_fwctl i2c_algo_bit ipmi_msghandler fwctl fuse loop nfnetlink zram lz4hc_compress lz4_compress xfs mlx5_ib macsec mlx5_core nvme nvme_core mlxfw psample tls nvme_keyring nvme_auth pci_hyperv_intf sbsa_gwdt rpcrdma sunrpc rdma_ucm ib_uverbs ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser i2c_dev ib_umad rdma_cm ib_ipoib iw_cm ib_cm libiscsi ib_core scsi_transport_iscsi aes_neon_bs > [34284.055630] CPU: 33 UID: 0 PID: 16112 Comm: kexec Tainted: G W 6.17.8-200.fc42.aarch64 #1 PREEMPT(voluntary) > [34284.067701] Tainted: [W]=WARN > [34284.070833] Hardware name: CRAY CS500/CMUD , BIOS 1.4.0 Jun 17 2020 > [34284.078238] pstate: 40400009 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) > [34284.085546] pc : vmap_pages_pte_range+0x2bc/0x3c0 > [34284.090607] lr : vmap_small_pages_range_noflush+0x16c/0x298 > [34284.096528] sp : ffff8000a0643940 > [34284.100001] x29: ffff8000a0643940 x28: 0000000000000000 x27: ffff800084f76000 > [34284.107699] x26: fffffdffc0000000 x25: ffff8000a06439d0 x24: ffff800082046000 > [34284.115174] x23: ffff800084f75000 x22: ffff007f80337ba8 x21: 03ffffffffffffc0 > [34284.122821] x20: ffff008fbd306980 x19: ffff8000a06439d4 x18: 00000000fffffff9 > [34284.130331] x17: 303d6d656d206539 x16: 3378303d7a736675 x15: 646565732d676e72 > [34284.138032] x14: 0000000000004000 x13: ffff009781307130 x12: 0000000000002000 > [34284.145733] x11: 0000000000000000 x10: 0000000000000001 x9 : ffff8000804e197c > [34284.153248] x8 : 0000000000000027 x7 : ffff800085175000 x6 : ffff8000a06439d4 > [34284.160944] x5 : ffff8000a06439d0 x4 : ffff008fbd306980 x3 : 0068000000000f03 > [34284.168449] x2 : ffff007f80337ba8 x1 : 0000000000000000 x0 : 0000000000000000 > [34284.176150] Call trace: > [34284.178768] vmap_pages_pte_range+0x2bc/0x3c0 (P) > [34284.183665] vmap_small_pages_range_noflush+0x16c/0x298 > [34284.189264] vmap+0xb4/0x138 > [34284.192312] kimage_map_segment+0xdc/0x190 > [34284.196794] ima_kexec_post_load+0x58/0xc0 > [34284.201044] __do_sys_kexec_file_load+0x2b8/0x398 > [34284.206107] __arm64_sys_kexec_file_load+0x28/0x40 > [34284.211254] invoke_syscall.constprop.0+0x64/0xe8 > [34284.216139] el0_svc_common.constprop.0+0x40/0xe8 > [34284.221196] do_el0_svc+0x24/0x38 > [34284.224678] el0_svc+0x3c/0x168 > [34284.227983] el0t_64_sync_handler+0xa0/0xf0 > [34284.232526] el0t_64_sync+0x1b0/0x1b8 > [34284.236376] ---[ end trace 0000000000000000 ]--- > [34284.241412] kexec_core: Could not map ima buffer. > [34284.241421] ima: Could not map measurements buffer. > [34284.551336] machine_kexec_post_load:155: > [34284.551354] kexec kimage info: > [34284.551366] type: 0 > [34284.551373] head: 90363f9002 > [34284.551377] kern_reloc: 0x00000090363f7000 > [34284.551381] el2_vectors: 0x0000000000000000 > [34284.551384] kexec_file: kexec_file_load: type:0, start:0x80400000 head:0x90363f9002 flags:0x8 > ==================== > > > > > Fixes: 07d24902977e ("kexec: enable CMA based contiguous allocation") > > Signed-off-by: Pingfan Liu > > Cc: Andrew Morton > > Cc: Baoquan He > > Cc: Mimi Zohar > > Cc: Roberto Sassu > > Cc: Alexander Graf > > Cc: Steven Chen > > Cc: > > To: kexec at lists.infradead.org > > To: linux-integrity at vger.kernel.org > > --- > > include/linux/kexec.h | 4 ++-- > > kernel/kexec_core.c | 9 ++++++--- > > security/integrity/ima/ima_kexec.c | 4 +--- > > 3 files changed, 9 insertions(+), 8 deletions(-) > > > > diff --git a/include/linux/kexec.h b/include/linux/kexec.h > > index ff7e231b0485..8a22bc9b8c6c 100644 > > --- a/include/linux/kexec.h > > +++ b/include/linux/kexec.h > > @@ -530,7 +530,7 @@ extern bool kexec_file_dbg_print; > > #define kexec_dprintk(fmt, arg...) \ > > do { if (kexec_file_dbg_print) pr_info(fmt, ##arg); } while (0) > > > > -extern void *kimage_map_segment(struct kimage *image, unsigned long addr, unsigned long size); > > +extern void *kimage_map_segment(struct kimage *image, int idx); > > extern void kimage_unmap_segment(void *buffer); > > #else /* !CONFIG_KEXEC_CORE */ > > struct pt_regs; > > @@ -540,7 +540,7 @@ static inline void __crash_kexec(struct pt_regs *regs) { } > > static inline void crash_kexec(struct pt_regs *regs) { } > > static inline int kexec_should_crash(struct task_struct *p) { return 0; } > > static inline int kexec_crash_loaded(void) { return 0; } > > -static inline void *kimage_map_segment(struct kimage *image, unsigned long addr, unsigned long size) > > +static inline void *kimage_map_segment(struct kimage *image, int idx) > > { return NULL; } > > static inline void kimage_unmap_segment(void *buffer) { } > > #define kexec_in_progress false > > diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c > > index fa00b239c5d9..9a1966207041 100644 > > --- a/kernel/kexec_core.c > > +++ b/kernel/kexec_core.c > > @@ -960,17 +960,20 @@ int kimage_load_segment(struct kimage *image, int idx) > > return result; > > } > > > > -void *kimage_map_segment(struct kimage *image, > > - unsigned long addr, unsigned long size) > > +void *kimage_map_segment(struct kimage *image, int idx) > > { > > + unsigned long addr, size, eaddr; > > unsigned long src_page_addr, dest_page_addr = 0; > > - unsigned long eaddr = addr + size; > > kimage_entry_t *ptr, entry; > > struct page **src_pages; > > unsigned int npages; > > void *vaddr = NULL; > > int i; > > > > + addr = image->segment[idx].mem; > > + size = image->segment[idx].memsz; > > + eaddr = addr + size; > > + > > /* > > * Collect the source pages and map them in a contiguous VA range. > > */ > > diff --git a/security/integrity/ima/ima_kexec.c b/security/integrity/ima/ima_kexec.c > > index 7362f68f2d8b..5beb69edd12f 100644 > > --- a/security/integrity/ima/ima_kexec.c > > +++ b/security/integrity/ima/ima_kexec.c > > @@ -250,9 +250,7 @@ void ima_kexec_post_load(struct kimage *image) > > if (!image->ima_buffer_addr) > > return; > > > > - ima_kexec_buffer = kimage_map_segment(image, > > - image->ima_buffer_addr, > > - image->ima_buffer_size); > > + ima_kexec_buffer = kimage_map_segment(image, image->ima_segment_index); > > if (!ima_kexec_buffer) { > > pr_err("Could not map measurements buffer.\n"); > > return; > > -- > > 2.49.0 > > > From piliu at redhat.com Tue Nov 25 18:30:05 2025 From: piliu at redhat.com (Pingfan Liu) Date: Wed, 26 Nov 2025 10:30:05 +0800 Subject: [PATCHv2 1/2] kernel/kexec: Change the prototype of kimage_map_segment() In-Reply-To: References: <20251106065904.10772-1-piliu@redhat.com> Message-ID: On Wed, Nov 26, 2025 at 9:54?AM Baoquan He wrote: > > Hi, > > On 11/26/25 at 09:10am, Baoquan He wrote: > > Hi Pingfan, > > > > On 11/06/25 at 02:59pm, Pingfan Liu wrote: > > > The kexec segment index will be required to extract the corresponding > > > information for that segment in kimage_map_segment(). Additionally, > > > kexec_segment already holds the kexec relocation destination address and > > > size. Therefore, the prototype of kimage_map_segment() can be changed. > > > > Because no cover letter, I just reply here. > > > > I am testing code of (tag: next-20251125, next/master) on arm64 system. > > I saw your two patches are already in there. When I used kexec reboot > > as below, I still got the warning message during ima_kexec_post_load() > > invocation. > I ran into this warning on the platform "NVIDIA Jetson Orin Nano". I just got the control of this machine and have an opportunity to decode its dtb. I think the following section is critical to reproduce this issue reserved-memory { #address-cells = <0x02>; #size-cells = <0x02>; ranges; linux,cma { linux,cma-default; alignment = <0x00 0x10000>; compatible = "shared-dma-pool"; size = <0x00 0x10000000>; status = "okay"; reusable; }; That is weird. I used (tag: next-20251125, next/master) to have a test, and cann't see the warning any longer. Once you finish with the machine, I'll run some tests to check if the warning comes from the same root cause on your machine. > And when I try to turn off cma allocating for kexec buffer, I found > there's no such flag in user space utility kexec-tools. Since Alexander > introduced commit 07d24902977e ("kexec: enable CMA based contiguous > allocation"), but haven't add flag KEXEC_FILE_NO_CMA to kexec-tools, and > Pingfan you are working to fix the bug, can any of you post patch to > kexec-tools to add the flag? > OK. > And flag KEXEC_FILE_FORCE_DTB too, which was introduced in commit f367474b5884 > ("x86/kexec: carry forward the boot DTB on kexec"). > I have no idea about KEXEC_FILE_FORCE_DTB for the time being. But I will see how to handle it properer. Thanks, Pingfan > We only have them in kernel, but there's no chance to specify them, > what's the meaning to have them? > > Thanks > Baoquan > > > > > ==================== > > kexec -d -l /boot/vmlinuz-6.18.0-rc7-next-20251125 --initrd /boot/initramfs-6.18.0-rc7-next-20251125.img --reuse-cmdline > > ==================== > > > > ==================== > > [34283.657670] kexec_file: kernel: 000000006cf71829 kernel_size: 0x48b0000 > > [34283.657700] PEFILE: Unsigned PE binary > > [34283.676597] ima: kexec measurement buffer for the loaded kernel at 0xff206000. > > [34283.676621] kexec_file: Loaded initrd at 0x84cb0000 bufsz=0x25ec426 memsz=0x25ed000 > > [34283.684646] kexec_file: Loaded dtb at 0xff400000 bufsz=0x39e memsz=0x1000 > > [34283.684653] kexec_file(Image): Loaded kernel at 0x80400000 bufsz=0x48b0000 memsz=0x48b0000 > > [34283.684663] kexec_file: nr_segments = 4 > > [34283.684666] kexec_file: segment[0]: buf=0x0000000000000000 bufsz=0x0 mem=0xff206000 memsz=0x1000 > > [34283.684674] kexec_file: segment[1]: buf=0x000000006cf71829 bufsz=0x48b0000 mem=0x80400000 memsz=0x48b0000 > > [34283.725987] kexec_file: segment[2]: buf=0x00000000c7369de6 bufsz=0x25ec426 mem=0x84cb0000 memsz=0x25ed000 > > [34283.747670] kexec_file: segmen > > ** replaying previous printk message ** > > [34283.747670] kexec_file: segment[3]: buf=0x00000000d83b530b bufsz=0x39e mem=0xff400000 memsz=0x1000 > > [34283.747973] ------------[ cut here ]------------ > > [34283.747976] WARNING: CPU: 33 PID: 16112 at kernel/kexec_core.c:1002 kimage_map_segment+0x138/0x190 > > [34283.778574] Modules linked in: rfkill vfat fat ipmi_ssif igb acpi_ipmi ipmi_si ipmi_devintf mlx5_fwctl i2c_algo_bit ipmi_msghandler fwctl fuse loop nfnetlink zram lz4hc_compress lz4_compress xfs mlx5_ib macsec mlx5_core nvme nvme_core mlxfw psample tls nvme_keyring nvme_auth pci_hyperv_intf sbsa_gwdt rpcrdma sunrpc rdma_ucm ib_uverbs ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser i2c_dev ib_umad rdma_cm ib_ipoib iw_cm ib_cm libiscsi ib_core scsi_transport_iscsi aes_neon_bs > > [34283.824233] CPU: 33 UID: 0 PID: 16112 Comm: kexec Tainted: G W 6.17.8-200.fc42.aarch64 #1 PREEMPT(voluntary) > > [34283.836355] Tainted: [W]=WARN > > [34283.839684] Hardware name: CRAY CS500/CMUD , BIOS 1.4.0 Jun 17 2020 > > [34283.846903] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) > > [34283.854243] pc : kimage_map_segment+0x138/0x190 > > [34283.859120] lr : kimage_map_segment+0x4c/0x190 > > [34283.863920] sp : ffff8000a0643a90 > > [34283.867394] x29: ffff8000a0643a90 x28: ffff800083d0a000 x27: 0000000000000000 > > [34283.874901] x26: 0000aaaad722d4b0 x25: 000000000000008f x24: ffff800083d0a000 > > [34283.882608] x23: 0000000000000001 x22: 00000000ff206000 x21: 00000000ff207000 > > [34283.890305] x20: ffff008fbd306980 x19: ffff008f895d6400 x18: 00000000fffffff9 > > [34283.897815] x17: 303d6d656d206539 x16: 3378303d7a736675 x15: 646565732d676e72 > > [34283.905516] x14: 00646565732d726c x13: 616d692c78756e69 x12: 6c00636578656b2d > > [34283.912999] x11: 007265666675622d x10: 636578656b2d616d x9 : ffff80008050b73c > > [34283.920691] x8 : 0001000000000000 x7 : 0000000000000000 x6 : 0000000080000000 > > [34283.928197] x5 : 0000000084cb0000 x4 : ffff008fbd2306b0 x3 : ffff008fbd305000 > > [34283.935898] x2 : fffffff7ff000000 x1 : 0000000000000004 x0 : ffff800082046000 > > [34283.943603] Call trace: > > [34283.946039] kimage_map_segment+0x138/0x190 (P) > > [34283.950935] ima_kexec_post_load+0x58/0xc0 > > [34283.955225] __do_sys_kexec_file_load+0x2b8/0x398 > > [34283.960279] __arm64_sys_kexec_file_load+0x28/0x40 > > [34283.965965] invoke_syscall.constprop.0+0x64/0xe8 > > [34283.971025] el0_svc_common.constprop.0+0x40/0xe8 > > [34283.975883] do_el0_svc+0x24/0x38 > > [34283.979361] el0_svc+0x3c/0x168 > > [34283.982833] el0t_64_sync_handler+0xa0/0xf0 > > [34283.987176] el0t_64_sync+0x1b0/0x1b8 > > [34283.991000] ---[ end trace 0000000000000000 ]--- > > [34283.996060] ------------[ cut here ]------------ > > [34283.996064] WARNING: CPU: 33 PID: 16112 at mm/vmalloc.c:538 vmap_pages_pte_range+0x2bc/0x3c0 > > [34284.010006] Modules linked in: rfkill vfat fat ipmi_ssif igb acpi_ipmi ipmi_si ipmi_devintf mlx5_fwctl i2c_algo_bit ipmi_msghandler fwctl fuse loop nfnetlink zram lz4hc_compress lz4_compress xfs mlx5_ib macsec mlx5_core nvme nvme_core mlxfw psample tls nvme_keyring nvme_auth pci_hyperv_intf sbsa_gwdt rpcrdma sunrpc rdma_ucm ib_uverbs ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser i2c_dev ib_umad rdma_cm ib_ipoib iw_cm ib_cm libiscsi ib_core scsi_transport_iscsi aes_neon_bs > > [34284.055630] CPU: 33 UID: 0 PID: 16112 Comm: kexec Tainted: G W 6.17.8-200.fc42.aarch64 #1 PREEMPT(voluntary) > > [34284.067701] Tainted: [W]=WARN > > [34284.070833] Hardware name: CRAY CS500/CMUD , BIOS 1.4.0 Jun 17 2020 > > [34284.078238] pstate: 40400009 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) > > [34284.085546] pc : vmap_pages_pte_range+0x2bc/0x3c0 > > [34284.090607] lr : vmap_small_pages_range_noflush+0x16c/0x298 > > [34284.096528] sp : ffff8000a0643940 > > [34284.100001] x29: ffff8000a0643940 x28: 0000000000000000 x27: ffff800084f76000 > > [34284.107699] x26: fffffdffc0000000 x25: ffff8000a06439d0 x24: ffff800082046000 > > [34284.115174] x23: ffff800084f75000 x22: ffff007f80337ba8 x21: 03ffffffffffffc0 > > [34284.122821] x20: ffff008fbd306980 x19: ffff8000a06439d4 x18: 00000000fffffff9 > > [34284.130331] x17: 303d6d656d206539 x16: 3378303d7a736675 x15: 646565732d676e72 > > [34284.138032] x14: 0000000000004000 x13: ffff009781307130 x12: 0000000000002000 > > [34284.145733] x11: 0000000000000000 x10: 0000000000000001 x9 : ffff8000804e197c > > [34284.153248] x8 : 0000000000000027 x7 : ffff800085175000 x6 : ffff8000a06439d4 > > [34284.160944] x5 : ffff8000a06439d0 x4 : ffff008fbd306980 x3 : 0068000000000f03 > > [34284.168449] x2 : ffff007f80337ba8 x1 : 0000000000000000 x0 : 0000000000000000 > > [34284.176150] Call trace: > > [34284.178768] vmap_pages_pte_range+0x2bc/0x3c0 (P) > > [34284.183665] vmap_small_pages_range_noflush+0x16c/0x298 > > [34284.189264] vmap+0xb4/0x138 > > [34284.192312] kimage_map_segment+0xdc/0x190 > > [34284.196794] ima_kexec_post_load+0x58/0xc0 > > [34284.201044] __do_sys_kexec_file_load+0x2b8/0x398 > > [34284.206107] __arm64_sys_kexec_file_load+0x28/0x40 > > [34284.211254] invoke_syscall.constprop.0+0x64/0xe8 > > [34284.216139] el0_svc_common.constprop.0+0x40/0xe8 > > [34284.221196] do_el0_svc+0x24/0x38 > > [34284.224678] el0_svc+0x3c/0x168 > > [34284.227983] el0t_64_sync_handler+0xa0/0xf0 > > [34284.232526] el0t_64_sync+0x1b0/0x1b8 > > [34284.236376] ---[ end trace 0000000000000000 ]--- > > [34284.241412] kexec_core: Could not map ima buffer. > > [34284.241421] ima: Could not map measurements buffer. > > [34284.551336] machine_kexec_post_load:155: > > [34284.551354] kexec kimage info: > > [34284.551366] type: 0 > > [34284.551373] head: 90363f9002 > > [34284.551377] kern_reloc: 0x00000090363f7000 > > [34284.551381] el2_vectors: 0x0000000000000000 > > [34284.551384] kexec_file: kexec_file_load: type:0, start:0x80400000 head:0x90363f9002 flags:0x8 > > ==================== > > > > > > > > Fixes: 07d24902977e ("kexec: enable CMA based contiguous allocation") > > > Signed-off-by: Pingfan Liu > > > Cc: Andrew Morton > > > Cc: Baoquan He > > > Cc: Mimi Zohar > > > Cc: Roberto Sassu > > > Cc: Alexander Graf > > > Cc: Steven Chen > > > Cc: > > > To: kexec at lists.infradead.org > > > To: linux-integrity at vger.kernel.org > > > --- > > > include/linux/kexec.h | 4 ++-- > > > kernel/kexec_core.c | 9 ++++++--- > > > security/integrity/ima/ima_kexec.c | 4 +--- > > > 3 files changed, 9 insertions(+), 8 deletions(-) > > > > > > diff --git a/include/linux/kexec.h b/include/linux/kexec.h > > > index ff7e231b0485..8a22bc9b8c6c 100644 > > > --- a/include/linux/kexec.h > > > +++ b/include/linux/kexec.h > > > @@ -530,7 +530,7 @@ extern bool kexec_file_dbg_print; > > > #define kexec_dprintk(fmt, arg...) \ > > > do { if (kexec_file_dbg_print) pr_info(fmt, ##arg); } while (0) > > > > > > -extern void *kimage_map_segment(struct kimage *image, unsigned long addr, unsigned long size); > > > +extern void *kimage_map_segment(struct kimage *image, int idx); > > > extern void kimage_unmap_segment(void *buffer); > > > #else /* !CONFIG_KEXEC_CORE */ > > > struct pt_regs; > > > @@ -540,7 +540,7 @@ static inline void __crash_kexec(struct pt_regs *regs) { } > > > static inline void crash_kexec(struct pt_regs *regs) { } > > > static inline int kexec_should_crash(struct task_struct *p) { return 0; } > > > static inline int kexec_crash_loaded(void) { return 0; } > > > -static inline void *kimage_map_segment(struct kimage *image, unsigned long addr, unsigned long size) > > > +static inline void *kimage_map_segment(struct kimage *image, int idx) > > > { return NULL; } > > > static inline void kimage_unmap_segment(void *buffer) { } > > > #define kexec_in_progress false > > > diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c > > > index fa00b239c5d9..9a1966207041 100644 > > > --- a/kernel/kexec_core.c > > > +++ b/kernel/kexec_core.c > > > @@ -960,17 +960,20 @@ int kimage_load_segment(struct kimage *image, int idx) > > > return result; > > > } > > > > > > -void *kimage_map_segment(struct kimage *image, > > > - unsigned long addr, unsigned long size) > > > +void *kimage_map_segment(struct kimage *image, int idx) > > > { > > > + unsigned long addr, size, eaddr; > > > unsigned long src_page_addr, dest_page_addr = 0; > > > - unsigned long eaddr = addr + size; > > > kimage_entry_t *ptr, entry; > > > struct page **src_pages; > > > unsigned int npages; > > > void *vaddr = NULL; > > > int i; > > > > > > + addr = image->segment[idx].mem; > > > + size = image->segment[idx].memsz; > > > + eaddr = addr + size; > > > + > > > /* > > > * Collect the source pages and map them in a contiguous VA range. > > > */ > > > diff --git a/security/integrity/ima/ima_kexec.c b/security/integrity/ima/ima_kexec.c > > > index 7362f68f2d8b..5beb69edd12f 100644 > > > --- a/security/integrity/ima/ima_kexec.c > > > +++ b/security/integrity/ima/ima_kexec.c > > > @@ -250,9 +250,7 @@ void ima_kexec_post_load(struct kimage *image) > > > if (!image->ima_buffer_addr) > > > return; > > > > > > - ima_kexec_buffer = kimage_map_segment(image, > > > - image->ima_buffer_addr, > > > - image->ima_buffer_size); > > > + ima_kexec_buffer = kimage_map_segment(image, image->ima_segment_index); > > > if (!ima_kexec_buffer) { > > > pr_err("Could not map measurements buffer.\n"); > > > return; > > > -- > > > 2.49.0 > > > > > > From piliu at redhat.com Tue Nov 25 20:47:58 2025 From: piliu at redhat.com (Pingfan Liu) Date: Wed, 26 Nov 2025 12:47:58 +0800 Subject: [PATCHv2 1/2] kernel/kexec: Change the prototype of kimage_map_segment() In-Reply-To: References: <20251106065904.10772-1-piliu@redhat.com> Message-ID: On Wed, Nov 26, 2025 at 9:10?AM Baoquan He wrote: > > Hi Pingfan, > > On 11/06/25 at 02:59pm, Pingfan Liu wrote: > > The kexec segment index will be required to extract the corresponding > > information for that segment in kimage_map_segment(). Additionally, > > kexec_segment already holds the kexec relocation destination address and > > size. Therefore, the prototype of kimage_map_segment() can be changed. > > Because no cover letter, I just reply here. > > I am testing code of (tag: next-20251125, next/master) on arm64 system. > I saw your two patches are already in there. When I used kexec reboot > as below, I still got the warning message during ima_kexec_post_load() > invocation. > > ==================== > kexec -d -l /boot/vmlinuz-6.18.0-rc7-next-20251125 --initrd /boot/initramfs-6.18.0-rc7-next-20251125.img --reuse-cmdline > ==================== > Could you share more detail, as I cannot reproduce this issue with (tag: next-20251125, next/master) on a different aarch64 platform either. I use the default config to compile the kernel and add CMA=512M in the kernel command line, so the kexec file load can allocate the dest memory directly on the CMA area. # lshw -class system hpe-apollo*** description: System product: CS500 (-) vendor: CRAY version: - serial: - width: 64 bits capabilities: smbios-3.1.1 dmi-3.1.1 smp sve_default_vector_length tagged_addr_disabled configuration: boot=normal chassis=server family=HPC sku=- uuid=8cdb9098-d03f-11e9-8001-2cd444ce8cad #cat /proc/meminfo | grep -i cma CmaTotal: 524288 kB CmaFree: 509856 kB # cd /boot/ # kexec -d -s -l vmlinuz-6.18.0-rc7-next-20251125 --initrd initramfs-6.18.0-rc7-next-20251125.img --reuse-cmdline arch_process_options:179: command_line: root=/dev/mapper/rhel_hpe--apollo80--02--n00-root ro earlycon=pl011,0x1c050000 ip=dhcp crashkernel=2G-4G:406M,4G-64G:470M,64G-:726M rd.lvm.lv=rhel_hpe-apollo80-02-n00/root rd.lvm.lv=rhel_hpe-apollo80-02-n00/swap console=ttyAMA0 cma=512M arch_process_options:181: initrd: initramfs-6.18.0-rc7-next-20251125.img arch_process_options:183: dtb: (null) arch_process_options:186: console: (null) Try gzip decompression. Try LZMA decompression. elf_arm64_probe: Not an ELF executable. image_arm64_probe: Bad arm64 image header. pez_arm64_probe: PROBE. Try gzip decompression. pez_prepare: decompressed size 50790400 pez_prepare: done # cat /proc/meminfo | grep -i cma CmaTotal: 524288 kB CmaFree: 411032 kB CmaFree shrinks, which means the kexec_file_load uses it. And the dmesg shows no warning [ 167.484064] kexec_file: kernel: 0000000096e14552 kernel_size: 0x3070000 [ 167.484094] PEFILE: Unsigned PE binary [ 167.576003] ima: kexec measurement buffer for the loaded kernel at 0xc1a18000. [ 167.585054] kexec_file: Loaded initrd at 0xc4b70000 bufsz=0x300f306 memsz=0x3010000 [ 167.593376] kexec_file: Loaded dtb at 0xc7c00000 bufsz=0x5b1 memsz=0x1000 [ 167.593389] kexec_file(Image): Loaded kernel at 0xc1b00000 bufsz=0x3070000 memsz=0x3070000 [ 167.593405] kexec_file: nr_segments = 4 [ 167.593408] kexec_file: segment[0]: buf=0x0000000000000000 bufsz=0x0 mem=0xc1a18000 memsz=0x1000 [ 167.593417] kexec_file: segment[1]: buf=0x0000000096e14552 bufsz=0x3070000 mem=0xc1b00000 memsz=0x3070000 [ 167.610450] kexec_file: segment[2]: buf=0x000000001285672d bufsz=0x300f306 mem=0xc4b70000 memsz=0x3010000 [ 167.627563] kexec_file: segment[3]: buf=0x000000002ef3060d bufsz=0x5b1 mem=0xc7c00000 memsz=0x1000 [ 167.629228] machine_kexec_post_load:119: [ 167.629233] kexec kimage info: [ 167.629236] type: 0 [ 167.629238] head: 4 [ 167.629241] kern_reloc: 0x0000000000000000 [ 167.629245] el2_vectors: 0x0000000000000000 [ 167.629248] kexec_file: kexec_file_load: type:0, start:0xc1b00000 head:0x4 flags:0x8 Thanks, Pingfan > ==================== > [34283.657670] kexec_file: kernel: 000000006cf71829 kernel_size: 0x48b0000 > [34283.657700] PEFILE: Unsigned PE binary > [34283.676597] ima: kexec measurement buffer for the loaded kernel at 0xff206000. > [34283.676621] kexec_file: Loaded initrd at 0x84cb0000 bufsz=0x25ec426 memsz=0x25ed000 > [34283.684646] kexec_file: Loaded dtb at 0xff400000 bufsz=0x39e memsz=0x1000 > [34283.684653] kexec_file(Image): Loaded kernel at 0x80400000 bufsz=0x48b0000 memsz=0x48b0000 > [34283.684663] kexec_file: nr_segments = 4 > [34283.684666] kexec_file: segment[0]: buf=0x0000000000000000 bufsz=0x0 mem=0xff206000 memsz=0x1000 > [34283.684674] kexec_file: segment[1]: buf=0x000000006cf71829 bufsz=0x48b0000 mem=0x80400000 memsz=0x48b0000 > [34283.725987] kexec_file: segment[2]: buf=0x00000000c7369de6 bufsz=0x25ec426 mem=0x84cb0000 memsz=0x25ed000 > [34283.747670] kexec_file: segmen > ** replaying previous printk message ** > [34283.747670] kexec_file: segment[3]: buf=0x00000000d83b530b bufsz=0x39e mem=0xff400000 memsz=0x1000 > [34283.747973] ------------[ cut here ]------------ > [34283.747976] WARNING: CPU: 33 PID: 16112 at kernel/kexec_core.c:1002 kimage_map_segment+0x138/0x190 > [34283.778574] Modules linked in: rfkill vfat fat ipmi_ssif igb acpi_ipmi ipmi_si ipmi_devintf mlx5_fwctl i2c_algo_bit ipmi_msghandler fwctl fuse loop nfnetlink zram lz4hc_compress lz4_compress xfs mlx5_ib macsec mlx5_core nvme nvme_core mlxfw psample tls nvme_keyring nvme_auth pci_hyperv_intf sbsa_gwdt rpcrdma sunrpc rdma_ucm ib_uverbs ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser i2c_dev ib_umad rdma_cm ib_ipoib iw_cm ib_cm libiscsi ib_core scsi_transport_iscsi aes_neon_bs > [34283.824233] CPU: 33 UID: 0 PID: 16112 Comm: kexec Tainted: G W 6.17.8-200.fc42.aarch64 #1 PREEMPT(voluntary) > [34283.836355] Tainted: [W]=WARN > [34283.839684] Hardware name: CRAY CS500/CMUD , BIOS 1.4.0 Jun 17 2020 > [34283.846903] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) > [34283.854243] pc : kimage_map_segment+0x138/0x190 > [34283.859120] lr : kimage_map_segment+0x4c/0x190 > [34283.863920] sp : ffff8000a0643a90 > [34283.867394] x29: ffff8000a0643a90 x28: ffff800083d0a000 x27: 0000000000000000 > [34283.874901] x26: 0000aaaad722d4b0 x25: 000000000000008f x24: ffff800083d0a000 > [34283.882608] x23: 0000000000000001 x22: 00000000ff206000 x21: 00000000ff207000 > [34283.890305] x20: ffff008fbd306980 x19: ffff008f895d6400 x18: 00000000fffffff9 > [34283.897815] x17: 303d6d656d206539 x16: 3378303d7a736675 x15: 646565732d676e72 > [34283.905516] x14: 00646565732d726c x13: 616d692c78756e69 x12: 6c00636578656b2d > [34283.912999] x11: 007265666675622d x10: 636578656b2d616d x9 : ffff80008050b73c > [34283.920691] x8 : 0001000000000000 x7 : 0000000000000000 x6 : 0000000080000000 > [34283.928197] x5 : 0000000084cb0000 x4 : ffff008fbd2306b0 x3 : ffff008fbd305000 > [34283.935898] x2 : fffffff7ff000000 x1 : 0000000000000004 x0 : ffff800082046000 > [34283.943603] Call trace: > [34283.946039] kimage_map_segment+0x138/0x190 (P) > [34283.950935] ima_kexec_post_load+0x58/0xc0 > [34283.955225] __do_sys_kexec_file_load+0x2b8/0x398 > [34283.960279] __arm64_sys_kexec_file_load+0x28/0x40 > [34283.965965] invoke_syscall.constprop.0+0x64/0xe8 > [34283.971025] el0_svc_common.constprop.0+0x40/0xe8 > [34283.975883] do_el0_svc+0x24/0x38 > [34283.979361] el0_svc+0x3c/0x168 > [34283.982833] el0t_64_sync_handler+0xa0/0xf0 > [34283.987176] el0t_64_sync+0x1b0/0x1b8 > [34283.991000] ---[ end trace 0000000000000000 ]--- > [34283.996060] ------------[ cut here ]------------ > [34283.996064] WARNING: CPU: 33 PID: 16112 at mm/vmalloc.c:538 vmap_pages_pte_range+0x2bc/0x3c0 > [34284.010006] Modules linked in: rfkill vfat fat ipmi_ssif igb acpi_ipmi ipmi_si ipmi_devintf mlx5_fwctl i2c_algo_bit ipmi_msghandler fwctl fuse loop nfnetlink zram lz4hc_compress lz4_compress xfs mlx5_ib macsec mlx5_core nvme nvme_core mlxfw psample tls nvme_keyring nvme_auth pci_hyperv_intf sbsa_gwdt rpcrdma sunrpc rdma_ucm ib_uverbs ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser i2c_dev ib_umad rdma_cm ib_ipoib iw_cm ib_cm libiscsi ib_core scsi_transport_iscsi aes_neon_bs > [34284.055630] CPU: 33 UID: 0 PID: 16112 Comm: kexec Tainted: G W 6.17.8-200.fc42.aarch64 #1 PREEMPT(voluntary) > [34284.067701] Tainted: [W]=WARN > [34284.070833] Hardware name: CRAY CS500/CMUD , BIOS 1.4.0 Jun 17 2020 > [34284.078238] pstate: 40400009 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) > [34284.085546] pc : vmap_pages_pte_range+0x2bc/0x3c0 > [34284.090607] lr : vmap_small_pages_range_noflush+0x16c/0x298 > [34284.096528] sp : ffff8000a0643940 > [34284.100001] x29: ffff8000a0643940 x28: 0000000000000000 x27: ffff800084f76000 > [34284.107699] x26: fffffdffc0000000 x25: ffff8000a06439d0 x24: ffff800082046000 > [34284.115174] x23: ffff800084f75000 x22: ffff007f80337ba8 x21: 03ffffffffffffc0 > [34284.122821] x20: ffff008fbd306980 x19: ffff8000a06439d4 x18: 00000000fffffff9 > [34284.130331] x17: 303d6d656d206539 x16: 3378303d7a736675 x15: 646565732d676e72 > [34284.138032] x14: 0000000000004000 x13: ffff009781307130 x12: 0000000000002000 > [34284.145733] x11: 0000000000000000 x10: 0000000000000001 x9 : ffff8000804e197c > [34284.153248] x8 : 0000000000000027 x7 : ffff800085175000 x6 : ffff8000a06439d4 > [34284.160944] x5 : ffff8000a06439d0 x4 : ffff008fbd306980 x3 : 0068000000000f03 > [34284.168449] x2 : ffff007f80337ba8 x1 : 0000000000000000 x0 : 0000000000000000 > [34284.176150] Call trace: > [34284.178768] vmap_pages_pte_range+0x2bc/0x3c0 (P) > [34284.183665] vmap_small_pages_range_noflush+0x16c/0x298 > [34284.189264] vmap+0xb4/0x138 > [34284.192312] kimage_map_segment+0xdc/0x190 > [34284.196794] ima_kexec_post_load+0x58/0xc0 > [34284.201044] __do_sys_kexec_file_load+0x2b8/0x398 > [34284.206107] __arm64_sys_kexec_file_load+0x28/0x40 > [34284.211254] invoke_syscall.constprop.0+0x64/0xe8 > [34284.216139] el0_svc_common.constprop.0+0x40/0xe8 > [34284.221196] do_el0_svc+0x24/0x38 > [34284.224678] el0_svc+0x3c/0x168 > [34284.227983] el0t_64_sync_handler+0xa0/0xf0 > [34284.232526] el0t_64_sync+0x1b0/0x1b8 > [34284.236376] ---[ end trace 0000000000000000 ]--- > [34284.241412] kexec_core: Could not map ima buffer. > [34284.241421] ima: Could not map measurements buffer. > [34284.551336] machine_kexec_post_load:155: > [34284.551354] kexec kimage info: > [34284.551366] type: 0 > [34284.551373] head: 90363f9002 > [34284.551377] kern_reloc: 0x00000090363f7000 > [34284.551381] el2_vectors: 0x0000000000000000 > [34284.551384] kexec_file: kexec_file_load: type:0, start:0x80400000 head:0x90363f9002 flags:0x8 > ==================== > > > > > Fixes: 07d24902977e ("kexec: enable CMA based contiguous allocation") > > Signed-off-by: Pingfan Liu > > Cc: Andrew Morton > > Cc: Baoquan He > > Cc: Mimi Zohar > > Cc: Roberto Sassu > > Cc: Alexander Graf > > Cc: Steven Chen > > Cc: > > To: kexec at lists.infradead.org > > To: linux-integrity at vger.kernel.org > > --- > > include/linux/kexec.h | 4 ++-- > > kernel/kexec_core.c | 9 ++++++--- > > security/integrity/ima/ima_kexec.c | 4 +--- > > 3 files changed, 9 insertions(+), 8 deletions(-) > > > > diff --git a/include/linux/kexec.h b/include/linux/kexec.h > > index ff7e231b0485..8a22bc9b8c6c 100644 > > --- a/include/linux/kexec.h > > +++ b/include/linux/kexec.h > > @@ -530,7 +530,7 @@ extern bool kexec_file_dbg_print; > > #define kexec_dprintk(fmt, arg...) \ > > do { if (kexec_file_dbg_print) pr_info(fmt, ##arg); } while (0) > > > > -extern void *kimage_map_segment(struct kimage *image, unsigned long addr, unsigned long size); > > +extern void *kimage_map_segment(struct kimage *image, int idx); > > extern void kimage_unmap_segment(void *buffer); > > #else /* !CONFIG_KEXEC_CORE */ > > struct pt_regs; > > @@ -540,7 +540,7 @@ static inline void __crash_kexec(struct pt_regs *regs) { } > > static inline void crash_kexec(struct pt_regs *regs) { } > > static inline int kexec_should_crash(struct task_struct *p) { return 0; } > > static inline int kexec_crash_loaded(void) { return 0; } > > -static inline void *kimage_map_segment(struct kimage *image, unsigned long addr, unsigned long size) > > +static inline void *kimage_map_segment(struct kimage *image, int idx) > > { return NULL; } > > static inline void kimage_unmap_segment(void *buffer) { } > > #define kexec_in_progress false > > diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c > > index fa00b239c5d9..9a1966207041 100644 > > --- a/kernel/kexec_core.c > > +++ b/kernel/kexec_core.c > > @@ -960,17 +960,20 @@ int kimage_load_segment(struct kimage *image, int idx) > > return result; > > } > > > > -void *kimage_map_segment(struct kimage *image, > > - unsigned long addr, unsigned long size) > > +void *kimage_map_segment(struct kimage *image, int idx) > > { > > + unsigned long addr, size, eaddr; > > unsigned long src_page_addr, dest_page_addr = 0; > > - unsigned long eaddr = addr + size; > > kimage_entry_t *ptr, entry; > > struct page **src_pages; > > unsigned int npages; > > void *vaddr = NULL; > > int i; > > > > + addr = image->segment[idx].mem; > > + size = image->segment[idx].memsz; > > + eaddr = addr + size; > > + > > /* > > * Collect the source pages and map them in a contiguous VA range. > > */ > > diff --git a/security/integrity/ima/ima_kexec.c b/security/integrity/ima/ima_kexec.c > > index 7362f68f2d8b..5beb69edd12f 100644 > > --- a/security/integrity/ima/ima_kexec.c > > +++ b/security/integrity/ima/ima_kexec.c > > @@ -250,9 +250,7 @@ void ima_kexec_post_load(struct kimage *image) > > if (!image->ima_buffer_addr) > > return; > > > > - ima_kexec_buffer = kimage_map_segment(image, > > - image->ima_buffer_addr, > > - image->ima_buffer_size); > > + ima_kexec_buffer = kimage_map_segment(image, image->ima_segment_index); > > if (!ima_kexec_buffer) { > > pr_err("Could not map measurements buffer.\n"); > > return; > > -- > > 2.49.0 > > > From rppt at kernel.org Tue Nov 25 22:14:38 2025 From: rppt at kernel.org (Mike Rapoport) Date: Wed, 26 Nov 2025 08:14:38 +0200 Subject: [PATCH v8 12/17] x86/e820: temporarily enable KHO scratch for memory below 1M In-Reply-To: References: <20250509074635.3187114-1-changyuanl@google.com> <20250509074635.3187114-13-changyuanl@google.com> Message-ID: On Tue, Nov 25, 2025 at 06:47:15PM +0000, Usama Arif wrote: > > > On 25/11/2025 13:50, Mike Rapoport wrote: > > Hi, > > > > On Tue, Nov 25, 2025 at 02:15:34PM +0100, Pratyush Yadav wrote: > >> On Mon, Nov 24 2025, Usama Arif wrote: > > > >>>> --- a/arch/x86/realmode/init.c > >>>> +++ b/arch/x86/realmode/init.c > >>>> @@ -65,6 +65,8 @@ void __init reserve_real_mode(void) > >>>> * setup_arch(). > >>>> */ > >>>> memblock_reserve(0, SZ_1M); > >>>> + > >>>> + memblock_clear_kho_scratch(0, SZ_1M); > >>>> } > >>>> > >>>> static void __init sme_sev_setup_real_mode(struct trampoline_header *th) > >>> > >>> Hello! > >>> > >>> I am working with Breno who reported that we are seeing the below warning at boot > >>> when rolling out 6.16 in Meta fleet. It is difficult to reproduce on a single host > >>> manually but we are seeing this several times a day inside the fleet. > >>> > >>> 20:16:33 ------------[ cut here ]------------ > >>> 20:16:33 WARNING: CPU: 0 PID: 0 at mm/memblock.c:668 memblock_add_range+0x316/0x330 > >>> 20:16:33 Modules linked in: > >>> 20:16:33 CPU: 0 UID: 0 PID: 0 Comm: swapper Tainted: G S 6.16.1-0_fbk0_0_gc0739ee5037a #1 NONE > >>> 20:16:33 Tainted: [S]=CPU_OUT_OF_SPEC > >>> 20:16:33 RIP: 0010:memblock_add_range+0x316/0x330 > >>> 20:16:33 Code: ff ff ff 89 5c 24 08 41 ff c5 44 89 6c 24 10 48 63 74 24 08 48 63 54 24 10 e8 26 0c 00 00 e9 41 ff ff ff 0f 0b e9 af fd ff ff <0f> 0b e9 b7 fd ff ff 0f 0b 0f 0b cc cc cc cc cc cc cc cc cc cc cc > >>> 20:16:33 RSP: 0000:ffffffff83403dd8 EFLAGS: 00010083 ORIG_RAX: 0000000000000000 > >>> 20:16:33 RAX: ffffffff8476ff90 RBX: 0000000000001c00 RCX: 0000000000000002 > >>> 20:16:33 RDX: 00000000ffffffff RSI: 0000000000000000 RDI: ffffffff83bad4d8 > >>> 20:16:33 RBP: 000000000009f000 R08: 0000000000000020 R09: 8000000000097101 > >>> 20:16:33 R10: ffffffffff2004b0 R11: 203a6d6f646e6172 R12: 000000000009ec00 > >>> 20:16:33 R13: 0000000000000002 R14: 0000000000100000 R15: 000000000009d000 > >>> 20:16:33 FS: 0000000000000000(0000) GS:0000000000000000(0000) knlGS:0000000000000000 > >>> 20:16:33 CR2: ffff888065413ff8 CR3: 00000000663b7000 CR4: 00000000000000b0 > >>> 20:16:33 Call Trace: > >>> 20:16:33 > >>> 20:16:33 ? __memblock_reserve+0x75/0x80 > > > > Do you have faddr2line for this? > > >>> 20:16:33 ? setup_arch+0x30f/0xb10 > > > > And this? > > > > > Thanks for this! I think it helped narrow down the problem. > > The stack is: > > 20:16:33 ? __memblock_reserve (mm/memblock.c:936) > 20:16:33 ? setup_arch (arch/x86/kernel/setup.c:413 arch/x86/kernel/setup.c:499 arch/x86/kernel/setup.c:956) > 20:16:33 ? start_kernel (init/main.c:922) > 20:16:33 ? x86_64_start_reservations (arch/x86/kernel/ebda.c:57) > 20:16:33 ? x86_64_start_kernel (arch/x86/kernel/head64.c:231) > 20:16:33 ? common_startup_64 (arch/x86/kernel/head_64.S:419) > > This is 6.16 kernel. > > 20:16:33 ? __memblock_reserve (mm/memblock.c:936) > Thats memblock_add_range call in memblock_reserve > > 20:16:33 ? setup_arch (arch/x86/kernel/setup.c:413 arch/x86/kernel/setup.c:499 arch/x86/kernel/setup.c:956) > That is parse_setup_data -> add_early_ima_buffer -> add_early_ima_buffer -> memblock_reserve_kern > > > I put a simple print like below: > > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c > index 680d1b6dfea41..cc97ffc0083c7 100644 > --- a/arch/x86/kernel/setup.c > +++ b/arch/x86/kernel/setup.c > @@ -409,6 +409,7 @@ static void __init add_early_ima_buffer(u64 phys_addr) > } > > if (data->size) { > + pr_err("PPP %s %s %d data->addr %llx, data->size %llx \n", __FILE__, __func__, __LINE__, data->addr, data->size); > memblock_reserve_kern(data->addr, data->size); > ima_kexec_buffer_phys = data->addr; > ima_kexec_buffer_size = data->size; > > > and I see (without replicating the warning): > > [ 0.000000] PPP arch/x86/kernel/setup.c add_early_ima_buffer 412 data->addr 9e000, data->size 1000 > .... So it looks like in cases when the warning reproduces there's something that reserves memory overlapping with IMA buffer before add_early_ima_buffer(). > > [ 0.000348] MEMBLOCK configuration: > [ 0.000348] memory size = 0x0000003fea329ff0 reserved size = 0x00000000050c969b > [ 0.000350] memory.cnt = 0x5 > [ 0.000351] memory[0x0] [0x0000000000001000-0x000000000009ffff], 0x000000000009f000 bytes flags: 0x40 > [ 0.000353] memory[0x1] [0x0000000000100000-0x0000000067c65fff], 0x0000000067b66000 bytes flags: 0x0 > [ 0.000355] memory[0x2] [0x000000006d8db000-0x000000006fffffff], 0x0000000002725000 bytes flags: 0x0 > [ 0.000356] memory[0x3] [0x0000000100000000-0x000000407fff8fff], 0x0000003f7fff9000 bytes flags: 0x0 > [ 0.000358] memory[0x4] [0x000000407fffa000-0x000000407fffffff], 0x0000000000006000 bytes flags: 0x0 > [ 0.000359] reserved.cnt = 0x7 > > > So MEMBLOCK_RSRV_KERN and MEMBLOCK_KHO_SCRATCH seem to overlap.. It does not matter, they are set on different arrays. RSRV_KERN is set on regions in memblock.reserved and KHO_SCRATCH is set on regions in memblock.memory. So dumping memblock.memory is completely irrelevant, you need to check memblock.reserved for potential conflicts. > >>> 20:16:33 ? start_kernel+0x58/0x960 > >>> 20:16:33 ? x86_64_start_reservations+0x20/0x20 > >>> 20:16:33 ? x86_64_start_kernel+0x13d/0x140 > >>> 20:16:33 ? common_startup_64+0x13e/0x140 > >>> 20:16:33 > >>> 20:16:33 ---[ end trace 0000000000000000 ]--- > >>> > >>> > >>> Rolling out with memblock=debug is not really an option in a large scale fleet due to the > >>> time added to boot. But I did try on one of the hosts (without reproducing the issue) and I see: > > > > Is it a problem to roll out a kernel that has additional debug printouts as > > Breno suggested earlier? I.e. > > > > if (flags != MEMBLOCK_NONE && flags != rgn->flags) { > > pr_warn("memblock: Flag mismatch at region [%pa-%pa]\n", > > &rgn->base, &rend); > > pr_warn(" Existing region flags: %#x\n", rgn->flags); > > pr_warn(" New range flags: %#x\n", flags); > > pr_warn(" New range: [%pa-%pa]\n", &base, &end); > > WARN_ON_ONCE(1); > > } > > > > I can add this, but the only thing is that it might be several weeks between me putting this in the > kernel and that kernel being deployed to enough machines that it starts to show up. I think the IMA coinciding > with memblock_mark_kho_scratch in e820__memblock_setup could be the reason for the warning. It might be better to > fix that case and deploy it to see if the warnings still show up? > I can add these prints as well incase it doesnt fix the problem. I really don't think that effectively disabling memblock_mark_kho_scratch() when KHO is disabled will solve the problem because as I said the flags it sets are on different structure than the flags set by memblock_reserve_kern(). > > If you have the logs from failing boots up to the point where SLUB reports > > about it's initialization, e.g. > > > > [ 0.134377] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=8, Nodes=1 > > > > something there may hint about what's the issue. > > So the boot doesnt fail, its just giving warnings in the fleet. > I have added the dmesg to the end of the mail. Thanks, unfortunately nothing jumped at me there. > Does something like this look good? I can try deploying this (although it will take sometime to find out). > We can get it upstream as well as that makes backports easier. > > diff --git a/mm/memblock.c b/mm/memblock.c > index 154f1d73b61f2..257c6f0eee03d 100644 > --- a/mm/memblock.c > +++ b/mm/memblock.c > @@ -1119,8 +1119,13 @@ int __init_memblock memblock_reserved_mark_noinit(phys_addr_t base, phys_addr_t > */ > __init int memblock_mark_kho_scratch(phys_addr_t base, phys_addr_t size) > { > - return memblock_setclr_flag(&memblock.memory, base, size, 1, > - MEMBLOCK_KHO_SCRATCH); > +#ifdef CONFIG_MEMBLOCK_KHO_SCRATCH > + if (is_kho_boot()) Please use if (IS_ENABLED(CONFIG_MEMBLOCK_KHO_SCRATCH) instead of indef. If you send a formal patch with it, I'll take it. I'd suggest still deploying additional debug printouts internally. > + return memblock_setclr_flag(&memblock.memory, base, size, 1, > + MEMBLOCK_KHO_SCRATCH); > +#else > + return 0; > +#endif > } > > /** > @@ -1133,8 +1138,13 @@ __init int memblock_mark_kho_scratch(phys_addr_t base, phys_addr_t size) > */ > __init int memblock_clear_kho_scratch(phys_addr_t base, phys_addr_t size) > { > - return memblock_setclr_flag(&memblock.memory, base, size, 0, > - MEMBLOCK_KHO_SCRATCH); > +#ifdef CONFIG_MEMBLOCK_KHO_SCRATCH > + if (is_kho_boot()) > + return memblock_setclr_flag(&memblock.memory, base, size, 0, > + MEMBLOCK_KHO_SCRATCH); > +#else If nothing sets the flag _clear is anyway nop, but let's update it as well for symmetry. -- Sincerely yours, Mike. From usamaarif642 at gmail.com Tue Nov 25 23:25:48 2025 From: usamaarif642 at gmail.com (Usama Arif) Date: Wed, 26 Nov 2025 07:25:48 +0000 Subject: [PATCH v8 12/17] x86/e820: temporarily enable KHO scratch for memory below 1M In-Reply-To: References: <20250509074635.3187114-1-changyuanl@google.com> <20250509074635.3187114-13-changyuanl@google.com> Message-ID: On 26/11/2025 06:14, Mike Rapoport wrote: > On Tue, Nov 25, 2025 at 06:47:15PM +0000, Usama Arif wrote: >> >> >> On 25/11/2025 13:50, Mike Rapoport wrote: >>> Hi, >>> >>> On Tue, Nov 25, 2025 at 02:15:34PM +0100, Pratyush Yadav wrote: >>>> On Mon, Nov 24 2025, Usama Arif wrote: >>> >>>>>> --- a/arch/x86/realmode/init.c >>>>>> +++ b/arch/x86/realmode/init.c >>>>>> @@ -65,6 +65,8 @@ void __init reserve_real_mode(void) >>>>>> * setup_arch(). >>>>>> */ >>>>>> memblock_reserve(0, SZ_1M); >>>>>> + >>>>>> + memblock_clear_kho_scratch(0, SZ_1M); >>>>>> } >>>>>> >>>>>> static void __init sme_sev_setup_real_mode(struct trampoline_header *th) >>>>> >>>>> Hello! >>>>> >>>>> I am working with Breno who reported that we are seeing the below warning at boot >>>>> when rolling out 6.16 in Meta fleet. It is difficult to reproduce on a single host >>>>> manually but we are seeing this several times a day inside the fleet. >>>>> >>>>> 20:16:33 ------------[ cut here ]------------ >>>>> 20:16:33 WARNING: CPU: 0 PID: 0 at mm/memblock.c:668 memblock_add_range+0x316/0x330 >>>>> 20:16:33 Modules linked in: >>>>> 20:16:33 CPU: 0 UID: 0 PID: 0 Comm: swapper Tainted: G S 6.16.1-0_fbk0_0_gc0739ee5037a #1 NONE >>>>> 20:16:33 Tainted: [S]=CPU_OUT_OF_SPEC >>>>> 20:16:33 RIP: 0010:memblock_add_range+0x316/0x330 >>>>> 20:16:33 Code: ff ff ff 89 5c 24 08 41 ff c5 44 89 6c 24 10 48 63 74 24 08 48 63 54 24 10 e8 26 0c 00 00 e9 41 ff ff ff 0f 0b e9 af fd ff ff <0f> 0b e9 b7 fd ff ff 0f 0b 0f 0b cc cc cc cc cc cc cc cc cc cc cc >>>>> 20:16:33 RSP: 0000:ffffffff83403dd8 EFLAGS: 00010083 ORIG_RAX: 0000000000000000 >>>>> 20:16:33 RAX: ffffffff8476ff90 RBX: 0000000000001c00 RCX: 0000000000000002 >>>>> 20:16:33 RDX: 00000000ffffffff RSI: 0000000000000000 RDI: ffffffff83bad4d8 >>>>> 20:16:33 RBP: 000000000009f000 R08: 0000000000000020 R09: 8000000000097101 >>>>> 20:16:33 R10: ffffffffff2004b0 R11: 203a6d6f646e6172 R12: 000000000009ec00 >>>>> 20:16:33 R13: 0000000000000002 R14: 0000000000100000 R15: 000000000009d000 >>>>> 20:16:33 FS: 0000000000000000(0000) GS:0000000000000000(0000) knlGS:0000000000000000 >>>>> 20:16:33 CR2: ffff888065413ff8 CR3: 00000000663b7000 CR4: 00000000000000b0 >>>>> 20:16:33 Call Trace: >>>>> 20:16:33 >>>>> 20:16:33 ? __memblock_reserve+0x75/0x80 >>> >>> Do you have faddr2line for this? >>>>>> 20:16:33 ? setup_arch+0x30f/0xb10 >>> >>> And this? >>> >> >> >> Thanks for this! I think it helped narrow down the problem. >> >> The stack is: >> >> 20:16:33 ? __memblock_reserve (mm/memblock.c:936) >> 20:16:33 ? setup_arch (arch/x86/kernel/setup.c:413 arch/x86/kernel/setup.c:499 arch/x86/kernel/setup.c:956) >> 20:16:33 ? start_kernel (init/main.c:922) >> 20:16:33 ? x86_64_start_reservations (arch/x86/kernel/ebda.c:57) >> 20:16:33 ? x86_64_start_kernel (arch/x86/kernel/head64.c:231) >> 20:16:33 ? common_startup_64 (arch/x86/kernel/head_64.S:419) >> >> This is 6.16 kernel. >> >> 20:16:33 ? __memblock_reserve (mm/memblock.c:936) >> Thats memblock_add_range call in memblock_reserve >> >> 20:16:33 ? setup_arch (arch/x86/kernel/setup.c:413 arch/x86/kernel/setup.c:499 arch/x86/kernel/setup.c:956) >> That is parse_setup_data -> add_early_ima_buffer -> add_early_ima_buffer -> memblock_reserve_kern >> >> >> I put a simple print like below: >> >> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c >> index 680d1b6dfea41..cc97ffc0083c7 100644 >> --- a/arch/x86/kernel/setup.c >> +++ b/arch/x86/kernel/setup.c >> @@ -409,6 +409,7 @@ static void __init add_early_ima_buffer(u64 phys_addr) >> } >> >> if (data->size) { >> + pr_err("PPP %s %s %d data->addr %llx, data->size %llx \n", __FILE__, __func__, __LINE__, data->addr, data->size); >> memblock_reserve_kern(data->addr, data->size); >> ima_kexec_buffer_phys = data->addr; >> ima_kexec_buffer_size = data->size; >> >> >> and I see (without replicating the warning): >> >> [ 0.000000] PPP arch/x86/kernel/setup.c add_early_ima_buffer 412 data->addr 9e000, data->size 1000 >> .... > > So it looks like in cases when the warning reproduces there's something > that reserves memory overlapping with IMA buffer before > add_early_ima_buffer(). > >> >> [ 0.000348] MEMBLOCK configuration: >> [ 0.000348] memory size = 0x0000003fea329ff0 reserved size = 0x00000000050c969b >> [ 0.000350] memory.cnt = 0x5 >> [ 0.000351] memory[0x0] [0x0000000000001000-0x000000000009ffff], 0x000000000009f000 bytes flags: 0x40 >> [ 0.000353] memory[0x1] [0x0000000000100000-0x0000000067c65fff], 0x0000000067b66000 bytes flags: 0x0 >> [ 0.000355] memory[0x2] [0x000000006d8db000-0x000000006fffffff], 0x0000000002725000 bytes flags: 0x0 >> [ 0.000356] memory[0x3] [0x0000000100000000-0x000000407fff8fff], 0x0000003f7fff9000 bytes flags: 0x0 >> [ 0.000358] memory[0x4] [0x000000407fffa000-0x000000407fffffff], 0x0000000000006000 bytes flags: 0x0 >> [ 0.000359] reserved.cnt = 0x7 >> >> >> So MEMBLOCK_RSRV_KERN and MEMBLOCK_KHO_SCRATCH seem to overlap.. > > It does not matter, they are set on different arrays. RSRV_KERN is set on > regions in memblock.reserved and KHO_SCRATCH is set on regions in > memblock.memory. > > So dumping memblock.memory is completely irrelevant, you need to check > memblock.reserved for potential conflicts. > >>>>> 20:16:33 ? start_kernel+0x58/0x960 >>>>> 20:16:33 ? x86_64_start_reservations+0x20/0x20 >>>>> 20:16:33 ? x86_64_start_kernel+0x13d/0x140 >>>>> 20:16:33 ? common_startup_64+0x13e/0x140 >>>>> 20:16:33 >>>>> 20:16:33 ---[ end trace 0000000000000000 ]--- >>>>> >>>>> >>>>> Rolling out with memblock=debug is not really an option in a large scale fleet due to the >>>>> time added to boot. But I did try on one of the hosts (without reproducing the issue) and I see: >>> >>> Is it a problem to roll out a kernel that has additional debug printouts as >>> Breno suggested earlier? I.e. >>> >>> if (flags != MEMBLOCK_NONE && flags != rgn->flags) { >>> pr_warn("memblock: Flag mismatch at region [%pa-%pa]\n", >>> &rgn->base, &rend); >>> pr_warn(" Existing region flags: %#x\n", rgn->flags); >>> pr_warn(" New range flags: %#x\n", flags); >>> pr_warn(" New range: [%pa-%pa]\n", &base, &end); >>> WARN_ON_ONCE(1); >>> } >>> >> >> I can add this, but the only thing is that it might be several weeks between me putting this in the >> kernel and that kernel being deployed to enough machines that it starts to show up. I think the IMA coinciding >> with memblock_mark_kho_scratch in e820__memblock_setup could be the reason for the warning. It might be better to >> fix that case and deploy it to see if the warnings still show up? >> I can add these prints as well incase it doesnt fix the problem. > > I really don't think that effectively disabling memblock_mark_kho_scratch() > when KHO is disabled will solve the problem because as I said the flags it > sets are on different structure than the flags set by > memblock_reserve_kern(). > >>> If you have the logs from failing boots up to the point where SLUB reports >>> about it's initialization, e.g. >>> >>> [ 0.134377] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=8, Nodes=1 >>> >>> something there may hint about what's the issue. >> >> So the boot doesnt fail, its just giving warnings in the fleet. >> I have added the dmesg to the end of the mail. > > Thanks, unfortunately nothing jumped at me there. > >> Does something like this look good? I can try deploying this (although it will take sometime to find out). >> We can get it upstream as well as that makes backports easier. >> >> diff --git a/mm/memblock.c b/mm/memblock.c >> index 154f1d73b61f2..257c6f0eee03d 100644 >> --- a/mm/memblock.c >> +++ b/mm/memblock.c >> @@ -1119,8 +1119,13 @@ int __init_memblock memblock_reserved_mark_noinit(phys_addr_t base, phys_addr_t >> */ >> __init int memblock_mark_kho_scratch(phys_addr_t base, phys_addr_t size) >> { >> - return memblock_setclr_flag(&memblock.memory, base, size, 1, >> - MEMBLOCK_KHO_SCRATCH); >> +#ifdef CONFIG_MEMBLOCK_KHO_SCRATCH >> + if (is_kho_boot()) > > Please use > > if (IS_ENABLED(CONFIG_MEMBLOCK_KHO_SCRATCH) > > instead of indef. > > If you send a formal patch with it, I'll take it. > I'd suggest still deploying additional debug printouts internally. Thanks! I will add the additional debug prints and [1] in the next release. It will be sometime before it makes it into production, so I will try to debug this more using the information you provided above. [1] https://lore.kernel.org/all/20251126072051.546700-1-usamaarif642 at gmail.com/ > >> + return memblock_setclr_flag(&memblock.memory, base, size, 1, >> + MEMBLOCK_KHO_SCRATCH); >> +#else >> + return 0; >> +#endif >> } >> >> /** >> @@ -1133,8 +1138,13 @@ __init int memblock_mark_kho_scratch(phys_addr_t base, phys_addr_t size) >> */ >> __init int memblock_clear_kho_scratch(phys_addr_t base, phys_addr_t size) >> { >> - return memblock_setclr_flag(&memblock.memory, base, size, 0, >> - MEMBLOCK_KHO_SCRATCH); >> +#ifdef CONFIG_MEMBLOCK_KHO_SCRATCH >> + if (is_kho_boot()) >> + return memblock_setclr_flag(&memblock.memory, base, size, 0, >> + MEMBLOCK_KHO_SCRATCH); >> +#else > > If nothing sets the flag _clear is anyway nop, but let's update it as well > for symmetry. > From maqianga at uniontech.com Wed Nov 26 00:44:24 2025 From: maqianga at uniontech.com (Qiang Ma) Date: Wed, 26 Nov 2025 16:44:24 +0800 Subject: [PATCH v3 0/3] kexec: print out debugging message if required for kexec_load Message-ID: <20251126084427.3222212-1-maqianga@uniontech.com> Overview: ========= The commit a85ee18c7900 ("kexec_file: print out debugging message if required") has added general code printing in kexec_file_load(), but not in kexec_load(). Since kexec_load and kexec_file_load are not triggered simultaneously, we can unify the debug flag of kexec and kexec_file as kexec_dbg_print. Next, we need to do some things in this patchset: 1. rename kexec_file_dbg_print to kexec_dbg_print 2. Add KEXEC_DEBUG 3. Initialize kexec_dbg_print for kexec 4. Fix uninitialized struct kimage *image pointer 5. Set the reset of kexec_dbg_print to kimage_free Testing: ========= I did testing on x86_64, arm64 and loongarch. On x86_64, the printed messages look like below: unset CONFIG_KEXEC_FILE: [ 81.502374] kexec: kexec_load: type:0, start:0x23fff7700 head:0x10a4b9002 flags:0x3e0010 set CONFIG_KEXEC_FILE [ 36.774228] kexec_file: kernel: 0000000066c386c8 kernel_size: 0xd78400 [ 36.821814] kexec-bzImage64: Loaded purgatory at 0x23fffb000 [ 36.821826] kexec-bzImage64: Loaded boot_param, command line and misc at 0x23fff9000 bufsz=0x12d0 memsz=0x2000 [ 36.821829] kexec-bzImage64: Loaded 64bit kernel at 0x23d400000 bufsz=0xd73400 memsz=0x2ab7000 [ 36.821918] kexec-bzImage64: Loaded initrd at 0x23bd0b000 bufsz=0x16f40a8 memsz=0x16f40a8 [ 36.821920] kexec-bzImage64: Final command line is: root=/dev/mapper/test-root crashkernel=auto rd.lvm.lv=test/root [ 36.821925] kexec-bzImage64: E820 memmap: [ 36.821926] kexec-bzImage64: 0000000000000000-000000000009ffff (1) [ 36.821928] kexec-bzImage64: 0000000000100000-0000000000811fff (1) [ 36.821930] kexec-bzImage64: 0000000000812000-0000000000812fff (2) [ 36.821931] kexec-bzImage64: 0000000000813000-00000000bee38fff (1) [ 36.821933] kexec-bzImage64: 00000000bee39000-00000000beec2fff (2) [ 36.821934] kexec-bzImage64: 00000000beec3000-00000000bf8ecfff (1) [ 36.821935] kexec-bzImage64: 00000000bf8ed000-00000000bfb6cfff (2) [ 36.821936] kexec-bzImage64: 00000000bfb6d000-00000000bfb7efff (3) [ 36.821937] kexec-bzImage64: 00000000bfb7f000-00000000bfbfefff (4) [ 36.821938] kexec-bzImage64: 00000000bfbff000-00000000bff7bfff (1) [ 36.821939] kexec-bzImage64: 00000000bff7c000-00000000bfffffff (2) [ 36.821940] kexec-bzImage64: 00000000feffc000-00000000feffffff (2) [ 36.821941] kexec-bzImage64: 00000000ffc00000-00000000ffffffff (2) [ 36.821942] kexec-bzImage64: 0000000100000000-000000023fffffff (1) [ 36.872348] kexec_file: nr_segments = 4 [ 36.872356] kexec_file: segment[0]: buf=0x000000005314ece7 bufsz=0x4000 mem=0x23fffb000 memsz=0x5000 [ 36.872370] kexec_file: segment[1]: buf=0x000000006e59b143 bufsz=0x12d0 mem=0x23fff9000 memsz=0x2000 [ 36.872374] kexec_file: segment[2]: buf=0x00000000eb7b1fc3 bufsz=0xd73400 mem=0x23d400000 memsz=0x2ab7000 [ 36.882172] kexec_file: segment[3]: buf=0x000000006af76441 bufsz=0x16f40a8 mem=0x23bd0b000 memsz=0x16f5000 [ 36.889113] kexec_file: kexec_file_load: type:0, start:0x23fffb150 head:0x101a2e002 flags:0x8 Changes in v3: ========== - Rename kexec_core_dbg_print to kexec_dbg_print - Remove unnecessary segments prints - Remove patch "kexec_file: Fix the issue of mismatch between loop variable types" Qiang Ma (3): kexec: Fix uninitialized struct kimage *image pointer kexec: add kexec flag to control debug printing kexec: print out debugging message if required for kexec_load include/linux/kexec.h | 9 +++++---- include/uapi/linux/kexec.h | 1 + kernel/kexec.c | 8 +++++++- kernel/kexec_core.c | 4 +++- kernel/kexec_file.c | 4 +--- 5 files changed, 17 insertions(+), 9 deletions(-) -- 2.20.1 From maqianga at uniontech.com Wed Nov 26 00:44:25 2025 From: maqianga at uniontech.com (Qiang Ma) Date: Wed, 26 Nov 2025 16:44:25 +0800 Subject: [PATCH v3 1/3] kexec: Fix uninitialized struct kimage *image pointer In-Reply-To: <20251126084427.3222212-1-maqianga@uniontech.com> References: <20251126084427.3222212-1-maqianga@uniontech.com> Message-ID: <20251126084427.3222212-2-maqianga@uniontech.com> The image is initialized to NULL. Then, after calling kimage_alloc_init, we can directly goto 'out' because at this time, the kimage_free will determine whether image is a NULL pointer. This can also prepare for the subsequent patch's kexec_core_dbg_print to be reset to zero in kimage_free. Signed-off-by: Qiang Ma --- kernel/kexec.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/kernel/kexec.c b/kernel/kexec.c index 28008e3d462e..9bb1f2b6b268 100644 --- a/kernel/kexec.c +++ b/kernel/kexec.c @@ -95,6 +95,8 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments, unsigned long i; int ret; + image = NULL; + /* * Because we write directly to the reserved memory region when loading * crash kernels we need a serialization here to prevent multiple crash @@ -129,7 +131,7 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments, ret = kimage_alloc_init(&image, entry, nr_segments, segments, flags); if (ret) - goto out_unlock; + goto out; if (flags & KEXEC_PRESERVE_CONTEXT) image->preserve_context = 1; -- 2.20.1 From maqianga at uniontech.com Wed Nov 26 00:44:26 2025 From: maqianga at uniontech.com (Qiang Ma) Date: Wed, 26 Nov 2025 16:44:26 +0800 Subject: [PATCH v3 2/3] kexec: add kexec flag to control debug printing In-Reply-To: <20251126084427.3222212-1-maqianga@uniontech.com> References: <20251126084427.3222212-1-maqianga@uniontech.com> Message-ID: <20251126084427.3222212-3-maqianga@uniontech.com> The commit a85ee18c7900 ("kexec_file: print out debugging message if required") has added general code printing in kexec_file_load(), but not in kexec_load(). Since kexec_load and kexec_file_load are not triggered simultaneously, we can unify the debug flag of kexec and kexec_file as kexec_dbg_print. Next, we need to do four things: 1. rename kexec_file_dbg_print to kexec_dbg_print 2. Add KEXEC_DEBUG 3. Initialize kexec_dbg_print for kexec 4. Set the reset of kexec_dbg_print to kimage_free Signed-off-by: Qiang Ma --- include/linux/kexec.h | 9 +++++---- include/uapi/linux/kexec.h | 1 + kernel/kexec.c | 1 + kernel/kexec_core.c | 4 +++- kernel/kexec_file.c | 4 +--- 5 files changed, 11 insertions(+), 8 deletions(-) diff --git a/include/linux/kexec.h b/include/linux/kexec.h index ff7e231b0485..23f10aec0b34 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -455,10 +455,11 @@ bool kexec_load_permitted(int kexec_image_type); /* List of defined/legal kexec flags */ #ifndef CONFIG_KEXEC_JUMP -#define KEXEC_FLAGS (KEXEC_ON_CRASH | KEXEC_UPDATE_ELFCOREHDR | KEXEC_CRASH_HOTPLUG_SUPPORT) +#define KEXEC_FLAGS (KEXEC_ON_CRASH | KEXEC_UPDATE_ELFCOREHDR | KEXEC_CRASH_HOTPLUG_SUPPORT | \ + KEXEC_DEBUG) #else #define KEXEC_FLAGS (KEXEC_ON_CRASH | KEXEC_PRESERVE_CONTEXT | KEXEC_UPDATE_ELFCOREHDR | \ - KEXEC_CRASH_HOTPLUG_SUPPORT) + KEXEC_CRASH_HOTPLUG_SUPPORT | KEXEC_DEBUG) #endif /* List of defined/legal kexec file flags */ @@ -525,10 +526,10 @@ static inline int arch_kexec_post_alloc_pages(void *vaddr, unsigned int pages, g static inline void arch_kexec_pre_free_pages(void *vaddr, unsigned int pages) { } #endif -extern bool kexec_file_dbg_print; +extern bool kexec_dbg_print; #define kexec_dprintk(fmt, arg...) \ - do { if (kexec_file_dbg_print) pr_info(fmt, ##arg); } while (0) + do { if (kexec_dbg_print) pr_info(fmt, ##arg); } while (0) extern void *kimage_map_segment(struct kimage *image, unsigned long addr, unsigned long size); extern void kimage_unmap_segment(void *buffer); diff --git a/include/uapi/linux/kexec.h b/include/uapi/linux/kexec.h index 55749cb0b81d..819c600af125 100644 --- a/include/uapi/linux/kexec.h +++ b/include/uapi/linux/kexec.h @@ -14,6 +14,7 @@ #define KEXEC_PRESERVE_CONTEXT 0x00000002 #define KEXEC_UPDATE_ELFCOREHDR 0x00000004 #define KEXEC_CRASH_HOTPLUG_SUPPORT 0x00000008 +#define KEXEC_DEBUG 0x00000010 #define KEXEC_ARCH_MASK 0xffff0000 /* diff --git a/kernel/kexec.c b/kernel/kexec.c index 9bb1f2b6b268..f6c58c767eb0 100644 --- a/kernel/kexec.c +++ b/kernel/kexec.c @@ -42,6 +42,7 @@ static int kimage_alloc_init(struct kimage **rimage, unsigned long entry, if (!image) return -ENOMEM; + kexec_dbg_print = !!(flags & KEXEC_DEBUG); image->start = entry; image->nr_segments = nr_segments; memcpy(image->segment, segments, nr_segments * sizeof(*segments)); diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c index fa00b239c5d9..7bc1cd4105fc 100644 --- a/kernel/kexec_core.c +++ b/kernel/kexec_core.c @@ -53,7 +53,7 @@ atomic_t __kexec_lock = ATOMIC_INIT(0); /* Flag to indicate we are going to kexec a new kernel */ bool kexec_in_progress = false; -bool kexec_file_dbg_print; +bool kexec_dbg_print; /* * When kexec transitions to the new kernel there is a one-to-one @@ -576,6 +576,8 @@ void kimage_free(struct kimage *image) kimage_entry_t *ptr, entry; kimage_entry_t ind = 0; + kexec_dbg_print = false; + if (!image) return; diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c index eb62a9794242..3f1d6c4e8ff2 100644 --- a/kernel/kexec_file.c +++ b/kernel/kexec_file.c @@ -138,8 +138,6 @@ void kimage_file_post_load_cleanup(struct kimage *image) */ kfree(image->image_loader_data); image->image_loader_data = NULL; - - kexec_file_dbg_print = false; } #ifdef CONFIG_KEXEC_SIG @@ -314,7 +312,7 @@ kimage_file_alloc_init(struct kimage **rimage, int kernel_fd, if (!image) return -ENOMEM; - kexec_file_dbg_print = !!(flags & KEXEC_FILE_DEBUG); + kexec_dbg_print = !!(flags & KEXEC_FILE_DEBUG); image->file_mode = 1; #ifdef CONFIG_CRASH_DUMP -- 2.20.1 From maqianga at uniontech.com Wed Nov 26 00:44:27 2025 From: maqianga at uniontech.com (Qiang Ma) Date: Wed, 26 Nov 2025 16:44:27 +0800 Subject: [PATCH v3 3/3] kexec: print out debugging message if required for kexec_load In-Reply-To: <20251126084427.3222212-1-maqianga@uniontech.com> References: <20251126084427.3222212-1-maqianga@uniontech.com> Message-ID: <20251126084427.3222212-4-maqianga@uniontech.com> The commit a85ee18c7900 ("kexec_file: print out debugging message if required") has added general code printing in kexec_file_load(), but not in kexec_load(). As a result, when using '-d' for the kexec_load interface, print nothing in the kernel space. And print out type/start/head of kimage and flags to help debug. Signed-off-by: Qiang Ma Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-kbuild-all/202510310332.6XrLe70K-lkp at intel.com/ --- kernel/kexec.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/kernel/kexec.c b/kernel/kexec.c index f6c58c767eb0..37e4ac8af9f3 100644 --- a/kernel/kexec.c +++ b/kernel/kexec.c @@ -166,6 +166,9 @@ static int do_kexec_load(unsigned long entry, unsigned long nr_segments, if (ret) goto out; + kexec_dprintk("kexec_load: type:%u, start:0x%lx head:0x%lx flags:0x%lx\n", + image->type, image->start, image->head, flags); + /* Install the new kernel and uninstall the old */ image = xchg(dest_image, image); -- 2.20.1 From bhe at redhat.com Wed Nov 26 17:47:18 2025 From: bhe at redhat.com (Baoquan He) Date: Thu, 27 Nov 2025 09:47:18 +0800 Subject: [PATCH v3 0/3] kexec: print out debugging message if required for kexec_load In-Reply-To: <20251126084427.3222212-1-maqianga@uniontech.com> References: <20251126084427.3222212-1-maqianga@uniontech.com> Message-ID: Hi, On 11/26/25 at 04:44pm, Qiang Ma wrote: > Overview: > ========= > The commit a85ee18c7900 ("kexec_file: print out debugging message > if required") has added general code printing in kexec_file_load(), > but not in kexec_load(). > > Since kexec_load and kexec_file_load are not triggered simultaneously, > we can unify the debug flag of kexec and kexec_file as kexec_dbg_print. As I said in your last post, this is not needed at all, you just add a not needed thing to kernel. So NACK this patchset, unless you have reason to justify it. Sorry about it. Thanks Baoquan > > Next, we need to do some things in this patchset: > > 1. rename kexec_file_dbg_print to kexec_dbg_print > 2. Add KEXEC_DEBUG > 3. Initialize kexec_dbg_print for kexec > 4. Fix uninitialized struct kimage *image pointer > 5. Set the reset of kexec_dbg_print to kimage_free > > Testing: > ========= > I did testing on x86_64, arm64 and loongarch. On x86_64, the printed messages > look like below: > > unset CONFIG_KEXEC_FILE: > [ 81.502374] kexec: kexec_load: type:0, start:0x23fff7700 head:0x10a4b9002 flags:0x3e0010 > > set CONFIG_KEXEC_FILE > [ 36.774228] kexec_file: kernel: 0000000066c386c8 kernel_size: 0xd78400 > [ 36.821814] kexec-bzImage64: Loaded purgatory at 0x23fffb000 > [ 36.821826] kexec-bzImage64: Loaded boot_param, command line and misc at 0x23fff9000 bufsz=0x12d0 memsz=0x2000 > [ 36.821829] kexec-bzImage64: Loaded 64bit kernel at 0x23d400000 bufsz=0xd73400 memsz=0x2ab7000 > [ 36.821918] kexec-bzImage64: Loaded initrd at 0x23bd0b000 bufsz=0x16f40a8 memsz=0x16f40a8 > [ 36.821920] kexec-bzImage64: Final command line is: root=/dev/mapper/test-root crashkernel=auto rd.lvm.lv=test/root > [ 36.821925] kexec-bzImage64: E820 memmap: > [ 36.821926] kexec-bzImage64: 0000000000000000-000000000009ffff (1) > [ 36.821928] kexec-bzImage64: 0000000000100000-0000000000811fff (1) > [ 36.821930] kexec-bzImage64: 0000000000812000-0000000000812fff (2) > [ 36.821931] kexec-bzImage64: 0000000000813000-00000000bee38fff (1) > [ 36.821933] kexec-bzImage64: 00000000bee39000-00000000beec2fff (2) > [ 36.821934] kexec-bzImage64: 00000000beec3000-00000000bf8ecfff (1) > [ 36.821935] kexec-bzImage64: 00000000bf8ed000-00000000bfb6cfff (2) > [ 36.821936] kexec-bzImage64: 00000000bfb6d000-00000000bfb7efff (3) > [ 36.821937] kexec-bzImage64: 00000000bfb7f000-00000000bfbfefff (4) > [ 36.821938] kexec-bzImage64: 00000000bfbff000-00000000bff7bfff (1) > [ 36.821939] kexec-bzImage64: 00000000bff7c000-00000000bfffffff (2) > [ 36.821940] kexec-bzImage64: 00000000feffc000-00000000feffffff (2) > [ 36.821941] kexec-bzImage64: 00000000ffc00000-00000000ffffffff (2) > [ 36.821942] kexec-bzImage64: 0000000100000000-000000023fffffff (1) > [ 36.872348] kexec_file: nr_segments = 4 > [ 36.872356] kexec_file: segment[0]: buf=0x000000005314ece7 bufsz=0x4000 mem=0x23fffb000 memsz=0x5000 > [ 36.872370] kexec_file: segment[1]: buf=0x000000006e59b143 bufsz=0x12d0 mem=0x23fff9000 memsz=0x2000 > [ 36.872374] kexec_file: segment[2]: buf=0x00000000eb7b1fc3 bufsz=0xd73400 mem=0x23d400000 memsz=0x2ab7000 > [ 36.882172] kexec_file: segment[3]: buf=0x000000006af76441 bufsz=0x16f40a8 mem=0x23bd0b000 memsz=0x16f5000 > [ 36.889113] kexec_file: kexec_file_load: type:0, start:0x23fffb150 head:0x101a2e002 flags:0x8 > > Changes in v3: > ========== > - Rename kexec_core_dbg_print to kexec_dbg_print > - Remove unnecessary segments prints > - Remove patch "kexec_file: Fix the issue of mismatch between loop variable types" > > Qiang Ma (3): > kexec: Fix uninitialized struct kimage *image pointer > kexec: add kexec flag to control debug printing > kexec: print out debugging message if required for kexec_load > > include/linux/kexec.h | 9 +++++---- > include/uapi/linux/kexec.h | 1 + > kernel/kexec.c | 8 +++++++- > kernel/kexec_core.c | 4 +++- > kernel/kexec_file.c | 4 +--- > 5 files changed, 17 insertions(+), 9 deletions(-) > > -- > 2.20.1 > From maqianga at uniontech.com Wed Nov 26 18:04:40 2025 From: maqianga at uniontech.com (Qiang Ma) Date: Thu, 27 Nov 2025 10:04:40 +0800 Subject: [PATCH v3 0/3] kexec: print out debugging message if required for kexec_load In-Reply-To: References: <20251126084427.3222212-1-maqianga@uniontech.com> Message-ID: <63BA9935197ADF34+2a3faf95-36da-46f1-b9b5-4e438e75d1be@uniontech.com> ? 2025/11/27 09:47, Baoquan He ??: > Hi, > > On 11/26/25 at 04:44pm, Qiang Ma wrote: >> Overview: >> ========= >> The commit a85ee18c7900 ("kexec_file: print out debugging message >> if required") has added general code printing in kexec_file_load(), >> but not in kexec_load(). >> >> Since kexec_load and kexec_file_load are not triggered simultaneously, >> we can unify the debug flag of kexec and kexec_file as kexec_dbg_print. > As I said in your last post, this is not needed at all, you just add a > not needed thing to kernel. > > So NACK this patchset, unless you have reason to justify it. Sorry about > it. The segment prints discussed in the last post, this patchset has been removed, leaving only type/start/head of kimage and flags. I think the current patchset is still necessary. For example, renaming kexec_file_dbg_print is still necessary, but not for kexec_file. > Thanks > Baoquan > >> >> Next, we need to do some things in this patchset: >> >> 1. rename kexec_file_dbg_print to kexec_dbg_print >> 2. Add KEXEC_DEBUG >> 3. Initialize kexec_dbg_print for kexec >> 4. Fix uninitialized struct kimage *image pointer >> 5. Set the reset of kexec_dbg_print to kimage_free >> >> Testing: >> ========= >> I did testing on x86_64, arm64 and loongarch. On x86_64, the printed messages >> look like below: >> >> unset CONFIG_KEXEC_FILE: >> [ 81.502374] kexec: kexec_load: type:0, start:0x23fff7700 head:0x10a4b9002 flags:0x3e0010 >> >> set CONFIG_KEXEC_FILE >> [ 36.774228] kexec_file: kernel: 0000000066c386c8 kernel_size: 0xd78400 >> [ 36.821814] kexec-bzImage64: Loaded purgatory at 0x23fffb000 >> [ 36.821826] kexec-bzImage64: Loaded boot_param, command line and misc at 0x23fff9000 bufsz=0x12d0 memsz=0x2000 >> [ 36.821829] kexec-bzImage64: Loaded 64bit kernel at 0x23d400000 bufsz=0xd73400 memsz=0x2ab7000 >> [ 36.821918] kexec-bzImage64: Loaded initrd at 0x23bd0b000 bufsz=0x16f40a8 memsz=0x16f40a8 >> [ 36.821920] kexec-bzImage64: Final command line is: root=/dev/mapper/test-root crashkernel=auto rd.lvm.lv=test/root >> [ 36.821925] kexec-bzImage64: E820 memmap: >> [ 36.821926] kexec-bzImage64: 0000000000000000-000000000009ffff (1) >> [ 36.821928] kexec-bzImage64: 0000000000100000-0000000000811fff (1) >> [ 36.821930] kexec-bzImage64: 0000000000812000-0000000000812fff (2) >> [ 36.821931] kexec-bzImage64: 0000000000813000-00000000bee38fff (1) >> [ 36.821933] kexec-bzImage64: 00000000bee39000-00000000beec2fff (2) >> [ 36.821934] kexec-bzImage64: 00000000beec3000-00000000bf8ecfff (1) >> [ 36.821935] kexec-bzImage64: 00000000bf8ed000-00000000bfb6cfff (2) >> [ 36.821936] kexec-bzImage64: 00000000bfb6d000-00000000bfb7efff (3) >> [ 36.821937] kexec-bzImage64: 00000000bfb7f000-00000000bfbfefff (4) >> [ 36.821938] kexec-bzImage64: 00000000bfbff000-00000000bff7bfff (1) >> [ 36.821939] kexec-bzImage64: 00000000bff7c000-00000000bfffffff (2) >> [ 36.821940] kexec-bzImage64: 00000000feffc000-00000000feffffff (2) >> [ 36.821941] kexec-bzImage64: 00000000ffc00000-00000000ffffffff (2) >> [ 36.821942] kexec-bzImage64: 0000000100000000-000000023fffffff (1) >> [ 36.872348] kexec_file: nr_segments = 4 >> [ 36.872356] kexec_file: segment[0]: buf=0x000000005314ece7 bufsz=0x4000 mem=0x23fffb000 memsz=0x5000 >> [ 36.872370] kexec_file: segment[1]: buf=0x000000006e59b143 bufsz=0x12d0 mem=0x23fff9000 memsz=0x2000 >> [ 36.872374] kexec_file: segment[2]: buf=0x00000000eb7b1fc3 bufsz=0xd73400 mem=0x23d400000 memsz=0x2ab7000 >> [ 36.882172] kexec_file: segment[3]: buf=0x000000006af76441 bufsz=0x16f40a8 mem=0x23bd0b000 memsz=0x16f5000 >> [ 36.889113] kexec_file: kexec_file_load: type:0, start:0x23fffb150 head:0x101a2e002 flags:0x8 >> >> Changes in v3: >> ========== >> - Rename kexec_core_dbg_print to kexec_dbg_print >> - Remove unnecessary segments prints >> - Remove patch "kexec_file: Fix the issue of mismatch between loop variable types" >> >> Qiang Ma (3): >> kexec: Fix uninitialized struct kimage *image pointer >> kexec: add kexec flag to control debug printing >> kexec: print out debugging message if required for kexec_load >> >> include/linux/kexec.h | 9 +++++---- >> include/uapi/linux/kexec.h | 1 + >> kernel/kexec.c | 8 +++++++- >> kernel/kexec_core.c | 4 +++- >> kernel/kexec_file.c | 4 +--- >> 5 files changed, 17 insertions(+), 9 deletions(-) >> >> -- >> 2.20.1 >> > From bhe at redhat.com Wed Nov 26 18:36:13 2025 From: bhe at redhat.com (Baoquan He) Date: Thu, 27 Nov 2025 10:36:13 +0800 Subject: [PATCH v3 0/3] kexec: print out debugging message if required for kexec_load In-Reply-To: <63BA9935197ADF34+2a3faf95-36da-46f1-b9b5-4e438e75d1be@uniontech.com> References: <20251126084427.3222212-1-maqianga@uniontech.com> <63BA9935197ADF34+2a3faf95-36da-46f1-b9b5-4e438e75d1be@uniontech.com> Message-ID: On 11/27/25 at 10:04am, Qiang Ma wrote: > > ? 2025/11/27 09:47, Baoquan He ??: > > Hi, > > > > On 11/26/25 at 04:44pm, Qiang Ma wrote: > > > Overview: > > > ========= > > > The commit a85ee18c7900 ("kexec_file: print out debugging message > > > if required") has added general code printing in kexec_file_load(), > > > but not in kexec_load(). > > > Since kexec_load and kexec_file_load are not triggered simultaneously, > > > we can unify the debug flag of kexec and kexec_file as kexec_dbg_print. > > As I said in your last post, this is not needed at all, you just add a > > not needed thing to kernel. > > > > So NACK this patchset, unless you have reason to justify it. Sorry about > > it. > The segment prints discussed in the last post, > > this patchset has been removed, leaving only type/start/head of kimage and > flags. > > > I think the current patchset is still necessary. > For example, renaming kexec_file_dbg_print is still necessary, but not for > kexec_file. How come renaming kexec_file_dbg_print is a justification in this case. No, kexec_file_dbg_print is named because it's only for kexec_file debugging printing. Because we have had enough debugging printing for kexec_load interface. Do you have difficulty on debugging printing of kexec_load? From maqianga at uniontech.com Wed Nov 26 19:00:06 2025 From: maqianga at uniontech.com (Qiang Ma) Date: Thu, 27 Nov 2025 11:00:06 +0800 Subject: [PATCH v3 0/3] kexec: print out debugging message if required for kexec_load In-Reply-To: References: <20251126084427.3222212-1-maqianga@uniontech.com> <63BA9935197ADF34+2a3faf95-36da-46f1-b9b5-4e438e75d1be@uniontech.com> Message-ID: ? 2025/11/27 10:36, Baoquan He ??: > On 11/27/25 at 10:04am, Qiang Ma wrote: >> ? 2025/11/27 09:47, Baoquan He ??: >>> Hi, >>> >>> On 11/26/25 at 04:44pm, Qiang Ma wrote: >>>> Overview: >>>> ========= >>>> The commit a85ee18c7900 ("kexec_file: print out debugging message >>>> if required") has added general code printing in kexec_file_load(), >>>> but not in kexec_load(). >>>> Since kexec_load and kexec_file_load are not triggered simultaneously, >>>> we can unify the debug flag of kexec and kexec_file as kexec_dbg_print. >>> As I said in your last post, this is not needed at all, you just add a >>> not needed thing to kernel. >>> >>> So NACK this patchset, unless you have reason to justify it. Sorry about >>> it. >> The segment prints discussed in the last post, >> >> this patchset has been removed, leaving only type/start/head of kimage and >> flags. >> >> >> I think the current patchset is still necessary. >> For example, renaming kexec_file_dbg_print is still necessary, but not for >> kexec_file. > How come renaming kexec_file_dbg_print is a justification in this case. > > No, kexec_file_dbg_print is named because it's only for kexec_file > debugging printing. Because we have had enough debugging printing for > kexec_load interface. Do you have difficulty on debugging printing of > kexec_load? It's sufficient now, but there might be a need in the future. Also, there's kexec_dprintk. Judging from its name, it seems like a universal kexec print. Looking at the code, it feels like not only the kexec_file interface path uses it for printing. So, would it be better to rename kexec_file_dbg_print to kexec_dbg_print. > > From bhe at redhat.com Wed Nov 26 19:55:42 2025 From: bhe at redhat.com (Baoquan He) Date: Thu, 27 Nov 2025 11:55:42 +0800 Subject: [PATCH v3 0/3] kexec: print out debugging message if required for kexec_load In-Reply-To: References: <20251126084427.3222212-1-maqianga@uniontech.com> <63BA9935197ADF34+2a3faf95-36da-46f1-b9b5-4e438e75d1be@uniontech.com> Message-ID: On 11/27/25 at 11:00am, Qiang Ma wrote: > > ? 2025/11/27 10:36, Baoquan He ??: > > On 11/27/25 at 10:04am, Qiang Ma wrote: > > > ? 2025/11/27 09:47, Baoquan He ??: > > > > Hi, > > > > > > > > On 11/26/25 at 04:44pm, Qiang Ma wrote: > > > > > Overview: > > > > > ========= > > > > > The commit a85ee18c7900 ("kexec_file: print out debugging message > > > > > if required") has added general code printing in kexec_file_load(), > > > > > but not in kexec_load(). > > > > > Since kexec_load and kexec_file_load are not triggered simultaneously, > > > > > we can unify the debug flag of kexec and kexec_file as kexec_dbg_print. > > > > As I said in your last post, this is not needed at all, you just add a > > > > not needed thing to kernel. > > > > > > > > So NACK this patchset, unless you have reason to justify it. Sorry about > > > > it. > > > The segment prints discussed in the last post, > > > > > > this patchset has been removed, leaving only type/start/head of kimage and > > > flags. > > > > > > > > > I think the current patchset is still necessary. > > > For example, renaming kexec_file_dbg_print is still necessary, but not for > > > kexec_file. > > How come renaming kexec_file_dbg_print is a justification in this case. > > > > No, kexec_file_dbg_print is named because it's only for kexec_file > > debugging printing. Because we have had enough debugging printing for > > kexec_load interface. Do you have difficulty on debugging printing of > > kexec_load? > It's sufficient now, but there might be a need in the future. > Also, there's kexec_dprintk. Judging from its name, it seems like a Hmm, as I ever said in earlier discussion, kexec sometime means generic handling including kexec_load and kexec_file_load interfaces. Both possible future need and kexec_dprintk which seems a little ambiguous to you are not justified. We do not suggest adding these meaningless code to kernel. Please do't continue spending effort on this, that is not good. I welcome cleanup/refactoring/fix for kexec/kdump to improve code, but adding non-reasonable code is not included. > universal kexec print. > Looking at the code, it feels like not only the kexec_file interface path > uses it for printing. > > So, would it be better to rename kexec_file_dbg_print to kexec_dbg_print. > > > > > > > From maqianga at uniontech.com Wed Nov 26 22:59:15 2025 From: maqianga at uniontech.com (Qiang Ma) Date: Thu, 27 Nov 2025 14:59:15 +0800 Subject: [PATCH v3 0/3] kexec: print out debugging message if required for kexec_load In-Reply-To: References: <20251126084427.3222212-1-maqianga@uniontech.com> <63BA9935197ADF34+2a3faf95-36da-46f1-b9b5-4e438e75d1be@uniontech.com> Message-ID: <4D709A5E16BE5DEC+04521049-61de-411f-85c1-cfc049ff04c5@uniontech.com> ? 2025/11/27 11:55, Baoquan He ??: > On 11/27/25 at 11:00am, Qiang Ma wrote: >> ? 2025/11/27 10:36, Baoquan He ??: >>> On 11/27/25 at 10:04am, Qiang Ma wrote: >>>> ? 2025/11/27 09:47, Baoquan He ??: >>>>> Hi, >>>>> >>>>> On 11/26/25 at 04:44pm, Qiang Ma wrote: >>>>>> Overview: >>>>>> ========= >>>>>> The commit a85ee18c7900 ("kexec_file: print out debugging message >>>>>> if required") has added general code printing in kexec_file_load(), >>>>>> but not in kexec_load(). >>>>>> Since kexec_load and kexec_file_load are not triggered simultaneously, >>>>>> we can unify the debug flag of kexec and kexec_file as kexec_dbg_print. >>>>> As I said in your last post, this is not needed at all, you just add a >>>>> not needed thing to kernel. >>>>> >>>>> So NACK this patchset, unless you have reason to justify it. Sorry about >>>>> it. >>>> The segment prints discussed in the last post, >>>> >>>> this patchset has been removed, leaving only type/start/head of kimage and >>>> flags. >>>> >>>> >>>> I think the current patchset is still necessary. >>>> For example, renaming kexec_file_dbg_print is still necessary, but not for >>>> kexec_file. >>> How come renaming kexec_file_dbg_print is a justification in this case. >>> >>> No, kexec_file_dbg_print is named because it's only for kexec_file >>> debugging printing. Because we have had enough debugging printing for >>> kexec_load interface. Do you have difficulty on debugging printing of >>> kexec_load? >> It's sufficient now, but there might be a need in the future. >> Also, there's kexec_dprintk. Judging from its name, it seems like a > Hmm, as I ever said in earlier discussion, kexec sometime means generic > handling including kexec_load and kexec_file_load interfaces. Both possible > future need and kexec_dprintk which seems a little ambiguous to you are > not justified. We do not suggest adding these meaningless code to > kernel. Please do't continue spending effort on this, that is not good. > > I welcome cleanup/refactoring/fix for kexec/kdump to improve code, but > adding non-reasonable code is not included. I agree that meaningless code should not be added to the kernel, but it is meaningful for this patchset and has reasonable reasons. Let me summarize again the purpose of submitting this patchset: First, unify the print flag of kexec_file and kexec as kexec_dbg_print, so that both kexec and kexec_file can be used both now and in the future. Secondly, in the current code, for instance, I saw in the arm64 code that under the kexec_load interface, the kexec_image_info() already uses kexec_dprintk. When unset CONFIG_KEXEC_FILE, specifying '-d' for kexec_load interface, print nothing in the kernel space. static void _kexec_image_info(const char *func, int line, ??????? const struct kimage *kimage) { ??????? kexec_dprintk("%s:%d:\n", func, line); ??????? kexec_dprintk("? kexec kimage info:\n"); ??????? kexec_dprintk("??? type:??????? %d\n", kimage->type); ??????? kexec_dprintk("??? head:??????? %lx\n", kimage->head); Thirdly, for instance, in the arm64 code, the kexec_image_info() prints type/head of kimage, while in the RISC-V code, the type/head of kimage under kexec_load and kexec_file_load is removed. We can remove the print(type/head of kimage) in arm64, and then add it to the general code. For points 1 and 2, Patch 2 can implement: "kexec: add kexec flag to control debug printing" For point 3, Patch 3 can be implemented: "kexec: print out debugging message if required for kexec_load" Additionally, if it is necessary to remove the type/head of kimage printed in the kexec_image_info() in the arm64 code, another patch can be provided. >> universal kexec print. >> Looking at the code, it feels like not only the kexec_file interface path >> uses it for printing. >> >> So, would it be better to rename kexec_file_dbg_print to kexec_dbg_print. >> >> >>> > From sourabhjain at linux.ibm.com Thu Nov 27 04:01:13 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Thu, 27 Nov 2025 17:31:13 +0530 Subject: [PATCH v3 0/3] kexec: print out debugging message if required for kexec_load In-Reply-To: <20251126084427.3222212-1-maqianga@uniontech.com> References: <20251126084427.3222212-1-maqianga@uniontech.com> Message-ID: <7aadda55-d2a4-40f9-95ef-d284ec358646@linux.ibm.com> Hello All, Do we have plan to support KEXEC_DEBUG flag? Because upstream kexec-tools already added support for KEXEC_DEBUG flag and that breaks the kexec_load with -d option. - kexec: add kexec flag to support debug printing https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git/commit/?id=71d6fd99af7e Thanks, Sourabh Jain On 26/11/25 14:14, Qiang Ma wrote: > Overview: > ========= > The commit a85ee18c7900 ("kexec_file: print out debugging message > if required") has added general code printing in kexec_file_load(), > but not in kexec_load(). > > Since kexec_load and kexec_file_load are not triggered simultaneously, > we can unify the debug flag of kexec and kexec_file as kexec_dbg_print. > > Next, we need to do some things in this patchset: > > 1. rename kexec_file_dbg_print to kexec_dbg_print > 2. Add KEXEC_DEBUG > 3. Initialize kexec_dbg_print for kexec > 4. Fix uninitialized struct kimage *image pointer > 5. Set the reset of kexec_dbg_print to kimage_free > > Testing: > ========= > I did testing on x86_64, arm64 and loongarch. On x86_64, the printed messages > look like below: > > unset CONFIG_KEXEC_FILE: > [ 81.502374] kexec: kexec_load: type:0, start:0x23fff7700 head:0x10a4b9002 flags:0x3e0010 > > set CONFIG_KEXEC_FILE > [ 36.774228] kexec_file: kernel: 0000000066c386c8 kernel_size: 0xd78400 > [ 36.821814] kexec-bzImage64: Loaded purgatory at 0x23fffb000 > [ 36.821826] kexec-bzImage64: Loaded boot_param, command line and misc at 0x23fff9000 bufsz=0x12d0 memsz=0x2000 > [ 36.821829] kexec-bzImage64: Loaded 64bit kernel at 0x23d400000 bufsz=0xd73400 memsz=0x2ab7000 > [ 36.821918] kexec-bzImage64: Loaded initrd at 0x23bd0b000 bufsz=0x16f40a8 memsz=0x16f40a8 > [ 36.821920] kexec-bzImage64: Final command line is: root=/dev/mapper/test-root crashkernel=auto rd.lvm.lv=test/root > [ 36.821925] kexec-bzImage64: E820 memmap: > [ 36.821926] kexec-bzImage64: 0000000000000000-000000000009ffff (1) > [ 36.821928] kexec-bzImage64: 0000000000100000-0000000000811fff (1) > [ 36.821930] kexec-bzImage64: 0000000000812000-0000000000812fff (2) > [ 36.821931] kexec-bzImage64: 0000000000813000-00000000bee38fff (1) > [ 36.821933] kexec-bzImage64: 00000000bee39000-00000000beec2fff (2) > [ 36.821934] kexec-bzImage64: 00000000beec3000-00000000bf8ecfff (1) > [ 36.821935] kexec-bzImage64: 00000000bf8ed000-00000000bfb6cfff (2) > [ 36.821936] kexec-bzImage64: 00000000bfb6d000-00000000bfb7efff (3) > [ 36.821937] kexec-bzImage64: 00000000bfb7f000-00000000bfbfefff (4) > [ 36.821938] kexec-bzImage64: 00000000bfbff000-00000000bff7bfff (1) > [ 36.821939] kexec-bzImage64: 00000000bff7c000-00000000bfffffff (2) > [ 36.821940] kexec-bzImage64: 00000000feffc000-00000000feffffff (2) > [ 36.821941] kexec-bzImage64: 00000000ffc00000-00000000ffffffff (2) > [ 36.821942] kexec-bzImage64: 0000000100000000-000000023fffffff (1) > [ 36.872348] kexec_file: nr_segments = 4 > [ 36.872356] kexec_file: segment[0]: buf=0x000000005314ece7 bufsz=0x4000 mem=0x23fffb000 memsz=0x5000 > [ 36.872370] kexec_file: segment[1]: buf=0x000000006e59b143 bufsz=0x12d0 mem=0x23fff9000 memsz=0x2000 > [ 36.872374] kexec_file: segment[2]: buf=0x00000000eb7b1fc3 bufsz=0xd73400 mem=0x23d400000 memsz=0x2ab7000 > [ 36.882172] kexec_file: segment[3]: buf=0x000000006af76441 bufsz=0x16f40a8 mem=0x23bd0b000 memsz=0x16f5000 > [ 36.889113] kexec_file: kexec_file_load: type:0, start:0x23fffb150 head:0x101a2e002 flags:0x8 > > Changes in v3: > ========== > - Rename kexec_core_dbg_print to kexec_dbg_print > - Remove unnecessary segments prints > - Remove patch "kexec_file: Fix the issue of mismatch between loop variable types" > > Qiang Ma (3): > kexec: Fix uninitialized struct kimage *image pointer > kexec: add kexec flag to control debug printing > kexec: print out debugging message if required for kexec_load > > include/linux/kexec.h | 9 +++++---- > include/uapi/linux/kexec.h | 1 + > kernel/kexec.c | 8 +++++++- > kernel/kexec_core.c | 4 +++- > kernel/kexec_file.c | 4 +--- > 5 files changed, 17 insertions(+), 9 deletions(-) > From ranxiaokai627 at 163.com Thu Nov 27 04:27:00 2025 From: ranxiaokai627 at 163.com (ranxiaokai627 at 163.com) Date: Thu, 27 Nov 2025 12:27:00 +0000 Subject: [PATCH v4] KHO: Fix boot failure due to kmemleak access to non-PRESENT pages Message-ID: <20251127122700.103927-1-ranxiaokai627@163.com> From: Ran Xiaokai When booting with debug_pagealloc=on while having: CONFIG_KEXEC_HANDOVER_ENABLE_DEFAULT=y CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF=n the system fails to boot due to page faults during kmemleak scanning. This occurs because: With debug_pagealloc is enabled, __free_pages() invokes debug_pagealloc_unmap_pages(), clearing the _PAGE_PRESENT bit for freed pages in the kernel page table. KHO scratch areas are allocated from memblock and noted by kmemleak. But these areas don't remain reserved but released later to the page allocator using init_cma_reserved_pageblock(). This causes subsequent kmemleak scans access non-PRESENT pages, leading to fatal page faults. Mark scratch areas with kmemleak_ignore_phys() after they are allocated from memblock to exclude them from kmemleak scanning before they are released to buddy allocator to fix this. Fixes: 3dc92c311498 ("kexec: add Kexec HandOver (KHO) generation helpers") Signed-off-by: Ran Xiaokai Reviewed-by: Mike Rapoport (Microsoft) --- kernel/liveupdate/kexec_handover.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/kernel/liveupdate/kexec_handover.c b/kernel/liveupdate/kexec_handover.c index 224bdf5becb6..55d66e65274f 100644 --- a/kernel/liveupdate/kexec_handover.c +++ b/kernel/liveupdate/kexec_handover.c @@ -11,6 +11,7 @@ #include #include +#include #include #include #include @@ -1369,6 +1370,15 @@ static __init int kho_init(void) unsigned long count = kho_scratch[i].size >> PAGE_SHIFT; unsigned long pfn; + /* + * When debug_pagealloc is enabled, __free_pages() clears the + * corresponding PRESENT bit in the kernel page table. + * Subsequent kmemleak scans of these pages cause the + * non-PRESENT page faults. + * Mark scratch areas with kmemleak_ignore_phys() to exclude + * them from kmemleak scanning. + */ + kmemleak_ignore_phys(kho_scratch[i].addr); for (pfn = base_pfn; pfn < base_pfn + count; pfn += pageblock_nr_pages) init_cma_reserved_pageblock(pfn_to_page(pfn)); -- 2.25.1 From bhe at redhat.com Thu Nov 27 07:30:36 2025 From: bhe at redhat.com (Baoquan He) Date: Thu, 27 Nov 2025 23:30:36 +0800 Subject: [PATCH v3 0/3] kexec: print out debugging message if required for kexec_load In-Reply-To: <7aadda55-d2a4-40f9-95ef-d284ec358646@linux.ibm.com> References: <20251126084427.3222212-1-maqianga@uniontech.com> <7aadda55-d2a4-40f9-95ef-d284ec358646@linux.ibm.com> Message-ID: On 11/27/25 at 05:31pm, Sourabh Jain wrote: > Hello All, > > Do we have plan to support KEXEC_DEBUG flag? > > Because upstream kexec-tools already added support for KEXEC_DEBUG flag > and that breaks the kexec_load with -d option. > > - kexec: add kexec flag to support debug printing > https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git/commit/?id=71d6fd99af7e I think we should revert that kexec-tools commit. This whole patchset is non-sense. Because of my carelessness, that userspace patch was merged. Hi Sourabh, Could you go through this patchset and help check if they are really needed? I can't find anything to convince myself. Thanks. From pratyush at kernel.org Thu Nov 27 08:20:14 2025 From: pratyush at kernel.org (Pratyush Yadav) Date: Thu, 27 Nov 2025 17:20:14 +0100 Subject: [PATCH v4] KHO: Fix boot failure due to kmemleak access to non-PRESENT pages In-Reply-To: <20251127122700.103927-1-ranxiaokai627@163.com> (ranxiaokai's message of "Thu, 27 Nov 2025 12:27:00 +0000") References: <20251127122700.103927-1-ranxiaokai627@163.com> Message-ID: On Thu, Nov 27 2025, ranxiaokai627 at 163.com wrote: > From: Ran Xiaokai > > When booting with debug_pagealloc=on while having: > CONFIG_KEXEC_HANDOVER_ENABLE_DEFAULT=y > CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF=n > the system fails to boot due to page faults during kmemleak scanning. > > This occurs because: > With debug_pagealloc is enabled, __free_pages() invokes > debug_pagealloc_unmap_pages(), clearing the _PAGE_PRESENT bit for > freed pages in the kernel page table. > KHO scratch areas are allocated from memblock and noted by kmemleak. But > these areas don't remain reserved but released later to the page allocator > using init_cma_reserved_pageblock(). This causes subsequent kmemleak scans > access non-PRESENT pages, leading to fatal page faults. > > Mark scratch areas with kmemleak_ignore_phys() after they are allocated > from memblock to exclude them from kmemleak scanning before they are > released to buddy allocator to fix this. > > Fixes: 3dc92c311498 ("kexec: add Kexec HandOver (KHO) generation helpers") > Signed-off-by: Ran Xiaokai > Reviewed-by: Mike Rapoport (Microsoft) Reviewed-by: Pratyush Yadav Thanks! [...] -- Regards, Pratyush Yadav From bhe at redhat.com Thu Nov 27 19:33:08 2025 From: bhe at redhat.com (Baoquan He) Date: Fri, 28 Nov 2025 11:33:08 +0800 Subject: [PATCH v4 00/12] mm/kasan: make kasan=on|off work for all three modes Message-ID: <20251128033320.1349620-1-bhe@redhat.com> Currently only hw_tags mode of kasan can be enabled or disabled with kernel parameter kasan=on|off for built kernel. For kasan generic and sw_tags mode, there's no way to disable them once kernel is built. This is not convenient sometime, e.g in system kdump is configured. When the 1st kernel has KASAN enabled and crash triggered to switch to kdump kernel, the generic or sw_tags mode will cost much extra memory while in fact it's meaningless to have kasan in kdump kernel There are two parts of big amount of memory requiring for kasan enabed kernel. One is the direct memory mapping shadow of kasan, which is 1/8 of system RAM in generic mode and 1/16 of system RAM in sw_tags mode; the other is the shadow meomry for vmalloc which causes big meomry usage in kdump kernel because of lazy vmap freeing. By introducing "kasan=off|on", if we specify 'kasan=off', the former is avoided by skipping the kasan_init(), and the latter is avoided by not building the vmalloc shadow for vmalloc. So this patchset moves the kasan=on|off out of hw_tags scope and into common code to make it visible in generic and sw_tags mode too. Then we can add kasan=off in kdump kernel to reduce the unneeded meomry cost for kasan. Testing: ======== - Testing on x86_64 and arm64 for generic mode passed when kasan=on or kasan=off. - Testing on arm64 with sw_tags mode passed when kasan=off is set. But when I tried to test sw_tags on arm64, the system bootup failed. It's not introduced by my patchset, the original code has the bug. I have reported it to upstream. - System is broken in KASAN sw_tags mode during bootup - https://lore.kernel.org/all/aSXKqJTkZPNskFop at MiWiFi-R3L-srv/T/#u - Haven't found hardware to test hw_tags. If anybody has the system, please help take a test. Changelog: ==== v3->v4: - Rebase code to the latest linux-next/master to make the whole patchset set on top of [PATCH 0/2] kasan: cleanups for kasan_enabled() checks [PATCH v6 0/2] kasan: unify kasan_enabled() and remove arch-specific implementations v2->v3: - Fix a building error on UML ARCH when CONFIG_KASAN is not set. The change of fixing is appended into patch patch 11. This is reported by LKP, thanks to them. v1->v2: - Add __ro_after_init for kasan_arg_disabled, and remove redundant blank lines in mm/kasan/common.c. Thanks to Marco. - Fix a code bug in when CONFIG_KASAN is unset, this is found out by SeongJae and Lorenzo, and also reported by LKP report, thanks to them. - Add a missing kasan_enabled() checking in kasan_report(). This will cause below KASAN report info even though kasan=off is set: ================================================================== BUG: KASAN: stack-out-of-bounds in tick_program_event+0x130/0x150 Read of size 4 at addr ffff00005f747778 by task swapper/0/1 CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.16.0+ #8 PREEMPT(voluntary) Hardware name: GIGABYTE R272-P30-JG/MP32-AR0-JG, BIOS F31n (SCP: 2.10.20220810) 09/30/2022 Call trace: show_stack+0x30/0x90 (C) dump_stack_lvl+0x7c/0xa0 print_address_description.constprop.0+0x90/0x310 print_report+0x104/0x1f0 kasan_report+0xc8/0x110 __asan_report_load4_noabort+0x20/0x30 tick_program_event+0x130/0x150 ......snip... ================================================================== - Add jump_label_init() calling before kasan_init() in setup_arch() in these architectures: xtensa, arm. Because they currenly rely on jump_label_init() in main() which is a little late. Then the early static key kasan_flag_enabled in kasan_init() won't work. - In UML architecture, change to enable kasan_flag_enabled in arch_mm_preinit() because kasan_init() is enabled before main(), there's no chance to operate on static key in kasan_init(). Baoquan He (12): mm/kasan: add conditional checks in functions to return directly if kasan is disabled mm/kasan: move kasan= code to common place mm/kasan/sw_tags: don't initialize kasan if it's disabled arch/arm: don't initialize kasan if it's disabled arch/arm64: don't initialize kasan if it's disabled arch/loongarch: don't initialize kasan if it's disabled arch/powerpc: don't initialize kasan if it's disabled arch/riscv: don't initialize kasan if it's disabled arch/x86: don't initialize kasan if it's disabled arch/xtensa: don't initialize kasan if it's disabled arch/um: don't initialize kasan if it's disabled mm/kasan: make kasan=on|off take effect for all three modes arch/arm/kernel/setup.c | 6 ++++++ arch/arm/mm/kasan_init.c | 2 ++ arch/arm64/mm/kasan_init.c | 6 ++++++ arch/loongarch/mm/kasan_init.c | 2 ++ arch/powerpc/mm/kasan/init_32.c | 5 ++++- arch/powerpc/mm/kasan/init_book3e_64.c | 3 +++ arch/powerpc/mm/kasan/init_book3s_64.c | 3 +++ arch/riscv/mm/kasan_init.c | 3 +++ arch/um/kernel/mem.c | 5 ++++- arch/x86/mm/kasan_init_64.c | 3 +++ arch/xtensa/kernel/setup.c | 1 + arch/xtensa/mm/kasan_init.c | 3 +++ include/linux/kasan-enabled.h | 6 ++++-- mm/kasan/common.c | 20 ++++++++++++++++-- mm/kasan/generic.c | 17 ++++++++++++++-- mm/kasan/hw_tags.c | 28 ++------------------------ mm/kasan/init.c | 6 ++++++ mm/kasan/quarantine.c | 3 +++ mm/kasan/report.c | 4 +++- mm/kasan/shadow.c | 11 +++++++++- mm/kasan/sw_tags.c | 6 ++++++ 21 files changed, 107 insertions(+), 36 deletions(-) -- 2.41.0 From bhe at redhat.com Thu Nov 27 19:33:09 2025 From: bhe at redhat.com (Baoquan He) Date: Fri, 28 Nov 2025 11:33:09 +0800 Subject: [PATCH v4 01/12] mm/kasan: add conditional checks in functions to return directly if kasan is disabled In-Reply-To: <20251128033320.1349620-1-bhe@redhat.com> References: <20251128033320.1349620-1-bhe@redhat.com> Message-ID: <20251128033320.1349620-2-bhe@redhat.com> The current codes only check if kasan is disabled for hw_tags mode. Here add the conditional checks for functional functions of generic mode and sw_tags mode. This is prepared for later adding kernel parameter kasan=on|off for all three kasan modes. Signed-off-by: Baoquan He --- mm/kasan/generic.c | 17 +++++++++++++++-- mm/kasan/init.c | 6 ++++++ mm/kasan/quarantine.c | 3 +++ mm/kasan/report.c | 4 +++- mm/kasan/shadow.c | 11 ++++++++++- mm/kasan/sw_tags.c | 3 +++ 6 files changed, 40 insertions(+), 4 deletions(-) diff --git a/mm/kasan/generic.c b/mm/kasan/generic.c index 2b8e73f5f6a7..aff822aa2bd6 100644 --- a/mm/kasan/generic.c +++ b/mm/kasan/generic.c @@ -214,12 +214,13 @@ bool kasan_byte_accessible(const void *addr) void kasan_cache_shrink(struct kmem_cache *cache) { - kasan_quarantine_remove_cache(cache); + if (kasan_enabled()) + kasan_quarantine_remove_cache(cache); } void kasan_cache_shutdown(struct kmem_cache *cache) { - if (!__kmem_cache_empty(cache)) + if (kasan_enabled() && !__kmem_cache_empty(cache)) kasan_quarantine_remove_cache(cache); } @@ -239,6 +240,9 @@ void __asan_register_globals(void *ptr, ssize_t size) int i; struct kasan_global *globals = ptr; + if (!kasan_enabled()) + return; + for (i = 0; i < size; i++) register_global(&globals[i]); } @@ -369,6 +373,9 @@ void kasan_cache_create(struct kmem_cache *cache, unsigned int *size, unsigned int rem_free_meta_size; unsigned int orig_alloc_meta_offset; + if (!kasan_enabled()) + return; + if (!kasan_requires_meta()) return; @@ -518,6 +525,9 @@ size_t kasan_metadata_size(struct kmem_cache *cache, bool in_object) { struct kasan_cache *info = &cache->kasan_info; + if (!kasan_enabled()) + return 0; + if (!kasan_requires_meta()) return 0; @@ -543,6 +553,9 @@ void kasan_record_aux_stack(void *addr) struct kasan_alloc_meta *alloc_meta; void *object; + if (!kasan_enabled()) + return; + if (is_kfence_address(addr) || !slab) return; diff --git a/mm/kasan/init.c b/mm/kasan/init.c index f084e7a5df1e..c78d77ed47bc 100644 --- a/mm/kasan/init.c +++ b/mm/kasan/init.c @@ -447,6 +447,9 @@ void kasan_remove_zero_shadow(void *start, unsigned long size) unsigned long addr, end, next; pgd_t *pgd; + if (!kasan_enabled()) + return; + addr = (unsigned long)kasan_mem_to_shadow(start); end = addr + (size >> KASAN_SHADOW_SCALE_SHIFT); @@ -482,6 +485,9 @@ int kasan_add_zero_shadow(void *start, unsigned long size) int ret; void *shadow_start, *shadow_end; + if (!kasan_enabled()) + return 0; + shadow_start = kasan_mem_to_shadow(start); shadow_end = shadow_start + (size >> KASAN_SHADOW_SCALE_SHIFT); diff --git a/mm/kasan/quarantine.c b/mm/kasan/quarantine.c index 6958aa713c67..a6dc2c3d8a15 100644 --- a/mm/kasan/quarantine.c +++ b/mm/kasan/quarantine.c @@ -405,6 +405,9 @@ static int __init kasan_cpu_quarantine_init(void) { int ret = 0; + if (!kasan_enabled()) + return 0; + ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "mm/kasan:online", kasan_cpu_online, kasan_cpu_offline); if (ret < 0) diff --git a/mm/kasan/report.c b/mm/kasan/report.c index 62c01b4527eb..884357fa74ed 100644 --- a/mm/kasan/report.c +++ b/mm/kasan/report.c @@ -576,7 +576,9 @@ bool kasan_report(const void *addr, size_t size, bool is_write, unsigned long irq_flags; struct kasan_report_info info; - if (unlikely(report_suppressed_sw()) || unlikely(!report_enabled())) { + if (unlikely(report_suppressed_sw()) || + unlikely(!report_enabled()) || + !kasan_enabled()) { ret = false; goto out; } diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c index 29a751a8a08d..f73a691421de 100644 --- a/mm/kasan/shadow.c +++ b/mm/kasan/shadow.c @@ -164,6 +164,8 @@ void kasan_unpoison(const void *addr, size_t size, bool init) { u8 tag = get_tag(addr); + if (!kasan_enabled()) + return; /* * Perform shadow offset calculation based on untagged address, as * some of the callers (e.g. kasan_unpoison_new_object) pass tagged @@ -277,7 +279,8 @@ static int __meminit kasan_mem_notifier(struct notifier_block *nb, static int __init kasan_memhotplug_init(void) { - hotplug_memory_notifier(kasan_mem_notifier, DEFAULT_CALLBACK_PRI); + if (kasan_enabled()) + hotplug_memory_notifier(kasan_mem_notifier, DEFAULT_CALLBACK_PRI); return 0; } @@ -658,6 +661,9 @@ int kasan_alloc_module_shadow(void *addr, size_t size, gfp_t gfp_mask) size_t shadow_size; unsigned long shadow_start; + if (!kasan_enabled()) + return 0; + shadow_start = (unsigned long)kasan_mem_to_shadow(addr); scaled_size = (size + KASAN_GRANULE_SIZE - 1) >> KASAN_SHADOW_SCALE_SHIFT; @@ -694,6 +700,9 @@ int kasan_alloc_module_shadow(void *addr, size_t size, gfp_t gfp_mask) void kasan_free_module_shadow(const struct vm_struct *vm) { + if (!kasan_enabled()) + return; + if (IS_ENABLED(CONFIG_UML)) return; diff --git a/mm/kasan/sw_tags.c b/mm/kasan/sw_tags.c index c75741a74602..6c1caec4261a 100644 --- a/mm/kasan/sw_tags.c +++ b/mm/kasan/sw_tags.c @@ -79,6 +79,9 @@ bool kasan_check_range(const void *addr, size_t size, bool write, u8 *shadow_first, *shadow_last, *shadow; void *untagged_addr; + if (!kasan_enabled()) + return true; + if (unlikely(size == 0)) return true; -- 2.41.0 From bhe at redhat.com Thu Nov 27 19:33:10 2025 From: bhe at redhat.com (Baoquan He) Date: Fri, 28 Nov 2025 11:33:10 +0800 Subject: [PATCH v4 02/12] mm/kasan: move kasan= code to common place In-Reply-To: <20251128033320.1349620-1-bhe@redhat.com> References: <20251128033320.1349620-1-bhe@redhat.com> Message-ID: <20251128033320.1349620-3-bhe@redhat.com> This allows generic and sw_tags to be set in kernel cmdline too. When at it, rename 'kasan_arg' to 'kasan_arg_disabled' as a bool variable. And expose 'kasan_flag_enabled' to kasan common place too. This is prepared for later adding kernel parameter kasan=on|off for all three kasan modes. Signed-off-by: Baoquan He --- include/linux/kasan-enabled.h | 4 +++- mm/kasan/common.c | 20 ++++++++++++++++++-- mm/kasan/hw_tags.c | 28 ++-------------------------- 3 files changed, 23 insertions(+), 29 deletions(-) diff --git a/include/linux/kasan-enabled.h b/include/linux/kasan-enabled.h index 9eca967d8526..b05ec6329fbe 100644 --- a/include/linux/kasan-enabled.h +++ b/include/linux/kasan-enabled.h @@ -4,13 +4,15 @@ #include -#if defined(CONFIG_ARCH_DEFER_KASAN) || defined(CONFIG_KASAN_HW_TAGS) +extern bool kasan_arg_disabled; + /* * Global runtime flag for KASAN modes that need runtime control. * Used by ARCH_DEFER_KASAN architectures and HW_TAGS mode. */ DECLARE_STATIC_KEY_FALSE(kasan_flag_enabled); +#if defined(CONFIG_ARCH_DEFER_KASAN) || defined(CONFIG_KASAN_HW_TAGS) /* * Runtime control for shadow memory initialization or HW_TAGS mode. * Uses static key for architectures that need deferred KASAN or HW_TAGS. diff --git a/mm/kasan/common.c b/mm/kasan/common.c index 1d27f1bd260b..ac14956986ee 100644 --- a/mm/kasan/common.c +++ b/mm/kasan/common.c @@ -32,14 +32,30 @@ #include "kasan.h" #include "../slab.h" -#if defined(CONFIG_ARCH_DEFER_KASAN) || defined(CONFIG_KASAN_HW_TAGS) /* * Definition of the unified static key declared in kasan-enabled.h. * This provides consistent runtime enable/disable across KASAN modes. */ DEFINE_STATIC_KEY_FALSE(kasan_flag_enabled); EXPORT_SYMBOL_GPL(kasan_flag_enabled); -#endif + +bool kasan_arg_disabled __ro_after_init; +/* kasan=off/on */ +static int __init early_kasan_flag(char *arg) +{ + if (!arg) + return -EINVAL; + + if (!strcmp(arg, "off")) + kasan_arg_disabled = true; + else if (!strcmp(arg, "on")) + kasan_arg_disabled = false; + else + return -EINVAL; + + return 0; +} +early_param("kasan", early_kasan_flag); struct slab *kasan_addr_to_slab(const void *addr) { diff --git a/mm/kasan/hw_tags.c b/mm/kasan/hw_tags.c index 1c373cc4b3fa..709c91abc1b1 100644 --- a/mm/kasan/hw_tags.c +++ b/mm/kasan/hw_tags.c @@ -22,12 +22,6 @@ #include "kasan.h" -enum kasan_arg { - KASAN_ARG_DEFAULT, - KASAN_ARG_OFF, - KASAN_ARG_ON, -}; - enum kasan_arg_mode { KASAN_ARG_MODE_DEFAULT, KASAN_ARG_MODE_SYNC, @@ -41,7 +35,6 @@ enum kasan_arg_vmalloc { KASAN_ARG_VMALLOC_ON, }; -static enum kasan_arg kasan_arg __ro_after_init; static enum kasan_arg_mode kasan_arg_mode __ro_after_init; static enum kasan_arg_vmalloc kasan_arg_vmalloc __initdata; @@ -81,23 +74,6 @@ unsigned int kasan_page_alloc_sample_order = PAGE_ALLOC_SAMPLE_ORDER_DEFAULT; DEFINE_PER_CPU(long, kasan_page_alloc_skip); -/* kasan=off/on */ -static int __init early_kasan_flag(char *arg) -{ - if (!arg) - return -EINVAL; - - if (!strcmp(arg, "off")) - kasan_arg = KASAN_ARG_OFF; - else if (!strcmp(arg, "on")) - kasan_arg = KASAN_ARG_ON; - else - return -EINVAL; - - return 0; -} -early_param("kasan", early_kasan_flag); - /* kasan.mode=sync/async/asymm */ static int __init early_kasan_mode(char *arg) { @@ -222,7 +198,7 @@ void kasan_init_hw_tags_cpu(void) * When this function is called, kasan_flag_enabled is not yet * set by kasan_init_hw_tags(). Thus, check kasan_arg instead. */ - if (kasan_arg == KASAN_ARG_OFF) + if (kasan_arg_disabled) return; /* @@ -240,7 +216,7 @@ void __init kasan_init_hw_tags(void) return; /* If KASAN is disabled via command line, don't initialize it. */ - if (kasan_arg == KASAN_ARG_OFF) + if (kasan_arg_disabled) return; switch (kasan_arg_mode) { -- 2.41.0 From bhe at redhat.com Thu Nov 27 19:33:11 2025 From: bhe at redhat.com (Baoquan He) Date: Fri, 28 Nov 2025 11:33:11 +0800 Subject: [PATCH v4 03/12] mm/kasan/sw_tags: don't initialize kasan if it's disabled In-Reply-To: <20251128033320.1349620-1-bhe@redhat.com> References: <20251128033320.1349620-1-bhe@redhat.com> Message-ID: <20251128033320.1349620-4-bhe@redhat.com> Signed-off-by: Baoquan He --- mm/kasan/sw_tags.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/mm/kasan/sw_tags.c b/mm/kasan/sw_tags.c index 6c1caec4261a..58edb68efc09 100644 --- a/mm/kasan/sw_tags.c +++ b/mm/kasan/sw_tags.c @@ -40,6 +40,9 @@ void __init kasan_init_sw_tags(void) { int cpu; + if (kasan_arg_disabled) + return; + for_each_possible_cpu(cpu) per_cpu(prng_state, cpu) = (u32)get_cycles(); -- 2.41.0 From bhe at redhat.com Thu Nov 27 19:33:12 2025 From: bhe at redhat.com (Baoquan He) Date: Fri, 28 Nov 2025 11:33:12 +0800 Subject: [PATCH v4 04/12] arch/arm: don't initialize kasan if it's disabled In-Reply-To: <20251128033320.1349620-1-bhe@redhat.com> References: <20251128033320.1349620-1-bhe@redhat.com> Message-ID: <20251128033320.1349620-5-bhe@redhat.com> Here call jump_label_init() early in setup_arch() so that later kasan_init() can enable static key kasan_flag_enabled. Put jump_label_init() beofre parse_early_param() as other architectures do. Signed-off-by: Baoquan He Cc: linux-arm-kernel at lists.infradead.org --- arch/arm/kernel/setup.c | 6 ++++++ arch/arm/mm/kasan_init.c | 2 ++ 2 files changed, 8 insertions(+) diff --git a/arch/arm/kernel/setup.c b/arch/arm/kernel/setup.c index 0bfd66c7ada0..453a47a4c715 100644 --- a/arch/arm/kernel/setup.c +++ b/arch/arm/kernel/setup.c @@ -1135,6 +1135,12 @@ void __init setup_arch(char **cmdline_p) early_fixmap_init(); early_ioremap_init(); + /* + * Initialise the static keys early as they may be enabled by the + * kasan_init() or early parameters. + */ + jump_label_init(); + parse_early_param(); #ifdef CONFIG_MMU diff --git a/arch/arm/mm/kasan_init.c b/arch/arm/mm/kasan_init.c index c6625e808bf8..488916c7d29e 100644 --- a/arch/arm/mm/kasan_init.c +++ b/arch/arm/mm/kasan_init.c @@ -212,6 +212,8 @@ void __init kasan_init(void) phys_addr_t pa_start, pa_end; u64 i; + if (kasan_arg_disabled) + return; /* * We are going to perform proper setup of shadow memory. * -- 2.41.0 From bhe at redhat.com Thu Nov 27 19:33:13 2025 From: bhe at redhat.com (Baoquan He) Date: Fri, 28 Nov 2025 11:33:13 +0800 Subject: [PATCH v4 05/12] arch/arm64: don't initialize kasan if it's disabled In-Reply-To: <20251128033320.1349620-1-bhe@redhat.com> References: <20251128033320.1349620-1-bhe@redhat.com> Message-ID: <20251128033320.1349620-6-bhe@redhat.com> And also need skip kasan_populate_early_vm_area_shadow() if kasan is disabled. Signed-off-by: Baoquan He Cc: linux-arm-kernel at lists.infradead.org --- arch/arm64/mm/kasan_init.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c index abeb81bf6ebd..eb49fdad4ef1 100644 --- a/arch/arm64/mm/kasan_init.c +++ b/arch/arm64/mm/kasan_init.c @@ -384,6 +384,9 @@ void __init kasan_populate_early_vm_area_shadow(void *start, unsigned long size) { unsigned long shadow_start, shadow_end; + if (!kasan_enabled()) + return; + if (!is_vmalloc_or_module_addr(start)) return; @@ -397,6 +400,9 @@ void __init kasan_populate_early_vm_area_shadow(void *start, unsigned long size) void __init kasan_init(void) { + if (kasan_arg_disabled) + return; + kasan_init_shadow(); kasan_init_depth(); kasan_init_generic(); -- 2.41.0 From bhe at redhat.com Thu Nov 27 19:33:14 2025 From: bhe at redhat.com (Baoquan He) Date: Fri, 28 Nov 2025 11:33:14 +0800 Subject: [PATCH v4 06/12] arch/loongarch: don't initialize kasan if it's disabled In-Reply-To: <20251128033320.1349620-1-bhe@redhat.com> References: <20251128033320.1349620-1-bhe@redhat.com> Message-ID: <20251128033320.1349620-7-bhe@redhat.com> Signed-off-by: Baoquan He Cc: loongarch at lists.linux.dev --- arch/loongarch/mm/kasan_init.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/loongarch/mm/kasan_init.c b/arch/loongarch/mm/kasan_init.c index 170da98ad4f5..61bce6a4b4bb 100644 --- a/arch/loongarch/mm/kasan_init.c +++ b/arch/loongarch/mm/kasan_init.c @@ -265,6 +265,8 @@ void __init kasan_init(void) u64 i; phys_addr_t pa_start, pa_end; + if (kasan_arg_disabled) + return; /* * If PGDIR_SIZE is too large for cpu_vabits, KASAN_SHADOW_END will * overflow UINTPTR_MAX and then looks like a user space address. -- 2.41.0 From bhe at redhat.com Thu Nov 27 19:33:15 2025 From: bhe at redhat.com (Baoquan He) Date: Fri, 28 Nov 2025 11:33:15 +0800 Subject: [PATCH v4 07/12] arch/powerpc: don't initialize kasan if it's disabled In-Reply-To: <20251128033320.1349620-1-bhe@redhat.com> References: <20251128033320.1349620-1-bhe@redhat.com> Message-ID: <20251128033320.1349620-8-bhe@redhat.com> This includes 32bit, book3s/64 and book3e/64. Signed-off-by: Baoquan He Cc: linuxppc-dev at lists.ozlabs.org --- arch/powerpc/mm/kasan/init_32.c | 5 ++++- arch/powerpc/mm/kasan/init_book3e_64.c | 3 +++ arch/powerpc/mm/kasan/init_book3s_64.c | 3 +++ 3 files changed, 10 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/mm/kasan/init_32.c b/arch/powerpc/mm/kasan/init_32.c index 1d083597464f..b0651ff9d44d 100644 --- a/arch/powerpc/mm/kasan/init_32.c +++ b/arch/powerpc/mm/kasan/init_32.c @@ -141,6 +141,9 @@ void __init kasan_init(void) u64 i; int ret; + if (kasan_arg_disabled) + return; + for_each_mem_range(i, &base, &end) { phys_addr_t top = min(end, total_lowmem); @@ -170,7 +173,7 @@ void __init kasan_init(void) void __init kasan_late_init(void) { - if (IS_ENABLED(CONFIG_KASAN_VMALLOC)) + if (IS_ENABLED(CONFIG_KASAN_VMALLOC) && kasan_enabled()) kasan_unmap_early_shadow_vmalloc(); } diff --git a/arch/powerpc/mm/kasan/init_book3e_64.c b/arch/powerpc/mm/kasan/init_book3e_64.c index 0d3a73d6d4b0..f75c1e38a011 100644 --- a/arch/powerpc/mm/kasan/init_book3e_64.c +++ b/arch/powerpc/mm/kasan/init_book3e_64.c @@ -111,6 +111,9 @@ void __init kasan_init(void) u64 i; pte_t zero_pte = pfn_pte(virt_to_pfn(kasan_early_shadow_page), PAGE_KERNEL_RO); + if (kasan_arg_disabled) + return; + for_each_mem_range(i, &start, &end) kasan_init_phys_region(phys_to_virt(start), phys_to_virt(end)); diff --git a/arch/powerpc/mm/kasan/init_book3s_64.c b/arch/powerpc/mm/kasan/init_book3s_64.c index dcafa641804c..8c6940e835d4 100644 --- a/arch/powerpc/mm/kasan/init_book3s_64.c +++ b/arch/powerpc/mm/kasan/init_book3s_64.c @@ -54,6 +54,9 @@ void __init kasan_init(void) u64 i; pte_t zero_pte = pfn_pte(virt_to_pfn(kasan_early_shadow_page), PAGE_KERNEL); + if (kasan_arg_disabled) + return; + if (!early_radix_enabled()) { pr_warn("KASAN not enabled as it requires radix!"); return; -- 2.41.0 From bhe at redhat.com Thu Nov 27 19:33:16 2025 From: bhe at redhat.com (Baoquan He) Date: Fri, 28 Nov 2025 11:33:16 +0800 Subject: [PATCH v4 08/12] arch/riscv: don't initialize kasan if it's disabled In-Reply-To: <20251128033320.1349620-1-bhe@redhat.com> References: <20251128033320.1349620-1-bhe@redhat.com> Message-ID: <20251128033320.1349620-9-bhe@redhat.com> Signed-off-by: Baoquan He Cc: linux-riscv at lists.infradead.org --- arch/riscv/mm/kasan_init.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/riscv/mm/kasan_init.c b/arch/riscv/mm/kasan_init.c index c4a2a9e5586e..aa464466e482 100644 --- a/arch/riscv/mm/kasan_init.c +++ b/arch/riscv/mm/kasan_init.c @@ -485,6 +485,9 @@ void __init kasan_init(void) phys_addr_t p_start, p_end; u64 i; + if (kasan_arg_disabled) + return; + create_tmp_mapping(); csr_write(CSR_SATP, PFN_DOWN(__pa(tmp_pg_dir)) | satp_mode); -- 2.41.0 From bhe at redhat.com Thu Nov 27 19:33:17 2025 From: bhe at redhat.com (Baoquan He) Date: Fri, 28 Nov 2025 11:33:17 +0800 Subject: [PATCH v4 09/12] arch/x86: don't initialize kasan if it's disabled In-Reply-To: <20251128033320.1349620-1-bhe@redhat.com> References: <20251128033320.1349620-1-bhe@redhat.com> Message-ID: <20251128033320.1349620-10-bhe@redhat.com> Signed-off-by: Baoquan He Cc: x86 at kernel.org --- arch/x86/mm/kasan_init_64.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c index 998b6010d6d3..d642ad364904 100644 --- a/arch/x86/mm/kasan_init_64.c +++ b/arch/x86/mm/kasan_init_64.c @@ -343,6 +343,9 @@ void __init kasan_init(void) unsigned long shadow_cea_begin, shadow_cea_per_cpu_begin, shadow_cea_end; int i; + if (kasan_arg_disabled) + return; + memcpy(early_top_pgt, init_top_pgt, sizeof(early_top_pgt)); /* -- 2.41.0 From bhe at redhat.com Thu Nov 27 19:33:18 2025 From: bhe at redhat.com (Baoquan He) Date: Fri, 28 Nov 2025 11:33:18 +0800 Subject: [PATCH v4 10/12] arch/xtensa: don't initialize kasan if it's disabled In-Reply-To: <20251128033320.1349620-1-bhe@redhat.com> References: <20251128033320.1349620-1-bhe@redhat.com> Message-ID: <20251128033320.1349620-11-bhe@redhat.com> Here call jump_label_init() early in setup_arch() so that later kasan_init() can enable static key kasan_flag_enabled. Put jump_label_init() beofre parse_early_param() as other architectures do. Signed-off-by: Baoquan He Cc: Chris Zankel Cc: Max Filippov --- arch/xtensa/kernel/setup.c | 1 + arch/xtensa/mm/kasan_init.c | 3 +++ 2 files changed, 4 insertions(+) diff --git a/arch/xtensa/kernel/setup.c b/arch/xtensa/kernel/setup.c index f72e280363be..aabeb23f41fa 100644 --- a/arch/xtensa/kernel/setup.c +++ b/arch/xtensa/kernel/setup.c @@ -352,6 +352,7 @@ void __init setup_arch(char **cmdline_p) mem_reserve(__pa(_SecondaryResetVector_text_start), __pa(_SecondaryResetVector_text_end)); #endif + jump_label_init(); parse_early_param(); bootmem_init(); kasan_init(); diff --git a/arch/xtensa/mm/kasan_init.c b/arch/xtensa/mm/kasan_init.c index 0524b9ed5e63..a78a85da1f0d 100644 --- a/arch/xtensa/mm/kasan_init.c +++ b/arch/xtensa/mm/kasan_init.c @@ -70,6 +70,9 @@ void __init kasan_init(void) { int i; + if (kasan_arg_disabled) + return; + BUILD_BUG_ON(KASAN_SHADOW_OFFSET != KASAN_SHADOW_START - (KASAN_START_VADDR >> KASAN_SHADOW_SCALE_SHIFT)); BUILD_BUG_ON(VMALLOC_START < KASAN_START_VADDR); -- 2.41.0 From bhe at redhat.com Thu Nov 27 19:33:19 2025 From: bhe at redhat.com (Baoquan He) Date: Fri, 28 Nov 2025 11:33:19 +0800 Subject: [PATCH v4 11/12] arch/um: don't initialize kasan if it's disabled In-Reply-To: <20251128033320.1349620-1-bhe@redhat.com> References: <20251128033320.1349620-1-bhe@redhat.com> Message-ID: <20251128033320.1349620-12-bhe@redhat.com> And also do the kasan_arg_disabled chekcing before kasan_flag_enabled enabling to make sure kernel parameter kasan=on|off has been parsed. Signed-off-by: Baoquan He Cc: linux-um at lists.infradead.org --- arch/um/kernel/mem.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/arch/um/kernel/mem.c b/arch/um/kernel/mem.c index 39c4a7e21c6f..08cd012a6bb8 100644 --- a/arch/um/kernel/mem.c +++ b/arch/um/kernel/mem.c @@ -62,8 +62,11 @@ static unsigned long brk_end; void __init arch_mm_preinit(void) { +#ifdef CONFIG_KASAN /* Safe to call after jump_label_init(). Enables KASAN. */ - kasan_init_generic(); + if (!kasan_arg_disabled) + kasan_init_generic(); +#endif /* clear the zero-page */ memset(empty_zero_page, 0, PAGE_SIZE); -- 2.41.0 From bhe at redhat.com Thu Nov 27 19:33:20 2025 From: bhe at redhat.com (Baoquan He) Date: Fri, 28 Nov 2025 11:33:20 +0800 Subject: [PATCH v4 12/12] mm/kasan: make kasan=on|off take effect for all three modes In-Reply-To: <20251128033320.1349620-1-bhe@redhat.com> References: <20251128033320.1349620-1-bhe@redhat.com> Message-ID: <20251128033320.1349620-13-bhe@redhat.com> Now everything is ready, set kasan=off can disable kasan for all three modes. Signed-off-by: Baoquan He --- include/linux/kasan-enabled.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/include/linux/kasan-enabled.h b/include/linux/kasan-enabled.h index b05ec6329fbe..b33c92cc6bd8 100644 --- a/include/linux/kasan-enabled.h +++ b/include/linux/kasan-enabled.h @@ -4,6 +4,7 @@ #include +#ifdef CONFIG_KASAN extern bool kasan_arg_disabled; /* @@ -12,7 +13,6 @@ extern bool kasan_arg_disabled; */ DECLARE_STATIC_KEY_FALSE(kasan_flag_enabled); -#if defined(CONFIG_ARCH_DEFER_KASAN) || defined(CONFIG_KASAN_HW_TAGS) /* * Runtime control for shadow memory initialization or HW_TAGS mode. * Uses static key for architectures that need deferred KASAN or HW_TAGS. @@ -30,7 +30,7 @@ static inline void kasan_enable(void) /* For architectures that can enable KASAN early, use compile-time check. */ static __always_inline bool kasan_enabled(void) { - return IS_ENABLED(CONFIG_KASAN); + return false; } static inline void kasan_enable(void) {} -- 2.41.0 From k-hagio-ab at nec.com Fri Nov 28 00:04:24 2025 From: k-hagio-ab at nec.com (=?utf-8?B?SEFHSU8gS0FaVUhJVE8o6JCp5bC+44CA5LiA5LuBKQ==?=) Date: Fri, 28 Nov 2025 08:04:24 +0000 Subject: [PATCH v2][makedumpfile 00/14] btf/kallsyms based eppic extension for mm page filtering In-Reply-To: References: <20251020222410.8235-1-ltao@redhat.com> Message-ID: <8b5c5913-34bc-444f-8ffe-9457bde0649c@nec.com> On 2025/11/24 13:46, Tao Liu wrote: > Kindly ping... Any comments on this? Hi Tao, I'm sorry for the delay. I think I can look into this next month. Thanks, Kazu > > Thanks, > Tao Liu > > On Tue, Oct 21, 2025 at 11:24?AM Tao Liu wrote: >> >> A) This patchset will introduce the following features to makedumpfile: >> >> 1) Enable eppic script for memory pages filtering. >> 2) Enable btf and kallsyms for symbol type and address resolving. >> >> B) The purpose of the features are: >> >> 1) Currently makedumpfile filters mm pages based on page flags, because flags >> can help to determine one page's usage. But this page-flag-checking method >> lacks of flexibility in certain cases, e.g. if we want to filter those mm >> pages occupied by GPU during vmcore dumping due to: >> >> a) GPU may be taking a large memory and contains sensitive data; >> b) GPU mm pages have no relations to kernel crash and useless for vmcore >> analysis. >> >> But there is no GPU mm page specific flags, and apparently we don't need >> to create one just for kdump use. A programmable filtering tool is more >> suitable for such cases. In addition, different GPU vendors may use >> different ways for mm pages allocating, programmable filtering is better >> than hard coding these GPU specific logics into makedumpfile in this case. >> >> 2) Currently makedumpfile already contains a programmable filtering tool, aka >> eppic script, which allows user to write customized code for data erasing. >> However it has the following drawbacks: >> >> a) cannot do mm page filtering. >> b) need to access to debuginfo of both kernel and modules, which is not >> applicable in the 2nd kernel. >> c) Poor performance, making vmcore dumping time unacceptable (See >> the following performance testing). >> >> makedumpfile need to resolve the dwarf data from debuginfo, to get symbols >> types and addresses. In recent kernel there are dwarf alternatives such >> as btf/kallsyms which can be used for this purpose. And btf/kallsyms info >> are already packed within vmcore, so we can use it directly. >> >> With these, this patchset introduces an upgraded eppic, which is based on >> btf/kallsyms symbol resolving, and is programmable for mm page filtering. >> The following info shows its usage and performance, please note the tests >> are performed in 1st kernel: >> >> $ time ./makedumpfile -d 31 -l /var/crash/127.0.0.1-2025-06-10-18\:03\:12/vmcore >> /tmp/dwarf.out -x /lib/debug/lib/modules/6.11.8-300.fc41.x86_64/vmlinux >> --eppic eppic_scripts/filter_amdgpu_mm_pages.c >> real 14m6.894s >> user 4m16.900s >> sys 9m44.695s >> >> $ time ./makedumpfile -d 31 -l /var/crash/127.0.0.1-2025-06-10-18\:03\:12/vmcore >> /tmp/btf.out --eppic eppic_scripts/filter_amdgpu_mm_pages.c >> real 0m10.672s >> user 0m9.270s >> sys 0m1.130s >> >> -rw------- 1 root root 367475074 Jun 10 18:06 btf.out >> -rw------- 1 root root 367475074 Jun 10 21:05 dwarf.out >> -rw-rw-rw- 1 root root 387181418 Jun 10 18:03 /var/crash/127.0.0.1-2025-06-10-18:03:12/vmcore >> >> C) Discussion: >> >> 1) GPU types: Currently only tested with amdgpu's mm page filtering, others >> are not tested. >> 2) OS: The code can work on rhel-10+/rhel9.5+ on x86_64/arm64/s390/ppc64. >> Others are not tested. >> >> D) Testing: >> >> 1) If you don't want to create your vmcore, you can find a vmcore which I >> created with amdgpu mm pages unfiltered [1], the amdgpu mm pages are >> allocated by program [2]. You can use the vmcore in 1st kernel to filter >> the amdgpu mm pages by the previous performance testing cmdline. To >> verify the pages are filtered in crash: >> >> Unfiltered: >> crash> search -c "!QAZXSW@#EDC" >> ffff96b7fa800000: !QAZXSW@#EDCXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX >> ffff96b87c800000: !QAZXSW@#EDCXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX >> crash> rd ffff96b7fa800000 >> ffff96b7fa800000: 405753585a415121 !QAZXSW@ >> crash> rd ffff96b87c800000 >> ffff96b87c800000: 405753585a415121 !QAZXSW@ >> >> Filtered: >> crash> search -c "!QAZXSW@#EDC" >> crash> rd ffff96b7fa800000 >> rd: page excluded: kernel virtual address: ffff96b7fa800000 type: "64-bit KVADDR" >> crash> rd ffff96b87c800000 >> rd: page excluded: kernel virtual address: ffff96b87c800000 type: "64-bit KVADDR" >> >> 2) You can use eppic_scripts/print_all_vma.c against an ordinary vmcore to >> test only btf/kallsyms functions by output all VMAs if no amdgpu >> vmcores/machine avaliable. >> >> [1]: https://people.redhat.com/~ltao/core/ >> [2]: https://gist.github.com/liutgnu/a8cbce1c666452f1530e1410d1f352df >> >> v2 -> v1: >> >> 1) Moved maple tree related code(for VMA iteration) into eppic script, so we >> don't need to port maple tree code to makedumpfile. >> >> 2) Reorganized the patchset as follows: >> >> --- --- >> 1.Add page filtering function >> 2.Supporting main() as the entry of eppic script >> >> --- --- >> 3.dwarf_info: Support kernel address randomization >> 4.dwarf_info: Fix a infinite recursion bug for rust >> 5.eppic dwarf: support anonymous structs member resolving >> 6.Enable page filtering for dwarf eppic >> >> --- --- >> 7.Implement kernel kallsyms resolving >> 8.Implement kernel btf resolving >> 9.Implement kernel module's kallsyms resolving >> 10.Implement kernel module's btf resolving >> 11.Export necessary btf/kallsyms functions to eppic extension >> 12.Enable page filtering for btf/kallsyms eppic >> 13.Docs: Update eppic related entries >> >> --- --- >> 14.Introducing 2 eppic scripts to test the dwarf/btf eppic extension >> >> The modification on dwarf is primary for comparision purpose, that >> for the same eppic program, mm page filtering should get exact same >> outputs for dwarf & kallsyms/btf based approaches. If outputs unmatch, >> this indicates bugs. In fact, we will never take dwarf mm pages filtering >> in real use, due to its poor performance as well as inaccessibility >> of debuginfo during kdump in 2nd kernel. So patch 3/4/5 won't affect >> the function of btf/kallsyms eppic mm page filtering, but there are >> functions shared in patch 6, so it is a must-have one. Patch 14 is >> only for test purpose, to demonstrate how to write eppic script for >> mm page filtering, so it isn't a must-have patch. >> >> Please note, in patch 14, I have deliberately converted all array >> operation into pointer operation, e.g. modified "node->slot[i]" into >> "*((unsigned long *)&(node->slot) + i)". This is because there are >> bugs for array operation support in extension_eppic.c. I didn't have >> effort to test and fix them all because as I mentioned previously, >> mm page filtering in dwarf side is only for comparision and will >> never be used in real use. There is no such issue for kallsyms/btf >> eppic side. >> >> 3) Since we ported maple tree code to eppic script, several bugs found >> both for eppic library & eppic btf support. Please use master branch >> of eppic library to co-compile with this patchset. >> >> Tao Liu (14): >> Add page filtering function >> Supporting main() as the entry of eppic script >> dwarf_info: Support kernel address randomization >> dwarf_info: Fix a infinite recursion bug for rust >> eppic dwarf: support anonymous structs member resolving >> Enable page filtering for dwarf eppic >> Implement kernel kallsyms resolving >> Implement kernel btf resolving >> Implement kernel module's kallsyms resolving >> Implement kernel module's btf resolving >> Export necessary btf/kallsyms functions to eppic extension >> Enable page filtering for btf/kallsyms eppic >> Docs: Update eppic related entries >> Introducing 2 eppic scripts to test the dwarf/btf eppic extension >> >> Makefile | 6 +- >> btf.c | 919 +++++++++++++++++++++++++ >> btf.h | 177 +++++ >> dwarf_info.c | 7 + >> eppic_scripts/filter_amdgpu_mm_pages.c | 255 +++++++ >> eppic_scripts/print_all_vma.c | 239 +++++++ >> erase_info.c | 120 +++- >> erase_info.h | 19 + >> extension_btf.c | 258 +++++++ >> extension_eppic.c | 106 ++- >> extension_eppic.h | 6 +- >> kallsyms.c | 392 +++++++++++ >> kallsyms.h | 41 ++ >> makedumpfile.8.in | 24 +- >> makedumpfile.c | 21 +- >> makedumpfile.h | 11 + >> print_info.c | 11 +- >> 17 files changed, 2550 insertions(+), 62 deletions(-) >> create mode 100644 btf.c >> create mode 100644 btf.h >> create mode 100644 eppic_scripts/filter_amdgpu_mm_pages.c >> create mode 100644 eppic_scripts/print_all_vma.c >> create mode 100644 extension_btf.c >> create mode 100644 kallsyms.c >> create mode 100644 kallsyms.h >> >> -- >> 2.47.0 >> From ltao at redhat.com Fri Nov 28 00:10:01 2025 From: ltao at redhat.com (Tao Liu) Date: Fri, 28 Nov 2025 21:10:01 +1300 Subject: [PATCH v2][makedumpfile 00/14] btf/kallsyms based eppic extension for mm page filtering In-Reply-To: <8b5c5913-34bc-444f-8ffe-9457bde0649c@nec.com> References: <20251020222410.8235-1-ltao@redhat.com> <8b5c5913-34bc-444f-8ffe-9457bde0649c@nec.com> Message-ID: Hi Kazu, On Fri, Nov 28, 2025 at 9:04?PM HAGIO KAZUHITO(?????) wrote: > > On 2025/11/24 13:46, Tao Liu wrote: > > Kindly ping... Any comments on this? > > Hi Tao, > > I'm sorry for the delay. I think I can look into this next month. No worries, please take your time :) Thanks, Tao Liu > > Thanks, > Kazu > > > > > Thanks, > > Tao Liu > > > > On Tue, Oct 21, 2025 at 11:24?AM Tao Liu wrote: > >> > >> A) This patchset will introduce the following features to makedumpfile: > >> > >> 1) Enable eppic script for memory pages filtering. > >> 2) Enable btf and kallsyms for symbol type and address resolving. > >> > >> B) The purpose of the features are: > >> > >> 1) Currently makedumpfile filters mm pages based on page flags, because flags > >> can help to determine one page's usage. But this page-flag-checking method > >> lacks of flexibility in certain cases, e.g. if we want to filter those mm > >> pages occupied by GPU during vmcore dumping due to: > >> > >> a) GPU may be taking a large memory and contains sensitive data; > >> b) GPU mm pages have no relations to kernel crash and useless for vmcore > >> analysis. > >> > >> But there is no GPU mm page specific flags, and apparently we don't need > >> to create one just for kdump use. A programmable filtering tool is more > >> suitable for such cases. In addition, different GPU vendors may use > >> different ways for mm pages allocating, programmable filtering is better > >> than hard coding these GPU specific logics into makedumpfile in this case. > >> > >> 2) Currently makedumpfile already contains a programmable filtering tool, aka > >> eppic script, which allows user to write customized code for data erasing. > >> However it has the following drawbacks: > >> > >> a) cannot do mm page filtering. > >> b) need to access to debuginfo of both kernel and modules, which is not > >> applicable in the 2nd kernel. > >> c) Poor performance, making vmcore dumping time unacceptable (See > >> the following performance testing). > >> > >> makedumpfile need to resolve the dwarf data from debuginfo, to get symbols > >> types and addresses. In recent kernel there are dwarf alternatives such > >> as btf/kallsyms which can be used for this purpose. And btf/kallsyms info > >> are already packed within vmcore, so we can use it directly. > >> > >> With these, this patchset introduces an upgraded eppic, which is based on > >> btf/kallsyms symbol resolving, and is programmable for mm page filtering. > >> The following info shows its usage and performance, please note the tests > >> are performed in 1st kernel: > >> > >> $ time ./makedumpfile -d 31 -l /var/crash/127.0.0.1-2025-06-10-18\:03\:12/vmcore > >> /tmp/dwarf.out -x /lib/debug/lib/modules/6.11.8-300.fc41.x86_64/vmlinux > >> --eppic eppic_scripts/filter_amdgpu_mm_pages.c > >> real 14m6.894s > >> user 4m16.900s > >> sys 9m44.695s > >> > >> $ time ./makedumpfile -d 31 -l /var/crash/127.0.0.1-2025-06-10-18\:03\:12/vmcore > >> /tmp/btf.out --eppic eppic_scripts/filter_amdgpu_mm_pages.c > >> real 0m10.672s > >> user 0m9.270s > >> sys 0m1.130s > >> > >> -rw------- 1 root root 367475074 Jun 10 18:06 btf.out > >> -rw------- 1 root root 367475074 Jun 10 21:05 dwarf.out > >> -rw-rw-rw- 1 root root 387181418 Jun 10 18:03 /var/crash/127.0.0.1-2025-06-10-18:03:12/vmcore > >> > >> C) Discussion: > >> > >> 1) GPU types: Currently only tested with amdgpu's mm page filtering, others > >> are not tested. > >> 2) OS: The code can work on rhel-10+/rhel9.5+ on x86_64/arm64/s390/ppc64. > >> Others are not tested. > >> > >> D) Testing: > >> > >> 1) If you don't want to create your vmcore, you can find a vmcore which I > >> created with amdgpu mm pages unfiltered [1], the amdgpu mm pages are > >> allocated by program [2]. You can use the vmcore in 1st kernel to filter > >> the amdgpu mm pages by the previous performance testing cmdline. To > >> verify the pages are filtered in crash: > >> > >> Unfiltered: > >> crash> search -c "!QAZXSW@#EDC" > >> ffff96b7fa800000: !QAZXSW@#EDCXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX > >> ffff96b87c800000: !QAZXSW@#EDCXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX > >> crash> rd ffff96b7fa800000 > >> ffff96b7fa800000: 405753585a415121 !QAZXSW@ > >> crash> rd ffff96b87c800000 > >> ffff96b87c800000: 405753585a415121 !QAZXSW@ > >> > >> Filtered: > >> crash> search -c "!QAZXSW@#EDC" > >> crash> rd ffff96b7fa800000 > >> rd: page excluded: kernel virtual address: ffff96b7fa800000 type: "64-bit KVADDR" > >> crash> rd ffff96b87c800000 > >> rd: page excluded: kernel virtual address: ffff96b87c800000 type: "64-bit KVADDR" > >> > >> 2) You can use eppic_scripts/print_all_vma.c against an ordinary vmcore to > >> test only btf/kallsyms functions by output all VMAs if no amdgpu > >> vmcores/machine avaliable. > >> > >> [1]: https://people.redhat.com/~ltao/core/ > >> [2]: https://gist.github.com/liutgnu/a8cbce1c666452f1530e1410d1f352df > >> > >> v2 -> v1: > >> > >> 1) Moved maple tree related code(for VMA iteration) into eppic script, so we > >> don't need to port maple tree code to makedumpfile. > >> > >> 2) Reorganized the patchset as follows: > >> > >> --- --- > >> 1.Add page filtering function > >> 2.Supporting main() as the entry of eppic script > >> > >> --- --- > >> 3.dwarf_info: Support kernel address randomization > >> 4.dwarf_info: Fix a infinite recursion bug for rust > >> 5.eppic dwarf: support anonymous structs member resolving > >> 6.Enable page filtering for dwarf eppic > >> > >> --- --- > >> 7.Implement kernel kallsyms resolving > >> 8.Implement kernel btf resolving > >> 9.Implement kernel module's kallsyms resolving > >> 10.Implement kernel module's btf resolving > >> 11.Export necessary btf/kallsyms functions to eppic extension > >> 12.Enable page filtering for btf/kallsyms eppic > >> 13.Docs: Update eppic related entries > >> > >> --- --- > >> 14.Introducing 2 eppic scripts to test the dwarf/btf eppic extension > >> > >> The modification on dwarf is primary for comparision purpose, that > >> for the same eppic program, mm page filtering should get exact same > >> outputs for dwarf & kallsyms/btf based approaches. If outputs unmatch, > >> this indicates bugs. In fact, we will never take dwarf mm pages filtering > >> in real use, due to its poor performance as well as inaccessibility > >> of debuginfo during kdump in 2nd kernel. So patch 3/4/5 won't affect > >> the function of btf/kallsyms eppic mm page filtering, but there are > >> functions shared in patch 6, so it is a must-have one. Patch 14 is > >> only for test purpose, to demonstrate how to write eppic script for > >> mm page filtering, so it isn't a must-have patch. > >> > >> Please note, in patch 14, I have deliberately converted all array > >> operation into pointer operation, e.g. modified "node->slot[i]" into > >> "*((unsigned long *)&(node->slot) + i)". This is because there are > >> bugs for array operation support in extension_eppic.c. I didn't have > >> effort to test and fix them all because as I mentioned previously, > >> mm page filtering in dwarf side is only for comparision and will > >> never be used in real use. There is no such issue for kallsyms/btf > >> eppic side. > >> > >> 3) Since we ported maple tree code to eppic script, several bugs found > >> both for eppic library & eppic btf support. Please use master branch > >> of eppic library to co-compile with this patchset. > >> > >> Tao Liu (14): > >> Add page filtering function > >> Supporting main() as the entry of eppic script > >> dwarf_info: Support kernel address randomization > >> dwarf_info: Fix a infinite recursion bug for rust > >> eppic dwarf: support anonymous structs member resolving > >> Enable page filtering for dwarf eppic > >> Implement kernel kallsyms resolving > >> Implement kernel btf resolving > >> Implement kernel module's kallsyms resolving > >> Implement kernel module's btf resolving > >> Export necessary btf/kallsyms functions to eppic extension > >> Enable page filtering for btf/kallsyms eppic > >> Docs: Update eppic related entries > >> Introducing 2 eppic scripts to test the dwarf/btf eppic extension > >> > >> Makefile | 6 +- > >> btf.c | 919 +++++++++++++++++++++++++ > >> btf.h | 177 +++++ > >> dwarf_info.c | 7 + > >> eppic_scripts/filter_amdgpu_mm_pages.c | 255 +++++++ > >> eppic_scripts/print_all_vma.c | 239 +++++++ > >> erase_info.c | 120 +++- > >> erase_info.h | 19 + > >> extension_btf.c | 258 +++++++ > >> extension_eppic.c | 106 ++- > >> extension_eppic.h | 6 +- > >> kallsyms.c | 392 +++++++++++ > >> kallsyms.h | 41 ++ > >> makedumpfile.8.in | 24 +- > >> makedumpfile.c | 21 +- > >> makedumpfile.h | 11 + > >> print_info.c | 11 +- > >> 17 files changed, 2550 insertions(+), 62 deletions(-) > >> create mode 100644 btf.c > >> create mode 100644 btf.h > >> create mode 100644 eppic_scripts/filter_amdgpu_mm_pages.c > >> create mode 100644 eppic_scripts/print_all_vma.c > >> create mode 100644 extension_btf.c > >> create mode 100644 kallsyms.c > >> create mode 100644 kallsyms.h > >> > >> -- > >> 2.47.0 > >> From sourabhjain at linux.ibm.com Fri Nov 28 01:41:54 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Fri, 28 Nov 2025 15:11:54 +0530 Subject: [PATCH v3 0/3] kexec: print out debugging message if required for kexec_load In-Reply-To: References: <20251126084427.3222212-1-maqianga@uniontech.com> <7aadda55-d2a4-40f9-95ef-d284ec358646@linux.ibm.com> Message-ID: <77ce0329-1f82-49be-b18a-73c9e5c3e85e@linux.ibm.com> Hello Baoquan, On 27/11/25 21:00, Baoquan He wrote: > On 11/27/25 at 05:31pm, Sourabh Jain wrote: >> Hello All, >> >> Do we have plan to support KEXEC_DEBUG flag? >> >> Because upstream kexec-tools already added support for KEXEC_DEBUG flag >> and that breaks the kexec_load with -d option. >> >> - kexec: add kexec flag to support debug printing >> https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git/commit/?id=71d6fd99af7e > I think we should revert that kexec-tools commit. Yeah, userspace changes shouldn't go in until the kernel patches are finalized. It seems that there are disagreements regarding the approach and usefulness of this patch series, so reverting the kexec-tools patch might be the right thing to avoid breaking anything for now. I have one question: should the kernel advertise KEXEC_DEBUG so that backward compatibility can be maintained between the kernel and kexec-tools? Or is that too much for a debugging flag? How was backward compatibility handled when we added the KEXEC_FILE_DEBUG flag? > This whole patchset is > non-sense. Because of my carelessness, that userspace patch was merged. > > Hi Sourabh, > > Could you go through this patchset and help check if they are really > needed? I can't find anything to convince myself. Thanks. Sure I will review this patch series. Thanks, Sourabh Jain From glider at google.com Fri Nov 28 07:50:51 2025 From: glider at google.com (Alexander Potapenko) Date: Fri, 28 Nov 2025 16:50:51 +0100 Subject: [PATCH v4 12/12] mm/kasan: make kasan=on|off take effect for all three modes In-Reply-To: <20251128033320.1349620-13-bhe@redhat.com> References: <20251128033320.1349620-1-bhe@redhat.com> <20251128033320.1349620-13-bhe@redhat.com> Message-ID: > @@ -30,7 +30,7 @@ static inline void kasan_enable(void) > /* For architectures that can enable KASAN early, use compile-time check. */ I think the behavior of kasan_enabled() is inconsistent with this comment now. > static __always_inline bool kasan_enabled(void) > { > - return IS_ENABLED(CONFIG_KASAN); > + return false; > } From bhe at redhat.com Sat Nov 29 18:49:18 2025 From: bhe at redhat.com (Baoquan He) Date: Sun, 30 Nov 2025 10:49:18 +0800 Subject: [PATCH v4 12/12] mm/kasan: make kasan=on|off take effect for all three modes In-Reply-To: References: <20251128033320.1349620-1-bhe@redhat.com> <20251128033320.1349620-13-bhe@redhat.com> Message-ID: On 11/28/25 at 04:50pm, Alexander Potapenko wrote: > > @@ -30,7 +30,7 @@ static inline void kasan_enable(void) > > /* For architectures that can enable KASAN early, use compile-time check. */ > I think the behavior of kasan_enabled() is inconsistent with this comment now. You are right, that line should be removed. Thanks for careful checking. > > static __always_inline bool kasan_enabled(void) > > { > > - return IS_ENABLED(CONFIG_KASAN); > > + return false; > > } > From bhe at redhat.com Sat Nov 29 18:56:33 2025 From: bhe at redhat.com (Baoquan He) Date: Sun, 30 Nov 2025 10:56:33 +0800 Subject: [PATCH v3 0/3] kexec: print out debugging message if required for kexec_load In-Reply-To: <77ce0329-1f82-49be-b18a-73c9e5c3e85e@linux.ibm.com> References: <20251126084427.3222212-1-maqianga@uniontech.com> <7aadda55-d2a4-40f9-95ef-d284ec358646@linux.ibm.com> <77ce0329-1f82-49be-b18a-73c9e5c3e85e@linux.ibm.com> Message-ID: On 11/28/25 at 03:11pm, Sourabh Jain wrote: > Hello Baoquan, > > On 27/11/25 21:00, Baoquan He wrote: > > On 11/27/25 at 05:31pm, Sourabh Jain wrote: > > > Hello All, > > > > > > Do we have plan to support KEXEC_DEBUG flag? > > > > > > Because upstream kexec-tools already added support for KEXEC_DEBUG flag > > > and that breaks the kexec_load with -d option. > > > > > > - kexec: add kexec flag to support debug printing > > > https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git/commit/?id=71d6fd99af7e > > I think we should revert that kexec-tools commit. > > Yeah, userspace changes shouldn't go in until the kernel patches are > finalized. It seems that there are disagreements regarding the approach > and usefulness of this patch series, so reverting the kexec-tools patch > might be the right thing to avoid breaking anything for now. The patch 1 is issue fixing, that is a good one. While patch 2, 3 are trying to add debugging printing for kexec_load interface which I think is not needed. I added debugging printing for kexec_file_load because I has been using 'kexec -d' to debug for kexec_load while kexec_file_load didn't have. So I mimicked kexec_load's debugging printing to add one for kexec_file_load. Now patch 2,3's adding doesn't make sense as he said he is doing for future need. > > I have one question: should the kernel advertise KEXEC_DEBUG so that > backward compatibility can be maintained between the kernel and > kexec-tools? Or is that too much for a debugging flag? How was backward > compatibility handled when we added the KEXEC_FILE_DEBUG flag? When I added KEXEC_FILE_DEBUG, I didn't consider backward compatibility. That is making the then latest kernel match the then latest kexec-tools. > > > This whole patchset is > > non-sense. Because of my carelessness, that userspace patch was merged. > > > > Hi Sourabh, > > > > Could you go through this patchset and help check if they are really > > needed? I can't find anything to convince myself. Thanks. > > Sure I will review this patch series. Thanks. Please check patch 2,3 to see if we really need the debugging printing for kexec_load, or its adding really brings benefit even if it's a little bit compared with the mess it brings; and if my objecting is too subjective. Thanks Baoquan From rientjes at google.com Sat Nov 29 19:13:11 2025 From: rientjes at google.com (David Rientjes) Date: Sat, 29 Nov 2025 19:13:11 -0800 (PST) Subject: [Hypervisor Live Update] Notes from November 17, 2025 Message-ID: Hi everybody, Here are the notes from the last Hypervisor Live Update call that happened on Monday, November 17. Thanks to everybody who was involved! These notes are intended to bring people up to speed who could not attend the call as well as keep the conversation going in between meetings. ----->o----- Pasha updated on the status of the stateless KHO RFC: Jason Miu had sent an update of the patches but they need to be rebased on top of the latest KHO series. There were some simplification patches that had been sent for KHO that changed how the FDT was used. Thus, the stateless KHO patches need to be updated again as well as some splitting of the patches into finer grained patches. LUO v6 was sent the previous weekend. There were a number of comments received for LUO v5 in linux-next that were addressed in v6. Mike Rapoport was going through v6 and provided the most feedback. Pasha was planning on sending a v7 for the next merge window. ----->o----- David Matlack updated that he was going to be focused this week on the VFIO v2 patch series. His goal was to have it on the mailing list by the week of November 24. The goal was to be able to gather feedback prior to LPC and then leverage that conference to discuss the open questions for that series. Sami and David had discussed a minimal patch series for VFIO preservation as the next feature that could be merged on top of LUO, setting the stage for IOMMU preservation to build on that. ----->o----- Pratyush updated on his HugeTLB and 1GB page preservation series, he got this working internally for v5. It is not ready to post as an RFC yet, so the goal was to have this in a state ready to share over the next two weeks. This will also enable LPC discussions. ----->o----- Ackerley provided an update on guest memfd support for 1GB HugeTLB pages. He has an internal version working. There is no preservation support for it, just guest memfd with 1GB HugeTLB support. Pratyush had previously discussed with Ackerley and felt that the series were really independent from each other. Ackerley was planning his next posting to the mailing list after LPC. ----->o----- Pratyush discussed an idea about versioning for LUO: there will be different versions for different components like memfd, IOMMU, etc. He was thinking to have a mechanism to define different versions. This would be supported as an ELF header in the vmlinux. When you load the next kernel in preparation for kexec, luod would read this next vmlinux, see what version it supports and determine its compatibility with the currently running kernel. Jason suggested discussing the roadmap for FDT first, he wanted to ensure that the dependencies were sorted out fully before doing optimization. He wanted to see more infrastructure to support the versioning and wrote some thoughts on this on the mailing list previously. The ELF versioning could just auto-generate out of the aligned design. Pratyush proposed writing an RFC that could be used as the basis for further discussion. ----->o----- Next meeting will be on Monday, December 1 at 8am PST (UTC-8), everybody is welcome: https://meet.google.com/rjn-dmzu-hgq Topics for the next meeting: - update on the status of stateless KHO RFC patches that were being rebased on top of the KHO simplification - update on the the status of LUO v7 and its potential for merge in the next merge window - update for the VFIO v2 patch series intended to solicit feedback prior to LPC - next steps for iommu persistence to build upon the VFIO patch series once that is merged - status update for HugeTLB + 1GB page preservation support that should be ready to send out by the next meeting - continued discussion on versioning support for various components for luod to negotiate - determine plan for December 15 instance of the meeting since it's immediately after LPC - later, after LPC: update on status of guest_memfd support for 1GB HugeTLB pages - later: testing methodology to allow downstream consumers to qualify that live update works from one version to another - later: reducing blackout window during live update, including deferred struct page initialization Please let me know if you'd like to propose additional topics for discussion, thank you! From rppt at kernel.org Sun Nov 30 22:54:37 2025 From: rppt at kernel.org (Mike Rapoport) Date: Mon, 1 Dec 2025 08:54:37 +0200 Subject: [PATCH 2/2] kho: fix restoring of contiguous ranges of order-0 pages In-Reply-To: References: <20251125110917.843744-1-rppt@kernel.org> <20251125110917.843744-3-rppt@kernel.org> Message-ID: Hi Pratyush, On Tue, Nov 25, 2025 at 02:45:59PM +0100, Pratyush Yadav wrote: > On Tue, Nov 25 2025, Mike Rapoport wrote: ... > > @@ -243,11 +243,16 @@ static struct page *kho_restore_page(phys_addr_t phys) > > /* Head page gets refcount of 1. */ > > set_page_count(page, 1); > > > > - /* For higher order folios, tail pages get a page count of zero. */ > > + /* > > + * For higher order folios, tail pages get a page count of zero. > > + * For physically contiguous order-0 pages every pages gets a page > > + * count of 1 > > + */ > > + ref_cnt = is_folio ? 0 : 1; > > for (unsigned int i = 1; i < nr_pages; i++) > > - set_page_count(page + i, 0); > > + set_page_count(page + i, ref_cnt); > > > > - if (info.order > 0) > > + if (is_folio && info.order) > > This is getting a bit difficult to parse. Let's separate out folio and > page initialization to separate helpers: Sorry, I've missed this earlier and now the patches are in akpm's -stable branch. Let's postpone these changes for the next cycle, maybe along with support for deferred initialization of struct page. > /* Initalize 0-order KHO pages */ > static void kho_init_page(struct page *page, unsigned int nr_pages) > { > for (unsigned int i = 0; i < nr_pages; i++) > set_page_count(page + i, 1); > } > > static void kho_init_folio(struct page *page, unsigned int order) > { > unsigned int nr_pages = (1 << order); > > /* Head page gets refcount of 1. */ > set_page_count(page, 1); > > /* For higher order folios, tail pages get a page count of zero. */ > for (unsigned int i = 1; i < nr_pages; i++) > set_page_count(page + i, 0); > > if (order > 0) > prep_compound_page(page, order); > } -- Sincerely yours, Mike.