From k-hagio-ab at nec.com Tue Jul 1 00:38:00 2025 From: k-hagio-ab at nec.com (=?utf-8?B?SEFHSU8gS0FaVUhJVE8o6JCp5bC+44CA5LiA5LuBKQ==?=) Date: Tue, 1 Jul 2025 07:38:00 +0000 Subject: [PATCH v2][makedumpfile] Fix a data race in multi-threading mode (--num-threads=N) In-Reply-To: <20250625022343.57529-2-ltao@redhat.com> References: <20250625022343.57529-2-ltao@redhat.com> Message-ID: <7c13a968-4a3a-4d0d-8977-3ba0a4a845b1@nec.com> Hi Tao, thank you for the patch. On 2025/06/25 11:23, Tao Liu wrote: > A vmcore corrupt issue has been noticed in powerpc arch [1]. It can be > reproduced with upstream makedumpfile. > > When analyzing the corrupt vmcore using crash, the following error > message will output: > > crash: compressed kdump: uncompress failed: 0 > crash: read error: kernel virtual address: c0001e2d2fe48000 type: > "hardirq thread_union" > crash: cannot read hardirq_ctx[930] at c0001e2d2fe48000 > crash: compressed kdump: uncompress failed: 0 > > If the vmcore is generated without num-threads option, then no such > errors are noticed. > > With --num-threads=N enabled, there will be N sub-threads created. All > sub-threads are producers which responsible for mm page processing, e.g. > compression. The main thread is the consumer which responsible for > writing the compressed data into file. page_flag_buf->ready is used to > sync main and sub-threads. When a sub-thread finishes page processing, > it will set ready flag to be FLAG_READY. In the meantime, main thread > looply check all threads of the ready flags, and break the loop when > find FLAG_READY. I've tried to reproduce the issue, but I couldn't on x86_64. Do you have any possible scenario that breaks a vmcore? I could not think of it only by looking at the code. and this is just out of curiosity, is the issue reproduced with makedumpfile compiled with -O0 too? Thanks, Kazu > > page_flag_buf->ready is read/write by main/sub-threads simultaneously, > but it is unprotected and unsafe. I have tested both mutex and atomic_rw > can fix this issue. This patch takes atomic_rw for its simplicity. > > [1]: https://github.com/makedumpfile/makedumpfile/issues/15 > > Tested-by: Sourabh Jain > Signed-off-by: Tao Liu > --- > > v2 -> v1: Add error message of crash into commit log > > --- > makedumpfile.c | 21 ++++++++++++++------- > 1 file changed, 14 insertions(+), 7 deletions(-) > > diff --git a/makedumpfile.c b/makedumpfile.c > index 2d3b08b..bac45c2 100644 > --- a/makedumpfile.c > +++ b/makedumpfile.c > @@ -8621,7 +8621,8 @@ kdump_thread_function_cyclic(void *arg) { > > while (buf_ready == FALSE) { > pthread_testcancel(); > - if (page_flag_buf->ready == FLAG_READY) > + if (__atomic_load_n(&page_flag_buf->ready, > + __ATOMIC_SEQ_CST) == FLAG_READY) > continue; > > /* get next dumpable pfn */ > @@ -8637,7 +8638,8 @@ kdump_thread_function_cyclic(void *arg) { > info->current_pfn = pfn + 1; > > page_flag_buf->pfn = pfn; > - page_flag_buf->ready = FLAG_FILLING; > + __atomic_store_n(&page_flag_buf->ready, FLAG_FILLING, > + __ATOMIC_SEQ_CST); > pthread_mutex_unlock(&info->current_pfn_mutex); > sem_post(&info->page_flag_buf_sem); > > @@ -8726,7 +8728,8 @@ kdump_thread_function_cyclic(void *arg) { > page_flag_buf->index = index; > buf_ready = TRUE; > next: > - page_flag_buf->ready = FLAG_READY; > + __atomic_store_n(&page_flag_buf->ready, FLAG_READY, > + __ATOMIC_SEQ_CST); > page_flag_buf = page_flag_buf->next; > > } > @@ -8855,7 +8858,8 @@ write_kdump_pages_parallel_cyclic(struct cache_data *cd_header, > * current_pfn is used for recording the value of pfn when checking the pfn. > */ > for (i = 0; i < info->num_threads; i++) { > - if (info->page_flag_buf[i]->ready == FLAG_UNUSED) > + if (__atomic_load_n(&info->page_flag_buf[i]->ready, > + __ATOMIC_SEQ_CST) == FLAG_UNUSED) > continue; > temp_pfn = info->page_flag_buf[i]->pfn; > > @@ -8863,7 +8867,8 @@ write_kdump_pages_parallel_cyclic(struct cache_data *cd_header, > * count how many threads have reached the end. > */ > if (temp_pfn >= end_pfn) { > - info->page_flag_buf[i]->ready = FLAG_UNUSED; > + __atomic_store_n(&info->page_flag_buf[i]->ready, > + FLAG_UNUSED, __ATOMIC_SEQ_CST); > end_count++; > continue; > } > @@ -8885,7 +8890,8 @@ write_kdump_pages_parallel_cyclic(struct cache_data *cd_header, > * If the page_flag_buf is not ready, the pfn recorded may be changed. > * So we should recheck. > */ > - if (info->page_flag_buf[consuming]->ready != FLAG_READY) { > + if (__atomic_load_n(&info->page_flag_buf[consuming]->ready, > + __ATOMIC_SEQ_CST) != FLAG_READY) { > clock_gettime(CLOCK_MONOTONIC, &new); > if (new.tv_sec - last.tv_sec > WAIT_TIME) { > ERRMSG("Can't get data of pfn.\n"); > @@ -8927,7 +8933,8 @@ write_kdump_pages_parallel_cyclic(struct cache_data *cd_header, > goto out; > page_data_buf[index].used = FALSE; > } > - info->page_flag_buf[consuming]->ready = FLAG_UNUSED; > + __atomic_store_n(&info->page_flag_buf[consuming]->ready, > + FLAG_UNUSED, __ATOMIC_SEQ_CST); > info->page_flag_buf[consuming] = info->page_flag_buf[consuming]->next; > } > finish: From ltao at redhat.com Tue Jul 1 00:59:53 2025 From: ltao at redhat.com (Tao Liu) Date: Tue, 1 Jul 2025 19:59:53 +1200 Subject: [PATCH v2][makedumpfile] Fix a data race in multi-threading mode (--num-threads=N) In-Reply-To: <7c13a968-4a3a-4d0d-8977-3ba0a4a845b1@nec.com> References: <20250625022343.57529-2-ltao@redhat.com> <7c13a968-4a3a-4d0d-8977-3ba0a4a845b1@nec.com> Message-ID: Hi Kazu, Thanks for your comments! On Tue, Jul 1, 2025 at 7:38?PM HAGIO KAZUHITO(?????) wrote: > > Hi Tao, > > thank you for the patch. > > On 2025/06/25 11:23, Tao Liu wrote: > > A vmcore corrupt issue has been noticed in powerpc arch [1]. It can be > > reproduced with upstream makedumpfile. > > > > When analyzing the corrupt vmcore using crash, the following error > > message will output: > > > > crash: compressed kdump: uncompress failed: 0 > > crash: read error: kernel virtual address: c0001e2d2fe48000 type: > > "hardirq thread_union" > > crash: cannot read hardirq_ctx[930] at c0001e2d2fe48000 > > crash: compressed kdump: uncompress failed: 0 > > > > If the vmcore is generated without num-threads option, then no such > > errors are noticed. > > > > With --num-threads=N enabled, there will be N sub-threads created. All > > sub-threads are producers which responsible for mm page processing, e.g. > > compression. The main thread is the consumer which responsible for > > writing the compressed data into file. page_flag_buf->ready is used to > > sync main and sub-threads. When a sub-thread finishes page processing, > > it will set ready flag to be FLAG_READY. In the meantime, main thread > > looply check all threads of the ready flags, and break the loop when > > find FLAG_READY. > > I've tried to reproduce the issue, but I couldn't on x86_64. Yes, I cannot reproduce it on x86_64 either, but the issue is very easily reproduced on ppc64 arch, which is where our QE reported. Recently we have enabled --num-threads=N in rhel by default. N == nr_cpus in 2nd kernel, so QE noticed the issue. > > Do you have any possible scenario that breaks a vmcore? I could not > think of it only by looking at the code. I guess the issue only been observed on ppc might be due to ppc's memory model, multi-thread scheduling algorithm etc. I'm not an expert on those. So I cannot give a clear explanation, sorry... The page_flag_buf->ready is an integer that r/w by main and sub threads simultaneously. And the assignment operation, like page_flag_buf->ready = 1, might be composed of several assembly instructions. Without atomic r/w (memory) protection, there might be racing r/w just within the few instructions, which caused the data inconsistency. Frankly the ppc assembly consists of more instructions than x86_64 for the same c code, which enlarged the possibility of data racing. We can observe the issue without the help of crash, just compare the binary output of vmcore generated from the same core file, and compress it with or without --num-threads option. Then compare it with "cmp vmcore1 vmcore2" cmdline, and cmp will output bytes differ for the 2 vmcores, and this is unexpected. > > and this is just out of curiosity, is the issue reproduced with > makedumpfile compiled with -O0 too? Sorry, I haven't done the -O0 experiment, I can do it tomorrow and share my findings... Thanks, Tao Liu > > Thanks, > Kazu > > > > > page_flag_buf->ready is read/write by main/sub-threads simultaneously, > > but it is unprotected and unsafe. I have tested both mutex and atomic_rw > > can fix this issue. This patch takes atomic_rw for its simplicity. > > > > [1]: https://github.com/makedumpfile/makedumpfile/issues/15 > > > > Tested-by: Sourabh Jain > > Signed-off-by: Tao Liu > > --- > > > > v2 -> v1: Add error message of crash into commit log > > > > --- > > makedumpfile.c | 21 ++++++++++++++------- > > 1 file changed, 14 insertions(+), 7 deletions(-) > > > > diff --git a/makedumpfile.c b/makedumpfile.c > > index 2d3b08b..bac45c2 100644 > > --- a/makedumpfile.c > > +++ b/makedumpfile.c > > @@ -8621,7 +8621,8 @@ kdump_thread_function_cyclic(void *arg) { > > > > while (buf_ready == FALSE) { > > pthread_testcancel(); > > - if (page_flag_buf->ready == FLAG_READY) > > + if (__atomic_load_n(&page_flag_buf->ready, > > + __ATOMIC_SEQ_CST) == FLAG_READY) > > continue; > > > > /* get next dumpable pfn */ > > @@ -8637,7 +8638,8 @@ kdump_thread_function_cyclic(void *arg) { > > info->current_pfn = pfn + 1; > > > > page_flag_buf->pfn = pfn; > > - page_flag_buf->ready = FLAG_FILLING; > > + __atomic_store_n(&page_flag_buf->ready, FLAG_FILLING, > > + __ATOMIC_SEQ_CST); > > pthread_mutex_unlock(&info->current_pfn_mutex); > > sem_post(&info->page_flag_buf_sem); > > > > @@ -8726,7 +8728,8 @@ kdump_thread_function_cyclic(void *arg) { > > page_flag_buf->index = index; > > buf_ready = TRUE; > > next: > > - page_flag_buf->ready = FLAG_READY; > > + __atomic_store_n(&page_flag_buf->ready, FLAG_READY, > > + __ATOMIC_SEQ_CST); > > page_flag_buf = page_flag_buf->next; > > > > } > > @@ -8855,7 +8858,8 @@ write_kdump_pages_parallel_cyclic(struct cache_data *cd_header, > > * current_pfn is used for recording the value of pfn when checking the pfn. > > */ > > for (i = 0; i < info->num_threads; i++) { > > - if (info->page_flag_buf[i]->ready == FLAG_UNUSED) > > + if (__atomic_load_n(&info->page_flag_buf[i]->ready, > > + __ATOMIC_SEQ_CST) == FLAG_UNUSED) > > continue; > > temp_pfn = info->page_flag_buf[i]->pfn; > > > > @@ -8863,7 +8867,8 @@ write_kdump_pages_parallel_cyclic(struct cache_data *cd_header, > > * count how many threads have reached the end. > > */ > > if (temp_pfn >= end_pfn) { > > - info->page_flag_buf[i]->ready = FLAG_UNUSED; > > + __atomic_store_n(&info->page_flag_buf[i]->ready, > > + FLAG_UNUSED, __ATOMIC_SEQ_CST); > > end_count++; > > continue; > > } > > @@ -8885,7 +8890,8 @@ write_kdump_pages_parallel_cyclic(struct cache_data *cd_header, > > * If the page_flag_buf is not ready, the pfn recorded may be changed. > > * So we should recheck. > > */ > > - if (info->page_flag_buf[consuming]->ready != FLAG_READY) { > > + if (__atomic_load_n(&info->page_flag_buf[consuming]->ready, > > + __ATOMIC_SEQ_CST) != FLAG_READY) { > > clock_gettime(CLOCK_MONOTONIC, &new); > > if (new.tv_sec - last.tv_sec > WAIT_TIME) { > > ERRMSG("Can't get data of pfn.\n"); > > @@ -8927,7 +8933,8 @@ write_kdump_pages_parallel_cyclic(struct cache_data *cd_header, > > goto out; > > page_data_buf[index].used = FALSE; > > } > > - info->page_flag_buf[consuming]->ready = FLAG_UNUSED; > > + __atomic_store_n(&info->page_flag_buf[consuming]->ready, > > + FLAG_UNUSED, __ATOMIC_SEQ_CST); > > info->page_flag_buf[consuming] = info->page_flag_buf[consuming]->next; > > } > > finish: From skhan at linuxfoundation.org Tue Jul 1 12:53:35 2025 From: skhan at linuxfoundation.org (Shuah Khan) Date: Tue, 1 Jul 2025 13:53:35 -0600 Subject: [PATCH] selftests/kexec: fix test_kexec_jump build and ignore generated binary In-Reply-To: <20250624201438.89391-1-moonhee.lee.ca@gmail.com> References: <20250624201438.89391-1-moonhee.lee.ca@gmail.com> Message-ID: <744bd439-2613-45d7-8724-5959d25100aa@linuxfoundation.org> On 6/24/25 14:14, Moon Hee Lee wrote: > The test_kexec_jump program builds correctly when invoked from the top-level > selftests/Makefile, which explicitly sets the OUTPUT variable. However, > building directly in tools/testing/selftests/kexec fails with: > > make: *** No rule to make target '/test_kexec_jump', needed by 'test_kexec_jump.sh'. Stop. > > This failure occurs because the Makefile rule relies on $(OUTPUT), which is > undefined in direct builds. > > Fix this by listing test_kexec_jump in TEST_GEN_PROGS, the standard way to > declare generated test binaries in the kselftest framework. This ensures the > binary is built regardless of invocation context and properly removed by > make clean. The change looks good to me. Acked-by: Shuah Khan > > Also add the binary to .gitignore to avoid tracking it in version control. There is another patch that adds the executable to .gitignore https://lore.kernel.org/r/20250623232549.3263273-1-dyudaken at gmail.com I think you are missing kexec at lists.infradead.org - added it > > Signed-off-by: Moon Hee Lee > --- > tools/testing/selftests/kexec/.gitignore | 2 ++ > tools/testing/selftests/kexec/Makefile | 2 +- > 2 files changed, 3 insertions(+), 1 deletion(-) > create mode 100644 tools/testing/selftests/kexec/.gitignore > > diff --git a/tools/testing/selftests/kexec/.gitignore b/tools/testing/selftests/kexec/.gitignore > new file mode 100644 > index 000000000000..5f3d9e089ae8 > --- /dev/null > +++ b/tools/testing/selftests/kexec/.gitignore > @@ -0,0 +1,2 @@ > +# SPDX-License-Identifier: GPL-2.0-only > +test_kexec_jump > diff --git a/tools/testing/selftests/kexec/Makefile b/tools/testing/selftests/kexec/Makefile > index e3000ccb9a5d..874cfdd3b75b 100644 > --- a/tools/testing/selftests/kexec/Makefile > +++ b/tools/testing/selftests/kexec/Makefile > @@ -12,7 +12,7 @@ include ../../../scripts/Makefile.arch > > ifeq ($(IS_64_BIT)$(ARCH_PROCESSED),1x86) > TEST_PROGS += test_kexec_jump.sh > -test_kexec_jump.sh: $(OUTPUT)/test_kexec_jump > +TEST_GEN_PROGS := test_kexec_jump > endif > > include ../lib.mk thanks, -- Shuah From k-hagio-ab at nec.com Tue Jul 1 17:03:30 2025 From: k-hagio-ab at nec.com (=?utf-8?B?SEFHSU8gS0FaVUhJVE8o6JCp5bC+44CA5LiA5LuBKQ==?=) Date: Wed, 2 Jul 2025 00:03:30 +0000 Subject: [PATCH makedumpfile] Add confidential VM unaccepted free pages support on Linux 6.16 and later In-Reply-To: References: <20250609050244.837619-1-zhiquan1.li@intel.com>

Message-ID: <25f0ebf9-d8a7-4fb6-a716-bcef962443d6@nec.com> On 2025/06/30 15:24, HAGIO KAZUHITO(?? ??) wrote: > On 2025/06/30 15:30, Zhiquan Li wrote: >> >> On 2025/6/30 12:05, HAGIO KAZUHITO(?? ??) wrote: >>> Hi, >>> >>> thank you for the patch, and sorry for the long delay.. >>> >> >> Thanks for your time to review this patch. There is a lot of background >> knowledge behind this feature, it really needs a long time to digest them. >> >>>> @@ -6630,6 +6633,17 @@ check_order: >>>> nr_pages = 1 << private; >>>> pfn_counter = &pfn_free; >>>> } >>>> + /* >>>> + * Exclude the unaccepted free pages not managed by buddy. >>>> + * By convention, pages can be added to the zone.unaccepted_pages list >>>> + * only when the order is MAX_ORDER_NR_PAGES. Otherwise, the page is >>>> + * accepted immediately without being on the list. >>>> + */ >>>> + else if ((info->dump_level & DL_EXCLUDE_FREE) >>>> + && isUnaccepted(_mapcount)) { >>> >>>> + nr_pages = 1 << (ARRAY_LENGTH(zone.free_area) - 1); >>> >>> just to clarify, does this mean that the order of unaccepted pages is >>> MAX_PAGE_ORDER but it's not set in struct page, so we need to set the >>> order here? >>> >> >> Yes, the unit of operations on the buddy system for unaccepted pages is >> MAX_PAGE_ORDER, and the type PGTY_unaccepted is only applied on the >> first of struct page, not each struct page of MAX_PAGE_ORDER unaccepted >> pages need the type. Therefore, if we see one struct page with the >> PGTY_unaccepted type, that means the next MAX_PAGE_ORDER-1 unaccepted >> pages are added to zone.unaccepted_pages list. > > Thank you Zhiquan for the detailed explanation, which made me understand > more clearly. And the patch looks good to me, so ack. > > (There is no need to add my acked-by tag to the patch, we will merge > this when Masa sends his ack.) The patch got 2 acks, applied. https://github.com/makedumpfile/makedumpfile/commit/c1a930185482a72b290099a4f814c5f32b6e4cc8 Thanks! Kazu From k-hagio-ab at nec.com Tue Jul 1 17:13:15 2025 From: k-hagio-ab at nec.com (=?utf-8?B?SEFHSU8gS0FaVUhJVE8o6JCp5bC+44CA5LiA5LuBKQ==?=) Date: Wed, 2 Jul 2025 00:13:15 +0000 Subject: [PATCH v2][makedumpfile] Fix a data race in multi-threading mode (--num-threads=N) In-Reply-To: References: <20250625022343.57529-2-ltao@redhat.com> <7c13a968-4a3a-4d0d-8977-3ba0a4a845b1@nec.com> Message-ID: <5c425f4e-4e89-4500-993e-e4dfec50a4fb@nec.com> On 2025/07/01 16:59, Tao Liu wrote: > Hi Kazu, > > Thanks for your comments! > > On Tue, Jul 1, 2025 at 7:38?PM HAGIO KAZUHITO(?????) wrote: >> >> Hi Tao, >> >> thank you for the patch. >> >> On 2025/06/25 11:23, Tao Liu wrote: >>> A vmcore corrupt issue has been noticed in powerpc arch [1]. It can be >>> reproduced with upstream makedumpfile. >>> >>> When analyzing the corrupt vmcore using crash, the following error >>> message will output: >>> >>> crash: compressed kdump: uncompress failed: 0 >>> crash: read error: kernel virtual address: c0001e2d2fe48000 type: >>> "hardirq thread_union" >>> crash: cannot read hardirq_ctx[930] at c0001e2d2fe48000 >>> crash: compressed kdump: uncompress failed: 0 >>> >>> If the vmcore is generated without num-threads option, then no such >>> errors are noticed. >>> >>> With --num-threads=N enabled, there will be N sub-threads created. All >>> sub-threads are producers which responsible for mm page processing, e.g. >>> compression. The main thread is the consumer which responsible for >>> writing the compressed data into file. page_flag_buf->ready is used to >>> sync main and sub-threads. When a sub-thread finishes page processing, >>> it will set ready flag to be FLAG_READY. In the meantime, main thread >>> looply check all threads of the ready flags, and break the loop when >>> find FLAG_READY. >> >> I've tried to reproduce the issue, but I couldn't on x86_64. > > Yes, I cannot reproduce it on x86_64 either, but the issue is very > easily reproduced on ppc64 arch, which is where our QE reported. > Recently we have enabled --num-threads=N in rhel by default. N == > nr_cpus in 2nd kernel, so QE noticed the issue. I see, thank you for the information. > >> >> Do you have any possible scenario that breaks a vmcore? I could not >> think of it only by looking at the code. > > I guess the issue only been observed on ppc might be due to ppc's > memory model, multi-thread scheduling algorithm etc. I'm not an expert > on those. So I cannot give a clear explanation, sorry... ok, I also don't think of how to debug this well.. > > The page_flag_buf->ready is an integer that r/w by main and sub > threads simultaneously. And the assignment operation, like > page_flag_buf->ready = 1, might be composed of several assembly > instructions. Without atomic r/w (memory) protection, there might be > racing r/w just within the few instructions, which caused the data > inconsistency. Frankly the ppc assembly consists of more instructions > than x86_64 for the same c code, which enlarged the possibility of > data racing. > > We can observe the issue without the help of crash, just compare the > binary output of vmcore generated from the same core file, and > compress it with or without --num-threads option. Then compare it with > "cmp vmcore1 vmcore2" cmdline, and cmp will output bytes differ for > the 2 vmcores, and this is unexpected. > >> >> and this is just out of curiosity, is the issue reproduced with >> makedumpfile compiled with -O0 too? > > Sorry, I haven't done the -O0 experiment, I can do it tomorrow and > share my findings... Thanks, we have to fix this anyway, I want a clue to think about a possible scenario.. Thanks, Kazu From ltao at redhat.com Tue Jul 1 21:36:35 2025 From: ltao at redhat.com (Tao Liu) Date: Wed, 2 Jul 2025 16:36:35 +1200 Subject: [PATCH v2][makedumpfile] Fix a data race in multi-threading mode (--num-threads=N) In-Reply-To: <5c425f4e-4e89-4500-993e-e4dfec50a4fb@nec.com> References: <20250625022343.57529-2-ltao@redhat.com> <7c13a968-4a3a-4d0d-8977-3ba0a4a845b1@nec.com> <5c425f4e-4e89-4500-993e-e4dfec50a4fb@nec.com> Message-ID: Hi Kazu, On Wed, Jul 2, 2025 at 12:13?PM HAGIO KAZUHITO(?????) wrote: > > On 2025/07/01 16:59, Tao Liu wrote: > > Hi Kazu, > > > > Thanks for your comments! > > > > On Tue, Jul 1, 2025 at 7:38?PM HAGIO KAZUHITO(?????) wrote: > >> > >> Hi Tao, > >> > >> thank you for the patch. > >> > >> On 2025/06/25 11:23, Tao Liu wrote: > >>> A vmcore corrupt issue has been noticed in powerpc arch [1]. It can be > >>> reproduced with upstream makedumpfile. > >>> > >>> When analyzing the corrupt vmcore using crash, the following error > >>> message will output: > >>> > >>> crash: compressed kdump: uncompress failed: 0 > >>> crash: read error: kernel virtual address: c0001e2d2fe48000 type: > >>> "hardirq thread_union" > >>> crash: cannot read hardirq_ctx[930] at c0001e2d2fe48000 > >>> crash: compressed kdump: uncompress failed: 0 > >>> > >>> If the vmcore is generated without num-threads option, then no such > >>> errors are noticed. > >>> > >>> With --num-threads=N enabled, there will be N sub-threads created. All > >>> sub-threads are producers which responsible for mm page processing, e.g. > >>> compression. The main thread is the consumer which responsible for > >>> writing the compressed data into file. page_flag_buf->ready is used to > >>> sync main and sub-threads. When a sub-thread finishes page processing, > >>> it will set ready flag to be FLAG_READY. In the meantime, main thread > >>> looply check all threads of the ready flags, and break the loop when > >>> find FLAG_READY. > >> > >> I've tried to reproduce the issue, but I couldn't on x86_64. > > > > Yes, I cannot reproduce it on x86_64 either, but the issue is very > > easily reproduced on ppc64 arch, which is where our QE reported. > > Recently we have enabled --num-threads=N in rhel by default. N == > > nr_cpus in 2nd kernel, so QE noticed the issue. > > I see, thank you for the information. > > > > >> > >> Do you have any possible scenario that breaks a vmcore? I could not > >> think of it only by looking at the code. > > > > I guess the issue only been observed on ppc might be due to ppc's > > memory model, multi-thread scheduling algorithm etc. I'm not an expert > > on those. So I cannot give a clear explanation, sorry... > > ok, I also don't think of how to debug this well.. > > > > > The page_flag_buf->ready is an integer that r/w by main and sub > > threads simultaneously. And the assignment operation, like > > page_flag_buf->ready = 1, might be composed of several assembly > > instructions. Without atomic r/w (memory) protection, there might be > > racing r/w just within the few instructions, which caused the data > > inconsistency. Frankly the ppc assembly consists of more instructions > > than x86_64 for the same c code, which enlarged the possibility of > > data racing. > > > > We can observe the issue without the help of crash, just compare the > > binary output of vmcore generated from the same core file, and > > compress it with or without --num-threads option. Then compare it with > > "cmp vmcore1 vmcore2" cmdline, and cmp will output bytes differ for > > the 2 vmcores, and this is unexpected. > > > >> > >> and this is just out of curiosity, is the issue reproduced with > >> makedumpfile compiled with -O0 too? > > > > Sorry, I haven't done the -O0 experiment, I can do it tomorrow and > > share my findings... > > Thanks, we have to fix this anyway, I want a clue to think about a > possible scenario.. 1) Compiled with -O2 flag: [root at ibm-p10-01-lp45 makedumpfile]# ./makedumpfile -d 31 -l ~/vmcore /tmp/out1 Copying data : [100.0 %] / eta: 0s The dumpfile is saved to /tmp/out1. makedumpfile Completed. [root at ibm-p10-01-lp45 makedumpfile]# ./makedumpfile --num-threads=2 -d 31 -l ~/vmcore /tmp/out2 Copying data : [100.0 %] | eta: 0s Copying data : [100.0 %] \ eta: 0s The dumpfile is saved to /tmp/out2. makedumpfile Completed. [root at ibm-p10-01-lp45 makedumpfile]# cd /tmp [root at ibm-p10-01-lp45 tmp]# cmp out1 out2 out1 out2 differ: byte 20786414, line 108064 2) Compiled with -O0 flag: [root at ibm-p10-01-lp45 makedumpfile]# ./makedumpfile -d 31 -l ~/vmcore /tmp/out3 Copying data : [100.0 %] / eta: 0s The dumpfile is saved to /tmp/out3. makedumpfile Completed. [root at ibm-p10-01-lp45 makedumpfile]# ./makedumpfile --num-threads=2 -d 31 -l ~/vmcore /tmp/out4 Copying data : [100.0 %] | eta: 0s Copying data : [100.0 %] \ eta: 0s The dumpfile is saved to /tmp/out4. makedumpfile Completed. [root at ibm-p10-01-lp45 makedumpfile]# cd /tmp [root at ibm-p10-01-lp45 tmp]# cmp out3 out4 out3 out4 differ: byte 23948282, line 151739 Looks to me the O0/O2 have no difference for this case. If no problem, the /tmp/outX generated from both single/multi thread should be exactly the same, however the cmp reports there are differences. With the v2 patch applied, there is no such difference: [root at ibm-p10-01-lp45 makedumpfile]# ./makedumpfile -d 31 -l ~/vmcore /tmp/out5 Copying data : [100.0 %] / eta: 0s The dumpfile is saved to /tmp/out5. makedumpfile Completed. [root at ibm-p10-01-lp45 makedumpfile]# ./makedumpfile --num-threads=2 -d 31 -l ~/vmcore /tmp/out6 Copying data : [100.0 %] | eta: 0s Copying data : [100.0 %] \ eta: 0s The dumpfile is saved to /tmp/out6. makedumpfile Completed. [root at ibm-p10-01-lp45 makedumpfile]# cmp /tmp/out5 /tmp/out6 [root at ibm-p10-01-lp45 makedumpfile]# Thanks, Tao Liu > > Thanks, > Kazu From k-hagio-ab at nec.com Tue Jul 1 21:52:15 2025 From: k-hagio-ab at nec.com (=?utf-8?B?SEFHSU8gS0FaVUhJVE8o6JCp5bC+44CA5LiA5LuBKQ==?=) Date: Wed, 2 Jul 2025 04:52:15 +0000 Subject: [PATCH v2][makedumpfile] Fix a data race in multi-threading mode (--num-threads=N) In-Reply-To: References: <20250625022343.57529-2-ltao@redhat.com> <7c13a968-4a3a-4d0d-8977-3ba0a4a845b1@nec.com> <5c425f4e-4e89-4500-993e-e4dfec50a4fb@nec.com> Message-ID: Hi Tao, On 2025/07/02 13:36, Tao Liu wrote: > Hi Kazu, > > On Wed, Jul 2, 2025 at 12:13?PM HAGIO KAZUHITO(?????) > wrote: >> >> On 2025/07/01 16:59, Tao Liu wrote: >>> Hi Kazu, >>> >>> Thanks for your comments! >>> >>> On Tue, Jul 1, 2025 at 7:38?PM HAGIO KAZUHITO(?????) wrote: >>>> >>>> Hi Tao, >>>> >>>> thank you for the patch. >>>> >>>> On 2025/06/25 11:23, Tao Liu wrote: >>>>> A vmcore corrupt issue has been noticed in powerpc arch [1]. It can be >>>>> reproduced with upstream makedumpfile. >>>>> >>>>> When analyzing the corrupt vmcore using crash, the following error >>>>> message will output: >>>>> >>>>> crash: compressed kdump: uncompress failed: 0 >>>>> crash: read error: kernel virtual address: c0001e2d2fe48000 type: >>>>> "hardirq thread_union" >>>>> crash: cannot read hardirq_ctx[930] at c0001e2d2fe48000 >>>>> crash: compressed kdump: uncompress failed: 0 >>>>> >>>>> If the vmcore is generated without num-threads option, then no such >>>>> errors are noticed. >>>>> >>>>> With --num-threads=N enabled, there will be N sub-threads created. All >>>>> sub-threads are producers which responsible for mm page processing, e.g. >>>>> compression. The main thread is the consumer which responsible for >>>>> writing the compressed data into file. page_flag_buf->ready is used to >>>>> sync main and sub-threads. When a sub-thread finishes page processing, >>>>> it will set ready flag to be FLAG_READY. In the meantime, main thread >>>>> looply check all threads of the ready flags, and break the loop when >>>>> find FLAG_READY. >>>> >>>> I've tried to reproduce the issue, but I couldn't on x86_64. >>> >>> Yes, I cannot reproduce it on x86_64 either, but the issue is very >>> easily reproduced on ppc64 arch, which is where our QE reported. >>> Recently we have enabled --num-threads=N in rhel by default. N == >>> nr_cpus in 2nd kernel, so QE noticed the issue. >> >> I see, thank you for the information. >> >>> >>>> >>>> Do you have any possible scenario that breaks a vmcore? I could not >>>> think of it only by looking at the code. >>> >>> I guess the issue only been observed on ppc might be due to ppc's >>> memory model, multi-thread scheduling algorithm etc. I'm not an expert >>> on those. So I cannot give a clear explanation, sorry... >> >> ok, I also don't think of how to debug this well.. >> >>> >>> The page_flag_buf->ready is an integer that r/w by main and sub >>> threads simultaneously. And the assignment operation, like >>> page_flag_buf->ready = 1, might be composed of several assembly >>> instructions. Without atomic r/w (memory) protection, there might be >>> racing r/w just within the few instructions, which caused the data >>> inconsistency. Frankly the ppc assembly consists of more instructions >>> than x86_64 for the same c code, which enlarged the possibility of >>> data racing. >>> >>> We can observe the issue without the help of crash, just compare the >>> binary output of vmcore generated from the same core file, and >>> compress it with or without --num-threads option. Then compare it with >>> "cmp vmcore1 vmcore2" cmdline, and cmp will output bytes differ for >>> the 2 vmcores, and this is unexpected. >>> >>>> >>>> and this is just out of curiosity, is the issue reproduced with >>>> makedumpfile compiled with -O0 too? >>> >>> Sorry, I haven't done the -O0 experiment, I can do it tomorrow and >>> share my findings... >> >> Thanks, we have to fix this anyway, I want a clue to think about a >> possible scenario.. > > 1) Compiled with -O2 flag: > > [root at ibm-p10-01-lp45 makedumpfile]# ./makedumpfile -d 31 -l ~/vmcore /tmp/out1 > Copying data : [100.0 %] / > eta: 0s > > The dumpfile is saved to /tmp/out1. > > makedumpfile Completed. > [root at ibm-p10-01-lp45 makedumpfile]# ./makedumpfile --num-threads=2 -d > 31 -l ~/vmcore /tmp/out2 > Copying data : [100.0 %] | > eta: 0s > Copying data : [100.0 %] \ > eta: 0s > > The dumpfile is saved to /tmp/out2. > > makedumpfile Completed. > [root at ibm-p10-01-lp45 makedumpfile]# cd /tmp > [root at ibm-p10-01-lp45 tmp]# cmp out1 out2 > out1 out2 differ: byte 20786414, line 108064 > > 2) Compiled with -O0 flag: > > [root at ibm-p10-01-lp45 makedumpfile]# ./makedumpfile -d 31 -l ~/vmcore /tmp/out3 > Copying data : [100.0 %] / > eta: 0s > > The dumpfile is saved to /tmp/out3. > > makedumpfile Completed. > [root at ibm-p10-01-lp45 makedumpfile]# ./makedumpfile --num-threads=2 -d > 31 -l ~/vmcore /tmp/out4 > Copying data : [100.0 %] | > eta: 0s > Copying data : [100.0 %] \ > eta: 0s > > The dumpfile is saved to /tmp/out4. > > makedumpfile Completed. > [root at ibm-p10-01-lp45 makedumpfile]# cd /tmp > [root at ibm-p10-01-lp45 tmp]# cmp out3 out4 > out3 out4 differ: byte 23948282, line 151739 > > Looks to me the O0/O2 have no difference for this case. If no problem, > the /tmp/outX generated from both single/multi thread should be > exactly the same, however the cmp reports there are differences. With > the v2 patch applied, there is no such difference: > > [root at ibm-p10-01-lp45 makedumpfile]# ./makedumpfile -d 31 -l ~/vmcore /tmp/out5 > Copying data : [100.0 %] / > eta: 0s > > The dumpfile is saved to /tmp/out5. > > makedumpfile Completed. > [root at ibm-p10-01-lp45 makedumpfile]# ./makedumpfile --num-threads=2 -d > 31 -l ~/vmcore /tmp/out6 > Copying data : [100.0 %] | > eta: 0s > Copying data : [100.0 %] \ > eta: 0s > > The dumpfile is saved to /tmp/out6. > > makedumpfile Completed. > [root at ibm-p10-01-lp45 makedumpfile]# cmp /tmp/out5 /tmp/out6 > [root at ibm-p10-01-lp45 makedumpfile]# thank you for testing! sorry one more thing, does --num-threads=1 break the vmcore? Thanks, Kazu From ltao at redhat.com Tue Jul 1 22:03:01 2025 From: ltao at redhat.com (Tao Liu) Date: Wed, 2 Jul 2025 17:03:01 +1200 Subject: [PATCH v2][makedumpfile] Fix a data race in multi-threading mode (--num-threads=N) In-Reply-To: References: <20250625022343.57529-2-ltao@redhat.com> <7c13a968-4a3a-4d0d-8977-3ba0a4a845b1@nec.com> <5c425f4e-4e89-4500-993e-e4dfec50a4fb@nec.com>

Message-ID: Hi Kazu, On Wed, Jul 2, 2025 at 4:52?PM HAGIO KAZUHITO(?????) wrote: > > Hi Tao, > > On 2025/07/02 13:36, Tao Liu wrote: > > Hi Kazu, > > > > On Wed, Jul 2, 2025 at 12:13?PM HAGIO KAZUHITO(?????) > > wrote: > >> > >> On 2025/07/01 16:59, Tao Liu wrote: > >>> Hi Kazu, > >>> > >>> Thanks for your comments! > >>> > >>> On Tue, Jul 1, 2025 at 7:38?PM HAGIO KAZUHITO(?????) wrote: > >>>> > >>>> Hi Tao, > >>>> > >>>> thank you for the patch. > >>>> > >>>> On 2025/06/25 11:23, Tao Liu wrote: > >>>>> A vmcore corrupt issue has been noticed in powerpc arch [1]. It can be > >>>>> reproduced with upstream makedumpfile. > >>>>> > >>>>> When analyzing the corrupt vmcore using crash, the following error > >>>>> message will output: > >>>>> > >>>>> crash: compressed kdump: uncompress failed: 0 > >>>>> crash: read error: kernel virtual address: c0001e2d2fe48000 type: > >>>>> "hardirq thread_union" > >>>>> crash: cannot read hardirq_ctx[930] at c0001e2d2fe48000 > >>>>> crash: compressed kdump: uncompress failed: 0 > >>>>> > >>>>> If the vmcore is generated without num-threads option, then no such > >>>>> errors are noticed. > >>>>> > >>>>> With --num-threads=N enabled, there will be N sub-threads created. All > >>>>> sub-threads are producers which responsible for mm page processing, e.g. > >>>>> compression. The main thread is the consumer which responsible for > >>>>> writing the compressed data into file. page_flag_buf->ready is used to > >>>>> sync main and sub-threads. When a sub-thread finishes page processing, > >>>>> it will set ready flag to be FLAG_READY. In the meantime, main thread > >>>>> looply check all threads of the ready flags, and break the loop when > >>>>> find FLAG_READY. > >>>> > >>>> I've tried to reproduce the issue, but I couldn't on x86_64. > >>> > >>> Yes, I cannot reproduce it on x86_64 either, but the issue is very > >>> easily reproduced on ppc64 arch, which is where our QE reported. > >>> Recently we have enabled --num-threads=N in rhel by default. N == > >>> nr_cpus in 2nd kernel, so QE noticed the issue. > >> > >> I see, thank you for the information. > >> > >>> > >>>> > >>>> Do you have any possible scenario that breaks a vmcore? I could not > >>>> think of it only by looking at the code. > >>> > >>> I guess the issue only been observed on ppc might be due to ppc's > >>> memory model, multi-thread scheduling algorithm etc. I'm not an expert > >>> on those. So I cannot give a clear explanation, sorry... > >> > >> ok, I also don't think of how to debug this well.. > >> > >>> > >>> The page_flag_buf->ready is an integer that r/w by main and sub > >>> threads simultaneously. And the assignment operation, like > >>> page_flag_buf->ready = 1, might be composed of several assembly > >>> instructions. Without atomic r/w (memory) protection, there might be > >>> racing r/w just within the few instructions, which caused the data > >>> inconsistency. Frankly the ppc assembly consists of more instructions > >>> than x86_64 for the same c code, which enlarged the possibility of > >>> data racing. > >>> > >>> We can observe the issue without the help of crash, just compare the > >>> binary output of vmcore generated from the same core file, and > >>> compress it with or without --num-threads option. Then compare it with > >>> "cmp vmcore1 vmcore2" cmdline, and cmp will output bytes differ for > >>> the 2 vmcores, and this is unexpected. > >>> > >>>> > >>>> and this is just out of curiosity, is the issue reproduced with > >>>> makedumpfile compiled with -O0 too? > >>> > >>> Sorry, I haven't done the -O0 experiment, I can do it tomorrow and > >>> share my findings... > >> > >> Thanks, we have to fix this anyway, I want a clue to think about a > >> possible scenario.. > > > > 1) Compiled with -O2 flag: > > > > [root at ibm-p10-01-lp45 makedumpfile]# ./makedumpfile -d 31 -l ~/vmcore /tmp/out1 > > Copying data : [100.0 %] / > > eta: 0s > > > > The dumpfile is saved to /tmp/out1. > > > > makedumpfile Completed. > > [root at ibm-p10-01-lp45 makedumpfile]# ./makedumpfile --num-threads=2 -d > > 31 -l ~/vmcore /tmp/out2 > > Copying data : [100.0 %] | > > eta: 0s > > Copying data : [100.0 %] \ > > eta: 0s > > > > The dumpfile is saved to /tmp/out2. > > > > makedumpfile Completed. > > [root at ibm-p10-01-lp45 makedumpfile]# cd /tmp > > [root at ibm-p10-01-lp45 tmp]# cmp out1 out2 > > out1 out2 differ: byte 20786414, line 108064 > > > > 2) Compiled with -O0 flag: > > > > [root at ibm-p10-01-lp45 makedumpfile]# ./makedumpfile -d 31 -l ~/vmcore /tmp/out3 > > Copying data : [100.0 %] / > > eta: 0s > > > > The dumpfile is saved to /tmp/out3. > > > > makedumpfile Completed. > > [root at ibm-p10-01-lp45 makedumpfile]# ./makedumpfile --num-threads=2 -d > > 31 -l ~/vmcore /tmp/out4 > > Copying data : [100.0 %] | > > eta: 0s > > Copying data : [100.0 %] \ > > eta: 0s > > > > The dumpfile is saved to /tmp/out4. > > > > makedumpfile Completed. > > [root at ibm-p10-01-lp45 makedumpfile]# cd /tmp > > [root at ibm-p10-01-lp45 tmp]# cmp out3 out4 > > out3 out4 differ: byte 23948282, line 151739 > > > > Looks to me the O0/O2 have no difference for this case. If no problem, > > the /tmp/outX generated from both single/multi thread should be > > exactly the same, however the cmp reports there are differences. With > > the v2 patch applied, there is no such difference: > > > > [root at ibm-p10-01-lp45 makedumpfile]# ./makedumpfile -d 31 -l ~/vmcore /tmp/out5 > > Copying data : [100.0 %] / > > eta: 0s > > > > The dumpfile is saved to /tmp/out5. > > > > makedumpfile Completed. > > [root at ibm-p10-01-lp45 makedumpfile]# ./makedumpfile --num-threads=2 -d > > 31 -l ~/vmcore /tmp/out6 > > Copying data : [100.0 %] | > > eta: 0s > > Copying data : [100.0 %] \ > > eta: 0s > > > > The dumpfile is saved to /tmp/out6. > > > > makedumpfile Completed. > > [root at ibm-p10-01-lp45 makedumpfile]# cmp /tmp/out5 /tmp/out6 > > [root at ibm-p10-01-lp45 makedumpfile]# > > thank you for testing! sorry one more thing, > does --num-threads=1 break the vmcore? Yes: [root at ibm-p10-01-lp45 makedumpfile]# ./makedumpfile -d 31 -l ~/vmcore /tmp/out7 Copying data : [100.0 %] / eta: 0s The dumpfile is saved to /tmp/out7. makedumpfile Completed. [root at ibm-p10-01-lp45 makedumpfile]# ./makedumpfile --num-threads=1 -d 31 -l ~/vmcore /tmp/out8 Copying data : [100.0 %] - eta: 0s Copying data : [100.0 %] / eta: 0s The dumpfile is saved to /tmp/out8. makedumpfile Completed. [root at ibm-p10-01-lp45 makedumpfile]# cmp /tmp/out7 /tmp/out8 /tmp/out7 /tmp/out8 differ: byte 11119019, line 49418 > > Thanks, > Kazu From sourabhjain at linux.ibm.com Tue Jul 1 22:03:59 2025 From: sourabhjain at linux.ibm.com (Sourabh Jain) Date: Wed, 2 Jul 2025 10:33:59 +0530 Subject: [PATCH v2][makedumpfile] Fix a data race in multi-threading mode (--num-threads=N) In-Reply-To: References: <20250625022343.57529-2-ltao@redhat.com> <7c13a968-4a3a-4d0d-8977-3ba0a4a845b1@nec.com> <5c425f4e-4e89-4500-993e-e4dfec50a4fb@nec.com>

Message-ID: <32246350-8a68-4d46-9103-9a633d3cfa97@linux.ibm.com> Hello Kazu, On 02/07/25 10:22, HAGIO KAZUHITO(?? ??) wrote: > Hi Tao, > > On 2025/07/02 13:36, Tao Liu wrote: >> Hi Kazu, >> >> On Wed, Jul 2, 2025 at 12:13?PM HAGIO KAZUHITO(?????) >> wrote: >>> On 2025/07/01 16:59, Tao Liu wrote: >>>> Hi Kazu, >>>> >>>> Thanks for your comments! >>>> >>>> On Tue, Jul 1, 2025 at 7:38?PM HAGIO KAZUHITO(?????) wrote: >>>>> Hi Tao, >>>>> >>>>> thank you for the patch. >>>>> >>>>> On 2025/06/25 11:23, Tao Liu wrote: >>>>>> A vmcore corrupt issue has been noticed in powerpc arch [1]. It can be >>>>>> reproduced with upstream makedumpfile. >>>>>> >>>>>> When analyzing the corrupt vmcore using crash, the following error >>>>>> message will output: >>>>>> >>>>>> crash: compressed kdump: uncompress failed: 0 >>>>>> crash: read error: kernel virtual address: c0001e2d2fe48000 type: >>>>>> "hardirq thread_union" >>>>>> crash: cannot read hardirq_ctx[930] at c0001e2d2fe48000 >>>>>> crash: compressed kdump: uncompress failed: 0 >>>>>> >>>>>> If the vmcore is generated without num-threads option, then no such >>>>>> errors are noticed. >>>>>> >>>>>> With --num-threads=N enabled, there will be N sub-threads created. All >>>>>> sub-threads are producers which responsible for mm page processing, e.g. >>>>>> compression. The main thread is the consumer which responsible for >>>>>> writing the compressed data into file. page_flag_buf->ready is used to >>>>>> sync main and sub-threads. When a sub-thread finishes page processing, >>>>>> it will set ready flag to be FLAG_READY. In the meantime, main thread >>>>>> looply check all threads of the ready flags, and break the loop when >>>>>> find FLAG_READY. >>>>> I've tried to reproduce the issue, but I couldn't on x86_64. >>>> Yes, I cannot reproduce it on x86_64 either, but the issue is very >>>> easily reproduced on ppc64 arch, which is where our QE reported. >>>> Recently we have enabled --num-threads=N in rhel by default. N == >>>> nr_cpus in 2nd kernel, so QE noticed the issue. >>> I see, thank you for the information. >>> >>>>> Do you have any possible scenario that breaks a vmcore? I could not >>>>> think of it only by looking at the code. >>>> I guess the issue only been observed on ppc might be due to ppc's >>>> memory model, multi-thread scheduling algorithm etc. I'm not an expert >>>> on those. So I cannot give a clear explanation, sorry... >>> ok, I also don't think of how to debug this well.. >>> >>>> The page_flag_buf->ready is an integer that r/w by main and sub >>>> threads simultaneously. And the assignment operation, like >>>> page_flag_buf->ready = 1, might be composed of several assembly >>>> instructions. Without atomic r/w (memory) protection, there might be >>>> racing r/w just within the few instructions, which caused the data >>>> inconsistency. Frankly the ppc assembly consists of more instructions >>>> than x86_64 for the same c code, which enlarged the possibility of >>>> data racing. >>>> >>>> We can observe the issue without the help of crash, just compare the >>>> binary output of vmcore generated from the same core file, and >>>> compress it with or without --num-threads option. Then compare it with >>>> "cmp vmcore1 vmcore2" cmdline, and cmp will output bytes differ for >>>> the 2 vmcores, and this is unexpected. >>>> >>>>> and this is just out of curiosity, is the issue reproduced with >>>>> makedumpfile compiled with -O0 too? >>>> Sorry, I haven't done the -O0 experiment, I can do it tomorrow and >>>> share my findings... >>> Thanks, we have to fix this anyway, I want a clue to think about a >>> possible scenario.. >> 1) Compiled with -O2 flag: >> >> [root at ibm-p10-01-lp45 makedumpfile]# ./makedumpfile -d 31 -l ~/vmcore /tmp/out1 >> Copying data : [100.0 %] / >> eta: 0s >> >> The dumpfile is saved to /tmp/out1. >> >> makedumpfile Completed. >> [root at ibm-p10-01-lp45 makedumpfile]# ./makedumpfile --num-threads=2 -d >> 31 -l ~/vmcore /tmp/out2 >> Copying data : [100.0 %] | >> eta: 0s >> Copying data : [100.0 %] \ >> eta: 0s >> >> The dumpfile is saved to /tmp/out2. >> >> makedumpfile Completed. >> [root at ibm-p10-01-lp45 makedumpfile]# cd /tmp >> [root at ibm-p10-01-lp45 tmp]# cmp out1 out2 >> out1 out2 differ: byte 20786414, line 108064 >> >> 2) Compiled with -O0 flag: >> >> [root at ibm-p10-01-lp45 makedumpfile]# ./makedumpfile -d 31 -l ~/vmcore /tmp/out3 >> Copying data : [100.0 %] / >> eta: 0s >> >> The dumpfile is saved to /tmp/out3. >> >> makedumpfile Completed. >> [root at ibm-p10-01-lp45 makedumpfile]# ./makedumpfile --num-threads=2 -d >> 31 -l ~/vmcore /tmp/out4 >> Copying data : [100.0 %] | >> eta: 0s >> Copying data : [100.0 %] \ >> eta: 0s >> >> The dumpfile is saved to /tmp/out4. >> >> makedumpfile Completed. >> [root at ibm-p10-01-lp45 makedumpfile]# cd /tmp >> [root at ibm-p10-01-lp45 tmp]# cmp out3 out4 >> out3 out4 differ: byte 23948282, line 151739 >> >> Looks to me the O0/O2 have no difference for this case. If no problem, >> the /tmp/outX generated from both single/multi thread should be >> exactly the same, however the cmp reports there are differences. With >> the v2 patch applied, there is no such difference: >> >> [root at ibm-p10-01-lp45 makedumpfile]# ./makedumpfile -d 31 -l ~/vmcore /tmp/out5 >> Copying data : [100.0 %] / >> eta: 0s >> >> The dumpfile is saved to /tmp/out5. >> >> makedumpfile Completed. >> [root at ibm-p10-01-lp45 makedumpfile]# ./makedumpfile --num-threads=2 -d >> 31 -l ~/vmcore /tmp/out6 >> Copying data : [100.0 %] | >> eta: 0s >> Copying data : [100.0 %] \ >> eta: 0s >> >> The dumpfile is saved to /tmp/out6. >> >> makedumpfile Completed. >> [root at ibm-p10-01-lp45 makedumpfile]# cmp /tmp/out5 /tmp/out6 >> [root at ibm-p10-01-lp45 makedumpfile]# > thank you for testing! sorry one more thing, > does --num-threads=1 break the vmcore? I was able to reproduce this issue with --num-threads=1. The reason is that when --num-threads is specified, makedumpfile uses one producer and one consumer thread. So even with --num-threads=1, multithreading is still in effect. Thanks, Sourabh Jain From k-hagio-ab at nec.com Tue Jul 1 23:02:56 2025 From: k-hagio-ab at nec.com (=?utf-8?B?SEFHSU8gS0FaVUhJVE8o6JCp5bC+44CA5LiA5LuBKQ==?=) Date: Wed, 2 Jul 2025 06:02:56 +0000 Subject: [PATCH v2][makedumpfile] Fix a data race in multi-threading mode (--num-threads=N) In-Reply-To: References: <20250625022343.57529-2-ltao@redhat.com> <7c13a968-4a3a-4d0d-8977-3ba0a4a845b1@nec.com> <5c425f4e-4e89-4500-993e-e4dfec50a4fb@nec.com>

Message-ID: <29488c31-41a4-4dee-b768-9f8c49deeea6@nec.com> Hi Tao, Sourabh, >> thank you for testing! sorry one more thing, >> does --num-threads=1 break the vmcore? > > Yes: Thank you for testing and information, certainly the race occurs between the main and sub-thread, I will check the code again. If you could determine how it breaks the vmcore, please let me know. It will be better to add the scenario to the commit log. Thanks, Kazu > > [root at ibm-p10-01-lp45 makedumpfile]# ./makedumpfile -d 31 -l ~/vmcore /tmp/out7 > Copying data : [100.0 %] / > eta: 0s > > The dumpfile is saved to /tmp/out7. > > makedumpfile Completed. > [root at ibm-p10-01-lp45 makedumpfile]# ./makedumpfile --num-threads=1 -d > 31 -l ~/vmcore /tmp/out8 > Copying data : [100.0 %] - > eta: 0s > Copying data : [100.0 %] / > eta: 0s > > The dumpfile is saved to /tmp/out8. > > makedumpfile Completed. > [root at ibm-p10-01-lp45 makedumpfile]# cmp /tmp/out7 /tmp/out8 > /tmp/out7 /tmp/out8 differ: byte 11119019, line 49418 > >> >> Thanks, >> Kazu > From prudo at redhat.com Wed Jul 2 02:17:51 2025 From: prudo at redhat.com (Philipp Rudo) Date: Wed, 2 Jul 2025 11:17:51 +0200 Subject: [PATCHv3 5/9] kexec: Introduce kexec_pe_image to parse and load PE file In-Reply-To: References: <20250529041744.16458-1-piliu@redhat.com> <20250529041744.16458-6-piliu@redhat.com> <20250625200950.16d7a09c@rotkaeppchen> Message-ID: <20250702111751.2b43aea2@rotkaeppchen> Hi Pingfan, On Mon, 30 Jun 2025 21:45:05 +0800 Pingfan Liu wrote: > On Wed, Jun 25, 2025 at 08:09:50PM +0200, Philipp Rudo wrote: > > Hi Pingfan, > > > > On Thu, 29 May 2025 12:17:40 +0800 > > Pingfan Liu wrote: > > > > > As UEFI becomes popular, a few architectures support to boot a PE format > > > kernel image directly. But the internal of PE format varies, which means > > > each parser for each format. > > > > > > This patch (with the rest in this series) introduces a common skeleton > > > to all parsers, and leave the format parsing in > > > bpf-prog, so the kernel code can keep relative stable. > > > > > > A new kexec_file_ops is implementation, named pe_image_ops. > > > > > > There are some place holder function in this patch. (They will take > > > effect after the introduction of kexec bpf light skeleton and bpf > > > helpers). Overall the parsing progress is a pipeline, the current > > > bpf-prog parser is attached to bpf_handle_pefile(), and detatched at the > > > end of the current stage 'disarm_bpf_prog()' the current parsed result > > > by the current bpf-prog will be buffered in kernel 'prepare_nested_pe()' > > > , and deliver to the next stage. For each stage, the bpf bytecode is > > > extracted from the '.bpf' section in the PE file. > > > > > > Signed-off-by: Pingfan Liu > > > Cc: Baoquan He > > > Cc: Dave Young > > > Cc: Andrew Morton > > > Cc: Philipp Rudo > > > To: kexec at lists.infradead.org > > > --- > > > include/linux/kexec.h | 1 + > > > kernel/Kconfig.kexec | 8 + > > > kernel/Makefile | 1 + > > > kernel/kexec_pe_image.c | 356 ++++++++++++++++++++++++++++++++++++++++ > > > 4 files changed, 366 insertions(+) > > > create mode 100644 kernel/kexec_pe_image.c > > > > > [...] > > > > > diff --git a/kernel/kexec_pe_image.c b/kernel/kexec_pe_image.c > > > new file mode 100644 > > > index 0000000000000..3097efccb8502 > > > --- /dev/null > > > +++ b/kernel/kexec_pe_image.c > > > @@ -0,0 +1,356 @@ > > > +// SPDX-License-Identifier: GPL-2.0 > > > +/* > > > + * Kexec PE image loader > > > + > > > + * Copyright (C) 2025 Red Hat, Inc > > > + */ > > > + > > > +#define pr_fmt(fmt) "kexec_file(Image): " fmt > > > + > > > +#include > > > +#include > > > +#include > > > +#include > > > +#include > > > +#include > > > +#include > > > +#include > > > +#include > > > +#include > > > +#include > > > +#include > > > +#include > > > + > > > + > > > +static LIST_HEAD(phase_head); > > > + > > > +struct parsed_phase { > > > + struct list_head head; > > > + struct list_head res_head; > > > +}; > > > + > > > +static struct parsed_phase *cur_phase; > > > + > > > +static char *kexec_res_names[3] = {"kernel", "initrd", "cmdline"}; > > > > Wouldn't it be better to use a enum rather than strings for the > > different resources? Especially as in prepare_nested_pe you are > > I plan to make bpf_copy_to_kernel() fit for more cases besides kexec. So > string may be better choice, and I think it is better to have a > subsystem prefix, like "kexec:kernel" True, although an enum could be utilized directly as, e.g. an index for an array directly. Anyway, I don't think there is a single 'best' solution here. So feel free to use strings. > > comparing two strings using == instead of strcmp(). So IIUC it should > > always return false. > > > > Oops, I will fix that. In fact, I meaned to assign the pointer > kexec_res_names[i] to kexec_res.name in bpf_kexec_carrier(). Later in > prepare_nested_pe() can compare two pointers. > > > > > +struct kexec_res { > > > + struct list_head node; > > > + char *name; > > > + /* The free of buffer is deferred to kimage_file_post_load_cleanup */ > > > + bool deferred_free; > > > + struct mem_range_result *r; > > > +}; > > > + > > > +static struct parsed_phase *alloc_new_phase(void) > > > +{ > > > + struct parsed_phase *phase = kzalloc(sizeof(struct parsed_phase), GFP_KERNEL); > > > + > > > + INIT_LIST_HEAD(&phase->head); > > > + INIT_LIST_HEAD(&phase->res_head); > > > + list_add_tail(&phase->head, &phase_head); > > > + > > > + return phase; > > > +} > > > > I must admit I don't fully understand how you are handling the > > different phases. In particular I don't understand why you are keeping > > all the resources a phase returned once it is finished. The way I see > > it those resources are only needed once as input for the next phase. So > > it should be sufficient to only keep a single kexec_context and update > > it when a phase returns a new resource. The way I see it this should > > simplify pe_image_load quite a bit. Or am I missing something? > > > > Let us say an aarch64 zboot image embeded in UKI's .linux section. > The UKI parser takes apart the image into kernel, initrd, cmdline. > And the kernel part contains the zboot PE, including zboot parser. > The zboot parser needn't to handle either initrd or cmdline. > So I use the phases, and the leaf node is the final parsed result. Right, that's how the code is working. My point was that when you have multiple phases working on the same component, e.g. the kernel image, then you still keep all the intermediate kernel images in memory until the end. Even though the intermediate images are only used as an input for the next phase(s). So my suggestion is to remove them immediately once a phase returns a new image. My expectation is that this not only reduces the memory usage but also simplifies the code. Thanks Philipp > > > +static bool is_valid_pe(const char *kernel_buf, unsigned long kernel_len) > > > +{ > > > + struct mz_hdr *mz; > > > + struct pe_hdr *pe; > > > + > > > + if (!kernel_buf) > > > + return false; > > > + mz = (struct mz_hdr *)kernel_buf; > > > + if (mz->magic != MZ_MAGIC) > > > + return false; > > > + pe = (struct pe_hdr *)(kernel_buf + mz->peaddr); > > > + if (pe->magic != PE_MAGIC) > > > + return false; > > > + if (pe->opt_hdr_size == 0) { > > > + pr_err("optional header is missing\n"); > > > + return false; > > > + } > > > + > > > + return true; > > > +} > > > + > > > +static bool is_valid_format(const char *kernel_buf, unsigned long kernel_len) > > > +{ > > > + return is_valid_pe(kernel_buf, kernel_len); > > > +} > > > + > > > +/* > > > + * The UEFI Terse Executable (TE) image has MZ header. > > > + */ > > > +static int pe_image_probe(const char *kernel_buf, unsigned long kernel_len) > > > +{ > > > + return is_valid_pe(kernel_buf, kernel_len) ? 0 : -1; > > > > Every image, at least on x86, is a valid pe file. So we should check > > for the .bpf section rather than the header. > > > > You are right that it should include the check on the existence of .bpf > section. On the other hand, the check on PE header in kernel can ensure > the kexec-tools passes the right candidate for this parser. > > > > +} > > > + > > > +static int get_pe_section(char *file_buf, const char *sect_name, > > > > s/get_pe_section/pe_get_section/ ? > > that would make it more consistent with the other functions. > > Sure. I will fix it. > > > Thanks for your careful review. > > > Best Regards, > > Pingfan > > > > > Thanks > > Philipp > > > > > + char **sect_start, unsigned long *sect_sz) > > > +{ > > > + struct pe_hdr *pe_hdr; > > > + struct pe32plus_opt_hdr *opt_hdr; > > > + struct section_header *sect_hdr; > > > + int section_nr, i; > > > + struct mz_hdr *mz = (struct mz_hdr *)file_buf; > > > + > > > + *sect_start = NULL; > > > + *sect_sz = 0; > > > + pe_hdr = (struct pe_hdr *)(file_buf + mz->peaddr); > > > + section_nr = pe_hdr->sections; > > > + opt_hdr = (struct pe32plus_opt_hdr *)(file_buf + mz->peaddr + sizeof(struct pe_hdr)); > > > + sect_hdr = (struct section_header *)((char *)opt_hdr + pe_hdr->opt_hdr_size); > > > + > > > + for (i = 0; i < section_nr; i++) { > > > + if (strcmp(sect_hdr->name, sect_name) == 0) { > > > + *sect_start = file_buf + sect_hdr->data_addr; > > > + *sect_sz = sect_hdr->raw_data_size; > > > + return 0; > > > + } > > > + sect_hdr++; > > > + } > > > + > > > + return -1; > > > +} > > > + > > > +static bool pe_has_bpf_section(char *file_buf, unsigned long pe_sz) > > > +{ > > > + char *sect_start = NULL; > > > + unsigned long sect_sz = 0; > > > + int ret; > > > + > > > + ret = get_pe_section(file_buf, ".bpf", §_start, §_sz); > > > + if (ret < 0) > > > + return false; > > > + return true; > > > +} > > > + > > > +/* Load a ELF */ > > > +static int arm_bpf_prog(char *bpf_elf, unsigned long sz) > > > +{ > > > + return 0; > > > +} > > > + > > > +static void disarm_bpf_prog(void) > > > +{ > > > +} > > > + > > > +struct kexec_context { > > > + bool kdump; > > > + char *image; > > > + int image_sz; > > > + char *initrd; > > > + int initrd_sz; > > > + char *cmdline; > > > + int cmdline_sz; > > > +}; > > > + > > > +void bpf_handle_pefile(struct kexec_context *context); > > > +void bpf_post_handle_pefile(struct kexec_context *context); > > > + > > > + > > > +/* > > > + * optimize("O0") prevents inline, compiler constant propagation > > > + */ > > > +__attribute__((used, optimize("O0"))) void bpf_handle_pefile(struct kexec_context *context) > > > +{ > > > +} > > > + > > > +__attribute__((used, optimize("O0"))) void bpf_post_handle_pefile(struct kexec_context *context) > > > +{ > > > +} > > > + > > > +/* > > > + * PE file may be nested and should be unfold one by one. > > > + * Query 'kernel', 'initrd', 'cmdline' in cur_phase, as they are inputs for the > > > + * next phase. > > > + */ > > > +static int prepare_nested_pe(char **kernel, unsigned long *kernel_len, char **initrd, > > > + unsigned long *initrd_len, char **cmdline) > > > +{ > > > + struct kexec_res *res; > > > + int ret = -1; > > > + > > > + *kernel = NULL; > > > + *kernel_len = 0; > > > + > > > + list_for_each_entry(res, &cur_phase->res_head, node) { > > > + if (res->name == kexec_res_names[0]) { > > > + *kernel = res->r->buf; > > > + *kernel_len = res->r->data_sz; > > > + ret = 0; > > > + } else if (res->name == kexec_res_names[1]) { > > > + *initrd = res->r->buf; > > > + *initrd_len = res->r->data_sz; > > > + } else if (res->name == kexec_res_names[2]) { > > > + *cmdline = res->r->buf; > > > + } > > > + } > > > + > > > + return ret; > > > +} > > > + > > > +static void *pe_image_load(struct kimage *image, > > > + char *kernel, unsigned long kernel_len, > > > + char *initrd, unsigned long initrd_len, > > > + char *cmdline, unsigned long cmdline_len) > > > +{ > > > + char *parsed_kernel = NULL; > > > + unsigned long parsed_len; > > > + char *linux_start, *initrd_start, *cmdline_start, *bpf_start; > > > + unsigned long linux_sz, initrd_sz, cmdline_sz, bpf_sz; > > > + struct parsed_phase *phase, *phase_tmp; > > > + struct kexec_res *res, *res_tmp; > > > + void *ldata; > > > + int ret; > > > + > > > + linux_start = kernel; > > > + linux_sz = kernel_len; > > > + initrd_start = initrd; > > > + initrd_sz = initrd_len; > > > + cmdline_start = cmdline; > > > + cmdline_sz = cmdline_len; > > > + > > > + while (is_valid_format(linux_start, linux_sz) && > > > + pe_has_bpf_section(linux_start, linux_sz)) { > > > + struct kexec_context context; > > > + > > > + get_pe_section(linux_start, ".bpf", &bpf_start, &bpf_sz); > > > + if (!!bpf_sz) { > > > + /* load and attach bpf-prog */ > > > + ret = arm_bpf_prog(bpf_start, bpf_sz); > > > + if (ret) { > > > + pr_err("Fail to load .bpf section\n"); > > > + ldata = ERR_PTR(ret); > > > + goto err; > > > + } > > > + } > > > + cur_phase = alloc_new_phase(); > > > + if (image->type != KEXEC_TYPE_CRASH) > > > + context.kdump = false; > > > + else > > > + context.kdump = true; > > > + context.image = linux_start; > > > + context.image_sz = linux_sz; > > > + context.initrd = initrd_start; > > > + context.initrd_sz = initrd_sz; > > > + context.cmdline = cmdline_start; > > > + context.cmdline_sz = strlen(cmdline_start); > > > + /* bpf-prog fentry, which handle above buffers. */ > > > + bpf_handle_pefile(&context); > > > + > > > + prepare_nested_pe(&linux_start, &linux_sz, &initrd_start, > > > + &initrd_sz, &cmdline_start); > > > + /* bpf-prog fentry */ > > > + bpf_post_handle_pefile(&context); > > > + /* > > > + * detach the current bpf-prog from their attachment points. > > > + * It also a point to free any registered interim resource. > > > + * Any resource except attached to phase is interim. > > > + */ > > > + disarm_bpf_prog(); > > > + } > > > + > > > + /* the rear of parsed phase contains the result */ > > > + list_for_each_entry_reverse(phase, &phase_head, head) { > > > + if (initrd != NULL && cmdline != NULL && parsed_kernel != NULL) > > > + break; > > > + list_for_each_entry(res, &phase->res_head, node) { > > > + if (!strcmp(res->name, "kernel") && !parsed_kernel) { > > > + parsed_kernel = res->r->buf; > > > + parsed_len = res->r->data_sz; > > > + res->deferred_free = true; > > > + } else if (!strcmp(res->name, "initrd") && !initrd) { > > > + initrd = res->r->buf; > > > + initrd_len = res->r->data_sz; > > > + res->deferred_free = true; > > > + } else if (!strcmp(res->name, "cmdline") && !cmdline) { > > > + cmdline = res->r->buf; > > > + cmdline_len = res->r->data_sz; > > > + res->deferred_free = true; > > > + } > > > + } > > > + > > > + } > > > + > > > + if (initrd == NULL || cmdline == NULL || parsed_kernel == NULL) { > > > + char *c, buf[64]; > > > + > > > + c = buf; > > > + if (parsed_kernel == NULL) { > > > + strcpy(c, "kernel "); > > > + c += strlen("kernel "); > > > + } > > > + if (initrd == NULL) { > > > + strcpy(c, "initrd "); > > > + c += strlen("initrd "); > > > + } > > > + if (cmdline == NULL) { > > > + strcpy(c, "cmdline "); > > > + c += strlen("cmdline "); > > > + } > > > + c = '\0'; > > > + pr_err("Can not extract data for %s", buf); > > > + ldata = ERR_PTR(-EINVAL); > > > + goto err; > > > + } > > > + /* > > > + * image's kernel_buf, initrd_buf, cmdline_buf are set. Now they should > > > + * be updated to the new content. > > > + */ > > > + if (image->kernel_buf != parsed_kernel) { > > > + vfree(image->kernel_buf); > > > + image->kernel_buf = parsed_kernel; > > > + image->kernel_buf_len = parsed_len; > > > + } > > > + if (image->initrd_buf != initrd) { > > > + vfree(image->initrd_buf); > > > + image->initrd_buf = initrd; > > > + image->initrd_buf_len = initrd_len; > > > + } > > > + if (image->cmdline_buf != cmdline) { > > > + kfree(image->cmdline_buf); > > > + image->cmdline_buf = cmdline; > > > + image->cmdline_buf_len = cmdline_len; > > > + } > > > + ret = arch_kexec_kernel_image_probe(image, image->kernel_buf, > > > + image->kernel_buf_len); > > > + if (ret) { > > > + pr_err("Fail to find suitable image loader\n"); > > > + ldata = ERR_PTR(ret); > > > + goto err; > > > + } > > > + ldata = kexec_image_load_default(image); > > > + if (IS_ERR(ldata)) { > > > + pr_err("architecture code fails to load image\n"); > > > + goto err; > > > + } > > > + image->image_loader_data = ldata; > > > + > > > +err: > > > + list_for_each_entry_safe(phase, phase_tmp, &phase_head, head) { > > > + list_for_each_entry_safe(res, res_tmp, &phase->res_head, node) { > > > + list_del(&res->node); > > > + /* defer to kimage_file_post_load_cleanup() */ > > > + if (res->deferred_free) { > > > + res->r->buf = NULL; > > > + res->r->buf_sz = 0; > > > + } > > > + mem_range_result_put(res->r); > > > + kfree(res); > > > + } > > > + list_del(&phase->head); > > > + kfree(phase); > > > + } > > > + > > > + return ldata; > > > +} > > > + > > > +const struct kexec_file_ops kexec_pe_image_ops = { > > > + .probe = pe_image_probe, > > > + .load = pe_image_load, > > > +#ifdef CONFIG_KEXEC_IMAGE_VERIFY_SIG > > > + .verify_sig = kexec_kernel_verify_pe_sig, > > > +#endif > > > +}; > > > From moonhee.lee.ca at gmail.com Wed Jul 2 09:07:35 2025 From: moonhee.lee.ca at gmail.com (Moonhee Lee) Date: Wed, 2 Jul 2025 09:07:35 -0700 Subject: [PATCH] selftests/kexec: fix test_kexec_jump build and ignore generated binary In-Reply-To: <744bd439-2613-45d7-8724-5959d25100aa@linuxfoundation.org> References: <20250624201438.89391-1-moonhee.lee.ca@gmail.com> <744bd439-2613-45d7-8724-5959d25100aa@linuxfoundation.org> Message-ID: Hi Shuah, On Tue, Jul 1, 2025 at 12:53?PM Shuah Khan wrote: > The change looks good to me. > > Acked-by: Shuah Khan Thank you, Shuah. I'll carry your Acked-by tag in v2. > There is another patch that adds the executable to .gitignore > https://lore.kernel.org/r/20250623232549.3263273-1-dyudaken at gmail.com > I missed that patch. Thank you for pointing it out. I'll drop this change in v2. > I think you are missing kexec at lists.infradead.org - added it Thanks, I?ll add kexec at lists.infradead.org manually in future patches since get_maintainer.pl didn?t include it. $ ./scripts/get_maintainer.pl --scm tools/testing/selftests/kexec Shuah Khan (maintainer:KERNEL SELFTEST FRAMEWORK) David Woodhouse (commit_signer:1/1=100%,authored:1/1=100%) Ingo Molnar (commit_signer:1/1=100%) linux-kselftest at vger.kernel.org (open list:KERNEL SELFTEST FRAMEWORK) linux-kernel at vger.kernel.org (open list) git git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git git git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Regards, Moonhee From moonhee.lee.ca at gmail.com Wed Jul 2 10:17:05 2025 From: moonhee.lee.ca at gmail.com (Moon Hee Lee) Date: Wed, 2 Jul 2025 10:17:05 -0700 Subject: [PATCH v2] selftests/kexec: fix test_kexec_jump build Message-ID: <20250702171704.22559-2-moonhee.lee.ca@gmail.com> The test_kexec_jump program builds correctly when invoked from the top-level selftests/Makefile, which explicitly sets the OUTPUT variable. However, building directly in tools/testing/selftests/kexec fails with: make: *** No rule to make target '/test_kexec_jump', needed by 'test_kexec_jump.sh'. Stop. This failure occurs because the Makefile rule relies on $(OUTPUT), which is undefined in direct builds. Fix this by listing test_kexec_jump in TEST_GEN_PROGS, the standard way to declare generated test binaries in the kselftest framework. This ensures the binary is built regardless of invocation context and properly removed by make clean. Acked-by: Shuah Khan Signed-off-by: Moon Hee Lee --- Changes in v2: - Dropped the .gitignore addition, as it is already handled in [1] [1] https://lore.kernel.org/r/20250623232549.3263273-1-dyudaken at gmail.com tools/testing/selftests/kexec/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/kexec/Makefile b/tools/testing/selftests/kexec/Makefile index e3000ccb9a5d..874cfdd3b75b 100644 --- a/tools/testing/selftests/kexec/Makefile +++ b/tools/testing/selftests/kexec/Makefile @@ -12,7 +12,7 @@ include ../../../scripts/Makefile.arch ifeq ($(IS_64_BIT)$(ARCH_PROCESSED),1x86) TEST_PROGS += test_kexec_jump.sh -test_kexec_jump.sh: $(OUTPUT)/test_kexec_jump +TEST_GEN_PROGS := test_kexec_jump endif include ../lib.mk -- 2.43.0 From moonhee.lee.ca at gmail.com Wed Jul 2 10:37:03 2025 From: moonhee.lee.ca at gmail.com (Moonhee Lee) Date: Wed, 2 Jul 2025 10:37:03 -0700 Subject: [PATCH v2] selftests/kexec: fix test_kexec_jump build In-Reply-To: <20250702171704.22559-2-moonhee.lee.ca@gmail.com> References: <20250702171704.22559-2-moonhee.lee.ca@gmail.com> Message-ID: On Wed, Jul 2, 2025 at 10:17?AM Moon Hee Lee wrote: > --- > Changes in v2: > - Dropped the .gitignore addition, as it is already handled in [1] > > [1] https://lore.kernel.org/r/20250623232549.3263273-1-dyudaken at gmail.com Just noticed I had the wrong address in the To field ? the correct one (skhan at linuxfoundation.org) was already in Cc, but sending this to fix it properly. No changes to the patch. Thanks, Moonhee From piliu at redhat.com Wed Jul 2 18:17:11 2025 From: piliu at redhat.com (Pingfan Liu) Date: Thu, 3 Jul 2025 09:17:11 +0800 Subject: [PATCHv3 5/9] kexec: Introduce kexec_pe_image to parse and load PE file In-Reply-To: <20250702111751.2b43aea2@rotkaeppchen> References: <20250529041744.16458-1-piliu@redhat.com> <20250529041744.16458-6-piliu@redhat.com> <20250625200950.16d7a09c@rotkaeppchen> <20250702111751.2b43aea2@rotkaeppchen> Message-ID: On Wed, Jul 2, 2025 at 5:18?PM Philipp Rudo wrote: > > Hi Pingfan, > > On Mon, 30 Jun 2025 21:45:05 +0800 > Pingfan Liu wrote: > > > On Wed, Jun 25, 2025 at 08:09:50PM +0200, Philipp Rudo wrote: > > > Hi Pingfan, > > > > > > On Thu, 29 May 2025 12:17:40 +0800 > > > Pingfan Liu wrote: > > > > > > > As UEFI becomes popular, a few architectures support to boot a PE format > > > > kernel image directly. But the internal of PE format varies, which means > > > > each parser for each format. > > > > > > > > This patch (with the rest in this series) introduces a common skeleton > > > > to all parsers, and leave the format parsing in > > > > bpf-prog, so the kernel code can keep relative stable. > > > > > > > > A new kexec_file_ops is implementation, named pe_image_ops. > > > > > > > > There are some place holder function in this patch. (They will take > > > > effect after the introduction of kexec bpf light skeleton and bpf > > > > helpers). Overall the parsing progress is a pipeline, the current > > > > bpf-prog parser is attached to bpf_handle_pefile(), and detatched at the > > > > end of the current stage 'disarm_bpf_prog()' the current parsed result > > > > by the current bpf-prog will be buffered in kernel 'prepare_nested_pe()' > > > > , and deliver to the next stage. For each stage, the bpf bytecode is > > > > extracted from the '.bpf' section in the PE file. > > > > > > > > Signed-off-by: Pingfan Liu > > > > Cc: Baoquan He > > > > Cc: Dave Young > > > > Cc: Andrew Morton > > > > Cc: Philipp Rudo > > > > To: kexec at lists.infradead.org > > > > --- > > > > include/linux/kexec.h | 1 + > > > > kernel/Kconfig.kexec | 8 + > > > > kernel/Makefile | 1 + > > > > kernel/kexec_pe_image.c | 356 ++++++++++++++++++++++++++++++++++++++++ > > > > 4 files changed, 366 insertions(+) > > > > create mode 100644 kernel/kexec_pe_image.c > > > > > > > [...] > > > > > > > diff --git a/kernel/kexec_pe_image.c b/kernel/kexec_pe_image.c > > > > new file mode 100644 > > > > index 0000000000000..3097efccb8502 > > > > --- /dev/null > > > > +++ b/kernel/kexec_pe_image.c > > > > @@ -0,0 +1,356 @@ > > > > +// SPDX-License-Identifier: GPL-2.0 > > > > +/* > > > > + * Kexec PE image loader > > > > + > > > > + * Copyright (C) 2025 Red Hat, Inc > > > > + */ > > > > + > > > > +#define pr_fmt(fmt) "kexec_file(Image): " fmt > > > > + > > > > +#include > > > > +#include > > > > +#include > > > > +#include > > > > +#include > > > > +#include > > > > +#include > > > > +#include > > > > +#include > > > > +#include > > > > +#include > > > > +#include > > > > +#include > > > > + > > > > + > > > > +static LIST_HEAD(phase_head); > > > > + > > > > +struct parsed_phase { > > > > + struct list_head head; > > > > + struct list_head res_head; > > > > +}; > > > > + > > > > +static struct parsed_phase *cur_phase; > > > > + > > > > +static char *kexec_res_names[3] = {"kernel", "initrd", "cmdline"}; > > > > > > Wouldn't it be better to use a enum rather than strings for the > > > different resources? Especially as in prepare_nested_pe you are > > > > I plan to make bpf_copy_to_kernel() fit for more cases besides kexec. So > > string may be better choice, and I think it is better to have a > > subsystem prefix, like "kexec:kernel" > > True, although an enum could be utilized directly as, e.g. an index for > an array directly. Anyway, I don't think there is a single 'best' > solution here. So feel free to use strings. > > > > comparing two strings using == instead of strcmp(). So IIUC it should > > > always return false. > > > > > > > Oops, I will fix that. In fact, I meaned to assign the pointer > > kexec_res_names[i] to kexec_res.name in bpf_kexec_carrier(). Later in > > prepare_nested_pe() can compare two pointers. > > > > > > > > +struct kexec_res { > > > > + struct list_head node; > > > > + char *name; > > > > + /* The free of buffer is deferred to kimage_file_post_load_cleanup */ > > > > + bool deferred_free; > > > > + struct mem_range_result *r; > > > > +}; > > > > + > > > > +static struct parsed_phase *alloc_new_phase(void) > > > > +{ > > > > + struct parsed_phase *phase = kzalloc(sizeof(struct parsed_phase), GFP_KERNEL); > > > > + > > > > + INIT_LIST_HEAD(&phase->head); > > > > + INIT_LIST_HEAD(&phase->res_head); > > > > + list_add_tail(&phase->head, &phase_head); > > > > + > > > > + return phase; > > > > +} > > > > > > I must admit I don't fully understand how you are handling the > > > different phases. In particular I don't understand why you are keeping > > > all the resources a phase returned once it is finished. The way I see > > > it those resources are only needed once as input for the next phase. So > > > it should be sufficient to only keep a single kexec_context and update > > > it when a phase returns a new resource. The way I see it this should > > > simplify pe_image_load quite a bit. Or am I missing something? > > > > > > > Let us say an aarch64 zboot image embeded in UKI's .linux section. > > The UKI parser takes apart the image into kernel, initrd, cmdline. > > And the kernel part contains the zboot PE, including zboot parser. > > The zboot parser needn't to handle either initrd or cmdline. > > So I use the phases, and the leaf node is the final parsed result. > > Right, that's how the code is working. My point was that when you have > multiple phases working on the same component, e.g. the kernel image, > then you still keep all the intermediate kernel images in memory until > the end. Even though the intermediate images are only used as an input > for the next phase(s). So my suggestion is to remove them immediately > once a phase returns a new image. My expectation is that this not only > reduces the memory usage but also simplifies the code. > Ah, got your point. It is a good suggestion especially that it can save lots of code. Thanks, Pingfan > Thanks > Philipp > > > > > +static bool is_valid_pe(const char *kernel_buf, unsigned long kernel_len) > > > > +{ > > > > + struct mz_hdr *mz; > > > > + struct pe_hdr *pe; > > > > + > > > > + if (!kernel_buf) > > > > + return false; > > > > + mz = (struct mz_hdr *)kernel_buf; > > > > + if (mz->magic != MZ_MAGIC) > > > > + return false; > > > > + pe = (struct pe_hdr *)(kernel_buf + mz->peaddr); > > > > + if (pe->magic != PE_MAGIC) > > > > + return false; > > > > + if (pe->opt_hdr_size == 0) { > > > > + pr_err("optional header is missing\n"); > > > > + return false; > > > > + } > > > > + > > > > + return true; > > > > +} > > > > + > > > > +static bool is_valid_format(const char *kernel_buf, unsigned long kernel_len) > > > > +{ > > > > + return is_valid_pe(kernel_buf, kernel_len); > > > > +} > > > > + > > > > +/* > > > > + * The UEFI Terse Executable (TE) image has MZ header. > > > > + */ > > > > +static int pe_image_probe(const char *kernel_buf, unsigned long kernel_len) > > > > +{ > > > > + return is_valid_pe(kernel_buf, kernel_len) ? 0 : -1; > > > > > > Every image, at least on x86, is a valid pe file. So we should check > > > for the .bpf section rather than the header. > > > > > > > You are right that it should include the check on the existence of .bpf > > section. On the other hand, the check on PE header in kernel can ensure > > the kexec-tools passes the right candidate for this parser. > > > > > > +} > > > > + > > > > +static int get_pe_section(char *file_buf, const char *sect_name, > > > > > > s/get_pe_section/pe_get_section/ ? > > > that would make it more consistent with the other functions. > > > > Sure. I will fix it. > > > > > > Thanks for your careful review. > > > > > > Best Regards, > > > > Pingfan > > > > > > > > Thanks > > > Philipp > > > > > > > + char **sect_start, unsigned long *sect_sz) > > > > +{ > > > > + struct pe_hdr *pe_hdr; > > > > + struct pe32plus_opt_hdr *opt_hdr; > > > > + struct section_header *sect_hdr; > > > > + int section_nr, i; > > > > + struct mz_hdr *mz = (struct mz_hdr *)file_buf; > > > > + > > > > + *sect_start = NULL; > > > > + *sect_sz = 0; > > > > + pe_hdr = (struct pe_hdr *)(file_buf + mz->peaddr); > > > > + section_nr = pe_hdr->sections; > > > > + opt_hdr = (struct pe32plus_opt_hdr *)(file_buf + mz->peaddr + sizeof(struct pe_hdr)); > > > > + sect_hdr = (struct section_header *)((char *)opt_hdr + pe_hdr->opt_hdr_size); > > > > + > > > > + for (i = 0; i < section_nr; i++) { > > > > + if (strcmp(sect_hdr->name, sect_name) == 0) { > > > > + *sect_start = file_buf + sect_hdr->data_addr; > > > > + *sect_sz = sect_hdr->raw_data_size; > > > > + return 0; > > > > + } > > > > + sect_hdr++; > > > > + } > > > > + > > > > + return -1; > > > > +} > > > > + > > > > +static bool pe_has_bpf_section(char *file_buf, unsigned long pe_sz) > > > > +{ > > > > + char *sect_start = NULL; > > > > + unsigned long sect_sz = 0; > > > > + int ret; > > > > + > > > > + ret = get_pe_section(file_buf, ".bpf", §_start, §_sz); > > > > + if (ret < 0) > > > > + return false; > > > > + return true; > > > > +} > > > > + > > > > +/* Load a ELF */ > > > > +static int arm_bpf_prog(char *bpf_elf, unsigned long sz) > > > > +{ > > > > + return 0; > > > > +} > > > > + > > > > +static void disarm_bpf_prog(void) > > > > +{ > > > > +} > > > > + > > > > +struct kexec_context { > > > > + bool kdump; > > > > + char *image; > > > > + int image_sz; > > > > + char *initrd; > > > > + int initrd_sz; > > > > + char *cmdline; > > > > + int cmdline_sz; > > > > +}; > > > > + > > > > +void bpf_handle_pefile(struct kexec_context *context); > > > > +void bpf_post_handle_pefile(struct kexec_context *context); > > > > + > > > > + > > > > +/* > > > > + * optimize("O0") prevents inline, compiler constant propagation > > > > + */ > > > > +__attribute__((used, optimize("O0"))) void bpf_handle_pefile(struct kexec_context *context) > > > > +{ > > > > +} > > > > + > > > > +__attribute__((used, optimize("O0"))) void bpf_post_handle_pefile(struct kexec_context *context) > > > > +{ > > > > +} > > > > + > > > > +/* > > > > + * PE file may be nested and should be unfold one by one. > > > > + * Query 'kernel', 'initrd', 'cmdline' in cur_phase, as they are inputs for the > > > > + * next phase. > > > > + */ > > > > +static int prepare_nested_pe(char **kernel, unsigned long *kernel_len, char **initrd, > > > > + unsigned long *initrd_len, char **cmdline) > > > > +{ > > > > + struct kexec_res *res; > > > > + int ret = -1; > > > > + > > > > + *kernel = NULL; > > > > + *kernel_len = 0; > > > > + > > > > + list_for_each_entry(res, &cur_phase->res_head, node) { > > > > + if (res->name == kexec_res_names[0]) { > > > > + *kernel = res->r->buf; > > > > + *kernel_len = res->r->data_sz; > > > > + ret = 0; > > > > + } else if (res->name == kexec_res_names[1]) { > > > > + *initrd = res->r->buf; > > > > + *initrd_len = res->r->data_sz; > > > > + } else if (res->name == kexec_res_names[2]) { > > > > + *cmdline = res->r->buf; > > > > + } > > > > + } > > > > + > > > > + return ret; > > > > +} > > > > + > > > > +static void *pe_image_load(struct kimage *image, > > > > + char *kernel, unsigned long kernel_len, > > > > + char *initrd, unsigned long initrd_len, > > > > + char *cmdline, unsigned long cmdline_len) > > > > +{ > > > > + char *parsed_kernel = NULL; > > > > + unsigned long parsed_len; > > > > + char *linux_start, *initrd_start, *cmdline_start, *bpf_start; > > > > + unsigned long linux_sz, initrd_sz, cmdline_sz, bpf_sz; > > > > + struct parsed_phase *phase, *phase_tmp; > > > > + struct kexec_res *res, *res_tmp; > > > > + void *ldata; > > > > + int ret; > > > > + > > > > + linux_start = kernel; > > > > + linux_sz = kernel_len; > > > > + initrd_start = initrd; > > > > + initrd_sz = initrd_len; > > > > + cmdline_start = cmdline; > > > > + cmdline_sz = cmdline_len; > > > > + > > > > + while (is_valid_format(linux_start, linux_sz) && > > > > + pe_has_bpf_section(linux_start, linux_sz)) { > > > > + struct kexec_context context; > > > > + > > > > + get_pe_section(linux_start, ".bpf", &bpf_start, &bpf_sz); > > > > + if (!!bpf_sz) { > > > > + /* load and attach bpf-prog */ > > > > + ret = arm_bpf_prog(bpf_start, bpf_sz); > > > > + if (ret) { > > > > + pr_err("Fail to load .bpf section\n"); > > > > + ldata = ERR_PTR(ret); > > > > + goto err; > > > > + } > > > > + } > > > > + cur_phase = alloc_new_phase(); > > > > + if (image->type != KEXEC_TYPE_CRASH) > > > > + context.kdump = false; > > > > + else > > > > + context.kdump = true; > > > > + context.image = linux_start; > > > > + context.image_sz = linux_sz; > > > > + context.initrd = initrd_start; > > > > + context.initrd_sz = initrd_sz; > > > > + context.cmdline = cmdline_start; > > > > + context.cmdline_sz = strlen(cmdline_start); > > > > + /* bpf-prog fentry, which handle above buffers. */ > > > > + bpf_handle_pefile(&context); > > > > + > > > > + prepare_nested_pe(&linux_start, &linux_sz, &initrd_start, > > > > + &initrd_sz, &cmdline_start); > > > > + /* bpf-prog fentry */ > > > > + bpf_post_handle_pefile(&context); > > > > + /* > > > > + * detach the current bpf-prog from their attachment points. > > > > + * It also a point to free any registered interim resource. > > > > + * Any resource except attached to phase is interim. > > > > + */ > > > > + disarm_bpf_prog(); > > > > + } > > > > + > > > > + /* the rear of parsed phase contains the result */ > > > > + list_for_each_entry_reverse(phase, &phase_head, head) { > > > > + if (initrd != NULL && cmdline != NULL && parsed_kernel != NULL) > > > > + break; > > > > + list_for_each_entry(res, &phase->res_head, node) { > > > > + if (!strcmp(res->name, "kernel") && !parsed_kernel) { > > > > + parsed_kernel = res->r->buf; > > > > + parsed_len = res->r->data_sz; > > > > + res->deferred_free = true; > > > > + } else if (!strcmp(res->name, "initrd") && !initrd) { > > > > + initrd = res->r->buf; > > > > + initrd_len = res->r->data_sz; > > > > + res->deferred_free = true; > > > > + } else if (!strcmp(res->name, "cmdline") && !cmdline) { > > > > + cmdline = res->r->buf; > > > > + cmdline_len = res->r->data_sz; > > > > + res->deferred_free = true; > > > > + } > > > > + } > > > > + > > > > + } > > > > + > > > > + if (initrd == NULL || cmdline == NULL || parsed_kernel == NULL) { > > > > + char *c, buf[64]; > > > > + > > > > + c = buf; > > > > + if (parsed_kernel == NULL) { > > > > + strcpy(c, "kernel "); > > > > + c += strlen("kernel "); > > > > + } > > > > + if (initrd == NULL) { > > > > + strcpy(c, "initrd "); > > > > + c += strlen("initrd "); > > > > + } > > > > + if (cmdline == NULL) { > > > > + strcpy(c, "cmdline "); > > > > + c += strlen("cmdline "); > > > > + } > > > > + c = '\0'; > > > > + pr_err("Can not extract data for %s", buf); > > > > + ldata = ERR_PTR(-EINVAL); > > > > + goto err; > > > > + } > > > > + /* > > > > + * image's kernel_buf, initrd_buf, cmdline_buf are set. Now they should > > > > + * be updated to the new content. > > > > + */ > > > > + if (image->kernel_buf != parsed_kernel) { > > > > + vfree(image->kernel_buf); > > > > + image->kernel_buf = parsed_kernel; > > > > + image->kernel_buf_len = parsed_len; > > > > + } > > > > + if (image->initrd_buf != initrd) { > > > > + vfree(image->initrd_buf); > > > > + image->initrd_buf = initrd; > > > > + image->initrd_buf_len = initrd_len; > > > > + } > > > > + if (image->cmdline_buf != cmdline) { > > > > + kfree(image->cmdline_buf); > > > > + image->cmdline_buf = cmdline; > > > > + image->cmdline_buf_len = cmdline_len; > > > > + } > > > > + ret = arch_kexec_kernel_image_probe(image, image->kernel_buf, > > > > + image->kernel_buf_len); > > > > + if (ret) { > > > > + pr_err("Fail to find suitable image loader\n"); > > > > + ldata = ERR_PTR(ret); > > > > + goto err; > > > > + } > > > > + ldata = kexec_image_load_default(image); > > > > + if (IS_ERR(ldata)) { > > > > + pr_err("architecture code fails to load image\n"); > > > > + goto err; > > > > + } > > > > + image->image_loader_data = ldata; > > > > + > > > > +err: > > > > + list_for_each_entry_safe(phase, phase_tmp, &phase_head, head) { > > > > + list_for_each_entry_safe(res, res_tmp, &phase->res_head, node) { > > > > + list_del(&res->node); > > > > + /* defer to kimage_file_post_load_cleanup() */ > > > > + if (res->deferred_free) { > > > > + res->r->buf = NULL; > > > > + res->r->buf_sz = 0; > > > > + } > > > > + mem_range_result_put(res->r); > > > > + kfree(res); > > > > + } > > > > + list_del(&phase->head); > > > > + kfree(phase); > > > > + } > > > > + > > > > + return ldata; > > > > +} > > > > + > > > > +const struct kexec_file_ops kexec_pe_image_ops = { > > > > + .probe = pe_image_probe, > > > > + .load = pe_image_load, > > > > +#ifdef CONFIG_KEXEC_IMAGE_VERIFY_SIG > > > > + .verify_sig = kexec_kernel_verify_pe_sig, > > > > +#endif > > > > +}; > > > > > > From bhe at redhat.com Wed Jul 2 23:44:36 2025 From: bhe at redhat.com (Baoquan He) Date: Thu, 3 Jul 2025 14:44:36 +0800 Subject: [PATCH v2] selftests/kexec: fix test_kexec_jump build In-Reply-To: <20250702171704.22559-2-moonhee.lee.ca@gmail.com> References: <20250702171704.22559-2-moonhee.lee.ca@gmail.com> Message-ID: On 07/02/25 at 10:17am, Moon Hee Lee wrote: > The test_kexec_jump program builds correctly when invoked from the top-level > selftests/Makefile, which explicitly sets the OUTPUT variable. However, > building directly in tools/testing/selftests/kexec fails with: > > make: *** No rule to make target '/test_kexec_jump', needed by 'test_kexec_jump.sh'. Stop. I can reproduce this, and this patch fixes it. Thanks. Acked-by: Baoquan He > > This failure occurs because the Makefile rule relies on $(OUTPUT), which is > undefined in direct builds. > > Fix this by listing test_kexec_jump in TEST_GEN_PROGS, the standard way to > declare generated test binaries in the kselftest framework. This ensures the > binary is built regardless of invocation context and properly removed by > make clean. > > Acked-by: Shuah Khan > Signed-off-by: Moon Hee Lee > --- > Changes in v2: > - Dropped the .gitignore addition, as it is already handled in [1] > > [1] https://lore.kernel.org/r/20250623232549.3263273-1-dyudaken at gmail.com > > > tools/testing/selftests/kexec/Makefile | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/tools/testing/selftests/kexec/Makefile b/tools/testing/selftests/kexec/Makefile > index e3000ccb9a5d..874cfdd3b75b 100644 > --- a/tools/testing/selftests/kexec/Makefile > +++ b/tools/testing/selftests/kexec/Makefile > @@ -12,7 +12,7 @@ include ../../../scripts/Makefile.arch > > ifeq ($(IS_64_BIT)$(ARCH_PROCESSED),1x86) > TEST_PROGS += test_kexec_jump.sh > -test_kexec_jump.sh: $(OUTPUT)/test_kexec_jump > +TEST_GEN_PROGS := test_kexec_jump > endif > > include ../lib.mk > -- > 2.43.0 > > From ptesarik at suse.com Thu Jul 3 07:31:00 2025 From: ptesarik at suse.com (Petr Tesarik) Date: Thu, 3 Jul 2025 16:31:00 +0200 Subject: [PATCH v2][makedumpfile] Fix a data race in multi-threading mode (--num-threads=N) In-Reply-To: References: <20250625022343.57529-2-ltao@redhat.com> <7c13a968-4a3a-4d0d-8977-3ba0a4a845b1@nec.com> Message-ID: <20250703163100.603f59f4@mordecai.tesarici.cz> On Tue, 1 Jul 2025 19:59:53 +1200 Tao Liu wrote: > Hi Kazu, > > Thanks for your comments! > > On Tue, Jul 1, 2025 at 7:38?PM HAGIO KAZUHITO(?????) wrote: > > > > Hi Tao, > > > > thank you for the patch. > > > > On 2025/06/25 11:23, Tao Liu wrote: > > > A vmcore corrupt issue has been noticed in powerpc arch [1]. It can be > > > reproduced with upstream makedumpfile. > > > > > > When analyzing the corrupt vmcore using crash, the following error > > > message will output: > > > > > > crash: compressed kdump: uncompress failed: 0 > > > crash: read error: kernel virtual address: c0001e2d2fe48000 type: > > > "hardirq thread_union" > > > crash: cannot read hardirq_ctx[930] at c0001e2d2fe48000 > > > crash: compressed kdump: uncompress failed: 0 > > > > > > If the vmcore is generated without num-threads option, then no such > > > errors are noticed. > > > > > > With --num-threads=N enabled, there will be N sub-threads created. All > > > sub-threads are producers which responsible for mm page processing, e.g. > > > compression. The main thread is the consumer which responsible for > > > writing the compressed data into file. page_flag_buf->ready is used to > > > sync main and sub-threads. When a sub-thread finishes page processing, > > > it will set ready flag to be FLAG_READY. In the meantime, main thread > > > looply check all threads of the ready flags, and break the loop when > > > find FLAG_READY. > > > > I've tried to reproduce the issue, but I couldn't on x86_64. > > Yes, I cannot reproduce it on x86_64 either, but the issue is very > easily reproduced on ppc64 arch, which is where our QE reported. Yes, this is expected. X86 implements a strongly ordered memory model, so a "store-to-memory" instruction ensures that the new value is immediately observed by other CPUs. FWIW the current code is wrong even on X86, because it does nothing to prevent compiler optimizations. The compiler is then allowed to reorder instructions so that the write to page_flag_buf->ready happens after other writes; with a bit of bad scheduling luck, the consumer thread may see an inconsistent state (e.g. read a stale page_flag_buf->pfn). Note that thanks to how compilers are designed (today), this issue is more or less hypothetical. Nevertheless, the use of atomics fixes it, because they also serve as memory barriers. Petr T From safinaskar at zohomail.com Thu Jul 3 12:56:00 2025 From: safinaskar at zohomail.com (Askar Safin) Date: Thu, 03 Jul 2025 23:56:00 +0400 Subject: Second kexec_file_load (but not kexec_load) fails on i915 if CONFIG_INTEL_IOMMU_DEFAULT_ON=n Message-ID: <197d1dc3bff.c01ddb9024897.1898328361232711826@zohomail.com> TL;DR: I found a bug in strange interaction in kexec_file_load (but not kexec_load) and i915 TL;DR#2: Second (sometimes third or forth) kexec (using kexec_file_load) fails on my particular hardware TL;DR#3: I did 55 expirements, each of them required a lot of boots, in total I did 1908 boots Okay, so I found a bug. Steps to reproduce: - I have Dell Precision 7780 - I have recent Debian x86_64 sid installed (bug reproducible with both Debian kernels and mainline ones) - Bug is reproducible on many kernels, including very recent ones, for example 6.15.4 - Boot system, then do kexec into the same system using kexec_file_load. I. e. pass --kexec-file-syscall to "kexec" command - Then kexec from this kexec'ed system again (i. e. you should do two kexec's in a row) - Then do 3rd kexec, etc - Repeat kexec's until you do 100 kexec's or your system start to misbehave On my computer the system starts to misbehave after some number of kexec's. This always happens after 2nd kexec attempt. I. e. the first kexec is always successful. But second sometimes is not. I never was able to perform 100 kexec's in a row. After some kexec attempt the system starts to misbehave: oopses, panics, locked system, etc. Notes: - I tried to bisect "kexec-tools" package, but bisect merely gave me commit, which switched to kexec_file_load as a default. Bug is reproducible if we use kexec_file_load, but doesn't reproduce if we use kexec_load - Bug is reproducible even if we boot via init=/bin/bash (note: this means that initramfs is still part of the boot process). (If we boot to normal GUI, bug is reproducible, too) - When I reproduce I use this command line: "root=UUID=... rootflags=subvol=... ro init=..." - Debian package "plymouth" is required for reproducing. (It reproduces with plymouth, but doesn't reproduce without plymouth.) But note that I never see actual plymouth screen! I. e. presence of "plymouth" on the system somehow affects bug reproduciblity despite plymouth animation never actually shown. I don't know why this happens, but I suspect that I don't pass "splash" to kernel command line, and thus don't see plymouth screen. But I suspect that plymouth is still included to initramfs and from there somehow affects boot process - Bug reproduces in Debian, but doesn't reproduce in Ubuntu. After a lot of expirementing I finally understood why: Ubuntu kernel has CONFIG_INTEL_IOMMU_DEFAULT_ON=y, and Debian kernel has not. Additional expirements found that it is culpit. I. e. the bug is reproducible with CONFIG_INTEL_IOMMU_DEFAULT_ON=n and not reproducbile with CONFIG_INTEL_IOMMU_DEFAULT_ON=y . (So advice for distributions: do what Ubuntu does, i. e. set CONFIG_INTEL_IOMMU_DEFAULT_ON=y to hide this bug) - Bug is not reproducible in old enough kernels, so I did bisect on Linux. Bisect showed me these commits: d4a2393049..4a75f32fc7. I. e. bug is reproducible in 4a75f32fc7, but doesn't reproduce in d4a2393049. Between them there is a middle commit 52407c220c44c8dcc6a, which is not testable. Here are these commits: commit 4a75f32fc783128d0c42ef73fa62a20379a66828 Author: Anusha Srivatsa ? ?drm/i915/rpl-s: Add PCH Support for Raptor Lake S commit 52407c220c44c8dcc6aa8aa35ffc8a2db3c849a9 Author: Anusha Srivatsa ? ?drm/i915/rpl-s: Add PCI IDS for Raptor Lake S It seems these commits merely added support for my Intel GPU model. So this is fake regression. I'm not sure this should be treated as proper regression and whether regzbot should be notified. (What do you think?) Still formally this is regression: I did expirements and they show that bug present in 4a75f32fc783128d0c42 and not present before. (Side note: in latest kernels both wayland and x11 work, in d4a2393049 x11 works and wayland doesn't.) I tried to reproduce the bug in Qemu, but I was unable to do so. It seems Intel GPU is required, maybe even my particular model. Here is "lspci -vnn -d :*:0300" for my GPU: 00:02.0 VGA compatible controller [0300]: Intel Corporation Raptor Lake-S UHD Graphics [8086:a788] (rev 04) (prog-if 00 [VGA controller]) Subsystem: Dell Raptor Lake-S UHD Graphics [1028:0c42] Flags: bus master, fast devsel, latency 0, IRQ 202, IOMMU group 0 Memory at 604b000000 (64-bit, non-prefetchable) [size=16M] Memory at 4000000000 (64-bit, prefetchable) [size=256M] I/O ports at 3000 [size=64] Expansion ROM at 000c0000 [virtual] [disabled] [size=128K] Capabilities: [40] Vendor Specific Information: Len=0c Capabilities: [70] Express Root Complex Integrated Endpoint, MSI 00 Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit- Capabilities: [d0] Power Management version 2 Capabilities: [100] Process Address Space ID (PASID) Capabilities: [200] Address Translation Service (ATS) Capabilities: [300] Page Request Interface (PRI) Capabilities: [320] Single Root I/O Virtualization (SR-IOV) Kernel driver in use: i915 Kernel modules: i915 dmidecode: https://zerobin.net/?aebea072b93d8122#z4W9URnV+k9ZZErhP4etQkxlfpyRKf++uKMNoO5PGjs= - I use "root=UUID=... rootflags=subvol=... ro init=..." as a command line for reproducing. If I add "recovery nomodeset dis_ucode_ldr" (this is options used by Ubuntu in recovery mode), the bug stops to reproduce Again, in short, full list of things required for successful reproducing: - Intel GPU, possibly my particular model - Kernel with support for my model (4a75f32fc783128d0c42 and later up to 6.15.4) - Kexec at least two times. (One kexec never fails, 100 kexec's in a row never succeed) - kexec_file_load as opposed to kexec_load - Initramfs - Lack of parameters "recovery nomodeset dis_ucode_ldr" (i. e. one of them stops reproducing) - plymouth - CONFIG_INTEL_IOMMU_DEFAULT_ON=n Removing of ANY of them stops the bug, and I proved this by lots of expirements. In total I did 55+ expirements, each of them required up to 100 boots. In total I did 1908 (!!!!!!) boots on my physical laptop (I mean kexec boots here). No, I'm not faking this number, here is my actual directories with results: user at subvolume:~$ ls /rbt/kx-results/ @rec-2025-06-29T201723Z-bad-4 @rec-2025-06-29T214650Z-good-60 @rec-2025-07-03T050626Z-bad-41 @rec-2025-07-03T104125Z-bad-28 @rec-2025-07-03T133705Z-bad-3 @rec-2025-06-29T203429Z-good-60 @rec-2025-06-29T215558Z-bad-8 @rec-2025-07-03T060107Z-good-100 @rec-2025-07-03T111727Z-bad-13 @rec-2025-07-03T141647Z-good-100 @rec-2025-06-29T205626Z-good-60 @rec-2025-07-01T042949Z-bad-12 @rec-2025-07-03T074810Z-good-100 @rec-2025-07-03T122242Z-good-100 @rec-2025-07-03T145705Z-good-100 @rec-2025-06-29T211612Z-bad-6 @rec-2025-07-02T120101Z-good-60 @rec-2025-07-03T082914Z-good-100 @rec-2025-07-03T123958Z-bad-12 @rec-2025-07-03T152406Z-bad-50 @rec-2025-06-29T212932Z-good-60 @rec-2025-07-03T031038Z-good-60 @rec-2025-07-03T100615Z-good-100 @rec-2025-07-03T132116Z-good-100 @rec-2025-07-03T154204Z-bad-15 user at subvolume:~$ ls /rbt/kx-manual-testing/ 2025-07-01-03-19-good-6 2025-07-01-03-56-good-4 2025-07-01-05-28-bad-3 2025-07-01-06-35-bad-2 2025-07-01-09-46-good-8 2025-07-01-03-44-good-3 2025-07-01-04-47-good-3 2025-07-01-06-19-bad-2 2025-07-01-09-21-bad-2 2025-07-02-13-09-good user at subvolume:~$ ls /rbt/kx-vanilla-results/ 2025-06-30T005219Z_5.16.0-kx-df0cc57e057f18e4-3e17eec5ff024b63_1626_good_60 2025-06-30T023542Z_5.16.0-rc2-kx-87bb2a410dcfb617-9f30253daecd39e5_1663_bad_4 2025-06-30T012313Z_5.17.0-kx-f443e374ae131c16-91b07dce12a83fab_1674_bad_1 2025-06-30T032312Z_5.16.0-rc2-kx-c9ee950a2ca55ea0-854a1f40ce042801_1662_bad_6 2025-06-30T013555Z_5.16.0-kx-22ef12195e13c5ec-9aaf880b25942f2a_1668_bad_7 2025-06-30T033528Z_5.16.0-rc2-kx-ba884a411700dc56-854a1f40ce042801_1662_good_60 2025-06-30T014106Z_5.16.0-kx-9bcbf894b6872216-b828905f3cf12050_1664_bad_2 2025-06-30T034645Z_5.16.0-rc2-kx-d4a23930490df39f-854a1f40ce042801_1662_good_60 2025-06-30T014634Z_5.16.0-rc5-kx-cb6846fbb83b574c-83e7c6cf2ede57b4_1663_bad_6 2025-06-30T035232Z_5.16.0-rc2-kx-4a75f32fc783128d-854a1f40ce042801_1662_bad_5 2025-06-30T015713Z_5.16.0-rc2-kx-15bb79910fe734ad-9f30253daecd39e5_1663_good_60 2025-06-30T042058Z_5.16.0-rc2-kx-4a75f32fc783128d-854a1f40ce042801_1662_bad_1 2025-06-30T020235Z_5.16.0-rc5-kx-b06103b5325364e0-26176b9b704a5c24_1664_bad_6 2025-06-30T050000Z_6.15.4-kx-e60eb441596d1c70-2378f4efc5e956e5_2366_bad_2 2025-06-30T020717Z_5.16.0-rc5-kx-eacef9fd61dcf5ea-26176b9b704a5c24_1664_bad_1 2025-06-30T053011Z_6.15.4-kx-e60eb441596d1c70-2378f4efc5e956e5_2366_good_60 2025-06-30T021738Z_5.16.0-rc2-kx-67b858dd89932086-8d2f1d17f1e1933c_1662_good_60 2025-06-30T060619Z_5.16.0-rc2-kx-d4a23930490df39f-854a1f40ce042801_1662_good_60 2025-06-30T022759Z_5.16.0-rc2-kx-17815f624a90579a-854a1f40ce042801_1662_good_60 2025-06-30T061448Z_5.16.0-rc2-kx-4a75f32fc783128d-854a1f40ce042801_1662_bad_1 Each number in the end of file/directory name is number of boots. In total we have 1908 boots. Testing was mostly automatical, using my script. Here is one example dmesg from mainline commit e60eb441596d1c70 (somewhere around 6.15.4): https://zerobin.net/?119ff118fd47b363#BpziYs6dNz5PaT7H8w2hlveoEYa4DDtITGkyd9o57LE= This is was dmesg from 2nd (and in the same time last) boot. The next boot (i. e. kexec) was unsuccessful. Corresponding config: https://zerobin.net/?009c807e1df41af8#gnmrswlbaFbdPTuzNq6NFkQd/Jhb3Ds0ZlLiwNanXnc= If you want results from all expirements, here is a link: https://filebin.net/45g2757b2iwaeen7 (1 Mb, expires after 7 days). Usually expirements come with full reproducer script. But what I described above is already enough, I think this link is not needed. I will be available for testing in coming days, then I will switch to other things, and so will not be available for testing. If you want more time, then, please, ask for it, i. e. say me something like "Please, be available for testing in more 10 days". -- Askar Safin https://types.pl/@safinaskar From ltao at redhat.com Thu Jul 3 15:35:20 2025 From: ltao at redhat.com (Tao Liu) Date: Fri, 4 Jul 2025 10:35:20 +1200 Subject: [PATCH v2][makedumpfile] Fix a data race in multi-threading mode (--num-threads=N) In-Reply-To: <20250703163100.603f59f4@mordecai.tesarici.cz> References: <20250625022343.57529-2-ltao@redhat.com> <7c13a968-4a3a-4d0d-8977-3ba0a4a845b1@nec.com> <20250703163100.603f59f4@mordecai.tesarici.cz> Message-ID: Hi Petr, On Fri, Jul 4, 2025 at 2:31?AM Petr Tesarik wrote: > > On Tue, 1 Jul 2025 19:59:53 +1200 > Tao Liu wrote: > > > Hi Kazu, > > > > Thanks for your comments! > > > > On Tue, Jul 1, 2025 at 7:38?PM HAGIO KAZUHITO(?????) wrote: > > > > > > Hi Tao, > > > > > > thank you for the patch. > > > > > > On 2025/06/25 11:23, Tao Liu wrote: > > > > A vmcore corrupt issue has been noticed in powerpc arch [1]. It can be > > > > reproduced with upstream makedumpfile. > > > > > > > > When analyzing the corrupt vmcore using crash, the following error > > > > message will output: > > > > > > > > crash: compressed kdump: uncompress failed: 0 > > > > crash: read error: kernel virtual address: c0001e2d2fe48000 type: > > > > "hardirq thread_union" > > > > crash: cannot read hardirq_ctx[930] at c0001e2d2fe48000 > > > > crash: compressed kdump: uncompress failed: 0 > > > > > > > > If the vmcore is generated without num-threads option, then no such > > > > errors are noticed. > > > > > > > > With --num-threads=N enabled, there will be N sub-threads created. All > > > > sub-threads are producers which responsible for mm page processing, e.g. > > > > compression. The main thread is the consumer which responsible for > > > > writing the compressed data into file. page_flag_buf->ready is used to > > > > sync main and sub-threads. When a sub-thread finishes page processing, > > > > it will set ready flag to be FLAG_READY. In the meantime, main thread > > > > looply check all threads of the ready flags, and break the loop when > > > > find FLAG_READY. > > > > > > I've tried to reproduce the issue, but I couldn't on x86_64. > > > > Yes, I cannot reproduce it on x86_64 either, but the issue is very > > easily reproduced on ppc64 arch, which is where our QE reported. > > Yes, this is expected. X86 implements a strongly ordered memory model, > so a "store-to-memory" instruction ensures that the new value is > immediately observed by other CPUs. > > FWIW the current code is wrong even on X86, because it does nothing to > prevent compiler optimizations. The compiler is then allowed to reorder > instructions so that the write to page_flag_buf->ready happens after > other writes; with a bit of bad scheduling luck, the consumer thread > may see an inconsistent state (e.g. read a stale page_flag_buf->pfn). > Note that thanks to how compilers are designed (today), this issue is > more or less hypothetical. Nevertheless, the use of atomics fixes it, > because they also serve as memory barriers. Thanks a lot for your detailed explanation, it's very helpful! I haven't thought of the possibility of instruction reordering and atomic_rw prevents the reorder. Thanks, Tao Liu > > Petr T > From k-hagio-ab at nec.com Thu Jul 3 23:49:10 2025 From: k-hagio-ab at nec.com (=?utf-8?B?SEFHSU8gS0FaVUhJVE8o6JCp5bC+44CA5LiA5LuBKQ==?=) Date: Fri, 4 Jul 2025 06:49:10 +0000 Subject: [PATCH v2][makedumpfile] Fix a data race in multi-threading mode (--num-threads=N) In-Reply-To: References: <20250625022343.57529-2-ltao@redhat.com> <7c13a968-4a3a-4d0d-8977-3ba0a4a845b1@nec.com> <20250703163100.603f59f4@mordecai.tesarici.cz> Message-ID: <8485b4f1-1277-45ab-b533-efc20120b26e@nec.com> On 2025/07/04 7:35, Tao Liu wrote: > Hi Petr, > > On Fri, Jul 4, 2025 at 2:31?AM Petr Tesarik wrote: >> >> On Tue, 1 Jul 2025 19:59:53 +1200 >> Tao Liu wrote: >> >>> Hi Kazu, >>> >>> Thanks for your comments! >>> >>> On Tue, Jul 1, 2025 at 7:38?PM HAGIO KAZUHITO(?????) wrote: >>>> >>>> Hi Tao, >>>> >>>> thank you for the patch. >>>> >>>> On 2025/06/25 11:23, Tao Liu wrote: >>>>> A vmcore corrupt issue has been noticed in powerpc arch [1]. It can be >>>>> reproduced with upstream makedumpfile. >>>>> >>>>> When analyzing the corrupt vmcore using crash, the following error >>>>> message will output: >>>>> >>>>> crash: compressed kdump: uncompress failed: 0 >>>>> crash: read error: kernel virtual address: c0001e2d2fe48000 type: >>>>> "hardirq thread_union" >>>>> crash: cannot read hardirq_ctx[930] at c0001e2d2fe48000 >>>>> crash: compressed kdump: uncompress failed: 0 >>>>> >>>>> If the vmcore is generated without num-threads option, then no such >>>>> errors are noticed. >>>>> >>>>> With --num-threads=N enabled, there will be N sub-threads created. All >>>>> sub-threads are producers which responsible for mm page processing, e.g. >>>>> compression. The main thread is the consumer which responsible for >>>>> writing the compressed data into file. page_flag_buf->ready is used to >>>>> sync main and sub-threads. When a sub-thread finishes page processing, >>>>> it will set ready flag to be FLAG_READY. In the meantime, main thread >>>>> looply check all threads of the ready flags, and break the loop when >>>>> find FLAG_READY. >>>> >>>> I've tried to reproduce the issue, but I couldn't on x86_64. >>> >>> Yes, I cannot reproduce it on x86_64 either, but the issue is very >>> easily reproduced on ppc64 arch, which is where our QE reported. >> >> Yes, this is expected. X86 implements a strongly ordered memory model, >> so a "store-to-memory" instruction ensures that the new value is >> immediately observed by other CPUs. >> >> FWIW the current code is wrong even on X86, because it does nothing to >> prevent compiler optimizations. The compiler is then allowed to reorder >> instructions so that the write to page_flag_buf->ready happens after >> other writes; with a bit of bad scheduling luck, the consumer thread >> may see an inconsistent state (e.g. read a stale page_flag_buf->pfn). >> Note that thanks to how compilers are designed (today), this issue is >> more or less hypothetical. Nevertheless, the use of atomics fixes it, >> because they also serve as memory barriers. Thank you Petr, for the information. I was wondering whether atomic operations might be necessary for the other members of page_flag_buf, but it looks like they won't be necessary in this case. Then I was convinced that the issue would be fixed by removing the inconsistency of page_flag_buf->ready. And the patch tested ok, so ack. Thanks, Kazu > > Thanks a lot for your detailed explanation, it's very helpful! I > haven't thought of the possibility of instruction reordering and > atomic_rw prevents the reorder. > > Thanks, > Tao Liu > >> >> Petr T >> From ltao at redhat.com Fri Jul 4 00:51:01 2025 From: ltao at redhat.com (Tao Liu) Date: Fri, 4 Jul 2025 19:51:01 +1200 Subject: [PATCH v2][makedumpfile] Fix a data race in multi-threading mode (--num-threads=N) In-Reply-To: <8485b4f1-1277-45ab-b533-efc20120b26e@nec.com> References: <20250625022343.57529-2-ltao@redhat.com> <7c13a968-4a3a-4d0d-8977-3ba0a4a845b1@nec.com> <20250703163100.603f59f4@mordecai.tesarici.cz> <8485b4f1-1277-45ab-b533-efc20120b26e@nec.com> Message-ID: On Fri, Jul 4, 2025 at 6:49?PM HAGIO KAZUHITO(?????) wrote: > > On 2025/07/04 7:35, Tao Liu wrote: > > Hi Petr, > > > > On Fri, Jul 4, 2025 at 2:31?AM Petr Tesarik wrote: > >> > >> On Tue, 1 Jul 2025 19:59:53 +1200 > >> Tao Liu wrote: > >> > >>> Hi Kazu, > >>> > >>> Thanks for your comments! > >>> > >>> On Tue, Jul 1, 2025 at 7:38?PM HAGIO KAZUHITO(?????) wrote: > >>>> > >>>> Hi Tao, > >>>> > >>>> thank you for the patch. > >>>> > >>>> On 2025/06/25 11:23, Tao Liu wrote: > >>>>> A vmcore corrupt issue has been noticed in powerpc arch [1]. It can be > >>>>> reproduced with upstream makedumpfile. > >>>>> > >>>>> When analyzing the corrupt vmcore using crash, the following error > >>>>> message will output: > >>>>> > >>>>> crash: compressed kdump: uncompress failed: 0 > >>>>> crash: read error: kernel virtual address: c0001e2d2fe48000 type: > >>>>> "hardirq thread_union" > >>>>> crash: cannot read hardirq_ctx[930] at c0001e2d2fe48000 > >>>>> crash: compressed kdump: uncompress failed: 0 > >>>>> > >>>>> If the vmcore is generated without num-threads option, then no such > >>>>> errors are noticed. > >>>>> > >>>>> With --num-threads=N enabled, there will be N sub-threads created. All > >>>>> sub-threads are producers which responsible for mm page processing, e.g. > >>>>> compression. The main thread is the consumer which responsible for > >>>>> writing the compressed data into file. page_flag_buf->ready is used to > >>>>> sync main and sub-threads. When a sub-thread finishes page processing, > >>>>> it will set ready flag to be FLAG_READY. In the meantime, main thread > >>>>> looply check all threads of the ready flags, and break the loop when > >>>>> find FLAG_READY. > >>>> > >>>> I've tried to reproduce the issue, but I couldn't on x86_64. > >>> > >>> Yes, I cannot reproduce it on x86_64 either, but the issue is very > >>> easily reproduced on ppc64 arch, which is where our QE reported. > >> > >> Yes, this is expected. X86 implements a strongly ordered memory model, > >> so a "store-to-memory" instruction ensures that the new value is > >> immediately observed by other CPUs. > >> > >> FWIW the current code is wrong even on X86, because it does nothing to > >> prevent compiler optimizations. The compiler is then allowed to reorder > >> instructions so that the write to page_flag_buf->ready happens after > >> other writes; with a bit of bad scheduling luck, the consumer thread > >> may see an inconsistent state (e.g. read a stale page_flag_buf->pfn). > >> Note that thanks to how compilers are designed (today), this issue is > >> more or less hypothetical. Nevertheless, the use of atomics fixes it, > >> because they also serve as memory barriers. > > Thank you Petr, for the information. I was wondering whether atomic > operations might be necessary for the other members of page_flag_buf, > but it looks like they won't be necessary in this case. > > Then I was convinced that the issue would be fixed by removing the > inconsistency of page_flag_buf->ready. And the patch tested ok, so ack. > Thank you all for the patch review, patch testing and comments, these have been so helpful! Thanks, Tao Liu > Thanks, > Kazu > > > > > Thanks a lot for your detailed explanation, it's very helpful! I > > haven't thought of the possibility of instruction reordering and > > atomic_rw prevents the reorder. > > > > Thanks, > > Tao Liu > > > >> > >> Petr T > >> From jani.nikula at linux.intel.com Fri Jul 4 01:29:01 2025 From: jani.nikula at linux.intel.com (Jani Nikula) Date: Fri, 04 Jul 2025 11:29:01 +0300 Subject: Second kexec_file_load (but not kexec_load) fails on i915 if CONFIG_INTEL_IOMMU_DEFAULT_ON=n In-Reply-To: <197d1dc3bff.c01ddb9024897.1898328361232711826@zohomail.com> References: <197d1dc3bff.c01ddb9024897.1898328361232711826@zohomail.com> Message-ID: On Thu, 03 Jul 2025, Askar Safin wrote: > TL;DR: I found a bug in strange interaction in kexec_file_load (but not kexec_load) and i915 > TL;DR#2: Second (sometimes third or forth) kexec (using kexec_file_load) fails on my particular hardware > TL;DR#3: I did 55 expirements, each of them required a lot of boots, in total I did 1908 boots Thanks for the detailed debug info. I'm afraid all I can say at this point is, please file all of this in a bug report as described in [1]. Please add the drm.debug related options, and attach the dmesgs and configs in the bug instead of pointing at external sites. BR, Jani. [1] https://drm.pages.freedesktop.org/intel-docs/how-to-file-i915-bugs.html > > Okay, so I found a bug. Steps to reproduce: > - I have Dell Precision 7780 > - I have recent Debian x86_64 sid installed (bug reproducible with both Debian kernels and mainline ones) > - Bug is reproducible on many kernels, including very recent ones, for example 6.15.4 > - Boot system, then do kexec into the same system using kexec_file_load. I. e. pass --kexec-file-syscall to "kexec" command > - Then kexec from this kexec'ed system again (i. e. you should do two kexec's in a row) > - Then do 3rd kexec, etc > - Repeat kexec's until you do 100 kexec's or your system start to misbehave > > On my computer the system starts to misbehave after some number of kexec's. This always happens after 2nd kexec attempt. > I. e. the first kexec is always successful. But second sometimes is not. > I never was able to perform 100 kexec's in a row. > After some kexec attempt the system starts to misbehave: oopses, panics, locked system, etc. > > Notes: > > - I tried to bisect "kexec-tools" package, but bisect merely gave me commit, which switched to kexec_file_load as a default. > Bug is reproducible if we use kexec_file_load, but doesn't reproduce if we use kexec_load > > - Bug is reproducible even if we boot via init=/bin/bash (note: this means that initramfs is still part of the boot process). (If we boot to normal GUI, bug is reproducible, too) > > - When I reproduce I use this command line: "root=UUID=... rootflags=subvol=... ro init=..." > > - Debian package "plymouth" is required for reproducing. (It reproduces with plymouth, but doesn't reproduce without plymouth.) But note that I never see actual plymouth screen! I. e. presence of > "plymouth" on the system somehow affects bug reproduciblity despite plymouth animation never actually shown. I don't know why this happens, but I suspect that I don't pass "splash" to kernel command line, and thus don't see plymouth screen. But I suspect that plymouth is still included to initramfs and from there somehow affects boot process > > - Bug reproduces in Debian, but doesn't reproduce in Ubuntu. After a lot of expirementing I finally understood why: Ubuntu kernel has CONFIG_INTEL_IOMMU_DEFAULT_ON=y, and Debian kernel has not. Additional expirements found that it is culpit. I. e. the bug is reproducible with CONFIG_INTEL_IOMMU_DEFAULT_ON=n and not reproducbile with CONFIG_INTEL_IOMMU_DEFAULT_ON=y . (So advice for distributions: do what Ubuntu does, i. e. set CONFIG_INTEL_IOMMU_DEFAULT_ON=y to hide this bug) > > - Bug is not reproducible in old enough kernels, so I did bisect on Linux. Bisect showed me these commits: d4a2393049..4a75f32fc7. I. e. bug is reproducible in 4a75f32fc7, but doesn't reproduce in d4a2393049. Between them there is a middle commit 52407c220c44c8dcc6a, which is not testable. Here are these commits: > > commit 4a75f32fc783128d0c42ef73fa62a20379a66828 > Author: Anusha Srivatsa > > ? ?drm/i915/rpl-s: Add PCH Support for Raptor Lake S > > commit 52407c220c44c8dcc6aa8aa35ffc8a2db3c849a9 > Author: Anusha Srivatsa > > ? ?drm/i915/rpl-s: Add PCI IDS for Raptor Lake S > > It seems these commits merely added support for my Intel GPU model. So this is fake regression. I'm not sure this should be treated as proper regression and whether regzbot should be notified. (What do you think?) > > Still formally this is regression: I did expirements and they show that bug present in 4a75f32fc783128d0c42 and not present before. (Side note: in latest kernels both wayland and x11 work, in d4a2393049 x11 works and wayland doesn't.) > > I tried to reproduce the bug in Qemu, but I was unable to do so. It seems Intel GPU is required, maybe even my particular model. > > Here is "lspci -vnn -d :*:0300" for my GPU: > > 00:02.0 VGA compatible controller [0300]: Intel Corporation Raptor Lake-S UHD Graphics [8086:a788] (rev 04) (prog-if 00 [VGA controller]) > Subsystem: Dell Raptor Lake-S UHD Graphics [1028:0c42] > Flags: bus master, fast devsel, latency 0, IRQ 202, IOMMU group 0 > Memory at 604b000000 (64-bit, non-prefetchable) [size=16M] > Memory at 4000000000 (64-bit, prefetchable) [size=256M] > I/O ports at 3000 [size=64] > Expansion ROM at 000c0000 [virtual] [disabled] [size=128K] > Capabilities: [40] Vendor Specific Information: Len=0c > Capabilities: [70] Express Root Complex Integrated Endpoint, MSI 00 > Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit- > Capabilities: [d0] Power Management version 2 > Capabilities: [100] Process Address Space ID (PASID) > Capabilities: [200] Address Translation Service (ATS) > Capabilities: [300] Page Request Interface (PRI) > Capabilities: [320] Single Root I/O Virtualization (SR-IOV) > Kernel driver in use: i915 > Kernel modules: i915 > > dmidecode: > https://zerobin.net/?aebea072b93d8122#z4W9URnV+k9ZZErhP4etQkxlfpyRKf++uKMNoO5PGjs= > > - I use "root=UUID=... rootflags=subvol=... ro init=..." as a command line for reproducing. If I add "recovery nomodeset dis_ucode_ldr" (this is options used by Ubuntu in recovery mode), the bug stops to reproduce > > Again, in short, full list of things required for successful reproducing: > - Intel GPU, possibly my particular model > - Kernel with support for my model (4a75f32fc783128d0c42 and later up to 6.15.4) > - Kexec at least two times. (One kexec never fails, 100 kexec's in a row never succeed) > - kexec_file_load as opposed to kexec_load > - Initramfs > - Lack of parameters "recovery nomodeset dis_ucode_ldr" (i. e. one of them stops reproducing) > - plymouth > - CONFIG_INTEL_IOMMU_DEFAULT_ON=n > > Removing of ANY of them stops the bug, and I proved this by lots of expirements. > > In total I did 55+ expirements, each of them required up to 100 boots. In total I did 1908 (!!!!!!) boots on my physical laptop (I mean kexec boots here). No, I'm not faking this number, here is my actual directories with results: > > user at subvolume:~$ ls /rbt/kx-results/ > @rec-2025-06-29T201723Z-bad-4 @rec-2025-06-29T214650Z-good-60 @rec-2025-07-03T050626Z-bad-41 @rec-2025-07-03T104125Z-bad-28 @rec-2025-07-03T133705Z-bad-3 > @rec-2025-06-29T203429Z-good-60 @rec-2025-06-29T215558Z-bad-8 @rec-2025-07-03T060107Z-good-100 @rec-2025-07-03T111727Z-bad-13 @rec-2025-07-03T141647Z-good-100 > @rec-2025-06-29T205626Z-good-60 @rec-2025-07-01T042949Z-bad-12 @rec-2025-07-03T074810Z-good-100 @rec-2025-07-03T122242Z-good-100 @rec-2025-07-03T145705Z-good-100 > @rec-2025-06-29T211612Z-bad-6 @rec-2025-07-02T120101Z-good-60 @rec-2025-07-03T082914Z-good-100 @rec-2025-07-03T123958Z-bad-12 @rec-2025-07-03T152406Z-bad-50 > @rec-2025-06-29T212932Z-good-60 @rec-2025-07-03T031038Z-good-60 @rec-2025-07-03T100615Z-good-100 @rec-2025-07-03T132116Z-good-100 @rec-2025-07-03T154204Z-bad-15 > user at subvolume:~$ ls /rbt/kx-manual-testing/ > 2025-07-01-03-19-good-6 2025-07-01-03-56-good-4 2025-07-01-05-28-bad-3 2025-07-01-06-35-bad-2 2025-07-01-09-46-good-8 > 2025-07-01-03-44-good-3 2025-07-01-04-47-good-3 2025-07-01-06-19-bad-2 2025-07-01-09-21-bad-2 2025-07-02-13-09-good > user at subvolume:~$ ls /rbt/kx-vanilla-results/ > 2025-06-30T005219Z_5.16.0-kx-df0cc57e057f18e4-3e17eec5ff024b63_1626_good_60 2025-06-30T023542Z_5.16.0-rc2-kx-87bb2a410dcfb617-9f30253daecd39e5_1663_bad_4 > 2025-06-30T012313Z_5.17.0-kx-f443e374ae131c16-91b07dce12a83fab_1674_bad_1 2025-06-30T032312Z_5.16.0-rc2-kx-c9ee950a2ca55ea0-854a1f40ce042801_1662_bad_6 > 2025-06-30T013555Z_5.16.0-kx-22ef12195e13c5ec-9aaf880b25942f2a_1668_bad_7 2025-06-30T033528Z_5.16.0-rc2-kx-ba884a411700dc56-854a1f40ce042801_1662_good_60 > 2025-06-30T014106Z_5.16.0-kx-9bcbf894b6872216-b828905f3cf12050_1664_bad_2 2025-06-30T034645Z_5.16.0-rc2-kx-d4a23930490df39f-854a1f40ce042801_1662_good_60 > 2025-06-30T014634Z_5.16.0-rc5-kx-cb6846fbb83b574c-83e7c6cf2ede57b4_1663_bad_6 2025-06-30T035232Z_5.16.0-rc2-kx-4a75f32fc783128d-854a1f40ce042801_1662_bad_5 > 2025-06-30T015713Z_5.16.0-rc2-kx-15bb79910fe734ad-9f30253daecd39e5_1663_good_60 2025-06-30T042058Z_5.16.0-rc2-kx-4a75f32fc783128d-854a1f40ce042801_1662_bad_1 > 2025-06-30T020235Z_5.16.0-rc5-kx-b06103b5325364e0-26176b9b704a5c24_1664_bad_6 2025-06-30T050000Z_6.15.4-kx-e60eb441596d1c70-2378f4efc5e956e5_2366_bad_2 > 2025-06-30T020717Z_5.16.0-rc5-kx-eacef9fd61dcf5ea-26176b9b704a5c24_1664_bad_1 2025-06-30T053011Z_6.15.4-kx-e60eb441596d1c70-2378f4efc5e956e5_2366_good_60 > 2025-06-30T021738Z_5.16.0-rc2-kx-67b858dd89932086-8d2f1d17f1e1933c_1662_good_60 2025-06-30T060619Z_5.16.0-rc2-kx-d4a23930490df39f-854a1f40ce042801_1662_good_60 > 2025-06-30T022759Z_5.16.0-rc2-kx-17815f624a90579a-854a1f40ce042801_1662_good_60 2025-06-30T061448Z_5.16.0-rc2-kx-4a75f32fc783128d-854a1f40ce042801_1662_bad_1 > > Each number in the end of file/directory name is number of boots. In total we have 1908 boots. Testing was mostly automatical, using my script. > > Here is one example dmesg from mainline commit e60eb441596d1c70 (somewhere around 6.15.4): > > https://zerobin.net/?119ff118fd47b363#BpziYs6dNz5PaT7H8w2hlveoEYa4DDtITGkyd9o57LE= > > This is was dmesg from 2nd (and in the same time last) boot. The next boot (i. e. kexec) was unsuccessful. Corresponding config: > > https://zerobin.net/?009c807e1df41af8#gnmrswlbaFbdPTuzNq6NFkQd/Jhb3Ds0ZlLiwNanXnc= > > If you want results from all expirements, here is a link: https://filebin.net/45g2757b2iwaeen7 (1 Mb, expires after 7 days). Usually expirements come with full reproducer script. > > But what I described above is already enough, I think this link is not needed. > > I will be available for testing in coming days, then I will switch to other things, and so will not be available for testing. > If you want more time, then, please, ask for it, i. e. say me something like "Please, be available for testing in more 10 days". > > -- > Askar Safin > https://types.pl/@safinaskar > -- Jani Nikula, Intel From safinaskar at zohomail.com Fri Jul 4 13:11:23 2025 From: safinaskar at zohomail.com (Askar Safin) Date: Sat, 05 Jul 2025 00:11:23 +0400 Subject: Second kexec_file_load (but not kexec_load) fails on i915 if CONFIG_INTEL_IOMMU_DEFAULT_ON=n In-Reply-To: References: <197d1dc3bff.c01ddb9024897.1898328361232711826@zohomail.com> Message-ID: <197d710ac39.10e2c241536088.2706332519040181850@zohomail.com> ---- On Fri, 04 Jul 2025 12:29:01 +0400 Jani Nikula wrote --- > Thanks for the detailed debug info. I'm afraid all I can say at this > point is, please file all of this in a bug report as described in > [1]. Please add the drm.debug related options, and attach the dmesgs and > configs in the bug instead of pointing at external sites. Okay, now let me speculate how to fix this bug. :) I think someone with moderate kexec understanding and with Intel GPU should do this: reproduce the bug and then slowly modify kexec_file_load code until it becomes kexec_load code. (Or vice versa.) In the middle of this modification the bug stops to reproduce, and so we will know what exactly causes it. kexec_file_load and kexec_load should behave the same. If they do not, then we should understand, why. We should closely review their code. Also, in case of kexec_load kernel uncompressing and parsing performed by "kexec" userspace tool, and in case of kexec_file_load by kernel. So we should closely review this two uncompressing/parsing code fragments. I think that this bug is related to kexec, not to i915. And thus it should be fixed by kexec people, not by i915 people. (But I may be wrong.) But okay, I reported it to that bug tracker anyway: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/14598 Maybe there is separate kexec bug tracker? Also, your bug tracker is cool. One can attach files in the middle of report. Why not whole kernel uses it? :) -- Askar Safin https://types.pl/@safinaskar From bhe at redhat.com Sun Jul 6 21:54:34 2025 From: bhe at redhat.com (Baoquan He) Date: Mon, 7 Jul 2025 12:54:34 +0800 Subject: [PATCH v3 3/5] kdump, documentation: describe craskernel CMA reservation In-Reply-To: References:

<053f8c6d-0acd-465b-8d9f-a46d50ccce71@redhat.com> Message-ID: On 06/27/25 at 02:18pm, David Hildenbrand wrote: > On 27.06.25 14:16, David Hildenbrand wrote: > > On 14.03.25 04:18, Baoquan He wrote: > > > Hi Jiri, > > > > > > On 03/12/25 at 10:09pm, Jiri Bohac wrote: > > > ...... > > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > > > > index fb8752b42ec8..895b974dc3bb 100644 > > > > --- a/Documentation/admin-guide/kernel-parameters.txt > > > > +++ b/Documentation/admin-guide/kernel-parameters.txt > > > > @@ -987,6 +987,28 @@ > > > > 0: to disable low allocation. > > > > It will be ignored when crashkernel=X,high is not used > > > > or memory reserved is below 4G. > > > > + crashkernel=size[KMG],cma > > > > + [KNL, X86] Reserve additional crash kernel memory from > > > > + CMA. This reservation is usable by the first system's > > > > + userspace memory and kernel movable allocations (memory > > > > + balloon, zswap). Pages allocated from this memory range > > > > + will not be included in the vmcore so this should not > > > > + be used if dumping of userspace memory is intended and > > > > + it has to be expected that some movable kernel pages > > > > + may be missing from the dump. > > > > > > Since David and Don expressed concern about the missing kernel pages > > > allocated from CMA area in v2, and you argued this is still useful for > > > VM system, I would like to invite David to help evaluate the whole > > > series if it's worth from the VM and MM point of view. > > > > Balloon pages will not be dumped either way (PageOffline), so that is > > not a convern. > > > > Zsmalloc pages ... are probably fine right now. They should likely only > > be storing compressed user data. (not sure if they also store some other > > datastructures, I think no, but might be wrong) > > > > My comment was rather forward-looking: that CMA memory only contains > > user space memory is already not the case (but the existing cases might > > be okay). In the future, as we support other movable allocations (as > > raised, leaf page tables at some point, and there were discussions about > > movable slab pages, although that might be challenging) this can change > > (unless we find ways of not placing these allocations on CMA memory). > > > > So as is, this should be fine, but it's certainly something to be aware > > of in the future. > > > > BTW, I realize this was a late reply, and that the series already proceeded. > Just stumbled over that un-replied mail an thought I'd clarify my point > here. Thanks a lot for deliberating on this and providing these helpful details. As you said, this feature is fine for the time being, we can remember this and consider how to adapt in the future once those movable allocations could happen in CMA. And the risk has been told clearly in doc. From sashal at kernel.org Mon Jul 7 17:02:13 2025 From: sashal at kernel.org (Sasha Levin) Date: Mon, 7 Jul 2025 20:02:13 -0400 Subject: [PATCH AUTOSEL 6.15 6/8] PM: Restrict swap use to later in the suspend sequence In-Reply-To: <20250708000215.793090-1-sashal@kernel.org> References: <20250708000215.793090-1-sashal@kernel.org> Message-ID: <20250708000215.793090-6-sashal@kernel.org> From: Mario Limonciello [ Upstream commit 12ffc3b1513ebc1f11ae77d053948504a94a68a6 ] Currently swap is restricted before drivers have had a chance to do their prepare() PM callbacks. Restricting swap this early means that if a driver needs to evict some content from memory into sawp in it's prepare callback, it won't be able to. On AMD dGPUs this can lead to failed suspends under memory pressure situations as all VRAM must be evicted to system memory or swap. Move the swap restriction to right after all devices have had a chance to do the prepare() callback. If there is any problem with the sequence, restore swap in the appropriate dpm resume callbacks or error handling paths. Closes: https://github.com/ROCm/ROCK-Kernel-Driver/issues/174 Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/2362 Signed-off-by: Mario Limonciello Tested-by: Nat Wittstock Tested-by: Lucian Langa Link: https://patch.msgid.link/20250613214413.4127087-1-superm1 at kernel.org Signed-off-by: Rafael J. Wysocki Signed-off-by: Sasha Levin --- **YES** This commit should be backported to stable kernel trees for the following reasons: ## Critical Bug Fix for Real User Issues 1. **Fixes Actual Suspend Failures**: The commit addresses real-world suspend failures under memory pressure on systems with AMD discrete GPUs. The linked issues (ROCm/ROCK-Kernel-Driver#174 and freedesktop.org/drm/amd#2362) indicate this affects actual users. 2. **Regression Fix**: This is effectively a regression fix. The PM subsystem's early swap restriction prevents AMD GPU drivers from properly evicting VRAM during their prepare() callbacks, which is a requirement that has become more critical as GPU VRAM sizes have increased. ## Small, Contained Change 3. **Minimal Code Changes**: The fix is remarkably simple - it just moves the `pm_restrict_gfp_mask()` call from early in the suspend sequence to after `dpm_prepare()` completes. The changes are: - Move `pm_restrict_gfp_mask()` from multiple early locations to inside `dpm_suspend_start()` after `dpm_prepare()` succeeds - Add corresponding `pm_restore_gfp_mask()` calls in error paths and resume paths - Remove the now-redundant calls from hibernate.c and suspend.c 4. **Low Risk of Regression**: The change maintains the original intent of preventing I/O during the critical suspend phase while allowing it during device preparation. The swap restriction still happens before `dpm_suspend()`, just after `dpm_prepare()`. ## Follows Stable Rules 5. **Meets Stable Criteria**: - Fixes a real bug that bothers people (suspend failures) - Small change (moves function calls, doesn't introduce new logic) - Obviously correct (allows drivers to use swap during their designated preparation phase) - Already tested by users (Tested-by tags from affected users) ## Similar to Other Backported Commits 6. **Pattern Matches**: Looking at the similar commits provided, this follows the same pattern as the AMD GPU eviction commits that were backported. Those commits also addressed the same fundamental issue - ensuring GPU VRAM can be properly evicted during suspend/hibernation. ## Critical Timing 7. **Error Path Handling**: The commit properly handles error paths by adding `pm_restore_gfp_mask()` calls in: - `dpm_resume_end()` for normal resume - `platform_recover()` error path in suspend.c - `pm_restore_gfp_mask()` in kexec_core.c for kexec flows The commit is well-tested, addresses a real problem affecting users, and makes a minimal, obviously correct change to fix suspend failures on systems with discrete GPUs under memory pressure. drivers/base/power/main.c | 5 ++++- include/linux/suspend.h | 5 +++++ kernel/kexec_core.c | 1 + kernel/power/hibernate.c | 3 --- kernel/power/power.h | 5 ----- kernel/power/suspend.c | 3 +-- 6 files changed, 11 insertions(+), 11 deletions(-) diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c index 1926454c7a7e8..dd1efa95bcf15 100644 --- a/drivers/base/power/main.c +++ b/drivers/base/power/main.c @@ -1182,6 +1182,7 @@ void dpm_complete(pm_message_t state) */ void dpm_resume_end(pm_message_t state) { + pm_restore_gfp_mask(); dpm_resume(state); dpm_complete(state); } @@ -2015,8 +2016,10 @@ int dpm_suspend_start(pm_message_t state) error = dpm_prepare(state); if (error) dpm_save_failed_step(SUSPEND_PREPARE); - else + else { + pm_restrict_gfp_mask(); error = dpm_suspend(state); + } dpm_show_time(starttime, state, error, "start"); return error; diff --git a/include/linux/suspend.h b/include/linux/suspend.h index da6ebca3ff774..d638f31dc32cd 100644 --- a/include/linux/suspend.h +++ b/include/linux/suspend.h @@ -441,6 +441,8 @@ extern int unregister_pm_notifier(struct notifier_block *nb); extern void ksys_sync_helper(void); extern void pm_report_hw_sleep_time(u64 t); extern void pm_report_max_hw_sleep(u64 t); +void pm_restrict_gfp_mask(void); +void pm_restore_gfp_mask(void); #define pm_notifier(fn, pri) { \ static struct notifier_block fn##_nb = \ @@ -485,6 +487,9 @@ static inline int unregister_pm_notifier(struct notifier_block *nb) static inline void pm_report_hw_sleep_time(u64 t) {}; static inline void pm_report_max_hw_sleep(u64 t) {}; +static inline void pm_restrict_gfp_mask(void) {} +static inline void pm_restore_gfp_mask(void) {} + static inline void ksys_sync_helper(void) {} #define pm_notifier(fn, pri) do { (void)(fn); } while (0) diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c index 3e62b944c8833..2972278497b0b 100644 --- a/kernel/kexec_core.c +++ b/kernel/kexec_core.c @@ -1082,6 +1082,7 @@ int kernel_kexec(void) Resume_devices: dpm_resume_end(PMSG_RESTORE); Resume_console: + pm_restore_gfp_mask(); console_resume_all(); thaw_processes(); Restore_console: diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c index 5af9c7ee98cd4..0bb5a7befe944 100644 --- a/kernel/power/hibernate.c +++ b/kernel/power/hibernate.c @@ -418,7 +418,6 @@ int hibernation_snapshot(int platform_mode) } console_suspend_all(); - pm_restrict_gfp_mask(); error = dpm_suspend(PMSG_FREEZE); @@ -554,7 +553,6 @@ int hibernation_restore(int platform_mode) pm_prepare_console(); console_suspend_all(); - pm_restrict_gfp_mask(); error = dpm_suspend_start(PMSG_QUIESCE); if (!error) { error = resume_target_kernel(platform_mode); @@ -566,7 +564,6 @@ int hibernation_restore(int platform_mode) BUG_ON(!error); } dpm_resume_end(PMSG_RECOVER); - pm_restore_gfp_mask(); console_resume_all(); pm_restore_console(); return error; diff --git a/kernel/power/power.h b/kernel/power/power.h index f8496f40b54fa..6037090578b71 100644 --- a/kernel/power/power.h +++ b/kernel/power/power.h @@ -235,11 +235,6 @@ static inline void suspend_test_finish(const char *label) {} /* kernel/power/main.c */ extern int pm_notifier_call_chain_robust(unsigned long val_up, unsigned long val_down); extern int pm_notifier_call_chain(unsigned long val); -void pm_restrict_gfp_mask(void); -void pm_restore_gfp_mask(void); -#else -static inline void pm_restrict_gfp_mask(void) {} -static inline void pm_restore_gfp_mask(void) {} #endif #ifdef CONFIG_HIGHMEM diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c index 8eaec4ab121d4..d22edf9678872 100644 --- a/kernel/power/suspend.c +++ b/kernel/power/suspend.c @@ -537,6 +537,7 @@ int suspend_devices_and_enter(suspend_state_t state) return error; Recover_platform: + pm_restore_gfp_mask(); platform_recover(state); goto Resume_devices; } @@ -600,9 +601,7 @@ static int enter_state(suspend_state_t state) trace_suspend_resume(TPS("suspend_enter"), state, false); pm_pr_dbg("Suspending system (%s)\n", mem_sleep_labels[state]); - pm_restrict_gfp_mask(); error = suspend_devices_and_enter(state); - pm_restore_gfp_mask(); Finish: events_check_enabled = false; -- 2.39.5 From pavel at ucw.cz Mon Jul 7 23:25:46 2025 From: pavel at ucw.cz (Pavel Machek) Date: Tue, 8 Jul 2025 08:25:46 +0200 Subject: [PATCH AUTOSEL 6.15 6/8] PM: Restrict swap use to later in the suspend sequence In-Reply-To: <20250708000215.793090-6-sashal@kernel.org> References: <20250708000215.793090-1-sashal@kernel.org> <20250708000215.793090-6-sashal@kernel.org> Message-ID: On Mon 2025-07-07 20:02:13, Sasha Levin wrote: > From: Mario Limonciello > > [ Upstream commit 12ffc3b1513ebc1f11ae77d053948504a94a68a6 ] > > Currently swap is restricted before drivers have had a chance to do > their prepare() PM callbacks. Restricting swap this early means that if > a driver needs to evict some content from memory into sawp in it's > prepare callback, it won't be able to. > > On AMD dGPUs this can lead to failed suspends under memory pressure > situations as all VRAM must be evicted to system memory or swap. > > Move the swap restriction to right after all devices have had a chance > to do the prepare() callback. If there is any problem with the sequence, > restore swap in the appropriate dpm resume callbacks or error handling > paths. > > Closes: https://github.com/ROCm/ROCK-Kernel-Driver/issues/174 > Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/2362 > Signed-off-by: Mario Limonciello > Tested-by: Nat Wittstock > Tested-by: Lucian Langa > Link: https://patch.msgid.link/20250613214413.4127087-1-superm1 at kernel.org > Signed-off-by: Rafael J. Wysocki > Signed-off-by: Sasha Levin > --- > > **YES** > > This commit should be backported to stable kernel trees for the > following reasons: > > ## Critical Bug Fix for Real User Issues > > 1. **Fixes Actual Suspend Failures**: The commit addresses real-world > suspend failures under memory pressure on systems with AMD discrete > GPUs. The linked issues (ROCm/ROCK-Kernel-Driver#174 and > freedesktop.org/drm/amd#2362) indicate this affects actual users. > > 2. **Regression Fix**: This is effectively a regression fix. The PM > subsystem's early swap restriction prevents AMD GPU drivers from > properly evicting VRAM during their prepare() callbacks, which is a > requirement that has become more critical as GPU VRAM sizes have > increased. Stop copying AI generated nonsense to your emails while making it look you wrote that. When did this regress? Pavel -- I don't work for Nazis and criminals, and neither should you. Boycott Putin, Trump, and Musk! -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 195 bytes Desc: not available URL: From pavel at ucw.cz Mon Jul 7 23:39:47 2025 From: pavel at ucw.cz (Pavel Machek) Date: Tue, 8 Jul 2025 08:39:47 +0200 Subject: [PATCH AUTOSEL 6.15 6/8] PM: Restrict swap use to later in the suspend sequence In-Reply-To: <20250708000215.793090-6-sashal@kernel.org> References: <20250708000215.793090-1-sashal@kernel.org> <20250708000215.793090-6-sashal@kernel.org> Message-ID: Hi! > From: Mario Limonciello > > [ Upstream commit 12ffc3b1513ebc1f11ae77d053948504a94a68a6 ] > > Currently swap is restricted before drivers have had a chance to do > their prepare() PM callbacks. Restricting swap this early means that if > a driver needs to evict some content from memory into sawp in it's > prepare callback, it won't be able to. > > On AMD dGPUs this can lead to failed suspends under memory pressure > situations as all VRAM must be evicted to system memory or swap. > > Move the swap restriction to right after all devices have had a chance > to do the prepare() callback. If there is any problem with the sequence, > restore swap in the appropriate dpm resume callbacks or error handling > paths. > > Closes: https://github.com/ROCm/ROCK-Kernel-Driver/issues/174 > Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/2362 > Signed-off-by: Mario Limonciello > Tested-by: Nat Wittstock > Tested-by: Lucian Langa > Link: https://patch.msgid.link/20250613214413.4127087-1-superm1 at kernel.org > Signed-off-by: Rafael J. Wysocki > Signed-off-by: Sasha Levin > ## Small, Contained Change > > 3. **Minimal Code Changes**: The fix is remarkably simple - it just > moves the `pm_restrict_gfp_mask()` call from early in the suspend > sequence to after `dpm_prepare()` completes. The changes are: This is not contained change. It changes environment in which drivers run. I have strong suspicion that you did not do actual analysis, but let some kind of LVM "analyze", then signed it with your name. Is my analysis correct? Pavel -- I don't work for Nazis and criminals, and neither should you. Boycott Putin, Trump, and Musk! -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 195 bytes Desc: not available URL: From dwmw at amazon.co.uk Tue Jul 8 00:18:36 2025 From: dwmw at amazon.co.uk (Woodhouse, David) Date: Tue, 8 Jul 2025 07:18:36 +0000 Subject: [PATCH v2] selftests/kexec: fix test_kexec_jump build In-Reply-To: References: <20250702171704.22559-2-moonhee.lee.ca@gmail.com> Message-ID: On Thu, 2025-07-03 at 14:44 +0800, Baoquan He wrote: > On 07/02/25 at 10:17am, Moon Hee Lee wrote: > > The test_kexec_jump program builds correctly when invoked from the > > top-level > > selftests/Makefile, which explicitly sets the OUTPUT variable. > > However, > > building directly in tools/testing/selftests/kexec fails with: > > > > ? make: *** No rule to make target '/test_kexec_jump', needed by > > 'test_kexec_jump.sh'.? Stop. > > I can reproduce this, and this patch fixes it. Thanks. > > Acked-by: Baoquan He Acked-by: David Woodhouse Thanks. -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5964 bytes Desc: not available URL: -------------- next part -------------- Amazon Development Centre (London) Ltd. Registered in England and Wales with registration number 04543232 with its registered office at 1 Principal Place, Worship Street, London EC2A 2FA, United Kingdom. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ebiederm at xmission.com Tue Jul 8 12:13:42 2025 From: ebiederm at xmission.com (Eric W. Biederman) Date: Tue, 08 Jul 2025 14:13:42 -0500 Subject: [PATCH AUTOSEL 6.15 6/8] PM: Restrict swap use to later in the suspend sequence In-Reply-To: <20250708000215.793090-6-sashal@kernel.org> (Sasha Levin's message of "Mon, 7 Jul 2025 20:02:13 -0400") References: <20250708000215.793090-1-sashal@kernel.org> <20250708000215.793090-6-sashal@kernel.org> Message-ID: <87ikk2wl5l.fsf@email.froward.int.ebiederm.org> Wow! Sasha I think an impersonator has gotten into your account, and is just making nonsense up. This reads like an impassioned plea to backport this change, from someone who has actually dealt with it. However reading the justification in detail is an exercise in reading falehoods. If this does not come from an impersonator then if this comes from a human being, I recommend you have a talk with them. If this comes from a machine I recommend take it out of commission and rework it. If I see this kind of baloney again I expect I will just auto-nack it instead of reading it, as reading it appears to be a waste of time. It is a complete waste reading fiction in what little time I have for kernel development. Eric Sasha Levin writes: > **YES** > > This commit should be backported to stable kernel trees for the > following reasons: > > ## Critical Bug Fix for Real User Issues > > 1. **Fixes Actual Suspend Failures**: The commit addresses real-world > suspend failures under memory pressure on systems with AMD discrete > GPUs. The linked issues (ROCm/ROCK-Kernel-Driver#174 and > freedesktop.org/drm/amd#2362) indicate this affects actual users. The links in the first paragraph are very distorted. The links from the actual change are: https://github.com/ROCm/ROCK-Kernel-Driver/issues/174 https://gitlab.freedesktop.org/drm/amd/-/issues/2362 Those completely distorted links make understanding this justification much harder then necessary. > 2. **Regression Fix**: This is effectively a regression fix. The PM > subsystem's early swap restriction prevents AMD GPU drivers from > properly evicting VRAM during their prepare() callbacks, which is a > requirement that has become more critical as GPU VRAM sizes have > increased. That is a justification. There is no evidence that a kernel change made this worse. Thus there is no evidence this is a regression fix. > ## Small, Contained Change > > 3. **Minimal Code Changes**: The fix is remarkably simple - it just > moves the `pm_restrict_gfp_mask()` call from early in the suspend > sequence to after `dpm_prepare()` completes. The changes are: > - Move `pm_restrict_gfp_mask()` from multiple early locations to > inside `dpm_suspend_start()` after `dpm_prepare()` succeeds > - Add corresponding `pm_restore_gfp_mask()` calls in error paths and > resume paths > - Remove the now-redundant calls from hibernate.c and suspend.c Completely wrong. > 4. **Low Risk of Regression**: The change maintains the original intent > of preventing I/O during the critical suspend phase while allowing it > during device preparation. The swap restriction still happens before > `dpm_suspend()`, just after `dpm_prepare()`. This is a fundamental change to a susbsystem that the subsystem maintainer does not say is low risk. > ## Follows Stable Rules > > 5. **Meets Stable Criteria**: > - Fixes a real bug that bothers people (suspend failures) Addresses a real bug. > - Small change (moves function calls, doesn't introduce new logic) The change is a large change in the logic. > - Obviously correct (allows drivers to use swap during their > designated preparation phase) It obviously changes the behavior. It is not at all obvious the change is behavior is desirable for all callbacks, and in all other scenarios. > - Already tested by users (Tested-by tags from affected users) Yes it has Tested-by tags. > ## Similar to Other Backported Commits > > 6. **Pattern Matches**: Looking at the similar commits provided, this > follows the same pattern as the AMD GPU eviction commits that were > backported. Those commits also addressed the same fundamental issue - > ensuring GPU VRAM can be properly evicted during suspend/hibernation. Which commits that were backported? > ## Critical Timing Timing??? There is no race condition. > 7. **Error Path Handling**: The commit properly handles error paths by > adding `pm_restore_gfp_mask()` calls in: > - `dpm_resume_end()` for normal resume > - `platform_recover()` error path in suspend.c > - `pm_restore_gfp_mask()` in kexec_core.c for kexec flows > > The commit is well-tested, addresses a real problem affecting users, and > makes a minimal, obviously correct change to fix suspend failures on > systems with discrete GPUs under memory pressure. What evidence is there that this commit has been tested let alone well-tested. The entire line of reasoning is completely suspect. Eric From ebiederm at xmission.com Tue Jul 8 12:32:02 2025 From: ebiederm at xmission.com (Eric W. Biederman) Date: Tue, 08 Jul 2025 14:32:02 -0500 Subject: [PATCH AUTOSEL 6.15 6/8] PM: Restrict swap use to later in the suspend sequence In-Reply-To: <20250708000215.793090-6-sashal@kernel.org> (Sasha Levin's message of "Mon, 7 Jul 2025 20:02:13 -0400") References: <20250708000215.793090-1-sashal@kernel.org> <20250708000215.793090-6-sashal@kernel.org> Message-ID: <87ms9esclp.fsf@email.froward.int.ebiederm.org> Wow! Sasha I think an impersonator has gotten into your account, and is just making nonsense up. At first glance this reads like an impassioned plea to backport this change, from someone who has actually dealt with it. Unfortunately reading the justification in detail is an exercise in reading falsehoods. If this does not come from an impersonator then: - If this comes from a human being, I recommend you have a talk with them. - If this comes from a machine I recommend you take it out of commission and rework it. At best all of this appears to be an effort to get someone else to do necessary thinking for you. As my time for kernel work is very limited I expect I will auto-nack any such future attempts to outsource someone else's thinking on me. Eric Sasha Levin writes: > From: Mario Limonciello > > [ Upstream commit 12ffc3b1513ebc1f11ae77d053948504a94a68a6 ] > > Currently swap is restricted before drivers have had a chance to do > their prepare() PM callbacks. Restricting swap this early means that if > a driver needs to evict some content from memory into sawp in it's > prepare callback, it won't be able to. > > On AMD dGPUs this can lead to failed suspends under memory pressure > situations as all VRAM must be evicted to system memory or swap. > > Move the swap restriction to right after all devices have had a chance > to do the prepare() callback. If there is any problem with the sequence, > restore swap in the appropriate dpm resume callbacks or error handling > paths. > > Closes: https://github.com/ROCm/ROCK-Kernel-Driver/issues/174 > Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/2362 > Signed-off-by: Mario Limonciello > Tested-by: Nat Wittstock > Tested-by: Lucian Langa > Link: https://patch.msgid.link/20250613214413.4127087-1-superm1 at kernel.org > Signed-off-by: Rafael J. Wysocki > Signed-off-by: Sasha Levin > --- > > **YES** > > This commit should be backported to stable kernel trees for the > following reasons: Really? And when those reasons turn out to be baloney? > ## Critical Bug Fix for Real User Issues > > 1. **Fixes Actual Suspend Failures**: The commit addresses real-world > suspend failures under memory pressure on systems with AMD discrete > GPUs. The linked issues (ROCm/ROCK-Kernel-Driver#174 and > freedesktop.org/drm/amd#2362) indicate this affects actual users. Those linked issues are completely corrupted in the paragraph above. >From the original commit the proper issues are: https://github.com/ROCm/ROCK-Kernel-Driver/issues/174 https://gitlab.freedesktop.org/drm/amd/-/issues/2362 Which indicate that something is going on, but are old enough and long enough coming to any kind of conclusion from them is not easy. > 2. **Regression Fix**: This is effectively a regression fix. The PM > subsystem's early swap restriction prevents AMD GPU drivers from > properly evicting VRAM during their prepare() callbacks, which is a > requirement that has become more critical as GPU VRAM sizes have > increased. There is no indication that this used to work, or that an earlier kernel change caused this to stop working. This is not a regression. > ## Small, Contained Change > > 3. **Minimal Code Changes**: The fix is remarkably simple - it just > moves the `pm_restrict_gfp_mask()` call from early in the suspend > sequence to after `dpm_prepare()` completes. The changes are: > - Move `pm_restrict_gfp_mask()` from multiple early locations to > inside `dpm_suspend_start()` after `dpm_prepare()` succeeds > - Add corresponding `pm_restore_gfp_mask()` calls in error paths and > resume paths > - Remove the now-redundant calls from hibernate.c and suspend.c Reworking how different layers of the kernel interact is not minimal, and it not self contained. > 4. **Low Risk of Regression**: The change maintains the original intent > of preventing I/O during the critical suspend phase while allowing it > during device preparation. The swap restriction still happens before > `dpm_suspend()`, just after `dpm_prepare()`. There is no analysis anywhere on what happens to the code with code that might expect the old behavior. So it is not possible to conclude a low risk of regression, in fact we can't conclude anything. > ## Follows Stable Rules > > 5. **Meets Stable Criteria**: > - Fixes a real bug that bothers people (suspend failures) Addresses a real bug, yes. Fixes? > - Small change (moves function calls, doesn't introduce new logic) No. > - Obviously correct (allows drivers to use swap during their > designated preparation phase) Not at all. It certainly isn't obvious to me what is going on. > - Already tested by users (Tested-by tags from affected users) Yes there are Tested-by tags. > ## Similar to Other Backported Commits > > 6. **Pattern Matches**: Looking at the similar commits provided, this > follows the same pattern as the AMD GPU eviction commits that were > backported. Those commits also addressed the same fundamental issue - > ensuring GPU VRAM can be properly evicted during suspend/hibernation. Which other commits are those? > ## Critical Timing Timing? > 7. **Error Path Handling**: The commit properly handles error paths by > adding `pm_restore_gfp_mask()` calls in: > - `dpm_resume_end()` for normal resume > - `platform_recover()` error path in suspend.c > - `pm_restore_gfp_mask()` in kexec_core.c for kexec flows I don't see anything in this change that has to do with error paths. > The commit is well-tested, addresses a real problem affecting users, and > makes a minimal, obviously correct change to fix suspend failures on > systems with discrete GPUs under memory pressure. The evidence that a 3 week old change is well tested, simply because it has been merged into Linus's change seems lacking. Tested yes, but is it well tested? Are there any possible side effects? I certainly see no evidence of any testing or any exercise at all of the kexec path modified. I wasn't even away of this change until this backport came in. Eric From sashal at kernel.org Tue Jul 8 13:32:49 2025 From: sashal at kernel.org (Sasha Levin) Date: Tue, 8 Jul 2025 16:32:49 -0400 Subject: [PATCH AUTOSEL 6.15 6/8] PM: Restrict swap use to later in the suspend sequence In-Reply-To: <87ms9esclp.fsf@email.froward.int.ebiederm.org> References: <20250708000215.793090-1-sashal@kernel.org> <20250708000215.793090-6-sashal@kernel.org> <87ms9esclp.fsf@email.froward.int.ebiederm.org> Message-ID: On Tue, Jul 08, 2025 at 02:32:02PM -0500, Eric W. Biederman wrote: > >Wow! > >Sasha I think an impersonator has gotten into your account, and >is just making nonsense up. https://lore.kernel.org/all/aDXQaq-bq5BMMlce at lappy/ >At best all of this appears to be an effort to get someone else to >do necessary thinking for you. As my time for kernel work is very >limited I expect I will auto-nack any such future attempts to outsource >someone else's thinking on me. I've gone ahead and added you to the list of people who AUTOSEL will skip, so no need to worry about wasting your time here. -- Thanks, Sasha From pavel at ucw.cz Tue Jul 8 13:37:33 2025 From: pavel at ucw.cz (Pavel Machek) Date: Tue, 8 Jul 2025 22:37:33 +0200 Subject: [PATCH AUTOSEL 6.15 6/8] PM: Restrict swap use to later in the suspend sequence In-Reply-To: References: <20250708000215.793090-1-sashal@kernel.org> <20250708000215.793090-6-sashal@kernel.org> <87ms9esclp.fsf@email.froward.int.ebiederm.org> Message-ID: On Tue 2025-07-08 16:32:49, Sasha Levin wrote: > On Tue, Jul 08, 2025 at 02:32:02PM -0500, Eric W. Biederman wrote: > > > > Wow! > > > > Sasha I think an impersonator has gotten into your account, and > > is just making nonsense up. > > https://lore.kernel.org/all/aDXQaq-bq5BMMlce at lappy/ > > > At best all of this appears to be an effort to get someone else to > > do necessary thinking for you. As my time for kernel work is very > > limited I expect I will auto-nack any such future attempts to outsource > > someone else's thinking on me. > > I've gone ahead and added you to the list of people who AUTOSEL will > skip, so no need to worry about wasting your time here. Can you read? Your stupid robot is sending junk to the list. And you simply blacklist people who complain? Resulting in more junk in autosel? Pavel -- I don't work for Nazis and criminals, and neither should you. Boycott Putin, Trump, and Musk! -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 195 bytes Desc: not available URL: From pavel at ucw.cz Tue Jul 8 13:38:52 2025 From: pavel at ucw.cz (Pavel Machek) Date: Tue, 8 Jul 2025 22:38:52 +0200 Subject: [PATCH AUTOSEL 6.15 6/8] PM: Restrict swap use to later in the suspend sequence In-Reply-To: <87ms9esclp.fsf@email.froward.int.ebiederm.org> References: <20250708000215.793090-1-sashal@kernel.org> <20250708000215.793090-6-sashal@kernel.org> <87ms9esclp.fsf@email.froward.int.ebiederm.org> Message-ID: Hi! > > Sasha I think an impersonator has gotten into your account, and > is just making nonsense up. > > At first glance this reads like an impassioned plea to backport this > change, from someone who has actually dealt with it. > > Unfortunately reading the justification in detail is an exercise > in reading falsehoods. > > If this does not come from an impersonator then: > - If this comes from a human being, I recommend you have a talk with > them. > - If this comes from a machine I recommend you take it out of commission > and rework it. > > At best all of this appears to be an effort to get someone else to > do necessary thinking for you. As my time for kernel work is very > limited I expect I will auto-nack any such future attempts to outsource > someone else's thinking on me. I'm glad I'm not the only one who finds "lets use LLM to try to waste other people's time" insulting :-(. Pavel -- I don't work for Nazis and criminals, and neither should you. Boycott Putin, Trump, and Musk! -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 195 bytes Desc: not available URL: From pavel at ucw.cz Tue Jul 8 13:41:57 2025 From: pavel at ucw.cz (Pavel Machek) Date: Tue, 8 Jul 2025 22:41:57 +0200 Subject: [PATCH AUTOSEL 6.15 6/8] PM: Restrict swap use to later in the suspend sequence In-Reply-To: References: <20250708000215.793090-1-sashal@kernel.org> <20250708000215.793090-6-sashal@kernel.org> <87ms9esclp.fsf@email.froward.int.ebiederm.org> Message-ID: On Tue 2025-07-08 16:32:49, Sasha Levin wrote: > On Tue, Jul 08, 2025 at 02:32:02PM -0500, Eric W. Biederman wrote: > > > > Wow! > > > > Sasha I think an impersonator has gotten into your account, and > > is just making nonsense up. > > https://lore.kernel.org/all/aDXQaq-bq5BMMlce at lappy/ > > > At best all of this appears to be an effort to get someone else to > > do necessary thinking for you. As my time for kernel work is very > > limited I expect I will auto-nack any such future attempts to outsource > > someone else's thinking on me. > > I've gone ahead and added you to the list of people who AUTOSEL will > skip, so no need to worry about wasting your time here. Do you have half a brain, or is it LLM talking again? You are sending autogenerated junk and signing it with your name. That's not okay. You are putting Signed-off on patches you have not checked. That's not okay, either. Stop it. Pavel -- I don't work for Nazis and criminals, and neither should you. Boycott Putin, Trump, and Musk! -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 195 bytes Desc: not available URL: From w at 1wt.eu Tue Jul 8 13:46:07 2025 From: w at 1wt.eu (Willy Tarreau) Date: Tue, 8 Jul 2025 22:46:07 +0200 Subject: [PATCH AUTOSEL 6.15 6/8] PM: Restrict swap use to later in the suspend sequence In-Reply-To: References: <20250708000215.793090-1-sashal@kernel.org> <20250708000215.793090-6-sashal@kernel.org> <87ms9esclp.fsf@email.froward.int.ebiederm.org> Message-ID: <20250708204607.GA5648@1wt.eu> On Tue, Jul 08, 2025 at 10:37:33PM +0200, Pavel Machek wrote: > On Tue 2025-07-08 16:32:49, Sasha Levin wrote: > > I've gone ahead and added you to the list of people who AUTOSEL will > > skip, so no need to worry about wasting your time here. > > Can you read? > > Your stupid robot is sending junk to the list. And you simply > blacklist people who complain? Resulting in more junk in autosel? No, he said autosel will now skip patches from you, not ignore your complaint. So eventually only those who are fine with autosel's job will have their patches selected and the other ones not. This will result in less patches there. Willy From pavel at ucw.cz Tue Jul 8 13:49:49 2025 From: pavel at ucw.cz (Pavel Machek) Date: Tue, 8 Jul 2025 22:49:49 +0200 Subject: [PATCH AUTOSEL 6.15 6/8] PM: Restrict swap use to later in the suspend sequence In-Reply-To: <20250708204607.GA5648@1wt.eu> References: <20250708000215.793090-1-sashal@kernel.org> <20250708000215.793090-6-sashal@kernel.org> <87ms9esclp.fsf@email.froward.int.ebiederm.org> <20250708204607.GA5648@1wt.eu> Message-ID: On Tue 2025-07-08 22:46:07, Willy Tarreau wrote: > On Tue, Jul 08, 2025 at 10:37:33PM +0200, Pavel Machek wrote: > > On Tue 2025-07-08 16:32:49, Sasha Levin wrote: > > > I've gone ahead and added you to the list of people who AUTOSEL will > > > skip, so no need to worry about wasting your time here. > > > > Can you read? > > > > Your stupid robot is sending junk to the list. And you simply > > blacklist people who complain? Resulting in more junk in autosel? > > No, he said autosel will now skip patches from you, not ignore your > complaint. So eventually only those who are fine with autosel's job > will have their patches selected and the other ones not. This will > result in less patches there. That's not how I understand it. Patch was not from Eric, patch was being reviewed by Eric. Pavel -- I don't work for Nazis and criminals, and neither should you. Boycott Putin, Trump, and Musk! -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 195 bytes Desc: not available URL: From sashal at kernel.org Tue Jul 8 14:12:46 2025 From: sashal at kernel.org (Sasha Levin) Date: Tue, 8 Jul 2025 17:12:46 -0400 Subject: [PATCH AUTOSEL 6.15 6/8] PM: Restrict swap use to later in the suspend sequence In-Reply-To: <20250708204607.GA5648@1wt.eu> References: <20250708000215.793090-1-sashal@kernel.org> <20250708000215.793090-6-sashal@kernel.org> <87ms9esclp.fsf@email.froward.int.ebiederm.org> <20250708204607.GA5648@1wt.eu> Message-ID: On Tue, Jul 08, 2025 at 10:46:07PM +0200, Willy Tarreau wrote: >On Tue, Jul 08, 2025 at 10:37:33PM +0200, Pavel Machek wrote: >> On Tue 2025-07-08 16:32:49, Sasha Levin wrote: >> > I've gone ahead and added you to the list of people who AUTOSEL will >> > skip, so no need to worry about wasting your time here. >> >> Can you read? >> >> Your stupid robot is sending junk to the list. And you simply >> blacklist people who complain? Resulting in more junk in autosel? > >No, he said autosel will now skip patches from you, not ignore your >complaint. So eventually only those who are fine with autosel's job >will have their patches selected and the other ones not. This will >result in less patches there. The only one on my blacklist here is Pavel. We have a list of folks who have requested that either their own or the subsystem they maintain would not be reviewed by AUTOSEL. I've added Eric's name to that list as he has indicated he's not interested in receiving these patches. It's not a blacklist (nor did I use the word blacklist). https://git.kernel.org/pub/scm/linux/kernel/git/stable/stable-queue.git/tree/ignore_list -- Thanks, Sasha From pavel at ucw.cz Tue Jul 8 14:26:02 2025 From: pavel at ucw.cz (Pavel Machek) Date: Tue, 8 Jul 2025 23:26:02 +0200 Subject: [PATCH AUTOSEL 6.15 6/8] PM: Restrict swap use to later in the suspend sequence In-Reply-To: References: <20250708000215.793090-1-sashal@kernel.org> <20250708000215.793090-6-sashal@kernel.org> <87ms9esclp.fsf@email.froward.int.ebiederm.org> <20250708204607.GA5648@1wt.eu> Message-ID: On Tue 2025-07-08 17:12:46, Sasha Levin wrote: > On Tue, Jul 08, 2025 at 10:46:07PM +0200, Willy Tarreau wrote: > > On Tue, Jul 08, 2025 at 10:37:33PM +0200, Pavel Machek wrote: > > > On Tue 2025-07-08 16:32:49, Sasha Levin wrote: > > > > I've gone ahead and added you to the list of people who AUTOSEL will > > > > skip, so no need to worry about wasting your time here. > > > > > > Can you read? > > > > > > Your stupid robot is sending junk to the list. And you simply > > > blacklist people who complain? Resulting in more junk in autosel? > > > > No, he said autosel will now skip patches from you, not ignore your > > complaint. So eventually only those who are fine with autosel's job > > will have their patches selected and the other ones not. This will > > result in less patches there. > > The only one on my blacklist here is Pavel. > > We have a list of folks who have requested that either their own or the > subsystem they maintain would not be reviewed by AUTOSEL. I've added Eric's name > to that list as he has indicated he's not interested in receiving these > patches. It's not a blacklist (nor did I use the word blacklist). Can you please clearly separate emails you wrote, from emails some kind of LLM generate? Word "bot" in the From: would be enough. Also, can you please clearly mark patches you checked, by Signed-off-by: and distinguish them from patches only some kind of halucinating autocomplete checked, perhaps, again, by the word "bot" in the Signed-off-by: line? Thank you. Hopefully I'm taking to human this time. Pavel -- I don't work for Nazis and criminals, and neither should you. Boycott Putin, Trump, and Musk! -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 195 bytes Desc: not available URL: From ebiederm at xmission.com Tue Jul 8 14:46:19 2025 From: ebiederm at xmission.com (Eric W. Biederman) Date: Tue, 08 Jul 2025 16:46:19 -0500 Subject: [PATCH AUTOSEL 6.15 6/8] PM: Restrict swap use to later in the suspend sequence In-Reply-To: (Sasha Levin's message of "Tue, 8 Jul 2025 16:32:49 -0400") References: <20250708000215.793090-1-sashal@kernel.org> <20250708000215.793090-6-sashal@kernel.org> <87ms9esclp.fsf@email.froward.int.ebiederm.org> Message-ID: <87tt3mqrtg.fsf@email.froward.int.ebiederm.org> Sasha Levin writes: > On Tue, Jul 08, 2025 at 02:32:02PM -0500, Eric W. Biederman wrote: >> >>Wow! >> >>Sasha I think an impersonator has gotten into your account, and >>is just making nonsense up. > > https://lore.kernel.org/all/aDXQaq-bq5BMMlce at lappy/ It is nice it is giving explanations for it's backporting decisions. It would be nicer if those explanations were clearly marked as coming from a non-human agent, and did not read like a human being impatient for a patch to be backported. Further the machine given explanations were clearly wrong. Do you have plans to do anything about that? Using very incorrect justifications for backporting patches is scary. I still highly recommend that you get your tool to not randomly cut out bits from links it references, making them unfollowable. >>At best all of this appears to be an effort to get someone else to >>do necessary thinking for you. As my time for kernel work is very >>limited I expect I will auto-nack any such future attempts to outsource >>someone else's thinking on me. > > I've gone ahead and added you to the list of people who AUTOSEL will > skip, so no need to worry about wasting your time here. Thank you for that. I assume going forward that AUTOSEL will not consider any patches involving the core kernel and the user/kernel ABI going forward. The areas I have been involved with over the years, and for which my review might be interesting. Eric From sashal at kernel.org Tue Jul 8 15:26:08 2025 From: sashal at kernel.org (Sasha Levin) Date: Tue, 8 Jul 2025 18:26:08 -0400 Subject: [PATCH AUTOSEL 6.15 6/8] PM: Restrict swap use to later in the suspend sequence In-Reply-To: <87tt3mqrtg.fsf@email.froward.int.ebiederm.org> References: <20250708000215.793090-1-sashal@kernel.org> <20250708000215.793090-6-sashal@kernel.org> <87ms9esclp.fsf@email.froward.int.ebiederm.org> <87tt3mqrtg.fsf@email.froward.int.ebiederm.org> Message-ID: On Tue, Jul 08, 2025 at 04:46:19PM -0500, Eric W. Biederman wrote: >Sasha Levin writes: > >> On Tue, Jul 08, 2025 at 02:32:02PM -0500, Eric W. Biederman wrote: >>> >>>Wow! >>> >>>Sasha I think an impersonator has gotten into your account, and >>>is just making nonsense up. >> >> https://lore.kernel.org/all/aDXQaq-bq5BMMlce at lappy/ > >It is nice it is giving explanations for it's backporting decisions. > >It would be nicer if those explanations were clearly marked as >coming from a non-human agent, and did not read like a human being >impatient for a patch to be backported. Thats a fair point. I'll add "LLM Analysis:" before the explanation to future patches. >Further the machine given explanations were clearly wrong. Do you have >plans to do anything about that? Using very incorrect justifications >for backporting patches is scary. Just like in the past 8 years where AUTOSEL ran without any explanation whatsoever, the patches are manually reviewed and tested prior to being included in the stable tree. I don't make a point to go back and correct the justification, it's there more to give some idea as to why this patch was marked for review and may be completely bogus (in which case I'll drop the patch). For that matter, I'd often look at the explanation only if I don't fully understand why a certain patch was selected. Most often I just use it as a "Yes/No" signal. In this instance I honestly haven't read the LLM explanation. I agree with you that the explanation is flawed, but the patch clearly fixes a problem: "On AMD dGPUs this can lead to failed suspends under memory pressure situations as all VRAM must be evicted to system memory or swap." So it was included in the AUTOSEL patchset. Do you have an objection to this patch being included in -stable? So far your concerns were about the LLM explanation rather than actual patch. >I still highly recommend that you get your tool to not randomly >cut out bits from links it references, making them unfollowable. Good point. I'm not really sure what messes up the line wraps. I'll take a look. >>>At best all of this appears to be an effort to get someone else to >>>do necessary thinking for you. As my time for kernel work is very >>>limited I expect I will auto-nack any such future attempts to outsource >>>someone else's thinking on me. >> >> I've gone ahead and added you to the list of people who AUTOSEL will >> skip, so no need to worry about wasting your time here. > >Thank you for that. > >I assume going forward that AUTOSEL will not consider any patches >involving the core kernel and the user/kernel ABI going forward. The >areas I have been involved with over the years, and for which my review >might be interesting. The filter is based on authorship and SoBs. Individual maintainers of a subsystem can elect to have their entire subsystem added to the ignore list. -- Thanks, Sasha From pavel at ucw.cz Tue Jul 8 22:34:01 2025 From: pavel at ucw.cz (Pavel Machek) Date: Wed, 9 Jul 2025 07:34:01 +0200 Subject: [PATCH AUTOSEL 6.15 6/8] PM: Restrict swap use to later in the suspend sequence In-Reply-To: References: <20250708000215.793090-1-sashal@kernel.org> <20250708000215.793090-6-sashal@kernel.org> <87ms9esclp.fsf@email.froward.int.ebiederm.org> <20250708204607.GA5648@1wt.eu> Message-ID: On Tue 2025-07-08 17:12:46, Sasha Levin wrote: > On Tue, Jul 08, 2025 at 10:46:07PM +0200, Willy Tarreau wrote: > > On Tue, Jul 08, 2025 at 10:37:33PM +0200, Pavel Machek wrote: > > > On Tue 2025-07-08 16:32:49, Sasha Levin wrote: > > > > I've gone ahead and added you to the list of people who AUTOSEL will > > > > skip, so no need to worry about wasting your time here. > > > > > > Can you read? > > > > > > Your stupid robot is sending junk to the list. And you simply > > > blacklist people who complain? Resulting in more junk in autosel? > > > > No, he said autosel will now skip patches from you, not ignore your > > complaint. So eventually only those who are fine with autosel's job > > will have their patches selected and the other ones not. This will > > result in less patches there. > > The only one on my blacklist here is Pavel. Please explain. Pavel -- I don't work for Nazis and criminals, and neither should you. Boycott Putin, Trump, and Musk! -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 195 bytes Desc: not available URL: From pavel at ucw.cz Tue Jul 8 22:39:22 2025 From: pavel at ucw.cz (Pavel Machek) Date: Wed, 9 Jul 2025 07:39:22 +0200 Subject: [PATCH AUTOSEL 6.15 6/8] PM: Restrict swap use to later in the suspend sequence In-Reply-To: References: <20250708000215.793090-1-sashal@kernel.org> <20250708000215.793090-6-sashal@kernel.org> <87ms9esclp.fsf@email.froward.int.ebiederm.org> <87tt3mqrtg.fsf@email.froward.int.ebiederm.org> Message-ID: > In this instance I honestly haven't read the LLM explanation. I agree > with you that the explanation is flawed, but the patch clearly fixes a > problem: > > "On AMD dGPUs this can lead to failed suspends under memory > pressure situations as all VRAM must be evicted to system memory > or swap." > > So it was included in the AUTOSEL patchset. Is "may fix a problem" the only criteria for -stable inclusion? You have been acting as if so. Please update the rules, if so. > > I assume going forward that AUTOSEL will not consider any patches > > involving the core kernel and the user/kernel ABI going forward. The > > areas I have been involved with over the years, and for which my review > > might be interesting. > > The filter is based on authorship and SoBs. Individual maintainers of a > subsystem can elect to have their entire subsystem added to the ignore > list. Then the filter is misdesigned. BR, Pavel -- I don't work for Nazis and criminals, and neither should you. Boycott Putin, Trump, and Musk! -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 195 bytes Desc: not available URL: From dominik.lotka at fintara.pl Wed Jul 9 00:35:50 2025 From: dominik.lotka at fintara.pl (Dominik Lotka) Date: Wed, 9 Jul 2025 07:35:50 GMT Subject: =?UTF-8?Q?Prosz=C4=99_o_kontakt?= Message-ID: <20250709084500-0.1.ak.2rpel.0.pkdl79camp@fintara.pl> Dzie? dobry, Czy jest mo?liwo?? nawi?zania wsp??pracy z Pa?stwem? Z ch?ci? porozmawiam z osob? zajmuj?c? si? dzia?aniami zwi?zanymi ze sprzeda??. Pomagamy skutecznie pozyskiwa? nowych klient?w. Zapraszam do kontaktu. Pozdrawiam Dominik Lotka From mario.limonciello at amd.com Wed Jul 9 07:35:40 2025 From: mario.limonciello at amd.com (Mario Limonciello) Date: Wed, 9 Jul 2025 10:35:40 -0400 Subject: [PATCH AUTOSEL 6.15 6/8] PM: Restrict swap use to later in the suspend sequence In-Reply-To: References: <20250708000215.793090-1-sashal@kernel.org> <20250708000215.793090-6-sashal@kernel.org> <87ms9esclp.fsf@email.froward.int.ebiederm.org> <87tt3mqrtg.fsf@email.froward.int.ebiederm.org> Message-ID: <24c245be-1ae6-4931-a0ac-375cae18e937@amd.com> On 7/9/2025 1:39 AM, Pavel Machek wrote: > >> In this instance I honestly haven't read the LLM explanation. I agree >> with you that the explanation is flawed, but the patch clearly fixes a >> problem: >> >> "On AMD dGPUs this can lead to failed suspends under memory >> pressure situations as all VRAM must be evicted to system memory >> or swap." >> >> So it was included in the AUTOSEL patchset. > > Is "may fix a problem" the only criteria for -stable inclusion? You > have been acting as if so. Please update the rules, if so. I would say that it most definitely does fix a problem. There are multiple testers who have confirmed it. But as it's rightfully pointed out the environment that drivers have during the initial pmops callbacks is different (swap is still available). I don't expect regressions from this; but wider testing is the only way that we will find out. Either we find out in 6.15.y or we find out in 6.16.y. Either way if there are regressions we either revert or fix them. > >>> I assume going forward that AUTOSEL will not consider any patches >>> involving the core kernel and the user/kernel ABI going forward. The >>> areas I have been involved with over the years, and for which my review >>> might be interesting. >> >> The filter is based on authorship and SoBs. Individual maintainers of a >> subsystem can elect to have their entire subsystem added to the ignore >> list. > > Then the filter is misdesigned. > > BR, > Pavel > > From ebiederm at xmission.com Wed Jul 9 09:23:36 2025 From: ebiederm at xmission.com (Eric W. Biederman) Date: Wed, 09 Jul 2025 11:23:36 -0500 Subject: [PATCH AUTOSEL 6.15 6/8] PM: Restrict swap use to later in the suspend sequence In-Reply-To: (Sasha Levin's message of "Tue, 8 Jul 2025 18:26:08 -0400") References: <20250708000215.793090-1-sashal@kernel.org> <20250708000215.793090-6-sashal@kernel.org> <87ms9esclp.fsf@email.froward.int.ebiederm.org> <87tt3mqrtg.fsf@email.froward.int.ebiederm.org> Message-ID: <87ms9dpc3b.fsf@email.froward.int.ebiederm.org> Sasha Levin writes: > On Tue, Jul 08, 2025 at 04:46:19PM -0500, Eric W. Biederman wrote: >>Sasha Levin writes: >> >>> On Tue, Jul 08, 2025 at 02:32:02PM -0500, Eric W. Biederman wrote: >>>> >>>>Wow! >>>> >>>>Sasha I think an impersonator has gotten into your account, and >>>>is just making nonsense up. >>> >>> https://lore.kernel.org/all/aDXQaq-bq5BMMlce at lappy/ >> >>It is nice it is giving explanations for it's backporting decisions. >> >>It would be nicer if those explanations were clearly marked as >>coming from a non-human agent, and did not read like a human being >>impatient for a patch to be backported. > > Thats a fair point. I'll add "LLM Analysis:" before the explanation to > future patches. > >>Further the machine given explanations were clearly wrong. Do you have >>plans to do anything about that? Using very incorrect justifications >>for backporting patches is scary. > > Just like in the past 8 years where AUTOSEL ran without any explanation > whatsoever, the patches are manually reviewed and tested prior to being > included in the stable tree. I believe there is some testing done. However for a lot of what I see go by I would be strongly surprised if there is actually much manual review. I expect there is a lot of the changes are simply ignored after a quick glance because people don't know what is going on, or they are of too little consequence to spend time on. > I don't make a point to go back and correct the justification, it's > there more to give some idea as to why this patch was marked for > review and may be completely bogus (in which case I'll drop the patch). > > For that matter, I'd often look at the explanation only if I don't fully > understand why a certain patch was selected. Most often I just use it as > a "Yes/No" signal. > > In this instance I honestly haven't read the LLM explanation. I agree > with you that the explanation is flawed, but the patch clearly fixes a > problem: > > "On AMD dGPUs this can lead to failed suspends under memory > pressure situations as all VRAM must be evicted to system memory > or swap." > > So it was included in the AUTOSEL patchset. > Do you have an objection to this patch being included in -stable? So far > your concerns were about the LLM explanation rather than actual patch. Several objections. - The explanation was clearly bogus. - The maintainer takes alarm. - The patch while small, is not simple and not obviously correct. - The patch has not been thoroughly tested. I object because the code does not appear to have been well tested outside of the realm of fixing the issue. There is no indication that the kexec code path has ever been exercised. So this appears to be one of those changes that was merged under the banner of "Let's see if this causes a regression". To the original authors. I would have appreciated it being a little more clearly called out in the change description that this came in under "Let's see if this causes a regression". Such changes should not be backported automatically. They should be backported with care after the have seen much more usage/testing of the kernel they were merged into. Probably after a kernel release or so. This is something that can take some actual judgment to decide, when a backport is reasonable. >>I still highly recommend that you get your tool to not randomly >>cut out bits from links it references, making them unfollowable. > > Good point. I'm not really sure what messes up the line wraps. I'll take > a look. It was a bit more than line wraps. At first glance I thought it was just removing a prefix from the links. On second glance it appears it is completely making a hash of links: The links in question: https://github.com/ROCm/ROCK-Kernel-Driver/issues/174 https://gitlab.freedesktop.org/drm/amd/-/issues/2362 The unusable restatement of those links: ROCm/ROCK-Kernel-Driver#174 freedesktop.org/drm/amd#2362 Short of knowing to look up into the patch to find the links, those references are completely junk. >>>>At best all of this appears to be an effort to get someone else to >>>>do necessary thinking for you. As my time for kernel work is very >>>>limited I expect I will auto-nack any such future attempts to outsource >>>>someone else's thinking on me. >>> >>> I've gone ahead and added you to the list of people who AUTOSEL will >>> skip, so no need to worry about wasting your time here. >> >>Thank you for that. >> >>I assume going forward that AUTOSEL will not consider any patches >>involving the core kernel and the user/kernel ABI going forward. The >>areas I have been involved with over the years, and for which my review >>might be interesting. > > The filter is based on authorship and SoBs. Individual maintainers of a > subsystem can elect to have their entire subsystem added to the ignore > list. As I said. I expect that the process looking at the output of get_maintainers.pl and ignoring a change when my name is returned will result in effectively the entire core kernel and the user/kernel ABI not being eligible for backport. I bring this up because I was not an author and I did not have any signed-off-by's on the change in question, and yet I was still selected for the review. Eric From mario.limonciello at amd.com Wed Jul 9 09:35:47 2025 From: mario.limonciello at amd.com (Mario Limonciello) Date: Wed, 9 Jul 2025 12:35:47 -0400 Subject: [PATCH AUTOSEL 6.15 6/8] PM: Restrict swap use to later in the suspend sequence In-Reply-To: <87ms9dpc3b.fsf@email.froward.int.ebiederm.org> References: <20250708000215.793090-1-sashal@kernel.org> <20250708000215.793090-6-sashal@kernel.org> <87ms9esclp.fsf@email.froward.int.ebiederm.org> <87tt3mqrtg.fsf@email.froward.int.ebiederm.org> <87ms9dpc3b.fsf@email.froward.int.ebiederm.org> Message-ID: <29441021-5758-4565-b120-e9713c58f6d8@amd.com> On 7/9/2025 12:23 PM, Eric W. Biederman wrote: > Sasha Levin writes: > >> On Tue, Jul 08, 2025 at 04:46:19PM -0500, Eric W. Biederman wrote: >>> Sasha Levin writes: >>> >>>> On Tue, Jul 08, 2025 at 02:32:02PM -0500, Eric W. Biederman wrote: >>>>> >>>>> Wow! >>>>> >>>>> Sasha I think an impersonator has gotten into your account, and >>>>> is just making nonsense up. >>>> >>>> https://lore.kernel.org/all/aDXQaq-bq5BMMlce at lappy/ >>> >>> It is nice it is giving explanations for it's backporting decisions. >>> >>> It would be nicer if those explanations were clearly marked as >>> coming from a non-human agent, and did not read like a human being >>> impatient for a patch to be backported. >> >> Thats a fair point. I'll add "LLM Analysis:" before the explanation to >> future patches. >> >>> Further the machine given explanations were clearly wrong. Do you have >>> plans to do anything about that? Using very incorrect justifications >>> for backporting patches is scary. >> >> Just like in the past 8 years where AUTOSEL ran without any explanation >> whatsoever, the patches are manually reviewed and tested prior to being >> included in the stable tree. > > I believe there is some testing done. However for a lot of what I see > go by I would be strongly surprised if there is actually much manual > review. > > I expect there is a lot of the changes are simply ignored after a quick > glance because people don't know what is going on, or they are of too > little consequence to spend time on. > >> I don't make a point to go back and correct the justification, it's >> there more to give some idea as to why this patch was marked for >> review and may be completely bogus (in which case I'll drop the patch). >> >> For that matter, I'd often look at the explanation only if I don't fully >> understand why a certain patch was selected. Most often I just use it as >> a "Yes/No" signal. >> >> In this instance I honestly haven't read the LLM explanation. I agree >> with you that the explanation is flawed, but the patch clearly fixes a >> problem: >> >> "On AMD dGPUs this can lead to failed suspends under memory >> pressure situations as all VRAM must be evicted to system memory >> or swap." >> >> So it was included in the AUTOSEL patchset. > > >> Do you have an objection to this patch being included in -stable? So far >> your concerns were about the LLM explanation rather than actual patch. > > Several objections. > - The explanation was clearly bogus. > - The maintainer takes alarm. > - The patch while small, is not simple and not obviously correct. > - The patch has not been thoroughly tested. > > I object because the code does not appear to have been well tested > outside of the realm of fixing the issue. > > There is no indication that the kexec code path has ever been exercised. > > So this appears to be one of those changes that was merged under > the banner of "Let's see if this causes a regression".> > To the original authors. I would have appreciated it being a little > more clearly called out in the change description that this came in > under "Let's see if this causes a regression". > As the original author of this patch I don't feel this patch is any different than any other patch in that regard. I don't write in a commit message the expected risk of a patch. There are always people that find interesting ways to exercise it and they could find problems that I didn't envision. > Such changes should not be backported automatically. They should be > backported with care after the have seen much more usage/testing of > the kernel they were merged into. Probably after a kernel release or > so. This is something that can take some actual judgment to decide, > when a backport is reasonable. TBH - I didn't include stable in the commit message with the intent that after this baked a cycle or so that we could bring it back later if AUTOSEL hadn't picked it up by then. It's a real issue people have complained about for years that is non-obvious where the root cause is. Once we're all confident on this I'd love to discuss bringing it back even further to LTS kernels if it's viable. > >>> I still highly recommend that you get your tool to not randomly >>> cut out bits from links it references, making them unfollowable. >> >> Good point. I'm not really sure what messes up the line wraps. I'll take >> a look. > > It was a bit more than line wraps. At first glance I thought > it was just removing a prefix from the links. On second glance > it appears it is completely making a hash of links: > > The links in question: > https://github.com/ROCm/ROCK-Kernel-Driver/issues/174 > https://gitlab.freedesktop.org/drm/amd/-/issues/2362 > > The unusable restatement of those links: > ROCm/ROCK-Kernel-Driver#174 > freedesktop.org/drm/amd#2362 > > Short of knowing to look up into the patch to find the links, > those references are completely junk. > >>>>> At best all of this appears to be an effort to get someone else to >>>>> do necessary thinking for you. As my time for kernel work is very >>>>> limited I expect I will auto-nack any such future attempts to outsource >>>>> someone else's thinking on me. >>>> >>>> I've gone ahead and added you to the list of people who AUTOSEL will >>>> skip, so no need to worry about wasting your time here. >>> >>> Thank you for that. >>> >>> I assume going forward that AUTOSEL will not consider any patches >>> involving the core kernel and the user/kernel ABI going forward. The >>> areas I have been involved with over the years, and for which my review >>> might be interesting. >> >> The filter is based on authorship and SoBs. Individual maintainers of a >> subsystem can elect to have their entire subsystem added to the ignore >> list. > > As I said. I expect that the process looking at the output of > get_maintainers.pl and ignoring a change when my name is returned > will result in effectively the entire core kernel and the user/kernel > ABI not being eligible for backport. > > I bring this up because I was not an author and I did not have any > signed-off-by's on the change in question, and yet I was still selected > for the review. > > Eric > From rafael at kernel.org Wed Jul 9 09:55:47 2025 From: rafael at kernel.org (Rafael J. Wysocki) Date: Wed, 9 Jul 2025 18:55:47 +0200 Subject: [PATCH AUTOSEL 6.15 6/8] PM: Restrict swap use to later in the suspend sequence In-Reply-To: <29441021-5758-4565-b120-e9713c58f6d8@amd.com> References: <20250708000215.793090-1-sashal@kernel.org> <20250708000215.793090-6-sashal@kernel.org> <87ms9esclp.fsf@email.froward.int.ebiederm.org> <87tt3mqrtg.fsf@email.froward.int.ebiederm.org> <87ms9dpc3b.fsf@email.froward.int.ebiederm.org> <29441021-5758-4565-b120-e9713c58f6d8@amd.com> Message-ID: On Wed, Jul 9, 2025 at 6:35?PM Mario Limonciello wrote: > > On 7/9/2025 12:23 PM, Eric W. Biederman wrote: > > Sasha Levin writes: > > > >> On Tue, Jul 08, 2025 at 04:46:19PM -0500, Eric W. Biederman wrote: > >>> Sasha Levin writes: > >>> > >>>> On Tue, Jul 08, 2025 at 02:32:02PM -0500, Eric W. Biederman wrote: > >>>>> > >>>>> Wow! > >>>>> > >>>>> Sasha I think an impersonator has gotten into your account, and > >>>>> is just making nonsense up. > >>>> > >>>> https://lore.kernel.org/all/aDXQaq-bq5BMMlce at lappy/ > >>> > >>> It is nice it is giving explanations for it's backporting decisions. > >>> > >>> It would be nicer if those explanations were clearly marked as > >>> coming from a non-human agent, and did not read like a human being > >>> impatient for a patch to be backported. > >> > >> Thats a fair point. I'll add "LLM Analysis:" before the explanation to > >> future patches. > >> > >>> Further the machine given explanations were clearly wrong. Do you have > >>> plans to do anything about that? Using very incorrect justifications > >>> for backporting patches is scary. > >> > >> Just like in the past 8 years where AUTOSEL ran without any explanation > >> whatsoever, the patches are manually reviewed and tested prior to being > >> included in the stable tree. > > > > I believe there is some testing done. However for a lot of what I see > > go by I would be strongly surprised if there is actually much manual > > review. > > > > I expect there is a lot of the changes are simply ignored after a quick > > glance because people don't know what is going on, or they are of too > > little consequence to spend time on. > > > >> I don't make a point to go back and correct the justification, it's > >> there more to give some idea as to why this patch was marked for > >> review and may be completely bogus (in which case I'll drop the patch). > >> > >> For that matter, I'd often look at the explanation only if I don't fully > >> understand why a certain patch was selected. Most often I just use it as > >> a "Yes/No" signal. > >> > >> In this instance I honestly haven't read the LLM explanation. I agree > >> with you that the explanation is flawed, but the patch clearly fixes a > >> problem: > >> > >> "On AMD dGPUs this can lead to failed suspends under memory > >> pressure situations as all VRAM must be evicted to system memory > >> or swap." > >> > >> So it was included in the AUTOSEL patchset. > > > > > >> Do you have an objection to this patch being included in -stable? So far > >> your concerns were about the LLM explanation rather than actual patch. > > > > Several objections. > > - The explanation was clearly bogus. > > - The maintainer takes alarm. > > - The patch while small, is not simple and not obviously correct. > > - The patch has not been thoroughly tested. > > > > I object because the code does not appear to have been well tested > > outside of the realm of fixing the issue. > > > > There is no indication that the kexec code path has ever been exercised. > > > > So this appears to be one of those changes that was merged under > > the banner of "Let's see if this causes a regression".> > > To the original authors. I would have appreciated it being a little > > more clearly called out in the change description that this came in > > under "Let's see if this causes a regression". > > > > As the original author of this patch I don't feel this patch is any > different than any other patch in that regard. > I don't write in a commit message the expected risk of a patch. > > There are always people that find interesting ways to exercise it and > they could find problems that I didn't envision. > > > Such changes should not be backported automatically. They should be > > backported with care after the have seen much more usage/testing of > > the kernel they were merged into. Probably after a kernel release or > > so. This is something that can take some actual judgment to decide, > > when a backport is reasonable. > > TBH - I didn't include stable in the commit message with the intent that > after this baked a cycle or so that we could bring it back later if > AUTOSEL hadn't picked it up by then. I actually see an issue in this patch that I have overlooked previously, so Sasha and "stable" folks - please drop this one. Namely, the change in dpm_resume_end() is going too far. > It's a real issue people have complained about for years that is > non-obvious where the root cause is. > > Once we're all confident on this I'd love to discuss bringing it back > even further to LTS kernels if it's viable. Sure. From sashal at kernel.org Wed Jul 9 10:37:40 2025 From: sashal at kernel.org (Sasha Levin) Date: Wed, 9 Jul 2025 13:37:40 -0400 Subject: [PATCH AUTOSEL 6.15 6/8] PM: Restrict swap use to later in the suspend sequence In-Reply-To: <87ms9dpc3b.fsf@email.froward.int.ebiederm.org> References: <20250708000215.793090-1-sashal@kernel.org> <20250708000215.793090-6-sashal@kernel.org> <87ms9esclp.fsf@email.froward.int.ebiederm.org> <87tt3mqrtg.fsf@email.froward.int.ebiederm.org> <87ms9dpc3b.fsf@email.froward.int.ebiederm.org> Message-ID: On Wed, Jul 09, 2025 at 11:23:36AM -0500, Eric W. Biederman wrote: >There is no indication that the kexec code path has ever been exercised. > >So this appears to be one of those changes that was merged under >the banner of "Let's see if this causes a regression". > >To the original authors. I would have appreciated it being a little >more clearly called out in the change description that this came in >under "Let's see if this causes a regression". > >Such changes should not be backported automatically. They should be >backported with care after the have seen much more usage/testing of >the kernel they were merged into. Probably after a kernel release or >so. This is something that can take some actual judgment to decide, >when a backport is reasonable. I'm assuming that you also refer to stable tagged patches that get "automatically" picked up, right? We already have a way to do what you suggest: maintainers can choose not to tag their patches for stable, and have both their subsystem and/or individual contributions ignored by AUTOSEL. This way they can send us commits at their convenience. There is one subsystem that is mostly doing that (XFS). The other ones are *choosing* not to do that. -- Thanks, Sasha From makb at juniper.net Wed Jul 9 12:20:35 2025 From: makb at juniper.net (Brian Mak) Date: Wed, 9 Jul 2025 12:20:35 -0700 Subject: [PATCH] x86/kexec: Carry forward the boot DTB on kexec Message-ID: <20250709192035.271687-1-makb@juniper.net> The kexec_file_load syscall on x86 currently does not support passing a device tree blob to the new kernel. To add support for this, we copy the behavior of ARM64 and PowerPC and copy the current boot's device tree blob for use in the new kernel. We do this on x86 by passing the device tree blob as a setup_data entry in accordance with the x86 boot protocol. Signed-off-by: Brian Mak --- arch/x86/kernel/kexec-bzimage64.c | 46 +++++++++++++++++++++++++++++-- 1 file changed, 43 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/kexec-bzimage64.c b/arch/x86/kernel/kexec-bzimage64.c index 24a41f0e0cf1..c24536c25f98 100644 --- a/arch/x86/kernel/kexec-bzimage64.c +++ b/arch/x86/kernel/kexec-bzimage64.c @@ -16,6 +16,8 @@ #include #include #include +#include +#include #include #include @@ -212,6 +214,28 @@ setup_efi_state(struct boot_params *params, unsigned long params_load_addr, } #endif /* CONFIG_EFI */ +#ifdef CONFIG_OF_FLATTREE +static void setup_dtb(struct boot_params *params, + unsigned long params_load_addr, + unsigned int dtb_setup_data_offset) +{ + struct setup_data *sd = (void *)params + dtb_setup_data_offset; + unsigned long setup_data_phys, dtb_len; + + dtb_len = fdt_totalsize(initial_boot_params); + sd->type = SETUP_DTB; + sd->len = dtb_len; + + /* Carry over current boot DTB with setup_data */ + memcpy(sd->data, initial_boot_params, dtb_len); + + /* Add setup data */ + setup_data_phys = params_load_addr + dtb_setup_data_offset; + sd->next = params->hdr.setup_data; + params->hdr.setup_data = setup_data_phys; +} +#endif /* CONFIG_OF_FLATTREE */ + static void setup_ima_state(const struct kimage *image, struct boot_params *params, unsigned long params_load_addr, @@ -336,6 +360,16 @@ setup_boot_parameters(struct kimage *image, struct boot_params *params, sizeof(struct efi_setup_data); #endif +#ifdef CONFIG_OF_FLATTREE + if (initial_boot_params) { + setup_dtb(params, params_load_addr, setup_data_offset); + setup_data_offset += sizeof(struct setup_data) + + fdt_totalsize(initial_boot_params); + } else { + pr_info("No DTB\n"); + } +#endif + if (IS_ENABLED(CONFIG_IMA_KEXEC)) { /* Setup IMA log buffer state */ setup_ima_state(image, params, params_load_addr, @@ -529,6 +563,12 @@ static void *bzImage64_load(struct kimage *image, char *kernel, sizeof(struct setup_data) + RNG_SEED_LENGTH; +#ifdef CONFIG_OF_FLATTREE + if (initial_boot_params) + kbuf.bufsz += sizeof(struct setup_data) + + fdt_totalsize(initial_boot_params); +#endif + if (IS_ENABLED(CONFIG_IMA_KEXEC)) kbuf.bufsz += sizeof(struct setup_data) + sizeof(struct ima_setup_data); @@ -537,7 +577,7 @@ static void *bzImage64_load(struct kimage *image, char *kernel, kbuf.bufsz += sizeof(struct setup_data) + sizeof(struct kho_data); - params = kzalloc(kbuf.bufsz, GFP_KERNEL); + params = kvzalloc(kbuf.bufsz, GFP_KERNEL); if (!params) return ERR_PTR(-ENOMEM); efi_map_offset = params_cmdline_sz; @@ -647,7 +687,7 @@ static void *bzImage64_load(struct kimage *image, char *kernel, return ldata; out_free_params: - kfree(params); + kvfree(params); return ERR_PTR(ret); } @@ -659,7 +699,7 @@ static int bzImage64_cleanup(void *loader_data) if (!ldata) return 0; - kfree(ldata->bootparams_buf); + kvfree(ldata->bootparams_buf); ldata->bootparams_buf = NULL; return 0; base-commit: d7b8f8e20813f0179d8ef519541a3527e7661d3a -- 2.25.1 From ltao at redhat.com Wed Jul 9 22:34:34 2025 From: ltao at redhat.com (Tao Liu) Date: Thu, 10 Jul 2025 17:34:34 +1200 Subject: [PATCH v2][makedumpfile] Fix a data race in multi-threading mode (--num-threads=N) In-Reply-To: References: <20250625022343.57529-2-ltao@redhat.com> <7c13a968-4a3a-4d0d-8977-3ba0a4a845b1@nec.com> <20250703163100.603f59f4@mordecai.tesarici.cz> <8485b4f1-1277-45ab-b533-efc20120b26e@nec.com> Message-ID: Kindly ping... Sorry to interrupt, could you please merge the patch since there are few bugs which depend on the backporting of this patch? Thanks, Tao Liu On Fri, Jul 4, 2025 at 7:51?PM Tao Liu wrote: > > On Fri, Jul 4, 2025 at 6:49?PM HAGIO KAZUHITO(?????) wrote: > > > > On 2025/07/04 7:35, Tao Liu wrote: > > > Hi Petr, > > > > > > On Fri, Jul 4, 2025 at 2:31?AM Petr Tesarik wrote: > > >> > > >> On Tue, 1 Jul 2025 19:59:53 +1200 > > >> Tao Liu wrote: > > >> > > >>> Hi Kazu, > > >>> > > >>> Thanks for your comments! > > >>> > > >>> On Tue, Jul 1, 2025 at 7:38?PM HAGIO KAZUHITO(?????) wrote: > > >>>> > > >>>> Hi Tao, > > >>>> > > >>>> thank you for the patch. > > >>>> > > >>>> On 2025/06/25 11:23, Tao Liu wrote: > > >>>>> A vmcore corrupt issue has been noticed in powerpc arch [1]. It can be > > >>>>> reproduced with upstream makedumpfile. > > >>>>> > > >>>>> When analyzing the corrupt vmcore using crash, the following error > > >>>>> message will output: > > >>>>> > > >>>>> crash: compressed kdump: uncompress failed: 0 > > >>>>> crash: read error: kernel virtual address: c0001e2d2fe48000 type: > > >>>>> "hardirq thread_union" > > >>>>> crash: cannot read hardirq_ctx[930] at c0001e2d2fe48000 > > >>>>> crash: compressed kdump: uncompress failed: 0 > > >>>>> > > >>>>> If the vmcore is generated without num-threads option, then no such > > >>>>> errors are noticed. > > >>>>> > > >>>>> With --num-threads=N enabled, there will be N sub-threads created. All > > >>>>> sub-threads are producers which responsible for mm page processing, e.g. > > >>>>> compression. The main thread is the consumer which responsible for > > >>>>> writing the compressed data into file. page_flag_buf->ready is used to > > >>>>> sync main and sub-threads. When a sub-thread finishes page processing, > > >>>>> it will set ready flag to be FLAG_READY. In the meantime, main thread > > >>>>> looply check all threads of the ready flags, and break the loop when > > >>>>> find FLAG_READY. > > >>>> > > >>>> I've tried to reproduce the issue, but I couldn't on x86_64. > > >>> > > >>> Yes, I cannot reproduce it on x86_64 either, but the issue is very > > >>> easily reproduced on ppc64 arch, which is where our QE reported. > > >> > > >> Yes, this is expected. X86 implements a strongly ordered memory model, > > >> so a "store-to-memory" instruction ensures that the new value is > > >> immediately observed by other CPUs. > > >> > > >> FWIW the current code is wrong even on X86, because it does nothing to > > >> prevent compiler optimizations. The compiler is then allowed to reorder > > >> instructions so that the write to page_flag_buf->ready happens after > > >> other writes; with a bit of bad scheduling luck, the consumer thread > > >> may see an inconsistent state (e.g. read a stale page_flag_buf->pfn). > > >> Note that thanks to how compilers are designed (today), this issue is > > >> more or less hypothetical. Nevertheless, the use of atomics fixes it, > > >> because they also serve as memory barriers. > > > > Thank you Petr, for the information. I was wondering whether atomic > > operations might be necessary for the other members of page_flag_buf, > > but it looks like they won't be necessary in this case. > > > > Then I was convinced that the issue would be fixed by removing the > > inconsistency of page_flag_buf->ready. And the patch tested ok, so ack. > > > > Thank you all for the patch review, patch testing and comments, these > have been so helpful! > > Thanks, > Tao Liu > > > Thanks, > > Kazu > > > > > > > > Thanks a lot for your detailed explanation, it's very helpful! I > > > haven't thought of the possibility of instruction reordering and > > > atomic_rw prevents the reorder. > > > > > > Thanks, > > > Tao Liu > > > > > >> > > >> Petr T > > >> From jon.brennan at tasknomic.com Thu Jul 10 00:36:14 2025 From: jon.brennan at tasknomic.com (Jon Brennan) Date: Thu, 10 Jul 2025 07:36:14 GMT Subject: Equipment - chairs Message-ID: <20250710084500-0.1.jy.13c1b.0.4gk1n4u28f@tasknomic.com> Hi, Do you offer a seat for corridors or waiting rooms? The shortest possible length is about 12 cm, with a wide range of practical dimensions, with a little over a centimeter of space - with a small, large, large roof. This is a solid and convenient solution that can be selected by furniture distributors, for use in public matters throughout Europe. I will send you the specification or photos - can our product be used? Sincerely, Jon Brennan From rafael at kernel.org Thu Jul 10 06:12:20 2025 From: rafael at kernel.org (Rafael J. Wysocki) Date: Thu, 10 Jul 2025 15:12:20 +0200 Subject: [PATCH v1 2/2] kexec_core: Drop redundant pm_restore_gfp_mask() call In-Reply-To: <5046396.31r3eYUQgx@rjwysocki.net> References: <5046396.31r3eYUQgx@rjwysocki.net> Message-ID: <1949230.tdWV9SEqCh@rjwysocki.net> From: Rafael J. Wysocki Drop the direct pm_restore_gfp_mask() call from the KEXEC_JUMP flow in kernel_kexec() because it is redundant. Namely, dpm_resume_end() called beforehand in the same code path invokes that function and it is sufficient to invoke it once. Signed-off-by: Rafael J. Wysocki --- kernel/kexec_core.c | 1 - 1 file changed, 1 deletion(-) --- a/kernel/kexec_core.c +++ b/kernel/kexec_core.c @@ -1136,7 +1136,6 @@ Resume_devices: dpm_resume_end(PMSG_RESTORE); Resume_console: - pm_restore_gfp_mask(); console_resume_all(); thaw_processes(); Restore_console: From rafael at kernel.org Thu Jul 10 06:10:41 2025 From: rafael at kernel.org (Rafael J. Wysocki) Date: Thu, 10 Jul 2025 15:10:41 +0200 Subject: [PATCH v1 1/2] kexec_core: Fix error code path in the KEXEC_JUMP flow In-Reply-To: <5046396.31r3eYUQgx@rjwysocki.net> References: <5046396.31r3eYUQgx@rjwysocki.net> Message-ID: <2396879.ElGaqSPkdT@rjwysocki.net> From: Rafael J. Wysocki If dpm_suspend_start() fails, dpm_resume_end() must be called to recover devices whose suspend callbacks have been called, but this does not happen in the KEXEC_JUMP flow's error path due to a confused goto target label. Address this by using the correct target label in the goto statement in question. Fixes: 2965faa5e03d ("kexec: split kexec_load syscall from kexec core code") Signed-off-by: Rafael J. Wysocki --- kernel/kexec_core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/kernel/kexec_core.c +++ b/kernel/kexec_core.c @@ -1080,7 +1080,7 @@ console_suspend_all(); error = dpm_suspend_start(PMSG_FREEZE); if (error) - goto Resume_console; + goto Resume_devices; /* * dpm_suspend_end() must be called after dpm_suspend_start() * to complete the transition, like in the hibernation flows From rafael at kernel.org Thu Jul 10 06:08:58 2025 From: rafael at kernel.org (Rafael J. Wysocki) Date: Thu, 10 Jul 2025 15:08:58 +0200 Subject: [PATCH v1 0/2] kexec_core: Fix and cleanup for the KEXEC_JUMP flow Message-ID: <5046396.31r3eYUQgx@rjwysocki.net> Hi Everyone, These two patches fix an error code path issue in the KEXEC_JUMP flow (patch [1/2]) and clean it up a bit afterward (patch [2/2]). Please see patch changelogs for details. Thanks! From bhe at redhat.com Thu Jul 10 23:16:48 2025 From: bhe at redhat.com (Baoquan He) Date: Fri, 11 Jul 2025 14:16:48 +0800 Subject: [PATCH v1 1/2] kexec_core: Fix error code path in the KEXEC_JUMP flow In-Reply-To: <2396879.ElGaqSPkdT@rjwysocki.net> References: <5046396.31r3eYUQgx@rjwysocki.net> <2396879.ElGaqSPkdT@rjwysocki.net> Message-ID: On 07/10/25 at 03:10pm, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki > > If dpm_suspend_start() fails, dpm_resume_end() must be called to > recover devices whose suspend callbacks have been called, but this > does not happen in the KEXEC_JUMP flow's error path due to a confused > goto target label. > > Address this by using the correct target label in the goto statement in > question. Sounds very reasonable, thanks for the fix. Acked-by: Baoquan He > > Fixes: 2965faa5e03d ("kexec: split kexec_load syscall from kexec core code") > Signed-off-by: Rafael J. Wysocki > --- > kernel/kexec_core.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > --- a/kernel/kexec_core.c > +++ b/kernel/kexec_core.c > @@ -1080,7 +1080,7 @@ > console_suspend_all(); > error = dpm_suspend_start(PMSG_FREEZE); > if (error) > - goto Resume_console; > + goto Resume_devices; > /* > * dpm_suspend_end() must be called after dpm_suspend_start() > * to complete the transition, like in the hibernation flows > > > From bhe at redhat.com Thu Jul 10 23:17:55 2025 From: bhe at redhat.com (Baoquan He) Date: Fri, 11 Jul 2025 14:17:55 +0800 Subject: [PATCH v1 2/2] kexec_core: Drop redundant pm_restore_gfp_mask() call In-Reply-To: <1949230.tdWV9SEqCh@rjwysocki.net> References: <5046396.31r3eYUQgx@rjwysocki.net> <1949230.tdWV9SEqCh@rjwysocki.net> Message-ID: On 07/10/25 at 03:12pm, Rafael J. Wysocki wrote: > From: Rafael J. Wysocki > > Drop the direct pm_restore_gfp_mask() call from the KEXEC_JUMP flow in > kernel_kexec() because it is redundant. Namely, dpm_resume_end() > called beforehand in the same code path invokes that function and > it is sufficient to invoke it once. > > Signed-off-by: Rafael J. Wysocki > --- > kernel/kexec_core.c | 1 - > 1 file changed, 1 deletion(-) LGTM, Acked-by: Baoquan He > > --- a/kernel/kexec_core.c > +++ b/kernel/kexec_core.c > @@ -1136,7 +1136,6 @@ > Resume_devices: > dpm_resume_end(PMSG_RESTORE); > Resume_console: > - pm_restore_gfp_mask(); > console_resume_all(); > thaw_processes(); > Restore_console: > > > From adam.drzewiecki at successa.pl Fri Jul 11 01:06:03 2025 From: adam.drzewiecki at successa.pl (Adam Drzewiecki) Date: Fri, 11 Jul 2025 08:06:03 GMT Subject: =?UTF-8?Q?Pytanie_o_samoch=C3=B3d_?= Message-ID: <20250711084500-0.1.jb.25amg.0.1z9o1i471h@successa.pl> Dzie? dobry, Czy interesuje Pa?stwa rozwi?zanie umo?liwiaj?ce monitorowanie samochod?w firmowych oraz optymalizacj? koszt?w ich utrzymania? Pozdrawiam Adam Drzewiecki From rafael at kernel.org Fri Jul 11 02:29:03 2025 From: rafael at kernel.org (Rafael J. Wysocki) Date: Fri, 11 Jul 2025 11:29:03 +0200 Subject: [PATCH v1 1/2] kexec_core: Fix error code path in the KEXEC_JUMP flow In-Reply-To: References: <5046396.31r3eYUQgx@rjwysocki.net> <2396879.ElGaqSPkdT@rjwysocki.net> Message-ID: On Fri, Jul 11, 2025 at 8:16?AM Baoquan He wrote: > > On 07/10/25 at 03:10pm, Rafael J. Wysocki wrote: > > From: Rafael J. Wysocki > > > > If dpm_suspend_start() fails, dpm_resume_end() must be called to > > recover devices whose suspend callbacks have been called, but this > > does not happen in the KEXEC_JUMP flow's error path due to a confused > > goto target label. > > > > Address this by using the correct target label in the goto statement in > > question. > > Sounds very reasonable, thanks for the fix. > > Acked-by: Baoquan He Thanks! I've queued it up for 6.17 along with the [2/2]. > > > > Fixes: 2965faa5e03d ("kexec: split kexec_load syscall from kexec core code") > > Signed-off-by: Rafael J. Wysocki > > --- > > kernel/kexec_core.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > --- a/kernel/kexec_core.c > > +++ b/kernel/kexec_core.c > > @@ -1080,7 +1080,7 @@ > > console_suspend_all(); > > error = dpm_suspend_start(PMSG_FREEZE); > > if (error) > > - goto Resume_console; > > + goto Resume_devices; > > /* > > * dpm_suspend_end() must be called after dpm_suspend_start() > > * to complete the transition, like in the hibernation flows > > > > > > > From mario.limonciello at amd.com Fri Jul 11 04:15:22 2025 From: mario.limonciello at amd.com (Mario Limonciello) Date: Fri, 11 Jul 2025 07:15:22 -0400 Subject: [PATCH v1 0/2] kexec_core: Fix and cleanup for the KEXEC_JUMP flow In-Reply-To: <5046396.31r3eYUQgx@rjwysocki.net> References: <5046396.31r3eYUQgx@rjwysocki.net> Message-ID: On 7/10/2025 9:08 AM, Rafael J. Wysocki wrote: > Hi Everyone, > > These two patches fix an error code path issue in the KEXEC_JUMP flow (patch > [1/2]) and clean it up a bit afterward (patch [2/2]). > > Please see patch changelogs for details. > > Thanks! > > > Reviewed-by: Mario Limonciello From yamazaki-msmt at nec.com Fri Jul 11 05:08:36 2025 From: yamazaki-msmt at nec.com (=?utf-8?B?WUFNQVpBS0kgTUFTQU1JVFNVKOWxseW0juOAgOecn+WFiSk=?=) Date: Fri, 11 Jul 2025 12:08:36 +0000 Subject: [PATCH v2][makedumpfile] Fix a data race in multi-threading mode (--num-threads=N) In-Reply-To: References: <20250625022343.57529-2-ltao@redhat.com> <7c13a968-4a3a-4d0d-8977-3ba0a4a845b1@nec.com> <20250703163100.603f59f4@mordecai.tesarici.cz> <8485b4f1-1277-45ab-b533-efc20120b26e@nec.com> Message-ID: <004de18c-263a-405d-9d5a-e83d4c391df7@nec.com> Sorry, I'm so rate. I looked into the fix and I think it will work safely on other architectures as well. I think it will also solve the problem with ppc64. I accept and merge this patch. Thank you for reporting this problem and providing the very difficult fix. Thanks, Masa On 2025/07/10 14:34, Tao Liu wrote: > Kindly ping... > > Sorry to interrupt, could you please merge the patch since there are > few bugs which depend on the backporting of this patch? > > Thanks, > Tao Liu > > > On Fri, Jul 4, 2025 at 7:51?PM Tao Liu wrote: >> On Fri, Jul 4, 2025 at 6:49?PM HAGIO KAZUHITO(?????) wrote: >>> On 2025/07/04 7:35, Tao Liu wrote: >>>> Hi Petr, >>>> >>>> On Fri, Jul 4, 2025 at 2:31?AM Petr Tesarik wrote: >>>>> On Tue, 1 Jul 2025 19:59:53 +1200 >>>>> Tao Liu wrote: >>>>> >>>>>> Hi Kazu, >>>>>> >>>>>> Thanks for your comments! >>>>>> >>>>>> On Tue, Jul 1, 2025 at 7:38?PM HAGIO KAZUHITO(?????) wrote: >>>>>>> Hi Tao, >>>>>>> >>>>>>> thank you for the patch. >>>>>>> >>>>>>> On 2025/06/25 11:23, Tao Liu wrote: >>>>>>>> A vmcore corrupt issue has been noticed in powerpc arch [1]. It can be >>>>>>>> reproduced with upstream makedumpfile. >>>>>>>> >>>>>>>> When analyzing the corrupt vmcore using crash, the following error >>>>>>>> message will output: >>>>>>>> >>>>>>>> crash: compressed kdump: uncompress failed: 0 >>>>>>>> crash: read error: kernel virtual address: c0001e2d2fe48000 type: >>>>>>>> "hardirq thread_union" >>>>>>>> crash: cannot read hardirq_ctx[930] at c0001e2d2fe48000 >>>>>>>> crash: compressed kdump: uncompress failed: 0 >>>>>>>> >>>>>>>> If the vmcore is generated without num-threads option, then no such >>>>>>>> errors are noticed. >>>>>>>> >>>>>>>> With --num-threads=N enabled, there will be N sub-threads created. All >>>>>>>> sub-threads are producers which responsible for mm page processing, e.g. >>>>>>>> compression. The main thread is the consumer which responsible for >>>>>>>> writing the compressed data into file. page_flag_buf->ready is used to >>>>>>>> sync main and sub-threads. When a sub-thread finishes page processing, >>>>>>>> it will set ready flag to be FLAG_READY. In the meantime, main thread >>>>>>>> looply check all threads of the ready flags, and break the loop when >>>>>>>> find FLAG_READY. >>>>>>> I've tried to reproduce the issue, but I couldn't on x86_64. >>>>>> Yes, I cannot reproduce it on x86_64 either, but the issue is very >>>>>> easily reproduced on ppc64 arch, which is where our QE reported. >>>>> Yes, this is expected. X86 implements a strongly ordered memory model, >>>>> so a "store-to-memory" instruction ensures that the new value is >>>>> immediately observed by other CPUs. >>>>> >>>>> FWIW the current code is wrong even on X86, because it does nothing to >>>>> prevent compiler optimizations. The compiler is then allowed to reorder >>>>> instructions so that the write to page_flag_buf->ready happens after >>>>> other writes; with a bit of bad scheduling luck, the consumer thread >>>>> may see an inconsistent state (e.g. read a stale page_flag_buf->pfn). >>>>> Note that thanks to how compilers are designed (today), this issue is >>>>> more or less hypothetical. Nevertheless, the use of atomics fixes it, >>>>> because they also serve as memory barriers. >>> Thank you Petr, for the information. I was wondering whether atomic >>> operations might be necessary for the other members of page_flag_buf, >>> but it looks like they won't be necessary in this case. >>> >>> Then I was convinced that the issue would be fixed by removing the >>> inconsistency of page_flag_buf->ready. And the patch tested ok, so ack. >>> >> Thank you all for the patch review, patch testing and comments, these >> have been so helpful! >> >> Thanks, >> Tao Liu >> >>> Thanks, >>> Kazu >>> >>>> Thanks a lot for your detailed explanation, it's very helpful! I >>>> haven't thought of the possibility of instruction reordering and >>>> atomic_rw prevents the reorder. >>>> >>>> Thanks, >>>> Tao Liu >>>> >>>>> Petr T >>>>> From ltao at redhat.com Sun Jul 13 16:37:25 2025 From: ltao at redhat.com (Tao Liu) Date: Mon, 14 Jul 2025 11:37:25 +1200 Subject: [PATCH v2][makedumpfile] Fix a data race in multi-threading mode (--num-threads=N) In-Reply-To: <004de18c-263a-405d-9d5a-e83d4c391df7@nec.com> References: <20250625022343.57529-2-ltao@redhat.com> <7c13a968-4a3a-4d0d-8977-3ba0a4a845b1@nec.com> <20250703163100.603f59f4@mordecai.tesarici.cz> <8485b4f1-1277-45ab-b533-efc20120b26e@nec.com> <004de18c-263a-405d-9d5a-e83d4c391df7@nec.com> Message-ID: Hi YAMAZAKI, On Sat, Jul 12, 2025 at 12:08?AM YAMAZAKI MASAMITSU(?????) wrote: > > Sorry, I'm so rate. No worries :) > > I looked into the fix and I think it will work safely on other > architectures as well. I think it will also solve the problem > with ppc64. I accept and merge this patch. > > Thank you for reporting this problem and providing the very > difficult fix. Thanks for your response and merging! Thanks, Tao Liu > > Thanks, > > Masa > > On 2025/07/10 14:34, Tao Liu wrote: > > Kindly ping... > > > > Sorry to interrupt, could you please merge the patch since there are > > few bugs which depend on the backporting of this patch? > > > > Thanks, > > Tao Liu > > > > > > On Fri, Jul 4, 2025 at 7:51?PM Tao Liu wrote: > >> On Fri, Jul 4, 2025 at 6:49?PM HAGIO KAZUHITO(?????) wrote: > >>> On 2025/07/04 7:35, Tao Liu wrote: > >>>> Hi Petr, > >>>> > >>>> On Fri, Jul 4, 2025 at 2:31?AM Petr Tesarik wrote: > >>>>> On Tue, 1 Jul 2025 19:59:53 +1200 > >>>>> Tao Liu wrote: > >>>>> > >>>>>> Hi Kazu, > >>>>>> > >>>>>> Thanks for your comments! > >>>>>> > >>>>>> On Tue, Jul 1, 2025 at 7:38?PM HAGIO KAZUHITO(?????) wrote: > >>>>>>> Hi Tao, > >>>>>>> > >>>>>>> thank you for the patch. > >>>>>>> > >>>>>>> On 2025/06/25 11:23, Tao Liu wrote: > >>>>>>>> A vmcore corrupt issue has been noticed in powerpc arch [1]. It can be > >>>>>>>> reproduced with upstream makedumpfile. > >>>>>>>> > >>>>>>>> When analyzing the corrupt vmcore using crash, the following error > >>>>>>>> message will output: > >>>>>>>> > >>>>>>>> crash: compressed kdump: uncompress failed: 0 > >>>>>>>> crash: read error: kernel virtual address: c0001e2d2fe48000 type: > >>>>>>>> "hardirq thread_union" > >>>>>>>> crash: cannot read hardirq_ctx[930] at c0001e2d2fe48000 > >>>>>>>> crash: compressed kdump: uncompress failed: 0 > >>>>>>>> > >>>>>>>> If the vmcore is generated without num-threads option, then no such > >>>>>>>> errors are noticed. > >>>>>>>> > >>>>>>>> With --num-threads=N enabled, there will be N sub-threads created. All > >>>>>>>> sub-threads are producers which responsible for mm page processing, e.g. > >>>>>>>> compression. The main thread is the consumer which responsible for > >>>>>>>> writing the compressed data into file. page_flag_buf->ready is used to > >>>>>>>> sync main and sub-threads. When a sub-thread finishes page processing, > >>>>>>>> it will set ready flag to be FLAG_READY. In the meantime, main thread > >>>>>>>> looply check all threads of the ready flags, and break the loop when > >>>>>>>> find FLAG_READY. > >>>>>>> I've tried to reproduce the issue, but I couldn't on x86_64. > >>>>>> Yes, I cannot reproduce it on x86_64 either, but the issue is very > >>>>>> easily reproduced on ppc64 arch, which is where our QE reported. > >>>>> Yes, this is expected. X86 implements a strongly ordered memory model, > >>>>> so a "store-to-memory" instruction ensures that the new value is > >>>>> immediately observed by other CPUs. > >>>>> > >>>>> FWIW the current code is wrong even on X86, because it does nothing to > >>>>> prevent compiler optimizations. The compiler is then allowed to reorder > >>>>> instructions so that the write to page_flag_buf->ready happens after > >>>>> other writes; with a bit of bad scheduling luck, the consumer thread > >>>>> may see an inconsistent state (e.g. read a stale page_flag_buf->pfn). > >>>>> Note that thanks to how compilers are designed (today), this issue is > >>>>> more or less hypothetical. Nevertheless, the use of atomics fixes it, > >>>>> because they also serve as memory barriers. > >>> Thank you Petr, for the information. I was wondering whether atomic > >>> operations might be necessary for the other members of page_flag_buf, > >>> but it looks like they won't be necessary in this case. > >>> > >>> Then I was convinced that the issue would be fixed by removing the > >>> inconsistency of page_flag_buf->ready. And the patch tested ok, so ack. > >>> > >> Thank you all for the patch review, patch testing and comments, these > >> have been so helpful! > >> > >> Thanks, > >> Tao Liu > >> > >>> Thanks, > >>> Kazu > >>> > >>>> Thanks a lot for your detailed explanation, it's very helpful! I > >>>> haven't thought of the possibility of instruction reordering and > >>>> atomic_rw prevents the reorder. > >>>> > >>>> Thanks, > >>>> Tao Liu > >>>> > >>>>> Petr T > >>>>> From bhe at redhat.com Mon Jul 14 23:37:09 2025 From: bhe at redhat.com (Baoquan He) Date: Tue, 15 Jul 2025 14:37:09 +0800 Subject: Second kexec_file_load (but not kexec_load) fails on i915 if CONFIG_INTEL_IOMMU_DEFAULT_ON=n In-Reply-To: References: <197d1dc3bff.c01ddb9024897.1898328361232711826@zohomail.com> Message-ID: On 07/04/25 at 11:29am, Jani Nikula wrote: > On Thu, 03 Jul 2025, Askar Safin wrote: > > TL;DR: I found a bug in strange interaction in kexec_file_load (but not kexec_load) and i915 > > TL;DR#2: Second (sometimes third or forth) kexec (using kexec_file_load) fails on my particular hardware > > TL;DR#3: I did 55 expirements, each of them required a lot of boots, in total I did 1908 boots > > Thanks for the detailed debug info. I'm afraid all I can say at this > point is, please file all of this in a bug report as described in > [1]. Please add the drm.debug related options, and attach the dmesgs and > configs in the bug instead of pointing at external sites. Yeah, that's very great example people can refer to when reporting issues to upstream, thanks for the details. For the bug itself, I would hope Intel GPU people can have a look, see what's happened and how to fix. For kexec reboot, we have got problems on Nvidia GPU and amdgpu which makes kexec reboot hard to do continuous switching to 2nd kernel. In Redhat, we have met this several years ago, and we tried to contact GPU dev, while there's no way to fix it. Finaly we have to declare not supporting kexec reboot formally. This Intel GPU issue could be a different one, I still hope GPU dev can have a look. Currently, many people are investing much effort on KHO, K-state, etc in upstream to make kexec reboot versatile and flexible. I am very glad to see that. And I guess people possiblly have met the same GPU issues on Nvidia and AMD gpu as I mentioned, and trying to solve them. Otherwise, no matter how wonderful KHO, K-state or K-anything are, they are just sky scraper on sand. Personal opinion. Thanks Baoquan From dominik.lotka at fintara.pl Tue Jul 15 00:40:33 2025 From: dominik.lotka at fintara.pl (Dominik Lotka) Date: Tue, 15 Jul 2025 07:40:33 GMT Subject: =?UTF-8?Q?Prosz=C4=99_o_kontakt?= Message-ID: <20250715084500-0.1.ao.2rpel.0.7mcjf6pn4l@fintara.pl> Dzie? dobry, Czy jest mo?liwo?? nawi?zania wsp??pracy z Pa?stwem? Z ch?ci? porozmawiam z osob? zajmuj?c? si? dzia?aniami zwi?zanymi ze sprzeda??. Pomagamy skutecznie pozyskiwa? nowych klient?w. Zapraszam do kontaktu. Pozdrawiam Dominik Lotka From marcin.wojciechowski at ventrazo.pl Wed Jul 16 00:30:30 2025 From: marcin.wojciechowski at ventrazo.pl (Marcin Wojciechowski) Date: Wed, 16 Jul 2025 07:30:30 GMT Subject: Zapytanie ofertowe Message-ID: <20250716084501-0.1.js.4rbq0.0.m2jcbdbrgo@OriginatePro.pl> Dzie? dobry, Pozwoli?em sobie na kontakt, poniewa? jestem zainteresowany weryfikacj? mo?liwo?ci nawi?zania wsp??pracy. Wspieramy firmy w pozyskiwaniu nowych klient?w biznesowych. Czy mo?emy porozmawia? w celu przedstawienia szczeg??owych informacji? Pozdrawiam serdecznie Marcin Wojciechowski From piliu at redhat.com Mon Jul 21 07:18:48 2025 From: piliu at redhat.com (Pingfan Liu) Date: Mon, 21 Jul 2025 22:18:48 +0800 Subject: Second kexec_file_load (but not kexec_load) fails on i915 if CONFIG_INTEL_IOMMU_DEFAULT_ON=n In-Reply-To: <197d710ac39.10e2c241536088.2706332519040181850@zohomail.com> References: <197d1dc3bff.c01ddb9024897.1898328361232711826@zohomail.com> <197d710ac39.10e2c241536088.2706332519040181850@zohomail.com> Message-ID: On Sat, Jul 5, 2025 at 4:12?AM Askar Safin wrote: > > ---- On Fri, 04 Jul 2025 12:29:01 +0400 Jani Nikula wrote --- > > Thanks for the detailed debug info. I'm afraid all I can say at this > > point is, please file all of this in a bug report as described in > > [1]. Please add the drm.debug related options, and attach the dmesgs and > > configs in the bug instead of pointing at external sites. > > Okay, now let me speculate how to fix this bug. :) I think someone with moderate kexec understanding > and with Intel GPU should do this: reproduce the bug and then slowly modify kexec_file_load code until it > becomes kexec_load code. (Or vice versa.) In the middle of this modification the bug stops to reproduce, > and so we will know what exactly causes it. > > kexec_file_load and kexec_load should behave the same. If they do not, then we should > understand, why. We should closely review their code. > > Also, in case of kexec_load kernel uncompressing and parsing performed by "kexec" userspace > tool, and in case of kexec_file_load by kernel. So we should closely review this two uncompressing/parsing code fragments. > I think that this bug is related to kexec, not to i915. And thus it should be fixed by kexec people, not by i915 people. (But I may be wrong.) > I tend to agree with Baoquan on this scene when kexec rebooted with a graphic card. I heard that this was due to the missed initialization on the graphic card by the firmware in the kexec reboot process. But it is not an official explanation. If any experts could enlighten me on this, I'd really appreciate it. IMHO, you could try blacklisting the i915 module to see if kexec_file_load works without issues - this would help narrow down the culprit. Thanks, Pingfan > But okay, I reported it to that bug tracker anyway: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/14598 > > Maybe there is separate kexec bug tracker? > > Also, your bug tracker is cool. One can attach files in the middle of report. Why not whole kernel uses it? :) > > -- > Askar Safin > https://types.pl/@safinaskar > > From graf at amazon.com Mon Jul 21 15:01:25 2025 From: graf at amazon.com (Alexander Graf) Date: Tue, 22 Jul 2025 00:01:25 +0200 Subject: [PATCH v5] kexec: Enable CMA based contiguous allocation In-Reply-To: References: <20250610085327.51817-1-graf@amazon.com> Message-ID: <07b21458-832f-4b15-9bc8-43f21f902e34@amazon.com> On 10.06.25 13:31, Pasha Tatashin wrote: > On Tue, Jun 10, 2025 at 4:53?AM Alexander Graf wrote: >> When booting a new kernel with kexec_file, the kernel picks a target >> location that the kernel should live at, then allocates random pages, >> checks whether any of those patches magically happens to coincide with >> a target address range and if so, uses them for that range. >> >> For every page allocated this way, it then creates a page list that the >> relocation code - code that executes while all CPUs are off and we are >> just about to jump into the new kernel - copies to their final memory >> location. We can not put them there before, because chances are pretty >> good that at least some page in the target range is already in use by >> the currently running Linux environment. Copying is happening from a >> single CPU at RAM rate, which takes around 4-50 ms per 100 MiB. >> >> All of this is inefficient and error prone. >> >> To successfully kexec, we need to quiesce all devices of the outgoing >> kernel so they don't scribble over the new kernel's memory. We have seen >> cases where that does not happen properly (*cough* GIC *cough*) and hence >> the new kernel was corrupted. This started a month long journey to root >> cause failing kexecs to eventually see memory corruption, because the new >> kernel was corrupted severely enough that it could not emit output to >> tell us about the fact that it was corrupted. By allocating memory for the >> next kernel from a memory range that is guaranteed scribbling free, we can >> boot the next kernel up to a point where it is at least able to detect >> corruption and maybe even stop it before it becomes severe. This increases >> the chance for successful kexecs. >> >> Since kexec got introduced, Linux has gained the CMA framework which >> can perform physically contiguous memory mappings, while keeping that >> memory available for movable memory when it is not needed for contiguous >> allocations. The default CMA allocator is for DMA allocations. >> >> This patch adds logic to the kexec file loader to attempt to place the >> target payload at a location allocated from CMA. If successful, it uses >> that memory range directly instead of creating copy instructions during >> the hot phase. To ensure that there is a safety net in case anything goes >> wrong with the CMA allocation, it also adds a flag for user space to force >> disable CMA allocations. >> >> Using CMA allocations has two advantages: >> >> 1) Faster by 4-50 ms per 100 MiB. There is no more need to copy in the >> hot phase. >> 2) More robust. Even if by accident some page is still in use for DMA, >> the new kernel image will be safe from that access because it resides >> in a memory region that is considered allocated in the old kernel and >> has a chance to reinitialize that component. >> >> Signed-off-by: Alexander Graf >> Acked-by: Baoquan He > Reviewed-by: Pasha Tatashin Andrew, I don't see this patch in linus/master. Is it still in your queue? :) Alex Amazon Web Services Development Center Germany GmbH Tamara-Danz-Str. 13 10243 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B Sitz: Berlin Ust-ID: DE 365 538 597 From akpm at linux-foundation.org Mon Jul 21 15:33:40 2025 From: akpm at linux-foundation.org (Andrew Morton) Date: Mon, 21 Jul 2025 15:33:40 -0700 Subject: [PATCH v5] kexec: Enable CMA based contiguous allocation In-Reply-To: <07b21458-832f-4b15-9bc8-43f21f902e34@amazon.com> References: <20250610085327.51817-1-graf@amazon.com> <07b21458-832f-4b15-9bc8-43f21f902e34@amazon.com> Message-ID: <20250721153340.5e033b1df1ac74c1a471c892@linux-foundation.org> On Tue, 22 Jul 2025 00:01:25 +0200 Alexander Graf wrote: > Andrew, I don't see this patch in linus/master. Is it still in your > queue? :) Seems I dropped v3(?) to make way for v5 but then didn't add v5, sorry. Added now. From safinaskar at zohomail.com Mon Jul 21 18:28:25 2025 From: safinaskar at zohomail.com (Askar Safin) Date: Tue, 22 Jul 2025 05:28:25 +0400 Subject: Second kexec_file_load (but not kexec_load) fails on i915 if CONFIG_INTEL_IOMMU_DEFAULT_ON=n In-Reply-To: References: <197d1dc3bff.c01ddb9024897.1898328361232711826@zohomail.com> <197d710ac39.10e2c241536088.2706332519040181850@zohomail.com> Message-ID: <1982fbf095a.e7a2ac3764675.6794980000287835465@zohomail.com> ---- On Mon, 21 Jul 2025 18:18:48 +0400 Pingfan Liu wrote --- > IMHO, you could try blacklisting the i915 module to see if I did this. Problem is in i915. Here you can see our discussion with i915 devs: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/14598 -- Askar Safin https://types.pl/@safinaskar From piliu at redhat.com Mon Jul 21 19:03:07 2025 From: piliu at redhat.com (Pingfan Liu) Date: Tue, 22 Jul 2025 10:03:07 +0800 Subject: [PATCHv4 00/12] kexec: Use BPF lskel to enable kexec to load PE format boot image Message-ID: <20250722020319.5837-1-piliu@redhat.com> *** Review the history *** Nowadays UEFI PE bootable image is more and more popular on the distribution. But it is still an open issue to load that kind of image by kexec with IMA enabled There are several approaches to reslove this issue, but none of them are accepted in upstream till now. The summary of those approaches: -1. UEFI service emulator for UEFI stub -2. PE format parser in kernel For the first one, I have tried a purgatory-style emulator [1]. But it confronts the hardware scaling trouble. For the second one, there are two choices, one is to implement it inside the kernel, the other is inside the user space. Both zboot-format [2] and UKI-format [3] parsers are rejected due to the concern that the variant format parsers will inflate the kernel code. And finally, we have these kinds of parsers in the user space 'kexec-tools'. *** The approach in this series *** This approach allows the various PE boot image to be parsed in the bpf-prog, as a result, the kexec kernel code to remain relatively stable. Benefits And it abstracts architecture independent part and the API is limitted To protect against malicious attacks on the BPF loader in user space, it employs BPF lskel to load and execute BPF programs from within the kernel. Each type of PE image contains a dedicated section '.bpf', which stores the bpf-prog designed to parse the format. This ensures that the PE's signature also protects the integrity of the '.bpf' section. The parsing process operates as a pipeline. The current BPF program parser attaches to bpf_handle_pefile() and detaches at the end of the current stage via disarm_bpf_prog(). The results parsed by the current BPF program are buffered in the kernel through prepare_nested_pe() and then delivered to the next stage. For each stage of the pipeline, the BPF bytecode is stored in the '.bpf' section of the PE file. That means a vmlinuz.efi embeded in UKI format can be handled. Special thanks to Philipp Rudo, who spent significant time evaluating the practicality of my solution, and to Viktor Malik, who guided me toward using BPF light skeleton to prevent malicious attacks from user space. *** Test result *** Configured with RHEL kernel debug file, which turns on most of locking, memory debug option, I have not seen any warning or bug for 1000 times. Test approach: -1. compile kernel -2. get the zboot image with bpf-prog by 'make -C tools/kexec zboot' -3. compile kexec-tools from https://github.com/pfliu/kexec-tools/pull/new/pe_bpf The rest process is the common convention to use kexec. [1]: https://lore.kernel.org/lkml/20240819145417.23367-1-piliu at redhat.com/T/ [2]: https://lore.kernel.org/kexec/20230306030305.15595-1-kernelfans at gmail.com/ [3]: https://lore.kernel.org/lkml/20230911052535.335770-1-kernel at jfarr.cc/ [4]: https://lore.kernel.org/linux-arm-kernel/20230921133703.39042-2-kernelfans at gmail.com/T/ v3 -> v4 - Use dynamic allocator in decompression ([4/12]) - Fix issue caused by Identical Code Folding ([5/12]) - Integrate the image generator tool in the kernel tree ([11,12/12]) - Address the issue according to Philipp's comments in v3 reviewing. Thanks Philipp! RFCv2 -> v3 - move the introduced bpf kfuncs to kernel/bpf/* and mark them sleepable - use listener and publisher model to implement bpf_copy_to_kernel() - keep each introduced kfunc under the control of memcg RFCv1 -> RFCv2 - Use bpf kfunc instead of helper - Use C source code to generate the light skeleton file Pingfan Liu (12): kexec_file: Make kexec_image_load_default global visible lib/decompress: Keep decompressor when CONFIG_KEXEC_PE_IMAGE bpf: Introduce bpf_copy_to_kernel() to buffer the content from bpf-prog bpf: Introduce decompressor kfunc kexec: Introduce kexec_pe_image to parse and load PE file kexec: Integrate with the introduced bpf kfuncs kexec: Introduce a bpf-prog lskel to parse PE file kexec: Factor out routine to find a symbol in ELF kexec: Integrate bpf light skeleton to load zboot image arm64/kexec: Add PE image format support tools/kexec: Introduce a bpf-prog to parse zboot image format tools/kexec: Add a zboot image building tool arch/arm64/Kconfig | 1 + arch/arm64/include/asm/kexec.h | 1 + arch/arm64/kernel/machine_kexec_file.c | 3 + include/linux/bpf.h | 39 ++ include/linux/decompress/mm.h | 7 + include/linux/kexec.h | 6 + kernel/Kconfig.kexec | 8 + kernel/Makefile | 2 + kernel/bpf/Makefile | 2 +- kernel/bpf/helpers.c | 225 +++++++++ kernel/bpf/helpers_carrier.c | 211 +++++++++ kernel/kexec_bpf/Makefile | 71 +++ kernel/kexec_bpf/kexec_pe_parser_bpf.c | 67 +++ kernel/kexec_bpf/kexec_pe_parser_bpf.lskel.h | 147 ++++++ kernel/kexec_file.c | 88 ++-- kernel/kexec_pe_image.c | 463 +++++++++++++++++++ lib/decompress.c | 6 +- tools/kexec/Makefile | 90 ++++ tools/kexec/pe.h | 177 +++++++ tools/kexec/zboot_image_builder.c | 280 +++++++++++ tools/kexec/zboot_parser_bpf.c | 158 +++++++ 21 files changed, 2007 insertions(+), 45 deletions(-) create mode 100644 kernel/bpf/helpers_carrier.c create mode 100644 kernel/kexec_bpf/Makefile create mode 100644 kernel/kexec_bpf/kexec_pe_parser_bpf.c create mode 100644 kernel/kexec_bpf/kexec_pe_parser_bpf.lskel.h create mode 100644 kernel/kexec_pe_image.c create mode 100644 tools/kexec/Makefile create mode 100644 tools/kexec/pe.h create mode 100644 tools/kexec/zboot_image_builder.c create mode 100644 tools/kexec/zboot_parser_bpf.c -- 2.49.0 From piliu at redhat.com Mon Jul 21 19:03:08 2025 From: piliu at redhat.com (Pingfan Liu) Date: Tue, 22 Jul 2025 10:03:08 +0800 Subject: [PATCHv4 01/12] kexec_file: Make kexec_image_load_default global visible In-Reply-To: <20250722020319.5837-1-piliu@redhat.com> References: <20250722020319.5837-1-piliu@redhat.com> Message-ID: <20250722020319.5837-2-piliu@redhat.com> In latter patches, PE format parser will extract the linux kernel inside and try its real format parser. So making kexec_image_load_default global. Signed-off-by: Pingfan Liu Cc: Baoquan He Cc: Dave Young Cc: Andrew Morton To: kexec at lists.infradead.org --- include/linux/kexec.h | 1 + kernel/kexec_file.c | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/include/linux/kexec.h b/include/linux/kexec.h index 03f85ad03025b..3a2b9b4fffa18 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -152,6 +152,7 @@ extern const struct kexec_file_ops * const kexec_file_loaders[]; int kexec_image_probe_default(struct kimage *image, void *buf, unsigned long buf_len); +void *kexec_image_load_default(struct kimage *image); int kexec_image_post_load_cleanup_default(struct kimage *image); /* diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c index 69fe76fd92334..c92afe1a3aa5e 100644 --- a/kernel/kexec_file.c +++ b/kernel/kexec_file.c @@ -79,7 +79,7 @@ int kexec_image_probe_default(struct kimage *image, void *buf, return ret; } -static void *kexec_image_load_default(struct kimage *image) +void *kexec_image_load_default(struct kimage *image) { if (!image->fops || !image->fops->load) return ERR_PTR(-ENOEXEC); -- 2.49.0 From piliu at redhat.com Mon Jul 21 19:03:09 2025 From: piliu at redhat.com (Pingfan Liu) Date: Tue, 22 Jul 2025 10:03:09 +0800 Subject: [PATCHv4 02/12] lib/decompress: Keep decompressor when CONFIG_KEXEC_PE_IMAGE In-Reply-To: <20250722020319.5837-1-piliu@redhat.com> References: <20250722020319.5837-1-piliu@redhat.com> Message-ID: <20250722020319.5837-3-piliu@redhat.com> The KEXE PE format parser needs the kernel built-in decompressor to decompress the kernel image. So moving the decompressor out of __init sections. Signed-off-by: Pingfan Liu Cc: Andrew Morton To: linux-kernel at vger.kernel.org --- include/linux/decompress/mm.h | 7 +++++++ lib/decompress.c | 6 +++--- 2 files changed, 10 insertions(+), 3 deletions(-) diff --git a/include/linux/decompress/mm.h b/include/linux/decompress/mm.h index ac862422df158..e8948260e2bbe 100644 --- a/include/linux/decompress/mm.h +++ b/include/linux/decompress/mm.h @@ -92,7 +92,14 @@ MALLOC_VISIBLE void free(void *where) #define large_malloc(a) vmalloc(a) #define large_free(a) vfree(a) +#ifdef CONFIG_KEXEC_PE_IMAGE +#define INIT +#define INITCONST +#else #define INIT __init +#define INITCONST __initconst +#endif + #define STATIC #include diff --git a/lib/decompress.c b/lib/decompress.c index ab3fc90ffc646..3d5b6304bb0f1 100644 --- a/lib/decompress.c +++ b/lib/decompress.c @@ -6,7 +6,7 @@ */ #include - +#include #include #include #include @@ -48,7 +48,7 @@ struct compress_format { decompress_fn decompressor; }; -static const struct compress_format compressed_formats[] __initconst = { +static const struct compress_format compressed_formats[] INITCONST = { { {0x1f, 0x8b}, "gzip", gunzip }, { {0x1f, 0x9e}, "gzip", gunzip }, { {0x42, 0x5a}, "bzip2", bunzip2 }, @@ -60,7 +60,7 @@ static const struct compress_format compressed_formats[] __initconst = { { {0, 0}, NULL, NULL } }; -decompress_fn __init decompress_method(const unsigned char *inbuf, long len, +decompress_fn INIT decompress_method(const unsigned char *inbuf, long len, const char **name) { const struct compress_format *cf; -- 2.49.0 From piliu at redhat.com Mon Jul 21 19:03:10 2025 From: piliu at redhat.com (Pingfan Liu) Date: Tue, 22 Jul 2025 10:03:10 +0800 Subject: [PATCHv4 03/12] bpf: Introduce bpf_copy_to_kernel() to buffer the content from bpf-prog In-Reply-To: <20250722020319.5837-1-piliu@redhat.com> References: <20250722020319.5837-1-piliu@redhat.com> Message-ID: <20250722020319.5837-4-piliu@redhat.com> In the security kexec_file_load case, the buffer which holds the kernel image should not be accessible from the userspace. Typically, BPF data flow occurs between user space and kernel space in either direction. However, kexec_file_load presents a unique case where user-originated data must be parsed and then forwarded to the kernel for subsequent parsing stages. This necessitates a mechanism to channel the intermedia data from the BPF program directly to the kernel. bpf_kexec_carrier() is introduced to serve that purpose. Signed-off-by: Pingfan Liu Cc: Alexei Starovoitov Cc: Daniel Borkmann Cc: John Fastabend Cc: Andrii Nakryiko Cc: Martin KaFai Lau Cc: Eduard Zingerman Cc: Song Liu Cc: Yonghong Song Cc: KP Singh Cc: Stanislav Fomichev Cc: Hao Luo Cc: Jiri Olsa To: bpf at vger.kernel.org --- include/linux/bpf.h | 39 +++++++ kernel/bpf/Makefile | 2 +- kernel/bpf/helpers.c | 2 + kernel/bpf/helpers_carrier.c | 211 +++++++++++++++++++++++++++++++++++ 4 files changed, 253 insertions(+), 1 deletion(-) create mode 100644 kernel/bpf/helpers_carrier.c diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 5b25d278409bb..0041697596e5d 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -3588,4 +3588,43 @@ static inline bool bpf_is_subprog(const struct bpf_prog *prog) return prog->aux->func_idx != 0; } +enum alloc_type { + TYPE_KALLOC, + TYPE_VMALLOC, + TYPE_VMAP, +}; + +struct mem_range_result { + struct kref ref; + char *buf; + uint32_t buf_sz; + uint32_t data_sz; + /* kmalloc-ed, vmalloc-ed, or vmap-ed */ + enum alloc_type alloc_type; + /* Valid if vmap-ed */ + struct page **pages; + unsigned int pg_cnt; + int status; + struct mem_cgroup *memcg; +}; + +struct mem_range_result *mem_range_result_alloc(void); +void mem_range_result_get(struct mem_range_result *r); +void mem_range_result_put(struct mem_range_result *r); + +typedef int (*resource_handler)(const char *name, struct mem_range_result *r); + +struct carrier_listener { + struct hlist_node node; + char *name; + resource_handler handler; + /* + * bpf_copy_to_kernel() knows the size in advance, so vmap-ed is not + * supported. + */ + enum alloc_type alloc_type; +}; + +int register_carrier_listener(struct carrier_listener *listener); +int unregister_carrier_listener(char *str); #endif /* _LINUX_BPF_H */ diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile index 3a335c50e6e3c..cf701aa222fc2 100644 --- a/kernel/bpf/Makefile +++ b/kernel/bpf/Makefile @@ -6,7 +6,7 @@ cflags-nogcse-$(CONFIG_X86)$(CONFIG_CC_IS_GCC) := -fno-gcse endif CFLAGS_core.o += -Wno-override-init $(cflags-nogcse-yy) -obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o log.o token.o +obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o helpers_carrier.o tnum.o log.o token.o obj-$(CONFIG_BPF_SYSCALL) += bpf_iter.o map_iter.o task_iter.o prog_iter.o link_iter.o obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index b71e428ad9360..b30a2114f15b8 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -3284,6 +3284,8 @@ BTF_KFUNCS_START(generic_btf_ids) #ifdef CONFIG_CRASH_DUMP BTF_ID_FLAGS(func, crash_kexec, KF_DESTRUCTIVE) #endif +BTF_ID_FLAGS(func, bpf_mem_range_result_put, KF_RELEASE | KF_SLEEPABLE) +BTF_ID_FLAGS(func, bpf_copy_to_kernel, KF_TRUSTED_ARGS | KF_SLEEPABLE) BTF_ID_FLAGS(func, bpf_obj_new_impl, KF_ACQUIRE | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_percpu_obj_new_impl, KF_ACQUIRE | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_obj_drop_impl, KF_RELEASE) diff --git a/kernel/bpf/helpers_carrier.c b/kernel/bpf/helpers_carrier.c new file mode 100644 index 0000000000000..de10d6eac7dfb --- /dev/null +++ b/kernel/bpf/helpers_carrier.c @@ -0,0 +1,211 @@ +// SPDX-License-Identifier: GPL-2.0-only +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +DEFINE_STATIC_SRCU(srcu); +static DEFINE_MUTEX(carrier_listeners_mutex); +static DEFINE_HASHTABLE(carrier_listeners, 8); + +static struct carrier_listener *find_listener(const char *str) +{ + struct carrier_listener *item; + unsigned int hash = jhash(str, strlen(str), 0); + + hash_for_each_possible_rcu(carrier_listeners, item, node, hash) { + if (strcmp(item->name, str) == 0) + return item; + } + return NULL; +} + +static void __mem_range_result_free(struct kref *kref) +{ + struct mem_range_result *result = container_of(kref, struct mem_range_result, ref); + struct mem_cgroup *memcg, *old_memcg; + + /* vunmap() is blocking */ + might_sleep(); + memcg = result->memcg; + old_memcg = set_active_memcg(memcg); + if (likely(!!result->buf)) { + switch (result->alloc_type) { + case TYPE_KALLOC: + kfree(result->buf); + break; + case TYPE_VMALLOC: + vfree(result->buf); + break; + case TYPE_VMAP: + vunmap(result->buf); + for (unsigned int i = 0; i < result->pg_cnt; i++) + __free_pages(result->pages[i], 0); + vfree(result->pages); + } + } + kfree(result); + set_active_memcg(old_memcg); + mem_cgroup_put(memcg); +} + +struct mem_range_result *mem_range_result_alloc(void) +{ + struct mem_range_result *range; + + range = kmalloc(sizeof(struct mem_range_result), GFP_KERNEL); + if (!range) + return NULL; + kref_init(&range->ref); + return range; +} + +void mem_range_result_get(struct mem_range_result *r) +{ + if (!r) + return; + kref_get(&r->ref); +} + +void mem_range_result_put(struct mem_range_result *r) +{ + might_sleep(); + if (!r) + return; + kref_put(&r->ref, __mem_range_result_free); +} + +__bpf_kfunc int bpf_mem_range_result_put(struct mem_range_result *result) +{ + mem_range_result_put(result); + return 0; +} + +/* + * Cache the content in @buf into kernel + */ +__bpf_kfunc int bpf_copy_to_kernel(const char *name, char *buf, int size) +{ + struct mem_range_result *range; + struct mem_cgroup *memcg, *old_memcg; + struct carrier_listener *item; + resource_handler handler; + enum alloc_type alloc_type; + char *kbuf; + int id, ret = 0; + + /* + * This lock ensures no use of item after free and there is no in-flight + * handler + */ + id = srcu_read_lock(&srcu); + item = find_listener(name); + if (!item) { + srcu_read_unlock(&srcu, id); + return -EINVAL; + } + alloc_type = item->alloc_type; + handler = item->handler; + memcg = get_mem_cgroup_from_current(); + old_memcg = set_active_memcg(memcg); + range = mem_range_result_alloc(); + if (!range) { + pr_err("fail to allocate mem_range_result\n"); + ret = -ENOMEM; + goto err; + } + + switch (alloc_type) { + case TYPE_KALLOC: + kbuf = kmalloc(size, GFP_KERNEL | __GFP_ACCOUNT); + break; + case TYPE_VMALLOC: + kbuf = __vmalloc(size, GFP_KERNEL | __GFP_ACCOUNT); + break; + } + if (!kbuf) { + kfree(range); + ret = -ENOMEM; + goto err; + } + ret = copy_from_kernel_nofault(kbuf, buf, size); + if (unlikely(ret < 0)) { + if (range->alloc_type == TYPE_KALLOC) + kfree(kbuf); + else + vfree(kbuf); + kfree(range); + ret = -EINVAL; + goto err; + } + range->buf = kbuf; + range->buf_sz = size; + range->data_sz = size; + range->memcg = memcg; + mem_cgroup_tryget(memcg); + range->status = 0; + range->alloc_type = alloc_type; + /* We exit the lock after the handler finishes */ + ret = handler(name, range); + srcu_read_unlock(&srcu, id); + mem_range_result_put(range); +err: + if (ret != 0) + srcu_read_unlock(&srcu, id); + set_active_memcg(old_memcg); + mem_cgroup_put(memcg); + return ret; +} + +int register_carrier_listener(struct carrier_listener *listener) +{ + unsigned int hash; + int ret = 0; + char *str = listener->name; + + /* Not support vmap-ed */ + if (listener->alloc_type > TYPE_VMALLOC) + return -EINVAL; + if (!str) + return -EINVAL; + hash = jhash(str, strlen(str), 0); + mutex_lock(&carrier_listeners_mutex); + if (!find_listener(str)) + hash_add_rcu(carrier_listeners, &listener->node, hash); + else + ret = -EBUSY; + mutex_unlock(&carrier_listeners_mutex); + + return ret; +} +EXPORT_SYMBOL(register_carrier_listener); + +int unregister_carrier_listener(char *str) +{ + struct carrier_listener *item; + int ret = 0; + + mutex_lock(&carrier_listeners_mutex); + item = find_listener(str); + if (!!item) { + hash_del_rcu(&item->node); + /* + * It also waits on in-flight handler. Refer to note on the read + * side + */ + synchronize_srcu(&srcu); + } else { + ret = -EINVAL; + } + mutex_unlock(&carrier_listeners_mutex); + + return ret; +} +EXPORT_SYMBOL(unregister_carrier_listener); + -- 2.49.0 From piliu at redhat.com Mon Jul 21 19:03:11 2025 From: piliu at redhat.com (Pingfan Liu) Date: Tue, 22 Jul 2025 10:03:11 +0800 Subject: [PATCHv4 04/12] bpf: Introduce decompressor kfunc In-Reply-To: <20250722020319.5837-1-piliu@redhat.com> References: <20250722020319.5837-1-piliu@redhat.com> Message-ID: <20250722020319.5837-5-piliu@redhat.com> This commit bridges the gap between bpf-prog and the kernel decompression routines. At present, only a global memory allocator is used for the decompression. Later, if needed, the decompress_fn's prototype can be changed to pass in a task related allocator. This memory allocator can allocate 2MB each time with a transient virtual address, up to a 1GB limit. After decompression finishes, it presents all of the decompressed data in a new unified virtual address space. Signed-off-by: Pingfan Liu Cc: Alexei Starovoitov Cc: Daniel Borkmann Cc: John Fastabend Cc: Andrii Nakryiko Cc: Martin KaFai Lau Cc: Eduard Zingerman Cc: Song Liu Cc: Yonghong Song Cc: KP Singh Cc: Stanislav Fomichev Cc: Hao Luo Cc: Jiri Olsa To: bpf at vger.kernel.org --- kernel/bpf/helpers.c | 223 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 223 insertions(+) diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index b30a2114f15b8..70fae899f173e 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -24,6 +24,7 @@ #include #include #include +#include #include "../../lib/kstrtox.h" @@ -3278,12 +3279,234 @@ __bpf_kfunc void __bpf_trap(void) { } +#define MAX_UNCOMPRESSED_BUF_SIZE (1 << 28) +/* a chunk should be large enough to contain a decompressing */ +#define CHUNK_SIZE (1 << 23) + +/* + * At present, one global allocator for decompression. Later if needed, changing the + * prototype of decompress_fn to introduce each task's allocator. + */ +static DEFINE_MUTEX(output_buf_mutex); + +struct decompress_mem_allocator { + struct page **pages; + unsigned int pg_idx; + void *chunk_start; + unsigned int chunk_size; + void *chunk_cur; +}; + +static struct decompress_mem_allocator dcmpr_allocator; + +/* + * Set up an active chunk to hold partial decompressed data. + */ +static void *vmap_decompressed_chunk(void) +{ + struct decompress_mem_allocator *a = &dcmpr_allocator; + unsigned int i, pg_cnt = a->chunk_size >> PAGE_SHIFT; + struct page **pg_start = &a->pages[a->pg_idx]; + + for (i = 0; i < pg_cnt; i++) + a->pages[a->pg_idx++] = alloc_page(GFP_KERNEL | __GFP_ACCOUNT); + + return vmap(pg_start, pg_cnt, VM_MAP, PAGE_KERNEL); +} + +/* + * Present the scattered pages containing decompressed data at a unified virtual + * address. + */ +static int decompress_mem_allocator_handover(struct decompress_mem_allocator *a, + struct mem_range_result *range) +{ + unsigned long pg_array_sz = a->pg_idx * sizeof(struct page *); + + range->pages = vmalloc(pg_array_sz); + if (!range->pages) + return -ENOMEM; + + range->pg_cnt = a->pg_idx; + memcpy(range->pages, a->pages, pg_array_sz); + range->buf = vmap(range->pages, range->pg_cnt, VM_MAP, PAGE_KERNEL); + if (!range->buf) { + vfree(range->pages); + return -1; + } + /* + * Free the tracing pointer; The pages are freed when mem_range_result + * is released. + */ + vfree(a->pages); + a->pages = NULL; + + /* vmap-ed */ + range->alloc_type = TYPE_VMAP; + range->buf_sz = a->pg_idx << PAGE_SHIFT; + range->data_sz = range->buf_sz - a->chunk_size; + range->data_sz += a->chunk_cur - a->chunk_start; + + return 0; +} + +static int decompress_mem_allocator_init( + struct decompress_mem_allocator *allocator, + unsigned int chunk_size) +{ + unsigned long sz = (MAX_UNCOMPRESSED_BUF_SIZE >> PAGE_SHIFT) * sizeof(struct page *); + + allocator->pages = __vmalloc(sz, GFP_KERNEL | __GFP_ACCOUNT); + if (!allocator->pages) + return -ENOMEM; + + allocator->pg_idx = 0; + allocator->chunk_start = NULL; + allocator->chunk_size = chunk_size; + allocator->chunk_cur = NULL; + return 0; +} + +static void decompress_mem_allocator_fini(struct decompress_mem_allocator *allocator) +{ + unsigned int i; + + /* unmap the active chunk */ + if (!!allocator->chunk_start) + vunmap(allocator->chunk_start); + if (!!allocator->pages) { + for (i = 0; i < allocator->pg_idx; i++) + __free_pages(allocator->pages[i], 0); + vfree(allocator->pages); + } +} + +/* + * This is a callback for decompress_fn. + * + * It copies the partial decompressed content in [buf, buf + len) to dst. If the + * active chunk is not large enough, retire it and activate a new chunk to hold + * the remaining data. + */ +static long flush(void *buf, unsigned long len) +{ + struct decompress_mem_allocator *a = &dcmpr_allocator; + long free, copied = 0; + + /* The first time allocation */ + if (unlikely(!a->chunk_start)) { + a->chunk_start = a->chunk_cur = vmap_decompressed_chunk(); + if (unlikely(!a->chunk_start)) + return -1; + } + + free = a->chunk_start + a->chunk_size - a->chunk_cur; + BUG_ON(free < 0); + if (free < len) { + /* + * If the totoal size exceeds MAX_UNCOMPRESSED_BUF_SIZE, + * return -1 to indicate the decompress method that something + * is wrong + */ + if (unlikely((a->pg_idx >= MAX_UNCOMPRESSED_BUF_SIZE >> PAGE_SHIFT))) + return -1; + memcpy(a->chunk_cur, buf, free); + copied += free; + a->chunk_cur += free; + len -= free; + /* + * When retiring the active chunk, release its virtual address + * but do not release the contents in the pages. + */ + vunmap(a->chunk_start); + a->chunk_start = a->chunk_cur = vmap_decompressed_chunk(); + if (unlikely(!a->chunk_start)) + return -1; + } + memcpy(a->chunk_cur, buf, len); + copied += len; + a->chunk_cur += len; + return copied; +} + +__bpf_kfunc struct mem_range_result *bpf_decompress(char *image_gz_payload, int image_gz_sz) +{ + struct decompress_mem_allocator *a = &dcmpr_allocator; + decompress_fn decompressor; + struct mem_cgroup *memcg, *old_memcg; + struct mem_range_result *range; + const char *name; + char *input_buf; + int ret; + + memcg = get_mem_cgroup_from_current(); + old_memcg = set_active_memcg(memcg); + range = mem_range_result_alloc(); + if (!range) { + pr_err("fail to allocate mem_range_result\n"); + goto error; + } + + input_buf = __vmalloc(image_gz_sz, GFP_KERNEL | __GFP_ACCOUNT); + if (!input_buf) { + kfree(range); + pr_err("fail to allocate input buffer\n"); + goto error; + } + + ret = copy_from_kernel_nofault(input_buf, image_gz_payload, image_gz_sz); + if (ret < 0) { + kfree(range); + vfree(input_buf); + pr_err("Error when copying from 0x%p, size:0x%x\n", + image_gz_payload, image_gz_sz); + goto error; + } + + mutex_lock(&output_buf_mutex); + decompress_mem_allocator_init(a, CHUNK_SIZE); + decompressor = decompress_method(input_buf, image_gz_sz, &name); + if (!decompressor) { + kfree(range); + vfree(input_buf); + pr_err("Can not find decompress method\n"); + goto error; + } + ret = decompressor(input_buf, image_gz_sz, NULL, flush, + NULL, NULL, NULL); + + vfree(input_buf); + if (ret == 0) { + ret = decompress_mem_allocator_handover(a, range); + if (!!ret) + goto fail; + range->status = 0; + mem_cgroup_tryget(memcg); + range->memcg = memcg; + set_active_memcg(old_memcg); + } +fail: + decompress_mem_allocator_fini(a); + mutex_unlock(&output_buf_mutex); + if (!!ret) { + kfree(range); + range = NULL; + pr_err("Decompress error\n"); + } + +error: + set_active_memcg(old_memcg); + mem_cgroup_put(memcg); + return range; +} + __bpf_kfunc_end_defs(); BTF_KFUNCS_START(generic_btf_ids) #ifdef CONFIG_CRASH_DUMP BTF_ID_FLAGS(func, crash_kexec, KF_DESTRUCTIVE) #endif +BTF_ID_FLAGS(func, bpf_decompress, KF_TRUSTED_ARGS | KF_ACQUIRE | KF_SLEEPABLE) BTF_ID_FLAGS(func, bpf_mem_range_result_put, KF_RELEASE | KF_SLEEPABLE) BTF_ID_FLAGS(func, bpf_copy_to_kernel, KF_TRUSTED_ARGS | KF_SLEEPABLE) BTF_ID_FLAGS(func, bpf_obj_new_impl, KF_ACQUIRE | KF_RET_NULL) -- 2.49.0 From piliu at redhat.com Mon Jul 21 19:03:12 2025 From: piliu at redhat.com (Pingfan Liu) Date: Tue, 22 Jul 2025 10:03:12 +0800 Subject: [PATCHv4 05/12] kexec: Introduce kexec_pe_image to parse and load PE file In-Reply-To: <20250722020319.5837-1-piliu@redhat.com> References: <20250722020319.5837-1-piliu@redhat.com> Message-ID: <20250722020319.5837-6-piliu@redhat.com> As UEFI becomes popular, a few architectures support to boot a PE format kernel image directly. But the internal of PE format varies, which means each parser for each format. This patch (with the rest in this series) introduces a common skeleton to all parsers, and leave the format parsing in bpf-prog, so the kernel code can keep relative stable. A new kexec_file_ops is implementation, named pe_image_ops. There are some place holder function in this patch. (They will take effect after the introduction of kexec bpf light skeleton and bpf helpers). Overall the parsing progress is a pipeline, the current bpf-prog parser is attached to bpf_handle_pefile(), and detatched at the end of the current stage 'disarm_bpf_prog()' the current parsed result by the current bpf-prog will be buffered in kernel 'prepare_nested_pe()' , and deliver to the next stage. For each stage, the bpf bytecode is extracted from the '.bpf' section in the PE file. Signed-off-by: Pingfan Liu Cc: Baoquan He Cc: Dave Young Cc: Andrew Morton Cc: Philipp Rudo To: kexec at lists.infradead.org --- include/linux/kexec.h | 1 + kernel/Kconfig.kexec | 8 + kernel/Makefile | 1 + kernel/kexec_pe_image.c | 348 ++++++++++++++++++++++++++++++++++++++++ 4 files changed, 358 insertions(+) create mode 100644 kernel/kexec_pe_image.c diff --git a/include/linux/kexec.h b/include/linux/kexec.h index 3a2b9b4fffa18..da527f323a930 100644 --- a/include/linux/kexec.h +++ b/include/linux/kexec.h @@ -434,6 +434,7 @@ static inline int machine_kexec_post_load(struct kimage *image) { return 0; } extern struct kimage *kexec_image; extern struct kimage *kexec_crash_image; +extern const struct kexec_file_ops pe_image_ops; bool kexec_load_permitted(int kexec_image_type); diff --git a/kernel/Kconfig.kexec b/kernel/Kconfig.kexec index e64ce21f9a805..304a82d03f750 100644 --- a/kernel/Kconfig.kexec +++ b/kernel/Kconfig.kexec @@ -46,6 +46,14 @@ config KEXEC_FILE for kernel and initramfs as opposed to list of segments as accepted by kexec system call. +config KEXEC_PE_IMAGE + bool "Enable parsing UEFI PE file through kexec file based system call" + depends on KEXEC_FILE + depends on DEBUG_INFO_BTF && BPF_SYSCALL + help + This option makes the kexec_file_load() syscall cooperates with bpf-prog + to parse PE format file + config KEXEC_SIG bool "Verify kernel signature during kexec_file_load() syscall" depends on ARCH_SUPPORTS_KEXEC_SIG diff --git a/kernel/Makefile b/kernel/Makefile index 32e80dd626af0..def3f50a0b2ef 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -80,6 +80,7 @@ obj-$(CONFIG_CRASH_DUMP) += crash_core.o obj-$(CONFIG_CRASH_DM_CRYPT) += crash_dump_dm_crypt.o obj-$(CONFIG_KEXEC) += kexec.o obj-$(CONFIG_KEXEC_FILE) += kexec_file.o +obj-$(CONFIG_KEXEC_PE_IMAGE) += kexec_pe_image.o obj-$(CONFIG_KEXEC_ELF) += kexec_elf.o obj-$(CONFIG_KEXEC_HANDOVER) += kexec_handover.o obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o diff --git a/kernel/kexec_pe_image.c b/kernel/kexec_pe_image.c new file mode 100644 index 0000000000000..b0cf9942e68d2 --- /dev/null +++ b/kernel/kexec_pe_image.c @@ -0,0 +1,348 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Kexec PE image loader + + * Copyright (C) 2025 Red Hat, Inc + */ + +#define pr_fmt(fmt) "kexec_file(Image): " fmt + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + + +#define KEXEC_RES_KERNEL_NAME "kexec:kernel" +#define KEXEC_RES_INITRD_NAME "kexec:initrd" +#define KEXEC_RES_CMDLINE_NAME "kexec:cmdline" + +struct kexec_res { + char *name; + /* The free of buffer is deferred to kimage_file_post_load_cleanup */ + struct mem_range_result *r; +}; + +static struct kexec_res parsed_resource[3] = { + { KEXEC_RES_KERNEL_NAME, }, + { KEXEC_RES_INITRD_NAME, }, + { KEXEC_RES_CMDLINE_NAME, }, +}; + +static bool pe_has_bpf_section(const char *file_buf, unsigned long pe_sz); + +static bool is_valid_pe(const char *kernel_buf, unsigned long kernel_len) +{ + struct mz_hdr *mz; + struct pe_hdr *pe; + + if (!kernel_buf) + return false; + mz = (struct mz_hdr *)kernel_buf; + if (mz->magic != IMAGE_DOS_SIGNATURE) + return false; + pe = (struct pe_hdr *)(kernel_buf + mz->peaddr); + if (pe->magic != IMAGE_NT_SIGNATURE) + return false; + if (pe->opt_hdr_size == 0) { + pr_err("optional header is missing\n"); + return false; + } + + return pe_has_bpf_section(kernel_buf, kernel_len); +} + +static bool is_valid_format(const char *kernel_buf, unsigned long kernel_len) +{ + return is_valid_pe(kernel_buf, kernel_len); +} + +/* + * The UEFI Terse Executable (TE) image has MZ header. + */ +static int pe_image_probe(const char *kernel_buf, unsigned long kernel_len) +{ + return is_valid_pe(kernel_buf, kernel_len) ? 0 : -1; +} + +static int pe_get_section(const char *file_buf, const char *sect_name, + char **sect_start, unsigned long *sect_sz) +{ + struct pe_hdr *pe_hdr; + struct pe32plus_opt_hdr *opt_hdr; + struct section_header *sect_hdr; + int section_nr, i; + struct mz_hdr *mz = (struct mz_hdr *)file_buf; + + *sect_start = NULL; + *sect_sz = 0; + pe_hdr = (struct pe_hdr *)(file_buf + mz->peaddr); + section_nr = pe_hdr->sections; + opt_hdr = (struct pe32plus_opt_hdr *)(file_buf + mz->peaddr + sizeof(struct pe_hdr)); + sect_hdr = (struct section_header *)((char *)opt_hdr + pe_hdr->opt_hdr_size); + + for (i = 0; i < section_nr; i++) { + if (strcmp(sect_hdr->name, sect_name) == 0) { + *sect_start = (char *)file_buf + sect_hdr->data_addr; + *sect_sz = sect_hdr->raw_data_size; + return 0; + } + sect_hdr++; + } + + return -1; +} + +static bool pe_has_bpf_section(const char *file_buf, unsigned long pe_sz) +{ + char *sect_start = NULL; + unsigned long sect_sz = 0; + int ret; + + ret = pe_get_section(file_buf, ".bpf", §_start, §_sz); + if (ret < 0) + return false; + return true; +} + +/* Load a ELF */ +static int arm_bpf_prog(char *bpf_elf, unsigned long sz) +{ + return 0; +} + +static void disarm_bpf_prog(void) +{ +} + +struct kexec_context { + bool kdump; + char *image; + int image_sz; + char *initrd; + int initrd_sz; + char *cmdline; + int cmdline_sz; +}; + +void bpf_handle_pefile(struct kexec_context *context); +void bpf_post_handle_pefile(struct kexec_context *context); + + +/* + * optimize("O0") prevents inline, compiler constant propagation + */ +__attribute__((used, optimize("O0"))) void bpf_handle_pefile(struct kexec_context *context) +{ + /* + * To prevent linker from Identical Code Folding (ICF) with bpf_handle_pefile, + * making them have different code. + */ + volatile int dummy = 0; + + dummy += 1; +} + +__attribute__((used, optimize("O0"))) void bpf_post_handle_pefile(struct kexec_context *context) +{ + volatile int dummy = 0; + + dummy += 2; +} + +/* + * PE file may be nested and should be unfold one by one. + * Query 'kernel', 'initrd', 'cmdline' in cur_phase, as they are inputs for the + * next phase. + */ +static int prepare_nested_pe(char **kernel, unsigned long *kernel_len, char **initrd, + unsigned long *initrd_len, char **cmdline) +{ + struct kexec_res *res; + int ret = -1; + + *kernel = NULL; + *kernel_len = 0; + + res = &parsed_resource[0]; + if (!!res->r) { + *kernel = res->r->buf; + *kernel_len = res->r->data_sz; + ret = 0; + } + + res = &parsed_resource[1]; + if (!!res->r) { + *initrd = res->r->buf; + *initrd_len = res->r->data_sz; + } + + res = &parsed_resource[2]; + if (!!res->r) { + *cmdline = res->r->buf; + } + + return ret; +} + +static void *pe_image_load(struct kimage *image, + char *kernel, unsigned long kernel_len, + char *initrd, unsigned long initrd_len, + char *cmdline, unsigned long cmdline_len) +{ + char *linux_start, *initrd_start, *cmdline_start, *bpf_start; + unsigned long linux_sz, initrd_sz, cmdline_sz, bpf_sz; + struct kexec_res *res; + struct mem_range_result *r; + void *ldata; + int ret; + + linux_start = kernel; + linux_sz = kernel_len; + initrd_start = initrd; + initrd_sz = initrd_len; + cmdline_start = cmdline; + cmdline_sz = cmdline_len; + + while (is_valid_format(linux_start, linux_sz) && + pe_has_bpf_section(linux_start, linux_sz)) { + struct kexec_context context; + + pe_get_section((const char *)linux_start, ".bpf", &bpf_start, &bpf_sz); + if (!!bpf_sz) { + /* load and attach bpf-prog */ + ret = arm_bpf_prog(bpf_start, bpf_sz); + if (ret) { + pr_err("Fail to load .bpf section\n"); + ldata = ERR_PTR(ret); + goto err; + } + } + if (image->type != KEXEC_TYPE_CRASH) + context.kdump = false; + else + context.kdump = true; + context.image = linux_start; + context.image_sz = linux_sz; + context.initrd = initrd_start; + context.initrd_sz = initrd_sz; + context.cmdline = cmdline_start; + context.cmdline_sz = strlen(cmdline_start); + /* bpf-prog fentry, which handle above buffers. */ + bpf_handle_pefile(&context); + + prepare_nested_pe(&linux_start, &linux_sz, &initrd_start, + &initrd_sz, &cmdline_start); + /* bpf-prog fentry */ + bpf_post_handle_pefile(&context); + /* + * detach the current bpf-prog from their attachment points. + */ + disarm_bpf_prog(); + } + + /* + * image's kernel_buf, initrd_buf, cmdline_buf are set. Now they should + * be updated to the new content. + */ + + res = &parsed_resource[0]; + /* Kernel part should always be parsed */ + if (!res->r) { + pr_err("Can not parse kernel\n"); + ldata = ERR_PTR(-EINVAL); + goto err; + } + kernel = res->r->buf; + kernel_len = res->r->data_sz; + vfree(image->kernel_buf); + image->kernel_buf = kernel; + image->kernel_buf_len = kernel_len; + + res = &parsed_resource[1]; + if (!!res->r) { + initrd = res->r->buf; + initrd_len = res->r->data_sz; + vfree(image->initrd_buf); + image->initrd_buf = initrd; + image->initrd_buf_len = initrd_len; + } + res = &parsed_resource[2]; + if (!!res->r) { + cmdline = res->r->buf; + cmdline_len = res->r->data_sz; + kfree(image->cmdline_buf); + image->cmdline_buf = cmdline; + image->cmdline_buf_len = cmdline_len; + } + + if (kernel == NULL || initrd == NULL || cmdline == NULL) { + char *c, buf[64]; + + c = buf; + if (kernel == NULL) { + strcpy(c, "kernel "); + c += strlen("kernel "); + } + if (initrd == NULL) { + strcpy(c, "initrd "); + c += strlen("initrd "); + } + if (cmdline == NULL) { + strcpy(c, "cmdline "); + c += strlen("cmdline "); + } + c = '\0'; + pr_err("Can not extract data for %s", buf); + ldata = ERR_PTR(-EINVAL); + goto err; + } + + ret = arch_kexec_kernel_image_probe(image, image->kernel_buf, + image->kernel_buf_len); + if (ret) { + pr_err("Fail to find suitable image loader\n"); + ldata = ERR_PTR(ret); + goto err; + } + ldata = kexec_image_load_default(image); + if (IS_ERR(ldata)) { + pr_err("architecture code fails to load image\n"); + goto err; + } + image->image_loader_data = ldata; + +err: + for (int i = 0; i < 3; i++) { + r = parsed_resource[i].r; + if (!r) + continue; + parsed_resource[i].r = NULL; + /* + * The release of buffer defers to + * kimage_file_post_load_cleanup() + */ + r->buf = NULL; + r->buf_sz = 0; + mem_range_result_put(r); + } + + return ldata; +} + +const struct kexec_file_ops kexec_pe_image_ops = { + .probe = pe_image_probe, + .load = pe_image_load, +#ifdef CONFIG_KEXEC_IMAGE_VERIFY_SIG + .verify_sig = kexec_kernel_verify_pe_sig, +#endif +}; -- 2.49.0 From piliu at redhat.com Mon Jul 21 19:03:13 2025 From: piliu at redhat.com (Pingfan Liu) Date: Tue, 22 Jul 2025 10:03:13 +0800 Subject: [PATCHv4 06/12] kexec: Integrate with the introduced bpf kfuncs In-Reply-To: <20250722020319.5837-1-piliu@redhat.com> References: <20250722020319.5837-1-piliu@redhat.com> Message-ID: <20250722020319.5837-7-piliu@redhat.com> This patch does two things: First, register as a listener on bpf_copy_to_kernel() Second, in order that the hooked bpf-prog can call the sleepable kfuncs, bpf_handle_pefile and bpf_post_handle_pefile are marked as KF_SLEEPABLE. Signed-off-by: Pingfan Liu Cc: Alexei Starovoitov Cc: Philipp Rudo Cc: Baoquan He Cc: Dave Young Cc: Andrew Morton Cc: bpf at vger.kernel.org To: kexec at lists.infradead.org --- kernel/kexec_pe_image.c | 67 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 67 insertions(+) diff --git a/kernel/kexec_pe_image.c b/kernel/kexec_pe_image.c index b0cf9942e68d2..f8debcde6b516 100644 --- a/kernel/kexec_pe_image.c +++ b/kernel/kexec_pe_image.c @@ -38,6 +38,51 @@ static struct kexec_res parsed_resource[3] = { { KEXEC_RES_CMDLINE_NAME, }, }; +/* + * @name should be one of : kernel, initrd, cmdline + */ +static int bpf_kexec_carrier(const char *name, struct mem_range_result *r) +{ + struct kexec_res *res; + int i; + + if (!r || !name) + return -EINVAL; + + for (i = 0; i < 3; i++) { + if (!strcmp(parsed_resource[i].name, name)) + break; + } + if (i >= 3) + return -EINVAL; + + res = &parsed_resource[i]; + /* + * Replace the intermediate resource generated by the previous step. + */ + if (!!res->r) + mem_range_result_put(res->r); + mem_range_result_get(r); + res->r = r; + return 0; +} + +static struct carrier_listener kexec_res_listener[3] = { + { .name = KEXEC_RES_KERNEL_NAME, + .alloc_type = 1, + .handler = bpf_kexec_carrier, + }, + { .name = KEXEC_RES_INITRD_NAME, + .alloc_type = 1, + .handler = bpf_kexec_carrier, + }, + { .name = KEXEC_RES_CMDLINE_NAME, + /* kmalloc-ed */ + .alloc_type = 0, + .handler = bpf_kexec_carrier, + }, +}; + static bool pe_has_bpf_section(const char *file_buf, unsigned long pe_sz); static bool is_valid_pe(const char *kernel_buf, unsigned long kernel_len) @@ -159,6 +204,22 @@ __attribute__((used, optimize("O0"))) void bpf_post_handle_pefile(struct kexec_c dummy += 2; } +BTF_KFUNCS_START(kexec_modify_return_ids) +BTF_ID_FLAGS(func, bpf_handle_pefile, KF_SLEEPABLE) +BTF_ID_FLAGS(func, bpf_post_handle_pefile, KF_SLEEPABLE) +BTF_KFUNCS_END(kexec_modify_return_ids) + +static const struct btf_kfunc_id_set kexec_modify_return_set = { + .owner = THIS_MODULE, + .set = &kexec_modify_return_ids, +}; + +static int __init kexec_bpf_prog_run_init(void) +{ + return register_btf_fmodret_id_set(&kexec_modify_return_set); +} +late_initcall(kexec_bpf_prog_run_init); + /* * PE file may be nested and should be unfold one by one. * Query 'kernel', 'initrd', 'cmdline' in cur_phase, as they are inputs for the @@ -213,6 +274,9 @@ static void *pe_image_load(struct kimage *image, cmdline_start = cmdline; cmdline_sz = cmdline_len; + for (int i = 0; i < ARRAY_SIZE(kexec_res_listener); i++) + register_carrier_listener(&kexec_res_listener[i]); + while (is_valid_format(linux_start, linux_sz) && pe_has_bpf_section(linux_start, linux_sz)) { struct kexec_context context; @@ -250,6 +314,9 @@ static void *pe_image_load(struct kimage *image, disarm_bpf_prog(); } + for (int i = 0; i < ARRAY_SIZE(kexec_res_listener); i++) + unregister_carrier_listener(kexec_res_listener[i].name); + /* * image's kernel_buf, initrd_buf, cmdline_buf are set. Now they should * be updated to the new content. -- 2.49.0 From piliu at redhat.com Mon Jul 21 19:03:14 2025 From: piliu at redhat.com (Pingfan Liu) Date: Tue, 22 Jul 2025 10:03:14 +0800 Subject: [PATCHv4 07/12] kexec: Introduce a bpf-prog lskel to parse PE file In-Reply-To: <20250722020319.5837-1-piliu@redhat.com> References: <20250722020319.5837-1-piliu@redhat.com> Message-ID: <20250722020319.5837-8-piliu@redhat.com> Analague to kernel/bpf/preload/iterators/Makefile, this Makefile is not invoked by the Kbuild system. It needs to be invoked manually when kexec_pe_parser_bpf.c is changed so that kexec_pe_parser_bpf.lskel.h can be re-generated by the command "bpftool gen skeleton -L kexec_pe_parser_bpf.o". kexec_pe_parser_bpf.lskel.h is used directly by the kernel kexec code in later patch. For this patch, there are bpf bytecode contained in opts_data[] and opts_insn[] in kexec_pe_parser_bpf.lskel.h, but in the following patch, they will be removed and only the function API in kexec_pe_parser_bpf.lskel.h left. As exposed in kexec_pe_parser_bpf.lskel.h, the interface between bpf-prog and the kernel are constituted by: four maps: struct bpf_map_desc ringbuf_1; struct bpf_map_desc ringbuf_2; struct bpf_map_desc ringbuf_3; struct bpf_map_desc ringbuf_4; four sections: struct bpf_map_desc rodata; struct bpf_map_desc data; struct bpf_map_desc bss; struct bpf_map_desc rodata_str1_1; two progs: SEC("fentry.s/bpf_handle_pefile") SEC("fentry.s/bpf_post_handle_pefile") They are fixed and provided for all kinds of bpf-prog which interacts with the kexec kernel component. Signed-off-by: Pingfan Liu Cc: Alexei Starovoitov