[PATCH v2][makedumpfile] Fix a data race in multi-threading mode (--num-threads=N)

Sun Jul 13 16:37:25 PDT 2025

Hi YAMAZAKI,

On Sat, Jul 12, 2025 at 12:08 AM YAMAZAKI MASAMITSU(山崎　真光)
<yamazaki-msmt at nec.com> wrote:
>
> Sorry, I'm so rate.

No worries :)
>
> I looked into the fix and I think it will work safely on other
> architectures as well. I think it will also solve the problem
> with ppc64. I accept and merge this patch.
>
> Thank you for reporting this problem and providing the very
> difficult fix.

Thanks for your response and merging!

Thanks,
Tao Liu

>
> Thanks,
>
> Masa
>
> On 2025/07/10 14:34, Tao Liu wrote:
> > Kindly ping...
> >
> > Sorry to interrupt, could you please merge the patch since there are
> > few bugs which depend on the backporting of this patch?
> >
> > Thanks,
> > Tao Liu
> >
> >
> > On Fri, Jul 4, 2025 at 7:51 PM Tao Liu <ltao at redhat.com> wrote:
> >> On Fri, Jul 4, 2025 at 6:49 PM HAGIO KAZUHITO(萩尾　一仁) <k-hagio-ab at nec.com> wrote:
> >>> On 2025/07/04 7:35, Tao Liu wrote:
> >>>> Hi Petr,
> >>>>
> >>>> On Fri, Jul 4, 2025 at 2:31 AM Petr Tesarik <ptesarik at suse.com> wrote:
> >>>>> On Tue, 1 Jul 2025 19:59:53 +1200
> >>>>> Tao Liu <ltao at redhat.com> wrote:
> >>>>>
> >>>>>> Hi Kazu,
> >>>>>>
> >>>>>> Thanks for your comments!
> >>>>>>
> >>>>>> On Tue, Jul 1, 2025 at 7:38 PM HAGIO KAZUHITO(萩尾　一仁) <k-hagio-ab at nec.com> wrote:
> >>>>>>> Hi Tao,
> >>>>>>>
> >>>>>>> thank you for the patch.
> >>>>>>>
> >>>>>>> On 2025/06/25 11:23, Tao Liu wrote:
> >>>>>>>> A vmcore corrupt issue has been noticed in powerpc arch [1]. It can be
> >>>>>>>> reproduced with upstream makedumpfile.
> >>>>>>>>
> >>>>>>>> When analyzing the corrupt vmcore using crash, the following error
> >>>>>>>> message will output:
> >>>>>>>>
> >>>>>>>>        crash: compressed kdump: uncompress failed: 0
> >>>>>>>>        crash: read error: kernel virtual address: c0001e2d2fe48000  type:
> >>>>>>>>        "hardirq thread_union"
> >>>>>>>>        crash: cannot read hardirq_ctx[930] at c0001e2d2fe48000
> >>>>>>>>        crash: compressed kdump: uncompress failed: 0
> >>>>>>>>
> >>>>>>>> If the vmcore is generated without num-threads option, then no such
> >>>>>>>> errors are noticed.
> >>>>>>>>
> >>>>>>>> With --num-threads=N enabled, there will be N sub-threads created. All
> >>>>>>>> sub-threads are producers which responsible for mm page processing, e.g.
> >>>>>>>> compression. The main thread is the consumer which responsible for
> >>>>>>>> writing the compressed data into file. page_flag_buf->ready is used to
> >>>>>>>> sync main and sub-threads. When a sub-thread finishes page processing,
> >>>>>>>> it will set ready flag to be FLAG_READY. In the meantime, main thread
> >>>>>>>> looply check all threads of the ready flags, and break the loop when
> >>>>>>>> find FLAG_READY.
> >>>>>>> I've tried to reproduce the issue, but I couldn't on x86_64.
> >>>>>> Yes, I cannot reproduce it on x86_64 either, but the issue is very
> >>>>>> easily reproduced on ppc64 arch, which is where our QE reported.
> >>>>> Yes, this is expected. X86 implements a strongly ordered memory model,
> >>>>> so a "store-to-memory" instruction ensures that the new value is
> >>>>> immediately observed by other CPUs.
> >>>>>
> >>>>> FWIW the current code is wrong even on X86, because it does nothing to
> >>>>> prevent compiler optimizations. The compiler is then allowed to reorder
> >>>>> instructions so that the write to page_flag_buf->ready happens after
> >>>>> other writes; with a bit of bad scheduling luck, the consumer thread
> >>>>> may see an inconsistent state (e.g. read a stale page_flag_buf->pfn).
> >>>>> Note that thanks to how compilers are designed (today), this issue is
> >>>>> more or less hypothetical. Nevertheless, the use of atomics fixes it,
> >>>>> because they also serve as memory barriers.
> >>> Thank you Petr, for the information.  I was wondering whether atomic
> >>> operations might be necessary for the other members of page_flag_buf,
> >>> but it looks like they won't be necessary in this case.
> >>>
> >>> Then I was convinced that the issue would be fixed by removing the
> >>> inconsistency of page_flag_buf->ready.  And the patch tested ok, so ack.
> >>>
> >> Thank you all for the patch review, patch testing and comments, these
> >> have been so helpful!
> >>
> >> Thanks,
> >> Tao Liu
> >>
> >>> Thanks,
> >>> Kazu
> >>>
> >>>> Thanks a lot for your detailed explanation, it's very helpful! I
> >>>> haven't thought of the possibility of instruction reordering and
> >>>> atomic_rw prevents the reorder.
> >>>>
> >>>> Thanks,
> >>>> Tao Liu
> >>>>
> >>>>> Petr T
> >>>>>