[PATCH v2][makedumpfile] Fix a data race in multi-threading mode (--num-threads=N)

Wed Jul 9 22:34:34 PDT 2025

Kindly ping...

Sorry to interrupt, could you please merge the patch since there are
few bugs which depend on the backporting of this patch?

Thanks,
Tao Liu

On Fri, Jul 4, 2025 at 7:51 PM Tao Liu <ltao at redhat.com> wrote:
>
> On Fri, Jul 4, 2025 at 6:49 PM HAGIO KAZUHITO(萩尾　一仁) <k-hagio-ab at nec.com> wrote:
> >
> > On 2025/07/04 7:35, Tao Liu wrote:
> > > Hi Petr,
> > >
> > > On Fri, Jul 4, 2025 at 2:31 AM Petr Tesarik <ptesarik at suse.com> wrote:
> > >>
> > >> On Tue, 1 Jul 2025 19:59:53 +1200
> > >> Tao Liu <ltao at redhat.com> wrote:
> > >>
> > >>> Hi Kazu,
> > >>>
> > >>> Thanks for your comments!
> > >>>
> > >>> On Tue, Jul 1, 2025 at 7:38 PM HAGIO KAZUHITO(萩尾　一仁) <k-hagio-ab at nec.com> wrote:
> > >>>>
> > >>>> Hi Tao,
> > >>>>
> > >>>> thank you for the patch.
> > >>>>
> > >>>> On 2025/06/25 11:23, Tao Liu wrote:
> > >>>>> A vmcore corrupt issue has been noticed in powerpc arch [1]. It can be
> > >>>>> reproduced with upstream makedumpfile.
> > >>>>>
> > >>>>> When analyzing the corrupt vmcore using crash, the following error
> > >>>>> message will output:
> > >>>>>
> > >>>>>       crash: compressed kdump: uncompress failed: 0
> > >>>>>       crash: read error: kernel virtual address: c0001e2d2fe48000  type:
> > >>>>>       "hardirq thread_union"
> > >>>>>       crash: cannot read hardirq_ctx[930] at c0001e2d2fe48000
> > >>>>>       crash: compressed kdump: uncompress failed: 0
> > >>>>>
> > >>>>> If the vmcore is generated without num-threads option, then no such
> > >>>>> errors are noticed.
> > >>>>>
> > >>>>> With --num-threads=N enabled, there will be N sub-threads created. All
> > >>>>> sub-threads are producers which responsible for mm page processing, e.g.
> > >>>>> compression. The main thread is the consumer which responsible for
> > >>>>> writing the compressed data into file. page_flag_buf->ready is used to
> > >>>>> sync main and sub-threads. When a sub-thread finishes page processing,
> > >>>>> it will set ready flag to be FLAG_READY. In the meantime, main thread
> > >>>>> looply check all threads of the ready flags, and break the loop when
> > >>>>> find FLAG_READY.
> > >>>>
> > >>>> I've tried to reproduce the issue, but I couldn't on x86_64.
> > >>>
> > >>> Yes, I cannot reproduce it on x86_64 either, but the issue is very
> > >>> easily reproduced on ppc64 arch, which is where our QE reported.
> > >>
> > >> Yes, this is expected. X86 implements a strongly ordered memory model,
> > >> so a "store-to-memory" instruction ensures that the new value is
> > >> immediately observed by other CPUs.
> > >>
> > >> FWIW the current code is wrong even on X86, because it does nothing to
> > >> prevent compiler optimizations. The compiler is then allowed to reorder
> > >> instructions so that the write to page_flag_buf->ready happens after
> > >> other writes; with a bit of bad scheduling luck, the consumer thread
> > >> may see an inconsistent state (e.g. read a stale page_flag_buf->pfn).
> > >> Note that thanks to how compilers are designed (today), this issue is
> > >> more or less hypothetical. Nevertheless, the use of atomics fixes it,
> > >> because they also serve as memory barriers.
> >
> > Thank you Petr, for the information.  I was wondering whether atomic
> > operations might be necessary for the other members of page_flag_buf,
> > but it looks like they won't be necessary in this case.
> >
> > Then I was convinced that the issue would be fixed by removing the
> > inconsistency of page_flag_buf->ready.  And the patch tested ok, so ack.
> >
>
> Thank you all for the patch review, patch testing and comments, these
> have been so helpful!
>
> Thanks,
> Tao Liu
>
> > Thanks,
> > Kazu
> >
> > >
> > > Thanks a lot for your detailed explanation, it's very helpful! I
> > > haven't thought of the possibility of instruction reordering and
> > > atomic_rw prevents the reorder.
> > >
> > > Thanks,
> > > Tao Liu
> > >
> > >>
> > >> Petr T
> > >>