[PATCH v3 1/2] mm: cma: fix allocation may fail sometimes

Wed Mar 16 20:49:16 PDT 2022

On Thu, Mar 17, 2022 at 5:09 AM Andrew Morton <akpm at linux-foundation.org> wrote:
>
> On Wed, 16 Mar 2022 11:41:37 +0800 Dong Aisheng <dongas86 at gmail.com> wrote:
>
> > On Wed, Mar 16, 2022 at 6:58 AM Andrew Morton <akpm at linux-foundation.org> wrote:
> > >
> > > On Tue, 15 Mar 2022 22:45:20 +0800 Dong Aisheng <aisheng.dong at nxp.com> wrote:
> > >
> > > > --- a/mm/cma.c
> > > > +++ b/mm/cma.c
> > > >
> > > > ...
> > > >
> > > > @@ -457,6 +458,16 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,
> > > >                               offset);
> > > >               if (bitmap_no >= bitmap_maxno) {
> > > >                       spin_unlock_irq(&cma->lock);
> > > > +                     pr_debug("%s(): alloc fail, retry loop %d\n", __func__, loop++);
> > > > +                     /*
> > > > +                      * rescan as others may finish the memory migration
> > > > +                      * and quit if no available CMA memory found finally
> > > > +                      */
> > > > +                     if (start) {
> > > > +                             schedule();
> > > > +                             start = 0;
> > > > +                             continue;
> > > > +                     }
> > > >                       break;
> > >
> > > The schedule() is problematic. For a start, we'd normally use
> > > cond_resched() here, so we avoid calling the more expensive schedule()
> > > if we know it won't perform any action.
> > >
> > > But cond_resched() is problematic if this thread has realtime
> > > scheduling policy and the process we're waiting on does not.  One way
> > > to address that is to use an unconditional msleep(1), but that's still
> > > just a hack.
> > >
> >
> > I think we can simply drop schedule() here during the second round of retry
> > as the estimated delay may not be really needed.
>
> That will simply cause a tight loop, so I'm obviously not understanding
> the proposal.
>

IIUC the original code is already a tight loop, isn't it?
You could also see my observation, thousands of retries, in patch 2.
The logic in this patch is just retry the original loop in case in case there's
a false possive error return.

Or you mean infinite loop? The loop will break out when meet an non EBUSY
error in alloc_contig_range().

BTW, the tight loop situation could be improved a lot by my patch 2.

And after Zi Yan's patchset [1] got merged, the situation could be
further improved by retring in pageblock step.
1. [v7,0/5] Use pageblock_order for cma and alloc_contig_range
alignment. - Patchwork (kernel.org)

So generally i wonder it seems still better than simply revert.
Please fix me if i still missed something.

Regards
Aisheng