Race condition in build_all_zonelists() when offlining movable zone

Mel Gorman mgorman at suse.de
Wed Aug 17 04:26:47 PDT 2022


On Wed, Aug 17, 2022 at 12:54:08PM +0200, Michal Hocko wrote:
> On Wed 17-08-22 11:40:28, Mel Gorman wrote:
> > On Wed, Aug 17, 2022 at 08:59:11AM +0200, Michal Hocko wrote:
> > > > In order to address that, we should either have to call first_zones_zonelist
> > > > inside get_page_from_freelist if the zoneref doesn't correspond to a
> > > > real zone in the zonelist or we should revisit my older approach
> > > > referenced above.
> > > 
> > > Would this work? It is not really great to pay an overhead for unlikely
> > > event in the hot path but we might use a similar trick to check_retry_cpuset
> > > in the slowpath to detect this situation. 
> > > 
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index b0bcab50f0a3..bce786d7fcb4 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -4098,7 +4098,17 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
> > >  	 * See also __cpuset_node_allowed() comment in kernel/cpuset.c.
> > >  	 */
> > >  	no_fallback = alloc_flags & ALLOC_NOFRAGMENT;
> > > +
> > > +	/*
> > > +	 * A race with memory offlining could alter zones on the zonelist
> > > +	 * e.g. dropping the top (movable) zone if it gets unpoppulated
> > > +	 * and so preferred_zoneref is not valid anymore
> > > +	 */
> > > +	if (unlikely(!ac->preferred_zoneref->zone))
> > > +		ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
> > > +					ac->highest_zoneidx, ac->nodemask);
> > >  	z = ac->preferred_zoneref;
> > > +
> > 
> > ac->preferred_zoneref->zone could still be a valid pointer to a zone,
> > but an empty one so that would imply
> 
> OK, I managed to confuse myself. I thought the zoneref always points to
> a specific zone in the zonelist.

It's a bit confusing.

> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index e5486d47406e..38ce123af543 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -5191,6 +5191,10 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  	if (check_retry_cpuset(cpuset_mems_cookie, ac))
> >  		goto retry_cpuset;
> >  
> > +	/* Hotplug could have drained the preferred zone. */
> > +	if (!populated_zone(ac->preferred_zoneref->zone))
> > +		goto retry_cpuset;
> > +
> >  	/* Reclaim has failed us, start killing things */
> >  	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
> >  	if (page)
> 
> Would it be possible to have preferred_zoneref always unpopulated?
>  

I don't think so but this is not my preferred approach anyway.

> > But even that is fragile. If there were multiple zones in the zonelist
> > and the preferred zone was further down the list, the zone could still
> > be populated but a different zone than expected. It may be better to have
> > the same type of seq counter that restarts the allocation attempt if the
> > zonelist changes.
> > 
> > So.... this? It is seqcount only with a basic lock as there already is a
> > full lock on the writer side and it would appear to be overkill to protect
> > the reader side with read_seqbegin_or_lock as it complicates the writer side.
> > 
> > (boot tested only)
> 
> yes, from a quick look it seems ok and it matches handling of a race for
> cpuset which is not the same thing but very similar in principle. So
> conceptually looks good to me.

Patrick, as you have a hotplug test that removes a full zone, would you
like to test? I haven't boot tested it on v5.15 but it applies cleanly
and compiles at least.

-- 
Mel Gorman
SUSE Labs



More information about the linux-arm-kernel mailing list