Race condition in build_all_zonelists() when offlining movable zone

Michal Hocko mhocko at suse.com
Wed Aug 17 03:54:08 PDT 2022


On Wed 17-08-22 11:40:28, Mel Gorman wrote:
> On Wed, Aug 17, 2022 at 08:59:11AM +0200, Michal Hocko wrote:
> > > In order to address that, we should either have to call first_zones_zonelist
> > > inside get_page_from_freelist if the zoneref doesn't correspond to a
> > > real zone in the zonelist or we should revisit my older approach
> > > referenced above.
> > 
> > Would this work? It is not really great to pay an overhead for unlikely
> > event in the hot path but we might use a similar trick to check_retry_cpuset
> > in the slowpath to detect this situation. 
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index b0bcab50f0a3..bce786d7fcb4 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4098,7 +4098,17 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
> >  	 * See also __cpuset_node_allowed() comment in kernel/cpuset.c.
> >  	 */
> >  	no_fallback = alloc_flags & ALLOC_NOFRAGMENT;
> > +
> > +	/*
> > +	 * A race with memory offlining could alter zones on the zonelist
> > +	 * e.g. dropping the top (movable) zone if it gets unpoppulated
> > +	 * and so preferred_zoneref is not valid anymore
> > +	 */
> > +	if (unlikely(!ac->preferred_zoneref->zone))
> > +		ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
> > +					ac->highest_zoneidx, ac->nodemask);
> >  	z = ac->preferred_zoneref;
> > +
> 
> ac->preferred_zoneref->zone could still be a valid pointer to a zone,
> but an empty one so that would imply

OK, I managed to confuse myself. I thought the zoneref always points to
a specific zone in the zonelist.
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e5486d47406e..38ce123af543 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5191,6 +5191,10 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	if (check_retry_cpuset(cpuset_mems_cookie, ac))
>  		goto retry_cpuset;
>  
> +	/* Hotplug could have drained the preferred zone. */
> +	if (!populated_zone(ac->preferred_zoneref->zone))
> +		goto retry_cpuset;
> +
>  	/* Reclaim has failed us, start killing things */
>  	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
>  	if (page)

Would it be possible to have preferred_zoneref always unpopulated?
 
> But even that is fragile. If there were multiple zones in the zonelist
> and the preferred zone was further down the list, the zone could still
> be populated but a different zone than expected. It may be better to have
> the same type of seq counter that restarts the allocation attempt if the
> zonelist changes.
> 
> So.... this? It is seqcount only with a basic lock as there already is a
> full lock on the writer side and it would appear to be overkill to protect
> the reader side with read_seqbegin_or_lock as it complicates the writer side.
> 
> (boot tested only)

yes, from a quick look it seems ok and it matches handling of a race for
cpuset which is not the same thing but very similar in principle. So
conceptually looks good to me.
-- 
Michal Hocko
SUSE Labs



More information about the linux-arm-kernel mailing list