Race condition in build_all_zonelists() when offlining movable zone

Michal Hocko mhocko at suse.com
Tue Aug 16 23:59:11 PDT 2022


On Wed 17-08-22 08:38:13, Michal Hocko wrote:
> [Cc Mel, David and Juergen]
> 
> On Tue 16-08-22 20:42:50, Patrick Daly wrote:
> > System: arm64 with 5.15 based kernel. CONFIG_NUMA=n.
> > 
> > NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - before offline operation
> > [0] - ZONE_MOVABLE
> > [1] - ZONE_NORMAL
> > [2] - NULL
> > 
> > For a GFP_KERNEL allocation, alloc_pages_slowpath() will save the offset of
> > ZONE_NORMAL in ac->preferred_zoneref. If a concurrent memory_offline operation
> > removes the last page from ZONE_MOVABLE, build_all_zonelists() &
> > build_zonerefs_node() will update node_zonelists as shown below. Only
> > populated zones are added.
> >
> > NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK] - after offline operation
> > [0] - ZONE_NORMAL
> > [1] - NULL
> > [2] - NULL  
> > 
> > The thread in alloc_pages_slowpath() will call get_page_from_freelist()
> > repeatedly to allocate from the zones in zonelist beginning from
> > preferred_zoneref. Since this is now NULL, it will never succeed, and OOM
> > killer will kill all killable processes.
> >
> > I noticed a comment on a recent change bb7645c33869 ("mm, page_alloc:
> > fix build_zonerefs_node()") which appeared to be relevant, but later
> > replies indicated concerns with performance implications.
> > https://lore.kernel.org/linux-mm/Yk7NqTlw7lmFzpKb@dhcp22.suse.cz/
> 
> I guess you mean e553f62f10d9 here. After re-reading the discussion I
> seem to remember. We've decided to go with a simple fix (the said
> commit) but I do not think we have realized this side effect of the
> zonelists index invalidating.
> 
> In order to address that, we should either have to call first_zones_zonelist
> inside get_page_from_freelist if the zoneref doesn't correspond to a
> real zone in the zonelist or we should revisit my older approach
> referenced above.

Would this work? It is not really great to pay an overhead for unlikely
event in the hot path but we might use a similar trick to check_retry_cpuset
in the slowpath to detect this situation. 

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b0bcab50f0a3..bce786d7fcb4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4098,7 +4098,17 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 	 * See also __cpuset_node_allowed() comment in kernel/cpuset.c.
 	 */
 	no_fallback = alloc_flags & ALLOC_NOFRAGMENT;
+
+	/*
+	 * A race with memory offlining could alter zones on the zonelist
+	 * e.g. dropping the top (movable) zone if it gets unpoppulated
+	 * and so preferred_zoneref is not valid anymore
+	 */
+	if (unlikely(!ac->preferred_zoneref->zone))
+		ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
+					ac->highest_zoneidx, ac->nodemask);
 	z = ac->preferred_zoneref;
+
 	for_next_zone_zonelist_nodemask(zone, z, ac->highest_zoneidx,
 					ac->nodemask) {
 		struct page *page;
-- 
Michal Hocko
SUSE Labs



More information about the linux-arm-kernel mailing list