Race condition in build_all_zonelists() when offlining movable zone

Tue Aug 23 02:49:50 PDT 2022

On Tue, Aug 23, 2022 at 10:52:34AM +0200, David Hildenbrand wrote:
> On 23.08.22 10:33, Mel Gorman wrote:
> > On Tue, Aug 23, 2022 at 08:36:34AM +0200, David Hildenbrand wrote:
> >>> @@ -6517,6 +6538,7 @@ static void __build_all_zonelists(void *data)
> >>>  	static DEFINE_SPINLOCK(lock);
> >>>  
> >>>  	spin_lock(&lock);
> >>> +	write_seqcount_begin(&zonelist_update_seq);
> >>>  
> >>>  #ifdef CONFIG_NUMA
> >>>  	memset(node_load, 0, sizeof(node_load));
> >>> @@ -6553,6 +6575,7 @@ static void __build_all_zonelists(void *data)
> >>>  #endif
> >>>  	}
> >>>  
> >>> +	write_seqcount_end(&zonelist_update_seq);
> >>>  	spin_unlock(&lock);
> >>
> >> Do we want to get rid of the static lock by using a seqlock_t instead of
> >> a seqcount_t?
> >>
> > 
> > I do not think so because it's a relatively heavy lock compared to the
> > counter and the read side.
> 
> I was primarily asking because seqlock.h states: "Sequential locks
> (seqlock_t):  Sequence counters with an embedded spinlock for writer
> serialization and non-preemptibility." seems to be precisely what we are
> doing here.
> 
> > 
> > As the read-side can be called from hardirq or softirq context, the
> > write-side needs to disable irqs or bottom halves as well according to the
> > documentation. That is relatively minor as the write side is rare but it's
> > tricker because the calling context can be both IRQ or softirq so the IRQ
> > protection would have to be used.
> 
> Naive me would just have used write_seqlock()/write_sequnlock() and
> read_seqbegin()/read_seqretry() to result in almost the same code as
> with your change -- but having both mechanisms encapsulated.
> 
> 
> Yeah, there are special write_seqlock_bh()/write_sequnlock_bh(),
> write_sequnlock_irq() ... but IIRC we don't have to care about that at
> all when just using the primitives as above. But most probably I am
> missing something important.
> 

You're not missing anything important, I'm just not a massive fan of the
API naming because it's unclear from the context if it's a plain counter
or a locked counter and felt it was better to keep the locking explicit.

A seqlock version is below. I updated the comments and naming to make it
clear the read-side is for iteration, what the locking protocol is and
match the retry naming with the cpuset equivalent. It boots on KVM but
would need another test from Patrick to be certain it still works. Patrick,
would you mind testing this version please?

---8<---
 mm/page_alloc.c | 33 ++++++++++++++++++++++++++++-----
 1 file changed, 28 insertions(+), 5 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e5486d47406e..a644c7b638a3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4708,6 +4708,24 @@ void fs_reclaim_release(gfp_t gfp_mask)
 EXPORT_SYMBOL_GPL(fs_reclaim_release);
 #endif
 
+/*
+ * Zonelists may change due to hotplug during allocation. Detect when zonelists
+ * have been rebuilt so allocation retries. Reader side does not lock and
+ * retries the allocation if zonelist changes. Writer side is protected by the
+ * embedded spin_lock.
+ */
+DEFINE_SEQLOCK(zonelist_update_seq);
+
+static unsigned int zonelist_iter_begin(void)
+{
+	return read_seqbegin(&zonelist_update_seq);
+}
+
+static unsigned int check_retry_zonelist(unsigned int seq)
+{
+	return read_seqretry(&zonelist_update_seq, seq);
+}
+
 /* Perform direct synchronous page reclaim */
 static unsigned long
 __perform_reclaim(gfp_t gfp_mask, unsigned int order,
@@ -5001,6 +5019,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	int compaction_retries;
 	int no_progress_loops;
 	unsigned int cpuset_mems_cookie;
+	unsigned int zonelist_iter_cookie;
 	int reserve_flags;
 
 	/*
@@ -5016,6 +5035,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	no_progress_loops = 0;
 	compact_priority = DEF_COMPACT_PRIORITY;
 	cpuset_mems_cookie = read_mems_allowed_begin();
+	zonelist_iter_cookie = zonelist_iter_begin();
 
 	/*
 	 * The fast path uses conservative alloc_flags to succeed only until
@@ -5187,8 +5207,12 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto retry;
 
 
-	/* Deal with possible cpuset update races before we start OOM killing */
-	if (check_retry_cpuset(cpuset_mems_cookie, ac))
+	/*
+	 * Deal with possible cpuset update races or zonelist updates to avoid
+	 * a unnecessary OOM kill.
+	 */
+	if (check_retry_cpuset(cpuset_mems_cookie, ac) ||
+	    check_retry_zonelist(zonelist_iter_cookie))
 		goto retry_cpuset;
 
 	/* Reclaim has failed us, start killing things */
@@ -6514,9 +6538,8 @@ static void __build_all_zonelists(void *data)
 	int nid;
 	int __maybe_unused cpu;
 	pg_data_t *self = data;
-	static DEFINE_SPINLOCK(lock);
 
-	spin_lock(&lock);
+	write_seqlock(&zonelist_update_seq);
 
 #ifdef CONFIG_NUMA
 	memset(node_load, 0, sizeof(node_load));
@@ -6553,7 +6576,7 @@ static void __build_all_zonelists(void *data)
 #endif
 	}
 
-	spin_unlock(&lock);
+	write_sequnlock(&zonelist_update_seq);
 }
 
 static noinline void __init