Race condition in build_all_zonelists() when offlining movable zone

Mel Gorman mgorman at suse.de
Tue Aug 23 01:33:49 PDT 2022


On Tue, Aug 23, 2022 at 08:36:34AM +0200, David Hildenbrand wrote:
> > @@ -6517,6 +6538,7 @@ static void __build_all_zonelists(void *data)
> >  	static DEFINE_SPINLOCK(lock);
> >  
> >  	spin_lock(&lock);
> > +	write_seqcount_begin(&zonelist_update_seq);
> >  
> >  #ifdef CONFIG_NUMA
> >  	memset(node_load, 0, sizeof(node_load));
> > @@ -6553,6 +6575,7 @@ static void __build_all_zonelists(void *data)
> >  #endif
> >  	}
> >  
> > +	write_seqcount_end(&zonelist_update_seq);
> >  	spin_unlock(&lock);
> 
> Do we want to get rid of the static lock by using a seqlock_t instead of
> a seqcount_t?
> 

I do not think so because it's a relatively heavy lock compared to the
counter and the read side.

As the read-side can be called from hardirq or softirq context, the
write-side needs to disable irqs or bottom halves as well according to the
documentation. That is relatively minor as the write side is rare but it's
tricker because the calling context can be both IRQ or softirq so the IRQ
protection would have to be used.

The read-side also gets more complicated. The read side becomes
either read_seqlock_excl() (or bh or IRQ as appropriate) or
read_seqbegin_or_lock. The read_seqlock_excl acts like a spinlock blocking
out other readers and writers, we definitely do not want to block out other
readers in the allocator path because .... that is crazy, it's basically a
global memory allocator lock. There is not an obvious option of limiting
the scope of that lock such as a single zone because it's the zonelists
we care about, not an individual zone. I guess it could be done on a
pgdat basis selected by the preferred zone but it's also unnecessarily
complicated and a relatively heavy lock.

The other obvious choice is read_seqbegin_or_lock to locklessly try and
then retry if necessary. This has better semantics as a lockless version
exists with the caller tracking more state but again, the retry step is
heavy and acts as a global lock.

In this case, the seqcounter or seqlock is protecting relatively simple
data -- the zonelist pointing to struct zones that never disappear (the
zone might be unpopulated but the struct zone still exists). The critical
data being protected in this context is either the PCP lists or the buddy
lists, each which has separate locking. The zonelist needs less protection
although RCU protection would be a potential, if somewhat complicated
option, as even if the zonelist itself is valid after an RCU update,
the zones listed are not necessarily useful any more.

There is no real advantage to using seqcount_spinlock_t either as the
associated lock would be a global lock and there isn't any lockdep advantage
to doing that either.

As the alternatives have side effects, I would prefer to see any proposed
conversion on top of the fix with review determining if there is any
unnecessary additional serialisation.

-- 
Mel Gorman
SUSE Labs



More information about the linux-arm-kernel mailing list