[PATCH] nvme: fix namespace removal list

Tue Jun 11 10:22:46 PDT 2024

On Tue, Jun 11, 2024 at 06:39:55PM +0200, Christoph Hellwig wrote:
> On Tue, Jun 11, 2024 at 08:20:55AM -0700, Keith Busch wrote:
> >  	mutex_lock(&ctrl->namespaces_lock);
> >  	list_for_each_entry_safe(ns, next, &ctrl->namespaces, list) {
> > -		if (ns->head->ns_id > nsid)
> > -			list_splice_init_rcu(&ns->list, &rm_list,
> > -					     synchronize_rcu);
> > +		if (ns->head->ns_id > nsid) {
> > +			list_del_rcu(&ns->list);
> > +			list_add_tail_rcu(&ns->list, &rm_list);
> > +		}
> 
> Is this actually valid for a (S)RCU protected list?  If the entry gets
> added to the new list before the grace period has completed, we could
> trick a concurrent traversal into following the new list unless I'm
> mistaken (although chances I'm mistaken on RCU corner cases aren't that
> low..).

Just to confirm, yes, if an RCU reader was referencing the object being
deleted from the first list, that reader really could be transported
(free of charge!) to the second list.

One (slow) way to avoid this involuntary transportation is to do something
like this:

		list_del_rcu(&ns->list);
		synchronize_rcu(); // drain all readers from ns.
		list_add_tail_rcu(&ns->list, &rm_list);

You could speed things up using synchronize_rcu_expedited, with the
usual downsides.

There are faster ways of getting this effect, but the ones that I
know of involve putting generation numbers on all objects, duplicating
objects on each change, and having readers chase pointers to find the
desired version.  Usually not worth it, but I don't know enough about
this code to judge.

							Thanx, Paul