[PATCH] nvme: fix SRCU protection of nvme_ns_head list

Wed Nov 23 16:12:15 PST 2022

On Tue, Nov 22, 2022 at 7:08 AM Sagi Grimberg <sagi at grimberg.me> wrote:
>
>
> >> 3. removes ns from head sibling list + synchronize rcu
> >>     -> this should fence non-sleeping traversals (like revalidate_paths)
> >
> > Well, non-sleeping would only matter if those non-sleeping travesals
> > are under rcu_read_lock(), but they are not.  They are either part of
> > a longer srcu critical section because other code can sleep, or in
> > case of revalidate_paths unprotected at all (which this patch fixes).
>
> The original patch comment was that rcu_read_lock/unlock would be
> sufficient and we don't need to touch nvme_ns_remove()
>
> >
> >> Maybe it is OK to have it also srcu locked and just accept that
> >> nshead sibling list is srcu protected. In that case, your patch
> >> needs to extend the srcu also the clearing of current_head pointer.
> >
> > I don't see how nvme_mpath_clear_current_path needs (S)RCU protection.
> > It never dereferences the current_path, it just checks is for pointer
> > equality and if they match clears it to NULL.  (I wonder if it should
> > use cmpxchg though).
>
> Agree. it can stay out. because at this point it does not compete with
> concurrent submissions due to prior synchronizations. The list traversal
> needs to be under rcu lock.
>
>
> >
> >> But looking again at your bug report, you mention that there are
> >> concurrent scans, one removing the ns and another accessing it.
> >> That cannot happen due to the scan_lock held around this section afaict.
> >>
> >> I guess it can be that in general ns removal can compete with a scan
> >> if due to some controller behavior that failed an identify command
> >> transiently in a prior scan, and a subsequent scan finds it? worth
> >> pinning down exactly what happened in the race you got because maybe we
> >> have a different issue that may manifest in other issues.
> >
> > So scanning itself should be single threaded as it only happens from
> > the workqueue.  But nvme_ns_remove can be called from
> > nvme_remove_namespaces in in 6.1 and earlier from the passthrough
> > handler.
>
> The original patch report did not include any sequence that removes all
> namespaces, and given that it came from RockyLinux 8.6 kernel, it is not
> 6.1... Hence I think that we need to understand how a namespace removal
> happened at the same time that the namespace is being scanned. Maybe
> something else is broken.

We can reliably cause the panic by sending a "NS Changed" AEN
from multiple controllers at the same time, resulting in multiple scan works
running concurrently for the same namespace heads.
In our test, we have 4 controllers for the same subsystem, with 9 namespaces
that are added one at a time, resulting in many AENs for each controller.
We can see from the dmesg logs that the controllers' scan works overlap:
[37311.530367] nvme nvme0: queue_size 128 > ctrl sqsize 32, clamping down
[37311.530398] nvme nvme0: creating 32 I/O queues.
[37311.819883] nvme nvme0: mapped 32/0/0 default/read/poll queues.
[37311.828129] nvme nvme0: new ctrl: NQN
"nqn.2010-06.com.purestorage:flasharray.7896f4eca47b4dea", addr
192.168.1.110:4420
[37311.924908] nvme nvme1: queue_size 128 > ctrl sqsize 32, clamping down
[37311.924935] nvme nvme1: creating 32 I/O queues.
[37312.298561] nvme nvme1: mapped 32/0/0 default/read/poll queues.
[37312.306296] nvme nvme1: new ctrl: NQN
"nqn.2010-06.com.purestorage:flasharray.7896f4eca47b4dea", addr
192.168.3.110:4420
[37312.400143] nvme nvme2: queue_size 128 > ctrl sqsize 32, clamping down
[37312.400180] nvme nvme2: creating 32 I/O queues.
[37312.671861] nvme nvme2: mapped 32/0/0 default/read/poll queues.
[37312.678318] nvme nvme2: new ctrl: NQN
"nqn.2010-06.com.purestorage:flasharray.7896f4eca47b4dea", addr
192.168.2.110:4420
[37312.760833] nvme nvme3: queue_size 128 > ctrl sqsize 32, clamping down
[37312.760860] nvme nvme3: creating 32 I/O queues.
[37313.123490] nvme nvme3: mapped 32/0/0 default/read/poll queues.
[37313.130407] nvme nvme3: new ctrl: NQN
"nqn.2010-06.com.purestorage:flasharray.7896f4eca47b4dea", addr
192.168.4.110:4420
[37313.654120] nvme nvme3: rescanning namespaces.
[37313.654152] nvme nvme2: rescanning namespaces.
[37313.654867] nvme0n1: detected capacity change from 0 to 11811160064
[37313.654876] nvme0n1: detected capacity change from 0 to 11811160064
[37313.655573] nvme nvme0: rescanning namespaces.
[37313.655694] nvme nvme1: rescanning namespaces.
[37313.656405] nvme0n1: detected capacity change from 0 to 11811160064
[37313.656445] nvme0n1: detected capacity change from 0 to 11811160064
[37313.897745] nvme nvme3: rescanning namespaces.
[37313.897748] nvme nvme2: rescanning namespaces.
[37313.898614] nvme0n2: detected capacity change from 0 to 11811160064
[37313.907348] nvme nvme0: rescanning namespaces.
[37313.907409] nvme nvme1: rescanning namespaces.
[37314.191586] nvme nvme2: rescanning namespaces.
[37314.191589] nvme nvme3: rescanning namespaces.
[37314.193241] nvme nvme0: rescanning namespaces.
[37314.193303] nvme nvme1: rescanning namespaces.
[37314.205965] nvme0n3: detected capacity change from 0 to 11811160064
[37314.206026] nvme0n3: detected capacity change from 0 to 11811160064
[37314.206036] BUG: unable to handle kernel NULL pointer dereference
at 0000000000000050
[37314.206036] nvme0n3: detected capacity change from 0 to 11811160064

I don't see any mechanism to prevent scan work from running concurrently.
scan_lock is per-controller, but there are 4 controllers in our test.
I'm not very familiar with work queues in Linux, but it looks like they can run
multiple pieces of work concurrently.
The nvme-wq appears not CPU bound and has a high max concurrency:
$ cat /sys/bus/workqueue/devices/nvme-wq/per_cpu
0
$ cat /sys/bus/workqueue/devices/nvme-wq/max_active
256

At the end of the scan, nvme_remove_invalid_namespaces() is called,
which I think explains how we end up with namespace removals during the scans.

Thanks,
Caleb