[PATCH] nvme: fix SRCU protection of nvme_ns_head list

Wed Nov 23 19:08:56 PST 2022

On 2022/11/24 8:12, Caleb Sander wrote:
> On Tue, Nov 22, 2022 at 7:08 AM Sagi Grimberg <sagi at grimberg.me> wrote:
>>
>>
>>>> 3. removes ns from head sibling list + synchronize rcu
>>>>      -> this should fence non-sleeping traversals (like revalidate_paths)
>>>
>>> Well, non-sleeping would only matter if those non-sleeping travesals
>>> are under rcu_read_lock(), but they are not.  They are either part of
>>> a longer srcu critical section because other code can sleep, or in
>>> case of revalidate_paths unprotected at all (which this patch fixes).
>>
>> The original patch comment was that rcu_read_lock/unlock would be
>> sufficient and we don't need to touch nvme_ns_remove()
>>
>>>
>>>> Maybe it is OK to have it also srcu locked and just accept that
>>>> nshead sibling list is srcu protected. In that case, your patch
>>>> needs to extend the srcu also the clearing of current_head pointer.
>>>
>>> I don't see how nvme_mpath_clear_current_path needs (S)RCU protection.
>>> It never dereferences the current_path, it just checks is for pointer
>>> equality and if they match clears it to NULL.  (I wonder if it should
>>> use cmpxchg though).
>>
>> Agree. it can stay out. because at this point it does not compete with
>> concurrent submissions due to prior synchronizations. The list traversal
>> needs to be under rcu lock.
>>
>>
>>>
>>>> But looking again at your bug report, you mention that there are
>>>> concurrent scans, one removing the ns and another accessing it.
>>>> That cannot happen due to the scan_lock held around this section afaict.
>>>>
>>>> I guess it can be that in general ns removal can compete with a scan
>>>> if due to some controller behavior that failed an identify command
>>>> transiently in a prior scan, and a subsequent scan finds it? worth
>>>> pinning down exactly what happened in the race you got because maybe we
>>>> have a different issue that may manifest in other issues.
>>>
>>> So scanning itself should be single threaded as it only happens from
>>> the workqueue.  But nvme_ns_remove can be called from
>>> nvme_remove_namespaces in in 6.1 and earlier from the passthrough
>>> handler.
>>
>> The original patch report did not include any sequence that removes all
>> namespaces, and given that it came from RockyLinux 8.6 kernel, it is not
>> 6.1... Hence I think that we need to understand how a namespace removal
>> happened at the same time that the namespace is being scanned. Maybe
>> something else is broken.
> 
> We can reliably cause the panic by sending a "NS Changed" AEN
> from multiple controllers at the same time, resulting in multiple scan works
> running concurrently for the same namespace heads.
> In our test, we have 4 controllers for the same subsystem, with 9 namespaces
> that are added one at a time, resulting in many AENs for each controller.
> We can see from the dmesg logs that the controllers' scan works overlap:
> [37311.530367] nvme nvme0: queue_size 128 > ctrl sqsize 32, clamping down
> [37311.530398] nvme nvme0: creating 32 I/O queues.
> [37311.819883] nvme nvme0: mapped 32/0/0 default/read/poll queues.
> [37311.828129] nvme nvme0: new ctrl: NQN
> "nqn.2010-06.com.purestorage:flasharray.7896f4eca47b4dea", addr
> 192.168.1.110:4420
> [37311.924908] nvme nvme1: queue_size 128 > ctrl sqsize 32, clamping down
> [37311.924935] nvme nvme1: creating 32 I/O queues.
> [37312.298561] nvme nvme1: mapped 32/0/0 default/read/poll queues.
> [37312.306296] nvme nvme1: new ctrl: NQN
> "nqn.2010-06.com.purestorage:flasharray.7896f4eca47b4dea", addr
> 192.168.3.110:4420
> [37312.400143] nvme nvme2: queue_size 128 > ctrl sqsize 32, clamping down
> [37312.400180] nvme nvme2: creating 32 I/O queues.
> [37312.671861] nvme nvme2: mapped 32/0/0 default/read/poll queues.
> [37312.678318] nvme nvme2: new ctrl: NQN
> "nqn.2010-06.com.purestorage:flasharray.7896f4eca47b4dea", addr
> 192.168.2.110:4420
> [37312.760833] nvme nvme3: queue_size 128 > ctrl sqsize 32, clamping down
> [37312.760860] nvme nvme3: creating 32 I/O queues.
> [37313.123490] nvme nvme3: mapped 32/0/0 default/read/poll queues.
> [37313.130407] nvme nvme3: new ctrl: NQN
> "nqn.2010-06.com.purestorage:flasharray.7896f4eca47b4dea", addr
> 192.168.4.110:4420
> [37313.654120] nvme nvme3: rescanning namespaces.
> [37313.654152] nvme nvme2: rescanning namespaces.
> [37313.654867] nvme0n1: detected capacity change from 0 to 11811160064
> [37313.654876] nvme0n1: detected capacity change from 0 to 11811160064
> [37313.655573] nvme nvme0: rescanning namespaces.
> [37313.655694] nvme nvme1: rescanning namespaces.
> [37313.656405] nvme0n1: detected capacity change from 0 to 11811160064
> [37313.656445] nvme0n1: detected capacity change from 0 to 11811160064
> [37313.897745] nvme nvme3: rescanning namespaces.
> [37313.897748] nvme nvme2: rescanning namespaces.
> [37313.898614] nvme0n2: detected capacity change from 0 to 11811160064
> [37313.907348] nvme nvme0: rescanning namespaces.
> [37313.907409] nvme nvme1: rescanning namespaces.
> [37314.191586] nvme nvme2: rescanning namespaces.
> [37314.191589] nvme nvme3: rescanning namespaces.
> [37314.193241] nvme nvme0: rescanning namespaces.
> [37314.193303] nvme nvme1: rescanning namespaces.
> [37314.205965] nvme0n3: detected capacity change from 0 to 11811160064
> [37314.206026] nvme0n3: detected capacity change from 0 to 11811160064
> [37314.206036] BUG: unable to handle kernel NULL pointer dereference
> at 0000000000000050
> [37314.206036] nvme0n3: detected capacity change from 0 to 11811160064
> 
> I don't see any mechanism to prevent scan work from running concurrently.
> scan_lock is per-controller, but there are 4 controllers in our test.
> I'm not very familiar with work queues in Linux, but it looks like they can run
> multiple pieces of work concurrently.
> The nvme-wq appears not CPU bound and has a high max concurrency:
> $ cat /sys/bus/workqueue/devices/nvme-wq/per_cpu
> 0
> $ cat /sys/bus/workqueue/devices/nvme-wq/max_active
> 256
> 
> At the end of the scan, nvme_remove_invalid_namespaces() is called,
> which I think explains how we end up with namespace removals during the scans.
work should not concurrent.
ctrl->scan_lock also should not ineffective.
I guess the reason is that the nsid of ns_list reported by the target is not in ascending order.
Of course, there is a problem with the robustness of the code on the host side.
You can try this patch:
  drivers/nvme/host/core.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index c78a5825a057..ebf751cc3855 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4467,6 +4467,8 @@ static int nvme_scan_ns_list(struct nvme_ctrl *ctrl)

                         if (!nsid)      /* end of the list? */
                                 goto out;
+                       if (nsid <= prev)
+                               continue;
                         nvme_scan_ns(ctrl, nsid);
                         while (++prev < nsid)
                                 nvme_ns_remove_by_nsid(ctrl, prev);