[PATCH] nvme: fix SRCU protection of nvme_ns_head list

Sagi Grimberg sagi at grimberg.me
Tue Nov 22 02:06:57 PST 2022


>> Of course, the easiest thing is to try it both ways and see if you can
>> measure the difference.  If you are unsure whether or not it is safe
>> to replace rcu_read_lock() with srcu_read_lock(), you can just add
>> smp_mb() after each rcu_read_lock() and before each rcu_read_unlock().
>> That won't be exactly, but it will get you quite close to the cost of
>> srcu_read_lock() and srcu_read_unlock().
>>
>>                                                          Thanx, Paul
> 
> Regarding your suggestion to use synchronize_rcu_mult() to wait concurrently,
> I'm not sure it's possible to write an equivalent to call_my_srrcu() here,
> since the srcu_struct is per-nvme_ns_head rather than global.
> I think we would probably need to synchronize with RCU and SRCU sequentially.

I don't think rcu_mult is necessary.

Lets look at nvme_ns_remove.

1. clears NVME_NS_READY + synchronize_srcu()
    -> guarantees that nshead->current_path is not re-selecting this ns

2. nvme_mpath_clear_current_path + synchronize_srcu()
    -> if this ns happens to be the current_path -> clear it and fence
       concurrent bio submissions

3. removes ns from head sibling list + synchronize rcu
    -> this should fence non-sleeping traversals (like revalidate_paths)

Maybe it is OK to have it also srcu locked and just accept that
nshead sibling list is srcu protected. In that case, your patch
needs to extend the srcu also the clearing of current_head pointer.

But looking again at your bug report, you mention that there are
concurrent scans, one removing the ns and another accessing it.
That cannot happen due to the scan_lock held around this section afaict.

I guess it can be that in general ns removal can compete with a scan
if due to some controller behavior that failed an identify command
transiently in a prior scan, and a subsequent scan finds it? worth
pinning down exactly what happened in the race you got because maybe we
have a different issue that may manifest in other issues.



More information about the Linux-nvme mailing list