nvme-multipath: round-robin infinite looping

Mon Mar 21 15:43:01 PDT 2022

I've been looking at a lockup reported by a partner doing nvme-tcp
testing, and I believe there's an issue between nvme_ns_remove and
nvme_round_robin_path that can result in infinite looping.

It seems like the same concern that was raised by Chao Leng in this
thread: https://lore.kernel.org/all/bd37abd5-759d-efe2-fdcd-8b004a41c75a@huawei.com/

The ordering of nvme_ns_remove is constructed to prevent races that
would re-assign a namespace being removed to the current_path cache.
That leaves a period where a namespace in current_path is not in the
path sibling list.  But nvme_round_robin_path makes the assumption that
the "old" ns taken from current_path is always on the list, and the odd
list traversal with nvme_next_ns isn't safe with an RCU list that can
change while it's being read.

I'm not convinced that there is a way to meet all of these assumptions
only looking at the list and current_path. I think it can be done if the
NVME_NS_READY flag is taken into account, but possibly needing an
additional synchronization point.

I'm following this email with details from a kdump analysis that shows
this happening, with a current_path entry partially removed from the
list (pointing into the list, but not on it, as list_del_rcu does) and a
CPU stuck in the inner loop of nvme_round_robin_path.

And then a couple of suggestions, for trying to fix this in
nvme_ns_remove as well as an easy backstop for nvme_round_robin_path
that would prevent endless looping without fixing the race.  

Just looking for discussion on these right now, we're working on getting
them tested.

- Chris