target crash / host hang with nvme-all.3 branch of nvme-fabrics

Tue Jun 21 09:01:34 PDT 2016

On Fri, Jun 17, 2016 at 12:37:18AM +0300, Sagi Grimberg wrote:
>> to false because the queue is on the local list, and now we have thread 1
>> and 2 racing for disconnecting the queue.
>
> But the list removal and list_empty evaluation is still under a mutex,
> isn't that sufficient to avoid the race?

If only once side takes the lock it's not very helpful.  We can
execute nvmet_rdma_queue_disconnect from the CM handler at the
same time that the queue is on the to be removed list, which creates
two issues: a) we manipulate local del_list without any knowledge
of the thread calling nvmet_rdma_delete_ctrl, leading to potential
list corruption, and b) we can call into __nvmet_rdma_queue_disconnect
concurrently.  As you pointed out we still have the per-queue
state_lock inside __nvmet_rdma_queue_disconnect so b) probably
is harmless at the moment as long as the queue hasn't been freed
yet by one of the racing threads, which is fairly unlikely.

Either way - using list_empty to check if something is still alive due
to being linked in a list and using a local dispose list simply don't
mix.  Both are useful patterns on their own, but should not be mixed.