[PATCH] nvmet-rdma: Release connections synchronously

Sun May 21 23:41:13 PDT 2023

>>> @@ -1582,11 +1566,6 @@ static int nvmet_rdma_queue_connect(struct 
>>> rdma_cm_id *cm_id,
>>>           goto put_device;
>>>       }
>>> -    if (queue->host_qid == 0) {
>>> -        /* Let inflight controller teardown complete */
>>> -        flush_workqueue(nvmet_wq);
>>> -    }
>>> -
>>>       ret = nvmet_rdma_cm_accept(cm_id, queue, &event->param.conn);
>>>       if (ret) {
>>>           /*
>>
>> You could have simply removed this hunk alone to make lockdep quiet on
>> this, without the need to rework the async queue removal.
>>
>> The flush here was added to prevent a reset/connect/disconnect storm
>> causing the target to run out of resources (which we have seen reports
>> about in the distant past). What prevents it now?
>>
>> And you both reworked the teardown, and still removed the flush, I don't
>> get why both are needed.
> 
> Hi Sagi,
> 
> My understanding is that the above flush_workqueue() call waits for 
> prior release_work to complete. If the release_work instance is removed, 
> I don't think that the above flush_workqueue() call is still necessary.

I'm wandering if making delete_ctrl synchronous may be a problem
in some cases, nvmet_fatal_error can be triggered from the
rdma completion path (rdma core workqueue context). Wondering if there
may be some inter-dependency there...

Also, nvmet-tcp has a similar lockdep complaint, where the teardown goes
via socket shutdown, which has to be async because we cannot release the
socket from the state_change callback.