[PATCH 1/2] nvmet-rdma: avoid circular locking dependency on install_queue()
Sagi Grimberg
sagi at grimberg.me
Mon Dec 4 04:46:53 PST 2023
On 12/4/23 14:31, Hannes Reinecke wrote:
> On 12/4/23 12:57, Sagi Grimberg wrote:
>>
>>
>> On 12/4/23 13:49, Hannes Reinecke wrote:
>>> On 12/4/23 11:19, Sagi Grimberg wrote:
>>>>
>>>>
>>>> On 11/20/23 15:48, Sagi Grimberg wrote:
>>>>>
>>>>>>> According to 777dc82395de ("nvmet-rdma: occasionally flush ongoing
>>>>>>> controller teardown") this is just for reducing the memory
>>>>>>> footprint.
>>>>>>> Wonder if we need to bother, and whether it won't be better to
>>>>>>> remove
>>>>>>> the whole thing entirely.
>>>>>>
>>>>>> Well, Sagi added it, so I'll let him chime in on the usefulness.
>>>>>
>>>>> While I don't like having nvmet arbitrarily replying busy and instead
>>>>> have lockdep simply just accept that its not a deadlock here, but
>>>>> we can
>>>>> simply just sidetrack it as proposed I guess.
>>>>>
>>>>> But Hannes, this is on the other extreme.. Now every connect that
>>>>> nvmet
>>>>> gets, if there is even a single queue that is disconnecting (global
>>>>> scope), then the host is denied. Lets give it a sane backlog.
>>>>> We use rdma_listen backlog of 128, so maybe stick with this magic
>>>>> number... This way we are busy only if more than 128 queues are
>>>>> tearing
>>>>> down to prevent the memory footprint from exploding, and hopefully
>>>>> it is
>>>>> rare enough that the normal host does not see an arbitrary busy
>>>>> rejection.
>>>>>
>>>>> Same comment for nvmet-tcp.
>>>>
>>>> Hey Hannes, anything happened with this one?
>>>>
>>>> Overall I think that the approach is fine, but I do think
>>>> that we need a backlog for it.
>>>
>>> Hmm. The main issue here is the call to 'flush_workqueue()', which
>>> triggers the circular lock warning. So a ratelimit would only help
>>> us so much; the real issue is to get rid of the flush_workqueue()
>>> thingie.
>>>
>>> What I can to is to add this:
>>>
>>> diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
>>> index 4cc27856aa8f..72bcc54701a0 100644
>>> --- a/drivers/nvme/target/tcp.c
>>> +++ b/drivers/nvme/target/tcp.c
>>> @@ -2119,8 +2119,20 @@ static u16 nvmet_tcp_install_queue(struct
>>> nvmet_sq *sq)
>>> container_of(sq, struct nvmet_tcp_queue, nvme_sq);
>>>
>>> if (sq->qid == 0) {
>>> + struct nvmet_tcp_queue *q;
>>> + int pending = 0;
>>> +
>>> /* Let inflight controller teardown complete */
>>> - flush_workqueue(nvmet_wq);
>>> + mutex_lock(&nvmet_tcp_queue_mutex);
>>> + list_for_each_entry(q, &nvmet_tcp_queue_list,
>>> queue_list) {
>>> + if (q->nvme_sq.ctrl == sq->ctrl &&
>>> + q->state == NVMET_TCP_Q_DISCONNECTING)
>>> + pending++;
>>> + }
>>> + mutex_unlock(&nvmet_tcp_queue_mutex);
>>> + /* Retry for pending controller teardown */
>>> + if (pending)
>>> + return NVME_SC_CONNECT_CTRL_BUSY;
>>> }
>>>
>>> which then would only affect the controller we're connecting to.
>>> Hmm?
>>
>> Still I think we should give a reasonable backlog, no reason to limit
>> this as we may hit this more often than we'd like and the sole purpose
>> here is to avoid memory overrun.
>
> So would 'if (pending > tcp_backlog)' (with eg tcp_backlog = 20) fit the
> bill here?
Yes I think so, we already place a listen backlog, so maybe we can reuse
that as NVME_TCP_BACKLOG or something. The same should apply to nvme-rdma.
More information about the Linux-nvme
mailing list