[PATCH 1/2] nvmet-rdma: avoid circular locking dependency on install_queue()

Wed Dec 6 21:54:25 PST 2023

On Dec 04, 2023 / 14:46, Sagi Grimberg wrote:
> 
> 
> On 12/4/23 14:31, Hannes Reinecke wrote:
> > On 12/4/23 12:57, Sagi Grimberg wrote:
> > > 
> > > 
> > > On 12/4/23 13:49, Hannes Reinecke wrote:
> > > > On 12/4/23 11:19, Sagi Grimberg wrote:
> > > > > 
> > > > > 
> > > > > On 11/20/23 15:48, Sagi Grimberg wrote:
> > > > > > 
> > > > > > > > According to 777dc82395de ("nvmet-rdma: occasionally flush ongoing
> > > > > > > > controller teardown") this is just for reducing
> > > > > > > > the memory footprint.
> > > > > > > > Wonder if we need to bother, and whether it
> > > > > > > > won't be better to remove
> > > > > > > > the whole thing entirely.
> > > > > > > 
> > > > > > > Well, Sagi added it, so I'll let him chime in on the usefulness.
> > > > > > 
> > > > > > While I don't like having nvmet arbitrarily replying busy and instead
> > > > > > have lockdep simply just accept that its not a deadlock
> > > > > > here, but we can
> > > > > > simply just sidetrack it as proposed I guess.
> > > > > > 
> > > > > > But Hannes, this is on the other extreme.. Now every
> > > > > > connect that nvmet
> > > > > > gets, if there is even a single queue that is disconnecting (global
> > > > > > scope), then the host is denied. Lets give it a sane backlog.
> > > > > > We use rdma_listen backlog of 128, so maybe stick with this magic
> > > > > > number... This way we are busy only if more than 128
> > > > > > queues are tearing
> > > > > > down to prevent the memory footprint from exploding, and
> > > > > > hopefully it is
> > > > > > rare enough that the normal host does not see an arbitrary busy
> > > > > > rejection.
> > > > > > 
> > > > > > Same comment for nvmet-tcp.
> > > > > 
> > > > > Hey Hannes, anything happened with this one?
> > > > > 
> > > > > Overall I think that the approach is fine, but I do think
> > > > > that we need a backlog for it.
> > > > 
> > > > Hmm. The main issue here is the call to 'flush_workqueue()',
> > > > which triggers the circular lock warning. So a ratelimit would
> > > > only help
> > > > us so much; the real issue is to get rid of the flush_workqueue()
> > > > thingie.
> > > > 
> > > > What I can to is to add this:
> > > > 
> > > > diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
> > > > index 4cc27856aa8f..72bcc54701a0 100644
> > > > --- a/drivers/nvme/target/tcp.c
> > > > +++ b/drivers/nvme/target/tcp.c
> > > > @@ -2119,8 +2119,20 @@ static u16 nvmet_tcp_install_queue(struct
> > > > nvmet_sq *sq)
> > > >                  container_of(sq, struct nvmet_tcp_queue, nvme_sq);
> > > > 
> > > >          if (sq->qid == 0) {
> > > > +               struct nvmet_tcp_queue *q;
> > > > +               int pending = 0;
> > > > +
> > > >                  /* Let inflight controller teardown complete */
> > > > -               flush_workqueue(nvmet_wq);
> > > > +               mutex_lock(&nvmet_tcp_queue_mutex);
> > > > +               list_for_each_entry(q, &nvmet_tcp_queue_list,
> > > > queue_list) {
> > > > +                       if (q->nvme_sq.ctrl == sq->ctrl &&
> > > > +                           q->state == NVMET_TCP_Q_DISCONNECTING)
> > > > +                               pending++;
> > > > +               }
> > > > +               mutex_unlock(&nvmet_tcp_queue_mutex);
> > > > +               /* Retry for pending controller teardown */
> > > > +               if (pending)
> > > > +                       return NVME_SC_CONNECT_CTRL_BUSY;
> > > >          }
> > > > 
> > > > which then would only affect the controller we're connecting to.
> > > > Hmm?
> > > 
> > > Still I think we should give a reasonable backlog, no reason to limit
> > > this as we may hit this more often than we'd like and the sole purpose
> > > here is to avoid memory overrun.
> > 
> > So would 'if (pending > tcp_backlog)' (with eg tcp_backlog = 20) fit the
> > bill here?
> 
> Yes I think so, we already place a listen backlog, so maybe we can reuse
> that as NVME_TCP_BACKLOG or something. The same should apply to nvme-rdma.

Thanks for the discussion. FYI, I modified the patches by Hannes to reflect the
discussion above [1][2]. With these, I confirmed the lockdep WARN disappears,
and saw no regression in my blktests runs. Looks good. Will test again when
formal patches come out.

[1] https://github.com/kawasaki/linux/commit/dc58819d07b02bd54dbcd5b15d2600a517fcbbff
[2] https://github.com/kawasaki/linux/commit/20fb65d145152562d0e11a9c8065e64405c31b0d