blktests failures with v6.4
Shinichiro Kawasaki
shinichiro.kawasaki at wdc.com
Wed Jul 12 18:22:32 PDT 2023
On Jul 09, 2023 / 17:32, Sagi Grimberg wrote:
>
> > #3: nvme/003 (fabrics transport)
> >
> > When nvme test group is run with trtype=rdma or tcp, the test case fails
> > due to lockdep WARNING "possible circular locking dependency detected".
> > Reported in May/2023. Bart suggested a fix for trytpe=rdma [4] but it
> > needs more discussion.
> >
> > [4] https://lore.kernel.org/linux-nvme/20230511150321.103172-1-bvanassche@acm.org/
>
> This patch is unfortunately incorrect and buggy.
>
> This will likely make the issue go away, but adds another
> old issue where a client can DDOS a target by bombarding it
> with connect/disconnect. When releases are async and we don't
> have any back-pressure, it is likely to happen.
> --
> diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
> index 4597bca43a6d..8b4f4aa48206 100644
> --- a/drivers/nvme/target/rdma.c
> +++ b/drivers/nvme/target/rdma.c
> @@ -1582,11 +1582,6 @@ static int nvmet_rdma_queue_connect(struct rdma_cm_id
> *cm_id,
> goto put_device;
> }
>
> - if (queue->host_qid == 0) {
> - /* Let inflight controller teardown complete */
> - flush_workqueue(nvmet_wq);
> - }
> -
> ret = nvmet_rdma_cm_accept(cm_id, queue, &event->param.conn);
> if (ret) {
> /*
> diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
> index 868aa4de2e4c..c8cfa19e11c7 100644
> --- a/drivers/nvme/target/tcp.c
> +++ b/drivers/nvme/target/tcp.c
> @@ -1844,11 +1844,6 @@ static u16 nvmet_tcp_install_queue(struct nvmet_sq
> *sq)
> struct nvmet_tcp_queue *queue =
> container_of(sq, struct nvmet_tcp_queue, nvme_sq);
>
> - if (sq->qid == 0) {
> - /* Let inflight controller teardown complete */
> - flush_workqueue(nvmet_wq);
> - }
> -
> queue->nr_cmds = sq->size * 2;
> if (nvmet_tcp_alloc_cmds(queue))
> return NVME_SC_INTERNAL;
> --
Thanks Sagi, I tried the patch above and confirmed the lockdep WARN disappears
for both rdma and tcp. It indicates that the flush_workqueue(nvmet_wq)
introduced the circular lock dependency. I also found the two commits below
record why the flush_workqueue(nvmet_wq) was introduced.
777dc82395de ("nvmet-rdma: occasionally flush ongoing controller teardown")
8832cf922151 ("nvmet: use a private workqueue instead of the system workqueue")
The left question is how to avoid both the connect/disconnect bombarding DDOS
and the circular lock possibility related to the nvmet_wq completion.
More information about the Linux-nvme
mailing list