SPDK initiators (Vmware 7.x) can not connect to nvmet-rdma.

Tue Aug 31 06:42:43 PDT 2021

When I connect an SPDK initiator it will try to connect using 1024 connections.
The linux target is unable to handle this situation and return an error.

Aug 28 14:22:56 crashme kernel: [169366.627010] infiniband mlx5_0: create_qp:2789:(pid 33755): Create QP type 2 failed
Aug 28 14:22:56 crashme kernel: [169366.627913] nvmet_rdma: failed to create_qp ret= -12
Aug 28 14:22:56 crashme kernel: [169366.628498] nvmet_rdma: nvmet_rdma_alloc_queue: creating RDMA queue failed (-12).

It is really easy to reproduce the problem, even when not using the SPDK initiator.

Just type:
nvme connect --transport=rdma --queue-size=1024 --nqn=SOME.NQN --traddr=SOME.IP --trsvcid=XXXX
While a linux initiator attempts to setup 64 connections, SPDK attempts to create 1024 connections.

The result is that anything which relies on SPDK, like VMware 7.x for example, won't be able to connect.
Forcing the queues to be restricted to 256 QD solves some of it. In this case SPDK and VMware seem to connect.
See the code section below. Sadly, VMware declares the path to be dead afterwards. I guess this 'fix' needs more work. ;-(

In noticed that someone reported this problem on the SPDK list: 
https://github.com/spdk/spdk/issues/1719

Thanks,

Mark

---
static int
nvmet_rdma_parse_cm_connect_req(struct rdma_conn_param *conn,
                                struct nvmet_rdma_queue *queue)
{
        struct nvme_rdma_cm_req *req;

        req = (struct nvme_rdma_cm_req *)conn->private_data;
        if (!req || conn->private_data_len == 0)
                return NVME_RDMA_CM_INVALID_LEN;

        if (le16_to_cpu(req->recfmt) != NVME_RDMA_CM_FMT_1_0)
                return NVME_RDMA_CM_INVALID_RECFMT;

        queue->host_qid = le16_to_cpu(req->qid);

        /*
         * req->hsqsize corresponds to our recv queue size plus 1
         * req->hrqsize corresponds to our send queue size
         */
        queue->recv_queue_size = le16_to_cpu(req->hsqsize) + 1;
        queue->send_queue_size = le16_to_cpu(req->hrqsize);
        if (!queue->host_qid && queue->recv_queue_size > NVME_AQ_DEPTH) {
                pr_info("MARK nvmet_rdma_parse_cm_connect_req return %i", NVME_RDMA_CM_INVALID_HSQSIZE);
                return NVME_RDMA_CM_INVALID_HSQSIZE;
        }

+        if (queue->recv_queue_size > 256)
+               queue->recv_queue_size = 256;
+        if (queue->send_queue_size > 256)
+               queue->send_queue_size = 256;
+       pr_info("MARK queue->recv_queue_size = %i", queue->recv_queue_size);
+       pr_info("MARK queue->send_queue_size = %i", queue->send_queue_size);

        /* XXX: Should we enforce some kind of max for IO queues? */
        return 0;
}