SPDK initiators (Vmware 7.x) can not connect to nvmet-rdma.

Mon Sep 6 02:12:06 PDT 2021

Hi Max,

The system I use has dual AMD EPYC 7452 32-Core Processors.
MemTotal:       197784196 kB

It has a single dual port ConnectX-6 card.
81:00.0 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
81:00.1 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]

The problem is not related to hardware. Vmware works flawlessly using the SPDK target with this system.

The kernel target fails like this:
target/rdma.c                          -> infiniband/cma.c -> infiniband/verbs.c       -> infiniband/hw/mlx5/qp.c
nvmet_rdma_cm_accept      -> rdma_accept        -> ib_create_named_qp -> create_kernel_qp ->
returns -12  -> mlx5_0: create_qp:2774:(pid 1246): MARK Create QP type 2 failed)

The queue-size is 1024. The mlx5 driver now entered the function calc_sq_size where it fails here and returns ENOMEM.
--
 if (qp->sq.wqe_cnt > (1 << MLX5_CAP_GEN(dev->mdev, log_max_qp_sz))) {
                mlx5_ib_dbg(dev, "send queue size (%d * %d / %d -> %d) exceeds limits(%d)\n",
                            attr->cap.max_send_wr, wqe_size, MLX5_SEND_WQE_BB,
                            qp->sq.wqe_cnt,
                            1 << MLX5_CAP_GEN(dev->mdev, log_max_qp_sz));
                return -ENOMEM;
}
--
Sep  5 12:53:45 everest kernel: [  567.691658] MARK enter ib_create_named_qp
Sep  5 12:53:45 everest kernel: [  567.691667] MARK wq_size = 2097152
Sep  5 12:53:46 everest kernel: [  567.692419] MARK create_kernel_qp 0
Sep  5 12:53:46 everest kernel: [  568.204213] MARK enter ib_create_named_qp
Sep  5 12:53:46 everest kernel: [  568.204218] MARK wq_size = 4194304
Sep  5 12:53:46 everest kernel: [  568.204219] MARK 1 send queue size (4097 * 640 / 64 -> 65536) exceeds limits(32768)
Sep  5 12:53:46 everest kernel: [  568.204220] MARK 1 calc_sq_size return ENOMEM

A hack / fix I tested and that seems to work, or at least prevents immediate failure, is this:

--- /root/linux-5.11/drivers/nvme/target/rdma.c	
+++ rdma.c	2021-09-06 03:05:08.998364562 -0400
@@ -1397,6 +1397,10 @@
 	if (!queue->host_qid && queue->recv_queue_size > NVME_AQ_DEPTH)
 		return NVME_RDMA_CM_INVALID_HSQSIZE;
 
+	if ( queue->send_queue_size > 256 ) {
+		queue->send_queue_size = 256;
+		pr_info("MARK : reducing the queue->send_queue_size to 256");
+	}
 	/* XXX: Should we enforce some kind of max for IO queues? */
 
 	return 0;

---

The answer to the question in the code: "Should we enforce some kind of max for IO queues?" seems to be: yes?
Although VMware now discovers and connects to the kernel target the path not working and declared dead.

The volume appears with a nguid since the target does not set the eui64 field. 
However, setting it by using a pass-through device does not solve the issue.

When I don't set pass-through nvme reports this:
esxcli nvme namespace list
Name                                   Controller Number  Namespace ID  Block Size  Capacity in MB
-------------------------------------  -----------------  ------------  ----------  --------------
eui.344337304e8001510025384100000001                 263             1        4096        12207104
uuid.fa8ab2201ffb4429ba1719ca0d5a3405                322             1         512        14649344

When I use pass-through it reports:
[root at vmw01:~] esxcli nvme namespace list
Name                                  Controller Number  Namespace ID  Block Size  Capacity in MB
------------------------------------  -----------------  ------------  ----------  --------------
eui.344337304e8001510025384100000001                263             1        4096        12207104
eui.344337304e7000780025384100000001                324             1         512        14649344

The reason is easy to explain. Without pass-through the kernel target shows this when I query a device with sg_inq:
sg_inq -e -p 0x83 /dev/nvmeXn1 -vvv
VPD INQUIRY: Device Identification page
  Designation descriptor number 1, descriptor length: 52
    designator_type: T10 vendor identification,  code_set: ASCII
    associated with the Target device that contains addressed lu
      vendor id: NVMe    
      vendor specific: testvg/testlv_79d87ff74dac1b27

With pass-through the kernel target provides this information for the same device:
VPD INQUIRY: Device Identification page
  Designation descriptor number 1, descriptor length: 56
    designator_type: T10 vendor identification,  code_set: ASCII
    associated with the Target device that contains addressed lu
      vendor id: NVMe    
      vendor specific: SAMSUNG MZWLL12THMLA-00005_S4C7NA0N700078
  Designation descriptor number 2, descriptor length: 20
    designator_type: EUI-64 based,  code_set: Binary
    associated with the Addressed logical unit
      EUI-64 based 16 byte identifier
      Identifier extension: 0x344337304e700078
      IEEE Company_id: 0x2538
      Vendor Specific Extension Identifier: 0x410000000103
      [0x344337304e7000780025384100000001]
  Designation descriptor number 3, descriptor length: 40
    designator_type: SCSI name string,  code_set: UTF-8
    associated with the Addressed logical unit
      SCSI name string:
      eui.344337304E7000780025384100000001

SPDK returns this for the same device:

VPD INQUIRY: Device Identification page
  Designation descriptor number 1, descriptor length: 48
    designator_type: T10 vendor identification,  code_set: ASCII
    associated with the Target device that contains addressed lu
      vendor id: NVMe    
      vendor specific: SPDK_Controller1_SPDK00000000000001
  Designation descriptor number 2, descriptor length: 20
    designator_type: EUI-64 based,  code_set: Binary
    associated with the Addressed logical unit
      EUI-64 based 16 byte identifier
      Identifier extension: 0xe0e9311590254d4f
      IEEE Company_id: 0x8fa737
      Vendor Specific Extension Identifier: 0xb56897382503
      [0xe0e9311590254d4f8fa737b568973825]
  Designation descriptor number 3, descriptor length: 40
    designator_type: SCSI name string,  code_set: UTF-8
    associated with the Addressed logical unit
      SCSI name string:
      eui.E0E9311590254D4F8FA737B568973825

So, the kernel target returns limited information when not using pass-through which forces VMware to use the nguid.
We could use the nguid to fill the eui64 attribute and always report the extended info like we do with a pass-through device?

-------------------
--- /root/linux-5.11/drivers/nvme/target/admin-cmd.c	2021-02-14 17:32:24.000000000 -0500
+++ admin-cmd.c	2021-09-05 06:18:10.836865874 -0400
@@ -526,6 +526,7 @@
 	id->anagrpid = cpu_to_le32(ns->anagrpid);
 
 	memcpy(&id->nguid, &ns->nguid, sizeof(id->nguid));
+	memcpy(&id->eui64, &ns->nguid, sizeof(id->eui64));
 
 	id->lbaf[0].ds = ns->blksize_shift;

--- /root/linux-5.11/drivers/nvme/target/configfs.c	2021-02-14 17:32:24.000000000 -0500
+++ configfs.c	2021-09-05 05:35:35.741619651 -0400
@@ -477,6 +477,7 @@
 	}
 
 	memcpy(&ns->nguid, nguid, sizeof(nguid));
+	memcpy(&ns->eui64, nguid, sizeof(ns->eui64));
 out_unlock:
 	mutex_unlock(&subsys->lock);
 	return ret ? ret : count;
--------------

Even with pass-through enabled and the kernel target returning all information the path is immediately reported to be dead.
esxcli storage core path list
rdma.vmnic2:98:03:9b:03:45:10-rdma.unknown-
   UID: rdma.vmnic2:98:03:9b:03:45:10-rdma.unknown-
   Runtime Name: vmhba64:C0:T1:L0
   Device: No associated device
   Device Display Name: No associated device
   Adapter: vmhba64
   Channel: 0
   Target: 1
   LUN: 0
   Plugin: (unclaimed)
   State: dead
   Transport: rdma
   Adapter Identifier: rdma.vmnic2:98:03:9b:03:45:10
   Target Identifier: rdma.unknown
   Adapter Transport Details: Unavailable or path is unclaimed
   Target Transport Details: Unavailable or path is unclaimed
   Maximum IO Size: 131072

This may or may not be a Vmware path-checker issue.
Since SPDK does not show this problem some difference between the kernel target and SPDK target must exist.
I don't know if the patch I use that limits the queue-depth to 256 is to blame.
The path for the exact same device exported with SPDK shows up like this:

rdma.vmnic2:98:03:9b:03:45:10-rdma.unknown-eui.a012ce7696bf47d5be87760d8f78fb8e
   UID: rdma.vmnic2:98:03:9b:03:45:10-rdma.unknown-eui.a012ce7696bf47d5be87760d8f78fb8e
   Runtime Name: vmhba64:C0:T0:L0
   Device: eui.a012ce7696bf47d5be87760d8f78fb8e
   Device Display Name: NVMe RDMA Disk (eui.a012ce7696bf47d5be87760d8f78fb8e)
   Adapter: vmhba64
   Channel: 0
   Target: 0
   LUN: 0
   Plugin: HPP
   State: active
   Transport: rdma
   Adapter Identifier: rdma.vmnic2:98:03:9b:03:45:10
   Target Identifier: rdma.unknown
   Adapter Transport Details: Unavailable or path is unclaimed
   Target Transport Details: Unavailable or path is unclaimed
   Maximum IO Size: 131072

It looks like the connect patch does work but something else causes VMware not to accept the nvmet-rdma target devices.
Not sure what to make of that. It could still be eui related? See the UID from the nvmet-rdma target.

Thanks,

--Mark

On 02/09/2021, 23:36, "Max Gurtovoy" <mgurtovoy at nvidia.com> wrote:


    On 8/31/2021 4:42 PM, Mark Ruijter wrote:
    > When I connect an SPDK initiator it will try to connect using 1024 connections.
    > The linux target is unable to handle this situation and return an error.
    >
    > Aug 28 14:22:56 crashme kernel: [169366.627010] infiniband mlx5_0: create_qp:2789:(pid 33755): Create QP type 2 failed
    > Aug 28 14:22:56 crashme kernel: [169366.627913] nvmet_rdma: failed to create_qp ret= -12
    > Aug 28 14:22:56 crashme kernel: [169366.628498] nvmet_rdma: nvmet_rdma_alloc_queue: creating RDMA queue failed (-12).
    >
    > It is really easy to reproduce the problem, even when not using the SPDK initiator.
    >
    > Just type:
    > nvme connect --transport=rdma --queue-size=1024 --nqn=SOME.NQN --traddr=SOME.IP --trsvcid=XXXX
    > While a linux initiator attempts to setup 64 connections, SPDK attempts to create 1024 connections.

    1024 connections or is it the queue depth ?

    how many cores you have in initiator ?

    can you give more details on the systems ?

    >
    > The result is that anything which relies on SPDK, like VMware 7.x for example, won't be able to connect.
    > Forcing the queues to be restricted to 256 QD solves some of it. In this case SPDK and VMware seem to connect.
    > See the code section below. Sadly, VMware declares the path to be dead afterwards. I guess this 'fix' needs more work. ;-(
    >
    > In noticed that someone reported this problem on the SPDK list:
    > https://github.com/spdk/spdk/issues/1719
    >
    > Thanks,
    >
    > Mark
    >
    > ---
    > static int
    > nvmet_rdma_parse_cm_connect_req(struct rdma_conn_param *conn,
    >                                  struct nvmet_rdma_queue *queue)
    > {
    >          struct nvme_rdma_cm_req *req;
    >
    >          req = (struct nvme_rdma_cm_req *)conn->private_data;
    >          if (!req || conn->private_data_len == 0)
    >                  return NVME_RDMA_CM_INVALID_LEN;
    >
    >          if (le16_to_cpu(req->recfmt) != NVME_RDMA_CM_FMT_1_0)
    >                  return NVME_RDMA_CM_INVALID_RECFMT;
    >
    >          queue->host_qid = le16_to_cpu(req->qid);
    >
    >          /*
    >           * req->hsqsize corresponds to our recv queue size plus 1
    >           * req->hrqsize corresponds to our send queue size
    >           */
    >          queue->recv_queue_size = le16_to_cpu(req->hsqsize) + 1;
    >          queue->send_queue_size = le16_to_cpu(req->hrqsize);
    >          if (!queue->host_qid && queue->recv_queue_size > NVME_AQ_DEPTH) {
    >                  pr_info("MARK nvmet_rdma_parse_cm_connect_req return %i", NVME_RDMA_CM_INVALID_HSQSIZE);
    >                  return NVME_RDMA_CM_INVALID_HSQSIZE;
    >          }
    >
    > +        if (queue->recv_queue_size > 256)
    > +               queue->recv_queue_size = 256;
    > +        if (queue->send_queue_size > 256)
    > +               queue->send_queue_size = 256;
    > +       pr_info("MARK queue->recv_queue_size = %i", queue->recv_queue_size);
    > +       pr_info("MARK queue->send_queue_size = %i", queue->send_queue_size);
    >
    >          /* XXX: Should we enforce some kind of max for IO queues? */
    >          return 0;
    > }
    >
    >
    >
    > _______________________________________________
    > Linux-nvme mailing list
    > Linux-nvme at lists.infradead.org
    > http://lists.infradead.org/mailman/listinfo/linux-nvme