SPDK initiators (Vmware 7.x) can not connect to nvmet-rdma.

Tue Sep 7 07:25:36 PDT 2021

On 9/6/2021 12:12 PM, Mark Ruijter wrote:
> Hi Max,
>
> The system I use has dual AMD EPYC 7452 32-Core Processors.
> MemTotal:       197784196 kB
>
> It has a single dual port ConnectX-6 card.
> 81:00.0 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
> 81:00.1 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
>
> The problem is not related to hardware. Vmware works flawlessly using the SPDK target with this system.
>
> The kernel target fails like this:
> target/rdma.c                          -> infiniband/cma.c -> infiniband/verbs.c       -> infiniband/hw/mlx5/qp.c
> nvmet_rdma_cm_accept      -> rdma_accept        -> ib_create_named_qp -> create_kernel_qp ->
> returns -12  -> mlx5_0: create_qp:2774:(pid 1246): MARK Create QP type 2 failed)
>
> The queue-size is 1024. The mlx5 driver now entered the function calc_sq_size where it fails here and returns ENOMEM.

Ok I see the issue here.

I can repro it with Linux initiator if I set -Q 1024 in the connect command.

We need to fix few things in the max_qp_wr calculation and add 
.get_queue_size op to nvmet_fabrics_ops to solve it completely.

For now you can use 256 queue size in SPDK initiator to work around this.

I'll send a fix.

> --
>   if (qp->sq.wqe_cnt > (1 << MLX5_CAP_GEN(dev->mdev, log_max_qp_sz))) {
>                  mlx5_ib_dbg(dev, "send queue size (%d * %d / %d -> %d) exceeds limits(%d)\n",
>                              attr->cap.max_send_wr, wqe_size, MLX5_SEND_WQE_BB,
>                              qp->sq.wqe_cnt,
>                              1 << MLX5_CAP_GEN(dev->mdev, log_max_qp_sz));
>                  return -ENOMEM;
> }
> --
> Sep  5 12:53:45 everest kernel: [  567.691658] MARK enter ib_create_named_qp
> Sep  5 12:53:45 everest kernel: [  567.691667] MARK wq_size = 2097152
> Sep  5 12:53:46 everest kernel: [  567.692419] MARK create_kernel_qp 0
> Sep  5 12:53:46 everest kernel: [  568.204213] MARK enter ib_create_named_qp
> Sep  5 12:53:46 everest kernel: [  568.204218] MARK wq_size = 4194304
> Sep  5 12:53:46 everest kernel: [  568.204219] MARK 1 send queue size (4097 * 640 / 64 -> 65536) exceeds limits(32768)
> Sep  5 12:53:46 everest kernel: [  568.204220] MARK 1 calc_sq_size return ENOMEM
>
> A hack / fix I tested and that seems to work, or at least prevents immediate failure, is this:
>
> --- /root/linux-5.11/drivers/nvme/target/rdma.c	
> +++ rdma.c	2021-09-06 03:05:08.998364562 -0400
> @@ -1397,6 +1397,10 @@
>   	if (!queue->host_qid && queue->recv_queue_size > NVME_AQ_DEPTH)
>   		return NVME_RDMA_CM_INVALID_HSQSIZE;
>   
> +	if ( queue->send_queue_size > 256 ) {
> +		queue->send_queue_size = 256;
> +		pr_info("MARK : reducing the queue->send_queue_size to 256");
> +	}
>   	/* XXX: Should we enforce some kind of max for IO queues? */
>   
>   	return 0;
>
> ---
>
> The answer to the question in the code: "Should we enforce some kind of max for IO queues?" seems to be: yes?
> Although VMware now discovers and connects to the kernel target the path not working and declared dead.
>
> The volume appears with a nguid since the target does not set the eui64 field.
> However, setting it by using a pass-through device does not solve the issue.
>
> When I don't set pass-through nvme reports this:
> esxcli nvme namespace list
> Name                                   Controller Number  Namespace ID  Block Size  Capacity in MB
> -------------------------------------  -----------------  ------------  ----------  --------------
> eui.344337304e8001510025384100000001                 263             1        4096        12207104
> uuid.fa8ab2201ffb4429ba1719ca0d5a3405                322             1         512        14649344
>
> When I use pass-through it reports:
> [root at vmw01:~] esxcli nvme namespace list
> Name                                  Controller Number  Namespace ID  Block Size  Capacity in MB
> ------------------------------------  -----------------  ------------  ----------  --------------
> eui.344337304e8001510025384100000001                263             1        4096        12207104
> eui.344337304e7000780025384100000001                324             1         512        14649344
>
> The reason is easy to explain. Without pass-through the kernel target shows this when I query a device with sg_inq:
> sg_inq -e -p 0x83 /dev/nvmeXn1 -vvv
> VPD INQUIRY: Device Identification page
>    Designation descriptor number 1, descriptor length: 52
>      designator_type: T10 vendor identification,  code_set: ASCII
>      associated with the Target device that contains addressed lu
>        vendor id: NVMe
>        vendor specific: testvg/testlv_79d87ff74dac1b27
>
> With pass-through the kernel target provides this information for the same device:
> VPD INQUIRY: Device Identification page
>    Designation descriptor number 1, descriptor length: 56
>      designator_type: T10 vendor identification,  code_set: ASCII
>      associated with the Target device that contains addressed lu
>        vendor id: NVMe
>        vendor specific: SAMSUNG MZWLL12THMLA-00005_S4C7NA0N700078
>    Designation descriptor number 2, descriptor length: 20
>      designator_type: EUI-64 based,  code_set: Binary
>      associated with the Addressed logical unit
>        EUI-64 based 16 byte identifier
>        Identifier extension: 0x344337304e700078
>        IEEE Company_id: 0x2538
>        Vendor Specific Extension Identifier: 0x410000000103
>        [0x344337304e7000780025384100000001]
>    Designation descriptor number 3, descriptor length: 40
>      designator_type: SCSI name string,  code_set: UTF-8
>      associated with the Addressed logical unit
>        SCSI name string:
>        eui.344337304E7000780025384100000001
>
> SPDK returns this for the same device:
>
> VPD INQUIRY: Device Identification page
>    Designation descriptor number 1, descriptor length: 48
>      designator_type: T10 vendor identification,  code_set: ASCII
>      associated with the Target device that contains addressed lu
>        vendor id: NVMe
>        vendor specific: SPDK_Controller1_SPDK00000000000001
>    Designation descriptor number 2, descriptor length: 20
>      designator_type: EUI-64 based,  code_set: Binary
>      associated with the Addressed logical unit
>        EUI-64 based 16 byte identifier
>        Identifier extension: 0xe0e9311590254d4f
>        IEEE Company_id: 0x8fa737
>        Vendor Specific Extension Identifier: 0xb56897382503
>        [0xe0e9311590254d4f8fa737b568973825]
>    Designation descriptor number 3, descriptor length: 40
>      designator_type: SCSI name string,  code_set: UTF-8
>      associated with the Addressed logical unit
>        SCSI name string:
>        eui.E0E9311590254D4F8FA737B568973825
>
> So, the kernel target returns limited information when not using pass-through which forces VMware to use the nguid.
> We could use the nguid to fill the eui64 attribute and always report the extended info like we do with a pass-through device?
>
> -------------------
> --- /root/linux-5.11/drivers/nvme/target/admin-cmd.c	2021-02-14 17:32:24.000000000 -0500
> +++ admin-cmd.c	2021-09-05 06:18:10.836865874 -0400
> @@ -526,6 +526,7 @@
>   	id->anagrpid = cpu_to_le32(ns->anagrpid);
>   
>   	memcpy(&id->nguid, &ns->nguid, sizeof(id->nguid));
> +	memcpy(&id->eui64, &ns->nguid, sizeof(id->eui64));
>   
>   	id->lbaf[0].ds = ns->blksize_shift;
>
> --- /root/linux-5.11/drivers/nvme/target/configfs.c	2021-02-14 17:32:24.000000000 -0500
> +++ configfs.c	2021-09-05 05:35:35.741619651 -0400
> @@ -477,6 +477,7 @@
>   	}
>   
>   	memcpy(&ns->nguid, nguid, sizeof(nguid));
> +	memcpy(&ns->eui64, nguid, sizeof(ns->eui64));
>   out_unlock:
>   	mutex_unlock(&subsys->lock);
>   	return ret ? ret : count;
> --------------
>
> Even with pass-through enabled and the kernel target returning all information the path is immediately reported to be dead.
> esxcli storage core path list
> rdma.vmnic2:98:03:9b:03:45:10-rdma.unknown-
>     UID: rdma.vmnic2:98:03:9b:03:45:10-rdma.unknown-
>     Runtime Name: vmhba64:C0:T1:L0
>     Device: No associated device
>     Device Display Name: No associated device
>     Adapter: vmhba64
>     Channel: 0
>     Target: 1
>     LUN: 0
>     Plugin: (unclaimed)
>     State: dead
>     Transport: rdma
>     Adapter Identifier: rdma.vmnic2:98:03:9b:03:45:10
>     Target Identifier: rdma.unknown
>     Adapter Transport Details: Unavailable or path is unclaimed
>     Target Transport Details: Unavailable or path is unclaimed
>     Maximum IO Size: 131072
>
> This may or may not be a Vmware path-checker issue.
> Since SPDK does not show this problem some difference between the kernel target and SPDK target must exist.
> I don't know if the patch I use that limits the queue-depth to 256 is to blame.
> The path for the exact same device exported with SPDK shows up like this:
>
> rdma.vmnic2:98:03:9b:03:45:10-rdma.unknown-eui.a012ce7696bf47d5be87760d8f78fb8e
>     UID: rdma.vmnic2:98:03:9b:03:45:10-rdma.unknown-eui.a012ce7696bf47d5be87760d8f78fb8e
>     Runtime Name: vmhba64:C0:T0:L0
>     Device: eui.a012ce7696bf47d5be87760d8f78fb8e
>     Device Display Name: NVMe RDMA Disk (eui.a012ce7696bf47d5be87760d8f78fb8e)
>     Adapter: vmhba64
>     Channel: 0
>     Target: 0
>     LUN: 0
>     Plugin: HPP
>     State: active
>     Transport: rdma
>     Adapter Identifier: rdma.vmnic2:98:03:9b:03:45:10
>     Target Identifier: rdma.unknown
>     Adapter Transport Details: Unavailable or path is unclaimed
>     Target Transport Details: Unavailable or path is unclaimed
>     Maximum IO Size: 131072
>
> It looks like the connect patch does work but something else causes VMware not to accept the nvmet-rdma target devices.
> Not sure what to make of that. It could still be eui related? See the UID from the nvmet-rdma target.
>
> Thanks,
>
> --Mark
>
> On 02/09/2021, 23:36, "Max Gurtovoy" <mgurtovoy at nvidia.com> wrote:
>
>
>      On 8/31/2021 4:42 PM, Mark Ruijter wrote:
>      > When I connect an SPDK initiator it will try to connect using 1024 connections.
>      > The linux target is unable to handle this situation and return an error.
>      >
>      > Aug 28 14:22:56 crashme kernel: [169366.627010] infiniband mlx5_0: create_qp:2789:(pid 33755): Create QP type 2 failed
>      > Aug 28 14:22:56 crashme kernel: [169366.627913] nvmet_rdma: failed to create_qp ret= -12
>      > Aug 28 14:22:56 crashme kernel: [169366.628498] nvmet_rdma: nvmet_rdma_alloc_queue: creating RDMA queue failed (-12).
>      >
>      > It is really easy to reproduce the problem, even when not using the SPDK initiator.
>      >
>      > Just type:
>      > nvme connect --transport=rdma --queue-size=1024 --nqn=SOME.NQN --traddr=SOME.IP --trsvcid=XXXX
>      > While a linux initiator attempts to setup 64 connections, SPDK attempts to create 1024 connections.
>
>      1024 connections or is it the queue depth ?
>
>      how many cores you have in initiator ?
>
>      can you give more details on the systems ?
>
>      >
>      > The result is that anything which relies on SPDK, like VMware 7.x for example, won't be able to connect.
>      > Forcing the queues to be restricted to 256 QD solves some of it. In this case SPDK and VMware seem to connect.
>      > See the code section below. Sadly, VMware declares the path to be dead afterwards. I guess this 'fix' needs more work. ;-(
>      >
>      > In noticed that someone reported this problem on the SPDK list:
>      > https://github.com/spdk/spdk/issues/1719
>      >
>      > Thanks,
>      >
>      > Mark
>      >
>      > ---
>      > static int
>      > nvmet_rdma_parse_cm_connect_req(struct rdma_conn_param *conn,
>      >                                  struct nvmet_rdma_queue *queue)
>      > {
>      >          struct nvme_rdma_cm_req *req;
>      >
>      >          req = (struct nvme_rdma_cm_req *)conn->private_data;
>      >          if (!req || conn->private_data_len == 0)
>      >                  return NVME_RDMA_CM_INVALID_LEN;
>      >
>      >          if (le16_to_cpu(req->recfmt) != NVME_RDMA_CM_FMT_1_0)
>      >                  return NVME_RDMA_CM_INVALID_RECFMT;
>      >
>      >          queue->host_qid = le16_to_cpu(req->qid);
>      >
>      >          /*
>      >           * req->hsqsize corresponds to our recv queue size plus 1
>      >           * req->hrqsize corresponds to our send queue size
>      >           */
>      >          queue->recv_queue_size = le16_to_cpu(req->hsqsize) + 1;
>      >          queue->send_queue_size = le16_to_cpu(req->hrqsize);
>      >          if (!queue->host_qid && queue->recv_queue_size > NVME_AQ_DEPTH) {
>      >                  pr_info("MARK nvmet_rdma_parse_cm_connect_req return %i", NVME_RDMA_CM_INVALID_HSQSIZE);
>      >                  return NVME_RDMA_CM_INVALID_HSQSIZE;
>      >          }
>      >
>      > +        if (queue->recv_queue_size > 256)
>      > +               queue->recv_queue_size = 256;
>      > +        if (queue->send_queue_size > 256)
>      > +               queue->send_queue_size = 256;
>      > +       pr_info("MARK queue->recv_queue_size = %i", queue->recv_queue_size);
>      > +       pr_info("MARK queue->send_queue_size = %i", queue->send_queue_size);
>      >
>      >          /* XXX: Should we enforce some kind of max for IO queues? */
>      >          return 0;
>      > }
>      >
>      >
>      >
>      > _______________________________________________
>      > Linux-nvme mailing list
>      > Linux-nvme at lists.infradead.org
>      > http://lists.infradead.org/mailman/listinfo/linux-nvme
>