SPDK initiators (Vmware 7.x) can not connect to nvmet-rdma.
Max Gurtovoy
mgurtovoy at nvidia.com
Tue Sep 7 07:25:36 PDT 2021
On 9/6/2021 12:12 PM, Mark Ruijter wrote:
> Hi Max,
>
> The system I use has dual AMD EPYC 7452 32-Core Processors.
> MemTotal: 197784196 kB
>
> It has a single dual port ConnectX-6 card.
> 81:00.0 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
> 81:00.1 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
>
> The problem is not related to hardware. Vmware works flawlessly using the SPDK target with this system.
>
> The kernel target fails like this:
> target/rdma.c -> infiniband/cma.c -> infiniband/verbs.c -> infiniband/hw/mlx5/qp.c
> nvmet_rdma_cm_accept -> rdma_accept -> ib_create_named_qp -> create_kernel_qp ->
> returns -12 -> mlx5_0: create_qp:2774:(pid 1246): MARK Create QP type 2 failed)
>
> The queue-size is 1024. The mlx5 driver now entered the function calc_sq_size where it fails here and returns ENOMEM.
Ok I see the issue here.
I can repro it with Linux initiator if I set -Q 1024 in the connect command.
We need to fix few things in the max_qp_wr calculation and add
.get_queue_size op to nvmet_fabrics_ops to solve it completely.
For now you can use 256 queue size in SPDK initiator to work around this.
I'll send a fix.
> --
> if (qp->sq.wqe_cnt > (1 << MLX5_CAP_GEN(dev->mdev, log_max_qp_sz))) {
> mlx5_ib_dbg(dev, "send queue size (%d * %d / %d -> %d) exceeds limits(%d)\n",
> attr->cap.max_send_wr, wqe_size, MLX5_SEND_WQE_BB,
> qp->sq.wqe_cnt,
> 1 << MLX5_CAP_GEN(dev->mdev, log_max_qp_sz));
> return -ENOMEM;
> }
> --
> Sep 5 12:53:45 everest kernel: [ 567.691658] MARK enter ib_create_named_qp
> Sep 5 12:53:45 everest kernel: [ 567.691667] MARK wq_size = 2097152
> Sep 5 12:53:46 everest kernel: [ 567.692419] MARK create_kernel_qp 0
> Sep 5 12:53:46 everest kernel: [ 568.204213] MARK enter ib_create_named_qp
> Sep 5 12:53:46 everest kernel: [ 568.204218] MARK wq_size = 4194304
> Sep 5 12:53:46 everest kernel: [ 568.204219] MARK 1 send queue size (4097 * 640 / 64 -> 65536) exceeds limits(32768)
> Sep 5 12:53:46 everest kernel: [ 568.204220] MARK 1 calc_sq_size return ENOMEM
>
> A hack / fix I tested and that seems to work, or at least prevents immediate failure, is this:
>
> --- /root/linux-5.11/drivers/nvme/target/rdma.c
> +++ rdma.c 2021-09-06 03:05:08.998364562 -0400
> @@ -1397,6 +1397,10 @@
> if (!queue->host_qid && queue->recv_queue_size > NVME_AQ_DEPTH)
> return NVME_RDMA_CM_INVALID_HSQSIZE;
>
> + if ( queue->send_queue_size > 256 ) {
> + queue->send_queue_size = 256;
> + pr_info("MARK : reducing the queue->send_queue_size to 256");
> + }
> /* XXX: Should we enforce some kind of max for IO queues? */
>
> return 0;
>
> ---
>
> The answer to the question in the code: "Should we enforce some kind of max for IO queues?" seems to be: yes?
> Although VMware now discovers and connects to the kernel target the path not working and declared dead.
>
> The volume appears with a nguid since the target does not set the eui64 field.
> However, setting it by using a pass-through device does not solve the issue.
>
> When I don't set pass-through nvme reports this:
> esxcli nvme namespace list
> Name Controller Number Namespace ID Block Size Capacity in MB
> ------------------------------------- ----------------- ------------ ---------- --------------
> eui.344337304e8001510025384100000001 263 1 4096 12207104
> uuid.fa8ab2201ffb4429ba1719ca0d5a3405 322 1 512 14649344
>
> When I use pass-through it reports:
> [root at vmw01:~] esxcli nvme namespace list
> Name Controller Number Namespace ID Block Size Capacity in MB
> ------------------------------------ ----------------- ------------ ---------- --------------
> eui.344337304e8001510025384100000001 263 1 4096 12207104
> eui.344337304e7000780025384100000001 324 1 512 14649344
>
> The reason is easy to explain. Without pass-through the kernel target shows this when I query a device with sg_inq:
> sg_inq -e -p 0x83 /dev/nvmeXn1 -vvv
> VPD INQUIRY: Device Identification page
> Designation descriptor number 1, descriptor length: 52
> designator_type: T10 vendor identification, code_set: ASCII
> associated with the Target device that contains addressed lu
> vendor id: NVMe
> vendor specific: testvg/testlv_79d87ff74dac1b27
>
> With pass-through the kernel target provides this information for the same device:
> VPD INQUIRY: Device Identification page
> Designation descriptor number 1, descriptor length: 56
> designator_type: T10 vendor identification, code_set: ASCII
> associated with the Target device that contains addressed lu
> vendor id: NVMe
> vendor specific: SAMSUNG MZWLL12THMLA-00005_S4C7NA0N700078
> Designation descriptor number 2, descriptor length: 20
> designator_type: EUI-64 based, code_set: Binary
> associated with the Addressed logical unit
> EUI-64 based 16 byte identifier
> Identifier extension: 0x344337304e700078
> IEEE Company_id: 0x2538
> Vendor Specific Extension Identifier: 0x410000000103
> [0x344337304e7000780025384100000001]
> Designation descriptor number 3, descriptor length: 40
> designator_type: SCSI name string, code_set: UTF-8
> associated with the Addressed logical unit
> SCSI name string:
> eui.344337304E7000780025384100000001
>
> SPDK returns this for the same device:
>
> VPD INQUIRY: Device Identification page
> Designation descriptor number 1, descriptor length: 48
> designator_type: T10 vendor identification, code_set: ASCII
> associated with the Target device that contains addressed lu
> vendor id: NVMe
> vendor specific: SPDK_Controller1_SPDK00000000000001
> Designation descriptor number 2, descriptor length: 20
> designator_type: EUI-64 based, code_set: Binary
> associated with the Addressed logical unit
> EUI-64 based 16 byte identifier
> Identifier extension: 0xe0e9311590254d4f
> IEEE Company_id: 0x8fa737
> Vendor Specific Extension Identifier: 0xb56897382503
> [0xe0e9311590254d4f8fa737b568973825]
> Designation descriptor number 3, descriptor length: 40
> designator_type: SCSI name string, code_set: UTF-8
> associated with the Addressed logical unit
> SCSI name string:
> eui.E0E9311590254D4F8FA737B568973825
>
> So, the kernel target returns limited information when not using pass-through which forces VMware to use the nguid.
> We could use the nguid to fill the eui64 attribute and always report the extended info like we do with a pass-through device?
>
> -------------------
> --- /root/linux-5.11/drivers/nvme/target/admin-cmd.c 2021-02-14 17:32:24.000000000 -0500
> +++ admin-cmd.c 2021-09-05 06:18:10.836865874 -0400
> @@ -526,6 +526,7 @@
> id->anagrpid = cpu_to_le32(ns->anagrpid);
>
> memcpy(&id->nguid, &ns->nguid, sizeof(id->nguid));
> + memcpy(&id->eui64, &ns->nguid, sizeof(id->eui64));
>
> id->lbaf[0].ds = ns->blksize_shift;
>
> --- /root/linux-5.11/drivers/nvme/target/configfs.c 2021-02-14 17:32:24.000000000 -0500
> +++ configfs.c 2021-09-05 05:35:35.741619651 -0400
> @@ -477,6 +477,7 @@
> }
>
> memcpy(&ns->nguid, nguid, sizeof(nguid));
> + memcpy(&ns->eui64, nguid, sizeof(ns->eui64));
> out_unlock:
> mutex_unlock(&subsys->lock);
> return ret ? ret : count;
> --------------
>
> Even with pass-through enabled and the kernel target returning all information the path is immediately reported to be dead.
> esxcli storage core path list
> rdma.vmnic2:98:03:9b:03:45:10-rdma.unknown-
> UID: rdma.vmnic2:98:03:9b:03:45:10-rdma.unknown-
> Runtime Name: vmhba64:C0:T1:L0
> Device: No associated device
> Device Display Name: No associated device
> Adapter: vmhba64
> Channel: 0
> Target: 1
> LUN: 0
> Plugin: (unclaimed)
> State: dead
> Transport: rdma
> Adapter Identifier: rdma.vmnic2:98:03:9b:03:45:10
> Target Identifier: rdma.unknown
> Adapter Transport Details: Unavailable or path is unclaimed
> Target Transport Details: Unavailable or path is unclaimed
> Maximum IO Size: 131072
>
> This may or may not be a Vmware path-checker issue.
> Since SPDK does not show this problem some difference between the kernel target and SPDK target must exist.
> I don't know if the patch I use that limits the queue-depth to 256 is to blame.
> The path for the exact same device exported with SPDK shows up like this:
>
> rdma.vmnic2:98:03:9b:03:45:10-rdma.unknown-eui.a012ce7696bf47d5be87760d8f78fb8e
> UID: rdma.vmnic2:98:03:9b:03:45:10-rdma.unknown-eui.a012ce7696bf47d5be87760d8f78fb8e
> Runtime Name: vmhba64:C0:T0:L0
> Device: eui.a012ce7696bf47d5be87760d8f78fb8e
> Device Display Name: NVMe RDMA Disk (eui.a012ce7696bf47d5be87760d8f78fb8e)
> Adapter: vmhba64
> Channel: 0
> Target: 0
> LUN: 0
> Plugin: HPP
> State: active
> Transport: rdma
> Adapter Identifier: rdma.vmnic2:98:03:9b:03:45:10
> Target Identifier: rdma.unknown
> Adapter Transport Details: Unavailable or path is unclaimed
> Target Transport Details: Unavailable or path is unclaimed
> Maximum IO Size: 131072
>
> It looks like the connect patch does work but something else causes VMware not to accept the nvmet-rdma target devices.
> Not sure what to make of that. It could still be eui related? See the UID from the nvmet-rdma target.
>
> Thanks,
>
> --Mark
>
> On 02/09/2021, 23:36, "Max Gurtovoy" <mgurtovoy at nvidia.com> wrote:
>
>
> On 8/31/2021 4:42 PM, Mark Ruijter wrote:
> > When I connect an SPDK initiator it will try to connect using 1024 connections.
> > The linux target is unable to handle this situation and return an error.
> >
> > Aug 28 14:22:56 crashme kernel: [169366.627010] infiniband mlx5_0: create_qp:2789:(pid 33755): Create QP type 2 failed
> > Aug 28 14:22:56 crashme kernel: [169366.627913] nvmet_rdma: failed to create_qp ret= -12
> > Aug 28 14:22:56 crashme kernel: [169366.628498] nvmet_rdma: nvmet_rdma_alloc_queue: creating RDMA queue failed (-12).
> >
> > It is really easy to reproduce the problem, even when not using the SPDK initiator.
> >
> > Just type:
> > nvme connect --transport=rdma --queue-size=1024 --nqn=SOME.NQN --traddr=SOME.IP --trsvcid=XXXX
> > While a linux initiator attempts to setup 64 connections, SPDK attempts to create 1024 connections.
>
> 1024 connections or is it the queue depth ?
>
> how many cores you have in initiator ?
>
> can you give more details on the systems ?
>
> >
> > The result is that anything which relies on SPDK, like VMware 7.x for example, won't be able to connect.
> > Forcing the queues to be restricted to 256 QD solves some of it. In this case SPDK and VMware seem to connect.
> > See the code section below. Sadly, VMware declares the path to be dead afterwards. I guess this 'fix' needs more work. ;-(
> >
> > In noticed that someone reported this problem on the SPDK list:
> > https://github.com/spdk/spdk/issues/1719
> >
> > Thanks,
> >
> > Mark
> >
> > ---
> > static int
> > nvmet_rdma_parse_cm_connect_req(struct rdma_conn_param *conn,
> > struct nvmet_rdma_queue *queue)
> > {
> > struct nvme_rdma_cm_req *req;
> >
> > req = (struct nvme_rdma_cm_req *)conn->private_data;
> > if (!req || conn->private_data_len == 0)
> > return NVME_RDMA_CM_INVALID_LEN;
> >
> > if (le16_to_cpu(req->recfmt) != NVME_RDMA_CM_FMT_1_0)
> > return NVME_RDMA_CM_INVALID_RECFMT;
> >
> > queue->host_qid = le16_to_cpu(req->qid);
> >
> > /*
> > * req->hsqsize corresponds to our recv queue size plus 1
> > * req->hrqsize corresponds to our send queue size
> > */
> > queue->recv_queue_size = le16_to_cpu(req->hsqsize) + 1;
> > queue->send_queue_size = le16_to_cpu(req->hrqsize);
> > if (!queue->host_qid && queue->recv_queue_size > NVME_AQ_DEPTH) {
> > pr_info("MARK nvmet_rdma_parse_cm_connect_req return %i", NVME_RDMA_CM_INVALID_HSQSIZE);
> > return NVME_RDMA_CM_INVALID_HSQSIZE;
> > }
> >
> > + if (queue->recv_queue_size > 256)
> > + queue->recv_queue_size = 256;
> > + if (queue->send_queue_size > 256)
> > + queue->send_queue_size = 256;
> > + pr_info("MARK queue->recv_queue_size = %i", queue->recv_queue_size);
> > + pr_info("MARK queue->send_queue_size = %i", queue->send_queue_size);
> >
> > /* XXX: Should we enforce some kind of max for IO queues? */
> > return 0;
> > }
> >
> >
> >
> > _______________________________________________
> > Linux-nvme mailing list
> > Linux-nvme at lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/linux-nvme
>
More information about the Linux-nvme
mailing list