SPDK initiators (Vmware 7.x) can not connect to nvmet-rdma.
Mark Ruijter
mruijter at primelogic.nl
Mon Sep 6 02:12:06 PDT 2021
Hi Max,
The system I use has dual AMD EPYC 7452 32-Core Processors.
MemTotal: 197784196 kB
It has a single dual port ConnectX-6 card.
81:00.0 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
81:00.1 Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6]
The problem is not related to hardware. Vmware works flawlessly using the SPDK target with this system.
The kernel target fails like this:
target/rdma.c -> infiniband/cma.c -> infiniband/verbs.c -> infiniband/hw/mlx5/qp.c
nvmet_rdma_cm_accept -> rdma_accept -> ib_create_named_qp -> create_kernel_qp ->
returns -12 -> mlx5_0: create_qp:2774:(pid 1246): MARK Create QP type 2 failed)
The queue-size is 1024. The mlx5 driver now entered the function calc_sq_size where it fails here and returns ENOMEM.
--
if (qp->sq.wqe_cnt > (1 << MLX5_CAP_GEN(dev->mdev, log_max_qp_sz))) {
mlx5_ib_dbg(dev, "send queue size (%d * %d / %d -> %d) exceeds limits(%d)\n",
attr->cap.max_send_wr, wqe_size, MLX5_SEND_WQE_BB,
qp->sq.wqe_cnt,
1 << MLX5_CAP_GEN(dev->mdev, log_max_qp_sz));
return -ENOMEM;
}
--
Sep 5 12:53:45 everest kernel: [ 567.691658] MARK enter ib_create_named_qp
Sep 5 12:53:45 everest kernel: [ 567.691667] MARK wq_size = 2097152
Sep 5 12:53:46 everest kernel: [ 567.692419] MARK create_kernel_qp 0
Sep 5 12:53:46 everest kernel: [ 568.204213] MARK enter ib_create_named_qp
Sep 5 12:53:46 everest kernel: [ 568.204218] MARK wq_size = 4194304
Sep 5 12:53:46 everest kernel: [ 568.204219] MARK 1 send queue size (4097 * 640 / 64 -> 65536) exceeds limits(32768)
Sep 5 12:53:46 everest kernel: [ 568.204220] MARK 1 calc_sq_size return ENOMEM
A hack / fix I tested and that seems to work, or at least prevents immediate failure, is this:
--- /root/linux-5.11/drivers/nvme/target/rdma.c
+++ rdma.c 2021-09-06 03:05:08.998364562 -0400
@@ -1397,6 +1397,10 @@
if (!queue->host_qid && queue->recv_queue_size > NVME_AQ_DEPTH)
return NVME_RDMA_CM_INVALID_HSQSIZE;
+ if ( queue->send_queue_size > 256 ) {
+ queue->send_queue_size = 256;
+ pr_info("MARK : reducing the queue->send_queue_size to 256");
+ }
/* XXX: Should we enforce some kind of max for IO queues? */
return 0;
---
The answer to the question in the code: "Should we enforce some kind of max for IO queues?" seems to be: yes?
Although VMware now discovers and connects to the kernel target the path not working and declared dead.
The volume appears with a nguid since the target does not set the eui64 field.
However, setting it by using a pass-through device does not solve the issue.
When I don't set pass-through nvme reports this:
esxcli nvme namespace list
Name Controller Number Namespace ID Block Size Capacity in MB
------------------------------------- ----------------- ------------ ---------- --------------
eui.344337304e8001510025384100000001 263 1 4096 12207104
uuid.fa8ab2201ffb4429ba1719ca0d5a3405 322 1 512 14649344
When I use pass-through it reports:
[root at vmw01:~] esxcli nvme namespace list
Name Controller Number Namespace ID Block Size Capacity in MB
------------------------------------ ----------------- ------------ ---------- --------------
eui.344337304e8001510025384100000001 263 1 4096 12207104
eui.344337304e7000780025384100000001 324 1 512 14649344
The reason is easy to explain. Without pass-through the kernel target shows this when I query a device with sg_inq:
sg_inq -e -p 0x83 /dev/nvmeXn1 -vvv
VPD INQUIRY: Device Identification page
Designation descriptor number 1, descriptor length: 52
designator_type: T10 vendor identification, code_set: ASCII
associated with the Target device that contains addressed lu
vendor id: NVMe
vendor specific: testvg/testlv_79d87ff74dac1b27
With pass-through the kernel target provides this information for the same device:
VPD INQUIRY: Device Identification page
Designation descriptor number 1, descriptor length: 56
designator_type: T10 vendor identification, code_set: ASCII
associated with the Target device that contains addressed lu
vendor id: NVMe
vendor specific: SAMSUNG MZWLL12THMLA-00005_S4C7NA0N700078
Designation descriptor number 2, descriptor length: 20
designator_type: EUI-64 based, code_set: Binary
associated with the Addressed logical unit
EUI-64 based 16 byte identifier
Identifier extension: 0x344337304e700078
IEEE Company_id: 0x2538
Vendor Specific Extension Identifier: 0x410000000103
[0x344337304e7000780025384100000001]
Designation descriptor number 3, descriptor length: 40
designator_type: SCSI name string, code_set: UTF-8
associated with the Addressed logical unit
SCSI name string:
eui.344337304E7000780025384100000001
SPDK returns this for the same device:
VPD INQUIRY: Device Identification page
Designation descriptor number 1, descriptor length: 48
designator_type: T10 vendor identification, code_set: ASCII
associated with the Target device that contains addressed lu
vendor id: NVMe
vendor specific: SPDK_Controller1_SPDK00000000000001
Designation descriptor number 2, descriptor length: 20
designator_type: EUI-64 based, code_set: Binary
associated with the Addressed logical unit
EUI-64 based 16 byte identifier
Identifier extension: 0xe0e9311590254d4f
IEEE Company_id: 0x8fa737
Vendor Specific Extension Identifier: 0xb56897382503
[0xe0e9311590254d4f8fa737b568973825]
Designation descriptor number 3, descriptor length: 40
designator_type: SCSI name string, code_set: UTF-8
associated with the Addressed logical unit
SCSI name string:
eui.E0E9311590254D4F8FA737B568973825
So, the kernel target returns limited information when not using pass-through which forces VMware to use the nguid.
We could use the nguid to fill the eui64 attribute and always report the extended info like we do with a pass-through device?
-------------------
--- /root/linux-5.11/drivers/nvme/target/admin-cmd.c 2021-02-14 17:32:24.000000000 -0500
+++ admin-cmd.c 2021-09-05 06:18:10.836865874 -0400
@@ -526,6 +526,7 @@
id->anagrpid = cpu_to_le32(ns->anagrpid);
memcpy(&id->nguid, &ns->nguid, sizeof(id->nguid));
+ memcpy(&id->eui64, &ns->nguid, sizeof(id->eui64));
id->lbaf[0].ds = ns->blksize_shift;
--- /root/linux-5.11/drivers/nvme/target/configfs.c 2021-02-14 17:32:24.000000000 -0500
+++ configfs.c 2021-09-05 05:35:35.741619651 -0400
@@ -477,6 +477,7 @@
}
memcpy(&ns->nguid, nguid, sizeof(nguid));
+ memcpy(&ns->eui64, nguid, sizeof(ns->eui64));
out_unlock:
mutex_unlock(&subsys->lock);
return ret ? ret : count;
--------------
Even with pass-through enabled and the kernel target returning all information the path is immediately reported to be dead.
esxcli storage core path list
rdma.vmnic2:98:03:9b:03:45:10-rdma.unknown-
UID: rdma.vmnic2:98:03:9b:03:45:10-rdma.unknown-
Runtime Name: vmhba64:C0:T1:L0
Device: No associated device
Device Display Name: No associated device
Adapter: vmhba64
Channel: 0
Target: 1
LUN: 0
Plugin: (unclaimed)
State: dead
Transport: rdma
Adapter Identifier: rdma.vmnic2:98:03:9b:03:45:10
Target Identifier: rdma.unknown
Adapter Transport Details: Unavailable or path is unclaimed
Target Transport Details: Unavailable or path is unclaimed
Maximum IO Size: 131072
This may or may not be a Vmware path-checker issue.
Since SPDK does not show this problem some difference between the kernel target and SPDK target must exist.
I don't know if the patch I use that limits the queue-depth to 256 is to blame.
The path for the exact same device exported with SPDK shows up like this:
rdma.vmnic2:98:03:9b:03:45:10-rdma.unknown-eui.a012ce7696bf47d5be87760d8f78fb8e
UID: rdma.vmnic2:98:03:9b:03:45:10-rdma.unknown-eui.a012ce7696bf47d5be87760d8f78fb8e
Runtime Name: vmhba64:C0:T0:L0
Device: eui.a012ce7696bf47d5be87760d8f78fb8e
Device Display Name: NVMe RDMA Disk (eui.a012ce7696bf47d5be87760d8f78fb8e)
Adapter: vmhba64
Channel: 0
Target: 0
LUN: 0
Plugin: HPP
State: active
Transport: rdma
Adapter Identifier: rdma.vmnic2:98:03:9b:03:45:10
Target Identifier: rdma.unknown
Adapter Transport Details: Unavailable or path is unclaimed
Target Transport Details: Unavailable or path is unclaimed
Maximum IO Size: 131072
It looks like the connect patch does work but something else causes VMware not to accept the nvmet-rdma target devices.
Not sure what to make of that. It could still be eui related? See the UID from the nvmet-rdma target.
Thanks,
--Mark
On 02/09/2021, 23:36, "Max Gurtovoy" <mgurtovoy at nvidia.com> wrote:
On 8/31/2021 4:42 PM, Mark Ruijter wrote:
> When I connect an SPDK initiator it will try to connect using 1024 connections.
> The linux target is unable to handle this situation and return an error.
>
> Aug 28 14:22:56 crashme kernel: [169366.627010] infiniband mlx5_0: create_qp:2789:(pid 33755): Create QP type 2 failed
> Aug 28 14:22:56 crashme kernel: [169366.627913] nvmet_rdma: failed to create_qp ret= -12
> Aug 28 14:22:56 crashme kernel: [169366.628498] nvmet_rdma: nvmet_rdma_alloc_queue: creating RDMA queue failed (-12).
>
> It is really easy to reproduce the problem, even when not using the SPDK initiator.
>
> Just type:
> nvme connect --transport=rdma --queue-size=1024 --nqn=SOME.NQN --traddr=SOME.IP --trsvcid=XXXX
> While a linux initiator attempts to setup 64 connections, SPDK attempts to create 1024 connections.
1024 connections or is it the queue depth ?
how many cores you have in initiator ?
can you give more details on the systems ?
>
> The result is that anything which relies on SPDK, like VMware 7.x for example, won't be able to connect.
> Forcing the queues to be restricted to 256 QD solves some of it. In this case SPDK and VMware seem to connect.
> See the code section below. Sadly, VMware declares the path to be dead afterwards. I guess this 'fix' needs more work. ;-(
>
> In noticed that someone reported this problem on the SPDK list:
> https://github.com/spdk/spdk/issues/1719
>
> Thanks,
>
> Mark
>
> ---
> static int
> nvmet_rdma_parse_cm_connect_req(struct rdma_conn_param *conn,
> struct nvmet_rdma_queue *queue)
> {
> struct nvme_rdma_cm_req *req;
>
> req = (struct nvme_rdma_cm_req *)conn->private_data;
> if (!req || conn->private_data_len == 0)
> return NVME_RDMA_CM_INVALID_LEN;
>
> if (le16_to_cpu(req->recfmt) != NVME_RDMA_CM_FMT_1_0)
> return NVME_RDMA_CM_INVALID_RECFMT;
>
> queue->host_qid = le16_to_cpu(req->qid);
>
> /*
> * req->hsqsize corresponds to our recv queue size plus 1
> * req->hrqsize corresponds to our send queue size
> */
> queue->recv_queue_size = le16_to_cpu(req->hsqsize) + 1;
> queue->send_queue_size = le16_to_cpu(req->hrqsize);
> if (!queue->host_qid && queue->recv_queue_size > NVME_AQ_DEPTH) {
> pr_info("MARK nvmet_rdma_parse_cm_connect_req return %i", NVME_RDMA_CM_INVALID_HSQSIZE);
> return NVME_RDMA_CM_INVALID_HSQSIZE;
> }
>
> + if (queue->recv_queue_size > 256)
> + queue->recv_queue_size = 256;
> + if (queue->send_queue_size > 256)
> + queue->send_queue_size = 256;
> + pr_info("MARK queue->recv_queue_size = %i", queue->recv_queue_size);
> + pr_info("MARK queue->send_queue_size = %i", queue->send_queue_size);
>
> /* XXX: Should we enforce some kind of max for IO queues? */
> return 0;
> }
>
>
>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
More information about the Linux-nvme
mailing list