Windows NVMeoF client - debug assistance requested

Sagi Grimberg sagi at grimberg.me
Tue Jun 21 07:19:07 PDT 2016


> Hello all,
>
> We were making great progress on a Windows version of the NVMe over Fabrics host / initiator / client until we ran into something very interesting.  All of read/write requests from the client execute without any errors reported on either side, the Linux server is happy and the Windows client is happy.  The interesting part is that the writes never make it to the device.  The write completes with success but a subsequent read does not return what was written.  We've verified the command packets, etc. and the writes look just fine and do succeed but the data never makes it to the device.  The server reports no errors and returns success.
>
> If we use the Linux host to write data to the device or just 'dd' a good partition table to the device we can read it without any problems from Windows.
>
> I could use some advice on the best strategy to track this down.
>
> The traffic generated by the Linux host and Windows host are quite similar.  The only difference that might be interesting are the RDMA addresses.  On Linux they tend to be low in the address range.  On Windows they tend to be high in the address range.  But this is true of the reads as well.
>
> A sample write from the Linux host:
> Jun 20 16:36:48 nvmeof0 kernel: [ 4024.087148] nvmet: drivers/nvme/target/core.c nvmet_req_init 499
> Jun 20 16:36:48 nvmeof0 kernel: [ 4024.087149] nvmet:  struct nvme_common_command {
> Jun 20 16:36:48 nvmeof0 kernel: [ 4024.087150] nvmet: __u8 opcode 1
> Jun 20 16:36:48 nvmeof0 kernel: [ 4024.087151] nvmet: __u8 flags 40
> Jun 20 16:36:48 nvmeof0 kernel: [ 4024.087153] nvmet: __u16 command_id 78
> Jun 20 16:36:48 nvmeof0 kernel: [ 4024.087154] nvmet: __le32 nsid 1
> Jun 20 16:36:48 nvmeof0 kernel: [ 4024.087155] nvmet: __le32 cdw2[0] 0
> Jun 20 16:36:48 nvmeof0 kernel: [ 4024.087156] nvmet: __le32 cdw2[1] 0
> Jun 20 16:36:48 nvmeof0 kernel: [ 4024.087157] nvmet: __le64 metadata 0
> Jun 20 16:36:48 nvmeof0 kernel: [ 4024.087158] nvmet: union nvme_data_ptr { nvme_keyed_sgl_desc
> Jun 20 16:36:48 nvmeof0 kernel: [ 4024.087159] nvmet: nvme_keyed_sgl_desc.__le64 addr 0
> Jun 20 16:36:48 nvmeof0 kernel: [ 4024.087161] nvmet: nvme_keyed_sgl_desc.__u8 length[0] 0
> Jun 20 16:36:48 nvmeof0 kernel: [ 4024.087162] nvmet: nvme_keyed_sgl_desc.__u8 length[1] 10
> Jun 20 16:36:48 nvmeof0 kernel: [ 4024.087163] nvmet: nvme_keyed_sgl_desc.__u8 length[2] 0
> Jun 20 16:36:48 nvmeof0 kernel: [ 4024.087164] nvmet: nvme_keyed_sgl_desc.__u8 key[0] 0
> Jun 20 16:36:48 nvmeof0 kernel: [ 4024.087166] nvmet: nvme_keyed_sgl_desc.__u8 key[1] 0
> Jun 20 16:36:48 nvmeof0 kernel: [ 4024.087167] nvmet: nvme_keyed_sgl_desc.__u8 key[2] 0
> Jun 20 16:36:48 nvmeof0 kernel: [ 4024.087168] nvmet: nvme_keyed_sgl_desc.__u8 key[3] 0
> Jun 20 16:36:48 nvmeof0 kernel: [ 4024.087169] nvmet: nvme_keyed_sgl_desc.__u8 type 1
> Jun 20 16:36:48 nvmeof0 kernel: [ 4024.087170] nvmet: __le32 cdw10[0] 9C8
> Jun 20 16:36:48 nvmeof0 kernel: [ 4024.087171] nvmet: __le32 cdw10[1] 0
> Jun 20 16:36:48 nvmeof0 kernel: [ 4024.087173] nvmet: __le32 cdw10[2] 7
> Jun 20 16:36:48 nvmeof0 kernel: [ 4024.087174] nvmet: __le32 cdw10[3] 0
> Jun 20 16:36:48 nvmeof0 kernel: [ 4024.087175] nvmet: __le32 cdw10[4] 0
> Jun 20 16:36:48 nvmeof0 kernel: [ 4024.087176] nvmet: __le32 cdw10[5] 0
>
> A sample write from the Windows host:
> Jun 20 17:26:47 nvmeof0 kernel: [ 7023.109389] nvmet:  struct nvme_common_command {
> Jun 20 17:26:47 nvmeof0 kernel: [ 7023.109390] nvmet: __u8 opcode 1
> Jun 20 17:26:47 nvmeof0 kernel: [ 7023.109391] nvmet: __u8 flags 40
> Jun 20 17:26:47 nvmeof0 kernel: [ 7023.109392] nvmet: __u16 command_id A
> Jun 20 17:26:47 nvmeof0 kernel: [ 7023.109394] nvmet: __le32 nsid 1
> Jun 20 17:26:47 nvmeof0 kernel: [ 7023.109395] nvmet: __le32 cdw2[0] 0
> Jun 20 17:26:47 nvmeof0 kernel: [ 7023.109396] nvmet: __le32 cdw2[1] 0
> Jun 20 17:26:47 nvmeof0 kernel: [ 7023.109397] nvmet: __le64 metadata 0
> Jun 20 17:26:47 nvmeof0 kernel: [ 7023.109398] nvmet: union nvme_data_ptr { nvme_keyed_sgl_desc
> Jun 20 17:26:47 nvmeof0 kernel: [ 7023.109400] nvmet: nvme_keyed_sgl_desc.__le64 addr FFFFD0009BB95040
> Jun 20 17:26:47 nvmeof0 kernel: [ 7023.109401] nvmet: nvme_keyed_sgl_desc.__u8 length[0] 0
> Jun 20 17:26:47 nvmeof0 kernel: [ 7023.109402] nvmet: nvme_keyed_sgl_desc.__u8 length[1] 2
> Jun 20 17:26:47 nvmeof0 kernel: [ 7023.109405] nvmet: nvme_keyed_sgl_desc.__u8 length[2] 0

Looks like the writes differ in length (4k vs. 512).

> Jun 20 17:26:47 nvmeof0 kernel: [ 7023.109406] nvmet: nvme_keyed_sgl_desc.__u8 key[0] A
> Jun 20 17:26:47 nvmeof0 kernel: [ 7023.109407] nvmet: nvme_keyed_sgl_desc.__u8 key[1] 64
> Jun 20 17:26:47 nvmeof0 kernel: [ 7023.109408] nvmet: nvme_keyed_sgl_desc.__u8 key[2] 1
> Jun 20 17:26:47 nvmeof0 kernel: [ 7023.109409] nvmet: nvme_keyed_sgl_desc.__u8 key[3] 92
> Jun 20 17:26:47 nvmeof0 kernel: [ 7023.109411] nvmet: nvme_keyed_sgl_desc.__u8 type 40
> Jun 20 17:26:47 nvmeof0 kernel: [ 7023.109412] nvmet: __le32 cdw10[0] 21
> Jun 20 17:26:47 nvmeof0 kernel: [ 7023.109413] nvmet: __le32 cdw10[1] 0
> Jun 20 17:26:47 nvmeof0 kernel: [ 7023.109414] nvmet: __le32 cdw10[2] 0
> Jun 20 17:26:47 nvmeof0 kernel: [ 7023.109415] nvmet: __le32 cdw10[3] 0
> Jun 20 17:26:47 nvmeof0 kernel: [ 7023.109416] nvmet: __le32 cdw10[4] 0
> Jun 20 17:26:47 nvmeof0 kernel: [ 7023.109418] nvmet: __le32 cdw10[5] 0

Looks like you never use in-capsule data (which we do in linux for 4k
or less), you passed to the target:
{addr=0xFFFFD0009BB95040, length=0x200, key=0x9201640a}

Are you sure that this is the buffer address you have registered with
the device? Do you see the actual data travel the wire?



More information about the Linux-nvme mailing list