nvme tcp receive errors

Fri Apr 9 19:04:43 BST 2021

>>>> Thanks for the reply.
>>>>
>>>> This was observed on the recent 5.12-rc4, so it has all the latest tcp
>>>> fixes. I'll check with reverting 0dc9edaf80ea and see if that makes a
>>>> difference. It is currently reproducible, though it can take over an
>>>> hour right now.
>>>
>>> After reverting 0dc9edaf80ea, we are observing a kernel panic (below).
>>
>> Ah, that's probably because WRITE_ZEROS are not set with RQF_SPECIAL..
>> This patch is actually needed.
>>
>>
>>> We'll try adding it back, plust adding your debug patch.
>>
>> Yes, that would give us more info about what is the state the
>> request is in when getting these errors
> 
> We have recreated with your debug patch:
> 
>    nvme nvme4: queue 6 no space in request 0x1 no space cmd_state 3
> 
> State 3 corresponds to the "NVME_TCP_CMD_DATA_DONE".
> 
> The summary from the test that I received:
> 
>    We have an Ethernet trace for this failure. I filtered the trace for the
>    connection that maps to "queue 6 of nvme4" and tracked the state of the IO
>    command with Command ID 0x1 ("Tag 0x1"). The sequence for this command per
>    the Ethernet trace is:
> 
>     1. The target receives this Command in an Ethernet frame that has  9 Command
>        capsules and a partial H2CDATA PDU. The Command with ID 0x1 is a Read
>        operation for 16K IO size
>     2. The target sends 11 frames of C2HDATA PDU's each with 1416 bytes and one
>        C2HDATA PDU with 832 bytes to complete the 16K transfer. LAS flag is set
>        in the last PDU.

Are the c2hdata pdus have data_length of 1416? and the last has 
data_length = 832?

1416 * 11 + 832 = 16408 > 16384

Can you share for each of the c2hdata PDUs what is:
- hlen
- plen
- data_length
- data_offset