nvme tcp receive errors

Mon Apr 5 15:37:02 BST 2021

On Fri, Apr 02, 2021 at 10:27:11AM -0700, Sagi Grimberg wrote:
> 
> > > Thanks for the reply.
> > > 
> > > This was observed on the recent 5.12-rc4, so it has all the latest tcp
> > > fixes. I'll check with reverting 0dc9edaf80ea and see if that makes a
> > > difference. It is currently reproducible, though it can take over an
> > > hour right now.
> > 
> > After reverting 0dc9edaf80ea, we are observing a kernel panic (below).
> 
> Ah, that's probably because WRITE_ZEROS are not set with RQF_SPECIAL..
> This patch is actually needed.
> 
> 
> > We'll try adding it back, plust adding your debug patch.
> 
> Yes, that would give us more info about what is the state the
> request is in when getting these errors

We have recreated with your debug patch:

  nvme nvme4: queue 6 no space in request 0x1 no space cmd_state 3

State 3 corresponds to the "NVME_TCP_CMD_DATA_DONE".

The summary from the test that I received:

  We have an Ethernet trace for this failure. I filtered the trace for the
  connection that maps to "queue 6 of nvme4" and tracked the state of the IO
  command with Command ID 0x1 ("Tag 0x1"). The sequence for this command per
  the Ethernet trace is:

   1. The target receives this Command in an Ethernet frame that has  9 Command
      capsules and a partial H2CDATA PDU. The Command with ID 0x1 is a Read
      operation for 16K IO size
   2. The target sends 11 frames of C2HDATA PDU's each with 1416 bytes and one
      C2HDATA PDU with 832 bytes to complete the 16K transfer. LAS flag is set
      in the last PDU.
   3. The target sends a Response for this Command.
   4. About 1.3 ms later, the Host logs this msg and closes the connection.

Please let us know if you need any additional information.