nvme tcp receive errors

Keith Busch kbusch at kernel.org
Wed Apr 7 20:53:19 BST 2021


On Mon, Apr 05, 2021 at 11:37:02PM +0900, Keith Busch wrote:
> On Fri, Apr 02, 2021 at 10:27:11AM -0700, Sagi Grimberg wrote:
> > 
> > > > Thanks for the reply.
> > > > 
> > > > This was observed on the recent 5.12-rc4, so it has all the latest tcp
> > > > fixes. I'll check with reverting 0dc9edaf80ea and see if that makes a
> > > > difference. It is currently reproducible, though it can take over an
> > > > hour right now.
> > > 
> > > After reverting 0dc9edaf80ea, we are observing a kernel panic (below).
> > 
> > Ah, that's probably because WRITE_ZEROS are not set with RQF_SPECIAL..
> > This patch is actually needed.
> > 
> > 
> > > We'll try adding it back, plust adding your debug patch.
> > 
> > Yes, that would give us more info about what is the state the
> > request is in when getting these errors
> 
> We have recreated with your debug patch:
> 
>   nvme nvme4: queue 6 no space in request 0x1 no space cmd_state 3
> 
> State 3 corresponds to the "NVME_TCP_CMD_DATA_DONE".
> 
> The summary from the test that I received:
> 
>   We have an Ethernet trace for this failure. I filtered the trace for the
>   connection that maps to "queue 6 of nvme4" and tracked the state of the IO
>   command with Command ID 0x1 ("Tag 0x1"). The sequence for this command per
>   the Ethernet trace is:
> 
>    1. The target receives this Command in an Ethernet frame that has  9 Command
>       capsules and a partial H2CDATA PDU. The Command with ID 0x1 is a Read
>       operation for 16K IO size
>    2. The target sends 11 frames of C2HDATA PDU's each with 1416 bytes and one
>       C2HDATA PDU with 832 bytes to complete the 16K transfer. LAS flag is set
>       in the last PDU.
>    3. The target sends a Response for this Command.
>    4. About 1.3 ms later, the Host logs this msg and closes the connection.
> 
> Please let us know if you need any additional information.

I'm not sure if this is just a different symptom of the same problem,
but with the debug patch, we're occasionally hitting messages like:

  nvme nvme5: req 8 r2t len 16384 exceeded data len 16384 (8192 sent) cmd_state 2



More information about the Linux-nvme mailing list