nvme tcp receive errors

Wed Apr 28 16:58:47 BST 2021

On Tue, Apr 27, 2021 at 04:55:43PM -0700, Sagi Grimberg wrote:
> 
> > > > > > > > This was observed on the recent 5.12-rc4, so it has all the latest tcp
> > > > > > > > fixes. I'll check with reverting 0dc9edaf80ea and see if that makes a
> > > > > > > > difference. It is currently reproducible, though it can take over an
> > > > > > > > hour right now.
> > > > > > > 
> > > > > > > After reverting 0dc9edaf80ea, we are observing a kernel panic (below).
> > > > > > 
> > > > > > Ah, that's probably because WRITE_ZEROS are not set with RQF_SPECIAL..
> > > > > > This patch is actually needed.
> > > > > > 
> > > > > > 
> > > > > > > We'll try adding it back, plust adding your debug patch.
> > > > > > 
> > > > > > Yes, that would give us more info about what is the state the
> > > > > > request is in when getting these errors
> > > > > 
> > > > > We have recreated with your debug patch:
> > > > > 
> > > > >     nvme nvme4: queue 6 no space in request 0x1 no space cmd_state 3
> > > > > 
> > > > > State 3 corresponds to the "NVME_TCP_CMD_DATA_DONE".
> > > > > 
> > > > > The summary from the test that I received:
> > > > > 
> > > > >     We have an Ethernet trace for this failure. I filtered the trace for the
> > > > >     connection that maps to "queue 6 of nvme4" and tracked the state of the IO
> > > > >     command with Command ID 0x1 ("Tag 0x1"). The sequence for this command per
> > > > >     the Ethernet trace is:
> > > > > 
> > > > >      1. The target receives this Command in an Ethernet frame that has  9 Command
> > > > >         capsules and a partial H2CDATA PDU. The Command with ID 0x1 is a Read
> > > > >         operation for 16K IO size
> > > > >      2. The target sends 11 frames of C2HDATA PDU's each with 1416 bytes and one
> > > > >         C2HDATA PDU with 832 bytes to complete the 16K transfer. LAS flag is set
> > > > >         in the last PDU.
> > > > >      3. The target sends a Response for this Command.
> > > > >      4. About 1.3 ms later, the Host logs this msg and closes the connection.
> > > > > 
> > > > > Please let us know if you need any additional information.
> > > > 
> > > > I'm not sure if this is just a different symptom of the same problem,
> > > > but with the debug patch, we're occasionally hitting messages like:
> > > > 
> > > >     nvme nvme5: req 8 r2t len 16384 exceeded data len 16384 (8192 sent) cmd_state 2
> > > 
> > > According to this message, this means the host got an r2t for 16384
> > > bytes after it already sent 8192 (which can only happen if it previously
> > > got an r2t soliciting 8192 bytes or more that accumulate to that). Can
> > > you share for each r2t pdus in this sequence:
> > > r2t_length
> > > r2t_offset
> > 
> > This one took a bit to go through. The traces all only have a single r2t
> > pdu with 0 offset for the full length of the requested transfer. I had
> > to add trace events to see what the heck the driver is thinking, and
> > the result is even more confusing.
> > 
> > The kernel message error:
> > 
> >      nvme5: req 5 op 1 r2t len 16384 exceeded data len 16384 (4096 sent)
> > 
> > And all the new trace events for this request are:
> > 
> >      fio-25086   [011] ....  9630.542669: nvme_tcp_queue_rq: nvme5: qid=4 tag=5 op=1 data_len=16384
> >      fio-25093   [007] ....  9630.542854: nvme_tcp_cmd_pdu: nvme5: qid=4 tag=5 op=1 page_offset=3664 send_len=72
> >      <...>-22670 [003] ....  9630.544377: nvme_tcp_r2t: nvme5: qid=4 tag=5 op=1 r2t_len=16384 r2t_offset=0 data_sent=4096 data_len=16384
> > 
> > The fact "data_sent" is non-zero on the very first r2t makes no sense to me.
> 
> Yep, not supposed to happen... Like the traces though! very useful and
> absolutely worth having.
> 
> > I so far can not find any sequence where that could happen.
> 
> The only way that data_sent can increment is either by getting an r2t
> solicitation or by sending incapsule data.
> 
> What is the ioccsz the controller is exposing?

In capsule data is not supported by this target, so ioccsz is 4.

> in nvme_tcp_sned_cmd_pdu we have:
> --
>                 if (inline_data) {
>                         req->state = NVME_TCP_SEND_DATA;
>                         if (queue->data_digest)
>                                 crypto_ahash_init(queue->snd_hash);
>                 } else {
>                         nvme_tcp_done_send_req(queue);
>                 }
> --
> 
> Where inline_data flag is:
> 	bool inline_data = nvme_tcp_has_inline_data(req);
> which essentially boils down to:
> 	req->data_len <= queue->cmnd_capsule_len - sizeof(struct nvme_command);
> 
> I wonder how does the nvme command sgl look like? is it an offset
> sgl?

Just a single data block descriptor SGL. The target supports only 1 and
reports that through ID_CTRL.MSDBD.

What do you mean by "offset" SGL?