nvme tcp receive errors

Sagi Grimberg sagi at grimberg.me
Tue May 4 20:29:32 BST 2021


>>>>> The driver tracepoints captured millions of IO's where everything
>>>>> happened as expected, so I really think something got confused and
>>>>> mucked with the wrong request. I've added more trace points to increase
>>>>> visibility because I frankly didn't find how that could happen just from
>>>>> code inspection. We will also incorporate your patch below for the next
>>>>> recreate.
>>>>
>>>> Keith, does the issue still happen with eliminating the network send
>>>> from .queue_rq() ?
>>>
>>> This patch is successful at resolving the observed r2t issues after the
>>> weekend test run, which is much longer than it could have run
>>> previously. I'm happy we're narrowing this down, but I'm not seeing how
>>> this addresses the problem. It looks like the mutex single threads the
>>> critical parts, but maybe I'm missing something. Any ideas?
>>
>> Not yet, but note that the send part is mutually exclusive but the
>> receive context is where we handle the r2t, validate length/offset
>> and (re)queue the request for sending a h2cdata pdu back to the
>> controller.
>>
>> The network send was an optimization for latency, and then I modified
>> the queueing in the driver such that a request would first go to llist
>> and then the sending context (either io_work or .queue_rq) would reap it
>> to a local send_list. This helps the driver get better understanding of
>> what is inflight such that it better set network msg flags for EOR/MORE.
>>
>> My assumption is that maybe somehow we send the the initial command
>> pdu to the controller from queue_rq, receive the r2t back before the
>> .queue_rq context has completed and something may not be coherent.
> 
> Interesting. The network traces look correct, so my thoughts jumped to
> possibly incorrect usage of PCIe relaxed ordering, but that appears to
> be disabled.. I'll keep looking for other possibilities.
> 
>> Side question, are you running with a fully preemptible kernel? or
>> less NVMe queues than cpus?
> 
> Voluntary preempt. This test is using the kernel config from Ubuntu
> 20.04.
> 
> There are 16 CPUs in this set up with just 7 IO queues.

Keith,

Maybe this can help to add some more information (compile tested only):
--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index a848b5b7f77b..9ce20b26c600 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -57,6 +57,7 @@ struct nvme_tcp_request {
         size_t                  data_sent;
         size_t                  data_received;
         enum nvme_tcp_send_state state;
+       bool                    got_r2t;
  };

  enum nvme_tcp_queue_flags {
@@ -622,10 +623,18 @@ static int nvme_tcp_handle_r2t(struct 
nvme_tcp_queue *queue,
         }
         req = blk_mq_rq_to_pdu(rq);

+       if (req->state != NVME_TCP_SEND_CMD_PDU) {
+               dev_err(queue->ctrl->ctrl.device,
+                       "queue %d tag %#x req unexpected state %d 
got_r2t %d\n",
+                       nvme_tcp_queue_id(queue), rq->tag, req->state,
+                       req->got_r2t);
+       }
+
         ret = nvme_tcp_setup_h2c_data_pdu(req, pdu);
         if (unlikely(ret))
                 return ret;

+       req->got_r2t = true;
         req->state = NVME_TCP_SEND_H2C_PDU;
         req->offset = 0;

@@ -1083,6 +1092,8 @@ static int nvme_tcp_try_send(struct nvme_tcp_queue 
*queue)
         }

         if (req->state == NVME_TCP_SEND_DATA) {
+               /* sending data not inline AND unsolicited? */
+               WARN_ON_ONCE(!nvme_tcp_has_inline_data(req) && 
!req->got_r2t);
                 ret = nvme_tcp_try_send_data(req);
                 if (ret <= 0)
                         goto done;
@@ -2275,6 +2286,7 @@ static blk_status_t nvme_tcp_setup_cmd_pdu(struct 
nvme_ns *ns,
                 return ret;

         req->state = NVME_TCP_SEND_CMD_PDU;
+       req->got_r2t = false;
         req->offset = 0;
         req->data_sent = 0;
         req->data_received = 0;
--



More information about the Linux-nvme mailing list