nvme tcp receive errors
Sagi Grimberg
sagi at grimberg.me
Tue May 4 20:29:32 BST 2021
>>>>> The driver tracepoints captured millions of IO's where everything
>>>>> happened as expected, so I really think something got confused and
>>>>> mucked with the wrong request. I've added more trace points to increase
>>>>> visibility because I frankly didn't find how that could happen just from
>>>>> code inspection. We will also incorporate your patch below for the next
>>>>> recreate.
>>>>
>>>> Keith, does the issue still happen with eliminating the network send
>>>> from .queue_rq() ?
>>>
>>> This patch is successful at resolving the observed r2t issues after the
>>> weekend test run, which is much longer than it could have run
>>> previously. I'm happy we're narrowing this down, but I'm not seeing how
>>> this addresses the problem. It looks like the mutex single threads the
>>> critical parts, but maybe I'm missing something. Any ideas?
>>
>> Not yet, but note that the send part is mutually exclusive but the
>> receive context is where we handle the r2t, validate length/offset
>> and (re)queue the request for sending a h2cdata pdu back to the
>> controller.
>>
>> The network send was an optimization for latency, and then I modified
>> the queueing in the driver such that a request would first go to llist
>> and then the sending context (either io_work or .queue_rq) would reap it
>> to a local send_list. This helps the driver get better understanding of
>> what is inflight such that it better set network msg flags for EOR/MORE.
>>
>> My assumption is that maybe somehow we send the the initial command
>> pdu to the controller from queue_rq, receive the r2t back before the
>> .queue_rq context has completed and something may not be coherent.
>
> Interesting. The network traces look correct, so my thoughts jumped to
> possibly incorrect usage of PCIe relaxed ordering, but that appears to
> be disabled.. I'll keep looking for other possibilities.
>
>> Side question, are you running with a fully preemptible kernel? or
>> less NVMe queues than cpus?
>
> Voluntary preempt. This test is using the kernel config from Ubuntu
> 20.04.
>
> There are 16 CPUs in this set up with just 7 IO queues.
Keith,
Maybe this can help to add some more information (compile tested only):
--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index a848b5b7f77b..9ce20b26c600 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -57,6 +57,7 @@ struct nvme_tcp_request {
size_t data_sent;
size_t data_received;
enum nvme_tcp_send_state state;
+ bool got_r2t;
};
enum nvme_tcp_queue_flags {
@@ -622,10 +623,18 @@ static int nvme_tcp_handle_r2t(struct
nvme_tcp_queue *queue,
}
req = blk_mq_rq_to_pdu(rq);
+ if (req->state != NVME_TCP_SEND_CMD_PDU) {
+ dev_err(queue->ctrl->ctrl.device,
+ "queue %d tag %#x req unexpected state %d
got_r2t %d\n",
+ nvme_tcp_queue_id(queue), rq->tag, req->state,
+ req->got_r2t);
+ }
+
ret = nvme_tcp_setup_h2c_data_pdu(req, pdu);
if (unlikely(ret))
return ret;
+ req->got_r2t = true;
req->state = NVME_TCP_SEND_H2C_PDU;
req->offset = 0;
@@ -1083,6 +1092,8 @@ static int nvme_tcp_try_send(struct nvme_tcp_queue
*queue)
}
if (req->state == NVME_TCP_SEND_DATA) {
+ /* sending data not inline AND unsolicited? */
+ WARN_ON_ONCE(!nvme_tcp_has_inline_data(req) &&
!req->got_r2t);
ret = nvme_tcp_try_send_data(req);
if (ret <= 0)
goto done;
@@ -2275,6 +2286,7 @@ static blk_status_t nvme_tcp_setup_cmd_pdu(struct
nvme_ns *ns,
return ret;
req->state = NVME_TCP_SEND_CMD_PDU;
+ req->got_r2t = false;
req->offset = 0;
req->data_sent = 0;
req->data_received = 0;
--
More information about the Linux-nvme
mailing list