[PATCH] nvme-tcp: fix incorrect h2cdata pdu offset accounting

Sagi Grimberg sagi at grimberg.me
Tue Sep 28 14:00:20 PDT 2021



On 9/28/21 11:40 PM, Keith Busch wrote:
> On Tue, Sep 14, 2021 at 06:38:55PM +0300, Sagi Grimberg wrote:
>> When the controller sends us multiple r2t PDUs in a single
>> request we need to account for it correctly as our send/recv
>> context run concurrently (i.e. we get a new r2t with r2t_offset
>> before we updated our iterator and req->data_sent marker). This
>> can cause wrong offsets to be sent to the controller.
>>
>> To fix that, we will first know that this may happen only in
>> the send sequence of the last page, hence we will take
>> the r2t_offset to the h2c PDU data_offset, and in
>> nvme_tcp_try_send_data loop, we make sure to increment
>> the request markers also when we completed a PDU but
>> we are expecting more r2t PDUs as we still did not send
>> the entire data of the request.
>>
>> Fixes: 825619b09ad3 ("nvme-tcp: fix possible use-after-completion")
>> Reported-by: Nowak, Lukasz <Lukasz.Nowak at Dell.com>
>> Tested-by: Nowak, Lukasz <Lukasz.Nowak at Dell.com>
>> Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
>> ---
>> - Keith, can you ask the WD team to test this as well?
> 
> I finally have some feedback on testing, and it looks like this patch
> has created a regression. The testers are observing:
> 
>    [Tue Sep 28 02:12:55 2021] nvme nvme6: req 3 r2t len 8192 exceeded data len 8192 (4096 sent)
>    [Tue Sep 28 02:12:55 2021] nvme nvme6: receive failed:  -71
>    [Tue Sep 28 02:12:55 2021] nvme nvme6: starting error recovery
> 
> Which we had seen before, and you fixed with commit:
> 
>    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=825619b09ad351894d2c6fb6705f5b3711d145c7
> 
> I only just recently heard about this, so I haven't looked into any
> additional possible cause for this regression yet, but just wanted to
> let you know earlier.

Yes I definitely see the bug... Damn it..

The issue here is that we peek at req->data_sent and req->data_len
after the last send, but by then the request may have completed
and these members were overwritten by a new execution of the
request.

We really should make sure that we record these before we perform
the network send to avoid this race.

Can you guys check that this restores the correct behavior?
--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 3c1c29dd3020..6d63129adccc 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -926,15 +926,17 @@ static void nvme_tcp_fail_request(struct 
nvme_tcp_request *req)
  static int nvme_tcp_try_send_data(struct nvme_tcp_request *req)
  {
         struct nvme_tcp_queue *queue = req->queue;
+       int req_data_len = req->data_len;

         while (true) {
                 struct page *page = nvme_tcp_req_cur_page(req);
                 size_t offset = nvme_tcp_req_cur_offset(req);
                 size_t len = nvme_tcp_req_cur_length(req);
-               bool last = nvme_tcp_pdu_last_send(req, len);
+               bool pdu_last = nvme_tcp_pdu_last_send(req, len);
+               bool req_data_sent = req->data_sent;
                 int ret, flags = MSG_DONTWAIT;

-               if (last && !queue->data_digest && 
!nvme_tcp_queue_more(queue))
+               if (pdu_last && !queue->data_digest && 
!nvme_tcp_queue_more(queue))
                         flags |= MSG_EOR;
                 else
                         flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
@@ -958,7 +960,7 @@ static int nvme_tcp_try_send_data(struct 
nvme_tcp_request *req)
                  * in the request where we don't want to modify it as 
we may
                  * compete with the RX path completing the request.
                  */
-               if (req->data_sent + ret < req->data_len)
+               if (req_data_sent + ret < req_data_len)
                         nvme_tcp_advance_req(req, ret);

                 /* fully successful last send in current PDU */
                                 nvme_tcp_ddgst_final(queue->snd_hash,
                                         &req->ddgst);
--



More information about the Linux-nvme mailing list