[PATCH] nvme-tcp: fix incorrect h2cdata pdu offset accounting

Tue Sep 28 17:24:07 PDT 2021

On Wed, Sep 29, 2021 at 12:00:20AM +0300, Sagi Grimberg wrote:
> 
> 
> On 9/28/21 11:40 PM, Keith Busch wrote:
> > On Tue, Sep 14, 2021 at 06:38:55PM +0300, Sagi Grimberg wrote:
> > > When the controller sends us multiple r2t PDUs in a single
> > > request we need to account for it correctly as our send/recv
> > > context run concurrently (i.e. we get a new r2t with r2t_offset
> > > before we updated our iterator and req->data_sent marker). This
> > > can cause wrong offsets to be sent to the controller.
> > > 
> > > To fix that, we will first know that this may happen only in
> > > the send sequence of the last page, hence we will take
> > > the r2t_offset to the h2c PDU data_offset, and in
> > > nvme_tcp_try_send_data loop, we make sure to increment
> > > the request markers also when we completed a PDU but
> > > we are expecting more r2t PDUs as we still did not send
> > > the entire data of the request.
> > > 
> > > Fixes: 825619b09ad3 ("nvme-tcp: fix possible use-after-completion")
> > > Reported-by: Nowak, Lukasz <Lukasz.Nowak at Dell.com>
> > > Tested-by: Nowak, Lukasz <Lukasz.Nowak at Dell.com>
> > > Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
> > > ---
> > > - Keith, can you ask the WD team to test this as well?
> > 
> > I finally have some feedback on testing, and it looks like this patch
> > has created a regression. The testers are observing:
> > 
> >    [Tue Sep 28 02:12:55 2021] nvme nvme6: req 3 r2t len 8192 exceeded data len 8192 (4096 sent)
> >    [Tue Sep 28 02:12:55 2021] nvme nvme6: receive failed:  -71
> >    [Tue Sep 28 02:12:55 2021] nvme nvme6: starting error recovery
> > 
> > Which we had seen before, and you fixed with commit:
> > 
> >    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=825619b09ad351894d2c6fb6705f5b3711d145c7
> > 
> > I only just recently heard about this, so I haven't looked into any
> > additional possible cause for this regression yet, but just wanted to
> > let you know earlier.
> 
> Yes I definitely see the bug... Damn it..
> 
> The issue here is that we peek at req->data_sent and req->data_len
> after the last send, but by then the request may have completed
> and these members were overwritten by a new execution of the
> request.
> 
> We really should make sure that we record these before we perform
> the network send to avoid this race.
> 
> Can you guys check that this restores the correct behavior?

We'll give this a spin (I had to fix it up so it applies and compiles;
no biggie)

> --
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 3c1c29dd3020..6d63129adccc 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -926,15 +926,17 @@ static void nvme_tcp_fail_request(struct
> nvme_tcp_request *req)
>  static int nvme_tcp_try_send_data(struct nvme_tcp_request *req)
>  {
>         struct nvme_tcp_queue *queue = req->queue;
> +       int req_data_len = req->data_len;
> 
>         while (true) {
>                 struct page *page = nvme_tcp_req_cur_page(req);
>                 size_t offset = nvme_tcp_req_cur_offset(req);
>                 size_t len = nvme_tcp_req_cur_length(req);
> -               bool last = nvme_tcp_pdu_last_send(req, len);
> +               bool pdu_last = nvme_tcp_pdu_last_send(req, len);
> +               bool req_data_sent = req->data_sent;
>                 int ret, flags = MSG_DONTWAIT;
> 
> -               if (last && !queue->data_digest &&
> !nvme_tcp_queue_more(queue))
> +               if (pdu_last && !queue->data_digest &&
> !nvme_tcp_queue_more(queue))
>                         flags |= MSG_EOR;
>                 else
>                         flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
> @@ -958,7 +960,7 @@ static int nvme_tcp_try_send_data(struct
> nvme_tcp_request *req)
>                  * in the request where we don't want to modify it as we may
>                  * compete with the RX path completing the request.
>                  */
> -               if (req->data_sent + ret < req->data_len)
> +               if (req_data_sent + ret < req_data_len)
>                         nvme_tcp_advance_req(req, ret);
> 
>                 /* fully successful last send in current PDU */
>                                 nvme_tcp_ddgst_final(queue->snd_hash,
>                                         &req->ddgst);
> --