[PATCH] nvme-tcp: fix incorrect h2cdata pdu offset accounting
Keith Busch
kbusch at kernel.org
Tue Sep 28 17:24:07 PDT 2021
On Wed, Sep 29, 2021 at 12:00:20AM +0300, Sagi Grimberg wrote:
>
>
> On 9/28/21 11:40 PM, Keith Busch wrote:
> > On Tue, Sep 14, 2021 at 06:38:55PM +0300, Sagi Grimberg wrote:
> > > When the controller sends us multiple r2t PDUs in a single
> > > request we need to account for it correctly as our send/recv
> > > context run concurrently (i.e. we get a new r2t with r2t_offset
> > > before we updated our iterator and req->data_sent marker). This
> > > can cause wrong offsets to be sent to the controller.
> > >
> > > To fix that, we will first know that this may happen only in
> > > the send sequence of the last page, hence we will take
> > > the r2t_offset to the h2c PDU data_offset, and in
> > > nvme_tcp_try_send_data loop, we make sure to increment
> > > the request markers also when we completed a PDU but
> > > we are expecting more r2t PDUs as we still did not send
> > > the entire data of the request.
> > >
> > > Fixes: 825619b09ad3 ("nvme-tcp: fix possible use-after-completion")
> > > Reported-by: Nowak, Lukasz <Lukasz.Nowak at Dell.com>
> > > Tested-by: Nowak, Lukasz <Lukasz.Nowak at Dell.com>
> > > Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
> > > ---
> > > - Keith, can you ask the WD team to test this as well?
> >
> > I finally have some feedback on testing, and it looks like this patch
> > has created a regression. The testers are observing:
> >
> > [Tue Sep 28 02:12:55 2021] nvme nvme6: req 3 r2t len 8192 exceeded data len 8192 (4096 sent)
> > [Tue Sep 28 02:12:55 2021] nvme nvme6: receive failed: -71
> > [Tue Sep 28 02:12:55 2021] nvme nvme6: starting error recovery
> >
> > Which we had seen before, and you fixed with commit:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=825619b09ad351894d2c6fb6705f5b3711d145c7
> >
> > I only just recently heard about this, so I haven't looked into any
> > additional possible cause for this regression yet, but just wanted to
> > let you know earlier.
>
> Yes I definitely see the bug... Damn it..
>
> The issue here is that we peek at req->data_sent and req->data_len
> after the last send, but by then the request may have completed
> and these members were overwritten by a new execution of the
> request.
>
> We really should make sure that we record these before we perform
> the network send to avoid this race.
>
> Can you guys check that this restores the correct behavior?
We'll give this a spin (I had to fix it up so it applies and compiles;
no biggie)
> --
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 3c1c29dd3020..6d63129adccc 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -926,15 +926,17 @@ static void nvme_tcp_fail_request(struct
> nvme_tcp_request *req)
> static int nvme_tcp_try_send_data(struct nvme_tcp_request *req)
> {
> struct nvme_tcp_queue *queue = req->queue;
> + int req_data_len = req->data_len;
>
> while (true) {
> struct page *page = nvme_tcp_req_cur_page(req);
> size_t offset = nvme_tcp_req_cur_offset(req);
> size_t len = nvme_tcp_req_cur_length(req);
> - bool last = nvme_tcp_pdu_last_send(req, len);
> + bool pdu_last = nvme_tcp_pdu_last_send(req, len);
> + bool req_data_sent = req->data_sent;
> int ret, flags = MSG_DONTWAIT;
>
> - if (last && !queue->data_digest &&
> !nvme_tcp_queue_more(queue))
> + if (pdu_last && !queue->data_digest &&
> !nvme_tcp_queue_more(queue))
> flags |= MSG_EOR;
> else
> flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST;
> @@ -958,7 +960,7 @@ static int nvme_tcp_try_send_data(struct
> nvme_tcp_request *req)
> * in the request where we don't want to modify it as we may
> * compete with the RX path completing the request.
> */
> - if (req->data_sent + ret < req->data_len)
> + if (req_data_sent + ret < req_data_len)
> nvme_tcp_advance_req(req, ret);
>
> /* fully successful last send in current PDU */
> nvme_tcp_ddgst_final(queue->snd_hash,
> &req->ddgst);
> --
More information about the Linux-nvme
mailing list