nvme tcp receive errors

Sagi Grimberg sagi at grimberg.me
Mon May 10 22:07:47 BST 2021


>>> Sagi,
>>>
>>> Just wanted to give you an update on where we're at with this.
>>>
>>> All tests run with your earlier patch removing the inline dispatch from
>>> nvme_tcp_queue_request() are successful. At this point, I am leaning to
>>> remove that optimization from mainline.
>>
>> Thanks Keith,
>>
>> Did you run it with the extra information debug patch I sent you? What
>> I'm concerned about is that given that you have the only environment
>> where this reproduces, and this is removed it will be very difficult
>> to add it back in.
>>
>> Also, what about the read issue? that one is still unresolved from
>> my PoV.
> 
> The test team reports no issues for both read and write tests with the
> inline dispatch removed. That single patch appears to resolve all
> problems. Sorry, I should have clarified that initially.
> 
> We will go back to the extra debug from last week on top of a pristine
> mainline kernel. There's been some confusion of which patches to apply
> on top of others in the mix, but I'm getting better at coordinating
> that.

Keith,

I may have a theory to this issue. I think that the problem is in
cases where we send commands with data to the controller and then in
nvme_tcp_send_data between the last successful kernel_sendpage
and before nvme_tcp_advance_req, the controller sends back a successful
completion.

If that is the case, then the completion path could be triggered,
the tag would be reused, triggering a new .queue_rq, setting again
the req.iter with the new bio params (all is not taken by the
send_mutex) and then the send context would call nvme_tcp_advance_req
progressing the req.iter with the former sent bytes... And given that
the req.iter is used for reads/writes, it is possible that it can
explain both issues.

While this is not easy to trigger, there is nothing I think that
can prevent that. The driver used to have a single context that
would do both send and recv so this could not have happened, but
now that we added the .queue_rq send context, I guess this can
indeed confuse the driver.

There are 3 possible options to resolve this:
1. Never touch the command after the last send (only advance the
request iter after we check if we are done):
--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index c51b70aec6dd..5c576eda5ba1 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -933,7 +933,6 @@ static int nvme_tcp_try_send_data(struct 
nvme_tcp_request *req)
                 if (ret <= 0)
                         return ret;

-               nvme_tcp_advance_req(req, ret);
                 if (queue->data_digest)
                         nvme_tcp_ddgst_update(queue->snd_hash, page,
                                         offset, ret);
@@ -950,6 +949,7 @@ static int nvme_tcp_try_send_data(struct 
nvme_tcp_request *req)
                         }
                         return 1;
                 }
+               nvme_tcp_advance_req(req, ret);
         }
         return -EAGAIN;
  }
--

2. only initialize the request/cmd/pdu (nvme_tcp_setup_cmd_pdu)
when we have taken the send_mutex (say in nvme_tcp_fetch_request
or something), which is more cumbersome as this stuff may be
called multiple times in the life of a request.

3. Add refcounting to never complete an I/O before both send
and receive has fully completed on this command, besides having some
more overhead on the datapath, there maybe also some challanges with
moving the request state machine that may be based on the refounting.

I suggest that you guys give (1) a shot and see if the theory holds...



More information about the Linux-nvme mailing list