nvme tcp receive errors

Thu May 13 20:53:54 BST 2021

On 5/13/21 8:48 AM, Keith Busch wrote:
> On Tue, May 11, 2021 at 10:17:09AM -0700, Sagi Grimberg wrote:
>>
>>>> I may have a theory to this issue. I think that the problem is in
>>>> cases where we send commands with data to the controller and then in
>>>> nvme_tcp_send_data between the last successful kernel_sendpage
>>>> and before nvme_tcp_advance_req, the controller sends back a successful
>>>> completion.
>>>>
>>>> If that is the case, then the completion path could be triggered,
>>>> the tag would be reused, triggering a new .queue_rq, setting again
>>>> the req.iter with the new bio params (all is not taken by the
>>>> send_mutex) and then the send context would call nvme_tcp_advance_req
>>>> progressing the req.iter with the former sent bytes... And given that
>>>> the req.iter is used for reads/writes, it is possible that it can
>>>> explain both issues.
>>>>
>>>> While this is not easy to trigger, there is nothing I think that
>>>> can prevent that. The driver used to have a single context that
>>>> would do both send and recv so this could not have happened, but
>>>> now that we added the .queue_rq send context, I guess this can
>>>> indeed confuse the driver.
>>>
>>> Awesome, this is exactly the type of sequence I've been trying to
>>> capture, but couldn't quite get there. Now that you've described it,
>>> that flow can certainly explain the observations, including the
>>> corrupted debug trace event I was trying to add.
>>>
>>> The sequence looks unlikely to happen, which agrees with the difficulty
>>> in reproducing it. I am betting right now that you got it, but a little
>>> surprised no one else is reporting a similar problem yet.
>>
>> We had at least one report from Potnuri that I think may have been
>> triggered by this, this ended up fixed (or rather worked-around
>> with 5c11f7d9f843).
>>
>>> Your option "1" looks like the best one, IMO. I've requested dropping
>>> all debug and test patches and using just this one on the current nvme
>>> baseline for the next test cycle.
>>
>> Cool, waiting to hear back...
> 
> This patch has been tested successfully on the initial workloads. There
> are several more that need to be validated, but each one runs for many
> hours, so it may be a couple more days before completed. Just wanted to
> leat you know: so far, so good.

Encouraging... I'll send a patch for that as soon as you give me the
final verdict. I'm assuming Narayan would be the reporter and the
tester?