[PATCH] nvme-tcp: fix incorrect h2cdata pdu offset accounting
Sagi Grimberg
sagi at grimberg.me
Sat Oct 2 15:19:28 PDT 2021
>>>> When the controller sends us multiple r2t PDUs in a single
>>>> request we need to account for it correctly as our send/recv
>>>> context run concurrently (i.e. we get a new r2t with r2t_offset
>>>> before we updated our iterator and req->data_sent marker). This
>>>> can cause wrong offsets to be sent to the controller.
>>>>
>>>> To fix that, we will first know that this may happen only in
>>>> the send sequence of the last page, hence we will take
>>>> the r2t_offset to the h2c PDU data_offset, and in
>>>> nvme_tcp_try_send_data loop, we make sure to increment
>>>> the request markers also when we completed a PDU but
>>>> we are expecting more r2t PDUs as we still did not send
>>>> the entire data of the request.
>>>>
>>>> Fixes: 825619b09ad3 ("nvme-tcp: fix possible use-after-completion")
>>>> Reported-by: Nowak, Lukasz <Lukasz.Nowak at Dell.com>
>>>> Tested-by: Nowak, Lukasz <Lukasz.Nowak at Dell.com>
>>>> Signed-off-by: Sagi Grimberg <sagi at grimberg.me>
>>>> ---
>>>> - Keith, can you ask the WD team to test this as well?
>>>
>>> I finally have some feedback on testing, and it looks like this patch
>>> has created a regression. The testers are observing:
>>>
>>> [Tue Sep 28 02:12:55 2021] nvme nvme6: req 3 r2t len 8192 exceeded data len 8192 (4096 sent)
>>> [Tue Sep 28 02:12:55 2021] nvme nvme6: receive failed: -71
>>> [Tue Sep 28 02:12:55 2021] nvme nvme6: starting error recovery
>>>
>>> Which we had seen before, and you fixed with commit:
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=825619b09ad351894d2c6fb6705f5b3711d145c7
>>>
>>> I only just recently heard about this, so I haven't looked into any
>>> additional possible cause for this regression yet, but just wanted to
>>> let you know earlier.
>>
>> Yes I definitely see the bug... Damn it..
>>
>> The issue here is that we peek at req->data_sent and req->data_len
>> after the last send, but by then the request may have completed
>> and these members were overwritten by a new execution of the
>> request.
>>
>> We really should make sure that we record these before we perform
>> the network send to avoid this race.
>>
>> Can you guys check that this restores the correct behavior?
>
> Unfortunately this was unsuccessful. The same issue is still occuring. Please
> let me know if you have another patch to try. I'll also keep looking for a
> solution as well.
Really? That's unexpected, this patch should ensure that the request
is not advanced after the last payload send. The req->data_sent and
req->data_len are recorded before we actually perform the send so
the request should not be advanced if (e.g. last send):
(req_data_sent + ret == req_data_len)
So I'm surprised that the same issue is still occurring.
More information about the Linux-nvme
mailing list