[PATCH RFC] nvme-tcp: Implement recvmsg() receive flow

Hannes Reinecke hare at suse.de
Thu Feb 26 00:40:16 PST 2026


On 2/26/26 00:37, Alistair Francis wrote:
> On Wed, 2026-02-25 at 16:02 +0100, Hannes Reinecke wrote:
>> On 2/25/26 14:15, Alistair Francis wrote:
>>> On Wed, 2026-02-25 at 12:41 +0100, Hannes Reinecke wrote:
>>>> On 2/25/26 11:56, Alistair Francis wrote:
>> [ .. ]
>>>>>
>>>>> This doesn't work unfortunately.
>>>>>
>>>>> The problem is what happens if queue->data_remaining is smaller
>>>>> then iov_iter_count(&req->iter)?
>>>>>
>>>>> queue->data_remaining is set by the data length in the c2h,
>>>>> while
>>>>> the length of the request iov_iter_count(&req->iter) is set
>>>>> when the
>>>>> request is submitted.
>>>>>
>>>>> If queue->data_remaining ends up being smaller then
>>>>> iov_iter_count(&req->iter) then we need to read less data then
>>>>> the
>>>>> actual count of req->iter.
>>>>>
>>>>> So we need a iov_iter_truncate(), but then we end up
>>>>> overwriting
>>>>> the data on the next iteration as we have no way to keep...
>>>>>
>>>> Question is, though: what _is_ in the remaining iov?
>>>
>>> Which remaining iov?
>>>
>>
>> Well, if queue->
>>>> The most reasonable explanation would be that it's the start of
>>>> the
>>>> next PDU (which we haven't accounted for, and hence haven't set
>>>> up pointers correctly).
>>>
>>> The next PDU seems fine, it's just the next data that goes on the
>>> current (and correct) IOV, just overwriting the previous data as
>>> there
>>> is no offset.
>>>
>>
>> The offset is in the iov (ie you advance the iov iter to capture the
>> offset)
>>
>>>> I can see this happening for TLS when the sender doesn't space
>>>> the records correctly (ie if the PDU end is not falling on a
>>>> TLS record boundary).
>>>>
>>>> But yeah, I can see the issue. While we can (and do)
>>>> advance the iterator to complete the request, we still
>>>> have the remaining data in the iterator.
>>>
>>>> What we can do, though, is to copy the remaining data over
>>>> to 'queue->pdu' (as we assume it's the start of the next PDU),
>>>> set up pointers, and let it rip.
> 
> I think I understand this part a bit more now. I'm currently using
> iov_iter_truncate() to reduce the count of req->iter to match queue-
>> data_remaining, so I don't see this.
> 
> If there was no truncation then the req->iter would fill up with the
> current data and the next PDU. This is where we could copy that data
> over and let it rip.
> 
> But it still has the same issue in that we can't add future data to an
> offset in the req->iter. At least not that I can figure out.
> 

Ah, I think I see it now.
The iovec might indeed be larger than queue->remaining
(data underflows are not that uncommon), and then recvmsg
might indeed fill the iovec with more data than the PDU
requires.

The old code had that bit:
  		/* we can read only from what is left in this bio */
		recv_len = min_t(size_t, recv_len,
				iov_iter_count(&req->iter));
(with recv_len being set to queue->data_remaining) to prevent that
from happening.
So we should do a

if (iov_iter_count(&req->iter) > queue->remaining)
    iov_iter_truncate(&req->iter, queue->remaining)

before issuing recvmsg. That should take care of things.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare at suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich



More information about the Linux-nvme mailing list