[PATCHv2] nvme-tcp: Implement recvmsg() receive flow

Mon Dec 1 00:49:00 PST 2025

On 11/30/25 22:35, Sagi Grimberg wrote:
> 
> 
> On 27/11/2025 9:52, Hannes Reinecke wrote:
>> On 11/26/25 08:32, Sagi Grimberg wrote:
>>>
>>>
>>> On 20/10/2025 11:58, Hannes Reinecke wrote:
>>>> The nvme-tcp code is using the ->read_sock() interface to
>>>> read data from the wire. While this interface gives us access
>>>> to the skbs themselves (and so might be able to reduce latency)
>>>> it does not interpret the skbs.
>>>> Additionally for TLS these skbs have to be re-constructed from
>>>> the TLS stream data, rendering any advantage questionable.
>>>> But the main drawback for TLS is that we do not get access to
>>>> the TLS control messages, so if we receive any of those message
>>>> the only choice we have is to tear down the connection and restart.
>>>> This patch switches the receive side over to use recvmsg(), which
>>>> provides us full access to the TLS control messages and is also
>>>> more efficient when working with TLS as skbs do not need to be
>>>> artificially constructed.
>>>
>>> Hannes,
>>>
>>> I generally agree with this approach. I'd like to point out though
>>> that this is going to give up running RX from directly from softirq 
>>> context.
>>
>> Yes.
>>
>>> I've gone back and forth on weather nvme-tcp should do that, but never
>>> got to do a thorough comparison between the two. This probably shuts
>>> the door on that option.
>>>
>> The thing with running from softirq context is that it would only
>> make sense if we could _ensure_ that the softirq context is running
>> on the cpu where the blk-mq hardware context is expecting it to.
> 
> What is this statement based on? softirq runs where the NIC interrupt
> happens, which eliminates the context switch to the workqueue io_cpu
> which is not guaranteed to affinitize where userspace is, in fact it often
> isn't in nvme-tcp...
> 
My thinking was that if we were to split TX and RX paths it would make
sense to align RX paths with the blk-mq CPU mapping. But that is not
a simple operation, and quite often not possible.
But if that wasn't the goal then fine, ignore my comment.

>> Not only would that require fiddling with RFS contexts, but we also
>> found that NVMe-over-fabrics should _not_ try to align with hardware
>> interrupts but rather rely on the driver to abstract things away.
> 
> I did not expect anyone to fiddle with RFS for softirq context. The main
> benefit of softirq context RX (outside of latency reduction) is that it
> makes io_work handle ONLY TX, which is probably somewhat more efficient.

But we still could do that, right?
We can easily split the current io_work() into two parts, the TX part 
driven from queue_rq(), and the RX part driven from the data_ready()
callback, no?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare at suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich