[PATCHv2] nvme-tcp: Implement recvmsg() receive flow

Mon Dec 1 01:02:12 PST 2025

On 11/30/25 22:38, Sagi Grimberg wrote:
> 
> 
> On 27/11/2025 13:47, Hannes Reinecke wrote:
>> On 11/27/25 08:52, Hannes Reinecke wrote:
>>> On 11/26/25 08:32, Sagi Grimberg wrote:
>>>>
>>>>
>>>> On 20/10/2025 11:58, Hannes Reinecke wrote:
>>>>> The nvme-tcp code is using the ->read_sock() interface to
>>>>> read data from the wire. While this interface gives us access
>>>>> to the skbs themselves (and so might be able to reduce latency)
>>>>> it does not interpret the skbs.
>>>>> Additionally for TLS these skbs have to be re-constructed from
>>>>> the TLS stream data, rendering any advantage questionable.
>>>>> But the main drawback for TLS is that we do not get access to
>>>>> the TLS control messages, so if we receive any of those message
>>>>> the only choice we have is to tear down the connection and restart.
>>>>> This patch switches the receive side over to use recvmsg(), which
>>>>> provides us full access to the TLS control messages and is also
>>>>> more efficient when working with TLS as skbs do not need to be
>>>>> artificially constructed.
>>>>
>>>> Hannes,
>>>>
>>>> I generally agree with this approach. I'd like to point out though
>>>> that this is going to give up running RX from directly from softirq 
>>>> context.
>>>
>>> Yes.
>>>
>>>> I've gone back and forth on weather nvme-tcp should do that, but never
>>>> got to do a thorough comparison between the two. This probably shuts
>>>> the door on that option.
>>>>
>>> The thing with running from softirq context is that it would only
>>> make sense if we could _ensure_ that the softirq context is running
>>> on the cpu where the blk-mq hardware context is expecting it to.
>>> Not only would that require fiddling with RFS contexts, but we also
>>> found that NVMe-over-fabrics should _not_ try to align with hardware
>>> interrupts but rather rely on the driver to abstract things away.
>>>
>>>> Having said that, I am fine with this approach. recvmsg is a saner 
>>>> interface.
>>>>
>>> Thanks!
>>>
>>>
>>> [ .. ]>> @@ -1023,40 +1030,32 @@ static int nvme_tcp_recv_ddgst(struct
>>>>> nvme_tcp_queue *queue,
>>>>>       return 0;
>>>>>   }
>>>>> -static int nvme_tcp_recv_skb(read_descriptor_t *desc, struct 
>>>>> sk_buff *skb,
>>>>> -                 unsigned int offset, size_t len)
>>>>> +static int nvme_tcp_recvmsg_ddgst(struct nvme_tcp_queue *queue)
>>>>>   {
>>>>> -    struct nvme_tcp_queue *queue = desc->arg.data;
>>>>> -    size_t consumed = len;
>>>>> -    int result;
>>>>> +    char *ddgst = (char *)&queue->recv_ddgst;
>>>>> +    off_t off = NVME_TCP_DIGEST_LENGTH - queue->ddgst_remaining;
>>>>> +    struct msghdr msg = {
>>>>> +        .msg_flags = MSG_WAITALL,
>>>>> +    };
>>>>
>>>> I don't think we want to use MSG_WAITALL ever. Why are you opting to 
>>>> do that?
>>>>
>>> Hmm. You are right, it's not needed.
>>>
>>> [ .. ]>> @@ -1356,20 +1355,38 @@ static int nvme_tcp_try_send(struct
>>>>> nvme_tcp_queue *queue)
>>>>>       return ret;
>>>>>   }
>>>>> -static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue)
>>>>> +static int nvme_tcp_try_recvmsg(struct nvme_tcp_queue *queue)
>>>>>   {
>>>>> -    struct socket *sock = queue->sock;
>>>>> -    struct sock *sk = sock->sk;
>>>>> -    read_descriptor_t rd_desc;
>>>>> -    int consumed;
>>>>> +    int result;
>>>>> +    int nr_cqe = queue->nr_cqe;
>>>>> +
>>>>> +    if (unlikely(!queue->rd_enabled))
>>>>> +        return -EFAULT;
>>>>> +
>>>>> +    do {
>>>>> +        switch (nvme_tcp_recv_state(queue)) {
>>>>> +        case NVME_TCP_RECV_PDU:
>>>>> +            result = nvme_tcp_recvmsg_pdu(queue);
>>>>> +            break;
>>>>> +        case NVME_TCP_RECV_DATA:
>>>>> +            result = nvme_tcp_recvmsg_data(queue);
>>>>> +            break;
>>>>> +        case NVME_TCP_RECV_DDGST:
>>>>> +            result = nvme_tcp_recvmsg_ddgst(queue);
>>>>> +            break;
>>>>> +        default:
>>>>> +            result = -EFAULT;
>>>>> +        }
>>>>> +    } while (result >= 0);
>>>>> +
>>>>> +    if (result < 0 && result != -EAGAIN) {
>>>>> +        dev_err(queue->ctrl->ctrl.device,
>>>>> +            "receive failed:  %d\n", result);
>>>>> +        queue->rd_enabled = false;
>>>>> + nvme_tcp_error_recovery(&queue->ctrl->ctrl);
>>>>> +    }
>>>>> -    rd_desc.arg.data = queue;
>>>>> -    rd_desc.count = 1;
>>>>> -    lock_sock(sk);
>>>>> -    queue->nr_cqe = 0;
>>>>> -    consumed = sock->ops->read_sock(sk, &rd_desc, nvme_tcp_recv_skb);
>>>>> -    release_sock(sk);
>>>>> -    return consumed == -EAGAIN ? 0 : consumed;
>>>>> +    return result < 0 ? result : (queue->nr_cqe = nr_cqe);
>>>>
>>>> This is changing the semantics of the retcode - can you please 
>>>> document in the commit msg
>>>> the change?
>>>>
>>> That was done to be compatible with the polling interface.
>>> I'll check if it's needed after all.
>>>
>> Actually, I don't think I'm changing the interface.
>> The original interface would return either a negative number on error,
>> '0' if no PDUs have been processed, or the number of processed PDUs.
>> And the new recvmsg interface does the same (I think).
>> The only change which I did was to _not_ map out -EAGAIN, as this will
>> be handled by the 'data_ready' callback.
> 
> Which is a change that needs to be documented.
> 
Okay, will do.

>> (Reasoning: -EAGAIN is returned when there is no (or not enough) data
>> to be read, and the network layer will call ->data_ready once the
>> situation improves. If we continue with the loop we will race/compete
>> with the networking layer, causing us to do duplicated work as the
>> io_work thread will be picking up the new data _and_ the data_ready
>> callback will be scheduling the io_work callback to pick up the new
>> data. So we can safely stop on -EAGAIN knowing that we'll be woken
>> up once we can continue to read data.).
> 
> There is TX work that may be pending as well, IIRC there is an 
> assumption that io_work will NOT return if there is any TX work
 > pending (we've seen stalls in the past due to issues like that).>
Hmm.
But I've seen the opposite; the current io_work prioritizes TX
over RX, so a 'data_ready' callback will trigger TX first, letting
the RX data to stew for a bit before the TX data has been sent.
Maybe we should be looking into splitting TX and RX paths after all...

>> (And arguably this was an issue with the original code, too.).
>>
>> So where is the change in semantics?
> 
> Exactly where you described it.

Okay, will be doing it.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare at suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich