[PATCHv2] nvme-tcp: Implement recvmsg() receive flow

Sagi Grimberg sagi at grimberg.me
Sun Nov 30 13:38:34 PST 2025



On 27/11/2025 13:47, Hannes Reinecke wrote:
> On 11/27/25 08:52, Hannes Reinecke wrote:
>> On 11/26/25 08:32, Sagi Grimberg wrote:
>>>
>>>
>>> On 20/10/2025 11:58, Hannes Reinecke wrote:
>>>> The nvme-tcp code is using the ->read_sock() interface to
>>>> read data from the wire. While this interface gives us access
>>>> to the skbs themselves (and so might be able to reduce latency)
>>>> it does not interpret the skbs.
>>>> Additionally for TLS these skbs have to be re-constructed from
>>>> the TLS stream data, rendering any advantage questionable.
>>>> But the main drawback for TLS is that we do not get access to
>>>> the TLS control messages, so if we receive any of those message
>>>> the only choice we have is to tear down the connection and restart.
>>>> This patch switches the receive side over to use recvmsg(), which
>>>> provides us full access to the TLS control messages and is also
>>>> more efficient when working with TLS as skbs do not need to be
>>>> artificially constructed.
>>>
>>> Hannes,
>>>
>>> I generally agree with this approach. I'd like to point out though
>>> that this is going to give up running RX from directly from softirq 
>>> context.
>>
>> Yes.
>>
>>> I've gone back and forth on weather nvme-tcp should do that, but never
>>> got to do a thorough comparison between the two. This probably shuts
>>> the door on that option.
>>>
>> The thing with running from softirq context is that it would only
>> make sense if we could _ensure_ that the softirq context is running
>> on the cpu where the blk-mq hardware context is expecting it to.
>> Not only would that require fiddling with RFS contexts, but we also
>> found that NVMe-over-fabrics should _not_ try to align with hardware
>> interrupts but rather rely on the driver to abstract things away.
>>
>>> Having said that, I am fine with this approach. recvmsg is a saner 
>>> interface.
>>>
>> Thanks!
>>
>>
>> [ .. ]>> @@ -1023,40 +1030,32 @@ static int nvme_tcp_recv_ddgst(struct
>>>> nvme_tcp_queue *queue,
>>>>       return 0;
>>>>   }
>>>> -static int nvme_tcp_recv_skb(read_descriptor_t *desc, struct 
>>>> sk_buff *skb,
>>>> -                 unsigned int offset, size_t len)
>>>> +static int nvme_tcp_recvmsg_ddgst(struct nvme_tcp_queue *queue)
>>>>   {
>>>> -    struct nvme_tcp_queue *queue = desc->arg.data;
>>>> -    size_t consumed = len;
>>>> -    int result;
>>>> +    char *ddgst = (char *)&queue->recv_ddgst;
>>>> +    off_t off = NVME_TCP_DIGEST_LENGTH - queue->ddgst_remaining;
>>>> +    struct msghdr msg = {
>>>> +        .msg_flags = MSG_WAITALL,
>>>> +    };
>>>
>>> I don't think we want to use MSG_WAITALL ever. Why are you opting to 
>>> do that?
>>>
>> Hmm. You are right, it's not needed.
>>
>> [ .. ]>> @@ -1356,20 +1355,38 @@ static int nvme_tcp_try_send(struct
>>>> nvme_tcp_queue *queue)
>>>>       return ret;
>>>>   }
>>>> -static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue)
>>>> +static int nvme_tcp_try_recvmsg(struct nvme_tcp_queue *queue)
>>>>   {
>>>> -    struct socket *sock = queue->sock;
>>>> -    struct sock *sk = sock->sk;
>>>> -    read_descriptor_t rd_desc;
>>>> -    int consumed;
>>>> +    int result;
>>>> +    int nr_cqe = queue->nr_cqe;
>>>> +
>>>> +    if (unlikely(!queue->rd_enabled))
>>>> +        return -EFAULT;
>>>> +
>>>> +    do {
>>>> +        switch (nvme_tcp_recv_state(queue)) {
>>>> +        case NVME_TCP_RECV_PDU:
>>>> +            result = nvme_tcp_recvmsg_pdu(queue);
>>>> +            break;
>>>> +        case NVME_TCP_RECV_DATA:
>>>> +            result = nvme_tcp_recvmsg_data(queue);
>>>> +            break;
>>>> +        case NVME_TCP_RECV_DDGST:
>>>> +            result = nvme_tcp_recvmsg_ddgst(queue);
>>>> +            break;
>>>> +        default:
>>>> +            result = -EFAULT;
>>>> +        }
>>>> +    } while (result >= 0);
>>>> +
>>>> +    if (result < 0 && result != -EAGAIN) {
>>>> +        dev_err(queue->ctrl->ctrl.device,
>>>> +            "receive failed:  %d\n", result);
>>>> +        queue->rd_enabled = false;
>>>> + nvme_tcp_error_recovery(&queue->ctrl->ctrl);
>>>> +    }
>>>> -    rd_desc.arg.data = queue;
>>>> -    rd_desc.count = 1;
>>>> -    lock_sock(sk);
>>>> -    queue->nr_cqe = 0;
>>>> -    consumed = sock->ops->read_sock(sk, &rd_desc, nvme_tcp_recv_skb);
>>>> -    release_sock(sk);
>>>> -    return consumed == -EAGAIN ? 0 : consumed;
>>>> +    return result < 0 ? result : (queue->nr_cqe = nr_cqe);
>>>
>>> This is changing the semantics of the retcode - can you please 
>>> document in the commit msg
>>> the change?
>>>
>> That was done to be compatible with the polling interface.
>> I'll check if it's needed after all.
>>
> Actually, I don't think I'm changing the interface.
> The original interface would return either a negative number on error,
> '0' if no PDUs have been processed, or the number of processed PDUs.
> And the new recvmsg interface does the same (I think).
> The only change which I did was to _not_ map out -EAGAIN, as this will
> be handled by the 'data_ready' callback.

Which is a change that needs to be documented.

> (Reasoning: -EAGAIN is returned when there is no (or not enough) data
> to be read, and the network layer will call ->data_ready once the
> situation improves. If we continue with the loop we will race/compete
> with the networking layer, causing us to do duplicated work as the
> io_work thread will be picking up the new data _and_ the data_ready
> callback will be scheduling the io_work callback to pick up the new
> data. So we can safely stop on -EAGAIN knowing that we'll be woken
> up once we can continue to read data.).

There is TX work that may be pending as well, IIRC there is an 
assumption that
io_work will NOT return if there is any TX work pending (we've seen 
stalls in the
past due to issues like that).

> (And arguably this was an issue with the original code, too.).
>
> So where is the change in semantics?

Exactly where you described it.



More information about the Linux-nvme mailing list