[PATCHv2] nvme-tcp: Implement recvmsg() receive flow

Wed Nov 26 23:52:11 PST 2025

On 11/26/25 08:32, Sagi Grimberg wrote:
> 
> 
> On 20/10/2025 11:58, Hannes Reinecke wrote:
>> The nvme-tcp code is using the ->read_sock() interface to
>> read data from the wire. While this interface gives us access
>> to the skbs themselves (and so might be able to reduce latency)
>> it does not interpret the skbs.
>> Additionally for TLS these skbs have to be re-constructed from
>> the TLS stream data, rendering any advantage questionable.
>> But the main drawback for TLS is that we do not get access to
>> the TLS control messages, so if we receive any of those message
>> the only choice we have is to tear down the connection and restart.
>> This patch switches the receive side over to use recvmsg(), which
>> provides us full access to the TLS control messages and is also
>> more efficient when working with TLS as skbs do not need to be
>> artificially constructed.
> 
> Hannes,
> 
> I generally agree with this approach. I'd like to point out though
> that this is going to give up running RX from directly from softirq 
> context.

Yes.

> I've gone back and forth on weather nvme-tcp should do that, but never
> got to do a thorough comparison between the two. This probably shuts
> the door on that option.
> 
The thing with running from softirq context is that it would only
make sense if we could _ensure_ that the softirq context is running
on the cpu where the blk-mq hardware context is expecting it to.
Not only would that require fiddling with RFS contexts, but we also
found that NVMe-over-fabrics should _not_ try to align with hardware
interrupts but rather rely on the driver to abstract things away.

> Having said that, I am fine with this approach. recvmsg is a saner 
> interface.
> 
Thanks!

[ .. ]>> @@ -1023,40 +1030,32 @@ static int nvme_tcp_recv_ddgst(struct
>> nvme_tcp_queue *queue,
>>       return 0;
>>   }
>> -static int nvme_tcp_recv_skb(read_descriptor_t *desc, struct sk_buff 
>> *skb,
>> -                 unsigned int offset, size_t len)
>> +static int nvme_tcp_recvmsg_ddgst(struct nvme_tcp_queue *queue)
>>   {
>> -    struct nvme_tcp_queue *queue = desc->arg.data;
>> -    size_t consumed = len;
>> -    int result;
>> +    char *ddgst = (char *)&queue->recv_ddgst;
>> +    off_t off = NVME_TCP_DIGEST_LENGTH - queue->ddgst_remaining;
>> +    struct msghdr msg = {
>> +        .msg_flags = MSG_WAITALL,
>> +    };
> 
> I don't think we want to use MSG_WAITALL ever. Why are you opting to do 
> that?
> 
Hmm. You are right, it's not needed.

[ .. ]>> @@ -1356,20 +1355,38 @@ static int nvme_tcp_try_send(struct
>> nvme_tcp_queue *queue)
>>       return ret;
>>   }
>> -static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue)
>> +static int nvme_tcp_try_recvmsg(struct nvme_tcp_queue *queue)
>>   {
>> -    struct socket *sock = queue->sock;
>> -    struct sock *sk = sock->sk;
>> -    read_descriptor_t rd_desc;
>> -    int consumed;
>> +    int result;
>> +    int nr_cqe = queue->nr_cqe;
>> +
>> +    if (unlikely(!queue->rd_enabled))
>> +        return -EFAULT;
>> +
>> +    do {
>> +        switch (nvme_tcp_recv_state(queue)) {
>> +        case NVME_TCP_RECV_PDU:
>> +            result = nvme_tcp_recvmsg_pdu(queue);
>> +            break;
>> +        case NVME_TCP_RECV_DATA:
>> +            result = nvme_tcp_recvmsg_data(queue);
>> +            break;
>> +        case NVME_TCP_RECV_DDGST:
>> +            result = nvme_tcp_recvmsg_ddgst(queue);
>> +            break;
>> +        default:
>> +            result = -EFAULT;
>> +        }
>> +    } while (result >= 0);
>> +
>> +    if (result < 0 && result != -EAGAIN) {
>> +        dev_err(queue->ctrl->ctrl.device,
>> +            "receive failed:  %d\n", result);
>> +        queue->rd_enabled = false;
>> +        nvme_tcp_error_recovery(&queue->ctrl->ctrl);
>> +    }
>> -    rd_desc.arg.data = queue;
>> -    rd_desc.count = 1;
>> -    lock_sock(sk);
>> -    queue->nr_cqe = 0;
>> -    consumed = sock->ops->read_sock(sk, &rd_desc, nvme_tcp_recv_skb);
>> -    release_sock(sk);
>> -    return consumed == -EAGAIN ? 0 : consumed;
>> +    return result < 0 ? result : (queue->nr_cqe = nr_cqe);
> 
> This is changing the semantics of the retcode - can you please document 
> in the commit msg
> the change?
> 
That was done to be compatible with the polling interface.
I'll check if it's needed after all.

>>   }
>>   static void nvme_tcp_io_work(struct work_struct *w)
>> @@ -1391,7 +1408,7 @@ static void nvme_tcp_io_work(struct work_struct *w)
>>                   break;
>>           }
>> -        result = nvme_tcp_try_recv(queue);
>> +        result = nvme_tcp_try_recvmsg(queue);
>>           if (result > 0)
>>               pending = true;
>>           else if (unlikely(result < 0))
>> @@ -2800,7 +2817,7 @@ static int nvme_tcp_poll(struct blk_mq_hw_ctx 
>> *hctx, struct io_comp_batch *iob)
>>       set_bit(NVME_TCP_Q_POLLING, &queue->flags);
>>       if (sk_can_busy_loop(sk) && skb_queue_empty_lockless(&sk- 
>> >sk_receive_queue))
>>           sk_busy_loop(sk, true);
>> -    ret = nvme_tcp_try_recv(queue);
>> +    ret = nvme_tcp_try_recvmsg(queue);
>>       clear_bit(NVME_TCP_Q_POLLING, &queue->flags);
>>       return ret < 0 ? ret : queue->nr_cqe;
>>   
Cheers,
Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare at suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich