[PATCHv2] nvme-tcp: Implement recvmsg() receive flow
Hannes Reinecke
hare at suse.de
Wed Nov 26 23:52:11 PST 2025
On 11/26/25 08:32, Sagi Grimberg wrote:
>
>
> On 20/10/2025 11:58, Hannes Reinecke wrote:
>> The nvme-tcp code is using the ->read_sock() interface to
>> read data from the wire. While this interface gives us access
>> to the skbs themselves (and so might be able to reduce latency)
>> it does not interpret the skbs.
>> Additionally for TLS these skbs have to be re-constructed from
>> the TLS stream data, rendering any advantage questionable.
>> But the main drawback for TLS is that we do not get access to
>> the TLS control messages, so if we receive any of those message
>> the only choice we have is to tear down the connection and restart.
>> This patch switches the receive side over to use recvmsg(), which
>> provides us full access to the TLS control messages and is also
>> more efficient when working with TLS as skbs do not need to be
>> artificially constructed.
>
> Hannes,
>
> I generally agree with this approach. I'd like to point out though
> that this is going to give up running RX from directly from softirq
> context.
Yes.
> I've gone back and forth on weather nvme-tcp should do that, but never
> got to do a thorough comparison between the two. This probably shuts
> the door on that option.
>
The thing with running from softirq context is that it would only
make sense if we could _ensure_ that the softirq context is running
on the cpu where the blk-mq hardware context is expecting it to.
Not only would that require fiddling with RFS contexts, but we also
found that NVMe-over-fabrics should _not_ try to align with hardware
interrupts but rather rely on the driver to abstract things away.
> Having said that, I am fine with this approach. recvmsg is a saner
> interface.
>
Thanks!
[ .. ]>> @@ -1023,40 +1030,32 @@ static int nvme_tcp_recv_ddgst(struct
>> nvme_tcp_queue *queue,
>> return 0;
>> }
>> -static int nvme_tcp_recv_skb(read_descriptor_t *desc, struct sk_buff
>> *skb,
>> - unsigned int offset, size_t len)
>> +static int nvme_tcp_recvmsg_ddgst(struct nvme_tcp_queue *queue)
>> {
>> - struct nvme_tcp_queue *queue = desc->arg.data;
>> - size_t consumed = len;
>> - int result;
>> + char *ddgst = (char *)&queue->recv_ddgst;
>> + off_t off = NVME_TCP_DIGEST_LENGTH - queue->ddgst_remaining;
>> + struct msghdr msg = {
>> + .msg_flags = MSG_WAITALL,
>> + };
>
> I don't think we want to use MSG_WAITALL ever. Why are you opting to do
> that?
>
Hmm. You are right, it's not needed.
[ .. ]>> @@ -1356,20 +1355,38 @@ static int nvme_tcp_try_send(struct
>> nvme_tcp_queue *queue)
>> return ret;
>> }
>> -static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue)
>> +static int nvme_tcp_try_recvmsg(struct nvme_tcp_queue *queue)
>> {
>> - struct socket *sock = queue->sock;
>> - struct sock *sk = sock->sk;
>> - read_descriptor_t rd_desc;
>> - int consumed;
>> + int result;
>> + int nr_cqe = queue->nr_cqe;
>> +
>> + if (unlikely(!queue->rd_enabled))
>> + return -EFAULT;
>> +
>> + do {
>> + switch (nvme_tcp_recv_state(queue)) {
>> + case NVME_TCP_RECV_PDU:
>> + result = nvme_tcp_recvmsg_pdu(queue);
>> + break;
>> + case NVME_TCP_RECV_DATA:
>> + result = nvme_tcp_recvmsg_data(queue);
>> + break;
>> + case NVME_TCP_RECV_DDGST:
>> + result = nvme_tcp_recvmsg_ddgst(queue);
>> + break;
>> + default:
>> + result = -EFAULT;
>> + }
>> + } while (result >= 0);
>> +
>> + if (result < 0 && result != -EAGAIN) {
>> + dev_err(queue->ctrl->ctrl.device,
>> + "receive failed: %d\n", result);
>> + queue->rd_enabled = false;
>> + nvme_tcp_error_recovery(&queue->ctrl->ctrl);
>> + }
>> - rd_desc.arg.data = queue;
>> - rd_desc.count = 1;
>> - lock_sock(sk);
>> - queue->nr_cqe = 0;
>> - consumed = sock->ops->read_sock(sk, &rd_desc, nvme_tcp_recv_skb);
>> - release_sock(sk);
>> - return consumed == -EAGAIN ? 0 : consumed;
>> + return result < 0 ? result : (queue->nr_cqe = nr_cqe);
>
> This is changing the semantics of the retcode - can you please document
> in the commit msg
> the change?
>
That was done to be compatible with the polling interface.
I'll check if it's needed after all.
>> }
>> static void nvme_tcp_io_work(struct work_struct *w)
>> @@ -1391,7 +1408,7 @@ static void nvme_tcp_io_work(struct work_struct *w)
>> break;
>> }
>> - result = nvme_tcp_try_recv(queue);
>> + result = nvme_tcp_try_recvmsg(queue);
>> if (result > 0)
>> pending = true;
>> else if (unlikely(result < 0))
>> @@ -2800,7 +2817,7 @@ static int nvme_tcp_poll(struct blk_mq_hw_ctx
>> *hctx, struct io_comp_batch *iob)
>> set_bit(NVME_TCP_Q_POLLING, &queue->flags);
>> if (sk_can_busy_loop(sk) && skb_queue_empty_lockless(&sk-
>> >sk_receive_queue))
>> sk_busy_loop(sk, true);
>> - ret = nvme_tcp_try_recv(queue);
>> + ret = nvme_tcp_try_recvmsg(queue);
>> clear_bit(NVME_TCP_Q_POLLING, &queue->flags);
>> return ret < 0 ? ret : queue->nr_cqe;
>>
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare at suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
More information about the Linux-nvme
mailing list