[PATCHv2] nvme-tcp: Implement recvmsg() receive flow
Hannes Reinecke
hare at suse.de
Mon Dec 1 01:02:12 PST 2025
On 11/30/25 22:38, Sagi Grimberg wrote:
>
>
> On 27/11/2025 13:47, Hannes Reinecke wrote:
>> On 11/27/25 08:52, Hannes Reinecke wrote:
>>> On 11/26/25 08:32, Sagi Grimberg wrote:
>>>>
>>>>
>>>> On 20/10/2025 11:58, Hannes Reinecke wrote:
>>>>> The nvme-tcp code is using the ->read_sock() interface to
>>>>> read data from the wire. While this interface gives us access
>>>>> to the skbs themselves (and so might be able to reduce latency)
>>>>> it does not interpret the skbs.
>>>>> Additionally for TLS these skbs have to be re-constructed from
>>>>> the TLS stream data, rendering any advantage questionable.
>>>>> But the main drawback for TLS is that we do not get access to
>>>>> the TLS control messages, so if we receive any of those message
>>>>> the only choice we have is to tear down the connection and restart.
>>>>> This patch switches the receive side over to use recvmsg(), which
>>>>> provides us full access to the TLS control messages and is also
>>>>> more efficient when working with TLS as skbs do not need to be
>>>>> artificially constructed.
>>>>
>>>> Hannes,
>>>>
>>>> I generally agree with this approach. I'd like to point out though
>>>> that this is going to give up running RX from directly from softirq
>>>> context.
>>>
>>> Yes.
>>>
>>>> I've gone back and forth on weather nvme-tcp should do that, but never
>>>> got to do a thorough comparison between the two. This probably shuts
>>>> the door on that option.
>>>>
>>> The thing with running from softirq context is that it would only
>>> make sense if we could _ensure_ that the softirq context is running
>>> on the cpu where the blk-mq hardware context is expecting it to.
>>> Not only would that require fiddling with RFS contexts, but we also
>>> found that NVMe-over-fabrics should _not_ try to align with hardware
>>> interrupts but rather rely on the driver to abstract things away.
>>>
>>>> Having said that, I am fine with this approach. recvmsg is a saner
>>>> interface.
>>>>
>>> Thanks!
>>>
>>>
>>> [ .. ]>> @@ -1023,40 +1030,32 @@ static int nvme_tcp_recv_ddgst(struct
>>>>> nvme_tcp_queue *queue,
>>>>> return 0;
>>>>> }
>>>>> -static int nvme_tcp_recv_skb(read_descriptor_t *desc, struct
>>>>> sk_buff *skb,
>>>>> - unsigned int offset, size_t len)
>>>>> +static int nvme_tcp_recvmsg_ddgst(struct nvme_tcp_queue *queue)
>>>>> {
>>>>> - struct nvme_tcp_queue *queue = desc->arg.data;
>>>>> - size_t consumed = len;
>>>>> - int result;
>>>>> + char *ddgst = (char *)&queue->recv_ddgst;
>>>>> + off_t off = NVME_TCP_DIGEST_LENGTH - queue->ddgst_remaining;
>>>>> + struct msghdr msg = {
>>>>> + .msg_flags = MSG_WAITALL,
>>>>> + };
>>>>
>>>> I don't think we want to use MSG_WAITALL ever. Why are you opting to
>>>> do that?
>>>>
>>> Hmm. You are right, it's not needed.
>>>
>>> [ .. ]>> @@ -1356,20 +1355,38 @@ static int nvme_tcp_try_send(struct
>>>>> nvme_tcp_queue *queue)
>>>>> return ret;
>>>>> }
>>>>> -static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue)
>>>>> +static int nvme_tcp_try_recvmsg(struct nvme_tcp_queue *queue)
>>>>> {
>>>>> - struct socket *sock = queue->sock;
>>>>> - struct sock *sk = sock->sk;
>>>>> - read_descriptor_t rd_desc;
>>>>> - int consumed;
>>>>> + int result;
>>>>> + int nr_cqe = queue->nr_cqe;
>>>>> +
>>>>> + if (unlikely(!queue->rd_enabled))
>>>>> + return -EFAULT;
>>>>> +
>>>>> + do {
>>>>> + switch (nvme_tcp_recv_state(queue)) {
>>>>> + case NVME_TCP_RECV_PDU:
>>>>> + result = nvme_tcp_recvmsg_pdu(queue);
>>>>> + break;
>>>>> + case NVME_TCP_RECV_DATA:
>>>>> + result = nvme_tcp_recvmsg_data(queue);
>>>>> + break;
>>>>> + case NVME_TCP_RECV_DDGST:
>>>>> + result = nvme_tcp_recvmsg_ddgst(queue);
>>>>> + break;
>>>>> + default:
>>>>> + result = -EFAULT;
>>>>> + }
>>>>> + } while (result >= 0);
>>>>> +
>>>>> + if (result < 0 && result != -EAGAIN) {
>>>>> + dev_err(queue->ctrl->ctrl.device,
>>>>> + "receive failed: %d\n", result);
>>>>> + queue->rd_enabled = false;
>>>>> + nvme_tcp_error_recovery(&queue->ctrl->ctrl);
>>>>> + }
>>>>> - rd_desc.arg.data = queue;
>>>>> - rd_desc.count = 1;
>>>>> - lock_sock(sk);
>>>>> - queue->nr_cqe = 0;
>>>>> - consumed = sock->ops->read_sock(sk, &rd_desc, nvme_tcp_recv_skb);
>>>>> - release_sock(sk);
>>>>> - return consumed == -EAGAIN ? 0 : consumed;
>>>>> + return result < 0 ? result : (queue->nr_cqe = nr_cqe);
>>>>
>>>> This is changing the semantics of the retcode - can you please
>>>> document in the commit msg
>>>> the change?
>>>>
>>> That was done to be compatible with the polling interface.
>>> I'll check if it's needed after all.
>>>
>> Actually, I don't think I'm changing the interface.
>> The original interface would return either a negative number on error,
>> '0' if no PDUs have been processed, or the number of processed PDUs.
>> And the new recvmsg interface does the same (I think).
>> The only change which I did was to _not_ map out -EAGAIN, as this will
>> be handled by the 'data_ready' callback.
>
> Which is a change that needs to be documented.
>
Okay, will do.
>> (Reasoning: -EAGAIN is returned when there is no (or not enough) data
>> to be read, and the network layer will call ->data_ready once the
>> situation improves. If we continue with the loop we will race/compete
>> with the networking layer, causing us to do duplicated work as the
>> io_work thread will be picking up the new data _and_ the data_ready
>> callback will be scheduling the io_work callback to pick up the new
>> data. So we can safely stop on -EAGAIN knowing that we'll be woken
>> up once we can continue to read data.).
>
> There is TX work that may be pending as well, IIRC there is an
> assumption that io_work will NOT return if there is any TX work
> pending (we've seen stalls in the past due to issues like that).>
Hmm.
But I've seen the opposite; the current io_work prioritizes TX
over RX, so a 'data_ready' callback will trigger TX first, letting
the RX data to stew for a bit before the TX data has been sent.
Maybe we should be looking into splitting TX and RX paths after all...
>> (And arguably this was an issue with the original code, too.).
>>
>> So where is the change in semantics?
>
> Exactly where you described it.
Okay, will be doing it.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare at suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
More information about the Linux-nvme
mailing list