[PATCHv2] nvme-tcp: Implement recvmsg() receive flow

Thu Nov 27 03:47:28 PST 2025

On 11/27/25 08:52, Hannes Reinecke wrote:
> On 11/26/25 08:32, Sagi Grimberg wrote:
>>
>>
>> On 20/10/2025 11:58, Hannes Reinecke wrote:
>>> The nvme-tcp code is using the ->read_sock() interface to
>>> read data from the wire. While this interface gives us access
>>> to the skbs themselves (and so might be able to reduce latency)
>>> it does not interpret the skbs.
>>> Additionally for TLS these skbs have to be re-constructed from
>>> the TLS stream data, rendering any advantage questionable.
>>> But the main drawback for TLS is that we do not get access to
>>> the TLS control messages, so if we receive any of those message
>>> the only choice we have is to tear down the connection and restart.
>>> This patch switches the receive side over to use recvmsg(), which
>>> provides us full access to the TLS control messages and is also
>>> more efficient when working with TLS as skbs do not need to be
>>> artificially constructed.
>>
>> Hannes,
>>
>> I generally agree with this approach. I'd like to point out though
>> that this is going to give up running RX from directly from softirq 
>> context.
> 
> Yes.
> 
>> I've gone back and forth on weather nvme-tcp should do that, but never
>> got to do a thorough comparison between the two. This probably shuts
>> the door on that option.
>>
> The thing with running from softirq context is that it would only
> make sense if we could _ensure_ that the softirq context is running
> on the cpu where the blk-mq hardware context is expecting it to.
> Not only would that require fiddling with RFS contexts, but we also
> found that NVMe-over-fabrics should _not_ try to align with hardware
> interrupts but rather rely on the driver to abstract things away.
> 
>> Having said that, I am fine with this approach. recvmsg is a saner 
>> interface.
>>
> Thanks!
> 
> 
> [ .. ]>> @@ -1023,40 +1030,32 @@ static int nvme_tcp_recv_ddgst(struct
>>> nvme_tcp_queue *queue,
>>>       return 0;
>>>   }
>>> -static int nvme_tcp_recv_skb(read_descriptor_t *desc, struct sk_buff 
>>> *skb,
>>> -                 unsigned int offset, size_t len)
>>> +static int nvme_tcp_recvmsg_ddgst(struct nvme_tcp_queue *queue)
>>>   {
>>> -    struct nvme_tcp_queue *queue = desc->arg.data;
>>> -    size_t consumed = len;
>>> -    int result;
>>> +    char *ddgst = (char *)&queue->recv_ddgst;
>>> +    off_t off = NVME_TCP_DIGEST_LENGTH - queue->ddgst_remaining;
>>> +    struct msghdr msg = {
>>> +        .msg_flags = MSG_WAITALL,
>>> +    };
>>
>> I don't think we want to use MSG_WAITALL ever. Why are you opting to 
>> do that?
>>
> Hmm. You are right, it's not needed.
> 
> [ .. ]>> @@ -1356,20 +1355,38 @@ static int nvme_tcp_try_send(struct
>>> nvme_tcp_queue *queue)
>>>       return ret;
>>>   }
>>> -static int nvme_tcp_try_recv(struct nvme_tcp_queue *queue)
>>> +static int nvme_tcp_try_recvmsg(struct nvme_tcp_queue *queue)
>>>   {
>>> -    struct socket *sock = queue->sock;
>>> -    struct sock *sk = sock->sk;
>>> -    read_descriptor_t rd_desc;
>>> -    int consumed;
>>> +    int result;
>>> +    int nr_cqe = queue->nr_cqe;
>>> +
>>> +    if (unlikely(!queue->rd_enabled))
>>> +        return -EFAULT;
>>> +
>>> +    do {
>>> +        switch (nvme_tcp_recv_state(queue)) {
>>> +        case NVME_TCP_RECV_PDU:
>>> +            result = nvme_tcp_recvmsg_pdu(queue);
>>> +            break;
>>> +        case NVME_TCP_RECV_DATA:
>>> +            result = nvme_tcp_recvmsg_data(queue);
>>> +            break;
>>> +        case NVME_TCP_RECV_DDGST:
>>> +            result = nvme_tcp_recvmsg_ddgst(queue);
>>> +            break;
>>> +        default:
>>> +            result = -EFAULT;
>>> +        }
>>> +    } while (result >= 0);
>>> +
>>> +    if (result < 0 && result != -EAGAIN) {
>>> +        dev_err(queue->ctrl->ctrl.device,
>>> +            "receive failed:  %d\n", result);
>>> +        queue->rd_enabled = false;
>>> +        nvme_tcp_error_recovery(&queue->ctrl->ctrl);
>>> +    }
>>> -    rd_desc.arg.data = queue;
>>> -    rd_desc.count = 1;
>>> -    lock_sock(sk);
>>> -    queue->nr_cqe = 0;
>>> -    consumed = sock->ops->read_sock(sk, &rd_desc, nvme_tcp_recv_skb);
>>> -    release_sock(sk);
>>> -    return consumed == -EAGAIN ? 0 : consumed;
>>> +    return result < 0 ? result : (queue->nr_cqe = nr_cqe);
>>
>> This is changing the semantics of the retcode - can you please 
>> document in the commit msg
>> the change?
>>
> That was done to be compatible with the polling interface.
> I'll check if it's needed after all.
> 
Actually, I don't think I'm changing the interface.
The original interface would return either a negative number on error,
'0' if no PDUs have been processed, or the number of processed PDUs.
And the new recvmsg interface does the same (I think).
The only change which I did was to _not_ map out -EAGAIN, as this will
be handled by the 'data_ready' callback.
(Reasoning: -EAGAIN is returned when there is no (or not enough) data
to be read, and the network layer will call ->data_ready once the
situation improves. If we continue with the loop we will race/compete
with the networking layer, causing us to do duplicated work as the
io_work thread will be picking up the new data _and_ the data_ready
callback will be scheduling the io_work callback to pick up the new
data. So we can safely stop on -EAGAIN knowing that we'll be woken
up once we can continue to read data.).
(And arguably this was an issue with the original code, too.).

So where is the change in semantics?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare at suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich