[PATCH 3/3] nvme-tcp: do not queue io_work on a full socket

Thu May 29 23:42:38 PDT 2025

On 5/29/25 14:38, Sagi Grimberg wrote:
> 
> 
> On 28/05/2025 9:45, Hannes Reinecke wrote:
>> There really is no point in scheduling io_work from ->queue_rq()
>> if the socket is full; we'll be notified by the ->write_space()
>> callback once space on the socket becomes available.
>> Consequently we also need to remove the check for 
>> 'sk_stream_is_writeable()'
>> in the ->write_space() callback as we need to schedule io_work
>> to receive packets even if the sending side is blocked.
>>
>> Signed-off-by: Hannes Reinecke <hare at kernel.org>
>> ---
>>   drivers/nvme/host/tcp.c | 8 ++++++--
>>   1 file changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index 0e178115dc04..e4dd1620dc28 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -411,6 +411,9 @@ static inline void nvme_tcp_queue_request(struct 
>> nvme_tcp_request *req,
>>       empty = llist_add(&req->lentry, &queue->req_list) &&
>>           list_empty(&queue->send_list) && !queue->request;
>> +    if (!sk_stream_is_writeable(queue->sock->sk))
>> +        empty = false;
> 
> This is wrong. the fact that the socket it not writeable does not mean 
> that the queue
> is not empty (meaning we have rqeuests that we have yet to write to the 
> socket).
> 
Admittedly I abuse 'empty' here; setting 'empty' to 'false' just avoids
the workqueue to be run. If 'sk_stream_is_writeable()' is false starting
the workqueue will just cause sendmsg() to return EAGAIN, and we'll be
notified via the 'write_space' callback anyway.

>> +
>>       /*
>>        * if we're the first on the send_list and we can try to send
>>        * directly, otherwise queue io_work. Also, only do that if we
>> @@ -422,7 +425,8 @@ static inline void nvme_tcp_queue_request(struct 
>> nvme_tcp_request *req,
>>           mutex_unlock(&queue->send_mutex);
>>       }
>> -    if (last && nvme_tcp_queue_has_pending(queue))
>> +    if (last && nvme_tcp_queue_has_pending(queue) &&
>> +        sk_stream_is_writeable(queue->sock->sk))
> 
> I think its fine, because either this or the callback will effectively 
> not queue the work as the work can only be queued once.
> 
>>           queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
>>   }
>> @@ -1074,7 +1078,7 @@ static void nvme_tcp_write_space(struct sock *sk)
>>       read_lock_bh(&sk->sk_callback_lock);
>>       queue = sk->sk_user_data;
>> -    if (likely(queue && sk_stream_is_writeable(sk))) {
>> +    if (likely(queue)) {
>>           clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
>>           queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
>>       }
> 
> I'd keep only this change. Unless you are convinced that the other 
> stream checks are correct?

The gist of this patch was to move the checks for 
'sk_stream_is_writeable' into io_work to avoid it being queued
when the socket is blocked, as in these cases io_work will be
restarted via the callbacks.
Prime reason here was to avoid recvmsg() or sendmsg() return with EAGAIN
as this will increase latency during command processing.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare at suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich