Command timeouts with NVMe TCP kernel driver

Wed Sep 8 13:55:59 PDT 2021

>>> 1. Userspace context grabs send_mutex via queue_rq and calls into kernel_sendpage. This context is pinned to a CPU X because that's the way my benchmark works.
>>> 2. Userspace context is descheduled.
>>> 3. nvme_tcp_io_work is scheduled on the same CPU X because it so happens that io_cpu == X. (I have hundreds of threads which are statically assigned to CPUs and spread over all the CPUs of the VM, so there are necessarily some userspace threads whose CPU coincides with io_cpu).
>>> 4. nvme_tcp_io_work obviously can't grab send_mutex because it's held by the userspace. But because req_list is not empty, it doesn't yield but keeps on spinning in the loop until it expires.
>>> 5. Since pending = true nvme_tcp_io_work re-schedules itself for immediate execution. Because it's flagged as HIGHPRI, I guess this means it is rescheduled very soon/almost immediately, and my poor userspace context doesn't get enough CPU to make reasonable forward progress. We find ourselves in a kind of livelock.
>>>
>>> This seems coherent with the fact that my problem disappears if I do any one of the 3 following things:
>>>
>>> 1. Modify my userspace benchmark to avoid pinning threads to CPUs => direct send path can execute on a different CPU and make forward progress
>>> 2. Modify nvme-tcp to remove the "direct send" path in queue_rq and always post to the work queue => no contention between direct send path and the workqueue
>>> 3. Modify the tcp wq to remove the WQ_HIGHPRI flag. => I guess this makes the scheduler more friendly towards my userspace thread
>>>
>>> Does this make sense? What do you think is the best way to fix this? I guess the WQ_HIGHPRI flag is there for a reason, and that the "direct send" path can provide lower latency in some cases. What about a heuristic in io_work that will prevent it from looping indefinitely after a certain number of failed attempts to claim the send mutex?
>>
>> Sounds possible. The "pending" check on the req_list was to fix a race
>> condition where the io_work could miss handling a request, resulting in
>> a different command time out. It sounds like that could make the io_work
>> spin without having anything to do, while starving the lower priority
>> task. I'll try to think of another way to handle it.
> 
> Instead of relying on io_work to determine if there's pending requests,
> the queuing action can try to re-schedule it only after the send_mutex
> is released, and I think that should address the issue you've described
> while still fixing the IO timeout the "pending" trick was addressing.
> Can you try the below patch?
> 
> Sagi, does this look okay to you?

It looks OK to me, but on the other hand the former
one looked perfectly fine to me. I'm trying to understand
if we can have a race condition hiding here....

> ---
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index e2ab12f3f51c..e4249b7dc056 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -274,6 +274,12 @@ static inline void nvme_tcp_send_all(struct nvme_tcp_queue *queue)
>   	} while (ret > 0);
>   }
>   
> +static inline bool nvme_tcp_queue_more(struct nvme_tcp_queue *queue)
> +{
> +	return !list_empty(&queue->send_list) ||
> +		!llist_empty(&queue->req_list) || queue->more_requests;
> +}
> +
>   static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
>   		bool sync, bool last)
>   {
> @@ -294,9 +300,10 @@ static inline void nvme_tcp_queue_request(struct nvme_tcp_request *req,
>   		nvme_tcp_send_all(queue);
>   		queue->more_requests = false;
>   		mutex_unlock(&queue->send_mutex);
> -	} else if (last) {
> -		queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
>   	}
> +
> +	if (last && nvme_tcp_queue_more(queue))
> +		queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);

Basically now the last request in the batch will not trigger
unconditionally (if the direct-send path was not taken), but only if it
sees queued requests, which in theory could be reaped by a running
io_work, that would result in an empty io_work invocation. On one hand
it is undesired, on the other hand should be harmless as it shouldn't
miss an invocation... So it looks reasonable to me.

Sam, can you run some more extensive I/O testing with this (different
patterns of r/w and block sizes)? Also Keith, can you ask the WD folks
to also take it for a spin?

All of these issues have a common theme, which is less I/O queues than
CPUs for each controller which make RX/TX and queue_rq/io_work compete
and usually controllers with smaller MDTS (that trigger multiple R2Ts
per request and uncovered some races in this area). So testing those
would be beneficial.