Command timeouts with NVMe TCP kernel driver

Wed Sep 8 23:36:30 PDT 2021

On 9/9/21 12:50 AM, Keith Busch wrote:
> On Wed, Sep 08, 2021 at 11:55:59PM +0300, Sagi Grimberg wrote:
>>
>>>>> 1. Userspace context grabs send_mutex via queue_rq and calls into kernel_sendpage. This context is pinned to a CPU X because that's the way my benchmark works.
>>>>> 2. Userspace context is descheduled.
>>>>> 3. nvme_tcp_io_work is scheduled on the same CPU X because it so happens that io_cpu == X. (I have hundreds of threads which are statically assigned to CPUs and spread over all the CPUs of the VM, so there are necessarily some userspace threads whose CPU coincides with io_cpu).
>>>>> 4. nvme_tcp_io_work obviously can't grab send_mutex because it's held by the userspace. But because req_list is not empty, it doesn't yield but keeps on spinning in the loop until it expires.
>>>>> 5. Since pending = true nvme_tcp_io_work re-schedules itself for immediate execution. Because it's flagged as HIGHPRI, I guess this means it is rescheduled very soon/almost immediately, and my poor userspace context doesn't get enough CPU to make reasonable forward progress. We find ourselves in a kind of livelock.
>>>>>
>>>>> This seems coherent with the fact that my problem disappears if I do any one of the 3 following things:
>>>>>
>>>>> 1. Modify my userspace benchmark to avoid pinning threads to CPUs => direct send path can execute on a different CPU and make forward progress
>>>>> 2. Modify nvme-tcp to remove the "direct send" path in queue_rq and always post to the work queue => no contention between direct send path and the workqueue
>>>>> 3. Modify the tcp wq to remove the WQ_HIGHPRI flag. => I guess this makes the scheduler more friendly towards my userspace thread
>>>>>
>>>>> Does this make sense? What do you think is the best way to fix this? I guess the WQ_HIGHPRI flag is there for a reason, and that the "direct send" path can provide lower latency in some cases. What about a heuristic in io_work that will prevent it from looping indefinitely after a certain number of failed attempts to claim the send mutex?
>>>>
>>>> Sounds possible. The "pending" check on the req_list was to fix a race
>>>> condition where the io_work could miss handling a request, resulting in
>>>> a different command time out. It sounds like that could make the io_work
>>>> spin without having anything to do, while starving the lower priority
>>>> task. I'll try to think of another way to handle it.
>>>
>>> Instead of relying on io_work to determine if there's pending requests,
>>> the queuing action can try to re-schedule it only after the send_mutex
>>> is released, and I think that should address the issue you've described
>>> while still fixing the IO timeout the "pending" trick was addressing.
>>> Can you try the below patch?
>>>
>>> Sagi, does this look okay to you?
>>
>> It looks OK to me, but on the other hand the former
>> one looked perfectly fine to me.
> 
> It's a priority inversion. The .queue_rq() task is holding onto the
> 'req_list' resource via send_mutex, but it can't schedule in because
> io_work is higher priority and will remain in TASK_RUNNING while
> 'req_list' is non-empty.

Yes, I understand the issue.

> I think the user must have pinned the .queue_rq task to that specific
> CPU, otherwise I believe it would have scheduled somewhere else to
> unblock forward progress.

Yes, I agree.

>> Basically now the last request in the batch will not trigger
>> unconditionally (if the direct-send path was not taken), but only if it
>> sees queued requests, which in theory could be reaped by a running
>> io_work, that would result in an empty io_work invocation.  On one
>> hand it is undesired, on the other hand should be harmless as it
>> shouldn't miss an invocation... So it looks reasonable to me.
> 
> I can change the check to the same as .commit_rqs(), which just
> consideres 'llist_empty(&queue->req_list))'.
> 
> That should be just as safe and have fewer unnecessary io_work
> scheduling for the same reasoning commit_rqs() uses it. In any case, I
> don't think an occasional unneeded work check is really that harmful.

What we need to think about is if there is a case where we miss
scheduling the io_work

>> Sam, can you run some more extensive I/O testing with this (different
>> patterns of r/w and block sizes)? Also Keith, can you ask the WD folks
>> to also take it for a spin?
> 
> Sure, I'll request this patch for their next test cycle.

Great.