[PATCH 2/4] nvme-tcp: align I/O cpu with blk-mq mapping

Thu Jul 4 02:07:46 PDT 2024

On 7/4/24 09:43, Hannes Reinecke wrote:
> On 7/3/24 21:38, Sagi Grimberg wrote:
> [ .. ]
>>>>
>>>> We should make the io_cpu come from blk-mq hctx mapping by default, 
>>>> and for every controller it should use a different cpu from the 
>>>> hctx mapping. That is the default behavior. in the wq_unbound case, 
>>>> we skip all of that and make io_cpu = WORK_CPU_UNBOUND, as it was 
>>>> before.
>>>>
>>>> I'm not sure I follow your logic.
>>>>
>>> Hehe. That's quite simple: there is none :-)
>>> I have been tinkering with that approach in the last weeks, but got 
>>> consistently _worse_ results than with the original implementation.
>>> So I gave up on trying to make that the default.
>>
>> What is the "original implementation" ?
>
> nvme-6.10
>
>> What is you target? nvmet?
>
> nvmet with brd backend
>
>> What is the fio job file you are using?
>
> tiobench-example.fio from the fio samples
>
>> what is the queue count? controller count?
>
> 96 queues, 4 subsystems, 2 controller each.
>
>> What was the queue mapping?
>>
> queue 0-5 maps to cpu 6-11
> queue 6-11 maps to cpu 54-59
> queue 12-17 maps to cpu 18-23
> queue 18-23 maps to cpu 66-71
> queue 24-29 maps to cpu 24-29
> queue 30-35 maps to cpu 72-77
> queue 36-41 maps to cpu 30-35
> queue 42-47 maps to cpu 78-83
> queue 48-53 maps to cpu 36-41
> queue 54-59 maps to cpu 84-89
> queue 60-65 maps to cpu 42-47
> queue 66-71 maps to cpu 90-95
> queue 72-77 maps to cpu 12-17
> queue 78-83 maps to cpu 60-65
> queue 84-89 maps to cpu 0-5
> queue 90-95 maps to cpu 48-53

What are the io_cpu for each queue?

>
>> Please lets NOT condition any of this on wq_unbound option at this 
>> point. This modparam was introduced to address
>> a specific issue. If we see IO timeouts, we should fix them, not tell 
>> people to filp a modparam as a solution.
>>
> Thing is, there is no 'best' solution. The current implementation is 
> actually quite good in the single subsystem case. Issues start to appear
> when doing performance testing with a really high load.
> Reason for this is a high contention on the per-cpu workqueues, which 
> are simply overwhelmed by doing I/O _and_ servicing 'normal' OS workload
> like writing do disk etc.

What other 'normal' workloads are you seeing in your test?

> Switching to wq_unbound reduces the contention and makes the system to 
> scale better, but that scaling leads to a performance regression for
> the single subsystem case.

The kthreads are the same kthreads, the only difference is concurrency 
management, which may take advantage of different cpu cores, but pays
a price in latency and bouncing cpus.

> (See my other mail for performance numbers)
> So what is 'better'?

Which email? To Tejun? it seems that bound is better than unbound for 
all cases. You are suggesting that you regress with multiple controllers
though. That is why I suggested that we _spread_ queue io_cpu 
assignments for the bound case (as I suggested in your first attempt), 
I'd want
to see what happens in this case.

>
>>>
>>>>>
>>>>> And it makes the 'CPU hogged' messages go away, which is a bonus 
>>>>> in itself...
>>>>
>>>> Which messages? aren't these messages saying that the work spent 
>>>> too much time? why are you describing the case where the work does 
>>>> not get
>>>> cpu quota to run?
>>>
>>> I means these messages:
>>>
>>> workqueue: nvme_tcp_io_work [nvme_tcp] hogged CPU for >10000us 32771 
>>> times, consider switching to WQ_UNBOUND
>>
>> That means that we are spending too much time in io_work, This is a 
>> separate bug. If you look at nvme_tcp_io_work it has
>> a stop condition after 1 millisecond. However, when we call 
>> nvme_tcp_try_recv() it just keeps receiving from the socket until
>> the socket receive buffer has no more payload. So in theory nothing 
>> prevents from the io_work from looping there forever.
>>
> Oh, no. It's not the loop which is the problem. It's the actual sending
> which takes long; in my test runs I've seen about 250 requests timing 
> out, the majority of which was still pending on the send_list.
> So the io_work function wasn't even running to fetch the requests off 
> the list.

That, is unexpected. This means that either the socket buffer is full 
(is it?), or that the rx is taking a long time, taking
away tx cpu time. I am not sure I understand why would unbound wq would 
solve either other than maybe hide it in
some cases.

In what workloads do you see this issue? reads/writes?

Can you try the following patch:
--

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 0b04a2bde81d..3360de9ef034 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -350,7 +350,7 @@ static inline void nvme_tcp_advance_req(struct 
nvme_tcp_request *req,
         }
  }

-static inline void nvme_tcp_send_all(struct nvme_tcp_queue *queue)
+static inline int nvme_tcp_send_all(struct nvme_tcp_queue *queue)
  {
         int ret;

@@ -358,6 +358,7 @@ static inline void nvme_tcp_send_all(struct 
nvme_tcp_queue *queue)
         do {
                 ret = nvme_tcp_try_send(queue);
         } while (ret > 0);
+       return ret;
  }

  static inline bool nvme_tcp_queue_has_pending(struct nvme_tcp_queue 
*queue)
@@ -1276,7 +1277,7 @@ static void nvme_tcp_io_work(struct work_struct *w)
                 int result;

                 if (mutex_trylock(&queue->send_mutex)) {
-                       result = nvme_tcp_try_send(queue);
+                       result = nvme_tcp_send_all(queue);
                         mutex_unlock(&queue->send_mutex);
                         if (result > 0)
                                 pending = true;
--

Just to understand if there is a starvation problem where in io_work() 
the tx is paced, but
the rx is not.
>
>> This is indeed a bug that we need to address. Probably by setting 
>> rd_desc.count to some limit, decrement it for every
>> skb that we consume, and if we reach that limit and there are more 
>> skbs pending, we break and self-requeue.
>>
>> If we indeed spend much time processing a single queue in io_work, it 
>> is possible that we have a starvation problem
>> that is escalating to the timeouts you are seeing.
>>
> See above; this is the problem. Most of the requests are still stuck 
> on the send_list (with some even still on the req_list) when timeouts 
> occur. This means the io_work function is not being scheduled fast 
> enough (or often enough) to fetch the requests from the list.

Or, maybe some other root-cause is creating a large backlog of send 
requests.

>
> My theory here is that this is due to us using bound workqueues;
> each workqueue function has to execute on a given cpu, and we can
> only schedule one io_work function per cpu. So if that cpu is busy
> (with receiving packets, say, or normal OS tasks) we cannot execute,
> and we're seeing a starvation.

The issue that you are seeing io_work is running over 10ms is the first 
red-light here.
Because it has an explicit stop condition after 1ms. This means that a 
single while loop
is exceeding 10ms. Something is causing that.

>
> With wq_unbound we are _not_ tied to a specific cpu, but rather
> scheduled in a round-robin fashion. This avoids the starvation
> and hence the I/O timeouts do not occur.
> But we need to set the 'cpu' affinity for wq_unbound to keep
> the cache locality, otherwise the performance _really_ suffers
> as we're bouncing threads all over the place.

The unbound workqueue exists to solve a specific user issue. I think 
that if we come to
a conclusion that unbound workqueues are better (i.e. we have a bug that 
is a result of
a bound workqueue), we change it, and stop using bound workqueues 
altogether.

To be clear, I'm do not hold that opinion yet.

>
>>>
>>> which I get consistently during testing with the default 
>>> implementation.
>>
>> Hannes, let's please separate this specific issue with the 
>> performance enhancements.
>> I do not think that we should search for performance enhancements to 
>> address what appears to be a logical starvation issue.
>
> I am perfectly fine with that approach. This patchset is indeed just 
> to address the I/O timeout issues I've been seeing.

And the solution cannot be "use unbound workqueues - change this modparam".