[PATCHv2] nvme-tcp: align I/O cpu with blk-mq mapping

Sagi Grimberg sagi at grimberg.me
Wed Jun 19 08:43:57 PDT 2024



On 19/06/2024 18:23, Hannes Reinecke wrote:
> On 6/19/24 16:59, Christoph Hellwig wrote:
>>>       if (wq_unbound)
>>>           queue->io_cpu = WORK_CPU_UNBOUND;
>>
>> None of the above computed information is even used for the wq_unbound
>> code.  This would make a lot more sense if the above assignment was in
>> the (only) caller.
>>
>> Yes, that probably should have been done when merging the wq_unbound
>> option (which honestly looks so whacky that I wish it wasn't merged).
>>
> Ah, you noticed?
>
> This patch is actually got sparked off by one of our partners notifying
> a severe latency increase proportional with the number of controllers.
> With a marked increase when (sw) TLS is active; they even managed to run
> into command timeouts.
>
> What happens is that we shove all work from identical queue numbers 
> onto the same CPU; so if your controller only exports 4 queues _all_ 
> work from qid 1 (from all controllers!) is pushed onto CPU 0 causing a 
> massive oversubscription on that CPU and leaving all other CPUs in the 
> system idle.

I see how you address multiple controllers falling into the same 
mappings case in your patch.
You could have selected a different mq_map entry for each controller 
(out of the entries that map to the qid).

>
> Not sure how wq_unbound helps in this case; in theory the workqueue
> items can be pushed on arbitrary CPUs, but that only leads to even worse
> thread bouncing.
>
> However, topic for ALPSS. We really should have some sore of 
> backpressure here.

I have a patch that was sitting for some time now, to make the RX path 
run directly
from softirq, which should make RX execute from the cpu core mapped to 
the RSS hash.
Perhaps you or your customer can give it a go.

disclaimer, lightly tested:
--
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index b6ea7e337eb8..02076b8cb4d8 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -44,6 +44,13 @@ static bool wq_unbound;
  module_param(wq_unbound, bool, 0644);
  MODULE_PARM_DESC(wq_unbound, "Use unbound workqueue for nvme-tcp IO 
context (default false)");

+/*
+ * RX context runs in softirq
+ */
+static bool softirq_rx;
+module_param(softirq_rx, bool, 0644);
+MODULE_PARM_DESC(softirq_rx, "nvme-tcp RX context in softirq");
+
  /*
   * TLS handshake timeout
   */
@@ -976,8 +983,13 @@ static void nvme_tcp_data_ready(struct sock *sk)
         read_lock_bh(&sk->sk_callback_lock);
         queue = sk->sk_user_data;
         if (likely(queue && queue->rd_enabled) &&
-           !test_bit(NVME_TCP_Q_POLLING, &queue->flags))
-               queue_work_on(queue->io_cpu, nvme_tcp_wq, &queue->io_work);
+           !test_bit(NVME_TCP_Q_POLLING, &queue->flags)) {
+               if (softirq_rx)
+                       nvme_tcp_try_recv_locked(queue);
+               else
+                       queue_work_on(queue->io_cpu, nvme_tcp_wq,
+                               &queue->io_work);
+       }
         read_unlock_bh(&sk->sk_callback_lock);
  }

@@ -1291,11 +1303,13 @@ static void nvme_tcp_io_work(struct work_struct *w)
                                 break;
                 }

-               result = nvme_tcp_try_recv(queue);
-               if (result > 0)
-                       pending = true;
-               else if (unlikely(result < 0))
-                       return;
+               if (!softirq_rx) {
+                       result = nvme_tcp_try_recv(queue);
+                       if (result > 0)
+                               pending = true;
+                       else if (unlikely(result < 0))
+                               return;
+               }

                 if (!pending || !queue->rd_enabled)
                         return;
--

Note, I don't yet know if this behavior should go conditional behind a 
parameter, or
simply replace the existing behavior.

Note that it depends on a patch I've sent (and will need to send a v2 for):

lore.kernel.org

[PATCH] net: micro-optimize skb_datagram_iter <#>

🔗 https://lore.kernel.org/netdev/20240617095852.66c96be9@kernel.org/T/ 
<https://lore.kernel.org/netdev/20240617095852.66c96be9@kernel.org/T/>




More information about the Linux-nvme mailing list