[PATCH v24 01/20] net: Introduce direct data placement tcp offload

Thu May 2 00:04:11 PDT 2024

Sagi Grimberg <sagi at grimberg.me> writes:
> Well, you cannot rely on the fact that the application will be pinned to a
> specific cpu core. That may be the case by accident, but you must not and
> cannot assume it.

Just to be clear, any CPU can read from the socket and benefit from the
offload but there will be an extra cost if the queue CPU is different
from the offload CPU. We use cfg->io_cpu as a hint.

> Even today, nvme-tcp has an option to run from an unbound wq context,
> where queue->io_cpu is set to WORK_CPU_UNBOUND. What are you going to
> do there?

When the CPU is not bound to a specific core, we will most likely always
have CPU misalignment and the extra cost that goes with it.

But when it is bound, which is still the default common case, we will
benefit from the alignment. To not lose that benefit for the default
most common case, we would like to keep cfg->io_cpu.

Could you clarify what are the advantages of running unbounded queues,
or to handle RX on a different cpu than the current io_cpu?

> nvme-tcp may handle rx side directly from .data_ready() in the future, what
> will the offload do in that case?

It is not clear to us what the benefit of handling rx in .data_ready()
will achieve. From our experiment, ->sk_data_ready() is called either
from queue->io_cpu, or sk->sk_incoming_cpu. Unless you enable aRFS,
sk_incoming_cpu will be constant for the whole connection. Can you
clarify would handling RX from data_ready() provide?

> io_cpu may or may not mean anything. You cannot rely on it, nor dictate it.

We are just interested in optimizing the bounded case, where io_cpu has
meaning.

> > - or we remove cfg->io_cpu, and we offload the socket from
> >    nvme_tcp_io_work() where the io_cpu is implicitly going to be
> >    the current CPU.
> What do you mean offload the socket from nvme_tcp_io_work? I do not
> understand what this means.

We meant setting up the offload from the io thread instead, by calling
nvme_tcp_offload_socket() from nvme_tcp_io_work(), and making sure it's
only called once. Something like this:

+ if (queue->ctrl->ddp_netdev && !nvme_tcp_admin_queue(queue) && !test_bit(NVME_TCP_Q_OFF_DDP, &queue->flags)) {
+         int ret;
+
+         ret = nvme_tcp_offload_socket(queue);
+         if (ret) {
+                 printk("XXX offload setup failed\n");
+         }
+ }