[PATCH 0/2] nvmet: support polling task for RDMA and TCP

Wed Jul 3 12:58:23 PDT 2024

On 02/07/2024 13:02, Ping Gan wrote:
>> On 01/07/2024 10:42, Ping Gan wrote:
>>>> Hey Ping Gan,
>>>>
>>>>
>>>> On 26/06/2024 11:28, Ping Gan wrote:
>>>>> When running nvmf on SMP platform, current nvme target's RDMA and
>>>>> TCP use kworker to handle IO. But if there is other high workload
>>>>> in the system(eg: on kubernetes), the competition between the
>>>>> kworker and other workload is very radical. And since the kworker
>>>>> is scheduled by OS randomly, it's difficult to control OS resource
>>>>> and also tune the performance. If target support to use delicated
>>>>> polling task to handle IO, it's useful to control OS resource and
>>>>> gain good performance. So it makes sense to add polling task in
>>>>> rdma-rdma and rdma-tcp modules.
>>>> This is NOT the way to go here.
>>>>
>>>> Both rdma and tcp are driven from workqueue context, which are bound
>>>> workqueues.
>>>>
>>>> So there are two ways to go here:
>>>> 1. Add generic port cpuset and use that to direct traffic to the
>>>> appropriate set of cores
>>>> (i.e. select an appropriate comp_vector for rdma and add an
>>>> appropriate
>>>> steering rule
>>>> for tcp).
>>>> 2. Add options to rdma/tcp to use UNBOUND workqueues, and allow
>>>> users
>>>> to
>>>> control
>>>> these UNBOUND workqueues cpumask via sysfs.
>>>>
>>>> (2) will not control interrupts to steer to other workloads cpus,
>>>> but
>>>> the handlers may
>>>> run on a set of dedicated cpus.
>>>>
>>>> (1) is a better solution, but harder to implement.
>>>>
>>>> You also should look into nvmet-fc as well (and nvmet-loop for that
>>>> matter).
>>> hi Sagi Grimberg,
>>> Thanks for your reply, actually we had tried the first advice you
>>> suggested, but we found the performance was poor when using spdk
>>> as initiator.
>> I suggest that you focus on that instead of what you proposed.
>> What is the source of your poor performance?
> Before these patches, we had used linux's RPS to forward the packets
> to a fixed cpu set for nvmet-tcp. But when did that we can still not
> cancel the competition between softirq and workqueue since nvme target's
> kworker cpu core bind on socket's cpu which is from skb. Besides that
> we found workqueue's wait latency was very high even we enabled polling
> on nvmet-tcp by module parameter idle_poll_period_usecs. So when
> initiator
> is polling mode, the target of workqueue is the bottleneck. Below is
> work's wait latency trace log of our test on our cluster(per node uses
> 4 numas 96 cores, 192G memory, one dual ports mellanox CX4LX(25Gbps X 2)
> ethernet adapter and randrw 1M IO size) by RPS to 6 cpu cores. And
> system's CPU and memory were used about 80%.

I'd try a simple unbound CPU case, steer packets to say cores [0-5] and 
assign
the cpumask of the unbound workqueue to cores [6-11].

> ogden-brown:~ #/usr/share/bcc/tools/wqlat -T -w nvmet_tcp_wq 1 2
> 01:06:59
>       usecs               : count     distribution
>           0 -> 1          : 0        |                              |
>           2 -> 3          : 0        |                              |
>           4 -> 7          : 0        |                              |
>           8 -> 15         : 3        |                              |
>          16 -> 31         : 10       |                              |
>          32 -> 63         : 3        |                              |
>          64 -> 127        : 2        |                              |
>         128 -> 255        : 0        |                              |
>         256 -> 511        : 5        |                              |
>         512 -> 1023       : 12       |                              |
>        1024 -> 2047       : 26       |*                             |
>        2048 -> 4095       : 34       |*                             |
>        4096 -> 8191       : 350      |************                  |
>        8192 -> 16383      : 625      |******************************|
>       16384 -> 32767      : 244      |*********                     |
>       32768 -> 65535      : 39       |*                             |
>
> 01:07:00
>       usecs               : count     distribution
>           0 -> 1          : 1        |                              |
>           2 -> 3          : 0        |                              |
>           4 -> 7          : 4        |                              |
>           8 -> 15         : 3        |                              |
>          16 -> 31         : 8        |                              |
>          32 -> 63         : 10       |                              |
>          64 -> 127        : 3        |                              |
>         128 -> 255        : 6        |                              |
>         256 -> 511        : 8        |                              |
>         512 -> 1023       : 20       |*                             |
>        1024 -> 2047       : 19       |*                             |
>        2048 -> 4095       : 57       |**                            |
>        4096 -> 8191       : 325      |****************              |
>        8192 -> 16383      : 647      |******************************|
>       16384 -> 32767      : 228      |***********                   |
>       32768 -> 65535      : 43       |**                            |
>       65536 -> 131071     : 1        |                              |
>
> And the bandwidth of a node is only 3100MB. While we used the patch
> and enable 6 polling task, the bandwidth can be 4000MB. It's a good
> improvement.

I think you will see similar performance with unbound workqueue and rps.

>
>>>    You know this patch is not only resolving OS resource
>>> competition issue, but also the perf issue. We have analyzed if we
>>> still use workqueue(kworker) as target when initiator is polling
>>> driver(eg: spdk), then workqueue/kworker target is the bottleneck
>>> since every nvmf request may have a wait latency from queuing on
>>> workqueue to begin processing,
>> That is incorrect, the work context polls the cq until it either drains
>> it
>> completely, or exhaust a quota of IB_POLL_BUDGET_WORKQUEUE (or
>> NVMET_TCP_IO_WORK_BUDGET). Not every command gets its own workqueue
>> queuing delay.
>>
>> And, what does the spdk initiator has to do with it? Didn't
>> understand...
> Yes, target workqueue implementation will poll a quota; but when the
> work
> load was high we found many work will wait too long(some of them at
> several
> ms to hundred ms shown above histogram). We use the spdk initiator(by
> polling mode) to send IO's read/write to nvme disks of a kubernetes
> cluster's remote node.

The initiator is an orthogonal detail here. the same issue exists 
regardless of
spdk afaiu. Let's ignore it, its confusing.

>
>>>    and the latency can be traced by wqlat
>>> of bcc (https://github.com/iovisor/bcc/blob/master/tools/wqlat.py).
>>> We think the latency is a disaster for the polling driver data plane,
>>> right?
>> If you need a target that polls all the time, you should probably
>> resort
>> to spdk.
>> If there is room for optimization in nvmet we'll gladly take it, but
>> this is not the
>> way to go IMO.
> Yes, in the begining we did use the spdk as polling target driver,
> but we suffered from spdk target could not support disk hot plug/unplug
> well, sometimes it will cause data loss when did disk hot plug/unplug.
> So we switch to kernel target driver because in production customer's
> data security is first priority. And for kernel's target it has no
> polling mode target driver, so we implemented these patches.

Well, its a hard sell for upstream nvmet.

>
>>> So we think adding a polling task mode on nvmet side to handle
>>> IO does really make sense; what's your opinion about this?
>> I personally think that adding a polling kthread is questionable.
>> However there is a precedent, io_uring sqthreads. So please look
>> into what is done there. I don't mind having something like
>> IB_POLL_IOTASK (or
>> io_task threads in nvmet-tcp) if its done correctly (leverages common
>> code).
> Yes, we have studied io_uring's code before implementing the patches.
> Actually we followed io_uring's design idea in these patches.

I'm talking about reusing what io_uring sqpoll tasks. Move them to common
code, generalizing it to address what you need, and reuse that. 
Implementing a
half-baked inspired version in nvmet is not going to fly here. Sorry.