[PATCH 0/2] nvmet: support polling task for RDMA and TCP
Sagi Grimberg
sagi at grimberg.me
Thu Jul 4 22:59:24 PDT 2024
On 7/4/24 13:35, Ping Gan wrote:
>> On 7/4/24 11:10, Ping Gan wrote:
>>>> On 02/07/2024 13:02, Ping Gan wrote:
>>>>>> On 01/07/2024 10:42, Ping Gan wrote:
>>>>>>>> Hey Ping Gan,
>>>>>>>>
>>>>>>>>
>>>>>>>> On 26/06/2024 11:28, Ping Gan wrote:
>>>>>>>>> When running nvmf on SMP platform, current nvme target's RDMA
>>>>>>>>> and
>>>>>>>>> TCP use kworker to handle IO. But if there is other high
>>>>>>>>> workload
>>>>>>>>> in the system(eg: on kubernetes), the competition between the
>>>>>>>>> kworker and other workload is very radical. And since the
>>>>>>>>> kworker
>>>>>>>>> is scheduled by OS randomly, it's difficult to control OS
>>>>>>>>> resource
>>>>>>>>> and also tune the performance. If target support to use
>>>>>>>>> delicated
>>>>>>>>> polling task to handle IO, it's useful to control OS resource
>>>>>>>>> and
>>>>>>>>> gain good performance. So it makes sense to add polling task in
>>>>>>>>> rdma-rdma and rdma-tcp modules.
>>>>>>>> This is NOT the way to go here.
>>>>>>>>
>>>>>>>> Both rdma and tcp are driven from workqueue context, which are
>>>>>>>> bound
>>>>>>>> workqueues.
>>>>>>>>
>>>>>>>> So there are two ways to go here:
>>>>>>>> 1. Add generic port cpuset and use that to direct traffic to the
>>>>>>>> appropriate set of cores
>>>>>>>> (i.e. select an appropriate comp_vector for rdma and add an
>>>>>>>> appropriate
>>>>>>>> steering rule
>>>>>>>> for tcp).
>>>>>>>> 2. Add options to rdma/tcp to use UNBOUND workqueues, and allow
>>>>>>>> users
>>>>>>>> to
>>>>>>>> control
>>>>>>>> these UNBOUND workqueues cpumask via sysfs.
>>>>>>>>
>>>>>>>> (2) will not control interrupts to steer to other workloads
>>>>>>>> cpus,
>>>>>>>> but
>>>>>>>> the handlers may
>>>>>>>> run on a set of dedicated cpus.
>>>>>>>>
>>>>>>>> (1) is a better solution, but harder to implement.
>>>>>>>>
>>>>>>>> You also should look into nvmet-fc as well (and nvmet-loop for
>>>>>>>> that
>>>>>>>> matter).
>>>>>>> hi Sagi Grimberg,
>>>>>>> Thanks for your reply, actually we had tried the first advice you
>>>>>>> suggested, but we found the performance was poor when using spdk
>>>>>>> as initiator.
>>>>>> I suggest that you focus on that instead of what you proposed.
>>>>>> What is the source of your poor performance?
>>>>> Before these patches, we had used linux's RPS to forward the
>>>>> packets
>>>>> to a fixed cpu set for nvmet-tcp. But when did that we can still
>>>>> not
>>>>> cancel the competition between softirq and workqueue since nvme
>>>>> target's
>>>>> kworker cpu core bind on socket's cpu which is from skb. Besides
>>>>> that
>>>>> we found workqueue's wait latency was very high even we enabled
>>>>> polling
>>>>> on nvmet-tcp by module parameter idle_poll_period_usecs. So when
>>>>> initiator
>>>>> is polling mode, the target of workqueue is the bottleneck. Below
>>>>> is
>>>>> work's wait latency trace log of our test on our cluster(per node
>>>>> uses
>>>>> 4 numas 96 cores, 192G memory, one dual ports mellanox CX4LX(25Gbps
>>>>> X
>>>>> 2)
>>>>> ethernet adapter and randrw 1M IO size) by RPS to 6 cpu cores. And
>>>>> system's CPU and memory were used about 80%.
>>>> I'd try a simple unbound CPU case, steer packets to say cores [0-5]
>>>> and
>>>> assign
>>>> the cpumask of the unbound workqueue to cores [6-11].
>>> Okay, thanks for your guide.
>>>
>>>>> ogden-brown:~ #/usr/share/bcc/tools/wqlat -T -w nvmet_tcp_wq 1 2
>>>>> 01:06:59
>>>>> usecs : count distribution
>>>>> 0 -> 1 : 0 | |
>>>>> 2 -> 3 : 0 | |
>>>>> 4 -> 7 : 0 | |
>>>>> 8 -> 15 : 3 | |
>>>>> 16 -> 31 : 10 | |
>>>>> 32 -> 63 : 3 | |
>>>>> 64 -> 127 : 2 | |
>>>>> 128 -> 255 : 0 | |
>>>>> 256 -> 511 : 5 | |
>>>>> 512 -> 1023 : 12 | |
>>>>> 1024 -> 2047 : 26 |* |
>>>>> 2048 -> 4095 : 34 |* |
>>>>> 4096 -> 8191 : 350 |************ |
>>>>> 8192 -> 16383 : 625 |******************************|
>>>>> 16384 -> 32767 : 244 |********* |
>>>>> 32768 -> 65535 : 39 |* |
>>>>>
>>>>> 01:07:00
>>>>> usecs : count distribution
>>>>> 0 -> 1 : 1 | |
>>>>> 2 -> 3 : 0 | |
>>>>> 4 -> 7 : 4 | |
>>>>> 8 -> 15 : 3 | |
>>>>> 16 -> 31 : 8 | |
>>>>> 32 -> 63 : 10 | |
>>>>> 64 -> 127 : 3 | |
>>>>> 128 -> 255 : 6 | |
>>>>> 256 -> 511 : 8 | |
>>>>> 512 -> 1023 : 20 |* |
>>>>> 1024 -> 2047 : 19 |* |
>>>>> 2048 -> 4095 : 57 |** |
>>>>> 4096 -> 8191 : 325 |**************** |
>>>>> 8192 -> 16383 : 647 |******************************|
>>>>> 16384 -> 32767 : 228 |*********** |
>>>>> 32768 -> 65535 : 43 |** |
>>>>> 65536 -> 131071 : 1 | |
>>>>>
>>>>> And the bandwidth of a node is only 3100MB. While we used the patch
>>>>> and enable 6 polling task, the bandwidth can be 4000MB. It's a good
>>>>> improvement.
>>>> I think you will see similar performance with unbound workqueue and
>>>> rps.
>>> Yes, I remodified the nvmet-tcp/nvmet-rdma code for supporting
>>> unbound
>>> workqueue, and in same prerequisites of above to run test, and
>>> compared
>>> the result of unbound workqueue and polling mode task. And I got a
>>> good
>>> performance for unbound workqueue. For unbound workqueue TCP we got
>>> 3850M/node, it's almost equal to polling task. And also tested
>>> nvmet-rdma
>>> we get 5100M/node for unbound workqueue RDMA versus 5600M for polling
>>> task,
>>> seems the diff is very small. Anyway, your advice is good.
>> I'm a bit surprised that you see ~10% delta here. I would look into
>> what
>> is the root-cause of
>> this difference. If indeed the load is high, the overhead of the
>> workqueue mgmt should be
>> negligible. I'm assuming you used IB_POLL_UNBOUND_WORKQUEUE ?
> Yes, we used IB_POLL_UNBOUND_WORKQUEUE to create ib CQ. And I observed
> 3% CPU
> usage of unbound workqueue versus 6% of polling task.
>
>>> Do you think
>>> we
>>> should submit the unbound workqueue patches for nvmet-tcp and
>>> nvmet-rdma
>>> to upstream nvmet?
>> For nvmet-tcp, I think there is merit to split socket processing from
>> napi context. For nvmet-rdma
>> I think the only difference is if you have multiple CQs assigned with
>> the same comp_vector.
>>
>> How many queues do you have in your test?
> We used 24 IO queues to nvmet-rdma target. I think this may also be
> related to workqueue's wait latency. We still see some several ms wait
> latency for unbound workqueue of RMDA. You can see below trace log.
What is the queue size of each? what rdma device are you using?
More information about the Linux-nvme
mailing list