[PATCH 0/2] nvmet: support polling task for RDMA and TCP
Sagi Grimberg
sagi at grimberg.me
Thu Jul 4 01:40:33 PDT 2024
On 7/4/24 11:10, Ping Gan wrote:
>> On 02/07/2024 13:02, Ping Gan wrote:
>>>> On 01/07/2024 10:42, Ping Gan wrote:
>>>>>> Hey Ping Gan,
>>>>>>
>>>>>>
>>>>>> On 26/06/2024 11:28, Ping Gan wrote:
>>>>>>> When running nvmf on SMP platform, current nvme target's RDMA and
>>>>>>> TCP use kworker to handle IO. But if there is other high workload
>>>>>>> in the system(eg: on kubernetes), the competition between the
>>>>>>> kworker and other workload is very radical. And since the kworker
>>>>>>> is scheduled by OS randomly, it's difficult to control OS
>>>>>>> resource
>>>>>>> and also tune the performance. If target support to use delicated
>>>>>>> polling task to handle IO, it's useful to control OS resource and
>>>>>>> gain good performance. So it makes sense to add polling task in
>>>>>>> rdma-rdma and rdma-tcp modules.
>>>>>> This is NOT the way to go here.
>>>>>>
>>>>>> Both rdma and tcp are driven from workqueue context, which are
>>>>>> bound
>>>>>> workqueues.
>>>>>>
>>>>>> So there are two ways to go here:
>>>>>> 1. Add generic port cpuset and use that to direct traffic to the
>>>>>> appropriate set of cores
>>>>>> (i.e. select an appropriate comp_vector for rdma and add an
>>>>>> appropriate
>>>>>> steering rule
>>>>>> for tcp).
>>>>>> 2. Add options to rdma/tcp to use UNBOUND workqueues, and allow
>>>>>> users
>>>>>> to
>>>>>> control
>>>>>> these UNBOUND workqueues cpumask via sysfs.
>>>>>>
>>>>>> (2) will not control interrupts to steer to other workloads cpus,
>>>>>> but
>>>>>> the handlers may
>>>>>> run on a set of dedicated cpus.
>>>>>>
>>>>>> (1) is a better solution, but harder to implement.
>>>>>>
>>>>>> You also should look into nvmet-fc as well (and nvmet-loop for
>>>>>> that
>>>>>> matter).
>>>>> hi Sagi Grimberg,
>>>>> Thanks for your reply, actually we had tried the first advice you
>>>>> suggested, but we found the performance was poor when using spdk
>>>>> as initiator.
>>>> I suggest that you focus on that instead of what you proposed.
>>>> What is the source of your poor performance?
>>> Before these patches, we had used linux's RPS to forward the packets
>>> to a fixed cpu set for nvmet-tcp. But when did that we can still not
>>> cancel the competition between softirq and workqueue since nvme
>>> target's
>>> kworker cpu core bind on socket's cpu which is from skb. Besides that
>>> we found workqueue's wait latency was very high even we enabled
>>> polling
>>> on nvmet-tcp by module parameter idle_poll_period_usecs. So when
>>> initiator
>>> is polling mode, the target of workqueue is the bottleneck. Below is
>>> work's wait latency trace log of our test on our cluster(per node
>>> uses
>>> 4 numas 96 cores, 192G memory, one dual ports mellanox CX4LX(25Gbps X
>>> 2)
>>> ethernet adapter and randrw 1M IO size) by RPS to 6 cpu cores. And
>>> system's CPU and memory were used about 80%.
>> I'd try a simple unbound CPU case, steer packets to say cores [0-5]
>> and
>> assign
>> the cpumask of the unbound workqueue to cores [6-11].
> Okay, thanks for your guide.
>
>>> ogden-brown:~ #/usr/share/bcc/tools/wqlat -T -w nvmet_tcp_wq 1 2
>>> 01:06:59
>>> usecs : count distribution
>>> 0 -> 1 : 0 | |
>>> 2 -> 3 : 0 | |
>>> 4 -> 7 : 0 | |
>>> 8 -> 15 : 3 | |
>>> 16 -> 31 : 10 | |
>>> 32 -> 63 : 3 | |
>>> 64 -> 127 : 2 | |
>>> 128 -> 255 : 0 | |
>>> 256 -> 511 : 5 | |
>>> 512 -> 1023 : 12 | |
>>> 1024 -> 2047 : 26 |* |
>>> 2048 -> 4095 : 34 |* |
>>> 4096 -> 8191 : 350 |************ |
>>> 8192 -> 16383 : 625 |******************************|
>>> 16384 -> 32767 : 244 |********* |
>>> 32768 -> 65535 : 39 |* |
>>>
>>> 01:07:00
>>> usecs : count distribution
>>> 0 -> 1 : 1 | |
>>> 2 -> 3 : 0 | |
>>> 4 -> 7 : 4 | |
>>> 8 -> 15 : 3 | |
>>> 16 -> 31 : 8 | |
>>> 32 -> 63 : 10 | |
>>> 64 -> 127 : 3 | |
>>> 128 -> 255 : 6 | |
>>> 256 -> 511 : 8 | |
>>> 512 -> 1023 : 20 |* |
>>> 1024 -> 2047 : 19 |* |
>>> 2048 -> 4095 : 57 |** |
>>> 4096 -> 8191 : 325 |**************** |
>>> 8192 -> 16383 : 647 |******************************|
>>> 16384 -> 32767 : 228 |*********** |
>>> 32768 -> 65535 : 43 |** |
>>> 65536 -> 131071 : 1 | |
>>>
>>> And the bandwidth of a node is only 3100MB. While we used the patch
>>> and enable 6 polling task, the bandwidth can be 4000MB. It's a good
>>> improvement.
>> I think you will see similar performance with unbound workqueue and
>> rps.
> Yes, I remodified the nvmet-tcp/nvmet-rdma code for supporting unbound
> workqueue, and in same prerequisites of above to run test, and compared
> the result of unbound workqueue and polling mode task. And I got a good
> performance for unbound workqueue. For unbound workqueue TCP we got
> 3850M/node, it's almost equal to polling task. And also tested
> nvmet-rdma
> we get 5100M/node for unbound workqueue RDMA versus 5600M for polling
> task,
> seems the diff is very small. Anyway, your advice is good.
I'm a bit surprised that you see ~10% delta here. I would look into what
is the root-cause of
this difference. If indeed the load is high, the overhead of the
workqueue mgmt should be
negligible. I'm assuming you used IB_POLL_UNBOUND_WORKQUEUE ?
> Do you think
> we
> should submit the unbound workqueue patches for nvmet-tcp and nvmet-rdma
> to upstream nvmet?
For nvmet-tcp, I think there is merit to split socket processing from
napi context. For nvmet-rdma
I think the only difference is if you have multiple CQs assigned with
the same comp_vector.
How many queues do you have in your test?
> BTW I have another question: Will nvmet of upstream have the plan to
> support
> polling queue when doing submit_bio in future?
No plans that I know of. Don't have a coherent idea of how that would work.
More information about the Linux-nvme
mailing list