[PATCH 0/2] nvmet: support polling task for RDMA and TCP

Sagi Grimberg sagi at grimberg.me
Thu Jul 4 01:40:33 PDT 2024



On 7/4/24 11:10, Ping Gan wrote:
>> On 02/07/2024 13:02, Ping Gan wrote:
>>>> On 01/07/2024 10:42, Ping Gan wrote:
>>>>>> Hey Ping Gan,
>>>>>>
>>>>>>
>>>>>> On 26/06/2024 11:28, Ping Gan wrote:
>>>>>>> When running nvmf on SMP platform, current nvme target's RDMA and
>>>>>>> TCP use kworker to handle IO. But if there is other high workload
>>>>>>> in the system(eg: on kubernetes), the competition between the
>>>>>>> kworker and other workload is very radical. And since the kworker
>>>>>>> is scheduled by OS randomly, it's difficult to control OS
>>>>>>> resource
>>>>>>> and also tune the performance. If target support to use delicated
>>>>>>> polling task to handle IO, it's useful to control OS resource and
>>>>>>> gain good performance. So it makes sense to add polling task in
>>>>>>> rdma-rdma and rdma-tcp modules.
>>>>>> This is NOT the way to go here.
>>>>>>
>>>>>> Both rdma and tcp are driven from workqueue context, which are
>>>>>> bound
>>>>>> workqueues.
>>>>>>
>>>>>> So there are two ways to go here:
>>>>>> 1. Add generic port cpuset and use that to direct traffic to the
>>>>>> appropriate set of cores
>>>>>> (i.e. select an appropriate comp_vector for rdma and add an
>>>>>> appropriate
>>>>>> steering rule
>>>>>> for tcp).
>>>>>> 2. Add options to rdma/tcp to use UNBOUND workqueues, and allow
>>>>>> users
>>>>>> to
>>>>>> control
>>>>>> these UNBOUND workqueues cpumask via sysfs.
>>>>>>
>>>>>> (2) will not control interrupts to steer to other workloads cpus,
>>>>>> but
>>>>>> the handlers may
>>>>>> run on a set of dedicated cpus.
>>>>>>
>>>>>> (1) is a better solution, but harder to implement.
>>>>>>
>>>>>> You also should look into nvmet-fc as well (and nvmet-loop for
>>>>>> that
>>>>>> matter).
>>>>> hi Sagi Grimberg,
>>>>> Thanks for your reply, actually we had tried the first advice you
>>>>> suggested, but we found the performance was poor when using spdk
>>>>> as initiator.
>>>> I suggest that you focus on that instead of what you proposed.
>>>> What is the source of your poor performance?
>>> Before these patches, we had used linux's RPS to forward the packets
>>> to a fixed cpu set for nvmet-tcp. But when did that we can still not
>>> cancel the competition between softirq and workqueue since nvme
>>> target's
>>> kworker cpu core bind on socket's cpu which is from skb. Besides that
>>> we found workqueue's wait latency was very high even we enabled
>>> polling
>>> on nvmet-tcp by module parameter idle_poll_period_usecs. So when
>>> initiator
>>> is polling mode, the target of workqueue is the bottleneck. Below is
>>> work's wait latency trace log of our test on our cluster(per node
>>> uses
>>> 4 numas 96 cores, 192G memory, one dual ports mellanox CX4LX(25Gbps X
>>> 2)
>>> ethernet adapter and randrw 1M IO size) by RPS to 6 cpu cores. And
>>> system's CPU and memory were used about 80%.
>> I'd try a simple unbound CPU case, steer packets to say cores [0-5]
>> and
>> assign
>> the cpumask of the unbound workqueue to cores [6-11].
> Okay, thanks for your guide.
>
>>> ogden-brown:~ #/usr/share/bcc/tools/wqlat -T -w nvmet_tcp_wq 1 2
>>> 01:06:59
>>>        usecs               : count     distribution
>>>            0 -> 1          : 0        |                              |
>>>            2 -> 3          : 0        |                              |
>>>            4 -> 7          : 0        |                              |
>>>            8 -> 15         : 3        |                              |
>>>           16 -> 31         : 10       |                              |
>>>           32 -> 63         : 3        |                              |
>>>           64 -> 127        : 2        |                              |
>>>          128 -> 255        : 0        |                              |
>>>          256 -> 511        : 5        |                              |
>>>          512 -> 1023       : 12       |                              |
>>>         1024 -> 2047       : 26       |*                             |
>>>         2048 -> 4095       : 34       |*                             |
>>>         4096 -> 8191       : 350      |************                  |
>>>         8192 -> 16383      : 625      |******************************|
>>>        16384 -> 32767      : 244      |*********                     |
>>>        32768 -> 65535      : 39       |*                             |
>>>
>>> 01:07:00
>>>        usecs               : count     distribution
>>>            0 -> 1          : 1        |                              |
>>>            2 -> 3          : 0        |                              |
>>>            4 -> 7          : 4        |                              |
>>>            8 -> 15         : 3        |                              |
>>>           16 -> 31         : 8        |                              |
>>>           32 -> 63         : 10       |                              |
>>>           64 -> 127        : 3        |                              |
>>>          128 -> 255        : 6        |                              |
>>>          256 -> 511        : 8        |                              |
>>>          512 -> 1023       : 20       |*                             |
>>>         1024 -> 2047       : 19       |*                             |
>>>         2048 -> 4095       : 57       |**                            |
>>>         4096 -> 8191       : 325      |****************              |
>>>         8192 -> 16383      : 647      |******************************|
>>>        16384 -> 32767      : 228      |***********                   |
>>>        32768 -> 65535      : 43       |**                            |
>>>        65536 -> 131071     : 1        |                              |
>>>
>>> And the bandwidth of a node is only 3100MB. While we used the patch
>>> and enable 6 polling task, the bandwidth can be 4000MB. It's a good
>>> improvement.
>> I think you will see similar performance with unbound workqueue and
>> rps.
> Yes, I remodified the nvmet-tcp/nvmet-rdma code for supporting unbound
> workqueue, and in same prerequisites of above to run test, and compared
> the result of unbound workqueue and polling mode task. And I got a good
> performance for unbound workqueue. For unbound workqueue TCP we got
> 3850M/node, it's almost equal to polling task. And also tested
> nvmet-rdma
> we get 5100M/node for unbound workqueue RDMA versus 5600M for polling
> task,
> seems the diff is very small. Anyway, your advice is good.

I'm a bit surprised that you see ~10% delta here. I would look into what 
is the root-cause of
this difference. If indeed the load is high, the overhead of the 
workqueue mgmt should be
negligible. I'm assuming you used IB_POLL_UNBOUND_WORKQUEUE ?



>   Do you think
> we
> should submit the unbound workqueue patches for nvmet-tcp and nvmet-rdma
> to upstream nvmet?

For nvmet-tcp, I think there is merit to split socket processing from 
napi context. For nvmet-rdma
I think the only difference is if you have multiple CQs assigned with 
the same comp_vector.

How many queues do you have in your test?

> BTW I have another question: Will nvmet of upstream have the plan to
> support
> polling queue when doing submit_bio in future?

No plans that I know of. Don't have a coherent idea of how that would work.



More information about the Linux-nvme mailing list