[PATCH v3 for-4.13 2/6] mlx5: move affinity hints assignments to generic code

Sun Jun 11 02:32:05 PDT 2017

On Thu, Jun 8, 2017 at 3:29 PM, Sagi Grimberg <sagi at grimberg.me> wrote:
>
>>>>> My interpretation is that mlx5 tried to do this for the (rather
>>>>> esoteric
>>>>> in my mind) case where the platform does not have enough vectors for
>>>>> the
>>>>> driver to allocate percpu. In this case, the next best thing is to stay
>>>>> as close to the device affinity as possible.
>>>>>
>>>>
>>>> No, we did it for the reason that mlx5e netdevice assumes that
>>>> IRQ[0]..IRQ[#num_numa/#cpu_per_numa]
>>>> are always bound to the numa close to the device. and the mlx5e driver
>>>> choose those IRQs to spread
>>>> the RSS hash only into them and never uses other IRQs/Cores
>>>
>>>
>>>
>>> OK, that explains a lot of weirdness I've seen with mlx5e.
>>>
>>> Can you explain why you're using only a single numa node for your RSS
>>> table? What does it buy you? You open RX rings for _all_ cpus but
>>> only spread on part of them? I must be missing something here...
>>
>>
>> Adding Tariq,
>>
>> this is also part of the weirdness :), we do that to make sure any OOB
>> test you run you always get the best performance
>> and we will guarantee to always use close numa cores.
>
>
> Well I wish I knew that before :( I got to a point where I started
> to seriously doubt the math truth of xor/toeplitz hashing strength :)
>
> I'm sure you ran plenty of performance tests, but from my experience,
> application locality makes much more difference than device locality,
> especially when the application needs to touch the data...
>
>> we open RX rings on all of the cores in case if the user want to
>> change the RSS table to point to the whole thing on the fly "ethtool
>> -X"
>
>
> That is very counter intuitive afaict, is it documented anywhere?
>
> users might rely on the (absolutely reasonable) assumption that if a
> NIC exposes X rx rings, rx hashing should spread across all of them and
> not a subset.
>

This is why we want to remove this assumption from mlx5e

>> But we are willing to change that, Tariq can provide the patch,
>> without changing this mlx5e is broken.
>
>
> What patch? to modify the RSS spread? What is exactly broken?

current mlx5 netdev with your patches might spread traffic _ONLY_ to
the far numa.

To fix this in mlx5e we need something like this:

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 41cd22a223dc..15499865784f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -3733,18 +3733,8 @@ void mlx5e_build_default_indir_rqt(struct
mlx5_core_dev *mdev,
                                   u32 *indirection_rqt, int len,
                                   int num_channels)
 {
-       int node = mdev->priv.numa_node;
-       int node_num_of_cores;
        int i;

-       if (node == -1)
-               node = first_online_node;
-
-       node_num_of_cores = cpumask_weight(cpumask_of_node(node));
-
-       if (node_num_of_cores)
-               num_channels = min_t(int, num_channels, node_num_of_cores);
-

we are working on such patch, and we would like to have it submitted
and accepted along or before your seires.
Also, we will have to run a performance & functional regression cycle
on all the patches combined.

>
> So I'm not sure how to move forward here, should we modify the
> indirection table construction to not rely on the unique affinity
> mappings?