[PATCH v3 for-4.13 2/6] mlx5: move affinity hints assignments to generic code

Sagi Grimberg sagi at grimberg.me
Thu Jun 8 05:29:18 PDT 2017


>>>> My interpretation is that mlx5 tried to do this for the (rather esoteric
>>>> in my mind) case where the platform does not have enough vectors for the
>>>> driver to allocate percpu. In this case, the next best thing is to stay
>>>> as close to the device affinity as possible.
>>>>
>>>
>>> No, we did it for the reason that mlx5e netdevice assumes that
>>> IRQ[0]..IRQ[#num_numa/#cpu_per_numa]
>>> are always bound to the numa close to the device. and the mlx5e driver
>>> choose those IRQs to spread
>>> the RSS hash only into them and never uses other IRQs/Cores
>>
>>
>> OK, that explains a lot of weirdness I've seen with mlx5e.
>>
>> Can you explain why you're using only a single numa node for your RSS
>> table? What does it buy you? You open RX rings for _all_ cpus but
>> only spread on part of them? I must be missing something here...
> 
> Adding Tariq,
> 
> this is also part of the weirdness :), we do that to make sure any OOB
> test you run you always get the best performance
> and we will guarantee to always use close numa cores.

Well I wish I knew that before :( I got to a point where I started
to seriously doubt the math truth of xor/toeplitz hashing strength :)

I'm sure you ran plenty of performance tests, but from my experience,
application locality makes much more difference than device locality,
especially when the application needs to touch the data...

> we open RX rings on all of the cores in case if the user want to
> change the RSS table to point to the whole thing on the fly "ethtool
> -X"

That is very counter intuitive afaict, is it documented anywhere?

users might rely on the (absolutely reasonable) assumption that if a
NIC exposes X rx rings, rx hashing should spread across all of them and
not a subset.

> But we are willing to change that, Tariq can provide the patch,
> without changing this mlx5e is broken.

What patch? to modify the RSS spread? What is exactly broken?

So I'm not sure how to move forward here, should we modify the
indirection table construction to not rely on the unique affinity
mappings?



More information about the Linux-nvme mailing list