[PATCH v2] nvme/tcp: Add support to set the tcp worker cpu affinity

Hannes Reinecke hare at suse.de
Sun Apr 16 23:27:42 PDT 2023


On 4/15/23 23:06, David Laight wrote:
> From: Li Feng
>> Sent: 14 April 2023 10:35
>>>
>>> On 4/13/23 15:29, Li Feng wrote:
>>>> The default worker affinity policy is using all online cpus, e.g. from 0
>>>> to N-1. However, some cpus are busy for other jobs, then the nvme-tcp will
>>>> have a bad performance.
>>>>
>>>> This patch adds a module parameter to set the cpu affinity for the nvme-tcp
>>>> socket worker threads.  The parameter is a comma separated list of CPU
>>>> numbers.  The list is parsed and the resulting cpumask is used to set the
>>>> affinity of the socket worker threads.  If the list is empty or the
>>>> parsing fails, the default affinity is used.
>>>>
> ...
>>> I am not in favour of this.
>>> NVMe-over-Fabrics has _virtual_ queues, which really have no
>>> relationship to the underlying hardware.
>>> So trying to be clever here by tacking queues to CPUs sort of works if
>>> you have one subsystem to talk to, but if you have several where each
>>> exposes a _different_ number of queues you end up with a quite
>>> suboptimal setting (ie you rely on the resulting cpu sets to overlap,
>>> but there is no guarantee that they do).
>>
>> Thanks for your comment.
>> The current io-queues/cpu map method is not optimal.
>> It is stupid, and just starts from 0 to the last CPU, which is not configurable.
> 
> Module parameters suck, and passing the buck to the user
> when you can't decide how to do something isn't a good idea either.
> 
> If the system is busy pinning threads to cpus is very hard to
> get right.
> 
> It can be better to set the threads to run at the lowest RT
> priority - so they have priority over all 'normal' threads
> and also have a very sticky (but not fixed) cpu affinity so
> that all such threads tends to get spread out by the scheduler.
> This all works best if the number of RT threads isn't greater
> than the number of physical cpu.
> 
And the problem is that you cannot give an 'optimal' performance metric
here. With NVMe-over-Fabrics the number of queues is negotiated during
the initial 'connect' call, and the resulting number of queues strongly 
depends on target preferences (eg a NetApp array will expose only 4 
queues, with Dell/EMC you end up with up max 128 queues).
And these queues need to be mapped on the underlying hardware, which has
its own issues wrt to NUMA affinity.

To give you an example:
Given a setup with a 4 node NUMA machine, one NIC connected to
one NUMA core, each socket having 24 threads, the NIC exposing up to 32
interrupts, and connections to a NetApp _and_ a EMC, how exactly should
the 'best' layout look like?
And, what _is_ the 'best' layout?
You cannot satisfy the queue requirements from NetApp _and_ EMC, as you 
only have one NIC, and you cannot change the interrupt affinity for each 
I/O.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare at suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman




More information about the Linux-nvme mailing list