[PATCHv2] nvme-tcp: align I/O cpu with blk-mq mapping

Hannes Reinecke hare at suse.de
Wed Jun 19 08:49:20 PDT 2024


On 6/19/24 17:43, Sagi Grimberg wrote:
> 
> 
> On 19/06/2024 18:23, Hannes Reinecke wrote:
>> On 6/19/24 16:59, Christoph Hellwig wrote:
>>>>       if (wq_unbound)
>>>>           queue->io_cpu = WORK_CPU_UNBOUND;
>>>
>>> None of the above computed information is even used for the wq_unbound
>>> code.  This would make a lot more sense if the above assignment was in
>>> the (only) caller.
>>>
>>> Yes, that probably should have been done when merging the wq_unbound
>>> option (which honestly looks so whacky that I wish it wasn't merged).
>>>
>> Ah, you noticed?
>>
>> This patch is actually got sparked off by one of our partners notifying
>> a severe latency increase proportional with the number of controllers.
>> With a marked increase when (sw) TLS is active; they even managed to run
>> into command timeouts.
>>
>> What happens is that we shove all work from identical queue numbers 
>> onto the same CPU; so if your controller only exports 4 queues _all_ 
>> work from qid 1 (from all controllers!) is pushed onto CPU 0 causing a 
>> massive oversubscription on that CPU and leaving all other CPUs in the 
>> system idle.
> 
> I see how you address multiple controllers falling into the same 
> mappings case in your patch.
> You could have selected a different mq_map entry for each controller 
> (out of the entries that map to the qid).
> 
Looked at it, but hadn't any idea how to figure out the load.
The load is actually per-cpu, but we only have per controller structures.
So we would need to introduce a per-cpu counter, detailing out the
number of queues scheduled on that CPU.
But that won't help with the CPU oversubscription issue; we still might 
have substantially higher number of overall queues than we have CPUs...

>>
>> Not sure how wq_unbound helps in this case; in theory the workqueue
>> items can be pushed on arbitrary CPUs, but that only leads to even worse
>> thread bouncing.
>>
>> However, topic for ALPSS. We really should have some sore of 
>> backpressure here.
> 
> I have a patch that was sitting for some time now, to make the RX path 
> run directly
> from softirq, which should make RX execute from the cpu core mapped to 
> the RSS hash.
> Perhaps you or your customer can give it a go.
> 
No s**t. That is pretty much what I wanted to do.
I'm sure to give it a go.
Thanks for that!

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare at suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Ivo Totev, Andrew McDonald,
Werner Knoblich




More information about the Linux-nvme mailing list