[PATCH] nvme-tcp: align I/O cpu with blk-mq mapping

Wed Jun 19 03:14:11 PDT 2024

On 19/06/2024 13:01, Hannes Reinecke wrote:
> On 6/19/24 10:57, Sagi Grimberg wrote:
>>
>>
>> On 19/06/2024 8:30, Christoph Hellwig wrote:
>>> On Tue, Jun 18, 2024 at 02:03:45PM +0200, Hannes Reinecke wrote:
>>>> Add a new module parameter 'wq_affinity' to spread the I/O
>>>> over all cpus within the blk-mq hctx mapping for the queue.
>>>> This avoids bouncing I/O between cpus when we have less
>>>> hardware queues than cpus.
>>> What is the benefit when setting it?  What is the downside? Why do you
>>> think it needs to be conditional?
>>
>> Not entirely sure.
>>
>
> Hehe. But I am :-)
>
>> Hannes,
>>
>> Can you please show what is the hctx<->cpu mappings before/after this
>> patch? Please use different queue counts so we can see the pattern.
>>
>> Can you please show any performance comparisons for low and high
>> queue counts?
>>
>> I'm not entirely clear what exactly "spreads I/O over all CPUs" mean.
>> Every nvme-tcp queue has an io context that does network sends and 
>> receives
>> (unless it goes within .queue_rq() context when possible). Does this 
>> somehow mean
>> that now a queue io context will run from multiple cpus at the same 
>> time?
>
> We are having on connection per queue, doing a 'queue_work_on()' for 
> each queue, based on the value of 'io_cpu'.
> That cpu should align to the set of CPUs per associated hwctx, otherwise
> we need to bounce the thread from one cpu to another.
> Case in point on this AMD machine I get this blk-mq mapping:
>
> queue 0: 6, 7, 8, 54, 55, 56
> queue 1: 9, 10, 11, 57, 58, 59
> queue 10: 42, 43, 44, 90, 91, 92
> queue 11: 45, 46, 47, 93, 94, 95
> queue 12: 12, 13, 14, 60, 61, 62
> queue 13: 15, 16, 17, 63, 64, 65
> queue 14: 0, 1, 2, 48, 49, 50
> queue 15: 3, 4, 5, 51, 52, 53
> queue 2: 18, 19, 20, 66, 67, 68
> queue 3: 21, 22, 23, 69, 70, 71
> queue 4: 24, 25, 26, 72, 73, 74
> queue 5: 27, 28, 29, 75, 76, 77
> queue 6: 30, 31, 32, 78, 79, 80
> queue 7: 33, 34, 35, 81, 82, 83
> queue 8: 36, 37, 38, 84, 85, 86
> queue 9: 39, 40, 41, 87, 88, 89
>
> - default behaviour:
>
> nvme nvme1: mapped 16/0/0 default/read/poll queues.
> nvme nvme1: queue 1: using cpu 0
> nvme nvme1: queue 2: using cpu 1
> nvme nvme1: queue 3: using cpu 2
> nvme nvme1: queue 4: using cpu 3
> nvme nvme1: queue 5: using cpu 4
> nvme nvme1: queue 6: using cpu 5
> nvme nvme1: queue 7: using cpu 6
> nvme nvme1: queue 8: using cpu 7
> nvme nvme1: queue 9: using cpu 8
> nvme nvme1: queue 10: using cpu 9
> nvme nvme1: queue 11: using cpu 10
> nvme nvme1: queue 12: using cpu 11
> nvme nvme1: queue 13: using cpu 12
> nvme nvme1: queue 14: using cpu 13
> nvme nvme1: queue 15: using cpu 14
> nvme nvme1: queue 16: using cpu 15
>
> So just queue 12 is using the correct CPU; all other queues are
> scheduled on cpus which are not part of the hwctx map.
>
> - wq_affinity behaviour:
>
> nvme nvme1: queue 1: using cpu 6
> nvme nvme1: queue 2: using cpu 9
> nvme nvme1: queue 3: using cpu 18
> nvme nvme1: queue 4: using cpu 21
> nvme nvme1: queue 5: using cpu 24
> nvme nvme1: queue 6: using cpu 27
> nvme nvme1: queue 7: using cpu 30
> nvme nvme1: queue 8: using cpu 33
> nvme nvme1: queue 9: using cpu 36
> nvme nvme1: queue 10: using cpu 39
> nvme nvme1: queue 11: using cpu 42
> nvme nvme1: queue 12: using cpu 45
> nvme nvme1: queue 13: using cpu 12
> nvme nvme1: queue 14: using cpu 15
> nvme nvme1: queue 15: using cpu 0
> nvme nvme1: queue 16: using cpu 3
>
> All queues are scheduled on a cpu which is part of the
> blk-mq hwctx map.

OK, I think I understand what this patch does now, thanks (this should 
appear in the patch change log).

The change log really needs to improve here Hannes,
The wording "spread the I/O over all cpus within the blk-mq hctx mapping 
for the queue" is very misleading. It is actually addressing a incorrect 
selection io_cpu selection for certain cpu topologies which leads to 
running different queue threads on sibling cpus, which may lead to 
suboptimal performance. And please add performance comparisons of this 
change. And lets remove the module parameter, iiuc this change is 
universally better.