[PATCHv2 0/3] nvme-tcp: improve scalability

Sagi Grimberg sagi at grimberg.me
Wed Jul 10 07:45:49 PDT 2024



On 10/07/2024 17:06, Hannes Reinecke wrote:
> On 7/10/24 13:56, Sagi Grimberg wrote:
>>
>>
>> On 08/07/2024 10:10, Hannes Reinecke wrote:
>>> Hi all,
>>>
>>> for workloads with a lot of controllers we run into workqueue 
>>> contention,
>>> where the single workqueue is not able to service requests fast enough,
>>> leading to spurious I/O errors and connect resets during high load.
>>> This patchset improves the situation by improve the fairness between
>>> rx and tx scheduling, introducing per-controller workqueues,
>>> and distribute the load accoring to the blk-mq cpu mapping.
>>> With this we reduce the spurious I/O errors and improve the overall
>>> performance for highly contended workloads.
>>>
>>> All performance number are derived from the 'tiobench-example.fio'
>>
>> Did you keep the fio file unmodified? I'd suggest to run it for longer
>> say 60 seconds each workload. 512 MB is a very short benchmark...
>
> Not for 32 queues :-)
> But yeah, I can keep it running for slightly longer.

How does the number of queues make a difference?
doesn't it simply write 512MB from 4 threads?

>
> Not making much progress, mind; your 'softirq' patch definitely speeds 
> up receiving, but seem to messing up the write side such that I'm 
> basically guaranteed to hit I/O timeouts on WRITE :-(
>
> Keep on debugging ...

Hannes, I think that now that we established that rx can starve tx, we 
must pace rx, so if we do something like softirq rx, it must be an 
initial batch, limited, and once we exhaust
it, we schedule a workqueue to continue processing. I'd also leave it 
alone for now, we'll add it once we have a good understanding of what is 
going on...



More information about the Linux-nvme mailing list