[PATCHv3 0/8] nvme-tcp: improve scalability
Hannes Reinecke
hare at suse.de
Wed Jul 17 23:20:18 PDT 2024
On 7/17/24 23:01, Sagi Grimberg wrote:
>
>
> On 16/07/2024 10:36, Hannes Reinecke wrote:
>> Hi all,
>>
>> for workloads with a lot of controllers we run into workqueue contention,
>> where the single workqueue is not able to service requests fast enough,
>> leading to spurious I/O errors and connect resets during high load.
>> One culprit here was a lock contention on the callbacks, where we
>> acquired the 'sk_callback_lock()' on every callback. As we are dealing
>> with parallel rx and tx flows this induces quite a lot of contention.
>> I have also added instrumentation to analyse I/O flows, by adding
>> a I/O stall debug messages and added debugfs entries to display
>> detailed statistics for each queue.
>
> Hanneds, I'm getting really confused with this...
>
> Once again you submit a set that is a different direction almost
> entirely from v1 and v2... Again without quantifying what each
> change is giving us, and it makes it very hard to review and understand.
>
I fully concur. But the previous patchsets turned out to not give a
substantial improvements when scaling up.
And getting reliable performance numbers is _really_ hard, as there is
quite a high fluctuation in them. This was the reason for including
the statistics patches; there we get a direct insight of the I/O path
latency, and can directly measure the impact of the changes.
And that's also how I found the lock contention on the callbacks...
> I suggest we split the changes that have consensus to a separate series
> (still state what each change gets us), and understand better the rest...
>
The patchset here is (apart from the statistics patches) just the one
removing lock contention on callbacks (which _really_ were causing
issues), and the alignment with blk-mq, for which I do see an improvement.
All other patches posted previously turned out to increase the latency
(thanks to the statistics patches), so I left them out from this round.
>>
>> All performance number are derived from the 'tiobench-example.fio'
>> sample from the fio sources, running on a 96 core machine with one,
>> two, or four subsystem and two paths, each path exposing 32 queues.
>> Backend is nvmet using an Intel DC P3700 NVMe SSD.
>
> The patchset in v1 started by stating a performance issue when
> controllers have a limited number of queues, does this test case
> represent the original issue?
>
Oh, but it does. The entire test is run on machine with 96 cores.
>>
>> write performance:
>> baseline:
>> 1 subsys, 4k seq: bw=523MiB/s (548MB/s), 16.3MiB/s-19.0MiB/s
>> (17.1MB/s-20.0MB/s)
>> 1 subsys, 4k rand: bw=502MiB/s (526MB/s), 15.7MiB/s-21.5MiB/s
>> (16.4MB/s-22.5MB/s)
>> 2 subsys, 4k seq: bw=420MiB/s (440MB/s), 2804KiB/s-4790KiB/s
>> (2871kB/s-4905kB/s)
>> 2 subsys, 4k rand: bw=416MiB/s (436MB/s), 2814KiB/s-5503KiB/s
>> (2881kB/s-5635kB/s)
>> 4 subsys, 4k seq: bw=409MiB/s (429MB/s), 1990KiB/s-8396KiB/s
>> (2038kB/s-8598kB/s)
>> 4 subsys, 4k rand: bw=386MiB/s (405MB/s), 2024KiB/s-6314KiB/s
>> (2072kB/s-6466kB/s)
>>
>> patched:
>> 1 subsys, 4k seq: bw=440MiB/s (461MB/s), 13.7MiB/s-16.1MiB/s
>> (14.4MB/s-16.8MB/s)
>> 1 subsys, 4k rand: bw=427MiB/s (448MB/s), 13.4MiB/s-16.2MiB/s
>> (13.0MB/s-16.0MB/s)
>
> That is a substantial degradation. I also keep asking, how does null_blk
> looks like?
>
Tested, and doesn't make a difference. Similar numbers.
Surprisingly, but there you are.
>> 2 subsys, 4k seq: bw=506MiB/s (531MB/s), 3581KiB/s-4493KiB/s
>> (3667kB/s-4601kB/s)
>> 2 subsys, 4k rand: bw=494MiB/s (518MB/s), 3630KiB/s-4421KiB/s
>> (3717kB/s-4528kB/s)
>> 4 subsys, 4k seq: bw=457MiB/s (479MB/s), 2564KiB/s-8297KiB/s
>> (2625kB/s-8496kB/s)
>> 4 subsys, 4k rand: bw=424MiB/s (444MB/s), 2509KiB/s-9414KiB/s
>> (2570kB/s-9640kB/s)
>
> There is still an observed degradation when moving from 2 to 4
> subsystems, what is the cause of it?
>
All subsystems are running over the same 10GigE link, so some
performance degradation is to be expected as we are having higher
contention.
>>
>> read performance:
>> baseline:
>> 1 subsys, 4k seq: bw=389MiB/s (408MB/s), 12.2MiB/s-18.1MiB/s
>> (12.7MB/s-18.0MB/s)
>> 1 subsys, 4k rand: bw=430MiB/s (451MB/s), 13.5MiB/s-19.2MiB/s
>> (14.1MB/s-20.2MB/s)
>> 2 subsys, 4k seq: bw=377MiB/s (395MB/s), 2603KiB/s-3987KiB/s
>> (2666kB/s-4083kB/s)
>> 2 subsys, 4k rand: bw=377MiB/s (395MB/s), 2431KiB/s-5403KiB/s
>> (2489kB/s-5533kB/s)
>> 4 subsys, 4k seq: bw=139MiB/s (146MB/s), 197KiB/s-11.1MiB/s
>> (202kB/s-11.6MB/s)
>> 4 subsys, 4k rand: bw=352MiB/s (369MB/s), 1360KiB/s-13.9MiB/s
>> (1392kB/s-14.6MB/s)
>>
>> patched:
>> 1 subsys, 4k seq: bw=405MiB/s (425MB/s), 2.7MiB/s-14.7MiB/s
>> (13.3MB/s-15.4MB/s)
>> 1 subsys, 4k rand: bw=427MiB/s (447MB/s), 13.3MiB/s-16.1MiB/s
>> (13.0MB/s-16.9MB/s)
>> 2 subsys, 4k seq: bw=411MiB/s (431MB/s), 2462KiB/s-4523KiB/s
>> (2522kB/s-4632kB/s)
>> 2 subsys, 4k rand: bw=392MiB/s (411MB/s), 2258KiB/s-4220KiB/s
>> (2312kB/s-4321kB/s)
>> 4 subsys, 4k seq: bw=378MiB/s (397MB/s), 1859KiB/s-8110KiB/s
>> (1904kB/s-8305kB/s)
>> 4 subsys, 4k rand: bw=326MiB/s (342MB/s), 1781KiB/s-4499KiB/s
>> (1823kB/s-4607kB/s)
>
> Same question here, your patches do not seem to eliminate the overall
> loss of efficiency.
Never claimed that, and really I can't see that we can.
All subsystems are running over the same link, so we are having to push
more independent frames across it, and we will suffer from higher
contention here. A performance degradation when scaling up subsystems
is unavoidable.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare at suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
More information about the Linux-nvme
mailing list