[RFC PATCH 0/4] nvme-tcp: NIC topology aware I/O queue scaling and queue info export
Hannes Reinecke
hare at suse.de
Wed Apr 22 04:10:13 PDT 2026
On 4/20/26 13:49, Nilay Shroff wrote:
> Hi,
>
> The NVMe/TCP host driver currently provisions I/O queues primarily based
> on CPU availability rather than the capabilities and topology of the
> underlying network interface.
>
> On modern systems with many CPUs but fewer NIC hardware queues, this can
> lead to multiple NVMe/TCP I/O workers contending for the same TX/RX queue,
> resulting in increased lock contention, cacheline bouncing, and degraded
> throughput.
>
> This RFC proposes a set of changes to better align NVMe/TCP I/O queues
> with NIC queue resources, and to expose queue/flow information to enable
> more effective system-level tuning.
>
> Key ideas
> ---------
>
> 1. Scale NVMe/TCP I/O queues based on NIC queue count
> Instead of relying solely on CPU count, limit the number of I/O workers
> to:
> min(num_online_cpus, netdev->real_num_{tx,rx}_queues)
>
> 2. Improve CPU locality
> Align NVMe/TCP I/O workers with CPUs associated with NIC IRQ affinity
> to reduce cross-CPU traffic and improve cache locality.
>
> 3. Expose queue and flow information via debugfs
> Export per-I/O queue information including:
> - queue id (qid)
> - CPU affinity
> - TCP flow (src/dst IP and ports)
>
> This enables userspace tools to configure:
> - IRQ affinity
> - RPS/XPS
> - ntuple steering
> - or any other scaling as deemed feasible
>
> 4. Provide infrastructure for extensible debugfs support in NVMe
>
> Together, these changes allow better alignment of:
> flow -> NIC queue -> IRQ -> CPU -> NVMe/TCP I/O worker
>
> Performance Evaluation
> ----------------------
> Tests were conducted using fio over NVMe/TCP with the following parameters:
> ioengine=io_uring
> direct=1
> bs=4k
> numjobs=<#nic-queues>
> iodepth=64
> System:
> CPUs: 72
> NIC: 100G mlx5
>
> Two configurations were evaluated.
>
> Scenario 1: NIC queues < CPU count
> ----------------------------------
> - CPUs: 72
> - NIC queues: 32
>
> Baseline Patched Patched + tuning
> randread 3141 MB/s 3228 MB/s 7509 MB/s
> (767k IOPS) (788k IOPS) (1833k IOPS)
>
> randwrite 4510 MB/s 6172 MB/s 7518 MB/s
> (1101k IOPS) (1507k IOPS) (1836k IOPS)
>
> randrw (read) 2156 MB/s 2560 MB/s 3932 MB/s
> (526k IOPS) (625k IOPS) (960k IOPS)
>
> randrw (write) 2155 MB/s 2560 MB/s 3932 MB/s
> (526k IOPS) (625k IOPS) (960k IOPS)
>
> Observation:
> When CPU count exceeds NIC queue count, the baseline configuration
> suffers from queue contention. The proposed changes provide modest
> improvements on their own, and when combined with queue-aware tuning
> (IRQ affinity, ntuple steering, and CPU alignment), enable up to
> ~1.5x–2.5x throughput improvement.
>
> Scenario 2: NIC queues == CPU count
> -----------------------------------
>
> - CPUs: 72
> - NIC queues: 72
>
> Baseline Patched + tuning
> randread 4310 MB/s 7987 MB/s
> (1052k IOPS) (1950k IOPS)
>
> randwrite 7947 MB/s 7972 MB/s
> (1940k IOPS) (1946k IOPS)
>
> randrw (read) 3583 MB/s 4030 MB/s
> (875k IOPS) (984k IOPS)
>
> randrw (write) 3583 MB/s 4029 MB/s
> (875k IOPS) (984k IOPS)
>
> Observation:
> When NIC queues are already aligned with CPU count, the baseline performs
> well. The proposed changes maintain write performance (no regression) and
> still improve read and mixed workloads due to better flow-to-CPU locality.
>
> Notes on tuning
> ---------------
> The "patched + tuning" configuration includes:
> - aligning NVMe/TCP I/O workers with NIC queue count
> - IRQ affinity configuration per RX queue
> - ntuple-based flow steering
> - CPU/queue affinity alignment
>
> These tuning steps are enabled by the queue/flow information exposed through
> this patchset.
>
> Discussion
> ----------
> This RFC aims to start discussion around:
> - Whether NVMe/TCP queue scaling should consider NIC queue topology
> - How best to expose queue/flow information to userspace
> - The role of userspace vs kernel in steering decisions
>
> As usual, feedback/comment/suggestions are most welcome!
>
> Reference to LSF/MM/BPF abstarct: https://lore.kernel.org/all/5db8ce78-0dfa-4dcb-bf71-5fb9c8f463e5@linux.ibm.com/
>
Weelll ... we have been debating this back and forth over recent years:
Should we check for hardware limitations for NVMe-over-Fabrics or not?
Initially it sounds appealing, and in fact I've worked on several
attempts myself. But in the end there are far more things which need
to be considered:
-> For networking, number of queues is not really telling us anything.
Most NICs have distinct RX and TX queues, and the number (of both!)
varies quite dramatically.
-> The number of queues does _not_ indicate that all queues are used
simultaneously. That is down to things like RSS and friends.
I gave a stab at configuring _that_ but it's patently horrible
trying to out-guess things for yourself.
-> It'll only work if you run directly on the NIC. As soon as there
is anything in between (qemu? Tunnelling?) you are out of luck.
So yeah, we should have a discussion here.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare at suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
More information about the Linux-nvme
mailing list