[RFC PATCH 0/4] nvme-tcp: NIC topology aware I/O queue scaling and queue info export

Fri Apr 24 15:30:03 PDT 2026

On 22/04/2026 14:10, Hannes Reinecke wrote:
> On 4/20/26 13:49, Nilay Shroff wrote:
>> Hi,
>>
>> The NVMe/TCP host driver currently provisions I/O queues primarily based
>> on CPU availability rather than the capabilities and topology of the
>> underlying network interface.
>>
>> On modern systems with many CPUs but fewer NIC hardware queues, this can
>> lead to multiple NVMe/TCP I/O workers contending for the same TX/RX 
>> queue,
>> resulting in increased lock contention, cacheline bouncing, and degraded
>> throughput.
>>
>> This RFC proposes a set of changes to better align NVMe/TCP I/O queues
>> with NIC queue resources, and to expose queue/flow information to enable
>> more effective system-level tuning.
>>
>> Key ideas
>> ---------
>>
>> 1. Scale NVMe/TCP I/O queues based on NIC queue count
>>     Instead of relying solely on CPU count, limit the number of I/O 
>> workers
>>     to:
>>         min(num_online_cpus, netdev->real_num_{tx,rx}_queues)
>>
>> 2. Improve CPU locality
>>     Align NVMe/TCP I/O workers with CPUs associated with NIC IRQ 
>> affinity
>>     to reduce cross-CPU traffic and improve cache locality.
>>
>> 3. Expose queue and flow information via debugfs
>>     Export per-I/O queue information including:
>>         - queue id (qid)
>>         - CPU affinity
>>         - TCP flow (src/dst IP and ports)
>>
>>     This enables userspace tools to configure:
>>         - IRQ affinity
>>         - RPS/XPS
>>         - ntuple steering
>>         - or any other scaling as deemed feasible
>>
>> 4. Provide infrastructure for extensible debugfs support in NVMe
>>
>> Together, these changes allow better alignment of:
>>      flow -> NIC queue -> IRQ -> CPU -> NVMe/TCP I/O worker
>>
>> Performance Evaluation
>> ----------------------
>> Tests were conducted using fio over NVMe/TCP with the following 
>> parameters:
>>      ioengine=io_uring
>>      direct=1
>>      bs=4k
>>      numjobs=<#nic-queues>
>>      iodepth=64
>> System:
>>      CPUs: 72
>>      NIC: 100G mlx5
>>
>> Two configurations were evaluated.
>>
>> Scenario 1: NIC queues < CPU count
>> ----------------------------------
>> - CPUs: 72
>> - NIC queues: 32
>>
>>                  Baseline        Patched        Patched + tuning
>> randread        3141 MB/s       3228 MB/s      7509 MB/s
>>                  (767k IOPS)     (788k IOPS)    (1833k IOPS)
>>
>> randwrite       4510 MB/s       6172 MB/s      7518 MB/s
>>                  (1101k IOPS)    (1507k IOPS)   (1836k IOPS)
>>
>> randrw (read)   2156 MB/s       2560 MB/s      3932 MB/s
>>                  (526k IOPS)     (625k IOPS)    (960k IOPS)
>>
>> randrw (write)  2155 MB/s       2560 MB/s      3932 MB/s
>>                  (526k IOPS)     (625k IOPS)    (960k IOPS)
>>
>> Observation:
>> When CPU count exceeds NIC queue count, the baseline configuration
>> suffers from queue contention. The proposed changes provide modest
>> improvements on their own, and when combined with queue-aware tuning
>> (IRQ affinity, ntuple steering, and CPU alignment), enable up to
>> ~1.5x–2.5x throughput improvement.
>>
>> Scenario 2: NIC queues == CPU count
>> -----------------------------------
>>
>> - CPUs: 72
>> - NIC queues: 72
>>
>>                  Baseline                Patched + tuning
>> randread        4310 MB/s               7987 MB/s
>>                  (1052k IOPS)            (1950k IOPS)
>>
>> randwrite       7947 MB/s               7972 MB/s
>>                  (1940k IOPS)            (1946k IOPS)
>>
>> randrw (read)   3583 MB/s               4030 MB/s
>>                  (875k IOPS)             (984k IOPS)
>>
>> randrw (write)  3583 MB/s               4029 MB/s
>>                  (875k IOPS)             (984k IOPS)
>>
>> Observation:
>> When NIC queues are already aligned with CPU count, the baseline 
>> performs
>> well. The proposed changes maintain write performance (no regression) 
>> and
>> still improve read and mixed workloads due to better flow-to-CPU 
>> locality.
>>
>> Notes on tuning
>> ---------------
>> The "patched + tuning" configuration includes:
>>      - aligning NVMe/TCP I/O workers with NIC queue count
>>      - IRQ affinity configuration per RX queue
>>      - ntuple-based flow steering
>>      - CPU/queue affinity alignment
>>
>> These tuning steps are enabled by the queue/flow information exposed 
>> through
>> this patchset.
>>
>> Discussion
>> ----------
>> This RFC aims to start discussion around:
>>    - Whether NVMe/TCP queue scaling should consider NIC queue topology
>>    - How best to expose queue/flow information to userspace
>>    - The role of userspace vs kernel in steering decisions
>>
>> As usual, feedback/comment/suggestions are most welcome!
>>
>> Reference to LSF/MM/BPF abstarct: 
>> https://lore.kernel.org/all/5db8ce78-0dfa-4dcb-bf71-5fb9c8f463e5@linux.ibm.com/
>>
>
> Weelll ... we have been debating this back and forth over recent years:
> Should we check for hardware limitations for NVMe-over-Fabrics or not?
>
> Initially it sounds appealing, and in fact I've worked on several 
> attempts myself. But in the end there are far more things which need
> to be considered:
> -> For networking, number of queues is not really telling us anything.
>    Most NICs have distinct RX and TX queues, and the number (of both!)
>    varies quite dramatically.
> -> The number of queues does _not_ indicate that all queues are used
>    simultaneously. That is down to things like RSS and friends.
>    I gave a stab at configuring _that_ but it's patently horrible
>    trying to out-guess things for yourself.
> -> It'll only work if you run directly on the NIC. As soon as there
>    is anything in between (qemu? Tunnelling?) you are out of luck.
>
> So yeah, we should have a discussion here.

TBH, I don't think that this is very useful. I mentioned some areas on 
why on patch #1

But the main reason is that I think that the majority the gains that you 
are showing
is the tuning - which is somewhat unrelated to the driver, and TBH, I 
doubt anyone
will actually do in reality.