[RFC PATCH 0/4] nvme-tcp: NIC topology aware I/O queue scaling and queue info export

Mon Apr 27 05:11:24 PDT 2026

On 4/25/26 4:00 AM, Sagi Grimberg wrote:
> 
> 
> On 22/04/2026 14:10, Hannes Reinecke wrote:
>> On 4/20/26 13:49, Nilay Shroff wrote:
>>> Hi,
>>>
>>> The NVMe/TCP host driver currently provisions I/O queues primarily based
>>> on CPU availability rather than the capabilities and topology of the
>>> underlying network interface.
>>>
>>> On modern systems with many CPUs but fewer NIC hardware queues, this can
>>> lead to multiple NVMe/TCP I/O workers contending for the same TX/RX queue,
>>> resulting in increased lock contention, cacheline bouncing, and degraded
>>> throughput.
>>>
>>> This RFC proposes a set of changes to better align NVMe/TCP I/O queues
>>> with NIC queue resources, and to expose queue/flow information to enable
>>> more effective system-level tuning.
>>>
>>> Key ideas
>>> ---------
>>>
>>> 1. Scale NVMe/TCP I/O queues based on NIC queue count
>>>     Instead of relying solely on CPU count, limit the number of I/O workers
>>>     to:
>>>         min(num_online_cpus, netdev->real_num_{tx,rx}_queues)
>>>
>>> 2. Improve CPU locality
>>>     Align NVMe/TCP I/O workers with CPUs associated with NIC IRQ affinity
>>>     to reduce cross-CPU traffic and improve cache locality.
>>>
>>> 3. Expose queue and flow information via debugfs
>>>     Export per-I/O queue information including:
>>>         - queue id (qid)
>>>         - CPU affinity
>>>         - TCP flow (src/dst IP and ports)
>>>
>>>     This enables userspace tools to configure:
>>>         - IRQ affinity
>>>         - RPS/XPS
>>>         - ntuple steering
>>>         - or any other scaling as deemed feasible
>>>
>>> 4. Provide infrastructure for extensible debugfs support in NVMe
>>>
>>> Together, these changes allow better alignment of:
>>>      flow -> NIC queue -> IRQ -> CPU -> NVMe/TCP I/O worker
>>>
>>> Performance Evaluation
>>> ----------------------
>>> Tests were conducted using fio over NVMe/TCP with the following parameters:
>>>      ioengine=io_uring
>>>      direct=1
>>>      bs=4k
>>>      numjobs=<#nic-queues>
>>>      iodepth=64
>>> System:
>>>      CPUs: 72
>>>      NIC: 100G mlx5
>>>
>>> Two configurations were evaluated.
>>>
>>> Scenario 1: NIC queues < CPU count
>>> ----------------------------------
>>> - CPUs: 72
>>> - NIC queues: 32
>>>
>>>                  Baseline        Patched        Patched + tuning
>>> randread        3141 MB/s       3228 MB/s      7509 MB/s
>>>                  (767k IOPS)     (788k IOPS)    (1833k IOPS)
>>>
>>> randwrite       4510 MB/s       6172 MB/s      7518 MB/s
>>>                  (1101k IOPS)    (1507k IOPS)   (1836k IOPS)
>>>
>>> randrw (read)   2156 MB/s       2560 MB/s      3932 MB/s
>>>                  (526k IOPS)     (625k IOPS)    (960k IOPS)
>>>
>>> randrw (write)  2155 MB/s       2560 MB/s      3932 MB/s
>>>                  (526k IOPS)     (625k IOPS)    (960k IOPS)
>>>
>>> Observation:
>>> When CPU count exceeds NIC queue count, the baseline configuration
>>> suffers from queue contention. The proposed changes provide modest
>>> improvements on their own, and when combined with queue-aware tuning
>>> (IRQ affinity, ntuple steering, and CPU alignment), enable up to
>>> ~1.5x–2.5x throughput improvement.
>>>
>>> Scenario 2: NIC queues == CPU count
>>> -----------------------------------
>>>
>>> - CPUs: 72
>>> - NIC queues: 72
>>>
>>>                  Baseline                Patched + tuning
>>> randread        4310 MB/s               7987 MB/s
>>>                  (1052k IOPS)            (1950k IOPS)
>>>
>>> randwrite       7947 MB/s               7972 MB/s
>>>                  (1940k IOPS)            (1946k IOPS)
>>>
>>> randrw (read)   3583 MB/s               4030 MB/s
>>>                  (875k IOPS)             (984k IOPS)
>>>
>>> randrw (write)  3583 MB/s               4029 MB/s
>>>                  (875k IOPS)             (984k IOPS)
>>>
>>> Observation:
>>> When NIC queues are already aligned with CPU count, the baseline performs
>>> well. The proposed changes maintain write performance (no regression) and
>>> still improve read and mixed workloads due to better flow-to-CPU locality.
>>>
>>> Notes on tuning
>>> ---------------
>>> The "patched + tuning" configuration includes:
>>>      - aligning NVMe/TCP I/O workers with NIC queue count
>>>      - IRQ affinity configuration per RX queue
>>>      - ntuple-based flow steering
>>>      - CPU/queue affinity alignment
>>>
>>> These tuning steps are enabled by the queue/flow information exposed through
>>> this patchset.
>>>
>>> Discussion
>>> ----------
>>> This RFC aims to start discussion around:
>>>    - Whether NVMe/TCP queue scaling should consider NIC queue topology
>>>    - How best to expose queue/flow information to userspace
>>>    - The role of userspace vs kernel in steering decisions
>>>
>>> As usual, feedback/comment/suggestions are most welcome!
>>>
>>> Reference to LSF/MM/BPF abstarct: https://lore.kernel.org/all/5db8ce78-0dfa-4dcb-bf71-5fb9c8f463e5@linux.ibm.com/
>>>
>>
>> Weelll ... we have been debating this back and forth over recent years:
>> Should we check for hardware limitations for NVMe-over-Fabrics or not?
>>
>> Initially it sounds appealing, and in fact I've worked on several attempts myself. But in the end there are far more things which need
>> to be considered:
>> -> For networking, number of queues is not really telling us anything.
>>    Most NICs have distinct RX and TX queues, and the number (of both!)
>>    varies quite dramatically.
>> -> The number of queues does _not_ indicate that all queues are used
>>    simultaneously. That is down to things like RSS and friends.
>>    I gave a stab at configuring _that_ but it's patently horrible
>>    trying to out-guess things for yourself.
>> -> It'll only work if you run directly on the NIC. As soon as there
>>    is anything in between (qemu? Tunnelling?) you are out of luck.
>>
>> So yeah, we should have a discussion here.
> 
> TBH, I don't think that this is very useful. I mentioned some areas on why on patch #1
> 
> But the main reason is that I think that the majority the gains that you are showing
> is the tuning - which is somewhat unrelated to the driver, and TBH, I doubt anyone
> will actually do in reality.

Even without additional tuning, aligning the NVMe/TCP I/O workers with
CPU and NIC queue locality already provides measurable performance
benefits (primarily visible in random write workloads, as shown in
Scenario 1).

The additional gains come from system-level tuning (e.g., XPS/RPS/RSS),
which further improves utilization of NIC queues and CPU locality.
However, the patch enables this tuning by exposing queue/flow
information and establishing better default alignment.

While such tuning may not be applied in all deployments, IMO, it should be
commonly used in performance-sensitive environments where users aim to
fully utilize available NIC and CPU resources.

Thanks,
--Nilay