[LSF/MM/BPF TOPIC] Topology-Aware NVMe-TCP I/O Queue Scaling and Worker Efficiency

Sun Feb 15 22:49:42 PST 2026

On 2/16/26 6:05 AM, Chaitanya Kulkarni wrote:
> On 2/15/26 09:06, Nilay Shroff wrote:
> 
>> The NVMe-TCP host driver currently provisions I/O queues primarily based on CPU
>> availability rather than the capabilities and topology of the underlying network
>> interface. On modern systems with many CPUs but fewer NIC hardware queues, this
>> can lead to multiple NVMe-TCP I/O queues contending for the same transmit/receive
>> queue, increasing lock contention, cacheline bouncing, and tail latency.
> 
> 
> Can you share any performance work that you have done prior to the
> 
> LSF session ?
> 

Yes — I’ve started prototyping the queue-scaling and CPU/IRQ-affinity changes and
have some early performance results from a local setup. These are still preliminary
and I’m continuing to expand testing, but the initial data looks promising enough
to motivate the discussion.

Test setup (current prototype):
- 32-CPU system
- NIC exposing 2 TX/RX queues
- fio (io_uring, direct=1, 32 jobs, iodepth=64)

Throughput results:

              Without patch    With patch
Randread:     263 MB/s         986 MB/s
Randwrite:    849 MB/s         1047 MB/s
Randrw:       R: 142 MB/s      R: 419 MB/s
              W: 142 MB/s      W: 419 MB/s

The largest gains appear in read-heavy and mixed workloads where multiple NVMe-TCP
queues were previously contending for a small number of NIC hardware queues. Aligning
I/O queue count with NIC queue count and improving CPU/IRQ locality significantly
reduced contention in this configuration.

I’m continuing to refine the prototype and expand testing across different queue
counts and NUMA layouts. My goal is to have more comprehensive data available ahead
of the LSFMM+BPF session, and if feasible I plan to demonstrate the impact live using
fio workloads.

Happy to share updated numbers as the work progresses.

Thanks,
--Nilay