[LSF/MM/BPF TOPIC] Topology-Aware NVMe-TCP I/O Queue Scaling and Worker Efficiency

Sun Feb 15 09:06:38 PST 2026

The NVMe-TCP host driver currently provisions I/O queues primarily based on CPU
availability rather than the capabilities and topology of the underlying network
interface. On modern systems with many CPUs but fewer NIC hardware queues, this
can lead to multiple NVMe-TCP I/O queues contending for the same transmit/receive
queue, increasing lock contention, cacheline bouncing, and tail latency.

This session explores making NVMe-TCP queue provisioning and execution more
network-aware. We propose aligning the number of NVMe-TCP I/O queues with the
number of NIC hardware TX/RX queues, and binding each I/O queue to CPUs that are
already affine to the corresponding NIC interrupt vectors. This aims to improve
cache locality and reduce cross-CPU wakeups in high-IOPS deployments.

We also examine the behavior of the NVMe-TCP I/O worker thread, which currently
operates under a fixed time budget (~1ms). In some workloads, the worker may
relinquish the CPU even when additional transmit/receive work is immediately
available. We propose exposing observability data such as per-worker I/O processing
counts, relinquish events, and CPU placement to better understand and potentially
tune this budget.

We plan to implement a proof-of-concept for these ideas ahead of the conference
and and submit RFC, if feasible, demonstrate the impact live using fio workloads
on a real system.This session seeks feedback on whether the NVMe-TCP host should
consider NIC topology when provisioning I/O queues, how tightly queue placement
should follow interrupt affinity, and whether additional observability or tunable
budgets for I/O workers would be useful. We also discuss potential interfaces
between networking and storage subsystems to support topology-aware queue scaling.

Thanks,
--Nilay