[PATCH 0/7] nvme-tcp scalability improvements
Hannes Reinecke
hare at kernel.org
Wed Jun 26 05:13:40 PDT 2024
From: Hannes Reinecke <hare at suse.de>
Hi all,
we have had reports from partners that nvme-tcp suffers from scalability
problems with the number of controllers; they even managed to run into
a request timeout by just connecting enough controllers to the host.
Looking into it I have found several issues with the nvme-tcp implementation:
- the 'io_cpu' assignment is static, leading to the same calculation
for each controller. Thus each queue with the same number is
assigned the same CPU, leading to CPU starvation.
- The blk-mq cpu mapping is not taken into account when calculating
the 'io_cpu' number, leading to excessive thread bouncing during I/O
- The socket state is not evaluating, so we're piling more and more
requests onto the socket even when it's already full.
This patchset addresses these issues, leading to a better I/O
distribution for several controllers.
Performance for read increases from:
4k seq read: bw=368MiB/s (386MB/s), 11.5MiB/s-12.7MiB/s
(12.1MB/s-13.3MB/s), io=16.0GiB (17.2GB), run=40444-44468msec
4k rand read: bw=360MiB/s (378MB/s), 11.3MiB/s-12.1MiB/s
(11.8MB/s-12.7MB/s), io=16.0GiB (17.2GB), run=42310-45502msec
to:
4k seq read: bw=520MiB/s (545MB/s), 16.3MiB/s-21.1MiB/s
(17.0MB/s-22.2MB/s), io=16.0GiB (17.2GB), run=24208-31505msec
4k rand read: bw=533MiB/s (559MB/s), 16.7MiB/s-22.2MiB/s
(17.5MB/s-23.3MB/s), io=16.0GiB (17.2GB), run=23014-30731msec
However, peak write performance degrades from:
4k seq write: bw=657MiB/s (689MB/s), 20.5MiB/s-20.7MiB/s
(21.5MB/s-21.8MB/s), io=16.0GiB (17.2GB), run=24678-24950msec
4k rand write: bw=687MiB/s (720MB/s), 21.5MiB/s-21.7MiB/s
(22.5MB/s-22.8MB/s), io=16.0GiB (17.2GB), run=23559-23859msec
to:
4k seq write: bw=535MiB/s (561MB/s), 16.7MiB/s-19.9MiB/s
(17.5MB/s-20.9MB/s), io=16.0GiB (17.2GB), run=25707-30624msec
4k rand write: bw=560MiB/s (587MB/s), 17.5MiB/s-22.3MiB/s
(18.4MB/s-23.4MB/s), io=16.0GiB (17.2GB), run=22977-29248msec
which is not surprising, seeing that the original implementation would
be pushing as many writes as possible to the workqueue, with complete
disregard of the utilisation of the queue (which was precisely the
issue we're addressing here).
Hannes Reinecke (5):
nvme-tcp: align I/O cpu with blk-mq mapping
nvme-tcp: distribute queue affinity
nvmet-tcp: add wq_unbound module parameter
nvme-tcp: SOCK_NOSPACE handling
nvme-tcp: make softirq_rx the default
Sagi Grimberg (2):
net: micro-optimize skb_datagram_iter
nvme-tcp: receive data in softirq
drivers/nvme/host/tcp.c | 126 ++++++++++++++++++++++++++++----------
drivers/nvme/target/tcp.c | 34 +++++++---
net/core/datagram.c | 4 +-
3 files changed, 122 insertions(+), 42 deletions(-)
--
2.35.3
More information about the Linux-nvme
mailing list