[PATCH 0/7] nvme-tcp scalability improvements

Wed Jun 26 06:37:38 PDT 2024

On 26/06/2024 15:13, Hannes Reinecke wrote:
> From: Hannes Reinecke <hare at suse.de>
>
> Hi all,
>
> we have had reports from partners that nvme-tcp suffers from scalability
> problems with the number of controllers; they even managed to run into
> a request timeout by just connecting enough controllers to the host.
>
> Looking into it I have found several issues with the nvme-tcp implementation:
> - the 'io_cpu' assignment is static, leading to the same calculation
>    for each controller. Thus each queue with the same number is
>    assigned the same CPU, leading to CPU starvation.
> - The blk-mq cpu mapping is not taken into account when calculating
>    the 'io_cpu' number, leading to excessive thread bouncing during I/O
> - The socket state is not evaluating, so we're piling more and more
>    requests onto the socket even when it's already full.
>
> This patchset addresses these issues, leading to a better I/O
> distribution for several controllers.

Hannes, Please quantify what every change gives us please. Each
change on its own should present merit, otherwise we should consider
weather it is actually needed.

>
> Performance for read increases from:
> 4k seq read: bw=368MiB/s (386MB/s), 11.5MiB/s-12.7MiB/s
>    (12.1MB/s-13.3MB/s), io=16.0GiB (17.2GB), run=40444-44468msec
> 4k rand read: bw=360MiB/s (378MB/s), 11.3MiB/s-12.1MiB/s
>    (11.8MB/s-12.7MB/s), io=16.0GiB (17.2GB), run=42310-45502msec
> to:
> 4k seq read: bw=520MiB/s (545MB/s), 16.3MiB/s-21.1MiB/s
>    (17.0MB/s-22.2MB/s), io=16.0GiB (17.2GB), run=24208-31505msec
> 4k rand read: bw=533MiB/s (559MB/s), 16.7MiB/s-22.2MiB/s
>    (17.5MB/s-23.3MB/s), io=16.0GiB (17.2GB), run=23014-30731msec
>
> However, peak write performance degrades from:
> 4k seq write: bw=657MiB/s (689MB/s), 20.5MiB/s-20.7MiB/s
>    (21.5MB/s-21.8MB/s), io=16.0GiB (17.2GB), run=24678-24950msec
> 4k rand write: bw=687MiB/s (720MB/s), 21.5MiB/s-21.7MiB/s
>    (22.5MB/s-22.8MB/s), io=16.0GiB (17.2GB), run=23559-23859msec
> to:
> 4k seq write: bw=535MiB/s (561MB/s), 16.7MiB/s-19.9MiB/s
>    (17.5MB/s-20.9MB/s), io=16.0GiB (17.2GB), run=25707-30624msec
> 4k rand write: bw=560MiB/s (587MB/s), 17.5MiB/s-22.3MiB/s
>    (18.4MB/s-23.4MB/s), io=16.0GiB (17.2GB), run=22977-29248msec
>
> which is not surprising, seeing that the original implementation would
> be pushing as many writes as possible to the workqueue, with complete
> disregard of the utilisation of the queue (which was precisely the
> issue we're addressing here).

Well, I do not expect performance to degrade here. Its a noticeable drop 
I'd say.

>
> Hannes Reinecke (5):
>    nvme-tcp: align I/O cpu with blk-mq mapping
>    nvme-tcp: distribute queue affinity
>    nvmet-tcp: add wq_unbound module parameter
>    nvme-tcp: SOCK_NOSPACE handling
>    nvme-tcp: make softirq_rx the default
>
> Sagi Grimberg (2):
>    net: micro-optimize skb_datagram_iter
>    nvme-tcp: receive data in softirq
>
>   drivers/nvme/host/tcp.c   | 126 ++++++++++++++++++++++++++++----------
>   drivers/nvme/target/tcp.c |  34 +++++++---
>   net/core/datagram.c       |   4 +-
>   3 files changed, 122 insertions(+), 42 deletions(-)
>