[LSF/MM/BPF TOPIC] NVMe over MPTCP: Multi-Fold Acceleration for NVMe over TCP in Multi-NIC Environments

Thu May 28 08:59:43 PDT 2026

On Tue, May 19, 2026 at 12:31 AM Geliang Tang <geliang at kernel.org> wrote:
> Lossless network, comparison between TCP and MPTCP using the "queue-
> depth" policy - MPTCP and TCP show similar performance:
>
> # ./mptcp_nvme.sh tcp 4 queue-depth 0
>    READ: bw=456MiB/s (478MB/s), 456MiB/s-456MiB/s (478MB/s-478MB/s),
>                         io=4632MiB (4857MB), run=10169-10169msec
>   WRITE: bw=455MiB/s (477MB/s), 455MiB/s-455MiB/s (477MB/s-477MB/s),
>                         io=4666MiB (4893MB), run=10250-10250msec
>
> # ./mptcp_nvme.sh mptcp 4 queue-depth 0
>    READ: bw=446MiB/s (467MB/s), 446MiB/s-446MiB/s (467MB/s-467MB/s),
>                         io=4568MiB (4790MB), run=10249-10249msec
>   WRITE: bw=445MiB/s (467MB/s), 445MiB/s-445MiB/s (467MB/s-467MB/s),
>                         io=4563MiB (4784MB), run=10245-10245msec
This makes much more sense to me.  Have you tested where one path
is _not_ flaky but is slower?  (3 100GBps, 1 50GBps or something like that)

>
>
> Second set of tests: lossy network, path=4, loss=1. The tc output is as
> follows:
>
>   qdisc netem 8051: root refcnt 25 limit 1000 delay 5ms loss 0.5%
>                         rate 1Gbit seed 14946049878654165618
>
>
> Lossy network, comparison between TCP and MPTCP using the "queue-depth"
> policy - MPTCP is four times faster than TCP:
>
> # ./mptcp_nvme.sh tcp 4 queue-depth 1
>    READ: bw=168MiB/s (176MB/s), 168MiB/s-168MiB/s (176MB/s-176MB/s),
>                         io=2179MiB (2285MB), run=12965-12965msec
>   WRITE: bw=128MiB/s (134MB/s), 128MiB/s-128MiB/s (134MB/s-134MB/s),
>                         io=1590MiB (1667MB), run=12418-12418msec
>
> # ./mptcp_nvme.sh mptcp 4 queue-depth 1
>    READ: bw=425MiB/s (445MB/s), 425MiB/s-425MiB/s (445MB/s-445MB/s),
>                         io=4536MiB (4756MB), run=10677-10677msec
>   WRITE: bw=414MiB/s (434MB/s), 414MiB/s-414MiB/s (434MB/s-434MB/s),
>                         io=4447MiB (4663MB), run=10733-10733msec
>
>
> Conclusion: MPTCP achieves bandwidth aggregation comparable to that of
> NVMe multipath while offering better resilience against network
> interference.
This is interesting.  So, one path of 4 flaky reduces bandwidth to
1/4 bandwidth, effectively a penalty of 2 paths (I was expecting more of a
penalty), while tcpmp can shake it off. Do you have a
hypothosis/understanding of why?

I have a guess that selective retransmission might be kicking in (which
would be good), but how is that different than expected behavior for IP/NIC
bonding?  (which, I think, could be implemented without an
NVMe driver/protocol change?)  We generally point people away from
IP/NIC bonding; although, I am (personally) not sure why.

Sincerely,
Randy Jennings