[PATCH 11/11] selftests: mptcp: nvme: add iopolicy tests

Sun May 31 07:50:51 PDT 2026

On 5/31/26 7:34 PM, Nilay Shroff wrote:
> On 5/28/26 8:40 AM, Geliang Tang wrote:
>> From: Geliang Tang<tanggeliang at kylinos.cn>
>>
>> Add NVMe iopolicy testing to mptcp_nvme.sh, with the default set to
>> "numa". It can be set to "round-robin" or "queue-depth".
>>
>> Test results with 4 NVMe multipath paths and round-robin iopolicy show
>> that TCP and MPTCP achieve similar bandwidth:
>>
>>   # ./mptcp_nvme.sh tcp 4 round-robin
>>     READ: bw=455MiB/s (478MB/s), 455MiB/s-455MiB/s (478MB/s-478MB/s),
>>         io=4665MiB (4891MB), run=10242-10242msec
>>    WRITE: bw=455MiB/s (477MB/s), 455MiB/s-455MiB/s (477MB/s-477MB/s),
>>         io=4633MiB (4858MB), run=10184-10184msec
>>
>>   # ./mptcp_nvme.sh mptcp 4 round-robin
>>     READ: bw=445MiB/s (466MB/s), 445MiB/s-445MiB/s (466MB/s-466MB/s),
>>         io=4575MiB (4797MB), run=10287-10287msec
>>    WRITE: bw=445MiB/s (467MB/s), 445MiB/s-445MiB/s (467MB/s-467MB/s),
>>         io=4572MiB (4794MB), run=10267-10267msec
>>
>> A "loss" argument is added to simulate network packet loss. When loss=1,
>> each veth interface is configured with "delay 5ms loss 0.5%" using tc
>> qdisc. Under this scenario, TCP performance is reduced by multiples
>> compared to MPTCP:
>>
>>   # ./mptcp_nvme.sh tcp 4 round-robin 1
>>     READ: bw=144MiB/s (151MB/s), 144MiB/s-144MiB/s (151MB/s-151MB/s),
>>         io=1909MiB (2001MB), run=13231-13231msec
>>    WRITE: bw=100.0MiB/s (105MB/s), 100.0MiB/s-100.0MiB/s (105MB/s-105MB/s),
>>         io=1397MiB (1465MB), run=13980-13980msec
>>
>>   # ./mptcp_nvme.sh mptcp 4 round-robin 1
>>     READ: bw=428MiB/s (449MB/s), 428MiB/s-428MiB/s (449MB/s-449MB/s),
>>         io=4524MiB (4743MB), run=10564-10564msec
>>    WRITE: bw=431MiB/s (452MB/s), 431MiB/s-431MiB/s (452MB/s-452MB/s),
>>         io=4513MiB (4732MB), run=10481-10481msec
>>
>> These results demonstrate that MPTCP has better resilience against
>> packet loss compared to TCP, as it can leverage multiple subflows to
>> mitigate network degradation.
> 
> There are a few observations I'd like to raise:
> 
> 1. It is difficult to reason about the throughput results when NVMe native
>     multipath is enabled together with MPTCP. In this topology, four NVMe paths
>     are created and the round-robin I/O policy is configured. As a result, each
>     I/O first goes through the NVMe multipath scheduler, which selects a path,
>     and is then further subjected to the MPTCP scheduler, which selects a TCP
>     subflow. This means there are two independent schedulers influencing I/O
>     placement, making it difficult to attribute the observed throughput
>     improvements to either NVMe multipath or MPTCP.
> 
>     For throughput comparisons, it may be more meaningful to disable NVMe native
>     multipath (e.g., modprobe nvme_core multipath=n) when testing MPTCP. This would
>     ensure that all I/O is sent through a single NVMe/TCP path while allowing MPTCP
>     alone to distribute traffic across available subflows. Such a setup would
>     provide a clearer comparison between TCP and MPTCP.
> 
> 2. The current test uses only a 128 KiB I/O size. It would be useful to include
>     additional I/O sizes as well, such as 4 KiB, 8 KiB, and 32 KiB, since MPTCP and
>     NVMe multipath may behave differently under different workload characteristics.
> 
> 3. The fio runtime is only 10 seconds, which is relatively short for performance
>     evaluation. The results may be influenced by startup transients and may not
>     accurately reflect steady-state behavior. It would be preferable to run the tests
>     for a longer duration, for example 120 seconds, to obtain more stable measurements.
> 
> 4. The tests are run on the same host by setting up veth interfaces and running
>     host and target under different network namespaces. It'd be useful if you could
>     run this tests between real host and target systems.
> 
One more point forgot to add:
Current tests uses symmetric path characteristics (i.e. all paths experiences
same loos or ratelimit). However it'd be useful to simulate a scenario where
paths exhibit asymmetric behavior (for instance, one path experiences loss
or increased latency compared to other). This would demonstrate the real world
network failures and it'd be interesting to see how mptcp performs compared to
native NVMe multiapth.

Thanks,
--Nilay