[RFC PATCH 4/5] nvmf-tcp: add support for retrieving adapter link speed

Nilay Shroff nilay at linux.ibm.com
Tue Sep 23 10:58:20 PDT 2025



On 9/23/25 3:57 PM, Hannes Reinecke wrote:
> On 9/23/25 11:33, Nilay Shroff wrote:
>>
>>
>> On 9/22/25 1:08 PM, Hannes Reinecke wrote:
>>> On 9/21/25 13:12, Nilay Shroff wrote:
>>>> Add support for retrieving the negotiated NIC link speed (in Mbps).
>>>> This value can be factored into path scoring for the adaptive I/O
>>>> policy. For visibility and debugging, a new sysfs attribute "speed"
>>>> is also added under the NVMe path block device.
>>>>
>>>> Signed-off-by: Nilay Shroff <nilay at linux.ibm.com>
>>>> ---
>>>>    drivers/nvme/host/multipath.c | 11 ++++++
>>>>    drivers/nvme/host/nvme.h      |  3 ++
>>>>    drivers/nvme/host/sysfs.c     |  5 +++
>>>>    drivers/nvme/host/tcp.c       | 66 +++++++++++++++++++++++++++++++++++
>>>>    4 files changed, 85 insertions(+)
>>>>
>>> Why not for FC? We can easily extract the link speed from there, too ...
>>>
>> Yes it's easy to get the speed for FC. I just wanted to get feedback from
>> the community about this idea and so didn't include it. But I will do that
>> in the future patchset.
>>  
>>> But why do we need to do that? We already calculated the weighted
>>> average, so we _know_ the latency of each path. And then it's
>>> pretty much immaterial if a path runs with a given speed; if the
>>> latency is lower, that path is being preferred.
>>> Irrespective of the speed, which might be deceptive anyway as
>>> you'll only ever be able to retrieve the speed of the local
>>> link, not of the entire path.
>>>
>> Consider a scenario with two paths: one over a high-capacity link
>> (e.g. 1000 Mbps) and another over a much smaller link (e.g. 10 Mbps).
>> If both paths report the same latency, the current formula would
>> assign them identical weights. But in reality, the higher-capacity
>> path can sustain a larger number of I/Os compared to the lower-
>> capacity one.
>>
> That would be correct if the transfer speed is assumed to be negligible.
> But I would assume that we do transfer mainly in units of PAGE_SIZE,
> so with 4k PAGE_SIZE we'll spend 400 ms on a 10Mbps link, but 4ms on a
> 1000Mbps link. That actually is one of the issues we're facing with
> measuring latency: we only have access to the combined latency
> (submission/data transfer/completion), so it's really hard to separate
> them out.
> 
>> In such cases, factoring in link speed allows us to assign proportionally
>> higher weight to the higher-capacity path. At the same time, if that same
>> path exhibits higher latency, it will be penalized accordingly, ensuring
>> the final score balances both latency and bandwidth.
>>
> See above. If we could measure them separately, yes. But we can't.
> 
>> So, including link speed in the weight calculation helps capture both
>> dimensions—latency sensitivity and throughput capacity—leading to a more
>> accurate and proportional I/O distribution.
>>
> Would be true if we could measure it properly. But we can only get thespeed on the local link; everything behind that is anyone's guess, and
> it would skew measurements even more if we assume the same link speed
> for the entire path.
> 
Yes you brought a very good point and I agree that we can’t reliably
determine the end-to-end path capacity. Assuming the same link speed
beyond the first hop may not always be correct and could easily skew
the measurement.
Given that limitation, I agree it would be better to exclude link speed
from the path scoring formula. Without a way to accurately capture the 
full path capacity, incorporating only the local link speed risks making
the scoring misleading rather than more accurate.

Thanks,
--Nilay



More information about the Linux-nvme mailing list