[RFC PATCH 4/5] nvmf-tcp: add support for retrieving adapter link speed
Hannes Reinecke
hare at suse.de
Tue Sep 23 03:27:41 PDT 2025
On 9/23/25 11:33, Nilay Shroff wrote:
>
>
> On 9/22/25 1:08 PM, Hannes Reinecke wrote:
>> On 9/21/25 13:12, Nilay Shroff wrote:
>>> Add support for retrieving the negotiated NIC link speed (in Mbps).
>>> This value can be factored into path scoring for the adaptive I/O
>>> policy. For visibility and debugging, a new sysfs attribute "speed"
>>> is also added under the NVMe path block device.
>>>
>>> Signed-off-by: Nilay Shroff <nilay at linux.ibm.com>
>>> ---
>>> drivers/nvme/host/multipath.c | 11 ++++++
>>> drivers/nvme/host/nvme.h | 3 ++
>>> drivers/nvme/host/sysfs.c | 5 +++
>>> drivers/nvme/host/tcp.c | 66 +++++++++++++++++++++++++++++++++++
>>> 4 files changed, 85 insertions(+)
>>>
>> Why not for FC? We can easily extract the link speed from there, too ...
>>
> Yes it's easy to get the speed for FC. I just wanted to get feedback from
> the community about this idea and so didn't include it. But I will do that
> in the future patchset.
>
>> But why do we need to do that? We already calculated the weighted
>> average, so we _know_ the latency of each path. And then it's
>> pretty much immaterial if a path runs with a given speed; if the
>> latency is lower, that path is being preferred.
>> Irrespective of the speed, which might be deceptive anyway as
>> you'll only ever be able to retrieve the speed of the local
>> link, not of the entire path.
>>
> Consider a scenario with two paths: one over a high-capacity link
> (e.g. 1000 Mbps) and another over a much smaller link (e.g. 10 Mbps).
> If both paths report the same latency, the current formula would
> assign them identical weights. But in reality, the higher-capacity
> path can sustain a larger number of I/Os compared to the lower-
> capacity one.
>
That would be correct if the transfer speed is assumed to be negligible.
But I would assume that we do transfer mainly in units of PAGE_SIZE,
so with 4k PAGE_SIZE we'll spend 400 ms on a 10Mbps link, but 4ms on a
1000Mbps link. That actually is one of the issues we're facing with
measuring latency: we only have access to the combined latency
(submission/data transfer/completion), so it's really hard to separate
them out.
> In such cases, factoring in link speed allows us to assign proportionally
> higher weight to the higher-capacity path. At the same time, if that same
> path exhibits higher latency, it will be penalized accordingly, ensuring
> the final score balances both latency and bandwidth.
>
See above. If we could measure them separately, yes. But we can't.
> So, including link speed in the weight calculation helps capture both
> dimensions—latency sensitivity and throughput capacity—leading to a more
> accurate and proportional I/O distribution.
>
Would be true if we could measure it properly. But we can only get
thespeed on the local link; everything behind that is anyone's guess, and
it would skew measurements even more if we assume the same link speed
for the entire path.
Cheers,Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare at suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
More information about the Linux-nvme
mailing list