[PATCH RFC 0/2] block,nvme: latency-based I/O scheduler

Thu Mar 28 04:32:05 PDT 2024

On 3/28/24 11:38, Sagi Grimberg wrote:
> 
> 
> On 26/03/2024 17:35, Hannes Reinecke wrote:
>> Hi all,
>>
>> there had been several attempts to implement a latency-based I/O
>> scheduler for native nvme multipath, all of which had its issues.
>>
>> So time to start afresh, this time using the QoS framework
>> already present in the block layer.
>> It consists of two parts:
>> - a new 'blk-nodelat' QoS module, which is just a simple per-node
>>    latency tracker
>> - a 'latency' nvme I/O policy
>>
>> Using the 'tiobench' fio script I'm getting:
>>    WRITE: bw=531MiB/s (556MB/s), 33.2MiB/s-52.4MiB/s
>>    (34.8MB/s-54.9MB/s), io=4096MiB (4295MB), run=4888-7718msec
>>      WRITE: bw=539MiB/s (566MB/s), 33.7MiB/s-50.9MiB/s
>>    (35.3MB/s-53.3MB/s), io=4096MiB (4295MB), run=5033-7594msec
>>       READ: bw=898MiB/s (942MB/s), 56.1MiB/s-75.4MiB/s
>>    (58.9MB/s-79.0MB/s), io=4096MiB (4295MB), run=3397-4560msec
>>       READ: bw=1023MiB/s (1072MB/s), 63.9MiB/s-75.1MiB/s
>>    (67.0MB/s-78.8MB/s), io=4096MiB (4295MB), run=3408-4005msec
>>
>> for 'round-robin' and
>>
>>    WRITE: bw=574MiB/s (601MB/s), 35.8MiB/s-45.5MiB/s
>>    (37.6MB/s-47.7MB/s), io=4096MiB (4295MB), run=5629-7142msec
>>      WRITE: bw=639MiB/s (670MB/s), 39.9MiB/s-47.5MiB/s
>>    (41.9MB/s-49.8MB/s), io=4096MiB (4295MB), run=5388-6408msec
>>       READ: bw=1024MiB/s (1074MB/s), 64.0MiB/s-73.7MiB/s
>>    (67.1MB/s-77.2MB/s), io=4096MiB (4295MB), run=3475-4000msec
>>       READ: bw=1013MiB/s (1063MB/s), 63.3MiB/s-72.6MiB/s
>>    (66.4MB/s-76.2MB/s), io=4096MiB (4295MB), run=3524-4042msec
>> for 'latency' with 'decay' set to 10.
>> That's on a 32G FC testbed running against a brd target,
>> fio running with 16 thread.
> 
> Can you quantify the improvement? Also, the name latency suggest
> that latency should be improved no?
> 
'latency' refers to 'latency-based' I/O scheduler, ie it selects
the path with the least latency. It does not necessarily _improve_
the latency. Eg for truly symmetric fabrics it doesn't.
It _does_ improve matters when running on asymmetric fabrics
(eg on a two socket system with two PCI HBAs, each connected to one
socket, or like the example above with one path via 'loop', and the
other via 'tcp' and address '127.0.0.1').
And, of course, if you have congested fabrics, where it should be
able to direct I/O to the least congested path.

But I'll see to extract the latency numbers, too.

What I really wanted to show is that we _can_ track latency without
harming performance.

Cheers,

Hannes