[PATCH rfc] nvme: support io stats on the mpath device

Sagi Grimberg sagi at grimberg.me
Mon Oct 3 01:35:52 PDT 2022



On 9/30/22 03:08, Jens Axboe wrote:
> On 9/29/22 10:25 AM, Sagi Grimberg wrote:
>>
>>>>> 3. Do you have some performance numbers (we're touching the fast path here) ?
>>>>
>>>> This is pretty light-weight, accounting is per-cpu and only wrapped by
>>>> preemption disable. This is a very small price to pay for what we gain.
>>>
>>> Is it? Enabling IO stats for normal devices has a very noticeable impact
>>> on performance at the higher end of the scale.
>>
>> Interesting, I didn't think this would be that noticeable. How much
>> would you quantify the impact in terms of %?
> 
> If we take it to the extreme - my usual peak benchmark, which is drive
> limited at 122M IOPS, run at 113M IOPS if I have iostats enabled. If I
> lower the queue depth (128 -> 16), then peak goes from 46M to 44M. Not
> as dramatic, but still quite noticeable. This is just using a single
> thread on a single CPU core per drive, so not throwing tons of CPU at
> it.
> 
> Now, I have no idea how well nvme multipath currently scales or works.

Should be pretty scalable and efficient. There is no bio cloning and
the only shared state is an srcu wrapping the submission path and path
lookup.

> Would be interesting to test that separately. But if you were to double
> (or more, I guess 3x if you're doing the exposed device and then adding
> stats to at least two below?) the overhead, that'd certainly not be
> free.

It is not 3x, in the patch nvme-multipath is accounting separately from
the bottom devices, so each request is accounted once for the bottom
device and once for the upper device.

But again, my working assumption is that IO stats must be exposed for
a nvme-multipath device (unless the user disabled them). So it is a
matter of weather we take a simple approach, where nvme-multipath does
"double" accounting or, we come up with a scheme that allows the driver
to collect stats on behalf of the block layer, and then add non-trivial
logic to combine stats like iops/bw/latency accurately from the bottom
devices.

My vote would be to go with the former.

>> I don't have any insight on this for blk-mq, probably because I've never
>> seen any user turn IO stats off (or at least don't remember).
> 
> Most people don't care, but some certainly do. As per the above, it's
> noticeable enough that it makes a difference if you're chasing latencies
> or peak performance.
> 
>> My (very limited) testing did not show any noticeable differences for
>> nvme-loop. All I'm saying that we need to have IO stats for the mpath
>> device node. If there is a clever way to collect this from the hidden
>> devices just for nvme, great, but we need to expose these stats.
> 
>   From a previous message, sounds like that's just some qemu setup? Hard
> to measure anything there with precision in my experience, and it's not
> really peak performance territory either.

It's not qemu, it is null_blk exported over nvme-loop (nvmet loop
device). so it is faster, but definitely not something that can provide
insight in the realm of real HW.



More information about the Linux-nvme mailing list