[RFC 0/3] nvme uring passthrough diet

Fri May 5 01:14:55 PDT 2023

On Wed, May 03, 2023 at 09:20:04AM -0600, Keith Busch wrote:
>On Wed, May 03, 2023 at 12:57:17PM +0530, Kanchan Joshi wrote:
>> On Mon, May 01, 2023 at 08:33:03AM -0700, Keith Busch wrote:
>> > From: Keith Busch <kbusch at kernel.org>
>> >
>> > When you disable all the optional features in your kernel config and
>> > request queue, it looks like the normal request dispatching is just as
>> > fast as any attempts to bypass it. So let's do that instead of
>> > reinventing everything.
>> >
>> > This doesn't require additional queues or user setup. It continues to
>> > work with multiple threads and processes, and relies on the well tested
>> > queueing mechanisms that track timeouts, handle tag exhuastion, and sync
>> > with controller state needed for reset control, hotplug events, and
>> > other error handling.
>>
>> I agree with your point that there are some functional holes in
>> the complete-bypass approach. Yet the work was needed to be done
>> to figure out the gain (of approach) and see whether the effort to fill
>> these holes is worth.
>>
>> On your specific points
>> - requiring additional queues: not a showstopper IMO.
>>  If queues are lying unused with HW, we can reap more performance by
>>  giving those to application. If not, we fall back to the existing path.
>>  No disruption as such.
>
>The current way we're reserving special queues is bad and should
>try to not extend it futher. It applies to the whole module and
>would steal resources from some devices that don't want poll queues.
>If you have a mix of device types in your system, the low end ones
>don't want to split their resources this way.
>
>NVMe has no problem creating new queues on the fly. Queue allocation
>doesn't have to be an initialization thing, but you would need to
>reserve the QID's ahead of time.

Totally in agreement with that. Jens also mentioned this point.
And I had added preallocation in my to-be-killed list. Thanks for
expanding.
Related to that, I think one-qid-per-ring also need to be lifted.
That should allow to do io on two/more devices with the single ring
and see how well that scales.

>> - tag exhaustion: that is not missing, a retry will be made. I actually
>>  wanted to do single command-id management at the io_uring level itself,
>>  and that would have cleaned things up. But it did not fit in
>>  because of submission/completion lifetime differences.
>> - timeout and other bits you mentioned: yes, those need more work.
>>
>> Now with the alternate proposed in this series, I doubt whether similar
>> gains are possible. Happy to be wrong if that happens.
>
>One other thing: the pure-bypass does appear better at low queue
>depths, but utilizing the plug for aggregated sq doorbell writes
>is a real win at higher queue depths from this series. Batching
>submissions at 4 deep is the tipping point on my test box; this
>series outperforms pure bypass at any higher batch count.

I see. 
I hit 5M cliff without plug/batching primarily because pure-bypass
is reducing the code to do the IO. But plug/batching is needed to get
better at this.
If we create space for a pointer in io_uring_cmd, that can get added in
the plug list (in place of struct request). That will be one way to sort
out the plugging.