[LSF/MM/BPF Topic] Towards more useful nvme-passthrough
Hannes Reinecke
hare at suse.de
Thu Jun 24 02:24:27 PDT 2021
On 6/9/21 12:50 PM, Kanchan Joshi wrote:
> Background & objectives:
> ------------------------
>
> The NVMe passthrough interface
>
> Good part: allows new device-features to be usable (at least in raw
> form) without having to build block-generic cmds, in-kernel users,
> emulations and file-generic user-interfaces - all this take some time to
> evolve.
>
> Bad part: passthrough interface has remain tied to synchronous ioctl,
> which is a blocker for performance-centric usage scenarios. User-space
> can take the pain of implementing async-over-sync on its own but it does
> not make much sense in a world that already has io_uring.
>
> Passthrough is lean in the sense it cuts through layers of abstractions
> and reaches to NVMe fast. One of the objective here is to build a
> scalable pass-through that can be readily used to play with new/emerging
> NVMe features. Another is to surpass/match existing raw/direct block
> I/O performance with this new in-kernel path.
>
> Recent developments:
> --------------------
> - NVMe now has a per-namespace char interface that remains available/usable
> even for unsupported features and for new command-sets [1].
>
> - Jens has proposed async-ioctl like facility 'uring_cmd' in io_uring. This
> introduces new possibilities (beyond storage); async-passthrough is one of
> those. Last posted version is V4 [2].
>
> - I have posted work on async nvme passthrough over block-dev [3]. Posted work
> is in V4 (in sync with the infra of [2]).
>
> Early performance numbers:
> --------------------------
> fio, randread, 4k bs, 1 job
> Kiops, with varying QD:
>
> QD Sync-PT io_uring Async-PT
> 1 10.8 10.6 10.6
> 2 10.9 24.5 24
> 4 10.6 45 46
> 8 10.9 90 89
> 16 11.0 169 170
> 32 10.6 308 307
> 64 10.8 503 506
> 128 10.9 592 596
>
> Further steps/discussion points:
> --------------------------------
> 1.Async-passthrough over nvme char-dev
> It is in a shape to receive feedback, but I am not sure if community
> would like to take a look at that before settling on uring-cmd infra.
>
> 2.Once above gets in shape, bring other perf-centric features of io_uring to
> this path -
> A. SQPoll and register-file: already functional.
> B. Passthrough polling: This can be enabled for block and looks feasible for
> char-interface as well. Keith recently posted enabling polling for user
> pass-through [4]
> C. Pre-mapped buffers: Early thought is to let the buffers registered by
> io_uring, and add a new passthrough ioctl/uring_cmd in driver which does
> everything that passthrough does except pinning/unpinning the pages.
>
> 3. Are there more things in the "io_uring->nvme->[block-layer]->nvme" path
> which can be optimized.
>
> Ideally I'd like to cover good deal of ground before Dec. But there seems
> plenty of possibilities on this path. Discussion would help in how best to
> move forward, and cement the ideas.
>
> [1] https://lore.kernel.org/linux-nvme/20210421074504.57750-1-minwoo.im.dev@gmail.com/
> [2] https://lore.kernel.org/linux-nvme/20210317221027.366780-1-axboe@kernel.dk/
> [3] https://lore.kernel.org/linux-nvme/20210325170540.59619-1-joshi.k@samsung.com/
> [4] https://lore.kernel.org/linux-block/20210517171443.GB2709391@dhcp-10-100-145-180.wdc.com/#t
>
I do like the idea.
What I would like to see is to make the ioring_cmd infrastructure
generally available, such that we can port the SCSI sg asynchronous
interface over to this.
Doug Gilbert has been fighting a lone battle to improve the sg
asynchronous interface, as the current one is deemed a security hazard.
But in the absence of a generic interface he had to design his own
ioctls, with all the expected pushback.
Plus there are only so many people who care about sg internals :-(
Being able to use ioring_cmd would be a neat way out of this.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare at suse.de +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer
More information about the Linux-nvme
mailing list