[LSF/MM/BPF Topic] Towards more useful nvme-passthrough

Hannes Reinecke hare at suse.de
Thu Jun 24 02:24:27 PDT 2021


On 6/9/21 12:50 PM, Kanchan Joshi wrote:
> Background & objectives:
> ------------------------
> 
> The NVMe passthrough interface
> 
> Good part: allows new device-features to be usable (at least in raw
> form) without having to build block-generic cmds, in-kernel users,
> emulations and file-generic user-interfaces - all this take some time to
> evolve.
> 
> Bad part: passthrough interface has remain tied to synchronous ioctl,
> which is a blocker for performance-centric usage scenarios. User-space
> can take the pain of implementing async-over-sync on its own but it does
> not make much sense in a world that already has io_uring.
> 
> Passthrough is lean in the sense it cuts through layers of abstractions
> and reaches to NVMe fast. One of the objective here is to build a
> scalable pass-through that can be readily used to play with new/emerging
> NVMe features.  Another is to surpass/match existing raw/direct block
> I/O performance with this new in-kernel path.
> 
> Recent developments:
> --------------------
> - NVMe now has a per-namespace char interface that remains available/usable
>   even for unsupported features and for new command-sets [1].
> 
> - Jens has proposed async-ioctl like facility 'uring_cmd' in io_uring. This
>   introduces new possibilities (beyond storage); async-passthrough is one of
> those. Last posted version is V4 [2].
> 
> - I have posted work on async nvme passthrough over block-dev [3]. Posted work
>   is in V4 (in sync with the infra of [2]).
> 
> Early performance numbers:
> --------------------------
> fio, randread, 4k bs, 1 job
> Kiops, with varying QD:
> 
> QD      Sync-PT         io_uring        Async-PT
> 1         10.8            10.6            10.6
> 2         10.9            24.5            24
> 4         10.6            45              46
> 8         10.9            90              89
> 16        11.0            169             170
> 32        10.6            308             307
> 64        10.8            503             506
> 128       10.9            592             596
> 
> Further steps/discussion points:
> --------------------------------
> 1.Async-passthrough over nvme char-dev
> It is in a shape to receive feedback, but I am not sure if community
> would like to take a look at that before settling on uring-cmd infra.
> 
> 2.Once above gets in shape, bring other perf-centric features of io_uring to
> this path -
> A. SQPoll and register-file: already functional.
> B. Passthrough polling: This can be enabled for block and looks feasible for
> char-interface as well.  Keith recently posted enabling polling for user
> pass-through [4]
> C. Pre-mapped buffers: Early thought is to let the buffers registered by
> io_uring, and add a new passthrough ioctl/uring_cmd in driver which does
> everything that passthrough does except pinning/unpinning the pages.
> 
> 3. Are there more things in the "io_uring->nvme->[block-layer]->nvme" path
> which can be optimized.
> 
> Ideally I'd like to cover good deal of ground before Dec. But there seems
> plenty of possibilities on this path.  Discussion would help in how best to
> move forward, and cement the ideas.
> 
> [1] https://lore.kernel.org/linux-nvme/20210421074504.57750-1-minwoo.im.dev@gmail.com/
> [2] https://lore.kernel.org/linux-nvme/20210317221027.366780-1-axboe@kernel.dk/
> [3] https://lore.kernel.org/linux-nvme/20210325170540.59619-1-joshi.k@samsung.com/
> [4] https://lore.kernel.org/linux-block/20210517171443.GB2709391@dhcp-10-100-145-180.wdc.com/#t
> 
I do like the idea.

What I would like to see is to make the ioring_cmd infrastructure
generally available, such that we can port the SCSI sg asynchronous
interface over to this.
Doug Gilbert has been fighting a lone battle to improve the sg
asynchronous interface, as the current one is deemed a security hazard.
But in the absence of a generic interface he had to design his own
ioctls, with all the expected pushback.
Plus there are only so many people who care about sg internals :-(

Being able to use ioring_cmd would be a neat way out of this.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare at suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer



More information about the Linux-nvme mailing list