[LSF/MM/BPF TOPIC] Towards more useful nvme passthrough

Kanchan Joshi joshi.k at samsung.com
Mon Feb 28 01:25:11 PST 2022


Background & Objective:
-----------------------
New storage interfaces/features, especially in NVMe, are emerging
fast. NVMe now has 3 command sets (NVM, ZNS and KV), and this is only
going to grow further (e.g. computational storage). Many of these new
commands do not fit well in the existing block abstraction and/or
syscalls. Be it somewhat specialized operation, or even a new way of
doing classical read/write (e.g. zone-append, copy command) - it takes
a good deal of consensus/time for a new device interface to climb the
ladders of kernel abstractions and become available for user-space
consumption. This presents challenges for early adopters of tech, and
leads to kernel-bypass at times.

Passthrough interface cuts through the abstractions and allows
applications to use any arbitrary nvme-command readily, similar to
kernel-bypass solutions. But passthrough does not scale as it travels
via sync ioctl interface, which is particularly painful for
fast/parallel NVMe storage.

Objective is to revamp the existing passthru interface and turn it
into something that applications can readily use to play with
new/emerging features of NVMe.

Current state of work:
----------------------
1. Block-interface is subject to compatibility of course. But now nvme
exposes a generic char interface (/dev/ng) as well which is not
subject to conditions [1]. When passthru is combined with this generic
char interface, applications get a sure-fire way to operate
nvme-device for any current/future command-set. This settles the
availability problem.

2. For scalability problem, we are discussing this new facility
“uring-cmd” that Jens proposed in io_uring [2]. This enables using
io_uring for any arbitrary command (ioctl, fsctl etc.) exposed by the
underlying component (driver, FS etc.).

3. I have posted patches combining nvme-passthru with uring-cmd [3].
This new uring-passthru path enables a bunch of capabilities – async
transport, fixed-buffer, async-polling, bio-cache etc. This scales well.
512b randread KIOPS comparing uring-passthru-over-char (/dev/ng0n1) to
uring-over-block (/dev/nvme0n1)

QD    uring    pt    uring-poll    pt-poll
8      538     589      831         902
64     967     1131     1351        1378
256    1043    1230     1376        1429

Discussion points:
------------------
I'd like a propose a session to go over:

- What are the issues in having the above work (uring-cmd and new nvme
passthru) merged?

- What would be other useful things to add in nvme-passthru. For
example- lack of vectored-io for passthru was one such missing piece.
That is covered from nvme 5.18 onwards [4]. But are there other things
that user-space would need before it starts treating this path as a
good alternative to kernel-bypass?

- Despite the numbers above, nvme passthru has more room for
efficiency e.g. unlike regular io, we do copy_to_user to fetch
command, and put_user to return the result. Eliminating some of this
may require new ioctl. There may be other opinions on what else needs
overhaul in this path.

- What would be a good way to upstream the tests? Nvme-cli may not be
very useful. Should it be similar to fio’s sg ioengine. But
unlike sg, here we are combining ng with io_uring, and one would want
to retain all the tunables of io_uring (register/fixed buffers/sqpoll
etc.)

- All the above is for 2.0 passthru which essentially forms a direct
path between io_uring and nvme. And io_uring and nvme programming
model share many similarities. For 3.0 passthru, would it be crazy to
think of trimming the path further by eliminating the block-layer and
doing stuff without “struct request”. There is some interest in
developing user-space block device [5] and FS anyway.

[1] https://lore.kernel.org/linux-nvme/20210421074504.57750-1-minwoo.im.dev@gmail.com/
[2] https://lore.kernel.org/linux-nvme/20210317221027.366780-1-axboe@kernel.dk/
[3] https://lore.kernel.org/linux-nvme/20211220141734.12206-1-joshi.k@samsung.com/
[4] https://lore.kernel.org/linux-nvme/20220216080208.GD10554@lst.de/
[5] https://lore.kernel.org/linux-block/87tucsf0sr.fsf@collabora.com/

-- 
2.25.1




More information about the Linux-nvme mailing list