[PATCH v6 06/10] io_uring/rw: add support to send metadata along with read/write

Kanchan Joshi joshi.k at samsung.com
Sun Nov 10 09:41:55 PST 2024


On 11/7/2024 10:53 PM, Pavel Begunkov wrote:

>>> 1. SQE128 makes it big for all requests, intermixing with requests that
>>> don't need additional space wastes space. SQE128 is fine to use but at
>>> the same time we should be mindful about it and try to avoid enabling it
>>> if feasible.
>>
>> Right. And initial versions of this series did not use SQE128. But as we
>> moved towards passing more comprehensive PI information, first SQE was
>> not enough. And we thought to make use of SQE128 rather than taking
>> copy_from_user cost.
> 
> Do we have any data how expensive it is? I don't think I've ever
> tried to profile it. And where the overhead comes from? speculation
> prevention?

We did measure this for nvme passthru commands in past (and that was the 
motivation for building SQE128). Perf profile showed about 3% overhead 
for copy [*].

> If it's indeed costly, we can add sth to io_uring like pre-mapping
> memory to optimise it, which would be useful in other places as
> well.

But why to operate as if SQE128 does not exist?
Reads/Writes, at this point, are clearly not using aboud 20b in first 
SQE and entire second SQE. Not using second SQE at all does not seem 
like the best way to protect it from being used by future users.

Pre-mapping maybe better for opcodes for which copy_for_user has already 
been done. For something new (like this), why to start in a suboptimal 
way, and later, put the burden of taking hoops on userspace to get to 
the same level where it can get by simply passing a flag at the time of 
ring setup.

[*]
perf record -a fio -iodepth=256 -rw=randread -ioengine=io_uring -bs=512 
-numjobs=1 -size=50G -group_reporting -iodepth_batch_submit=64 
-iodepth_batch_complete_min=1 -iodepth_batch_complete_max=64 
-fixedbufs=1 -hipri=1 -sqthread_poll=0 -filename=/dev/ng0n1 
-name=io_uring_1 -uring_cmd=1


# Overhead  Command          Shared Object                 Symbol
# ........  ...............  ............................ 
...............................................................................
#
     14.37%  fio              fio                           [.] axmap_isset
      6.30%  fio              fio                           [.] 
__fio_gettime
      3.69%  fio              fio                           [.] get_io_u
      3.16%  fio              [kernel.vmlinux]              [k] 
copy_user_enhanced_fast_string
      2.61%  fio              [kernel.vmlinux]              [k] 
io_submit_sqes
      1.99%  fio              [kernel.vmlinux]              [k] fget
      1.96%  fio              [nvme_core]                   [k] 
nvme_alloc_request
      1.82%  fio              [nvme]                        [k] nvme_poll
      1.79%  fio              fio                           [.] 
add_clat_sample
      1.69%  fio              fio                           [.] 
fio_ioring_prep
      1.59%  fio              fio                           [.] thread_main
      1.59%  fio              [nvme]                        [k] 
nvme_queue_rqs
      1.56%  fio              [kernel.vmlinux]              [k] io_issue_sqe
      1.52%  fio              [kernel.vmlinux]              [k] 
__put_user_nocheck_8
      1.44%  fio              fio                           [.] 
account_io_completion
      1.37%  fio              fio                           [.] 
get_next_rand_block
      1.37%  fio              fio                           [.] 
__get_next_rand_offset.isra.0
      1.34%  fio              fio                           [.] io_completed
      1.34%  fio              fio                           [.] td_io_queue
      1.27%  fio              [kernel.vmlinux]              [k] 
blk_mq_alloc_request
      1.27%  fio              [nvme_core]                   [k] 
nvme_user_cmd64



More information about the Linux-nvme mailing list