[PATCH for-next 0/2] Enable IOU_F_TWQ_LAZY_WAKE for passthrough
Pavel Begunkov
asml.silence at gmail.com
Tue May 16 11:38:20 PDT 2023
On 5/16/23 12:42, Anuj gupta wrote:
> On Mon, May 15, 2023 at 6:29 PM Pavel Begunkov <asml.silence at gmail.com> wrote:
>>
>> Let cmds to use IOU_F_TWQ_LAZY_WAKE and enable it for nvme passthrough.
>>
>> The result should be same as in test to the original IOU_F_TWQ_LAZY_WAKE [1]
>> patchset, but for a quick test I took fio/t/io_uring with 4 threads each
>> reading their own drive and all pinned to the same CPU to make it CPU
>> bound and got +10% throughput improvement.
>>
>> [1] https://lore.kernel.org/all/cover.1680782016.git.asml.silence@gmail.com/
>>
>> Pavel Begunkov (2):
>> io_uring/cmd: add cmd lazy tw wake helper
>> nvme: optimise io_uring passthrough completion
>>
>> drivers/nvme/host/ioctl.c | 4 ++--
>> include/linux/io_uring.h | 18 ++++++++++++++++--
>> io_uring/uring_cmd.c | 16 ++++++++++++----
>> 3 files changed, 30 insertions(+), 8 deletions(-)
>>
>>
>> base-commit: 9a48d604672220545d209e9996c2a1edbb5637f6
>> --
>> 2.40.0
>>
>
> I tried to run a few workloads on my setup with your patches applied. However, I
> couldn't see any difference in io passthrough performance. I might have missed
> something. Can you share the workload that you ran which gave you the perf
> improvement. Here is the workload that I ran -
The patch is way to make completion batching more consistent. If you're so
lucky that all IO complete before task_work runs, it'll be perfect batching
and there is nothing to improve. That often happens with high throughput
benchmarks because of how consistent they are: no writes, same size,
everything is issued at the same time and so on. In reality it depends
on your use pattern, timings, nvme coalescing, will also change if you
introduce a second drive, and so on.
With the patch t/io_uring should run task_work once for exactly the
number of cqes the user is waiting for, i.e. -c<N>, regardless of
circumstances.
Just tried it out to confirm,
taskset -c 0 nice -n -20 /t/io_uring -p0 -d4 -b8192 -s4 -c4 -F1 -B1 -R0 -X1 -u1 -O0 /dev/ng0n1
Without:
12:11:10 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
12:11:20 PM 0 2.03 0.00 25.95 0.00 0.00 0.00 0.00 0.00 0.00 72.03
With:
12:12:00 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
12:12:10 PM 0 2.22 0.00 17.39 0.00 0.00 0.00 0.00 0.00 0.00 80.40
Double checking it works:
echo 1 > /sys/kernel/debug/tracing/events/io_uring/io_uring_local_work_run/enable
cat /sys/kernel/debug/tracing/trace_pipe
Without I see
io_uring-4108 [000] ..... 653.820369: io_uring_local_work_run: ring 00000000b843f57f, count 1, loops 1
io_uring-4108 [000] ..... 653.820371: io_uring_local_work_run: ring 00000000b843f57f, count 1, loops 1
io_uring-4108 [000] ..... 653.820382: io_uring_local_work_run: ring 00000000b843f57f, count 2, loops 1
io_uring-4108 [000] ..... 653.820383: io_uring_local_work_run: ring 00000000b843f57f, count 1, loops 1
io_uring-4108 [000] ..... 653.820386: io_uring_local_work_run: ring 00000000b843f57f, count 1, loops 1
io_uring-4108 [000] ..... 653.820398: io_uring_local_work_run: ring 00000000b843f57f, count 2, loops 1
io_uring-4108 [000] ..... 653.820398: io_uring_local_work_run: ring 00000000b843f57f, count 1, loops 1
And with patches it's strictly count=4.
Another way would be to add more SSDs to the picture and hope they don't
conspire to complete at the same time
--
Pavel Begunkov
More information about the Linux-nvme
mailing list