don't reorder requests passed to ->queue_rqs
Jens Axboe
axboe at kernel.dk
Wed Nov 13 12:51:48 PST 2024
On 11/13/24 1:36 PM, Chaitanya Kulkarni wrote:
> On 11/13/24 07:20, Christoph Hellwig wrote:
>> Hi Jens,
>>
>> currently blk-mq reorders requests when adding them to the plug because
>> the request list can't do efficient tail appends. When the plug is
>> directly issued using ->queue_rqs that means reordered requests are
>> passed to the driver, which can lead to very bad I/O patterns when
>> not corrected, especially on rotational devices (e.g. NVMe HDD) or
>> when using zone append.
>>
>> This series first adds two easily backportable workarounds to reverse
>> the reording in the virtio_blk and nvme-pci ->queue_rq implementations
>> similar to what the non-queue_rqs path does, and then adds a rq_list
>> type that allows for efficient tail insertions and uses that to fix
>> the reordering for real and then does the same for I/O completions as
>> well.
>
> Looks good to me. I ran the quick performance numbers [1].
>
> Reviewed-by: Chaitanya Kulkarni <kch at nvidia.com>
>
> -ck
>
> fio randread iouring workload :-
>
> IOPS :-
> -------
> nvme-orig: Average IOPS: 72,690
> nvme-new-no-reorder: Average IOPS: 72,580
>
> BW :-
> -------
> nvme-orig: Average BW: 283.9 MiB/s
> nvme-new-no-reorder: Average BW: 283.4 MiB/s
Thanks for testing, but you can't verify any kind of perf change with
that kind of setup. I'll be willing to bet that it'll be 1-2% drop at
higher rates, which is substantial. But the reordering is a problem, not
just for zoned devices, which is why I chose to merge this.
--
Jens Axboe
More information about the Linux-nvme
mailing list