don't reorder requests passed to ->queue_rqs

Jens Axboe axboe at kernel.dk
Wed Nov 13 12:51:48 PST 2024


On 11/13/24 1:36 PM, Chaitanya Kulkarni wrote:
> On 11/13/24 07:20, Christoph Hellwig wrote:
>> Hi Jens,
>>
>> currently blk-mq reorders requests when adding them to the plug because
>> the request list can't do efficient tail appends.  When the plug is
>> directly issued using ->queue_rqs that means reordered requests are
>> passed to the driver, which can lead to very bad I/O patterns when
>> not corrected, especially on rotational devices (e.g. NVMe HDD) or
>> when using zone append.
>>
>> This series first adds two easily backportable workarounds to reverse
>> the reording in the virtio_blk and nvme-pci ->queue_rq implementations
>> similar to what the non-queue_rqs path does, and then adds a rq_list
>> type that allows for efficient tail insertions and uses that to fix
>> the reordering for real and then does the same for I/O completions as
>> well.
> 
> Looks good to me. I ran the quick performance numbers [1].
> 
> Reviewed-by: Chaitanya Kulkarni <kch at nvidia.com>
> 
> -ck
> 
> fio randread iouring workload :-
> 
> IOPS :-
> -------
> nvme-orig:           Average IOPS: 72,690
> nvme-new-no-reorder: Average IOPS: 72,580
> 
> BW :-
> -------
> nvme-orig:           Average BW: 283.9 MiB/s
> nvme-new-no-reorder: Average BW: 283.4 MiB/s

Thanks for testing, but you can't verify any kind of perf change with
that kind of setup. I'll be willing to bet that it'll be 1-2% drop at
higher rates, which is substantial. But the reordering is a problem, not
just for zoned devices, which is why I chose to merge this.

-- 
Jens Axboe



More information about the Linux-nvme mailing list