[External] : Re: [bug-report] 5-9% FIO randomwrite ext4 perf regression on 6.12.y kernel

Fri Nov 22 09:13:40 PST 2024

On 21/11/2024 14:49, Jens Axboe wrote:
> On 11/21/24 4:30 AM, Phil Auld wrote:
>> Hi,
>>
>> On Wed, Nov 20, 2024 at 06:20:12PM -0700 Jens Axboe wrote:
>>> On 11/20/24 5:00 PM, Chaitanya Kulkarni wrote:
>>>> On 11/20/24 13:35, Saeed Mirzamohammadi wrote:
>>>>> Hi,
>>>>>
>>>>> I?m reporting a performance regression of up to 9-10% with FIO randomwrite benchmark on ext4 comparing 6.12.0-rc2 kernel and v5.15.161. Also, standard deviation after this change grows up to 5-6%.
>>>>>
>>>>> Bisect root cause commit
>>>>> ===================
>>>>> - commit 63dfa1004322 ("nvme: move NVME_QUIRK_DEALLOCATE_ZEROES out of nvme_config_discard?)
>>>>>
>>>>>
>>>>> Test details
>>>>> =========
>>>>> - readwrite=randwrite bs=4k size=1G ioengine=libaio iodepth=16 direct=1 time_based=1 ramp_time=180 runtime=1800 randrepeat=1 gtod_reduce=1
>>>>> - Test is on ext4 filesystem
>>>>> - System has 4 NVMe disks
>>>>>
>>>> Thanks a lot for the report, to narrow down this problem can you
>>>> please :-
>>>>
>>>> 1. Run the same test on the raw nvme device /dev/nvme0n1 that you
>>>>      have used for this benchmark ?
>>>> 2. Run the same test on the  XFS formatted nvme device instead of ext4 ?
>>>>
>>>> This way we will know if there is an issue only with the ext4 or
>>>> with other file systems are suffering from this problem too or
>>>> it is below the file system layer such as block layer and nvme pci driver ?
>>>>
>>>> It will also help if you can repeat these numbers for io_uring fio io_engine
>>>> to narrow down this problem to know if the issue is ioengine specific.
>>>>
>>>> Looking at the commit [1], it only sets the max value to write zeroes
>>>> sectors
>>>> if NVME_QUIRK_DEALLOCATE_ZEROES is set, else uses the controller max
>>>> write zeroes value.
>>> There's no way that commit is involved, the test as quoted doesn't even
>>> touch write zeroes. Hence if there really is a regression here, then
>>> it's either not easily bisectable, some error was injected while
>>> bisecting, or the test itself is bimodal.
>> I was just going to ask how confident we are in that bisect result.
>>
>> I suspect this is the same issue I've been fighting here:
>>
>> https://urldefense.com/v3/__https://lore.kernel.org/lkml/20241101124715.GA689589@pauld.westford.csb/__;!!ACWV5N9M2RV99hQ!PXJXp0zosonkV7jeW9yE0YL-uPElcYI-G-bvm69COWR1Tbl9w9puGc1tLR_ccsDoYPBb9Bs3waNVuuf9Lg$
>>
>> Saeed, can you try your randwrite test after
>>
>>    "echo NO_DELAY_DEQUEUE > /sys/kernel/debug/sched/features"
>>
>> please?
>>
>> We don't as yet have a general fix for it as it seems to be a bit of
>> a trade off.
> Interesting. Might explain some regressions I've seen too related to
> performance.
Apologies for those receiving this twice, but resending due to mail 
client not sending it as text content causing it to be rejected by the 
lists.
Also, a little more info to update.

Hi,

To answer the various questions/suggestions, I'll just group them here:

Phil:
can you try your randwrite test after
"echo NO_DELAY_DEQUEUE > /sys/kernel/debug/sched/features"

Performance regression still persists with this setting being used.

Christoph:
To check for weird lazy init code using write zeroes

Values in the 5.15 kernel baseline prior to the commit:
$ cat /sys/block/nvme*n1/queue/write_zeroes_max_bytes
0
0
0
0

Values in the 6.11 kernel that contains the commit:
$ cat /sys/block/nvme*n1/queue/write_zeroes_max_bytes
2199023255040
2199023255040
2199023255040
2199023255040

Chaitanya:

Run the same test on the  XFS formatted nvme device instead of ext4 ?
- XFS runs did not show the performance regression.

Run the same test on the raw nvme device /dev/nvme0n1 that you have used 
for this benchmark
- Will have to check if this was done, and if not, get that test run

repeat these numbers for io_uring fio io_engine
- Will look into getting those too

Another interesting datapoint is that while performing some runs I am 
seeing the following output on the console in the 6.11/6.12 kernels that 
contain the commit:

[  473.398188] operation not supported error, dev nvme2n1, sector 13952 op 0x9:(WRITE_ZEROES) flags 0x800 phys_seg 0 prio class 0
[  473.534550] nvme0n1: Dataset Management(0x9) @ LBA 14000, 256 blocks, Invalid Command Opcode (sct 0x0 / sc 0x1) DNR
[  473.660502] operation not supported error, dev nvme0n1, sector 14000 op 0x9:(WRITE_ZEROES) flags 0x800 phys_seg 0 prio class 0
[  473.796859] nvme3n1: Dataset Management(0x9) @ LBA 13952, 256 blocks, Invalid Command Opcode (sct 0x0 / sc 0x1) DNR
[  473.922810] operation not supported error, dev nvme3n1, sector 13952 op 0x9:(WRITE_ZEROES) flags 0x800 phys_seg 0 prio class 0
[  474.059169] nvme1n1: Dataset Management(0x9) @ LBA 13952, 256 blocks, Invalid Command Opcode (sct 0x0 / sc 0x1) DNR

The errors start as soon as the mkfs command is initiated for ext4 and 
continue throughout the fio test run.
These are also seen when using mkfs command to create an xfs filesystem, 
however in this case, the error is only seen once at fs creation time.

Regards,
Paul.