Sequential read from NVMe/XFS twice slower on Fedora 42 than on Rocky 9.5
Anton Gavriliuk
antosha20xx at gmail.com
Tue May 6 04:03:37 PDT 2025
> So is this MD chunk size related? i.e. what is the chunk size
> the MD device? Is it smaller than the IO size (256kB) or larger?
> Does the regression go away if the chunk size matches the IO size,
> or if the IO size vs chunk size relationship is reversed?
According to the output below, the chunk size is 512K,
[root at localhost anton]# mdadm -D /dev/md127
/dev/md127:
Version : 1.2
Creation Time : Thu Apr 17 14:58:23 2025
Raid Level : raid0
Array Size : 37505814528 (34.93 TiB 38.41 TB)
Raid Devices : 12
Total Devices : 12
Persistence : Superblock is persistent
Update Time : Thu Apr 17 14:58:23 2025
State : clean
Active Devices : 12
Working Devices : 12
Failed Devices : 0
Spare Devices : 0
Layout : original
Chunk Size : 512K
Consistency Policy : none
Name : localhost.localdomain:127 (local to host
localhost.localdomain)
UUID : 2fadc96b:f37753af:f3b528a0:067c320d
Events : 0
Number Major Minor RaidDevice State
0 259 15 0 active sync /dev/nvme7n1
1 259 27 1 active sync /dev/nvme0n1
2 259 10 2 active sync /dev/nvme1n1
3 259 28 3 active sync /dev/nvme2n1
4 259 13 4 active sync /dev/nvme8n1
5 259 22 5 active sync /dev/nvme5n1
6 259 26 6 active sync /dev/nvme3n1
7 259 16 7 active sync /dev/nvme4n1
8 259 24 8 active sync /dev/nvme9n1
9 259 14 9 active sync /dev/nvme10n1
10 259 25 10 active sync /dev/nvme11n1
11 259 12 11 active sync /dev/nvme12n1
[root at localhost anton]# uname -r
6.14.5-300.fc42.x86_64
[root at localhost anton]# cat /proc/mdstat
Personalities : [raid0]
md127 : active raid0 nvme4n1[7] nvme1n1[2] nvme12n1[11] nvme7n1[0]
nvme9n1[8] nvme11n1[10] nvme2n1[3] nvme8n1[4] nvme0n1[1] nvme5n1[5]
nvme3n1[6] nvme10n1[9]
37505814528 blocks super 1.2 512k chunks
unused devices: <none>
[root at localhost anton]#
When I/O size is less 512K
[root at localhost ~]# fio --name=test --rw=read --bs=256k
--filename=/dev/md127 --direct=1 --numjobs=1 --iodepth=64 --exitall
--group_reporting --ioengine=libaio --runtime=30 --time_based
test: (g=0): rw=read, bs=(R) 256KiB-256KiB, (W) 256KiB-256KiB, (T)
256KiB-256KiB, ioengine=libaio, iodepth=64
fio-3.39-44-g19d9
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=48.1GiB/s][r=197k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=14340: Tue May 6 13:59:23 2025
read: IOPS=197k, BW=48.0GiB/s (51.6GB/s)(1441GiB/30001msec)
slat (usec): min=3, max=1041, avg= 4.74, stdev= 1.48
clat (usec): min=76, max=2042, avg=320.30, stdev=26.82
lat (usec): min=79, max=2160, avg=325.04, stdev=27.08
When I/O size is greater 512K
[root at localhost ~]# fio --name=test --rw=read --bs=1024k
--filename=/dev/md127 --direct=1 --numjobs=1 --iodepth=64 --exitall
--group_reporting --ioengine=libaio --runtime=30 --time_based
test: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T)
1024KiB-1024KiB, ioengine=libaio, iodepth=64
fio-3.39-44-g19d9
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=63.7GiB/s][r=65.2k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=14395: Tue May 6 14:00:28 2025
read: IOPS=64.6k, BW=63.0GiB/s (67.7GB/s)(1891GiB/30001msec)
slat (usec): min=9, max=1045, avg=15.12, stdev= 3.84
clat (usec): min=81, max=18494, avg=975.87, stdev=112.11
lat (usec): min=96, max=18758, avg=990.99, stdev=113.49
But still much worse than with 256k on Rocky 9.5
Anton
вт, 6 мая 2025 г. в 01:56, Dave Chinner <david at fromorbit.com>:
>
> On Mon, May 05, 2025 at 09:21:19AM -0400, Laurence Oberman wrote:
> > On Mon, 2025-05-05 at 08:29 -0400, Laurence Oberman wrote:
> > > On Mon, 2025-05-05 at 07:50 +1000, Dave Chinner wrote:
> > > > So the MD block device shows the same read performance as the
> > > > filesystem on top of it. That means this is a regression at the MD
> > > > device layer or in the block/driver layers below it. i.e. it is not
> > > > an XFS of filesystem issue at all.
> > > >
> > > > -Dave.
> > >
> > > I have a lab setup, let me see if I can also reproduce and then trace
> > > this to see where it is spending the time
> > >
> >
> >
> > Not seeing 1/2 the bandwidth but also significantly slower on Fedora42
> > kernel.
> > I will trace it
> >
> > 9.5 kernel - 5.14.0-503.40.1.el9_5.x86_64
> >
> > Run status group 0 (all jobs):
> > READ: bw=14.7GiB/s (15.8GB/s), 14.7GiB/s-14.7GiB/s (15.8GB/s-
> > 15.8GB/s), io=441GiB (473GB), run=30003-30003msec
> >
> > Fedora42 kernel - 6.14.5-300.fc42.x86_64
> >
> > Run status group 0 (all jobs):
> > READ: bw=10.4GiB/s (11.2GB/s), 10.4GiB/s-10.4GiB/s (11.2GB/s-
> > 11.2GB/s), io=313GiB (336GB), run=30001-30001msec
>
> So is this MD chunk size related? i.e. what is the chunk size
> the MD device? Is it smaller than the IO size (256kB) or larger?
> Does the regression go away if the chunk size matches the IO size,
> or if the IO size vs chunk size relationship is reversed?
>
> -Dave.
> --
> Dave Chinner
> david at fromorbit.com
More information about the Linux-nvme
mailing list