Read speed for a PCIe NVMe SSD is ridiculously slow on a multi-socket machine.

Alexander Shumakovitch shurik at jhu.edu
Fri Mar 31 00:53:30 PDT 2023


Thanks a lot, Damien. This was very helpful indeed. As you have suggested,
I've run a few fio test with libaio and io_uring engines with QD=32 and
different number of jobs. The results were mostly consistent between the two
engines, except for the random reads in the cached mode. In the case of
libaio, there was virtually no difference between different nodes, and the
bandwidth steadily increased with the number of nodes. Which made sense to
me after your explanations.

But for io_uring, node #0 was getting progressively faster as the number
of jobs increased, but the other three were getting slower, see the summary
tables below. Does this make sense for you? I understand that libaio engine
might ignore the iodepth settings in the cached mode. But smaller QD should
make things slower, not faster, shouldn't it? For your information, I also
attach complete outputs for fio in a few boundary cases. 

The main things I'm still concerned about is that not all Linux subsystems
might be fully NUMA aware on this machine. As I wrote, it has a buggy BIOS
that doesn't tell the OS its NUMA configuration. I populate the values of
numa_node in /sys/devices/pci0000:* myself after each boot, but this might
not be enough.

Thank you,

  --- Alex.

Benchmarks for random reads: bs = 4k, iodepth = 32 (in MB/s):

         ||   libaio engine, cached mode  ||  io_uring engine, cached mode |
    jobs || CPU#0 | CPU#1 | CPU#2 | CPU#3 || CPU#0 | CPU#1 | CPU#2 | CPU#3 |
   -------------------------------------------------------------------------
      1  ||  47.5 |  46.2 |  46.0 |  46.5 ||   330 |   285 |   281 |   252 |
      2  ||  94.2 |  91.8 |  90.9 |  91.8 ||   571 |   189 |   186 |   203 |
      4  ||   180 |   176 |   175 |   176 ||  1108 |   184 |   191 |   219 |
      8  ||   331 |   322 |   319 |   322 ||  1142 |   170 |   174 |   177 |
     16  ||   585 |   554 |   545 |   552 ||  1353 |   175 |   173 |   180 |
   -------------------------------------------------------------------------
   
         ||   libaio engine, direct mode  ||  io_uring engine, direct mode |
    jobs || CPU#0 | CPU#1 | CPU#2 | CPU#3 || CPU#0 | CPU#1 | CPU#2 | CPU#3 |
   -------------------------------------------------------------------------
      1  ||   544 |   520 |   477 |  519  ||   506 |   558 |   532 |   476 |
      2  ||  1034 |   928 |   943 |  996  ||  1028 |   938 |  1023 |  1004 |
      4  ||  1139 |  1138 |  1138 | 1139  ||  1138 |  1138 |  1138 |  1138 |
      8  ||  1140 |  1141 |  1141 | 1141  ||  1142 |  1142 |  1141 |  1141 |
     16  ||  1141 |  1135 |  1112 | 1136  ||  1141 |  1130 |  1133 |  1135 |
   -------------------------------------------------------------------------
   

Benchmarks for sequential reads: bs = 256k, iodepth = 32, numjobs = 1.

   |   libaio engine, cached mode  ||  io_uring engine, cached mode |
   | CPU#0 | CPU#1 | CPU#2 | CPU#3 || CPU#0 | CPU#1 | CPU#2 | CPU#3 |
   ------------------------------------------------------------------
   |  1411 |   160 |   159 |   163 ||  1355 |   160 |   159 |   163 |
   ------------------------------------------------------------------
   
   |   libaio engine, direct mode  ||  io_uring engine, direct mode |
   | CPU#0 | CPU#1 | CPU#2 | CPU#3 || CPU#0 | CPU#1 | CPU#2 | CPU#3 |
   ------------------------------------------------------------------
   |  3627 |  2160 |  1637 |  2184 ||  3627 |  2076 |  1756 |  2167 |
   ------------------------------------------------------------------
   

On Sat, Mar 25, 2023 at 10:52:02AM +0900, Damien Le Moal wrote:
> For fast block devices, the overhead of the page management and memory
> copies done when using the page cache is very visible. Nothing that can be
> done about that. Any application, fio included, will most of the time show
> slower performance because of that overhead. Not always true though (e.g.
> sequential read with read-ahead should be just fine), but at the very least
> you will see a higher CPU load.
> 
> dd and hdparm will also exercise the drive at QD=1, far from ideal when
> trying to measure the maximum throughput of a device, unless you one uses
> very large IO sizes.
> 
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: fio-io_uring-iodepth_32-numjobs_16_cached-node0.txt
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20230331/97c5262e/attachment-0008.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: fio-io_uring-iodepth_32-numjobs_16_cached-node1.txt
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20230331/97c5262e/attachment-0009.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: fio-io_uring-iodepth_32-numjobs_1_cached-node0.txt
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20230331/97c5262e/attachment-0010.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: fio-io_uring-iodepth_32-numjobs_1_cached-node1.txt
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20230331/97c5262e/attachment-0011.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: fio-libaio-iodepth_32-numjobs_16_cached-node0.txt
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20230331/97c5262e/attachment-0012.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: fio-libaio-iodepth_32-numjobs_16_cached-node1.txt
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20230331/97c5262e/attachment-0013.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: fio-libaio-iodepth_32-numjobs_1_cached-node0.txt
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20230331/97c5262e/attachment-0014.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: fio-libaio-iodepth_32-numjobs_1_cached-node1.txt
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20230331/97c5262e/attachment-0015.txt>


More information about the Linux-nvme mailing list