Read speed for a PCIe NVMe SSD is ridiculously slow on a multi-socket machine.

Fri Mar 24 14:19:09 PDT 2023

Hi Damien,

Thanks a lot for your thoughtful reply. The main reason why I used hdparm
and dd to benchmark the performance is because they are included with every
live distro. I didn't want to install an OS before confirming that hardware
works as expected.

Back to the main topic, it didn't occur to me that the --direct option can
have such a profound impact on reading speeds, but it does. With this
option enabled, most of the discrepancies in reading speeds from different
nodes disappear. The same happens when using dd with "iflag=direct". This
should imply that the issue is with the access time to the kernel's read
cache, correct? On the other hand, MLC shows completely reasonable latency
and bandwidth numbers between the nodes, see below.

So what could be the culprit and in which direction should I continue
digging? If hdparm and dd have issues with accessing the read cache, then
so will every other read-intensive program. Could this happen because of
the lack of the (correct) NUMA affinity for certain IRQs? I understand that
this question might not be NVMe-specific anymore, but would be grateful for
any pointer.

Thank you,

  --- Alex.

# ./mlc --bandwidth_matrix
Intel(R) Memory Latency Checker - v3.10
Command line parameters: --bandwidth_matrix

Using buffer size of 100.000MiB/thread for reads and an additional 100.000MiB/thread for writes
Measuring Memory Bandwidths between nodes within system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
                Numa node
Numa node            0       1       2       3
       0        25328.8  4131.8  4013.0  4541.0
       1         4180.3 24696.3  4501.2  3996.3
       2         4017.7  4535.5 25746.4  4105.7
       3         4488.1  4024.0  4157.0 25467.7

# ./mlc --latency_matrix
Intel(R) Memory Latency Checker - v3.10
Command line parameters: --latency_matrix

Using buffer size of 200.000MiB
Measuring idle latencies for sequential access (in ns)...
                Numa node
Numa node            0       1       2       3
       0          71.7   245.9   257.5   239.5
       1         156.4    71.8   238.3   256.3
       2         250.6   237.9    71.8   245.1
       3         238.4   252.5   237.9    71.9

On Fri, Mar 24, 2023 at 05:43:42PM +0900, Damien Le Moal wrote:
> 
> On 3/24/23 15:56, Alexander Shumakovitch wrote:
> > [ please copy me on your replies since I'm not subscribed to this list ]
> >
> > Hello all,
> >
> > I have an oldish quad socket server (Stratos S400-X44E by Quanta, 512GB RAM,
> > 4 x Xeon E5-4620) that I'm trying to upgrade with an NVMe Samsung 970 EVO
> > Plus SSD, connected via an adapter card to a PCIe slot, which is wired to
> > CPU #0 directly and supports PCIe 3.0 speeds. For some reason, the reading
> > speed from this SSD differs by a factor of 10 (ten!), depending on which
> > physical CPU hdparm or dd is run on:
> >
> >     # hdparm -t /dev/nvme0n1
> 
> It is very unusual to use hdparm, a tool designed mainly for ATA devices, to
> benchmark an nvme device. At the very least, if you really want to measure the
> drive performance, you should add the --direct option (see man hdparm).
> 
> But a better way to test would be to use fio with io_uring or libaio IO engine
> doing multi-job & high QD --direct=1 IOs. That will give you the maximum
> performance of your device. Then remove the --direct=1 option to do buffered
> IOs, which will expose potential issues with your system memory bandwidth.
>