Read speed for a PCIe NVMe SSD is ridiculously slow on a multi-socket machine.

Fri Mar 24 18:52:02 PDT 2023

On 3/25/23 06:19, Alexander Shumakovitch wrote:
> Hi Damien,
> 
> Thanks a lot for your thoughtful reply. The main reason why I used hdparm
> and dd to benchmark the performance is because they are included with every
> live distro. I didn't want to install an OS before confirming that hardware
> works as expected.

You could install the OS on a USB stick to add fio.

> 
> Back to the main topic, it didn't occur to me that the --direct option can
> have such a profound impact on reading speeds, but it does. With this
> option enabled, most of the discrepancies in reading speeds from different
> nodes disappear. The same happens when using dd with "iflag=direct". This
> should imply that the issue is with the access time to the kernel's read
> cache, correct? On the other hand, MLC shows completely reasonable latency
> and bandwidth numbers between the nodes, see below.
> 
> So what could be the culprit and in which direction should I continue
> digging? If hdparm and dd have issues with accessing the read cache, then
> so will every other read-intensive program. Could this happen because of
> the lack of the (correct) NUMA affinity for certain IRQs? I understand that
> this question might not be NVMe-specific anymore, but would be grateful for
> any pointer.

For fast block devices, the overhead of the page management and memory copies
done when using the page cache is very visible. Nothing that can be done about
that. Any application, fio included, will most of the time show slower
performance because of that overhead. Not always true though (e.g. sequential
read with read-ahead should be just fine), but at the very least you will see a
higher CPU load.

dd and hdparm will also exercise the drive at QD=1, far from ideal when trying
to measure the maximum throughput of a device, unless you one uses very large IO
sizes.

> # ./mlc --bandwidth_matrix
> Intel(R) Memory Latency Checker - v3.10
> Command line parameters: --bandwidth_matrix
> 
> Using buffer size of 100.000MiB/thread for reads and an additional 100.000MiB/thread for writes
> Measuring Memory Bandwidths between nodes within system
> Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
> Using all the threads from each core if Hyper-threading is enabled
> Using Read-only traffic type
>                 Numa node
> Numa node            0       1       2       3
>        0        25328.8  4131.8  4013.0  4541.0
>        1         4180.3 24696.3  4501.2  3996.3
>        2         4017.7  4535.5 25746.4  4105.7
>        3         4488.1  4024.0  4157.0 25467.7

Here you can see that local copies are very fast, but 6x slower when crossing
NUMA nodes. So unless the application explicitly uses libnuma to do direct IOs
using same node memory, this difference will be apparent with the page cache due
to balancing of the page allocations between nodes. And there is the copy back
to user space itself, which doubles the memory bandwidth needed.

Use fio and see its options for pinning jobs to CPUs and using libnuma for IO
buffers. You can then run different benchmarks to see the effect of having to
cross NUMA nodes for IOs.

There are plenty of papers and information about this subject (NUMA memory
management and its effect on performance) all over the place...

-- 
Damien Le Moal
Western Digital Research