Read speed for a PCIe NVMe SSD is ridiculously slow on a multi-socket machine.

Keith Busch kbusch at kernel.org
Fri Mar 24 12:34:51 PDT 2023


On Fri, Mar 24, 2023 at 06:56:03AM +0000, Alexander Shumakovitch wrote:
> physical CPU hdparm or dd is run on:
>       
>     # hdparm -t /dev/nvme0n1 
>     
>     /dev/nvme0n1:
>      Timing buffered disk reads: 510 MB in  3.01 seconds = 169.28 MB/sec
>     
>     # taskset -c 0-7 hdparm -t /dev/nvme0n1 
>     
>     /dev/nvme0n1:
>      Timing buffered disk reads: 5252 MB in  3.00 seconds = 1750.28 MB/sec
>     
>     # taskset -c 8-15 hdparm -t /dev/nvme0n1 
>     
>     /dev/nvme0n1:
>      Timing buffered disk reads: 496 MB in  3.01 seconds = 164.83 MB/sec
>     
>     # taskset -c 24-31 hdparm -t /dev/nvme0n1 
>     
>     /dev/nvme0n1:
>      Timing buffered disk reads: 520 MB in  3.01 seconds = 172.65 MB/sec
> 
> Even more mysteriously, the writing speeds are consistent across all the
> CPUs at about 800MB/sec (see the output of dd attached).

When writing host->dev, there is no cache coherency to consider so it'll always
be faster in NUMA situations. Reading dev->host does, and can have considerable
overhead, though 10x seems a bit high.

Retrying with Damien's O_DIRECT suggestion is a good idea.

Also, 'taskset' only pins the CPUs the process schedules on, but not the memory
node it allocates from. Try 'numactl' instead for local node allocations.



More information about the Linux-nvme mailing list