[PATCH rdma-next 1/2] arm64/io: add memcpy_toio_64

Jason Gunthorpe jgg at nvidia.com
Thu Jan 25 09:43:33 PST 2024


On Wed, Jan 24, 2024 at 03:26:34PM -0400, Jason Gunthorpe wrote:

> The suggestion that it should not have any interleaving instructions
> and use STP came from our CPU architecture team.

I got some more details here.

They point to the ARM publication about write combining

https://community.arm.com/cfs-file/__key/telligent-evolution-components-attachments/13-150-00-00-00-00-10-12/Understanding_5F00_Write_5F00_Combining_5F00_on_5F00_Arm_5F00_V.1.0.pdf

specifically to the example code using 4x 128 bit NEON stores.

They point at the actual CPU design and say it is optimized for 128
bit stores (STP and ST4 included, it seems).

64 bit stores trigger some different behavior.

I have no way to know if it will be OK for other drivers that expect
this to be a performance path in the kernel.

Are you *sure* you want to do this str version? If it works for mlx5 I
will send the patch and the other companies can come later with
performance data.

Jason



More information about the linux-arm-kernel mailing list