[PATCH rdma-next 1/2] arm64/io: add memcpy_toio_64

Fri Nov 24 06:10:29 PST 2023

On Thu, 2023-11-23 at 21:04 +0200, Leon Romanovsky wrote:
> From: Jason Gunthorpe <jgg at nvidia.com>
> 
> The kernel supports write combining IO memory which is commonly used to
> generate 64 byte TLPs in a PCIe environment. On many CPUs this mechanism
> is pretty tolerant and a simple C loop will suffice to generate a 64 byte
> TLP.
> 
> However modern ARM64 CPUs are quite sensitive and a compiler generated
> loop is not enough to reliably generate a 64 byte TLP. Especially given
> the ARM64 issue that writel() does not codegen anything other than "[xN]"
> as the address calculation.
> 
> These newer CPUs require an orderly consecutive block of stores to work
> reliably. This is best done with four STP integer instructions (perhaps
> ST64B in future), or a single ST4 vector instruction.
> 
> Provide a new generic function memcpy_toio_64() which should reliably
> generate the needed instructions for the architecture, assuming address
> alignment. As the usual need for this operation is performance sensitive a
> fast inline implementation is preferred.
> 
> Implement an optimized version on ARM that is a block of 4 STP
> instructions.
> 
> The generic implementation is just a simple loop. x86-64 (clang 16)
> compiles this into an unrolled loop of 16 movq pairs.
> 
> Cc: Arnd Bergmann <arnd at arndb.de>
> Cc: Catalin Marinas <catalin.marinas at arm.com>
> Cc: Will Deacon <will at kernel.org>
> Cc: linux-arch at vger.kernel.org
> Cc: linux-arm-kernel at lists.infradead.org
> Signed-off-by: Jason Gunthorpe <jgg at nvidia.com>
> Signed-off-by: Leon Romanovsky <leonro at nvidia.com>
> ---
---8<---
> +#ifndef memcpy_toio_64
> +#define memcpy_toio_64 memcpy_toio_64
> +/**
> + * memcpy_toio_64	Copy 64 bytes of data into I/O memory
> + * @dst:		The (I/O memory) destination for the copy
> + * @src:		The (RAM) source for the data
> + * @count:		The number of bytes to copy
> + *
> + * dst and src must be aligned to 8 bytes. This operation copies exactly 64
> + * bytes. It is intended to be used for write combining IO memory. The
> + * architecture should provide an implementation that has a high chance of
> + * generating a single combined transaction.
> + */
> +static inline void memcpy_toio_64(volatile void __iomem *addr,
> +				  const void *buffer)
> +{
> +	unsigned int i = 0;
> +
> +#if BITS_PER_LONG == 64
> +	for (; i != 8; i++)
> +		__raw_writeq(((const u64 *)buffer)[i],
> +			     ((u64 __iomem *)addr) + i);
> +#else
> +	for (; i != 16; i++)
> +		__raw_writel(((const u32 *)buffer)[i],
> +			     ((u32 __iomem *)addr) + i);
> +#endif

What's the reasoning behind not using the existing memcpy_toio() here?
For s390 the above generic variant would do 8 of our special PCI store
instructions while memcpy_toio() is defined to zpci_memcpy_toio() which
can do the same as a single PCI store block instruction. Now of course
we could provide our own memcpy_toio_64() but that would end up the
same as just doing memcpy_toio(addr, buffer, 64) here.

> +}
> +#endif
> +
>  extern int devmem_is_allowed(unsigned long pfn);
>  
>  #endif /* __KERNEL__ */