[PATCH rdma-next 1/2] arm64/io: add memcpy_toio_64
Catalin Marinas
catalin.marinas at arm.com
Wed Jan 24 09:22:05 PST 2024
On Wed, Jan 24, 2024 at 09:27:19AM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 24, 2024 at 12:40:29PM +0000, Catalin Marinas wrote:
>
> > > Just to be clear, that means we should drop this patch ("arm64/io: add
> > > memcpy_toio_64") for now, right?
> >
> > In its current form yes, but that doesn't mean that memcpy_toio_64()
> > cannot be reworked differently.
>
> I gave up on touching memcpy_toio_64(), it doesn't work very well
> because of the weak alignment
>
> Instead I followed your suggestion to fix __iowrite64_copy()
I forgot the details. Was it to introduce an __iowrite512_copy()
function or to simply use __iowrite64_copy() with a count of 8?
> There are only a couple of places that use this API:
[...]
> __iowrite64_copy() has a sufficient API that the compiler can inline
> the STP block as this patch did.
>
> I experimented with having memcpy_toio_64() invoke __iowrite64_copy(),
> but it did not look very nice. Maybe there is a possible performance
> win there, I don't know.
Just invoking __iowrite64_copy() with a count of 8 wouldn't work well
even if we have the writeq generating STR with an offset (well, it also
increments the pointers, so I don't think Mark's optimisation would
help). The copy implies loads and these would be interleaved with stores
and potentially get in the way of write combining. An
__iowrite512_copy() or maybe even an optimised __iowrite64_copy() for
count 8 could do the loads first followed by the stores. You can use a
special path in __iowrite64_copy() with __builtin_contant_p().
You can try with an arm64 specific __iowrite64_copy() and see how it
goes (together with Mark's patch):
void __iowrite64_copy(void __iomem *to, const void *from,
size_t count)
{
u64 __iomem *dst = to;
const u64 *src = from;
const u64 *end = src + count;
/*
* Try a 64-byte write, the CPUs tend to write-combine them.
*/
if (__builtin_contant_p(count) && count == 8) {
__raw_writeq(*src, dst);
__raw_writeq(*(src + 1), dst + 1);
__raw_writeq(*(src + 2), dst + 2);
__raw_writeq(*(src + 3), dst + 3);
__raw_writeq(*(src + 4), dst + 4);
__raw_writeq(*(src + 5), dst + 5);
__raw_writeq(*(src + 6), dst + 6);
__raw_writeq(*(src + 7), dst + 7);
return;
}
while (src < end)
__raw_writeq(*src++, dst++);
}
EXPORT_SYMBOL_GPL(__iowrite64_copy);
What we don't have is inlining of __iowrite64_copy() but if we need that
we can move away from a weak symbol to a static inline.
Give this a go and see if it you get write-combining in your hardware.
If the loads interleaves with stores get in the way, maybe we can resort
to inline asm.
--
Catalin
More information about the linux-arm-kernel
mailing list