[PATCH rdma-next 1/2] arm64/io: add memcpy_toio_64
Niklas Schnelle
schnelle at linux.ibm.com
Fri Nov 24 07:59:38 PST 2023
On Fri, 2023-11-24 at 10:55 -0400, Jason Gunthorpe wrote:
> On Fri, Nov 24, 2023 at 03:48:22PM +0100, Niklas Schnelle wrote:
> > On Fri, 2023-11-24 at 10:20 -0400, Jason Gunthorpe wrote:
> > > On Fri, Nov 24, 2023 at 03:10:29PM +0100, Niklas Schnelle wrote:
> > >
> > > > What's the reasoning behind not using the existing memcpy_toio()
> > > > here?
> > >
> > > Going forward CPUs are implementing an instruction to do a 64 byte
> > > aligned store, this is a wrapper for exactly that operation.
> > >
> > > memcpy_toio() is much more general, it allows unaligned buffers and
> > > non-multiples of 64. Adapting the general version to generate the
> > > optimized version in the cases it can is complex and has a codegen
> > > penalty..
> >
> > I think you misunderstood me. I understand why you want a separate
> > memcpy_toio_64(). I just wonder if its generic implementation shouldn't
> > just be a define or inline wrapper for memcpy_toio(addr, buffer, 64).
>
> Oh, yes, I totally did.
>
> I'm worried that x86 will less reliably generate write combining with
> it's memcpy_toio implemention. It codegens byte copies for that
> function :(
Oh ok I see what you mean.
>
> > Also seeing the second patch of course that would no longer really test
> > for write combining for us which we can also do but I think that's okay
> > and you're probably going to use memcpy_toio_64() in more places and
> > there we really want the PCI store block.
>
> Right now we don't have in-kernel performance use cases for write
> combining for mlx5.
Is the code in patch 2 performance critical?
>
> Userspace uses the WC and we already have the special 390 instructions
> for batching in rdma-core already, IIRC.
Yes, I added that support to rdma-core :-)
>
> So it would be appropriate for s390 to use a consistent path.
>
> Jason
This should be as easy as adding
#define memcpy_toio_64(to, from) zpci_memcpy_toio(to, from, 64)
to arch/s390/include/asm/io.h. I'm wondering if we should do that as
part of this series. It's not as good as a special case but probably
better than the existing loop.
I don't think we have any existing in-kernel users of memcpy_toio() on
s390 so far though so I'd like to give this some extra testing. Could
you share instructions on how to exercise the code path of patch 2 on a
ConnectX-5 or 6? Is this exercised e.g. when using NVMe-oF RDMA?
Thanks,
Niklas
More information about the linux-arm-kernel
mailing list