[PATCH rdma-next 1/2] arm64/io: add memcpy_toio_64

Fri Nov 24 07:59:38 PST 2023

On Fri, 2023-11-24 at 10:55 -0400, Jason Gunthorpe wrote:
> On Fri, Nov 24, 2023 at 03:48:22PM +0100, Niklas Schnelle wrote:
> > On Fri, 2023-11-24 at 10:20 -0400, Jason Gunthorpe wrote:
> > > On Fri, Nov 24, 2023 at 03:10:29PM +0100, Niklas Schnelle wrote:
> > >  
> > > > What's the reasoning behind not using the existing memcpy_toio()
> > > > here?
> > > 
> > > Going forward CPUs are implementing an instruction to do a 64 byte
> > > aligned store, this is a wrapper for exactly that operation.
> > > 
> > > memcpy_toio() is much more general, it allows unaligned buffers and
> > > non-multiples of 64. Adapting the general version to generate the
> > > optimized version in the cases it can is complex and has a codegen
> > > penalty..
> > 
> > I think you misunderstood me. I understand why you want a separate
> > memcpy_toio_64(). I just wonder if its generic implementation shouldn't
> > just be a define or inline wrapper for memcpy_toio(addr, buffer, 64).
> 
> Oh, yes, I totally did.
> 
> I'm worried that x86 will less reliably generate write combining with
> it's memcpy_toio implemention. It codegens byte copies for that
> function :(

Oh ok I see what you mean.

> 
> > Also seeing the second patch of course that would no longer really test
> > for write combining for us which we can also do but I think that's okay
> > and you're probably going to use memcpy_toio_64() in more places and
> > there we really want the PCI store block.
> 
> Right now we don't have in-kernel performance use cases for write
> combining for mlx5.

Is the code in patch 2 performance critical?

> 
> Userspace uses the WC and we already have the special 390 instructions
> for batching in rdma-core already, IIRC.

Yes, I added that support to rdma-core :-)

> 
> So it would be appropriate for s390 to use a consistent path.
> 
> Jason

This should be as easy as adding

#define memcpy_toio_64(to, from) zpci_memcpy_toio(to, from, 64)

to arch/s390/include/asm/io.h. I'm wondering if we should do that as
part of this series. It's not as good as a special case but probably
better than the existing loop.

I don't think we have any existing in-kernel users of memcpy_toio() on
s390 so far though so I'd like to give this some extra testing. Could
you share instructions on how to exercise the code path of patch 2 on a
ConnectX-5 or 6? Is this exercised e.g. when using NVMe-oF RDMA?

Thanks,
Niklas