[PATCH rdma-next 1/2] arm64/io: add memcpy_toio_64

Jason Gunthorpe jgg at nvidia.com
Wed Dec 6 04:59:19 PST 2023


On Wed, Dec 06, 2023 at 11:09:18AM +0000, Catalin Marinas wrote:
> On Tue, Dec 05, 2023 at 03:51:30PM -0400, Jason Gunthorpe wrote:
> > On Tue, Dec 05, 2023 at 07:34:45PM +0000, Catalin Marinas wrote:
> > > > 2) You want to #define __iowrite512_copy() to memcpy_toio() on ARM and
> > > >    implement some quad STP optimization for this case?
> > > 
> > > We can have the generic __iowrite512_copy() do memcpy_toio() and have
> > > the arm64 implement an optimised version.
> > > 
> > > What I'm not entirely sure of is the DGH (whatever the io_* barrier name
> > > is). I'd put it in the same __iowrite512_copy() function and remove it
> > > from the driver code. Otherwise when ST64B is added, we have an
> > > unnecessary DGH in the driver. If this does not match the other
> > > __iowrite*_copy() semantics, we can come up with another name. But start
> > > with this for now and document the function.
> > 
> > I think the iowrite is only used for WC and the DGH is functionally
> > harmless for non-WC, so it makes sense.
> > 
> > In this case we should just remove the DGH macro from the generic
> > architecture code and tell people to use iowrite - since we now
> > understand that callers basically have to in order to use DGH on new
> > ARM CPUs.
> 
> That works for me but what would the semantics be for __iowrite64_copy()
> for example? Is there a DGH at the end of the whole write or after each
> iteration?

End of the iowrite_copy function call. The purpose of DGH is to reduce
latency through write combining buffers by providing a hint to the HW
to close them. __iowrite64_copy can be reasonably thought of as trying
to push the argument into a single TLP.

> I'd go with the former since e.g. hns3_tx_push_bd() does
> that (and doesn't seem to be a 64 byte copy).

sizeof(struct hns3_desc) == 32, HNS3_MAX_PUSH_BD_NUM == 2, so it is 64
bytes.

Indeed, I already know this HW and it functions similar to mlx5. In
userspace it uses the ST4 instruction, in fact HNS was the team that
did that change citing measured improvements on their SOC. Changing
this to be the STP block will likely be an improvement.

Jason



More information about the linux-arm-kernel mailing list