[PATCH v3 6/6] IB/mlx5: Use __iowrite64_copy() for write combining stores

Mon Jul 14 22:57:07 PDT 2025

On Mon, Jul 14, 2025 at 06:55:04PM -0300, Jason Gunthorpe wrote:
> On Thu, Apr 11, 2024 at 01:46:19PM -0300, Jason Gunthorpe wrote:
> > mlx5 has a built in self-test at driver startup to evaluate if the
> > platform supports write combining to generate a 64 byte PCIe TLP or
> > not. This has proven necessary because a lot of common scenarios end up
> > with broken write combining (especially inside virtual machines) and there
> > is other way to learn this information.
> > 
> > This self test has been consistently failing on new ARM64 CPU
> > designs (specifically with NVIDIA Grace's implementation of Neoverse
> > V2). The C loop around writeq() generates some pretty terrible ARM64
> > assembly, but historically this has worked on a lot of existing ARM64 CPUs
> > till now.
> > 
> > We see it succeed about 1 time in 10,000 on the worst effected
> > systems. The CPU architects speculate that the load instructions
> > interspersed with the stores makes the WC buffers statistically flush too
> > often and thus the generation of large TLPs becomes infrequent. This makes
> > the boot up test unreliable in that it indicates no write-combining,
> > however userspace would be fine since it uses a ST4 instruction.
> 
> Hi Catalin,
> 
> After a year of testing this in real systems it turns out that still
> some systems are not good enough with the unrolled 8 byte store loop.
> In my view the CPUs are quite bad here and this WC performance
> optimization is not working very well.
> 
> There are only two more options to work around this issue, use the
> unrolled 16 byte STP or the single Neon instruction 64 byte store.
> 
> Since STP was rejected alread we've only tested the Neon version. It
> does make a huge improvement, but it still somehow fails to combine
> rarely sometimes. The CPU is really bad at this :(
> 
> So we want to make mlx5 use the single 64 byte neon store instruction
> like userspace has been using for a long time for this testing
> algorithm.
> 
> It is simple enough, but the question has come up where to put the
> code.  Do you want to somehow see the neon option to be in the
> arch/arm64 code or should we stick it in the driver under a #ifdef?
> 
> The entry/exit from neon is slow enough I don't think any driver doing
> performance work would want to use neon instead of __iowrite64_copy(),
> so I do not think it should be hidden inside __iowrite64_copy(). Nor
> have I thought of a name for an arch generic function..

__iowrite64_slow_copy() ????

> 
> Thanks,
> Jason