[PATCH v3 6/6] IB/mlx5: Use __iowrite64_copy() for write combining stores

Jason Gunthorpe jgg at nvidia.com
Tue Jul 15 04:52:00 PDT 2025


On Tue, Jul 15, 2025 at 11:15:25AM +0100, Will Deacon wrote:
> > Since STP was rejected alread we've only tested the Neon version. It
> > does make a huge improvement, but it still somehow fails to combine
> > rarely sometimes. The CPU is really bad at this :(
> 
> I think the thread was from last year so I've forgotten most of the
> details, but wasn't STP rejected because it wasn't virtualisable? 

Yes, that was the claim.

> In which case, doesn't NEON suffer from exactly the same (or possibly
> worse) problem?

In general yes, in specific no.

mlx5 (and other RDMA devices) have long used Neon for MMIO in
userspace, so any VMM assigning mlx5 devices simply must make this
work - it is already not optional. So we know that all VMs out there
with mlx5 support neon for mlx5, and it is safe for mlx5 to use.

Typically this is trivally done in a VMM by never emulating mlx5's
MMIO space. If the VMM takes a fault on a MMIO page it fixes the fault
and restarts the neon instruction.

The generality was the notion that there could be other devices in a
VM that are fully emulated and using these challenging instructions
would break the simple emulation. This is why the general purpose
__iowrite64_copy() didn't use STP.

> Also, have you managed to investigate why the CPU tends not to get this
> right? 

I have asked but our CPU architects have said it is too complex to
analyze, but they admit it doesn't work entirely well :(

The belief is some micro-architectural condition is breaking it as we
see even neon instructions failing during every test.

They say it is fully fixed with ST64B in the future.

> Do we e.g. end up taking interrupts/exceptions while the self
> test is running or something like that?

I doubt it, the test is running in kernel mode during boot for
hundreds of iterations. An interrupt on every interation is not
likely. Any single successful combine is a pass for the test.

Even an interrupt shouldn't disrupt a single instruction Neon store,
yet we can still mesure a low rate of neon failures.

> Sorry for the wall of questions!

No worries! It's weird and definately complicated.

Jason



More information about the linux-arm-kernel mailing list