[PATCH] arm64: Enable PCI write-combine resources under sysfs

Jason Gunthorpe jgg at nvidia.com
Thu Sep 10 08:37:58 EDT 2020

On Thu, Sep 10, 2020 at 10:46:00AM +0100, Lorenzo Pieralisi wrote:
> [+Jason]
> On Tue, Sep 08, 2020 at 09:33:42AM +1000, Benjamin Herrenschmidt wrote:
> > On Thu, 2020-09-03 at 12:08 +0100, Lorenzo Pieralisi wrote:
> > > > It's been what other architectures have been doing for mroe than a
> > > > decade without significant issues... I don't think you should worry
> > > > too
> > > > much about this.
> > > 
> > > Minus what I wrote above, I agree with you. I'd still be able to
> > > understand what this patch changes in the mellanox driver HW
> > > handling though - not sure what they expect from
> > > arch_can_pci_mmap_wc()
> > > returning 1.
> > 
> > I don't know enough to get into the finer details but looking a bit it
> > seems when this is set, they allow extra ioctls to create buffers
> > mapped with pgprot_writecombine().
> > 
> > I suppose this means faster MMIO backet buffers for small packets (ie,
> > non-DMA use case).
> > 
> > Also note that mlx5_ib_test_wc() only uses arch_can_pci_mmap_wc() for a
> > non-ROCE ethernet port on a PF... For anyting else, it just seems to
> > actually try to do it and see what happens :-)
> > 
> > Leon: Can you clarify the use of arch_can_pci_mmap_wc() in mlx5 and
> > whether you see an issue with enabling this on arm64 ?
> Hi Jason,
> I was wondering if you could help us with this question, we are trying
> to understand what enabling arch_can_pci_mmap_wc() on arm64 would cause
> in mellanox drivers wrt mappings and whether there is an expected
> behaviour behind them, in particular whether there is an implicit
> reliance on x86 write-combine arch/interconnect details.

Looking back at this big thread, let me add some perspective

Mellanox drivers have a performance optimization where a 64 byte MemWr
TLP from the root complex to the MMIO BAR will perform better, often
quite a bit better. We run WC in full QA'd production on PPC, ARM and

The userspace generates a burst of sequential, aligned 8 byte CPU
writes to the MMIO address and triggers an arch specific CPU barrier
to flush/fence the CPU WC buffer. At this point the CPU should emit
the 64 byte TLP toward the device ASAP.

In other words, the only usage here is only about Write. The CPU
should never, ever, generate a MemRD TLP. The code never does a read

If the CPU fails to generate a 64 byte TLP then the device will still
operate correctly but does a different, slower, flow.

If the CPU consistently fails WC then the overhead of trying the WC
flow is a notable net performance loss, and on these CPUs we want to
use only 8 byte write to the MMIO BAR, with NC memory.

There are many important details about how this works and how this
must interact with the CPU barriers and locking.

On x86, arch_can_pci_mmap_wc() is basically meaningless. It indicates
there is a chance that pgprot_writecombine() could work. It can also
be 0 and write combining will work just fine :\.

Thus, mlx5 switched to doing a runtime WC test to determine if the CPU
actually supports WC or not. If the arch can reliably tell the driver
then this test could be avoided. Based on this test the WC mode is
allowed for userspace.

The one call to arch_can_pci_mmap_wc() is in a case where the HW is
configured in a way that can't run the test, here we use
arch_can_pci_mmap_wc() to guess if the CPU has working WC or not.
Ideally an arch would return 1 only when the CPU has working WC.

Depending on workload WC may not be a win. In those cases userspace
will select NC. Thus the same PCI MMIO BAR region can have a mixture
of pages with WC and NC mappings to userspace.

For DEVICE_GRE.. For years now, many deployments of ARM & mlx5 devices
are using an out of tree patch to use DEVICE_GRE for WC on mlx5. This
seems to be the preferred working configuration on at least some ARM
SOCs. So far nobody from the ARM world has shown interest in making a
mainline solution. :(

I can't recall if this is because the relevant ARM SOC's don't support
pgprot_writecombine(), or it doesn't work properly.

I was told the reason ARM never enabled WC was because unaligned
access to WC memory was not supported, and there were existing drivers
that did unaligned writes that would malfunction. I thought this meant
that pgprot_writecombine() was non-working in ARM Linux? So, bit
surprised to see a patch messing with arch_can_pci_mmap_wc() and not
changing the defintion of pgprot_writecombine() ?

mlx5 is more or less a representative user WC for this kind of
'message push' methodology. Several other RDMA devices do this as
well. The methodology is important enough that recent Intel CPUs have
a dedicated instruction to push a 128 byte message in a single TLP
avoiding this whole WC mess.

Frankly, I think the kernel should introduce a well defined pgprot for
this working mode that all archs can agree upon. It should include the
alignment requirement, message push function, CPU barrier macros, and
locking macros that are needed to use this facility correctly.

Defined in a way that is compatible with DEVICE_GRE and can be used by
these 'message push' drivers. That would switch alway most of the
users in the kernel today.


More information about the linux-arm-kernel mailing list