[PATCH] arm64: Enable PCI write-combine resources under sysfs

Lorenzo Pieralisi lorenzo.pieralisi at arm.com
Thu Sep 10 11:17:21 EDT 2020

On Thu, Sep 10, 2020 at 09:37:58AM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 10, 2020 at 10:46:00AM +0100, Lorenzo Pieralisi wrote:
> > [+Jason]
> > 
> > On Tue, Sep 08, 2020 at 09:33:42AM +1000, Benjamin Herrenschmidt wrote:
> > > On Thu, 2020-09-03 at 12:08 +0100, Lorenzo Pieralisi wrote:
> > > > > It's been what other architectures have been doing for mroe than a
> > > > > decade without significant issues... I don't think you should worry
> > > > > too
> > > > > much about this.
> > > > 
> > > > Minus what I wrote above, I agree with you. I'd still be able to
> > > > understand what this patch changes in the mellanox driver HW
> > > > handling though - not sure what they expect from
> > > > arch_can_pci_mmap_wc()
> > > > returning 1.
> > > 
> > > I don't know enough to get into the finer details but looking a bit it
> > > seems when this is set, they allow extra ioctls to create buffers
> > > mapped with pgprot_writecombine().
> > > 
> > > I suppose this means faster MMIO backet buffers for small packets (ie,
> > > non-DMA use case).
> > > 
> > > Also note that mlx5_ib_test_wc() only uses arch_can_pci_mmap_wc() for a
> > > non-ROCE ethernet port on a PF... For anyting else, it just seems to
> > > actually try to do it and see what happens :-)
> > > 
> > > Leon: Can you clarify the use of arch_can_pci_mmap_wc() in mlx5 and
> > > whether you see an issue with enabling this on arm64 ?
> > 
> > Hi Jason,
> > 
> > I was wondering if you could help us with this question, we are trying
> > to understand what enabling arch_can_pci_mmap_wc() on arm64 would cause
> > in mellanox drivers wrt mappings and whether there is an expected
> > behaviour behind them, in particular whether there is an implicit
> > reliance on x86 write-combine arch/interconnect details.
> Looking back at this big thread, let me add some perspective

Thank you - it was needed.

> Mellanox drivers have a performance optimization where a 64 byte MemWr
> TLP from the root complex to the MMIO BAR will perform better, often
> quite a bit better. We run WC in full QA'd production on PPC, ARM and
> x86.
> The userspace generates a burst of sequential, aligned 8 byte CPU
> writes to the MMIO address and triggers an arch specific CPU barrier
> to flush/fence the CPU WC buffer. At this point the CPU should emit
> the 64 byte TLP toward the device ASAP.

While at it - mind explaining please what those 64 bytes actully contain ?

> In other words, the only usage here is only about Write. The CPU
> should never, ever, generate a MemRD TLP. The code never does a read
> explicitly.

On arm64 pgprot_writecombine() is speculative memory (normal
non-cacheable), which may not do what you expect from it.

> If the CPU fails to generate a 64 byte TLP then the device will still
> operate correctly but does a different, slower, flow.

Side note: on ARM that TLP is not a native interconnect transaction,
reworded, it depends on what the system-bus->PCI logic does in
this respect.

> If the CPU consistently fails WC then the overhead of trying the WC
> flow is a notable net performance loss, and on these CPUs we want to
> use only 8 byte write to the MMIO BAR, with NC memory.

That's why I looped you in - that's what worries me about "enabling"
arch_can_pci_mmap_wc() on arm64. If we enable it and we have perf
regressions that's not OK.

Or we *can* enable arch_can_pci_mmap_wc() but force the mellanox
driver (or more broadly all drivers following this message push
semantics) to use "something else" for WC detection.

> There are many important details about how this works and how this
> must interact with the CPU barriers and locking.
> On x86, arch_can_pci_mmap_wc() is basically meaningless.

On arm64 too, for the records - or better, write-combine is not
well defined, ergo I don't know what arch_can_pci_mmap_wc() means.

> It indicates there is a chance that pgprot_writecombine() could work.
> It can also be 0 and write combining will work just fine :\.
> Thus, mlx5 switched to doing a runtime WC test to determine if the CPU
> actually supports WC or not. If the arch can reliably tell the driver
> then this test could be avoided. Based on this test the WC mode is
> allowed for userspace.

Can you elaborate on this runtime test please ?

> The one call to arch_can_pci_mmap_wc() is in a case where the HW is
> configured in a way that can't run the test, here we use
> arch_can_pci_mmap_wc() to guess if the CPU has working WC or not.
> Ideally an arch would return 1 only when the CPU has working WC.

Which means we can guarantee the TLP packet you mentioned above I
guess ?

We have to define "working WC" :)

> Depending on workload WC may not be a win. In those cases userspace
> will select NC. Thus the same PCI MMIO BAR region can have a mixture
> of pages with WC and NC mappings to userspace.
> For DEVICE_GRE.. For years now, many deployments of ARM & mlx5 devices
> are using an out of tree patch to use DEVICE_GRE for WC on mlx5. This
> seems to be the preferred working configuration on at least some ARM
> SOCs. So far nobody from the ARM world has shown interest in making a
> mainline solution. :(
> I can't recall if this is because the relevant ARM SOC's don't support
> pgprot_writecombine(), or it doesn't work properly.
> I was told the reason ARM never enabled WC was because unaligned

When you say "enabled WC" I assume you mean making:

pgprot_writecombine() == DEVICE_GRE

> access to WC memory was not supported, and there were existing drivers
> that did unaligned writes that would malfunction. I thought this meant
> that pgprot_writecombine() was non-working in ARM Linux?

On arm64 pgprot_writecombine() is normal non-cacheable memory at the
moment - it works but that does not precisely do what you *expect* from
arch_can_pci_mmap_wc(), that's the whole point I am making.

> So, bit surprised to see a patch messing with arch_can_pci_mmap_wc()
> and not changing the defintion of pgprot_writecombine() ?

We can't change pgprot_writecombine() to DEVICE_GRE, it can trigger
issues on some drivers, see unaligned memory access.

> mlx5 is more or less a representative user WC for this kind of
> 'message push' methodology. Several other RDMA devices do this as
> well. The methodology is important enough that recent Intel CPUs have
> a dedicated instruction to push a 128 byte message in a single TLP
> avoiding this whole WC mess.
> Frankly, I think the kernel should introduce a well defined pgprot for
> this working mode that all archs can agree upon. It should include the
> alignment requirement, message push function, CPU barrier macros, and
> locking macros that are needed to use this facility correctly.
> Defined in a way that is compatible with DEVICE_GRE and can be used by
> these 'message push' drivers. That would switch alway most of the
> users in the kernel today.

That's probably the way forward - I still have concerns about this
patch as it stands given your clarifications above.


More information about the linux-arm-kernel mailing list