[PATCH 4/6] arm64/io: Provide a WC friendly __iowriteXX_copy()

Fri Feb 23 04:19:24 PST 2024

From: Niklas Schnelle
> Sent: 23 February 2024 11:38
...
> > Although I doubt that generating long TLP from byte writes is
> > really necessary.
> 
> I might have gotten confused but I think these are not byte writes.
> Remember that the count is in terms of the number of bits sized
> quantities to copy so "count == 1" is 4/8 bytes here.

Something made me think you were generating a byte version
as well as the 32 and 64 bit ones.

...
> > While write-combining to generate long TLP is probably mostly
> > safe for PCIe targets, there are some that will only handle
> > TLP for single 32bit data items.
> > Which might be why the code is explicitly requesting 4 byte copies.
> > So it may be entirely wrong to write-combine anything except
> > the generic memcpy_toio().
> >
> > 	David
> 
> On anything other than s390x this should only do write-combine if the
> memory mapping allows it, no? Meaning a driver that can't handle larger
> TLPs really shouldn't use ioremap_wc() then.

I can't decide whether merged writes could be required for some
target addresses but be problematic on others.
Probably not.

> On s390x one could argue that our version of __iowriteXX_copy() is
> strictly speaking not correct in that zpci_memcpy_toio() doesn't really
> use XX bit writes which is why for us memcpy_toio() was actually a
> better fit indeed. On the other hand doing 32 bit PCI stores (an s390x
> thing) can't combine multiple stores into a single TLP which these
> functions are used for and which has much more use cases than forcing a
> copy loop with 32/64 bit sized writes which would also be a lot slower
> on s390x than an aligned zpci_memcpy_toio().

If I read that correctly 32bit writes don't get merged?
Indeed any code that will benefit from merging can (probably)
do 64bit writes so is even attempting to merge 32bit ones
worth the effort?

Since writes get 'posted' all over the place.
How many writes do you need to do before write-combining makes a difference?
We've logic in our fpga to trace the RX and TX TLP [1].
Although the link is slow; back to back writes are limited by
what happens later in the fpga logic - not the pcie link.

Reads are another matter entirely.
The x86 cpu I've used assign a tag to each cpu core.
So while reads from multiple processes happen in parallel, those
from a single process are definitely synchronous.
The cpu stalls for a few thousand clock on every read.

Large read TLPs (and overlapped read TLPs) would have a much
bigger effect than large write TLP.

[1] It is nice to be able to see what is going on without having
to beg/steal/borrow an expensive PCIe analyser and persuade the
hardware to work with it connected.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)