L_PTE_MT_BUFFERABLE / device ordered memory

Fri Dec 30 14:23:40 PST 2022

Hi,

Apologies in advance if I have missed an ages old thread on this.  And
apologies for the length of the description.

I am trying to tune the memory mapped I/O performance of a ZYNQ 7000
with an ARM A7 core running Linux. From what I can observe (in the
phys_mem_access_prot function in mmu.c), the default for a memory
range that has not been given in the device tree is "strongly
ordered", which means that the ZYNQ core will not proceed on to the
next such memory request until the previous one has fully completed.
This has very sub-optimal performance, requiring on average 24 cycle
per access overhead. I believe this corresponds to the setting
pgprot_noncached (and then to L_PTE_MT_UNCACHED) in the kernel. The
ARM architecture, however provides for another setting in the page
table entry of "device ordering", which maintains ordering and
quantity of requests going out to the device, without pausing the ARM
core. In various Xilinx forum posts, it has been confirmed that in the
baremetal OS option, that setting the value of the ARM page table TEX
and C B fields to 000, 0, 1 respectively, that the performance is
greatly improved (maybe 4 cycles per access).

Q1. My goal is to unlock this functionality in the Linux kernel. Any
best practices?

(Below is what I tried/figured out.)

Looking at the phys_mem_access_prot function, I therefore concluded
that perhaps I should map in the memory location using the device
tree, as reserved, and this would cause phys_mem_access_prot to select
pgprot_writecombine in the kernel.  After doing this successfully, I
noticed a great improvement in performance, but also that only a small
fraction of transactions in my test case were actually making it out
to the I/O device.  The test case was writing a series of zeros to the
same I/O address, which corresponds to a FIFO, so I really want to see
all of the zeros. Looking at the logic analyzer, I saw that the
processor was optimizing away the repeated zero writes, and that the
AWCACHE field on the AXI bus was set to 3. This was quite surprising
to me, as these fields suggest that the PTE is, per the ARM docs
(https://developer.arm.com/documentation/ihi0022/c/Additional-Control-Information/Cache-support),
cacheable and bufferable, rather than just bufferable.

Diving deeper into the kernel, I see that in proc-macros.S, in
marv6_mt_table, the L_PTE_MT_BUFFERABLE entry is set to PTE_EXT_TEX(1)
(i.e TEX,C,B = 001,0,0) which per
https://developer.arm.com/documentation/ddi0406/c/System-Level-Architecture/Protected-Memory-System-Architecture--PMSA-/Memory-region-attributes/C--B--and-TEX-2-0--encodings
is listed as "Normal memory", but with out and inner regions given as
non-cacheable.  I would have expected PTE_BUFFERABLE (i.e.
TEX,C,B=000,0,1).

Also looking at proc-v7-2level.S, I see that BUFFERABLE is defined as
TR=10, IR=00, OR=00, where TR memory type (per
https://developer.arm.com/documentation/ddi0406/c/System-Level-Architecture/System-Control-Registers-in-a-VMSA-implementation/VMSA-System-control-registers-descriptions--in-register-order/PRRR--Primary-Region-Remap-Register--VMSA?lang=en)
is defined as 00=strongly-ordered, 01=device, 10=normal memory. So I
would have expected 01=device memory.

So my conclusion is that pgprot_writecombine is not what I am looking
for, since not only does it buffering and combine writes into packets,
it also eliminates writes to the same address.

Q2. What is the history behind using strong-ordering instead of
device-ordering for I/O writes? And why is the write-combining setting
mapping to "Normal Memory" rather than device memory? And why does
mmu.c not provide a mechanism for accessing device-ordering (or does
it)?

Thanks!

Michael