[Linaro-mm-sig] [PATCH/RFC 0/8] ARM: DMA-mapping framework redesign

Sat Jun 25 01:23:00 EDT 2011

On 24 June 2011 01:09, Michael K. Edwards <m.k.edwards at gmail.com> wrote:
> Jonathan -
>
> I'm inviting you to this conversation (and to linaro-mm-sig, if you'd
> care to participate!), because I'd really like your commentary on what
> it takes to make write-combining fully effective on various ARMv7
> implementations.

Thanks for the invite.  I'm not fully conversant with the kernel-level
intricacy, but I do know what application code sees, and I have a
fairly good idea of what happens at the SDRAM pinout.

> Getting full write-combining performance on Intel architectures
> involves a somewhat delicate dance:
>  http://software.intel.com/en-us/articles/copying-accelerated-video-decode-frame-buffers/

At a high level, that looks like a pretty effective technique (and
well explained) that works cross-platform with detail changes.
However, that describes *read* combining on an uncached area, rather
than write combining.

Write combining is easy - in my experience it Just Works on ARMv7 SoCs
in general.  In practice, I've found that you can write pretty much
anything to uncached memory and the write-combiner will deal with it
fairly intelligently.  Small writes sufficiently close together in
time and contiguous in area will be combined reliably.  This assumes
that the region is marked write-combinable, which should always be the
case for plain SDRAM.  So memcpy() *to* an uncached zone works okay,
even wtih an alignment mismatch.

Read combining is much harder as soon as you turn off the cache, which
defeats all of the nice auto-prefetching mechanisms that tend to be
built into modern caches.  Even the newest ARM SoCs disable any and
all speculative behaviour for uncached reads - it is then not possible
to set up a "streaming read" even explicitly (even though you could
reasonably express such using PLD).

There will typically be a full cache-miss latency per instruction (20+
cycles on A8), even if the addresses are exactly sequential.  If they
are not sequential, or if the memory controller does a read or write
to somewhere else in the meantime, you also get the CAS or RAS
latencies of about 25ns each, which hurt badly (CAS and RAS have not
sped up appreciably, in real terms, since PC133 a decade ago - sense
amps are not so good at fulfilling Moore's Law).  So on a 1GHz
Cortex-A8, you can spend 80 clock cycles waiting for a memory load to
complete - that's about 20 for the memory system to figure out it's
uncacheable and cause the CPU core to replay the instruction twice, 50
waiting for the SDRAM chip to spin up, and another 10 as a fudge
factor and to allow the data to percolate up.

This situation is sufficiently common that I assume (and I tell my
colleagues to assume) that this is the case.  If a vendor were to turn
off write-combining for a memory area, I would complain very loudly to
them once I discovered it.  So far, though, I can only wish that they
would sort out the memory hierarchy to make framebuffer & video reads
better.

I *have* found one vendor who appears to put GPU command buffers in
cached memory, but this necessitates a manual cache cleaning exercise
every time the command buffer is flushed.  This is a substantial
overhead too, but is perhaps easier to optimise.

IMO this whole problem is a hardware design fault.  It's SDRAM
directly wired to the chip; there's nothing going on that the memory
controller doesn't know about.  So why isn't the last cache level part
of / attached to the memory controller, so that it can be used
transparently by all relevant bus masters?  It is, BTW, not only ARM
that gets this wrong, but in the Wintel world there is so much
horsepower to spare that few people notice.

> And I expect something similar to be necessary in order to avoid the
> read-modify-write penalty for write-combining buffers on ARMv7.  (NEON
> store-multiple operations can fill an entire 64-byte entry in the
> victim buffer in one opcode; I don't know whether this is enough to
> stop the L3 memory system from reading the data before clobbering it.)

Ah, now you are talking about store misses to cached memory.

Why 64 bytes?  VLD1 does 32 bytes (4x64b) and VLDM can do 128 bytes
(16x64b).  The latter is, I think, bigger than Intel's fill buffers.
Each of these have exactly equivalent store variants.

The manual for the Cortex-A8 states that store misses in the L1 cache
are sent to the L2 cache; the L2 cache then has validity bits for
every quadword (ie. four validity domains per line), so a 16-byte
store (if aligned) is sufficient to avoid read traffic.  I assume that
the A9 and A5 are at least as sophisticated, not sure about
Snapdragon.

 - Jonathan Morton