[Linaro-mm-sig] [PATCH/RFC 0/8] ARM: DMA-mapping framework redesign

Sat Jun 25 05:55:28 EDT 2011

Thanks, Jonathan!  I agree with you that it is fundamentally a
hardware design defect that on-chip devices like GPUs and video
capture/encode/decode/display blocks do not participate in the cache
coherency protocol, and thus the buffers passed to, from, and among
them all have to be mapped uncacheable.  Not, unfortunately, something
likely to change soon, with one or two possible exceptions that I've
heard rumors about ...

With regard to the use of NEON for data moves, I have appended a
snippet of a conversation from the BeagleBoard list that veered off
into a related direction.  (My response is lightly edited, since I
made some stupid errors in the original.)  While this is somewhat
off-topic from Marek's patch set, I think it's relevant to the
question of whether "user-allocated" buffers are an important design
consideration for his otherwise DMA-centric API.  (And more to the
point, buffers allocated suitably for one or more on-chip devices, and
also mapped as uncacheable to userland.)

Perhaps you could correct misapprehensions and fill in gaps?  And
comment on what's likely to be different on other ARMv7-A
implementations?  And then I'll confine myself to review of Marek's
patches, on this thread anyway.  ;-)

On Jun 24, 4:50 am, Siarhei Siamashka <siarhei.siamas... at gmail.com> wrote:
> 2011/6/24 Måns Rullgård <m... at mansr.com>:
>
> > "Edwards, Michael" <m.k.edwa... at gmail.com> writes:
> >> and do have dedicated lanes to memory for the NEON unit
>
> > No core released to date, including the A15, has dedicated memory lanes
> > for NEON.  All the Cortex-A* cores have a common load/store unit for all
> > types of instructions.  Some can do multiple concurrent accesses, but
> > that's orthogonal to this discussion.
>
> Probably he wanted to say that NEON unit from Cortex-A8 can load/store
> 128 bits of data per cycle when accessing L1 cache *memory*, while
> ordinary ARM load/store instructions can't handle more than 64 bits
> per cycle there. This makes sense in the context of this discussion
> because loading data to NEON/VFP registers directly without dragging
> it through ARM registers is not a bad idea.

That's close to what I meant.  The load/store path *to main memory* is
indeed shared.  But within the cache hierarchy, at least on the
Cortex-A8, ARM and NEON take separate paths.  And that's a good thing,
because the ARM stalls on an L1 miss, and it would be rather bad if it
had to wait for a big NEON transfer to complete before it could fill
from L2.  Moreover, the only way to get "streaming" performance
(back-to-back AXI burst transactions) on uncacheable regions is by
using the NEON.  That's almost impossible to determine from the TRM,
but it's there:
http://infocenter.arm.com/help/topic/com.arm.doc.ddi0344k/ch09s04s03.html
.  Compare against the LDM/STM section of
http://infocenter.arm.com/help/topic/com.arm.doc.ddi0344k/ch09s01s02.html
.

On the A8, the NEON bypasses the L1 cache, and has a dedicated lane
(probably the wrong word, sorry) into the L2 cache -- or for
uncacheable mappings, *past* the L2 per se to its AXI scheduler.  See
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344k/ch08s02s02.html
.  In addition, NEON load/store operations can be issued in parallel
with integer code, and there can be as many as 12 NEON reads
outstanding in L2 -- vs. the maximum of 4 total cache line refills and
evictions.  So if you are moving data around without doing non-SIMD
operations on it, and without branching based on its contents, you can
do so without polluting L1 cache, or contending with L1 misses that
hit in L2.

There will be some contention between NEON-side loads and ARM-side L2
misses, but even that is negligible if you issue a preload early
enough (which you should do anyway for fetches that you suspect will
miss L2, because the compiler schedules loads based on the assumption
of an L1 hit; an L1 miss stalls the ARM side until it's satisfied).
Preloads do not have any effect if you miss in TLB, and they don't
force premature evictions from L1 cache (they only load as far as L2).
 And the contention on the write side is negligible thanks to the
write allocation mechanism, except insofar as you may approach
saturation of the AXI interface due to the total rate of L2
evictions/linefills and cache-bypassing traffic -- in which case,
congratulations!  Your code is well tuned and operates at the maximum
rate that the path to main memory permits.

If you are fetching data from an uncacheable region, using the NEON to
trampoline into a cacheable region should be a *huge* win.  Remember,
an L1 miss stalls the ARM side, and the only way to get data into L1
is to fetch and miss.  If you want it to hit in L2, you have to use
the NEON to put it there, by fetching up to 128 bytes at a go from the
uncacheable region (e. g., VLDM r1,{d16-d31}) and storing it to a
cacheable buffer (i. e., only as far as L2, since you write it again
and again without an eviction).  You want to limit fetches from the
ARM side to cacheable regions; otherwise every LDM is a round-trip to
AXI.

The store story is similar.  You want the equivalent of the x86's
magic "fill buffers" -- which avoid the read-modify-write penalty when
writing whole cache lines' worth of data through uncacheable
write-combining mappings, but only if you use cache-bypassing SSE2
writes.  To get it, you need to write from the ARM to cacheable
memory, then load that data to NEON registers and store from there.
That pushes up to two whole cache lines' worth of data at a time down
to the L2 controller, which queues the write without blocking the
NEON.  (This is the only way to get an AXI burst longer than 4 128-bit
transactions without using the preload engine.)

One more nice thing about doing your bulk data transfers this way,
instead of monkeying with some DMA unit (which you probably can't do
in userland anyway), is that there are no explicit cache operations to
deal with.  You don't have to worry about data stalling in L1, because
the NEON loads do peek data *from* L1 even though they don't load data
*to* L1.  (Not unless you turn on the L1NEON bit in the Auxiliary
Control Register, which you don't want to do unless you have no L2
cache, in which case you have a whole different set of problems.)

The Cortex-A9 is a whole different animal, with out-of-order issue on
the ARM side and two automatic prefetch mechanisms (based on detection
of miss patterns at L1 and, in MPCore only, at L2).  It also has a far
less detailed TRM, so I can't begin to analyze its memory hierarchy.
Given that the L2 cache has been hived off to an external unit, and
the penalty for transfers between the ARM and NEON units has been
greatly decreased, I would guess that the NEON goes through the L1
just like the ARM.  That changes the game a little -- the NEON
transfers to/from cacheable memory can now cause eviction of the ARM's
working set from L1 -- but in practice that's probably a wash.  The
basic premise (that you want to do your noncacheable transactions in
big bursts, feasible only from the NEON side) still holds.

> >> -- the compiler can tighten up the execution of rather a lot of code
> >> by trampolining structure fetches and stores through the NEON.
>
> > Do you have any numbers to back this up?  I don't see how going through
> > NEON registers would be faster than direct LDM/STM on any core.
>
> My understanding is that it's exactly the other way around. Using
> hardfp allows to avoid going through ARM registers for floating point
> data, which otherwise might be needed for the sole purpose of
> fulfilling ABI requirements in some cases. You are going a bit
> overboard trying to argue with absolutely everything what Edwards has
> posted :)

Not just for floating point data, but for SIMD integer data as well,
or really anything you want -- as long as you frame it as a
"Homogeneous Aggregate of containerized vectors".  That's an extra 64
bytes of structure that you can pass in, and let the callee decide
whether and when to spill a copy to a cache-line-aligned buffer (so
that it can then fetch the lot to the ARM L1 -- which might as well be
registers, as far as memory latency is concerned -- in one L1 miss).
Or you can do actual float/SIMD operations with the data, and return a
healthy chunk in registers, without ever touching memory.  (To be
precise, per the AAPCS, you can pass in one 64-byte chunk as a
"Homogeneous Aggregate with a Base Type of 128-bit containerized
vectors with four Elements", and return a similar chunk in the same
registers, with either the same or different contents.)

The point is not really to have "more registers"; the integer
"registers" are just names anyway, and the L1 cache is almost as
close.  Nor is it to pass floating point values to and from public
function calls cheaply; that's worth almost nothing on system scale.
Even in code that uses no floating point or SIMD whatever, there are
potentially big gains from:

  * preserving an additional 64 bytes of VFP/NEON state across
functions that don't need big operands or return values, if you are
willing to alter their function signatures to do so (at zero run-time
cost, if you're systematic about it); or alternately:

  * postponing the transfer of up to 64 bytes of operands from the
VFP/NEON bank to the integer side, allowing more time for pending NEON
operations (especially structure loads) to complete;

  * omitting the transfer from NEON to ARM entirely, if the operands
turn out to be unneeded (or simply written elsewhere in memory without
needing to be touched by the ARM);

  * returning up to 64 bytes of results in the VFP/NEON register bank,
possibly from an address that missed in L2, without stalling to wait
for a pending load to complete;

  * and, if you really do have to move those operands to the ARM,
doing so explicitly and efficiently (by spilling the whole block to a
cache-line-aligned buffer in L2, fetching it back into L1 with a
single load, and filling the delay with some other useful work)
instead of in the worst way possible (by transferring them from VFP to
ARM registers, 4 bytes at a time, before entering the function).

> As for NEON vs. LDM/STM. There are indeed no reasons why for example
> NEON memcpy should be faster than LDM/STM for the large memory buffers
> which do not fit caches. But still this is the case for OMAP3, along
> with some of other memory performance related WTF questions.

I hope I've clarified this a bit above.  But don't take my word for
it; these techniques are almost exactly the same as those described in
Intel's cheat sheet at
http://software.intel.com/en-us/articles/copying-accelerated-video-decode-frame-buffers/
, except that there is no need for the equivalent of "fill buffers" /
"write combining buffers" because VLDM/VSTM can move 128 bytes at a
time.  (It's probable that the right micro-optimization is to work in
64-byte chunks and pipeline more deeply; I haven't benchmarked yet.)

> >> If, that is, it can schedule them appropriately to account for
> >> latencies to and from memory as well as the (reduced but non-zero)
> >> latency of VFP<->ARM transfers.
>
> > The out of order issue on A9 and later makes most such tricks unnecessary.
>
> VFP/NEON unit from A9 is still in-order.

True but mostly irrelevant.  If your code is at all tight, and your
working set doesn't fit into L2 cache, all the mere arithmetic
pipelines should be stalled most of the time.  The name of the game is
to race as quickly as possible from one fetch from an uncacheable /
unpredictable address to the next that depends on it, and to get as
high an interleave among such fetch chains as possible.  If your
working set isn't larger than L2 cache, why bother thinking about
performance at all?  Your algorithm could be O(n^3) and coded by
banana slugs, and it would still get the job done.

Cheers,
- Michael