[Linaro-mm-sig] [RFC] ARM DMA mapping TODO, v1

Fri Apr 29 18:46:54 EDT 2011

On Fri, 2011-04-29 at 09:27 -0700, Jesse Barnes wrote:

> You must be making it sound worse than it really is, otherwise how
> would an embedded platform like the above deal with a display engine
> that needed a large, contiguous chunk of uncached memory for the
> display buffer?  If the CPU is actively speculating into it and
> overwriting blits etc it would never work...  Or do you do such
> reservations up front at 1G granularity??

Such embedded platforms have not been used with GPUs so far and our only
implementation of 64-bit BookE is fortunately also completely cache
coherent :-)

The good thing on ppc is that so far there is no new design coming from
us or FSL that isn't cache coherent. The bad thing is that people seem
to still try to pump out things using old 44x which isn't and somewhat
seem to also want to use GPUs on them :-)

The 44x is a case where I have a small (64 entries) SW loaded TLB and I
bolt the first 768M of the linear mapping (lowmem) using 3x256M entries.
What "saves" it is that it's also an ancient design with essentially a
busted prefetch engine that will thus cope with aliases as long as we
don't explicitely access the cached and non-cached aliases
simultaneously. 

The nasty cases I have never really dealt with properly are the Apple
machines and their non coherent AGP. Those processors were really not
designed with the idea that one would do non-coherent DMA, especially
the 970 (G5) and our Linux code really don't like it.

Things tend to "work" with DRI 1 because we allocate the AGP memory once
in one big chunk (it's pages but they are allocated together and thus
tend to be contiguous) so the possible issues with prefetch are so rare,
I think we end up being lucky. With DRI 2 dynamically mapping things
in/out, we have a bigger problem and I don't know how to solve it other
than forcing the DRM to allocate graphic objects in reserved areas of
memory made of 16M pools that I unmap from the linear mapping.... (since
I use 16M pages to map the linear mapping). 

For ppc32 laptops it's even worse as I use 256MB BATs (block address
translation, kind of special registers to create large static mappings)
to map the linear mapping, which brings me back to the 44x case to some
extent. I can't really do without at the moment, at the very least I
require the kernel text / data / bss to be covered by BATs.

> > Right. We should still shoot HW designers who give up coherency for the
> > sake of 3D benchmarks. It's insanely stupid.
> 
> Ah if it were that simple. :)  There are big costs to implementing full
> coherency for all your devices, as you well know, so it's just not a
> question of benchmark optimization.

But it -is- that simple.

You do have to deal with coherency anyways for your PHB unless you start
advocating that we should make everything else non coherent as well. So
you have the logic. Just make your GPU operate on the same protocol.

It's really only a perf tradeoff I believe. And a bad one.

Cheers,
Ben.