[Linaro-mm-sig] [RFCv3 2/2] dma-buf: add helpers for sharing attacher constraints with dma-parms

Tue Feb 3 08:12:40 PST 2015

On Tuesday 03 February 2015 15:54:04 Russell King - ARM Linux wrote:
> On Tue, Feb 03, 2015 at 04:31:13PM +0100, Arnd Bergmann wrote:
> > The dma_map_* interfaces assign the virtual addresses internally,
> > using typically either a global address space for all devices, or one
> > address space per device.
> 
> We shouldn't be doing one address space per device for precisely this
> reason.  We should be doing one address space per *bus*.  I did have
> a nice diagram to illustrate the point in my previous email, but I
> deleted it, I wish I hadn't... briefly:
> 
> Fig. 1.
>                                                  +------------------+
>                                                  |+-----+  device   |
> CPU--L1cache--L2cache--Memory--SysMMU---<iobus>----IOMMU-->         |
>                                                  |+-----+           |
>                                                  +------------------+
> 
> Fig.1 represents what I'd call the "GPU" issue that we're talking about
> in this thread.
> 
> Fig. 2.
> CPU--L1cache--L2cache--Memory--SysMMU---<iobus>--IOMMU--device
> 
> The DMA API should be responsible (at the very least) for everything on
> the left of "<iobus>" in and should be providing a dma_addr_t which is
> representative of what the device (in Fig.1) as a whole sees.  That's
> the "system" part.  
> 
> I believe this is the approach which is taken by x86 and similar platforms,
> simply because they tend not to have an IOMMU on individual devices (and
> if they did, eg, on a PCI card, it's clearly the responsibility of the
> device driver.)
> 
> Whether the DMA API also handles the IOMMU in Fig.1 or 2 is questionable.
> For fig.2, it is entirely possible that the same device could appear
> without an IOMMU, and in that scenario, you would want the IOMMU to be
> handled transparently.
> 
> However, by doing so for everything, you run into exactly the problem
> which is being discussed here - the need to separate out the cache
> coherency from the IOMMU aspects.  You probably also have a setup very
> similar to fig.1 (which is certainly true of Vivante GPUs.)
> 
> If you have the need to separately control both, then using the DMA API
> to encapsulate both does not make sense - at which point, the DMA API
> should be responsible for the minimum only - in other words, everything
> to the left of <iobus> (so including the system MMU.)  The control of
> the device IOMMU should be the responsibility of device driver in this
> case.
> 
> So, dma_map_sg() would be responsible for dealing with the CPU cache
> coherency issues, and setting up the system MMU.  dma_sync_*() would
> be responsible for the CPU cache coherency issues, and dma_unmap_sg()
> would (again) deal with the CPU cache and tear down the system MMU
> mappings.
> 
> Meanwhile, the device driver has ultimate control over its IOMMU, the
> creation and destruction of mappings and context switches at the
> appropriate times.

I agree for the case you are describing here. From what I understood
from Rob was that he is looking at something more like:

Fig 3
CPU--L1cache--L2cache--Memory--IOMMU---<iobus>--device

where the IOMMU controls one or more contexts per device, and is
shared across GPU and non-GPU devices. Here, we need to use the
dmap-mapping interface to set up the IO page table for any device
that is unable to address all of system RAM, and we can use it
for purposes like isolation of the devices. There are also cases
where using the IOMMU is not optional.

So unlike the scenario you describe, the driver cannot at the
same time control the cache (using the dma-mapping API) and
the I/O page tables (using the iommu API or some internal
functions).

	Arnd