[RFC PATCH] dmabuf-sync: Introduce buffer synchronization framework

Thu Jun 13 13:26:11 EDT 2013

On Thu, Jun 13, 2013 at 05:28:08PM +0900, Inki Dae wrote:
> This patch adds a buffer synchronization framework based on DMA BUF[1]
> and reservation[2] to use dma-buf resource, and based on ww-mutexes[3]
> for lock mechanism.
> 
> The purpose of this framework is not only to couple cache operations,
> and buffer access control to CPU and DMA but also to provide easy-to-use
> interfaces for device drivers and potentially user application
> (not implemented for user applications, yet). And this framework can be
> used for all dma devices using system memory as dma buffer, especially
> for most ARM based SoCs.
> 
> The mechanism of this framework has the following steps,
>     1. Register dmabufs to a sync object - A task gets a new sync object and
>     can add one or more dmabufs that the task wants to access.
>     This registering should be performed when a device context or an event
>     context such as a page flip event is created or before CPU accesses a shared
>     buffer.
> 
> 	dma_buf_sync_get(a sync object, a dmabuf);
> 
>     2. Lock a sync object - A task tries to lock all dmabufs added in its own
>     sync object. Basically, the lock mechanism uses ww-mutex[1] to avoid dead
>     lock issue and for race condition between CPU and CPU, CPU and DMA, and DMA
>     and DMA. Taking a lock means that others cannot access all locked dmabufs
>     until the task that locked the corresponding dmabufs, unlocks all the locked
>     dmabufs.
>     This locking should be performed before DMA or CPU accesses these dmabufs.
> 
> 	dma_buf_sync_lock(a sync object);
> 
>     3. Unlock a sync object - The task unlocks all dmabufs added in its own sync
>     object. The unlock means that the DMA or CPU accesses to the dmabufs have
>     been completed so that others may access them.
>     This unlocking should be performed after DMA or CPU has completed accesses
>     to the dmabufs.
> 
> 	dma_buf_sync_unlock(a sync object);
> 
>     4. Unregister one or all dmabufs from a sync object - A task unregisters
>     the given dmabufs from the sync object. This means that the task dosen't
>     want to lock the dmabufs.
>     The unregistering should be performed after DMA or CPU has completed
>     accesses to the dmabufs or when dma_buf_sync_lock() is failed.
> 
> 	dma_buf_sync_put(a sync object, a dmabuf);
> 	dma_buf_sync_put_all(a sync object);
> 
>     The described steps may be summarized as:
> 	get -> lock -> CPU or DMA access to a buffer/s -> unlock -> put
> 
> This framework includes the following two features.
>     1. read (shared) and write (exclusive) locks - A task is required to declare
>     the access type when the task tries to register a dmabuf;
>     READ, WRITE, READ DMA, or WRITE DMA.
> 
>     The below is example codes,
> 	struct dmabuf_sync *sync;
> 
> 	sync = dmabuf_sync_init(NULL, "test sync");
> 
> 	dmabuf_sync_get(sync, dmabuf, DMA_BUF_ACCESS_READ);
> 	...
> 
> 	And the below can be used as access types:
> 		DMA_BUF_ACCESS_READ,
> 		- CPU will access a buffer for read.
> 		DMA_BUF_ACCESS_WRITE,
> 		- CPU will access a buffer for read or write.
> 		DMA_BUF_ACCESS_READ | DMA_BUF_ACCESS_DMA,
> 		- DMA will access a buffer for read
> 		DMA_BUF_ACCESS_WRITE | DMA_BUF_ACCESS_DMA,
> 		- DMA will access a buffer for read or write.
> 
>     2. Mandatory resource releasing - a task cannot hold a lock indefinitely.
>     A task may never try to unlock a buffer after taking a lock to the buffer.
>     In this case, a timer handler to the corresponding sync object is called
>     in five (default) seconds and then the timed-out buffer is unlocked by work
>     queue handler to avoid lockups and to enforce resources of the buffer.
> 
> The below is how to use:
> 	1. Allocate and Initialize a sync object:
> 		struct dmabuf_sync *sync;
> 
> 		sync = dmabuf_sync_init(NULL, "test sync");
> 		...
> 
> 	2. Add a dmabuf to the sync object when setting up dma buffer relevant
> 	   registers:
> 		dmabuf_sync_get(sync, dmabuf, DMA_BUF_ACCESS_READ);
> 		...
> 
> 	3. Lock all dmabufs of the sync object before DMA or CPU accesses
> 	   the dmabufs:
> 		dmabuf_sync_lock(sync);
> 		...
> 
> 	4. Now CPU or DMA can access all dmabufs locked in step 3.
> 
> 	5. Unlock all dmabufs added in a sync object after DMA or CPU access
> 	   to these dmabufs is completed:
> 		dmabuf_sync_unlock(sync);
> 
> 	   And call the following functions to release all resources,
> 		dmabuf_sync_put_all(sync);
> 		dmabuf_sync_fini(sync);
> 
> 	You can refer to actual example codes:
> 		https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-exynos.git/
> 		commit/?h=dmabuf-sync&id=4030bdee9bab5841ad32faade528d04cc0c5fc94
> 
> 		https://git.kernel.org/cgit/linux/kernel/git/daeinki/drm-exynos.git/
> 		commit/?h=dmabuf-sync&id=6ca548e9ea9e865592719ef6b1cde58366af9f5c
> 
> The framework performs cache operation based on the previous and current access
> types to the dmabufs after the locks to all dmabufs are taken:
> 	Call dma_buf_begin_cpu_access() to invalidate cache if,
> 		previous access type is DMA_BUF_ACCESS_WRITE | DMA and
> 		current access type is DMA_BUF_ACCESS_READ
> 
> 	Call dma_buf_end_cpu_access() to clean cache if,
> 		previous access type is DMA_BUF_ACCESS_WRITE and
> 		current access type is DMA_BUF_ACCESS_READ | DMA
> 
> Such cache operations are invoked via dma-buf interfaces so the dma buf exporter
> should implement dmabuf->ops->begin_cpu_access/end_cpu_access callbacks.
> 
> [1] http://lwn.net/Articles/470339/
> [2] http://lwn.net/Articles/532616/
> [3] https://patchwork-mail1.kernel.org/patch/2625321/
> 
> Signed-off-by: Inki Dae <inki.dae at samsung.com>
> ---
>  Documentation/dma-buf-sync.txt |  246 ++++++++++++++++++
>  drivers/base/Kconfig           |    7 +
>  drivers/base/Makefile          |    1 +
>  drivers/base/dmabuf-sync.c     |  555 ++++++++++++++++++++++++++++++++++++++++
>  include/linux/dma-buf.h        |    5 +
>  include/linux/dmabuf-sync.h    |  115 +++++++++
>  include/linux/reservation.h    |    7 +
>  7 files changed, 936 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/dma-buf-sync.txt
>  create mode 100644 drivers/base/dmabuf-sync.c
>  create mode 100644 include/linux/dmabuf-sync.h
> 
> diff --git a/Documentation/dma-buf-sync.txt b/Documentation/dma-buf-sync.txt
> new file mode 100644
> index 0000000..e71b6f4
> --- /dev/null
> +++ b/Documentation/dma-buf-sync.txt
> @@ -0,0 +1,246 @@
> +                    DMA Buffer Synchronization Framework
> +                    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +                                  Inki Dae
> +                      <inki dot dae at samsung dot com>
> +                          <daeinki at gmail dot com>
> +
> +This document is a guide for device-driver writers describing the DMA buffer
> +synchronization API. This document also describes how to use the API to
> +use buffer synchronization between CPU and CPU, CPU and DMA, and DMA and DMA.
> +
> +The DMA Buffer synchronization API provides buffer synchronization mechanism;
> +i.e., buffer access control to CPU and DMA, cache operations, and easy-to-use
> +interfaces for device drivers and potentially user application
> +(not implemented for user applications, yet). And this API can be used for all
> +dma devices using system memory as dma buffer, especially for most ARM based
> +SoCs.
> +
> +
> +Motivation
> +----------
> +
> +Sharing a buffer, a device cannot be aware of when the other device will access
> +the shared buffer: a device may access a buffer containing wrong data if
> +the device accesses the shared buffer while another device is still accessing
> +the shared buffer. Therefore, a user process should have waited for
> +the completion of DMA access by another device before a device tries to access
> +the shared buffer.
> +
> +Besides, there is the same issue when CPU and DMA are sharing a buffer; i.e.,
> +a user process should consider that when the user process have to send a buffer
> +to a device driver for the device driver to access the buffer as input.
> +This means that a user process needs to understand how the device driver is
> +worked. Hence, the conventional mechanism not only makes user application
> +complicated but also incurs performance overhead because the conventional
> +mechanism cannot control devices precisely without additional and complex
> +implemantations.
> +
> +In addition, in case of ARM based SoCs, most devices have no hardware cache
> +consistency mechanisms between CPU and DMA devices because they do not use ACP
> +(Accelerator Coherency Port). ACP can be connected to DMA engine or similar
> +devices in order to keep cache coherency between CPU cache and DMA device.
> +Thus, we need additional cache operations to have the devices operate properly;
> +i.e., user applications should request cache operations to kernel before DMA
> +accesses the buffer and after the completion of buffer access by CPU, or vise
> +versa.
> +
> +	buffer access by CPU -> cache clean -> buffer access by DMA
> +
> +Or,
> +	buffer access by DMA -> cache invalidate -> buffer access by CPU
> +
> +The below shows why cache operations should be requested by user
> +process,
> +    (Presume that CPU and DMA share a buffer and the buffer is mapped
> +     with user space as cachable)
> +
> +	handle = drm_gem_alloc(size);
> +	...
> +	va1 = drm_gem_mmap(handle1);
> +	va2 = malloc(size);
> +	...
> +
> +	while(conditions) {
> +		memcpy(va1, some data, size);
> +		...
> +		drm_xxx_set_dma_buffer(handle, ...);
> +		...
> +
> +		/* user need to request cache clean at here. */
> +
> +		/* blocked until dma operation is completed. */
> +		drm_xxx_start_dma(...);
> +		...
> +
> +		/* user need to request cache invalidate at here. */
> +
> +		memcpy(va2, va1, size);
> +	}
> +
> +The issue arises: user processes may incur cache operations: user processes may
> +request unnecessary cache operations to kernel. Besides, kernel cannot prevent
> +user processes from requesting such cache operations. Therefore, we need to
> +prevent such excessive and unnecessary cache operations from user processes.
> +
> +
> +Basic concept
> +-------------
> +
> +The mechanism of this framework has the following steps,
> +    1. Register dmabufs to a sync object - A task gets a new sync object and
> +    can add one or more dmabufs that the task wants to access.
> +    This registering should be performed when a device context or an event
> +    context such as a page flip event is created or before CPU accesses a shared
> +    buffer.
> +
> +	dma_buf_sync_get(a sync object, a dmabuf);
> +
> +    2. Lock a sync object - A task tries to lock all dmabufs added in its own
> +    sync object. Basically, the lock mechanism uses ww-mutex[1] to avoid dead
> +    lock issue and for race condition between CPU and CPU, CPU and DMA, and DMA
> +    and DMA. Taking a lock means that others cannot access all locked dmabufs
> +    until the task that locked the corresponding dmabufs, unlocks all the locked
> +    dmabufs.
> +    This locking should be performed before DMA or CPU accesses these dmabufs.
> +
> +	dma_buf_sync_lock(a sync object);
> +
> +    3. Unlock a sync object - The task unlocks all dmabufs added in its own sync
> +    object. The unlock means that the DMA or CPU accesses to the dmabufs have
> +    been completed so that others may access them.
> +    This unlocking should be performed after DMA or CPU has completed accesses
> +    to the dmabufs.
> +
> +	dma_buf_sync_unlock(a sync object);
> +
> +    4. Unregister one or all dmabufs from a sync object - A task unregisters
> +    the given dmabufs from the sync object. This means that the task dosen't
> +    want to lock the dmabufs.
> +    The unregistering should be performed after DMA or CPU has completed
> +    accesses to the dmabufs or when dma_buf_sync_lock() is failed.
> +
> +	dma_buf_sync_put(a sync object, a dmabuf);
> +	dma_buf_sync_put_all(a sync object);
> +
> +    The described steps may be summarized as:
> +	get -> lock -> CPU or DMA access to a buffer/s -> unlock -> put
> +
> +This framework includes the following two features.
> +    1. read (shared) and write (exclusive) locks - A task is required to declare
> +    the access type when the task tries to register a dmabuf;
> +    READ, WRITE, READ DMA, or WRITE DMA.
> +
> +    The below is example codes,
> +	struct dmabuf_sync *sync;
> +
> +	sync = dmabuf_sync_init(NULL, "test sync");
> +
> +	dmabuf_sync_get(sync, dmabuf, DMA_BUF_ACCESS_READ);
> +	...
> +
> +    2. Mandatory resource releasing - a task cannot hold a lock indefinitely.
> +    A task may never try to unlock a buffer after taking a lock to the buffer.
> +    In this case, a timer handler to the corresponding sync object is called
> +    in five (default) seconds and then the timed-out buffer is unlocked by work
> +    queue handler to avoid lockups and to enforce resources of the buffer.
> +
> +
> +Access types
> +------------
> +
> +DMA_BUF_ACCESS_READ - CPU will access a buffer for read.
> +DMA_BUF_ACCESS_WRITE - CPU will access a buffer for read or write.
> +DMA_BUF_ACCESS_READ | DMA_BUF_ACCESS_DMA - DMA will access a buffer for read
> +DMA_BUF_ACCESS_WRITE | DMA_BUF_ACCESS_DMA - DMA will access a buffer for read
> +					    or write.
> +
> +
> +API set
> +-------
> +
> +bool is_dmabuf_sync_supported(void)
> +	- Check if dmabuf sync is supported or not.
> +
> +struct dmabuf_sync *dmabuf_sync_init(void *priv, const char *name)
> +	- Allocate and initialize a new sync object. The caller can get a new
> +	sync object for buffer synchronization. priv is used to set caller's
> +	private data and name is the name of sync object.
> +
> +void dmabuf_sync_fini(struct dmabuf_sync *sync)
> +	- Release all resources to the sync object.
> +
> +int dmabuf_sync_get(struct dmabuf_sync *sync, void *sync_buf,
> +			unsigned int type)
> +	- Add a dmabuf to a sync object. The caller can group multiple dmabufs
> +	by calling this function several times. Internally, this function also
> +	takes a reference to a dmabuf.
> +
> +void dmabuf_sync_put(struct dmabuf_sync *sync, struct dma_buf *dmabuf)
> +	- Remove a given dmabuf from a sync object. Internally, this function
> +	also release every reference to the given dmabuf.
> +
> +void dmabuf_sync_put_all(struct dmabuf_sync *sync)
> +	- Remove all dmabufs added in a sync object. Internally, this function
> +	also release every reference to the dmabufs of the sync object.
> +
> +int dmabuf_sync_lock(struct dmabuf_sync *sync)
> +	- Lock all dmabufs added in a sync object. The caller should call this
> +	function prior to CPU or DMA access to the dmabufs so that others can
> +	not access the dmabufs. Internally, this function avoids dead lock
> +	issue with ww-mutex.
> +
> +int dmabuf_sync_unlock(struct dmabuf_sync *sync)
> +	- Unlock all dmabufs added in a sync object. The caller should call
> +	this function after CPU or DMA access to the dmabufs is completed so
> +	that others can access the dmabufs.
> +
> +
> +Tutorial
> +--------
> +
> +1. Allocate and Initialize a sync object:
> +	struct dmabuf_sync *sync;
> +
> +	sync = dmabuf_sync_init(NULL, "test sync");
> +	...
> +
> +2. Add a dmabuf to the sync object when setting up dma buffer relevant registers:
> +	dmabuf_sync_get(sync, dmabuf, DMA_BUF_ACCESS_READ);
> +	...
> +
> +3. Lock all dmabufs of the sync object before DMA or CPU accesses the dmabufs:
> +	dmabuf_sync_lock(sync);
> +	...
> +
> +4. Now CPU or DMA can access all dmabufs locked in step 3.
> +
> +5. Unlock all dmabufs added in a sync object after DMA or CPU access to these
> +   dmabufs is completed:
> +	dmabuf_sync_unlock(sync);
> +
> +   And call the following functions to release all resources,
> +	dmabuf_sync_put_all(sync);
> +	dmabuf_sync_fini(sync);
> +
> +
> +Cache operation
> +---------------
> +
> +The framework performs cache operation based on the previous and current access
> +types to the dmabufs after the locks to all dmabufs are taken:
> +	Call dma_buf_begin_cpu_access() to invalidate cache if,
> +		previous access type is DMA_BUF_ACCESS_WRITE | DMA and
> +		current access type is DMA_BUF_ACCESS_READ
> +
> +	Call dma_buf_end_cpu_access() to clean cache if,
> +		previous access type is DMA_BUF_ACCESS_WRITE and
> +		current access type is DMA_BUF_ACCESS_READ | DMA
> +
> +Such cache operations are invoked via dma-buf interfaces. Thus, the dma buf
> +exporter should implement dmabuf->ops->begin_cpu_access and end_cpu_access
> +callbacks.
> +
> +
> +References:
> +[1] https://patchwork-mail1.kernel.org/patch/2625321/
> diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
> index 5ccf182..54a1d5a 100644
> --- a/drivers/base/Kconfig
> +++ b/drivers/base/Kconfig
> @@ -212,6 +212,13 @@ config FENCE_TRACE
>  	  lockup related problems for dma-buffers shared across multiple
>  	  devices.
>  
> +config DMABUF_SYNC
> +	bool "DMABUF Synchronization Framework"
> +	depends on DMA_SHARED_BUFFER
> +	help
> +	  This option enables dmabuf sync framework for buffer synchronization between
> +	  DMA and DMA, CPU and DMA, and CPU and CPU.
> +
>  config CMA
>  	bool "Contiguous Memory Allocator"
>  	depends on HAVE_DMA_CONTIGUOUS && HAVE_MEMBLOCK
> diff --git a/drivers/base/Makefile b/drivers/base/Makefile
> index 8a55cb9..599f6c90 100644
> --- a/drivers/base/Makefile
> +++ b/drivers/base/Makefile
> @@ -11,6 +11,7 @@ obj-y			+= power/
>  obj-$(CONFIG_HAS_DMA)	+= dma-mapping.o
>  obj-$(CONFIG_HAVE_GENERIC_DMA_COHERENT) += dma-coherent.o
>  obj-$(CONFIG_DMA_SHARED_BUFFER) += dma-buf.o fence.o reservation.o
> +obj-$(CONFIG_DMABUF_SYNC) += dmabuf-sync.o
>  obj-$(CONFIG_ISA)	+= isa.o
>  obj-$(CONFIG_FW_LOADER)	+= firmware_class.o
>  obj-$(CONFIG_NUMA)	+= node.o
> diff --git a/drivers/base/dmabuf-sync.c b/drivers/base/dmabuf-sync.c
> new file mode 100644
> index 0000000..c8723a5
> --- /dev/null
> +++ b/drivers/base/dmabuf-sync.c
> @@ -0,0 +1,555 @@
> +/*
> + * Copyright (C) 2013 Samsung Electronics Co.Ltd
> + * Authors:
> + *	Inki Dae <inki.dae at samsung.com>
> + *
> + * This program is free software; you can redistribute  it and/or modify it
> + * under  the terms of  the GNU General  Public License as published by the
> + * Free Software Foundation;  either version 2 of the  License, or (at your
> + * option) any later version.
> + *
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <linux/debugfs.h>
> +#include <linux/uaccess.h>
> +
> +#include <linux/dmabuf-sync.h>
> +
> +#define MAX_SYNC_TIMEOUT	5 /* Second. */
> +
> +#define NEED_BEGIN_CPU_ACCESS(old, new)	\
> +		(old->accessed_type == (DMA_BUF_ACCESS_WRITE | \
> +					DMA_BUF_ACCESS_DMA) && \
> +		 (new->access_type == DMA_BUF_ACCESS_READ))
> +
> +#define NEED_END_CPU_ACCESS(old, new)	\
> +		(old->accessed_type == DMA_BUF_ACCESS_WRITE && \
> +		 ((new->access_type == (DMA_BUF_ACCESS_READ | \
> +					DMA_BUF_ACCESS_DMA)) || \
> +		  (new->access_type == (DMA_BUF_ACCESS_WRITE | \
> +					DMA_BUF_ACCESS_DMA))))
> +
> +int dmabuf_sync_enabled = 1;
> +
> +MODULE_PARM_DESC(enabled, "Check if dmabuf sync is supported or not");
> +module_param_named(enabled, dmabuf_sync_enabled, int, 0444);
> +
> +static void dmabuf_sync_timeout_worker(struct work_struct *work)
> +{
> +	struct dmabuf_sync *sync = container_of(work, struct dmabuf_sync, work);
> +	struct dmabuf_sync_object *sobj;
> +
> +	mutex_lock(&sync->lock);
> +
> +	list_for_each_entry(sobj, &sync->syncs, head) {
> +		if (WARN_ON(!sobj->robj))
> +			continue;
> +
> +		printk(KERN_WARNING "%s: timeout = 0x%x [type = %d, " \
> +					"refcnt = %d, locked = %d]\n",
> +					sync->name, (u32)sobj->dmabuf,
> +					sobj->access_type,
> +					atomic_read(&sobj->robj->shared_cnt),
> +					sobj->robj->locked);
> +
> +		/* unlock only valid sync object. */
> +		if (!sobj->robj->locked)
> +			continue;
> +
> +		if (sobj->robj->shared &&
> +				atomic_read(&sobj->robj->shared_cnt) > 1) {
> +			atomic_dec(&sobj->robj->shared_cnt);

So, in my long standing complaints about atomic_t, this one really takes
the biscuit.  What makes you think that this is somehow safe?

What happens if:

shared_cnt = 2
	CPU0				CPU1
	atomic_read(&shared_cnt)
					atomic_read(&shared_cnt)
	atomic_dec(&shared_cnt)
					atomic_dec(&shared_cnt)

Now, it's zero.  That's not what the above code intends.  You probably
think that because it's called "atomic_*" it has some magical properties
which saves you from stuff like this.  I'm afraid it doesn't.

sync->lock may save you from that, but if that's the case, why use
atomic_t's anyway here because you're already in a protected region.
But maybe not, because I see other uses of shared_cnt without this
lock below.

I think you need to revisit this and think more carefully about how
to deal with this counting.  If you wish to continue using the
atomic_* API, please take time to get familiar with it, and most
importantly realise that virtually any sequence of:

	if some-condition-based-on atomic_read(&something)
		do something with atomic_*(&something)

is a bug.  Maybe take a look at atomic_add_unless() which can be used
with negative values to decrement.