[PATCH v3 1/3] dma: Support multiple interleaved frames with non-contiguous memory

Wed Feb 26 10:02:22 EST 2014

On 26 February 2014 23:21, Srikanth Thokala <sthokal at xilinx.com> wrote:
> On Mon, Feb 24, 2014 at 7:39 AM, Jassi Brar <jaswinder.singh at linaro.org> wrote:
>> On 21 February 2014 23:37, Srikanth Thokala <sthokal at xilinx.com> wrote:
>>> On Thu, Feb 20, 2014 at 3:23 PM, Jassi Brar <jaswinder.singh at linaro.org>
>>> wrote:
>>>> On 20 February 2014 14:54, Srikanth Thokala <sthokal at xilinx.com> wrote:
>>>>> On Wed, Feb 19, 2014 at 12:33 AM, Jassi Brar <jaswinder.singh at linaro.org>
>>>>> wrote:
>>>>>> On 18 February 2014 23:16, Srikanth Thokala <sthokal at xilinx.com> wrote:
>>>>>>> On Tue, Feb 18, 2014 at 10:20 PM, Jassi Brar
>>>>>>> <jaswinder.singh at linaro.org> wrote:
>>>>>>>> On 18 February 2014 16:58, Srikanth Thokala <sthokal at xilinx.com>
>>>>>>>> wrote:
>>>>>>>>> On Mon, Feb 17, 2014 at 3:27 PM, Jassi Brar
>>>>>>>>> <jaswinder.singh at linaro.org> wrote:
>>>>>>>>>> On 15 February 2014 17:30, Srikanth Thokala <sthokal at xilinx.com>
>>>>>>>>>> wrote:
>>>>>>>>>>> The current implementation of interleaved DMA API support multiple
>>>>>>>>>>> frames only when the memory is contiguous by incrementing
>>>>>>>>>>> src_start/
>>>>>>>>>>> dst_start members of interleaved template.
>>>>>>>>>>>
>>>>>>>>>>> But, when the memory is non-contiguous it will restrict slave
>>>>>>>>>>> device
>>>>>>>>>>> to not submit multiple frames in a batch.  This patch handles this
>>>>>>>>>>> issue by allowing the slave device to send array of interleaved dma
>>>>>>>>>>> templates each having a different memory location.
>>>>>>>>>>>
>>>>>>>>>> How fragmented could be memory in your case? Is it inefficient to
>>>>>>>>>> submit separate transfers for each segment/frame?
>>>>>>>>>> It will help if you could give a typical example (chunk size and gap
>>>>>>>>>> in bytes) of what you worry about.
>>>>>>>>>
>>>>>>>>> With scatter-gather engine feature in the hardware, submitting
>>>>>>>>> separate
>>>>>>>>> transfers for each frame look inefficient. As an example, our DMA
>>>>>>>>> engine
>>>>>>>>> supports up to 16 video frames, with each frame (a typical video
>>>>>>>>> frame
>>>>>>>>> size) being contiguous in memory but frames are scattered into
>>>>>>>>> different
>>>>>>>>> locations. We could not definitely submit frame by frame as it would
>>>>>>>>> be
>>>>>>>>> software overhead (HW interrupting for each frame) resulting in video
>>>>>>>>> lags.
>>>>>>>>>
>>>>>>>> IIUIC, it is 30fps and one dma interrupt per frame ... it doesn't seem
>>>>>>>> inefficient at all. Even poor-latency audio would generate a higher
>>>>>>>> interrupt-rate. So the "inefficiency concern" doesn't seem valid to
>>>>>>>> me.
>>>>>>>>
>>>>>>>> Not to mean we shouldn't strive to reduce the interrupt-rate further.
>>>>>>>> Another option is to emulate the ring-buffer scheme of ALSA.... which
>>>>>>>> should be possible since for a session of video playback the frame
>>>>>>>> buffers' locations wouldn't change.
>>>>>>>>
>>>>>>>> Yet another option is to use the full potential of the
>>>>>>>> interleaved-xfer api as such. It seems you confuse a 'video frame'
>>>>>>>> with the interleaved-xfer api's 'frame'. They are different.
>>>>>>>>
>>>>>>>> Assuming your one video frame is F bytes long and Gk is the gap in
>>>>>>>> bytes between end of frame [k] and start of frame [k+1] and  Gi != Gj
>>>>>>>> for i!=j
>>>>>>>> In the context of interleaved-xfer api, you have just 1 Frame of 16
>>>>>>>> chunks. Each chunk is Fbytes and the inter-chunk-gap(ICG) is Gk  where
>>>>>>>> 0<=k<15
>>>>>>>> So for your use-case .....
>>>>>>>>   dma_interleaved_template.numf = 1   /* just 1 frame */
>>>>>>>>   dma_interleaved_template.frame_size = 16  /* containing 16 chunks */
>>>>>>>>    ...... //other parameters
>>>>>>>>
>>>>>>>> You have 3 options to choose from and all should work just as fine.
>>>>>>>> Otherwise please state your problem in real numbers (video-frames'
>>>>>>>> size, count & gap in bytes).
>>>>>>>
>>>>>>> Initially I interpreted interleaved template the same.  But, Lars
>>>>>>> corrected me
>>>>>>> in the subsequent discussion and let me put it here briefly,
>>>>>>>
>>>>>>> In the interleaved template, each frame represents a line of size
>>>>>>> denoted by
>>>>>>> chunk.size and the stride by icg.  'numf' represent number of frames
>>>>>>> i.e.
>>>>>>> number of lines.
>>>>>>>
>>>>>>> In video frame context,
>>>>>>> chunk.size -> hsize
>>>>>>> chunk.icg -> stride
>>>>>>> numf -> vsize
>>>>>>> and frame_size is always 1 as it will have only one chunk in a line.
>>>>>>>
>>>>>> But you said in your last post
>>>>>>   "with each frame (a typical video frame size) being contiguous in
>>>>>> memory"
>>>>>>  ... which is not true from what you write above. Anyways, my first 2
>>>>>> suggestions still hold.
>>>>>
>>>>> Yes, each video frame is contiguous and they can be scattered.
>>>>>
>>>> I assume by contiguous frame you mean as in framebuffer?  Which is an
>>>> array of bytes.
>>>> If yes, then you should do as I suggest first,  frame_size=16  and numf=1.
>>>
>>> I think am confusing you.  I would like to explain with an example.  Lets
>>> say
>>> each video frame is 4k size starting at address 0x10004000 (-0x10005000) and
>>> other frame at 0x20002000 (-0x20003000), and so on.
>>>
>> As I said plz dont confuse video frame with DMA frame.... in video
>> frame the stride is constant(zero or not) whereas in DMA context the
>> stride must be zero for the frame to be called contiguous.
>>
>>> So, the frames are
>>> scattered in memory and as the template doesnt allow multiple src_start/
>>> dst_start we could not use single template to fill the HW descriptors (of
>>> frames).  So, I feel your suggestion might not work if the frames are
>>> scattered.
>>> Also, how could we get 'vsize' value in your approach?
>>>
>> The client driver(video driver) should know the frame parameters.
>> Practically you'll have to populate 16 transfer templates (frame
>> attributes and locations won't change for a session) and submit to be
>> transferred _cyclically_.
>>
>>>  More importantly,
>>> we are overriding the semantics of interleaved template members.
>>>
>> Not at all. Interleaved-dma isn't meant for only constant stride/icg
>> transfers. Rather it's for identical frames with random strides.
>>
>>>
>>>>
>>>> If no, then it seems you are already doing the right thing.... the
>>>> ring-buffer scheme. Please share some stats how the current api is
>>>> causing you overhead because that is a very common case (many
>>>> controllers support LLI) and you have 467ms (@30fps with 16-frames
>>>> ring-buffer) to queue in before you see any frame drop.
>>>
>>> As I mentioned earlier in the thread, our hardware has a SG engine where by
>>> we could send multiple frames in a batch.  Using the original implementation
>>> of interleaved API, we have three options to transfer.
>>>
>>> One is to send frame by frame to the hardware.  We get a async_tx desc for
>>> each frame and then we submit it to hardware triggering it to transfer this
>>> BD.
>>> We queue the next descriptor on the pending queue and whenever there is
>>> a completion interrupt we submit the next BD on this queue to hardware. In
>>> this implementation we are not efficiently using the SG engine in the
>>> hardware
>>> as we transferring frame by frame even HW allows us to transfer multiple
>>> frames.
>>>
>> Sending one frame at a time will likely cause jitters. That shouldn't be done.
>>
>>> The second option is to queue all the BDs until the maximum frames that HW
>>> is
>>> capable to transfer in a batch and then submit to SG engine in the HW.  With
>>> this approach I feel there will be additional software overhead to track the
>>> number
>>> of maximum transfers and few additional cycles to release the cookie of each
>>> desc.
>>> Here, each desc represents a frame.
>>>
>> APIs are written for the gcd of h/w. We should optimize only for
>> majority and here isn't even anything gained after changing the api.
>> What you want to 'optimize' has been there since ever. Nobody
>> considers that overhead. BTW you don't even want to spend a few 'extra
>> cycles' but what about every other platform that doesn't support this
>> and will have to scan for such transfer requests?
>>
>>> The last option is the current implementation of the driver along with this
>>> change in
>>> API.  It will allow us to send array of interleaved templates wherein we
>>> could allocate
>>> a single async desc which will handle multiple frames (or segments) and just
>>> submit
>>> this desc to HW.  Then we program the current to first frame, tail to last
>>> frame and
>>> HW will complete this transfer.  Here, each desc represents multiple frames.
>>>
>>> My point here is the driver should use hardware resources efficiently and I
>>> feel the
>>> driver will be inefficient if we dont use them.
>>>
>> I am afraid you are getting carried away by the 'awesomeness' of your
>> hardware. RingBuffers/Cyclic transfers  are meant for cases just like
>> yours.
>> Consider the following ... if you queue 16 frames and don't care to
>> track before all are transmitted, you'll have very high latency. Video
>> seeks will take unacceptably long and give the impression of a slow
>> system. Whereas if you get callbacks for each frame rendered, you
>> could updates frames from the next one thereby having very quick
>> response time.
>
> Not at all, at least in our hardware, when we submit transfers it is most
> likely to be completed. There are few errors, but mostly are recoverable
> and it doesnt stop the transfer unless there is a critical error which needs
> a reset to the whole system.  So, at least in my use case there will be
> no latency.  I am not saying hardware is great, but it is the IP implementation.
>
I am not talking about the error cases.
Apply your patch locally so that you queue 16frames and not get
notified upon each frame 'rendered'...  now click on 'seek bar' of the
video player. See how slow it is to jump to play from the new
location.  Or if the frames are to be encoded after dma transfer...
see the 'padding' that would need to be done at the end.  These
concerns are common with audio subsystem using ring-buffers.
Anyways that is just FYI, I don't care how you implement your platform.

> Regarding this API change, I had earlier explained my use case in my
> v2 thread.
> Lars and Vinod came up with this resolution to allow array of
> interleaved templates.
> I also feel this is reasonable change for the subsystem.
>
I don't see you getting any better performance from your hardware and
I certainly don't find your usecase anything new (many dma controllers
support LLI and they work just fine as such). And I have already
explained how you could do, whatever you want, without this change.
So a polite NAK from me.