[PATCH 2/2] ath10k: do not use coherent memory for tx buffers

Tue Nov 24 02:32:14 PST 2015

On 2015-11-23 19:50, Peter Oh wrote:
> 
> On 11/23/2015 10:18 AM, Felix Fietkau wrote:
>> On 2015-11-23 18:25, Peter Oh wrote:
>>> Hi,
>>>
>>> Have you measured the peak throughput?
>>> The pre-allocated coherent memory concept was introduced as once of peak
>>> throughput improvement.
>> It's all still pre-allocated and pre-mapped.
> Right. I mis-guessed with the title.
>>
>>> IIRC, dma_map_single takes about 4 us on Cortex A7 and dma_unmap_single
>>> also takes time to invalid cache.
>> That's why I didn't put a map/unmap in the hot path. There is only a
>> cache sync there. With coherent memory, every single word access blocks
>> until the transaction is complete. With cached/mapped memory, the CPU
>> can fill the cachelines first, then flush it in one go. This usually
>> ends up being faster than working with coherent memory directly.
>>
>>> Please share your tput number before and after, so I don't need to worry
>>> about performance degrade.
>> I don't have an ideal setup for tput tests at the moment, so I can't
>> give you any numbers.
> Could you share any rough number?
>>   However, on the device that I'm testing on
>> (IPQ806x based), this patch makes the difference between working and
>> non-working wifi, fixing the regression introduced by your pre-allocated
>> coherent memory patch.
> Thank you for the catch up and fix.
> Btw, the regression can be fixed by using GFP_KERNEL, instead of 
> GFP_DMA, right?
I just did some timing measurements, and it seems that the DMA coherent
variant is roughly 200 nanoseconds faster. Maybe the extra latency is
caused by the CPU filling the cacheline from RAM first.

Kalle, please only merge the first one and drop this patch.
I will send a replacement for it.

- Felix