[PATCH v5 1/5] ARM: dma-mapping: Optimize allocation

Wed Jan 13 09:44:18 PST 2016

On 13/01/16 17:33, Tomasz Figa wrote:
> On Wed, Jan 13, 2016 at 9:17 PM, Robin Murphy <robin.murphy at arm.com> wrote:
>> Hi Doug,
>>
>>
>> On 08/01/16 23:05, Douglas Anderson wrote:
>>>
>>> The __iommu_alloc_buffer() is expected to be called to allocate pretty
>>> sizeable buffers.  Upon simple tests of video I saw it trying to
>>> allocate 4,194,304 bytes.  The function tries to allocate large chunks
>>> in order to optimize IOMMU TLB usage.
>>>
>>> The current function is very, very slow.
>>>
>>> One problem is the way it keeps trying and trying to allocate big
>>> chunks.  Imagine a very fragmented memory that has 4M free but no
>>> contiguous pages at all.  Further imagine allocating 4M (1024 pages).
>>> We'll do the following memory allocations:
>>> - For page 1:
>>>     - Try to allocate order 10 (no retry)
>>>     - Try to allocate order 9 (no retry)
>>>     - ...
>>>     - Try to allocate order 0 (with retry, but not needed)
>>> - For page 2:
>>>     - Try to allocate order 9 (no retry)
>>>     - Try to allocate order 8 (no retry)
>>>     - ...
>>>     - Try to allocate order 0 (with retry, but not needed)
>>> - ...
>>> - ...
>>>
>>> Total number of calls to alloc() calls for this case is:
>>>     sum(int(math.log(i, 2)) + 1 for i in range(1, 1025))
>>>     => 9228
>>>
>>> The above is obviously worse case, but given how slow alloc can be we
>>> really want to try to avoid even somewhat bad cases.  I timed the old
>>> code with a device under memory pressure and it wasn't hard to see it
>>> take more than 120 seconds to allocate 4 megs of memory! (NOTE: testing
>>> was done on kernel 3.14, so possibly mainline would behave
>>> differently).
>>>
>>> A second problem is that allocating big chunks under memory pressure
>>> when we don't need them is just not a great idea anyway unless we really
>>> need them.  We can make due pretty well with smaller chunks so it's
>>> probably wise to leave bigger chunks for other users once memory
>>> pressure is on.
>>>
>>> Let's adjust the allocation like this:
>>>
>>> 1. If a big chunk fails, stop trying to hard and bump down to lower
>>>      order allocations.
>>> 2. Don't try useless orders.  The whole point of big chunks is to
>>>      optimize the TLB and it can really only make use of 2M, 1M, 64K and
>>>      4K sizes.
>>>
>>> We'll still tend to eat up a bunch of big chunks, but that might be the
>>> right answer for some users.  A future patch could possibly add a new
>>> DMA_ATTR that would let the caller decide that TLB optimization isn't
>>> important and that we should use smaller chunks.  Presumably this would
>>> be a sane strategy for some callers.
>>
>>
>> Now that I've had time to think about it properly:
>>
>> Reviewed-by: Robin Murphy <robin.murphy at arm.com>
>>
>> I just had an absolutely disgusting idea of how to get the same progression
>> with just a single variable and no static array, but I'll keep that firmly
>> to myself as it's almost IOCCC-grade WTF :D
>
> Just out of curiosity, a bitmap and loop with fls() and clearing bit
> on failure or something more freaky? :)

Got a Python interpreter handy?

order = 9
for i in range(4):
     print order
     order = (order - 1) & 0xc

Like I said, disgusting :D

Robin.

>
> Anyway:
>
> Reviewed-by: Tomasz Figa <tfiga at chromium.org>
>
> Best regards,
> Tomasz
>