kmalloc memory slower than malloc

Thu Sep 12 11:58:22 EDT 2013

On Tue, 10 Sep 2013, Russell King - ARM Linux wrote:

> On Tue, Sep 10, 2013 at 02:42:17PM +0200, Thommy Jakobsson wrote:
> > Using pgprot_dmacoherent() in mmap they look more similar. Still 
> > ~10-15% difference, but maybe that is normal for kernel/userspace. 
> > 
> > dma_alloc_coherent in kernel   4.257s (s=0)
> > kmalloc in kernel              0.126s (s=81370000)
> > dma_alloc_coherent userspace   4.907s (s=0)
> > kmalloc in userspace          1.815s (s=81370000)
> > malloc in userspace          0.566s (s=0)
> > 
> > Note that I was lazy and used the same pgprot for all mappings now, which 
> > I guess is a violation. 
> 
> What it means is that the results you end up with are documented to be
> "unpredictable" which gives scope to manufacturers to come up with any
> behaviour they desire in that situation - and it doesn't have to be
> consistent.
> 
> What that means is that if you have an area of physical memory mapped as
> "normal memory cacheable" and it's also mapped "strongly ordered" elsewhere,
> it is entirely legal for an access via the strongly ordered mapping to
> hit the cache if a cache line exists, whereas another implementation
> may miss the cache line if it exists.
> 
> Furthermore, with such mappings (and this has been true since ARMv3 days)
> if you have two such mappings - one cacheable and one non-cacheable, and
> the cacheable mapping has dirty cache lines, the dirty cache lines can be
> evicted at any moment, overwriting whatever you're doing via the non-
> cacheable mapping.
But isn't the memory received with dma_alloc_coherent() given a noncached 
mapping? or even strongly ordered? Will that not conflict with the normal 
kernel mapping which is cached?

Is all the mappings documented somewhere, what linux mapping corresponds 
to which mapping in MMU? Seems like the armv7 documentation isn't free 
either, which isn't making things easier for me.

Comning back to the original issue; dissassembling the code I noticed that 
the userspace code looked really stupid with a lot of unnecessary memory 
accesses. Kernel looked much better. Even after commenting the actual 
memory access out in userspace, leaving just the loop itself, I got 
terrible times.

Previous times:
dma_alloc_coherent in kernel   4.257s (s=0)
kmalloc in kernel              0.126s (s=68620000)
dma_alloc_coherent userspace   0.566s (s=0)
kmalloc in userspace          0.566s (s=68620000)
malloc in userspace          0.566s (s=0)

Commenting out actual memory access (loop not optimized away when checking 
assembler):
dma_alloc_coherent in kernel   4.256s (s=0)
kmalloc in kernel              0.126s (s=84750000)
dma_alloc_coherent userspace   0.566s (s=0)
kmalloc in userspace          0.412s (s=0) << just looping
malloc in userspace          0.566s (s=0)

Kernel is with -O2 so compiling the testprogram with -O2 aswell yield more 
reasonable results:
dma_alloc_coherent in kernel   4.257s (s=0)
kmalloc in kernel              0.126s (s=84560000)
dma_alloc_coherent userspace   0.124s (s=0)
kmalloc in userspace          0.124s (s=84560000)
malloc in userspace          0.113s (s=0)

As can be seen all tests executed in userspace was cut in 1/4-1/5. Malloc 
is now a bit faster than kmalloc. Could be faster if physical memory is 
spread out on different banks, but on other hand cache prefetching should 
be easier if continous. 

> I notice you turn off VM_IO - you don't want to do that...
Fixed

Thanks for all help,
Thommy