kmalloc memory slower than malloc
Thommy Jakobsson
thommyj at gmail.com
Thu Sep 12 11:58:22 EDT 2013
On Tue, 10 Sep 2013, Russell King - ARM Linux wrote:
> On Tue, Sep 10, 2013 at 02:42:17PM +0200, Thommy Jakobsson wrote:
> > Using pgprot_dmacoherent() in mmap they look more similar. Still
> > ~10-15% difference, but maybe that is normal for kernel/userspace.
> >
> > dma_alloc_coherent in kernel 4.257s (s=0)
> > kmalloc in kernel 0.126s (s=81370000)
> > dma_alloc_coherent userspace 4.907s (s=0)
> > kmalloc in userspace 1.815s (s=81370000)
> > malloc in userspace 0.566s (s=0)
> >
> > Note that I was lazy and used the same pgprot for all mappings now, which
> > I guess is a violation.
>
> What it means is that the results you end up with are documented to be
> "unpredictable" which gives scope to manufacturers to come up with any
> behaviour they desire in that situation - and it doesn't have to be
> consistent.
>
> What that means is that if you have an area of physical memory mapped as
> "normal memory cacheable" and it's also mapped "strongly ordered" elsewhere,
> it is entirely legal for an access via the strongly ordered mapping to
> hit the cache if a cache line exists, whereas another implementation
> may miss the cache line if it exists.
>
> Furthermore, with such mappings (and this has been true since ARMv3 days)
> if you have two such mappings - one cacheable and one non-cacheable, and
> the cacheable mapping has dirty cache lines, the dirty cache lines can be
> evicted at any moment, overwriting whatever you're doing via the non-
> cacheable mapping.
But isn't the memory received with dma_alloc_coherent() given a noncached
mapping? or even strongly ordered? Will that not conflict with the normal
kernel mapping which is cached?
Is all the mappings documented somewhere, what linux mapping corresponds
to which mapping in MMU? Seems like the armv7 documentation isn't free
either, which isn't making things easier for me.
Comning back to the original issue; dissassembling the code I noticed that
the userspace code looked really stupid with a lot of unnecessary memory
accesses. Kernel looked much better. Even after commenting the actual
memory access out in userspace, leaving just the loop itself, I got
terrible times.
Previous times:
dma_alloc_coherent in kernel 4.257s (s=0)
kmalloc in kernel 0.126s (s=68620000)
dma_alloc_coherent userspace 0.566s (s=0)
kmalloc in userspace 0.566s (s=68620000)
malloc in userspace 0.566s (s=0)
Commenting out actual memory access (loop not optimized away when checking
assembler):
dma_alloc_coherent in kernel 4.256s (s=0)
kmalloc in kernel 0.126s (s=84750000)
dma_alloc_coherent userspace 0.566s (s=0)
kmalloc in userspace 0.412s (s=0) << just looping
malloc in userspace 0.566s (s=0)
Kernel is with -O2 so compiling the testprogram with -O2 aswell yield more
reasonable results:
dma_alloc_coherent in kernel 4.257s (s=0)
kmalloc in kernel 0.126s (s=84560000)
dma_alloc_coherent userspace 0.124s (s=0)
kmalloc in userspace 0.124s (s=84560000)
malloc in userspace 0.113s (s=0)
As can be seen all tests executed in userspace was cut in 1/4-1/5. Malloc
is now a bit faster than kmalloc. Could be faster if physical memory is
spread out on different banks, but on other hand cache prefetching should
be easier if continous.
> I notice you turn off VM_IO - you don't want to do that...
Fixed
Thanks for all help,
Thommy
More information about the linux-arm-kernel
mailing list