kmalloc and uncached memory

Lin Ming minggr at gmail.com
Wed Apr 16 16:16:16 PDT 2014


On Wed, Apr 16, 2014 at 3:43 PM, Russell King - ARM Linux
<linux at arm.linux.org.uk> wrote:
> On Wed, Apr 16, 2014 at 02:28:45PM -0700, Lin Ming wrote:
>> On Wed, Apr 16, 2014 at 12:03 PM, Laura Abbott <lauraa at codeaurora.org> wrote:
>> > On 4/16/2014 11:50 AM, Lin Ming wrote:
>> >> On Wed, Apr 16, 2014 at 11:33 AM, Laura Abbott <lauraa at codeaurora.org> wrote:
>> >>> On 4/16/2014 11:11 AM, Lin Ming wrote:
>> >>>> Hi Peter,
>> >>>>
>> >>>> I have a performance problem(on ARM board) that cpu is very bus at
>> >>>> cache invalidation.
>> >>>> So I'm trying to alloc an uncached memory to eliminate cache invalidation.
>> >>>>
>> >>>> But I also have problem with dma_alloc_coherent().
>> >>>> If I don't use dma_alloc_coherent(), is it OK to use below code to
>> >>>> alloc uncached memory?
>> >>>>
>> >>>> struct page *page;
>> >>>> pgd_t *pgd;
>> >>>> pud_t *pud;
>> >>>> pmd_t *pmd;
>> >>>> pte_t *pte;
>> >>>> void *cpu_addr;
>> >>>> dma_addr_t dma_addr;
>> >>>> unsigned int vaddr;
>> >>>>
>> >>>> cpu_addr = kmalloc(PAGE_SIZE, GFP_KERNEL);
>> >>>> dma_addr = pci_map_single(NULL, cpu_addr, PAGE_SIZE, (int)DMA_FROM_DEVICE);
>> >>>> vaddr = (unsigned int)uncached->cpu_addr;
>> >>>> pgd = pgd_offset_k(vaddr);
>> >>>> pud = pud_offset(pgd, vaddr);
>> >>>> pmd = pmd_offset(pud, vaddr);
>> >>>> pte = pte_offset_kernel(pmd, vaddr);
>> >>>> page = virt_to_page(vaddr);
>> >>>> set_pte_ext(pte, mk_pte(page,  pgprot_dmacoherent(pgprot_kernel)), 0);
>> >>>>
>> >>>> /* This kmalloc memory won't be freed  */
>> >>>>
>> >>>
>> >>> No, that will not work. lowmem pages are mapped with 1MB sections underneath
>> >>> which cannot be (easily) changed at runtime. You really want to be using
>> >>> dma_alloc_coherent here.
>> >>
>> >> For "lowmem pages", do you mean the first 16M physical memory?
>> >> How about that if I only use highmem pages(>16M)?
>> >>
>> >
>> > By lowmem pages I am referring to the direct mapped kernel area. Highmem refers
>> > to pages which do not have a permanent mapping in the kernel address space. If
>> > you are calling kmalloc with GFP_KERNEL you will be getting a page from the lowmem
>> > region.
>>
>> Thanks for the explanation.
>>
>> >
>> > What's the reason you can't use dma_alloc_coherent?
>>
>> I'm actually testing WIFI RX performance on a ARM based AP.
>> WIFI to Ethernet traffic, that is WIFI driver RX packets and then
>> Ethernet driver TX packets.
>>
>> I used dma_alloc_coherent() to allocate uncached buffer in WIFI driver
>> to receive packets.
>> But then Ethernet driver can't send packets successfully.
>>
>> If I used kmalloc() to allocate buffers in WIFI driver, then everything is OK.
>>
>> I know this is too platform/drivers specific problem, but any
>> suggestion would be appreciated.
>
> So why are you trying to map the memory into userspace?

I didn't map the memory into userspace.
Or am I missing something obviously?

>
> Given your fragment above, what you're doing there will be no different
> from using dma_alloc_coherent() - think about what type of mapping you
> end up with.
>
> You have two options on ARM:
>
> 1. Use dma_alloc_coherent() - recommended for data which both the CPU and
>    DMA can update simultaneously - eg, descriptor ring buffers typically
>    found on ethernet devices.
>
> 2. Use dma_map_page/dma_map_single() for what we call streaming support,
>    which can use kmalloc memory.  *But* there is only exactly *one* owner
>    of the buffer at any one time - either the CPU owns it *or* the DMA
>    device owns it.  *Only* the current owner may access the buffer.
>    Such mappings must be unmapped before they are freed.

My WIFI RX driver did 2).
Here is a piece of perf_event log.
Seems the bottleneck is at CPU cache invalidate operation.

    33.86%  ksoftirqd/0  [kernel.kallsyms]  [k] v7_dma_inv_range
            |
            --- v7_dma_inv_range
               |
               |--51.46%-- ___dma_page_cpu_to_dev
               |          skb2rbd_attach
               |          vmac_rx_poll
               |          net_rx_action
               |          __do_softirq
               |          run_ksoftirqd
               |          kthread
               |          kernel_thread_exit
               |
                --48.54%-- ___dma_page_dev_to_cpu
                          vmac_rx_poll
                          net_rx_action
                          __do_softirq
                          run_ksoftirqd
                          kthread
                          kernel_thread_exit

So I try to do 1). Use dma_alloc_coherent() to eliminate cache
invalidate operation.
But for some reason, ethernet driver didn't TX successfully the
uncached buffer.

Thanks.

>
> Since there's the requirement for ownership in (2), these are not really
> suitable to be mapped into userspace while DMA is happening - accesses to
> the buffer while DMA is in progress /can/ corrupt the data.
>
> --
> FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
> improving, and getting towards what was expected from it.



More information about the linux-arm-kernel mailing list