kmalloc and uncached memory
Lin Ming
minggr at gmail.com
Wed Apr 16 16:16:16 PDT 2014
On Wed, Apr 16, 2014 at 3:43 PM, Russell King - ARM Linux
<linux at arm.linux.org.uk> wrote:
> On Wed, Apr 16, 2014 at 02:28:45PM -0700, Lin Ming wrote:
>> On Wed, Apr 16, 2014 at 12:03 PM, Laura Abbott <lauraa at codeaurora.org> wrote:
>> > On 4/16/2014 11:50 AM, Lin Ming wrote:
>> >> On Wed, Apr 16, 2014 at 11:33 AM, Laura Abbott <lauraa at codeaurora.org> wrote:
>> >>> On 4/16/2014 11:11 AM, Lin Ming wrote:
>> >>>> Hi Peter,
>> >>>>
>> >>>> I have a performance problem(on ARM board) that cpu is very bus at
>> >>>> cache invalidation.
>> >>>> So I'm trying to alloc an uncached memory to eliminate cache invalidation.
>> >>>>
>> >>>> But I also have problem with dma_alloc_coherent().
>> >>>> If I don't use dma_alloc_coherent(), is it OK to use below code to
>> >>>> alloc uncached memory?
>> >>>>
>> >>>> struct page *page;
>> >>>> pgd_t *pgd;
>> >>>> pud_t *pud;
>> >>>> pmd_t *pmd;
>> >>>> pte_t *pte;
>> >>>> void *cpu_addr;
>> >>>> dma_addr_t dma_addr;
>> >>>> unsigned int vaddr;
>> >>>>
>> >>>> cpu_addr = kmalloc(PAGE_SIZE, GFP_KERNEL);
>> >>>> dma_addr = pci_map_single(NULL, cpu_addr, PAGE_SIZE, (int)DMA_FROM_DEVICE);
>> >>>> vaddr = (unsigned int)uncached->cpu_addr;
>> >>>> pgd = pgd_offset_k(vaddr);
>> >>>> pud = pud_offset(pgd, vaddr);
>> >>>> pmd = pmd_offset(pud, vaddr);
>> >>>> pte = pte_offset_kernel(pmd, vaddr);
>> >>>> page = virt_to_page(vaddr);
>> >>>> set_pte_ext(pte, mk_pte(page, pgprot_dmacoherent(pgprot_kernel)), 0);
>> >>>>
>> >>>> /* This kmalloc memory won't be freed */
>> >>>>
>> >>>
>> >>> No, that will not work. lowmem pages are mapped with 1MB sections underneath
>> >>> which cannot be (easily) changed at runtime. You really want to be using
>> >>> dma_alloc_coherent here.
>> >>
>> >> For "lowmem pages", do you mean the first 16M physical memory?
>> >> How about that if I only use highmem pages(>16M)?
>> >>
>> >
>> > By lowmem pages I am referring to the direct mapped kernel area. Highmem refers
>> > to pages which do not have a permanent mapping in the kernel address space. If
>> > you are calling kmalloc with GFP_KERNEL you will be getting a page from the lowmem
>> > region.
>>
>> Thanks for the explanation.
>>
>> >
>> > What's the reason you can't use dma_alloc_coherent?
>>
>> I'm actually testing WIFI RX performance on a ARM based AP.
>> WIFI to Ethernet traffic, that is WIFI driver RX packets and then
>> Ethernet driver TX packets.
>>
>> I used dma_alloc_coherent() to allocate uncached buffer in WIFI driver
>> to receive packets.
>> But then Ethernet driver can't send packets successfully.
>>
>> If I used kmalloc() to allocate buffers in WIFI driver, then everything is OK.
>>
>> I know this is too platform/drivers specific problem, but any
>> suggestion would be appreciated.
>
> So why are you trying to map the memory into userspace?
I didn't map the memory into userspace.
Or am I missing something obviously?
>
> Given your fragment above, what you're doing there will be no different
> from using dma_alloc_coherent() - think about what type of mapping you
> end up with.
>
> You have two options on ARM:
>
> 1. Use dma_alloc_coherent() - recommended for data which both the CPU and
> DMA can update simultaneously - eg, descriptor ring buffers typically
> found on ethernet devices.
>
> 2. Use dma_map_page/dma_map_single() for what we call streaming support,
> which can use kmalloc memory. *But* there is only exactly *one* owner
> of the buffer at any one time - either the CPU owns it *or* the DMA
> device owns it. *Only* the current owner may access the buffer.
> Such mappings must be unmapped before they are freed.
My WIFI RX driver did 2).
Here is a piece of perf_event log.
Seems the bottleneck is at CPU cache invalidate operation.
33.86% ksoftirqd/0 [kernel.kallsyms] [k] v7_dma_inv_range
|
--- v7_dma_inv_range
|
|--51.46%-- ___dma_page_cpu_to_dev
| skb2rbd_attach
| vmac_rx_poll
| net_rx_action
| __do_softirq
| run_ksoftirqd
| kthread
| kernel_thread_exit
|
--48.54%-- ___dma_page_dev_to_cpu
vmac_rx_poll
net_rx_action
__do_softirq
run_ksoftirqd
kthread
kernel_thread_exit
So I try to do 1). Use dma_alloc_coherent() to eliminate cache
invalidate operation.
But for some reason, ethernet driver didn't TX successfully the
uncached buffer.
Thanks.
>
> Since there's the requirement for ownership in (2), these are not really
> suitable to be mapped into userspace while DMA is happening - accesses to
> the buffer while DMA is in progress /can/ corrupt the data.
>
> --
> FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
> improving, and getting towards what was expected from it.
More information about the linux-arm-kernel
mailing list