Slowdown copying data between kernel versions 4.19 and 5.15

Havens, Austin austin.havens at anritsu.com
Thu Jul 6 12:12:03 PDT 2023


On Friday Jun 30, 2023 at 4:15 AM PDT Mark Rutland wrote:
>On Thu, Jun 29, 2023 at 07:33:39PM +0000, Havens, Austin wrote:
>> Hi Mark,
>> Thanks for the reply.
>
>No problem; thanks for the info here!
>
>>  On Thursday, June 29, 2023 7:74 AM Mark Rutland wrote:
>> > On Wed, Jun 28, 2023 at 09:38:14PM +0000, Havens, Austin wrote:
>
>> >> Profiling with the hacked __arch_copy_from_user 
>> >> root at ahraptor:/tmp# perf stat -einstructions -ecycles -e ld_dep_stall -e read_alloc -e dTLB-load-misses /mnt/usrroot/test_copy
>> >> 
>> >>  Performance counter stats for '/mnt/usrroot/test_copy':
>> >> 
>> >>           11822342      instructions              #    0.23  insn per cycle         
>> >>           50689594      cycles                                                      
>> >>           37627922      ld_dep_stall                                                
>> >>              17933      read_alloc                                                  
>> >>               3421      dTLB-load-misses                                            
>> >> 
>> >>        0.043440253 seconds time elapsed
>> >> 
>> >>        0.004382000 seconds user
>> >>        0.039442000 seconds sys
>> >> 
>> >> Unfortunately the hack crashes in other cases so it is not a viable solution
>> >> for us. Also, on our actual workload there is still a small difference in
>> >> performance remaining that I have not tracked down yet (I am guessing it has
>> >> to do with the dTLB-load-misses remaining higher). 
>> >> 
>> >> Note, I think that the slow down is only noticeable in cases like ours where
>> >> the data being copied from is not in cache (for us, because the FPGA writes
>> >> it).
>> >
>> > When you say "is not in cache", what exactly do you mean? If this were just the
>> > latency of filling a cache I wouldn't expect the size of the first access to
>> > make a difference, so I'm assuming the source buffer is not mapped with
>> > cacheable memory attributes, which we generally assume.
>> >
>> > Which memory attribues are the source and destination buffers mapped with? Is
>> > that Normal-WB, Normal-NC, or Device? How exactly has that memory been mapped?
>> >
>> > I'm assuming this is with some out-of-tree driver; if that's in a public tree
>> > could you please provide a pointer to it?
>> >
>> > Thanks,
>> > Mark.
>> 
>> I am actually not 100% clear on how the memory gets mapped. Currently we call 
>> ioremap in our driver, so I think that should map it as iomem. When I removed 
>> that or used /dev/mem, nothing changed, and looking at things now I think that 
>> is because the origional mapping is from drivers/of/of_reserved_mem.c
>
>A plain ioremap() will give you Device memory attributes, which
>copy_{to,from}_user() aren't suppposed to be used with, and also forbids the
>CPU from doing a bunch of things (e.g. gathering and prefetching) which makes
>this slow.
>
>If it's possible to use Normal Non-Cacheable instead, (e.g. by using
>ioremap_wc()), that will likely be faster since it permits gathering and
>prefetching, etc.

I tried using Normal Non-Cacheable memory instead of iomem with
memremap(r.start,lp->size, MEMREMAP_WC);
and
vma->vm_page_prot =pgprot_writecombine(vma->vm_page_prot);
but it did not help.

I talked to our FPGA engineers and they said it would be possible to make it coherent, 
I think this would take advantage of the CCI400 on the SoC. Before haveing them implement
it I tried using the dma interfaces to see if there was a performance improvement, and 
there was not. 

For  the allocation I used 

dma_mask = dma_get_required_mask(dev);
dma_set_coherent_mask(dev, dma_mask);


lp->databuffer_vaddr = dmam_alloc_coherent(dev, lp->size, &lp->databuffer_paddr, GFP_KERNEL);

and for the mmap I used.
dma_mmap_coherent(lp->dev, vma,lp->databuffer_vaddr, lp->databuffer_paddr, size);

The performance in both cases was virtually identical to the iomem case. 

> Even that's a bit weird, and I'd generally expect to have a
>kernel driver to manage non-coherent DMA like this (rather than userspace
>having the buffer nad pointing the kernel at it).

I am not sure I follow exactly what you mean by this, but I am assuming it means not using mmap.
We have a few different use cases for the "IQ Capture", which generally fall into "block capture" 
and "streaming". For the block captures we can capture up to  2GiB at a very high rate (12.8 Gb/s).
Since the rates are so high, SW can't keep up with data rate so we have to have a huge buffer. And
since the buffer is so big we can't keep copies of it, so we jump through a bunch of hoops for "zero copy".

For the "streaming" use case we want to continuously write the data somewhere, either the network, or 
a file (E.G. on an external USB SSD). In this case, if the sample rate is higher than we can copy out 
we have to start dropping data. We specify the maximum sample rate we can handle without dropping data
which is why the copy rate is critical to us. For this use case I can see how the non-coherent dma interfaces
would be useful for this, but I did not want to have 2 different ways of using the memory, because I was not 
confident I could get it right. I was also unclear on how the full data path to files or network would look. 
Also, all the controls of the FPGA and RF hardware are done from a separate processor on the SoC (ARM r5) 
running bare metal code. Having a static mmap greatly simplifies our design. 

I also experimented with sendfile and copy_file_range system calls to avoid the user space copies altogether,
 but could not get anything working since they seem to only work with regular files and not device drivers. 
 I am not sure if this is a dead end or worth looking into further. 

>
>Robin might have thoughts on how to handle the non-coherent DMA.
>
>Thanks,
>Mark.
>
>> IIRC I mostly followed this wiki when setting things up
>> https://xilinx-wiki.atlassian.net/wiki/spaces/A/pages/18841683/Linux+Reserved+Memory
>> 
>> I think the relevant parts are from the dts (note we do it 2x, because we have some
>>  usages that also need to be accessed by other CPUs on the SoC which have adress 
>>  space restrictions)
>>  
>> reserved-memory {
>> #address-cells = <2>;
>> #size-cells = <2>;
>> ranges;
>> 
>> iq_capture: fpga_mem at 1 {
>>     compatible = "shared-dma-pool";
>>     no-map;            
>>     reg = <0x0 0x70000000 0x0 0x10000000>;
>> };
>> big_iq_capture: fpga_mem at 2 {
>>     compatible = "shared-dma-pool";
>>     no-map;
>>     reg = <0x8 0x0 0x0 0x80000000>;
>> };
>> };
>> 
>> 
>> anritsu-databuffer at 0 {
>>  compatible = "anritsu,databuffer";
>>  memory-region = <&iq_capture>;
>>  device-name = "databuffer-device";
>> };
>> anritsu-databuffer at 1 {
>> compatible = "anritsu,databuffer";
>> memory-region = <&big_iq_capture>;
>> device-name = "capturebuffer-device";
>> };
>> 
>> The databuffer driver is something we made and generally build out of tree,
>> but I put it in tree on our github if you want to look at it. I have not actually
>> tried to build it in-tree yet, so I could have made some mistakes with the Makefile
>> or something. Here is a link to where the ioremap is. 
>> 
>> https://cas5-0-urlprotect.trendmicro.com:443/wis/clicktime/v1/query?url=https%3a%2f%2fgithub.com%2fAnritsu%2flinux%2dxlnx%2fblob%2fintree%5fdatabuffer%5fdriver%2fdrivers%2fchar%2fdatabuffer%5fdriver.c%23L242&umid=64811442-1737-11ee-8159-002248ea2eea&auth=1f42b9f4699cb4f8c50b50d41cd8c1d4cdca7970-201463e497370af4283fcf0ebc8377b03f82d2c9
>> 
>> Despite doing my best to read the documentation, I was never really sure if I got the 
>> memory mapping right for our use case. 
>> 
>> 
>> If you are interested in context, the use case is in spectrum analyzers.
>> https://www.anritsu.com/en-us/test-measurement/products/ms2090a
>> The feature is IQ capture, which if you are not familiar with Spectrum Analyzers, 
>> is basically trying to take the data from an a high speed ADC and store it as fast 
>> as possible. Since the FPGA is writing the data is clocked to the ADC, the rates 
>> we can stream out without losing any data depend on how fast we can copy the 
>> data from memory to either the network or a file, which is why this performance
>> is important to us. I think we should probably be using scatter/gather for this,
>> but I could not convince the FPGA engineers to implement it (and it sounded hard
>> so I did not try very hard to convince them).  
>> 
>> Thanks for the help,
>> Austin

Thanks again for the help,
Austin


More information about the linux-arm-kernel mailing list