Slowdown copying data between kernel versions 4.19 and 5.15

Fri Jun 30 04:15:02 PDT 2023

On Thu, Jun 29, 2023 at 07:33:39PM +0000, Havens, Austin wrote:
> Hi Mark,
> Thanks for the reply.

No problem; thanks for the info here!

>  On Thursday, June 29, 2023 7:74 AM Mark Rutland wrote:
> > On Wed, Jun 28, 2023 at 09:38:14PM +0000, Havens, Austin wrote:

> >> Profiling with the hacked __arch_copy_from_user 
> >> root at ahraptor:/tmp# perf stat -einstructions -ecycles -e ld_dep_stall -e read_alloc -e dTLB-load-misses /mnt/usrroot/test_copy
> >> 
> >>  Performance counter stats for '/mnt/usrroot/test_copy':
> >> 
> >>           11822342      instructions              #    0.23  insn per cycle         
> >>           50689594      cycles                                                      
> >>           37627922      ld_dep_stall                                                
> >>              17933      read_alloc                                                  
> >>               3421      dTLB-load-misses                                            
> >> 
> >>        0.043440253 seconds time elapsed
> >> 
> >>        0.004382000 seconds user
> >>        0.039442000 seconds sys
> >> 
> >> Unfortunately the hack crashes in other cases so it is not a viable solution
> >> for us. Also, on our actual workload there is still a small difference in
> >> performance remaining that I have not tracked down yet (I am guessing it has
> >> to do with the dTLB-load-misses remaining higher). 
> >> 
> >> Note, I think that the slow down is only noticeable in cases like ours where
> >> the data being copied from is not in cache (for us, because the FPGA writes
> >> it).
> >
> > When you say "is not in cache", what exactly do you mean? If this were just the
> > latency of filling a cache I wouldn't expect the size of the first access to
> > make a difference, so I'm assuming the source buffer is not mapped with
> > cacheable memory attributes, which we generally assume.
> >
> > Which memory attribues are the source and destination buffers mapped with? Is
> > that Normal-WB, Normal-NC, or Device? How exactly has that memory been mapped?
> >
> > I'm assuming this is with some out-of-tree driver; if that's in a public tree
> > could you please provide a pointer to it?
> >
> > Thanks,
> > Mark.
> 
> I am actually not 100% clear on how the memory gets mapped. Currently we call 
> ioremap in our driver, so I think that should map it as iomem. When I removed 
> that or used /dev/mem, nothing changed, and looking at things now I think that 
> is because the origional mapping is from drivers/of/of_reserved_mem.c

A plain ioremap() will give you Device memory attributes, which
copy_{to,from}_user() aren't suppposed to be used with, and also forbids the
CPU from doing a bunch of things (e.g. gathering and prefetching) which makes
this slow.

If it's possible to use Normal Non-Cacheable instead, (e.g. by using
ioremap_wc()), that will likely be faster since it permits gathering and
prefetching, etc. Even that's a bit weird, and I'd generally expect to have a
kernel driver to manage non-coherent DMA like this (rather than userspace
having the buffer nad pointing the kernel at it).

Robin might have thoughts on how to handle the non-coherent DMA.

Thanks,
Mark.

> IIRC I mostly followed this wiki when setting things up
> https://xilinx-wiki.atlassian.net/wiki/spaces/A/pages/18841683/Linux+Reserved+Memory
> 
> I think the relevant parts are from the dts (note we do it 2x, because we have some
>  usages that also need to be accessed by other CPUs on the SoC which have adress 
>  space restrictions)
>  
> reserved-memory {
> #address-cells = <2>;
> #size-cells = <2>;
> ranges;
> 
> iq_capture: fpga_mem at 1 {
>     compatible = "shared-dma-pool";
>     no-map;            
>     reg = <0x0 0x70000000 0x0 0x10000000>;
> };
> big_iq_capture: fpga_mem at 2 {
>     compatible = "shared-dma-pool";
>     no-map;
>     reg = <0x8 0x0 0x0 0x80000000>;
> };
> };
> 
> 
> anritsu-databuffer at 0 {
>  compatible = "anritsu,databuffer";
>  memory-region = <&iq_capture>;
>  device-name = "databuffer-device";
> };
> anritsu-databuffer at 1 {
> compatible = "anritsu,databuffer";
> memory-region = <&big_iq_capture>;
> device-name = "capturebuffer-device";
> };
> 
> The databuffer driver is something we made and generally build out of tree,
> but I put it in tree on our github if you want to look at it. I have not actually
> tried to build it in-tree yet, so I could have made some mistakes with the Makefile
> or something. Here is a link to where the ioremap is. 
> 
> https://github.com/Anritsu/linux-xlnx/blob/intree_databuffer_driver/drivers/char/databuffer_driver.c#L242
> 
> Despite doing my best to read the documentation, I was never really sure if I got the 
> memory mapping right for our use case. 
> 
> 
> If you are interested in context, the use case is in spectrum analyzers.
> https://www.anritsu.com/en-us/test-measurement/products/ms2090a
> The feature is IQ capture, which if you are not familiar with Spectrum Analyzers, 
> is basically trying to take the data from an a high speed ADC and store it as fast 
> as possible. Since the FPGA is writing the data is clocked to the ADC, the rates 
> we can stream out without losing any data depend on how fast we can copy the 
> data from memory to either the network or a file, which is why this performance
> is important to us. I think we should probably be using scatter/gather for this,
> but I could not convince the FPGA engineers to implement it (and it sounded hard
> so I did not try very hard to convince them).  
> 
> Thanks for the help,
> Austin