Slowdown copying data between kernel versions 4.19 and 5.15
Mark Rutland
mark.rutland at arm.com
Fri Jun 30 04:15:02 PDT 2023
On Thu, Jun 29, 2023 at 07:33:39PM +0000, Havens, Austin wrote:
> Hi Mark,
> Thanks for the reply.
No problem; thanks for the info here!
> On Thursday, June 29, 2023 7:74 AM Mark Rutland wrote:
> > On Wed, Jun 28, 2023 at 09:38:14PM +0000, Havens, Austin wrote:
> >> Profiling with the hacked __arch_copy_from_user
> >> root at ahraptor:/tmp# perf stat -einstructions -ecycles -e ld_dep_stall -e read_alloc -e dTLB-load-misses /mnt/usrroot/test_copy
> >>
> >> Performance counter stats for '/mnt/usrroot/test_copy':
> >>
> >> 11822342 instructions # 0.23 insn per cycle
> >> 50689594 cycles
> >> 37627922 ld_dep_stall
> >> 17933 read_alloc
> >> 3421 dTLB-load-misses
> >>
> >> 0.043440253 seconds time elapsed
> >>
> >> 0.004382000 seconds user
> >> 0.039442000 seconds sys
> >>
> >> Unfortunately the hack crashes in other cases so it is not a viable solution
> >> for us. Also, on our actual workload there is still a small difference in
> >> performance remaining that I have not tracked down yet (I am guessing it has
> >> to do with the dTLB-load-misses remaining higher).
> >>
> >> Note, I think that the slow down is only noticeable in cases like ours where
> >> the data being copied from is not in cache (for us, because the FPGA writes
> >> it).
> >
> > When you say "is not in cache", what exactly do you mean? If this were just the
> > latency of filling a cache I wouldn't expect the size of the first access to
> > make a difference, so I'm assuming the source buffer is not mapped with
> > cacheable memory attributes, which we generally assume.
> >
> > Which memory attribues are the source and destination buffers mapped with? Is
> > that Normal-WB, Normal-NC, or Device? How exactly has that memory been mapped?
> >
> > I'm assuming this is with some out-of-tree driver; if that's in a public tree
> > could you please provide a pointer to it?
> >
> > Thanks,
> > Mark.
>
> I am actually not 100% clear on how the memory gets mapped. Currently we call
> ioremap in our driver, so I think that should map it as iomem. When I removed
> that or used /dev/mem, nothing changed, and looking at things now I think that
> is because the origional mapping is from drivers/of/of_reserved_mem.c
A plain ioremap() will give you Device memory attributes, which
copy_{to,from}_user() aren't suppposed to be used with, and also forbids the
CPU from doing a bunch of things (e.g. gathering and prefetching) which makes
this slow.
If it's possible to use Normal Non-Cacheable instead, (e.g. by using
ioremap_wc()), that will likely be faster since it permits gathering and
prefetching, etc. Even that's a bit weird, and I'd generally expect to have a
kernel driver to manage non-coherent DMA like this (rather than userspace
having the buffer nad pointing the kernel at it).
Robin might have thoughts on how to handle the non-coherent DMA.
Thanks,
Mark.
> IIRC I mostly followed this wiki when setting things up
> https://xilinx-wiki.atlassian.net/wiki/spaces/A/pages/18841683/Linux+Reserved+Memory
>
> I think the relevant parts are from the dts (note we do it 2x, because we have some
> usages that also need to be accessed by other CPUs on the SoC which have adress
> space restrictions)
>
> reserved-memory {
> #address-cells = <2>;
> #size-cells = <2>;
> ranges;
>
> iq_capture: fpga_mem at 1 {
> compatible = "shared-dma-pool";
> no-map;
> reg = <0x0 0x70000000 0x0 0x10000000>;
> };
> big_iq_capture: fpga_mem at 2 {
> compatible = "shared-dma-pool";
> no-map;
> reg = <0x8 0x0 0x0 0x80000000>;
> };
> };
>
>
> anritsu-databuffer at 0 {
> compatible = "anritsu,databuffer";
> memory-region = <&iq_capture>;
> device-name = "databuffer-device";
> };
> anritsu-databuffer at 1 {
> compatible = "anritsu,databuffer";
> memory-region = <&big_iq_capture>;
> device-name = "capturebuffer-device";
> };
>
> The databuffer driver is something we made and generally build out of tree,
> but I put it in tree on our github if you want to look at it. I have not actually
> tried to build it in-tree yet, so I could have made some mistakes with the Makefile
> or something. Here is a link to where the ioremap is.
>
> https://github.com/Anritsu/linux-xlnx/blob/intree_databuffer_driver/drivers/char/databuffer_driver.c#L242
>
> Despite doing my best to read the documentation, I was never really sure if I got the
> memory mapping right for our use case.
>
>
> If you are interested in context, the use case is in spectrum analyzers.
> https://www.anritsu.com/en-us/test-measurement/products/ms2090a
> The feature is IQ capture, which if you are not familiar with Spectrum Analyzers,
> is basically trying to take the data from an a high speed ADC and store it as fast
> as possible. Since the FPGA is writing the data is clocked to the ADC, the rates
> we can stream out without losing any data depend on how fast we can copy the
> data from memory to either the network or a file, which is why this performance
> is important to us. I think we should probably be using scatter/gather for this,
> but I could not convince the FPGA engineers to implement it (and it sounded hard
> so I did not try very hard to convince them).
>
> Thanks for the help,
> Austin
More information about the linux-arm-kernel
mailing list