Slowdown copying data between kernel versions 4.19 and 5.15
Havens, Austin
austin.havens at anritsu.com
Fri Jun 23 14:30:06 PDT 2023
Hi all,
In the process of updating our kernel from 4.19 to 5.15 we noticed a slowdown when copying data. We are using Zynqmp 9EG SoCs and basically following the Xilinx/AMD release branches (though a bit behind). I did some sample based profiling with perf, and it showed that a lot of the time was in __arch_copy_from_user, and since the amount of data getting copied is the same, it seems like it is spending more time in each __arch_copy_from_user call.
I made a test program to replicate the issue and here is what I see (i used the same binary on both versions to rule out differences from the compiler).
root at smudge:/tmp# uname -a
Linux smudge 4.19.0-xilinx-v2019.1 #1 SMP PREEMPT Thu May 18 04:01:27 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux
root at smudge:/tmp# perf stat -einstructions -ecycles -e ld_dep_stall -e read_alloc -e dTLB-load-misses /mnt/usrroot/test_copy
Performance counter stats for '/mnt/usrroot/test_copy':
13202623 instructions # 0.25 insn per cycle
52947780 cycles
37588761 ld_dep_stall
16301 read_alloc
1660 dTLB-load-misses
0.044990363 seconds time elapsed
0.004092000 seconds user
0.040920000 seconds sys
root at ahraptor:/tmp# uname -a
Linux ahraptor 5.15.36-xilinx-v2022.1 #1 SMP PREEMPT Mon Apr 10 22:46:16 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux
root at ahraptor:/tmp# perf stat -einstructions -ecycles -e ld_dep_stall -e read_alloc -e dTLB-load-misses /mnt/usrroot/test_copy
Performance counter stats for '/mnt/usrroot/test_copy':
11625888 instructions # 0.14 insn per cycle
83135040 cycles
69833562 ld_dep_stall
27948 read_alloc
3367 dTLB-load-misses
0.070537894 seconds time elapsed
0.004165000 seconds user
0.066643000 seconds sys
After some investigation I am guessing the issue is either in the iovector iteration changes (around https://elixir.bootlin.com/linux/v5.15/source/lib/iov_iter.c#L922 ) or the lower level changes in arch/arm64/lib/copy_from_user.S, but I am pretty out of my depth so it is just speculation.
Here is the C++ code for the test program (I compiled it with G++ -O3), note that in our products we have the FPGA writing to a dedicated memory carveout which is where I have the /dev/mem mmap, you would have to change that to somewhere else to run.
#include <iostream>
#include <memory>
#include <fstream>
#include <vector>
#include <fcntl.h>
#include <cstdlib>
#include <cstring>
#include <sys/mman.h>
#include <unistd.h>
using namespace std;
struct CaptureChunk
{
uint32_t partitionOffset, captureSizeInBytes;
};
static constexpr uint32_t MAX_WRITE_BYTES = (4096 * 256); // 1MiB
constexpr size_t copySize = 4096*1000;
constexpr size_t databufferSize = copySize;
void updateTotalBytes(uint32_t bytesWritten)
{
}
void writeChunkToFile(const char* data, const CaptureChunk& chunk, const std::string& filePath)
{
std::ofstream file;
// The default buffer size seems to be very large and has to be written on close.
// This would cause aborting the save to take several minutes (RAP-6926).
// I don't thing we want it completely unbuffered either so we have to choose
// a size. I think the page size is probably a pretty good bet for a good buffer
// size. We could get it with sysconf(_SC_PAGESIZE); but I am just going to
// use 4096 directly since that is almost always what it is, and that way we
// won't have to change it on Windows which does not have sysconf.
long sz = 4096;
std::vector<char> buffer;
buffer.resize(sz);
file.rdbuf()->pubsetbuf(buffer.data(), sz);
file.open(filePath.c_str(), std::ofstream::out | std::ofstream::binary);
uint32_t readOffset = chunk.partitionOffset;
uint32_t bytesRemainingToBeWritten = chunk.captureSizeInBytes;
while(bytesRemainingToBeWritten > 0)
{
uint32_t bytesToWrite = std::min(MAX_WRITE_BYTES, bytesRemainingToBeWritten);
if (readOffset + bytesToWrite > databufferSize)
{
bytesToWrite = databufferSize - readOffset;
}
file.write(data + readOffset, bytesToWrite);
if (file.fail())
{
cout<< " failed to write " << filePath;
break;
}
updateTotalBytes(bytesToWrite);
bytesRemainingToBeWritten -= bytesToWrite;
readOffset += bytesToWrite;
if (readOffset == databufferSize)
{
readOffset = 0; //wrap around
}
}
file.close();
}
void* getBuffer(int32_t& device)
{
size_t size = databufferSize;
device = ::open("/dev/mem", O_RDWR | O_SYNC);
if (device < 0)
{
cout << "could not open dev/mem ";
}
// The databuffer driver should take care of getting the physical address
void* buffer = mmap(nullptr, size, PROT_READ | PROT_WRITE, MAP_SHARED, device, 0x800000000);
if (buffer== (void*)MAP_FAILED)
{
::close(device);
device = -1;
buffer = nullptr;
cout << "could not mmap dev/mem ";
}
return buffer;
}
int main()
{
std::string fileName= "test_file.bin";
CaptureChunk testChunk {.partitionOffset=0, .captureSizeInBytes=copySize};
int32_t device;
char* buffer= (char*)getBuffer(device);
#ifdef copy_buffer
char* copyBuffer = (char*)std::malloc(copySize);
std::memcpy(copyBuffer, buffer, copySize);
writeChunkToFile(copyBuffer, testChunk, fileName );
#else
writeChunkToFile(buffer, testChunk, fileName );
#endif
::close(device);
return 0;
}
Any help will be greatly appreciated.
-Austin
More information about the linux-arm-kernel
mailing list