[PATCH bpf-next 4/5] selftests/bpf: Add benchmark for bpf_csum_diff() helper
Puranjay Mohan
puranjay at kernel.org
Tue Oct 22 03:21:43 PDT 2024
Andrii Nakryiko <andrii.nakryiko at gmail.com> writes:
> On Mon, Oct 21, 2024 at 5:22 AM Puranjay Mohan <puranjay at kernel.org> wrote:
>>
>> Add a microbenchmark for bpf_csum_diff() helper. This benchmark works by
>> filling a 4KB buffer with random data and calculating the internet
>> checksum on different parts of this buffer using bpf_csum_diff().
>>
>> Example run using ./benchs/run_bench_csum_diff.sh on x86_64:
>>
>> [bpf]$ ./benchs/run_bench_csum_diff.sh
>> 4 2.296 ± 0.066M/s (drops 0.000 ± 0.000M/s)
>> 8 2.320 ± 0.003M/s (drops 0.000 ± 0.000M/s)
>> 16 2.315 ± 0.001M/s (drops 0.000 ± 0.000M/s)
>> 20 2.318 ± 0.001M/s (drops 0.000 ± 0.000M/s)
>> 32 2.308 ± 0.003M/s (drops 0.000 ± 0.000M/s)
>> 40 2.300 ± 0.029M/s (drops 0.000 ± 0.000M/s)
>> 64 2.286 ± 0.001M/s (drops 0.000 ± 0.000M/s)
>> 128 2.250 ± 0.001M/s (drops 0.000 ± 0.000M/s)
>> 256 2.173 ± 0.001M/s (drops 0.000 ± 0.000M/s)
>> 512 2.023 ± 0.055M/s (drops 0.000 ± 0.000M/s)
>
> you are not benchmarking bpf_csum_diff(), you are benchmarking how
> often you can call bpf_prog_test_run(). Add some batching on the BPF
> side, these numbers tell you that there is no difference between
> calculating checksum for 4 bytes and for 512, that didn't seem strange
> to you?
This didn't seem strange to me because if you see the tables I added to
the cover letter, there is a clear improvement after optimizing the
helper and arm64 even shows a linear drop going from 4 bytes to 512
bytes, even after the optimization.
On x86 after the improvement, 4 bytes and 512 bytes show similar numbers
but there is still a small drop that can be seen going from 4 to 512
bytes.
My thought was that because the bpf_csum_diff() calls csum_partial() on
x86 which is already optimised, most of the overhead was due to copying
the buffer which is now removed.
I guess I can amplify the difference between 4B and 512B by calling
bpf_csum_diff() multiple times in a loop, or by calculating the csum by
dividing the buffer into more parts (currently the BPF code divides it
into 2 parts only).
>>
>> Signed-off-by: Puranjay Mohan <puranjay at kernel.org>
>> ---
>> tools/testing/selftests/bpf/Makefile | 2 +
>> tools/testing/selftests/bpf/bench.c | 4 +
>> .../selftests/bpf/benchs/bench_csum_diff.c | 164 ++++++++++++++++++
>> .../bpf/benchs/run_bench_csum_diff.sh | 10 ++
>> .../selftests/bpf/progs/csum_diff_bench.c | 25 +++
>> 5 files changed, 205 insertions(+)
>> create mode 100644 tools/testing/selftests/bpf/benchs/bench_csum_diff.c
>> create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_csum_diff.sh
>> create mode 100644 tools/testing/selftests/bpf/progs/csum_diff_bench.c
>>
>
> [...]
>
>> +
>> +static void csum_diff_setup(void)
>> +{
>> + int err;
>> + char *buff;
>> + size_t i, sz;
>> +
>> + sz = sizeof(ctx.skel->rodata->buff);
>> +
>> + setup_libbpf();
>> +
>> + ctx.skel = csum_diff_bench__open();
>> + if (!ctx.skel) {
>> + fprintf(stderr, "failed to open skeleton\n");
>> + exit(1);
>> + }
>> +
>> + srandom(time(NULL));
>> + buff = ctx.skel->rodata->buff;
>> +
>> + /*
>> + * Set first 8 bytes of buffer to 0xdeadbeefdeadbeef, this is later used to verify the
>> + * correctness of the helper by comparing the checksum result for 0xdeadbeefdeadbeef that
>> + * should be 0x3b3b
>> + */
>> +
>> + *(u64 *)buff = 0xdeadbeefdeadbeef;
>> +
>> + for (i = 8; i < sz; i++)
>> + buff[i] = '1' + random() % 9;
>
> so, you only generate 9 different values for bytes, why? Why not full
> byte range?
Thanks for catching this, there is no reason for this to be [1,10] I
will use the full byte range in the next version.
Thanks,
Puranjay
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 255 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-riscv/attachments/20241022/c40c7c91/attachment.sig>
More information about the linux-riscv
mailing list