WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342
Kumar Kartikeya Dwivedi
memxor at gmail.com
Sat Aug 26 15:49:11 PDT 2023
On Sat, 26 Aug 2023 at 03:01, Andrii Nakryiko <andrii.nakryiko at gmail.com> wrote:
>
> On Fri, Aug 25, 2023 at 12:49 PM Alexei Starovoitov
> <alexei.starovoitov at gmail.com> wrote:
> >
> > On Fri, Aug 25, 2023 at 11:53 AM Alexei Starovoitov
> > <alexei.starovoitov at gmail.com> wrote:
> > >
> > > On Fri, Aug 25, 2023 at 8:28 AM Yonghong Song <yonghong.song at linux.dev> wrote:
> > > >
> > > >
> > > >
> > > > On 8/25/23 3:32 AM, Björn Töpel wrote:
> > > > > I'm chasing a workqueue hang on RISC-V/qemu (TCG), using the bpf
> > > > > selftests on bpf-next 9e3b47abeb8f.
> > > > >
> > > > > I'm able to reproduce the hang by multiple runs of:
> > > > > | ./test_progs -a link_api -a linked_list
> > > > > I'm currently investigating that.
> > > > >
> > > > > But! Sometimes (every blue moon) I get a warn_on_once hit:
> > > > > | ------------[ cut here ]------------
> > > > > | WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342 bpf_mem_refill+0x1fc/0x206
> > > > > | Modules linked in: bpf_testmod(OE)
> > > > > | CPU: 3 PID: 261 Comm: test_progs-cpuv Tainted: G OE N 6.5.0-rc5-01743-gdcb152bb8328 #2
> > > > > | Hardware name: riscv-virtio,qemu (DT)
> > > > > | epc : bpf_mem_refill+0x1fc/0x206
> > > > > | ra : irq_work_single+0x68/0x70
> > > > > | epc : ffffffff801b1bc4 ra : ffffffff8015fe84 sp : ff2000000001be20
> > > > > | gp : ffffffff82d26138 tp : ff6000008477a800 t0 : 0000000000046600
> > > > > | t1 : ffffffff812b6ddc t2 : 0000000000000000 s0 : ff2000000001be70
> > > > > | s1 : ff5ffffffffe8998 a0 : ff5ffffffffe8998 a1 : ff600003fef4b000
> > > > > | a2 : 000000000000003f a3 : ffffffff80008250 a4 : 0000000000000060
> > > > > | a5 : 0000000000000080 a6 : 0000000000000000 a7 : 0000000000735049
> > > > > | s2 : ff5ffffffffe8998 s3 : 0000000000000022 s4 : 0000000000001000
> > > > > | s5 : 0000000000000007 s6 : ff5ffffffffe8570 s7 : ffffffff82d6bd30
> > > > > | s8 : 000000000000003f s9 : ffffffff82d2c5e8 s10: 000000000000ffff
> > > > > | s11: ffffffff82d2c5d8 t3 : ffffffff81ea8f28 t4 : 0000000000000000
> > > > > | t5 : ff6000008fd28278 t6 : 0000000000040000
> > > > > | status: 0000000200000100 badaddr: 0000000000000000 cause: 0000000000000003
> > > > > | [<ffffffff801b1bc4>] bpf_mem_refill+0x1fc/0x206
> > > > > | [<ffffffff8015fe84>] irq_work_single+0x68/0x70
> > > > > | [<ffffffff8015feb4>] irq_work_run_list+0x28/0x36
> > > > > | [<ffffffff8015fefa>] irq_work_run+0x38/0x66
> > > > > | [<ffffffff8000828a>] handle_IPI+0x3a/0xb4
> > > > > | [<ffffffff800a5c3a>] handle_percpu_devid_irq+0xa4/0x1f8
> > > > > | [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
> > > > > | [<ffffffff800ae570>] ipi_mux_process+0xac/0xfa
> > > > > | [<ffffffff8000a8ea>] sbi_ipi_handle+0x2e/0x88
> > > > > | [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
> > > > > | [<ffffffff807ee70e>] riscv_intc_irq+0x36/0x4e
> > > > > | [<ffffffff812b5d3a>] handle_riscv_irq+0x54/0x86
> > > > > | [<ffffffff812b6904>] do_irq+0x66/0x98
> > > > > | ---[ end trace 0000000000000000 ]---
> > > > >
> > > > > Code:
> > > > > | static void free_bulk(struct bpf_mem_cache *c)
> > > > > | {
> > > > > | struct bpf_mem_cache *tgt = c->tgt;
> > > > > | struct llist_node *llnode, *t;
> > > > > | unsigned long flags;
> > > > > | int cnt;
> > > > > |
> > > > > | WARN_ON_ONCE(tgt->unit_size != c->unit_size);
> > > > > | ...
> > > > >
> > > > > I'm not well versed in the memory allocator; Before I dive into it --
> > > > > has anyone else hit it? Ideas on why the warn_on_once is hit?
> > > >
> > > > Maybe take a look at the patch
> > > > 822fb26bdb55 bpf: Add a hint to allocated objects.
> > > >
> > > > In the above patch, we have
> > > >
> > > > + /*
> > > > + * Remember bpf_mem_cache that allocated this object.
> > > > + * The hint is not accurate.
> > > > + */
> > > > + c->tgt = *(struct bpf_mem_cache **)llnode;
> > > >
> > > > I suspect that the warning may be related to the above.
> > > > I tried the above ./test_progs command line (running multiple
> > > > at the same time) and didn't trigger the issue.
> > >
> > > Interesting. Thanks for the report.
> > > I wasn't able to repo the warn either, but I can confirm that linked_list
> > > test is super slow on the kernel with debugs on and may appear to "hang",
> > > since it doesn't respond to key press for many seconds.
> > >
> > > time ./test_progs -a linked_list
> > > Summary: 1/132 PASSED, 0 SKIPPED, 0 FAILED
> > >
> > > real 0m35.897s
> > > user 0m1.860s
> > > sys 0m32.105s
> > >
> > > That's not normal.
> > >
> > > Kumar, Dave,
> > >
> > > do you have any suggestions?
> >
> > Andrii,
> >
> > this issue seems to be a scalability issue with libbpf.
> > The perf report looks like:
> >
> > 9.89% test_progs [kernel.kallsyms] [k] __lock_acquire
> > 7.70% test_progs [kernel.kallsyms] [k] check_preemption_disabled
> > 2.61% test_progs [kernel.kallsyms] [k] asm_exc_page_fault
> > 2.32% test_progs [kernel.kallsyms] [k] rcu_is_watching
> > 2.30% test_progs [kernel.kallsyms] [k] memcpy_orig
> > 2.26% test_progs test_progs [.] btf_find_by_name_kind
> > 2.18% test_progs libc-2.28.so [.] __strcmp_avx2
> > 2.14% test_progs [kernel.kallsyms] [k] lock_acquire
> > 2.10% test_progs [kernel.kallsyms] [k] mark_lock.part.48
> > 2.07% test_progs test_progs [.] btf_kind
> > 2.05% test_progs [kernel.kallsyms] [k] clear_page_erms
> > 1.89% test_progs test_progs [.] btf_type_by_id
> > 1.83% test_progs [kernel.kallsyms] [k] lock_is_held_type
> > 1.75% test_progs [kernel.kallsyms] [k] lock_release
> > 1.31% test_progs [kernel.kallsyms] [k] unwind_next_frame
> > 1.29% test_progs test_progs [.] btf__type_by_id
> > 1.28% test_progs [kernel.kallsyms] [k] rep_movs_alternative
> > 1.09% test_progs [kernel.kallsyms] [k] __orc_find
> > 1.06% swapper [kernel.kallsyms] [k] __lock_acquire
> > 1.02% test_progs test_progs [.] btf_type_size
> > 0.84% test_progs test_progs [.] btf_parse_type_sec
> > 0.78% test_progs [kernel.kallsyms] [k] __create_object
> > 0.76% test_progs [kernel.kallsyms] [k] __lookup_object
> > 0.75% test_progs test_progs [.] btf__str_by_offset
> >
> > The top 5 functions coming from kernel due to libbpf
> > reading /sys/kernel/btf/vmlinux
> > Then libbpf proceeded to do a search in btf.
>
> This test loads 132 skeletons/BPF objects sequentially. Each BPF
> object load should cause exactly one /sys/kernel/btf/vmlinux load. Do
> you see multiple vmlinux reads per BPF object?
>
> We then do a linear search for each ksym, so depending on the amount
> of ksyms, that can scale badly. But linked_list.bpf.o and
> linked_list_fail.bpf.o have only a few ksyms:
>
> [72] DATASEC '.ksyms' size=0 vlen=6
> type_id=33 offset=0 size=0 (FUNC 'bpf_obj_new_impl')
> type_id=37 offset=0 size=0 (FUNC 'bpf_list_pop_front')
> type_id=39 offset=0 size=0 (FUNC 'bpf_obj_drop_impl')
> type_id=41 offset=0 size=0 (FUNC 'bpf_list_pop_back')
> type_id=43 offset=0 size=0 (FUNC 'bpf_list_push_front_impl')
> type_id=51 offset=0 size=0 (FUNC 'bpf_list_push_back_impl')
>
> We can improve this part by either pre-building hashmap index for BTF
> types or sorting all ksyms and then doing one linear pass over vmlinux
> BTF, binary searching. But so far it wasn't a big bottleneck in
> practice.
>
> And running this linked_list test on production kernel, it's not that
> slow, considering how much work it's doing:
>
> sudo ./test_progs -a linked_list 1.54s user 0.58s system 97% cpu 2.178 total
>
>
> Also, we are building sefltests+libbpf in debug mode. So I have worse
> news, with my BTF validation patch applied, this test becomes
> extremely slow, because all those 100+ vmlinux BTF parsings are now
> also doing a bunch of validation, and none of the added small static
> functions are inlined in debug mode. I need to look into this a bit
> more and see if I can mitigate this without switching all selftests to
> release mode.
>
Looking at Bjorn's update, I believe the problem is unrelated to the
linked_list test.
I think the main slowdown here is because of reopening the BPF object
having failed tests over and over for each specific program.
If we make it work so that the object is only opened once and each
failing program can be tested, we can avoid the slowdown.
If this makes sense, I can work on this improvement.
More information about the linux-riscv
mailing list