WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342

Mon Aug 28 17:54:21 PDT 2023

On 8/28/23 6:57 AM, Hou Tao wrote:
> Hi,
> 
> On 8/27/2023 10:53 PM, Yonghong Song wrote:
>>
>>
>> On 8/27/23 1:37 AM, Björn Töpel wrote:
>>> Björn Töpel <bjorn at kernel.org> writes:
>>>
>>>> Hou Tao <houtao at huaweicloud.com> writes:
>>>>
>>>>> Hi,
>>>>>
>>>>> On 8/26/2023 5:23 PM, Björn Töpel wrote:
>>>>>> Hou Tao <houtao at huaweicloud.com> writes:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> On 8/25/2023 11:28 PM, Yonghong Song wrote:
>>>>>>>>
>>>>>>>> On 8/25/23 3:32 AM, Björn Töpel wrote:
>>>>>>>>> I'm chasing a workqueue hang on RISC-V/qemu (TCG), using the bpf
>>>>>>>>> selftests on bpf-next 9e3b47abeb8f.
>>>>>>>>>
>>>>>>>>> I'm able to reproduce the hang by multiple runs of:
>>>>>>>>>     | ./test_progs -a link_api -a linked_list
>>>>>>>>> I'm currently investigating that.
>>>>>>>>>
>>>>>>>>> But! Sometimes (every blue moon) I get a warn_on_once hit:
>>>>>>>>>     | ------------[ cut here ]------------
>>>>>>>>>     | WARNING: CPU: 3 PID: 261 at kernel/bpf/memalloc.c:342
>>>>>>>>> bpf_mem_refill+0x1fc/0x206
>>>>>>>>>     | Modules linked in: bpf_testmod(OE)
>>>>>>>>>     | CPU: 3 PID: 261 Comm: test_progs-cpuv Tainted: G           OE
>>>>>>>>> N 6.5.0-rc5-01743-gdcb152bb8328 #2
>>>>>>>>>     | Hardware name: riscv-virtio,qemu (DT)
>>>>>>>>>     | epc : bpf_mem_refill+0x1fc/0x206
>>>>>>>>>     |  ra : irq_work_single+0x68/0x70
>>>>>>>>>     | epc : ffffffff801b1bc4 ra : ffffffff8015fe84 sp :
>>>>>>>>> ff2000000001be20
>>>>>>>>>     |  gp : ffffffff82d26138 tp : ff6000008477a800 t0 :
>>>>>>>>> 0000000000046600
>>>>>>>>>     |  t1 : ffffffff812b6ddc t2 : 0000000000000000 s0 :
>>>>>>>>> ff2000000001be70
>>>>>>>>>     |  s1 : ff5ffffffffe8998 a0 : ff5ffffffffe8998 a1 :
>>>>>>>>> ff600003fef4b000
>>>>>>>>>     |  a2 : 000000000000003f a3 : ffffffff80008250 a4 :
>>>>>>>>> 0000000000000060
>>>>>>>>>     |  a5 : 0000000000000080 a6 : 0000000000000000 a7 :
>>>>>>>>> 0000000000735049
>>>>>>>>>     |  s2 : ff5ffffffffe8998 s3 : 0000000000000022 s4 :
>>>>>>>>> 0000000000001000
>>>>>>>>>     |  s5 : 0000000000000007 s6 : ff5ffffffffe8570 s7 :
>>>>>>>>> ffffffff82d6bd30
>>>>>>>>>     |  s8 : 000000000000003f s9 : ffffffff82d2c5e8 s10:
>>>>>>>>> 000000000000ffff
>>>>>>>>>     |  s11: ffffffff82d2c5d8 t3 : ffffffff81ea8f28 t4 :
>>>>>>>>> 0000000000000000
>>>>>>>>>     |  t5 : ff6000008fd28278 t6 : 0000000000040000
>>>>>>>>>     | status: 0000000200000100 badaddr: 0000000000000000 cause:
>>>>>>>>> 0000000000000003
>>>>>>>>>     | [<ffffffff801b1bc4>] bpf_mem_refill+0x1fc/0x206
>>>>>>>>>     | [<ffffffff8015fe84>] irq_work_single+0x68/0x70
>>>>>>>>>     | [<ffffffff8015feb4>] irq_work_run_list+0x28/0x36
>>>>>>>>>     | [<ffffffff8015fefa>] irq_work_run+0x38/0x66
>>>>>>>>>     | [<ffffffff8000828a>] handle_IPI+0x3a/0xb4
>>>>>>>>>     | [<ffffffff800a5c3a>] handle_percpu_devid_irq+0xa4/0x1f8
>>>>>>>>>     | [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
>>>>>>>>>     | [<ffffffff800ae570>] ipi_mux_process+0xac/0xfa
>>>>>>>>>     | [<ffffffff8000a8ea>] sbi_ipi_handle+0x2e/0x88
>>>>>>>>>     | [<ffffffff8009fafa>] generic_handle_domain_irq+0x28/0x36
>>>>>>>>>     | [<ffffffff807ee70e>] riscv_intc_irq+0x36/0x4e
>>>>>>>>>     | [<ffffffff812b5d3a>] handle_riscv_irq+0x54/0x86
>>>>>>>>>     | [<ffffffff812b6904>] do_irq+0x66/0x98
>>>>>>>>>     | ---[ end trace 0000000000000000 ]---
>>>>>>>>>
>>>>>>>>> Code:
>>>>>>>>>     | static void free_bulk(struct bpf_mem_cache *c)
>>>>>>>>>     | {
>>>>>>>>>     |     struct bpf_mem_cache *tgt = c->tgt;
>>>>>>>>>     |     struct llist_node *llnode, *t;
>>>>>>>>>     |     unsigned long flags;
>>>>>>>>>     |     int cnt;
>>>>>>>>>     |
>>>>>>>>>     |     WARN_ON_ONCE(tgt->unit_size != c->unit_size);
>>>>>>>>>     | ...
>>>>>>>>>
>>>>>>>>> I'm not well versed in the memory allocator; Before I dive into
>>>>>>>>> it --
>>>>>>>>> has anyone else hit it? Ideas on why the warn_on_once is hit?
>>>>>>>> Maybe take a look at the patch
>>>>>>>>     822fb26bdb55  bpf: Add a hint to allocated objects.
>>>>>>>>
>>>>>>>> In the above patch, we have
>>>>>>>>
>>>>>>>> +       /*
>>>>>>>> +        * Remember bpf_mem_cache that allocated this object.
>>>>>>>> +        * The hint is not accurate.
>>>>>>>> +        */
>>>>>>>> +       c->tgt = *(struct bpf_mem_cache **)llnode;
>>>>>>>>
>>>>>>>> I suspect that the warning may be related to the above.
>>>>>>>> I tried the above ./test_progs command line (running multiple
>>>>>>>> at the same time) and didn't trigger the issue.
>>>>>>> The extra 8-bytes before the freed pointer is used to save the
>>>>>>> pointer
>>>>>>> of the original bpf memory allocator where the freed pointer came
>>>>>>> from,
>>>>>>> so unit_free() could free the pointer back to the original
>>>>>>> allocator to
>>>>>>> prevent alloc-and-free unbalance.
>>>>>>>
>>>>>>> I suspect that a wrong pointer was passed to bpf_obj_drop, but do
>>>>>>> not
>>>>>>> find anything suspicious after checking linked_list. Another
>>>>>>> possibility
>>>>>>> is that there is write-after-free problem which corrupts the extra
>>>>>>> 8-bytes before the freed pointer. Could you please apply the
>>>>>>> following
>>>>>>> debug patch to check whether or not the extra 8-bytes are
>>>>>>> corrupted ?
>>>>>> Thanks for getting back!
>>>>>>
>>>>>> I took your patch for a run, and there's a hit:
>>>>>>     | bad cache ff5ffffffffe8570: got size 96 work
>>>>>> ffffffff801b19c8, cache ff5ffffffffe8980 exp size 128 work
>>>>>> ffffffff801b19c8
>>>>>
>>>>> The extra 8-bytes are not corrupted. Both of these two
>>>>> bpf_mem_cache are
>>>>> valid and there are in the cache array defined in bpf_mem_caches. BPF
>>>>> memory allocator allocated the pointer from 96-bytes sized-cache,
>>>>> but it
>>>>> tried to free the pointer through 128-bytes sized-cache.
>>>>>
>>>>> Now I suspect there is no 96-bytes slab in your system and ksize(ptr -
>>>>> LLIST_NODE_SZ) returns 128, so bpf memory allocator selected the
>>>>> 128-byte sized-cache instead of 96-bytes sized-cache. Could you please
>>>>> check the value of KMALLOC_MIN_SIZE in your kernel .config and
>>>>> using the
>>>>> following command to check whether there is 96-bytes slab in your
>>>>> system:
>>>>
>>>> KMALLOC_MIN_SIZE is 64.
>>>>
>>>>> $ cat /proc/slabinfo |grep kmalloc-96
>>>>> dma-kmalloc-96         0      0     96   42    1 : tunables    0    0
>>>>> 0 : slabdata      0      0      0
>>>>> kmalloc-96          1865   2268     96   42    1 : tunables    0    0
>>>>> 0 : slabdata     54     54      0
>>>>>
>>>>> In my system, slab has 96-bytes cached, so grep outputs something,
>>>>> but I
>>>>> think there will no output in your system.
>>>>
>>>> You're right! No kmalloc-96.
>>>
>>> To get rid of the warning, limit available sizes from
>>> bpf_mem_alloc_init()?
> 
> It is not enough. We need to adjust size_index accordingly during
> initialization. Could you please try the attached patch below ? It is
> not a formal patch and I am considering to disable prefilling for these
> redirected bpf_mem_caches.
>>
>> Do you know why your system does not have kmalloc-96?
> 
> According to the implementation of setup_kmalloc_cache_index_table() and
> create_kmalloc_caches(),  when KMALLOC_MIN_SIZE is greater than 64,
> kmalloc-96 will be omitted. If KMALLOC_MIN_SIZE is greater than 128,
> kmalloc-192 will be omitted as well.

Thanks! This indeed the case except the above 'greater than' should be
'greater than or equal to'. For example, greater than or equal to 64
means the minimum alignment is 64, so 96 is removed. Similar for
greater than or equal to 128.

I agree it is a good idea to align memalloc cache unit size matching
underlying kmalloc cache unit size.

>>
>>>
>>>
>>> Björn
>>
>> .
>