[PATCH bpf-next 4/4] riscv, bpf: Mixing bpf2bpf and tailcalls

Wed Jan 31 01:18:55 PST 2024

On 2024/1/31 0:03, Björn Töpel wrote:
> Pu Lehui <pulehui at huawei.com> writes:
> 
>> On 2024/1/30 21:28, Björn Töpel wrote:
>>> Pu Lehui <pulehui at huawei.com> writes:
>>>
>>>> On 2024/1/30 16:29, Björn Töpel wrote:
>>>>> Pu Lehui <pulehui at huaweicloud.com> writes:
>>>>>
>>>>>> On 2023/9/28 17:59, Björn Töpel wrote:
>>>>>>> Pu Lehui <pulehui at huaweicloud.com> writes:
>>>>>>>
>>>>>>>> From: Pu Lehui <pulehui at huawei.com>
>>>>>>>>
>>>>>>>> In the current RV64 JIT, if we just don't initialize the TCC in subprog,
>>>>>>>> the TCC can be propagated from the parent process to the subprocess, but
>>>>>>>> the TCC of the parent process cannot be restored when the subprocess
>>>>>>>> exits. Since the RV64 TCC is initialized before saving the callee saved
>>>>>>>> registers into the stack, we cannot use the callee saved register to
>>>>>>>> pass the TCC, otherwise the original value of the callee saved register
>>>>>>>> will be destroyed. So we implemented mixing bpf2bpf and tailcalls
>>>>>>>> similar to x86_64, i.e. using a non-callee saved register to transfer
>>>>>>>> the TCC between functions, and saving that register to the stack to
>>>>>>>> protect the TCC value. At the same time, we also consider the scenario
>>>>>>>> of mixing trampoline.
>>>>>>>
>>>>>>> Hi!
>>>>>>>
>>>>>>> The RISC-V JIT tries to minimize the stack usage, e.g. it doesn't have a
>>>>>>> fixed pro/epilogue like some of the other JITs. I think we can do better
>>>>>>> here, so that the pass-TCC-via-register can be used, and the additional
>>>>>>> stack access can be avoided.
>>>>>>>
>>>>>>> Today, the TCC is passed via a register (a6) and can be viewed as a
>>>>>>> "state" variable/transparent argument/return value. As you point out, we
>>>>>>> loose this when we do a call. On (any) calls we move the TCC to a
>>>>>>> callee-saved register.
>>>>>>>
>>>>>>> WDYT about the following scheme:
>>>>>>>
>>>>>>> 1 Pickup the arm64 bpf2bpf/tailmix mechanism of just clearing the TCC
>>>>>>>       for the main program.
>>>>>>> 2 For BPF helper calls, move TCC to s6, perform the call, and restore
>>>>>>>       a6. Dito for kfunc calls (BPF_PSEUDO_KFUNC_CALL).
>>>>>>> 3 For all other calls, a6 is passed transparently.
>>>>>>>
>>>>>>> For 2 bpf_jit_get_func_addr() can be used to determine if the callee is
>>>>>>> a BPF helper or not.
>>>>>>>
>>>>>>> In summary; Determine in the JIT if we're leaving BPF-land, and need to
>>>>>>> move the TCC to a callee-saved reg, or not, and save us a bunch of stack
>>>>>>> store/loads.
>>>>>>>
>>>>>>
>>>>>> Valuable scheme. But we need to consider TCC back propagation. Let me
>>>>>> show an example of calling subprog with TCC stored in A6:
>>>>>>
>>>>>> prog1(TCC==1){
>>>>>>         subprog1(TCC==1)
>>>>>>             -> tailcall1(TCC==0)
>>>>>>                 -> subprog2(TCC==0)
>>>>>>         subprog3(TCC==0) <--- should be TCC==1
>>>>>>             -\-> tailcall2 <--- can't be called
>>>>>> }
>>>>
>>>> Let's back with this example again. Imagine that the tailcall chain is a
>>>> list limited to 33 elements. When the list has 32 elements, we call
>>>> subprog1 and then tailcall1. At this time, the list elements count
>>>> becomes 33. Then we call subprog2 and return prog1. At this time, the
>>>> list removes 1 element and becomes 32 elements. At this time, there
>>>> still can perform 1 tailcall.
>>>>
>>>> I've attached a diagram that shows mixing tailcall and subprogs is
>>>> nearly a "call". It can return to caller function.
>>>
>>> Hmm. Let me put my Q in another way.
>>>
>>> The kernel calls into BPF_PROG_RUN() (~a BPF context). Would it ever be
>>> OK to do more than 33 tail calls, regardless of subprogs or not?
>>>
>>> In your example, TCC is 1. You are allowed to perform one tail call. In
>>> your example prog1 performs two.
>>>
>>> My view of TCC has always been ~a counter of the number of tailcalls~.
>>>
>>> With your example expanded:
>>> prog1(TCC==33){
>>>         subprog1(TCC==33)
>>>             -> tailcall1(TCC==33) -> tailcall1(TCC==32) -> tailcall1(TCC==31) -> ... // 33 times
>>>         // Lehui says TCC should be 33 again.
>>>         // Björn says "it's the number of tailcalls", and subprog3 cannot perform a tail call
>>>         subprog3(TCC==?)
>>
>> Yes, my view is take this something like a stack，while you take this as
>> a fixed global value.
>>
>> prog1(TCC==33){
>>       subprog1(TCC==33)
>>           -> tailcall1(TCC==33) -> tailcall1(TCC==32) ->
>> tailcall1(TCC==31) -> ... // 33 times -> subprog2(TCC==0)
>>       subprog3(TCC==33)
>> 	-> tailcall1(TCC==33) -> tailcall1(TCC==32) -> tailcall1(TCC==31) ->
>> ... // 33 times
>>
>>>             
>>> My view has, again, been than TCC is a run-time count of the number
>>> tailcalls (fentry/fexit patch bpf-programs included).
>>>
>>> What does x86 and arm64 do?
>>
>> When subprog return back to caller bpf program, they both restore TCC to
>> the value when enter into subprog. The ARM64 uses the callee saved
>> register to store the TCC. When the ARM64 exits, the TCC is restored to
>> the value when it enter. The while x86 uses the stack to do the same thing.
> 
> Ok! Thanks for clarifying. I'll continue reviewing the v2 of your
> series!
> 
> BTW, I wonder if we can trigger this [1] on RV64 -- i.e. calling the
> main prog, will reset the tcc count.
> 
> [1] https://lore.kernel.org/bpf/20240104142226.87869-1-hffilwlqm@gmail.com/

Yes, I have been paying attention to this matter recently and will 
allocate time to analyze it.