[PATCH bpf-next] bpf, arm64: Support up to 12 function arguments

Thu Sep 21 04:19:31 PDT 2023

On 9/21/2023 6:09 PM, Florent Revest wrote:
> On Thu, Sep 21, 2023 at 7:21 AM Xu Kuohai <xukuohai at huaweicloud.com> wrote:
>> On 9/21/2023 6:17 AM, Florent Revest wrote:
>>> On Sun, Sep 17, 2023 at 5:09 PM Xu Kuohai <xukuohai at huaweicloud.com> wrote:
>>>> Due to bpf only supports function arguments up to 16 bytes, according to
>>>> AAPCS64, starting from the first argument, each argument is first
>>>> attempted to be loaded to 1 or 2 smallest registers from x0-x7, if there
>>>> are no enough registers to hold the entire argument, then all remaining
>>>> arguments starting from this one are pushed to the stack for passing.
>>>
>>> If I read the section 6.8.2 of the AAPCS64 correctly, there is a
>>> corner case which I believe isn't covered by this logic.
>>>
>>> void f(u128 a, u128 b, u128, c, u64 d, u128 e, u64 f) {}
>>> - a goes on x0 and x1
>>> - b goes on x2 and x3
>>> - c goes on x4 and x5
>>> - d goes on x6
>>> - e spills on the stack because it doesn't fit in the remaining regs
>>> - f goes on x7
>>>
>>
>> I guess you might have overlooked rule C.13 in AAPCS64. Non-floating type arguments
>> are copied to stack under rule C.15/C.17. However, C.13 is applied before C.15/C.17,
>> which means that NGRN is set to 8 before the stack is used. That is, all 8 parameter
>> arguments are used up and any remaining arguments can only be passed by the stack.
>>
>> C.13    The NGRN is set to 8.
>>
>> C.14    The NSAA is rounded up to the larger of 8 or the Natural Alignment of the
>>           argument’s type.
>>
>> C.15    If the argument is a composite type then the argument is copied to memory
>>           at the adjusted NSAA. The NSAA is incremented by the size of the argument.
>>           The argument has now been allocated.
>>
>> C.16    If the size of the argument is less than 8 bytes then the size of the argument
>>           is set to 8 bytes. The effect is as if the argument was copied to the least
>>           significant bits of a 64-bit register and the remaining bits filled with
>>           unspecified values.
>>
>> C.17    The argument is copied to memory at the adjusted NSAA. The NSAA is incremented
>>           by the size of the argument. The argument has now been allocated.
>>
>>
>> And the following assembly code also shows 'e' and 'f' are passed by stack.
>>
>> int func(__int128 a, __int128 b, __int128 c, int64_t d, __int128 e, int64_t f)
>> {
>>           return e == 5 || f == 7;
>> }
>>
>> asseembly:
>>
>> func:
>>           sub     sp, sp, #64
>>           stp     x0, x1, [sp, 48]
>>           stp     x2, x3, [sp, 32]
>>           stp     x4, x5, [sp, 16]
>>           str     x6, [sp, 8]
>>           ldr     x0, [sp, 64] // ** load the low 8 bytes of e from SP + 64 **
>>           cmp     x0, 5
>>           bne     .L27
>>           ldr     x0, [sp, 72] // ** load the high 8 bytes of e from SP + 72 **
>>           cmp     x0, 0
>>           beq     .L22
>> .L27:
>>           ldr     x0, [sp, 80] // ** load f from SP + 80 **
>>           cmp     x0, 7
>>           bne     .L24
>> .L22:
>>           mov     w0, 1
>>           b       .L26
>> .L24:
>>           mov     w0, 0
>> .L26:
>>           add     sp, sp, 64
>>           ret
> 
> Ah, that's great! :) It keeps things easy then. Thank you for the explanation!
> 
>> Although the above case is fine, the current patch does not handle rule C.14 correctly.
>> For example, for the following func, an 8-byte padding is inserted between f and g by
>> C.14 to align g to 16 bytes, but this patch mistakenly treats it as part of g.
>>
>> int func(__int128 a, __int128 b, __int128 c, int64_t d, __int128 e, int64_t f, __int128 g)
>> {
>> }
>>
>> Maybe we could fix it by adding argument alignment to struct btf_func_model, I'll
>> give it a try.
> 
> Good catch!
> 
>>> Maybe it would be good to add something pathological like this to the
>>> selftests ?
>>>
>>
>> OK, will do
> 
> Just a thought on that topic, maybe it would be preferable to have a
> new separate test for this ? Currently we have a test for 128b long
> arguments and a test for many arguments, it's good to have these
> separate because they are two dimensions of bpf architectural support:
> if someone adds support for one xor the other feature to an arch, it's
> good to see a test go green. Since that new test would cover both long
> and many arguments, it should probably be a new test.
>

Agree, I'll add a separate test for unaligned stack arguments, thanks.

>>>> -static void save_args(struct jit_ctx *ctx, int args_off, int nregs)
>>>> +struct arg_aux {
>>>> +       /* how many args are passed through registers, the rest args are
>>>
>>> the rest of the* args
>>>
>>>> +        * passed through stack
>>>> +        */
>>>> +       int args_in_reg;
>>>
>>> Maybe args_in_regs ? since args can go in multiple regs
>>>
>>>> +       /* how many registers used for passing arguments */
>>>
>>> are* used
>>>
>>>> +       int regs_for_arg;
>>>
>>> And here regs_for_args ? Since It's the number of registers used for all args
>>>
>>>> +       /* how many stack slots used for arguments, each slot is 8 bytes */
>>>
>>> are* used
>>>
>>>> +       int stack_slots_for_arg;
>>>
>>> And here stack_slots_for_args, for the same reason as above?
>>>
>>>> +};
>>>> +
>>>> +static void calc_arg_aux(const struct btf_func_model *m,
>>>> +                        struct arg_aux *a)
>>>>    {
>>>>           int i;
>>>> +       int nregs;
>>>> +       int slots;
>>>> +       int stack_slots;
>>>> +
>>>> +       /* verifier ensures m->nr_args <= MAX_BPF_FUNC_ARGS */
>>>> +       for (i = 0, nregs = 0; i < m->nr_args; i++) {
>>>> +               slots = (m->arg_size[i] + 7) / 8;
>>>> +               if (nregs + slots <= 8) /* passed through register ? */
>>>> +                       nregs += slots;
>>>> +               else
>>>> +                       break;
>>>> +       }
>>>> +
>>>> +       a->args_in_reg = i;
>>>> +       a->regs_for_arg = nregs;
>>>>
>>>> -       for (i = 0; i < nregs; i++) {
>>>> -               emit(A64_STR64I(i, A64_SP, args_off), ctx);
>>>> -               args_off += 8;
>>>> +       /* the rest arguments are passed through stack */
>>>> +       for (stack_slots = 0; i < m->nr_args; i++)
>>>> +               stack_slots += (m->arg_size[i] + 7) / 8;
>>>> +
>>>> +       a->stack_slots_for_arg = stack_slots;
>>>> +}
>>>> +
>>>> +static void clear_garbage(struct jit_ctx *ctx, int reg, int effective_bytes)
>>>> +{
>>>> +       if (effective_bytes) {
>>>> +               int garbage_bits = 64 - 8 * effective_bytes;
>>>> +#ifdef CONFIG_CPU_BIG_ENDIAN
>>>> +               /* garbage bits are at the right end */
>>>> +               emit(A64_LSR(1, reg, reg, garbage_bits), ctx);
>>>> +               emit(A64_LSL(1, reg, reg, garbage_bits), ctx);
>>>> +#else
>>>> +               /* garbage bits are at the left end */
>>>> +               emit(A64_LSL(1, reg, reg, garbage_bits), ctx);
>>>> +               emit(A64_LSR(1, reg, reg, garbage_bits), ctx);
>>>> +#endif
>>>>           }
>>>>    }
>>>>
>>>> -static void restore_args(struct jit_ctx *ctx, int args_off, int nregs)
>>>> +static void save_args(struct jit_ctx *ctx, int bargs_off, int oargs_off,
>>>> +                     const struct btf_func_model *m,
>>>> +                     const struct arg_aux *a,
>>>> +                     bool for_call_origin)
>>>>    {
>>>>           int i;
>>>> +       int reg;
>>>> +       int doff;
>>>> +       int soff;
>>>> +       int slots;
>>>> +       u8 tmp = bpf2a64[TMP_REG_1];
>>>> +
>>>> +       /* store argument registers to stack for call bpf, or restore argument
>>>
>>> to* call bpf or "for the bpf program"
>>>
>>
>> Sorry for these incorrect words :(, all will be fixed in the next version, thanks!
> 
> No problem!
> 
>>>>    static int prepare_trampoline(struct jit_ctx *ctx, struct bpf_tramp_image *im,
>>>>                                 struct bpf_tramp_links *tlinks, void *orig_call,
>>>> -                             int nregs, u32 flags)
>>>> +                             const struct btf_func_model *m,
>>>> +                             const struct arg_aux *a,
>>>> +                             u32 flags)
>>>>    {
>>>>           int i;
>>>>           int stack_size;
>>>>           int retaddr_off;
>>>>           int regs_off;
>>>>           int retval_off;
>>>> -       int args_off;
>>>> +       int bargs_off;
>>>>           int nregs_off;
>>>>           int ip_off;
>>>>           int run_ctx_off;
>>>> +       int oargs_off;
>>>> +       int nregs;
>>>>           struct bpf_tramp_links *fentry = &tlinks[BPF_TRAMP_FENTRY];
>>>>           struct bpf_tramp_links *fexit = &tlinks[BPF_TRAMP_FEXIT];
>>>>           struct bpf_tramp_links *fmod_ret = &tlinks[BPF_TRAMP_MODIFY_RETURN];
>>>> @@ -1859,19 +1951,26 @@ static int prepare_trampoline(struct jit_ctx *ctx, struct bpf_tramp_image *im,
>>>>            *
>>>>            * SP + retval_off  [ return value      ] BPF_TRAMP_F_CALL_ORIG or
>>>>            *                                        BPF_TRAMP_F_RET_FENTRY_RET
>>>> -        *
>>>>            *                  [ arg reg N         ]
>>>>            *                  [ ...               ]
>>>> -        * SP + args_off    [ arg reg 1         ]
>>>> +        * SP + bargs_off   [ arg reg 1         ] for bpf
>>>>            *
>>>>            * SP + nregs_off   [ arg regs count    ]
>>>>            *
>>>>            * SP + ip_off      [ traced function   ] BPF_TRAMP_F_IP_ARG flag
>>>>            *
>>>>            * SP + run_ctx_off [ bpf_tramp_run_ctx ]
>>>> +        *
>>>> +        *                  [ stack arg N       ]
>>>> +        *                  [ ...               ]
>>>> +        * SP + oargs_off   [ stack arg 1       ] for original func
>>>>            */
>>>>
>>>>           stack_size = 0;
>>>> +       oargs_off = stack_size;
>>>> +       if (flags & BPF_TRAMP_F_CALL_ORIG)
>>>> +               stack_size += 8 * a->stack_slots_for_arg;
>>>> +
>>>>           run_ctx_off = stack_size;
>>>>           /* room for bpf_tramp_run_ctx */
>>>>           stack_size += round_up(sizeof(struct bpf_tramp_run_ctx), 8);
>>>> @@ -1885,9 +1984,10 @@ static int prepare_trampoline(struct jit_ctx *ctx, struct bpf_tramp_image *im,
>>>>           /* room for args count */
>>>>           stack_size += 8;
>>>>
>>>> -       args_off = stack_size;
>>>> +       bargs_off = stack_size;
>>>>           /* room for args */
>>>> -       stack_size += nregs * 8;
>>>> +       nregs = a->regs_for_arg + a->stack_slots_for_arg;
>>>
>>> Maybe this name no longer makes sense ?
>>
>> OK, I'll remove it
> 
> Ack