[PATCH bpf-next] bpf, arm64: Support up to 12 function arguments
Xu Kuohai
xukuohai at huaweicloud.com
Thu Sep 21 04:19:31 PDT 2023
On 9/21/2023 6:09 PM, Florent Revest wrote:
> On Thu, Sep 21, 2023 at 7:21 AM Xu Kuohai <xukuohai at huaweicloud.com> wrote:
>> On 9/21/2023 6:17 AM, Florent Revest wrote:
>>> On Sun, Sep 17, 2023 at 5:09 PM Xu Kuohai <xukuohai at huaweicloud.com> wrote:
>>>> Due to bpf only supports function arguments up to 16 bytes, according to
>>>> AAPCS64, starting from the first argument, each argument is first
>>>> attempted to be loaded to 1 or 2 smallest registers from x0-x7, if there
>>>> are no enough registers to hold the entire argument, then all remaining
>>>> arguments starting from this one are pushed to the stack for passing.
>>>
>>> If I read the section 6.8.2 of the AAPCS64 correctly, there is a
>>> corner case which I believe isn't covered by this logic.
>>>
>>> void f(u128 a, u128 b, u128, c, u64 d, u128 e, u64 f) {}
>>> - a goes on x0 and x1
>>> - b goes on x2 and x3
>>> - c goes on x4 and x5
>>> - d goes on x6
>>> - e spills on the stack because it doesn't fit in the remaining regs
>>> - f goes on x7
>>>
>>
>> I guess you might have overlooked rule C.13 in AAPCS64. Non-floating type arguments
>> are copied to stack under rule C.15/C.17. However, C.13 is applied before C.15/C.17,
>> which means that NGRN is set to 8 before the stack is used. That is, all 8 parameter
>> arguments are used up and any remaining arguments can only be passed by the stack.
>>
>> C.13 The NGRN is set to 8.
>>
>> C.14 The NSAA is rounded up to the larger of 8 or the Natural Alignment of the
>> argument’s type.
>>
>> C.15 If the argument is a composite type then the argument is copied to memory
>> at the adjusted NSAA. The NSAA is incremented by the size of the argument.
>> The argument has now been allocated.
>>
>> C.16 If the size of the argument is less than 8 bytes then the size of the argument
>> is set to 8 bytes. The effect is as if the argument was copied to the least
>> significant bits of a 64-bit register and the remaining bits filled with
>> unspecified values.
>>
>> C.17 The argument is copied to memory at the adjusted NSAA. The NSAA is incremented
>> by the size of the argument. The argument has now been allocated.
>>
>>
>> And the following assembly code also shows 'e' and 'f' are passed by stack.
>>
>> int func(__int128 a, __int128 b, __int128 c, int64_t d, __int128 e, int64_t f)
>> {
>> return e == 5 || f == 7;
>> }
>>
>> asseembly:
>>
>> func:
>> sub sp, sp, #64
>> stp x0, x1, [sp, 48]
>> stp x2, x3, [sp, 32]
>> stp x4, x5, [sp, 16]
>> str x6, [sp, 8]
>> ldr x0, [sp, 64] // ** load the low 8 bytes of e from SP + 64 **
>> cmp x0, 5
>> bne .L27
>> ldr x0, [sp, 72] // ** load the high 8 bytes of e from SP + 72 **
>> cmp x0, 0
>> beq .L22
>> .L27:
>> ldr x0, [sp, 80] // ** load f from SP + 80 **
>> cmp x0, 7
>> bne .L24
>> .L22:
>> mov w0, 1
>> b .L26
>> .L24:
>> mov w0, 0
>> .L26:
>> add sp, sp, 64
>> ret
>
> Ah, that's great! :) It keeps things easy then. Thank you for the explanation!
>
>> Although the above case is fine, the current patch does not handle rule C.14 correctly.
>> For example, for the following func, an 8-byte padding is inserted between f and g by
>> C.14 to align g to 16 bytes, but this patch mistakenly treats it as part of g.
>>
>> int func(__int128 a, __int128 b, __int128 c, int64_t d, __int128 e, int64_t f, __int128 g)
>> {
>> }
>>
>> Maybe we could fix it by adding argument alignment to struct btf_func_model, I'll
>> give it a try.
>
> Good catch!
>
>>> Maybe it would be good to add something pathological like this to the
>>> selftests ?
>>>
>>
>> OK, will do
>
> Just a thought on that topic, maybe it would be preferable to have a
> new separate test for this ? Currently we have a test for 128b long
> arguments and a test for many arguments, it's good to have these
> separate because they are two dimensions of bpf architectural support:
> if someone adds support for one xor the other feature to an arch, it's
> good to see a test go green. Since that new test would cover both long
> and many arguments, it should probably be a new test.
>
Agree, I'll add a separate test for unaligned stack arguments, thanks.
>>>> -static void save_args(struct jit_ctx *ctx, int args_off, int nregs)
>>>> +struct arg_aux {
>>>> + /* how many args are passed through registers, the rest args are
>>>
>>> the rest of the* args
>>>
>>>> + * passed through stack
>>>> + */
>>>> + int args_in_reg;
>>>
>>> Maybe args_in_regs ? since args can go in multiple regs
>>>
>>>> + /* how many registers used for passing arguments */
>>>
>>> are* used
>>>
>>>> + int regs_for_arg;
>>>
>>> And here regs_for_args ? Since It's the number of registers used for all args
>>>
>>>> + /* how many stack slots used for arguments, each slot is 8 bytes */
>>>
>>> are* used
>>>
>>>> + int stack_slots_for_arg;
>>>
>>> And here stack_slots_for_args, for the same reason as above?
>>>
>>>> +};
>>>> +
>>>> +static void calc_arg_aux(const struct btf_func_model *m,
>>>> + struct arg_aux *a)
>>>> {
>>>> int i;
>>>> + int nregs;
>>>> + int slots;
>>>> + int stack_slots;
>>>> +
>>>> + /* verifier ensures m->nr_args <= MAX_BPF_FUNC_ARGS */
>>>> + for (i = 0, nregs = 0; i < m->nr_args; i++) {
>>>> + slots = (m->arg_size[i] + 7) / 8;
>>>> + if (nregs + slots <= 8) /* passed through register ? */
>>>> + nregs += slots;
>>>> + else
>>>> + break;
>>>> + }
>>>> +
>>>> + a->args_in_reg = i;
>>>> + a->regs_for_arg = nregs;
>>>>
>>>> - for (i = 0; i < nregs; i++) {
>>>> - emit(A64_STR64I(i, A64_SP, args_off), ctx);
>>>> - args_off += 8;
>>>> + /* the rest arguments are passed through stack */
>>>> + for (stack_slots = 0; i < m->nr_args; i++)
>>>> + stack_slots += (m->arg_size[i] + 7) / 8;
>>>> +
>>>> + a->stack_slots_for_arg = stack_slots;
>>>> +}
>>>> +
>>>> +static void clear_garbage(struct jit_ctx *ctx, int reg, int effective_bytes)
>>>> +{
>>>> + if (effective_bytes) {
>>>> + int garbage_bits = 64 - 8 * effective_bytes;
>>>> +#ifdef CONFIG_CPU_BIG_ENDIAN
>>>> + /* garbage bits are at the right end */
>>>> + emit(A64_LSR(1, reg, reg, garbage_bits), ctx);
>>>> + emit(A64_LSL(1, reg, reg, garbage_bits), ctx);
>>>> +#else
>>>> + /* garbage bits are at the left end */
>>>> + emit(A64_LSL(1, reg, reg, garbage_bits), ctx);
>>>> + emit(A64_LSR(1, reg, reg, garbage_bits), ctx);
>>>> +#endif
>>>> }
>>>> }
>>>>
>>>> -static void restore_args(struct jit_ctx *ctx, int args_off, int nregs)
>>>> +static void save_args(struct jit_ctx *ctx, int bargs_off, int oargs_off,
>>>> + const struct btf_func_model *m,
>>>> + const struct arg_aux *a,
>>>> + bool for_call_origin)
>>>> {
>>>> int i;
>>>> + int reg;
>>>> + int doff;
>>>> + int soff;
>>>> + int slots;
>>>> + u8 tmp = bpf2a64[TMP_REG_1];
>>>> +
>>>> + /* store argument registers to stack for call bpf, or restore argument
>>>
>>> to* call bpf or "for the bpf program"
>>>
>>
>> Sorry for these incorrect words :(, all will be fixed in the next version, thanks!
>
> No problem!
>
>>>> static int prepare_trampoline(struct jit_ctx *ctx, struct bpf_tramp_image *im,
>>>> struct bpf_tramp_links *tlinks, void *orig_call,
>>>> - int nregs, u32 flags)
>>>> + const struct btf_func_model *m,
>>>> + const struct arg_aux *a,
>>>> + u32 flags)
>>>> {
>>>> int i;
>>>> int stack_size;
>>>> int retaddr_off;
>>>> int regs_off;
>>>> int retval_off;
>>>> - int args_off;
>>>> + int bargs_off;
>>>> int nregs_off;
>>>> int ip_off;
>>>> int run_ctx_off;
>>>> + int oargs_off;
>>>> + int nregs;
>>>> struct bpf_tramp_links *fentry = &tlinks[BPF_TRAMP_FENTRY];
>>>> struct bpf_tramp_links *fexit = &tlinks[BPF_TRAMP_FEXIT];
>>>> struct bpf_tramp_links *fmod_ret = &tlinks[BPF_TRAMP_MODIFY_RETURN];
>>>> @@ -1859,19 +1951,26 @@ static int prepare_trampoline(struct jit_ctx *ctx, struct bpf_tramp_image *im,
>>>> *
>>>> * SP + retval_off [ return value ] BPF_TRAMP_F_CALL_ORIG or
>>>> * BPF_TRAMP_F_RET_FENTRY_RET
>>>> - *
>>>> * [ arg reg N ]
>>>> * [ ... ]
>>>> - * SP + args_off [ arg reg 1 ]
>>>> + * SP + bargs_off [ arg reg 1 ] for bpf
>>>> *
>>>> * SP + nregs_off [ arg regs count ]
>>>> *
>>>> * SP + ip_off [ traced function ] BPF_TRAMP_F_IP_ARG flag
>>>> *
>>>> * SP + run_ctx_off [ bpf_tramp_run_ctx ]
>>>> + *
>>>> + * [ stack arg N ]
>>>> + * [ ... ]
>>>> + * SP + oargs_off [ stack arg 1 ] for original func
>>>> */
>>>>
>>>> stack_size = 0;
>>>> + oargs_off = stack_size;
>>>> + if (flags & BPF_TRAMP_F_CALL_ORIG)
>>>> + stack_size += 8 * a->stack_slots_for_arg;
>>>> +
>>>> run_ctx_off = stack_size;
>>>> /* room for bpf_tramp_run_ctx */
>>>> stack_size += round_up(sizeof(struct bpf_tramp_run_ctx), 8);
>>>> @@ -1885,9 +1984,10 @@ static int prepare_trampoline(struct jit_ctx *ctx, struct bpf_tramp_image *im,
>>>> /* room for args count */
>>>> stack_size += 8;
>>>>
>>>> - args_off = stack_size;
>>>> + bargs_off = stack_size;
>>>> /* room for args */
>>>> - stack_size += nregs * 8;
>>>> + nregs = a->regs_for_arg + a->stack_slots_for_arg;
>>>
>>> Maybe this name no longer makes sense ?
>>
>> OK, I'll remove it
>
> Ack
More information about the linux-arm-kernel
mailing list