[PATCH bpf-next v2 2/3] bpf, arm64: Add JIT support for stack arguments

Thu May 28 07:23:32 PDT 2026

Hi Will,

Thanks for your review.
These patches have already landed in bpf-next tree as a part of this
series: https://lore.kernel.org/all/20260513044949.2382019-1-yonghong.song@linux.dev/

I will take all your feedback and send fixes for all the required changes.

> How does that work with kfuncs? I think the PCS means that they will
> expect to pick the stack arguments starting at [SP+0]. Or are you
> saying that SP == FP+16 on entry to the callee? It's hard to reconcile
> that with the ASCII art in build_prologue() because neither "current
> A64_FP" nor "BPF_FP" point below A64_SP and it's not clear which of them
> you're referring to when you refer to "FP" on its own.

At entry to kfunc the arguments are starting at [SP+0] like a normal C
callee function would expect.

The line "callee reads them from [FP+16], [FP+24]" is for a BPF callee
which always emits a frame record.
So the caller puts the arguments at SP+0, SP+8, .. and then the callee
creates the frame record in the prologue which means the first
argument is at FP+16, ....
This is just an implementation detail about how the JITed BPF callee
programs can read stack arguments from FP+16, ... easily.

>
> > BPF convention uses fixed offsets from BPF_REG_PARAMS (r11): off=-8 is
> > always arg 6, off=-16 arg 7, etc. The verifier invalidates all outgoing
> > stack arg slots after each call, so the compiler must re-store before
> > every call. This means x5-x7 don't need to be saved on stack.
> >
> > Signed-off-by: Yonghong Song <yonghong.song at linux.dev>
> > Signed-off-by: Puranjay Mohan <puranjay at kernel.org>
> > ---
> >  arch/arm64/net/bpf_jit_comp.c | 87 ++++++++++++++++++++++++++++++++++-
> >  1 file changed, 86 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
> > index 085e650662e3..cd8279880795 100644
> > --- a/arch/arm64/net/bpf_jit_comp.c
> > +++ b/arch/arm64/net/bpf_jit_comp.c
> > @@ -86,6 +86,7 @@ struct jit_ctx {
> >       __le32 *image;
> >       __le32 *ro_image;
> >       u32 stack_size;
> > +     u16 stack_arg_size;
> >       u64 user_vm_start;
> >       u64 arena_vm_start;
> >       bool fp_used;
> > @@ -533,13 +534,19 @@ static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf)
> >        *                        |     |
> >        *                        +-----+ <= (BPF_FP - prog->aux->stack_depth)
> >        *                        |RSVD | padding
> > -      * current A64_SP =>      +-----+ <= (BPF_FP - ctx->stack_size)
> > +      *                        +-----+ <= (BPF_FP - ctx->stack_size)
> > +      *                        |     |
> > +      *                        | ... | outgoing stack args (9+, if any)
> > +      *                        |     |
> > +      * current A64_SP =>      +-----+
> >        *                        |     |
> >        *                        | ... | Function call stack
> >        *                        |     |
> >        *                        +-----+
> >        *                          low
> >        *
> > +      * Stack args 6-8 are passed in x5-x7, args 9+ at [SP].
> > +      * Incoming args 9+ are at [FP + 16], [FP + 24], ...
> >        */
>
> I assume the arguments being passed are all <= 64-bit scalar types? If
> we ever want to pass anything with > 64-bit alignment or to a varargs
> function, then the rules for allocation get a little hairy.

Yes, right now only 64-bit scalars are supported.

>
> >       emit_kcfi(is_main_prog ? cfi_bpf_hash : cfi_bpf_subprog_hash, ctx);
> > @@ -613,6 +620,9 @@ static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf)
> >       if (ctx->stack_size && !ctx->priv_sp_used)
> >               emit(A64_SUB_I(1, A64_SP, A64_SP, ctx->stack_size), ctx);
> >
> > +     if (ctx->stack_arg_size)
> > +             emit(A64_SUB_I(1, A64_SP, A64_SP, ctx->stack_arg_size), ctx);
>
> How do you ensure that the stack pointer is always 16-byte aligned? We
> run with SP alignment checking enabled, so you need to take care of that.

Down below as you have already found, stack_arg_size is always aligned.

>
> > @@ -1191,6 +1207,41 @@ static int add_exception_handler(const struct bpf_insn *insn,
> >       return 0;
> >  }
> >
> > +static const u8 stack_arg_reg[] = { A64_R(5), A64_R(6), A64_R(7) };
> > +
> > +#define NR_STACK_ARG_REGS    ARRAY_SIZE(stack_arg_reg)
> > +
> > +static void emit_stack_arg_load(u8 dst, s16 bpf_off, struct jit_ctx *ctx)
> > +{
> > +     int idx = bpf_off / sizeof(u64) - 1;
> > +
> > +     if (idx < NR_STACK_ARG_REGS)
> > +             emit(A64_MOV(1, dst, stack_arg_reg[idx]), ctx);
> > +     else
> > +             emit(A64_LDR64I(dst, A64_FP, (idx - NR_STACK_ARG_REGS) * sizeof(u64) + 16), ctx);
> > +}
>
> Is it worth asserting that bpf_off >= 8 here or can we rely on that? I
> struggled to find any details about how bpf passes arguments on the
> stack (beyond what you describe in the commit message) and grepping for
> BPF_REG_PARAMS didn't help either.

This and other patches in this set has more information about this:
https://lore.kernel.org/all/20260513045015.2385013-1-yonghong.song@linux.dev/

>
> > +static void emit_stack_arg_store(u8 src_a64, s16 bpf_off, struct jit_ctx *ctx)
> > +{
> > +     int idx = -bpf_off / sizeof(u64) - 1;
> > +
> > +     if (idx < NR_STACK_ARG_REGS)
> > +             emit(A64_MOV(1, stack_arg_reg[idx], src_a64), ctx);
> > +     else
> > +             emit(A64_STR64I(src_a64, A64_SP, (idx - NR_STACK_ARG_REGS) * sizeof(u64)), ctx);
> > +}
> > +
> > +static void emit_stack_arg_store_imm(s32 imm, s16 bpf_off, const u8 tmp, struct jit_ctx *ctx)
> > +{
> > +     int idx = -bpf_off / sizeof(u64) - 1;
> > +
> > +     emit_a64_mov_i(1, tmp, imm, ctx);
> > +     if (idx < NR_STACK_ARG_REGS)
> > +             emit(A64_MOV(1, stack_arg_reg[idx], tmp), ctx);
>
> nit: You seem to have redundant MOVs here.

Ack, will fix this.

>
> > @@ -2065,6 +2137,14 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_verifier_env *env, struct bpf_pr
> >       ctx.user_vm_start = bpf_arena_get_user_vm_start(prog->aux->arena);
> >       ctx.arena_vm_start = bpf_arena_get_kern_vm_start(prog->aux->arena);
> >
> > +     if (prog->aux->stack_arg_depth > prog->aux->incoming_stack_arg_depth) {
> > +             u16 outgoing = prog->aux->stack_arg_depth - prog->aux->incoming_stack_arg_depth;
> > +             int nr_on_stack = outgoing / sizeof(u64) - NR_STACK_ARG_REGS;
> > +
> > +             if (nr_on_stack > 0)
> > +                     ctx.stack_arg_size = round_up(nr_on_stack * sizeof(u64), 16);
> > +     }
>
> ah, that's presumably where you handle the SP alignment. I think a
> comment would really help folks here... In fact, how does this interact
> with the same sort of SP adjustment that already exists in build_prologue()?
> Can we avoid the pointless re-alignment?

Yeah, you are right, if ctx->stack_size + ctx-> stack_arg_size is
aligned then we don't need to separately align these two, I can move
this from here and consolidate it in the build_prologue() function and
also add a comment there.

> Will