[PATCH bpf-next v2 0/4] Add ftrace direct call for arm64

Thu Oct 6 03:09:44 PDT 2022

On 10/5/2022 11:30 PM, Steven Rostedt wrote:
> On Wed, 5 Oct 2022 17:10:33 +0200
> Florent Revest <revest at chromium.org> wrote:
> 
>> On Wed, Oct 5, 2022 at 5:07 PM Steven Rostedt <rostedt at goodmis.org> wrote:
>>>
>>> On Wed, 5 Oct 2022 22:54:15 +0800
>>> Xu Kuohai <xukuohai at huawei.com> wrote:
>>>   
>>>> 1.3 attach bpf prog with with direct call, bpftrace -e 'kfunc:vfs_write {}'
>>>>
>>>> # dd if=/dev/zero of=/dev/null count=1000000
>>>> 1000000+0 records in
>>>> 1000000+0 records out
>>>> 512000000 bytes (512 MB, 488 MiB) copied, 1.72973 s, 296 MB/s
>>>>
>>>>
>>>> 1.4 attach bpf prog with with indirect call, bpftrace -e 'kfunc:vfs_write {}'
>>>>
>>>> # dd if=/dev/zero of=/dev/null count=1000000
>>>> 1000000+0 records in
>>>> 1000000+0 records out
>>>> 512000000 bytes (512 MB, 488 MiB) copied, 1.99179 s, 257 MB/s
>>
>> Thanks for the measurements Xu!
>>
>>> Can you show the implementation of the indirect call you used?
>>
>> Xu used my development branch here
>> https://github.com/FlorentRevest/linux/commits/fprobe-min-args
> 
> That looks like it could be optimized quite a bit too.
> 
> Specifically this part:
> 
> static bool bpf_fprobe_entry(struct fprobe *fp, unsigned long ip, struct ftrace_regs *regs, void *private)
> {
> 	struct bpf_fprobe_call_context *call_ctx = private;
> 	struct bpf_fprobe_context *fprobe_ctx = fp->ops.private;
> 	struct bpf_tramp_links *links = fprobe_ctx->links;
> 	struct bpf_tramp_links *fentry = &links[BPF_TRAMP_FENTRY];
> 	struct bpf_tramp_links *fmod_ret = &links[BPF_TRAMP_MODIFY_RETURN];
> 	struct bpf_tramp_links *fexit = &links[BPF_TRAMP_FEXIT];
> 	int i, ret;
> 
> 	memset(&call_ctx->ctx, 0, sizeof(call_ctx->ctx));
> 	call_ctx->ip = ip;
> 	for (i = 0; i < fprobe_ctx->nr_args; i++)
> 		call_ctx->args[i] = ftrace_regs_get_argument(regs, i);
> 
> 	for (i = 0; i < fentry->nr_links; i++)
> 		call_bpf_prog(fentry->links[i], &call_ctx->ctx, call_ctx->args);
> 
> 	call_ctx->args[fprobe_ctx->nr_args] = 0;
> 	for (i = 0; i < fmod_ret->nr_links; i++) {
> 		ret = call_bpf_prog(fmod_ret->links[i], &call_ctx->ctx,
> 				      call_ctx->args);
> 
> 		if (ret) {
> 			ftrace_regs_set_return_value(regs, ret);
> 			ftrace_override_function_with_return(regs);
> 
> 			bpf_fprobe_exit(fp, ip, regs, private);
> 			return false;
> 		}
> 	}
> 
> 	return fexit->nr_links;
> }
> 
> There's a lot of low hanging fruit to speed up there. I wouldn't be too
> fast to throw out this solution if it hasn't had the care that direct calls
> have had to speed that up.
> 
> For example, trampolines currently only allow to attach to functions with 6
> parameters or less (3 on x86_32). You could make 7 specific callbacks, with
> zero to 6 parameters, and unroll the argument loop.
> 
> Would also be interesting to run perf to see where the overhead is. There
> may be other locations to work on to make it almost as fast as direct
> callers without the other baggage.
> 

There is something wrong with my pi4 perf, I'll send the perf report after
I fix it.

> -- Steve
> 
>>
>> As it stands, the performance impact of the fprobe based
>> implementation would be too high for us. I wonder how much Mark's idea
>> here https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/ftrace/per-callsite-ops
>> would help but it doesn't work right now.
> 
> 
> .