kprobe overhead and OPTPROBES implementation on ARM64

Song Bao Hua (Barry Song) song.bao.hua at hisilicon.com
Mon Aug 10 00:47:08 EDT 2020


Hi,
I was trying to use kprobe example module[1] and eBPF/BCC funclatency[2] to analyze the latency of
a kernel function. eBPF/BCC funclatency is also based on kprobe.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/samples/kprobes/kretprobe_example.c
[2] https://github.com/iovisor/bcc/blob/master/tools/funclatency.py

I found the latency reported by kprobe was much larger than the actual execution time of a function. Then I wrote an
empty function in kernel to measure the overhead of kprobe:

static int do_empty(void)                                                                                                                      
{                                                                                                                                              
	return 0;                                                                                                                              
}                                                                                                                                              

typedef int (*empty)(void);                                                                                                                    
empty p = do_empty;                                                                                                                            

static ssize_t globalfifo_read(struct file *filp, char __user *buf,                                                                            
                               size_t count, loff_t *ppos)                                                                                     
{                                                                                                                                              
	...                                                                                                                            
	for(i = 0;i < 100; i++)                                                                                                                     
		a = p();                                                                                                                               
	...
}

After that, I triggered the execution of do_empty() by:
root at ubuntu:~# modprobe kretprobe_example func=do_empty
root at ubuntu:~# cat /dev/globalfifo    /* this command will lead to the call of globalfifo_read() */
root at ubuntu:~# dmesg -c                                                                                                                        
[ 2853.971945] do_empty returned 0 and took 3260 ns to execute
[ 2853.971948] do_empty returned 0 and took 490 ns to execute
[ 2853.971950] do_empty returned 0 and took 450 ns to execute
[ 2853.971952] do_empty returned 0 and took 430 ns to execute
[ 2853.971953] do_empty returned 0 and took 430 ns to execute
[ 2853.971955] do_empty returned 0 and took 420 ns to execute
[ 2853.971956] do_empty returned 0 and took 430 ns to execute
[ 2853.971957] do_empty returned 0 and took 430 ns to execute
[ 2853.971959] do_empty returned 0 and took 420 ns to execute
[ 2853.971960] do_empty returned 0 and took 420 ns to execute
[ 2853.971962] do_empty returned 0 and took 430 ns to execute
[ 2853.971963] do_empty returned 0 and took 430 ns to execute
[ 2853.971964] do_empty returned 0 and took 430 ns to execute
[ 2853.971966] do_empty returned 0 and took 430 ns to execute
[ 2853.971967] do_empty returned 0 and took 430 ns to execute
[ 2853.971969] do_empty returned 0 and took 430 ns to execute
[ 2853.971970] do_empty returned 0 and took 420 ns to execute
[ 2853.971972] do_empty returned 0 and took 430 ns to execute
[ 2853.971973] do_empty returned 0 and took 430 ns to execute
[ 2853.971974] do_empty returned 0 and took 430 ns to execute
[ 2853.971976] do_empty returned 0 and took 430 ns to execute
[ 2853.971977] do_empty returned 0 and took 420 ns to execute
[ 2853.971978] do_empty returned 0 and took 430 ns to execute
[ 2853.971980] do_empty returned 0 and took 420 ns to execute
[ 2853.971981] do_empty returned 0 and took 420 ns to execute
[ 2853.971983] do_empty returned 0 and took 430 ns to execute
[ 2853.971984] do_empty returned 0 and took 430 ns to execute
[ 2853.971986] do_empty returned 0 and took 420 ns to execute
[ 2853.971987] do_empty returned 0 and took 430 ns to execute
[ 2853.971988] do_empty returned 0 and took 430 ns to execute
[ 2853.971990] do_empty returned 0 and took 430 ns to execute
[ 2853.971991] do_empty returned 0 and took 430 ns to execute
[ 2853.971992] do_empty returned 0 and took 430 ns to execute
[ 2853.971994] do_empty returned 0 and took 430 ns to execute
[ 2853.971995] do_empty returned 0 and took 430 ns to execute
[ 2853.971997] do_empty returned 0 and took 430 ns to execute
[ 2853.971998] do_empty returned 0 and took 420 ns to execute
[ 2853.972000] do_empty returned 0 and took 430 ns to execute
[ 2853.972001] do_empty returned 0 and took 430 ns to execute
[ 2853.972002] do_empty returned 0 and took 430 ns to execute
[ 2853.972004] do_empty returned 0 and took 430 ns to execute
[ 2853.972005] do_empty returned 0 and took 430 ns to execute
[ 2853.972006] do_empty returned 0 and took 420 ns to execute
[ 2853.972008] do_empty returned 0 and took 430 ns to execute
[ 2853.972009] do_empty returned 0 and took 430 ns to execute
[ 2853.972011] do_empty returned 0 and took 430 ns to execute
[ 2853.972012] do_empty returned 0 and took 420 ns to execute
[ 2853.972014] do_empty returned 0 and took 420 ns to execute
[ 2853.972015] do_empty returned 0 and took 430 ns to execute
[ 2853.972016] do_empty returned 0 and took 430 ns to execute
[ 2853.972018] do_empty returned 0 and took 420 ns to execute
[ 2853.972019] do_empty returned 0 and took 430 ns to execute
[ 2853.972020] do_empty returned 0 and took 430 ns to execute
[ 2853.972022] do_empty returned 0 and took 430 ns to execute
[ 2853.972023] do_empty returned 0 and took 430 ns to execute
[ 2853.972025] do_empty returned 0 and took 430 ns to execute
[ 2853.972026] do_empty returned 0 and took 420 ns to execute
[ 2853.972028] do_empty returned 0 and took 420 ns to execute
[ 2853.972029] do_empty returned 0 and took 420 ns to execute
[ 2853.972030] do_empty returned 0 and took 430 ns to execute
[ 2853.972032] do_empty returned 0 and took 430 ns to execute
[ 2853.972033] do_empty returned 0 and took 420 ns to execute
[ 2853.972034] do_empty returned 0 and took 420 ns to execute
[ 2853.972036] do_empty returned 0 and took 430 ns to execute
[ 2853.972037] do_empty returned 0 and took 420 ns to execute
[ 2853.972039] do_empty returned 0 and took 430 ns to execute
[ 2853.972040] do_empty returned 0 and took 420 ns to execute
[ 2853.972042] do_empty returned 0 and took 430 ns to execute
[ 2853.972043] do_empty returned 0 and took 420 ns to execute
[ 2853.972044] do_empty returned 0 and took 430 ns to execute
[ 2853.972046] do_empty returned 0 and took 420 ns to execute

>From the above data, we can see for the 1st time (i==0), kprobe latency can be 3260 ns.
For a "hot" probed function, i==2 to i==99, kprobe latency can be 430ns.

funclatency in eBPF/BCC also shows the same result:
root at ubuntu:/usr/share/bcc/tools# ./funclatency do_empty 
Tracing 1 functions for "do_empty"... Hit Ctrl-C to end. 
^C 
     nsecs               : count     distribution 
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 99       |****************************************|
      1024 -> 2047       : 0        |                                        |
      2048 -> 4095       : 0        |                                        |
      4096 -> 8191       : 1        |                                        |

The data was got while I set cpufreq to "performance" and make the CPU running in
the highest frequency 2.6Ghz. And the cpuinfo is like:
$ cat /proc/cpuinfo | more                                                                                     
processor       : 0 
BogoMIPS        : 200.00 
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm 
CPU implementer : 0x48 
CPU architecture: 8 
CPU variant     : 0x1 
CPU part        : 0xd01 
CPU revision    : 0 

If I set cpufreq to "ondemand", the overhead of kprobe will be much bigger:

root at ubuntu:~# dmesg -c                                                                                                                        
[ 3363.950350] Planted return probe at do_empty: 000000005a96fe8a
[ 3370.512534] do_empty returned 0 and took 14546 ns to execute
[ 3370.512553] do_empty returned 0 and took 3299 ns to execute
[ 3370.512563] do_empty returned 0 and took 3019 ns to execute
[ 3370.512573] do_empty returned 0 and took 2959 ns to execute
[ 3370.512582] do_empty returned 0 and took 2969 ns to execute
[ 3370.512590] do_empty returned 0 and took 2980 ns to execute
[ 3370.512599] do_empty returned 0 and took 2979 ns to execute
[ 3370.512608] do_empty returned 0 and took 2979 ns to execute
[ 3370.512617] do_empty returned 0 and took 2970 ns to execute
[ 3370.512625] do_empty returned 0 and took 2979 ns to execute
[ 3370.512634] do_empty returned 0 and took 2979 ns to execute
[ 3370.512643] do_empty returned 0 and took 2979 ns to execute
[ 3370.512651] do_empty returned 0 and took 2980 ns to execute
[ 3370.512661] do_empty returned 0 and took 2979 ns to execute
[ 3370.512670] do_empty returned 0 and took 2989 ns to execute
[ 3370.512679] do_empty returned 0 and took 2980 ns to execute
[ 3370.512688] do_empty returned 0 and took 2969 ns to execute
[ 3370.512697] do_empty returned 0 and took 2979 ns to execute
[ 3370.512705] do_empty returned 0 and took 2979 ns to execute
[ 3370.512715] do_empty returned 0 and took 2979 ns to execute
[ 3370.512723] do_empty returned 0 and took 2979 ns to execute

>From the above data, we can find for the 1st time (i==0), kprobe latency can be 14546 ns.
For a "hot" probed function, i==2 to i==99, kprobe latency can be 3000ns.

funclatency shows the similar result:

root at ubuntu:/usr/share/bcc/tools# ./funclatency do_empty 
Tracing 1 functions for "do_empty"... Hit Ctrl-C to end. 
^C 
     nsecs               : count     distribution 
         0 -> 1          : 0        |                                        | 
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 0        |                                        |
      2048 -> 4095       : 98       |****************************************|
      4096 -> 8191       : 1        |                                        |
      8192 -> 16383      : 0        |                                        |
     16384 -> 32767      : 1        |                                        |

Such a huge overhead of kprobe seems to make the latency data gotten from kprobe and funclatency
distorted and make those tools practically useless.

Considering kprobe and funclatency are very convenient and even popular tools which are used widely,
do you think it is necessary to implement OPTPROBES on ARM64? 

On x86, it is said OPTPROBES can greatly decrease the overhead of kprobe.
I noticed that OPTPROBES are supported on X86, PowerPC and ARM32. But on ARM64, it is marked as
"TODO".

Thanks
Barry



More information about the linux-arm-kernel mailing list