Performance impact of CONFIG_FUNCTION_TRACER

Tue Jul 5 15:27:46 PDT 2022

On Tue, 5 Jul 2022 23:59:48 +0200
Sascha Hauer <sha at pengutronix.de> wrote:

> > 
> > As I believe due to using a link register for function calls, ARM
> > requires adding two 4 byte nops to every function where as x86 only
> > adds a single 5 byte nop.
> > 
> > Although nops are very fast (they should not be processed in the CPU's
> > pipe line, but I don't know if that's true for every arch). It also
> > affects instruction cache misses, as adding 8 bytes around the code
> > will cause more cache misses than when they do not exist.  
> 
> Just digged around a bit and saw that on ARM it's not even a real nop.
> The compiler emits:
> 
> 	push    {lr}
> 	bl      8010e7c0 <__gnu_mcount_nc>
> 
> Which is then turned into a nop by replacing the second instruction with
> 
> 	add   sp, sp, #4
> 
> to bring the stack pointer back to its original value. This indeed must
> be processed by the CPU pipeline. I wonder if that could be optimized by
> replacing both instructions with a nop. I have no idea though if that's
> feasible at all or if the overhead would even get smaller by that.

The problem is that there's no easy way to do that, because a task
could have been preempted after doing the 'push {lr}' and before the
'bl'. Thus, you create a race by changing either one to a nop first.

I wonder if it would have been better to change the first one to a jump
passed the second :-/

Actually, if you don't mind setups that take a long time, if you change
the first to a jump passed the second, then do synchronize_rcu_rude()
(which may take a while, possibly several seconds or more) then you know
that all users now only see the jump, and none will see the bl. Then
you could convert the bl to nop, and then even change the jump to nop
after that.

To convert back, you would need to reverse it. Convert the first nop
back to a jmp, run synchronize_rcu_rude(). Then convert the second nop
to the bl, and then convert the first to the push {lr}.

> 
> > 
> > Also, there's some configurations that use the old mcount that does add
> > some more code to handle the mcount case.
> > 
> > So if this is just to have us change the kconfig, I'm happy to do that.  
> 
> Yes, would be good to make the kconfig text clear. The overhead itself
> is fine when people know that's the price to pay for getting the
> function tracer.

Agreed. I'll write up a patch.

-- Steve