CFT: move outer_cache_sync() out of line

Catalin Marinas catalin.marinas at arm.com
Tue Jan 13 10:39:20 PST 2015


On Tue, Jan 13, 2015 at 05:09:11PM +0000, Russell King - ARM Linux wrote:
> On Tue, Jan 13, 2015 at 04:27:54PM +0000, Russell King - ARM Linux wrote:
> > On Tue, Jan 13, 2015 at 03:45:52PM +0000, Catalin Marinas wrote:
> > > On Mon, Jan 12, 2015 at 04:36:48PM +0000, Russell King - ARM Linux wrote:
> > > > The theory is that moving outer_cache_sync() out of line moves the
> > > > conditional test from multiple sites to a single location, which
> > > > in turn means that the branch predictor stands a better chance of
> > > > making the correct decision.
> > > [...]
> > > > --- a/arch/arm/mm/l2c-common.c
> > > > +++ b/arch/arm/mm/l2c-common.c
> > > > @@ -7,9 +7,17 @@
> > > >   * published by the Free Software Foundation.
> > > >   */
> > > >  #include <linux/bug.h>
> > > > +#include <linux/export.h>
> > > >  #include <linux/smp.h>
> > > >  #include <asm/outercache.h>
> > > >  
> > > > +void outer_sync(void)
> > > > +{
> > > > +	if (outer_cache.sync)
> > > > +		outer_cache.sync();
> > > > +}
> > > 
> > > But don't you end up with two branches now? With an outer cache, I guess
> > > the additional branch won't be noticed, especially if the prediction on
> > > the outer_cache.sync() call is better (but that's not a simple
> > > conditional branch prediction, it relies on predicting the value of the
> > > pointer, not entirely sure how this works).
> > 
> > Yes, we end up with two branches.  The branch to this function is 
> > unconditional and always taken - if a branch predictor gets that wrong,
> > then the branch predictor is broken!
> > 
> > However, that is not the only side effect of this change: the other
> > change is a reduction in instruction cache usage.
> > 
> > > However, on SoCs without an outer cache (and single kernel image), we
> > > end up with a permanent branch/return even though such branch would not
> > > be needed.
> > 
> > Right, so we go from something like this:
> > 
> >  188:   e594301c        ldr     r3, [r4, #28]
> >  18c:   e3002356        movw    r2, #854        ; 0x356
> >  190:   e3402100        movt    r2, #256        ; 0x100
> >  194:   e583204c        str     r2, [r3, #76]   ; 0x4c
> >  198:   f57ff04e        dsb     st
> >  19c:   e5953014        ldr     r3, [r5, #20]
> >  1a0:   e3530000        cmp     r3, #0
> >  1a4:   0a000000        beq     1ac <ipu_dp_csc_init+0x1ac>
> >  1a8:   e12fff33        blx     r3
> > 
> > down to:
> > 
> >  1e0:   e594301c        ldr     r3, [r4, #28]
> >  1e4:   e3002356        movw    r2, #854        ; 0x356
> >  1e8:   e3402100        movt    r2, #256        ; 0x100
> >  1ec:   e583204c        str     r2, [r3, #76]   ; 0x4c
> >  1f0:   f57ff04e        dsb     st
> >  1f4:   ebfffffe        bl      0 <outer_sync>
> > 
> > which saves 12 bytes per writel() - in favour of having the contents
> > of outer_cache_sync() being in one location in the kernel, and will
> > therefore be hotter in the instruction cache.
> > 
> > (Addresses different because the optimiser arranges the functions
> > differently.)
> > 
> > I'm willing to bet that the excessive use of "inline" in the kernel is
> > actually harming things: it made sense when taking a branch was expensive
> > (such as with older ARM CPUs) but this isn't the case anymore.
> 
> BTW, what we /could/ do is to be a little more clever at these (and
> similar) callsites.
> 
> We could have the linker build a table of callsites and a pointer to
> the indirect function pointer, and we initialise each of the callsites
> to a small "dynamic patcher."
> 
> When one of these sites is hit, the dynamic patcher is called, which
> looks up the site in the linker built table, and patches the callsite
> with a direct call to the required function, or if the function pointer
> is NULL, replaces it with a NOP instruction.
> 
> This eliminates the loads to get the function pointer, the test, and
> the conditional branch, turning the whole thing into a _single_ well
> known branch or a NOP instruction.
> 
> Since things like the cache ops, processor ops, etc are all constant
> at runtime, I think this would make sense.

Jump labels. Maybe we could get them to do a BL instead of B (unless
they can do this already). This would be beneficial in all cases.

(BTW, I'm off tomorrow, so not able to follow up)

-- 
Catalin



More information about the linux-arm-kernel mailing list