[PATCH v3 11/13] ARM: cacheflush: avoid clobbering the frame pointer

Ard Biesheuvel ardb at kernel.org
Tue Feb 8 02:08:19 PST 2022


On Mon, 7 Feb 2022 at 20:12, Nick Desaulniers <ndesaulniers at google.com> wrote:
>
> On Thu, Feb 3, 2022 at 12:22 AM Ard Biesheuvel <ardb at kernel.org> wrote:
> >
> > Thumb2 uses R7 rather than R11 as the frame pointer, and even if we
> > rarely use a frame pointer to begin with when building in Thumb2 mode,
> > there are cases where it is required by the compiler (Clang when
> > inserting profiling hooks via -pg)
> >
> > However, preserving and restoring the frame pointer is risky, as any
> > unhandled exceptions raised in the mean time will produce a bogus
> > backtrace, and it would be better not to touch the frame pointer at all.
> > This is the case even when CONFIG_FRAME_POINTER is not set, as the
> > unwind directive used by the unwinder may also use R7 or R11 as the
> > unwind anchor, even if the frame pointer is not managed strictly
> > according to the frame pointer ABI.
> >
> > So let's tweak the cacheflush asm code not to clobber R7 or R11 at all,
> > so that we can drop R7 from the clobber lists of the inline asm blocks
> > that call these routines, and remove the code that preserves/restores
> > R11.
> >
> > Signed-off-by: Ard Biesheuvel <ardb at kernel.org>
> > ---
> >  arch/arm/include/asm/cacheflush.h  | 12 ++----
> >  arch/arm/mach-exynos/mcpm-exynos.c |  6 +--
> >  arch/arm/mm/cache-v7.S             | 40 +++++++++-----------
> >  3 files changed, 23 insertions(+), 35 deletions(-)
> >
> > diff --git a/arch/arm/include/asm/cacheflush.h b/arch/arm/include/asm/cacheflush.h
> > index e68fb879e4f9..d27782331556 100644
> > --- a/arch/arm/include/asm/cacheflush.h
> > +++ b/arch/arm/include/asm/cacheflush.h
> > @@ -446,15 +446,10 @@ static inline void __sync_cache_range_r(volatile void *p, size_t size)
> >   *   however some exceptions may exist.  Caveat emptor.
> >   *
> >   * - The clobber list is dictated by the call to v7_flush_dcache_*.
> > - *   fp is preserved to the stack explicitly prior disabling the cache
> > - *   since adding it to the clobber list is incompatible with having
> > - *   CONFIG_FRAME_POINTER=y.  ip is saved as well if ever r12-clobbering
> > - *   trampoline are inserted by the linker and to keep sp 64-bit aligned.
> >   */
> >  #define v7_exit_coherency_flush(level) \
> >         asm volatile( \
> >         ".arch  armv7-a \n\t" \
> > -       "stmfd  sp!, {fp, ip} \n\t" \
> >         "mrc    p15, 0, r0, c1, c0, 0   @ get SCTLR \n\t" \
> >         "bic    r0, r0, #"__stringify(CR_C)" \n\t" \
> >         "mcr    p15, 0, r0, c1, c0, 0   @ set SCTLR \n\t" \
> > @@ -464,10 +459,9 @@ static inline void __sync_cache_range_r(volatile void *p, size_t size)
> >         "bic    r0, r0, #(1 << 6)       @ disable local coherency \n\t" \
> >         "mcr    p15, 0, r0, c1, c0, 1   @ set ACTLR \n\t" \
> >         "isb    \n\t" \
> > -       "dsb    \n\t" \
> > -       "ldmfd  sp!, {fp, ip}" \
> > -       : : : "r0","r1","r2","r3","r4","r5","r6","r7", \
> > -             "r9","r10","lr","memory" )
> > +       "dsb" \
> > +       : : : "r0","r1","r2","r3","r4","r5","r6", \
> > +             "r9","r10","ip","lr","memory" )
> >
> >  void flush_uprobe_xol_access(struct page *page, unsigned long uaddr,
> >                              void *kaddr, unsigned long len);
> > diff --git a/arch/arm/mach-exynos/mcpm-exynos.c b/arch/arm/mach-exynos/mcpm-exynos.c
> > index cd861c57d5ad..fd0dbeb93357 100644
> > --- a/arch/arm/mach-exynos/mcpm-exynos.c
> > +++ b/arch/arm/mach-exynos/mcpm-exynos.c
> > @@ -35,7 +35,6 @@ static bool secure_firmware __ro_after_init;
> >   */
> >  #define exynos_v7_exit_coherency_flush(level) \
> >         asm volatile( \
> > -       "stmfd  sp!, {fp, ip}\n\t"\
> >         "mrc    p15, 0, r0, c1, c0, 0   @ get SCTLR\n\t" \
> >         "bic    r0, r0, #"__stringify(CR_C)"\n\t" \
> >         "mcr    p15, 0, r0, c1, c0, 0   @ set SCTLR\n\t" \
> > @@ -50,11 +49,10 @@ static bool secure_firmware __ro_after_init;
> >         "mcr    p15, 0, r0, c1, c0, 1   @ set ACTLR\n\t" \
> >         "isb\n\t" \
> >         "dsb\n\t" \
> > -       "ldmfd  sp!, {fp, ip}" \
> >         : \
> >         : "Ir" (pmu_base_addr + S5P_INFORM0) \
> > -       : "r0", "r1", "r2", "r3", "r4", "r5", "r6", "r7", \
> > -         "r9", "r10", "lr", "memory")
> > +       : "r0", "r1", "r2", "r3", "r4", "r5", "r6", \
> > +         "r9", "r10", "ip", "lr", "memory")
> >
> >  static int exynos_cpu_powerup(unsigned int cpu, unsigned int cluster)
> >  {
> > diff --git a/arch/arm/mm/cache-v7.S b/arch/arm/mm/cache-v7.S
> > index 830bbfb26ca5..b9f55f0bc278 100644
> > --- a/arch/arm/mm/cache-v7.S
> > +++ b/arch/arm/mm/cache-v7.S
> > @@ -90,7 +90,7 @@ ENDPROC(v7_flush_icache_all)
> >   *
> >   *     Flush the D-cache up to the Level of Unification Inner Shareable
> >   *
> > - *     Corrupted registers: r0-r7, r9-r11 (r6 only in Thumb mode)
> > + *     Corrupted registers: r0-r6, r9-r10
> >   */
> >
> >  ENTRY(v7_flush_dcache_louis)
> > @@ -117,7 +117,7 @@ ENDPROC(v7_flush_dcache_louis)
> >   *
> >   *     Flush the whole D-cache.
> >   *
> > - *     Corrupted registers: r0-r7, r9-r11 (r6 only in Thumb mode)
> > + *     Corrupted registers: r0-r6, r9-r10
> >   *
> >   *     - mm    - mm_struct describing address space
> >   */
> > @@ -149,22 +149,22 @@ flush_levels:
> >         movw    r4, #0x3ff
> >         ands    r4, r4, r1, lsr #3              @ find maximum number on the way size
> >         clz     r5, r4                          @ find bit position of way size increment
> > -       movw    r7, #0x7fff
> > -       ands    r7, r7, r1, lsr #13             @ extract max number of the index size
> > +       movw    r6, #0x7fff
> > +       and     r1, r6, r1, lsr #13             @ extract max number of the index size
> > +       mov     r6, #1
> > +       movne   r4, r4, lsl r5                  @ # of ways shifted into bits [31:...]
> > +       movne   r6, r6, lsl r5                  @ 1 shifted left by same amount
>
> This group of changes was definitely trickier to review...can you
> provide more color as to why the conditional `mov`s and the change in
> the loop back-edge? (I appreciate you explaining some on IRC; it might
> be helpful to other reviewers to have more commentary on list) The
> rest LGTM.
>

So the primary goal is to use fewer registers, to avoid clobbering R7 and R11.

So the first step is to reuse R1 in the loop, as its value is not used
after deriving the number of sets and ways from it, allowing us to
avoid using R7.

The second step is to rewrite the inner loop so that we use R6 on ARM
as well as Thumb2 as a temporary.

Currently, R6 is used on Thumb2 only because it does not support
variable inline rotates. So ARM has

orr r11, r10, r4, lsl r5
orr r11, r11, r9, lsl r2
mcr p15, 0, r11, c7, c14, 2

whereas Thumb2 has

lsl r6, r4, r5
orr r11, r10, r6
lsl r6, r9, r2
orr r11, r11, r6
mcr p15, 0, r11, c7, c14, 2

Note that the value of R4 << R5 only changes in the outer loop, so if
we can precompute it, we can turn this into the following for both ARM
and Thumb

mov r6, r9, lsl r2
orr r6, r6, r4  <-- already shifted left by r5
orr r6, r6, r10
mcr p15, 0, r6, c7, c14, 2

which no longer needs R11 as a temporary.

Since R4 now carries the left-shifted value, we need to subtract 1
shifted left by the same amount after every iteration of the outer
loop, and terminate once the subtraction triggers a borrow (!carry on
ARM), so we continue as long as the carry remains set. Note that the
existing BGE instruction relies on signed arithmetic, which is
unsuitable as bit 31 is no longer a sign bit.

To handle a corner case where the cache geometry is reported as one
cache and one way (793acf870ea3), we use MOVNE to skip the left-shift
at the start when R4 == 0x0, and to force the outer loop decrement to
be 0x1, as otherwise, the loop would never terminate.



More information about the linux-arm-kernel mailing list