[PATCH v3 11/13] ARM: cacheflush: avoid clobbering the frame pointer

Tue Feb 8 14:34:14 PST 2022

On Tue, Feb 8, 2022 at 2:08 AM Ard Biesheuvel <ardb at kernel.org> wrote:
>
> On Mon, 7 Feb 2022 at 20:12, Nick Desaulniers <ndesaulniers at google.com> wrote:
> >
> > On Thu, Feb 3, 2022 at 12:22 AM Ard Biesheuvel <ardb at kernel.org> wrote:
> > >
> > > @@ -149,22 +149,22 @@ flush_levels:
> > >         movw    r4, #0x3ff
> > >         ands    r4, r4, r1, lsr #3              @ find maximum number on the way size
> > >         clz     r5, r4                          @ find bit position of way size increment
> > > -       movw    r7, #0x7fff
> > > -       ands    r7, r7, r1, lsr #13             @ extract max number of the index size
> > > +       movw    r6, #0x7fff
> > > +       and     r1, r6, r1, lsr #13             @ extract max number of the index size
> > > +       mov     r6, #1
> > > +       movne   r4, r4, lsl r5                  @ # of ways shifted into bits [31:...]
> > > +       movne   r6, r6, lsl r5                  @ 1 shifted left by same amount
> >
> > This group of changes was definitely trickier to review...can you
> > provide more color as to why the conditional `mov`s and the change in
> > the loop back-edge? (I appreciate you explaining some on IRC; it might
> > be helpful to other reviewers to have more commentary on list) The
> > rest LGTM.
> >
>
> So the primary goal is to use fewer registers, to avoid clobbering R7 and R11.
>
> So the first step is to reuse R1 in the loop, as its value is not used
> after deriving the number of sets and ways from it, allowing us to
> avoid using R7.

Yep. LGTM.

>
> The second step is to rewrite the inner loop so that we use R6 on ARM
> as well as Thumb2 as a temporary.

Yep. LGTM.

>
> Currently, R6 is used on Thumb2 only because it does not support
> variable inline rotates. So ARM has
>
> orr r11, r10, r4, lsl r5
> orr r11, r11, r9, lsl r2
> mcr p15, 0, r11, c7, c14, 2
>
> whereas Thumb2 has
>
> lsl r6, r4, r5
> orr r11, r10, r6
> lsl r6, r9, r2
> orr r11, r11, r6
> mcr p15, 0, r11, c7, c14, 2
>
> Note that the value of R4 << R5 only changes in the outer loop, so if
> we can precompute it, we can turn this into the following for both ARM
> and Thumb
>
> mov r6, r9, lsl r2
> orr r6, r6, r4  <-- already shifted left by r5
> orr r6, r6, r10
> mcr p15, 0, r6, c7, c14, 2
>
> which no longer needs R11 as a temporary.

Yep. LGTM.

>
> Since R4 now carries the left-shifted value, we need to subtract 1
> shifted left by the same amount after every iteration of the outer
> loop, and terminate once the subtraction triggers a borrow (!carry on
> ARM), so we continue as long as the carry remains set. Note that the
> existing BGE instruction relies on signed arithmetic, which is
> unsuitable as bit 31 is no longer a sign bit.

Ah, perfect. This was the part I was struggling most with.  I guess
before we had something almost like (for loop1):

int r4 = ...
do {
  r11 = r10 | (r4 << r5);
  ...
} while (r4 = r4 - 1);

Now it's
unsigned r4 = ...
r4 = r4 << r5
r6 = 1 << r5
do {
  ...
} while (r4 = r4 - r6);

>
> To handle a corner case where the cache geometry is reported as one
> cache and one way (793acf870ea3), we use MOVNE to skip the left-shift
> at the start when R4 == 0x0, and to force the outer loop decrement to
> be 0x1, as otherwise, the loop would never terminate.

Right, so when r4 == 0x0, r6 == 0x1, r4 = 0x0 - 0x1 == 0xFFFFFF. That
will set the N bit of PSR but because we only branch back if C was set
to 1 (which it would be if r4 > r6 > 0) we exit loop1.

(I found this fun little ARM simulator:
http://www.csbio.unc.edu/mcmillan/miniARM.html; it definitely helped
to observe the PSR through the loop!)

Thanks again for taking the time to explain it; I appreciate it!

Reviewed-by: Nick Desaulniers <ndesaulniers at google.com>
-- 
Thanks,
~Nick Desaulniers