[PATCH v3 11/13] ARM: cacheflush: avoid clobbering the frame pointer
Nick Desaulniers
ndesaulniers at google.com
Tue Feb 8 14:34:14 PST 2022
On Tue, Feb 8, 2022 at 2:08 AM Ard Biesheuvel <ardb at kernel.org> wrote:
>
> On Mon, 7 Feb 2022 at 20:12, Nick Desaulniers <ndesaulniers at google.com> wrote:
> >
> > On Thu, Feb 3, 2022 at 12:22 AM Ard Biesheuvel <ardb at kernel.org> wrote:
> > >
> > > @@ -149,22 +149,22 @@ flush_levels:
> > > movw r4, #0x3ff
> > > ands r4, r4, r1, lsr #3 @ find maximum number on the way size
> > > clz r5, r4 @ find bit position of way size increment
> > > - movw r7, #0x7fff
> > > - ands r7, r7, r1, lsr #13 @ extract max number of the index size
> > > + movw r6, #0x7fff
> > > + and r1, r6, r1, lsr #13 @ extract max number of the index size
> > > + mov r6, #1
> > > + movne r4, r4, lsl r5 @ # of ways shifted into bits [31:...]
> > > + movne r6, r6, lsl r5 @ 1 shifted left by same amount
> >
> > This group of changes was definitely trickier to review...can you
> > provide more color as to why the conditional `mov`s and the change in
> > the loop back-edge? (I appreciate you explaining some on IRC; it might
> > be helpful to other reviewers to have more commentary on list) The
> > rest LGTM.
> >
>
> So the primary goal is to use fewer registers, to avoid clobbering R7 and R11.
>
> So the first step is to reuse R1 in the loop, as its value is not used
> after deriving the number of sets and ways from it, allowing us to
> avoid using R7.
Yep. LGTM.
>
> The second step is to rewrite the inner loop so that we use R6 on ARM
> as well as Thumb2 as a temporary.
Yep. LGTM.
>
> Currently, R6 is used on Thumb2 only because it does not support
> variable inline rotates. So ARM has
>
> orr r11, r10, r4, lsl r5
> orr r11, r11, r9, lsl r2
> mcr p15, 0, r11, c7, c14, 2
>
> whereas Thumb2 has
>
> lsl r6, r4, r5
> orr r11, r10, r6
> lsl r6, r9, r2
> orr r11, r11, r6
> mcr p15, 0, r11, c7, c14, 2
>
> Note that the value of R4 << R5 only changes in the outer loop, so if
> we can precompute it, we can turn this into the following for both ARM
> and Thumb
>
> mov r6, r9, lsl r2
> orr r6, r6, r4 <-- already shifted left by r5
> orr r6, r6, r10
> mcr p15, 0, r6, c7, c14, 2
>
> which no longer needs R11 as a temporary.
Yep. LGTM.
>
> Since R4 now carries the left-shifted value, we need to subtract 1
> shifted left by the same amount after every iteration of the outer
> loop, and terminate once the subtraction triggers a borrow (!carry on
> ARM), so we continue as long as the carry remains set. Note that the
> existing BGE instruction relies on signed arithmetic, which is
> unsuitable as bit 31 is no longer a sign bit.
Ah, perfect. This was the part I was struggling most with. I guess
before we had something almost like (for loop1):
int r4 = ...
do {
r11 = r10 | (r4 << r5);
...
} while (r4 = r4 - 1);
Now it's
unsigned r4 = ...
r4 = r4 << r5
r6 = 1 << r5
do {
...
} while (r4 = r4 - r6);
>
> To handle a corner case where the cache geometry is reported as one
> cache and one way (793acf870ea3), we use MOVNE to skip the left-shift
> at the start when R4 == 0x0, and to force the outer loop decrement to
> be 0x1, as otherwise, the loop would never terminate.
Right, so when r4 == 0x0, r6 == 0x1, r4 = 0x0 - 0x1 == 0xFFFFFF. That
will set the N bit of PSR but because we only branch back if C was set
to 1 (which it would be if r4 > r6 > 0) we exit loop1.
(I found this fun little ARM simulator:
http://www.csbio.unc.edu/mcmillan/miniARM.html; it definitely helped
to observe the PSR through the loop!)
Thanks again for taking the time to explain it; I appreciate it!
Reviewed-by: Nick Desaulniers <ndesaulniers at google.com>
--
Thanks,
~Nick Desaulniers
More information about the linux-arm-kernel
mailing list